Message ID | 20200115134303.30668-1-milanpa@amazon.de (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | None | expand |
On 15.01.20 14:43, Milan Pandurov wrote: > KVM exposes debug counters through individual debugfs files. > Monitoring these counters requires debugfs to be enabled/accessible for > the application, which might not be always the case. > Additionally, periodic monitoring multiple debugfs files from > userspace requires multiple file open/read/close + atoi conversion > operations, which is not very efficient. > > Let's expose new interface to userspace for garhering these > statistics with one ioctl. > > Two new ioctl methods are added: > - KVM_GET_SUPPORTED_DEBUGFS_STAT : Returns list of available counter > names. Names correspond to the debugfs file names > - KVM_GET_DEBUGFS_VALUES : Returns list of u64 values each > corresponding to a value described in KVM_GET_SUPPORTED_DEBUGFS_STAT. > > Userspace application can read counter description once using > KVM_GET_SUPPORTED_DEBUGFS_STAT and periodically invoke the > KVM_GET_DEBUGFS_VALUES to get value update. > > Signed-off-by: Milan Pandurov <milanpa@amazon.de> Thanks a lot! I really love the idea to get stats directly from the kvm vm fd owner. This is so much better than poking at files from a random unrelated debug or stat file system. I have a few comments overall though: 1) This is an interface that requires a lot of logic and buffers from user space to retrieve individual, explicit counters. What if I just wanted to monitor the number of exits on every user space exit? Also, we're suddenly making the debugfs names a full ABI, because through this interface we only identify the individual stats through their names. That means we can not remove stats or change their names, because people may rely on them, no? Thining about this again, maybe they already are an ABI because people rely on them in debugfs though? I see two alternatives to this approach here: a) ONE_REG We can just add a new DEBUG arch in ONE_REG and expose ONE_REG per VM as well (if we really have to?). That gives us explicit identifiers for each stat with an explicit path to introduce new ones with very unique identifiers. That would give us a very easily structured approach to retrieve individual counters. b) part of the mmap'ed vcpu struct We could simply move the counters into the shared struct that we expose to user space via mmap. IIRC we only have that per-vcpu, but then again most counters are per-vcpu anyway, so we would push the aggregation to user space. For per-vm ones, maybe we can just add another mmap'ed shared page for the vm fd? 2) vcpu counters Most of the counters count on vcpu granularity, but debugfs only gives us a full VM view. Whatever we do to improve the situation, we should definitely try to build something that allows us to get the counters per vcpu (as well). The main purpose of these counters is monitoring. It can be quite important to know that only a single vCPU is going wild, compared to all of them for example. Thanks, Alex Amazon Development Center Germany GmbH Krausenstr. 38 10117 Berlin Geschaeftsfuehrung: Christian Schlaeger, Jonathan Weiss Eingetragen am Amtsgericht Charlottenburg unter HRB 149173 B Sitz: Berlin Ust-ID: DE 289 237 879
On 1/15/20 3:04 PM, Alexander Graf wrote: > > > On 15.01.20 14:43, Milan Pandurov wrote: >> KVM exposes debug counters through individual debugfs files. >> Monitoring these counters requires debugfs to be enabled/accessible for >> the application, which might not be always the case. >> Additionally, periodic monitoring multiple debugfs files from >> userspace requires multiple file open/read/close + atoi conversion >> operations, which is not very efficient. >> >> Let's expose new interface to userspace for garhering these >> statistics with one ioctl. >> >> Two new ioctl methods are added: >> - KVM_GET_SUPPORTED_DEBUGFS_STAT : Returns list of available counter >> names. Names correspond to the debugfs file names >> - KVM_GET_DEBUGFS_VALUES : Returns list of u64 values each >> corresponding to a value described in KVM_GET_SUPPORTED_DEBUGFS_STAT. >> >> Userspace application can read counter description once using >> KVM_GET_SUPPORTED_DEBUGFS_STAT and periodically invoke the >> KVM_GET_DEBUGFS_VALUES to get value update. >> >> Signed-off-by: Milan Pandurov <milanpa@amazon.de> > > Thanks a lot! I really love the idea to get stats directly from the > kvm vm fd owner. This is so much better than poking at files from a > random unrelated debug or stat file system. > > I have a few comments overall though: > > > 1) > > This is an interface that requires a lot of logic and buffers from > user space to retrieve individual, explicit counters. What if I just > wanted to monitor the number of exits on every user space exit? In case we want to cover such latency sensitive use cases solution b), with mmap'ed structs you suggested, would be a way to go, IMO. > > Also, we're suddenly making the debugfs names a full ABI, because > through this interface we only identify the individual stats through > their names. That means we can not remove stats or change their names, > because people may rely on them, no? Thining about this again, maybe > they already are an ABI because people rely on them in debugfs though? > > I see two alternatives to this approach here: > > a) ONE_REG > > We can just add a new DEBUG arch in ONE_REG and expose ONE_REG per VM > as well (if we really have to?). That gives us explicit identifiers > for each stat with an explicit path to introduce new ones with very > unique identifiers. > > That would give us a very easily structured approach to retrieve > individual counters. > > b) part of the mmap'ed vcpu struct > > We could simply move the counters into the shared struct that we > expose to user space via mmap. IIRC we only have that per-vcpu, but > then again most counters are per-vcpu anyway, so we would push the > aggregation to user space. > > For per-vm ones, maybe we can just add another mmap'ed shared page for > the vm fd? > > > 2) vcpu counters > > Most of the counters count on vcpu granularity, but debugfs only gives > us a full VM view. Whatever we do to improve the situation, we should > definitely try to build something that allows us to get the counters > per vcpu (as well). > > The main purpose of these counters is monitoring. It can be quite > important to know that only a single vCPU is going wild, compared to > all of them for example. I agree, exposing per vcpu counters can be useful. I guess it didn't make much sense exposing them through debugfs so aggregation was done in kernel. However if we chose to go with approach 1-b) mmap counters struct in userspace, we could do this. Best, Milan > > > Thanks, > > Alex Amazon Development Center Germany GmbH Krausenstr. 38 10117 Berlin Geschaeftsfuehrung: Christian Schlaeger, Jonathan Weiss Eingetragen am Amtsgericht Charlottenburg unter HRB 149173 B Sitz: Berlin Ust-ID: DE 289 237 879
On 15.01.20 15:43, milanpa@amazon.com wrote: > On 1/15/20 3:04 PM, Alexander Graf wrote: >> >> >> On 15.01.20 14:43, Milan Pandurov wrote: >>> KVM exposes debug counters through individual debugfs files. >>> Monitoring these counters requires debugfs to be enabled/accessible for >>> the application, which might not be always the case. >>> Additionally, periodic monitoring multiple debugfs files from >>> userspace requires multiple file open/read/close + atoi conversion >>> operations, which is not very efficient. >>> >>> Let's expose new interface to userspace for garhering these >>> statistics with one ioctl. >>> >>> Two new ioctl methods are added: >>> - KVM_GET_SUPPORTED_DEBUGFS_STAT : Returns list of available counter >>> names. Names correspond to the debugfs file names >>> - KVM_GET_DEBUGFS_VALUES : Returns list of u64 values each >>> corresponding to a value described in KVM_GET_SUPPORTED_DEBUGFS_STAT. >>> >>> Userspace application can read counter description once using >>> KVM_GET_SUPPORTED_DEBUGFS_STAT and periodically invoke the >>> KVM_GET_DEBUGFS_VALUES to get value update. >>> >>> Signed-off-by: Milan Pandurov <milanpa@amazon.de> >> >> Thanks a lot! I really love the idea to get stats directly from the >> kvm vm fd owner. This is so much better than poking at files from a >> random unrelated debug or stat file system. >> >> I have a few comments overall though: >> >> >> 1) >> >> This is an interface that requires a lot of logic and buffers from >> user space to retrieve individual, explicit counters. What if I just >> wanted to monitor the number of exits on every user space exit? > > In case we want to cover such latency sensitive use cases solution b), > with mmap'ed structs you suggested, would be a way to go, IMO. > >> >> Also, we're suddenly making the debugfs names a full ABI, because >> through this interface we only identify the individual stats through >> their names. That means we can not remove stats or change their names, >> because people may rely on them, no? Thining about this again, maybe >> they already are an ABI because people rely on them in debugfs though? >> >> I see two alternatives to this approach here: >> >> a) ONE_REG >> >> We can just add a new DEBUG arch in ONE_REG and expose ONE_REG per VM >> as well (if we really have to?). That gives us explicit identifiers >> for each stat with an explicit path to introduce new ones with very >> unique identifiers. >> >> That would give us a very easily structured approach to retrieve >> individual counters. >> >> b) part of the mmap'ed vcpu struct >> >> We could simply move the counters into the shared struct that we >> expose to user space via mmap. IIRC we only have that per-vcpu, but >> then again most counters are per-vcpu anyway, so we would push the >> aggregation to user space. >> >> For per-vm ones, maybe we can just add another mmap'ed shared page for >> the vm fd? >> >> >> 2) vcpu counters >> >> Most of the counters count on vcpu granularity, but debugfs only gives >> us a full VM view. Whatever we do to improve the situation, we should >> definitely try to build something that allows us to get the counters >> per vcpu (as well). >> >> The main purpose of these counters is monitoring. It can be quite >> important to know that only a single vCPU is going wild, compared to >> all of them for example. > > I agree, exposing per vcpu counters can be useful. I guess it didn't > make much sense exposing them through debugfs so aggregation was done in > kernel. However if we chose to go with approach 1-b) mmap counters > struct in userspace, we could do this. We could do it in any approach, even with statfs if we do directories per vcpu :). The reason I dislike the debugfs/statfs approach is that it generates a completely separate permission and access paths to the stats. That's great for full system monitoring, but really bad when you have multiple individual tenants on a single host. Alex Amazon Development Center Germany GmbH Krausenstr. 38 10117 Berlin Geschaeftsfuehrung: Christian Schlaeger, Jonathan Weiss Eingetragen am Amtsgericht Charlottenburg unter HRB 149173 B Sitz: Berlin Ust-ID: DE 289 237 879
On 15/01/20 15:59, Alexander Graf wrote: > On 15.01.20 15:43, milanpa@amazon.com wrote: >>>> Let's expose new interface to userspace for garhering these >>>> statistics with one ioctl. >>>> >>>> Userspace application can read counter description once using >>>> KVM_GET_SUPPORTED_DEBUGFS_STAT and periodically invoke the >>>> KVM_GET_DEBUGFS_VALUES to get value update. >>> >>> This is an interface that requires a lot of logic and buffers from >>> user space to retrieve individual, explicit counters. What if I just >>> wanted to monitor the number of exits on every user space exit? >> >> In case we want to cover such latency sensitive use cases solution b), >> with mmap'ed structs you suggested, would be a way to go, IMO. >> >>> Also, we're suddenly making the debugfs names a full ABI, because >>> through this interface we only identify the individual stats through >>> their names. That means we can not remove stats or change their >>> names, because people may rely on them, no? Thining about this again, >>> maybe they already are an ABI because people rely on them in debugfs >>> though? In theory not, in practice I have treated them as a kind of "soft" ABI: if the meaning changes you should rename them, and removing them is fine, but you shouldn't for example change the unit of measure (which is not hard since they are all counters :) but perhaps you could have nanoseconds vs TSC cycles in the future for some counters). >>> I see two alternatives to this approach here: >>> >>> a) ONE_REG >>> >>> We can just add a new DEBUG arch in ONE_REG and expose ONE_REG per VM >>> as well (if we really have to?). That gives us explicit identifiers >>> for each stat with an explicit path to introduce new ones with very >>> unique identifiers. ONE_REG would force us to define constants for each counter, and would make it hard to retire them. I don't like this. >>> b) part of the mmap'ed vcpu struct Same here. Even if we say the semantics of the struct would be exposed to userspace via KVM_GET_SUPPORTED_DEBUGFS_STAT, someone might end up getting this wrong and expecting a particular layout. Milan's proposed API has the big advantage of being hard to get wrong for userspace. And pushing the aggregation to userspace is not a huge chore, but it's still a chore. So unless someone has a usecase for latency-sensitive monitoring I'd prefer to keep it simple (usually these kind of stats can even make sense if you gather them over relatively large period of time, because then you'll probably use something else like tracepoints to actually pinpoint what's going on). >>> 2) vcpu counters >>> >>> Most of the counters count on vcpu granularity, but debugfs only >>> gives us a full VM view. Whatever we do to improve the situation, we >>> should definitely try to build something that allows us to get the >>> counters per vcpu (as well). >>> >>> The main purpose of these counters is monitoring. It can be quite >>> important to know that only a single vCPU is going wild, compared to >>> all of them for example. >> >> I agree, exposing per vcpu counters can be useful. I guess it didn't >> make much sense exposing them through debugfs so aggregation was done >> in kernel. However if we chose to go with approach 1-b) mmap counters >> struct in userspace, we could do this. > > The reason I dislike the debugfs/statfs approach is that it generates a > completely separate permission and access paths to the stats. That's > great for full system monitoring, but really bad when you have multiple > individual tenants on a single host. I agree, anything in sysfs is complementary to vmfd/vcpufd access. Paolo
On 18.01.20 00:38, Paolo Bonzini wrote: > On 15/01/20 15:59, Alexander Graf wrote: >> On 15.01.20 15:43, milanpa@amazon.com wrote: >>>>> Let's expose new interface to userspace for garhering these >>>>> statistics with one ioctl. >>>>> >>>>> Userspace application can read counter description once using >>>>> KVM_GET_SUPPORTED_DEBUGFS_STAT and periodically invoke the >>>>> KVM_GET_DEBUGFS_VALUES to get value update. >>>> >>>> This is an interface that requires a lot of logic and buffers from >>>> user space to retrieve individual, explicit counters. What if I just >>>> wanted to monitor the number of exits on every user space exit? >>> >>> In case we want to cover such latency sensitive use cases solution b), >>> with mmap'ed structs you suggested, would be a way to go, IMO. >>> >>>> Also, we're suddenly making the debugfs names a full ABI, because >>>> through this interface we only identify the individual stats through >>>> their names. That means we can not remove stats or change their >>>> names, because people may rely on them, no? Thining about this again, >>>> maybe they already are an ABI because people rely on them in debugfs >>>> though? > > In theory not, in practice I have treated them as a kind of "soft" ABI: > if the meaning changes you should rename them, and removing them is > fine, but you shouldn't for example change the unit of measure (which is > not hard since they are all counters :) but perhaps you could have > nanoseconds vs TSC cycles in the future for some counters). > >>>> I see two alternatives to this approach here: >>>> >>>> a) ONE_REG >>>> >>>> We can just add a new DEBUG arch in ONE_REG and expose ONE_REG per VM >>>> as well (if we really have to?). That gives us explicit identifiers >>>> for each stat with an explicit path to introduce new ones with very >>>> unique identifiers. > ONE_REG would force us to define constants for each counter, and would > make it hard to retire them. I don't like this. Why does it make it hard to retire them? We would just return -EINVAL on retrieval, like we do for any other non-supported ONE_REG. It's the same as a file not existing in debugfs/statfs. Or an entry in the array of this patch to disappear. > >>>> b) part of the mmap'ed vcpu struct > > Same here. Even if we say the semantics of the struct would be exposed > to userspace via KVM_GET_SUPPORTED_DEBUGFS_STAT, someone might end up > getting this wrong and expecting a particular layout. Milan's proposed > API has the big advantage of being hard to get wrong for userspace. And > pushing the aggregation to userspace is not a huge chore, but it's still > a chore. > > So unless someone has a usecase for latency-sensitive monitoring I'd > prefer to keep it simple (usually these kind of stats can even make > sense if you gather them over relatively large period of time, because > then you'll probably use something else like tracepoints to actually > pinpoint what's going on). I tend to agree. Fetching them via an explicit call into the kernel is definitely the safer route. > >>>> 2) vcpu counters >>>> >>>> Most of the counters count on vcpu granularity, but debugfs only >>>> gives us a full VM view. Whatever we do to improve the situation, we >>>> should definitely try to build something that allows us to get the >>>> counters per vcpu (as well). >>>> >>>> The main purpose of these counters is monitoring. It can be quite >>>> important to know that only a single vCPU is going wild, compared to >>>> all of them for example. >>> >>> I agree, exposing per vcpu counters can be useful. I guess it didn't >>> make much sense exposing them through debugfs so aggregation was done >>> in kernel. However if we chose to go with approach 1-b) mmap counters >>> struct in userspace, we could do this. >> >> The reason I dislike the debugfs/statfs approach is that it generates a >> completely separate permission and access paths to the stats. That's >> great for full system monitoring, but really bad when you have multiple >> individual tenants on a single host. > > I agree, anything in sysfs is complementary to vmfd/vcpufd access. Cool :). So we only need to agree on ONE_REG vs. this patch mostly? Alex Amazon Development Center Germany GmbH Krausenstr. 38 10117 Berlin Geschaeftsfuehrung: Christian Schlaeger, Jonathan Weiss Eingetragen am Amtsgericht Charlottenburg unter HRB 149173 B Sitz: Berlin Ust-ID: DE 289 237 879
On 1/20/20 6:53 PM, Alexander Graf wrote: > > > On 18.01.20 00:38, Paolo Bonzini wrote: >> On 15/01/20 15:59, Alexander Graf wrote: >>> On 15.01.20 15:43, milanpa@amazon.com wrote: >>>>>> Let's expose new interface to userspace for garhering these >>>>>> statistics with one ioctl. >>>>>> >>>>>> Userspace application can read counter description once using >>>>>> KVM_GET_SUPPORTED_DEBUGFS_STAT and periodically invoke the >>>>>> KVM_GET_DEBUGFS_VALUES to get value update. >>>>> >>>>> This is an interface that requires a lot of logic and buffers from >>>>> user space to retrieve individual, explicit counters. What if I just >>>>> wanted to monitor the number of exits on every user space exit? >>>> >>>> In case we want to cover such latency sensitive use cases solution b), >>>> with mmap'ed structs you suggested, would be a way to go, IMO. >>>> >>>>> Also, we're suddenly making the debugfs names a full ABI, because >>>>> through this interface we only identify the individual stats through >>>>> their names. That means we can not remove stats or change their >>>>> names, because people may rely on them, no? Thining about this again, >>>>> maybe they already are an ABI because people rely on them in debugfs >>>>> though? >> >> In theory not, in practice I have treated them as a kind of "soft" ABI: >> if the meaning changes you should rename them, and removing them is >> fine, but you shouldn't for example change the unit of measure (which is >> not hard since they are all counters :) but perhaps you could have >> nanoseconds vs TSC cycles in the future for some counters). >> >>>>> I see two alternatives to this approach here: >>>>> >>>>> a) ONE_REG >>>>> >>>>> We can just add a new DEBUG arch in ONE_REG and expose ONE_REG per VM >>>>> as well (if we really have to?). That gives us explicit identifiers >>>>> for each stat with an explicit path to introduce new ones with very >>>>> unique identifiers. >> ONE_REG would force us to define constants for each counter, and would >> make it hard to retire them. I don't like this. > > Why does it make it hard to retire them? We would just return -EINVAL > on retrieval, like we do for any other non-supported ONE_REG. > > It's the same as a file not existing in debugfs/statfs. Or an entry in > the array of this patch to disappear. > >> >>>>> b) part of the mmap'ed vcpu struct >> >> Same here. Even if we say the semantics of the struct would be exposed >> to userspace via KVM_GET_SUPPORTED_DEBUGFS_STAT, someone might end up >> getting this wrong and expecting a particular layout. Milan's proposed >> API has the big advantage of being hard to get wrong for userspace. And >> pushing the aggregation to userspace is not a huge chore, but it's still >> a chore. >> >> So unless someone has a usecase for latency-sensitive monitoring I'd >> prefer to keep it simple (usually these kind of stats can even make >> sense if you gather them over relatively large period of time, because >> then you'll probably use something else like tracepoints to actually >> pinpoint what's going on). > > I tend to agree. Fetching them via an explicit call into the kernel is > definitely the safer route. > >> >>>>> 2) vcpu counters >>>>> >>>>> Most of the counters count on vcpu granularity, but debugfs only >>>>> gives us a full VM view. Whatever we do to improve the situation, we >>>>> should definitely try to build something that allows us to get the >>>>> counters per vcpu (as well). >>>>> >>>>> The main purpose of these counters is monitoring. It can be quite >>>>> important to know that only a single vCPU is going wild, compared to >>>>> all of them for example. >>>> >>>> I agree, exposing per vcpu counters can be useful. I guess it didn't >>>> make much sense exposing them through debugfs so aggregation was done >>>> in kernel. However if we chose to go with approach 1-b) mmap counters >>>> struct in userspace, we could do this. >>> >>> The reason I dislike the debugfs/statfs approach is that it generates a >>> completely separate permission and access paths to the stats. That's >>> great for full system monitoring, but really bad when you have multiple >>> individual tenants on a single host. >> >> I agree, anything in sysfs is complementary to vmfd/vcpufd access. > > Cool :). > > So we only need to agree on ONE_REG vs. this patch mostly? What about extending KVM_GET_SUPPORTED_DEBUGFS_STAT with some additional information like the data type and unit? ONE_REG exposes additional semantics about data. > > Alex Amazon Development Center Germany GmbH Krausenstr. 38 10117 Berlin Geschaeftsfuehrung: Christian Schlaeger, Jonathan Weiss Eingetragen am Amtsgericht Charlottenburg unter HRB 149173 B Sitz: Berlin Ust-ID: DE 289 237 879
On 20.01.20 19:57, milanpa@amazon.com wrote: > On 1/20/20 6:53 PM, Alexander Graf wrote: >> >> >> On 18.01.20 00:38, Paolo Bonzini wrote: >>> On 15/01/20 15:59, Alexander Graf wrote: >>>> On 15.01.20 15:43, milanpa@amazon.com wrote: >>>>>>> Let's expose new interface to userspace for garhering these >>>>>>> statistics with one ioctl. >>>>>>> >>>>>>> Userspace application can read counter description once using >>>>>>> KVM_GET_SUPPORTED_DEBUGFS_STAT and periodically invoke the >>>>>>> KVM_GET_DEBUGFS_VALUES to get value update. >>>>>> >>>>>> This is an interface that requires a lot of logic and buffers from >>>>>> user space to retrieve individual, explicit counters. What if I just >>>>>> wanted to monitor the number of exits on every user space exit? >>>>> >>>>> In case we want to cover such latency sensitive use cases solution b), >>>>> with mmap'ed structs you suggested, would be a way to go, IMO. >>>>> >>>>>> Also, we're suddenly making the debugfs names a full ABI, because >>>>>> through this interface we only identify the individual stats through >>>>>> their names. That means we can not remove stats or change their >>>>>> names, because people may rely on them, no? Thining about this again, >>>>>> maybe they already are an ABI because people rely on them in debugfs >>>>>> though? >>> >>> In theory not, in practice I have treated them as a kind of "soft" ABI: >>> if the meaning changes you should rename them, and removing them is >>> fine, but you shouldn't for example change the unit of measure (which is >>> not hard since they are all counters :) but perhaps you could have >>> nanoseconds vs TSC cycles in the future for some counters). >>> >>>>>> I see two alternatives to this approach here: >>>>>> >>>>>> a) ONE_REG >>>>>> >>>>>> We can just add a new DEBUG arch in ONE_REG and expose ONE_REG per VM >>>>>> as well (if we really have to?). That gives us explicit identifiers >>>>>> for each stat with an explicit path to introduce new ones with very >>>>>> unique identifiers. >>> ONE_REG would force us to define constants for each counter, and would >>> make it hard to retire them. I don't like this. >> >> Why does it make it hard to retire them? We would just return -EINVAL >> on retrieval, like we do for any other non-supported ONE_REG. >> >> It's the same as a file not existing in debugfs/statfs. Or an entry in >> the array of this patch to disappear. >> >>> >>>>>> b) part of the mmap'ed vcpu struct >>> >>> Same here. Even if we say the semantics of the struct would be exposed >>> to userspace via KVM_GET_SUPPORTED_DEBUGFS_STAT, someone might end up >>> getting this wrong and expecting a particular layout. Milan's proposed >>> API has the big advantage of being hard to get wrong for userspace. And >>> pushing the aggregation to userspace is not a huge chore, but it's still >>> a chore. >>> >>> So unless someone has a usecase for latency-sensitive monitoring I'd >>> prefer to keep it simple (usually these kind of stats can even make >>> sense if you gather them over relatively large period of time, because >>> then you'll probably use something else like tracepoints to actually >>> pinpoint what's going on). >> >> I tend to agree. Fetching them via an explicit call into the kernel is >> definitely the safer route. >> >>> >>>>>> 2) vcpu counters >>>>>> >>>>>> Most of the counters count on vcpu granularity, but debugfs only >>>>>> gives us a full VM view. Whatever we do to improve the situation, we >>>>>> should definitely try to build something that allows us to get the >>>>>> counters per vcpu (as well). >>>>>> >>>>>> The main purpose of these counters is monitoring. It can be quite >>>>>> important to know that only a single vCPU is going wild, compared to >>>>>> all of them for example. >>>>> >>>>> I agree, exposing per vcpu counters can be useful. I guess it didn't >>>>> make much sense exposing them through debugfs so aggregation was done >>>>> in kernel. However if we chose to go with approach 1-b) mmap counters >>>>> struct in userspace, we could do this. >>>> >>>> The reason I dislike the debugfs/statfs approach is that it generates a >>>> completely separate permission and access paths to the stats. That's >>>> great for full system monitoring, but really bad when you have multiple >>>> individual tenants on a single host. >>> >>> I agree, anything in sysfs is complementary to vmfd/vcpufd access. >> >> Cool :). >> >> So we only need to agree on ONE_REG vs. this patch mostly? > > What about extending KVM_GET_SUPPORTED_DEBUGFS_STAT with some additional > information like the data type and unit? ONE_REG exposes additional > semantics about data. It's not a problem of exposing the type information - we have that today by implicitly saying "every counter is 64bit". The thing I'm worried about is that we keep inventing these special purpose interfaces that really do nothing but transfer numbers in one way or another. ONE_REG's purpose was to unify them. Debug counters really are the same story. Alex Amazon Development Center Germany GmbH Krausenstr. 38 10117 Berlin Geschaeftsfuehrung: Christian Schlaeger, Jonathan Weiss Eingetragen am Amtsgericht Charlottenburg unter HRB 149173 B Sitz: Berlin Ust-ID: DE 289 237 879
On 21/01/20 16:38, Alexander Graf wrote: >>>> ONE_REG would force us to define constants for each counter, and would >>>> make it hard to retire them. I don't like this. >>> >>> Why does it make it hard to retire them? We would just return -EINVAL >>> on retrieval, like we do for any other non-supported ONE_REG. >>> >>> It's the same as a file not existing in debugfs/statfs. Or an entry >>> in the array of this patch to disappear. The devil is in the details. For example, would you retire uapi/ constants and cause programs to fail compilation? Or do you keep the obsolete constants forever? Also, fixing the mapping from ONE_REG number to stat would mean a switch statement (or loop of some kind---a switch statement is basically an unrolled binary search) to access the stats. Instead returning the id in KVM_GET_SUPPORTED_DEBUGFS_STAT would simplify returning the stats to a simple copy_to_user. Of course, some of the complexity would be punted to userspace. But userspace is much closer to the humans that ultimately look at the stats, so the question is: does userspace really care about knowing which stat is which, or do they just care about having a name that will ultimately be consumed by humans down the pipe? If the latter (which is also my gut feeling), that is also a point against ONE_REG. > It's not a problem of exposing the type information - we have that today > by implicitly saying "every counter is 64bit". > > The thing I'm worried about is that we keep inventing these special > purpose interfaces that really do nothing but transfer numbers in one > way or another. ONE_REG's purpose was to unify them. Debug counters > really are the same story. See above: I am not sure they are the same story because their consumers might be very different from registers. Registers are generally consumed by programs (to migrate VMs, for example) and only occasionally by humans, while stats are meant to be consumed by humans. We may disagree on whether this justifies a completely different API... Paolo
On 23.01.20 13:08, Paolo Bonzini wrote: > On 21/01/20 16:38, Alexander Graf wrote: >>>>> ONE_REG would force us to define constants for each counter, and would >>>>> make it hard to retire them. I don't like this. >>>> >>>> Why does it make it hard to retire them? We would just return -EINVAL >>>> on retrieval, like we do for any other non-supported ONE_REG. >>>> >>>> It's the same as a file not existing in debugfs/statfs. Or an entry >>>> in the array of this patch to disappear. > > The devil is in the details. For example, would you retire uapi/ > constants and cause programs to fail compilation? Or do you keep the > obsolete constants forever? Also, fixing the mapping from ONE_REG > number to stat would mean a switch statement (or loop of some kind---a > switch statement is basically an unrolled binary search) to access the > stats. Instead returning the id in KVM_GET_SUPPORTED_DEBUGFS_STAT would > simplify returning the stats to a simple copy_to_user. If you look at the way RISC-V implemented ONE_REG, I think we can agree that it's possible with constant identifiers as well :). The only downside is of course that you may potentially end up with an identifier array to map from "ONE_REG id" to "offset in vcpu/vm struct". I fail to see how that's worse than the struct kvm_stats_debugfs_item[] we have today. > Of course, some of the complexity would be punted to userspace. But > userspace is much closer to the humans that ultimately look at the > stats, so the question is: does userspace really care about knowing > which stat is which, or do they just care about having a name that will > ultimately be consumed by humans down the pipe? If the latter (which is > also my gut feeling), that is also a point against ONE_REG. > >> It's not a problem of exposing the type information - we have that today >> by implicitly saying "every counter is 64bit". >> >> The thing I'm worried about is that we keep inventing these special >> purpose interfaces that really do nothing but transfer numbers in one >> way or another. ONE_REG's purpose was to unify them. Debug counters >> really are the same story. > > See above: I am not sure they are the same story because their consumers > might be very different from registers. Registers are generally > consumed by programs (to migrate VMs, for example) and only occasionally > by humans, while stats are meant to be consumed by humans. We may > disagree on whether this justifies a completely different API... I don't fully agree on the "human" part here. At the end of the day, you want stats because you want to act on stats. Ideally, you want to do that fully automatically. Let me give you a few examples: 1) insn_emulation_fail triggers You may want to feed all the failures into a database to check whether there is something wrong in the emulator. 2) (remote_)tlb_flush beyond certain threshold If you see that you're constantly flushing remote TLBs, there's a good chance that you found a workload that may need tuning in KVM. You want to gather those stats across your full fleet of hosts, so that for the few occasions when you hit it, you can work with the actual VM owners to potentially improve their performance 3) exits beyond certain threshold You know roughly how many exits your fleet would usually see, so you can configure an upper threshold on that. When you now have an automated way to notify you when the threshold is exceeded, you can check what that particular guest did to raise so many exits. ... and I'm sure there's room for a lot more potential stats that could be useful to gather to determine the health of a KVM environment, such as a "vcpu steal time" one or a "maximum time between two VMENTERS while the guest was in running state". All of these should eventually feed into something bigger that collects the numbers across your full VM fleet, so that a human can take actions based on them. However, that means the values are no longer directly impacting a human, they need to feed into machines. And for that, exact, constant identifiers make much more sense :) Alex Amazon Development Center Germany GmbH Krausenstr. 38 10117 Berlin Geschaeftsfuehrung: Christian Schlaeger, Jonathan Weiss Eingetragen am Amtsgericht Charlottenburg unter HRB 149173 B Sitz: Berlin Ust-ID: DE 289 237 879
On 23/01/20 13:32, Alexander Graf wrote: >> See above: I am not sure they are the same story because their consumers >> might be very different from registers. Registers are generally >> consumed by programs (to migrate VMs, for example) and only occasionally >> by humans, while stats are meant to be consumed by humans. We may >> disagree on whether this justifies a completely different API... > > I don't fully agree on the "human" part here. I agree it's not entirely about humans, but in general it's going to be rules and queries on monitoring tools, where 1) the monitoring tools' output is generally not KVM-specific, 2) the rules and queries will be written by humans. So if the kernel produces insn_emulation_fail, the plugin for the monitoring tool will just log kvm.insn_emulation_fail. If the kernel produces 0x10042, the plugin will have to convert it and then log it. This is why I'm not sure that providing strings is actually less work for userspace. Paolo > At the end of the day, you > want stats because you want to act on stats. Ideally, you want to do > that fully automatically. Let me give you a few examples: > > 1) insn_emulation_fail triggers > > You may want to feed all the failures into a database to check whether > there is something wrong in the emulator. > > 2) (remote_)tlb_flush beyond certain threshold > > If you see that you're constantly flushing remote TLBs, there's a good > chance that you found a workload that may need tuning in KVM. You want > to gather those stats across your full fleet of hosts, so that for the > few occasions when you hit it, you can work with the actual VM owners to > potentially improve their performance > > 3) exits beyond certain threshold > > You know roughly how many exits your fleet would usually see, so you can > configure an upper threshold on that. When you now have an automated way > to notify you when the threshold is exceeded, you can check what that > particular guest did to raise so many exits. > > > ... and I'm sure there's room for a lot more potential stats that could > be useful to gather to determine the health of a KVM environment, such > as a "vcpu steal time" one or a "maximum time between two VMENTERS while > the guest was in running state". > > All of these should eventually feed into something bigger that collects > the numbers across your full VM fleet, so that a human can take actions > based on them. However, that means the values are no longer directly > impacting a human, they need to feed into machines. And for that, exact, > constant identifiers make much more sense
On 23.01.20 15:19, Paolo Bonzini wrote: > On 23/01/20 13:32, Alexander Graf wrote: >>> See above: I am not sure they are the same story because their consumers >>> might be very different from registers. Registers are generally >>> consumed by programs (to migrate VMs, for example) and only occasionally >>> by humans, while stats are meant to be consumed by humans. We may >>> disagree on whether this justifies a completely different API... >> >> I don't fully agree on the "human" part here. > > I agree it's not entirely about humans, but in general it's going to be > rules and queries on monitoring tools, where 1) the monitoring tools' > output is generally not KVM-specific, 2) the rules and queries will be > written by humans. > > So if the kernel produces insn_emulation_fail, the plugin for the > monitoring tool will just log kvm.insn_emulation_fail. If the kernel > produces 0x10042, the plugin will have to convert it and then log it. > This is why I'm not sure that providing strings is actually less work > for userspace. I think we're in agreement then, just leaning onto the other side of the same fence. My take is that if I don't know whether a string is necessary, I'd rather not have a string :). I guess as long as we do get stat information out per-vm as well as per-vcpu through vmfd and vcpufd, I'm happy overall. So how strongly do you feel about the string based approach? Alex PS: You could btw easily add a "give me the string for a ONE_REG id" interface in KVM to translate from 0x10042 to "insn_emulation_fail" :). Amazon Development Center Germany GmbH Krausenstr. 38 10117 Berlin Geschaeftsfuehrung: Christian Schlaeger, Jonathan Weiss Eingetragen am Amtsgericht Charlottenburg unter HRB 149173 B Sitz: Berlin Ust-ID: DE 289 237 879
On 23/01/20 15:45, Alexander Graf wrote: > I think we're in agreement then, just leaning onto the other side of the > same fence. My take is that if I don't know whether a string is > necessary, I'd rather not have a string :). And for me it's if I don't know whether a #define is necessary, I'd rather not have a #define. So yeah we agree on everything except the userspace API (which is no small thing, but it's a start). > I guess as long as we do get stat information out per-vm as well as > per-vcpu through vmfd and vcpufd, I'm happy overall. > > So how strongly do you feel about the string based approach? I like it, of course. > PS: You could btw easily add a "give me the string for a ONE_REG id" > interface in KVM to translate from 0x10042 to "insn_emulation_fail" :). That could actually be somewhat useful for VCPU registers as well (give me the string and type, and a list of valid ONE_REG ids). If that was the case, of course it would be fine for me to use ONE_REG on a VM. The part which I don't like is having to make all ONE_REG part of the userspace ABI/API. Paolo
On 23.01.20 15:50, Paolo Bonzini wrote: > On 23/01/20 15:45, Alexander Graf wrote: >> I think we're in agreement then, just leaning onto the other side of the >> same fence. My take is that if I don't know whether a string is >> necessary, I'd rather not have a string :). > > And for me it's if I don't know whether a #define is necessary, I'd > rather not have a #define. So yeah we agree on everything except the > userspace API (which is no small thing, but it's a start). > >> I guess as long as we do get stat information out per-vm as well as >> per-vcpu through vmfd and vcpufd, I'm happy overall. >> >> So how strongly do you feel about the string based approach? > > I like it, of course. > >> PS: You could btw easily add a "give me the string for a ONE_REG id" >> interface in KVM to translate from 0x10042 to "insn_emulation_fail" :). > > That could actually be somewhat useful for VCPU registers as well (give > me the string and type, and a list of valid ONE_REG ids). If that was You don't need the type explicitly, it's part of the ONE_REG identifier :). > the case, of course it would be fine for me to use ONE_REG on a VM. The > part which I don't like is having to make all ONE_REG part of the > userspace ABI/API. We don't have all of ONE_REG as part of the user space ABI today. On Book3S PPC you can not access BookE ONE_REG variables for obvious reasons. Neither can you access ARM ONE_REGs. Not implementing a ONE_REG is perfectly ok. But I like the idea of a ONE_REG query interface that gives you the list of available registers and a string representation of them. It would make programming kvm from Python so easy! Alex Amazon Development Center Germany GmbH Krausenstr. 38 10117 Berlin Geschaeftsfuehrung: Christian Schlaeger, Jonathan Weiss Eingetragen am Amtsgericht Charlottenburg unter HRB 149173 B Sitz: Berlin Ust-ID: DE 289 237 879
On 23/01/20 15:58, Alexander Graf wrote: >> the case, of course it would be fine for me to use ONE_REG on a VM. The >> part which I don't like is having to make all ONE_REG part of the >> userspace ABI/API. > > We don't have all of ONE_REG as part of the user space ABI today. Still those that exist cannot change their id. This makes them a part of the userspace ABI. > But I like the idea of a ONE_REG query interface that gives you the list > of available registers and a string representation of them. It would > make programming kvm from Python so easy! Yeah, wouldn't it? Milan, what do you think about it? Paolo
On 1/23/20 4:05 PM, Paolo Bonzini wrote: > On 23/01/20 15:58, Alexander Graf wrote: >>> the case, of course it would be fine for me to use ONE_REG on a VM. The >>> part which I don't like is having to make all ONE_REG part of the >>> userspace ABI/API. >> We don't have all of ONE_REG as part of the user space ABI today. > Still those that exist cannot change their id. This makes them a part > of the userspace ABI. > >> But I like the idea of a ONE_REG query interface that gives you the list >> of available registers and a string representation of them. It would >> make programming kvm from Python so easy! > Yeah, wouldn't it? Milan, what do you think about it? > > Paolo > I agree, extending the API with GET_AVAILABLE_ONE_REGS (and possibly a bitmask argument to narrow down which type of registers userspace is interested in) is a clean solution. We won't require userspace to rely on constants in compile time if it doesn't need to. Only concern is that now we need to have some kind of datastructure for keeping the mappings between all available ONE_REG IDs and their strings/descriptions. Additionally enforcing that newly added ONE_REGs always get added to that mapping, is also necessary. Milan Amazon Development Center Germany GmbH Krausenstr. 38 10117 Berlin Geschaeftsfuehrung: Christian Schlaeger, Jonathan Weiss Eingetragen am Amtsgericht Charlottenburg unter HRB 149173 B Sitz: Berlin Ust-ID: DE 289 237 879
On 23/01/20 16:27, milanpa@amazon.com wrote: >> >> >> Paolo >> > I agree, extending the API with GET_AVAILABLE_ONE_REGS (and possibly a > bitmask argument to narrow down which type of registers userspace is > interested in) is a clean solution. We won't require userspace to rely > on constants in compile time if it doesn't need to. > > Only concern is that now we need to have some kind of datastructure for > keeping the mappings between all available ONE_REG IDs and their > strings/descriptions. Additionally enforcing that newly added ONE_REGs > always get added to that mapping, is also necessary. For now just do the implementation for VM ONE_REGs. We'll worry about the existing VCPU registers later. Paolo
On 1/23/20 5:15 PM, Paolo Bonzini wrote: > On 23/01/20 16:27, milanpa@amazon.com wrote: >>> >>> Paolo >>> >> I agree, extending the API with GET_AVAILABLE_ONE_REGS (and possibly a >> bitmask argument to narrow down which type of registers userspace is >> interested in) is a clean solution. We won't require userspace to rely >> on constants in compile time if it doesn't need to. >> >> Only concern is that now we need to have some kind of datastructure for >> keeping the mappings between all available ONE_REG IDs and their >> strings/descriptions. Additionally enforcing that newly added ONE_REGs >> always get added to that mapping, is also necessary. > For now just do the implementation for VM ONE_REGs. We'll worry about > the existing VCPU registers later. > > Paolo > Sounds good. Thanks for the help, I will update the patch soon. Milan Amazon Development Center Germany GmbH Krausenstr. 38 10117 Berlin Geschaeftsfuehrung: Christian Schlaeger, Jonathan Weiss Eingetragen am Amtsgericht Charlottenburg unter HRB 149173 B Sitz: Berlin Ust-ID: DE 289 237 879
diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h index f0a16b4adbbd..07ad35ddc14f 100644 --- a/include/uapi/linux/kvm.h +++ b/include/uapi/linux/kvm.h @@ -1473,6 +1473,27 @@ struct kvm_enc_region { /* Available with KVM_CAP_ARM_SVE */ #define KVM_ARM_VCPU_FINALIZE _IOW(KVMIO, 0xc2, int) +#define KVM_DBG_DESCR_NAME_MAX_SIZE 30 +struct kvm_debugfs_entry_description { + char name[KVM_DBG_DESCR_NAME_MAX_SIZE + 1]; +}; + +struct kvm_debugfs_entries_description { + __u32 nentries; + struct kvm_debugfs_entry_description entry[0]; +}; + +struct kvm_debug_stats { + __u32 nentries; + __u64 values[0]; +}; + +/* Get description of available debugfs counters */ +#define KVM_GET_SUPPORTED_DEBUGFS_STATS \ + _IOWR(KVMIO, 0xc2, struct kvm_debugfs_entries_description) +/* Get values from debugfs */ +#define KVM_GET_DEBUGFS_VALUES _IOWR(KVMIO, 0xc3, struct kvm_debug_stats) + /* Secure Encrypted Virtualization command */ enum sev_cmd_id { /* Guest initialization commands */ diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c index 9eb6e081da3a..66b36b7e347e 100644 --- a/virt/kvm/kvm_main.c +++ b/virt/kvm/kvm_main.c @@ -146,6 +146,10 @@ static void kvm_io_bus_destroy(struct kvm_io_bus *bus); static void mark_page_dirty_in_slot(struct kvm_memory_slot *memslot, gfn_t gfn); +static long kvm_get_debugfs_entry_description(struct kvm *kvm, + void __user *argp); +static long kvm_get_debugfs_values(struct kvm *kvm, void __user *argp); + __visible bool kvm_rebooting; EXPORT_SYMBOL_GPL(kvm_rebooting); @@ -3452,6 +3456,12 @@ static long kvm_vm_ioctl(struct file *filp, case KVM_CHECK_EXTENSION: r = kvm_vm_ioctl_check_extension_generic(kvm, arg); break; + case KVM_GET_SUPPORTED_DEBUGFS_STATS: + r = kvm_get_debugfs_entry_description(kvm, argp); + break; + case KVM_GET_DEBUGFS_VALUES: + r = kvm_get_debugfs_values(kvm, argp); + break; default: r = kvm_arch_vm_ioctl(filp, ioctl, arg); } @@ -4202,6 +4212,66 @@ static const struct file_operations *stat_fops[] = { [KVM_STAT_VM] = &vm_stat_fops, }; +static long kvm_get_debugfs_entry_description(struct kvm *kvm, + void __user *argp) +{ + struct kvm_debugfs_entries_description *description = argp; + struct kvm_stats_debugfs_item *dbgfs_item = debugfs_entries; + bool should_copy = true; + size_t name_length = 0; + __u32 i = 0; + + for (; dbgfs_item->name != NULL; dbgfs_item++, i++) { + if (i >= description->nentries) + should_copy = false; + + if (should_copy) { + name_length = strlen(dbgfs_item->name); + name_length = + (name_length > KVM_DBG_DESCR_NAME_MAX_SIZE) ? + KVM_DBG_DESCR_NAME_MAX_SIZE : + name_length; + + copy_to_user(description->entry[i].name, + dbgfs_item->name, name_length); + put_user('\0', + description->entry[i].name + name_length); + } + } + put_user(i, &description->nentries); + return (should_copy) ? 0 : -ENOMEM; +} + +static long kvm_get_debugfs_values(struct kvm *kvm, void __user *argp) +{ + struct kvm_debug_stats *stats = argp; + struct kvm_stats_debugfs_item *dbgfs_item = debugfs_entries; + bool should_copy = true; + __u32 i = 0; + __u64 tmp = 0; + + for (; dbgfs_item->name != NULL; dbgfs_item++, i++) { + if (i >= stats->nentries) + should_copy = false; + + if (should_copy) { + switch (dbgfs_item->kind) { + case KVM_STAT_VM: + kvm_get_stat_per_vm(kvm, dbgfs_item->offset, + &tmp); + break; + case KVM_STAT_VCPU: + kvm_get_stat_per_vcpu(kvm, dbgfs_item->offset, + &tmp); + break; + } + put_user(tmp, stats->values + i); + } + } + put_user(i, &stats->nentries); + return (should_copy) ? 0 : -ENOMEM; +} + static void kvm_uevent_notify_change(unsigned int type, struct kvm *kvm) { struct kobj_uevent_env *env;
KVM exposes debug counters through individual debugfs files. Monitoring these counters requires debugfs to be enabled/accessible for the application, which might not be always the case. Additionally, periodic monitoring multiple debugfs files from userspace requires multiple file open/read/close + atoi conversion operations, which is not very efficient. Let's expose new interface to userspace for garhering these statistics with one ioctl. Two new ioctl methods are added: - KVM_GET_SUPPORTED_DEBUGFS_STAT : Returns list of available counter names. Names correspond to the debugfs file names - KVM_GET_DEBUGFS_VALUES : Returns list of u64 values each corresponding to a value described in KVM_GET_SUPPORTED_DEBUGFS_STAT. Userspace application can read counter description once using KVM_GET_SUPPORTED_DEBUGFS_STAT and periodically invoke the KVM_GET_DEBUGFS_VALUES to get value update. Signed-off-by: Milan Pandurov <milanpa@amazon.de> --- Current approach returns all available counters to userspace which might be an overkill. This can be further extended with an interface in which userspace provides indicies of counters it is interested in counters will be filled accordingly. NOTE: This patch is placed on top of: https://www.spinics.net/lists/kvm/msg202599.html --- include/uapi/linux/kvm.h | 21 ++++++++++++ virt/kvm/kvm_main.c | 70 ++++++++++++++++++++++++++++++++++++++++ 2 files changed, 91 insertions(+)