Message ID | 20250319-hybrid-v1-1-4d1ada10e705@daynix.com (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | [RFC] KVM: arm64: PMU: Use multiple host PMUs | expand |
Hi Akihiko, On Wed, Mar 19, 2025 at 03:33:46PM +0900, Akihiko Odaki wrote: > Problem > ------- > > arch/arm64/kvm/pmu-emul.c used to have a comment saying the follows: > > The observant among you will notice that the supported_cpus > > mask does not get updated for the default PMU even though it > > is quite possible the selected instance supports only a > > subset of cores in the system. This is intentional, and > > upholds the preexisting behavior on heterogeneous systems > > where vCPUs can be scheduled on any core but the guest > > counters could stop working. > > Despite the reference manual says counters may not continuously > incrementing, Windows is not robust enough to handle stopped PMCCNTR_EL0 > and crashes with a division-by-zero error and it also crashes when the > PMU is not present. > > To avoid such a problem, the userspace should pin the vCPU threads to > pCPUs supported by one host PMU when initializing the vCPUs or specify > the host PMU to use with KVM_ARM_VCPU_PMU_V3_SET_PMU after the > initialization. However, QEMU/libvirt can pin vCPU threads only after the > vCPUs are initialized. It also limits the pCPUs the guest can use even > for VMMs that support proper pinning. > > Solution > -------- > > Ideally, Windows should fix the division-by-zero error and QEMU/libvirt > should support pinning better, but neither of them are going to happen > anytime soon. > > To allow running Windows on QEMU/libvirt or with heterogeneous cores, > combine all host PMUs necessary to cover the cores vCPUs can run and > keep PMCCNTR_EL0 working. I'm extremely uneasy about making this a generalized solution. PMUs are deeply tied to the microarchitecture of a particular implementation, and that isn't something we can abstract away from the guest in KVM. For example, you could have an event ID that counts on only a subset of cores, or better yet an event that counts something completely different depending on where a vCPU lands. I do appreciate the issue that you're trying to solve. The good news though is that the fixed PMU cycle counter is the only thing guaranteed to be present in any PMUv3 implementation. Since that's the only counter Windows actually needs, perhaps we could special-case this in KVM. I have the following (completely untested) patch, do you want to give it a try? There's still going to be observable differences between PMUs (e.g. CPU frequency) but at least it should get things booting. Thanks, Oliver --- diff --git a/arch/arm64/kvm/pmu-emul.c b/arch/arm64/kvm/pmu-emul.c index a1bc10d7116a..913a7bab50b5 100644 --- a/arch/arm64/kvm/pmu-emul.c +++ b/arch/arm64/kvm/pmu-emul.c @@ -724,14 +724,21 @@ static void kvm_pmu_create_perf_event(struct kvm_pmc *pmc) return; memset(&attr, 0, sizeof(struct perf_event_attr)); - attr.type = arm_pmu->pmu.type; + + if (pmc->idx == ARMV8_PMU_CYCLE_IDX) { + attr.type = PERF_TYPE_HARDWARE; + attr.config = PERF_COUNT_HW_CPU_CYCLES; + } else { + attr.type = arm_pmu->pmu.type; + attr.config = eventsel; + } + attr.size = sizeof(attr); attr.pinned = 1; attr.disabled = !kvm_pmu_counter_is_enabled(pmc); attr.exclude_user = !kvm_pmc_counts_at_el0(pmc); attr.exclude_hv = 1; /* Don't count EL2 events */ attr.exclude_host = 1; /* Don't count host events */ - attr.config = eventsel; /* * Filter events at EL1 (i.e. vEL2) when in a hyp context based on the
On 2025/03/19 16:34, Oliver Upton wrote: > Hi Akihiko, > > On Wed, Mar 19, 2025 at 03:33:46PM +0900, Akihiko Odaki wrote: >> Problem >> ------- >> >> arch/arm64/kvm/pmu-emul.c used to have a comment saying the follows: >>> The observant among you will notice that the supported_cpus >>> mask does not get updated for the default PMU even though it >>> is quite possible the selected instance supports only a >>> subset of cores in the system. This is intentional, and >>> upholds the preexisting behavior on heterogeneous systems >>> where vCPUs can be scheduled on any core but the guest >>> counters could stop working. >> >> Despite the reference manual says counters may not continuously >> incrementing, Windows is not robust enough to handle stopped PMCCNTR_EL0 >> and crashes with a division-by-zero error and it also crashes when the >> PMU is not present. >> >> To avoid such a problem, the userspace should pin the vCPU threads to >> pCPUs supported by one host PMU when initializing the vCPUs or specify >> the host PMU to use with KVM_ARM_VCPU_PMU_V3_SET_PMU after the >> initialization. However, QEMU/libvirt can pin vCPU threads only after the >> vCPUs are initialized. It also limits the pCPUs the guest can use even >> for VMMs that support proper pinning. >> >> Solution >> -------- >> >> Ideally, Windows should fix the division-by-zero error and QEMU/libvirt >> should support pinning better, but neither of them are going to happen >> anytime soon. >> >> To allow running Windows on QEMU/libvirt or with heterogeneous cores, >> combine all host PMUs necessary to cover the cores vCPUs can run and >> keep PMCCNTR_EL0 working. > > I'm extremely uneasy about making this a generalized solution. PMUs are > deeply tied to the microarchitecture of a particular implementation, and > that isn't something we can abstract away from the guest in KVM. > > For example, you could have an event ID that counts on only a subset of > cores, or better yet an event that counts something completely different > depending on where a vCPU lands.> > I do appreciate the issue that you're trying to solve. > > The good news though is that the fixed PMU cycle counter is the only > thing guaranteed to be present in any PMUv3 implementation. Since > that's the only counter Windows actually needs, perhaps we could > special-case this in KVM. > > I have the following (completely untested) patch, do you want to give it > a try? There's still going to be observable differences between PMUs > (e.g. CPU frequency) but at least it should get things booting. I don't think it will work, unfortunately. perf_init_event() binds a perf event to a particular PMU so the event will stop working when the thread migrates away. It should also be the reason why the perf program creates an event for each PMU. tools/perf/Documentation/intel-hybrid.txt has more descriptions. Allowing to enable more than one counter and/or an event type other than the cycle counter is not the goal. Enabling another event type may result in a garbage value, but I don't think it's worse than the current situation where the count stays zero; please tell me if I miss something. There is still room for improvement. Returning a garbage value may not be worse than returning zero, but counters and event types not supported by some cores shouldn't be advertised as available in the first place. More concretely: - The vCPU should be limited to run only on cores covered by PMUs when KVM_ARM_VCPU_PMU_V3 is set. - PMCR_EL0.N advertised to the guest should be the minimum of ones of host PMUs. - PMCEID0_EL0 and PMCEID1_EL0 advertised to the guest should be the result of the AND operations of ones of host PMUs. Special-casing the cycle counter may make sense if we are going to fix the advertised values of PMCR_EL0.N, PMCEID0_EL0, and PMCEID1_EL0. PMCR_EL0.N as we can simply return zero for these registers. We can also prevent enabling a counter that returns zero or a garbage value. Do you think it's worth fixing these registers? If so, I'll do that by special-casing the cycle counter. Regards, Akihiko Odaki > > Thanks, > Oliver > > --- > diff --git a/arch/arm64/kvm/pmu-emul.c b/arch/arm64/kvm/pmu-emul.c > index a1bc10d7116a..913a7bab50b5 100644 > --- a/arch/arm64/kvm/pmu-emul.c > +++ b/arch/arm64/kvm/pmu-emul.c > @@ -724,14 +724,21 @@ static void kvm_pmu_create_perf_event(struct kvm_pmc *pmc) > return; > > memset(&attr, 0, sizeof(struct perf_event_attr)); > - attr.type = arm_pmu->pmu.type; > + > + if (pmc->idx == ARMV8_PMU_CYCLE_IDX) { > + attr.type = PERF_TYPE_HARDWARE; > + attr.config = PERF_COUNT_HW_CPU_CYCLES; > + } else { > + attr.type = arm_pmu->pmu.type; > + attr.config = eventsel; > + } > + > attr.size = sizeof(attr); > attr.pinned = 1; > attr.disabled = !kvm_pmu_counter_is_enabled(pmc); > attr.exclude_user = !kvm_pmc_counts_at_el0(pmc); > attr.exclude_hv = 1; /* Don't count EL2 events */ > attr.exclude_host = 1; /* Don't count host events */ > - attr.config = eventsel; > > /* > * Filter events at EL1 (i.e. vEL2) when in a hyp context based on the
On Wed, 19 Mar 2025 08:37:29 +0000, Akihiko Odaki <akihiko.odaki@daynix.com> wrote: > > On 2025/03/19 16:34, Oliver Upton wrote: > > Hi Akihiko, > > > > On Wed, Mar 19, 2025 at 03:33:46PM +0900, Akihiko Odaki wrote: > >> Problem > >> ------- > >> > >> arch/arm64/kvm/pmu-emul.c used to have a comment saying the follows: > >>> The observant among you will notice that the supported_cpus > >>> mask does not get updated for the default PMU even though it > >>> is quite possible the selected instance supports only a > >>> subset of cores in the system. This is intentional, and > >>> upholds the preexisting behavior on heterogeneous systems > >>> where vCPUs can be scheduled on any core but the guest > >>> counters could stop working. > >> > >> Despite the reference manual says counters may not continuously > >> incrementing, Windows is not robust enough to handle stopped PMCCNTR_EL0 > >> and crashes with a division-by-zero error and it also crashes when the > >> PMU is not present. > >> > >> To avoid such a problem, the userspace should pin the vCPU threads to > >> pCPUs supported by one host PMU when initializing the vCPUs or specify > >> the host PMU to use with KVM_ARM_VCPU_PMU_V3_SET_PMU after the > >> initialization. However, QEMU/libvirt can pin vCPU threads only after the > >> vCPUs are initialized. It also limits the pCPUs the guest can use even > >> for VMMs that support proper pinning. > >> > >> Solution > >> -------- > >> > >> Ideally, Windows should fix the division-by-zero error and QEMU/libvirt > >> should support pinning better, but neither of them are going to happen > >> anytime soon. > >> > >> To allow running Windows on QEMU/libvirt or with heterogeneous cores, > >> combine all host PMUs necessary to cover the cores vCPUs can run and > >> keep PMCCNTR_EL0 working. > > > I'm extremely uneasy about making this a generalized solution. PMUs are > > deeply tied to the microarchitecture of a particular implementation, and > > that isn't something we can abstract away from the guest in KVM. > > > > For example, you could have an event ID that counts on only a subset of > > cores, or better yet an event that counts something completely different > > depending on where a vCPU lands.> > I do appreciate the issue that you're trying to solve. > > > > The good news though is that the fixed PMU cycle counter is the only > > thing guaranteed to be present in any PMUv3 implementation. Since > > that's the only counter Windows actually needs, perhaps we could > > special-case this in KVM. > > > > I have the following (completely untested) patch, do you want to give it > > a try? There's still going to be observable differences between PMUs > > (e.g. CPU frequency) but at least it should get things booting. > > I don't think it will work, unfortunately. perf_init_event() binds a > perf event to a particular PMU so the event will stop working when the > thread migrates away. > > It should also be the reason why the perf program creates an event for > each PMU. tools/perf/Documentation/intel-hybrid.txt has more > descriptions. But perf on non-Intel behaves pretty differently. ARM PMUs behaves pretty differently, because there is no guarantee of homogeneous events. > > Allowing to enable more than one counter and/or an event type other > than the cycle counter is not the goal. Enabling another event type > may result in a garbage value, but I don't think it's worse than the > current situation where the count stays zero; please tell me if I miss > something. > > There is still room for improvement. Returning a garbage value may not > be worse than returning zero, but counters and event types not > supported by some cores shouldn't be advertised as available in the > first place. More concretely: > > - The vCPU should be limited to run only on cores covered by PMUs when > KVM_ARM_VCPU_PMU_V3 is set. That's userspace's job. Bind to the desired PMU, and run. KVM will actively prevent you from running on the wrong CPU. > - PMCR_EL0.N advertised to the guest should be the minimum of ones of > host PMUs. How do you find out? CPUs can be hot-plugged on long after a VM has started, bringing in a new PMU, with a different number of counters. > - PMCEID0_EL0 and PMCEID1_EL0 advertised to the guest should be the > result of the AND operations of ones of host PMUs. Same problem. > > Special-casing the cycle counter may make sense if we are going to fix > the advertised values of PMCR_EL0.N, PMCEID0_EL0, and > PMCEID1_EL0. PMCR_EL0.N as we can simply return zero for these > registers. We can also prevent enabling a counter that returns zero or > a garbage value. > > Do you think it's worth fixing these registers? If so, I'll do that by > special-casing the cycle counter. I think this is really going in the wrong direction. The whole design of the PMU emulation is that we expose a single, architecturally correct PMU implementation. This is clearly documented. Furthermore, userspace is being given all the relevant information to place vcpus on the correct physical CPUs. Why should we add this sort of hack in the kernel, creating a new userspace ABI that we will have to support forever, when usespace can do the correct thing right now? Worse case, this is just a 'taskset' away, and everything will work. Frankly, I'm not prepared to add more hacks to KVM for the sake of the combination of broken userspace and broken guest. M.
On 2025/03/19 18:47, Marc Zyngier wrote: > On Wed, 19 Mar 2025 08:37:29 +0000, > Akihiko Odaki <akihiko.odaki@daynix.com> wrote: >> >> On 2025/03/19 16:34, Oliver Upton wrote: >>> Hi Akihiko, >>> >>> On Wed, Mar 19, 2025 at 03:33:46PM +0900, Akihiko Odaki wrote: >>>> Problem >>>> ------- >>>> >>>> arch/arm64/kvm/pmu-emul.c used to have a comment saying the follows: >>>>> The observant among you will notice that the supported_cpus >>>>> mask does not get updated for the default PMU even though it >>>>> is quite possible the selected instance supports only a >>>>> subset of cores in the system. This is intentional, and >>>>> upholds the preexisting behavior on heterogeneous systems >>>>> where vCPUs can be scheduled on any core but the guest >>>>> counters could stop working. >>>> >>>> Despite the reference manual says counters may not continuously >>>> incrementing, Windows is not robust enough to handle stopped PMCCNTR_EL0 >>>> and crashes with a division-by-zero error and it also crashes when the >>>> PMU is not present. >>>> >>>> To avoid such a problem, the userspace should pin the vCPU threads to >>>> pCPUs supported by one host PMU when initializing the vCPUs or specify >>>> the host PMU to use with KVM_ARM_VCPU_PMU_V3_SET_PMU after the >>>> initialization. However, QEMU/libvirt can pin vCPU threads only after the >>>> vCPUs are initialized. It also limits the pCPUs the guest can use even >>>> for VMMs that support proper pinning. >>>> >>>> Solution >>>> -------- >>>> >>>> Ideally, Windows should fix the division-by-zero error and QEMU/libvirt >>>> should support pinning better, but neither of them are going to happen >>>> anytime soon. >>>> >>>> To allow running Windows on QEMU/libvirt or with heterogeneous cores, >>>> combine all host PMUs necessary to cover the cores vCPUs can run and >>>> keep PMCCNTR_EL0 working. >>>> I'm extremely uneasy about making this a generalized solution. PMUs are >>> deeply tied to the microarchitecture of a particular implementation, and >>> that isn't something we can abstract away from the guest in KVM. >>> >>> For example, you could have an event ID that counts on only a subset of >>> cores, or better yet an event that counts something completely different >>> depending on where a vCPU lands.> > I do appreciate the issue that you're trying to solve. >>> >>> The good news though is that the fixed PMU cycle counter is the only >>> thing guaranteed to be present in any PMUv3 implementation. Since >>> that's the only counter Windows actually needs, perhaps we could >>> special-case this in KVM. >>> >>> I have the following (completely untested) patch, do you want to give it >>> a try? There's still going to be observable differences between PMUs >>> (e.g. CPU frequency) but at least it should get things booting. >> >> I don't think it will work, unfortunately. perf_init_event() binds a >> perf event to a particular PMU so the event will stop working when the >> thread migrates away. >> >> It should also be the reason why the perf program creates an event for >> each PMU. tools/perf/Documentation/intel-hybrid.txt has more >> descriptions. > > But perf on non-Intel behaves pretty differently. ARM PMUs behaves > pretty differently, because there is no guarantee of homogeneous > events. It works in the same manner in this particular aspect (i.e., "perf stat -e cycles -a" creates events for all PMUs). > >> >> Allowing to enable more than one counter and/or an event type other >> than the cycle counter is not the goal. Enabling another event type >> may result in a garbage value, but I don't think it's worse than the >> current situation where the count stays zero; please tell me if I miss >> something. >> >> There is still room for improvement. Returning a garbage value may not >> be worse than returning zero, but counters and event types not >> supported by some cores shouldn't be advertised as available in the >> first place. More concretely: >> >> - The vCPU should be limited to run only on cores covered by PMUs when >> KVM_ARM_VCPU_PMU_V3 is set. > > That's userspace's job. Bind to the desired PMU, and run. KVM will > actively prevent you from running on the wrong CPU. > >> - PMCR_EL0.N advertised to the guest should be the minimum of ones of >> host PMUs. > > How do you find out? CPUs can be hot-plugged on long after a VM has > started, bringing in a new PMU, with a different number of counters. > >> - PMCEID0_EL0 and PMCEID1_EL0 advertised to the guest should be the >> result of the AND operations of ones of host PMUs. > > Same problem. I guess special-casing the cycle counter is the only option if the kernel is going to deal with this. > >> >> Special-casing the cycle counter may make sense if we are going to fix >> the advertised values of PMCR_EL0.N, PMCEID0_EL0, and >> PMCEID1_EL0. PMCR_EL0.N as we can simply return zero for these >> registers. We can also prevent enabling a counter that returns zero or >> a garbage value. >> >> Do you think it's worth fixing these registers? If so, I'll do that by >> special-casing the cycle counter. > > I think this is really going in the wrong direction. > > The whole design of the PMU emulation is that we expose a single, > architecturally correct PMU implementation. This is clearly > documented. > > Furthermore, userspace is being given all the relevant information to > place vcpus on the correct physical CPUs. Why should we add this sort > of hack in the kernel, creating a new userspace ABI that we will have > to support forever, when usespace can do the correct thing right now? > > Worse case, this is just a 'taskset' away, and everything will work. It's surprisingly difficult to do that with libvirt; of course it is a userspace problem though. > > Frankly, I'm not prepared to add more hacks to KVM for the sake of the > combination of broken userspace and broken guest. The only counter argument I have in this regard is that some change is also needed to expose all CPUs to Windows guest even when the userspace does its best. It may result in odd scheduling, but still gives the best throughput. Regards, Akihiko Odaki > > M. >
On Wed, 19 Mar 2025 10:26:57 +0000, Akihiko Odaki <akihiko.odaki@daynix.com> wrote: > > >> It should also be the reason why the perf program creates an event for > >> each PMU. tools/perf/Documentation/intel-hybrid.txt has more > >> descriptions. > > > > But perf on non-Intel behaves pretty differently. ARM PMUs behaves > > pretty differently, because there is no guarantee of homogeneous > > events. > > It works in the same manner in this particular aspect (i.e., "perf > stat -e cycles -a" creates events for all PMUs). But it then becomes a system-wide counter, and that's not what KVM needs to do. > >> Allowing to enable more than one counter and/or an event type other > >> than the cycle counter is not the goal. Enabling another event type > >> may result in a garbage value, but I don't think it's worse than the > >> current situation where the count stays zero; please tell me if I miss > >> something. > >> > >> There is still room for improvement. Returning a garbage value may not > >> be worse than returning zero, but counters and event types not > >> supported by some cores shouldn't be advertised as available in the > >> first place. More concretely: > >> > >> - The vCPU should be limited to run only on cores covered by PMUs when > >> KVM_ARM_VCPU_PMU_V3 is set. > > > > That's userspace's job. Bind to the desired PMU, and run. KVM will > > actively prevent you from running on the wrong CPU. > > > >> - PMCR_EL0.N advertised to the guest should be the minimum of ones of > >> host PMUs. > > > > How do you find out? CPUs can be hot-plugged on long after a VM has > > started, bringing in a new PMU, with a different number of counters. > > > >> - PMCEID0_EL0 and PMCEID1_EL0 advertised to the guest should be the > >> result of the AND operations of ones of host PMUs. > > > > Same problem. > > I guess special-casing the cycle counter is the only option if the > kernel is going to deal with this. Indeed. I think Oliver's idea is the least bad of them all, but man, this is really ugly. > >> Special-casing the cycle counter may make sense if we are going to fix > >> the advertised values of PMCR_EL0.N, PMCEID0_EL0, and > >> PMCEID1_EL0. PMCR_EL0.N as we can simply return zero for these > >> registers. We can also prevent enabling a counter that returns zero or > >> a garbage value. > >> > >> Do you think it's worth fixing these registers? If so, I'll do that by > >> special-casing the cycle counter. > > > > I think this is really going in the wrong direction. > > > > The whole design of the PMU emulation is that we expose a single, > > architecturally correct PMU implementation. This is clearly > > documented. > > > > Furthermore, userspace is being given all the relevant information to > > place vcpus on the correct physical CPUs. Why should we add this sort > > of hack in the kernel, creating a new userspace ABI that we will have > > to support forever, when usespace can do the correct thing right now? > > > > Worse case, this is just a 'taskset' away, and everything will work. > > It's surprisingly difficult to do that with libvirt; of course it is a > userspace problem though. Sorry, I must admit I'm completely ignorant of libvirt. I tried it years ago, and concluded that 95% of what I needed was adequately done with a shell script... > > Frankly, I'm not prepared to add more hacks to KVM for the sake of the > > combination of broken userspace and broken guest. > > The only counter argument I have in this regard is that some change is > also needed to expose all CPUs to Windows guest even when the > userspace does its best. It may result in odd scheduling, but still > gives the best throughput. But that'd be a new ABI, which again would require buy-in from userspace. Maybe there is scope for an all CPUs, cycle-counter only PMUv3 exposed to the guest, but that cannot be set automatically, as we would otherwise regress existing setups. At this stage, and given that you need to change userspace, I'm not sure what the best course of action is. Thanks, M.
On 2025/03/19 20:07, Marc Zyngier wrote: > On Wed, 19 Mar 2025 10:26:57 +0000, > Akihiko Odaki <akihiko.odaki@daynix.com> wrote: >> >>>> It should also be the reason why the perf program creates an event for >>>> each PMU. tools/perf/Documentation/intel-hybrid.txt has more >>>> descriptions. >>> >>> But perf on non-Intel behaves pretty differently. ARM PMUs behaves >>> pretty differently, because there is no guarantee of homogeneous >>> events. >> >> It works in the same manner in this particular aspect (i.e., "perf >> stat -e cycles -a" creates events for all PMUs). > > But it then becomes a system-wide counter, and that's not what KVM > needs to do. There is also an example of program profiling: "perf stat -e cycles \-- taskset -c 16 ./triad_loop" This also creates events for all PMUs. > >>>> Allowing to enable more than one counter and/or an event type other >>>> than the cycle counter is not the goal. Enabling another event type >>>> may result in a garbage value, but I don't think it's worse than the >>>> current situation where the count stays zero; please tell me if I miss >>>> something. >>>> >>>> There is still room for improvement. Returning a garbage value may not >>>> be worse than returning zero, but counters and event types not >>>> supported by some cores shouldn't be advertised as available in the >>>> first place. More concretely: >>>> >>>> - The vCPU should be limited to run only on cores covered by PMUs when >>>> KVM_ARM_VCPU_PMU_V3 is set. >>> >>> That's userspace's job. Bind to the desired PMU, and run. KVM will >>> actively prevent you from running on the wrong CPU. >>> >>>> - PMCR_EL0.N advertised to the guest should be the minimum of ones of >>>> host PMUs. >>> >>> How do you find out? CPUs can be hot-plugged on long after a VM has >>> started, bringing in a new PMU, with a different number of counters. >>> >>>> - PMCEID0_EL0 and PMCEID1_EL0 advertised to the guest should be the >>>> result of the AND operations of ones of host PMUs. >>> >>> Same problem. >> >> I guess special-casing the cycle counter is the only option if the >> kernel is going to deal with this. > > Indeed. I think Oliver's idea is the least bad of them all, but man, > this is really ugly. > >>>> Special-casing the cycle counter may make sense if we are going to fix >>>> the advertised values of PMCR_EL0.N, PMCEID0_EL0, and >>>> PMCEID1_EL0. PMCR_EL0.N as we can simply return zero for these >>>> registers. We can also prevent enabling a counter that returns zero or >>>> a garbage value. >>>> >>>> Do you think it's worth fixing these registers? If so, I'll do that by >>>> special-casing the cycle counter. >>> >>> I think this is really going in the wrong direction. >>> >>> The whole design of the PMU emulation is that we expose a single, >>> architecturally correct PMU implementation. This is clearly >>> documented. >>> >>> Furthermore, userspace is being given all the relevant information to >>> place vcpus on the correct physical CPUs. Why should we add this sort >>> of hack in the kernel, creating a new userspace ABI that we will have >>> to support forever, when usespace can do the correct thing right now? >>> >>> Worse case, this is just a 'taskset' away, and everything will work. >> >> It's surprisingly difficult to do that with libvirt; of course it is a >> userspace problem though. > > Sorry, I must admit I'm completely ignorant of libvirt. I tried it > years ago, and concluded that 95% of what I needed was adequately done > with a shell script... > >>> Frankly, I'm not prepared to add more hacks to KVM for the sake of the >>> combination of broken userspace and broken guest. >> >> The only counter argument I have in this regard is that some change is >> also needed to expose all CPUs to Windows guest even when the >> userspace does its best. It may result in odd scheduling, but still >> gives the best throughput. > > But that'd be a new ABI, which again would require buy-in from > userspace. Maybe there is scope for an all CPUs, cycle-counter only > PMUv3 exposed to the guest, but that cannot be set automatically, as > we would otherwise regress existing setups. > > At this stage, and given that you need to change userspace, I'm not > sure what the best course of action is. Having an explicit flag for the userspace is fine for QEMU, which I care. It can flip the flag if and only if threads are not pinned to one PMU and the machine is a new setup. I also wonder what regression you think setting it automatically causes. Regards, Akihiko Odaki > > Thanks, > > M. >
On Wed, 19 Mar 2025 11:26:18 +0000, Akihiko Odaki <akihiko.odaki@daynix.com> wrote: > > On 2025/03/19 20:07, Marc Zyngier wrote: > > On Wed, 19 Mar 2025 10:26:57 +0000, > >> > > But that'd be a new ABI, which again would require buy-in from > > userspace. Maybe there is scope for an all CPUs, cycle-counter only > > PMUv3 exposed to the guest, but that cannot be set automatically, as > > we would otherwise regress existing setups. > > > > At this stage, and given that you need to change userspace, I'm not > > sure what the best course of action is. > > Having an explicit flag for the userspace is fine for QEMU, which I > care. It can flip the flag if and only if threads are not pinned to > one PMU and the machine is a new setup. > > I also wonder what regression you think setting it automatically causes. The current behaviour is that if you don't specify anything other than creating a PMUv3 (without KVM_ARM_VCPU_PMU_V3_SET_PMU), you get *some* PMU, and userspace is responsible for running the vcpu on CPUs that will implement that PMU. When if does, all the counters, all the events are valid. If it doesn't, nothing counts, but the counters/events are still valid. If you now add this flag automatically, the guest doesn't see the full PMU anymore. Only the cycle counter. That's the regression. Thanks, M.
On 2025/03/19 20:41, Marc Zyngier wrote: > On Wed, 19 Mar 2025 11:26:18 +0000, > Akihiko Odaki <akihiko.odaki@daynix.com> wrote: >> >> On 2025/03/19 20:07, Marc Zyngier wrote: >>> On Wed, 19 Mar 2025 10:26:57 +0000, >>>> >>> But that'd be a new ABI, which again would require buy-in from >>> userspace. Maybe there is scope for an all CPUs, cycle-counter only >>> PMUv3 exposed to the guest, but that cannot be set automatically, as >>> we would otherwise regress existing setups. >>> >>> At this stage, and given that you need to change userspace, I'm not >>> sure what the best course of action is. >> >> Having an explicit flag for the userspace is fine for QEMU, which I >> care. It can flip the flag if and only if threads are not pinned to >> one PMU and the machine is a new setup. >> >> I also wonder what regression you think setting it automatically causes. > > The current behaviour is that if you don't specify anything other than > creating a PMUv3 (without KVM_ARM_VCPU_PMU_V3_SET_PMU), you get *some* > PMU, and userspace is responsible for running the vcpu on CPUs that > will implement that PMU. When if does, all the counters, all the > events are valid. If it doesn't, nothing counts, but the > counters/events are still valid. > > If you now add this flag automatically, the guest doesn't see the full > PMU anymore. Only the cycle counter. That's the regression. What about setting the flag automatically when a user fails to pin vCPUs to CPUs that are covered by one PMU? There would be no change if a user correctly pins vCPUs as it is. Otherwise, they will see a correct feature set advertised to the guest and the cycle counter working. Regards, Akihiko Odaki > > Thanks, > > M. >
On Wed, 19 Mar 2025 11:51:21 +0000, Akihiko Odaki <akihiko.odaki@daynix.com> wrote: > > On 2025/03/19 20:41, Marc Zyngier wrote: > > On Wed, 19 Mar 2025 11:26:18 +0000, > > Akihiko Odaki <akihiko.odaki@daynix.com> wrote: > >> > >> On 2025/03/19 20:07, Marc Zyngier wrote: > >>> On Wed, 19 Mar 2025 10:26:57 +0000, > >>>> > >>> But that'd be a new ABI, which again would require buy-in from > >>> userspace. Maybe there is scope for an all CPUs, cycle-counter only > >>> PMUv3 exposed to the guest, but that cannot be set automatically, as > >>> we would otherwise regress existing setups. > >>> > >>> At this stage, and given that you need to change userspace, I'm not > >>> sure what the best course of action is. > >> > >> Having an explicit flag for the userspace is fine for QEMU, which I > >> care. It can flip the flag if and only if threads are not pinned to > >> one PMU and the machine is a new setup. > >> > >> I also wonder what regression you think setting it automatically causes. > > > > The current behaviour is that if you don't specify anything other than > > creating a PMUv3 (without KVM_ARM_VCPU_PMU_V3_SET_PMU), you get *some* > > PMU, and userspace is responsible for running the vcpu on CPUs that > > will implement that PMU. When if does, all the counters, all the > > events are valid. If it doesn't, nothing counts, but the > > counters/events are still valid. > > > > If you now add this flag automatically, the guest doesn't see the full > > PMU anymore. Only the cycle counter. That's the regression. > > What about setting the flag automatically when a user fails to pin > vCPUs to CPUs that are covered by one PMU? There would be no change if > a user correctly pins vCPUs as it is. Otherwise, they will see a > correct feature set advertised to the guest and the cycle counter > working. How do you know that the affinity is "correct"? VCPU affinity can be changed at any time. I, for one, do not want my VMs to change behaviour because I let the vcpus bounce around as the scheduler sees fit. Honestly, this is not a can of worm I want to open. We already have a pretty terrible userspace API for the PMU, let's not add to the confusion. *If* we are going down the road of presenting a dumbed-down PMU to the guest, it has to be an explicit buy-in from userspace. M.
On Wed, Mar 19, 2025 at 06:38:38PM +0000, Marc Zyngier wrote: > On Wed, 19 Mar 2025 11:51:21 +0000, Akihiko Odaki <akihiko.odaki@daynix.com> wrote: > > What about setting the flag automatically when a user fails to pin > > vCPUs to CPUs that are covered by one PMU? There would be no change if > > a user correctly pins vCPUs as it is. Otherwise, they will see a > > correct feature set advertised to the guest and the cycle counter > > working. > > How do you know that the affinity is "correct"? VCPU affinity can be > changed at any time. I, for one, do not want my VMs to change > behaviour because I let the vcpus bounce around as the scheduler sees > fit. > > Honestly, this is not a can of worm I want to open. We already have a > pretty terrible userspace API for the PMU, let's not add to the > confusion. *If* we are going down the road of presenting a dumbed-down > PMU to the guest, it has to be an explicit buy-in from userspace. I also have a very strong distaste for the crappy UAPI we have around a 'default' PMU. At the same time I do recognize this hurts practical usecases since some VMMs can't be bothered to configure the vPMU / vCPU pinning correctly. I'm at least willing to plug my nose and do the following: 1) When the VMM does not specify a vPMU type: - We continue to present the 'default' PMU (including event counters) to the VM - KVM ensures that the fixed CPU cycle counter works on any PMUv3 implementation in the system, even if it is different from the default - Otherwise, event counters will only count on the default implementation and will not count on different PMUs 2) Implement your suggestion of a UAPI where the VMM can select a PMU that only has the CPU cycle counter and works on any PMUv3 implementation. Either way KVM will need to have some special case handling of the fixed CPU cycle counter. That'd allow users to actually run Windows *now* and provide a clear mechanism for userspace to present a less-broken vPMU if it cares. Thanks, Oliver
On 2025/03/20 3:51, Oliver Upton wrote: > On Wed, Mar 19, 2025 at 06:38:38PM +0000, Marc Zyngier wrote: >> On Wed, 19 Mar 2025 11:51:21 +0000, Akihiko Odaki <akihiko.odaki@daynix.com> wrote: >>> What about setting the flag automatically when a user fails to pin >>> vCPUs to CPUs that are covered by one PMU? There would be no change if >>> a user correctly pins vCPUs as it is. Otherwise, they will see a >>> correct feature set advertised to the guest and the cycle counter >>> working. >> >> How do you know that the affinity is "correct"? VCPU affinity can be >> changed at any time. I, for one, do not want my VMs to change >> behaviour because I let the vcpus bounce around as the scheduler sees >> fit. Checking the affinity when picking the default PMU; the vCPU affinity is the only thing that rules the choice of the default PMU even now. Perhaps we may model the API as follows: introduce another "composite" PMU that works on any core but only exposes the cycle counter. Robust VMMs will choose it or one of hardware PMUs with KVM_ARM_VCPU_PMU_V3_SET_PMU. KVM will choose the default PMU according to the vCPU affinity at the point of KVM_ARM_VCPU_INIT otherwise. If the affinity is covered by one hardware PMU, that PMU will be chosen as the default. The "composite" PMU will be the default otherwise. Regards, Akihiko Odaki >> >> Honestly, this is not a can of worm I want to open. We already have a >> pretty terrible userspace API for the PMU, let's not add to the >> confusion. *If* we are going down the road of presenting a dumbed-down >> PMU to the guest, it has to be an explicit buy-in from userspace. > > I also have a very strong distaste for the crappy UAPI we have around a > 'default' PMU. At the same time I do recognize this hurts practical > usecases since some VMMs can't be bothered to configure the vPMU / vCPU > pinning correctly. > > I'm at least willing to plug my nose and do the following: > > 1) When the VMM does not specify a vPMU type: > > - We continue to present the 'default' PMU (including event counters) > to the VM > > - KVM ensures that the fixed CPU cycle counter works on any PMUv3 > implementation in the system, even if it is different from the > default > > - Otherwise, event counters will only count on the default > implementation and will not count on different PMUs > > 2) Implement your suggestion of a UAPI where the VMM can select a PMU > that only has the CPU cycle counter and works on any PMUv3 > implementation. > > Either way KVM will need to have some special case handling of the fixed > CPU cycle counter. That'd allow users to actually run Windows *now* and > provide a clear mechanism for userspace to present a less-broken vPMU if > it cares. > > Thanks, > Oliver
On Thu, 20 Mar 2025 06:03:35 +0000, Akihiko Odaki <akihiko.odaki@daynix.com> wrote: > > On 2025/03/20 3:51, Oliver Upton wrote: > > On Wed, Mar 19, 2025 at 06:38:38PM +0000, Marc Zyngier wrote: > >> On Wed, 19 Mar 2025 11:51:21 +0000, Akihiko Odaki <akihiko.odaki@daynix.com> wrote: > >>> What about setting the flag automatically when a user fails to pin > >>> vCPUs to CPUs that are covered by one PMU? There would be no change if > >>> a user correctly pins vCPUs as it is. Otherwise, they will see a > >>> correct feature set advertised to the guest and the cycle counter > >>> working. > >> > >> How do you know that the affinity is "correct"? VCPU affinity can be > >> changed at any time. I, for one, do not want my VMs to change > >> behaviour because I let the vcpus bounce around as the scheduler sees > >> fit. > > Checking the affinity when picking the default PMU; the vCPU affinity > is the only thing that rules the choice of the default PMU even now. > > Perhaps we may model the API as follows: introduce another "composite" > PMU that works on any core but only exposes the cycle counter. Robust > VMMs will choose it or one of hardware PMUs with > KVM_ARM_VCPU_PMU_V3_SET_PMU. KVM will choose the default PMU according > to the vCPU affinity at the point of KVM_ARM_VCPU_INIT otherwise. If > the affinity is covered by one hardware PMU, that PMU will be chosen > as the default. The "composite" PMU will be the default otherwise. This makes no sense to me. A VCPU is always affine to a PMU, because we do not support configurations where only some CPUs have a PMU. This is an all-or-nothing situation. More importantly, you keep suggesting the same "new default", and I keep saying NO. My position is clear: if you want a *new* behaviour, you *must* add a new flag that the VMM explicitly provides to enable this CC-only PMU. No change in default behaviour at all. I'm not going to move from that. M.
On Wed, 19 Mar 2025 18:51:28 +0000, Oliver Upton <oliver.upton@linux.dev> wrote: > > On Wed, Mar 19, 2025 at 06:38:38PM +0000, Marc Zyngier wrote: > > On Wed, 19 Mar 2025 11:51:21 +0000, Akihiko Odaki <akihiko.odaki@daynix.com> wrote: > > > What about setting the flag automatically when a user fails to pin > > > vCPUs to CPUs that are covered by one PMU? There would be no change if > > > a user correctly pins vCPUs as it is. Otherwise, they will see a > > > correct feature set advertised to the guest and the cycle counter > > > working. > > > > How do you know that the affinity is "correct"? VCPU affinity can be > > changed at any time. I, for one, do not want my VMs to change > > behaviour because I let the vcpus bounce around as the scheduler sees > > fit. > > > > Honestly, this is not a can of worm I want to open. We already have a > > pretty terrible userspace API for the PMU, let's not add to the > > confusion. *If* we are going down the road of presenting a dumbed-down > > PMU to the guest, it has to be an explicit buy-in from userspace. > > I also have a very strong distaste for the crappy UAPI we have around a > 'default' PMU. At the same time I do recognize this hurts practical > usecases since some VMMs can't be bothered to configure the vPMU / vCPU > pinning correctly. > > I'm at least willing to plug my nose and do the following: > > 1) When the VMM does not specify a vPMU type: > > - We continue to present the 'default' PMU (including event counters) > to the VM > > - KVM ensures that the fixed CPU cycle counter works on any PMUv3 > implementation in the system, even if it is different from the > default > > - Otherwise, event counters will only count on the default > implementation and will not count on different PMUs I think this is confusing. The CC is counting, but nothing else, and people using the cycle counters in conjunction with other events (a very common use case) will not be able to correlate things correctly. The current behaviour is, for all its sins, at least consistent. > > 2) Implement your suggestion of a UAPI where the VMM can select a PMU > that only has the CPU cycle counter and works on any PMUv3 > implementation. > > Either way KVM will need to have some special case handling of the fixed > CPU cycle counter. That'd allow users to actually run Windows *now* and > provide a clear mechanism for userspace to present a less-broken vPMU if > it cares. Honestly, I don't care about one guest or another. My point is that if we are changing the behaviour of the PMU to deal with this sort of things, then it has to be a userspace buy-in. M.
On 2025/03/20 18:10, Marc Zyngier wrote: > On Thu, 20 Mar 2025 06:03:35 +0000, > Akihiko Odaki <akihiko.odaki@daynix.com> wrote: >> >> On 2025/03/20 3:51, Oliver Upton wrote: >>> On Wed, Mar 19, 2025 at 06:38:38PM +0000, Marc Zyngier wrote: >>>> On Wed, 19 Mar 2025 11:51:21 +0000, Akihiko Odaki <akihiko.odaki@daynix.com> wrote: >>>>> What about setting the flag automatically when a user fails to pin >>>>> vCPUs to CPUs that are covered by one PMU? There would be no change if >>>>> a user correctly pins vCPUs as it is. Otherwise, they will see a >>>>> correct feature set advertised to the guest and the cycle counter >>>>> working. >>>> >>>> How do you know that the affinity is "correct"? VCPU affinity can be >>>> changed at any time. I, for one, do not want my VMs to change >>>> behaviour because I let the vcpus bounce around as the scheduler sees >>>> fit. >> >> Checking the affinity when picking the default PMU; the vCPU affinity >> is the only thing that rules the choice of the default PMU even now. >> >> Perhaps we may model the API as follows: introduce another "composite" >> PMU that works on any core but only exposes the cycle counter. Robust >> VMMs will choose it or one of hardware PMUs with >> KVM_ARM_VCPU_PMU_V3_SET_PMU. KVM will choose the default PMU according >> to the vCPU affinity at the point of KVM_ARM_VCPU_INIT otherwise. If >> the affinity is covered by one hardware PMU, that PMU will be chosen >> as the default. The "composite" PMU will be the default otherwise. > > This makes no sense to me. A VCPU is always affine to a PMU, because > we do not support configurations where only some CPUs have a PMU. This > is an all-or-nothing situation. At least isn't it fine to have the composite PMU with a new value for KVM_ARM_VCPU_PMU_V3_SET_PMU? > > More importantly, you keep suggesting the same "new default", and I > keep saying NO. > > My position is clear: if you want a *new* behaviour, you *must* add a > new flag that the VMM explicitly provides to enable this CC-only PMU. > No change in default behaviour at all. > > I'm not going to move from that. Why not? It will not break anything guaranteed to work in the past. Currently KVM only guarantees that the emulated PMU correctly counts only when 1) the vCPU affinity is contained by one PMU and 2) it will not expand Breaking these conditions will make the behavior of the emulated PMU undefined. Now I'm proposing to remove 1). Regards, Akihiko Odaki > > M. >
On Thu, 20 Mar 2025 09:52:59 +0000, Akihiko Odaki <akihiko.odaki@daynix.com> wrote: > > On 2025/03/20 18:10, Marc Zyngier wrote: > > On Thu, 20 Mar 2025 06:03:35 +0000, > > Akihiko Odaki <akihiko.odaki@daynix.com> wrote: > >> > >> On 2025/03/20 3:51, Oliver Upton wrote: > >>> On Wed, Mar 19, 2025 at 06:38:38PM +0000, Marc Zyngier wrote: > >>>> On Wed, 19 Mar 2025 11:51:21 +0000, Akihiko Odaki <akihiko.odaki@daynix.com> wrote: > >>>>> What about setting the flag automatically when a user fails to pin > >>>>> vCPUs to CPUs that are covered by one PMU? There would be no change if > >>>>> a user correctly pins vCPUs as it is. Otherwise, they will see a > >>>>> correct feature set advertised to the guest and the cycle counter > >>>>> working. > >>>> > >>>> How do you know that the affinity is "correct"? VCPU affinity can be > >>>> changed at any time. I, for one, do not want my VMs to change > >>>> behaviour because I let the vcpus bounce around as the scheduler sees > >>>> fit. > >> > >> Checking the affinity when picking the default PMU; the vCPU affinity > >> is the only thing that rules the choice of the default PMU even now. > >> > >> Perhaps we may model the API as follows: introduce another "composite" > >> PMU that works on any core but only exposes the cycle counter. Robust > >> VMMs will choose it or one of hardware PMUs with > >> KVM_ARM_VCPU_PMU_V3_SET_PMU. KVM will choose the default PMU according > >> to the vCPU affinity at the point of KVM_ARM_VCPU_INIT otherwise. If > >> the affinity is covered by one hardware PMU, that PMU will be chosen > >> as the default. The "composite" PMU will be the default otherwise. > > > > This makes no sense to me. A VCPU is always affine to a PMU, because > > we do not support configurations where only some CPUs have a PMU. This > > is an all-or-nothing situation. > > At least isn't it fine to have the composite PMU with a new value for > KVM_ARM_VCPU_PMU_V3_SET_PMU? Not sure KVM_ARM_VCPU_PMU_V3_SET_PMU is the right hook (it takes a PMU 'type', which is under control of the perf subsystem). But if we can find a value that is guaranteed to be unique, why not. > > More importantly, you keep suggesting the same "new default", and I > > keep saying NO. > > > > My position is clear: if you want a *new* behaviour, you *must* add a > > new flag that the VMM explicitly provides to enable this CC-only PMU. > > No change in default behaviour at all. > > > > I'm not going to move from that. > > Why not? It will not break anything guaranteed to work in the past. It doesn't have to be guaranteed. It just has to *exist*. That's the Linux ABI for you. > Currently KVM only guarantees that the emulated PMU correctly counts > only when > 1) the vCPU affinity is contained by one PMU and > 2) it will not expand > > Breaking these conditions will make the behavior of the emulated PMU > undefined. Now I'm proposing to remove 1). And I'm saying no to that. I'm also getting tired of arguing the same point on and on. We currently have two different behaviours: - either you explicitly set a PMU, and the affinity of this PMU constraints the affinity of the vcpus. The vcpus are not allowed to run outside of this affinity. Everything counts all the time. - or you don't explicitly set a PMU, and a default PMU will be picked from the current affinity of the first vcpu. Your vcpus will be able to run anywhere, but the virtual PMU will *only* count when the vcpus are affine to the default PMU. When the vcpus are not affine to default PMU, *nothing* counts. These two behaviours are ABI. They are not changing. They don't get relaxed, they don't get tightened, they stay just as they are, forever. You want a *third* behaviour, go ahead. Define it the way you want. But the behaviours described above will stay unchanged. I'm looking forward to your patches implementing it, but I am also done arguing on it. M.
On Thu, Mar 20, 2025 at 09:19:02AM +0000, Marc Zyngier wrote: > On Wed, 19 Mar 2025 18:51:28 +0000, Oliver Upton <oliver.upton@linux.dev> wrote: > > I'm at least willing to plug my nose and do the following: > > > > 1) When the VMM does not specify a vPMU type: > > > > - We continue to present the 'default' PMU (including event counters) > > to the VM > > > > - KVM ensures that the fixed CPU cycle counter works on any PMUv3 > > implementation in the system, even if it is different from the > > default > > > > - Otherwise, event counters will only count on the default > > implementation and will not count on different PMUs > > I think this is confusing. The CC is counting, but nothing else, and > people using the cycle counters in conjunction with other events (a > very common use case) will not be able to correlate things correctly. > The current behaviour is, for all its sins, at least consistent. You of course have a good point. What Windows is doing is definitely an outlier. > > > > 2) Implement your suggestion of a UAPI where the VMM can select a PMU > > that only has the CPU cycle counter and works on any PMUv3 > > implementation. > > > > Either way KVM will need to have some special case handling of the fixed > > CPU cycle counter. That'd allow users to actually run Windows *now* and > > provide a clear mechanism for userspace to present a less-broken vPMU if > > it cares. > > Honestly, I don't care about one guest or another. My point is that if > we are changing the behaviour of the PMU to deal with this sort of > things, then it has to be a userspace buy-in. I'm fine with just the user buy-in then. But I still do care about the guest compatibility issue, especially since the end user of all this crap is unlikely to know/care about the fine details of the implementation. So, Akihiko, I would *greatly* appreciate it if you propose a complete solution to the problem, including the KVM and VMM patches to make it all work. Thanks, Oliver
On 2025/03/21 2:14, Marc Zyngier wrote: > On Thu, 20 Mar 2025 09:52:59 +0000, > Akihiko Odaki <akihiko.odaki@daynix.com> wrote: >> >> On 2025/03/20 18:10, Marc Zyngier wrote: >>> On Thu, 20 Mar 2025 06:03:35 +0000, >>> Akihiko Odaki <akihiko.odaki@daynix.com> wrote: >>>> >>>> On 2025/03/20 3:51, Oliver Upton wrote: >>>>> On Wed, Mar 19, 2025 at 06:38:38PM +0000, Marc Zyngier wrote: >>>>>> On Wed, 19 Mar 2025 11:51:21 +0000, Akihiko Odaki <akihiko.odaki@daynix.com> wrote: >>>>>>> What about setting the flag automatically when a user fails to pin >>>>>>> vCPUs to CPUs that are covered by one PMU? There would be no change if >>>>>>> a user correctly pins vCPUs as it is. Otherwise, they will see a >>>>>>> correct feature set advertised to the guest and the cycle counter >>>>>>> working. >>>>>> >>>>>> How do you know that the affinity is "correct"? VCPU affinity can be >>>>>> changed at any time. I, for one, do not want my VMs to change >>>>>> behaviour because I let the vcpus bounce around as the scheduler sees >>>>>> fit. >>>> >>>> Checking the affinity when picking the default PMU; the vCPU affinity >>>> is the only thing that rules the choice of the default PMU even now. >>>> >>>> Perhaps we may model the API as follows: introduce another "composite" >>>> PMU that works on any core but only exposes the cycle counter. Robust >>>> VMMs will choose it or one of hardware PMUs with >>>> KVM_ARM_VCPU_PMU_V3_SET_PMU. KVM will choose the default PMU according >>>> to the vCPU affinity at the point of KVM_ARM_VCPU_INIT otherwise. If >>>> the affinity is covered by one hardware PMU, that PMU will be chosen >>>> as the default. The "composite" PMU will be the default otherwise. >>> >>> This makes no sense to me. A VCPU is always affine to a PMU, because >>> we do not support configurations where only some CPUs have a PMU. This >>> is an all-or-nothing situation. >> >> At least isn't it fine to have the composite PMU with a new value for >> KVM_ARM_VCPU_PMU_V3_SET_PMU? > > Not sure KVM_ARM_VCPU_PMU_V3_SET_PMU is the right hook (it takes a PMU > 'type', which is under control of the perf subsystem). But if we can > find a value that is guaranteed to be unique, why not. > >>> More importantly, you keep suggesting the same "new default", and I >>> keep saying NO. >>> >>> My position is clear: if you want a *new* behaviour, you *must* add a >>> new flag that the VMM explicitly provides to enable this CC-only PMU. >>> No change in default behaviour at all. >>> >>> I'm not going to move from that. >> >> Why not? It will not break anything guaranteed to work in the past. > > It doesn't have to be guaranteed. It just has to *exist*. That's the > Linux ABI for you. > >> Currently KVM only guarantees that the emulated PMU correctly counts >> only when >> 1) the vCPU affinity is contained by one PMU and >> 2) it will not expand >> >> Breaking these conditions will make the behavior of the emulated PMU >> undefined. Now I'm proposing to remove 1). > > And I'm saying no to that. I'm also getting tired of arguing the same > point on and on. > > We currently have two different behaviours: > > - either you explicitly set a PMU, and the affinity of this PMU > constraints the affinity of the vcpus. The vcpus are not allowed to > run outside of this affinity. Everything counts all the time. > > - or you don't explicitly set a PMU, and a default PMU will be picked > from the current affinity of the first vcpu. Your vcpus will be able > to run anywhere, but the virtual PMU will *only* count when the > vcpus are affine to the default PMU. When the vcpus are not affine > to default PMU, *nothing* counts. > > These two behaviours are ABI. They are not changing. They don't get > relaxed, they don't get tightened, they stay just as they are, > forever. Is the latter one is really ABI? I see it as part of behaviors that are undefined and not included in ABI for the following reasons: 1) It depends on the scheduler behavior, which cannot be ABI. 2) It provides a broken PMU so the proposed behavioral change is similar a bug fix though I call it a undefined behavior instead of a bug as it is explicitly told that there is no assurance that the PMU works in such a scenario. 3) The userspace could not have relied on it so the "no regressions" rule cannot be applied here; how can anyone have a userspace that relies on a kernel behavior that depends on scheduling? But for 3), Oliver raised a concern for the guest compatibility so I'd like to hear an explanation for that concern. > > You want a *third* behaviour, go ahead. Define it the way you want. > But the behaviours described above will stay unchanged.> > I'm looking forward to your patches implementing it, but I am also > done arguing on it. I understand the discussion is tiring but I want to know the reasoning behind such a design decision before sending an RFC patch to a VMM (QEMU) so that I can explain them why it is necessary in turn. Regards, Akihiko Odaki > > M. >
On Fri, 21 Mar 2025 06:20:49 +0000, Akihiko Odaki <akihiko.odaki@daynix.com> wrote: > > On 2025/03/21 2:14, Marc Zyngier wrote: > > On Thu, 20 Mar 2025 09:52:59 +0000, > > Akihiko Odaki <akihiko.odaki@daynix.com> wrote: > >> > >> On 2025/03/20 18:10, Marc Zyngier wrote: > >>> On Thu, 20 Mar 2025 06:03:35 +0000, > >>> Akihiko Odaki <akihiko.odaki@daynix.com> wrote: > >>>> > >>>> On 2025/03/20 3:51, Oliver Upton wrote: > >>>>> On Wed, Mar 19, 2025 at 06:38:38PM +0000, Marc Zyngier wrote: > >>>>>> On Wed, 19 Mar 2025 11:51:21 +0000, Akihiko Odaki <akihiko.odaki@daynix.com> wrote: > >>>>>>> What about setting the flag automatically when a user fails to pin > >>>>>>> vCPUs to CPUs that are covered by one PMU? There would be no change if > >>>>>>> a user correctly pins vCPUs as it is. Otherwise, they will see a > >>>>>>> correct feature set advertised to the guest and the cycle counter > >>>>>>> working. > >>>>>> > >>>>>> How do you know that the affinity is "correct"? VCPU affinity can be > >>>>>> changed at any time. I, for one, do not want my VMs to change > >>>>>> behaviour because I let the vcpus bounce around as the scheduler sees > >>>>>> fit. > >>>> > >>>> Checking the affinity when picking the default PMU; the vCPU affinity > >>>> is the only thing that rules the choice of the default PMU even now. > >>>> > >>>> Perhaps we may model the API as follows: introduce another "composite" > >>>> PMU that works on any core but only exposes the cycle counter. Robust > >>>> VMMs will choose it or one of hardware PMUs with > >>>> KVM_ARM_VCPU_PMU_V3_SET_PMU. KVM will choose the default PMU according > >>>> to the vCPU affinity at the point of KVM_ARM_VCPU_INIT otherwise. If > >>>> the affinity is covered by one hardware PMU, that PMU will be chosen > >>>> as the default. The "composite" PMU will be the default otherwise. > >>> > >>> This makes no sense to me. A VCPU is always affine to a PMU, because > >>> we do not support configurations where only some CPUs have a PMU. This > >>> is an all-or-nothing situation. > >> > >> At least isn't it fine to have the composite PMU with a new value for > >> KVM_ARM_VCPU_PMU_V3_SET_PMU? > > > > Not sure KVM_ARM_VCPU_PMU_V3_SET_PMU is the right hook (it takes a PMU > > 'type', which is under control of the perf subsystem). But if we can > > find a value that is guaranteed to be unique, why not. > > > >>> More importantly, you keep suggesting the same "new default", and I > >>> keep saying NO. > >>> > >>> My position is clear: if you want a *new* behaviour, you *must* add a > >>> new flag that the VMM explicitly provides to enable this CC-only PMU. > >>> No change in default behaviour at all. > >>> > >>> I'm not going to move from that. > >> > >> Why not? It will not break anything guaranteed to work in the past. > > > > It doesn't have to be guaranteed. It just has to *exist*. That's the > > Linux ABI for you. > > > >> Currently KVM only guarantees that the emulated PMU correctly counts > >> only when > >> 1) the vCPU affinity is contained by one PMU and > >> 2) it will not expand > >> > >> Breaking these conditions will make the behavior of the emulated PMU > >> undefined. Now I'm proposing to remove 1). > > > > And I'm saying no to that. I'm also getting tired of arguing the same > > point on and on. > > > > We currently have two different behaviours: > > > > - either you explicitly set a PMU, and the affinity of this PMU > > constraints the affinity of the vcpus. The vcpus are not allowed to > > run outside of this affinity. Everything counts all the time. > > > > - or you don't explicitly set a PMU, and a default PMU will be picked > > from the current affinity of the first vcpu. Your vcpus will be able > > to run anywhere, but the virtual PMU will *only* count when the > > vcpus are affine to the default PMU. When the vcpus are not affine > > to default PMU, *nothing* counts. > > > > These two behaviours are ABI. They are not changing. They don't get > > relaxed, they don't get tightened, they stay just as they are, > > forever. > > Is the latter one is really ABI? I see it as part of behaviors that > are undefined and not included in ABI for the following reasons: > > 1) It depends on the scheduler behavior, which cannot be ABI. You're wrong. The task affinity is under complete control of userspace. sched_setaffinity(2) is your friend. > 2) It provides a broken PMU so the proposed behavioral change is > similar a bug fix though I call it a undefined behavior instead of a > bug as it is explicitly told that there is no assurance that the PMU > works in such a scenario. Call it undefined behaviour if you want, I call it ABI. This is how the PMU has been handled on KVM/arm64 for the past 10 years, and the fact that you (and I) don't like it doesn't make it less of an existing behaviour that must be preserved. > 3) The userspace could not have relied on it so the "no regressions" > rule cannot be applied here; how can anyone have a userspace that > relies on a kernel behavior that depends on scheduling? Of course it has. Read the above. Even more, I contend that your approach is just wrong. You want to make random events count at random times, which is even worse. We can at least pretend that the PMU hasn't recorded anything (which is true) rather than saying "oh look, the cycle counter has ticked but we didn't execute any instruction -- WTF were we doing?". That approach is far worse than what we have today. Let me give you an example: - I have an asymmetric system with two types of CPUs (any odd big-little nonsense) - I statically place my vcpus on physical CPUs so that only some vcpus are effectively generating events by using my knowledge of the "default PMU" behaviour. - I perform system-wide profiling inside the guest with the full knowledge that only the vcpus I'm interested in will affect the event counts. This is deterministic. - I can move vcpus around as I see fit to capture new counter sets with the same guarantees. Your "count random stuff at random times on random CPUs" breaks this. Is this a contrived example? Of course it is. Do we know for sure that nobody in their right mind would do this? No, because we don't have a mandate to decide what people can or cannot do with a behaviour that has been established for that long. > But for 3), Oliver raised a concern for the guest compatibility so I'd > like to hear an explanation for that concern. I've given it above, and in my reply to Oliver as well. > > You want a *third* behaviour, go ahead. Define it the way you want. > > But the behaviours described above will stay unchanged. > > I'm looking forward to your patches implementing it, but I am also > > done arguing on it. > > I understand the discussion is tiring but I want to know the reasoning > behind such a design decision before sending an RFC patch to a VMM > (QEMU) so that I can explain them why it is necessary in turn. But that's *again* where you got things completely wrong: <allcaps> This is not a design decision, this is how things have always worked and we cannot break previous behaviours. This is unfortunate, but that's how the Linux kernel works. </allcaps> The only "design decision" is that we don't change established behaviours, explicit or implicit. If you want to implement something that matches this requirement, go ahead and post patches. I'll gladly review them. If you cannot abide by this rule, then please stop wasting your time (and mine). Feel free to relay this to the QEMU folks who, in my experience, very much value the notion of ABI and the preservation of existing behaviours. Now, that's my last email on this thread. Do no bother replying. M. (pretty annoyed)
diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c index caa1357fa367..06292a4ccdc8 100644 --- a/arch/arm64/kvm/arm.c +++ b/arch/arm64/kvm/arm.c @@ -449,8 +449,6 @@ int kvm_arch_vcpu_create(struct kvm_vcpu *vcpu) /* Set up the timer */ kvm_timer_vcpu_init(vcpu); - kvm_pmu_vcpu_init(vcpu); - kvm_arm_pvtime_vcpu_init(&vcpu->arch); vcpu->arch.hw_mmu = &vcpu->kvm->arch.mmu; diff --git a/arch/arm64/kvm/pmu-emul.c b/arch/arm64/kvm/pmu-emul.c index aae5713d8993..631752f5327f 100644 --- a/arch/arm64/kvm/pmu-emul.c +++ b/arch/arm64/kvm/pmu-emul.c @@ -22,16 +22,16 @@ DEFINE_STATIC_KEY_FALSE(kvm_arm_pmu_available); static LIST_HEAD(arm_pmus); static DEFINE_MUTEX(arm_pmus_lock); -static void kvm_pmu_create_perf_event(struct kvm_pmc *pmc); -static void kvm_pmu_release_perf_event(struct kvm_pmc *pmc); -static bool kvm_pmu_counter_is_enabled(struct kvm_pmc *pmc); +static void kvm_pmu_create_pmc(struct kvm_vcpu *vcpu, u8 idx); +static void kvm_pmu_release_pmc(struct kvm_vcpu *vcpu, u8 idx); +static bool kvm_pmu_counter_is_enabled(struct kvm_vcpu *vcpu, u8 idx); -static struct kvm_vcpu *kvm_pmc_to_vcpu(const struct kvm_pmc *pmc) +static struct kvm_vcpu *kvm_pmc_to_vcpu(struct kvm_pmc * const *pmc) { - return container_of(pmc, struct kvm_vcpu, arch.pmu.pmc[pmc->idx]); + return container_of(pmc, struct kvm_vcpu, arch.pmu.pmc[(*pmc)->idx]); } -static struct kvm_pmc *kvm_vcpu_idx_to_pmc(struct kvm_vcpu *vcpu, int cnt_idx) +static struct kvm_pmc **kvm_vcpu_idx_to_pmc(struct kvm_vcpu *vcpu, int cnt_idx) { return &vcpu->arch.pmu.pmc[cnt_idx]; } @@ -80,30 +80,27 @@ u64 kvm_pmu_evtyper_mask(struct kvm *kvm) * kvm_pmc_is_64bit - determine if counter is 64bit * @pmc: counter context */ -static bool kvm_pmc_is_64bit(struct kvm_pmc *pmc) +static bool kvm_pmc_is_64bit(struct kvm_vcpu *vcpu, u8 idx) { - struct kvm_vcpu *vcpu = kvm_pmc_to_vcpu(pmc); - - return (pmc->idx == ARMV8_PMU_CYCLE_IDX || + return (idx == ARMV8_PMU_CYCLE_IDX || kvm_has_feat(vcpu->kvm, ID_AA64DFR0_EL1, PMUVer, V3P5)); } -static bool kvm_pmc_has_64bit_overflow(struct kvm_pmc *pmc) +static bool kvm_pmc_has_64bit_overflow(struct kvm_vcpu *vcpu, u8 idx) { - struct kvm_vcpu *vcpu = kvm_pmc_to_vcpu(pmc); u64 val = kvm_vcpu_read_pmcr(vcpu); - if (kvm_pmu_counter_is_hyp(vcpu, pmc->idx)) + if (kvm_pmu_counter_is_hyp(vcpu, idx)) return __vcpu_sys_reg(vcpu, MDCR_EL2) & MDCR_EL2_HLP; - return (pmc->idx < ARMV8_PMU_CYCLE_IDX && (val & ARMV8_PMU_PMCR_LP)) || - (pmc->idx == ARMV8_PMU_CYCLE_IDX && (val & ARMV8_PMU_PMCR_LC)); + return (idx < ARMV8_PMU_CYCLE_IDX && (val & ARMV8_PMU_PMCR_LP)) || + (idx == ARMV8_PMU_CYCLE_IDX && (val & ARMV8_PMU_PMCR_LC)); } -static bool kvm_pmu_counter_can_chain(struct kvm_pmc *pmc) +static bool kvm_pmu_counter_can_chain(struct kvm_vcpu *vcpu, u8 idx) { - return (!(pmc->idx & 1) && (pmc->idx + 1) < ARMV8_PMU_CYCLE_IDX && - !kvm_pmc_has_64bit_overflow(pmc)); + return (!(idx & 1) && (idx + 1) < ARMV8_PMU_CYCLE_IDX && + !kvm_pmc_has_64bit_overflow(vcpu, idx)); } static u32 counter_index_to_reg(u64 idx) @@ -116,28 +113,30 @@ static u32 counter_index_to_evtreg(u64 idx) return (idx == ARMV8_PMU_CYCLE_IDX) ? PMCCFILTR_EL0 : PMEVTYPER0_EL0 + idx; } -static u64 kvm_pmc_read_evtreg(const struct kvm_pmc *pmc) +static u64 kvm_pmc_read_evtreg(const struct kvm_vcpu *vcpu, u8 idx) { - return __vcpu_sys_reg(kvm_pmc_to_vcpu(pmc), counter_index_to_evtreg(pmc->idx)); + return __vcpu_sys_reg(vcpu, counter_index_to_evtreg(idx)); } -static u64 kvm_pmu_get_pmc_value(struct kvm_pmc *pmc) +static u64 kvm_pmu_get_pmc_value(struct kvm_vcpu *vcpu, u8 idx) { - struct kvm_vcpu *vcpu = kvm_pmc_to_vcpu(pmc); + struct kvm_pmc *pmc = *kvm_vcpu_idx_to_pmc(vcpu, idx); u64 counter, reg, enabled, running; + unsigned int i; - reg = counter_index_to_reg(pmc->idx); + reg = counter_index_to_reg(idx); counter = __vcpu_sys_reg(vcpu, reg); /* * The real counter value is equal to the value of counter register plus * the value perf event counts. */ - if (pmc->perf_event) - counter += perf_event_read_value(pmc->perf_event, &enabled, - &running); + if (pmc) + for (i = 0; i < pmc->nr_perf_events; i++) + counter += perf_event_read_value(pmc->perf_events[i], + &enabled, &running); - if (!kvm_pmc_is_64bit(pmc)) + if (!kvm_pmc_is_64bit(vcpu, idx)) counter = lower_32_bits(counter); return counter; @@ -150,20 +149,18 @@ static u64 kvm_pmu_get_pmc_value(struct kvm_pmc *pmc) */ u64 kvm_pmu_get_counter_value(struct kvm_vcpu *vcpu, u64 select_idx) { - return kvm_pmu_get_pmc_value(kvm_vcpu_idx_to_pmc(vcpu, select_idx)); + return kvm_pmu_get_pmc_value(vcpu, select_idx); } -static void kvm_pmu_set_pmc_value(struct kvm_pmc *pmc, u64 val, bool force) +static void kvm_pmu_set_pmc_value(struct kvm_vcpu *vcpu, u8 idx, u64 val, bool force) { - struct kvm_vcpu *vcpu = kvm_pmc_to_vcpu(pmc); u64 reg; - kvm_pmu_release_perf_event(pmc); + kvm_pmu_release_pmc(vcpu, idx); - reg = counter_index_to_reg(pmc->idx); + reg = counter_index_to_reg(idx); - if (vcpu_mode_is_32bit(vcpu) && pmc->idx != ARMV8_PMU_CYCLE_IDX && - !force) { + if (vcpu_mode_is_32bit(vcpu) && idx != ARMV8_PMU_CYCLE_IDX && !force) { /* * Even with PMUv3p5, AArch32 cannot write to the top * 32bit of the counters. The only possible course of @@ -176,8 +173,8 @@ static void kvm_pmu_set_pmc_value(struct kvm_pmc *pmc, u64 val, bool force) __vcpu_sys_reg(vcpu, reg) = val; - /* Recreate the perf event to reflect the updated sample_period */ - kvm_pmu_create_perf_event(pmc); + /* Recreate the pmc to reflect the updated sample_period */ + kvm_pmu_create_pmc(vcpu, idx); } /** @@ -188,7 +185,7 @@ static void kvm_pmu_set_pmc_value(struct kvm_pmc *pmc, u64 val, bool force) */ void kvm_pmu_set_counter_value(struct kvm_vcpu *vcpu, u64 select_idx, u64 val) { - kvm_pmu_set_pmc_value(kvm_vcpu_idx_to_pmc(vcpu, select_idx), val, false); + kvm_pmu_set_pmc_value(vcpu, select_idx, val, false); } /** @@ -199,21 +196,28 @@ void kvm_pmu_set_counter_value(struct kvm_vcpu *vcpu, u64 select_idx, u64 val) */ void kvm_pmu_set_counter_value_user(struct kvm_vcpu *vcpu, u64 select_idx, u64 val) { - kvm_pmu_release_perf_event(kvm_vcpu_idx_to_pmc(vcpu, select_idx)); + kvm_pmu_release_pmc(vcpu, select_idx); __vcpu_sys_reg(vcpu, counter_index_to_reg(select_idx)) = val; kvm_make_request(KVM_REQ_RELOAD_PMU, vcpu); } /** - * kvm_pmu_release_perf_event - remove the perf event + * kvm_pmu_release_pmc - remove the pmc * @pmc: The PMU counter pointer */ -static void kvm_pmu_release_perf_event(struct kvm_pmc *pmc) +static void kvm_pmu_release_pmc(struct kvm_vcpu *vcpu, u8 idx) { - if (pmc->perf_event) { - perf_event_disable(pmc->perf_event); - perf_event_release_kernel(pmc->perf_event); - pmc->perf_event = NULL; + struct kvm_pmc **pmc = kvm_vcpu_idx_to_pmc(vcpu, idx); + unsigned int i; + + if (*pmc) { + for (i = 0; i < (*pmc)->nr_perf_events; i++) { + perf_event_disable((*pmc)->perf_events[i]); + perf_event_release_kernel((*pmc)->perf_events[i]); + } + + kfree(*pmc); + *pmc = NULL; } } @@ -223,35 +227,17 @@ static void kvm_pmu_release_perf_event(struct kvm_pmc *pmc) * * If this counter has been configured to monitor some event, release it here. */ -static void kvm_pmu_stop_counter(struct kvm_pmc *pmc) +static void kvm_pmu_stop_counter(struct kvm_vcpu *vcpu, u8 idx) { - struct kvm_vcpu *vcpu = kvm_pmc_to_vcpu(pmc); u64 reg, val; - if (!pmc->perf_event) - return; - - val = kvm_pmu_get_pmc_value(pmc); + val = kvm_pmu_get_pmc_value(vcpu, idx); - reg = counter_index_to_reg(pmc->idx); + reg = counter_index_to_reg(idx); __vcpu_sys_reg(vcpu, reg) = val; - kvm_pmu_release_perf_event(pmc); -} - -/** - * kvm_pmu_vcpu_init - assign pmu counter idx for cpu - * @vcpu: The vcpu pointer - * - */ -void kvm_pmu_vcpu_init(struct kvm_vcpu *vcpu) -{ - int i; - struct kvm_pmu *pmu = &vcpu->arch.pmu; - - for (i = 0; i < KVM_ARMV8_PMU_MAX_COUNTERS; i++) - pmu->pmc[i].idx = i; + kvm_pmu_release_pmc(vcpu, idx); } /** @@ -264,7 +250,7 @@ void kvm_pmu_vcpu_destroy(struct kvm_vcpu *vcpu) int i; for (i = 0; i < KVM_ARMV8_PMU_MAX_COUNTERS; i++) - kvm_pmu_release_perf_event(kvm_vcpu_idx_to_pmc(vcpu, i)); + kvm_pmu_release_pmc(vcpu, i); irq_work_sync(&vcpu->arch.pmu.overflow_work); } @@ -321,22 +307,31 @@ u64 kvm_pmu_implemented_counter_mask(struct kvm_vcpu *vcpu) return GENMASK(val - 1, 0) | BIT(ARMV8_PMU_CYCLE_IDX); } -static void kvm_pmc_enable_perf_event(struct kvm_pmc *pmc) +static void kvm_pmc_enable_perf_event(struct kvm_vcpu *vcpu, u8 idx) { - if (!pmc->perf_event) { - kvm_pmu_create_perf_event(pmc); + struct kvm_pmc *pmc = *kvm_vcpu_idx_to_pmc(vcpu, idx); + unsigned int i; + + if (!pmc) { + kvm_pmu_create_pmc(vcpu, idx); return; } - perf_event_enable(pmc->perf_event); - if (pmc->perf_event->state != PERF_EVENT_STATE_ACTIVE) - kvm_debug("fail to enable perf event\n"); + for (i = 0; i < pmc->nr_perf_events; i++) { + perf_event_enable(pmc->perf_events[i]); + if (pmc->perf_events[i]->state != PERF_EVENT_STATE_ACTIVE) + kvm_debug("fail to enable perf event\n"); + } } -static void kvm_pmc_disable_perf_event(struct kvm_pmc *pmc) +static void kvm_pmc_disable_perf_event(struct kvm_vcpu *vcpu, u8 idx) { - if (pmc->perf_event) - perf_event_disable(pmc->perf_event); + struct kvm_pmc *pmc = *kvm_vcpu_idx_to_pmc(vcpu, idx); + unsigned int i; + + if (pmc) + for (i = 0; i < pmc->nr_perf_events; i++) + perf_event_disable(pmc->perf_events[i]); } void kvm_pmu_reprogram_counter_mask(struct kvm_vcpu *vcpu, u64 val) @@ -347,15 +342,13 @@ void kvm_pmu_reprogram_counter_mask(struct kvm_vcpu *vcpu, u64 val) return; for (i = 0; i < KVM_ARMV8_PMU_MAX_COUNTERS; i++) { - struct kvm_pmc *pmc = kvm_vcpu_idx_to_pmc(vcpu, i); - if (!(val & BIT(i))) continue; - if (kvm_pmu_counter_is_enabled(pmc)) - kvm_pmc_enable_perf_event(pmc); + if (kvm_pmu_counter_is_enabled(vcpu, i)) + kvm_pmc_enable_perf_event(vcpu, i); else - kvm_pmc_disable_perf_event(pmc); + kvm_pmc_disable_perf_event(vcpu, i); } kvm_vcpu_pmu_restore_guest(vcpu); @@ -486,7 +479,6 @@ static void kvm_pmu_counter_increment(struct kvm_vcpu *vcpu, mask &= __vcpu_sys_reg(vcpu, PMCNTENSET_EL0); for_each_set_bit(i, &mask, ARMV8_PMU_CYCLE_IDX) { - struct kvm_pmc *pmc = kvm_vcpu_idx_to_pmc(vcpu, i); u64 type, reg; /* Filter on event type */ @@ -497,29 +489,30 @@ static void kvm_pmu_counter_increment(struct kvm_vcpu *vcpu, /* Increment this counter */ reg = __vcpu_sys_reg(vcpu, counter_index_to_reg(i)) + 1; - if (!kvm_pmc_is_64bit(pmc)) + if (!kvm_pmc_is_64bit(vcpu, i)) reg = lower_32_bits(reg); __vcpu_sys_reg(vcpu, counter_index_to_reg(i)) = reg; /* No overflow? move on */ - if (kvm_pmc_has_64bit_overflow(pmc) ? reg : lower_32_bits(reg)) + if (kvm_pmc_has_64bit_overflow(vcpu, i) ? reg : lower_32_bits(reg)) continue; /* Mark overflow */ __vcpu_sys_reg(vcpu, PMOVSSET_EL0) |= BIT(i); - if (kvm_pmu_counter_can_chain(pmc)) + if (kvm_pmu_counter_can_chain(vcpu, i)) kvm_pmu_counter_increment(vcpu, BIT(i + 1), ARMV8_PMUV3_PERFCTR_CHAIN); } } /* Compute the sample period for a given counter value */ -static u64 compute_period(struct kvm_pmc *pmc, u64 counter) +static u64 compute_period(struct kvm_pmc **pmc, u64 counter) { + struct kvm_vcpu *vcpu = kvm_pmc_to_vcpu(pmc); u64 val; - if (kvm_pmc_is_64bit(pmc) && kvm_pmc_has_64bit_overflow(pmc)) + if (kvm_pmc_is_64bit(vcpu, (*pmc)->idx) && kvm_pmc_has_64bit_overflow(vcpu, (*pmc)->idx)) val = (-counter) & GENMASK(63, 0); else val = (-counter) & GENMASK(31, 0); @@ -534,10 +527,10 @@ static void kvm_pmu_perf_overflow(struct perf_event *perf_event, struct perf_sample_data *data, struct pt_regs *regs) { - struct kvm_pmc *pmc = perf_event->overflow_handler_context; + struct kvm_pmc **pmc = perf_event->overflow_handler_context; struct arm_pmu *cpu_pmu = to_arm_pmu(perf_event->pmu); struct kvm_vcpu *vcpu = kvm_pmc_to_vcpu(pmc); - int idx = pmc->idx; + int idx = (*pmc)->idx; u64 period; cpu_pmu->pmu.stop(perf_event, PERF_EF_UPDATE); @@ -554,7 +547,7 @@ static void kvm_pmu_perf_overflow(struct perf_event *perf_event, __vcpu_sys_reg(vcpu, PMOVSSET_EL0) |= BIT(idx); - if (kvm_pmu_counter_can_chain(pmc)) + if (kvm_pmu_counter_can_chain(vcpu, idx)) kvm_pmu_counter_increment(vcpu, BIT(idx + 1), ARMV8_PMUV3_PERFCTR_CHAIN); @@ -613,95 +606,74 @@ void kvm_pmu_handle_pmcr(struct kvm_vcpu *vcpu, u64 val) ~BIT(ARMV8_PMU_CYCLE_IDX); for_each_set_bit(i, &mask, 32) - kvm_pmu_set_pmc_value(kvm_vcpu_idx_to_pmc(vcpu, i), 0, true); + kvm_pmu_set_pmc_value(vcpu, i, 0, true); } } -static bool kvm_pmu_counter_is_enabled(struct kvm_pmc *pmc) +static bool kvm_pmu_counter_is_enabled(struct kvm_vcpu *vcpu, u8 idx) { - struct kvm_vcpu *vcpu = kvm_pmc_to_vcpu(pmc); unsigned int mdcr = __vcpu_sys_reg(vcpu, MDCR_EL2); - if (!(__vcpu_sys_reg(vcpu, PMCNTENSET_EL0) & BIT(pmc->idx))) + if (!(__vcpu_sys_reg(vcpu, PMCNTENSET_EL0) & BIT(idx))) return false; - if (kvm_pmu_counter_is_hyp(vcpu, pmc->idx)) + if (kvm_pmu_counter_is_hyp(vcpu, idx)) return mdcr & MDCR_EL2_HPME; return kvm_vcpu_read_pmcr(vcpu) & ARMV8_PMU_PMCR_E; } -static bool kvm_pmc_counts_at_el0(struct kvm_pmc *pmc) +static bool kvm_pmc_counts_at_el0(struct kvm_vcpu *vcpu, u8 idx) { - u64 evtreg = kvm_pmc_read_evtreg(pmc); + u64 evtreg = kvm_pmc_read_evtreg(vcpu, idx); bool nsu = evtreg & ARMV8_PMU_EXCLUDE_NS_EL0; bool u = evtreg & ARMV8_PMU_EXCLUDE_EL0; return u == nsu; } -static bool kvm_pmc_counts_at_el1(struct kvm_pmc *pmc) +static bool kvm_pmc_counts_at_el1(struct kvm_vcpu *vcpu, u8 idx) { - u64 evtreg = kvm_pmc_read_evtreg(pmc); + u64 evtreg = kvm_pmc_read_evtreg(vcpu, idx); bool nsk = evtreg & ARMV8_PMU_EXCLUDE_NS_EL1; bool p = evtreg & ARMV8_PMU_EXCLUDE_EL1; return p == nsk; } -static bool kvm_pmc_counts_at_el2(struct kvm_pmc *pmc) +static bool kvm_pmc_counts_at_el2(struct kvm_vcpu *vcpu, u8 idx) { - struct kvm_vcpu *vcpu = kvm_pmc_to_vcpu(pmc); u64 mdcr = __vcpu_sys_reg(vcpu, MDCR_EL2); - if (!kvm_pmu_counter_is_hyp(vcpu, pmc->idx) && (mdcr & MDCR_EL2_HPMD)) + if (!kvm_pmu_counter_is_hyp(vcpu, idx) && (mdcr & MDCR_EL2_HPMD)) return false; - return kvm_pmc_read_evtreg(pmc) & ARMV8_PMU_INCLUDE_EL2; + return kvm_pmc_read_evtreg(vcpu, idx) & ARMV8_PMU_INCLUDE_EL2; } -/** - * kvm_pmu_create_perf_event - create a perf event for a counter - * @pmc: Counter context - */ -static void kvm_pmu_create_perf_event(struct kvm_pmc *pmc) +static struct kvm_pmc *kvm_pmu_alloc_pmc(u8 idx, unsigned int capacity) { - struct kvm_vcpu *vcpu = kvm_pmc_to_vcpu(pmc); - struct arm_pmu *arm_pmu = vcpu->kvm->arch.arm_pmu; - struct perf_event *event; - struct perf_event_attr attr; - u64 eventsel, evtreg; + struct kvm_pmc *pmc = kzalloc(struct_size(pmc, perf_events, capacity), GFP_KERNEL_ACCOUNT); - evtreg = kvm_pmc_read_evtreg(pmc); - - kvm_pmu_stop_counter(pmc); - if (pmc->idx == ARMV8_PMU_CYCLE_IDX) - eventsel = ARMV8_PMUV3_PERFCTR_CPU_CYCLES; - else - eventsel = evtreg & kvm_pmu_event_mask(vcpu->kvm); + if (pmc) + pmc->idx = idx; - /* - * Neither SW increment nor chained events need to be backed - * by a perf event. - */ - if (eventsel == ARMV8_PMUV3_PERFCTR_SW_INCR || - eventsel == ARMV8_PMUV3_PERFCTR_CHAIN) - return; + return pmc; +} - /* - * If we have a filter in place and that the event isn't allowed, do - * not install a perf event either. - */ - if (vcpu->kvm->arch.pmu_filter && - !test_bit(eventsel, vcpu->kvm->arch.pmu_filter)) - return; +static void kvm_pmu_create_perf_event(struct kvm_pmc **pmc, + struct arm_pmu *arm_pmu, u64 eventsel) +{ + struct kvm_vcpu *vcpu = kvm_pmc_to_vcpu(pmc); + struct perf_event *event; + struct perf_event_attr attr; memset(&attr, 0, sizeof(struct perf_event_attr)); attr.type = arm_pmu->pmu.type; attr.size = sizeof(attr); attr.pinned = 1; - attr.disabled = !kvm_pmu_counter_is_enabled(pmc); - attr.exclude_user = !kvm_pmc_counts_at_el0(pmc); + attr.disabled = !kvm_pmu_counter_is_enabled(vcpu, (*pmc)->idx); + attr.exclude_user = !kvm_pmc_counts_at_el0(vcpu, (*pmc)->idx); attr.exclude_hv = 1; /* Don't count EL2 events */ attr.exclude_host = 1; /* Don't count host events */ attr.config = eventsel; @@ -711,19 +683,19 @@ static void kvm_pmu_create_perf_event(struct kvm_pmc *pmc) * guest's EL2 filter. */ if (unlikely(is_hyp_ctxt(vcpu))) - attr.exclude_kernel = !kvm_pmc_counts_at_el2(pmc); + attr.exclude_kernel = !kvm_pmc_counts_at_el2(vcpu, (*pmc)->idx); else - attr.exclude_kernel = !kvm_pmc_counts_at_el1(pmc); + attr.exclude_kernel = !kvm_pmc_counts_at_el1(vcpu, (*pmc)->idx); /* * If counting with a 64bit counter, advertise it to the perf * code, carefully dealing with the initial sample period * which also depends on the overflow. */ - if (kvm_pmc_is_64bit(pmc)) + if (kvm_pmc_is_64bit(vcpu, (*pmc)->idx)) attr.config1 |= PERF_ATTR_CFG1_COUNTER_64BIT; - attr.sample_period = compute_period(pmc, kvm_pmu_get_pmc_value(pmc)); + attr.sample_period = compute_period(pmc, kvm_pmu_get_pmc_value(vcpu, (*pmc)->idx)); event = perf_event_create_kernel_counter(&attr, -1, current, kvm_pmu_perf_overflow, pmc); @@ -731,10 +703,93 @@ static void kvm_pmu_create_perf_event(struct kvm_pmc *pmc) if (IS_ERR(event)) { pr_err_once("kvm: pmu event creation failed %ld\n", PTR_ERR(event)); + } + + (*pmc)->perf_events[(*pmc)->nr_perf_events] = event; + (*pmc)->nr_perf_events++; +} + +/** + * kvm_pmu_create_pmc - create a pmc + * @pmc: Counter context + */ +static void kvm_pmu_create_pmc(struct kvm_vcpu *vcpu, u8 idx) +{ + struct arm_pmu *arm_pmu = vcpu->kvm->arch.arm_pmu; + struct arm_pmu_entry *entry; + struct kvm_pmc **pmc = kvm_vcpu_idx_to_pmc(vcpu, idx); + u64 eventsel, evtreg; + cpumask_t cpus; + size_t n; + + + evtreg = kvm_pmc_read_evtreg(vcpu, idx); + + kvm_pmu_stop_counter(vcpu, idx); + if (idx == ARMV8_PMU_CYCLE_IDX) + eventsel = ARMV8_PMUV3_PERFCTR_CPU_CYCLES; + else + eventsel = evtreg & kvm_pmu_event_mask(vcpu->kvm); + + /* + * Neither SW increment nor chained events need to be backed + * by a perf event. + */ + if (eventsel == ARMV8_PMUV3_PERFCTR_SW_INCR || + eventsel == ARMV8_PMUV3_PERFCTR_CHAIN) return; + + /* + * If we have a filter in place and that the event isn't allowed, do + * not install a perf event either. + */ + if (vcpu->kvm->arch.pmu_filter && + !test_bit(eventsel, vcpu->kvm->arch.pmu_filter)) + return; + + if (cpumask_subset(current->cpus_ptr, &arm_pmu->supported_cpus)) { + *pmc = kvm_pmu_alloc_pmc(idx, 1); + if (!*pmc) + goto err; + + kvm_pmu_create_perf_event(pmc, arm_pmu, eventsel); + } else { + guard(mutex)(&arm_pmus_lock); + + n = 0; + cpumask_clear(&cpus); + list_for_each_entry(entry, &arm_pmus, entry) { + arm_pmu = entry->arm_pmu; + + if (!cpumask_intersects(current->cpus_ptr, &arm_pmu->supported_cpus) || + cpumask_intersects(&cpus, &arm_pmu->supported_cpus)) + continue; + + cpumask_or(&cpus, &cpus, &arm_pmu->supported_cpus); + n++; + } + + *pmc = kvm_pmu_alloc_pmc(idx, n); + if (!*pmc) + goto err; + + cpumask_clear(&cpus); + list_for_each_entry(entry, &arm_pmus, entry) { + arm_pmu = entry->arm_pmu; + + if (!cpumask_intersects(current->cpus_ptr, &arm_pmu->supported_cpus) || + cpumask_intersects(&cpus, &arm_pmu->supported_cpus)) + continue; + + cpumask_or(&cpus, &cpus, &arm_pmu->supported_cpus); + kvm_pmu_create_perf_event(pmc, arm_pmu, eventsel); + } } - pmc->perf_event = event; + return; + +err: + pr_err_once("kvm: pmc allocation failed\n"); } /** @@ -750,13 +805,12 @@ static void kvm_pmu_create_perf_event(struct kvm_pmc *pmc) void kvm_pmu_set_counter_event_type(struct kvm_vcpu *vcpu, u64 data, u64 select_idx) { - struct kvm_pmc *pmc = kvm_vcpu_idx_to_pmc(vcpu, select_idx); u64 reg; - reg = counter_index_to_evtreg(pmc->idx); + reg = counter_index_to_evtreg(select_idx); __vcpu_sys_reg(vcpu, reg) = data & kvm_pmu_evtyper_mask(vcpu->kvm); - kvm_pmu_create_perf_event(pmc); + kvm_pmu_create_pmc(vcpu, select_idx); } void kvm_host_pmu_init(struct arm_pmu *pmu) @@ -994,7 +1048,8 @@ static void kvm_arm_set_pmu(struct kvm *kvm, struct arm_pmu *arm_pmu) * subset of cores in the system. This is intentional, and * upholds the preexisting behavior on heterogeneous systems * where vCPUs can be scheduled on any core but the guest - * counters could stop working. + * may see features not supported on some cores being advertised + * via PMCR_EL0, PMCEID0_EL0, and PMCEID1_EL0. */ int kvm_arm_set_default_pmu(struct kvm *kvm) { @@ -1208,17 +1263,15 @@ void kvm_pmu_nested_transition(struct kvm_vcpu *vcpu) mask = __vcpu_sys_reg(vcpu, PMCNTENSET_EL0); for_each_set_bit(i, &mask, 32) { - struct kvm_pmc *pmc = kvm_vcpu_idx_to_pmc(vcpu, i); - /* * We only need to reconfigure events where the filter is * different at EL1 vs. EL2, as we're multiplexing the true EL1 * event filter bit for nested. */ - if (kvm_pmc_counts_at_el1(pmc) == kvm_pmc_counts_at_el2(pmc)) + if (kvm_pmc_counts_at_el1(vcpu, i) == kvm_pmc_counts_at_el2(vcpu, i)) continue; - kvm_pmu_create_perf_event(pmc); + kvm_pmu_create_pmc(vcpu, i); reprogrammed = true; } diff --git a/include/kvm/arm_pmu.h b/include/kvm/arm_pmu.h index d6ad13925978..79ab5ff203eb 100644 --- a/include/kvm/arm_pmu.h +++ b/include/kvm/arm_pmu.h @@ -15,7 +15,8 @@ #if IS_ENABLED(CONFIG_HW_PERF_EVENTS) && IS_ENABLED(CONFIG_KVM) struct kvm_pmc { u8 idx; /* index into the pmu->pmc array */ - struct perf_event *perf_event; + unsigned int nr_perf_events; + struct perf_event *perf_events[] __counted_by(nr_perf_events); }; struct kvm_pmu_events { @@ -26,7 +27,7 @@ struct kvm_pmu_events { struct kvm_pmu { struct irq_work overflow_work; struct kvm_pmu_events events; - struct kvm_pmc pmc[KVM_ARMV8_PMU_MAX_COUNTERS]; + struct kvm_pmc *pmc[KVM_ARMV8_PMU_MAX_COUNTERS]; int irq_num; bool created; bool irq_level; @@ -52,7 +53,6 @@ u64 kvm_pmu_implemented_counter_mask(struct kvm_vcpu *vcpu); u64 kvm_pmu_accessible_counter_mask(struct kvm_vcpu *vcpu); u64 kvm_pmu_valid_counter_mask(struct kvm_vcpu *vcpu); u64 kvm_pmu_get_pmceid(struct kvm_vcpu *vcpu, bool pmceid1); -void kvm_pmu_vcpu_init(struct kvm_vcpu *vcpu); void kvm_pmu_vcpu_destroy(struct kvm_vcpu *vcpu); void kvm_pmu_reprogram_counter_mask(struct kvm_vcpu *vcpu, u64 val); void kvm_pmu_flush_hwstate(struct kvm_vcpu *vcpu); @@ -124,7 +124,6 @@ static inline u64 kvm_pmu_accessible_counter_mask(struct kvm_vcpu *vcpu) { return 0; } -static inline void kvm_pmu_vcpu_init(struct kvm_vcpu *vcpu) {} static inline void kvm_pmu_vcpu_destroy(struct kvm_vcpu *vcpu) {} static inline void kvm_pmu_reprogram_counter_mask(struct kvm_vcpu *vcpu, u64 val) {} static inline void kvm_pmu_flush_hwstate(struct kvm_vcpu *vcpu) {}