diff mbox series

KVM: x86/svm/pmu: Set PerfMonV2 global control bits correctly

Message ID 20240301075007.644152-1-sandipan.das@amd.com (mailing list archive)
State New, archived
Headers show
Series KVM: x86/svm/pmu: Set PerfMonV2 global control bits correctly | expand

Commit Message

Sandipan Das March 1, 2024, 7:50 a.m. UTC
With PerfMonV2, a performance monitoring counter will start operating
only when both the PERF_CTLx enable bit as well as the corresponding
PerfCntrGlobalCtl enable bit are set.

When the PerfMonV2 CPUID feature bit (leaf 0x80000022 EAX bit 0) is set
for a guest but the guest kernel does not support PerfMonV2 (such as
kernels older than v5.19), the guest counters do not count since the
PerfCntrGlobalCtl MSR is initialized to zero and the guest kernel never
writes to it.

This is not observed on bare-metal as the default value of the
PerfCntrGlobalCtl MSR after a reset is 0x3f (assuming there are six
counters) and the counters can still be operated by using the enable
bit in the PERF_CTLx MSRs. Replicate the same behaviour in guests for
compatibility with older kernels.

Before:

  $ perf stat -e cycles:u true

   Performance counter stats for 'true':

                   0      cycles:u

         0.001074773 seconds time elapsed

         0.001169000 seconds user
         0.000000000 seconds sys

After:

  $ perf stat -e cycles:u true

   Performance counter stats for 'true':

             227,850      cycles:u

         0.037770758 seconds time elapsed

         0.000000000 seconds user
         0.037886000 seconds sys

Reported-by: Babu Moger <babu.moger@amd.com>
Fixes: 4a2771895ca6 ("KVM: x86/svm/pmu: Add AMD PerfMonV2 support")
Signed-off-by: Sandipan Das <sandipan.das@amd.com>
---
 arch/x86/kvm/svm/pmu.c | 1 +
 1 file changed, 1 insertion(+)

Comments

Like Xu March 1, 2024, 8:37 a.m. UTC | #1
On 1/3/2024 3:50 pm, Sandipan Das wrote:
> With PerfMonV2, a performance monitoring counter will start operating
> only when both the PERF_CTLx enable bit as well as the corresponding
> PerfCntrGlobalCtl enable bit are set.
> 
> When the PerfMonV2 CPUID feature bit (leaf 0x80000022 EAX bit 0) is set
> for a guest but the guest kernel does not support PerfMonV2 (such as
> kernels older than v5.19), the guest counters do not count since the
> PerfCntrGlobalCtl MSR is initialized to zero and the guest kernel never
> writes to it.

If the vcpu has the PerfMonV2 feature, it should not work the way legacy
PMU does. Users need to use the new driver to operate the new hardware,
don't they ? One practical approach is that the hypervisor should not set
the PerfMonV2 bit for this unpatched 'v5.19' guest.

> 
> This is not observed on bare-metal as the default value of the
> PerfCntrGlobalCtl MSR after a reset is 0x3f (assuming there are six
> counters) and the counters can still be operated by using the enable
> bit in the PERF_CTLx MSRs. Replicate the same behaviour in guests for
> compatibility with older kernels.
> 
> Before:
> 
>    $ perf stat -e cycles:u true
> 
>     Performance counter stats for 'true':
> 
>                     0      cycles:u
> 
>           0.001074773 seconds time elapsed
> 
>           0.001169000 seconds user
>           0.000000000 seconds sys
> 
> After:
> 
>    $ perf stat -e cycles:u true
> 
>     Performance counter stats for 'true':
> 
>               227,850      cycles:u
> 
>           0.037770758 seconds time elapsed
> 
>           0.000000000 seconds user
>           0.037886000 seconds sys
> 
> Reported-by: Babu Moger <babu.moger@amd.com>
> Fixes: 4a2771895ca6 ("KVM: x86/svm/pmu: Add AMD PerfMonV2 support")
> Signed-off-by: Sandipan Das <sandipan.das@amd.com>
> ---
>   arch/x86/kvm/svm/pmu.c | 1 +
>   1 file changed, 1 insertion(+)
> 
> diff --git a/arch/x86/kvm/svm/pmu.c b/arch/x86/kvm/svm/pmu.c
> index b6a7ad4d6914..14709c564d6a 100644
> --- a/arch/x86/kvm/svm/pmu.c
> +++ b/arch/x86/kvm/svm/pmu.c
> @@ -205,6 +205,7 @@ static void amd_pmu_refresh(struct kvm_vcpu *vcpu)
>   	if (pmu->version > 1) {
>   		pmu->global_ctrl_mask = ~((1ull << pmu->nr_arch_gp_counters) - 1);
>   		pmu->global_status_mask = pmu->global_ctrl_mask;
> +		pmu->global_ctrl = ~pmu->global_ctrl_mask;
>   	}
>   
>   	pmu->counter_bitmask[KVM_PMC_GP] = ((u64)1 << 48) - 1;
Sandipan Das March 1, 2024, 9 a.m. UTC | #2
On 3/1/2024 2:07 PM, Like Xu wrote:
> On 1/3/2024 3:50 pm, Sandipan Das wrote:
>> With PerfMonV2, a performance monitoring counter will start operating
>> only when both the PERF_CTLx enable bit as well as the corresponding
>> PerfCntrGlobalCtl enable bit are set.
>>
>> When the PerfMonV2 CPUID feature bit (leaf 0x80000022 EAX bit 0) is set
>> for a guest but the guest kernel does not support PerfMonV2 (such as
>> kernels older than v5.19), the guest counters do not count since the
>> PerfCntrGlobalCtl MSR is initialized to zero and the guest kernel never
>> writes to it.
> 
> If the vcpu has the PerfMonV2 feature, it should not work the way legacy
> PMU does. Users need to use the new driver to operate the new hardware,
> don't they ? One practical approach is that the hypervisor should not set
> the PerfMonV2 bit for this unpatched 'v5.19' guest.
> 

My understanding is that the legacy method of managing the counters should
still work because the enable bits in PerfCntrGlobalCtl are expected to be
set. The AMD PPR does mention that the PerfCntrEn bitfield of PerfCntrGlobalCtl
is set to 0x3f after a system reset. That way, the guest kernel can use either
the new or legacy method.

>>
>> This is not observed on bare-metal as the default value of the
>> PerfCntrGlobalCtl MSR after a reset is 0x3f (assuming there are six
>> counters) and the counters can still be operated by using the enable
>> bit in the PERF_CTLx MSRs. Replicate the same behaviour in guests for
>> compatibility with older kernels.
>>
>> Before:
>>
>>    $ perf stat -e cycles:u true
>>
>>     Performance counter stats for 'true':
>>
>>                     0      cycles:u
>>
>>           0.001074773 seconds time elapsed
>>
>>           0.001169000 seconds user
>>           0.000000000 seconds sys
>>
>> After:
>>
>>    $ perf stat -e cycles:u true
>>
>>     Performance counter stats for 'true':
>>
>>               227,850      cycles:u
>>
>>           0.037770758 seconds time elapsed
>>
>>           0.000000000 seconds user
>>           0.037886000 seconds sys
>>
>> Reported-by: Babu Moger <babu.moger@amd.com>
>> Fixes: 4a2771895ca6 ("KVM: x86/svm/pmu: Add AMD PerfMonV2 support")
>> Signed-off-by: Sandipan Das <sandipan.das@amd.com>
>> ---
>>   arch/x86/kvm/svm/pmu.c | 1 +
>>   1 file changed, 1 insertion(+)
>>
>> diff --git a/arch/x86/kvm/svm/pmu.c b/arch/x86/kvm/svm/pmu.c
>> index b6a7ad4d6914..14709c564d6a 100644
>> --- a/arch/x86/kvm/svm/pmu.c
>> +++ b/arch/x86/kvm/svm/pmu.c
>> @@ -205,6 +205,7 @@ static void amd_pmu_refresh(struct kvm_vcpu *vcpu)
>>       if (pmu->version > 1) {
>>           pmu->global_ctrl_mask = ~((1ull << pmu->nr_arch_gp_counters) - 1);
>>           pmu->global_status_mask = pmu->global_ctrl_mask;
>> +        pmu->global_ctrl = ~pmu->global_ctrl_mask;
>>       }
>>         pmu->counter_bitmask[KVM_PMC_GP] = ((u64)1 << 48) - 1;
Mi, Dapeng March 4, 2024, 7:59 a.m. UTC | #3
On 3/1/2024 5:00 PM, Sandipan Das wrote:
> On 3/1/2024 2:07 PM, Like Xu wrote:
>> On 1/3/2024 3:50 pm, Sandipan Das wrote:
>>> With PerfMonV2, a performance monitoring counter will start operating
>>> only when both the PERF_CTLx enable bit as well as the corresponding
>>> PerfCntrGlobalCtl enable bit are set.
>>>
>>> When the PerfMonV2 CPUID feature bit (leaf 0x80000022 EAX bit 0) is set
>>> for a guest but the guest kernel does not support PerfMonV2 (such as
>>> kernels older than v5.19), the guest counters do not count since the
>>> PerfCntrGlobalCtl MSR is initialized to zero and the guest kernel never
>>> writes to it.
>> If the vcpu has the PerfMonV2 feature, it should not work the way legacy
>> PMU does. Users need to use the new driver to operate the new hardware,
>> don't they ? One practical approach is that the hypervisor should not set
>> the PerfMonV2 bit for this unpatched 'v5.19' guest.
>>
> My understanding is that the legacy method of managing the counters should
> still work because the enable bits in PerfCntrGlobalCtl are expected to be
> set. The AMD PPR does mention that the PerfCntrEn bitfield of PerfCntrGlobalCtl
> is set to 0x3f after a system reset. That way, the guest kernel can use either


If so, please add the PPR description here as comments.


> the new or legacy method.
>
>>> This is not observed on bare-metal as the default value of the
>>> PerfCntrGlobalCtl MSR after a reset is 0x3f (assuming there are six
>>> counters) and the counters can still be operated by using the enable
>>> bit in the PERF_CTLx MSRs. Replicate the same behaviour in guests for
>>> compatibility with older kernels.
>>>
>>> Before:
>>>
>>>     $ perf stat -e cycles:u true
>>>
>>>      Performance counter stats for 'true':
>>>
>>>                      0      cycles:u
>>>
>>>            0.001074773 seconds time elapsed
>>>
>>>            0.001169000 seconds user
>>>            0.000000000 seconds sys
>>>
>>> After:
>>>
>>>     $ perf stat -e cycles:u true
>>>
>>>      Performance counter stats for 'true':
>>>
>>>                227,850      cycles:u
>>>
>>>            0.037770758 seconds time elapsed
>>>
>>>            0.000000000 seconds user
>>>            0.037886000 seconds sys
>>>
>>> Reported-by: Babu Moger <babu.moger@amd.com>
>>> Fixes: 4a2771895ca6 ("KVM: x86/svm/pmu: Add AMD PerfMonV2 support")
>>> Signed-off-by: Sandipan Das <sandipan.das@amd.com>
>>> ---
>>>    arch/x86/kvm/svm/pmu.c | 1 +
>>>    1 file changed, 1 insertion(+)
>>>
>>> diff --git a/arch/x86/kvm/svm/pmu.c b/arch/x86/kvm/svm/pmu.c
>>> index b6a7ad4d6914..14709c564d6a 100644
>>> --- a/arch/x86/kvm/svm/pmu.c
>>> +++ b/arch/x86/kvm/svm/pmu.c
>>> @@ -205,6 +205,7 @@ static void amd_pmu_refresh(struct kvm_vcpu *vcpu)
>>>        if (pmu->version > 1) {
>>>            pmu->global_ctrl_mask = ~((1ull << pmu->nr_arch_gp_counters) - 1);
>>>            pmu->global_status_mask = pmu->global_ctrl_mask;
>>> +        pmu->global_ctrl = ~pmu->global_ctrl_mask;


It seems to be more easily understand to calculate global_ctrl firstly 
and then derive the globol_ctrl_mask (negative logic).

diff --git a/arch/x86/kvm/svm/pmu.c b/arch/x86/kvm/svm/pmu.c
index e886300f0f97..7ac9b080aba6 100644
--- a/arch/x86/kvm/svm/pmu.c
+++ b/arch/x86/kvm/svm/pmu.c
@@ -199,7 +199,8 @@ static void amd_pmu_refresh(struct kvm_vcpu *vcpu)
kvm_pmu_cap.num_counters_gp);

         if (pmu->version > 1) {
-               pmu->global_ctrl_mask = ~((1ull << 
pmu->nr_arch_gp_counters) - 1);
+               pmu->global_ctrl = (1ull << pmu->nr_arch_gp_counters) - 1;
+               pmu->global_ctrl_mask = ~pmu->global_ctrl;
                 pmu->global_status_mask = pmu->global_ctrl_mask;
         }

>>>        }
>>>          pmu->counter_bitmask[KVM_PMC_GP] = ((u64)1 << 48) - 1;
>
Sean Christopherson March 4, 2024, 7:46 p.m. UTC | #4
On Mon, Mar 04, 2024, Dapeng Mi wrote:
> 
> On 3/1/2024 5:00 PM, Sandipan Das wrote:
> > On 3/1/2024 2:07 PM, Like Xu wrote:
> > > On 1/3/2024 3:50 pm, Sandipan Das wrote:
> > > > With PerfMonV2, a performance monitoring counter will start operating
> > > > only when both the PERF_CTLx enable bit as well as the corresponding
> > > > PerfCntrGlobalCtl enable bit are set.
> > > > 
> > > > When the PerfMonV2 CPUID feature bit (leaf 0x80000022 EAX bit 0) is set
> > > > for a guest but the guest kernel does not support PerfMonV2 (such as
> > > > kernels older than v5.19), the guest counters do not count since the
> > > > PerfCntrGlobalCtl MSR is initialized to zero and the guest kernel never
> > > > writes to it.
> > > If the vcpu has the PerfMonV2 feature, it should not work the way legacy
> > > PMU does. Users need to use the new driver to operate the new hardware,
> > > don't they ? One practical approach is that the hypervisor should not set
> > > the PerfMonV2 bit for this unpatched 'v5.19' guest.
> > > 
> > My understanding is that the legacy method of managing the counters should
> > still work because the enable bits in PerfCntrGlobalCtl are expected to be
> > set. The AMD PPR does mention that the PerfCntrEn bitfield of PerfCntrGlobalCtl
> > is set to 0x3f after a system reset. That way, the guest kernel can use either
> 
> If so, please add the PPR description here as comments.

Or even better, make that architectural behavior that's documented in the APM.

> > > > ---
> > > >    arch/x86/kvm/svm/pmu.c | 1 +
> > > >    1 file changed, 1 insertion(+)
> > > > 
> > > > diff --git a/arch/x86/kvm/svm/pmu.c b/arch/x86/kvm/svm/pmu.c
> > > > index b6a7ad4d6914..14709c564d6a 100644
> > > > --- a/arch/x86/kvm/svm/pmu.c
> > > > +++ b/arch/x86/kvm/svm/pmu.c
> > > > @@ -205,6 +205,7 @@ static void amd_pmu_refresh(struct kvm_vcpu *vcpu)
> > > >        if (pmu->version > 1) {
> > > >            pmu->global_ctrl_mask = ~((1ull << pmu->nr_arch_gp_counters) - 1);
> > > >            pmu->global_status_mask = pmu->global_ctrl_mask;
> > > > +        pmu->global_ctrl = ~pmu->global_ctrl_mask;
> 
> It seems to be more easily understand to calculate global_ctrl firstly and
> then derive the globol_ctrl_mask (negative logic).

Hrm, I'm torn.  On one hand, awful name aside (global_ctrl_mask should really be
something like global_ctrl_rsvd_bits), the computation of the reserved bits should
come from the capabilities of the PMU, not from the RESET value.

On the other hand, setting _all_ non-reserved bits will likely do the wrong thing
if AMD ever adds bits in PerfCntGlobalCtl that aren't tied to general purpose
counters.  But, that's a future theoretical problem, so I'm inclined to vote for
Sandipan's approach.

> diff --git a/arch/x86/kvm/svm/pmu.c b/arch/x86/kvm/svm/pmu.c
> index e886300f0f97..7ac9b080aba6 100644
> --- a/arch/x86/kvm/svm/pmu.c
> +++ b/arch/x86/kvm/svm/pmu.c
> @@ -199,7 +199,8 @@ static void amd_pmu_refresh(struct kvm_vcpu *vcpu)
> kvm_pmu_cap.num_counters_gp);
> 
>         if (pmu->version > 1) {
> -               pmu->global_ctrl_mask = ~((1ull << pmu->nr_arch_gp_counters)
> - 1);
> +               pmu->global_ctrl = (1ull << pmu->nr_arch_gp_counters) - 1;
> +               pmu->global_ctrl_mask = ~pmu->global_ctrl;
>                 pmu->global_status_mask = pmu->global_ctrl_mask;
>         }
> 
> > > >        }
> > > >          pmu->counter_bitmask[KVM_PMC_GP] = ((u64)1 << 48) - 1;
> >
Mingwei Zhang March 4, 2024, 8:06 p.m. UTC | #5
On Fri, Mar 01, 2024, Like Xu wrote:
> On 1/3/2024 3:50 pm, Sandipan Das wrote:
> > With PerfMonV2, a performance monitoring counter will start operating
> > only when both the PERF_CTLx enable bit as well as the corresponding
> > PerfCntrGlobalCtl enable bit are set.
> > 
> > When the PerfMonV2 CPUID feature bit (leaf 0x80000022 EAX bit 0) is set
> > for a guest but the guest kernel does not support PerfMonV2 (such as
> > kernels older than v5.19), the guest counters do not count since the
> > PerfCntrGlobalCtl MSR is initialized to zero and the guest kernel never
> > writes to it.
> 
> If the vcpu has the PerfMonV2 feature, it should not work the way legacy
> PMU does. Users need to use the new driver to operate the new hardware,
> don't they ? One practical approach is that the hypervisor should not set
> the PerfMonV2 bit for this unpatched 'v5.19' guest.
> 
How could hypervisor know the guest 'Linux version'? KVM should not even
assume it is Linux. So that means, if the 'guest driver' does not
support PerfMonV2, then guest should just continue to use its legacy code.
Otherwise, the guest is considered broken.

Thanks.
-Mingwei
> > 
> > This is not observed on bare-metal as the default value of the
> > PerfCntrGlobalCtl MSR after a reset is 0x3f (assuming there are six
> > counters) and the counters can still be operated by using the enable
> > bit in the PERF_CTLx MSRs. Replicate the same behaviour in guests for
> > compatibility with older kernels.
> > 
> > Before:
> > 
> >    $ perf stat -e cycles:u true
> > 
> >     Performance counter stats for 'true':
> > 
> >                     0      cycles:u
> > 
> >           0.001074773 seconds time elapsed
> > 
> >           0.001169000 seconds user
> >           0.000000000 seconds sys
> > 
> > After:
> > 
> >    $ perf stat -e cycles:u true
> > 
> >     Performance counter stats for 'true':
> > 
> >               227,850      cycles:u
> > 
> >           0.037770758 seconds time elapsed
> > 
> >           0.000000000 seconds user
> >           0.037886000 seconds sys
> > 
> > Reported-by: Babu Moger <babu.moger@amd.com>
> > Fixes: 4a2771895ca6 ("KVM: x86/svm/pmu: Add AMD PerfMonV2 support")
> > Signed-off-by: Sandipan Das <sandipan.das@amd.com>
> > ---
> >   arch/x86/kvm/svm/pmu.c | 1 +
> >   1 file changed, 1 insertion(+)
> > 
> > diff --git a/arch/x86/kvm/svm/pmu.c b/arch/x86/kvm/svm/pmu.c
> > index b6a7ad4d6914..14709c564d6a 100644
> > --- a/arch/x86/kvm/svm/pmu.c
> > +++ b/arch/x86/kvm/svm/pmu.c
> > @@ -205,6 +205,7 @@ static void amd_pmu_refresh(struct kvm_vcpu *vcpu)
> >   	if (pmu->version > 1) {
> >   		pmu->global_ctrl_mask = ~((1ull << pmu->nr_arch_gp_counters) - 1);
> >   		pmu->global_status_mask = pmu->global_ctrl_mask;
> > +		pmu->global_ctrl = ~pmu->global_ctrl_mask;
> >   	}
> >   	pmu->counter_bitmask[KVM_PMC_GP] = ((u64)1 << 48) - 1;
Mingwei Zhang March 4, 2024, 8:17 p.m. UTC | #6
On Mon, Mar 04, 2024, Sean Christopherson wrote:
> On Mon, Mar 04, 2024, Dapeng Mi wrote:
> > 
> > On 3/1/2024 5:00 PM, Sandipan Das wrote:
> > > On 3/1/2024 2:07 PM, Like Xu wrote:
> > > > On 1/3/2024 3:50 pm, Sandipan Das wrote:
> > > > > With PerfMonV2, a performance monitoring counter will start operating
> > > > > only when both the PERF_CTLx enable bit as well as the corresponding
> > > > > PerfCntrGlobalCtl enable bit are set.
> > > > > 
> > > > > When the PerfMonV2 CPUID feature bit (leaf 0x80000022 EAX bit 0) is set
> > > > > for a guest but the guest kernel does not support PerfMonV2 (such as
> > > > > kernels older than v5.19), the guest counters do not count since the
> > > > > PerfCntrGlobalCtl MSR is initialized to zero and the guest kernel never
> > > > > writes to it.
> > > > If the vcpu has the PerfMonV2 feature, it should not work the way legacy
> > > > PMU does. Users need to use the new driver to operate the new hardware,
> > > > don't they ? One practical approach is that the hypervisor should not set
> > > > the PerfMonV2 bit for this unpatched 'v5.19' guest.
> > > > 
> > > My understanding is that the legacy method of managing the counters should
> > > still work because the enable bits in PerfCntrGlobalCtl are expected to be
> > > set. The AMD PPR does mention that the PerfCntrEn bitfield of PerfCntrGlobalCtl
> > > is set to 0x3f after a system reset. That way, the guest kernel can use either
> > 
> > If so, please add the PPR description here as comments.
> 
> Or even better, make that architectural behavior that's documented in the APM.
> 
> > > > > ---
> > > > >    arch/x86/kvm/svm/pmu.c | 1 +
> > > > >    1 file changed, 1 insertion(+)
> > > > > 
> > > > > diff --git a/arch/x86/kvm/svm/pmu.c b/arch/x86/kvm/svm/pmu.c
> > > > > index b6a7ad4d6914..14709c564d6a 100644
> > > > > --- a/arch/x86/kvm/svm/pmu.c
> > > > > +++ b/arch/x86/kvm/svm/pmu.c
> > > > > @@ -205,6 +205,7 @@ static void amd_pmu_refresh(struct kvm_vcpu *vcpu)
> > > > >        if (pmu->version > 1) {
> > > > >            pmu->global_ctrl_mask = ~((1ull << pmu->nr_arch_gp_counters) - 1);
> > > > >            pmu->global_status_mask = pmu->global_ctrl_mask;
> > > > > +        pmu->global_ctrl = ~pmu->global_ctrl_mask;
> > 
> > It seems to be more easily understand to calculate global_ctrl firstly and
> > then derive the globol_ctrl_mask (negative logic).
> 
> Hrm, I'm torn.  On one hand, awful name aside (global_ctrl_mask should really be
> something like global_ctrl_rsvd_bits), the computation of the reserved bits should
> come from the capabilities of the PMU, not from the RESET value.
> 
+1

> On the other hand, setting _all_ non-reserved bits will likely do the wrong thing
> if AMD ever adds bits in PerfCntGlobalCtl that aren't tied to general purpose
> counters.  But, that's a future theoretical problem, so I'm inclined to vote for
> Sandipan's approach.
> 

right. I am ok with either approach.

Thanks.
-Mingwei
> > diff --git a/arch/x86/kvm/svm/pmu.c b/arch/x86/kvm/svm/pmu.c
> > index e886300f0f97..7ac9b080aba6 100644
> > --- a/arch/x86/kvm/svm/pmu.c
> > +++ b/arch/x86/kvm/svm/pmu.c
> > @@ -199,7 +199,8 @@ static void amd_pmu_refresh(struct kvm_vcpu *vcpu)
> > kvm_pmu_cap.num_counters_gp);
> > 
> >         if (pmu->version > 1) {
> > -               pmu->global_ctrl_mask = ~((1ull << pmu->nr_arch_gp_counters)
> > - 1);
> > +               pmu->global_ctrl = (1ull << pmu->nr_arch_gp_counters) - 1;
> > +               pmu->global_ctrl_mask = ~pmu->global_ctrl;
> >                 pmu->global_status_mask = pmu->global_ctrl_mask;
> >         }
> > 
> > > > >        }
> > > > >          pmu->counter_bitmask[KVM_PMC_GP] = ((u64)1 << 48) - 1;
> > >
Like Xu March 5, 2024, 2:31 a.m. UTC | #7
On 5/3/2024 3:46 am, Sean Christopherson wrote:
> On Mon, Mar 04, 2024, Dapeng Mi wrote:
>>
>> On 3/1/2024 5:00 PM, Sandipan Das wrote:
>>> On 3/1/2024 2:07 PM, Like Xu wrote:
>>>> On 1/3/2024 3:50 pm, Sandipan Das wrote:
>>>>> With PerfMonV2, a performance monitoring counter will start operating
>>>>> only when both the PERF_CTLx enable bit as well as the corresponding
>>>>> PerfCntrGlobalCtl enable bit are set.
>>>>>
>>>>> When the PerfMonV2 CPUID feature bit (leaf 0x80000022 EAX bit 0) is set
>>>>> for a guest but the guest kernel does not support PerfMonV2 (such as
>>>>> kernels older than v5.19), the guest counters do not count since the
>>>>> PerfCntrGlobalCtl MSR is initialized to zero and the guest kernel never
>>>>> writes to it.
>>>> If the vcpu has the PerfMonV2 feature, it should not work the way legacy
>>>> PMU does. Users need to use the new driver to operate the new hardware,
>>>> don't they ? One practical approach is that the hypervisor should not set
>>>> the PerfMonV2 bit for this unpatched 'v5.19' guest.
>>>>
>>> My understanding is that the legacy method of managing the counters should
>>> still work because the enable bits in PerfCntrGlobalCtl are expected to be
>>> set. The AMD PPR does mention that the PerfCntrEn bitfield of PerfCntrGlobalCtl
>>> is set to 0x3f after a system reset. That way, the guest kernel can use either
>>
>> If so, please add the PPR description here as comments.
> 
> Or even better, make that architectural behavior that's documented in the APM.

On the AMD side, we can't even reason that "PerfMonV3" will be compatible with
"PerfMonV2" w/o APM clarification which is a concern for both driver/virt impl.

> 
>>>>> ---
>>>>>     arch/x86/kvm/svm/pmu.c | 1 +
>>>>>     1 file changed, 1 insertion(+)
>>>>>
>>>>> diff --git a/arch/x86/kvm/svm/pmu.c b/arch/x86/kvm/svm/pmu.c
>>>>> index b6a7ad4d6914..14709c564d6a 100644
>>>>> --- a/arch/x86/kvm/svm/pmu.c
>>>>> +++ b/arch/x86/kvm/svm/pmu.c
>>>>> @@ -205,6 +205,7 @@ static void amd_pmu_refresh(struct kvm_vcpu *vcpu)
>>>>>         if (pmu->version > 1) {
>>>>>             pmu->global_ctrl_mask = ~((1ull << pmu->nr_arch_gp_counters) - 1);
>>>>>             pmu->global_status_mask = pmu->global_ctrl_mask;
>>>>> +        pmu->global_ctrl = ~pmu->global_ctrl_mask;
>>
>> It seems to be more easily understand to calculate global_ctrl firstly and
>> then derive the globol_ctrl_mask (negative logic).
> 
> Hrm, I'm torn.  On one hand, awful name aside (global_ctrl_mask should really be
> something like global_ctrl_rsvd_bits), the computation of the reserved bits should
> come from the capabilities of the PMU, not from the RESET value.
> 
> On the other hand, setting _all_ non-reserved bits will likely do the wrong thing
> if AMD ever adds bits in PerfCntGlobalCtl that aren't tied to general purpose
> counters.  But, that's a future theoretical problem, so I'm inclined to vote for
> Sandipan's approach.

I suspect that Intel hardware also has this behaviour [*] although guest
kernels using Intel pmu version 1 are pretty much non-existent.

[*] Table 10-1. IA-32 and Intel® 64 Processor States Following Power-up, Reset, 
or INIT (Contd.)

We need to update the selftest to guard this.

> 
>> diff --git a/arch/x86/kvm/svm/pmu.c b/arch/x86/kvm/svm/pmu.c
>> index e886300f0f97..7ac9b080aba6 100644
>> --- a/arch/x86/kvm/svm/pmu.c
>> +++ b/arch/x86/kvm/svm/pmu.c
>> @@ -199,7 +199,8 @@ static void amd_pmu_refresh(struct kvm_vcpu *vcpu)
>> kvm_pmu_cap.num_counters_gp);
>>
>>          if (pmu->version > 1) {
>> -               pmu->global_ctrl_mask = ~((1ull << pmu->nr_arch_gp_counters)
>> - 1);
>> +               pmu->global_ctrl = (1ull << pmu->nr_arch_gp_counters) - 1;
>> +               pmu->global_ctrl_mask = ~pmu->global_ctrl;
>>                  pmu->global_status_mask = pmu->global_ctrl_mask;
>>          }
>>
>>>>>         }
>>>>>           pmu->counter_bitmask[KVM_PMC_GP] = ((u64)1 << 48) - 1;
>>>
>
Sean Christopherson March 5, 2024, 5:22 p.m. UTC | #8
On Tue, Mar 05, 2024, Like Xu wrote:
> On 5/3/2024 3:46 am, Sean Christopherson wrote:
> > > > > > ---
> > > > > >     arch/x86/kvm/svm/pmu.c | 1 +
> > > > > >     1 file changed, 1 insertion(+)
> > > > > > 
> > > > > > diff --git a/arch/x86/kvm/svm/pmu.c b/arch/x86/kvm/svm/pmu.c
> > > > > > index b6a7ad4d6914..14709c564d6a 100644
> > > > > > --- a/arch/x86/kvm/svm/pmu.c
> > > > > > +++ b/arch/x86/kvm/svm/pmu.c
> > > > > > @@ -205,6 +205,7 @@ static void amd_pmu_refresh(struct kvm_vcpu *vcpu)
> > > > > >         if (pmu->version > 1) {
> > > > > >             pmu->global_ctrl_mask = ~((1ull << pmu->nr_arch_gp_counters) - 1);
> > > > > >             pmu->global_status_mask = pmu->global_ctrl_mask;
> > > > > > +        pmu->global_ctrl = ~pmu->global_ctrl_mask;
> > > 
> > > It seems to be more easily understand to calculate global_ctrl firstly and
> > > then derive the globol_ctrl_mask (negative logic).
> > 
> > Hrm, I'm torn.  On one hand, awful name aside (global_ctrl_mask should really be
> > something like global_ctrl_rsvd_bits), the computation of the reserved bits should
> > come from the capabilities of the PMU, not from the RESET value.
> > 
> > On the other hand, setting _all_ non-reserved bits will likely do the wrong thing
> > if AMD ever adds bits in PerfCntGlobalCtl that aren't tied to general purpose
> > counters.  But, that's a future theoretical problem, so I'm inclined to vote for
> > Sandipan's approach.
> 
> I suspect that Intel hardware also has this behaviour [*] although guest
> kernels using Intel pmu version 1 are pretty much non-existent.
> 
> [*] Table 10-1. IA-32 and Intel® 64 Processor States Following Power-up,
> Reset, or INIT (Contd.)

Aha!  Nice.  To save people lookups, the table says:

  IA32_PERF_GLOBAL_CTRL:  Sets bits n-1:0 and clears the upper bits.

and 

  Where "n" is the number of general-purpose counters available in the processor.

Which means that (a) KVM can handle this in common code and (b) we can dodge the
whole reserved bits chicken-and-egg problem since global_ctrl *can't* be derived
from global_ctrl_mask.

This?  (compile tested only)

---
From: Sean Christopherson <seanjc@google.com>
Date: Tue, 5 Mar 2024 09:02:26 -0800
Subject: [PATCH] KVM: x86/pmu: Set enable bits for GP counters in
 PERF_GLOBAL_CTRL at "RESET"

Set the enable bits for general purpose counters in IA32_PERF_GLOBAL_CTRL
when refreshing the PMU to emulate the MSR's architecturally defined
post-RESET behavior.  Per Intel's SDM:

  IA32_PERF_GLOBAL_CTRL:  Sets bits n-1:0 and clears the upper bits.

and

  Where "n" is the number of general-purpose counters available in the processor.

This is a long-standing bug that was recently exposed when KVM added
supported for AMD's PerfMonV2, i.e. when KVM started exposing a vPMU with
PERF_GLOBAL_CTRL to guest software that only knew how to program v1 PMUs
(that don't support PERF_GLOBAL_CTRL).  Failure to emulate the post-RESET
behavior results in such guests unknowingly leaving all general purpose
counters globally disabled (the entire reason the post-RESET value sets
the GP counter enable bits is to maintain backwards compatibility).

The bug has likely gone unnoticed because PERF_GLOBAL_CTRL has been
supported on Intel CPUs for as long as KVM has existed, i.e. hardly anyone
is running guest software that isn't aware of PERF_GLOBAL_CTRL on Intel
PMUs.

Note, kvm_pmu_refresh() can be invoked multiple times, i.e. it's not a
"pure" RESET flow.  But it can only be called prior to the first KVM_RUN,
i.e. the guest will only ever observe the final value.

Reported-by: Reported-by: Babu Moger <babu.moger@amd.com>
Cc: Like Xu <like.xu.linux@gmail.com>
Cc: Mingwei Zhang <mizhang@google.com>
Cc: Dapeng Mi <dapeng1.mi@linux.intel.com>
Cc: Sandipan Das <sandipan.das@amd.com>
Cc: stable@vger.kernel.org
Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 arch/x86/kvm/pmu.c | 14 +++++++++++++-
 1 file changed, 13 insertions(+), 1 deletion(-)

diff --git a/arch/x86/kvm/pmu.c b/arch/x86/kvm/pmu.c
index 87cc6c8809ad..f61ce26aeb90 100644
--- a/arch/x86/kvm/pmu.c
+++ b/arch/x86/kvm/pmu.c
@@ -741,6 +741,8 @@ static void kvm_pmu_reset(struct kvm_vcpu *vcpu)
  */
 void kvm_pmu_refresh(struct kvm_vcpu *vcpu)
 {
+	struct kvm_pmu *pmu = vcpu_to_pmu(vcpu);
+
 	if (KVM_BUG_ON(kvm_vcpu_has_run(vcpu), vcpu->kvm))
 		return;
 
@@ -750,8 +752,18 @@ void kvm_pmu_refresh(struct kvm_vcpu *vcpu)
 	 */
 	kvm_pmu_reset(vcpu);
 
-	bitmap_zero(vcpu_to_pmu(vcpu)->all_valid_pmc_idx, X86_PMC_IDX_MAX);
+	bitmap_zero(pmu->all_valid_pmc_idx, X86_PMC_IDX_MAX);
 	static_call(kvm_x86_pmu_refresh)(vcpu);
+
+	/*
+	 * At RESET, both Intel and AMD CPUs set all enable bits for general
+	 * purpose counters in IA32_PERF_GLOBAL_CTRL (so that software that
+	 * was written for v1 PMUs don't unknowingly leave GP counters disabled
+	 * in the global controls).  Emulate that behavior when refreshing the
+	 * PMU so that userspace doesn't need to manually set PERF_GLOBAL_CTRL.
+	 */
+	if (kvm_pmu_has_perf_global_ctrl(pmu))
+		pmu->global_ctrl = GENMASK_ULL(pmu->nr_arch_gp_counters - 1, 0);
 }
 
 void kvm_pmu_init(struct kvm_vcpu *vcpu)

base-commit: 1d7ae977d219e68698fdb9bed1049dc561038aa1
--
Sean Christopherson March 5, 2024, 6:04 p.m. UTC | #9
On Tue, Mar 05, 2024, Sean Christopherson wrote:
> On Tue, Mar 05, 2024, Like Xu wrote:
> diff --git a/arch/x86/kvm/pmu.c b/arch/x86/kvm/pmu.c
> index 87cc6c8809ad..f61ce26aeb90 100644
> --- a/arch/x86/kvm/pmu.c
> +++ b/arch/x86/kvm/pmu.c
> @@ -741,6 +741,8 @@ static void kvm_pmu_reset(struct kvm_vcpu *vcpu)
>   */
>  void kvm_pmu_refresh(struct kvm_vcpu *vcpu)
>  {
> +	struct kvm_pmu *pmu = vcpu_to_pmu(vcpu);
> +
>  	if (KVM_BUG_ON(kvm_vcpu_has_run(vcpu), vcpu->kvm))
>  		return;
>  
> @@ -750,8 +752,18 @@ void kvm_pmu_refresh(struct kvm_vcpu *vcpu)
>  	 */
>  	kvm_pmu_reset(vcpu);
>  
> -	bitmap_zero(vcpu_to_pmu(vcpu)->all_valid_pmc_idx, X86_PMC_IDX_MAX);
> +	bitmap_zero(pmu->all_valid_pmc_idx, X86_PMC_IDX_MAX);
>  	static_call(kvm_x86_pmu_refresh)(vcpu);
> +
> +	/*
> +	 * At RESET, both Intel and AMD CPUs set all enable bits for general
> +	 * purpose counters in IA32_PERF_GLOBAL_CTRL (so that software that
> +	 * was written for v1 PMUs don't unknowingly leave GP counters disabled
> +	 * in the global controls).  Emulate that behavior when refreshing the
> +	 * PMU so that userspace doesn't need to manually set PERF_GLOBAL_CTRL.
> +	 */
> +	if (kvm_pmu_has_perf_global_ctrl(pmu))
> +		pmu->global_ctrl = GENMASK_ULL(pmu->nr_arch_gp_counters - 1, 0);
>  }

Doh, this is based on kvm/kvm-uapi, I'll rebase to kvm-x86/next before posting.

I'll also update the changelog to call out that KVM has always clobbered global_ctrl
during PMU refresh, i.e. there is no danger of breaking existing setups by
clobbering a value set by userspace, e.g. during live migration.

Lastly, I'll also update the changelog to call out that KVM *did* actually set
the general purpose counter enable bits in global_ctrl at "RESET" until v6.0,
and that KVM intentionally removed that behavior because of what appears to be
an Intel SDM bug.

Of course, in typical KVM fashion, that old code was also broken in its own way
(the history of this code is a comedy of errors).  Initial vPMU support in commit
f5132b01386b ("KVM: Expose a version 2 architectural PMU to a guests") *almost*
got it right, but for some reason only set the bits if the guest PMU was
advertised as v1:

        if (pmu->version == 1) {
                pmu->global_ctrl = (1 << pmu->nr_arch_gp_counters) - 1;
                return;
        }


Commit f19a0c2c2e6a ("KVM: PMU emulation: GLOBAL_CTRL MSR should be enabled on
reset") then tried to remedy that goof, but botched things and also enabled the
fixed counters:

        pmu->global_ctrl = ((1 << pmu->nr_arch_gp_counters) - 1) |
                (((1ull << pmu->nr_arch_fixed_counters) - 1) << X86_PMC_IDX_FIXED);
        pmu->global_ctrl_mask = ~pmu->global_ctrl;

Which was KVM's behavior up until commit c49467a45fe0 ("KVM: x86/pmu: Don't overwrite
the pmu->global_ctrl when refreshing") incorrectly removed *everything*.  Very
ironically, that commit came from Like.

Author: Like Xu <likexu@tencent.com>
Date:   Tue May 10 12:44:07 2022 +0800

    KVM: x86/pmu: Don't overwrite the pmu->global_ctrl when refreshing
    
    Assigning a value to pmu->global_ctrl just to set the value of
    pmu->global_ctrl_mask is more readable but does not conform to the
    specification. The value is reset to zero on Power up and Reset but
    stays unchanged on INIT, like most other MSRs.

But wait, it gets even better.  Like wasn't making up that behavior, Intel's SDM
circa December 2022 states that "Global Perf Counter Controls" is '0' at Power-Up
and RESET.  But then the March 2023 SDM rolls out and says

  IA32_PERF_GLOBAL_CTRL: Sets bits n-1:0 and clears the upper bits.

So presumably someone at Intel noticed that what their CPUs do and what the
documentation says didn't match.

*sigh*
Mingwei Zhang March 5, 2024, 6:31 p.m. UTC | #10
On Tue, Mar 05, 2024, Sean Christopherson wrote:
> On Tue, Mar 05, 2024, Sean Christopherson wrote:
> > On Tue, Mar 05, 2024, Like Xu wrote:
> > diff --git a/arch/x86/kvm/pmu.c b/arch/x86/kvm/pmu.c
> > index 87cc6c8809ad..f61ce26aeb90 100644
> > --- a/arch/x86/kvm/pmu.c
> > +++ b/arch/x86/kvm/pmu.c
> > @@ -741,6 +741,8 @@ static void kvm_pmu_reset(struct kvm_vcpu *vcpu)
> >   */
> >  void kvm_pmu_refresh(struct kvm_vcpu *vcpu)
> >  {
> > +	struct kvm_pmu *pmu = vcpu_to_pmu(vcpu);
> > +
> >  	if (KVM_BUG_ON(kvm_vcpu_has_run(vcpu), vcpu->kvm))
> >  		return;
> >  
> > @@ -750,8 +752,18 @@ void kvm_pmu_refresh(struct kvm_vcpu *vcpu)
> >  	 */
> >  	kvm_pmu_reset(vcpu);
> >  
> > -	bitmap_zero(vcpu_to_pmu(vcpu)->all_valid_pmc_idx, X86_PMC_IDX_MAX);
> > +	bitmap_zero(pmu->all_valid_pmc_idx, X86_PMC_IDX_MAX);
> >  	static_call(kvm_x86_pmu_refresh)(vcpu);
> > +
> > +	/*
> > +	 * At RESET, both Intel and AMD CPUs set all enable bits for general
> > +	 * purpose counters in IA32_PERF_GLOBAL_CTRL (so that software that
> > +	 * was written for v1 PMUs don't unknowingly leave GP counters disabled
> > +	 * in the global controls).  Emulate that behavior when refreshing the
> > +	 * PMU so that userspace doesn't need to manually set PERF_GLOBAL_CTRL.
> > +	 */
> > +	if (kvm_pmu_has_perf_global_ctrl(pmu))
> > +		pmu->global_ctrl = GENMASK_ULL(pmu->nr_arch_gp_counters - 1, 0);
> >  }
> 
> Doh, this is based on kvm/kvm-uapi, I'll rebase to kvm-x86/next before posting.
> 
> I'll also update the changelog to call out that KVM has always clobbered global_ctrl
> during PMU refresh, i.e. there is no danger of breaking existing setups by
> clobbering a value set by userspace, e.g. during live migration.
> 
> Lastly, I'll also update the changelog to call out that KVM *did* actually set
> the general purpose counter enable bits in global_ctrl at "RESET" until v6.0,
> and that KVM intentionally removed that behavior because of what appears to be
> an Intel SDM bug.
> 
> Of course, in typical KVM fashion, that old code was also broken in its own way
> (the history of this code is a comedy of errors).  Initial vPMU support in commit
> f5132b01386b ("KVM: Expose a version 2 architectural PMU to a guests") *almost*
> got it right, but for some reason only set the bits if the guest PMU was
> advertised as v1:
> 
>         if (pmu->version == 1) {
>                 pmu->global_ctrl = (1 << pmu->nr_arch_gp_counters) - 1;
>                 return;
>         }
> 
> 
> Commit f19a0c2c2e6a ("KVM: PMU emulation: GLOBAL_CTRL MSR should be enabled on
> reset") then tried to remedy that goof, but botched things and also enabled the
> fixed counters:
> 
>         pmu->global_ctrl = ((1 << pmu->nr_arch_gp_counters) - 1) |
>                 (((1ull << pmu->nr_arch_fixed_counters) - 1) << X86_PMC_IDX_FIXED);
>         pmu->global_ctrl_mask = ~pmu->global_ctrl;
> 
> Which was KVM's behavior up until commit c49467a45fe0 ("KVM: x86/pmu: Don't overwrite
> the pmu->global_ctrl when refreshing") incorrectly removed *everything*.  Very
> ironically, that commit came from Like.
> 
> Author: Like Xu <likexu@tencent.com>
> Date:   Tue May 10 12:44:07 2022 +0800
> 
>     KVM: x86/pmu: Don't overwrite the pmu->global_ctrl when refreshing
>     
>     Assigning a value to pmu->global_ctrl just to set the value of
>     pmu->global_ctrl_mask is more readable but does not conform to the
>     specification. The value is reset to zero on Power up and Reset but
>     stays unchanged on INIT, like most other MSRs.
> 
> But wait, it gets even better.  Like wasn't making up that behavior, Intel's SDM
> circa December 2022 states that "Global Perf Counter Controls" is '0' at Power-Up
> and RESET.  But then the March 2023 SDM rolls out and says
> 
>   IA32_PERF_GLOBAL_CTRL: Sets bits n-1:0 and clears the upper bits.
> 
> So presumably someone at Intel noticed that what their CPUs do and what the
> documentation says didn't match.
> 

Sean, can you update your commit message with the table name of the
Intel SDM and the version of the Intel SDM (2023 version). It was quite
hard to find where exactly SDM mentioned this, since I was using the
2022 version.

Thanks.
-Mingwei
> *sigh*
Mi, Dapeng March 6, 2024, 2:32 a.m. UTC | #11
On 3/6/2024 2:04 AM, Sean Christopherson wrote:
> On Tue, Mar 05, 2024, Sean Christopherson wrote:
>> On Tue, Mar 05, 2024, Like Xu wrote:
>> diff --git a/arch/x86/kvm/pmu.c b/arch/x86/kvm/pmu.c
>> index 87cc6c8809ad..f61ce26aeb90 100644
>> --- a/arch/x86/kvm/pmu.c
>> +++ b/arch/x86/kvm/pmu.c
>> @@ -741,6 +741,8 @@ static void kvm_pmu_reset(struct kvm_vcpu *vcpu)
>>    */
>>   void kvm_pmu_refresh(struct kvm_vcpu *vcpu)
>>   {
>> +	struct kvm_pmu *pmu = vcpu_to_pmu(vcpu);
>> +
>>   	if (KVM_BUG_ON(kvm_vcpu_has_run(vcpu), vcpu->kvm))
>>   		return;
>>   
>> @@ -750,8 +752,18 @@ void kvm_pmu_refresh(struct kvm_vcpu *vcpu)
>>   	 */
>>   	kvm_pmu_reset(vcpu);
>>   
>> -	bitmap_zero(vcpu_to_pmu(vcpu)->all_valid_pmc_idx, X86_PMC_IDX_MAX);
>> +	bitmap_zero(pmu->all_valid_pmc_idx, X86_PMC_IDX_MAX);
>>   	static_call(kvm_x86_pmu_refresh)(vcpu);
>> +
>> +	/*
>> +	 * At RESET, both Intel and AMD CPUs set all enable bits for general
>> +	 * purpose counters in IA32_PERF_GLOBAL_CTRL (so that software that
>> +	 * was written for v1 PMUs don't unknowingly leave GP counters disabled
>> +	 * in the global controls).  Emulate that behavior when refreshing the
>> +	 * PMU so that userspace doesn't need to manually set PERF_GLOBAL_CTRL.
>> +	 */
>> +	if (kvm_pmu_has_perf_global_ctrl(pmu))
>> +		pmu->global_ctrl = GENMASK_ULL(pmu->nr_arch_gp_counters - 1, 0);
>>   }
> Doh, this is based on kvm/kvm-uapi, I'll rebase to kvm-x86/next before posting.
>
> I'll also update the changelog to call out that KVM has always clobbered global_ctrl
> during PMU refresh, i.e. there is no danger of breaking existing setups by
> clobbering a value set by userspace, e.g. during live migration.
>
> Lastly, I'll also update the changelog to call out that KVM *did* actually set
> the general purpose counter enable bits in global_ctrl at "RESET" until v6.0,
> and that KVM intentionally removed that behavior because of what appears to be
> an Intel SDM bug.
>
> Of course, in typical KVM fashion, that old code was also broken in its own way
> (the history of this code is a comedy of errors).  Initial vPMU support in commit
> f5132b01386b ("KVM: Expose a version 2 architectural PMU to a guests") *almost*
> got it right, but for some reason only set the bits if the guest PMU was
> advertised as v1:
>
>          if (pmu->version == 1) {
>                  pmu->global_ctrl = (1 << pmu->nr_arch_gp_counters) - 1;
>                  return;
>          }
>
>
> Commit f19a0c2c2e6a ("KVM: PMU emulation: GLOBAL_CTRL MSR should be enabled on
> reset") then tried to remedy that goof, but botched things and also enabled the
> fixed counters:
>
>          pmu->global_ctrl = ((1 << pmu->nr_arch_gp_counters) - 1) |
>                  (((1ull << pmu->nr_arch_fixed_counters) - 1) << X86_PMC_IDX_FIXED);
>          pmu->global_ctrl_mask = ~pmu->global_ctrl;
>
> Which was KVM's behavior up until commit c49467a45fe0 ("KVM: x86/pmu: Don't overwrite
> the pmu->global_ctrl when refreshing") incorrectly removed *everything*.  Very
> ironically, that commit came from Like.
>
> Author: Like Xu <likexu@tencent.com>
> Date:   Tue May 10 12:44:07 2022 +0800
>
>      KVM: x86/pmu: Don't overwrite the pmu->global_ctrl when refreshing
>      
>      Assigning a value to pmu->global_ctrl just to set the value of
>      pmu->global_ctrl_mask is more readable but does not conform to the
>      specification. The value is reset to zero on Power up and Reset but
>      stays unchanged on INIT, like most other MSRs.
>
> But wait, it gets even better.  Like wasn't making up that behavior, Intel's SDM
> circa December 2022 states that "Global Perf Counter Controls" is '0' at Power-Up
> and RESET.  But then the March 2023 SDM rolls out and says
>
>    IA32_PERF_GLOBAL_CTRL: Sets bits n-1:0 and clears the upper bits.
>
> So presumably someone at Intel noticed that what their CPUs do and what the
> documentation says didn't match.

It's a really long story... thanks for figuring it out.

>
> *sigh*
>
Mi, Dapeng March 6, 2024, 2:48 a.m. UTC | #12
On 3/5/2024 3:46 AM, Sean Christopherson wrote:
> On Mon, Mar 04, 2024, Dapeng Mi wrote:
>> On 3/1/2024 5:00 PM, Sandipan Das wrote:
>>> On 3/1/2024 2:07 PM, Like Xu wrote:
>>>> On 1/3/2024 3:50 pm, Sandipan Das wrote:
>>>>> With PerfMonV2, a performance monitoring counter will start operating
>>>>> only when both the PERF_CTLx enable bit as well as the corresponding
>>>>> PerfCntrGlobalCtl enable bit are set.
>>>>>
>>>>> When the PerfMonV2 CPUID feature bit (leaf 0x80000022 EAX bit 0) is set
>>>>> for a guest but the guest kernel does not support PerfMonV2 (such as
>>>>> kernels older than v5.19), the guest counters do not count since the
>>>>> PerfCntrGlobalCtl MSR is initialized to zero and the guest kernel never
>>>>> writes to it.
>>>> If the vcpu has the PerfMonV2 feature, it should not work the way legacy
>>>> PMU does. Users need to use the new driver to operate the new hardware,
>>>> don't they ? One practical approach is that the hypervisor should not set
>>>> the PerfMonV2 bit for this unpatched 'v5.19' guest.
>>>>
>>> My understanding is that the legacy method of managing the counters should
>>> still work because the enable bits in PerfCntrGlobalCtl are expected to be
>>> set. The AMD PPR does mention that the PerfCntrEn bitfield of PerfCntrGlobalCtl
>>> is set to 0x3f after a system reset. That way, the guest kernel can use either
>> If so, please add the PPR description here as comments.
> Or even better, make that architectural behavior that's documented in the APM.
>
>>>>> ---
>>>>>     arch/x86/kvm/svm/pmu.c | 1 +
>>>>>     1 file changed, 1 insertion(+)
>>>>>
>>>>> diff --git a/arch/x86/kvm/svm/pmu.c b/arch/x86/kvm/svm/pmu.c
>>>>> index b6a7ad4d6914..14709c564d6a 100644
>>>>> --- a/arch/x86/kvm/svm/pmu.c
>>>>> +++ b/arch/x86/kvm/svm/pmu.c
>>>>> @@ -205,6 +205,7 @@ static void amd_pmu_refresh(struct kvm_vcpu *vcpu)
>>>>>         if (pmu->version > 1) {
>>>>>             pmu->global_ctrl_mask = ~((1ull << pmu->nr_arch_gp_counters) - 1);
>>>>>             pmu->global_status_mask = pmu->global_ctrl_mask;
>>>>> +        pmu->global_ctrl = ~pmu->global_ctrl_mask;
>> It seems to be more easily understand to calculate global_ctrl firstly and
>> then derive the globol_ctrl_mask (negative logic).
> Hrm, I'm torn.  On one hand, awful name aside (global_ctrl_mask should really be
> something like global_ctrl_rsvd_bits), the computation of the reserved bits should

Yeah, same feeling here. global_ctrl_mask is ambiguous and users are 
hard to get its real meaning just from the name and have to read the all 
the code. global_ctrl_rsvd_bits looks  to be a better name. There are 
several other variables with similar name "xxx_mask". Sean, do you think 
it's a good time to change the name for all these variables? Thanks.


> come from the capabilities of the PMU, not from the RESET value.
>
> On the other hand, setting _all_ non-reserved bits will likely do the wrong thing
> if AMD ever adds bits in PerfCntGlobalCtl that aren't tied to general purpose
> counters.  But, that's a future theoretical problem, so I'm inclined to vote for
> Sandipan's approach.
>
>> diff --git a/arch/x86/kvm/svm/pmu.c b/arch/x86/kvm/svm/pmu.c
>> index e886300f0f97..7ac9b080aba6 100644
>> --- a/arch/x86/kvm/svm/pmu.c
>> +++ b/arch/x86/kvm/svm/pmu.c
>> @@ -199,7 +199,8 @@ static void amd_pmu_refresh(struct kvm_vcpu *vcpu)
>> kvm_pmu_cap.num_counters_gp);
>>
>>          if (pmu->version > 1) {
>> -               pmu->global_ctrl_mask = ~((1ull << pmu->nr_arch_gp_counters)
>> - 1);
>> +               pmu->global_ctrl = (1ull << pmu->nr_arch_gp_counters) - 1;
>> +               pmu->global_ctrl_mask = ~pmu->global_ctrl;
>>                  pmu->global_status_mask = pmu->global_ctrl_mask;
>>          }
>>
>>>>>         }
>>>>>           pmu->counter_bitmask[KVM_PMC_GP] = ((u64)1 << 48) - 1;
Like Xu March 6, 2024, 3:23 a.m. UTC | #13
On 6/3/2024 10:32 am, Mi, Dapeng wrote:
> 
> On 3/6/2024 2:04 AM, Sean Christopherson wrote:
>> On Tue, Mar 05, 2024, Sean Christopherson wrote:
>>> On Tue, Mar 05, 2024, Like Xu wrote:
>>> diff --git a/arch/x86/kvm/pmu.c b/arch/x86/kvm/pmu.c
>>> index 87cc6c8809ad..f61ce26aeb90 100644
>>> --- a/arch/x86/kvm/pmu.c
>>> +++ b/arch/x86/kvm/pmu.c
>>> @@ -741,6 +741,8 @@ static void kvm_pmu_reset(struct kvm_vcpu *vcpu)
>>>    */
>>>   void kvm_pmu_refresh(struct kvm_vcpu *vcpu)
>>>   {
>>> +    struct kvm_pmu *pmu = vcpu_to_pmu(vcpu);
>>> +
>>>       if (KVM_BUG_ON(kvm_vcpu_has_run(vcpu), vcpu->kvm))
>>>           return;
>>> @@ -750,8 +752,18 @@ void kvm_pmu_refresh(struct kvm_vcpu *vcpu)
>>>        */
>>>       kvm_pmu_reset(vcpu);
>>> -    bitmap_zero(vcpu_to_pmu(vcpu)->all_valid_pmc_idx, X86_PMC_IDX_MAX);
>>> +    bitmap_zero(pmu->all_valid_pmc_idx, X86_PMC_IDX_MAX);
>>>       static_call(kvm_x86_pmu_refresh)(vcpu);
>>> +
>>> +    /*
>>> +     * At RESET, both Intel and AMD CPUs set all enable bits for general
>>> +     * purpose counters in IA32_PERF_GLOBAL_CTRL (so that software that
>>> +     * was written for v1 PMUs don't unknowingly leave GP counters disabled
>>> +     * in the global controls).  Emulate that behavior when refreshing the
>>> +     * PMU so that userspace doesn't need to manually set PERF_GLOBAL_CTRL.
>>> +     */
>>> +    if (kvm_pmu_has_perf_global_ctrl(pmu))
>>> +        pmu->global_ctrl = GENMASK_ULL(pmu->nr_arch_gp_counters - 1, 0);
>>>   }
>> Doh, this is based on kvm/kvm-uapi, I'll rebase to kvm-x86/next before posting.
>>
>> I'll also update the changelog to call out that KVM has always clobbered 
>> global_ctrl
>> during PMU refresh, i.e. there is no danger of breaking existing setups by
>> clobbering a value set by userspace, e.g. during live migration.
>>
>> Lastly, I'll also update the changelog to call out that KVM *did* actually set
>> the general purpose counter enable bits in global_ctrl at "RESET" until v6.0,
>> and that KVM intentionally removed that behavior because of what appears to be
>> an Intel SDM bug.
>>
>> Of course, in typical KVM fashion, that old code was also broken in its own way
>> (the history of this code is a comedy of errors).  Initial vPMU support in commit
>> f5132b01386b ("KVM: Expose a version 2 architectural PMU to a guests") *almost*
>> got it right, but for some reason only set the bits if the guest PMU was
>> advertised as v1:
>>
>>          if (pmu->version == 1) {
>>                  pmu->global_ctrl = (1 << pmu->nr_arch_gp_counters) - 1;
>>                  return;
>>          }
>>
>>
>> Commit f19a0c2c2e6a ("KVM: PMU emulation: GLOBAL_CTRL MSR should be enabled on
>> reset") then tried to remedy that goof, but botched things and also enabled the
>> fixed counters:
>>
>>          pmu->global_ctrl = ((1 << pmu->nr_arch_gp_counters) - 1) |
>>                  (((1ull << pmu->nr_arch_fixed_counters) - 1) << 
>> X86_PMC_IDX_FIXED);
>>          pmu->global_ctrl_mask = ~pmu->global_ctrl;
>>
>> Which was KVM's behavior up until commit c49467a45fe0 ("KVM: x86/pmu: Don't 
>> overwrite
>> the pmu->global_ctrl when refreshing") incorrectly removed *everything*.  Very
>> ironically, that commit came from Like.
>>
>> Author: Like Xu <likexu@tencent.com>
>> Date:   Tue May 10 12:44:07 2022 +0800
>>
>>      KVM: x86/pmu: Don't overwrite the pmu->global_ctrl when refreshing
>>      Assigning a value to pmu->global_ctrl just to set the value of
>>      pmu->global_ctrl_mask is more readable but does not conform to the
>>      specification. The value is reset to zero on Power up and Reset but
>>      stays unchanged on INIT, like most other MSRs.
>>
>> But wait, it gets even better.  Like wasn't making up that behavior, Intel's SDM
>> circa December 2022 states that "Global Perf Counter Controls" is '0' at Power-Up
>> and RESET.  But then the March 2023 SDM rolls out and says
>>
>>    IA32_PERF_GLOBAL_CTRL: Sets bits n-1:0 and clears the upper bits.
>>
>> So presumably someone at Intel noticed that what their CPUs do and what the
>> documentation says didn't match.

This reminds me quite a bit of the past, where this kind of thing happened
occasionally (e.g. some pmu events don't count as expected, and ucode
updates often sneak in to change hardware behaviour). Sometimes we have
to rely on hardware behaviour, sometimes we have to trust the documentation
specification, my experience has been to find a balance that is more
favourable to the end-user and to force the hardware vendors to make
more careful documentation and implementation. :D

> 
> It's a really long story... thanks for figuring it out.
> 
>>
>> *sigh* >>
>
Sandipan Das March 6, 2024, 5:17 a.m. UTC | #14
On 3/5/2024 10:52 PM, Sean Christopherson wrote:
> On Tue, Mar 05, 2024, Like Xu wrote:
>> On 5/3/2024 3:46 am, Sean Christopherson wrote:
>>>>>>> ---
>>>>>>>     arch/x86/kvm/svm/pmu.c | 1 +
>>>>>>>     1 file changed, 1 insertion(+)
>>>>>>>
>>>>>>> diff --git a/arch/x86/kvm/svm/pmu.c b/arch/x86/kvm/svm/pmu.c
>>>>>>> index b6a7ad4d6914..14709c564d6a 100644
>>>>>>> --- a/arch/x86/kvm/svm/pmu.c
>>>>>>> +++ b/arch/x86/kvm/svm/pmu.c
>>>>>>> @@ -205,6 +205,7 @@ static void amd_pmu_refresh(struct kvm_vcpu *vcpu)
>>>>>>>         if (pmu->version > 1) {
>>>>>>>             pmu->global_ctrl_mask = ~((1ull << pmu->nr_arch_gp_counters) - 1);
>>>>>>>             pmu->global_status_mask = pmu->global_ctrl_mask;
>>>>>>> +        pmu->global_ctrl = ~pmu->global_ctrl_mask;
>>>>
>>>> It seems to be more easily understand to calculate global_ctrl firstly and
>>>> then derive the globol_ctrl_mask (negative logic).
>>>
>>> Hrm, I'm torn.  On one hand, awful name aside (global_ctrl_mask should really be
>>> something like global_ctrl_rsvd_bits), the computation of the reserved bits should
>>> come from the capabilities of the PMU, not from the RESET value.
>>>
>>> On the other hand, setting _all_ non-reserved bits will likely do the wrong thing
>>> if AMD ever adds bits in PerfCntGlobalCtl that aren't tied to general purpose
>>> counters.  But, that's a future theoretical problem, so I'm inclined to vote for
>>> Sandipan's approach.
>>
>> I suspect that Intel hardware also has this behaviour [*] although guest
>> kernels using Intel pmu version 1 are pretty much non-existent.
>>
>> [*] Table 10-1. IA-32 and Intel® 64 Processor States Following Power-up,
>> Reset, or INIT (Contd.)
> 
> Aha!  Nice.  To save people lookups, the table says:
> 
>   IA32_PERF_GLOBAL_CTRL:  Sets bits n-1:0 and clears the upper bits.
> 
> and 
> 
>   Where "n" is the number of general-purpose counters available in the processor.
> 
> Which means that (a) KVM can handle this in common code and (b) we can dodge the
> whole reserved bits chicken-and-egg problem since global_ctrl *can't* be derived
> from global_ctrl_mask.
> 
> This?  (compile tested only)
> 
> ---
> From: Sean Christopherson <seanjc@google.com>
> Date: Tue, 5 Mar 2024 09:02:26 -0800
> Subject: [PATCH] KVM: x86/pmu: Set enable bits for GP counters in
>  PERF_GLOBAL_CTRL at "RESET"
> 
> Set the enable bits for general purpose counters in IA32_PERF_GLOBAL_CTRL
> when refreshing the PMU to emulate the MSR's architecturally defined
> post-RESET behavior.  Per Intel's SDM:
> 
>   IA32_PERF_GLOBAL_CTRL:  Sets bits n-1:0 and clears the upper bits.
> 
> and
> 
>   Where "n" is the number of general-purpose counters available in the processor.
> 
> This is a long-standing bug that was recently exposed when KVM added
> supported for AMD's PerfMonV2, i.e. when KVM started exposing a vPMU with
> PERF_GLOBAL_CTRL to guest software that only knew how to program v1 PMUs
> (that don't support PERF_GLOBAL_CTRL).  Failure to emulate the post-RESET
> behavior results in such guests unknowingly leaving all general purpose
> counters globally disabled (the entire reason the post-RESET value sets
> the GP counter enable bits is to maintain backwards compatibility).
> 
> The bug has likely gone unnoticed because PERF_GLOBAL_CTRL has been
> supported on Intel CPUs for as long as KVM has existed, i.e. hardly anyone
> is running guest software that isn't aware of PERF_GLOBAL_CTRL on Intel
> PMUs.
> 
> Note, kvm_pmu_refresh() can be invoked multiple times, i.e. it's not a
> "pure" RESET flow.  But it can only be called prior to the first KVM_RUN,
> i.e. the guest will only ever observe the final value.
> 
> Reported-by: Reported-by: Babu Moger <babu.moger@amd.com>
> Cc: Like Xu <like.xu.linux@gmail.com>
> Cc: Mingwei Zhang <mizhang@google.com>
> Cc: Dapeng Mi <dapeng1.mi@linux.intel.com>
> Cc: Sandipan Das <sandipan.das@amd.com>
> Cc: stable@vger.kernel.org
> Signed-off-by: Sean Christopherson <seanjc@google.com>
> ---
>  arch/x86/kvm/pmu.c | 14 +++++++++++++-
>  1 file changed, 13 insertions(+), 1 deletion(-)
> 
> diff --git a/arch/x86/kvm/pmu.c b/arch/x86/kvm/pmu.c
> index 87cc6c8809ad..f61ce26aeb90 100644
> --- a/arch/x86/kvm/pmu.c
> +++ b/arch/x86/kvm/pmu.c
> @@ -741,6 +741,8 @@ static void kvm_pmu_reset(struct kvm_vcpu *vcpu)
>   */
>  void kvm_pmu_refresh(struct kvm_vcpu *vcpu)
>  {
> +	struct kvm_pmu *pmu = vcpu_to_pmu(vcpu);
> +
>  	if (KVM_BUG_ON(kvm_vcpu_has_run(vcpu), vcpu->kvm))
>  		return;
>  
> @@ -750,8 +752,18 @@ void kvm_pmu_refresh(struct kvm_vcpu *vcpu)
>  	 */
>  	kvm_pmu_reset(vcpu);
>  
> -	bitmap_zero(vcpu_to_pmu(vcpu)->all_valid_pmc_idx, X86_PMC_IDX_MAX);
> +	bitmap_zero(pmu->all_valid_pmc_idx, X86_PMC_IDX_MAX);
>  	static_call(kvm_x86_pmu_refresh)(vcpu);
> +
> +	/*
> +	 * At RESET, both Intel and AMD CPUs set all enable bits for general
> +	 * purpose counters in IA32_PERF_GLOBAL_CTRL (so that software that
> +	 * was written for v1 PMUs don't unknowingly leave GP counters disabled
> +	 * in the global controls).  Emulate that behavior when refreshing the
> +	 * PMU so that userspace doesn't need to manually set PERF_GLOBAL_CTRL.
> +	 */
> +	if (kvm_pmu_has_perf_global_ctrl(pmu))
> +		pmu->global_ctrl = GENMASK_ULL(pmu->nr_arch_gp_counters - 1, 0);
>  }
>  
>  void kvm_pmu_init(struct kvm_vcpu *vcpu)
> 
> base-commit: 1d7ae977d219e68698fdb9bed1049dc561038aa1

Thanks. This looks good.

Tested-by: Sandipan Das <sandipan.das@amd.com>
diff mbox series

Patch

diff --git a/arch/x86/kvm/svm/pmu.c b/arch/x86/kvm/svm/pmu.c
index b6a7ad4d6914..14709c564d6a 100644
--- a/arch/x86/kvm/svm/pmu.c
+++ b/arch/x86/kvm/svm/pmu.c
@@ -205,6 +205,7 @@  static void amd_pmu_refresh(struct kvm_vcpu *vcpu)
 	if (pmu->version > 1) {
 		pmu->global_ctrl_mask = ~((1ull << pmu->nr_arch_gp_counters) - 1);
 		pmu->global_status_mask = pmu->global_ctrl_mask;
+		pmu->global_ctrl = ~pmu->global_ctrl_mask;
 	}
 
 	pmu->counter_bitmask[KVM_PMC_GP] = ((u64)1 << 48) - 1;