diff mbox series

[v2] KVM/arm64: reconfigurate the event filters for guest context

Message ID 20230810072906.4007-1-shijie@os.amperecomputing.com (mailing list archive)
State New, archived
Headers show
Series [v2] KVM/arm64: reconfigurate the event filters for guest context | expand

Commit Message

Huang Shijie Aug. 10, 2023, 7:29 a.m. UTC
1.) Background.
   1.1) In arm64, start a guest with Qemu which is running as a VMM of KVM,
        and bind the guest to core 33 and run program "a" in guest.
        The code of "a" shows below:
   	----------------------------------------------------------
		#include <stdio.h>

		int main()
		{
			unsigned long i = 0;

			for (;;) {
				i++;
			}

			printf("i:%ld\n", i);
			return 0;
		}
   	----------------------------------------------------------

   1.2) Use the following perf command in host:
      #perf stat -e cycles:G,cycles:H -C 33 -I 1000 sleep 1
          #           time             counts unit events
               1.000817400      3,299,471,572      cycles:G
               1.000817400          3,240,586      cycles:H

       This result is correct, my cpu's frequency is 3.3G.

   1.3) Use the following perf command in host:
      #perf stat -e cycles:G,cycles:H -C 33 -d -d  -I 1000 sleep 1
            time             counts unit events
     1.000831480        153,634,097      cycles:G                                                                (70.03%)
     1.000831480      3,147,940,599      cycles:H                                                                (70.03%)
     1.000831480      1,143,598,527      L1-dcache-loads                                                         (70.03%)
     1.000831480              9,986      L1-dcache-load-misses            #    0.00% of all L1-dcache accesses   (70.03%)
     1.000831480    <not supported>      LLC-loads
     1.000831480    <not supported>      LLC-load-misses
     1.000831480        580,887,696      L1-icache-loads                                                         (70.03%)
     1.000831480             77,855      L1-icache-load-misses            #    0.01% of all L1-icache accesses   (70.03%)
     1.000831480      6,112,224,612      dTLB-loads                                                              (70.03%)
     1.000831480             16,222      dTLB-load-misses                 #    0.00% of all dTLB cache accesses  (69.94%)
     1.000831480        590,015,996      iTLB-loads                                                              (59.95%)
     1.000831480                505      iTLB-load-misses                 #    0.00% of all iTLB cache accesses  (59.95%)

       This result is wrong. The "cycle:G" should be nearly 3.3G.

2.) Root cause.
	There is only 7 counters in my arm64 platform:
	  (one cycle counter) + (6 normal counters)

	In 1.3 above, we will use 10 event counters.
	Since we only have 7 counters, the perf core will trigger
       	multiplexing in hrtimer:
	     perf_mux_hrtimer_restart() --> perf_rotate_context().

       If the hrtimer occurs when the host is running, it's fine.
       If the hrtimer occurs when the guest is running,
       the perf_rotate_context() will program the PMU with filters for
       host context. The KVM does not have a chance to restore
       PMU registers with kvm_vcpu_pmu_restore_guest().
       The PMU does not work correctly, so we got wrong result.

3.) About this patch.
	Make a KVM_REQ_RELOAD_PMU request before reentering the
	guest. The request will call kvm_vcpu_pmu_restore_guest()
	to reconfigurate the filters for guest context.

4.) Test result of this patch:
      #perf stat -e cycles:G,cycles:H -C 33 -d -d  -I 1000 sleep 1
            time             counts unit events
     1.001006400      3,298,348,656      cycles:G                                                                (70.03%)
     1.001006400          3,144,532      cycles:H                                                                (70.03%)
     1.001006400            941,149      L1-dcache-loads                                                         (70.03%)
     1.001006400             17,937      L1-dcache-load-misses            #    1.91% of all L1-dcache accesses   (70.03%)
     1.001006400    <not supported>      LLC-loads
     1.001006400    <not supported>      LLC-load-misses
     1.001006400          1,101,889      L1-icache-loads                                                         (70.03%)
     1.001006400            121,638      L1-icache-load-misses            #   11.04% of all L1-icache accesses   (70.03%)
     1.001006400          1,031,228      dTLB-loads                                                              (70.03%)
     1.001006400             26,952      dTLB-load-misses                 #    2.61% of all dTLB cache accesses  (69.93%)
     1.001006400          1,030,678      iTLB-loads                                                              (59.94%)
     1.001006400                338      iTLB-load-misses                 #    0.03% of all iTLB cache accesses  (59.94%)

    The result is correct. The "cycle:G" is nearly 3.3G now.

Signed-off-by: Huang Shijie <shijie@os.amperecomputing.com>
---
v1 --> v2:
	Do not change perf/core code, only change the ARM64 kvm code.
	v1: https://lkml.org/lkml/2023/8/8/1465

---
 arch/arm64/kvm/arm.c | 11 ++++++++++-
 1 file changed, 10 insertions(+), 1 deletion(-)

Comments

Marc Zyngier Aug. 10, 2023, 3:27 p.m. UTC | #1
Huang,

Please make sure you add everyone who commented on v1 (I've Cc'd Mark
so that he can shime need as needed).

On Thu, 10 Aug 2023 08:29:06 +0100,
Huang Shijie <shijie@os.amperecomputing.com> wrote:
> 
> 1.) Background.
>    1.1) In arm64, start a guest with Qemu which is running as a VMM of KVM,
>         and bind the guest to core 33 and run program "a" in guest.
>         The code of "a" shows below:
>    	----------------------------------------------------------
> 		#include <stdio.h>
> 
> 		int main()
> 		{
> 			unsigned long i = 0;
> 
> 			for (;;) {
> 				i++;
> 			}
> 
> 			printf("i:%ld\n", i);
> 			return 0;
> 		}
>    	----------------------------------------------------------
> 
>    1.2) Use the following perf command in host:
>       #perf stat -e cycles:G,cycles:H -C 33 -I 1000 sleep 1
>           #           time             counts unit events
>                1.000817400      3,299,471,572      cycles:G
>                1.000817400          3,240,586      cycles:H
> 
>        This result is correct, my cpu's frequency is 3.3G.
> 
>    1.3) Use the following perf command in host:
>       #perf stat -e cycles:G,cycles:H -C 33 -d -d  -I 1000 sleep 1
>             time             counts unit events
>      1.000831480        153,634,097      cycles:G                                                                (70.03%)
>      1.000831480      3,147,940,599      cycles:H                                                                (70.03%)
>      1.000831480      1,143,598,527      L1-dcache-loads                                                         (70.03%)
>      1.000831480              9,986      L1-dcache-load-misses            #    0.00% of all L1-dcache accesses   (70.03%)
>      1.000831480    <not supported>      LLC-loads
>      1.000831480    <not supported>      LLC-load-misses
>      1.000831480        580,887,696      L1-icache-loads                                                         (70.03%)
>      1.000831480             77,855      L1-icache-load-misses            #    0.01% of all L1-icache accesses   (70.03%)
>      1.000831480      6,112,224,612      dTLB-loads                                                              (70.03%)
>      1.000831480             16,222      dTLB-load-misses                 #    0.00% of all dTLB cache accesses  (69.94%)
>      1.000831480        590,015,996      iTLB-loads                                                              (59.95%)
>      1.000831480                505      iTLB-load-misses                 #    0.00% of all iTLB cache accesses  (59.95%)
> 
>        This result is wrong. The "cycle:G" should be nearly 3.3G.
> 
> 2.) Root cause.
> 	There is only 7 counters in my arm64 platform:
> 	  (one cycle counter) + (6 normal counters)
> 
> 	In 1.3 above, we will use 10 event counters.
> 	Since we only have 7 counters, the perf core will trigger
>        	multiplexing in hrtimer:
> 	     perf_mux_hrtimer_restart() --> perf_rotate_context().
> 
>        If the hrtimer occurs when the host is running, it's fine.
>        If the hrtimer occurs when the guest is running,
>        the perf_rotate_context() will program the PMU with filters for
>        host context. The KVM does not have a chance to restore
>        PMU registers with kvm_vcpu_pmu_restore_guest().
>        The PMU does not work correctly, so we got wrong result.
> 
> 3.) About this patch.
> 	Make a KVM_REQ_RELOAD_PMU request before reentering the
> 	guest. The request will call kvm_vcpu_pmu_restore_guest()
> 	to reconfigurate the filters for guest context.
> 
> 4.) Test result of this patch:
>       #perf stat -e cycles:G,cycles:H -C 33 -d -d  -I 1000 sleep 1
>             time             counts unit events
>      1.001006400      3,298,348,656      cycles:G                                                                (70.03%)
>      1.001006400          3,144,532      cycles:H                                                                (70.03%)
>      1.001006400            941,149      L1-dcache-loads                                                         (70.03%)
>      1.001006400             17,937      L1-dcache-load-misses            #    1.91% of all L1-dcache accesses   (70.03%)
>      1.001006400    <not supported>      LLC-loads
>      1.001006400    <not supported>      LLC-load-misses
>      1.001006400          1,101,889      L1-icache-loads                                                         (70.03%)
>      1.001006400            121,638      L1-icache-load-misses            #   11.04% of all L1-icache accesses   (70.03%)
>      1.001006400          1,031,228      dTLB-loads                                                              (70.03%)
>      1.001006400             26,952      dTLB-load-misses                 #    2.61% of all dTLB cache accesses  (69.93%)
>      1.001006400          1,030,678      iTLB-loads                                                              (59.94%)
>      1.001006400                338      iTLB-load-misses                 #    0.03% of all iTLB cache accesses  (59.94%)
> 
>     The result is correct. The "cycle:G" is nearly 3.3G now.
> 
> Signed-off-by: Huang Shijie <shijie@os.amperecomputing.com>
> ---
> v1 --> v2:
> 	Do not change perf/core code, only change the ARM64 kvm code.
> 	v1: https://lkml.org/lkml/2023/8/8/1465
> 
> ---
>  arch/arm64/kvm/arm.c | 11 ++++++++++-
>  1 file changed, 10 insertions(+), 1 deletion(-)
> 
> diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c
> index c2c14059f6a8..475a2f0e0e40 100644
> --- a/arch/arm64/kvm/arm.c
> +++ b/arch/arm64/kvm/arm.c
> @@ -919,8 +919,17 @@ int kvm_arch_vcpu_ioctl_run(struct kvm_vcpu *vcpu)
>  		if (!ret)
>  			ret = 1;
>  
> -		if (ret > 0)
> +		if (ret > 0) {
> +			/*
> +			 * The perf_rotate_context() may rotate the events and
> +			 * reprogram PMU with filters for host context.
> +			 * So make a request before reentering the guest to
> +			 * reconfigurate the event filters for guest context.
> +			 */
> +			kvm_make_request(KVM_REQ_RELOAD_PMU, vcpu);
> +
>  			ret = check_vcpu_requests(vcpu);
> +		}

This looks extremely heavy handed. You're performing the reload on
*every* entry, and I don't think this is right (exit-heavy workloads
will suffer from it)

Furthermore, you're also reloading the virtual state of the PMU
(recreating guest events and other things), all of which looks pretty
pointless, as all we're interested in is what is being counted on the
*host*.

Instead, we can restrict the reload of the host state (and only that)
to situations where:

- we're running on a VHE system

- we have a host PMUv3 (not everybody does), as that's the only way we
  can profile a guest

and ideally we would have a way to detect that a rotation happened
(which may requires some help from the low-level PMU code).

Thanks,

	M.
Shijie Huang Aug. 11, 2023, 1:46 a.m. UTC | #2
Hi Marc,

在 2023/8/10 23:27, Marc Zyngier 写道:
> Huang,
>
> Please make sure you add everyone who commented on v1 (I've Cc'd Mark
> so that he can shime need as needed).
thanks.
>
> On Thu, 10 Aug 2023 08:29:06 +0100,
> Huang Shijie <shijie@os.amperecomputing.com> wrote:
>> 1.) Background.
>>     1.1) In arm64, start a guest with Qemu which is running as a VMM of KVM,
>>          and bind the guest to core 33 and run program "a" in guest.
>>          The code of "a" shows below:
>>     	----------------------------------------------------------
>> 		#include <stdio.h>
>>
>> 		int main()
>> 		{
>> 			unsigned long i = 0;
>>
>> 			for (;;) {
>> 				i++;
>> 			}
>>
>> 			printf("i:%ld\n", i);
>> 			return 0;
>> 		}
>>     	----------------------------------------------------------
>>
>>     1.2) Use the following perf command in host:
>>        #perf stat -e cycles:G,cycles:H -C 33 -I 1000 sleep 1
>>            #           time             counts unit events
>>                 1.000817400      3,299,471,572      cycles:G
>>                 1.000817400          3,240,586      cycles:H
>>
>>         This result is correct, my cpu's frequency is 3.3G.
>>
>>     1.3) Use the following perf command in host:
>>        #perf stat -e cycles:G,cycles:H -C 33 -d -d  -I 1000 sleep 1
>>              time             counts unit events
>>       1.000831480        153,634,097      cycles:G                                                                (70.03%)
>>       1.000831480      3,147,940,599      cycles:H                                                                (70.03%)
>>       1.000831480      1,143,598,527      L1-dcache-loads                                                         (70.03%)
>>       1.000831480              9,986      L1-dcache-load-misses            #    0.00% of all L1-dcache accesses   (70.03%)
>>       1.000831480    <not supported>      LLC-loads
>>       1.000831480    <not supported>      LLC-load-misses
>>       1.000831480        580,887,696      L1-icache-loads                                                         (70.03%)
>>       1.000831480             77,855      L1-icache-load-misses            #    0.01% of all L1-icache accesses   (70.03%)
>>       1.000831480      6,112,224,612      dTLB-loads                                                              (70.03%)
>>       1.000831480             16,222      dTLB-load-misses                 #    0.00% of all dTLB cache accesses  (69.94%)
>>       1.000831480        590,015,996      iTLB-loads                                                              (59.95%)
>>       1.000831480                505      iTLB-load-misses                 #    0.00% of all iTLB cache accesses  (59.95%)
>>
>>         This result is wrong. The "cycle:G" should be nearly 3.3G.
>>
>> 2.) Root cause.
>> 	There is only 7 counters in my arm64 platform:
>> 	  (one cycle counter) + (6 normal counters)
>>
>> 	In 1.3 above, we will use 10 event counters.
>> 	Since we only have 7 counters, the perf core will trigger
>>         	multiplexing in hrtimer:
>> 	     perf_mux_hrtimer_restart() --> perf_rotate_context().
>>
>>         If the hrtimer occurs when the host is running, it's fine.
>>         If the hrtimer occurs when the guest is running,
>>         the perf_rotate_context() will program the PMU with filters for
>>         host context. The KVM does not have a chance to restore
>>         PMU registers with kvm_vcpu_pmu_restore_guest().
>>         The PMU does not work correctly, so we got wrong result.
>>
>> 3.) About this patch.
>> 	Make a KVM_REQ_RELOAD_PMU request before reentering the
>> 	guest. The request will call kvm_vcpu_pmu_restore_guest()
>> 	to reconfigurate the filters for guest context.
>>
>> 4.) Test result of this patch:
>>        #perf stat -e cycles:G,cycles:H -C 33 -d -d  -I 1000 sleep 1
>>              time             counts unit events
>>       1.001006400      3,298,348,656      cycles:G                                                                (70.03%)
>>       1.001006400          3,144,532      cycles:H                                                                (70.03%)
>>       1.001006400            941,149      L1-dcache-loads                                                         (70.03%)
>>       1.001006400             17,937      L1-dcache-load-misses            #    1.91% of all L1-dcache accesses   (70.03%)
>>       1.001006400    <not supported>      LLC-loads
>>       1.001006400    <not supported>      LLC-load-misses
>>       1.001006400          1,101,889      L1-icache-loads                                                         (70.03%)
>>       1.001006400            121,638      L1-icache-load-misses            #   11.04% of all L1-icache accesses   (70.03%)
>>       1.001006400          1,031,228      dTLB-loads                                                              (70.03%)
>>       1.001006400             26,952      dTLB-load-misses                 #    2.61% of all dTLB cache accesses  (69.93%)
>>       1.001006400          1,030,678      iTLB-loads                                                              (59.94%)
>>       1.001006400                338      iTLB-load-misses                 #    0.03% of all iTLB cache accesses  (59.94%)
>>
>>      The result is correct. The "cycle:G" is nearly 3.3G now.
>>
>> Signed-off-by: Huang Shijie <shijie@os.amperecomputing.com>
>> ---
>> v1 --> v2:
>> 	Do not change perf/core code, only change the ARM64 kvm code.
>> 	v1: https://lkml.org/lkml/2023/8/8/1465
>>
>> ---
>>   arch/arm64/kvm/arm.c | 11 ++++++++++-
>>   1 file changed, 10 insertions(+), 1 deletion(-)
>>
>> diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c
>> index c2c14059f6a8..475a2f0e0e40 100644
>> --- a/arch/arm64/kvm/arm.c
>> +++ b/arch/arm64/kvm/arm.c
>> @@ -919,8 +919,17 @@ int kvm_arch_vcpu_ioctl_run(struct kvm_vcpu *vcpu)
>>   		if (!ret)
>>   			ret = 1;
>>   
>> -		if (ret > 0)
>> +		if (ret > 0) {
>> +			/*
>> +			 * The perf_rotate_context() may rotate the events and
>> +			 * reprogram PMU with filters for host context.
>> +			 * So make a request before reentering the guest to
>> +			 * reconfigurate the event filters for guest context.
>> +			 */
>> +			kvm_make_request(KVM_REQ_RELOAD_PMU, vcpu);
>> +
>>   			ret = check_vcpu_requests(vcpu);
>> +		}
> This looks extremely heavy handed. You're performing the reload on
> *every* entry, and I don't think this is right (exit-heavy workloads
> will suffer from it)
>
> Furthermore, you're also reloading the virtual state of the PMU
> (recreating guest events and other things), all of which looks pretty
> pointless, as all we're interested in is what is being counted on the
> *host*.

okay. What about to add a _new_ request, such as KVM_REQ_RESTROE_PMU_GUEST?


> Instead, we can restrict the reload of the host state (and only that)
> to situations where:
>
> - we're running on a VHE system
>
> - we have a host PMUv3 (not everybody does), as that's the only way we
>    can profile a guest

okay. No problem.


>
> and ideally we would have a way to detect that a rotation happened
> (which may requires some help from the low-level PMU code).

I will check it, hope we can find a better way.


Thanks

Huang Shijie
Marc Zyngier Aug. 11, 2023, 6:10 a.m. UTC | #3
On Fri, 11 Aug 2023 02:46:49 +0100,
Shijie Huang <shijie@amperemail.onmicrosoft.com> wrote:
> 
> Hi Marc,
> 
> 在 2023/8/10 23:27, Marc Zyngier 写道:
> > Huang,
> > 
> > Please make sure you add everyone who commented on v1 (I've Cc'd Mark
> > so that he can shime need as needed).
> thanks.
> > 
> > On Thu, 10 Aug 2023 08:29:06 +0100,
> > Huang Shijie <shijie@os.amperecomputing.com> wrote:
> >> 1.) Background.
> >>     1.1) In arm64, start a guest with Qemu which is running as a VMM of KVM,
> >>          and bind the guest to core 33 and run program "a" in guest.
> >>          The code of "a" shows below:
> >>     	----------------------------------------------------------
> >> 		#include <stdio.h>
> >> 
> >> 		int main()
> >> 		{
> >> 			unsigned long i = 0;
> >> 
> >> 			for (;;) {
> >> 				i++;
> >> 			}
> >> 
> >> 			printf("i:%ld\n", i);
> >> 			return 0;
> >> 		}
> >>     	----------------------------------------------------------
> >> 
> >>     1.2) Use the following perf command in host:
> >>        #perf stat -e cycles:G,cycles:H -C 33 -I 1000 sleep 1
> >>            #           time             counts unit events
> >>                 1.000817400      3,299,471,572      cycles:G
> >>                 1.000817400          3,240,586      cycles:H
> >> 
> >>         This result is correct, my cpu's frequency is 3.3G.
> >> 
> >>     1.3) Use the following perf command in host:
> >>        #perf stat -e cycles:G,cycles:H -C 33 -d -d  -I 1000 sleep 1
> >>              time             counts unit events
> >>       1.000831480        153,634,097      cycles:G                                                                (70.03%)
> >>       1.000831480      3,147,940,599      cycles:H                                                                (70.03%)
> >>       1.000831480      1,143,598,527      L1-dcache-loads                                                         (70.03%)
> >>       1.000831480              9,986      L1-dcache-load-misses            #    0.00% of all L1-dcache accesses   (70.03%)
> >>       1.000831480    <not supported>      LLC-loads
> >>       1.000831480    <not supported>      LLC-load-misses
> >>       1.000831480        580,887,696      L1-icache-loads                                                         (70.03%)
> >>       1.000831480             77,855      L1-icache-load-misses            #    0.01% of all L1-icache accesses   (70.03%)
> >>       1.000831480      6,112,224,612      dTLB-loads                                                              (70.03%)
> >>       1.000831480             16,222      dTLB-load-misses                 #    0.00% of all dTLB cache accesses  (69.94%)
> >>       1.000831480        590,015,996      iTLB-loads                                                              (59.95%)
> >>       1.000831480                505      iTLB-load-misses                 #    0.00% of all iTLB cache accesses  (59.95%)
> >> 
> >>         This result is wrong. The "cycle:G" should be nearly 3.3G.
> >> 
> >> 2.) Root cause.
> >> 	There is only 7 counters in my arm64 platform:
> >> 	  (one cycle counter) + (6 normal counters)
> >> 
> >> 	In 1.3 above, we will use 10 event counters.
> >> 	Since we only have 7 counters, the perf core will trigger
> >>         	multiplexing in hrtimer:
> >> 	     perf_mux_hrtimer_restart() --> perf_rotate_context().
> >> 
> >>         If the hrtimer occurs when the host is running, it's fine.
> >>         If the hrtimer occurs when the guest is running,
> >>         the perf_rotate_context() will program the PMU with filters for
> >>         host context. The KVM does not have a chance to restore
> >>         PMU registers with kvm_vcpu_pmu_restore_guest().
> >>         The PMU does not work correctly, so we got wrong result.
> >> 
> >> 3.) About this patch.
> >> 	Make a KVM_REQ_RELOAD_PMU request before reentering the
> >> 	guest. The request will call kvm_vcpu_pmu_restore_guest()
> >> 	to reconfigurate the filters for guest context.
> >> 
> >> 4.) Test result of this patch:
> >>        #perf stat -e cycles:G,cycles:H -C 33 -d -d  -I 1000 sleep 1
> >>              time             counts unit events
> >>       1.001006400      3,298,348,656      cycles:G                                                                (70.03%)
> >>       1.001006400          3,144,532      cycles:H                                                                (70.03%)
> >>       1.001006400            941,149      L1-dcache-loads                                                         (70.03%)
> >>       1.001006400             17,937      L1-dcache-load-misses            #    1.91% of all L1-dcache accesses   (70.03%)
> >>       1.001006400    <not supported>      LLC-loads
> >>       1.001006400    <not supported>      LLC-load-misses
> >>       1.001006400          1,101,889      L1-icache-loads                                                         (70.03%)
> >>       1.001006400            121,638      L1-icache-load-misses            #   11.04% of all L1-icache accesses   (70.03%)
> >>       1.001006400          1,031,228      dTLB-loads                                                              (70.03%)
> >>       1.001006400             26,952      dTLB-load-misses                 #    2.61% of all dTLB cache accesses  (69.93%)
> >>       1.001006400          1,030,678      iTLB-loads                                                              (59.94%)
> >>       1.001006400                338      iTLB-load-misses                 #    0.03% of all iTLB cache accesses  (59.94%)
> >> 
> >>      The result is correct. The "cycle:G" is nearly 3.3G now.
> >> 
> >> Signed-off-by: Huang Shijie <shijie@os.amperecomputing.com>
> >> ---
> >> v1 --> v2:
> >> 	Do not change perf/core code, only change the ARM64 kvm code.
> >> 	v1: https://lkml.org/lkml/2023/8/8/1465
> >> 
> >> ---
> >>   arch/arm64/kvm/arm.c | 11 ++++++++++-
> >>   1 file changed, 10 insertions(+), 1 deletion(-)
> >> 
> >> diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c
> >> index c2c14059f6a8..475a2f0e0e40 100644
> >> --- a/arch/arm64/kvm/arm.c
> >> +++ b/arch/arm64/kvm/arm.c
> >> @@ -919,8 +919,17 @@ int kvm_arch_vcpu_ioctl_run(struct kvm_vcpu *vcpu)
> >>   		if (!ret)
> >>   			ret = 1;
> >>   -		if (ret > 0)
> >> +		if (ret > 0) {
> >> +			/*
> >> +			 * The perf_rotate_context() may rotate the events and
> >> +			 * reprogram PMU with filters for host context.
> >> +			 * So make a request before reentering the guest to
> >> +			 * reconfigurate the event filters for guest context.
> >> +			 */
> >> +			kvm_make_request(KVM_REQ_RELOAD_PMU, vcpu);
> >> +
> >>   			ret = check_vcpu_requests(vcpu);
> >> +		}
> > This looks extremely heavy handed. You're performing the reload on
> > *every* entry, and I don't think this is right (exit-heavy workloads
> > will suffer from it)
> > 
> > Furthermore, you're also reloading the virtual state of the PMU
> > (recreating guest events and other things), all of which looks pretty
> > pointless, as all we're interested in is what is being counted on the
> > *host*.
> 
> okay. What about to add a _new_ request, such as KVM_REQ_RESTROE_PMU_GUEST?
> 
> 
> > Instead, we can restrict the reload of the host state (and only that)
> > to situations where:
> > 
> > - we're running on a VHE system
> > 
> > - we have a host PMUv3 (not everybody does), as that's the only way we
> >    can profile a guest
> 
> okay. No problem.
> 
> 
> > 
> > and ideally we would have a way to detect that a rotation happened
> > (which may requires some help from the low-level PMU code).
> 
> I will check it, hope we can find a better way.

I came up with the following patch, completely untested. Let me know
how that fares for you.

Thanks,

	M.

diff --git a/arch/arm64/include/asm/kvm_host.h b/arch/arm64/include/asm/kvm_host.h
index 93c541111dea..fb875c5c0347 100644
--- a/arch/arm64/include/asm/kvm_host.h
+++ b/arch/arm64/include/asm/kvm_host.h
@@ -49,6 +49,7 @@
 #define KVM_REQ_RELOAD_GICv4	KVM_ARCH_REQ(4)
 #define KVM_REQ_RELOAD_PMU	KVM_ARCH_REQ(5)
 #define KVM_REQ_SUSPEND		KVM_ARCH_REQ(6)
+#define KVM_REQ_RELOAD_GUEST_PMU_EVENTS	KVM_ARCH_REQ(7)
 
 #define KVM_DIRTY_LOG_MANUAL_CAPS   (KVM_DIRTY_LOG_MANUAL_PROTECT_ENABLE | \
 				     KVM_DIRTY_LOG_INITIALLY_SET)
diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c
index 8b51570a76f8..b40db24f1f0b 100644
--- a/arch/arm64/kvm/arm.c
+++ b/arch/arm64/kvm/arm.c
@@ -804,6 +804,9 @@ static int check_vcpu_requests(struct kvm_vcpu *vcpu)
 			kvm_pmu_handle_pmcr(vcpu,
 					    __vcpu_sys_reg(vcpu, PMCR_EL0));
 
+		if (kvm_check_request(KVM_REQ_RELOAD_GUEST_PMU_EVENTS, vcpu))
+			kvm_vcpu_pmu_restore_guest(vcpu);
+
 		if (kvm_check_request(KVM_REQ_SUSPEND, vcpu))
 			return kvm_vcpu_suspend(vcpu);
 
diff --git a/drivers/perf/arm_pmuv3.c b/drivers/perf/arm_pmuv3.c
index 08b3a1bf0ef6..7012de417092 100644
--- a/drivers/perf/arm_pmuv3.c
+++ b/drivers/perf/arm_pmuv3.c
@@ -772,6 +772,9 @@ static void armv8pmu_start(struct arm_pmu *cpu_pmu)
 
 	/* Enable all counters */
 	armv8pmu_pmcr_write(armv8pmu_pmcr_read() | ARMV8_PMU_PMCR_E);
+
+	if (in_interrupt())
+		kvm_resync_guest_context();
 }
 
 static void armv8pmu_stop(struct arm_pmu *cpu_pmu)
diff --git a/include/kvm/arm_pmu.h b/include/kvm/arm_pmu.h
index 847da6fc2713..d66f7216b5a9 100644
--- a/include/kvm/arm_pmu.h
+++ b/include/kvm/arm_pmu.h
@@ -74,6 +74,7 @@ int kvm_arm_pmu_v3_enable(struct kvm_vcpu *vcpu);
 struct kvm_pmu_events *kvm_get_pmu_events(void);
 void kvm_vcpu_pmu_restore_guest(struct kvm_vcpu *vcpu);
 void kvm_vcpu_pmu_restore_host(struct kvm_vcpu *vcpu);
+void kvm_resync_guest_context(void);
 
 #define kvm_vcpu_has_pmu(vcpu)					\
 	(test_bit(KVM_ARM_VCPU_PMU_V3, (vcpu)->arch.features))
@@ -171,6 +172,7 @@ static inline u8 kvm_arm_pmu_get_pmuver_limit(void)
 {
 	return 0;
 }
+static inline void kvm_resync_guest_context(void) {}
 
 #endif
Shijie Huang Aug. 11, 2023, 7:10 a.m. UTC | #4
Hi Marc,

在 2023/8/11 14:10, Marc Zyngier 写道:
> On Fri, 11 Aug 2023 02:46:49 +0100,
> Shijie Huang <shijie@amperemail.onmicrosoft.com> wrote:
>> Hi Marc,
>>
>> 在 2023/8/10 23:27, Marc Zyngier 写道:
>>> Huang,
>>>
>>> Please make sure you add everyone who commented on v1 (I've Cc'd Mark
>>> so that he can shime need as needed).
>> thanks.
>>> On Thu, 10 Aug 2023 08:29:06 +0100,
>>> Huang Shijie <shijie@os.amperecomputing.com> wrote:
>>>> 1.) Background.
>>>>      1.1) In arm64, start a guest with Qemu which is running as a VMM of KVM,
>>>>           and bind the guest to core 33 and run program "a" in guest.
>>>>           The code of "a" shows below:
>>>>      	----------------------------------------------------------
>>>> 		#include <stdio.h>
>>>>
>>>> 		int main()
>>>> 		{
>>>> 			unsigned long i = 0;
>>>>
>>>> 			for (;;) {
>>>> 				i++;
>>>> 			}
>>>>
>>>> 			printf("i:%ld\n", i);
>>>> 			return 0;
>>>> 		}
>>>>      	----------------------------------------------------------
>>>>
>>>>      1.2) Use the following perf command in host:
>>>>         #perf stat -e cycles:G,cycles:H -C 33 -I 1000 sleep 1
>>>>             #           time             counts unit events
>>>>                  1.000817400      3,299,471,572      cycles:G
>>>>                  1.000817400          3,240,586      cycles:H
>>>>
>>>>          This result is correct, my cpu's frequency is 3.3G.
>>>>
>>>>      1.3) Use the following perf command in host:
>>>>         #perf stat -e cycles:G,cycles:H -C 33 -d -d  -I 1000 sleep 1
>>>>               time             counts unit events
>>>>        1.000831480        153,634,097      cycles:G                                                                (70.03%)
>>>>        1.000831480      3,147,940,599      cycles:H                                                                (70.03%)
>>>>        1.000831480      1,143,598,527      L1-dcache-loads                                                         (70.03%)
>>>>        1.000831480              9,986      L1-dcache-load-misses            #    0.00% of all L1-dcache accesses   (70.03%)
>>>>        1.000831480    <not supported>      LLC-loads
>>>>        1.000831480    <not supported>      LLC-load-misses
>>>>        1.000831480        580,887,696      L1-icache-loads                                                         (70.03%)
>>>>        1.000831480             77,855      L1-icache-load-misses            #    0.01% of all L1-icache accesses   (70.03%)
>>>>        1.000831480      6,112,224,612      dTLB-loads                                                              (70.03%)
>>>>        1.000831480             16,222      dTLB-load-misses                 #    0.00% of all dTLB cache accesses  (69.94%)
>>>>        1.000831480        590,015,996      iTLB-loads                                                              (59.95%)
>>>>        1.000831480                505      iTLB-load-misses                 #    0.00% of all iTLB cache accesses  (59.95%)
>>>>
>>>>          This result is wrong. The "cycle:G" should be nearly 3.3G.
>>>>
>>>> 2.) Root cause.
>>>> 	There is only 7 counters in my arm64 platform:
>>>> 	  (one cycle counter) + (6 normal counters)
>>>>
>>>> 	In 1.3 above, we will use 10 event counters.
>>>> 	Since we only have 7 counters, the perf core will trigger
>>>>          	multiplexing in hrtimer:
>>>> 	     perf_mux_hrtimer_restart() --> perf_rotate_context().
>>>>
>>>>          If the hrtimer occurs when the host is running, it's fine.
>>>>          If the hrtimer occurs when the guest is running,
>>>>          the perf_rotate_context() will program the PMU with filters for
>>>>          host context. The KVM does not have a chance to restore
>>>>          PMU registers with kvm_vcpu_pmu_restore_guest().
>>>>          The PMU does not work correctly, so we got wrong result.
>>>>
>>>> 3.) About this patch.
>>>> 	Make a KVM_REQ_RELOAD_PMU request before reentering the
>>>> 	guest. The request will call kvm_vcpu_pmu_restore_guest()
>>>> 	to reconfigurate the filters for guest context.
>>>>
>>>> 4.) Test result of this patch:
>>>>         #perf stat -e cycles:G,cycles:H -C 33 -d -d  -I 1000 sleep 1
>>>>               time             counts unit events
>>>>        1.001006400      3,298,348,656      cycles:G                                                                (70.03%)
>>>>        1.001006400          3,144,532      cycles:H                                                                (70.03%)
>>>>        1.001006400            941,149      L1-dcache-loads                                                         (70.03%)
>>>>        1.001006400             17,937      L1-dcache-load-misses            #    1.91% of all L1-dcache accesses   (70.03%)
>>>>        1.001006400    <not supported>      LLC-loads
>>>>        1.001006400    <not supported>      LLC-load-misses
>>>>        1.001006400          1,101,889      L1-icache-loads                                                         (70.03%)
>>>>        1.001006400            121,638      L1-icache-load-misses            #   11.04% of all L1-icache accesses   (70.03%)
>>>>        1.001006400          1,031,228      dTLB-loads                                                              (70.03%)
>>>>        1.001006400             26,952      dTLB-load-misses                 #    2.61% of all dTLB cache accesses  (69.93%)
>>>>        1.001006400          1,030,678      iTLB-loads                                                              (59.94%)
>>>>        1.001006400                338      iTLB-load-misses                 #    0.03% of all iTLB cache accesses  (59.94%)
>>>>
>>>>       The result is correct. The "cycle:G" is nearly 3.3G now.
>>>>
>>>> Signed-off-by: Huang Shijie <shijie@os.amperecomputing.com>
>>>> ---
>>>> v1 --> v2:
>>>> 	Do not change perf/core code, only change the ARM64 kvm code.
>>>> 	v1: https://lkml.org/lkml/2023/8/8/1465
>>>>
>>>> ---
>>>>    arch/arm64/kvm/arm.c | 11 ++++++++++-
>>>>    1 file changed, 10 insertions(+), 1 deletion(-)
>>>>
>>>> diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c
>>>> index c2c14059f6a8..475a2f0e0e40 100644
>>>> --- a/arch/arm64/kvm/arm.c
>>>> +++ b/arch/arm64/kvm/arm.c
>>>> @@ -919,8 +919,17 @@ int kvm_arch_vcpu_ioctl_run(struct kvm_vcpu *vcpu)
>>>>    		if (!ret)
>>>>    			ret = 1;
>>>>    -		if (ret > 0)
>>>> +		if (ret > 0) {
>>>> +			/*
>>>> +			 * The perf_rotate_context() may rotate the events and
>>>> +			 * reprogram PMU with filters for host context.
>>>> +			 * So make a request before reentering the guest to
>>>> +			 * reconfigurate the event filters for guest context.
>>>> +			 */
>>>> +			kvm_make_request(KVM_REQ_RELOAD_PMU, vcpu);
>>>> +
>>>>    			ret = check_vcpu_requests(vcpu);
>>>> +		}
>>> This looks extremely heavy handed. You're performing the reload on
>>> *every* entry, and I don't think this is right (exit-heavy workloads
>>> will suffer from it)
>>>
>>> Furthermore, you're also reloading the virtual state of the PMU
>>> (recreating guest events and other things), all of which looks pretty
>>> pointless, as all we're interested in is what is being counted on the
>>> *host*.
>> okay. What about to add a _new_ request, such as KVM_REQ_RESTROE_PMU_GUEST?
>>
>>
>>> Instead, we can restrict the reload of the host state (and only that)
>>> to situations where:
>>>
>>> - we're running on a VHE system
>>>
>>> - we have a host PMUv3 (not everybody does), as that's the only way we
>>>     can profile a guest
>> okay. No problem.
>>
>>
>>> and ideally we would have a way to detect that a rotation happened
>>> (which may requires some help from the low-level PMU code).
>> I will check it, hope we can find a better way.
> I came up with the following patch, completely untested. Let me know
> how that fares for you.
>
> Thanks,
>
> 	M.
>
> diff --git a/arch/arm64/include/asm/kvm_host.h b/arch/arm64/include/asm/kvm_host.h
> index 93c541111dea..fb875c5c0347 100644
> --- a/arch/arm64/include/asm/kvm_host.h
> +++ b/arch/arm64/include/asm/kvm_host.h
> @@ -49,6 +49,7 @@
>   #define KVM_REQ_RELOAD_GICv4	KVM_ARCH_REQ(4)
>   #define KVM_REQ_RELOAD_PMU	KVM_ARCH_REQ(5)
>   #define KVM_REQ_SUSPEND		KVM_ARCH_REQ(6)
> +#define KVM_REQ_RELOAD_GUEST_PMU_EVENTS	KVM_ARCH_REQ(7)
>   
>   #define KVM_DIRTY_LOG_MANUAL_CAPS   (KVM_DIRTY_LOG_MANUAL_PROTECT_ENABLE | \
>   				     KVM_DIRTY_LOG_INITIALLY_SET)
> diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c
> index 8b51570a76f8..b40db24f1f0b 100644
> --- a/arch/arm64/kvm/arm.c
> +++ b/arch/arm64/kvm/arm.c
> @@ -804,6 +804,9 @@ static int check_vcpu_requests(struct kvm_vcpu *vcpu)
>   			kvm_pmu_handle_pmcr(vcpu,
>   					    __vcpu_sys_reg(vcpu, PMCR_EL0));
>   
> +		if (kvm_check_request(KVM_REQ_RELOAD_GUEST_PMU_EVENTS, vcpu))
> +			kvm_vcpu_pmu_restore_guest(vcpu);
> +
>   		if (kvm_check_request(KVM_REQ_SUSPEND, vcpu))
>   			return kvm_vcpu_suspend(vcpu);
>   
> diff --git a/drivers/perf/arm_pmuv3.c b/drivers/perf/arm_pmuv3.c
> index 08b3a1bf0ef6..7012de417092 100644
> --- a/drivers/perf/arm_pmuv3.c
> +++ b/drivers/perf/arm_pmuv3.c
> @@ -772,6 +772,9 @@ static void armv8pmu_start(struct arm_pmu *cpu_pmu)
>   
>   	/* Enable all counters */
>   	armv8pmu_pmcr_write(armv8pmu_pmcr_read() | ARMV8_PMU_PMCR_E);
> +
> +	if (in_interrupt())
> +		kvm_resync_guest_context();

I currently added a similiar check in armv8pmu_get_event_idx().

The event multiplexing will call armv8pmu_get_event_idx(), and will 
definitely fail at least one time.

+++ b/drivers/perf/arm_pmuv3.c
@@ -882,6 +882,8 @@ static int armv8pmu_get_event_idx(struct 
pmu_hw_events *cpuc,
         struct arm_pmu *cpu_pmu = to_arm_pmu(event->pmu);
         struct hw_perf_event *hwc = &event->hw;
         unsigned long evtype = hwc->config_base & ARMV8_PMU_EVTYPE_EVENT;
+       struct kvm_vcpu *vcpu;
+       int index;
         struct arm_pmu *cpu_pmu = to_arm_pmu(event->pmu);
         struct hw_perf_event *hwc = &event->hw;
         unsigned long evtype = hwc->config_base & ARMV8_PMU_EVTYPE_EVENT;
+       struct kvm_vcpu *vcpu;
+       int index;

         /* Always prefer to place a cycle counter into the cycle 
counter. */
         if (evtype == ARMV8_PMUV3_PERFCTR_CPU_CYCLES) {
@@ -897,9 +899,22 @@ static int armv8pmu_get_event_idx(struct 
pmu_hw_events *cpuc,
          * Otherwise use events counters
          */
         if (armv8pmu_event_is_chained(event))
-               return  armv8pmu_get_chain_idx(cpuc, cpu_pmu);
+               index = armv8pmu_get_chain_idx(cpuc, cpu_pmu);
         else
-               return armv8pmu_get_single_idx(cpuc, cpu_pmu);
+               index = armv8pmu_get_single_idx(cpuc, cpu_pmu);
+
+       /*
+        * If we are in pmu multiplexing, we will definitely meet a failure.
+        * Please see perf_rotate_context().
+        * If we are in the guest context, we can mark it.
+        */
+       if (index < 0) {
+               vcpu = kvm_get_running_vcpu();
+               if (vcpu && in_interrupt() && !event->attr.pinned) {
+                       kvm_resync_guest_context();
+               }
+       }
+       return index;
  }

IMHO, it's better to change armv8pmu_get_event_idx().

But if you think it is also okay to change armv8pmu_start() to fix the bug,

I am okay too.


Thanks

Huang Shijie


>   }
>   
>   static void armv8pmu_stop(struct arm_pmu *cpu_pmu)
> diff --git a/include/kvm/arm_pmu.h b/include/kvm/arm_pmu.h
> index 847da6fc2713..d66f7216b5a9 100644
> --- a/include/kvm/arm_pmu.h
> +++ b/include/kvm/arm_pmu.h
> @@ -74,6 +74,7 @@ int kvm_arm_pmu_v3_enable(struct kvm_vcpu *vcpu);
>   struct kvm_pmu_events *kvm_get_pmu_events(void);
>   void kvm_vcpu_pmu_restore_guest(struct kvm_vcpu *vcpu);
>   void kvm_vcpu_pmu_restore_host(struct kvm_vcpu *vcpu);
> +void kvm_resync_guest_context(void);
>   
>   #define kvm_vcpu_has_pmu(vcpu)					\
>   	(test_bit(KVM_ARM_VCPU_PMU_V3, (vcpu)->arch.features))
> @@ -171,6 +172,7 @@ static inline u8 kvm_arm_pmu_get_pmuver_limit(void)
>   {
>   	return 0;
>   }
> +static inline void kvm_resync_guest_context(void) {}
>   
>   #endif
>   
>
Marc Zyngier Aug. 11, 2023, 7:42 a.m. UTC | #5
On Fri, 11 Aug 2023 08:10:26 +0100,
Shijie Huang <shijie@amperemail.onmicrosoft.com> wrote:
> 
> Hi Marc,
> 
> 在 2023/8/11 14:10, Marc Zyngier 写道:
> > On Fri, 11 Aug 2023 02:46:49 +0100,
> > Shijie Huang <shijie@amperemail.onmicrosoft.com> wrote:
> >> Hi Marc,
> >> 
> >> 在 2023/8/10 23:27, Marc Zyngier 写道:
> >>> Huang,
> >>> 
> >>> Please make sure you add everyone who commented on v1 (I've Cc'd Mark
> >>> so that he can shime need as needed).
> >> thanks.
> >>> On Thu, 10 Aug 2023 08:29:06 +0100,
> >>> Huang Shijie <shijie@os.amperecomputing.com> wrote:
> >>>> 1.) Background.
> >>>>      1.1) In arm64, start a guest with Qemu which is running as a VMM of KVM,
> >>>>           and bind the guest to core 33 and run program "a" in guest.
> >>>>           The code of "a" shows below:
> >>>>      	----------------------------------------------------------
> >>>> 		#include <stdio.h>
> >>>> 
> >>>> 		int main()
> >>>> 		{
> >>>> 			unsigned long i = 0;
> >>>> 
> >>>> 			for (;;) {
> >>>> 				i++;
> >>>> 			}
> >>>> 
> >>>> 			printf("i:%ld\n", i);
> >>>> 			return 0;
> >>>> 		}
> >>>>      	----------------------------------------------------------
> >>>> 
> >>>>      1.2) Use the following perf command in host:
> >>>>         #perf stat -e cycles:G,cycles:H -C 33 -I 1000 sleep 1
> >>>>             #           time             counts unit events
> >>>>                  1.000817400      3,299,471,572      cycles:G
> >>>>                  1.000817400          3,240,586      cycles:H
> >>>> 
> >>>>          This result is correct, my cpu's frequency is 3.3G.
> >>>> 
> >>>>      1.3) Use the following perf command in host:
> >>>>         #perf stat -e cycles:G,cycles:H -C 33 -d -d  -I 1000 sleep 1
> >>>>               time             counts unit events
> >>>>        1.000831480        153,634,097      cycles:G                                                                (70.03%)
> >>>>        1.000831480      3,147,940,599      cycles:H                                                                (70.03%)
> >>>>        1.000831480      1,143,598,527      L1-dcache-loads                                                         (70.03%)
> >>>>        1.000831480              9,986      L1-dcache-load-misses            #    0.00% of all L1-dcache accesses   (70.03%)
> >>>>        1.000831480    <not supported>      LLC-loads
> >>>>        1.000831480    <not supported>      LLC-load-misses
> >>>>        1.000831480        580,887,696      L1-icache-loads                                                         (70.03%)
> >>>>        1.000831480             77,855      L1-icache-load-misses            #    0.01% of all L1-icache accesses   (70.03%)
> >>>>        1.000831480      6,112,224,612      dTLB-loads                                                              (70.03%)
> >>>>        1.000831480             16,222      dTLB-load-misses                 #    0.00% of all dTLB cache accesses  (69.94%)
> >>>>        1.000831480        590,015,996      iTLB-loads                                                              (59.95%)
> >>>>        1.000831480                505      iTLB-load-misses                 #    0.00% of all iTLB cache accesses  (59.95%)
> >>>> 
> >>>>          This result is wrong. The "cycle:G" should be nearly 3.3G.
> >>>> 
> >>>> 2.) Root cause.
> >>>> 	There is only 7 counters in my arm64 platform:
> >>>> 	  (one cycle counter) + (6 normal counters)
> >>>> 
> >>>> 	In 1.3 above, we will use 10 event counters.
> >>>> 	Since we only have 7 counters, the perf core will trigger
> >>>>          	multiplexing in hrtimer:
> >>>> 	     perf_mux_hrtimer_restart() --> perf_rotate_context().
> >>>> 
> >>>>          If the hrtimer occurs when the host is running, it's fine.
> >>>>          If the hrtimer occurs when the guest is running,
> >>>>          the perf_rotate_context() will program the PMU with filters for
> >>>>          host context. The KVM does not have a chance to restore
> >>>>          PMU registers with kvm_vcpu_pmu_restore_guest().
> >>>>          The PMU does not work correctly, so we got wrong result.
> >>>> 
> >>>> 3.) About this patch.
> >>>> 	Make a KVM_REQ_RELOAD_PMU request before reentering the
> >>>> 	guest. The request will call kvm_vcpu_pmu_restore_guest()
> >>>> 	to reconfigurate the filters for guest context.
> >>>> 
> >>>> 4.) Test result of this patch:
> >>>>         #perf stat -e cycles:G,cycles:H -C 33 -d -d  -I 1000 sleep 1
> >>>>               time             counts unit events
> >>>>        1.001006400      3,298,348,656      cycles:G                                                                (70.03%)
> >>>>        1.001006400          3,144,532      cycles:H                                                                (70.03%)
> >>>>        1.001006400            941,149      L1-dcache-loads                                                         (70.03%)
> >>>>        1.001006400             17,937      L1-dcache-load-misses            #    1.91% of all L1-dcache accesses   (70.03%)
> >>>>        1.001006400    <not supported>      LLC-loads
> >>>>        1.001006400    <not supported>      LLC-load-misses
> >>>>        1.001006400          1,101,889      L1-icache-loads                                                         (70.03%)
> >>>>        1.001006400            121,638      L1-icache-load-misses            #   11.04% of all L1-icache accesses   (70.03%)
> >>>>        1.001006400          1,031,228      dTLB-loads                                                              (70.03%)
> >>>>        1.001006400             26,952      dTLB-load-misses                 #    2.61% of all dTLB cache accesses  (69.93%)
> >>>>        1.001006400          1,030,678      iTLB-loads                                                              (59.94%)
> >>>>        1.001006400                338      iTLB-load-misses                 #    0.03% of all iTLB cache accesses  (59.94%)
> >>>> 
> >>>>       The result is correct. The "cycle:G" is nearly 3.3G now.
> >>>> 
> >>>> Signed-off-by: Huang Shijie <shijie@os.amperecomputing.com>
> >>>> ---
> >>>> v1 --> v2:
> >>>> 	Do not change perf/core code, only change the ARM64 kvm code.
> >>>> 	v1: https://lkml.org/lkml/2023/8/8/1465
> >>>> 
> >>>> ---
> >>>>    arch/arm64/kvm/arm.c | 11 ++++++++++-
> >>>>    1 file changed, 10 insertions(+), 1 deletion(-)
> >>>> 
> >>>> diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c
> >>>> index c2c14059f6a8..475a2f0e0e40 100644
> >>>> --- a/arch/arm64/kvm/arm.c
> >>>> +++ b/arch/arm64/kvm/arm.c
> >>>> @@ -919,8 +919,17 @@ int kvm_arch_vcpu_ioctl_run(struct kvm_vcpu *vcpu)
> >>>>    		if (!ret)
> >>>>    			ret = 1;
> >>>>    -		if (ret > 0)
> >>>> +		if (ret > 0) {
> >>>> +			/*
> >>>> +			 * The perf_rotate_context() may rotate the events and
> >>>> +			 * reprogram PMU with filters for host context.
> >>>> +			 * So make a request before reentering the guest to
> >>>> +			 * reconfigurate the event filters for guest context.
> >>>> +			 */
> >>>> +			kvm_make_request(KVM_REQ_RELOAD_PMU, vcpu);
> >>>> +
> >>>>    			ret = check_vcpu_requests(vcpu);
> >>>> +		}
> >>> This looks extremely heavy handed. You're performing the reload on
> >>> *every* entry, and I don't think this is right (exit-heavy workloads
> >>> will suffer from it)
> >>> 
> >>> Furthermore, you're also reloading the virtual state of the PMU
> >>> (recreating guest events and other things), all of which looks pretty
> >>> pointless, as all we're interested in is what is being counted on the
> >>> *host*.
> >> okay. What about to add a _new_ request, such as KVM_REQ_RESTROE_PMU_GUEST?
> >> 
> >> 
> >>> Instead, we can restrict the reload of the host state (and only that)
> >>> to situations where:
> >>> 
> >>> - we're running on a VHE system
> >>> 
> >>> - we have a host PMUv3 (not everybody does), as that's the only way we
> >>>     can profile a guest
> >> okay. No problem.
> >> 
> >> 
> >>> and ideally we would have a way to detect that a rotation happened
> >>> (which may requires some help from the low-level PMU code).
> >> I will check it, hope we can find a better way.
> > I came up with the following patch, completely untested. Let me know
> > how that fares for you.
> > 
> > Thanks,
> > 
> > 	M.
> > 
> > diff --git a/arch/arm64/include/asm/kvm_host.h b/arch/arm64/include/asm/kvm_host.h
> > index 93c541111dea..fb875c5c0347 100644
> > --- a/arch/arm64/include/asm/kvm_host.h
> > +++ b/arch/arm64/include/asm/kvm_host.h
> > @@ -49,6 +49,7 @@
> >   #define KVM_REQ_RELOAD_GICv4	KVM_ARCH_REQ(4)
> >   #define KVM_REQ_RELOAD_PMU	KVM_ARCH_REQ(5)
> >   #define KVM_REQ_SUSPEND		KVM_ARCH_REQ(6)
> > +#define KVM_REQ_RELOAD_GUEST_PMU_EVENTS	KVM_ARCH_REQ(7)
> >     #define KVM_DIRTY_LOG_MANUAL_CAPS
> > (KVM_DIRTY_LOG_MANUAL_PROTECT_ENABLE | \
> >   				     KVM_DIRTY_LOG_INITIALLY_SET)
> > diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c
> > index 8b51570a76f8..b40db24f1f0b 100644
> > --- a/arch/arm64/kvm/arm.c
> > +++ b/arch/arm64/kvm/arm.c
> > @@ -804,6 +804,9 @@ static int check_vcpu_requests(struct kvm_vcpu *vcpu)
> >   			kvm_pmu_handle_pmcr(vcpu,
> >   					    __vcpu_sys_reg(vcpu, PMCR_EL0));
> >   +		if (kvm_check_request(KVM_REQ_RELOAD_GUEST_PMU_EVENTS,
> > vcpu))
> > +			kvm_vcpu_pmu_restore_guest(vcpu);
> > +
> >   		if (kvm_check_request(KVM_REQ_SUSPEND, vcpu))
> >   			return kvm_vcpu_suspend(vcpu);
> >   diff --git a/drivers/perf/arm_pmuv3.c b/drivers/perf/arm_pmuv3.c
> > index 08b3a1bf0ef6..7012de417092 100644
> > --- a/drivers/perf/arm_pmuv3.c
> > +++ b/drivers/perf/arm_pmuv3.c
> > @@ -772,6 +772,9 @@ static void armv8pmu_start(struct arm_pmu *cpu_pmu)
> >     	/* Enable all counters */
> >   	armv8pmu_pmcr_write(armv8pmu_pmcr_read() | ARMV8_PMU_PMCR_E);
> > +
> > +	if (in_interrupt())
> > +		kvm_resync_guest_context();
> 
> I currently added a similiar check in armv8pmu_get_event_idx().
> 
> The event multiplexing will call armv8pmu_get_event_idx(), and will
> definitely fail at least one time.
> 
> +++ b/drivers/perf/arm_pmuv3.c
> @@ -882,6 +882,8 @@ static int armv8pmu_get_event_idx(struct
> pmu_hw_events *cpuc,
>         struct arm_pmu *cpu_pmu = to_arm_pmu(event->pmu);
>         struct hw_perf_event *hwc = &event->hw;
>         unsigned long evtype = hwc->config_base & ARMV8_PMU_EVTYPE_EVENT;
> +       struct kvm_vcpu *vcpu;
> +       int index;
>         struct arm_pmu *cpu_pmu = to_arm_pmu(event->pmu);
>         struct hw_perf_event *hwc = &event->hw;
>         unsigned long evtype = hwc->config_base & ARMV8_PMU_EVTYPE_EVENT;
> +       struct kvm_vcpu *vcpu;
> +       int index;
> 
>         /* Always prefer to place a cycle counter into the cycle
> counter. */
>         if (evtype == ARMV8_PMUV3_PERFCTR_CPU_CYCLES) {
> @@ -897,9 +899,22 @@ static int armv8pmu_get_event_idx(struct
> pmu_hw_events *cpuc,
>          * Otherwise use events counters
>          */
>         if (armv8pmu_event_is_chained(event))
> -               return  armv8pmu_get_chain_idx(cpuc, cpu_pmu);
> +               index = armv8pmu_get_chain_idx(cpuc, cpu_pmu);
>         else
> -               return armv8pmu_get_single_idx(cpuc, cpu_pmu);
> +               index = armv8pmu_get_single_idx(cpuc, cpu_pmu);
> +
> +       /*
> +        * If we are in pmu multiplexing, we will definitely meet a failure.
> +        * Please see perf_rotate_context().
> +        * If we are in the guest context, we can mark it.
> +        */
> +       if (index < 0) {
> +               vcpu = kvm_get_running_vcpu();
> +               if (vcpu && in_interrupt() && !event->attr.pinned) {
> +                       kvm_resync_guest_context();
> +               }
> +       }
> +       return index;
>  }
> 
> IMHO, it's better to change armv8pmu_get_event_idx().
> 
> But if you think it is also okay to change armv8pmu_start() to fix the bug,
> 
> I am okay too.

But that's doing work each time you rotate an event. And if you rotate
a bunch of them, you'll hit this path multiple times, reloading the
stuff again. What's the point?

My take is that we can hook at the point where the PMU gets
re-enabled, and have the full context once and for all.

Unless of course I miss something, which is very likely as the whole
perf subsystem generally escapes me altogether.

In any case, I'd welcome your testing the proposed patch.

Thanks,

	M.
Shijie Huang Aug. 11, 2023, 7:52 a.m. UTC | #6
Hi Marc,

在 2023/8/11 15:42, Marc Zyngier 写道:
> On Fri, 11 Aug 2023 08:10:26 +0100,
> Shijie Huang <shijie@amperemail.onmicrosoft.com> wrote:
>> Hi Marc,
>>
>> 在 2023/8/11 14:10, Marc Zyngier 写道:
>>> On Fri, 11 Aug 2023 02:46:49 +0100,
>>> Shijie Huang <shijie@amperemail.onmicrosoft.com> wrote:
>>>> Hi Marc,
>>>>
>>>> 在 2023/8/10 23:27, Marc Zyngier 写道:
>>>>> Huang,
>>>>>
>>>>> Please make sure you add everyone who commented on v1 (I've Cc'd Mark
>>>>> so that he can shime need as needed).
>>>> thanks.
>>>>> On Thu, 10 Aug 2023 08:29:06 +0100,
>>>>> Huang Shijie <shijie@os.amperecomputing.com> wrote:
>>>>>> 1.) Background.
>>>>>>       1.1) In arm64, start a guest with Qemu which is running as a VMM of KVM,
>>>>>>            and bind the guest to core 33 and run program "a" in guest.
>>>>>>            The code of "a" shows below:
>>>>>>       	----------------------------------------------------------
>>>>>> 		#include <stdio.h>
>>>>>>
>>>>>> 		int main()
>>>>>> 		{
>>>>>> 			unsigned long i = 0;
>>>>>>
>>>>>> 			for (;;) {
>>>>>> 				i++;
>>>>>> 			}
>>>>>>
>>>>>> 			printf("i:%ld\n", i);
>>>>>> 			return 0;
>>>>>> 		}
>>>>>>       	----------------------------------------------------------
>>>>>>
>>>>>>       1.2) Use the following perf command in host:
>>>>>>          #perf stat -e cycles:G,cycles:H -C 33 -I 1000 sleep 1
>>>>>>              #           time             counts unit events
>>>>>>                   1.000817400      3,299,471,572      cycles:G
>>>>>>                   1.000817400          3,240,586      cycles:H
>>>>>>
>>>>>>           This result is correct, my cpu's frequency is 3.3G.
>>>>>>
>>>>>>       1.3) Use the following perf command in host:
>>>>>>          #perf stat -e cycles:G,cycles:H -C 33 -d -d  -I 1000 sleep 1
>>>>>>                time             counts unit events
>>>>>>         1.000831480        153,634,097      cycles:G                                                                (70.03%)
>>>>>>         1.000831480      3,147,940,599      cycles:H                                                                (70.03%)
>>>>>>         1.000831480      1,143,598,527      L1-dcache-loads                                                         (70.03%)
>>>>>>         1.000831480              9,986      L1-dcache-load-misses            #    0.00% of all L1-dcache accesses   (70.03%)
>>>>>>         1.000831480    <not supported>      LLC-loads
>>>>>>         1.000831480    <not supported>      LLC-load-misses
>>>>>>         1.000831480        580,887,696      L1-icache-loads                                                         (70.03%)
>>>>>>         1.000831480             77,855      L1-icache-load-misses            #    0.01% of all L1-icache accesses   (70.03%)
>>>>>>         1.000831480      6,112,224,612      dTLB-loads                                                              (70.03%)
>>>>>>         1.000831480             16,222      dTLB-load-misses                 #    0.00% of all dTLB cache accesses  (69.94%)
>>>>>>         1.000831480        590,015,996      iTLB-loads                                                              (59.95%)
>>>>>>         1.000831480                505      iTLB-load-misses                 #    0.00% of all iTLB cache accesses  (59.95%)
>>>>>>
>>>>>>           This result is wrong. The "cycle:G" should be nearly 3.3G.
>>>>>>
>>>>>> 2.) Root cause.
>>>>>> 	There is only 7 counters in my arm64 platform:
>>>>>> 	  (one cycle counter) + (6 normal counters)
>>>>>>
>>>>>> 	In 1.3 above, we will use 10 event counters.
>>>>>> 	Since we only have 7 counters, the perf core will trigger
>>>>>>           	multiplexing in hrtimer:
>>>>>> 	     perf_mux_hrtimer_restart() --> perf_rotate_context().
>>>>>>
>>>>>>           If the hrtimer occurs when the host is running, it's fine.
>>>>>>           If the hrtimer occurs when the guest is running,
>>>>>>           the perf_rotate_context() will program the PMU with filters for
>>>>>>           host context. The KVM does not have a chance to restore
>>>>>>           PMU registers with kvm_vcpu_pmu_restore_guest().
>>>>>>           The PMU does not work correctly, so we got wrong result.
>>>>>>
>>>>>> 3.) About this patch.
>>>>>> 	Make a KVM_REQ_RELOAD_PMU request before reentering the
>>>>>> 	guest. The request will call kvm_vcpu_pmu_restore_guest()
>>>>>> 	to reconfigurate the filters for guest context.
>>>>>>
>>>>>> 4.) Test result of this patch:
>>>>>>          #perf stat -e cycles:G,cycles:H -C 33 -d -d  -I 1000 sleep 1
>>>>>>                time             counts unit events
>>>>>>         1.001006400      3,298,348,656      cycles:G                                                                (70.03%)
>>>>>>         1.001006400          3,144,532      cycles:H                                                                (70.03%)
>>>>>>         1.001006400            941,149      L1-dcache-loads                                                         (70.03%)
>>>>>>         1.001006400             17,937      L1-dcache-load-misses            #    1.91% of all L1-dcache accesses   (70.03%)
>>>>>>         1.001006400    <not supported>      LLC-loads
>>>>>>         1.001006400    <not supported>      LLC-load-misses
>>>>>>         1.001006400          1,101,889      L1-icache-loads                                                         (70.03%)
>>>>>>         1.001006400            121,638      L1-icache-load-misses            #   11.04% of all L1-icache accesses   (70.03%)
>>>>>>         1.001006400          1,031,228      dTLB-loads                                                              (70.03%)
>>>>>>         1.001006400             26,952      dTLB-load-misses                 #    2.61% of all dTLB cache accesses  (69.93%)
>>>>>>         1.001006400          1,030,678      iTLB-loads                                                              (59.94%)
>>>>>>         1.001006400                338      iTLB-load-misses                 #    0.03% of all iTLB cache accesses  (59.94%)
>>>>>>
>>>>>>        The result is correct. The "cycle:G" is nearly 3.3G now.
>>>>>>
>>>>>> Signed-off-by: Huang Shijie <shijie@os.amperecomputing.com>
>>>>>> ---
>>>>>> v1 --> v2:
>>>>>> 	Do not change perf/core code, only change the ARM64 kvm code.
>>>>>> 	v1: https://lkml.org/lkml/2023/8/8/1465
>>>>>>
>>>>>> ---
>>>>>>     arch/arm64/kvm/arm.c | 11 ++++++++++-
>>>>>>     1 file changed, 10 insertions(+), 1 deletion(-)
>>>>>>
>>>>>> diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c
>>>>>> index c2c14059f6a8..475a2f0e0e40 100644
>>>>>> --- a/arch/arm64/kvm/arm.c
>>>>>> +++ b/arch/arm64/kvm/arm.c
>>>>>> @@ -919,8 +919,17 @@ int kvm_arch_vcpu_ioctl_run(struct kvm_vcpu *vcpu)
>>>>>>     		if (!ret)
>>>>>>     			ret = 1;
>>>>>>     -		if (ret > 0)
>>>>>> +		if (ret > 0) {
>>>>>> +			/*
>>>>>> +			 * The perf_rotate_context() may rotate the events and
>>>>>> +			 * reprogram PMU with filters for host context.
>>>>>> +			 * So make a request before reentering the guest to
>>>>>> +			 * reconfigurate the event filters for guest context.
>>>>>> +			 */
>>>>>> +			kvm_make_request(KVM_REQ_RELOAD_PMU, vcpu);
>>>>>> +
>>>>>>     			ret = check_vcpu_requests(vcpu);
>>>>>> +		}
>>>>> This looks extremely heavy handed. You're performing the reload on
>>>>> *every* entry, and I don't think this is right (exit-heavy workloads
>>>>> will suffer from it)
>>>>>
>>>>> Furthermore, you're also reloading the virtual state of the PMU
>>>>> (recreating guest events and other things), all of which looks pretty
>>>>> pointless, as all we're interested in is what is being counted on the
>>>>> *host*.
>>>> okay. What about to add a _new_ request, such as KVM_REQ_RESTROE_PMU_GUEST?
>>>>
>>>>
>>>>> Instead, we can restrict the reload of the host state (and only that)
>>>>> to situations where:
>>>>>
>>>>> - we're running on a VHE system
>>>>>
>>>>> - we have a host PMUv3 (not everybody does), as that's the only way we
>>>>>      can profile a guest
>>>> okay. No problem.
>>>>
>>>>
>>>>> and ideally we would have a way to detect that a rotation happened
>>>>> (which may requires some help from the low-level PMU code).
>>>> I will check it, hope we can find a better way.
>>> I came up with the following patch, completely untested. Let me know
>>> how that fares for you.
>>>
>>> Thanks,
>>>
>>> 	M.
>>>
>>> diff --git a/arch/arm64/include/asm/kvm_host.h b/arch/arm64/include/asm/kvm_host.h
>>> index 93c541111dea..fb875c5c0347 100644
>>> --- a/arch/arm64/include/asm/kvm_host.h
>>> +++ b/arch/arm64/include/asm/kvm_host.h
>>> @@ -49,6 +49,7 @@
>>>    #define KVM_REQ_RELOAD_GICv4	KVM_ARCH_REQ(4)
>>>    #define KVM_REQ_RELOAD_PMU	KVM_ARCH_REQ(5)
>>>    #define KVM_REQ_SUSPEND		KVM_ARCH_REQ(6)
>>> +#define KVM_REQ_RELOAD_GUEST_PMU_EVENTS	KVM_ARCH_REQ(7)
>>>      #define KVM_DIRTY_LOG_MANUAL_CAPS
>>> (KVM_DIRTY_LOG_MANUAL_PROTECT_ENABLE | \
>>>    				     KVM_DIRTY_LOG_INITIALLY_SET)
>>> diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c
>>> index 8b51570a76f8..b40db24f1f0b 100644
>>> --- a/arch/arm64/kvm/arm.c
>>> +++ b/arch/arm64/kvm/arm.c
>>> @@ -804,6 +804,9 @@ static int check_vcpu_requests(struct kvm_vcpu *vcpu)
>>>    			kvm_pmu_handle_pmcr(vcpu,
>>>    					    __vcpu_sys_reg(vcpu, PMCR_EL0));
>>>    +		if (kvm_check_request(KVM_REQ_RELOAD_GUEST_PMU_EVENTS,
>>> vcpu))
>>> +			kvm_vcpu_pmu_restore_guest(vcpu);
>>> +
>>>    		if (kvm_check_request(KVM_REQ_SUSPEND, vcpu))
>>>    			return kvm_vcpu_suspend(vcpu);
>>>    diff --git a/drivers/perf/arm_pmuv3.c b/drivers/perf/arm_pmuv3.c
>>> index 08b3a1bf0ef6..7012de417092 100644
>>> --- a/drivers/perf/arm_pmuv3.c
>>> +++ b/drivers/perf/arm_pmuv3.c
>>> @@ -772,6 +772,9 @@ static void armv8pmu_start(struct arm_pmu *cpu_pmu)
>>>      	/* Enable all counters */
>>>    	armv8pmu_pmcr_write(armv8pmu_pmcr_read() | ARMV8_PMU_PMCR_E);
>>> +
>>> +	if (in_interrupt())
>>> +		kvm_resync_guest_context();
>> I currently added a similiar check in armv8pmu_get_event_idx().
>>
>> The event multiplexing will call armv8pmu_get_event_idx(), and will
>> definitely fail at least one time.
>>
>> +++ b/drivers/perf/arm_pmuv3.c
>> @@ -882,6 +882,8 @@ static int armv8pmu_get_event_idx(struct
>> pmu_hw_events *cpuc,
>>          struct arm_pmu *cpu_pmu = to_arm_pmu(event->pmu);
>>          struct hw_perf_event *hwc = &event->hw;
>>          unsigned long evtype = hwc->config_base & ARMV8_PMU_EVTYPE_EVENT;
>> +       struct kvm_vcpu *vcpu;
>> +       int index;
>>          struct arm_pmu *cpu_pmu = to_arm_pmu(event->pmu);
>>          struct hw_perf_event *hwc = &event->hw;
>>          unsigned long evtype = hwc->config_base & ARMV8_PMU_EVTYPE_EVENT;
>> +       struct kvm_vcpu *vcpu;
>> +       int index;
>>
>>          /* Always prefer to place a cycle counter into the cycle
>> counter. */
>>          if (evtype == ARMV8_PMUV3_PERFCTR_CPU_CYCLES) {
>> @@ -897,9 +899,22 @@ static int armv8pmu_get_event_idx(struct
>> pmu_hw_events *cpuc,
>>           * Otherwise use events counters
>>           */
>>          if (armv8pmu_event_is_chained(event))
>> -               return  armv8pmu_get_chain_idx(cpuc, cpu_pmu);
>> +               index = armv8pmu_get_chain_idx(cpuc, cpu_pmu);
>>          else
>> -               return armv8pmu_get_single_idx(cpuc, cpu_pmu);
>> +               index = armv8pmu_get_single_idx(cpuc, cpu_pmu);
>> +
>> +       /*
>> +        * If we are in pmu multiplexing, we will definitely meet a failure.
>> +        * Please see perf_rotate_context().
>> +        * If we are in the guest context, we can mark it.
>> +        */
>> +       if (index < 0) {
>> +               vcpu = kvm_get_running_vcpu();
>> +               if (vcpu && in_interrupt() && !event->attr.pinned) {
>> +                       kvm_resync_guest_context();

xxxx.


>> +               }
>> +       }
>> +       return index;
>>   }
>>
>> IMHO, it's better to change armv8pmu_get_event_idx().
>>
>> But if you think it is also okay to change armv8pmu_start() to fix the bug,
>>
>> I am okay too.
> But that's doing work each time you rotate an event. And if you rotate
> a bunch of them, you'll hit this path multiple times, reloading the
> stuff again. What's the point?

In my code, I just put the kvm_make_request() in "xxx" above. Event 
reloading it multiple times,

it just set a bit in vcpu->requests.


>
> My take is that we can hook at the point where the PMU gets
> re-enabled, and have the full context once and for all.
>
> Unless of course I miss something, which is very likely as the whole
> perf subsystem generally escapes me altogether.
>
> In any case, I'd welcome your testing the proposed patch.

No problem.


Thanks

Huang Shijie

>
> Thanks,
>
> 	M.
>
Marc Zyngier Aug. 11, 2023, 7:56 a.m. UTC | #7
On Fri, 11 Aug 2023 08:52:40 +0100,
Shijie Huang <shijie@amperemail.onmicrosoft.com> wrote:
> 
> Hi Marc,
> 
> 在 2023/8/11 15:42, Marc Zyngier 写道:
> > On Fri, 11 Aug 2023 08:10:26 +0100,
> > Shijie Huang <shijie@amperemail.onmicrosoft.com> wrote:
> >> Hi Marc,
> >> 
> >> 在 2023/8/11 14:10, Marc Zyngier 写道:
> >>> On Fri, 11 Aug 2023 02:46:49 +0100,
> >>> Shijie Huang <shijie@amperemail.onmicrosoft.com> wrote:
> >>>> Hi Marc,
> >>>> 
> >>>> 在 2023/8/10 23:27, Marc Zyngier 写道:
> >>>>> Huang,
> >>>>> 
> >>>>> Please make sure you add everyone who commented on v1 (I've Cc'd Mark
> >>>>> so that he can shime need as needed).
> >>>> thanks.
> >>>>> On Thu, 10 Aug 2023 08:29:06 +0100,
> >>>>> Huang Shijie <shijie@os.amperecomputing.com> wrote:
> >>>>>> 1.) Background.
> >>>>>>       1.1) In arm64, start a guest with Qemu which is running as a VMM of KVM,
> >>>>>>            and bind the guest to core 33 and run program "a" in guest.
> >>>>>>            The code of "a" shows below:
> >>>>>>       	----------------------------------------------------------
> >>>>>> 		#include <stdio.h>
> >>>>>> 
> >>>>>> 		int main()
> >>>>>> 		{
> >>>>>> 			unsigned long i = 0;
> >>>>>> 
> >>>>>> 			for (;;) {
> >>>>>> 				i++;
> >>>>>> 			}
> >>>>>> 
> >>>>>> 			printf("i:%ld\n", i);
> >>>>>> 			return 0;
> >>>>>> 		}
> >>>>>>       	----------------------------------------------------------
> >>>>>> 
> >>>>>>       1.2) Use the following perf command in host:
> >>>>>>          #perf stat -e cycles:G,cycles:H -C 33 -I 1000 sleep 1
> >>>>>>              #           time             counts unit events
> >>>>>>                   1.000817400      3,299,471,572      cycles:G
> >>>>>>                   1.000817400          3,240,586      cycles:H
> >>>>>> 
> >>>>>>           This result is correct, my cpu's frequency is 3.3G.
> >>>>>> 
> >>>>>>       1.3) Use the following perf command in host:
> >>>>>>          #perf stat -e cycles:G,cycles:H -C 33 -d -d  -I 1000 sleep 1
> >>>>>>                time             counts unit events
> >>>>>>         1.000831480        153,634,097      cycles:G                                                                (70.03%)
> >>>>>>         1.000831480      3,147,940,599      cycles:H                                                                (70.03%)
> >>>>>>         1.000831480      1,143,598,527      L1-dcache-loads                                                         (70.03%)
> >>>>>>         1.000831480              9,986      L1-dcache-load-misses            #    0.00% of all L1-dcache accesses   (70.03%)
> >>>>>>         1.000831480    <not supported>      LLC-loads
> >>>>>>         1.000831480    <not supported>      LLC-load-misses
> >>>>>>         1.000831480        580,887,696      L1-icache-loads                                                         (70.03%)
> >>>>>>         1.000831480             77,855      L1-icache-load-misses            #    0.01% of all L1-icache accesses   (70.03%)
> >>>>>>         1.000831480      6,112,224,612      dTLB-loads                                                              (70.03%)
> >>>>>>         1.000831480             16,222      dTLB-load-misses                 #    0.00% of all dTLB cache accesses  (69.94%)
> >>>>>>         1.000831480        590,015,996      iTLB-loads                                                              (59.95%)
> >>>>>>         1.000831480                505      iTLB-load-misses                 #    0.00% of all iTLB cache accesses  (59.95%)
> >>>>>> 
> >>>>>>           This result is wrong. The "cycle:G" should be nearly 3.3G.
> >>>>>> 
> >>>>>> 2.) Root cause.
> >>>>>> 	There is only 7 counters in my arm64 platform:
> >>>>>> 	  (one cycle counter) + (6 normal counters)
> >>>>>> 
> >>>>>> 	In 1.3 above, we will use 10 event counters.
> >>>>>> 	Since we only have 7 counters, the perf core will trigger
> >>>>>>           	multiplexing in hrtimer:
> >>>>>> 	     perf_mux_hrtimer_restart() --> perf_rotate_context().
> >>>>>> 
> >>>>>>           If the hrtimer occurs when the host is running, it's fine.
> >>>>>>           If the hrtimer occurs when the guest is running,
> >>>>>>           the perf_rotate_context() will program the PMU with filters for
> >>>>>>           host context. The KVM does not have a chance to restore
> >>>>>>           PMU registers with kvm_vcpu_pmu_restore_guest().
> >>>>>>           The PMU does not work correctly, so we got wrong result.
> >>>>>> 
> >>>>>> 3.) About this patch.
> >>>>>> 	Make a KVM_REQ_RELOAD_PMU request before reentering the
> >>>>>> 	guest. The request will call kvm_vcpu_pmu_restore_guest()
> >>>>>> 	to reconfigurate the filters for guest context.
> >>>>>> 
> >>>>>> 4.) Test result of this patch:
> >>>>>>          #perf stat -e cycles:G,cycles:H -C 33 -d -d  -I 1000 sleep 1
> >>>>>>                time             counts unit events
> >>>>>>         1.001006400      3,298,348,656      cycles:G                                                                (70.03%)
> >>>>>>         1.001006400          3,144,532      cycles:H                                                                (70.03%)
> >>>>>>         1.001006400            941,149      L1-dcache-loads                                                         (70.03%)
> >>>>>>         1.001006400             17,937      L1-dcache-load-misses            #    1.91% of all L1-dcache accesses   (70.03%)
> >>>>>>         1.001006400    <not supported>      LLC-loads
> >>>>>>         1.001006400    <not supported>      LLC-load-misses
> >>>>>>         1.001006400          1,101,889      L1-icache-loads                                                         (70.03%)
> >>>>>>         1.001006400            121,638      L1-icache-load-misses            #   11.04% of all L1-icache accesses   (70.03%)
> >>>>>>         1.001006400          1,031,228      dTLB-loads                                                              (70.03%)
> >>>>>>         1.001006400             26,952      dTLB-load-misses                 #    2.61% of all dTLB cache accesses  (69.93%)
> >>>>>>         1.001006400          1,030,678      iTLB-loads                                                              (59.94%)
> >>>>>>         1.001006400                338      iTLB-load-misses                 #    0.03% of all iTLB cache accesses  (59.94%)
> >>>>>> 
> >>>>>>        The result is correct. The "cycle:G" is nearly 3.3G now.
> >>>>>> 
> >>>>>> Signed-off-by: Huang Shijie <shijie@os.amperecomputing.com>
> >>>>>> ---
> >>>>>> v1 --> v2:
> >>>>>> 	Do not change perf/core code, only change the ARM64 kvm code.
> >>>>>> 	v1: https://lkml.org/lkml/2023/8/8/1465
> >>>>>> 
> >>>>>> ---
> >>>>>>     arch/arm64/kvm/arm.c | 11 ++++++++++-
> >>>>>>     1 file changed, 10 insertions(+), 1 deletion(-)
> >>>>>> 
> >>>>>> diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c
> >>>>>> index c2c14059f6a8..475a2f0e0e40 100644
> >>>>>> --- a/arch/arm64/kvm/arm.c
> >>>>>> +++ b/arch/arm64/kvm/arm.c
> >>>>>> @@ -919,8 +919,17 @@ int kvm_arch_vcpu_ioctl_run(struct kvm_vcpu *vcpu)
> >>>>>>     		if (!ret)
> >>>>>>     			ret = 1;
> >>>>>>     -		if (ret > 0)
> >>>>>> +		if (ret > 0) {
> >>>>>> +			/*
> >>>>>> +			 * The perf_rotate_context() may rotate the events and
> >>>>>> +			 * reprogram PMU with filters for host context.
> >>>>>> +			 * So make a request before reentering the guest to
> >>>>>> +			 * reconfigurate the event filters for guest context.
> >>>>>> +			 */
> >>>>>> +			kvm_make_request(KVM_REQ_RELOAD_PMU, vcpu);
> >>>>>> +
> >>>>>>     			ret = check_vcpu_requests(vcpu);
> >>>>>> +		}
> >>>>> This looks extremely heavy handed. You're performing the reload on
> >>>>> *every* entry, and I don't think this is right (exit-heavy workloads
> >>>>> will suffer from it)
> >>>>> 
> >>>>> Furthermore, you're also reloading the virtual state of the PMU
> >>>>> (recreating guest events and other things), all of which looks pretty
> >>>>> pointless, as all we're interested in is what is being counted on the
> >>>>> *host*.
> >>>> okay. What about to add a _new_ request, such as KVM_REQ_RESTROE_PMU_GUEST?
> >>>> 
> >>>> 
> >>>>> Instead, we can restrict the reload of the host state (and only that)
> >>>>> to situations where:
> >>>>> 
> >>>>> - we're running on a VHE system
> >>>>> 
> >>>>> - we have a host PMUv3 (not everybody does), as that's the only way we
> >>>>>      can profile a guest
> >>>> okay. No problem.
> >>>> 
> >>>> 
> >>>>> and ideally we would have a way to detect that a rotation happened
> >>>>> (which may requires some help from the low-level PMU code).
> >>>> I will check it, hope we can find a better way.
> >>> I came up with the following patch, completely untested. Let me know
> >>> how that fares for you.
> >>> 
> >>> Thanks,
> >>> 
> >>> 	M.
> >>> 
> >>> diff --git a/arch/arm64/include/asm/kvm_host.h b/arch/arm64/include/asm/kvm_host.h
> >>> index 93c541111dea..fb875c5c0347 100644
> >>> --- a/arch/arm64/include/asm/kvm_host.h
> >>> +++ b/arch/arm64/include/asm/kvm_host.h
> >>> @@ -49,6 +49,7 @@
> >>>    #define KVM_REQ_RELOAD_GICv4	KVM_ARCH_REQ(4)
> >>>    #define KVM_REQ_RELOAD_PMU	KVM_ARCH_REQ(5)
> >>>    #define KVM_REQ_SUSPEND		KVM_ARCH_REQ(6)
> >>> +#define KVM_REQ_RELOAD_GUEST_PMU_EVENTS	KVM_ARCH_REQ(7)
> >>>      #define KVM_DIRTY_LOG_MANUAL_CAPS
> >>> (KVM_DIRTY_LOG_MANUAL_PROTECT_ENABLE | \
> >>>    				     KVM_DIRTY_LOG_INITIALLY_SET)
> >>> diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c
> >>> index 8b51570a76f8..b40db24f1f0b 100644
> >>> --- a/arch/arm64/kvm/arm.c
> >>> +++ b/arch/arm64/kvm/arm.c
> >>> @@ -804,6 +804,9 @@ static int check_vcpu_requests(struct kvm_vcpu *vcpu)
> >>>    			kvm_pmu_handle_pmcr(vcpu,
> >>>    					    __vcpu_sys_reg(vcpu, PMCR_EL0));
> >>>    +		if (kvm_check_request(KVM_REQ_RELOAD_GUEST_PMU_EVENTS,
> >>> vcpu))
> >>> +			kvm_vcpu_pmu_restore_guest(vcpu);
> >>> +
> >>>    		if (kvm_check_request(KVM_REQ_SUSPEND, vcpu))
> >>>    			return kvm_vcpu_suspend(vcpu);
> >>>    diff --git a/drivers/perf/arm_pmuv3.c b/drivers/perf/arm_pmuv3.c
> >>> index 08b3a1bf0ef6..7012de417092 100644
> >>> --- a/drivers/perf/arm_pmuv3.c
> >>> +++ b/drivers/perf/arm_pmuv3.c
> >>> @@ -772,6 +772,9 @@ static void armv8pmu_start(struct arm_pmu *cpu_pmu)
> >>>      	/* Enable all counters */
> >>>    	armv8pmu_pmcr_write(armv8pmu_pmcr_read() | ARMV8_PMU_PMCR_E);
> >>> +
> >>> +	if (in_interrupt())
> >>> +		kvm_resync_guest_context();
> >> I currently added a similiar check in armv8pmu_get_event_idx().
> >> 
> >> The event multiplexing will call armv8pmu_get_event_idx(), and will
> >> definitely fail at least one time.
> >> 
> >> +++ b/drivers/perf/arm_pmuv3.c
> >> @@ -882,6 +882,8 @@ static int armv8pmu_get_event_idx(struct
> >> pmu_hw_events *cpuc,
> >>          struct arm_pmu *cpu_pmu = to_arm_pmu(event->pmu);
> >>          struct hw_perf_event *hwc = &event->hw;
> >>          unsigned long evtype = hwc->config_base & ARMV8_PMU_EVTYPE_EVENT;
> >> +       struct kvm_vcpu *vcpu;
> >> +       int index;
> >>          struct arm_pmu *cpu_pmu = to_arm_pmu(event->pmu);
> >>          struct hw_perf_event *hwc = &event->hw;
> >>          unsigned long evtype = hwc->config_base & ARMV8_PMU_EVTYPE_EVENT;
> >> +       struct kvm_vcpu *vcpu;
> >> +       int index;
> >> 
> >>          /* Always prefer to place a cycle counter into the cycle
> >> counter. */
> >>          if (evtype == ARMV8_PMUV3_PERFCTR_CPU_CYCLES) {
> >> @@ -897,9 +899,22 @@ static int armv8pmu_get_event_idx(struct
> >> pmu_hw_events *cpuc,
> >>           * Otherwise use events counters
> >>           */
> >>          if (armv8pmu_event_is_chained(event))
> >> -               return  armv8pmu_get_chain_idx(cpuc, cpu_pmu);
> >> +               index = armv8pmu_get_chain_idx(cpuc, cpu_pmu);
> >>          else
> >> -               return armv8pmu_get_single_idx(cpuc, cpu_pmu);
> >> +               index = armv8pmu_get_single_idx(cpuc, cpu_pmu);
> >> +
> >> +       /*
> >> +        * If we are in pmu multiplexing, we will definitely meet a failure.
> >> +        * Please see perf_rotate_context().
> >> +        * If we are in the guest context, we can mark it.
> >> +        */
> >> +       if (index < 0) {
> >> +               vcpu = kvm_get_running_vcpu();
> >> +               if (vcpu && in_interrupt() && !event->attr.pinned) {
> >> +                       kvm_resync_guest_context();
> 
> xxxx.
> 
> 
> >> +               }
> >> +       }
> >> +       return index;
> >>   }
> >> 
> >> IMHO, it's better to change armv8pmu_get_event_idx().
> >> 
> >> But if you think it is also okay to change armv8pmu_start() to fix the bug,
> >> 
> >> I am okay too.
> > But that's doing work each time you rotate an event. And if you rotate
> > a bunch of them, you'll hit this path multiple times, reloading the
> > stuff again. What's the point?
> 
> In my code, I just put the kvm_make_request() in "xxx" above. Event
> reloading it multiple times,
> 
> it just set a bit in vcpu->requests.
> 
> 
> > 
> > My take is that we can hook at the point where the PMU gets
> > re-enabled, and have the full context once and for all.
> > 
> > Unless of course I miss something, which is very likely as the whole
> > perf subsystem generally escapes me altogether.
> > 
> > In any case, I'd welcome your testing the proposed patch.
> 
> No problem.

As Oliver pointed out offline, I only have posted half of the patch...

Here is the whole thing below.

Thanks,

	M.

diff --git a/arch/arm64/include/asm/kvm_host.h b/arch/arm64/include/asm/kvm_host.h
index 93c541111dea..fb875c5c0347 100644
--- a/arch/arm64/include/asm/kvm_host.h
+++ b/arch/arm64/include/asm/kvm_host.h
@@ -49,6 +49,7 @@
 #define KVM_REQ_RELOAD_GICv4	KVM_ARCH_REQ(4)
 #define KVM_REQ_RELOAD_PMU	KVM_ARCH_REQ(5)
 #define KVM_REQ_SUSPEND		KVM_ARCH_REQ(6)
+#define KVM_REQ_RELOAD_GUEST_PMU_EVENTS	KVM_ARCH_REQ(7)
 
 #define KVM_DIRTY_LOG_MANUAL_CAPS   (KVM_DIRTY_LOG_MANUAL_PROTECT_ENABLE | \
 				     KVM_DIRTY_LOG_INITIALLY_SET)
diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c
index 8b51570a76f8..b40db24f1f0b 100644
--- a/arch/arm64/kvm/arm.c
+++ b/arch/arm64/kvm/arm.c
@@ -804,6 +804,9 @@ static int check_vcpu_requests(struct kvm_vcpu *vcpu)
 			kvm_pmu_handle_pmcr(vcpu,
 					    __vcpu_sys_reg(vcpu, PMCR_EL0));
 
+		if (kvm_check_request(KVM_REQ_RELOAD_GUEST_PMU_EVENTS, vcpu))
+			kvm_vcpu_pmu_restore_guest(vcpu);
+
 		if (kvm_check_request(KVM_REQ_SUSPEND, vcpu))
 			return kvm_vcpu_suspend(vcpu);
 
diff --git a/arch/arm64/kvm/pmu.c b/arch/arm64/kvm/pmu.c
index 121f1a14c829..7bd1facc8f15 100644
--- a/arch/arm64/kvm/pmu.c
+++ b/arch/arm64/kvm/pmu.c
@@ -236,3 +236,17 @@ bool kvm_set_pmuserenr(u64 val)
 	ctxt_sys_reg(hctxt, PMUSERENR_EL0) = val;
 	return true;
 }
+
+void kvm_resync_guest_context(void)
+{
+	struct kvm_vcpu *vcpu;
+
+	if (!kvm_arm_support_pmu_v3() || !has_vhe())
+		return;
+
+	vcpu = kvm_get_running_vcpu();
+	if (!vcpu)
+		return;
+
+	kvm_make_request(KVM_REQ_RELOAD_GUEST_PMU_EVENTS, vcpu);
+}
diff --git a/drivers/perf/arm_pmuv3.c b/drivers/perf/arm_pmuv3.c
index 08b3a1bf0ef6..7012de417092 100644
--- a/drivers/perf/arm_pmuv3.c
+++ b/drivers/perf/arm_pmuv3.c
@@ -772,6 +772,9 @@ static void armv8pmu_start(struct arm_pmu *cpu_pmu)
 
 	/* Enable all counters */
 	armv8pmu_pmcr_write(armv8pmu_pmcr_read() | ARMV8_PMU_PMCR_E);
+
+	if (in_interrupt())
+		kvm_resync_guest_context();
 }
 
 static void armv8pmu_stop(struct arm_pmu *cpu_pmu)
diff --git a/include/kvm/arm_pmu.h b/include/kvm/arm_pmu.h
index 847da6fc2713..d66f7216b5a9 100644
--- a/include/kvm/arm_pmu.h
+++ b/include/kvm/arm_pmu.h
@@ -74,6 +74,7 @@ int kvm_arm_pmu_v3_enable(struct kvm_vcpu *vcpu);
 struct kvm_pmu_events *kvm_get_pmu_events(void);
 void kvm_vcpu_pmu_restore_guest(struct kvm_vcpu *vcpu);
 void kvm_vcpu_pmu_restore_host(struct kvm_vcpu *vcpu);
+void kvm_resync_guest_context(void);
 
 #define kvm_vcpu_has_pmu(vcpu)					\
 	(test_bit(KVM_ARM_VCPU_PMU_V3, (vcpu)->arch.features))
@@ -171,6 +172,7 @@ static inline u8 kvm_arm_pmu_get_pmuver_limit(void)
 {
 	return 0;
 }
+static inline void kvm_resync_guest_context(void) {}
 
 #endif
Shijie Huang Aug. 11, 2023, 8:16 a.m. UTC | #8
Hi Marc,

在 2023/8/11 15:56, Marc Zyngier 写道:
> On Fri, 11 Aug 2023 08:52:40 +0100,
> Shijie Huang <shijie@amperemail.onmicrosoft.com> wrote:
>> Hi Marc,
>>
>> 在 2023/8/11 15:42, Marc Zyngier 写道:
>>> On Fri, 11 Aug 2023 08:10:26 +0100,
>>> Shijie Huang <shijie@amperemail.onmicrosoft.com> wrote:
>>>> Hi Marc,
>>>>
>>>> 在 2023/8/11 14:10, Marc Zyngier 写道:
>>>>> On Fri, 11 Aug 2023 02:46:49 +0100,
>>>>> Shijie Huang <shijie@amperemail.onmicrosoft.com> wrote:
>>>>>> Hi Marc,
>>>>>>
>>>>>> 在 2023/8/10 23:27, Marc Zyngier 写道:
>>>>>>> Huang,
>>>>>>>
>>>>>>> Please make sure you add everyone who commented on v1 (I've Cc'd Mark
>>>>>>> so that he can shime need as needed).
>>>>>> thanks.
>>>>>>> On Thu, 10 Aug 2023 08:29:06 +0100,
>>>>>>> Huang Shijie <shijie@os.amperecomputing.com> wrote:
>>>>>>>> 1.) Background.
>>>>>>>>        1.1) In arm64, start a guest with Qemu which is running as a VMM of KVM,
>>>>>>>>             and bind the guest to core 33 and run program "a" in guest.
>>>>>>>>             The code of "a" shows below:
>>>>>>>>        	----------------------------------------------------------
>>>>>>>> 		#include <stdio.h>
>>>>>>>>
>>>>>>>> 		int main()
>>>>>>>> 		{
>>>>>>>> 			unsigned long i = 0;
>>>>>>>>
>>>>>>>> 			for (;;) {
>>>>>>>> 				i++;
>>>>>>>> 			}
>>>>>>>>
>>>>>>>> 			printf("i:%ld\n", i);
>>>>>>>> 			return 0;
>>>>>>>> 		}
>>>>>>>>        	----------------------------------------------------------
>>>>>>>>
>>>>>>>>        1.2) Use the following perf command in host:
>>>>>>>>           #perf stat -e cycles:G,cycles:H -C 33 -I 1000 sleep 1
>>>>>>>>               #           time             counts unit events
>>>>>>>>                    1.000817400      3,299,471,572      cycles:G
>>>>>>>>                    1.000817400          3,240,586      cycles:H
>>>>>>>>
>>>>>>>>            This result is correct, my cpu's frequency is 3.3G.
>>>>>>>>
>>>>>>>>        1.3) Use the following perf command in host:
>>>>>>>>           #perf stat -e cycles:G,cycles:H -C 33 -d -d  -I 1000 sleep 1
>>>>>>>>                 time             counts unit events
>>>>>>>>          1.000831480        153,634,097      cycles:G                                                                (70.03%)
>>>>>>>>          1.000831480      3,147,940,599      cycles:H                                                                (70.03%)
>>>>>>>>          1.000831480      1,143,598,527      L1-dcache-loads                                                         (70.03%)
>>>>>>>>          1.000831480              9,986      L1-dcache-load-misses            #    0.00% of all L1-dcache accesses   (70.03%)
>>>>>>>>          1.000831480    <not supported>      LLC-loads
>>>>>>>>          1.000831480    <not supported>      LLC-load-misses
>>>>>>>>          1.000831480        580,887,696      L1-icache-loads                                                         (70.03%)
>>>>>>>>          1.000831480             77,855      L1-icache-load-misses            #    0.01% of all L1-icache accesses   (70.03%)
>>>>>>>>          1.000831480      6,112,224,612      dTLB-loads                                                              (70.03%)
>>>>>>>>          1.000831480             16,222      dTLB-load-misses                 #    0.00% of all dTLB cache accesses  (69.94%)
>>>>>>>>          1.000831480        590,015,996      iTLB-loads                                                              (59.95%)
>>>>>>>>          1.000831480                505      iTLB-load-misses                 #    0.00% of all iTLB cache accesses  (59.95%)
>>>>>>>>
>>>>>>>>            This result is wrong. The "cycle:G" should be nearly 3.3G.
>>>>>>>>
>>>>>>>> 2.) Root cause.
>>>>>>>> 	There is only 7 counters in my arm64 platform:
>>>>>>>> 	  (one cycle counter) + (6 normal counters)
>>>>>>>>
>>>>>>>> 	In 1.3 above, we will use 10 event counters.
>>>>>>>> 	Since we only have 7 counters, the perf core will trigger
>>>>>>>>            	multiplexing in hrtimer:
>>>>>>>> 	     perf_mux_hrtimer_restart() --> perf_rotate_context().
>>>>>>>>
>>>>>>>>            If the hrtimer occurs when the host is running, it's fine.
>>>>>>>>            If the hrtimer occurs when the guest is running,
>>>>>>>>            the perf_rotate_context() will program the PMU with filters for
>>>>>>>>            host context. The KVM does not have a chance to restore
>>>>>>>>            PMU registers with kvm_vcpu_pmu_restore_guest().
>>>>>>>>            The PMU does not work correctly, so we got wrong result.
>>>>>>>>
>>>>>>>> 3.) About this patch.
>>>>>>>> 	Make a KVM_REQ_RELOAD_PMU request before reentering the
>>>>>>>> 	guest. The request will call kvm_vcpu_pmu_restore_guest()
>>>>>>>> 	to reconfigurate the filters for guest context.
>>>>>>>>
>>>>>>>> 4.) Test result of this patch:
>>>>>>>>           #perf stat -e cycles:G,cycles:H -C 33 -d -d  -I 1000 sleep 1
>>>>>>>>                 time             counts unit events
>>>>>>>>          1.001006400      3,298,348,656      cycles:G                                                                (70.03%)
>>>>>>>>          1.001006400          3,144,532      cycles:H                                                                (70.03%)
>>>>>>>>          1.001006400            941,149      L1-dcache-loads                                                         (70.03%)
>>>>>>>>          1.001006400             17,937      L1-dcache-load-misses            #    1.91% of all L1-dcache accesses   (70.03%)
>>>>>>>>          1.001006400    <not supported>      LLC-loads
>>>>>>>>          1.001006400    <not supported>      LLC-load-misses
>>>>>>>>          1.001006400          1,101,889      L1-icache-loads                                                         (70.03%)
>>>>>>>>          1.001006400            121,638      L1-icache-load-misses            #   11.04% of all L1-icache accesses   (70.03%)
>>>>>>>>          1.001006400          1,031,228      dTLB-loads                                                              (70.03%)
>>>>>>>>          1.001006400             26,952      dTLB-load-misses                 #    2.61% of all dTLB cache accesses  (69.93%)
>>>>>>>>          1.001006400          1,030,678      iTLB-loads                                                              (59.94%)
>>>>>>>>          1.001006400                338      iTLB-load-misses                 #    0.03% of all iTLB cache accesses  (59.94%)
>>>>>>>>
>>>>>>>>         The result is correct. The "cycle:G" is nearly 3.3G now.
>>>>>>>>
>>>>>>>> Signed-off-by: Huang Shijie <shijie@os.amperecomputing.com>
>>>>>>>> ---
>>>>>>>> v1 --> v2:
>>>>>>>> 	Do not change perf/core code, only change the ARM64 kvm code.
>>>>>>>> 	v1: https://lkml.org/lkml/2023/8/8/1465
>>>>>>>>
>>>>>>>> ---
>>>>>>>>      arch/arm64/kvm/arm.c | 11 ++++++++++-
>>>>>>>>      1 file changed, 10 insertions(+), 1 deletion(-)
>>>>>>>>
>>>>>>>> diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c
>>>>>>>> index c2c14059f6a8..475a2f0e0e40 100644
>>>>>>>> --- a/arch/arm64/kvm/arm.c
>>>>>>>> +++ b/arch/arm64/kvm/arm.c
>>>>>>>> @@ -919,8 +919,17 @@ int kvm_arch_vcpu_ioctl_run(struct kvm_vcpu *vcpu)
>>>>>>>>      		if (!ret)
>>>>>>>>      			ret = 1;
>>>>>>>>      -		if (ret > 0)
>>>>>>>> +		if (ret > 0) {
>>>>>>>> +			/*
>>>>>>>> +			 * The perf_rotate_context() may rotate the events and
>>>>>>>> +			 * reprogram PMU with filters for host context.
>>>>>>>> +			 * So make a request before reentering the guest to
>>>>>>>> +			 * reconfigurate the event filters for guest context.
>>>>>>>> +			 */
>>>>>>>> +			kvm_make_request(KVM_REQ_RELOAD_PMU, vcpu);
>>>>>>>> +
>>>>>>>>      			ret = check_vcpu_requests(vcpu);
>>>>>>>> +		}
>>>>>>> This looks extremely heavy handed. You're performing the reload on
>>>>>>> *every* entry, and I don't think this is right (exit-heavy workloads
>>>>>>> will suffer from it)
>>>>>>>
>>>>>>> Furthermore, you're also reloading the virtual state of the PMU
>>>>>>> (recreating guest events and other things), all of which looks pretty
>>>>>>> pointless, as all we're interested in is what is being counted on the
>>>>>>> *host*.
>>>>>> okay. What about to add a _new_ request, such as KVM_REQ_RESTROE_PMU_GUEST?
>>>>>>
>>>>>>
>>>>>>> Instead, we can restrict the reload of the host state (and only that)
>>>>>>> to situations where:
>>>>>>>
>>>>>>> - we're running on a VHE system
>>>>>>>
>>>>>>> - we have a host PMUv3 (not everybody does), as that's the only way we
>>>>>>>       can profile a guest
>>>>>> okay. No problem.
>>>>>>
>>>>>>
>>>>>>> and ideally we would have a way to detect that a rotation happened
>>>>>>> (which may requires some help from the low-level PMU code).
>>>>>> I will check it, hope we can find a better way.
>>>>> I came up with the following patch, completely untested. Let me know
>>>>> how that fares for you.
>>>>>
>>>>> Thanks,
>>>>>
>>>>> 	M.
>>>>>
>>>>> diff --git a/arch/arm64/include/asm/kvm_host.h b/arch/arm64/include/asm/kvm_host.h
>>>>> index 93c541111dea..fb875c5c0347 100644
>>>>> --- a/arch/arm64/include/asm/kvm_host.h
>>>>> +++ b/arch/arm64/include/asm/kvm_host.h
>>>>> @@ -49,6 +49,7 @@
>>>>>     #define KVM_REQ_RELOAD_GICv4	KVM_ARCH_REQ(4)
>>>>>     #define KVM_REQ_RELOAD_PMU	KVM_ARCH_REQ(5)
>>>>>     #define KVM_REQ_SUSPEND		KVM_ARCH_REQ(6)
>>>>> +#define KVM_REQ_RELOAD_GUEST_PMU_EVENTS	KVM_ARCH_REQ(7)
>>>>>       #define KVM_DIRTY_LOG_MANUAL_CAPS
>>>>> (KVM_DIRTY_LOG_MANUAL_PROTECT_ENABLE | \
>>>>>     				     KVM_DIRTY_LOG_INITIALLY_SET)
>>>>> diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c
>>>>> index 8b51570a76f8..b40db24f1f0b 100644
>>>>> --- a/arch/arm64/kvm/arm.c
>>>>> +++ b/arch/arm64/kvm/arm.c
>>>>> @@ -804,6 +804,9 @@ static int check_vcpu_requests(struct kvm_vcpu *vcpu)
>>>>>     			kvm_pmu_handle_pmcr(vcpu,
>>>>>     					    __vcpu_sys_reg(vcpu, PMCR_EL0));
>>>>>     +		if (kvm_check_request(KVM_REQ_RELOAD_GUEST_PMU_EVENTS,
>>>>> vcpu))
>>>>> +			kvm_vcpu_pmu_restore_guest(vcpu);
>>>>> +
>>>>>     		if (kvm_check_request(KVM_REQ_SUSPEND, vcpu))
>>>>>     			return kvm_vcpu_suspend(vcpu);
>>>>>     diff --git a/drivers/perf/arm_pmuv3.c b/drivers/perf/arm_pmuv3.c
>>>>> index 08b3a1bf0ef6..7012de417092 100644
>>>>> --- a/drivers/perf/arm_pmuv3.c
>>>>> +++ b/drivers/perf/arm_pmuv3.c
>>>>> @@ -772,6 +772,9 @@ static void armv8pmu_start(struct arm_pmu *cpu_pmu)
>>>>>       	/* Enable all counters */
>>>>>     	armv8pmu_pmcr_write(armv8pmu_pmcr_read() | ARMV8_PMU_PMCR_E);
>>>>> +
>>>>> +	if (in_interrupt())
>>>>> +		kvm_resync_guest_context();
>>>> I currently added a similiar check in armv8pmu_get_event_idx().
>>>>
>>>> The event multiplexing will call armv8pmu_get_event_idx(), and will
>>>> definitely fail at least one time.
>>>>
>>>> +++ b/drivers/perf/arm_pmuv3.c
>>>> @@ -882,6 +882,8 @@ static int armv8pmu_get_event_idx(struct
>>>> pmu_hw_events *cpuc,
>>>>           struct arm_pmu *cpu_pmu = to_arm_pmu(event->pmu);
>>>>           struct hw_perf_event *hwc = &event->hw;
>>>>           unsigned long evtype = hwc->config_base & ARMV8_PMU_EVTYPE_EVENT;
>>>> +       struct kvm_vcpu *vcpu;
>>>> +       int index;
>>>>           struct arm_pmu *cpu_pmu = to_arm_pmu(event->pmu);
>>>>           struct hw_perf_event *hwc = &event->hw;
>>>>           unsigned long evtype = hwc->config_base & ARMV8_PMU_EVTYPE_EVENT;
>>>> +       struct kvm_vcpu *vcpu;
>>>> +       int index;
>>>>
>>>>           /* Always prefer to place a cycle counter into the cycle
>>>> counter. */
>>>>           if (evtype == ARMV8_PMUV3_PERFCTR_CPU_CYCLES) {
>>>> @@ -897,9 +899,22 @@ static int armv8pmu_get_event_idx(struct
>>>> pmu_hw_events *cpuc,
>>>>            * Otherwise use events counters
>>>>            */
>>>>           if (armv8pmu_event_is_chained(event))
>>>> -               return  armv8pmu_get_chain_idx(cpuc, cpu_pmu);
>>>> +               index = armv8pmu_get_chain_idx(cpuc, cpu_pmu);
>>>>           else
>>>> -               return armv8pmu_get_single_idx(cpuc, cpu_pmu);
>>>> +               index = armv8pmu_get_single_idx(cpuc, cpu_pmu);
>>>> +
>>>> +       /*
>>>> +        * If we are in pmu multiplexing, we will definitely meet a failure.
>>>> +        * Please see perf_rotate_context().
>>>> +        * If we are in the guest context, we can mark it.
>>>> +        */
>>>> +       if (index < 0) {
>>>> +               vcpu = kvm_get_running_vcpu();
>>>> +               if (vcpu && in_interrupt() && !event->attr.pinned) {
>>>> +                       kvm_resync_guest_context();
>> xxxx.
>>
>>
>>>> +               }
>>>> +       }
>>>> +       return index;
>>>>    }
>>>>
>>>> IMHO, it's better to change armv8pmu_get_event_idx().
>>>>
>>>> But if you think it is also okay to change armv8pmu_start() to fix the bug,
>>>>
>>>> I am okay too.
>>> But that's doing work each time you rotate an event. And if you rotate
>>> a bunch of them, you'll hit this path multiple times, reloading the
>>> stuff again. What's the point?
>> In my code, I just put the kvm_make_request() in "xxx" above. Event
>> reloading it multiple times,
>>
>> it just set a bit in vcpu->requests.
>>
>>
>>> My take is that we can hook at the point where the PMU gets
>>> re-enabled, and have the full context once and for all.
>>>
>>> Unless of course I miss something, which is very likely as the whole
>>> perf subsystem generally escapes me altogether.
>>>
>>> In any case, I'd welcome your testing the proposed patch.
>> No problem.
> As Oliver pointed out offline, I only have posted half of the patch...
>
> Here is the whole thing below.

I tested it just now, it works fine.

Can I create a v3 patch based your patch? or you do it yourself.


Thanks

Huang Shijie

>
> Thanks,
>
> 	M.
>
> diff --git a/arch/arm64/include/asm/kvm_host.h b/arch/arm64/include/asm/kvm_host.h
> index 93c541111dea..fb875c5c0347 100644
> --- a/arch/arm64/include/asm/kvm_host.h
> +++ b/arch/arm64/include/asm/kvm_host.h
> @@ -49,6 +49,7 @@
>   #define KVM_REQ_RELOAD_GICv4	KVM_ARCH_REQ(4)
>   #define KVM_REQ_RELOAD_PMU	KVM_ARCH_REQ(5)
>   #define KVM_REQ_SUSPEND		KVM_ARCH_REQ(6)
> +#define KVM_REQ_RELOAD_GUEST_PMU_EVENTS	KVM_ARCH_REQ(7)
>   
>   #define KVM_DIRTY_LOG_MANUAL_CAPS   (KVM_DIRTY_LOG_MANUAL_PROTECT_ENABLE | \
>   				     KVM_DIRTY_LOG_INITIALLY_SET)
> diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c
> index 8b51570a76f8..b40db24f1f0b 100644
> --- a/arch/arm64/kvm/arm.c
> +++ b/arch/arm64/kvm/arm.c
> @@ -804,6 +804,9 @@ static int check_vcpu_requests(struct kvm_vcpu *vcpu)
>   			kvm_pmu_handle_pmcr(vcpu,
>   					    __vcpu_sys_reg(vcpu, PMCR_EL0));
>   
> +		if (kvm_check_request(KVM_REQ_RELOAD_GUEST_PMU_EVENTS, vcpu))
> +			kvm_vcpu_pmu_restore_guest(vcpu);
> +
>   		if (kvm_check_request(KVM_REQ_SUSPEND, vcpu))
>   			return kvm_vcpu_suspend(vcpu);
>   
> diff --git a/arch/arm64/kvm/pmu.c b/arch/arm64/kvm/pmu.c
> index 121f1a14c829..7bd1facc8f15 100644
> --- a/arch/arm64/kvm/pmu.c
> +++ b/arch/arm64/kvm/pmu.c
> @@ -236,3 +236,17 @@ bool kvm_set_pmuserenr(u64 val)
>   	ctxt_sys_reg(hctxt, PMUSERENR_EL0) = val;
>   	return true;
>   }
> +
> +void kvm_resync_guest_context(void)
> +{
> +	struct kvm_vcpu *vcpu;
> +
> +	if (!kvm_arm_support_pmu_v3() || !has_vhe())
> +		return;
> +
> +	vcpu = kvm_get_running_vcpu();
> +	if (!vcpu)
> +		return;
> +
> +	kvm_make_request(KVM_REQ_RELOAD_GUEST_PMU_EVENTS, vcpu);
> +}
> diff --git a/drivers/perf/arm_pmuv3.c b/drivers/perf/arm_pmuv3.c
> index 08b3a1bf0ef6..7012de417092 100644
> --- a/drivers/perf/arm_pmuv3.c
> +++ b/drivers/perf/arm_pmuv3.c
> @@ -772,6 +772,9 @@ static void armv8pmu_start(struct arm_pmu *cpu_pmu)
>   
>   	/* Enable all counters */
>   	armv8pmu_pmcr_write(armv8pmu_pmcr_read() | ARMV8_PMU_PMCR_E);
> +
> +	if (in_interrupt())
> +		kvm_resync_guest_context();
>   }
>   
>   static void armv8pmu_stop(struct arm_pmu *cpu_pmu)
> diff --git a/include/kvm/arm_pmu.h b/include/kvm/arm_pmu.h
> index 847da6fc2713..d66f7216b5a9 100644
> --- a/include/kvm/arm_pmu.h
> +++ b/include/kvm/arm_pmu.h
> @@ -74,6 +74,7 @@ int kvm_arm_pmu_v3_enable(struct kvm_vcpu *vcpu);
>   struct kvm_pmu_events *kvm_get_pmu_events(void);
>   void kvm_vcpu_pmu_restore_guest(struct kvm_vcpu *vcpu);
>   void kvm_vcpu_pmu_restore_host(struct kvm_vcpu *vcpu);
> +void kvm_resync_guest_context(void);
>   
>   #define kvm_vcpu_has_pmu(vcpu)					\
>   	(test_bit(KVM_ARM_VCPU_PMU_V3, (vcpu)->arch.features))
> @@ -171,6 +172,7 @@ static inline u8 kvm_arm_pmu_get_pmuver_limit(void)
>   {
>   	return 0;
>   }
> +static inline void kvm_resync_guest_context(void) {}
>   
>   #endif
>   
>
Marc Zyngier Aug. 11, 2023, 6:09 p.m. UTC | #9
On Fri, 11 Aug 2023 09:16:20 +0100,
Shijie Huang <shijie@amperemail.onmicrosoft.com> wrote:

[...]

> > Here is the whole thing below.
> 
> I tested it just now, it works fine.
> 
> Can I create a v3 patch based your patch? or you do it yourself.

I just posted a (slightly reworked) version at [1]. Please let us know
how it fares.

Thanks,

	M.

[1] https://lore.kernel.org/r/20230811180520.131727-1-maz@kernel.org
diff mbox series

Patch

diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c
index c2c14059f6a8..475a2f0e0e40 100644
--- a/arch/arm64/kvm/arm.c
+++ b/arch/arm64/kvm/arm.c
@@ -919,8 +919,17 @@  int kvm_arch_vcpu_ioctl_run(struct kvm_vcpu *vcpu)
 		if (!ret)
 			ret = 1;
 
-		if (ret > 0)
+		if (ret > 0) {
+			/*
+			 * The perf_rotate_context() may rotate the events and
+			 * reprogram PMU with filters for host context.
+			 * So make a request before reentering the guest to
+			 * reconfigurate the event filters for guest context.
+			 */
+			kvm_make_request(KVM_REQ_RELOAD_PMU, vcpu);
+
 			ret = check_vcpu_requests(vcpu);
+		}
 
 		/*
 		 * Preparing the interrupts to be injected also