Message ID | 20220819110939.78013-11-likexu@tencent.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | x86/pmu: Test case optimization, fixes and additions | expand |
Hi Like, On 8/19/2022 4:39 PM, Like Xu wrote: > From: Like Xu <likexu@tencent.com> > > For most unit tests, the basic framework and use cases which test > any PMU counter do not require any changes, except for two things: > > - No access to registers introduced only in PMU version 2 and above; > - Expanded tolerance for testing counter overflows > due to the loss of uniform control of the gloabl_ctrl register > > Adding some pmu_version() return value checks can seamlessly support > Intel Arch PMU Version 1, while opening the door for AMD PMUs tests. > > Signed-off-by: Like Xu <likexu@tencent.com> > --- > x86/pmu.c | 64 +++++++++++++++++++++++++++++++++++++------------------ > 1 file changed, 43 insertions(+), 21 deletions(-) > > diff --git a/x86/pmu.c b/x86/pmu.c > index 25fafbe..826472c 100644 > --- a/x86/pmu.c > +++ b/x86/pmu.c > [...] > @@ -520,10 +544,13 @@ static void check_emulated_instr(void) > "instruction count"); > report(brnch_cnt.count - brnch_start >= EXPECTED_BRNCH, > "branch count"); > - // Additionally check that those counters overflowed properly. > - status = rdmsr(MSR_CORE_PERF_GLOBAL_STATUS); > - report(status & 1, "instruction counter overflow"); > - report(status & 2, "branch counter overflow"); > + > + if (pmu_version() > 1) { > + // Additionally check that those counters overflowed properly. > + status = rdmsr(MSR_CORE_PERF_GLOBAL_STATUS); > + report(status & 1, "instruction counter overflow"); > + report(status & 2, "branch counter overflow"); > + } > This should use status bit 1 for instructions and bit 0 for branches. > report_prefix_pop(); > } > [...] - Sandipan
Hi Like, On 8/19/2022 4:39 PM, Like Xu wrote: > From: Like Xu <likexu@tencent.com> > > For most unit tests, the basic framework and use cases which test > any PMU counter do not require any changes, except for two things: > > - No access to registers introduced only in PMU version 2 and above; > - Expanded tolerance for testing counter overflows > due to the loss of uniform control of the gloabl_ctrl register > > Adding some pmu_version() return value checks can seamlessly support > Intel Arch PMU Version 1, while opening the door for AMD PMUs tests. > > Signed-off-by: Like Xu <likexu@tencent.com> > --- > x86/pmu.c | 64 +++++++++++++++++++++++++++++++++++++------------------ > 1 file changed, 43 insertions(+), 21 deletions(-) > > [...] > @@ -327,13 +335,21 @@ static void check_counter_overflow(void) > cnt.config &= ~EVNTSEL_INT; > idx = event_to_global_idx(&cnt); > __measure(&cnt, cnt.count); > - report(cnt.count == 1, "cntr-%d", i); > + > + report(check_irq() == (i % 2), "irq-%d", i); > + if (pmu_version() > 1) > + report(cnt.count == 1, "cntr-%d", i); > + else > + report(cnt.count < 4, "cntr-%d", i); > + > [...] Sorry I missed this in the previous response. With an upper bound of 4, I see this test failing some times for at least one of the six counters (with NMI watchdog disabled on the host) on a Milan (Zen 3) system. Increasing it further does reduce the probability but I still see failures. Do you see the same behaviour on systems with Zen 3 and older processors? - Sandipan
On 6/9/2022 3:15 pm, Sandipan Das wrote: > Hi Like, > > On 8/19/2022 4:39 PM, Like Xu wrote: >> From: Like Xu <likexu@tencent.com> >> >> For most unit tests, the basic framework and use cases which test >> any PMU counter do not require any changes, except for two things: >> >> - No access to registers introduced only in PMU version 2 and above; >> - Expanded tolerance for testing counter overflows >> due to the loss of uniform control of the gloabl_ctrl register >> >> Adding some pmu_version() return value checks can seamlessly support >> Intel Arch PMU Version 1, while opening the door for AMD PMUs tests. >> >> Signed-off-by: Like Xu <likexu@tencent.com> >> --- >> x86/pmu.c | 64 +++++++++++++++++++++++++++++++++++++------------------ >> 1 file changed, 43 insertions(+), 21 deletions(-) >> >> diff --git a/x86/pmu.c b/x86/pmu.c >> index 25fafbe..826472c 100644 >> --- a/x86/pmu.c >> +++ b/x86/pmu.c >> [...] >> @@ -520,10 +544,13 @@ static void check_emulated_instr(void) >> "instruction count"); >> report(brnch_cnt.count - brnch_start >= EXPECTED_BRNCH, >> "branch count"); >> - // Additionally check that those counters overflowed properly. >> - status = rdmsr(MSR_CORE_PERF_GLOBAL_STATUS); >> - report(status & 1, "instruction counter overflow"); >> - report(status & 2, "branch counter overflow"); >> + >> + if (pmu_version() > 1) { >> + // Additionally check that those counters overflowed properly. >> + status = rdmsr(MSR_CORE_PERF_GLOBAL_STATUS); >> + report(status & 1, "instruction counter overflow"); >> + report(status & 2, "branch counter overflow"); >> + } >> > > This should use status bit 1 for instructions and bit 0 for branches. Yes, this misleading statement stems from 20cf914 ("x86/pmu: Test PMU virtualization on emulated instructions") I will fix it as part of this patch set. Thanks. > >> report_prefix_pop(); >> } >> [...] > > - Sandipan
On 6/9/2022 4:16 pm, Sandipan Das wrote: > Hi Like, > > On 8/19/2022 4:39 PM, Like Xu wrote: >> From: Like Xu <likexu@tencent.com> >> >> For most unit tests, the basic framework and use cases which test >> any PMU counter do not require any changes, except for two things: >> >> - No access to registers introduced only in PMU version 2 and above; >> - Expanded tolerance for testing counter overflows >> due to the loss of uniform control of the gloabl_ctrl register >> >> Adding some pmu_version() return value checks can seamlessly support >> Intel Arch PMU Version 1, while opening the door for AMD PMUs tests. >> >> Signed-off-by: Like Xu <likexu@tencent.com> >> --- >> x86/pmu.c | 64 +++++++++++++++++++++++++++++++++++++------------------ >> 1 file changed, 43 insertions(+), 21 deletions(-) >> >> [...] >> @@ -327,13 +335,21 @@ static void check_counter_overflow(void) >> cnt.config &= ~EVNTSEL_INT; >> idx = event_to_global_idx(&cnt); >> __measure(&cnt, cnt.count); >> - report(cnt.count == 1, "cntr-%d", i); >> + >> + report(check_irq() == (i % 2), "irq-%d", i); >> + if (pmu_version() > 1) >> + report(cnt.count == 1, "cntr-%d", i); >> + else >> + report(cnt.count < 4, "cntr-%d", i); >> + >> [...] > > Sorry I missed this in the previous response. With an upper bound of > 4, I see this test failing some times for at least one of the six > counters (with NMI watchdog disabled on the host) on a Milan (Zen 3) > system. Increasing it further does reduce the probability but I still > see failures. Do you see the same behaviour on systems with Zen 3 and > older processors? A hundred runs on my machine did not report a failure. But I'm not surprised by this, because some AMD platforms do have hw PMU errata which requires bios or ucode fixes. Please help find the right upper bound for all your available AMD boxes. What makes me most nervous is that AMD's core hardware events run repeatedly against the same workload, and their count results are erratic. You may check is_the_count_reproducible() in the test case: [1]https://lore.kernel.org/kvm/20220905123946.95223-7-likexu@tencent.com/ > > - Sandipan
On 9/6/2022 7:05 PM, Like Xu wrote: > On 6/9/2022 4:16 pm, Sandipan Das wrote: >> Hi Like, >> >> On 8/19/2022 4:39 PM, Like Xu wrote: >>> From: Like Xu <likexu@tencent.com> >>> >>> For most unit tests, the basic framework and use cases which test >>> any PMU counter do not require any changes, except for two things: >>> >>> - No access to registers introduced only in PMU version 2 and above; >>> - Expanded tolerance for testing counter overflows >>> due to the loss of uniform control of the gloabl_ctrl register >>> >>> Adding some pmu_version() return value checks can seamlessly support >>> Intel Arch PMU Version 1, while opening the door for AMD PMUs tests. >>> >>> Signed-off-by: Like Xu <likexu@tencent.com> >>> --- >>> x86/pmu.c | 64 +++++++++++++++++++++++++++++++++++++------------------ >>> 1 file changed, 43 insertions(+), 21 deletions(-) >>> >>> [...] >>> @@ -327,13 +335,21 @@ static void check_counter_overflow(void) >>> cnt.config &= ~EVNTSEL_INT; >>> idx = event_to_global_idx(&cnt); >>> __measure(&cnt, cnt.count); >>> - report(cnt.count == 1, "cntr-%d", i); >>> + >>> + report(check_irq() == (i % 2), "irq-%d", i); >>> + if (pmu_version() > 1) >>> + report(cnt.count == 1, "cntr-%d", i); >>> + else >>> + report(cnt.count < 4, "cntr-%d", i); >>> + >>> [...] >> >> Sorry I missed this in the previous response. With an upper bound of >> 4, I see this test failing some times for at least one of the six >> counters (with NMI watchdog disabled on the host) on a Milan (Zen 3) >> system. Increasing it further does reduce the probability but I still >> see failures. Do you see the same behaviour on systems with Zen 3 and >> older processors? > > A hundred runs on my machine did not report a failure. > Was this on a Zen 4 system? > But I'm not surprised by this, because some AMD platforms do > have hw PMU errata which requires bios or ucode fixes. > > Please help find the right upper bound for all your available AMD boxes. > Even after updating the microcode, the tests failed just as often in an overnight loop. However, upon closer inspection, the reason for failure was different. The variance is well within the bounds now but sometimes, is_the_count_reproducible() is true. Since this selects the original verification criteria (cnt.count == 1), the tests fail. > What makes me most nervous is that AMD's core hardware events run > repeatedly against the same workload, and their count results are erratic. > With that in mind, should we consider having the following change? diff --git a/x86/pmu.c b/x86/pmu.c index bb16b3c..39979b8 100644 --- a/x86/pmu.c +++ b/x86/pmu.c @@ -352,7 +352,7 @@ static void check_counter_overflow(void) .ctr = gp_counter_base, .config = EVNTSEL_OS | EVNTSEL_USR | (*gp_events)[1].unit_sel /* instructions */, }; - bool precise_event = is_the_count_reproducible(&cnt); + bool precise_event = is_intel() ? is_the_count_reproducible(&cnt) : false; __measure(&cnt, 0); count = cnt.count; With this, the tests always pass. I will run another overnight loop and report back if I see any errors. > You may check is_the_count_reproducible() in the test case: > [1]https://lore.kernel.org/kvm/20220905123946.95223-7-likexu@tencent.com/ On Zen 4 systems, this is always false and the overflow tests always pass irrespective of whether PerfMonV2 is enabled for the guest or not. - Sandipan
On 8/9/2022 4:23 pm, Sandipan Das wrote: > On 9/6/2022 7:05 PM, Like Xu wrote: >> On 6/9/2022 4:16 pm, Sandipan Das wrote: >>> Hi Like, >>> >>> On 8/19/2022 4:39 PM, Like Xu wrote: >>>> From: Like Xu <likexu@tencent.com> >>>> >>>> For most unit tests, the basic framework and use cases which test >>>> any PMU counter do not require any changes, except for two things: >>>> >>>> - No access to registers introduced only in PMU version 2 and above; >>>> - Expanded tolerance for testing counter overflows >>>> due to the loss of uniform control of the gloabl_ctrl register >>>> >>>> Adding some pmu_version() return value checks can seamlessly support >>>> Intel Arch PMU Version 1, while opening the door for AMD PMUs tests. >>>> >>>> Signed-off-by: Like Xu <likexu@tencent.com> >>>> --- >>>> x86/pmu.c | 64 +++++++++++++++++++++++++++++++++++++------------------ >>>> 1 file changed, 43 insertions(+), 21 deletions(-) >>>> >>>> [...] >>>> @@ -327,13 +335,21 @@ static void check_counter_overflow(void) >>>> cnt.config &= ~EVNTSEL_INT; >>>> idx = event_to_global_idx(&cnt); >>>> __measure(&cnt, cnt.count); >>>> - report(cnt.count == 1, "cntr-%d", i); >>>> + >>>> + report(check_irq() == (i % 2), "irq-%d", i); >>>> + if (pmu_version() > 1) >>>> + report(cnt.count == 1, "cntr-%d", i); >>>> + else >>>> + report(cnt.count < 4, "cntr-%d", i); >>>> + >>>> [...] >>> >>> Sorry I missed this in the previous response. With an upper bound of >>> 4, I see this test failing some times for at least one of the six >>> counters (with NMI watchdog disabled on the host) on a Milan (Zen 3) >>> system. Increasing it further does reduce the probability but I still >>> see failures. Do you see the same behaviour on systems with Zen 3 and >>> older processors? >> >> A hundred runs on my machine did not report a failure. >> > > Was this on a Zen 4 system? > >> But I'm not surprised by this, because some AMD platforms do >> have hw PMU errata which requires bios or ucode fixes. >> >> Please help find the right upper bound for all your available AMD boxes. >> > > Even after updating the microcode, the tests failed just as often in an > overnight loop. However, upon closer inspection, the reason for failure > was different. The variance is well within the bounds now but sometimes, > is_the_count_reproducible() is true. Since this selects the original > verification criteria (cnt.count == 1), the tests fail. > >> What makes me most nervous is that AMD's core hardware events run >> repeatedly against the same workload, and their count results are erratic. >> > > With that in mind, should we consider having the following change? > > diff --git a/x86/pmu.c b/x86/pmu.c > index bb16b3c..39979b8 100644 > --- a/x86/pmu.c > +++ b/x86/pmu.c > @@ -352,7 +352,7 @@ static void check_counter_overflow(void) > .ctr = gp_counter_base, > .config = EVNTSEL_OS | EVNTSEL_USR | (*gp_events)[1].unit_sel /* instructions */, > }; > - bool precise_event = is_the_count_reproducible(&cnt); > + bool precise_event = is_intel() ? is_the_count_reproducible(&cnt) : false; > > __measure(&cnt, 0); > count = cnt.count; > > With this, the tests always pass. I will run another overnight loop and > report back if I see any errors. > >> You may check is_the_count_reproducible() in the test case: >> [1]https://lore.kernel.org/kvm/20220905123946.95223-7-likexu@tencent.com/ > > On Zen 4 systems, this is always false and the overflow tests always > pass irrespective of whether PerfMonV2 is enabled for the guest or not. > > - Sandipan I could change it to: if (is_intel()) report(cnt.count == 1, "cntr-%d", i); else report(cnt.count < 4, "cntr-%d", i); but this does not explain the difference, that is for the same workload: if a retired hw event like "PMCx0C0 [Retired Instructions] (ExRetInstr)" is configured, then it's expected to count "the number of instructions retired", the value is only relevant for workload and it should remain the same over multiple measurements, but there are two hardware counters, one AMD and one Intel, both are reset to an identical value (like "cnt.count = 1 - count"), and when they overflow, the Intel counter can stay exactly at 1, while the AMD counter cannot. I know there are ulterior hardware micro-arch implementation differences here, but what AMD is doing violates the semantics of "retired". Is this behavior normal by design ? I'm not sure what I'm missing, this behavior is reinforced in zen4 as you said.
On Fri, Aug 19, 2022, Like Xu wrote: > From: Like Xu <likexu@tencent.com> > > For most unit tests, the basic framework and use cases which test > any PMU counter do not require any changes, except for two things: > > - No access to registers introduced only in PMU version 2 and above; > - Expanded tolerance for testing counter overflows > due to the loss of uniform control of the gloabl_ctrl register > > Adding some pmu_version() return value checks can seamlessly support > Intel Arch PMU Version 1, while opening the door for AMD PMUs tests. Phrase this as a command so that it's crystal clear that this is what the patch does, as opposed to what the patch _can_ do. > Signed-off-by: Like Xu <likexu@tencent.com> > --- > x86/pmu.c | 64 +++++++++++++++++++++++++++++++++++++------------------ > 1 file changed, 43 insertions(+), 21 deletions(-) > > diff --git a/x86/pmu.c b/x86/pmu.c > index 25fafbe..826472c 100644 > --- a/x86/pmu.c > +++ b/x86/pmu.c > @@ -125,14 +125,19 @@ static struct pmu_event* get_counter_event(pmu_counter_t *cnt) > > static void global_enable(pmu_counter_t *cnt) > { > - cnt->idx = event_to_global_idx(cnt); > + if (pmu_version() < 2) Helper please. > + return; > > + cnt->idx = event_to_global_idx(cnt); > wrmsr(MSR_CORE_PERF_GLOBAL_CTRL, rdmsr(MSR_CORE_PERF_GLOBAL_CTRL) | > (1ull << cnt->idx)); > } > > static void global_disable(pmu_counter_t *cnt) > { > + if (pmu_version() < 2) > + return; > + > wrmsr(MSR_CORE_PERF_GLOBAL_CTRL, rdmsr(MSR_CORE_PERF_GLOBAL_CTRL) & > ~(1ull << cnt->idx)); > } > @@ -301,7 +306,10 @@ static void check_counter_overflow(void) > count = cnt.count; > > /* clear status before test */ > - wrmsr(MSR_CORE_PERF_GLOBAL_OVF_CTRL, rdmsr(MSR_CORE_PERF_GLOBAL_STATUS)); > + if (pmu_version() > 1) { Should be a helper to use from an earlier patch. Hmm, looking forward, maybe have an upper level helper? E.g. void pmu_clear_global_status_safe(void) { if (!exists) return wrmsr(...); } Ignore this suggestion if these checks go away in the future. > + wrmsr(MSR_CORE_PERF_GLOBAL_OVF_CTRL, > + rdmsr(MSR_CORE_PERF_GLOBAL_STATUS)); > + } > > report_prefix_push("overflow"); > > @@ -327,13 +335,21 @@ static void check_counter_overflow(void) > cnt.config &= ~EVNTSEL_INT; > idx = event_to_global_idx(&cnt); > __measure(&cnt, cnt.count); > - report(cnt.count == 1, "cntr-%d", i); > + > + report(check_irq() == (i % 2), "irq-%d", i); > + if (pmu_version() > 1) Helper. > + report(cnt.count == 1, "cntr-%d", i); > + else > + report(cnt.count < 4, "cntr-%d", i); > + > + if (pmu_version() < 2) Helper. > + continue; > + > status = rdmsr(MSR_CORE_PERF_GLOBAL_STATUS); > report(status & (1ull << idx), "status-%d", i); > wrmsr(MSR_CORE_PERF_GLOBAL_OVF_CTRL, status); > status = rdmsr(MSR_CORE_PERF_GLOBAL_STATUS); > report(!(status & (1ull << idx)), "status clear-%d", i); > - report(check_irq() == (i % 2), "irq-%d", i); > } > > report_prefix_pop(); > @@ -440,8 +456,10 @@ static void check_running_counter_wrmsr(void) > report(evt.count < gp_events[1].min, "cntr"); > > /* clear status before overflow test */ > - wrmsr(MSR_CORE_PERF_GLOBAL_OVF_CTRL, > - rdmsr(MSR_CORE_PERF_GLOBAL_STATUS)); > + if (pmu_version() > 1) { Helper. Curly braces aren't necessary. > + wrmsr(MSR_CORE_PERF_GLOBAL_OVF_CTRL, > + rdmsr(MSR_CORE_PERF_GLOBAL_STATUS)); > + } > > start_event(&evt); > > @@ -453,8 +471,11 @@ static void check_running_counter_wrmsr(void) > > loop(); > stop_event(&evt); > - status = rdmsr(MSR_CORE_PERF_GLOBAL_STATUS); > - report(status & 1, "status"); > + > + if (pmu_version() > 1) { Helper. > + status = rdmsr(MSR_CORE_PERF_GLOBAL_STATUS); > + report(status & 1, "status"); Can you opportunistically provide a better message than "status"? > + } > > report_prefix_pop(); > } > @@ -474,8 +495,10 @@ static void check_emulated_instr(void) > }; > report_prefix_push("emulated instruction"); > > - wrmsr(MSR_CORE_PERF_GLOBAL_OVF_CTRL, > - rdmsr(MSR_CORE_PERF_GLOBAL_STATUS)); > + if (pmu_version() > 1) { Helper, no curly braces. Ah, IIRC, kernel perf prefers curly braces if the code spans multiple lines. KVM and KUT does not. > + wrmsr(MSR_CORE_PERF_GLOBAL_OVF_CTRL, > + rdmsr(MSR_CORE_PERF_GLOBAL_STATUS)); > + } > > start_event(&brnch_cnt); > start_event(&instr_cnt); > @@ -509,7 +532,8 @@ static void check_emulated_instr(void) > : > : "eax", "ebx", "ecx", "edx"); > > - wrmsr(MSR_CORE_PERF_GLOBAL_CTRL, 0); > + if (pmu_version() > 1) Helper. > + wrmsr(MSR_CORE_PERF_GLOBAL_CTRL, 0); > > stop_event(&brnch_cnt); > stop_event(&instr_cnt); > @@ -520,10 +544,13 @@ static void check_emulated_instr(void) > "instruction count"); > report(brnch_cnt.count - brnch_start >= EXPECTED_BRNCH, > "branch count"); > - // Additionally check that those counters overflowed properly. > - status = rdmsr(MSR_CORE_PERF_GLOBAL_STATUS); > - report(status & 1, "instruction counter overflow"); > - report(status & 2, "branch counter overflow"); > + > + if (pmu_version() > 1) { Helper? E.g. if this is a "has architectural PMU". > + // Additionally check that those counters overflowed properly. > + status = rdmsr(MSR_CORE_PERF_GLOBAL_STATUS); > + report(status & 1, "instruction counter overflow"); > + report(status & 2, "branch counter overflow"); > + } > > report_prefix_pop(); > } > @@ -647,12 +674,7 @@ int main(int ac, char **av) > buf = malloc(N*64); > > if (!pmu_version()) { > - report_skip("No pmu is detected!"); > - return report_summary(); > - } > - > - if (pmu_version() == 1) { > - report_skip("PMU version 1 is not supported."); > + report_skip("No Intel Arch PMU is detected!"); > return report_summary(); > } > > -- > 2.37.2 >
All applied except pmu_clear_global_status_safe(). Thanks. On 6/10/2022 6:35 am, Sean Christopherson wrote: > On Fri, Aug 19, 2022, Like Xu wrote: >> From: Like Xu <likexu@tencent.com> >> >> For most unit tests, the basic framework and use cases which test >> any PMU counter do not require any changes, except for two things: >> >> - No access to registers introduced only in PMU version 2 and above; >> - Expanded tolerance for testing counter overflows >> due to the loss of uniform control of the gloabl_ctrl register >> >> Adding some pmu_version() return value checks can seamlessly support >> Intel Arch PMU Version 1, while opening the door for AMD PMUs tests. > > Phrase this as a command so that it's crystal clear that this is what the patch > does, as opposed to what the patch _can_ do. > >> Signed-off-by: Like Xu <likexu@tencent.com> >> --- >> x86/pmu.c | 64 +++++++++++++++++++++++++++++++++++++------------------ >> 1 file changed, 43 insertions(+), 21 deletions(-) >> >> diff --git a/x86/pmu.c b/x86/pmu.c >> index 25fafbe..826472c 100644 >> --- a/x86/pmu.c >> +++ b/x86/pmu.c >> @@ -125,14 +125,19 @@ static struct pmu_event* get_counter_event(pmu_counter_t *cnt) >> >> static void global_enable(pmu_counter_t *cnt) >> { >> - cnt->idx = event_to_global_idx(cnt); >> + if (pmu_version() < 2) > > Helper please. > >> + return; >> >> + cnt->idx = event_to_global_idx(cnt); >> wrmsr(MSR_CORE_PERF_GLOBAL_CTRL, rdmsr(MSR_CORE_PERF_GLOBAL_CTRL) | >> (1ull << cnt->idx)); >> } >> >> static void global_disable(pmu_counter_t *cnt) >> { >> + if (pmu_version() < 2) >> + return; >> + >> wrmsr(MSR_CORE_PERF_GLOBAL_CTRL, rdmsr(MSR_CORE_PERF_GLOBAL_CTRL) & >> ~(1ull << cnt->idx)); >> } >> @@ -301,7 +306,10 @@ static void check_counter_overflow(void) >> count = cnt.count; >> >> /* clear status before test */ >> - wrmsr(MSR_CORE_PERF_GLOBAL_OVF_CTRL, rdmsr(MSR_CORE_PERF_GLOBAL_STATUS)); >> + if (pmu_version() > 1) { > > Should be a helper to use from an earlier patch. > > Hmm, looking forward, maybe have an upper level helper? E.g. > > void pmu_clear_global_status_safe(void) > { > if (!exists) > return > > wrmsr(...); > } > > Ignore this suggestion if these checks go away in the future. > >> + wrmsr(MSR_CORE_PERF_GLOBAL_OVF_CTRL, >> + rdmsr(MSR_CORE_PERF_GLOBAL_STATUS)); >> + } >> >> report_prefix_push("overflow"); >> >> @@ -327,13 +335,21 @@ static void check_counter_overflow(void) >> cnt.config &= ~EVNTSEL_INT; >> idx = event_to_global_idx(&cnt); >> __measure(&cnt, cnt.count); >> - report(cnt.count == 1, "cntr-%d", i); >> + >> + report(check_irq() == (i % 2), "irq-%d", i); >> + if (pmu_version() > 1) > > Helper. > >> + report(cnt.count == 1, "cntr-%d", i); >> + else >> + report(cnt.count < 4, "cntr-%d", i); >> + >> + if (pmu_version() < 2) > > Helper. > >> + continue; >> + >> status = rdmsr(MSR_CORE_PERF_GLOBAL_STATUS); >> report(status & (1ull << idx), "status-%d", i); >> wrmsr(MSR_CORE_PERF_GLOBAL_OVF_CTRL, status); >> status = rdmsr(MSR_CORE_PERF_GLOBAL_STATUS); >> report(!(status & (1ull << idx)), "status clear-%d", i); >> - report(check_irq() == (i % 2), "irq-%d", i); >> } >> >> report_prefix_pop(); >> @@ -440,8 +456,10 @@ static void check_running_counter_wrmsr(void) >> report(evt.count < gp_events[1].min, "cntr"); >> >> /* clear status before overflow test */ >> - wrmsr(MSR_CORE_PERF_GLOBAL_OVF_CTRL, >> - rdmsr(MSR_CORE_PERF_GLOBAL_STATUS)); >> + if (pmu_version() > 1) { > > Helper. Curly braces aren't necessary. > >> + wrmsr(MSR_CORE_PERF_GLOBAL_OVF_CTRL, >> + rdmsr(MSR_CORE_PERF_GLOBAL_STATUS)); >> + } >> >> start_event(&evt); >> >> @@ -453,8 +471,11 @@ static void check_running_counter_wrmsr(void) >> >> loop(); >> stop_event(&evt); >> - status = rdmsr(MSR_CORE_PERF_GLOBAL_STATUS); >> - report(status & 1, "status"); >> + >> + if (pmu_version() > 1) { > > Helper. > >> + status = rdmsr(MSR_CORE_PERF_GLOBAL_STATUS); >> + report(status & 1, "status"); > > Can you opportunistically provide a better message than "status"? > >> + } >> >> report_prefix_pop(); >> } >> @@ -474,8 +495,10 @@ static void check_emulated_instr(void) >> }; >> report_prefix_push("emulated instruction"); >> >> - wrmsr(MSR_CORE_PERF_GLOBAL_OVF_CTRL, >> - rdmsr(MSR_CORE_PERF_GLOBAL_STATUS)); >> + if (pmu_version() > 1) { > > Helper, no curly braces. Ah, IIRC, kernel perf prefers curly braces if the code > spans multiple lines. KVM and KUT does not. OK. > >> + wrmsr(MSR_CORE_PERF_GLOBAL_OVF_CTRL, >> + rdmsr(MSR_CORE_PERF_GLOBAL_STATUS)); >> + } >> >> start_event(&brnch_cnt); >> start_event(&instr_cnt); >> @@ -509,7 +532,8 @@ static void check_emulated_instr(void) >> : >> : "eax", "ebx", "ecx", "edx"); >> >> - wrmsr(MSR_CORE_PERF_GLOBAL_CTRL, 0); >> + if (pmu_version() > 1) > > Helper. > >> + wrmsr(MSR_CORE_PERF_GLOBAL_CTRL, 0); >> >> stop_event(&brnch_cnt); >> stop_event(&instr_cnt); >> @@ -520,10 +544,13 @@ static void check_emulated_instr(void) >> "instruction count"); >> report(brnch_cnt.count - brnch_start >= EXPECTED_BRNCH, >> "branch count"); >> - // Additionally check that those counters overflowed properly. >> - status = rdmsr(MSR_CORE_PERF_GLOBAL_STATUS); >> - report(status & 1, "instruction counter overflow"); >> - report(status & 2, "branch counter overflow"); >> + >> + if (pmu_version() > 1) { > > Helper? E.g. if this is a "has architectural PMU". Nit, "intel arch pmu" means "version > 0". > >> + // Additionally check that those counters overflowed properly. >> + status = rdmsr(MSR_CORE_PERF_GLOBAL_STATUS); >> + report(status & 1, "instruction counter overflow"); >> + report(status & 2, "branch counter overflow"); >> + } >> >> report_prefix_pop(); >> } >> @@ -647,12 +674,7 @@ int main(int ac, char **av) >> buf = malloc(N*64); >> >> if (!pmu_version()) { >> - report_skip("No pmu is detected!"); >> - return report_summary(); >> - } >> - >> - if (pmu_version() == 1) { >> - report_skip("PMU version 1 is not supported."); >> + report_skip("No Intel Arch PMU is detected!"); >> return report_summary(); >> } >> >> -- >> 2.37.2 >>
Hi Sandipan, On 19/9/2022 3:09 pm, Like Xu wrote: > On 8/9/2022 4:23 pm, Sandipan Das wrote: >> On 9/6/2022 7:05 PM, Like Xu wrote: >>> On 6/9/2022 4:16 pm, Sandipan Das wrote: >>>> Hi Like, >>>> >>>> On 8/19/2022 4:39 PM, Like Xu wrote: >>>>> From: Like Xu <likexu@tencent.com> >>>>> >>>>> For most unit tests, the basic framework and use cases which test >>>>> any PMU counter do not require any changes, except for two things: >>>>> >>>>> - No access to registers introduced only in PMU version 2 and above; >>>>> - Expanded tolerance for testing counter overflows >>>>> due to the loss of uniform control of the gloabl_ctrl register >>>>> >>>>> Adding some pmu_version() return value checks can seamlessly support >>>>> Intel Arch PMU Version 1, while opening the door for AMD PMUs tests. >>>>> >>>>> Signed-off-by: Like Xu <likexu@tencent.com> >>>>> --- >>>>> x86/pmu.c | 64 +++++++++++++++++++++++++++++++++++++------------------ >>>>> 1 file changed, 43 insertions(+), 21 deletions(-) >>>>> >>>>> [...] >>>>> @@ -327,13 +335,21 @@ static void check_counter_overflow(void) >>>>> cnt.config &= ~EVNTSEL_INT; >>>>> idx = event_to_global_idx(&cnt); >>>>> __measure(&cnt, cnt.count); >>>>> - report(cnt.count == 1, "cntr-%d", i); >>>>> + >>>>> + report(check_irq() == (i % 2), "irq-%d", i); >>>>> + if (pmu_version() > 1) >>>>> + report(cnt.count == 1, "cntr-%d", i); >>>>> + else >>>>> + report(cnt.count < 4, "cntr-%d", i); >>>>> + >>>>> [...] >>>> >>>> Sorry I missed this in the previous response. With an upper bound of >>>> 4, I see this test failing some times for at least one of the six >>>> counters (with NMI watchdog disabled on the host) on a Milan (Zen 3) >>>> system. Increasing it further does reduce the probability but I still >>>> see failures. Do you see the same behaviour on systems with Zen 3 and >>>> older processors? >>> >>> A hundred runs on my machine did not report a failure. >>> >> >> Was this on a Zen 4 system? >> >>> But I'm not surprised by this, because some AMD platforms do >>> have hw PMU errata which requires bios or ucode fixes. >>> >>> Please help find the right upper bound for all your available AMD boxes. >>> >> >> Even after updating the microcode, the tests failed just as often in an >> overnight loop. However, upon closer inspection, the reason for failure >> was different. The variance is well within the bounds now but sometimes, >> is_the_count_reproducible() is true. Since this selects the original >> verification criteria (cnt.count == 1), the tests fail. >> >>> What makes me most nervous is that AMD's core hardware events run >>> repeatedly against the same workload, and their count results are erratic. >>> >> >> With that in mind, should we consider having the following change? >> >> diff --git a/x86/pmu.c b/x86/pmu.c >> index bb16b3c..39979b8 100644 >> --- a/x86/pmu.c >> +++ b/x86/pmu.c >> @@ -352,7 +352,7 @@ static void check_counter_overflow(void) >> .ctr = gp_counter_base, >> .config = EVNTSEL_OS | EVNTSEL_USR | (*gp_events)[1].unit_sel >> /* instructions */, >> }; >> - bool precise_event = is_the_count_reproducible(&cnt); >> + bool precise_event = is_intel() ? is_the_count_reproducible(&cnt) : >> false; >> >> __measure(&cnt, 0); >> count = cnt.count; >> >> With this, the tests always pass. I will run another overnight loop and >> report back if I see any errors. >> >>> You may check is_the_count_reproducible() in the test case: >>> [1]https://lore.kernel.org/kvm/20220905123946.95223-7-likexu@tencent.com/ >> >> On Zen 4 systems, this is always false and the overflow tests always >> pass irrespective of whether PerfMonV2 is enabled for the guest or not. >> >> - Sandipan > > I could change it to: > > if (is_intel()) > report(cnt.count == 1, "cntr-%d", i); > else > report(cnt.count < 4, "cntr-%d", i); On AMD (zen3/zen4) machines this seems to be the only way to ensure that the test cases don't fail: if (is_intel()) report(cnt.count == 1, "cntr-%d", i); else report(cnt.count == 0xffffffffffff || cnt.count < 7, "cntr-%d", i); but it means some hardware counter defects, can you further confirm that this hardware behaviour is in line with your expectations ? > > but this does not explain the difference, that is for the same workload: > > if a retired hw event like "PMCx0C0 [Retired Instructions] (ExRetInstr)" is > configured, > then it's expected to count "the number of instructions retired", the value is > only relevant > for workload and it should remain the same over multiple measurements, > > but there are two hardware counters, one AMD and one Intel, both are reset to an > identical value > (like "cnt.count = 1 - count"), and when they overflow, the Intel counter can > stay exactly at 1, > while the AMD counter cannot. > > I know there are ulterior hardware micro-arch implementation differences here, > but what AMD is doing violates the semantics of "retired". > > Is this behavior normal by design ? > I'm not sure what I'm missing, this behavior is reinforced in zen4 as you said. >
Hi Like, On 10/21/2022 1:02 PM, Like Xu wrote: > Hi Sandipan, > > On 19/9/2022 3:09 pm, Like Xu wrote: >> On 8/9/2022 4:23 pm, Sandipan Das wrote: >>> On 9/6/2022 7:05 PM, Like Xu wrote: >>>> On 6/9/2022 4:16 pm, Sandipan Das wrote: >>>>> Hi Like, >>>>> >>>>> On 8/19/2022 4:39 PM, Like Xu wrote: >>>>>> From: Like Xu <likexu@tencent.com> >>>>>> >>>>>> For most unit tests, the basic framework and use cases which test >>>>>> any PMU counter do not require any changes, except for two things: >>>>>> >>>>>> - No access to registers introduced only in PMU version 2 and above; >>>>>> - Expanded tolerance for testing counter overflows >>>>>> due to the loss of uniform control of the gloabl_ctrl register >>>>>> >>>>>> Adding some pmu_version() return value checks can seamlessly support >>>>>> Intel Arch PMU Version 1, while opening the door for AMD PMUs tests. >>>>>> >>>>>> Signed-off-by: Like Xu <likexu@tencent.com> >>>>>> --- >>>>>> x86/pmu.c | 64 +++++++++++++++++++++++++++++++++++++------------------ >>>>>> 1 file changed, 43 insertions(+), 21 deletions(-) >>>>>> >>>>>> [...] >>>>>> @@ -327,13 +335,21 @@ static void check_counter_overflow(void) >>>>>> cnt.config &= ~EVNTSEL_INT; >>>>>> idx = event_to_global_idx(&cnt); >>>>>> __measure(&cnt, cnt.count); >>>>>> - report(cnt.count == 1, "cntr-%d", i); >>>>>> + >>>>>> + report(check_irq() == (i % 2), "irq-%d", i); >>>>>> + if (pmu_version() > 1) >>>>>> + report(cnt.count == 1, "cntr-%d", i); >>>>>> + else >>>>>> + report(cnt.count < 4, "cntr-%d", i); >>>>>> + >>>>>> [...] >>>>> >>>>> Sorry I missed this in the previous response. With an upper bound of >>>>> 4, I see this test failing some times for at least one of the six >>>>> counters (with NMI watchdog disabled on the host) on a Milan (Zen 3) >>>>> system. Increasing it further does reduce the probability but I still >>>>> see failures. Do you see the same behaviour on systems with Zen 3 and >>>>> older processors? >>>> >>>> A hundred runs on my machine did not report a failure. >>>> >>> >>> Was this on a Zen 4 system? >>> >>>> But I'm not surprised by this, because some AMD platforms do >>>> have hw PMU errata which requires bios or ucode fixes. >>>> >>>> Please help find the right upper bound for all your available AMD boxes. >>>> >>> >>> Even after updating the microcode, the tests failed just as often in an >>> overnight loop. However, upon closer inspection, the reason for failure >>> was different. The variance is well within the bounds now but sometimes, >>> is_the_count_reproducible() is true. Since this selects the original >>> verification criteria (cnt.count == 1), the tests fail. >>> >>>> What makes me most nervous is that AMD's core hardware events run >>>> repeatedly against the same workload, and their count results are erratic. >>>> >>> >>> With that in mind, should we consider having the following change? >>> >>> diff --git a/x86/pmu.c b/x86/pmu.c >>> index bb16b3c..39979b8 100644 >>> --- a/x86/pmu.c >>> +++ b/x86/pmu.c >>> @@ -352,7 +352,7 @@ static void check_counter_overflow(void) >>> .ctr = gp_counter_base, >>> .config = EVNTSEL_OS | EVNTSEL_USR | (*gp_events)[1].unit_sel /* instructions */, >>> }; >>> - bool precise_event = is_the_count_reproducible(&cnt); >>> + bool precise_event = is_intel() ? is_the_count_reproducible(&cnt) : false; >>> >>> __measure(&cnt, 0); >>> count = cnt.count; >>> >>> With this, the tests always pass. I will run another overnight loop and >>> report back if I see any errors. >>> >>>> You may check is_the_count_reproducible() in the test case: >>>> [1]https://lore.kernel.org/kvm/20220905123946.95223-7-likexu@tencent.com/>>>> >>> On Zen 4 systems, this is always false and the overflow tests always >>> pass irrespective of whether PerfMonV2 is enabled for the guest or not. >>> >>> - Sandipan >> >> I could change it to: >> >> if (is_intel()) >> report(cnt.count == 1, "cntr-%d", i); >> else >> report(cnt.count < 4, "cntr-%d", i); > > On AMD (zen3/zen4) machines this seems to be the only way to ensure that the test cases don't fail: > > if (is_intel()) > report(cnt.count == 1, "cntr-%d", i); > else > report(cnt.count == 0xffffffffffff || cnt.count < 7, "cntr-%d", i); > > but it means some hardware counter defects, can you further confirm that this hardware behaviour > is in line with your expectations ? > I am yet to investigate as to why there would a variance in count but with this updated test condition, I can confirm that the tests pass on all my systems. - Sandipan >> >> but this does not explain the difference, that is for the same workload: >> >> if a retired hw event like "PMCx0C0 [Retired Instructions] (ExRetInstr)" is configured, >> then it's expected to count "the number of instructions retired", the value is only relevant >> for workload and it should remain the same over multiple measurements, >> >> but there are two hardware counters, one AMD and one Intel, both are reset to an identical value >> (like "cnt.count = 1 - count"), and when they overflow, the Intel counter can stay exactly at 1, >> while the AMD counter cannot. >> >> I know there are ulterior hardware micro-arch implementation differences here, >> but what AMD is doing violates the semantics of "retired". >> >> Is this behavior normal by design ? >> I'm not sure what I'm missing, this behavior is reinforced in zen4 as you said. >>
diff --git a/x86/pmu.c b/x86/pmu.c index 25fafbe..826472c 100644 --- a/x86/pmu.c +++ b/x86/pmu.c @@ -125,14 +125,19 @@ static struct pmu_event* get_counter_event(pmu_counter_t *cnt) static void global_enable(pmu_counter_t *cnt) { - cnt->idx = event_to_global_idx(cnt); + if (pmu_version() < 2) + return; + cnt->idx = event_to_global_idx(cnt); wrmsr(MSR_CORE_PERF_GLOBAL_CTRL, rdmsr(MSR_CORE_PERF_GLOBAL_CTRL) | (1ull << cnt->idx)); } static void global_disable(pmu_counter_t *cnt) { + if (pmu_version() < 2) + return; + wrmsr(MSR_CORE_PERF_GLOBAL_CTRL, rdmsr(MSR_CORE_PERF_GLOBAL_CTRL) & ~(1ull << cnt->idx)); } @@ -301,7 +306,10 @@ static void check_counter_overflow(void) count = cnt.count; /* clear status before test */ - wrmsr(MSR_CORE_PERF_GLOBAL_OVF_CTRL, rdmsr(MSR_CORE_PERF_GLOBAL_STATUS)); + if (pmu_version() > 1) { + wrmsr(MSR_CORE_PERF_GLOBAL_OVF_CTRL, + rdmsr(MSR_CORE_PERF_GLOBAL_STATUS)); + } report_prefix_push("overflow"); @@ -327,13 +335,21 @@ static void check_counter_overflow(void) cnt.config &= ~EVNTSEL_INT; idx = event_to_global_idx(&cnt); __measure(&cnt, cnt.count); - report(cnt.count == 1, "cntr-%d", i); + + report(check_irq() == (i % 2), "irq-%d", i); + if (pmu_version() > 1) + report(cnt.count == 1, "cntr-%d", i); + else + report(cnt.count < 4, "cntr-%d", i); + + if (pmu_version() < 2) + continue; + status = rdmsr(MSR_CORE_PERF_GLOBAL_STATUS); report(status & (1ull << idx), "status-%d", i); wrmsr(MSR_CORE_PERF_GLOBAL_OVF_CTRL, status); status = rdmsr(MSR_CORE_PERF_GLOBAL_STATUS); report(!(status & (1ull << idx)), "status clear-%d", i); - report(check_irq() == (i % 2), "irq-%d", i); } report_prefix_pop(); @@ -440,8 +456,10 @@ static void check_running_counter_wrmsr(void) report(evt.count < gp_events[1].min, "cntr"); /* clear status before overflow test */ - wrmsr(MSR_CORE_PERF_GLOBAL_OVF_CTRL, - rdmsr(MSR_CORE_PERF_GLOBAL_STATUS)); + if (pmu_version() > 1) { + wrmsr(MSR_CORE_PERF_GLOBAL_OVF_CTRL, + rdmsr(MSR_CORE_PERF_GLOBAL_STATUS)); + } start_event(&evt); @@ -453,8 +471,11 @@ static void check_running_counter_wrmsr(void) loop(); stop_event(&evt); - status = rdmsr(MSR_CORE_PERF_GLOBAL_STATUS); - report(status & 1, "status"); + + if (pmu_version() > 1) { + status = rdmsr(MSR_CORE_PERF_GLOBAL_STATUS); + report(status & 1, "status"); + } report_prefix_pop(); } @@ -474,8 +495,10 @@ static void check_emulated_instr(void) }; report_prefix_push("emulated instruction"); - wrmsr(MSR_CORE_PERF_GLOBAL_OVF_CTRL, - rdmsr(MSR_CORE_PERF_GLOBAL_STATUS)); + if (pmu_version() > 1) { + wrmsr(MSR_CORE_PERF_GLOBAL_OVF_CTRL, + rdmsr(MSR_CORE_PERF_GLOBAL_STATUS)); + } start_event(&brnch_cnt); start_event(&instr_cnt); @@ -509,7 +532,8 @@ static void check_emulated_instr(void) : : "eax", "ebx", "ecx", "edx"); - wrmsr(MSR_CORE_PERF_GLOBAL_CTRL, 0); + if (pmu_version() > 1) + wrmsr(MSR_CORE_PERF_GLOBAL_CTRL, 0); stop_event(&brnch_cnt); stop_event(&instr_cnt); @@ -520,10 +544,13 @@ static void check_emulated_instr(void) "instruction count"); report(brnch_cnt.count - brnch_start >= EXPECTED_BRNCH, "branch count"); - // Additionally check that those counters overflowed properly. - status = rdmsr(MSR_CORE_PERF_GLOBAL_STATUS); - report(status & 1, "instruction counter overflow"); - report(status & 2, "branch counter overflow"); + + if (pmu_version() > 1) { + // Additionally check that those counters overflowed properly. + status = rdmsr(MSR_CORE_PERF_GLOBAL_STATUS); + report(status & 1, "instruction counter overflow"); + report(status & 2, "branch counter overflow"); + } report_prefix_pop(); } @@ -647,12 +674,7 @@ int main(int ac, char **av) buf = malloc(N*64); if (!pmu_version()) { - report_skip("No pmu is detected!"); - return report_summary(); - } - - if (pmu_version() == 1) { - report_skip("PMU version 1 is not supported."); + report_skip("No Intel Arch PMU is detected!"); return report_summary(); }