diff mbox series

x86: increase NMI timer frequency if necessary

Message ID d5fd3646-18b3-4dae-8da7-6afa187f930e@suse.com (mailing list archive)
State New
Headers show
Series x86: increase NMI timer frequency if necessary | expand

Commit Message

Jan Beulich Jan. 25, 2024, 4:55 p.m. UTC
Since the performance counters used for the NMI watchdog count non-
halted cycles, they may count at a rate higher than cpu_khz. Thus the
watchdog tick may occur more frequently than invocations of the timer
if we don't account for the ratio between nominal and maximum CPU clock
speeds, which would be a problem in particular when "watchdog_timeout=1"
is in effect (for high enough ratios even larger timout values may pose
a problem).

Leverage the so far display-only data we collect on newer Intel and AMD
CPUs. On older CPUs we just have to (continue to) hope that the default
frequency of 1 Hz is okay(-ish) to use.

While adding the new variable, also move the (now adjacent) cpu_khz to
.data.ro_after_init.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
---
This renders the "log" in the function names somewhat stale, but I don't
think this strictly warrants renaming the functions right away.

Comments

Andrew Cooper March 19, 2024, 8:51 p.m. UTC | #1
On 25/01/2024 4:55 pm, Jan Beulich wrote:
> Since the performance counters used for the NMI watchdog count non-
> halted cycles, they may count at a rate higher than cpu_khz.

Is this in theory, or observed in practice?

It is my understanding that perf counters count in P0 reference cycles,
and not at the Turbo/CBS rate.

>  Thus the
> watchdog tick may occur more frequently than invocations of the timer
> if we don't account for the ratio between nominal and maximum CPU clock
> speeds, which would be a problem in particular when "watchdog_timeout=1"
> is in effect (for high enough ratios even larger timout values may pose
> a problem).
>
> Leverage the so far display-only data we collect on newer Intel and AMD
> CPUs. On older CPUs we just have to (continue to) hope that the default
> frequency of 1 Hz is okay(-ish) to use.
>
> While adding the new variable, also move the (now adjacent) cpu_khz to
> .data.ro_after_init.
>
> Signed-off-by: Jan Beulich <jbeulich@suse.com>
> ---
> This renders the "log" in the function names somewhat stale, but I don't
> think this strictly warrants renaming the functions right away.

I'm not comfortable with this change.  It's adding to a complicated
timing problem, rather than simplifying it.

The real problem we've got is that the NMI handler is guessing at the
timeout by counting NMIs, not by actually counting time.  There are
several ways to fix this even with the current rendezvous logic.  When
the NMI handler can actually say "if ( NOW() - last > timeout )", then
the exact frequently of NMIs becomes far less important.

~Andrew
Jan Beulich March 20, 2024, 7:28 a.m. UTC | #2
On 19.03.2024 21:51, Andrew Cooper wrote:
> On 25/01/2024 4:55 pm, Jan Beulich wrote:
>> Since the performance counters used for the NMI watchdog count non-
>> halted cycles, they may count at a rate higher than cpu_khz.
> 
> Is this in theory, or observed in practice?

It's been over two months since doing the experiments, so I can only go
from memory, but my recollection is that I actually observed higher
rates, just not high enough (yet) for the watchdog (without this change)
to start malfunctioning.

> It is my understanding that perf counters count in P0 reference cycles,
> and not at the Turbo/CBS rate.
> 
>>  Thus the
>> watchdog tick may occur more frequently than invocations of the timer
>> if we don't account for the ratio between nominal and maximum CPU clock
>> speeds, which would be a problem in particular when "watchdog_timeout=1"
>> is in effect (for high enough ratios even larger timout values may pose
>> a problem).
>>
>> Leverage the so far display-only data we collect on newer Intel and AMD
>> CPUs. On older CPUs we just have to (continue to) hope that the default
>> frequency of 1 Hz is okay(-ish) to use.
>>
>> While adding the new variable, also move the (now adjacent) cpu_khz to
>> .data.ro_after_init.
>>
>> Signed-off-by: Jan Beulich <jbeulich@suse.com>
>> ---
>> This renders the "log" in the function names somewhat stale, but I don't
>> think this strictly warrants renaming the functions right away.
> 
> I'm not comfortable with this change.  It's adding to a complicated
> timing problem, rather than simplifying it.

The actual change to the watchdog logic is minimal - a build-time constant
is replaced by a boot-time determined value.

> The real problem we've got is that the NMI handler is guessing at the
> timeout by counting NMIs, not by actually counting time.  There are
> several ways to fix this even with the current rendezvous logic.  When
> the NMI handler can actually say "if ( NOW() - last > timeout )", then
> the exact frequently of NMIs becomes far less important.

But that would come with its own downsides: The logic within the NMI
handler should be as simple as possible, involving as little as possible
other code. NOW(), for example, cannot really be used there without
first fiddling with the time rendezvous (to make sure an NMI hitting in
the middle of an update to the scaling values will know how to use
consistent data; that could e.g. be done by flip-flopping between two
instances of the data, with a "selector" always flipped last).

Jan
diff mbox series

Patch

--- a/xen/arch/x86/cpu/amd.c
+++ b/xen/arch/x86/cpu/amd.c
@@ -657,12 +657,18 @@  void amd_log_freq(const struct cpuinfo_x
 		                     : (((v) & 0xff) * 25 * 8) / (((v) >> 8) & 0x3f))
 	if (idx && idx < h &&
 	    !rdmsr_safe(0xC0010064 + idx, val) && (val >> 63) &&
-	    !rdmsr_safe(0xC0010064, hi) && (hi >> 63))
+	    !rdmsr_safe(0xC0010064, hi) && (hi >> 63)) {
+		if (c == &boot_cpu_data)
+			cpu_max_mhz = FREQ(hi);
 		printk("CPU%u: %lu (%lu ... %lu) MHz\n",
 		       smp_processor_id(), FREQ(val), FREQ(lo), FREQ(hi));
-	else if (h && !rdmsr_safe(0xC0010064, hi) && (hi >> 63))
+	}
+	else if (h && !rdmsr_safe(0xC0010064, hi) && (hi >> 63)) {
+		if (c == &boot_cpu_data)
+			cpu_max_mhz = FREQ(hi);
 		printk("CPU%u: %lu ... %lu MHz\n",
 		       smp_processor_id(), FREQ(lo), FREQ(hi));
+	}
 	else
 		printk("CPU%u: %lu MHz\n", smp_processor_id(), FREQ(lo));
 #undef FREQ
--- a/xen/arch/x86/cpu/intel.c
+++ b/xen/arch/x86/cpu/intel.c
@@ -456,7 +456,11 @@  static void intel_log_freq(const struct
             if ( eax )
                 printk(" base: %u MHz", eax);
             if ( ebx )
+            {
+                if ( c == &boot_cpu_data )
+                    cpu_max_mhz = ebx;
                 printk(" max: %u MHz", ebx);
+            }
             printk("\n");
         }
     }
@@ -522,6 +526,8 @@  static void intel_log_freq(const struct
     printk("CPU%u: ", smp_processor_id());
     if ( min_ratio )
         printk("%u ... ", (factor * min_ratio + 50) / 100);
+    if ( c == &boot_cpu_data && !cpu_max_mhz )
+        cpu_max_mhz = (factor * max_ratio + 50) / 100;
     printk("%u MHz\n", (factor * max_ratio + 50) / 100);
 }
 
--- a/xen/arch/x86/include/asm/time.h
+++ b/xen/arch/x86/include/asm/time.h
@@ -8,6 +8,8 @@  typedef u64 cycles_t;
 
 extern bool disable_tsc_sync;
 
+extern unsigned int cpu_max_mhz;
+
 static inline cycles_t get_cycles(void)
 {
     return rdtsc_ordered();
--- a/xen/arch/x86/nmi.c
+++ b/xen/arch/x86/nmi.c
@@ -213,10 +213,12 @@  void __init check_nmi_watchdog(void)
     return;
 }
 
+static unsigned int __ro_after_init timer_gap = MILLISECS(1000);
+
 static void cf_check nmi_timer_fn(void *unused)
 {
     this_cpu(nmi_timer_ticks)++;
-    set_timer(&this_cpu(nmi_timer), NOW() + MILLISECS(1000));
+    set_timer(&this_cpu(nmi_timer), NOW() + timer_gap);
 }
 
 void disable_lapic_nmi_watchdog(void)
@@ -477,8 +479,17 @@  bool watchdog_enabled(void)
 
 int __init watchdog_setup(void)
 {
+    unsigned long cpu_mhz = cpu_khz / 1000;
     unsigned int cpu;
 
+    if ( cpu_max_mhz > cpu_mhz )
+    {
+        timer_gap = timer_gap * cpu_mhz / cpu_max_mhz;
+        /* To be on the safe side, bound to 1ms. */
+        if ( timer_gap < MILLISECS(1) )
+            timer_gap = MILLISECS(1);
+    }
+
     /*
      * Activate periodic heartbeats. We cannot do this earlier during 
      * setup because the timer infrastructure is not available.
--- a/xen/arch/x86/time.c
+++ b/xen/arch/x86/time.c
@@ -47,7 +47,9 @@ 
 static char __initdata opt_clocksource[10];
 string_param("clocksource", opt_clocksource);
 
-unsigned long __read_mostly cpu_khz;  /* CPU clock frequency in kHz. */
+unsigned long __ro_after_init cpu_khz;    /* CPU clock frequency in kHz. */
+unsigned int __ro_after_init cpu_max_mhz; /* CPU max (known) clkfreq in MHz. */
+
 DEFINE_SPINLOCK(rtc_lock);
 unsigned long pit0_ticks;