x86: tsc: Rework time_cpufreq_notifier()
diff mbox series

Message ID 38900622.ao2n2t5aPS@kreacher
State Accepted, archived
Delegated to: Rafael Wysocki
Headers show
Series
  • x86: tsc: Rework time_cpufreq_notifier()
Related show

Commit Message

Rafael J. Wysocki April 18, 2019, 2:11 p.m. UTC
From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>

There are problems with running time_cpufreq_notifier() on SMP
systems.

First off, the rdtsc() called from there runs on the CPU executing
that code and not necessarily on the CPU whose sched_clock() rate is
updated which is questionable at best.

Second, in the cases when the frequencies of all CPUs in an SMP
system are always in sync, it is not sufficient to update just
one of them or the set associated with a given cpufreq policy on
frequency changes - all CPUs in the system should be updated and
that would require more than a simple transition notifier.

Note, however, that the underlying issue (the TSC rate depending on
the CPU frequency) has not been present in hardware shipping for the
last few years and in quite a few relevant cases (acpi-cpufreq in
particular) running time_cpufreq_notifier() will cause the TSC to
be marked as unstable anyway.

For this reason, make time_cpufreq_notifier() simply mark the TSC
as unstable and give up when run on SMP and only try to carry out
any adjustments otherwise.

Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
---
 arch/x86/kernel/tsc.c |   29 ++++++++++++++---------------
 1 file changed, 14 insertions(+), 15 deletions(-)

Comments

Borislav Petkov April 18, 2019, 6:40 p.m. UTC | #1
On Thu, Apr 18, 2019 at 04:11:37PM +0200, Rafael J. Wysocki wrote:
> From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> 
> There are problems with running time_cpufreq_notifier() on SMP
> systems.

Just a stupid question: that must be some old rust you're fixing because
we set X86_FEATURE_CONSTANT_TSC on almost anything that is x86 nowadays
and if that is set, we don't reg that notifier call...
Rafael J. Wysocki April 18, 2019, 7:36 p.m. UTC | #2
On Thu, Apr 18, 2019 at 8:40 PM Borislav Petkov <bp@alien8.de> wrote:
>
> On Thu, Apr 18, 2019 at 04:11:37PM +0200, Rafael J. Wysocki wrote:
> > From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> >
> > There are problems with running time_cpufreq_notifier() on SMP
> > systems.
>
> Just a stupid question: that must be some old rust you're fixing because
> we set X86_FEATURE_CONSTANT_TSC on almost anything that is x86 nowadays
> and if that is set, we don't reg that notifier call...

That's totally correct.

It's never really worked AFAICS and even so, then by accident.
Viresh Kumar April 22, 2019, 8:17 a.m. UTC | #3
On 18-04-19, 16:11, Rafael J. Wysocki wrote:
> From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> 
> There are problems with running time_cpufreq_notifier() on SMP
> systems.
> 
> First off, the rdtsc() called from there runs on the CPU executing
> that code and not necessarily on the CPU whose sched_clock() rate is
> updated which is questionable at best.
> 
> Second, in the cases when the frequencies of all CPUs in an SMP
> system are always in sync, it is not sufficient to update just
> one of them or the set associated with a given cpufreq policy on
> frequency changes - all CPUs in the system should be updated and
> that would require more than a simple transition notifier.
> 
> Note, however, that the underlying issue (the TSC rate depending on
> the CPU frequency) has not been present in hardware shipping for the
> last few years and in quite a few relevant cases (acpi-cpufreq in
> particular) running time_cpufreq_notifier() will cause the TSC to
> be marked as unstable anyway.
> 
> For this reason, make time_cpufreq_notifier() simply mark the TSC
> as unstable and give up when run on SMP and only try to carry out
> any adjustments otherwise.
> 
> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> ---
>  arch/x86/kernel/tsc.c |   29 ++++++++++++++---------------
>  1 file changed, 14 insertions(+), 15 deletions(-)
> 
> Index: linux-pm/arch/x86/kernel/tsc.c
> ===================================================================
> --- linux-pm.orig/arch/x86/kernel/tsc.c
> +++ linux-pm/arch/x86/kernel/tsc.c
> @@ -185,8 +185,7 @@ static void __init cyc2ns_init_boot_cpu(
>  /*
>   * Secondary CPUs do not run through tsc_init(), so set up
>   * all the scale factors for all CPUs, assuming the same
> - * speed as the bootup CPU. (cpufreq notifiers will fix this
> - * up if their speed diverges)
> + * speed as the bootup CPU.
>   */
>  static void __init cyc2ns_init_secondary_cpus(void)
>  {
> @@ -937,12 +936,12 @@ void tsc_restore_sched_clock_state(void)
>  }
>  
>  #ifdef CONFIG_CPU_FREQ
> -/* Frequency scaling support. Adjust the TSC based timer when the cpu frequency
> +/*
> + * Frequency scaling support. Adjust the TSC based timer when the CPU frequency
>   * changes.
>   *
> - * RED-PEN: On SMP we assume all CPUs run with the same frequency.  It's
> - * not that important because current Opteron setups do not support
> - * scaling on SMP anyroads.
> + * NOTE: On SMP the situation is not fixable in general, so simply mark the TSC
> + * as unstable and give up in those cases.
>   *
>   * Should fix up last_tsc too. Currently gettimeofday in the
>   * first tick after the change will be slightly wrong.
> @@ -956,22 +955,22 @@ static int time_cpufreq_notifier(struct
>  				void *data)
>  {
>  	struct cpufreq_freqs *freq = data;
> -	unsigned long *lpj;
>  
> -	lpj = &boot_cpu_data.loops_per_jiffy;
> -#ifdef CONFIG_SMP
> -	if (!(freq->flags & CPUFREQ_CONST_LOOPS))
> -		lpj = &cpu_data(freq->cpu).loops_per_jiffy;
> -#endif
> +	if (num_online_cpus() > 1) {

What about checking num_possible_cpus() instead ? So we reliably quit
everytime even if some CPUs are offlined.

And can we place this check before registering the notifier, so it
never gets called ?

> +		mark_tsc_unstable("cpufreq changes on SMP");
> +		return 0;
> +	}
>  
>  	if (!ref_freq) {
>  		ref_freq = freq->old;
> -		loops_per_jiffy_ref = *lpj;
> +		loops_per_jiffy_ref = boot_cpu_data.loops_per_jiffy;
>  		tsc_khz_ref = tsc_khz;
>  	}
> +
>  	if ((val == CPUFREQ_PRECHANGE  && freq->old < freq->new) ||
> -			(val == CPUFREQ_POSTCHANGE && freq->old > freq->new)) {
> -		*lpj = cpufreq_scale(loops_per_jiffy_ref, ref_freq, freq->new);
> +	    (val == CPUFREQ_POSTCHANGE && freq->old > freq->new)) {
> +		boot_cpu_data.loops_per_jiffy =
> +			cpufreq_scale(loops_per_jiffy_ref, ref_freq, freq->new);
>  
>  		tsc_khz = cpufreq_scale(tsc_khz_ref, ref_freq, freq->new);
>  		if (!(freq->flags & CPUFREQ_CONST_LOOPS))
> 
>
Rafael J. Wysocki April 23, 2019, 8:19 a.m. UTC | #4
On Mon, Apr 22, 2019 at 10:17 AM Viresh Kumar <viresh.kumar@linaro.org> wrote:
>
> On 18-04-19, 16:11, Rafael J. Wysocki wrote:
> > From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> >
> > There are problems with running time_cpufreq_notifier() on SMP
> > systems.
> >
> > First off, the rdtsc() called from there runs on the CPU executing
> > that code and not necessarily on the CPU whose sched_clock() rate is
> > updated which is questionable at best.
> >
> > Second, in the cases when the frequencies of all CPUs in an SMP
> > system are always in sync, it is not sufficient to update just
> > one of them or the set associated with a given cpufreq policy on
> > frequency changes - all CPUs in the system should be updated and
> > that would require more than a simple transition notifier.
> >
> > Note, however, that the underlying issue (the TSC rate depending on
> > the CPU frequency) has not been present in hardware shipping for the
> > last few years and in quite a few relevant cases (acpi-cpufreq in
> > particular) running time_cpufreq_notifier() will cause the TSC to
> > be marked as unstable anyway.
> >
> > For this reason, make time_cpufreq_notifier() simply mark the TSC
> > as unstable and give up when run on SMP and only try to carry out
> > any adjustments otherwise.
> >
> > Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> > ---
> >  arch/x86/kernel/tsc.c |   29 ++++++++++++++---------------
> >  1 file changed, 14 insertions(+), 15 deletions(-)
> >
> > Index: linux-pm/arch/x86/kernel/tsc.c
> > ===================================================================
> > --- linux-pm.orig/arch/x86/kernel/tsc.c
> > +++ linux-pm/arch/x86/kernel/tsc.c
> > @@ -185,8 +185,7 @@ static void __init cyc2ns_init_boot_cpu(
> >  /*
> >   * Secondary CPUs do not run through tsc_init(), so set up
> >   * all the scale factors for all CPUs, assuming the same
> > - * speed as the bootup CPU. (cpufreq notifiers will fix this
> > - * up if their speed diverges)
> > + * speed as the bootup CPU.
> >   */
> >  static void __init cyc2ns_init_secondary_cpus(void)
> >  {
> > @@ -937,12 +936,12 @@ void tsc_restore_sched_clock_state(void)
> >  }
> >
> >  #ifdef CONFIG_CPU_FREQ
> > -/* Frequency scaling support. Adjust the TSC based timer when the cpu frequency
> > +/*
> > + * Frequency scaling support. Adjust the TSC based timer when the CPU frequency
> >   * changes.
> >   *
> > - * RED-PEN: On SMP we assume all CPUs run with the same frequency.  It's
> > - * not that important because current Opteron setups do not support
> > - * scaling on SMP anyroads.
> > + * NOTE: On SMP the situation is not fixable in general, so simply mark the TSC
> > + * as unstable and give up in those cases.
> >   *
> >   * Should fix up last_tsc too. Currently gettimeofday in the
> >   * first tick after the change will be slightly wrong.
> > @@ -956,22 +955,22 @@ static int time_cpufreq_notifier(struct
> >                               void *data)
> >  {
> >       struct cpufreq_freqs *freq = data;
> > -     unsigned long *lpj;
> >
> > -     lpj = &boot_cpu_data.loops_per_jiffy;
> > -#ifdef CONFIG_SMP
> > -     if (!(freq->flags & CPUFREQ_CONST_LOOPS))
> > -             lpj = &cpu_data(freq->cpu).loops_per_jiffy;
> > -#endif
> > +     if (num_online_cpus() > 1) {
>
> What about checking num_possible_cpus() instead ? So we reliably quit
> everytime even if some CPUs are offlined.

That would work too.

The point here is that the adjustments can work if the "additional"
CPUs never go online.

> And can we place this check before registering the notifier, so it
> never gets called ?

Well, in that case the TSC would need to be marked as unstable
upfront, but that really only would be necessary if cpufreq was
actually used.  If it wasn't used, whatever the reason, marking the
TSC as unstable here would be excessive.

We're talking about old HW, mind you, and I know about systems shipped
around that time frame that didn't support performance scaling at all
and had the TSC (see the Opteron comment removed by this patch, for
instance).

So I'd rather not change that part.
Viresh Kumar April 23, 2019, 8:42 a.m. UTC | #5
On 23-04-19, 10:19, Rafael J. Wysocki wrote:
> On Mon, Apr 22, 2019 at 10:17 AM Viresh Kumar <viresh.kumar@linaro.org> wrote:
> >
> > On 18-04-19, 16:11, Rafael J. Wysocki wrote:
> > > From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> > >
> > > There are problems with running time_cpufreq_notifier() on SMP
> > > systems.
> > >
> > > First off, the rdtsc() called from there runs on the CPU executing
> > > that code and not necessarily on the CPU whose sched_clock() rate is
> > > updated which is questionable at best.
> > >
> > > Second, in the cases when the frequencies of all CPUs in an SMP
> > > system are always in sync, it is not sufficient to update just
> > > one of them or the set associated with a given cpufreq policy on
> > > frequency changes - all CPUs in the system should be updated and
> > > that would require more than a simple transition notifier.
> > >
> > > Note, however, that the underlying issue (the TSC rate depending on
> > > the CPU frequency) has not been present in hardware shipping for the
> > > last few years and in quite a few relevant cases (acpi-cpufreq in
> > > particular) running time_cpufreq_notifier() will cause the TSC to
> > > be marked as unstable anyway.
> > >
> > > For this reason, make time_cpufreq_notifier() simply mark the TSC
> > > as unstable and give up when run on SMP and only try to carry out
> > > any adjustments otherwise.
> > >
> > > Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> > > ---
> > >  arch/x86/kernel/tsc.c |   29 ++++++++++++++---------------
> > >  1 file changed, 14 insertions(+), 15 deletions(-)
> > >
> > > Index: linux-pm/arch/x86/kernel/tsc.c
> > > ===================================================================
> > > --- linux-pm.orig/arch/x86/kernel/tsc.c
> > > +++ linux-pm/arch/x86/kernel/tsc.c
> > > @@ -185,8 +185,7 @@ static void __init cyc2ns_init_boot_cpu(
> > >  /*
> > >   * Secondary CPUs do not run through tsc_init(), so set up
> > >   * all the scale factors for all CPUs, assuming the same
> > > - * speed as the bootup CPU. (cpufreq notifiers will fix this
> > > - * up if their speed diverges)
> > > + * speed as the bootup CPU.
> > >   */
> > >  static void __init cyc2ns_init_secondary_cpus(void)
> > >  {
> > > @@ -937,12 +936,12 @@ void tsc_restore_sched_clock_state(void)
> > >  }
> > >
> > >  #ifdef CONFIG_CPU_FREQ
> > > -/* Frequency scaling support. Adjust the TSC based timer when the cpu frequency
> > > +/*
> > > + * Frequency scaling support. Adjust the TSC based timer when the CPU frequency
> > >   * changes.
> > >   *
> > > - * RED-PEN: On SMP we assume all CPUs run with the same frequency.  It's
> > > - * not that important because current Opteron setups do not support
> > > - * scaling on SMP anyroads.
> > > + * NOTE: On SMP the situation is not fixable in general, so simply mark the TSC
> > > + * as unstable and give up in those cases.
> > >   *
> > >   * Should fix up last_tsc too. Currently gettimeofday in the
> > >   * first tick after the change will be slightly wrong.
> > > @@ -956,22 +955,22 @@ static int time_cpufreq_notifier(struct
> > >                               void *data)
> > >  {
> > >       struct cpufreq_freqs *freq = data;
> > > -     unsigned long *lpj;
> > >
> > > -     lpj = &boot_cpu_data.loops_per_jiffy;
> > > -#ifdef CONFIG_SMP
> > > -     if (!(freq->flags & CPUFREQ_CONST_LOOPS))
> > > -             lpj = &cpu_data(freq->cpu).loops_per_jiffy;
> > > -#endif
> > > +     if (num_online_cpus() > 1) {
> >
> > What about checking num_possible_cpus() instead ? So we reliably quit
> > everytime even if some CPUs are offlined.
> 
> That would work too.
> 
> The point here is that the adjustments can work if the "additional"
> CPUs never go online.

Hmm, okay.

> > And can we place this check before registering the notifier, so it
> > never gets called ?
> 
> Well, in that case the TSC would need to be marked as unstable
> upfront, but that really only would be necessary if cpufreq was
> actually used.  If it wasn't used, whatever the reason, marking the
> TSC as unstable here would be excessive.
> 
> We're talking about old HW, mind you, and I know about systems shipped
> around that time frame that didn't support performance scaling at all
> and had the TSC (see the Opteron comment removed by this patch, for
> instance).
> 
> So I'd rather not change that part.

Fair enough.
Viresh Kumar April 23, 2019, 8:42 a.m. UTC | #6
On 18-04-19, 16:11, Rafael J. Wysocki wrote:
> From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> 
> There are problems with running time_cpufreq_notifier() on SMP
> systems.
> 
> First off, the rdtsc() called from there runs on the CPU executing
> that code and not necessarily on the CPU whose sched_clock() rate is
> updated which is questionable at best.
> 
> Second, in the cases when the frequencies of all CPUs in an SMP
> system are always in sync, it is not sufficient to update just
> one of them or the set associated with a given cpufreq policy on
> frequency changes - all CPUs in the system should be updated and
> that would require more than a simple transition notifier.
> 
> Note, however, that the underlying issue (the TSC rate depending on
> the CPU frequency) has not been present in hardware shipping for the
> last few years and in quite a few relevant cases (acpi-cpufreq in
> particular) running time_cpufreq_notifier() will cause the TSC to
> be marked as unstable anyway.
> 
> For this reason, make time_cpufreq_notifier() simply mark the TSC
> as unstable and give up when run on SMP and only try to carry out
> any adjustments otherwise.
> 
> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> ---
>  arch/x86/kernel/tsc.c |   29 ++++++++++++++---------------
>  1 file changed, 14 insertions(+), 15 deletions(-)

Reviewed-by: Viresh Kumar <viresh.kumar@linaro.org>

Patch
diff mbox series

Index: linux-pm/arch/x86/kernel/tsc.c
===================================================================
--- linux-pm.orig/arch/x86/kernel/tsc.c
+++ linux-pm/arch/x86/kernel/tsc.c
@@ -185,8 +185,7 @@  static void __init cyc2ns_init_boot_cpu(
 /*
  * Secondary CPUs do not run through tsc_init(), so set up
  * all the scale factors for all CPUs, assuming the same
- * speed as the bootup CPU. (cpufreq notifiers will fix this
- * up if their speed diverges)
+ * speed as the bootup CPU.
  */
 static void __init cyc2ns_init_secondary_cpus(void)
 {
@@ -937,12 +936,12 @@  void tsc_restore_sched_clock_state(void)
 }
 
 #ifdef CONFIG_CPU_FREQ
-/* Frequency scaling support. Adjust the TSC based timer when the cpu frequency
+/*
+ * Frequency scaling support. Adjust the TSC based timer when the CPU frequency
  * changes.
  *
- * RED-PEN: On SMP we assume all CPUs run with the same frequency.  It's
- * not that important because current Opteron setups do not support
- * scaling on SMP anyroads.
+ * NOTE: On SMP the situation is not fixable in general, so simply mark the TSC
+ * as unstable and give up in those cases.
  *
  * Should fix up last_tsc too. Currently gettimeofday in the
  * first tick after the change will be slightly wrong.
@@ -956,22 +955,22 @@  static int time_cpufreq_notifier(struct
 				void *data)
 {
 	struct cpufreq_freqs *freq = data;
-	unsigned long *lpj;
 
-	lpj = &boot_cpu_data.loops_per_jiffy;
-#ifdef CONFIG_SMP
-	if (!(freq->flags & CPUFREQ_CONST_LOOPS))
-		lpj = &cpu_data(freq->cpu).loops_per_jiffy;
-#endif
+	if (num_online_cpus() > 1) {
+		mark_tsc_unstable("cpufreq changes on SMP");
+		return 0;
+	}
 
 	if (!ref_freq) {
 		ref_freq = freq->old;
-		loops_per_jiffy_ref = *lpj;
+		loops_per_jiffy_ref = boot_cpu_data.loops_per_jiffy;
 		tsc_khz_ref = tsc_khz;
 	}
+
 	if ((val == CPUFREQ_PRECHANGE  && freq->old < freq->new) ||
-			(val == CPUFREQ_POSTCHANGE && freq->old > freq->new)) {
-		*lpj = cpufreq_scale(loops_per_jiffy_ref, ref_freq, freq->new);
+	    (val == CPUFREQ_POSTCHANGE && freq->old > freq->new)) {
+		boot_cpu_data.loops_per_jiffy =
+			cpufreq_scale(loops_per_jiffy_ref, ref_freq, freq->new);
 
 		tsc_khz = cpufreq_scale(tsc_khz_ref, ref_freq, freq->new);
 		if (!(freq->flags & CPUFREQ_CONST_LOOPS))