diff mbox series

[v10,03/27] timer: Export next wakeup time of a CPU

Message ID 20181129174700.16585-4-ulf.hansson@linaro.org (mailing list archive)
State Deferred
Headers show
Series PM / Domains: Support hierarchical CPU arrangement (PSCI/ARM) | expand

Commit Message

Ulf Hansson Nov. 29, 2018, 5:46 p.m. UTC
From: Lina Iyer <lina.iyer@linaro.org>

Knowing the sleep duration of CPUs, is known to be needed while selecting
the most energy efficient idle state for a CPU or a group of CPUs.

However, to be able to compute the sleep duration, we need to know at what
time the next expected wakeup is for the CPU. Therefore, let's export this
information via a new function, tick_nohz_get_next_wakeup(). Following
changes make use of it.

Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Daniel Lezcano <daniel.lezcano@linaro.org>
Cc: Lina Iyer <ilina@codeaurora.org>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Lina Iyer <lina.iyer@linaro.org>
Co-developed-by: Ulf Hansson <ulf.hansson@linaro.org>
Signed-off-by: Ulf Hansson <ulf.hansson@linaro.org>
---

Changes in v10:
	- Updated function header of tick_nohz_get_next_wakeup().

---
 include/linux/tick.h     |  8 ++++++++
 kernel/time/tick-sched.c | 13 +++++++++++++
 2 files changed, 21 insertions(+)

Comments

Rafael J. Wysocki Jan. 11, 2019, 11:06 a.m. UTC | #1
On Thursday, November 29, 2018 6:46:36 PM CET Ulf Hansson wrote:
> From: Lina Iyer <lina.iyer@linaro.org>
> 
> Knowing the sleep duration of CPUs, is known to be needed while selecting
> the most energy efficient idle state for a CPU or a group of CPUs.
> 
> However, to be able to compute the sleep duration, we need to know at what
> time the next expected wakeup is for the CPU. Therefore, let's export this
> information via a new function, tick_nohz_get_next_wakeup(). Following
> changes make use of it.
> 
> Cc: Thomas Gleixner <tglx@linutronix.de>
> Cc: Daniel Lezcano <daniel.lezcano@linaro.org>
> Cc: Lina Iyer <ilina@codeaurora.org>
> Cc: Frederic Weisbecker <fweisbec@gmail.com>
> Cc: Ingo Molnar <mingo@kernel.org>
> Signed-off-by: Lina Iyer <lina.iyer@linaro.org>
> Co-developed-by: Ulf Hansson <ulf.hansson@linaro.org>
> Signed-off-by: Ulf Hansson <ulf.hansson@linaro.org>
> ---
> 
> Changes in v10:
> 	- Updated function header of tick_nohz_get_next_wakeup().
> 
> ---
>  include/linux/tick.h     |  8 ++++++++
>  kernel/time/tick-sched.c | 13 +++++++++++++
>  2 files changed, 21 insertions(+)
> 
> diff --git a/include/linux/tick.h b/include/linux/tick.h
> index 55388ab45fd4..e48f6b26b425 100644
> --- a/include/linux/tick.h
> +++ b/include/linux/tick.h
> @@ -125,6 +125,7 @@ extern bool tick_nohz_idle_got_tick(void);
>  extern ktime_t tick_nohz_get_sleep_length(ktime_t *delta_next);
>  extern unsigned long tick_nohz_get_idle_calls(void);
>  extern unsigned long tick_nohz_get_idle_calls_cpu(int cpu);
> +extern ktime_t tick_nohz_get_next_wakeup(int cpu);
>  extern u64 get_cpu_idle_time_us(int cpu, u64 *last_update_time);
>  extern u64 get_cpu_iowait_time_us(int cpu, u64 *last_update_time);
>  
> @@ -151,6 +152,13 @@ static inline ktime_t tick_nohz_get_sleep_length(ktime_t *delta_next)
>  	*delta_next = TICK_NSEC;
>  	return *delta_next;
>  }
> +
> +static inline ktime_t tick_nohz_get_next_wakeup(int cpu)
> +{
> +	/* Next wake up is the tick period, assume it starts now */
> +	return ktime_add(ktime_get(), TICK_NSEC);
> +}
> +
>  static inline u64 get_cpu_idle_time_us(int cpu, u64 *unused) { return -1; }
>  static inline u64 get_cpu_iowait_time_us(int cpu, u64 *unused) { return -1; }
>  
> diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
> index 69e673b88474..7a9166506503 100644
> --- a/kernel/time/tick-sched.c
> +++ b/kernel/time/tick-sched.c
> @@ -1089,6 +1089,19 @@ unsigned long tick_nohz_get_idle_calls(void)
>  	return ts->idle_calls;
>  }
>  
> +/**
> + * tick_nohz_get_next_wakeup - return the next wake up of the CPU
> + * @cpu: the particular CPU to get next wake up for
> + *
> + * Called for idle CPUs only.
> + */
> +ktime_t tick_nohz_get_next_wakeup(int cpu)
> +{
> +	struct clock_event_device *dev = per_cpu(tick_cpu_device.evtdev, cpu);
> +
> +	return dev->next_event;
> +}
> +
>  static void tick_nohz_account_idle_ticks(struct tick_sched *ts)
>  {
>  #ifndef CONFIG_VIRT_CPU_ACCOUNTING_NATIVE
> 

Well, I have concerns regarding this one.

I don't believe it is valid to call this new function for non-idle CPUs and
the kerneldoc kind of says so, but the next patch doesn't actually prevent
it from being called for a non-idle CPU (at the time it is called in there
the target CPU may not be idle any more AFAICS).

In principle, the cpuidle core can store this value, say in struct
cpuidle_device of the given CPU, and expose a helper to access it from
genpd, but that would be extra overhead totally unnecessary on everthing
that doesn't use genpd for cpuidle.

So maybe the driver could store it in its ->enter callback?  After all,
the driver knows that genpd is going to be used later.
Ulf Hansson Jan. 16, 2019, 7:57 a.m. UTC | #2
On Fri, 11 Jan 2019 at 12:07, Rafael J. Wysocki <rjw@rjwysocki.net> wrote:
>
> On Thursday, November 29, 2018 6:46:36 PM CET Ulf Hansson wrote:
> > From: Lina Iyer <lina.iyer@linaro.org>
> >
> > Knowing the sleep duration of CPUs, is known to be needed while selecting
> > the most energy efficient idle state for a CPU or a group of CPUs.
> >
> > However, to be able to compute the sleep duration, we need to know at what
> > time the next expected wakeup is for the CPU. Therefore, let's export this
> > information via a new function, tick_nohz_get_next_wakeup(). Following
> > changes make use of it.
> >
> > Cc: Thomas Gleixner <tglx@linutronix.de>
> > Cc: Daniel Lezcano <daniel.lezcano@linaro.org>
> > Cc: Lina Iyer <ilina@codeaurora.org>
> > Cc: Frederic Weisbecker <fweisbec@gmail.com>
> > Cc: Ingo Molnar <mingo@kernel.org>
> > Signed-off-by: Lina Iyer <lina.iyer@linaro.org>
> > Co-developed-by: Ulf Hansson <ulf.hansson@linaro.org>
> > Signed-off-by: Ulf Hansson <ulf.hansson@linaro.org>
> > ---
> >
> > Changes in v10:
> >       - Updated function header of tick_nohz_get_next_wakeup().
> >
> > ---
> >  include/linux/tick.h     |  8 ++++++++
> >  kernel/time/tick-sched.c | 13 +++++++++++++
> >  2 files changed, 21 insertions(+)
> >
> > diff --git a/include/linux/tick.h b/include/linux/tick.h
> > index 55388ab45fd4..e48f6b26b425 100644
> > --- a/include/linux/tick.h
> > +++ b/include/linux/tick.h
> > @@ -125,6 +125,7 @@ extern bool tick_nohz_idle_got_tick(void);
> >  extern ktime_t tick_nohz_get_sleep_length(ktime_t *delta_next);
> >  extern unsigned long tick_nohz_get_idle_calls(void);
> >  extern unsigned long tick_nohz_get_idle_calls_cpu(int cpu);
> > +extern ktime_t tick_nohz_get_next_wakeup(int cpu);
> >  extern u64 get_cpu_idle_time_us(int cpu, u64 *last_update_time);
> >  extern u64 get_cpu_iowait_time_us(int cpu, u64 *last_update_time);
> >
> > @@ -151,6 +152,13 @@ static inline ktime_t tick_nohz_get_sleep_length(ktime_t *delta_next)
> >       *delta_next = TICK_NSEC;
> >       return *delta_next;
> >  }
> > +
> > +static inline ktime_t tick_nohz_get_next_wakeup(int cpu)
> > +{
> > +     /* Next wake up is the tick period, assume it starts now */
> > +     return ktime_add(ktime_get(), TICK_NSEC);
> > +}
> > +
> >  static inline u64 get_cpu_idle_time_us(int cpu, u64 *unused) { return -1; }
> >  static inline u64 get_cpu_iowait_time_us(int cpu, u64 *unused) { return -1; }
> >
> > diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
> > index 69e673b88474..7a9166506503 100644
> > --- a/kernel/time/tick-sched.c
> > +++ b/kernel/time/tick-sched.c
> > @@ -1089,6 +1089,19 @@ unsigned long tick_nohz_get_idle_calls(void)
> >       return ts->idle_calls;
> >  }
> >
> > +/**
> > + * tick_nohz_get_next_wakeup - return the next wake up of the CPU
> > + * @cpu: the particular CPU to get next wake up for
> > + *
> > + * Called for idle CPUs only.
> > + */
> > +ktime_t tick_nohz_get_next_wakeup(int cpu)
> > +{
> > +     struct clock_event_device *dev = per_cpu(tick_cpu_device.evtdev, cpu);
> > +
> > +     return dev->next_event;
> > +}
> > +
> >  static void tick_nohz_account_idle_ticks(struct tick_sched *ts)
> >  {
> >  #ifndef CONFIG_VIRT_CPU_ACCOUNTING_NATIVE
> >
>
> Well, I have concerns regarding this one.
>
> I don't believe it is valid to call this new function for non-idle CPUs and
> the kerneldoc kind of says so, but the next patch doesn't actually prevent
> it from being called for a non-idle CPU (at the time it is called in there
> the target CPU may not be idle any more AFAICS).

You are correct, but let me clarify things.

We are calling this new API from the new genpd governor, which may
have a cpumask indicating there is more than one CPU attached to its
PM domain+sub-PM domains. In other words, we may call the API for
another CPU than the one we are executing on.

When the new genpd governor is called, all CPUs in the cpumask of the
genpd in question, are already runtime suspended and will remain so
throughout the decisions made by the governor.

However, because of the race condition, which needs to be manged by
the genpd backend driver and its corresponding FW, one of the CPU in
the genpd cpumask could potentially wake up from idle when the genpd
governor runs. However, as a part of exiting from idle, that CPU needs
to wait for the call to pm_runtime_get_sync() to return before
completing the exit patch of idle. This also means waiting for the
genpd governor to finish.

The point is, no matter what decision the governor takes under these
circumstances, the genpd backend driver and its FW must manage this
race condition during the last man standing. For PSCI OSI mode, it
means that if a cluster idle state is suggested by Linux during these
circumstances, it must be prevented and aborted.

>
> In principle, the cpuidle core can store this value, say in struct
> cpuidle_device of the given CPU, and expose a helper to access it from
> genpd, but that would be extra overhead totally unnecessary on everthing
> that doesn't use genpd for cpuidle.
>
> So maybe the driver could store it in its ->enter callback?  After all,
> the driver knows that genpd is going to be used later.

This would work, but it wouldn't really change much when it comes to
the race condition described above. Of course it would turn the code
into being more cpuidle specific, which seems reasonable to me.

Anyway, if I understand your suggestion, in principle it means
changing $subject patch in such way that the API should not take "int
cpu" as an in-parameter, but instead only use __this_cpu() to read out
the next event for current idle CPU.

Additionally, we need another new cpuidle API, which genpd can call to
retrieve a new per CPU "next event data" stored by the cpuidle driver
from its ->enter() callback. Is this a correct interpretation of your
suggestion?

Kind regards
Uffe
Rafael J. Wysocki Jan. 16, 2019, 10:59 a.m. UTC | #3
On Wed, Jan 16, 2019 at 8:58 AM Ulf Hansson <ulf.hansson@linaro.org> wrote:
>
> On Fri, 11 Jan 2019 at 12:07, Rafael J. Wysocki <rjw@rjwysocki.net> wrote:
> >
> > On Thursday, November 29, 2018 6:46:36 PM CET Ulf Hansson wrote:
> > > From: Lina Iyer <lina.iyer@linaro.org>
> > >
> > > Knowing the sleep duration of CPUs, is known to be needed while selecting
> > > the most energy efficient idle state for a CPU or a group of CPUs.
> > >
> > > However, to be able to compute the sleep duration, we need to know at what
> > > time the next expected wakeup is for the CPU. Therefore, let's export this
> > > information via a new function, tick_nohz_get_next_wakeup(). Following
> > > changes make use of it.
> > >
> > > Cc: Thomas Gleixner <tglx@linutronix.de>
> > > Cc: Daniel Lezcano <daniel.lezcano@linaro.org>
> > > Cc: Lina Iyer <ilina@codeaurora.org>
> > > Cc: Frederic Weisbecker <fweisbec@gmail.com>
> > > Cc: Ingo Molnar <mingo@kernel.org>
> > > Signed-off-by: Lina Iyer <lina.iyer@linaro.org>
> > > Co-developed-by: Ulf Hansson <ulf.hansson@linaro.org>
> > > Signed-off-by: Ulf Hansson <ulf.hansson@linaro.org>
> > > ---
> > >
> > > Changes in v10:
> > >       - Updated function header of tick_nohz_get_next_wakeup().
> > >
> > > ---
> > >  include/linux/tick.h     |  8 ++++++++
> > >  kernel/time/tick-sched.c | 13 +++++++++++++
> > >  2 files changed, 21 insertions(+)
> > >
> > > diff --git a/include/linux/tick.h b/include/linux/tick.h
> > > index 55388ab45fd4..e48f6b26b425 100644
> > > --- a/include/linux/tick.h
> > > +++ b/include/linux/tick.h
> > > @@ -125,6 +125,7 @@ extern bool tick_nohz_idle_got_tick(void);
> > >  extern ktime_t tick_nohz_get_sleep_length(ktime_t *delta_next);
> > >  extern unsigned long tick_nohz_get_idle_calls(void);
> > >  extern unsigned long tick_nohz_get_idle_calls_cpu(int cpu);
> > > +extern ktime_t tick_nohz_get_next_wakeup(int cpu);
> > >  extern u64 get_cpu_idle_time_us(int cpu, u64 *last_update_time);
> > >  extern u64 get_cpu_iowait_time_us(int cpu, u64 *last_update_time);
> > >
> > > @@ -151,6 +152,13 @@ static inline ktime_t tick_nohz_get_sleep_length(ktime_t *delta_next)
> > >       *delta_next = TICK_NSEC;
> > >       return *delta_next;
> > >  }
> > > +
> > > +static inline ktime_t tick_nohz_get_next_wakeup(int cpu)
> > > +{
> > > +     /* Next wake up is the tick period, assume it starts now */
> > > +     return ktime_add(ktime_get(), TICK_NSEC);
> > > +}
> > > +
> > >  static inline u64 get_cpu_idle_time_us(int cpu, u64 *unused) { return -1; }
> > >  static inline u64 get_cpu_iowait_time_us(int cpu, u64 *unused) { return -1; }
> > >
> > > diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
> > > index 69e673b88474..7a9166506503 100644
> > > --- a/kernel/time/tick-sched.c
> > > +++ b/kernel/time/tick-sched.c
> > > @@ -1089,6 +1089,19 @@ unsigned long tick_nohz_get_idle_calls(void)
> > >       return ts->idle_calls;
> > >  }
> > >
> > > +/**
> > > + * tick_nohz_get_next_wakeup - return the next wake up of the CPU
> > > + * @cpu: the particular CPU to get next wake up for
> > > + *
> > > + * Called for idle CPUs only.
> > > + */
> > > +ktime_t tick_nohz_get_next_wakeup(int cpu)
> > > +{
> > > +     struct clock_event_device *dev = per_cpu(tick_cpu_device.evtdev, cpu);
> > > +
> > > +     return dev->next_event;
> > > +}
> > > +
> > >  static void tick_nohz_account_idle_ticks(struct tick_sched *ts)
> > >  {
> > >  #ifndef CONFIG_VIRT_CPU_ACCOUNTING_NATIVE
> > >
> >
> > Well, I have concerns regarding this one.
> >
> > I don't believe it is valid to call this new function for non-idle CPUs and
> > the kerneldoc kind of says so, but the next patch doesn't actually prevent
> > it from being called for a non-idle CPU (at the time it is called in there
> > the target CPU may not be idle any more AFAICS).
>
> You are correct, but let me clarify things.
>
> We are calling this new API from the new genpd governor, which may
> have a cpumask indicating there is more than one CPU attached to its
> PM domain+sub-PM domains. In other words, we may call the API for
> another CPU than the one we are executing on.
>
> When the new genpd governor is called, all CPUs in the cpumask of the
> genpd in question, are already runtime suspended and will remain so
> throughout the decisions made by the governor.
>
> However, because of the race condition, which needs to be manged by
> the genpd backend driver and its corresponding FW, one of the CPU in
> the genpd cpumask could potentially wake up from idle when the genpd
> governor runs. However, as a part of exiting from idle, that CPU needs
> to wait for the call to pm_runtime_get_sync() to return before
> completing the exit patch of idle. This also means waiting for the
> genpd governor to finish.

OK, so the CPU spins on a spin lock inside of the idle loop with interrupts off.

> The point is, no matter what decision the governor takes under these
> circumstances, the genpd backend driver and its FW must manage this
> race condition during the last man standing. For PSCI OSI mode, it
> means that if a cluster idle state is suggested by Linux during these
> circumstances, it must be prevented and aborted.

I would suggest putting a comment to explain that somewhere as it is
not really obvious.

> >
> > In principle, the cpuidle core can store this value, say in struct
> > cpuidle_device of the given CPU, and expose a helper to access it from
> > genpd, but that would be extra overhead totally unnecessary on everthing
> > that doesn't use genpd for cpuidle.
> >
> > So maybe the driver could store it in its ->enter callback?  After all,
> > the driver knows that genpd is going to be used later.
>
> This would work, but it wouldn't really change much when it comes to
> the race condition described above.

No, it wouldn't make the race go away.

> Of course it would turn the code
> into being more cpuidle specific, which seems reasonable to me.
>
> Anyway, if I understand your suggestion, in principle it means
> changing $subject patch in such way that the API should not take "int
> cpu" as an in-parameter, but instead only use __this_cpu() to read out
> the next event for current idle CPU.

Yes.

> Additionally, we need another new cpuidle API, which genpd can call to
> retrieve a new per CPU "next event data" stored by the cpuidle driver
> from its ->enter() callback. Is this a correct interpretation of your
> suggestion?

Yes, it is.

Generally, something like "cpuidle, give me the wakeup time of this
CPU".  And it may very well give you 0 if the CPU has woken up
already. :-)
Ulf Hansson Jan. 16, 2019, noon UTC | #4
On Wed, 16 Jan 2019 at 11:59, Rafael J. Wysocki <rafael@kernel.org> wrote:
>
> On Wed, Jan 16, 2019 at 8:58 AM Ulf Hansson <ulf.hansson@linaro.org> wrote:
> >
> > On Fri, 11 Jan 2019 at 12:07, Rafael J. Wysocki <rjw@rjwysocki.net> wrote:
> > >
> > > On Thursday, November 29, 2018 6:46:36 PM CET Ulf Hansson wrote:
> > > > From: Lina Iyer <lina.iyer@linaro.org>
> > > >
> > > > Knowing the sleep duration of CPUs, is known to be needed while selecting
> > > > the most energy efficient idle state for a CPU or a group of CPUs.
> > > >
> > > > However, to be able to compute the sleep duration, we need to know at what
> > > > time the next expected wakeup is for the CPU. Therefore, let's export this
> > > > information via a new function, tick_nohz_get_next_wakeup(). Following
> > > > changes make use of it.
> > > >
> > > > Cc: Thomas Gleixner <tglx@linutronix.de>
> > > > Cc: Daniel Lezcano <daniel.lezcano@linaro.org>
> > > > Cc: Lina Iyer <ilina@codeaurora.org>
> > > > Cc: Frederic Weisbecker <fweisbec@gmail.com>
> > > > Cc: Ingo Molnar <mingo@kernel.org>
> > > > Signed-off-by: Lina Iyer <lina.iyer@linaro.org>
> > > > Co-developed-by: Ulf Hansson <ulf.hansson@linaro.org>
> > > > Signed-off-by: Ulf Hansson <ulf.hansson@linaro.org>
> > > > ---
> > > >
> > > > Changes in v10:
> > > >       - Updated function header of tick_nohz_get_next_wakeup().
> > > >
> > > > ---
> > > >  include/linux/tick.h     |  8 ++++++++
> > > >  kernel/time/tick-sched.c | 13 +++++++++++++
> > > >  2 files changed, 21 insertions(+)
> > > >
> > > > diff --git a/include/linux/tick.h b/include/linux/tick.h
> > > > index 55388ab45fd4..e48f6b26b425 100644
> > > > --- a/include/linux/tick.h
> > > > +++ b/include/linux/tick.h
> > > > @@ -125,6 +125,7 @@ extern bool tick_nohz_idle_got_tick(void);
> > > >  extern ktime_t tick_nohz_get_sleep_length(ktime_t *delta_next);
> > > >  extern unsigned long tick_nohz_get_idle_calls(void);
> > > >  extern unsigned long tick_nohz_get_idle_calls_cpu(int cpu);
> > > > +extern ktime_t tick_nohz_get_next_wakeup(int cpu);
> > > >  extern u64 get_cpu_idle_time_us(int cpu, u64 *last_update_time);
> > > >  extern u64 get_cpu_iowait_time_us(int cpu, u64 *last_update_time);
> > > >
> > > > @@ -151,6 +152,13 @@ static inline ktime_t tick_nohz_get_sleep_length(ktime_t *delta_next)
> > > >       *delta_next = TICK_NSEC;
> > > >       return *delta_next;
> > > >  }
> > > > +
> > > > +static inline ktime_t tick_nohz_get_next_wakeup(int cpu)
> > > > +{
> > > > +     /* Next wake up is the tick period, assume it starts now */
> > > > +     return ktime_add(ktime_get(), TICK_NSEC);
> > > > +}
> > > > +
> > > >  static inline u64 get_cpu_idle_time_us(int cpu, u64 *unused) { return -1; }
> > > >  static inline u64 get_cpu_iowait_time_us(int cpu, u64 *unused) { return -1; }
> > > >
> > > > diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
> > > > index 69e673b88474..7a9166506503 100644
> > > > --- a/kernel/time/tick-sched.c
> > > > +++ b/kernel/time/tick-sched.c
> > > > @@ -1089,6 +1089,19 @@ unsigned long tick_nohz_get_idle_calls(void)
> > > >       return ts->idle_calls;
> > > >  }
> > > >
> > > > +/**
> > > > + * tick_nohz_get_next_wakeup - return the next wake up of the CPU
> > > > + * @cpu: the particular CPU to get next wake up for
> > > > + *
> > > > + * Called for idle CPUs only.
> > > > + */
> > > > +ktime_t tick_nohz_get_next_wakeup(int cpu)
> > > > +{
> > > > +     struct clock_event_device *dev = per_cpu(tick_cpu_device.evtdev, cpu);
> > > > +
> > > > +     return dev->next_event;
> > > > +}
> > > > +
> > > >  static void tick_nohz_account_idle_ticks(struct tick_sched *ts)
> > > >  {
> > > >  #ifndef CONFIG_VIRT_CPU_ACCOUNTING_NATIVE
> > > >
> > >
> > > Well, I have concerns regarding this one.
> > >
> > > I don't believe it is valid to call this new function for non-idle CPUs and
> > > the kerneldoc kind of says so, but the next patch doesn't actually prevent
> > > it from being called for a non-idle CPU (at the time it is called in there
> > > the target CPU may not be idle any more AFAICS).
> >
> > You are correct, but let me clarify things.
> >
> > We are calling this new API from the new genpd governor, which may
> > have a cpumask indicating there is more than one CPU attached to its
> > PM domain+sub-PM domains. In other words, we may call the API for
> > another CPU than the one we are executing on.
> >
> > When the new genpd governor is called, all CPUs in the cpumask of the
> > genpd in question, are already runtime suspended and will remain so
> > throughout the decisions made by the governor.
> >
> > However, because of the race condition, which needs to be manged by
> > the genpd backend driver and its corresponding FW, one of the CPU in
> > the genpd cpumask could potentially wake up from idle when the genpd
> > governor runs. However, as a part of exiting from idle, that CPU needs
> > to wait for the call to pm_runtime_get_sync() to return before
> > completing the exit patch of idle. This also means waiting for the
> > genpd governor to finish.
>
> OK, so the CPU spins on a spin lock inside of the idle loop with interrupts off.

Correct.

This is the part that is not very nice, but ideally it should be a
rather rare condition as it only happens during the last man standing
point.

>
> > The point is, no matter what decision the governor takes under these
> > circumstances, the genpd backend driver and its FW must manage this
> > race condition during the last man standing. For PSCI OSI mode, it
> > means that if a cluster idle state is suggested by Linux during these
> > circumstances, it must be prevented and aborted.
>
> I would suggest putting a comment to explain that somewhere as it is
> not really obvious.

Let me see if can squeeze in that somewhere, probably it's best suited
in the new genpd governor code somewhere.

>
> > >
> > > In principle, the cpuidle core can store this value, say in struct
> > > cpuidle_device of the given CPU, and expose a helper to access it from
> > > genpd, but that would be extra overhead totally unnecessary on everthing
> > > that doesn't use genpd for cpuidle.
> > >
> > > So maybe the driver could store it in its ->enter callback?  After all,
> > > the driver knows that genpd is going to be used later.
> >
> > This would work, but it wouldn't really change much when it comes to
> > the race condition described above.
>
> No, it wouldn't make the race go away.
>
> > Of course it would turn the code
> > into being more cpuidle specific, which seems reasonable to me.
> >
> > Anyway, if I understand your suggestion, in principle it means
> > changing $subject patch in such way that the API should not take "int
> > cpu" as an in-parameter, but instead only use __this_cpu() to read out
> > the next event for current idle CPU.
>
> Yes.
>
> > Additionally, we need another new cpuidle API, which genpd can call to
> > retrieve a new per CPU "next event data" stored by the cpuidle driver
> > from its ->enter() callback. Is this a correct interpretation of your
> > suggestion?
>
> Yes, it is.

Thanks for confirming!

>
> Generally, something like "cpuidle, give me the wakeup time of this
> CPU".  And it may very well give you 0 if the CPU has woken up
> already. :-)

Yep, I was thinking something like that, so in principle it may
minimize the window of receiving in-correct "next wakeup data" in
genpd for a non-idle CPU, but again it doesn't solve the race
condition.

Alright, I re-spin this according to your suggestions. Thanks for reviewing!

Kind regards
Uffe
Ulf Hansson Jan. 25, 2019, 10:04 a.m. UTC | #5
[...]

> > > > > +/**
> > > > > + * tick_nohz_get_next_wakeup - return the next wake up of the CPU
> > > > > + * @cpu: the particular CPU to get next wake up for
> > > > > + *
> > > > > + * Called for idle CPUs only.
> > > > > + */
> > > > > +ktime_t tick_nohz_get_next_wakeup(int cpu)
> > > > > +{
> > > > > +     struct clock_event_device *dev = per_cpu(tick_cpu_device.evtdev, cpu);
> > > > > +
> > > > > +     return dev->next_event;
> > > > > +}
> > > > > +
> > > > >  static void tick_nohz_account_idle_ticks(struct tick_sched *ts)
> > > > >  {
> > > > >  #ifndef CONFIG_VIRT_CPU_ACCOUNTING_NATIVE
> > > > >
> > > >
> > > > Well, I have concerns regarding this one.
> > > >
> > > > I don't believe it is valid to call this new function for non-idle CPUs and
> > > > the kerneldoc kind of says so, but the next patch doesn't actually prevent
> > > > it from being called for a non-idle CPU (at the time it is called in there
> > > > the target CPU may not be idle any more AFAICS).
> > >
> > > You are correct, but let me clarify things.
> > >
> > > We are calling this new API from the new genpd governor, which may
> > > have a cpumask indicating there is more than one CPU attached to its
> > > PM domain+sub-PM domains. In other words, we may call the API for
> > > another CPU than the one we are executing on.
> > >
> > > When the new genpd governor is called, all CPUs in the cpumask of the
> > > genpd in question, are already runtime suspended and will remain so
> > > throughout the decisions made by the governor.
> > >
> > > However, because of the race condition, which needs to be manged by
> > > the genpd backend driver and its corresponding FW, one of the CPU in
> > > the genpd cpumask could potentially wake up from idle when the genpd
> > > governor runs. However, as a part of exiting from idle, that CPU needs
> > > to wait for the call to pm_runtime_get_sync() to return before
> > > completing the exit patch of idle. This also means waiting for the
> > > genpd governor to finish.
> >
> > OK, so the CPU spins on a spin lock inside of the idle loop with interrupts off.
>
> Correct.
>
> This is the part that is not very nice, but ideally it should be a
> rather rare condition as it only happens during the last man standing
> point.
>
> >
> > > The point is, no matter what decision the governor takes under these
> > > circumstances, the genpd backend driver and its FW must manage this
> > > race condition during the last man standing. For PSCI OSI mode, it
> > > means that if a cluster idle state is suggested by Linux during these
> > > circumstances, it must be prevented and aborted.
> >
> > I would suggest putting a comment to explain that somewhere as it is
> > not really obvious.
>
> Let me see if can squeeze in that somewhere, probably it's best suited
> in the new genpd governor code somewhere.
>
> >
> > > >
> > > > In principle, the cpuidle core can store this value, say in struct
> > > > cpuidle_device of the given CPU, and expose a helper to access it from
> > > > genpd, but that would be extra overhead totally unnecessary on everthing
> > > > that doesn't use genpd for cpuidle.
> > > >
> > > > So maybe the driver could store it in its ->enter callback?  After all,
> > > > the driver knows that genpd is going to be used later.
> > >
> > > This would work, but it wouldn't really change much when it comes to
> > > the race condition described above.
> >
> > No, it wouldn't make the race go away.
> >
> > > Of course it would turn the code
> > > into being more cpuidle specific, which seems reasonable to me.
> > >
> > > Anyway, if I understand your suggestion, in principle it means
> > > changing $subject patch in such way that the API should not take "int
> > > cpu" as an in-parameter, but instead only use __this_cpu() to read out
> > > the next event for current idle CPU.
> >
> > Yes.

I have looked closer to this and it turned out that it seems that I
should probably not need introduce an entirely new thing here. Instead
I should likely be able to re-factor the current
tick_nohz_get_sleep_length() and tick_nohz_next_event(), as those are
in principle doing the similar things as I need. So I started hacking
on that, when Daniel Lezcano told me that he already have a patch
doing exactly what I want. :-) However, in the context of his "next
wakeup prediction" work, but that shouldn't matter.

If I can make it work, I will fold in his patch in the next version of
the series instead.

Please tell if you already at this point, see any issues with this approach.

> >
> > > Additionally, we need another new cpuidle API, which genpd can call to
> > > retrieve a new per CPU "next event data" stored by the cpuidle driver
> > > from its ->enter() callback. Is this a correct interpretation of your
> > > suggestion?
> >
> > Yes, it is.
>
> Thanks for confirming!
>
> >
> > Generally, something like "cpuidle, give me the wakeup time of this
> > CPU".  And it may very well give you 0 if the CPU has woken up
> > already. :-)
>
> Yep, I was thinking something like that, so in principle it may
> minimize the window of receiving in-correct "next wakeup data" in
> genpd for a non-idle CPU, but again it doesn't solve the race
> condition.
>

Kind regards
Uffe
Rafael J. Wysocki Jan. 25, 2019, 10:18 a.m. UTC | #6
On Fri, Jan 25, 2019 at 11:04 AM Ulf Hansson <ulf.hansson@linaro.org> wrote:
>
> [...]
>
> > > > > > +/**
> > > > > > + * tick_nohz_get_next_wakeup - return the next wake up of the CPU
> > > > > > + * @cpu: the particular CPU to get next wake up for
> > > > > > + *
> > > > > > + * Called for idle CPUs only.
> > > > > > + */
> > > > > > +ktime_t tick_nohz_get_next_wakeup(int cpu)
> > > > > > +{
> > > > > > +     struct clock_event_device *dev = per_cpu(tick_cpu_device.evtdev, cpu);
> > > > > > +
> > > > > > +     return dev->next_event;
> > > > > > +}
> > > > > > +
> > > > > >  static void tick_nohz_account_idle_ticks(struct tick_sched *ts)
> > > > > >  {
> > > > > >  #ifndef CONFIG_VIRT_CPU_ACCOUNTING_NATIVE
> > > > > >
> > > > >
> > > > > Well, I have concerns regarding this one.
> > > > >
> > > > > I don't believe it is valid to call this new function for non-idle CPUs and
> > > > > the kerneldoc kind of says so, but the next patch doesn't actually prevent
> > > > > it from being called for a non-idle CPU (at the time it is called in there
> > > > > the target CPU may not be idle any more AFAICS).
> > > >
> > > > You are correct, but let me clarify things.
> > > >
> > > > We are calling this new API from the new genpd governor, which may
> > > > have a cpumask indicating there is more than one CPU attached to its
> > > > PM domain+sub-PM domains. In other words, we may call the API for
> > > > another CPU than the one we are executing on.
> > > >
> > > > When the new genpd governor is called, all CPUs in the cpumask of the
> > > > genpd in question, are already runtime suspended and will remain so
> > > > throughout the decisions made by the governor.
> > > >
> > > > However, because of the race condition, which needs to be manged by
> > > > the genpd backend driver and its corresponding FW, one of the CPU in
> > > > the genpd cpumask could potentially wake up from idle when the genpd
> > > > governor runs. However, as a part of exiting from idle, that CPU needs
> > > > to wait for the call to pm_runtime_get_sync() to return before
> > > > completing the exit patch of idle. This also means waiting for the
> > > > genpd governor to finish.
> > >
> > > OK, so the CPU spins on a spin lock inside of the idle loop with interrupts off.
> >
> > Correct.
> >
> > This is the part that is not very nice, but ideally it should be a
> > rather rare condition as it only happens during the last man standing
> > point.
> >
> > >
> > > > The point is, no matter what decision the governor takes under these
> > > > circumstances, the genpd backend driver and its FW must manage this
> > > > race condition during the last man standing. For PSCI OSI mode, it
> > > > means that if a cluster idle state is suggested by Linux during these
> > > > circumstances, it must be prevented and aborted.
> > >
> > > I would suggest putting a comment to explain that somewhere as it is
> > > not really obvious.
> >
> > Let me see if can squeeze in that somewhere, probably it's best suited
> > in the new genpd governor code somewhere.
> >
> > >
> > > > >
> > > > > In principle, the cpuidle core can store this value, say in struct
> > > > > cpuidle_device of the given CPU, and expose a helper to access it from
> > > > > genpd, but that would be extra overhead totally unnecessary on everthing
> > > > > that doesn't use genpd for cpuidle.
> > > > >
> > > > > So maybe the driver could store it in its ->enter callback?  After all,
> > > > > the driver knows that genpd is going to be used later.
> > > >
> > > > This would work, but it wouldn't really change much when it comes to
> > > > the race condition described above.
> > >
> > > No, it wouldn't make the race go away.
> > >
> > > > Of course it would turn the code
> > > > into being more cpuidle specific, which seems reasonable to me.
> > > >
> > > > Anyway, if I understand your suggestion, in principle it means
> > > > changing $subject patch in such way that the API should not take "int
> > > > cpu" as an in-parameter, but instead only use __this_cpu() to read out
> > > > the next event for current idle CPU.
> > >
> > > Yes.
>
> I have looked closer to this and it turned out that it seems that I
> should probably not need introduce an entirely new thing here. Instead
> I should likely be able to re-factor the current
> tick_nohz_get_sleep_length() and tick_nohz_next_event(), as those are
> in principle doing the similar things as I need. So I started hacking
> on that, when Daniel Lezcano told me that he already have a patch
> doing exactly what I want. :-) However, in the context of his "next
> wakeup prediction" work, but that shouldn't matter.
>
> If I can make it work, I will fold in his patch in the next version of
> the series instead.
>
> Please tell if you already at this point, see any issues with this approach.

Not in principle as long as you do that in the context of the cpuidle
framework.  That is, I still would like to have this  "cpuidle, give
me the wakeup time of this CPU" I/F to the genpd governor, but you can
do the above to implement it as long as I'm concerned.
diff mbox series

Patch

diff --git a/include/linux/tick.h b/include/linux/tick.h
index 55388ab45fd4..e48f6b26b425 100644
--- a/include/linux/tick.h
+++ b/include/linux/tick.h
@@ -125,6 +125,7 @@  extern bool tick_nohz_idle_got_tick(void);
 extern ktime_t tick_nohz_get_sleep_length(ktime_t *delta_next);
 extern unsigned long tick_nohz_get_idle_calls(void);
 extern unsigned long tick_nohz_get_idle_calls_cpu(int cpu);
+extern ktime_t tick_nohz_get_next_wakeup(int cpu);
 extern u64 get_cpu_idle_time_us(int cpu, u64 *last_update_time);
 extern u64 get_cpu_iowait_time_us(int cpu, u64 *last_update_time);
 
@@ -151,6 +152,13 @@  static inline ktime_t tick_nohz_get_sleep_length(ktime_t *delta_next)
 	*delta_next = TICK_NSEC;
 	return *delta_next;
 }
+
+static inline ktime_t tick_nohz_get_next_wakeup(int cpu)
+{
+	/* Next wake up is the tick period, assume it starts now */
+	return ktime_add(ktime_get(), TICK_NSEC);
+}
+
 static inline u64 get_cpu_idle_time_us(int cpu, u64 *unused) { return -1; }
 static inline u64 get_cpu_iowait_time_us(int cpu, u64 *unused) { return -1; }
 
diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
index 69e673b88474..7a9166506503 100644
--- a/kernel/time/tick-sched.c
+++ b/kernel/time/tick-sched.c
@@ -1089,6 +1089,19 @@  unsigned long tick_nohz_get_idle_calls(void)
 	return ts->idle_calls;
 }
 
+/**
+ * tick_nohz_get_next_wakeup - return the next wake up of the CPU
+ * @cpu: the particular CPU to get next wake up for
+ *
+ * Called for idle CPUs only.
+ */
+ktime_t tick_nohz_get_next_wakeup(int cpu)
+{
+	struct clock_event_device *dev = per_cpu(tick_cpu_device.evtdev, cpu);
+
+	return dev->next_event;
+}
+
 static void tick_nohz_account_idle_ticks(struct tick_sched *ts)
 {
 #ifndef CONFIG_VIRT_CPU_ACCOUNTING_NATIVE