diff mbox series

cpuidle/pseries: Fixup CEDE0 latency only for POWER10 onwards

Message ID 1619104049-5118-1-git-send-email-ego@linux.vnet.ibm.com (mailing list archive)
State Not Applicable, archived
Headers show
Series cpuidle/pseries: Fixup CEDE0 latency only for POWER10 onwards | expand

Commit Message

Gautham R Shenoy April 22, 2021, 3:07 p.m. UTC
From: "Gautham R. Shenoy" <ego@linux.vnet.ibm.com>

Commit d947fb4c965c ("cpuidle: pseries: Fixup exit latency for
CEDE(0)") sets the exit latency of CEDE(0) based on the latency values
of the Extended CEDE states advertised by the platform

On some of the POWER9 LPARs, the older firmwares advertise a very low
value of 2us for CEDE1 exit latency on a Dedicated LPAR. However the
measured value is 5us on an average. Due to the low advertised exit
latency, we are entering CEDE(0) more aggressively on such
platforms. While this helps achieve SMT folding faster, we pay the
penalty of having to send an IPI to wakeup the CPU when the target
residency is very short. This is showing up as a performance
regression on the newer kernels running on the LPARs with older
firmware.

Hence, set the exit latency of CEDE(0) based on the latency values
advertized by platform only from POWER10 onwards. The values
advertized on POWER10 platforms is more realistic and informed by the
latency measurements.

For platforms older than POWER10, retain the hardcoded value of exit
latency, which is 10us. Though this is higher than the measured
values, we would be erring on the side of caution.

Reported-by: Enrico Joedecke <joedecke@de.ibm.com>
Fixes: commit d947fb4c965c ("cpuidle: pseries: Fixup exit latency for
CEDE(0)")
Signed-off-by: Gautham R. Shenoy <ego@linux.vnet.ibm.com>
---
 drivers/cpuidle/cpuidle-pseries.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

Comments

Vaidyanathan Srinivasan April 22, 2021, 5:57 p.m. UTC | #1
* Gautham R Shenoy <ego@linux.vnet.ibm.com> [2021-04-22 20:37:29]:

> From: "Gautham R. Shenoy" <ego@linux.vnet.ibm.com>
> 
> Commit d947fb4c965c ("cpuidle: pseries: Fixup exit latency for
> CEDE(0)") sets the exit latency of CEDE(0) based on the latency values
> of the Extended CEDE states advertised by the platform
> 
> On some of the POWER9 LPARs, the older firmwares advertise a very low
> value of 2us for CEDE1 exit latency on a Dedicated LPAR. However the
> measured value is 5us on an average. Due to the low advertised exit
> latency, we are entering CEDE(0) more aggressively on such
> platforms. While this helps achieve SMT folding faster, we pay the
> penalty of having to send an IPI to wakeup the CPU when the target
> residency is very short. This is showing up as a performance
> regression on the newer kernels running on the LPARs with older
> firmware.
> 
> Hence, set the exit latency of CEDE(0) based on the latency values
> advertized by platform only from POWER10 onwards. The values
> advertized on POWER10 platforms is more realistic and informed by the
> latency measurements.
> 
> For platforms older than POWER10, retain the hardcoded value of exit
> latency, which is 10us. Though this is higher than the measured
> values, we would be erring on the side of caution.
> 
> Reported-by: Enrico Joedecke <joedecke@de.ibm.com>
> Fixes: commit d947fb4c965c ("cpuidle: pseries: Fixup exit latency for
> CEDE(0)")
> Signed-off-by: Gautham R. Shenoy <ego@linux.vnet.ibm.com>

Reviewed-by: Vaidyanathan Srinivasan <svaidy@linux.ibm.com>

> ---
>  drivers/cpuidle/cpuidle-pseries.c | 3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/cpuidle/cpuidle-pseries.c b/drivers/cpuidle/cpuidle-pseries.c
> index a2b5c6f..7207467 100644
> --- a/drivers/cpuidle/cpuidle-pseries.c
> +++ b/drivers/cpuidle/cpuidle-pseries.c
> @@ -419,7 +419,8 @@ static int pseries_idle_probe(void)
>  			cpuidle_state_table = shared_states;
>  			max_idle_state = ARRAY_SIZE(shared_states);
>  		} else {
> -			fixup_cede0_latency();
> +			if (pvr_version_is(PVR_POWER10))
> +				fixup_cede0_latency();
>  			cpuidle_state_table = dedicated_states;
>  			max_idle_state = NR_DEDICATED_STATES;
>  		}

Thanks for the fix.  We should have such safeguards or fallbacks while
running on older platforms.  This fix is needed because of the
unfortunate regression on some older platforms that we failed to
notice while designing and testing the original feature.

--Vaidy
Michal Suchanek April 23, 2021, 7:35 a.m. UTC | #2
On Thu, Apr 22, 2021 at 08:37:29PM +0530, Gautham R. Shenoy wrote:
> From: "Gautham R. Shenoy" <ego@linux.vnet.ibm.com>
> 
> Commit d947fb4c965c ("cpuidle: pseries: Fixup exit latency for
> CEDE(0)") sets the exit latency of CEDE(0) based on the latency values
> of the Extended CEDE states advertised by the platform
> 
> On some of the POWER9 LPARs, the older firmwares advertise a very low
> value of 2us for CEDE1 exit latency on a Dedicated LPAR. However the
Can you be more specific about 'older firmwares'?

Also while this is a performance regression on such firmwares it
should be fixed by updating the firmware to current version.

Having sub-optimal performance on obsolete firmware should not require a
kernel workaround, should it?

It's not like the kernel would crash on the affected firmware.

Thanks

Michal
Vaidyanathan Srinivasan April 23, 2021, 3:59 p.m. UTC | #3
* Michal Such?nek <msuchanek@suse.de> [2021-04-23 09:35:51]:

> On Thu, Apr 22, 2021 at 08:37:29PM +0530, Gautham R. Shenoy wrote:
> > From: "Gautham R. Shenoy" <ego@linux.vnet.ibm.com>
> > 
> > Commit d947fb4c965c ("cpuidle: pseries: Fixup exit latency for
> > CEDE(0)") sets the exit latency of CEDE(0) based on the latency values
> > of the Extended CEDE states advertised by the platform
> > 
> > On some of the POWER9 LPARs, the older firmwares advertise a very low
> > value of 2us for CEDE1 exit latency on a Dedicated LPAR. However the
> Can you be more specific about 'older firmwares'?

Hi Michal,

This is POWER9 vs POWER10 difference, not really an obsolete FW.  The
key idea behind the original patch was to make the H_CEDE latency and
hence target residency come from firmware instead of being decided by
the kernel.  The advantage is such that, different type of systems in
POWER10 generation can adjust this value and have an optimal H_CEDE
entry criteria which balances good single thread performance and
wakeup latency.  Further we can have additional H_CEDE state to feed
into the cpuidle.  

> Also while this is a performance regression on such firmwares it
> should be fixed by updating the firmware to current version.
> 
> Having sub-optimal performance on obsolete firmware should not require a
> kernel workaround, should it?

When we designed and tested this change on POWER9 and POWER10 systems
the values that were set in F/w were working out fine with positive
results in all our micro benchmarks and no regression in context
switch tests.  These repeatable results gave us the confidence that we
can go ahead and set the values from F/w and remove the kernel's value
for all future Linux versions.

But where we slipped is the fact that real world workload show
variations in performance and regressions in specific case because we
are favouring H_CEDE state more often than snooze loop.  The root
cause is we have to send more IPIs to wakeup now because more cpus
will be in H_CEDE state than before.

This is a performance problem on POWER9 systems where we actually
expected good benefit and also proved them with micro benchmarks, but
later it turned out to have an impact for some workloads.  Further the
challenge is not that regressions are severe, it is the fact that on
exact same hardware and firmware end users expect similar or better
performance for everything when updating to a newer kernel and no
regressions.

We have these setting adjusted for POWER10 in F/w and hence behaviour
will be similar when we come from old kernel on P9 to a new kernel on
P10.  We did test the reverse also like new kernel on P9 should show
benefit.  But as explained, the benefit came at the cost of regressing
in few cases which were discovered later.

Hence this fix is to keep exact same behaviour for POWER9 and use this
F/w driven heuristics only from POWER10.

> It's not like the kernel would crash on the affected firmware.

Correct. We do not have a functional issue, but only a performance
regression observable on certain real workloads.

This is a minor change in cpuidle's H_CEDE usage which will show up
only in certain workload patterns where we need idle CPU threads to
wakeup faster to get the job done as compared to keeping busy CPU
threads in single thread mode to get more execution slices.

This fix is primarily to ensure kernel update does not change H_CEDE
behaviour on same hardware generation there by causing performance
variation and also regression in some case.

Thanks for the questions and comments, I hope this gives additional
context for this fix.

--Vaidy
Michal Suchanek April 23, 2021, 5:45 p.m. UTC | #4
On Fri, Apr 23, 2021 at 09:29:39PM +0530, Vaidyanathan Srinivasan wrote:
> * Michal Such?nek <msuchanek@suse.de> [2021-04-23 09:35:51]:
> 
> > On Thu, Apr 22, 2021 at 08:37:29PM +0530, Gautham R. Shenoy wrote:
> > > From: "Gautham R. Shenoy" <ego@linux.vnet.ibm.com>
> > > 
> > > Commit d947fb4c965c ("cpuidle: pseries: Fixup exit latency for
> > > CEDE(0)") sets the exit latency of CEDE(0) based on the latency values
> > > of the Extended CEDE states advertised by the platform
> > > 
> > > On some of the POWER9 LPARs, the older firmwares advertise a very low
> > > value of 2us for CEDE1 exit latency on a Dedicated LPAR. However the
> > Can you be more specific about 'older firmwares'?
> 
> Hi Michal,
> 
> This is POWER9 vs POWER10 difference, not really an obsolete FW.  The
> key idea behind the original patch was to make the H_CEDE latency and
> hence target residency come from firmware instead of being decided by
> the kernel.  The advantage is such that, different type of systems in
> POWER10 generation can adjust this value and have an optimal H_CEDE
> entry criteria which balances good single thread performance and
> wakeup latency.  Further we can have additional H_CEDE state to feed
> into the cpuidle.  

So all POWER9 machines are affected by the firmware bug where firmware
reports CEDE1 exit latency of 2us and the real latency is 5us which
causes the kernel to prefer CEDE1 too much when relying on the values
supplied by the firmware. It is not about 'older firmware'.

I still think it would be preferrable to adjust the latency value
reported by the firmware to match reality over a kernel workaround.

Thanks

Michal
Vaidyanathan Srinivasan April 23, 2021, 6:29 p.m. UTC | #5
* Michal Such?nek <msuchanek@suse.de> [2021-04-23 19:45:05]:

> On Fri, Apr 23, 2021 at 09:29:39PM +0530, Vaidyanathan Srinivasan wrote:
> > * Michal Such?nek <msuchanek@suse.de> [2021-04-23 09:35:51]:
> > 
> > > On Thu, Apr 22, 2021 at 08:37:29PM +0530, Gautham R. Shenoy wrote:
> > > > From: "Gautham R. Shenoy" <ego@linux.vnet.ibm.com>
> > > > 
> > > > Commit d947fb4c965c ("cpuidle: pseries: Fixup exit latency for
> > > > CEDE(0)") sets the exit latency of CEDE(0) based on the latency values
> > > > of the Extended CEDE states advertised by the platform
> > > > 
> > > > On some of the POWER9 LPARs, the older firmwares advertise a very low
> > > > value of 2us for CEDE1 exit latency on a Dedicated LPAR. However the
> > > Can you be more specific about 'older firmwares'?
> > 
> > Hi Michal,
> > 
> > This is POWER9 vs POWER10 difference, not really an obsolete FW.  The
> > key idea behind the original patch was to make the H_CEDE latency and
> > hence target residency come from firmware instead of being decided by
> > the kernel.  The advantage is such that, different type of systems in
> > POWER10 generation can adjust this value and have an optimal H_CEDE
> > entry criteria which balances good single thread performance and
> > wakeup latency.  Further we can have additional H_CEDE state to feed
> > into the cpuidle.  
> 
> So all POWER9 machines are affected by the firmware bug where firmware
> reports CEDE1 exit latency of 2us and the real latency is 5us which
> causes the kernel to prefer CEDE1 too much when relying on the values
> supplied by the firmware. It is not about 'older firmware'.

Correct.  All POWER9 systems running Linux as guest LPARs will see
extra usage of CEDE idle state, but not baremetal (PowerNV).

The correct definition of the bug or miss-match in expectation is that
firmware reports wakeup latency from a core/thread wakeup timing, but
not end-to-end time from sending a wakeup event like an IPI using
H_calls and receiving the events on the target.  Practically there are
few extra micro-seconds needed after deciding to wakeup a target
core/thread to getting the target to start executing instructions
within the LPAR instance.

> I still think it would be preferrable to adjust the latency value
> reported by the firmware to match reality over a kernel workaround.

Right, practically we can fix for future releases and as such we
targeted this scheme from POWER10 but expected no harm on POWER9 which
proved to be wrong.

We can possibly change this FW value for POWER9, but it is too
expensive and not practical because many release streams exist for
different platforms and further customers are at different streams as
well.  We cannot force all of them to update because that blows up
co-dependency matrix.

From a user/customer's view current kernel worked fine, why is
a kernel update changing my performance :(

Looking back, we should have boxed any kernel-firmware dependent
feature like this one to a future releases only.  We have all options
open for a future release like POWER10, but on a released product
stream, we have to manage with kernel changes.  In this specific case
we should have been very conservative and not allow the kernel to
change behaviour on released products.

Thanks,
Vaidy
Michal Suchanek April 23, 2021, 6:42 p.m. UTC | #6
On Fri, Apr 23, 2021 at 11:59:30PM +0530, Vaidyanathan Srinivasan wrote:
> * Michal Such?nek <msuchanek@suse.de> [2021-04-23 19:45:05]:
> 
> > On Fri, Apr 23, 2021 at 09:29:39PM +0530, Vaidyanathan Srinivasan wrote:
> > > * Michal Such?nek <msuchanek@suse.de> [2021-04-23 09:35:51]:
> > > 
> > > > On Thu, Apr 22, 2021 at 08:37:29PM +0530, Gautham R. Shenoy wrote:
> > > > > From: "Gautham R. Shenoy" <ego@linux.vnet.ibm.com>
> > > > > 
> > > > > Commit d947fb4c965c ("cpuidle: pseries: Fixup exit latency for
> > > > > CEDE(0)") sets the exit latency of CEDE(0) based on the latency values
> > > > > of the Extended CEDE states advertised by the platform
> > > > > 
> > > > > On some of the POWER9 LPARs, the older firmwares advertise a very low
> > > > > value of 2us for CEDE1 exit latency on a Dedicated LPAR. However the
> > > > Can you be more specific about 'older firmwares'?
> > > 
> > > Hi Michal,
> > > 
> > > This is POWER9 vs POWER10 difference, not really an obsolete FW.  The
> > > key idea behind the original patch was to make the H_CEDE latency and
> > > hence target residency come from firmware instead of being decided by
> > > the kernel.  The advantage is such that, different type of systems in
> > > POWER10 generation can adjust this value and have an optimal H_CEDE
> > > entry criteria which balances good single thread performance and
> > > wakeup latency.  Further we can have additional H_CEDE state to feed
> > > into the cpuidle.  
> > 
> > So all POWER9 machines are affected by the firmware bug where firmware
> > reports CEDE1 exit latency of 2us and the real latency is 5us which
> > causes the kernel to prefer CEDE1 too much when relying on the values
> > supplied by the firmware. It is not about 'older firmware'.
> 
> Correct.  All POWER9 systems running Linux as guest LPARs will see
> extra usage of CEDE idle state, but not baremetal (PowerNV).
> 
> The correct definition of the bug or miss-match in expectation is that
> firmware reports wakeup latency from a core/thread wakeup timing, but
> not end-to-end time from sending a wakeup event like an IPI using
> H_calls and receiving the events on the target.  Practically there are
> few extra micro-seconds needed after deciding to wakeup a target
> core/thread to getting the target to start executing instructions
> within the LPAR instance.

Thanks for the detailed explanation.

Maybe just adding a few microseconds to the reported time would be a
more reasonable workaround than using a blanket fixed value then.

> 
> > I still think it would be preferrable to adjust the latency value
> > reported by the firmware to match reality over a kernel workaround.
> 
> Right, practically we can fix for future releases and as such we
> targeted this scheme from POWER10 but expected no harm on POWER9 which
> proved to be wrong.
> 
> We can possibly change this FW value for POWER9, but it is too
> expensive and not practical because many release streams exist for
> different platforms and further customers are at different streams as
> well.  We cannot force all of them to update because that blows up
> co-dependency matrix.

From the user point of view only few firmware release streams exist but
what is packaged in such binaries might be another story.

Thanks

Michal
Vaidyanathan Srinivasan April 24, 2021, 7:37 a.m. UTC | #7
* Michal Such?nek <msuchanek@suse.de> [2021-04-23 20:42:16]:

> On Fri, Apr 23, 2021 at 11:59:30PM +0530, Vaidyanathan Srinivasan wrote:
> > * Michal Such?nek <msuchanek@suse.de> [2021-04-23 19:45:05]:
> > 
> > > On Fri, Apr 23, 2021 at 09:29:39PM +0530, Vaidyanathan Srinivasan wrote:
> > > > * Michal Such?nek <msuchanek@suse.de> [2021-04-23 09:35:51]:
> > > > 
> > > > > On Thu, Apr 22, 2021 at 08:37:29PM +0530, Gautham R. Shenoy wrote:
> > > > > > From: "Gautham R. Shenoy" <ego@linux.vnet.ibm.com>
> > > > > > 
> > > > > > Commit d947fb4c965c ("cpuidle: pseries: Fixup exit latency for
> > > > > > CEDE(0)") sets the exit latency of CEDE(0) based on the latency values
> > > > > > of the Extended CEDE states advertised by the platform
> > > > > > 
> > > > > > On some of the POWER9 LPARs, the older firmwares advertise a very low
> > > > > > value of 2us for CEDE1 exit latency on a Dedicated LPAR. However the
> > > > > Can you be more specific about 'older firmwares'?
> > > > 
> > > > Hi Michal,
> > > > 
> > > > This is POWER9 vs POWER10 difference, not really an obsolete FW.  The
> > > > key idea behind the original patch was to make the H_CEDE latency and
> > > > hence target residency come from firmware instead of being decided by
> > > > the kernel.  The advantage is such that, different type of systems in
> > > > POWER10 generation can adjust this value and have an optimal H_CEDE
> > > > entry criteria which balances good single thread performance and
> > > > wakeup latency.  Further we can have additional H_CEDE state to feed
> > > > into the cpuidle.  
> > > 
> > > So all POWER9 machines are affected by the firmware bug where firmware
> > > reports CEDE1 exit latency of 2us and the real latency is 5us which
> > > causes the kernel to prefer CEDE1 too much when relying on the values
> > > supplied by the firmware. It is not about 'older firmware'.
> > 
> > Correct.  All POWER9 systems running Linux as guest LPARs will see
> > extra usage of CEDE idle state, but not baremetal (PowerNV).
> > 
> > The correct definition of the bug or miss-match in expectation is that
> > firmware reports wakeup latency from a core/thread wakeup timing, but
> > not end-to-end time from sending a wakeup event like an IPI using
> > H_calls and receiving the events on the target.  Practically there are
> > few extra micro-seconds needed after deciding to wakeup a target
> > core/thread to getting the target to start executing instructions
> > within the LPAR instance.
> 
> Thanks for the detailed explanation.
> 
> Maybe just adding a few microseconds to the reported time would be a
> more reasonable workaround than using a blanket fixed value then.

Yes, that is an option.  But that may only reduce the difference
between existing kernel and new kernel unless we make it the same
number.  Further we are fixing this in P10 and hence we will have to
add "if(P9) do the compensation" and otherwise take it as is.  That
would not be elegant.  Given that our goal for P9 platform is to not
introduce changes in H_CEDE entry behaviour, we arrived at this
approach (this small patch) and this also makes it easy to backport to
various distro products.

Thanks,
Vaidy
Michal Suchanek April 25, 2021, 11:07 a.m. UTC | #8
On Sat, Apr 24, 2021 at 01:07:16PM +0530, Vaidyanathan Srinivasan wrote:
> * Michal Such?nek <msuchanek@suse.de> [2021-04-23 20:42:16]:
> 
> > On Fri, Apr 23, 2021 at 11:59:30PM +0530, Vaidyanathan Srinivasan wrote:
> > > * Michal Such?nek <msuchanek@suse.de> [2021-04-23 19:45:05]:
> > > 
> > > > On Fri, Apr 23, 2021 at 09:29:39PM +0530, Vaidyanathan Srinivasan wrote:
> > > > > * Michal Such?nek <msuchanek@suse.de> [2021-04-23 09:35:51]:
> > > > > 
> > > > > > On Thu, Apr 22, 2021 at 08:37:29PM +0530, Gautham R. Shenoy wrote:
> > > > > > > From: "Gautham R. Shenoy" <ego@linux.vnet.ibm.com>
> > > > > > > 
> > > > > > > Commit d947fb4c965c ("cpuidle: pseries: Fixup exit latency for
> > > > > > > CEDE(0)") sets the exit latency of CEDE(0) based on the latency values
> > > > > > > of the Extended CEDE states advertised by the platform
> > > > > > > 
> > > > > > > On some of the POWER9 LPARs, the older firmwares advertise a very low
> > > > > > > value of 2us for CEDE1 exit latency on a Dedicated LPAR. However the
> > > > > > Can you be more specific about 'older firmwares'?
> > > > > 
> > > > > Hi Michal,
> > > > > 
> > > > > This is POWER9 vs POWER10 difference, not really an obsolete FW.  The
> > > > > key idea behind the original patch was to make the H_CEDE latency and
> > > > > hence target residency come from firmware instead of being decided by
> > > > > the kernel.  The advantage is such that, different type of systems in
> > > > > POWER10 generation can adjust this value and have an optimal H_CEDE
> > > > > entry criteria which balances good single thread performance and
> > > > > wakeup latency.  Further we can have additional H_CEDE state to feed
> > > > > into the cpuidle.  
> > > > 
> > > > So all POWER9 machines are affected by the firmware bug where firmware
> > > > reports CEDE1 exit latency of 2us and the real latency is 5us which
> > > > causes the kernel to prefer CEDE1 too much when relying on the values
> > > > supplied by the firmware. It is not about 'older firmware'.
> > > 
> > > Correct.  All POWER9 systems running Linux as guest LPARs will see
> > > extra usage of CEDE idle state, but not baremetal (PowerNV).
> > > 
> > > The correct definition of the bug or miss-match in expectation is that
> > > firmware reports wakeup latency from a core/thread wakeup timing, but
> > > not end-to-end time from sending a wakeup event like an IPI using
> > > H_calls and receiving the events on the target.  Practically there are
> > > few extra micro-seconds needed after deciding to wakeup a target
> > > core/thread to getting the target to start executing instructions
> > > within the LPAR instance.
> > 
> > Thanks for the detailed explanation.
> > 
> > Maybe just adding a few microseconds to the reported time would be a
> > more reasonable workaround than using a blanket fixed value then.
> 
> Yes, that is an option.  But that may only reduce the difference
> between existing kernel and new kernel unless we make it the same
> number.  Further we are fixing this in P10 and hence we will have to
> add "if(P9) do the compensation" and otherwise take it as is.  That
> would not be elegant.  Given that our goal for P9 platform is to not
> introduce changes in H_CEDE entry behaviour, we arrived at this
> approach (this small patch) and this also makes it easy to backport to
> various distro products.

I don't see how this is more elegent.

The current patch is

if(p9)
	use fixed value

the suggested patch is

if(p9)
	apply compensation

That is either will add one branch for the affected platform.

But I understand if you do not have confidence that the compensation is
the same in all cases and do not have the opportunity to measure it it
may be simpler to apply one very conservative adjustment.

Thanks

Michal
Gautham R Shenoy April 28, 2021, 5:58 a.m. UTC | #9
Hello Michal,

On Sun, Apr 25, 2021 at 01:07:14PM +0200, Michal Suchánek wrote:
> On Sat, Apr 24, 2021 at 01:07:16PM +0530, Vaidyanathan Srinivasan wrote:
> > * Michal Such?nek <msuchanek@suse.de> [2021-04-23 20:42:16]:
> > 
> > > On Fri, Apr 23, 2021 at 11:59:30PM +0530, Vaidyanathan Srinivasan wrote:
> > > > * Michal Such?nek <msuchanek@suse.de> [2021-04-23 19:45:05]:
> > > > 
> > > > > On Fri, Apr 23, 2021 at 09:29:39PM +0530, Vaidyanathan Srinivasan wrote:
> > > > > > * Michal Such?nek <msuchanek@suse.de> [2021-04-23 09:35:51]:
> > > > > > 
> > > > > > > On Thu, Apr 22, 2021 at 08:37:29PM +0530, Gautham R. Shenoy wrote:
> > > > > > > > From: "Gautham R. Shenoy" <ego@linux.vnet.ibm.com>
> > > > > > > > 
> > > > > > > > Commit d947fb4c965c ("cpuidle: pseries: Fixup exit latency for
> > > > > > > > CEDE(0)") sets the exit latency of CEDE(0) based on the latency values
> > > > > > > > of the Extended CEDE states advertised by the platform
> > > > > > > > 
> > > > > > > > On some of the POWER9 LPARs, the older firmwares advertise a very low
> > > > > > > > value of 2us for CEDE1 exit latency on a Dedicated LPAR. However the
> > > > > > > Can you be more specific about 'older firmwares'?
> > > > > > 
> > > > > > Hi Michal,
> > > > > > 
> > > > > > This is POWER9 vs POWER10 difference, not really an obsolete FW.  The
> > > > > > key idea behind the original patch was to make the H_CEDE latency and
> > > > > > hence target residency come from firmware instead of being decided by
> > > > > > the kernel.  The advantage is such that, different type of systems in
> > > > > > POWER10 generation can adjust this value and have an optimal H_CEDE
> > > > > > entry criteria which balances good single thread performance and
> > > > > > wakeup latency.  Further we can have additional H_CEDE state to feed
> > > > > > into the cpuidle.  
> > > > > 
> > > > > So all POWER9 machines are affected by the firmware bug where firmware
> > > > > reports CEDE1 exit latency of 2us and the real latency is 5us which
> > > > > causes the kernel to prefer CEDE1 too much when relying on the values
> > > > > supplied by the firmware. It is not about 'older firmware'.
> > > > 
> > > > Correct.  All POWER9 systems running Linux as guest LPARs will see
> > > > extra usage of CEDE idle state, but not baremetal (PowerNV).
> > > > 
> > > > The correct definition of the bug or miss-match in expectation is that
> > > > firmware reports wakeup latency from a core/thread wakeup timing, but
> > > > not end-to-end time from sending a wakeup event like an IPI using
> > > > H_calls and receiving the events on the target.  Practically there are
> > > > few extra micro-seconds needed after deciding to wakeup a target
> > > > core/thread to getting the target to start executing instructions
> > > > within the LPAR instance.
> > > 
> > > Thanks for the detailed explanation.
> > > 
> > > Maybe just adding a few microseconds to the reported time would be a
> > > more reasonable workaround than using a blanket fixed value then.
> > 
> > Yes, that is an option.  But that may only reduce the difference
> > between existing kernel and new kernel unless we make it the same
> > number.  Further we are fixing this in P10 and hence we will have to
> > add "if(P9) do the compensation" and otherwise take it as is.  That
> > would not be elegant.  Given that our goal for P9 platform is to not
> > introduce changes in H_CEDE entry behaviour, we arrived at this
> > approach (this small patch) and this also makes it easy to backport to
> > various distro products.
> 
> I don't see how this is more elegent.
> 
> The current patch is
> 
> if(p9)
> 	use fixed value
> 
> the suggested patch is
> 
> if(p9)
> 	apply compensation


We could do that, however, from the recent measurements the default
value is closer to the latency value measured using an IPI.

As Vaidy described earlier, on POWER9 and prior platforms, the wakeup
latency advertized by the PHYP hypervisor corresponds to the latency
required to wakeup from the underlying hardware idle state (Nap in
POWER8 and stop0/1/2 on POWER9) into the hypervisor. That's 2us on
POWER9.

We need to apply two kinds of compensation,

1. Compensation for the time taken to transition the CPU from the
   Hypervisor into the LPAR post wakeup from platform idle state

2. Compensation for the time taken to send the IPI from the source CPU
   (waker) to the idle target CPU (wakee).

1. can be measured via timer idle test (I am using Pratik's
cpuidle self-test posted here
https://lore.kernel.org/lkml/20210412074309.38484-1-psampat@linux.ibm.com/)

We queue a timer, say for 1ms, and enter the CEDE state. When the
timer fires, in the timer handler we compute how much extra timer over
the expected 1ms have we consumed. This is what it looks like on
POWER9 LPAR

CEDE latency measured using a timer (numbers in ns)
===================================================================
N       Min      Median   Avg       90%ile  99%ile    Max    Stddev
400     2601     5677     5668.74    5917    6413     9299   455.01

If we consider the avg and the 99th %ile values, it takes on an avg
about somewhere between 3.5-4.5 us to transition from the Hypervisor
to the guest VCPU after the CPU has woken up from the idle state. 

1. and 2. combined can be determined by an IPI latency test (from the
same self-test linked above). We send an IPI to an idle CPU and in the
handler compute the time difference between when the IPI was sent and
when the handler ran. We see the following numbers on POWER9 LPAR.

CEDE latency measured using an IPI (numbers in us)
==================================================
N       Min      Median   Avg       90%ile  99%ile    Max    Stddev
400     711      7564     7369.43   8559    9514      9698   1200.01

Thus considering the avg and the 99th percentile this compensation
would be 5.4-7.5us.

Suppose, we consider the compensation corresponding to the 99th
percentile latency value measured using the IPI, the compensation will
be 7.5us, which will take the total CEDE latency to 9.5us.

This is in the ballpark of the default value of 10us which we obtain
if we do

if (!p10)
   use default hardcoded value;


> 
> That is either will add one branch for the affected platform.
>

Since POWER10 onwards, the latency value advertized by the hypervisor
will be the latency as observed by the LPAR VCPU, any new code that we
will be adding will only be applicable for POWER9. We can get the same
effect by using the default value.

Given this, if you feel that it might still be worth pursuing the
compensation approach, I will send out a patch for that.

> But I understand if you do not have confidence that the compensation is
> the same in all cases and do not have the opportunity to measure it it
> may be simpler to apply one very conservative adjustment.
>



> Thanks
> 
> Michal

--
Thanks and Regards
gautham.
Michal Suchanek April 28, 2021, 8:03 a.m. UTC | #10
On Wed, Apr 28, 2021 at 11:28:48AM +0530, Gautham R Shenoy wrote:
> Hello Michal,
> 
> On Sun, Apr 25, 2021 at 01:07:14PM +0200, Michal Suchánek wrote:
> > On Sat, Apr 24, 2021 at 01:07:16PM +0530, Vaidyanathan Srinivasan wrote:
> > > * Michal Such?nek <msuchanek@suse.de> [2021-04-23 20:42:16]:
> > > 
> > > > On Fri, Apr 23, 2021 at 11:59:30PM +0530, Vaidyanathan Srinivasan wrote:
> > > > > * Michal Such?nek <msuchanek@suse.de> [2021-04-23 19:45:05]:
> > > > > 
> > > > > > On Fri, Apr 23, 2021 at 09:29:39PM +0530, Vaidyanathan Srinivasan wrote:
> > > > > > > * Michal Such?nek <msuchanek@suse.de> [2021-04-23 09:35:51]:
> > > > > > > 
> > > > > > > > On Thu, Apr 22, 2021 at 08:37:29PM +0530, Gautham R. Shenoy wrote:
> > > > > > > > > From: "Gautham R. Shenoy" <ego@linux.vnet.ibm.com>
> > > > > > > > > 
> > > > > > > > > Commit d947fb4c965c ("cpuidle: pseries: Fixup exit latency for
> > > > > > > > > CEDE(0)") sets the exit latency of CEDE(0) based on the latency values
> > > > > > > > > of the Extended CEDE states advertised by the platform
> > > > > > > > > 
> > > > > > > > > On some of the POWER9 LPARs, the older firmwares advertise a very low
> > > > > > > > > value of 2us for CEDE1 exit latency on a Dedicated LPAR. However the
> > > > > > > > Can you be more specific about 'older firmwares'?
> > > > > > > 
> > > > > > > Hi Michal,
> > > > > > > 
> > > > > > > This is POWER9 vs POWER10 difference, not really an obsolete FW.  The
> > > > > > > key idea behind the original patch was to make the H_CEDE latency and
> > > > > > > hence target residency come from firmware instead of being decided by
> > > > > > > the kernel.  The advantage is such that, different type of systems in
> > > > > > > POWER10 generation can adjust this value and have an optimal H_CEDE
> > > > > > > entry criteria which balances good single thread performance and
> > > > > > > wakeup latency.  Further we can have additional H_CEDE state to feed
> > > > > > > into the cpuidle.  
> > > > > > 
> > > > > > So all POWER9 machines are affected by the firmware bug where firmware
> > > > > > reports CEDE1 exit latency of 2us and the real latency is 5us which
> > > > > > causes the kernel to prefer CEDE1 too much when relying on the values
> > > > > > supplied by the firmware. It is not about 'older firmware'.
> > > > > 
> > > > > Correct.  All POWER9 systems running Linux as guest LPARs will see
> > > > > extra usage of CEDE idle state, but not baremetal (PowerNV).
> > > > > 
> > > > > The correct definition of the bug or miss-match in expectation is that
> > > > > firmware reports wakeup latency from a core/thread wakeup timing, but
> > > > > not end-to-end time from sending a wakeup event like an IPI using
> > > > > H_calls and receiving the events on the target.  Practically there are
> > > > > few extra micro-seconds needed after deciding to wakeup a target
> > > > > core/thread to getting the target to start executing instructions
> > > > > within the LPAR instance.
> > > > 
> > > > Thanks for the detailed explanation.
> > > > 
> > > > Maybe just adding a few microseconds to the reported time would be a
> > > > more reasonable workaround than using a blanket fixed value then.
> > > 
> > > Yes, that is an option.  But that may only reduce the difference
> > > between existing kernel and new kernel unless we make it the same
> > > number.  Further we are fixing this in P10 and hence we will have to
> > > add "if(P9) do the compensation" and otherwise take it as is.  That
> > > would not be elegant.  Given that our goal for P9 platform is to not
> > > introduce changes in H_CEDE entry behaviour, we arrived at this
> > > approach (this small patch) and this also makes it easy to backport to
> > > various distro products.
> > 
> > I don't see how this is more elegent.
> > 
> > The current patch is
> > 
> > if(p9)
> > 	use fixed value
> > 
> > the suggested patch is
> > 
> > if(p9)
> > 	apply compensation
> 
> 
> We could do that, however, from the recent measurements the default
> value is closer to the latency value measured using an IPI.
> 
> As Vaidy described earlier, on POWER9 and prior platforms, the wakeup
> latency advertized by the PHYP hypervisor corresponds to the latency
> required to wakeup from the underlying hardware idle state (Nap in
> POWER8 and stop0/1/2 on POWER9) into the hypervisor. That's 2us on
> POWER9.
> 
> We need to apply two kinds of compensation,
> 
> 1. Compensation for the time taken to transition the CPU from the
>    Hypervisor into the LPAR post wakeup from platform idle state
> 
> 2. Compensation for the time taken to send the IPI from the source CPU
>    (waker) to the idle target CPU (wakee).
> 
> 1. can be measured via timer idle test (I am using Pratik's
> cpuidle self-test posted here
> https://lore.kernel.org/lkml/20210412074309.38484-1-psampat@linux.ibm.com/)
> 
> We queue a timer, say for 1ms, and enter the CEDE state. When the
> timer fires, in the timer handler we compute how much extra timer over
> the expected 1ms have we consumed. This is what it looks like on
> POWER9 LPAR
> 
> CEDE latency measured using a timer (numbers in ns)
> ===================================================================
> N       Min      Median   Avg       90%ile  99%ile    Max    Stddev
> 400     2601     5677     5668.74    5917    6413     9299   455.01
> 
> If we consider the avg and the 99th %ile values, it takes on an avg
> about somewhere between 3.5-4.5 us to transition from the Hypervisor
> to the guest VCPU after the CPU has woken up from the idle state. 
> 
> 1. and 2. combined can be determined by an IPI latency test (from the
> same self-test linked above). We send an IPI to an idle CPU and in the
> handler compute the time difference between when the IPI was sent and
> when the handler ran. We see the following numbers on POWER9 LPAR.
> 
> CEDE latency measured using an IPI (numbers in us)
> ==================================================
> N       Min      Median   Avg       90%ile  99%ile    Max    Stddev
> 400     711      7564     7369.43   8559    9514      9698   1200.01
> 
> Thus considering the avg and the 99th percentile this compensation
> would be 5.4-7.5us.
> 
> Suppose, we consider the compensation corresponding to the 99th
> percentile latency value measured using the IPI, the compensation will
> be 7.5us, which will take the total CEDE latency to 9.5us.
> 
> This is in the ballpark of the default value of 10us which we obtain
> if we do
> 
> if (!p10)
>    use default hardcoded value;
> 
That's a nice detailed explanation. Maybe you could summarize it in the
commit message so that people looking at the patch in the future can
tell where the value comes from.

Thanks

Michal
Gautham R Shenoy April 28, 2021, 11:34 a.m. UTC | #11
Hello Michal,

On Wed, Apr 28, 2021 at 10:03:26AM +0200, Michal Suchánek wrote:

> > 
> That's a nice detailed explanation. Maybe you could summarize it in the
> commit message so that people looking at the patch in the future can
> tell where the value comes from.

Sure, I will do that and send a v2 with the updated commit message.


> 
> Thanks
> 
> Michal
diff mbox series

Patch

diff --git a/drivers/cpuidle/cpuidle-pseries.c b/drivers/cpuidle/cpuidle-pseries.c
index a2b5c6f..7207467 100644
--- a/drivers/cpuidle/cpuidle-pseries.c
+++ b/drivers/cpuidle/cpuidle-pseries.c
@@ -419,7 +419,8 @@  static int pseries_idle_probe(void)
 			cpuidle_state_table = shared_states;
 			max_idle_state = ARRAY_SIZE(shared_states);
 		} else {
-			fixup_cede0_latency();
+			if (pvr_version_is(PVR_POWER10))
+				fixup_cede0_latency();
 			cpuidle_state_table = dedicated_states;
 			max_idle_state = NR_DEDICATED_STATES;
 		}