diff mbox series

RFC: disable HPET legacy mode after timer check

Message ID cb408368-077d-edb5-b4ad-f80086db48c1@invisiblethingslab.com (mailing list archive)
State New, archived
Headers show
Series RFC: disable HPET legacy mode after timer check | expand

Commit Message

Simon Gaiser April 11, 2023, 10:30 a.m. UTC
Hi,

I have been recently looking into getting S0ix working on Xen [1].

Thanks to a tip from Andrew I found that the HPET legacy mode was
preventing my test system from reaching a package C-state lower than PC7
and thereby also preventing S0ix residency.

For testing I simply modified check_timer() to disable it again after it
checked the timer irq:


With this [2] I'm able to reach S0ix residency for some time and for short
periods the systems power consumption goes down to the same level as with
native Linux!

It reaches low power states only for a fraction of the suspend to idle
time, so something still makes the CPU/chipset think it should leave the
low power mode, but that's another topic.

I tried to understand how all the timer code interacts with disabling
the legacy mode. I think it only would break cpuidle if X86_FEATURE_ARAT
is not available (Which is available on my test system and indeed I
didn't run into obvious breakage). 

Is this (disabled PIT && !ARAT) a configuration that exists (and needs
to be supported)?

Did I miss something else? (Very much possible, given that this is way
above my existing experience with X86 and Xen internals.)

Simon

[1]: https://lore.kernel.org/xen-devel/9051e484-b128-715a-9253-48af8e47bb9d@invisiblethingslab.com/
[2]: Plus [3] and some hack to have mwait-idle on Tiger Lake.
[3]: https://lore.kernel.org/xen-devel/20230313134102.3157-1-simon@invisiblethingslab.com/

Comments

Andrew Cooper April 11, 2023, 11:20 a.m. UTC | #1
On 11/04/2023 11:30 am, Simon Gaiser wrote:
> Hi,
>
> I have been recently looking into getting S0ix working on Xen [1].
>
> Thanks to a tip from Andrew I found that the HPET legacy mode was
> preventing my test system from reaching a package C-state lower than PC7
> and thereby also preventing S0ix residency.
>
> For testing I simply modified check_timer() to disable it again after it
> checked the timer irq:
>
> --- a/xen/arch/x86/io_apic.c
> +++ b/xen/arch/x86/io_apic.c
> @@ -1966,6 +1969,8 @@ static void __init check_timer(void)
>  
>              if ( timer_irq_works() )
>              {
> +                hpet_disable_legacy_replacement_mode();
>                  local_irq_restore(flags);
>                  return;
>              }
>
>
> With this [2] I'm able to reach S0ix residency for some time and for short
> periods the systems power consumption goes down to the same level as with
> native Linux!

Excellent progress!

> It reaches low power states only for a fraction of the suspend to idle
> time, so something still makes the CPU/chipset think it should leave the
> low power mode, but that's another topic.

Do you have any further info here?  There are a range of possibilities,
from excess timers in Xen (e.g. PV guests default to a 100Hz timer even
though no guests actually want it AFAICT), or the 1s TSC rendezvous
(which isn't actually needed on modern systems), all the way to the
platform devices not entering d3hot.

>
> I tried to understand how all the timer code interacts with disabling
> the legacy mode. I think it only would break cpuidle if X86_FEATURE_ARAT
> is not available (Which is available on my test system and indeed I
> didn't run into obvious breakage). 
>
> Is this (disabled PIT && !ARAT) a configuration that exists (and needs
> to be supported)?
>
> Did I miss something else? (Very much possible, given that this is way
> above my existing experience with X86 and Xen internals.)

Xen's code is a mess and needs an overhaul.

Right now, we're using the timer as "a source of interrupts" to try and
check that we've got things set up suitably.  But this doesn't need to
be the PIT, or a timer at all - it just needs to be "an interrupt coming
in from the platform".

Furthermore, there will soon be client platforms with no PIT/HPET/etc at
all.  (HPET is known broken in <PC7, and not used these days anyway in
order to get deeper sleep working), so this logic is going to explode on
us yet again.

AFAICT moving forwards, the expectation is to use TSCs as the
clocksource, and the deadline timer for interrupts.

~Andrew
Roger Pau Monné April 12, 2023, 11:25 a.m. UTC | #2
On Tue, Apr 11, 2023 at 12:20:13PM +0100, Andrew Cooper wrote:
> On 11/04/2023 11:30 am, Simon Gaiser wrote:
> > Hi,
> >
> > I have been recently looking into getting S0ix working on Xen [1].
> >
> > Thanks to a tip from Andrew I found that the HPET legacy mode was
> > preventing my test system from reaching a package C-state lower than PC7
> > and thereby also preventing S0ix residency.
> >
> > For testing I simply modified check_timer() to disable it again after it
> > checked the timer irq:
> >
> > --- a/xen/arch/x86/io_apic.c
> > +++ b/xen/arch/x86/io_apic.c
> > @@ -1966,6 +1969,8 @@ static void __init check_timer(void)
> >  
> >              if ( timer_irq_works() )
> >              {
> > +                hpet_disable_legacy_replacement_mode();
> >                  local_irq_restore(flags);
> >                  return;
> >              }
> >
> >
> > With this [2] I'm able to reach S0ix residency for some time and for short
> > periods the systems power consumption goes down to the same level as with
> > native Linux!
> 
> Excellent progress!
> 
> > It reaches low power states only for a fraction of the suspend to idle
> > time, so something still makes the CPU/chipset think it should leave the
> > low power mode, but that's another topic.
> 
> Do you have any further info here?  There are a range of possibilities,
> from excess timers in Xen (e.g. PV guests default to a 100Hz timer even
> though no guests actually want it AFAICT), or the 1s TSC rendezvous
> (which isn't actually needed on modern systems), all the way to the
> platform devices not entering d3hot.
> 
> >
> > I tried to understand how all the timer code interacts with disabling
> > the legacy mode. I think it only would break cpuidle if X86_FEATURE_ARAT
> > is not available (Which is available on my test system and indeed I
> > didn't run into obvious breakage). 
> >
> > Is this (disabled PIT && !ARAT) a configuration that exists (and needs
> > to be supported)?
> >
> > Did I miss something else? (Very much possible, given that this is way
> > above my existing experience with X86 and Xen internals.)
> 
> Xen's code is a mess and needs an overhaul.
> 
> Right now, we're using the timer as "a source of interrupts" to try and
> check that we've got things set up suitably.  But this doesn't need to
> be the PIT, or a timer at all - it just needs to be "an interrupt coming
> in from the platform".

I would even question whether that testing is useful overall.  We test
a single IO-APIC pin, which still leaves room for the rest of them to
not be properly configured, and Xen might not be using the PIT timer at
the end.

IOW: I think it's fine to test that the timer is working, but forcing
that to be routed through the IO-APIC is wrong.  HPET for example can
support FSB, which skips the IO-APIC and injects the vector directly
into the local APIC.

We do support the APIC deadline timer, so I'm confused as to why we
also need to use the PIT or HPET.  I don't think I fully understand
the relation between the usage of PIT/HPET/APIC deadline timers on
Xen.

Roger.
Jan Beulich April 17, 2023, 8:49 a.m. UTC | #3
On 12.04.2023 13:25, Roger Pau Monné wrote:
> On Tue, Apr 11, 2023 at 12:20:13PM +0100, Andrew Cooper wrote:
>> On 11/04/2023 11:30 am, Simon Gaiser wrote:
>>> Hi,
>>>
>>> I have been recently looking into getting S0ix working on Xen [1].
>>>
>>> Thanks to a tip from Andrew I found that the HPET legacy mode was
>>> preventing my test system from reaching a package C-state lower than PC7
>>> and thereby also preventing S0ix residency.
>>>
>>> For testing I simply modified check_timer() to disable it again after it
>>> checked the timer irq:
>>>
>>> --- a/xen/arch/x86/io_apic.c
>>> +++ b/xen/arch/x86/io_apic.c
>>> @@ -1966,6 +1969,8 @@ static void __init check_timer(void)
>>>  
>>>              if ( timer_irq_works() )
>>>              {
>>> +                hpet_disable_legacy_replacement_mode();
>>>                  local_irq_restore(flags);
>>>                  return;
>>>              }
>>>
>>>
>>> With this [2] I'm able to reach S0ix residency for some time and for short
>>> periods the systems power consumption goes down to the same level as with
>>> native Linux!
>>
>> Excellent progress!
>>
>>> It reaches low power states only for a fraction of the suspend to idle
>>> time, so something still makes the CPU/chipset think it should leave the
>>> low power mode, but that's another topic.
>>
>> Do you have any further info here?  There are a range of possibilities,
>> from excess timers in Xen (e.g. PV guests default to a 100Hz timer even
>> though no guests actually want it AFAICT), or the 1s TSC rendezvous
>> (which isn't actually needed on modern systems), all the way to the
>> platform devices not entering d3hot.
>>
>>>
>>> I tried to understand how all the timer code interacts with disabling
>>> the legacy mode. I think it only would break cpuidle if X86_FEATURE_ARAT
>>> is not available (Which is available on my test system and indeed I
>>> didn't run into obvious breakage). 
>>>
>>> Is this (disabled PIT && !ARAT) a configuration that exists (and needs
>>> to be supported)?
>>>
>>> Did I miss something else? (Very much possible, given that this is way
>>> above my existing experience with X86 and Xen internals.)
>>
>> Xen's code is a mess and needs an overhaul.
>>
>> Right now, we're using the timer as "a source of interrupts" to try and
>> check that we've got things set up suitably.  But this doesn't need to
>> be the PIT, or a timer at all - it just needs to be "an interrupt coming
>> in from the platform".
> 
> I would even question whether that testing is useful overall.  We test
> a single IO-APIC pin, which still leaves room for the rest of them to
> not be properly configured, and Xen might not be using the PIT timer at
> the end.

Testing one pin is sufficient for the intended purpose (proving that
the delivery route platform -> IO-APIC -> LAPIC works), leaving aside
firmware possibly configuring multiple IO-APICs inconsistently. Yet
if there are multiple IO-APICs, I'm afraid we have no way of knowing
how to trigger any of the pins of secondary ones. Even if we went to
figure out what devices are connected to it, we'd then still have no
(rudimentary) device drivers knowing how to interact with the devices.

Jan
Roger Pau Monné April 17, 2023, 10:26 a.m. UTC | #4
On Mon, Apr 17, 2023 at 10:49:38AM +0200, Jan Beulich wrote:
> On 12.04.2023 13:25, Roger Pau Monné wrote:
> > On Tue, Apr 11, 2023 at 12:20:13PM +0100, Andrew Cooper wrote:
> >> On 11/04/2023 11:30 am, Simon Gaiser wrote:
> >>> Hi,
> >>>
> >>> I have been recently looking into getting S0ix working on Xen [1].
> >>>
> >>> Thanks to a tip from Andrew I found that the HPET legacy mode was
> >>> preventing my test system from reaching a package C-state lower than PC7
> >>> and thereby also preventing S0ix residency.
> >>>
> >>> For testing I simply modified check_timer() to disable it again after it
> >>> checked the timer irq:
> >>>
> >>> --- a/xen/arch/x86/io_apic.c
> >>> +++ b/xen/arch/x86/io_apic.c
> >>> @@ -1966,6 +1969,8 @@ static void __init check_timer(void)
> >>>  
> >>>              if ( timer_irq_works() )
> >>>              {
> >>> +                hpet_disable_legacy_replacement_mode();
> >>>                  local_irq_restore(flags);
> >>>                  return;
> >>>              }
> >>>
> >>>
> >>> With this [2] I'm able to reach S0ix residency for some time and for short
> >>> periods the systems power consumption goes down to the same level as with
> >>> native Linux!
> >>
> >> Excellent progress!
> >>
> >>> It reaches low power states only for a fraction of the suspend to idle
> >>> time, so something still makes the CPU/chipset think it should leave the
> >>> low power mode, but that's another topic.
> >>
> >> Do you have any further info here?  There are a range of possibilities,
> >> from excess timers in Xen (e.g. PV guests default to a 100Hz timer even
> >> though no guests actually want it AFAICT), or the 1s TSC rendezvous
> >> (which isn't actually needed on modern systems), all the way to the
> >> platform devices not entering d3hot.
> >>
> >>>
> >>> I tried to understand how all the timer code interacts with disabling
> >>> the legacy mode. I think it only would break cpuidle if X86_FEATURE_ARAT
> >>> is not available (Which is available on my test system and indeed I
> >>> didn't run into obvious breakage). 
> >>>
> >>> Is this (disabled PIT && !ARAT) a configuration that exists (and needs
> >>> to be supported)?
> >>>
> >>> Did I miss something else? (Very much possible, given that this is way
> >>> above my existing experience with X86 and Xen internals.)
> >>
> >> Xen's code is a mess and needs an overhaul.
> >>
> >> Right now, we're using the timer as "a source of interrupts" to try and
> >> check that we've got things set up suitably.  But this doesn't need to
> >> be the PIT, or a timer at all - it just needs to be "an interrupt coming
> >> in from the platform".
> > 
> > I would even question whether that testing is useful overall.  We test
> > a single IO-APIC pin, which still leaves room for the rest of them to
> > not be properly configured, and Xen might not be using the PIT timer at
> > the end.
> 
> Testing one pin is sufficient for the intended purpose (proving that
> the delivery route platform -> IO-APIC -> LAPIC works), leaving aside
> firmware possibly configuring multiple IO-APICs inconsistently. Yet
> if there are multiple IO-APICs, I'm afraid we have no way of knowing
> how to trigger any of the pins of secondary ones. Even if we went to
> figure out what devices are connected to it, we'd then still have no
> (rudimentary) device drivers knowing how to interact with the devices.

That's why I think the test is not very useful.  Also the delivery
route not being properly configured will be quite obvious when dom0
attempts to to use any device, as it would get timeouts.

I think it's fine to test that the timer interrupt in use by Xen is
working, but forcing this kind of interrupt delivery test doesn't seem
specially useful to me, the more that we keep accumulation workarounds
to make it work on newer platforms.

Thanks, Roger.
Simon Gaiser April 17, 2023, 12:10 p.m. UTC | #5
Andrew Cooper:
[...]
>> It reaches low power states only for a fraction of the suspend to idle
>> time, so something still makes the CPU/chipset think it should leave the
>> low power mode, but that's another topic.
> 
> Do you have any further info here?  There are a range of possibilities,
> from excess timers in Xen (e.g. PV guests default to a 100Hz timer even
> though no guests actually want it AFAICT), or the 1s TSC rendezvous
> (which isn't actually needed on modern systems), all the way to the
> platform devices not entering d3hot.

So in the meantime I got some progress here.

What helps a lot is setting cpufreq to powersave before going to s2idle.
With that I get residency of about 88 % (everything is still tested with
only dom0 running). Not yet the > 99 % that a native Linux manages, but
much better than before (<< 50 %).

While, based on your and Marek's feedback, I was already looking at
active timers, I first ignored the cpufreq dbs timer since the idle
driver suspend it and I assumed it was active because I wake things up
when triggering the debug key. But turns out disabling the ondemand
governor has a big effect. But not sure if it's the timer itself or some
other part of it.

I tried to disable the time calibration timer. While eyeballing on the
power meter I first thought it brings some improvement there's no
difference according to the residency counters (will need to improve my
power measurement setup).

Other timers I see active:

common/sched/core.c#vcpu_singleshot_timer_fn:
If I understand correctly those are configure by the domain (so dom0
here). So Linux should do this right. But I will have to have a closer
look.

arch/x86/cpu/mcheck/non-fatal.c#mce_work_fn:
Triggers only seldom (something >> 1 s), so unlikely. But will try
disabling.

arch/x86/time.c#plt_overflow:
Dito.

Simon
diff mbox series

Patch

--- a/xen/arch/x86/io_apic.c
+++ b/xen/arch/x86/io_apic.c
@@ -1966,6 +1969,8 @@  static void __init check_timer(void)
 
             if ( timer_irq_works() )
             {
+                hpet_disable_legacy_replacement_mode();
                 local_irq_restore(flags);
                 return;
             }