diff mbox series

[RFC,V2,11/11] x86: tsc: avoid system instability in hibernation

Message ID 20200107234526.GA19034@dev-dsk-anchalag-2a-9c2d1d96.us-west-2.amazon.com (mailing list archive)
State New, archived
Headers show
Series Enable PM hibernation on guest VMs | expand

Commit Message

Anchal Agarwal Jan. 7, 2020, 11:45 p.m. UTC
From: Eduardo Valentin <eduval@amazon.com>

System instability are seen during resume from hibernation when system
is under heavy CPU load. This is due to the lack of update of sched
clock data, and the scheduler would then think that heavy CPU hog
tasks need more time in CPU, causing the system to freeze
during the unfreezing of tasks. For example, threaded irqs,
and kernel processes servicing network interface may be delayed
for several tens of seconds, causing the system to be unreachable.

Situation like this can be reported by using lockup detectors
such as workqueue lockup detectors:

[root@ip-172-31-67-114 ec2-user]# echo disk > /sys/power/state

Message from syslogd@ip-172-31-67-114 at May  7 18:23:21 ...
 kernel:BUG: workqueue lockup - pool cpus=0 node=0 flags=0x0 nice=0 stuck for 57s!

Message from syslogd@ip-172-31-67-114 at May  7 18:23:21 ...
 kernel:BUG: workqueue lockup - pool cpus=1 node=0 flags=0x0 nice=0 stuck for 57s!

Message from syslogd@ip-172-31-67-114 at May  7 18:23:21 ...
 kernel:BUG: workqueue lockup - pool cpus=3 node=0 flags=0x1 nice=0 stuck for 57s!

Message from syslogd@ip-172-31-67-114 at May  7 18:29:06 ...
 kernel:BUG: workqueue lockup - pool cpus=3 node=0 flags=0x1 nice=0 stuck for 403s!

The fix for this situation is to mark the sched clock as unstable
as early as possible in the resume path, leaving it unstable
for the duration of the resume process. This will force the
scheduler to attempt to align the sched clock across CPUs using
the delta with time of day, updating sched clock data. In a post
hibernation event, we can then mark the sched clock as stable
again, avoiding unnecessary syncs with time of day on systems
in which TSC is reliable.

Reviewed-by: Erik Quanstrom <quanstro@amazon.com>
Reviewed-by: Frank van der Linden <fllinden@amazon.com>
Reviewed-by: Balbir Singh <sblbir@amazon.com>
Reviewed-by: Munehisa Kamata <kamatam@amazon.com>
Tested-by: Anchal Agarwal <anchalag@amazon.com>
Signed-off-by: Eduardo Valentin <eduval@amazon.com>
---
 arch/x86/kernel/tsc.c       | 29 +++++++++++++++++++++++++++++
 include/linux/sched/clock.h |  5 +++++
 kernel/sched/clock.c        |  4 ++--
 3 files changed, 36 insertions(+), 2 deletions(-)

Comments

Peter Zijlstra Jan. 8, 2020, 10:50 a.m. UTC | #1
On Tue, Jan 07, 2020 at 11:45:26PM +0000, Anchal Agarwal wrote:
> From: Eduardo Valentin <eduval@amazon.com>
> 
> System instability are seen during resume from hibernation when system
> is under heavy CPU load. This is due to the lack of update of sched
> clock data, and the scheduler would then think that heavy CPU hog
> tasks need more time in CPU, causing the system to freeze
> during the unfreezing of tasks. For example, threaded irqs,
> and kernel processes servicing network interface may be delayed
> for several tens of seconds, causing the system to be unreachable.

> The fix for this situation is to mark the sched clock as unstable
> as early as possible in the resume path, leaving it unstable
> for the duration of the resume process. This will force the
> scheduler to attempt to align the sched clock across CPUs using
> the delta with time of day, updating sched clock data. In a post
> hibernation event, we can then mark the sched clock as stable
> again, avoiding unnecessary syncs with time of day on systems
> in which TSC is reliable.

This makes no frigging sense what so bloody ever. If the clock is
stable, we don't care about sched_clock_data. When it is stable you get
a linear function of the TSC without complicated bits on.

When it is unstable, only then do we care about the sched_clock_data.

> Reviewed-by: Erik Quanstrom <quanstro@amazon.com>
> Reviewed-by: Frank van der Linden <fllinden@amazon.com>
> Reviewed-by: Balbir Singh <sblbir@amazon.com>
> Reviewed-by: Munehisa Kamata <kamatam@amazon.com>
> Tested-by: Anchal Agarwal <anchalag@amazon.com>
> Signed-off-by: Eduardo Valentin <eduval@amazon.com>
> ---

NAK, the code very much relies on never getting marked stable again
after it gets set to unstable.

> diff --git a/kernel/sched/clock.c b/kernel/sched/clock.c
> index 1152259a4ca0..374d40e5b1a2 100644
> --- a/kernel/sched/clock.c
> +++ b/kernel/sched/clock.c
> @@ -116,7 +116,7 @@ static void __scd_stamp(struct sched_clock_data *scd)
>  	scd->tick_raw = sched_clock();
>  }
>  
> -static void __set_sched_clock_stable(void)
> +void set_sched_clock_stable(void)
>  {
>  	struct sched_clock_data *scd;
>  
> @@ -236,7 +236,7 @@ static int __init sched_clock_init_late(void)
>  	smp_mb(); /* matches {set,clear}_sched_clock_stable() */
>  
>  	if (__sched_clock_stable_early)
> -		__set_sched_clock_stable();
> +		set_sched_clock_stable();
>  
>  	return 0;
>  }
> -- 
> 2.15.3.AMZN
>
Eduardo Valentin Jan. 10, 2020, 3:35 p.m. UTC | #2
Hey Peter,

On Wed, Jan 08, 2020 at 11:50:11AM +0100, Peter Zijlstra wrote:
> On Tue, Jan 07, 2020 at 11:45:26PM +0000, Anchal Agarwal wrote:
> > From: Eduardo Valentin <eduval@amazon.com>
> > 
> > System instability are seen during resume from hibernation when system
> > is under heavy CPU load. This is due to the lack of update of sched
> > clock data, and the scheduler would then think that heavy CPU hog
> > tasks need more time in CPU, causing the system to freeze
> > during the unfreezing of tasks. For example, threaded irqs,
> > and kernel processes servicing network interface may be delayed
> > for several tens of seconds, causing the system to be unreachable.
> 
> > The fix for this situation is to mark the sched clock as unstable
> > as early as possible in the resume path, leaving it unstable
> > for the duration of the resume process. This will force the
> > scheduler to attempt to align the sched clock across CPUs using
> > the delta with time of day, updating sched clock data. In a post
> > hibernation event, we can then mark the sched clock as stable
> > again, avoiding unnecessary syncs with time of day on systems
> > in which TSC is reliable.
> 
> This makes no frigging sense what so bloody ever. If the clock is
> stable, we don't care about sched_clock_data. When it is stable you get
> a linear function of the TSC without complicated bits on.
> 
> When it is unstable, only then do we care about the sched_clock_data.
> 

Yeah, maybe what is not clear here is that we covering for situation
where clock stability changes over time, e.g. at regular boot clock is
stable, hibernation happens, then restore happens in a non-stable clock.

> > Reviewed-by: Erik Quanstrom <quanstro@amazon.com>
> > Reviewed-by: Frank van der Linden <fllinden@amazon.com>
> > Reviewed-by: Balbir Singh <sblbir@amazon.com>
> > Reviewed-by: Munehisa Kamata <kamatam@amazon.com>
> > Tested-by: Anchal Agarwal <anchalag@amazon.com>
> > Signed-off-by: Eduardo Valentin <eduval@amazon.com>
> > ---
> 
> NAK, the code very much relies on never getting marked stable again
> after it gets set to unstable.
> 

Well actually, at the PM_POST_HIBERNATION, we do the check and set stable if
known to be stable.

The issue only really happens during the restoration path under scheduling pressure,
which takes forever to finish, as described in the commit.

Do you see a better solution for this issue?


> > diff --git a/kernel/sched/clock.c b/kernel/sched/clock.c
> > index 1152259a4ca0..374d40e5b1a2 100644
> > --- a/kernel/sched/clock.c
> > +++ b/kernel/sched/clock.c
> > @@ -116,7 +116,7 @@ static void __scd_stamp(struct sched_clock_data *scd)
> >  	scd->tick_raw = sched_clock();
> >  }
> >  
> > -static void __set_sched_clock_stable(void)
> > +void set_sched_clock_stable(void)
> >  {
> >  	struct sched_clock_data *scd;
> >  
> > @@ -236,7 +236,7 @@ static int __init sched_clock_init_late(void)
> >  	smp_mb(); /* matches {set,clear}_sched_clock_stable() */
> >  
> >  	if (__sched_clock_stable_early)
> > -		__set_sched_clock_stable();
> > +		set_sched_clock_stable();
> >  
> >  	return 0;
> >  }
> > -- 
> > 2.15.3.AMZN
> >
Peter Zijlstra Jan. 13, 2020, 10:16 a.m. UTC | #3
On Fri, Jan 10, 2020 at 07:35:20AM -0800, Eduardo Valentin wrote:
> Hey Peter,
> 
> On Wed, Jan 08, 2020 at 11:50:11AM +0100, Peter Zijlstra wrote:
> > On Tue, Jan 07, 2020 at 11:45:26PM +0000, Anchal Agarwal wrote:
> > > From: Eduardo Valentin <eduval@amazon.com>
> > > 
> > > System instability are seen during resume from hibernation when system
> > > is under heavy CPU load. This is due to the lack of update of sched
> > > clock data, and the scheduler would then think that heavy CPU hog
> > > tasks need more time in CPU, causing the system to freeze
> > > during the unfreezing of tasks. For example, threaded irqs,
> > > and kernel processes servicing network interface may be delayed
> > > for several tens of seconds, causing the system to be unreachable.
> > 
> > > The fix for this situation is to mark the sched clock as unstable
> > > as early as possible in the resume path, leaving it unstable
> > > for the duration of the resume process. This will force the
> > > scheduler to attempt to align the sched clock across CPUs using
> > > the delta with time of day, updating sched clock data. In a post
> > > hibernation event, we can then mark the sched clock as stable
> > > again, avoiding unnecessary syncs with time of day on systems
> > > in which TSC is reliable.
> > 
> > This makes no frigging sense what so bloody ever. If the clock is
> > stable, we don't care about sched_clock_data. When it is stable you get
> > a linear function of the TSC without complicated bits on.
> > 
> > When it is unstable, only then do we care about the sched_clock_data.
> > 
> 
> Yeah, maybe what is not clear here is that we covering for situation
> where clock stability changes over time, e.g. at regular boot clock is
> stable, hibernation happens, then restore happens in a non-stable clock.

Still confused, who marks the thing unstable? The patch seems to suggest
you do yourself, but it is not at all clear why.

If TSC really is unstable, then it needs to remain unstable. If the TSC
really is stable then there is no point in marking is unstable.

Either way something is off, and you're not telling me what.

> > > Reviewed-by: Erik Quanstrom <quanstro@amazon.com>
> > > Reviewed-by: Frank van der Linden <fllinden@amazon.com>
> > > Reviewed-by: Balbir Singh <sblbir@amazon.com>
> > > Reviewed-by: Munehisa Kamata <kamatam@amazon.com>
> > > Tested-by: Anchal Agarwal <anchalag@amazon.com>
> > > Signed-off-by: Eduardo Valentin <eduval@amazon.com>
> > > ---
> > 
> > NAK, the code very much relies on never getting marked stable again
> > after it gets set to unstable.
> > 
> 
> Well actually, at the PM_POST_HIBERNATION, we do the check and set stable if
> known to be stable.
> 
> The issue only really happens during the restoration path under scheduling pressure,
> which takes forever to finish, as described in the commit.
> 
> Do you see a better solution for this issue?

I still have no clue what your actual problem is. You say scheduling
goes wobbly because sched_clock_data is stale, but when stable that
doesn't matter.

So what is the actual problem?
Singh, Balbir Jan. 13, 2020, 11:43 a.m. UTC | #4
On Mon, 2020-01-13 at 11:16 +0100, Peter Zijlstra wrote:
> On Fri, Jan 10, 2020 at 07:35:20AM -0800, Eduardo Valentin wrote:
> > Hey Peter,
> > 
> > On Wed, Jan 08, 2020 at 11:50:11AM +0100, Peter Zijlstra wrote:
> > > On Tue, Jan 07, 2020 at 11:45:26PM +0000, Anchal Agarwal wrote:
> > > > From: Eduardo Valentin <eduval@amazon.com>
> > > > 
> > > > System instability are seen during resume from hibernation when system
> > > > is under heavy CPU load. This is due to the lack of update of sched
> > > > clock data, and the scheduler would then think that heavy CPU hog
> > > > tasks need more time in CPU, causing the system to freeze
> > > > during the unfreezing of tasks. For example, threaded irqs,
> > > > and kernel processes servicing network interface may be delayed
> > > > for several tens of seconds, causing the system to be unreachable.
> > > > The fix for this situation is to mark the sched clock as unstable
> > > > as early as possible in the resume path, leaving it unstable
> > > > for the duration of the resume process. This will force the
> > > > scheduler to attempt to align the sched clock across CPUs using
> > > > the delta with time of day, updating sched clock data. In a post
> > > > hibernation event, we can then mark the sched clock as stable
> > > > again, avoiding unnecessary syncs with time of day on systems
> > > > in which TSC is reliable.
> > > 
> > > This makes no frigging sense what so bloody ever. If the clock is
> > > stable, we don't care about sched_clock_data. When it is stable you get
> > > a linear function of the TSC without complicated bits on.
> > > 
> > > When it is unstable, only then do we care about the sched_clock_data.
> > > 
> > 
> > Yeah, maybe what is not clear here is that we covering for situation
> > where clock stability changes over time, e.g. at regular boot clock is
> > stable, hibernation happens, then restore happens in a non-stable clock.
> 
> Still confused, who marks the thing unstable? The patch seems to suggest
> you do yourself, but it is not at all clear why.
> 
> If TSC really is unstable, then it needs to remain unstable. If the TSC
> really is stable then there is no point in marking is unstable.
> 
> Either way something is off, and you're not telling me what.
> 

Hi, Peter

For your original comment, just wanted to clarify the following:

1. After hibernation, the machine can be resumed on a different but compatible
host (these are VM images hibernated)
2. This means the clock between host1 and host2 can/will be different

In your comments are you making the assumption that the host(s) is/are the
same? Just checking the assumptions being made and being on the same page with
them.

Balbir Singh.

> > > > Reviewed-by: Erik Quanstrom <quanstro@amazon.com>
> > > > Reviewed-by: Frank van der Linden <fllinden@amazon.com>
> > > > Reviewed-by: Balbir Singh <sblbir@amazon.com>
> > > > Reviewed-by: Munehisa Kamata <kamatam@amazon.com>
> > > > Tested-by: Anchal Agarwal <anchalag@amazon.com>
> > > > Signed-off-by: Eduardo Valentin <eduval@amazon.com>
> > > > ---
> > > 
> > > NAK, the code very much relies on never getting marked stable again
> > > after it gets set to unstable.
> > > 
> > 
> > Well actually, at the PM_POST_HIBERNATION, we do the check and set stable
> > if
> > known to be stable.
> > 
> > The issue only really happens during the restoration path under scheduling
> > pressure,
> > which takes forever to finish, as described in the commit.
> > 
> > Do you see a better solution for this issue?
> 
> I still have no clue what your actual problem is. You say scheduling
> goes wobbly because sched_clock_data is stale, but when stable that
> doesn't matter.
> 
> So what is the actual problem?
Rafael J. Wysocki Jan. 13, 2020, 11:48 a.m. UTC | #5
On Mon, Jan 13, 2020 at 12:43 PM Singh, Balbir <sblbir@amazon.com> wrote:
>
> On Mon, 2020-01-13 at 11:16 +0100, Peter Zijlstra wrote:
> > On Fri, Jan 10, 2020 at 07:35:20AM -0800, Eduardo Valentin wrote:
> > > Hey Peter,
> > >
> > > On Wed, Jan 08, 2020 at 11:50:11AM +0100, Peter Zijlstra wrote:
> > > > On Tue, Jan 07, 2020 at 11:45:26PM +0000, Anchal Agarwal wrote:
> > > > > From: Eduardo Valentin <eduval@amazon.com>
> > > > >
> > > > > System instability are seen during resume from hibernation when system
> > > > > is under heavy CPU load. This is due to the lack of update of sched
> > > > > clock data, and the scheduler would then think that heavy CPU hog
> > > > > tasks need more time in CPU, causing the system to freeze
> > > > > during the unfreezing of tasks. For example, threaded irqs,
> > > > > and kernel processes servicing network interface may be delayed
> > > > > for several tens of seconds, causing the system to be unreachable.
> > > > > The fix for this situation is to mark the sched clock as unstable
> > > > > as early as possible in the resume path, leaving it unstable
> > > > > for the duration of the resume process. This will force the
> > > > > scheduler to attempt to align the sched clock across CPUs using
> > > > > the delta with time of day, updating sched clock data. In a post
> > > > > hibernation event, we can then mark the sched clock as stable
> > > > > again, avoiding unnecessary syncs with time of day on systems
> > > > > in which TSC is reliable.
> > > >
> > > > This makes no frigging sense what so bloody ever. If the clock is
> > > > stable, we don't care about sched_clock_data. When it is stable you get
> > > > a linear function of the TSC without complicated bits on.
> > > >
> > > > When it is unstable, only then do we care about the sched_clock_data.
> > > >
> > >
> > > Yeah, maybe what is not clear here is that we covering for situation
> > > where clock stability changes over time, e.g. at regular boot clock is
> > > stable, hibernation happens, then restore happens in a non-stable clock.
> >
> > Still confused, who marks the thing unstable? The patch seems to suggest
> > you do yourself, but it is not at all clear why.
> >
> > If TSC really is unstable, then it needs to remain unstable. If the TSC
> > really is stable then there is no point in marking is unstable.
> >
> > Either way something is off, and you're not telling me what.
> >
>
> Hi, Peter
>
> For your original comment, just wanted to clarify the following:
>
> 1. After hibernation, the machine can be resumed on a different but compatible
> host (these are VM images hibernated)
> 2. This means the clock between host1 and host2 can/will be different

So the problem is specific to this particular use case.

I'm not sure why to impose this hack on hibernation in all cases.
Peter Zijlstra Jan. 13, 2020, 12:42 p.m. UTC | #6
On Mon, Jan 13, 2020 at 11:43:18AM +0000, Singh, Balbir wrote:
> For your original comment, just wanted to clarify the following:
> 
> 1. After hibernation, the machine can be resumed on a different but compatible
> host (these are VM images hibernated)
> 2. This means the clock between host1 and host2 can/will be different
> 
> In your comments are you making the assumption that the host(s) is/are the
> same? Just checking the assumptions being made and being on the same page with
> them.

I would expect this to be the same problem we have as regular suspend,
after power off the TSC will have been reset, so resume will have to
somehow bridge that gap. I've no idea if/how it does that.

I remember some BIOSes had crazy TSC ideas for suspend2ram, and we grew
tsc_restore_sched_clock_state() for it.

Playing crazy games like what you're doing just isn't it though.
Andrew Cooper Jan. 13, 2020, 1:01 p.m. UTC | #7
On 13/01/2020 11:43, Singh, Balbir wrote:
> On Mon, 2020-01-13 at 11:16 +0100, Peter Zijlstra wrote:
>> On Fri, Jan 10, 2020 at 07:35:20AM -0800, Eduardo Valentin wrote:
>>> Hey Peter,
>>>
>>> On Wed, Jan 08, 2020 at 11:50:11AM +0100, Peter Zijlstra wrote:
>>>> On Tue, Jan 07, 2020 at 11:45:26PM +0000, Anchal Agarwal wrote:
>>>>> From: Eduardo Valentin <eduval@amazon.com>
>>>>>
>>>>> System instability are seen during resume from hibernation when system
>>>>> is under heavy CPU load. This is due to the lack of update of sched
>>>>> clock data, and the scheduler would then think that heavy CPU hog
>>>>> tasks need more time in CPU, causing the system to freeze
>>>>> during the unfreezing of tasks. For example, threaded irqs,
>>>>> and kernel processes servicing network interface may be delayed
>>>>> for several tens of seconds, causing the system to be unreachable.
>>>>> The fix for this situation is to mark the sched clock as unstable
>>>>> as early as possible in the resume path, leaving it unstable
>>>>> for the duration of the resume process. This will force the
>>>>> scheduler to attempt to align the sched clock across CPUs using
>>>>> the delta with time of day, updating sched clock data. In a post
>>>>> hibernation event, we can then mark the sched clock as stable
>>>>> again, avoiding unnecessary syncs with time of day on systems
>>>>> in which TSC is reliable.
>>>> This makes no frigging sense what so bloody ever. If the clock is
>>>> stable, we don't care about sched_clock_data. When it is stable you get
>>>> a linear function of the TSC without complicated bits on.
>>>>
>>>> When it is unstable, only then do we care about the sched_clock_data.
>>>>
>>> Yeah, maybe what is not clear here is that we covering for situation
>>> where clock stability changes over time, e.g. at regular boot clock is
>>> stable, hibernation happens, then restore happens in a non-stable clock.
>> Still confused, who marks the thing unstable? The patch seems to suggest
>> you do yourself, but it is not at all clear why.
>>
>> If TSC really is unstable, then it needs to remain unstable. If the TSC
>> really is stable then there is no point in marking is unstable.
>>
>> Either way something is off, and you're not telling me what.
>>
> Hi, Peter
>
> For your original comment, just wanted to clarify the following:
>
> 1. After hibernation, the machine can be resumed on a different but compatible
> host (these are VM images hibernated)
> 2. This means the clock between host1 and host2 can/will be different

The guests TSC value is part of all save/migrate/resume state.  Given
this bug, I presume you've actually discarded all register state on
hibernate, and the TSC is starting again from 0?

The frequency of the new TSC might very likely be different, but the
scale/offset in the paravirtual clock information should let Linux's
view of time stay consistent.

> In your comments are you making the assumption that the host(s) is/are the
> same? Just checking the assumptions being made and being on the same page with
> them.

TSCs are a massive source of "fun".  I'm not surprised that there are
yet more bugs around.

Does anyone actually know what does/should happen to the real TSC on
native S4?  The default course of action should be for virtualisation to
follow suit.

~Andrew
David Woodhouse Jan. 13, 2020, 1:54 p.m. UTC | #8
On Mon, 2020-01-13 at 13:01 +0000, Andrew Cooper wrote:
> On 13/01/2020 11:43, Singh, Balbir wrote:
> > On Mon, 2020-01-13 at 11:16 +0100, Peter Zijlstra wrote:
> > > On Fri, Jan 10, 2020 at 07:35:20AM -0800, Eduardo Valentin wrote:
> > > > Hey Peter,
> > > > 
> > > > On Wed, Jan 08, 2020 at 11:50:11AM +0100, Peter Zijlstra wrote:
> > > > > On Tue, Jan 07, 2020 at 11:45:26PM +0000, Anchal Agarwal wrote:
> > > > > > From: Eduardo Valentin <eduval@amazon.com>
> > > > > > 
> > > > > > System instability are seen during resume from hibernation when system
> > > > > > is under heavy CPU load. This is due to the lack of update of sched
> > > > > > clock data, and the scheduler would then think that heavy CPU hog
> > > > > > tasks need more time in CPU, causing the system to freeze
> > > > > > during the unfreezing of tasks. For example, threaded irqs,
> > > > > > and kernel processes servicing network interface may be delayed
> > > > > > for several tens of seconds, causing the system to be unreachable.
> > > > > > The fix for this situation is to mark the sched clock as unstable
> > > > > > as early as possible in the resume path, leaving it unstable
> > > > > > for the duration of the resume process. This will force the
> > > > > > scheduler to attempt to align the sched clock across CPUs using
> > > > > > the delta with time of day, updating sched clock data. In a post
> > > > > > hibernation event, we can then mark the sched clock as stable
> > > > > > again, avoiding unnecessary syncs with time of day on systems
> > > > > > in which TSC is reliable.
> > > > > 
> > > > > This makes no frigging sense what so bloody ever. If the clock is
> > > > > stable, we don't care about sched_clock_data. When it is stable you get
> > > > > a linear function of the TSC without complicated bits on.
> > > > > 
> > > > > When it is unstable, only then do we care about the sched_clock_data.
> > > > > 
> > > > 
> > > > Yeah, maybe what is not clear here is that we covering for situation
> > > > where clock stability changes over time, e.g. at regular boot clock is
> > > > stable, hibernation happens, then restore happens in a non-stable clock.
> > > 
> > > Still confused, who marks the thing unstable? The patch seems to suggest
> > > you do yourself, but it is not at all clear why.
> > > 
> > > If TSC really is unstable, then it needs to remain unstable. If the TSC
> > > really is stable then there is no point in marking is unstable.
> > > 
> > > Either way something is off, and you're not telling me what.
> > > 
> > 
> > Hi, Peter
> > 
> > For your original comment, just wanted to clarify the following:
> > 
> > 1. After hibernation, the machine can be resumed on a different but compatible
> > host (these are VM images hibernated)
> > 2. This means the clock between host1 and host2 can/will be different
> 
> The guests TSC value is part of all save/migrate/resume state.  Given
> this bug, I presume you've actually discarded all register state on
> hibernate, and the TSC is starting again from 0?

Right. This is a guest-driven suspend to disk, followed by starting up
later on a different — but identical — host. There is no guest state
being saved as part of a Xen save/restore.

> The frequency of the new TSC might very likely be different, but the
> scale/offset in the paravirtual clock information should let Linux's
> view of time stay consistent.

The frequency as seen by the guest really needs to be the same. That
hibernated instance may only be booted again on a host which would have
been suitable for live migration. That's either because the TSC
frequency *is* the same, or with TSC scaling to make it appear that
way.

If the environment doesn't provide that then all bets are off and we
shouldn't be trying to hack around it in the guest kernel.

Across the hibernation we do expect a single step change in the TSC
value, just as on real hardware. Like Peter, I assume that the resume
code does cope with that but haven't checked precisely how/where it
does so.
Singh, Balbir Jan. 13, 2020, 3:02 p.m. UTC | #9
On Mon, 2020-01-13 at 13:01 +0000, Andrew Cooper wrote:
> On 13/01/2020 11:43, Singh, Balbir wrote:
> > On Mon, 2020-01-13 at 11:16 +0100, Peter Zijlstra wrote:
> > > On Fri, Jan 10, 2020 at 07:35:20AM -0800, Eduardo Valentin wrote:
> > > > Hey Peter,
> > > > 
> > > > On Wed, Jan 08, 2020 at 11:50:11AM +0100, Peter Zijlstra wrote:
> > > > > On Tue, Jan 07, 2020 at 11:45:26PM +0000, Anchal Agarwal wrote:
> > > > > > From: Eduardo Valentin <eduval@amazon.com>
> > > > > > 
> > > > > > System instability are seen during resume from hibernation when
> > > > > > system
> > > > > > is under heavy CPU load. This is due to the lack of update of
> > > > > > sched
> > > > > > clock data, and the scheduler would then think that heavy CPU hog
> > > > > > tasks need more time in CPU, causing the system to freeze
> > > > > > during the unfreezing of tasks. For example, threaded irqs,
> > > > > > and kernel processes servicing network interface may be delayed
> > > > > > for several tens of seconds, causing the system to be unreachable.
> > > > > > The fix for this situation is to mark the sched clock as unstable
> > > > > > as early as possible in the resume path, leaving it unstable
> > > > > > for the duration of the resume process. This will force the
> > > > > > scheduler to attempt to align the sched clock across CPUs using
> > > > > > the delta with time of day, updating sched clock data. In a post
> > > > > > hibernation event, we can then mark the sched clock as stable
> > > > > > again, avoiding unnecessary syncs with time of day on systems
> > > > > > in which TSC is reliable.
> > > > > 
> > > > > This makes no frigging sense what so bloody ever. If the clock is
> > > > > stable, we don't care about sched_clock_data. When it is stable you
> > > > > get
> > > > > a linear function of the TSC without complicated bits on.
> > > > > 
> > > > > When it is unstable, only then do we care about the
> > > > > sched_clock_data.
> > > > > 
> > > > 
> > > > Yeah, maybe what is not clear here is that we covering for situation
> > > > where clock stability changes over time, e.g. at regular boot clock is
> > > > stable, hibernation happens, then restore happens in a non-stable
> > > > clock.
> > > 
> > > Still confused, who marks the thing unstable? The patch seems to suggest
> > > you do yourself, but it is not at all clear why.
> > > 
> > > If TSC really is unstable, then it needs to remain unstable. If the TSC
> > > really is stable then there is no point in marking is unstable.
> > > 
> > > Either way something is off, and you're not telling me what.
> > > 
> > 
> > Hi, Peter
> > 
> > For your original comment, just wanted to clarify the following:
> > 
> > 1. After hibernation, the machine can be resumed on a different but
> > compatible
> > host (these are VM images hibernated)
> > 2. This means the clock between host1 and host2 can/will be different
> 
> The guests TSC value is part of all save/migrate/resume state.  Given
> this bug, I presume you've actually discarded all register state on
> hibernate, and the TSC is starting again from 0?
> 
> The frequency of the new TSC might very likely be different, but the
> scale/offset in the paravirtual clock information should let Linux's
> view of time stay consistent.
> 

I am looking at my old dmesg logs, which I seem to have lost to revalidate,
but I think Eduardo had a different point. I should point out that I was
adding to the list of potentially missed assumptions


> > In your comments are you making the assumption that the host(s) is/are the
> > same? Just checking the assumptions being made and being on the same page
> > with
> > them.
> 
> TSCs are a massive source of "fun".  I'm not surprised that there are
> yet more bugs around.
> 
> Does anyone actually know what does/should happen to the real TSC on
> native S4?  The default course of action should be for virtualisation to
> follow suit.
> 
> ~Andrew

Balbir
Rafael J. Wysocki Jan. 13, 2020, 9:50 p.m. UTC | #10
On Mon, Jan 13, 2020 at 1:43 PM Peter Zijlstra <peterz@infradead.org> wrote:
>
> On Mon, Jan 13, 2020 at 11:43:18AM +0000, Singh, Balbir wrote:
> > For your original comment, just wanted to clarify the following:
> >
> > 1. After hibernation, the machine can be resumed on a different but compatible
> > host (these are VM images hibernated)
> > 2. This means the clock between host1 and host2 can/will be different
> >
> > In your comments are you making the assumption that the host(s) is/are the
> > same? Just checking the assumptions being made and being on the same page with
> > them.
>
> I would expect this to be the same problem we have as regular suspend,
> after power off the TSC will have been reset, so resume will have to
> somehow bridge that gap. I've no idea if/how it does that.

In general, this is done by timekeeping_resume() and the only special
thing done for the TSC appears to be the tsc_verify_tsc_adjust(true)
call in tsc_resume().

> I remember some BIOSes had crazy TSC ideas for suspend2ram, and we grew
> tsc_restore_sched_clock_state() for it.
>
> Playing crazy games like what you're doing just isn't it though.

Right.
Rafael J. Wysocki Jan. 13, 2020, 11:30 p.m. UTC | #11
On Mon, Jan 13, 2020 at 10:50 PM Rafael J. Wysocki <rafael@kernel.org> wrote:
>
> On Mon, Jan 13, 2020 at 1:43 PM Peter Zijlstra <peterz@infradead.org> wrote:
> >
> > On Mon, Jan 13, 2020 at 11:43:18AM +0000, Singh, Balbir wrote:
> > > For your original comment, just wanted to clarify the following:
> > >
> > > 1. After hibernation, the machine can be resumed on a different but compatible
> > > host (these are VM images hibernated)
> > > 2. This means the clock between host1 and host2 can/will be different
> > >
> > > In your comments are you making the assumption that the host(s) is/are the
> > > same? Just checking the assumptions being made and being on the same page with
> > > them.
> >
> > I would expect this to be the same problem we have as regular suspend,
> > after power off the TSC will have been reset, so resume will have to
> > somehow bridge that gap. I've no idea if/how it does that.
>
> In general, this is done by timekeeping_resume() and the only special
> thing done for the TSC appears to be the tsc_verify_tsc_adjust(true)
> call in tsc_resume().

And I forgot about tsc_restore_sched_clock_state() that gets called
via restore_processor_state() on x86, before calling
timekeeping_resume().
Anchal Agarwal Jan. 14, 2020, 7:29 p.m. UTC | #12
On Tue, Jan 14, 2020 at 12:30:02AM +0100, Rafael J. Wysocki wrote:
> On Mon, Jan 13, 2020 at 10:50 PM Rafael J. Wysocki <rafael@kernel.org> wrote:
> >
> > On Mon, Jan 13, 2020 at 1:43 PM Peter Zijlstra <peterz@infradead.org> wrote:
> > >
> > > On Mon, Jan 13, 2020 at 11:43:18AM +0000, Singh, Balbir wrote:
> > > > For your original comment, just wanted to clarify the following:
> > > >
> > > > 1. After hibernation, the machine can be resumed on a different but compatible
> > > > host (these are VM images hibernated)
> > > > 2. This means the clock between host1 and host2 can/will be different
> > > >
> > > > In your comments are you making the assumption that the host(s) is/are the
> > > > same? Just checking the assumptions being made and being on the same page with
> > > > them.
> > >
> > > I would expect this to be the same problem we have as regular suspend,
> > > after power off the TSC will have been reset, so resume will have to
> > > somehow bridge that gap. I've no idea if/how it does that.
> >
> > In general, this is done by timekeeping_resume() and the only special
> > thing done for the TSC appears to be the tsc_verify_tsc_adjust(true)
> > call in tsc_resume().
> 
> And I forgot about tsc_restore_sched_clock_state() that gets called
> via restore_processor_state() on x86, before calling
> timekeeping_resume().
>
In this case tsc_verify_tsc_adjust(true) this does nothing as
feature bit X86_FEATURE_TSC_ADJUST is not available to guest. 
I am no expert in this area, but could this be messing things up?

Thanks,
Anchal
Anchal Agarwal Jan. 22, 2020, 8:07 p.m. UTC | #13
On Tue, Jan 14, 2020 at 07:29:52PM +0000, Anchal Agarwal wrote:
> On Tue, Jan 14, 2020 at 12:30:02AM +0100, Rafael J. Wysocki wrote:
> > On Mon, Jan 13, 2020 at 10:50 PM Rafael J. Wysocki <rafael@kernel.org> wrote:
> > >
> > > On Mon, Jan 13, 2020 at 1:43 PM Peter Zijlstra <peterz@infradead.org> wrote:
> > > >
> > > > On Mon, Jan 13, 2020 at 11:43:18AM +0000, Singh, Balbir wrote:
> > > > > For your original comment, just wanted to clarify the following:
> > > > >
> > > > > 1. After hibernation, the machine can be resumed on a different but compatible
> > > > > host (these are VM images hibernated)
> > > > > 2. This means the clock between host1 and host2 can/will be different
> > > > >
> > > > > In your comments are you making the assumption that the host(s) is/are the
> > > > > same? Just checking the assumptions being made and being on the same page with
> > > > > them.
> > > >
> > > > I would expect this to be the same problem we have as regular suspend,
> > > > after power off the TSC will have been reset, so resume will have to
> > > > somehow bridge that gap. I've no idea if/how it does that.
> > >
> > > In general, this is done by timekeeping_resume() and the only special
> > > thing done for the TSC appears to be the tsc_verify_tsc_adjust(true)
> > > call in tsc_resume().
> > 
> > And I forgot about tsc_restore_sched_clock_state() that gets called
> > via restore_processor_state() on x86, before calling
> > timekeeping_resume().
> >
> In this case tsc_verify_tsc_adjust(true) this does nothing as
> feature bit X86_FEATURE_TSC_ADJUST is not available to guest. 
> I am no expert in this area, but could this be messing things up?
> 
> Thanks,
> Anchal
Gentle nudge on this. I will add more data here in case that helps.

1. Before this patch, tsc is stable but hibernation does not work
100% of the time. I agree if tsc is stable it should not be marked
unstable however, in this case if I run a cpu intensive workload
in the background and trigger reboot-hibernation loop I see a 
workqueue lockup. 

2. The lockup does not hose the system completely,
the reboot-hibernation carries out and system recovers. 
However, as mentioned in the commit message system does 
become unreachable for couple of seconds.

3. Xen suspend/resume seems to save/restore time_memory area in its
xen_arch_pre_suspend and xen_arch_post_suspend. The xen clock value
is saved. xen_sched_clock_offset is set at resume time to ensure a
monotonic clock value

4. Also, the instances do not have InvariantTSC exposed. Feature bit
X86_FEATURE_TSC_ADJUST is not available to guest and xen clocksource
is used by guests.

I am not sure if something needs to be fixed on hibernate path itself
or its very much ties to time handling on xen guest hibernation

Here is a part of log from last hibernation exit to next hibernation
entry. The loop was running for a while so boot to lockup log will be
huge. I am specifically including the timestamps.

...
01h 57m 15.627s(  16ms): [    5.822701] OOM killer enabled.
01h 57m 15.627s(   0ms): [    5.824981] Restarting tasks ... done.
01h 57m 15.627s(   0ms): [    5.836397] PM: hibernation exit
01h 57m 17.636s(2009ms): [    7.844471] PM: hibernation entry
01h 57m 52.725s(35089ms): [   42.934542] BUG: workqueue lockup - pool cpus=0
node=0 flags=0x0 nice=0 stuck for 37s!
01h 57m 52.730s(   5ms): [   42.941468] Showing busy workqueues and worker
pools:
01h 57m 52.734s(   4ms): [   42.945088] workqueue events: flags=0x0
01h 57m 52.737s(   3ms): [   42.948385]   pwq 0: cpus=0 node=0 flags=0x0 nice=0
active=2/256
01h 57m 52.742s(   5ms): [   42.952838]     pending: vmstat_shepherd,
check_corruption
01h 57m 52.746s(   4ms): [   42.956927] workqueue events_power_efficient:
flags=0x80
01h 57m 52.749s(   3ms): [   42.960731]   pwq 0: cpus=0 node=0 flags=0x0 nice=0
active=4/256
01h 57m 52.754s(   5ms): [   42.964835]     pending: neigh_periodic_work,
do_cache_clean [sunrpc], neigh_periodic_work, check_lifetime
01h 57m 52.781s(  27ms): [   42.971419] workqueue mm_percpu_wq: flags=0x8
01h 57m 52.781s(   0ms): [   42.974628]   pwq 0: cpus=0 node=0 flags=0x0 nice=0
active=1/256
01h 57m 52.781s(   0ms): [   42.978901]     pending: vmstat_update
01h 57m 52.781s(   0ms): [   42.981822] workqueue ipv6_addrconf: flags=0x40008
01h 57m 52.781s(   0ms): [   42.985524]   pwq 0: cpus=0 node=0 flags=0x0 nice=0
active=1/1
01h 57m 52.781s(   0ms): [   42.989670]     pending: addrconf_verify_work [ipv6]
01h 57m 52.782s(   1ms): [   42.993282] workqueue xfs-conv/xvda1: flags=0xc
01h 57m 52.786s(   4ms): [   42.996708]   pwq 0: cpus=0 node=0 flags=0x0 nice=0
active=3/256
01h 57m 52.790s(   4ms): [   43.000954]     pending: xfs_end_io [xfs],
xfs_end_io [xfs], xfs_end_io [xfs]
01h 57m 52.795s(   5ms): [   43.005610] workqueue xfs-reclaim/xvda1: flags=0xc
01h 57m 52.798s(   3ms): [   43.008945]   pwq 0: cpus=0 node=0 flags=0x0 nice=0
active=1/256
01h 57m 52.802s(   4ms): [   43.012675]     pending: xfs_reclaim_worker [xfs]
01h 57m 52.805s(   3ms): [   43.015741] workqueue xfs-sync/xvda1: flags=0x4
01h 57m 52.808s(   3ms): [   43.018723]   pwq 0: cpus=0 node=0 flags=0x0 nice=0
active=1/256
01h 57m 52.811s(   3ms): [   43.022436]     pending: xfs_log_worker [xfs]
01h 57m 52.814s(   3ms): [   43.043519] Filesystems sync: 35.234 seconds
01h 57m 52.837s(  23ms): [   43.048133] Freezing user space processes ...
(elapsed 0.001 seconds) done.
01h 57m 52.844s(   7ms): [   43.055996] OOM killer disabled.
01h 57m 53.838s( 994ms): [   43.061512] PM: Preallocating image memory... done
(allocated 385859 pages)
01h 57m 53.843s(   5ms): [   44.054720] PM: Allocated 1543436 kbytes in 1.06
seconds (1456.07 MB/s)
01h 57m 53.861s(  18ms): [   44.060885] Freezing remaining freezable tasks ...
(elapsed 0.001 seconds) done.
01h 57m 53.861s(   0ms): [   44.069715] printk: Suspending console(s) (use
no_console_suspend to debug)
01h 57m 56.278s(2417ms): [   44.116601] Disabling non-boot CPUs ...
.....
hibernate-resume loop continues after this. As mentioned above, I loose
connectivity for a while.


Thanks,
Anchal
Boris Ostrovsky Jan. 23, 2020, 4:27 p.m. UTC | #14
On 1/22/20 3:07 PM, Anchal Agarwal wrote:
>> In this case tsc_verify_tsc_adjust(true) this does nothing as
>> feature bit X86_FEATURE_TSC_ADJUST is not available to guest. 


Is it not available to your specific guest? Because AFAICT it is
available in general (to HVM/PVH guests).


> 4. Also, the instances do not have InvariantTSC exposed.

If you specify "nomigrate=true" in your configuration file you should
have that feature enabled.


-boris
diff mbox series

Patch

diff --git a/arch/x86/kernel/tsc.c b/arch/x86/kernel/tsc.c
index 7e322e2daaf5..ae77b8bc4e46 100644
--- a/arch/x86/kernel/tsc.c
+++ b/arch/x86/kernel/tsc.c
@@ -14,6 +14,7 @@ 
 #include <linux/percpu.h>
 #include <linux/timex.h>
 #include <linux/static_key.h>
+#include <linux/suspend.h>
 
 #include <asm/hpet.h>
 #include <asm/timer.h>
@@ -1534,3 +1535,31 @@  unsigned long calibrate_delay_is_known(void)
 	return 0;
 }
 #endif
+
+static int tsc_pm_notifier(struct notifier_block *notifier,
+			    unsigned long pm_event, void *unused)
+{
+	switch (pm_event) {
+	case PM_HIBERNATION_PREPARE:
+		clear_sched_clock_stable();
+		break;
+	case PM_POST_HIBERNATION:
+		/* Set back to the default */
+		if (!check_tsc_unstable())
+			set_sched_clock_stable();
+		break;
+	}
+
+	return 0;
+};
+
+static struct notifier_block tsc_pm_notifier_block = {
+       .notifier_call = tsc_pm_notifier,
+};
+
+static int tsc_setup_pm_notifier(void)
+{
+       return register_pm_notifier(&tsc_pm_notifier_block);
+}
+
+subsys_initcall(tsc_setup_pm_notifier);
diff --git a/include/linux/sched/clock.h b/include/linux/sched/clock.h
index 867d588314e0..902654ac5f7e 100644
--- a/include/linux/sched/clock.h
+++ b/include/linux/sched/clock.h
@@ -32,6 +32,10 @@  static inline void clear_sched_clock_stable(void)
 {
 }
 
+static inline void set_sched_clock_stable(void)
+{
+}
+
 static inline void sched_clock_idle_sleep_event(void)
 {
 }
@@ -51,6 +55,7 @@  static inline u64 local_clock(void)
 }
 #else
 extern int sched_clock_stable(void);
+extern void set_sched_clock_stable(void);
 extern void clear_sched_clock_stable(void);
 
 /*
diff --git a/kernel/sched/clock.c b/kernel/sched/clock.c
index 1152259a4ca0..374d40e5b1a2 100644
--- a/kernel/sched/clock.c
+++ b/kernel/sched/clock.c
@@ -116,7 +116,7 @@  static void __scd_stamp(struct sched_clock_data *scd)
 	scd->tick_raw = sched_clock();
 }
 
-static void __set_sched_clock_stable(void)
+void set_sched_clock_stable(void)
 {
 	struct sched_clock_data *scd;
 
@@ -236,7 +236,7 @@  static int __init sched_clock_init_late(void)
 	smp_mb(); /* matches {set,clear}_sched_clock_stable() */
 
 	if (__sched_clock_stable_early)
-		__set_sched_clock_stable();
+		set_sched_clock_stable();
 
 	return 0;
 }