Message ID | 20200107234526.GA19034@dev-dsk-anchalag-2a-9c2d1d96.us-west-2.amazon.com (mailing list archive) |
---|---|
State | Rejected, archived |
Headers | show |
Series | Enable PM hibernation on guest VMs | expand |
On Tue, Jan 07, 2020 at 11:45:26PM +0000, Anchal Agarwal wrote: > From: Eduardo Valentin <eduval@amazon.com> > > System instability are seen during resume from hibernation when system > is under heavy CPU load. This is due to the lack of update of sched > clock data, and the scheduler would then think that heavy CPU hog > tasks need more time in CPU, causing the system to freeze > during the unfreezing of tasks. For example, threaded irqs, > and kernel processes servicing network interface may be delayed > for several tens of seconds, causing the system to be unreachable. > The fix for this situation is to mark the sched clock as unstable > as early as possible in the resume path, leaving it unstable > for the duration of the resume process. This will force the > scheduler to attempt to align the sched clock across CPUs using > the delta with time of day, updating sched clock data. In a post > hibernation event, we can then mark the sched clock as stable > again, avoiding unnecessary syncs with time of day on systems > in which TSC is reliable. This makes no frigging sense what so bloody ever. If the clock is stable, we don't care about sched_clock_data. When it is stable you get a linear function of the TSC without complicated bits on. When it is unstable, only then do we care about the sched_clock_data. > Reviewed-by: Erik Quanstrom <quanstro@amazon.com> > Reviewed-by: Frank van der Linden <fllinden@amazon.com> > Reviewed-by: Balbir Singh <sblbir@amazon.com> > Reviewed-by: Munehisa Kamata <kamatam@amazon.com> > Tested-by: Anchal Agarwal <anchalag@amazon.com> > Signed-off-by: Eduardo Valentin <eduval@amazon.com> > --- NAK, the code very much relies on never getting marked stable again after it gets set to unstable. > diff --git a/kernel/sched/clock.c b/kernel/sched/clock.c > index 1152259a4ca0..374d40e5b1a2 100644 > --- a/kernel/sched/clock.c > +++ b/kernel/sched/clock.c > @@ -116,7 +116,7 @@ static void __scd_stamp(struct sched_clock_data *scd) > scd->tick_raw = sched_clock(); > } > > -static void __set_sched_clock_stable(void) > +void set_sched_clock_stable(void) > { > struct sched_clock_data *scd; > > @@ -236,7 +236,7 @@ static int __init sched_clock_init_late(void) > smp_mb(); /* matches {set,clear}_sched_clock_stable() */ > > if (__sched_clock_stable_early) > - __set_sched_clock_stable(); > + set_sched_clock_stable(); > > return 0; > } > -- > 2.15.3.AMZN >
Hey Peter, On Wed, Jan 08, 2020 at 11:50:11AM +0100, Peter Zijlstra wrote: > On Tue, Jan 07, 2020 at 11:45:26PM +0000, Anchal Agarwal wrote: > > From: Eduardo Valentin <eduval@amazon.com> > > > > System instability are seen during resume from hibernation when system > > is under heavy CPU load. This is due to the lack of update of sched > > clock data, and the scheduler would then think that heavy CPU hog > > tasks need more time in CPU, causing the system to freeze > > during the unfreezing of tasks. For example, threaded irqs, > > and kernel processes servicing network interface may be delayed > > for several tens of seconds, causing the system to be unreachable. > > > The fix for this situation is to mark the sched clock as unstable > > as early as possible in the resume path, leaving it unstable > > for the duration of the resume process. This will force the > > scheduler to attempt to align the sched clock across CPUs using > > the delta with time of day, updating sched clock data. In a post > > hibernation event, we can then mark the sched clock as stable > > again, avoiding unnecessary syncs with time of day on systems > > in which TSC is reliable. > > This makes no frigging sense what so bloody ever. If the clock is > stable, we don't care about sched_clock_data. When it is stable you get > a linear function of the TSC without complicated bits on. > > When it is unstable, only then do we care about the sched_clock_data. > Yeah, maybe what is not clear here is that we covering for situation where clock stability changes over time, e.g. at regular boot clock is stable, hibernation happens, then restore happens in a non-stable clock. > > Reviewed-by: Erik Quanstrom <quanstro@amazon.com> > > Reviewed-by: Frank van der Linden <fllinden@amazon.com> > > Reviewed-by: Balbir Singh <sblbir@amazon.com> > > Reviewed-by: Munehisa Kamata <kamatam@amazon.com> > > Tested-by: Anchal Agarwal <anchalag@amazon.com> > > Signed-off-by: Eduardo Valentin <eduval@amazon.com> > > --- > > NAK, the code very much relies on never getting marked stable again > after it gets set to unstable. > Well actually, at the PM_POST_HIBERNATION, we do the check and set stable if known to be stable. The issue only really happens during the restoration path under scheduling pressure, which takes forever to finish, as described in the commit. Do you see a better solution for this issue? > > diff --git a/kernel/sched/clock.c b/kernel/sched/clock.c > > index 1152259a4ca0..374d40e5b1a2 100644 > > --- a/kernel/sched/clock.c > > +++ b/kernel/sched/clock.c > > @@ -116,7 +116,7 @@ static void __scd_stamp(struct sched_clock_data *scd) > > scd->tick_raw = sched_clock(); > > } > > > > -static void __set_sched_clock_stable(void) > > +void set_sched_clock_stable(void) > > { > > struct sched_clock_data *scd; > > > > @@ -236,7 +236,7 @@ static int __init sched_clock_init_late(void) > > smp_mb(); /* matches {set,clear}_sched_clock_stable() */ > > > > if (__sched_clock_stable_early) > > - __set_sched_clock_stable(); > > + set_sched_clock_stable(); > > > > return 0; > > } > > -- > > 2.15.3.AMZN > >
On Fri, Jan 10, 2020 at 07:35:20AM -0800, Eduardo Valentin wrote: > Hey Peter, > > On Wed, Jan 08, 2020 at 11:50:11AM +0100, Peter Zijlstra wrote: > > On Tue, Jan 07, 2020 at 11:45:26PM +0000, Anchal Agarwal wrote: > > > From: Eduardo Valentin <eduval@amazon.com> > > > > > > System instability are seen during resume from hibernation when system > > > is under heavy CPU load. This is due to the lack of update of sched > > > clock data, and the scheduler would then think that heavy CPU hog > > > tasks need more time in CPU, causing the system to freeze > > > during the unfreezing of tasks. For example, threaded irqs, > > > and kernel processes servicing network interface may be delayed > > > for several tens of seconds, causing the system to be unreachable. > > > > > The fix for this situation is to mark the sched clock as unstable > > > as early as possible in the resume path, leaving it unstable > > > for the duration of the resume process. This will force the > > > scheduler to attempt to align the sched clock across CPUs using > > > the delta with time of day, updating sched clock data. In a post > > > hibernation event, we can then mark the sched clock as stable > > > again, avoiding unnecessary syncs with time of day on systems > > > in which TSC is reliable. > > > > This makes no frigging sense what so bloody ever. If the clock is > > stable, we don't care about sched_clock_data. When it is stable you get > > a linear function of the TSC without complicated bits on. > > > > When it is unstable, only then do we care about the sched_clock_data. > > > > Yeah, maybe what is not clear here is that we covering for situation > where clock stability changes over time, e.g. at regular boot clock is > stable, hibernation happens, then restore happens in a non-stable clock. Still confused, who marks the thing unstable? The patch seems to suggest you do yourself, but it is not at all clear why. If TSC really is unstable, then it needs to remain unstable. If the TSC really is stable then there is no point in marking is unstable. Either way something is off, and you're not telling me what. > > > Reviewed-by: Erik Quanstrom <quanstro@amazon.com> > > > Reviewed-by: Frank van der Linden <fllinden@amazon.com> > > > Reviewed-by: Balbir Singh <sblbir@amazon.com> > > > Reviewed-by: Munehisa Kamata <kamatam@amazon.com> > > > Tested-by: Anchal Agarwal <anchalag@amazon.com> > > > Signed-off-by: Eduardo Valentin <eduval@amazon.com> > > > --- > > > > NAK, the code very much relies on never getting marked stable again > > after it gets set to unstable. > > > > Well actually, at the PM_POST_HIBERNATION, we do the check and set stable if > known to be stable. > > The issue only really happens during the restoration path under scheduling pressure, > which takes forever to finish, as described in the commit. > > Do you see a better solution for this issue? I still have no clue what your actual problem is. You say scheduling goes wobbly because sched_clock_data is stale, but when stable that doesn't matter. So what is the actual problem?
On Mon, 2020-01-13 at 11:16 +0100, Peter Zijlstra wrote: > On Fri, Jan 10, 2020 at 07:35:20AM -0800, Eduardo Valentin wrote: > > Hey Peter, > > > > On Wed, Jan 08, 2020 at 11:50:11AM +0100, Peter Zijlstra wrote: > > > On Tue, Jan 07, 2020 at 11:45:26PM +0000, Anchal Agarwal wrote: > > > > From: Eduardo Valentin <eduval@amazon.com> > > > > > > > > System instability are seen during resume from hibernation when system > > > > is under heavy CPU load. This is due to the lack of update of sched > > > > clock data, and the scheduler would then think that heavy CPU hog > > > > tasks need more time in CPU, causing the system to freeze > > > > during the unfreezing of tasks. For example, threaded irqs, > > > > and kernel processes servicing network interface may be delayed > > > > for several tens of seconds, causing the system to be unreachable. > > > > The fix for this situation is to mark the sched clock as unstable > > > > as early as possible in the resume path, leaving it unstable > > > > for the duration of the resume process. This will force the > > > > scheduler to attempt to align the sched clock across CPUs using > > > > the delta with time of day, updating sched clock data. In a post > > > > hibernation event, we can then mark the sched clock as stable > > > > again, avoiding unnecessary syncs with time of day on systems > > > > in which TSC is reliable. > > > > > > This makes no frigging sense what so bloody ever. If the clock is > > > stable, we don't care about sched_clock_data. When it is stable you get > > > a linear function of the TSC without complicated bits on. > > > > > > When it is unstable, only then do we care about the sched_clock_data. > > > > > > > Yeah, maybe what is not clear here is that we covering for situation > > where clock stability changes over time, e.g. at regular boot clock is > > stable, hibernation happens, then restore happens in a non-stable clock. > > Still confused, who marks the thing unstable? The patch seems to suggest > you do yourself, but it is not at all clear why. > > If TSC really is unstable, then it needs to remain unstable. If the TSC > really is stable then there is no point in marking is unstable. > > Either way something is off, and you're not telling me what. > Hi, Peter For your original comment, just wanted to clarify the following: 1. After hibernation, the machine can be resumed on a different but compatible host (these are VM images hibernated) 2. This means the clock between host1 and host2 can/will be different In your comments are you making the assumption that the host(s) is/are the same? Just checking the assumptions being made and being on the same page with them. Balbir Singh. > > > > Reviewed-by: Erik Quanstrom <quanstro@amazon.com> > > > > Reviewed-by: Frank van der Linden <fllinden@amazon.com> > > > > Reviewed-by: Balbir Singh <sblbir@amazon.com> > > > > Reviewed-by: Munehisa Kamata <kamatam@amazon.com> > > > > Tested-by: Anchal Agarwal <anchalag@amazon.com> > > > > Signed-off-by: Eduardo Valentin <eduval@amazon.com> > > > > --- > > > > > > NAK, the code very much relies on never getting marked stable again > > > after it gets set to unstable. > > > > > > > Well actually, at the PM_POST_HIBERNATION, we do the check and set stable > > if > > known to be stable. > > > > The issue only really happens during the restoration path under scheduling > > pressure, > > which takes forever to finish, as described in the commit. > > > > Do you see a better solution for this issue? > > I still have no clue what your actual problem is. You say scheduling > goes wobbly because sched_clock_data is stale, but when stable that > doesn't matter. > > So what is the actual problem?
On Mon, Jan 13, 2020 at 12:43 PM Singh, Balbir <sblbir@amazon.com> wrote: > > On Mon, 2020-01-13 at 11:16 +0100, Peter Zijlstra wrote: > > On Fri, Jan 10, 2020 at 07:35:20AM -0800, Eduardo Valentin wrote: > > > Hey Peter, > > > > > > On Wed, Jan 08, 2020 at 11:50:11AM +0100, Peter Zijlstra wrote: > > > > On Tue, Jan 07, 2020 at 11:45:26PM +0000, Anchal Agarwal wrote: > > > > > From: Eduardo Valentin <eduval@amazon.com> > > > > > > > > > > System instability are seen during resume from hibernation when system > > > > > is under heavy CPU load. This is due to the lack of update of sched > > > > > clock data, and the scheduler would then think that heavy CPU hog > > > > > tasks need more time in CPU, causing the system to freeze > > > > > during the unfreezing of tasks. For example, threaded irqs, > > > > > and kernel processes servicing network interface may be delayed > > > > > for several tens of seconds, causing the system to be unreachable. > > > > > The fix for this situation is to mark the sched clock as unstable > > > > > as early as possible in the resume path, leaving it unstable > > > > > for the duration of the resume process. This will force the > > > > > scheduler to attempt to align the sched clock across CPUs using > > > > > the delta with time of day, updating sched clock data. In a post > > > > > hibernation event, we can then mark the sched clock as stable > > > > > again, avoiding unnecessary syncs with time of day on systems > > > > > in which TSC is reliable. > > > > > > > > This makes no frigging sense what so bloody ever. If the clock is > > > > stable, we don't care about sched_clock_data. When it is stable you get > > > > a linear function of the TSC without complicated bits on. > > > > > > > > When it is unstable, only then do we care about the sched_clock_data. > > > > > > > > > > Yeah, maybe what is not clear here is that we covering for situation > > > where clock stability changes over time, e.g. at regular boot clock is > > > stable, hibernation happens, then restore happens in a non-stable clock. > > > > Still confused, who marks the thing unstable? The patch seems to suggest > > you do yourself, but it is not at all clear why. > > > > If TSC really is unstable, then it needs to remain unstable. If the TSC > > really is stable then there is no point in marking is unstable. > > > > Either way something is off, and you're not telling me what. > > > > Hi, Peter > > For your original comment, just wanted to clarify the following: > > 1. After hibernation, the machine can be resumed on a different but compatible > host (these are VM images hibernated) > 2. This means the clock between host1 and host2 can/will be different So the problem is specific to this particular use case. I'm not sure why to impose this hack on hibernation in all cases.
On Mon, Jan 13, 2020 at 11:43:18AM +0000, Singh, Balbir wrote: > For your original comment, just wanted to clarify the following: > > 1. After hibernation, the machine can be resumed on a different but compatible > host (these are VM images hibernated) > 2. This means the clock between host1 and host2 can/will be different > > In your comments are you making the assumption that the host(s) is/are the > same? Just checking the assumptions being made and being on the same page with > them. I would expect this to be the same problem we have as regular suspend, after power off the TSC will have been reset, so resume will have to somehow bridge that gap. I've no idea if/how it does that. I remember some BIOSes had crazy TSC ideas for suspend2ram, and we grew tsc_restore_sched_clock_state() for it. Playing crazy games like what you're doing just isn't it though.
On 13/01/2020 11:43, Singh, Balbir wrote: > On Mon, 2020-01-13 at 11:16 +0100, Peter Zijlstra wrote: >> On Fri, Jan 10, 2020 at 07:35:20AM -0800, Eduardo Valentin wrote: >>> Hey Peter, >>> >>> On Wed, Jan 08, 2020 at 11:50:11AM +0100, Peter Zijlstra wrote: >>>> On Tue, Jan 07, 2020 at 11:45:26PM +0000, Anchal Agarwal wrote: >>>>> From: Eduardo Valentin <eduval@amazon.com> >>>>> >>>>> System instability are seen during resume from hibernation when system >>>>> is under heavy CPU load. This is due to the lack of update of sched >>>>> clock data, and the scheduler would then think that heavy CPU hog >>>>> tasks need more time in CPU, causing the system to freeze >>>>> during the unfreezing of tasks. For example, threaded irqs, >>>>> and kernel processes servicing network interface may be delayed >>>>> for several tens of seconds, causing the system to be unreachable. >>>>> The fix for this situation is to mark the sched clock as unstable >>>>> as early as possible in the resume path, leaving it unstable >>>>> for the duration of the resume process. This will force the >>>>> scheduler to attempt to align the sched clock across CPUs using >>>>> the delta with time of day, updating sched clock data. In a post >>>>> hibernation event, we can then mark the sched clock as stable >>>>> again, avoiding unnecessary syncs with time of day on systems >>>>> in which TSC is reliable. >>>> This makes no frigging sense what so bloody ever. If the clock is >>>> stable, we don't care about sched_clock_data. When it is stable you get >>>> a linear function of the TSC without complicated bits on. >>>> >>>> When it is unstable, only then do we care about the sched_clock_data. >>>> >>> Yeah, maybe what is not clear here is that we covering for situation >>> where clock stability changes over time, e.g. at regular boot clock is >>> stable, hibernation happens, then restore happens in a non-stable clock. >> Still confused, who marks the thing unstable? The patch seems to suggest >> you do yourself, but it is not at all clear why. >> >> If TSC really is unstable, then it needs to remain unstable. If the TSC >> really is stable then there is no point in marking is unstable. >> >> Either way something is off, and you're not telling me what. >> > Hi, Peter > > For your original comment, just wanted to clarify the following: > > 1. After hibernation, the machine can be resumed on a different but compatible > host (these are VM images hibernated) > 2. This means the clock between host1 and host2 can/will be different The guests TSC value is part of all save/migrate/resume state. Given this bug, I presume you've actually discarded all register state on hibernate, and the TSC is starting again from 0? The frequency of the new TSC might very likely be different, but the scale/offset in the paravirtual clock information should let Linux's view of time stay consistent. > In your comments are you making the assumption that the host(s) is/are the > same? Just checking the assumptions being made and being on the same page with > them. TSCs are a massive source of "fun". I'm not surprised that there are yet more bugs around. Does anyone actually know what does/should happen to the real TSC on native S4? The default course of action should be for virtualisation to follow suit. ~Andrew
On Mon, 2020-01-13 at 13:01 +0000, Andrew Cooper wrote: > On 13/01/2020 11:43, Singh, Balbir wrote: > > On Mon, 2020-01-13 at 11:16 +0100, Peter Zijlstra wrote: > > > On Fri, Jan 10, 2020 at 07:35:20AM -0800, Eduardo Valentin wrote: > > > > Hey Peter, > > > > > > > > On Wed, Jan 08, 2020 at 11:50:11AM +0100, Peter Zijlstra wrote: > > > > > On Tue, Jan 07, 2020 at 11:45:26PM +0000, Anchal Agarwal wrote: > > > > > > From: Eduardo Valentin <eduval@amazon.com> > > > > > > > > > > > > System instability are seen during resume from hibernation when system > > > > > > is under heavy CPU load. This is due to the lack of update of sched > > > > > > clock data, and the scheduler would then think that heavy CPU hog > > > > > > tasks need more time in CPU, causing the system to freeze > > > > > > during the unfreezing of tasks. For example, threaded irqs, > > > > > > and kernel processes servicing network interface may be delayed > > > > > > for several tens of seconds, causing the system to be unreachable. > > > > > > The fix for this situation is to mark the sched clock as unstable > > > > > > as early as possible in the resume path, leaving it unstable > > > > > > for the duration of the resume process. This will force the > > > > > > scheduler to attempt to align the sched clock across CPUs using > > > > > > the delta with time of day, updating sched clock data. In a post > > > > > > hibernation event, we can then mark the sched clock as stable > > > > > > again, avoiding unnecessary syncs with time of day on systems > > > > > > in which TSC is reliable. > > > > > > > > > > This makes no frigging sense what so bloody ever. If the clock is > > > > > stable, we don't care about sched_clock_data. When it is stable you get > > > > > a linear function of the TSC without complicated bits on. > > > > > > > > > > When it is unstable, only then do we care about the sched_clock_data. > > > > > > > > > > > > > Yeah, maybe what is not clear here is that we covering for situation > > > > where clock stability changes over time, e.g. at regular boot clock is > > > > stable, hibernation happens, then restore happens in a non-stable clock. > > > > > > Still confused, who marks the thing unstable? The patch seems to suggest > > > you do yourself, but it is not at all clear why. > > > > > > If TSC really is unstable, then it needs to remain unstable. If the TSC > > > really is stable then there is no point in marking is unstable. > > > > > > Either way something is off, and you're not telling me what. > > > > > > > Hi, Peter > > > > For your original comment, just wanted to clarify the following: > > > > 1. After hibernation, the machine can be resumed on a different but compatible > > host (these are VM images hibernated) > > 2. This means the clock between host1 and host2 can/will be different > > The guests TSC value is part of all save/migrate/resume state. Given > this bug, I presume you've actually discarded all register state on > hibernate, and the TSC is starting again from 0? Right. This is a guest-driven suspend to disk, followed by starting up later on a different — but identical — host. There is no guest state being saved as part of a Xen save/restore. > The frequency of the new TSC might very likely be different, but the > scale/offset in the paravirtual clock information should let Linux's > view of time stay consistent. The frequency as seen by the guest really needs to be the same. That hibernated instance may only be booted again on a host which would have been suitable for live migration. That's either because the TSC frequency *is* the same, or with TSC scaling to make it appear that way. If the environment doesn't provide that then all bets are off and we shouldn't be trying to hack around it in the guest kernel. Across the hibernation we do expect a single step change in the TSC value, just as on real hardware. Like Peter, I assume that the resume code does cope with that but haven't checked precisely how/where it does so.
On Mon, 2020-01-13 at 13:01 +0000, Andrew Cooper wrote: > On 13/01/2020 11:43, Singh, Balbir wrote: > > On Mon, 2020-01-13 at 11:16 +0100, Peter Zijlstra wrote: > > > On Fri, Jan 10, 2020 at 07:35:20AM -0800, Eduardo Valentin wrote: > > > > Hey Peter, > > > > > > > > On Wed, Jan 08, 2020 at 11:50:11AM +0100, Peter Zijlstra wrote: > > > > > On Tue, Jan 07, 2020 at 11:45:26PM +0000, Anchal Agarwal wrote: > > > > > > From: Eduardo Valentin <eduval@amazon.com> > > > > > > > > > > > > System instability are seen during resume from hibernation when > > > > > > system > > > > > > is under heavy CPU load. This is due to the lack of update of > > > > > > sched > > > > > > clock data, and the scheduler would then think that heavy CPU hog > > > > > > tasks need more time in CPU, causing the system to freeze > > > > > > during the unfreezing of tasks. For example, threaded irqs, > > > > > > and kernel processes servicing network interface may be delayed > > > > > > for several tens of seconds, causing the system to be unreachable. > > > > > > The fix for this situation is to mark the sched clock as unstable > > > > > > as early as possible in the resume path, leaving it unstable > > > > > > for the duration of the resume process. This will force the > > > > > > scheduler to attempt to align the sched clock across CPUs using > > > > > > the delta with time of day, updating sched clock data. In a post > > > > > > hibernation event, we can then mark the sched clock as stable > > > > > > again, avoiding unnecessary syncs with time of day on systems > > > > > > in which TSC is reliable. > > > > > > > > > > This makes no frigging sense what so bloody ever. If the clock is > > > > > stable, we don't care about sched_clock_data. When it is stable you > > > > > get > > > > > a linear function of the TSC without complicated bits on. > > > > > > > > > > When it is unstable, only then do we care about the > > > > > sched_clock_data. > > > > > > > > > > > > > Yeah, maybe what is not clear here is that we covering for situation > > > > where clock stability changes over time, e.g. at regular boot clock is > > > > stable, hibernation happens, then restore happens in a non-stable > > > > clock. > > > > > > Still confused, who marks the thing unstable? The patch seems to suggest > > > you do yourself, but it is not at all clear why. > > > > > > If TSC really is unstable, then it needs to remain unstable. If the TSC > > > really is stable then there is no point in marking is unstable. > > > > > > Either way something is off, and you're not telling me what. > > > > > > > Hi, Peter > > > > For your original comment, just wanted to clarify the following: > > > > 1. After hibernation, the machine can be resumed on a different but > > compatible > > host (these are VM images hibernated) > > 2. This means the clock between host1 and host2 can/will be different > > The guests TSC value is part of all save/migrate/resume state. Given > this bug, I presume you've actually discarded all register state on > hibernate, and the TSC is starting again from 0? > > The frequency of the new TSC might very likely be different, but the > scale/offset in the paravirtual clock information should let Linux's > view of time stay consistent. > I am looking at my old dmesg logs, which I seem to have lost to revalidate, but I think Eduardo had a different point. I should point out that I was adding to the list of potentially missed assumptions > > In your comments are you making the assumption that the host(s) is/are the > > same? Just checking the assumptions being made and being on the same page > > with > > them. > > TSCs are a massive source of "fun". I'm not surprised that there are > yet more bugs around. > > Does anyone actually know what does/should happen to the real TSC on > native S4? The default course of action should be for virtualisation to > follow suit. > > ~Andrew Balbir
On Mon, Jan 13, 2020 at 1:43 PM Peter Zijlstra <peterz@infradead.org> wrote: > > On Mon, Jan 13, 2020 at 11:43:18AM +0000, Singh, Balbir wrote: > > For your original comment, just wanted to clarify the following: > > > > 1. After hibernation, the machine can be resumed on a different but compatible > > host (these are VM images hibernated) > > 2. This means the clock between host1 and host2 can/will be different > > > > In your comments are you making the assumption that the host(s) is/are the > > same? Just checking the assumptions being made and being on the same page with > > them. > > I would expect this to be the same problem we have as regular suspend, > after power off the TSC will have been reset, so resume will have to > somehow bridge that gap. I've no idea if/how it does that. In general, this is done by timekeeping_resume() and the only special thing done for the TSC appears to be the tsc_verify_tsc_adjust(true) call in tsc_resume(). > I remember some BIOSes had crazy TSC ideas for suspend2ram, and we grew > tsc_restore_sched_clock_state() for it. > > Playing crazy games like what you're doing just isn't it though. Right.
On Mon, Jan 13, 2020 at 10:50 PM Rafael J. Wysocki <rafael@kernel.org> wrote: > > On Mon, Jan 13, 2020 at 1:43 PM Peter Zijlstra <peterz@infradead.org> wrote: > > > > On Mon, Jan 13, 2020 at 11:43:18AM +0000, Singh, Balbir wrote: > > > For your original comment, just wanted to clarify the following: > > > > > > 1. After hibernation, the machine can be resumed on a different but compatible > > > host (these are VM images hibernated) > > > 2. This means the clock between host1 and host2 can/will be different > > > > > > In your comments are you making the assumption that the host(s) is/are the > > > same? Just checking the assumptions being made and being on the same page with > > > them. > > > > I would expect this to be the same problem we have as regular suspend, > > after power off the TSC will have been reset, so resume will have to > > somehow bridge that gap. I've no idea if/how it does that. > > In general, this is done by timekeeping_resume() and the only special > thing done for the TSC appears to be the tsc_verify_tsc_adjust(true) > call in tsc_resume(). And I forgot about tsc_restore_sched_clock_state() that gets called via restore_processor_state() on x86, before calling timekeeping_resume().
On Tue, Jan 14, 2020 at 12:30:02AM +0100, Rafael J. Wysocki wrote: > On Mon, Jan 13, 2020 at 10:50 PM Rafael J. Wysocki <rafael@kernel.org> wrote: > > > > On Mon, Jan 13, 2020 at 1:43 PM Peter Zijlstra <peterz@infradead.org> wrote: > > > > > > On Mon, Jan 13, 2020 at 11:43:18AM +0000, Singh, Balbir wrote: > > > > For your original comment, just wanted to clarify the following: > > > > > > > > 1. After hibernation, the machine can be resumed on a different but compatible > > > > host (these are VM images hibernated) > > > > 2. This means the clock between host1 and host2 can/will be different > > > > > > > > In your comments are you making the assumption that the host(s) is/are the > > > > same? Just checking the assumptions being made and being on the same page with > > > > them. > > > > > > I would expect this to be the same problem we have as regular suspend, > > > after power off the TSC will have been reset, so resume will have to > > > somehow bridge that gap. I've no idea if/how it does that. > > > > In general, this is done by timekeeping_resume() and the only special > > thing done for the TSC appears to be the tsc_verify_tsc_adjust(true) > > call in tsc_resume(). > > And I forgot about tsc_restore_sched_clock_state() that gets called > via restore_processor_state() on x86, before calling > timekeeping_resume(). > In this case tsc_verify_tsc_adjust(true) this does nothing as feature bit X86_FEATURE_TSC_ADJUST is not available to guest. I am no expert in this area, but could this be messing things up? Thanks, Anchal
On Tue, Jan 14, 2020 at 07:29:52PM +0000, Anchal Agarwal wrote: > On Tue, Jan 14, 2020 at 12:30:02AM +0100, Rafael J. Wysocki wrote: > > On Mon, Jan 13, 2020 at 10:50 PM Rafael J. Wysocki <rafael@kernel.org> wrote: > > > > > > On Mon, Jan 13, 2020 at 1:43 PM Peter Zijlstra <peterz@infradead.org> wrote: > > > > > > > > On Mon, Jan 13, 2020 at 11:43:18AM +0000, Singh, Balbir wrote: > > > > > For your original comment, just wanted to clarify the following: > > > > > > > > > > 1. After hibernation, the machine can be resumed on a different but compatible > > > > > host (these are VM images hibernated) > > > > > 2. This means the clock between host1 and host2 can/will be different > > > > > > > > > > In your comments are you making the assumption that the host(s) is/are the > > > > > same? Just checking the assumptions being made and being on the same page with > > > > > them. > > > > > > > > I would expect this to be the same problem we have as regular suspend, > > > > after power off the TSC will have been reset, so resume will have to > > > > somehow bridge that gap. I've no idea if/how it does that. > > > > > > In general, this is done by timekeeping_resume() and the only special > > > thing done for the TSC appears to be the tsc_verify_tsc_adjust(true) > > > call in tsc_resume(). > > > > And I forgot about tsc_restore_sched_clock_state() that gets called > > via restore_processor_state() on x86, before calling > > timekeeping_resume(). > > > In this case tsc_verify_tsc_adjust(true) this does nothing as > feature bit X86_FEATURE_TSC_ADJUST is not available to guest. > I am no expert in this area, but could this be messing things up? > > Thanks, > Anchal Gentle nudge on this. I will add more data here in case that helps. 1. Before this patch, tsc is stable but hibernation does not work 100% of the time. I agree if tsc is stable it should not be marked unstable however, in this case if I run a cpu intensive workload in the background and trigger reboot-hibernation loop I see a workqueue lockup. 2. The lockup does not hose the system completely, the reboot-hibernation carries out and system recovers. However, as mentioned in the commit message system does become unreachable for couple of seconds. 3. Xen suspend/resume seems to save/restore time_memory area in its xen_arch_pre_suspend and xen_arch_post_suspend. The xen clock value is saved. xen_sched_clock_offset is set at resume time to ensure a monotonic clock value 4. Also, the instances do not have InvariantTSC exposed. Feature bit X86_FEATURE_TSC_ADJUST is not available to guest and xen clocksource is used by guests. I am not sure if something needs to be fixed on hibernate path itself or its very much ties to time handling on xen guest hibernation Here is a part of log from last hibernation exit to next hibernation entry. The loop was running for a while so boot to lockup log will be huge. I am specifically including the timestamps. ... 01h 57m 15.627s( 16ms): [ 5.822701] OOM killer enabled. 01h 57m 15.627s( 0ms): [ 5.824981] Restarting tasks ... done. 01h 57m 15.627s( 0ms): [ 5.836397] PM: hibernation exit 01h 57m 17.636s(2009ms): [ 7.844471] PM: hibernation entry 01h 57m 52.725s(35089ms): [ 42.934542] BUG: workqueue lockup - pool cpus=0 node=0 flags=0x0 nice=0 stuck for 37s! 01h 57m 52.730s( 5ms): [ 42.941468] Showing busy workqueues and worker pools: 01h 57m 52.734s( 4ms): [ 42.945088] workqueue events: flags=0x0 01h 57m 52.737s( 3ms): [ 42.948385] pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=2/256 01h 57m 52.742s( 5ms): [ 42.952838] pending: vmstat_shepherd, check_corruption 01h 57m 52.746s( 4ms): [ 42.956927] workqueue events_power_efficient: flags=0x80 01h 57m 52.749s( 3ms): [ 42.960731] pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=4/256 01h 57m 52.754s( 5ms): [ 42.964835] pending: neigh_periodic_work, do_cache_clean [sunrpc], neigh_periodic_work, check_lifetime 01h 57m 52.781s( 27ms): [ 42.971419] workqueue mm_percpu_wq: flags=0x8 01h 57m 52.781s( 0ms): [ 42.974628] pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=1/256 01h 57m 52.781s( 0ms): [ 42.978901] pending: vmstat_update 01h 57m 52.781s( 0ms): [ 42.981822] workqueue ipv6_addrconf: flags=0x40008 01h 57m 52.781s( 0ms): [ 42.985524] pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=1/1 01h 57m 52.781s( 0ms): [ 42.989670] pending: addrconf_verify_work [ipv6] 01h 57m 52.782s( 1ms): [ 42.993282] workqueue xfs-conv/xvda1: flags=0xc 01h 57m 52.786s( 4ms): [ 42.996708] pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=3/256 01h 57m 52.790s( 4ms): [ 43.000954] pending: xfs_end_io [xfs], xfs_end_io [xfs], xfs_end_io [xfs] 01h 57m 52.795s( 5ms): [ 43.005610] workqueue xfs-reclaim/xvda1: flags=0xc 01h 57m 52.798s( 3ms): [ 43.008945] pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=1/256 01h 57m 52.802s( 4ms): [ 43.012675] pending: xfs_reclaim_worker [xfs] 01h 57m 52.805s( 3ms): [ 43.015741] workqueue xfs-sync/xvda1: flags=0x4 01h 57m 52.808s( 3ms): [ 43.018723] pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=1/256 01h 57m 52.811s( 3ms): [ 43.022436] pending: xfs_log_worker [xfs] 01h 57m 52.814s( 3ms): [ 43.043519] Filesystems sync: 35.234 seconds 01h 57m 52.837s( 23ms): [ 43.048133] Freezing user space processes ... (elapsed 0.001 seconds) done. 01h 57m 52.844s( 7ms): [ 43.055996] OOM killer disabled. 01h 57m 53.838s( 994ms): [ 43.061512] PM: Preallocating image memory... done (allocated 385859 pages) 01h 57m 53.843s( 5ms): [ 44.054720] PM: Allocated 1543436 kbytes in 1.06 seconds (1456.07 MB/s) 01h 57m 53.861s( 18ms): [ 44.060885] Freezing remaining freezable tasks ... (elapsed 0.001 seconds) done. 01h 57m 53.861s( 0ms): [ 44.069715] printk: Suspending console(s) (use no_console_suspend to debug) 01h 57m 56.278s(2417ms): [ 44.116601] Disabling non-boot CPUs ... ..... hibernate-resume loop continues after this. As mentioned above, I loose connectivity for a while. Thanks, Anchal
On 1/22/20 3:07 PM, Anchal Agarwal wrote: >> In this case tsc_verify_tsc_adjust(true) this does nothing as >> feature bit X86_FEATURE_TSC_ADJUST is not available to guest. Is it not available to your specific guest? Because AFAICT it is available in general (to HVM/PVH guests). > 4. Also, the instances do not have InvariantTSC exposed. If you specify "nomigrate=true" in your configuration file you should have that feature enabled. -boris
diff --git a/arch/x86/kernel/tsc.c b/arch/x86/kernel/tsc.c index 7e322e2daaf5..ae77b8bc4e46 100644 --- a/arch/x86/kernel/tsc.c +++ b/arch/x86/kernel/tsc.c @@ -14,6 +14,7 @@ #include <linux/percpu.h> #include <linux/timex.h> #include <linux/static_key.h> +#include <linux/suspend.h> #include <asm/hpet.h> #include <asm/timer.h> @@ -1534,3 +1535,31 @@ unsigned long calibrate_delay_is_known(void) return 0; } #endif + +static int tsc_pm_notifier(struct notifier_block *notifier, + unsigned long pm_event, void *unused) +{ + switch (pm_event) { + case PM_HIBERNATION_PREPARE: + clear_sched_clock_stable(); + break; + case PM_POST_HIBERNATION: + /* Set back to the default */ + if (!check_tsc_unstable()) + set_sched_clock_stable(); + break; + } + + return 0; +}; + +static struct notifier_block tsc_pm_notifier_block = { + .notifier_call = tsc_pm_notifier, +}; + +static int tsc_setup_pm_notifier(void) +{ + return register_pm_notifier(&tsc_pm_notifier_block); +} + +subsys_initcall(tsc_setup_pm_notifier); diff --git a/include/linux/sched/clock.h b/include/linux/sched/clock.h index 867d588314e0..902654ac5f7e 100644 --- a/include/linux/sched/clock.h +++ b/include/linux/sched/clock.h @@ -32,6 +32,10 @@ static inline void clear_sched_clock_stable(void) { } +static inline void set_sched_clock_stable(void) +{ +} + static inline void sched_clock_idle_sleep_event(void) { } @@ -51,6 +55,7 @@ static inline u64 local_clock(void) } #else extern int sched_clock_stable(void); +extern void set_sched_clock_stable(void); extern void clear_sched_clock_stable(void); /* diff --git a/kernel/sched/clock.c b/kernel/sched/clock.c index 1152259a4ca0..374d40e5b1a2 100644 --- a/kernel/sched/clock.c +++ b/kernel/sched/clock.c @@ -116,7 +116,7 @@ static void __scd_stamp(struct sched_clock_data *scd) scd->tick_raw = sched_clock(); } -static void __set_sched_clock_stable(void) +void set_sched_clock_stable(void) { struct sched_clock_data *scd; @@ -236,7 +236,7 @@ static int __init sched_clock_init_late(void) smp_mb(); /* matches {set,clear}_sched_clock_stable() */ if (__sched_clock_stable_early) - __set_sched_clock_stable(); + set_sched_clock_stable(); return 0; }