mbox series

[v3,00/11] Fix PM hibernation in Xen guests

Message ID cover.1598042152.git.anchalag@amazon.com (mailing list archive)
Headers show
Series Fix PM hibernation in Xen guests | expand

Message

Anchal Agarwal Aug. 21, 2020, 10:22 p.m. UTC
Hello,
This series fixes PM hibernation for hvm guests running on xen hypervisor.
The running guest could now be hibernated and resumed successfully at a
later time. The fixes for PM hibernation are added to block and
network device drivers i.e xen-blkfront and xen-netfront. Any other driver
that needs to add S4 support if not already, can follow same method of
introducing freeze/thaw/restore callbacks.
The patches had been tested against upstream kernel and xen4.11. Large
scale testing is also done on Xen based Amazon EC2 instances. All this testing
involved running memory exhausting workload in the background.
  
Doing guest hibernation does not involve any support from hypervisor and
this way guest has complete control over its state. Infrastructure
restrictions for saving up guest state can be overcome by guest initiated
hibernation.
  
These patches were send out as RFC before and all the feedback had been
incorporated in the patches. The last v1 & v2 could be found here:
  
[v1]: https://lkml.org/lkml/2020/5/19/1312
[v2]: https://lkml.org/lkml/2020/7/2/995
All comments and feedback from v2 had been incorporated in v3 series.

Known issues:
1.KASLR causes intermittent hibernation failures. VM fails to resumes and
has to be restarted. I will investigate this issue separately and shouldn't
be a blocker for this patch series.
2. During hibernation, I observed sometimes that freezing of tasks fails due
to busy XFS workqueuei[xfs-cil/xfs-sync]. This is also intermittent may be 1
out of 200 runs and hibernation is aborted in this case. Re-trying hibernation
may work. Also, this is a known issue with hibernation and some
filesystems like XFS has been discussed by the community for years with not an
effectve resolution at this point.

Testing How to:
---------------
1. Setup xen hypervisor on a physical machine[ I used Ubuntu 16.04 +upstream
xen-4.11]
2. Bring up a HVM guest w/t kernel compiled with hibernation patches
[I used ubuntu18.04 netboot bionic images and also Amazon Linux on-prem images].
3. Create a swap file size=RAM size
4. Update grub parameters and reboot
5. Trigger pm-hibernation from within the VM

Example:
Set up a file-backed swap space. Swap file size>=Total memory on the system
sudo dd if=/dev/zero of=/swap bs=$(( 1024 * 1024 )) count=4096 # 4096MiB
sudo chmod 600 /swap
sudo mkswap /swap
sudo swapon /swap

Update resume device/resume offset in grub if using swap file:
resume=/dev/xvda1 resume_offset=200704 no_console_suspend=1

Execute:
--------
sudo pm-hibernate
OR
echo disk > /sys/power/state && echo reboot > /sys/power/disk

Compute resume offset code:
"
#!/usr/bin/env python
import sys
import array
import fcntl

#swap file
f = open(sys.argv[1], 'r')
buf = array.array('L', [0])

#FIBMAP
ret = fcntl.ioctl(f.fileno(), 0x01, buf)
print buf[0]
"

Aleksei Besogonov (1):
  PM / hibernate: update the resume offset on SNAPSHOT_SET_SWAP_AREA

Anchal Agarwal (4):
  x86/xen: Introduce new function to map HYPERVISOR_shared_info on
    Resume
  x86/xen: save and restore steal clock during PM hibernation
  xen: Introduce wrapper for save/restore sched clock offset
  xen: Update sched clock offset to avoid system instability in
    hibernation

Munehisa Kamata (5):
  xen/manage: keep track of the on-going suspend mode
  xenbus: add freeze/thaw/restore callbacks support
  x86/xen: add system core suspend and resume callbacks
  xen-blkfront: add callbacks for PM suspend and hibernation
  xen-netfront: add callbacks for PM suspend and hibernation

Thomas Gleixner (1):
  genirq: Shutdown irq chips in suspend/resume during hibernation

 arch/x86/xen/enlighten_hvm.c      |   7 +++
 arch/x86/xen/suspend.c            |  63 ++++++++++++++++++++
 arch/x86/xen/time.c               |  15 ++++-
 arch/x86/xen/xen-ops.h            |   3 +
 drivers/block/xen-blkfront.c      | 122 ++++++++++++++++++++++++++++++++++++--
 drivers/net/xen-netfront.c        |  96 +++++++++++++++++++++++++++++-
 drivers/xen/events/events_base.c  |   1 +
 drivers/xen/manage.c              |  46 ++++++++++++++
 drivers/xen/xenbus/xenbus_probe.c |  96 +++++++++++++++++++++++++-----
 include/linux/irq.h               |   2 +
 include/xen/xen-ops.h             |   3 +
 include/xen/xenbus.h              |   3 +
 kernel/irq/chip.c                 |   2 +-
 kernel/irq/internals.h            |   1 +
 kernel/irq/pm.c                   |  31 +++++++---
 kernel/power/user.c               |   7 ++-
 16 files changed, 464 insertions(+), 34 deletions(-)

Comments

Thomas Gleixner Aug. 22, 2020, 12:36 a.m. UTC | #1
On Fri, Aug 21 2020 at 22:27, Thomas Gleixner wrote:
> Add a new quirk flag IRQCHIP_SHUTDOWN_ON_SUSPEND and add support for
> it the core interrupt suspend/resume paths.
>
> Changelog:
> v1->v2: Corrected the author's name to tglx@

Can you please move that Changelog part below the --- seperator next
time because that's really not part of the final commit messaage and the
maintainer has then to strip it off manually
 
> Signed-off-by: Anchal Agarwal <anchalag@amazon.com>
> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>

These SOB lines are just wrongly ordered as they suggest:

     Anchal has authored the patch and Thomas transported it

which is clearly not the case. So the right order is:

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Anchal Agarwal <anchalag@amazon.com>

And that needs another tweak at the top of the change log. The first
line in the mail body wants to be:

From: Thomas Gleixner <tglx@linutronix.de>

followed by an empty new line before the actual changelog text
starts. That way the attribution of the patch when applying it will be
correct.

Documentation/process/ is there for a reason and following the few
simple rules to get that straight is not rocket science.

Thanks,

        tglx
Anchal Agarwal Aug. 24, 2020, 5:25 p.m. UTC | #2
On Sat, Aug 22, 2020 at 02:36:37AM +0200, Thomas Gleixner wrote:
> CAUTION: This email originated from outside of the organization. Do not click links or open attachments unless you can confirm the sender and know the content is safe.
> 
> 
> 
> On Fri, Aug 21 2020 at 22:27, Thomas Gleixner wrote:
> > Add a new quirk flag IRQCHIP_SHUTDOWN_ON_SUSPEND and add support for
> > it the core interrupt suspend/resume paths.
> >
> > Changelog:
> > v1->v2: Corrected the author's name to tglx@
> 
> Can you please move that Changelog part below the --- seperator next
> time because that's really not part of the final commit messaage and the
> maintainer has then to strip it off manually
> 
Ack.
> > Signed-off-by: Anchal Agarwal <anchalag@amazon.com>
> > Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
> 
> These SOB lines are just wrongly ordered as they suggest:
> 
>      Anchal has authored the patch and Thomas transported it
> 
> which is clearly not the case. So the right order is:
> 
I must admit I wasn't aware of that. Will fix.
> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
> Signed-off-by: Anchal Agarwal <anchalag@amazon.com>
> 
> And that needs another tweak at the top of the change log. The first
> line in the mail body wants to be:
> 
> From: Thomas Gleixner <tglx@linutronix.de>
Yes I accidentally missed that in this patch.
Others have that line on all the patches and even v2 for this patch
has. Will fix.
> 
> followed by an empty new line before the actual changelog text
> starts. That way the attribution of the patch when applying it will be
> correct.
> 
> Documentation/process/ is there for a reason and following the few
> simple rules to get that straight is not rocket science.
> 
> Thanks,
> 
>         tglx
> 
> 
Anchal
Christoph Hellwig Aug. 25, 2020, 1:20 p.m. UTC | #3
On Sat, Aug 22, 2020 at 02:36:37AM +0200, Thomas Gleixner wrote:
> From: Thomas Gleixner <tglx@linutronix.de>
> 
> followed by an empty new line before the actual changelog text
> starts. That way the attribution of the patch when applying it will be
> correct.

The way he sent it attribution will be correct as he managed to get his
MTU to send out the mail claiming to be from you.  But yes, it needs
the second From line, _and_ the first from line needs to be fixed to
be from him.
Thomas Gleixner Aug. 25, 2020, 3:25 p.m. UTC | #4
On Tue, Aug 25 2020 at 14:20, Christoph Hellwig wrote:
> On Sat, Aug 22, 2020 at 02:36:37AM +0200, Thomas Gleixner wrote:
>> From: Thomas Gleixner <tglx@linutronix.de>
>> 
>> followed by an empty new line before the actual changelog text
>> starts. That way the attribution of the patch when applying it will be
>> correct.
>
> The way he sent it attribution will be correct as he managed to get his
> MTU to send out the mail claiming to be from you.

Which is even worse as that spammed my inbox with mail delivery rejects
for SPF and whatever violations. And those came mostly from Amazon
servers which sent out that wrong stuff in the first place ....

> But yes, it needs the second From line, _and_ the first from line
> needs to be fixed to be from him.

Thanks,

        tglx
Anchal Agarwal Aug. 28, 2020, 6:26 p.m. UTC | #5
On Fri, Aug 21, 2020 at 10:22:43PM +0000, Anchal Agarwal wrote:
> Hello,
> This series fixes PM hibernation for hvm guests running on xen hypervisor.
> The running guest could now be hibernated and resumed successfully at a
> later time. The fixes for PM hibernation are added to block and
> network device drivers i.e xen-blkfront and xen-netfront. Any other driver
> that needs to add S4 support if not already, can follow same method of
> introducing freeze/thaw/restore callbacks.
> The patches had been tested against upstream kernel and xen4.11. Large
> scale testing is also done on Xen based Amazon EC2 instances. All this testing
> involved running memory exhausting workload in the background.
>   
> Doing guest hibernation does not involve any support from hypervisor and
> this way guest has complete control over its state. Infrastructure
> restrictions for saving up guest state can be overcome by guest initiated
> hibernation.
>   
> These patches were send out as RFC before and all the feedback had been
> incorporated in the patches. The last v1 & v2 could be found here:
>   
> [v1]: https://lkml.org/lkml/2020/5/19/1312
> [v2]: https://lkml.org/lkml/2020/7/2/995
> All comments and feedback from v2 had been incorporated in v3 series.
> 
> Known issues:
> 1.KASLR causes intermittent hibernation failures. VM fails to resumes and
> has to be restarted. I will investigate this issue separately and shouldn't
> be a blocker for this patch series.
> 2. During hibernation, I observed sometimes that freezing of tasks fails due
> to busy XFS workqueuei[xfs-cil/xfs-sync]. This is also intermittent may be 1
> out of 200 runs and hibernation is aborted in this case. Re-trying hibernation
> may work. Also, this is a known issue with hibernation and some
> filesystems like XFS has been discussed by the community for years with not an
> effectve resolution at this point.
> 
> Testing How to:
> ---------------
> 1. Setup xen hypervisor on a physical machine[ I used Ubuntu 16.04 +upstream
> xen-4.11]
> 2. Bring up a HVM guest w/t kernel compiled with hibernation patches
> [I used ubuntu18.04 netboot bionic images and also Amazon Linux on-prem images].
> 3. Create a swap file size=RAM size
> 4. Update grub parameters and reboot
> 5. Trigger pm-hibernation from within the VM
> 
> Example:
> Set up a file-backed swap space. Swap file size>=Total memory on the system
> sudo dd if=/dev/zero of=/swap bs=$(( 1024 * 1024 )) count=4096 # 4096MiB
> sudo chmod 600 /swap
> sudo mkswap /swap
> sudo swapon /swap
> 
> Update resume device/resume offset in grub if using swap file:
> resume=/dev/xvda1 resume_offset=200704 no_console_suspend=1
> 
> Execute:
> --------
> sudo pm-hibernate
> OR
> echo disk > /sys/power/state && echo reboot > /sys/power/disk
> 
> Compute resume offset code:
> "
> #!/usr/bin/env python
> import sys
> import array
> import fcntl
> 
> #swap file
> f = open(sys.argv[1], 'r')
> buf = array.array('L', [0])
> 
> #FIBMAP
> ret = fcntl.ioctl(f.fileno(), 0x01, buf)
> print buf[0]
> "
> 
> Aleksei Besogonov (1):
>   PM / hibernate: update the resume offset on SNAPSHOT_SET_SWAP_AREA
> 
> Anchal Agarwal (4):
>   x86/xen: Introduce new function to map HYPERVISOR_shared_info on
>     Resume
>   x86/xen: save and restore steal clock during PM hibernation
>   xen: Introduce wrapper for save/restore sched clock offset
>   xen: Update sched clock offset to avoid system instability in
>     hibernation
> 
> Munehisa Kamata (5):
>   xen/manage: keep track of the on-going suspend mode
>   xenbus: add freeze/thaw/restore callbacks support
>   x86/xen: add system core suspend and resume callbacks
>   xen-blkfront: add callbacks for PM suspend and hibernation
>   xen-netfront: add callbacks for PM suspend and hibernation
> 
> Thomas Gleixner (1):
>   genirq: Shutdown irq chips in suspend/resume during hibernation
> 
>  arch/x86/xen/enlighten_hvm.c      |   7 +++
>  arch/x86/xen/suspend.c            |  63 ++++++++++++++++++++
>  arch/x86/xen/time.c               |  15 ++++-
>  arch/x86/xen/xen-ops.h            |   3 +
>  drivers/block/xen-blkfront.c      | 122 ++++++++++++++++++++++++++++++++++++--
>  drivers/net/xen-netfront.c        |  96 +++++++++++++++++++++++++++++-
>  drivers/xen/events/events_base.c  |   1 +
>  drivers/xen/manage.c              |  46 ++++++++++++++
>  drivers/xen/xenbus/xenbus_probe.c |  96 +++++++++++++++++++++++++-----
>  include/linux/irq.h               |   2 +
>  include/xen/xen-ops.h             |   3 +
>  include/xen/xenbus.h              |   3 +
>  kernel/irq/chip.c                 |   2 +-
>  kernel/irq/internals.h            |   1 +
>  kernel/irq/pm.c                   |  31 +++++++---
>  kernel/power/user.c               |   7 ++-
>  16 files changed, 464 insertions(+), 34 deletions(-)
> 
> -- 
> 2.16.6
>
A gentle ping on the series in case there is any more feedback or can we plan to
merge this? I can then send the series with minor fixes pointed by tglx@

Thanks,
Anchal
Rafael J. Wysocki Aug. 28, 2020, 6:29 p.m. UTC | #6
On Fri, Aug 28, 2020 at 8:26 PM Anchal Agarwal <anchalag@amazon.com> wrote:
>
> On Fri, Aug 21, 2020 at 10:22:43PM +0000, Anchal Agarwal wrote:
> > Hello,
> > This series fixes PM hibernation for hvm guests running on xen hypervisor.
> > The running guest could now be hibernated and resumed successfully at a
> > later time. The fixes for PM hibernation are added to block and
> > network device drivers i.e xen-blkfront and xen-netfront. Any other driver
> > that needs to add S4 support if not already, can follow same method of
> > introducing freeze/thaw/restore callbacks.
> > The patches had been tested against upstream kernel and xen4.11. Large
> > scale testing is also done on Xen based Amazon EC2 instances. All this testing
> > involved running memory exhausting workload in the background.
> >
> > Doing guest hibernation does not involve any support from hypervisor and
> > this way guest has complete control over its state. Infrastructure
> > restrictions for saving up guest state can be overcome by guest initiated
> > hibernation.
> >
> > These patches were send out as RFC before and all the feedback had been
> > incorporated in the patches. The last v1 & v2 could be found here:
> >
> > [v1]: https://lkml.org/lkml/2020/5/19/1312
> > [v2]: https://lkml.org/lkml/2020/7/2/995
> > All comments and feedback from v2 had been incorporated in v3 series.
> >
> > Known issues:
> > 1.KASLR causes intermittent hibernation failures. VM fails to resumes and
> > has to be restarted. I will investigate this issue separately and shouldn't
> > be a blocker for this patch series.
> > 2. During hibernation, I observed sometimes that freezing of tasks fails due
> > to busy XFS workqueuei[xfs-cil/xfs-sync]. This is also intermittent may be 1
> > out of 200 runs and hibernation is aborted in this case. Re-trying hibernation
> > may work. Also, this is a known issue with hibernation and some
> > filesystems like XFS has been discussed by the community for years with not an
> > effectve resolution at this point.
> >
> > Testing How to:
> > ---------------
> > 1. Setup xen hypervisor on a physical machine[ I used Ubuntu 16.04 +upstream
> > xen-4.11]
> > 2. Bring up a HVM guest w/t kernel compiled with hibernation patches
> > [I used ubuntu18.04 netboot bionic images and also Amazon Linux on-prem images].
> > 3. Create a swap file size=RAM size
> > 4. Update grub parameters and reboot
> > 5. Trigger pm-hibernation from within the VM
> >
> > Example:
> > Set up a file-backed swap space. Swap file size>=Total memory on the system
> > sudo dd if=/dev/zero of=/swap bs=$(( 1024 * 1024 )) count=4096 # 4096MiB
> > sudo chmod 600 /swap
> > sudo mkswap /swap
> > sudo swapon /swap
> >
> > Update resume device/resume offset in grub if using swap file:
> > resume=/dev/xvda1 resume_offset=200704 no_console_suspend=1
> >
> > Execute:
> > --------
> > sudo pm-hibernate
> > OR
> > echo disk > /sys/power/state && echo reboot > /sys/power/disk
> >
> > Compute resume offset code:
> > "
> > #!/usr/bin/env python
> > import sys
> > import array
> > import fcntl
> >
> > #swap file
> > f = open(sys.argv[1], 'r')
> > buf = array.array('L', [0])
> >
> > #FIBMAP
> > ret = fcntl.ioctl(f.fileno(), 0x01, buf)
> > print buf[0]
> > "
> >
> > Aleksei Besogonov (1):
> >   PM / hibernate: update the resume offset on SNAPSHOT_SET_SWAP_AREA
> >
> > Anchal Agarwal (4):
> >   x86/xen: Introduce new function to map HYPERVISOR_shared_info on
> >     Resume
> >   x86/xen: save and restore steal clock during PM hibernation
> >   xen: Introduce wrapper for save/restore sched clock offset
> >   xen: Update sched clock offset to avoid system instability in
> >     hibernation
> >
> > Munehisa Kamata (5):
> >   xen/manage: keep track of the on-going suspend mode
> >   xenbus: add freeze/thaw/restore callbacks support
> >   x86/xen: add system core suspend and resume callbacks
> >   xen-blkfront: add callbacks for PM suspend and hibernation
> >   xen-netfront: add callbacks for PM suspend and hibernation
> >
> > Thomas Gleixner (1):
> >   genirq: Shutdown irq chips in suspend/resume during hibernation
> >
> >  arch/x86/xen/enlighten_hvm.c      |   7 +++
> >  arch/x86/xen/suspend.c            |  63 ++++++++++++++++++++
> >  arch/x86/xen/time.c               |  15 ++++-
> >  arch/x86/xen/xen-ops.h            |   3 +
> >  drivers/block/xen-blkfront.c      | 122 ++++++++++++++++++++++++++++++++++++--
> >  drivers/net/xen-netfront.c        |  96 +++++++++++++++++++++++++++++-
> >  drivers/xen/events/events_base.c  |   1 +
> >  drivers/xen/manage.c              |  46 ++++++++++++++
> >  drivers/xen/xenbus/xenbus_probe.c |  96 +++++++++++++++++++++++++-----
> >  include/linux/irq.h               |   2 +
> >  include/xen/xen-ops.h             |   3 +
> >  include/xen/xenbus.h              |   3 +
> >  kernel/irq/chip.c                 |   2 +-
> >  kernel/irq/internals.h            |   1 +
> >  kernel/irq/pm.c                   |  31 +++++++---
> >  kernel/power/user.c               |   7 ++-
> >  16 files changed, 464 insertions(+), 34 deletions(-)
> >
> > --
> > 2.16.6
> >
> A gentle ping on the series in case there is any more feedback or can we plan to
> merge this? I can then send the series with minor fixes pointed by tglx@

Some more time, please!
Anchal Agarwal Aug. 28, 2020, 6:39 p.m. UTC | #7
On Fri, Aug 28, 2020 at 08:29:24PM +0200, Rafael J. Wysocki wrote:
> CAUTION: This email originated from outside of the organization. Do not click links or open attachments unless you can confirm the sender and know the content is safe.
> 
> 
> 
> On Fri, Aug 28, 2020 at 8:26 PM Anchal Agarwal <anchalag@amazon.com> wrote:
> >
> > On Fri, Aug 21, 2020 at 10:22:43PM +0000, Anchal Agarwal wrote:
> > > Hello,
> > > This series fixes PM hibernation for hvm guests running on xen hypervisor.
> > > The running guest could now be hibernated and resumed successfully at a
> > > later time. The fixes for PM hibernation are added to block and
> > > network device drivers i.e xen-blkfront and xen-netfront. Any other driver
> > > that needs to add S4 support if not already, can follow same method of
> > > introducing freeze/thaw/restore callbacks.
> > > The patches had been tested against upstream kernel and xen4.11. Large
> > > scale testing is also done on Xen based Amazon EC2 instances. All this testing
> > > involved running memory exhausting workload in the background.
> > >
> > > Doing guest hibernation does not involve any support from hypervisor and
> > > this way guest has complete control over its state. Infrastructure
> > > restrictions for saving up guest state can be overcome by guest initiated
> > > hibernation.
> > >
> > > These patches were send out as RFC before and all the feedback had been
> > > incorporated in the patches. The last v1 & v2 could be found here:
> > >
> > > [v1]: https://lkml.org/lkml/2020/5/19/1312
> > > [v2]: https://lkml.org/lkml/2020/7/2/995
> > > All comments and feedback from v2 had been incorporated in v3 series.
> > >
> > > Known issues:
> > > 1.KASLR causes intermittent hibernation failures. VM fails to resumes and
> > > has to be restarted. I will investigate this issue separately and shouldn't
> > > be a blocker for this patch series.
> > > 2. During hibernation, I observed sometimes that freezing of tasks fails due
> > > to busy XFS workqueuei[xfs-cil/xfs-sync]. This is also intermittent may be 1
> > > out of 200 runs and hibernation is aborted in this case. Re-trying hibernation
> > > may work. Also, this is a known issue with hibernation and some
> > > filesystems like XFS has been discussed by the community for years with not an
> > > effectve resolution at this point.
> > >
> > > Testing How to:
> > > ---------------
> > > 1. Setup xen hypervisor on a physical machine[ I used Ubuntu 16.04 +upstream
> > > xen-4.11]
> > > 2. Bring up a HVM guest w/t kernel compiled with hibernation patches
> > > [I used ubuntu18.04 netboot bionic images and also Amazon Linux on-prem images].
> > > 3. Create a swap file size=RAM size
> > > 4. Update grub parameters and reboot
> > > 5. Trigger pm-hibernation from within the VM
> > >
> > > Example:
> > > Set up a file-backed swap space. Swap file size>=Total memory on the system
> > > sudo dd if=/dev/zero of=/swap bs=$(( 1024 * 1024 )) count=4096 # 4096MiB
> > > sudo chmod 600 /swap
> > > sudo mkswap /swap
> > > sudo swapon /swap
> > >
> > > Update resume device/resume offset in grub if using swap file:
> > > resume=/dev/xvda1 resume_offset=200704 no_console_suspend=1
> > >
> > > Execute:
> > > --------
> > > sudo pm-hibernate
> > > OR
> > > echo disk > /sys/power/state && echo reboot > /sys/power/disk
> > >
> > > Compute resume offset code:
> > > "
> > > #!/usr/bin/env python
> > > import sys
> > > import array
> > > import fcntl
> > >
> > > #swap file
> > > f = open(sys.argv[1], 'r')
> > > buf = array.array('L', [0])
> > >
> > > #FIBMAP
> > > ret = fcntl.ioctl(f.fileno(), 0x01, buf)
> > > print buf[0]
> > > "
> > >
> > > Aleksei Besogonov (1):
> > >   PM / hibernate: update the resume offset on SNAPSHOT_SET_SWAP_AREA
> > >
> > > Anchal Agarwal (4):
> > >   x86/xen: Introduce new function to map HYPERVISOR_shared_info on
> > >     Resume
> > >   x86/xen: save and restore steal clock during PM hibernation
> > >   xen: Introduce wrapper for save/restore sched clock offset
> > >   xen: Update sched clock offset to avoid system instability in
> > >     hibernation
> > >
> > > Munehisa Kamata (5):
> > >   xen/manage: keep track of the on-going suspend mode
> > >   xenbus: add freeze/thaw/restore callbacks support
> > >   x86/xen: add system core suspend and resume callbacks
> > >   xen-blkfront: add callbacks for PM suspend and hibernation
> > >   xen-netfront: add callbacks for PM suspend and hibernation
> > >
> > > Thomas Gleixner (1):
> > >   genirq: Shutdown irq chips in suspend/resume during hibernation
> > >
> > >  arch/x86/xen/enlighten_hvm.c      |   7 +++
> > >  arch/x86/xen/suspend.c            |  63 ++++++++++++++++++++
> > >  arch/x86/xen/time.c               |  15 ++++-
> > >  arch/x86/xen/xen-ops.h            |   3 +
> > >  drivers/block/xen-blkfront.c      | 122 ++++++++++++++++++++++++++++++++++++--
> > >  drivers/net/xen-netfront.c        |  96 +++++++++++++++++++++++++++++-
> > >  drivers/xen/events/events_base.c  |   1 +
> > >  drivers/xen/manage.c              |  46 ++++++++++++++
> > >  drivers/xen/xenbus/xenbus_probe.c |  96 +++++++++++++++++++++++++-----
> > >  include/linux/irq.h               |   2 +
> > >  include/xen/xen-ops.h             |   3 +
> > >  include/xen/xenbus.h              |   3 +
> > >  kernel/irq/chip.c                 |   2 +-
> > >  kernel/irq/internals.h            |   1 +
> > >  kernel/irq/pm.c                   |  31 +++++++---
> > >  kernel/power/user.c               |   7 ++-
> > >  16 files changed, 464 insertions(+), 34 deletions(-)
> > >
> > > --
> > > 2.16.6
> > >
> > A gentle ping on the series in case there is any more feedback or can we plan to
> > merge this? I can then send the series with minor fixes pointed by tglx@
> 
> Some more time, please!
>
Sure happy to answer any more questions and fix more BUGS!!

--
Anchal
Boris Ostrovsky Sept. 11, 2020, 3:19 p.m. UTC | #8
On 8/21/20 6:22 PM, Anchal Agarwal wrote:
>
> Known issues:
> 1.KASLR causes intermittent hibernation failures. VM fails to resumes and
> has to be restarted. I will investigate this issue separately and shouldn't
> be a blocker for this patch series.


Is there any change in status for this? This has been noted since January.


-boris


> 2. During hibernation, I observed sometimes that freezing of tasks fails due
> to busy XFS workqueuei[xfs-cil/xfs-sync]. This is also intermittent may be 1
> out of 200 runs and hibernation is aborted in this case. Re-trying hibernation
> may work. Also, this is a known issue with hibernation and some
> filesystems like XFS has been discussed by the community for years with not an
> effectve resolution at this point.
>
Anchal Agarwal Sept. 11, 2020, 8:44 p.m. UTC | #9
On Fri, Aug 28, 2020 at 06:39:45PM +0000, Anchal Agarwal wrote:
> On Fri, Aug 28, 2020 at 08:29:24PM +0200, Rafael J. Wysocki wrote:
> > CAUTION: This email originated from outside of the organization. Do not click links or open attachments unless you can confirm the sender and know the content is safe.
> > 
> > 
> > 
> > On Fri, Aug 28, 2020 at 8:26 PM Anchal Agarwal <anchalag@amazon.com> wrote:
> > >
> > > On Fri, Aug 21, 2020 at 10:22:43PM +0000, Anchal Agarwal wrote:
> > > > Hello,
> > > > This series fixes PM hibernation for hvm guests running on xen hypervisor.
> > > > The running guest could now be hibernated and resumed successfully at a
> > > > later time. The fixes for PM hibernation are added to block and
> > > > network device drivers i.e xen-blkfront and xen-netfront. Any other driver
> > > > that needs to add S4 support if not already, can follow same method of
> > > > introducing freeze/thaw/restore callbacks.
> > > > The patches had been tested against upstream kernel and xen4.11. Large
> > > > scale testing is also done on Xen based Amazon EC2 instances. All this testing
> > > > involved running memory exhausting workload in the background.
> > > >
> > > > Doing guest hibernation does not involve any support from hypervisor and
> > > > this way guest has complete control over its state. Infrastructure
> > > > restrictions for saving up guest state can be overcome by guest initiated
> > > > hibernation.
> > > >
> > > > These patches were send out as RFC before and all the feedback had been
> > > > incorporated in the patches. The last v1 & v2 could be found here:
> > > >
> > > > [v1]: https://lkml.org/lkml/2020/5/19/1312
> > > > [v2]: https://lkml.org/lkml/2020/7/2/995
> > > > All comments and feedback from v2 had been incorporated in v3 series.
> > > >
> > > > Known issues:
> > > > 1.KASLR causes intermittent hibernation failures. VM fails to resumes and
> > > > has to be restarted. I will investigate this issue separately and shouldn't
> > > > be a blocker for this patch series.
> > > > 2. During hibernation, I observed sometimes that freezing of tasks fails due
> > > > to busy XFS workqueuei[xfs-cil/xfs-sync]. This is also intermittent may be 1
> > > > out of 200 runs and hibernation is aborted in this case. Re-trying hibernation
> > > > may work. Also, this is a known issue with hibernation and some
> > > > filesystems like XFS has been discussed by the community for years with not an
> > > > effectve resolution at this point.
> > > >
> > > > Testing How to:
> > > > ---------------
> > > > 1. Setup xen hypervisor on a physical machine[ I used Ubuntu 16.04 +upstream
> > > > xen-4.11]
> > > > 2. Bring up a HVM guest w/t kernel compiled with hibernation patches
> > > > [I used ubuntu18.04 netboot bionic images and also Amazon Linux on-prem images].
> > > > 3. Create a swap file size=RAM size
> > > > 4. Update grub parameters and reboot
> > > > 5. Trigger pm-hibernation from within the VM
> > > >
> > > > Example:
> > > > Set up a file-backed swap space. Swap file size>=Total memory on the system
> > > > sudo dd if=/dev/zero of=/swap bs=$(( 1024 * 1024 )) count=4096 # 4096MiB
> > > > sudo chmod 600 /swap
> > > > sudo mkswap /swap
> > > > sudo swapon /swap
> > > >
> > > > Update resume device/resume offset in grub if using swap file:
> > > > resume=/dev/xvda1 resume_offset=200704 no_console_suspend=1
> > > >
> > > > Execute:
> > > > --------
> > > > sudo pm-hibernate
> > > > OR
> > > > echo disk > /sys/power/state && echo reboot > /sys/power/disk
> > > >
> > > > Compute resume offset code:
> > > > "
> > > > #!/usr/bin/env python
> > > > import sys
> > > > import array
> > > > import fcntl
> > > >
> > > > #swap file
> > > > f = open(sys.argv[1], 'r')
> > > > buf = array.array('L', [0])
> > > >
> > > > #FIBMAP
> > > > ret = fcntl.ioctl(f.fileno(), 0x01, buf)
> > > > print buf[0]
> > > > "
> > > >
> > > > Aleksei Besogonov (1):
> > > >   PM / hibernate: update the resume offset on SNAPSHOT_SET_SWAP_AREA
> > > >
> > > > Anchal Agarwal (4):
> > > >   x86/xen: Introduce new function to map HYPERVISOR_shared_info on
> > > >     Resume
> > > >   x86/xen: save and restore steal clock during PM hibernation
> > > >   xen: Introduce wrapper for save/restore sched clock offset
> > > >   xen: Update sched clock offset to avoid system instability in
> > > >     hibernation
> > > >
> > > > Munehisa Kamata (5):
> > > >   xen/manage: keep track of the on-going suspend mode
> > > >   xenbus: add freeze/thaw/restore callbacks support
> > > >   x86/xen: add system core suspend and resume callbacks
> > > >   xen-blkfront: add callbacks for PM suspend and hibernation
> > > >   xen-netfront: add callbacks for PM suspend and hibernation
> > > >
> > > > Thomas Gleixner (1):
> > > >   genirq: Shutdown irq chips in suspend/resume during hibernation
> > > >
> > > >  arch/x86/xen/enlighten_hvm.c      |   7 +++
> > > >  arch/x86/xen/suspend.c            |  63 ++++++++++++++++++++
> > > >  arch/x86/xen/time.c               |  15 ++++-
> > > >  arch/x86/xen/xen-ops.h            |   3 +
> > > >  drivers/block/xen-blkfront.c      | 122 ++++++++++++++++++++++++++++++++++++--
> > > >  drivers/net/xen-netfront.c        |  96 +++++++++++++++++++++++++++++-
> > > >  drivers/xen/events/events_base.c  |   1 +
> > > >  drivers/xen/manage.c              |  46 ++++++++++++++
> > > >  drivers/xen/xenbus/xenbus_probe.c |  96 +++++++++++++++++++++++++-----
> > > >  include/linux/irq.h               |   2 +
> > > >  include/xen/xen-ops.h             |   3 +
> > > >  include/xen/xenbus.h              |   3 +
> > > >  kernel/irq/chip.c                 |   2 +-
> > > >  kernel/irq/internals.h            |   1 +
> > > >  kernel/irq/pm.c                   |  31 +++++++---
> > > >  kernel/power/user.c               |   7 ++-
> > > >  16 files changed, 464 insertions(+), 34 deletions(-)
> > > >
> > > > --
> > > > 2.16.6
> > > >
> > > A gentle ping on the series in case there is any more feedback or can we plan to
> > > merge this? I can then send the series with minor fixes pointed by tglx@
> > 
> > Some more time, please!
> >
> Sure happy to answer any more questions and fix more BUGS!!
> 
> --
> Anchal
>
A gentle ping on this one again!

--
Anchal