mbox series

[V1,00/32] Live Update

Message ID 1596122076-341293-1-git-send-email-steven.sistare@oracle.com (mailing list archive)
Headers show
Series Live Update | expand

Message

Steven Sistare July 30, 2020, 3:14 p.m. UTC
Improve and extend the qemu functions that save and restore VM state so a
guest may be suspended and resumed with minimal pause time.  qemu may be
updated to a new version in between.

The first set of patches adds the cprsave and cprload commands to save and
restore VM state, and allow the host kernel to be updated and rebooted in
between.  The VM must create guest RAM in a persistent shared memory file,
such as /dev/dax0.0 or persistant /dev/shm PKRAM as proposed in 
https://lore.kernel.org/lkml/1588812129-8596-1-git-send-email-anthony.yznaga@oracle.com/

cprsave stops the VCPUs and saves VM device state in a simple file, and
thus supports any type of guest image and block device.  The caller must
not modify the VM's block devices between cprsave and cprload.

cprsave and cprload support guests with vfio devices if the caller first
suspends the guest by issuing guest-suspend-ram to the qemu guest agent.
The guest drivers suspend methods flush outstanding requests and re-
initialize the devices, and thus there is no device state to save and
restore.

   1 savevm: add vmstate handler iterators
   2 savevm: VM handlers mode mask
   3 savevm: QMP command for cprsave
   4 savevm: HMP Command for cprsave
   5 savevm: QMP command for cprload
   6 savevm: HMP Command for cprload
   7 savevm: QMP command for cprinfo
   8 savevm: HMP command for cprinfo
   9 savevm: prevent cprsave if memory is volatile
  10 kvmclock: restore paused KVM clock
  11 cpu: disable ticks when suspended
  12 vl: pause option
  13 gdbstub: gdb support for suspended state

The next patches add a restart method that eliminates the persistent memory
constraint, and allows qemu to be updated across the restart, but does not
allow host reboot.  Anonymous memory segments used by the guest are
preserved across a re-exec of qemu, mapped at the same VA, via a proposed
madvise(MADV_DOEXEC) option in the Linux kernel.  See
https://lore.kernel.org/lkml/1595869887-23307-1-git-send-email-anthony.yznaga@oracle.com/

  14 savevm: VMS_RESTART and cprsave restart
  15 vl: QEMU_START_FREEZE env var
  16 oslib: add qemu_clr_cloexec
  17 util: env var helpers
  18 osdep: import MADV_DOEXEC
  19 memory: ram_block_add cosmetic changes
  20 vl: add helper to request re-exec
  21 exec, memory: exec(3) to restart
  22 char: qio_channel_socket_accept reuse fd
  23 char: save/restore chardev socket fds
  24 ui: save/restore vnc socket fds
  25 char: save/restore chardev pty fds
  26 monitor: save/restore QMP negotiation status
  27 vhost: reset vhost devices upon cprsave
  28 char: restore terminal on restart

The next patches extend the restart method to save and restore vfio-pci
state, eliminating the requirement for a guest agent.  The vfio container,
group, and device descriptors are preserved across the qemu re-exec.

  29 pci: export pci_update_mappings
  30 vfio-pci: save and restore
  31 vfio-pci: trace pci config
  32 vfio-pci: improved tracing

Here is an example of updating qemu from v4.2.0 to v4.2.1 using 
"cprload restart".  The software update is performed while the guest is
running to minimize downtime.

window 1				| window 2
					|
# qemu-system-x86_64 ... 		|
QEMU 4.2.0 monitor - type 'help' ...	|
(qemu) info status			|
VM status: running			|
					| # yum update qemu
(qemu) cprsave /tmp/qemu.sav restart	|
QEMU 4.2.1 monitor - type 'help' ...	|
(qemu) info status			|
VM status: paused (prelaunch)		|
(qemu) cprload /tmp/qemu.sav		|
(qemu) info status			|
VM status: running			|


Here is an example of updating the host kernel using "cprload reboot"

window 1					| window 2
						|
# qemu-system-x86_64 ...mem-path=/dev/dax0.0 ...|
QEMU 4.2.1 monitor - type 'help' ...		|
(qemu) info status				|
VM status: running				|
						| # yum update kernel-uek
(qemu) cprsave /tmp/qemu.sav restart		|
						|
# systemctl kexec				|
kexec_core: Starting new kernel			|
...						|
						|
# qemu-system-x86_64 ...mem-path=/dev/dax0.0 ...|
QEMU 4.2.1 monitor - type 'help' ...		|
(qemu) info status				|
VM status: paused (prelaunch)			|
(qemu) cprload /tmp/qemu.sav			|
(qemu) info status				|
VM status: running				|


Mark Kanda (5):
  char: qio_channel_socket_accept reuse fd
  char: save/restore chardev socket fds
  ui: save/restore vnc socket fds
  monitor: save/restore QMP negotiation status
  vhost: reset vhost devices upon cprsave

Steve Sistare (27):
  savevm: add vmstate handler iterators
  savevm: VM handlers mode mask
  savevm: QMP command for cprsave
  savevm: HMP Command for cprsave
  savevm: QMP command for cprload
  savevm: HMP Command for cprload
  savevm: QMP command for cprinfo
  savevm: HMP command for cprinfo
  savevm: prevent cprsave if memory is volatile
  kvmclock: restore paused KVM clock
  cpu: disable ticks when suspended
  vl: pause option
  gdbstub: gdb support for suspended state
  savevm: VMS_RESTART and cprsave restart
  vl: QEMU_START_FREEZE env var
  oslib: add qemu_clr_cloexec
  util: env var helpers
  osdep: import MADV_DOEXEC
  memory: ram_block_add cosmetic changes
  vl: add helper to request re-exec
  exec, memory: exec(3) to restart
  char: save/restore chardev pty fds
  char: restore terminal on restart
  pci: export pci_update_mappings
  vfio-pci: save and restore
  vfio-pci: trace pci config
  vfio-pci: improved tracing

 MAINTAINERS                    |   7 ++
 accel/kvm/kvm-all.c            |   8 +-
 accel/kvm/trace-events         |   3 +-
 chardev/char-pty.c             |  38 +++++--
 chardev/char-socket.c          |  35 ++++++
 chardev/char-stdio.c           |   7 ++
 chardev/char.c                 |  16 +++
 exec.c                         |  88 +++++++++++++--
 gdbstub.c                      |  11 +-
 hmp-commands.hx                |  46 ++++++++
 hw/i386/kvm/clock.c            |   6 +-
 hw/pci/msix.c                  |   1 +
 hw/pci/pci.c                   |  17 +--
 hw/pci/trace-events            |   5 +-
 hw/vfio/common.c               | 115 ++++++++++++++++----
 hw/vfio/pci.c                  | 179 ++++++++++++++++++++++++++++++-
 hw/vfio/platform.c             |   2 +-
 hw/vfio/trace-events           |  11 +-
 hw/virtio/vhost.c              |  12 +++
 include/chardev/char.h         |   8 ++
 include/exec/memory.h          |   4 +
 include/hw/pci/pci.h           |   2 +
 include/hw/vfio/vfio-common.h  |   4 +-
 include/io/channel-socket.h    |   3 +-
 include/migration/register.h   |   3 +
 include/migration/vmstate.h    |  11 ++
 include/monitor/hmp.h          |   3 +
 include/qemu/cutils.h          |   1 +
 include/qemu/env.h             |  31 ++++++
 include/qemu/osdep.h           |   8 ++
 include/sysemu/sysemu.h        |  10 ++
 io/channel-socket.c            |  12 ++-
 io/net-listener.c              |   4 +-
 migration/block.c              |   1 +
 migration/migration.c          |   4 +-
 migration/ram.c                |   1 +
 migration/savevm.c             | 237 ++++++++++++++++++++++++++++++++++++-----
 migration/savevm.h             |   4 +-
 monitor/hmp-cmds.c             |  28 +++++
 monitor/qmp-cmds.c             |  16 +++
 monitor/qmp.c                  |  42 ++++++++
 qapi/migration.json            |  35 ++++++
 qapi/pragma.json               |   1 +
 qemu-options.hx                |   9 ++
 scsi/qemu-pr-helper.c          |   2 +-
 softmmu/vl.c                   |  65 ++++++++++-
 tests/qtest/tpm-emu.c          |   2 +-
 tests/test-char.c              |   2 +-
 tests/test-io-channel-socket.c |   4 +-
 trace-events                   |   2 +
 ui/vnc.c                       | 153 +++++++++++++++++++++-----
 util/Makefile.objs             |   2 +-
 util/env.c                     | 132 +++++++++++++++++++++++
 util/oslib-posix.c             |   9 ++
 util/oslib-win32.c             |   4 +
 55 files changed, 1331 insertions(+), 135 deletions(-)
 create mode 100644 include/qemu/env.h
 create mode 100644 util/env.c

Comments

Daniel P. Berrangé July 30, 2020, 4:52 p.m. UTC | #1
On Thu, Jul 30, 2020 at 08:14:04AM -0700, Steve Sistare wrote:
> Improve and extend the qemu functions that save and restore VM state so a
> guest may be suspended and resumed with minimal pause time.  qemu may be
> updated to a new version in between.
> 
> The first set of patches adds the cprsave and cprload commands to save and
> restore VM state, and allow the host kernel to be updated and rebooted in
> between.  The VM must create guest RAM in a persistent shared memory file,
> such as /dev/dax0.0 or persistant /dev/shm PKRAM as proposed in 
> https://lore.kernel.org/lkml/1588812129-8596-1-git-send-email-anthony.yznaga@oracle.com/
> 
> cprsave stops the VCPUs and saves VM device state in a simple file, and
> thus supports any type of guest image and block device.  The caller must
> not modify the VM's block devices between cprsave and cprload.
> 
> cprsave and cprload support guests with vfio devices if the caller first
> suspends the guest by issuing guest-suspend-ram to the qemu guest agent.
> The guest drivers suspend methods flush outstanding requests and re-
> initialize the devices, and thus there is no device state to save and
> restore.
> 
>    1 savevm: add vmstate handler iterators
>    2 savevm: VM handlers mode mask
>    3 savevm: QMP command for cprsave
>    4 savevm: HMP Command for cprsave
>    5 savevm: QMP command for cprload
>    6 savevm: HMP Command for cprload
>    7 savevm: QMP command for cprinfo
>    8 savevm: HMP command for cprinfo
>    9 savevm: prevent cprsave if memory is volatile
>   10 kvmclock: restore paused KVM clock
>   11 cpu: disable ticks when suspended
>   12 vl: pause option
>   13 gdbstub: gdb support for suspended state
> 
> The next patches add a restart method that eliminates the persistent memory
> constraint, and allows qemu to be updated across the restart, but does not
> allow host reboot.  Anonymous memory segments used by the guest are
> preserved across a re-exec of qemu, mapped at the same VA, via a proposed
> madvise(MADV_DOEXEC) option in the Linux kernel.  See
> https://lore.kernel.org/lkml/1595869887-23307-1-git-send-email-anthony.yznaga@oracle.com/
> 
>   14 savevm: VMS_RESTART and cprsave restart
>   15 vl: QEMU_START_FREEZE env var
>   16 oslib: add qemu_clr_cloexec
>   17 util: env var helpers
>   18 osdep: import MADV_DOEXEC
>   19 memory: ram_block_add cosmetic changes
>   20 vl: add helper to request re-exec
>   21 exec, memory: exec(3) to restart
>   22 char: qio_channel_socket_accept reuse fd
>   23 char: save/restore chardev socket fds
>   24 ui: save/restore vnc socket fds
>   25 char: save/restore chardev pty fds

Keeping FDs open across re-exec is a nice trick, but how are you dealing
with the state associated with them, most especially the TLS encryption
state ? AFAIK, there's no way to serialize/deserialize the TLS state that
GNUTLS maintains, and the patches don't show any sign of dealing with
this. IOW it looks like while the FD will be preserved, any TLS session
running on it will fail.

I'm going to presume that you're probably just considering the TLS features
out of scope for your patch series.  It would be useful if you have any
info about this and other things you've considered out of scope for this
patch series.

I'm not seeing anything in the block layer about preserving open FDs, so
I presume you're just letting the block layer close and then re-open any
FDs it has ?  This would have the side effect that any locks held on the
FDs are lost, so there's a potential race condition where another process
could acquire the lock and prevent the re-exec completing. That said this
is unavoidable, because Linux kernel is completely broken wrt keeping
fnctl() locks held across a re-exec, always throwing away the locks if
more than 1 thread is running [1].

Regards,
Daniel

[1] https://bugzilla.redhat.com/show_bug.cgi?id=1552621
Paolo Bonzini July 30, 2020, 5:15 p.m. UTC | #2
On 30/07/20 17:14, Steve Sistare wrote:
> The first set of patches adds the cprsave and cprload commands to save and
> restore VM state, and allow the host kernel to be updated and rebooted in
> between.  The VM must create guest RAM in a persistent shared memory file,
> such as /dev/dax0.0 or persistant /dev/shm PKRAM as proposed in 
> https://lore.kernel.org/lkml/1588812129-8596-1-git-send-email-anthony.yznaga@oracle.com/
> 
> cprsave stops the VCPUs and saves VM device state in a simple file, and
> thus supports any type of guest image and block device.  The caller must
> not modify the VM's block devices between cprsave and cprload.

Stupid question, what does cpr stand for?  If it is checkpoint/restore,
please spell it out.  Also, how does the functionality compare to
xen-save-devices-state and xen-load-devices-state?

> cprsave and cprload support guests with vfio devices if the caller first
> suspends the guest by issuing guest-suspend-ram to the qemu guest agent.
> The guest drivers suspend methods flush outstanding requests and re-
> initialize the devices, and thus there is no device state to save and
> restore.

This probably should be allowed even for regular migration.  Can you
generalize the code as a separate series?

Thanks,

Paolo
Dr. David Alan Gilbert July 30, 2020, 5:49 p.m. UTC | #3
* Steve Sistare (steven.sistare@oracle.com) wrote:
> Improve and extend the qemu functions that save and restore VM state so a
> guest may be suspended and resumed with minimal pause time.  qemu may be
> updated to a new version in between.

Nice.

> The first set of patches adds the cprsave and cprload commands to save and
> restore VM state, and allow the host kernel to be updated and rebooted in
> between.  The VM must create guest RAM in a persistent shared memory file,
> such as /dev/dax0.0 or persistant /dev/shm PKRAM as proposed in 
> https://lore.kernel.org/lkml/1588812129-8596-1-git-send-email-anthony.yznaga@oracle.com/
> 
> cprsave stops the VCPUs and saves VM device state in a simple file, and
> thus supports any type of guest image and block device.  The caller must
> not modify the VM's block devices between cprsave and cprload.

can I ask why you don't just add a migration flag to skip the devices
you don't want, and then do a migrate to a file?
(i.e. migrate "exec:cat > afile")
We already have the 'x-ignore-shared' capability that's used for doing
RAM snapshots of VMs; primarily I think for being able to start a VM
from a RAM snapshot as a fast VM start trick.
(There's also a xen_save_devices that does something similar).
If you backed the RAM as you say, enabled x-ignore-shared and then did:

   migrate "exec:cat > afile"

and restarted the destination with:

    migrate_incoming "exec:cat afile"

what is different (except the later stuff about the vfio magic and
chardevs).

Dave

> cprsave and cprload support guests with vfio devices if the caller first
> suspends the guest by issuing guest-suspend-ram to the qemu guest agent.
> The guest drivers suspend methods flush outstanding requests and re-
> initialize the devices, and thus there is no device state to save and
> restore.
> 
>    1 savevm: add vmstate handler iterators
>    2 savevm: VM handlers mode mask
>    3 savevm: QMP command for cprsave
>    4 savevm: HMP Command for cprsave
>    5 savevm: QMP command for cprload
>    6 savevm: HMP Command for cprload
>    7 savevm: QMP command for cprinfo
>    8 savevm: HMP command for cprinfo
>    9 savevm: prevent cprsave if memory is volatile
>   10 kvmclock: restore paused KVM clock
>   11 cpu: disable ticks when suspended
>   12 vl: pause option
>   13 gdbstub: gdb support for suspended state
> 
> The next patches add a restart method that eliminates the persistent memory
> constraint, and allows qemu to be updated across the restart, but does not
> allow host reboot.  Anonymous memory segments used by the guest are
> preserved across a re-exec of qemu, mapped at the same VA, via a proposed
> madvise(MADV_DOEXEC) option in the Linux kernel.  See
> https://lore.kernel.org/lkml/1595869887-23307-1-git-send-email-anthony.yznaga@oracle.com/
> 
>   14 savevm: VMS_RESTART and cprsave restart
>   15 vl: QEMU_START_FREEZE env var
>   16 oslib: add qemu_clr_cloexec
>   17 util: env var helpers
>   18 osdep: import MADV_DOEXEC
>   19 memory: ram_block_add cosmetic changes
>   20 vl: add helper to request re-exec
>   21 exec, memory: exec(3) to restart
>   22 char: qio_channel_socket_accept reuse fd
>   23 char: save/restore chardev socket fds
>   24 ui: save/restore vnc socket fds
>   25 char: save/restore chardev pty fds
>   26 monitor: save/restore QMP negotiation status
>   27 vhost: reset vhost devices upon cprsave
>   28 char: restore terminal on restart
> 
> The next patches extend the restart method to save and restore vfio-pci
> state, eliminating the requirement for a guest agent.  The vfio container,
> group, and device descriptors are preserved across the qemu re-exec.
> 
>   29 pci: export pci_update_mappings
>   30 vfio-pci: save and restore
>   31 vfio-pci: trace pci config
>   32 vfio-pci: improved tracing
> 
> Here is an example of updating qemu from v4.2.0 to v4.2.1 using 
> "cprload restart".  The software update is performed while the guest is
> running to minimize downtime.
> 
> window 1				| window 2
> 					|
> # qemu-system-x86_64 ... 		|
> QEMU 4.2.0 monitor - type 'help' ...	|
> (qemu) info status			|
> VM status: running			|
> 					| # yum update qemu
> (qemu) cprsave /tmp/qemu.sav restart	|
> QEMU 4.2.1 monitor - type 'help' ...	|
> (qemu) info status			|
> VM status: paused (prelaunch)		|
> (qemu) cprload /tmp/qemu.sav		|
> (qemu) info status			|
> VM status: running			|
> 
> 
> Here is an example of updating the host kernel using "cprload reboot"
> 
> window 1					| window 2
> 						|
> # qemu-system-x86_64 ...mem-path=/dev/dax0.0 ...|
> QEMU 4.2.1 monitor - type 'help' ...		|
> (qemu) info status				|
> VM status: running				|
> 						| # yum update kernel-uek
> (qemu) cprsave /tmp/qemu.sav restart		|
> 						|
> # systemctl kexec				|
> kexec_core: Starting new kernel			|
> ...						|
> 						|
> # qemu-system-x86_64 ...mem-path=/dev/dax0.0 ...|
> QEMU 4.2.1 monitor - type 'help' ...		|
> (qemu) info status				|
> VM status: paused (prelaunch)			|
> (qemu) cprload /tmp/qemu.sav			|
> (qemu) info status				|
> VM status: running				|
> 
> 
> Mark Kanda (5):
>   char: qio_channel_socket_accept reuse fd
>   char: save/restore chardev socket fds
>   ui: save/restore vnc socket fds
>   monitor: save/restore QMP negotiation status
>   vhost: reset vhost devices upon cprsave
> 
> Steve Sistare (27):
>   savevm: add vmstate handler iterators
>   savevm: VM handlers mode mask
>   savevm: QMP command for cprsave
>   savevm: HMP Command for cprsave
>   savevm: QMP command for cprload
>   savevm: HMP Command for cprload
>   savevm: QMP command for cprinfo
>   savevm: HMP command for cprinfo
>   savevm: prevent cprsave if memory is volatile
>   kvmclock: restore paused KVM clock
>   cpu: disable ticks when suspended
>   vl: pause option
>   gdbstub: gdb support for suspended state
>   savevm: VMS_RESTART and cprsave restart
>   vl: QEMU_START_FREEZE env var
>   oslib: add qemu_clr_cloexec
>   util: env var helpers
>   osdep: import MADV_DOEXEC
>   memory: ram_block_add cosmetic changes
>   vl: add helper to request re-exec
>   exec, memory: exec(3) to restart
>   char: save/restore chardev pty fds
>   char: restore terminal on restart
>   pci: export pci_update_mappings
>   vfio-pci: save and restore
>   vfio-pci: trace pci config
>   vfio-pci: improved tracing
> 
>  MAINTAINERS                    |   7 ++
>  accel/kvm/kvm-all.c            |   8 +-
>  accel/kvm/trace-events         |   3 +-
>  chardev/char-pty.c             |  38 +++++--
>  chardev/char-socket.c          |  35 ++++++
>  chardev/char-stdio.c           |   7 ++
>  chardev/char.c                 |  16 +++
>  exec.c                         |  88 +++++++++++++--
>  gdbstub.c                      |  11 +-
>  hmp-commands.hx                |  46 ++++++++
>  hw/i386/kvm/clock.c            |   6 +-
>  hw/pci/msix.c                  |   1 +
>  hw/pci/pci.c                   |  17 +--
>  hw/pci/trace-events            |   5 +-
>  hw/vfio/common.c               | 115 ++++++++++++++++----
>  hw/vfio/pci.c                  | 179 ++++++++++++++++++++++++++++++-
>  hw/vfio/platform.c             |   2 +-
>  hw/vfio/trace-events           |  11 +-
>  hw/virtio/vhost.c              |  12 +++
>  include/chardev/char.h         |   8 ++
>  include/exec/memory.h          |   4 +
>  include/hw/pci/pci.h           |   2 +
>  include/hw/vfio/vfio-common.h  |   4 +-
>  include/io/channel-socket.h    |   3 +-
>  include/migration/register.h   |   3 +
>  include/migration/vmstate.h    |  11 ++
>  include/monitor/hmp.h          |   3 +
>  include/qemu/cutils.h          |   1 +
>  include/qemu/env.h             |  31 ++++++
>  include/qemu/osdep.h           |   8 ++
>  include/sysemu/sysemu.h        |  10 ++
>  io/channel-socket.c            |  12 ++-
>  io/net-listener.c              |   4 +-
>  migration/block.c              |   1 +
>  migration/migration.c          |   4 +-
>  migration/ram.c                |   1 +
>  migration/savevm.c             | 237 ++++++++++++++++++++++++++++++++++++-----
>  migration/savevm.h             |   4 +-
>  monitor/hmp-cmds.c             |  28 +++++
>  monitor/qmp-cmds.c             |  16 +++
>  monitor/qmp.c                  |  42 ++++++++
>  qapi/migration.json            |  35 ++++++
>  qapi/pragma.json               |   1 +
>  qemu-options.hx                |   9 ++
>  scsi/qemu-pr-helper.c          |   2 +-
>  softmmu/vl.c                   |  65 ++++++++++-
>  tests/qtest/tpm-emu.c          |   2 +-
>  tests/test-char.c              |   2 +-
>  tests/test-io-channel-socket.c |   4 +-
>  trace-events                   |   2 +
>  ui/vnc.c                       | 153 +++++++++++++++++++++-----
>  util/Makefile.objs             |   2 +-
>  util/env.c                     | 132 +++++++++++++++++++++++
>  util/oslib-posix.c             |   9 ++
>  util/oslib-win32.c             |   4 +
>  55 files changed, 1331 insertions(+), 135 deletions(-)
>  create mode 100644 include/qemu/env.h
>  create mode 100644 util/env.c
> 
> -- 
> 1.8.3.1
> 
> 
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
Steven Sistare July 30, 2020, 6:48 p.m. UTC | #4
On 7/30/2020 12:52 PM, Daniel P. Berrangé wrote:
> On Thu, Jul 30, 2020 at 08:14:04AM -0700, Steve Sistare wrote:
>> Improve and extend the qemu functions that save and restore VM state so a
>> guest may be suspended and resumed with minimal pause time.  qemu may be
>> updated to a new version in between.
>>
>> The first set of patches adds the cprsave and cprload commands to save and
>> restore VM state, and allow the host kernel to be updated and rebooted in
>> between.  The VM must create guest RAM in a persistent shared memory file,
>> such as /dev/dax0.0 or persistant /dev/shm PKRAM as proposed in 
>> https://lore.kernel.org/lkml/1588812129-8596-1-git-send-email-anthony.yznaga@oracle.com/
>>
>> cprsave stops the VCPUs and saves VM device state in a simple file, and
>> thus supports any type of guest image and block device.  The caller must
>> not modify the VM's block devices between cprsave and cprload.
>>
>> cprsave and cprload support guests with vfio devices if the caller first
>> suspends the guest by issuing guest-suspend-ram to the qemu guest agent.
>> The guest drivers suspend methods flush outstanding requests and re-
>> initialize the devices, and thus there is no device state to save and
>> restore.
>>
>>    1 savevm: add vmstate handler iterators
>>    2 savevm: VM handlers mode mask
>>    3 savevm: QMP command for cprsave
>>    4 savevm: HMP Command for cprsave
>>    5 savevm: QMP command for cprload
>>    6 savevm: HMP Command for cprload
>>    7 savevm: QMP command for cprinfo
>>    8 savevm: HMP command for cprinfo
>>    9 savevm: prevent cprsave if memory is volatile
>>   10 kvmclock: restore paused KVM clock
>>   11 cpu: disable ticks when suspended
>>   12 vl: pause option
>>   13 gdbstub: gdb support for suspended state
>>
>> The next patches add a restart method that eliminates the persistent memory
>> constraint, and allows qemu to be updated across the restart, but does not
>> allow host reboot.  Anonymous memory segments used by the guest are
>> preserved across a re-exec of qemu, mapped at the same VA, via a proposed
>> madvise(MADV_DOEXEC) option in the Linux kernel.  See
>> https://lore.kernel.org/lkml/1595869887-23307-1-git-send-email-anthony.yznaga@oracle.com/
>>
>>   14 savevm: VMS_RESTART and cprsave restart
>>   15 vl: QEMU_START_FREEZE env var
>>   16 oslib: add qemu_clr_cloexec
>>   17 util: env var helpers
>>   18 osdep: import MADV_DOEXEC
>>   19 memory: ram_block_add cosmetic changes
>>   20 vl: add helper to request re-exec
>>   21 exec, memory: exec(3) to restart
>>   22 char: qio_channel_socket_accept reuse fd
>>   23 char: save/restore chardev socket fds
>>   24 ui: save/restore vnc socket fds
>>   25 char: save/restore chardev pty fds
> 
> Keeping FDs open across re-exec is a nice trick, but how are you dealing
> with the state associated with them, most especially the TLS encryption
> state ? AFAIK, there's no way to serialize/deserialize the TLS state that
> GNUTLS maintains, and the patches don't show any sign of dealing with
> this. IOW it looks like while the FD will be preserved, any TLS session
> running on it will fail.

I had not considered TLS.  If a non-qemu library maintains connection state, then
we won't be able to support it for live update until the library provides interfaces
to serialize the state.

For qemu objects, so far vmstate has been adequate to represent the devices with
descriptors that we preserve.

> I'm going to presume that you're probably just considering the TLS features
> out of scope for your patch series.  It would be useful if you have any
> info about this and other things you've considered out of scope for this
> patch series.

The descriptors covered in these patches are needed for our use case.  I realize
there are others that could perhaps be preserved, but we have not tried them.
Those descriptors are closed on exec as usual, and are reopened after exec. I
expect that we or others will support more over time.

> I'm not seeing anything in the block layer about preserving open FDs, so
> I presume you're just letting the block layer close and then re-open any
> FDs it has ?  

Correct.

> This would have the side effect that any locks held on the
> FDs are lost, so there's a potential race condition where another process
> could acquire the lock and prevent the re-exec completing. That said this
> is unavoidable, because Linux kernel is completely broken wrt keeping
> fnctl() locks held across a re-exec, always throwing away the locks if
> more than 1 thread is running [1].

Ouch.

- Steve

> 
> Regards,
> Daniel
> 
> [1] https://bugzilla.redhat.com/show_bug.cgi?id=1552621
>
Steven Sistare July 30, 2020, 7:09 p.m. UTC | #5
On 7/30/2020 1:15 PM, Paolo Bonzini wrote:
> On 30/07/20 17:14, Steve Sistare wrote:
>> The first set of patches adds the cprsave and cprload commands to save and
>> restore VM state, and allow the host kernel to be updated and rebooted in
>> between.  The VM must create guest RAM in a persistent shared memory file,
>> such as /dev/dax0.0 or persistant /dev/shm PKRAM as proposed in 
>> https://lore.kernel.org/lkml/1588812129-8596-1-git-send-email-anthony.yznaga@oracle.com/
>>
>> cprsave stops the VCPUs and saves VM device state in a simple file, and
>> thus supports any type of guest image and block device.  The caller must
>> not modify the VM's block devices between cprsave and cprload.
> 
> Stupid question, what does cpr stand for?  If it is checkpoint/restore,

Checkpoint/restart.  An acronym from my HPC days.  I will spell it out.

> please spell it out.  Also, how does the functionality compare to
> xen-save-devices-state and xen-load-devices-state?

qmp_xen_save_devices_state serializes device state to a file which is loaded 
on the target for a live migration.  It performs some of the same actions
as cprsave/cprload but does not support live update-in-place.

>> cprsave and cprload support guests with vfio devices if the caller first
>> suspends the guest by issuing guest-suspend-ram to the qemu guest agent.
>> The guest drivers suspend methods flush outstanding requests and re-
>> initialize the devices, and thus there is no device state to save and
>> restore.
> 
> This probably should be allowed even for regular migration.  Can you
> generalize the code as a separate series?

Maybe.  I think that would be a distinct patch that ignores the vfio migration blocker 
if the state is suspended.  Plus a qemu agent call to do the suspend.  Needs more
thought.

- Steve

> 
> Thanks,
> 
> Paolo
>
Steven Sistare July 30, 2020, 7:31 p.m. UTC | #6
On 7/30/2020 1:49 PM, Dr. David Alan Gilbert wrote:
> * Steve Sistare (steven.sistare@oracle.com) wrote:
>> Improve and extend the qemu functions that save and restore VM state so a
>> guest may be suspended and resumed with minimal pause time.  qemu may be
>> updated to a new version in between.
> 
> Nice.
> 
>> The first set of patches adds the cprsave and cprload commands to save and
>> restore VM state, and allow the host kernel to be updated and rebooted in
>> between.  The VM must create guest RAM in a persistent shared memory file,
>> such as /dev/dax0.0 or persistant /dev/shm PKRAM as proposed in 
>> https://lore.kernel.org/lkml/1588812129-8596-1-git-send-email-anthony.yznaga@oracle.com/
>>
>> cprsave stops the VCPUs and saves VM device state in a simple file, and
>> thus supports any type of guest image and block device.  The caller must
>> not modify the VM's block devices between cprsave and cprload.
> 
> can I ask why you don't just add a migration flag to skip the devices
> you don't want, and then do a migrate to a file?
> (i.e. migrate "exec:cat > afile")
> We already have the 'x-ignore-shared' capability that's used for doing
> RAM snapshots of VMs; primarily I think for being able to start a VM
> from a RAM snapshot as a fast VM start trick.
> (There's also a xen_save_devices that does something similar).
> If you backed the RAM as you say, enabled x-ignore-shared and then did:
> 
>    migrate "exec:cat > afile"
> 
> and restarted the destination with:
> 
>     migrate_incoming "exec:cat afile"
> 
> what is different (except the later stuff about the vfio magic and
> chardevs).
> 
> Dave

Yes, I did consider whether to extend the migration syntax and implemention in
save_vmstate and load_vmstate, versus creating something new.  Those functions 
handle stuff like bdrv snapshot, aio, and migration which are n/a for the cpr 
use case, and the cpr functions handle state that is n/a for the migration case. 
I judged that a single function handling both would be less readable and 
maintainable.  At their core all these routines call qemu_loadvm_state() and 
qemu_savevm_state().
 The surrounding code is mostly different.


Take a look at 
  savevm.c:save_vmstate()   vs   save_cpr_snapshot() attached
and
  savevm.c:load_vmstate()   vs   load_cpr_snapshot() attached

I attached the complete versions of the cpr functions because they are built up
over multiple patches in this series, thus hard to visualize in patch form.

- Steve

> 
>> cprsave and cprload support guests with vfio devices if the caller first
>> suspends the guest by issuing guest-suspend-ram to the qemu guest agent.
>> The guest drivers suspend methods flush outstanding requests and re-
>> initialize the devices, and thus there is no device state to save and
>> restore.
>>
>>    1 savevm: add vmstate handler iterators
>>    2 savevm: VM handlers mode mask
>>    3 savevm: QMP command for cprsave
>>    4 savevm: HMP Command for cprsave
>>    5 savevm: QMP command for cprload
>>    6 savevm: HMP Command for cprload
>>    7 savevm: QMP command for cprinfo
>>    8 savevm: HMP command for cprinfo
>>    9 savevm: prevent cprsave if memory is volatile
>>   10 kvmclock: restore paused KVM clock
>>   11 cpu: disable ticks when suspended
>>   12 vl: pause option
>>   13 gdbstub: gdb support for suspended state
>>
>> The next patches add a restart method that eliminates the persistent memory
>> constraint, and allows qemu to be updated across the restart, but does not
>> allow host reboot.  Anonymous memory segments used by the guest are
>> preserved across a re-exec of qemu, mapped at the same VA, via a proposed
>> madvise(MADV_DOEXEC) option in the Linux kernel.  See
>> https://lore.kernel.org/lkml/1595869887-23307-1-git-send-email-anthony.yznaga@oracle.com/
>>
>>   14 savevm: VMS_RESTART and cprsave restart
>>   15 vl: QEMU_START_FREEZE env var
>>   16 oslib: add qemu_clr_cloexec
>>   17 util: env var helpers
>>   18 osdep: import MADV_DOEXEC
>>   19 memory: ram_block_add cosmetic changes
>>   20 vl: add helper to request re-exec
>>   21 exec, memory: exec(3) to restart
>>   22 char: qio_channel_socket_accept reuse fd
>>   23 char: save/restore chardev socket fds
>>   24 ui: save/restore vnc socket fds
>>   25 char: save/restore chardev pty fds
>>   26 monitor: save/restore QMP negotiation status
>>   27 vhost: reset vhost devices upon cprsave
>>   28 char: restore terminal on restart
>>
>> The next patches extend the restart method to save and restore vfio-pci
>> state, eliminating the requirement for a guest agent.  The vfio container,
>> group, and device descriptors are preserved across the qemu re-exec.
>>
>>   29 pci: export pci_update_mappings
>>   30 vfio-pci: save and restore
>>   31 vfio-pci: trace pci config
>>   32 vfio-pci: improved tracing
>>
>> Here is an example of updating qemu from v4.2.0 to v4.2.1 using 
>> "cprload restart".  The software update is performed while the guest is
>> running to minimize downtime.
>>
>> window 1				| window 2
>> 					|
>> # qemu-system-x86_64 ... 		|
>> QEMU 4.2.0 monitor - type 'help' ...	|
>> (qemu) info status			|
>> VM status: running			|
>> 					| # yum update qemu
>> (qemu) cprsave /tmp/qemu.sav restart	|
>> QEMU 4.2.1 monitor - type 'help' ...	|
>> (qemu) info status			|
>> VM status: paused (prelaunch)		|
>> (qemu) cprload /tmp/qemu.sav		|
>> (qemu) info status			|
>> VM status: running			|
>>
>>
>> Here is an example of updating the host kernel using "cprload reboot"
>>
>> window 1					| window 2
>> 						|
>> # qemu-system-x86_64 ...mem-path=/dev/dax0.0 ...|
>> QEMU 4.2.1 monitor - type 'help' ...		|
>> (qemu) info status				|
>> VM status: running				|
>> 						| # yum update kernel-uek
>> (qemu) cprsave /tmp/qemu.sav restart		|
>> 						|
>> # systemctl kexec				|
>> kexec_core: Starting new kernel			|
>> ...						|
>> 						|
>> # qemu-system-x86_64 ...mem-path=/dev/dax0.0 ...|
>> QEMU 4.2.1 monitor - type 'help' ...		|
>> (qemu) info status				|
>> VM status: paused (prelaunch)			|
>> (qemu) cprload /tmp/qemu.sav			|
>> (qemu) info status				|
>> VM status: running				|
>>
>>
>> Mark Kanda (5):
>>   char: qio_channel_socket_accept reuse fd
>>   char: save/restore chardev socket fds
>>   ui: save/restore vnc socket fds
>>   monitor: save/restore QMP negotiation status
>>   vhost: reset vhost devices upon cprsave
>>
>> Steve Sistare (27):
>>   savevm: add vmstate handler iterators
>>   savevm: VM handlers mode mask
>>   savevm: QMP command for cprsave
>>   savevm: HMP Command for cprsave
>>   savevm: QMP command for cprload
>>   savevm: HMP Command for cprload
>>   savevm: QMP command for cprinfo
>>   savevm: HMP command for cprinfo
>>   savevm: prevent cprsave if memory is volatile
>>   kvmclock: restore paused KVM clock
>>   cpu: disable ticks when suspended
>>   vl: pause option
>>   gdbstub: gdb support for suspended state
>>   savevm: VMS_RESTART and cprsave restart
>>   vl: QEMU_START_FREEZE env var
>>   oslib: add qemu_clr_cloexec
>>   util: env var helpers
>>   osdep: import MADV_DOEXEC
>>   memory: ram_block_add cosmetic changes
>>   vl: add helper to request re-exec
>>   exec, memory: exec(3) to restart
>>   char: save/restore chardev pty fds
>>   char: restore terminal on restart
>>   pci: export pci_update_mappings
>>   vfio-pci: save and restore
>>   vfio-pci: trace pci config
>>   vfio-pci: improved tracing
>>
>>  MAINTAINERS                    |   7 ++
>>  accel/kvm/kvm-all.c            |   8 +-
>>  accel/kvm/trace-events         |   3 +-
>>  chardev/char-pty.c             |  38 +++++--
>>  chardev/char-socket.c          |  35 ++++++
>>  chardev/char-stdio.c           |   7 ++
>>  chardev/char.c                 |  16 +++
>>  exec.c                         |  88 +++++++++++++--
>>  gdbstub.c                      |  11 +-
>>  hmp-commands.hx                |  46 ++++++++
>>  hw/i386/kvm/clock.c            |   6 +-
>>  hw/pci/msix.c                  |   1 +
>>  hw/pci/pci.c                   |  17 +--
>>  hw/pci/trace-events            |   5 +-
>>  hw/vfio/common.c               | 115 ++++++++++++++++----
>>  hw/vfio/pci.c                  | 179 ++++++++++++++++++++++++++++++-
>>  hw/vfio/platform.c             |   2 +-
>>  hw/vfio/trace-events           |  11 +-
>>  hw/virtio/vhost.c              |  12 +++
>>  include/chardev/char.h         |   8 ++
>>  include/exec/memory.h          |   4 +
>>  include/hw/pci/pci.h           |   2 +
>>  include/hw/vfio/vfio-common.h  |   4 +-
>>  include/io/channel-socket.h    |   3 +-
>>  include/migration/register.h   |   3 +
>>  include/migration/vmstate.h    |  11 ++
>>  include/monitor/hmp.h          |   3 +
>>  include/qemu/cutils.h          |   1 +
>>  include/qemu/env.h             |  31 ++++++
>>  include/qemu/osdep.h           |   8 ++
>>  include/sysemu/sysemu.h        |  10 ++
>>  io/channel-socket.c            |  12 ++-
>>  io/net-listener.c              |   4 +-
>>  migration/block.c              |   1 +
>>  migration/migration.c          |   4 +-
>>  migration/ram.c                |   1 +
>>  migration/savevm.c             | 237 ++++++++++++++++++++++++++++++++++++-----
>>  migration/savevm.h             |   4 +-
>>  monitor/hmp-cmds.c             |  28 +++++
>>  monitor/qmp-cmds.c             |  16 +++
>>  monitor/qmp.c                  |  42 ++++++++
>>  qapi/migration.json            |  35 ++++++
>>  qapi/pragma.json               |   1 +
>>  qemu-options.hx                |   9 ++
>>  scsi/qemu-pr-helper.c          |   2 +-
>>  softmmu/vl.c                   |  65 ++++++++++-
>>  tests/qtest/tpm-emu.c          |   2 +-
>>  tests/test-char.c              |   2 +-
>>  tests/test-io-channel-socket.c |   4 +-
>>  trace-events                   |   2 +
>>  ui/vnc.c                       | 153 +++++++++++++++++++++-----
>>  util/Makefile.objs             |   2 +-
>>  util/env.c                     | 132 +++++++++++++++++++++++
>>  util/oslib-posix.c             |   9 ++
>>  util/oslib-win32.c             |   4 +
>>  55 files changed, 1331 insertions(+), 135 deletions(-)
>>  create mode 100644 include/qemu/env.h
>>  create mode 100644 util/env.c
>>
>> -- 
>> 1.8.3.1
>>
>>
> --
> Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
>
void save_cpr_snapshot(const char *file, const char *mode, Error **errp)
{
    int ret = 0;
    QEMUFile *f;
    VMStateMode op;

    if (!strcmp(mode, "reboot")) {
        op = VMS_REBOOT;
    } else if (!strcmp(mode, "restart")) {
        op = VMS_RESTART;
    } else {
        error_setg(errp, "cprsave: bad mode %s", mode);
        return;
    }

    if (op == VMS_REBOOT && qemu_ram_volatile(errp)) {
        return;
    }

    if (op == VMS_RESTART && QEMU_MADV_DOEXEC == QEMU_MADV_INVALID) {
        error_setg(errp, "kernel does not support MADV_DOEXEC.");
        return;
    }

    if (op == VMS_RESTART && xen_enabled()) {
        error_setg(errp, "xen does not support cprsave restart");
        return;
    }

    f = qf_file_open(file, O_CREAT | O_WRONLY | O_TRUNC, 0600, errp);
    if (!f) {
        return;
    }

    ret = global_state_store();
    if (ret) {
        error_setg(errp, "Error saving global state");
        qemu_fclose(f);
        return;
    }

    /* Update timers_state before saving.  Suspend did not so do. */
    if (runstate_check(RUN_STATE_SUSPENDED)) {
        cpu_disable_ticks();
    }

    vm_stop(RUN_STATE_SAVE_VM);

    ret = qemu_savevm_state(f, op, errp);
    if ((ret < 0) && !*errp) {
        error_setg(errp, "qemu_savevm_state failed");
    }
    qemu_fclose(f);

    if (op == VMS_REBOOT) {
        no_shutdown = 0;
        qemu_system_shutdown_request();
    } else if (op == VMS_RESTART) {
        if (qemu_preserve_ram(errp)) {
            return;
        }
        save_chardev_fds();
        save_vnc_fds();
        save_named_fd("mntfd");          /* was received from qemu-cpr */
        save_named_fd("ctlfd");          /* was received from qemu-cpr */
        walkenv(FD_PREFIX, preserve_fd, 0);
        reset_vhost_devices();
        save_qmp_negotiation_status();
        qemu_term_exit();
        qemu_system_exec_request();
        putenv((char *)"QEMU_START_FREEZE=");
    }
}
void load_cpr_snapshot(const char *file, Error **errp)
{
    QEMUFile *f;
    int ret;
    RunState state;

    if (runstate_is_running()) {
        error_setg(errp, "cprload called for a running VM");
        return;
    }

    f = qf_file_open(file, O_RDONLY, 0, errp);
    if (!f) {
        return;
    }

    ret = qemu_loadvm_state(f, VMS_REBOOT | VMS_RESTART);
    qemu_fclose(f);
    if (ret < 0) {
        error_setg(errp, "Error %d while loading VM state", ret);
        return;
    }

    state = global_state_get_runstate();
    if (state == RUN_STATE_RUNNING) {
        vm_start();
    } else {
        runstate_set(state);
        if (runstate_check(RUN_STATE_SUSPENDED)) {
            start_on_wake = 1;
        }
    }

    load_vnc_fds();
}
Paolo Bonzini July 30, 2020, 9:39 p.m. UTC | #7
On 30/07/20 21:09, Steven Sistare wrote:
>> please spell it out.  Also, how does the functionality compare to
>> xen-save-devices-state and xen-load-devices-state?
>
> qmp_xen_save_devices_state serializes device state to a file which is loaded 
> on the target for a live migration.  It performs some of the same actions
> as cprsave/cprload but does not support live update-in-place.

So it is a subset, can code be reused across both?  Also, live migration
across versions is supported, so can you describe the special
update-in-place support more precisely?  I am confused about the use
cases, which require (or try) to keep file descriptors across re-exec,
which are for kexec, and so on.

>>> cprsave and cprload support guests with vfio devices if the caller first
>>> suspends the guest by issuing guest-suspend-ram to the qemu guest agent.
>>> The guest drivers suspend methods flush outstanding requests and re-
>>> initialize the devices, and thus there is no device state to save and
>>> restore.
>> This probably should be allowed even for regular migration.  Can you
>> generalize the code as a separate series?
>
> Maybe.  I think that would be a distinct patch that ignores the vfio migration blocker 
> if the state is suspended.  Plus a qemu agent call to do the suspend.  Needs more
> thought.

The agent already supports suspend, so that should be relatively easy.
Only the code to add/remove the VFIO migration blocker from a VM state
change notifier, or something like that, would be needed.

Paolo
Daniel P. Berrangé July 31, 2020, 8:53 a.m. UTC | #8
On Thu, Jul 30, 2020 at 02:48:44PM -0400, Steven Sistare wrote:
> On 7/30/2020 12:52 PM, Daniel P. Berrangé wrote:
> > On Thu, Jul 30, 2020 at 08:14:04AM -0700, Steve Sistare wrote:
> >> Improve and extend the qemu functions that save and restore VM state so a
> >> guest may be suspended and resumed with minimal pause time.  qemu may be
> >> updated to a new version in between.
> >>
> >> The first set of patches adds the cprsave and cprload commands to save and
> >> restore VM state, and allow the host kernel to be updated and rebooted in
> >> between.  The VM must create guest RAM in a persistent shared memory file,
> >> such as /dev/dax0.0 or persistant /dev/shm PKRAM as proposed in 
> >> https://lore.kernel.org/lkml/1588812129-8596-1-git-send-email-anthony.yznaga@oracle.com/
> >>
> >> cprsave stops the VCPUs and saves VM device state in a simple file, and
> >> thus supports any type of guest image and block device.  The caller must
> >> not modify the VM's block devices between cprsave and cprload.
> >>
> >> cprsave and cprload support guests with vfio devices if the caller first
> >> suspends the guest by issuing guest-suspend-ram to the qemu guest agent.
> >> The guest drivers suspend methods flush outstanding requests and re-
> >> initialize the devices, and thus there is no device state to save and
> >> restore.
> >>
> >>    1 savevm: add vmstate handler iterators
> >>    2 savevm: VM handlers mode mask
> >>    3 savevm: QMP command for cprsave
> >>    4 savevm: HMP Command for cprsave
> >>    5 savevm: QMP command for cprload
> >>    6 savevm: HMP Command for cprload
> >>    7 savevm: QMP command for cprinfo
> >>    8 savevm: HMP command for cprinfo
> >>    9 savevm: prevent cprsave if memory is volatile
> >>   10 kvmclock: restore paused KVM clock
> >>   11 cpu: disable ticks when suspended
> >>   12 vl: pause option
> >>   13 gdbstub: gdb support for suspended state
> >>
> >> The next patches add a restart method that eliminates the persistent memory
> >> constraint, and allows qemu to be updated across the restart, but does not
> >> allow host reboot.  Anonymous memory segments used by the guest are
> >> preserved across a re-exec of qemu, mapped at the same VA, via a proposed
> >> madvise(MADV_DOEXEC) option in the Linux kernel.  See
> >> https://lore.kernel.org/lkml/1595869887-23307-1-git-send-email-anthony.yznaga@oracle.com/
> >>
> >>   14 savevm: VMS_RESTART and cprsave restart
> >>   15 vl: QEMU_START_FREEZE env var
> >>   16 oslib: add qemu_clr_cloexec
> >>   17 util: env var helpers
> >>   18 osdep: import MADV_DOEXEC
> >>   19 memory: ram_block_add cosmetic changes
> >>   20 vl: add helper to request re-exec
> >>   21 exec, memory: exec(3) to restart
> >>   22 char: qio_channel_socket_accept reuse fd
> >>   23 char: save/restore chardev socket fds
> >>   24 ui: save/restore vnc socket fds
> >>   25 char: save/restore chardev pty fds
> > 
> > Keeping FDs open across re-exec is a nice trick, but how are you dealing
> > with the state associated with them, most especially the TLS encryption
> > state ? AFAIK, there's no way to serialize/deserialize the TLS state that
> > GNUTLS maintains, and the patches don't show any sign of dealing with
> > this. IOW it looks like while the FD will be preserved, any TLS session
> > running on it will fail.
> 
> I had not considered TLS.  If a non-qemu library maintains connection state, then
> we won't be able to support it for live update until the library provides interfaces
> to serialize the state.
> 
> For qemu objects, so far vmstate has been adequate to represent the devices with
> descriptors that we preserve.

My main concern about this series is that there is an implicit assumption
that QEMU is *not* configured with certain features that are not handled
If QEMU is using one of the unsupported features, I don't see anything in
the series which attempts to prevent the actions.

IOW, users can have an arbitrary QEMU config, attempt to use these new features,
the commands may well succeed, but the user is silently left with a broken QEMU.
Such silent failure modes are really undesirable as they'll lead to a never
ending stream of hard to diagnose bug reports for QEMU maintainers.

TLS is one example of this, the live upgrade  will "succeed", but the TLS
connections will be totally non-functional.

Regards,
Daniel
Steven Sistare July 31, 2020, 3:27 p.m. UTC | #9
On 7/31/2020 4:53 AM, Daniel P. Berrangé wrote:
> On Thu, Jul 30, 2020 at 02:48:44PM -0400, Steven Sistare wrote:
>> On 7/30/2020 12:52 PM, Daniel P. Berrangé wrote:
>>> On Thu, Jul 30, 2020 at 08:14:04AM -0700, Steve Sistare wrote:
>>>> Improve and extend the qemu functions that save and restore VM state so a
>>>> guest may be suspended and resumed with minimal pause time.  qemu may be
>>>> updated to a new version in between.
>>>>
>>>> The first set of patches adds the cprsave and cprload commands to save and
>>>> restore VM state, and allow the host kernel to be updated and rebooted in
>>>> between.  The VM must create guest RAM in a persistent shared memory file,
>>>> such as /dev/dax0.0 or persistant /dev/shm PKRAM as proposed in 
>>>> https://lore.kernel.org/lkml/1588812129-8596-1-git-send-email-anthony.yznaga@oracle.com/
>>>>
>>>> cprsave stops the VCPUs and saves VM device state in a simple file, and
>>>> thus supports any type of guest image and block device.  The caller must
>>>> not modify the VM's block devices between cprsave and cprload.
>>>>
>>>> cprsave and cprload support guests with vfio devices if the caller first
>>>> suspends the guest by issuing guest-suspend-ram to the qemu guest agent.
>>>> The guest drivers suspend methods flush outstanding requests and re-
>>>> initialize the devices, and thus there is no device state to save and
>>>> restore.
>>>>
>>>>    1 savevm: add vmstate handler iterators
>>>>    2 savevm: VM handlers mode mask
>>>>    3 savevm: QMP command for cprsave
>>>>    4 savevm: HMP Command for cprsave
>>>>    5 savevm: QMP command for cprload
>>>>    6 savevm: HMP Command for cprload
>>>>    7 savevm: QMP command for cprinfo
>>>>    8 savevm: HMP command for cprinfo
>>>>    9 savevm: prevent cprsave if memory is volatile
>>>>   10 kvmclock: restore paused KVM clock
>>>>   11 cpu: disable ticks when suspended
>>>>   12 vl: pause option
>>>>   13 gdbstub: gdb support for suspended state
>>>>
>>>> The next patches add a restart method that eliminates the persistent memory
>>>> constraint, and allows qemu to be updated across the restart, but does not
>>>> allow host reboot.  Anonymous memory segments used by the guest are
>>>> preserved across a re-exec of qemu, mapped at the same VA, via a proposed
>>>> madvise(MADV_DOEXEC) option in the Linux kernel.  See
>>>> https://lore.kernel.org/lkml/1595869887-23307-1-git-send-email-anthony.yznaga@oracle.com/
>>>>
>>>>   14 savevm: VMS_RESTART and cprsave restart
>>>>   15 vl: QEMU_START_FREEZE env var
>>>>   16 oslib: add qemu_clr_cloexec
>>>>   17 util: env var helpers
>>>>   18 osdep: import MADV_DOEXEC
>>>>   19 memory: ram_block_add cosmetic changes
>>>>   20 vl: add helper to request re-exec
>>>>   21 exec, memory: exec(3) to restart
>>>>   22 char: qio_channel_socket_accept reuse fd
>>>>   23 char: save/restore chardev socket fds
>>>>   24 ui: save/restore vnc socket fds
>>>>   25 char: save/restore chardev pty fds
>>>
>>> Keeping FDs open across re-exec is a nice trick, but how are you dealing
>>> with the state associated with them, most especially the TLS encryption
>>> state ? AFAIK, there's no way to serialize/deserialize the TLS state that
>>> GNUTLS maintains, and the patches don't show any sign of dealing with
>>> this. IOW it looks like while the FD will be preserved, any TLS session
>>> running on it will fail.
>>
>> I had not considered TLS.  If a non-qemu library maintains connection state, then
>> we won't be able to support it for live update until the library provides interfaces
>> to serialize the state.
>>
>> For qemu objects, so far vmstate has been adequate to represent the devices with
>> descriptors that we preserve.
> 
> My main concern about this series is that there is an implicit assumption
> that QEMU is *not* configured with certain features that are not handled
> If QEMU is using one of the unsupported features, I don't see anything in
> the series which attempts to prevent the actions.
> 
> IOW, users can have an arbitrary QEMU config, attempt to use these new features,
> the commands may well succeed, but the user is silently left with a broken QEMU.
> Such silent failure modes are really undesirable as they'll lead to a never
> ending stream of hard to diagnose bug reports for QEMU maintainers.
> 
> TLS is one example of this, the live upgrade  will "succeed", but the TLS
> connections will be totally non-functional.

I agree with all your points and would like to do better in this area.  Other than hunting for 
every use of a descriptor and either supporting it or blocking cpr, do you have any suggestions?
Thinking out loud, maybe we can gather all the fds that we support, then look for all fds in the
process, and block the cpr if we find an unrecognized fd.

- Steve
Daniel P. Berrangé July 31, 2020, 3:52 p.m. UTC | #10
On Fri, Jul 31, 2020 at 11:27:45AM -0400, Steven Sistare wrote:
> On 7/31/2020 4:53 AM, Daniel P. Berrangé wrote:
> > On Thu, Jul 30, 2020 at 02:48:44PM -0400, Steven Sistare wrote:
> >> On 7/30/2020 12:52 PM, Daniel P. Berrangé wrote:
> >>> On Thu, Jul 30, 2020 at 08:14:04AM -0700, Steve Sistare wrote:
> >>>> Improve and extend the qemu functions that save and restore VM state so a
> >>>> guest may be suspended and resumed with minimal pause time.  qemu may be
> >>>> updated to a new version in between.
> >>>>
> >>>> The first set of patches adds the cprsave and cprload commands to save and
> >>>> restore VM state, and allow the host kernel to be updated and rebooted in
> >>>> between.  The VM must create guest RAM in a persistent shared memory file,
> >>>> such as /dev/dax0.0 or persistant /dev/shm PKRAM as proposed in 
> >>>> https://lore.kernel.org/lkml/1588812129-8596-1-git-send-email-anthony.yznaga@oracle.com/
> >>>>
> >>>> cprsave stops the VCPUs and saves VM device state in a simple file, and
> >>>> thus supports any type of guest image and block device.  The caller must
> >>>> not modify the VM's block devices between cprsave and cprload.
> >>>>
> >>>> cprsave and cprload support guests with vfio devices if the caller first
> >>>> suspends the guest by issuing guest-suspend-ram to the qemu guest agent.
> >>>> The guest drivers suspend methods flush outstanding requests and re-
> >>>> initialize the devices, and thus there is no device state to save and
> >>>> restore.
> >>>>
> >>>>    1 savevm: add vmstate handler iterators
> >>>>    2 savevm: VM handlers mode mask
> >>>>    3 savevm: QMP command for cprsave
> >>>>    4 savevm: HMP Command for cprsave
> >>>>    5 savevm: QMP command for cprload
> >>>>    6 savevm: HMP Command for cprload
> >>>>    7 savevm: QMP command for cprinfo
> >>>>    8 savevm: HMP command for cprinfo
> >>>>    9 savevm: prevent cprsave if memory is volatile
> >>>>   10 kvmclock: restore paused KVM clock
> >>>>   11 cpu: disable ticks when suspended
> >>>>   12 vl: pause option
> >>>>   13 gdbstub: gdb support for suspended state
> >>>>
> >>>> The next patches add a restart method that eliminates the persistent memory
> >>>> constraint, and allows qemu to be updated across the restart, but does not
> >>>> allow host reboot.  Anonymous memory segments used by the guest are
> >>>> preserved across a re-exec of qemu, mapped at the same VA, via a proposed
> >>>> madvise(MADV_DOEXEC) option in the Linux kernel.  See
> >>>> https://lore.kernel.org/lkml/1595869887-23307-1-git-send-email-anthony.yznaga@oracle.com/
> >>>>
> >>>>   14 savevm: VMS_RESTART and cprsave restart
> >>>>   15 vl: QEMU_START_FREEZE env var
> >>>>   16 oslib: add qemu_clr_cloexec
> >>>>   17 util: env var helpers
> >>>>   18 osdep: import MADV_DOEXEC
> >>>>   19 memory: ram_block_add cosmetic changes
> >>>>   20 vl: add helper to request re-exec
> >>>>   21 exec, memory: exec(3) to restart
> >>>>   22 char: qio_channel_socket_accept reuse fd
> >>>>   23 char: save/restore chardev socket fds
> >>>>   24 ui: save/restore vnc socket fds
> >>>>   25 char: save/restore chardev pty fds
> >>>
> >>> Keeping FDs open across re-exec is a nice trick, but how are you dealing
> >>> with the state associated with them, most especially the TLS encryption
> >>> state ? AFAIK, there's no way to serialize/deserialize the TLS state that
> >>> GNUTLS maintains, and the patches don't show any sign of dealing with
> >>> this. IOW it looks like while the FD will be preserved, any TLS session
> >>> running on it will fail.
> >>
> >> I had not considered TLS.  If a non-qemu library maintains connection state, then
> >> we won't be able to support it for live update until the library provides interfaces
> >> to serialize the state.
> >>
> >> For qemu objects, so far vmstate has been adequate to represent the devices with
> >> descriptors that we preserve.
> > 
> > My main concern about this series is that there is an implicit assumption
> > that QEMU is *not* configured with certain features that are not handled
> > If QEMU is using one of the unsupported features, I don't see anything in
> > the series which attempts to prevent the actions.
> > 
> > IOW, users can have an arbitrary QEMU config, attempt to use these new features,
> > the commands may well succeed, but the user is silently left with a broken QEMU.
> > Such silent failure modes are really undesirable as they'll lead to a never
> > ending stream of hard to diagnose bug reports for QEMU maintainers.
> > 
> > TLS is one example of this, the live upgrade  will "succeed", but the TLS
> > connections will be totally non-functional.
> 
> I agree with all your points and would like to do better in this area.  Other than hunting for 
> every use of a descriptor and either supporting it or blocking cpr, do you have any suggestions?
> Thinking out loud, maybe we can gather all the fds that we support, then look for all fds in the
> process, and block the cpr if we find an unrecognized fd.

There's no magic easy answer to this problem. Conceptually it is similar to
the problem of reliably migrating guest device state, but in this case we're
primarily concerned about the backends instead.

For migration we've got standardized interfaces that devices must implement
in order to correctly support migration serialization. There is also support
for devices to register migration "blockers" which prevent any use of the
migration feature when the device is present.

We lack this kind of concept for the backend, and that's what I think needs
to be tackled in a more thorough way.  There are quite alot of backends,
but they're grouped into a reasonable small number of sets (UIs, chardevs,
blockdevs, net devs, etc). We need some standard interface that we can
plumb into all the backends, along with providing backends the ability to
block the re-exec. If we plumb the generic infrastructure into each of the
different types of backend, and make the default behaviour be to reject
the re-exec. Then we need to carefull consider specific  backend impls
and allow the re-exec only in the very precise cases we can demonstrate
to be safe.

IOW, have a presumption that re-exec will *not* be permitted. Over time
we can make it work for an ever expanding set of use cases. 


Regards,
Daniel
Steven Sistare July 31, 2020, 5:20 p.m. UTC | #11
On 7/31/2020 11:52 AM, Daniel P. Berrangé wrote:
> On Fri, Jul 31, 2020 at 11:27:45AM -0400, Steven Sistare wrote:
>> On 7/31/2020 4:53 AM, Daniel P. Berrangé wrote:
>>> On Thu, Jul 30, 2020 at 02:48:44PM -0400, Steven Sistare wrote:
>>>> On 7/30/2020 12:52 PM, Daniel P. Berrangé wrote:
>>>>> On Thu, Jul 30, 2020 at 08:14:04AM -0700, Steve Sistare wrote:
>>>>>> Improve and extend the qemu functions that save and restore VM state so a
>>>>>> guest may be suspended and resumed with minimal pause time.  qemu may be
>>>>>> updated to a new version in between.
>>>>>>
>>>>>> The first set of patches adds the cprsave and cprload commands to save and
>>>>>> restore VM state, and allow the host kernel to be updated and rebooted in
>>>>>> between.  The VM must create guest RAM in a persistent shared memory file,
>>>>>> such as /dev/dax0.0 or persistant /dev/shm PKRAM as proposed in 
>>>>>> https://lore.kernel.org/lkml/1588812129-8596-1-git-send-email-anthony.yznaga@oracle.com/
>>>>>>
>>>>>> cprsave stops the VCPUs and saves VM device state in a simple file, and
>>>>>> thus supports any type of guest image and block device.  The caller must
>>>>>> not modify the VM's block devices between cprsave and cprload.
>>>>>>
>>>>>> cprsave and cprload support guests with vfio devices if the caller first
>>>>>> suspends the guest by issuing guest-suspend-ram to the qemu guest agent.
>>>>>> The guest drivers suspend methods flush outstanding requests and re-
>>>>>> initialize the devices, and thus there is no device state to save and
>>>>>> restore.
>>>>>>
>>>>>>    1 savevm: add vmstate handler iterators
>>>>>>    2 savevm: VM handlers mode mask
>>>>>>    3 savevm: QMP command for cprsave
>>>>>>    4 savevm: HMP Command for cprsave
>>>>>>    5 savevm: QMP command for cprload
>>>>>>    6 savevm: HMP Command for cprload
>>>>>>    7 savevm: QMP command for cprinfo
>>>>>>    8 savevm: HMP command for cprinfo
>>>>>>    9 savevm: prevent cprsave if memory is volatile
>>>>>>   10 kvmclock: restore paused KVM clock
>>>>>>   11 cpu: disable ticks when suspended
>>>>>>   12 vl: pause option
>>>>>>   13 gdbstub: gdb support for suspended state
>>>>>>
>>>>>> The next patches add a restart method that eliminates the persistent memory
>>>>>> constraint, and allows qemu to be updated across the restart, but does not
>>>>>> allow host reboot.  Anonymous memory segments used by the guest are
>>>>>> preserved across a re-exec of qemu, mapped at the same VA, via a proposed
>>>>>> madvise(MADV_DOEXEC) option in the Linux kernel.  See
>>>>>> https://lore.kernel.org/lkml/1595869887-23307-1-git-send-email-anthony.yznaga@oracle.com/
>>>>>>
>>>>>>   14 savevm: VMS_RESTART and cprsave restart
>>>>>>   15 vl: QEMU_START_FREEZE env var
>>>>>>   16 oslib: add qemu_clr_cloexec
>>>>>>   17 util: env var helpers
>>>>>>   18 osdep: import MADV_DOEXEC
>>>>>>   19 memory: ram_block_add cosmetic changes
>>>>>>   20 vl: add helper to request re-exec
>>>>>>   21 exec, memory: exec(3) to restart
>>>>>>   22 char: qio_channel_socket_accept reuse fd
>>>>>>   23 char: save/restore chardev socket fds
>>>>>>   24 ui: save/restore vnc socket fds
>>>>>>   25 char: save/restore chardev pty fds
>>>>>
>>>>> Keeping FDs open across re-exec is a nice trick, but how are you dealing
>>>>> with the state associated with them, most especially the TLS encryption
>>>>> state ? AFAIK, there's no way to serialize/deserialize the TLS state that
>>>>> GNUTLS maintains, and the patches don't show any sign of dealing with
>>>>> this. IOW it looks like while the FD will be preserved, any TLS session
>>>>> running on it will fail.
>>>>
>>>> I had not considered TLS.  If a non-qemu library maintains connection state, then
>>>> we won't be able to support it for live update until the library provides interfaces
>>>> to serialize the state.
>>>>
>>>> For qemu objects, so far vmstate has been adequate to represent the devices with
>>>> descriptors that we preserve.
>>>
>>> My main concern about this series is that there is an implicit assumption
>>> that QEMU is *not* configured with certain features that are not handled
>>> If QEMU is using one of the unsupported features, I don't see anything in
>>> the series which attempts to prevent the actions.
>>>
>>> IOW, users can have an arbitrary QEMU config, attempt to use these new features,
>>> the commands may well succeed, but the user is silently left with a broken QEMU.
>>> Such silent failure modes are really undesirable as they'll lead to a never
>>> ending stream of hard to diagnose bug reports for QEMU maintainers.
>>>
>>> TLS is one example of this, the live upgrade  will "succeed", but the TLS
>>> connections will be totally non-functional.
>>
>> I agree with all your points and would like to do better in this area.  Other than hunting for 
>> every use of a descriptor and either supporting it or blocking cpr, do you have any suggestions?
>> Thinking out loud, maybe we can gather all the fds that we support, then look for all fds in the
>> process, and block the cpr if we find an unrecognized fd.
> 
> There's no magic easy answer to this problem. Conceptually it is similar to
> the problem of reliably migrating guest device state, but in this case we're
> primarily concerned about the backends instead.
> 
> For migration we've got standardized interfaces that devices must implement
> in order to correctly support migration serialization. There is also support
> for devices to register migration "blockers" which prevent any use of the
> migration feature when the device is present.
> 
> We lack this kind of concept for the backend, and that's what I think needs
> to be tackled in a more thorough way.  There are quite alot of backends,
> but they're grouped into a reasonable small number of sets (UIs, chardevs,
> blockdevs, net devs, etc). We need some standard interface that we can
> plumb into all the backends, along with providing backends the ability to
> block the re-exec. If we plumb the generic infrastructure into each of the
> different types of backend, and make the default behaviour be to reject
> the re-exec. Then we need to carefull consider specific  backend impls
> and allow the re-exec only in the very precise cases we can demonstrate
> to be safe.
> 
> IOW, have a presumption that re-exec will *not* be permitted. Over time
> we can make it work for an ever expanding set of use cases. 

Actually, we could use the vmstate mode_mask field added in patch 2, and only allow the restart
mode for vmstate objects that have been vetted.  Currently an uninitialized mask (value 0)
enables the object for all modes, but we could change that.

- Steve
Steven Sistare July 31, 2020, 7:22 p.m. UTC | #12
On 7/30/2020 5:39 PM, Paolo Bonzini wrote:
> On 30/07/20 21:09, Steven Sistare wrote:
>>> please spell it out.  Also, how does the functionality compare to
>>> xen-save-devices-state and xen-load-devices-state?
>>
>> qmp_xen_save_devices_state serializes device state to a file which is loaded 
>> on the target for a live migration.  It performs some of the same actions
>> as cprsave/cprload but does not support live update-in-place.
> 
> So it is a subset, can code be reused across both?  

They use common subroutines, but their bodies check different conditions, so I
don't think merging would be an improvement.  We do provide a new helper 
qf_file_open() which could replace a handful of lines in both qmp_xen_save_devices_state 
and qmp_xen_load_devices_state.

> Also, live migration
> across versions is supported, so can you describe the special
> update-in-place support more precisely?  I am confused about the use
> cases, which require (or try) to keep file descriptors across re-exec,
> which are for kexec, and so on.

Sure. The first use case allows you to kexec reboot the host and update host
software and/or qemu.  It does not preserve descriptors, and guest ram must be
backed by persistant shared memory.  Guest pause time depends on host reboot
time, which can be seconds to 10's of seconds.

The second case allows you to update qemu in place, but not update the host.
Guest ram can be in shared or anonymous memory.  We call madvise(MADV_DOEXEC)
to tell the kernel to preserve anon memory across the exec.  Open descriptors
are preserved.  Addresses and lengths of saved memory segments are saved in
the environment, and the values of descriptors are saved.  When new qemu
restarts, it finds those values in the environment and uses them when the
various objects are created.  Memory is not realloc'd, it is already present,
and the address and lengths are saved in the ram objects.  Guest pause time
is in the 100 to 200 msec range.  It is less resource intensive than live
migration, and is appropriate if your only goal is to update qemu, as opposed
to evacuating a host.

>>>> cprsave and cprload support guests with vfio devices if the caller first
>>>> suspends the guest by issuing guest-suspend-ram to the qemu guest agent.
>>>> The guest drivers suspend methods flush outstanding requests and re-
>>>> initialize the devices, and thus there is no device state to save and
>>>> restore.
>>> This probably should be allowed even for regular migration.  Can you
>>> generalize the code as a separate series?
>>
>> Maybe.  I think that would be a distinct patch that ignores the vfio migration blocker 
>> if the state is suspended.  Plus a qemu agent call to do the suspend.  Needs more
>> thought.
> 
> The agent already supports suspend, so that should be relatively easy.
> Only the code to add/remove the VFIO migration blocker from a VM state
> change notifier, or something like that, would be needed.

Yes, I have experimented with the guest's suspend method.

- Steve
Steven Sistare Aug. 4, 2020, 6:18 p.m. UTC | #13
Hi folks, any questions or comments on the vfio and pci changes in 
patch 30?  Or on the means of preserving anonymous memory and re-exec'ing 
in patches 14 - 21?

- Steve

On 7/30/2020 11:14 AM, Steve Sistare wrote:
> Improve and extend the qemu functions that save and restore VM state so a
> guest may be suspended and resumed with minimal pause time.  qemu may be
> updated to a new version in between.
> 
> The first set of patches adds the cprsave and cprload commands to save and
> restore VM state, and allow the host kernel to be updated and rebooted in
> between.  The VM must create guest RAM in a persistent shared memory file,
> such as /dev/dax0.0 or persistant /dev/shm PKRAM as proposed in 
> https://lore.kernel.org/lkml/1588812129-8596-1-git-send-email-anthony.yznaga@oracle.com/
> 
> cprsave stops the VCPUs and saves VM device state in a simple file, and
> thus supports any type of guest image and block device.  The caller must
> not modify the VM's block devices between cprsave and cprload.
> 
> cprsave and cprload support guests with vfio devices if the caller first
> suspends the guest by issuing guest-suspend-ram to the qemu guest agent.
> The guest drivers suspend methods flush outstanding requests and re-
> initialize the devices, and thus there is no device state to save and
> restore.
> 
>    1 savevm: add vmstate handler iterators
>    2 savevm: VM handlers mode mask
>    3 savevm: QMP command for cprsave
>    4 savevm: HMP Command for cprsave
>    5 savevm: QMP command for cprload
>    6 savevm: HMP Command for cprload
>    7 savevm: QMP command for cprinfo
>    8 savevm: HMP command for cprinfo
>    9 savevm: prevent cprsave if memory is volatile
>   10 kvmclock: restore paused KVM clock
>   11 cpu: disable ticks when suspended
>   12 vl: pause option
>   13 gdbstub: gdb support for suspended state
> 
> The next patches add a restart method that eliminates the persistent memory
> constraint, and allows qemu to be updated across the restart, but does not
> allow host reboot.  Anonymous memory segments used by the guest are
> preserved across a re-exec of qemu, mapped at the same VA, via a proposed
> madvise(MADV_DOEXEC) option in the Linux kernel.  See
> https://lore.kernel.org/lkml/1595869887-23307-1-git-send-email-anthony.yznaga@oracle.com/
> 
>   14 savevm: VMS_RESTART and cprsave restart
>   15 vl: QEMU_START_FREEZE env var
>   16 oslib: add qemu_clr_cloexec
>   17 util: env var helpers
>   18 osdep: import MADV_DOEXEC
>   19 memory: ram_block_add cosmetic changes
>   20 vl: add helper to request re-exec
>   21 exec, memory: exec(3) to restart
>   22 char: qio_channel_socket_accept reuse fd
>   23 char: save/restore chardev socket fds
>   24 ui: save/restore vnc socket fds
>   25 char: save/restore chardev pty fds
>   26 monitor: save/restore QMP negotiation status
>   27 vhost: reset vhost devices upon cprsave
>   28 char: restore terminal on restart
> 
> The next patches extend the restart method to save and restore vfio-pci
> state, eliminating the requirement for a guest agent.  The vfio container,
> group, and device descriptors are preserved across the qemu re-exec.
> 
>   29 pci: export pci_update_mappings
>   30 vfio-pci: save and restore
>   31 vfio-pci: trace pci config
>   32 vfio-pci: improved tracing
> 
> Here is an example of updating qemu from v4.2.0 to v4.2.1 using 
> "cprload restart".  The software update is performed while the guest is
> running to minimize downtime.
> 
> window 1				| window 2
> 					|
> # qemu-system-x86_64 ... 		|
> QEMU 4.2.0 monitor - type 'help' ...	|
> (qemu) info status			|
> VM status: running			|
> 					| # yum update qemu
> (qemu) cprsave /tmp/qemu.sav restart	|
> QEMU 4.2.1 monitor - type 'help' ...	|
> (qemu) info status			|
> VM status: paused (prelaunch)		|
> (qemu) cprload /tmp/qemu.sav		|
> (qemu) info status			|
> VM status: running			|
> 
> 
> Here is an example of updating the host kernel using "cprload reboot"
> 
> window 1					| window 2
> 						|
> # qemu-system-x86_64 ...mem-path=/dev/dax0.0 ...|
> QEMU 4.2.1 monitor - type 'help' ...		|
> (qemu) info status				|
> VM status: running				|
> 						| # yum update kernel-uek
> (qemu) cprsave /tmp/qemu.sav restart		|
> 						|
> # systemctl kexec				|
> kexec_core: Starting new kernel			|
> ...						|
> 						|
> # qemu-system-x86_64 ...mem-path=/dev/dax0.0 ...|
> QEMU 4.2.1 monitor - type 'help' ...		|
> (qemu) info status				|
> VM status: paused (prelaunch)			|
> (qemu) cprload /tmp/qemu.sav			|
> (qemu) info status				|
> VM status: running				|
> 
> 
> Mark Kanda (5):
>   char: qio_channel_socket_accept reuse fd
>   char: save/restore chardev socket fds
>   ui: save/restore vnc socket fds
>   monitor: save/restore QMP negotiation status
>   vhost: reset vhost devices upon cprsave
> 
> Steve Sistare (27):
>   savevm: add vmstate handler iterators
>   savevm: VM handlers mode mask
>   savevm: QMP command for cprsave
>   savevm: HMP Command for cprsave
>   savevm: QMP command for cprload
>   savevm: HMP Command for cprload
>   savevm: QMP command for cprinfo
>   savevm: HMP command for cprinfo
>   savevm: prevent cprsave if memory is volatile
>   kvmclock: restore paused KVM clock
>   cpu: disable ticks when suspended
>   vl: pause option
>   gdbstub: gdb support for suspended state
>   savevm: VMS_RESTART and cprsave restart
>   vl: QEMU_START_FREEZE env var
>   oslib: add qemu_clr_cloexec
>   util: env var helpers
>   osdep: import MADV_DOEXEC
>   memory: ram_block_add cosmetic changes
>   vl: add helper to request re-exec
>   exec, memory: exec(3) to restart
>   char: save/restore chardev pty fds
>   char: restore terminal on restart
>   pci: export pci_update_mappings
>   vfio-pci: save and restore
>   vfio-pci: trace pci config
>   vfio-pci: improved tracing
> 
>  MAINTAINERS                    |   7 ++
>  accel/kvm/kvm-all.c            |   8 +-
>  accel/kvm/trace-events         |   3 +-
>  chardev/char-pty.c             |  38 +++++--
>  chardev/char-socket.c          |  35 ++++++
>  chardev/char-stdio.c           |   7 ++
>  chardev/char.c                 |  16 +++
>  exec.c                         |  88 +++++++++++++--
>  gdbstub.c                      |  11 +-
>  hmp-commands.hx                |  46 ++++++++
>  hw/i386/kvm/clock.c            |   6 +-
>  hw/pci/msix.c                  |   1 +
>  hw/pci/pci.c                   |  17 +--
>  hw/pci/trace-events            |   5 +-
>  hw/vfio/common.c               | 115 ++++++++++++++++----
>  hw/vfio/pci.c                  | 179 ++++++++++++++++++++++++++++++-
>  hw/vfio/platform.c             |   2 +-
>  hw/vfio/trace-events           |  11 +-
>  hw/virtio/vhost.c              |  12 +++
>  include/chardev/char.h         |   8 ++
>  include/exec/memory.h          |   4 +
>  include/hw/pci/pci.h           |   2 +
>  include/hw/vfio/vfio-common.h  |   4 +-
>  include/io/channel-socket.h    |   3 +-
>  include/migration/register.h   |   3 +
>  include/migration/vmstate.h    |  11 ++
>  include/monitor/hmp.h          |   3 +
>  include/qemu/cutils.h          |   1 +
>  include/qemu/env.h             |  31 ++++++
>  include/qemu/osdep.h           |   8 ++
>  include/sysemu/sysemu.h        |  10 ++
>  io/channel-socket.c            |  12 ++-
>  io/net-listener.c              |   4 +-
>  migration/block.c              |   1 +
>  migration/migration.c          |   4 +-
>  migration/ram.c                |   1 +
>  migration/savevm.c             | 237 ++++++++++++++++++++++++++++++++++++-----
>  migration/savevm.h             |   4 +-
>  monitor/hmp-cmds.c             |  28 +++++
>  monitor/qmp-cmds.c             |  16 +++
>  monitor/qmp.c                  |  42 ++++++++
>  qapi/migration.json            |  35 ++++++
>  qapi/pragma.json               |   1 +
>  qemu-options.hx                |   9 ++
>  scsi/qemu-pr-helper.c          |   2 +-
>  softmmu/vl.c                   |  65 ++++++++++-
>  tests/qtest/tpm-emu.c          |   2 +-
>  tests/test-char.c              |   2 +-
>  tests/test-io-channel-socket.c |   4 +-
>  trace-events                   |   2 +
>  ui/vnc.c                       | 153 +++++++++++++++++++++-----
>  util/Makefile.objs             |   2 +-
>  util/env.c                     | 132 +++++++++++++++++++++++
>  util/oslib-posix.c             |   9 ++
>  util/oslib-win32.c             |   4 +
>  55 files changed, 1331 insertions(+), 135 deletions(-)
>  create mode 100644 include/qemu/env.h
>  create mode 100644 util/env.c
>
Dr. David Alan Gilbert Aug. 11, 2020, 7:08 p.m. UTC | #14
* Daniel P. Berrangé (berrange@redhat.com) wrote:
> On Fri, Jul 31, 2020 at 11:27:45AM -0400, Steven Sistare wrote:
> > On 7/31/2020 4:53 AM, Daniel P. Berrangé wrote:
> > > On Thu, Jul 30, 2020 at 02:48:44PM -0400, Steven Sistare wrote:
> > >> On 7/30/2020 12:52 PM, Daniel P. Berrangé wrote:
> > >>> On Thu, Jul 30, 2020 at 08:14:04AM -0700, Steve Sistare wrote:
> > >>>> Improve and extend the qemu functions that save and restore VM state so a
> > >>>> guest may be suspended and resumed with minimal pause time.  qemu may be
> > >>>> updated to a new version in between.
> > >>>>
> > >>>> The first set of patches adds the cprsave and cprload commands to save and
> > >>>> restore VM state, and allow the host kernel to be updated and rebooted in
> > >>>> between.  The VM must create guest RAM in a persistent shared memory file,
> > >>>> such as /dev/dax0.0 or persistant /dev/shm PKRAM as proposed in 
> > >>>> https://lore.kernel.org/lkml/1588812129-8596-1-git-send-email-anthony.yznaga@oracle.com/
> > >>>>
> > >>>> cprsave stops the VCPUs and saves VM device state in a simple file, and
> > >>>> thus supports any type of guest image and block device.  The caller must
> > >>>> not modify the VM's block devices between cprsave and cprload.
> > >>>>
> > >>>> cprsave and cprload support guests with vfio devices if the caller first
> > >>>> suspends the guest by issuing guest-suspend-ram to the qemu guest agent.
> > >>>> The guest drivers suspend methods flush outstanding requests and re-
> > >>>> initialize the devices, and thus there is no device state to save and
> > >>>> restore.
> > >>>>
> > >>>>    1 savevm: add vmstate handler iterators
> > >>>>    2 savevm: VM handlers mode mask
> > >>>>    3 savevm: QMP command for cprsave
> > >>>>    4 savevm: HMP Command for cprsave
> > >>>>    5 savevm: QMP command for cprload
> > >>>>    6 savevm: HMP Command for cprload
> > >>>>    7 savevm: QMP command for cprinfo
> > >>>>    8 savevm: HMP command for cprinfo
> > >>>>    9 savevm: prevent cprsave if memory is volatile
> > >>>>   10 kvmclock: restore paused KVM clock
> > >>>>   11 cpu: disable ticks when suspended
> > >>>>   12 vl: pause option
> > >>>>   13 gdbstub: gdb support for suspended state
> > >>>>
> > >>>> The next patches add a restart method that eliminates the persistent memory
> > >>>> constraint, and allows qemu to be updated across the restart, but does not
> > >>>> allow host reboot.  Anonymous memory segments used by the guest are
> > >>>> preserved across a re-exec of qemu, mapped at the same VA, via a proposed
> > >>>> madvise(MADV_DOEXEC) option in the Linux kernel.  See
> > >>>> https://lore.kernel.org/lkml/1595869887-23307-1-git-send-email-anthony.yznaga@oracle.com/
> > >>>>
> > >>>>   14 savevm: VMS_RESTART and cprsave restart
> > >>>>   15 vl: QEMU_START_FREEZE env var
> > >>>>   16 oslib: add qemu_clr_cloexec
> > >>>>   17 util: env var helpers
> > >>>>   18 osdep: import MADV_DOEXEC
> > >>>>   19 memory: ram_block_add cosmetic changes
> > >>>>   20 vl: add helper to request re-exec
> > >>>>   21 exec, memory: exec(3) to restart
> > >>>>   22 char: qio_channel_socket_accept reuse fd
> > >>>>   23 char: save/restore chardev socket fds
> > >>>>   24 ui: save/restore vnc socket fds
> > >>>>   25 char: save/restore chardev pty fds
> > >>>
> > >>> Keeping FDs open across re-exec is a nice trick, but how are you dealing
> > >>> with the state associated with them, most especially the TLS encryption
> > >>> state ? AFAIK, there's no way to serialize/deserialize the TLS state that
> > >>> GNUTLS maintains, and the patches don't show any sign of dealing with
> > >>> this. IOW it looks like while the FD will be preserved, any TLS session
> > >>> running on it will fail.
> > >>
> > >> I had not considered TLS.  If a non-qemu library maintains connection state, then
> > >> we won't be able to support it for live update until the library provides interfaces
> > >> to serialize the state.
> > >>
> > >> For qemu objects, so far vmstate has been adequate to represent the devices with
> > >> descriptors that we preserve.
> > > 
> > > My main concern about this series is that there is an implicit assumption
> > > that QEMU is *not* configured with certain features that are not handled
> > > If QEMU is using one of the unsupported features, I don't see anything in
> > > the series which attempts to prevent the actions.
> > > 
> > > IOW, users can have an arbitrary QEMU config, attempt to use these new features,
> > > the commands may well succeed, but the user is silently left with a broken QEMU.
> > > Such silent failure modes are really undesirable as they'll lead to a never
> > > ending stream of hard to diagnose bug reports for QEMU maintainers.
> > > 
> > > TLS is one example of this, the live upgrade  will "succeed", but the TLS
> > > connections will be totally non-functional.
> > 
> > I agree with all your points and would like to do better in this area.  Other than hunting for 
> > every use of a descriptor and either supporting it or blocking cpr, do you have any suggestions?
> > Thinking out loud, maybe we can gather all the fds that we support, then look for all fds in the
> > process, and block the cpr if we find an unrecognized fd.
> 
> There's no magic easy answer to this problem. Conceptually it is similar to
> the problem of reliably migrating guest device state, but in this case we're
> primarily concerned about the backends instead.
> 
> For migration we've got standardized interfaces that devices must implement
> in order to correctly support migration serialization. There is also support
> for devices to register migration "blockers" which prevent any use of the
> migration feature when the device is present.
> 
> We lack this kind of concept for the backend, and that's what I think needs
> to be tackled in a more thorough way.  There are quite alot of backends,
> but they're grouped into a reasonable small number of sets (UIs, chardevs,
> blockdevs, net devs, etc). We need some standard interface that we can
> plumb into all the backends, along with providing backends the ability to
> block the re-exec. If we plumb the generic infrastructure into each of the
> different types of backend, and make the default behaviour be to reject
> the re-exec. Then we need to carefull consider specific  backend impls
> and allow the re-exec only in the very precise cases we can demonstrate
> to be safe.
> 
> IOW, have a presumption that re-exec will *not* be permitted. Over time
> we can make it work for an ever expanding set of use cases. 

Yes, it does feel like an interface that needs to be implemented on the
chardev; then you don't need to worry about handling them all
individually.

Dave

> 
> Regards,
> Daniel
> -- 
> |: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
> |: https://libvirt.org         -o-            https://fstop138.berrange.com :|
> |: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|