Message ID | 1596122076-341293-1-git-send-email-steven.sistare@oracle.com (mailing list archive) |
---|---|
Headers | show |
Series | Live Update | expand |
On Thu, Jul 30, 2020 at 08:14:04AM -0700, Steve Sistare wrote: > Improve and extend the qemu functions that save and restore VM state so a > guest may be suspended and resumed with minimal pause time. qemu may be > updated to a new version in between. > > The first set of patches adds the cprsave and cprload commands to save and > restore VM state, and allow the host kernel to be updated and rebooted in > between. The VM must create guest RAM in a persistent shared memory file, > such as /dev/dax0.0 or persistant /dev/shm PKRAM as proposed in > https://lore.kernel.org/lkml/1588812129-8596-1-git-send-email-anthony.yznaga@oracle.com/ > > cprsave stops the VCPUs and saves VM device state in a simple file, and > thus supports any type of guest image and block device. The caller must > not modify the VM's block devices between cprsave and cprload. > > cprsave and cprload support guests with vfio devices if the caller first > suspends the guest by issuing guest-suspend-ram to the qemu guest agent. > The guest drivers suspend methods flush outstanding requests and re- > initialize the devices, and thus there is no device state to save and > restore. > > 1 savevm: add vmstate handler iterators > 2 savevm: VM handlers mode mask > 3 savevm: QMP command for cprsave > 4 savevm: HMP Command for cprsave > 5 savevm: QMP command for cprload > 6 savevm: HMP Command for cprload > 7 savevm: QMP command for cprinfo > 8 savevm: HMP command for cprinfo > 9 savevm: prevent cprsave if memory is volatile > 10 kvmclock: restore paused KVM clock > 11 cpu: disable ticks when suspended > 12 vl: pause option > 13 gdbstub: gdb support for suspended state > > The next patches add a restart method that eliminates the persistent memory > constraint, and allows qemu to be updated across the restart, but does not > allow host reboot. Anonymous memory segments used by the guest are > preserved across a re-exec of qemu, mapped at the same VA, via a proposed > madvise(MADV_DOEXEC) option in the Linux kernel. See > https://lore.kernel.org/lkml/1595869887-23307-1-git-send-email-anthony.yznaga@oracle.com/ > > 14 savevm: VMS_RESTART and cprsave restart > 15 vl: QEMU_START_FREEZE env var > 16 oslib: add qemu_clr_cloexec > 17 util: env var helpers > 18 osdep: import MADV_DOEXEC > 19 memory: ram_block_add cosmetic changes > 20 vl: add helper to request re-exec > 21 exec, memory: exec(3) to restart > 22 char: qio_channel_socket_accept reuse fd > 23 char: save/restore chardev socket fds > 24 ui: save/restore vnc socket fds > 25 char: save/restore chardev pty fds Keeping FDs open across re-exec is a nice trick, but how are you dealing with the state associated with them, most especially the TLS encryption state ? AFAIK, there's no way to serialize/deserialize the TLS state that GNUTLS maintains, and the patches don't show any sign of dealing with this. IOW it looks like while the FD will be preserved, any TLS session running on it will fail. I'm going to presume that you're probably just considering the TLS features out of scope for your patch series. It would be useful if you have any info about this and other things you've considered out of scope for this patch series. I'm not seeing anything in the block layer about preserving open FDs, so I presume you're just letting the block layer close and then re-open any FDs it has ? This would have the side effect that any locks held on the FDs are lost, so there's a potential race condition where another process could acquire the lock and prevent the re-exec completing. That said this is unavoidable, because Linux kernel is completely broken wrt keeping fnctl() locks held across a re-exec, always throwing away the locks if more than 1 thread is running [1]. Regards, Daniel [1] https://bugzilla.redhat.com/show_bug.cgi?id=1552621
On 30/07/20 17:14, Steve Sistare wrote: > The first set of patches adds the cprsave and cprload commands to save and > restore VM state, and allow the host kernel to be updated and rebooted in > between. The VM must create guest RAM in a persistent shared memory file, > such as /dev/dax0.0 or persistant /dev/shm PKRAM as proposed in > https://lore.kernel.org/lkml/1588812129-8596-1-git-send-email-anthony.yznaga@oracle.com/ > > cprsave stops the VCPUs and saves VM device state in a simple file, and > thus supports any type of guest image and block device. The caller must > not modify the VM's block devices between cprsave and cprload. Stupid question, what does cpr stand for? If it is checkpoint/restore, please spell it out. Also, how does the functionality compare to xen-save-devices-state and xen-load-devices-state? > cprsave and cprload support guests with vfio devices if the caller first > suspends the guest by issuing guest-suspend-ram to the qemu guest agent. > The guest drivers suspend methods flush outstanding requests and re- > initialize the devices, and thus there is no device state to save and > restore. This probably should be allowed even for regular migration. Can you generalize the code as a separate series? Thanks, Paolo
* Steve Sistare (steven.sistare@oracle.com) wrote: > Improve and extend the qemu functions that save and restore VM state so a > guest may be suspended and resumed with minimal pause time. qemu may be > updated to a new version in between. Nice. > The first set of patches adds the cprsave and cprload commands to save and > restore VM state, and allow the host kernel to be updated and rebooted in > between. The VM must create guest RAM in a persistent shared memory file, > such as /dev/dax0.0 or persistant /dev/shm PKRAM as proposed in > https://lore.kernel.org/lkml/1588812129-8596-1-git-send-email-anthony.yznaga@oracle.com/ > > cprsave stops the VCPUs and saves VM device state in a simple file, and > thus supports any type of guest image and block device. The caller must > not modify the VM's block devices between cprsave and cprload. can I ask why you don't just add a migration flag to skip the devices you don't want, and then do a migrate to a file? (i.e. migrate "exec:cat > afile") We already have the 'x-ignore-shared' capability that's used for doing RAM snapshots of VMs; primarily I think for being able to start a VM from a RAM snapshot as a fast VM start trick. (There's also a xen_save_devices that does something similar). If you backed the RAM as you say, enabled x-ignore-shared and then did: migrate "exec:cat > afile" and restarted the destination with: migrate_incoming "exec:cat afile" what is different (except the later stuff about the vfio magic and chardevs). Dave > cprsave and cprload support guests with vfio devices if the caller first > suspends the guest by issuing guest-suspend-ram to the qemu guest agent. > The guest drivers suspend methods flush outstanding requests and re- > initialize the devices, and thus there is no device state to save and > restore. > > 1 savevm: add vmstate handler iterators > 2 savevm: VM handlers mode mask > 3 savevm: QMP command for cprsave > 4 savevm: HMP Command for cprsave > 5 savevm: QMP command for cprload > 6 savevm: HMP Command for cprload > 7 savevm: QMP command for cprinfo > 8 savevm: HMP command for cprinfo > 9 savevm: prevent cprsave if memory is volatile > 10 kvmclock: restore paused KVM clock > 11 cpu: disable ticks when suspended > 12 vl: pause option > 13 gdbstub: gdb support for suspended state > > The next patches add a restart method that eliminates the persistent memory > constraint, and allows qemu to be updated across the restart, but does not > allow host reboot. Anonymous memory segments used by the guest are > preserved across a re-exec of qemu, mapped at the same VA, via a proposed > madvise(MADV_DOEXEC) option in the Linux kernel. See > https://lore.kernel.org/lkml/1595869887-23307-1-git-send-email-anthony.yznaga@oracle.com/ > > 14 savevm: VMS_RESTART and cprsave restart > 15 vl: QEMU_START_FREEZE env var > 16 oslib: add qemu_clr_cloexec > 17 util: env var helpers > 18 osdep: import MADV_DOEXEC > 19 memory: ram_block_add cosmetic changes > 20 vl: add helper to request re-exec > 21 exec, memory: exec(3) to restart > 22 char: qio_channel_socket_accept reuse fd > 23 char: save/restore chardev socket fds > 24 ui: save/restore vnc socket fds > 25 char: save/restore chardev pty fds > 26 monitor: save/restore QMP negotiation status > 27 vhost: reset vhost devices upon cprsave > 28 char: restore terminal on restart > > The next patches extend the restart method to save and restore vfio-pci > state, eliminating the requirement for a guest agent. The vfio container, > group, and device descriptors are preserved across the qemu re-exec. > > 29 pci: export pci_update_mappings > 30 vfio-pci: save and restore > 31 vfio-pci: trace pci config > 32 vfio-pci: improved tracing > > Here is an example of updating qemu from v4.2.0 to v4.2.1 using > "cprload restart". The software update is performed while the guest is > running to minimize downtime. > > window 1 | window 2 > | > # qemu-system-x86_64 ... | > QEMU 4.2.0 monitor - type 'help' ... | > (qemu) info status | > VM status: running | > | # yum update qemu > (qemu) cprsave /tmp/qemu.sav restart | > QEMU 4.2.1 monitor - type 'help' ... | > (qemu) info status | > VM status: paused (prelaunch) | > (qemu) cprload /tmp/qemu.sav | > (qemu) info status | > VM status: running | > > > Here is an example of updating the host kernel using "cprload reboot" > > window 1 | window 2 > | > # qemu-system-x86_64 ...mem-path=/dev/dax0.0 ...| > QEMU 4.2.1 monitor - type 'help' ... | > (qemu) info status | > VM status: running | > | # yum update kernel-uek > (qemu) cprsave /tmp/qemu.sav restart | > | > # systemctl kexec | > kexec_core: Starting new kernel | > ... | > | > # qemu-system-x86_64 ...mem-path=/dev/dax0.0 ...| > QEMU 4.2.1 monitor - type 'help' ... | > (qemu) info status | > VM status: paused (prelaunch) | > (qemu) cprload /tmp/qemu.sav | > (qemu) info status | > VM status: running | > > > Mark Kanda (5): > char: qio_channel_socket_accept reuse fd > char: save/restore chardev socket fds > ui: save/restore vnc socket fds > monitor: save/restore QMP negotiation status > vhost: reset vhost devices upon cprsave > > Steve Sistare (27): > savevm: add vmstate handler iterators > savevm: VM handlers mode mask > savevm: QMP command for cprsave > savevm: HMP Command for cprsave > savevm: QMP command for cprload > savevm: HMP Command for cprload > savevm: QMP command for cprinfo > savevm: HMP command for cprinfo > savevm: prevent cprsave if memory is volatile > kvmclock: restore paused KVM clock > cpu: disable ticks when suspended > vl: pause option > gdbstub: gdb support for suspended state > savevm: VMS_RESTART and cprsave restart > vl: QEMU_START_FREEZE env var > oslib: add qemu_clr_cloexec > util: env var helpers > osdep: import MADV_DOEXEC > memory: ram_block_add cosmetic changes > vl: add helper to request re-exec > exec, memory: exec(3) to restart > char: save/restore chardev pty fds > char: restore terminal on restart > pci: export pci_update_mappings > vfio-pci: save and restore > vfio-pci: trace pci config > vfio-pci: improved tracing > > MAINTAINERS | 7 ++ > accel/kvm/kvm-all.c | 8 +- > accel/kvm/trace-events | 3 +- > chardev/char-pty.c | 38 +++++-- > chardev/char-socket.c | 35 ++++++ > chardev/char-stdio.c | 7 ++ > chardev/char.c | 16 +++ > exec.c | 88 +++++++++++++-- > gdbstub.c | 11 +- > hmp-commands.hx | 46 ++++++++ > hw/i386/kvm/clock.c | 6 +- > hw/pci/msix.c | 1 + > hw/pci/pci.c | 17 +-- > hw/pci/trace-events | 5 +- > hw/vfio/common.c | 115 ++++++++++++++++---- > hw/vfio/pci.c | 179 ++++++++++++++++++++++++++++++- > hw/vfio/platform.c | 2 +- > hw/vfio/trace-events | 11 +- > hw/virtio/vhost.c | 12 +++ > include/chardev/char.h | 8 ++ > include/exec/memory.h | 4 + > include/hw/pci/pci.h | 2 + > include/hw/vfio/vfio-common.h | 4 +- > include/io/channel-socket.h | 3 +- > include/migration/register.h | 3 + > include/migration/vmstate.h | 11 ++ > include/monitor/hmp.h | 3 + > include/qemu/cutils.h | 1 + > include/qemu/env.h | 31 ++++++ > include/qemu/osdep.h | 8 ++ > include/sysemu/sysemu.h | 10 ++ > io/channel-socket.c | 12 ++- > io/net-listener.c | 4 +- > migration/block.c | 1 + > migration/migration.c | 4 +- > migration/ram.c | 1 + > migration/savevm.c | 237 ++++++++++++++++++++++++++++++++++++----- > migration/savevm.h | 4 +- > monitor/hmp-cmds.c | 28 +++++ > monitor/qmp-cmds.c | 16 +++ > monitor/qmp.c | 42 ++++++++ > qapi/migration.json | 35 ++++++ > qapi/pragma.json | 1 + > qemu-options.hx | 9 ++ > scsi/qemu-pr-helper.c | 2 +- > softmmu/vl.c | 65 ++++++++++- > tests/qtest/tpm-emu.c | 2 +- > tests/test-char.c | 2 +- > tests/test-io-channel-socket.c | 4 +- > trace-events | 2 + > ui/vnc.c | 153 +++++++++++++++++++++----- > util/Makefile.objs | 2 +- > util/env.c | 132 +++++++++++++++++++++++ > util/oslib-posix.c | 9 ++ > util/oslib-win32.c | 4 + > 55 files changed, 1331 insertions(+), 135 deletions(-) > create mode 100644 include/qemu/env.h > create mode 100644 util/env.c > > -- > 1.8.3.1 > > -- Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
On 7/30/2020 12:52 PM, Daniel P. Berrangé wrote: > On Thu, Jul 30, 2020 at 08:14:04AM -0700, Steve Sistare wrote: >> Improve and extend the qemu functions that save and restore VM state so a >> guest may be suspended and resumed with minimal pause time. qemu may be >> updated to a new version in between. >> >> The first set of patches adds the cprsave and cprload commands to save and >> restore VM state, and allow the host kernel to be updated and rebooted in >> between. The VM must create guest RAM in a persistent shared memory file, >> such as /dev/dax0.0 or persistant /dev/shm PKRAM as proposed in >> https://lore.kernel.org/lkml/1588812129-8596-1-git-send-email-anthony.yznaga@oracle.com/ >> >> cprsave stops the VCPUs and saves VM device state in a simple file, and >> thus supports any type of guest image and block device. The caller must >> not modify the VM's block devices between cprsave and cprload. >> >> cprsave and cprload support guests with vfio devices if the caller first >> suspends the guest by issuing guest-suspend-ram to the qemu guest agent. >> The guest drivers suspend methods flush outstanding requests and re- >> initialize the devices, and thus there is no device state to save and >> restore. >> >> 1 savevm: add vmstate handler iterators >> 2 savevm: VM handlers mode mask >> 3 savevm: QMP command for cprsave >> 4 savevm: HMP Command for cprsave >> 5 savevm: QMP command for cprload >> 6 savevm: HMP Command for cprload >> 7 savevm: QMP command for cprinfo >> 8 savevm: HMP command for cprinfo >> 9 savevm: prevent cprsave if memory is volatile >> 10 kvmclock: restore paused KVM clock >> 11 cpu: disable ticks when suspended >> 12 vl: pause option >> 13 gdbstub: gdb support for suspended state >> >> The next patches add a restart method that eliminates the persistent memory >> constraint, and allows qemu to be updated across the restart, but does not >> allow host reboot. Anonymous memory segments used by the guest are >> preserved across a re-exec of qemu, mapped at the same VA, via a proposed >> madvise(MADV_DOEXEC) option in the Linux kernel. See >> https://lore.kernel.org/lkml/1595869887-23307-1-git-send-email-anthony.yznaga@oracle.com/ >> >> 14 savevm: VMS_RESTART and cprsave restart >> 15 vl: QEMU_START_FREEZE env var >> 16 oslib: add qemu_clr_cloexec >> 17 util: env var helpers >> 18 osdep: import MADV_DOEXEC >> 19 memory: ram_block_add cosmetic changes >> 20 vl: add helper to request re-exec >> 21 exec, memory: exec(3) to restart >> 22 char: qio_channel_socket_accept reuse fd >> 23 char: save/restore chardev socket fds >> 24 ui: save/restore vnc socket fds >> 25 char: save/restore chardev pty fds > > Keeping FDs open across re-exec is a nice trick, but how are you dealing > with the state associated with them, most especially the TLS encryption > state ? AFAIK, there's no way to serialize/deserialize the TLS state that > GNUTLS maintains, and the patches don't show any sign of dealing with > this. IOW it looks like while the FD will be preserved, any TLS session > running on it will fail. I had not considered TLS. If a non-qemu library maintains connection state, then we won't be able to support it for live update until the library provides interfaces to serialize the state. For qemu objects, so far vmstate has been adequate to represent the devices with descriptors that we preserve. > I'm going to presume that you're probably just considering the TLS features > out of scope for your patch series. It would be useful if you have any > info about this and other things you've considered out of scope for this > patch series. The descriptors covered in these patches are needed for our use case. I realize there are others that could perhaps be preserved, but we have not tried them. Those descriptors are closed on exec as usual, and are reopened after exec. I expect that we or others will support more over time. > I'm not seeing anything in the block layer about preserving open FDs, so > I presume you're just letting the block layer close and then re-open any > FDs it has ? Correct. > This would have the side effect that any locks held on the > FDs are lost, so there's a potential race condition where another process > could acquire the lock and prevent the re-exec completing. That said this > is unavoidable, because Linux kernel is completely broken wrt keeping > fnctl() locks held across a re-exec, always throwing away the locks if > more than 1 thread is running [1]. Ouch. - Steve > > Regards, > Daniel > > [1] https://bugzilla.redhat.com/show_bug.cgi?id=1552621 >
On 7/30/2020 1:15 PM, Paolo Bonzini wrote: > On 30/07/20 17:14, Steve Sistare wrote: >> The first set of patches adds the cprsave and cprload commands to save and >> restore VM state, and allow the host kernel to be updated and rebooted in >> between. The VM must create guest RAM in a persistent shared memory file, >> such as /dev/dax0.0 or persistant /dev/shm PKRAM as proposed in >> https://lore.kernel.org/lkml/1588812129-8596-1-git-send-email-anthony.yznaga@oracle.com/ >> >> cprsave stops the VCPUs and saves VM device state in a simple file, and >> thus supports any type of guest image and block device. The caller must >> not modify the VM's block devices between cprsave and cprload. > > Stupid question, what does cpr stand for? If it is checkpoint/restore, Checkpoint/restart. An acronym from my HPC days. I will spell it out. > please spell it out. Also, how does the functionality compare to > xen-save-devices-state and xen-load-devices-state? qmp_xen_save_devices_state serializes device state to a file which is loaded on the target for a live migration. It performs some of the same actions as cprsave/cprload but does not support live update-in-place. >> cprsave and cprload support guests with vfio devices if the caller first >> suspends the guest by issuing guest-suspend-ram to the qemu guest agent. >> The guest drivers suspend methods flush outstanding requests and re- >> initialize the devices, and thus there is no device state to save and >> restore. > > This probably should be allowed even for regular migration. Can you > generalize the code as a separate series? Maybe. I think that would be a distinct patch that ignores the vfio migration blocker if the state is suspended. Plus a qemu agent call to do the suspend. Needs more thought. - Steve > > Thanks, > > Paolo >
On 7/30/2020 1:49 PM, Dr. David Alan Gilbert wrote: > * Steve Sistare (steven.sistare@oracle.com) wrote: >> Improve and extend the qemu functions that save and restore VM state so a >> guest may be suspended and resumed with minimal pause time. qemu may be >> updated to a new version in between. > > Nice. > >> The first set of patches adds the cprsave and cprload commands to save and >> restore VM state, and allow the host kernel to be updated and rebooted in >> between. The VM must create guest RAM in a persistent shared memory file, >> such as /dev/dax0.0 or persistant /dev/shm PKRAM as proposed in >> https://lore.kernel.org/lkml/1588812129-8596-1-git-send-email-anthony.yznaga@oracle.com/ >> >> cprsave stops the VCPUs and saves VM device state in a simple file, and >> thus supports any type of guest image and block device. The caller must >> not modify the VM's block devices between cprsave and cprload. > > can I ask why you don't just add a migration flag to skip the devices > you don't want, and then do a migrate to a file? > (i.e. migrate "exec:cat > afile") > We already have the 'x-ignore-shared' capability that's used for doing > RAM snapshots of VMs; primarily I think for being able to start a VM > from a RAM snapshot as a fast VM start trick. > (There's also a xen_save_devices that does something similar). > If you backed the RAM as you say, enabled x-ignore-shared and then did: > > migrate "exec:cat > afile" > > and restarted the destination with: > > migrate_incoming "exec:cat afile" > > what is different (except the later stuff about the vfio magic and > chardevs). > > Dave Yes, I did consider whether to extend the migration syntax and implemention in save_vmstate and load_vmstate, versus creating something new. Those functions handle stuff like bdrv snapshot, aio, and migration which are n/a for the cpr use case, and the cpr functions handle state that is n/a for the migration case. I judged that a single function handling both would be less readable and maintainable. At their core all these routines call qemu_loadvm_state() and qemu_savevm_state(). The surrounding code is mostly different. Take a look at savevm.c:save_vmstate() vs save_cpr_snapshot() attached and savevm.c:load_vmstate() vs load_cpr_snapshot() attached I attached the complete versions of the cpr functions because they are built up over multiple patches in this series, thus hard to visualize in patch form. - Steve > >> cprsave and cprload support guests with vfio devices if the caller first >> suspends the guest by issuing guest-suspend-ram to the qemu guest agent. >> The guest drivers suspend methods flush outstanding requests and re- >> initialize the devices, and thus there is no device state to save and >> restore. >> >> 1 savevm: add vmstate handler iterators >> 2 savevm: VM handlers mode mask >> 3 savevm: QMP command for cprsave >> 4 savevm: HMP Command for cprsave >> 5 savevm: QMP command for cprload >> 6 savevm: HMP Command for cprload >> 7 savevm: QMP command for cprinfo >> 8 savevm: HMP command for cprinfo >> 9 savevm: prevent cprsave if memory is volatile >> 10 kvmclock: restore paused KVM clock >> 11 cpu: disable ticks when suspended >> 12 vl: pause option >> 13 gdbstub: gdb support for suspended state >> >> The next patches add a restart method that eliminates the persistent memory >> constraint, and allows qemu to be updated across the restart, but does not >> allow host reboot. Anonymous memory segments used by the guest are >> preserved across a re-exec of qemu, mapped at the same VA, via a proposed >> madvise(MADV_DOEXEC) option in the Linux kernel. See >> https://lore.kernel.org/lkml/1595869887-23307-1-git-send-email-anthony.yznaga@oracle.com/ >> >> 14 savevm: VMS_RESTART and cprsave restart >> 15 vl: QEMU_START_FREEZE env var >> 16 oslib: add qemu_clr_cloexec >> 17 util: env var helpers >> 18 osdep: import MADV_DOEXEC >> 19 memory: ram_block_add cosmetic changes >> 20 vl: add helper to request re-exec >> 21 exec, memory: exec(3) to restart >> 22 char: qio_channel_socket_accept reuse fd >> 23 char: save/restore chardev socket fds >> 24 ui: save/restore vnc socket fds >> 25 char: save/restore chardev pty fds >> 26 monitor: save/restore QMP negotiation status >> 27 vhost: reset vhost devices upon cprsave >> 28 char: restore terminal on restart >> >> The next patches extend the restart method to save and restore vfio-pci >> state, eliminating the requirement for a guest agent. The vfio container, >> group, and device descriptors are preserved across the qemu re-exec. >> >> 29 pci: export pci_update_mappings >> 30 vfio-pci: save and restore >> 31 vfio-pci: trace pci config >> 32 vfio-pci: improved tracing >> >> Here is an example of updating qemu from v4.2.0 to v4.2.1 using >> "cprload restart". The software update is performed while the guest is >> running to minimize downtime. >> >> window 1 | window 2 >> | >> # qemu-system-x86_64 ... | >> QEMU 4.2.0 monitor - type 'help' ... | >> (qemu) info status | >> VM status: running | >> | # yum update qemu >> (qemu) cprsave /tmp/qemu.sav restart | >> QEMU 4.2.1 monitor - type 'help' ... | >> (qemu) info status | >> VM status: paused (prelaunch) | >> (qemu) cprload /tmp/qemu.sav | >> (qemu) info status | >> VM status: running | >> >> >> Here is an example of updating the host kernel using "cprload reboot" >> >> window 1 | window 2 >> | >> # qemu-system-x86_64 ...mem-path=/dev/dax0.0 ...| >> QEMU 4.2.1 monitor - type 'help' ... | >> (qemu) info status | >> VM status: running | >> | # yum update kernel-uek >> (qemu) cprsave /tmp/qemu.sav restart | >> | >> # systemctl kexec | >> kexec_core: Starting new kernel | >> ... | >> | >> # qemu-system-x86_64 ...mem-path=/dev/dax0.0 ...| >> QEMU 4.2.1 monitor - type 'help' ... | >> (qemu) info status | >> VM status: paused (prelaunch) | >> (qemu) cprload /tmp/qemu.sav | >> (qemu) info status | >> VM status: running | >> >> >> Mark Kanda (5): >> char: qio_channel_socket_accept reuse fd >> char: save/restore chardev socket fds >> ui: save/restore vnc socket fds >> monitor: save/restore QMP negotiation status >> vhost: reset vhost devices upon cprsave >> >> Steve Sistare (27): >> savevm: add vmstate handler iterators >> savevm: VM handlers mode mask >> savevm: QMP command for cprsave >> savevm: HMP Command for cprsave >> savevm: QMP command for cprload >> savevm: HMP Command for cprload >> savevm: QMP command for cprinfo >> savevm: HMP command for cprinfo >> savevm: prevent cprsave if memory is volatile >> kvmclock: restore paused KVM clock >> cpu: disable ticks when suspended >> vl: pause option >> gdbstub: gdb support for suspended state >> savevm: VMS_RESTART and cprsave restart >> vl: QEMU_START_FREEZE env var >> oslib: add qemu_clr_cloexec >> util: env var helpers >> osdep: import MADV_DOEXEC >> memory: ram_block_add cosmetic changes >> vl: add helper to request re-exec >> exec, memory: exec(3) to restart >> char: save/restore chardev pty fds >> char: restore terminal on restart >> pci: export pci_update_mappings >> vfio-pci: save and restore >> vfio-pci: trace pci config >> vfio-pci: improved tracing >> >> MAINTAINERS | 7 ++ >> accel/kvm/kvm-all.c | 8 +- >> accel/kvm/trace-events | 3 +- >> chardev/char-pty.c | 38 +++++-- >> chardev/char-socket.c | 35 ++++++ >> chardev/char-stdio.c | 7 ++ >> chardev/char.c | 16 +++ >> exec.c | 88 +++++++++++++-- >> gdbstub.c | 11 +- >> hmp-commands.hx | 46 ++++++++ >> hw/i386/kvm/clock.c | 6 +- >> hw/pci/msix.c | 1 + >> hw/pci/pci.c | 17 +-- >> hw/pci/trace-events | 5 +- >> hw/vfio/common.c | 115 ++++++++++++++++---- >> hw/vfio/pci.c | 179 ++++++++++++++++++++++++++++++- >> hw/vfio/platform.c | 2 +- >> hw/vfio/trace-events | 11 +- >> hw/virtio/vhost.c | 12 +++ >> include/chardev/char.h | 8 ++ >> include/exec/memory.h | 4 + >> include/hw/pci/pci.h | 2 + >> include/hw/vfio/vfio-common.h | 4 +- >> include/io/channel-socket.h | 3 +- >> include/migration/register.h | 3 + >> include/migration/vmstate.h | 11 ++ >> include/monitor/hmp.h | 3 + >> include/qemu/cutils.h | 1 + >> include/qemu/env.h | 31 ++++++ >> include/qemu/osdep.h | 8 ++ >> include/sysemu/sysemu.h | 10 ++ >> io/channel-socket.c | 12 ++- >> io/net-listener.c | 4 +- >> migration/block.c | 1 + >> migration/migration.c | 4 +- >> migration/ram.c | 1 + >> migration/savevm.c | 237 ++++++++++++++++++++++++++++++++++++----- >> migration/savevm.h | 4 +- >> monitor/hmp-cmds.c | 28 +++++ >> monitor/qmp-cmds.c | 16 +++ >> monitor/qmp.c | 42 ++++++++ >> qapi/migration.json | 35 ++++++ >> qapi/pragma.json | 1 + >> qemu-options.hx | 9 ++ >> scsi/qemu-pr-helper.c | 2 +- >> softmmu/vl.c | 65 ++++++++++- >> tests/qtest/tpm-emu.c | 2 +- >> tests/test-char.c | 2 +- >> tests/test-io-channel-socket.c | 4 +- >> trace-events | 2 + >> ui/vnc.c | 153 +++++++++++++++++++++----- >> util/Makefile.objs | 2 +- >> util/env.c | 132 +++++++++++++++++++++++ >> util/oslib-posix.c | 9 ++ >> util/oslib-win32.c | 4 + >> 55 files changed, 1331 insertions(+), 135 deletions(-) >> create mode 100644 include/qemu/env.h >> create mode 100644 util/env.c >> >> -- >> 1.8.3.1 >> >> > -- > Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK > void save_cpr_snapshot(const char *file, const char *mode, Error **errp) { int ret = 0; QEMUFile *f; VMStateMode op; if (!strcmp(mode, "reboot")) { op = VMS_REBOOT; } else if (!strcmp(mode, "restart")) { op = VMS_RESTART; } else { error_setg(errp, "cprsave: bad mode %s", mode); return; } if (op == VMS_REBOOT && qemu_ram_volatile(errp)) { return; } if (op == VMS_RESTART && QEMU_MADV_DOEXEC == QEMU_MADV_INVALID) { error_setg(errp, "kernel does not support MADV_DOEXEC."); return; } if (op == VMS_RESTART && xen_enabled()) { error_setg(errp, "xen does not support cprsave restart"); return; } f = qf_file_open(file, O_CREAT | O_WRONLY | O_TRUNC, 0600, errp); if (!f) { return; } ret = global_state_store(); if (ret) { error_setg(errp, "Error saving global state"); qemu_fclose(f); return; } /* Update timers_state before saving. Suspend did not so do. */ if (runstate_check(RUN_STATE_SUSPENDED)) { cpu_disable_ticks(); } vm_stop(RUN_STATE_SAVE_VM); ret = qemu_savevm_state(f, op, errp); if ((ret < 0) && !*errp) { error_setg(errp, "qemu_savevm_state failed"); } qemu_fclose(f); if (op == VMS_REBOOT) { no_shutdown = 0; qemu_system_shutdown_request(); } else if (op == VMS_RESTART) { if (qemu_preserve_ram(errp)) { return; } save_chardev_fds(); save_vnc_fds(); save_named_fd("mntfd"); /* was received from qemu-cpr */ save_named_fd("ctlfd"); /* was received from qemu-cpr */ walkenv(FD_PREFIX, preserve_fd, 0); reset_vhost_devices(); save_qmp_negotiation_status(); qemu_term_exit(); qemu_system_exec_request(); putenv((char *)"QEMU_START_FREEZE="); } } void load_cpr_snapshot(const char *file, Error **errp) { QEMUFile *f; int ret; RunState state; if (runstate_is_running()) { error_setg(errp, "cprload called for a running VM"); return; } f = qf_file_open(file, O_RDONLY, 0, errp); if (!f) { return; } ret = qemu_loadvm_state(f, VMS_REBOOT | VMS_RESTART); qemu_fclose(f); if (ret < 0) { error_setg(errp, "Error %d while loading VM state", ret); return; } state = global_state_get_runstate(); if (state == RUN_STATE_RUNNING) { vm_start(); } else { runstate_set(state); if (runstate_check(RUN_STATE_SUSPENDED)) { start_on_wake = 1; } } load_vnc_fds(); }
On 30/07/20 21:09, Steven Sistare wrote: >> please spell it out. Also, how does the functionality compare to >> xen-save-devices-state and xen-load-devices-state? > > qmp_xen_save_devices_state serializes device state to a file which is loaded > on the target for a live migration. It performs some of the same actions > as cprsave/cprload but does not support live update-in-place. So it is a subset, can code be reused across both? Also, live migration across versions is supported, so can you describe the special update-in-place support more precisely? I am confused about the use cases, which require (or try) to keep file descriptors across re-exec, which are for kexec, and so on. >>> cprsave and cprload support guests with vfio devices if the caller first >>> suspends the guest by issuing guest-suspend-ram to the qemu guest agent. >>> The guest drivers suspend methods flush outstanding requests and re- >>> initialize the devices, and thus there is no device state to save and >>> restore. >> This probably should be allowed even for regular migration. Can you >> generalize the code as a separate series? > > Maybe. I think that would be a distinct patch that ignores the vfio migration blocker > if the state is suspended. Plus a qemu agent call to do the suspend. Needs more > thought. The agent already supports suspend, so that should be relatively easy. Only the code to add/remove the VFIO migration blocker from a VM state change notifier, or something like that, would be needed. Paolo
On Thu, Jul 30, 2020 at 02:48:44PM -0400, Steven Sistare wrote: > On 7/30/2020 12:52 PM, Daniel P. Berrangé wrote: > > On Thu, Jul 30, 2020 at 08:14:04AM -0700, Steve Sistare wrote: > >> Improve and extend the qemu functions that save and restore VM state so a > >> guest may be suspended and resumed with minimal pause time. qemu may be > >> updated to a new version in between. > >> > >> The first set of patches adds the cprsave and cprload commands to save and > >> restore VM state, and allow the host kernel to be updated and rebooted in > >> between. The VM must create guest RAM in a persistent shared memory file, > >> such as /dev/dax0.0 or persistant /dev/shm PKRAM as proposed in > >> https://lore.kernel.org/lkml/1588812129-8596-1-git-send-email-anthony.yznaga@oracle.com/ > >> > >> cprsave stops the VCPUs and saves VM device state in a simple file, and > >> thus supports any type of guest image and block device. The caller must > >> not modify the VM's block devices between cprsave and cprload. > >> > >> cprsave and cprload support guests with vfio devices if the caller first > >> suspends the guest by issuing guest-suspend-ram to the qemu guest agent. > >> The guest drivers suspend methods flush outstanding requests and re- > >> initialize the devices, and thus there is no device state to save and > >> restore. > >> > >> 1 savevm: add vmstate handler iterators > >> 2 savevm: VM handlers mode mask > >> 3 savevm: QMP command for cprsave > >> 4 savevm: HMP Command for cprsave > >> 5 savevm: QMP command for cprload > >> 6 savevm: HMP Command for cprload > >> 7 savevm: QMP command for cprinfo > >> 8 savevm: HMP command for cprinfo > >> 9 savevm: prevent cprsave if memory is volatile > >> 10 kvmclock: restore paused KVM clock > >> 11 cpu: disable ticks when suspended > >> 12 vl: pause option > >> 13 gdbstub: gdb support for suspended state > >> > >> The next patches add a restart method that eliminates the persistent memory > >> constraint, and allows qemu to be updated across the restart, but does not > >> allow host reboot. Anonymous memory segments used by the guest are > >> preserved across a re-exec of qemu, mapped at the same VA, via a proposed > >> madvise(MADV_DOEXEC) option in the Linux kernel. See > >> https://lore.kernel.org/lkml/1595869887-23307-1-git-send-email-anthony.yznaga@oracle.com/ > >> > >> 14 savevm: VMS_RESTART and cprsave restart > >> 15 vl: QEMU_START_FREEZE env var > >> 16 oslib: add qemu_clr_cloexec > >> 17 util: env var helpers > >> 18 osdep: import MADV_DOEXEC > >> 19 memory: ram_block_add cosmetic changes > >> 20 vl: add helper to request re-exec > >> 21 exec, memory: exec(3) to restart > >> 22 char: qio_channel_socket_accept reuse fd > >> 23 char: save/restore chardev socket fds > >> 24 ui: save/restore vnc socket fds > >> 25 char: save/restore chardev pty fds > > > > Keeping FDs open across re-exec is a nice trick, but how are you dealing > > with the state associated with them, most especially the TLS encryption > > state ? AFAIK, there's no way to serialize/deserialize the TLS state that > > GNUTLS maintains, and the patches don't show any sign of dealing with > > this. IOW it looks like while the FD will be preserved, any TLS session > > running on it will fail. > > I had not considered TLS. If a non-qemu library maintains connection state, then > we won't be able to support it for live update until the library provides interfaces > to serialize the state. > > For qemu objects, so far vmstate has been adequate to represent the devices with > descriptors that we preserve. My main concern about this series is that there is an implicit assumption that QEMU is *not* configured with certain features that are not handled If QEMU is using one of the unsupported features, I don't see anything in the series which attempts to prevent the actions. IOW, users can have an arbitrary QEMU config, attempt to use these new features, the commands may well succeed, but the user is silently left with a broken QEMU. Such silent failure modes are really undesirable as they'll lead to a never ending stream of hard to diagnose bug reports for QEMU maintainers. TLS is one example of this, the live upgrade will "succeed", but the TLS connections will be totally non-functional. Regards, Daniel
On 7/31/2020 4:53 AM, Daniel P. Berrangé wrote: > On Thu, Jul 30, 2020 at 02:48:44PM -0400, Steven Sistare wrote: >> On 7/30/2020 12:52 PM, Daniel P. Berrangé wrote: >>> On Thu, Jul 30, 2020 at 08:14:04AM -0700, Steve Sistare wrote: >>>> Improve and extend the qemu functions that save and restore VM state so a >>>> guest may be suspended and resumed with minimal pause time. qemu may be >>>> updated to a new version in between. >>>> >>>> The first set of patches adds the cprsave and cprload commands to save and >>>> restore VM state, and allow the host kernel to be updated and rebooted in >>>> between. The VM must create guest RAM in a persistent shared memory file, >>>> such as /dev/dax0.0 or persistant /dev/shm PKRAM as proposed in >>>> https://lore.kernel.org/lkml/1588812129-8596-1-git-send-email-anthony.yznaga@oracle.com/ >>>> >>>> cprsave stops the VCPUs and saves VM device state in a simple file, and >>>> thus supports any type of guest image and block device. The caller must >>>> not modify the VM's block devices between cprsave and cprload. >>>> >>>> cprsave and cprload support guests with vfio devices if the caller first >>>> suspends the guest by issuing guest-suspend-ram to the qemu guest agent. >>>> The guest drivers suspend methods flush outstanding requests and re- >>>> initialize the devices, and thus there is no device state to save and >>>> restore. >>>> >>>> 1 savevm: add vmstate handler iterators >>>> 2 savevm: VM handlers mode mask >>>> 3 savevm: QMP command for cprsave >>>> 4 savevm: HMP Command for cprsave >>>> 5 savevm: QMP command for cprload >>>> 6 savevm: HMP Command for cprload >>>> 7 savevm: QMP command for cprinfo >>>> 8 savevm: HMP command for cprinfo >>>> 9 savevm: prevent cprsave if memory is volatile >>>> 10 kvmclock: restore paused KVM clock >>>> 11 cpu: disable ticks when suspended >>>> 12 vl: pause option >>>> 13 gdbstub: gdb support for suspended state >>>> >>>> The next patches add a restart method that eliminates the persistent memory >>>> constraint, and allows qemu to be updated across the restart, but does not >>>> allow host reboot. Anonymous memory segments used by the guest are >>>> preserved across a re-exec of qemu, mapped at the same VA, via a proposed >>>> madvise(MADV_DOEXEC) option in the Linux kernel. See >>>> https://lore.kernel.org/lkml/1595869887-23307-1-git-send-email-anthony.yznaga@oracle.com/ >>>> >>>> 14 savevm: VMS_RESTART and cprsave restart >>>> 15 vl: QEMU_START_FREEZE env var >>>> 16 oslib: add qemu_clr_cloexec >>>> 17 util: env var helpers >>>> 18 osdep: import MADV_DOEXEC >>>> 19 memory: ram_block_add cosmetic changes >>>> 20 vl: add helper to request re-exec >>>> 21 exec, memory: exec(3) to restart >>>> 22 char: qio_channel_socket_accept reuse fd >>>> 23 char: save/restore chardev socket fds >>>> 24 ui: save/restore vnc socket fds >>>> 25 char: save/restore chardev pty fds >>> >>> Keeping FDs open across re-exec is a nice trick, but how are you dealing >>> with the state associated with them, most especially the TLS encryption >>> state ? AFAIK, there's no way to serialize/deserialize the TLS state that >>> GNUTLS maintains, and the patches don't show any sign of dealing with >>> this. IOW it looks like while the FD will be preserved, any TLS session >>> running on it will fail. >> >> I had not considered TLS. If a non-qemu library maintains connection state, then >> we won't be able to support it for live update until the library provides interfaces >> to serialize the state. >> >> For qemu objects, so far vmstate has been adequate to represent the devices with >> descriptors that we preserve. > > My main concern about this series is that there is an implicit assumption > that QEMU is *not* configured with certain features that are not handled > If QEMU is using one of the unsupported features, I don't see anything in > the series which attempts to prevent the actions. > > IOW, users can have an arbitrary QEMU config, attempt to use these new features, > the commands may well succeed, but the user is silently left with a broken QEMU. > Such silent failure modes are really undesirable as they'll lead to a never > ending stream of hard to diagnose bug reports for QEMU maintainers. > > TLS is one example of this, the live upgrade will "succeed", but the TLS > connections will be totally non-functional. I agree with all your points and would like to do better in this area. Other than hunting for every use of a descriptor and either supporting it or blocking cpr, do you have any suggestions? Thinking out loud, maybe we can gather all the fds that we support, then look for all fds in the process, and block the cpr if we find an unrecognized fd. - Steve
On Fri, Jul 31, 2020 at 11:27:45AM -0400, Steven Sistare wrote: > On 7/31/2020 4:53 AM, Daniel P. Berrangé wrote: > > On Thu, Jul 30, 2020 at 02:48:44PM -0400, Steven Sistare wrote: > >> On 7/30/2020 12:52 PM, Daniel P. Berrangé wrote: > >>> On Thu, Jul 30, 2020 at 08:14:04AM -0700, Steve Sistare wrote: > >>>> Improve and extend the qemu functions that save and restore VM state so a > >>>> guest may be suspended and resumed with minimal pause time. qemu may be > >>>> updated to a new version in between. > >>>> > >>>> The first set of patches adds the cprsave and cprload commands to save and > >>>> restore VM state, and allow the host kernel to be updated and rebooted in > >>>> between. The VM must create guest RAM in a persistent shared memory file, > >>>> such as /dev/dax0.0 or persistant /dev/shm PKRAM as proposed in > >>>> https://lore.kernel.org/lkml/1588812129-8596-1-git-send-email-anthony.yznaga@oracle.com/ > >>>> > >>>> cprsave stops the VCPUs and saves VM device state in a simple file, and > >>>> thus supports any type of guest image and block device. The caller must > >>>> not modify the VM's block devices between cprsave and cprload. > >>>> > >>>> cprsave and cprload support guests with vfio devices if the caller first > >>>> suspends the guest by issuing guest-suspend-ram to the qemu guest agent. > >>>> The guest drivers suspend methods flush outstanding requests and re- > >>>> initialize the devices, and thus there is no device state to save and > >>>> restore. > >>>> > >>>> 1 savevm: add vmstate handler iterators > >>>> 2 savevm: VM handlers mode mask > >>>> 3 savevm: QMP command for cprsave > >>>> 4 savevm: HMP Command for cprsave > >>>> 5 savevm: QMP command for cprload > >>>> 6 savevm: HMP Command for cprload > >>>> 7 savevm: QMP command for cprinfo > >>>> 8 savevm: HMP command for cprinfo > >>>> 9 savevm: prevent cprsave if memory is volatile > >>>> 10 kvmclock: restore paused KVM clock > >>>> 11 cpu: disable ticks when suspended > >>>> 12 vl: pause option > >>>> 13 gdbstub: gdb support for suspended state > >>>> > >>>> The next patches add a restart method that eliminates the persistent memory > >>>> constraint, and allows qemu to be updated across the restart, but does not > >>>> allow host reboot. Anonymous memory segments used by the guest are > >>>> preserved across a re-exec of qemu, mapped at the same VA, via a proposed > >>>> madvise(MADV_DOEXEC) option in the Linux kernel. See > >>>> https://lore.kernel.org/lkml/1595869887-23307-1-git-send-email-anthony.yznaga@oracle.com/ > >>>> > >>>> 14 savevm: VMS_RESTART and cprsave restart > >>>> 15 vl: QEMU_START_FREEZE env var > >>>> 16 oslib: add qemu_clr_cloexec > >>>> 17 util: env var helpers > >>>> 18 osdep: import MADV_DOEXEC > >>>> 19 memory: ram_block_add cosmetic changes > >>>> 20 vl: add helper to request re-exec > >>>> 21 exec, memory: exec(3) to restart > >>>> 22 char: qio_channel_socket_accept reuse fd > >>>> 23 char: save/restore chardev socket fds > >>>> 24 ui: save/restore vnc socket fds > >>>> 25 char: save/restore chardev pty fds > >>> > >>> Keeping FDs open across re-exec is a nice trick, but how are you dealing > >>> with the state associated with them, most especially the TLS encryption > >>> state ? AFAIK, there's no way to serialize/deserialize the TLS state that > >>> GNUTLS maintains, and the patches don't show any sign of dealing with > >>> this. IOW it looks like while the FD will be preserved, any TLS session > >>> running on it will fail. > >> > >> I had not considered TLS. If a non-qemu library maintains connection state, then > >> we won't be able to support it for live update until the library provides interfaces > >> to serialize the state. > >> > >> For qemu objects, so far vmstate has been adequate to represent the devices with > >> descriptors that we preserve. > > > > My main concern about this series is that there is an implicit assumption > > that QEMU is *not* configured with certain features that are not handled > > If QEMU is using one of the unsupported features, I don't see anything in > > the series which attempts to prevent the actions. > > > > IOW, users can have an arbitrary QEMU config, attempt to use these new features, > > the commands may well succeed, but the user is silently left with a broken QEMU. > > Such silent failure modes are really undesirable as they'll lead to a never > > ending stream of hard to diagnose bug reports for QEMU maintainers. > > > > TLS is one example of this, the live upgrade will "succeed", but the TLS > > connections will be totally non-functional. > > I agree with all your points and would like to do better in this area. Other than hunting for > every use of a descriptor and either supporting it or blocking cpr, do you have any suggestions? > Thinking out loud, maybe we can gather all the fds that we support, then look for all fds in the > process, and block the cpr if we find an unrecognized fd. There's no magic easy answer to this problem. Conceptually it is similar to the problem of reliably migrating guest device state, but in this case we're primarily concerned about the backends instead. For migration we've got standardized interfaces that devices must implement in order to correctly support migration serialization. There is also support for devices to register migration "blockers" which prevent any use of the migration feature when the device is present. We lack this kind of concept for the backend, and that's what I think needs to be tackled in a more thorough way. There are quite alot of backends, but they're grouped into a reasonable small number of sets (UIs, chardevs, blockdevs, net devs, etc). We need some standard interface that we can plumb into all the backends, along with providing backends the ability to block the re-exec. If we plumb the generic infrastructure into each of the different types of backend, and make the default behaviour be to reject the re-exec. Then we need to carefull consider specific backend impls and allow the re-exec only in the very precise cases we can demonstrate to be safe. IOW, have a presumption that re-exec will *not* be permitted. Over time we can make it work for an ever expanding set of use cases. Regards, Daniel
On 7/31/2020 11:52 AM, Daniel P. Berrangé wrote: > On Fri, Jul 31, 2020 at 11:27:45AM -0400, Steven Sistare wrote: >> On 7/31/2020 4:53 AM, Daniel P. Berrangé wrote: >>> On Thu, Jul 30, 2020 at 02:48:44PM -0400, Steven Sistare wrote: >>>> On 7/30/2020 12:52 PM, Daniel P. Berrangé wrote: >>>>> On Thu, Jul 30, 2020 at 08:14:04AM -0700, Steve Sistare wrote: >>>>>> Improve and extend the qemu functions that save and restore VM state so a >>>>>> guest may be suspended and resumed with minimal pause time. qemu may be >>>>>> updated to a new version in between. >>>>>> >>>>>> The first set of patches adds the cprsave and cprload commands to save and >>>>>> restore VM state, and allow the host kernel to be updated and rebooted in >>>>>> between. The VM must create guest RAM in a persistent shared memory file, >>>>>> such as /dev/dax0.0 or persistant /dev/shm PKRAM as proposed in >>>>>> https://lore.kernel.org/lkml/1588812129-8596-1-git-send-email-anthony.yznaga@oracle.com/ >>>>>> >>>>>> cprsave stops the VCPUs and saves VM device state in a simple file, and >>>>>> thus supports any type of guest image and block device. The caller must >>>>>> not modify the VM's block devices between cprsave and cprload. >>>>>> >>>>>> cprsave and cprload support guests with vfio devices if the caller first >>>>>> suspends the guest by issuing guest-suspend-ram to the qemu guest agent. >>>>>> The guest drivers suspend methods flush outstanding requests and re- >>>>>> initialize the devices, and thus there is no device state to save and >>>>>> restore. >>>>>> >>>>>> 1 savevm: add vmstate handler iterators >>>>>> 2 savevm: VM handlers mode mask >>>>>> 3 savevm: QMP command for cprsave >>>>>> 4 savevm: HMP Command for cprsave >>>>>> 5 savevm: QMP command for cprload >>>>>> 6 savevm: HMP Command for cprload >>>>>> 7 savevm: QMP command for cprinfo >>>>>> 8 savevm: HMP command for cprinfo >>>>>> 9 savevm: prevent cprsave if memory is volatile >>>>>> 10 kvmclock: restore paused KVM clock >>>>>> 11 cpu: disable ticks when suspended >>>>>> 12 vl: pause option >>>>>> 13 gdbstub: gdb support for suspended state >>>>>> >>>>>> The next patches add a restart method that eliminates the persistent memory >>>>>> constraint, and allows qemu to be updated across the restart, but does not >>>>>> allow host reboot. Anonymous memory segments used by the guest are >>>>>> preserved across a re-exec of qemu, mapped at the same VA, via a proposed >>>>>> madvise(MADV_DOEXEC) option in the Linux kernel. See >>>>>> https://lore.kernel.org/lkml/1595869887-23307-1-git-send-email-anthony.yznaga@oracle.com/ >>>>>> >>>>>> 14 savevm: VMS_RESTART and cprsave restart >>>>>> 15 vl: QEMU_START_FREEZE env var >>>>>> 16 oslib: add qemu_clr_cloexec >>>>>> 17 util: env var helpers >>>>>> 18 osdep: import MADV_DOEXEC >>>>>> 19 memory: ram_block_add cosmetic changes >>>>>> 20 vl: add helper to request re-exec >>>>>> 21 exec, memory: exec(3) to restart >>>>>> 22 char: qio_channel_socket_accept reuse fd >>>>>> 23 char: save/restore chardev socket fds >>>>>> 24 ui: save/restore vnc socket fds >>>>>> 25 char: save/restore chardev pty fds >>>>> >>>>> Keeping FDs open across re-exec is a nice trick, but how are you dealing >>>>> with the state associated with them, most especially the TLS encryption >>>>> state ? AFAIK, there's no way to serialize/deserialize the TLS state that >>>>> GNUTLS maintains, and the patches don't show any sign of dealing with >>>>> this. IOW it looks like while the FD will be preserved, any TLS session >>>>> running on it will fail. >>>> >>>> I had not considered TLS. If a non-qemu library maintains connection state, then >>>> we won't be able to support it for live update until the library provides interfaces >>>> to serialize the state. >>>> >>>> For qemu objects, so far vmstate has been adequate to represent the devices with >>>> descriptors that we preserve. >>> >>> My main concern about this series is that there is an implicit assumption >>> that QEMU is *not* configured with certain features that are not handled >>> If QEMU is using one of the unsupported features, I don't see anything in >>> the series which attempts to prevent the actions. >>> >>> IOW, users can have an arbitrary QEMU config, attempt to use these new features, >>> the commands may well succeed, but the user is silently left with a broken QEMU. >>> Such silent failure modes are really undesirable as they'll lead to a never >>> ending stream of hard to diagnose bug reports for QEMU maintainers. >>> >>> TLS is one example of this, the live upgrade will "succeed", but the TLS >>> connections will be totally non-functional. >> >> I agree with all your points and would like to do better in this area. Other than hunting for >> every use of a descriptor and either supporting it or blocking cpr, do you have any suggestions? >> Thinking out loud, maybe we can gather all the fds that we support, then look for all fds in the >> process, and block the cpr if we find an unrecognized fd. > > There's no magic easy answer to this problem. Conceptually it is similar to > the problem of reliably migrating guest device state, but in this case we're > primarily concerned about the backends instead. > > For migration we've got standardized interfaces that devices must implement > in order to correctly support migration serialization. There is also support > for devices to register migration "blockers" which prevent any use of the > migration feature when the device is present. > > We lack this kind of concept for the backend, and that's what I think needs > to be tackled in a more thorough way. There are quite alot of backends, > but they're grouped into a reasonable small number of sets (UIs, chardevs, > blockdevs, net devs, etc). We need some standard interface that we can > plumb into all the backends, along with providing backends the ability to > block the re-exec. If we plumb the generic infrastructure into each of the > different types of backend, and make the default behaviour be to reject > the re-exec. Then we need to carefull consider specific backend impls > and allow the re-exec only in the very precise cases we can demonstrate > to be safe. > > IOW, have a presumption that re-exec will *not* be permitted. Over time > we can make it work for an ever expanding set of use cases. Actually, we could use the vmstate mode_mask field added in patch 2, and only allow the restart mode for vmstate objects that have been vetted. Currently an uninitialized mask (value 0) enables the object for all modes, but we could change that. - Steve
On 7/30/2020 5:39 PM, Paolo Bonzini wrote: > On 30/07/20 21:09, Steven Sistare wrote: >>> please spell it out. Also, how does the functionality compare to >>> xen-save-devices-state and xen-load-devices-state? >> >> qmp_xen_save_devices_state serializes device state to a file which is loaded >> on the target for a live migration. It performs some of the same actions >> as cprsave/cprload but does not support live update-in-place. > > So it is a subset, can code be reused across both? They use common subroutines, but their bodies check different conditions, so I don't think merging would be an improvement. We do provide a new helper qf_file_open() which could replace a handful of lines in both qmp_xen_save_devices_state and qmp_xen_load_devices_state. > Also, live migration > across versions is supported, so can you describe the special > update-in-place support more precisely? I am confused about the use > cases, which require (or try) to keep file descriptors across re-exec, > which are for kexec, and so on. Sure. The first use case allows you to kexec reboot the host and update host software and/or qemu. It does not preserve descriptors, and guest ram must be backed by persistant shared memory. Guest pause time depends on host reboot time, which can be seconds to 10's of seconds. The second case allows you to update qemu in place, but not update the host. Guest ram can be in shared or anonymous memory. We call madvise(MADV_DOEXEC) to tell the kernel to preserve anon memory across the exec. Open descriptors are preserved. Addresses and lengths of saved memory segments are saved in the environment, and the values of descriptors are saved. When new qemu restarts, it finds those values in the environment and uses them when the various objects are created. Memory is not realloc'd, it is already present, and the address and lengths are saved in the ram objects. Guest pause time is in the 100 to 200 msec range. It is less resource intensive than live migration, and is appropriate if your only goal is to update qemu, as opposed to evacuating a host. >>>> cprsave and cprload support guests with vfio devices if the caller first >>>> suspends the guest by issuing guest-suspend-ram to the qemu guest agent. >>>> The guest drivers suspend methods flush outstanding requests and re- >>>> initialize the devices, and thus there is no device state to save and >>>> restore. >>> This probably should be allowed even for regular migration. Can you >>> generalize the code as a separate series? >> >> Maybe. I think that would be a distinct patch that ignores the vfio migration blocker >> if the state is suspended. Plus a qemu agent call to do the suspend. Needs more >> thought. > > The agent already supports suspend, so that should be relatively easy. > Only the code to add/remove the VFIO migration blocker from a VM state > change notifier, or something like that, would be needed. Yes, I have experimented with the guest's suspend method. - Steve
Hi folks, any questions or comments on the vfio and pci changes in patch 30? Or on the means of preserving anonymous memory and re-exec'ing in patches 14 - 21? - Steve On 7/30/2020 11:14 AM, Steve Sistare wrote: > Improve and extend the qemu functions that save and restore VM state so a > guest may be suspended and resumed with minimal pause time. qemu may be > updated to a new version in between. > > The first set of patches adds the cprsave and cprload commands to save and > restore VM state, and allow the host kernel to be updated and rebooted in > between. The VM must create guest RAM in a persistent shared memory file, > such as /dev/dax0.0 or persistant /dev/shm PKRAM as proposed in > https://lore.kernel.org/lkml/1588812129-8596-1-git-send-email-anthony.yznaga@oracle.com/ > > cprsave stops the VCPUs and saves VM device state in a simple file, and > thus supports any type of guest image and block device. The caller must > not modify the VM's block devices between cprsave and cprload. > > cprsave and cprload support guests with vfio devices if the caller first > suspends the guest by issuing guest-suspend-ram to the qemu guest agent. > The guest drivers suspend methods flush outstanding requests and re- > initialize the devices, and thus there is no device state to save and > restore. > > 1 savevm: add vmstate handler iterators > 2 savevm: VM handlers mode mask > 3 savevm: QMP command for cprsave > 4 savevm: HMP Command for cprsave > 5 savevm: QMP command for cprload > 6 savevm: HMP Command for cprload > 7 savevm: QMP command for cprinfo > 8 savevm: HMP command for cprinfo > 9 savevm: prevent cprsave if memory is volatile > 10 kvmclock: restore paused KVM clock > 11 cpu: disable ticks when suspended > 12 vl: pause option > 13 gdbstub: gdb support for suspended state > > The next patches add a restart method that eliminates the persistent memory > constraint, and allows qemu to be updated across the restart, but does not > allow host reboot. Anonymous memory segments used by the guest are > preserved across a re-exec of qemu, mapped at the same VA, via a proposed > madvise(MADV_DOEXEC) option in the Linux kernel. See > https://lore.kernel.org/lkml/1595869887-23307-1-git-send-email-anthony.yznaga@oracle.com/ > > 14 savevm: VMS_RESTART and cprsave restart > 15 vl: QEMU_START_FREEZE env var > 16 oslib: add qemu_clr_cloexec > 17 util: env var helpers > 18 osdep: import MADV_DOEXEC > 19 memory: ram_block_add cosmetic changes > 20 vl: add helper to request re-exec > 21 exec, memory: exec(3) to restart > 22 char: qio_channel_socket_accept reuse fd > 23 char: save/restore chardev socket fds > 24 ui: save/restore vnc socket fds > 25 char: save/restore chardev pty fds > 26 monitor: save/restore QMP negotiation status > 27 vhost: reset vhost devices upon cprsave > 28 char: restore terminal on restart > > The next patches extend the restart method to save and restore vfio-pci > state, eliminating the requirement for a guest agent. The vfio container, > group, and device descriptors are preserved across the qemu re-exec. > > 29 pci: export pci_update_mappings > 30 vfio-pci: save and restore > 31 vfio-pci: trace pci config > 32 vfio-pci: improved tracing > > Here is an example of updating qemu from v4.2.0 to v4.2.1 using > "cprload restart". The software update is performed while the guest is > running to minimize downtime. > > window 1 | window 2 > | > # qemu-system-x86_64 ... | > QEMU 4.2.0 monitor - type 'help' ... | > (qemu) info status | > VM status: running | > | # yum update qemu > (qemu) cprsave /tmp/qemu.sav restart | > QEMU 4.2.1 monitor - type 'help' ... | > (qemu) info status | > VM status: paused (prelaunch) | > (qemu) cprload /tmp/qemu.sav | > (qemu) info status | > VM status: running | > > > Here is an example of updating the host kernel using "cprload reboot" > > window 1 | window 2 > | > # qemu-system-x86_64 ...mem-path=/dev/dax0.0 ...| > QEMU 4.2.1 monitor - type 'help' ... | > (qemu) info status | > VM status: running | > | # yum update kernel-uek > (qemu) cprsave /tmp/qemu.sav restart | > | > # systemctl kexec | > kexec_core: Starting new kernel | > ... | > | > # qemu-system-x86_64 ...mem-path=/dev/dax0.0 ...| > QEMU 4.2.1 monitor - type 'help' ... | > (qemu) info status | > VM status: paused (prelaunch) | > (qemu) cprload /tmp/qemu.sav | > (qemu) info status | > VM status: running | > > > Mark Kanda (5): > char: qio_channel_socket_accept reuse fd > char: save/restore chardev socket fds > ui: save/restore vnc socket fds > monitor: save/restore QMP negotiation status > vhost: reset vhost devices upon cprsave > > Steve Sistare (27): > savevm: add vmstate handler iterators > savevm: VM handlers mode mask > savevm: QMP command for cprsave > savevm: HMP Command for cprsave > savevm: QMP command for cprload > savevm: HMP Command for cprload > savevm: QMP command for cprinfo > savevm: HMP command for cprinfo > savevm: prevent cprsave if memory is volatile > kvmclock: restore paused KVM clock > cpu: disable ticks when suspended > vl: pause option > gdbstub: gdb support for suspended state > savevm: VMS_RESTART and cprsave restart > vl: QEMU_START_FREEZE env var > oslib: add qemu_clr_cloexec > util: env var helpers > osdep: import MADV_DOEXEC > memory: ram_block_add cosmetic changes > vl: add helper to request re-exec > exec, memory: exec(3) to restart > char: save/restore chardev pty fds > char: restore terminal on restart > pci: export pci_update_mappings > vfio-pci: save and restore > vfio-pci: trace pci config > vfio-pci: improved tracing > > MAINTAINERS | 7 ++ > accel/kvm/kvm-all.c | 8 +- > accel/kvm/trace-events | 3 +- > chardev/char-pty.c | 38 +++++-- > chardev/char-socket.c | 35 ++++++ > chardev/char-stdio.c | 7 ++ > chardev/char.c | 16 +++ > exec.c | 88 +++++++++++++-- > gdbstub.c | 11 +- > hmp-commands.hx | 46 ++++++++ > hw/i386/kvm/clock.c | 6 +- > hw/pci/msix.c | 1 + > hw/pci/pci.c | 17 +-- > hw/pci/trace-events | 5 +- > hw/vfio/common.c | 115 ++++++++++++++++---- > hw/vfio/pci.c | 179 ++++++++++++++++++++++++++++++- > hw/vfio/platform.c | 2 +- > hw/vfio/trace-events | 11 +- > hw/virtio/vhost.c | 12 +++ > include/chardev/char.h | 8 ++ > include/exec/memory.h | 4 + > include/hw/pci/pci.h | 2 + > include/hw/vfio/vfio-common.h | 4 +- > include/io/channel-socket.h | 3 +- > include/migration/register.h | 3 + > include/migration/vmstate.h | 11 ++ > include/monitor/hmp.h | 3 + > include/qemu/cutils.h | 1 + > include/qemu/env.h | 31 ++++++ > include/qemu/osdep.h | 8 ++ > include/sysemu/sysemu.h | 10 ++ > io/channel-socket.c | 12 ++- > io/net-listener.c | 4 +- > migration/block.c | 1 + > migration/migration.c | 4 +- > migration/ram.c | 1 + > migration/savevm.c | 237 ++++++++++++++++++++++++++++++++++++----- > migration/savevm.h | 4 +- > monitor/hmp-cmds.c | 28 +++++ > monitor/qmp-cmds.c | 16 +++ > monitor/qmp.c | 42 ++++++++ > qapi/migration.json | 35 ++++++ > qapi/pragma.json | 1 + > qemu-options.hx | 9 ++ > scsi/qemu-pr-helper.c | 2 +- > softmmu/vl.c | 65 ++++++++++- > tests/qtest/tpm-emu.c | 2 +- > tests/test-char.c | 2 +- > tests/test-io-channel-socket.c | 4 +- > trace-events | 2 + > ui/vnc.c | 153 +++++++++++++++++++++----- > util/Makefile.objs | 2 +- > util/env.c | 132 +++++++++++++++++++++++ > util/oslib-posix.c | 9 ++ > util/oslib-win32.c | 4 + > 55 files changed, 1331 insertions(+), 135 deletions(-) > create mode 100644 include/qemu/env.h > create mode 100644 util/env.c >
* Daniel P. Berrangé (berrange@redhat.com) wrote: > On Fri, Jul 31, 2020 at 11:27:45AM -0400, Steven Sistare wrote: > > On 7/31/2020 4:53 AM, Daniel P. Berrangé wrote: > > > On Thu, Jul 30, 2020 at 02:48:44PM -0400, Steven Sistare wrote: > > >> On 7/30/2020 12:52 PM, Daniel P. Berrangé wrote: > > >>> On Thu, Jul 30, 2020 at 08:14:04AM -0700, Steve Sistare wrote: > > >>>> Improve and extend the qemu functions that save and restore VM state so a > > >>>> guest may be suspended and resumed with minimal pause time. qemu may be > > >>>> updated to a new version in between. > > >>>> > > >>>> The first set of patches adds the cprsave and cprload commands to save and > > >>>> restore VM state, and allow the host kernel to be updated and rebooted in > > >>>> between. The VM must create guest RAM in a persistent shared memory file, > > >>>> such as /dev/dax0.0 or persistant /dev/shm PKRAM as proposed in > > >>>> https://lore.kernel.org/lkml/1588812129-8596-1-git-send-email-anthony.yznaga@oracle.com/ > > >>>> > > >>>> cprsave stops the VCPUs and saves VM device state in a simple file, and > > >>>> thus supports any type of guest image and block device. The caller must > > >>>> not modify the VM's block devices between cprsave and cprload. > > >>>> > > >>>> cprsave and cprload support guests with vfio devices if the caller first > > >>>> suspends the guest by issuing guest-suspend-ram to the qemu guest agent. > > >>>> The guest drivers suspend methods flush outstanding requests and re- > > >>>> initialize the devices, and thus there is no device state to save and > > >>>> restore. > > >>>> > > >>>> 1 savevm: add vmstate handler iterators > > >>>> 2 savevm: VM handlers mode mask > > >>>> 3 savevm: QMP command for cprsave > > >>>> 4 savevm: HMP Command for cprsave > > >>>> 5 savevm: QMP command for cprload > > >>>> 6 savevm: HMP Command for cprload > > >>>> 7 savevm: QMP command for cprinfo > > >>>> 8 savevm: HMP command for cprinfo > > >>>> 9 savevm: prevent cprsave if memory is volatile > > >>>> 10 kvmclock: restore paused KVM clock > > >>>> 11 cpu: disable ticks when suspended > > >>>> 12 vl: pause option > > >>>> 13 gdbstub: gdb support for suspended state > > >>>> > > >>>> The next patches add a restart method that eliminates the persistent memory > > >>>> constraint, and allows qemu to be updated across the restart, but does not > > >>>> allow host reboot. Anonymous memory segments used by the guest are > > >>>> preserved across a re-exec of qemu, mapped at the same VA, via a proposed > > >>>> madvise(MADV_DOEXEC) option in the Linux kernel. See > > >>>> https://lore.kernel.org/lkml/1595869887-23307-1-git-send-email-anthony.yznaga@oracle.com/ > > >>>> > > >>>> 14 savevm: VMS_RESTART and cprsave restart > > >>>> 15 vl: QEMU_START_FREEZE env var > > >>>> 16 oslib: add qemu_clr_cloexec > > >>>> 17 util: env var helpers > > >>>> 18 osdep: import MADV_DOEXEC > > >>>> 19 memory: ram_block_add cosmetic changes > > >>>> 20 vl: add helper to request re-exec > > >>>> 21 exec, memory: exec(3) to restart > > >>>> 22 char: qio_channel_socket_accept reuse fd > > >>>> 23 char: save/restore chardev socket fds > > >>>> 24 ui: save/restore vnc socket fds > > >>>> 25 char: save/restore chardev pty fds > > >>> > > >>> Keeping FDs open across re-exec is a nice trick, but how are you dealing > > >>> with the state associated with them, most especially the TLS encryption > > >>> state ? AFAIK, there's no way to serialize/deserialize the TLS state that > > >>> GNUTLS maintains, and the patches don't show any sign of dealing with > > >>> this. IOW it looks like while the FD will be preserved, any TLS session > > >>> running on it will fail. > > >> > > >> I had not considered TLS. If a non-qemu library maintains connection state, then > > >> we won't be able to support it for live update until the library provides interfaces > > >> to serialize the state. > > >> > > >> For qemu objects, so far vmstate has been adequate to represent the devices with > > >> descriptors that we preserve. > > > > > > My main concern about this series is that there is an implicit assumption > > > that QEMU is *not* configured with certain features that are not handled > > > If QEMU is using one of the unsupported features, I don't see anything in > > > the series which attempts to prevent the actions. > > > > > > IOW, users can have an arbitrary QEMU config, attempt to use these new features, > > > the commands may well succeed, but the user is silently left with a broken QEMU. > > > Such silent failure modes are really undesirable as they'll lead to a never > > > ending stream of hard to diagnose bug reports for QEMU maintainers. > > > > > > TLS is one example of this, the live upgrade will "succeed", but the TLS > > > connections will be totally non-functional. > > > > I agree with all your points and would like to do better in this area. Other than hunting for > > every use of a descriptor and either supporting it or blocking cpr, do you have any suggestions? > > Thinking out loud, maybe we can gather all the fds that we support, then look for all fds in the > > process, and block the cpr if we find an unrecognized fd. > > There's no magic easy answer to this problem. Conceptually it is similar to > the problem of reliably migrating guest device state, but in this case we're > primarily concerned about the backends instead. > > For migration we've got standardized interfaces that devices must implement > in order to correctly support migration serialization. There is also support > for devices to register migration "blockers" which prevent any use of the > migration feature when the device is present. > > We lack this kind of concept for the backend, and that's what I think needs > to be tackled in a more thorough way. There are quite alot of backends, > but they're grouped into a reasonable small number of sets (UIs, chardevs, > blockdevs, net devs, etc). We need some standard interface that we can > plumb into all the backends, along with providing backends the ability to > block the re-exec. If we plumb the generic infrastructure into each of the > different types of backend, and make the default behaviour be to reject > the re-exec. Then we need to carefull consider specific backend impls > and allow the re-exec only in the very precise cases we can demonstrate > to be safe. > > IOW, have a presumption that re-exec will *not* be permitted. Over time > we can make it work for an ever expanding set of use cases. Yes, it does feel like an interface that needs to be implemented on the chardev; then you don't need to worry about handling them all individually. Dave > > Regards, > Daniel > -- > |: https://berrange.com -o- https://www.flickr.com/photos/dberrange :| > |: https://libvirt.org -o- https://fstop138.berrange.com :| > |: https://entangle-photo.org -o- https://www.instagram.com/dberrange :|