mbox series

[RFC,v1,00/26] migration: File based migration with multifd and fixed-ram

Message ID 20230330180336.2791-1-farosas@suse.de (mailing list archive)
Headers show
Series migration: File based migration with multifd and fixed-ram | expand

Message

Fabiano Rosas March 30, 2023, 6:03 p.m. UTC
Hi folks,

I'm continuing the work done last year to add a new format of
migration stream that can be used to migrate large guests to a single
file in a performant way.

This is an early RFC with the previous code + my additions to support
multifd and direct IO. Let me know what you think!

Here are the reference links for previous discussions:

https://lists.gnu.org/archive/html/qemu-devel/2022-08/msg01813.html
https://lists.gnu.org/archive/html/qemu-devel/2022-10/msg01338.html
https://lists.gnu.org/archive/html/qemu-devel/2022-10/msg05536.html

The series has 4 main parts:

1) File migration: A new "file:" migration URI. So "file:mig" does the
   same as "exec:cat > mig". Patches 1-4 implement this;

2) Fixed-ram format: A new format for the migration stream. Puts guest
   pages at their relative offsets in the migration file. This saves
   space on the worst case of RAM utilization because every page has a
   fixed offset in the migration file and (potentially) saves us time
   because we could write pages independently in parallel. It also
   gives alignment guarantees so we could use O_DIRECT. Patches 5-13
   implement this;

With patches 1-13 these two^ can be used with:

(qemu) migrate_set_capability fixed-ram on
(qemu) migrate[_incoming] file:mig

--> new in this series:

3) MultiFD support: This is about making use of the parallelism
   allowed by the new format. We just need the threading and page
   queuing infrastructure that is already in place for
   multifd. Patches 14-24 implement this;

(qemu) migrate_set_capability fixed-ram on
(qemu) migrate_set_capability multifd on
(qemu) migrate_set_parameter multifd-channels 4
(qemu) migrate_set_parameter max-bandwith 0
(qemu) migrate[_incoming] file:mig

4) Add a new "direct_io" parameter and enable O_DIRECT for the
   properly aligned segments of the migration (mostly ram). Patch 25.

(qemu) migrate_set_parameter direct-io on

Thanks! Some data below:
=====

Outgoing migration to file. NVMe disk. XFS filesystem.

- Single migration runs of stopped 32G guest with ~90% RAM usage. Guest
  running `stress-ng --vm 4 --vm-bytes 90% --vm-method all --verify -t
  10m -v`:

migration type  | MB/s | pages/s |  ms
----------------+------+---------+------
savevm io_uring |  434 |  102294 | 71473
file:           | 3017 |  855862 | 10301
fixed-ram       | 1982 |  330686 | 15637
----------------+------+---------+------
fixed-ram + multifd + O_DIRECT
         2 ch.  | 5565 | 1500882 |  5576
         4 ch.  | 5735 | 1991549 |  5412
         8 ch.  | 5650 | 1769650 |  5489
        16 ch.  | 6071 | 1832407 |  5114
        32 ch.  | 6147 | 1809588 |  5050
        64 ch.  | 6344 | 1841728 |  4895
       128 ch.  | 6120 | 1915669 |  5085
----------------+------+---------+------

- Average of 10 migration runs of guestperf.py --mem 32 --cpus 4:

migration type | #ch. | MB/s | ms
---------------+------+------+-----
fixed-ram +    |    2 | 4132 | 8388
multifd        |    4 | 4273 | 8082
               |    8 | 4094 | 8441
               |   16 | 4204 | 8217
               |   32 | 4048 | 8528
               |   64 | 3861 | 8946
               |  128 | 3777 | 9147
---------------+------+------+-----
fixed-ram +    |    2 | 6031 | 5754
multifd +      |    4 | 6377 | 5421
O_DIRECT       |    8 | 6386 | 5416
               |   16 | 6321 | 5466
               |   32 | 5911 | 5321
               |   64 | 6375 | 5433
               |  128 | 6400 | 5412
---------------+------+------+-----

Fabiano Rosas (13):
  migration: Add completion tracepoint
  migration/multifd: Remove direct "socket" references
  migration/multifd: Allow multifd without packets
  migration/multifd: Add outgoing QIOChannelFile support
  migration/multifd: Add incoming QIOChannelFile support
  migration/multifd: Add pages to the receiving side
  io: Add a pwritev/preadv version that takes a discontiguous iovec
  migration/ram: Add a wrapper for fixed-ram shadow bitmap
  migration/multifd: Support outgoing fixed-ram stream format
  migration/multifd: Support incoming fixed-ram stream format
  tests/qtest: Add a multifd + fixed-ram migration test
  migration: Add direct-io parameter
  tests/migration/guestperf: Add file, fixed-ram and direct-io support

Nikolay Borisov (13):
  migration: Add support for 'file:' uri for source migration
  migration: Add support for 'file:' uri for incoming migration
  tests/qtest: migration: Add migrate_incoming_qmp helper
  tests/qtest: migration-test: Add tests for file-based migration
  migration: Initial support of fixed-ram feature for
    analyze-migration.py
  io: add and implement QIO_CHANNEL_FEATURE_SEEKABLE for channel file
  io: Add generic pwritev/preadv interface
  io: implement io_pwritev/preadv for QIOChannelFile
  migration/qemu-file: add utility methods for working with seekable
    channels
  migration/ram: Introduce 'fixed-ram' migration stream capability
  migration: Refactor precopy ram loading code
  migration: Add support for 'fixed-ram' migration restore
  tests/qtest: migration-test: Add tests for fixed-ram file-based
    migration

 docs/devel/migration.rst              |  38 +++
 include/exec/ramblock.h               |   8 +
 include/io/channel-file.h             |   1 +
 include/io/channel.h                  | 133 ++++++++++
 include/migration/qemu-file-types.h   |   2 +
 include/qemu/osdep.h                  |   2 +
 io/channel-file.c                     |  60 +++++
 io/channel.c                          | 140 +++++++++++
 migration/file.c                      | 130 ++++++++++
 migration/file.h                      |  14 ++
 migration/meson.build                 |   1 +
 migration/migration-hmp-cmds.c        |   9 +
 migration/migration.c                 | 108 +++++++-
 migration/migration.h                 |  11 +-
 migration/multifd.c                   | 327 ++++++++++++++++++++----
 migration/multifd.h                   |  13 +
 migration/qemu-file.c                 |  80 ++++++
 migration/qemu-file.h                 |   4 +
 migration/ram.c                       | 349 ++++++++++++++++++++------
 migration/ram.h                       |   1 +
 migration/savevm.c                    |  23 +-
 migration/trace-events                |   1 +
 qapi/migration.json                   |  19 +-
 scripts/analyze-migration.py          |  51 +++-
 tests/migration/guestperf/engine.py   |  38 ++-
 tests/migration/guestperf/scenario.py |  14 +-
 tests/migration/guestperf/shell.py    |  18 +-
 tests/qtest/migration-helpers.c       |  19 ++
 tests/qtest/migration-helpers.h       |   4 +
 tests/qtest/migration-test.c          |  73 ++++++
 util/osdep.c                          |   9 +
 31 files changed, 1546 insertions(+), 154 deletions(-)
 create mode 100644 migration/file.c
 create mode 100644 migration/file.h

Comments

Peter Xu March 30, 2023, 9:41 p.m. UTC | #1
On Thu, Mar 30, 2023 at 03:03:10PM -0300, Fabiano Rosas wrote:
> Hi folks,

Hi,

> 
> I'm continuing the work done last year to add a new format of
> migration stream that can be used to migrate large guests to a single
> file in a performant way.
> 
> This is an early RFC with the previous code + my additions to support
> multifd and direct IO. Let me know what you think!
> 
> Here are the reference links for previous discussions:
> 
> https://lists.gnu.org/archive/html/qemu-devel/2022-08/msg01813.html
> https://lists.gnu.org/archive/html/qemu-devel/2022-10/msg01338.html
> https://lists.gnu.org/archive/html/qemu-devel/2022-10/msg05536.html
> 
> The series has 4 main parts:
> 
> 1) File migration: A new "file:" migration URI. So "file:mig" does the
>    same as "exec:cat > mig". Patches 1-4 implement this;
> 
> 2) Fixed-ram format: A new format for the migration stream. Puts guest
>    pages at their relative offsets in the migration file. This saves
>    space on the worst case of RAM utilization because every page has a
>    fixed offset in the migration file and (potentially) saves us time
>    because we could write pages independently in parallel. It also
>    gives alignment guarantees so we could use O_DIRECT. Patches 5-13
>    implement this;
> 
> With patches 1-13 these two^ can be used with:
> 
> (qemu) migrate_set_capability fixed-ram on
> (qemu) migrate[_incoming] file:mig

Have you considered enabling the new fixed-ram format with postcopy when
loading?

Due to the linear offseting of pages, I think it can achieve super fast vm
loads due to O(1) lookup of pages and local page fault resolutions.

> 
> --> new in this series:
> 
> 3) MultiFD support: This is about making use of the parallelism
>    allowed by the new format. We just need the threading and page
>    queuing infrastructure that is already in place for
>    multifd. Patches 14-24 implement this;
> 
> (qemu) migrate_set_capability fixed-ram on
> (qemu) migrate_set_capability multifd on
> (qemu) migrate_set_parameter multifd-channels 4
> (qemu) migrate_set_parameter max-bandwith 0
> (qemu) migrate[_incoming] file:mig
> 
> 4) Add a new "direct_io" parameter and enable O_DIRECT for the
>    properly aligned segments of the migration (mostly ram). Patch 25.
> 
> (qemu) migrate_set_parameter direct-io on
> 
> Thanks! Some data below:
> =====
> 
> Outgoing migration to file. NVMe disk. XFS filesystem.
> 
> - Single migration runs of stopped 32G guest with ~90% RAM usage. Guest
>   running `stress-ng --vm 4 --vm-bytes 90% --vm-method all --verify -t
>   10m -v`:
> 
> migration type  | MB/s | pages/s |  ms
> ----------------+------+---------+------
> savevm io_uring |  434 |  102294 | 71473

So I assume this is the non-live migration scenario.  Could you explain
what does io_uring mean here?

> file:           | 3017 |  855862 | 10301
> fixed-ram       | 1982 |  330686 | 15637
> ----------------+------+---------+------
> fixed-ram + multifd + O_DIRECT
>          2 ch.  | 5565 | 1500882 |  5576
>          4 ch.  | 5735 | 1991549 |  5412
>          8 ch.  | 5650 | 1769650 |  5489
>         16 ch.  | 6071 | 1832407 |  5114
>         32 ch.  | 6147 | 1809588 |  5050
>         64 ch.  | 6344 | 1841728 |  4895
>        128 ch.  | 6120 | 1915669 |  5085
> ----------------+------+---------+------

Thanks,
Fabiano Rosas March 31, 2023, 2:37 p.m. UTC | #2
Peter Xu <peterx@redhat.com> writes:

> On Thu, Mar 30, 2023 at 03:03:10PM -0300, Fabiano Rosas wrote:
>> Hi folks,
>
> Hi,
>
>> 
>> I'm continuing the work done last year to add a new format of
>> migration stream that can be used to migrate large guests to a single
>> file in a performant way.
>> 
>> This is an early RFC with the previous code + my additions to support
>> multifd and direct IO. Let me know what you think!
>> 
>> Here are the reference links for previous discussions:
>> 
>> https://lists.gnu.org/archive/html/qemu-devel/2022-08/msg01813.html
>> https://lists.gnu.org/archive/html/qemu-devel/2022-10/msg01338.html
>> https://lists.gnu.org/archive/html/qemu-devel/2022-10/msg05536.html
>> 
>> The series has 4 main parts:
>> 
>> 1) File migration: A new "file:" migration URI. So "file:mig" does the
>>    same as "exec:cat > mig". Patches 1-4 implement this;
>> 
>> 2) Fixed-ram format: A new format for the migration stream. Puts guest
>>    pages at their relative offsets in the migration file. This saves
>>    space on the worst case of RAM utilization because every page has a
>>    fixed offset in the migration file and (potentially) saves us time
>>    because we could write pages independently in parallel. It also
>>    gives alignment guarantees so we could use O_DIRECT. Patches 5-13
>>    implement this;
>> 
>> With patches 1-13 these two^ can be used with:
>> 
>> (qemu) migrate_set_capability fixed-ram on
>> (qemu) migrate[_incoming] file:mig
>
> Have you considered enabling the new fixed-ram format with postcopy when
> loading?
>
> Due to the linear offseting of pages, I think it can achieve super fast vm
> loads due to O(1) lookup of pages and local page fault resolutions.
>

I don't think we have looked that much at the loading side yet. Good to
know that it has potential to be faster. I'll look into it. Thanks for
the suggestion.

>> 
>> --> new in this series:
>> 
>> 3) MultiFD support: This is about making use of the parallelism
>>    allowed by the new format. We just need the threading and page
>>    queuing infrastructure that is already in place for
>>    multifd. Patches 14-24 implement this;
>> 
>> (qemu) migrate_set_capability fixed-ram on
>> (qemu) migrate_set_capability multifd on
>> (qemu) migrate_set_parameter multifd-channels 4
>> (qemu) migrate_set_parameter max-bandwith 0
>> (qemu) migrate[_incoming] file:mig
>> 
>> 4) Add a new "direct_io" parameter and enable O_DIRECT for the
>>    properly aligned segments of the migration (mostly ram). Patch 25.
>> 
>> (qemu) migrate_set_parameter direct-io on
>> 
>> Thanks! Some data below:
>> =====
>> 
>> Outgoing migration to file. NVMe disk. XFS filesystem.
>> 
>> - Single migration runs of stopped 32G guest with ~90% RAM usage. Guest
>>   running `stress-ng --vm 4 --vm-bytes 90% --vm-method all --verify -t
>>   10m -v`:
>> 
>> migration type  | MB/s | pages/s |  ms
>> ----------------+------+---------+------
>> savevm io_uring |  434 |  102294 | 71473
>
> So I assume this is the non-live migration scenario.  Could you explain
> what does io_uring mean here?
>

This table is all non-live migration. This particular line is a snapshot
(hmp_savevm->save_snapshot). I thought it could be relevant because it
is another way by which we write RAM into disk.

The io_uring is noise, I was initially under the impression that the
block device aio configuration affected this scenario.

>> file:           | 3017 |  855862 | 10301
>> fixed-ram       | 1982 |  330686 | 15637
>> ----------------+------+---------+------
>> fixed-ram + multifd + O_DIRECT
>>          2 ch.  | 5565 | 1500882 |  5576
>>          4 ch.  | 5735 | 1991549 |  5412
>>          8 ch.  | 5650 | 1769650 |  5489
>>         16 ch.  | 6071 | 1832407 |  5114
>>         32 ch.  | 6147 | 1809588 |  5050
>>         64 ch.  | 6344 | 1841728 |  4895
>>        128 ch.  | 6120 | 1915669 |  5085
>> ----------------+------+---------+------
>
> Thanks,
Peter Xu March 31, 2023, 2:52 p.m. UTC | #3
On Fri, Mar 31, 2023 at 11:37:50AM -0300, Fabiano Rosas wrote:
> >> Outgoing migration to file. NVMe disk. XFS filesystem.
> >> 
> >> - Single migration runs of stopped 32G guest with ~90% RAM usage. Guest
> >>   running `stress-ng --vm 4 --vm-bytes 90% --vm-method all --verify -t
> >>   10m -v`:
> >> 
> >> migration type  | MB/s | pages/s |  ms
> >> ----------------+------+---------+------
> >> savevm io_uring |  434 |  102294 | 71473
> >
> > So I assume this is the non-live migration scenario.  Could you explain
> > what does io_uring mean here?
> >
> 
> This table is all non-live migration. This particular line is a snapshot
> (hmp_savevm->save_snapshot). I thought it could be relevant because it
> is another way by which we write RAM into disk.

I see, so if all non-live that explains, because I was curious what's the
relationship between this feature and the live snapshot that QEMU also
supports.

I also don't immediately see why savevm will be much slower, do you have an
answer?  Maybe it's somewhere but I just overlooked..

IIUC this is "vm suspend" case, so there's an extra benefit knowledge of
"we can stop the VM".  It smells slightly weird to build this on top of
"migrate" from that pov, rather than "savevm", though.  Any thoughts on
this aspect (on why not building this on top of "savevm")?

Thanks,

> 
> The io_uring is noise, I was initially under the impression that the
> block device aio configuration affected this scenario.
> 
> >> file:           | 3017 |  855862 | 10301
> >> fixed-ram       | 1982 |  330686 | 15637
> >> ----------------+------+---------+------
> >> fixed-ram + multifd + O_DIRECT
> >>          2 ch.  | 5565 | 1500882 |  5576
> >>          4 ch.  | 5735 | 1991549 |  5412
> >>          8 ch.  | 5650 | 1769650 |  5489
> >>         16 ch.  | 6071 | 1832407 |  5114
> >>         32 ch.  | 6147 | 1809588 |  5050
> >>         64 ch.  | 6344 | 1841728 |  4895
> >>        128 ch.  | 6120 | 1915669 |  5085
> >> ----------------+------+---------+------
> >
> > Thanks,
>
Fabiano Rosas March 31, 2023, 3:30 p.m. UTC | #4
Peter Xu <peterx@redhat.com> writes:

> On Fri, Mar 31, 2023 at 11:37:50AM -0300, Fabiano Rosas wrote:
>> >> Outgoing migration to file. NVMe disk. XFS filesystem.
>> >> 
>> >> - Single migration runs of stopped 32G guest with ~90% RAM usage. Guest
>> >>   running `stress-ng --vm 4 --vm-bytes 90% --vm-method all --verify -t
>> >>   10m -v`:
>> >> 
>> >> migration type  | MB/s | pages/s |  ms
>> >> ----------------+------+---------+------
>> >> savevm io_uring |  434 |  102294 | 71473
>> >
>> > So I assume this is the non-live migration scenario.  Could you explain
>> > what does io_uring mean here?
>> >
>> 
>> This table is all non-live migration. This particular line is a snapshot
>> (hmp_savevm->save_snapshot). I thought it could be relevant because it
>> is another way by which we write RAM into disk.
>
> I see, so if all non-live that explains, because I was curious what's the
> relationship between this feature and the live snapshot that QEMU also
> supports.
>
> I also don't immediately see why savevm will be much slower, do you have an
> answer?  Maybe it's somewhere but I just overlooked..
>

I don't have a concrete answer. I could take a jab and maybe blame the
extra memcpy for the buffer in QEMUFile? Or perhaps an unintended effect
of bandwidth limits?

> IIUC this is "vm suspend" case, so there's an extra benefit knowledge of
> "we can stop the VM".  It smells slightly weird to build this on top of
> "migrate" from that pov, rather than "savevm", though.  Any thoughts on
> this aspect (on why not building this on top of "savevm")?
>

I share the same perception. I have done initial experiments with
savevm, but I decided to carry on the work that was already started by
others because my understanding of the problem was yet incomplete.

One point that has been raised is that the fixed-ram format alone does
not bring that many performance improvements. So we'll need
multi-threading and direct-io on top of it. Re-using multifd
infrastructure seems like it could be a good idea.

> Thanks,
>
>> 
>> The io_uring is noise, I was initially under the impression that the
>> block device aio configuration affected this scenario.
>> 
>> >> file:           | 3017 |  855862 | 10301
>> >> fixed-ram       | 1982 |  330686 | 15637
>> >> ----------------+------+---------+------
>> >> fixed-ram + multifd + O_DIRECT
>> >>          2 ch.  | 5565 | 1500882 |  5576
>> >>          4 ch.  | 5735 | 1991549 |  5412
>> >>          8 ch.  | 5650 | 1769650 |  5489
>> >>         16 ch.  | 6071 | 1832407 |  5114
>> >>         32 ch.  | 6147 | 1809588 |  5050
>> >>         64 ch.  | 6344 | 1841728 |  4895
>> >>        128 ch.  | 6120 | 1915669 |  5085
>> >> ----------------+------+---------+------
>> >
>> > Thanks,
>>
Daniel P. Berrangé March 31, 2023, 3:46 p.m. UTC | #5
On Fri, Mar 31, 2023 at 10:52:09AM -0400, Peter Xu wrote:
> On Fri, Mar 31, 2023 at 11:37:50AM -0300, Fabiano Rosas wrote:
> > >> Outgoing migration to file. NVMe disk. XFS filesystem.
> > >> 
> > >> - Single migration runs of stopped 32G guest with ~90% RAM usage. Guest
> > >>   running `stress-ng --vm 4 --vm-bytes 90% --vm-method all --verify -t
> > >>   10m -v`:
> > >> 
> > >> migration type  | MB/s | pages/s |  ms
> > >> ----------------+------+---------+------
> > >> savevm io_uring |  434 |  102294 | 71473
> > >
> > > So I assume this is the non-live migration scenario.  Could you explain
> > > what does io_uring mean here?
> > >
> > 
> > This table is all non-live migration. This particular line is a snapshot
> > (hmp_savevm->save_snapshot). I thought it could be relevant because it
> > is another way by which we write RAM into disk.
> 
> I see, so if all non-live that explains, because I was curious what's the
> relationship between this feature and the live snapshot that QEMU also
> supports.
> 
> I also don't immediately see why savevm will be much slower, do you have an
> answer?  Maybe it's somewhere but I just overlooked..
> 
> IIUC this is "vm suspend" case, so there's an extra benefit knowledge of
> "we can stop the VM".  It smells slightly weird to build this on top of
> "migrate" from that pov, rather than "savevm", though.  Any thoughts on
> this aspect (on why not building this on top of "savevm")?

Currently savevm covers memory, device state and disk snapshots
saving into the VM's disks, which basically means only works
with qcow2.

Libvirt's save logic only cares about saving memory and device
state, and supports saving guests regardless of what storage is
used, saving it externally from the disk.

This is only possible with 'migrate' today and so 'savevm' isn't
useful for this tasks from libvirt's POV.

In the past it has been suggested that actually 'savevm' command
as a concept is redundant, and that we could in fact layer it
on top of a combination of migration and block snapshot APIs.
eg if we had a 'blockdev:' migration protocol for saving the
vmstate.

With regards,
Daniel
Peter Xu March 31, 2023, 3:55 p.m. UTC | #6
On Fri, Mar 31, 2023 at 12:30:45PM -0300, Fabiano Rosas wrote:
> Peter Xu <peterx@redhat.com> writes:
> 
> > On Fri, Mar 31, 2023 at 11:37:50AM -0300, Fabiano Rosas wrote:
> >> >> Outgoing migration to file. NVMe disk. XFS filesystem.
> >> >> 
> >> >> - Single migration runs of stopped 32G guest with ~90% RAM usage. Guest
> >> >>   running `stress-ng --vm 4 --vm-bytes 90% --vm-method all --verify -t
> >> >>   10m -v`:
> >> >> 
> >> >> migration type  | MB/s | pages/s |  ms
> >> >> ----------------+------+---------+------
> >> >> savevm io_uring |  434 |  102294 | 71473
> >> >
> >> > So I assume this is the non-live migration scenario.  Could you explain
> >> > what does io_uring mean here?
> >> >
> >> 
> >> This table is all non-live migration. This particular line is a snapshot
> >> (hmp_savevm->save_snapshot). I thought it could be relevant because it
> >> is another way by which we write RAM into disk.
> >
> > I see, so if all non-live that explains, because I was curious what's the
> > relationship between this feature and the live snapshot that QEMU also
> > supports.
> >
> > I also don't immediately see why savevm will be much slower, do you have an
> > answer?  Maybe it's somewhere but I just overlooked..
> >
> 
> I don't have a concrete answer. I could take a jab and maybe blame the
> extra memcpy for the buffer in QEMUFile? Or perhaps an unintended effect
> of bandwidth limits?

IMHO it would be great if this can be investigated and reasons provided in
the next cover letter.

> 
> > IIUC this is "vm suspend" case, so there's an extra benefit knowledge of
> > "we can stop the VM".  It smells slightly weird to build this on top of
> > "migrate" from that pov, rather than "savevm", though.  Any thoughts on
> > this aspect (on why not building this on top of "savevm")?
> >
> 
> I share the same perception. I have done initial experiments with
> savevm, but I decided to carry on the work that was already started by
> others because my understanding of the problem was yet incomplete.
> 
> One point that has been raised is that the fixed-ram format alone does
> not bring that many performance improvements. So we'll need
> multi-threading and direct-io on top of it. Re-using multifd
> infrastructure seems like it could be a good idea.

The thing is IMHO concurrency is not as hard if VM stopped, and when we're
100% sure locally on where the page will go.

IOW, I think multifd provides a lot of features that may not really be
useful for this effort, meanwhile using those features may need to already
pay for the overhead to support those features.

For example, a major benefit of multifd is it allows pages sent out of
order, so it indexes the page as a header.  I didn't read the follow up
patches, but I assume that's not needed in this effort.

What I understand so far with fixes-ram is we dump the whole ramblock
memory into a chunk at offset of a file.  Can concurrency of that
achievable easily by creating a bunch of threads dumping altogether during
the savevm, with different offsets of guest ram & file passed over?

It's very possible that I overlooked a lot of things, but IMHO my point is
it'll always be great to have a small section discussing the pros and cons
in the cover letter on the decision of using "migrate" infra rather than
"savevm".  Because it's still against the intuition at least to some
reviewers (like me..).  What I worry is this can be implemented more
efficiently and with less LOCs into savevm (and perhaps also benefit normal
savevm too!  so there's chance current savevm users can already benefit
from this) but we didn't do so because the project simply started with
using QMP migrate.  Any investigation on figuring more of this out would be
greatly helpful.

Thanks,
Daniel P. Berrangé March 31, 2023, 4:10 p.m. UTC | #7
On Fri, Mar 31, 2023 at 11:55:03AM -0400, Peter Xu wrote:
> On Fri, Mar 31, 2023 at 12:30:45PM -0300, Fabiano Rosas wrote:
> > Peter Xu <peterx@redhat.com> writes:
> > 
> > > On Fri, Mar 31, 2023 at 11:37:50AM -0300, Fabiano Rosas wrote:
> > >> >> Outgoing migration to file. NVMe disk. XFS filesystem.
> > >> >> 
> > >> >> - Single migration runs of stopped 32G guest with ~90% RAM usage. Guest
> > >> >>   running `stress-ng --vm 4 --vm-bytes 90% --vm-method all --verify -t
> > >> >>   10m -v`:
> > >> >> 
> > >> >> migration type  | MB/s | pages/s |  ms
> > >> >> ----------------+------+---------+------
> > >> >> savevm io_uring |  434 |  102294 | 71473
> > >> >
> > >> > So I assume this is the non-live migration scenario.  Could you explain
> > >> > what does io_uring mean here?
> > >> >
> > >> 
> > >> This table is all non-live migration. This particular line is a snapshot
> > >> (hmp_savevm->save_snapshot). I thought it could be relevant because it
> > >> is another way by which we write RAM into disk.
> > >
> > > I see, so if all non-live that explains, because I was curious what's the
> > > relationship between this feature and the live snapshot that QEMU also
> > > supports.
> > >
> > > I also don't immediately see why savevm will be much slower, do you have an
> > > answer?  Maybe it's somewhere but I just overlooked..
> > >
> > 
> > I don't have a concrete answer. I could take a jab and maybe blame the
> > extra memcpy for the buffer in QEMUFile? Or perhaps an unintended effect
> > of bandwidth limits?
> 
> IMHO it would be great if this can be investigated and reasons provided in
> the next cover letter.
> 
> > 
> > > IIUC this is "vm suspend" case, so there's an extra benefit knowledge of
> > > "we can stop the VM".  It smells slightly weird to build this on top of
> > > "migrate" from that pov, rather than "savevm", though.  Any thoughts on
> > > this aspect (on why not building this on top of "savevm")?
> > >
> > 
> > I share the same perception. I have done initial experiments with
> > savevm, but I decided to carry on the work that was already started by
> > others because my understanding of the problem was yet incomplete.
> > 
> > One point that has been raised is that the fixed-ram format alone does
> > not bring that many performance improvements. So we'll need
> > multi-threading and direct-io on top of it. Re-using multifd
> > infrastructure seems like it could be a good idea.
> 
> The thing is IMHO concurrency is not as hard if VM stopped, and when we're
> 100% sure locally on where the page will go.

We shouldn't assume the VM is stopped though. When saving to the file
the VM may still be active. The fixed-ram format lets us re-write the
same memory location on disk multiple times in this case, thus avoiding
growth of the file size.

> IOW, I think multifd provides a lot of features that may not really be
> useful for this effort, meanwhile using those features may need to already
> pay for the overhead to support those features.
> 
> For example, a major benefit of multifd is it allows pages sent out of
> order, so it indexes the page as a header.  I didn't read the follow up
> patches, but I assume that's not needed in this effort.
> 
> What I understand so far with fixes-ram is we dump the whole ramblock
> memory into a chunk at offset of a file.  Can concurrency of that
> achievable easily by creating a bunch of threads dumping altogether during
> the savevm, with different offsets of guest ram & file passed over?

I feel like the migration code is already insanely complicated and
the many threads involved have caused no end of subtle bugs. 

It was Juan I believe who expressed a desire to entirely remove
non-multifd code in the future, in order to reduce the maint burden.
IOW, ideally we would be pushing mgmt apps towards always using
multifd at all times, even if they only ask it to create 1 single
thread.

That would in turn suggest against creating new concurrency
mechanisms on top of non-multifd code, both to avoid adding yet
more complexity and also because it would make it harder to later
delete the non-multifd code.

On the libvirt side wrt fixed-ram, we could just use multifd
exclusively, as there should be no downside to it even for a
single FD.

With regards,
Daniel
Peter Xu March 31, 2023, 4:27 p.m. UTC | #8
On Fri, Mar 31, 2023 at 05:10:16PM +0100, Daniel P. Berrangé wrote:
> On Fri, Mar 31, 2023 at 11:55:03AM -0400, Peter Xu wrote:
> > On Fri, Mar 31, 2023 at 12:30:45PM -0300, Fabiano Rosas wrote:
> > > Peter Xu <peterx@redhat.com> writes:
> > > 
> > > > On Fri, Mar 31, 2023 at 11:37:50AM -0300, Fabiano Rosas wrote:
> > > >> >> Outgoing migration to file. NVMe disk. XFS filesystem.
> > > >> >> 
> > > >> >> - Single migration runs of stopped 32G guest with ~90% RAM usage. Guest
> > > >> >>   running `stress-ng --vm 4 --vm-bytes 90% --vm-method all --verify -t
> > > >> >>   10m -v`:
> > > >> >> 
> > > >> >> migration type  | MB/s | pages/s |  ms
> > > >> >> ----------------+------+---------+------
> > > >> >> savevm io_uring |  434 |  102294 | 71473
> > > >> >
> > > >> > So I assume this is the non-live migration scenario.  Could you explain
> > > >> > what does io_uring mean here?
> > > >> >
> > > >> 
> > > >> This table is all non-live migration. This particular line is a snapshot
> > > >> (hmp_savevm->save_snapshot). I thought it could be relevant because it
> > > >> is another way by which we write RAM into disk.
> > > >
> > > > I see, so if all non-live that explains, because I was curious what's the
> > > > relationship between this feature and the live snapshot that QEMU also
> > > > supports.
> > > >
> > > > I also don't immediately see why savevm will be much slower, do you have an
> > > > answer?  Maybe it's somewhere but I just overlooked..
> > > >
> > > 
> > > I don't have a concrete answer. I could take a jab and maybe blame the
> > > extra memcpy for the buffer in QEMUFile? Or perhaps an unintended effect
> > > of bandwidth limits?
> > 
> > IMHO it would be great if this can be investigated and reasons provided in
> > the next cover letter.
> > 
> > > 
> > > > IIUC this is "vm suspend" case, so there's an extra benefit knowledge of
> > > > "we can stop the VM".  It smells slightly weird to build this on top of
> > > > "migrate" from that pov, rather than "savevm", though.  Any thoughts on
> > > > this aspect (on why not building this on top of "savevm")?
> > > >
> > > 
> > > I share the same perception. I have done initial experiments with
> > > savevm, but I decided to carry on the work that was already started by
> > > others because my understanding of the problem was yet incomplete.
> > > 
> > > One point that has been raised is that the fixed-ram format alone does
> > > not bring that many performance improvements. So we'll need
> > > multi-threading and direct-io on top of it. Re-using multifd
> > > infrastructure seems like it could be a good idea.
> > 
> > The thing is IMHO concurrency is not as hard if VM stopped, and when we're
> > 100% sure locally on where the page will go.
> 
> We shouldn't assume the VM is stopped though. When saving to the file
> the VM may still be active. The fixed-ram format lets us re-write the
> same memory location on disk multiple times in this case, thus avoiding
> growth of the file size.

Before discussing on reusing multifd below, now I have a major confusing on
the use case of the feature..

The question is whether we would like to stop the VM after fixed-ram
migration completes.  I'm asking because:

  1. If it will stop, then it looks like a "VM suspend" to me. If so, could
     anyone help explain why we don't stop the VM first then migrate?
     Because it avoids copying single pages multiple times, no fiddling
     with dirty tracking at all - we just don't ever track anything.  In
     short, we'll stop the VM anyway, then why not stop it slightly
     earlier?

  2. If it will not stop, then it's "VM live snapshot" to me.  We have
     that, aren't we?  That's more efficient because it'll wr-protect all
     guest pages, any write triggers a CoW and we only copy the guest pages
     once and for all.

Either way to go, there's no need to copy any page more than once.  Did I
miss anything perhaps very important?

I would guess it's option (1) above, because it seems we don't snapshot the
disk alongside.  But I am really not sure now..

> 
> > IOW, I think multifd provides a lot of features that may not really be
> > useful for this effort, meanwhile using those features may need to already
> > pay for the overhead to support those features.
> > 
> > For example, a major benefit of multifd is it allows pages sent out of
> > order, so it indexes the page as a header.  I didn't read the follow up
> > patches, but I assume that's not needed in this effort.
> > 
> > What I understand so far with fixes-ram is we dump the whole ramblock
> > memory into a chunk at offset of a file.  Can concurrency of that
> > achievable easily by creating a bunch of threads dumping altogether during
> > the savevm, with different offsets of guest ram & file passed over?
> 
> I feel like the migration code is already insanely complicated and
> the many threads involved have caused no end of subtle bugs. 
> 
> It was Juan I believe who expressed a desire to entirely remove
> non-multifd code in the future, in order to reduce the maint burden.
> IOW, ideally we would be pushing mgmt apps towards always using
> multifd at all times, even if they only ask it to create 1 single
> thread.
> 
> That would in turn suggest against creating new concurrency
> mechanisms on top of non-multifd code, both to avoid adding yet
> more complexity and also because it would make it harder to later
> delete the non-multifd code.
> 
> On the libvirt side wrt fixed-ram, we could just use multifd
> exclusively, as there should be no downside to it even for a
> single FD.
> 
> With regards,
> Daniel
> -- 
> |: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
> |: https://libvirt.org         -o-            https://fstop138.berrange.com :|
> |: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|
> 

Thanks,
Fabiano Rosas March 31, 2023, 6:18 p.m. UTC | #9
Peter Xu <peterx@redhat.com> writes:

> On Fri, Mar 31, 2023 at 05:10:16PM +0100, Daniel P. Berrangé wrote:
>> On Fri, Mar 31, 2023 at 11:55:03AM -0400, Peter Xu wrote:
>> > On Fri, Mar 31, 2023 at 12:30:45PM -0300, Fabiano Rosas wrote:
>> > > Peter Xu <peterx@redhat.com> writes:
>> > > 
>> > > > On Fri, Mar 31, 2023 at 11:37:50AM -0300, Fabiano Rosas wrote:
>> > > >> >> Outgoing migration to file. NVMe disk. XFS filesystem.
>> > > >> >> 
>> > > >> >> - Single migration runs of stopped 32G guest with ~90% RAM usage. Guest
>> > > >> >>   running `stress-ng --vm 4 --vm-bytes 90% --vm-method all --verify -t
>> > > >> >>   10m -v`:
>> > > >> >> 
>> > > >> >> migration type  | MB/s | pages/s |  ms
>> > > >> >> ----------------+------+---------+------
>> > > >> >> savevm io_uring |  434 |  102294 | 71473
>> > > >> >
>> > > >> > So I assume this is the non-live migration scenario.  Could you explain
>> > > >> > what does io_uring mean here?
>> > > >> >
>> > > >> 
>> > > >> This table is all non-live migration. This particular line is a snapshot
>> > > >> (hmp_savevm->save_snapshot). I thought it could be relevant because it
>> > > >> is another way by which we write RAM into disk.
>> > > >
>> > > > I see, so if all non-live that explains, because I was curious what's the
>> > > > relationship between this feature and the live snapshot that QEMU also
>> > > > supports.
>> > > >
>> > > > I also don't immediately see why savevm will be much slower, do you have an
>> > > > answer?  Maybe it's somewhere but I just overlooked..
>> > > >
>> > > 
>> > > I don't have a concrete answer. I could take a jab and maybe blame the
>> > > extra memcpy for the buffer in QEMUFile? Or perhaps an unintended effect
>> > > of bandwidth limits?
>> > 
>> > IMHO it would be great if this can be investigated and reasons provided in
>> > the next cover letter.
>> > 
>> > > 
>> > > > IIUC this is "vm suspend" case, so there's an extra benefit knowledge of
>> > > > "we can stop the VM".  It smells slightly weird to build this on top of
>> > > > "migrate" from that pov, rather than "savevm", though.  Any thoughts on
>> > > > this aspect (on why not building this on top of "savevm")?
>> > > >
>> > > 
>> > > I share the same perception. I have done initial experiments with
>> > > savevm, but I decided to carry on the work that was already started by
>> > > others because my understanding of the problem was yet incomplete.
>> > > 
>> > > One point that has been raised is that the fixed-ram format alone does
>> > > not bring that many performance improvements. So we'll need
>> > > multi-threading and direct-io on top of it. Re-using multifd
>> > > infrastructure seems like it could be a good idea.
>> > 
>> > The thing is IMHO concurrency is not as hard if VM stopped, and when we're
>> > 100% sure locally on where the page will go.
>> 
>> We shouldn't assume the VM is stopped though. When saving to the file
>> the VM may still be active. The fixed-ram format lets us re-write the
>> same memory location on disk multiple times in this case, thus avoiding
>> growth of the file size.
>
> Before discussing on reusing multifd below, now I have a major confusing on
> the use case of the feature..
>
> The question is whether we would like to stop the VM after fixed-ram
> migration completes.  I'm asking because:
>

We would.

>   1. If it will stop, then it looks like a "VM suspend" to me. If so, could
>      anyone help explain why we don't stop the VM first then migrate?
>      Because it avoids copying single pages multiple times, no fiddling
>      with dirty tracking at all - we just don't ever track anything.  In
>      short, we'll stop the VM anyway, then why not stop it slightly
>      earlier?
>

Looking at the previous discussions I don't see explicit mentions of a
requirement either way (stop before or stop after). I agree it makes
more sense to stop the guest first and then migrate without having to
deal with dirty pages.

I presume libvirt just migrates without altering the guest run state so
we implemented this to work in both scenarios. But even then, it seems
QEMU could store the current VM state, stop it, migrate and restore the
state on the destination.

I might be missing context here since I wasn't around when this work
started. Someone correct me if I'm wrong please.

>   2. If it will not stop, then it's "VM live snapshot" to me.  We have
>      that, aren't we?  That's more efficient because it'll wr-protect all
>      guest pages, any write triggers a CoW and we only copy the guest pages
>      once and for all.
>
> Either way to go, there's no need to copy any page more than once.  Did I
> miss anything perhaps very important?
>
> I would guess it's option (1) above, because it seems we don't snapshot the
> disk alongside.  But I am really not sure now..
>
Peter Xu March 31, 2023, 9:52 p.m. UTC | #10
On Fri, Mar 31, 2023 at 03:18:37PM -0300, Fabiano Rosas wrote:
> Peter Xu <peterx@redhat.com> writes:
> 
> > On Fri, Mar 31, 2023 at 05:10:16PM +0100, Daniel P. Berrangé wrote:
> >> On Fri, Mar 31, 2023 at 11:55:03AM -0400, Peter Xu wrote:
> >> > On Fri, Mar 31, 2023 at 12:30:45PM -0300, Fabiano Rosas wrote:
> >> > > Peter Xu <peterx@redhat.com> writes:
> >> > > 
> >> > > > On Fri, Mar 31, 2023 at 11:37:50AM -0300, Fabiano Rosas wrote:
> >> > > >> >> Outgoing migration to file. NVMe disk. XFS filesystem.
> >> > > >> >> 
> >> > > >> >> - Single migration runs of stopped 32G guest with ~90% RAM usage. Guest
> >> > > >> >>   running `stress-ng --vm 4 --vm-bytes 90% --vm-method all --verify -t
> >> > > >> >>   10m -v`:
> >> > > >> >> 
> >> > > >> >> migration type  | MB/s | pages/s |  ms
> >> > > >> >> ----------------+------+---------+------
> >> > > >> >> savevm io_uring |  434 |  102294 | 71473
> >> > > >> >
> >> > > >> > So I assume this is the non-live migration scenario.  Could you explain
> >> > > >> > what does io_uring mean here?
> >> > > >> >
> >> > > >> 
> >> > > >> This table is all non-live migration. This particular line is a snapshot
> >> > > >> (hmp_savevm->save_snapshot). I thought it could be relevant because it
> >> > > >> is another way by which we write RAM into disk.
> >> > > >
> >> > > > I see, so if all non-live that explains, because I was curious what's the
> >> > > > relationship between this feature and the live snapshot that QEMU also
> >> > > > supports.
> >> > > >
> >> > > > I also don't immediately see why savevm will be much slower, do you have an
> >> > > > answer?  Maybe it's somewhere but I just overlooked..
> >> > > >
> >> > > 
> >> > > I don't have a concrete answer. I could take a jab and maybe blame the
> >> > > extra memcpy for the buffer in QEMUFile? Or perhaps an unintended effect
> >> > > of bandwidth limits?
> >> > 
> >> > IMHO it would be great if this can be investigated and reasons provided in
> >> > the next cover letter.
> >> > 
> >> > > 
> >> > > > IIUC this is "vm suspend" case, so there's an extra benefit knowledge of
> >> > > > "we can stop the VM".  It smells slightly weird to build this on top of
> >> > > > "migrate" from that pov, rather than "savevm", though.  Any thoughts on
> >> > > > this aspect (on why not building this on top of "savevm")?
> >> > > >
> >> > > 
> >> > > I share the same perception. I have done initial experiments with
> >> > > savevm, but I decided to carry on the work that was already started by
> >> > > others because my understanding of the problem was yet incomplete.
> >> > > 
> >> > > One point that has been raised is that the fixed-ram format alone does
> >> > > not bring that many performance improvements. So we'll need
> >> > > multi-threading and direct-io on top of it. Re-using multifd
> >> > > infrastructure seems like it could be a good idea.
> >> > 
> >> > The thing is IMHO concurrency is not as hard if VM stopped, and when we're
> >> > 100% sure locally on where the page will go.
> >> 
> >> We shouldn't assume the VM is stopped though. When saving to the file
> >> the VM may still be active. The fixed-ram format lets us re-write the
> >> same memory location on disk multiple times in this case, thus avoiding
> >> growth of the file size.
> >
> > Before discussing on reusing multifd below, now I have a major confusing on
> > the use case of the feature..
> >
> > The question is whether we would like to stop the VM after fixed-ram
> > migration completes.  I'm asking because:
> >
> 
> We would.
> 
> >   1. If it will stop, then it looks like a "VM suspend" to me. If so, could
> >      anyone help explain why we don't stop the VM first then migrate?
> >      Because it avoids copying single pages multiple times, no fiddling
> >      with dirty tracking at all - we just don't ever track anything.  In
> >      short, we'll stop the VM anyway, then why not stop it slightly
> >      earlier?
> >
> 
> Looking at the previous discussions I don't see explicit mentions of a
> requirement either way (stop before or stop after). I agree it makes
> more sense to stop the guest first and then migrate without having to
> deal with dirty pages.
> 
> I presume libvirt just migrates without altering the guest run state so
> we implemented this to work in both scenarios. But even then, it seems
> QEMU could store the current VM state, stop it, migrate and restore the
> state on the destination.

Yes, I can understand having a unified interface for libvirt would be great
in this case.  So I am personally not against reusing qmp command "migrate"
if that would help in any case from libvirt pov.

However this is an important question to be answered very sure before
building more things on top.  IOW, even if reusing QMP migrate, we could
consider a totally different impl (e.g. don't reuse migration thread model).

As I mentioned above it seems just ideal we always stop the VM so it could
be part of the command (unlike normal QMP migrate), then it's getting more
like save_snapshot() as there's the vm_stop().  We should make sure when
the user uses the new cmd it'll always do that because that's the most
performant (comparing to enabling dirty tracking and live migrate).

> 
> I might be missing context here since I wasn't around when this work
> started. Someone correct me if I'm wrong please.

Yes, it would be great if someone can help clarify.

Thanks,

> 
> >   2. If it will not stop, then it's "VM live snapshot" to me.  We have
> >      that, aren't we?  That's more efficient because it'll wr-protect all
> >      guest pages, any write triggers a CoW and we only copy the guest pages
> >      once and for all.
> >
> > Either way to go, there's no need to copy any page more than once.  Did I
> > miss anything perhaps very important?
> >
> > I would guess it's option (1) above, because it seems we don't snapshot the
> > disk alongside.  But I am really not sure now..
> >
>
David Hildenbrand April 3, 2023, 7:38 a.m. UTC | #11
On 30.03.23 20:03, Fabiano Rosas wrote:
> Hi folks,
> 
> I'm continuing the work done last year to add a new format of
> migration stream that can be used to migrate large guests to a single
> file in a performant way.
> 
> This is an early RFC with the previous code + my additions to support
> multifd and direct IO. Let me know what you think!
> 
> Here are the reference links for previous discussions:
> 
> https://lists.gnu.org/archive/html/qemu-devel/2022-08/msg01813.html
> https://lists.gnu.org/archive/html/qemu-devel/2022-10/msg01338.html
> https://lists.gnu.org/archive/html/qemu-devel/2022-10/msg05536.html
> 
> The series has 4 main parts:
> 
> 1) File migration: A new "file:" migration URI. So "file:mig" does the
>     same as "exec:cat > mig". Patches 1-4 implement this;
> 
> 2) Fixed-ram format: A new format for the migration stream. Puts guest
>     pages at their relative offsets in the migration file. This saves
>     space on the worst case of RAM utilization because every page has a
>     fixed offset in the migration file and (potentially) saves us time
>     because we could write pages independently in parallel. It also
>     gives alignment guarantees so we could use O_DIRECT. Patches 5-13
>     implement this;
> 
> With patches 1-13 these two^ can be used with:
> 
> (qemu) migrate_set_capability fixed-ram on
> (qemu) migrate[_incoming] file:mig

There are some use cases (especially virtio-mem, but also virtio-balloon 
with free-page-hinting) where we end up having very sparse guest RAM. We 
don't want to have such "memory without meaning" in the migration stream 
nor restore it on the destination.

Would that still be supported with the new format? For example, have a 
sparse VM savefile and remember which ranges actually contain reasonable 
data?
Claudio Fontana April 3, 2023, 7:47 a.m. UTC | #12
On 3/31/23 23:52, Peter Xu wrote:
> On Fri, Mar 31, 2023 at 03:18:37PM -0300, Fabiano Rosas wrote:
>> Peter Xu <peterx@redhat.com> writes:
>>
>>> On Fri, Mar 31, 2023 at 05:10:16PM +0100, Daniel P. Berrangé wrote:
>>>> On Fri, Mar 31, 2023 at 11:55:03AM -0400, Peter Xu wrote:
>>>>> On Fri, Mar 31, 2023 at 12:30:45PM -0300, Fabiano Rosas wrote:
>>>>>> Peter Xu <peterx@redhat.com> writes:
>>>>>>
>>>>>>> On Fri, Mar 31, 2023 at 11:37:50AM -0300, Fabiano Rosas wrote:
>>>>>>>>>> Outgoing migration to file. NVMe disk. XFS filesystem.
>>>>>>>>>>
>>>>>>>>>> - Single migration runs of stopped 32G guest with ~90% RAM usage. Guest
>>>>>>>>>>   running `stress-ng --vm 4 --vm-bytes 90% --vm-method all --verify -t
>>>>>>>>>>   10m -v`:
>>>>>>>>>>
>>>>>>>>>> migration type  | MB/s | pages/s |  ms
>>>>>>>>>> ----------------+------+---------+------
>>>>>>>>>> savevm io_uring |  434 |  102294 | 71473
>>>>>>>>>
>>>>>>>>> So I assume this is the non-live migration scenario.  Could you explain
>>>>>>>>> what does io_uring mean here?
>>>>>>>>>
>>>>>>>>
>>>>>>>> This table is all non-live migration. This particular line is a snapshot
>>>>>>>> (hmp_savevm->save_snapshot). I thought it could be relevant because it
>>>>>>>> is another way by which we write RAM into disk.
>>>>>>>
>>>>>>> I see, so if all non-live that explains, because I was curious what's the
>>>>>>> relationship between this feature and the live snapshot that QEMU also
>>>>>>> supports.
>>>>>>>
>>>>>>> I also don't immediately see why savevm will be much slower, do you have an
>>>>>>> answer?  Maybe it's somewhere but I just overlooked..
>>>>>>>
>>>>>>
>>>>>> I don't have a concrete answer. I could take a jab and maybe blame the
>>>>>> extra memcpy for the buffer in QEMUFile? Or perhaps an unintended effect
>>>>>> of bandwidth limits?
>>>>>
>>>>> IMHO it would be great if this can be investigated and reasons provided in
>>>>> the next cover letter.
>>>>>
>>>>>>
>>>>>>> IIUC this is "vm suspend" case, so there's an extra benefit knowledge of
>>>>>>> "we can stop the VM".  It smells slightly weird to build this on top of
>>>>>>> "migrate" from that pov, rather than "savevm", though.  Any thoughts on
>>>>>>> this aspect (on why not building this on top of "savevm")?
>>>>>>>
>>>>>>
>>>>>> I share the same perception. I have done initial experiments with
>>>>>> savevm, but I decided to carry on the work that was already started by
>>>>>> others because my understanding of the problem was yet incomplete.
>>>>>>
>>>>>> One point that has been raised is that the fixed-ram format alone does
>>>>>> not bring that many performance improvements. So we'll need
>>>>>> multi-threading and direct-io on top of it. Re-using multifd
>>>>>> infrastructure seems like it could be a good idea.
>>>>>
>>>>> The thing is IMHO concurrency is not as hard if VM stopped, and when we're
>>>>> 100% sure locally on where the page will go.
>>>>
>>>> We shouldn't assume the VM is stopped though. When saving to the file
>>>> the VM may still be active. The fixed-ram format lets us re-write the
>>>> same memory location on disk multiple times in this case, thus avoiding
>>>> growth of the file size.
>>>
>>> Before discussing on reusing multifd below, now I have a major confusing on
>>> the use case of the feature..
>>>
>>> The question is whether we would like to stop the VM after fixed-ram
>>> migration completes.  I'm asking because:
>>>
>>
>> We would.
>>
>>>   1. If it will stop, then it looks like a "VM suspend" to me. If so, could
>>>      anyone help explain why we don't stop the VM first then migrate?
>>>      Because it avoids copying single pages multiple times, no fiddling
>>>      with dirty tracking at all - we just don't ever track anything.  In
>>>      short, we'll stop the VM anyway, then why not stop it slightly
>>>      earlier?
>>>
>>
>> Looking at the previous discussions I don't see explicit mentions of a
>> requirement either way (stop before or stop after). I agree it makes
>> more sense to stop the guest first and then migrate without having to
>> deal with dirty pages.
>>
>> I presume libvirt just migrates without altering the guest run state so
>> we implemented this to work in both scenarios. But even then, it seems
>> QEMU could store the current VM state, stop it, migrate and restore the
>> state on the destination.
> 
> Yes, I can understand having a unified interface for libvirt would be great
> in this case.  So I am personally not against reusing qmp command "migrate"
> if that would help in any case from libvirt pov.
> 
> However this is an important question to be answered very sure before
> building more things on top.  IOW, even if reusing QMP migrate, we could
> consider a totally different impl (e.g. don't reuse migration thread model).
> 
> As I mentioned above it seems just ideal we always stop the VM so it could
> be part of the command (unlike normal QMP migrate), then it's getting more
> like save_snapshot() as there's the vm_stop().  We should make sure when
> the user uses the new cmd it'll always do that because that's the most
> performant (comparing to enabling dirty tracking and live migrate).
> 
>>
>> I might be missing context here since I wasn't around when this work
>> started. Someone correct me if I'm wrong please.


Hi, not sure if what is asked here is context in terms of the previous upstream discussions or our specific requirement we are trying to bring upstream.

In terms of the specific requirement we are trying to bring upstream, we need to get libvirt+QEMU VM save and restore functionality to be able to transfer VM sizes of ~30 GB (4/8 vcpus) in roughly 5 seconds.
When an event trigger happens, the VM needs to be quickly paused and saved to disk safely, including datasync, and another VM needs to be restored, also in ~5 secs.
For our specific requirement, the VM is never running when its data (mostly consisting of RAM) is saved.

I understand that the need to handle also the "live" case comes from upstream discussions about solving the "general case",
where someone might want to do this for "live" VMs, but if helpful I want to highlight that it is not part of the specific requirement we are trying to address,
and for this specific case won't also in the future, as the whole point of the trigger is to replace the running VM with another VM, so it cannot be kept running.

The reason we are using "migrate" here likely stems from the fact that existing libvirt code currently uses QMP migrate to implement the save and restore commands.
And in my personal view, I think that reusing the existing building blocks (migration, multifd) would be preferable, to avoid having to maintain two separate ways to do the same thing.

That said, it could be done in a different way, if the performance can keep up. Just thinking of reducing the overall effort and also maintenance surface.

Ciao,

Claudio

> 
> Yes, it would be great if someone can help clarify.
> 
> Thanks,
> 
>>
>>>   2. If it will not stop, then it's "VM live snapshot" to me.  We have
>>>      that, aren't we?  That's more efficient because it'll wr-protect all
>>>      guest pages, any write triggers a CoW and we only copy the guest pages
>>>      once and for all.
>>>
>>> Either way to go, there's no need to copy any page more than once.  Did I
>>> miss anything perhaps very important?
>>>
>>> I would guess it's option (1) above, because it seems we don't snapshot the
>>> disk alongside.  But I am really not sure now..
>>>
>>
>
Fabiano Rosas April 3, 2023, 2:41 p.m. UTC | #13
David Hildenbrand <david@redhat.com> writes:

> On 30.03.23 20:03, Fabiano Rosas wrote:
>> Hi folks,
>> 
>> I'm continuing the work done last year to add a new format of
>> migration stream that can be used to migrate large guests to a single
>> file in a performant way.
>> 
>> This is an early RFC with the previous code + my additions to support
>> multifd and direct IO. Let me know what you think!
>> 
>> Here are the reference links for previous discussions:
>> 
>> https://lists.gnu.org/archive/html/qemu-devel/2022-08/msg01813.html
>> https://lists.gnu.org/archive/html/qemu-devel/2022-10/msg01338.html
>> https://lists.gnu.org/archive/html/qemu-devel/2022-10/msg05536.html
>> 
>> The series has 4 main parts:
>> 
>> 1) File migration: A new "file:" migration URI. So "file:mig" does the
>>     same as "exec:cat > mig". Patches 1-4 implement this;
>> 
>> 2) Fixed-ram format: A new format for the migration stream. Puts guest
>>     pages at their relative offsets in the migration file. This saves
>>     space on the worst case of RAM utilization because every page has a
>>     fixed offset in the migration file and (potentially) saves us time
>>     because we could write pages independently in parallel. It also
>>     gives alignment guarantees so we could use O_DIRECT. Patches 5-13
>>     implement this;
>> 
>> With patches 1-13 these two^ can be used with:
>> 
>> (qemu) migrate_set_capability fixed-ram on
>> (qemu) migrate[_incoming] file:mig
>
> There are some use cases (especially virtio-mem, but also virtio-balloon 
> with free-page-hinting) where we end up having very sparse guest RAM. We 
> don't want to have such "memory without meaning" in the migration stream 
> nor restore it on the destination.
>

Is that what is currently defined by ramblock_page_is_discarded ->
virtio_mem_rdm_is_populated ?

> Would that still be supported with the new format? For example, have a 
> sparse VM savefile and remember which ranges actually contain reasonable 
> data?

We do ignore zero pages, so I don't think it would be an issue to have
another criteria for ignoring pages. It seems if we do enable postcopy
load w/ fixed-ram that would be already handled in postcopy_request_page.
David Hildenbrand April 3, 2023, 4:24 p.m. UTC | #14
On 03.04.23 16:41, Fabiano Rosas wrote:
> David Hildenbrand <david@redhat.com> writes:
> 
>> On 30.03.23 20:03, Fabiano Rosas wrote:
>>> Hi folks,
>>>
>>> I'm continuing the work done last year to add a new format of
>>> migration stream that can be used to migrate large guests to a single
>>> file in a performant way.
>>>
>>> This is an early RFC with the previous code + my additions to support
>>> multifd and direct IO. Let me know what you think!
>>>
>>> Here are the reference links for previous discussions:
>>>
>>> https://lists.gnu.org/archive/html/qemu-devel/2022-08/msg01813.html
>>> https://lists.gnu.org/archive/html/qemu-devel/2022-10/msg01338.html
>>> https://lists.gnu.org/archive/html/qemu-devel/2022-10/msg05536.html
>>>
>>> The series has 4 main parts:
>>>
>>> 1) File migration: A new "file:" migration URI. So "file:mig" does the
>>>      same as "exec:cat > mig". Patches 1-4 implement this;
>>>
>>> 2) Fixed-ram format: A new format for the migration stream. Puts guest
>>>      pages at their relative offsets in the migration file. This saves
>>>      space on the worst case of RAM utilization because every page has a
>>>      fixed offset in the migration file and (potentially) saves us time
>>>      because we could write pages independently in parallel. It also
>>>      gives alignment guarantees so we could use O_DIRECT. Patches 5-13
>>>      implement this;
>>>
>>> With patches 1-13 these two^ can be used with:
>>>
>>> (qemu) migrate_set_capability fixed-ram on
>>> (qemu) migrate[_incoming] file:mig
>>
>> There are some use cases (especially virtio-mem, but also virtio-balloon
>> with free-page-hinting) where we end up having very sparse guest RAM. We
>> don't want to have such "memory without meaning" in the migration stream
>> nor restore it on the destination.
>>
> 
> Is that what is currently defined by ramblock_page_is_discarded ->
> virtio_mem_rdm_is_populated ?

For virtio-mem, yes. For virtio-balloon we communicate that information 
via qemu_guest_free_page_hint().

> 
>> Would that still be supported with the new format? For example, have a
>> sparse VM savefile and remember which ranges actually contain reasonable
>> data?
> 
> We do ignore zero pages, so I don't think it would be an issue to have
> another criteria for ignoring pages. It seems if we do enable postcopy
> load w/ fixed-ram that would be already handled in postcopy_request_page.

Ok, good. Just to note that we do have migration of sparse RAM blocks 
working and if fixed-ram would be incompatible we'd have to fence it.
Fabiano Rosas April 3, 2023, 4:36 p.m. UTC | #15
David Hildenbrand <david@redhat.com> writes:

> On 03.04.23 16:41, Fabiano Rosas wrote:
>> David Hildenbrand <david@redhat.com> writes:
>> 
>>> On 30.03.23 20:03, Fabiano Rosas wrote:
>>>> Hi folks,
>>>>
>>>> I'm continuing the work done last year to add a new format of
>>>> migration stream that can be used to migrate large guests to a single
>>>> file in a performant way.
>>>>
>>>> This is an early RFC with the previous code + my additions to support
>>>> multifd and direct IO. Let me know what you think!
>>>>
>>>> Here are the reference links for previous discussions:
>>>>
>>>> https://lists.gnu.org/archive/html/qemu-devel/2022-08/msg01813.html
>>>> https://lists.gnu.org/archive/html/qemu-devel/2022-10/msg01338.html
>>>> https://lists.gnu.org/archive/html/qemu-devel/2022-10/msg05536.html
>>>>
>>>> The series has 4 main parts:
>>>>
>>>> 1) File migration: A new "file:" migration URI. So "file:mig" does the
>>>>      same as "exec:cat > mig". Patches 1-4 implement this;
>>>>
>>>> 2) Fixed-ram format: A new format for the migration stream. Puts guest
>>>>      pages at their relative offsets in the migration file. This saves
>>>>      space on the worst case of RAM utilization because every page has a
>>>>      fixed offset in the migration file and (potentially) saves us time
>>>>      because we could write pages independently in parallel. It also
>>>>      gives alignment guarantees so we could use O_DIRECT. Patches 5-13
>>>>      implement this;
>>>>
>>>> With patches 1-13 these two^ can be used with:
>>>>
>>>> (qemu) migrate_set_capability fixed-ram on
>>>> (qemu) migrate[_incoming] file:mig
>>>
>>> There are some use cases (especially virtio-mem, but also virtio-balloon
>>> with free-page-hinting) where we end up having very sparse guest RAM. We
>>> don't want to have such "memory without meaning" in the migration stream
>>> nor restore it on the destination.
>>>
>> 
>> Is that what is currently defined by ramblock_page_is_discarded ->
>> virtio_mem_rdm_is_populated ?
>
> For virtio-mem, yes. For virtio-balloon we communicate that information 
> via qemu_guest_free_page_hint().
>
>> 
>>> Would that still be supported with the new format? For example, have a
>>> sparse VM savefile and remember which ranges actually contain reasonable
>>> data?
>> 
>> We do ignore zero pages, so I don't think it would be an issue to have
>> another criteria for ignoring pages. It seems if we do enable postcopy
>> load w/ fixed-ram that would be already handled in postcopy_request_page.
>
> Ok, good. Just to note that we do have migration of sparse RAM blocks 
> working and if fixed-ram would be incompatible we'd have to fence it.

Yep, thanks for the heads-up. I'll keep that in mind.
Peter Xu April 3, 2023, 7:26 p.m. UTC | #16
Hi, Claudio,

Thanks for the context.

On Mon, Apr 03, 2023 at 09:47:26AM +0200, Claudio Fontana wrote:
> Hi, not sure if what is asked here is context in terms of the previous
> upstream discussions or our specific requirement we are trying to bring
> upstream.
>
> In terms of the specific requirement we are trying to bring upstream, we
> need to get libvirt+QEMU VM save and restore functionality to be able to
> transfer VM sizes of ~30 GB (4/8 vcpus) in roughly 5 seconds.  When an
> event trigger happens, the VM needs to be quickly paused and saved to
> disk safely, including datasync, and another VM needs to be restored,
> also in ~5 secs.  For our specific requirement, the VM is never running
> when its data (mostly consisting of RAM) is saved.
>
> I understand that the need to handle also the "live" case comes from
> upstream discussions about solving the "general case", where someone
> might want to do this for "live" VMs, but if helpful I want to highlight
> that it is not part of the specific requirement we are trying to address,
> and for this specific case won't also in the future, as the whole point
> of the trigger is to replace the running VM with another VM, so it cannot
> be kept running.

From what I read so far, that scenario suites exactly what live snapshot
would do with current QEMU - that at least should involve a snapshot on the
disks being used or I can't see how that can be live.  So it looks like a
separate request.

> The reason we are using "migrate" here likely stems from the fact that
> existing libvirt code currently uses QMP migrate to implement the save
> and restore commands.  And in my personal view, I think that reusing the
> existing building blocks (migration, multifd) would be preferable, to
> avoid having to maintain two separate ways to do the same thing.  That
> said, it could be done in a different way, if the performance can keep
> up. Just thinking of reducing the overall effort and also maintenance
> surface.

I would vaguely guess the performance can not only keep up but better than
what the current solution would provide, due to the possibility of (1)
batch handling of continuous guest pages, and (2) completely no dirty
tracking overhead.

For (2), it's not about wr-protect page faults or vmexits due to PML being
full (because vcpus will be stopped anyway..), it's about enabling the
dirty tracking (which already contains overhead, especially when huge pages
are enabled, to split huge pages in EPT pgtables) and all the bitmap
operations QEMU does during live migration even if the VM is not live.

IMHO reusing multifd may or may not be a good idea here, because it'll of
course also complicate multifd code, hence makes multifd harder to
maintain, while not in a good way, because as I mentioned I don't think it
can use much of what multifd provides.

I don't have a strong opinion on the impl (even though I do have a
preference..), but I think at least we should still check on two things:

  - Being crystal clear on the use case above, and double check whether "VM
    stop" should be the default operation at the start of the new cmd - we
    shouldn't assume the user will be aware of doing this, neither should
    we assume the user is aware of the performance implications.

  - Making sure the image layout is well defined, so:

    - It'll be extensible in the future, and,

    - If someone would like to refactor it to not use the migration thread
      model anymore, the image format, hopefully, can be easy to keep
      untouched so it can be compatible with the current approach.

Just my two cents. I think Juan should have the best grasp on this.

Thanks,
Claudio Fontana April 4, 2023, 8 a.m. UTC | #17
Hi Peter,

On 4/3/23 21:26, Peter Xu wrote:
> Hi, Claudio,
> 
> Thanks for the context.
> 
> On Mon, Apr 03, 2023 at 09:47:26AM +0200, Claudio Fontana wrote:
>> Hi, not sure if what is asked here is context in terms of the previous
>> upstream discussions or our specific requirement we are trying to bring
>> upstream.
>>
>> In terms of the specific requirement we are trying to bring upstream, we
>> need to get libvirt+QEMU VM save and restore functionality to be able to
>> transfer VM sizes of ~30 GB (4/8 vcpus) in roughly 5 seconds.  When an
>> event trigger happens, the VM needs to be quickly paused and saved to
>> disk safely, including datasync, and another VM needs to be restored,
>> also in ~5 secs.  For our specific requirement, the VM is never running
>> when its data (mostly consisting of RAM) is saved.
>>
>> I understand that the need to handle also the "live" case comes from
>> upstream discussions about solving the "general case", where someone
>> might want to do this for "live" VMs, but if helpful I want to highlight
>> that it is not part of the specific requirement we are trying to address,
>> and for this specific case won't also in the future, as the whole point
>> of the trigger is to replace the running VM with another VM, so it cannot
>> be kept running.
> 
> From what I read so far, that scenario suites exactly what live snapshot
> would do with current QEMU - that at least should involve a snapshot on the
> disks being used or I can't see how that can be live.  So it looks like a
> separate request.
> 
>> The reason we are using "migrate" here likely stems from the fact that
>> existing libvirt code currently uses QMP migrate to implement the save
>> and restore commands.  And in my personal view, I think that reusing the
>> existing building blocks (migration, multifd) would be preferable, to
>> avoid having to maintain two separate ways to do the same thing.  That
>> said, it could be done in a different way, if the performance can keep
>> up. Just thinking of reducing the overall effort and also maintenance
>> surface.
> 
> I would vaguely guess the performance can not only keep up but better than
> what the current solution would provide, due to the possibility of (1)
> batch handling of continuous guest pages, and (2) completely no dirty
> tracking overhead.
> 
> For (2), it's not about wr-protect page faults or vmexits due to PML being
> full (because vcpus will be stopped anyway..), it's about enabling the
> dirty tracking (which already contains overhead, especially when huge pages
> are enabled, to split huge pages in EPT pgtables) and all the bitmap
> operations QEMU does during live migration even if the VM is not live.

something we could profile for, I do not remember it being really an important source of overhead in my previous profile runs,
but maybe worthwhile redoing the profiling with Fabiano's patchset.

> 
> IMHO reusing multifd may or may not be a good idea here, because it'll of
> course also complicate multifd code, hence makes multifd harder to
> maintain, while not in a good way, because as I mentioned I don't think it
> can use much of what multifd provides.


The main advantage we get is the automatic multithreading of the qemu_savevm_state_iterate code in my view.

Reimplementing the same thing again has the potential to cause bitrot for this use case, and using multiple fds for the transfer is exactly what is needed here,
and in my understanding the same exact reason multifd exists: to take advantage of high bandwidth migration channels.

The only adjustment needed to multifd is the ability to work with block devices (file fds) as the migration channels instead of just sockets,
so it seems a very natural extension of multifd to me.

> 
> I don't have a strong opinion on the impl (even though I do have a
> preference..), but I think at least we should still check on two things:
> 
>   - Being crystal clear on the use case above, and double check whether "VM
>     stop" should be the default operation at the start of the new cmd - we
>     shouldn't assume the user will be aware of doing this, neither should
>     we assume the user is aware of the performance implications.


Not sure I can identify what you are asking specifically: the use case is to stop executing the currently running VM as soon as possible, save it to disk, then restore another VM as soon as possible.
Probably I missed something there.

> 
>   - Making sure the image layout is well defined, so:
> 
>     - It'll be extensible in the future, and,
> 
>     - If someone would like to refactor it to not use the migration thread
>       model anymore, the image format, hopefully, can be easy to keep
>       untouched so it can be compatible with the current approach.
> 
> Just my two cents. I think Juan should have the best grasp on this.
> 
> Thanks,
> 

Ciao,

Claudio
Peter Xu April 4, 2023, 2:53 p.m. UTC | #18
On Tue, Apr 04, 2023 at 10:00:16AM +0200, Claudio Fontana wrote:
> Hi Peter,

Hi, Claudio,

> 
> On 4/3/23 21:26, Peter Xu wrote:
> > Hi, Claudio,
> > 
> > Thanks for the context.
> > 
> > On Mon, Apr 03, 2023 at 09:47:26AM +0200, Claudio Fontana wrote:
> >> Hi, not sure if what is asked here is context in terms of the previous
> >> upstream discussions or our specific requirement we are trying to bring
> >> upstream.
> >>
> >> In terms of the specific requirement we are trying to bring upstream, we
> >> need to get libvirt+QEMU VM save and restore functionality to be able to
> >> transfer VM sizes of ~30 GB (4/8 vcpus) in roughly 5 seconds.  When an
> >> event trigger happens, the VM needs to be quickly paused and saved to
> >> disk safely, including datasync, and another VM needs to be restored,
> >> also in ~5 secs.  For our specific requirement, the VM is never running
> >> when its data (mostly consisting of RAM) is saved.
> >>
> >> I understand that the need to handle also the "live" case comes from
> >> upstream discussions about solving the "general case", where someone
> >> might want to do this for "live" VMs, but if helpful I want to highlight
> >> that it is not part of the specific requirement we are trying to address,
> >> and for this specific case won't also in the future, as the whole point
> >> of the trigger is to replace the running VM with another VM, so it cannot
> >> be kept running.
> > 
> > From what I read so far, that scenario suites exactly what live snapshot
> > would do with current QEMU - that at least should involve a snapshot on the
> > disks being used or I can't see how that can be live.  So it looks like a
> > separate request.
> > 
> >> The reason we are using "migrate" here likely stems from the fact that
> >> existing libvirt code currently uses QMP migrate to implement the save
> >> and restore commands.  And in my personal view, I think that reusing the
> >> existing building blocks (migration, multifd) would be preferable, to
> >> avoid having to maintain two separate ways to do the same thing.  That
> >> said, it could be done in a different way, if the performance can keep
> >> up. Just thinking of reducing the overall effort and also maintenance
> >> surface.
> > 
> > I would vaguely guess the performance can not only keep up but better than
> > what the current solution would provide, due to the possibility of (1)
> > batch handling of continuous guest pages, and (2) completely no dirty
> > tracking overhead.
> > 
> > For (2), it's not about wr-protect page faults or vmexits due to PML being
> > full (because vcpus will be stopped anyway..), it's about enabling the
> > dirty tracking (which already contains overhead, especially when huge pages
> > are enabled, to split huge pages in EPT pgtables) and all the bitmap
> > operations QEMU does during live migration even if the VM is not live.
> 
> something we could profile for, I do not remember it being really an important source of overhead in my previous profile runs,
> but maybe worthwhile redoing the profiling with Fabiano's patchset.

Yes I don't know the detailed number either, it should depend on the guest
configuration (mem size, mem type, kernel version etc).  It could be less a
concern comparing to the time used elsewhere.  More on this on below.

> 
> > 
> > IMHO reusing multifd may or may not be a good idea here, because it'll of
> > course also complicate multifd code, hence makes multifd harder to
> > maintain, while not in a good way, because as I mentioned I don't think it
> > can use much of what multifd provides.
> 
> 
> The main advantage we get is the automatic multithreading of the qemu_savevm_state_iterate code in my view.
> 
> Reimplementing the same thing again has the potential to cause bitrot for this use case, and using multiple fds for the transfer is exactly what is needed here,
> and in my understanding the same exact reason multifd exists: to take advantage of high bandwidth migration channels.
> 
> The only adjustment needed to multifd is the ability to work with block devices (file fds) as the migration channels instead of just sockets,
> so it seems a very natural extension of multifd to me.

Yes, since I haven't looked at the multifd patches at all so I don't have
solid clue on how much it'll affect multifd.  I'll leave that to Juan.

> 
> > 
> > I don't have a strong opinion on the impl (even though I do have a
> > preference..), but I think at least we should still check on two things:
> > 
> >   - Being crystal clear on the use case above, and double check whether "VM
> >     stop" should be the default operation at the start of the new cmd - we
> >     shouldn't assume the user will be aware of doing this, neither should
> >     we assume the user is aware of the performance implications.
> 
> 
> Not sure I can identify what you are asking specifically: the use case is to stop executing the currently running VM as soon as possible, save it to disk, then restore another VM as soon as possible.
> Probably I missed something there.

Yes, then IMHO as mentioned we should make "vm stop" part of the command
procedure if vm was still running when invoked.  Then we can already
optimize dirty logging of above (2) with the current framework. E.g., we
already optimized live snapshot to not enable dirty logging:

        if (!migrate_background_snapshot()) {
            memory_global_dirty_log_start(GLOBAL_DIRTY_MIGRATION);
            migration_bitmap_sync_precopy(rs);
        }

Maybe that can also be done for fixed-ram migration, so no matter how much
overhead there will be, that can be avoided.

PS: I think similar optimizations can be done too in ram_save_complete() or
ram_state_pending_exact().. maybe we should move the check into
migration_bitmap_sync_precopy() so it can be skipped as a whole when it can.

Thanks,

> 
> > 
> >   - Making sure the image layout is well defined, so:
> > 
> >     - It'll be extensible in the future, and,
> > 
> >     - If someone would like to refactor it to not use the migration thread
> >       model anymore, the image format, hopefully, can be easy to keep
> >       untouched so it can be compatible with the current approach.
> > 
> > Just my two cents. I think Juan should have the best grasp on this.
> > 
> > Thanks,
> > 
> 
> Ciao,
> 
> Claudio
>
Claudio Fontana April 4, 2023, 3:10 p.m. UTC | #19
On 4/4/23 16:53, Peter Xu wrote:
> On Tue, Apr 04, 2023 at 10:00:16AM +0200, Claudio Fontana wrote:
>> Hi Peter,
> 
> Hi, Claudio,
> 
>>
>> On 4/3/23 21:26, Peter Xu wrote:
>>> Hi, Claudio,
>>>
>>> Thanks for the context.
>>>
>>> On Mon, Apr 03, 2023 at 09:47:26AM +0200, Claudio Fontana wrote:
>>>> Hi, not sure if what is asked here is context in terms of the previous
>>>> upstream discussions or our specific requirement we are trying to bring
>>>> upstream.
>>>>
>>>> In terms of the specific requirement we are trying to bring upstream, we
>>>> need to get libvirt+QEMU VM save and restore functionality to be able to
>>>> transfer VM sizes of ~30 GB (4/8 vcpus) in roughly 5 seconds.  When an
>>>> event trigger happens, the VM needs to be quickly paused and saved to
>>>> disk safely, including datasync, and another VM needs to be restored,
>>>> also in ~5 secs.  For our specific requirement, the VM is never running
>>>> when its data (mostly consisting of RAM) is saved.
>>>>
>>>> I understand that the need to handle also the "live" case comes from
>>>> upstream discussions about solving the "general case", where someone
>>>> might want to do this for "live" VMs, but if helpful I want to highlight
>>>> that it is not part of the specific requirement we are trying to address,
>>>> and for this specific case won't also in the future, as the whole point
>>>> of the trigger is to replace the running VM with another VM, so it cannot
>>>> be kept running.
>>>
>>> From what I read so far, that scenario suites exactly what live snapshot
>>> would do with current QEMU - that at least should involve a snapshot on the
>>> disks being used or I can't see how that can be live.  So it looks like a
>>> separate request.
>>>
>>>> The reason we are using "migrate" here likely stems from the fact that
>>>> existing libvirt code currently uses QMP migrate to implement the save
>>>> and restore commands.  And in my personal view, I think that reusing the
>>>> existing building blocks (migration, multifd) would be preferable, to
>>>> avoid having to maintain two separate ways to do the same thing.  That
>>>> said, it could be done in a different way, if the performance can keep
>>>> up. Just thinking of reducing the overall effort and also maintenance
>>>> surface.
>>>
>>> I would vaguely guess the performance can not only keep up but better than
>>> what the current solution would provide, due to the possibility of (1)
>>> batch handling of continuous guest pages, and (2) completely no dirty
>>> tracking overhead.
>>>
>>> For (2), it's not about wr-protect page faults or vmexits due to PML being
>>> full (because vcpus will be stopped anyway..), it's about enabling the
>>> dirty tracking (which already contains overhead, especially when huge pages
>>> are enabled, to split huge pages in EPT pgtables) and all the bitmap
>>> operations QEMU does during live migration even if the VM is not live.
>>
>> something we could profile for, I do not remember it being really an important source of overhead in my previous profile runs,
>> but maybe worthwhile redoing the profiling with Fabiano's patchset.
> 
> Yes I don't know the detailed number either, it should depend on the guest
> configuration (mem size, mem type, kernel version etc).  It could be less a
> concern comparing to the time used elsewhere.  More on this on below.
> 
>>
>>>
>>> IMHO reusing multifd may or may not be a good idea here, because it'll of
>>> course also complicate multifd code, hence makes multifd harder to
>>> maintain, while not in a good way, because as I mentioned I don't think it
>>> can use much of what multifd provides.
>>
>>
>> The main advantage we get is the automatic multithreading of the qemu_savevm_state_iterate code in my view.
>>
>> Reimplementing the same thing again has the potential to cause bitrot for this use case, and using multiple fds for the transfer is exactly what is needed here,
>> and in my understanding the same exact reason multifd exists: to take advantage of high bandwidth migration channels.
>>
>> The only adjustment needed to multifd is the ability to work with block devices (file fds) as the migration channels instead of just sockets,
>> so it seems a very natural extension of multifd to me.
> 
> Yes, since I haven't looked at the multifd patches at all so I don't have
> solid clue on how much it'll affect multifd.  I'll leave that to Juan.
> 
>>
>>>
>>> I don't have a strong opinion on the impl (even though I do have a
>>> preference..), but I think at least we should still check on two things:
>>>
>>>   - Being crystal clear on the use case above, and double check whether "VM
>>>     stop" should be the default operation at the start of the new cmd - we
>>>     shouldn't assume the user will be aware of doing this, neither should
>>>     we assume the user is aware of the performance implications.
>>
>>
>> Not sure I can identify what you are asking specifically: the use case is to stop executing the currently running VM as soon as possible, save it to disk, then restore another VM as soon as possible.
>> Probably I missed something there.
> 
> Yes, then IMHO as mentioned we should make "vm stop" part of the command
> procedure if vm was still running when invoked.  Then we can already
> optimize dirty logging of above (2) with the current framework. E.g., we
> already optimized live snapshot to not enable dirty logging:
> 
>         if (!migrate_background_snapshot()) {
>             memory_global_dirty_log_start(GLOBAL_DIRTY_MIGRATION);
>             migration_bitmap_sync_precopy(rs);
>         }
> 
> Maybe that can also be done for fixed-ram migration, so no matter how much
> overhead there will be, that can be avoided.

Understood, agree.

Would it make sense to check for something like if (!runstate_is_running())
instead of checking for the specific multifd + fixed-ram feature?

I think from a high level perspective, there should not be dirtying if the vcpus are not running right?
This could even be a bit more future proof to avoid checking for many features, if they all happen to share the fact that vcpus are not running.

> 
> PS: I think similar optimizations can be done too in ram_save_complete() or
> ram_state_pending_exact().. maybe we should move the check into
> migration_bitmap_sync_precopy() so it can be skipped as a whole when it can.

makes sense, interesting.

I wonder if ramblock_is_ignored() could be optimized a bit too, since it seems to consume roughly the same amount of cpu as the dirty bitmap handling, even when "ignore-shared" is not used.

this feature was added by:

commit fbd162e629aaf8a7e464af44d2f73d06b26428ad
Author: Yury Kotov <yury-kotov@yandex-team.ru>
Date:   Fri Feb 15 20:45:46 2019 +0300

    migration: Add an ability to ignore shared RAM blocks
    
    If ignore-shared capability is set then skip shared RAMBlocks during the
    RAM migration.
    Also, move qemu_ram_foreach_migratable_block (and rename) to the
    migration code, because it requires access to the migration capabilities.
    
    Signed-off-by: Yury Kotov <yury-kotov@yandex-team.ru>
    Message-Id: <20190215174548.2630-4-yury-kotov@yandex-team.ru>
    Reviewed-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
    Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>

Probably not that important, just to mention since we were thinking of possible small optimizations.
I would like to share the complete previous callgrind data, but cannot find a way to export them in a readable state, could export the graph though as PDF if helpful.

Likely we'd need a new round of measurements with perf...

Ciao,

Claudio

> 
> Thanks,
> 
>>
>>>
>>>   - Making sure the image layout is well defined, so:
>>>
>>>     - It'll be extensible in the future, and,
>>>
>>>     - If someone would like to refactor it to not use the migration thread
>>>       model anymore, the image format, hopefully, can be easy to keep
>>>       untouched so it can be compatible with the current approach.
>>>
>>> Just my two cents. I think Juan should have the best grasp on this.
>>>
>>> Thanks,
>>>
>>
>> Ciao,
>>
>> Claudio
>>
>
Peter Xu April 4, 2023, 3:56 p.m. UTC | #20
On Tue, Apr 04, 2023 at 05:10:52PM +0200, Claudio Fontana wrote:
> On 4/4/23 16:53, Peter Xu wrote:
> > On Tue, Apr 04, 2023 at 10:00:16AM +0200, Claudio Fontana wrote:
> >> Hi Peter,
> > 
> > Hi, Claudio,
> > 
> >>
> >> On 4/3/23 21:26, Peter Xu wrote:
> >>> Hi, Claudio,
> >>>
> >>> Thanks for the context.
> >>>
> >>> On Mon, Apr 03, 2023 at 09:47:26AM +0200, Claudio Fontana wrote:
> >>>> Hi, not sure if what is asked here is context in terms of the previous
> >>>> upstream discussions or our specific requirement we are trying to bring
> >>>> upstream.
> >>>>
> >>>> In terms of the specific requirement we are trying to bring upstream, we
> >>>> need to get libvirt+QEMU VM save and restore functionality to be able to
> >>>> transfer VM sizes of ~30 GB (4/8 vcpus) in roughly 5 seconds.  When an
> >>>> event trigger happens, the VM needs to be quickly paused and saved to
> >>>> disk safely, including datasync, and another VM needs to be restored,
> >>>> also in ~5 secs.  For our specific requirement, the VM is never running
> >>>> when its data (mostly consisting of RAM) is saved.
> >>>>
> >>>> I understand that the need to handle also the "live" case comes from
> >>>> upstream discussions about solving the "general case", where someone
> >>>> might want to do this for "live" VMs, but if helpful I want to highlight
> >>>> that it is not part of the specific requirement we are trying to address,
> >>>> and for this specific case won't also in the future, as the whole point
> >>>> of the trigger is to replace the running VM with another VM, so it cannot
> >>>> be kept running.
> >>>
> >>> From what I read so far, that scenario suites exactly what live snapshot
> >>> would do with current QEMU - that at least should involve a snapshot on the
> >>> disks being used or I can't see how that can be live.  So it looks like a
> >>> separate request.
> >>>
> >>>> The reason we are using "migrate" here likely stems from the fact that
> >>>> existing libvirt code currently uses QMP migrate to implement the save
> >>>> and restore commands.  And in my personal view, I think that reusing the
> >>>> existing building blocks (migration, multifd) would be preferable, to
> >>>> avoid having to maintain two separate ways to do the same thing.  That
> >>>> said, it could be done in a different way, if the performance can keep
> >>>> up. Just thinking of reducing the overall effort and also maintenance
> >>>> surface.
> >>>
> >>> I would vaguely guess the performance can not only keep up but better than
> >>> what the current solution would provide, due to the possibility of (1)
> >>> batch handling of continuous guest pages, and (2) completely no dirty
> >>> tracking overhead.
> >>>
> >>> For (2), it's not about wr-protect page faults or vmexits due to PML being
> >>> full (because vcpus will be stopped anyway..), it's about enabling the
> >>> dirty tracking (which already contains overhead, especially when huge pages
> >>> are enabled, to split huge pages in EPT pgtables) and all the bitmap
> >>> operations QEMU does during live migration even if the VM is not live.
> >>
> >> something we could profile for, I do not remember it being really an important source of overhead in my previous profile runs,
> >> but maybe worthwhile redoing the profiling with Fabiano's patchset.
> > 
> > Yes I don't know the detailed number either, it should depend on the guest
> > configuration (mem size, mem type, kernel version etc).  It could be less a
> > concern comparing to the time used elsewhere.  More on this on below.
> > 
> >>
> >>>
> >>> IMHO reusing multifd may or may not be a good idea here, because it'll of
> >>> course also complicate multifd code, hence makes multifd harder to
> >>> maintain, while not in a good way, because as I mentioned I don't think it
> >>> can use much of what multifd provides.
> >>
> >>
> >> The main advantage we get is the automatic multithreading of the qemu_savevm_state_iterate code in my view.
> >>
> >> Reimplementing the same thing again has the potential to cause bitrot for this use case, and using multiple fds for the transfer is exactly what is needed here,
> >> and in my understanding the same exact reason multifd exists: to take advantage of high bandwidth migration channels.
> >>
> >> The only adjustment needed to multifd is the ability to work with block devices (file fds) as the migration channels instead of just sockets,
> >> so it seems a very natural extension of multifd to me.
> > 
> > Yes, since I haven't looked at the multifd patches at all so I don't have
> > solid clue on how much it'll affect multifd.  I'll leave that to Juan.
> > 
> >>
> >>>
> >>> I don't have a strong opinion on the impl (even though I do have a
> >>> preference..), but I think at least we should still check on two things:
> >>>
> >>>   - Being crystal clear on the use case above, and double check whether "VM
> >>>     stop" should be the default operation at the start of the new cmd - we
> >>>     shouldn't assume the user will be aware of doing this, neither should
> >>>     we assume the user is aware of the performance implications.
> >>
> >>
> >> Not sure I can identify what you are asking specifically: the use case is to stop executing the currently running VM as soon as possible, save it to disk, then restore another VM as soon as possible.
> >> Probably I missed something there.
> > 
> > Yes, then IMHO as mentioned we should make "vm stop" part of the command
> > procedure if vm was still running when invoked.  Then we can already
> > optimize dirty logging of above (2) with the current framework. E.g., we
> > already optimized live snapshot to not enable dirty logging:
> > 
> >         if (!migrate_background_snapshot()) {
> >             memory_global_dirty_log_start(GLOBAL_DIRTY_MIGRATION);
> >             migration_bitmap_sync_precopy(rs);
> >         }
> > 
> > Maybe that can also be done for fixed-ram migration, so no matter how much
> > overhead there will be, that can be avoided.
> 
> Understood, agree.
> 
> Would it make sense to check for something like if (!runstate_is_running())
> instead of checking for the specific multifd + fixed-ram feature?
> 
> I think from a high level perspective, there should not be dirtying if the vcpus are not running right?
> This could even be a bit more future proof to avoid checking for many features, if they all happen to share the fact that vcpus are not running.

Hmm I'm not sure.  I think we still allow use to stop/start VMs during
migration?  If so, probably not applicable.

And it won't cover live snapshot too - live snapshot always run with VM
running, but it doesn't need to track dirty.  It actually needs to track
dirty, but in a synchronous way to make it efficient (while kvm dirty
tracking is asynchronous, aka, vcpu won't be blocked if dirtied).

So here we can make it "if (migrate_needs_async_dirty_tracking())", and
having both live snapshot and fixed-ram migration covered in the helper to
opt-out dirty tracking.

One thing worth keeping an eye here is if we go that way we need to make
sure VM won't be started during the fixed-ram migration.  IOW, we can
cancel the fixed-ram migration (in this case, more suitable to be called
"vm suspend") if the user starts the VM during the process.

> 
> > 
> > PS: I think similar optimizations can be done too in ram_save_complete() or
> > ram_state_pending_exact().. maybe we should move the check into
> > migration_bitmap_sync_precopy() so it can be skipped as a whole when it can.
> 
> makes sense, interesting.
> 
> I wonder if ramblock_is_ignored() could be optimized a bit too, since it seems to consume roughly the same amount of cpu as the dirty bitmap handling, even when "ignore-shared" is not used.

Do you mean we can skip dirty tracking when ramblock_is_ignored() for a
ramblock?  I think it's doable but it'll be slightly more involved, because
ignored/shared ramblocks can be used together with private/non-ignored
ramblocks, hence at least it's not applicable globally.

> 
> this feature was added by:
> 
> commit fbd162e629aaf8a7e464af44d2f73d06b26428ad
> Author: Yury Kotov <yury-kotov@yandex-team.ru>
> Date:   Fri Feb 15 20:45:46 2019 +0300
> 
>     migration: Add an ability to ignore shared RAM blocks
>     
>     If ignore-shared capability is set then skip shared RAMBlocks during the
>     RAM migration.
>     Also, move qemu_ram_foreach_migratable_block (and rename) to the
>     migration code, because it requires access to the migration capabilities.
>     
>     Signed-off-by: Yury Kotov <yury-kotov@yandex-team.ru>
>     Message-Id: <20190215174548.2630-4-yury-kotov@yandex-team.ru>
>     Reviewed-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
>     Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
> 
> Probably not that important, just to mention since we were thinking of possible small optimizations.
> I would like to share the complete previous callgrind data, but cannot find a way to export them in a readable state, could export the graph though as PDF if helpful.
> 
> Likely we'd need a new round of measurements with perf...

Yes it would be good to know. Said that, I think it'll also be fine if
optimizations are done on top, as long as the change will be compatible
with the interface being proposed.

Here e.g. "stop the VM within the cmd" is part of the interface so IMHO it
should be decided before this series got merged.

Thanks.

> 
> Ciao,
> 
> Claudio
> 
> > 
> > Thanks,
> > 
> >>
> >>>
> >>>   - Making sure the image layout is well defined, so:
> >>>
> >>>     - It'll be extensible in the future, and,
> >>>
> >>>     - If someone would like to refactor it to not use the migration thread
> >>>       model anymore, the image format, hopefully, can be easy to keep
> >>>       untouched so it can be compatible with the current approach.
> >>>
> >>> Just my two cents. I think Juan should have the best grasp on this.
> >>>
> >>> Thanks,
> >>>
> >>
> >> Ciao,
> >>
> >> Claudio
> >>
> > 
>
Fabiano Rosas April 6, 2023, 4:46 p.m. UTC | #21
Peter Xu <peterx@redhat.com> writes:

> On Tue, Apr 04, 2023 at 05:10:52PM +0200, Claudio Fontana wrote:
>> On 4/4/23 16:53, Peter Xu wrote:
>> > On Tue, Apr 04, 2023 at 10:00:16AM +0200, Claudio Fontana wrote:
>> >> Hi Peter,
>> > 
>> > Hi, Claudio,
>> > 
>> >>
>> >> On 4/3/23 21:26, Peter Xu wrote:
>> >>> Hi, Claudio,
>> >>>
>> >>> Thanks for the context.
>> >>>
>> >>> On Mon, Apr 03, 2023 at 09:47:26AM +0200, Claudio Fontana wrote:
>> >>>> Hi, not sure if what is asked here is context in terms of the previous
>> >>>> upstream discussions or our specific requirement we are trying to bring
>> >>>> upstream.
>> >>>>
>> >>>> In terms of the specific requirement we are trying to bring upstream, we
>> >>>> need to get libvirt+QEMU VM save and restore functionality to be able to
>> >>>> transfer VM sizes of ~30 GB (4/8 vcpus) in roughly 5 seconds.  When an
>> >>>> event trigger happens, the VM needs to be quickly paused and saved to
>> >>>> disk safely, including datasync, and another VM needs to be restored,
>> >>>> also in ~5 secs.  For our specific requirement, the VM is never running
>> >>>> when its data (mostly consisting of RAM) is saved.
>> >>>>
>> >>>> I understand that the need to handle also the "live" case comes from
>> >>>> upstream discussions about solving the "general case", where someone
>> >>>> might want to do this for "live" VMs, but if helpful I want to highlight
>> >>>> that it is not part of the specific requirement we are trying to address,
>> >>>> and for this specific case won't also in the future, as the whole point
>> >>>> of the trigger is to replace the running VM with another VM, so it cannot
>> >>>> be kept running.
>> >>>
>> >>> From what I read so far, that scenario suites exactly what live snapshot
>> >>> would do with current QEMU - that at least should involve a snapshot on the
>> >>> disks being used or I can't see how that can be live.  So it looks like a
>> >>> separate request.
>> >>>
>> >>>> The reason we are using "migrate" here likely stems from the fact that
>> >>>> existing libvirt code currently uses QMP migrate to implement the save
>> >>>> and restore commands.  And in my personal view, I think that reusing the
>> >>>> existing building blocks (migration, multifd) would be preferable, to
>> >>>> avoid having to maintain two separate ways to do the same thing.  That
>> >>>> said, it could be done in a different way, if the performance can keep
>> >>>> up. Just thinking of reducing the overall effort and also maintenance
>> >>>> surface.
>> >>>
>> >>> I would vaguely guess the performance can not only keep up but better than
>> >>> what the current solution would provide, due to the possibility of (1)
>> >>> batch handling of continuous guest pages, and (2) completely no dirty
>> >>> tracking overhead.
>> >>>
>> >>> For (2), it's not about wr-protect page faults or vmexits due to PML being
>> >>> full (because vcpus will be stopped anyway..), it's about enabling the
>> >>> dirty tracking (which already contains overhead, especially when huge pages
>> >>> are enabled, to split huge pages in EPT pgtables) and all the bitmap
>> >>> operations QEMU does during live migration even if the VM is not live.
>> >>
>> >> something we could profile for, I do not remember it being really an important source of overhead in my previous profile runs,
>> >> but maybe worthwhile redoing the profiling with Fabiano's patchset.
>> > 
>> > Yes I don't know the detailed number either, it should depend on the guest
>> > configuration (mem size, mem type, kernel version etc).  It could be less a
>> > concern comparing to the time used elsewhere.  More on this on below.
>> > 
>> >>
>> >>>
>> >>> IMHO reusing multifd may or may not be a good idea here, because it'll of
>> >>> course also complicate multifd code, hence makes multifd harder to
>> >>> maintain, while not in a good way, because as I mentioned I don't think it
>> >>> can use much of what multifd provides.
>> >>
>> >>
>> >> The main advantage we get is the automatic multithreading of the qemu_savevm_state_iterate code in my view.
>> >>
>> >> Reimplementing the same thing again has the potential to cause bitrot for this use case, and using multiple fds for the transfer is exactly what is needed here,
>> >> and in my understanding the same exact reason multifd exists: to take advantage of high bandwidth migration channels.
>> >>
>> >> The only adjustment needed to multifd is the ability to work with block devices (file fds) as the migration channels instead of just sockets,
>> >> so it seems a very natural extension of multifd to me.
>> > 
>> > Yes, since I haven't looked at the multifd patches at all so I don't have
>> > solid clue on how much it'll affect multifd.  I'll leave that to Juan.
>> > 
>> >>
>> >>>
>> >>> I don't have a strong opinion on the impl (even though I do have a
>> >>> preference..), but I think at least we should still check on two things:
>> >>>
>> >>>   - Being crystal clear on the use case above, and double check whether "VM
>> >>>     stop" should be the default operation at the start of the new cmd - we
>> >>>     shouldn't assume the user will be aware of doing this, neither should
>> >>>     we assume the user is aware of the performance implications.
>> >>
>> >>
>> >> Not sure I can identify what you are asking specifically: the use case is to stop executing the currently running VM as soon as possible, save it to disk, then restore another VM as soon as possible.
>> >> Probably I missed something there.
>> > 
>> > Yes, then IMHO as mentioned we should make "vm stop" part of the command
>> > procedure if vm was still running when invoked.  Then we can already
>> > optimize dirty logging of above (2) with the current framework. E.g., we
>> > already optimized live snapshot to not enable dirty logging:
>> > 
>> >         if (!migrate_background_snapshot()) {
>> >             memory_global_dirty_log_start(GLOBAL_DIRTY_MIGRATION);
>> >             migration_bitmap_sync_precopy(rs);
>> >         }
>> > 
>> > Maybe that can also be done for fixed-ram migration, so no matter how much
>> > overhead there will be, that can be avoided.
>> 
>> Understood, agree.
>> 
>> Would it make sense to check for something like if (!runstate_is_running())
>> instead of checking for the specific multifd + fixed-ram feature?
>> 
>> I think from a high level perspective, there should not be dirtying if the vcpus are not running right?
>> This could even be a bit more future proof to avoid checking for many features, if they all happen to share the fact that vcpus are not running.
>
> Hmm I'm not sure.  I think we still allow use to stop/start VMs during
> migration?  If so, probably not applicable.
>
> And it won't cover live snapshot too - live snapshot always run with VM
> running, but it doesn't need to track dirty.  It actually needs to track
> dirty, but in a synchronous way to make it efficient (while kvm dirty
> tracking is asynchronous, aka, vcpu won't be blocked if dirtied).
>
> So here we can make it "if (migrate_needs_async_dirty_tracking())", and
> having both live snapshot and fixed-ram migration covered in the helper to
> opt-out dirty tracking.
>
> One thing worth keeping an eye here is if we go that way we need to make
> sure VM won't be started during the fixed-ram migration.  IOW, we can
> cancel the fixed-ram migration (in this case, more suitable to be called
> "vm suspend") if the user starts the VM during the process.
>
>> 
>> > 
>> > PS: I think similar optimizations can be done too in ram_save_complete() or
>> > ram_state_pending_exact().. maybe we should move the check into
>> > migration_bitmap_sync_precopy() so it can be skipped as a whole when it can.
>> 
>> makes sense, interesting.
>> 
>> I wonder if ramblock_is_ignored() could be optimized a bit too, since it seems to consume roughly the same amount of cpu as the dirty bitmap handling, even when "ignore-shared" is not used.
>
> Do you mean we can skip dirty tracking when ramblock_is_ignored() for a
> ramblock?  I think it's doable but it'll be slightly more involved, because
> ignored/shared ramblocks can be used together with private/non-ignored
> ramblocks, hence at least it's not applicable globally.
>
>> 
>> this feature was added by:
>> 
>> commit fbd162e629aaf8a7e464af44d2f73d06b26428ad
>> Author: Yury Kotov <yury-kotov@yandex-team.ru>
>> Date:   Fri Feb 15 20:45:46 2019 +0300
>> 
>>     migration: Add an ability to ignore shared RAM blocks
>>     
>>     If ignore-shared capability is set then skip shared RAMBlocks during the
>>     RAM migration.
>>     Also, move qemu_ram_foreach_migratable_block (and rename) to the
>>     migration code, because it requires access to the migration capabilities.
>>     
>>     Signed-off-by: Yury Kotov <yury-kotov@yandex-team.ru>
>>     Message-Id: <20190215174548.2630-4-yury-kotov@yandex-team.ru>
>>     Reviewed-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
>>     Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
>> 
>> Probably not that important, just to mention since we were thinking of possible small optimizations.
>> I would like to share the complete previous callgrind data, but cannot find a way to export them in a readable state, could export the graph though as PDF if helpful.
>> 
>> Likely we'd need a new round of measurements with perf...
>
> Yes it would be good to know. Said that, I think it'll also be fine if
> optimizations are done on top, as long as the change will be compatible
> with the interface being proposed.
>
> Here e.g. "stop the VM within the cmd" is part of the interface so IMHO it
> should be decided before this series got merged.
>

Ok, so in summary, the high level requirement says we need to stop the
VM and we've determined that stopping it before the migration is what
probably makes more sense.

Keeping in mind that the design of fixed-ram already supports live
migration, I see three options for the interface so far:

1) Add a new command that does vm_stop + fixed-ram migrate;

2) Arbitrarily declare that fixed-ram is always non-live and hardcode
   that;

3) Add a new migration capability "live migration", ON by default and
   have the management layer set fixed-ram=on, live-migration=off.

I guess this also largely depends on what direction we're going with the
migration code in general. I.e. do we prefer a more isolated
implementation or keep the new feature flexible for future use-cases?

I'll give people time to catch up and in the meantime work on adding the
stop and the safeguards around the user re-starting.

Thanks all for the input so far.
Claudio Fontana April 7, 2023, 10:36 a.m. UTC | #22
On 4/6/23 18:46, Fabiano Rosas wrote:
> Peter Xu <peterx@redhat.com> writes:
> 
>> On Tue, Apr 04, 2023 at 05:10:52PM +0200, Claudio Fontana wrote:
>>> On 4/4/23 16:53, Peter Xu wrote:
>>>> On Tue, Apr 04, 2023 at 10:00:16AM +0200, Claudio Fontana wrote:
>>>>> Hi Peter,
>>>>
>>>> Hi, Claudio,
>>>>
>>>>>
>>>>> On 4/3/23 21:26, Peter Xu wrote:
>>>>>> Hi, Claudio,
>>>>>>
>>>>>> Thanks for the context.
>>>>>>
>>>>>> On Mon, Apr 03, 2023 at 09:47:26AM +0200, Claudio Fontana wrote:
>>>>>>> Hi, not sure if what is asked here is context in terms of the previous
>>>>>>> upstream discussions or our specific requirement we are trying to bring
>>>>>>> upstream.
>>>>>>>
>>>>>>> In terms of the specific requirement we are trying to bring upstream, we
>>>>>>> need to get libvirt+QEMU VM save and restore functionality to be able to
>>>>>>> transfer VM sizes of ~30 GB (4/8 vcpus) in roughly 5 seconds.  When an
>>>>>>> event trigger happens, the VM needs to be quickly paused and saved to
>>>>>>> disk safely, including datasync, and another VM needs to be restored,
>>>>>>> also in ~5 secs.  For our specific requirement, the VM is never running
>>>>>>> when its data (mostly consisting of RAM) is saved.
>>>>>>>
>>>>>>> I understand that the need to handle also the "live" case comes from
>>>>>>> upstream discussions about solving the "general case", where someone
>>>>>>> might want to do this for "live" VMs, but if helpful I want to highlight
>>>>>>> that it is not part of the specific requirement we are trying to address,
>>>>>>> and for this specific case won't also in the future, as the whole point
>>>>>>> of the trigger is to replace the running VM with another VM, so it cannot
>>>>>>> be kept running.
>>>>>>
>>>>>> From what I read so far, that scenario suites exactly what live snapshot
>>>>>> would do with current QEMU - that at least should involve a snapshot on the
>>>>>> disks being used or I can't see how that can be live.  So it looks like a
>>>>>> separate request.
>>>>>>
>>>>>>> The reason we are using "migrate" here likely stems from the fact that
>>>>>>> existing libvirt code currently uses QMP migrate to implement the save
>>>>>>> and restore commands.  And in my personal view, I think that reusing the
>>>>>>> existing building blocks (migration, multifd) would be preferable, to
>>>>>>> avoid having to maintain two separate ways to do the same thing.  That
>>>>>>> said, it could be done in a different way, if the performance can keep
>>>>>>> up. Just thinking of reducing the overall effort and also maintenance
>>>>>>> surface.
>>>>>>
>>>>>> I would vaguely guess the performance can not only keep up but better than
>>>>>> what the current solution would provide, due to the possibility of (1)
>>>>>> batch handling of continuous guest pages, and (2) completely no dirty
>>>>>> tracking overhead.
>>>>>>
>>>>>> For (2), it's not about wr-protect page faults or vmexits due to PML being
>>>>>> full (because vcpus will be stopped anyway..), it's about enabling the
>>>>>> dirty tracking (which already contains overhead, especially when huge pages
>>>>>> are enabled, to split huge pages in EPT pgtables) and all the bitmap
>>>>>> operations QEMU does during live migration even if the VM is not live.
>>>>>
>>>>> something we could profile for, I do not remember it being really an important source of overhead in my previous profile runs,
>>>>> but maybe worthwhile redoing the profiling with Fabiano's patchset.
>>>>
>>>> Yes I don't know the detailed number either, it should depend on the guest
>>>> configuration (mem size, mem type, kernel version etc).  It could be less a
>>>> concern comparing to the time used elsewhere.  More on this on below.
>>>>
>>>>>
>>>>>>
>>>>>> IMHO reusing multifd may or may not be a good idea here, because it'll of
>>>>>> course also complicate multifd code, hence makes multifd harder to
>>>>>> maintain, while not in a good way, because as I mentioned I don't think it
>>>>>> can use much of what multifd provides.
>>>>>
>>>>>
>>>>> The main advantage we get is the automatic multithreading of the qemu_savevm_state_iterate code in my view.
>>>>>
>>>>> Reimplementing the same thing again has the potential to cause bitrot for this use case, and using multiple fds for the transfer is exactly what is needed here,
>>>>> and in my understanding the same exact reason multifd exists: to take advantage of high bandwidth migration channels.
>>>>>
>>>>> The only adjustment needed to multifd is the ability to work with block devices (file fds) as the migration channels instead of just sockets,
>>>>> so it seems a very natural extension of multifd to me.
>>>>
>>>> Yes, since I haven't looked at the multifd patches at all so I don't have
>>>> solid clue on how much it'll affect multifd.  I'll leave that to Juan.
>>>>
>>>>>
>>>>>>
>>>>>> I don't have a strong opinion on the impl (even though I do have a
>>>>>> preference..), but I think at least we should still check on two things:
>>>>>>
>>>>>>   - Being crystal clear on the use case above, and double check whether "VM
>>>>>>     stop" should be the default operation at the start of the new cmd - we
>>>>>>     shouldn't assume the user will be aware of doing this, neither should
>>>>>>     we assume the user is aware of the performance implications.
>>>>>
>>>>>
>>>>> Not sure I can identify what you are asking specifically: the use case is to stop executing the currently running VM as soon as possible, save it to disk, then restore another VM as soon as possible.
>>>>> Probably I missed something there.
>>>>
>>>> Yes, then IMHO as mentioned we should make "vm stop" part of the command
>>>> procedure if vm was still running when invoked.  Then we can already
>>>> optimize dirty logging of above (2) with the current framework. E.g., we
>>>> already optimized live snapshot to not enable dirty logging:
>>>>
>>>>         if (!migrate_background_snapshot()) {
>>>>             memory_global_dirty_log_start(GLOBAL_DIRTY_MIGRATION);
>>>>             migration_bitmap_sync_precopy(rs);
>>>>         }
>>>>
>>>> Maybe that can also be done for fixed-ram migration, so no matter how much
>>>> overhead there will be, that can be avoided.
>>>
>>> Understood, agree.
>>>
>>> Would it make sense to check for something like if (!runstate_is_running())
>>> instead of checking for the specific multifd + fixed-ram feature?
>>>
>>> I think from a high level perspective, there should not be dirtying if the vcpus are not running right?
>>> This could even be a bit more future proof to avoid checking for many features, if they all happen to share the fact that vcpus are not running.
>>
>> Hmm I'm not sure.  I think we still allow use to stop/start VMs during
>> migration?  If so, probably not applicable.
>>
>> And it won't cover live snapshot too - live snapshot always run with VM
>> running, but it doesn't need to track dirty.  It actually needs to track
>> dirty, but in a synchronous way to make it efficient (while kvm dirty
>> tracking is asynchronous, aka, vcpu won't be blocked if dirtied).
>>
>> So here we can make it "if (migrate_needs_async_dirty_tracking())", and
>> having both live snapshot and fixed-ram migration covered in the helper to
>> opt-out dirty tracking.
>>
>> One thing worth keeping an eye here is if we go that way we need to make
>> sure VM won't be started during the fixed-ram migration.  IOW, we can
>> cancel the fixed-ram migration (in this case, more suitable to be called
>> "vm suspend") if the user starts the VM during the process.
>>
>>>
>>>>
>>>> PS: I think similar optimizations can be done too in ram_save_complete() or
>>>> ram_state_pending_exact().. maybe we should move the check into
>>>> migration_bitmap_sync_precopy() so it can be skipped as a whole when it can.
>>>
>>> makes sense, interesting.
>>>
>>> I wonder if ramblock_is_ignored() could be optimized a bit too, since it seems to consume roughly the same amount of cpu as the dirty bitmap handling, even when "ignore-shared" is not used.
>>
>> Do you mean we can skip dirty tracking when ramblock_is_ignored() for a
>> ramblock?  I think it's doable but it'll be slightly more involved, because
>> ignored/shared ramblocks can be used together with private/non-ignored
>> ramblocks, hence at least it's not applicable globally.
>>
>>>
>>> this feature was added by:
>>>
>>> commit fbd162e629aaf8a7e464af44d2f73d06b26428ad
>>> Author: Yury Kotov <yury-kotov@yandex-team.ru>
>>> Date:   Fri Feb 15 20:45:46 2019 +0300
>>>
>>>     migration: Add an ability to ignore shared RAM blocks
>>>     
>>>     If ignore-shared capability is set then skip shared RAMBlocks during the
>>>     RAM migration.
>>>     Also, move qemu_ram_foreach_migratable_block (and rename) to the
>>>     migration code, because it requires access to the migration capabilities.
>>>     
>>>     Signed-off-by: Yury Kotov <yury-kotov@yandex-team.ru>
>>>     Message-Id: <20190215174548.2630-4-yury-kotov@yandex-team.ru>
>>>     Reviewed-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
>>>     Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
>>>
>>> Probably not that important, just to mention since we were thinking of possible small optimizations.
>>> I would like to share the complete previous callgrind data, but cannot find a way to export them in a readable state, could export the graph though as PDF if helpful.
>>>
>>> Likely we'd need a new round of measurements with perf...
>>
>> Yes it would be good to know. Said that, I think it'll also be fine if
>> optimizations are done on top, as long as the change will be compatible
>> with the interface being proposed.
>>
>> Here e.g. "stop the VM within the cmd" is part of the interface so IMHO it
>> should be decided before this series got merged.
>>
> 
> Ok, so in summary, the high level requirement says we need to stop the
> VM and we've determined that stopping it before the migration is what
> probably makes more sense.
> 
> Keeping in mind that the design of fixed-ram already supports live
> migration, I see three options for the interface so far:

(just my opinion here, I might be wrong and is not directly a requirement I am presenting here)

Maybe there are other reasons to provide the fixed-ram offsets thing beyond the live case? I am unclear on that.

If the live case is a potential requirement for someone else, or there are other reasons for fixed-ram offsets anyway,
I think it would be better to leave the decision of whether to stop or not to stop the vm prior to transfer to the user, or to the management tools (libvirt ...)

We care about the stop case, but since the proposal already supports live too, there is no real good reason I think to force the user to stop the VM, forcing our own use case when others might find use for "live".

If we want to detect the two cases at runtime separately in the future for potential additional performance gain, that is a possibility in my view for future work,
but we know already experimentally that the bits of extra overhead for the dirty bitmap tracking is not the real bottleneck at least in our testing,
even with devices capable of transfering ~6 gigabytes per second.

But again this is assuming that the live case is compatible and does not make things overly complicated,
otherwise looking instead at the thing from purely these business requirements perspective we don't need it, and we could even scrap live.

> 
> 1) Add a new command that does vm_stop + fixed-ram migrate;
> 
> 2) Arbitrarily declare that fixed-ram is always non-live and hardcode
>    that;
> 
> 3) Add a new migration capability "live migration", ON by default and
>    have the management layer set fixed-ram=on, live-migration=off.

(just minor point, for the case where this would apply): instead of an additional options, could we not just detect whether we are "live" or not by just checking whether the guest is in a running state?
I suppose we don't allow to start/stop guests while the migration is running..


> 
> I guess this also largely depends on what direction we're going with the
> migration code in general. I.e. do we prefer a more isolated
> implementation or keep the new feature flexible for future use-cases?

right, looking for the migration experts and maintainers to chime in here :-)

> 
> I'll give people time to catch up and in the meantime work on adding the
> stop and the safeguards around the user re-starting.
> 
> Thanks all for the input so far.

Thanks and as again: all this is just my 2c truly.

Ciao,

Claudio
Peter Xu April 11, 2023, 3:48 p.m. UTC | #23
On Fri, Apr 07, 2023 at 12:36:24PM +0200, Claudio Fontana wrote:
> On 4/6/23 18:46, Fabiano Rosas wrote:
> > Peter Xu <peterx@redhat.com> writes:
> > 
> >> On Tue, Apr 04, 2023 at 05:10:52PM +0200, Claudio Fontana wrote:
> >>> On 4/4/23 16:53, Peter Xu wrote:
> >>>> On Tue, Apr 04, 2023 at 10:00:16AM +0200, Claudio Fontana wrote:
> >>>>> Hi Peter,
> >>>>
> >>>> Hi, Claudio,
> >>>>
> >>>>>
> >>>>> On 4/3/23 21:26, Peter Xu wrote:
> >>>>>> Hi, Claudio,
> >>>>>>
> >>>>>> Thanks for the context.
> >>>>>>
> >>>>>> On Mon, Apr 03, 2023 at 09:47:26AM +0200, Claudio Fontana wrote:
> >>>>>>> Hi, not sure if what is asked here is context in terms of the previous
> >>>>>>> upstream discussions or our specific requirement we are trying to bring
> >>>>>>> upstream.
> >>>>>>>
> >>>>>>> In terms of the specific requirement we are trying to bring upstream, we
> >>>>>>> need to get libvirt+QEMU VM save and restore functionality to be able to
> >>>>>>> transfer VM sizes of ~30 GB (4/8 vcpus) in roughly 5 seconds.  When an
> >>>>>>> event trigger happens, the VM needs to be quickly paused and saved to
> >>>>>>> disk safely, including datasync, and another VM needs to be restored,
> >>>>>>> also in ~5 secs.  For our specific requirement, the VM is never running
> >>>>>>> when its data (mostly consisting of RAM) is saved.
> >>>>>>>
> >>>>>>> I understand that the need to handle also the "live" case comes from
> >>>>>>> upstream discussions about solving the "general case", where someone
> >>>>>>> might want to do this for "live" VMs, but if helpful I want to highlight
> >>>>>>> that it is not part of the specific requirement we are trying to address,
> >>>>>>> and for this specific case won't also in the future, as the whole point
> >>>>>>> of the trigger is to replace the running VM with another VM, so it cannot
> >>>>>>> be kept running.
> >>>>>>
> >>>>>> From what I read so far, that scenario suites exactly what live snapshot
> >>>>>> would do with current QEMU - that at least should involve a snapshot on the
> >>>>>> disks being used or I can't see how that can be live.  So it looks like a
> >>>>>> separate request.
> >>>>>>
> >>>>>>> The reason we are using "migrate" here likely stems from the fact that
> >>>>>>> existing libvirt code currently uses QMP migrate to implement the save
> >>>>>>> and restore commands.  And in my personal view, I think that reusing the
> >>>>>>> existing building blocks (migration, multifd) would be preferable, to
> >>>>>>> avoid having to maintain two separate ways to do the same thing.  That
> >>>>>>> said, it could be done in a different way, if the performance can keep
> >>>>>>> up. Just thinking of reducing the overall effort and also maintenance
> >>>>>>> surface.
> >>>>>>
> >>>>>> I would vaguely guess the performance can not only keep up but better than
> >>>>>> what the current solution would provide, due to the possibility of (1)
> >>>>>> batch handling of continuous guest pages, and (2) completely no dirty
> >>>>>> tracking overhead.
> >>>>>>
> >>>>>> For (2), it's not about wr-protect page faults or vmexits due to PML being
> >>>>>> full (because vcpus will be stopped anyway..), it's about enabling the
> >>>>>> dirty tracking (which already contains overhead, especially when huge pages
> >>>>>> are enabled, to split huge pages in EPT pgtables) and all the bitmap
> >>>>>> operations QEMU does during live migration even if the VM is not live.
> >>>>>
> >>>>> something we could profile for, I do not remember it being really an important source of overhead in my previous profile runs,
> >>>>> but maybe worthwhile redoing the profiling with Fabiano's patchset.
> >>>>
> >>>> Yes I don't know the detailed number either, it should depend on the guest
> >>>> configuration (mem size, mem type, kernel version etc).  It could be less a
> >>>> concern comparing to the time used elsewhere.  More on this on below.
> >>>>
> >>>>>
> >>>>>>
> >>>>>> IMHO reusing multifd may or may not be a good idea here, because it'll of
> >>>>>> course also complicate multifd code, hence makes multifd harder to
> >>>>>> maintain, while not in a good way, because as I mentioned I don't think it
> >>>>>> can use much of what multifd provides.
> >>>>>
> >>>>>
> >>>>> The main advantage we get is the automatic multithreading of the qemu_savevm_state_iterate code in my view.
> >>>>>
> >>>>> Reimplementing the same thing again has the potential to cause bitrot for this use case, and using multiple fds for the transfer is exactly what is needed here,
> >>>>> and in my understanding the same exact reason multifd exists: to take advantage of high bandwidth migration channels.
> >>>>>
> >>>>> The only adjustment needed to multifd is the ability to work with block devices (file fds) as the migration channels instead of just sockets,
> >>>>> so it seems a very natural extension of multifd to me.
> >>>>
> >>>> Yes, since I haven't looked at the multifd patches at all so I don't have
> >>>> solid clue on how much it'll affect multifd.  I'll leave that to Juan.
> >>>>
> >>>>>
> >>>>>>
> >>>>>> I don't have a strong opinion on the impl (even though I do have a
> >>>>>> preference..), but I think at least we should still check on two things:
> >>>>>>
> >>>>>>   - Being crystal clear on the use case above, and double check whether "VM
> >>>>>>     stop" should be the default operation at the start of the new cmd - we
> >>>>>>     shouldn't assume the user will be aware of doing this, neither should
> >>>>>>     we assume the user is aware of the performance implications.
> >>>>>
> >>>>>
> >>>>> Not sure I can identify what you are asking specifically: the use case is to stop executing the currently running VM as soon as possible, save it to disk, then restore another VM as soon as possible.
> >>>>> Probably I missed something there.
> >>>>
> >>>> Yes, then IMHO as mentioned we should make "vm stop" part of the command
> >>>> procedure if vm was still running when invoked.  Then we can already
> >>>> optimize dirty logging of above (2) with the current framework. E.g., we
> >>>> already optimized live snapshot to not enable dirty logging:
> >>>>
> >>>>         if (!migrate_background_snapshot()) {
> >>>>             memory_global_dirty_log_start(GLOBAL_DIRTY_MIGRATION);
> >>>>             migration_bitmap_sync_precopy(rs);
> >>>>         }
> >>>>
> >>>> Maybe that can also be done for fixed-ram migration, so no matter how much
> >>>> overhead there will be, that can be avoided.
> >>>
> >>> Understood, agree.
> >>>
> >>> Would it make sense to check for something like if (!runstate_is_running())
> >>> instead of checking for the specific multifd + fixed-ram feature?
> >>>
> >>> I think from a high level perspective, there should not be dirtying if the vcpus are not running right?
> >>> This could even be a bit more future proof to avoid checking for many features, if they all happen to share the fact that vcpus are not running.
> >>
> >> Hmm I'm not sure.  I think we still allow use to stop/start VMs during
> >> migration?  If so, probably not applicable.
> >>
> >> And it won't cover live snapshot too - live snapshot always run with VM
> >> running, but it doesn't need to track dirty.  It actually needs to track
> >> dirty, but in a synchronous way to make it efficient (while kvm dirty
> >> tracking is asynchronous, aka, vcpu won't be blocked if dirtied).
> >>
> >> So here we can make it "if (migrate_needs_async_dirty_tracking())", and
> >> having both live snapshot and fixed-ram migration covered in the helper to
> >> opt-out dirty tracking.
> >>
> >> One thing worth keeping an eye here is if we go that way we need to make
> >> sure VM won't be started during the fixed-ram migration.  IOW, we can
> >> cancel the fixed-ram migration (in this case, more suitable to be called
> >> "vm suspend") if the user starts the VM during the process.
> >>
> >>>
> >>>>
> >>>> PS: I think similar optimizations can be done too in ram_save_complete() or
> >>>> ram_state_pending_exact().. maybe we should move the check into
> >>>> migration_bitmap_sync_precopy() so it can be skipped as a whole when it can.
> >>>
> >>> makes sense, interesting.
> >>>
> >>> I wonder if ramblock_is_ignored() could be optimized a bit too, since it seems to consume roughly the same amount of cpu as the dirty bitmap handling, even when "ignore-shared" is not used.
> >>
> >> Do you mean we can skip dirty tracking when ramblock_is_ignored() for a
> >> ramblock?  I think it's doable but it'll be slightly more involved, because
> >> ignored/shared ramblocks can be used together with private/non-ignored
> >> ramblocks, hence at least it's not applicable globally.
> >>
> >>>
> >>> this feature was added by:
> >>>
> >>> commit fbd162e629aaf8a7e464af44d2f73d06b26428ad
> >>> Author: Yury Kotov <yury-kotov@yandex-team.ru>
> >>> Date:   Fri Feb 15 20:45:46 2019 +0300
> >>>
> >>>     migration: Add an ability to ignore shared RAM blocks
> >>>     
> >>>     If ignore-shared capability is set then skip shared RAMBlocks during the
> >>>     RAM migration.
> >>>     Also, move qemu_ram_foreach_migratable_block (and rename) to the
> >>>     migration code, because it requires access to the migration capabilities.
> >>>     
> >>>     Signed-off-by: Yury Kotov <yury-kotov@yandex-team.ru>
> >>>     Message-Id: <20190215174548.2630-4-yury-kotov@yandex-team.ru>
> >>>     Reviewed-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
> >>>     Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
> >>>
> >>> Probably not that important, just to mention since we were thinking of possible small optimizations.
> >>> I would like to share the complete previous callgrind data, but cannot find a way to export them in a readable state, could export the graph though as PDF if helpful.
> >>>
> >>> Likely we'd need a new round of measurements with perf...
> >>
> >> Yes it would be good to know. Said that, I think it'll also be fine if
> >> optimizations are done on top, as long as the change will be compatible
> >> with the interface being proposed.
> >>
> >> Here e.g. "stop the VM within the cmd" is part of the interface so IMHO it
> >> should be decided before this series got merged.
> >>
> > 
> > Ok, so in summary, the high level requirement says we need to stop the
> > VM and we've determined that stopping it before the migration is what
> > probably makes more sense.
> > 
> > Keeping in mind that the design of fixed-ram already supports live
> > migration, I see three options for the interface so far:
> 
> (just my opinion here, I might be wrong and is not directly a requirement I am presenting here)
> 
> Maybe there are other reasons to provide the fixed-ram offsets thing beyond the live case? I am unclear on that.
> 
> If the live case is a potential requirement for someone else, or there are other reasons for fixed-ram offsets anyway,
> I think it would be better to leave the decision of whether to stop or not to stop the vm prior to transfer to the user, or to the management tools (libvirt ...)

We'll need someone stand up and explain the use case.  IMHO we should not
assume something can happen and design the interface with an assumption,
especially if there can be an impact on the design with the assumption.
Per my own experience that's the major source of over-engineering.

If we live migrate the VM then stop it after migration completes, it means
we're "assuming" the ultimate disk image will match with the VM image we're
going to create, but I doubt it.

Here the whole process actually contains a few steps:

     (1)                  (2)                     (3)               (4)
  VM running --> start live migration --> migration completes --> VM stop
                   (fixed-ram=on)

We have the VM image containing all the VM states (including device and
memory) at step (3), then we can optionally & quickly turn off the VM at
step (4).  IOW, the final disk image contains states in step (4) not step
(3).  The question is could something changed on the disk or IO flush
happened during step (3) and (4)?

IOW, I think the use case so far can only be justified if it's VM suspend.

> 
> We care about the stop case, but since the proposal already supports live too, there is no real good reason I think to force the user to stop the VM, forcing our own use case when others might find use for "live".
> 
> If we want to detect the two cases at runtime separately in the future for potential additional performance gain, that is a possibility in my view for future work,
> but we know already experimentally that the bits of extra overhead for the dirty bitmap tracking is not the real bottleneck at least in our testing,
> even with devices capable of transfering ~6 gigabytes per second.

Is that device assigned to the guest?  I'm very curious why that wouldn't
make a difference.

It could be that the device is reusing a small buffer of the guest so even
if it dirtied very fast it's still a small range impact.  Logically high
dirty loads definitely will make a difference irrelevant of dirty tracking
overheads (e.g., besides tracking overhead that we'll also need to migrate
dirtied pages >1 times; while we don't need to do that with live snapshot).

> 
> But again this is assuming that the live case is compatible and does not make things overly complicated,
> otherwise looking instead at the thing from purely these business requirements perspective we don't need it, and we could even scrap live.
> 
> > 
> > 1) Add a new command that does vm_stop + fixed-ram migrate;
> > 
> > 2) Arbitrarily declare that fixed-ram is always non-live and hardcode
> >    that;
> > 
> > 3) Add a new migration capability "live migration", ON by default and
> >    have the management layer set fixed-ram=on, live-migration=off.
> 
> (just minor point, for the case where this would apply): instead of an additional options, could we not just detect whether we are "live" or not by just checking whether the guest is in a running state?
> I suppose we don't allow to start/stop guests while the migration is running..
> 
> 
> > 
> > I guess this also largely depends on what direction we're going with the
> > migration code in general. I.e. do we prefer a more isolated
> > implementation or keep the new feature flexible for future use-cases?
> 
> right, looking for the migration experts and maintainers to chime in here :-)
> 
> > 
> > I'll give people time to catch up and in the meantime work on adding the
> > stop and the safeguards around the user re-starting.
> > 
> > Thanks all for the input so far.
> 
> Thanks and as again: all this is just my 2c truly.
> 
> Ciao,
> 
> Claudio
>
Daniel P. Berrangé April 18, 2023, 4:58 p.m. UTC | #24
On Fri, Mar 31, 2023 at 12:27:48PM -0400, Peter Xu wrote:
> On Fri, Mar 31, 2023 at 05:10:16PM +0100, Daniel P. Berrangé wrote:
> > On Fri, Mar 31, 2023 at 11:55:03AM -0400, Peter Xu wrote:
> > > On Fri, Mar 31, 2023 at 12:30:45PM -0300, Fabiano Rosas wrote:
> > > > Peter Xu <peterx@redhat.com> writes:
> > > > 
> > > > > On Fri, Mar 31, 2023 at 11:37:50AM -0300, Fabiano Rosas wrote:
> > > > >> >> Outgoing migration to file. NVMe disk. XFS filesystem.
> > > > >> >> 
> > > > >> >> - Single migration runs of stopped 32G guest with ~90% RAM usage. Guest
> > > > >> >>   running `stress-ng --vm 4 --vm-bytes 90% --vm-method all --verify -t
> > > > >> >>   10m -v`:
> > > > >> >> 
> > > > >> >> migration type  | MB/s | pages/s |  ms
> > > > >> >> ----------------+------+---------+------
> > > > >> >> savevm io_uring |  434 |  102294 | 71473
> > > > >> >
> > > > >> > So I assume this is the non-live migration scenario.  Could you explain
> > > > >> > what does io_uring mean here?
> > > > >> >
> > > > >> 
> > > > >> This table is all non-live migration. This particular line is a snapshot
> > > > >> (hmp_savevm->save_snapshot). I thought it could be relevant because it
> > > > >> is another way by which we write RAM into disk.
> > > > >
> > > > > I see, so if all non-live that explains, because I was curious what's the
> > > > > relationship between this feature and the live snapshot that QEMU also
> > > > > supports.
> > > > >
> > > > > I also don't immediately see why savevm will be much slower, do you have an
> > > > > answer?  Maybe it's somewhere but I just overlooked..
> > > > >
> > > > 
> > > > I don't have a concrete answer. I could take a jab and maybe blame the
> > > > extra memcpy for the buffer in QEMUFile? Or perhaps an unintended effect
> > > > of bandwidth limits?
> > > 
> > > IMHO it would be great if this can be investigated and reasons provided in
> > > the next cover letter.
> > > 
> > > > 
> > > > > IIUC this is "vm suspend" case, so there's an extra benefit knowledge of
> > > > > "we can stop the VM".  It smells slightly weird to build this on top of
> > > > > "migrate" from that pov, rather than "savevm", though.  Any thoughts on
> > > > > this aspect (on why not building this on top of "savevm")?
> > > > >
> > > > 
> > > > I share the same perception. I have done initial experiments with
> > > > savevm, but I decided to carry on the work that was already started by
> > > > others because my understanding of the problem was yet incomplete.
> > > > 
> > > > One point that has been raised is that the fixed-ram format alone does
> > > > not bring that many performance improvements. So we'll need
> > > > multi-threading and direct-io on top of it. Re-using multifd
> > > > infrastructure seems like it could be a good idea.
> > > 
> > > The thing is IMHO concurrency is not as hard if VM stopped, and when we're
> > > 100% sure locally on where the page will go.
> > 
> > We shouldn't assume the VM is stopped though. When saving to the file
> > the VM may still be active. The fixed-ram format lets us re-write the
> > same memory location on disk multiple times in this case, thus avoiding
> > growth of the file size.
> 
> Before discussing on reusing multifd below, now I have a major confusing on
> the use case of the feature..
> 
> The question is whether we would like to stop the VM after fixed-ram
> migration completes.  I'm asking because:
> 
>   1. If it will stop, then it looks like a "VM suspend" to me. If so, could
>      anyone help explain why we don't stop the VM first then migrate?
>      Because it avoids copying single pages multiple times, no fiddling
>      with dirty tracking at all - we just don't ever track anything.  In
>      short, we'll stop the VM anyway, then why not stop it slightly
>      earlier?
> 
>   2. If it will not stop, then it's "VM live snapshot" to me.  We have
>      that, aren't we?  That's more efficient because it'll wr-protect all
>      guest pages, any write triggers a CoW and we only copy the guest pages
>      once and for all.
> 
> Either way to go, there's no need to copy any page more than once.  Did I
> miss anything perhaps very important?
> 
> I would guess it's option (1) above, because it seems we don't snapshot the
> disk alongside.  But I am really not sure now..

It is both options above.

Libvirt has multiple APIs where it currently uses its migrate-to-file
approach

  * virDomainManagedSave()

    This saves VM state to an libvirt managed file, stops the VM, and the
    file state is auto-restored on next request to start the VM, and the
    file deleted. The VM CPUs are stopped during both save + restore
    phase

  * virDomainSave/virDomainRestore

    The former saves VM state to a file specified by the mgmt app/user.
    A later call to virDomaniRestore starts the VM using that saved
    state. The mgmt app / user can delete the file state, or re-use
    it many times as they desire. The VM CPUs are stopped during both
    save + restore phase

  * virDomainSnapshotXXX

    This family of APIs takes snapshots of the VM disks, optionally
    also including the full VM state to a separate file. The snapshots
    can later be restored. The VM CPUs remain running during the
    save phase, but are stopped during restore phase

All these APIs end up calling the same code inside libvirt that uses
the libvirt-iohelper, together with QEMU migrate:fd driver.

IIUC, Suse's original motivation for the performance improvements was
wrt to the first case of virDomainManagedSave. From the POV of actually
supporting this in libvirt though, we need to cover all the scenarios
there. Thus we need this to work both when CPUs are running and stopped,
and if we didn't use migrate in this case, then we basically just end
up re-inventing migrate again which IMHO is undesirable both from
libvirt's POV and QEMU's POV.

With regards,
Daniel
Peter Xu April 18, 2023, 7:26 p.m. UTC | #25
On Tue, Apr 18, 2023 at 05:58:44PM +0100, Daniel P. Berrangé wrote:
> On Fri, Mar 31, 2023 at 12:27:48PM -0400, Peter Xu wrote:
> > On Fri, Mar 31, 2023 at 05:10:16PM +0100, Daniel P. Berrangé wrote:
> > > On Fri, Mar 31, 2023 at 11:55:03AM -0400, Peter Xu wrote:
> > > > On Fri, Mar 31, 2023 at 12:30:45PM -0300, Fabiano Rosas wrote:
> > > > > Peter Xu <peterx@redhat.com> writes:
> > > > > 
> > > > > > On Fri, Mar 31, 2023 at 11:37:50AM -0300, Fabiano Rosas wrote:
> > > > > >> >> Outgoing migration to file. NVMe disk. XFS filesystem.
> > > > > >> >> 
> > > > > >> >> - Single migration runs of stopped 32G guest with ~90% RAM usage. Guest
> > > > > >> >>   running `stress-ng --vm 4 --vm-bytes 90% --vm-method all --verify -t
> > > > > >> >>   10m -v`:
> > > > > >> >> 
> > > > > >> >> migration type  | MB/s | pages/s |  ms
> > > > > >> >> ----------------+------+---------+------
> > > > > >> >> savevm io_uring |  434 |  102294 | 71473
> > > > > >> >
> > > > > >> > So I assume this is the non-live migration scenario.  Could you explain
> > > > > >> > what does io_uring mean here?
> > > > > >> >
> > > > > >> 
> > > > > >> This table is all non-live migration. This particular line is a snapshot
> > > > > >> (hmp_savevm->save_snapshot). I thought it could be relevant because it
> > > > > >> is another way by which we write RAM into disk.
> > > > > >
> > > > > > I see, so if all non-live that explains, because I was curious what's the
> > > > > > relationship between this feature and the live snapshot that QEMU also
> > > > > > supports.
> > > > > >
> > > > > > I also don't immediately see why savevm will be much slower, do you have an
> > > > > > answer?  Maybe it's somewhere but I just overlooked..
> > > > > >
> > > > > 
> > > > > I don't have a concrete answer. I could take a jab and maybe blame the
> > > > > extra memcpy for the buffer in QEMUFile? Or perhaps an unintended effect
> > > > > of bandwidth limits?
> > > > 
> > > > IMHO it would be great if this can be investigated and reasons provided in
> > > > the next cover letter.
> > > > 
> > > > > 
> > > > > > IIUC this is "vm suspend" case, so there's an extra benefit knowledge of
> > > > > > "we can stop the VM".  It smells slightly weird to build this on top of
> > > > > > "migrate" from that pov, rather than "savevm", though.  Any thoughts on
> > > > > > this aspect (on why not building this on top of "savevm")?
> > > > > >
> > > > > 
> > > > > I share the same perception. I have done initial experiments with
> > > > > savevm, but I decided to carry on the work that was already started by
> > > > > others because my understanding of the problem was yet incomplete.
> > > > > 
> > > > > One point that has been raised is that the fixed-ram format alone does
> > > > > not bring that many performance improvements. So we'll need
> > > > > multi-threading and direct-io on top of it. Re-using multifd
> > > > > infrastructure seems like it could be a good idea.
> > > > 
> > > > The thing is IMHO concurrency is not as hard if VM stopped, and when we're
> > > > 100% sure locally on where the page will go.
> > > 
> > > We shouldn't assume the VM is stopped though. When saving to the file
> > > the VM may still be active. The fixed-ram format lets us re-write the
> > > same memory location on disk multiple times in this case, thus avoiding
> > > growth of the file size.
> > 
> > Before discussing on reusing multifd below, now I have a major confusing on
> > the use case of the feature..
> > 
> > The question is whether we would like to stop the VM after fixed-ram
> > migration completes.  I'm asking because:
> > 
> >   1. If it will stop, then it looks like a "VM suspend" to me. If so, could
> >      anyone help explain why we don't stop the VM first then migrate?
> >      Because it avoids copying single pages multiple times, no fiddling
> >      with dirty tracking at all - we just don't ever track anything.  In
> >      short, we'll stop the VM anyway, then why not stop it slightly
> >      earlier?
> > 
> >   2. If it will not stop, then it's "VM live snapshot" to me.  We have
> >      that, aren't we?  That's more efficient because it'll wr-protect all
> >      guest pages, any write triggers a CoW and we only copy the guest pages
> >      once and for all.
> > 
> > Either way to go, there's no need to copy any page more than once.  Did I
> > miss anything perhaps very important?
> > 
> > I would guess it's option (1) above, because it seems we don't snapshot the
> > disk alongside.  But I am really not sure now..
> 
> It is both options above.
> 
> Libvirt has multiple APIs where it currently uses its migrate-to-file
> approach
> 
>   * virDomainManagedSave()
> 
>     This saves VM state to an libvirt managed file, stops the VM, and the
>     file state is auto-restored on next request to start the VM, and the
>     file deleted. The VM CPUs are stopped during both save + restore
>     phase
> 
>   * virDomainSave/virDomainRestore
> 
>     The former saves VM state to a file specified by the mgmt app/user.
>     A later call to virDomaniRestore starts the VM using that saved
>     state. The mgmt app / user can delete the file state, or re-use
>     it many times as they desire. The VM CPUs are stopped during both
>     save + restore phase
> 
>   * virDomainSnapshotXXX
> 
>     This family of APIs takes snapshots of the VM disks, optionally
>     also including the full VM state to a separate file. The snapshots
>     can later be restored. The VM CPUs remain running during the
>     save phase, but are stopped during restore phase

For this one IMHO it'll be good if Libvirt can consider leveraging the new
background-snapshot capability (QEMU 6.0+, so not very new..).  Or is there
perhaps any reason why a generic migrate:fd approach is better?

> 
> All these APIs end up calling the same code inside libvirt that uses
> the libvirt-iohelper, together with QEMU migrate:fd driver.
> 
> IIUC, Suse's original motivation for the performance improvements was
> wrt to the first case of virDomainManagedSave. From the POV of actually
> supporting this in libvirt though, we need to cover all the scenarios
> there. Thus we need this to work both when CPUs are running and stopped,
> and if we didn't use migrate in this case, then we basically just end
> up re-inventing migrate again which IMHO is undesirable both from
> libvirt's POV and QEMU's POV.

Just to make sure we're on the same page - I always think it fine to use
the QMP "migrate" command to do this.

Meanwhile, we can also reuse the migration framework if we think that's
still the good way to go (even if I am not 100% sure on this... I still
think _lots_ of the live migration framework as plenty of logics trying to
take care of a "live" VM, IOW, those logics will become pure overheads if
we reuse the live migration framework for vm suspend).

However could you help elaborate more on why it must support live mode for
a virDomainManagedSave() request?  As I assume this is the core of the goal.

IMHO virDomainManagedSave() is a good interface design, because it contains
the target goal of what it wants to do (according to above).  To ask in
another way, I'm curious whether virDomainManagedSave() will stop the VM
before triggering the QMP "migrate" to fd: If it doesn't, why not?  If it
does, then why we can't have that assumption also for QEMU?

That assumption is IMHO important for QEMU because non-live VM migration
can avoid tons of overhead that a live migration will need.  I've mentioned
this in the other reply, even if we keep using the migration framework, we
can still optimize other things like dirty tracking.  We probably don't
even need any bitmap at all because we simply scan over all ramblocks.

OTOH, if QEMU supports live mode for a "vm suspend" in the initial design,
not only it doesn't sound right at all from interface level, it means QEMU
will need to keep doing so forever because we need to be compatible with
the old interfaces even on new binaries.  That's why I keep suggesting we
should take "VM turned off" part of the cmd if that's what we're looking
for.

Thanks,
Daniel P. Berrangé April 19, 2023, 5:12 p.m. UTC | #26
On Tue, Apr 18, 2023 at 03:26:45PM -0400, Peter Xu wrote:
> On Tue, Apr 18, 2023 at 05:58:44PM +0100, Daniel P. Berrangé wrote:
> > Libvirt has multiple APIs where it currently uses its migrate-to-file
> > approach
> > 
> >   * virDomainManagedSave()
> > 
> >     This saves VM state to an libvirt managed file, stops the VM, and the
> >     file state is auto-restored on next request to start the VM, and the
> >     file deleted. The VM CPUs are stopped during both save + restore
> >     phase
> > 
> >   * virDomainSave/virDomainRestore
> > 
> >     The former saves VM state to a file specified by the mgmt app/user.
> >     A later call to virDomaniRestore starts the VM using that saved
> >     state. The mgmt app / user can delete the file state, or re-use
> >     it many times as they desire. The VM CPUs are stopped during both
> >     save + restore phase
> > 
> >   * virDomainSnapshotXXX
> > 
> >     This family of APIs takes snapshots of the VM disks, optionally
> >     also including the full VM state to a separate file. The snapshots
> >     can later be restored. The VM CPUs remain running during the
> >     save phase, but are stopped during restore phase
> 
> For this one IMHO it'll be good if Libvirt can consider leveraging the new
> background-snapshot capability (QEMU 6.0+, so not very new..).  Or is there
> perhaps any reason why a generic migrate:fd approach is better?

I'm not sure I fully understand the implications of 'background-snapshot' ?

Based on what the QAPI comment says, it sounds potentially interesting,
as conceptually it would be nicer to have the memory / state snapshot
represent the VM at the point where we started the snapshot operation,
rather than where we finished the snapshot operation.

It would not solve the performance problems that the work in this thread
was intended to address though.  With large VMs (100's of GB of RAM),
saving all the RAM state to disk takes a very long time, regardless of
whether the VM vCPUs are paused or running.

Currently when doing this libvirt has a "libvirt_iohelper" process
that we use so that we can do writes with O_DIRECT set. This avoids
thrashing the host OS's  I/O buffers/cache, and thus negatively
impacting performance of anything else on the host doing I/O. This
can't take advantage of multifd though, and even if extended todo
so, it still imposes extra data copies during the save/restore paths.


So to speed up the above 3 libvirt APIs, we want QEMU to be able to
directly save/restore mem/vmstate to files, with parallization and
O_DIRECT.


> > All these APIs end up calling the same code inside libvirt that uses
> > the libvirt-iohelper, together with QEMU migrate:fd driver.
> > 
> > IIUC, Suse's original motivation for the performance improvements was
> > wrt to the first case of virDomainManagedSave. From the POV of actually
> > supporting this in libvirt though, we need to cover all the scenarios
> > there. Thus we need this to work both when CPUs are running and stopped,
> > and if we didn't use migrate in this case, then we basically just end
> > up re-inventing migrate again which IMHO is undesirable both from
> > libvirt's POV and QEMU's POV.
> 
> Just to make sure we're on the same page - I always think it fine to use
> the QMP "migrate" command to do this.
> 
> Meanwhile, we can also reuse the migration framework if we think that's
> still the good way to go (even if I am not 100% sure on this... I still
> think _lots_ of the live migration framework as plenty of logics trying to
> take care of a "live" VM, IOW, those logics will become pure overheads if
> we reuse the live migration framework for vm suspend).
> 
> However could you help elaborate more on why it must support live mode for
> a virDomainManagedSave() request?  As I assume this is the core of the goal.

No, we've no need for live mode for virDomainManagedSave. Live mode is
needed for virDomainSnapshot* APIs.

The point I'm making is that all three of the above libvirt APIs run exactly
the same migration code in libvirt. The only difference in the APIs is how
the operation gets striggered and whether the CPUs are running or not.

We wwant the improved performance of having parallel save/restore-to-disk
and use of O_DIRECT to be available to all 3 APIs. To me it doesn't make
sense to provide different impls for these APIs when they all have the
same end goal - it would be extra work on QEMU side and libvirt side alike
to use different solutions for each. 

> IMHO virDomainManagedSave() is a good interface design, because it contains
> the target goal of what it wants to do (according to above).  To ask in
> another way, I'm curious whether virDomainManagedSave() will stop the VM
> before triggering the QMP "migrate" to fd: If it doesn't, why not?  If it
> does, then why we can't have that assumption also for QEMU?
> 
> That assumption is IMHO important for QEMU because non-live VM migration
> can avoid tons of overhead that a live migration will need.  I've mentioned
> this in the other reply, even if we keep using the migration framework, we
> can still optimize other things like dirty tracking.  We probably don't
> even need any bitmap at all because we simply scan over all ramblocks.
> 
> OTOH, if QEMU supports live mode for a "vm suspend" in the initial design,
> not only it doesn't sound right at all from interface level, it means QEMU
> will need to keep doing so forever because we need to be compatible with
> the old interfaces even on new binaries.  That's why I keep suggesting we
> should take "VM turned off" part of the cmd if that's what we're looking
> for.

With regards,
Daniel
Peter Xu April 19, 2023, 7:07 p.m. UTC | #27
On Wed, Apr 19, 2023 at 06:12:05PM +0100, Daniel P. Berrangé wrote:
> On Tue, Apr 18, 2023 at 03:26:45PM -0400, Peter Xu wrote:
> > On Tue, Apr 18, 2023 at 05:58:44PM +0100, Daniel P. Berrangé wrote:
> > > Libvirt has multiple APIs where it currently uses its migrate-to-file
> > > approach
> > > 
> > >   * virDomainManagedSave()
> > > 
> > >     This saves VM state to an libvirt managed file, stops the VM, and the
> > >     file state is auto-restored on next request to start the VM, and the
> > >     file deleted. The VM CPUs are stopped during both save + restore
> > >     phase
> > > 
> > >   * virDomainSave/virDomainRestore
> > > 
> > >     The former saves VM state to a file specified by the mgmt app/user.
> > >     A later call to virDomaniRestore starts the VM using that saved
> > >     state. The mgmt app / user can delete the file state, or re-use
> > >     it many times as they desire. The VM CPUs are stopped during both
> > >     save + restore phase
> > > 
> > >   * virDomainSnapshotXXX
> > > 
> > >     This family of APIs takes snapshots of the VM disks, optionally
> > >     also including the full VM state to a separate file. The snapshots
> > >     can later be restored. The VM CPUs remain running during the
> > >     save phase, but are stopped during restore phase
> > 
> > For this one IMHO it'll be good if Libvirt can consider leveraging the new
> > background-snapshot capability (QEMU 6.0+, so not very new..).  Or is there
> > perhaps any reason why a generic migrate:fd approach is better?
> 
> I'm not sure I fully understand the implications of 'background-snapshot' ?
> 
> Based on what the QAPI comment says, it sounds potentially interesting,
> as conceptually it would be nicer to have the memory / state snapshot
> represent the VM at the point where we started the snapshot operation,
> rather than where we finished the snapshot operation.
> 
> It would not solve the performance problems that the work in this thread
> was intended to address though.  With large VMs (100's of GB of RAM),
> saving all the RAM state to disk takes a very long time, regardless of
> whether the VM vCPUs are paused or running.

I think it solves the performance problem by only copy each of the guest
page once, even if the guest is running.

Different from mostly all the rest of "migrate" use cases, background
snapshot does not use the generic dirty tracking at all (for KVM that's
get-dirty-log), instead it uses userfaultfd wr-protects, so that when
taking the snapshot all the guest pages will be protected once.

Then when each page is written, the guest cannot proceed before copying the
snapshot page over first.  After one guest page is unprotected, any write
to it will be with full speed because the follow up writes won't matter for
a snapshot.

It guarantees the best efficiency of creating a snapshot with VM running,
afaict.  I sincerely think Libvirt should have someone investigating and
see whether virDomainSnapshotXXX() can be implemented by this cap rather
than the default migration.

I actually thought the Libvirt support was there. I think it must be that
someone posted support for Libvirt but it didn't really land for some
reason.

> 
> Currently when doing this libvirt has a "libvirt_iohelper" process
> that we use so that we can do writes with O_DIRECT set. This avoids
> thrashing the host OS's  I/O buffers/cache, and thus negatively
> impacting performance of anything else on the host doing I/O. This
> can't take advantage of multifd though, and even if extended todo
> so, it still imposes extra data copies during the save/restore paths.
> 
> So to speed up the above 3 libvirt APIs, we want QEMU to be able to
> directly save/restore mem/vmstate to files, with parallization and
> O_DIRECT.

Here IIUC above question can be really important on whether existing
virDomainSnapshotXXX() can (and should) use "background-snapshot" to
implement, because that's the only one that will need to support migration
live (out of 3 use cases).

If virDomainSnapshotXXX() can be implemented differently, I think it'll be
much easier to have both virDomainManagedSave() and virDomainSave() trigger
a migration command that will stop the VM first by whatever way.

It's probably fine if we still want to have CAP_FIXED_RAM as a new
capability describing the file property (so that libvirt will know iohelper
is not needed anymore), it can support live migrating even if it shouldn't
really use it.  But then we could probably have another CAP_SUSPEND which
gives QEMU a hint so QEMU can be smart on this non-live migration.

It's just that AFAIU CAP_FIXED_RAM should just always be set with
CAP_SUSPEND, because it must be a SUSPEND to fixed ram or one should just
use virDomainSnapshotXXX() (or say, live snapshot).

Thanks,
Daniel P. Berrangé April 20, 2023, 9:02 a.m. UTC | #28
On Wed, Apr 19, 2023 at 03:07:19PM -0400, Peter Xu wrote:
> On Wed, Apr 19, 2023 at 06:12:05PM +0100, Daniel P. Berrangé wrote:
> > On Tue, Apr 18, 2023 at 03:26:45PM -0400, Peter Xu wrote:
> > > On Tue, Apr 18, 2023 at 05:58:44PM +0100, Daniel P. Berrangé wrote:
> > > > Libvirt has multiple APIs where it currently uses its migrate-to-file
> > > > approach
> > > > 
> > > >   * virDomainManagedSave()
> > > > 
> > > >     This saves VM state to an libvirt managed file, stops the VM, and the
> > > >     file state is auto-restored on next request to start the VM, and the
> > > >     file deleted. The VM CPUs are stopped during both save + restore
> > > >     phase
> > > > 
> > > >   * virDomainSave/virDomainRestore
> > > > 
> > > >     The former saves VM state to a file specified by the mgmt app/user.
> > > >     A later call to virDomaniRestore starts the VM using that saved
> > > >     state. The mgmt app / user can delete the file state, or re-use
> > > >     it many times as they desire. The VM CPUs are stopped during both
> > > >     save + restore phase
> > > > 
> > > >   * virDomainSnapshotXXX
> > > > 
> > > >     This family of APIs takes snapshots of the VM disks, optionally
> > > >     also including the full VM state to a separate file. The snapshots
> > > >     can later be restored. The VM CPUs remain running during the
> > > >     save phase, but are stopped during restore phase
> > > 
> > > For this one IMHO it'll be good if Libvirt can consider leveraging the new
> > > background-snapshot capability (QEMU 6.0+, so not very new..).  Or is there
> > > perhaps any reason why a generic migrate:fd approach is better?
> > 
> > I'm not sure I fully understand the implications of 'background-snapshot' ?
> > 
> > Based on what the QAPI comment says, it sounds potentially interesting,
> > as conceptually it would be nicer to have the memory / state snapshot
> > represent the VM at the point where we started the snapshot operation,
> > rather than where we finished the snapshot operation.
> > 
> > It would not solve the performance problems that the work in this thread
> > was intended to address though.  With large VMs (100's of GB of RAM),
> > saving all the RAM state to disk takes a very long time, regardless of
> > whether the VM vCPUs are paused or running.
> 
> I think it solves the performance problem by only copy each of the guest
> page once, even if the guest is running.

I think we're talking about different performance problems.

What you describe here is about ensuring the snapshot is of finite size
and completes in linear time, by ensuring each page is written only
once.

What I'm talking about is being able to parallelize the writing of all
RAM, so if a single thread can saturate the storage, using multiple
threads will make the overal process faster, even when we're only
writing each page once.

> Different from mostly all the rest of "migrate" use cases, background
> snapshot does not use the generic dirty tracking at all (for KVM that's
> get-dirty-log), instead it uses userfaultfd wr-protects, so that when
> taking the snapshot all the guest pages will be protected once.

Oh, so that means this 'background-snapshot' feature only works on
Linux, and only when permissions allow it. The migration parameter
probably should be marked with 'CONFIG_LINUX' in the QAPI schema
to make it clear this is a non-portable feature.

> It guarantees the best efficiency of creating a snapshot with VM running,
> afaict.  I sincerely think Libvirt should have someone investigating and
> see whether virDomainSnapshotXXX() can be implemented by this cap rather
> than the default migration.

Since the background-snapshot feature is not universally available,
it will only ever be possible to use it as an optional enhancement
with virDomainSnapshotXXX, we'll need the portable impl to be the
default / fallback.

> > Currently when doing this libvirt has a "libvirt_iohelper" process
> > that we use so that we can do writes with O_DIRECT set. This avoids
> > thrashing the host OS's  I/O buffers/cache, and thus negatively
> > impacting performance of anything else on the host doing I/O. This
> > can't take advantage of multifd though, and even if extended todo
> > so, it still imposes extra data copies during the save/restore paths.
> > 
> > So to speed up the above 3 libvirt APIs, we want QEMU to be able to
> > directly save/restore mem/vmstate to files, with parallization and
> > O_DIRECT.
> 
> Here IIUC above question can be really important on whether existing
> virDomainSnapshotXXX() can (and should) use "background-snapshot" to
> implement, because that's the only one that will need to support migration
> live (out of 3 use cases).
> 
> If virDomainSnapshotXXX() can be implemented differently, I think it'll be
> much easier to have both virDomainManagedSave() and virDomainSave() trigger
> a migration command that will stop the VM first by whatever way.
> 
> It's probably fine if we still want to have CAP_FIXED_RAM as a new
> capability describing the file property (so that libvirt will know iohelper
> is not needed anymore), it can support live migrating even if it shouldn't
> really use it.  But then we could probably have another CAP_SUSPEND which
> gives QEMU a hint so QEMU can be smart on this non-live migration.
> 
> It's just that AFAIU CAP_FIXED_RAM should just always be set with
> CAP_SUSPEND, because it must be a SUSPEND to fixed ram or one should just
> use virDomainSnapshotXXX() (or say, live snapshot).

With regards,
Daniel
Peter Xu April 20, 2023, 7:19 p.m. UTC | #29
On Thu, Apr 20, 2023 at 10:02:43AM +0100, Daniel P. Berrangé wrote:
> On Wed, Apr 19, 2023 at 03:07:19PM -0400, Peter Xu wrote:
> > On Wed, Apr 19, 2023 at 06:12:05PM +0100, Daniel P. Berrangé wrote:
> > > On Tue, Apr 18, 2023 at 03:26:45PM -0400, Peter Xu wrote:
> > > > On Tue, Apr 18, 2023 at 05:58:44PM +0100, Daniel P. Berrangé wrote:
> > > > > Libvirt has multiple APIs where it currently uses its migrate-to-file
> > > > > approach
> > > > > 
> > > > >   * virDomainManagedSave()
> > > > > 
> > > > >     This saves VM state to an libvirt managed file, stops the VM, and the
> > > > >     file state is auto-restored on next request to start the VM, and the
> > > > >     file deleted. The VM CPUs are stopped during both save + restore
> > > > >     phase
> > > > > 
> > > > >   * virDomainSave/virDomainRestore
> > > > > 
> > > > >     The former saves VM state to a file specified by the mgmt app/user.
> > > > >     A later call to virDomaniRestore starts the VM using that saved
> > > > >     state. The mgmt app / user can delete the file state, or re-use
> > > > >     it many times as they desire. The VM CPUs are stopped during both
> > > > >     save + restore phase
> > > > > 
> > > > >   * virDomainSnapshotXXX
> > > > > 
> > > > >     This family of APIs takes snapshots of the VM disks, optionally
> > > > >     also including the full VM state to a separate file. The snapshots
> > > > >     can later be restored. The VM CPUs remain running during the
> > > > >     save phase, but are stopped during restore phase
> > > > 
> > > > For this one IMHO it'll be good if Libvirt can consider leveraging the new
> > > > background-snapshot capability (QEMU 6.0+, so not very new..).  Or is there
> > > > perhaps any reason why a generic migrate:fd approach is better?
> > > 
> > > I'm not sure I fully understand the implications of 'background-snapshot' ?
> > > 
> > > Based on what the QAPI comment says, it sounds potentially interesting,
> > > as conceptually it would be nicer to have the memory / state snapshot
> > > represent the VM at the point where we started the snapshot operation,
> > > rather than where we finished the snapshot operation.
> > > 
> > > It would not solve the performance problems that the work in this thread
> > > was intended to address though.  With large VMs (100's of GB of RAM),
> > > saving all the RAM state to disk takes a very long time, regardless of
> > > whether the VM vCPUs are paused or running.
> > 
> > I think it solves the performance problem by only copy each of the guest
> > page once, even if the guest is running.
> 
> I think we're talking about different performance problems.
> 
> What you describe here is about ensuring the snapshot is of finite size
> and completes in linear time, by ensuring each page is written only
> once.
> 
> What I'm talking about is being able to parallelize the writing of all
> RAM, so if a single thread can saturate the storage, using multiple
> threads will make the overal process faster, even when we're only
> writing each page once.

It depends on how much we want it.  Here the live snapshot scenaior could
probably leverage a same multi-threading framework with a vm suspend case
because it can assume all the pages are static and only saved once.

But I agree it's at least not there yet.. so we can directly leverage
multifd at least for now.

> 
> > Different from mostly all the rest of "migrate" use cases, background
> > snapshot does not use the generic dirty tracking at all (for KVM that's
> > get-dirty-log), instead it uses userfaultfd wr-protects, so that when
> > taking the snapshot all the guest pages will be protected once.
> 
> Oh, so that means this 'background-snapshot' feature only works on
> Linux, and only when permissions allow it. The migration parameter
> probably should be marked with 'CONFIG_LINUX' in the QAPI schema
> to make it clear this is a non-portable feature.

Indeed, I can have a follow up patch for this.  But it'll be the same as
some other features, like, postcopy (and all its sub-features including
postcopy-blocktime and postcopy-preempt)?

> 
> > It guarantees the best efficiency of creating a snapshot with VM running,
> > afaict.  I sincerely think Libvirt should have someone investigating and
> > see whether virDomainSnapshotXXX() can be implemented by this cap rather
> > than the default migration.
> 
> Since the background-snapshot feature is not universally available,
> it will only ever be possible to use it as an optional enhancement
> with virDomainSnapshotXXX, we'll need the portable impl to be the
> default / fallback.

I am actually curious on how a live snapshot can be implemented correctly
if without something like background snapshot.  I raised this question in
another reply here:

https://lore.kernel.org/all/ZDWBSuGDU9IMohEf@x1n/

I was using fixed-ram and vm suspend as example, but I assume it applies to
any live snapshot that is based on current default migration scheme.

For a real live snapshot (not vm suspend), IIUC we have similar challenges.

The problem is when migration completes (snapshot taken) the VM is still
running with a live disk image.  Then how can we take a snapshot exactly at
the same time when we got the guest image mirrored in the vm dump?  What
guarantees that there's no IO changes after VM image created but before we
take a snapshot on the disk image?

In short, it's a question on how libvirt can make sure the VM image and
disk snapshot image be taken at exactly the same time for live.

Thanks,
Daniel P. Berrangé April 21, 2023, 7:48 a.m. UTC | #30
On Thu, Apr 20, 2023 at 03:19:39PM -0400, Peter Xu wrote:
> On Thu, Apr 20, 2023 at 10:02:43AM +0100, Daniel P. Berrangé wrote:
> > On Wed, Apr 19, 2023 at 03:07:19PM -0400, Peter Xu wrote:
> > > On Wed, Apr 19, 2023 at 06:12:05PM +0100, Daniel P. Berrangé wrote:
> > > > On Tue, Apr 18, 2023 at 03:26:45PM -0400, Peter Xu wrote:
> > > > > On Tue, Apr 18, 2023 at 05:58:44PM +0100, Daniel P. Berrangé wrote:
> > > > > > Libvirt has multiple APIs where it currently uses its migrate-to-file
> > > > > > approach
> > > > > > 
> > > > > >   * virDomainManagedSave()
> > > > > > 
> > > > > >     This saves VM state to an libvirt managed file, stops the VM, and the
> > > > > >     file state is auto-restored on next request to start the VM, and the
> > > > > >     file deleted. The VM CPUs are stopped during both save + restore
> > > > > >     phase
> > > > > > 
> > > > > >   * virDomainSave/virDomainRestore
> > > > > > 
> > > > > >     The former saves VM state to a file specified by the mgmt app/user.
> > > > > >     A later call to virDomaniRestore starts the VM using that saved
> > > > > >     state. The mgmt app / user can delete the file state, or re-use
> > > > > >     it many times as they desire. The VM CPUs are stopped during both
> > > > > >     save + restore phase
> > > > > > 
> > > > > >   * virDomainSnapshotXXX
> > > > > > 
> > > > > >     This family of APIs takes snapshots of the VM disks, optionally
> > > > > >     also including the full VM state to a separate file. The snapshots
> > > > > >     can later be restored. The VM CPUs remain running during the
> > > > > >     save phase, but are stopped during restore phase
> > > > > 
> > > > > For this one IMHO it'll be good if Libvirt can consider leveraging the new
> > > > > background-snapshot capability (QEMU 6.0+, so not very new..).  Or is there
> > > > > perhaps any reason why a generic migrate:fd approach is better?
> > > > 
> > > > I'm not sure I fully understand the implications of 'background-snapshot' ?
> > > > 
> > > > Based on what the QAPI comment says, it sounds potentially interesting,
> > > > as conceptually it would be nicer to have the memory / state snapshot
> > > > represent the VM at the point where we started the snapshot operation,
> > > > rather than where we finished the snapshot operation.
> > > > 
> > > > It would not solve the performance problems that the work in this thread
> > > > was intended to address though.  With large VMs (100's of GB of RAM),
> > > > saving all the RAM state to disk takes a very long time, regardless of
> > > > whether the VM vCPUs are paused or running.
> > > 
> > > I think it solves the performance problem by only copy each of the guest
> > > page once, even if the guest is running.
> > 
> > I think we're talking about different performance problems.
> > 
> > What you describe here is about ensuring the snapshot is of finite size
> > and completes in linear time, by ensuring each page is written only
> > once.
> > 
> > What I'm talking about is being able to parallelize the writing of all
> > RAM, so if a single thread can saturate the storage, using multiple
> > threads will make the overal process faster, even when we're only
> > writing each page once.
> 
> It depends on how much we want it.  Here the live snapshot scenaior could
> probably leverage a same multi-threading framework with a vm suspend case
> because it can assume all the pages are static and only saved once.
> 
> But I agree it's at least not there yet.. so we can directly leverage
> multifd at least for now.
> 
> > 
> > > Different from mostly all the rest of "migrate" use cases, background
> > > snapshot does not use the generic dirty tracking at all (for KVM that's
> > > get-dirty-log), instead it uses userfaultfd wr-protects, so that when
> > > taking the snapshot all the guest pages will be protected once.
> > 
> > Oh, so that means this 'background-snapshot' feature only works on
> > Linux, and only when permissions allow it. The migration parameter
> > probably should be marked with 'CONFIG_LINUX' in the QAPI schema
> > to make it clear this is a non-portable feature.
> 
> Indeed, I can have a follow up patch for this.  But it'll be the same as
> some other features, like, postcopy (and all its sub-features including
> postcopy-blocktime and postcopy-preempt)?
> 
> > 
> > > It guarantees the best efficiency of creating a snapshot with VM running,
> > > afaict.  I sincerely think Libvirt should have someone investigating and
> > > see whether virDomainSnapshotXXX() can be implemented by this cap rather
> > > than the default migration.
> > 
> > Since the background-snapshot feature is not universally available,
> > it will only ever be possible to use it as an optional enhancement
> > with virDomainSnapshotXXX, we'll need the portable impl to be the
> > default / fallback.
> 
> I am actually curious on how a live snapshot can be implemented correctly
> if without something like background snapshot.  I raised this question in
> another reply here:
> 
> https://lore.kernel.org/all/ZDWBSuGDU9IMohEf@x1n/
> 
> I was using fixed-ram and vm suspend as example, but I assume it applies to
> any live snapshot that is based on current default migration scheme.
> 
> For a real live snapshot (not vm suspend), IIUC we have similar challenges.
> 
> The problem is when migration completes (snapshot taken) the VM is still
> running with a live disk image.  Then how can we take a snapshot exactly at
> the same time when we got the guest image mirrored in the vm dump?  What
> guarantees that there's no IO changes after VM image created but before we
> take a snapshot on the disk image?
> 
> In short, it's a question on how libvirt can make sure the VM image and
> disk snapshot image be taken at exactly the same time for live.

It is just a matter of where you have the synchronization point.

With background-snapshot, you have to snapshot the disks at the
start of the migrate operation. Without background-snapshot
yu have to snapshot the disks at the end of the migrate
operation. The CPUs are paused at the end of the migrate, so
when the CPUs pause, initiate the storage snapshot in the
background and then let the CPUs resume.

With regards,
Daniel
Peter Xu April 21, 2023, 1:56 p.m. UTC | #31
On Fri, Apr 21, 2023 at 08:48:02AM +0100, Daniel P. Berrangé wrote:
> On Thu, Apr 20, 2023 at 03:19:39PM -0400, Peter Xu wrote:
> > On Thu, Apr 20, 2023 at 10:02:43AM +0100, Daniel P. Berrangé wrote:
> > > On Wed, Apr 19, 2023 at 03:07:19PM -0400, Peter Xu wrote:
> > > > On Wed, Apr 19, 2023 at 06:12:05PM +0100, Daniel P. Berrangé wrote:
> > > > > On Tue, Apr 18, 2023 at 03:26:45PM -0400, Peter Xu wrote:
> > > > > > On Tue, Apr 18, 2023 at 05:58:44PM +0100, Daniel P. Berrangé wrote:
> > > > > > > Libvirt has multiple APIs where it currently uses its migrate-to-file
> > > > > > > approach
> > > > > > > 
> > > > > > >   * virDomainManagedSave()
> > > > > > > 
> > > > > > >     This saves VM state to an libvirt managed file, stops the VM, and the
> > > > > > >     file state is auto-restored on next request to start the VM, and the
> > > > > > >     file deleted. The VM CPUs are stopped during both save + restore
> > > > > > >     phase
> > > > > > > 
> > > > > > >   * virDomainSave/virDomainRestore
> > > > > > > 
> > > > > > >     The former saves VM state to a file specified by the mgmt app/user.
> > > > > > >     A later call to virDomaniRestore starts the VM using that saved
> > > > > > >     state. The mgmt app / user can delete the file state, or re-use
> > > > > > >     it many times as they desire. The VM CPUs are stopped during both
> > > > > > >     save + restore phase
> > > > > > > 
> > > > > > >   * virDomainSnapshotXXX
> > > > > > > 
> > > > > > >     This family of APIs takes snapshots of the VM disks, optionally
> > > > > > >     also including the full VM state to a separate file. The snapshots
> > > > > > >     can later be restored. The VM CPUs remain running during the
> > > > > > >     save phase, but are stopped during restore phase
> > > > > > 
> > > > > > For this one IMHO it'll be good if Libvirt can consider leveraging the new
> > > > > > background-snapshot capability (QEMU 6.0+, so not very new..).  Or is there
> > > > > > perhaps any reason why a generic migrate:fd approach is better?
> > > > > 
> > > > > I'm not sure I fully understand the implications of 'background-snapshot' ?
> > > > > 
> > > > > Based on what the QAPI comment says, it sounds potentially interesting,
> > > > > as conceptually it would be nicer to have the memory / state snapshot
> > > > > represent the VM at the point where we started the snapshot operation,
> > > > > rather than where we finished the snapshot operation.
> > > > > 
> > > > > It would not solve the performance problems that the work in this thread
> > > > > was intended to address though.  With large VMs (100's of GB of RAM),
> > > > > saving all the RAM state to disk takes a very long time, regardless of
> > > > > whether the VM vCPUs are paused or running.
> > > > 
> > > > I think it solves the performance problem by only copy each of the guest
> > > > page once, even if the guest is running.
> > > 
> > > I think we're talking about different performance problems.
> > > 
> > > What you describe here is about ensuring the snapshot is of finite size
> > > and completes in linear time, by ensuring each page is written only
> > > once.
> > > 
> > > What I'm talking about is being able to parallelize the writing of all
> > > RAM, so if a single thread can saturate the storage, using multiple
> > > threads will make the overal process faster, even when we're only
> > > writing each page once.
> > 
> > It depends on how much we want it.  Here the live snapshot scenaior could
> > probably leverage a same multi-threading framework with a vm suspend case
> > because it can assume all the pages are static and only saved once.
> > 
> > But I agree it's at least not there yet.. so we can directly leverage
> > multifd at least for now.
> > 
> > > 
> > > > Different from mostly all the rest of "migrate" use cases, background
> > > > snapshot does not use the generic dirty tracking at all (for KVM that's
> > > > get-dirty-log), instead it uses userfaultfd wr-protects, so that when
> > > > taking the snapshot all the guest pages will be protected once.
> > > 
> > > Oh, so that means this 'background-snapshot' feature only works on
> > > Linux, and only when permissions allow it. The migration parameter
> > > probably should be marked with 'CONFIG_LINUX' in the QAPI schema
> > > to make it clear this is a non-portable feature.
> > 
> > Indeed, I can have a follow up patch for this.  But it'll be the same as
> > some other features, like, postcopy (and all its sub-features including
> > postcopy-blocktime and postcopy-preempt)?
> > 
> > > 
> > > > It guarantees the best efficiency of creating a snapshot with VM running,
> > > > afaict.  I sincerely think Libvirt should have someone investigating and
> > > > see whether virDomainSnapshotXXX() can be implemented by this cap rather
> > > > than the default migration.
> > > 
> > > Since the background-snapshot feature is not universally available,
> > > it will only ever be possible to use it as an optional enhancement
> > > with virDomainSnapshotXXX, we'll need the portable impl to be the
> > > default / fallback.
> > 
> > I am actually curious on how a live snapshot can be implemented correctly
> > if without something like background snapshot.  I raised this question in
> > another reply here:
> > 
> > https://lore.kernel.org/all/ZDWBSuGDU9IMohEf@x1n/
> > 
> > I was using fixed-ram and vm suspend as example, but I assume it applies to
> > any live snapshot that is based on current default migration scheme.
> > 
> > For a real live snapshot (not vm suspend), IIUC we have similar challenges.
> > 
> > The problem is when migration completes (snapshot taken) the VM is still
> > running with a live disk image.  Then how can we take a snapshot exactly at
> > the same time when we got the guest image mirrored in the vm dump?  What
> > guarantees that there's no IO changes after VM image created but before we
> > take a snapshot on the disk image?
> > 
> > In short, it's a question on how libvirt can make sure the VM image and
> > disk snapshot image be taken at exactly the same time for live.
> 
> It is just a matter of where you have the synchronization point.
> 
> With background-snapshot, you have to snapshot the disks at the
> start of the migrate operation. Without background-snapshot
> yu have to snapshot the disks at the end of the migrate
> operation. The CPUs are paused at the end of the migrate, so
> when the CPUs pause, initiate the storage snapshot in the
> background and then let the CPUs resume.

Ah, indeed.

Thanks.