diff mbox series

[RFC,v1,10/26] migration/ram: Introduce 'fixed-ram' migration stream capability

Message ID 20230330180336.2791-11-farosas@suse.de (mailing list archive)
State New, archived
Headers show
Series migration: File based migration with multifd and fixed-ram | expand

Commit Message

Fabiano Rosas March 30, 2023, 6:03 p.m. UTC
From: Nikolay Borisov <nborisov@suse.com>

Implement 'fixed-ram' feature. The core of the feature is to ensure that
each ram page of the migration stream has a specific offset in the
resulting migration stream. The reason why we'd want such behavior are
two fold:

 - When doing a 'fixed-ram' migration the resulting file will have a
   bounded size, since pages which are dirtied multiple times will
   always go to a fixed location in the file, rather than constantly
   being added to a sequential stream. This eliminates cases where a vm
   with, say, 1G of ram can result in a migration file that's 10s of
   GBs, provided that the workload constantly redirties memory.

 - It paves the way to implement DIO-enabled save/restore of the
   migration stream as the pages are ensured to be written at aligned
   offsets.

The feature requires changing the stream format. First, a bitmap is
introduced which tracks which pages have been written (i.e are
dirtied) during migration and subsequently it's being written in the
resulting file, again at a fixed location for every ramblock. Zero
pages are ignored as they'd be zero in the destination migration as
well. With the changed format data would look like the following:

|name len|name|used_len|pc*|bitmap_size|pages_offset|bitmap|pages|

* pc - refers to the page_size/mr->addr members, so newly added members
begin from "bitmap_size".

This layout is initialized during ram_save_setup so instead of having a
sequential stream of pages that follow the ramblock headers the dirty
pages for a ramblock follow its header. Since all pages have a fixed
location RAM_SAVE_FLAG_EOS is no longer generated on every migration
iteration but there is effectively a single RAM_SAVE_FLAG_EOS right at
the end.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Fabiano Rosas <farosas@suse.de>
---
 docs/devel/migration.rst | 36 +++++++++++++++
 include/exec/ramblock.h  |  8 ++++
 migration/migration.c    | 51 +++++++++++++++++++++-
 migration/migration.h    |  1 +
 migration/ram.c          | 94 +++++++++++++++++++++++++++++++++-------
 migration/savevm.c       |  1 +
 qapi/migration.json      |  2 +-
 7 files changed, 176 insertions(+), 17 deletions(-)

Comments

Peter Xu March 30, 2023, 10:01 p.m. UTC | #1
On Thu, Mar 30, 2023 at 03:03:20PM -0300, Fabiano Rosas wrote:
> From: Nikolay Borisov <nborisov@suse.com>
> 
> Implement 'fixed-ram' feature. The core of the feature is to ensure that
> each ram page of the migration stream has a specific offset in the
> resulting migration stream. The reason why we'd want such behavior are
> two fold:
> 
>  - When doing a 'fixed-ram' migration the resulting file will have a
>    bounded size, since pages which are dirtied multiple times will
>    always go to a fixed location in the file, rather than constantly
>    being added to a sequential stream. This eliminates cases where a vm
>    with, say, 1G of ram can result in a migration file that's 10s of
>    GBs, provided that the workload constantly redirties memory.
> 
>  - It paves the way to implement DIO-enabled save/restore of the
>    migration stream as the pages are ensured to be written at aligned
>    offsets.
> 
> The feature requires changing the stream format. First, a bitmap is
> introduced which tracks which pages have been written (i.e are
> dirtied) during migration and subsequently it's being written in the
> resulting file, again at a fixed location for every ramblock. Zero
> pages are ignored as they'd be zero in the destination migration as
> well. With the changed format data would look like the following:
> 
> |name len|name|used_len|pc*|bitmap_size|pages_offset|bitmap|pages|

What happens with huge pages?  Would page size matter here?

I would assume it's fine it uses a constant (small) page size, assuming
that should match with the granule that qemu tracks dirty (which IIUC is
the host page size not guest's).

But I didn't yet pay any further thoughts on that, maybe it would be
worthwhile in all cases to record page sizes here to be explicit or the
meaning of bitmap may not be clear (and then the bitmap_size will be a
field just for sanity check too).

If postcopy might be an option, we'd want the page size to be the host page
size because then looking up the bitmap will be straightforward, deciding
whether we should copy over page (UFFDIO_COPY) or fill in with zeros
(UFFDIO_ZEROPAGE).

> 
> * pc - refers to the page_size/mr->addr members, so newly added members
> begin from "bitmap_size".

Could you elaborate more on what's the pc?

I also didn't see this *pc in below migration.rst update.

> 
> This layout is initialized during ram_save_setup so instead of having a
> sequential stream of pages that follow the ramblock headers the dirty
> pages for a ramblock follow its header. Since all pages have a fixed
> location RAM_SAVE_FLAG_EOS is no longer generated on every migration
> iteration but there is effectively a single RAM_SAVE_FLAG_EOS right at
> the end.
> 
> Signed-off-by: Nikolay Borisov <nborisov@suse.com>
> Signed-off-by: Fabiano Rosas <farosas@suse.de>
> ---
>  docs/devel/migration.rst | 36 +++++++++++++++
>  include/exec/ramblock.h  |  8 ++++
>  migration/migration.c    | 51 +++++++++++++++++++++-
>  migration/migration.h    |  1 +
>  migration/ram.c          | 94 +++++++++++++++++++++++++++++++++-------
>  migration/savevm.c       |  1 +
>  qapi/migration.json      |  2 +-
>  7 files changed, 176 insertions(+), 17 deletions(-)
> 
> diff --git a/docs/devel/migration.rst b/docs/devel/migration.rst
> index 1080211f8e..84112d7f3f 100644
> --- a/docs/devel/migration.rst
> +++ b/docs/devel/migration.rst
> @@ -568,6 +568,42 @@ Others (especially either older devices or system devices which for
>  some reason don't have a bus concept) make use of the ``instance id``
>  for otherwise identically named devices.
>  
> +Fixed-ram format
> +----------------
> +
> +When the ``fixed-ram`` capability is enabled, a slightly different
> +stream format is used for the RAM section. Instead of having a
> +sequential stream of pages that follow the RAMBlock headers, the dirty
> +pages for a RAMBlock follow its header. This ensures that each RAM
> +page has a fixed offset in the resulting migration stream.
> +
> +  - RAMBlock 1
> +
> +    - ID string length
> +    - ID string
> +    - Used size
> +    - Shadow bitmap size
> +    - Pages offset in migration stream*
> +
> +  - Shadow bitmap
> +  - Sequence of pages for RAMBlock 1 (* offset points here)
> +
> +  - RAMBlock 2
> +
> +    - ID string length
> +    - ID string
> +    - Used size
> +    - Shadow bitmap size
> +    - Pages offset in migration stream*
> +
> +  - Shadow bitmap
> +  - Sequence of pages for RAMBlock 2 (* offset points here)
> +
> +The ``fixed-ram`` capaility can be enabled in both source and
> +destination with:
> +
> +    ``migrate_set_capability fixed-ram on``
> +
>  Return path
>  -----------
>  
> diff --git a/include/exec/ramblock.h b/include/exec/ramblock.h
> index adc03df59c..4360c772c2 100644
> --- a/include/exec/ramblock.h
> +++ b/include/exec/ramblock.h
> @@ -43,6 +43,14 @@ struct RAMBlock {
>      size_t page_size;
>      /* dirty bitmap used during migration */
>      unsigned long *bmap;
> +    /* shadow dirty bitmap used when migrating to a file */
> +    unsigned long *shadow_bmap;
> +    /*
> +     * offset in the file pages belonging to this ramblock are saved,
> +     * used only during migration to a file.
> +     */
> +    off_t bitmap_offset;
> +    uint64_t pages_offset;
>      /* bitmap of already received pages in postcopy */
>      unsigned long *receivedmap;
>  
> diff --git a/migration/migration.c b/migration/migration.c
> index 177fb0de0f..29630523e2 100644
> --- a/migration/migration.c
> +++ b/migration/migration.c
> @@ -168,7 +168,8 @@ INITIALIZE_MIGRATE_CAPS_SET(check_caps_background_snapshot,
>      MIGRATION_CAPABILITY_XBZRLE,
>      MIGRATION_CAPABILITY_X_COLO,
>      MIGRATION_CAPABILITY_VALIDATE_UUID,
> -    MIGRATION_CAPABILITY_ZERO_COPY_SEND);
> +    MIGRATION_CAPABILITY_ZERO_COPY_SEND,
> +    MIGRATION_CAPABILITY_FIXED_RAM);
>  
>  /* When we add fault tolerance, we could have several
>     migrations at once.  For now we don't need to add
> @@ -1341,6 +1342,28 @@ static bool migrate_caps_check(bool *cap_list,
>      }
>  #endif
>  
> +    if (cap_list[MIGRATION_CAPABILITY_FIXED_RAM]) {
> +        if (cap_list[MIGRATION_CAPABILITY_MULTIFD]) {
> +            error_setg(errp, "Directly mapped memory incompatible with multifd");
> +            return false;
> +        }
> +
> +        if (cap_list[MIGRATION_CAPABILITY_XBZRLE]) {
> +            error_setg(errp, "Directly mapped memory incompatible with xbzrle");
> +            return false;
> +        }
> +
> +        if (cap_list[MIGRATION_CAPABILITY_COMPRESS]) {
> +            error_setg(errp, "Directly mapped memory incompatible with compression");
> +            return false;
> +        }
> +
> +        if (cap_list[MIGRATION_CAPABILITY_POSTCOPY_RAM]) {
> +            error_setg(errp, "Directly mapped memory incompatible with postcopy ram");
> +            return false;
> +        }
> +    }
> +
>      if (cap_list[MIGRATION_CAPABILITY_POSTCOPY_RAM]) {
>          /* This check is reasonably expensive, so only when it's being
>           * set the first time, also it's only the destination that needs
> @@ -2736,6 +2759,11 @@ MultiFDCompression migrate_multifd_compression(void)
>      return s->parameters.multifd_compression;
>  }
>  
> +int migrate_fixed_ram(void)
> +{
> +    return migrate_get_current()->enabled_capabilities[MIGRATION_CAPABILITY_FIXED_RAM];
> +}
> +
>  int migrate_multifd_zlib_level(void)
>  {
>      MigrationState *s;
> @@ -4324,6 +4352,20 @@ fail:
>      return NULL;
>  }
>  
> +static int migrate_check_fixed_ram(MigrationState *s, Error **errp)
> +{
> +    if (!s->enabled_capabilities[MIGRATION_CAPABILITY_FIXED_RAM]) {
> +        return 0;
> +    }
> +
> +    if (!qemu_file_is_seekable(s->to_dst_file)) {
> +        error_setg(errp, "Directly mapped memory requires a seekable transport");
> +        return -1;
> +    }
> +
> +    return 0;
> +}
> +
>  void migrate_fd_connect(MigrationState *s, Error *error_in)
>  {
>      Error *local_err = NULL;
> @@ -4390,6 +4432,12 @@ void migrate_fd_connect(MigrationState *s, Error *error_in)
>          }
>      }
>  
> +    if (migrate_check_fixed_ram(s, &local_err) < 0) {

This check might be too late afaict, QMP cmd "migrate" could have already
succeeded.

Can we do an early check in / close to qmp_migrate()?  The idea is we fail
at the QMP migrate command there.

> +        migrate_fd_cleanup(s);
> +        migrate_fd_error(s, local_err);
> +        return;
> +    }
> +
>      if (resume) {
>          /* Wakeup the main migration thread to do the recovery */
>          migrate_set_state(&s->state, MIGRATION_STATUS_POSTCOPY_PAUSED,
> @@ -4519,6 +4567,7 @@ static Property migration_properties[] = {
>      DEFINE_PROP_STRING("tls-authz", MigrationState, parameters.tls_authz),
>  
>      /* Migration capabilities */
> +    DEFINE_PROP_MIG_CAP("x-fixed-ram", MIGRATION_CAPABILITY_FIXED_RAM),
>      DEFINE_PROP_MIG_CAP("x-xbzrle", MIGRATION_CAPABILITY_XBZRLE),
>      DEFINE_PROP_MIG_CAP("x-rdma-pin-all", MIGRATION_CAPABILITY_RDMA_PIN_ALL),
>      DEFINE_PROP_MIG_CAP("x-auto-converge", MIGRATION_CAPABILITY_AUTO_CONVERGE),
> diff --git a/migration/migration.h b/migration/migration.h
> index 2da2f8a164..8cf3caecfe 100644
> --- a/migration/migration.h
> +++ b/migration/migration.h
> @@ -416,6 +416,7 @@ bool migrate_zero_blocks(void);
>  bool migrate_dirty_bitmaps(void);
>  bool migrate_ignore_shared(void);
>  bool migrate_validate_uuid(void);
> +int migrate_fixed_ram(void);
>  
>  bool migrate_auto_converge(void);
>  bool migrate_use_multifd(void);
> diff --git a/migration/ram.c b/migration/ram.c
> index 96e8a19a58..56f0f782c8 100644
> --- a/migration/ram.c
> +++ b/migration/ram.c
> @@ -1310,9 +1310,14 @@ static int save_zero_page_to_file(PageSearchStatus *pss,
>      int len = 0;
>  
>      if (buffer_is_zero(p, TARGET_PAGE_SIZE)) {
> -        len += save_page_header(pss, block, offset | RAM_SAVE_FLAG_ZERO);
> -        qemu_put_byte(file, 0);
> -        len += 1;
> +        if (migrate_fixed_ram()) {
> +            /* for zero pages we don't need to do anything */
> +            len = 1;

I think you wanted to increase the "duplicated" counter, but this will also
increase ram-transferred even though only 1 byte.

Perhaps just pass a pointer to keep the bytes, and return true/false to
increase the counter (to make everything accurate)?

> +        } else {
> +            len += save_page_header(pss, block, offset | RAM_SAVE_FLAG_ZERO);
> +            qemu_put_byte(file, 0);
> +            len += 1;
> +        }
>          ram_release_page(block->idstr, offset);
>      }
>      return len;
> @@ -1394,14 +1399,20 @@ static int save_normal_page(PageSearchStatus *pss, RAMBlock *block,
>  {
>      QEMUFile *file = pss->pss_channel;
>  
> -    ram_transferred_add(save_page_header(pss, block,
> -                                         offset | RAM_SAVE_FLAG_PAGE));
> -    if (async) {
> -        qemu_put_buffer_async(file, buf, TARGET_PAGE_SIZE,
> -                              migrate_release_ram() &&
> -                              migration_in_postcopy());
> +    if (migrate_fixed_ram()) {
> +        qemu_put_buffer_at(file, buf, TARGET_PAGE_SIZE,
> +                           block->pages_offset + offset);
> +        set_bit(offset >> TARGET_PAGE_BITS, block->shadow_bmap);
>      } else {
> -        qemu_put_buffer(file, buf, TARGET_PAGE_SIZE);
> +        ram_transferred_add(save_page_header(pss, block,
> +                                             offset | RAM_SAVE_FLAG_PAGE));
> +        if (async) {
> +            qemu_put_buffer_async(file, buf, TARGET_PAGE_SIZE,
> +                                  migrate_release_ram() &&
> +                                  migration_in_postcopy());
> +        } else {
> +            qemu_put_buffer(file, buf, TARGET_PAGE_SIZE);
> +        }
>      }
>      ram_transferred_add(TARGET_PAGE_SIZE);
>      stat64_add(&ram_atomic_counters.normal, 1);
> @@ -2731,6 +2742,8 @@ static void ram_save_cleanup(void *opaque)
>          block->clear_bmap = NULL;
>          g_free(block->bmap);
>          block->bmap = NULL;
> +        g_free(block->shadow_bmap);
> +        block->shadow_bmap = NULL;
>      }
>  
>      xbzrle_cleanup();
> @@ -3098,6 +3111,7 @@ static void ram_list_init_bitmaps(void)
>               */
>              block->bmap = bitmap_new(pages);
>              bitmap_set(block->bmap, 0, pages);
> +            block->shadow_bmap = bitmap_new(block->used_length >> TARGET_PAGE_BITS);
>              block->clear_bmap_shift = shift;
>              block->clear_bmap = bitmap_new(clear_bmap_size(pages, shift));
>          }
> @@ -3287,6 +3301,33 @@ static int ram_save_setup(QEMUFile *f, void *opaque)
>              if (migrate_ignore_shared()) {
>                  qemu_put_be64(f, block->mr->addr);
>              }
> +
> +            if (migrate_fixed_ram()) {
> +                long num_pages = block->used_length >> TARGET_PAGE_BITS;
> +                long bitmap_size = BITS_TO_LONGS(num_pages) * sizeof(unsigned long);
> +
> +                /* Needed for external programs (think analyze-migration.py) */
> +                qemu_put_be32(f, bitmap_size);
> +
> +                /*
> +                 * The bitmap starts after pages_offset, so add 8 to
> +                 * account for the pages_offset size.
> +                 */
> +                block->bitmap_offset = qemu_get_offset(f) + 8;
> +
> +                /*
> +                 * Make pages_offset aligned to 1 MiB to account for
> +                 * migration file movement between filesystems with
> +                 * possibly different alignment restrictions when
> +                 * using O_DIRECT.
> +                 */
> +                block->pages_offset = ROUND_UP(block->bitmap_offset +
> +                                               bitmap_size, 0x100000);
> +                qemu_put_be64(f, block->pages_offset);
> +
> +                /* Now prepare offset for next ramblock */
> +                qemu_set_offset(f, block->pages_offset + block->used_length, SEEK_SET);
> +            }
>          }
>      }
>  
> @@ -3306,6 +3347,18 @@ static int ram_save_setup(QEMUFile *f, void *opaque)
>      return 0;
>  }
>  
> +static void ram_save_shadow_bmap(QEMUFile *f)
> +{
> +    RAMBlock *block;
> +
> +    RAMBLOCK_FOREACH_MIGRATABLE(block) {
> +        long num_pages = block->used_length >> TARGET_PAGE_BITS;
> +        long bitmap_size = BITS_TO_LONGS(num_pages) * sizeof(unsigned long);
> +        qemu_put_buffer_at(f, (uint8_t *)block->shadow_bmap, bitmap_size,
> +                           block->bitmap_offset);
> +    }
> +}
> +
>  /**
>   * ram_save_iterate: iterative stage for migration
>   *
> @@ -3413,9 +3466,15 @@ out:
>              return ret;
>          }
>  
> -        qemu_put_be64(f, RAM_SAVE_FLAG_EOS);
> -        qemu_fflush(f);
> -        ram_transferred_add(8);
> +        /*
> +         * For fixed ram we don't want to pollute the migration stream with
> +         * EOS flags.
> +         */
> +        if (!migrate_fixed_ram()) {
> +            qemu_put_be64(f, RAM_SAVE_FLAG_EOS);
> +            qemu_fflush(f);
> +            ram_transferred_add(8);
> +        }
>  
>          ret = qemu_file_get_error(f);
>      }
> @@ -3461,6 +3520,9 @@ static int ram_save_complete(QEMUFile *f, void *opaque)
>              pages = ram_find_and_save_block(rs);
>              /* no more blocks to sent */
>              if (pages == 0) {
> +                if (migrate_fixed_ram()) {
> +                    ram_save_shadow_bmap(f);
> +                }
>                  break;
>              }
>              if (pages < 0) {
> @@ -3483,8 +3545,10 @@ static int ram_save_complete(QEMUFile *f, void *opaque)
>          return ret;
>      }
>  
> -    qemu_put_be64(f, RAM_SAVE_FLAG_EOS);
> -    qemu_fflush(f);
> +    if (!migrate_fixed_ram()) {
> +        qemu_put_be64(f, RAM_SAVE_FLAG_EOS);
> +        qemu_fflush(f);
> +    }
>  
>      return 0;
>  }
> diff --git a/migration/savevm.c b/migration/savevm.c
> index 92102c1fe5..1f1bc19224 100644
> --- a/migration/savevm.c
> +++ b/migration/savevm.c
> @@ -241,6 +241,7 @@ static bool should_validate_capability(int capability)
>      /* Validate only new capabilities to keep compatibility. */
>      switch (capability) {
>      case MIGRATION_CAPABILITY_X_IGNORE_SHARED:
> +    case MIGRATION_CAPABILITY_FIXED_RAM:
>          return true;
>      default:
>          return false;
> diff --git a/qapi/migration.json b/qapi/migration.json
> index c84fa10e86..22eea58ce3 100644
> --- a/qapi/migration.json
> +++ b/qapi/migration.json
> @@ -485,7 +485,7 @@
>  ##
>  { 'enum': 'MigrationCapability',
>    'data': ['xbzrle', 'rdma-pin-all', 'auto-converge', 'zero-blocks',
> -           'compress', 'events', 'postcopy-ram',
> +           'compress', 'events', 'postcopy-ram', 'fixed-ram',
>             { 'name': 'x-colo', 'features': [ 'unstable' ] },
>             'release-ram',
>             'block', 'return-path', 'pause-before-switchover', 'multifd',
> -- 
> 2.35.3
>
Markus Armbruster March 31, 2023, 5:50 a.m. UTC | #2
Fabiano Rosas <farosas@suse.de> writes:

> From: Nikolay Borisov <nborisov@suse.com>
>
> Implement 'fixed-ram' feature. The core of the feature is to ensure that
> each ram page of the migration stream has a specific offset in the
> resulting migration stream. The reason why we'd want such behavior are
> two fold:
>
>  - When doing a 'fixed-ram' migration the resulting file will have a
>    bounded size, since pages which are dirtied multiple times will
>    always go to a fixed location in the file, rather than constantly
>    being added to a sequential stream. This eliminates cases where a vm
>    with, say, 1G of ram can result in a migration file that's 10s of
>    GBs, provided that the workload constantly redirties memory.
>
>  - It paves the way to implement DIO-enabled save/restore of the
>    migration stream as the pages are ensured to be written at aligned
>    offsets.
>
> The feature requires changing the stream format. First, a bitmap is
> introduced which tracks which pages have been written (i.e are
> dirtied) during migration and subsequently it's being written in the
> resulting file, again at a fixed location for every ramblock. Zero
> pages are ignored as they'd be zero in the destination migration as
> well. With the changed format data would look like the following:
>
> |name len|name|used_len|pc*|bitmap_size|pages_offset|bitmap|pages|
>
> * pc - refers to the page_size/mr->addr members, so newly added members
> begin from "bitmap_size".
>
> This layout is initialized during ram_save_setup so instead of having a
> sequential stream of pages that follow the ramblock headers the dirty
> pages for a ramblock follow its header. Since all pages have a fixed
> location RAM_SAVE_FLAG_EOS is no longer generated on every migration
> iteration but there is effectively a single RAM_SAVE_FLAG_EOS right at
> the end.
>
> Signed-off-by: Nikolay Borisov <nborisov@suse.com>
> Signed-off-by: Fabiano Rosas <farosas@suse.de>

[...]

> diff --git a/qapi/migration.json b/qapi/migration.json
> index c84fa10e86..22eea58ce3 100644
> --- a/qapi/migration.json
> +++ b/qapi/migration.json
> @@ -485,7 +485,7 @@
>  ##
>  { 'enum': 'MigrationCapability',
>    'data': ['xbzrle', 'rdma-pin-all', 'auto-converge', 'zero-blocks',
> -           'compress', 'events', 'postcopy-ram',
> +           'compress', 'events', 'postcopy-ram', 'fixed-ram',
>             { 'name': 'x-colo', 'features': [ 'unstable' ] },
>             'release-ram',
>             'block', 'return-path', 'pause-before-switchover', 'multifd',

Doc comment update is missing.
Daniel P. Berrangé March 31, 2023, 7:56 a.m. UTC | #3
On Thu, Mar 30, 2023 at 06:01:51PM -0400, Peter Xu wrote:
> On Thu, Mar 30, 2023 at 03:03:20PM -0300, Fabiano Rosas wrote:
> > From: Nikolay Borisov <nborisov@suse.com>
> > 
> > Implement 'fixed-ram' feature. The core of the feature is to ensure that
> > each ram page of the migration stream has a specific offset in the
> > resulting migration stream. The reason why we'd want such behavior are
> > two fold:
> > 
> >  - When doing a 'fixed-ram' migration the resulting file will have a
> >    bounded size, since pages which are dirtied multiple times will
> >    always go to a fixed location in the file, rather than constantly
> >    being added to a sequential stream. This eliminates cases where a vm
> >    with, say, 1G of ram can result in a migration file that's 10s of
> >    GBs, provided that the workload constantly redirties memory.
> > 
> >  - It paves the way to implement DIO-enabled save/restore of the
> >    migration stream as the pages are ensured to be written at aligned
> >    offsets.
> > 
> > The feature requires changing the stream format. First, a bitmap is
> > introduced which tracks which pages have been written (i.e are
> > dirtied) during migration and subsequently it's being written in the
> > resulting file, again at a fixed location for every ramblock. Zero
> > pages are ignored as they'd be zero in the destination migration as
> > well. With the changed format data would look like the following:
> > 
> > |name len|name|used_len|pc*|bitmap_size|pages_offset|bitmap|pages|
> 
> What happens with huge pages?  Would page size matter here?
> 
> I would assume it's fine it uses a constant (small) page size, assuming
> that should match with the granule that qemu tracks dirty (which IIUC is
> the host page size not guest's).
> 
> But I didn't yet pay any further thoughts on that, maybe it would be
> worthwhile in all cases to record page sizes here to be explicit or the
> meaning of bitmap may not be clear (and then the bitmap_size will be a
> field just for sanity check too).

I think recording the page sizes is an anti-feature in this case.

The migration format / state needs to reflect the guest ABI, but
we need to be free to have different backend config behind that
either side of the save/restore.

IOW, if I start a QEMU with 2 GB of RAM, I should be free to use
small pages initially and after restore use 2 x 1 GB hugepages,
or vica-verca.

The important thing with the pages that are saved into the file
is that they are a 1:1 mapping guest RAM regions to file offsets.
IOW, the 2 GB of guest RAM is always a contiguous 2 GB region
in the file.

If the src VM used 1 GB pages, we would be writing a full 2 GB
of data assuming both pages were dirty.

If the src VM used 4k pages, we would be writing some subset of
the 2 GB of data, and the rest would be unwritten.

Either way, when reading back the data we restore it into either
1 GB pages of 4k pages, beause any places there were unwritten
orignally  will read back as zeros.

> If postcopy might be an option, we'd want the page size to be the host page
> size because then looking up the bitmap will be straightforward, deciding
> whether we should copy over page (UFFDIO_COPY) or fill in with zeros
> (UFFDIO_ZEROPAGE).

This format is only intended for the case where we are migrating to
a random-access medium, aka a file, because the fixed RAM mappings
to disk mean that we need to seek back to the original location to
re-write pages that get dirtied. It isn't suitable for a live
migration stream, and thus postcopy is inherantly out of scope.

With regards,
Daniel
Peter Xu March 31, 2023, 2:39 p.m. UTC | #4
On Fri, Mar 31, 2023 at 08:56:01AM +0100, Daniel P. Berrangé wrote:
> On Thu, Mar 30, 2023 at 06:01:51PM -0400, Peter Xu wrote:
> > On Thu, Mar 30, 2023 at 03:03:20PM -0300, Fabiano Rosas wrote:
> > > From: Nikolay Borisov <nborisov@suse.com>
> > > 
> > > Implement 'fixed-ram' feature. The core of the feature is to ensure that
> > > each ram page of the migration stream has a specific offset in the
> > > resulting migration stream. The reason why we'd want such behavior are
> > > two fold:
> > > 
> > >  - When doing a 'fixed-ram' migration the resulting file will have a
> > >    bounded size, since pages which are dirtied multiple times will
> > >    always go to a fixed location in the file, rather than constantly
> > >    being added to a sequential stream. This eliminates cases where a vm
> > >    with, say, 1G of ram can result in a migration file that's 10s of
> > >    GBs, provided that the workload constantly redirties memory.
> > > 
> > >  - It paves the way to implement DIO-enabled save/restore of the
> > >    migration stream as the pages are ensured to be written at aligned
> > >    offsets.
> > > 
> > > The feature requires changing the stream format. First, a bitmap is
> > > introduced which tracks which pages have been written (i.e are
> > > dirtied) during migration and subsequently it's being written in the
> > > resulting file, again at a fixed location for every ramblock. Zero
> > > pages are ignored as they'd be zero in the destination migration as
> > > well. With the changed format data would look like the following:
> > > 
> > > |name len|name|used_len|pc*|bitmap_size|pages_offset|bitmap|pages|
> > 
> > What happens with huge pages?  Would page size matter here?
> > 
> > I would assume it's fine it uses a constant (small) page size, assuming
> > that should match with the granule that qemu tracks dirty (which IIUC is
> > the host page size not guest's).
> > 
> > But I didn't yet pay any further thoughts on that, maybe it would be
> > worthwhile in all cases to record page sizes here to be explicit or the
> > meaning of bitmap may not be clear (and then the bitmap_size will be a
> > field just for sanity check too).
> 
> I think recording the page sizes is an anti-feature in this case.
> 
> The migration format / state needs to reflect the guest ABI, but
> we need to be free to have different backend config behind that
> either side of the save/restore.
> 
> IOW, if I start a QEMU with 2 GB of RAM, I should be free to use
> small pages initially and after restore use 2 x 1 GB hugepages,
> or vica-verca.
> 
> The important thing with the pages that are saved into the file
> is that they are a 1:1 mapping guest RAM regions to file offsets.
> IOW, the 2 GB of guest RAM is always a contiguous 2 GB region
> in the file.
> 
> If the src VM used 1 GB pages, we would be writing a full 2 GB
> of data assuming both pages were dirty.
> 
> If the src VM used 4k pages, we would be writing some subset of
> the 2 GB of data, and the rest would be unwritten.
> 
> Either way, when reading back the data we restore it into either
> 1 GB pages of 4k pages, beause any places there were unwritten
> orignally  will read back as zeros.

I think there's already the page size information, because there's a bitmap
embeded in the format at least in the current proposal, and the bitmap can
only be defined with a page size provided in some form.

Here I agree the backend can change before/after a migration (live or
not).  Though the question is whether page size matters in the snapshot
layout rather than what the loaded QEMU instance will use as backend.

> 
> > If postcopy might be an option, we'd want the page size to be the host page
> > size because then looking up the bitmap will be straightforward, deciding
> > whether we should copy over page (UFFDIO_COPY) or fill in with zeros
> > (UFFDIO_ZEROPAGE).
> 
> This format is only intended for the case where we are migrating to
> a random-access medium, aka a file, because the fixed RAM mappings
> to disk mean that we need to seek back to the original location to
> re-write pages that get dirtied. It isn't suitable for a live
> migration stream, and thus postcopy is inherantly out of scope.

Yes, I've commented also in the cover letter, but I can expand a bit.

I mean support postcopy only when loading, but not when saving.

Saving to file definitely cannot work with postcopy because there's no dest
qemu running.

Loading from file, OTOH, can work together with postcopy.

Right now AFAICT current approach is precopy loading the whole guest image
with the supported snapshot format (if I can call it just a snapshot).

What I want to say is we can consider supporting postcopy on loading in
that we start an "empty" QEMU dest node, when any page fault triggered we
do it using userfault and lookup the snapshot file instead rather than
sending a request back to the source.  I mentioned that because there'll be
two major benefits which I mentioned in reply to the cover letter quickly,
but I can also extend here:

  - Firstly, the snapshot format is ideally storing pages in linear
    offsets, it means when we know some page missing we can use O(1) time
    looking it up from the snapshot image.

  - Secondly, we don't need to let the page go through the wires, neither
    do we need to send a request to src qemu or anyone.  What we need here
    is simply test the bit on the snapshot bitmap, then:

    - If it is copied, do UFFDIO_COPY to resolve page faults,
    - If it is not copied, do UFFDIO_ZEROPAGE (e.g., if not hugetlb,
      hugetlb can use a fake UFFDIO_COPY)

So this is a perfect testing ground for using postcopy in a very efficient
way against a file snapshot.

Thanks,
Fabiano Rosas March 31, 2023, 3:05 p.m. UTC | #5
Peter Xu <peterx@redhat.com> writes:

>> 
>> * pc - refers to the page_size/mr->addr members, so newly added members
>> begin from "bitmap_size".
>
> Could you elaborate more on what's the pc?
>
> I also didn't see this *pc in below migration.rst update.
>

Yeah, you need to be looking at the code to figure that one out. That
was intended to reference some postcopy data that is (already) inserted
into the stream. Literally this:

    if (migrate_postcopy_ram() && block->page_size !=
                                  qemu_host_page_size) {
        qemu_put_be64(f, block->page_size);
    }
    if (migrate_ignore_shared()) {
        qemu_put_be64(f, block->mr->addr);
    }

It has nothing to do with this patch. I need to rewrite that part of the
commit message a bit.

>> 
>> This layout is initialized during ram_save_setup so instead of having a
>> sequential stream of pages that follow the ramblock headers the dirty
>> pages for a ramblock follow its header. Since all pages have a fixed
>> location RAM_SAVE_FLAG_EOS is no longer generated on every migration
>> iteration but there is effectively a single RAM_SAVE_FLAG_EOS right at
>> the end.
>> 
>> Signed-off-by: Nikolay Borisov <nborisov@suse.com>
>> Signed-off-by: Fabiano Rosas <farosas@suse.de>

...

>> @@ -4390,6 +4432,12 @@ void migrate_fd_connect(MigrationState *s, Error *error_in)
>>          }
>>      }
>>  
>> +    if (migrate_check_fixed_ram(s, &local_err) < 0) {
>
> This check might be too late afaict, QMP cmd "migrate" could have already
> succeeded.
>
> Can we do an early check in / close to qmp_migrate()?  The idea is we fail
> at the QMP migrate command there.
>

Yes, some of it depends on the QEMUFile being known but I can at least
move part of the verification earlier.

>> +        migrate_fd_cleanup(s);
>> +        migrate_fd_error(s, local_err);
>> +        return;
>> +    }
>> +
>>      if (resume) {
>>          /* Wakeup the main migration thread to do the recovery */
>>          migrate_set_state(&s->state, MIGRATION_STATUS_POSTCOPY_PAUSED,
>> @@ -4519,6 +4567,7 @@ static Property migration_properties[] = {
>>      DEFINE_PROP_STRING("tls-authz", MigrationState, parameters.tls_authz),
>>  
>>      /* Migration capabilities */
>> +    DEFINE_PROP_MIG_CAP("x-fixed-ram", MIGRATION_CAPABILITY_FIXED_RAM),
>>      DEFINE_PROP_MIG_CAP("x-xbzrle", MIGRATION_CAPABILITY_XBZRLE),
>>      DEFINE_PROP_MIG_CAP("x-rdma-pin-all", MIGRATION_CAPABILITY_RDMA_PIN_ALL),
>>      DEFINE_PROP_MIG_CAP("x-auto-converge", MIGRATION_CAPABILITY_AUTO_CONVERGE),
>> diff --git a/migration/migration.h b/migration/migration.h
>> index 2da2f8a164..8cf3caecfe 100644
>> --- a/migration/migration.h
>> +++ b/migration/migration.h
>> @@ -416,6 +416,7 @@ bool migrate_zero_blocks(void);
>>  bool migrate_dirty_bitmaps(void);
>>  bool migrate_ignore_shared(void);
>>  bool migrate_validate_uuid(void);
>> +int migrate_fixed_ram(void);
>>  
>>  bool migrate_auto_converge(void);
>>  bool migrate_use_multifd(void);
>> diff --git a/migration/ram.c b/migration/ram.c
>> index 96e8a19a58..56f0f782c8 100644
>> --- a/migration/ram.c
>> +++ b/migration/ram.c
>> @@ -1310,9 +1310,14 @@ static int save_zero_page_to_file(PageSearchStatus *pss,
>>      int len = 0;
>>  
>>      if (buffer_is_zero(p, TARGET_PAGE_SIZE)) {
>> -        len += save_page_header(pss, block, offset | RAM_SAVE_FLAG_ZERO);
>> -        qemu_put_byte(file, 0);
>> -        len += 1;
>> +        if (migrate_fixed_ram()) {
>> +            /* for zero pages we don't need to do anything */
>> +            len = 1;
>
> I think you wanted to increase the "duplicated" counter, but this will also
> increase ram-transferred even though only 1 byte.
>

Ah, well spotted, that is indeed incorrect.

> Perhaps just pass a pointer to keep the bytes, and return true/false to
> increase the counter (to make everything accurate)?
>

Ok

>> +        } else {
>> +            len += save_page_header(pss, block, offset | RAM_SAVE_FLAG_ZERO);
>> +            qemu_put_byte(file, 0);
>> +            len += 1;
>> +        }
>>          ram_release_page(block->idstr, offset);
>>      }
>>      return len;
Daniel P. Berrangé March 31, 2023, 3:34 p.m. UTC | #6
On Fri, Mar 31, 2023 at 10:39:23AM -0400, Peter Xu wrote:
> On Fri, Mar 31, 2023 at 08:56:01AM +0100, Daniel P. Berrangé wrote:
> > On Thu, Mar 30, 2023 at 06:01:51PM -0400, Peter Xu wrote:
> > > On Thu, Mar 30, 2023 at 03:03:20PM -0300, Fabiano Rosas wrote:
> > > > From: Nikolay Borisov <nborisov@suse.com>
> > > > 
> > > > Implement 'fixed-ram' feature. The core of the feature is to ensure that
> > > > each ram page of the migration stream has a specific offset in the
> > > > resulting migration stream. The reason why we'd want such behavior are
> > > > two fold:
> > > > 
> > > >  - When doing a 'fixed-ram' migration the resulting file will have a
> > > >    bounded size, since pages which are dirtied multiple times will
> > > >    always go to a fixed location in the file, rather than constantly
> > > >    being added to a sequential stream. This eliminates cases where a vm
> > > >    with, say, 1G of ram can result in a migration file that's 10s of
> > > >    GBs, provided that the workload constantly redirties memory.
> > > > 
> > > >  - It paves the way to implement DIO-enabled save/restore of the
> > > >    migration stream as the pages are ensured to be written at aligned
> > > >    offsets.
> > > > 
> > > > The feature requires changing the stream format. First, a bitmap is
> > > > introduced which tracks which pages have been written (i.e are
> > > > dirtied) during migration and subsequently it's being written in the
> > > > resulting file, again at a fixed location for every ramblock. Zero
> > > > pages are ignored as they'd be zero in the destination migration as
> > > > well. With the changed format data would look like the following:
> > > > 
> > > > |name len|name|used_len|pc*|bitmap_size|pages_offset|bitmap|pages|
> > > 
> > > What happens with huge pages?  Would page size matter here?
> > > 
> > > I would assume it's fine it uses a constant (small) page size, assuming
> > > that should match with the granule that qemu tracks dirty (which IIUC is
> > > the host page size not guest's).
> > > 
> > > But I didn't yet pay any further thoughts on that, maybe it would be
> > > worthwhile in all cases to record page sizes here to be explicit or the
> > > meaning of bitmap may not be clear (and then the bitmap_size will be a
> > > field just for sanity check too).
> > 
> > I think recording the page sizes is an anti-feature in this case.
> > 
> > The migration format / state needs to reflect the guest ABI, but
> > we need to be free to have different backend config behind that
> > either side of the save/restore.
> > 
> > IOW, if I start a QEMU with 2 GB of RAM, I should be free to use
> > small pages initially and after restore use 2 x 1 GB hugepages,
> > or vica-verca.
> > 
> > The important thing with the pages that are saved into the file
> > is that they are a 1:1 mapping guest RAM regions to file offsets.
> > IOW, the 2 GB of guest RAM is always a contiguous 2 GB region
> > in the file.
> > 
> > If the src VM used 1 GB pages, we would be writing a full 2 GB
> > of data assuming both pages were dirty.
> > 
> > If the src VM used 4k pages, we would be writing some subset of
> > the 2 GB of data, and the rest would be unwritten.
> > 
> > Either way, when reading back the data we restore it into either
> > 1 GB pages of 4k pages, beause any places there were unwritten
> > orignally  will read back as zeros.
> 
> I think there's already the page size information, because there's a bitmap
> embeded in the format at least in the current proposal, and the bitmap can
> only be defined with a page size provided in some form.
> 
> Here I agree the backend can change before/after a migration (live or
> not).  Though the question is whether page size matters in the snapshot
> layout rather than what the loaded QEMU instance will use as backend.

IIUC, the page size information merely sets a constraint on the granularity
of unwritten (sparse) regions in the file. If we didn't want to express
page size directly in the file format we would need explicit start/end
offsets for each written block. This is less convenient that just having
a bitmap, so I think its ok to use the page size bitmap

> > > If postcopy might be an option, we'd want the page size to be the host page
> > > size because then looking up the bitmap will be straightforward, deciding
> > > whether we should copy over page (UFFDIO_COPY) or fill in with zeros
> > > (UFFDIO_ZEROPAGE).
> > 
> > This format is only intended for the case where we are migrating to
> > a random-access medium, aka a file, because the fixed RAM mappings
> > to disk mean that we need to seek back to the original location to
> > re-write pages that get dirtied. It isn't suitable for a live
> > migration stream, and thus postcopy is inherantly out of scope.
> 
> Yes, I've commented also in the cover letter, but I can expand a bit.
> 
> I mean support postcopy only when loading, but not when saving.
> 
> Saving to file definitely cannot work with postcopy because there's no dest
> qemu running.
> 
> Loading from file, OTOH, can work together with postcopy.

Ahh, I see what you mean.

> Right now AFAICT current approach is precopy loading the whole guest image
> with the supported snapshot format (if I can call it just a snapshot).
> 
> What I want to say is we can consider supporting postcopy on loading in
> that we start an "empty" QEMU dest node, when any page fault triggered we
> do it using userfault and lookup the snapshot file instead rather than
> sending a request back to the source.  I mentioned that because there'll be
> two major benefits which I mentioned in reply to the cover letter quickly,
> but I can also extend here:
> 
>   - Firstly, the snapshot format is ideally storing pages in linear
>     offsets, it means when we know some page missing we can use O(1) time
>     looking it up from the snapshot image.
> 
>   - Secondly, we don't need to let the page go through the wires, neither
>     do we need to send a request to src qemu or anyone.  What we need here
>     is simply test the bit on the snapshot bitmap, then:
> 
>     - If it is copied, do UFFDIO_COPY to resolve page faults,
>     - If it is not copied, do UFFDIO_ZEROPAGE (e.g., if not hugetlb,
>       hugetlb can use a fake UFFDIO_COPY)
> 
> So this is a perfect testing ground for using postcopy in a very efficient
> way against a file snapshot.

Yes, that's an nice unexpected benefit of this fixed ram file format.

With regards,
Daniel
Peter Xu March 31, 2023, 4:13 p.m. UTC | #7
On Fri, Mar 31, 2023 at 04:34:57PM +0100, Daniel P. Berrangé wrote:
> On Fri, Mar 31, 2023 at 10:39:23AM -0400, Peter Xu wrote:
> > On Fri, Mar 31, 2023 at 08:56:01AM +0100, Daniel P. Berrangé wrote:
> > > On Thu, Mar 30, 2023 at 06:01:51PM -0400, Peter Xu wrote:
> > > > On Thu, Mar 30, 2023 at 03:03:20PM -0300, Fabiano Rosas wrote:
> > > > > From: Nikolay Borisov <nborisov@suse.com>
> > > > > 
> > > > > Implement 'fixed-ram' feature. The core of the feature is to ensure that
> > > > > each ram page of the migration stream has a specific offset in the
> > > > > resulting migration stream. The reason why we'd want such behavior are
> > > > > two fold:
> > > > > 
> > > > >  - When doing a 'fixed-ram' migration the resulting file will have a
> > > > >    bounded size, since pages which are dirtied multiple times will
> > > > >    always go to a fixed location in the file, rather than constantly
> > > > >    being added to a sequential stream. This eliminates cases where a vm
> > > > >    with, say, 1G of ram can result in a migration file that's 10s of
> > > > >    GBs, provided that the workload constantly redirties memory.
> > > > > 
> > > > >  - It paves the way to implement DIO-enabled save/restore of the
> > > > >    migration stream as the pages are ensured to be written at aligned
> > > > >    offsets.
> > > > > 
> > > > > The feature requires changing the stream format. First, a bitmap is
> > > > > introduced which tracks which pages have been written (i.e are
> > > > > dirtied) during migration and subsequently it's being written in the
> > > > > resulting file, again at a fixed location for every ramblock. Zero
> > > > > pages are ignored as they'd be zero in the destination migration as
> > > > > well. With the changed format data would look like the following:
> > > > > 
> > > > > |name len|name|used_len|pc*|bitmap_size|pages_offset|bitmap|pages|
> > > > 
> > > > What happens with huge pages?  Would page size matter here?
> > > > 
> > > > I would assume it's fine it uses a constant (small) page size, assuming
> > > > that should match with the granule that qemu tracks dirty (which IIUC is
> > > > the host page size not guest's).
> > > > 
> > > > But I didn't yet pay any further thoughts on that, maybe it would be
> > > > worthwhile in all cases to record page sizes here to be explicit or the
> > > > meaning of bitmap may not be clear (and then the bitmap_size will be a
> > > > field just for sanity check too).
> > > 
> > > I think recording the page sizes is an anti-feature in this case.
> > > 
> > > The migration format / state needs to reflect the guest ABI, but
> > > we need to be free to have different backend config behind that
> > > either side of the save/restore.
> > > 
> > > IOW, if I start a QEMU with 2 GB of RAM, I should be free to use
> > > small pages initially and after restore use 2 x 1 GB hugepages,
> > > or vica-verca.
> > > 
> > > The important thing with the pages that are saved into the file
> > > is that they are a 1:1 mapping guest RAM regions to file offsets.
> > > IOW, the 2 GB of guest RAM is always a contiguous 2 GB region
> > > in the file.
> > > 
> > > If the src VM used 1 GB pages, we would be writing a full 2 GB
> > > of data assuming both pages were dirty.
> > > 
> > > If the src VM used 4k pages, we would be writing some subset of
> > > the 2 GB of data, and the rest would be unwritten.
> > > 
> > > Either way, when reading back the data we restore it into either
> > > 1 GB pages of 4k pages, beause any places there were unwritten
> > > orignally  will read back as zeros.
> > 
> > I think there's already the page size information, because there's a bitmap
> > embeded in the format at least in the current proposal, and the bitmap can
> > only be defined with a page size provided in some form.
> > 
> > Here I agree the backend can change before/after a migration (live or
> > not).  Though the question is whether page size matters in the snapshot
> > layout rather than what the loaded QEMU instance will use as backend.
> 
> IIUC, the page size information merely sets a constraint on the granularity
> of unwritten (sparse) regions in the file. If we didn't want to express
> page size directly in the file format we would need explicit start/end
> offsets for each written block. This is less convenient that just having
> a bitmap, so I think its ok to use the page size bitmap

I'm perfectly fine with having the bitmap.  The original question was about
whether we should store page_size into the same header too along with the
bitmap.

Currently I think the page size can be implied by either the system
configuration (e.g. arch, cpu setups) and also the size of bitmap.  So I'm
wondering whether it'll be cleaner to replace the bitmap size with page
size (hence one can calculate the bitmap size from the page size), or just
keep both of them for sanity.

Besides, since we seem to be defining a new header format to be stored on
disks, maybe it'll be worthwhile to leave some space for future extentions
of the image?

So the image format can start with a versioning (perhaps also with field
explaning what it contains). Then if someday we want to extend the image,
the new qemu binary will still be able to load the old image even if the
format may change.  Or vice versa, where the old qemu binary would be able
to identify it's loading a new image that it doesn't really understand, so
to properly notify the user rather than weird loading errors.
diff mbox series

Patch

diff --git a/docs/devel/migration.rst b/docs/devel/migration.rst
index 1080211f8e..84112d7f3f 100644
--- a/docs/devel/migration.rst
+++ b/docs/devel/migration.rst
@@ -568,6 +568,42 @@  Others (especially either older devices or system devices which for
 some reason don't have a bus concept) make use of the ``instance id``
 for otherwise identically named devices.
 
+Fixed-ram format
+----------------
+
+When the ``fixed-ram`` capability is enabled, a slightly different
+stream format is used for the RAM section. Instead of having a
+sequential stream of pages that follow the RAMBlock headers, the dirty
+pages for a RAMBlock follow its header. This ensures that each RAM
+page has a fixed offset in the resulting migration stream.
+
+  - RAMBlock 1
+
+    - ID string length
+    - ID string
+    - Used size
+    - Shadow bitmap size
+    - Pages offset in migration stream*
+
+  - Shadow bitmap
+  - Sequence of pages for RAMBlock 1 (* offset points here)
+
+  - RAMBlock 2
+
+    - ID string length
+    - ID string
+    - Used size
+    - Shadow bitmap size
+    - Pages offset in migration stream*
+
+  - Shadow bitmap
+  - Sequence of pages for RAMBlock 2 (* offset points here)
+
+The ``fixed-ram`` capaility can be enabled in both source and
+destination with:
+
+    ``migrate_set_capability fixed-ram on``
+
 Return path
 -----------
 
diff --git a/include/exec/ramblock.h b/include/exec/ramblock.h
index adc03df59c..4360c772c2 100644
--- a/include/exec/ramblock.h
+++ b/include/exec/ramblock.h
@@ -43,6 +43,14 @@  struct RAMBlock {
     size_t page_size;
     /* dirty bitmap used during migration */
     unsigned long *bmap;
+    /* shadow dirty bitmap used when migrating to a file */
+    unsigned long *shadow_bmap;
+    /*
+     * offset in the file pages belonging to this ramblock are saved,
+     * used only during migration to a file.
+     */
+    off_t bitmap_offset;
+    uint64_t pages_offset;
     /* bitmap of already received pages in postcopy */
     unsigned long *receivedmap;
 
diff --git a/migration/migration.c b/migration/migration.c
index 177fb0de0f..29630523e2 100644
--- a/migration/migration.c
+++ b/migration/migration.c
@@ -168,7 +168,8 @@  INITIALIZE_MIGRATE_CAPS_SET(check_caps_background_snapshot,
     MIGRATION_CAPABILITY_XBZRLE,
     MIGRATION_CAPABILITY_X_COLO,
     MIGRATION_CAPABILITY_VALIDATE_UUID,
-    MIGRATION_CAPABILITY_ZERO_COPY_SEND);
+    MIGRATION_CAPABILITY_ZERO_COPY_SEND,
+    MIGRATION_CAPABILITY_FIXED_RAM);
 
 /* When we add fault tolerance, we could have several
    migrations at once.  For now we don't need to add
@@ -1341,6 +1342,28 @@  static bool migrate_caps_check(bool *cap_list,
     }
 #endif
 
+    if (cap_list[MIGRATION_CAPABILITY_FIXED_RAM]) {
+        if (cap_list[MIGRATION_CAPABILITY_MULTIFD]) {
+            error_setg(errp, "Directly mapped memory incompatible with multifd");
+            return false;
+        }
+
+        if (cap_list[MIGRATION_CAPABILITY_XBZRLE]) {
+            error_setg(errp, "Directly mapped memory incompatible with xbzrle");
+            return false;
+        }
+
+        if (cap_list[MIGRATION_CAPABILITY_COMPRESS]) {
+            error_setg(errp, "Directly mapped memory incompatible with compression");
+            return false;
+        }
+
+        if (cap_list[MIGRATION_CAPABILITY_POSTCOPY_RAM]) {
+            error_setg(errp, "Directly mapped memory incompatible with postcopy ram");
+            return false;
+        }
+    }
+
     if (cap_list[MIGRATION_CAPABILITY_POSTCOPY_RAM]) {
         /* This check is reasonably expensive, so only when it's being
          * set the first time, also it's only the destination that needs
@@ -2736,6 +2759,11 @@  MultiFDCompression migrate_multifd_compression(void)
     return s->parameters.multifd_compression;
 }
 
+int migrate_fixed_ram(void)
+{
+    return migrate_get_current()->enabled_capabilities[MIGRATION_CAPABILITY_FIXED_RAM];
+}
+
 int migrate_multifd_zlib_level(void)
 {
     MigrationState *s;
@@ -4324,6 +4352,20 @@  fail:
     return NULL;
 }
 
+static int migrate_check_fixed_ram(MigrationState *s, Error **errp)
+{
+    if (!s->enabled_capabilities[MIGRATION_CAPABILITY_FIXED_RAM]) {
+        return 0;
+    }
+
+    if (!qemu_file_is_seekable(s->to_dst_file)) {
+        error_setg(errp, "Directly mapped memory requires a seekable transport");
+        return -1;
+    }
+
+    return 0;
+}
+
 void migrate_fd_connect(MigrationState *s, Error *error_in)
 {
     Error *local_err = NULL;
@@ -4390,6 +4432,12 @@  void migrate_fd_connect(MigrationState *s, Error *error_in)
         }
     }
 
+    if (migrate_check_fixed_ram(s, &local_err) < 0) {
+        migrate_fd_cleanup(s);
+        migrate_fd_error(s, local_err);
+        return;
+    }
+
     if (resume) {
         /* Wakeup the main migration thread to do the recovery */
         migrate_set_state(&s->state, MIGRATION_STATUS_POSTCOPY_PAUSED,
@@ -4519,6 +4567,7 @@  static Property migration_properties[] = {
     DEFINE_PROP_STRING("tls-authz", MigrationState, parameters.tls_authz),
 
     /* Migration capabilities */
+    DEFINE_PROP_MIG_CAP("x-fixed-ram", MIGRATION_CAPABILITY_FIXED_RAM),
     DEFINE_PROP_MIG_CAP("x-xbzrle", MIGRATION_CAPABILITY_XBZRLE),
     DEFINE_PROP_MIG_CAP("x-rdma-pin-all", MIGRATION_CAPABILITY_RDMA_PIN_ALL),
     DEFINE_PROP_MIG_CAP("x-auto-converge", MIGRATION_CAPABILITY_AUTO_CONVERGE),
diff --git a/migration/migration.h b/migration/migration.h
index 2da2f8a164..8cf3caecfe 100644
--- a/migration/migration.h
+++ b/migration/migration.h
@@ -416,6 +416,7 @@  bool migrate_zero_blocks(void);
 bool migrate_dirty_bitmaps(void);
 bool migrate_ignore_shared(void);
 bool migrate_validate_uuid(void);
+int migrate_fixed_ram(void);
 
 bool migrate_auto_converge(void);
 bool migrate_use_multifd(void);
diff --git a/migration/ram.c b/migration/ram.c
index 96e8a19a58..56f0f782c8 100644
--- a/migration/ram.c
+++ b/migration/ram.c
@@ -1310,9 +1310,14 @@  static int save_zero_page_to_file(PageSearchStatus *pss,
     int len = 0;
 
     if (buffer_is_zero(p, TARGET_PAGE_SIZE)) {
-        len += save_page_header(pss, block, offset | RAM_SAVE_FLAG_ZERO);
-        qemu_put_byte(file, 0);
-        len += 1;
+        if (migrate_fixed_ram()) {
+            /* for zero pages we don't need to do anything */
+            len = 1;
+        } else {
+            len += save_page_header(pss, block, offset | RAM_SAVE_FLAG_ZERO);
+            qemu_put_byte(file, 0);
+            len += 1;
+        }
         ram_release_page(block->idstr, offset);
     }
     return len;
@@ -1394,14 +1399,20 @@  static int save_normal_page(PageSearchStatus *pss, RAMBlock *block,
 {
     QEMUFile *file = pss->pss_channel;
 
-    ram_transferred_add(save_page_header(pss, block,
-                                         offset | RAM_SAVE_FLAG_PAGE));
-    if (async) {
-        qemu_put_buffer_async(file, buf, TARGET_PAGE_SIZE,
-                              migrate_release_ram() &&
-                              migration_in_postcopy());
+    if (migrate_fixed_ram()) {
+        qemu_put_buffer_at(file, buf, TARGET_PAGE_SIZE,
+                           block->pages_offset + offset);
+        set_bit(offset >> TARGET_PAGE_BITS, block->shadow_bmap);
     } else {
-        qemu_put_buffer(file, buf, TARGET_PAGE_SIZE);
+        ram_transferred_add(save_page_header(pss, block,
+                                             offset | RAM_SAVE_FLAG_PAGE));
+        if (async) {
+            qemu_put_buffer_async(file, buf, TARGET_PAGE_SIZE,
+                                  migrate_release_ram() &&
+                                  migration_in_postcopy());
+        } else {
+            qemu_put_buffer(file, buf, TARGET_PAGE_SIZE);
+        }
     }
     ram_transferred_add(TARGET_PAGE_SIZE);
     stat64_add(&ram_atomic_counters.normal, 1);
@@ -2731,6 +2742,8 @@  static void ram_save_cleanup(void *opaque)
         block->clear_bmap = NULL;
         g_free(block->bmap);
         block->bmap = NULL;
+        g_free(block->shadow_bmap);
+        block->shadow_bmap = NULL;
     }
 
     xbzrle_cleanup();
@@ -3098,6 +3111,7 @@  static void ram_list_init_bitmaps(void)
              */
             block->bmap = bitmap_new(pages);
             bitmap_set(block->bmap, 0, pages);
+            block->shadow_bmap = bitmap_new(block->used_length >> TARGET_PAGE_BITS);
             block->clear_bmap_shift = shift;
             block->clear_bmap = bitmap_new(clear_bmap_size(pages, shift));
         }
@@ -3287,6 +3301,33 @@  static int ram_save_setup(QEMUFile *f, void *opaque)
             if (migrate_ignore_shared()) {
                 qemu_put_be64(f, block->mr->addr);
             }
+
+            if (migrate_fixed_ram()) {
+                long num_pages = block->used_length >> TARGET_PAGE_BITS;
+                long bitmap_size = BITS_TO_LONGS(num_pages) * sizeof(unsigned long);
+
+                /* Needed for external programs (think analyze-migration.py) */
+                qemu_put_be32(f, bitmap_size);
+
+                /*
+                 * The bitmap starts after pages_offset, so add 8 to
+                 * account for the pages_offset size.
+                 */
+                block->bitmap_offset = qemu_get_offset(f) + 8;
+
+                /*
+                 * Make pages_offset aligned to 1 MiB to account for
+                 * migration file movement between filesystems with
+                 * possibly different alignment restrictions when
+                 * using O_DIRECT.
+                 */
+                block->pages_offset = ROUND_UP(block->bitmap_offset +
+                                               bitmap_size, 0x100000);
+                qemu_put_be64(f, block->pages_offset);
+
+                /* Now prepare offset for next ramblock */
+                qemu_set_offset(f, block->pages_offset + block->used_length, SEEK_SET);
+            }
         }
     }
 
@@ -3306,6 +3347,18 @@  static int ram_save_setup(QEMUFile *f, void *opaque)
     return 0;
 }
 
+static void ram_save_shadow_bmap(QEMUFile *f)
+{
+    RAMBlock *block;
+
+    RAMBLOCK_FOREACH_MIGRATABLE(block) {
+        long num_pages = block->used_length >> TARGET_PAGE_BITS;
+        long bitmap_size = BITS_TO_LONGS(num_pages) * sizeof(unsigned long);
+        qemu_put_buffer_at(f, (uint8_t *)block->shadow_bmap, bitmap_size,
+                           block->bitmap_offset);
+    }
+}
+
 /**
  * ram_save_iterate: iterative stage for migration
  *
@@ -3413,9 +3466,15 @@  out:
             return ret;
         }
 
-        qemu_put_be64(f, RAM_SAVE_FLAG_EOS);
-        qemu_fflush(f);
-        ram_transferred_add(8);
+        /*
+         * For fixed ram we don't want to pollute the migration stream with
+         * EOS flags.
+         */
+        if (!migrate_fixed_ram()) {
+            qemu_put_be64(f, RAM_SAVE_FLAG_EOS);
+            qemu_fflush(f);
+            ram_transferred_add(8);
+        }
 
         ret = qemu_file_get_error(f);
     }
@@ -3461,6 +3520,9 @@  static int ram_save_complete(QEMUFile *f, void *opaque)
             pages = ram_find_and_save_block(rs);
             /* no more blocks to sent */
             if (pages == 0) {
+                if (migrate_fixed_ram()) {
+                    ram_save_shadow_bmap(f);
+                }
                 break;
             }
             if (pages < 0) {
@@ -3483,8 +3545,10 @@  static int ram_save_complete(QEMUFile *f, void *opaque)
         return ret;
     }
 
-    qemu_put_be64(f, RAM_SAVE_FLAG_EOS);
-    qemu_fflush(f);
+    if (!migrate_fixed_ram()) {
+        qemu_put_be64(f, RAM_SAVE_FLAG_EOS);
+        qemu_fflush(f);
+    }
 
     return 0;
 }
diff --git a/migration/savevm.c b/migration/savevm.c
index 92102c1fe5..1f1bc19224 100644
--- a/migration/savevm.c
+++ b/migration/savevm.c
@@ -241,6 +241,7 @@  static bool should_validate_capability(int capability)
     /* Validate only new capabilities to keep compatibility. */
     switch (capability) {
     case MIGRATION_CAPABILITY_X_IGNORE_SHARED:
+    case MIGRATION_CAPABILITY_FIXED_RAM:
         return true;
     default:
         return false;
diff --git a/qapi/migration.json b/qapi/migration.json
index c84fa10e86..22eea58ce3 100644
--- a/qapi/migration.json
+++ b/qapi/migration.json
@@ -485,7 +485,7 @@ 
 ##
 { 'enum': 'MigrationCapability',
   'data': ['xbzrle', 'rdma-pin-all', 'auto-converge', 'zero-blocks',
-           'compress', 'events', 'postcopy-ram',
+           'compress', 'events', 'postcopy-ram', 'fixed-ram',
            { 'name': 'x-colo', 'features': [ 'unstable' ] },
            'release-ram',
            'block', 'return-path', 'pause-before-switchover', 'multifd',