diff mbox series

[v20210701,15/40] tools: prepare to allocate saverestore arrays once

Message ID 20210701095635.15648-16-olaf@aepfle.de (mailing list archive)
State Superseded
Headers show
Series leftover from 2020 | expand

Commit Message

Olaf Hering July 1, 2021, 9:56 a.m. UTC
The hotpath 'send_dirty_pages' is supposed to do just one thing: sending.
The other end 'handle_page_data' is supposed to do just receiving.

But instead both do other costly work like memory allocations and data moving.
Do the allocations once, the array sizes are a compiletime constant.
Avoid unneeded copying of data by receiving data directly into mapped guest memory.

This patch is just prepartion, subsequent changes will populate the arrays.

Once all changes are applied, migration of a busy HVM domU changes like that:

Without this series, from sr650 to sr950 (xen-4.15.20201027T173911.16a20963b3 xen_testing):
2020-10-29 10:23:10.711+0000: xc: show_transfer_rate: 23663128 bytes + 2879563 pages in 55.324905335 sec, 203 MiB/sec: Internal error
2020-10-29 10:23:35.115+0000: xc: show_transfer_rate: 16829632 bytes + 2097552 pages in 24.401179720 sec, 335 MiB/sec: Internal error
2020-10-29 10:23:59.436+0000: xc: show_transfer_rate: 16829032 bytes + 2097478 pages in 24.319025928 sec, 336 MiB/sec: Internal error
2020-10-29 10:24:23.844+0000: xc: show_transfer_rate: 16829024 bytes + 2097477 pages in 24.406992500 sec, 335 MiB/sec: Internal error
2020-10-29 10:24:48.292+0000: xc: show_transfer_rate: 16828912 bytes + 2097463 pages in 24.446489027 sec, 335 MiB/sec: Internal error
2020-10-29 10:25:01.816+0000: xc: show_transfer_rate: 16836080 bytes + 2098356 pages in 13.447091818 sec, 609 MiB/sec: Internal error

With this series, from sr650 to sr950 (xen-4.15.20201027T173911.16a20963b3 xen_unstable):
2020-10-28 21:26:05.074+0000: xc: show_transfer_rate: 23663128 bytes + 2879563 pages in 52.564054368 sec, 213 MiB/sec: Internal error
2020-10-28 21:26:23.527+0000: xc: show_transfer_rate: 16830040 bytes + 2097603 pages in 18.450592015 sec, 444 MiB/sec: Internal error
2020-10-28 21:26:41.926+0000: xc: show_transfer_rate: 16830944 bytes + 2097717 pages in 18.397862306 sec, 445 MiB/sec: Internal error
2020-10-28 21:27:00.339+0000: xc: show_transfer_rate: 16829176 bytes + 2097498 pages in 18.411973339 sec, 445 MiB/sec: Internal error
2020-10-28 21:27:18.643+0000: xc: show_transfer_rate: 16828592 bytes + 2097425 pages in 18.303326695 sec, 447 MiB/sec: Internal error
2020-10-28 21:27:26.289+0000: xc: show_transfer_rate: 16835952 bytes + 2098342 pages in 7.579846749 sec, 1081 MiB/sec: Internal error

Note: the performance improvement depends on the used network cards,
wirespeed and the host:
- No improvement is expected with a 1G link.
- Improvement can be seen as shown above on a 10G link.
- Just a slight improvment can be seen on a 100G link.

This change also populates sr_save_arrays with "batch_pfns", and
sr_restore_arrays with "pfns" to make sure malloc is always called
with a non-zero value.

Signed-off-by: Olaf Hering <olaf@aepfle.de>

v02:
- rename xc_sr_save_arrays to sr_save_arrays
- rename xc_sr_restore_arrays to sr_restore_arrays
- merge handling of "batch_pfns" and "pfns" to make sure malloc is
  called with a non-zero size value (jgross)
---
 tools/libs/saverestore/common.h  | 12 +++++++++++-
 tools/libs/saverestore/restore.c | 14 ++++++++++----
 tools/libs/saverestore/save.c    | 27 +++++++++++++--------------
 3 files changed, 34 insertions(+), 19 deletions(-)

Comments

Andrew Cooper July 5, 2021, 10:44 a.m. UTC | #1
On 01/07/2021 10:56, Olaf Hering wrote:
> The hotpath 'send_dirty_pages' is supposed to do just one thing: sending.
> The other end 'handle_page_data' is supposed to do just receiving.
>
> But instead both do other costly work like memory allocations and data moving.
> Do the allocations once, the array sizes are a compiletime constant.
> Avoid unneeded copying of data by receiving data directly into mapped guest memory.

There is a reason why the logic was written that way, which was good at
the time.  It was so valgrind could spot bugs with the memory handling
in these functions (And it did indeed find many bugs during development).

These days, its ASAN is perhaps the preferred tool, but both depend on
dynamic allocations to figure out the actual size of various objects.


I agree that the repeated alloc/free of same-sized memory regions on
each iteration is a waste.  However, if we are going to fix this by
using one-off allocations, then we want to compensate with logic such as
the call to VALGRIND_MAKE_MEM_UNDEFINED() in flush_batch(), and I think
we still need individual allocations to let the tools work properly.

>
> This patch is just prepartion, subsequent changes will populate the arrays.
>
> Once all changes are applied, migration of a busy HVM domU changes like that:
>
> Without this series, from sr650 to sr950 (xen-4.15.20201027T173911.16a20963b3 xen_testing):
> 2020-10-29 10:23:10.711+0000: xc: show_transfer_rate: 23663128 bytes + 2879563 pages in 55.324905335 sec, 203 MiB/sec: Internal error
> 2020-10-29 10:23:35.115+0000: xc: show_transfer_rate: 16829632 bytes + 2097552 pages in 24.401179720 sec, 335 MiB/sec: Internal error
> 2020-10-29 10:23:59.436+0000: xc: show_transfer_rate: 16829032 bytes + 2097478 pages in 24.319025928 sec, 336 MiB/sec: Internal error
> 2020-10-29 10:24:23.844+0000: xc: show_transfer_rate: 16829024 bytes + 2097477 pages in 24.406992500 sec, 335 MiB/sec: Internal error
> 2020-10-29 10:24:48.292+0000: xc: show_transfer_rate: 16828912 bytes + 2097463 pages in 24.446489027 sec, 335 MiB/sec: Internal error
> 2020-10-29 10:25:01.816+0000: xc: show_transfer_rate: 16836080 bytes + 2098356 pages in 13.447091818 sec, 609 MiB/sec: Internal error
>
> With this series, from sr650 to sr950 (xen-4.15.20201027T173911.16a20963b3 xen_unstable):
> 2020-10-28 21:26:05.074+0000: xc: show_transfer_rate: 23663128 bytes + 2879563 pages in 52.564054368 sec, 213 MiB/sec: Internal error
> 2020-10-28 21:26:23.527+0000: xc: show_transfer_rate: 16830040 bytes + 2097603 pages in 18.450592015 sec, 444 MiB/sec: Internal error
> 2020-10-28 21:26:41.926+0000: xc: show_transfer_rate: 16830944 bytes + 2097717 pages in 18.397862306 sec, 445 MiB/sec: Internal error
> 2020-10-28 21:27:00.339+0000: xc: show_transfer_rate: 16829176 bytes + 2097498 pages in 18.411973339 sec, 445 MiB/sec: Internal error
> 2020-10-28 21:27:18.643+0000: xc: show_transfer_rate: 16828592 bytes + 2097425 pages in 18.303326695 sec, 447 MiB/sec: Internal error
> 2020-10-28 21:27:26.289+0000: xc: show_transfer_rate: 16835952 bytes + 2098342 pages in 7.579846749 sec, 1081 MiB/sec: Internal error

These are good numbers, and clearly show that there is some value here,
but shouldn't they be in the series header?  They're not terribly
relevant to this patch specifically.

Also, while I can believe that the first sample is slower than the later
ones (in particular, during the first round, we've got to deal with the
non-RAM regions too and therefore spend more time making hypercalls),
I'm not sure I believe the final sample.  Given the byte/page count, the
substantially smaller elapsed time looks suspicious.

> Note: the performance improvement depends on the used network cards,
> wirespeed and the host:
> - No improvement is expected with a 1G link.
> - Improvement can be seen as shown above on a 10G link.
> - Just a slight improvment can be seen on a 100G link.

Are these observations with an otherwise idle dom0?

Even if CPU time in dom0 wasn't the bottlekneck with a 1G link, the
reduction in CPU time you observe at higher link speeds will still be
making a difference at 1G, and will probably be visible if you perform
multiple concurrent migrations.

~Andrew
Olaf Hering July 5, 2021, 11:27 a.m. UTC | #2
Am Mon, 5 Jul 2021 11:44:30 +0100
schrieb Andrew Cooper <andrew.cooper3@citrix.com>:

> On 01/07/2021 10:56, Olaf Hering wrote:
> I agree that the repeated alloc/free of same-sized memory regions on
> each iteration is a waste.  However, if we are going to fix this by
> using one-off allocations, then we want to compensate with logic such as
> the call to VALGRIND_MAKE_MEM_UNDEFINED() in flush_batch(), and I think
> we still need individual allocations to let the tools work properly.

If this is a concern, lets just do a few individual arrays.

> > This patch is just prepartion, subsequent changes will populate the arrays.
> >
> > Once all changes are applied, migration of a busy HVM domU changes like that:
> >
> > Without this series, from sr650 to sr950 (xen-4.15.20201027T173911.16a20963b3 xen_testing):
> > 2020-10-29 10:23:10.711+0000: xc: show_transfer_rate: 23663128 bytes + 2879563 pages in 55.324905335 sec, 203 MiB/sec: Internal error
> > 2020-10-29 10:23:35.115+0000: xc: show_transfer_rate: 16829632 bytes + 2097552 pages in 24.401179720 sec, 335 MiB/sec: Internal error
> > 2020-10-29 10:23:59.436+0000: xc: show_transfer_rate: 16829032 bytes + 2097478 pages in 24.319025928 sec, 336 MiB/sec: Internal error
> > 2020-10-29 10:24:23.844+0000: xc: show_transfer_rate: 16829024 bytes + 2097477 pages in 24.406992500 sec, 335 MiB/sec: Internal error
> > 2020-10-29 10:24:48.292+0000: xc: show_transfer_rate: 16828912 bytes + 2097463 pages in 24.446489027 sec, 335 MiB/sec: Internal error
> > 2020-10-29 10:25:01.816+0000: xc: show_transfer_rate: 16836080 bytes + 2098356 pages in 13.447091818 sec, 609 MiB/sec: Internal error
> >
> > With this series, from sr650 to sr950 (xen-4.15.20201027T173911.16a20963b3 xen_unstable):
> > 2020-10-28 21:26:05.074+0000: xc: show_transfer_rate: 23663128 bytes + 2879563 pages in 52.564054368 sec, 213 MiB/sec: Internal error
> > 2020-10-28 21:26:23.527+0000: xc: show_transfer_rate: 16830040 bytes + 2097603 pages in 18.450592015 sec, 444 MiB/sec: Internal error
> > 2020-10-28 21:26:41.926+0000: xc: show_transfer_rate: 16830944 bytes + 2097717 pages in 18.397862306 sec, 445 MiB/sec: Internal error
> > 2020-10-28 21:27:00.339+0000: xc: show_transfer_rate: 16829176 bytes + 2097498 pages in 18.411973339 sec, 445 MiB/sec: Internal error
> > 2020-10-28 21:27:18.643+0000: xc: show_transfer_rate: 16828592 bytes + 2097425 pages in 18.303326695 sec, 447 MiB/sec: Internal error
> > 2020-10-28 21:27:26.289+0000: xc: show_transfer_rate: 16835952 bytes + 2098342 pages in 7.579846749 sec, 1081 MiB/sec: Internal error  
> 
> These are good numbers, and clearly show that there is some value here,
> but shouldn't they be in the series header?  They're not terribly
> relevant to this patch specifically.

The cover letter is unfortunately not under version control.
Perhaps there are ways with git notes, I never use it.

> Also, while I can believe that the first sample is slower than the later
> ones (in particular, during the first round, we've got to deal with the
> non-RAM regions too and therefore spend more time making hypercalls),
> I'm not sure I believe the final sample.  Given the byte/page count, the
> substantially smaller elapsed time looks suspicious.

The first one is slower because it has to wait for the receiver to allocate pages.
But maybe as you said there are other aspects as well.
The last one is always way faster because apparently map/unmap is less costly with a stopped guest.
Right now the code may reach up to 15Gbit/s. The next step is to map the domU just once to reach wirespeed.

> Are these observations with an otherwise idle dom0?

Yes. Idle dom0 and a domU busy with touching its memory.

Unfortunately, I'm not able to prove the reported gain with the systems I have today.
I'm waiting for preparation of different hardware, right now I have only a pair of CoyotePass and WilsonCity.

I'm sure there were NUMA effects involved. Last years libvirt was unable to properly pin vcpus. If I pin all the involved memory to node#0 there is some jitter in the logged numbers, but no obvious improvement. The fist iteration is slightly faster, but that is it.

Meanwhile I think this commit message needs to be redone.

> Even if CPU time in dom0 wasn't the bottlekneck with a 1G link, the
> reduction in CPU time you observe at higher link speeds will still be
> making a difference at 1G, and will probably be visible if you perform
> multiple concurrent migrations.

Yes, I will see what numbers I get with two or more migrations running in parallel.

Olaf
Andrew Cooper July 5, 2021, 1:01 p.m. UTC | #3
On 05/07/2021 12:27, Olaf Hering wrote:
> Am Mon, 5 Jul 2021 11:44:30 +0100
> schrieb Andrew Cooper <andrew.cooper3@citrix.com>:
>
>>> This patch is just prepartion, subsequent changes will populate the arrays.
>>>
>>> Once all changes are applied, migration of a busy HVM domU changes like that:
>>>
>>> Without this series, from sr650 to sr950 (xen-4.15.20201027T173911.16a20963b3 xen_testing):
>>> 2020-10-29 10:23:10.711+0000: xc: show_transfer_rate: 23663128 bytes + 2879563 pages in 55.324905335 sec, 203 MiB/sec: Internal error
>>> 2020-10-29 10:23:35.115+0000: xc: show_transfer_rate: 16829632 bytes + 2097552 pages in 24.401179720 sec, 335 MiB/sec: Internal error
>>> 2020-10-29 10:23:59.436+0000: xc: show_transfer_rate: 16829032 bytes + 2097478 pages in 24.319025928 sec, 336 MiB/sec: Internal error
>>> 2020-10-29 10:24:23.844+0000: xc: show_transfer_rate: 16829024 bytes + 2097477 pages in 24.406992500 sec, 335 MiB/sec: Internal error
>>> 2020-10-29 10:24:48.292+0000: xc: show_transfer_rate: 16828912 bytes + 2097463 pages in 24.446489027 sec, 335 MiB/sec: Internal error
>>> 2020-10-29 10:25:01.816+0000: xc: show_transfer_rate: 16836080 bytes + 2098356 pages in 13.447091818 sec, 609 MiB/sec: Internal error
>>>
>>> With this series, from sr650 to sr950 (xen-4.15.20201027T173911.16a20963b3 xen_unstable):
>>> 2020-10-28 21:26:05.074+0000: xc: show_transfer_rate: 23663128 bytes + 2879563 pages in 52.564054368 sec, 213 MiB/sec: Internal error
>>> 2020-10-28 21:26:23.527+0000: xc: show_transfer_rate: 16830040 bytes + 2097603 pages in 18.450592015 sec, 444 MiB/sec: Internal error
>>> 2020-10-28 21:26:41.926+0000: xc: show_transfer_rate: 16830944 bytes + 2097717 pages in 18.397862306 sec, 445 MiB/sec: Internal error
>>> 2020-10-28 21:27:00.339+0000: xc: show_transfer_rate: 16829176 bytes + 2097498 pages in 18.411973339 sec, 445 MiB/sec: Internal error
>>> 2020-10-28 21:27:18.643+0000: xc: show_transfer_rate: 16828592 bytes + 2097425 pages in 18.303326695 sec, 447 MiB/sec: Internal error
>>> 2020-10-28 21:27:26.289+0000: xc: show_transfer_rate: 16835952 bytes + 2098342 pages in 7.579846749 sec, 1081 MiB/sec: Internal error  
>> These are good numbers, and clearly show that there is some value here,
>> but shouldn't they be in the series header?  They're not terribly
>> relevant to this patch specifically.
> The cover letter is unfortunately not under version control.
> Perhaps there are ways with git notes, I never use it.

In the end, we'll want some kind of note in the changelog, but that
wants to be a single line.  Its probably fine to say "Improve migration
performance.  25% better bandwidth when NIC link speed is the
bottleneck, due to optimising the data handling logic."

>> Also, while I can believe that the first sample is slower than the later
>> ones (in particular, during the first round, we've got to deal with the
>> non-RAM regions too and therefore spend more time making hypercalls),
>> I'm not sure I believe the final sample.  Given the byte/page count, the
>> substantially smaller elapsed time looks suspicious.
> The first one is slower because it has to wait for the receiver to allocate pages.
> But maybe as you said there are other aspects as well.
> The last one is always way faster because apparently map/unmap is less costly with a stopped guest.

That's suspicious.  If true, we've got some very wonky behaviour in the
hypervisor...

> Right now the code may reach up to 15Gbit/s. The next step is to map the domU just once to reach wirespeed.

We can in principle do that in 64bit toolstacks, for HVM guests.  But
not usefully until we've fixed the fact that Xen has no idea what the
guest physmap is supposed to look like.

At the moment, the current scheme is a little more resilient to bugs
caused by the guest attempting to balloon during the live phase.

Another area to improve, which can be started now, is to avoid bounce
buffering hypercall data.  Now that we have /dev/xen/hypercall which you
can mmap() regular kernel pages from, what we want is a simple memory
allocator which we can allocate permanent hypercall buffers from, rather
than the internals of every xc_*() hypercall wrapper bouncing the data
in (potentially) both directions.

>
>> Are these observations with an otherwise idle dom0?
> Yes. Idle dom0 and a domU busy with touching its memory.
>
> Unfortunately, I'm not able to prove the reported gain with the systems I have today.
> I'm waiting for preparation of different hardware, right now I have only a pair of CoyotePass and WilsonCity.
>
> I'm sure there were NUMA effects involved. Last years libvirt was unable to properly pin vcpus. If I pin all the involved memory to node#0 there is some jitter in the logged numbers, but no obvious improvement. The fist iteration is slightly faster, but that is it.

Oh - so the speedup might not be from reduced data handling?

Avoiding unnecessary data copies is clearly going to improve things,
even if it isn't 25%.

~Andrew
Olaf Hering July 5, 2021, 2:11 p.m. UTC | #4
Am Mon, 5 Jul 2021 14:01:07 +0100
schrieb Andrew Cooper <andrew.cooper3@citrix.com>:

> > The last one is always way faster because apparently map/unmap is less costly with a stopped guest.  
> That's suspicious.  If true, we've got some very wonky behaviour in the
> hypervisor...

At least the transfer rate this last iteration is consistent.
Since the only difference I can see is the fact that the domU is suspended, I suspect the mapping.
I did no investigation where the time is spent, I should probably do that one day to better understand this specific difference.

> > Right now the code may reach up to 15Gbit/s. The next step is to map the domU just once to reach wirespeed.  
> 
> We can in principle do that in 64bit toolstacks, for HVM guests.  But
> not usefully until we've fixed the fact that Xen has no idea what the
> guest physmap is supposed to look like.

Why would Xen care?
My attempt last year with a new save/restore code did just 'map' the memory on both sides. The 'unmap' was done in exit().

With this approach I got wirespeed in all iterations with a 10G link.

> At the moment, the current scheme is a little more resilient to bugs
> caused by the guest attempting to balloon during the live phase.

I did not specifically test how a domU behaves when it claims and releases pages while being migrated.
I think this series would handle at least parts of that:
If a page appears or disappears it will be recognized by getpageframeinfo.
If a page disappears between getpageframeinfo and MMAPBATCH I expect an error.
This error is fatal right now, perhaps the code could catch this and move on.
If a page disappears after MMAPBATCH it will be caught by later iterations.


> Another area to improve, which can be started now, is to avoid bounce
> buffering hypercall data.  Now that we have /dev/xen/hypercall which you
> can mmap() regular kernel pages from, what we want is a simple memory
> allocator which we can allocate permanent hypercall buffers from, rather
> than the internals of every xc_*() hypercall wrapper bouncing the data
> in (potentially) both directions.

That sounds like a good idea. Not sure how costly the current approach is.

> Oh - so the speedup might not be from reduced data handling?

At least not on the systems I have now.

Perhaps I should test how the numbers look like with the NIC and the toolstack in node#0, and the domU in node#1.


Olaf
Olaf Hering July 13, 2021, 5:50 p.m. UTC | #5
Am Mon, 5 Jul 2021 14:01:07 +0100
schrieb Andrew Cooper <andrew.cooper3@citrix.com>:

> > Unfortunately, I'm not able to prove the reported gain with the systems I have today.
> > I'm waiting for preparation of different hardware, right now I have only a pair of CoyotePass and WilsonCity.
> >
> > I'm sure there were NUMA effects involved. Last years libvirt was unable to properly pin vcpus. If I pin all the involved memory to node#0 there is some jitter in the logged numbers, but no obvious improvement. The fist iteration is slightly faster, but that is it.  
> 
> Oh - so the speedup might not be from reduced data handling?
> 
> Avoiding unnecessary data copies is clearly going to improve things,
> even if it isn't 25%.


For HVM the only notable improvement is the initial iteration.

On average with 4 migrations of a single domU from A to B and back from B to A, transfer rate goes up from ~490MiB/s to ~677MiB/s. The initial transfer time for the 4194299 domU pages:

with plain staging:
36.800582009
32.145531727
31.827540709
33.009956041
34.951513466
33.416769973
32.128985762
33.201786076

with the series applied:
24.266428156
24.632898175
24.112660134
23.603475994
24.418323859
23.841875914
25.087779229
23.493812677


Migration of a PV domU is much faster, but transfer rate for each iteration varies with or without the patches being applied.


Olaf
diff mbox series

Patch

diff --git a/tools/libs/saverestore/common.h b/tools/libs/saverestore/common.h
index 252076cf51..968bb8af13 100644
--- a/tools/libs/saverestore/common.h
+++ b/tools/libs/saverestore/common.h
@@ -223,6 +223,15 @@  static inline int update_blob(struct xc_sr_blob *blob,
     return 0;
 }
 
+struct sr_save_arrays {
+    xen_pfn_t batch_pfns[MAX_BATCH_SIZE];
+};
+
+struct sr_restore_arrays {
+    /* handle_page_data */
+    xen_pfn_t pfns[MAX_BATCH_SIZE];
+};
+
 struct xc_sr_context
 {
     xc_interface *xch;
@@ -255,11 +264,11 @@  struct xc_sr_context
 
             struct precopy_stats stats;
 
-            xen_pfn_t *batch_pfns;
             unsigned int nr_batch_pfns;
             unsigned long *deferred_pages;
             unsigned long nr_deferred_pages;
             xc_hypercall_buffer_t dirty_bitmap_hbuf;
+            struct sr_save_arrays *m;
         } save;
 
         struct /* Restore data. */
@@ -311,6 +320,7 @@  struct xc_sr_context
 
             /* Sender has invoked verify mode on the stream. */
             bool verify;
+            struct sr_restore_arrays *m;
         } restore;
     };
 
diff --git a/tools/libs/saverestore/restore.c b/tools/libs/saverestore/restore.c
index 799170c7a1..c203ce503d 100644
--- a/tools/libs/saverestore/restore.c
+++ b/tools/libs/saverestore/restore.c
@@ -315,7 +315,7 @@  static int handle_page_data(struct xc_sr_context *ctx, struct xc_sr_record *rec)
     unsigned int i, pages_of_data = 0;
     int rc = -1;
 
-    xen_pfn_t *pfns = NULL, pfn;
+    xen_pfn_t *pfns = ctx->restore.m->pfns, pfn;
     uint32_t *types = NULL, type;
 
     /*
@@ -363,9 +363,8 @@  static int handle_page_data(struct xc_sr_context *ctx, struct xc_sr_record *rec)
         goto err;
     }
 
-    pfns = malloc(pages->count * sizeof(*pfns));
     types = malloc(pages->count * sizeof(*types));
-    if ( !pfns || !types )
+    if ( !types )
     {
         ERROR("Unable to allocate enough memory for %u pfns",
               pages->count);
@@ -412,7 +411,6 @@  static int handle_page_data(struct xc_sr_context *ctx, struct xc_sr_record *rec)
                            &pages->pfn[pages->count]);
  err:
     free(types);
-    free(pfns);
 
     return rc;
 }
@@ -739,6 +737,13 @@  static int setup(struct xc_sr_context *ctx)
     }
     ctx->restore.allocated_rec_num = DEFAULT_BUF_RECORDS;
 
+    ctx->restore.m = malloc(sizeof(*ctx->restore.m));
+    if ( !ctx->restore.m ) {
+        ERROR("Unable to allocate memory for arrays");
+        rc = -1;
+        goto err;
+    }
+
  err:
     return rc;
 }
@@ -757,6 +762,7 @@  static void cleanup(struct xc_sr_context *ctx)
         xc_hypercall_buffer_free_pages(
             xch, dirty_bitmap, NRPAGES(bitmap_size(ctx->restore.p2m_size)));
 
+    free(ctx->restore.m);
     free(ctx->restore.buffered_records);
     free(ctx->restore.populated_pfns);
 
diff --git a/tools/libs/saverestore/save.c b/tools/libs/saverestore/save.c
index f8fbe7a742..e29b6e1d66 100644
--- a/tools/libs/saverestore/save.c
+++ b/tools/libs/saverestore/save.c
@@ -77,7 +77,7 @@  static int write_checkpoint_record(struct xc_sr_context *ctx)
 
 /*
  * Writes a batch of memory as a PAGE_DATA record into the stream.  The batch
- * is constructed in ctx->save.batch_pfns.
+ * is constructed in ctx->save.m->batch_pfns.
  *
  * This function:
  * - gets the types for each pfn in the batch.
@@ -128,12 +128,12 @@  static int write_batch(struct xc_sr_context *ctx)
     for ( i = 0; i < nr_pfns; ++i )
     {
         types[i] = mfns[i] = ctx->save.ops.pfn_to_gfn(ctx,
-                                                      ctx->save.batch_pfns[i]);
+                                                      ctx->save.m->batch_pfns[i]);
 
         /* Likely a ballooned page. */
         if ( mfns[i] == INVALID_MFN )
         {
-            set_bit(ctx->save.batch_pfns[i], ctx->save.deferred_pages);
+            set_bit(ctx->save.m->batch_pfns[i], ctx->save.deferred_pages);
             ++ctx->save.nr_deferred_pages;
         }
     }
@@ -179,7 +179,7 @@  static int write_batch(struct xc_sr_context *ctx)
             if ( errors[p] )
             {
                 ERROR("Mapping of pfn %#"PRIpfn" (mfn %#"PRIpfn") failed %d",
-                      ctx->save.batch_pfns[i], mfns[p], errors[p]);
+                      ctx->save.m->batch_pfns[i], mfns[p], errors[p]);
                 goto err;
             }
 
@@ -193,7 +193,7 @@  static int write_batch(struct xc_sr_context *ctx)
             {
                 if ( rc == -1 && errno == EAGAIN )
                 {
-                    set_bit(ctx->save.batch_pfns[i], ctx->save.deferred_pages);
+                    set_bit(ctx->save.m->batch_pfns[i], ctx->save.deferred_pages);
                     ++ctx->save.nr_deferred_pages;
                     types[i] = XEN_DOMCTL_PFINFO_XTAB;
                     --nr_pages;
@@ -224,7 +224,7 @@  static int write_batch(struct xc_sr_context *ctx)
     rec.length += nr_pages * PAGE_SIZE;
 
     for ( i = 0; i < nr_pfns; ++i )
-        rec_pfns[i] = ((uint64_t)(types[i]) << 32) | ctx->save.batch_pfns[i];
+        rec_pfns[i] = ((uint64_t)(types[i]) << 32) | ctx->save.m->batch_pfns[i];
 
     iov[0].iov_base = &rec.type;
     iov[0].iov_len = sizeof(rec.type);
@@ -296,9 +296,9 @@  static int flush_batch(struct xc_sr_context *ctx)
 
     if ( !rc )
     {
-        VALGRIND_MAKE_MEM_UNDEFINED(ctx->save.batch_pfns,
+        VALGRIND_MAKE_MEM_UNDEFINED(ctx->save.m->batch_pfns,
                                     MAX_BATCH_SIZE *
-                                    sizeof(*ctx->save.batch_pfns));
+                                    sizeof(*ctx->save.m->batch_pfns));
     }
 
     return rc;
@@ -315,7 +315,7 @@  static int add_to_batch(struct xc_sr_context *ctx, xen_pfn_t pfn)
         rc = flush_batch(ctx);
 
     if ( rc == 0 )
-        ctx->save.batch_pfns[ctx->save.nr_batch_pfns++] = pfn;
+        ctx->save.m->batch_pfns[ctx->save.nr_batch_pfns++] = pfn;
 
     return rc;
 }
@@ -849,13 +849,12 @@  static int setup(struct xc_sr_context *ctx)
 
     dirty_bitmap = xc_hypercall_buffer_alloc_pages(
         xch, dirty_bitmap, NRPAGES(bitmap_size(ctx->save.p2m_size)));
-    ctx->save.batch_pfns = malloc(MAX_BATCH_SIZE *
-                                  sizeof(*ctx->save.batch_pfns));
     ctx->save.deferred_pages = bitmap_alloc(ctx->save.p2m_size);
+    ctx->save.m = malloc(sizeof(*ctx->save.m));
 
-    if ( !ctx->save.batch_pfns || !dirty_bitmap || !ctx->save.deferred_pages )
+    if ( !ctx->save.m || !dirty_bitmap || !ctx->save.deferred_pages )
     {
-        ERROR("Unable to allocate memory for dirty bitmaps, batch pfns and"
+        ERROR("Unable to allocate memory for dirty bitmaps and"
               " deferred pages");
         rc = -1;
         errno = ENOMEM;
@@ -884,7 +883,7 @@  static void cleanup(struct xc_sr_context *ctx)
     xc_hypercall_buffer_free_pages(xch, dirty_bitmap,
                                    NRPAGES(bitmap_size(ctx->save.p2m_size)));
     free(ctx->save.deferred_pages);
-    free(ctx->save.batch_pfns);
+    free(ctx->save.m);
 }
 
 /*