Message ID | 20210701095635.15648-16-olaf@aepfle.de (mailing list archive) |
---|---|
State | Superseded |
Headers | show |
Series | leftover from 2020 | expand |
On 01/07/2021 10:56, Olaf Hering wrote: > The hotpath 'send_dirty_pages' is supposed to do just one thing: sending. > The other end 'handle_page_data' is supposed to do just receiving. > > But instead both do other costly work like memory allocations and data moving. > Do the allocations once, the array sizes are a compiletime constant. > Avoid unneeded copying of data by receiving data directly into mapped guest memory. There is a reason why the logic was written that way, which was good at the time. It was so valgrind could spot bugs with the memory handling in these functions (And it did indeed find many bugs during development). These days, its ASAN is perhaps the preferred tool, but both depend on dynamic allocations to figure out the actual size of various objects. I agree that the repeated alloc/free of same-sized memory regions on each iteration is a waste. However, if we are going to fix this by using one-off allocations, then we want to compensate with logic such as the call to VALGRIND_MAKE_MEM_UNDEFINED() in flush_batch(), and I think we still need individual allocations to let the tools work properly. > > This patch is just prepartion, subsequent changes will populate the arrays. > > Once all changes are applied, migration of a busy HVM domU changes like that: > > Without this series, from sr650 to sr950 (xen-4.15.20201027T173911.16a20963b3 xen_testing): > 2020-10-29 10:23:10.711+0000: xc: show_transfer_rate: 23663128 bytes + 2879563 pages in 55.324905335 sec, 203 MiB/sec: Internal error > 2020-10-29 10:23:35.115+0000: xc: show_transfer_rate: 16829632 bytes + 2097552 pages in 24.401179720 sec, 335 MiB/sec: Internal error > 2020-10-29 10:23:59.436+0000: xc: show_transfer_rate: 16829032 bytes + 2097478 pages in 24.319025928 sec, 336 MiB/sec: Internal error > 2020-10-29 10:24:23.844+0000: xc: show_transfer_rate: 16829024 bytes + 2097477 pages in 24.406992500 sec, 335 MiB/sec: Internal error > 2020-10-29 10:24:48.292+0000: xc: show_transfer_rate: 16828912 bytes + 2097463 pages in 24.446489027 sec, 335 MiB/sec: Internal error > 2020-10-29 10:25:01.816+0000: xc: show_transfer_rate: 16836080 bytes + 2098356 pages in 13.447091818 sec, 609 MiB/sec: Internal error > > With this series, from sr650 to sr950 (xen-4.15.20201027T173911.16a20963b3 xen_unstable): > 2020-10-28 21:26:05.074+0000: xc: show_transfer_rate: 23663128 bytes + 2879563 pages in 52.564054368 sec, 213 MiB/sec: Internal error > 2020-10-28 21:26:23.527+0000: xc: show_transfer_rate: 16830040 bytes + 2097603 pages in 18.450592015 sec, 444 MiB/sec: Internal error > 2020-10-28 21:26:41.926+0000: xc: show_transfer_rate: 16830944 bytes + 2097717 pages in 18.397862306 sec, 445 MiB/sec: Internal error > 2020-10-28 21:27:00.339+0000: xc: show_transfer_rate: 16829176 bytes + 2097498 pages in 18.411973339 sec, 445 MiB/sec: Internal error > 2020-10-28 21:27:18.643+0000: xc: show_transfer_rate: 16828592 bytes + 2097425 pages in 18.303326695 sec, 447 MiB/sec: Internal error > 2020-10-28 21:27:26.289+0000: xc: show_transfer_rate: 16835952 bytes + 2098342 pages in 7.579846749 sec, 1081 MiB/sec: Internal error These are good numbers, and clearly show that there is some value here, but shouldn't they be in the series header? They're not terribly relevant to this patch specifically. Also, while I can believe that the first sample is slower than the later ones (in particular, during the first round, we've got to deal with the non-RAM regions too and therefore spend more time making hypercalls), I'm not sure I believe the final sample. Given the byte/page count, the substantially smaller elapsed time looks suspicious. > Note: the performance improvement depends on the used network cards, > wirespeed and the host: > - No improvement is expected with a 1G link. > - Improvement can be seen as shown above on a 10G link. > - Just a slight improvment can be seen on a 100G link. Are these observations with an otherwise idle dom0? Even if CPU time in dom0 wasn't the bottlekneck with a 1G link, the reduction in CPU time you observe at higher link speeds will still be making a difference at 1G, and will probably be visible if you perform multiple concurrent migrations. ~Andrew
Am Mon, 5 Jul 2021 11:44:30 +0100 schrieb Andrew Cooper <andrew.cooper3@citrix.com>: > On 01/07/2021 10:56, Olaf Hering wrote: > I agree that the repeated alloc/free of same-sized memory regions on > each iteration is a waste. However, if we are going to fix this by > using one-off allocations, then we want to compensate with logic such as > the call to VALGRIND_MAKE_MEM_UNDEFINED() in flush_batch(), and I think > we still need individual allocations to let the tools work properly. If this is a concern, lets just do a few individual arrays. > > This patch is just prepartion, subsequent changes will populate the arrays. > > > > Once all changes are applied, migration of a busy HVM domU changes like that: > > > > Without this series, from sr650 to sr950 (xen-4.15.20201027T173911.16a20963b3 xen_testing): > > 2020-10-29 10:23:10.711+0000: xc: show_transfer_rate: 23663128 bytes + 2879563 pages in 55.324905335 sec, 203 MiB/sec: Internal error > > 2020-10-29 10:23:35.115+0000: xc: show_transfer_rate: 16829632 bytes + 2097552 pages in 24.401179720 sec, 335 MiB/sec: Internal error > > 2020-10-29 10:23:59.436+0000: xc: show_transfer_rate: 16829032 bytes + 2097478 pages in 24.319025928 sec, 336 MiB/sec: Internal error > > 2020-10-29 10:24:23.844+0000: xc: show_transfer_rate: 16829024 bytes + 2097477 pages in 24.406992500 sec, 335 MiB/sec: Internal error > > 2020-10-29 10:24:48.292+0000: xc: show_transfer_rate: 16828912 bytes + 2097463 pages in 24.446489027 sec, 335 MiB/sec: Internal error > > 2020-10-29 10:25:01.816+0000: xc: show_transfer_rate: 16836080 bytes + 2098356 pages in 13.447091818 sec, 609 MiB/sec: Internal error > > > > With this series, from sr650 to sr950 (xen-4.15.20201027T173911.16a20963b3 xen_unstable): > > 2020-10-28 21:26:05.074+0000: xc: show_transfer_rate: 23663128 bytes + 2879563 pages in 52.564054368 sec, 213 MiB/sec: Internal error > > 2020-10-28 21:26:23.527+0000: xc: show_transfer_rate: 16830040 bytes + 2097603 pages in 18.450592015 sec, 444 MiB/sec: Internal error > > 2020-10-28 21:26:41.926+0000: xc: show_transfer_rate: 16830944 bytes + 2097717 pages in 18.397862306 sec, 445 MiB/sec: Internal error > > 2020-10-28 21:27:00.339+0000: xc: show_transfer_rate: 16829176 bytes + 2097498 pages in 18.411973339 sec, 445 MiB/sec: Internal error > > 2020-10-28 21:27:18.643+0000: xc: show_transfer_rate: 16828592 bytes + 2097425 pages in 18.303326695 sec, 447 MiB/sec: Internal error > > 2020-10-28 21:27:26.289+0000: xc: show_transfer_rate: 16835952 bytes + 2098342 pages in 7.579846749 sec, 1081 MiB/sec: Internal error > > These are good numbers, and clearly show that there is some value here, > but shouldn't they be in the series header? They're not terribly > relevant to this patch specifically. The cover letter is unfortunately not under version control. Perhaps there are ways with git notes, I never use it. > Also, while I can believe that the first sample is slower than the later > ones (in particular, during the first round, we've got to deal with the > non-RAM regions too and therefore spend more time making hypercalls), > I'm not sure I believe the final sample. Given the byte/page count, the > substantially smaller elapsed time looks suspicious. The first one is slower because it has to wait for the receiver to allocate pages. But maybe as you said there are other aspects as well. The last one is always way faster because apparently map/unmap is less costly with a stopped guest. Right now the code may reach up to 15Gbit/s. The next step is to map the domU just once to reach wirespeed. > Are these observations with an otherwise idle dom0? Yes. Idle dom0 and a domU busy with touching its memory. Unfortunately, I'm not able to prove the reported gain with the systems I have today. I'm waiting for preparation of different hardware, right now I have only a pair of CoyotePass and WilsonCity. I'm sure there were NUMA effects involved. Last years libvirt was unable to properly pin vcpus. If I pin all the involved memory to node#0 there is some jitter in the logged numbers, but no obvious improvement. The fist iteration is slightly faster, but that is it. Meanwhile I think this commit message needs to be redone. > Even if CPU time in dom0 wasn't the bottlekneck with a 1G link, the > reduction in CPU time you observe at higher link speeds will still be > making a difference at 1G, and will probably be visible if you perform > multiple concurrent migrations. Yes, I will see what numbers I get with two or more migrations running in parallel. Olaf
On 05/07/2021 12:27, Olaf Hering wrote: > Am Mon, 5 Jul 2021 11:44:30 +0100 > schrieb Andrew Cooper <andrew.cooper3@citrix.com>: > >>> This patch is just prepartion, subsequent changes will populate the arrays. >>> >>> Once all changes are applied, migration of a busy HVM domU changes like that: >>> >>> Without this series, from sr650 to sr950 (xen-4.15.20201027T173911.16a20963b3 xen_testing): >>> 2020-10-29 10:23:10.711+0000: xc: show_transfer_rate: 23663128 bytes + 2879563 pages in 55.324905335 sec, 203 MiB/sec: Internal error >>> 2020-10-29 10:23:35.115+0000: xc: show_transfer_rate: 16829632 bytes + 2097552 pages in 24.401179720 sec, 335 MiB/sec: Internal error >>> 2020-10-29 10:23:59.436+0000: xc: show_transfer_rate: 16829032 bytes + 2097478 pages in 24.319025928 sec, 336 MiB/sec: Internal error >>> 2020-10-29 10:24:23.844+0000: xc: show_transfer_rate: 16829024 bytes + 2097477 pages in 24.406992500 sec, 335 MiB/sec: Internal error >>> 2020-10-29 10:24:48.292+0000: xc: show_transfer_rate: 16828912 bytes + 2097463 pages in 24.446489027 sec, 335 MiB/sec: Internal error >>> 2020-10-29 10:25:01.816+0000: xc: show_transfer_rate: 16836080 bytes + 2098356 pages in 13.447091818 sec, 609 MiB/sec: Internal error >>> >>> With this series, from sr650 to sr950 (xen-4.15.20201027T173911.16a20963b3 xen_unstable): >>> 2020-10-28 21:26:05.074+0000: xc: show_transfer_rate: 23663128 bytes + 2879563 pages in 52.564054368 sec, 213 MiB/sec: Internal error >>> 2020-10-28 21:26:23.527+0000: xc: show_transfer_rate: 16830040 bytes + 2097603 pages in 18.450592015 sec, 444 MiB/sec: Internal error >>> 2020-10-28 21:26:41.926+0000: xc: show_transfer_rate: 16830944 bytes + 2097717 pages in 18.397862306 sec, 445 MiB/sec: Internal error >>> 2020-10-28 21:27:00.339+0000: xc: show_transfer_rate: 16829176 bytes + 2097498 pages in 18.411973339 sec, 445 MiB/sec: Internal error >>> 2020-10-28 21:27:18.643+0000: xc: show_transfer_rate: 16828592 bytes + 2097425 pages in 18.303326695 sec, 447 MiB/sec: Internal error >>> 2020-10-28 21:27:26.289+0000: xc: show_transfer_rate: 16835952 bytes + 2098342 pages in 7.579846749 sec, 1081 MiB/sec: Internal error >> These are good numbers, and clearly show that there is some value here, >> but shouldn't they be in the series header? They're not terribly >> relevant to this patch specifically. > The cover letter is unfortunately not under version control. > Perhaps there are ways with git notes, I never use it. In the end, we'll want some kind of note in the changelog, but that wants to be a single line. Its probably fine to say "Improve migration performance. 25% better bandwidth when NIC link speed is the bottleneck, due to optimising the data handling logic." >> Also, while I can believe that the first sample is slower than the later >> ones (in particular, during the first round, we've got to deal with the >> non-RAM regions too and therefore spend more time making hypercalls), >> I'm not sure I believe the final sample. Given the byte/page count, the >> substantially smaller elapsed time looks suspicious. > The first one is slower because it has to wait for the receiver to allocate pages. > But maybe as you said there are other aspects as well. > The last one is always way faster because apparently map/unmap is less costly with a stopped guest. That's suspicious. If true, we've got some very wonky behaviour in the hypervisor... > Right now the code may reach up to 15Gbit/s. The next step is to map the domU just once to reach wirespeed. We can in principle do that in 64bit toolstacks, for HVM guests. But not usefully until we've fixed the fact that Xen has no idea what the guest physmap is supposed to look like. At the moment, the current scheme is a little more resilient to bugs caused by the guest attempting to balloon during the live phase. Another area to improve, which can be started now, is to avoid bounce buffering hypercall data. Now that we have /dev/xen/hypercall which you can mmap() regular kernel pages from, what we want is a simple memory allocator which we can allocate permanent hypercall buffers from, rather than the internals of every xc_*() hypercall wrapper bouncing the data in (potentially) both directions. > >> Are these observations with an otherwise idle dom0? > Yes. Idle dom0 and a domU busy with touching its memory. > > Unfortunately, I'm not able to prove the reported gain with the systems I have today. > I'm waiting for preparation of different hardware, right now I have only a pair of CoyotePass and WilsonCity. > > I'm sure there were NUMA effects involved. Last years libvirt was unable to properly pin vcpus. If I pin all the involved memory to node#0 there is some jitter in the logged numbers, but no obvious improvement. The fist iteration is slightly faster, but that is it. Oh - so the speedup might not be from reduced data handling? Avoiding unnecessary data copies is clearly going to improve things, even if it isn't 25%. ~Andrew
Am Mon, 5 Jul 2021 14:01:07 +0100 schrieb Andrew Cooper <andrew.cooper3@citrix.com>: > > The last one is always way faster because apparently map/unmap is less costly with a stopped guest. > That's suspicious. If true, we've got some very wonky behaviour in the > hypervisor... At least the transfer rate this last iteration is consistent. Since the only difference I can see is the fact that the domU is suspended, I suspect the mapping. I did no investigation where the time is spent, I should probably do that one day to better understand this specific difference. > > Right now the code may reach up to 15Gbit/s. The next step is to map the domU just once to reach wirespeed. > > We can in principle do that in 64bit toolstacks, for HVM guests. But > not usefully until we've fixed the fact that Xen has no idea what the > guest physmap is supposed to look like. Why would Xen care? My attempt last year with a new save/restore code did just 'map' the memory on both sides. The 'unmap' was done in exit(). With this approach I got wirespeed in all iterations with a 10G link. > At the moment, the current scheme is a little more resilient to bugs > caused by the guest attempting to balloon during the live phase. I did not specifically test how a domU behaves when it claims and releases pages while being migrated. I think this series would handle at least parts of that: If a page appears or disappears it will be recognized by getpageframeinfo. If a page disappears between getpageframeinfo and MMAPBATCH I expect an error. This error is fatal right now, perhaps the code could catch this and move on. If a page disappears after MMAPBATCH it will be caught by later iterations. > Another area to improve, which can be started now, is to avoid bounce > buffering hypercall data. Now that we have /dev/xen/hypercall which you > can mmap() regular kernel pages from, what we want is a simple memory > allocator which we can allocate permanent hypercall buffers from, rather > than the internals of every xc_*() hypercall wrapper bouncing the data > in (potentially) both directions. That sounds like a good idea. Not sure how costly the current approach is. > Oh - so the speedup might not be from reduced data handling? At least not on the systems I have now. Perhaps I should test how the numbers look like with the NIC and the toolstack in node#0, and the domU in node#1. Olaf
Am Mon, 5 Jul 2021 14:01:07 +0100 schrieb Andrew Cooper <andrew.cooper3@citrix.com>: > > Unfortunately, I'm not able to prove the reported gain with the systems I have today. > > I'm waiting for preparation of different hardware, right now I have only a pair of CoyotePass and WilsonCity. > > > > I'm sure there were NUMA effects involved. Last years libvirt was unable to properly pin vcpus. If I pin all the involved memory to node#0 there is some jitter in the logged numbers, but no obvious improvement. The fist iteration is slightly faster, but that is it. > > Oh - so the speedup might not be from reduced data handling? > > Avoiding unnecessary data copies is clearly going to improve things, > even if it isn't 25%. For HVM the only notable improvement is the initial iteration. On average with 4 migrations of a single domU from A to B and back from B to A, transfer rate goes up from ~490MiB/s to ~677MiB/s. The initial transfer time for the 4194299 domU pages: with plain staging: 36.800582009 32.145531727 31.827540709 33.009956041 34.951513466 33.416769973 32.128985762 33.201786076 with the series applied: 24.266428156 24.632898175 24.112660134 23.603475994 24.418323859 23.841875914 25.087779229 23.493812677 Migration of a PV domU is much faster, but transfer rate for each iteration varies with or without the patches being applied. Olaf
diff --git a/tools/libs/saverestore/common.h b/tools/libs/saverestore/common.h index 252076cf51..968bb8af13 100644 --- a/tools/libs/saverestore/common.h +++ b/tools/libs/saverestore/common.h @@ -223,6 +223,15 @@ static inline int update_blob(struct xc_sr_blob *blob, return 0; } +struct sr_save_arrays { + xen_pfn_t batch_pfns[MAX_BATCH_SIZE]; +}; + +struct sr_restore_arrays { + /* handle_page_data */ + xen_pfn_t pfns[MAX_BATCH_SIZE]; +}; + struct xc_sr_context { xc_interface *xch; @@ -255,11 +264,11 @@ struct xc_sr_context struct precopy_stats stats; - xen_pfn_t *batch_pfns; unsigned int nr_batch_pfns; unsigned long *deferred_pages; unsigned long nr_deferred_pages; xc_hypercall_buffer_t dirty_bitmap_hbuf; + struct sr_save_arrays *m; } save; struct /* Restore data. */ @@ -311,6 +320,7 @@ struct xc_sr_context /* Sender has invoked verify mode on the stream. */ bool verify; + struct sr_restore_arrays *m; } restore; }; diff --git a/tools/libs/saverestore/restore.c b/tools/libs/saverestore/restore.c index 799170c7a1..c203ce503d 100644 --- a/tools/libs/saverestore/restore.c +++ b/tools/libs/saverestore/restore.c @@ -315,7 +315,7 @@ static int handle_page_data(struct xc_sr_context *ctx, struct xc_sr_record *rec) unsigned int i, pages_of_data = 0; int rc = -1; - xen_pfn_t *pfns = NULL, pfn; + xen_pfn_t *pfns = ctx->restore.m->pfns, pfn; uint32_t *types = NULL, type; /* @@ -363,9 +363,8 @@ static int handle_page_data(struct xc_sr_context *ctx, struct xc_sr_record *rec) goto err; } - pfns = malloc(pages->count * sizeof(*pfns)); types = malloc(pages->count * sizeof(*types)); - if ( !pfns || !types ) + if ( !types ) { ERROR("Unable to allocate enough memory for %u pfns", pages->count); @@ -412,7 +411,6 @@ static int handle_page_data(struct xc_sr_context *ctx, struct xc_sr_record *rec) &pages->pfn[pages->count]); err: free(types); - free(pfns); return rc; } @@ -739,6 +737,13 @@ static int setup(struct xc_sr_context *ctx) } ctx->restore.allocated_rec_num = DEFAULT_BUF_RECORDS; + ctx->restore.m = malloc(sizeof(*ctx->restore.m)); + if ( !ctx->restore.m ) { + ERROR("Unable to allocate memory for arrays"); + rc = -1; + goto err; + } + err: return rc; } @@ -757,6 +762,7 @@ static void cleanup(struct xc_sr_context *ctx) xc_hypercall_buffer_free_pages( xch, dirty_bitmap, NRPAGES(bitmap_size(ctx->restore.p2m_size))); + free(ctx->restore.m); free(ctx->restore.buffered_records); free(ctx->restore.populated_pfns); diff --git a/tools/libs/saverestore/save.c b/tools/libs/saverestore/save.c index f8fbe7a742..e29b6e1d66 100644 --- a/tools/libs/saverestore/save.c +++ b/tools/libs/saverestore/save.c @@ -77,7 +77,7 @@ static int write_checkpoint_record(struct xc_sr_context *ctx) /* * Writes a batch of memory as a PAGE_DATA record into the stream. The batch - * is constructed in ctx->save.batch_pfns. + * is constructed in ctx->save.m->batch_pfns. * * This function: * - gets the types for each pfn in the batch. @@ -128,12 +128,12 @@ static int write_batch(struct xc_sr_context *ctx) for ( i = 0; i < nr_pfns; ++i ) { types[i] = mfns[i] = ctx->save.ops.pfn_to_gfn(ctx, - ctx->save.batch_pfns[i]); + ctx->save.m->batch_pfns[i]); /* Likely a ballooned page. */ if ( mfns[i] == INVALID_MFN ) { - set_bit(ctx->save.batch_pfns[i], ctx->save.deferred_pages); + set_bit(ctx->save.m->batch_pfns[i], ctx->save.deferred_pages); ++ctx->save.nr_deferred_pages; } } @@ -179,7 +179,7 @@ static int write_batch(struct xc_sr_context *ctx) if ( errors[p] ) { ERROR("Mapping of pfn %#"PRIpfn" (mfn %#"PRIpfn") failed %d", - ctx->save.batch_pfns[i], mfns[p], errors[p]); + ctx->save.m->batch_pfns[i], mfns[p], errors[p]); goto err; } @@ -193,7 +193,7 @@ static int write_batch(struct xc_sr_context *ctx) { if ( rc == -1 && errno == EAGAIN ) { - set_bit(ctx->save.batch_pfns[i], ctx->save.deferred_pages); + set_bit(ctx->save.m->batch_pfns[i], ctx->save.deferred_pages); ++ctx->save.nr_deferred_pages; types[i] = XEN_DOMCTL_PFINFO_XTAB; --nr_pages; @@ -224,7 +224,7 @@ static int write_batch(struct xc_sr_context *ctx) rec.length += nr_pages * PAGE_SIZE; for ( i = 0; i < nr_pfns; ++i ) - rec_pfns[i] = ((uint64_t)(types[i]) << 32) | ctx->save.batch_pfns[i]; + rec_pfns[i] = ((uint64_t)(types[i]) << 32) | ctx->save.m->batch_pfns[i]; iov[0].iov_base = &rec.type; iov[0].iov_len = sizeof(rec.type); @@ -296,9 +296,9 @@ static int flush_batch(struct xc_sr_context *ctx) if ( !rc ) { - VALGRIND_MAKE_MEM_UNDEFINED(ctx->save.batch_pfns, + VALGRIND_MAKE_MEM_UNDEFINED(ctx->save.m->batch_pfns, MAX_BATCH_SIZE * - sizeof(*ctx->save.batch_pfns)); + sizeof(*ctx->save.m->batch_pfns)); } return rc; @@ -315,7 +315,7 @@ static int add_to_batch(struct xc_sr_context *ctx, xen_pfn_t pfn) rc = flush_batch(ctx); if ( rc == 0 ) - ctx->save.batch_pfns[ctx->save.nr_batch_pfns++] = pfn; + ctx->save.m->batch_pfns[ctx->save.nr_batch_pfns++] = pfn; return rc; } @@ -849,13 +849,12 @@ static int setup(struct xc_sr_context *ctx) dirty_bitmap = xc_hypercall_buffer_alloc_pages( xch, dirty_bitmap, NRPAGES(bitmap_size(ctx->save.p2m_size))); - ctx->save.batch_pfns = malloc(MAX_BATCH_SIZE * - sizeof(*ctx->save.batch_pfns)); ctx->save.deferred_pages = bitmap_alloc(ctx->save.p2m_size); + ctx->save.m = malloc(sizeof(*ctx->save.m)); - if ( !ctx->save.batch_pfns || !dirty_bitmap || !ctx->save.deferred_pages ) + if ( !ctx->save.m || !dirty_bitmap || !ctx->save.deferred_pages ) { - ERROR("Unable to allocate memory for dirty bitmaps, batch pfns and" + ERROR("Unable to allocate memory for dirty bitmaps and" " deferred pages"); rc = -1; errno = ENOMEM; @@ -884,7 +883,7 @@ static void cleanup(struct xc_sr_context *ctx) xc_hypercall_buffer_free_pages(xch, dirty_bitmap, NRPAGES(bitmap_size(ctx->save.p2m_size))); free(ctx->save.deferred_pages); - free(ctx->save.batch_pfns); + free(ctx->save.m); } /*
The hotpath 'send_dirty_pages' is supposed to do just one thing: sending. The other end 'handle_page_data' is supposed to do just receiving. But instead both do other costly work like memory allocations and data moving. Do the allocations once, the array sizes are a compiletime constant. Avoid unneeded copying of data by receiving data directly into mapped guest memory. This patch is just prepartion, subsequent changes will populate the arrays. Once all changes are applied, migration of a busy HVM domU changes like that: Without this series, from sr650 to sr950 (xen-4.15.20201027T173911.16a20963b3 xen_testing): 2020-10-29 10:23:10.711+0000: xc: show_transfer_rate: 23663128 bytes + 2879563 pages in 55.324905335 sec, 203 MiB/sec: Internal error 2020-10-29 10:23:35.115+0000: xc: show_transfer_rate: 16829632 bytes + 2097552 pages in 24.401179720 sec, 335 MiB/sec: Internal error 2020-10-29 10:23:59.436+0000: xc: show_transfer_rate: 16829032 bytes + 2097478 pages in 24.319025928 sec, 336 MiB/sec: Internal error 2020-10-29 10:24:23.844+0000: xc: show_transfer_rate: 16829024 bytes + 2097477 pages in 24.406992500 sec, 335 MiB/sec: Internal error 2020-10-29 10:24:48.292+0000: xc: show_transfer_rate: 16828912 bytes + 2097463 pages in 24.446489027 sec, 335 MiB/sec: Internal error 2020-10-29 10:25:01.816+0000: xc: show_transfer_rate: 16836080 bytes + 2098356 pages in 13.447091818 sec, 609 MiB/sec: Internal error With this series, from sr650 to sr950 (xen-4.15.20201027T173911.16a20963b3 xen_unstable): 2020-10-28 21:26:05.074+0000: xc: show_transfer_rate: 23663128 bytes + 2879563 pages in 52.564054368 sec, 213 MiB/sec: Internal error 2020-10-28 21:26:23.527+0000: xc: show_transfer_rate: 16830040 bytes + 2097603 pages in 18.450592015 sec, 444 MiB/sec: Internal error 2020-10-28 21:26:41.926+0000: xc: show_transfer_rate: 16830944 bytes + 2097717 pages in 18.397862306 sec, 445 MiB/sec: Internal error 2020-10-28 21:27:00.339+0000: xc: show_transfer_rate: 16829176 bytes + 2097498 pages in 18.411973339 sec, 445 MiB/sec: Internal error 2020-10-28 21:27:18.643+0000: xc: show_transfer_rate: 16828592 bytes + 2097425 pages in 18.303326695 sec, 447 MiB/sec: Internal error 2020-10-28 21:27:26.289+0000: xc: show_transfer_rate: 16835952 bytes + 2098342 pages in 7.579846749 sec, 1081 MiB/sec: Internal error Note: the performance improvement depends on the used network cards, wirespeed and the host: - No improvement is expected with a 1G link. - Improvement can be seen as shown above on a 10G link. - Just a slight improvment can be seen on a 100G link. This change also populates sr_save_arrays with "batch_pfns", and sr_restore_arrays with "pfns" to make sure malloc is always called with a non-zero value. Signed-off-by: Olaf Hering <olaf@aepfle.de> v02: - rename xc_sr_save_arrays to sr_save_arrays - rename xc_sr_restore_arrays to sr_restore_arrays - merge handling of "batch_pfns" and "pfns" to make sure malloc is called with a non-zero size value (jgross) --- tools/libs/saverestore/common.h | 12 +++++++++++- tools/libs/saverestore/restore.c | 14 ++++++++++---- tools/libs/saverestore/save.c | 27 +++++++++++++-------------- 3 files changed, 34 insertions(+), 19 deletions(-)