From patchwork Mon Mar 27 09:06:26 2017
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Joshua Otto <jtotto@uwaterloo.ca>
X-Patchwork-Id: 9645739
Return-Path: <xen-devel-bounces@lists.xen.org>
Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org
	[172.30.200.125])
	by pdx-korg-patchwork.web.codeaurora.org (Postfix) with ESMTP id
	111EF602BF for <patchwork-xen-devel@patchwork.kernel.org>;
	Mon, 27 Mar 2017 09:16:12 +0000 (UTC)
Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1])
	by mail.wl.linuxfoundation.org (Postfix) with ESMTP id F153527F95
	for <patchwork-xen-devel@patchwork.kernel.org>;
	Mon, 27 Mar 2017 09:16:11 +0000 (UTC)
Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486)
	id E557C28338; Mon, 27 Mar 2017 09:16:11 +0000 (UTC)
X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on
	pdx-wl-mail.web.codeaurora.org
X-Spam-Level: 
X-Spam-Status: No, score=-4.1 required=2.0 tests=BAYES_00,DKIM_SIGNED,
	RCVD_IN_DNSWL_MED,T_DKIM_INVALID autolearn=ham version=3.3.1
Received: from lists.xenproject.org (lists.xenproject.org [192.237.175.120])
	(using TLSv1.2 with cipher AES128-GCM-SHA256 (128/128 bits))
	(No client certificate requested)
	by mail.wl.linuxfoundation.org (Postfix) with ESMTPS id 7FF4527F95
	for <patchwork-xen-devel@patchwork.kernel.org>;
	Mon, 27 Mar 2017 09:16:10 +0000 (UTC)
Received: from localhost ([127.0.0.1] helo=lists.xenproject.org)
	by lists.xenproject.org with esmtp (Exim 4.84_2)
	(envelope-from <xen-devel-bounces@lists.xen.org>)
	id 1csQil-00082v-82; Mon, 27 Mar 2017 09:13:51 +0000
Received: from mail6.bemta3.messagelabs.com ([195.245.230.39])
	by lists.xenproject.org with esmtp (Exim 4.84_2)
	(envelope-from <jtotto@uwaterloo.ca>) id 1csQij-00082X-IZ
	for xen-devel@lists.xenproject.org; Mon, 27 Mar 2017 09:13:49 +0000
Received: from [85.158.137.68] by server-13.bemta-3.messagelabs.com id
	1F/3F-05091-CC7D8D85; Mon, 27 Mar 2017 09:13:48 +0000
X-Brightmail-Tracker: H4sIAAAAAAAAA+NgFtrGIsWRWlGSWpSXmKPExsXSmNjwSff09Rs
	RBit/GVh83zKZyYHR4/CHKywBjFGsmXlJ+RUJrBmbv+1lL7h1n7Fi9a0O1gbGvQsYuxi5OIQE
	ljBJPHzzmL2LkZODRcBHYvqq80wgCRaB/0wSd67/ZwNJsAmoS2xftBGsSERASeLeqslgRcwCB
	xklzpw/zNzFyMEhLJAgcXO3OsQgVYlZW/eAhXkFXCS+bNECCUsIyEncPNfJDGJzAoWPXTwENl
	5IwFni2uFeRpByCYFiicsrYyDKQyRO7n7IDGHrSKztvMcGYcdL7N8/mwmi3ETiwRRPiLCoRPf
	hZ2BvSQjMZ5RYtXIq8wRG4QWMDKsY1YtTi8pSi3RN9ZKKMtMzSnITM3N0DQ2M9XJTi4sT01Nz
	EpOK9ZLzczcxAsOznoGBcQfj5a9OhxglOZiURHk/nL4RIcSXlJ9SmZFYnBFfVJqTWnyIUYaDQ
	0mCd+U1oJxgUWp6akVaZg4wUmDSEhw8SiK8E0HSvMUFibnFmekQqVOMilLivJrA+BISAElklO
	bBtcGi8xKjrJQwLyMDA4MQT0FqUW5mCar8K0ZxDkYlYd7HION5MvNK4Ka/AlrMBLT48HywxSW
	JCCmpBsZ9pgsK48Ufuc9Y9jNGQCnmlOmhx6efPn6/45s/42kGxcpFT/TuXtcIn3ZhzWzl3Ww3
	QtJsg6pEvoRu/tlft5q3b9eCuwo+ttYTHpwWa8g92Td7QtRi/W1MvO8UDeO/722TW/b4zvxP/
	652t82ctknhimHDIfOzwfu+pF6I9mqRduvLTMkqPFipxFKckWioxVxUnAgAZpFjZckCAAA=
X-Env-Sender: jtotto@uwaterloo.ca
X-Msg-Ref: server-12.tower-31.messagelabs.com!1490606026!75430370!1
X-Originating-IP: [129.97.128.242]
X-SpamReason: No, hits=0.5 required=7.0 tests=BODY_RANDOM_LONG
X-StarScan-Received: 
X-StarScan-Version: 9.2.3; banners=-,-,-
X-VirusChecked: Checked
Received: (qmail 23686 invoked from network); 27 Mar 2017 09:13:47 -0000
Received: from mailchk-m06.uwaterloo.ca (HELO mailchk-m06.uwaterloo.ca)
	(129.97.128.242)
	by server-12.tower-31.messagelabs.com with DHE-RSA-AES256-GCM-SHA384
	encrypted SMTP; 27 Mar 2017 09:13:47 -0000
Received: from eagle.uwaterloo.ca
	(cs-auth-dc-129-97-60-142.dynamic.uwaterloo.ca [129.97.60.142])
	(authenticated bits=0)
	by mailchk-m06.uwaterloo.ca (8.14.4/8.14.4) with ESMTP id
	v2R97DTU023326
	(version=TLSv1/SSLv3 cipher=AES128-SHA256 bits=128 verify=NO);
	Mon, 27 Mar 2017 05:12:30 -0400
DKIM-Filter: OpenDKIM Filter v2.11.0 mailchk-m06.uwaterloo.ca v2R97DTU023326
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=uwaterloo.ca;
	s=default; t=1490606019;
	bh=iDON4cRbGpzTVLcJhevZp3367XZL496SPlqCcVtMKW4=;
	h=From:To:Cc:Subject:Date:In-Reply-To:References:From;
	b=m4/mTnJqdDqmtrZWF/Yp3XFXI6Qv58oV2GslT3MGjU6RfU7paFh+43Z2yk7sMyg/Z
	oyxW6w8EVXHPV0eldkHo4LOeQwnFT4QtHAHPEolsIF2HJpqXwTPWf/4x5MRTfDXKGj
	+1dp6ukzAAkA5UG66fsIVsoxSuSM0dtxDiocFVXY=
From: Joshua Otto <jtotto@uwaterloo.ca>
To: xen-devel@lists.xenproject.org
Date: Mon, 27 Mar 2017 05:06:26 -0400
Message-Id: <1490605592-12189-15-git-send-email-jtotto@uwaterloo.ca>
X-Mailer: git-send-email 2.7.4
In-Reply-To: <1490605592-12189-1-git-send-email-jtotto@uwaterloo.ca>
References: <1490605592-12189-1-git-send-email-jtotto@uwaterloo.ca>
X-UUID: e82804ef-b87b-4186-b08a-e3f9aa4dc327
X-Miltered: at mailchk-m06 with ID 58D8D641.001 by Joe's j-chkmail
	(http://j-chkmail.ensmp.fr)!
X-Virus-Scanned: clamav-milter 0.99.2 at mailchk-m06
X-Virus-Status: Clean
X-Greylist: Sender succeeded SMTP AUTH, not delayed by milter-greylist-4.6.2
	(mailchk-m06.uwaterloo.ca [129.97.128.141]);
	Mon, 27 Mar 2017 05:13:39 -0400 (EDT)
Cc: wei.liu2@citrix.com, andrew.cooper3@citrix.com, ian.jackson@eu.citrix.com,
	czylin@uwaterloo.ca, Joshua Otto <jtotto@uwaterloo.ca>,
	imhy.yang@gmail.com, hjarmstr@uwaterloo.ca
Subject: [Xen-devel] [PATCH RFC 14/20] libxc/migration: implement the sender
	side of postcopy live migration
X-BeenThere: xen-devel@lists.xen.org
X-Mailman-Version: 2.1.18
Precedence: list
List-Id: Xen developer discussion <xen-devel.lists.xen.org>
List-Unsubscribe: <https://lists.xen.org/cgi-bin/mailman/options/xen-devel>,
	<mailto:xen-devel-request@lists.xen.org?subject=unsubscribe>
List-Post: <mailto:xen-devel@lists.xen.org>
List-Help: <mailto:xen-devel-request@lists.xen.org?subject=help>
List-Subscribe: <https://lists.xen.org/cgi-bin/mailman/listinfo/xen-devel>,
	<mailto:xen-devel-request@lists.xen.org?subject=subscribe>
MIME-Version: 1.0
Errors-To: xen-devel-bounces@lists.xen.org
Sender: "Xen-devel" <xen-devel-bounces@lists.xen.org>
X-Virus-Scanned: ClamAV using ClamSMTP

Add a new 'postcopy' phase to the live migration algorithm, during which
unmigrated domain memory is paged over the network on-demand _after_ the
guest has been resumed at the destination.

To do so:
- Add a new precopy policy option, XGS_POLICY_POSTCOPY, that policies
  can use to request a transition to the postcopy live migration phase
  rather than a stop-and-copy of the remaining dirty pages.
- Add support to xc_domain_save() for this policy option by breaking out
  of the precopy loop early, transmitting the final set of dirty pfns
  and all remaining domain state (including higher-layer state) except
  memory, and entering a postcopy loop during which the remaining page
  data is pushed in the background.  Remote requests for specific pages
  in response to faults in the domain are serviced with priority in this
  loop.

The new save callbacks required for this migration phase are stubbed in
libxl for now, to be replaced in a subsequent patch that adds libxl
support for this migration phase.  Support for this phase on the
migration receiver side follows immediately in the next patch.

Signed-off-by: Joshua Otto <jtotto@uwaterloo.ca>
---
 tools/libxc/include/xenguest.h     |  82 +++++---
 tools/libxc/xc_sr_common.h         |   5 +-
 tools/libxc/xc_sr_save.c           | 421 ++++++++++++++++++++++++++++++++++---
 tools/libxc/xc_sr_save_x86_hvm.c   |  13 ++
 tools/libxc/xg_save_restore.h      |  16 +-
 tools/libxl/libxl_dom_save.c       |  11 +-
 tools/libxl/libxl_save_msgs_gen.pl |   6 +-
 7 files changed, 487 insertions(+), 67 deletions(-)

diff --git a/tools/libxc/include/xenguest.h b/tools/libxc/include/xenguest.h
index 30ffb6f..16441c9 100644
--- a/tools/libxc/include/xenguest.h
+++ b/tools/libxc/include/xenguest.h
@@ -63,41 +63,57 @@ struct save_callbacks {
 #define XGS_POLICY_CONTINUE_PRECOPY 0  /* Remain in the precopy phase. */
 #define XGS_POLICY_STOP_AND_COPY    1  /* Immediately suspend and transmit the
                                         * remaining dirty pages. */
+#define XGS_POLICY_POSTCOPY         2  /* Suspend the guest and transition into
+                                        * the postcopy phase of the migration. */
     int (*precopy_policy)(struct precopy_stats stats, void *data);
 
-    /* Called after the guest's dirty pages have been
-     *  copied into an output buffer.
-     * Callback function resumes the guest & the device model,
-     *  returns to xc_domain_save.
-     * xc_domain_save then flushes the output buffer, while the
-     *  guest continues to run.
-     */
-    int (*aftercopy)(void* data);
-
-    /* Called after the memory checkpoint has been flushed
-     * out into the network. Typical actions performed in this
-     * callback include:
-     *   (a) send the saved device model state (for HVM guests),
-     *   (b) wait for checkpoint ack
-     *   (c) release the network output buffer pertaining to the acked checkpoint.
-     *   (c) sleep for the checkpoint interval.
-     *
-     * returns:
-     * 0: terminate checkpointing gracefully
-     * 1: take another checkpoint */
-    int (*checkpoint)(void* data);
-
-    /*
-     * Called after the checkpoint callback.
-     *
-     * returns:
-     * 0: terminate checkpointing gracefully
-     * 1: take another checkpoint
-     */
-    int (*wait_checkpoint)(void* data);
-
-    /* Enable qemu-dm logging dirty pages to xen */
-    int (*switch_qemu_logdirty)(int domid, unsigned enable, void *data); /* HVM only */
+    /* Checkpointing and postcopy live migration are mutually exclusive. */
+    union {
+        struct {
+            /* Called during a live migration's transition to the postcopy phase
+             * to yield control of the stream back to a higher layer so it can
+             * transmit records needed for resumption of the guest at the
+             * destination (e.g. device model state, xenstore context) */
+            int (*postcopy_transition)(void *data);
+        };
+
+        struct {
+            /* Called after the guest's dirty pages have been
+             *  copied into an output buffer.
+             * Callback function resumes the guest & the device model,
+             *  returns to xc_domain_save.
+             * xc_domain_save then flushes the output buffer, while the
+             *  guest continues to run.
+             */
+            int (*aftercopy)(void* data);
+
+            /* Called after the memory checkpoint has been flushed
+             * out into the network. Typical actions performed in this
+             * callback include:
+             *   (a) send the saved device model state (for HVM guests),
+             *   (b) wait for checkpoint ack
+             *   (c) release the network output buffer pertaining to the acked
+             *       checkpoint.
+             *   (c) sleep for the checkpoint interval.
+             *
+             * returns:
+             * 0: terminate checkpointing gracefully
+             * 1: take another checkpoint */
+            int (*checkpoint)(void* data);
+
+            /*
+             * Called after the checkpoint callback.
+             *
+             * returns:
+             * 0: terminate checkpointing gracefully
+             * 1: take another checkpoint
+             */
+            int (*wait_checkpoint)(void* data);
+
+            /* Enable qemu-dm logging dirty pages to xen */
+            int (*switch_qemu_logdirty)(int domid, unsigned enable, void *data); /* HVM only */
+        };
+    };
 
     /* to be provided as the last argument to each callback function */
     void* data;
diff --git a/tools/libxc/xc_sr_common.h b/tools/libxc/xc_sr_common.h
index b52355d..0043791 100644
--- a/tools/libxc/xc_sr_common.h
+++ b/tools/libxc/xc_sr_common.h
@@ -204,13 +204,16 @@ struct xc_sr_context
             int policy_decision;
 
             enum {
-                XC_SR_SAVE_BATCH_PRECOPY_PAGE
+                XC_SR_SAVE_BATCH_PRECOPY_PAGE,
+                XC_SR_SAVE_BATCH_POSTCOPY_PFN,
+                XC_SR_SAVE_BATCH_POSTCOPY_PAGE
             } batch_type;
             xen_pfn_t *batch_pfns;
             unsigned nr_batch_pfns;
             unsigned long *deferred_pages;
             unsigned long nr_deferred_pages;
             xc_hypercall_buffer_t dirty_bitmap_hbuf;
+            unsigned long nr_final_dirty_pages;
         } save;
 
         struct /* Restore data. */
diff --git a/tools/libxc/xc_sr_save.c b/tools/libxc/xc_sr_save.c
index 6acc8d3..51d7016 100644
--- a/tools/libxc/xc_sr_save.c
+++ b/tools/libxc/xc_sr_save.c
@@ -3,21 +3,28 @@
 
 #include "xc_sr_common.h"
 
-#define MAX_BATCH_SIZE MAX_PRECOPY_BATCH_SIZE
+#define MAX_BATCH_SIZE \
+    max(max(MAX_PRECOPY_BATCH_SIZE, MAX_PFN_BATCH_SIZE), MAX_POSTCOPY_BATCH_SIZE)
 
 static const unsigned batch_sizes[] =
 {
-    [XC_SR_SAVE_BATCH_PRECOPY_PAGE]  = MAX_PRECOPY_BATCH_SIZE
+    [XC_SR_SAVE_BATCH_PRECOPY_PAGE]  = MAX_PRECOPY_BATCH_SIZE,
+    [XC_SR_SAVE_BATCH_POSTCOPY_PFN]  = MAX_PFN_BATCH_SIZE,
+    [XC_SR_SAVE_BATCH_POSTCOPY_PAGE] = MAX_POSTCOPY_BATCH_SIZE
 };
 
 static const bool batch_includes_contents[] =
 {
-    [XC_SR_SAVE_BATCH_PRECOPY_PAGE] = true
+    [XC_SR_SAVE_BATCH_PRECOPY_PAGE]  = true,
+    [XC_SR_SAVE_BATCH_POSTCOPY_PFN]  = false,
+    [XC_SR_SAVE_BATCH_POSTCOPY_PAGE] = true
 };
 
 static const uint32_t batch_rec_types[] =
 {
-    [XC_SR_SAVE_BATCH_PRECOPY_PAGE]  = REC_TYPE_PAGE_DATA
+    [XC_SR_SAVE_BATCH_PRECOPY_PAGE]  = REC_TYPE_PAGE_DATA,
+    [XC_SR_SAVE_BATCH_POSTCOPY_PFN]  = REC_TYPE_POSTCOPY_PFNS,
+    [XC_SR_SAVE_BATCH_POSTCOPY_PAGE] = REC_TYPE_POSTCOPY_PAGE_DATA
 };
 
 /*
@@ -76,6 +83,9 @@ static int write_headers(struct xc_sr_context *ctx, uint16_t guest_type)
 
 WRITE_TRIVIAL_RECORD_FN(end,                 REC_TYPE_END);
 WRITE_TRIVIAL_RECORD_FN(checkpoint,          REC_TYPE_CHECKPOINT);
+WRITE_TRIVIAL_RECORD_FN(postcopy_begin,      REC_TYPE_POSTCOPY_BEGIN);
+WRITE_TRIVIAL_RECORD_FN(postcopy_pfns_begin, REC_TYPE_POSTCOPY_PFNS_BEGIN);
+WRITE_TRIVIAL_RECORD_FN(postcopy_transition, REC_TYPE_POSTCOPY_TRANSITION);
 
 /*
  * This function:
@@ -394,6 +404,108 @@ static void add_to_batch(struct xc_sr_context *ctx, xen_pfn_t pfn)
 }
 
 /*
+ * This function:
+ * - flushes the current batch of postcopy pfns into the migration stream
+ * - clears the dirty bits of all pfns with no migrateable backing data
+ * - counts the number of pfns that _do_ have migrateable backing data, adding
+ *   it to nr_final_dirty_pfns
+ */
+static int flush_postcopy_pfns_batch(struct xc_sr_context *ctx)
+{
+    int rc = 0;
+    xen_pfn_t *pfns = ctx->save.batch_pfns, *mfns = NULL, *types = NULL;
+    unsigned i, nr_pfns = ctx->save.nr_batch_pfns;
+
+    DECLARE_HYPERCALL_BUFFER_SHADOW(unsigned long, dirty_bitmap,
+                                    &ctx->save.dirty_bitmap_hbuf);
+
+    assert(ctx->save.batch_type == XC_SR_SAVE_BATCH_POSTCOPY_PFN);
+
+    if ( batch_empty(ctx) )
+        return rc;
+
+    rc = get_batch_info(ctx, &mfns, &types);
+    if ( rc )
+        return rc;
+
+    /* Consider any pages not backed by a physical page of data to have been
+     * 'cleaned' at this point - there's no sense wasting room in a subsequent
+     * postcopy batch to duplicate the type information. */
+    for ( i = 0; i < nr_pfns; ++i )
+    {
+        switch ( types[i] )
+        {
+        case XEN_DOMCTL_PFINFO_BROKEN:
+        case XEN_DOMCTL_PFINFO_XALLOC:
+        case XEN_DOMCTL_PFINFO_XTAB:
+            clear_bit(pfns[i], dirty_bitmap);
+            continue;
+        }
+
+        ++ctx->save.nr_final_dirty_pages;
+    }
+
+    rc = write_batch(ctx, mfns, types);
+    free(mfns);
+    free(types);
+
+    if ( !rc )
+    {
+        VALGRIND_MAKE_MEM_UNDEFINED(ctx->save.batch_pfns,
+                                    MAX_BATCH_SIZE *
+                                    sizeof(*ctx->save.batch_pfns));
+    }
+
+    return rc;
+}
+
+/*
+ * This function:
+ * - writes a POSTCOPY_PFNS_BEGIN record into the stream
+ * - writes 0 or more POSTCOPY_PFNS records specifying the subset of domain
+ *   memory that must be migrated during the upcoming postcopy phase of the
+ *   migration
+ * - counts the number of pfns in this subset, storing it in
+ *   nr_final_dirty_pages
+ */
+static int send_postcopy_pfns(struct xc_sr_context *ctx)
+{
+    xen_pfn_t p;
+    int rc;
+
+    DECLARE_HYPERCALL_BUFFER_SHADOW(unsigned long, dirty_bitmap,
+                                    &ctx->save.dirty_bitmap_hbuf);
+
+    /* The true nr_final_dirty_pages is iteratively computed by
+     * flush_postcopy_pfns_batch(), which counts only pages actually backed by
+     * data we need to migrate. */
+    ctx->save.nr_final_dirty_pages = 0;
+
+    rc = write_postcopy_pfns_begin_record(ctx);
+    if ( rc )
+        return rc;
+
+    assert(batch_empty(ctx));
+    ctx->save.batch_type = XC_SR_SAVE_BATCH_POSTCOPY_PFN;
+    for ( p = 0; p < ctx->save.p2m_size; ++p )
+    {
+        if ( !test_bit(p, dirty_bitmap) )
+            continue;
+
+        if ( batch_full(ctx) )
+        {
+            rc = flush_postcopy_pfns_batch(ctx);
+            if ( rc )
+                return rc;
+        }
+
+        add_to_batch(ctx, p);
+    }
+
+    return flush_postcopy_pfns_batch(ctx);
+}
+
+/*
  * Pause/suspend the domain, and refresh ctx->dominfo if required.
  */
 static int suspend_domain(struct xc_sr_context *ctx)
@@ -731,15 +843,12 @@ static int colo_merge_secondary_dirty_bitmap(struct xc_sr_context *ctx)
 }
 
 /*
- * Suspend the domain and send dirty memory.
- * This is the last iteration of the live migration and the
- * heart of the checkpointed stream.
+ * Suspend the domain and determine the final set of dirty pages.
  */
-static int suspend_and_send_dirty(struct xc_sr_context *ctx)
+static int suspend_and_check_dirty(struct xc_sr_context *ctx)
 {
     xc_interface *xch = ctx->xch;
     xc_shadow_op_stats_t stats = { 0, ctx->save.p2m_size };
-    char *progress_str = NULL;
     int rc;
     DECLARE_HYPERCALL_BUFFER_SHADOW(unsigned long, dirty_bitmap,
                                     &ctx->save.dirty_bitmap_hbuf);
@@ -759,16 +868,6 @@ static int suspend_and_send_dirty(struct xc_sr_context *ctx)
         goto out;
     }
 
-    if ( ctx->save.live )
-    {
-        rc = update_progress_string(ctx, &progress_str,
-                                    ctx->save.stats.iteration);
-        if ( rc )
-            goto out;
-    }
-    else
-        xc_set_progress_prefix(xch, "Checkpointed save");
-
     bitmap_or(dirty_bitmap, ctx->save.deferred_pages, ctx->save.p2m_size);
 
     if ( !ctx->save.live && ctx->save.checkpointed == XC_MIG_STREAM_COLO )
@@ -781,20 +880,36 @@ static int suspend_and_send_dirty(struct xc_sr_context *ctx)
         }
     }
 
-    rc = send_dirty_pages(ctx, stats.dirty_count + ctx->save.nr_deferred_pages,
-                          /* precopy */ false);
-    if ( rc )
-        goto out;
+    if ( !ctx->save.live || ctx->save.policy_decision != XGS_POLICY_POSTCOPY )
+    {
+        /* If we aren't transitioning to a postcopy live migration, then rather
+         * than explicitly counting the number of final dirty pages, simply
+         * (somewhat crudely) estimate it as this sum to save time.  If we _are_
+         * about to begin postcopy then we don't bother, since our count must in
+         * that case be exact and we'll work it out later on. */
+        ctx->save.nr_final_dirty_pages =
+            stats.dirty_count + ctx->save.nr_deferred_pages;
+    }
 
     bitmap_clear(ctx->save.deferred_pages, ctx->save.p2m_size);
     ctx->save.nr_deferred_pages = 0;
 
  out:
-    xc_set_progress_prefix(xch, NULL);
-    free(progress_str);
     return rc;
 }
 
+static int suspend_and_send_dirty(struct xc_sr_context *ctx)
+{
+    int rc;
+
+    rc = suspend_and_check_dirty(ctx);
+    if ( rc )
+        return rc;
+
+    return send_dirty_pages(ctx, ctx->save.nr_final_dirty_pages,
+                            /* precopy */ false);
+}
+
 static int verify_frames(struct xc_sr_context *ctx)
 {
     xc_interface *xch = ctx->xch;
@@ -835,11 +950,13 @@ static int verify_frames(struct xc_sr_context *ctx)
 }
 
 /*
- * Send all domain memory.  This is the heart of the live migration loop.
+ * Send all domain memory, modulo postcopy pages.  This is the heart of the live
+ * migration loop.
  */
 static int send_domain_memory_live(struct xc_sr_context *ctx)
 {
     int rc;
+    xc_interface *xch = ctx->xch;
 
     rc = enable_logdirty(ctx);
     if ( rc )
@@ -849,10 +966,20 @@ static int send_domain_memory_live(struct xc_sr_context *ctx)
     if ( rc )
         goto out;
 
-    rc = suspend_and_send_dirty(ctx);
+    rc = suspend_and_check_dirty(ctx);
     if ( rc )
         goto out;
 
+    if ( ctx->save.policy_decision == XGS_POLICY_STOP_AND_COPY )
+    {
+        xc_set_progress_prefix(xch, "Final precopy iteration");
+        rc = send_dirty_pages(ctx, ctx->save.nr_final_dirty_pages,
+                              /* precopy */ false);
+        xc_set_progress_prefix(xch, NULL);
+        if ( rc )
+            goto out;
+    }
+
     if ( ctx->save.debug && ctx->save.checkpointed != XC_MIG_STREAM_NONE )
     {
         rc = verify_frames(ctx);
@@ -864,12 +991,209 @@ static int send_domain_memory_live(struct xc_sr_context *ctx)
     return rc;
 }
 
+static int handle_postcopy_faults(struct xc_sr_context *ctx,
+                                  struct xc_sr_record *rec,
+                                  /* OUT */ unsigned long *nr_new_fault_pfns,
+                                  /* OUT */ xen_pfn_t *last_fault_pfn)
+{
+    int rc;
+    unsigned i;
+    xc_interface *xch = ctx->xch;
+    struct xc_sr_rec_pages_header *fault_pages = rec->data;
+
+    DECLARE_HYPERCALL_BUFFER_SHADOW(unsigned long, dirty_bitmap,
+                                    &ctx->save.dirty_bitmap_hbuf);
+
+    assert(nr_new_fault_pfns);
+    *nr_new_fault_pfns = 0;
+
+    rc = validate_pages_record(ctx, rec, REC_TYPE_POSTCOPY_FAULT);
+    if ( rc )
+        return rc;
+
+    DBGPRINTF("Handling a batch of %"PRIu32" faults!", fault_pages->count);
+
+    assert(ctx->save.batch_type == XC_SR_SAVE_BATCH_POSTCOPY_PAGE);
+    for ( i = 0; i < fault_pages->count; ++i )
+    {
+        if ( test_and_clear_bit(fault_pages->pfn[i], dirty_bitmap) )
+        {
+            if ( batch_full(ctx) )
+            {
+                rc = flush_batch(ctx);
+                if ( rc )
+                    return rc;
+            }
+
+            add_to_batch(ctx, fault_pages->pfn[i]);
+            ++(*nr_new_fault_pfns);
+        }
+    }
+
+    /* _Don't_ flush yet - fill out the rest of the batch. */
+
+    assert(fault_pages->count);
+    *last_fault_pfn = fault_pages->pfn[fault_pages->count - 1];
+    return 0;
+}
+
+/*
+ * Now that the guest has resumed at the destination, send all of the remaining
+ * dirty pages.  Periodically check for pages needed by the destination to make
+ * progress.
+ */
+static int postcopy_domain_memory(struct xc_sr_context *ctx)
+{
+    int rc;
+    xc_interface *xch = ctx->xch;
+    int recv_fd = ctx->save.recv_fd;
+    int old_flags;
+    struct xc_sr_read_record_context rrctx;
+    struct xc_sr_record rec = { 0, 0, NULL };
+    unsigned long nr_new_fault_pfns;
+    unsigned long pages_remaining = ctx->save.nr_final_dirty_pages;
+    xen_pfn_t last_fault_pfn, p;
+    bool received_postcopy_complete = false;
+
+    DECLARE_HYPERCALL_BUFFER_SHADOW(unsigned long, dirty_bitmap,
+                                    &ctx->save.dirty_bitmap_hbuf);
+
+    read_record_init(&rrctx, ctx);
+
+    /* First, configure the receive stream as non-blocking so we can
+     * periodically poll it for fault requests. */
+    old_flags = fcntl(recv_fd, F_GETFL);
+    if ( old_flags == -1 )
+    {
+        rc = old_flags;
+        goto err;
+    }
+
+    assert(!(old_flags & O_NONBLOCK));
+
+    rc = fcntl(recv_fd, F_SETFL, old_flags | O_NONBLOCK);
+    if ( rc == -1 )
+    {
+        goto err;
+    }
+
+    xc_set_progress_prefix(xch, "Postcopy phase");
+
+    assert(batch_empty(ctx));
+    ctx->save.batch_type = XC_SR_SAVE_BATCH_POSTCOPY_PAGE;
+
+    p = 0;
+    while ( pages_remaining )
+    {
+        /* Between (small) batches, poll the receive stream for new
+         * POSTCOPY_FAULT messages. */
+        for ( ; ; )
+        {
+            rc = try_read_record(&rrctx, recv_fd, &rec);
+            if ( rc )
+            {
+                if ( (errno == EAGAIN) || (errno == EWOULDBLOCK) )
+                {
+                    break;
+                }
+
+                goto err;
+            }
+            else
+            {
+                /* Tear down and re-initialize the read record context for the
+                 * next request record. */
+                read_record_destroy(&rrctx);
+                read_record_init(&rrctx, ctx);
+
+                if ( rec.type == REC_TYPE_POSTCOPY_COMPLETE )
+                {
+                    /* The restore side may ultimately not need all of the pages
+                     * we think it does - for example, the guest may release
+                     * some outstanding pages.  If this occurs, we'll receive
+                     * this record before we'd otherwise expect to. */
+                    received_postcopy_complete = true;
+                    goto done;
+                }
+
+                rc = handle_postcopy_faults(ctx, &rec, &nr_new_fault_pfns,
+                                            &last_fault_pfn);
+                if ( rc )
+                    goto err;
+
+                free(rec.data);
+                rec.data = NULL;
+
+                assert(pages_remaining >= nr_new_fault_pfns);
+                pages_remaining -= nr_new_fault_pfns;
+
+                /* To take advantage of any locality present in the postcopy
+                 * faults, continue the background copy process from the newest
+                 * page in the fault batch. */
+                p = (last_fault_pfn + 1) % ctx->save.p2m_size;
+            }
+        }
+
+        /* Now that we've serviced all of the POSTCOPY_FAULT requests we know
+         * about for now, fill out the current batch with background pages. */
+        for ( ;
+              pages_remaining && !batch_full(ctx);
+              p = (p + 1) % ctx->save.p2m_size )
+        {
+            if ( test_and_clear_bit(p, dirty_bitmap) )
+            {
+                add_to_batch(ctx, p);
+                --pages_remaining;
+            }
+        }
+
+        rc = flush_batch(ctx);
+        if ( rc )
+            goto err;
+
+        xc_report_progress_step(
+            xch, ctx->save.nr_final_dirty_pages - pages_remaining,
+            ctx->save.nr_final_dirty_pages);
+    }
+
+ done:
+    /* Revert the receive stream to the (blocking) state we found it in. */
+    rc = fcntl(recv_fd, F_SETFL, old_flags);
+    if ( rc == -1 )
+        goto err;
+
+    if ( !received_postcopy_complete )
+    {
+        /* Flush any outstanding POSTCOPY_FAULT requests from the migration
+         * stream by reading until a POSTCOPY_COMPLETE is received. */
+        do
+        {
+            rc = read_record(ctx, recv_fd, &rec);
+            if ( rc )
+                goto err;
+        } while ( rec.type != REC_TYPE_POSTCOPY_COMPLETE );
+    }
+
+ err:
+    xc_set_progress_prefix(xch, NULL);
+    free(rec.data);
+    read_record_destroy(&rrctx);
+    return rc;
+}
+
 /*
  * Checkpointed save.
  */
 static int send_domain_memory_checkpointed(struct xc_sr_context *ctx)
 {
-    return suspend_and_send_dirty(ctx);
+    int rc;
+    xc_interface *xch = ctx->xch;
+
+    xc_set_progress_prefix(xch, "Checkpointed save");
+    rc = suspend_and_send_dirty(ctx);
+    xc_set_progress_prefix(xch, NULL);
+
+    return rc;
 }
 
 /*
@@ -998,11 +1322,50 @@ static int save(struct xc_sr_context *ctx, uint16_t guest_type)
             goto err;
         }
 
+        /* End-of-checkpoint records are handled differently in the case of
+         * postcopy migration, so we need to alert the destination before
+         * sending them. */
+        if ( ctx->save.live &&
+             ctx->save.policy_decision == XGS_POLICY_POSTCOPY )
+        {
+            rc = write_postcopy_begin_record(ctx);
+            if ( rc )
+                goto err;
+        }
+
         rc = ctx->save.ops.end_of_checkpoint(ctx);
         if ( rc )
             goto err;
 
-        if ( ctx->save.checkpointed != XC_MIG_STREAM_NONE )
+        if ( ctx->save.live &&
+             ctx->save.policy_decision == XGS_POLICY_POSTCOPY )
+        {
+            xc_report_progress_single(xch, "Beginning postcopy transition");
+
+            rc = send_postcopy_pfns(ctx);
+            if ( rc )
+                goto err;
+
+            rc = write_postcopy_transition_record(ctx);
+            if ( rc )
+                goto err;
+
+            /* Yield control to libxl to finish the transition.  Note that this
+             * callback returns _non-zero_ upon success. */
+            rc = ctx->save.callbacks->postcopy_transition(
+                ctx->save.callbacks->data);
+            if ( !rc )
+            {
+                rc = -1;
+                goto err;
+            }
+
+            /* When libxl is done, we can begin the postcopy loop. */
+            rc = postcopy_domain_memory(ctx);
+            if ( rc )
+                goto err;
+        }
+        else if ( ctx->save.checkpointed != XC_MIG_STREAM_NONE )
         {
             /*
              * We have now completed the initial live portion of the checkpoint
diff --git a/tools/libxc/xc_sr_save_x86_hvm.c b/tools/libxc/xc_sr_save_x86_hvm.c
index ea4b780..13df25b 100644
--- a/tools/libxc/xc_sr_save_x86_hvm.c
+++ b/tools/libxc/xc_sr_save_x86_hvm.c
@@ -92,6 +92,9 @@ static int write_hvm_params(struct xc_sr_context *ctx)
     unsigned int i;
     int rc;
 
+    DECLARE_HYPERCALL_BUFFER_SHADOW(unsigned long, dirty_bitmap,
+                                    &ctx->save.dirty_bitmap_hbuf);
+
     for ( i = 0; i < ARRAY_SIZE(params); i++ )
     {
         uint32_t index = params[i];
@@ -106,6 +109,16 @@ static int write_hvm_params(struct xc_sr_context *ctx)
 
         if ( value != 0 )
         {
+            if ( ctx->save.live &&
+                 ctx->save.policy_decision == XGS_POLICY_POSTCOPY &&
+                 ( index == HVM_PARAM_CONSOLE_PFN ||
+                   index == HVM_PARAM_STORE_PFN ||
+                   index == HVM_PARAM_IOREQ_PFN ||
+                   index == HVM_PARAM_BUFIOREQ_PFN ||
+                   index == HVM_PARAM_PAGING_RING_PFN ) &&
+                 test_and_clear_bit(value, dirty_bitmap) )
+                --ctx->save.nr_final_dirty_pages;
+
             entries[hdr.count].index = index;
             entries[hdr.count].value = value;
             hdr.count++;
diff --git a/tools/libxc/xg_save_restore.h b/tools/libxc/xg_save_restore.h
index 40debf6..9f5b223 100644
--- a/tools/libxc/xg_save_restore.h
+++ b/tools/libxc/xg_save_restore.h
@@ -24,7 +24,21 @@
 ** We process save/restore/migrate in batches of pages; the below
 ** determines how many pages we (at maximum) deal with in each batch.
 */
-#define MAX_PRECOPY_BATCH_SIZE 1024   /* up to 1024 pages (4MB) at a time */
+#define MAX_PRECOPY_BATCH_SIZE ((size_t)1024U)   /* up to 1024 pages (4MB) */
+
+/*
+** We process the migration postcopy transition in batches of pfns to ensure
+** that we stay within the record size bound.  Because these records contain
+** only pfns (and _not_ their contents), we can accomodate many more of them
+** in a batch.
+*/
+#define MAX_PFN_BATCH_SIZE ((4U << 20) / sizeof(uint64_t)) /* up to 512k pfns */
+
+/*
+** The postcopy background copy uses a smaller batch size to ensure it can
+** quickly respond to remote faults.
+*/
+#define MAX_POSTCOPY_BATCH_SIZE ((size_t)64U)
 
 /* When pinning page tables at the end of restore, we also use batching. */
 #define MAX_PIN_BATCH  1024
diff --git a/tools/libxl/libxl_dom_save.c b/tools/libxl/libxl_dom_save.c
index 10d5012..4ef9ca5 100644
--- a/tools/libxl/libxl_dom_save.c
+++ b/tools/libxl/libxl_dom_save.c
@@ -349,6 +349,12 @@ static int libxl__save_live_migration_simple_precopy_policy(
     return XGS_POLICY_CONTINUE_PRECOPY;
 }
 
+static void libxl__save_live_migration_postcopy_transition_callback(void *user)
+{
+    /* XXX we're not yet ready to deal with this */
+    assert(0);
+}
+
 /*----- main code for saving, in order of execution -----*/
 
 void libxl__domain_save(libxl__egc *egc, libxl__domain_save_state *dss)
@@ -419,8 +425,11 @@ void libxl__domain_save(libxl__egc *egc, libxl__domain_save_state *dss)
             dss->xcflags |= XCFLAGS_CHECKPOINT_COMPRESS;
     }
 
-    if (dss->checkpointed_stream == LIBXL_CHECKPOINTED_STREAM_NONE)
+    if (dss->checkpointed_stream == LIBXL_CHECKPOINTED_STREAM_NONE) {
         callbacks->suspend = libxl__domain_suspend_callback;
+        callbacks->postcopy_transition =
+            libxl__save_live_migration_postcopy_transition_callback;
+    }
 
     callbacks->precopy_policy = libxl__save_live_migration_simple_precopy_policy;
     callbacks->switch_qemu_logdirty = libxl__domain_suspend_common_switch_qemu_logdirty;
diff --git a/tools/libxl/libxl_save_msgs_gen.pl b/tools/libxl/libxl_save_msgs_gen.pl
index 50c97b4..5647b97 100755
--- a/tools/libxl/libxl_save_msgs_gen.pl
+++ b/tools/libxl/libxl_save_msgs_gen.pl
@@ -33,7 +33,8 @@ our @msgs = (
                                               'xen_pfn_t', 'console_gfn'] ],
     [  9, 'srW',    "complete",              [qw(int retval
                                                  int errnoval)] ],
-    [ 10, 'scxW',   "precopy_policy", ['struct precopy_stats', 'stats'] ]
+    [ 10, 'scxW',   "precopy_policy", ['struct precopy_stats', 'stats'] ],
+    [ 11, 'scxA',   "postcopy_transition", [] ]
 );
 
 #----------------------------------------
@@ -225,6 +226,7 @@ foreach my $sr (qw(save restore)) {
 
     f_decl("${setcallbacks}_${sr}", 'helper', 'void',
            "(struct ${sr}_callbacks *cbs, unsigned cbflags)");
+    f_more("${setcallbacks}_${sr}", "    memset(cbs, 0, sizeof(*cbs));\n");
 
     f_more("${receiveds}_${sr}",
            <<END_ALWAYS.($debug ? <<END_DEBUG : '').<<END_ALWAYS);
@@ -335,7 +337,7 @@ END_ALWAYS
         my $c_v = "(1u<<$msgnum)";
         my $c_cb = "cbs->$name";
         $f_more_sr->("    if ($c_cb) cbflags |= $c_v;\n", $enumcallbacks);
-        $f_more_sr->("    $c_cb = (cbflags & $c_v) ? ${encode}_${name} : 0;\n",
+        $f_more_sr->("    if (cbflags & $c_v) $c_cb = ${encode}_${name};\n",
                      $setcallbacks);
     }
     $f_more_sr->("        return 1;\n    }\n\n");