diff mbox

[RFC,08/20] libxl/migration: add precopy tuning parameters

Message ID 1490605592-12189-9-git-send-email-jtotto@uwaterloo.ca (mailing list archive)
State New, archived
Headers show

Commit Message

Joshua Otto March 27, 2017, 9:06 a.m. UTC
In the context of the live migration algorithm, the precopy iteration
count refers to the number of page-copying iterations performed prior to
the suspension of the guest and transmission of the final set of dirty
pages.  Similarly, the precopy dirty threshold refers to the dirty page
count below which we judge it more profitable to proceed to
stop-and-copy rather than continue with the precopy.  These would be
helpful tuning parameters to work with when migrating particularly busy
guests, as they enable an administrator to reap the available benefits
of the precopy algorithm (the transmission of guest pages _not_ in the
writable working set can be completed without guest downtime) while
reducing the total amount of time required for the migration (as
iterations of the precopy loop that will certainly be redundant can be
skipped in favour of an earlier suspension).

To expose these tuning parameters to users:
- introduce a new libxl API function, libxl_domain_live_migrate(),
  taking the same parameters as libxl_domain_suspend() _and_
  precopy_iterations and precopy_dirty_threshold parameters, and
  consider these parameters in the precopy policy

  (though a pair of new parameters on their own might not warrant an
  entirely new API function, it is added in anticipation of a number of
  additional migration-only parameters that would be cumbersome on the
  whole to tack on to the existing suspend API)

- switch xl migrate to the new libxl_domain_live_migrate() and add new
  --postcopy-iterations and --postcopy-threshold parameters to pass
  through

Signed-off-by: Joshua Otto <jtotto@uwaterloo.ca>
---
 tools/libxl/libxl.h          | 10 ++++++++++
 tools/libxl/libxl_dom_save.c | 20 +++++++++++---------
 tools/libxl/libxl_domain.c   | 27 +++++++++++++++++++++++++--
 tools/libxl/libxl_internal.h |  2 ++
 tools/xl/xl_cmdtable.c       | 22 +++++++++++++---------
 tools/xl/xl_migrate.c        | 31 +++++++++++++++++++++++++++----
 6 files changed, 88 insertions(+), 24 deletions(-)

Comments

Andrew Cooper March 29, 2017, 9:08 p.m. UTC | #1
On 27/03/17 10:06, Joshua Otto wrote:
> In the context of the live migration algorithm, the precopy iteration
> count refers to the number of page-copying iterations performed prior to
> the suspension of the guest and transmission of the final set of dirty
> pages.  Similarly, the precopy dirty threshold refers to the dirty page
> count below which we judge it more profitable to proceed to
> stop-and-copy rather than continue with the precopy.  These would be
> helpful tuning parameters to work with when migrating particularly busy
> guests, as they enable an administrator to reap the available benefits
> of the precopy algorithm (the transmission of guest pages _not_ in the
> writable working set can be completed without guest downtime) while
> reducing the total amount of time required for the migration (as
> iterations of the precopy loop that will certainly be redundant can be
> skipped in favour of an earlier suspension).
>
> To expose these tuning parameters to users:
> - introduce a new libxl API function, libxl_domain_live_migrate(),
>   taking the same parameters as libxl_domain_suspend() _and_
>   precopy_iterations and precopy_dirty_threshold parameters, and
>   consider these parameters in the precopy policy
>
>   (though a pair of new parameters on their own might not warrant an
>   entirely new API function, it is added in anticipation of a number of
>   additional migration-only parameters that would be cumbersome on the
>   whole to tack on to the existing suspend API)
>
> - switch xl migrate to the new libxl_domain_live_migrate() and add new
>   --postcopy-iterations and --postcopy-threshold parameters to pass
>   through
>
> Signed-off-by: Joshua Otto <jtotto@uwaterloo.ca>

This will have to defer to the tools maintainers, but I purposefully
didn't expose these knobs to users when rewriting live migration,
because they cannot be meaningfully chosen by anyone outside of a
testing scenario.  (That is not to say they aren't useful for testing
purposes, but I didn't upstream my version of this patch.)

I spent quite a while wondering how best to expose these tunables in a
way that end users could sensibly use them, and the best I came up with
was this:

First, run the guest under logdirty for a period of time to establish
the working set, and how steady it is.  From this, you have a baseline
for the target threshold, and a plausible way of estimating the
downtime.  (Better yet, as XenCenter, XenServers windows GUI, has proved
time and time again, users love graphs!  Even if they don't necessarily
understand them.)

From this baseline, the conditions you need to care about are the rate
of convergence.  On a steady VM, you should converge asymptotically to
the measured threshold, although on 5 or fewer iterations, the
asymptotic properties don't appear cleanly.  (Of course, the larger the
VM, the more iterations, and the more likely to spot this.)

Users will either care about the migration completing successfully, or
avoiding interrupting the workload.  The majority case would be both,
but every user will have one of these two options which is more
important than the other.  As a result, there need to be some options to
cover "if $X happens, do I continue or abort".

The case where the VM becomes more busy is harder however.  For the
users which care about not interrupting the workload, there will be a
point above which they'd prefer to abort the migration rather than
continue it.  For the users which want the migration to complete, they'd
prefer to pause the VM and take a downtime hit, rather than aborting.

Therefore, you really need two thresholds; the one above which you
always abort, the one where you would normally choose to pause.  The
decision as to what to do depends on where you are between these
thresholds when the dirty state converges.  (Of course, if the VM
suddenly becomes more idle, it is sensible to continue beyond the lower
threshold, as it will reduce the downtime.)  The absolute number of
iterations on the other hand doesn't actually matter from a users point
of view, so isn't a useful control to have.

Another thing to be careful with is the measure of convergence with
respect to guest busyness, and other factors influencing the absolute
iteration time, such as congestion of the network between the two
hosts.  I haven't yet come up with a sensible way of reconciling this
with the above, in a way which can be expressed as a useful set of controls.


The plan, following migration v2, was always to come back to this and
see about doing something better than the current hard coded parameters,
but I am still working on fixing migration in other areas (not having
VMs crash when moving, because they observe important differences in the
hardware).

How does your postcopy proposal influence/change the above logic?

~Andrew
Joshua Otto March 30, 2017, 6:03 a.m. UTC | #2
On Wed, Mar 29, 2017 at 10:08:02PM +0100, Andrew Cooper wrote:
> On 27/03/17 10:06, Joshua Otto wrote:
> > In the context of the live migration algorithm, the precopy iteration
> > count refers to the number of page-copying iterations performed prior to
> > the suspension of the guest and transmission of the final set of dirty
> > pages.  Similarly, the precopy dirty threshold refers to the dirty page
> > count below which we judge it more profitable to proceed to
> > stop-and-copy rather than continue with the precopy.  These would be
> > helpful tuning parameters to work with when migrating particularly busy
> > guests, as they enable an administrator to reap the available benefits
> > of the precopy algorithm (the transmission of guest pages _not_ in the
> > writable working set can be completed without guest downtime) while
> > reducing the total amount of time required for the migration (as
> > iterations of the precopy loop that will certainly be redundant can be
> > skipped in favour of an earlier suspension).
> >
> > To expose these tuning parameters to users:
> > - introduce a new libxl API function, libxl_domain_live_migrate(),
> >   taking the same parameters as libxl_domain_suspend() _and_
> >   precopy_iterations and precopy_dirty_threshold parameters, and
> >   consider these parameters in the precopy policy
> >
> >   (though a pair of new parameters on their own might not warrant an
> >   entirely new API function, it is added in anticipation of a number of
> >   additional migration-only parameters that would be cumbersome on the
> >   whole to tack on to the existing suspend API)
> >
> > - switch xl migrate to the new libxl_domain_live_migrate() and add new
> >   --postcopy-iterations and --postcopy-threshold parameters to pass
> >   through
> >
> > Signed-off-by: Joshua Otto <jtotto@uwaterloo.ca>
> 
> This will have to defer to the tools maintainers, but I purposefully
> didn't expose these knobs to users when rewriting live migration,
> because they cannot be meaningfully chosen by anyone outside of a
> testing scenario.  (That is not to say they aren't useful for testing
> purposes, but I didn't upstream my version of this patch.)

Ahhh, I wondered why those parameters to xc_domain_save() were present
but ignored.  That's reasonable.

I guess the way I had imagined an administrator using them would be in a
non-production/test environment - if they could run workloads
representative of their production application in this environment, they
could experiment with different --precopy-iterations and
--precopy-threshold values (having just a high-level understanding of
what they control) and choose the ones that result in the best outcome
for later use in production.

> I spent quite a while wondering how best to expose these tunables in a
> way that end users could sensibly use them, and the best I came up with
> was this:
> 
> First, run the guest under logdirty for a period of time to establish
> the working set, and how steady it is.  From this, you have a baseline
> for the target threshold, and a plausible way of estimating the
> downtime.  (Better yet, as XenCenter, XenServers windows GUI, has proved
> time and time again, users love graphs!  Even if they don't necessarily
> understand them.)
> 
> From this baseline, the conditions you need to care about are the rate
> of convergence.  On a steady VM, you should converge asymptotically to
> the measured threshold, although on 5 or fewer iterations, the
> asymptotic properties don't appear cleanly.  (Of course, the larger the
> VM, the more iterations, and the more likely to spot this.)
> 
> Users will either care about the migration completing successfully, or
> avoiding interrupting the workload.  The majority case would be both,
> but every user will have one of these two options which is more
> important than the other.  As a result, there need to be some options to
> cover "if $X happens, do I continue or abort".
> 
> The case where the VM becomes more busy is harder however.  For the
> users which care about not interrupting the workload, there will be a
> point above which they'd prefer to abort the migration rather than
> continue it.  For the users which want the migration to complete, they'd
> prefer to pause the VM and take a downtime hit, rather than aborting.
> 
> Therefore, you really need two thresholds; the one above which you
> always abort, the one where you would normally choose to pause.  The
> decision as to what to do depends on where you are between these
> thresholds when the dirty state converges.  (Of course, if the VM
> suddenly becomes more idle, it is sensible to continue beyond the lower
> threshold, as it will reduce the downtime.)  The absolute number of
> iterations on the other hand doesn't actually matter from a users point
> of view, so isn't a useful control to have.
> 
> Another thing to be careful with is the measure of convergence with
> respect to guest busyness, and other factors influencing the absolute
> iteration time, such as congestion of the network between the two
> hosts.  I haven't yet come up with a sensible way of reconciling this
> with the above, in a way which can be expressed as a useful set of controls.
> 
> 
> The plan, following migration v2, was always to come back to this and
> see about doing something better than the current hard coded parameters,
> but I am still working on fixing migration in other areas (not having
> VMs crash when moving, because they observe important differences in the
> hardware).

I think a good strategy would be to solicit three parameters from the
user:
- the precopy duration they're willing to tolerate
- the downtime duration they're willing to tolerate
- the bandwidth of the link between the hosts (we could try and estimate
  it for them but I'd rather just make them run iperf)

Then, after applying this patch, alter the policy so that precopy simply
runs for the duration that the user is willing to wait.  After that,
using the bandwidth estimate, compute the approximate downtime required
to transfer the final set of dirty-pages.  If this is less than what the
user indicated is acceptable, proceed with the stop-and-copy - otherwise
abort.

This still requires the user to figure out for themselves how long their
workload can really wait, but hopefully they already had some idea
before deciding to attempt live migration in the first place.

> How does your postcopy proposal influence/change the above logic?

Well, the 'downtime' phase of the migration becomes a very short, fixed
interval, regardless of guest busyness, so you can't ask the user 'how
much downtime can you tolerate?'  Instead, the question becomes the
murkier 'how much memory performance degradation can your guest
tolerate?'  I.e. is the postcopy migration going to essentially be
downtime, or can useful work get done between faults? (for example,
guests that are I/O bound would do much better with postcopy than they
would with a long stop-and-copy)

To answer that question, they're back to the approach I outlined at the
beginning - they'd have to experiment in a test environment and observe
their workload's response to the alternatives to make an informed
choice.

Cheers,

Josh
Wei Liu April 12, 2017, 3:37 p.m. UTC | #3
On Thu, Mar 30, 2017 at 02:03:29AM -0400, Joshua Otto wrote:
> On Wed, Mar 29, 2017 at 10:08:02PM +0100, Andrew Cooper wrote:
> > On 27/03/17 10:06, Joshua Otto wrote:
> > > In the context of the live migration algorithm, the precopy iteration
> > > count refers to the number of page-copying iterations performed prior to
> > > the suspension of the guest and transmission of the final set of dirty
> > > pages.  Similarly, the precopy dirty threshold refers to the dirty page
> > > count below which we judge it more profitable to proceed to
> > > stop-and-copy rather than continue with the precopy.  These would be
> > > helpful tuning parameters to work with when migrating particularly busy
> > > guests, as they enable an administrator to reap the available benefits
> > > of the precopy algorithm (the transmission of guest pages _not_ in the
> > > writable working set can be completed without guest downtime) while
> > > reducing the total amount of time required for the migration (as
> > > iterations of the precopy loop that will certainly be redundant can be
> > > skipped in favour of an earlier suspension).
> > >
> > > To expose these tuning parameters to users:
> > > - introduce a new libxl API function, libxl_domain_live_migrate(),
> > >   taking the same parameters as libxl_domain_suspend() _and_
> > >   precopy_iterations and precopy_dirty_threshold parameters, and
> > >   consider these parameters in the precopy policy
> > >
> > >   (though a pair of new parameters on their own might not warrant an
> > >   entirely new API function, it is added in anticipation of a number of
> > >   additional migration-only parameters that would be cumbersome on the
> > >   whole to tack on to the existing suspend API)
> > >
> > > - switch xl migrate to the new libxl_domain_live_migrate() and add new
> > >   --postcopy-iterations and --postcopy-threshold parameters to pass
> > >   through
> > >
> > > Signed-off-by: Joshua Otto <jtotto@uwaterloo.ca>
> > 
> > This will have to defer to the tools maintainers, but I purposefully
> > didn't expose these knobs to users when rewriting live migration,
> > because they cannot be meaningfully chosen by anyone outside of a
> > testing scenario.  (That is not to say they aren't useful for testing
> > purposes, but I didn't upstream my version of this patch.)
> 
> Ahhh, I wondered why those parameters to xc_domain_save() were present
> but ignored.  That's reasonable.
> 
> I guess the way I had imagined an administrator using them would be in a
> non-production/test environment - if they could run workloads
> representative of their production application in this environment, they
> could experiment with different --precopy-iterations and
> --precopy-threshold values (having just a high-level understanding of
> what they control) and choose the ones that result in the best outcome
> for later use in production.
> 

Running in a test environment isn't always an option -- think about
public cloud providers who don't have control over the VMs or the
workload.

> > I spent quite a while wondering how best to expose these tunables in a
> > way that end users could sensibly use them, and the best I came up with
> > was this:
> > 
> > First, run the guest under logdirty for a period of time to establish
> > the working set, and how steady it is.  From this, you have a baseline
> > for the target threshold, and a plausible way of estimating the
> > downtime.  (Better yet, as XenCenter, XenServers windows GUI, has proved
> > time and time again, users love graphs!  Even if they don't necessarily
> > understand them.)
> > 
> > From this baseline, the conditions you need to care about are the rate
> > of convergence.  On a steady VM, you should converge asymptotically to
> > the measured threshold, although on 5 or fewer iterations, the
> > asymptotic properties don't appear cleanly.  (Of course, the larger the
> > VM, the more iterations, and the more likely to spot this.)
> > 
> > Users will either care about the migration completing successfully, or
> > avoiding interrupting the workload.  The majority case would be both,
> > but every user will have one of these two options which is more
> > important than the other.  As a result, there need to be some options to
> > cover "if $X happens, do I continue or abort".
> > 
> > The case where the VM becomes more busy is harder however.  For the
> > users which care about not interrupting the workload, there will be a
> > point above which they'd prefer to abort the migration rather than
> > continue it.  For the users which want the migration to complete, they'd
> > prefer to pause the VM and take a downtime hit, rather than aborting.
> > 
> > Therefore, you really need two thresholds; the one above which you
> > always abort, the one where you would normally choose to pause.  The
> > decision as to what to do depends on where you are between these
> > thresholds when the dirty state converges.  (Of course, if the VM
> > suddenly becomes more idle, it is sensible to continue beyond the lower
> > threshold, as it will reduce the downtime.)  The absolute number of
> > iterations on the other hand doesn't actually matter from a users point
> > of view, so isn't a useful control to have.
> > 
> > Another thing to be careful with is the measure of convergence with
> > respect to guest busyness, and other factors influencing the absolute
> > iteration time, such as congestion of the network between the two
> > hosts.  I haven't yet come up with a sensible way of reconciling this
> > with the above, in a way which can be expressed as a useful set of controls.
> > 

My thought as well.

> > 
> > The plan, following migration v2, was always to come back to this and
> > see about doing something better than the current hard coded parameters,
> > but I am still working on fixing migration in other areas (not having
> > VMs crash when moving, because they observe important differences in the
> > hardware).
> 
> I think a good strategy would be to solicit three parameters from the
> user:
> - the precopy duration they're willing to tolerate
> - the downtime duration they're willing to tolerate
> - the bandwidth of the link between the hosts (we could try and estimate
>   it for them but I'd rather just make them run iperf)
> 
> Then, after applying this patch, alter the policy so that precopy simply
> runs for the duration that the user is willing to wait.  After that,
> using the bandwidth estimate, compute the approximate downtime required
> to transfer the final set of dirty-pages.  If this is less than what the
> user indicated is acceptable, proceed with the stop-and-copy - otherwise
> abort.
> 
> This still requires the user to figure out for themselves how long their
> workload can really wait, but hopefully they already had some idea
> before deciding to attempt live migration in the first place.
> 

I am not entirely sure what to make of this. I'm not convinced using
durations would cover all cases, but I can't come up with a counter
example that doesn't sound contrived.

Given this series is already complex enough, I think we should set this
aside for another day.

How hard would it be to _not_ include all the knobs in this series?

Wei.
Joshua Otto April 27, 2017, 10:51 p.m. UTC | #4
On Wed, Apr 12, 2017 at 04:37:16PM +0100, Wei Liu wrote:
> On Thu, Mar 30, 2017 at 02:03:29AM -0400, Joshua Otto wrote:
> > I guess the way I had imagined an administrator using them would be in a
> > non-production/test environment - if they could run workloads
> > representative of their production application in this environment, they
> > could experiment with different --precopy-iterations and
> > --precopy-threshold values (having just a high-level understanding of
> > what they control) and choose the ones that result in the best outcome
> > for later use in production.
> > 
> 
> Running in a test environment isn't always an option -- think about
> public cloud providers who don't have control over the VMs or the
> workload.

Sure, it definitely won't always be an option, but sometimes it might.
The question is whether or not the benefit in the cases where it can be
used justifies the added complexity to the interface.  I think so, but
that's just my intuition.

> > > 
> > > The plan, following migration v2, was always to come back to this and
> > > see about doing something better than the current hard coded parameters,
> > > but I am still working on fixing migration in other areas (not having
> > > VMs crash when moving, because they observe important differences in the
> > > hardware).
> > 
> > I think a good strategy would be to solicit three parameters from the
> > user:
> > - the precopy duration they're willing to tolerate
> > - the downtime duration they're willing to tolerate
> > - the bandwidth of the link between the hosts (we could try and estimate
> >   it for them but I'd rather just make them run iperf)
> > 
> > Then, after applying this patch, alter the policy so that precopy simply
> > runs for the duration that the user is willing to wait.  After that,
> > using the bandwidth estimate, compute the approximate downtime required
> > to transfer the final set of dirty-pages.  If this is less than what the
> > user indicated is acceptable, proceed with the stop-and-copy - otherwise
> > abort.
> > 
> > This still requires the user to figure out for themselves how long their
> > workload can really wait, but hopefully they already had some idea
> > before deciding to attempt live migration in the first place.
> > 
> 
> I am not entirely sure what to make of this. I'm not convinced using
> durations would cover all cases, but I can't come up with a counter
> example that doesn't sound contrived.
> 
> Given this series is already complex enough, I think we should set this
> aside for another day.
> 
> How hard would it be to _not_ include all the knobs in this series?

Fair enough.  It wouldn't be much trouble, so I'll drop it for now.

As a general comment on the patch series for anyone following: I've just
finished with the last of my academic commitments and now have time to
pick this back up.  I'll follow up in the next few weeks with the
suggested revisions, the design document and the quantitative
performance evaluation.

Thanks!

Josh
diff mbox

Patch

diff --git a/tools/libxl/libxl.h b/tools/libxl/libxl.h
index 833f866..84ac96a 100644
--- a/tools/libxl/libxl.h
+++ b/tools/libxl/libxl.h
@@ -1375,6 +1375,16 @@  int libxl_domain_suspend(libxl_ctx *ctx, uint32_t domid, int fd,
 #define LIBXL_SUSPEND_DEBUG 1
 #define LIBXL_SUSPEND_LIVE 2
 
+int libxl_domain_live_migrate(libxl_ctx *ctx, uint32_t domid, int fd,
+                              int flags, /* LIBXL_SUSPEND_* */
+                              unsigned int precopy_iterations,
+                              unsigned int precopy_dirty_threshold,
+                              const libxl_asyncop_how *ao_how)
+                              LIBXL_EXTERNAL_CALLERS_ONLY;
+
+#define LIBXL_LM_PRECOPY_ITERATIONS_DEFAULT 5
+#define LIBXL_LM_DIRTY_THRESHOLD_DEFAULT 50
+
 /* @param suspend_cancel [from xenctrl.h:xc_domain_resume( @param fast )]
  *   If this parameter is true, use co-operative resume. The guest
  *   must support this.
diff --git a/tools/libxl/libxl_dom_save.c b/tools/libxl/libxl_dom_save.c
index 6d28cce..10d5012 100644
--- a/tools/libxl/libxl_dom_save.c
+++ b/tools/libxl/libxl_dom_save.c
@@ -332,19 +332,21 @@  int libxl__save_emulator_xenstore_data(libxl__domain_save_state *dss,
  * This is the live migration precopy policy - it's called periodically during
  * the precopy phase of live migrations, and is responsible for deciding when
  * the precopy phase should terminate and what should be done next.
- *
- * The policy implemented here behaves identically to the policy previously
- * hard-coded into xc_domain_save() - it proceeds to the stop-and-copy phase of
- * the live migration when there are either fewer than 50 dirty pages, or more
- * than 5 precopy rounds have completed.
  */
 static int libxl__save_live_migration_simple_precopy_policy(
     struct precopy_stats stats, void *user)
 {
-    return ((stats.dirty_count >= 0 && stats.dirty_count < 50) ||
-            stats.iteration >= 5)
-        ? XGS_POLICY_STOP_AND_COPY
-        : XGS_POLICY_CONTINUE_PRECOPY;
+    libxl__save_helper_state *shs = user;
+    libxl__domain_save_state *dss = shs->caller_state;
+
+    if (stats.dirty_count >= 0 &&
+        stats.dirty_count <= dss->precopy_dirty_threshold)
+        return XGS_POLICY_STOP_AND_COPY;
+
+    if (stats.iteration >= dss->precopy_iterations)
+        return XGS_POLICY_STOP_AND_COPY;
+
+    return XGS_POLICY_CONTINUE_PRECOPY;
 }
 
 /*----- main code for saving, in order of execution -----*/
diff --git a/tools/libxl/libxl_domain.c b/tools/libxl/libxl_domain.c
index 08eccd0..b1cf643 100644
--- a/tools/libxl/libxl_domain.c
+++ b/tools/libxl/libxl_domain.c
@@ -486,8 +486,10 @@  static void domain_suspend_cb(libxl__egc *egc,
 
 }
 
-int libxl_domain_suspend(libxl_ctx *ctx, uint32_t domid, int fd, int flags,
-                         const libxl_asyncop_how *ao_how)
+static int do_domain_suspend(libxl_ctx *ctx, uint32_t domid, int fd, int flags,
+                             unsigned int precopy_iterations,
+                             unsigned int precopy_dirty_threshold,
+                             const libxl_asyncop_how *ao_how)
 {
     AO_CREATE(ctx, domid, ao_how);
     int rc;
@@ -510,6 +512,8 @@  int libxl_domain_suspend(libxl_ctx *ctx, uint32_t domid, int fd, int flags,
     dss->live = flags & LIBXL_SUSPEND_LIVE;
     dss->debug = flags & LIBXL_SUSPEND_DEBUG;
     dss->checkpointed_stream = LIBXL_CHECKPOINTED_STREAM_NONE;
+    dss->precopy_iterations = precopy_iterations;
+    dss->precopy_dirty_threshold = precopy_dirty_threshold;
 
     rc = libxl__fd_flags_modify_save(gc, dss->fd,
                                      ~(O_NONBLOCK|O_NDELAY), 0,
@@ -523,6 +527,25 @@  int libxl_domain_suspend(libxl_ctx *ctx, uint32_t domid, int fd, int flags,
     return AO_CREATE_FAIL(rc);
 }
 
+int libxl_domain_suspend(libxl_ctx *ctx, uint32_t domid, int fd, int flags,
+                         const libxl_asyncop_how *ao_how)
+{
+    return do_domain_suspend(ctx, domid, fd, flags,
+                             LIBXL_LM_PRECOPY_ITERATIONS_DEFAULT,
+                             LIBXL_LM_DIRTY_THRESHOLD_DEFAULT, ao_how);
+}
+
+int libxl_domain_live_migrate(libxl_ctx *ctx, uint32_t domid, int fd, int flags,
+                              unsigned int precopy_iterations,
+                              unsigned int precopy_dirty_threshold,
+                              const libxl_asyncop_how *ao_how)
+{
+    flags |= LIBXL_SUSPEND_LIVE;
+
+    return do_domain_suspend(ctx, domid, fd, flags, precopy_iterations,
+                             precopy_dirty_threshold, ao_how);
+}
+
 int libxl_domain_pause(libxl_ctx *ctx, uint32_t domid)
 {
     int ret;
diff --git a/tools/libxl/libxl_internal.h b/tools/libxl/libxl_internal.h
index f1d8f9a..45d607a 100644
--- a/tools/libxl/libxl_internal.h
+++ b/tools/libxl/libxl_internal.h
@@ -3292,6 +3292,8 @@  struct libxl__domain_save_state {
     int live;
     int debug;
     int checkpointed_stream;
+    unsigned int precopy_iterations;
+    unsigned int precopy_dirty_threshold;
     const libxl_domain_remus_info *remus;
     /* private */
     int rc;
diff --git a/tools/xl/xl_cmdtable.c b/tools/xl/xl_cmdtable.c
index 7d97811..6df66fb 100644
--- a/tools/xl/xl_cmdtable.c
+++ b/tools/xl/xl_cmdtable.c
@@ -157,15 +157,19 @@  struct cmd_spec cmd_table[] = {
       &main_migrate, 0, 1,
       "Migrate a domain to another host",
       "[options] <Domain> <host>",
-      "-h              Print this help.\n"
-      "-C <config>     Send <config> instead of config file from creation.\n"
-      "-s <sshcommand> Use <sshcommand> instead of ssh.  String will be passed\n"
-      "                to sh. If empty, run <host> instead of ssh <host> xl\n"
-      "                migrate-receive [-d -e]\n"
-      "-e              Do not wait in the background (on <host>) for the death\n"
-      "                of the domain.\n"
-      "--debug         Print huge (!) amount of debug during the migration process.\n"
-      "-p              Do not unpause domain after migrating it."
+      "-h                   Print this help.\n"
+      "-C <config>          Send <config> instead of config file from creation.\n"
+      "-s <sshcommand>      Use <sshcommand> instead of ssh.  String will be passed\n"
+      "                     to sh. If empty, run <host> instead of ssh <host> xl\n"
+      "                     migrate-receive [-d -e]\n"
+      "-e                   Do not wait in the background (on <host>) for the death\n"
+      "                     of the domain.\n"
+      "--debug              Print huge (!) amount of debug during the migration process.\n"
+      "-p                   Do not unpause domain after migrating it.\n"
+      "--precopy-iterations Perform at most this many iterations of the precopy\n"
+      "                     memory migration loop before suspending the domain.\n"
+      "--precopy-threshold  If fewer than this many pages are dirty at the end of a\n"
+      "                     copy round, exit the precopy loop and suspend the domain."
     },
     { "restore",
       &main_restore, 0, 1,
diff --git a/tools/xl/xl_migrate.c b/tools/xl/xl_migrate.c
index 1f0e87d..1bb3fb4 100644
--- a/tools/xl/xl_migrate.c
+++ b/tools/xl/xl_migrate.c
@@ -177,7 +177,9 @@  static void migrate_do_preamble(int send_fd, int recv_fd, pid_t child,
 }
 
 static void migrate_domain(uint32_t domid, const char *rune, int debug,
-                           const char *override_config_file)
+                           const char *override_config_file,
+                           unsigned int precopy_iterations,
+                           unsigned int precopy_dirty_threshold)
 {
     pid_t child = -1;
     int rc;
@@ -205,7 +207,9 @@  static void migrate_domain(uint32_t domid, const char *rune, int debug,
 
     if (debug)
         flags |= LIBXL_SUSPEND_DEBUG;
-    rc = libxl_domain_suspend(ctx, domid, send_fd, flags, NULL);
+    rc = libxl_domain_live_migrate(ctx, domid, send_fd, flags,
+                                   precopy_iterations, precopy_dirty_threshold,
+                                   NULL);
     if (rc) {
         fprintf(stderr, "migration sender: libxl_domain_suspend failed"
                 " (rc=%d)\n", rc);
@@ -537,13 +541,17 @@  int main_migrate(int argc, char **argv)
     char *rune = NULL;
     char *host;
     int opt, daemonize = 1, monitor = 1, debug = 0, pause_after_migration = 0;
+    int precopy_iterations = LIBXL_LM_PRECOPY_ITERATIONS_DEFAULT,
+        precopy_dirty_threshold = LIBXL_LM_DIRTY_THRESHOLD_DEFAULT;
     static struct option opts[] = {
         {"debug", 0, 0, 0x100},
         {"live", 0, 0, 0x200},
+        {"precopy-iterations", 1, 0, 'i'},
+        {"precopy-threshold", 1, 0, 'd'},
         COMMON_LONG_OPTS
     };
 
-    SWITCH_FOREACH_OPT(opt, "FC:s:ep", opts, "migrate", 2) {
+    SWITCH_FOREACH_OPT(opt, "FC:s:epi:d:", opts, "migrate", 2) {
     case 'C':
         config_filename = optarg;
         break;
@@ -560,6 +568,20 @@  int main_migrate(int argc, char **argv)
     case 'p':
         pause_after_migration = 1;
         break;
+    case 'i':
+        precopy_iterations = atoi(optarg);
+        if (precopy_iterations < 0) {
+            fprintf(stderr, "negative precopy iterations not supported\n");
+            return EXIT_FAILURE;
+        }
+        break;
+    case 'd':
+        precopy_dirty_threshold = atoi(optarg);
+        if (precopy_dirty_threshold < 0) {
+            fprintf(stderr, "negative dirty threshold not supported\n");
+            return EXIT_FAILURE;
+        }
+        break;
     case 0x100: /* --debug */
         debug = 1;
         break;
@@ -596,7 +618,8 @@  int main_migrate(int argc, char **argv)
                   pause_after_migration ? " -p" : "");
     }
 
-    migrate_domain(domid, rune, debug, config_filename);
+    migrate_domain(domid, rune, debug, config_filename, precopy_iterations,
+                   precopy_dirty_threshold);
     return EXIT_SUCCESS;
 }