diff mbox

[6/6] KVM: Dirty memory tracking for performant checkpointing and improved live migration

Message ID BL2PR08MB4814C8EBEC9E7A82E01EC39F0630@BL2PR08MB481.namprd08.prod.outlook.com (mailing list archive)
State New, archived
Headers show

Commit Message

Cao, Lei April 26, 2016, 7:26 p.m. UTC
Updates to KVM API documentation

---
 Documentation/virtual/kvm/api.txt | 170 ++++++++++++++++++++++++++++
 1 file changed, 170 insertions(+)

Comments

Radim Krčmář April 28, 2016, 6:08 p.m. UTC | #1
2016-04-26 19:26+0000, Cao, Lei:
> Updates to KVM API documentation
> ---

I have five broad questions about design of the interface:

* Why are IOCTLs marked with "Called once when entering live
  migration/checkpoint mode" separate from KVM_INIT_MT?
* Is there a reason to call KVM_ENABLE_MT often?
* How significant is the benefit of MT_FETCH_WAIT?
* When would you disable MT_FETCH_REARM?
* What drawbacks had an interface without explicit checkpointing cycles?

> diff --git a/Documentation/virtual/kvm/api.txt b/Documentation/virtual/kvm/api.txt
> @@ -3120,6 +3120,176 @@ struct kvm_reinject_control {

A bit to the code itself:

> +4.99 KVM_INIT_MT
> +
> +Capability: basic

"basic" ioctls were present since the first version of KVM.
We can't change the past, so please add a new capability.

> +4.102 KVM_MT_SUBLIST_FETCH
> +
> +Capability: basic
> +Architectures: x86
> +Type: vm ioctl
> +Parameters: struct mt_sublist_fetch_info (in/out)
> +Returns: 0 on success, -1 on error
> +
> +/* for KVM_MT_SUBLIST_FETCH */
> +struct mt_gfn_list {
> +        __s32   count;
> +        __u32   max_dirty;
> +        __u64   *gfnlist;

gfn (= gpa >> PAGE_SHIFT) is not enough to specify a page for userspace,
because KVM has a concept of address spaces and pages from multiple
slots can be mapped into the same gfn (e.g. x86 SMRAM).

Providing a memslot/offset pair seems best.  (I'd start by addressing
Kai's comment on [3/6] about binding gfnlist to memslots.)

Thanks.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Cao, Lei April 29, 2016, 6:47 p.m. UTC | #2
On 4/28/2016 2:08 PM, Radim Kr?má? wrote:
> 2016-04-26 19:26+0000, Cao, Lei:
>> Updates to KVM API documentation
>> ---
> 
> I have five broad questions about design of the interface:
> 
> * Why are IOCTLs marked with "Called once when entering live
>   migration/checkpoint mode" separate from KVM_INIT_MT?

KVM_MT_DIRTY_TRIGGER can be folded into KVM_INIT_MT. I'll change that.

> * Is there a reason to call KVM_ENABLE_MT often?

KVM_ENABLE_MT can be called multiple times during a protected
VM's lifecycle in a checkpointing system. A protected VM has two
instances, primary and secondary. Memory tracking is only enabled on
the primary. When we do a polite failover, memory tracking is
disabled on the old primary and enabled on the new primary. Memory
tracking is also disabled when the secondary goes away, in which case
checkpoint cycle stops and there is no need for memory tracking. When
the secondary comes back, memory tracking is re-enabled and the two
instances sync up and checkpoint cycle starts.

> * How significant is the benefit of MT_FETCH_WAIT?

This allows the user thread that harvest dirty pages to park instead
of doing busy wait when there is no or very few dirty pages.

> * When would you disable MT_FETCH_REARM?

In a checkpointing system, dirty pages are harvested after the VM is
paused. Userspace can choose to rearm the write traps all at once after
all the dirty pages have been fetched using KVM_REARM_DIRTY_PAGES, in
which case the traps don't need to be armed during each fetch.

> * What drawbacks had an interface without explicit checkpointing cycles?

Checkpointing cycle has to be implemented in userspace to use this
interface. 

> 
>> diff --git a/Documentation/virtual/kvm/api.txt b/Documentation/virtual/kvm/api.txt
>> @@ -3120,6 +3120,176 @@ struct kvm_reinject_control {
> 
> A bit to the code itself:
> 
>> +4.99 KVM_INIT_MT
>> +
>> +Capability: basic
> 
> "basic" ioctls were present since the first version of KVM.
> We can't change the past, so please add a new capability.
> 

Will do.

>> +4.102 KVM_MT_SUBLIST_FETCH
>> +
>> +Capability: basic
>> +Architectures: x86
>> +Type: vm ioctl
>> +Parameters: struct mt_sublist_fetch_info (in/out)
>> +Returns: 0 on success, -1 on error
>> +
>> +/* for KVM_MT_SUBLIST_FETCH */
>> +struct mt_gfn_list {
>> +        __s32   count;
>> +        __u32   max_dirty;
>> +        __u64   *gfnlist;
> 
> gfn (= gpa >> PAGE_SHIFT) is not enough to specify a page for userspace,
> because KVM has a concept of address spaces and pages from multiple
> slots can be mapped into the same gfn (e.g. x86 SMRAM).
> 
> Providing a memslot/offset pair seems best.  (I'd start by addressing
> Kai's comment on [3/6] about binding gfnlist to memslots.)
> 
> Thanks.
> 

I'll respond to your latest comments about memslot/offset.

Thanks!
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Radim Krčmář May 2, 2016, 4:23 p.m. UTC | #3
2016-04-29 18:47+0000, Cao, Lei:
> On 4/28/2016 2:08 PM, Radim Kr?má? wrote:
>> 2016-04-26 19:26+0000, Cao, Lei:
>> * Is there a reason to call KVM_ENABLE_MT often?
> 
> KVM_ENABLE_MT can be called multiple times during a protected
> VM's lifecycle in a checkpointing system. A protected VM has two
> instances, primary and secondary. Memory tracking is only enabled on
> the primary. When we do a polite failover, memory tracking is
> disabled on the old primary and enabled on the new primary. Memory
> tracking is also disabled when the secondary goes away, in which case
> checkpoint cycle stops and there is no need for memory tracking. When
> the secondary comes back, memory tracking is re-enabled and the two
> instances sync up and checkpoint cycle starts.

Makes sense.

>> * How significant is the benefit of MT_FETCH_WAIT?
> 
> This allows the user thread that harvest dirty pages to park instead
> of doing busy wait when there is no or very few dirty pages.

True, mandatory polling could be ugly.

>> * When would you disable MT_FETCH_REARM?
> 
> In a checkpointing system, dirty pages are harvested after the VM is
> paused. Userspace can choose to rearm the write traps all at once after
> all the dirty pages have been fetched using KVM_REARM_DIRTY_PAGES, in
> which case the traps don't need to be armed during each fetch.

Ah, it makes a difference when you don't plan to run the VM again.

I guess all three of them are worth it.
(Might change my mind when I gain better understanding.)

>> * What drawbacks had an interface without explicit checkpointing cycles?
> 
> Checkpointing cycle has to be implemented in userspace to use this
> interface. 

But isn't the explicit cycle necessary only in userspace?
The dirty list could be implemented as a circullar buffer, so KVM
wouldn't need an explicit notification about the new cycle -- the
userspace would just drain all dirty pages and unpause vcpus.
(Quiesced can be stateless one-time kick of waiters instead.)

Thanks.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Kai Huang May 3, 2016, 7:10 a.m. UTC | #4
Hi,

On 4/27/2016 7:26 AM, Cao, Lei wrote:
> Updates to KVM API documentation
>
> ---
>  Documentation/virtual/kvm/api.txt | 170 ++++++++++++++++++++++++++++
>  1 file changed, 170 insertions(+)
>
> diff --git a/Documentation/virtual/kvm/api.txt b/Documentation/virtual/kvm/api.txt
> index 4d0542c..3f5367a 100644
> --- a/Documentation/virtual/kvm/api.txt
> +++ b/Documentation/virtual/kvm/api.txt
> @@ -3120,6 +3120,176 @@ struct kvm_reinject_control {
>  pit_reinject = 0 (!reinject mode) is recommended, unless running an old
>  operating system that uses the PIT for timing (e.g. Linux 2.4.x).
>
> +4.99 KVM_INIT_MT
> +
> +Capability: basic
> +Architectures: x86

Shall we make the new IOCTLs be available for all archs? In my 
understanding your memory tracking mechanism doesn't depend on any 
specific arch. :)

Thanks,
-Kai

> +Type: vm ioctl
> +Parameters: struct mt_setup (in)
> +Returns: 0 on success, -1 on error
> +
> +/* for KVM_INIT_MT */
> +struct mt_setup {
> +#define KVM_MT_VERSION                  1
> +struct mt_setup {
> +        __u32 version;
> +
> +        /* which operation to perform? */
> +#define KVM_MT_OP_INIT           1
> +#define KVM_MT_OP_CLEANUP        2
> +        __u32 op;
> +
> +        /*
> +         * Features.
> +         * 1. Avoid logging duplicate entries
> +         */
> +#define KVM_MT_OPTION_NO_DUPS           (1 << 2)
> +
> +        __u32 flags;
> +
> +        /* max number of dirty pages per checkpoint cycle */
> +        __u32 max_dirty;
> +};
> +
> +This instructs the memory tracking (MT) subsystem to initialize or
> +cleanup memory tracking data structures. Userspace specifies the
> +memory tracking version to make sure it and KVM are on the same
> +page. For initialization, userspace specifies the maxinum number
> +of dirty pages that is allowed per checkpoint cycle. It can tell
> +KVM to avoid logging duplicate pages via 'flags', and KVM would
> +create bitmap to track dirty pages.
> +
> +Called once during initialization.
> +
> +4.100 KVM_ENABLE_MT
> +
> +Capability: basic
> +Architectures: x86
> +Type: vm ioctl
> +Parameters: struct mt_enable (in)
> +Returns: 0 on success, -1 on error
> +
> +/* for KVM_ENABLE_MT */
> +struct mt_enable {
> +       __u32 flags;            /* 1 -> on, 0 -> off */
> +};
> +
> +This instructs the MT subsystem to start/stop logging dirty
> +VM pages. On hosts that support fault based memory tracking, KVM
> +write-protects all VM pages to start dirty logging. On hosts that
> +support PML, KVM clears the dirty bits for all VM pages to start
> +dirty logging, and sets the dirty bits to stop dirty logging.
> +
> +Called once when entering/exiting live migration/checkpoint mode.
> +
> +4.101 KVM_PREPARE_MT_CP
> +
> +Capability: basic
> +Architectures: x86
> +Type: vm ioctl
> +Parameters: struct mt_prepare_cp (in)
> +Returns: 0 on success, -1 on error
> +
> +/* for KVM_PREPARE_MT_CP */
> +struct mt_prepare_cp {
> +        __s64   cpid;
> +};
> +
> +This instructs the MT subsystem that a new checkpoint cycle is
> +about to start and provides the cycle ID. The MT subsystem resets
> +all the relevant variables, assuming all dirty pages have been
> +fetched.
> +
> +Called once per checkpoint cycle.
> +
> +4.102 KVM_MT_SUBLIST_FETCH
> +
> +Capability: basic
> +Architectures: x86
> +Type: vm ioctl
> +Parameters: struct mt_sublist_fetch_info (in/out)
> +Returns: 0 on success, -1 on error
> +
> +/* for KVM_MT_SUBLIST_FETCH */
> +struct mt_gfn_list {
> +        __s32   count;
> +        __u32   max_dirty;
> +        __u64   *gfnlist;
> +};
> +
> +struct mt_sublist_fetch_info {
> +        struct mt_gfn_list  gfn_info;
> +
> +        /*
> +         * flags bit defs:
> +         */
> +
> +        /* caller sleeps until dirty count is reached */
> +#define MT_FETCH_WAIT           (1 << 0)
> +        /* dirty tracking is re-armed for each page in returned list */
> +#define MT_FETCH_REARM          (1 << 1)
> +
> +        __u32 flags;
> +};
> +
> +This fetches a subset of the current dirty pages. Userspace thread
> +specifies the maximum number of dirty pages it wants to fetch via
> +(struct mt_gfn_list).count. It also tells the MT subsystem if it is
> +going to wait until the specified maxinum number is reached. Userspace
> +thread can instruct the MT subsystem to re-arm the dirty trap for
> +each page that is fetched. The dirty pages are returned to userspace
> +in (struct mt_gfn_list).gfnlist, and (struct mt_gfn_list).count
> +indicates the number of dirty pages that are returned.
> +
> +Called multiple times by multiple threads per checkpoint cycle.
> +
> +4.103 KVM_REARM_DIRTY_PAGES
> +
> +Capability: basic
> +Architectures: x86
> +Type: vm ioctl
> +Parameters:
> +Returns: 0 on success, -1 on error
> +
> +This instructs the MT subsystem to rearm the dirty traps for all
> +the pages that were dirtied during the last checkpoint cycle.
> +
> +Called once per checkpoint cycle. The call is not necessary if dirty
> +traps are rearmed when dirty pages are being fetched.
> +
> +4.104 KVM_MT_VM_QUIESCED
> +
> +Capability: basic
> +Architectures: x86
> +Type: vm ioctl
> +Parameters:
> +Returns: 0 on success, -1 on error
> +
> +This instructs the MT subsystem that the VM has been quiesced and no
> +more pages will be dirtied this checkpoint cycle. The MT subsystem
> +will wake up userspace threads that are waiting for new dirty pages
> +to fetch, if any.
> +
> +Called once per checkpoint cycle.
> +
> +4.105 KVM_MT_DIRTY_TRIGGER
> +
> +Capability: basic
> +Architectures: x86
> +Type: vm ioctl
> +Parameters: struct mt_dirty_trigger (in)
> +Returns: 0 on success, -1 on error
> +
> +/* for KVM_MT_DIRTY_TRIGGER */
> +struct mt_dirty_trigger {
> +        /* force vcpus to exit when trigger is reached */
> +        __u32 dirty_trigger;
> +};
> +
> +This sets the VM exit trigger point based on dirty page count.
> +
> +Called once when entering live migration/checkpoint mode.
> +
>  5. The kvm_run structure
>  ------------------------
>
>
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Cao, Lei May 3, 2016, 1:34 p.m. UTC | #5
On 5/2/2016 12:23 PM, Radim Kr?má? wrote:
> 2016-04-29 18:47+0000, Cao, Lei:
>> On 4/28/2016 2:08 PM, Radim Kr?má? wrote:
>>> 2016-04-26 19:26+0000, Cao, Lei:
>>> * Is there a reason to call KVM_ENABLE_MT often?
>>
>> KVM_ENABLE_MT can be called multiple times during a protected
>> VM's lifecycle in a checkpointing system. A protected VM has two
>> instances, primary and secondary. Memory tracking is only enabled on
>> the primary. When we do a polite failover, memory tracking is
>> disabled on the old primary and enabled on the new primary. Memory
>> tracking is also disabled when the secondary goes away, in which case
>> checkpoint cycle stops and there is no need for memory tracking. When
>> the secondary comes back, memory tracking is re-enabled and the two
>> instances sync up and checkpoint cycle starts.
> 
> Makes sense.
> 
>>> * How significant is the benefit of MT_FETCH_WAIT?
>>
>> This allows the user thread that harvest dirty pages to park instead
>> of doing busy wait when there is no or very few dirty pages.
> 
> True, mandatory polling could be ugly.
> 
>>> * When would you disable MT_FETCH_REARM?
>>
>> In a checkpointing system, dirty pages are harvested after the VM is
>> paused. Userspace can choose to rearm the write traps all at once after
>> all the dirty pages have been fetched using KVM_REARM_DIRTY_PAGES, in
>> which case the traps don't need to be armed during each fetch.
> 
> Ah, it makes a difference when you don't plan to run the VM again.
> 
> I guess all three of them are worth it.
> (Might change my mind when I gain better understanding.)
> 
>>> * What drawbacks had an interface without explicit checkpointing cycles?
>>
>> Checkpointing cycle has to be implemented in userspace to use this
>> interface. 
> 
> But isn't the explicit cycle necessary only in userspace?
> The dirty list could be implemented as a circullar buffer, so KVM
> wouldn't need an explicit notification about the new cycle -- the
> userspace would just drain all dirty pages and unpause vcpus.
> (Quiesced can be stateless one-time kick of waiters instead.)
> 
> Thanks.
> 

Good point. I might be able to do away with explicit cycles. I'll
see what else I can do to simplify the interface. 

Thanks!
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/Documentation/virtual/kvm/api.txt b/Documentation/virtual/kvm/api.txt
index 4d0542c..3f5367a 100644
--- a/Documentation/virtual/kvm/api.txt
+++ b/Documentation/virtual/kvm/api.txt
@@ -3120,6 +3120,176 @@  struct kvm_reinject_control {
 pit_reinject = 0 (!reinject mode) is recommended, unless running an old
 operating system that uses the PIT for timing (e.g. Linux 2.4.x).
 
+4.99 KVM_INIT_MT
+
+Capability: basic
+Architectures: x86
+Type: vm ioctl
+Parameters: struct mt_setup (in)
+Returns: 0 on success, -1 on error
+
+/* for KVM_INIT_MT */
+struct mt_setup {
+#define KVM_MT_VERSION                  1
+struct mt_setup {
+        __u32 version;
+
+        /* which operation to perform? */
+#define KVM_MT_OP_INIT           1
+#define KVM_MT_OP_CLEANUP        2
+        __u32 op;
+
+        /*
+         * Features.
+         * 1. Avoid logging duplicate entries
+         */
+#define KVM_MT_OPTION_NO_DUPS           (1 << 2)
+
+        __u32 flags;
+
+        /* max number of dirty pages per checkpoint cycle */
+        __u32 max_dirty;
+};
+
+This instructs the memory tracking (MT) subsystem to initialize or
+cleanup memory tracking data structures. Userspace specifies the
+memory tracking version to make sure it and KVM are on the same
+page. For initialization, userspace specifies the maxinum number
+of dirty pages that is allowed per checkpoint cycle. It can tell
+KVM to avoid logging duplicate pages via 'flags', and KVM would
+create bitmap to track dirty pages.
+
+Called once during initialization.
+
+4.100 KVM_ENABLE_MT
+
+Capability: basic
+Architectures: x86
+Type: vm ioctl
+Parameters: struct mt_enable (in)
+Returns: 0 on success, -1 on error
+
+/* for KVM_ENABLE_MT */
+struct mt_enable {
+       __u32 flags;            /* 1 -> on, 0 -> off */
+};
+
+This instructs the MT subsystem to start/stop logging dirty
+VM pages. On hosts that support fault based memory tracking, KVM
+write-protects all VM pages to start dirty logging. On hosts that
+support PML, KVM clears the dirty bits for all VM pages to start
+dirty logging, and sets the dirty bits to stop dirty logging.
+
+Called once when entering/exiting live migration/checkpoint mode.
+
+4.101 KVM_PREPARE_MT_CP
+
+Capability: basic
+Architectures: x86
+Type: vm ioctl
+Parameters: struct mt_prepare_cp (in)
+Returns: 0 on success, -1 on error
+
+/* for KVM_PREPARE_MT_CP */
+struct mt_prepare_cp {
+        __s64   cpid;
+};
+
+This instructs the MT subsystem that a new checkpoint cycle is
+about to start and provides the cycle ID. The MT subsystem resets
+all the relevant variables, assuming all dirty pages have been
+fetched.
+
+Called once per checkpoint cycle.
+
+4.102 KVM_MT_SUBLIST_FETCH
+
+Capability: basic
+Architectures: x86
+Type: vm ioctl
+Parameters: struct mt_sublist_fetch_info (in/out)
+Returns: 0 on success, -1 on error
+
+/* for KVM_MT_SUBLIST_FETCH */
+struct mt_gfn_list {
+        __s32   count;
+        __u32   max_dirty;
+        __u64   *gfnlist;
+};
+
+struct mt_sublist_fetch_info {
+        struct mt_gfn_list  gfn_info;
+
+        /*
+         * flags bit defs:
+         */
+
+        /* caller sleeps until dirty count is reached */
+#define MT_FETCH_WAIT           (1 << 0)
+        /* dirty tracking is re-armed for each page in returned list */
+#define MT_FETCH_REARM          (1 << 1)
+
+        __u32 flags;
+};
+
+This fetches a subset of the current dirty pages. Userspace thread
+specifies the maximum number of dirty pages it wants to fetch via
+(struct mt_gfn_list).count. It also tells the MT subsystem if it is
+going to wait until the specified maxinum number is reached. Userspace
+thread can instruct the MT subsystem to re-arm the dirty trap for
+each page that is fetched. The dirty pages are returned to userspace
+in (struct mt_gfn_list).gfnlist, and (struct mt_gfn_list).count
+indicates the number of dirty pages that are returned.
+
+Called multiple times by multiple threads per checkpoint cycle.
+
+4.103 KVM_REARM_DIRTY_PAGES
+
+Capability: basic
+Architectures: x86
+Type: vm ioctl
+Parameters:
+Returns: 0 on success, -1 on error
+
+This instructs the MT subsystem to rearm the dirty traps for all
+the pages that were dirtied during the last checkpoint cycle.
+
+Called once per checkpoint cycle. The call is not necessary if dirty
+traps are rearmed when dirty pages are being fetched.
+
+4.104 KVM_MT_VM_QUIESCED
+
+Capability: basic
+Architectures: x86
+Type: vm ioctl
+Parameters:
+Returns: 0 on success, -1 on error
+
+This instructs the MT subsystem that the VM has been quiesced and no
+more pages will be dirtied this checkpoint cycle. The MT subsystem
+will wake up userspace threads that are waiting for new dirty pages
+to fetch, if any.
+
+Called once per checkpoint cycle.
+
+4.105 KVM_MT_DIRTY_TRIGGER
+
+Capability: basic
+Architectures: x86
+Type: vm ioctl
+Parameters: struct mt_dirty_trigger (in)
+Returns: 0 on success, -1 on error
+
+/* for KVM_MT_DIRTY_TRIGGER */
+struct mt_dirty_trigger {
+        /* force vcpus to exit when trigger is reached */
+        __u32 dirty_trigger;
+};
+
+This sets the VM exit trigger point based on dirty page count.
+
+Called once when entering live migration/checkpoint mode.
+
 5. The kvm_run structure
 ------------------------