[0/6] KVM: Dirty Quota-Based VM Live Migration Auto-Converge

Message ID	20211114145721.209219-1-shivam.kumar1@nutanix.com (mailing list archive)
Headers	show Return-Path: <kvm-owner@kernel.org> From: Shivam Kumar <shivam.kumar1@nutanix.com> To: pbonzini@redhat.com Cc: kvm@vger.kernel.org, Shivam Kumar <shivam.kumar1@nutanix.com> Subject: [PATCH 0/6] KVM: Dirty Quota-Based VM Live Migration Auto-Converge Date: Sun, 14 Nov 2021 14:57:15 +0000 Message-Id: <20211114145721.209219-1-shivam.kumar1@nutanix.com> Content-Transfer-Encoding: 8bit Content-Type: text/plain MIME-Version: 1.0 Precedence: bulk
Series	KVM: Dirty Quota-Based VM Live Migration Auto-Converge \| expand [0/6] KVM: Dirty Quota-Based VM Live Migration Auto-Converge [1/6] Define data structures for dirty quota migration. [2/6] Init dirty quota flag and allocate memory for vCPUdqctx. [3/6] Add KVM_CAP_DIRTY_QUOTA_MIGRATION and handle vCPU page faults. [4/6] Increment dirty counter for vmexit due to page write fault. [5/6] Exit to userspace when dirty quota is full. [6/6] Free vCPUdqctx memory on vCPU destroy.

Message ID

20211114145721.209219-1-shivam.kumar1@nutanix.com (mailing list archive)

Headers

From: Shivam Kumar <shivam.kumar1@nutanix.com>
To: pbonzini@redhat.com
Cc: kvm@vger.kernel.org, Shivam Kumar <shivam.kumar1@nutanix.com>
Subject: [PATCH 0/6] KVM: Dirty Quota-Based VM Live Migration Auto-Converge
Date: Sun, 14 Nov 2021 14:57:15 +0000
Message-Id: <20211114145721.209219-1-shivam.kumar1@nutanix.com>
Content-Transfer-Encoding: 8bit
Content-Type: text/plain
MIME-Version: 1.0
X-MS-Exchange-AntiSpam-MessageData-ChunkCount: 1
X-MS-Exchange-AntiSpam-MessageData-0: 
 H1il9n8kztBn7Udk1aIfuUdRR77/qT+SW526EqbcXhJfDwi8nKaOW9rHIGNe869GhUt+onV4NNL11G+9D+e5PjUp6XxrrizDNhJuR5JMkg0UKSP8sqkNrYRTnrxshoVZ7BeS1VlxUezRKQTsB5p0cAL3BOBsP+n6GVjwJqq5fPb8WG0qh44PN9OWKI5cxy9MrR/nYvMLMyU6d6GlxY9AfXZnVbFQQJSvITmTNcFE0+mmwtRQLjL/y2DRyMy5n4TXS/JbNHlk9htZ/S3d9vFVn+Q/G7B1AguBfnI0fpG7p+/8Cw+LWqMxkgCiL3swt6Ukik1GVeO9dDrfP/EMuvyOsnrYxok/EKAo7iu9VYad2CX1moD+bLfD8ZU9CWwB3KB3Bt8n2XGZJLE1l6fx5BaYfLnXg/6r3CI8p8z6BgUl0IvEDPgr8hUXgLn00sgYPDhvTD6JcIOWed3LU4mY4fYqqqiDzU2lngfx4+SMlL9Dm37DWEwScYJMW6uvXsxDL2+BVSmT7p/+Du1aIs2+6tVRlOk1Ch5S0Fp7yPPrameqKh5YuE3qfSzsIoPwZXlpa8kCoFAibqIG85crLv2HqUbokBlu1TStv+4b2uNKBzSd8m37DJR+vpsbVRIZDg24YxGUqzNRqxLKqLk6H4ianCbZkSq1ZxV/fQOtqhFJk/3wx0c5qPkd9K5zZ5HBnwp3gOv/9uqaub4jSyJ4WYITwWxr+pszPBNqZsuJawptxsmSir9xtQ5/EcP51J4amsi9WqaGssjE9GDZ+4My9VcEQyaWRpIV/vlit6mGruTt+ycz8qR37Kzg/K6yjTWqkbp4Unas7XWwWpDYV7tKPFv/bTxUGz5J3oVFolaOAupoQUqmDT75VgLG9iSeeNVjPrm78PQAFSkH2di1nUi9Wi+yCWKhPmrn08RHlfXrFYYps2xg3LcymJzrxPkN/MhA+RiUCSsHCw7kZomxpoTQ/bbC7LynnXS6oqfPXs+XOzEKQinMxChRc6qJLMTJo1izPdNn59qICX8/edjumGesEVtMufL9gQQvprGl2cgIPalJJWzuQx5aT4XSFbZvY6MQdVUuCcFd9kbN1Ks7EGJjtq1EiIZSGKlmUOrT0Pdz3k+CZolxFBIuv4sHIz8MjABqYXvkuf1AYiW+CiPIiSq76YoF6cZo8sranlqTeTMlONqzFpYkdiHQfddZqq5orOYK42JVKytWRyXYoxGOhl/xo9PZyJPtCL9pBD6YOi2bXW38vXnirWlL//J2PFFvYWZD4wwKToIq/MoEM3IK5tJrRzj/zVOvkEqVEFInlY25ZMKhVC9d2CqNrPU629xDrisYArUPeDyt8p4Lp9eyUIt1ZqpEEVTJTtPf4l/XAoHfx47JGJrNANSp5EDHwYTXKerv5w9eLkJPPgi4bYn9CCWPZGnkRIOZhvMkX6EmSCNQuZnYvQwdBhhKIRSbI8ZWLeAjbFsZJk84J3PXY4945H0Qwx9iatFO8r0c6wRhSuzzRJzMwjC7fAY1PLsVt1sDwevDFtmcefHGkrxS9lNtbdgM+PyFnj4ynRwy9jidRUX0kVVrZoe6koXStZ7z699OK/Tjve0rXYLPk7dM9kzuq/cHPPJqR64rckRI+pocsEV0CcR3LQohv5/HeOaIx4M24EUAx/lB38+r+4ZFjaIyBG3Fe+OdFRYEXyagrKgjfN4bk9mI1hrxAbQ=
X-OriginatorOrg: nutanix.com
X-MS-Exchange-CrossTenant-Network-Message-Id: 
 e04eca30-6d73-4300-72a1-08d9a77f159a
X-MS-Exchange-CrossTenant-AuthSource: CO6PR02MB7555.namprd02.prod.outlook.com
X-MS-Exchange-CrossTenant-AuthAs: Internal
X-MS-Exchange-CrossTenant-OriginalArrivalTime: 14 Nov 2021 14:57:31.8288
 (UTC)
X-MS-Exchange-CrossTenant-FromEntityHeader: Hosted
X-MS-Exchange-CrossTenant-Id: bb047546-786f-4de1-bd75-24e5b6f79043
X-MS-Exchange-CrossTenant-MailboxType: HOSTED
X-MS-Exchange-CrossTenant-UserPrincipalName: 
 XqlOUvwtFMdq/70VHvYE7TjARoP31IGh1wV+ufZCGonsdHiQF27409vo3i5jck+KXRsd9bTVcn0EpQAYKrJeAx0BlIXZlgjk0MeYQihoNq4=
X-MS-Exchange-Transport-CrossTenantHeadersStamped: CO1PR02MB8618
X-Proofpoint-ORIG-GUID: QOoc0UQ6G8yX-8601tiJVlbe2abzhXyz
X-Proofpoint-GUID: QOoc0UQ6G8yX-8601tiJVlbe2abzhXyz
X-Proofpoint-Virus-Version: vendor=baseguard
 engine=ICAP:2.0.205,Aquarius:18.0.790,Hydra:6.0.425,FMLib:17.0.607.475
 definitions=2021-11-14_02,2021-11-12_01,2020-04-07_01
X-Proofpoint-Spam-Reason: safe
Precedence: bulk
List-ID: <kvm.vger.kernel.org>
X-Mailing-List: kvm@vger.kernel.org

Series

KVM: Dirty Quota-Based VM Live Migration Auto-Converge | expand

Message

Shivam Kumar Nov. 14, 2021, 2:57 p.m. UTC

This patchset is the KVM-side implementation of a (new) dirty "quota"
based throttling algorithm that selectively throttles vCPUs based on their
individual contribution to overall memory dirtying and also dynamically
adapts the throttle based on the available network bandwidth.

Overview
----------
----------

To throttle memory dirtying, we propose to set a limit on the number of
pages a vCPU can dirty in given fixed microscopic size time intervals. This
limit depends on the network throughput calculated over the last few
intervals so as to throttle the vCPUs based on available network bandwidth.
We are referring to this limit as the "dirty quota" of a vCPU and
the fixed size intervals as the "dirty quota intervals". 

One possible approach to distributing the overall scope of dirtying for a
dirty quota interval is to equally distribute it among all the vCPUs. This
approach to the distribution doesn't make sense if the distribution of
workloads among vCPUs is skewed. So, to counter such skewed cases, we
propose that if any vCPU doesn't need its quota for any given dirty
quota interval, we add this quota to a common pool. This common pool (or
"common quota") can be consumed on a first come first serve basis
by all vCPUs in the upcoming dirty quota intervals.

Design
----------
----------

Initialization

vCPUDirtyQuotaContext keeps the dirty quota context for each vCPU. It keeps
the number of pages the vCPU has dirtied (dirty_counter) in the ongoing
dirty quota interval, and the maximum number of dirties allowed for the
vCPU (dirty_quota) in the ongoing dirty quota interval.

struct vCPUDirtyQuotaContext {
u64 dirty_counter;
u64 dirty_quota;
};

The flag dirty_quota_migration_enabled determines whether dirty quota-based
throttling is enabled for an ongoing migration or not.


Handling page dirtying

When the guest tries to dirty a page, it leads to a vmexit as each page is
write-protected. In the vmexit path, we increment the dirty_counter for the
corresponding vCPU. Then, we check if the vCPU has exceeded its quota. If
yes, we exit to userspace with a new exit reason KVM_EXIT_DIRTY_QUOTA_FULL.
This "quota full" event is further handled on the userspace side. 


Please find the KVM Forum presentation on dirty quota-based throttling
here: https://www.youtube.com/watch?v=ZBkkJf78zFA


Shivam Kumar (6):
  Define data structures for dirty quota migration.
  Init dirty quota flag and allocate memory for vCPUdqctx.
  Add KVM_CAP_DIRTY_QUOTA_MIGRATION and handle vCPU page faults.
  Increment dirty counter for vmexit due to page write fault.
  Exit to userspace when dirty quota is full.
  Free vCPUdqctx memory on vCPU destroy.

 Documentation/virt/kvm/api.rst        | 39 +++++++++++++++++++
 arch/x86/include/uapi/asm/kvm.h       |  1 +
 arch/x86/kvm/Makefile                 |  3 +-
 arch/x86/kvm/x86.c                    |  9 +++++
 include/linux/dirty_quota_migration.h | 52 +++++++++++++++++++++++++
 include/linux/kvm_host.h              |  3 ++
 include/uapi/linux/kvm.h              | 11 ++++++
 virt/kvm/dirty_quota_migration.c      | 31 +++++++++++++++
 virt/kvm/kvm_main.c                   | 56 ++++++++++++++++++++++++++-
 9 files changed, 203 insertions(+), 2 deletions(-)
 create mode 100644 include/linux/dirty_quota_migration.h
 create mode 100644 virt/kvm/dirty_quota_migration.c

Comments

Sean Christopherson Nov. 18, 2021, 5:46 p.m. UTC | #1

On Sun, Nov 14, 2021, Shivam Kumar wrote:
> One possible approach to distributing the overall scope of dirtying for a
> dirty quota interval is to equally distribute it among all the vCPUs. This
> approach to the distribution doesn't make sense if the distribution of
> workloads among vCPUs is skewed. So, to counter such skewed cases, we
> propose that if any vCPU doesn't need its quota for any given dirty
> quota interval, we add this quota to a common pool. This common pool (or
> "common quota") can be consumed on a first come first serve basis
> by all vCPUs in the upcoming dirty quota intervals.

Why not simply use a per-VM quota in combination with a percpu_counter to avoid bouncing
the dirty counter?

> Design
> ----------
> ----------
> 
> Initialization
> 

Feedback that applies to all patches:

> vCPUDirtyQuotaContext keeps the dirty quota context for each vCPU. It keeps

CamelCase is very frowned upon, please use whatever_case_this_is_called.

The SOB chains are wrong.  The person physically posting the patches needs to have
their SOB last, as they are the person who last handled the patches.

  Co-developed-by: Anurag Madnawat <anurag.madnawat@nutanix.com>
  Signed-off-by: Anurag Madnawat <anurag.madnawat@nutanix.com>
  Signed-off-by: Shivam Kumar <shivam.kumar1@nutanix.com>
  Signed-off-by: Shaju Abraham <shaju.abraham@nutanix.com>
  Signed-off-by: Manish Mishra <manish.mishra@nutanix.com>

These needs a Co-developed-by.  The only other scenario is that you and Anurag
wrote the patches, then handed them off to Shaju, who sent them to Manish, who
sent them back to you for posting.  I highly doubt that's the case, and if so,
I would hope you've done due diligence to ensure what you handed off is the same
as what you posted, i.e. the SOB chains for Shaju and Manish can be omitted.

In general, please read through most of the stuff in Documentation/process.

> the number of pages the vCPU has dirtied (dirty_counter) in the ongoing
> dirty quota interval, and the maximum number of dirties allowed for the
> vCPU (dirty_quota) in the ongoing dirty quota interval.
> 
> struct vCPUDirtyQuotaContext {
> u64 dirty_counter;
> u64 dirty_quota;
> };
> 
> The flag dirty_quota_migration_enabled determines whether dirty quota-based
> throttling is enabled for an ongoing migration or not.
> 
> 
> Handling page dirtying
> 
> When the guest tries to dirty a page, it leads to a vmexit as each page is
> write-protected. In the vmexit path, we increment the dirty_counter for the
> corresponding vCPU. Then, we check if the vCPU has exceeded its quota. If
> yes, we exit to userspace with a new exit reason KVM_EXIT_DIRTY_QUOTA_FULL.
> This "quota full" event is further handled on the userspace side. 
> 
> 
> Please find the KVM Forum presentation on dirty quota-based throttling
> here: https://www.youtube.com/watch?v=ZBkkJf78zFA
> 
> 
> Shivam Kumar (6):
>   Define data structures for dirty quota migration.
>   Init dirty quota flag and allocate memory for vCPUdqctx.
>   Add KVM_CAP_DIRTY_QUOTA_MIGRATION and handle vCPU page faults.
>   Increment dirty counter for vmexit due to page write fault.
>   Exit to userspace when dirty quota is full.
>   Free vCPUdqctx memory on vCPU destroy.

Freeing memory in a later patch is not an option.  The purpose of splitting is
to aid bisection and make the patches more reviewable, not to break bisection and
confuse reviewers.  In general, there are too many patches and things are split in
weird ways, making this hard to review.  This can probably be smushed to two
patches: 1) implement the guts, 2) exposed to userspace and document.

>  Documentation/virt/kvm/api.rst        | 39 +++++++++++++++++++
>  arch/x86/include/uapi/asm/kvm.h       |  1 +
>  arch/x86/kvm/Makefile                 |  3 +-
>  arch/x86/kvm/x86.c                    |  9 +++++
>  include/linux/dirty_quota_migration.h | 52 +++++++++++++++++++++++++
>  include/linux/kvm_host.h              |  3 ++
>  include/uapi/linux/kvm.h              | 11 ++++++
>  virt/kvm/dirty_quota_migration.c      | 31 +++++++++++++++

I do not see any reason to add two new files for 84 lines, which I'm pretty sure
we can trim down significantly in any case.  Paolo has suggested creating files
for the mm side of generic kvm, the helpers can go wherever that lands.

>  virt/kvm/kvm_main.c                   | 56 ++++++++++++++++++++++++++-
>  9 files changed, 203 insertions(+), 2 deletions(-)
>  create mode 100644 include/linux/dirty_quota_migration.h
>  create mode 100644 virt/kvm/dirty_quota_migration.c

As for the design, allocating a separate page for 16 bytes is wasteful and adds
complexity that I don't think is strictly necessary.  Assuming the quota isn't
simply a per-VM thing....

Rather than have both the count and the quote writable by userspace, what about
having KVM_CAP_DIRTY_QUOTA_MIGRATION (renamed to just KVM_CAP_DIRTY_QUOTA, because
dirty logging can technically be used for things other than migration) define a
default, per-VM dirty quota, that is snapshotted by each vCPU on creation.  The
ioctl() would need to be rejected if vCPUs have been created, but it already needs
something along those lines because currently it has a TOCTOU race and can also
race with vCPU readers.

Anyways, vCPUs snapshot a default quota on creation, and then use struct kvm_run to
update the quota upon return from userspace after KVM_EXIT_DIRTY_QUOTA_FULL instead
of giving userspace free reign to change it the quota at will.  There are a variety
of ways to leverage kvm_run, the simplest I can think of would be to define the ABI
such that calling KVM_RUN with "exit_reason == KVM_EXIT_DIRTY_QUOTA_FULL" would
trigger an update.  That would do the right thing even if userspace _doesn't_ update
the count/quota, as KVM would simply copy back the original quota/count and exit back
to userspace.

E.g.

diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index 78f0719cc2a3..d4a7d1b7019e 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -487,6 +487,11 @@ struct kvm_run {
                        unsigned long args[6];
                        unsigned long ret[2];
                } riscv_sbi;
+               /* KVM_EXIT_DIRTY_QUOTA_FULL */
+               struct {
+                       u64 dirty_count;
+                       u64 dirty_quota;
+               }
                /* Fix the size of the union. */
                char padding[256];
        };

Side topic, it might make sense to have the counter be a stat, the per-vCPU dirty
rate could be useful info even if userspace isn't using quotas.