From patchwork Mon Jan 28 09:33:55 2019
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 8bit
X-Patchwork-Submitter: Wanpeng Li <kernellwp@gmail.com>
X-Patchwork-Id: 10783463
Return-Path: <kvm-owner@kernel.org>
Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org
 [172.30.200.125])
	by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 13B5313B5
	for <patchwork-kvm@patchwork.kernel.org>;
 Mon, 28 Jan 2019 09:34:08 +0000 (UTC)
Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1])
	by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 02056284C7
	for <patchwork-kvm@patchwork.kernel.org>;
 Mon, 28 Jan 2019 09:34:07 +0000 (UTC)
Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486)
	id EA49E284D1; Mon, 28 Jan 2019 09:34:06 +0000 (UTC)
X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on
	pdx-wl-mail.web.codeaurora.org
X-Spam-Level: 
X-Spam-Status: No, score=-8.0 required=2.0 tests=BAYES_00,DKIM_SIGNED,
	DKIM_VALID,DKIM_VALID_AU,FREEMAIL_FROM,MAILING_LIST_MULTI,RCVD_IN_DNSWL_HI
	autolearn=ham version=3.3.1
Received: from vger.kernel.org (vger.kernel.org [209.132.180.67])
	by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 58EA1284C5
	for <patchwork-kvm@patchwork.kernel.org>;
 Mon, 28 Jan 2019 09:34:06 +0000 (UTC)
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S1726752AbfA1JeB (ORCPT
        <rfc822;patchwork-kvm@patchwork.kernel.org>);
        Mon, 28 Jan 2019 04:34:01 -0500
Received: from mail-pl1-f195.google.com ([209.85.214.195]:35277 "EHLO
        mail-pl1-f195.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
        with ESMTP id S1726369AbfA1JeB (ORCPT <rfc822;kvm@vger.kernel.org>);
        Mon, 28 Jan 2019 04:34:01 -0500
Received: by mail-pl1-f195.google.com with SMTP id p8so7495559plo.2;
        Mon, 28 Jan 2019 01:34:00 -0800 (PST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=gmail.com; s=20161025;
        h=from:to:cc:subject:date:message-id:mime-version
         :content-transfer-encoding;
        bh=mJ5Cg8m5AuiE6k8X70rqtyhMp/Uyh1+t2vNRSuissbo=;
        b=Xi7zWxHwv6cnMpacp6pvZLhUh+qK1vcLRE/JGPvZqU5Mcx5LC7pKRJx0z8WyMHb/Ni
         1tbyccdU77PpePH4xN+huZm1x5HdMCI++biC5fy75SMBlPQke/H/i5Y/YVI5SjpQpVaZ
         xzYusnQG8NrAxkMs551/a093SQaay6u0eNFqVQyCoFG6CmmI/kvifUiIK2OPR9m6wfEO
         7LY3i59odNt38SxmmewoADXvmVBMt5e97keq/36hg+Ic6qpxUFBvYR4L5vW2wXM6WBWo
         TqEkOGAxNMC+aUW/6cVjiVmfyfH9luQUzgDSd6Vtlxopi3UFkIu4Ma47qThqEsWFXeuO
         +CgA==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20161025;
        h=x-gm-message-state:from:to:cc:subject:date:message-id:mime-version
         :content-transfer-encoding;
        bh=mJ5Cg8m5AuiE6k8X70rqtyhMp/Uyh1+t2vNRSuissbo=;
        b=pAZ2BtOP/NsF1w0Ve1Kh0s25WLjnoQuQGN9hoypT/JTy/d6dlqaCiH1U+ah8YKVCQ2
         8ZTelMZo2ZR/681REACatorAlKihIdOdrks6vZiBE7mRgY/92d1bCNKzsj4tX4S00xBX
         j1GtO4wa3zmP78PmdI6Z3wE5/5ZKJD8Q+AZHHrxQ9DtFPY08Z1cl6hVn2glnFZtxzvsy
         bMR1b/oWn1gJS6LhRYsn+PZRA37pU0OExZt34usR9HXuwY7Bl5mKJVfGMMsS59e9bH/q
         4L3tKeP98S7vAf2yAcgjG4OkknSoznGS3Jc94qd2Ae3Nui5ahTWh/2g0TXeHoy+1SgCX
         mx7w==
X-Gm-Message-State: AJcUukc3QJS5Hnb2XFsaCUOCYiJo7I3Ps4qBtjsr5zeCFHSO7nmJmgKM
        ZDh14+LxWRzpW64bbKaeDJ/vEovy
X-Google-Smtp-Source: 
 ALg8bN5OgfP3qzQAS9q/zJnfGhB0X3ocjS+9CkFLC5neOi5O9x8uhxk8zm9HuHlClGB1OsC0BfCtmg==
X-Received: by 2002:a17:902:bf44:: with SMTP id
 u4mr21298557pls.5.1548668039959;
        Mon, 28 Jan 2019 01:33:59 -0800 (PST)
Received: from localhost.localdomain ([203.205.141.123])
        by smtp.googlemail.com with ESMTPSA id
 y6sm37665727pfl.187.2019.01.28.01.33.58
        (version=TLS1_2 cipher=ECDHE-RSA-AES128-SHA bits=128/128);
        Mon, 28 Jan 2019 01:33:59 -0800 (PST)
From: Wanpeng Li <kernellwp@gmail.com>
X-Google-Original-From: Wanpeng Li <wanpengli@tencent.com>
To: linux-kernel@vger.kernel.org, kvm@vger.kernel.org
Cc: Paolo Bonzini <pbonzini@redhat.com>,
 =?utf-8?b?UmFkaW0gS3LEjW3DocWZ?= <rkrcmar@redhat.com>
Subject: [PATCH RESEND] KVM: MMU: Introduce single thread to zap collapsible
 sptes
Date: Mon, 28 Jan 2019 17:33:55 +0800
Message-Id: <1548668035-17639-1-git-send-email-wanpengli@tencent.com>
X-Mailer: git-send-email 2.7.4
MIME-Version: 1.0
Sender: kvm-owner@vger.kernel.org
Precedence: bulk
List-ID: <kvm.vger.kernel.org>
X-Mailing-List: kvm@vger.kernel.org
X-Virus-Scanned: ClamAV using ClamSMTP

From: Wanpeng Li <wanpengli@tencent.com>

Last year guys from huawei reported that the call of memory_global_dirty_log_start/stop() 
takes 13s for 4T memory and cause guest freeze too long which increases the unacceptable 
migration downtime. [1] [2]

Guangrong pointed out:

| collapsible_sptes zaps 4k mappings to make memory-read happy, it is not
| required by the semanteme of KVM_SET_USER_MEMORY_REGION and it is not
| urgent for vCPU's running, it could be done in a separate thread and use
| lock-break technology.

Several TB memory guest is common now after NVDIMM is deployed in cloud environment.
This patch utilizes worker thread to zap collapsible sptes in order to lazy collapse 
small sptes into large sptes during roll-back after live migration fails.

[1] https://lists.gnu.org/archive/html/qemu-devel/2017-04/msg05249.html
[2] https://www.mail-archive.com/qemu-devel@nongnu.org/msg449994.html

Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Radim Krčmář <rkrcmar@redhat.com>
Signed-off-by: Wanpeng Li <wanpengli@tencent.com>
---
Note:
I ever consider to add a list of memslots to be zapped, delete from the list
and add in kvm_mmu_zap_collapsible_sptes(). However, i observe a lot of races 
here, the memslot can disappear/modify underneath before the worker thread
start to zap even if i introduce lock to protect the list. This patch delays 
the worker thread by 60s(to assume memory_global_dirty_log_stop can absolutely 
complete) to coalesce all the zap requirements after live migration fails.

 arch/x86/include/asm/kvm_host.h |  3 +++
 arch/x86/kvm/mmu.c              | 37 ++++++++++++++++++++++++++++++++-----
 arch/x86/kvm/x86.c              |  4 ++++
 3 files changed, 39 insertions(+), 5 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index fbda5a9..dde32f9 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -892,6 +892,8 @@ struct kvm_arch {
 	u64 master_cycle_now;
 	struct delayed_work kvmclock_update_work;
 	struct delayed_work kvmclock_sync_work;
+	struct delayed_work kvm_mmu_zap_collapsible_sptes_work;
+	bool zap_in_progress;
 
 	struct kvm_xen_hvm_config xen_hvm_config;
 
@@ -1247,6 +1249,7 @@ void kvm_mmu_zap_all(struct kvm *kvm);
 void kvm_mmu_invalidate_mmio_sptes(struct kvm *kvm, struct kvm_memslots *slots);
 unsigned int kvm_mmu_calculate_mmu_pages(struct kvm *kvm);
 void kvm_mmu_change_mmu_pages(struct kvm *kvm, unsigned int kvm_nr_mmu_pages);
+void zap_collapsible_sptes_fn(struct work_struct *work);
 
 int load_pdptrs(struct kvm_vcpu *vcpu, struct kvm_mmu *mmu, unsigned long cr3);
 bool pdptrs_changed(struct kvm_vcpu *vcpu);
diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index 7c03c0f..fe87dd3 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -5679,14 +5679,41 @@ static bool kvm_mmu_zap_collapsible_spte(struct kvm *kvm,
 	return need_tlb_flush;
 }
 
+void zap_collapsible_sptes_fn(struct work_struct *work)
+{
+	struct kvm_memory_slot *memslot;
+	struct kvm_memslots *slots;
+	struct delayed_work *dwork = to_delayed_work(work);
+	struct kvm_arch *ka = container_of(dwork, struct kvm_arch,
+					   kvm_mmu_zap_collapsible_sptes_work);
+	struct kvm *kvm = container_of(ka, struct kvm, arch);
+	int i;
+
+	mutex_lock(&kvm->slots_lock);
+	for (i = 0; i < KVM_ADDRESS_SPACE_NUM; i++) {
+		spin_lock(&kvm->mmu_lock);
+		slots = __kvm_memslots(kvm, i);
+		kvm_for_each_memslot(memslot, slots) {
+			slot_handle_leaf(kvm, (struct kvm_memory_slot *)memslot,
+				kvm_mmu_zap_collapsible_spte, true);
+			if (need_resched() || spin_needbreak(&kvm->mmu_lock))
+				cond_resched_lock(&kvm->mmu_lock);
+		}
+		spin_unlock(&kvm->mmu_lock);
+	}
+	kvm->arch.zap_in_progress = false;
+	mutex_unlock(&kvm->slots_lock);
+}
+
+#define KVM_MMU_ZAP_DELAYED (60 * HZ)
 void kvm_mmu_zap_collapsible_sptes(struct kvm *kvm,
 				   const struct kvm_memory_slot *memslot)
 {
-	/* FIXME: const-ify all uses of struct kvm_memory_slot.  */
-	spin_lock(&kvm->mmu_lock);
-	slot_handle_leaf(kvm, (struct kvm_memory_slot *)memslot,
-			 kvm_mmu_zap_collapsible_spte, true);
-	spin_unlock(&kvm->mmu_lock);
+	if (!kvm->arch.zap_in_progress) {
+		kvm->arch.zap_in_progress = true;
+		schedule_delayed_work(&kvm->arch.kvm_mmu_zap_collapsible_sptes_work,
+			KVM_MMU_ZAP_DELAYED);
+	}
 }
 
 void kvm_mmu_slot_leaf_clear_dirty(struct kvm *kvm,
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index d029377..c2af289 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -9019,6 +9019,9 @@ int kvm_arch_init_vm(struct kvm *kvm, unsigned long type)
 
 	INIT_DELAYED_WORK(&kvm->arch.kvmclock_update_work, kvmclock_update_fn);
 	INIT_DELAYED_WORK(&kvm->arch.kvmclock_sync_work, kvmclock_sync_fn);
+	INIT_DELAYED_WORK(&kvm->arch.kvm_mmu_zap_collapsible_sptes_work,
+			zap_collapsible_sptes_fn);
+	kvm->arch.zap_in_progress = false;
 
 	kvm_hv_init_vm(kvm);
 	kvm_page_track_init(kvm);
@@ -9064,6 +9067,7 @@ void kvm_arch_sync_events(struct kvm *kvm)
 {
 	cancel_delayed_work_sync(&kvm->arch.kvmclock_sync_work);
 	cancel_delayed_work_sync(&kvm->arch.kvmclock_update_work);
+	cancel_delayed_work_sync(&kvm->arch.kvm_mmu_zap_collapsible_sptes_work);
 	kvm_free_pit(kvm);
 }