From patchwork Tue Feb 18 20:26:01 2025
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Fernand Sieber <sieberf@amazon.com>
X-Patchwork-Id: 13980828
Received: from smtp-fw-52005.amazon.com (smtp-fw-52005.amazon.com
 [52.119.213.156])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
	(No client certificate requested)
	by smtp.subspace.kernel.org (Postfix) with ESMTPS id 4ABB72116F4;
	Tue, 18 Feb 2025 20:27:02 +0000 (UTC)
Authentication-Results: smtp.subspace.kernel.org;
 arc=none smtp.client-ip=52.119.213.156
ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116;
	t=1739910424; cv=none;
 b=W3msEgYBiosMU5i358JPdy/9wxyQa00U3CAInWTOKGW7We0keGM7Uxb+guxDB0K5SHgkXuWCHggW2VrZUmHDVCZoGGsZwaRxxz39bH53aRpEjFZl1Qar4L3ScNMbefqMVftASoIFFvDxg07nNNBhHG8RACJaumAbzmjXV+9zr5M=
ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org;
	s=arc-20240116; t=1739910424; c=relaxed/simple;
	bh=1z+s/5oLUgkpVX+TZ6VZExymfp2lWHBOshwhP3rfD88=;
	h=From:To:Subject:Date:Message-ID:In-Reply-To:References:
	 MIME-Version:Content-Type;
 b=bmc8ktXUKy3P7wnLG6pez3Mk5OEzA8YxrkL6ANP10uBtL/zVVwYeWmnF/G11pGHTJkvL1/e/+gfZDbqJeImMZH+Ykpn6p1Wk02/4ESm9EFtaK36rFkUEz+0UJGKcbe+MaQCBnDvBbztBQfteoAU+95D+2P9PXrXBQggJF2OhBjk=
ARC-Authentication-Results: i=1; smtp.subspace.kernel.org;
 dmarc=pass (p=quarantine dis=none) header.from=amazon.com;
 spf=pass smtp.mailfrom=amazon.com;
 dkim=pass (1024-bit key) header.d=amazon.com header.i=@amazon.com
 header.b=qMo8T24K; arc=none smtp.client-ip=52.119.213.156
Authentication-Results: smtp.subspace.kernel.org;
 dmarc=pass (p=quarantine dis=none) header.from=amazon.com
Authentication-Results: smtp.subspace.kernel.org;
 spf=pass smtp.mailfrom=amazon.com
Authentication-Results: smtp.subspace.kernel.org;
	dkim=pass (1024-bit key) header.d=amazon.com header.i=@amazon.com
 header.b="qMo8T24K"
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
  d=amazon.com; i=@amazon.com; q=dns/txt; s=amazon201209;
  t=1739910423; x=1771446423;
  h=from:to:subject:date:message-id:in-reply-to:references:
   mime-version:content-transfer-encoding;
  bh=VRi00QC9itLgrUbe0p8KBclI5A1SKRWOXY7c+YjuKak=;
  b=qMo8T24K60PtM1d/PXXQskCAj5qQ3TeJItBi/pbVi+z7RjH1123+LbYH
   Qh5osn5vdhMCM/nmZiYDULSLn0D+nHYjernLHSxcOIDRBow7XneSC63x5
   VS5z8Sip4umxqP092cJWjkkxHD0EiAIYyRrYXRy2TOer+9Newl/IklreP
   A=;
X-IronPort-AV: E=Sophos;i="6.13,296,1732579200";
   d="scan'208";a="719883548"
Received: from iad12-co-svc-p1-lb1-vlan3.amazon.com (HELO
 smtpout.prod.us-west-2.prod.farcaster.email.amazon.dev) ([10.43.8.6])
  by smtp-border-fw-52005.iad7.amazon.com with
 ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 18 Feb 2025 20:27:02 +0000
Received: from EX19MTAEUB001.ant.amazon.com [10.0.17.79:16632]
 by smtpin.naws.eu-west-1.prod.farcaster.email.amazon.dev [10.0.0.236:2525]
 with esmtp (Farcaster)
 id 6628eb86-21a6-4254-8fbe-883f5b39d6fa;
 Tue, 18 Feb 2025 20:27:00 +0000 (UTC)
X-Farcaster-Flow-ID: 6628eb86-21a6-4254-8fbe-883f5b39d6fa
Received: from EX19D003EUB001.ant.amazon.com (10.252.51.97) by
 EX19MTAEUB001.ant.amazon.com (10.252.51.26) with Microsoft SMTP Server
 (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_CBC_SHA) id 15.2.1258.39;
 Tue, 18 Feb 2025 20:27:00 +0000
Received: from u5934974a1cdd59.ant.amazon.com (10.146.13.227) by
 EX19D003EUB001.ant.amazon.com (10.252.51.97) with Microsoft SMTP Server
 (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_CBC_SHA) id 15.2.1544.14;
 Tue, 18 Feb 2025 20:26:55 +0000
From: Fernand Sieber <sieberf@amazon.com>
To: <sieberf@amazon.com>, Ingo Molnar <mingo@redhat.com>, Peter Zijlstra
	<peterz@infradead.org>, Vincent Guittot <vincent.guittot@linaro.org>, "Paolo
 Bonzini" <pbonzini@redhat.com>, <x86@kernel.org>, <kvm@vger.kernel.org>,
	<linux-kernel@vger.kernel.org>, <nh-open-source@amazon.com>
Subject: [RFC PATCH 1/3] fs/proc: Add gtime halted to proc/<pid>/stat
Date: Tue, 18 Feb 2025 22:26:01 +0200
Message-ID: <20250218202618.567363-2-sieberf@amazon.com>
X-Mailer: git-send-email 2.43.0
In-Reply-To: <20250218202618.567363-1-sieberf@amazon.com>
References: <20250218202618.567363-1-sieberf@amazon.com>
Precedence: bulk
X-Mailing-List: kvm@vger.kernel.org
List-Id: <kvm.vger.kernel.org>
List-Subscribe: <mailto:kvm+subscribe@vger.kernel.org>
List-Unsubscribe: <mailto:kvm+unsubscribe@vger.kernel.org>
MIME-Version: 1.0
X-ClientProxiedBy: EX19D031UWA002.ant.amazon.com (10.13.139.96) To
 EX19D003EUB001.ant.amazon.com (10.252.51.97)

The hypervisor may need to gain visibility to CPU guest activity for various
purposes such as reporting it to monitoring systems that tracks the amount
of work done on behalf of a guest.

With guest hlt, pause and mwait passthrough, gtime is not useful since the
guest never tells the hypervisor that it has halted execution. So the reported
guest time is always 100% even when the guest is completely halted.

Add a new concept of guest halted time that allows the hypervisor to keep
track of the number of halted cycles a CPU spends in guest mode.

The value is reported in proc/<pid>/stat and defaults to zero for architectures
that do not support it.
---
 Documentation/filesystems/proc.rst | 1 +
 fs/proc/array.c                    | 7 ++++++-
 include/linux/sched.h              | 1 +
 include/linux/sched/signal.h       | 1 +
 kernel/exit.c                      | 1 +
 kernel/fork.c                      | 2 +-
 6 files changed, 11 insertions(+), 2 deletions(-)

--
2.43.0




Amazon Development Centre (South Africa) (Proprietary) Limited
29 Gogosoa Street, Observatory, Cape Town, Western Cape, 7925, South Africa
Registration Number: 2004 / 034463 / 07
diff --git a/Documentation/filesystems/proc.rst b/Documentation/filesystems/proc.rst
index 09f0aed5a08b..bbb230420fa4 100644
--- a/Documentation/filesystems/proc.rst
+++ b/Documentation/filesystems/proc.rst
@@ -386,6 +386,7 @@ It's slow but very precise.
   env_end       address below which program environment is placed
   exit_code     the thread's exit_code in the form reported by the waitpid
 		system call
+  gtime_halted  guest time when the cpu is halted of the task in jiffies
   ============= ===============================================================

 The /proc/PID/maps file contains the currently mapped memory regions and
diff --git a/fs/proc/array.c b/fs/proc/array.c
index d6a0369caa93..0788ef0fa710 100644
--- a/fs/proc/array.c
+++ b/fs/proc/array.c
@@ -478,7 +478,7 @@ static int do_task_stat(struct seq_file *m, struct pid_namespace *ns,
 	struct mm_struct *mm;
 	unsigned long long start_time;
 	unsigned long cmin_flt, cmaj_flt, min_flt, maj_flt;
-	u64 cutime, cstime, cgtime, utime, stime, gtime;
+	u64 cutime, cstime, cgtime, utime, stime, gtime, gtime_halted;
 	unsigned long rsslim = 0;
 	unsigned long flags;
 	int exit_code = task->exit_code;
@@ -556,12 +556,14 @@ static int do_task_stat(struct seq_file *m, struct pid_namespace *ns,
 			min_flt = sig->min_flt;
 			maj_flt = sig->maj_flt;
 			gtime = sig->gtime;
+			gtime_halted = sig->gtime_halted;

 			rcu_read_lock();
 			__for_each_thread(sig, t) {
 				min_flt += t->min_flt;
 				maj_flt += t->maj_flt;
 				gtime += task_gtime(t);
+				gtime_halted += t->gtime_halted;
 			}
 			rcu_read_unlock();
 		}
@@ -575,6 +577,7 @@ static int do_task_stat(struct seq_file *m, struct pid_namespace *ns,
 		min_flt = task->min_flt;
 		maj_flt = task->maj_flt;
 		gtime = task_gtime(task);
+		gtime_halted = task->gtime_halted;
 	}

 	/* scale priority and nice values from timeslices to -20..20 */
@@ -662,6 +665,8 @@ static int do_task_stat(struct seq_file *m, struct pid_namespace *ns,
 	else
 		seq_puts(m, " 0");

+	seq_put_decimal_ull(m, " ", nsec_to_clock_t(gtime_halted));
+
 	seq_putc(m, '\n');
 	if (mm)
 		mmput(mm);
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 9632e3318e0d..5f6745357e20 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1087,6 +1087,7 @@ struct task_struct {
 	u64				stimescaled;
 #endif
 	u64				gtime;
+	u64				gtime_halted;
 	struct prev_cputime		prev_cputime;
 #ifdef CONFIG_VIRT_CPU_ACCOUNTING_GEN
 	struct vtime			vtime;
diff --git a/include/linux/sched/signal.h b/include/linux/sched/signal.h
index d5d03d919df8..633082f7c7b8 100644
--- a/include/linux/sched/signal.h
+++ b/include/linux/sched/signal.h
@@ -187,6 +187,7 @@ struct signal_struct {
 	seqlock_t stats_lock;
 	u64 utime, stime, cutime, cstime;
 	u64 gtime;
+	u64 gtime_halted;
 	u64 cgtime;
 	struct prev_cputime prev_cputime;
 	unsigned long nvcsw, nivcsw, cnvcsw, cnivcsw;
diff --git a/kernel/exit.c b/kernel/exit.c
index 3485e5fc499e..ba6efc6900d0 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -188,6 +188,7 @@ static void __exit_signal(struct task_struct *tsk)
 	sig->utime += utime;
 	sig->stime += stime;
 	sig->gtime += task_gtime(tsk);
+	sig->gtime_halted += tsk->gtime_halted;
 	sig->min_flt += tsk->min_flt;
 	sig->maj_flt += tsk->maj_flt;
 	sig->nvcsw += tsk->nvcsw;
diff --git a/kernel/fork.c b/kernel/fork.c
index 735405a9c5f3..e3453084bb5a 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -2296,7 +2296,7 @@ __latent_entropy struct task_struct *copy_process(

 	init_sigpending(&p->pending);

-	p->utime = p->stime = p->gtime = 0;
+	p->utime = p->stime = p->gtime = p->gtime_halted = 0;
 #ifdef CONFIG_ARCH_HAS_SCALED_CPUTIME
 	p->utimescaled = p->stimescaled = 0;
 #endif

From patchwork Tue Feb 18 20:26:02 2025
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Fernand Sieber <sieberf@amazon.com>
X-Patchwork-Id: 13980829
Received: from smtp-fw-6001.amazon.com (smtp-fw-6001.amazon.com
 [52.95.48.154])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
	(No client certificate requested)
	by smtp.subspace.kernel.org (Postfix) with ESMTPS id 262091DE2B9;
	Tue, 18 Feb 2025 20:27:09 +0000 (UTC)
Authentication-Results: smtp.subspace.kernel.org;
 arc=none smtp.client-ip=52.95.48.154
ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116;
	t=1739910431; cv=none;
 b=AUqIIcuBJnOwDQjU6zgsufLp188jaIDc3qanC8+jb+TfGJ738Ujqnp0B30AMqzO9jUUa65OIBGPvMudd250lo7M0xRZTU5LsFMITyKeDBk/m/9e9XsG4IgwQYHvCugU1ItiEYLoGXyYRaMWQ+qfNjqyR+qIfn4bukDwTJlEGvZc=
ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org;
	s=arc-20240116; t=1739910431; c=relaxed/simple;
	bh=3de8aP+eKUaiXgy4IvoYG4PKo51UnJ/CXdIDrpzv854=;
	h=From:To:Subject:Date:Message-ID:In-Reply-To:References:
	 MIME-Version:Content-Type;
 b=ke9zcEsta5xL6yF6VuDW5aAYiW5CW/tWy7OCDCxf54uExz7/sk4TRXvqeguClym9+QxhfYterFgqOs/eBQsKcrgD+Oea+XeEQAjPYjj8SrT0WR5Bx/Mk/hdEPGiyNu1bY75N//KHkayl9ga97xiGUk1tAiL7E5HsZfgRmOP6QaY=
ARC-Authentication-Results: i=1; smtp.subspace.kernel.org;
 dmarc=pass (p=quarantine dis=none) header.from=amazon.com;
 spf=pass smtp.mailfrom=amazon.com;
 dkim=pass (1024-bit key) header.d=amazon.com header.i=@amazon.com
 header.b=nqxbP0nA; arc=none smtp.client-ip=52.95.48.154
Authentication-Results: smtp.subspace.kernel.org;
 dmarc=pass (p=quarantine dis=none) header.from=amazon.com
Authentication-Results: smtp.subspace.kernel.org;
 spf=pass smtp.mailfrom=amazon.com
Authentication-Results: smtp.subspace.kernel.org;
	dkim=pass (1024-bit key) header.d=amazon.com header.i=@amazon.com
 header.b="nqxbP0nA"
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
  d=amazon.com; i=@amazon.com; q=dns/txt; s=amazon201209;
  t=1739910430; x=1771446430;
  h=from:to:subject:date:message-id:in-reply-to:references:
   mime-version:content-transfer-encoding;
  bh=UA7rd2+7E7E1EZQdxivdBtDuy5hSlsFld3c3zvaH5Do=;
  b=nqxbP0nA7FtVbrtesoyI42pvcZGFu4xZV5g66t54PlHrWz9D0t4pE0LA
   7KyiPlPN+HcnpXqSy1JxEANWz+Lclxag27vwWp6rVDjSmH021Lz3QBUwu
   PtCD5BxzKqQBeXh7qZ4vGaNGp9qmOO9LBgDr6SJu3fCorx/EfkC47sYu/
   I=;
X-IronPort-AV: E=Sophos;i="6.13,296,1732579200";
   d="scan'208";a="463706639"
Received: from iad12-co-svc-p1-lb1-vlan2.amazon.com (HELO
 smtpout.prod.us-east-1.prod.farcaster.email.amazon.dev) ([10.43.8.2])
  by smtp-border-fw-6001.iad6.amazon.com with
 ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 18 Feb 2025 20:27:07 +0000
Received: from EX19MTAEUC002.ant.amazon.com [10.0.10.100:15387]
 by smtpin.naws.eu-west-1.prod.farcaster.email.amazon.dev [10.0.2.102:2525]
 with esmtp (Farcaster)
 id 0fd69873-b657-487e-899c-6f013894b1e5;
 Tue, 18 Feb 2025 20:27:06 +0000 (UTC)
X-Farcaster-Flow-ID: 0fd69873-b657-487e-899c-6f013894b1e5
Received: from EX19D003EUB001.ant.amazon.com (10.252.51.97) by
 EX19MTAEUC002.ant.amazon.com (10.252.51.245) with Microsoft SMTP Server
 (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_CBC_SHA) id 15.2.1258.39;
 Tue, 18 Feb 2025 20:27:05 +0000
Received: from u5934974a1cdd59.ant.amazon.com (10.146.13.227) by
 EX19D003EUB001.ant.amazon.com (10.252.51.97) with Microsoft SMTP Server
 (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_CBC_SHA) id 15.2.1544.14;
 Tue, 18 Feb 2025 20:27:00 +0000
From: Fernand Sieber <sieberf@amazon.com>
To: <sieberf@amazon.com>, Ingo Molnar <mingo@redhat.com>, Peter Zijlstra
	<peterz@infradead.org>, Vincent Guittot <vincent.guittot@linaro.org>, "Paolo
 Bonzini" <pbonzini@redhat.com>, <x86@kernel.org>, <kvm@vger.kernel.org>,
	<linux-kernel@vger.kernel.org>, <nh-open-source@amazon.com>
Subject: [RFC PATCH 2/3] kvm/x86: Add support for gtime halted
Date: Tue, 18 Feb 2025 22:26:02 +0200
Message-ID: <20250218202618.567363-3-sieberf@amazon.com>
X-Mailer: git-send-email 2.43.0
In-Reply-To: <20250218202618.567363-1-sieberf@amazon.com>
References: <20250218202618.567363-1-sieberf@amazon.com>
Precedence: bulk
X-Mailing-List: kvm@vger.kernel.org
List-Id: <kvm.vger.kernel.org>
List-Subscribe: <mailto:kvm+subscribe@vger.kernel.org>
List-Unsubscribe: <mailto:kvm+unsubscribe@vger.kernel.org>
MIME-Version: 1.0
X-ClientProxiedBy: EX19D031UWA002.ant.amazon.com (10.13.139.96) To
 EX19D003EUB001.ant.amazon.com (10.252.51.97)

The previous commit introduced the concept of guest time halted to allow
the hypervisor to track real guest CPU activity (halted cyles) with
mwait/hlt/pause pass through enabled.

This commits implements it for the x86 architecture. We track the number of
actual cycles executed by the guest by taking two reads on MSR_IA32_MPERF,
one before vcpu enter and the other after vcpu exit. These two reads happen
immediately before and after guest_timing_enter/exit_irqoff which are the
architecture independent points for gtime accounting. The difference between
the reads corresponds to the number of unhalted cycles. We get the number
of halted cycles by using the tsc difference with the unhalted cycles and
tolerate slight approximations.
---
 arch/x86/include/asm/tsc.h |  1 +
 arch/x86/kernel/tsc.c      | 13 +++++++++++++
 arch/x86/kvm/x86.c         | 26 ++++++++++++++++++++++++++
 3 files changed, 40 insertions(+)

--
2.43.0




Amazon Development Centre (South Africa) (Proprietary) Limited
29 Gogosoa Street, Observatory, Cape Town, Western Cape, 7925, South Africa
Registration Number: 2004 / 034463 / 07

diff --git a/arch/x86/include/asm/tsc.h b/arch/x86/include/asm/tsc.h
index 94408a784c8e..00ad09e7268e 100644
--- a/arch/x86/include/asm/tsc.h
+++ b/arch/x86/include/asm/tsc.h
@@ -37,6 +37,7 @@ extern void mark_tsc_async_resets(char *reason);
 extern unsigned long native_calibrate_cpu_early(void);
 extern unsigned long native_calibrate_tsc(void);
 extern unsigned long long native_sched_clock_from_tsc(u64 tsc);
+extern unsigned long long cycles2ns(unsigned long long cycles);

 extern int tsc_clocksource_reliable;
 #ifdef CONFIG_X86_TSC
diff --git a/arch/x86/kernel/tsc.c b/arch/x86/kernel/tsc.c
index 34dec0b72ea8..80bb12357148 100644
--- a/arch/x86/kernel/tsc.c
+++ b/arch/x86/kernel/tsc.c
@@ -144,6 +144,19 @@ static __always_inline unsigned long long cycles_2_ns(unsigned long long cyc)
 	return ns;
 }

+unsigned long long cycles2ns(unsigned long long cyc)
+{
+       struct cyc2ns_data data;
+       unsigned long long ns;
+
+       cyc2ns_read_begin(&data);
+       ns = mul_u64_u32_shr(cyc, data.cyc2ns_mul, data.cyc2ns_shift);
+       cyc2ns_read_end();
+
+       return ns;
+}
+EXPORT_SYMBOL(cycles2ns);
+
 static void __set_cyc2ns_scale(unsigned long khz, int cpu, unsigned long long tsc_now)
 {
 	unsigned long long ns_now;
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 02159c967d29..46975b0a63a5 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -10688,6 +10688,19 @@ static void kvm_vcpu_reload_apic_access_page(struct kvm_vcpu *vcpu)
 	kvm_x86_call(set_apic_access_page_addr)(vcpu);
 }

+static bool needs_halted_accounting(struct kvm_vcpu *vcpu)
+{
+	return (vcpu->kvm->arch.mwait_in_guest ||
+			vcpu->kvm->arch.hlt_in_guest ||
+			vcpu->kvm->arch.pause_in_guest) &&
+		boot_cpu_has(X86_FEATURE_APERFMPERF);
+}
+
+static long long get_unhalted_cycles(void)
+{
+	return __rdmsr(MSR_IA32_MPERF);
+}
+
 /*
  * Called within kvm->srcu read side.
  * Returns 1 to let vcpu_run() continue the guest execution loop without
@@ -10697,6 +10710,8 @@ static void kvm_vcpu_reload_apic_access_page(struct kvm_vcpu *vcpu)
 static int vcpu_enter_guest(struct kvm_vcpu *vcpu)
 {
 	int r;
+	unsigned long long cycles, cycles_start = 0;
+	unsigned long long unhalted_cycles, unhalted_cycles_start = 0;
 	bool req_int_win =
 		dm_request_for_irq_injection(vcpu) &&
 		kvm_cpu_accept_dm_intr(vcpu);
@@ -10968,6 +10983,10 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu)
 		set_debugreg(0, 7);
 	}

+	if (needs_halted_accounting(vcpu)) {
+		cycles_start = get_cycles();
+		unhalted_cycles_start = get_unhalted_cycles();
+	}
 	guest_timing_enter_irqoff();

 	for (;;) {
@@ -11060,6 +11079,13 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu)
 	 * acceptable for all known use cases.
 	 */
 	guest_timing_exit_irqoff();
+	if (needs_halted_accounting(vcpu)) {
+		cycles = get_cycles() - cycles_start;
+		unhalted_cycles = get_unhalted_cycles() -
+			unhalted_cycles_start;
+		if (likely(cycles > unhalted_cycles))
+			current->gtime_halted += cycles2ns(cycles - unhalted_cycles);
+	}

 	local_irq_enable();
 	preempt_enable();

From patchwork Tue Feb 18 20:26:03 2025
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Fernand Sieber <sieberf@amazon.com>
X-Patchwork-Id: 13980830
Received: from smtp-fw-9106.amazon.com (smtp-fw-9106.amazon.com
 [207.171.188.206])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
	(No client certificate requested)
	by smtp.subspace.kernel.org (Postfix) with ESMTPS id 10F42215169;
	Tue, 18 Feb 2025 20:27:54 +0000 (UTC)
Authentication-Results: smtp.subspace.kernel.org;
 arc=none smtp.client-ip=207.171.188.206
ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116;
	t=1739910479; cv=none;
 b=oWBJMoxYrx5hkt3HpXl8Unw9vnZIrKfHj6lgoXCCdMyc7YNC6e9WM/kIwI8Bq3kxrgh+W+CqQMwGKhNg5mPfnJFo2K405VvwkVp6y+1VgMtKM1qaS171El4aTi6HO3mNLU0IN8ADInuOvvp+G/VA7b2Bn+2hs4wRn+T9Lf+r4qM=
ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org;
	s=arc-20240116; t=1739910479; c=relaxed/simple;
	bh=JTSEYbYS4Z/Ik62jE8h8jo3yLesydkKBCqS7gHREHyw=;
	h=From:To:Subject:Date:Message-ID:In-Reply-To:References:
	 MIME-Version:Content-Type;
 b=s0pbn6yV/l4j7SDy4vCOtm/ZkQkRrGldy21/HlZPbyUOiFfsos6VuHG4CZjWbtBbBvlV3j/6xVVVjUmV3/0JFq6Uu4oLgI3zcOyEv0wez1TiL6eWEDExvV/GvU3JYIpU41yd8GytfqBkaSYwaz80K0E+rW3Nc0cD3MWNbC5pt/Y=
ARC-Authentication-Results: i=1; smtp.subspace.kernel.org;
 dmarc=pass (p=quarantine dis=none) header.from=amazon.com;
 spf=pass smtp.mailfrom=amazon.com;
 dkim=pass (1024-bit key) header.d=amazon.com header.i=@amazon.com
 header.b=f/bu+yi7; arc=none smtp.client-ip=207.171.188.206
Authentication-Results: smtp.subspace.kernel.org;
 dmarc=pass (p=quarantine dis=none) header.from=amazon.com
Authentication-Results: smtp.subspace.kernel.org;
 spf=pass smtp.mailfrom=amazon.com
Authentication-Results: smtp.subspace.kernel.org;
	dkim=pass (1024-bit key) header.d=amazon.com header.i=@amazon.com
 header.b="f/bu+yi7"
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
  d=amazon.com; i=@amazon.com; q=dns/txt; s=amazon201209;
  t=1739910476; x=1771446476;
  h=from:to:subject:date:message-id:in-reply-to:references:
   mime-version:content-transfer-encoding;
  bh=R71KixzSQhV/O62VTjLAK1KH6JMWl+uEPn9+qpQPkQU=;
  b=f/bu+yi7C+DU/XjFmM9TX7x1Ssaglr5HzXVrqsWLIW1qnKF/6TWkUrC1
   yiiYG3rBPH7QrxvPHvKWx8J1yHJ/gccWRAJ/Sa7GzErBtaRjOvWLeAiq5
   yok120g9wT3tuGIzEWSeo1iPk1fh3mrYHbgPi6qte8yhgJR2Wxvz+t11U
   0=;
X-IronPort-AV: E=Sophos;i="6.13,296,1732579200";
   d="scan'208";a="799874462"
Received: from pdx4-co-svc-p1-lb2-vlan2.amazon.com (HELO
 smtpout.prod.us-west-2.prod.farcaster.email.amazon.dev) ([10.25.36.210])
  by smtp-border-fw-9106.sea19.amazon.com with
 ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 18 Feb 2025 20:27:50 +0000
Received: from EX19MTAEUC001.ant.amazon.com [10.0.43.254:62137]
 by smtpin.naws.eu-west-1.prod.farcaster.email.amazon.dev [10.0.11.108:2525]
 with esmtp (Farcaster)
 id 52050e94-2ff3-4b18-b080-d3b5f5682f76;
 Tue, 18 Feb 2025 20:27:48 +0000 (UTC)
X-Farcaster-Flow-ID: 52050e94-2ff3-4b18-b080-d3b5f5682f76
Received: from EX19D003EUB001.ant.amazon.com (10.252.51.97) by
 EX19MTAEUC001.ant.amazon.com (10.252.51.155) with Microsoft SMTP Server
 (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_CBC_SHA) id 15.2.1258.39;
 Tue, 18 Feb 2025 20:27:46 +0000
Received: from u5934974a1cdd59.ant.amazon.com (10.146.13.227) by
 EX19D003EUB001.ant.amazon.com (10.252.51.97) with Microsoft SMTP Server
 (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_CBC_SHA) id 15.2.1544.14;
 Tue, 18 Feb 2025 20:27:41 +0000
From: Fernand Sieber <sieberf@amazon.com>
To: <sieberf@amazon.com>, Ingo Molnar <mingo@redhat.com>, Peter Zijlstra
	<peterz@infradead.org>, Vincent Guittot <vincent.guittot@linaro.org>, "Paolo
 Bonzini" <pbonzini@redhat.com>, <x86@kernel.org>, <kvm@vger.kernel.org>,
	<linux-kernel@vger.kernel.org>, <nh-open-source@amazon.com>
Subject: [RFC PATCH 3/3] sched,x86: Make the scheduler guest unhalted aware
Date: Tue, 18 Feb 2025 22:26:03 +0200
Message-ID: <20250218202618.567363-4-sieberf@amazon.com>
X-Mailer: git-send-email 2.43.0
In-Reply-To: <20250218202618.567363-1-sieberf@amazon.com>
References: <20250218202618.567363-1-sieberf@amazon.com>
Precedence: bulk
X-Mailing-List: kvm@vger.kernel.org
List-Id: <kvm.vger.kernel.org>
List-Subscribe: <mailto:kvm+subscribe@vger.kernel.org>
List-Unsubscribe: <mailto:kvm+unsubscribe@vger.kernel.org>
MIME-Version: 1.0
X-ClientProxiedBy: EX19D035UWB001.ant.amazon.com (10.13.138.33) To
 EX19D003EUB001.ant.amazon.com (10.252.51.97)

With guest hlt/mwait/pause pass through, the scheduler has no visibility into
real vCPU activity as it sees them all 100% active. As such, load balancing
cannot make informed decisions on where it is preferrable to collocate
tasks when necessary. I.e as far as the load balancer is concerned, a
halted vCPU and an idle polling vCPU look exactly the same so it may decide
that either should be preempted when in reality it would be preferrable to
preempt the idle one.

This commits enlightens the scheduler to real guest activity in this
situation. Leveraging gtime unhalted, it adds a hook for kvm to communicate
to the scheduler the duration that a vCPU spends halted. This is then used in
PELT accounting to discount it from real activity. This results in better
placement and overall steal time reduction.

This initial implementation assumes that non-idle CPUs are ticking as it
hooks the unhalted time the PELT decaying load accounting. As such it
doesn't work well if PELT is updated infrequenly with large chunks of
halted time. This is not a fundamental limitation but more complex
accounting is needed to generalize the use case to nohz full.
---
 arch/x86/kvm/x86.c    |  8 ++++++--
 include/linux/sched.h |  4 ++++
 kernel/sched/core.c   |  1 +
 kernel/sched/fair.c   | 25 +++++++++++++++++++++++++
 kernel/sched/pelt.c   | 42 +++++++++++++++++++++++++++++++++++-------
 kernel/sched/sched.h  |  2 ++
 6 files changed, 73 insertions(+), 9 deletions(-)

--
2.43.0




Amazon Development Centre (South Africa) (Proprietary) Limited
29 Gogosoa Street, Observatory, Cape Town, Western Cape, 7925, South Africa
Registration Number: 2004 / 034463 / 07

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 46975b0a63a5..156cf05b325f 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -10712,6 +10712,7 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu)
 	int r;
 	unsigned long long cycles, cycles_start = 0;
 	unsigned long long unhalted_cycles, unhalted_cycles_start = 0;
+	unsigned long long halted_cycles_ns = 0;
 	bool req_int_win =
 		dm_request_for_irq_injection(vcpu) &&
 		kvm_cpu_accept_dm_intr(vcpu);
@@ -11083,8 +11084,11 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu)
 		cycles = get_cycles() - cycles_start;
 		unhalted_cycles = get_unhalted_cycles() -
 			unhalted_cycles_start;
-		if (likely(cycles > unhalted_cycles))
-			current->gtime_halted += cycles2ns(cycles - unhalted_cycles);
+		if (likely(cycles > unhalted_cycles)) {
+			halted_cycles_ns = cycles2ns(cycles - unhalted_cycles);
+			current->gtime_halted += halted_cycles_ns;
+			sched_account_gtime_halted(current, halted_cycles_ns);
+		}
 	}

 	local_irq_enable();
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 5f6745357e20..5409fac152c9 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -367,6 +367,8 @@ struct vtime {
 	u64			gtime;
 };

+extern void sched_account_gtime_halted(struct task_struct *p, u64 gtime_halted);
+
 /*
  * Utilization clamp constraints.
  * @UCLAMP_MIN:	Minimum utilization
@@ -588,6 +590,8 @@ struct sched_entity {
 	 */
 	struct sched_avg		avg;
 #endif
+
+	u64				gtime_halted;
 };

 struct sched_rt_entity {
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 9aecd914ac69..1f3ced2b2636 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -4487,6 +4487,7 @@ static void __sched_fork(unsigned long clone_flags, struct task_struct *p)
 	p->se.nr_migrations		= 0;
 	p->se.vruntime			= 0;
 	p->se.vlag			= 0;
+	p->se.gtime_halted              = 0;
 	INIT_LIST_HEAD(&p->se.group_node);

 	/* A delayed task cannot be in clone(). */
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 1c0ef435a7aa..5ff52711d459 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -13705,4 +13705,29 @@ __init void init_sched_fair_class(void)
 #endif
 #endif /* SMP */

+
+}
+
+#ifdef CONFIG_NO_HZ_FULL
+void sched_account_gtime_halted(struct task_struct *p, u64 gtime_halted)
+{
 }
+#else
+/*
+ * The implementation hooking into PELT requires regular updates of
+ * gtime_halted. This is guaranteed unless we run on CONFIG_NO_HZ_FULL.
+ */
+void sched_account_gtime_halted(struct task_struct *p, u64 gtime_halted)
+{
+	struct sched_entity *se = &p->se;
+
+	if (unlikely(!gtime_halted))
+		return;
+
+	for_each_sched_entity(se) {
+		se->gtime_halted += gtime_halted;
+		se->cfs_rq->gtime_halted += gtime_halted;
+	}
+}
+#endif
+EXPORT_SYMBOL(sched_account_gtime_halted);
diff --git a/kernel/sched/pelt.c b/kernel/sched/pelt.c
index 7a8534a2deff..9f96b7c46c00 100644
--- a/kernel/sched/pelt.c
+++ b/kernel/sched/pelt.c
@@ -305,10 +305,23 @@ int __update_load_avg_blocked_se(u64 now, struct sched_entity *se)

 int __update_load_avg_se(u64 now, struct cfs_rq *cfs_rq, struct sched_entity *se)
 {
-	if (___update_load_sum(now, &se->avg, !!se->on_rq, se_runnable(se),
-				cfs_rq->curr == se)) {
+	int ret = 0;
+	u64 delta = now - se->avg.last_update_time;
+	u64 gtime_halted = min(delta, se->gtime_halted);

-		___update_load_avg(&se->avg, se_weight(se));
+	ret = ___update_load_sum(now - gtime_halted, &se->avg, !!se->on_rq, se_runnable(se),
+			cfs_rq->curr == se);
+
+	if (gtime_halted) {
+		ret |= ___update_load_sum(now, &se->avg, 0, 0, 0);
+		se->gtime_halted -= gtime_halted;
+
+		/* decay residual halted time */
+		if (ret && se->gtime_halted)
+			se->gtime_halted = decay_load(se->gtime_halted, delta / 1024);
+	}
+
+	if (ret) {
 		cfs_se_util_change(&se->avg);
 		trace_pelt_se_tp(se);
 		return 1;
@@ -319,10 +332,25 @@ int __update_load_avg_se(u64 now, struct cfs_rq *cfs_rq, struct sched_entity *se

 int __update_load_avg_cfs_rq(u64 now, struct cfs_rq *cfs_rq)
 {
-	if (___update_load_sum(now, &cfs_rq->avg,
-				scale_load_down(cfs_rq->load.weight),
-				cfs_rq->h_nr_runnable,
-				cfs_rq->curr != NULL)) {
+	int ret = 0;
+	u64 delta = now - cfs_rq->avg.last_update_time;
+	u64 gtime_halted = min(delta, cfs_rq->gtime_halted);
+
+	ret =  ___update_load_sum(now - gtime_halted, &cfs_rq->avg,
+			scale_load_down(cfs_rq->load.weight),
+			cfs_rq->h_nr_runnable,
+			cfs_rq->curr != NULL);
+
+	if (gtime_halted) {
+		ret |= ___update_load_sum(now, &cfs_rq->avg, 0, 0, 0);
+		cfs_rq->gtime_halted -= gtime_halted;
+
+		/* decay any residual halted time */
+		if (ret && cfs_rq->gtime_halted)
+			cfs_rq->gtime_halted = decay_load(cfs_rq->gtime_halted, delta / 1024);
+	}
+
+	if (ret) {

 		___update_load_avg(&cfs_rq->avg, 1);
 		trace_pelt_cfs_tp(cfs_rq);
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index b93c8c3dc05a..79b1166265bf 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -744,6 +744,8 @@ struct cfs_rq {
 	struct list_head	throttled_csd_list;
 #endif /* CONFIG_CFS_BANDWIDTH */
 #endif /* CONFIG_FAIR_GROUP_SCHED */
+
+	u64			gtime_halted;
 };

 #ifdef CONFIG_SCHED_CLASS_EXT