From patchwork Tue Feb 18 20:26:01 2025 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Fernand Sieber X-Patchwork-Id: 13980828 Received: from smtp-fw-52005.amazon.com (smtp-fw-52005.amazon.com [52.119.213.156]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 4ABB72116F4; Tue, 18 Feb 2025 20:27:02 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=52.119.213.156 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1739910424; cv=none; b=W3msEgYBiosMU5i358JPdy/9wxyQa00U3CAInWTOKGW7We0keGM7Uxb+guxDB0K5SHgkXuWCHggW2VrZUmHDVCZoGGsZwaRxxz39bH53aRpEjFZl1Qar4L3ScNMbefqMVftASoIFFvDxg07nNNBhHG8RACJaumAbzmjXV+9zr5M= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1739910424; c=relaxed/simple; bh=1z+s/5oLUgkpVX+TZ6VZExymfp2lWHBOshwhP3rfD88=; h=From:To:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version:Content-Type; b=bmc8ktXUKy3P7wnLG6pez3Mk5OEzA8YxrkL6ANP10uBtL/zVVwYeWmnF/G11pGHTJkvL1/e/+gfZDbqJeImMZH+Ykpn6p1Wk02/4ESm9EFtaK36rFkUEz+0UJGKcbe+MaQCBnDvBbztBQfteoAU+95D+2P9PXrXBQggJF2OhBjk= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=amazon.com; spf=pass smtp.mailfrom=amazon.com; dkim=pass (1024-bit key) header.d=amazon.com header.i=@amazon.com header.b=qMo8T24K; arc=none smtp.client-ip=52.119.213.156 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=amazon.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=amazon.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (1024-bit key) header.d=amazon.com header.i=@amazon.com header.b="qMo8T24K" DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=amazon.com; i=@amazon.com; q=dns/txt; s=amazon201209; t=1739910423; x=1771446423; h=from:to:subject:date:message-id:in-reply-to:references: mime-version:content-transfer-encoding; bh=VRi00QC9itLgrUbe0p8KBclI5A1SKRWOXY7c+YjuKak=; b=qMo8T24K60PtM1d/PXXQskCAj5qQ3TeJItBi/pbVi+z7RjH1123+LbYH Qh5osn5vdhMCM/nmZiYDULSLn0D+nHYjernLHSxcOIDRBow7XneSC63x5 VS5z8Sip4umxqP092cJWjkkxHD0EiAIYyRrYXRy2TOer+9Newl/IklreP A=; X-IronPort-AV: E=Sophos;i="6.13,296,1732579200"; d="scan'208";a="719883548" Received: from iad12-co-svc-p1-lb1-vlan3.amazon.com (HELO smtpout.prod.us-west-2.prod.farcaster.email.amazon.dev) ([10.43.8.6]) by smtp-border-fw-52005.iad7.amazon.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 18 Feb 2025 20:27:02 +0000 Received: from EX19MTAEUB001.ant.amazon.com [10.0.17.79:16632] by smtpin.naws.eu-west-1.prod.farcaster.email.amazon.dev [10.0.0.236:2525] with esmtp (Farcaster) id 6628eb86-21a6-4254-8fbe-883f5b39d6fa; Tue, 18 Feb 2025 20:27:00 +0000 (UTC) X-Farcaster-Flow-ID: 6628eb86-21a6-4254-8fbe-883f5b39d6fa Received: from EX19D003EUB001.ant.amazon.com (10.252.51.97) by EX19MTAEUB001.ant.amazon.com (10.252.51.26) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_CBC_SHA) id 15.2.1258.39; Tue, 18 Feb 2025 20:27:00 +0000 Received: from u5934974a1cdd59.ant.amazon.com (10.146.13.227) by EX19D003EUB001.ant.amazon.com (10.252.51.97) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_CBC_SHA) id 15.2.1544.14; Tue, 18 Feb 2025 20:26:55 +0000 From: Fernand Sieber To: , Ingo Molnar , Peter Zijlstra , Vincent Guittot , "Paolo Bonzini" , , , , Subject: [RFC PATCH 1/3] fs/proc: Add gtime halted to proc//stat Date: Tue, 18 Feb 2025 22:26:01 +0200 Message-ID: <20250218202618.567363-2-sieberf@amazon.com> X-Mailer: git-send-email 2.43.0 In-Reply-To: <20250218202618.567363-1-sieberf@amazon.com> References: <20250218202618.567363-1-sieberf@amazon.com> Precedence: bulk X-Mailing-List: kvm@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 X-ClientProxiedBy: EX19D031UWA002.ant.amazon.com (10.13.139.96) To EX19D003EUB001.ant.amazon.com (10.252.51.97) The hypervisor may need to gain visibility to CPU guest activity for various purposes such as reporting it to monitoring systems that tracks the amount of work done on behalf of a guest. With guest hlt, pause and mwait passthrough, gtime is not useful since the guest never tells the hypervisor that it has halted execution. So the reported guest time is always 100% even when the guest is completely halted. Add a new concept of guest halted time that allows the hypervisor to keep track of the number of halted cycles a CPU spends in guest mode. The value is reported in proc//stat and defaults to zero for architectures that do not support it. --- Documentation/filesystems/proc.rst | 1 + fs/proc/array.c | 7 ++++++- include/linux/sched.h | 1 + include/linux/sched/signal.h | 1 + kernel/exit.c | 1 + kernel/fork.c | 2 +- 6 files changed, 11 insertions(+), 2 deletions(-) -- 2.43.0 Amazon Development Centre (South Africa) (Proprietary) Limited 29 Gogosoa Street, Observatory, Cape Town, Western Cape, 7925, South Africa Registration Number: 2004 / 034463 / 07 diff --git a/Documentation/filesystems/proc.rst b/Documentation/filesystems/proc.rst index 09f0aed5a08b..bbb230420fa4 100644 --- a/Documentation/filesystems/proc.rst +++ b/Documentation/filesystems/proc.rst @@ -386,6 +386,7 @@ It's slow but very precise. env_end address below which program environment is placed exit_code the thread's exit_code in the form reported by the waitpid system call + gtime_halted guest time when the cpu is halted of the task in jiffies ============= =============================================================== The /proc/PID/maps file contains the currently mapped memory regions and diff --git a/fs/proc/array.c b/fs/proc/array.c index d6a0369caa93..0788ef0fa710 100644 --- a/fs/proc/array.c +++ b/fs/proc/array.c @@ -478,7 +478,7 @@ static int do_task_stat(struct seq_file *m, struct pid_namespace *ns, struct mm_struct *mm; unsigned long long start_time; unsigned long cmin_flt, cmaj_flt, min_flt, maj_flt; - u64 cutime, cstime, cgtime, utime, stime, gtime; + u64 cutime, cstime, cgtime, utime, stime, gtime, gtime_halted; unsigned long rsslim = 0; unsigned long flags; int exit_code = task->exit_code; @@ -556,12 +556,14 @@ static int do_task_stat(struct seq_file *m, struct pid_namespace *ns, min_flt = sig->min_flt; maj_flt = sig->maj_flt; gtime = sig->gtime; + gtime_halted = sig->gtime_halted; rcu_read_lock(); __for_each_thread(sig, t) { min_flt += t->min_flt; maj_flt += t->maj_flt; gtime += task_gtime(t); + gtime_halted += t->gtime_halted; } rcu_read_unlock(); } @@ -575,6 +577,7 @@ static int do_task_stat(struct seq_file *m, struct pid_namespace *ns, min_flt = task->min_flt; maj_flt = task->maj_flt; gtime = task_gtime(task); + gtime_halted = task->gtime_halted; } /* scale priority and nice values from timeslices to -20..20 */ @@ -662,6 +665,8 @@ static int do_task_stat(struct seq_file *m, struct pid_namespace *ns, else seq_puts(m, " 0"); + seq_put_decimal_ull(m, " ", nsec_to_clock_t(gtime_halted)); + seq_putc(m, '\n'); if (mm) mmput(mm); diff --git a/include/linux/sched.h b/include/linux/sched.h index 9632e3318e0d..5f6745357e20 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -1087,6 +1087,7 @@ struct task_struct { u64 stimescaled; #endif u64 gtime; + u64 gtime_halted; struct prev_cputime prev_cputime; #ifdef CONFIG_VIRT_CPU_ACCOUNTING_GEN struct vtime vtime; diff --git a/include/linux/sched/signal.h b/include/linux/sched/signal.h index d5d03d919df8..633082f7c7b8 100644 --- a/include/linux/sched/signal.h +++ b/include/linux/sched/signal.h @@ -187,6 +187,7 @@ struct signal_struct { seqlock_t stats_lock; u64 utime, stime, cutime, cstime; u64 gtime; + u64 gtime_halted; u64 cgtime; struct prev_cputime prev_cputime; unsigned long nvcsw, nivcsw, cnvcsw, cnivcsw; diff --git a/kernel/exit.c b/kernel/exit.c index 3485e5fc499e..ba6efc6900d0 100644 --- a/kernel/exit.c +++ b/kernel/exit.c @@ -188,6 +188,7 @@ static void __exit_signal(struct task_struct *tsk) sig->utime += utime; sig->stime += stime; sig->gtime += task_gtime(tsk); + sig->gtime_halted += tsk->gtime_halted; sig->min_flt += tsk->min_flt; sig->maj_flt += tsk->maj_flt; sig->nvcsw += tsk->nvcsw; diff --git a/kernel/fork.c b/kernel/fork.c index 735405a9c5f3..e3453084bb5a 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -2296,7 +2296,7 @@ __latent_entropy struct task_struct *copy_process( init_sigpending(&p->pending); - p->utime = p->stime = p->gtime = 0; + p->utime = p->stime = p->gtime = p->gtime_halted = 0; #ifdef CONFIG_ARCH_HAS_SCALED_CPUTIME p->utimescaled = p->stimescaled = 0; #endif From patchwork Tue Feb 18 20:26:02 2025 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Fernand Sieber X-Patchwork-Id: 13980829 Received: from smtp-fw-6001.amazon.com (smtp-fw-6001.amazon.com [52.95.48.154]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 262091DE2B9; Tue, 18 Feb 2025 20:27:09 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=52.95.48.154 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1739910431; cv=none; b=AUqIIcuBJnOwDQjU6zgsufLp188jaIDc3qanC8+jb+TfGJ738Ujqnp0B30AMqzO9jUUa65OIBGPvMudd250lo7M0xRZTU5LsFMITyKeDBk/m/9e9XsG4IgwQYHvCugU1ItiEYLoGXyYRaMWQ+qfNjqyR+qIfn4bukDwTJlEGvZc= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1739910431; c=relaxed/simple; bh=3de8aP+eKUaiXgy4IvoYG4PKo51UnJ/CXdIDrpzv854=; h=From:To:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version:Content-Type; b=ke9zcEsta5xL6yF6VuDW5aAYiW5CW/tWy7OCDCxf54uExz7/sk4TRXvqeguClym9+QxhfYterFgqOs/eBQsKcrgD+Oea+XeEQAjPYjj8SrT0WR5Bx/Mk/hdEPGiyNu1bY75N//KHkayl9ga97xiGUk1tAiL7E5HsZfgRmOP6QaY= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=amazon.com; spf=pass smtp.mailfrom=amazon.com; dkim=pass (1024-bit key) header.d=amazon.com header.i=@amazon.com header.b=nqxbP0nA; arc=none smtp.client-ip=52.95.48.154 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=amazon.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=amazon.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (1024-bit key) header.d=amazon.com header.i=@amazon.com header.b="nqxbP0nA" DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=amazon.com; i=@amazon.com; q=dns/txt; s=amazon201209; t=1739910430; x=1771446430; h=from:to:subject:date:message-id:in-reply-to:references: mime-version:content-transfer-encoding; bh=UA7rd2+7E7E1EZQdxivdBtDuy5hSlsFld3c3zvaH5Do=; b=nqxbP0nA7FtVbrtesoyI42pvcZGFu4xZV5g66t54PlHrWz9D0t4pE0LA 7KyiPlPN+HcnpXqSy1JxEANWz+Lclxag27vwWp6rVDjSmH021Lz3QBUwu PtCD5BxzKqQBeXh7qZ4vGaNGp9qmOO9LBgDr6SJu3fCorx/EfkC47sYu/ I=; X-IronPort-AV: E=Sophos;i="6.13,296,1732579200"; d="scan'208";a="463706639" Received: from iad12-co-svc-p1-lb1-vlan2.amazon.com (HELO smtpout.prod.us-east-1.prod.farcaster.email.amazon.dev) ([10.43.8.2]) by smtp-border-fw-6001.iad6.amazon.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 18 Feb 2025 20:27:07 +0000 Received: from EX19MTAEUC002.ant.amazon.com [10.0.10.100:15387] by smtpin.naws.eu-west-1.prod.farcaster.email.amazon.dev [10.0.2.102:2525] with esmtp (Farcaster) id 0fd69873-b657-487e-899c-6f013894b1e5; Tue, 18 Feb 2025 20:27:06 +0000 (UTC) X-Farcaster-Flow-ID: 0fd69873-b657-487e-899c-6f013894b1e5 Received: from EX19D003EUB001.ant.amazon.com (10.252.51.97) by EX19MTAEUC002.ant.amazon.com (10.252.51.245) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_CBC_SHA) id 15.2.1258.39; Tue, 18 Feb 2025 20:27:05 +0000 Received: from u5934974a1cdd59.ant.amazon.com (10.146.13.227) by EX19D003EUB001.ant.amazon.com (10.252.51.97) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_CBC_SHA) id 15.2.1544.14; Tue, 18 Feb 2025 20:27:00 +0000 From: Fernand Sieber To: , Ingo Molnar , Peter Zijlstra , Vincent Guittot , "Paolo Bonzini" , , , , Subject: [RFC PATCH 2/3] kvm/x86: Add support for gtime halted Date: Tue, 18 Feb 2025 22:26:02 +0200 Message-ID: <20250218202618.567363-3-sieberf@amazon.com> X-Mailer: git-send-email 2.43.0 In-Reply-To: <20250218202618.567363-1-sieberf@amazon.com> References: <20250218202618.567363-1-sieberf@amazon.com> Precedence: bulk X-Mailing-List: kvm@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 X-ClientProxiedBy: EX19D031UWA002.ant.amazon.com (10.13.139.96) To EX19D003EUB001.ant.amazon.com (10.252.51.97) The previous commit introduced the concept of guest time halted to allow the hypervisor to track real guest CPU activity (halted cyles) with mwait/hlt/pause pass through enabled. This commits implements it for the x86 architecture. We track the number of actual cycles executed by the guest by taking two reads on MSR_IA32_MPERF, one before vcpu enter and the other after vcpu exit. These two reads happen immediately before and after guest_timing_enter/exit_irqoff which are the architecture independent points for gtime accounting. The difference between the reads corresponds to the number of unhalted cycles. We get the number of halted cycles by using the tsc difference with the unhalted cycles and tolerate slight approximations. --- arch/x86/include/asm/tsc.h | 1 + arch/x86/kernel/tsc.c | 13 +++++++++++++ arch/x86/kvm/x86.c | 26 ++++++++++++++++++++++++++ 3 files changed, 40 insertions(+) -- 2.43.0 Amazon Development Centre (South Africa) (Proprietary) Limited 29 Gogosoa Street, Observatory, Cape Town, Western Cape, 7925, South Africa Registration Number: 2004 / 034463 / 07 diff --git a/arch/x86/include/asm/tsc.h b/arch/x86/include/asm/tsc.h index 94408a784c8e..00ad09e7268e 100644 --- a/arch/x86/include/asm/tsc.h +++ b/arch/x86/include/asm/tsc.h @@ -37,6 +37,7 @@ extern void mark_tsc_async_resets(char *reason); extern unsigned long native_calibrate_cpu_early(void); extern unsigned long native_calibrate_tsc(void); extern unsigned long long native_sched_clock_from_tsc(u64 tsc); +extern unsigned long long cycles2ns(unsigned long long cycles); extern int tsc_clocksource_reliable; #ifdef CONFIG_X86_TSC diff --git a/arch/x86/kernel/tsc.c b/arch/x86/kernel/tsc.c index 34dec0b72ea8..80bb12357148 100644 --- a/arch/x86/kernel/tsc.c +++ b/arch/x86/kernel/tsc.c @@ -144,6 +144,19 @@ static __always_inline unsigned long long cycles_2_ns(unsigned long long cyc) return ns; } +unsigned long long cycles2ns(unsigned long long cyc) +{ + struct cyc2ns_data data; + unsigned long long ns; + + cyc2ns_read_begin(&data); + ns = mul_u64_u32_shr(cyc, data.cyc2ns_mul, data.cyc2ns_shift); + cyc2ns_read_end(); + + return ns; +} +EXPORT_SYMBOL(cycles2ns); + static void __set_cyc2ns_scale(unsigned long khz, int cpu, unsigned long long tsc_now) { unsigned long long ns_now; diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index 02159c967d29..46975b0a63a5 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -10688,6 +10688,19 @@ static void kvm_vcpu_reload_apic_access_page(struct kvm_vcpu *vcpu) kvm_x86_call(set_apic_access_page_addr)(vcpu); } +static bool needs_halted_accounting(struct kvm_vcpu *vcpu) +{ + return (vcpu->kvm->arch.mwait_in_guest || + vcpu->kvm->arch.hlt_in_guest || + vcpu->kvm->arch.pause_in_guest) && + boot_cpu_has(X86_FEATURE_APERFMPERF); +} + +static long long get_unhalted_cycles(void) +{ + return __rdmsr(MSR_IA32_MPERF); +} + /* * Called within kvm->srcu read side. * Returns 1 to let vcpu_run() continue the guest execution loop without @@ -10697,6 +10710,8 @@ static void kvm_vcpu_reload_apic_access_page(struct kvm_vcpu *vcpu) static int vcpu_enter_guest(struct kvm_vcpu *vcpu) { int r; + unsigned long long cycles, cycles_start = 0; + unsigned long long unhalted_cycles, unhalted_cycles_start = 0; bool req_int_win = dm_request_for_irq_injection(vcpu) && kvm_cpu_accept_dm_intr(vcpu); @@ -10968,6 +10983,10 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu) set_debugreg(0, 7); } + if (needs_halted_accounting(vcpu)) { + cycles_start = get_cycles(); + unhalted_cycles_start = get_unhalted_cycles(); + } guest_timing_enter_irqoff(); for (;;) { @@ -11060,6 +11079,13 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu) * acceptable for all known use cases. */ guest_timing_exit_irqoff(); + if (needs_halted_accounting(vcpu)) { + cycles = get_cycles() - cycles_start; + unhalted_cycles = get_unhalted_cycles() - + unhalted_cycles_start; + if (likely(cycles > unhalted_cycles)) + current->gtime_halted += cycles2ns(cycles - unhalted_cycles); + } local_irq_enable(); preempt_enable(); From patchwork Tue Feb 18 20:26:03 2025 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Fernand Sieber X-Patchwork-Id: 13980830 Received: from smtp-fw-9106.amazon.com (smtp-fw-9106.amazon.com [207.171.188.206]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 10F42215169; Tue, 18 Feb 2025 20:27:54 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=207.171.188.206 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1739910479; cv=none; b=oWBJMoxYrx5hkt3HpXl8Unw9vnZIrKfHj6lgoXCCdMyc7YNC6e9WM/kIwI8Bq3kxrgh+W+CqQMwGKhNg5mPfnJFo2K405VvwkVp6y+1VgMtKM1qaS171El4aTi6HO3mNLU0IN8ADInuOvvp+G/VA7b2Bn+2hs4wRn+T9Lf+r4qM= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1739910479; c=relaxed/simple; bh=JTSEYbYS4Z/Ik62jE8h8jo3yLesydkKBCqS7gHREHyw=; h=From:To:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version:Content-Type; b=s0pbn6yV/l4j7SDy4vCOtm/ZkQkRrGldy21/HlZPbyUOiFfsos6VuHG4CZjWbtBbBvlV3j/6xVVVjUmV3/0JFq6Uu4oLgI3zcOyEv0wez1TiL6eWEDExvV/GvU3JYIpU41yd8GytfqBkaSYwaz80K0E+rW3Nc0cD3MWNbC5pt/Y= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=amazon.com; spf=pass smtp.mailfrom=amazon.com; dkim=pass (1024-bit key) header.d=amazon.com header.i=@amazon.com header.b=f/bu+yi7; arc=none smtp.client-ip=207.171.188.206 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=amazon.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=amazon.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (1024-bit key) header.d=amazon.com header.i=@amazon.com header.b="f/bu+yi7" DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=amazon.com; i=@amazon.com; q=dns/txt; s=amazon201209; t=1739910476; x=1771446476; h=from:to:subject:date:message-id:in-reply-to:references: mime-version:content-transfer-encoding; bh=R71KixzSQhV/O62VTjLAK1KH6JMWl+uEPn9+qpQPkQU=; b=f/bu+yi7C+DU/XjFmM9TX7x1Ssaglr5HzXVrqsWLIW1qnKF/6TWkUrC1 yiiYG3rBPH7QrxvPHvKWx8J1yHJ/gccWRAJ/Sa7GzErBtaRjOvWLeAiq5 yok120g9wT3tuGIzEWSeo1iPk1fh3mrYHbgPi6qte8yhgJR2Wxvz+t11U 0=; X-IronPort-AV: E=Sophos;i="6.13,296,1732579200"; d="scan'208";a="799874462" Received: from pdx4-co-svc-p1-lb2-vlan2.amazon.com (HELO smtpout.prod.us-west-2.prod.farcaster.email.amazon.dev) ([10.25.36.210]) by smtp-border-fw-9106.sea19.amazon.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 18 Feb 2025 20:27:50 +0000 Received: from EX19MTAEUC001.ant.amazon.com [10.0.43.254:62137] by smtpin.naws.eu-west-1.prod.farcaster.email.amazon.dev [10.0.11.108:2525] with esmtp (Farcaster) id 52050e94-2ff3-4b18-b080-d3b5f5682f76; Tue, 18 Feb 2025 20:27:48 +0000 (UTC) X-Farcaster-Flow-ID: 52050e94-2ff3-4b18-b080-d3b5f5682f76 Received: from EX19D003EUB001.ant.amazon.com (10.252.51.97) by EX19MTAEUC001.ant.amazon.com (10.252.51.155) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_CBC_SHA) id 15.2.1258.39; Tue, 18 Feb 2025 20:27:46 +0000 Received: from u5934974a1cdd59.ant.amazon.com (10.146.13.227) by EX19D003EUB001.ant.amazon.com (10.252.51.97) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_CBC_SHA) id 15.2.1544.14; Tue, 18 Feb 2025 20:27:41 +0000 From: Fernand Sieber To: , Ingo Molnar , Peter Zijlstra , Vincent Guittot , "Paolo Bonzini" , , , , Subject: [RFC PATCH 3/3] sched,x86: Make the scheduler guest unhalted aware Date: Tue, 18 Feb 2025 22:26:03 +0200 Message-ID: <20250218202618.567363-4-sieberf@amazon.com> X-Mailer: git-send-email 2.43.0 In-Reply-To: <20250218202618.567363-1-sieberf@amazon.com> References: <20250218202618.567363-1-sieberf@amazon.com> Precedence: bulk X-Mailing-List: kvm@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 X-ClientProxiedBy: EX19D035UWB001.ant.amazon.com (10.13.138.33) To EX19D003EUB001.ant.amazon.com (10.252.51.97) With guest hlt/mwait/pause pass through, the scheduler has no visibility into real vCPU activity as it sees them all 100% active. As such, load balancing cannot make informed decisions on where it is preferrable to collocate tasks when necessary. I.e as far as the load balancer is concerned, a halted vCPU and an idle polling vCPU look exactly the same so it may decide that either should be preempted when in reality it would be preferrable to preempt the idle one. This commits enlightens the scheduler to real guest activity in this situation. Leveraging gtime unhalted, it adds a hook for kvm to communicate to the scheduler the duration that a vCPU spends halted. This is then used in PELT accounting to discount it from real activity. This results in better placement and overall steal time reduction. This initial implementation assumes that non-idle CPUs are ticking as it hooks the unhalted time the PELT decaying load accounting. As such it doesn't work well if PELT is updated infrequenly with large chunks of halted time. This is not a fundamental limitation but more complex accounting is needed to generalize the use case to nohz full. --- arch/x86/kvm/x86.c | 8 ++++++-- include/linux/sched.h | 4 ++++ kernel/sched/core.c | 1 + kernel/sched/fair.c | 25 +++++++++++++++++++++++++ kernel/sched/pelt.c | 42 +++++++++++++++++++++++++++++++++++------- kernel/sched/sched.h | 2 ++ 6 files changed, 73 insertions(+), 9 deletions(-) -- 2.43.0 Amazon Development Centre (South Africa) (Proprietary) Limited 29 Gogosoa Street, Observatory, Cape Town, Western Cape, 7925, South Africa Registration Number: 2004 / 034463 / 07 diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index 46975b0a63a5..156cf05b325f 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -10712,6 +10712,7 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu) int r; unsigned long long cycles, cycles_start = 0; unsigned long long unhalted_cycles, unhalted_cycles_start = 0; + unsigned long long halted_cycles_ns = 0; bool req_int_win = dm_request_for_irq_injection(vcpu) && kvm_cpu_accept_dm_intr(vcpu); @@ -11083,8 +11084,11 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu) cycles = get_cycles() - cycles_start; unhalted_cycles = get_unhalted_cycles() - unhalted_cycles_start; - if (likely(cycles > unhalted_cycles)) - current->gtime_halted += cycles2ns(cycles - unhalted_cycles); + if (likely(cycles > unhalted_cycles)) { + halted_cycles_ns = cycles2ns(cycles - unhalted_cycles); + current->gtime_halted += halted_cycles_ns; + sched_account_gtime_halted(current, halted_cycles_ns); + } } local_irq_enable(); diff --git a/include/linux/sched.h b/include/linux/sched.h index 5f6745357e20..5409fac152c9 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -367,6 +367,8 @@ struct vtime { u64 gtime; }; +extern void sched_account_gtime_halted(struct task_struct *p, u64 gtime_halted); + /* * Utilization clamp constraints. * @UCLAMP_MIN: Minimum utilization @@ -588,6 +590,8 @@ struct sched_entity { */ struct sched_avg avg; #endif + + u64 gtime_halted; }; struct sched_rt_entity { diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 9aecd914ac69..1f3ced2b2636 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -4487,6 +4487,7 @@ static void __sched_fork(unsigned long clone_flags, struct task_struct *p) p->se.nr_migrations = 0; p->se.vruntime = 0; p->se.vlag = 0; + p->se.gtime_halted = 0; INIT_LIST_HEAD(&p->se.group_node); /* A delayed task cannot be in clone(). */ diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 1c0ef435a7aa..5ff52711d459 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -13705,4 +13705,29 @@ __init void init_sched_fair_class(void) #endif #endif /* SMP */ + +} + +#ifdef CONFIG_NO_HZ_FULL +void sched_account_gtime_halted(struct task_struct *p, u64 gtime_halted) +{ } +#else +/* + * The implementation hooking into PELT requires regular updates of + * gtime_halted. This is guaranteed unless we run on CONFIG_NO_HZ_FULL. + */ +void sched_account_gtime_halted(struct task_struct *p, u64 gtime_halted) +{ + struct sched_entity *se = &p->se; + + if (unlikely(!gtime_halted)) + return; + + for_each_sched_entity(se) { + se->gtime_halted += gtime_halted; + se->cfs_rq->gtime_halted += gtime_halted; + } +} +#endif +EXPORT_SYMBOL(sched_account_gtime_halted); diff --git a/kernel/sched/pelt.c b/kernel/sched/pelt.c index 7a8534a2deff..9f96b7c46c00 100644 --- a/kernel/sched/pelt.c +++ b/kernel/sched/pelt.c @@ -305,10 +305,23 @@ int __update_load_avg_blocked_se(u64 now, struct sched_entity *se) int __update_load_avg_se(u64 now, struct cfs_rq *cfs_rq, struct sched_entity *se) { - if (___update_load_sum(now, &se->avg, !!se->on_rq, se_runnable(se), - cfs_rq->curr == se)) { + int ret = 0; + u64 delta = now - se->avg.last_update_time; + u64 gtime_halted = min(delta, se->gtime_halted); - ___update_load_avg(&se->avg, se_weight(se)); + ret = ___update_load_sum(now - gtime_halted, &se->avg, !!se->on_rq, se_runnable(se), + cfs_rq->curr == se); + + if (gtime_halted) { + ret |= ___update_load_sum(now, &se->avg, 0, 0, 0); + se->gtime_halted -= gtime_halted; + + /* decay residual halted time */ + if (ret && se->gtime_halted) + se->gtime_halted = decay_load(se->gtime_halted, delta / 1024); + } + + if (ret) { cfs_se_util_change(&se->avg); trace_pelt_se_tp(se); return 1; @@ -319,10 +332,25 @@ int __update_load_avg_se(u64 now, struct cfs_rq *cfs_rq, struct sched_entity *se int __update_load_avg_cfs_rq(u64 now, struct cfs_rq *cfs_rq) { - if (___update_load_sum(now, &cfs_rq->avg, - scale_load_down(cfs_rq->load.weight), - cfs_rq->h_nr_runnable, - cfs_rq->curr != NULL)) { + int ret = 0; + u64 delta = now - cfs_rq->avg.last_update_time; + u64 gtime_halted = min(delta, cfs_rq->gtime_halted); + + ret = ___update_load_sum(now - gtime_halted, &cfs_rq->avg, + scale_load_down(cfs_rq->load.weight), + cfs_rq->h_nr_runnable, + cfs_rq->curr != NULL); + + if (gtime_halted) { + ret |= ___update_load_sum(now, &cfs_rq->avg, 0, 0, 0); + cfs_rq->gtime_halted -= gtime_halted; + + /* decay any residual halted time */ + if (ret && cfs_rq->gtime_halted) + cfs_rq->gtime_halted = decay_load(cfs_rq->gtime_halted, delta / 1024); + } + + if (ret) { ___update_load_avg(&cfs_rq->avg, 1); trace_pelt_cfs_tp(cfs_rq); diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h index b93c8c3dc05a..79b1166265bf 100644 --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -744,6 +744,8 @@ struct cfs_rq { struct list_head throttled_csd_list; #endif /* CONFIG_CFS_BANDWIDTH */ #endif /* CONFIG_FAIR_GROUP_SCHED */ + + u64 gtime_halted; }; #ifdef CONFIG_SCHED_CLASS_EXT