From patchwork Mon Jun 5 15:47:13 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Arjan van de Ven X-Patchwork-Id: 13267720 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 0C740C7EE24 for ; Mon, 5 Jun 2023 15:50:35 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S229563AbjFEPud (ORCPT ); Mon, 5 Jun 2023 11:50:33 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:49762 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S234012AbjFEPti (ORCPT ); Mon, 5 Jun 2023 11:49:38 -0400 Received: from mga04.intel.com (mga04.intel.com [192.55.52.120]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 910831B4 for ; Mon, 5 Jun 2023 08:49:13 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1685980153; x=1717516153; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=jWGjcuuEGgWU0hFzgYwZnsaGs11aYXSsOoifvBqgUSM=; b=jSbETjE5nZlQ7qmQtjgNbbdR8UNAsNS3Vi6XbJdFaK7n48kSryXNhSWg nTEbKpKuUVdN2MxRwz0NzbNJTCOb+f5iWdNDlZ9a3GyZHe2ETbrHdZpIK h3s5L6Hd2xvqfzLXHKWBeSiIEn0cAupcz28uqhbjWdjBHopHCs6Elu7/6 xapqZ7pks8a49aZXr1nXENpENAo5gFiC5Nvx+ZkL8T1PCZ8CfZa1DCXQg FE2st1CIBNog+bIi5IxbMOY3Ay5OKMcGHxnfzyBz/Hl7uisdOQqgK6T/v NwUe7RkdbaS7M8M4JOsPTIJkOuWr/ri4WE4J7wKo71G2B9PLt9k+YbCPT A==; X-IronPort-AV: E=McAfee;i="6600,9927,10732"; a="355256224" X-IronPort-AV: E=Sophos;i="6.00,217,1681196400"; d="scan'208";a="355256224" Received: from fmsmga006.fm.intel.com ([10.253.24.20]) by fmsmga104.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 05 Jun 2023 08:47:36 -0700 X-ExtLoop1: 1 X-IronPort-AV: E=McAfee;i="6600,9927,10732"; a="955374798" X-IronPort-AV: E=Sophos;i="6.00,217,1681196400"; d="scan'208";a="955374798" Received: from arjan-box.jf.intel.com ([10.54.74.119]) by fmsmga006-auth.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 05 Jun 2023 08:47:35 -0700 From: arjan@linux.intel.com To: linux-pm@vger.kernel.org Cc: artem.bityutskiy@linux.intel.com, rafael@kernel.org, Arjan van de Ven Subject: [PATCH 1/4] intel_idle: refactor state->enter manipulation into its own function Date: Mon, 5 Jun 2023 15:47:13 +0000 Message-Id: <20230605154716.840930-2-arjan@linux.intel.com> X-Mailer: git-send-email 2.40.1 In-Reply-To: <20230605154716.840930-1-arjan@linux.intel.com> References: <20230605154716.840930-1-arjan@linux.intel.com> MIME-Version: 1.0 Precedence: bulk List-ID: X-Mailing-List: linux-pm@vger.kernel.org From: Arjan van de Ven Since the 6.4 kernel, the logic for updating a state's enter method based on "environmental conditions" (command line options, cpu sidechannel workarounds etc etc) has gotten pretty complex. This patch refactors this into a seperate small, self contained function (no behavior changes) for improved readability and to make future changes to this logic easier to do and understand. Signed-off-by: Arjan van de Ven --- drivers/idle/intel_idle.c | 50 ++++++++++++++++++++++----------------- 1 file changed, 28 insertions(+), 22 deletions(-) diff --git a/drivers/idle/intel_idle.c b/drivers/idle/intel_idle.c index aa2d19db2b1d..c351b21c0875 100644 --- a/drivers/idle/intel_idle.c +++ b/drivers/idle/intel_idle.c @@ -1839,6 +1839,32 @@ static bool __init intel_idle_verify_cstate(unsigned int mwait_hint) return true; } +static void state_update_enter_method(struct cpuidle_state *state, int cstate) +{ + if (state->flags & CPUIDLE_FLAG_INIT_XSTATE) { + /* + * Combining with XSTATE with IBRS or IRQ_ENABLE flags + * is not currently supported but this driver. + */ + WARN_ON_ONCE(state->flags & CPUIDLE_FLAG_IBRS); + WARN_ON_ONCE(state->flags & CPUIDLE_FLAG_IRQ_ENABLE); + state->enter = intel_idle_xstate; + } else if (cpu_feature_enabled(X86_FEATURE_KERNEL_IBRS) && + state->flags & CPUIDLE_FLAG_IBRS) { + /* + * IBRS mitigation requires that C-states are entered + * with interrupts disabled. + */ + WARN_ON_ONCE(state->flags & CPUIDLE_FLAG_IRQ_ENABLE); + state->enter = intel_idle_ibrs; + } else if (state->flags & CPUIDLE_FLAG_IRQ_ENABLE) { + state->enter = intel_idle_irq; + } else if (force_irq_on) { + pr_info("forced intel_idle_irq for state %d\n", cstate); + state->enter = intel_idle_irq; + } +} + static void __init intel_idle_init_cstates_icpu(struct cpuidle_driver *drv) { int cstate; @@ -1894,28 +1920,8 @@ static void __init intel_idle_init_cstates_icpu(struct cpuidle_driver *drv) drv->states[drv->state_count] = cpuidle_state_table[cstate]; state = &drv->states[drv->state_count]; - if (state->flags & CPUIDLE_FLAG_INIT_XSTATE) { - /* - * Combining with XSTATE with IBRS or IRQ_ENABLE flags - * is not currently supported but this driver. - */ - WARN_ON_ONCE(state->flags & CPUIDLE_FLAG_IBRS); - WARN_ON_ONCE(state->flags & CPUIDLE_FLAG_IRQ_ENABLE); - state->enter = intel_idle_xstate; - } else if (cpu_feature_enabled(X86_FEATURE_KERNEL_IBRS) && - state->flags & CPUIDLE_FLAG_IBRS) { - /* - * IBRS mitigation requires that C-states are entered - * with interrupts disabled. - */ - WARN_ON_ONCE(state->flags & CPUIDLE_FLAG_IRQ_ENABLE); - state->enter = intel_idle_ibrs; - } else if (state->flags & CPUIDLE_FLAG_IRQ_ENABLE) { - state->enter = intel_idle_irq; - } else if (force_irq_on) { - pr_info("forced intel_idle_irq for state %d\n", cstate); - state->enter = intel_idle_irq; - } + state_update_enter_method(state, cstate); + if ((disabled_states_mask & BIT(drv->state_count)) || ((icpu->use_acpi || force_use_acpi) && From patchwork Mon Jun 5 15:47:14 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Arjan van de Ven X-Patchwork-Id: 13267721 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 7E328C7EE2D for ; Mon, 5 Jun 2023 15:50:35 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S229927AbjFEPud (ORCPT ); Mon, 5 Jun 2023 11:50:33 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:53754 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S235343AbjFEPt6 (ORCPT ); Mon, 5 Jun 2023 11:49:58 -0400 Received: from mga04.intel.com (mga04.intel.com [192.55.52.120]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id B570210C7 for ; Mon, 5 Jun 2023 08:49:33 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1685980173; x=1717516173; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=l7/dS8deGnMaGMsdTSlUpm4oZl0I8Css07WPimZdFFc=; b=Zvy0AvawIKeagaRjUQNOEtsKFQYXYxYeTfV6M092Be7gTMuxDghcXxOl LEi/eu2KmGl7lxEf2kIlMGy4A8lgCCKXYC2vgoIF99RPgmSdlYzpfyhVh 6dDlF1HV4McPE7/jWQqcLHJy+nvOemivk4hqxdwPBKbGjpsrqYWwiB/SS Fg9O8XkdJzn/eduBhjgPhZG3IqhEybY1mEMmVxQNJb4JGYYL9L7RciJJm OKXfjK053lIh1/TDzO66CRscUDNGYRGvRNZ1NgmY04IIJBQNIlqU1uXbE vAlGjk20WmOzcb5ZF/ZJQ33z1bhYaQ8+PgEs8kARmEh+ePT+ZFsyLBzHO A==; X-IronPort-AV: E=McAfee;i="6600,9927,10732"; a="355256229" X-IronPort-AV: E=Sophos;i="6.00,217,1681196400"; d="scan'208";a="355256229" Received: from fmsmga006.fm.intel.com ([10.253.24.20]) by fmsmga104.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 05 Jun 2023 08:47:37 -0700 X-ExtLoop1: 1 X-IronPort-AV: E=McAfee;i="6600,9927,10732"; a="955374807" X-IronPort-AV: E=Sophos;i="6.00,217,1681196400"; d="scan'208";a="955374807" Received: from arjan-box.jf.intel.com ([10.54.74.119]) by fmsmga006-auth.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 05 Jun 2023 08:47:36 -0700 From: arjan@linux.intel.com To: linux-pm@vger.kernel.org Cc: artem.bityutskiy@linux.intel.com, rafael@kernel.org, Arjan van de Ven Subject: [PATCH 2/4] intel_idle: clean up the (new) state_update_enter_method function Date: Mon, 5 Jun 2023 15:47:14 +0000 Message-Id: <20230605154716.840930-3-arjan@linux.intel.com> X-Mailer: git-send-email 2.40.1 In-Reply-To: <20230605154716.840930-1-arjan@linux.intel.com> References: <20230605154716.840930-1-arjan@linux.intel.com> MIME-Version: 1.0 Precedence: bulk List-ID: X-Mailing-List: linux-pm@vger.kernel.org From: Arjan van de Ven Now that the logic for state_update_enter_method() is in its own function, the long if .. else if .. else if .. else if chain can be simplified by just returning from the function at the various places. This does not change functionality, but it makes the logic much simpler to read or modify later. Signed-off-by: Arjan van de Ven --- drivers/idle/intel_idle.c | 15 ++++++++++++--- 1 file changed, 12 insertions(+), 3 deletions(-) diff --git a/drivers/idle/intel_idle.c b/drivers/idle/intel_idle.c index c351b21c0875..256c2d42e350 100644 --- a/drivers/idle/intel_idle.c +++ b/drivers/idle/intel_idle.c @@ -1849,7 +1849,10 @@ static void state_update_enter_method(struct cpuidle_state *state, int cstate) WARN_ON_ONCE(state->flags & CPUIDLE_FLAG_IBRS); WARN_ON_ONCE(state->flags & CPUIDLE_FLAG_IRQ_ENABLE); state->enter = intel_idle_xstate; - } else if (cpu_feature_enabled(X86_FEATURE_KERNEL_IBRS) && + return; + } + + if (cpu_feature_enabled(X86_FEATURE_KERNEL_IBRS) && state->flags & CPUIDLE_FLAG_IBRS) { /* * IBRS mitigation requires that C-states are entered @@ -1857,9 +1860,15 @@ static void state_update_enter_method(struct cpuidle_state *state, int cstate) */ WARN_ON_ONCE(state->flags & CPUIDLE_FLAG_IRQ_ENABLE); state->enter = intel_idle_ibrs; - } else if (state->flags & CPUIDLE_FLAG_IRQ_ENABLE) { + return; + } + + if (state->flags & CPUIDLE_FLAG_IRQ_ENABLE) { state->enter = intel_idle_irq; - } else if (force_irq_on) { + return; + } + + if (force_irq_on) { pr_info("forced intel_idle_irq for state %d\n", cstate); state->enter = intel_idle_irq; } From patchwork Mon Jun 5 15:47:15 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Arjan van de Ven X-Patchwork-Id: 13267722 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 34ACCC7EE2C for ; Mon, 5 Jun 2023 15:50:36 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S230374AbjFEPuf (ORCPT ); Mon, 5 Jun 2023 11:50:35 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:53854 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S235377AbjFEPuB (ORCPT ); Mon, 5 Jun 2023 11:50:01 -0400 Received: from mga04.intel.com (mga04.intel.com [192.55.52.120]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id E515810D5 for ; Mon, 5 Jun 2023 08:49:35 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1685980175; x=1717516175; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=TgfCLF7RPLsec+88nSy1sZMsG3ovkkzEselukh/h8pE=; b=ar8MWIVgedJ/wnNbEl3c2h+ep1G4oN6E7t+KgMShI3k7JEyiBEDT9HW8 w++5dNLqiXjvr4jLlJoi/BYmw9YPYqP7PLUaVksrqcVYNcsQcleErIGAX ThnNyUY7WvPwhinUp1swkztsktSL76vibGX81PHZxZEK9U8JvzPhm3hiA IXMWL2lz29wXSDbjRL+cPp+uxRTB6vtJR1kH2ZtDw1R3Gyn0fctJvhRQa 0HoNSQWN9rv4yQvopehEdwSZyx0+9FAH4om4VKl0qU3zY5AH9g0dBFObq CV3XSMlpZ8mJPvEUcHeVW41/AEgQY+Sfu8xQL3kyvNpiNdzw90iGeoSz2 g==; X-IronPort-AV: E=McAfee;i="6600,9927,10732"; a="355256235" X-IronPort-AV: E=Sophos;i="6.00,217,1681196400"; d="scan'208";a="355256235" Received: from fmsmga006.fm.intel.com ([10.253.24.20]) by fmsmga104.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 05 Jun 2023 08:47:38 -0700 X-ExtLoop1: 1 X-IronPort-AV: E=McAfee;i="6600,9927,10732"; a="955374812" X-IronPort-AV: E=Sophos;i="6.00,217,1681196400"; d="scan'208";a="955374812" Received: from arjan-box.jf.intel.com ([10.54.74.119]) by fmsmga006-auth.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 05 Jun 2023 08:47:38 -0700 From: arjan@linux.intel.com To: linux-pm@vger.kernel.org Cc: artem.bityutskiy@linux.intel.com, rafael@kernel.org, Arjan van de Ven Subject: [PATCH 3/4] intel_idle: Add support for using intel_idle in a VM guest using just hlt Date: Mon, 5 Jun 2023 15:47:15 +0000 Message-Id: <20230605154716.840930-4-arjan@linux.intel.com> X-Mailer: git-send-email 2.40.1 In-Reply-To: <20230605154716.840930-1-arjan@linux.intel.com> References: <20230605154716.840930-1-arjan@linux.intel.com> MIME-Version: 1.0 Precedence: bulk List-ID: X-Mailing-List: linux-pm@vger.kernel.org From: Arjan van de Ven In a typical VM guest, the mwait instruction is not available, leaving only the 'hlt' instruction (which causes a VMEXIT to the host). So for this common case, intel_idle will detect the lack of mwait, and fail to initialize (after which another idle method would step in which will just use hlt always). Other (non-common) cases exist; the table below shows the before/after for these: +------------+--------------------------+-------------------------+ | Hypervisor | Idle method before patch | Idle method after patch | | exposes | | | +============+==========================+=========================+ | nothing | default_idle fallback | intel_idle VM table | | (common) | (straight "hlt") | | +------------+--------------------------+-------------------------+ | mwait | intel_idle mwait table | intel_idle mwait table | +------------+--------------------------+-------------------------+ | ACPI | ACPI C1 state ("hlt") | intel_idle VM table | +------------+--------------------------+-------------------------+ By providing capability to do this with the intel_idle driver, we can do better than the fallback or ACPI table methods. While this current change only gets us to the existing behavior, later patches in this series will add new capabilities such as optimized TLB flushing. In order to do this, a simplified version of the initialization function for VM guests is created, and this will be called if the CPU is recognized, but mwait is not supported, and we're in a VM guest. One thing to note is that the max latency (and break even) of this C1 state is higher than the typical bare metal C1 state. Because hlt causes a vmexit, and the cost of vmexit + hypervisor overhead + vmenter is typically in the order of upto 5 microseconds... even if the hypervisor does not actually goes into a hardware power saving state. Signed-off-by: Arjan van de Ven --- drivers/idle/intel_idle.c | 125 +++++++++++++++++++++++++++++++++++++- 1 file changed, 124 insertions(+), 1 deletion(-) diff --git a/drivers/idle/intel_idle.c b/drivers/idle/intel_idle.c index 256c2d42e350..d2518cf36ab4 100644 --- a/drivers/idle/intel_idle.c +++ b/drivers/idle/intel_idle.c @@ -199,6 +199,43 @@ static __cpuidle int intel_idle_xstate(struct cpuidle_device *dev, return __intel_idle(dev, drv, index); } +static __always_inline int __intel_idle_hlt(struct cpuidle_device *dev, + struct cpuidle_driver *drv, int index) +{ + raw_safe_halt(); + raw_local_irq_disable(); + return index; +} + +/** + * intel_idle_hlt - Ask the processor to enter the given idle state using hlt. + * @dev: cpuidle device of the target CPU. + * @drv: cpuidle driver (assumed to point to intel_idle_driver). + * @index: Target idle state index. + * + * Use the HLT instruction to notify the processor that the CPU represented by + * @dev is idle and it can try to enter the idle state corresponding to @index. + * + * Must be called under local_irq_disable(). + */ +static __cpuidle int intel_idle_hlt(struct cpuidle_device *dev, + struct cpuidle_driver *drv, int index) +{ + return __intel_idle_hlt(dev, drv, index); +} + +static __cpuidle int intel_idle_hlt_irq_on(struct cpuidle_device *dev, + struct cpuidle_driver *drv, int index) +{ + int ret; + + raw_local_irq_enable(); + ret = __intel_idle_hlt(dev, drv, index); + raw_local_irq_disable(); + + return ret; +} + /** * intel_idle_s2idle - Ask the processor to enter the given idle state. * @dev: cpuidle device of the target CPU. @@ -1242,6 +1279,18 @@ static struct cpuidle_state snr_cstates[] __initdata = { .enter = NULL } }; +static struct cpuidle_state vmguest_cstates[] __initdata = { + { + .name = "C1", + .desc = "HLT", + .flags = MWAIT2flg(0x00) | CPUIDLE_FLAG_IRQ_ENABLE, + .exit_latency = 5, + .target_residency = 10, + .enter = &intel_idle_hlt, }, + { + .enter = NULL } +}; + static const struct idle_cpu idle_cpu_nehalem __initconst = { .state_table = nehalem_cstates, .auto_demotion_disable_flags = NHM_C1_AUTO_DEMOTE | NHM_C3_AUTO_DEMOTE, @@ -1841,6 +1890,16 @@ static bool __init intel_idle_verify_cstate(unsigned int mwait_hint) static void state_update_enter_method(struct cpuidle_state *state, int cstate) { + if (state->enter == intel_idle_hlt) { + if (force_irq_on) { + pr_info("forced intel_idle_irq for state %d\n", cstate); + state->enter = intel_idle_hlt_irq_on; + } + return; + } + if (state->enter == intel_idle_hlt_irq_on) + return; /* no update scenarios */ + if (state->flags & CPUIDLE_FLAG_INIT_XSTATE) { /* * Combining with XSTATE with IBRS or IRQ_ENABLE flags @@ -1874,6 +1933,29 @@ static void state_update_enter_method(struct cpuidle_state *state, int cstate) } } +/* + * For mwait based states, we want to verify the cpuid data to see if the state + * is actually supported by this specific CPU. + * For non-mwait based states, this check should be skipped. + */ +static bool should_verify_mwait(struct cpuidle_state *state) +{ + if (state->enter == intel_idle_irq) + return true; + if (state->enter == intel_idle) + return true; + if (state->enter == intel_idle_ibrs) + return true; + if (state->enter == intel_idle_xstate) + return true; + if (state->enter == intel_idle_hlt) + return false; + if (state->enter == intel_idle_hlt_irq_on) + return false; + + return true; +} + static void __init intel_idle_init_cstates_icpu(struct cpuidle_driver *drv) { int cstate; @@ -1922,7 +2004,7 @@ static void __init intel_idle_init_cstates_icpu(struct cpuidle_driver *drv) } mwait_hint = flg2MWAIT(cpuidle_state_table[cstate].flags); - if (!intel_idle_verify_cstate(mwait_hint)) + if (should_verify_mwait(&cpuidle_state_table[cstate]) && !intel_idle_verify_cstate(mwait_hint)) continue; /* Structure copy. */ @@ -2056,6 +2138,45 @@ static void __init intel_idle_cpuidle_devices_uninit(void) cpuidle_unregister_device(per_cpu_ptr(intel_idle_cpuidle_devices, i)); } +static int __init intel_idle_vminit(const struct x86_cpu_id *id) +{ + int retval; + + cpuidle_state_table = vmguest_cstates; + + icpu = (const struct idle_cpu *)id->driver_data; + + pr_debug("v" INTEL_IDLE_VERSION " model 0x%X\n", + boot_cpu_data.x86_model); + + intel_idle_cpuidle_devices = alloc_percpu(struct cpuidle_device); + if (!intel_idle_cpuidle_devices) + return -ENOMEM; + + intel_idle_cpuidle_driver_init(&intel_idle_driver); + + retval = cpuidle_register_driver(&intel_idle_driver); + if (retval) { + struct cpuidle_driver *drv = cpuidle_get_driver(); + printk(KERN_DEBUG pr_fmt("intel_idle yielding to %s\n"), + drv ? drv->name : "none"); + goto init_driver_fail; + } + + retval = cpuhp_setup_state(CPUHP_AP_ONLINE_DYN, "idle/intel:online", + intel_idle_cpu_online, NULL); + if (retval < 0) + goto hp_setup_fail; + + return 0; +hp_setup_fail: + intel_idle_cpuidle_devices_uninit(); + cpuidle_unregister_driver(&intel_idle_driver); +init_driver_fail: + free_percpu(intel_idle_cpuidle_devices); + return retval; +} + static int __init intel_idle_init(void) { const struct x86_cpu_id *id; @@ -2074,6 +2195,8 @@ static int __init intel_idle_init(void) id = x86_match_cpu(intel_idle_ids); if (id) { if (!boot_cpu_has(X86_FEATURE_MWAIT)) { + if (boot_cpu_has(X86_FEATURE_HYPERVISOR)) + return intel_idle_vminit(id); pr_debug("Please enable MWAIT in BIOS SETUP\n"); return -ENODEV; } From patchwork Mon Jun 5 15:47:16 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Arjan van de Ven X-Patchwork-Id: 13267723 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 9369BC7EE23 for ; Mon, 5 Jun 2023 15:50:53 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S234771AbjFEPuw (ORCPT ); Mon, 5 Jun 2023 11:50:52 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:54046 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S235410AbjFEPuI (ORCPT ); Mon, 5 Jun 2023 11:50:08 -0400 Received: from mga04.intel.com (mga04.intel.com [192.55.52.120]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 071D512A for ; Mon, 5 Jun 2023 08:49:38 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1685980178; x=1717516178; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=DpcVmy3iWa4Xr4T0O8En5f5l38rcfZXzCLtG0gpNULQ=; b=ETlkme1MlgM0X6kbKx7TmUToszabKFbdPPVlDBl8EQUo5ZPZWLIe/pFS OBd0dd+sF+uGMJG1kkxJ4rEyB8zB2ohCirQAiQF5iv2RWNXzJixX+LQhi iDU9wz/Ma2Knp0IML5sW43dz9I4zA559tUZrW2P4FAqkw8EuJlD8IRz8/ zFT6zFhv3J8j15TtT2AxLspXxnwhbK2y2ZNnfaf2K2NWoz/vTA10bASQV WwfaLKPo+V1e0N8eLlyVQTGEmKdD7U9RyE5yqgSOy1+v9kbK8kYEfF8XP Cr8JgS38uxawwYlP5D3L2ispxeogz+iVyS9OvebsLJWweZyL9ic/56N/L w==; X-IronPort-AV: E=McAfee;i="6600,9927,10732"; a="355256240" X-IronPort-AV: E=Sophos;i="6.00,217,1681196400"; d="scan'208";a="355256240" Received: from fmsmga006.fm.intel.com ([10.253.24.20]) by fmsmga104.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 05 Jun 2023 08:47:39 -0700 X-ExtLoop1: 1 X-IronPort-AV: E=McAfee;i="6600,9927,10732"; a="955374830" X-IronPort-AV: E=Sophos;i="6.00,217,1681196400"; d="scan'208";a="955374830" Received: from arjan-box.jf.intel.com ([10.54.74.119]) by fmsmga006-auth.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 05 Jun 2023 08:47:38 -0700 From: arjan@linux.intel.com To: linux-pm@vger.kernel.org Cc: artem.bityutskiy@linux.intel.com, rafael@kernel.org, Arjan van de Ven Subject: [PATCH 4/4] intel_idle: Add a "Long HLT" C1 state for the VM guest mode Date: Mon, 5 Jun 2023 15:47:16 +0000 Message-Id: <20230605154716.840930-5-arjan@linux.intel.com> X-Mailer: git-send-email 2.40.1 In-Reply-To: <20230605154716.840930-1-arjan@linux.intel.com> References: <20230605154716.840930-1-arjan@linux.intel.com> MIME-Version: 1.0 Precedence: bulk List-ID: X-Mailing-List: linux-pm@vger.kernel.org From: Arjan van de Ven intel_idle will, for the bare metal case, usually have one or more deep power states that have the CPUIDLE_FLAG_TLB_FLUSHED flag set. When a state with this flag is selected by the cpuidle framework, it will also flush the TLBs as part of entering this state. The benefit of doing this is that the kernel does not need to wake the cpu out of this deep power state just to flush the TLBs... for which the latency can be very high due to the exit latency of deep power states. In a VM guest currently, this benefit of avoiding the wakeup does not exist, while the problem (long exit latency) is even more severe. Linux will need to wake up a vCPU (causing the host to either come out of a deep C state, or the VMM to have to deschedule something else to schedule the vCPU) which can take a very long time.. adding a lot of latency to tlb flush operations (including munmap and others). To solve this, add a "Long HLT" C state to the state table for the VM guest case that has the CPUIDLE_FLAG_TLB_FLUSHED flag set. The result of that is that for long idle periods (where the VMM is likely to do things that cause large latency) the cpuidle framework will flush the TLBs (and avoid the wakeups), while for short/quick idle durations, the existing behavior is retained. Now, there is still only "hlt" available in the guest, but for long idle, the host can go to a deeper state (say C6). There is a reasonable debate one can have to what to set for the exit_latency and break even point for this "Long HLT" state. The good news is that intel_idle has these values available for the underlying CPU (even when mwait is not exposed). The solution thus is to just use the latency and break even of the deepest state from the bare metal CPU. This is under the assumption that this is a pretty reasonable estimate of what the VMM would do to cause latency. Signed-off-by: Arjan van de Ven --- drivers/idle/intel_idle.c | 54 +++++++++++++++++++++++++++++++++++++++ 1 file changed, 54 insertions(+) diff --git a/drivers/idle/intel_idle.c b/drivers/idle/intel_idle.c index d2518cf36ab4..0bf5e9f5bed8 100644 --- a/drivers/idle/intel_idle.c +++ b/drivers/idle/intel_idle.c @@ -1287,6 +1287,13 @@ static struct cpuidle_state vmguest_cstates[] __initdata = { .exit_latency = 5, .target_residency = 10, .enter = &intel_idle_hlt, }, + { + .name = "C1L", + .desc = "Long HLT", + .flags = MWAIT2flg(0x00) | CPUIDLE_FLAG_TLB_FLUSHED, + .exit_latency = 5, + .target_residency = 200, + .enter = &intel_idle_hlt, }, { .enter = NULL } }; @@ -2138,6 +2145,44 @@ static void __init intel_idle_cpuidle_devices_uninit(void) cpuidle_unregister_device(per_cpu_ptr(intel_idle_cpuidle_devices, i)); } +/* + * Match up the latency and break even point of the bare metal (cpu based) + * states with the deepest VM available state. + * + * We only want to do this for the deepest state, the ones that has + * the TLB_FLUSHED flag set on the . + * + * All our short idle states are dominated by vmexit/vmenter latencies, + * not the underlying hardware latencies so we keep our values for these. + */ +static void matchup_vm_state_with_baremetal(void) +{ + int cstate; + + for (cstate = 0; cstate < CPUIDLE_STATE_MAX; ++cstate) { + int matching_cstate; + + if (intel_idle_max_cstate_reached(cstate)) + break; + + if (!cpuidle_state_table[cstate].enter && + !cpuidle_state_table[cstate].enter_s2idle) + break; + + if (!(cpuidle_state_table[cstate].flags & CPUIDLE_FLAG_TLB_FLUSHED)) + continue; + + for (matching_cstate = 0; matching_cstate < CPUIDLE_STATE_MAX; ++matching_cstate) { + if (icpu->state_table[matching_cstate].exit_latency > cpuidle_state_table[cstate].exit_latency) { + cpuidle_state_table[cstate].exit_latency = icpu->state_table[matching_cstate].exit_latency; + cpuidle_state_table[cstate].target_residency = icpu->state_table[matching_cstate].target_residency; + } + } + + } +} + + static int __init intel_idle_vminit(const struct x86_cpu_id *id) { int retval; @@ -2153,6 +2198,15 @@ static int __init intel_idle_vminit(const struct x86_cpu_id *id) if (!intel_idle_cpuidle_devices) return -ENOMEM; + /* + * We don't know exactly what the host will do when we go idle, but as a worst estimate + * we can assume that the exit latency of the deepest host state will be hit for our + * deep (long duration) guest idle state. + * The same logic applies to the break even point for the long duration guest idle state. + * So lets copy these two properties from the table we found for the host CPU type. + */ + matchup_vm_state_with_baremetal(); + intel_idle_cpuidle_driver_init(&intel_idle_driver); retval = cpuidle_register_driver(&intel_idle_driver);