APEI: GHES: Have GHES honor the panic= setting

Message ID	20250113125224.GFZ4UMiNtWIJvgpveU@fat_crate.local (mailing list archive)
State	Queued
Delegated to:	Rafael Wysocki
Headers	show Received: from mail.alien8.de (mail.alien8.de [65.109.113.108]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 5590C22CF28; Mon, 13 Jan 2025 12:52:43 +0000 (UTC) Date: Mon, 13 Jan 2025 13:52:24 +0100 From: Borislav Petkov <bp@alien8.de> To: "Rafael J. Wysocki" <rafael@kernel.org> Cc: "Huang, Ying" <ying.huang@linux.alibaba.com>, Feng Tang <feng.tang@linux.alibaba.com>, rafael@kernel.org, Len Brown <lenb@kernel.org>, James Morse <james.morse@arm.com>, Tony Luck <tony.luck@intel.com>, linux-acpi@vger.kernel.org, linux-kernel@vger.kernel.org, Ira Weiny <ira.weiny@intel.com>, Dave Jiang <dave.jiang@intel.com>, Dan Williams <dan.j.williams@intel.com>, Andi Kleen <ak@linux.intel.com> Subject: [PATCH] APEI: GHES: Have GHES honor the panic= setting Message-ID: <20250113125224.GFZ4UMiNtWIJvgpveU@fat_crate.local> References: <Z3KGopUvilZLwsBK@U-2FWC9VHC-2323.local> <20241230121009.GDZ3KNoe0-hUwQDLG7@fat_crate.local> <Z3KaSxr2sjCC8FpJ@U-2FWC9VHC-2323.local> <20241230132403.GEZ3Ke8zm7HxSv84pA@fat_crate.local> <Z3OS4LCCxfVN32uH@U-2FWC9VHC-2323.local> <20241231092358.GAZ3O4LroNtlnztneC@fat_crate.local> <Z3PEXxFTGXW2j2F3@U-2FWC9VHC-2323.local> <20241231111314.GDZ3PRyq_tiU002p5d@fat_crate.local> <87ikqydja9.fsf@DESKTOP-5N7EMDA> <20250102083509.GAZ3ZPvcUhl9v6Kbp_@fat_crate.local> Precedence: bulk MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline In-Reply-To: <20250102083509.GAZ3ZPvcUhl9v6Kbp_@fat_crate.local>
Series	APEI: GHES: Have GHES honor the panic= setting \| expand APEI: GHES: Have GHES honor the panic= setting

Message ID

20250113125224.GFZ4UMiNtWIJvgpveU@fat_crate.local (mailing list archive)

State

Queued

Delegated to:

Rafael Wysocki

Headers

Date: Mon, 13 Jan 2025 13:52:24 +0100
From: Borislav Petkov <bp@alien8.de>
To: "Rafael J. Wysocki" <rafael@kernel.org>
Cc: "Huang, Ying" <ying.huang@linux.alibaba.com>,
	Feng Tang <feng.tang@linux.alibaba.com>, rafael@kernel.org,
	Len Brown <lenb@kernel.org>, James Morse <james.morse@arm.com>,
	Tony Luck <tony.luck@intel.com>, linux-acpi@vger.kernel.org,
	linux-kernel@vger.kernel.org, Ira Weiny <ira.weiny@intel.com>,
	Dave Jiang <dave.jiang@intel.com>,
	Dan Williams <dan.j.williams@intel.com>,
	Andi Kleen <ak@linux.intel.com>
Subject: [PATCH] APEI: GHES: Have GHES honor the panic= setting
Message-ID: <20250113125224.GFZ4UMiNtWIJvgpveU@fat_crate.local>
References: <Z3KGopUvilZLwsBK@U-2FWC9VHC-2323.local>
 <20241230121009.GDZ3KNoe0-hUwQDLG7@fat_crate.local>
 <Z3KaSxr2sjCC8FpJ@U-2FWC9VHC-2323.local>
 <20241230132403.GEZ3Ke8zm7HxSv84pA@fat_crate.local>
 <Z3OS4LCCxfVN32uH@U-2FWC9VHC-2323.local>
 <20241231092358.GAZ3O4LroNtlnztneC@fat_crate.local>
 <Z3PEXxFTGXW2j2F3@U-2FWC9VHC-2323.local>
 <20241231111314.GDZ3PRyq_tiU002p5d@fat_crate.local>
 <87ikqydja9.fsf@DESKTOP-5N7EMDA>
 <20250102083509.GAZ3ZPvcUhl9v6Kbp_@fat_crate.local>
Precedence: bulk
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Disposition: inline
In-Reply-To: <20250102083509.GAZ3ZPvcUhl9v6Kbp_@fat_crate.local>

Series

APEI: GHES: Have GHES honor the panic= setting | expand

Commit Message

Borislav Petkov Jan. 13, 2025, 12:52 p.m. UTC

The GHES driver overrides the panic= setting by force-rebooting the
system after a fatal hw error has been reported. The intent being that
such an error would be reported earlier.

However, this is not optimal when a hard-to-debug issue requires long
time to reproduce and when that happens, the box will get rebooted after
30 seconds and thus destroy the whole hw context of when the error
happened.

So rip out the default GHES panic timeout and honor the global one.

In the panic disabled (panic=0) case, the error will still be logged to
dmesg for later inspection and if panic after a hw error is really
required, then that can be controlled the usual way - use panic= on the
cmdline or set it in the kernel .config's CONFIG_PANIC_TIMEOUT.

Reported-by: Feng Tang <feng.tang@linux.alibaba.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Feng Tang <feng.tang@linux.alibaba.com>
---
 drivers/acpi/apei/ghes.c | 10 +++++-----
 1 file changed, 5 insertions(+), 5 deletions(-)

Comments

Ira Weiny Jan. 13, 2025, 8:08 p.m. UTC | #1

Borislav Petkov wrote:
> The GHES driver overrides the panic= setting by force-rebooting the
> system after a fatal hw error has been reported. The intent being that
> such an error would be reported earlier.
> 
> However, this is not optimal when a hard-to-debug issue requires long
> time to reproduce and when that happens, the box will get rebooted after
> 30 seconds and thus destroy the whole hw context of when the error
> happened.
> 
> So rip out the default GHES panic timeout and honor the global one.
> 
> In the panic disabled (panic=0) case, the error will still be logged to
> dmesg for later inspection and if panic after a hw error is really
> required, then that can be controlled the usual way - use panic= on the
> cmdline or set it in the kernel .config's CONFIG_PANIC_TIMEOUT.
> 
> Reported-by: Feng Tang <feng.tang@linux.alibaba.com>
> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
> Reviewed-by: Feng Tang <feng.tang@linux.alibaba.com>

Reviewed-by: Ira Weiny <ira.weiny@intel.com>

Rafael J. Wysocki Jan. 14, 2025, 5:29 p.m. UTC | #2

On Mon, Jan 13, 2025 at 9:08 PM Ira Weiny <ira.weiny@intel.com> wrote:
>
> Borislav Petkov wrote:
> > The GHES driver overrides the panic= setting by force-rebooting the
> > system after a fatal hw error has been reported. The intent being that
> > such an error would be reported earlier.
> >
> > However, this is not optimal when a hard-to-debug issue requires long
> > time to reproduce and when that happens, the box will get rebooted after
> > 30 seconds and thus destroy the whole hw context of when the error
> > happened.
> >
> > So rip out the default GHES panic timeout and honor the global one.
> >
> > In the panic disabled (panic=0) case, the error will still be logged to
> > dmesg for later inspection and if panic after a hw error is really
> > required, then that can be controlled the usual way - use panic= on the
> > cmdline or set it in the kernel .config's CONFIG_PANIC_TIMEOUT.
> >
> > Reported-by: Feng Tang <feng.tang@linux.alibaba.com>
> > Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
> > Reviewed-by: Feng Tang <feng.tang@linux.alibaba.com>
>
> Reviewed-by: Ira Weiny <ira.weiny@intel.com>

Applied as 6.14 material, thanks!

diff --git a/drivers/acpi/apei/ghes.c b/drivers/acpi/apei/ghes.c
index 07789f0b59bc..b72772494655 100644
--- a/drivers/acpi/apei/ghes.c
+++ b/drivers/acpi/apei/ghes.c
@@ -173,8 +173,6 @@  static struct gen_pool *ghes_estatus_pool;
 static struct ghes_estatus_cache __rcu *ghes_estatus_caches[GHES_ESTATUS_CACHES_SIZE];
 static atomic_t ghes_estatus_cache_alloced;
 
-static int ghes_panic_timeout __read_mostly = 30;
-
 static void __iomem *ghes_map(u64 pfn, enum fixed_addresses fixmap_idx)
 {
 	phys_addr_t paddr;
@@ -983,14 +981,16 @@  static void __ghes_panic(struct ghes *ghes,
 			 struct acpi_hest_generic_status *estatus,
 			 u64 buf_paddr, enum fixed_addresses fixmap_idx)
 {
+	const char *msg = GHES_PFX "Fatal hardware error";
+
 	__ghes_print_estatus(KERN_EMERG, ghes->generic, estatus);
 
 	ghes_clear_estatus(ghes, estatus, buf_paddr, fixmap_idx);
 
-	/* reboot to log the error! */
 	if (!panic_timeout)
-		panic_timeout = ghes_panic_timeout;
-	panic("Fatal hardware error!");
+		pr_emerg("%s but panic disabled\n", msg);
+
+	panic(msg);
 }
 
 static int ghes_proc(struct ghes *ghes)

APEI: GHES: Have GHES honor the panic= setting

Commit Message

Comments

Patch