From patchwork Thu Sep 30 09:44:51 2021 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Mauro Carvalho Chehab X-Patchwork-Id: 12527797 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 11BADC433F5 for ; Thu, 30 Sep 2021 09:45:04 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id DCD5B61980 for ; Thu, 30 Sep 2021 09:45:03 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1349523AbhI3Jqo (ORCPT ); Thu, 30 Sep 2021 05:46:44 -0400 Received: from mail.kernel.org ([198.145.29.99]:46514 "EHLO mail.kernel.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1349504AbhI3Jqo (ORCPT ); Thu, 30 Sep 2021 05:46:44 -0400 Received: by mail.kernel.org (Postfix) with ESMTPSA id 8FB4861528; Thu, 30 Sep 2021 09:45:01 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1632995101; bh=YEbjO2PVjYe4aLR6UTcioxMYW4qUoWftUtALRs+AOmA=; h=From:To:Cc:Subject:Date:In-Reply-To:References:From; b=hNJI2e9Oj5eL8k+VwX7deiWVQ1GfGFlpZ9TETSqTZPdXViAurzkqRU88G/MUYrhCH xJeOf24kjY8n0fwh/xQ0FcuR/PSXiHh/F9FnxHhnE+nfNloIszFRGERge0XpNmFqDC bZLCSDKzcBL2YWIKgJ+VZjapFmicmbqtCARbci97JK2Wmi+m1ZTn/qcghNlRLvNCAr eS08pCP1GOsKpp7AtUXU+nWhhal0IhT/NK3nJ2OQ365gOAb6VJ7Jjk1CG3HTX1ROZm XDCKKuH6FxR0DbnOhRLZGFwptIAtYGLWOJWPGa5XUzgdZhhW9WGU/sGet+OYoZNZ2i q1OLZoknw/UuQ== Received: by mail.kernel.org with local (Exim 4.94.2) (envelope-from ) id 1mVscd-002ATG-O9; Thu, 30 Sep 2021 11:44:59 +0200 From: Mauro Carvalho Chehab To: Linux Doc Mailing List , Greg Kroah-Hartman Cc: Mauro Carvalho Chehab , "H. Peter Anvin" , "Jonathan Corbet" , Borislav Petkov , Ingo Molnar , Thomas Gleixner , Tony Luck , linux-edac@vger.kernel.org, linux-kernel@vger.kernel.org, x86@kernel.org, Andi Kleen Subject: [PATCH v2 4/7] ABI: sysfs-mce: add a new ABI file Date: Thu, 30 Sep 2021 11:44:51 +0200 Message-Id: <801a26985e32589eb78ba4b728d3e19fdea18f04.1632994837.git.mchehab+huawei@kernel.org> X-Mailer: git-send-email 2.31.1 In-Reply-To: References: MIME-Version: 1.0 Sender: Mauro Carvalho Chehab Precedence: bulk List-ID: X-Mailing-List: linux-edac@vger.kernel.org Reduce the gap of missing ABIs for Intel servers with MCE by adding a new ABI file. The contents of this file comes from: Documentation/x86/x86_64/machinecheck.rst Cc: Andi Kleen Signed-off-by: Mauro Carvalho Chehab Reviewed-by: Andi Kleen --- See [PATCH v2 0/7] at: https://lore.kernel.org/all/cover.1632994837.git.mchehab+huawei@kernel.org/ Documentation/ABI/testing/sysfs-mce | 107 ++++++++++++++++++++++ Documentation/x86/x86_64/machinecheck.rst | 56 +---------- MAINTAINERS | 2 + 3 files changed, 111 insertions(+), 54 deletions(-) create mode 100644 Documentation/ABI/testing/sysfs-mce diff --git a/Documentation/ABI/testing/sysfs-mce b/Documentation/ABI/testing/sysfs-mce new file mode 100644 index 000000000000..686fbfa02cdc --- /dev/null +++ b/Documentation/ABI/testing/sysfs-mce @@ -0,0 +1,107 @@ +What: /sys/devices/system/machinecheck/machinecheckX/ +Contact: Andi Kleen +Date: Feb, 2007 +Description: + (X = CPU number) + + Machine checks report internal hardware error conditions + detected by the CPU. Uncorrected errors typically cause a + machine check (often with panic), corrected ones cause a + machine check log entry. + + For more details about the x86 machine check architecture + see the Intel and AMD architecture manuals from their + developer websites. + + For more details about the architecture + see http://one.firstfloor.org/~andi/mce.pdf + + Each CPU has its own directory. + +What: /sys/devices/system/machinecheck/machinecheckX/bank +Contact: Andi Kleen +Date: Feb, 2007 +Description: + (Y bank number) + + 64bit Hex bitmask enabling/disabling specific subevents for + bank Y. + + When a bit in the bitmask is zero then the respective + subevent will not be reported. + + By default all events are enabled. + + Note that BIOS maintain another mask to disable specific events + per bank. This is not visible here + +What: /sys/devices/system/machinecheck/machinecheckX/check_interval +Contact: Andi Kleen +Date: Feb, 2007 +Description: + The entries appear for each CPU, but they are truly shared + between all CPUs. + + How often to poll for corrected machine check errors, in + seconds (Note output is hexadecimal). Default 5 minutes. + When the poller finds MCEs it triggers an exponential speedup + (poll more often) on the polling interval. When the poller + stops finding MCEs, it triggers an exponential backoff + (poll less often) on the polling interval. The check_interval + variable is both the initial and maximum polling interval. + 0 means no polling for corrected machine check errors + (but some corrected errors might be still reported + in other ways) + +What: /sys/devices/system/machinecheck/machinecheckX/tolerant +Contact: Andi Kleen +Date: Feb, 2007 +Description: + The entries appear for each CPU, but they are truly shared + between all CPUs. + + Tolerance level. When a machine check exception occurs for a + non corrected machine check the kernel can take different + actions. + + Since machine check exceptions can happen any time it is + sometimes risky for the kernel to kill a process because it + defies normal kernel locking rules. The tolerance level + configures how hard the kernel tries to recover even at some + risk of deadlock. Higher tolerant values trade potentially + better uptime with the risk of a crash or even corruption + (for tolerant >= 3). + + == =========================================================== + 0 always panic on uncorrected errors, log corrected errors + 1 panic or SIGBUS on uncorrected errors, log corrected errors + 2 SIGBUS or log uncorrected errors, log corrected errors + 3 never panic or SIGBUS, log all errors (for testing only) + == =========================================================== + + Default: 1 + + Note this only makes a difference if the CPU allows recovery + from a machine check exception. Current x86 CPUs generally + do not. + +What: /sys/devices/system/machinecheck/machinecheckX/trigger +Contact: Andi Kleen +Date: Feb, 2007 +Description: + The entries appear for each CPU, but they are truly shared + between all CPUs. + + Program to run when a machine check event is detected. + This is an alternative to running mcelog regularly from cron + and allows to detect events faster. + +What: /sys/devices/system/machinecheck/machinecheckX/monarch_timeout +Contact: Andi Kleen +Date: Feb, 2007 +Description: + How long to wait for the other CPUs to machine check too on a + exception. 0 to disable waiting for other CPUs. + + Unit: us + diff --git a/Documentation/x86/x86_64/machinecheck.rst b/Documentation/x86/x86_64/machinecheck.rst index b402e04bee60..cea12ee97200 100644 --- a/Documentation/x86/x86_64/machinecheck.rst +++ b/Documentation/x86/x86_64/machinecheck.rst @@ -21,60 +21,8 @@ from /dev/mcelog. Normally mcelog should be run regularly from a cronjob. Each CPU has a directory in /sys/devices/system/machinecheck/machinecheckN (N = CPU number). -The directory contains some configurable entries: - -bankNctl - (N bank number) - - 64bit Hex bitmask enabling/disabling specific subevents for bank N - When a bit in the bitmask is zero then the respective - subevent will not be reported. - By default all events are enabled. - Note that BIOS maintain another mask to disable specific events - per bank. This is not visible here - -The following entries appear for each CPU, but they are truly shared -between all CPUs. - -check_interval - How often to poll for corrected machine check errors, in seconds - (Note output is hexadecimal). Default 5 minutes. When the poller - finds MCEs it triggers an exponential speedup (poll more often) on - the polling interval. When the poller stops finding MCEs, it - triggers an exponential backoff (poll less often) on the polling - interval. The check_interval variable is both the initial and - maximum polling interval. 0 means no polling for corrected machine - check errors (but some corrected errors might be still reported - in other ways) - -tolerant - Tolerance level. When a machine check exception occurs for a non - corrected machine check the kernel can take different actions. - Since machine check exceptions can happen any time it is sometimes - risky for the kernel to kill a process because it defies - normal kernel locking rules. The tolerance level configures - how hard the kernel tries to recover even at some risk of - deadlock. Higher tolerant values trade potentially better uptime - with the risk of a crash or even corruption (for tolerant >= 3). - - 0: always panic on uncorrected errors, log corrected errors - 1: panic or SIGBUS on uncorrected errors, log corrected errors - 2: SIGBUS or log uncorrected errors, log corrected errors - 3: never panic or SIGBUS, log all errors (for testing only) - - Default: 1 - - Note this only makes a difference if the CPU allows recovery - from a machine check exception. Current x86 CPUs generally do not. - -trigger - Program to run when a machine check event is detected. - This is an alternative to running mcelog regularly from cron - and allows to detect events faster. -monarch_timeout - How long to wait for the other CPUs to machine check too on a - exception. 0 to disable waiting for other CPUs. - Unit: us +The directory contains some configurable entries. See +Documentation/ABI/testing/sysfs-mce for more details. TBD document entries for AMD threshold interrupt configuration diff --git a/MAINTAINERS b/MAINTAINERS index e9fd362ef4d6..360311ea0b43 100644 --- a/MAINTAINERS +++ b/MAINTAINERS @@ -20457,6 +20457,8 @@ M: Tony Luck M: Borislav Petkov L: linux-edac@vger.kernel.org S: Maintained +F: Documentation/ABI/testing/sysfs-mce +F: Documentation/x86/x86_64/machinecheck.rst F: arch/x86/kernel/cpu/mce/* X86 MICROCODE UPDATE SUPPORT From patchwork Thu Sep 30 09:44:52 2021 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Mauro Carvalho Chehab X-Patchwork-Id: 12527801 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 91393C41535 for ; Thu, 30 Sep 2021 09:45:05 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id 77AA161A02 for ; Thu, 30 Sep 2021 09:45:05 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1349427AbhI3Jqq (ORCPT ); Thu, 30 Sep 2021 05:46:46 -0400 Received: from mail.kernel.org ([198.145.29.99]:46530 "EHLO mail.kernel.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1349514AbhI3Jqo (ORCPT ); Thu, 30 Sep 2021 05:46:44 -0400 Received: by mail.kernel.org (Postfix) with ESMTPSA id AB83A615E5; Thu, 30 Sep 2021 09:45:01 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1632995101; bh=oDk3Xte6LUuXxMEKmNTpij3KGwX2DRHuVkgkafiYT9s=; h=From:To:Cc:Subject:Date:In-Reply-To:References:From; b=uByD8y42+qnvmla5BX2u4BZw1rk3Gpgp9zdF4BDRMwrkgfyOfafr4ucktROB64rp4 /LJqRSrWnE0ozJiLltEeemydgAdM4CXCy95w35lb0g/QP+f0t7+NhyqoqpS1BZ15+I oCaFNbNA3Dx4e2UrhLWs5P3Tcvy1ckP1E2+IvT9y/vKlOPAbiB6A50ubx2BZSfXHvj 9Rrh6aGIBtK9KbgSma0ggwVsrLCNfZcKX5Q3JGndhCo1pUPA8vW53hYe4Gj8E6aPZq C2zUz/n45vX+rL2JWEjTb6dx9m85JxqQeQx5b8vn61LRPRv88gtEVtnE0oM9UsVwI+ +6bxMQ52vM2Xg== Received: by mail.kernel.org with local (Exim 4.94.2) (envelope-from ) id 1mVscd-002ATK-PE; Thu, 30 Sep 2021 11:44:59 +0200 From: Mauro Carvalho Chehab To: Linux Doc Mailing List , Greg Kroah-Hartman Cc: Mauro Carvalho Chehab , "Jonathan Corbet" , Borislav Petkov , Tony Luck , linux-edac@vger.kernel.org, linux-kernel@vger.kernel.org Subject: [PATCH v2 5/7] ABI: sysfs-mce: add 3 missing files Date: Thu, 30 Sep 2021 11:44:52 +0200 Message-Id: X-Mailer: git-send-email 2.31.1 In-Reply-To: References: MIME-Version: 1.0 Sender: Mauro Carvalho Chehab Precedence: bulk List-ID: X-Mailing-List: linux-edac@vger.kernel.org Changeset 62fdac5913f7 ("x86, mce: Add boot options for corrected errors") added three more MCE files that are also exposed currently via sysfs. Document them. Signed-off-by: Mauro Carvalho Chehab --- See [PATCH v2 0/7] at: https://lore.kernel.org/all/cover.1632994837.git.mchehab+huawei@kernel.org/ Documentation/ABI/testing/sysfs-mce | 22 ++++++++++++++++++++++ 1 file changed, 22 insertions(+) diff --git a/Documentation/ABI/testing/sysfs-mce b/Documentation/ABI/testing/sysfs-mce index 686fbfa02cdc..c8cd989034b4 100644 --- a/Documentation/ABI/testing/sysfs-mce +++ b/Documentation/ABI/testing/sysfs-mce @@ -105,3 +105,25 @@ Description: Unit: us +What: /sys/devices/system/machinecheck/machinecheckX/ignore_ce +Contact: Hidetoshi Seto +Date: Jun 2009 +Description: + Disables polling and CMCI for corrected errors. + All corrected events are not cleared and kept in bank MSRs. + +What: /sys/devices/system/machinecheck/machinecheckX/dont_log_ce +Contact: Hidetoshi Seto +Date: Jun 2009 +Description: + Disables logging for corrected errors. + All reported corrected errors will be cleared silently. + + This option will be useful if you never care about corrected + errors. + +What: /sys/devices/system/machinecheck/machinecheckX/cmci_disabled +Contact: Hidetoshi Seto +Date: Jun 2009 +Description: + Disables the CMCI feature.