From patchwork Mon Jun 4 15:04:40 2018 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Alexandru Gagniuc X-Patchwork-Id: 10446797 Return-Path: Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org [172.30.200.125]) by pdx-korg-patchwork.web.codeaurora.org (Postfix) with ESMTP id C276C60375 for ; Mon, 4 Jun 2018 15:05:34 +0000 (UTC) Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id AF25128747 for ; Mon, 4 Jun 2018 15:05:34 +0000 (UTC) Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486) id A145428868; Mon, 4 Jun 2018 15:05:34 +0000 (UTC) X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on pdx-wl-mail.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-7.8 required=2.0 tests=BAYES_00, DKIM_ADSP_CUSTOM_MED, DKIM_SIGNED, FREEMAIL_FROM, MAILING_LIST_MULTI, RCVD_IN_DNSWL_HI, T_DKIM_INVALID autolearn=ham version=3.3.1 Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 28F6028747 for ; Mon, 4 Jun 2018 15:05:34 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1753662AbeFDPFd (ORCPT ); Mon, 4 Jun 2018 11:05:33 -0400 Received: from mail-ot0-f193.google.com ([74.125.82.193]:45796 "EHLO mail-ot0-f193.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753784AbeFDPFP (ORCPT ); Mon, 4 Jun 2018 11:05:15 -0400 Received: by mail-ot0-f193.google.com with SMTP id a5-v6so6899000otf.12; Mon, 04 Jun 2018 08:05:15 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=from:to:cc:subject:date:message-id:in-reply-to:references; bh=rYdyg5AvJFz+WhLnWVt9GaZWfHKTGVW6Xqjt8TV6meg=; b=n2YhD5JYBNm4fAsGsaKP5JDTTtczI+ngkNZg2bghi6oYqivFjP9j9RXl57Bd4gaz+S R62s79alAfL6O6oyKr3I96AlyNJpH/JPNg4YZGRwv0Zl6ywF/VLdmB6aqjf53Da0FMSC 1aPkD2clj8pa2J6HwR4ojPJC9WDS3AiC0sjAuPCVGWpodE7JViDdrtjPVl/QPd/Dm1fw zi6Qe0q3NSCjtxHE9TqgL7PZtNtWP02PHea1mvRJfmieLbZ/k7TSShs1umdtzPdXBrIl MBDiT37AvXXLMIQo004ZiHwqKRK7buDfyusv03p44fE1jkgW4V/pS22wpTmDSYrwkDJt taGg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to :references; bh=rYdyg5AvJFz+WhLnWVt9GaZWfHKTGVW6Xqjt8TV6meg=; b=KgES7BRaClG8++WhgjYn0WatvL7dFEu8dr2rf/J7yXthv7WbgCl/+S9+k5p7FcqSiz 6LWkjSNtper9knZxjlpai1hOs7RtynTLZL7a4IPl/k7mOlEfHdhLylVPw24GMUP2babV g6ctMiNSnczUWnP01HJ1GCk4EfqpJkk06xq1cA0qHhhr/2fnMcQaLY7R7tuIs+c6Hv3u x3pgudy+tecyJkZX+EG7a2ZO77xTdig4wBY1e3Y6bqDO2dSejrUlp6eRS0rfOPSnBKwO dPwG2wSMPsIB6D/MuKZPj7HZVhfF1RHXZSo2BNyQTL+hgS8ABbKC5mr7XZglxfmW8eYA T0Ag== X-Gm-Message-State: APt69E1NxihIVk09EN7iUTZCKQh52a9kCGQ08VlP+MtLauYYchhn6Q/o Mq7n/Yg8eAgrZGsQV7PtSVAw23Pw X-Google-Smtp-Source: ADUXVKIxew/4dIoYUhwE4unbVql7Xl3CveeL+0gya0FXtdC9L80M1Kv8eM9P7rgICYfCgdCkhslgFw== X-Received: by 2002:a9d:2b64:: with SMTP id f33-v6mr9035986otd.342.1528124710132; Mon, 04 Jun 2018 08:05:10 -0700 (PDT) Received: from austins740xd.raid.adc.delllabs.net ([143.166.81.254]) by smtp.gmail.com with ESMTPSA id n72-v6sm3878332oig.6.2018.06.04.08.05.07 (version=TLS1_2 cipher=ECDHE-RSA-CHACHA20-POLY1305 bits=256/256); Mon, 04 Jun 2018 08:05:09 -0700 (PDT) From: mrnuke X-Google-Original-From: mrnuke To: linux-acpi@vger.kernel.org, tony.luck@intel.com Cc: alex_gagniuc@dellteam.com, austin_bolen@dell.com, shyam_iyer@dell.com, Alexandru Gagniuc , Borislav Petkov , Thomas Gleixner , Ingo Molnar , "H. Peter Anvin" , x86@kernel.org, "Rafael J. Wysocki" , Len Brown , Mauro Carvalho Chehab , Robert Moore , Erik Schmauss , Tyler Baicar , Will Deacon , James Morse , "Jonathan (Zhixiong) Zhang" , Dongjiu Geng , linux-edac@vger.kernel.org, linux-kernel@vger.kernel.org, devel@acpica.org Subject: [PATCH v8 3/3] acpi: apei: Do not panic() on PCIe errors reported through GHES Date: Mon, 4 Jun 2018 10:04:40 -0500 Message-Id: <20180604150443.1265-4-mrnuke@austins740xd.dell> X-Mailer: git-send-email 2.14.4 In-Reply-To: <20180604150443.1265-1-mrnuke@austins740xd.dell> References: <20180604150443.1265-1-mrnuke@austins740xd.dell> Sender: linux-acpi-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-acpi@vger.kernel.org X-Virus-Scanned: ClamAV using ClamSMTP From: Alexandru Gagniuc As previously noted, the policy to panic on any "Fatal" GHES error is not suitable for several classes of errors. The most notable is error containment. The correct policy is to achieve identical behavior to native error handling -- i.e. when not reported through GHES. This, in special cases, may not be possible, as we have to exit NMIs, which requires these special considerations PCIe AER errors are contained and reported at the root port. On DPC capable hardware, containment can be done by all downstream ports. DPC also has the added advantage of preventing future errors. Since these errors stop at the root port, we can do all the work we need to exit NMI and reach the error handler. This patch does away with the mindless crashing of the system, and correctly invokes the AER handler. When AER is not enabled, or the firmware doesn't provide sufficient information to identify the source of the error, the original panic() behavior is maintained. Signed-off-by: Alexandru Gagniuc --- drivers/acpi/apei/ghes.c | 43 +++++++++++++++++++++++++++++++++++++++++-- 1 file changed, 41 insertions(+), 2 deletions(-) diff --git a/drivers/acpi/apei/ghes.c b/drivers/acpi/apei/ghes.c index 1b22e18168f5..f7126f6d8d52 100644 --- a/drivers/acpi/apei/ghes.c +++ b/drivers/acpi/apei/ghes.c @@ -425,7 +425,7 @@ static void ghes_handle_memory_failure(struct acpi_hest_generic_data *gdata, int * GHES_SEV_RECOVERABLE -> AER_NONFATAL * GHES_SEV_RECOVERABLE && CPER_SEC_RESET -> AER_FATAL * These both need to be reported and recovered from by the AER driver. - * GHES_SEV_FATAL does not make it to this handler + * GHES_SEV_FATAL -> AER_FATAL */ static void ghes_handle_aer(struct acpi_hest_generic_data *gdata) { @@ -837,6 +837,45 @@ static inline void ghes_sea_remove(struct ghes *ghes) { } static struct llist_head ghes_estatus_llist; static struct irq_work ghes_proc_irq_work; +/* PCIe AER errors are safe if AER section contains enough info. */ +static int ghes_pcie_has_safe_handler(struct acpi_hest_generic_data *gdata) +{ + struct cper_sec_pcie *pcie_err = acpi_hest_get_payload(gdata); + + if (pcie_err->validation_bits & CPER_PCIE_VALID_DEVICE_ID && + pcie_err->validation_bits & CPER_PCIE_VALID_AER_INFO && + IS_ENABLED(CONFIG_ACPI_APEI_PCIEAER)) + return true; + + return false; +} + +/* + * Do we have an error handler that we can safely reach? We're concerned with + * being able to notify an error handler by crossing the NMI/IRQ boundary, + * being able to schedule_work, and so forth. + */ +static int ghes_has_fatal_handler(struct ghes *ghes) +{ + int worst_sev, sec_sev; + bool safe = true; + struct acpi_hest_generic_data *gdata; + const guid_t *section_type; + const struct acpi_hest_generic_status *estatus = ghes->estatus; + + apei_estatus_for_each_section(estatus, gdata) { + section_type = (guid_t *)gdata->section_type; + + if (guid_equal(section_type, &CPER_SEC_PCIE)) + safe = ghes_pcie_has_safe_handler(gdata); + + if (!safe) + break; + } + + return safe; +} + /* * NMI may be triggered on any CPU, so ghes_in_nmi is used for * having only one concurrent reader. @@ -944,7 +983,7 @@ static int ghes_notify_nmi(unsigned int cmd, struct pt_regs *regs) } sev = ghes_cper_severity(ghes->estatus->error_severity); - if (sev >= GHES_SEV_FATAL) { + if ((sev >= GHES_SEV_FATAL) && !ghes_has_fatal_handler(ghes)) { oops_begin(); ghes_print_queued_estatus(); __ghes_panic(ghes);