[v13,21/22] x86/mce: Improve error log of kernel space TDX #MC due to erratum

The first few generations of TDX hardware have an erratum.  Triggering
it in Linux requires some kind of kernel bug involving relatively exotic
memory writes to TDX private memory and will manifest via
spurious-looking machine checks when reading the affected memory.

== Background ==

Virtually all kernel memory accesses operations happen in full
cachelines.  In practice, writing a "byte" of memory usually reads a 64
byte cacheline of memory, modifies it, then writes the whole line back.
Those operations do not trigger this problem.

This problem is triggered by "partial" writes where a write transaction
of less than cacheline lands at the memory controller.  The CPU does
these via non-temporal write instructions (like MOVNTI), or through
UC/WC memory mappings.  The issue can also be triggered away from the
CPU by devices doing partial writes via DMA.

== Problem ==

A partial write to a TDX private memory cacheline will silently "poison"
the line.  Subsequent reads will consume the poison and generate a
machine check.  According to the TDX hardware spec, neither of these
things should have happened.

To add insult to injury, the Linux machine code will present these as a
literal "Hardware error" when they were, in fact, a software-triggered
issue.

== Solution ==

In the end, this issue is hard to trigger.  Rather than do something
rash (and incomplete) like unmap TDX private memory from the direct map,
improve the machine check handler.

Currently, the #MC handler doesn't distinguish whether the memory is
TDX private memory or not but just dump, for instance, below message:

 [...] mce: [Hardware Error]: CPU 147: Machine Check Exception: f Bank 1: bd80000000100134
 [...] mce: [Hardware Error]: RIP 10:<ffffffffadb69870> {__tlb_remove_page_size+0x10/0xa0}
 	...
 [...] mce: [Hardware Error]: Run the above through 'mcelog --ascii'
 [...] mce: [Hardware Error]: Machine check: Data load in unrecoverable area of kernel
 [...] Kernel panic - not syncing: Fatal local machine check

Which says "Hardware Error" and "Data load in unrecoverable area of
kernel".

Ideally, it's better for the log to say "software bug around TDX private
memory" instead of "Hardware Error".  But in reality the real hardware
memory error can happen, and sadly such software-triggered #MC cannot be
distinguished from the real hardware error.  Also, the error message is
used by userspace tool 'mcelog' to parse, so changing the output may
break userspace.

So keep the "Hardware Error".  The "Data load in unrecoverable area of
kernel" is also helpful, so keep it too.

Instead of modifying above error log, improve the error log by printing
additional TDX related message to make the log like:

  ...
 [...] mce: [Hardware Error]: Machine check: Data load in unrecoverable area of kernel
 [...] mce: [Hardware Error]: Machine Check: TDX private memory error. Possible kernel bug.

Adding this additional message requires determination of whether the
memory page is TDX private memory.  There is no existing infrastructure
to do that.  Add an interface to query the TDX module to fill this gap.

== Impact ==

This issue requires some kind of kernel bug to trigger.

TDX private memory should never be mapped UC/WC.  A partial write
originating from these mappings would require *two* bugs, first mapping
the wrong page, then writing the wrong memory.  It would also be
detectable using traditional memory corruption techniques like
DEBUG_PAGEALLOC.

MOVNTI (and friends) could cause this issue with something like a simple
buffer overrun or use-after-free on the direct map.  It should also be
detectable with normal debug techniques.

The one place where this might get nasty would be if the CPU read data
then wrote back the same data.  That would trigger this problem but
would not, for instance, set off mechanisms like slab redzoning because
it doesn't actually corrupt data.

With an IOMMU at least, the DMA exposure is similar to the UC/WC issue.
TDX private memory would first need to be incorrectly mapped into the
I/O space and then a later DMA to that mapping would actually cause the
poisoning event.

Signed-off-by: Kai Huang <kai.huang@intel.com>
Reviewed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Reviewed-by: Yuan Yao <yuan.yao@intel.com>
---

v12 -> v13:
 - Added Kirill and Yuan's tag.

v11 -> v12:
 - Simplified #MC message (Dave/Kirill)
 - Slightly improved some comments.

v10 -> v11:
 - New patch

---
 arch/x86/include/asm/tdx.h     |   2 +
 arch/x86/kernel/cpu/mce/core.c |  33 +++++++++++
 arch/x86/virt/vmx/tdx/tdx.c    | 103 +++++++++++++++++++++++++++++++++
 arch/x86/virt/vmx/tdx/tdx.h    |   5 ++
 4 files changed, 143 insertions(+)

Message ID	9f7bcd70ed4c665fe53df3aee24722af6064d476.1692962263.git.kai.huang@intel.com (mailing list archive)
State	New, archived
Headers	show Return-Path: <kvm-owner@vger.kernel.org> X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 93FD1C88CB2 for <kvm@archiver.kernel.org>; Fri, 25 Aug 2023 12:18:55 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S244638AbjHYMS0 (ORCPT <rfc822;kvm@archiver.kernel.org>); Fri, 25 Aug 2023 08:18:26 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:35240 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S244676AbjHYMR4 (ORCPT <rfc822;kvm@vger.kernel.org>); Fri, 25 Aug 2023 08:17:56 -0400 Received: from mgamail.intel.com (mgamail.intel.com [134.134.136.31]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 6AEB32D69; Fri, 25 Aug 2023 05:17:22 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1692965842; x=1724501842; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=ZLvZrYH9mEO49R3hoScxLr0s/JmvSaz+w6YlL4wBoUA=; b=aryQyzs0BMx0l0tm+tTGdiHWi7kdBc190DO5JDd+Vt0iqg8eErItc47I qLFUrtEGTkK4EWtt3LCWPvpOgPNhSnwgFoi6jeXCDL/f0/T5LB46hyzdY fqt4sV8krZZzd3+SQZsoR1bLX4rAQObxsskVW70/LCL9/i2vVG/2fwu2e H4zeq8yFM70R9CuDVjgvisg9cdTVuFIRflYFxGZVL3ZOMJNVrpMk7tPDc oTx+uN48AhLwtcXuOdlX5FQrkajft3DMTw+kfiSuy4futBY39UVRFV3JY e/IQTdAFQjFvJjKYQdDfNRCj+eM5LHZ99LkKE72+V3we1ZOUH0u9VKghd A==; X-IronPort-AV: E=McAfee;i="6600,9927,10812"; a="438639496" X-IronPort-AV: E=Sophos;i="6.02,195,1688454000"; d="scan'208";a="438639496" Received: from fmsmga001.fm.intel.com ([10.253.24.23]) by orsmga104.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 25 Aug 2023 05:16:54 -0700 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="6.01,202,1684825200"; d="scan'208";a="881158928" Received: from vnaikawa-mobl1.amr.corp.intel.com (HELO khuang2-desk.gar.corp.intel.com) ([10.209.185.177]) by fmsmga001-auth.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 25 Aug 2023 05:16:54 -0700 From: Kai Huang <kai.huang@intel.com> To: linux-kernel@vger.kernel.org, kvm@vger.kernel.org Cc: x86@kernel.org, dave.hansen@intel.com, kirill.shutemov@linux.intel.com, tony.luck@intel.com, peterz@infradead.org, tglx@linutronix.de, bp@alien8.de, mingo@redhat.com, hpa@zytor.com, seanjc@google.com, pbonzini@redhat.com, david@redhat.com, dan.j.williams@intel.com, rafael.j.wysocki@intel.com, ashok.raj@intel.com, reinette.chatre@intel.com, len.brown@intel.com, ak@linux.intel.com, isaku.yamahata@intel.com, ying.huang@intel.com, chao.gao@intel.com, sathyanarayanan.kuppuswamy@linux.intel.com, nik.borisov@suse.com, bagasdotme@gmail.com, sagis@google.com, imammedo@redhat.com, kai.huang@intel.com Subject: [PATCH v13 21/22] x86/mce: Improve error log of kernel space TDX #MC due to erratum Date: Sat, 26 Aug 2023 00:14:40 +1200 Message-ID: <9f7bcd70ed4c665fe53df3aee24722af6064d476.1692962263.git.kai.huang@intel.com> X-Mailer: git-send-email 2.41.0 In-Reply-To: <cover.1692962263.git.kai.huang@intel.com> References: <cover.1692962263.git.kai.huang@intel.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Precedence: bulk List-ID: <kvm.vger.kernel.org> X-Mailing-List: kvm@vger.kernel.org
Series	TDX host kernel support \| expand [v13,00/22] TDX host kernel support [v13,01/22] x86/virt/tdx: Detect TDX during kernel boot [v13,02/22] x86/tdx: Define TDX supported page sizes as macros [v13,03/22] x86/virt/tdx: Make INTEL_TDX_HOST depend on X86_X2APIC [v13,04/22] x86/cpu: Detect TDX partial write machine check erratum [v13,05/22] x86/virt/tdx: Handle SEAMCALL no entropy error in common code [v13,06/22] x86/virt/tdx: Add SEAMCALL error printing for module initialization [v13,07/22] x86/virt/tdx: Add skeleton to enable TDX on demand [v13,08/22] x86/virt/tdx: Get information about TDX module and TDX-capable memory [v13,09/22] x86/virt/tdx: Use all system memory when initializing TDX module as TDX memory [v13,10/22] x86/virt/tdx: Add placeholder to construct TDMRs to cover all TDX memory regions [v13,11/22] x86/virt/tdx: Fill out TDMRs to cover all TDX memory regions [v13,12/22] x86/virt/tdx: Allocate and set up PAMTs for TDMRs [v13,13/22] x86/virt/tdx: Designate reserved areas for all TDMRs [v13,14/22] x86/virt/tdx: Configure TDX module with the TDMRs and global KeyID [v13,15/22] x86/virt/tdx: Configure global KeyID on all packages [v13,16/22] x86/virt/tdx: Initialize all TDMRs [v13,17/22] x86/kexec: Flush cache of TDX private memory [v13,18/22] x86/virt/tdx: Keep TDMRs when module initialization is successful [v13,19/22] x86/virt/tdx: Improve readibility of module initialization error handling [v13,20/22] x86/kexec(): Reset TDX private memory on platforms with TDX erratum [v13,21/22] x86/mce: Improve error log of kernel space TDX #MC due to erratum [v13,22/22] Documentation/x86: Add documentation for TDX host support

[v13,21/22] x86/mce: Improve error log of kernel space TDX #MC due to erratum

Commit Message

Patch