[v15,22/23] x86/mce: Improve error log of kernel space TDX #MC due to erratum

The first few generations of TDX hardware have an erratum.  Triggering
it in Linux requires some kind of kernel bug involving relatively exotic
memory writes to TDX private memory and will manifest via
spurious-looking machine checks when reading the affected memory.

== Background ==

Virtually all kernel memory accesses operations happen in full
cachelines.  In practice, writing a "byte" of memory usually reads a 64
byte cacheline of memory, modifies it, then writes the whole line back.
Those operations do not trigger this problem.

This problem is triggered by "partial" writes where a write transaction
of less than cacheline lands at the memory controller.  The CPU does
these via non-temporal write instructions (like MOVNTI), or through
UC/WC memory mappings.  The issue can also be triggered away from the
CPU by devices doing partial writes via DMA.

== Problem ==

A partial write to a TDX private memory cacheline will silently "poison"
the line.  Subsequent reads will consume the poison and generate a
machine check.  According to the TDX hardware spec, neither of these
things should have happened.

To add insult to injury, the Linux machine code will present these as a
literal "Hardware error" when they were, in fact, a software-triggered
issue.

== Solution ==

In the end, this issue is hard to trigger.  Rather than do something
rash (and incomplete) like unmap TDX private memory from the direct map,
improve the machine check handler.

Currently, the #MC handler doesn't distinguish whether the memory is
TDX private memory or not but just dump, for instance, below message:

 [...] mce: [Hardware Error]: CPU 147: Machine Check Exception: f Bank 1: bd80000000100134
 [...] mce: [Hardware Error]: RIP 10:<ffffffffadb69870> {__tlb_remove_page_size+0x10/0xa0}
 	...
 [...] mce: [Hardware Error]: Run the above through 'mcelog --ascii'
 [...] mce: [Hardware Error]: Machine check: Data load in unrecoverable area of kernel
 [...] Kernel panic - not syncing: Fatal local machine check

Which says "Hardware Error" and "Data load in unrecoverable area of
kernel".

Ideally, it's better for the log to say "software bug around TDX private
memory" instead of "Hardware Error".  But in reality the real hardware
memory error can happen, and sadly such software-triggered #MC cannot be
distinguished from the real hardware error.  Also, the error message is
used by userspace tool 'mcelog' to parse, so changing the output may
break userspace.

So keep the "Hardware Error".  The "Data load in unrecoverable area of
kernel" is also helpful, so keep it too.

Instead of modifying above error log, improve the error log by printing
additional TDX related message to make the log like:

  ...
 [...] mce: [Hardware Error]: Machine check: Data load in unrecoverable area of kernel
 [...] mce: [Hardware Error]: Machine Check: TDX private memory error. Possible kernel bug.

Adding this additional message requires determination of whether the
memory page is TDX private memory.  There is no existing infrastructure
to do that.  Add an interface to query the TDX module to fill this gap.

== Impact ==

This issue requires some kind of kernel bug to trigger.

TDX private memory should never be mapped UC/WC.  A partial write
originating from these mappings would require *two* bugs, first mapping
the wrong page, then writing the wrong memory.  It would also be
detectable using traditional memory corruption techniques like
DEBUG_PAGEALLOC.

MOVNTI (and friends) could cause this issue with something like a simple
buffer overrun or use-after-free on the direct map.  It should also be
detectable with normal debug techniques.

The one place where this might get nasty would be if the CPU read data
then wrote back the same data.  That would trigger this problem but
would not, for instance, set off mechanisms like slab redzoning because
it doesn't actually corrupt data.

With an IOMMU at least, the DMA exposure is similar to the UC/WC issue.
TDX private memory would first need to be incorrectly mapped into the
I/O space and then a later DMA to that mapping would actually cause the
poisoning event.

Signed-off-by: Kai Huang <kai.huang@intel.com>
Reviewed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Reviewed-by: Yuan Yao <yuan.yao@intel.com>
---

v14 -> v15:
 - No change

v13 -> v14:
 - No change

v12 -> v13:
 - Added Kirill and Yuan's tag.

v11 -> v12:
 - Simplified #MC message (Dave/Kirill)
 - Slightly improved some comments.

v10 -> v11:
 - New patch

---
 arch/x86/include/asm/tdx.h     |   2 +
 arch/x86/kernel/cpu/mce/core.c |  33 +++++++++++
 arch/x86/virt/vmx/tdx/tdx.c    | 103 +++++++++++++++++++++++++++++++++
 arch/x86/virt/vmx/tdx/tdx.h    |   5 ++
 4 files changed, 143 insertions(+)

Message ID	9e80873fac878aa5d697cbcd4d456d01e1009d1f.1699527082.git.kai.huang@intel.com (mailing list archive)
State	New, archived
Headers	show Received: from lindbergh.monkeyblade.net (lindbergh.monkeyblade.net [23.128.96.19]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id B03E31DA26 for <kvm@vger.kernel.org>; Thu, 9 Nov 2023 11:58:34 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com header.b="ct9BqkZN" Received: from mgamail.intel.com (mgamail.intel.com [192.198.163.8]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 3F6C52D7C; Thu, 9 Nov 2023 03:58:34 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1699531114; x=1731067114; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=8+I46LmtF7AAg2DjI+b7nCLfanyrPmwoj8gVT38v1+A=; b=ct9BqkZN9XVGDan/DXyL+q0QjaZXNEgGOUenoGUcBuIhepCrUt0BoKJ+ OuPfFusZB6HyNv87iyIqPe9dWOS1PqHZMcvQ0+KJtnz+o1AYbGmcJ9aSK PlFAWdu4qhSrOPrEHkYlQMAiegRUyWP8Z76r8EPo2LBa9Pv5vYMiInTfG Wi8gTb2ft5IwgtKVz6rOWemxJMcxZoddHMg9oybV19tePMcm5RcH7lZsv /6mFaTohRjGdwHUCWqYi1DW5FgX+qzF2Y5a5rGrODU/aLuMsG5fgovxEU qf0kXYnggYAk4bDnWdwrWSAfGxDhljv+lzdISQaNBtLoeHScS5VvrjPOq A==; X-IronPort-AV: E=McAfee;i="6600,9927,10888"; a="2936815" X-IronPort-AV: E=Sophos;i="6.03,289,1694761200"; d="scan'208";a="2936815" Received: from fmsmga007.fm.intel.com ([10.253.24.52]) by fmvoesa102.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 09 Nov 2023 03:58:34 -0800 X-ExtLoop1: 1 X-IronPort-AV: E=McAfee;i="6600,9927,10888"; a="766977078" X-IronPort-AV: E=Sophos;i="6.03,289,1694761200"; d="scan'208";a="766977078" Received: from shadphix-mobl.amr.corp.intel.com (HELO khuang2-desk.gar.corp.intel.com) ([10.209.83.35]) by fmsmga007-auth.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 09 Nov 2023 03:58:27 -0800 From: Kai Huang <kai.huang@intel.com> To: linux-kernel@vger.kernel.org, kvm@vger.kernel.org Cc: x86@kernel.org, dave.hansen@intel.com, kirill.shutemov@linux.intel.com, peterz@infradead.org, tony.luck@intel.com, tglx@linutronix.de, bp@alien8.de, mingo@redhat.com, hpa@zytor.com, seanjc@google.com, pbonzini@redhat.com, rafael@kernel.org, david@redhat.com, dan.j.williams@intel.com, len.brown@intel.com, ak@linux.intel.com, isaku.yamahata@intel.com, ying.huang@intel.com, chao.gao@intel.com, sathyanarayanan.kuppuswamy@linux.intel.com, nik.borisov@suse.com, bagasdotme@gmail.com, sagis@google.com, imammedo@redhat.com, kai.huang@intel.com Subject: [PATCH v15 22/23] x86/mce: Improve error log of kernel space TDX #MC due to erratum Date: Fri, 10 Nov 2023 00:55:59 +1300 Message-ID: <9e80873fac878aa5d697cbcd4d456d01e1009d1f.1699527082.git.kai.huang@intel.com> X-Mailer: git-send-email 2.41.0 In-Reply-To: <cover.1699527082.git.kai.huang@intel.com> References: <cover.1699527082.git.kai.huang@intel.com> Precedence: bulk X-Mailing-List: kvm@vger.kernel.org List-Id: <kvm.vger.kernel.org> List-Subscribe: <mailto:kvm+subscribe@vger.kernel.org> List-Unsubscribe: <mailto:kvm+unsubscribe@vger.kernel.org> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit
Series	TDX host kernel support \| expand [v15,00/23] TDX host kernel support [v15,01/23] x86/virt/tdx: Detect TDX during kernel boot [v15,02/23] x86/tdx: Define TDX supported page sizes as macros [v15,03/23] x86/virt/tdx: Make INTEL_TDX_HOST depend on X86_X2APIC [v15,04/23] x86/cpu: Detect TDX partial write machine check erratum [v15,05/23] x86/virt/tdx: Handle SEAMCALL no entropy error in common code [v15,06/23] x86/virt/tdx: Add SEAMCALL error printing for module initialization [v15,07/23] x86/virt/tdx: Add skeleton to enable TDX on demand [v15,08/23] x86/virt/tdx: Use all system memory when initializing TDX module as TDX memory [v15,09/23] x86/virt/tdx: Get module global metadata for module initialization [v15,10/23] x86/virt/tdx: Add placeholder to construct TDMRs to cover all TDX memory regions [v15,11/23] x86/virt/tdx: Fill out TDMRs to cover all TDX memory regions [v15,12/23] x86/virt/tdx: Allocate and set up PAMTs for TDMRs [v15,13/23] x86/virt/tdx: Designate reserved areas for all TDMRs [v15,14/23] x86/virt/tdx: Configure TDX module with the TDMRs and global KeyID [v15,15/23] x86/virt/tdx: Configure global KeyID on all packages [v15,16/23] x86/virt/tdx: Initialize all TDMRs [v15,17/23] x86/kexec: Flush cache of TDX private memory [v15,18/23] x86/virt/tdx: Keep TDMRs when module initialization is successful [v15,19/23] x86/virt/tdx: Improve readability of module initialization error handling [v15,20/23] x86/kexec(): Reset TDX private memory on platforms with TDX erratum [v15,21/23] x86/virt/tdx: Handle TDX interaction with ACPI S3 and deeper states [v15,22/23] x86/mce: Improve error log of kernel space TDX #MC due to erratum [v15,23/23] Documentation/x86: Add documentation for TDX host support

[v15,22/23] x86/mce: Improve error log of kernel space TDX #MC due to erratum

Commit Message

Comments

Patch