[v11,18/20] x86: Handle TDX erratum to reset TDX private memory during kexec() and reboot

From: kirill.shutemov@linux.intel.com

The first few generations of TDX hardware have an erratum.  A partial
write to a TDX private memory cacheline will silently "poison" the
line.  Subsequent reads will consume the poison and generate a machine
check.  According to the TDX hardware spec, neither of these things
should have happened.

== Background ==

Virtually all kernel memory accesses operations happen in full
cachelines.  In practice, writing a "byte" of memory usually reads a 64
byte cacheline of memory, modifies it, then writes the whole line back.
Those operations do not trigger this problem.

This problem is triggered by "partial" writes where a write transaction
of less than cacheline lands at the memory controller.  The CPU does
these via non-temporal write instructions (like MOVNTI), or through
UC/WC memory mappings.  The issue can also be triggered away from the
CPU by devices doing partial writes via DMA.

== Problem ==

A fast warm reset doesn't reset TDX private memory.  Kexec() can also
boot into the new kernel directly.  Thus if the old kernel has enabled
TDX on the platform with this erratum, the new kernel may get unexpected
machine check.

Note that w/o this erratum any kernel read/write on TDX private memory
should never cause machine check, thus it's OK for the old kernel to
leave TDX private pages as is.

== Solution ==

In short, with this erratum, the kernel needs to explicitly convert all
TDX private pages back to normal to give the new kernel a clean slate
after either a fast warm reset or kexec().

There's no existing infrastructure to track TDX private pages, which
could be PAMT pages, TDX guest private pages, or SEPT (secure EPT)
pages.  The latter two are yet to be implemented thus it's not certain
how to track them for now.

It's not feasible to query the TDX module either because VMX has already
been stopped when KVM receives the reboot notifier.

Another option is to blindly convert all memory pages.  But this may
bring non-trivial latency to machine reboot and kexec() on large memory
systems (especially when the number of TDX private pages is small).  A
final solution should be tracking TDX private pages and only converting
them.  Also, it's problematic to convert all memory pages because not
all pages are mapped as writable in the direct-mapping.  Thus to do so
would require switching to a new page table which maps all pages as
writable.  Such page table can either be a new page table, or the
identical mapping table built during kexec().  Using either seems too
dramatic, especially considering the kernel should eventually be able
to track all TDX private pages in which case the direct-mapping can be
directly used.

So for now just convert PAMT pages.  Converting TDX guest private pages
and SEPT pages can be added when supporting TDX guests is added to the
kernel.

Introduce a new "x86_platform_ops::memory_shutdown()" callback as a
placeholder to convert all TDX private memory, and call it at the end of
machine_shutdown() after all remote cpus have been stopped (thus no more
TDX activities) and all dirty cachelines of TDX private memory have been
flushed (thus no more later cacheline writeback).

Implement the default callback as a noop function.  Replace the callback
with TDX's own implementation when the platform has this erratum in TDX
early boot-time initialization.  In this way only the platforms with
this erratum carry this additional memory conversion burden.

Signed-off-by: Kai Huang <kai.huang@intel.com>
---

v10 -> v11:
 - New patch

---
 arch/x86/include/asm/x86_init.h |  1 +
 arch/x86/kernel/reboot.c        |  1 +
 arch/x86/kernel/x86_init.c      |  2 ++
 arch/x86/virt/vmx/tdx/tdx.c     | 57 +++++++++++++++++++++++++++++++++
 4 files changed, 61 insertions(+)

Message ID	5aa7506d4fedbf625e3fe8ceeb88af3be1ce97ea.1685887183.git.kai.huang@intel.com (mailing list archive)
State	New, archived
Headers	show Return-Path: <kvm-owner@vger.kernel.org> X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id EE0EBC7EE23 for <kvm@archiver.kernel.org>; Sun, 4 Jun 2023 14:31:12 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S232419AbjFDObL (ORCPT <rfc822;kvm@archiver.kernel.org>); Sun, 4 Jun 2023 10:31:11 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:51662 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S231874AbjFDOan (ORCPT <rfc822;kvm@vger.kernel.org>); Sun, 4 Jun 2023 10:30:43 -0400 Received: from mga11.intel.com (mga11.intel.com [192.55.52.93]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id BF64B1BB; Sun, 4 Jun 2023 07:30:35 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1685889035; x=1717425035; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=2acFdh8Y1N4LJriSc5TNMCWEQYU2QM2INyk/4DortCw=; b=lFV04aG27CrLpu+xqyqeda9Gb4c4C9CP7PnQ7kaxIiQARDwXLxoIpDDi OgAXhmRTAn54XqGechHSha2MTFwpF61SQvhc1esLlL/mXco/q01qKvMtK kZpL0pVYgHoygab39U8qlDPMJFOtSlpsfPq31hfIsc1As0q+xYE8XqJVU Q95mbXD4fjQ8b3JmvUT7qPx6mOrmfumrsMNXgDWCo3q6ma1D0h7cQDqoj W57MpNAnVgY3LbNy2/ujoFiWNJtszfN1cHFg7h8mCKE4bJe2IewhQLWZk EnmZ38/lrk9EG2uJaBfuJxoTcvUA47b2fdrgAWsCOmvJMKYnTlozC+uf+ A==; X-IronPort-AV: E=McAfee;i="6600,9927,10731"; a="353683686" X-IronPort-AV: E=Sophos;i="6.00,217,1681196400"; d="scan'208";a="353683686" Received: from fmsmga005.fm.intel.com ([10.253.24.32]) by fmsmga102.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 04 Jun 2023 07:29:23 -0700 X-ExtLoop1: 1 X-IronPort-AV: E=McAfee;i="6600,9927,10731"; a="1038501215" X-IronPort-AV: E=Sophos;i="6.00,217,1681196400"; d="scan'208";a="1038501215" Received: from tdhastx-mobl2.amr.corp.intel.com (HELO khuang2-desk.gar.corp.intel.com) ([10.212.50.31]) by fmsmga005-auth.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 04 Jun 2023 07:29:19 -0700 From: Kai Huang <kai.huang@intel.com> To: linux-kernel@vger.kernel.org, kvm@vger.kernel.org Cc: linux-mm@kvack.org, dave.hansen@intel.com, kirill.shutemov@linux.intel.com, tony.luck@intel.com, peterz@infradead.org, tglx@linutronix.de, seanjc@google.com, pbonzini@redhat.com, david@redhat.com, dan.j.williams@intel.com, rafael.j.wysocki@intel.com, ying.huang@intel.com, reinette.chatre@intel.com, len.brown@intel.com, ak@linux.intel.com, isaku.yamahata@intel.com, chao.gao@intel.com, sathyanarayanan.kuppuswamy@linux.intel.com, bagasdotme@gmail.com, sagis@google.com, imammedo@redhat.com, kai.huang@intel.com Subject: [PATCH v11 18/20] x86: Handle TDX erratum to reset TDX private memory during kexec() and reboot Date: Mon, 5 Jun 2023 02:27:31 +1200 Message-Id: <5aa7506d4fedbf625e3fe8ceeb88af3be1ce97ea.1685887183.git.kai.huang@intel.com> X-Mailer: git-send-email 2.40.1 In-Reply-To: <cover.1685887183.git.kai.huang@intel.com> References: <cover.1685887183.git.kai.huang@intel.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Precedence: bulk List-ID: <kvm.vger.kernel.org> X-Mailing-List: kvm@vger.kernel.org
Series	TDX host kernel support \| expand [v11,00/20] TDX host kernel support [v11,01/20] x86/tdx: Define TDX supported page sizes as macros [v11,02/20] x86/virt/tdx: Detect TDX during kernel boot [v11,03/20] x86/virt/tdx: Make INTEL_TDX_HOST depend on X86_X2APIC [v11,04/20] x86/cpu: Detect TDX partial write machine check erratum [v11,05/20] x86/virt/tdx: Add SEAMCALL infrastructure [v11,06/20] x86/virt/tdx: Handle SEAMCALL running out of entropy error [v11,07/20] x86/virt/tdx: Add skeleton to enable TDX on demand [v11,08/20] x86/virt/tdx: Get information about TDX module and TDX-capable memory [v11,09/20] x86/virt/tdx: Use all system memory when initializing TDX module as TDX memory [v11,10/20] x86/virt/tdx: Add placeholder to construct TDMRs to cover all TDX memory regions [v11,11/20] x86/virt/tdx: Fill out TDMRs to cover all TDX memory regions [v11,12/20] x86/virt/tdx: Allocate and set up PAMTs for TDMRs [v11,13/20] x86/virt/tdx: Designate reserved areas for all TDMRs [v11,14/20] x86/virt/tdx: Configure TDX module with the TDMRs and global KeyID [v11,15/20] x86/virt/tdx: Configure global KeyID on all packages [v11,16/20] x86/virt/tdx: Initialize all TDMRs [v11,17/20] x86/kexec: Flush cache of TDX private memory [v11,18/20] x86: Handle TDX erratum to reset TDX private memory during kexec() and reboot [v11,19/20] x86/mce: Improve error log of kernel space TDX #MC due to erratum [v11,20/20] Documentation/x86: Add documentation for TDX host support

[v11,18/20] x86: Handle TDX erratum to reset TDX private memory during kexec() and reboot

Commit Message

Comments

Patch