[v2,0/8] x86/coco: Mark CoCo VM pages not present when changing encrypted state

Message ID	20231121212016.1154303-1-mhklinux@outlook.com (mailing list archive)
Headers	show Return-Path: <owner-linux-mm@kvack.org> From: mhkelley58@gmail.com To: tglx@linutronix.de, mingo@redhat.com, bp@alien8.de, dave.hansen@linux.intel.com, x86@kernel.org, hpa@zytor.com, kirill.shutemov@linux.intel.com, kys@microsoft.com, haiyangz@microsoft.com, wei.liu@kernel.org, decui@microsoft.com, luto@kernel.org, peterz@infradead.org, akpm@linux-foundation.org, urezki@gmail.com, hch@infradead.org, lstoakes@gmail.com, thomas.lendacky@amd.com, ardb@kernel.org, jroedel@suse.de, seanjc@google.com, rick.p.edgecombe@intel.com, sathyanarayanan.kuppuswamy@linux.intel.com, linux-kernel@vger.kernel.org, linux-coco@lists.linux.dev, linux-hyperv@vger.kernel.org, linux-mm@kvack.org Subject: [PATCH v2 0/8] x86/coco: Mark CoCo VM pages not present when changing encrypted state Date: Tue, 21 Nov 2023 13:20:08 -0800 Message-Id: <20231121212016.1154303-1-mhklinux@outlook.com> Reply-To: mhklinux@outlook.com MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Sender: owner-linux-mm@kvack.org Precedence: bulk
Series	x86/coco: Mark CoCo VM pages not present when changing encrypted state \| expand [v2,0/8] x86/coco: Mark CoCo VM pages not present when changing encrypted state [v2,1/8] x86/coco: Use slow_virt_to_phys() in page transition hypervisor callbacks [v2,2/8] x86/mm: Don't do a TLB flush if changing a PTE that isn't marked present [v2,3/8] x86/mm: Remove "static" from vmap_pages_range() [v2,4/8] x86/sev: Enable PVALIDATE for PFNs without a valid virtual address [v2,5/8] x86/mm: Mark CoCo VM pages not present while changing encrypted state [v2,6/8] x86/mm: Merge CoCo prepare and finish hypervisor callbacks [v2,7/8] x86/mm: Remove unnecessary call layer for __set_memory_enc_pgtable() [v2,8/8] x86/mm: Add comments about errors in set_memory_decrypted()/encrypted()

Message ID

20231121212016.1154303-1-mhklinux@outlook.com (mailing list archive)

Headers

From: mhkelley58@gmail.com
To: tglx@linutronix.de,
	mingo@redhat.com,
	bp@alien8.de,
	dave.hansen@linux.intel.com,
	x86@kernel.org,
	hpa@zytor.com,
	kirill.shutemov@linux.intel.com,
	kys@microsoft.com,
	haiyangz@microsoft.com,
	wei.liu@kernel.org,
	decui@microsoft.com,
	luto@kernel.org,
	peterz@infradead.org,
	akpm@linux-foundation.org,
	urezki@gmail.com,
	hch@infradead.org,
	lstoakes@gmail.com,
	thomas.lendacky@amd.com,
	ardb@kernel.org,
	jroedel@suse.de,
	seanjc@google.com,
	rick.p.edgecombe@intel.com,
	sathyanarayanan.kuppuswamy@linux.intel.com,
	linux-kernel@vger.kernel.org,
	linux-coco@lists.linux.dev,
	linux-hyperv@vger.kernel.org,
	linux-mm@kvack.org
Subject: [PATCH v2 0/8] x86/coco: Mark CoCo VM pages not present when changing
 encrypted state
Date: Tue, 21 Nov 2023 13:20:08 -0800
Message-Id: <20231121212016.1154303-1-mhklinux@outlook.com>
Reply-To: mhklinux@outlook.com
MIME-Version: 1.0
Content-Transfer-Encoding: 8bit
Sender: owner-linux-mm@kvack.org
Precedence: bulk

Series

x86/coco: Mark CoCo VM pages not present when changing encrypted state | expand

Message

Michael Kelley Nov. 21, 2023, 9:20 p.m. UTC

From: Michael Kelley <mhklinux@outlook.com>

In a CoCo VM when a page transitions from encrypted to decrypted, or vice
versa, attributes in the PTE must be updated *and* the hypervisor must
be notified of the change. Because there are two separate steps, there's
a window where the settings are inconsistent.  Normally the code that
initiates the transition (via set_memory_decrypted() or
set_memory_encrypted()) ensures that the memory is not being accessed
during a transition, so the window of inconsistency is not a problem.
However, the load_unaligned_zeropad() function can read arbitrary memory
pages at arbitrary times, which could read a transitioning page during
the window.  In such a case, CoCo VM specific exceptions are taken
(depending on the CoCo architecture in use).  Current code in those
exception handlers recovers and does "fixup" on the result returned by
load_unaligned_zeropad().  Unfortunately, this exception handling can't
work in paravisor scenarios (TDX Paritioning and SEV-SNP in vTOM mode)
if the exceptions are routed to the paravisor.  The paravisor can't
do load_unaligned_zeropad() fixup, so the exceptions would need to
be forwarded from the paravisor to the Linux guest, but there are
no architectural specs for how to do that.

Fortunately, there's a simpler way to solve the problem by changing
the core transition code in __set_memory_enc_pgtable() to do the
following:

1.  Remove aliasing mappings
2.  Flush the data cache if needed
3.  Remove the PRESENT bit from the PTEs of all transitioning pages
4.  Set/clear the encryption attribute as appropriate
5.  Flush the TLB so the changed encryption attribute isn't visible
6.  Notify the hypervisor of the encryption status change
7.  Add back the PRESENT bit, making the changed attribute visible

With this approach, load_unaligned_zeropad() just takes its normal
page-fault-based fixup path if it touches a page that is transitioning.
As a result, load_unaligned_zeropad() and CoCo VM page transitioning
are completely decoupled.  CoCo VM page transitions can proceed
without needing to handle architecture-specific exceptions and fix
things up. This decoupling reduces the complexity due to separate
TDX and SEV-SNP fixup paths, and gives more freedom to revise and
introduce new capabilities in future versions of the TDX and SEV-SNP
architectures. Paravisor scenarios work properly without needing
to forward exceptions.

Patch 1 handles implications of the hypervisor callbacks in Step 6
needing to do virt-to-phys translations on pages that are temporarily
marked not present.

Patch 2 ensures that Step 7 doesn't generate a TLB flush.  It is a
performance optimization only and is not necessary for correctness.

Patches 3 and 4 handle the case where SEV-SNP does PVALIDATE in
Step 6, which requires a valid virtual address.  But since the
PRESENT bit has been removed from the direct map virtual address,
the PVALIDATE fails.  These patches construct a temporary virtual
address to be used by PVALIDATE.  This code is SEV-SNP only because
the TDX and Hyper-V paravisor flavors operate on physical addresses.

Patches 5 and 6 are the core change that marks the transitioning
pages as not present.  Patch 6 is optional since retaining both
the "prepare" and "finish" callbacks doesn't hurt anything and
there might be an argument for retaining both for future
flexibility.  However, Patch 6 *does* eliminate about 75 lines of
code and comments.

Patch 7 is a somewhat tangential cleanup that removes an unnecessary
wrapper function in the path for doing a transition.

Patch 8 adds comments describing the implications of errors when
doing a transition.  These implications are discussed in the email
thread for the RFC patch[1] and a patch proposed by Rick Edgecombe.
[2][3]

With this change, the #VE and #VC exception handlers should no longer
be triggered for load_unaligned_zeropad() accesses, and the existing
code in those handlers to do the "fixup" shouldn't be needed. But I
have not removed that code in this patch set. Kirill Shutemov wants
to keep the code for TDX #VE, so the code for #VC on the SEV-SNP
side has also been kept.

This patch set is based on the linux-next20231117 code tree.

Changes in v2:
* Added Patches 3 and 4 to deal with the failure on SEV-SNP
  [Tom Lendacky]
* Split the main change into two separate patches (Patch 5 and
  Patch 6) to improve reviewability and to offer the option of
  retaining both hypervisor callbacks.
* Patch 5 moves set_memory_p() out of an #ifdef CONFIG_X86_64
  so that the code builds correctly for 32-bit, even though it
  is never executed for 32-bit [reported by kernel test robot]

[1] https://lore.kernel.org/lkml/1688661719-60329-1-git-send-email-mikelley@microsoft.com/
[2] https://lore.kernel.org/lkml/20231017202505.340906-1-rick.p.edgecombe@intel.com/
[3] https://lore.kernel.org/lkml/20231024234829.1443125-1-rick.p.edgecombe@intel.com/

Michael Kelley (8):
  x86/coco: Use slow_virt_to_phys() in page transition hypervisor
    callbacks
  x86/mm: Don't do a TLB flush if changing a PTE that isn't marked
    present
  x86/mm: Remove "static" from vmap_pages_range()
  x86/sev: Enable PVALIDATE for PFNs without a valid virtual address
  x86/mm: Mark CoCo VM pages not present while changing encrypted state
  x86/mm: Merge CoCo prepare and finish hypervisor callbacks
  x86/mm: Remove unnecessary call layer for __set_memory_enc_pgtable()
  x86/mm: Add comments about errors in
    set_memory_decrypted()/encrypted()

 arch/x86/boot/compressed/sev.c  |   2 +-
 arch/x86/coco/tdx/tdx.c         |  66 +----------------
 arch/x86/hyperv/ivm.c           |  15 ++--
 arch/x86/include/asm/sev.h      |   6 +-
 arch/x86/include/asm/x86_init.h |   4 --
 arch/x86/kernel/sev-shared.c    |  57 ++++++++++++---
 arch/x86/kernel/sev.c           |  43 ++++++-----
 arch/x86/kernel/x86_init.c      |   4 --
 arch/x86/mm/mem_encrypt_amd.c   |  23 +-----
 arch/x86/mm/pat/set_memory.c    | 122 ++++++++++++++++++++++----------
 include/linux/vmalloc.h         |   2 +
 mm/vmalloc.c                    |   2 +-
 12 files changed, 171 insertions(+), 175 deletions(-)

Comments

Kirill A. Shutemov Nov. 24, 2023, 10:06 a.m. UTC | #1

On Tue, Nov 21, 2023 at 01:20:08PM -0800, mhkelley58@gmail.com wrote:
> From: Michael Kelley <mhklinux@outlook.com>
> 
> In a CoCo VM when a page transitions from encrypted to decrypted, or vice
> versa, attributes in the PTE must be updated *and* the hypervisor must
> be notified of the change.

Strictly speaking it is not true for TDX. Conversion to shared can be
implicit: set shared bit and touch the page will do the conversion. MapGPA
is optional.

> Because there are two separate steps, there's
> a window where the settings are inconsistent.  Normally the code that
> initiates the transition (via set_memory_decrypted() or
> set_memory_encrypted()) ensures that the memory is not being accessed
> during a transition, so the window of inconsistency is not a problem.
> However, the load_unaligned_zeropad() function can read arbitrary memory
> pages at arbitrary times, which could read a transitioning page during
> the window.  In such a case, CoCo VM specific exceptions are taken
> (depending on the CoCo architecture in use).  Current code in those
> exception handlers recovers and does "fixup" on the result returned by
> load_unaligned_zeropad().  Unfortunately, this exception handling can't
> work in paravisor scenarios (TDX Paritioning and SEV-SNP in vTOM mode)
> if the exceptions are routed to the paravisor.  The paravisor can't
> do load_unaligned_zeropad() fixup, so the exceptions would need to
> be forwarded from the paravisor to the Linux guest, but there are
> no architectural specs for how to do that.

Hm. Can't we inject #PF (or #GP) into L2 if #VE/#VC handler in L1 sees
cross-page access to shared memory while no fixup entry for the page in
L1. It would give L2 chance to handle the situation in a transparent way.

Maybe I miss something, I donno.

Michael Kelley Nov. 28, 2023, 7:12 p.m. UTC | #2

From: kirill.shutemov@linux.intel.com <kirill.shutemov@linux.intel.com> Sent: Friday, November 24, 2023 2:06 AM
> 
> On Tue, Nov 21, 2023 at 01:20:08PM -0800, mhkelley58@gmail.com wrote:
> > From: Michael Kelley <mhklinux@outlook.com>
> >
> > In a CoCo VM when a page transitions from encrypted to decrypted, or vice
> > versa, attributes in the PTE must be updated *and* the hypervisor must
> > be notified of the change.
> 
> Strictly speaking it is not true for TDX. Conversion to shared can be
> implicit: set shared bit and touch the page will do the conversion. MapGPA
> is optional.

Interesting.  Given that, is there a reason to use the explicit
hypervisor callbacks in for private->shared transitions in 
__set_mem_enc_pgtable()?   It probably doesn't have direct relevance
to this patch series, but I'm just trying to understand the tradeoffs of
the implicit vs. explicit approach.  And am I correct that
shared->private transitions must use the explicit approach?

> 
> > Because there are two separate steps, there's
> > a window where the settings are inconsistent.  Normally the code that
> > initiates the transition (via set_memory_decrypted() or
> > set_memory_encrypted()) ensures that the memory is not being accessed
> > during a transition, so the window of inconsistency is not a problem.
> > However, the load_unaligned_zeropad() function can read arbitrary memory
> > pages at arbitrary times, which could read a transitioning page during
> > the window.  In such a case, CoCo VM specific exceptions are taken
> > (depending on the CoCo architecture in use).  Current code in those
> > exception handlers recovers and does "fixup" on the result returned by
> > load_unaligned_zeropad().  Unfortunately, this exception handling can't
> > work in paravisor scenarios (TDX Paritioning and SEV-SNP in vTOM mode)
> > if the exceptions are routed to the paravisor.  The paravisor can't
> > do load_unaligned_zeropad() fixup, so the exceptions would need to
> > be forwarded from the paravisor to the Linux guest, but there are
> > no architectural specs for how to do that.
> 
> Hm. Can't we inject #PF (or #GP) into L2 if #VE/#VC handler in L1 sees
> cross-page access to shared memory while no fixup entry for the page in
> L1. It would give L2 chance to handle the situation in a transparent way.
> 
> Maybe I miss something, I donno.

I'm recounting what the Hyper-V paravisor folks say without knowing all the
details. :-(   But it seems like any kind of forwarding scheme needs to be a
well-defined contract that would work for both TDX and SEV-SNP.   The
paravisor in L1 might or might not be Linux-based, so the contract must be OS
independent.  And the L2 guest might or might not be Linux, so there's
potential for some other kind of error to be confused with a Linux
load_unaligned_zeropad() reference.

Maybe all that could be sorted out, but I come back to believing that it's
better now and in the long run to just avoid all this complexity by decoupling
private-shared page transitions and Linux load_unaligned_zeropad().
Unfortunately that decoupling hasn't been as simple as I first envisioned
because of SEV-SNP PVALIDATE needing a virtual address.  But doing the
decoupling only in the paravisor case still seems like the simpler approach.

Michael

> 
> --
>   Kiryl Shutsemau / Kirill A. Shutemov

Kirill A. Shutemov Nov. 29, 2023, 3:10 p.m. UTC | #3

On Tue, Nov 28, 2023 at 07:12:33PM +0000, Michael Kelley wrote:
> From: kirill.shutemov@linux.intel.com <kirill.shutemov@linux.intel.com> Sent: Friday, November 24, 2023 2:06 AM
> > 
> > On Tue, Nov 21, 2023 at 01:20:08PM -0800, mhkelley58@gmail.com wrote:
> > > From: Michael Kelley <mhklinux@outlook.com>
> > >
> > > In a CoCo VM when a page transitions from encrypted to decrypted, or vice
> > > versa, attributes in the PTE must be updated *and* the hypervisor must
> > > be notified of the change.
> > 
> > Strictly speaking it is not true for TDX. Conversion to shared can be
> > implicit: set shared bit and touch the page will do the conversion. MapGPA
> > is optional.
> 
> Interesting.  Given that, is there a reason to use the explicit
> hypervisor callbacks in for private->shared transitions in 
> __set_mem_enc_pgtable()?   It probably doesn't have direct relevance
> to this patch series, but I'm just trying to understand the tradeoffs of
> the implicit vs. explicit approach.  And am I correct that
> shared->private transitions must use the explicit approach?

It must be explicit in sense, that the memory has to be accepted before
use. MapGPA() is still optional.

I don't like this implicit tricks. I spent a lot of time debugging an
issue that was obscured by this semantics.

But I think it is going to say :/

> > > Because there are two separate steps, there's
> > > a window where the settings are inconsistent.  Normally the code that
> > > initiates the transition (via set_memory_decrypted() or
> > > set_memory_encrypted()) ensures that the memory is not being accessed
> > > during a transition, so the window of inconsistency is not a problem.
> > > However, the load_unaligned_zeropad() function can read arbitrary memory
> > > pages at arbitrary times, which could read a transitioning page during
> > > the window.  In such a case, CoCo VM specific exceptions are taken
> > > (depending on the CoCo architecture in use).  Current code in those
> > > exception handlers recovers and does "fixup" on the result returned by
> > > load_unaligned_zeropad().  Unfortunately, this exception handling can't
> > > work in paravisor scenarios (TDX Paritioning and SEV-SNP in vTOM mode)
> > > if the exceptions are routed to the paravisor.  The paravisor can't
> > > do load_unaligned_zeropad() fixup, so the exceptions would need to
> > > be forwarded from the paravisor to the Linux guest, but there are
> > > no architectural specs for how to do that.
> > 
> > Hm. Can't we inject #PF (or #GP) into L2 if #VE/#VC handler in L1 sees
> > cross-page access to shared memory while no fixup entry for the page in
> > L1. It would give L2 chance to handle the situation in a transparent way.
> > 
> > Maybe I miss something, I donno.
> 
> I'm recounting what the Hyper-V paravisor folks say without knowing all the
> details. :-(   But it seems like any kind of forwarding scheme needs to be a
> well-defined contract that would work for both TDX and SEV-SNP.   The
> paravisor in L1 might or might not be Linux-based, so the contract must be OS
> independent.  And the L2 guest might or might not be Linux, so there's
> potential for some other kind of error to be confused with a Linux
> load_unaligned_zeropad() reference.

Okay, fair enough. I have hard time reasoning if it is okay for L2 which
is not Linux.