mbox series

[v8,0/8] x86: Show in sysfs if a memory node is able to do encryption

Message ID 20220429201717.1946178-1-martin.fernandez@eclypsium.com (mailing list archive)
Headers show
Series x86: Show in sysfs if a memory node is able to do encryption | expand

Message

Martin Fernandez April 29, 2022, 8:17 p.m. UTC
Show for each node if every memory descriptor in that node has the
EFI_MEMORY_CPU_CRYPTO attribute.

fwupd project plans to use it as part of a check to see if the users
have properly configured memory hardware encryption
capabilities. fwupd's people have seen cases where it seems like there
is memory encryption because all the hardware is capable of doing it,
but on a closer look there is not, either because of system firmware
or because some component requires updating to enable the feature.

It's planned to make it part of a specification that can be passed to
people purchasing hardware

These checks will run at every boot. The specification is called Host
Security ID: https://fwupd.github.io/libfwupdplugin/hsi.html.

We choosed to do it a per-node basis because although an ABI that
shows that the whole system memory is capable of encryption would be
useful for the fwupd use case, doing it in a per-node basis gives also
the capability to the user to target allocations from applications to
NUMA nodes which have encryption capabilities.

I did some tests for some of the functions introduced (and modified)
in e820.c. Sadly KUnit is not able to test __init functions and data
so I had some warnings during the linking. There is a KUnit patch
already to fix that [1]. I wanted to wait for it to be merged but it
is taking more time than I expected so I'm sending this without tests
for now. I'm planning to add unit tests in the future to the e820
range update rework and e820_update_table.

[1] https://lore.kernel.org/lkml/20220419040515.43693-1-brendanhiggins@google.com/T/


Changes since v7:

Less kerneldocs

Less verbosity in the e820 code


Changes since v6:

Fixes in __e820__handle_range_update

Const correctness in e820.c

Correct alignment in memblock.h

Rework memblock_overlaps_region


Changes since v5:

Refactor e820__range_{update, remove, set_crypto_capable} in order to
avoid code duplication.

Warn the user when a node has both encryptable and non-encryptable
regions.

Check that e820_table has enough size to store both current e820_table
and EFI memmap.


Changes since v4:

Add enum to represent the cryptographic capabilities in e820:
e820_crypto_capabilities.

Revert __e820__range_update, only adding the new argument for
__e820__range_add about crypto capabilities.

Add a function __e820__range_update_crypto similar to
__e820__range_update but to only update this new field.


Changes since v3:

Update date in Doc/ABI file.

More information about the fwupd usecase and the rationale behind
doing it in a per-NUMA-node.


Changes since v2:

e820__range_mark_crypto -> e820__range_mark_crypto_capable.

In e820__range_remove: Create a region with crypto capabilities
instead of creating one without it and then mark it.


Changes since v1:

Modify __e820__range_update to update the crypto capabilities of a
range; now this function will change the crypto capability of a range
if it's called with the same old_type and new_type. Rework
efi_mark_e820_regions_as_crypto_capable based on this.

Update do_add_efi_memmap to mark the regions as it creates them.

Change the type of crypto_capable in e820_entry from bool to u8.

Fix e820__update_table changes.

Remove memblock_add_crypto_capable. Now you have to add the region and
mark it then.

Better place for crypto_capable in pglist_data.

Martin Fernandez (8):
  mm/memblock: Tag memblocks with crypto capabilities
  mm/mmzone: Tag pg_data_t with crypto capabilities
  x86/e820: Add infrastructure to refactor e820__range_{update,remove}
  x86/e820: Refactor __e820__range_update
  x86/e820: Refactor e820__range_remove
  x86/e820: Tag e820_entry with crypto capabilities
  x86/efi: Mark e820_entries as crypto capable from EFI memmap
  drivers/node: Show in sysfs node's crypto capabilities

 Documentation/ABI/testing/sysfs-devices-node |  10 +
 arch/x86/include/asm/e820/api.h              |   1 +
 arch/x86/include/asm/e820/types.h            |  12 +-
 arch/x86/kernel/e820.c                       | 388 ++++++++++++++-----
 arch/x86/platform/efi/efi.c                  |  37 ++
 drivers/base/node.c                          |  10 +
 include/linux/memblock.h                     |   5 +
 include/linux/mmzone.h                       |   3 +
 mm/memblock.c                                |  62 +++
 mm/page_alloc.c                              |   1 +
 10 files changed, 431 insertions(+), 98 deletions(-)
 create mode 100644 Documentation/ABI/testing/sysfs-devices-node

Comments

Borislav Petkov May 4, 2022, 4:38 p.m. UTC | #1
On Fri, Apr 29, 2022 at 05:17:09PM -0300, Martin Fernandez wrote:
> Show for each node if every memory descriptor in that node has the
> EFI_MEMORY_CPU_CRYPTO attribute.
> 
> fwupd project plans to use it as part of a check to see if the users
> have properly configured memory hardware encryption
> capabilities. fwupd's people have seen cases where it seems like there
> is memory encryption because all the hardware is capable of doing it,
> but on a closer look there is not, either because of system firmware
> or because some component requires updating to enable the feature.

Hm, so in the sysfs patch you have:

+               This value is 1 if all system memory in this node is
+               capable of being protected with the CPU's memory
+               cryptographic capabilities.

So this says the node is capable - so what is fwupd going to report -
that the memory is capable?

From your previous paragraph above it sounds to me like you wanna
say whether memory encryption is active or not, not that the node is
capable.

Or what is the use case?

> It's planned to make it part of a specification that can be passed to
> people purchasing hardware

So people are supposed to run that fwupd on that new hw to check whether
they can use memory encryption?

> These checks will run at every boot. The specification is called Host
> Security ID: https://fwupd.github.io/libfwupdplugin/hsi.html.
> 
> We choosed to do it a per-node basis because although an ABI that
> shows that the whole system memory is capable of encryption would be
> useful for the fwupd use case, doing it in a per-node basis gives also
> the capability to the user to target allocations from applications to
> NUMA nodes which have encryption capabilities.

That's another hmmm: what systems do not do full system memory
encryption and do only per-node?

From those I know, you encrypt the whole memory on the whole system and
that's it. Even if it is a hypervisor which runs a lot of guests, you
still want the hypervisor itself to run encrypted, i.e., what's called
SME in AMD's variant.

Thx.
Martin Fernandez May 4, 2022, 5:18 p.m. UTC | #2
On 5/4/22, Borislav Petkov <bp@alien8.de> wrote:
> On Fri, Apr 29, 2022 at 05:17:09PM -0300, Martin Fernandez wrote:
>> Show for each node if every memory descriptor in that node has the
>> EFI_MEMORY_CPU_CRYPTO attribute.
>>
>> fwupd project plans to use it as part of a check to see if the users
>> have properly configured memory hardware encryption
>> capabilities. fwupd's people have seen cases where it seems like there
>> is memory encryption because all the hardware is capable of doing it,
>> but on a closer look there is not, either because of system firmware
>> or because some component requires updating to enable the feature.
>
> Hm, so in the sysfs patch you have:
>
> +               This value is 1 if all system memory in this node is
> +               capable of being protected with the CPU's memory
> +               cryptographic capabilities.
>
> So this says the node is capable - so what is fwupd going to report -
> that the memory is capable?
>
> From your previous paragraph above it sounds to me like you wanna
> say whether memory encryption is active or not, not that the node is
> capable.
>
> Or what is the use case?

The use case is to know if a user is using hardware encryption or
not. This new sysfs file plus knowing if tme/sev is active you can be
pretty sure about that.

>> It's planned to make it part of a specification that can be passed to
>> people purchasing hardware
>
> So people are supposed to run that fwupd on that new hw to check whether
> they can use memory encryption?

Yes

>> These checks will run at every boot. The specification is called Host
>> Security ID: https://fwupd.github.io/libfwupdplugin/hsi.html.
>>
>> We choosed to do it a per-node basis because although an ABI that
>> shows that the whole system memory is capable of encryption would be
>> useful for the fwupd use case, doing it in a per-node basis gives also
>> the capability to the user to target allocations from applications to
>> NUMA nodes which have encryption capabilities.
>
> That's another hmmm: what systems do not do full system memory
> encryption and do only per-node?
>
> From those I know, you encrypt the whole memory on the whole system and
> that's it. Even if it is a hypervisor which runs a lot of guests, you
> still want the hypervisor itself to run encrypted, i.e., what's called
> SME in AMD's variant.

Dave Hansen pointed those out in a previuos patch serie, here is the
quote:

> CXL devices will have normal RAM on them, be exposed as "System RAM" and
> they won't have encryption capabilities.  I think these devices were
> probably the main motivation for EFI_MEMORY_CPU_CRYPTO.
Borislav Petkov May 6, 2022, 12:44 p.m. UTC | #3
On Wed, May 04, 2022 at 02:18:30PM -0300, Martin Fernandez wrote:
> The use case is to know if a user is using hardware encryption or
> not. This new sysfs file plus knowing if tme/sev is active you can be
> pretty sure about that.

Then please explain it in detail and in the text so that it is clear. As
it is now, the reader is left wondering what that file is supposed to
state.

> Dave Hansen pointed those out in a previuos patch serie, here is the
> quote:
> 
> > CXL devices will have normal RAM on them, be exposed as "System RAM" and
> > they won't have encryption capabilities.  I think these devices were
> > probably the main motivation for EFI_MEMORY_CPU_CRYPTO.

So this would mean that if a system doesn't have CXL devices and has
TME/SME/SEV-* enabled, then it is running with encrypted memory.

Which would then also mean, you don't need any of that code - you only
need to enumerate CXL devices which, it seems, do not support memory
encryption, and then state that memory encryption is enabled on the
whole system, except for the memory of those devices.

I.e.,

$ dmesg | grep -i SME
[    1.783650] AMD Memory Encryption Features active: SME

Done - memory is encrypted on the whole system.

We could export it into /proc/cpuinfo so that you don't have to grep
dmesg and problem solved.
Mario Limonciello May 6, 2022, 2:18 p.m. UTC | #4
[Public]



> -----Original Message-----
> From: Borislav Petkov <bp@alien8.de>
> Sent: Friday, May 6, 2022 07:44
> To: Martin Fernandez <martin.fernandez@eclypsium.com>
> Cc: linux-kernel@vger.kernel.org; linux-efi@vger.kernel.org; platform-
> driver-x86@vger.kernel.org; linux-mm@kvack.org; tglx@linutronix.de;
> mingo@redhat.com; dave.hansen@linux.intel.com; x86@kernel.org;
> hpa@zytor.com; ardb@kernel.org; dvhart@infradead.org;
> andy@infradead.org; gregkh@linuxfoundation.org; rafael@kernel.org;
> rppt@kernel.org; akpm@linux-foundation.org;
> daniel.gutson@eclypsium.com; hughsient@gmail.com;
> alex.bazhaniuk@eclypsium.com; alison.schofield@intel.com;
> keescook@chromium.org
> Subject: Re: [PATCH v8 0/8] x86: Show in sysfs if a memory node is able to do
> encryption
> 
> On Wed, May 04, 2022 at 02:18:30PM -0300, Martin Fernandez wrote:
> > The use case is to know if a user is using hardware encryption or
> > not. This new sysfs file plus knowing if tme/sev is active you can be
> > pretty sure about that.
> 
> Then please explain it in detail and in the text so that it is clear. As
> it is now, the reader is left wondering what that file is supposed to
> state.
> 
> > Dave Hansen pointed those out in a previuos patch serie, here is the
> > quote:
> >
> > > CXL devices will have normal RAM on them, be exposed as "System
> RAM" and
> > > they won't have encryption capabilities.  I think these devices were
> > > probably the main motivation for EFI_MEMORY_CPU_CRYPTO.
> 
> So this would mean that if a system doesn't have CXL devices and has
> TME/SME/SEV-* enabled, then it is running with encrypted memory.
> 
> Which would then also mean, you don't need any of that code - you only
> need to enumerate CXL devices which, it seems, do not support memory
> encryption, and then state that memory encryption is enabled on the
> whole system, except for the memory of those devices.
> 
> I.e.,
> 
> $ dmesg | grep -i SME
> [    1.783650] AMD Memory Encryption Features active: SME
> 
> Done - memory is encrypted on the whole system.
> 
> We could export it into /proc/cpuinfo so that you don't have to grep
> dmesg and problem solved.
> 

Actually we solved that already for SME.  Kernel only exposes the feature
in /proc/cpuinfo if it's active now.
See kernel commit 08f253ec3767bcfafc5d32617a92cee57c63968e.

Fwupd code has been changed to match it too.  It will only trust the presence of
sme flag with kernel 5.18.0 and newer.

https://github.com/fwupd/fwupd/commit/53a49b4ac1815572f242f85a1a1cc52a2d7ed50c
Dave Hansen May 6, 2022, 3:32 p.m. UTC | #5
On 5/6/22 05:44, Borislav Petkov wrote:
>> Dave Hansen pointed those out in a previuos patch serie, here is the
>> quote:
>>
>>> CXL devices will have normal RAM on them, be exposed as "System RAM" and
>>> they won't have encryption capabilities.  I think these devices were
>>> probably the main motivation for EFI_MEMORY_CPU_CRYPTO.
> So this would mean that if a system doesn't have CXL devices and has
> TME/SME/SEV-* enabled, then it is running with encrypted memory.
> 
> Which would then also mean, you don't need any of that code - you only
> need to enumerate CXL devices which, it seems, do not support memory
> encryption, and then state that memory encryption is enabled on the
> whole system, except for the memory of those devices.

CXL devices are just the easiest example to explain, but they are not
the only problem.

For example, Intel NVDIMMs don't support TDX (or MKTME with integrity)
since TDX requires integrity protection and NVDIMMs don't have metadata
space available.

Also, if this were purely a CXL problem, I would have expected this to
have been dealt with in the CXL spec alone.  But, this series is
actually driven by an ACPI spec.  That tells me that we'll see these
mismatched encryption capabilities in many more places than just CXL
devices.
Dan Williams May 6, 2022, 4 p.m. UTC | #6
On Fri, May 6, 2022 at 8:32 AM Dave Hansen <dave.hansen@intel.com> wrote:
>
> On 5/6/22 05:44, Borislav Petkov wrote:
> >> Dave Hansen pointed those out in a previuos patch serie, here is the
> >> quote:
> >>
> >>> CXL devices will have normal RAM on them, be exposed as "System RAM" and
> >>> they won't have encryption capabilities.  I think these devices were
> >>> probably the main motivation for EFI_MEMORY_CPU_CRYPTO.
> > So this would mean that if a system doesn't have CXL devices and has
> > TME/SME/SEV-* enabled, then it is running with encrypted memory.
> >
> > Which would then also mean, you don't need any of that code - you only
> > need to enumerate CXL devices which, it seems, do not support memory
> > encryption, and then state that memory encryption is enabled on the
> > whole system, except for the memory of those devices.
>
> CXL devices are just the easiest example to explain, but they are not
> the only problem.
>
> For example, Intel NVDIMMs don't support TDX (or MKTME with integrity)
> since TDX requires integrity protection and NVDIMMs don't have metadata
> space available.
>
> Also, if this were purely a CXL problem, I would have expected this to
> have been dealt with in the CXL spec alone.  But, this series is
> actually driven by an ACPI spec.  That tells me that we'll see these
> mismatched encryption capabilities in many more places than just CXL
> devices.

Yes, the problem is that encryption capabilities cut across multiple
specifications. For example, you might need to consult a CPU
vendor-specific manual, ACPI, EFI, PCI, and CXL specifications for a
single security feature.
Borislav Petkov May 6, 2022, 5:55 p.m. UTC | #7
On May 6, 2022 4:00:57 PM UTC, Dan Williams <dan.j.williams@intel.com> wrote:
>On Fri, May 6, 2022 at 8:32 AM Dave Hansen <dave.hansen@intel.com> wrote:
>>
>> On 5/6/22 05:44, Borislav Petkov wrote:
>> >> Dave Hansen pointed those out in a previuos patch serie, here is the
>> >> quote:
>> >>
>> >>> CXL devices will have normal RAM on them, be exposed as "System RAM" and
>> >>> they won't have encryption capabilities.  I think these devices were
>> >>> probably the main motivation for EFI_MEMORY_CPU_CRYPTO.
>> > So this would mean that if a system doesn't have CXL devices and has
>> > TME/SME/SEV-* enabled, then it is running with encrypted memory.
>> >
>> > Which would then also mean, you don't need any of that code - you only
>> > need to enumerate CXL devices which, it seems, do not support memory
>> > encryption, and then state that memory encryption is enabled on the
>> > whole system, except for the memory of those devices.
>>
>> CXL devices are just the easiest example to explain, but they are not
>> the only problem.
>>
>> For example, Intel NVDIMMs don't support TDX (or MKTME with integrity)
>> since TDX requires integrity protection and NVDIMMs don't have metadata
>> space available.
>>
>> Also, if this were purely a CXL problem, I would have expected this to
>> have been dealt with in the CXL spec alone.  But, this series is
>> actually driven by an ACPI spec.  That tells me that we'll see these
>> mismatched encryption capabilities in many more places than just CXL
>> devices.
>
>Yes, the problem is that encryption capabilities cut across multiple
>specifications. For example, you might need to consult a CPU
>vendor-specific manual, ACPI, EFI, PCI, and CXL specifications for a
>single security feature.

So here's the deal: we can say in the kernel that memory encryption is enabled and active.  But then all those different devices and so on,  can or cannot support encryption. IO devices do not support encryption either, afaict. And there you don't have node granularity etc. So you can't do this per node thing anyway. Or you do it and it becomes insufficient soin after.

But that is not the question - they don't wanna say in fwupd whether every transaction was encrypted or not - they wanna say that encryption is active. And that we can give them now. 

Thx.
Dave Hansen May 6, 2022, 6:14 p.m. UTC | #8
On 5/6/22 10:55, Boris Petkov wrote:
> So here's the deal: we can say in the kernel that memory encryption
> is enabled and active.  But then all those different devices and so
> on,  can or cannot support encryption. IO devices do not support
> encryption either, afaict.

At least on MKTME platforms, if a device does DMA to a physical address
with the KeyID bits set, it gets memory encryption.  That's because all
the encryption magic is done in the memory controller itself.  The CPU's
memory controller doesn't actually care if the access comes from a
device or a CPU as long as the right physical bits are set.

The reason we're talking about this in terms of CXL devices is that CXL
devices have their *OWN* memory controllers.  Those memory controllers
might or might not support encryption.

> But that is not the question - they don't wanna say in fwupd whether
> every transaction was encrypted or not - they wanna say that
> encryption is active. And that we can give them now.

The reason we went down this per-node thing instead of something
system-wide is EFI_MEMORY_CPU_CRYPTO.  It's in the standard because EFI
systems are not expected to have uniform crypto capabilities across the
entire memory map.  Some memory will be capable of CPU crypto and some not.

As an example, if I were to build a system today with TDX and NVDIMMs,
I'd probably mark the RAM as EFI_MEMORY_CPU_CRYPTO=1 and the NVDIMMs as
EFI_MEMORY_CPU_CRYPTO=0.

I think you're saying that current AMD SEV systems have no need for
EFI_MEMORY_CPU_CRYPTO since their encryption capabilities *ARE* uniform.
 I'm not challenging that at all.  This interface is total overkill for
systems with guaranteed uniform encryption capabilities.

But, this interface will *work* both for the uniform and non-uniform
systems alike.
Borislav Petkov May 6, 2022, 6:25 p.m. UTC | #9
On May 6, 2022 6:14:00 PM UTC, Dave Hansen <dave.hansen@intel.com> wrote:
>But, this interface will *work* both for the uniform and non-uniform
>systems alike.

And what would that additional information that some "node" - whatever "node" means nowadays - is not encrypted give you? 

Note that the fwupd use case is to be able to say that memory encryption is active - nothing more.
Dave Hansen May 6, 2022, 6:43 p.m. UTC | #10
On 5/6/22 11:25, Boris Petkov wrote:
> On May 6, 2022 6:14:00 PM UTC, Dave Hansen <dave.hansen@intel.com>
> wrote:
>> But, this interface will *work* both for the uniform and
>> non-uniform systems alike.
> And what would that additional information that some "node" -
> whatever "node" means nowadays - is not encrypted give you?

Tying it to the node ties it to the NUMA ABIs.  For instance, it lets
you say: "allocate memory with encryption capabilities" with a
set_mempolicy() to nodes that are enumerated as encryption-capable.

Imagine that we have a non-uniform system: some memory supports TDX (or
SEV-SNP) and some doesn't.  QEMU calls mmap() to allocate some guest
memory and then its ioctl()s to get its addresses stuffed into EPT/NPT.
 The memory might be allocated from anywhere, CPU_CRYPTO-capable or not.
 VM creation will fail because the (hardware-enforced) security checks
can't be satisfied on non-CPU_CRYPTO memory.

Userspace has no recourse to fix this.  It's just stuck.  In that case,
 the *kernel* needs to be responsible for ensuring that the backing
physical memory supports TDX (or SEV).

This node attribute punts the problem back out to userspace.  It gives
userspace the ability to steer allocations to compatible NUMA nodes.  If
something goes wrong, they can use other NUMA ABIs to inspect the
situation, like /proc/$pid/numa_maps.
Borislav Petkov May 6, 2022, 7:02 p.m. UTC | #11
On May 6, 2022 6:43:39 PM UTC, Dave Hansen <dave.hansen@intel.com> wrote:
>On 5/6/22 11:25, Boris Petkov wrote:
>> On May 6, 2022 6:14:00 PM UTC, Dave Hansen <dave.hansen@intel.com>
>> wrote:
>>> But, this interface will *work* both for the uniform and
>>> non-uniform systems alike.
>> And what would that additional information that some "node" -
>> whatever "node" means nowadays - is not encrypted give you?
>
>Tying it to the node ties it to the NUMA ABIs.  For instance, it lets
>you say: "allocate memory with encryption capabilities" with a
>set_mempolicy() to nodes that are enumerated as encryption-capable.

I was expecting something along those lines...

>Imagine that we have a non-uniform system: some memory supports TDX (or
>SEV-SNP) and some doesn't.  QEMU calls mmap() to allocate some guest
>memory and then its ioctl()s to get its addresses stuffed into EPT/NPT.
> The memory might be allocated from anywhere, CPU_CRYPTO-capable or not.
> VM creation will fail because the (hardware-enforced) security checks
>can't be satisfied on non-CPU_CRYPTO memory.
>
>Userspace has no recourse to fix this.  It's just stuck.  In that case,
> the *kernel* needs to be responsible for ensuring that the backing
>physical memory supports TDX (or SEV).
>
>This node attribute punts the problem back out to userspace.  It gives
>userspace the ability to steer allocations to compatible NUMA nodes.  If
>something goes wrong, they can use other NUMA ABIs to inspect the
>situation, like /proc/$pid/numa_maps.

That's all fine and dandy but I still don't see the *actual*, real-life use case of why something would request memory of particular encryption capabilities. Don't get me wrong  - I'm not saying there are not such use cases - I'm saying we should go all the way and fully define properly  *why* we're doing this whole hoopla.

Remember - this all started with "i wanna say that mem enc is active" and now we're so far deep down the rabbit hole...
Dave Hansen May 9, 2022, 6:47 p.m. UTC | #12
... adding some KVM/TDX folks

On 5/6/22 12:02, Boris Petkov wrote:
>> This node attribute punts the problem back out to userspace.  It
>> gives userspace the ability to steer allocations to compatible NUMA
>> nodes.  If something goes wrong, they can use other NUMA ABIs to
>> inspect the situation, like /proc/$pid/numa_maps.
> That's all fine and dandy but I still don't see the *actual*,
> real-life use case of why something would request memory of
> particular encryption capabilities. Don't get me wrong  - I'm not
> saying there are not such use cases - I'm saying we should go all the
> way and fully define properly  *why* we're doing this whole hoopla.

Let's say TDX is running on a system with mixed encryption
capabilities*.  Some NUMA nodes support TDX and some don't.  If that
happens, your guest RAM can come from anywhere.  When the host kernel
calls into the TDX module to add pages to the guest (via
TDH.MEM.PAGE.ADD) it might get an error back from the TDX module.  At
that point, the host kernel is stuck.  It's got a partially created
guest and no recourse to fix the error.

This new ABI provides a way to avoid that situation in the first place.
 Userspace can look at sysfs to figure out which NUMA nodes support
"encryption" (aka. TDX) and can use the existing NUMA policy ABI to
avoid TDH.MEM.PAGE.ADD failures.

So, here's the question for the TDX folks: are these mixed-capability
systems a problem for you?  Does this ABI help you fix the problem?
Will your userspace (qemu and friends) actually use consume from this ABI?

* There are three ways we might hit a system with this issue:
  1. NVDIMMs that don't support  TDX, like lack of memory integrity
     protection.
  2. CXL-attached memory controllers that can't do encryption at all
  3. Nominally TDX-compatible memory that was not covered/converted by
     the kernel for some reason (memory hot-add, or ran out of TDMR
     resources)
Borislav Petkov May 9, 2022, 10:17 p.m. UTC | #13
On Mon, May 09, 2022 at 11:47:43AM -0700, Dave Hansen wrote:
> ... adding some KVM/TDX folks

+ AMD SEV folks as they're going to probably need something like that
too.

> On 5/6/22 12:02, Boris Petkov wrote:
> >> This node attribute punts the problem back out to userspace.  It
> >> gives userspace the ability to steer allocations to compatible NUMA
> >> nodes.  If something goes wrong, they can use other NUMA ABIs to
> >> inspect the situation, like /proc/$pid/numa_maps.
> > That's all fine and dandy but I still don't see the *actual*,
> > real-life use case of why something would request memory of
> > particular encryption capabilities. Don't get me wrong  - I'm not
> > saying there are not such use cases - I'm saying we should go all the
> > way and fully define properly  *why* we're doing this whole hoopla.
> 
> Let's say TDX is running on a system with mixed encryption
> capabilities*.  Some NUMA nodes support TDX and some don't.  If that
> happens, your guest RAM can come from anywhere.  When the host kernel
> calls into the TDX module to add pages to the guest (via
> TDH.MEM.PAGE.ADD) it might get an error back from the TDX module.  At
> that point, the host kernel is stuck.  It's got a partially created
> guest and no recourse to fix the error.

Thanks for that detailed use case, btw!

> This new ABI provides a way to avoid that situation in the first place.
>  Userspace can look at sysfs to figure out which NUMA nodes support
> "encryption" (aka. TDX) and can use the existing NUMA policy ABI to
> avoid TDH.MEM.PAGE.ADD failures.
> 
> So, here's the question for the TDX folks: are these mixed-capability
> systems a problem for you?  Does this ABI help you fix the problem?

What I'm not really sure too is, is per-node granularity ok? I guess it
is but let me ask it anyway...

> Will your userspace (qemu and friends) actually use consume from this ABI?

Same question for SEV folks - do you guys think this interface would
make sense for the SEV side of things?

> * There are three ways we might hit a system with this issue:
>   1. NVDIMMs that don't support  TDX, like lack of memory integrity
>      protection.
>   2. CXL-attached memory controllers that can't do encryption at all
>   3. Nominally TDX-compatible memory that was not covered/converted by
>      the kernel for some reason (memory hot-add, or ran out of TDMR
>      resources)

And I think some of those might be of interest to the AMD side of things
too.

Thx.
Dave Hansen May 9, 2022, 10:56 p.m. UTC | #14
On 5/9/22 15:17, Borislav Petkov wrote:
> 
>> This new ABI provides a way to avoid that situation in the first place.
>>  Userspace can look at sysfs to figure out which NUMA nodes support
>> "encryption" (aka. TDX) and can use the existing NUMA policy ABI to
>> avoid TDH.MEM.PAGE.ADD failures.
>>
>> So, here's the question for the TDX folks: are these mixed-capability
>> systems a problem for you?  Does this ABI help you fix the problem?
> What I'm not really sure too is, is per-node granularity ok? I guess it
> is but let me ask it anyway...

I think nodes are the only sane granularity.

tl;dr: Zones might work in theory but have no existing useful ABI around
them and too many practical problems.  Nodes are the only other real
option without inventing something new and fancy.

--

What about zones (or any sub-node granularity really)?

Folks have, for instance, discussed adding new memory zones for this
purpose: have ZONE_NORMAL, and then ZONE_UNENCRYPTABLE (or something
similar).  Zones are great because they have their own memory allocation
pools and can be targeted directly from within the kernel using things
like GFP_DMA.  If you run out of ZONE_FOO, you can theoretically just
reclaim ZONE_FOO.

But, even a single new zone isn't necessarily good enough.  What if we
have some ZONE_NORMAL that's encryption-capable and some that's not?
The same goes for ZONE_MOVABLE.  We'd probably need at least:

	ZONE_NORMAL
	ZONE_NORMAL_UNENCRYPTABLE
	ZONE_MOVABLE
	ZONE_MOVABLE_UNENCRYPTABLE

Also, zones are (mostly) not exposed to userspace.  If we want userspace
to be able to specify encryption capabilities, we're talking about new
ABI for enumeration and policy specification.

Why node granularity?

First, for the majority of cases, nodes "just work".  ACPI systems with
an "HMAT" table already separate out different performance classes of
memory into different Proximity Domains (PXMs) which the kernel maps
into NUMA nodes.

This means that for NVDIMMs or virtually any CXL memory regions (one or
more CXL devices glued together) we can think of, they already get their
own NUMA node.  Those nodes have their own zones (implicitly) and can
lean on the existing NUMA ABI for enumeration and policy creation.

Basically, the firmware creates the NUMA nodes for the kernel.  All the
kernel has to do is acknowledge which of them can do encryption or not.

The one place where nodes fall down is if a memory hot-add occurs within
an existing node and the newly hot-added memory does not match the
encryption capabilities of the existing memory.  The kernel basically
has two options in that case:
 * Throw away the memory until the next reboot where the system might be
   reconfigured in a way to support more uniform capabilities (this is
   actually *likely* for a reboot of a TDX system)
 * Create a synthetic NUMA node to hold it

Neither one of those is a horrible option.  Throwing the memory away is
the most likely way TDX will handle this situation if it pops up.  For
now, the folks building TDX-capable BIOSes claim emphatically that such
a system won't be built.
Richard Hughes May 16, 2022, 8:39 a.m. UTC | #15
On Fri, 6 May 2022 at 20:02, Boris Petkov <bp@alien8.de> wrote:
> Remember - this all started with "i wanna say that mem enc is active" and now we're so far deep down the rabbit hole...

This is still something consumers need; at the moment users have no
idea if data is *actually* being encrypted. I think Martin has done an
admirable job going down the rabbit hole to add this functionality in
the proper manner -- so it's actually accurate and useful for other
use cases to that of fwupd.

At the moment my professional advice to people asking about Intel
memory encryption is to assume there is none, as there's no way of
verifying that it's actually enabled and working. This is certainly a
shame for something so promising, touted as an enterprise security
feature.

Richard
Borislav Petkov May 18, 2022, 7:52 a.m. UTC | #16
On Mon, May 16, 2022 at 09:39:06AM +0100, Richard Hughes wrote:
> This is still something consumers need; at the moment users have no
> idea if data is *actually* being encrypted.

As it was already pointed out - that's in /proc/cpuinfo.

> I think Martin has done an admirable job going down the rabbit hole
> to add this functionality in the proper manner -- so it's actually
> accurate and useful for other use cases to that of fwupd.

Only after I scratched the surface as to why this is needed.

> At the moment my professional advice to people asking about Intel
> memory encryption

Well, what kind of memory encryption? Host, guest?
Dan Williams May 18, 2022, 6:28 p.m. UTC | #17
On Wed, May 18, 2022 at 12:53 AM Borislav Petkov <bp@alien8.de> wrote:
>
> On Mon, May 16, 2022 at 09:39:06AM +0100, Richard Hughes wrote:
> > This is still something consumers need; at the moment users have no
> > idea if data is *actually* being encrypted.
>
> As it was already pointed out - that's in /proc/cpuinfo.

For TME you still need to compare it against the EFI memory map as
there are exclusion ranges for things like persistent memory. Given
that persistent memory can be forced into volatile "System RAM"
operation by various command line options and driver overrides, you
need to at least trim the assumptions of what is encrypted to the
default "conventional memory" conveyed by platform firmware / BIOS.
Borislav Petkov May 18, 2022, 8:23 p.m. UTC | #18
On Wed, May 18, 2022 at 11:28:49AM -0700, Dan Williams wrote:
> On Wed, May 18, 2022 at 12:53 AM Borislav Petkov <bp@alien8.de> wrote:
> >
> > On Mon, May 16, 2022 at 09:39:06AM +0100, Richard Hughes wrote:
> > > This is still something consumers need; at the moment users have no
> > > idea if data is *actually* being encrypted.
> >
> > As it was already pointed out - that's in /proc/cpuinfo.
> 
> For TME you still need to compare it against the EFI memory map as
> there are exclusion ranges for things like persistent memory. Given
> that persistent memory can be forced into volatile "System RAM"
> operation by various command line options and driver overrides, you
> need to at least trim the assumptions of what is encrypted to the
> default "conventional memory" conveyed by platform firmware / BIOS.

So SME/SEV also has some exceptions to which memory is encrypted and
which not. Doing device IO would be one example where you simply cannot
encrypt.

But that wasn't the original question - the original question is whether
memory encryption is enabled on the system.

Now, the nodes way of describing what is encrypted and what not is
not enough either when you want to determine whether an arbitrary
transaction is being done encrypted or not. You can do silly things
as mapping a page decrypted even if the underlying hardware can do
encryption and every other page is encrypted and still think that that
page is encrypted too. But that would be a lie.

So the whole problem space needs to be specified with a lot more detail
as to what exact information userspace is going to need and how we can
provide it to it.