[RFC,v2,00/13] Multi-Key Total Memory Encryption API (MKTME)

Message ID	cover.1543903910.git.alison.schofield@intel.com (mailing list archive)
Headers	show Return-Path: <linux-security-module-owner@kernel.org> From: Alison Schofield <alison.schofield@intel.com> To: dhowells@redhat.com, tglx@linutronix.de Cc: jmorris@namei.org, mingo@redhat.com, hpa@zytor.com, bp@alien8.de, luto@kernel.org, peterz@infradead.org, kirill.shutemov@linux.intel.com, dave.hansen@intel.com, kai.huang@intel.com, jun.nakajima@intel.com, dan.j.williams@intel.com, jarkko.sakkinen@intel.com, keyrings@vger.kernel.org, linux-security-module@vger.kernel.org, linux-mm@kvack.org, x86@kernel.org Subject: [RFC v2 00/13] Multi-Key Total Memory Encryption API (MKTME) Date: Mon, 3 Dec 2018 23:39:47 -0800 Message-Id: <cover.1543903910.git.alison.schofield@intel.com> Sender: owner-linux-security-module@vger.kernel.org Precedence: bulk
Series	Multi-Key Total Memory Encryption API (MKTME) \| expand [RFC,v2,00/13] Multi-Key Total Memory Encryption API (MKTME) [RFC,v2,01/13] x86/mktme: Document the MKTME APIs [RFC,v2,02/13] mm: Generalize the mprotect implementation to support extensions [RFC,v2,03/13] syscall/x86: Wire up a new system call for memory encryption keys [RFC,v2,04/13] x86/mm: Add helper functions for MKTME memory encryption keys [RFC,v2,05/13] x86/mm: Set KeyIDs in encrypted VMAs [RFC,v2,07/13] x86/mm: Add helpers for reference counting encrypted VMAs [RFC,v2,08/13] mm: Use reference counting for encrypted VMAs [RFC,v2,09/13] mm: Restrict memory encryption to anonymous VMA's [RFC,v2,10/13] keys/mktme: Add the MKTME Key Service type for memory encryption [RFC,v2,11/13] keys/mktme: Program memory encryption keys on a system wide basis [RFC,v2,12/13] keys/mktme: Save MKTME data if kernel cmdline parameter allows [RFC,v2,13/13] keys/mktme: Support CPU Hotplug for MKTME keys

Alison Schofield Dec. 4, 2018, 7:39 a.m. UTC

Hi Thomas, David,

Here is an updated RFC on the API's to support MKTME.
(Multi-Key Total Memory Encryption)

This RFC presents the 2 API additions to support the creation and
usage of memory encryption keys:
 1) Kernel Key Service type "mktme"
 2) System call encrypt_mprotect()

This patchset is built upon Kirill Shutemov's work for the core MKTME
support.

David: Please let me know if the changes made, based on your review,
are reasonable. I don't think that the new changes touch key service
specific areas (much).

Thomas: Please provide feedback on encrypt_mprotect(). If not a
review, then a direction check would be helpful.

I picked up a few more 'CCs this time in get_maintainer!

Thanks!
Alison


Changes in RFC2
Add a preparser to mktme key service. (dhowells)
Replace key serial no. with key struct point in mktme_map. (dhowells)
Remove patch that inserts a special MKTME case in keyctl revoke. (dhowells)
Updated key usage syntax in the documentation (Kai)
Replaced NO_PKEY, NO_KEYID with a single constant NO_KEY. (Jarkko)
Clarified comments in changelog and code. (Jarkko)
Add clear, no-encrypt, and update key support.
Add mktme_savekeys (Patch 12 ) to give kernel permission to save key data.
Add cpu hotplug support. (Patch 13)

Alison Schofield (13):
  x86/mktme: Document the MKTME APIs
  mm: Generalize the mprotect implementation to support extensions
  syscall/x86: Wire up a new system call for memory encryption keys
  x86/mm: Add helper functions for MKTME memory encryption keys
  x86/mm: Set KeyIDs in encrypted VMAs
  mm: Add the encrypt_mprotect() system call
  x86/mm: Add helpers for reference counting encrypted VMAs
  mm: Use reference counting for encrypted VMAs
  mm: Restrict memory encryption to anonymous VMA's
  keys/mktme: Add the MKTME Key Service type for memory encryption
  keys/mktme: Program memory encryption keys on a system wide basis
  keys/mktme: Save MKTME data if kernel cmdline parameter allows
  keys/mktme: Support CPU Hotplug for MKTME keys

 Documentation/admin-guide/kernel-parameters.rst |   1 +
 Documentation/admin-guide/kernel-parameters.txt |  11 +
 Documentation/x86/mktme/index.rst               |  11 +
 Documentation/x86/mktme/mktme_demo.rst          |  53 +++
 Documentation/x86/mktme/mktme_encrypt.rst       |  58 +++
 Documentation/x86/mktme/mktme_keys.rst          | 109 +++++
 Documentation/x86/mktme/mktme_overview.rst      |  60 +++
 arch/x86/Kconfig                                |   1 +
 arch/x86/entry/syscalls/syscall_32.tbl          |   1 +
 arch/x86/entry/syscalls/syscall_64.tbl          |   1 +
 arch/x86/include/asm/mktme.h                    |  25 +
 arch/x86/mm/mktme.c                             | 179 ++++++++
 fs/exec.c                                       |   4 +-
 include/keys/mktme-type.h                       |  41 ++
 include/linux/key.h                             |   2 +
 include/linux/mm.h                              |  11 +-
 include/linux/syscalls.h                        |   2 +
 include/uapi/asm-generic/unistd.h               |   4 +-
 kernel/fork.c                                   |   2 +
 kernel/sys_ni.c                                 |   2 +
 mm/mprotect.c                                   |  91 +++-
 security/keys/Kconfig                           |  11 +
 security/keys/Makefile                          |   1 +
 security/keys/mktme_keys.c                      | 580 ++++++++++++++++++++++++
 24 files changed, 1249 insertions(+), 12 deletions(-)
 create mode 100644 Documentation/x86/mktme/index.rst
 create mode 100644 Documentation/x86/mktme/mktme_demo.rst
 create mode 100644 Documentation/x86/mktme/mktme_encrypt.rst
 create mode 100644 Documentation/x86/mktme/mktme_keys.rst
 create mode 100644 Documentation/x86/mktme/mktme_overview.rst
 create mode 100644 include/keys/mktme-type.h
 create mode 100644 security/keys/mktme_keys.c

Peter Zijlstra Dec. 4, 2018, 9:25 a.m. UTC | #1

On Mon, Dec 03, 2018 at 11:39:47PM -0800, Alison Schofield wrote:
> (Multi-Key Total Memory Encryption)

I think that MKTME is a horrible name, and doesn't appear to accurately
describe what it does either. Specifically the 'total' seems out of
place, it doesn't require all memory to be encrypted.

kirill.shutemov@linux.intel.com Dec. 4, 2018, 9:46 a.m. UTC | #2

On Tue, Dec 04, 2018 at 09:25:50AM +0000, Peter Zijlstra wrote:
> On Mon, Dec 03, 2018 at 11:39:47PM -0800, Alison Schofield wrote:
> > (Multi-Key Total Memory Encryption)
> 
> I think that MKTME is a horrible name, and doesn't appear to accurately
> describe what it does either. Specifically the 'total' seems out of
> place, it doesn't require all memory to be encrypted.

MKTME implies TME. TME is enabled by BIOS and it encrypts all memory with
CPU-generated key. MKTME allows to use other keys or disable encryption
for a page.

But, yes, name is not good.

Andy Lutomirski Dec. 4, 2018, 7:19 p.m. UTC | #3

On Mon, Dec 3, 2018 at 11:37 PM Alison Schofield
<alison.schofield@intel.com> wrote:
>
> Hi Thomas, David,
>
> Here is an updated RFC on the API's to support MKTME.
> (Multi-Key Total Memory Encryption)
>
> This RFC presents the 2 API additions to support the creation and
> usage of memory encryption keys:
>  1) Kernel Key Service type "mktme"
>  2) System call encrypt_mprotect()
>
> This patchset is built upon Kirill Shutemov's work for the core MKTME
> support.
>
> David: Please let me know if the changes made, based on your review,
> are reasonable. I don't think that the new changes touch key service
> specific areas (much).
>
> Thomas: Please provide feedback on encrypt_mprotect(). If not a
> review, then a direction check would be helpful.
>

I'm not Thomas, but I think it's the wrong direction.  As it stands,
encrypt_mprotect() is an incomplete version of mprotect() (since it's
missing the protection key support), and it's also functionally just
MADV_DONTNEED.  In other words, the sole user-visible effect appears
to be that the existing pages are blown away.  The fact that it
changes the key in use doesn't seem terribly useful, since it's
anonymous memory, and the most secure choice is to use CPU-managed
keying, which appears to be the default anyway on TME systems.  It
also has totally unclear semantics WRT swap, and, off the top of my
head, it looks like it may have serious cache-coherency issues and
like swapping the pages might corrupt them, both because there are no
flushes and because the direct-map alias looks like it will use the
default key and therefore appear to contain the wrong data.

I would propose a very different direction: don't try to support MKTME
at all for anonymous memory, and instead figure out the important use
cases and support them directly.  The use cases that I can think of
off the top of my head are:

1. pmem.  This should probably use a very different API.

2. Some kind of VM hardening, where a VM's memory can be protected a
little tiny bit from the main kernel.  But I don't see why this is any
better than XPO (eXclusive Page-frame Ownership), which brings to
mind:

The main implementation concern I have with this patch set is cache
coherency and handling of the direct map.  Unless I missed something,
you're not doing anything about the direct map, which means that you
have RW aliases of the same memory with different keys.  For use case
#2, this probably means that you need to either get rid of the direct
map and make get_user_pages() fail, or you need to change the key on
the direct map as well, probably using the pageattr.c code.

As for caching, As far as I can tell from reading the preliminary
docs, Intel's MKTME, much like AMD's SME, is basically invisible to
the hardware cache coherency mechanism.  So, if you modify a physical
address with one key (or SME-enable bit), and you read it with
another, you get garbage unless you flush.  And, if you modify memory
with one key then remap it with a different key without flushing in
the mean time, you risk corruption.  And, what's worse, if I'm reading
between the lines in the docs correctly, if you use PCONFIG to change
a key, you may need to do a bunch of cache flushing to ensure you get
reasonable effects.  (If you have dirty cache lines for some (PA, key)
and you PCONFIG to change the underlying key, you get different
results depending on whether the writeback happens before or after the
package doing the writeback notices the PCONFIG.)

Finally, If you're going to teach the kernel how to have some user
pages that aren't in the direct map, you've essentially done XPO,
which is nifty but expensive.  And I think that doing this gets you
essentially all the benefit of MKTME for the non-pmem use case.  Why
exactly would any software want to use anything other than a
CPU-managed key for anything other than pmem?

--Andy

Andy Lutomirski Dec. 4, 2018, 8 p.m. UTC | #4

On Tue, Dec 4, 2018 at 11:19 AM Andy Lutomirski <luto@kernel.org> wrote:
>
> On Mon, Dec 3, 2018 at 11:37 PM Alison Schofield
> <alison.schofield@intel.com> wrote:
> >

> Finally, If you're going to teach the kernel how to have some user
> pages that aren't in the direct map, you've essentially done XPO,
> which is nifty but expensive.  And I think that doing this gets you
> essentially all the benefit of MKTME for the non-pmem use case.  Why
> exactly would any software want to use anything other than a
> CPU-managed key for anything other than pmem?
>

Let me say this less abstractly.  Here's a somewhat concrete actual
proposal.  Make a new memfd_create() flag like MEMFD_ISOLATED.  The
semantics are that the underlying pages are made not-present in the
direct map when they're allocated (which is hideously slow, but so be
it), and that anything that tries to get_user_pages() the resulting
pages fails.  And then make sure we have all the required APIs so that
QEMU can still map this stuff into a VM.

If there is indeed a situation in which MKTME-ifying the memory adds
some value, then we can consider doing that.

And maybe we get fancy and encrypt this memory when it's swapped, but
maybe we should just encrypt everything when it's swapped.

Dave Hansen Dec. 4, 2018, 8:32 p.m. UTC | #5

On 12/4/18 12:00 PM, Andy Lutomirski wrote:
> On Tue, Dec 4, 2018 at 11:19 AM Andy Lutomirski <luto@kernel.org> wrote:
>> On Mon, Dec 3, 2018 at 11:37 PM Alison Schofield <alison.schofield@intel.com> wrote:
>> Finally, If you're going to teach the kernel how to have some user
>> pages that aren't in the direct map, you've essentially done XPO,
>> which is nifty but expensive.  And I think that doing this gets you
>> essentially all the benefit of MKTME for the non-pmem use case.  Why
>> exactly would any software want to use anything other than a
>> CPU-managed key for anything other than pmem?
> 
> Let me say this less abstractly.  Here's a somewhat concrete actual
> proposal.  Make a new memfd_create() flag like MEMFD_ISOLATED.  The
> semantics are that the underlying pages are made not-present in the
> direct map when they're allocated (which is hideously slow, but so be
> it), and that anything that tries to get_user_pages() the resulting
> pages fails.  And then make sure we have all the required APIs so that
> QEMU can still map this stuff into a VM.

I think we need get_user_pages().  We want direct I/O to work, *and* we
really want direct device assignment into VMs.

> And maybe we get fancy and encrypt this memory when it's swapped, but
> maybe we should just encrypt everything when it's swapped.

We decided long ago (and this should be in the patches somewhere) that
we wouldn't force memory to be encrypted in swap.  We would just
recommend it in the documentation as a best practice, especially when
using MKTME.

We can walk that back, of course, but that's what we're doing at the moment.

Jarkko Sakkinen Dec. 5, 2018, 8:30 p.m. UTC | #6

On Mon, 2018-12-03 at 23:39 -0800, Alison Schofield wrote:
> Hi Thomas, David,
> 
> Here is an updated RFC on the API's to support MKTME.
> (Multi-Key Total Memory Encryption)
> 
> This RFC presents the 2 API additions to support the creation and
> usage of memory encryption keys:
>  1) Kernel Key Service type "mktme"
>  2) System call encrypt_mprotect()
> 
> This patchset is built upon Kirill Shutemov's work for the core MKTME
> support.

Please, explain what MKTME is right here.

No references, no explanations... Even with a reference, a short
summary would be really nice to have.

/Jarkko

Jarkko Sakkinen Dec. 5, 2018, 8:32 p.m. UTC | #7

On Tue, 2018-12-04 at 12:46 +0300, Kirill A. Shutemov wrote:
> On Tue, Dec 04, 2018 at 09:25:50AM +0000, Peter Zijlstra wrote:
> > On Mon, Dec 03, 2018 at 11:39:47PM -0800, Alison Schofield wrote:
> > > (Multi-Key Total Memory Encryption)
> > 
> > I think that MKTME is a horrible name, and doesn't appear to accurately
> > describe what it does either. Specifically the 'total' seems out of
> > place, it doesn't require all memory to be encrypted.
> 
> MKTME implies TME. TME is enabled by BIOS and it encrypts all memory with
> CPU-generated key. MKTME allows to use other keys or disable encryption
> for a page.

When you say "disable encryption to a page" does the encryption get
actually disabled or does the CPU just decrypt it transparently i.e.
what happens physically?

> But, yes, name is not good.

/Jarkko

Jarkko Sakkinen Dec. 5, 2018, 10:19 p.m. UTC | #8

On Tue, 2018-12-04 at 11:19 -0800, Andy Lutomirski wrote:
> I'm not Thomas, but I think it's the wrong direction.  As it stands,
> encrypt_mprotect() is an incomplete version of mprotect() (since it's
> missing the protection key support), and it's also functionally just
> MADV_DONTNEED.  In other words, the sole user-visible effect appears
> to be that the existing pages are blown away.  The fact that it
> changes the key in use doesn't seem terribly useful, since it's
> anonymous memory, and the most secure choice is to use CPU-managed
> keying, which appears to be the default anyway on TME systems.  It
> also has totally unclear semantics WRT swap, and, off the top of my
> head, it looks like it may have serious cache-coherency issues and
> like swapping the pages might corrupt them, both because there are no
> flushes and because the direct-map alias looks like it will use the
> default key and therefore appear to contain the wrong data.
> 
> I would propose a very different direction: don't try to support MKTME
> at all for anonymous memory, and instead figure out the important use
> cases and support them directly.  The use cases that I can think of
> off the top of my head are:
> 
> 1. pmem.  This should probably use a very different API.
> 
> 2. Some kind of VM hardening, where a VM's memory can be protected a
> little tiny bit from the main kernel.  But I don't see why this is any
> better than XPO (eXclusive Page-frame Ownership), which brings to
> mind:

What is the threat model anyway for AMD and Intel technologies?

For me it looks like that you can read, write and even replay 
encrypted pages both in SME and TME. 

/Jarkko

Dave Hansen Dec. 5, 2018, 11:49 p.m. UTC | #9

On 12/4/18 11:19 AM, Andy Lutomirski wrote:
> I'm not Thomas, but I think it's the wrong direction.  As it stands,
> encrypt_mprotect() is an incomplete version of mprotect() (since it's
> missing the protection key support),

I thought about this when I added mprotect_pkey().  We start with:

	mprotect(addr, len, prot);

then

	mprotect_pkey(addr, len, prot);

then

	mprotect_pkey_encrypt(addr, len, prot, key);

That doesn't scale because we eventually have
mprotect_and_a_history_of_mm_features(). :)

What I was hoping to see was them do this (apologies for the horrible
indentation:

	ptr = mmap(..., PROT_NONE);
	mprotect_pkey(   addr, len, PROT_NONE, pkey);
	mprotect_encrypt(addr, len, PROT_NONE, keyid);
	mprotect(        addr, len, real_prot);

The point is that you *can* stack these things and don't have to have an
mprotect_kitchen_sink() if you use PROT_NONE for intermediate
permissions during setup.

> and it's also functionally just MADV_DONTNEED.  In other words, the
> sole user-visible effect appears to be that the existing pages are
> blown away.  The fact that it changes the key in use doesn't seem
> terribly useful, since it's anonymous memory,

It's functionally MADV_DONTNEED, plus a future promise that your writes
will never show up as plaintext on the DIMM.

We also haven't settled on the file-backed properties.  For file-backed,
my hope was that you could do:

	ptr = mmap(fd, size, prot);
	printf("ciphertext: %x\n", *ptr);
	mprotect_encrypt(ptr, len, prot, keyid);
	printf("plaintext: %x\n", *ptr);

> and the most secure choice is to use CPU-managed keying, which
> appears to be the default anyway on TME systems.  It also has totally
> unclear semantics WRT swap, and, off the top of my head, it looks
> like it may have serious cache-coherency issues and like swapping the
> pages might corrupt them, both because there are no flushes and
> because the direct-map alias looks like it will use the default key
> and therefore appear to contain the wrong data.

I think we fleshed this out on IRC a bit, but the other part of the
implementation is described here: https://lwn.net/Articles/758313/, and
contains a direct map per keyid.  When you do phys_to_virt() and
friends, you get the correct, decrypted view direct map which is
appropriate for the physical page.  And, yes, this has very
consequential security implications.

> I would propose a very different direction: don't try to support MKTME
> at all for anonymous memory, and instead figure out the important use
> cases and support them directly.  The use cases that I can think of
> off the top of my head are:
> 
> 1. pmem.  This should probably use a very different API.
> 
> 2. Some kind of VM hardening, where a VM's memory can be protected a
> little tiny bit from the main kernel.  But I don't see why this is any
> better than XPO (eXclusive Page-frame Ownership), which brings to
> mind:

The XPO approach is "fun", and would certainly be a way to keep the
direct map from being exploited to get access to plain-text mappings of
ciphertext.

But, it also has massive performance implications and we didn't quite
want to go there quite yet.

> The main implementation concern I have with this patch set is cache
> coherency and handling of the direct map.  Unless I missed something,
> you're not doing anything about the direct map, which means that you
> have RW aliases of the same memory with different keys.  For use case
> #2, this probably means that you need to either get rid of the direct
> map and make get_user_pages() fail, or you need to change the key on
> the direct map as well, probably using the pageattr.c code.

The current, public hardware spec has a description of what's required
to maintain cache coherency.  Basically, you can keep as many mappings
of a physical page as you want, but only write to one mapping at a time,
and clflush the old one when you want to write to a new one.

> As for caching, As far as I can tell from reading the preliminary
> docs, Intel's MKTME, much like AMD's SME, is basically invisible to
> the hardware cache coherency mechanism.  So, if you modify a physical
> address with one key (or SME-enable bit), and you read it with
> another, you get garbage unless you flush.  And, if you modify memory
> with one key then remap it with a different key without flushing in
> the mean time, you risk corruption.

Yes, all true (at least with respect to Intel's implementation).

> And, what's worse, if I'm reading
> between the lines in the docs correctly, if you use PCONFIG to change
> a key, you may need to do a bunch of cache flushing to ensure you get
> reasonable effects.  (If you have dirty cache lines for some (PA, key)
> and you PCONFIG to change the underlying key, you get different
> results depending on whether the writeback happens before or after the
> package doing the writeback notices the PCONFIG.)

We're not going to allow a key to be PCONFIG'd while there are any
physical pages still associated with it.  There are per-VMA refcounts
tied back to the keyid slots, IIRC.  So, before PCONFIG can happen, we
just need to make sure that all the VMAs are gone, all the pages are
freed, and all dirty cachelines have been clflushed.

This is where get_user_pages() is our mortal enemy, though.  I hope we
got that right.  Kirill/Alison, we should chat about this one. :)

> Finally, If you're going to teach the kernel how to have some user
> pages that aren't in the direct map, you've essentially done XPO,
> which is nifty but expensive.  And I think that doing this gets you
> essentially all the benefit of MKTME for the non-pmem use case.  Why
> exactly would any software want to use anything other than a
> CPU-managed key for anything other than pmem?

It is handy, for one, to let you "cluster" key usage.  If you have 5
Pepsi VMs and 5 Coke VMs, each Pepsi one using the same key and each
Coke one using the same key, you can boil it down to only 2 hardware
keyid slots that get used, and do this transparently.

But, I think what you're implying is that the security properties of
user-supplied keys can only be *worse* than using CPU-generated keys
(assuming the CPU does a good job generating it).  So, why bother
allowing user-specified keys in the first place?

It's a good question and I don't have a solid answer for why folks want
this.  I'll find out.

Andy Lutomirski Dec. 6, 2018, 1:09 a.m. UTC | #10

On Wed, Dec 5, 2018 at 3:49 PM Dave Hansen <dave.hansen@intel.com> wrote:
>
> On 12/4/18 11:19 AM, Andy Lutomirski wrote:
> > I'm not Thomas, but I think it's the wrong direction.  As it stands,
> > encrypt_mprotect() is an incomplete version of mprotect() (since it's
> > missing the protection key support),
>
> I thought about this when I added mprotect_pkey().  We start with:
>
>         mprotect(addr, len, prot);
>
> then
>
>         mprotect_pkey(addr, len, prot);
>
> then
>
>         mprotect_pkey_encrypt(addr, len, prot, key);
>
> That doesn't scale because we eventually have
> mprotect_and_a_history_of_mm_features(). :)
>
> What I was hoping to see was them do this (apologies for the horrible
> indentation:
>
>         ptr = mmap(..., PROT_NONE);
>         mprotect_pkey(   addr, len, PROT_NONE, pkey);
>         mprotect_encrypt(addr, len, PROT_NONE, keyid);
>         mprotect(        addr, len, real_prot);
>
> The point is that you *can* stack these things and don't have to have an
> mprotect_kitchen_sink() if you use PROT_NONE for intermediate
> permissions during setup.

Sure, but then why call it mprotect at all?  How about:

mmap(..., PROT_NONE);
mencrypt(..., keyid);
mprotect_pkey(...);

But wouldn't this be much nicer:

int fd = memfd_create(...);
memfd_set_tme_key(fd, keyid);  /* fails if len != 0 */
mmap(fd, ...);

>
> > and it's also functionally just MADV_DONTNEED.  In other words, the
> > sole user-visible effect appears to be that the existing pages are
> > blown away.  The fact that it changes the key in use doesn't seem
> > terribly useful, since it's anonymous memory,
>
> It's functionally MADV_DONTNEED, plus a future promise that your writes
> will never show up as plaintext on the DIMM.

But that's mostly vacuous.  If I read the docs right, MKTME systems
also support TME, so you *already* have that promise, unless the
firmware totally blew it.  If we want a boot option to have the kernel
use MKTME to forcibly encrypt everything regardless of what the TME
MSRs say, I'd be entirely on board.  Heck, the implementation would be
quite simple because we mostly reuse the SME code.

>
> We also haven't settled on the file-backed properties.  For file-backed,
> my hope was that you could do:
>
>         ptr = mmap(fd, size, prot);
>         printf("ciphertext: %x\n", *ptr);
>         mprotect_encrypt(ptr, len, prot, keyid);
>         printf("plaintext: %x\n", *ptr);

Why would you ever want the plaintext?  Also, how does this work on a
normal fs, where relocation of the file would cause the ciphertext to
get lost?  It really seems to be that it should look more like
dm-crypt where you encrypt a filesystem.  Maybe you'd just configure
the pmem device to be encrypted before you mount it, or you'd get a
new pmem-mktme device node instead.  This would also avoid some nasty
multiple-copies-of-the-direct-map issue, since you'd only ever have
one of them mapped.

>
> > The main implementation concern I have with this patch set is cache
> > coherency and handling of the direct map.  Unless I missed something,
> > you're not doing anything about the direct map, which means that you
> > have RW aliases of the same memory with different keys.  For use case
> > #2, this probably means that you need to either get rid of the direct
> > map and make get_user_pages() fail, or you need to change the key on
> > the direct map as well, probably using the pageattr.c code.
>
> The current, public hardware spec has a description of what's required
> to maintain cache coherency.  Basically, you can keep as many mappings
> of a physical page as you want, but only write to one mapping at a time,
> and clflush the old one when you want to write to a new one.

Surely you at least have to clflush the old mapping and then the new
mapping, since the new mapping could have been speculatively read.

> > Finally, If you're going to teach the kernel how to have some user
> > pages that aren't in the direct map, you've essentially done XPO,
> > which is nifty but expensive.  And I think that doing this gets you
> > essentially all the benefit of MKTME for the non-pmem use case.  Why
> > exactly would any software want to use anything other than a
> > CPU-managed key for anything other than pmem?
>
> It is handy, for one, to let you "cluster" key usage.  If you have 5
> Pepsi VMs and 5 Coke VMs, each Pepsi one using the same key and each
> Coke one using the same key, you can boil it down to only 2 hardware
> keyid slots that get used, and do this transparently.

I understand this from a marketing perspective but not a security
perspective.  Say I'm Coke and you've sold me some VMs that are
"encrypted with a Coke-specific key and no other VMs get to use that
key."  I can't think of *any* not-exceedingly-contrived attack in
which this makes the slightest difference.  If Pepsi tries to attack
Coke without MKTME, then they'll either need to get the hypervisor to
leak Coke's data through the direct map or they'll have to find some
way to corrupt a page table or use something like L1TF to read from a
physical address Coke owns.  With MKTME, if they can read through the
host direct map, then they'll get Coke's cleartext, and if they can
corrupt a page table or use L1TF to read from your memory, they'll get
Coke's cleartext.

TME itself provides a ton of protection -- you can't just barge into
the datacenter, refrigerate the DIMMs, walk away with them, and read
off everyone's data.

Am I missing something?

>
> But, I think what you're implying is that the security properties of
> user-supplied keys can only be *worse* than using CPU-generated keys
> (assuming the CPU does a good job generating it).  So, why bother
> allowing user-specified keys in the first place?

That too :)

Dan Williams Dec. 6, 2018, 1:25 a.m. UTC | #11

[ only responding to the pmem side of things... ]

On Wed, Dec 5, 2018 at 5:09 PM Andy Lutomirski <luto@kernel.org> wrote:
>
> On Wed, Dec 5, 2018 at 3:49 PM Dave Hansen <dave.hansen@intel.com> wrote:
[..]
> > We also haven't settled on the file-backed properties.  For file-backed,
> > my hope was that you could do:
> >
> >         ptr = mmap(fd, size, prot);
> >         printf("ciphertext: %x\n", *ptr);
> >         mprotect_encrypt(ptr, len, prot, keyid);
> >         printf("plaintext: %x\n", *ptr);
>
> Why would you ever want the plaintext?  Also, how does this work on a
> normal fs, where relocation of the file would cause the ciphertext to
> get lost?  It really seems to be that it should look more like
> dm-crypt where you encrypt a filesystem.  Maybe you'd just configure
> the pmem device to be encrypted before you mount it, or you'd get a
> new pmem-mktme device node instead.  This would also avoid some nasty
> multiple-copies-of-the-direct-map issue, since you'd only ever have
> one of them mapped.

Yes, this is really the only way it can work. Otherwise you need to
teach the filesystem that "these blocks can't move without the key
because encryption", and have an fs-feature flag to say "you can't
mount this legacy / encryption unaware filesystem from an older kernel
because we're not sure you'll move something and break the
encryption".

So pmem namespaces (volumes) would be encrypted providing something
similar to dm-crypt, although we're looking at following the lead of
the fscrypt key management scheme.

Kirill A . Shutemov Dec. 6, 2018, 11:22 a.m. UTC | #12

On Wed, Dec 05, 2018 at 08:32:52PM +0000, Sakkinen, Jarkko wrote:
> On Tue, 2018-12-04 at 12:46 +0300, Kirill A. Shutemov wrote:
> > On Tue, Dec 04, 2018 at 09:25:50AM +0000, Peter Zijlstra wrote:
> > > On Mon, Dec 03, 2018 at 11:39:47PM -0800, Alison Schofield wrote:
> > > > (Multi-Key Total Memory Encryption)
> > > 
> > > I think that MKTME is a horrible name, and doesn't appear to accurately
> > > describe what it does either. Specifically the 'total' seems out of
> > > place, it doesn't require all memory to be encrypted.
> > 
> > MKTME implies TME. TME is enabled by BIOS and it encrypts all memory with
> > CPU-generated key. MKTME allows to use other keys or disable encryption
> > for a page.
> 
> When you say "disable encryption to a page" does the encryption get
> actually disabled or does the CPU just decrypt it transparently i.e.
> what happens physically?

Yes, it gets disabled. Physically. It overrides TME encryption.

Dave Hansen Dec. 6, 2018, 2:59 p.m. UTC | #13

On 12/6/18 3:22 AM, Kirill A. Shutemov wrote:
>> When you say "disable encryption to a page" does the encryption get
>> actually disabled or does the CPU just decrypt it transparently i.e.
>> what happens physically?
> Yes, it gets disabled. Physically. It overrides TME encryption.

I know MKTME itself has a runtime overhead and we expect it to have a
performance impact in the low single digits.  Does TME have that
overhead?  Presumably MKTME plus no-encryption is not expected to have
the overhead.

We should probably mention that in the changelogs too.

Dave Hansen Dec. 6, 2018, 3:39 p.m. UTC | #14

On 12/5/18 5:09 PM, Andy Lutomirski wrote:
> On Wed, Dec 5, 2018 at 3:49 PM Dave Hansen <dave.hansen@intel.com> wrote:
>> What I was hoping to see was them do this (apologies for the horrible
>> indentation:
>>
>>         ptr = mmap(..., PROT_NONE);
>>         mprotect_pkey(   addr, len, PROT_NONE, pkey);
>>         mprotect_encrypt(addr, len, PROT_NONE, keyid);
>>         mprotect(        addr, len, real_prot);
>>
>> The point is that you *can* stack these things and don't have to have an
>> mprotect_kitchen_sink() if you use PROT_NONE for intermediate
>> permissions during setup.
> 
> Sure, but then why call it mprotect at all?  How about:
> 
> mmap(..., PROT_NONE);
> mencrypt(..., keyid);
> mprotect_pkey(...);

That would totally work too.  I just like the idea of a family of
mprotect() syscalls that do mprotect() plus one other thing.  What
you're proposing is totally equivalent where we have mprotect_pkey()
always being the *last* thing that gets called, plus a family of things
that we expect to get called on something that's probably PROT_NONE.

> But wouldn't this be much nicer:
> 
> int fd = memfd_create(...);
> memfd_set_tme_key(fd, keyid);  /* fails if len != 0 */
> mmap(fd, ...);

No. :)

One really big advantage with protection keys, or this implementation is
that you don't have to implement an allocator.  You can use it with any
old malloc() as long as you own a whole page.

The pages also fundamentally *stay* anonymous in the VM and get all the
goodness that comes with that, like THP.

>>> and it's also functionally just MADV_DONTNEED.  In other words, the
>>> sole user-visible effect appears to be that the existing pages are
>>> blown away.  The fact that it changes the key in use doesn't seem
>>> terribly useful, since it's anonymous memory,
>>
>> It's functionally MADV_DONTNEED, plus a future promise that your writes
>> will never show up as plaintext on the DIMM.
> 
> But that's mostly vacuous.  If I read the docs right, MKTME systems
> also support TME, so you *already* have that promise, unless the
> firmware totally blew it.  If we want a boot option to have the kernel
> use MKTME to forcibly encrypt everything regardless of what the TME
> MSRs say, I'd be entirely on board.  Heck, the implementation would be
> quite simple because we mostly reuse the SME code.

Yeah, that's true.  I seem to always forget about the TME case! :)

"It's functionally MADV_DONTNEED, plus a future promise that your writes
will never be written to the DIMM with the TME key."

But, this gets us back to your very good question about what good this
does in the end.  What value does _that_ scheme provide over TME?  We're
admittedly weak on specific examples there, but I'm working on it.

>>> the direct map as well, probably using the pageattr.c code.
>>
>> The current, public hardware spec has a description of what's required
>> to maintain cache coherency.  Basically, you can keep as many mappings
>> of a physical page as you want, but only write to one mapping at a time,
>> and clflush the old one when you want to write to a new one.
> 
> Surely you at least have to clflush the old mapping and then the new
> mapping, since the new mapping could have been speculatively read.

Nope.  The coherency is "fine" unless you have writeback of an older
cacheline that blows away newer data.  CPUs that support MKTME are
guaranteed to never do writeback of the lines that could be established
speculatively or from prefetching.

>>> Finally, If you're going to teach the kernel how to have some user
>>> pages that aren't in the direct map, you've essentially done XPO,
>>> which is nifty but expensive.  And I think that doing this gets you
>>> essentially all the benefit of MKTME for the non-pmem use case.  Why
>>> exactly would any software want to use anything other than a
>>> CPU-managed key for anything other than pmem?
>>
>> It is handy, for one, to let you "cluster" key usage.  If you have 5
>> Pepsi VMs and 5 Coke VMs, each Pepsi one using the same key and each
>> Coke one using the same key, you can boil it down to only 2 hardware
>> keyid slots that get used, and do this transparently.
> 
> I understand this from a marketing perspective but not a security
> perspective.  Say I'm Coke and you've sold me some VMs that are
> "encrypted with a Coke-specific key and no other VMs get to use that
> key."  I can't think of *any* not-exceedingly-contrived attack in
> which this makes the slightest difference.  If Pepsi tries to attack
> Coke without MKTME, then they'll either need to get the hypervisor to
> leak Coke's data through the direct map or they'll have to find some
> way to corrupt a page table or use something like L1TF to read from a
> physical address Coke owns.  With MKTME, if they can read through the
> host direct map, then they'll get Coke's cleartext, and if they can
> corrupt a page table or use L1TF to read from your memory, they'll get
> Coke's cleartext.

The design definitely has the hypervisor in the trust boundary.  If the
hypervisor is evil, or if someone evil compromises the hypervisor, MKTME
obviously provides less protection.

I guess the question ends up being if this makes its protections weak
enough that we should not bother merging it in its current form.

I still have the homework assignment to go figure out why folks want the
protections as they stand.

Andy Lutomirski Dec. 6, 2018, 7:10 p.m. UTC | #15

> On Dec 6, 2018, at 7:39 AM, Dave Hansen <dave.hansen@intel.com> wrote:

>>>> the direct map as well, probably using the pageattr.c code.
>>>
>>> The current, public hardware spec has a description of what's required
>>> to maintain cache coherency.  Basically, you can keep as many mappings
>>> of a physical page as you want, but only write to one mapping at a time,
>>> and clflush the old one when you want to write to a new one.
>>
>> Surely you at least have to clflush the old mapping and then the new
>> mapping, since the new mapping could have been speculatively read.
>
> Nope.  The coherency is "fine" unless you have writeback of an older
> cacheline that blows away newer data.  CPUs that support MKTME are
> guaranteed to never do writeback of the lines that could be established
> speculatively or from prefetching.

How is that sufficient?  Suppose I have some physical page mapped with
keys 1 and 2. #1 is logically live and I write to it.  Then I prefetch
or otherwise populate mapping 2 into the cache (in the S state,
presumably).  Now I clflush mapping 1 and read 2.  It contains garbage
in the cache, but the garbage in the cache is inconsistent with the
garbage in memory.  This can’t be a good thing, even if no writeback
occurs.

I suppose the right fix is to clflush the old mapping and then to zero
the new mapping.

>
>>>> Finally, If you're going to teach the kernel how to have some user
>>>> pages that aren't in the direct map, you've essentially done XPO,
>>>> which is nifty but expensive.  And I think that doing this gets you
>>>> essentially all the benefit of MKTME for the non-pmem use case.  Why
>>>> exactly would any software want to use anything other than a
>>>> CPU-managed key for anything other than pmem?
>>>
>>> It is handy, for one, to let you "cluster" key usage.  If you have 5
>>> Pepsi VMs and 5 Coke VMs, each Pepsi one using the same key and each
>>> Coke one using the same key, you can boil it down to only 2 hardware
>>> keyid slots that get used, and do this transparently.
>>
>> I understand this from a marketing perspective but not a security
>> perspective.  Say I'm Coke and you've sold me some VMs that are
>> "encrypted with a Coke-specific key and no other VMs get to use that
>> key."  I can't think of *any* not-exceedingly-contrived attack in
>> which this makes the slightest difference.  If Pepsi tries to attack
>> Coke without MKTME, then they'll either need to get the hypervisor to
>> leak Coke's data through the direct map or they'll have to find some
>> way to corrupt a page table or use something like L1TF to read from a
>> physical address Coke owns.  With MKTME, if they can read through the
>> host direct map, then they'll get Coke's cleartext, and if they can
>> corrupt a page table or use L1TF to read from your memory, they'll get
>> Coke's cleartext.
>
> The design definitely has the hypervisor in the trust boundary.  If the
> hypervisor is evil, or if someone evil compromises the hypervisor, MKTME
> obviously provides less protection.
>
> I guess the question ends up being if this makes its protections weak
> enough that we should not bother merging it in its current form.

Indeed, but I’d ask another question too: I expect that MKTME is weak
enough that it will be improved, and without seeing the improvement,
it seems quite plausible that using the improvement will require
radically reworking the kernel implementation.

As a straw man, suppose we get a way to say “this key may only be
accessed through such-and-such VPID or by using a special new
restricted facility for the hypervisor to request access”.    Now we
have some degree of serious protection, but it doesn’t work, by
design, for anonymous memory.  Similarly, something that looks more
like AMD's SEV would be very very awkward to support with anything
like the current API proposal.

>
> I still have the homework assignment to go figure out why folks want the
> protections as they stand.

Dave Hansen Dec. 6, 2018, 7:31 p.m. UTC | #16

On 12/6/18 11:10 AM, Andy Lutomirski wrote:
>> On Dec 6, 2018, at 7:39 AM, Dave Hansen <dave.hansen@intel.com> wrote:
>>The coherency is "fine" unless you have writeback of an older
>> cacheline that blows away newer data.  CPUs that support MKTME are
>> guaranteed to never do writeback of the lines that could be established
>> speculatively or from prefetching.
> 
> How is that sufficient?  Suppose I have some physical page mapped with
> keys 1 and 2. #1 is logically live and I write to it.  Then I prefetch
> or otherwise populate mapping 2 into the cache (in the S state,
> presumably).  Now I clflush mapping 1 and read 2.  It contains garbage
> in the cache, but the garbage in the cache is inconsistent with the
> garbage in memory.  This can’t be a good thing, even if no writeback
> occurs.
> 
> I suppose the right fix is to clflush the old mapping and then to zero
> the new mapping.

Yep.  Practically, you need to write to the new mapping to give it any
meaning.  Those writes effectively blow away any previously cached,
garbage contents.

I think you're right, though, that the cached data might not be
_consistent_ with what is in memory.  It feels really dirty, but I can't
think of any problems that it actually causes.

Jarkko Sakkinen Dec. 6, 2018, 9:23 p.m. UTC | #17

On Thu, 2018-12-06 at 14:22 +0300, Kirill A. Shutemov wrote:
> When you say "disable encryption to a page" does the encryption get
> > actually disabled or does the CPU just decrypt it transparently i.e.
> > what happens physically?
> 
> Yes, it gets disabled. Physically. It overrides TME encryption.

OK, thanks for confirmation. BTW, how much is the penalty to keep it
always enabled? Is it something that would not make sense for some
other reasons?

/Jarkko

Huang, Kai Dec. 7, 2018, 1:55 a.m. UTC | #18

> 
> TME itself provides a ton of protection -- you can't just barge into
> the datacenter, refrigerate the DIMMs, walk away with them, and read
> off everyone's data.
> 
> Am I missing something?

I think we can make such assumption in most cases, but I think it's better that we don't make any
assumption at all. For example, the admin of data center (or anyone) who has physical access to
servers may do something malicious. I am not expert but there should be other physical attack
methods besides coldboot attack, if the malicious employee can get physical access to server w/o
being detected.

> 
> > 
> > But, I think what you're implying is that the security properties of
> > user-supplied keys can only be *worse* than using CPU-generated keys
> > (assuming the CPU does a good job generating it).  So, why bother
> > allowing user-specified keys in the first place?
> 
> That too :)

I think one usage of user-specified key is for NVDIMM, since CPU key will be gone after machine
reboot, therefore if NVDIMM is encrypted by CPU key we are not able to retrieve it once
shutdown/reboot, etc.

There are some other use cases that already require tenant to send key to CSP. For example, the VM
image can be provided by tenant and encrypted by tenant's own key, and tenant needs to send key to
CSP when asking CSP to run that encrypted image. But tenant will need to trust CSP in such case,
which brings us why tenant wants to use his own image at first place (I have to say I myself is not
convinced the value of such use case). I think there are two levels of trustiness involved here: 1)
tenant needs to trust CSP anyway; 2) but CSP needs to convince tenant that CSP can be trusted, ie,
by proving it can prevent potential attack from malicious employee (ie, by raising bar by using
MKTME), etc.

Thanks,
-Kai

Huang, Kai Dec. 7, 2018, 2:05 a.m. UTC | #19

On Wed, 2018-12-05 at 22:19 +0000, Sakkinen, Jarkko wrote:
> On Tue, 2018-12-04 at 11:19 -0800, Andy Lutomirski wrote:
> > I'm not Thomas, but I think it's the wrong direction.  As it stands,
> > encrypt_mprotect() is an incomplete version of mprotect() (since it's
> > missing the protection key support), and it's also functionally just
> > MADV_DONTNEED.  In other words, the sole user-visible effect appears
> > to be that the existing pages are blown away.  The fact that it
> > changes the key in use doesn't seem terribly useful, since it's
> > anonymous memory, and the most secure choice is to use CPU-managed
> > keying, which appears to be the default anyway on TME systems.  It
> > also has totally unclear semantics WRT swap, and, off the top of my
> > head, it looks like it may have serious cache-coherency issues and
> > like swapping the pages might corrupt them, both because there are no
> > flushes and because the direct-map alias looks like it will use the
> > default key and therefore appear to contain the wrong data.
> > 
> > I would propose a very different direction: don't try to support MKTME
> > at all for anonymous memory, and instead figure out the important use
> > cases and support them directly.  The use cases that I can think of
> > off the top of my head are:
> > 
> > 1. pmem.  This should probably use a very different API.
> > 
> > 2. Some kind of VM hardening, where a VM's memory can be protected a
> > little tiny bit from the main kernel.  But I don't see why this is any
> > better than XPO (eXclusive Page-frame Ownership), which brings to
> > mind:
> 
> What is the threat model anyway for AMD and Intel technologies?
> 
> For me it looks like that you can read, write and even replay 
> encrypted pages both in SME and TME. 

Right. Neither of them (including MKTME) prevents replay attack. But in my understanding SEV doesn't
prevent replay attack either since it doesn't have integrity protection.

Thanks,
-Kai

Dave Hansen Dec. 7, 2018, 4:23 a.m. UTC | #20

On 12/6/18 5:55 PM, Huang, Kai wrote:
> I think one usage of user-specified key is for NVDIMM, since CPU key
> will be gone after machine reboot, therefore if NVDIMM is encrypted
> by CPU key we are not able to retrieve it once shutdown/reboot, etc.

I think we all agree that the NVDIMM uses are really useful.

But, these patches don't implement that.  So, if NVDIMMs are the only
reasonable use case, we shouldn't merge these patches until we add
NVDIMM support.

Jarkko Sakkinen Dec. 7, 2018, 6:48 a.m. UTC | #21

On Thu, Dec 06, 2018 at 06:05:50PM -0800, Huang, Kai wrote:
> On Wed, 2018-12-05 at 22:19 +0000, Sakkinen, Jarkko wrote:
> > On Tue, 2018-12-04 at 11:19 -0800, Andy Lutomirski wrote:
> > > I'm not Thomas, but I think it's the wrong direction.  As it stands,
> > > encrypt_mprotect() is an incomplete version of mprotect() (since it's
> > > missing the protection key support), and it's also functionally just
> > > MADV_DONTNEED.  In other words, the sole user-visible effect appears
> > > to be that the existing pages are blown away.  The fact that it
> > > changes the key in use doesn't seem terribly useful, since it's
> > > anonymous memory, and the most secure choice is to use CPU-managed
> > > keying, which appears to be the default anyway on TME systems.  It
> > > also has totally unclear semantics WRT swap, and, off the top of my
> > > head, it looks like it may have serious cache-coherency issues and
> > > like swapping the pages might corrupt them, both because there are no
> > > flushes and because the direct-map alias looks like it will use the
> > > default key and therefore appear to contain the wrong data.
> > > 
> > > I would propose a very different direction: don't try to support MKTME
> > > at all for anonymous memory, and instead figure out the important use
> > > cases and support them directly.  The use cases that I can think of
> > > off the top of my head are:
> > > 
> > > 1. pmem.  This should probably use a very different API.
> > > 
> > > 2. Some kind of VM hardening, where a VM's memory can be protected a
> > > little tiny bit from the main kernel.  But I don't see why this is any
> > > better than XPO (eXclusive Page-frame Ownership), which brings to
> > > mind:
> > 
> > What is the threat model anyway for AMD and Intel technologies?
> > 
> > For me it looks like that you can read, write and even replay 
> > encrypted pages both in SME and TME. 
> 
> Right. Neither of them (including MKTME) prevents replay attack. But
> in my understanding SEV doesn't prevent replay attack either since it
> doesn't have integrity protection.

Yep, it doesn't :-) That's why I've been wondering after seeing
presentations concerning SME and SVE what they are good for.

Cold boot attacks are definitely at least something where these
techs can help...

/Jarkko

Huang, Kai Dec. 7, 2018, 10:12 a.m. UTC | #22

On Thu, 2018-12-06 at 06:59 -0800, Dave Hansen wrote:
> On 12/6/18 3:22 AM, Kirill A. Shutemov wrote:
> > > When you say "disable encryption to a page" does the encryption get
> > > actually disabled or does the CPU just decrypt it transparently i.e.
> > > what happens physically?
> > 
> > Yes, it gets disabled. Physically. It overrides TME encryption.
> 
> I know MKTME itself has a runtime overhead and we expect it to have a
> performance impact in the low single digits.  Does TME have that
> overhead?  Presumably MKTME plus no-encryption is not expected to have
> the overhead.
> 
> We should probably mention that in the changelogs too.
> 

I believe in terms of hardware crypto overhead MKTME and TME should have the same (except MKTME no-
encrypt case?). But MKTME might have additional overhead from software implementation in kernel?

Thanks,
-Kai

Kirill A . Shutemov Dec. 7, 2018, 11:54 a.m. UTC | #23

On Thu, Dec 06, 2018 at 09:23:20PM +0000, Sakkinen, Jarkko wrote:
> On Thu, 2018-12-06 at 14:22 +0300, Kirill A. Shutemov wrote:
> > When you say "disable encryption to a page" does the encryption get
> > > actually disabled or does the CPU just decrypt it transparently i.e.
> > > what happens physically?
> > 
> > Yes, it gets disabled. Physically. It overrides TME encryption.
> 
> OK, thanks for confirmation. BTW, how much is the penalty to keep it
> always enabled? Is it something that would not make sense for some
> other reasons?

We don't have any numbers to share at this point.

Kirill A . Shutemov Dec. 7, 2018, 11:57 a.m. UTC | #24

On Wed, Dec 05, 2018 at 10:19:20PM +0000, Sakkinen, Jarkko wrote:
> On Tue, 2018-12-04 at 11:19 -0800, Andy Lutomirski wrote:
> > I'm not Thomas, but I think it's the wrong direction.  As it stands,
> > encrypt_mprotect() is an incomplete version of mprotect() (since it's
> > missing the protection key support), and it's also functionally just
> > MADV_DONTNEED.  In other words, the sole user-visible effect appears
> > to be that the existing pages are blown away.  The fact that it
> > changes the key in use doesn't seem terribly useful, since it's
> > anonymous memory, and the most secure choice is to use CPU-managed
> > keying, which appears to be the default anyway on TME systems.  It
> > also has totally unclear semantics WRT swap, and, off the top of my
> > head, it looks like it may have serious cache-coherency issues and
> > like swapping the pages might corrupt them, both because there are no
> > flushes and because the direct-map alias looks like it will use the
> > default key and therefore appear to contain the wrong data.
> > 
> > I would propose a very different direction: don't try to support MKTME
> > at all for anonymous memory, and instead figure out the important use
> > cases and support them directly.  The use cases that I can think of
> > off the top of my head are:
> > 
> > 1. pmem.  This should probably use a very different API.
> > 
> > 2. Some kind of VM hardening, where a VM's memory can be protected a
> > little tiny bit from the main kernel.  But I don't see why this is any
> > better than XPO (eXclusive Page-frame Ownership), which brings to
> > mind:
> 
> What is the threat model anyway for AMD and Intel technologies?
> 
> For me it looks like that you can read, write and even replay 
> encrypted pages both in SME and TME. 

What replay attack are you talking about? MKTME uses AES-XTS with physical
address tweak. So the data is tied to the place in physical address space
and replacing one encrypted page with another encrypted page from
different address will produce garbage on decryption.

Jarkko Sakkinen Dec. 7, 2018, 9:59 p.m. UTC | #25

On Fri, 2018-12-07 at 14:57 +0300, Kirill A. Shutemov wrote:
> > What is the threat model anyway for AMD and Intel technologies?
> > 
> > For me it looks like that you can read, write and even replay 
> > encrypted pages both in SME and TME. 
> 
> What replay attack are you talking about? MKTME uses AES-XTS with physical
> address tweak. So the data is tied to the place in physical address space and
> replacing one encrypted page with another encrypted page from different
> address will produce garbage on decryption.

Just trying to understand how this works.

So you use physical address like a nonce/version for the page and
thus prevent replay? Was not aware of this.

/Jarkko

Eric Rannaud Dec. 7, 2018, 11:35 p.m. UTC | #26

On Fri, Dec 7, 2018 at 3:57 AM Kirill A. Shutemov <kirill@shutemov.name> wrote:
> > What is the threat model anyway for AMD and Intel technologies?
> >
> > For me it looks like that you can read, write and even replay
> > encrypted pages both in SME and TME.
>
> What replay attack are you talking about? MKTME uses AES-XTS with physical
> address tweak. So the data is tied to the place in physical address space
> and replacing one encrypted page with another encrypted page from
> different address will produce garbage on decryption.

What if you have some control over the physical addresses you write
the stolen encrypted page to? For instance, VM_Eve might manage to use
physical address space previously used by VM_Alice by getting the
hypervisor to move memory around (memory pressure, force other VMs out
via some type of DOS attack, etc.).

Say:
    C is VM_Alice's clear text at hwaddr
    E = mktme_encrypt(VM_Allice_key, hwaddr, C)
    Eve somehow stole the encrypted bits E

Eve would need to write the page E without further encryption to make
sure that the DRAM contains the original stolen bits E, not encrypted
again with VM_Eve's key or mktme_encrypt(VM_Eve_key, hwaddr, E) would
be present in the DRAM which is not helpful. But with MKTME under the
current proposal VM_Eve can disable encryption for a given mapping,
right? (See also Note 1)

Eve gets the HV to move VM_Alice back over the same physical address,
Eve "somehow" gets VM_Alice to read that page and use its content
(which would likely be a use of uninitialized memory bug, from
VM_Alice's perspective) and you have a replay attack?

For TME, this doesn't work as you cannot partially disable encryption,
so if Eve tries to write the stolen encrypted bits E, even in the
"right place", they get encrypted again to tme_encrypt(hwaddr, E).
Upon decryption, VM_Alice will get E, not C.

Note 1: Actually, even if with MKTME you cannot disable encryption but
*if* Eve knows its own key, Eve can always write a preimage P that the
CPU encrypts to E for VM_Alice to read back and decrypt:
    P = mktme_decrypt(VM_Eve_key, hwaddr, E)

This is not possible with TME as Eve doesn't know the key used by the
CPU and cannot compute P.

Jarkko Sakkinen Dec. 7, 2018, 11:45 p.m. UTC | #27

On Fri, 2018-12-07 at 13:59 -0800, Jarkko Sakkinen wrote:
> On Fri, 2018-12-07 at 14:57 +0300, Kirill A. Shutemov wrote:
> > > What is the threat model anyway for AMD and Intel technologies?
> > > 
> > > For me it looks like that you can read, write and even replay 
> > > encrypted pages both in SME and TME. 
> > 
> > What replay attack are you talking about? MKTME uses AES-XTS with physical
> > address tweak. So the data is tied to the place in physical address space
> > and
> > replacing one encrypted page with another encrypted page from different
> > address will produce garbage on decryption.
> 
> Just trying to understand how this works.
> 
> So you use physical address like a nonce/version for the page and
> thus prevent replay? Was not aware of this.

The brutal fact is that a physical address is an astronomical stretch
from a random value or increasing counter. Thus, it is fair to say that
MKTME provides only naive measures against replay attacks...

/Jarkko

Andy Lutomirski Dec. 7, 2018, 11:48 p.m. UTC | #28

On Fri, Dec 7, 2018 at 3:45 PM Sakkinen, Jarkko
<jarkko.sakkinen@intel.com> wrote:
>
> On Fri, 2018-12-07 at 13:59 -0800, Jarkko Sakkinen wrote:
> > On Fri, 2018-12-07 at 14:57 +0300, Kirill A. Shutemov wrote:
> > > > What is the threat model anyway for AMD and Intel technologies?
> > > >
> > > > For me it looks like that you can read, write and even replay
> > > > encrypted pages both in SME and TME.
> > >
> > > What replay attack are you talking about? MKTME uses AES-XTS with physical
> > > address tweak. So the data is tied to the place in physical address space
> > > and
> > > replacing one encrypted page with another encrypted page from different
> > > address will produce garbage on decryption.
> >
> > Just trying to understand how this works.
> >
> > So you use physical address like a nonce/version for the page and
> > thus prevent replay? Was not aware of this.
>
> The brutal fact is that a physical address is an astronomical stretch
> from a random value or increasing counter. Thus, it is fair to say that
> MKTME provides only naive measures against replay attacks...
>

And this is potentially a big deal, since there are much simpler
replay attacks that can compromise the system.  For example, if I can
replay the contents of a page table, I can write to freed memory.

--Andy

Andy Lutomirski Dec. 7, 2018, 11:53 p.m. UTC | #29

> On Dec 6, 2018, at 5:55 PM, Huang, Kai <kai.huang@intel.com> wrote:
>
>
>>
>> TME itself provides a ton of protection -- you can't just barge into
>> the datacenter, refrigerate the DIMMs, walk away with them, and read
>> off everyone's data.
>>
>> Am I missing something?
>
> I think we can make such assumption in most cases, but I think it's better that we don't make any
> assumption at all. For example, the admin of data center (or anyone) who has physical access to
> servers may do something malicious. I am not expert but there should be other physical attack
> methods besides coldboot attack, if the malicious employee can get physical access to server w/o
> being detected.
>
>>
>>>
>>> But, I think what you're implying is that the security properties of
>>> user-supplied keys can only be *worse* than using CPU-generated keys
>>> (assuming the CPU does a good job generating it).  So, why bother
>>> allowing user-specified keys in the first place?
>>
>> That too :)
>
> I think one usage of user-specified key is for NVDIMM, since CPU key will be gone after machine
> reboot, therefore if NVDIMM is encrypted by CPU key we are not able to retrieve it once
> shutdown/reboot, etc.
>
> There are some other use cases that already require tenant to send key to CSP. For example, the VM
> image can be provided by tenant and encrypted by tenant's own key, and tenant needs to send key to
> CSP when asking CSP to run that encrypted image.

I can imagine a few reasons why one would want to encrypt one’s image.
For example, the CSP could issue a public key and state, or even
attest, that the key is wrapped and locked to particular PCRs of their
TPM or otherwise protected by an enclave that verifies that the key is
only used to decrypt the image for the benefit of a hypervisor.

I don’t see what MKTME has to do with this.  The only remotely
plausible way I can see to use MKTME for this is to have the
hypervisor load a TPM (or other enclave) protected key into an MKTME
user key slot and to load customer-provided ciphertext into the
corresponding physical memory (using an MKTME no-encrypt slot).  But
this has three major problems.  First, it's effectively just a fancy
way to avoid one AES pass over the data.  Second, sensible scheme for
this type of VM image protection would use *authenticated* encryption
or at least verify a signature, which MKTME can't do.  The third
problem is the real show-stopper, though: this scheme requires that
the ciphertext go into predetermined physical addresses, which would
be a giant mess.

Dave Hansen Dec. 8, 2018, 1:11 a.m. UTC | #30

On 12/7/18 3:53 PM, Andy Lutomirski wrote:
> The third problem is the real show-stopper, though: this scheme
> requires that the ciphertext go into predetermined physical
> addresses, which would be a giant mess.

There's a more fundamental problem than that.  The tweak fed into the
actual AES-XTS operation is determined by the firmware, programmed into
the memory controller, and is not visible to software.  So, not only
would you need to put stuff at a fixed physical address, the tweaks can
change from boot-to-boot, so whatever you did would only be good for one
boot.

Huang, Kai Dec. 8, 2018, 1:33 a.m. UTC | #31

On Fri, 2018-12-07 at 23:45 +0000, Sakkinen, Jarkko wrote:
> On Fri, 2018-12-07 at 13:59 -0800, Jarkko Sakkinen wrote:
> > On Fri, 2018-12-07 at 14:57 +0300, Kirill A. Shutemov wrote:
> > > > What is the threat model anyway for AMD and Intel technologies?
> > > > 
> > > > For me it looks like that you can read, write and even replay 
> > > > encrypted pages both in SME and TME. 
> > > 
> > > What replay attack are you talking about? MKTME uses AES-XTS with physical
> > > address tweak. So the data is tied to the place in physical address space
> > > and
> > > replacing one encrypted page with another encrypted page from different
> > > address will produce garbage on decryption.
> > 
> > Just trying to understand how this works.
> > 
> > So you use physical address like a nonce/version for the page and
> > thus prevent replay? Was not aware of this.
> 
> The brutal fact is that a physical address is an astronomical stretch
> from a random value or increasing counter. Thus, it is fair to say that
> MKTME provides only naive measures against replay attacks...
> 
> /Jarkko

Currently there's no nonce to protect cache line so TME/MKTME is not able to prevent replay attack
you mentioned. Currently MKTME only involves AES-XTS-128 encryption but nothing else. But like I
said if I understand correctly even SEV doesn't have integrity protection so not able to prevent
reply attack as well.

Thanks,
-Kai

Huang, Kai Dec. 8, 2018, 2:07 a.m. UTC | #32

> > There are some other use cases that already require tenant to send key to CSP. For example, the
> > VM
> > image can be provided by tenant and encrypted by tenant's own key, and tenant needs to send key
> > to
> > CSP when asking CSP to run that encrypted image.
> 
> 
> I can imagine a few reasons why one would want to encrypt one’s image.
> For example, the CSP could issue a public key and state, or even
> attest, that the key is wrapped and locked to particular PCRs of their
> TPM or otherwise protected by an enclave that verifies that the key is
> only used to decrypt the image for the benefit of a hypervisor.

Right. I think before tenant releases key to CSP it should always use attestation authority to
verify the trustiness of computer node. I can understand that the key can be wrapped by TPM before
sending to CSP but need some catch up about using enclave part. 

The thing is computer node can be trusted doesn't mean it cannot be attacked, or even it doesn't
mean it can prevent, ie some malicious admin, to get tenant key even by using legitimate way. There
are many SW components involved here. Anyway this is not related to MKTME itself like you mentioned
below, therefore the point is, as we already see MKTME itself provides very weak security
protection, we need to see whether MKTME has value from the whole use case's point of view
(including all the things you mentioned above) -- we define the whole use case, we clearly state
who/what should be in trust boundary, and what we can prevent, etc.

> 
> I don’t see what MKTME has to do with this. The only remotely
> plausible way I can see to use MKTME for this is to have the
> hypervisor load a TPM (or other enclave) protected key into an MKTME
> user key slot and to load customer-provided ciphertext into the
> corresponding physical memory (using an MKTME no-encrypt slot).  But
> this has three major problems.  First, it's effectively just a fancy
> way to avoid one AES pass over the data.  Second, sensible scheme for
> this type of VM image protection would use *authenticated* encryption
> or at least verify a signature, which MKTME can't do.  The third
> problem is the real show-stopper, though: this scheme requires that
> the ciphertext go into predetermined physical addresses, which would
> be a giant mess.

My intention was to say if we are already sending key to CSP, then we may prefer to use the key for
MKTME VM runtime protection as well, but like you said we may not have real security gain here
comparing to TME, so I agree we need to find out one specific case to prove that.

Thanks,
-Kai

Jarkko Sakkinen Dec. 8, 2018, 3:53 a.m. UTC | #33

On Sat, 2018-12-08 at 09:33 +0800, Huang, Kai wrote:
> Currently there's no nonce to protect cache line so TME/MKTME is not able to
> prevent replay attack
> you mentioned. Currently MKTME only involves AES-XTS-128 encryption but
> nothing else. But like I
> said if I understand correctly even SEV doesn't have integrity protection so
> not able to prevent
> reply attack as well.

You're absolutely correct.

There's a also good paper on SEV subvertion:

https://arxiv.org/pdf/1805.09604.pdf

I don't think this makes MKTME or SEV uselss, but yeah, it is a
constraint that needs to be taken into consideration when finding the
best way to use these technologies in Linux.

/Jarkko

Jarkko Sakkinen Dec. 12, 2018, 3:31 p.m. UTC | #34

On Fri, 2018-12-07 at 15:45 -0800, Jarkko Sakkinen wrote:
> The brutal fact is that a physical address is an astronomical stretch
> from a random value or increasing counter. Thus, it is fair to say that
> MKTME provides only naive measures against replay attacks...

I'll try to summarize how I understand the high level security
model of MKTME because (would be good idea to document it).

Assumptions:

1. The hypervisor has not been infiltrated.
2. The hypervisor does not leak secrets.

When (1) and (2) hold [1], we harden VMs in two different ways:

A. VMs cannot leak data to each other or can they with L1TF when HT
   is enabled?
B. Protects against cold boot attacks.

Isn't this what this about in the nutshell roughly?

[1] XPFO could potentially be an opt-in feature that reduces the
    damage when either of these assumptions has been broken.

/Jarkko

Andy Lutomirski Dec. 12, 2018, 4:29 p.m. UTC | #35

On Wed, Dec 12, 2018 at 7:31 AM Sakkinen, Jarkko
<jarkko.sakkinen@intel.com> wrote:
>
> On Fri, 2018-12-07 at 15:45 -0800, Jarkko Sakkinen wrote:
> > The brutal fact is that a physical address is an astronomical stretch
> > from a random value or increasing counter. Thus, it is fair to say that
> > MKTME provides only naive measures against replay attacks...
>
> I'll try to summarize how I understand the high level security
> model of MKTME because (would be good idea to document it).
>
> Assumptions:
>
> 1. The hypervisor has not been infiltrated.
> 2. The hypervisor does not leak secrets.
>
> When (1) and (2) hold [1], we harden VMs in two different ways:
>
> A. VMs cannot leak data to each other or can they with L1TF when HT
>    is enabled?

I strongly suspect that, on L1TF-vulnerable CPUs, MKTME provides no
protection whatsoever.  It sounds like MKTME is implemented in the
memory controller -- as far as the rest of the CPU and the cache
hierarchy are concerned, the MKTME key selction bits are just part of
the physical address.  So an attack like L1TF that leaks a cacheline
that's selected by physical address will leak the cleartext if the key
selection bits are set correctly.

(I suppose that, if the attacker needs to brute-force the physical
address, then MKTME makes it a bit harder because the effective
physical address space is larger.)

> B. Protects against cold boot attacks.

TME does this, AFAIK.  MKTME does, too, unless the "user" mode is
used, in which case the protection is weaker.

>
> Isn't this what this about in the nutshell roughly?
>
> [1] XPFO could potentially be an opt-in feature that reduces the
>     damage when either of these assumptions has been broken.
>
> /Jarkko

Jarkko Sakkinen Dec. 12, 2018, 4:43 p.m. UTC | #36

On Wed, 2018-12-12 at 08:29 -0800, Andy Lutomirski wrote:
> On Wed, Dec 12, 2018 at 7:31 AM Sakkinen, Jarkko
> <jarkko.sakkinen@intel.com> wrote:
> > On Fri, 2018-12-07 at 15:45 -0800, Jarkko Sakkinen wrote:
> > > The brutal fact is that a physical address is an astronomical stretch
> > > from a random value or increasing counter. Thus, it is fair to say that
> > > MKTME provides only naive measures against replay attacks...
> > 
> > I'll try to summarize how I understand the high level security
> > model of MKTME because (would be good idea to document it).
> > 
> > Assumptions:
> > 
> > 1. The hypervisor has not been infiltrated.
> > 2. The hypervisor does not leak secrets.
> > 
> > When (1) and (2) hold [1], we harden VMs in two different ways:
> > 
> > A. VMs cannot leak data to each other or can they with L1TF when HT
> >    is enabled?
> 
> I strongly suspect that, on L1TF-vulnerable CPUs, MKTME provides no
> protection whatsoever.  It sounds like MKTME is implemented in the
> memory controller -- as far as the rest of the CPU and the cache
> hierarchy are concerned, the MKTME key selction bits are just part of
> the physical address.  So an attack like L1TF that leaks a cacheline
> that's selected by physical address will leak the cleartext if the key
> selection bits are set correctly.
> 
> (I suppose that, if the attacker needs to brute-force the physical
> address, then MKTME makes it a bit harder because the effective
> physical address space is larger.)
> 
> > B. Protects against cold boot attacks.
> 
> TME does this, AFAIK.  MKTME does, too, unless the "user" mode is
> used, in which case the protection is weaker.
> 
> > Isn't this what this about in the nutshell roughly?
> > 
> > [1] XPFO could potentially be an opt-in feature that reduces the
> >     damage when either of these assumptions has been broken.

This all should be summarized in the documentation (high-level model
and corner cases).

/Jarkko

Huang, Kai Dec. 12, 2018, 11:24 p.m. UTC | #37

> I strongly suspect that, on L1TF-vulnerable CPUs, MKTME provides no
> protection whatsoever.  It sounds like MKTME is implemented in the
> memory controller -- as far as the rest of the CPU and the cache hierarchy
> are concerned, the MKTME key selction bits are just part of the physical
> address.  So an attack like L1TF that leaks a cacheline that's selected by
> physical address will leak the cleartext if the key selection bits are set
> correctly.

Right. MKTME doesn't prevent cache based attack. Data in cache is in clear.

Thanks,
-Kai

Huang, Kai Dec. 12, 2018, 11:27 p.m. UTC | #38

> This all should be summarized in the documentation (high-level model and
> corner cases).

I am not sure whether it is necessary to document L1TF explicitly, since it is quite obvious that MKTME doesn't prevent that. IMHO if needed we only need to mention MKTME doesn't prevent any sort of cache based attack, since data in cache is in clear.

In fact SGX doesn't prevent this either..

Thanks,
-Kai

Jarkko Sakkinen Dec. 13, 2018, 5:49 a.m. UTC | #39

On Thu, 2018-12-13 at 07:27 +0800, Huang, Kai wrote:
> > This all should be summarized in the documentation (high-level model and
> > corner cases).
> 
> I am not sure whether it is necessary to document L1TF explicitly, since it is
> quite obvious that MKTME doesn't prevent that. IMHO if needed we only need to
> mention MKTME doesn't prevent any sort of cache based attack, since data in
> cache is in clear.
> 
> In fact SGX doesn't prevent this either..

Sorry, was a bit unclear. I meant the assumptions and goals.

/Jarkko

Jarkko Sakkinen Dec. 13, 2018, 5:52 a.m. UTC | #40

On Thu, 2018-12-13 at 07:49 +0200, Jarkko Sakkinen wrote:
> On Thu, 2018-12-13 at 07:27 +0800, Huang, Kai wrote:
> > > This all should be summarized in the documentation (high-level model and
> > > corner cases).
> > 
> > I am not sure whether it is necessary to document L1TF explicitly, since it
> > is
> > quite obvious that MKTME doesn't prevent that. IMHO if needed we only need
> > to
> > mention MKTME doesn't prevent any sort of cache based attack, since data in
> > cache is in clear.
> > 
> > In fact SGX doesn't prevent this either..
> 
> Sorry, was a bit unclear. I meant the assumptions and goals.

I.e. what I put in my earlier response, what belongs to TCB and what
types adversaries is pursued to be protected.

/Jarkko

[RFC,v2,00/13] Multi-Key Total Memory Encryption API (MKTME)

Message

Comments