mbox series

[v10,0/9] KVM: mm: fd-based approach for supporting KVM

Message ID 20221202061347.1070246-1-chao.p.peng@linux.intel.com (mailing list archive)
Headers show
Series KVM: mm: fd-based approach for supporting KVM | expand

Message

Chao Peng Dec. 2, 2022, 6:13 a.m. UTC
This patch series implements KVM guest private memory for confidential
computing scenarios like Intel TDX[1]. If a TDX host accesses
TDX-protected guest memory, machine check can happen which can further
crash the running host system, this is terrible for multi-tenant
configurations. The host accesses include those from KVM userspace like
QEMU. This series addresses KVM userspace induced crash by introducing
new mm and KVM interfaces so KVM userspace can still manage guest memory
via a fd-based approach, but it can never access the guest memory
content.

The patch series touches both core mm and KVM code. I appreciate
Andrew/Hugh and Paolo/Sean can review and pick these patches. Any other
reviews are always welcome.
  - 01: mm change, target for mm tree
  - 02-09: KVM change, target for KVM tree

Given KVM is the only current user for the mm part, I have chatted with
Paolo and he is OK to merge the mm change through KVM tree, but
reviewed-by/acked-by is still expected from the mm people.

The patches have been verified with Intel TDX environment, but Vishal
has done an excellent work on the selftests[4] which are dedicated for
this series, making it possible to test this series without innovative
hardware and fancy steps of building a VM environment. See Test section
below for more info.


Introduction
============
KVM userspace being able to crash the host is horrible. Under current
KVM architecture, all guest memory is inherently accessible from KVM
userspace and is exposed to the mentioned crash issue. The goal of this
series is to provide a solution to align mm and KVM, on a userspace
inaccessible approach of exposing guest memory. 

Normally, KVM populates secondary page table (e.g. EPT) by using a host
virtual address (hva) from core mm page table (e.g. x86 userspace page
table). This requires guest memory being mmaped into KVM userspace, but
this is also the source where the mentioned crash issue can happen. In
theory, apart from those 'shared' memory for device emulation etc, guest
memory doesn't have to be mmaped into KVM userspace.

This series introduces fd-based guest memory which will not be mmaped
into KVM userspace. KVM populates secondary page table by using a
fd/offset pair backed by a memory file system. The fd can be created
from a supported memory filesystem like tmpfs/hugetlbfs and KVM can
directly interact with them with newly introduced in-kernel interface,
therefore remove the KVM userspace from the path of accessing/mmaping
the guest memory. 

Kirill had a patch [2] to address the same issue in a different way. It
tracks guest encrypted memory at the 'struct page' level and relies on
HWPOISON to reject the userspace access. The patch has been discussed in
several online and offline threads and resulted in a design document [3]
which is also the original proposal for this series. Later this patch
series evolved as more comments received in community but the major
concepts in [3] still hold true so recommend reading.

The patch series may also be useful for other usages, for example, pure
software approach may use it to harden itself against unintentional
access to guest memory. This series is designed with these usages in
mind but doesn't have code directly support them and extension might be
needed.


mm change
=========
Introduces a new memfd_restricted system call which can create memory
file that is restricted from userspace access via normal MMU operations
like read(), write() or mmap() etc and the only way to use it is
passing it to a third kernel module like KVM and relying on it to
access the fd through the newly added restrictedmem kernel interface.
The restrictedmem interface bridges the memory file subsystems
(tmpfs/hugetlbfs etc) and their users (KVM in this case) and provides
bi-directional communication between them. 


KVM change
==========
Extends the KVM memslot to provide guest private (encrypted) memory from
a fd. With this extension, a single memslot can maintain both private
memory through private fd (restricted_fd/restricted_offset) and shared
(unencrypted) memory through userspace mmaped host virtual address
(userspace_addr). For a particular guest page, the corresponding page in
KVM memslot can be only either private or shared and only one of the
shared/private parts of the memslot is visible to guest. For how this
new extension is used in QEMU, please refer to kvm_set_phys_mem() in
below TDX-enabled QEMU repo.

Introduces new KVM_EXIT_MEMORY_FAULT exit to allow userspace to get the
chance on decision-making for shared <-> private memory conversion. The
exit can be an implicit conversion in KVM page fault handler or an
explicit conversion from guest OS.

Introduces new KVM ioctl KVM_SET_MEMORY_ATTRIBUTES to maintain whether a
page is private or shared. This ioctl allows userspace to convert a page
between private <-> shared. The data maintained tells the truth whether
a guest page is private or shared and this information will be used in
KVM page fault handler to decide whether the private or the shared part
of the memslot is visible to guest.


Test
====
Ran two kinds of tests:
  - Selftests [4] from Vishal and VM boot tests in non-TDX environment
    Code also in below repo: https://github.com/chao-p/linux/tree/privmem-v10

  - Functional tests in TDX capable environment
    Tested the new functionalities in TDX environment. Code repos:
    Linux: https://github.com/chao-p/linux/tree/privmem-v10-tdx
    QEMU: https://github.com/chao-p/qemu/tree/privmem-v10

    An example QEMU command line for TDX test:
    -object tdx-guest,id=tdx,debug=off,sept-ve-disable=off \
    -machine confidential-guest-support=tdx \
    -object memory-backend-memfd-private,id=ram1,size=${mem} \
    -machine memory-backend=ram1


TODO
====
  - Page accounting and limiting for encrypted memory
  - hugetlbfs support


Changelog
=========
v10:
  - mm: hook up restricted_memfd to memory failure and route it to
    kernel users through .error() callback.
  - mm: call invalidate() notifier only for FALLOC_FL_PUNCH_HOLE, i.e.
    not for allocation.
  - KVM: introduce new ioctl KVM_SET_MEMORY_ATTRIBUTES for memory
    conversion instead of reusing KVM_MEMORY_ENCRYPT_{UN,}REG_REGION.
  - KVM: refine gfn-based mmu_notifier_retry() mechanism.
  - KVM: improve lpage_info updating code.
  - KVM: fix the bug in private memory handling that a private fault may
    fall into a non-private memslot.
  - KVM: handle memory machine check error for fd-based memory.
v9:
  - mm: move inaccessible memfd into separated syscall.
  - mm: return page instead of pfn_t for inaccessible_get_pfn and remove
    inaccessible_put_pfn.
  - KVM: rename inaccessible/private to restricted and CONFIG change to
    make the code friendly to pKVM.
  - KVM: add invalidate_begin/end pair to fix race contention and revise
    the lock protection for invalidation path.
  - KVM: optimize setting lpage_info for > 2M level by direct accessing
    lower level's result.
  - KVM: avoid load xarray in kvm_mmu_max_mapping_level() and instead let
    the caller to pass in is_private.
  - KVM: API doc improvement.
v8:
  - mm: redesign mm part by introducing a shim layer(inaccessible_memfd)
    in memfd to avoid touch the memory file systems directly.
  - mm: exclude F_SEAL_AUTO_ALLOCATE as it is for shared memory and
    cause confusion in this series, will send out separately.
  - doc: exclude the man page change, it's not kernel patch and will
    send out separately.
  - KVM: adapt to use the new mm inaccessible_memfd interface.
  - KVM: update lpage_info when setting mem_attr_array to support
    large page.
  - KVM: change from xa_store_range to xa_store for mem_attr_array due
    to xa_store_range overrides all entries which is not intended
    behavior for us.
  - KVM: refine the mmu_invalidate_retry_gfn mechanism for private page.
  - KVM: reorganize KVM_MEMORY_ENCRYPT_{UN,}REG_REGION and private page
    handling code suggested by Sean.
v7:
  - mm: introduce F_SEAL_AUTO_ALLOCATE to avoid double allocation.
  - KVM: use KVM_MEMORY_ENCRYPT_{UN,}REG_REGION to record
    private/shared info.
  - KVM: use similar sync mechanism between zap/page fault paths as
    mmu_notifier for memfile_notifier based invalidation.
v6:
  - mm: introduce MEMFILE_F_* flags into memfile_node to allow checking
    feature consistence among all memfile_notifier users and get rid of
    internal flags like SHM_F_INACCESSIBLE.
  - mm: make pfn_ops callbacks being members of memfile_backing_store
    and then refer to it directly in memfile_notifier.
  - mm: remove backing store unregister.
  - mm: remove RLIMIT_MEMLOCK based memory accounting and limiting.
  - KVM: reorganize patch sequence for page fault handling and private
    memory enabling.
v5:
  - Add man page for MFD_INACCESSIBLE flag and improve KVM API do for
    the new memslot extensions.
  - mm: introduce memfile_{un}register_backing_store to allow memory
    backing store to register/unregister it from memfile_notifier.
  - mm: remove F_SEAL_INACCESSIBLE, use in-kernel flag
    (SHM_F_INACCESSIBLE for shmem) instead. 
  - mm: add memory accounting and limiting (RLIMIT_MEMLOCK based) for
    MFD_INACCESSIBLE memory.
  - KVM: remove the overlap check for mapping the same file+offset into
    multiple gfns due to perf consideration, warned in document.
v4:
  - mm: rename memfd_ops to memfile_notifier and separate it from
    memfd.c to standalone memfile-notifier.c.
  - KVM: move pfn_ops to per-memslot scope from per-vm scope and allow
    registering multiple memslots to the same memory backing store.
  - KVM: add a 'kvm' reference in memslot so that we can recover kvm in
    memfile_notifier handlers.
  - KVM: add 'private_' prefix for the new fields in memslot.
  - KVM: reshape the 'type' to 'flag' for kvm_memory_exit
v3:
  - Remove 'RFC' prefix.
  - Fix race condition between memfile_notifier handlers and kvm destroy.
  - mm: introduce MFD_INACCESSIBLE flag for memfd_create() to force
    setting F_SEAL_INACCESSIBLE when the fd is created.
  - KVM: add the shared part of the memslot back to make private/shared
    pages live in one memslot.

Reference
=========
[1] Intel TDX:
https://www.intel.com/content/www/us/en/developer/articles/technical/intel-trust-domain-extensions.html
[2] Kirill's implementation:
https://lore.kernel.org/all/20210416154106.23721-1-kirill.shutemov@linux.intel.com/T/ 
[3] Original design proposal:
https://lore.kernel.org/all/20210824005248.200037-1-seanjc@google.com/  
[4] Selftest:
https://lore.kernel.org/all/20221111014244.1714148-1-vannapurve@google.com/


Chao Peng (8):
  KVM: Introduce per-page memory attributes
  KVM: Extend the memslot to support fd-based private memory
  KVM: Add KVM_EXIT_MEMORY_FAULT exit
  KVM: Use gfn instead of hva for mmu_notifier_retry
  KVM: Unmap existing mappings when change the memory attributes
  KVM: Update lpage info when private/shared memory are mixed
  KVM: Handle page fault for private memory
  KVM: Enable and expose KVM_MEM_PRIVATE

Kirill A. Shutemov (1):
  mm: Introduce memfd_restricted system call to create restricted user
    memory

 Documentation/virt/kvm/api.rst         | 125 ++++++-
 arch/x86/entry/syscalls/syscall_32.tbl |   1 +
 arch/x86/entry/syscalls/syscall_64.tbl |   1 +
 arch/x86/include/asm/kvm_host.h        |   9 +
 arch/x86/kvm/Kconfig                   |   3 +
 arch/x86/kvm/mmu/mmu.c                 | 205 ++++++++++-
 arch/x86/kvm/mmu/mmu_internal.h        |  14 +-
 arch/x86/kvm/mmu/mmutrace.h            |   1 +
 arch/x86/kvm/mmu/tdp_mmu.c             |   2 +-
 arch/x86/kvm/x86.c                     |  17 +-
 include/linux/kvm_host.h               | 103 +++++-
 include/linux/restrictedmem.h          |  71 ++++
 include/linux/syscalls.h               |   1 +
 include/uapi/asm-generic/unistd.h      |   5 +-
 include/uapi/linux/kvm.h               |  53 +++
 include/uapi/linux/magic.h             |   1 +
 kernel/sys_ni.c                        |   3 +
 mm/Kconfig                             |   4 +
 mm/Makefile                            |   1 +
 mm/memory-failure.c                    |   3 +
 mm/restrictedmem.c                     | 318 +++++++++++++++++
 virt/kvm/Kconfig                       |   6 +
 virt/kvm/kvm_main.c                    | 469 +++++++++++++++++++++----
 23 files changed, 1323 insertions(+), 93 deletions(-)
 create mode 100644 include/linux/restrictedmem.h
 create mode 100644 mm/restrictedmem.c


base-commit: df0bb47baa95aad133820b149851d5b94cbc6790

Comments

Sean Christopherson Jan. 14, 2023, 12:37 a.m. UTC | #1
On Fri, Dec 02, 2022, Chao Peng wrote:
> This patch series implements KVM guest private memory for confidential
> computing scenarios like Intel TDX[1]. If a TDX host accesses
> TDX-protected guest memory, machine check can happen which can further
> crash the running host system, this is terrible for multi-tenant
> configurations. The host accesses include those from KVM userspace like
> QEMU. This series addresses KVM userspace induced crash by introducing
> new mm and KVM interfaces so KVM userspace can still manage guest memory
> via a fd-based approach, but it can never access the guest memory
> content.
> 
> The patch series touches both core mm and KVM code. I appreciate
> Andrew/Hugh and Paolo/Sean can review and pick these patches. Any other
> reviews are always welcome.
>   - 01: mm change, target for mm tree
>   - 02-09: KVM change, target for KVM tree

A version with all of my feedback, plus reworked versions of Vishal's selftest,
is available here:

  git@github.com:sean-jc/linux.git x86/upm_base_support

It compiles and passes the selftest, but it's otherwise barely tested.  There are
a few todos (2 I think?) and many of the commits need changelogs, i.e. it's still
a WIP.

As for next steps, can you (handwaving all of the TDX folks) take a look at what
I pushed and see if there's anything horrifically broken, and that it still works
for TDX?

Fuad (and pKVM folks) same ask for you with respect to pKVM.  Absolutely no rush
(and I mean that).

On my side, the two things on my mind are (a) tests and (b) downstream dependencies
(SEV and TDX).  For tests, I want to build a lists of tests that are required for
merging so that the criteria for merging are clear, and so that if the list is large
(haven't thought much yet), the work of writing and running tests can be distributed.

Regarding downstream dependencies, before this lands, I want to pull in all the
TDX and SNP series and see how everything fits together.  Specifically, I want to
make sure that we don't end up with a uAPI that necessitates ugly code, and that we
don't miss an opportunity to make things simpler.  The patches in the SNP series to
add "legacy" SEV support for UPM in particular made me slightly rethink some minor
details.  Nothing remotely major, but something that needs attention since it'll
be uAPI.

I'm off Monday, so it'll be at least Tuesday before I make any more progress on
my side.

Thanks!
Kirill A. Shutemov Jan. 16, 2023, 1:48 p.m. UTC | #2
On Sat, Jan 14, 2023 at 12:37:59AM +0000, Sean Christopherson wrote:
> On Fri, Dec 02, 2022, Chao Peng wrote:
> > This patch series implements KVM guest private memory for confidential
> > computing scenarios like Intel TDX[1]. If a TDX host accesses
> > TDX-protected guest memory, machine check can happen which can further
> > crash the running host system, this is terrible for multi-tenant
> > configurations. The host accesses include those from KVM userspace like
> > QEMU. This series addresses KVM userspace induced crash by introducing
> > new mm and KVM interfaces so KVM userspace can still manage guest memory
> > via a fd-based approach, but it can never access the guest memory
> > content.
> > 
> > The patch series touches both core mm and KVM code. I appreciate
> > Andrew/Hugh and Paolo/Sean can review and pick these patches. Any other
> > reviews are always welcome.
> >   - 01: mm change, target for mm tree
> >   - 02-09: KVM change, target for KVM tree
> 
> A version with all of my feedback, plus reworked versions of Vishal's selftest,
> is available here:
> 
>   git@github.com:sean-jc/linux.git x86/upm_base_support
> 
> It compiles and passes the selftest, but it's otherwise barely tested.  There are
> a few todos (2 I think?) and many of the commits need changelogs, i.e. it's still
> a WIP.
> 
> As for next steps, can you (handwaving all of the TDX folks) take a look at what
> I pushed and see if there's anything horrifically broken, and that it still works
> for TDX?

Minor build fix:

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 6eb5336ccc65..4a9e9fa2552a 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -7211,8 +7211,8 @@ void kvm_arch_set_memory_attributes(struct kvm *kvm,
 	int level;
 	bool mixed;
 
-	lockdep_assert_held_write(kvm->mmu_lock);
-	lockdep_assert_held(kvm->slots_lock);
+	lockdep_assert_held_write(&kvm->mmu_lock);
+	lockdep_assert_held(&kvm->slots_lock);
 
 	/*
 	 * KVM x86 currently only supports KVM_MEMORY_ATTRIBUTE_PRIVATE, skip
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 467916943c73..4ef60ba7eb1d 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -2304,7 +2304,7 @@ static inline void kvm_account_pgtable_pages(void *virt, int nr)
 #ifdef CONFIG_KVM_GENERIC_MEMORY_ATTRIBUTES
 static inline unsigned long kvm_get_memory_attributes(struct kvm *kvm, gfn_t gfn)
 {
-	lockdep_assert_held(kvm->mmu_lock);
+	lockdep_assert_held(&kvm->mmu_lock);
 
 	return xa_to_value(xa_load(&kvm->mem_attr_array, gfn));
 }
Chao Peng Jan. 17, 2023, 1:19 p.m. UTC | #3
On Sat, Jan 14, 2023 at 12:37:59AM +0000, Sean Christopherson wrote:
> On Fri, Dec 02, 2022, Chao Peng wrote:
> > This patch series implements KVM guest private memory for confidential
> > computing scenarios like Intel TDX[1]. If a TDX host accesses
> > TDX-protected guest memory, machine check can happen which can further
> > crash the running host system, this is terrible for multi-tenant
> > configurations. The host accesses include those from KVM userspace like
> > QEMU. This series addresses KVM userspace induced crash by introducing
> > new mm and KVM interfaces so KVM userspace can still manage guest memory
> > via a fd-based approach, but it can never access the guest memory
> > content.
> > 
> > The patch series touches both core mm and KVM code. I appreciate
> > Andrew/Hugh and Paolo/Sean can review and pick these patches. Any other
> > reviews are always welcome.
> >   - 01: mm change, target for mm tree
> >   - 02-09: KVM change, target for KVM tree
> 
> A version with all of my feedback, plus reworked versions of Vishal's selftest,
> is available here:
> 
>   git@github.com:sean-jc/linux.git x86/upm_base_support
> 
> It compiles and passes the selftest, but it's otherwise barely tested.  There are
> a few todos (2 I think?) and many of the commits need changelogs, i.e. it's still
> a WIP.

Thanks very much for doing this. Almost all of your comments are well
received, except for two cases that need more discussions which have
replied individually.

> 
> As for next steps, can you (handwaving all of the TDX folks) take a look at what
> I pushed and see if there's anything horrifically broken, and that it still works
> for TDX?

I have integrated this into my local TDX repo, with some changes (as I
replied individually), the new code basically still works with TDX.

I have also asked other TDX folks to take a look.

> 
> Fuad (and pKVM folks) same ask for you with respect to pKVM.  Absolutely no rush
> (and I mean that).
> 
> On my side, the two things on my mind are (a) tests and (b) downstream dependencies
> (SEV and TDX).  For tests, I want to build a lists of tests that are required for
> merging so that the criteria for merging are clear, and so that if the list is large
> (haven't thought much yet), the work of writing and running tests can be distributed.
> 
> Regarding downstream dependencies, before this lands, I want to pull in all the
> TDX and SNP series and see how everything fits together.  Specifically, I want to
> make sure that we don't end up with a uAPI that necessitates ugly code, and that we
> don't miss an opportunity to make things simpler.  The patches in the SNP series to
> add "legacy" SEV support for UPM in particular made me slightly rethink some minor
> details.  Nothing remotely major, but something that needs attention since it'll
> be uAPI.
> 
> I'm off Monday, so it'll be at least Tuesday before I make any more progress on
> my side.

Appreciate your effort. As for the next steps, if you see something we
can do parallel, feel free to let me know.

Thanks,
Chao
Fuad Tabba Jan. 17, 2023, 2:32 p.m. UTC | #4
Hi Sean,

On Sat, Jan 14, 2023 at 12:38 AM Sean Christopherson <seanjc@google.com> wrote:
>
> On Fri, Dec 02, 2022, Chao Peng wrote:
> > This patch series implements KVM guest private memory for confidential
> > computing scenarios like Intel TDX[1]. If a TDX host accesses
> > TDX-protected guest memory, machine check can happen which can further
> > crash the running host system, this is terrible for multi-tenant
> > configurations. The host accesses include those from KVM userspace like
> > QEMU. This series addresses KVM userspace induced crash by introducing
> > new mm and KVM interfaces so KVM userspace can still manage guest memory
> > via a fd-based approach, but it can never access the guest memory
> > content.
> >
> > The patch series touches both core mm and KVM code. I appreciate
> > Andrew/Hugh and Paolo/Sean can review and pick these patches. Any other
> > reviews are always welcome.
> >   - 01: mm change, target for mm tree
> >   - 02-09: KVM change, target for KVM tree
>
> A version with all of my feedback, plus reworked versions of Vishal's selftest,
> is available here:
>
>   git@github.com:sean-jc/linux.git x86/upm_base_support
>
> It compiles and passes the selftest, but it's otherwise barely tested.  There are
> a few todos (2 I think?) and many of the commits need changelogs, i.e. it's still
> a WIP.
>
> As for next steps, can you (handwaving all of the TDX folks) take a look at what
> I pushed and see if there's anything horrifically broken, and that it still works
> for TDX?
>
> Fuad (and pKVM folks) same ask for you with respect to pKVM.  Absolutely no rush
> (and I mean that).

Thanks for sharing this. I've had a look at the patches, and have
ported them to work with pKVM. At a high level, the new interface
seems fine and it works with the arm64/pKVM port. I have a couple of
comments regarding some of the details, but they can wait until v11 is
posted.

Cheers,
/fuad



> On my side, the two things on my mind are (a) tests and (b) downstream dependencies
> (SEV and TDX).  For tests, I want to build a lists of tests that are required for
> merging so that the criteria for merging are clear, and so that if the list is large
> (haven't thought much yet), the work of writing and running tests can be distributed.
>
> Regarding downstream dependencies, before this lands, I want to pull in all the
> TDX and SNP series and see how everything fits together.  Specifically, I want to
> make sure that we don't end up with a uAPI that necessitates ugly code, and that we
> don't miss an opportunity to make things simpler.  The patches in the SNP series to
> add "legacy" SEV support for UPM in particular made me slightly rethink some minor
> details.  Nothing remotely major, but something that needs attention since it'll
> be uAPI.
>
> I'm off Monday, so it'll be at least Tuesday before I make any more progress on
> my side.
>
> Thanks!
Isaku Yamahata Jan. 19, 2023, 11:13 a.m. UTC | #5
On Sat, Jan 14, 2023 at 12:37:59AM +0000,
Sean Christopherson <seanjc@google.com> wrote:

> On Fri, Dec 02, 2022, Chao Peng wrote:
> > This patch series implements KVM guest private memory for confidential
> > computing scenarios like Intel TDX[1]. If a TDX host accesses
> > TDX-protected guest memory, machine check can happen which can further
> > crash the running host system, this is terrible for multi-tenant
> > configurations. The host accesses include those from KVM userspace like
> > QEMU. This series addresses KVM userspace induced crash by introducing
> > new mm and KVM interfaces so KVM userspace can still manage guest memory
> > via a fd-based approach, but it can never access the guest memory
> > content.
> > 
> > The patch series touches both core mm and KVM code. I appreciate
> > Andrew/Hugh and Paolo/Sean can review and pick these patches. Any other
> > reviews are always welcome.
> >   - 01: mm change, target for mm tree
> >   - 02-09: KVM change, target for KVM tree
> 
> A version with all of my feedback, plus reworked versions of Vishal's selftest,
> is available here:
> 
>   git@github.com:sean-jc/linux.git x86/upm_base_support
> 
> It compiles and passes the selftest, but it's otherwise barely tested.  There are
> a few todos (2 I think?) and many of the commits need changelogs, i.e. it's still
> a WIP.
> 
> As for next steps, can you (handwaving all of the TDX folks) take a look at what
> I pushed and see if there's anything horrifically broken, and that it still works
> for TDX?
> 
> Fuad (and pKVM folks) same ask for you with respect to pKVM.  Absolutely no rush
> (and I mean that).
> 
> On my side, the two things on my mind are (a) tests and (b) downstream dependencies
> (SEV and TDX).  For tests, I want to build a lists of tests that are required for
> merging so that the criteria for merging are clear, and so that if the list is large
> (haven't thought much yet), the work of writing and running tests can be distributed.
> 
> Regarding downstream dependencies, before this lands, I want to pull in all the
> TDX and SNP series and see how everything fits together.  Specifically, I want to
> make sure that we don't end up with a uAPI that necessitates ugly code, and that we
> don't miss an opportunity to make things simpler.  The patches in the SNP series to
> add "legacy" SEV support for UPM in particular made me slightly rethink some minor
> details.  Nothing remotely major, but something that needs attention since it'll
> be uAPI.

Although I'm still debuging with TDX KVM, I needed the following.
kvm_faultin_pfn() is called without mmu_lock held.  the race to change
private/shared is handled by mmu_seq.  Maybe dedicated function only for
kvm_faultin_pfn().

diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 02be5e1cba1e..38699ca75ab8 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -2322,7 +2322,7 @@ static inline void kvm_account_pgtable_pages(void *virt, int nr)
 #ifdef CONFIG_KVM_GENERIC_MEMORY_ATTRIBUTES
 static inline unsigned long kvm_get_memory_attributes(struct kvm *kvm, gfn_t gfn)
 {
-       lockdep_assert_held(&kvm->mmu_lock);
+       // lockdep_assert_held(&kvm->mmu_lock);
 
        return xa_to_value(xa_load(&kvm->mem_attr_array, gfn));
 }
Sean Christopherson Jan. 19, 2023, 3:25 p.m. UTC | #6
On Thu, Jan 19, 2023, Isaku Yamahata wrote:
> On Sat, Jan 14, 2023 at 12:37:59AM +0000,
> Sean Christopherson <seanjc@google.com> wrote:
> 
> > On Fri, Dec 02, 2022, Chao Peng wrote:
> > > This patch series implements KVM guest private memory for confidential
> > > computing scenarios like Intel TDX[1]. If a TDX host accesses
> > > TDX-protected guest memory, machine check can happen which can further
> > > crash the running host system, this is terrible for multi-tenant
> > > configurations. The host accesses include those from KVM userspace like
> > > QEMU. This series addresses KVM userspace induced crash by introducing
> > > new mm and KVM interfaces so KVM userspace can still manage guest memory
> > > via a fd-based approach, but it can never access the guest memory
> > > content.
> > > 
> > > The patch series touches both core mm and KVM code. I appreciate
> > > Andrew/Hugh and Paolo/Sean can review and pick these patches. Any other
> > > reviews are always welcome.
> > >   - 01: mm change, target for mm tree
> > >   - 02-09: KVM change, target for KVM tree
> > 
> > A version with all of my feedback, plus reworked versions of Vishal's selftest,
> > is available here:
> > 
> >   git@github.com:sean-jc/linux.git x86/upm_base_support
> > 
> > It compiles and passes the selftest, but it's otherwise barely tested.  There are
> > a few todos (2 I think?) and many of the commits need changelogs, i.e. it's still
> > a WIP.
> > 
> > As for next steps, can you (handwaving all of the TDX folks) take a look at what
> > I pushed and see if there's anything horrifically broken, and that it still works
> > for TDX?
> > 
> > Fuad (and pKVM folks) same ask for you with respect to pKVM.  Absolutely no rush
> > (and I mean that).
> > 
> > On my side, the two things on my mind are (a) tests and (b) downstream dependencies
> > (SEV and TDX).  For tests, I want to build a lists of tests that are required for
> > merging so that the criteria for merging are clear, and so that if the list is large
> > (haven't thought much yet), the work of writing and running tests can be distributed.
> > 
> > Regarding downstream dependencies, before this lands, I want to pull in all the
> > TDX and SNP series and see how everything fits together.  Specifically, I want to
> > make sure that we don't end up with a uAPI that necessitates ugly code, and that we
> > don't miss an opportunity to make things simpler.  The patches in the SNP series to
> > add "legacy" SEV support for UPM in particular made me slightly rethink some minor
> > details.  Nothing remotely major, but something that needs attention since it'll
> > be uAPI.
> 
> Although I'm still debuging with TDX KVM, I needed the following.
> kvm_faultin_pfn() is called without mmu_lock held.  the race to change
> private/shared is handled by mmu_seq.  Maybe dedicated function only for
> kvm_faultin_pfn().

Gah, you're not on the other thread where this was discussed[*].  Simply deleting
the lockdep assertion is safe, for guest types that rely on the attributes to
define shared vs. private, KVM rechecks the attributes under the protection of
mmu_seq.

I'll get a fixed version pushed out today.

[*] https://lore.kernel.org/all/Y8gpl+LwSuSgBFks@google.com
Isaku Yamahata Jan. 19, 2023, 10:37 p.m. UTC | #7
On Thu, Jan 19, 2023 at 03:25:08PM +0000,
Sean Christopherson <seanjc@google.com> wrote:

> On Thu, Jan 19, 2023, Isaku Yamahata wrote:
> > On Sat, Jan 14, 2023 at 12:37:59AM +0000,
> > Sean Christopherson <seanjc@google.com> wrote:
> > 
> > > On Fri, Dec 02, 2022, Chao Peng wrote:
> > > > This patch series implements KVM guest private memory for confidential
> > > > computing scenarios like Intel TDX[1]. If a TDX host accesses
> > > > TDX-protected guest memory, machine check can happen which can further
> > > > crash the running host system, this is terrible for multi-tenant
> > > > configurations. The host accesses include those from KVM userspace like
> > > > QEMU. This series addresses KVM userspace induced crash by introducing
> > > > new mm and KVM interfaces so KVM userspace can still manage guest memory
> > > > via a fd-based approach, but it can never access the guest memory
> > > > content.
> > > > 
> > > > The patch series touches both core mm and KVM code. I appreciate
> > > > Andrew/Hugh and Paolo/Sean can review and pick these patches. Any other
> > > > reviews are always welcome.
> > > >   - 01: mm change, target for mm tree
> > > >   - 02-09: KVM change, target for KVM tree
> > > 
> > > A version with all of my feedback, plus reworked versions of Vishal's selftest,
> > > is available here:
> > > 
> > >   git@github.com:sean-jc/linux.git x86/upm_base_support
> > > 
> > > It compiles and passes the selftest, but it's otherwise barely tested.  There are
> > > a few todos (2 I think?) and many of the commits need changelogs, i.e. it's still
> > > a WIP.
> > > 
> > > As for next steps, can you (handwaving all of the TDX folks) take a look at what
> > > I pushed and see if there's anything horrifically broken, and that it still works
> > > for TDX?
> > > 
> > > Fuad (and pKVM folks) same ask for you with respect to pKVM.  Absolutely no rush
> > > (and I mean that).
> > > 
> > > On my side, the two things on my mind are (a) tests and (b) downstream dependencies
> > > (SEV and TDX).  For tests, I want to build a lists of tests that are required for
> > > merging so that the criteria for merging are clear, and so that if the list is large
> > > (haven't thought much yet), the work of writing and running tests can be distributed.
> > > 
> > > Regarding downstream dependencies, before this lands, I want to pull in all the
> > > TDX and SNP series and see how everything fits together.  Specifically, I want to
> > > make sure that we don't end up with a uAPI that necessitates ugly code, and that we
> > > don't miss an opportunity to make things simpler.  The patches in the SNP series to
> > > add "legacy" SEV support for UPM in particular made me slightly rethink some minor
> > > details.  Nothing remotely major, but something that needs attention since it'll
> > > be uAPI.
> > 
> > Although I'm still debuging with TDX KVM, I needed the following.
> > kvm_faultin_pfn() is called without mmu_lock held.  the race to change
> > private/shared is handled by mmu_seq.  Maybe dedicated function only for
> > kvm_faultin_pfn().
> 
> Gah, you're not on the other thread where this was discussed[*].  Simply deleting
> the lockdep assertion is safe, for guest types that rely on the attributes to
> define shared vs. private, KVM rechecks the attributes under the protection of
> mmu_seq.
> 
> I'll get a fixed version pushed out today.
> 
> [*] https://lore.kernel.org/all/Y8gpl+LwSuSgBFks@google.com

Now I have tdx kvm working. I've uploaded at the followings.
It's rebased to v6.2-rc3.
        git@github.com:yamahata/linux.git tdx/upm
        git@github.com:yamahata/qemu.git tdx/upm

kvm_mmu_do_page_fault() needs the following change.
kvm_mem_is_private() queries mem_attr_array.  kvm_faultin_pfn() also uses
kvm_mem_is_private(). So the shared-private check in kvm_faultin_pfn() doesn't
make sense. This change would belong to TDX KVM patches, though.

diff --git a/arch/x86/kvm/mmu/mmu_internal.h b/arch/x86/kvm/mmu/mmu_internal.h
index 72b0da8e27e0..f45ac438bbf4 100644
--- a/arch/x86/kvm/mmu/mmu_internal.h
+++ b/arch/x86/kvm/mmu/mmu_internal.h
@@ -430,7 +430,7 @@ static inline int kvm_mmu_do_page_fault(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa,
                .max_level = vcpu->kvm->arch.tdp_max_page_level,
                .req_level = PG_LEVEL_4K,
                .goal_level = PG_LEVEL_4K,
-               .is_private = kvm_mem_is_private(vcpu->kvm, cr2_or_gpa >> PAGE_SHIFT),
+               .is_private = kvm_is_private_gpa(vcpu->kvm, cr2_or_gpa),
        };
        int r;
Sean Christopherson Jan. 24, 2023, 1:27 a.m. UTC | #8
On Thu, Jan 19, 2023, Isaku Yamahata wrote:
> On Thu, Jan 19, 2023 at 03:25:08PM +0000,
> Sean Christopherson <seanjc@google.com> wrote:
> 
> > On Thu, Jan 19, 2023, Isaku Yamahata wrote:
> > > On Sat, Jan 14, 2023 at 12:37:59AM +0000,
> > > Sean Christopherson <seanjc@google.com> wrote:
> > > 
> > > > On Fri, Dec 02, 2022, Chao Peng wrote:
> > > > > This patch series implements KVM guest private memory for confidential
> > > > > computing scenarios like Intel TDX[1]. If a TDX host accesses
> > > > > TDX-protected guest memory, machine check can happen which can further
> > > > > crash the running host system, this is terrible for multi-tenant
> > > > > configurations. The host accesses include those from KVM userspace like
> > > > > QEMU. This series addresses KVM userspace induced crash by introducing
> > > > > new mm and KVM interfaces so KVM userspace can still manage guest memory
> > > > > via a fd-based approach, but it can never access the guest memory
> > > > > content.
> > > > > 
> > > > > The patch series touches both core mm and KVM code. I appreciate
> > > > > Andrew/Hugh and Paolo/Sean can review and pick these patches. Any other
> > > > > reviews are always welcome.
> > > > >   - 01: mm change, target for mm tree
> > > > >   - 02-09: KVM change, target for KVM tree
> > > > 
> > > > A version with all of my feedback, plus reworked versions of Vishal's selftest,
> > > > is available here:
> > > > 
> > > >   git@github.com:sean-jc/linux.git x86/upm_base_support
> > > > 
> > > > It compiles and passes the selftest, but it's otherwise barely tested.  There are
> > > > a few todos (2 I think?) and many of the commits need changelogs, i.e. it's still
> > > > a WIP.
> > > > 
> > > > As for next steps, can you (handwaving all of the TDX folks) take a look at what
> > > > I pushed and see if there's anything horrifically broken, and that it still works
> > > > for TDX?
> > > > 
> > > > Fuad (and pKVM folks) same ask for you with respect to pKVM.  Absolutely no rush
> > > > (and I mean that).
> > > > 
> > > > On my side, the two things on my mind are (a) tests and (b) downstream dependencies
> > > > (SEV and TDX).  For tests, I want to build a lists of tests that are required for
> > > > merging so that the criteria for merging are clear, and so that if the list is large
> > > > (haven't thought much yet), the work of writing and running tests can be distributed.
> > > > 
> > > > Regarding downstream dependencies, before this lands, I want to pull in all the
> > > > TDX and SNP series and see how everything fits together.  Specifically, I want to
> > > > make sure that we don't end up with a uAPI that necessitates ugly code, and that we
> > > > don't miss an opportunity to make things simpler.  The patches in the SNP series to
> > > > add "legacy" SEV support for UPM in particular made me slightly rethink some minor
> > > > details.  Nothing remotely major, but something that needs attention since it'll
> > > > be uAPI.
> > > 
> > > Although I'm still debuging with TDX KVM, I needed the following.
> > > kvm_faultin_pfn() is called without mmu_lock held.  the race to change
> > > private/shared is handled by mmu_seq.  Maybe dedicated function only for
> > > kvm_faultin_pfn().
> > 
> > Gah, you're not on the other thread where this was discussed[*].  Simply deleting
> > the lockdep assertion is safe, for guest types that rely on the attributes to
> > define shared vs. private, KVM rechecks the attributes under the protection of
> > mmu_seq.
> > 
> > I'll get a fixed version pushed out today.
> > 
> > [*] https://lore.kernel.org/all/Y8gpl+LwSuSgBFks@google.com
> 
> Now I have tdx kvm working. I've uploaded at the followings.
> It's rebased to v6.2-rc3.
>         git@github.com:yamahata/linux.git tdx/upm
>         git@github.com:yamahata/qemu.git tdx/upm

And I finally got a working, building version updated and pushed out (again to):

  git@github.com:sean-jc/linux.git x86/upm_base_support

Took longer than expected to get the memslot restrictions sussed out.  I'm done
working on the code for now, my plan is to come back to it+TDX+SNP in 2-3 weeks
to resolves any remaining todos (that no one else tackles) and to do the whole
"merge the world" excersise.

> kvm_mmu_do_page_fault() needs the following change.
> kvm_mem_is_private() queries mem_attr_array.  kvm_faultin_pfn() also uses
> kvm_mem_is_private(). So the shared-private check in kvm_faultin_pfn() doesn't
> make sense. This change would belong to TDX KVM patches, though.

Yeah, SNP needs similar treatment.  Sorting that out is high up on the todo list.
Liam Merwick Jan. 24, 2023, 4:08 p.m. UTC | #9
On 14/01/2023 00:37, Sean Christopherson wrote:
> On Fri, Dec 02, 2022, Chao Peng wrote:
>> This patch series implements KVM guest private memory for confidential
>> computing scenarios like Intel TDX[1]. If a TDX host accesses
>> TDX-protected guest memory, machine check can happen which can further
>> crash the running host system, this is terrible for multi-tenant
>> configurations. The host accesses include those from KVM userspace like
>> QEMU. This series addresses KVM userspace induced crash by introducing
>> new mm and KVM interfaces so KVM userspace can still manage guest memory
>> via a fd-based approach, but it can never access the guest memory
>> content.
>>
>> The patch series touches both core mm and KVM code. I appreciate
>> Andrew/Hugh and Paolo/Sean can review and pick these patches. Any other
>> reviews are always welcome.
>>    - 01: mm change, target for mm tree
>>    - 02-09: KVM change, target for KVM tree
> 
> A version with all of my feedback, plus reworked versions of Vishal's selftest,
> is available here:
> 
>    git@github.com:sean-jc/linux.git x86/upm_base_support
> 
> It compiles and passes the selftest, but it's otherwise barely tested.  There are
> a few todos (2 I think?) and many of the commits need changelogs, i.e. it's still
> a WIP.
> 

When running LTP (https://github.com/linux-test-project/ltp) on the v10
bits (and also with Sean's branch above) I encounter the following NULL
pointer dereference with testcases/kernel/syscalls/madvise/madvise01
(100% reproducible).

It appears that in restrictedmem_error_page() 
inode->i_mapping->private_data is NULL
in the list_for_each_entry_safe(inode, next, &sb->s_inodes, i_sb_list)
but I don't know why.


[ 5365.177168] BUG: kernel NULL pointer dereference, address: 
0000000000000028
[ 5365.178881] #PF: supervisor read access in kernel mode
[ 5365.180006] #PF: error_code(0x0000) - not-present page
[ 5365.181322] PGD 8000000109dad067 P4D 8000000109dad067 PUD 107707067 PMD 0
[ 5365.183474] Oops: 0000 [#1] PREEMPT SMP PTI
[ 5365.184792] CPU: 0 PID: 22086 Comm: madvise01 Not tainted 
6.1.0-1.el8.seanjcupm.x86_64 #1
[ 5365.186572] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), 
BIOS 1.5.1 06/16/2021
[ 5365.188816] RIP: 0010:restrictedmem_error_page+0xc7/0x1b0
[ 5365.190081] Code: 99 00 48 8b 55 00 48 8b 02 48 8d 8a e8 fe ff ff 48 
2d 18 01 00 00 48 39 d5 0f 84 8a 00 00 00 48 8b 51 30 48 8b 92 b8 00 00 
00 <48> 8b 4a 28 4c 39 b1 d8 00 00 00 74 22 48 8b 88 18 01 00 00 48 8d
[ 5365.193984] RSP: 0018:ffff9b7343c07d80 EFLAGS: 00010206
[ 5365.195142] RAX: ffff8e5b410cfc70 RBX: 0000000000000001 RCX: 
ffff8e5b4048e580
[ 5365.196888] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 
0000000000000000
[ 5365.198399] RBP: ffff8e5b410cfd88 R08: 0000000000000000 R09: 
0000000000000000
[ 5365.200200] R10: 0000000000000000 R11: 0000000000000000 R12: 
0000000000000000
[ 5365.201843] R13: ffff8e5b410cfd80 R14: ffff8e5b47cc7618 R15: 
ffffd49d44c05080
[ 5365.203472] FS:  00007fc96de9b5c0(0000) GS:ffff8e5deda00000(0000) 
knlGS:0000000000000000
[ 5365.205485] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 5365.206791] CR2: 0000000000000028 CR3: 000000012047e002 CR4: 
0000000000170ef0
[ 5365.208131] Call Trace:
[ 5365.208752]  <TASK>
[ 5365.209229]  me_pagecache_clean+0x58/0x100
[ 5365.210196]  identify_page_state+0x84/0xd0
[ 5365.211180]  memory_failure+0x231/0x8b0
[ 5365.212148]  madvise_inject_error.cold+0x8d/0xa4
[ 5365.213317]  do_madvise+0x363/0x3a0
[ 5365.214177]  __x64_sys_madvise+0x2c/0x40
[ 5365.215159]  do_syscall_64+0x3f/0xa0
[ 5365.216016]  entry_SYSCALL_64_after_hwframe+0x63/0xcd
[ 5365.217130] RIP: 0033:0x7fc96d8399ab
[ 5365.217953] Code: 73 01 c3 48 8b 0d dd 54 38 00 f7 d8 64 89 01 48 83 
c8 ff c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa b8 1c 00 00 00 0f 
05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d ad 54 38 00 f7 d8 64 89 01 48
[ 5365.222323] RSP: 002b:00007fff62a99b18 EFLAGS: 00000206 ORIG_RAX: 
000000000000001c
[ 5365.224026] RAX: ffffffffffffffda RBX: 000000000041ce00 RCX: 
00007fc96d8399ab
[ 5365.225375] RDX: 0000000000000064 RSI: 000000000000a000 RDI: 
00007fc96de8e000
[ 5365.226999] RBP: 00007fc96de9b540 R08: 0000000000000001 R09: 
0000000000415c80
[ 5365.228641] R10: 0000000000000001 R11: 0000000000000206 R12: 
0000000000000008
[ 5365.230074] R13: 0000000000000000 R14: 0000000000000000 R15: 
0000000000000000

Regards,
Liam

> As for next steps, can you (handwaving all of the TDX folks) take a look at what
> I pushed and see if there's anything horrifically broken, and that it still works
> for TDX?
> 
> Fuad (and pKVM folks) same ask for you with respect to pKVM.  Absolutely no rush
> (and I mean that).
> 
> On my side, the two things on my mind are (a) tests and (b) downstream dependencies
> (SEV and TDX).  For tests, I want to build a lists of tests that are required for
> merging so that the criteria for merging are clear, and so that if the list is large
> (haven't thought much yet), the work of writing and running tests can be distributed.
> 
> Regarding downstream dependencies, before this lands, I want to pull in all the
> TDX and SNP series and see how everything fits together.  Specifically, I want to
> make sure that we don't end up with a uAPI that necessitates ugly code, and that we
> don't miss an opportunity to make things simpler.  The patches in the SNP series to
> add "legacy" SEV support for UPM in particular made me slightly rethink some minor
> details.  Nothing remotely major, but something that needs attention since it'll
> be uAPI.
> 
> I'm off Monday, so it'll be at least Tuesday before I make any more progress on
> my side.
> 
> Thanks!
>
Sean Christopherson Jan. 25, 2023, 12:20 a.m. UTC | #10
On Tue, Jan 24, 2023, Liam Merwick wrote:
> On 14/01/2023 00:37, Sean Christopherson wrote:
> > On Fri, Dec 02, 2022, Chao Peng wrote:
> > > This patch series implements KVM guest private memory for confidential
> > > computing scenarios like Intel TDX[1]. If a TDX host accesses
> > > TDX-protected guest memory, machine check can happen which can further
> > > crash the running host system, this is terrible for multi-tenant
> > > configurations. The host accesses include those from KVM userspace like
> > > QEMU. This series addresses KVM userspace induced crash by introducing
> > > new mm and KVM interfaces so KVM userspace can still manage guest memory
> > > via a fd-based approach, but it can never access the guest memory
> > > content.
> > > 
> > > The patch series touches both core mm and KVM code. I appreciate
> > > Andrew/Hugh and Paolo/Sean can review and pick these patches. Any other
> > > reviews are always welcome.
> > >    - 01: mm change, target for mm tree
> > >    - 02-09: KVM change, target for KVM tree
> > 
> > A version with all of my feedback, plus reworked versions of Vishal's selftest,
> > is available here:
> > 
> >    git@github.com:sean-jc/linux.git x86/upm_base_support
> > 
> > It compiles and passes the selftest, but it's otherwise barely tested.  There are
> > a few todos (2 I think?) and many of the commits need changelogs, i.e. it's still
> > a WIP.
> > 
> 
> When running LTP (https://github.com/linux-test-project/ltp) on the v10
> bits (and also with Sean's branch above) I encounter the following NULL
> pointer dereference with testcases/kernel/syscalls/madvise/madvise01
> (100% reproducible).
> 
> It appears that in restrictedmem_error_page() inode->i_mapping->private_data
> is NULL
> in the list_for_each_entry_safe(inode, next, &sb->s_inodes, i_sb_list)
> but I don't know why.

Kirill, can you take a look?  Or pass the buck to someone who can? :-)
Kirill A. Shutemov Jan. 25, 2023, 12:53 p.m. UTC | #11
On Wed, Jan 25, 2023 at 12:20:26AM +0000, Sean Christopherson wrote:
> On Tue, Jan 24, 2023, Liam Merwick wrote:
> > On 14/01/2023 00:37, Sean Christopherson wrote:
> > > On Fri, Dec 02, 2022, Chao Peng wrote:
> > > > This patch series implements KVM guest private memory for confidential
> > > > computing scenarios like Intel TDX[1]. If a TDX host accesses
> > > > TDX-protected guest memory, machine check can happen which can further
> > > > crash the running host system, this is terrible for multi-tenant
> > > > configurations. The host accesses include those from KVM userspace like
> > > > QEMU. This series addresses KVM userspace induced crash by introducing
> > > > new mm and KVM interfaces so KVM userspace can still manage guest memory
> > > > via a fd-based approach, but it can never access the guest memory
> > > > content.
> > > > 
> > > > The patch series touches both core mm and KVM code. I appreciate
> > > > Andrew/Hugh and Paolo/Sean can review and pick these patches. Any other
> > > > reviews are always welcome.
> > > >    - 01: mm change, target for mm tree
> > > >    - 02-09: KVM change, target for KVM tree
> > > 
> > > A version with all of my feedback, plus reworked versions of Vishal's selftest,
> > > is available here:
> > > 
> > >    git@github.com:sean-jc/linux.git x86/upm_base_support
> > > 
> > > It compiles and passes the selftest, but it's otherwise barely tested.  There are
> > > a few todos (2 I think?) and many of the commits need changelogs, i.e. it's still
> > > a WIP.
> > > 
> > 
> > When running LTP (https://github.com/linux-test-project/ltp) on the v10
> > bits (and also with Sean's branch above) I encounter the following NULL
> > pointer dereference with testcases/kernel/syscalls/madvise/madvise01
> > (100% reproducible).
> > 
> > It appears that in restrictedmem_error_page() inode->i_mapping->private_data
> > is NULL
> > in the list_for_each_entry_safe(inode, next, &sb->s_inodes, i_sb_list)
> > but I don't know why.
> 
> Kirill, can you take a look?  Or pass the buck to someone who can? :-)

The patch below should help.

diff --git a/mm/restrictedmem.c b/mm/restrictedmem.c
index 15c52301eeb9..39ada985c7c0 100644
--- a/mm/restrictedmem.c
+++ b/mm/restrictedmem.c
@@ -307,14 +307,29 @@ void restrictedmem_error_page(struct page *page, struct address_space *mapping)
 
 	spin_lock(&sb->s_inode_list_lock);
 	list_for_each_entry_safe(inode, next, &sb->s_inodes, i_sb_list) {
-		struct restrictedmem *rm = inode->i_mapping->private_data;
 		struct restrictedmem_notifier *notifier;
-		struct file *memfd = rm->memfd;
+		struct restrictedmem *rm;
 		unsigned long index;
+		struct file *memfd;
 
-		if (memfd->f_mapping != mapping)
+		if (atomic_read(&inode->i_count))
 			continue;
 
+		spin_lock(&inode->i_lock);
+		if (inode->i_state & (I_NEW | I_FREEING | I_WILL_FREE)) {
+			spin_unlock(&inode->i_lock);
+			continue;
+		}
+
+		rm = inode->i_mapping->private_data;
+		memfd = rm->memfd;
+
+		if (memfd->f_mapping != mapping) {
+			spin_unlock(&inode->i_lock);
+			continue;
+		}
+		spin_unlock(&inode->i_lock);
+
 		xa_for_each_range(&rm->bindings, index, notifier, start, end)
 			notifier->ops->error(notifier, start, end);
 		break;
Liam Merwick Jan. 25, 2023, 4:01 p.m. UTC | #12
On 25/01/2023 12:53, Kirill A. Shutemov wrote:
> On Wed, Jan 25, 2023 at 12:20:26AM +0000, Sean Christopherson wrote:
>> On Tue, Jan 24, 2023, Liam Merwick wrote:
>>> On 14/01/2023 00:37, Sean Christopherson wrote:
>>>> On Fri, Dec 02, 2022, Chao Peng wrote:
...
>>>
>>> When running LTP (https://github.com/linux-test-project/ltp) on the v10
>>> bits (and also with Sean's branch above) I encounter the following NULL
>>> pointer dereference with testcases/kernel/syscalls/madvise/madvise01
>>> (100% reproducible).
>>>
>>> It appears that in restrictedmem_error_page() inode->i_mapping->private_data
>>> is NULL
>>> in the list_for_each_entry_safe(inode, next, &sb->s_inodes, i_sb_list)
>>> but I don't know why.
>>
>> Kirill, can you take a look?  Or pass the buck to someone who can? :-)
> 
> The patch below should help.

Thanks, this works for me.

Regards,
Liam

> 
> diff --git a/mm/restrictedmem.c b/mm/restrictedmem.c
> index 15c52301eeb9..39ada985c7c0 100644
> --- a/mm/restrictedmem.c
> +++ b/mm/restrictedmem.c
> @@ -307,14 +307,29 @@ void restrictedmem_error_page(struct page *page, struct address_space *mapping)
>   
>   	spin_lock(&sb->s_inode_list_lock);
>   	list_for_each_entry_safe(inode, next, &sb->s_inodes, i_sb_list) {
> -		struct restrictedmem *rm = inode->i_mapping->private_data;
>   		struct restrictedmem_notifier *notifier;
> -		struct file *memfd = rm->memfd;
> +		struct restrictedmem *rm;
>   		unsigned long index;
> +		struct file *memfd;
>   
> -		if (memfd->f_mapping != mapping)
> +		if (atomic_read(&inode->i_count))
>   			continue;
>   
> +		spin_lock(&inode->i_lock);
> +		if (inode->i_state & (I_NEW | I_FREEING | I_WILL_FREE)) {
> +			spin_unlock(&inode->i_lock);
> +			continue;
> +		}
> +
> +		rm = inode->i_mapping->private_data;
> +		memfd = rm->memfd;
> +
> +		if (memfd->f_mapping != mapping) {
> +			spin_unlock(&inode->i_lock);
> +			continue;
> +		}
> +		spin_unlock(&inode->i_lock);
> +
>   		xa_for_each_range(&rm->bindings, index, notifier, start, end)
>   			notifier->ops->error(notifier, start, end);
>   		break;
Isaku Yamahata Feb. 8, 2023, 12:24 p.m. UTC | #13
On Tue, Jan 24, 2023 at 01:27:50AM +0000,
Sean Christopherson <seanjc@google.com> wrote:

> On Thu, Jan 19, 2023, Isaku Yamahata wrote:
> > On Thu, Jan 19, 2023 at 03:25:08PM +0000,
> > Sean Christopherson <seanjc@google.com> wrote:
> > 
> > > On Thu, Jan 19, 2023, Isaku Yamahata wrote:
> > > > On Sat, Jan 14, 2023 at 12:37:59AM +0000,
> > > > Sean Christopherson <seanjc@google.com> wrote:
> > > > 
> > > > > On Fri, Dec 02, 2022, Chao Peng wrote:
> > > > > > This patch series implements KVM guest private memory for confidential
> > > > > > computing scenarios like Intel TDX[1]. If a TDX host accesses
> > > > > > TDX-protected guest memory, machine check can happen which can further
> > > > > > crash the running host system, this is terrible for multi-tenant
> > > > > > configurations. The host accesses include those from KVM userspace like
> > > > > > QEMU. This series addresses KVM userspace induced crash by introducing
> > > > > > new mm and KVM interfaces so KVM userspace can still manage guest memory
> > > > > > via a fd-based approach, but it can never access the guest memory
> > > > > > content.
> > > > > > 
> > > > > > The patch series touches both core mm and KVM code. I appreciate
> > > > > > Andrew/Hugh and Paolo/Sean can review and pick these patches. Any other
> > > > > > reviews are always welcome.
> > > > > >   - 01: mm change, target for mm tree
> > > > > >   - 02-09: KVM change, target for KVM tree
> > > > > 
> > > > > A version with all of my feedback, plus reworked versions of Vishal's selftest,
> > > > > is available here:
> > > > > 
> > > > >   git@github.com:sean-jc/linux.git x86/upm_base_support
> > > > > 
> > > > > It compiles and passes the selftest, but it's otherwise barely tested.  There are
> > > > > a few todos (2 I think?) and many of the commits need changelogs, i.e. it's still
> > > > > a WIP.
> > > > > 
> > > > > As for next steps, can you (handwaving all of the TDX folks) take a look at what
> > > > > I pushed and see if there's anything horrifically broken, and that it still works
> > > > > for TDX?
> > > > > 
> > > > > Fuad (and pKVM folks) same ask for you with respect to pKVM.  Absolutely no rush
> > > > > (and I mean that).
> > > > > 
> > > > > On my side, the two things on my mind are (a) tests and (b) downstream dependencies
> > > > > (SEV and TDX).  For tests, I want to build a lists of tests that are required for
> > > > > merging so that the criteria for merging are clear, and so that if the list is large
> > > > > (haven't thought much yet), the work of writing and running tests can be distributed.
> > > > > 
> > > > > Regarding downstream dependencies, before this lands, I want to pull in all the
> > > > > TDX and SNP series and see how everything fits together.  Specifically, I want to
> > > > > make sure that we don't end up with a uAPI that necessitates ugly code, and that we
> > > > > don't miss an opportunity to make things simpler.  The patches in the SNP series to
> > > > > add "legacy" SEV support for UPM in particular made me slightly rethink some minor
> > > > > details.  Nothing remotely major, but something that needs attention since it'll
> > > > > be uAPI.
> > > > 
> > > > Although I'm still debuging with TDX KVM, I needed the following.
> > > > kvm_faultin_pfn() is called without mmu_lock held.  the race to change
> > > > private/shared is handled by mmu_seq.  Maybe dedicated function only for
> > > > kvm_faultin_pfn().
> > > 
> > > Gah, you're not on the other thread where this was discussed[*].  Simply deleting
> > > the lockdep assertion is safe, for guest types that rely on the attributes to
> > > define shared vs. private, KVM rechecks the attributes under the protection of
> > > mmu_seq.
> > > 
> > > I'll get a fixed version pushed out today.
> > > 
> > > [*] https://lore.kernel.org/all/Y8gpl+LwSuSgBFks@google.com
> > 
> > Now I have tdx kvm working. I've uploaded at the followings.
> > It's rebased to v6.2-rc3.
> >         git@github.com:yamahata/linux.git tdx/upm
> >         git@github.com:yamahata/qemu.git tdx/upm
> 
> And I finally got a working, building version updated and pushed out (again to):
> 
>   git@github.com:sean-jc/linux.git x86/upm_base_support
> 

Ok, I rebased TDX part to the updated branch.
        git@github.com:yamahata/linux.git tdx/upm
        git@github.com:yamahata/qemu.git tdx/upm

Now it's v6.2-rc7 based.
qemu needs more patches to avoid registering memory slot for SMM.
Michael Roth Feb. 13, 2023, 1:01 p.m. UTC | #14
On Tue, Jan 24, 2023 at 01:27:50AM +0000, Sean Christopherson wrote:
> On Thu, Jan 19, 2023, Isaku Yamahata wrote:
> > On Thu, Jan 19, 2023 at 03:25:08PM +0000,
> > Sean Christopherson <seanjc@google.com> wrote:
> > 
> > > On Thu, Jan 19, 2023, Isaku Yamahata wrote:
> > > > On Sat, Jan 14, 2023 at 12:37:59AM +0000,
> > > > Sean Christopherson <seanjc@google.com> wrote:
> > > > 
> > > > > On Fri, Dec 02, 2022, Chao Peng wrote:
> > > > > > This patch series implements KVM guest private memory for confidential
> > > > > > computing scenarios like Intel TDX[1]. If a TDX host accesses
> > > > > > TDX-protected guest memory, machine check can happen which can further
> > > > > > crash the running host system, this is terrible for multi-tenant
> > > > > > configurations. The host accesses include those from KVM userspace like
> > > > > > QEMU. This series addresses KVM userspace induced crash by introducing
> > > > > > new mm and KVM interfaces so KVM userspace can still manage guest memory
> > > > > > via a fd-based approach, but it can never access the guest memory
> > > > > > content.
> > > > > > 
> > > > > > The patch series touches both core mm and KVM code. I appreciate
> > > > > > Andrew/Hugh and Paolo/Sean can review and pick these patches. Any other
> > > > > > reviews are always welcome.
> > > > > >   - 01: mm change, target for mm tree
> > > > > >   - 02-09: KVM change, target for KVM tree
> > > > > 
> > > > > A version with all of my feedback, plus reworked versions of Vishal's selftest,
> > > > > is available here:
> > > > > 
> > > > >   git@github.com:sean-jc/linux.git x86/upm_base_support
> > > > > 
> > > > > It compiles and passes the selftest, but it's otherwise barely tested.  There are
> > > > > a few todos (2 I think?) and many of the commits need changelogs, i.e. it's still
> > > > > a WIP.
> > > > > 
> > > > > As for next steps, can you (handwaving all of the TDX folks) take a look at what
> > > > > I pushed and see if there's anything horrifically broken, and that it still works
> > > > > for TDX?
> > > > > 
> > > > > Fuad (and pKVM folks) same ask for you with respect to pKVM.  Absolutely no rush
> > > > > (and I mean that).
> > > > > 
> > > > > On my side, the two things on my mind are (a) tests and (b) downstream dependencies
> > > > > (SEV and TDX).  For tests, I want to build a lists of tests that are required for
> > > > > merging so that the criteria for merging are clear, and so that if the list is large
> > > > > (haven't thought much yet), the work of writing and running tests can be distributed.
> > > > > 
> > > > > Regarding downstream dependencies, before this lands, I want to pull in all the
> > > > > TDX and SNP series and see how everything fits together.  Specifically, I want to
> > > > > make sure that we don't end up with a uAPI that necessitates ugly code, and that we
> > > > > don't miss an opportunity to make things simpler.  The patches in the SNP series to
> > > > > add "legacy" SEV support for UPM in particular made me slightly rethink some minor
> > > > > details.  Nothing remotely major, but something that needs attention since it'll
> > > > > be uAPI.
> > > > 
> > > > Although I'm still debuging with TDX KVM, I needed the following.
> > > > kvm_faultin_pfn() is called without mmu_lock held.  the race to change
> > > > private/shared is handled by mmu_seq.  Maybe dedicated function only for
> > > > kvm_faultin_pfn().
> > > 
> > > Gah, you're not on the other thread where this was discussed[*].  Simply deleting
> > > the lockdep assertion is safe, for guest types that rely on the attributes to
> > > define shared vs. private, KVM rechecks the attributes under the protection of
> > > mmu_seq.
> > > 
> > > I'll get a fixed version pushed out today.
> > > 
> > > [*] https://lore.kernel.org/all/Y8gpl+LwSuSgBFks@google.com
> > 
> > Now I have tdx kvm working. I've uploaded at the followings.
> > It's rebased to v6.2-rc3.
> >         git@github.com:yamahata/linux.git tdx/upm
> >         git@github.com:yamahata/qemu.git tdx/upm
> 
> And I finally got a working, building version updated and pushed out (again to):
> 
>   git@github.com:sean-jc/linux.git x86/upm_base_support
> 
> Took longer than expected to get the memslot restrictions sussed out.  I'm done
> working on the code for now, my plan is to come back to it+TDX+SNP in 2-3 weeks
> to resolves any remaining todos (that no one else tackles) and to do the whole
> "merge the world" excersise.
> 
> > kvm_mmu_do_page_fault() needs the following change.
> > kvm_mem_is_private() queries mem_attr_array.  kvm_faultin_pfn() also uses
> > kvm_mem_is_private(). So the shared-private check in kvm_faultin_pfn() doesn't
> > make sense. This change would belong to TDX KVM patches, though.
> 
> Yeah, SNP needs similar treatment.  Sorting that out is high up on the todo list.

Hi Sean,

We've rebased the SEV+SNP support onto your updated UPM base support
tree and things seem to be working okay, but we needed some fixups on
top of the base support get things working, along with 1 workaround
for an issue that hasn't been root-caused yet:

  https://github.com/mdroth/linux/commits/upmv10b-host-snp-v8-wip

  *stash (upm_base_support): mm: restrictedmem: Kirill's pinning implementation
  *workaround (use_base_support): mm: restrictedmem: loosen exclusivity check
  *fixup (upm_base_support): KVM: use inclusive ranges for restrictedmem binding/unbinding
  *fixup (upm_base_support): mm: restrictedmem: use inclusive ranges for issuing invalidations
  *fixup (upm_base_support): KVM: fix restrictedmem GFN range calculations
  *fixup (upm_base_support): KVM: selftests: CoCo compilation fixes

We plan to post an updated RFC for v8 soon, but also wanted to share
the staging tree in case you end up looking at the UPM integration aspects
before then.

-Mike
Mike Rapoport Feb. 16, 2023, 5:13 a.m. UTC | #15
Hi,

On Fri, Dec 02, 2022 at 02:13:38PM +0800, Chao Peng wrote:
> This patch series implements KVM guest private memory for confidential
> computing scenarios like Intel TDX[1]. If a TDX host accesses
> TDX-protected guest memory, machine check can happen which can further
> crash the running host system, this is terrible for multi-tenant
> configurations. The host accesses include those from KVM userspace like
> QEMU. This series addresses KVM userspace induced crash by introducing
> new mm and KVM interfaces so KVM userspace can still manage guest memory
> via a fd-based approach, but it can never access the guest memory
> content.

Sorry for jumping late.

Unless I'm missing something, hibernation will also cause an machine check
when there is TDX-protected memory in the system. When the hibernation
creates memory snapshot it essentially walks all physical pages and saves
their contents, so for TDX memory this will trigger machine check, right?
 
>  Documentation/virt/kvm/api.rst         | 125 ++++++-
>  arch/x86/entry/syscalls/syscall_32.tbl |   1 +
>  arch/x86/entry/syscalls/syscall_64.tbl |   1 +
>  arch/x86/include/asm/kvm_host.h        |   9 +
>  arch/x86/kvm/Kconfig                   |   3 +
>  arch/x86/kvm/mmu/mmu.c                 | 205 ++++++++++-
>  arch/x86/kvm/mmu/mmu_internal.h        |  14 +-
>  arch/x86/kvm/mmu/mmutrace.h            |   1 +
>  arch/x86/kvm/mmu/tdp_mmu.c             |   2 +-
>  arch/x86/kvm/x86.c                     |  17 +-
>  include/linux/kvm_host.h               | 103 +++++-
>  include/linux/restrictedmem.h          |  71 ++++
>  include/linux/syscalls.h               |   1 +
>  include/uapi/asm-generic/unistd.h      |   5 +-
>  include/uapi/linux/kvm.h               |  53 +++
>  include/uapi/linux/magic.h             |   1 +
>  kernel/sys_ni.c                        |   3 +
>  mm/Kconfig                             |   4 +
>  mm/Makefile                            |   1 +
>  mm/memory-failure.c                    |   3 +
>  mm/restrictedmem.c                     | 318 +++++++++++++++++
>  virt/kvm/Kconfig                       |   6 +
>  virt/kvm/kvm_main.c                    | 469 +++++++++++++++++++++----
>  23 files changed, 1323 insertions(+), 93 deletions(-)
>  create mode 100644 include/linux/restrictedmem.h
>  create mode 100644 mm/restrictedmem.c
David Hildenbrand Feb. 16, 2023, 9:41 a.m. UTC | #16
On 16.02.23 06:13, Mike Rapoport wrote:
> Hi,
> 
> On Fri, Dec 02, 2022 at 02:13:38PM +0800, Chao Peng wrote:
>> This patch series implements KVM guest private memory for confidential
>> computing scenarios like Intel TDX[1]. If a TDX host accesses
>> TDX-protected guest memory, machine check can happen which can further
>> crash the running host system, this is terrible for multi-tenant
>> configurations. The host accesses include those from KVM userspace like
>> QEMU. This series addresses KVM userspace induced crash by introducing
>> new mm and KVM interfaces so KVM userspace can still manage guest memory
>> via a fd-based approach, but it can never access the guest memory
>> content.
> 
> Sorry for jumping late.
> 
> Unless I'm missing something, hibernation will also cause an machine check
> when there is TDX-protected memory in the system. When the hibernation
> creates memory snapshot it essentially walks all physical pages and saves
> their contents, so for TDX memory this will trigger machine check, right?

I recall bringing that up in the past (also memory access due to kdump, 
/prov/kcore) and was told that the main focus for now is preventing 
unprivileged users from crashing the system, that is, not mapping such 
memory into user space (e.g., QEMU). In the long run, we'll want to 
handle such pages also properly in the other events where the kernel 
might access them.
Chao Peng Feb. 21, 2023, 12:11 p.m. UTC | #17
> Hi Sean,
> 
> We've rebased the SEV+SNP support onto your updated UPM base support
> tree and things seem to be working okay, but we needed some fixups on
> top of the base support get things working, along with 1 workaround
> for an issue that hasn't been root-caused yet:
> 
>   https://github.com/mdroth/linux/commits/upmv10b-host-snp-v8-wip
> 
>   *stash (upm_base_support): mm: restrictedmem: Kirill's pinning implementation
>   *workaround (use_base_support): mm: restrictedmem: loosen exclusivity check

What I'm seeing is Slot#3 gets added first and then deleted. When it's
gets added, Slot#0 already has the same range bound to restrictedmem so
trigger the exclusive check. This check is exactly the current code for.

>   *fixup (upm_base_support): KVM: use inclusive ranges for restrictedmem binding/unbinding
>   *fixup (upm_base_support): mm: restrictedmem: use inclusive ranges for issuing invalidations

As many kernel APIs treat 'end' as exclusive, I would rather keep using
exclusive 'end' for these APIs(restrictedmem_bind/restrictedmem_unbind
and notifier callbacks) but fix it internally in the restrictedmem. E.g.
all the places where xarray API needs a 'last'/'max' we use 'end - 1'.
See below for the change.

>   *fixup (upm_base_support): KVM: fix restrictedmem GFN range calculations

Subtracting slot->restrictedmem.index for start/end in
restrictedmem_get_gfn_range() is the correct fix.

>   *fixup (upm_base_support): KVM: selftests: CoCo compilation fixes
> 
> We plan to post an updated RFC for v8 soon, but also wanted to share
> the staging tree in case you end up looking at the UPM integration aspects
> before then.
> 
> -Mike

This is the restrictedmem fix to solve 'end' being stored and checked in xarray:

--- a/mm/restrictedmem.c
+++ b/mm/restrictedmem.c
@@ -46,12 +46,12 @@ static long restrictedmem_punch_hole(struct restrictedmem *rm, int mode,
         */
        down_read(&rm->lock);
 
-       xa_for_each_range(&rm->bindings, index, notifier, start, end)
+       xa_for_each_range(&rm->bindings, index, notifier, start, end - 1)
                notifier->ops->invalidate_start(notifier, start, end);
 
        ret = memfd->f_op->fallocate(memfd, mode, offset, len);
 
-       xa_for_each_range(&rm->bindings, index, notifier, start, end)
+       xa_for_each_range(&rm->bindings, index, notifier, start, end - 1)
                notifier->ops->invalidate_end(notifier, start, end);
 
        up_read(&rm->lock);
@@ -224,7 +224,7 @@ static int restricted_error_remove_page(struct address_space *mapping,
                }
                spin_unlock(&inode->i_lock);
 
-               xa_for_each_range(&rm->bindings, index, notifier, start, end)
+               xa_for_each_range(&rm->bindings, index, notifier, start, end - 1)
                        notifier->ops->error(notifier, start, end);
                break;
        }
@@ -301,11 +301,12 @@ int restrictedmem_bind(struct file *file, pgoff_t start, pgoff_t end,
                if (exclusive != rm->exclusive)
                        goto out_unlock;
 
-               if (exclusive && xa_find(&rm->bindings, &start, end, XA_PRESENT))
+               if (exclusive &&
+                   xa_find(&rm->bindings, &start, end - 1, XA_PRESENT))
                        goto out_unlock;
        }
 
-       xa_store_range(&rm->bindings, start, end, notifier, GFP_KERNEL);
+       xa_store_range(&rm->bindings, start, end - 1, notifier, GFP_KERNEL);
        rm->exclusive = exclusive;
        ret = 0;
 out_unlock:
@@ -320,7 +321,7 @@ void restrictedmem_unbind(struct file *file, pgoff_t start, pgoff_t end,
        struct restrictedmem *rm = file->f_mapping->private_data;
 
        down_write(&rm->lock);
-       xa_store_range(&rm->bindings, start, end, NULL, GFP_KERNEL);
+       xa_store_range(&rm->bindings, start, end - 1, NULL, GFP_KERNEL);
        synchronize_rcu();
        up_write(&rm->lock);
 }
Sean Christopherson Feb. 22, 2023, 9:53 p.m. UTC | #18
On Thu, Feb 16, 2023, David Hildenbrand wrote:
> On 16.02.23 06:13, Mike Rapoport wrote:
> > Hi,
> > 
> > On Fri, Dec 02, 2022 at 02:13:38PM +0800, Chao Peng wrote:
> > > This patch series implements KVM guest private memory for confidential
> > > computing scenarios like Intel TDX[1]. If a TDX host accesses
> > > TDX-protected guest memory, machine check can happen which can further
> > > crash the running host system, this is terrible for multi-tenant
> > > configurations. The host accesses include those from KVM userspace like
> > > QEMU. This series addresses KVM userspace induced crash by introducing
> > > new mm and KVM interfaces so KVM userspace can still manage guest memory
> > > via a fd-based approach, but it can never access the guest memory
> > > content.
> > 
> > Sorry for jumping late.
> > 
> > Unless I'm missing something, hibernation will also cause an machine check
> > when there is TDX-protected memory in the system. When the hibernation
> > creates memory snapshot it essentially walks all physical pages and saves
> > their contents, so for TDX memory this will trigger machine check, right?

For hibernation specifically, I think that should be handled elsewhere as hibernation
is simply incompatible with TDX, SNP, pKVM, etc. without paravirtualizing the
guest, as none of those technologies support auto-export a la s390.  I suspect
the right approach is to disallow hibernation if KVM is running any protected guests.

> I recall bringing that up in the past (also memory access due to kdump,
> /prov/kcore) and was told that the main focus for now is preventing
> unprivileged users from crashing the system, that is, not mapping such
> memory into user space (e.g., QEMU). In the long run, we'll want to handle
> such pages also properly in the other events where the kernel might access
> them.

Ya, unless someone strongly objects, the plan is to essentially treat "attacks"
from privileged users as out of to scope for initial support, and then iterate
as needed to fix/enable more features.

FWIW, read accesses, e.g. kdump, should be ok for TDX and SNP as they both play
nice with "bad" reads.  pKVM is a different beast though as I believe any access
to guest private memory will fault.  But my understanding is that this series
would be a big step forward for pKVM, which currently doesn't have any safeguards.
Michael Roth March 23, 2023, 1:27 a.m. UTC | #19
On Tue, Feb 21, 2023 at 08:11:35PM +0800, Chao Peng wrote:
> > Hi Sean,
> > 
> > We've rebased the SEV+SNP support onto your updated UPM base support
> > tree and things seem to be working okay, but we needed some fixups on
> > top of the base support get things working, along with 1 workaround
> > for an issue that hasn't been root-caused yet:
> > 
> >   https://github.com/mdroth/linux/commits/upmv10b-host-snp-v8-wip
> > 
> >   *stash (upm_base_support): mm: restrictedmem: Kirill's pinning implementation
> >   *workaround (use_base_support): mm: restrictedmem: loosen exclusivity check
> 
> What I'm seeing is Slot#3 gets added first and then deleted. When it's
> gets added, Slot#0 already has the same range bound to restrictedmem so
> trigger the exclusive check. This check is exactly the current code for.

With the following change in QEMU, we no longer trigger this check:

  diff --git a/hw/pci-host/q35.c b/hw/pci-host/q35.c
  index 20da121374..849b5de469 100644
  --- a/hw/pci-host/q35.c
  +++ b/hw/pci-host/q35.c
  @@ -588,9 +588,9 @@ static void mch_realize(PCIDevice *d, Error **errp)
       memory_region_init_alias(&mch->open_high_smram, OBJECT(mch), "smram-open-high",
                                mch->ram_memory, MCH_HOST_BRIDGE_SMRAM_C_BASE,
                                MCH_HOST_BRIDGE_SMRAM_C_SIZE);
  +    memory_region_set_enabled(&mch->open_high_smram, false);
       memory_region_add_subregion_overlap(mch->system_memory, 0xfeda0000,
                                           &mch->open_high_smram, 1);
  -    memory_region_set_enabled(&mch->open_high_smram, false);

I'm not sure if QEMU is actually doing something wrong here though or if
this check is putting tighter restrictions on userspace than what was
expected before. Will look into it more.

> 
> >   *fixup (upm_base_support): KVM: use inclusive ranges for restrictedmem binding/unbinding
> >   *fixup (upm_base_support): mm: restrictedmem: use inclusive ranges for issuing invalidations
> 
> As many kernel APIs treat 'end' as exclusive, I would rather keep using
> exclusive 'end' for these APIs(restrictedmem_bind/restrictedmem_unbind
> and notifier callbacks) but fix it internally in the restrictedmem. E.g.
> all the places where xarray API needs a 'last'/'max' we use 'end - 1'.
> See below for the change.

Yes I did feel like I was fighting the kernel a bit on that; your
suggestion seems like it would be a better fit.

> 
> >   *fixup (upm_base_support): KVM: fix restrictedmem GFN range calculations
> 
> Subtracting slot->restrictedmem.index for start/end in
> restrictedmem_get_gfn_range() is the correct fix.
> 
> >   *fixup (upm_base_support): KVM: selftests: CoCo compilation fixes
> > 
> > We plan to post an updated RFC for v8 soon, but also wanted to share
> > the staging tree in case you end up looking at the UPM integration aspects
> > before then.
> > 
> > -Mike
> 
> This is the restrictedmem fix to solve 'end' being stored and checked in xarray:

Looks good.

Thanks!

-Mike

> 
> --- a/mm/restrictedmem.c
> +++ b/mm/restrictedmem.c
> @@ -46,12 +46,12 @@ static long restrictedmem_punch_hole(struct restrictedmem *rm, int mode,
>          */
>         down_read(&rm->lock);
>  
> -       xa_for_each_range(&rm->bindings, index, notifier, start, end)
> +       xa_for_each_range(&rm->bindings, index, notifier, start, end - 1)
>                 notifier->ops->invalidate_start(notifier, start, end);
>  
>         ret = memfd->f_op->fallocate(memfd, mode, offset, len);
>  
> -       xa_for_each_range(&rm->bindings, index, notifier, start, end)
> +       xa_for_each_range(&rm->bindings, index, notifier, start, end - 1)
>                 notifier->ops->invalidate_end(notifier, start, end);
>  
>         up_read(&rm->lock);
> @@ -224,7 +224,7 @@ static int restricted_error_remove_page(struct address_space *mapping,
>                 }
>                 spin_unlock(&inode->i_lock);
>  
> -               xa_for_each_range(&rm->bindings, index, notifier, start, end)
> +               xa_for_each_range(&rm->bindings, index, notifier, start, end - 1)
>                         notifier->ops->error(notifier, start, end);
>                 break;
>         }
> @@ -301,11 +301,12 @@ int restrictedmem_bind(struct file *file, pgoff_t start, pgoff_t end,
>                 if (exclusive != rm->exclusive)
>                         goto out_unlock;
>  
> -               if (exclusive && xa_find(&rm->bindings, &start, end, XA_PRESENT))
> +               if (exclusive &&
> +                   xa_find(&rm->bindings, &start, end - 1, XA_PRESENT))
>                         goto out_unlock;
>         }
>  
> -       xa_store_range(&rm->bindings, start, end, notifier, GFP_KERNEL);
> +       xa_store_range(&rm->bindings, start, end - 1, notifier, GFP_KERNEL);
>         rm->exclusive = exclusive;
>         ret = 0;
>  out_unlock:
> @@ -320,7 +321,7 @@ void restrictedmem_unbind(struct file *file, pgoff_t start, pgoff_t end,
>         struct restrictedmem *rm = file->f_mapping->private_data;
>  
>         down_write(&rm->lock);
> -       xa_store_range(&rm->bindings, start, end, NULL, GFP_KERNEL);
> +       xa_store_range(&rm->bindings, start, end - 1, NULL, GFP_KERNEL);
>         synchronize_rcu();
>         up_write(&rm->lock);
>  }
Chao Peng March 24, 2023, 2:13 a.m. UTC | #20
On Wed, Mar 22, 2023 at 08:27:37PM -0500, Michael Roth wrote:
> On Tue, Feb 21, 2023 at 08:11:35PM +0800, Chao Peng wrote:
> > > Hi Sean,
> > > 
> > > We've rebased the SEV+SNP support onto your updated UPM base support
> > > tree and things seem to be working okay, but we needed some fixups on
> > > top of the base support get things working, along with 1 workaround
> > > for an issue that hasn't been root-caused yet:
> > > 
> > >   https://github.com/mdroth/linux/commits/upmv10b-host-snp-v8-wip
> > > 
> > >   *stash (upm_base_support): mm: restrictedmem: Kirill's pinning implementation
> > >   *workaround (use_base_support): mm: restrictedmem: loosen exclusivity check
> > 
> > What I'm seeing is Slot#3 gets added first and then deleted. When it's
> > gets added, Slot#0 already has the same range bound to restrictedmem so
> > trigger the exclusive check. This check is exactly the current code for.
> 
> With the following change in QEMU, we no longer trigger this check:
> 
>   diff --git a/hw/pci-host/q35.c b/hw/pci-host/q35.c
>   index 20da121374..849b5de469 100644
>   --- a/hw/pci-host/q35.c
>   +++ b/hw/pci-host/q35.c
>   @@ -588,9 +588,9 @@ static void mch_realize(PCIDevice *d, Error **errp)
>        memory_region_init_alias(&mch->open_high_smram, OBJECT(mch), "smram-open-high",
>                                 mch->ram_memory, MCH_HOST_BRIDGE_SMRAM_C_BASE,
>                                 MCH_HOST_BRIDGE_SMRAM_C_SIZE);
>   +    memory_region_set_enabled(&mch->open_high_smram, false);
>        memory_region_add_subregion_overlap(mch->system_memory, 0xfeda0000,
>                                            &mch->open_high_smram, 1);
>   -    memory_region_set_enabled(&mch->open_high_smram, false);
> 
> I'm not sure if QEMU is actually doing something wrong here though or if
> this check is putting tighter restrictions on userspace than what was
> expected before. Will look into it more.

I don't think above QEMU change is upstream acceptable. It may break
functionality for 'normal' VMs.

The UPM check does putting tighter restriction, the restriction is that
you can't bind the same fd range to more than one memslot. For SMRAM in
QEMU however, it violates this restriction. The right 'fix' is disabling
SMM in QEMU for UPM usages rather than trying to work around it. There
is more discussion in below link:

  https://lore.kernel.org/all/Y8bOB7VuVIsxoMcn@google.com/

Chao

> 
> > 
> > >   *fixup (upm_base_support): KVM: use inclusive ranges for restrictedmem binding/unbinding
> > >   *fixup (upm_base_support): mm: restrictedmem: use inclusive ranges for issuing invalidations
> > 
> > As many kernel APIs treat 'end' as exclusive, I would rather keep using
> > exclusive 'end' for these APIs(restrictedmem_bind/restrictedmem_unbind
> > and notifier callbacks) but fix it internally in the restrictedmem. E.g.
> > all the places where xarray API needs a 'last'/'max' we use 'end - 1'.
> > See below for the change.
> 
> Yes I did feel like I was fighting the kernel a bit on that; your
> suggestion seems like it would be a better fit.
> 
> > 
> > >   *fixup (upm_base_support): KVM: fix restrictedmem GFN range calculations
> > 
> > Subtracting slot->restrictedmem.index for start/end in
> > restrictedmem_get_gfn_range() is the correct fix.
> > 
> > >   *fixup (upm_base_support): KVM: selftests: CoCo compilation fixes
> > > 
> > > We plan to post an updated RFC for v8 soon, but also wanted to share
> > > the staging tree in case you end up looking at the UPM integration aspects
> > > before then.
> > > 
> > > -Mike
> > 
> > This is the restrictedmem fix to solve 'end' being stored and checked in xarray:
> 
> Looks good.
> 
> Thanks!
> 
> -Mike
> 
> > 
> > --- a/mm/restrictedmem.c
> > +++ b/mm/restrictedmem.c
> > @@ -46,12 +46,12 @@ static long restrictedmem_punch_hole(struct restrictedmem *rm, int mode,
> >          */
> >         down_read(&rm->lock);
> >  
> > -       xa_for_each_range(&rm->bindings, index, notifier, start, end)
> > +       xa_for_each_range(&rm->bindings, index, notifier, start, end - 1)
> >                 notifier->ops->invalidate_start(notifier, start, end);
> >  
> >         ret = memfd->f_op->fallocate(memfd, mode, offset, len);
> >  
> > -       xa_for_each_range(&rm->bindings, index, notifier, start, end)
> > +       xa_for_each_range(&rm->bindings, index, notifier, start, end - 1)
> >                 notifier->ops->invalidate_end(notifier, start, end);
> >  
> >         up_read(&rm->lock);
> > @@ -224,7 +224,7 @@ static int restricted_error_remove_page(struct address_space *mapping,
> >                 }
> >                 spin_unlock(&inode->i_lock);
> >  
> > -               xa_for_each_range(&rm->bindings, index, notifier, start, end)
> > +               xa_for_each_range(&rm->bindings, index, notifier, start, end - 1)
> >                         notifier->ops->error(notifier, start, end);
> >                 break;
> >         }
> > @@ -301,11 +301,12 @@ int restrictedmem_bind(struct file *file, pgoff_t start, pgoff_t end,
> >                 if (exclusive != rm->exclusive)
> >                         goto out_unlock;
> >  
> > -               if (exclusive && xa_find(&rm->bindings, &start, end, XA_PRESENT))
> > +               if (exclusive &&
> > +                   xa_find(&rm->bindings, &start, end - 1, XA_PRESENT))
> >                         goto out_unlock;
> >         }
> >  
> > -       xa_store_range(&rm->bindings, start, end, notifier, GFP_KERNEL);
> > +       xa_store_range(&rm->bindings, start, end - 1, notifier, GFP_KERNEL);
> >         rm->exclusive = exclusive;
> >         ret = 0;
> >  out_unlock:
> > @@ -320,7 +321,7 @@ void restrictedmem_unbind(struct file *file, pgoff_t start, pgoff_t end,
> >         struct restrictedmem *rm = file->f_mapping->private_data;
> >  
> >         down_write(&rm->lock);
> > -       xa_store_range(&rm->bindings, start, end, NULL, GFP_KERNEL);
> > +       xa_store_range(&rm->bindings, start, end - 1, NULL, GFP_KERNEL);
> >         synchronize_rcu();
> >         up_write(&rm->lock);
> >  }
Sean Christopherson April 12, 2023, 10:01 p.m. UTC | #21
On Wed, Mar 22, 2023, Michael Roth wrote:
> On Tue, Feb 21, 2023 at 08:11:35PM +0800, Chao Peng wrote:
> > >   *fixup (upm_base_support): KVM: use inclusive ranges for restrictedmem binding/unbinding
> > >   *fixup (upm_base_support): mm: restrictedmem: use inclusive ranges for issuing invalidations
> > 
> > As many kernel APIs treat 'end' as exclusive, I would rather keep using
> > exclusive 'end' for these APIs(restrictedmem_bind/restrictedmem_unbind
> > and notifier callbacks) but fix it internally in the restrictedmem. E.g.
> > all the places where xarray API needs a 'last'/'max' we use 'end - 1'.
> > See below for the change.
> 
> Yes I did feel like I was fighting the kernel a bit on that; your
> suggestion seems like it would be a better fit.

Comically belated +1, XArray is the odd one here.
Sean Christopherson April 13, 2023, 1:07 a.m. UTC | #22
On Wed, Jan 25, 2023, Kirill A. Shutemov wrote:
> On Wed, Jan 25, 2023 at 12:20:26AM +0000, Sean Christopherson wrote:
> > On Tue, Jan 24, 2023, Liam Merwick wrote:
> > > On 14/01/2023 00:37, Sean Christopherson wrote:
> > > > On Fri, Dec 02, 2022, Chao Peng wrote:
> > > > > This patch series implements KVM guest private memory for confidential
> > > > > computing scenarios like Intel TDX[1]. If a TDX host accesses
> > > > > TDX-protected guest memory, machine check can happen which can further
> > > > > crash the running host system, this is terrible for multi-tenant
> > > > > configurations. The host accesses include those from KVM userspace like
> > > > > QEMU. This series addresses KVM userspace induced crash by introducing
> > > > > new mm and KVM interfaces so KVM userspace can still manage guest memory
> > > > > via a fd-based approach, but it can never access the guest memory
> > > > > content.
> > > > > 
> > > > > The patch series touches both core mm and KVM code. I appreciate
> > > > > Andrew/Hugh and Paolo/Sean can review and pick these patches. Any other
> > > > > reviews are always welcome.
> > > > >    - 01: mm change, target for mm tree
> > > > >    - 02-09: KVM change, target for KVM tree
> > > > 
> > > > A version with all of my feedback, plus reworked versions of Vishal's selftest,
> > > > is available here:
> > > > 
> > > >    git@github.com:sean-jc/linux.git x86/upm_base_support
> > > > 
> > > > It compiles and passes the selftest, but it's otherwise barely tested.  There are
> > > > a few todos (2 I think?) and many of the commits need changelogs, i.e. it's still
> > > > a WIP.
> > > > 
> > > 
> > > When running LTP (https://github.com/linux-test-project/ltp) on the v10
> > > bits (and also with Sean's branch above) I encounter the following NULL
> > > pointer dereference with testcases/kernel/syscalls/madvise/madvise01
> > > (100% reproducible).
> > > 
> > > It appears that in restrictedmem_error_page()
> > > inode->i_mapping->private_data is NULL in the
> > > list_for_each_entry_safe(inode, next, &sb->s_inodes, i_sb_list) but I
> > > don't know why.
> > 
> > Kirill, can you take a look?  Or pass the buck to someone who can? :-)
> 
> The patch below should help.
> 
> diff --git a/mm/restrictedmem.c b/mm/restrictedmem.c
> index 15c52301eeb9..39ada985c7c0 100644
> --- a/mm/restrictedmem.c
> +++ b/mm/restrictedmem.c
> @@ -307,14 +307,29 @@ void restrictedmem_error_page(struct page *page, struct address_space *mapping)
>  
>  	spin_lock(&sb->s_inode_list_lock);
>  	list_for_each_entry_safe(inode, next, &sb->s_inodes, i_sb_list) {
> -		struct restrictedmem *rm = inode->i_mapping->private_data;
>  		struct restrictedmem_notifier *notifier;
> -		struct file *memfd = rm->memfd;
> +		struct restrictedmem *rm;
>  		unsigned long index;
> +		struct file *memfd;
>  
> -		if (memfd->f_mapping != mapping)
> +		if (atomic_read(&inode->i_count))

Kirill, should this be

		if (!atomic_read(&inode->i_count))
			continue;

i.e. skip unreferenced inodes, not skip referenced inodes?
Kirill A. Shutemov April 13, 2023, 4:04 p.m. UTC | #23
On Wed, Apr 12, 2023 at 06:07:28PM -0700, Sean Christopherson wrote:
> On Wed, Jan 25, 2023, Kirill A. Shutemov wrote:
> > On Wed, Jan 25, 2023 at 12:20:26AM +0000, Sean Christopherson wrote:
> > > On Tue, Jan 24, 2023, Liam Merwick wrote:
> > > > On 14/01/2023 00:37, Sean Christopherson wrote:
> > > > > On Fri, Dec 02, 2022, Chao Peng wrote:
> > > > > > This patch series implements KVM guest private memory for confidential
> > > > > > computing scenarios like Intel TDX[1]. If a TDX host accesses
> > > > > > TDX-protected guest memory, machine check can happen which can further
> > > > > > crash the running host system, this is terrible for multi-tenant
> > > > > > configurations. The host accesses include those from KVM userspace like
> > > > > > QEMU. This series addresses KVM userspace induced crash by introducing
> > > > > > new mm and KVM interfaces so KVM userspace can still manage guest memory
> > > > > > via a fd-based approach, but it can never access the guest memory
> > > > > > content.
> > > > > > 
> > > > > > The patch series touches both core mm and KVM code. I appreciate
> > > > > > Andrew/Hugh and Paolo/Sean can review and pick these patches. Any other
> > > > > > reviews are always welcome.
> > > > > >    - 01: mm change, target for mm tree
> > > > > >    - 02-09: KVM change, target for KVM tree
> > > > > 
> > > > > A version with all of my feedback, plus reworked versions of Vishal's selftest,
> > > > > is available here:
> > > > > 
> > > > >    git@github.com:sean-jc/linux.git x86/upm_base_support
> > > > > 
> > > > > It compiles and passes the selftest, but it's otherwise barely tested.  There are
> > > > > a few todos (2 I think?) and many of the commits need changelogs, i.e. it's still
> > > > > a WIP.
> > > > > 
> > > > 
> > > > When running LTP (https://github.com/linux-test-project/ltp) on the v10
> > > > bits (and also with Sean's branch above) I encounter the following NULL
> > > > pointer dereference with testcases/kernel/syscalls/madvise/madvise01
> > > > (100% reproducible).
> > > > 
> > > > It appears that in restrictedmem_error_page()
> > > > inode->i_mapping->private_data is NULL in the
> > > > list_for_each_entry_safe(inode, next, &sb->s_inodes, i_sb_list) but I
> > > > don't know why.
> > > 
> > > Kirill, can you take a look?  Or pass the buck to someone who can? :-)
> > 
> > The patch below should help.
> > 
> > diff --git a/mm/restrictedmem.c b/mm/restrictedmem.c
> > index 15c52301eeb9..39ada985c7c0 100644
> > --- a/mm/restrictedmem.c
> > +++ b/mm/restrictedmem.c
> > @@ -307,14 +307,29 @@ void restrictedmem_error_page(struct page *page, struct address_space *mapping)
> >  
> >  	spin_lock(&sb->s_inode_list_lock);
> >  	list_for_each_entry_safe(inode, next, &sb->s_inodes, i_sb_list) {
> > -		struct restrictedmem *rm = inode->i_mapping->private_data;
> >  		struct restrictedmem_notifier *notifier;
> > -		struct file *memfd = rm->memfd;
> > +		struct restrictedmem *rm;
> >  		unsigned long index;
> > +		struct file *memfd;
> >  
> > -		if (memfd->f_mapping != mapping)
> > +		if (atomic_read(&inode->i_count))
> 
> Kirill, should this be
> 
> 		if (!atomic_read(&inode->i_count))
> 			continue;
> 
> i.e. skip unreferenced inodes, not skip referenced inodes?

Ouch. Yes.

But looking at other instances of s_inodes usage, I think we can drop the
check altogether. inode cannot be completely free until it is removed from
s_inodes list.

While there, replace list_for_each_entry_safe() with
list_for_each_entry() as we don't remove anything from the list.

diff --git a/mm/restrictedmem.c b/mm/restrictedmem.c
index 55e99e6c09a1..8e8a4420d3d1 100644
--- a/mm/restrictedmem.c
+++ b/mm/restrictedmem.c
@@ -194,22 +194,19 @@ static int restricted_error_remove_page(struct address_space *mapping,
 					struct page *page)
 {
 	struct super_block *sb = restrictedmem_mnt->mnt_sb;
-	struct inode *inode, *next;
+	struct inode *inode;
 	pgoff_t start, end;
 
 	start = page->index;
 	end = start + thp_nr_pages(page);
 
 	spin_lock(&sb->s_inode_list_lock);
-	list_for_each_entry_safe(inode, next, &sb->s_inodes, i_sb_list) {
+	list_for_each_entry(inode, &sb->s_inodes, i_sb_list) {
 		struct restrictedmem_notifier *notifier;
 		struct restrictedmem *rm;
 		unsigned long index;
 		struct file *memfd;
 
-		if (atomic_read(&inode->i_count))
-			continue;
-
 		spin_lock(&inode->i_lock);
 		if (inode->i_state & (I_NEW | I_FREEING | I_WILL_FREE)) {
 			spin_unlock(&inode->i_lock);
Chao Peng April 17, 2023, 2:37 p.m. UTC | #24
On Tue, Jan 24, 2023 at 01:27:50AM +0000, Sean Christopherson wrote:
> On Thu, Jan 19, 2023, Isaku Yamahata wrote:
> > On Thu, Jan 19, 2023 at 03:25:08PM +0000,
> > Sean Christopherson <seanjc@google.com> wrote:
> > 
> > > On Thu, Jan 19, 2023, Isaku Yamahata wrote:
> > > > On Sat, Jan 14, 2023 at 12:37:59AM +0000,
> > > > Sean Christopherson <seanjc@google.com> wrote:
> > > > 
> > > > > On Fri, Dec 02, 2022, Chao Peng wrote:
> > > > > > This patch series implements KVM guest private memory for confidential
> > > > > > computing scenarios like Intel TDX[1]. If a TDX host accesses
> > > > > > TDX-protected guest memory, machine check can happen which can further
> > > > > > crash the running host system, this is terrible for multi-tenant
> > > > > > configurations. The host accesses include those from KVM userspace like
> > > > > > QEMU. This series addresses KVM userspace induced crash by introducing
> > > > > > new mm and KVM interfaces so KVM userspace can still manage guest memory
> > > > > > via a fd-based approach, but it can never access the guest memory
> > > > > > content.
> > > > > > 
> > > > > > The patch series touches both core mm and KVM code. I appreciate
> > > > > > Andrew/Hugh and Paolo/Sean can review and pick these patches. Any other
> > > > > > reviews are always welcome.
> > > > > >   - 01: mm change, target for mm tree
> > > > > >   - 02-09: KVM change, target for KVM tree
> > > > > 
> > > > > A version with all of my feedback, plus reworked versions of Vishal's selftest,
> > > > > is available here:
> > > > > 
> > > > >   git@github.com:sean-jc/linux.git x86/upm_base_support
> > > > > 
> > > > > It compiles and passes the selftest, but it's otherwise barely tested.  There are
> > > > > a few todos (2 I think?) and many of the commits need changelogs, i.e. it's still
> > > > > a WIP.
> > > > > 
> > > > > As for next steps, can you (handwaving all of the TDX folks) take a look at what
> > > > > I pushed and see if there's anything horrifically broken, and that it still works
> > > > > for TDX?
> > > > > 
> > > > > Fuad (and pKVM folks) same ask for you with respect to pKVM.  Absolutely no rush
> > > > > (and I mean that).
> > > > > 
> > > > > On my side, the two things on my mind are (a) tests and (b) downstream dependencies
> > > > > (SEV and TDX).  For tests, I want to build a lists of tests that are required for
> > > > > merging so that the criteria for merging are clear, and so that if the list is large
> > > > > (haven't thought much yet), the work of writing and running tests can be distributed.
> > > > > 
> > > > > Regarding downstream dependencies, before this lands, I want to pull in all the
> > > > > TDX and SNP series and see how everything fits together.  Specifically, I want to
> > > > > make sure that we don't end up with a uAPI that necessitates ugly code, and that we
> > > > > don't miss an opportunity to make things simpler.  The patches in the SNP series to
> > > > > add "legacy" SEV support for UPM in particular made me slightly rethink some minor
> > > > > details.  Nothing remotely major, but something that needs attention since it'll
> > > > > be uAPI.
> > > > 
> > > > Although I'm still debuging with TDX KVM, I needed the following.
> > > > kvm_faultin_pfn() is called without mmu_lock held.  the race to change
> > > > private/shared is handled by mmu_seq.  Maybe dedicated function only for
> > > > kvm_faultin_pfn().
> > > 
> > > Gah, you're not on the other thread where this was discussed[*].  Simply deleting
> > > the lockdep assertion is safe, for guest types that rely on the attributes to
> > > define shared vs. private, KVM rechecks the attributes under the protection of
> > > mmu_seq.
> > > 
> > > I'll get a fixed version pushed out today.
> > > 
> > > [*] https://lore.kernel.org/all/Y8gpl+LwSuSgBFks@google.com
> > 
> > Now I have tdx kvm working. I've uploaded at the followings.
> > It's rebased to v6.2-rc3.
> >         git@github.com:yamahata/linux.git tdx/upm
> >         git@github.com:yamahata/qemu.git tdx/upm
> 
> And I finally got a working, building version updated and pushed out (again to):
> 
>   git@github.com:sean-jc/linux.git x86/upm_base_support
> 
> Took longer than expected to get the memslot restrictions sussed out.  I'm done
> working on the code for now, my plan is to come back to it+TDX+SNP in 2-3 weeks
> to resolves any remaining todos (that no one else tackles) and to do the whole
> "merge the world" excersise.

Hi Sean,

In case you started working on the code again, I have a branch [1]
originally planned as v11 candidate which I believe I addressed all the
discussions we had for v10 except the very latest one [2] and integrated
all the newly added selftests from Ackerley and myself. The branch was
based on your original upm_base_support and then rebased to your
kvm-x86/mmu head. Feel free to take anything you think useful( most of
them are trivial things but also some fixes for bugs).

[1] https://github.com/chao-p/linux/commits/privmem-v11.6
[2] https://lore.kernel.org/all/20230413160405.h6ov2yl6l3i7mvsj@box.shutemov.name/

Chao
> 
> > kvm_mmu_do_page_fault() needs the following change.
> > kvm_mem_is_private() queries mem_attr_array.  kvm_faultin_pfn() also uses
> > kvm_mem_is_private(). So the shared-private check in kvm_faultin_pfn() doesn't
> > make sense. This change would belong to TDX KVM patches, though.
> 
> Yeah, SNP needs similar treatment.  Sorting that out is high up on the todo list.
Sean Christopherson April 17, 2023, 3:01 p.m. UTC | #25
On Mon, Apr 17, 2023, Chao Peng wrote:
> In case you started working on the code again, I have a branch [1]
> originally planned as v11 candidate which I believe I addressed all the
> discussions we had for v10 except the very latest one [2] and integrated
> all the newly added selftests from Ackerley and myself. The branch was
> based on your original upm_base_support and then rebased to your
> kvm-x86/mmu head. Feel free to take anything you think useful( most of
> them are trivial things but also some fixes for bugs).

Nice!  I am going to work on splicing together the various series this week, I'll
make sure to grab your work.

Thanks much!
Sean Christopherson April 17, 2023, 3:40 p.m. UTC | #26
What do y'all think about renaming "restrictedmem" to "guardedmem"?

I want to start referring to the code/patches by its syscall/implementation name
instead of "UPM", as "UPM" is (a) very KVM centric, (b) refers to the broader effort
and not just the non-KVM code, and (c) will likely be confusing for future reviewers
since there's nothing in the code that mentions "UPM" in any way.

But typing out restrictedmem is quite tedious, and git grep shows that "rmem" is
already used to refer to "reserved memory".

Renaming the syscall to "guardedmem"...

  1. Allows for a shorthand and namespace, "gmem", that isn't already in use by
     the kernel (see "reserved memory above").
 
  2. Provides a stronger hint as to its purpose.  "Restricted" conveys that the
     allocated memory is limited in some way, but doesn't capture how the memory
     is restricted, e.g. "restricted" could just as easily mean that the allocation
     can be restricted to certain types of backing stores or something.  "Guarded"
     on the other hand captures that the memory has extra defenses of some form.

  3. Is shorter to type and speak.  Work smart, not hard :-)

  4. Isn't totally wrong for the KVM use case if someone assumes the "g" means
     "guest" when reading mail and whatnot.


P.S. I trimmed the Cc/To substantially for this particular discussion to avoid
     spamming folks that don't (yet) care about this stuff with another potentially
     lengthy thread.  Feel free to add (back) any people/lists.
David Hildenbrand April 17, 2023, 3:48 p.m. UTC | #27
On 17.04.23 17:40, Sean Christopherson wrote:
> What do y'all think about renaming "restrictedmem" to "guardedmem"?

Yeay, let's add more confusion :D

If we're at renaming, I'd appreciate if we could find a terminology that 
does look/sound less horrible.

> 
> I want to start referring to the code/patches by its syscall/implementation name
> instead of "UPM", as "UPM" is (a) very KVM centric, (b) refers to the broader effort
> and not just the non-KVM code, and (c) will likely be confusing for future reviewers
> since there's nothing in the code that mentions "UPM" in any way.
> 
> But typing out restrictedmem is quite tedious, and git grep shows that "rmem" is
> already used to refer to "reserved memory".
> 
> Renaming the syscall to "guardedmem"...

restrictedmem, guardedmem, ... all fairly "suboptimal" if you'd ask me ...
Sean Christopherson April 17, 2023, 4:40 p.m. UTC | #28
On Mon, Apr 17, 2023, David Hildenbrand wrote:
> On 17.04.23 17:40, Sean Christopherson wrote:
> > I want to start referring to the code/patches by its syscall/implementation name
> > instead of "UPM", as "UPM" is (a) very KVM centric, (b) refers to the broader effort
> > and not just the non-KVM code, and (c) will likely be confusing for future reviewers
> > since there's nothing in the code that mentions "UPM" in any way.
> > 
> > But typing out restrictedmem is quite tedious, and git grep shows that "rmem" is
> > already used to refer to "reserved memory".
> > 
> > Renaming the syscall to "guardedmem"...
> 
> restrictedmem, guardedmem, ... all fairly "suboptimal" if you'd ask me ...

I'm definitely open to other suggestions, but I suspect it's going to be difficult
to be more precise than something like "guarded".

E.g. we discussed "unmappable" at one point, but the memory can still be mapped,
just not via mmap().  And it's not just about mappings, e.g. read() and its many
variants are all disallowed too, despite the kernel direct map still being live
(modulo SNP requirements).
David Hildenbrand April 17, 2023, 5:09 p.m. UTC | #29
On 17.04.23 18:40, Sean Christopherson wrote:
> On Mon, Apr 17, 2023, David Hildenbrand wrote:
>> On 17.04.23 17:40, Sean Christopherson wrote:
>>> I want to start referring to the code/patches by its syscall/implementation name
>>> instead of "UPM", as "UPM" is (a) very KVM centric, (b) refers to the broader effort
>>> and not just the non-KVM code, and (c) will likely be confusing for future reviewers
>>> since there's nothing in the code that mentions "UPM" in any way.
>>>
>>> But typing out restrictedmem is quite tedious, and git grep shows that "rmem" is
>>> already used to refer to "reserved memory".
>>>
>>> Renaming the syscall to "guardedmem"...
>>
>> restrictedmem, guardedmem, ... all fairly "suboptimal" if you'd ask me ...
> 
> I'm definitely open to other suggestions, but I suspect it's going to be difficult
> to be more precise than something like "guarded".

Guardedmem is just as bad as restrictedmem IMHO, sorry.


Restricted: what's restricted? how does the restriction manifest? 
secretmem also has it's restrictions/limitations (pinning), why does 
that one not fall under the same category?

Make a stranger guess what "restrictedmem" is and I can guarantee that 
it has nothing to do with the concept we're introducing here.


Guarded: what's guarded? From whom? For which purpose? How does the 
"guarding" manifest?

Again, make a stranger guess what "guardedmem" is and I can guarantee 
that it has nothing to do with the concept we're introducing here.

If, at all, the guess might be "guarded storage" [1] on s390x, which, of 
course, has nothing to do with the concept here. (storage on s390x is 
just the dinosaur slang for memory)


Often, if we fail to find a good name, the concept is either unclear or 
not well defined.

So what are the characteristics we want to generalize under that new 
name? We want to have an fd, that

(a) cannot be mapped into user space (mmap)
(b) cannot be accessed using ordinary system calls (read/write)
(c) can still be managed like other fds (fallocate, future NUMA
     policies?)
(d) can be consumed by some special entities that are allowed to
     read/write/map.

So the fd content is inaccessible using the ordinary POSIX syscalls. 
It's only accessible by special entities (e.g., KVM).

Most probably I am forgetting something. But maybe that will help to 
find a more expressive name. Maybe :)

> 
> E.g. we discussed "unmappable" at one point, but the memory can still be mapped,
> just not via mmap().  And it's not just about mappings, e.g. read() and its many
> variants are all disallowed too, despite the kernel direct map still being live
> (modulo SNP requirements).
> 

[1] https://man.archlinux.org/man/s390_guarded_storage.2.en
Sean Christopherson April 17, 2023, 6:17 p.m. UTC | #30
On Mon, Apr 17, 2023, Ackerley Tng wrote:
> Sean Christopherson <seanjc@google.com> writes:
> 
> > On Mon, Apr 17, 2023, David Hildenbrand wrote:
> > > On 17.04.23 17:40, Sean Christopherson wrote:
> > > > I want to start referring to the code/patches by its
> > > syscall/implementation name
> > > > instead of "UPM", as "UPM" is (a) very KVM centric, (b) refers to
> > > the broader effort
> > > > and not just the non-KVM code, and (c) will likely be confusing
> > > for future reviewers
> > > > since there's nothing in the code that mentions "UPM" in any way.
> > > >
> > > > But typing out restrictedmem is quite tedious, and git grep shows
> > > that "rmem" is

Your mail client appears to be wrapping too aggressively and mangling quotes.  I'm
guessing gmail is to blame?

> > > > already used to refer to "reserved memory".
> > > >
> > > > Renaming the syscall to "guardedmem"...
> 
> > > restrictedmem, guardedmem, ... all fairly "suboptimal" if you'd ask
> > > me ...
> 
> > I'm definitely open to other suggestions, but I suspect it's going to be
> > difficult
> > to be more precise than something like "guarded".
> 
> > E.g. we discussed "unmappable" at one point, but the memory can still be
> > mapped,
> > just not via mmap().  And it's not just about mappings, e.g. read() and
> > its many
> > variants are all disallowed too, despite the kernel direct map still
> > being live
> > (modulo SNP requirements).
> 
> I'm for renaming the concept because restrictedmem is quite a
> mouthful. :)
> 
> How about "concealedmem" or "obscuredmem" to highlight the idea of this
> memory being hidden/unreadable/unmappable from userspace?

I'm hesitant to use something like "concealed" becuase it's too close to secretmem,
e.g. might be miscontrued as concealing the memory from anything _but_ the process
that creates the file.

Obscured has similar problems, and obscure often suggests that something is unclear,
as opposed to outright unreachable.

The other aspect of hidden/concealed/etc is that the memory isn't necessarily
concealed from the user.  Long term, I hope to get to the point where even "normal"
VMs use restricted/guarded/??? memory, e.g. to guard (heh) against _unintentional_
access from userspace.  In that use case, the memory isn't truly concealed, espeically
if the user is both the "admin" and the consumer.

Though by that argument, "guarded" is also a poor choice.  And now I'm remembering
how we ended up with "restricted"...

> Guarded is better than restricted but doesn't really highlight how/in
> what way it is being guarded.

Ya, though in practice I think it's infeasible for us to get a name that is both
precise and succinct.
Sean Christopherson April 17, 2023, 7:16 p.m. UTC | #31
On Mon, Apr 17, 2023, David Hildenbrand wrote:
> On 17.04.23 18:40, Sean Christopherson wrote:
> > On Mon, Apr 17, 2023, David Hildenbrand wrote:
> > > On 17.04.23 17:40, Sean Christopherson wrote:
> > > > I want to start referring to the code/patches by its syscall/implementation name
> > > > instead of "UPM", as "UPM" is (a) very KVM centric, (b) refers to the broader effort
> > > > and not just the non-KVM code, and (c) will likely be confusing for future reviewers
> > > > since there's nothing in the code that mentions "UPM" in any way.
> > > > 
> > > > But typing out restrictedmem is quite tedious, and git grep shows that "rmem" is
> > > > already used to refer to "reserved memory".
> > > > 
> > > > Renaming the syscall to "guardedmem"...
> > > 
> > > restrictedmem, guardedmem, ... all fairly "suboptimal" if you'd ask me ...
> > 
> > I'm definitely open to other suggestions, but I suspect it's going to be difficult
> > to be more precise than something like "guarded".
> 
> Guardedmem is just as bad as restrictedmem IMHO, sorry.
> 
> 
> Restricted: what's restricted? how does the restriction manifest? secretmem
> also has it's restrictions/limitations (pinning), why does that one not fall
> under the same category?
> 
> Make a stranger guess what "restrictedmem" is and I can guarantee that it
> has nothing to do with the concept we're introducing here.
> 
> 
> Guarded: what's guarded? From whom? For which purpose? How does the
> "guarding" manifest?

I completely agree that "guarded" lacks precision, but as above, I've pretty much
given up hope of being super precise.  I actually like "restricted", I just don't
like that I can't shorten the name.

Hmm, maybe that won't be a huge problem in practice.  I can't say I've ever heard
any use "rmem" in verbale or written communication, it's primarily just "rmem" in
code that we can't use, and I don't mind having to use restrictedmem for the namespace.
So maybe we can use "rmem", just not in code?

Or, we could pretend we're pirates and call it arrrmem!, which is definitely going
to be how I refer to it in my internal dialogue if we keep "restricted" :-)

> Again, make a stranger guess what "guardedmem" is and I can guarantee that
> it has nothing to do with the concept we're introducing here.
> 
> If, at all, the guess might be "guarded storage" [1] on s390x, which, of
> course, has nothing to do with the concept here.

Oof, and guarded storage is even documented in Documentation/virt/kvm/api.rst.

> (storage on s390x is just the dinosaur slang for memory)
> 
> 
> Often, if we fail to find a good name, the concept is either unclear or not
> well defined.
> 
> So what are the characteristics we want to generalize under that new name?
> We want to have an fd, that
> 
> (a) cannot be mapped into user space (mmap)
> (b) cannot be accessed using ordinary system calls (read/write)
> (c) can still be managed like other fds (fallocate, future NUMA
>     policies?)
> (d) can be consumed by some special entities that are allowed to
>     read/write/map.
> 
> So the fd content is inaccessible using the ordinary POSIX syscalls. It's
> only accessible by special entities (e.g., KVM).
> 
> Most probably I am forgetting something. But maybe that will help to find a
> more expressive name. Maybe :)

Hidden/Concealed/etc - Too close to secretmem, suffers the "hidden from whom" problem,
and depending on the use case, the memory may not actually be concealed from the
user that controls the VMM.

Restricted - "rmem" collides with "reserved memory" in code.

Guarded - Conflicts with s390's "guarded storage", has the "from whom" problem.

Inaccessible - Many of the same problems as "hidden".

Unmappable - Doesn't cover things like read/write, and is wrong in the sense that
the memory is still mappable, just not via mmap().

Secured - I'm not getting anywhere near this one :-)
Fuad Tabba April 18, 2023, 8:53 a.m. UTC | #32
On Mon, Apr 17, 2023 at 8:16 PM Sean Christopherson <seanjc@google.com> wrote:
 ....

> > So the fd content is inaccessible using the ordinary POSIX syscalls. It's
> > only accessible by special entities (e.g., KVM).
> >
> > Most probably I am forgetting something. But maybe that will help to find a
> > more expressive name. Maybe :)
>
> Hidden/Concealed/etc - Too close to secretmem, suffers the "hidden from whom" problem,
> and depending on the use case, the memory may not actually be concealed from the
> user that controls the VMM.
>
> Restricted - "rmem" collides with "reserved memory" in code.
>
> Guarded - Conflicts with s390's "guarded storage", has the "from whom" problem.
>
> Inaccessible - Many of the same problems as "hidden".
>
> Unmappable - Doesn't cover things like read/write, and is wrong in the sense that
> the memory is still mappable, just not via mmap().
>
> Secured - I'm not getting anywhere near this one :-)

How about "protected" ;)? _ducks_

To me the name doesn't matter much, but fwiw I have developed a liking
to "restricted", more than the previous "private", since of all of the
one-word suggestions I think it captures most of what it's trying to
do.

Cheers,
/fuad
David Hildenbrand April 18, 2023, 9:10 a.m. UTC | #33
On 17.04.23 21:16, Sean Christopherson wrote:
> On Mon, Apr 17, 2023, David Hildenbrand wrote:
>> On 17.04.23 18:40, Sean Christopherson wrote:
>>> On Mon, Apr 17, 2023, David Hildenbrand wrote:
>>>> On 17.04.23 17:40, Sean Christopherson wrote:
>>>>> I want to start referring to the code/patches by its syscall/implementation name
>>>>> instead of "UPM", as "UPM" is (a) very KVM centric, (b) refers to the broader effort
>>>>> and not just the non-KVM code, and (c) will likely be confusing for future reviewers
>>>>> since there's nothing in the code that mentions "UPM" in any way.
>>>>>
>>>>> But typing out restrictedmem is quite tedious, and git grep shows that "rmem" is
>>>>> already used to refer to "reserved memory".
>>>>>
>>>>> Renaming the syscall to "guardedmem"...
>>>>
>>>> restrictedmem, guardedmem, ... all fairly "suboptimal" if you'd ask me ...
>>>
>>> I'm definitely open to other suggestions, but I suspect it's going to be difficult
>>> to be more precise than something like "guarded".
>>
>> Guardedmem is just as bad as restrictedmem IMHO, sorry.
>>
>>
>> Restricted: what's restricted? how does the restriction manifest? secretmem
>> also has it's restrictions/limitations (pinning), why does that one not fall
>> under the same category?
>>
>> Make a stranger guess what "restrictedmem" is and I can guarantee that it
>> has nothing to do with the concept we're introducing here.
>>
>>
>> Guarded: what's guarded? From whom? For which purpose? How does the
>> "guarding" manifest?
> 
> I completely agree that "guarded" lacks precision, but as above, I've pretty much
> given up hope of being super precise.  I actually like "restricted", I just don't
> like that I can't shorten the name.
> 
> Hmm, maybe that won't be a huge problem in practice.  I can't say I've ever heard
> any use "rmem" in verbale or written communication, it's primarily just "rmem" in
> code that we can't use, and I don't mind having to use restrictedmem for the namespace.
> So maybe we can use "rmem", just not in code?
> 
> Or, we could pretend we're pirates and call it arrrmem!, which is definitely going
> to be how I refer to it in my internal dialogue if we keep "restricted" :-)

:)

> 
>> Again, make a stranger guess what "guardedmem" is and I can guarantee that
>> it has nothing to do with the concept we're introducing here.
>>
>> If, at all, the guess might be "guarded storage" [1] on s390x, which, of
>> course, has nothing to do with the concept here.
> 
> Oof, and guarded storage is even documented in Documentation/virt/kvm/api.rst.
> 
>> (storage on s390x is just the dinosaur slang for memory)
>>
>>
>> Often, if we fail to find a good name, the concept is either unclear or not
>> well defined.
>>
>> So what are the characteristics we want to generalize under that new name?
>> We want to have an fd, that
>>
>> (a) cannot be mapped into user space (mmap)
>> (b) cannot be accessed using ordinary system calls (read/write)
>> (c) can still be managed like other fds (fallocate, future NUMA
>>      policies?)
>> (d) can be consumed by some special entities that are allowed to
>>      read/write/map.
>>
>> So the fd content is inaccessible using the ordinary POSIX syscalls. It's
>> only accessible by special entities (e.g., KVM).
>>
>> Most probably I am forgetting something. But maybe that will help to find a
>> more expressive name. Maybe :)
> 
> Hidden/Concealed/etc - Too close to secretmem, suffers the "hidden from whom" problem,
> and depending on the use case, the memory may not actually be concealed from the
> user that controls the VMM.
> 
> Restricted - "rmem" collides with "reserved memory" in code.
> 
> Guarded - Conflicts with s390's "guarded storage", has the "from whom" problem.
> 
> Inaccessible - Many of the same problems as "hidden".
> 
> Unmappable - Doesn't cover things like read/write, and is wrong in the sense that
> the memory is still mappable, just not via mmap().
> 
> Secured - I'm not getting anywhere near this one :-)

The think about "secretmem" that I kind-of like (a little) is that it 
explains what it's used for: storing secrets. We don't call it 
"unmapped" memory because we unmap it from the directmap or "unpinnable" 
memory or "inaccessible" memory ... or even "restricted" because it has 
restrictions ... how the secrets are protected is kind of an 
implementation detail.

So instead of describing *why*/*how* restrictedmem is the weird kid 
(restricted/guarded/hidden/restricted/inaccessible/ ...), maybe rather 
describe what it's used for?

I know, I know, "there are other use cases where it will be used outside 
of VM context". I really don't care. "memfd_vm" / "vm_mem" would be sooo 
(feel free to add some more o's here) much easier to get. It's a special 
fd to be used to back VM memory. Depending on the VM type 
(encrypted/protected/whatever), restrictions might apply (not able to 
mmap, not able to read/write ...). For example, there really is no need 
to disallow mmap/read/write when using that memory to back a simple VM 
where all we want to do is avoid user-space page tables.
Sean Christopherson April 19, 2023, 12:47 a.m. UTC | #34
On Tue, Apr 18, 2023, David Hildenbrand wrote:
> On 17.04.23 21:16, Sean Christopherson wrote:
> > Hidden/Concealed/etc - Too close to secretmem, suffers the "hidden from whom" problem,
> > and depending on the use case, the memory may not actually be concealed from the
> > user that controls the VMM.
> > 
> > Restricted - "rmem" collides with "reserved memory" in code.
> > 
> > Guarded - Conflicts with s390's "guarded storage", has the "from whom" problem.
> > 
> > Inaccessible - Many of the same problems as "hidden".
> > 
> > Unmappable - Doesn't cover things like read/write, and is wrong in the sense that
> > the memory is still mappable, just not via mmap().
> > 
> > Secured - I'm not getting anywhere near this one :-)
> 
> The think about "secretmem" that I kind-of like (a little) is that it
> explains what it's used for: storing secrets. We don't call it "unmapped"
> memory because we unmap it from the directmap or "unpinnable" memory or
> "inaccessible" memory ... or even "restricted" because it has restrictions
> ... how the secrets are protected is kind of an implementation detail.
> 
> So instead of describing *why*/*how* restrictedmem is the weird kid
> (restricted/guarded/hidden/restricted/inaccessible/ ...), maybe rather
> describe what it's used for?
> 
> I know, I know, "there are other use cases where it will be used outside of
> VM context". I really don't care.

Heh, we originally proposed F_SEAL_GUEST, but that was also sub-optimal[1] ;-)

> "memfd_vm" / "vm_mem" would be sooo (feel free to add some more o's here)
> much easier to get. It's a special fd to be used to back VM memory. Depending
> on the VM type (encrypted/protected/whatever), restrictions might apply (not
> able to mmap, not able to read/write ...). For example, there really is no
> need to disallow mmap/read/write when using that memory to back a simple VM
> where all we want to do is avoid user-space page tables.

In seriousness, I do agree with Jason's very explicit objection[2] against naming
a non-KVM uAPI "guest", or any variation thereof.

An alternative that we haven't considered since the very early days is making the
uAPI a KVM ioctl() instead of a memfd() flag or dedicated syscall.  Looking at the
code for "pure shim" implementation[3], that's actually not that crazy of an idea.

I don't know that I love the idea of burying this in KVM, but there are benefits
to coupling restrictedmem to KVM (aside from getting out from behind this bikeshed
that I created).

The big benefit is that the layer of indirection goes away.  That simplifies things
like enhancing restrictedmem to allow host userspace access for debug purposes,
batching TLB flushes if a PUNCH_HOLE spans multiple memslots, enforcing exclusive
access, likely the whole "share with a device" story if/when we get there, etc.

The obvious downsides are that (a) maintenance falls under the KVM umbrella, but
that's likely to be true in practice regardless of where the code lands, and
(b) if another use case comes along, e.g. the Gunyah hypervisor[4][5], we risk
someone reinventing a similar solution.

If we can get Gunyah on board and they don't need substantial changes to the
restrictedmem implementation, then I'm all for continuing on the path we're on.
But if Gunyah wants to do their own thing, and the lightweight shim approach is
viable, then it's awfully tempting to put this all behind a KVM ioctl().

[1] https://lore.kernel.org/all/df11d753-6242-8f7c-cb04-c095f68b41fa@redhat.com
[2] https://lore.kernel.org/all/20211123171723.GD5112@ziepe.ca
[3] https://lore.kernel.org/all/ZDiCG%2F7OgDI0SwMR@google.com
[4] https://lore.kernel.org/all/Y%2FkI66qQFJJ6bkTq@google.com
[5] https://lore.kernel.org/all/20230304010632.2127470-13-quic_eberman@quicinc.com
David Hildenbrand April 19, 2023, 7:21 a.m. UTC | #35
On 19.04.23 02:47, Sean Christopherson wrote:
> On Tue, Apr 18, 2023, David Hildenbrand wrote:
>> On 17.04.23 21:16, Sean Christopherson wrote:
>>> Hidden/Concealed/etc - Too close to secretmem, suffers the "hidden from whom" problem,
>>> and depending on the use case, the memory may not actually be concealed from the
>>> user that controls the VMM.
>>>
>>> Restricted - "rmem" collides with "reserved memory" in code.
>>>
>>> Guarded - Conflicts with s390's "guarded storage", has the "from whom" problem.
>>>
>>> Inaccessible - Many of the same problems as "hidden".
>>>
>>> Unmappable - Doesn't cover things like read/write, and is wrong in the sense that
>>> the memory is still mappable, just not via mmap().
>>>
>>> Secured - I'm not getting anywhere near this one :-)
>>
>> The think about "secretmem" that I kind-of like (a little) is that it
>> explains what it's used for: storing secrets. We don't call it "unmapped"
>> memory because we unmap it from the directmap or "unpinnable" memory or
>> "inaccessible" memory ... or even "restricted" because it has restrictions
>> ... how the secrets are protected is kind of an implementation detail.
>>
>> So instead of describing *why*/*how* restrictedmem is the weird kid
>> (restricted/guarded/hidden/restricted/inaccessible/ ...), maybe rather
>> describe what it's used for?
>>
>> I know, I know, "there are other use cases where it will be used outside of
>> VM context". I really don't care.
> 
> Heh, we originally proposed F_SEAL_GUEST, but that was also sub-optimal[1] ;-)
> 
>> "memfd_vm" / "vm_mem" would be sooo (feel free to add some more o's here)
>> much easier to get. It's a special fd to be used to back VM memory. Depending
>> on the VM type (encrypted/protected/whatever), restrictions might apply (not
>> able to mmap, not able to read/write ...). For example, there really is no
>> need to disallow mmap/read/write when using that memory to back a simple VM
>> where all we want to do is avoid user-space page tables.
> 
> In seriousness, I do agree with Jason's very explicit objection[2] against naming
> a non-KVM uAPI "guest", or any variation thereof.

While I agree, it's all better than the naming we use right now ...


Let me throw "tee_mem" / "memfd_tee" into the picture. That could 
eventually catch what we want to have.

Or "coco_mem" / "memfd_coco".

Of course, both expect that people know the terminology (just like what 
"vm" stands for), but it's IMHO significantly better than 
restricted/guarded/opaque/whatsoever.

Again, expresses what it's used for, not why it behaves in weird ways.


> 
> An alternative that we haven't considered since the very early days is making the
> uAPI a KVM ioctl() instead of a memfd() flag or dedicated syscall.  Looking at the
> code for "pure shim" implementation[3], that's actually not that crazy of an idea.

Yes.

> 
> I don't know that I love the idea of burying this in KVM, but there are benefits
> to coupling restrictedmem to KVM (aside from getting out from behind this bikeshed
> that I created).

Yes, it's all better than jumping through hoops to come up with a bad 
name like "restrictedmem".

> 
> The big benefit is that the layer of indirection goes away.  That simplifies things
> like enhancing restrictedmem to allow host userspace access for debug purposes,
> batching TLB flushes if a PUNCH_HOLE spans multiple memslots, enforcing exclusive
> access, likely the whole "share with a device" story if/when we get there, etc.
> 
> The obvious downsides are that (a) maintenance falls under the KVM umbrella, but
> that's likely to be true in practice regardless of where the code lands, and

Yes.

> (b) if another use case comes along, e.g. the Gunyah hypervisor[4][5], we risk
> someone reinventing a similar solution.

I agree. But if it's as simple as providing an ioctl for that hypervisor 
that simply wires up the existing implementation, it's not too bad.

> 
> If we can get Gunyah on board and they don't need substantial changes to the
> restrictedmem implementation, then I'm all for continuing on the path we're on.
> But if Gunyah wants to do their own thing, and the lightweight shim approach is
> viable, then it's awfully tempting to put this all behind a KVM ioctl().

Right. Or we still succeed in finding a name that's not as bad as what 
we had so far.
Sean Christopherson April 19, 2023, 3:17 p.m. UTC | #36
On Wed, Apr 19, 2023, David Hildenbrand wrote:
> On 19.04.23 02:47, Sean Christopherson wrote:
> > On Tue, Apr 18, 2023, David Hildenbrand wrote:
> > > "memfd_vm" / "vm_mem" would be sooo (feel free to add some more o's here)
> > > much easier to get. It's a special fd to be used to back VM memory. Depending
> > > on the VM type (encrypted/protected/whatever), restrictions might apply (not
> > > able to mmap, not able to read/write ...). For example, there really is no
> > > need to disallow mmap/read/write when using that memory to back a simple VM
> > > where all we want to do is avoid user-space page tables.
> > 
> > In seriousness, I do agree with Jason's very explicit objection[2] against naming
> > a non-KVM uAPI "guest", or any variation thereof.
> 
> While I agree, it's all better than the naming we use right now ...
> 
> 
> Let me throw "tee_mem" / "memfd_tee" into the picture. That could eventually
> catch what we want to have.
> 
> Or "coco_mem" / "memfd_coco".
> 
> Of course, both expect that people know the terminology (just like what "vm"
> stands for), but it's IMHO significantly better than
> restricted/guarded/opaque/whatsoever.
> 
> Again, expresses what it's used for, not why it behaves in weird ways.

I don't want to explicitly tie this to trusted execution or confidential compute,
as there is value in backing "normal" guests with memory that cannot be accessed
by the host userspace without jumping through a few extra hoops, e.g. to add a
layer of protection against data corruption due to host userspace bugs.

> > (b) if another use case comes along, e.g. the Gunyah hypervisor[4][5], we risk
> > someone reinventing a similar solution.
> 
> I agree. But if it's as simple as providing an ioctl for that hypervisor
> that simply wires up the existing implementation, it's not too bad.

Yeah, my mind was wandering in this direction too.  The absolute worst case
scenario seems to be that we do end up creating a generic syscall that is a
superset of KVM's functionality, in which case KVM would end up with an ioctl()
that is just a redirect/wrapper.
David Hildenbrand April 19, 2023, 3:27 p.m. UTC | #37
On 19.04.23 17:17, Sean Christopherson wrote:
> On Wed, Apr 19, 2023, David Hildenbrand wrote:
>> On 19.04.23 02:47, Sean Christopherson wrote:
>>> On Tue, Apr 18, 2023, David Hildenbrand wrote:
>>>> "memfd_vm" / "vm_mem" would be sooo (feel free to add some more o's here)
>>>> much easier to get. It's a special fd to be used to back VM memory. Depending
>>>> on the VM type (encrypted/protected/whatever), restrictions might apply (not
>>>> able to mmap, not able to read/write ...). For example, there really is no
>>>> need to disallow mmap/read/write when using that memory to back a simple VM
>>>> where all we want to do is avoid user-space page tables.
>>>
>>> In seriousness, I do agree with Jason's very explicit objection[2] against naming
>>> a non-KVM uAPI "guest", or any variation thereof.
>>
>> While I agree, it's all better than the naming we use right now ...
>>
>>
>> Let me throw "tee_mem" / "memfd_tee" into the picture. That could eventually
>> catch what we want to have.
>>
>> Or "coco_mem" / "memfd_coco".
>>
>> Of course, both expect that people know the terminology (just like what "vm"
>> stands for), but it's IMHO significantly better than
>> restricted/guarded/opaque/whatsoever.
>>
>> Again, expresses what it's used for, not why it behaves in weird ways.
> 
> I don't want to explicitly tie this to trusted execution or confidential compute,
> as there is value in backing "normal" guests with memory that cannot be accessed
> by the host userspace without jumping through a few extra hoops, e.g. to add a
> layer of protection against data corruption due to host userspace bugs.

Nothing speaks against using tee_mem for the same purpose I guess. I 
like the sound of it after all. :)
Sean Christopherson April 22, 2023, 1:33 a.m. UTC | #38
+Christian and Hugh

On Wed, Apr 19, 2023, David Hildenbrand wrote:
> On 19.04.23 02:47, Sean Christopherson wrote:
> > An alternative that we haven't considered since the very early days is making the
> > uAPI a KVM ioctl() instead of a memfd() flag or dedicated syscall.  Looking at the
> > code for "pure shim" implementation[3], that's actually not that crazy of an idea.
> 
> Yes.
> 
> > 
> > I don't know that I love the idea of burying this in KVM, but there are benefits
> > to coupling restrictedmem to KVM (aside from getting out from behind this bikeshed
> > that I created).
> 
> Yes, it's all better than jumping through hoops to come up with a bad name
> like "restrictedmem".
> 
> > 
> > The big benefit is that the layer of indirection goes away.  That simplifies things
> > like enhancing restrictedmem to allow host userspace access for debug purposes,
> > batching TLB flushes if a PUNCH_HOLE spans multiple memslots, enforcing exclusive
> > access, likely the whole "share with a device" story if/when we get there, etc.
> > 
> > The obvious downsides are that (a) maintenance falls under the KVM umbrella, but
> > that's likely to be true in practice regardless of where the code lands, and
> 
> Yes.
> 
> > (b) if another use case comes along, e.g. the Gunyah hypervisor[4][5], we risk
> > someone reinventing a similar solution.
> 
> I agree. But if it's as simple as providing an ioctl for that hypervisor
> that simply wires up the existing implementation, it's not too bad.
> 
> > 
> > If we can get Gunyah on board and they don't need substantial changes to the
> > restrictedmem implementation, then I'm all for continuing on the path we're on.
> > But if Gunyah wants to do their own thing, and the lightweight shim approach is
> > viable, then it's awfully tempting to put this all behind a KVM ioctl().
> 
> Right. Or we still succeed in finding a name that's not as bad as what we
> had so far.

Okie dokie.  So as not to bury the lede:

I think we should provide a KVM ioctl() and have KVM provide its own low-level
implementation, i.e. not wrap shmem.  As much as I don't want to have KVM serving
up memory to userspace, by trying to keep KVM out of the memory management business,
we're actually making things *more* complex and harder to maintain (and merge).

Hugh said something to this effect quite early on[1], it just unfortunately took
me a long time to understand the benefits.

 : QEMU gets fd from opening /dev/kvm_something, uses ioctls (or perhaps
 : the fallocate syscall interface itself) to allocate and free the memory,
 : ioctl for initializing some of it too.  KVM in control of whether that
 : fd can be read or written or mmap'ed or whatever, no need to prevent it
 : in shmem.c, no need for flags, seals, notifications to and fro because
 : KVM is already in control and knows the history.  If shmem actually has
 : value, call into it underneath - somewhat like SysV SHM, and /dev/zero
 : mmap, and i915/gem make use of it underneath.  If shmem has nothing to
 : add, just allocate and free kernel memory directly, recorded in your
 : own xarray.

Christian also suggested that we stop trying to be lazy and create a proper API[2].

 : I find this hard to navigate tbh and it feels like taking a shortcut to
 : avoid building a proper api. If you only care about a specific set of
 : operations specific to memfd restricte that needs to be available to
 : in-kernel consumers, I wonder if you shouldn't just go one step further
 : then your proposal below and build a dedicated minimal ops api. Idk,
 : sketching like a madman on a drawning board here with no claim to
 : feasibility from a mm perspective whatsoever

The key point from Hugh is that, except for a few minor things that are trivially
easy to replicate, the things that make shmem "shmem" don't provide any value for
the KVM use case:

  - We have no plans to support swap, and migration support is dubious at best.
    Swap in particular brings in a lot of complexity for no benefit (to us).  That
    complexity doesn't mean depending on shmem is inherently bad, but it does mean
    rolling our own implementation is highly unlikely to result in reinventing
    shmem's wheel.

  - Sharing a backing file between arbitrary process is _unwanted_ for the initial
    use cases.  There may come a time when mutually trusted VMs can share "private"
    data, but (a) that's a distant future problem and (b) it'll likely require even
    more KVM control over the memory.

  - All of the interfaces for read/write/mmap/etc. are dead weight from our
    perspective.  Even worse, we have to actively avoid those interfaces.  That
    can kinda sorta be done without polluting shmem code by using a shim, but
    that has problems of its own (see below).

And Christian pointed out several flaws with wrapping shmem:

  - Implementing a partial overlay filesystem leads to inconsistencies because
    only some of the ops are changed, e.g. poking at the inode_operations or
    super_operations will show shmem stuff, whereas address_space_operations and
    file_operations will show restrictedmem stuff.


  - Usurping just f_ops and a_ops without creating a full blown filesystem
    avoids the partial overlay issues, but partially overriding ops isn't really
    supported (because it's weird and brittle), e.g. blindly forwarding a
    fallocate() after restrictedmem does it's thing "works", but only because
    we very carefully and deliberately designed restrictedmem on top of shmem.

On top of the points raised by Hugh and Christian, wrapping shmem isn't really
any simpler, just different.  E.g. there's a very subtle bug in the shim variant:
by passing SGP_WRITE, shmem skips zeroing the page because restrictemem is telling
shmem that it (restrictedmem) will immediately write the page.  For TDX and SNP,
that's a "feature" because in those cases trusted firmware will zero (or init)
memory when it's assigned to the guest, but it's a nasty flaw for other use cases.

I'm not saying that we'll magically avoid such bugs by avoiding shmem, just pointing
out that using shmem requires understanding exactly how shmem works, i.e. using
shmem isn't necessarily any easier than building directly on filemap and/or folio
APIs.  And I gotta imagine it will be a similar story if/when we want to add
hugetlbfs support.  Building on filemap/folio will definitely have its own challenges,
but after prototyping an implementation I can confidently say that none of the
problems will be insurmountable.  KVM has _way_ more complex code in its memslot
and MMU implementations.

And another benefit of building on filemap and/or folio APIs is that, because we
would be reinventing the wheel to some extent, when we do inevitablly run into
problems, it will be easier to get help solving those problems because (a) we won't
be doing weird things no one wants to deal with and (b) because the problems will
likely be things others have already dealt with.

The other angle I've been looking at is whether or not having KVM provide its
own implementation will lead to maintenance problems in the future, specifically
if we get to the point where we want to support "fancy" things like swap and
migration.  For migration, I'm quite confident that a dedicated KVM ioctl() versus
wrapping shmem would be at worst a wash, and more than likely simpler if KVM owns
everything.  E.g. migrating pages under TDX and SNP requires invoking magic
instructions, and so we'd be overriding ->migrate_folio() no matter what.

As for swap, I think we should put a stake in the ground and say that KVM will
never support swap for KVM's ioctl().  Sooo much of the infrastructure around
swap/reclaim is tied to userspace mappings, e.g. knowing which pages are LRU and/or
cold.  I poked around a bit to see how we could avoid reinventing all of that
infrastructure for fd-only memory, and the best idea I could come up with is
basically a rehash of Kirill's very original "KVM protected memory" RFC[3], i.e.
allow "mapping" fd-only memory, but ensure that memory is never actually present
from hardware's perspective.

But on top of the various problems with that approach, the only use cases I can
think of for using fd-only to back non-confidential VMs is to guard against spurious
writes/reads to guest memory and/or avoid memory overhead for mapping guest
memory into the user page tables.  Avoiding memory overhead is completely defeated
if the page tables are populated PROT_NONE, which just leaves the "harden against
guest data corruption use case".  And for that specific use case, _if_ swap is
desirable, it would be far, far easier to teach the kernel to allow KVM to follow
PROT_NONE (as Kirill's series did), as opposed to trying to teach the kernel and/or
KVM how to swap fd-only memory.

In other words, fd-only memory is purely for slice-of-hardware VMs.  If someone
wants to overcommit VMs, then they use KVM's traditional API for mapping memory
into the guest.

Regarding putting this in KVM, as mentioned (way) above in the quote, the worst
case scenario of making this a KVM ioctl() seems to be that KVM will end up with
an ioctl() that redirects to common kernel code.  On the flip side, implementing
this in KVM gives us a much clearer path to getting all of this merged.  There will
be a few small non-KVM patches, but they will either be trivial, e.g. exporting
APIs (that aren't contentious) or not strictly required, e.g. adding a flag to
mark an address space as completely unmigratable (for improved performance).  I.e.
the non-KVM patches will be small and be very actionable for their maintainers,
and other than that we control our own destiny.

And on the topic of control, making this a KVM ioctl() and implementation gives
KVM a _lot_ of control.  E.g. we can make the file size immutable to simplify the
implementation, bind the fd to a single VM at creation time, add KVM-defined flags
for controlling hugepage behavior, etc.

Last but not least, I think forcing us KVM folks to get our hands at least a little
dirty with MM and FS stuff would be a *good* thing.  KVM has had more than a few
bugs that we missed in no small part because most of the people that work on KVM
have almost zero knowledge of MM and FS, and especially at the boundaries between
those two.

Note, I implemented the POC on top of the filemap APIs because that was much faster
and less error prone than re-implementing xarray management myself.  We may or may
not want to actually make the kernel at large aware of these allocations, i.e. it
may be better to follow Hugh's suggestion and use the folio APIs directly instead
of piggybacking filemap+folios.  E.g. I _think_ migration becomes a complete non-issue
if the pages are "raw" allocations and never get marked as LRU-friendly.  What I'm
not so sure about is if there is anything substantial that we'd lose out on by not
reusing the filemap stuff.

The POC also doesn't play nice with Ackerley's patches to allocate files on a
user-defined mount.  AIUI, that's largely a non-issue as something like fbind()
would provide a superset of that functionality and more than one maintainer has
expressed (handwavy) support for a generic fbind().

Code is available here if folks want to take a look before any kind of formal
posting:

	https://github.com/sean-jc/linux.git x86/kvm_gmem_solo

[1] https://lore.kernel.org/all/ff5c5b97-acdf-9745-ebe5-c6609dd6322e@google.com
[2] https://lore.kernel.org/all/20230418-anfallen-irdisch-6993a61be10b@brauner
[3] https://lore.kernel.org/linux-mm/20200522125214.31348-1-kirill.shutemov@linux.intel.com
Jarkko Sakkinen April 23, 2023, 1:14 p.m. UTC | #39
On Mon Apr 17, 2023 at 6:40 PM EEST, Sean Christopherson wrote:
> What do y'all think about renaming "restrictedmem" to "guardedmem"?
>
> I want to start referring to the code/patches by its syscall/implementation name
> instead of "UPM", as "UPM" is (a) very KVM centric, (b) refers to the broader effort
> and not just the non-KVM code, and (c) will likely be confusing for future reviewers
> since there's nothing in the code that mentions "UPM" in any way.
>
> But typing out restrictedmem is quite tedious, and git grep shows that "rmem" is
> already used to refer to "reserved memory".
>
> Renaming the syscall to "guardedmem"...
>
>   1. Allows for a shorthand and namespace, "gmem", that isn't already in use by
>      the kernel (see "reserved memory above").
>  
>   2. Provides a stronger hint as to its purpose.  "Restricted" conveys that the
>      allocated memory is limited in some way, but doesn't capture how the memory
>      is restricted, e.g. "restricted" could just as easily mean that the allocation
>      can be restricted to certain types of backing stores or something.  "Guarded"
>      on the other hand captures that the memory has extra defenses of some form.
>
>   3. Is shorter to type and speak.  Work smart, not hard :-)
>
>   4. Isn't totally wrong for the KVM use case if someone assumes the "g" means
>      "guest" when reading mail and whatnot.
>
>
> P.S. I trimmed the Cc/To substantially for this particular discussion to avoid
>      spamming folks that don't (yet) care about this stuff with another potentially
>      lengthy thread.  Feel free to add (back) any people/lists.

I guess 'guarded' could be a good noun in the sense that it does not
get easily mixed up to anything pre-existing, and it does give the idea
of the purpose.

BR, Jarkko
Jarkko Sakkinen April 23, 2023, 1:28 p.m. UTC | #40
On Mon Apr 17, 2023 at 6:48 PM EEST, David Hildenbrand wrote:
> On 17.04.23 17:40, Sean Christopherson wrote:
> > What do y'all think about renaming "restrictedmem" to "guardedmem"?
>
> Yeay, let's add more confusion :D
>
> If we're at renaming, I'd appreciate if we could find a terminology that 
> does look/sound less horrible.
>
> > 
> > I want to start referring to the code/patches by its syscall/implementation name
> > instead of "UPM", as "UPM" is (a) very KVM centric, (b) refers to the broader effort
> > and not just the non-KVM code, and (c) will likely be confusing for future reviewers
> > since there's nothing in the code that mentions "UPM" in any way.
> > 
> > But typing out restrictedmem is quite tedious, and git grep shows that "rmem" is
> > already used to refer to "reserved memory".
> > 
> > Renaming the syscall to "guardedmem"...
>
> restrictedmem, guardedmem, ... all fairly "suboptimal" if you'd ask me ...

In the world of TEE's and confidential computing it is fairly common to
call memory areas enclaves, even outside SGX context. So in that sense
enclave memory would be the most correct terminology.

BR, Jarkko
David Hildenbrand May 5, 2023, 8 p.m. UTC | #41
On 23.04.23 15:28, Jarkko Sakkinen wrote:
> On Mon Apr 17, 2023 at 6:48 PM EEST, David Hildenbrand wrote:
>> On 17.04.23 17:40, Sean Christopherson wrote:
>>> What do y'all think about renaming "restrictedmem" to "guardedmem"?
>>
>> Yeay, let's add more confusion :D
>>
>> If we're at renaming, I'd appreciate if we could find a terminology that
>> does look/sound less horrible.
>>
>>>
>>> I want to start referring to the code/patches by its syscall/implementation name
>>> instead of "UPM", as "UPM" is (a) very KVM centric, (b) refers to the broader effort
>>> and not just the non-KVM code, and (c) will likely be confusing for future reviewers
>>> since there's nothing in the code that mentions "UPM" in any way.
>>>
>>> But typing out restrictedmem is quite tedious, and git grep shows that "rmem" is
>>> already used to refer to "reserved memory".
>>>
>>> Renaming the syscall to "guardedmem"...
>>
>> restrictedmem, guardedmem, ... all fairly "suboptimal" if you'd ask me ...
> 
> In the world of TEE's and confidential computing it is fairly common to
> call memory areas enclaves, even outside SGX context. So in that sense
> enclave memory would be the most correct terminology.

I was also thinking along the lines of isolated_mem or imem ... 
essentially, isolated from (unprivileged) user space.

... if we still want to have a common syscall for it.
Sean Christopherson May 6, 2023, 12:55 a.m. UTC | #42
On Fri, May 05, 2023, Ackerley Tng wrote:
> 
> Hi Sean,
> 
> Thanks for implementing this POC!
> 
> I’ve started porting the selftests (both Chao’s and those I added [1]).
> 
> guest mem seems to cover the use cases that have been discussed and
> proposed so far, but I still need to figure out how gmem can work with
> 
> + hugetlbfs
> + specification of/storing memory policy (for NUMA node bindings)
> + memory accounting - we may need to account for memory used separately,
>   so that guest mem shows up separately on /proc/meminfo and similar
>   places.
> 
> One issue I’ve found so far is that the pointer to kvm (gmem->kvm) is
> not cleaned up, and hence it is possible to crash the host kernel in the
> following way
> 
> 1. Create a KVM VM
> 2. Create a guest mem fd on that VM
> 3. Create a memslot with the guest mem fd (hence binding the fd to the
>    VM)
> 4. Close/destroy the KVM VM
> 5. Call fallocate(PUNCH_HOLE) on the guest mem fd, which uses gmem->kvm
>    when it tries to do invalidation.
> 
> I then tried to clean up the gmem->kvm pointer during unbinding when the
> KVM VM is destroyed.
> 
> That works, but then I realized there’s a simpler way to use the pointer
> after freeing:
> 
> 1. Create a KVM VM
> 2. Create a guest mem fd on that VM
> 3. Close/destroy the KVM VM
> 4. Call fallocate(PUNCH_HOLE) on the guest mem fd, which uses gmem->kvm
>    when it tries to do invalidation.
> 
> Perhaps binding should mean setting the gmem->kvm pointer in addition to
> gmem->bindings. This makes binding and unbinding symmetric and avoids
> the use-after-frees described above.

Hrm, that would work, though it's a bit convoluted, e.g. would require detecting
when the last binding is being removed.  A similar (also ugly) solution would be
to nullify gmem->kvm when KVM dies.

I don't love either approach idea because it means a file created in the context
of a VM can outlive the VM itself, and then userspace ends up with a file descriptor
that it can't do anything with except close().  I doubt that matters in practice
though, e.g. when the VM dies, all memory can be freed so that the file ends up
being little more than a shell.  And if we go that route, there's no need to grab
a reference to the file during bind, KVM can just grab a longterm reference when
the file is initially created and then drop it when KVM dies (and nullifies gmem->kvm).

Blech, another wart is that I believe gmem would need to do __module_get() during
file creation to prevent kvm.ko from being unloaded after the last VM dies.  Ah,
but that'd also be true if we went with a system-scoped KVM ioctl(), so I suppose
it's not _that_ ugly.

Exchanging references (at binding or at creation) doesn't work, because that
creates a circular dependency, i.e. gmem and KVM would pin each other. 

A "proper" refcounting approach, where the file pins KVM and not vice versa, gets
nasty because of how KVM's memslots work.  The least awful approach I can think of
would be to delete the associated memslot(s) when the file is released, possibly
via deferred work to avoid deadlock issues.  Not the prettiest thing ever and in
some ways that'd yield an even worse ABI.

Side topic, there's a second bug (and probably more lurking): kvm_swap_active_memslots()'s
call to synchronize_srcu_expedited() is done _before_ the call to kvm_gmem_unbind(),
i.e. doesn't wait for readers in kvm_gmem_invalidate_begin() to go away.  The easy
solution for that one is to add another synchronize_srcu_expedited() after unbinding.

> This also means that creating a guest mem fd is no longer dependent on
> the VM. Perhaps we can make creating a gmem fd a system ioctl (like
> KVM_GET_API_VERSION and KVM_CREATE_VM) instead of a vm ioctl?

My preference is to make it a VM-scoped ioctl(), if it ends up being a KVM ioctl()
and not a common syscall.  If the file isn't tightly coupled to a single VM, then
punching a hole is further complicated by needing to deal with invalidating multiple
regions that are bound to different @kvm instances.  It's not super complex, but
AFAICT having the ioctl() be system-scoped doesn't add value, e.g. I don't think
having one VM own the memory will complicate even if/when we get to the point where
VMs can share "private" memory, and the gmem code would still need to deal with
grabbing a module reference.
Vishal Annapurve May 6, 2023, 1:17 a.m. UTC | #43
On Fri, May 5, 2023 at 5:55 PM Sean Christopherson <seanjc@google.com> wrote:
>
> ...
> My preference is to make it a VM-scoped ioctl(), if it ends up being a KVM ioctl()
> and not a common syscall.  If the file isn't tightly coupled to a single VM, then
> punching a hole is further complicated by needing to deal with invalidating multiple
> regions that are bound to different @kvm instances.  It's not super complex, but
> AFAICT having the ioctl() be system-scoped doesn't add value, e.g. I don't think
> having one VM own the memory will complicate even if/when we get to the point where
> VMs can share "private" memory, and the gmem code would still need to deal with
> grabbing a module reference.

Copyless migration would be a scenario where "private" memory may need
to be shared between source and target VMs depending on how migration
support is implemented.

Regards,
Vishal
Vlastimil Babka May 6, 2023, 7:44 a.m. UTC | #44
On 5/5/23 22:00, David Hildenbrand wrote:
> On 23.04.23 15:28, Jarkko Sakkinen wrote:
>> On Mon Apr 17, 2023 at 6:48 PM EEST, David Hildenbrand wrote:
>>> On 17.04.23 17:40, Sean Christopherson wrote:
>>>> What do y'all think about renaming "restrictedmem" to "guardedmem"?
>>>
>>> Yeay, let's add more confusion :D
>>>
>>> If we're at renaming, I'd appreciate if we could find a terminology that
>>> does look/sound less horrible.
>>>
>>>>
>>>> I want to start referring to the code/patches by its syscall/implementation name
>>>> instead of "UPM", as "UPM" is (a) very KVM centric, (b) refers to the broader effort
>>>> and not just the non-KVM code, and (c) will likely be confusing for future reviewers
>>>> since there's nothing in the code that mentions "UPM" in any way.
>>>>
>>>> But typing out restrictedmem is quite tedious, and git grep shows that "rmem" is
>>>> already used to refer to "reserved memory".
>>>>
>>>> Renaming the syscall to "guardedmem"...
>>>
>>> restrictedmem, guardedmem, ... all fairly "suboptimal" if you'd ask me ...
>> 
>> In the world of TEE's and confidential computing it is fairly common to
>> call memory areas enclaves, even outside SGX context. So in that sense
>> enclave memory would be the most correct terminology.
> 
> I was also thinking along the lines of isolated_mem or imem ... 
> essentially, isolated from (unprivileged) user space.
> 
> ... if we still want to have a common syscall for it.

I'm fan of the ioctl, if it has a chance of working out.
David Hildenbrand May 6, 2023, 9:16 a.m. UTC | #45
On 06.05.23 09:44, Vlastimil Babka wrote:
> On 5/5/23 22:00, David Hildenbrand wrote:
>> On 23.04.23 15:28, Jarkko Sakkinen wrote:
>>> On Mon Apr 17, 2023 at 6:48 PM EEST, David Hildenbrand wrote:
>>>> On 17.04.23 17:40, Sean Christopherson wrote:
>>>>> What do y'all think about renaming "restrictedmem" to "guardedmem"?
>>>>
>>>> Yeay, let's add more confusion :D
>>>>
>>>> If we're at renaming, I'd appreciate if we could find a terminology that
>>>> does look/sound less horrible.
>>>>
>>>>>
>>>>> I want to start referring to the code/patches by its syscall/implementation name
>>>>> instead of "UPM", as "UPM" is (a) very KVM centric, (b) refers to the broader effort
>>>>> and not just the non-KVM code, and (c) will likely be confusing for future reviewers
>>>>> since there's nothing in the code that mentions "UPM" in any way.
>>>>>
>>>>> But typing out restrictedmem is quite tedious, and git grep shows that "rmem" is
>>>>> already used to refer to "reserved memory".
>>>>>
>>>>> Renaming the syscall to "guardedmem"...
>>>>
>>>> restrictedmem, guardedmem, ... all fairly "suboptimal" if you'd ask me ...
>>>
>>> In the world of TEE's and confidential computing it is fairly common to
>>> call memory areas enclaves, even outside SGX context. So in that sense
>>> enclave memory would be the most correct terminology.
>>
>> I was also thinking along the lines of isolated_mem or imem ...
>> essentially, isolated from (unprivileged) user space.
>>
>> ... if we still want to have a common syscall for it.
> 
> I'm fan of the ioctl, if it has a chance of working out.
Yes, me too.
Chao Peng May 9, 2023, 12:44 p.m. UTC | #46
On Fri, May 05, 2023 at 07:39:36PM +0000, Ackerley Tng wrote:
> 
> Hi Sean,
> 
> Thanks for implementing this POC!
> 
> I’ve started porting the selftests (both Chao’s and those I added [1]).

Hi Sean/Ackerley,

Thanks for doing that. Overall making gmem a KVM ioctl() looks good to
me and it should also play nice with Intel TDX. Besides what Ackerley
mentioned below, I think we haven't discussed device assignment, which
will be supported in not too long distance. Current VFIO_IOMMU_MAP_DMA
consumes virtual address so that needs to be fixed for fd-based memory
anyway, and the fix looks not related to whether this being a syscall()
or a KVM ioctl(). There will be some initialization sequence dependency,
e.g. if gmem is finally a VM-scope ioctl() then we need VM created first
before can we map fd-based memory in VFIO, but that sounds not an issue
at all.

I also see Vlastimil/David expressed their preference on ioctl. So maybe
we can move forward on your current PoC. Do you already have a plan to
post a formal version?

Chao

> 
> guest mem seems to cover the use cases that have been discussed and
> proposed so far, but I still need to figure out how gmem can work with
> 
> + hugetlbfs
> + specification of/storing memory policy (for NUMA node bindings)
> + memory accounting - we may need to account for memory used separately,
>   so that guest mem shows up separately on /proc/meminfo and similar
>   places.
> 
> One issue I’ve found so far is that the pointer to kvm (gmem->kvm) is
> not cleaned up, and hence it is possible to crash the host kernel in the
> following way
> 
> 1. Create a KVM VM
> 2. Create a guest mem fd on that VM
> 3. Create a memslot with the guest mem fd (hence binding the fd to the
>    VM)
> 4. Close/destroy the KVM VM
> 5. Call fallocate(PUNCH_HOLE) on the guest mem fd, which uses gmem->kvm
>    when it tries to do invalidation.
> 
> I then tried to clean up the gmem->kvm pointer during unbinding when the
> KVM VM is destroyed.
> 
> That works, but then I realized there’s a simpler way to use the pointer
> after freeing:
> 
> 1. Create a KVM VM
> 2. Create a guest mem fd on that VM
> 3. Close/destroy the KVM VM
> 4. Call fallocate(PUNCH_HOLE) on the guest mem fd, which uses gmem->kvm
>    when it tries to do invalidation.
> 
> Perhaps binding should mean setting the gmem->kvm pointer in addition to
> gmem->bindings. This makes binding and unbinding symmetric and avoids
> the use-after-frees described above.
> 
> This also means that creating a guest mem fd is no longer dependent on
> the VM. Perhaps we can make creating a gmem fd a system ioctl (like
> KVM_GET_API_VERSION and KVM_CREATE_VM) instead of a vm ioctl?
> 
> [1]
> https://lore.kernel.org/all/cover.1678926164.git.ackerleytng@google.com/T/
> 
> Ackerley
Vishal Annapurve May 10, 2023, 5:26 p.m. UTC | #47
On Fri, Apr 21, 2023 at 6:33 PM Sean Christopherson <seanjc@google.com> wrote:
>
> ...
> cold.  I poked around a bit to see how we could avoid reinventing all of that
> infrastructure for fd-only memory, and the best idea I could come up with is
> basically a rehash of Kirill's very original "KVM protected memory" RFC[3], i.e.
> allow "mapping" fd-only memory, but ensure that memory is never actually present
> from hardware's perspective.
>

I am most likely missing a lot of context here and possibly venturing
into an infeasible/already shot down direction here. But I would still
like to get this discussed here before we move on.

I am wondering if it would make sense to implement
restricted_mem/guest_mem file to expose both private and shared memory
regions, inline with Kirill's original proposal now that the file
implementation is controlled by KVM.

Thinking from userspace perspective:
1) Userspace creates guest mem files and is able to mmap them but all
accesses to these files result into faults as no memory is allowed to
be mapped into userspace VMM pagetables.
2) Userspace registers mmaped HVA ranges with KVM with additional
KVM_MEM_PRIVATE flag
3) Userspace converts memory attributes and this memory conversion
allows userspace to access shared ranges of the file because those are
allowed to be faulted in from guest_mem. Shared to private conversion
unmaps the file ranges from userspace VMM pagetables.
4) Granularity of userspace pagetable mappings for shared ranges will
have to be dictated by KVM guest_mem file implementation.

Caveat here is that once private pages are mapped into userspace view.

Benefits here:
1) Userspace view remains consistent while still being able to use HVA ranges
2) It would be possible to use HVA based APIs from userspace to do
things like binding.
3) Double allocation wouldn't be a concern since hva ranges and gpa
ranges possibly map to the same HPA ranges.

>
> Code is available here if folks want to take a look before any kind of formal
> posting:
>
>         https://github.com/sean-jc/linux.git x86/kvm_gmem_solo
>
> [1] https://lore.kernel.org/all/ff5c5b97-acdf-9745-ebe5-c6609dd6322e@google.com
> [2] https://lore.kernel.org/all/20230418-anfallen-irdisch-6993a61be10b@brauner
> [3] https://lore.kernel.org/linux-mm/20200522125214.31348-1-kirill.shutemov@linux.intel.com
Vishal Annapurve May 10, 2023, 8:23 p.m. UTC | #48
On Wed, May 10, 2023 at 10:26 AM Vishal Annapurve <vannapurve@google.com> wrote:
>
> On Fri, Apr 21, 2023 at 6:33 PM Sean Christopherson <seanjc@google.com> wrote:
> >
> > ...
> > cold.  I poked around a bit to see how we could avoid reinventing all of that
> > infrastructure for fd-only memory, and the best idea I could come up with is
> > basically a rehash of Kirill's very original "KVM protected memory" RFC[3], i.e.
> > allow "mapping" fd-only memory, but ensure that memory is never actually present
> > from hardware's perspective.
> >
>
> I am most likely missing a lot of context here and possibly venturing
> into an infeasible/already shot down direction here. But I would still
> like to get this discussed here before we move on.
>
> I am wondering if it would make sense to implement
> restricted_mem/guest_mem file to expose both private and shared memory
> regions, inline with Kirill's original proposal now that the file
> implementation is controlled by KVM.
>
> Thinking from userspace perspective:
> 1) Userspace creates guest mem files and is able to mmap them but all
> accesses to these files result into faults as no memory is allowed to
> be mapped into userspace VMM pagetables.
> 2) Userspace registers mmaped HVA ranges with KVM with additional
> KVM_MEM_PRIVATE flag
> 3) Userspace converts memory attributes and this memory conversion
> allows userspace to access shared ranges of the file because those are
> allowed to be faulted in from guest_mem. Shared to private conversion
> unmaps the file ranges from userspace VMM pagetables.
> 4) Granularity of userspace pagetable mappings for shared ranges will
> have to be dictated by KVM guest_mem file implementation.
>
> Caveat here is that once private pages are mapped into userspace view.
>
> Benefits here:
> 1) Userspace view remains consistent while still being able to use HVA ranges
> 2) It would be possible to use HVA based APIs from userspace to do
> things like binding.
> 3) Double allocation wouldn't be a concern since hva ranges and gpa
> ranges possibly map to the same HPA ranges.
>

I realize now that VFIO IOMMU mappings are not associated with
userspace MMU in any way. So this approach does leave the gap of not
being able to handle modifications of IOMMU mappings when HVA mappings
are modified in userspace page tables. I guess this could be a good
enough reason to give up on this option.


> >
> > Code is available here if folks want to take a look before any kind of formal
> > posting:
> >
> >         https://github.com/sean-jc/linux.git x86/kvm_gmem_solo
> >
> > [1] https://lore.kernel.org/all/ff5c5b97-acdf-9745-ebe5-c6609dd6322e@google.com
> > [2] https://lore.kernel.org/all/20230418-anfallen-irdisch-6993a61be10b@brauner
> > [3] https://lore.kernel.org/linux-mm/20200522125214.31348-1-kirill.shutemov@linux.intel.com
Sean Christopherson May 10, 2023, 9:39 p.m. UTC | #49
On Wed, May 10, 2023, Vishal Annapurve wrote:
> On Fri, Apr 21, 2023 at 6:33 PM Sean Christopherson <seanjc@google.com> wrote:
> >
> > ...
> > cold.  I poked around a bit to see how we could avoid reinventing all of that
> > infrastructure for fd-only memory, and the best idea I could come up with is
> > basically a rehash of Kirill's very original "KVM protected memory" RFC[3], i.e.
> > allow "mapping" fd-only memory, but ensure that memory is never actually present
> > from hardware's perspective.
> >
> 
> I am most likely missing a lot of context here and possibly venturing
> into an infeasible/already shot down direction here.

Both :-)

> But I would still like to get this discussed here before we move on.
> 
> I am wondering if it would make sense to implement
> restricted_mem/guest_mem file to expose both private and shared memory
> regions, inline with Kirill's original proposal now that the file
> implementation is controlled by KVM.
> 
> Thinking from userspace perspective:
> 1) Userspace creates guest mem files and is able to mmap them but all
> accesses to these files result into faults as no memory is allowed to
> be mapped into userspace VMM pagetables.

Never mapping anything into the userspace page table is infeasible.  Technically
it's doable, but it'd effectively require all of the work of an fd-based approach
(and probably significantly more), _and_ it'd require touching core mm code.

VMAs don't provide hva=>pfn information, they're the kernel's way of implementing
the abstraction provided to userspace by mmap(), mprotect() etc.  Among many other
things, a VMA describes properties of what is mapped, e.g. hugetblfs versus
anonymous, where memory is mapped (virtual address), how memory is mapped, e.g.
RWX protections, etc.  But a VMA doesn't track the physical address, that info
is all managed through the userspace page tables.

To make it possible to allow userspace to mmap() but not access memory (without
redoing how the kernel fundamentally manages virtual=>physical mappings), the
simplest approach is to install PTEs into userspace page tables, but never mark
them Present in hardware, i.e. prevent actually accessing the backing memory.
This is is exactly what Kirill's series in link [3] below implemented.

Issues that led to us abandoning the "map with special !Present PTEs" approach:

 - Using page tables, i.e. hardware defined structures, to track gfn=>pfn mappings
   is inefficient and inflexible compared to software defined structures, especially
   for the expected use cases for CoCo guests.

 - The kernel wouldn't _easily_ be able to enforce a 1:1 page:guest association,
   let alone a 1:1 pfn:gfn mapping.
 
 - Does not work for memory that isn't backed by 'struct page', e.g. if devices
   gain support for exposing encrypted memory regions to guests.

 - Poking into the VMAs to convert memory would be likely be less performant due
   to using infrastructure that is much "heavier", e.g. would require taking
   mmap_lock for write.

In short, shoehorning this into mmap() requires fighting how the kernel works at
pretty much every step, and in the end, adding e.g. fbind() is a lot easier.

> 2) Userspace registers mmaped HVA ranges with KVM with additional
> KVM_MEM_PRIVATE flag
> 3) Userspace converts memory attributes and this memory conversion
> allows userspace to access shared ranges of the file because those are
> allowed to be faulted in from guest_mem. Shared to private conversion
> unmaps the file ranges from userspace VMM pagetables.
> 4) Granularity of userspace pagetable mappings for shared ranges will
> have to be dictated by KVM guest_mem file implementation.
> 
> Caveat here is that once private pages are mapped into userspace view.
> 
> Benefits here:
> 1) Userspace view remains consistent while still being able to use HVA ranges
> 2) It would be possible to use HVA based APIs from userspace to do
> things like binding.
> 3) Double allocation wouldn't be a concern since hva ranges and gpa
> ranges possibly map to the same HPA ranges.

#3 isn't entirely correct.  If a different process (call it "B") maps shared memory,
and then the guest converts that memory from shared to private, the backing pages
for the previously shared mapping will still be mapped by process B unless userspace
also ensures process B also unmaps on conversion.

#3 is also a limiter.  E.g. if a guest is primarly backed by 1GiB pages, keeping
the 1GiB mapping is desirable if the guest converts a few KiB of memory to shared,
and possibly even if the guest converts a few MiB of memory.

> > Code is available here if folks want to take a look before any kind of formal
> > posting:
> >
> >         https://github.com/sean-jc/linux.git x86/kvm_gmem_solo
> >
> > [1] https://lore.kernel.org/all/ff5c5b97-acdf-9745-ebe5-c6609dd6322e@google.com
> > [2] https://lore.kernel.org/all/20230418-anfallen-irdisch-6993a61be10b@brauner
> > [3] https://lore.kernel.org/linux-mm/20200522125214.31348-1-kirill.shutemov@linux.intel.com
Vishal Annapurve May 10, 2023, 11:03 p.m. UTC | #50
On Wed, May 10, 2023 at 2:39 PM Sean Christopherson <seanjc@google.com> wrote:
>
> On Wed, May 10, 2023, Vishal Annapurve wrote:
> > On Fri, Apr 21, 2023 at 6:33 PM Sean Christopherson <seanjc@google.com> wrote:
> > >
> > > ...
> > > cold.  I poked around a bit to see how we could avoid reinventing all of that
> > > infrastructure for fd-only memory, and the best idea I could come up with is
> > > basically a rehash of Kirill's very original "KVM protected memory" RFC[3], i.e.
> > > allow "mapping" fd-only memory, but ensure that memory is never actually present
> > > from hardware's perspective.
> > >
> >
> > I am most likely missing a lot of context here and possibly venturing
> > into an infeasible/already shot down direction here.
>
> Both :-)
>
> > But I would still like to get this discussed here before we move on.
> >
> > I am wondering if it would make sense to implement
> > restricted_mem/guest_mem file to expose both private and shared memory
> > regions, inline with Kirill's original proposal now that the file
> > implementation is controlled by KVM.
> >
> > Thinking from userspace perspective:
> > 1) Userspace creates guest mem files and is able to mmap them but all
> > accesses to these files result into faults as no memory is allowed to
> > be mapped into userspace VMM pagetables.
>
> Never mapping anything into the userspace page table is infeasible.  Technically
> it's doable, but it'd effectively require all of the work of an fd-based approach
> (and probably significantly more), _and_ it'd require touching core mm code.
>
> VMAs don't provide hva=>pfn information, they're the kernel's way of implementing
> the abstraction provided to userspace by mmap(), mprotect() etc.  Among many other
> things, a VMA describes properties of what is mapped, e.g. hugetblfs versus
> anonymous, where memory is mapped (virtual address), how memory is mapped, e.g.
> RWX protections, etc.  But a VMA doesn't track the physical address, that info
> is all managed through the userspace page tables.
>
> To make it possible to allow userspace to mmap() but not access memory (without
> redoing how the kernel fundamentally manages virtual=>physical mappings), the
> simplest approach is to install PTEs into userspace page tables, but never mark
> them Present in hardware, i.e. prevent actually accessing the backing memory.
> This is is exactly what Kirill's series in link [3] below implemented.
>

Maybe it's simpler to do when mmaped regions are backed with files.

I see that shmem has fault handlers for accesses to VMA regions
associated with the files, In theory a file implementation can always
choose to not allocate physical pages for such faults (similar to
F_SEAL_FAULT_AUTOALLOCATE that was discussed earlier).

> Issues that led to us abandoning the "map with special !Present PTEs" approach:
>
>  - Using page tables, i.e. hardware defined structures, to track gfn=>pfn mappings
>    is inefficient and inflexible compared to software defined structures, especially
>    for the expected use cases for CoCo guests.
>
>  - The kernel wouldn't _easily_ be able to enforce a 1:1 page:guest association,
>    let alone a 1:1 pfn:gfn mapping.

Maybe KVM can ensure that each page of the guest_mem file is
associated with a single memslot. HVAs when they are registered can be
associated with offsets into guest_mem files.

>
>  - Does not work for memory that isn't backed by 'struct page', e.g. if devices
>    gain support for exposing encrypted memory regions to guests.
>
>  - Poking into the VMAs to convert memory would be likely be less performant due
>    to using infrastructure that is much "heavier", e.g. would require taking
>    mmap_lock for write.

Converting memory doesn't necessarily need to poke holes into VMA, but
rather just unmap pagetables just like what would happen when mmapped
files are punched to free the backing file offsets.

>
> In short, shoehorning this into mmap() requires fighting how the kernel works at
> pretty much every step, and in the end, adding e.g. fbind() is a lot easier.
>
> > 2) Userspace registers mmaped HVA ranges with KVM with additional
> > KVM_MEM_PRIVATE flag
> > 3) Userspace converts memory attributes and this memory conversion
> > allows userspace to access shared ranges of the file because those are
> > allowed to be faulted in from guest_mem. Shared to private conversion
> > unmaps the file ranges from userspace VMM pagetables.
> > 4) Granularity of userspace pagetable mappings for shared ranges will
> > have to be dictated by KVM guest_mem file implementation.
> >
> > Caveat here is that once private pages are mapped into userspace view.
> >
> > Benefits here:
> > 1) Userspace view remains consistent while still being able to use HVA ranges
> > 2) It would be possible to use HVA based APIs from userspace to do
> > things like binding.
> > 3) Double allocation wouldn't be a concern since hva ranges and gpa
> > ranges possibly map to the same HPA ranges.
>
> #3 isn't entirely correct.  If a different process (call it "B") maps shared memory,
> and then the guest converts that memory from shared to private, the backing pages
> for the previously shared mapping will still be mapped by process B unless userspace
> also ensures process B also unmaps on conversion.
>

This should be ideally handled by something like: unmap_mapping_range()

> #3 is also a limiter.  E.g. if a guest is primarly backed by 1GiB pages, keeping
> the 1GiB mapping is desirable if the guest converts a few KiB of memory to shared,
> and possibly even if the guest converts a few MiB of memory.

This caveat maybe can be lived with as shared ranges most likely will
not be backed by 1G pages anyways, possibly causing IO performance to
get hit. This possibly needs more discussion about conversion
granularity used by guests.

>
> > > Code is available here if folks want to take a look before any kind of formal
> > > posting:
> > >
> > >         https://github.com/sean-jc/linux.git x86/kvm_gmem_solo
> > >
> > > [1] https://lore.kernel.org/all/ff5c5b97-acdf-9745-ebe5-c6609dd6322e@google.com
> > > [2] https://lore.kernel.org/all/20230418-anfallen-irdisch-6993a61be10b@brauner
> > > [3] https://lore.kernel.org/linux-mm/20200522125214.31348-1-kirill.shutemov@linux.intel.com
Sean Christopherson May 11, 2023, 8:22 p.m. UTC | #51
On Wed, May 10, 2023, Vishal Annapurve wrote:
> On Wed, May 10, 2023 at 2:39 PM Sean Christopherson <seanjc@google.com> wrote:
> > > But I would still like to get this discussed here before we move on.
> > >
> > > I am wondering if it would make sense to implement
> > > restricted_mem/guest_mem file to expose both private and shared memory
> > > regions, inline with Kirill's original proposal now that the file
> > > implementation is controlled by KVM.
> > >
> > > Thinking from userspace perspective:
> > > 1) Userspace creates guest mem files and is able to mmap them but all
> > > accesses to these files result into faults as no memory is allowed to
> > > be mapped into userspace VMM pagetables.
> >
> > Never mapping anything into the userspace page table is infeasible.  Technically
> > it's doable, but it'd effectively require all of the work of an fd-based approach
> > (and probably significantly more), _and_ it'd require touching core mm code.
> >
> > VMAs don't provide hva=>pfn information, they're the kernel's way of implementing
> > the abstraction provided to userspace by mmap(), mprotect() etc.  Among many other
> > things, a VMA describes properties of what is mapped, e.g. hugetblfs versus
> > anonymous, where memory is mapped (virtual address), how memory is mapped, e.g.
> > RWX protections, etc.  But a VMA doesn't track the physical address, that info
> > is all managed through the userspace page tables.
> >
> > To make it possible to allow userspace to mmap() but not access memory (without
> > redoing how the kernel fundamentally manages virtual=>physical mappings), the
> > simplest approach is to install PTEs into userspace page tables, but never mark
> > them Present in hardware, i.e. prevent actually accessing the backing memory.
> > This is is exactly what Kirill's series in link [3] below implemented.
> >
> 
> Maybe it's simpler to do when mmaped regions are backed with files.
> 
> I see that shmem has fault handlers for accesses to VMA regions
> associated with the files, In theory a file implementation can always
> choose to not allocate physical pages for such faults (similar to
> F_SEAL_FAULT_AUTOALLOCATE that was discussed earlier).

Ah, you're effectively suggesting a hybrid model where the file is the single
source of truth for what's private versus shared, ad KVM gets pfns through
direct communication with the backing store via the file descriptor, but userspace
can still control things via mmap() and friends.

If you're not suggesting a backdoor, i.e. KVM still gets private pfns via hvas,
then we're back at Kirill's series, because otherwise there's no easy way for KVM
to retrieve the pfn.

A form of this was also discussed, though I don't know how much of the discussion
happened on-list.
 
KVM actually does something like this for s390's Ultravisor (UV), which is quite
a bit like TDX (UV is a trusted intermediary) except that it handles faults much,
much more gracefully.  Specifically, when the untrusted host attempts to access a
secure page, a fault occurs and the kernel responds by telling UV to export the
page.  The fault is gracefully handled even even for kernel accesses
(see do_secure_storage_access()).  The kernel does BUG() if the export fails when
handling fault from kernel context, but my understanding is that export can fail
if and only if there's a fatal error elsewhere, i.e. the UV essentialy _ensures_
success, and goes straight to BUG()/panic() if something goes wrong.

On the guest side, accesses to exported (swapped) secure pages generate intercepts
and KVM faults in the page.  To do so, KVM freezes the page/folio refcount, tells
the UV to import the page, and then unfreezes the page/folio.  But very crucially,
when _anything_ in the untrusted host attempts to access the secure page, the
above fault handling for untrusted host accesses kicks in.  In other words, the
guest can cause thrash, but can't bring down the host.

TDX on the other hand silently poisons memory, i.e. doesn't even generate a
synchronous fault.  Thus the kernel needs to be 100% perfect on preventing _any_
accesses to private memory from the host, and doing that is non-trivial and
invasive.

SNP does synchronously fault, but the automatically converting in the #PF handler
got NAK'd[*] for good reasons, e.g. SNP doesn't guarantee conversion success as the
guest can trigger concurrent RMP modifications.  So the end result ends up being
the same as TDX, host accesses need to be completely prevented.

Again, this is all doable, but costly.  And IMO, provides very little value.

Allowing things like mbind() is nice-to-have at best, as implementing fbind()
isn't straightforward and arguably valuable to have irrespective of this
discussion, e.g. to allow userspace to say "use this policy regardless of what
process maps the file".

Using a common memory pool (same physical page is used for both shared and private)
is a similar story.  There are plenty of existing controls to limit userspace/guest
memory usage and to deal with OOM scenarios, so barring egregious host accounting
and/or limiting bugs, which would affect _all_ VM types, the worst case scenario
is that a VM is terminated because host userspace is buggy.  On the slip side, using
a common pool brings complexity into the kernel, as backing stores would need to
be taught to deny access to a subset of pages in their mappings, and in multiple
paths, e.g. faults, read()/write() and similar, page migration, swap, etc.

[*] https://lore.kernel.org/linux-mm/8a244d34-2b10-4cf8-894a-1bf12b59cf92@www.fastmail.com

> > Issues that led to us abandoning the "map with special !Present PTEs" approach:
> >
> >  - Using page tables, i.e. hardware defined structures, to track gfn=>pfn mappings
> >    is inefficient and inflexible compared to software defined structures, especially
> >    for the expected use cases for CoCo guests.
> >
> >  - The kernel wouldn't _easily_ be able to enforce a 1:1 page:guest association,
> >    let alone a 1:1 pfn:gfn mapping.
> 
> Maybe KVM can ensure that each page of the guest_mem file is
> associated with a single memslot.

This is a hard NAK.  Guest physical address space is guaranteed to have holes
and/or be discontiguous, for the PCI hole at the top of lower memory.  Allowing
only a single binding would prevent userspace from backing all (or large chunks)
of guest memory with a single file.

> HVAs when they are registered can be associated with offsets into guest_mem files.

Enforcing 1:1 assocations is doable if KVM inserts a shim/interposer, e.g. essentially
implements the exclusivity bits of restrictedmem.  But that's adding even more
complexity.

> >  - Does not work for memory that isn't backed by 'struct page', e.g. if devices
> >    gain support for exposing encrypted memory regions to guests.
> >
> >  - Poking into the VMAs to convert memory would be likely be less performant due
> >    to using infrastructure that is much "heavier", e.g. would require taking
> >    mmap_lock for write.
> 
> Converting memory doesn't necessarily need to poke holes into VMA, but
> rather just unmap pagetables just like what would happen when mmapped
> files are punched to free the backing file offsets.

Sorry, bad choice of word on my part.  I didn't intend to imply poking holes, in
this case I used "poking" to mean "modifying".  munmap(), mprotected(), etc all
require modifying VMAs, which means taking mmap_lock for write.

> > In short, shoehorning this into mmap() requires fighting how the kernel works at
> > pretty much every step, and in the end, adding e.g. fbind() is a lot easier.
> >
> > > 2) Userspace registers mmaped HVA ranges with KVM with additional
> > > KVM_MEM_PRIVATE flag
> > > 3) Userspace converts memory attributes and this memory conversion
> > > allows userspace to access shared ranges of the file because those are
> > > allowed to be faulted in from guest_mem. Shared to private conversion
> > > unmaps the file ranges from userspace VMM pagetables.
> > > 4) Granularity of userspace pagetable mappings for shared ranges will
> > > have to be dictated by KVM guest_mem file implementation.
> > >
> > > Caveat here is that once private pages are mapped into userspace view.
> > >
> > > Benefits here:
> > > 1) Userspace view remains consistent while still being able to use HVA ranges
> > > 2) It would be possible to use HVA based APIs from userspace to do
> > > things like binding.
> > > 3) Double allocation wouldn't be a concern since hva ranges and gpa
> > > ranges possibly map to the same HPA ranges.
> >
> > #3 isn't entirely correct.  If a different process (call it "B") maps shared memory,
> > and then the guest converts that memory from shared to private, the backing pages
> > for the previously shared mapping will still be mapped by process B unless userspace
> > also ensures process B also unmaps on conversion.
> >
> 
> This should be ideally handled by something like: unmap_mapping_range()

That'd work for the hybrid model (fd backdoor with pseudo mmap() support), but
not for a generic VMA-based implementation.  If the file isn't the single source
of truth, then forcing all mappings to go away simply can't work.
 
> > #3 is also a limiter.  E.g. if a guest is primarly backed by 1GiB pages, keeping
> > the 1GiB mapping is desirable if the guest converts a few KiB of memory to shared,
> > and possibly even if the guest converts a few MiB of memory.
> 
> This caveat maybe can be lived with as shared ranges most likely will
> not be backed by 1G pages anyways, possibly causing IO performance to
> get hit. This possibly needs more discussion about conversion
> granularity used by guests.

Yes, it's not the end of the world.  My point is that separating shared and private
memory provides more flexibility.  Maybe that flexibility never ends up being
super important, but at the same time we shouldn't willingly paint ourselves into
a corner.
Michael Roth May 12, 2023, 12:21 a.m. UTC | #52
On Fri, Apr 21, 2023 at 06:33:26PM -0700, Sean Christopherson wrote:
> 
> Code is available here if folks want to take a look before any kind of formal
> posting:
> 
> 	https://github.com/sean-jc/linux.git x86/kvm_gmem_solo

Hi Sean,

I've been working on getting the SNP patches ported to this but I'm having
some trouble working out a reasonable scheme for how to work the
RMPUPDATE hooks into the proposed design.

One of the main things is kvm_gmem_punch_hole(): this is can free pages
back to the host whenever userspace feels like it. Pages that are still
marked private in the RMP table will blow up the host if they aren't returned
to the normal state before handing them back to the kernel. So I'm trying to
add a hook, orchestrated by kvm_arch_gmem_invalidate(), to handle that,
e.g.:

  static long kvm_gmem_punch_hole(struct file *file, int mode, loff_t offset,
                                  loff_t len)
  {
          struct kvm_gmem *gmem = file->private_data;
          pgoff_t start = offset >> PAGE_SHIFT;
          pgoff_t end = (offset + len) >> PAGE_SHIFT;
          struct kvm *kvm = gmem->kvm;
  
          /*
           * Bindings must stable across invalidation to ensure the start+end
           * are balanced.
           */
          filemap_invalidate_lock(file->f_mapping);
          kvm_gmem_invalidate_begin(kvm, gmem, start, end);
  
          /* Handle arch-specific cleanups before releasing pages */
          kvm_arch_gmem_invalidate(kvm, gmem, start, end);
          truncate_inode_pages_range(file->f_mapping, offset, offset + len);
  
          kvm_gmem_invalidate_end(kvm, gmem, start, end);
          filemap_invalidate_unlock(file->f_mapping);
  
          return 0;
  }

But there's another hook, kvm_arch_gmem_set_mem_attributes(), needed to put
the page in its intended state in the RMP table prior to mapping it into the
guest's NPT. Currently I'm calling that hook via
kvm_vm_ioctl_set_mem_attributes(), just after kvm->mem_attr_array is updated
based on the ioctl. The reasoning there is that KVM MMU can then rely on the
existing mmu_invalidate_seq logic to ensure both the state in the
mem_attr_array and the RMP table are in sync and up-to-date once MMU lock is
acquired and MMU is ready to map it, or retry #NPF otherwise.

But for kvm_gmem_punch_hole(), kvm_vm_ioctl_set_mem_attributes() can potentially
result in something like the following sequence if I implement things as above:

  CPU0: kvm_gmem_punch_hole():
          kvm_gmem_invalidate_begin()
          kvm_arch_gmem_invalidate()         // set pages to default/shared state in RMP table before free'ing
  CPU1: kvm_vm_ioctl_set_mem_attributes():
          kvm_arch_gmem_set_mem_attributes() // maliciously set pages to private in RMP table
  CPU0:   truncate_inode_pages_range()       // HOST BLOWS UP TOUCHING PRIVATE PAGES
          kvm_arch_gmem_invalidate_end()

One quick and lazy solution is to rely on the fact that
kvm_vm_ioctl_set_mem_attributes() holds the kvm->slots_lock throughout the
entire begin()/end() portion of the invalidation sequence, and to similarly
hold the kvm->slots_lock throughout the begin()/end() sequence in
kvm_gmem_punch_hole() to prevent any interleaving.

But I'd imagine overloading kvm->slots_lock is not the proper approach. But
would introducing a similar mutex to keep these operations grouped/atomic be
a reasonable approach to you, or should we be doing something else entirely
here?

Keep in mind that RMP updates can't be done while holding KVM->mmu_lock
spinlock, because we also need to unmap pages from the directmap, which can
lead to scheduling-while-atomic BUG()s[1], so that's another constraint we
need to work around.

Thanks!

-Mike

[1] https://lore.kernel.org/linux-coco/20221214194056.161492-7-michael.roth@amd.com/T/#m45a1af063aa5ac0b9314d6a7d46eecb1253bba7a

> 
> [1] https://lore.kernel.org/all/ff5c5b97-acdf-9745-ebe5-c6609dd6322e@google.com
> [2] https://lore.kernel.org/all/20230418-anfallen-irdisch-6993a61be10b@brauner
> [3] https://lore.kernel.org/linux-mm/20200522125214.31348-1-kirill.shutemov@linux.intel.com
Sean Christopherson May 12, 2023, 6:01 p.m. UTC | #53
On Thu, May 11, 2023, Michael Roth wrote:
> On Fri, Apr 21, 2023 at 06:33:26PM -0700, Sean Christopherson wrote:
> > 
> > Code is available here if folks want to take a look before any kind of formal
> > posting:
> > 
> > 	https://github.com/sean-jc/linux.git x86/kvm_gmem_solo
> 
> Hi Sean,
> 
> I've been working on getting the SNP patches ported to this but I'm having
> some trouble working out a reasonable scheme for how to work the
> RMPUPDATE hooks into the proposed design.
> 
> One of the main things is kvm_gmem_punch_hole(): this is can free pages
> back to the host whenever userspace feels like it. Pages that are still
> marked private in the RMP table will blow up the host if they aren't returned
> to the normal state before handing them back to the kernel. So I'm trying to
> add a hook, orchestrated by kvm_arch_gmem_invalidate(), to handle that,
> e.g.:
> 
>   static long kvm_gmem_punch_hole(struct file *file, int mode, loff_t offset,
>                                   loff_t len)
>   {
>           struct kvm_gmem *gmem = file->private_data;
>           pgoff_t start = offset >> PAGE_SHIFT;
>           pgoff_t end = (offset + len) >> PAGE_SHIFT;
>           struct kvm *kvm = gmem->kvm;
>   
>           /*
>            * Bindings must stable across invalidation to ensure the start+end
>            * are balanced.
>            */
>           filemap_invalidate_lock(file->f_mapping);
>           kvm_gmem_invalidate_begin(kvm, gmem, start, end);
>   
>           /* Handle arch-specific cleanups before releasing pages */
>           kvm_arch_gmem_invalidate(kvm, gmem, start, end);
>           truncate_inode_pages_range(file->f_mapping, offset, offset + len);
>   
>           kvm_gmem_invalidate_end(kvm, gmem, start, end);
>           filemap_invalidate_unlock(file->f_mapping);
>   
>           return 0;
>   }
> 
> But there's another hook, kvm_arch_gmem_set_mem_attributes(), needed to put
> the page in its intended state in the RMP table prior to mapping it into the
> guest's NPT.

IMO, this approach is wrong.  kvm->mem_attr_array is the source of truth for whether
userspace wants _guest_ physical pages mapped private vs. shared, but the attributes
array has zero insight into the _host_ physical pages.  I.e. SNP shouldn't hook
kvm_mem_attrs_changed(), because operating on the RMP from that code is fundamentally
wrong.

A good analogy is moving a memslot (ignoring that AFAIK no VMM actually moves
memslots, but it's a good analogy for KVM internals).  KVM needs to zap all mappings
for the old memslot gfn, but KVM does not create mappings for the new memslot gfn.
Same for changing attributes; unmap, but never map.

As for the unmapping side of things, kvm_unmap_gfn_range() will unmap all relevant
NPT entries, and the elevated mmu_invalidate_in_progress will prevent KVM from
establishing a new NPT mapping.  And mmu_invalidate_in_progress will reach '0' only
after both truncation _and_ kvm_vm_ioctl_set_mem_attributes() complete, i.e. KVM
can create new mappings only when both kvm->mem_attr_array and any relevant
guest_mem bindings have reached steady state.

That leaves the question of when/where to do RMP updates.  Off the cuff, I think
RMP updates (and I _think_ also TDX page conversions) should _always_ be done in
the context of either (a) file truncation (make host owned due, a.k.a. TDX reclaim)
or (b) allocating a new page/folio in guest_mem, a.k.a. kvm_gmem_get_folio().
Under the hood, even though the gfn is the same, the backing pfn is different, i.e.
installing a shared mapping should _never_ need to touch the RMP because pages
common from the normal (non-guest_mem) pool must already be host owned.

> Currently I'm calling that hook via kvm_vm_ioctl_set_mem_attributes(), just
> after kvm->mem_attr_array is updated based on the ioctl. The reasoning there
> is that KVM MMU can then rely on the existing mmu_invalidate_seq logic to
> ensure both the state in the mem_attr_array and the RMP table are in sync and
> up-to-date once MMU lock is acquired and MMU is ready to map it, or retry
> #NPF otherwise.
> 
> But for kvm_gmem_punch_hole(), kvm_vm_ioctl_set_mem_attributes() can potentially
> result in something like the following sequence if I implement things as above:
> 
>   CPU0: kvm_gmem_punch_hole():
>           kvm_gmem_invalidate_begin()
>           kvm_arch_gmem_invalidate()         // set pages to default/shared state in RMP table before free'ing
>   CPU1: kvm_vm_ioctl_set_mem_attributes():
>           kvm_arch_gmem_set_mem_attributes() // maliciously set pages to private in RMP table
>   CPU0:   truncate_inode_pages_range()       // HOST BLOWS UP TOUCHING PRIVATE PAGES
>           kvm_arch_gmem_invalidate_end()
> 
> One quick and lazy solution is to rely on the fact that
> kvm_vm_ioctl_set_mem_attributes() holds the kvm->slots_lock throughout the
> entire begin()/end() portion of the invalidation sequence, and to similarly
> hold the kvm->slots_lock throughout the begin()/end() sequence in
> kvm_gmem_punch_hole() to prevent any interleaving.
> 
> But I'd imagine overloading kvm->slots_lock is not the proper approach. But
> would introducing a similar mutex to keep these operations grouped/atomic be
> a reasonable approach to you, or should we be doing something else entirely
> here?
> 
> Keep in mind that RMP updates can't be done while holding KVM->mmu_lock
> spinlock, because we also need to unmap pages from the directmap, which can
> lead to scheduling-while-atomic BUG()s[1], so that's another constraint we
> need to work around.
> 
> Thanks!
> 
> -Mike
> 
> [1] https://lore.kernel.org/linux-coco/20221214194056.161492-7-michael.roth@amd.com/T/#m45a1af063aa5ac0b9314d6a7d46eecb1253bba7a
> 
> > 
> > [1] https://lore.kernel.org/all/ff5c5b97-acdf-9745-ebe5-c6609dd6322e@google.com
> > [2] https://lore.kernel.org/all/20230418-anfallen-irdisch-6993a61be10b@brauner
> > [3] https://lore.kernel.org/linux-mm/20200522125214.31348-1-kirill.shutemov@linux.intel.com
Sean Christopherson May 15, 2023, 11:46 p.m. UTC | #54
On Fri, May 05, 2023, Sean Christopherson wrote:
> On Fri, May 05, 2023, Ackerley Tng wrote:
> > One issue I’ve found so far is that the pointer to kvm (gmem->kvm) is
> > not cleaned up, and hence it is possible to crash the host kernel in the
> > following way
> > 
> > 1. Create a KVM VM
> > 2. Create a guest mem fd on that VM
> > 3. Create a memslot with the guest mem fd (hence binding the fd to the
> >    VM)
> > 4. Close/destroy the KVM VM
> > 5. Call fallocate(PUNCH_HOLE) on the guest mem fd, which uses gmem->kvm
> >    when it tries to do invalidation.
> > 
> > I then tried to clean up the gmem->kvm pointer during unbinding when the
> > KVM VM is destroyed.
> > 
> > That works, but then I realized there’s a simpler way to use the pointer
> > after freeing:
> > 
> > 1. Create a KVM VM
> > 2. Create a guest mem fd on that VM
> > 3. Close/destroy the KVM VM
> > 4. Call fallocate(PUNCH_HOLE) on the guest mem fd, which uses gmem->kvm
> >    when it tries to do invalidation.
> > 
> > Perhaps binding should mean setting the gmem->kvm pointer in addition to
> > gmem->bindings. This makes binding and unbinding symmetric and avoids
> > the use-after-frees described above.
> 
> Hrm, that would work, though it's a bit convoluted, e.g. would require detecting
> when the last binding is being removed.  A similar (also ugly) solution would be
> to nullify gmem->kvm when KVM dies.
> 
> I don't love either approach idea because it means a file created in the context
> of a VM can outlive the VM itself, and then userspace ends up with a file descriptor
> that it can't do anything with except close().  I doubt that matters in practice
> though, e.g. when the VM dies, all memory can be freed so that the file ends up
> being little more than a shell.  And if we go that route, there's no need to grab
> a reference to the file during bind, KVM can just grab a longterm reference when
> the file is initially created and then drop it when KVM dies (and nullifies gmem->kvm).
> 
> Blech, another wart is that I believe gmem would need to do __module_get() during
> file creation to prevent kvm.ko from being unloaded after the last VM dies.  Ah,
> but that'd also be true if we went with a system-scoped KVM ioctl(), so I suppose
> it's not _that_ ugly.
> 
> Exchanging references (at binding or at creation) doesn't work, because that
> creates a circular dependency, i.e. gmem and KVM would pin each other. 
> 
> A "proper" refcounting approach, where the file pins KVM and not vice versa, gets
> nasty because of how KVM's memslots work.  The least awful approach I can think of
> would be to delete the associated memslot(s) when the file is released, possibly
> via deferred work to avoid deadlock issues.  Not the prettiest thing ever and in
> some ways that'd yield an even worse ABI.

Circling back to this.  Pending testing, the "proper" refcounting approach seems
to be the way to go.  KVM's existing memslots actually work this way, e.g. if
userspace does munmap() on an active memslot, KVM zaps any PTEs but the memslot
stays alive.  A similar approach can be taken here, the only wrinkle is that the
association between gmem and the memslot is stronger than between VMAs and memslots,
specifically KVM binds the file and not simply the file descriptor.  This is
necessary because not binding to an exact file would let userspace install a
different file at the file descriptor.

That's solvable without having to delete memslots though, e.g. by protecting the
file pointer in the memslot with RCU and directly bumping the refcount in the two
places where KVM needs to get at gmem (the file) via the memslot (unbind and
faulting in a page).  E.g.

  static struct file *kvm_gmem_get_file(struct kvm_memory_slot *slot)
  {
	struct file *file;

	rcu_read_lock();

	file = rcu_dereference(slot->gmem.file);
	if (file && !get_file_rcu(file))
		file = NULL;
	rcu_read_unlock();

	return file;
  }

The gotcha is that ->release could race with memslot deletion, as kvm_gmem_unbind()
won't be able to differentiate between "file was deleted" and "file is currently
being deleted".  That's easy enough to deal with though, kvm_gmem_release() can
take slots_lock to prevent the memslot from going away when nullifying and
invalidating ranges for the memslot.

> Side topic, there's a second bug (and probably more lurking): kvm_swap_active_memslots()'s
> call to synchronize_srcu_expedited() is done _before_ the call to kvm_gmem_unbind(),
> i.e. doesn't wait for readers in kvm_gmem_invalidate_begin() to go away.  The easy
> solution for that one is to add another synchronize_srcu_expedited() after unbinding.

There's a bug here, but not the one I pointed out.  Acquiring kvm->srcu doesn't
provide any protection, the binding already has a pointer to the memslot, i.e.
isn't doing an SRCU-protected lookup in the memslots.  The actual protection is
provided by the filemap invalidate lock, which prevents unbinding a memslot until
all invalidations complete, i.e. acquiring kvm->srcu in the punch hole path is
completely unnecessary.
Vishal Annapurve May 19, 2023, 1:07 a.m. UTC | #55
On Thu, May 11, 2023 at 1:22 PM Sean Christopherson <seanjc@google.com> wrote:
> ...
> Ah, you're effectively suggesting a hybrid model where the file is the single
> source of truth for what's private versus shared, ad KVM gets pfns through
> direct communication with the backing store via the file descriptor, but userspace
> can still control things via mmap() and friends.
>
> If you're not suggesting a backdoor, i.e. KVM still gets private pfns via hvas,
> then we're back at Kirill's series, because otherwise there's no easy way for KVM
> to retrieve the pfn.
>

Yeah, I was hinting towards using the backdoor, where KVM still gets
private pfns via HVAs.

> A form of this was also discussed, though I don't know how much of the discussion
> happened on-list.
>
> KVM actually does something like this for s390's Ultravisor (UV), which is quite
> a bit like TDX (UV is a trusted intermediary) except that it handles faults much,
> much more gracefully.  Specifically, when the untrusted host attempts to access a
> secure page, a fault occurs and the kernel responds by telling UV to export the
> page.  The fault is gracefully handled even even for kernel accesses
> (see do_secure_storage_access()).  The kernel does BUG() if the export fails when
> handling fault from kernel context, but my understanding is that export can fail
> if and only if there's a fatal error elsewhere, i.e. the UV essentialy _ensures_
> success, and goes straight to BUG()/panic() if something goes wrong.
>
> On the guest side, accesses to exported (swapped) secure pages generate intercepts
> and KVM faults in the page.  To do so, KVM freezes the page/folio refcount, tells
> the UV to import the page, and then unfreezes the page/folio.  But very crucially,
> when _anything_ in the untrusted host attempts to access the secure page, the
> above fault handling for untrusted host accesses kicks in.  In other words, the
> guest can cause thrash, but can't bring down the host.
>

Yeah, this is very similar to what I was trying to propose. Except in
this case, the backing store i.e. guest_mem will have to let the fault
be unhandled for untrusted host accesses to private ranges of
guest_mem file.

> TDX on the other hand silently poisons memory, i.e. doesn't even generate a
> synchronous fault.  Thus the kernel needs to be 100% perfect on preventing _any_
> accesses to private memory from the host, and doing that is non-trivial and
> invasive.
>
> SNP does synchronously fault, but the automatically converting in the #PF handler
> got NAK'd[*] for good reasons, e.g. SNP doesn't guarantee conversion success as the
> guest can trigger concurrent RMP modifications.  So the end result ends up being
> the same as TDX, host accesses need to be completely prevented.
>
> Again, this is all doable, but costly.  And IMO, provides very little value.

With this hybrid approach with the backdoor access to pfns from KVM,
do we see a scenario where host can bypass the guest_mem restrictions
and still be able to access the private ranges using HVA ranges? One
possibility is that these pages are mapped in the IOMMU (when they are
shared) and then get converted to private without getting unmapped
from IOMMU. Maybe KVM can disallow converting the ranges which are
pinned for DMA (not sure if there is a way to do that).

Few additional benefits here:
1) Possibly handle the pkvm usecase in this series without the need
for additional modifications.
2) Handling UPM for normal VMs possibly could get simpler as this
hybrid approach can allow preserving the contents across conversions.

>
> Allowing things like mbind() is nice-to-have at best, as implementing fbind()
> isn't straightforward and arguably valuable to have irrespective of this
> discussion, e.g. to allow userspace to say "use this policy regardless of what
> process maps the file".
>

Agreed, having mbind supported is not a significant gain given the cost here.

> Using a common memory pool (same physical page is used for both shared and private)
> is a similar story.  There are plenty of existing controls to limit userspace/guest
> memory usage and to deal with OOM scenarios, so barring egregious host accounting
> and/or limiting bugs, which would affect _all_ VM types, the worst case scenario
> is that a VM is terminated because host userspace is buggy.  On the slip side, using
> a common pool brings complexity into the kernel, as backing stores would need to
> be taught to deny access to a subset of pages in their mappings, and in multiple
> paths, e.g. faults, read()/write() and similar, page migration, swap, etc.

In this case the backing store that needs to be modified would just be
guest_mem though.

>
> [*] https://lore.kernel.org/linux-mm/8a244d34-2b10-4cf8-894a-1bf12b59cf92@www.fastmail.com
>
> > > Issues that led to us abandoning the "map with special !Present PTEs" approach:
> > >
> > >  - Using page tables, i.e. hardware defined structures, to track gfn=>pfn mappings
> > >    is inefficient and inflexible compared to software defined structures, especially
> > >    for the expected use cases for CoCo guests.
> > >
> > >  - The kernel wouldn't _easily_ be able to enforce a 1:1 page:guest association,
> > >    let alone a 1:1 pfn:gfn mapping.
> >
> > Maybe KVM can ensure that each page of the guest_mem file is
> > associated with a single memslot.
>
> This is a hard NAK.  Guest physical address space is guaranteed to have holes
> and/or be discontiguous, for the PCI hole at the top of lower memory.  Allowing
> only a single binding would prevent userspace from backing all (or large chunks)
> of guest memory with a single file.
>

Poor choice of words from my side. I meant to suggest that KVM can
ensure that ANY page of the guest_mem file is associated with at max
one memslot.

> ...
> That'd work for the hybrid model (fd backdoor with pseudo mmap() support), but
> not for a generic VMA-based implementation.  If the file isn't the single source
> of truth, then forcing all mappings to go away simply can't work.
>
> > > #3 is also a limiter.  E.g. if a guest is primarly backed by 1GiB pages, keeping
> > > the 1GiB mapping is desirable if the guest converts a few KiB of memory to shared,
> > > and possibly even if the guest converts a few MiB of memory.
> >
> > This caveat maybe can be lived with as shared ranges most likely will
> > not be backed by 1G pages anyways, possibly causing IO performance to
> > get hit. This possibly needs more discussion about conversion
> > granularity used by guests.
>
> Yes, it's not the end of the world.  My point is that separating shared and private
> memory provides more flexibility.  Maybe that flexibility never ends up being
> super important, but at the same time we shouldn't willingly paint ourselves into
> a corner.

There are some performance implications here with the split approach.

This flexibility is actually coming with the cost of managing double
allocation effectively. As the granularity of mappings increases for
the shared memory, it gets difficult to cap the amount of double
allocation. So effectively it comes down to always using smaller
granularity for shared memory and also the private memory for
converted ranges. In general the performance requirements would always
try to push for higher mapping granularities depending on the scale of
usage.

In general private memory (and also the shared memory on respective
conversions) will always need to be hole punched to ensure that the
double allocation won't happen. And so, even if this is something for
the future, using hugetlbfs pages for backing private memory with the
split model effectively makes it impossible to cap the double
allocation. I am not sure if the 1G pages can be handled better with
the hybrid model but maybe it's worth checking.

Split shared/private mem approach also increases the uncertainties
around memory management in general where the same amount of memory
which was available earlier is first freed to the system and then
allocated back from the system. .e.g. Even if hugepages were around
when private memory was initially allocated, further allocations keep
increasing the possibilities of not being able to use a huge page to
back the memory even if the whole huge page is private/shared.
Michael Roth May 22, 2023, 1:50 p.m. UTC | #56
On Fri, May 12, 2023 at 11:01:10AM -0700, Sean Christopherson wrote:
> On Thu, May 11, 2023, Michael Roth wrote:
> > On Fri, Apr 21, 2023 at 06:33:26PM -0700, Sean Christopherson wrote:
> > > 
> > > Code is available here if folks want to take a look before any kind of formal
> > > posting:
> > > 
> > > 	https://github.com/sean-jc/linux.git x86/kvm_gmem_solo
> > 
> > Hi Sean,
> > 
> > I've been working on getting the SNP patches ported to this but I'm having
> > some trouble working out a reasonable scheme for how to work the
> > RMPUPDATE hooks into the proposed design.
> > 
> > One of the main things is kvm_gmem_punch_hole(): this is can free pages
> > back to the host whenever userspace feels like it. Pages that are still
> > marked private in the RMP table will blow up the host if they aren't returned
> > to the normal state before handing them back to the kernel. So I'm trying to
> > add a hook, orchestrated by kvm_arch_gmem_invalidate(), to handle that,
> > e.g.:
> > 
> >   static long kvm_gmem_punch_hole(struct file *file, int mode, loff_t offset,
> >                                   loff_t len)
> >   {
> >           struct kvm_gmem *gmem = file->private_data;
> >           pgoff_t start = offset >> PAGE_SHIFT;
> >           pgoff_t end = (offset + len) >> PAGE_SHIFT;
> >           struct kvm *kvm = gmem->kvm;
> >   
> >           /*
> >            * Bindings must stable across invalidation to ensure the start+end
> >            * are balanced.
> >            */
> >           filemap_invalidate_lock(file->f_mapping);
> >           kvm_gmem_invalidate_begin(kvm, gmem, start, end);
> >   
> >           /* Handle arch-specific cleanups before releasing pages */
> >           kvm_arch_gmem_invalidate(kvm, gmem, start, end);
> >           truncate_inode_pages_range(file->f_mapping, offset, offset + len);
> >   
> >           kvm_gmem_invalidate_end(kvm, gmem, start, end);
> >           filemap_invalidate_unlock(file->f_mapping);
> >   
> >           return 0;
> >   }
> > 
> > But there's another hook, kvm_arch_gmem_set_mem_attributes(), needed to put
> > the page in its intended state in the RMP table prior to mapping it into the
> > guest's NPT.
> 
> IMO, this approach is wrong.  kvm->mem_attr_array is the source of truth for whether
> userspace wants _guest_ physical pages mapped private vs. shared, but the attributes
> array has zero insight into the _host_ physical pages.  I.e. SNP shouldn't hook
> kvm_mem_attrs_changed(), because operating on the RMP from that code is fundamentally
> wrong.
> 
> A good analogy is moving a memslot (ignoring that AFAIK no VMM actually moves
> memslots, but it's a good analogy for KVM internals).  KVM needs to zap all mappings
> for the old memslot gfn, but KVM does not create mappings for the new memslot gfn.
> Same for changing attributes; unmap, but never map.
> 
> As for the unmapping side of things, kvm_unmap_gfn_range() will unmap all relevant
> NPT entries, and the elevated mmu_invalidate_in_progress will prevent KVM from
> establishing a new NPT mapping.  And mmu_invalidate_in_progress will reach '0' only
> after both truncation _and_ kvm_vm_ioctl_set_mem_attributes() complete, i.e. KVM
> can create new mappings only when both kvm->mem_attr_array and any relevant
> guest_mem bindings have reached steady state.
> 
> That leaves the question of when/where to do RMP updates.  Off the cuff, I think
> RMP updates (and I _think_ also TDX page conversions) should _always_ be done in
> the context of either (a) file truncation (make host owned due, a.k.a. TDX reclaim)
> or (b) allocating a new page/folio in guest_mem, a.k.a. kvm_gmem_get_folio().
> Under the hood, even though the gfn is the same, the backing pfn is different, i.e.
> installing a shared mapping should _never_ need to touch the RMP because pages
> common from the normal (non-guest_mem) pool must already be host owned.

Hi Sean, thanks for the suggestions.

I reworked things based on this approach and things seems to work out
pretty nicely for SNP.

I needed to add the hook to kvm_gmem_get_pfn() instead of
kvm_gmem_get_folio() because SNP needs to know the GFN in order to mark
the page as private in the RMP table, but otherwise I think things are
the same as what you had in mind. One downside to this approach is since
the hook always gets called during kvm_gmem_get_pfn(), we need to do an
extra RMP lookup to determine whether or not that page has already been
set to private state, vs. being able to assume it's already been put in
the expected state, but it's only a memory access so not a huge
overhead. Not sure if that would be a concern of not on the TDX side
though.

I put together a tree with some fixups that are needed for against the
kvm_gmem_solo base tree, and a set of hooks to handle invalidations,
preparing the initial private state as suggested above, and a
platform-configurable mask that the x86 MMU code can use for determining
whether a fault is for private vs. shared pages.

  KVM: x86: Determine shared/private faults using a configurable mask
  ^ for TDX we could trivially add an inverted analogue of the mask/logic
  KVM: x86: Use full 64-bit error code for kvm_mmu_do_page_fault
  KVM: x86: Add platform hooks for private memory invalidations
  KVM: x86: Add platform hook for initializing private memory
  *fixup (kvm_gmem_solo): KVM: Fix end range calculation for MMU invalidations
  *fixup (kvm_gmem_solo): KVM: selftests: update kvm_create_guest_memfd struct usage

  https://github.com/mdroth/linux/commits/kvm_gmem_solo_x86

I'm hoping these are similarly usable for TDX, but could use some input
from TDX folks on that aspect.

> > 
> > Keep in mind that RMP updates can't be done while holding KVM->mmu_lock
> > spinlock, because we also need to unmap pages from the directmap, which can
> > lead to scheduling-while-atomic BUG()s[1], so that's another constraint we
> > need to work around.

This concern also ends up going away since GFP_RECLAIM also has similar
issues when called under kvm->mmu_lock, so having the hook in
kvm_gmem_get_pfn() sort of guarantees we wouldn't hit issues with this.

-Mike

> > 
> > Thanks!
> > 
> > -Mike
> > 
> > [1] https://lore.kernel.org/linux-coco/20221214194056.161492-7-michael.roth@amd.com/T/#m45a1af063aa5ac0b9314d6a7d46eecb1253bba7a
> > 
> > > 
> > > [1] https://lore.kernel.org/all/ff5c5b97-acdf-9745-ebe5-c6609dd6322e@google.com
> > > [2] https://lore.kernel.org/all/20230418-anfallen-irdisch-6993a61be10b@brauner
> > > [3] https://lore.kernel.org/linux-mm/20200522125214.31348-1-kirill.shutemov@linux.intel.com
Sean Christopherson May 22, 2023, 5:09 p.m. UTC | #57
On Mon, May 22, 2023, Michael Roth wrote:
> On Fri, May 12, 2023 at 11:01:10AM -0700, Sean Christopherson wrote:
> > On Thu, May 11, 2023, Michael Roth wrote:
> I put together a tree with some fixups that are needed for against the
> kvm_gmem_solo base tree, and a set of hooks to handle invalidations,
> preparing the initial private state as suggested above, and a
> platform-configurable mask that the x86 MMU code can use for determining
> whether a fault is for private vs. shared pages.
> 
>   KVM: x86: Determine shared/private faults using a configurable mask
>   ^ for TDX we could trivially add an inverted analogue of the mask/logic
>   KVM: x86: Use full 64-bit error code for kvm_mmu_do_page_fault
>   KVM: x86: Add platform hooks for private memory invalidations

Hrm, I'd prefer to avoid adding another hook for this case, arch code already has
a "hook" in the form of kvm_unmap_gfn_range().  We'd probably just need a
kvm_gfn_range.is_private flag to communicate to arch/vendor code that the memory
being zapped is private.

That'd leave a gap for the unbind() case because kvm_unmap_gfn_range() is invoked
if and only if there's an overlapping memslot.  I'll chew on that a bit to see if
there's a way to cleanly handle that case without another hook.  I think it's worth
mapping out exactly what we want unbind() to look like anyways, e.g. right now the
code subtly relies on private memslots being immutable.

>   KVM: x86: Add platform hook for initializing private memory

This should also be unnecessary.  The call to kvm_gmem_get_pfn() is from arch
code, KVM just needs to ensure the RMP is converted before acquiring mmu_lock,
e.g. KVM has all the necessary info in kvm_tdp_mmu_page_fault().

The only reason to add another arch hook would be if we wanted to converted the
RMP when _allocating_, e.g. to preconvert in response to fallocate() instead of
waiting until #NPF.  But I think I would rather add a generic ioctl() to allow
userspace to effectively prefault guest memory, e.g. to setup the RMP before
running a vCPU.  Such an ioctl() would potentially be useful in other scenarios,
e.g. on the dest during live migration to reduce jitter.

>   *fixup (kvm_gmem_solo): KVM: Fix end range calculation for MMU invalidations

There was another bug in this path.  The math for handling a non-zero offsets into
the file was wrong.  The code now looks like:

	xa_for_each_range(&gmem->bindings, index, slot, start, end - 1) {
		struct kvm_gfn_range gfn_range = {
			.start = slot->base_gfn + start - slot->gmem.index,
			.end = slot->base_gfn + min(end - slot->gmem.index, slot->npages),
			.slot = slot,
			.pte = __pte(0),
			.may_block = true,
		};

		if (WARN_ON_ONCE(start < slot->gmem.index ||
				 end > slot->gmem.index + slot->npages))
			continue;

		kvm_mmu_invalidate_range_add(kvm, gfn_range.start, gfn_range.end);

		flush |= kvm_unmap_gfn_range(kvm, &gfn_range);
	}
Michael Roth May 22, 2023, 11:58 p.m. UTC | #58
On Mon, May 22, 2023 at 10:09:40AM -0700, Sean Christopherson wrote:
> On Mon, May 22, 2023, Michael Roth wrote:
> > On Fri, May 12, 2023 at 11:01:10AM -0700, Sean Christopherson wrote:
> > > On Thu, May 11, 2023, Michael Roth wrote:
> > I put together a tree with some fixups that are needed for against the
> > kvm_gmem_solo base tree, and a set of hooks to handle invalidations,
> > preparing the initial private state as suggested above, and a
> > platform-configurable mask that the x86 MMU code can use for determining
> > whether a fault is for private vs. shared pages.
> > 
> >   KVM: x86: Determine shared/private faults using a configurable mask
> >   ^ for TDX we could trivially add an inverted analogue of the mask/logic
> >   KVM: x86: Use full 64-bit error code for kvm_mmu_do_page_fault
> >   KVM: x86: Add platform hooks for private memory invalidations
> 
> Hrm, I'd prefer to avoid adding another hook for this case, arch code already has
> a "hook" in the form of kvm_unmap_gfn_range().  We'd probably just need a
> kvm_gfn_range.is_private flag to communicate to arch/vendor code that the memory
> being zapped is private.

kvm_unmap_gfn_range() does however get called with kvm->mmu_lock held so
it might be tricky to tie RMP updates into that path.

> 
> That'd leave a gap for the unbind() case because kvm_unmap_gfn_range() is invoked
> if and only if there's an overlapping memslot.  I'll chew on that a bit to see if
> there's a way to cleanly handle that case without another hook.  I think it's worth
> mapping out exactly what we want unbind() to look like anyways, e.g. right now the
> code subtly relies on private memslots being immutable.

I thought the direction you sort of driving at was to completely decouple
RMP updates for physical pages from the KVM MMU map/unmap paths since the
life-cycles of those backing pages and associated RMP state are somewhat
separate from the state of the GFNs and kvm->mem_attr_array. It seems to
make sense when dealing with things like this unbind() case.

There's also cases like userspaces that opt to not discard memory after
conversions because they highly favor performance over memory usage. In
those cases it would make sense to defer marking the pages as shared in
the RMP until the FALLOC_FL_PUNCH_HOLE, rather than triggering it via
KVM MMU invalidation path after a conversion.

> 
> >   KVM: x86: Add platform hook for initializing private memory
> 
> This should also be unnecessary.  The call to kvm_gmem_get_pfn() is from arch
> code, KVM just needs to ensure the RMP is converted before acquiring mmu_lock,
> e.g. KVM has all the necessary info in kvm_tdp_mmu_page_fault().

I think that approach would work fine. The way I was thinking of things
is that KVM MMU would necessarily call kvm_gmem_get_pfn() to grab the
page before mapping it into the guest, so moving it out into an explicit
call should work just as well. That would also drop the need for the
__kvm_gmem_get_pfn() stuff I needed to add for the initial case where we
need to access the PFN prior to making it private.

> 
> The only reason to add another arch hook would be if we wanted to converted the
> RMP when _allocating_, e.g. to preconvert in response to fallocate() instead of
> waiting until #NPF.  But I think I would rather add a generic ioctl() to allow
> userspace to effectively prefault guest memory, e.g. to setup the RMP before
> running a vCPU.  Such an ioctl() would potentially be useful in other scenarios,
> e.g. on the dest during live migration to reduce jitter.

Agreed, deferring the RMPUPDATE until it's actually needed would give us
more flexibility on optimizing for things like lazy-acceptance.

For less-common scenarios like preallocation it makes sense to make that
an opt-in sort of thing for userspace to configure explicitly.

> 
> >   *fixup (kvm_gmem_solo): KVM: Fix end range calculation for MMU invalidations
> 
> There was another bug in this path.  The math for handling a non-zero offsets into
> the file was wrong.  The code now looks like:
> 
> 	xa_for_each_range(&gmem->bindings, index, slot, start, end - 1) {
> 		struct kvm_gfn_range gfn_range = {
> 			.start = slot->base_gfn + start - slot->gmem.index,

Sorry if I'm missing something here, but isn't there a risk that:

  start - slot->gmem.index

would be less than zero? E.g. starting GFN was 0, but current slot is bound
at some non-zero offset in the same gmem instance. I guess the warning below
shouldn't caught that, but it seems like a real scenario.

Since 'index' corresponds to the gmem offset of the current slot, is there any
reason not to do something like this?:

  .start = slot->base_gfn + index - slot->gmem.index,

But then, if that's the case, wouldn't index == slot->gmem.index? Suggesting
we case just simplify to this?:

  .start = slot->base_gfn,

-Mike

> 			.end = slot->base_gfn + min(end - slot->gmem.index, slot->npages),
> 			.slot = slot,
> 			.pte = __pte(0),
> 			.may_block = true,
> 		};
> 
> 		if (WARN_ON_ONCE(start < slot->gmem.index ||
> 				 end > slot->gmem.index + slot->npages))
> 			continue;
> 
> 		kvm_mmu_invalidate_range_add(kvm, gfn_range.start, gfn_range.end);
> 
> 		flush |= kvm_unmap_gfn_range(kvm, &gfn_range);
> 	}
Sean Christopherson May 23, 2023, 12:21 a.m. UTC | #59
On Mon, May 22, 2023, Michael Roth wrote:
> On Mon, May 22, 2023 at 10:09:40AM -0700, Sean Christopherson wrote:
> > On Mon, May 22, 2023, Michael Roth wrote:
> > > On Fri, May 12, 2023 at 11:01:10AM -0700, Sean Christopherson wrote:
> > > > On Thu, May 11, 2023, Michael Roth wrote:
> > > I put together a tree with some fixups that are needed for against the
> > > kvm_gmem_solo base tree, and a set of hooks to handle invalidations,
> > > preparing the initial private state as suggested above, and a
> > > platform-configurable mask that the x86 MMU code can use for determining
> > > whether a fault is for private vs. shared pages.
> > > 
> > >   KVM: x86: Determine shared/private faults using a configurable mask
> > >   ^ for TDX we could trivially add an inverted analogue of the mask/logic
> > >   KVM: x86: Use full 64-bit error code for kvm_mmu_do_page_fault
> > >   KVM: x86: Add platform hooks for private memory invalidations
> > 
> > Hrm, I'd prefer to avoid adding another hook for this case, arch code already has
> > a "hook" in the form of kvm_unmap_gfn_range().  We'd probably just need a
> > kvm_gfn_range.is_private flag to communicate to arch/vendor code that the memory
> > being zapped is private.
> 
> kvm_unmap_gfn_range() does however get called with kvm->mmu_lock held so
> it might be tricky to tie RMP updates into that path.

Gah, I caught the mmu_lock issue before the end of my email, but forgot to go back
and rethink the first half.

> > That'd leave a gap for the unbind() case because kvm_unmap_gfn_range() is invoked
> > if and only if there's an overlapping memslot.  I'll chew on that a bit to see if
> > there's a way to cleanly handle that case without another hook.  I think it's worth
> > mapping out exactly what we want unbind() to look like anyways, e.g. right now the
> > code subtly relies on private memslots being immutable.
m 
> I thought the direction you sort of driving at was to completely decouple
> RMP updates for physical pages from the KVM MMU map/unmap paths since the
> life-cycles of those backing pages and associated RMP state are somewhat
> separate from the state of the GFNs and kvm->mem_attr_array. It seems to
> make sense when dealing with things like this unbind() case.
> 
> There's also cases like userspaces that opt to not discard memory after
> conversions because they highly favor performance over memory usage. In
> those cases it would make sense to defer marking the pages as shared in
> the RMP until the FALLOC_FL_PUNCH_HOLE, rather than triggering it via
> KVM MMU invalidation path after a conversion.

Hmm, right.  I got overzealous in my desire to avoid new hooks.

> > >   KVM: x86: Add platform hook for initializing private memory
> > 
> > This should also be unnecessary.  The call to kvm_gmem_get_pfn() is from arch
> > code, KVM just needs to ensure the RMP is converted before acquiring mmu_lock,
> > e.g. KVM has all the necessary info in kvm_tdp_mmu_page_fault().
> 
> I think that approach would work fine. The way I was thinking of things
> is that KVM MMU would necessarily call kvm_gmem_get_pfn() to grab the
> page before mapping it into the guest, so moving it out into an explicit
> call should work just as well. That would also drop the need for the
> __kvm_gmem_get_pfn() stuff I needed to add for the initial case where we
> need to access the PFN prior to making it private.
> 
> > 
> > The only reason to add another arch hook would be if we wanted to converted the
> > RMP when _allocating_, e.g. to preconvert in response to fallocate() instead of
> > waiting until #NPF.  But I think I would rather add a generic ioctl() to allow
> > userspace to effectively prefault guest memory, e.g. to setup the RMP before
> > running a vCPU.  Such an ioctl() would potentially be useful in other scenarios,
> > e.g. on the dest during live migration to reduce jitter.
> 
> Agreed, deferring the RMPUPDATE until it's actually needed would give us
> more flexibility on optimizing for things like lazy-acceptance.
> 
> For less-common scenarios like preallocation it makes sense to make that
> an opt-in sort of thing for userspace to configure explicitly.
> 
> > 
> > >   *fixup (kvm_gmem_solo): KVM: Fix end range calculation for MMU invalidations
> > 
> > There was another bug in this path.  The math for handling a non-zero offsets into
> > the file was wrong.  The code now looks like:
> > 
> > 	xa_for_each_range(&gmem->bindings, index, slot, start, end - 1) {
> > 		struct kvm_gfn_range gfn_range = {
> > 			.start = slot->base_gfn + start - slot->gmem.index,
> 
> Sorry if I'm missing something here, but isn't there a risk that:
> 
>   start - slot->gmem.index
> 
> would be less than zero? E.g. starting GFN was 0, but current slot is bound
> at some non-zero offset in the same gmem instance. I guess the warning below
> shouldn't caught that, but it seems like a real scenario.

Heh, only if there's a testcase for it.  Assuming start >= the slot offset does
seem broken, e.g. if the range-to-invalidate overlaps multiple slots, later slots
will have index==slot->gmem.index > start.

> Since 'index' corresponds to the gmem offset of the current slot, is there any
> reason not to do something like this?:
> 
>   .start = slot->base_gfn + index - slot->gmem.index,
> 
> But then, if that's the case, wouldn't index == slot->gmem.index? Suggesting
> we case just simplify to this?:
> 
>   .start = slot->base_gfn,

No, e.g. if start is partway through a memslot, there's no need to invalidate
the entire memslot.  I'll stare at this tomorrow when my brain is hopefully a
bit more functional, I suspect there is a min() and/or max() needed somewhere.
Sean Christopherson June 6, 2023, 11:25 p.m. UTC | #60
On Tue, Jun 06, 2023, Ackerley Tng wrote:
> 
> I've ported selftests from Chao and I [1] while working on hugetlb support
> for guest_mem [2].
> 
> In the process, I found some bugs and have some suggestions for guest_mem.
> Please see separate commits at [3].
> 
> Here are some highlights/questions:
> 
> + "KVM: guest_mem: Explain the use of the uptodate flag for gmem"
>     + Generally, uptodate flags means that the contents of this page match the
>       backing store. Since gmem is memory-backed, does "uptodate" for gmem mean
>       "zeroed"?

Don't read too much into the code, my POC was very much a "beat on it until it
works" scenario.

> + "KVM: guest_mem: Don't re-mark accessed after getting a folio" and "KVM:
>   guest_mem: Don't set dirty flag for folio"
>     + Do we need to folio_mark_accessed(), when it was created with
>       FGP_ACCESSED?

Probably not.  And as you note below, it's all pretty nonsensical anyways.

>     + What is the significance of these LRU flags when gmem doesn't support
>       swapping/eviction?

Likely none.  I used the filemap APIs in my POC because it was easy, not because
it was necessarily the best approach, i.e. that the folios/pages show up in the
LRUs is an unwanted side effect, not a feature.  If guest_memfd only needs a small
subset of the filemap support, going with a true from-scratch implemenation on
top of xarray might be cleaner overall, e.g. would avoid the need for a new flag
to say "this folio can't be migrated even though it's on the LRUs".

> + "KVM: guest_mem: Align so that at least 1 page is allocated"
>     + Bug in current implementation: without this alignment, fallocate() of
>       a size less than the gmem page size will result in no allocation at all

I'm not convinced this is a bug.  I don't see any reason to allow allocating and
punching holes in sub-page granularity.

>     + Both shmem and hugetlbfs perform this alignment
> + "KVM: guest_mem: Add alignment checks"
>     + Implemented the alignment checks for guest_mem because hugetlb on gmem
>       would hit a BUG_ON without this check
> + "KVM: guest_mem: Prevent overflows in kvm_gmem_invalidate_begin()"
>     + Sean fixed a bug in the offset-to-gfn conversion in
>       kvm_gmem_invalidate_begin() earlier, adding a WARN_ON_ONCE()

As Mike pointed out, there's likely still a bug here[*].  I was planning on
diving into that last week, but that never happened.  If you or anyone else can
take a peek and/or write a testcase, that would be awesome.

 : Heh, only if there's a testcase for it.  Assuming start >= the slot offset does
 : seem broken, e.g. if the range-to-invalidate overlaps multiple slots, later slots
 : will have index==slot->gmem.index > start.
 : 
 : > Since 'index' corresponds to the gmem offset of the current slot, is there any
 : > reason not to do something like this?:
 : >
 : >   .start = slot->base_gfn + index - slot->gmem.index,
 : >
 : > But then, if that's the case, wouldn't index == slot->gmem.index? Suggesting
 : > we case just simplify to this?:
 : >
 : >   .start = slot->base_gfn,
 : 
 : No, e.g. if start is partway through a memslot, there's no need to invalidate
 : the entire memslot.  I'll stare at this tomorrow when my brain is hopefully a
 : bit more functional, I suspect there is a min() and/or max() needed somewhere.

[*] https://lore.kernel.org/all/20230512002124.3sap3kzxpegwj3n2@amd.com

>     + Code will always hit WARN_ON_ONCE() when the entire file is closed and
>       all offsets are invalidated, so WARN_ON_ONCE() should be removed
>     + Vishal noticed that the conversion might result in an overflow, so I
>       fixed that
> + And of course, hugetlb support! Please let me know what you think of the
>   approach proposed at [2].
> 
> [1] https://lore.kernel.org/all/cover.1678926164.git.ackerleytng@google.com/T/
> [2] https://lore.kernel.org/lkml/cover.1686077275.git.ackerleytng@google.com/T/
> [3] https://github.com/googleprodkernel/linux-cc/tree/gmem-hugetlb-rfc-v1
Sean Christopherson July 14, 2023, 7:29 p.m. UTC | #61
On Thu, Jul 13, 2023, Ackerley Tng wrote:
> Sean Christopherson <seanjc@google.com> writes:
> 
> > On Fri, May 05, 2023, Ackerley Tng wrote:
> >>
> >> Hi Sean,
> >>
> >> Thanks for implementing this POC!
> >>
> >> ... snip ...
> >>
> >
> > I don't love either approach idea because it means a file created in the context
> > of a VM can outlive the VM itself, and then userspace ends up with a file descriptor
> > that it can't do anything with except close().  I doubt that matters in practice
> > though, e.g. when the VM dies, all memory can be freed so that the file ends up
> > being little more than a shell.  And if we go that route, there's no need to grab
> > a reference to the file during bind, KVM can just grab a longterm reference when
> > the file is initially created and then drop it when KVM dies (and nullifies gmem->kvm).
> >
> > ... snip ...
> >
> > My preference is to make it a VM-scoped ioctl(), if it ends up being a KVM ioctl()
> > and not a common syscall.  If the file isn't tightly coupled to a single VM, then
> > punching a hole is further complicated by needing to deal with invalidating multiple
> > regions that are bound to different @kvm instances.  It's not super complex, but
> > AFAICT having the ioctl() be system-scoped doesn't add value, e.g. I don't think
> > having one VM own the memory will complicate even if/when we get to the point where
> > VMs can share "private" memory, and the gmem code would still need to deal with
> > grabbing a module reference.
> 
> I’d like to follow up on this discussion about a guest_mem file
> outliving the VM and whether to have a VM-scoped ioctl or a KVM ioctl.
> 
> Here's a POC of delayed binding of a guest_mem file to a memslot, where
> the guest_mem file outlives the VM [1].
> 
> I also hope to raise some points before we do the first integration of
> guest_mem patches!
> 
> A use case for guest_mem inodes outliving the VM is when the host VMM
> needs to be upgraded.

Translation: to support intra-host migration, a.k.a. KVM_CAP_VM_MOVE_ENC_CONTEXT_FROM

> The guest_mem inode is passed between two VMs on the same host machine and
> all memory associated with the inode needs to be retained.
> 
> To support the above use case, binding of memslots is delayed until
> first use, so that the following inode passing flow can be used:
> 
> 1. Source (old version of host VMM) process passes guest_mem inodes to
>    destination (new version of host VMM) process via unix sockets.
> 2. Destination process initializes memslots identical to source process.
> 3. Destination process invokes ioctl to migrate guest_mem inode over to
>    destination process by unbinding all memslots from the source VM and
>    binding them to the destination VM. (The kvm pointer is updated in
>    this process)
> 
> Without delayed binding, step 2 will fail since initialization of
> memslots would check and find that the kvm pointer in the guest_mem
> inode points to the kvm in the source process.
> 
> 
> These two patches contain the meat of the changes required to support
> delayed binding:
> 
> https://github.com/googleprodkernel/linux-cc/commit/93b31a006ef2e4dbe1ef0ec5d2534ca30f3bf60c
> https://github.com/googleprodkernel/linux-cc/commit/dd5ac5e53f14a1ef9915c9c1e4cc1006a40b49df
> 
> Some things to highlight for the approach set out in these two patches:
> 
> 1. Previously, closing the guest_mem file in userspace is taken to mean
>    that all associated memory is to be removed and cleared. With these
>    two patches, each memslot also holds a reference to the file (and
>    therefore inode) and so even if the host VMM closes the fd, the VM
>    will be able to continue to function.
> 
>    This is desirable to userspace since closing the file should not be
>    interpreted as a command to clear memory.

100% agreed.  However, after more thought since we first discussed this, I think
that a deferred binding is the wrong way to solve this particular problem.  More
below.

>    This is aligned with the
>    way tmpfs files are used with KVM before guest_mem: when the file is
>    closed in userspace, the memory contents are still mapped and can
>    still be used by the VM. fallocate(PUNCH_HOLE) is how userspace
>    should command memory to be removed, just like munmap() would be used
>    to remove memory from use by KVM.
>
> 2. Creating a guest mem file no longer depends on a specific VM and
>    hence the guest_mem creation ioctl can be a system ioctl instead of a
>    VM specific ioctl. This will also address Chao's concern at [3].

That concern is a non-issue for QEMU as memory backends are created after
accelerators are initialized, and AFAICT Google's VMM behaves similarly.

And _if_ there is a VMM that instantiates memory before KVM_CREATE_VM, IMO making
the ioctl() /dev/kvm scoped would have no meaningful impact on adapting userspace
to play nice with the required ordering.  If userspace can get at /dev/kvm, then
it can do KVM_CREATE_VM, because the only input to KVM_CREATE_VM is the type, i.e.
the only dependencies for KVM_CREATE_VM should be known/resolved long before the
VMM knows it wants to use gmem.

Using a non-KVM syscall would eliminate any such dependencies, but in my very
strong opinion, that is not a good reason to go with a syscall.

> I also separated cleaning up files vs inodes in
> https://github.com/googleprodkernel/linux-cc/commit/0f5aa18910c515141e57e05c4cc791022047a242,
> which I believe is more aligned with how files and inodes are cleaned up
> in FSes.

I didn't take these, though I am in the process of incorporating parts of the
underlying feedback (see below).

> This alignment makes it easier to extend gmem to hugetlb, for one.

I am not convinced that utilizing hugetlb is the best way to provide 1GiB support
in gmem.  I'm not necessarily against it, but there's a fair bit of uncertainty
around the future of hugetlb, and there are fundamental aspects of hugetlb that
may be non-goals for the gmem use case, e.g. mapping the memory with 1GiB pages
in the userspace page tables likely isn't necessariy, and might even be
undesirable.

And practically speaking, the changes aren't _that_ invasive, i.e. punting any 
necessary refactoring should not substantially increase the size/complexity of
hugetlb support (*if* we end up adding it).

That said, I do think we'll implement .evict_inode() very early on in order to
support SNP and TDX, because (in addition to PUNCH_HOLE) that's when the backing
memory will be freed and thus reclaimed, i.e. unassigned in the RMP (SNP) / zeroed
with the shared key ID (TDX).

> While working on this, I was also wondering if we should perhaps be
> storing the inode pointer in slot->gmem instead of the file pointer? The
> memory is associated with an inode->mapping rather than the file. Are we
> binding to a userspace handle on the inode (store file pointer), or are
> we really referencing the inode (store inode pointer)?

Conceptually, I think KVM should to bind to the file.  The inode is effectively
the raw underlying physical storage, while the file is the VM's view of that
storage. 

Practically, I think that gives us a clean, intuitive way to handle intra-host
migration.  Rather than transfer ownership of the file, instantiate a new file
for the target VM, using the gmem inode from the source VM, i.e. create a hard
link.  That'd probably require new uAPI, but I don't think that will be hugely
problematic.  KVM would need to ensure the new VM's guest_memfd can't be mapped
until KVM_CAP_VM_MOVE_ENC_CONTEXT_FROM (which would also need to verify the
memslots/bindings are identical), but that should be easy enough to enforce.

That way, a VM, its memslots, and its SPTEs are tied to the file, while allowing
the memory and the *contents* of memory to outlive the VM, i.e. be effectively
transfered to the new target VM.  And we'll maintain the invariant that each
guest_memfd is bound 1:1 with a single VM.

As above, that should also help us draw the line between mapping memory into a
VM (file), and freeing/reclaiming the memory (inode).

There will be extra complexity/overhead as we'll have to play nice with the
possibility of multiple files per inode, e.g. to zap mappings across all files
when punching a hole, but the extra complexity is quite small, e.g. we can use
address_space.private_list to keep track of the guest_memfd instances associated
with the inode.

Setting aside TDX and SNP for the moment, as it's not clear how they'll support
memory that is "private" but shared between multiple VMs, I think per-VM files
would work well for sharing gmem between two VMs.  E.g. would allow a give page
to be bound to a different gfn for each VM, would allow having different permissions
for each file (e.g. to allow fallocate() only from the original owner).
Vishal Annapurve July 14, 2023, 11:09 p.m. UTC | #62
On Fri, Jul 14, 2023 at 12:29 PM Sean Christopherson <seanjc@google.com> wrote:
> ...
> And _if_ there is a VMM that instantiates memory before KVM_CREATE_VM, IMO making
> the ioctl() /dev/kvm scoped would have no meaningful impact on adapting userspace
> to play nice with the required ordering.  If userspace can get at /dev/kvm, then
> it can do KVM_CREATE_VM, because the only input to KVM_CREATE_VM is the type, i.e.
> the only dependencies for KVM_CREATE_VM should be known/resolved long before the
> VMM knows it wants to use gmem.

I am not sure about the benefits of tying gmem creation to any given
kvm instance. I think the most important requirement here is that a
given gmem range is always tied to a single VM - This can be enforced
when memslots are bound to the gmem files.

I believe "Required ordering" is that gmem files are created first and
then supplied while creating the memslots whose gpa ranges can
generate private memory accesses.
Is there any other ordering we want to enforce here?

> ...
> Practically, I think that gives us a clean, intuitive way to handle intra-host
> migration.  Rather than transfer ownership of the file, instantiate a new file
> for the target VM, using the gmem inode from the source VM, i.e. create a hard
> link.  That'd probably require new uAPI, but I don't think that will be hugely
> problematic.  KVM would need to ensure the new VM's guest_memfd can't be mapped
> until KVM_CAP_VM_MOVE_ENC_CONTEXT_FROM (which would also need to verify the
> memslots/bindings are identical), but that should be easy enough to enforce.
>
> That way, a VM, its memslots, and its SPTEs are tied to the file, while allowing
> the memory and the *contents* of memory to outlive the VM, i.e. be effectively
> transfered to the new target VM.  And we'll maintain the invariant that each
> guest_memfd is bound 1:1 with a single VM.
>
> As above, that should also help us draw the line between mapping memory into a
> VM (file), and freeing/reclaiming the memory (inode).
>
> There will be extra complexity/overhead as we'll have to play nice with the
> possibility of multiple files per inode, e.g. to zap mappings across all files
> when punching a hole, but the extra complexity is quite small, e.g. we can use
> address_space.private_list to keep track of the guest_memfd instances associated
> with the inode.

Are we talking about a different usecase of sharing gmem fd across VMs
other than intra-host migration?
If not, ideally only one of the files should be catering to the guest
memory mappings at any given time. i.e. any inode should be ideally
bound to (through the file) a single kvm instance, as we we are
planning to ensure that guest_memfd can't be mapped until
KVM_CAP_VM_MOVE_ENC_CONTEXT_FROM is invoked on the target side.
Sean Christopherson July 15, 2023, 12:30 a.m. UTC | #63
On Fri, Jul 14, 2023, Vishal Annapurve wrote:
> On Fri, Jul 14, 2023 at 12:29 PM Sean Christopherson <seanjc@google.com> wrote:
> > ...
> > And _if_ there is a VMM that instantiates memory before KVM_CREATE_VM, IMO making
> > the ioctl() /dev/kvm scoped would have no meaningful impact on adapting userspace
> > to play nice with the required ordering.  If userspace can get at /dev/kvm, then
> > it can do KVM_CREATE_VM, because the only input to KVM_CREATE_VM is the type, i.e.
> > the only dependencies for KVM_CREATE_VM should be known/resolved long before the
> > VMM knows it wants to use gmem.
> 
> I am not sure about the benefits of tying gmem creation to any given
> kvm instance.

IMO, making gmem->kvm immutable is very nice to have, e.g. gmem->kvm will always be
valid and the refcounting rules are fairly straightforward.  

> I think the most important requirement here is that a given gmem range is always
> tied to a single VM 

I'm not convinced that that requirement will always hold true (see below).

> This can be enforced when memslots are bound to the gmem files.

Yeah, but TBH, waiting until the guest faults in memory to detect an invalid memslot
is gross.  And looking more closely, taking filemap_invalidate_lock(), i.e. taking
a semaphore for write, in the page fault path is a complete non-starter.  The
"if (existing_slot == slot)" check is likely a non-starter, because KVM handles
FLAGS_ONLY memslot updates, e.g. toggling dirty logging, by duplicating and
replacing the memslot, not by updating the live memslot.

> I believe "Required ordering" is that gmem files are created first and
> then supplied while creating the memslots whose gpa ranges can
> generate private memory accesses.
> Is there any other ordering we want to enforce here?

I wasn't talking about enforcing arbitrary ordering, I was simply talking about
what userspace literally needs to be able to do KVM_CREATE_GUEST_MEMFD.

> > Practically, I think that gives us a clean, intuitive way to handle intra-host
> > migration.  Rather than transfer ownership of the file, instantiate a new file
> > for the target VM, using the gmem inode from the source VM, i.e. create a hard
> > link.  That'd probably require new uAPI, but I don't think that will be hugely
> > problematic.  KVM would need to ensure the new VM's guest_memfd can't be mapped
> > until KVM_CAP_VM_MOVE_ENC_CONTEXT_FROM (which would also need to verify the
> > memslots/bindings are identical), but that should be easy enough to enforce.
> >
> > That way, a VM, its memslots, and its SPTEs are tied to the file, while allowing
> > the memory and the *contents* of memory to outlive the VM, i.e. be effectively
> > transfered to the new target VM.  And we'll maintain the invariant that each
> > guest_memfd is bound 1:1 with a single VM.
> >
> > As above, that should also help us draw the line between mapping memory into a
> > VM (file), and freeing/reclaiming the memory (inode).
> >
> > There will be extra complexity/overhead as we'll have to play nice with the
> > possibility of multiple files per inode, e.g. to zap mappings across all files
> > when punching a hole, but the extra complexity is quite small, e.g. we can use
> > address_space.private_list to keep track of the guest_memfd instances associated
> > with the inode.
> 
> Are we talking about a different usecase of sharing gmem fd across VMs
> other than intra-host migration?

Well, I am :-)  I don't want to build all of this on an assumption that we'll
never ever want to share a guest_memfd across multiple VMs.  E.g. SEV (and SEV-ES?)
already have the migration helper concept, and I've heard more than a few rumblings
of TDX utilizing helper TDs.  IMO, it's not far fetched at all to think that there
will eventually be a need to let multiple VMs share a guest_memfd.

> If not, ideally only one of the files should be catering to the guest
> memory mappings at any given time. i.e. any inode should be ideally
> bound to (through the file) a single kvm instance,

Why?  Honest question, what does it buy us?

For TDX and SNP intra-host migration, it should be easy enough to ensure the new
VM can't create mappings before migration, and that the old VM can't create mappings
or run after migration.  I don't see that being any harder if the source and
dest use different files.

FWIW, it might be easier to hold off on this discussion until I post the RFC
(which is going to happen on Monday at this point), as then we'll have actual code
to discuss.

> as we we are planning to ensure that guest_memfd can't be mapped until
> KVM_CAP_VM_MOVE_ENC_CONTEXT_FROM is invoked on the target side.