mbox series

[v7,00/14] KVM: mm: fd-based approach for supporting KVM guest private memory

Message ID 20220706082016.2603916-1-chao.p.peng@linux.intel.com (mailing list archive)
Headers show
Series KVM: mm: fd-based approach for supporting KVM guest private memory | expand

Message

Chao Peng July 6, 2022, 8:20 a.m. UTC
This is the v7 of this series which tries to implement the fd-based KVM
guest private memory. The patches are based on latest kvm/queue branch
commit:

  b9b71f43683a (kvm/queue) KVM: x86/mmu: Buffer nested MMU
split_desc_cache only by default capacity

Introduction
------------
In general this patch series introduce fd-based memslot which provides
guest memory through memory file descriptor fd[offset,size] instead of
hva/size. The fd can be created from a supported memory filesystem
like tmpfs/hugetlbfs etc. which we refer as memory backing store. KVM
and the the memory backing store exchange callbacks when such memslot
gets created. At runtime KVM will call into callbacks provided by the
backing store to get the pfn with the fd+offset. Memory backing store
will also call into KVM callbacks when userspace punch hole on the fd
to notify KVM to unmap secondary MMU page table entries.

Comparing to existing hva-based memslot, this new type of memslot allows
guest memory unmapped from host userspace like QEMU and even the kernel
itself, therefore reduce attack surface and prevent bugs.

Based on this fd-based memslot, we can build guest private memory that
is going to be used in confidential computing environments such as Intel
TDX and AMD SEV. When supported, the memory backing store can provide
more enforcement on the fd and KVM can use a single memslot to hold both
the private and shared part of the guest memory. 

mm extension
---------------------
Introduces new MFD_INACCESSIBLE flag for memfd_create(), the file
created with these flags cannot read(), write() or mmap() etc via normal
MMU operations. The file content can only be used with the newly
introduced memfile_notifier extension.

The memfile_notifier extension provides two sets of callbacks for KVM to
interact with the memory backing store:
  - memfile_notifier_ops: callbacks for memory backing store to notify
    KVM when memory gets invalidated.
  - backing store callbacks: callbacks for KVM to call into memory
    backing store to request memory pages for guest private memory.

The memfile_notifier extension also provides APIs for memory backing
store to register/unregister itself and to trigger the notifier when the
bookmarked memory gets invalidated.

The patchset also introduces a new memfd seal F_SEAL_AUTO_ALLOCATE to
prevent double allocation caused by unintentional guest when we only
have a single side of the shared/private memfds effective.

memslot extension
-----------------
Add the private fd and the fd offset to existing 'shared' memslot so
that both private/shared guest memory can live in one single memslot.
A page in the memslot is either private or shared. Whether a guest page
is private or shared is maintained through reusing existing SEV ioctls
KVM_MEMORY_ENCRYPT_{UN,}REG_REGION.

Test
----
To test the new functionalities of this patch TDX patchset is needed.
Since TDX patchset has not been merged so I did two kinds of test:

-  Regresion test on kvm/queue (this patchset)
   Most new code are not covered. Code also in below repo:
   https://github.com/chao-p/linux/tree/privmem-v7

-  New Funational test on latest TDX code
   The patch is rebased to latest TDX code and tested the new
   funcationalities. See below repos:
   Linux: https://github.com/chao-p/linux/tree/privmem-v7-tdx
   QEMU: https://github.com/chao-p/qemu/tree/privmem-v7

An example QEMU command line for TDX test:
-object tdx-guest,id=tdx,debug=off,sept-ve-disable=off \
-machine confidential-guest-support=tdx \
-object memory-backend-memfd-private,id=ram1,size=${mem} \
-machine memory-backend=ram1

Changelog
----------
v7:
  - Move the private/shared info from backing store to KVM.
  - Introduce F_SEAL_AUTO_ALLOCATE to avoid double allocation.
  - Rework on the sync mechanism between zap/page fault paths.
  - Addressed other comments in v6.
v6:
  - Re-organzied patch for both mm/KVM parts.
  - Added flags for memfile_notifier so its consumers can state their
    features and memory backing store can check against these flags.
  - Put a backing store reference in the memfile_notifier and move pfn_ops
    into backing store.
  - Only support boot time backing store register.
  - Overall KVM part improvement suggested by Sean and some others.
v5:
  - Removed userspace visible F_SEAL_INACCESSIBLE, instead using an
    in-kernel flag (SHM_F_INACCESSIBLE for shmem). Private fd can only
    be created by MFD_INACCESSIBLE.
  - Introduced new APIs for backing store to register itself to
    memfile_notifier instead of direct function call.
  - Added the accounting and restriction for MFD_INACCESSIBLE memory.
  - Added KVM API doc for new memslot extensions and man page for the new
    MFD_INACCESSIBLE flag.
  - Removed the overlap check for mapping the same file+offset into
    multiple gfns due to perf consideration, warned in document.
  - Addressed other comments in v4.
v4:
  - Decoupled the callbacks between KVM/mm from memfd and use new
    name 'memfile_notifier'.
  - Supported register multiple memslots to the same backing store.
  - Added per-memslot pfn_ops instead of per-system.
  - Reworked the invalidation part.
  - Improved new KVM uAPIs (private memslot extension and memory
    error) per Sean's suggestions.
  - Addressed many other minor fixes for comments from v3.
v3:
  - Added locking protection when calling
    invalidate_page_range/fallocate callbacks.
  - Changed memslot structure to keep use useraddr for shared memory.
  - Re-organized F_SEAL_INACCESSIBLE and MEMFD_OPS.
  - Added MFD_INACCESSIBLE flag to force F_SEAL_INACCESSIBLE.
  - Commit message improvement.
  - Many small fixes for comments from the last version.

Links to previous discussions
-----------------------------
[1] Original design proposal:
https://lkml.kernel.org/kvm/20210824005248.200037-1-seanjc@google.com/
[2] Updated proposal and RFC patch v1:
https://lkml.kernel.org/linux-fsdevel/20211111141352.26311-1-chao.p.peng@linux.intel.com/
[3] Patch v5: https://lkml.org/lkml/2022/5/19/861

Chao Peng (12):
  mm: Add F_SEAL_AUTO_ALLOCATE seal to memfd
  selftests/memfd: Add tests for F_SEAL_AUTO_ALLOCATE
  mm: Introduce memfile_notifier
  mm/memfd: Introduce MFD_INACCESSIBLE flag
  KVM: Rename KVM_PRIVATE_MEM_SLOTS to KVM_INTERNAL_MEM_SLOTS
  KVM: Use gfn instead of hva for mmu_notifier_retry
  KVM: Rename mmu_notifier_*
  KVM: Extend the memslot to support fd-based private memory
  KVM: Add KVM_EXIT_MEMORY_FAULT exit
  KVM: Register/unregister the guest private memory regions
  KVM: Handle page fault for private memory
  KVM: Enable and expose KVM_MEM_PRIVATE

Kirill A. Shutemov (1):
  mm/shmem: Support memfile_notifier

 Documentation/virt/kvm/api.rst             |  77 +++++-
 arch/arm64/kvm/mmu.c                       |   8 +-
 arch/mips/include/asm/kvm_host.h           |   2 +-
 arch/mips/kvm/mmu.c                        |  10 +-
 arch/powerpc/include/asm/kvm_book3s_64.h   |   2 +-
 arch/powerpc/kvm/book3s_64_mmu_host.c      |   4 +-
 arch/powerpc/kvm/book3s_64_mmu_hv.c        |   4 +-
 arch/powerpc/kvm/book3s_64_mmu_radix.c     |   6 +-
 arch/powerpc/kvm/book3s_hv_nested.c        |   2 +-
 arch/powerpc/kvm/book3s_hv_rm_mmu.c        |   8 +-
 arch/powerpc/kvm/e500_mmu_host.c           |   4 +-
 arch/riscv/kvm/mmu.c                       |   4 +-
 arch/x86/include/asm/kvm_host.h            |   3 +-
 arch/x86/kvm/Kconfig                       |   3 +
 arch/x86/kvm/mmu.h                         |   2 -
 arch/x86/kvm/mmu/mmu.c                     |  74 +++++-
 arch/x86/kvm/mmu/mmu_internal.h            |  18 ++
 arch/x86/kvm/mmu/mmutrace.h                |   1 +
 arch/x86/kvm/mmu/paging_tmpl.h             |   4 +-
 arch/x86/kvm/x86.c                         |   2 +-
 include/linux/kvm_host.h                   | 105 +++++---
 include/linux/memfile_notifier.h           |  91 +++++++
 include/linux/shmem_fs.h                   |   2 +
 include/uapi/linux/fcntl.h                 |   1 +
 include/uapi/linux/kvm.h                   |  37 +++
 include/uapi/linux/memfd.h                 |   1 +
 mm/Kconfig                                 |   4 +
 mm/Makefile                                |   1 +
 mm/memfd.c                                 |  18 +-
 mm/memfile_notifier.c                      | 123 ++++++++++
 mm/shmem.c                                 | 125 +++++++++-
 tools/testing/selftests/memfd/memfd_test.c | 166 +++++++++++++
 virt/kvm/Kconfig                           |   3 +
 virt/kvm/kvm_main.c                        | 272 ++++++++++++++++++---
 virt/kvm/pfncache.c                        |  14 +-
 35 files changed, 1074 insertions(+), 127 deletions(-)
 create mode 100644 include/linux/memfile_notifier.h
 create mode 100644 mm/memfile_notifier.c

Comments

Gupta, Pankaj July 13, 2022, 3:58 a.m. UTC | #1
> This is the v7 of this series which tries to implement the fd-based KVM
> guest private memory. The patches are based on latest kvm/queue branch
> commit:
> 
>    b9b71f43683a (kvm/queue) KVM: x86/mmu: Buffer nested MMU
> split_desc_cache only by default capacity
> 
> Introduction
> ------------
> In general this patch series introduce fd-based memslot which provides
> guest memory through memory file descriptor fd[offset,size] instead of
> hva/size. The fd can be created from a supported memory filesystem
> like tmpfs/hugetlbfs etc. which we refer as memory backing store. KVM

Thinking a bit, As host side fd on tmpfs or shmem will store memory on 
host page cache instead of mapping pages into userspace address space. 
Can we hit double (un-coordinated) page cache problem with this when 
guest page cache is also used?

Thanks,
Pankaj

> and the the memory backing store exchange callbacks when such memslot
> gets created. At runtime KVM will call into callbacks provided by the
> backing store to get the pfn with the fd+offset. Memory backing store
> will also call into KVM callbacks when userspace punch hole on the fd
> to notify KVM to unmap secondary MMU page table entries.
> 
> Comparing to existing hva-based memslot, this new type of memslot allows
> guest memory unmapped from host userspace like QEMU and even the kernel
> itself, therefore reduce attack surface and prevent bugs.
> 
> Based on this fd-based memslot, we can build guest private memory that
> is going to be used in confidential computing environments such as Intel
> TDX and AMD SEV. When supported, the memory backing store can provide
> more enforcement on the fd and KVM can use a single memslot to hold both
> the private and shared part of the guest memory.
> 
> mm extension
> ---------------------
> Introduces new MFD_INACCESSIBLE flag for memfd_create(), the file
> created with these flags cannot read(), write() or mmap() etc via normal
> MMU operations. The file content can only be used with the newly
> introduced memfile_notifier extension.
> 
> The memfile_notifier extension provides two sets of callbacks for KVM to
> interact with the memory backing store:
>    - memfile_notifier_ops: callbacks for memory backing store to notify
>      KVM when memory gets invalidated.
>    - backing store callbacks: callbacks for KVM to call into memory
>      backing store to request memory pages for guest private memory.
> 
> The memfile_notifier extension also provides APIs for memory backing
> store to register/unregister itself and to trigger the notifier when the
> bookmarked memory gets invalidated.
> 
> The patchset also introduces a new memfd seal F_SEAL_AUTO_ALLOCATE to
> prevent double allocation caused by unintentional guest when we only
> have a single side of the shared/private memfds effective.
> 
> memslot extension
> -----------------
> Add the private fd and the fd offset to existing 'shared' memslot so
> that both private/shared guest memory can live in one single memslot.
> A page in the memslot is either private or shared. Whether a guest page
> is private or shared is maintained through reusing existing SEV ioctls
> KVM_MEMORY_ENCRYPT_{UN,}REG_REGION.
> 
> Test
> ----
> To test the new functionalities of this patch TDX patchset is needed.
> Since TDX patchset has not been merged so I did two kinds of test:
> 
> -  Regresion test on kvm/queue (this patchset)
>     Most new code are not covered. Code also in below repo:
>     https://github.com/chao-p/linux/tree/privmem-v7
> 
> -  New Funational test on latest TDX code
>     The patch is rebased to latest TDX code and tested the new
>     funcationalities. See below repos:
>     Linux: https://github.com/chao-p/linux/tree/privmem-v7-tdx
>     QEMU: https://github.com/chao-p/qemu/tree/privmem-v7
> 
> An example QEMU command line for TDX test:
> -object tdx-guest,id=tdx,debug=off,sept-ve-disable=off \
> -machine confidential-guest-support=tdx \
> -object memory-backend-memfd-private,id=ram1,size=${mem} \
> -machine memory-backend=ram1
> 
> Changelog
> ----------
> v7:
>    - Move the private/shared info from backing store to KVM.
>    - Introduce F_SEAL_AUTO_ALLOCATE to avoid double allocation.
>    - Rework on the sync mechanism between zap/page fault paths.
>    - Addressed other comments in v6.
> v6:
>    - Re-organzied patch for both mm/KVM parts.
>    - Added flags for memfile_notifier so its consumers can state their
>      features and memory backing store can check against these flags.
>    - Put a backing store reference in the memfile_notifier and move pfn_ops
>      into backing store.
>    - Only support boot time backing store register.
>    - Overall KVM part improvement suggested by Sean and some others.
> v5:
>    - Removed userspace visible F_SEAL_INACCESSIBLE, instead using an
>      in-kernel flag (SHM_F_INACCESSIBLE for shmem). Private fd can only
>      be created by MFD_INACCESSIBLE.
>    - Introduced new APIs for backing store to register itself to
>      memfile_notifier instead of direct function call.
>    - Added the accounting and restriction for MFD_INACCESSIBLE memory.
>    - Added KVM API doc for new memslot extensions and man page for the new
>      MFD_INACCESSIBLE flag.
>    - Removed the overlap check for mapping the same file+offset into
>      multiple gfns due to perf consideration, warned in document.
>    - Addressed other comments in v4.
> v4:
>    - Decoupled the callbacks between KVM/mm from memfd and use new
>      name 'memfile_notifier'.
>    - Supported register multiple memslots to the same backing store.
>    - Added per-memslot pfn_ops instead of per-system.
>    - Reworked the invalidation part.
>    - Improved new KVM uAPIs (private memslot extension and memory
>      error) per Sean's suggestions.
>    - Addressed many other minor fixes for comments from v3.
> v3:
>    - Added locking protection when calling
>      invalidate_page_range/fallocate callbacks.
>    - Changed memslot structure to keep use useraddr for shared memory.
>    - Re-organized F_SEAL_INACCESSIBLE and MEMFD_OPS.
>    - Added MFD_INACCESSIBLE flag to force F_SEAL_INACCESSIBLE.
>    - Commit message improvement.
>    - Many small fixes for comments from the last version.
> 
> Links to previous discussions
> -----------------------------
> [1] Original design proposal:
> https://lkml.kernel.org/kvm/20210824005248.200037-1-seanjc@google.com/
> [2] Updated proposal and RFC patch v1:
> https://lkml.kernel.org/linux-fsdevel/20211111141352.26311-1-chao.p.peng@linux.intel.com/
> [3] Patch v5: https://lkml.org/lkml/2022/5/19/861
> 
> Chao Peng (12):
>    mm: Add F_SEAL_AUTO_ALLOCATE seal to memfd
>    selftests/memfd: Add tests for F_SEAL_AUTO_ALLOCATE
>    mm: Introduce memfile_notifier
>    mm/memfd: Introduce MFD_INACCESSIBLE flag
>    KVM: Rename KVM_PRIVATE_MEM_SLOTS to KVM_INTERNAL_MEM_SLOTS
>    KVM: Use gfn instead of hva for mmu_notifier_retry
>    KVM: Rename mmu_notifier_*
>    KVM: Extend the memslot to support fd-based private memory
>    KVM: Add KVM_EXIT_MEMORY_FAULT exit
>    KVM: Register/unregister the guest private memory regions
>    KVM: Handle page fault for private memory
>    KVM: Enable and expose KVM_MEM_PRIVATE
> 
> Kirill A. Shutemov (1):
>    mm/shmem: Support memfile_notifier
> 
>   Documentation/virt/kvm/api.rst             |  77 +++++-
>   arch/arm64/kvm/mmu.c                       |   8 +-
>   arch/mips/include/asm/kvm_host.h           |   2 +-
>   arch/mips/kvm/mmu.c                        |  10 +-
>   arch/powerpc/include/asm/kvm_book3s_64.h   |   2 +-
>   arch/powerpc/kvm/book3s_64_mmu_host.c      |   4 +-
>   arch/powerpc/kvm/book3s_64_mmu_hv.c        |   4 +-
>   arch/powerpc/kvm/book3s_64_mmu_radix.c     |   6 +-
>   arch/powerpc/kvm/book3s_hv_nested.c        |   2 +-
>   arch/powerpc/kvm/book3s_hv_rm_mmu.c        |   8 +-
>   arch/powerpc/kvm/e500_mmu_host.c           |   4 +-
>   arch/riscv/kvm/mmu.c                       |   4 +-
>   arch/x86/include/asm/kvm_host.h            |   3 +-
>   arch/x86/kvm/Kconfig                       |   3 +
>   arch/x86/kvm/mmu.h                         |   2 -
>   arch/x86/kvm/mmu/mmu.c                     |  74 +++++-
>   arch/x86/kvm/mmu/mmu_internal.h            |  18 ++
>   arch/x86/kvm/mmu/mmutrace.h                |   1 +
>   arch/x86/kvm/mmu/paging_tmpl.h             |   4 +-
>   arch/x86/kvm/x86.c                         |   2 +-
>   include/linux/kvm_host.h                   | 105 +++++---
>   include/linux/memfile_notifier.h           |  91 +++++++
>   include/linux/shmem_fs.h                   |   2 +
>   include/uapi/linux/fcntl.h                 |   1 +
>   include/uapi/linux/kvm.h                   |  37 +++
>   include/uapi/linux/memfd.h                 |   1 +
>   mm/Kconfig                                 |   4 +
>   mm/Makefile                                |   1 +
>   mm/memfd.c                                 |  18 +-
>   mm/memfile_notifier.c                      | 123 ++++++++++
>   mm/shmem.c                                 | 125 +++++++++-
>   tools/testing/selftests/memfd/memfd_test.c | 166 +++++++++++++
>   virt/kvm/Kconfig                           |   3 +
>   virt/kvm/kvm_main.c                        | 272 ++++++++++++++++++---
>   virt/kvm/pfncache.c                        |  14 +-
>   35 files changed, 1074 insertions(+), 127 deletions(-)
>   create mode 100644 include/linux/memfile_notifier.h
>   create mode 100644 mm/memfile_notifier.c
>
Chao Peng July 13, 2022, 7:57 a.m. UTC | #2
On Wed, Jul 13, 2022 at 05:58:32AM +0200, Gupta, Pankaj wrote:
> 
> > This is the v7 of this series which tries to implement the fd-based KVM
> > guest private memory. The patches are based on latest kvm/queue branch
> > commit:
> > 
> >    b9b71f43683a (kvm/queue) KVM: x86/mmu: Buffer nested MMU
> > split_desc_cache only by default capacity
> > 
> > Introduction
> > ------------
> > In general this patch series introduce fd-based memslot which provides
> > guest memory through memory file descriptor fd[offset,size] instead of
> > hva/size. The fd can be created from a supported memory filesystem
> > like tmpfs/hugetlbfs etc. which we refer as memory backing store. KVM
> 
> Thinking a bit, As host side fd on tmpfs or shmem will store memory on host
> page cache instead of mapping pages into userspace address space. Can we hit
> double (un-coordinated) page cache problem with this when guest page cache
> is also used?

This is my understanding: in host it will be indeed in page cache (in
current shmem implementation) but that's just the way it allocates and
provides the physical memory for the guest. In guest, guest OS will not
see this fd (absolutely), it only sees guest memory, on top of which it
can build its own page cache system for its own file-mapped content but
that is unrelated to host page cache.

Chao
> 
> Thanks,
> Pankaj
>
Gupta, Pankaj July 13, 2022, 10:35 a.m. UTC | #3
>>> This is the v7 of this series which tries to implement the fd-based KVM
>>> guest private memory. The patches are based on latest kvm/queue branch
>>> commit:
>>>
>>>     b9b71f43683a (kvm/queue) KVM: x86/mmu: Buffer nested MMU
>>> split_desc_cache only by default capacity
>>>
>>> Introduction
>>> ------------
>>> In general this patch series introduce fd-based memslot which provides
>>> guest memory through memory file descriptor fd[offset,size] instead of
>>> hva/size. The fd can be created from a supported memory filesystem
>>> like tmpfs/hugetlbfs etc. which we refer as memory backing store. KVM
>>
>> Thinking a bit, As host side fd on tmpfs or shmem will store memory on host
>> page cache instead of mapping pages into userspace address space. Can we hit
>> double (un-coordinated) page cache problem with this when guest page cache
>> is also used?
> 
> This is my understanding: in host it will be indeed in page cache (in
> current shmem implementation) but that's just the way it allocates and
> provides the physical memory for the guest. In guest, guest OS will not
> see this fd (absolutely), it only sees guest memory, on top of which it
> can build its own page cache system for its own file-mapped content but
> that is unrelated to host page cache.

yes. If guest fills its page cache with file backed memory, this at host 
side(on shmem fd backend) will also fill the host page cache fast. This 
can have an impact on performance of guest VM's if host goes to memory 
pressure situation sooner. Or else we end up utilizing way less System 
RAM.

Thanks,
Pankaj
Chao Peng July 13, 2022, 11:59 p.m. UTC | #4
On Wed, Jul 13, 2022 at 12:35:56PM +0200, Gupta, Pankaj wrote:
> 
> > > > This is the v7 of this series which tries to implement the fd-based KVM
> > > > guest private memory. The patches are based on latest kvm/queue branch
> > > > commit:
> > > > 
> > > >     b9b71f43683a (kvm/queue) KVM: x86/mmu: Buffer nested MMU
> > > > split_desc_cache only by default capacity
> > > > 
> > > > Introduction
> > > > ------------
> > > > In general this patch series introduce fd-based memslot which provides
> > > > guest memory through memory file descriptor fd[offset,size] instead of
> > > > hva/size. The fd can be created from a supported memory filesystem
> > > > like tmpfs/hugetlbfs etc. which we refer as memory backing store. KVM
> > > 
> > > Thinking a bit, As host side fd on tmpfs or shmem will store memory on host
> > > page cache instead of mapping pages into userspace address space. Can we hit
> > > double (un-coordinated) page cache problem with this when guest page cache
> > > is also used?
> > 
> > This is my understanding: in host it will be indeed in page cache (in
> > current shmem implementation) but that's just the way it allocates and
> > provides the physical memory for the guest. In guest, guest OS will not
> > see this fd (absolutely), it only sees guest memory, on top of which it
> > can build its own page cache system for its own file-mapped content but
> > that is unrelated to host page cache.
> 
> yes. If guest fills its page cache with file backed memory, this at host
> side(on shmem fd backend) will also fill the host page cache fast. This can
> have an impact on performance of guest VM's if host goes to memory pressure
> situation sooner. Or else we end up utilizing way less System RAM.

(Currently), the file backed guest private memory is long-term pinned
and not reclaimable, it's in page cache anyway once we allocated it for
guest. This does not depend on how guest use it (e.g. use it for guest
page cache or not). 

Chao
> 
> Thanks,
> Pankaj
>
Andy Lutomirski July 14, 2022, 4:29 a.m. UTC | #5
On Wed, Jul 13, 2022, at 3:35 AM, Gupta, Pankaj wrote:
>>>> This is the v7 of this series which tries to implement the fd-based KVM
>>>> guest private memory. The patches are based on latest kvm/queue branch
>>>> commit:
>>>>
>>>>     b9b71f43683a (kvm/queue) KVM: x86/mmu: Buffer nested MMU
>>>> split_desc_cache only by default capacity
>>>>
>>>> Introduction
>>>> ------------
>>>> In general this patch series introduce fd-based memslot which provides
>>>> guest memory through memory file descriptor fd[offset,size] instead of
>>>> hva/size. The fd can be created from a supported memory filesystem
>>>> like tmpfs/hugetlbfs etc. which we refer as memory backing store. KVM
>>>
>>> Thinking a bit, As host side fd on tmpfs or shmem will store memory on host
>>> page cache instead of mapping pages into userspace address space. Can we hit
>>> double (un-coordinated) page cache problem with this when guest page cache
>>> is also used?
>> 
>> This is my understanding: in host it will be indeed in page cache (in
>> current shmem implementation) but that's just the way it allocates and
>> provides the physical memory for the guest. In guest, guest OS will not
>> see this fd (absolutely), it only sees guest memory, on top of which it
>> can build its own page cache system for its own file-mapped content but
>> that is unrelated to host page cache.
>
> yes. If guest fills its page cache with file backed memory, this at host 
> side(on shmem fd backend) will also fill the host page cache fast. This 
> can have an impact on performance of guest VM's if host goes to memory 
> pressure situation sooner. Or else we end up utilizing way less System 
> RAM.

Is this in any meaningful way different from a regular VM?

--Andy
Gupta, Pankaj July 14, 2022, 4:39 a.m. UTC | #6
>>>>> This is the v7 of this series which tries to implement the fd-based KVM
>>>>> guest private memory. The patches are based on latest kvm/queue branch
>>>>> commit:
>>>>>
>>>>>      b9b71f43683a (kvm/queue) KVM: x86/mmu: Buffer nested MMU
>>>>> split_desc_cache only by default capacity
>>>>>
>>>>> Introduction
>>>>> ------------
>>>>> In general this patch series introduce fd-based memslot which provides
>>>>> guest memory through memory file descriptor fd[offset,size] instead of
>>>>> hva/size. The fd can be created from a supported memory filesystem
>>>>> like tmpfs/hugetlbfs etc. which we refer as memory backing store. KVM
>>>>
>>>> Thinking a bit, As host side fd on tmpfs or shmem will store memory on host
>>>> page cache instead of mapping pages into userspace address space. Can we hit
>>>> double (un-coordinated) page cache problem with this when guest page cache
>>>> is also used?
>>>
>>> This is my understanding: in host it will be indeed in page cache (in
>>> current shmem implementation) but that's just the way it allocates and
>>> provides the physical memory for the guest. In guest, guest OS will not
>>> see this fd (absolutely), it only sees guest memory, on top of which it
>>> can build its own page cache system for its own file-mapped content but
>>> that is unrelated to host page cache.
>>
>> yes. If guest fills its page cache with file backed memory, this at host
>> side(on shmem fd backend) will also fill the host page cache fast. This can
>> have an impact on performance of guest VM's if host goes to memory pressure
>> situation sooner. Or else we end up utilizing way less System RAM.
> 
> (Currently), the file backed guest private memory is long-term pinned
> and not reclaimable, it's in page cache anyway once we allocated it for
> guest. This does not depend on how guest use it (e.g. use it for guest
> page cache or not).

Even if host shmem backed memory always be always un-reclaimable, we end 
up utilizing double RAM (both in guest & host page cache) for guest disk 
accesses?

I am considering this a serious design decision before we commit to this 
approach.

Happy to be enlightened on this and know the thoughts from others as well.

Thanks,
Pankaj
Gupta, Pankaj July 14, 2022, 5:06 a.m. UTC | #7
>>>>>> This is the v7 of this series which tries to implement the 
>>>>>> fd-based KVM
>>>>>> guest private memory. The patches are based on latest kvm/queue 
>>>>>> branch
>>>>>> commit:
>>>>>>
>>>>>>      b9b71f43683a (kvm/queue) KVM: x86/mmu: Buffer nested MMU
>>>>>> split_desc_cache only by default capacity
>>>>>>
>>>>>> Introduction
>>>>>> ------------
>>>>>> In general this patch series introduce fd-based memslot which 
>>>>>> provides
>>>>>> guest memory through memory file descriptor fd[offset,size] 
>>>>>> instead of
>>>>>> hva/size. The fd can be created from a supported memory filesystem
>>>>>> like tmpfs/hugetlbfs etc. which we refer as memory backing store. KVM
>>>>>
>>>>> Thinking a bit, As host side fd on tmpfs or shmem will store memory 
>>>>> on host
>>>>> page cache instead of mapping pages into userspace address space. 
>>>>> Can we hit
>>>>> double (un-coordinated) page cache problem with this when guest 
>>>>> page cache
>>>>> is also used?
>>>>
>>>> This is my understanding: in host it will be indeed in page cache (in
>>>> current shmem implementation) but that's just the way it allocates and
>>>> provides the physical memory for the guest. In guest, guest OS will not
>>>> see this fd (absolutely), it only sees guest memory, on top of which it
>>>> can build its own page cache system for its own file-mapped content but
>>>> that is unrelated to host page cache.
>>>
>>> yes. If guest fills its page cache with file backed memory, this at host
>>> side(on shmem fd backend) will also fill the host page cache fast. 
>>> This can
>>> have an impact on performance of guest VM's if host goes to memory 
>>> pressure
>>> situation sooner. Or else we end up utilizing way less System RAM.
>>
>> (Currently), the file backed guest private memory is long-term pinned
>> and not reclaimable, it's in page cache anyway once we allocated it for
>> guest. This does not depend on how guest use it (e.g. use it for guest
>> page cache or not).
> 
> Even if host shmem backed memory always be always un-reclaimable, we end 
> up utilizing double RAM (both in guest & host page cache) for guest disk 
> accesses?

Answering my own question:

We wont use double RAM, just view of guest & host structures would 
change as per the code path taken. If we we don't care about reclaim 
situations we should be good, else we have to think something to 
coordinate page cache between guest & host (that could be an 
optimization for later).

Thanks,
Pankaj
Gupta, Pankaj July 14, 2022, 5:13 a.m. UTC | #8
>>>>> This is the v7 of this series which tries to implement the fd-based KVM
>>>>> guest private memory. The patches are based on latest kvm/queue branch
>>>>> commit:
>>>>>
>>>>>      b9b71f43683a (kvm/queue) KVM: x86/mmu: Buffer nested MMU
>>>>> split_desc_cache only by default capacity
>>>>>
>>>>> Introduction
>>>>> ------------
>>>>> In general this patch series introduce fd-based memslot which provides
>>>>> guest memory through memory file descriptor fd[offset,size] instead of
>>>>> hva/size. The fd can be created from a supported memory filesystem
>>>>> like tmpfs/hugetlbfs etc. which we refer as memory backing store. KVM
>>>>
>>>> Thinking a bit, As host side fd on tmpfs or shmem will store memory on host
>>>> page cache instead of mapping pages into userspace address space. Can we hit
>>>> double (un-coordinated) page cache problem with this when guest page cache
>>>> is also used?
>>>
>>> This is my understanding: in host it will be indeed in page cache (in
>>> current shmem implementation) but that's just the way it allocates and
>>> provides the physical memory for the guest. In guest, guest OS will not
>>> see this fd (absolutely), it only sees guest memory, on top of which it
>>> can build its own page cache system for its own file-mapped content but
>>> that is unrelated to host page cache.
>>
>> yes. If guest fills its page cache with file backed memory, this at host
>> side(on shmem fd backend) will also fill the host page cache fast. This
>> can have an impact on performance of guest VM's if host goes to memory
>> pressure situation sooner. Or else we end up utilizing way less System
>> RAM.
> 
> Is this in any meaningful way different from a regular VM?

After thinking a bit, Seems 'No'. Except the reclaim decisions system 
would take under memory pressure and also will have to see how well this 
gets stitched with memory tiers in future. But all these are future topics.

Sorry! for the noise.

Thanks,
Pankaj
Nikunj A. Dadhania Aug. 11, 2022, 10:02 a.m. UTC | #9
On 06/07/22 13:50, Chao Peng wrote:
> This is the v7 of this series which tries to implement the fd-based KVM
> guest private memory. The patches are based on latest kvm/queue branch
> commit:
> 
>   b9b71f43683a (kvm/queue) KVM: x86/mmu: Buffer nested MMU
> split_desc_cache only by default capacity
> 
> Introduction
> ------------
> In general this patch series introduce fd-based memslot which provides
> guest memory through memory file descriptor fd[offset,size] instead of
> hva/size. The fd can be created from a supported memory filesystem
> like tmpfs/hugetlbfs etc. which we refer as memory backing store. KVM
> and the the memory backing store exchange callbacks when such memslot
> gets created. At runtime KVM will call into callbacks provided by the
> backing store to get the pfn with the fd+offset. Memory backing store
> will also call into KVM callbacks when userspace punch hole on the fd
> to notify KVM to unmap secondary MMU page table entries.
> 
> Comparing to existing hva-based memslot, this new type of memslot allows
> guest memory unmapped from host userspace like QEMU and even the kernel
> itself, therefore reduce attack surface and prevent bugs.
> 
> Based on this fd-based memslot, we can build guest private memory that
> is going to be used in confidential computing environments such as Intel
> TDX and AMD SEV. When supported, the memory backing store can provide
> more enforcement on the fd and KVM can use a single memslot to hold both
> the private and shared part of the guest memory. 
> 
> mm extension
> ---------------------
> Introduces new MFD_INACCESSIBLE flag for memfd_create(), the file
> created with these flags cannot read(), write() or mmap() etc via normal
> MMU operations. The file content can only be used with the newly
> introduced memfile_notifier extension.
> 
> The memfile_notifier extension provides two sets of callbacks for KVM to
> interact with the memory backing store:
>   - memfile_notifier_ops: callbacks for memory backing store to notify
>     KVM when memory gets invalidated.
>   - backing store callbacks: callbacks for KVM to call into memory
>     backing store to request memory pages for guest private memory.
> 
> The memfile_notifier extension also provides APIs for memory backing
> store to register/unregister itself and to trigger the notifier when the
> bookmarked memory gets invalidated.
> 
> The patchset also introduces a new memfd seal F_SEAL_AUTO_ALLOCATE to
> prevent double allocation caused by unintentional guest when we only
> have a single side of the shared/private memfds effective.
> 
> memslot extension
> -----------------
> Add the private fd and the fd offset to existing 'shared' memslot so
> that both private/shared guest memory can live in one single memslot.
> A page in the memslot is either private or shared. Whether a guest page
> is private or shared is maintained through reusing existing SEV ioctls
> KVM_MEMORY_ENCRYPT_{UN,}REG_REGION.
> 
> Test
> ----
> To test the new functionalities of this patch TDX patchset is needed.
> Since TDX patchset has not been merged so I did two kinds of test:
> 
> -  Regresion test on kvm/queue (this patchset)
>    Most new code are not covered. Code also in below repo:
>    https://github.com/chao-p/linux/tree/privmem-v7
> 
> -  New Funational test on latest TDX code
>    The patch is rebased to latest TDX code and tested the new
>    funcationalities. See below repos:
>    Linux: https://github.com/chao-p/linux/tree/privmem-v7-tdx
>    QEMU: https://github.com/chao-p/qemu/tree/privmem-v7

While debugging an issue with SEV+UPM, found that fallocate() returns 
an error in QEMU which is not handled (EINTR). With the below handling 
of EINTR subsequent fallocate() succeeds:


diff --git a/backends/hostmem-memfd-private.c b/backends/hostmem-memfd-private.c
index af8fb0c957..e8597ed28d 100644
--- a/backends/hostmem-memfd-private.c
+++ b/backends/hostmem-memfd-private.c
@@ -39,7 +39,7 @@ priv_memfd_backend_memory_alloc(HostMemoryBackend *backend, Error **errp)
     MachineState *machine = MACHINE(qdev_get_machine());
     uint32_t ram_flags;
     char *name;
-    int fd, priv_fd;
+    int fd, priv_fd, ret;
 
     if (!backend->size) {
         error_setg(errp, "can't create backend with size 0");
@@ -65,7 +65,15 @@ priv_memfd_backend_memory_alloc(HostMemoryBackend *backend, Error **errp)
                                    backend->size, ram_flags, fd, 0, errp);
     g_free(name);
 
-    fallocate(priv_fd, 0, 0, backend->size);
+again:
+    ret = fallocate(priv_fd, 0, 0, backend->size);
+    if (ret) {
+           perror("Fallocate failed: \n");
+           if (errno == EINTR)
+                   goto again;
+           else
+                   exit(1);
+    }

However, fallocate() preallocates full guest memory before starting the guest.
With this behaviour guest memory is *not* demand pinned. Is there a way to 
prevent fallocate() from reserving full guest memory?

> An example QEMU command line for TDX test:
> -object tdx-guest,id=tdx,debug=off,sept-ve-disable=off \
> -machine confidential-guest-support=tdx \
> -object memory-backend-memfd-private,id=ram1,size=${mem} \
> -machine memory-backend=ram1
> 

Regards,
Nikunj
Gupta, Pankaj Aug. 11, 2022, 11:30 a.m. UTC | #10
>> This is the v7 of this series which tries to implement the fd-based KVM
>> guest private memory. The patches are based on latest kvm/queue branch
>> commit:
>>
>>    b9b71f43683a (kvm/queue) KVM: x86/mmu: Buffer nested MMU
>> split_desc_cache only by default capacity
>>
>> Introduction
>> ------------
>> In general this patch series introduce fd-based memslot which provides
>> guest memory through memory file descriptor fd[offset,size] instead of
>> hva/size. The fd can be created from a supported memory filesystem
>> like tmpfs/hugetlbfs etc. which we refer as memory backing store. KVM
>> and the the memory backing store exchange callbacks when such memslot
>> gets created. At runtime KVM will call into callbacks provided by the
>> backing store to get the pfn with the fd+offset. Memory backing store
>> will also call into KVM callbacks when userspace punch hole on the fd
>> to notify KVM to unmap secondary MMU page table entries.
>>
>> Comparing to existing hva-based memslot, this new type of memslot allows
>> guest memory unmapped from host userspace like QEMU and even the kernel
>> itself, therefore reduce attack surface and prevent bugs.
>>
>> Based on this fd-based memslot, we can build guest private memory that
>> is going to be used in confidential computing environments such as Intel
>> TDX and AMD SEV. When supported, the memory backing store can provide
>> more enforcement on the fd and KVM can use a single memslot to hold both
>> the private and shared part of the guest memory.
>>
>> mm extension
>> ---------------------
>> Introduces new MFD_INACCESSIBLE flag for memfd_create(), the file
>> created with these flags cannot read(), write() or mmap() etc via normal
>> MMU operations. The file content can only be used with the newly
>> introduced memfile_notifier extension.
>>
>> The memfile_notifier extension provides two sets of callbacks for KVM to
>> interact with the memory backing store:
>>    - memfile_notifier_ops: callbacks for memory backing store to notify
>>      KVM when memory gets invalidated.
>>    - backing store callbacks: callbacks for KVM to call into memory
>>      backing store to request memory pages for guest private memory.
>>
>> The memfile_notifier extension also provides APIs for memory backing
>> store to register/unregister itself and to trigger the notifier when the
>> bookmarked memory gets invalidated.
>>
>> The patchset also introduces a new memfd seal F_SEAL_AUTO_ALLOCATE to
>> prevent double allocation caused by unintentional guest when we only
>> have a single side of the shared/private memfds effective.
>>
>> memslot extension
>> -----------------
>> Add the private fd and the fd offset to existing 'shared' memslot so
>> that both private/shared guest memory can live in one single memslot.
>> A page in the memslot is either private or shared. Whether a guest page
>> is private or shared is maintained through reusing existing SEV ioctls
>> KVM_MEMORY_ENCRYPT_{UN,}REG_REGION.
>>
>> Test
>> ----
>> To test the new functionalities of this patch TDX patchset is needed.
>> Since TDX patchset has not been merged so I did two kinds of test:
>>
>> -  Regresion test on kvm/queue (this patchset)
>>     Most new code are not covered. Code also in below repo:
>>     https://github.com/chao-p/linux/tree/privmem-v7
>>
>> -  New Funational test on latest TDX code
>>     The patch is rebased to latest TDX code and tested the new
>>     funcationalities. See below repos:
>>     Linux: https://github.com/chao-p/linux/tree/privmem-v7-tdx
>>     QEMU: https://github.com/chao-p/qemu/tree/privmem-v7
> 
> While debugging an issue with SEV+UPM, found that fallocate() returns
> an error in QEMU which is not handled (EINTR). With the below handling
> of EINTR subsequent fallocate() succeeds:
> 
> 
> diff --git a/backends/hostmem-memfd-private.c b/backends/hostmem-memfd-private.c
> index af8fb0c957..e8597ed28d 100644
> --- a/backends/hostmem-memfd-private.c
> +++ b/backends/hostmem-memfd-private.c
> @@ -39,7 +39,7 @@ priv_memfd_backend_memory_alloc(HostMemoryBackend *backend, Error **errp)
>       MachineState *machine = MACHINE(qdev_get_machine());
>       uint32_t ram_flags;
>       char *name;
> -    int fd, priv_fd;
> +    int fd, priv_fd, ret;
>   
>       if (!backend->size) {
>           error_setg(errp, "can't create backend with size 0");
> @@ -65,7 +65,15 @@ priv_memfd_backend_memory_alloc(HostMemoryBackend *backend, Error **errp)
>                                      backend->size, ram_flags, fd, 0, errp);
>       g_free(name);
>   
> -    fallocate(priv_fd, 0, 0, backend->size);
> +again:
> +    ret = fallocate(priv_fd, 0, 0, backend->size);
> +    if (ret) {
> +           perror("Fallocate failed: \n");
> +           if (errno == EINTR)
> +                   goto again;
> +           else
> +                   exit(1);
> +    }
> 
> However, fallocate() preallocates full guest memory before starting the guest.
> With this behaviour guest memory is *not* demand pinned. Is there a way to
> prevent fallocate() from reserving full guest memory?

Isn't the pinning being handled by the corresponding host memory backend 
with mmu notifier and architecture support while doing the memory 
operations e.g page migration and swapping/reclaim (not supported
currently AFAIU). But yes, we need to allocate entire guest memory with 
the new flags MEMFILE_F_{UNMOVABLE, UNRECLAIMABLE etc}.


Thanks,
Pankaj

> 
>> An example QEMU command line for TDX test:
>> -object tdx-guest,id=tdx,debug=off,sept-ve-disable=off \
>> -machine confidential-guest-support=tdx \
>> -object memory-backend-memfd-private,id=ram1,size=${mem} \
>> -machine memory-backend=ram1
>>
> 
> Regards,
> Nikunj
> 
>
Chao Peng Aug. 11, 2022, 1:32 p.m. UTC | #11
On Thu, Aug 11, 2022 at 01:30:06PM +0200, Gupta, Pankaj wrote:
> 
> > > This is the v7 of this series which tries to implement the fd-based KVM
> > > guest private memory. The patches are based on latest kvm/queue branch
> > > commit:
> > > 
> > >    b9b71f43683a (kvm/queue) KVM: x86/mmu: Buffer nested MMU
> > > split_desc_cache only by default capacity
> > > 
> > > Introduction
> > > ------------
> > > In general this patch series introduce fd-based memslot which provides
> > > guest memory through memory file descriptor fd[offset,size] instead of
> > > hva/size. The fd can be created from a supported memory filesystem
> > > like tmpfs/hugetlbfs etc. which we refer as memory backing store. KVM
> > > and the the memory backing store exchange callbacks when such memslot
> > > gets created. At runtime KVM will call into callbacks provided by the
> > > backing store to get the pfn with the fd+offset. Memory backing store
> > > will also call into KVM callbacks when userspace punch hole on the fd
> > > to notify KVM to unmap secondary MMU page table entries.
> > > 
> > > Comparing to existing hva-based memslot, this new type of memslot allows
> > > guest memory unmapped from host userspace like QEMU and even the kernel
> > > itself, therefore reduce attack surface and prevent bugs.
> > > 
> > > Based on this fd-based memslot, we can build guest private memory that
> > > is going to be used in confidential computing environments such as Intel
> > > TDX and AMD SEV. When supported, the memory backing store can provide
> > > more enforcement on the fd and KVM can use a single memslot to hold both
> > > the private and shared part of the guest memory.
> > > 
> > > mm extension
> > > ---------------------
> > > Introduces new MFD_INACCESSIBLE flag for memfd_create(), the file
> > > created with these flags cannot read(), write() or mmap() etc via normal
> > > MMU operations. The file content can only be used with the newly
> > > introduced memfile_notifier extension.
> > > 
> > > The memfile_notifier extension provides two sets of callbacks for KVM to
> > > interact with the memory backing store:
> > >    - memfile_notifier_ops: callbacks for memory backing store to notify
> > >      KVM when memory gets invalidated.
> > >    - backing store callbacks: callbacks for KVM to call into memory
> > >      backing store to request memory pages for guest private memory.
> > > 
> > > The memfile_notifier extension also provides APIs for memory backing
> > > store to register/unregister itself and to trigger the notifier when the
> > > bookmarked memory gets invalidated.
> > > 
> > > The patchset also introduces a new memfd seal F_SEAL_AUTO_ALLOCATE to
> > > prevent double allocation caused by unintentional guest when we only
> > > have a single side of the shared/private memfds effective.
> > > 
> > > memslot extension
> > > -----------------
> > > Add the private fd and the fd offset to existing 'shared' memslot so
> > > that both private/shared guest memory can live in one single memslot.
> > > A page in the memslot is either private or shared. Whether a guest page
> > > is private or shared is maintained through reusing existing SEV ioctls
> > > KVM_MEMORY_ENCRYPT_{UN,}REG_REGION.
> > > 
> > > Test
> > > ----
> > > To test the new functionalities of this patch TDX patchset is needed.
> > > Since TDX patchset has not been merged so I did two kinds of test:
> > > 
> > > -  Regresion test on kvm/queue (this patchset)
> > >     Most new code are not covered. Code also in below repo:
> > >     https://github.com/chao-p/linux/tree/privmem-v7
> > > 
> > > -  New Funational test on latest TDX code
> > >     The patch is rebased to latest TDX code and tested the new
> > >     funcationalities. See below repos:
> > >     Linux: https://github.com/chao-p/linux/tree/privmem-v7-tdx
> > >     QEMU: https://github.com/chao-p/qemu/tree/privmem-v7
> > 
> > While debugging an issue with SEV+UPM, found that fallocate() returns
> > an error in QEMU which is not handled (EINTR). With the below handling
> > of EINTR subsequent fallocate() succeeds:

QEMU code has not well-tested so it's not strange you met problem. But
from the man page, there is signal was caught for EINTR, do you know
the signal number?

Thanks for you patch but before we change it in QEMU I want to make sure
it's indeed a QEMU issue (e.g. not a kernel isssue).

> > 
> > 
> > diff --git a/backends/hostmem-memfd-private.c b/backends/hostmem-memfd-private.c
> > index af8fb0c957..e8597ed28d 100644
> > --- a/backends/hostmem-memfd-private.c
> > +++ b/backends/hostmem-memfd-private.c
> > @@ -39,7 +39,7 @@ priv_memfd_backend_memory_alloc(HostMemoryBackend *backend, Error **errp)
> >       MachineState *machine = MACHINE(qdev_get_machine());
> >       uint32_t ram_flags;
> >       char *name;
> > -    int fd, priv_fd;
> > +    int fd, priv_fd, ret;
> >       if (!backend->size) {
> >           error_setg(errp, "can't create backend with size 0");
> > @@ -65,7 +65,15 @@ priv_memfd_backend_memory_alloc(HostMemoryBackend *backend, Error **errp)
> >                                      backend->size, ram_flags, fd, 0, errp);
> >       g_free(name);
> > -    fallocate(priv_fd, 0, 0, backend->size);
> > +again:
> > +    ret = fallocate(priv_fd, 0, 0, backend->size);
> > +    if (ret) {
> > +           perror("Fallocate failed: \n");
> > +           if (errno == EINTR)
> > +                   goto again;
> > +           else
> > +                   exit(1);
> > +    }
> > 
> > However, fallocate() preallocates full guest memory before starting the guest.
> > With this behaviour guest memory is *not* demand pinned. Is there a way to
> > prevent fallocate() from reserving full guest memory?
> 
> Isn't the pinning being handled by the corresponding host memory backend
> with mmu notifier and architecture support while doing the memory operations
> e.g page migration and swapping/reclaim (not supported
> currently AFAIU). But yes, we need to allocate entire guest memory with the
> new flags MEMFILE_F_{UNMOVABLE, UNRECLAIMABLE etc}.

Right.

> 
> 
> Thanks,
> Pankaj
> 
> > 
> > > An example QEMU command line for TDX test:
> > > -object tdx-guest,id=tdx,debug=off,sept-ve-disable=off \
> > > -machine confidential-guest-support=tdx \
> > > -object memory-backend-memfd-private,id=ram1,size=${mem} \
> > > -machine memory-backend=ram1
> > > 
> > 
> > Regards,
> > Nikunj
> > 
> >
Nikunj A. Dadhania Aug. 11, 2022, 5:18 p.m. UTC | #12
On 11/08/22 17:00, Gupta, Pankaj wrote:
> 
>>> This is the v7 of this series which tries to implement the fd-based KVM
>>> guest private memory. The patches are based on latest kvm/queue branch
>>> commit:
>>>
>>>    b9b71f43683a (kvm/queue) KVM: x86/mmu: Buffer nested MMU
>>> split_desc_cache only by default capacity
>>>
>>> Introduction
>>> ------------
>>> In general this patch series introduce fd-based memslot which provides
>>> guest memory through memory file descriptor fd[offset,size] instead of
>>> hva/size. The fd can be created from a supported memory filesystem
>>> like tmpfs/hugetlbfs etc. which we refer as memory backing store. KVM
>>> and the the memory backing store exchange callbacks when such memslot
>>> gets created. At runtime KVM will call into callbacks provided by the
>>> backing store to get the pfn with the fd+offset. Memory backing store
>>> will also call into KVM callbacks when userspace punch hole on the fd
>>> to notify KVM to unmap secondary MMU page table entries.
>>>
>>> Comparing to existing hva-based memslot, this new type of memslot allows
>>> guest memory unmapped from host userspace like QEMU and even the kernel
>>> itself, therefore reduce attack surface and prevent bugs.
>>>
>>> Based on this fd-based memslot, we can build guest private memory that
>>> is going to be used in confidential computing environments such as Intel
>>> TDX and AMD SEV. When supported, the memory backing store can provide
>>> more enforcement on the fd and KVM can use a single memslot to hold both
>>> the private and shared part of the guest memory.
>>>
>>> mm extension
>>> ---------------------
>>> Introduces new MFD_INACCESSIBLE flag for memfd_create(), the file
>>> created with these flags cannot read(), write() or mmap() etc via normal
>>> MMU operations. The file content can only be used with the newly
>>> introduced memfile_notifier extension.
>>>
>>> The memfile_notifier extension provides two sets of callbacks for KVM to
>>> interact with the memory backing store:
>>>    - memfile_notifier_ops: callbacks for memory backing store to notify
>>>      KVM when memory gets invalidated.
>>>    - backing store callbacks: callbacks for KVM to call into memory
>>>      backing store to request memory pages for guest private memory.
>>>
>>> The memfile_notifier extension also provides APIs for memory backing
>>> store to register/unregister itself and to trigger the notifier when the
>>> bookmarked memory gets invalidated.
>>>
>>> The patchset also introduces a new memfd seal F_SEAL_AUTO_ALLOCATE to
>>> prevent double allocation caused by unintentional guest when we only
>>> have a single side of the shared/private memfds effective.
>>>
>>> memslot extension
>>> -----------------
>>> Add the private fd and the fd offset to existing 'shared' memslot so
>>> that both private/shared guest memory can live in one single memslot.
>>> A page in the memslot is either private or shared. Whether a guest page
>>> is private or shared is maintained through reusing existing SEV ioctls
>>> KVM_MEMORY_ENCRYPT_{UN,}REG_REGION.
>>>
>>> Test
>>> ----
>>> To test the new functionalities of this patch TDX patchset is needed.
>>> Since TDX patchset has not been merged so I did two kinds of test:
>>>
>>> -  Regresion test on kvm/queue (this patchset)
>>>     Most new code are not covered. Code also in below repo:
>>>     https://github.com/chao-p/linux/tree/privmem-v7
>>>
>>> -  New Funational test on latest TDX code
>>>     The patch is rebased to latest TDX code and tested the new
>>>     funcationalities. See below repos:
>>>     Linux: https://github.com/chao-p/linux/tree/privmem-v7-tdx
>>>     QEMU: https://github.com/chao-p/qemu/tree/privmem-v7
>>
>> While debugging an issue with SEV+UPM, found that fallocate() returns
>> an error in QEMU which is not handled (EINTR). With the below handling
>> of EINTR subsequent fallocate() succeeds:
>>
>>
>> diff --git a/backends/hostmem-memfd-private.c b/backends/hostmem-memfd-private.c
>> index af8fb0c957..e8597ed28d 100644
>> --- a/backends/hostmem-memfd-private.c
>> +++ b/backends/hostmem-memfd-private.c
>> @@ -39,7 +39,7 @@ priv_memfd_backend_memory_alloc(HostMemoryBackend *backend, Error **errp)
>>       MachineState *machine = MACHINE(qdev_get_machine());
>>       uint32_t ram_flags;
>>       char *name;
>> -    int fd, priv_fd;
>> +    int fd, priv_fd, ret;
>>         if (!backend->size) {
>>           error_setg(errp, "can't create backend with size 0");
>> @@ -65,7 +65,15 @@ priv_memfd_backend_memory_alloc(HostMemoryBackend *backend, Error **errp)
>>                                      backend->size, ram_flags, fd, 0, errp);
>>       g_free(name);
>>   -    fallocate(priv_fd, 0, 0, backend->size);
>> +again:
>> +    ret = fallocate(priv_fd, 0, 0, backend->size);
>> +    if (ret) {
>> +           perror("Fallocate failed: \n");
>> +           if (errno == EINTR)
>> +                   goto again;
>> +           else
>> +                   exit(1);
>> +    }
>>
>> However, fallocate() preallocates full guest memory before starting the guest.
>> With this behaviour guest memory is *not* demand pinned. Is there a way to
>> prevent fallocate() from reserving full guest memory?
> 
> Isn't the pinning being handled by the corresponding host memory backend with mmu > notifier and architecture support while doing the memory operations e.g page> migration and swapping/reclaim (not supported currently AFAIU). But yes, we need> to allocate entire guest memory with the new flags MEMFILE_F_{UNMOVABLE, UNRECLAIMABLE etc}.

That is correct, but the question is when does the memory allocated, as these flags are set,
memory is neither moved nor reclaimed. In current scenario, if I start a 32GB guest, all 32GB is
allocated.

Regards
Nikunj
Nikunj A. Dadhania Aug. 11, 2022, 5:28 p.m. UTC | #13
On 11/08/22 19:02, Chao Peng wrote:
> On Thu, Aug 11, 2022 at 01:30:06PM +0200, Gupta, Pankaj wrote:
>>

>>>> Test
>>>> ----
>>>> To test the new functionalities of this patch TDX patchset is needed.
>>>> Since TDX patchset has not been merged so I did two kinds of test:
>>>>
>>>> -  Regresion test on kvm/queue (this patchset)
>>>>     Most new code are not covered. Code also in below repo:
>>>>     https://github.com/chao-p/linux/tree/privmem-v7
>>>>
>>>> -  New Funational test on latest TDX code
>>>>     The patch is rebased to latest TDX code and tested the new
>>>>     funcationalities. See below repos:
>>>>     Linux: https://github.com/chao-p/linux/tree/privmem-v7-tdx
>>>>     QEMU: https://github.com/chao-p/qemu/tree/privmem-v7
>>>
>>> While debugging an issue with SEV+UPM, found that fallocate() returns
>>> an error in QEMU which is not handled (EINTR). With the below handling
>>> of EINTR subsequent fallocate() succeeds:
> 
> QEMU code has not well-tested so it's not strange you met problem. But
> from the man page, there is signal was caught for EINTR, do you know
> the signal number?

I haven't check that, but that should be fairly straight forward to get.
I presume that you are referring to signal_pending() in the shmem_fallocate()

> Thanks for you patch but before we change it in QEMU I want to make sure
> it's indeed a QEMU issue (e.g. not a kernel isssue).

As per the manual fallocate() can return EINTR, and this should be handled 
by the user space.

Regards
Nikunj
Gupta, Pankaj Aug. 11, 2022, 11:02 p.m. UTC | #14
On 8/11/2022 7:18 PM, Nikunj A. Dadhania wrote:
> On 11/08/22 17:00, Gupta, Pankaj wrote:
>>
>>>> This is the v7 of this series which tries to implement the fd-based KVM
>>>> guest private memory. The patches are based on latest kvm/queue branch
>>>> commit:
>>>>
>>>>     b9b71f43683a (kvm/queue) KVM: x86/mmu: Buffer nested MMU
>>>> split_desc_cache only by default capacity
>>>>
>>>> Introduction
>>>> ------------
>>>> In general this patch series introduce fd-based memslot which provides
>>>> guest memory through memory file descriptor fd[offset,size] instead of
>>>> hva/size. The fd can be created from a supported memory filesystem
>>>> like tmpfs/hugetlbfs etc. which we refer as memory backing store. KVM
>>>> and the the memory backing store exchange callbacks when such memslot
>>>> gets created. At runtime KVM will call into callbacks provided by the
>>>> backing store to get the pfn with the fd+offset. Memory backing store
>>>> will also call into KVM callbacks when userspace punch hole on the fd
>>>> to notify KVM to unmap secondary MMU page table entries.
>>>>
>>>> Comparing to existing hva-based memslot, this new type of memslot allows
>>>> guest memory unmapped from host userspace like QEMU and even the kernel
>>>> itself, therefore reduce attack surface and prevent bugs.
>>>>
>>>> Based on this fd-based memslot, we can build guest private memory that
>>>> is going to be used in confidential computing environments such as Intel
>>>> TDX and AMD SEV. When supported, the memory backing store can provide
>>>> more enforcement on the fd and KVM can use a single memslot to hold both
>>>> the private and shared part of the guest memory.
>>>>
>>>> mm extension
>>>> ---------------------
>>>> Introduces new MFD_INACCESSIBLE flag for memfd_create(), the file
>>>> created with these flags cannot read(), write() or mmap() etc via normal
>>>> MMU operations. The file content can only be used with the newly
>>>> introduced memfile_notifier extension.
>>>>
>>>> The memfile_notifier extension provides two sets of callbacks for KVM to
>>>> interact with the memory backing store:
>>>>     - memfile_notifier_ops: callbacks for memory backing store to notify
>>>>       KVM when memory gets invalidated.
>>>>     - backing store callbacks: callbacks for KVM to call into memory
>>>>       backing store to request memory pages for guest private memory.
>>>>
>>>> The memfile_notifier extension also provides APIs for memory backing
>>>> store to register/unregister itself and to trigger the notifier when the
>>>> bookmarked memory gets invalidated.
>>>>
>>>> The patchset also introduces a new memfd seal F_SEAL_AUTO_ALLOCATE to
>>>> prevent double allocation caused by unintentional guest when we only
>>>> have a single side of the shared/private memfds effective.
>>>>
>>>> memslot extension
>>>> -----------------
>>>> Add the private fd and the fd offset to existing 'shared' memslot so
>>>> that both private/shared guest memory can live in one single memslot.
>>>> A page in the memslot is either private or shared. Whether a guest page
>>>> is private or shared is maintained through reusing existing SEV ioctls
>>>> KVM_MEMORY_ENCRYPT_{UN,}REG_REGION.
>>>>
>>>> Test
>>>> ----
>>>> To test the new functionalities of this patch TDX patchset is needed.
>>>> Since TDX patchset has not been merged so I did two kinds of test:
>>>>
>>>> -  Regresion test on kvm/queue (this patchset)
>>>>      Most new code are not covered. Code also in below repo:
>>>>      https://github.com/chao-p/linux/tree/privmem-v7
>>>>
>>>> -  New Funational test on latest TDX code
>>>>      The patch is rebased to latest TDX code and tested the new
>>>>      funcationalities. See below repos:
>>>>      Linux: https://github.com/chao-p/linux/tree/privmem-v7-tdx
>>>>      QEMU: https://github.com/chao-p/qemu/tree/privmem-v7
>>>
>>> While debugging an issue with SEV+UPM, found that fallocate() returns
>>> an error in QEMU which is not handled (EINTR). With the below handling
>>> of EINTR subsequent fallocate() succeeds:
>>>
>>>
>>> diff --git a/backends/hostmem-memfd-private.c b/backends/hostmem-memfd-private.c
>>> index af8fb0c957..e8597ed28d 100644
>>> --- a/backends/hostmem-memfd-private.c
>>> +++ b/backends/hostmem-memfd-private.c
>>> @@ -39,7 +39,7 @@ priv_memfd_backend_memory_alloc(HostMemoryBackend *backend, Error **errp)
>>>        MachineState *machine = MACHINE(qdev_get_machine());
>>>        uint32_t ram_flags;
>>>        char *name;
>>> -    int fd, priv_fd;
>>> +    int fd, priv_fd, ret;
>>>          if (!backend->size) {
>>>            error_setg(errp, "can't create backend with size 0");
>>> @@ -65,7 +65,15 @@ priv_memfd_backend_memory_alloc(HostMemoryBackend *backend, Error **errp)
>>>                                       backend->size, ram_flags, fd, 0, errp);
>>>        g_free(name);
>>>    -    fallocate(priv_fd, 0, 0, backend->size);
>>> +again:
>>> +    ret = fallocate(priv_fd, 0, 0, backend->size);
>>> +    if (ret) {
>>> +           perror("Fallocate failed: \n");
>>> +           if (errno == EINTR)
>>> +                   goto again;
>>> +           else
>>> +                   exit(1);
>>> +    }
>>>
>>> However, fallocate() preallocates full guest memory before starting the guest.
>>> With this behaviour guest memory is *not* demand pinned. Is there a way to
>>> prevent fallocate() from reserving full guest memory?
>>
>> Isn't the pinning being handled by the corresponding host memory backend with mmu > notifier and architecture support while doing the memory operations e.g page> migration and swapping/reclaim (not supported currently AFAIU). But yes, we need> to allocate entire guest memory with the new flags MEMFILE_F_{UNMOVABLE, UNRECLAIMABLE etc}.
> 
> That is correct, but the question is when does the memory allocated, as these flags are set,
> memory is neither moved nor reclaimed. In current scenario, if I start a 32GB guest, all 32GB is
> allocated.

I guess so if guest memory is private by default.

Other option would be to allocate memory as shared by default and
handle on demand allocation and RMPUPDATE with page state change event. 
But still that would be done at guest boot time, IIUC.

Might be missing some details on this. So, better to wait for someone 
more familiar to answer.

Thanks,
Pankaj
Nikunj A. Dadhania Aug. 12, 2022, 3:22 a.m. UTC | #15
On 11/08/22 19:02, Chao Peng wrote:
> On Thu, Aug 11, 2022 at 01:30:06PM +0200, Gupta, Pankaj wrote:
>>>
>>> While debugging an issue with SEV+UPM, found that fallocate() returns
>>> an error in QEMU which is not handled (EINTR). With the below handling
>>> of EINTR subsequent fallocate() succeeds:
> 
> QEMU code has not well-tested so it's not strange you met problem. But
> from the man page, there is signal was caught for EINTR, do you know
> the signal number?
> 
> Thanks for you patch but before we change it in QEMU I want to make sure
> it's indeed a QEMU issue (e.g. not a kernel isssue).
> 
>>>
>>>
>>> diff --git a/backends/hostmem-memfd-private.c b/backends/hostmem-memfd-private.c
>>> index af8fb0c957..e8597ed28d 100644
>>> --- a/backends/hostmem-memfd-private.c
>>> +++ b/backends/hostmem-memfd-private.c
>>> @@ -39,7 +39,7 @@ priv_memfd_backend_memory_alloc(HostMemoryBackend *backend, Error **errp)
>>>       MachineState *machine = MACHINE(qdev_get_machine());
>>>       uint32_t ram_flags;
>>>       char *name;
>>> -    int fd, priv_fd;
>>> +    int fd, priv_fd, ret;
>>>       if (!backend->size) {
>>>           error_setg(errp, "can't create backend with size 0");
>>> @@ -65,7 +65,15 @@ priv_memfd_backend_memory_alloc(HostMemoryBackend *backend, Error **errp)
>>>                                      backend->size, ram_flags, fd, 0, errp);
>>>       g_free(name);
>>> -    fallocate(priv_fd, 0, 0, backend->size);
>>> +again:
>>> +    ret = fallocate(priv_fd, 0, 0, backend->size);
>>> +    if (ret) {
>>> +           perror("Fallocate failed: \n");
>>> +           if (errno == EINTR)
>>> +                   goto again;
>>> +           else
>>> +                   exit(1);
>>> +    }
>>>
>>> However, fallocate() preallocates full guest memory before starting the guest.
>>> With this behaviour guest memory is *not* demand pinned. 

This is with reference to the SEV demand pinning patches that I was working on. 
The understanding was UPM will not reserve memory for SEV/TDX guest in the beginning 
similar to normal guest. Here is the relevant quote from the discussion with Sean[1]:

	"I think we should abandon this approach in favor of committing all our resources
	to fd-based private memory[*], which (if done right) will provide on-demand pinning
	for "free". "

>>> Is there a way to prevent fallocate() from reserving full guest memory?
Regards
Nikunj
[1] https://lore.kernel.org/kvm/YkIh8zM7XfhsFN8L@google.com/
Gupta, Pankaj Aug. 12, 2022, 6:02 a.m. UTC | #16
>>>>> This is the v7 of this series which tries to implement the fd-based 
>>>>> KVM
>>>>> guest private memory. The patches are based on latest kvm/queue branch
>>>>> commit:
>>>>>
>>>>>     b9b71f43683a (kvm/queue) KVM: x86/mmu: Buffer nested MMU
>>>>> split_desc_cache only by default capacity
>>>>>
>>>>> Introduction
>>>>> ------------
>>>>> In general this patch series introduce fd-based memslot which provides
>>>>> guest memory through memory file descriptor fd[offset,size] instead of
>>>>> hva/size. The fd can be created from a supported memory filesystem
>>>>> like tmpfs/hugetlbfs etc. which we refer as memory backing store. KVM
>>>>> and the the memory backing store exchange callbacks when such memslot
>>>>> gets created. At runtime KVM will call into callbacks provided by the
>>>>> backing store to get the pfn with the fd+offset. Memory backing store
>>>>> will also call into KVM callbacks when userspace punch hole on the fd
>>>>> to notify KVM to unmap secondary MMU page table entries.
>>>>>
>>>>> Comparing to existing hva-based memslot, this new type of memslot 
>>>>> allows
>>>>> guest memory unmapped from host userspace like QEMU and even the 
>>>>> kernel
>>>>> itself, therefore reduce attack surface and prevent bugs.
>>>>>
>>>>> Based on this fd-based memslot, we can build guest private memory that
>>>>> is going to be used in confidential computing environments such as 
>>>>> Intel
>>>>> TDX and AMD SEV. When supported, the memory backing store can provide
>>>>> more enforcement on the fd and KVM can use a single memslot to hold 
>>>>> both
>>>>> the private and shared part of the guest memory.
>>>>>
>>>>> mm extension
>>>>> ---------------------
>>>>> Introduces new MFD_INACCESSIBLE flag for memfd_create(), the file
>>>>> created with these flags cannot read(), write() or mmap() etc via 
>>>>> normal
>>>>> MMU operations. The file content can only be used with the newly
>>>>> introduced memfile_notifier extension.
>>>>>
>>>>> The memfile_notifier extension provides two sets of callbacks for 
>>>>> KVM to
>>>>> interact with the memory backing store:
>>>>>     - memfile_notifier_ops: callbacks for memory backing store to 
>>>>> notify
>>>>>       KVM when memory gets invalidated.
>>>>>     - backing store callbacks: callbacks for KVM to call into memory
>>>>>       backing store to request memory pages for guest private memory.
>>>>>
>>>>> The memfile_notifier extension also provides APIs for memory backing
>>>>> store to register/unregister itself and to trigger the notifier 
>>>>> when the
>>>>> bookmarked memory gets invalidated.
>>>>>
>>>>> The patchset also introduces a new memfd seal F_SEAL_AUTO_ALLOCATE to
>>>>> prevent double allocation caused by unintentional guest when we only
>>>>> have a single side of the shared/private memfds effective.
>>>>>
>>>>> memslot extension
>>>>> -----------------
>>>>> Add the private fd and the fd offset to existing 'shared' memslot so
>>>>> that both private/shared guest memory can live in one single memslot.
>>>>> A page in the memslot is either private or shared. Whether a guest 
>>>>> page
>>>>> is private or shared is maintained through reusing existing SEV ioctls
>>>>> KVM_MEMORY_ENCRYPT_{UN,}REG_REGION.
>>>>>
>>>>> Test
>>>>> ----
>>>>> To test the new functionalities of this patch TDX patchset is needed.
>>>>> Since TDX patchset has not been merged so I did two kinds of test:
>>>>>
>>>>> -  Regresion test on kvm/queue (this patchset)
>>>>>      Most new code are not covered. Code also in below repo:
>>>>>      
>>>>> https://github.com/chao-p/linux/tree/privmem-v7
>>>>>
>>>>>
>>>>> -  New Funational test on latest TDX code
>>>>>      The patch is rebased to latest TDX code and tested the new
>>>>>      funcationalities. See below repos:
>>>>>      Linux: 
>>>>> https://github.com/chao-p/linux/tree/privmem-v7-tdx
>>>>>
>>>>>      QEMU: 
>>>>> https://github.com/chao-p/qemu/tree/privmem-v7
>>>>>
>>>>
>>>> While debugging an issue with SEV+UPM, found that fallocate() returns
>>>> an error in QEMU which is not handled (EINTR). With the below handling
>>>> of EINTR subsequent fallocate() succeeds:
>>>>
>>>>
>>>> diff --git a/backends/hostmem-memfd-private.c 
>>>> b/backends/hostmem-memfd-private.c
>>>> index af8fb0c957..e8597ed28d 100644
>>>> --- a/backends/hostmem-memfd-private.c
>>>> +++ b/backends/hostmem-memfd-private.c
>>>> @@ -39,7 +39,7 @@ priv_memfd_backend_memory_alloc(HostMemoryBackend 
>>>> *backend, Error **errp)
>>>>        MachineState *machine = MACHINE(qdev_get_machine());
>>>>        uint32_t ram_flags;
>>>>        char *name;
>>>> -    int fd, priv_fd;
>>>> +    int fd, priv_fd, ret;
>>>>          if (!backend->size) {
>>>>            error_setg(errp, "can't create backend with size 0");
>>>> @@ -65,7 +65,15 @@ priv_memfd_backend_memory_alloc(HostMemoryBackend 
>>>> *backend, Error **errp)
>>>>                                       backend->size, ram_flags, fd, 
>>>> 0, errp);
>>>>        g_free(name);
>>>>    -    fallocate(priv_fd, 0, 0, backend->size);
>>>> +again:
>>>> +    ret = fallocate(priv_fd, 0, 0, backend->size);
>>>> +    if (ret) {
>>>> +           perror("Fallocate failed: \n");
>>>> +           if (errno == EINTR)
>>>> +                   goto again;
>>>> +           else
>>>> +                   exit(1);
>>>> +    }
>>>>
>>>> However, fallocate() preallocates full guest memory before starting 
>>>> the guest.
>>>> With this behaviour guest memory is *not* demand pinned. Is there a 
>>>> way to
>>>> prevent fallocate() from reserving full guest memory?
>>>
>>> Isn't the pinning being handled by the corresponding host memory 
>>> backend with mmu > notifier and architecture support while doing the 
>>> memory operations e.g page> migration and swapping/reclaim (not 
>>> supported currently AFAIU). But yes, we need> to allocate entire 
>>> guest memory with the new flags MEMFILE_F_{UNMOVABLE, UNRECLAIMABLE 
>>> etc}.
>>
>> That is correct, but the question is when does the memory allocated, 
>> as these flags are set,
>> memory is neither moved nor reclaimed. In current scenario, if I start 
>> a 32GB guest, all 32GB is
>> allocated.
> 
> I guess so if guest memory is private by default.
> 
> Other option would be to allocate memory as shared by default and
> handle on demand allocation and RMPUPDATE with page state change event. 
> But still that would be done at guest boot time, IIUC.

Sorry! Don't want to hijack the other thread so replying here.

I thought the question is for SEV SNP. For SEV, maybe the hypercall with 
the page state information can be used to allocate memory as we use it 
or something like quota based memory allocation (just thinking).

> 
> Might be missing some details on this. So, better to wait for someone 
> more familiar to answer.

Same applies here :)

Thanks,
Pankaj
Gupta, Pankaj Aug. 12, 2022, 7:18 a.m. UTC | #17
>>>>>> This is the v7 of this series which tries to implement the 
>>>>>> fd-based KVM
>>>>>> guest private memory. The patches are based on latest kvm/queue 
>>>>>> branch
>>>>>> commit:
>>>>>>
>>>>>>     b9b71f43683a (kvm/queue) KVM: x86/mmu: Buffer nested MMU
>>>>>> split_desc_cache only by default capacity
>>>>>>
>>>>>> Introduction
>>>>>> ------------
>>>>>> In general this patch series introduce fd-based memslot which 
>>>>>> provides
>>>>>> guest memory through memory file descriptor fd[offset,size] 
>>>>>> instead of
>>>>>> hva/size. The fd can be created from a supported memory filesystem
>>>>>> like tmpfs/hugetlbfs etc. which we refer as memory backing store. KVM
>>>>>> and the the memory backing store exchange callbacks when such memslot
>>>>>> gets created. At runtime KVM will call into callbacks provided by the
>>>>>> backing store to get the pfn with the fd+offset. Memory backing store
>>>>>> will also call into KVM callbacks when userspace punch hole on the fd
>>>>>> to notify KVM to unmap secondary MMU page table entries.
>>>>>>
>>>>>> Comparing to existing hva-based memslot, this new type of memslot 
>>>>>> allows
>>>>>> guest memory unmapped from host userspace like QEMU and even the 
>>>>>> kernel
>>>>>> itself, therefore reduce attack surface and prevent bugs.
>>>>>>
>>>>>> Based on this fd-based memslot, we can build guest private memory 
>>>>>> that
>>>>>> is going to be used in confidential computing environments such as 
>>>>>> Intel
>>>>>> TDX and AMD SEV. When supported, the memory backing store can provide
>>>>>> more enforcement on the fd and KVM can use a single memslot to 
>>>>>> hold both
>>>>>> the private and shared part of the guest memory.
>>>>>>
>>>>>> mm extension
>>>>>> ---------------------
>>>>>> Introduces new MFD_INACCESSIBLE flag for memfd_create(), the file
>>>>>> created with these flags cannot read(), write() or mmap() etc via 
>>>>>> normal
>>>>>> MMU operations. The file content can only be used with the newly
>>>>>> introduced memfile_notifier extension.
>>>>>>
>>>>>> The memfile_notifier extension provides two sets of callbacks for 
>>>>>> KVM to
>>>>>> interact with the memory backing store:
>>>>>>     - memfile_notifier_ops: callbacks for memory backing store to 
>>>>>> notify
>>>>>>       KVM when memory gets invalidated.
>>>>>>     - backing store callbacks: callbacks for KVM to call into memory
>>>>>>       backing store to request memory pages for guest private memory.
>>>>>>
>>>>>> The memfile_notifier extension also provides APIs for memory backing
>>>>>> store to register/unregister itself and to trigger the notifier 
>>>>>> when the
>>>>>> bookmarked memory gets invalidated.
>>>>>>
>>>>>> The patchset also introduces a new memfd seal F_SEAL_AUTO_ALLOCATE to
>>>>>> prevent double allocation caused by unintentional guest when we only
>>>>>> have a single side of the shared/private memfds effective.
>>>>>>
>>>>>> memslot extension
>>>>>> -----------------
>>>>>> Add the private fd and the fd offset to existing 'shared' memslot so
>>>>>> that both private/shared guest memory can live in one single memslot.
>>>>>> A page in the memslot is either private or shared. Whether a guest 
>>>>>> page
>>>>>> is private or shared is maintained through reusing existing SEV 
>>>>>> ioctls
>>>>>> KVM_MEMORY_ENCRYPT_{UN,}REG_REGION.
>>>>>>
>>>>>> Test
>>>>>> ----
>>>>>> To test the new functionalities of this patch TDX patchset is needed.
>>>>>> Since TDX patchset has not been merged so I did two kinds of test:
>>>>>>
>>>>>> -  Regresion test on kvm/queue (this patchset)
>>>>>>      Most new code are not covered. Code also in below repo:
>>>>>> https://github.com/chao-p/linux/tree/privmem-v7
>>>>>>
>>>>>>
>>>>>>
>>>>>> -  New Funational test on latest TDX code
>>>>>>      The patch is rebased to latest TDX code and tested the new
>>>>>>      funcationalities. See below repos:
>>>>>>      Linux: 
>>>>>> https://github.com/chao-p/linux/tree/privmem-v7-tdx
>>>>>>
>>>>>>
>>>>>>      QEMU: 
>>>>>> https://github.com/chao-p/qemu/tree/privmem-v7
>>>>>>
>>>>>>
>>>>>
>>>>> While debugging an issue with SEV+UPM, found that fallocate() returns
>>>>> an error in QEMU which is not handled (EINTR). With the below handling
>>>>> of EINTR subsequent fallocate() succeeds:
>>>>>
>>>>>
>>>>> diff --git a/backends/hostmem-memfd-private.c 
>>>>> b/backends/hostmem-memfd-private.c
>>>>> index af8fb0c957..e8597ed28d 100644
>>>>> --- a/backends/hostmem-memfd-private.c
>>>>> +++ b/backends/hostmem-memfd-private.c
>>>>> @@ -39,7 +39,7 @@ priv_memfd_backend_memory_alloc(HostMemoryBackend 
>>>>> *backend, Error **errp)
>>>>>        MachineState *machine = MACHINE(qdev_get_machine());
>>>>>        uint32_t ram_flags;
>>>>>        char *name;
>>>>> -    int fd, priv_fd;
>>>>> +    int fd, priv_fd, ret;
>>>>>          if (!backend->size) {
>>>>>            error_setg(errp, "can't create backend with size 0");
>>>>> @@ -65,7 +65,15 @@ 
>>>>> priv_memfd_backend_memory_alloc(HostMemoryBackend *backend, Error 
>>>>> **errp)
>>>>>                                       backend->size, ram_flags, fd, 
>>>>> 0, errp);
>>>>>        g_free(name);
>>>>>    -    fallocate(priv_fd, 0, 0, backend->size);
>>>>> +again:
>>>>> +    ret = fallocate(priv_fd, 0, 0, backend->size);
>>>>> +    if (ret) {
>>>>> +           perror("Fallocate failed: \n");
>>>>> +           if (errno == EINTR)
>>>>> +                   goto again;
>>>>> +           else
>>>>> +                   exit(1);
>>>>> +    }
>>>>>
>>>>> However, fallocate() preallocates full guest memory before starting 
>>>>> the guest.
>>>>> With this behaviour guest memory is *not* demand pinned. Is there a 
>>>>> way to
>>>>> prevent fallocate() from reserving full guest memory?
>>>>
>>>> Isn't the pinning being handled by the corresponding host memory 
>>>> backend with mmu > notifier and architecture support while doing the 
>>>> memory operations e.g page> migration and swapping/reclaim (not 
>>>> supported currently AFAIU). But yes, we need> to allocate entire 
>>>> guest memory with the new flags MEMFILE_F_{UNMOVABLE, UNRECLAIMABLE 
>>>> etc}.
>>>
>>> That is correct, but the question is when does the memory allocated, 
>>> as these flags are set,
>>> memory is neither moved nor reclaimed. In current scenario, if I 
>>> start a 32GB guest, all 32GB is
>>> allocated.
>>
>> I guess so if guest memory is private by default.
>>
>> Other option would be to allocate memory as shared by default and
>> handle on demand allocation and RMPUPDATE with page state change 
>> event. But still that would be done at guest boot time, IIUC.
> 
> Sorry! Don't want to hijack the other thread so replying here.
> 
> I thought the question is for SEV SNP. For SEV, maybe the hypercall with 
> the page state information can be used to allocate memory as we use it 
> or something like quota based memory allocation (just thinking).

But all this would have considerable performance overhead (if by default 
memory is shared) and used mostly at boot time. So, preallocating memory 
(default memory private) seems better approach for both SEV & SEV SNP 
with later page management (pinning, reclaim) taken care by host memory 
backend & architecture together.

Or maybe later we can think of something like allowing direct page fault 
on host memory access for *SEV* guest as there is no strict requirement 
for memory integrity guarantee and the performance overhead.

Don't know if it is feasible, just sharing my thoughts.

Thanks,
Pankaj

> 
>>
>> Might be missing some details on this. So, better to wait for someone 
>> more familiar to answer.
> 
> Same applies here :)
> 
> Thanks,
> Pankaj
> 
>
Nikunj A. Dadhania Aug. 12, 2022, 8:48 a.m. UTC | #18
On 12/08/22 12:48, Gupta, Pankaj wrote:
> 
>>>>>>
>>>>>> However, fallocate() preallocates full guest memory before starting the guest.
>>>>>> With this behaviour guest memory is *not* demand pinned. Is there a way to
>>>>>> prevent fallocate() from reserving full guest memory?
>>>>>
>>>>> Isn't the pinning being handled by the corresponding host memory backend with mmu > notifier and architecture support while doing the memory operations e.g page> migration and swapping/reclaim (not supported currently AFAIU). But yes, we need> to allocate entire guest memory with the new flags MEMFILE_F_{UNMOVABLE, UNRECLAIMABLE etc}.
>>>>
>>>> That is correct, but the question is when does the memory allocated, as these flags are set,
>>>> memory is neither moved nor reclaimed. In current scenario, if I start a 32GB guest, all 32GB is
>>>> allocated.
>>>
>>> I guess so if guest memory is private by default.
>>>
>>> Other option would be to allocate memory as shared by default and
>>> handle on demand allocation and RMPUPDATE with page state change event. But still that would be done at guest boot time, IIUC.
>>
>> Sorry! Don't want to hijack the other thread so replying here.
>>
>> I thought the question is for SEV SNP. For SEV, maybe the hypercall with the page state information can be used to allocate memory as we use it or something like quota based memory allocation (just thinking).
> 
> But all this would have considerable performance overhead (if by default memory is shared) and used mostly at boot time. 

> So, preallocating memory (default memory private) seems better approach for both SEV & SEV SNP with later page management (pinning, reclaim) taken care by host memory backend & architecture together.

I am not sure how will pre-allocating memory help, even if guest would not use full memory it will be pre-allocated. Which if I understand correctly is not expected.

Regards
Nikunj
Gupta, Pankaj Aug. 12, 2022, 9:33 a.m. UTC | #19
>>>>>>>
>>>>>>> However, fallocate() preallocates full guest memory before starting the guest.
>>>>>>> With this behaviour guest memory is *not* demand pinned. Is there a way to
>>>>>>> prevent fallocate() from reserving full guest memory?
>>>>>>
>>>>>> Isn't the pinning being handled by the corresponding host memory backend with mmu > notifier and architecture support while doing the memory operations e.g page> migration and swapping/reclaim (not supported currently AFAIU). But yes, we need> to allocate entire guest memory with the new flags MEMFILE_F_{UNMOVABLE, UNRECLAIMABLE etc}.
>>>>>
>>>>> That is correct, but the question is when does the memory allocated, as these flags are set,
>>>>> memory is neither moved nor reclaimed. In current scenario, if I start a 32GB guest, all 32GB is
>>>>> allocated.
>>>>
>>>> I guess so if guest memory is private by default.
>>>>
>>>> Other option would be to allocate memory as shared by default and
>>>> handle on demand allocation and RMPUPDATE with page state change event. But still that would be done at guest boot time, IIUC.
>>>
>>> Sorry! Don't want to hijack the other thread so replying here.
>>>
>>> I thought the question is for SEV SNP. For SEV, maybe the hypercall with the page state information can be used to allocate memory as we use it or something like quota based memory allocation (just thinking).
>>
>> But all this would have considerable performance overhead (if by default memory is shared) and used mostly at boot time.
> 
>> So, preallocating memory (default memory private) seems better approach for both SEV & SEV SNP with later page management (pinning, reclaim) taken care by host memory backend & architecture together.
> 
> I am not sure how will pre-allocating memory help, even if guest would not use full memory it will be pre-allocated. Which if I understand correctly is not expected.

For SEV I am also not very sure what would be the best way.
There could be a tradeoff between memory pinning and performance.
As I was also thinking about "Async page fault" aspect of guest
in my previous reply. Details needs to be figure out.

Quoting my previous reply here:
"Or maybe later we can think of something like allowing direct page 
fault on host memory access for *SEV* guest as there is no strict 
requirement for memory integrity guarantee and the performance overhead."

Thanks,
Pankaj
Chao Peng Aug. 15, 2022, 1:04 p.m. UTC | #20
On Fri, Aug 12, 2022 at 02:18:43PM +0530, Nikunj A. Dadhania wrote:
> 
> 
> On 12/08/22 12:48, Gupta, Pankaj wrote:
> > 
> >>>>>>
> >>>>>> However, fallocate() preallocates full guest memory before starting the guest.
> >>>>>> With this behaviour guest memory is *not* demand pinned. Is there a way to
> >>>>>> prevent fallocate() from reserving full guest memory?
> >>>>>
> >>>>> Isn't the pinning being handled by the corresponding host memory backend with mmu > notifier and architecture support while doing the memory operations e.g page> migration and swapping/reclaim (not supported currently AFAIU). But yes, we need> to allocate entire guest memory with the new flags MEMFILE_F_{UNMOVABLE, UNRECLAIMABLE etc}.
> >>>>
> >>>> That is correct, but the question is when does the memory allocated, as these flags are set,
> >>>> memory is neither moved nor reclaimed. In current scenario, if I start a 32GB guest, all 32GB is
> >>>> allocated.
> >>>
> >>> I guess so if guest memory is private by default.
> >>>
> >>> Other option would be to allocate memory as shared by default and
> >>> handle on demand allocation and RMPUPDATE with page state change event. But still that would be done at guest boot time, IIUC.
> >>
> >> Sorry! Don't want to hijack the other thread so replying here.
> >>
> >> I thought the question is for SEV SNP. For SEV, maybe the hypercall with the page state information can be used to allocate memory as we use it or something like quota based memory allocation (just thinking).
> > 
> > But all this would have considerable performance overhead (if by default memory is shared) and used mostly at boot time. 
> 
> > So, preallocating memory (default memory private) seems better approach for both SEV & SEV SNP with later page management (pinning, reclaim) taken care by host memory backend & architecture together.
> 
> I am not sure how will pre-allocating memory help, even if guest would not use full memory it will be pre-allocated. Which if I understand correctly is not expected.

Actually the current version allows you to delay the allocation to a
later time (e.g. page fault time) if you don't call fallocate() on the
private fd. fallocate() is necessary in previous versions because we
treat the existense in the fd as 'private' but in this version we track
private/shared info in KVM so we don't rely on that fact from memory
backstores.

Definitely the page will still be pinned once it's allocated, there is
no way to swap it out for example just with the current code. That kind
of support, if desirable, can be extended through MOVABLE flag and some
other callbacks to let feature-specific code to involve.

Chao
> 
> Regards
> Nikunj
Nikunj A. Dadhania Aug. 16, 2022, 4:28 a.m. UTC | #21
On 15/08/22 18:34, Chao Peng wrote:
> On Fri, Aug 12, 2022 at 02:18:43PM +0530, Nikunj A. Dadhania wrote:
>>
>>
>> On 12/08/22 12:48, Gupta, Pankaj wrote:
>>>
>>>>>>>>
>>>>>>>> However, fallocate() preallocates full guest memory before starting the guest.
>>>>>>>> With this behaviour guest memory is *not* demand pinned. Is there a way to
>>>>>>>> prevent fallocate() from reserving full guest memory?
>>>>>>>
>>>>>>> Isn't the pinning being handled by the corresponding host memory backend with mmu > notifier and architecture support while doing the memory operations e.g page> migration and swapping/reclaim (not supported currently AFAIU). But yes, we need> to allocate entire guest memory with the new flags MEMFILE_F_{UNMOVABLE, UNRECLAIMABLE etc}.
>>>>>>
>>>>>> That is correct, but the question is when does the memory allocated, as these flags are set,
>>>>>> memory is neither moved nor reclaimed. In current scenario, if I start a 32GB guest, all 32GB is
>>>>>> allocated.
>>>>>
>>>>> I guess so if guest memory is private by default.
>>>>>
>>>>> Other option would be to allocate memory as shared by default and
>>>>> handle on demand allocation and RMPUPDATE with page state change event. But still that would be done at guest boot time, IIUC.
>>>>
>>>> Sorry! Don't want to hijack the other thread so replying here.
>>>>
>>>> I thought the question is for SEV SNP. For SEV, maybe the hypercall with the page state information can be used to allocate memory as we use it or something like quota based memory allocation (just thinking).
>>>
>>> But all this would have considerable performance overhead (if by default memory is shared) and used mostly at boot time. 
>>
>>> So, preallocating memory (default memory private) seems better approach for both SEV & SEV SNP with later page management (pinning, reclaim) taken care by host memory backend & architecture together.
>>
>> I am not sure how will pre-allocating memory help, even if guest would not use full memory it will be pre-allocated. Which if I understand correctly is not expected.
> 
> Actually the current version allows you to delay the allocation to a
> later time (e.g. page fault time) if you don't call fallocate() on the
> private fd. fallocate() is necessary in previous versions because we
> treat the existense in the fd as 'private' but in this version we track
> private/shared info in KVM so we don't rely on that fact from memory
> backstores.

Thanks for confirming Chao, in that case we can drop fallocate() from qemu 
in both the case
* Once while creating the memfd private object
* During ram_block_convert_range() for shared->private and vice versa.
 
> Definitely the page will still be pinned once it's allocated, there is
> no way to swap it out for example just with the current code. 

Agree, at present once the page is brought in, page will remain till VM shutdown.

> That kind of support, if desirable, can be extended through MOVABLE flag and some
> other callbacks to let feature-specific code to involve.

Sure, that could be future work.

Regards
Nikunj
Gupta, Pankaj Aug. 16, 2022, 11:33 a.m. UTC | #22
Hi Chao,

> 
> Actually the current version allows you to delay the allocation to a
> later time (e.g. page fault time) if you don't call fallocate() on the
> private fd. fallocate() is necessary in previous versions because we
> treat the existense in the fd as 'private' but in this version we track
> private/shared info in KVM so we don't rely on that fact from memory
> backstores.

Does this also mean reservation of guest physical memory with secure 
processor (both for SEV-SNP & TDX) will also happen at page fault time?

Do we plan to keep it this way?

Thanks,
Pankaj
> 
> Definitely the page will still be pinned once it's allocated, there is
> no way to swap it out for example just with the current code. That kind
> of support, if desirable, can be extended through MOVABLE flag and some
> other callbacks to let feature-specific code to involve.
Kirill A. Shutemov Aug. 16, 2022, 12:24 p.m. UTC | #23
On Tue, Aug 16, 2022 at 01:33:00PM +0200, Gupta, Pankaj wrote:
> Hi Chao,
> 
> > 
> > Actually the current version allows you to delay the allocation to a
> > later time (e.g. page fault time) if you don't call fallocate() on the
> > private fd. fallocate() is necessary in previous versions because we
> > treat the existense in the fd as 'private' but in this version we track
> > private/shared info in KVM so we don't rely on that fact from memory
> > backstores.
> 
> Does this also mean reservation of guest physical memory with secure
> processor (both for SEV-SNP & TDX) will also happen at page fault time?
> 
> Do we plan to keep it this way?

If you are talking about accepting memory by the guest, it is initiated by
the guest and has nothing to do with page fault time vs fallocate()
allocation of host memory. I mean acceptance happens after host memory
allocation but they are not in lockstep, acceptance can happen much later.
Gupta, Pankaj Aug. 16, 2022, 1:03 p.m. UTC | #24
>>> Actually the current version allows you to delay the allocation to a
>>> later time (e.g. page fault time) if you don't call fallocate() on the
>>> private fd. fallocate() is necessary in previous versions because we
>>> treat the existense in the fd as 'private' but in this version we track
>>> private/shared info in KVM so we don't rely on that fact from memory
>>> backstores.
>>
>> Does this also mean reservation of guest physical memory with secure
>> processor (both for SEV-SNP & TDX) will also happen at page fault time?
>>
>> Do we plan to keep it this way?
> 
> If you are talking about accepting memory by the guest, it is initiated by
> the guest and has nothing to do with page fault time vs fallocate()
> allocation of host memory. I mean acceptance happens after host memory
> allocation but they are not in lockstep, acceptance can happen much later.

No, I meant reserving guest physical memory range from hypervisor e.g 
with RMPUpdate for SEV-SNP or equivalent at TDX side (PAMTs?).

Thanks,
Pankaj
Sean Christopherson Aug. 16, 2022, 3:38 p.m. UTC | #25
On Tue, Aug 16, 2022, Gupta, Pankaj wrote:
> 
> > > > Actually the current version allows you to delay the allocation to a
> > > > later time (e.g. page fault time) if you don't call fallocate() on the
> > > > private fd. fallocate() is necessary in previous versions because we
> > > > treat the existense in the fd as 'private' but in this version we track
> > > > private/shared info in KVM so we don't rely on that fact from memory
> > > > backstores.
> > > 
> > > Does this also mean reservation of guest physical memory with secure
> > > processor (both for SEV-SNP & TDX) will also happen at page fault time?
> > > 
> > > Do we plan to keep it this way?
> > 
> > If you are talking about accepting memory by the guest, it is initiated by
> > the guest and has nothing to do with page fault time vs fallocate()
> > allocation of host memory. I mean acceptance happens after host memory
> > allocation but they are not in lockstep, acceptance can happen much later.
> 
> No, I meant reserving guest physical memory range from hypervisor e.g with
> RMPUpdate for SEV-SNP or equivalent at TDX side (PAMTs?).

As proposed, RMP/PAMT updates will occur in the fault path, i.e. there is no way
for userspace to pre-map guest memory.

I think the best approach is to turn KVM_TDX_INIT_MEM_REGION into a generic
vCPU-scoped ioctl() that allows userspace to pre-map guest memory.  Supporting
initializing guest private memory with a source page can be implemented via a
flag.  That also gives KVM line of sight to in-place "conversion", e.g. another
flag could be added to say that the dest is also the source.

The TDX and SNP restrictions would then become addition restrictions on when
initializing with a source is allowed (and VMs that don't have guest private
memory wouldn't allow the flag at all).
Michael Roth Aug. 17, 2022, 3:27 p.m. UTC | #26
On Tue, Aug 16, 2022 at 03:38:08PM +0000, Sean Christopherson wrote:
> On Tue, Aug 16, 2022, Gupta, Pankaj wrote:
> > 
> > > > > Actually the current version allows you to delay the allocation to a
> > > > > later time (e.g. page fault time) if you don't call fallocate() on the
> > > > > private fd. fallocate() is necessary in previous versions because we
> > > > > treat the existense in the fd as 'private' but in this version we track
> > > > > private/shared info in KVM so we don't rely on that fact from memory
> > > > > backstores.
> > > > 
> > > > Does this also mean reservation of guest physical memory with secure
> > > > processor (both for SEV-SNP & TDX) will also happen at page fault time?
> > > > 
> > > > Do we plan to keep it this way?
> > > 
> > > If you are talking about accepting memory by the guest, it is initiated by
> > > the guest and has nothing to do with page fault time vs fallocate()
> > > allocation of host memory. I mean acceptance happens after host memory
> > > allocation but they are not in lockstep, acceptance can happen much later.
> > 
> > No, I meant reserving guest physical memory range from hypervisor e.g with
> > RMPUpdate for SEV-SNP or equivalent at TDX side (PAMTs?).
> 
> As proposed, RMP/PAMT updates will occur in the fault path, i.e. there is no way
> for userspace to pre-map guest memory.

Hi Sean,

Currently I have the rmpupdate hook in KVM_MEMORY_ENCRYPT_{REG,UNREG}_REGION
ioctls, so that when the pages actually get faulted in they are already
in the expected state. I have userspace set up to call
KVM_MEMORY_ENCRYPT_* in response to explicit page state changes issued by
the guest, as well as in response to MEMORY_FAULT exits for implicit page
state changes.

Initially the private backing store may or may not be pre-fallocate()'d
depending on how userspace wants to handle it. If it's not
pre-fallocate()'d, then the pages don't get faulted in until the guest
does explicit page state changes (currently SNP guests will do this for all
memory at boot time, but with unaccepted memory patches for guest/ovmf
this will happen during guest run-time, would still allow us to make
efficient use of lazy-pinning support for shorter boot times).

If userspaces wants to pre-allocate, it can issue the fallocate() for
all the ranges up-front so it doesn't incur the cost during run-time.

Is that compatible with the proposed design?

Of course, for the initial encrypted payload, we would need to to issue
the KVM_MEMORY_ENCRYPT_{REG,UNREG}_REGION up-front. I'm doing that in
conjunction with the hack to allow pwrite() to memfd to pre-populate the
private pages before the in-place encryption that occurs when
SNP_LAUNCH_UPDATE is issued...

In the past you and Vishal suggested doing the copy from within
SNP_LAUNCH_UPDATE, which seems like a workable solution and something
we've been meaning to implement...

> 
> I think the best approach is to turn KVM_TDX_INIT_MEM_REGION into a generic
> vCPU-scoped ioctl() that allows userspace to pre-map guest memory.  Supporting
> initializing guest private memory with a source page can be implemented via a
> flag.  That also gives KVM line of sight to in-place "conversion", e.g. another
> flag could be added to say that the dest is also the source.

So is this proposed ioctl only intended to handle the initial encrypted
payload, and the KVM_MEMORY_ENCRYPT_{REG,UNREG}_REGION ioctls would
still be used for conversions post-boot?

If so, that seems reasonable, but I thought there was some consensus that
just handling it per-platform in, e.g., SNP_LAUNCH_UPDATE, was
sufficient for now until some additional need arose for a new interface.
Has something changed in the regard? Just want to understand the
motivations so we can plan accordingly.

Thanks!

-Mike

> 
> The TDX and SNP restrictions would then become addition restrictions on when
> initializing with a source is allowed (and VMs that don't have guest private
> memory wouldn't allow the flag at all).
Hugh Dickins Aug. 18, 2022, 5:40 a.m. UTC | #27
On Wed, 6 Jul 2022, Chao Peng wrote:
> This is the v7 of this series which tries to implement the fd-based KVM
> guest private memory.

Here at last are my reluctant thoughts on this patchset.

fd-based approach for supporting KVM guest private memory: fine.

Use or abuse of memfd and shmem.c: mistaken.

memfd_create() was an excellent way to put together the initial prototype.

But since then, TDX in particular has forced an effort into preventing
(by flags, seals, notifiers) almost everything that makes it shmem/tmpfs.

Are any of the shmem.c mods useful to existing users of shmem.c? No.
Is MFD_INACCESSIBLE useful or comprehensible to memfd_create() users? No.

What use do you have for a filesystem here?  Almost none.
IIUC, what you want is an fd through which QEMU can allocate kernel
memory, selectively free that memory, and communicate fd+offset+length
to KVM.  And perhaps an interface to initialize a little of that memory
from a template (presumably copied from a real file on disk somewhere).

You don't need shmem.c or a filesystem for that!

If your memory could be swapped, that would be enough of a good reason
to make use of shmem.c: but it cannot be swapped; and although there
are some references in the mailthreads to it perhaps being swappable
in future, I get the impression that will not happen soon if ever.

If your memory could be migrated, that would be some reason to use
filesystem page cache (because page migration happens to understand
that type of memory): but it cannot be migrated.

Some of these impressions may come from earlier iterations of the
patchset (v7 looks better in several ways than v5).  I am probably
underestimating the extent to which you have taken on board other
usages beyond TDX and SEV private memory, and rightly want to serve
them all with similar interfaces: perhaps there is enough justification
for shmem there, but I don't see it.  There was mention of userfaultfd
in one link: does that provide the justification for using shmem?

I'm afraid of the special demands you may make of memory allocation
later on - surprised that huge pages are not mentioned already;
gigantic contiguous extents? secretmem removed from direct map?

Here's what I would prefer, and imagine much easier for you to maintain;
but I'm no system designer, and may be misunderstanding throughout.

QEMU gets fd from opening /dev/kvm_something, uses ioctls (or perhaps
the fallocate syscall interface itself) to allocate and free the memory,
ioctl for initializing some of it too.  KVM in control of whether that
fd can be read or written or mmap'ed or whatever, no need to prevent it
in shmem.c, no need for flags, seals, notifications to and fro because
KVM is already in control and knows the history.  If shmem actually has
value, call into it underneath - somewhat like SysV SHM, and /dev/zero
mmap, and i915/gem make use of it underneath.  If shmem has nothing to
add, just allocate and free kernel memory directly, recorded in your
own xarray.

With that /dev/kvm_something subject to access controls and LSMs -
which I cannot find for memfd_create().  Full marks for including the
MFD_INACCESSIBLE manpage update, and for Cc'ing linux-api: but I'd
have expected some doubts from that direction already.

Hugh
Kirill A. Shutemov Aug. 18, 2022, 1:24 p.m. UTC | #28
On Wed, Aug 17, 2022 at 10:40:12PM -0700, Hugh Dickins wrote:
> On Wed, 6 Jul 2022, Chao Peng wrote:
> > This is the v7 of this series which tries to implement the fd-based KVM
> > guest private memory.
> 
> Here at last are my reluctant thoughts on this patchset.
> 
> fd-based approach for supporting KVM guest private memory: fine.
> 
> Use or abuse of memfd and shmem.c: mistaken.
> 
> memfd_create() was an excellent way to put together the initial prototype.
> 
> But since then, TDX in particular has forced an effort into preventing
> (by flags, seals, notifiers) almost everything that makes it shmem/tmpfs.
> 
> Are any of the shmem.c mods useful to existing users of shmem.c? No.
> Is MFD_INACCESSIBLE useful or comprehensible to memfd_create() users? No.
> 
> What use do you have for a filesystem here?  Almost none.
> IIUC, what you want is an fd through which QEMU can allocate kernel
> memory, selectively free that memory, and communicate fd+offset+length
> to KVM.  And perhaps an interface to initialize a little of that memory
> from a template (presumably copied from a real file on disk somewhere).
> 
> You don't need shmem.c or a filesystem for that!
> 
> If your memory could be swapped, that would be enough of a good reason
> to make use of shmem.c: but it cannot be swapped; and although there
> are some references in the mailthreads to it perhaps being swappable
> in future, I get the impression that will not happen soon if ever.
> 
> If your memory could be migrated, that would be some reason to use
> filesystem page cache (because page migration happens to understand
> that type of memory): but it cannot be migrated.

Migration support is in pipeline. It is part of TDX 1.5 [1]. And swapping
theoretically possible, but I'm not aware of any plans as of now.

[1] https://www.intel.com/content/www/us/en/developer/articles/technical/intel-trust-domain-extensions.html

> Some of these impressions may come from earlier iterations of the
> patchset (v7 looks better in several ways than v5).  I am probably
> underestimating the extent to which you have taken on board other
> usages beyond TDX and SEV private memory, and rightly want to serve
> them all with similar interfaces: perhaps there is enough justification
> for shmem there, but I don't see it.  There was mention of userfaultfd
> in one link: does that provide the justification for using shmem?
> 
> I'm afraid of the special demands you may make of memory allocation
> later on - surprised that huge pages are not mentioned already;
> gigantic contiguous extents? secretmem removed from direct map?

The design allows for extension to hugetlbfs if needed. Combination of
MFD_INACCESSIBLE | MFD_HUGETLB should route this way. There should be zero
implications for shmem. It is going to be separate struct memfile_backing_store.

I'm not sure secretmem is a fit here as we want to extend MFD_INACCESSIBLE
to be movable if platform supports it and secretmem is not migratable by
design (without direct mapping fragmentations).

> Here's what I would prefer, and imagine much easier for you to maintain;
> but I'm no system designer, and may be misunderstanding throughout.
> 
> QEMU gets fd from opening /dev/kvm_something, uses ioctls (or perhaps
> the fallocate syscall interface itself) to allocate and free the memory,
> ioctl for initializing some of it too.  KVM in control of whether that
> fd can be read or written or mmap'ed or whatever, no need to prevent it
> in shmem.c, no need for flags, seals, notifications to and fro because
> KVM is already in control and knows the history.  If shmem actually has
> value, call into it underneath - somewhat like SysV SHM, and /dev/zero
> mmap, and i915/gem make use of it underneath.  If shmem has nothing to
> add, just allocate and free kernel memory directly, recorded in your
> own xarray.

I guess shim layer on top of shmem *can* work. I don't see immediately why
it would not. But I'm not sure it is right direction. We risk creating yet
another parallel VM with own rules/locking/accounting that opaque to
core-mm.

Note that on machines that run TDX guests such memory would likely be the
bulk of memory use. Treating it as a fringe case may bite us one day.
Sean Christopherson Aug. 19, 2022, 12:20 a.m. UTC | #29
On Thu, Aug 18, 2022, Kirill A . Shutemov wrote:
> On Wed, Aug 17, 2022 at 10:40:12PM -0700, Hugh Dickins wrote:
> > On Wed, 6 Jul 2022, Chao Peng wrote:
> > But since then, TDX in particular has forced an effort into preventing
> > (by flags, seals, notifiers) almost everything that makes it shmem/tmpfs.
> > 
> > Are any of the shmem.c mods useful to existing users of shmem.c? No.
> > Is MFD_INACCESSIBLE useful or comprehensible to memfd_create() users? No.

But QEMU and other VMMs are users of shmem and memfd.  The new features certainly
aren't useful for _all_ existing users, but I don't think it's fair to say that
they're not useful for _any_ existing users.

> > What use do you have for a filesystem here?  Almost none.
> > IIUC, what you want is an fd through which QEMU can allocate kernel
> > memory, selectively free that memory, and communicate fd+offset+length
> > to KVM.  And perhaps an interface to initialize a little of that memory
> > from a template (presumably copied from a real file on disk somewhere).
> > 
> > You don't need shmem.c or a filesystem for that!
> > 
> > If your memory could be swapped, that would be enough of a good reason
> > to make use of shmem.c: but it cannot be swapped; and although there
> > are some references in the mailthreads to it perhaps being swappable
> > in future, I get the impression that will not happen soon if ever.
> > 
> > If your memory could be migrated, that would be some reason to use
> > filesystem page cache (because page migration happens to understand
> > that type of memory): but it cannot be migrated.
> 
> Migration support is in pipeline. It is part of TDX 1.5 [1]. 

And this isn't intended for just TDX (or SNP, or pKVM).  We're not _that_ far off
from being able to use UPM for "regular" VMs as a way to provide defense-in-depth
without having to take on the overhead of confidential VMs.  At that point,
migration and probably even swap are on the table.

> And swapping theoretically possible, but I'm not aware of any plans as of
> now.

Ya, I highly doubt confidential VMs will ever bother with swap.

> > I'm afraid of the special demands you may make of memory allocation
> > later on - surprised that huge pages are not mentioned already;
> > gigantic contiguous extents? secretmem removed from direct map?
> 
> The design allows for extension to hugetlbfs if needed. Combination of
> MFD_INACCESSIBLE | MFD_HUGETLB should route this way. There should be zero
> implications for shmem. It is going to be separate struct memfile_backing_store.
> 
> I'm not sure secretmem is a fit here as we want to extend MFD_INACCESSIBLE
> to be movable if platform supports it and secretmem is not migratable by
> design (without direct mapping fragmentations).

But secretmem _could_ be a fit.  If a use case wants to unmap guest private memory
from both userspace and the kernel then KVM should absolutely be able to support
that, but at the same time I don't want to have to update KVM to enable secretmem
(and I definitely don't want KVM poking into the directmap itself).

MFD_INACCESSIBLE should only say "this memory can't be mapped into userspace",
any other properties should be completely separate, e.g. the inability to migrate
pages is effective a restriction from KVM (acting on behalf of TDX/SNP), it's not
a fundamental property of MFD_INACCESSIBLE.
Hugh Dickins Aug. 19, 2022, 3 a.m. UTC | #30
On Thu, 18 Aug 2022, Kirill A . Shutemov wrote:
> On Wed, Aug 17, 2022 at 10:40:12PM -0700, Hugh Dickins wrote:
> > 
> > If your memory could be swapped, that would be enough of a good reason
> > to make use of shmem.c: but it cannot be swapped; and although there
> > are some references in the mailthreads to it perhaps being swappable
> > in future, I get the impression that will not happen soon if ever.
> > 
> > If your memory could be migrated, that would be some reason to use
> > filesystem page cache (because page migration happens to understand
> > that type of memory): but it cannot be migrated.
> 
> Migration support is in pipeline. It is part of TDX 1.5 [1]. And swapping
> theoretically possible, but I'm not aware of any plans as of now.
> 
> [1] https://www.intel.com/content/www/us/en/developer/articles/technical/intel-trust-domain-extensions.html

I always forget, migration means different things to different audiences.
As an mm person, I was meaning page migration, whereas a virtualization
person thinks VM live migration (which that reference appears to be about),
a scheduler person task migration, an ornithologist bird migration, etc.

But you're an mm person too: you may have cited that reference in the
knowledge that TDX 1.5 Live Migration will entail page migration of the
kind I'm thinking of.  (Anyway, it's not important to clarify that here.)

> 
> > Some of these impressions may come from earlier iterations of the
> > patchset (v7 looks better in several ways than v5).  I am probably
> > underestimating the extent to which you have taken on board other
> > usages beyond TDX and SEV private memory, and rightly want to serve
> > them all with similar interfaces: perhaps there is enough justification
> > for shmem there, but I don't see it.  There was mention of userfaultfd
> > in one link: does that provide the justification for using shmem?
> > 
> > I'm afraid of the special demands you may make of memory allocation
> > later on - surprised that huge pages are not mentioned already;
> > gigantic contiguous extents? secretmem removed from direct map?
> 
> The design allows for extension to hugetlbfs if needed. Combination of
> MFD_INACCESSIBLE | MFD_HUGETLB should route this way. There should be zero
> implications for shmem. It is going to be separate struct memfile_backing_store.

Last year's MFD_HUGEPAGE proposal would have allowed you to do it with
memfd via tmpfs without needing to involve hugetlbfs; but you may prefer
the determinism of hugetlbfs, relying on /proc/sys/vm/nr_hugepages etc.

But I've yet to see why you want to involve this or that filesystem
(with all its filesystem-icity suppressed) at all.  The backing store
is host memory, and tmpfs and hugetlbfs just impose their own
idiosyncrasies on how that memory is allocated; but I think you would
do better to choose your own idiosyncrasies in allocation directly -
you don't need a different "backing store" to choose between 4k or 2M
or 1G or whatever allocations.

tmpfs and hugetlbfs and page cache are designed around sharing memory:
TDX is designed around absolutely not sharing memory; and the further
uses which Sean foresees appear not to need it as page cache either.

Except perhaps for page migration reasons.  It's somewhat incidental,  
but of course page migration knows how to migrate page cache, so
masquerading as page cache will give a short cut to page migration,
when page migration becomes at all possible.

> 
> I'm not sure secretmem is a fit here as we want to extend MFD_INACCESSIBLE
> to be movable if platform supports it and secretmem is not migratable by
> design (without direct mapping fragmentations).
> 
> > Here's what I would prefer, and imagine much easier for you to maintain;
> > but I'm no system designer, and may be misunderstanding throughout.
> > 
> > QEMU gets fd from opening /dev/kvm_something, uses ioctls (or perhaps
> > the fallocate syscall interface itself) to allocate and free the memory,
> > ioctl for initializing some of it too.  KVM in control of whether that
> > fd can be read or written or mmap'ed or whatever, no need to prevent it
> > in shmem.c, no need for flags, seals, notifications to and fro because
> > KVM is already in control and knows the history.  If shmem actually has
> > value, call into it underneath - somewhat like SysV SHM, and /dev/zero
> > mmap, and i915/gem make use of it underneath.  If shmem has nothing to
> > add, just allocate and free kernel memory directly, recorded in your
> > own xarray.
> 
> I guess shim layer on top of shmem *can* work. I don't see immediately why
> it would not. But I'm not sure it is right direction. We risk creating yet
> another parallel VM with own rules/locking/accounting that opaque to
> core-mm.

You are already proposing a new set of rules, foreign to how tmpfs works
for others.  You're right that KVM allocating large amounts of memory,
opaque to core-mm, carries risk: and you'd be right to say that shmem.c
provides some clues (security_vm_enough_memory checks, memcg charging,
user_shm_lock accounting) on what to remember.

But I'm not up to the job of being the one to police you there,
and you don't want to be waiting on me either.

To take a rather silly example: Ted just added chattr support to tmpfs,
and it fits in well.  But I don't now want to have to decide whether
"chattr +i" FS_IMMUTABLE_FL is or is not compatible with
MEMFILE_F_USER_INACCESSIBLE.  They are from different worlds,
and I'd prefer KVM to carry the weight of imposing INACCESSIBLE:
which seems easily done if it manages the fd, without making the
memory allocated to that fd accessible to those who hold the fd.

> 
> Note that on machines that run TDX guests such memory would likely be the
> bulk of memory use. Treating it as a fringe case may bite us one day.

Yes, I suspected that machines running TDX guests might well consume
most of the memory that way, but glad(?) to hear it confirmed.

I am not suggesting that this memory be treated as a fringe case, rather
the reverse: a different case, not something to hide away inside shmem.c.

Is there a notion that /proc/meminfo "Shmem:" is going to be a good hint
of this usage?  Whether or not it's also included in "Shmem:", I expect
that its different characteristics will deserve its own display.

Hugh
Hugh Dickins Aug. 19, 2022, 3:38 a.m. UTC | #31
On Fri, 19 Aug 2022, Sean Christopherson wrote:
> On Thu, Aug 18, 2022, Kirill A . Shutemov wrote:
> > On Wed, Aug 17, 2022 at 10:40:12PM -0700, Hugh Dickins wrote:
> > > On Wed, 6 Jul 2022, Chao Peng wrote:
> > > But since then, TDX in particular has forced an effort into preventing
> > > (by flags, seals, notifiers) almost everything that makes it shmem/tmpfs.
> > > 
> > > Are any of the shmem.c mods useful to existing users of shmem.c? No.
> > > Is MFD_INACCESSIBLE useful or comprehensible to memfd_create() users? No.
> 
> But QEMU and other VMMs are users of shmem and memfd.  The new features certainly
> aren't useful for _all_ existing users, but I don't think it's fair to say that
> they're not useful for _any_ existing users.

Okay, I stand corrected: there exist some users of memfd_create()
who will also have use for "INACCESSIBLE" memory.

> 
> > > What use do you have for a filesystem here?  Almost none.
> > > IIUC, what you want is an fd through which QEMU can allocate kernel
> > > memory, selectively free that memory, and communicate fd+offset+length
> > > to KVM.  And perhaps an interface to initialize a little of that memory
> > > from a template (presumably copied from a real file on disk somewhere).
> > > 
> > > You don't need shmem.c or a filesystem for that!
> > > 
> > > If your memory could be swapped, that would be enough of a good reason
> > > to make use of shmem.c: but it cannot be swapped; and although there
> > > are some references in the mailthreads to it perhaps being swappable
> > > in future, I get the impression that will not happen soon if ever.
> > > 
> > > If your memory could be migrated, that would be some reason to use
> > > filesystem page cache (because page migration happens to understand
> > > that type of memory): but it cannot be migrated.
> > 
> > Migration support is in pipeline. It is part of TDX 1.5 [1]. 
> 
> And this isn't intended for just TDX (or SNP, or pKVM).  We're not _that_ far off
> from being able to use UPM for "regular" VMs as a way to provide defense-in-depth

UPM? That's an acronym from your side of the fence, I spy references to
it in the mail threads, but haven't tracked down a definition.  I'll
just take it to mean the fd-based memory we're discussing.

> without having to take on the overhead of confidential VMs.  At that point,
> migration and probably even swap are on the table.

Good, the more "flexible" that memory is, the better for competing users
of memory.  But an fd supplied by KVM gives you freedom to change to a
better implementation of allocation underneath, whenever it suits you.
Maybe shmem beneath is good from the start, maybe not.

Hugh
Sean Christopherson Aug. 19, 2022, 10:53 p.m. UTC | #32
On Thu, Aug 18, 2022, Hugh Dickins wrote:
> On Fri, 19 Aug 2022, Sean Christopherson wrote:
> > On Thu, Aug 18, 2022, Kirill A . Shutemov wrote:
> > > On Wed, Aug 17, 2022 at 10:40:12PM -0700, Hugh Dickins wrote:
> > > > If your memory could be migrated, that would be some reason to use
> > > > filesystem page cache (because page migration happens to understand
> > > > that type of memory): but it cannot be migrated.
> > > 
> > > Migration support is in pipeline. It is part of TDX 1.5 [1]. 
> > 
> > And this isn't intended for just TDX (or SNP, or pKVM).  We're not _that_ far off
> > from being able to use UPM for "regular" VMs as a way to provide defense-in-depth
> 
> UPM? That's an acronym from your side of the fence, I spy references to
> it in the mail threads, but haven't tracked down a definition.  I'll
> just take it to mean the fd-based memory we're discussing.

Ya, sorry, UPM is what we came up with as shorthand for "Unmapping guest Private
Memory".  Your assumption is spot on, it's just a fancy way of saying "guest is
backed with inaccessible fd-based memory".

> > without having to take on the overhead of confidential VMs.  At that point,
> > migration and probably even swap are on the table.
> 
> Good, the more "flexible" that memory is, the better for competing users
> of memory.  But an fd supplied by KVM gives you freedom to change to a
> better implementation of allocation underneath, whenever it suits you.
> Maybe shmem beneath is good from the start, maybe not.

The main flaw with KVM providing the fd is that it forces KVM to get into the
memory management business, which us KVM folks really, really do not want to do.
And based on the types of bugs KVM has had in the past related to memory management,
it's a safe bet to say the mm folks don't want us getting involved either :-)

The combination of gup()/follow_pte() and mmu_notifiers has worked very well.
KVM gets a set of (relatively) simple rules to follow and doesn't have to be taught
new things every time a new backing type comes along.  And from the other side, KVM
has very rarely had to go poke into other subsystems' code to support exposing a
new type of memory to guests.

What we're trying to do with UPM/fd-based memory is establish a similar contract
between mm and KVM, but without requiring mm to also map memory into host userspace.

The only way having KVM provide the fd works out in the long run is if KVM is the
only subsystem that ever wants to make use of memory that isn't accessible from
userspace and isn't tied to a specific backing type, _and_ if the set of backing
types that KVM ever supports is kept to an absolute minimum.
Kirill A. Shutemov Aug. 20, 2022, 12:27 a.m. UTC | #33
On Thu, Aug 18, 2022 at 08:00:41PM -0700, Hugh Dickins wrote:
> On Thu, 18 Aug 2022, Kirill A . Shutemov wrote:
> > On Wed, Aug 17, 2022 at 10:40:12PM -0700, Hugh Dickins wrote:
> > > 
> > > If your memory could be swapped, that would be enough of a good reason
> > > to make use of shmem.c: but it cannot be swapped; and although there
> > > are some references in the mailthreads to it perhaps being swappable
> > > in future, I get the impression that will not happen soon if ever.
> > > 
> > > If your memory could be migrated, that would be some reason to use
> > > filesystem page cache (because page migration happens to understand
> > > that type of memory): but it cannot be migrated.
> > 
> > Migration support is in pipeline. It is part of TDX 1.5 [1]. And swapping
> > theoretically possible, but I'm not aware of any plans as of now.
> > 
> > [1] https://www.intel.com/content/www/us/en/developer/articles/technical/intel-trust-domain-extensions.html
> 
> I always forget, migration means different things to different audiences.
> As an mm person, I was meaning page migration, whereas a virtualization
> person thinks VM live migration (which that reference appears to be about),
> a scheduler person task migration, an ornithologist bird migration, etc.
> 
> But you're an mm person too: you may have cited that reference in the
> knowledge that TDX 1.5 Live Migration will entail page migration of the
> kind I'm thinking of.  (Anyway, it's not important to clarify that here.)

TDX 1.5 brings both.

In TDX speak, mm migration called relocation. See TDH.MEM.PAGE.RELOCATE.

> > > Some of these impressions may come from earlier iterations of the
> > > patchset (v7 looks better in several ways than v5).  I am probably
> > > underestimating the extent to which you have taken on board other
> > > usages beyond TDX and SEV private memory, and rightly want to serve
> > > them all with similar interfaces: perhaps there is enough justification
> > > for shmem there, but I don't see it.  There was mention of userfaultfd
> > > in one link: does that provide the justification for using shmem?
> > > 
> > > I'm afraid of the special demands you may make of memory allocation
> > > later on - surprised that huge pages are not mentioned already;
> > > gigantic contiguous extents? secretmem removed from direct map?
> > 
> > The design allows for extension to hugetlbfs if needed. Combination of
> > MFD_INACCESSIBLE | MFD_HUGETLB should route this way. There should be zero
> > implications for shmem. It is going to be separate struct memfile_backing_store.
> 
> Last year's MFD_HUGEPAGE proposal would have allowed you to do it with
> memfd via tmpfs without needing to involve hugetlbfs; but you may prefer
> the determinism of hugetlbfs, relying on /proc/sys/vm/nr_hugepages etc.
> 
> But I've yet to see why you want to involve this or that filesystem
> (with all its filesystem-icity suppressed) at all.  The backing store
> is host memory, and tmpfs and hugetlbfs just impose their own
> idiosyncrasies on how that memory is allocated; but I think you would
> do better to choose your own idiosyncrasies in allocation directly -
> you don't need a different "backing store" to choose between 4k or 2M
> or 1G or whatever allocations.

These idiosyncrasies are well known: user who used hugetlbfs may want to
get direct replacement that would tap into the same hugetlb reserves and
get the same allocation guarantees. Admins know where to look if ENOMEM
happens.

For THP, admin may know how to tweak allocation/defrag policy for his
liking and how to track if they are allocated.

> tmpfs and hugetlbfs and page cache are designed around sharing memory:
> TDX is designed around absolutely not sharing memory; and the further
> uses which Sean foresees appear not to need it as page cache either.
> 
> Except perhaps for page migration reasons.  It's somewhat incidental,  
> but of course page migration knows how to migrate page cache, so
> masquerading as page cache will give a short cut to page migration,
> when page migration becomes at all possible.
> 
> > 
> > I'm not sure secretmem is a fit here as we want to extend MFD_INACCESSIBLE
> > to be movable if platform supports it and secretmem is not migratable by
> > design (without direct mapping fragmentations).
> > 
> > > Here's what I would prefer, and imagine much easier for you to maintain;
> > > but I'm no system designer, and may be misunderstanding throughout.
> > > 
> > > QEMU gets fd from opening /dev/kvm_something, uses ioctls (or perhaps
> > > the fallocate syscall interface itself) to allocate and free the memory,
> > > ioctl for initializing some of it too.  KVM in control of whether that
> > > fd can be read or written or mmap'ed or whatever, no need to prevent it
> > > in shmem.c, no need for flags, seals, notifications to and fro because
> > > KVM is already in control and knows the history.  If shmem actually has
> > > value, call into it underneath - somewhat like SysV SHM, and /dev/zero
> > > mmap, and i915/gem make use of it underneath.  If shmem has nothing to
> > > add, just allocate and free kernel memory directly, recorded in your
> > > own xarray.
> > 
> > I guess shim layer on top of shmem *can* work. I don't see immediately why
> > it would not. But I'm not sure it is right direction. We risk creating yet
> > another parallel VM with own rules/locking/accounting that opaque to
> > core-mm.
> 
> You are already proposing a new set of rules, foreign to how tmpfs works
> for others.  You're right that KVM allocating large amounts of memory,
> opaque to core-mm, carries risk: and you'd be right to say that shmem.c
> provides some clues (security_vm_enough_memory checks, memcg charging,
> user_shm_lock accounting) on what to remember.

That's a nice list of clues that would need to be re-implemented somewhere
else to get competent solution.

> But I'm not up to the job of being the one to police you there,
> and you don't want to be waiting on me either.

> To take a rather silly example: Ted just added chattr support to tmpfs,
> and it fits in well.  But I don't now want to have to decide whether
> "chattr +i" FS_IMMUTABLE_FL is or is not compatible with
> MEMFILE_F_USER_INACCESSIBLE.  They are from different worlds,
> and I'd prefer KVM to carry the weight of imposing INACCESSIBLE:
> which seems easily done if it manages the fd, without making the
> memory allocated to that fd accessible to those who hold the fd.

From a quick look, these are orthogonal. But it is not your point.

Yes, INACCESSIBLE is increase of complexity which you do not want to deal
with in shmem.c. It get it.

I will try next week to rework it as shim to top of shmem. Does it work
for you?

But I think it is wrong to throw it over the fence to KVM folks and say it
is your problem. Core MM has to manage it.

> > Note that on machines that run TDX guests such memory would likely be the
> > bulk of memory use. Treating it as a fringe case may bite us one day.
> 
> Yes, I suspected that machines running TDX guests might well consume
> most of the memory that way, but glad(?) to hear it confirmed.
> 
> I am not suggesting that this memory be treated as a fringe case, rather
> the reverse: a different case, not something to hide away inside shmem.c.
> 
> Is there a notion that /proc/meminfo "Shmem:" is going to be a good hint
> of this usage?  Whether or not it's also included in "Shmem:", I expect
> that its different characteristics will deserve its own display.

That's the hint users know about from previous experience.
Hugh Dickins Aug. 21, 2022, 5:15 a.m. UTC | #34
On Sat, 20 Aug 2022, Kirill A. Shutemov wrote:
> 
> Yes, INACCESSIBLE is increase of complexity which you do not want to deal
> with in shmem.c. It get it.

It's not so much that INACCESSIBLE increases the complexity of
memfd/shmem/tmpfs, as that it is completely foreign to it.

And by handling all those foreign needs at the KVM end (where you
can be sure that the mem attached to the fd is INACCESSIBLE because
you have given nobody access to it - no handshaking with 3rd party
required).

> 
> I will try next week to rework it as shim to top of shmem. Does it work
> for you?

Yes, please do, thanks.  It's a compromise between us: the initial TDX
case has no justification to use shmem at all, but doing it that way
will help you with some of the infrastructure, and will probably be
easiest for KVM to extend to other more relaxed fd cases later.

> 
> But I think it is wrong to throw it over the fence to KVM folks and say it
> is your problem. Core MM has to manage it.

We disagree on who is throwing over the fence to whom :)

Core MM should manage the core MM parts and KVM should manage the KVM
parts.  What makes this rather different from most driver usage of MM,
is that KVM seems likely to use a great proportion of memory this way.
With great memory usage comes great responsibility: I don't think
all those flags and seals and notifiers let KVM escape from that.

Hugh
Matthew Wilcox Aug. 21, 2022, 10:27 a.m. UTC | #35
On Thu, Aug 18, 2022 at 08:00:41PM -0700, Hugh Dickins wrote:
> tmpfs and hugetlbfs and page cache are designed around sharing memory:
> TDX is designed around absolutely not sharing memory; and the further
> uses which Sean foresees appear not to need it as page cache either.
> 
> Except perhaps for page migration reasons.  It's somewhat incidental,  
> but of course page migration knows how to migrate page cache, so
> masquerading as page cache will give a short cut to page migration,
> when page migration becomes at all possible.

I haven't read the patch series, and I'm not taking a position one way
or the other on whether this is better implemented as a shmem addition
or a shim that asks shmem for memory.  Page migration can be done for
driver memory by using PageMovable.  I just rewrote how it works, so
the details are top of my mind at the moment if anyone wants something
explained.  Commit 68f2736a8583 is the key one to look at.
Isaku Yamahata Aug. 23, 2022, 1:25 a.m. UTC | #36
On Wed, Aug 17, 2022 at 10:27:19AM -0500,
Michael Roth <michael.roth@amd.com> wrote:

> > I think the best approach is to turn KVM_TDX_INIT_MEM_REGION into a generic
> > vCPU-scoped ioctl() that allows userspace to pre-map guest memory.  Supporting
> > initializing guest private memory with a source page can be implemented via a
> > flag.  That also gives KVM line of sight to in-place "conversion", e.g. another
> > flag could be added to say that the dest is also the source.
> 
> So is this proposed ioctl only intended to handle the initial encrypted
> payload, and the KVM_MEMORY_ENCRYPT_{REG,UNREG}_REGION ioctls would
> still be used for conversions post-boot?

Yes.  It is called before running any vcpu.  At run time (after running vcpus),
KVM_MEMORY_ENCRYPT_{REG,UNREG}_REGION is used.
David Hildenbrand Aug. 23, 2022, 7:55 a.m. UTC | #37
On 19.08.22 05:38, Hugh Dickins wrote:
> On Fri, 19 Aug 2022, Sean Christopherson wrote:
>> On Thu, Aug 18, 2022, Kirill A . Shutemov wrote:
>>> On Wed, Aug 17, 2022 at 10:40:12PM -0700, Hugh Dickins wrote:
>>>> On Wed, 6 Jul 2022, Chao Peng wrote:
>>>> But since then, TDX in particular has forced an effort into preventing
>>>> (by flags, seals, notifiers) almost everything that makes it shmem/tmpfs.
>>>>
>>>> Are any of the shmem.c mods useful to existing users of shmem.c? No.
>>>> Is MFD_INACCESSIBLE useful or comprehensible to memfd_create() users? No.
>>
>> But QEMU and other VMMs are users of shmem and memfd.  The new features certainly
>> aren't useful for _all_ existing users, but I don't think it's fair to say that
>> they're not useful for _any_ existing users.
> 
> Okay, I stand corrected: there exist some users of memfd_create()
> who will also have use for "INACCESSIBLE" memory.

As raised in reply to the relevant patch, I'm not sure if we really have
to/want to expose MFD_INACCESSIBLE to user space. I feel like this is a
requirement of specific memfd_notifer (memfile_notifier) implementations
-- such as TDX that will convert the memory and MCE-kill the machine on
ordinary write access. We might be able to set/enforce this when
registering a notifier internally instead, and fail notifier
registration if a condition isn't met (e.g., existing mmap).

So I'd be curious, which other users of shmem/memfd would benefit from
(MMU)-"INACCESSIBLE" memory obtained via memfd_create()?
Sean Christopherson Aug. 23, 2022, 4:05 p.m. UTC | #38
On Tue, Aug 23, 2022, David Hildenbrand wrote:
> On 19.08.22 05:38, Hugh Dickins wrote:
> > On Fri, 19 Aug 2022, Sean Christopherson wrote:
> >> On Thu, Aug 18, 2022, Kirill A . Shutemov wrote:
> >>> On Wed, Aug 17, 2022 at 10:40:12PM -0700, Hugh Dickins wrote:
> >>>> On Wed, 6 Jul 2022, Chao Peng wrote:
> >>>> But since then, TDX in particular has forced an effort into preventing
> >>>> (by flags, seals, notifiers) almost everything that makes it shmem/tmpfs.
> >>>>
> >>>> Are any of the shmem.c mods useful to existing users of shmem.c? No.
> >>>> Is MFD_INACCESSIBLE useful or comprehensible to memfd_create() users? No.
> >>
> >> But QEMU and other VMMs are users of shmem and memfd.  The new features certainly
> >> aren't useful for _all_ existing users, but I don't think it's fair to say that
> >> they're not useful for _any_ existing users.
> > 
> > Okay, I stand corrected: there exist some users of memfd_create()
> > who will also have use for "INACCESSIBLE" memory.
> 
> As raised in reply to the relevant patch, I'm not sure if we really have
> to/want to expose MFD_INACCESSIBLE to user space. I feel like this is a
> requirement of specific memfd_notifer (memfile_notifier) implementations
> -- such as TDX that will convert the memory and MCE-kill the machine on
> ordinary write access. We might be able to set/enforce this when
> registering a notifier internally instead, and fail notifier
> registration if a condition isn't met (e.g., existing mmap).
>
> So I'd be curious, which other users of shmem/memfd would benefit from
> (MMU)-"INACCESSIBLE" memory obtained via memfd_create()?

I agree that there's no need to expose the inaccessible behavior via uAPI.  Making
it a kernel-internal thing that's negotiated/resolved when KVM binds to the fd
would align INACCESSIBLE with the UNMOVABLE and UNRECLAIMABLE flags (and any other
flags that get added in the future).

AFAICT, the user-visible flag is a holdover from the early RFCs and doesn't provide
any unique functionality.

If we go that route, we might want to have shmem/memfd require INACCESSIBLE to be
set for the initial implementation.  I.e. disallow binding without INACCESSIBLE
until there's a use case.
Gupta, Pankaj Aug. 23, 2022, 5:41 p.m. UTC | #39
>>>>> Actually the current version allows you to delay the allocation to a
>>>>> later time (e.g. page fault time) if you don't call fallocate() on the
>>>>> private fd. fallocate() is necessary in previous versions because we
>>>>> treat the existense in the fd as 'private' but in this version we track
>>>>> private/shared info in KVM so we don't rely on that fact from memory
>>>>> backstores.
>>>>
>>>> Does this also mean reservation of guest physical memory with secure
>>>> processor (both for SEV-SNP & TDX) will also happen at page fault time?
>>>>
>>>> Do we plan to keep it this way?
>>>
>>> If you are talking about accepting memory by the guest, it is initiated by
>>> the guest and has nothing to do with page fault time vs fallocate()
>>> allocation of host memory. I mean acceptance happens after host memory
>>> allocation but they are not in lockstep, acceptance can happen much later.
>>
>> No, I meant reserving guest physical memory range from hypervisor e.g with
>> RMPUpdate for SEV-SNP or equivalent at TDX side (PAMTs?).
> 
> As proposed, RMP/PAMT updates will occur in the fault path, i.e. there is no way
> for userspace to pre-map guest memory.
> 
> I think the best approach is to turn KVM_TDX_INIT_MEM_REGION into a generic
> vCPU-scoped ioctl() that allows userspace to pre-map guest memory.  Supporting
> initializing guest private memory with a source page can be implemented via a
> flag.  That also gives KVM line of sight to in-place "conversion", e.g. another
> flag could be added to say that the dest is also the source.

Questions to clarify *my* understanding here:

- Do you suggest to use KVM_TDX_INIT_MEM_REGION into a generic ioctl to
   pre-map guest private memory in addition to initialize the payload
   (in-place encryption or just copy page to guest private memory)?

- Want to clarify "pre-map": Are you suggesting to use the ioctl
   to avoid the RMP/PAMT registration at guest page fault time? instead
   pre-map guest private memory i.e to allocate and do RMP/PAMT
   registration before running the actual guest vCPU's?

Thanks,
Pankaj

> 
> The TDX and SNP restrictions would then become addition restrictions on when
> initializing with a source is allowed (and VMs that don't have guest private
> memory wouldn't allow the flag at all).
>
Chao Peng Aug. 24, 2022, 9:41 a.m. UTC | #40
On Tue, Aug 23, 2022 at 04:05:27PM +0000, Sean Christopherson wrote:
> On Tue, Aug 23, 2022, David Hildenbrand wrote:
> > On 19.08.22 05:38, Hugh Dickins wrote:
> > > On Fri, 19 Aug 2022, Sean Christopherson wrote:
> > >> On Thu, Aug 18, 2022, Kirill A . Shutemov wrote:
> > >>> On Wed, Aug 17, 2022 at 10:40:12PM -0700, Hugh Dickins wrote:
> > >>>> On Wed, 6 Jul 2022, Chao Peng wrote:
> > >>>> But since then, TDX in particular has forced an effort into preventing
> > >>>> (by flags, seals, notifiers) almost everything that makes it shmem/tmpfs.
> > >>>>
> > >>>> Are any of the shmem.c mods useful to existing users of shmem.c? No.
> > >>>> Is MFD_INACCESSIBLE useful or comprehensible to memfd_create() users? No.
> > >>
> > >> But QEMU and other VMMs are users of shmem and memfd.  The new features certainly
> > >> aren't useful for _all_ existing users, but I don't think it's fair to say that
> > >> they're not useful for _any_ existing users.
> > > 
> > > Okay, I stand corrected: there exist some users of memfd_create()
> > > who will also have use for "INACCESSIBLE" memory.
> > 
> > As raised in reply to the relevant patch, I'm not sure if we really have
> > to/want to expose MFD_INACCESSIBLE to user space. I feel like this is a
> > requirement of specific memfd_notifer (memfile_notifier) implementations
> > -- such as TDX that will convert the memory and MCE-kill the machine on
> > ordinary write access. We might be able to set/enforce this when
> > registering a notifier internally instead, and fail notifier
> > registration if a condition isn't met (e.g., existing mmap).
> >
> > So I'd be curious, which other users of shmem/memfd would benefit from
> > (MMU)-"INACCESSIBLE" memory obtained via memfd_create()?
> 
> I agree that there's no need to expose the inaccessible behavior via uAPI.  Making
> it a kernel-internal thing that's negotiated/resolved when KVM binds to the fd
> would align INACCESSIBLE with the UNMOVABLE and UNRECLAIMABLE flags (and any other
> flags that get added in the future).
> 
> AFAICT, the user-visible flag is a holdover from the early RFCs and doesn't provide
> any unique functionality.

That's also what I'm thinking. And I don't see problem immediately if
user has populated the fd at the binding time. Actually that looks an
advantage for previously discussed guest payload pre-loading.

> 
> If we go that route, we might want to have shmem/memfd require INACCESSIBLE to be
> set for the initial implementation.  I.e. disallow binding without INACCESSIBLE
> until there's a use case.

I can do that.

Chao
Chao Peng Aug. 24, 2022, 10:27 a.m. UTC | #41
On Sun, Aug 21, 2022 at 11:27:44AM +0100, Matthew Wilcox wrote:
> On Thu, Aug 18, 2022 at 08:00:41PM -0700, Hugh Dickins wrote:
> > tmpfs and hugetlbfs and page cache are designed around sharing memory:
> > TDX is designed around absolutely not sharing memory; and the further
> > uses which Sean foresees appear not to need it as page cache either.
> > 
> > Except perhaps for page migration reasons.  It's somewhat incidental,  
> > but of course page migration knows how to migrate page cache, so
> > masquerading as page cache will give a short cut to page migration,
> > when page migration becomes at all possible.
> 
> I haven't read the patch series, and I'm not taking a position one way
> or the other on whether this is better implemented as a shmem addition
> or a shim that asks shmem for memory.  Page migration can be done for
> driver memory by using PageMovable.  I just rewrote how it works, so
> the details are top of my mind at the moment if anyone wants something
> explained.  Commit 68f2736a8583 is the key one to look at.

Thanks Matthew. That is helpful to understand the current code.

Chao
Fuad Tabba Aug. 26, 2022, 3:19 p.m. UTC | #42
Hi,

On Wed, Jul 6, 2022 at 9:24 AM Chao Peng <chao.p.peng@linux.intel.com> wrote:
>
> This is the v7 of this series which tries to implement the fd-based KVM
> guest private memory. The patches are based on latest kvm/queue branch
> commit:
>
>   b9b71f43683a (kvm/queue) KVM: x86/mmu: Buffer nested MMU
> split_desc_cache only by default capacity
>
> Introduction
> ------------
> In general this patch series introduce fd-based memslot which provides
> guest memory through memory file descriptor fd[offset,size] instead of
> hva/size. The fd can be created from a supported memory filesystem
> like tmpfs/hugetlbfs etc. which we refer as memory backing store. KVM
> and the the memory backing store exchange callbacks when such memslot
> gets created. At runtime KVM will call into callbacks provided by the
> backing store to get the pfn with the fd+offset. Memory backing store
> will also call into KVM callbacks when userspace punch hole on the fd
> to notify KVM to unmap secondary MMU page table entries.
>
> Comparing to existing hva-based memslot, this new type of memslot allows
> guest memory unmapped from host userspace like QEMU and even the kernel
> itself, therefore reduce attack surface and prevent bugs.
>
> Based on this fd-based memslot, we can build guest private memory that
> is going to be used in confidential computing environments such as Intel
> TDX and AMD SEV. When supported, the memory backing store can provide
> more enforcement on the fd and KVM can use a single memslot to hold both
> the private and shared part of the guest memory.
>
> mm extension
> ---------------------
> Introduces new MFD_INACCESSIBLE flag for memfd_create(), the file
> created with these flags cannot read(), write() or mmap() etc via normal
> MMU operations. The file content can only be used with the newly
> introduced memfile_notifier extension.
>
> The memfile_notifier extension provides two sets of callbacks for KVM to
> interact with the memory backing store:
>   - memfile_notifier_ops: callbacks for memory backing store to notify
>     KVM when memory gets invalidated.
>   - backing store callbacks: callbacks for KVM to call into memory
>     backing store to request memory pages for guest private memory.
>
> The memfile_notifier extension also provides APIs for memory backing
> store to register/unregister itself and to trigger the notifier when the
> bookmarked memory gets invalidated.
>
> The patchset also introduces a new memfd seal F_SEAL_AUTO_ALLOCATE to
> prevent double allocation caused by unintentional guest when we only
> have a single side of the shared/private memfds effective.
>
> memslot extension
> -----------------
> Add the private fd and the fd offset to existing 'shared' memslot so
> that both private/shared guest memory can live in one single memslot.
> A page in the memslot is either private or shared. Whether a guest page
> is private or shared is maintained through reusing existing SEV ioctls
> KVM_MEMORY_ENCRYPT_{UN,}REG_REGION.
>

I'm on the Android pKVM team at Google, and we've been looking into
how this approach fits with what we've been doing with pkvm/arm64.
I've had a go at porting your patches, along with some fixes and
additions so it would go on top of our latest pkvm patch series [1] to
see how well this proposal fits with what we’re doing. You can find
the ported code at this link [2].

In general, an fd-based approach fits very well with pKVM for the
reasons you mention. It means that we don't necessarily need to map
the guest memory, and with the new extensions it allows the host
kernel to control whether to restrict migration and swapping.

For pKVM, we would also need the guest private memory not to be
GUP’able by the kernel so that userspace can’t trick the kernel into
accessing guest private memory in a context where it isn’t prepared to
handle the fault injected by the hypervisor. We’re looking at whether
we could use memfd_secret to achieve this, or maybe whether extending
your work might solve the problem.

However, during the porting effort, the main issue we've encountered
is that many of the details of this approach seem to be targeted at
TDX/SEV and don’t readily align with the design of pKVM. My knowledge
on TDX is very rudimentary, so please bear with me if I get things
wrong.

The idea of the memslot having two references to the backing memory,
the (new) private_fd (a file descriptor) as well as the userspace_addr
(a memory address), with the meaning changing depending on whether the
memory is private or shared. Both can potentially be live at the same
time, but only one is used by the guest depending on whether the
memory is shared or private. For pKVM, the memory region is the same,
and whether the underlying physical page is shared or private is
determined by the hypervisor based on the initial configuration of the
VM and also in response to hypercalls from the guest. So at least from
our side, having a private_fd isn't the best fit, but rather just
having an fd instead of a userspace_addr.

Moreover, something which was discussed here before [3], is the
ability to share in-place. For pKVM/arm64, the conversion between
shared and private involves only changes to the stage-2 page tables,
which are controlled by the hypervisor. Android supports this in-place
conversion already, and I think that the cost of copying for many
use-cases that would involve large amounts of data would be big. We
will measure the relative costs in due course, but in the meantime
we’re nervous about adopting a new user ABI which doesn’t appear to
cater for in-place conversion; having just the fd would simplify that
somewhat

In the memfd approach, what is the plan for being able to initialize
guest private memory from the host? In my port of this patch series,
I've added an fcntl() command that allows setting INACCESSIBLE after
the memfd has been created. So the memory can be mapped, initialized,
then unmapped. Of course there is no way to enforce that the memory is
unmapped from userspace before being used as private memory, but the
hypervisor will take care of the stage-2 mapping and so a user access
to the private memory would result in a SEGV regardless of the flag

Now, moving on to implementation-specific issues in this patch series
that I have encountered:

- There are a couple of small issues in porting the patches, some of
which have been mentioned already by others. I will point out the rest
in direct replies to these patches.

- MEMFILE_F_UNRECLAIMABLE and MEMFILE_F_UNMOVABLE are never set in
this patch series. MFD_INACCESSIBLE only sets
MEMFILE_F_USER_INACCESSIBLE. Is this intentional?

- Nothing in this patch series enforces that MFD_INACCESSIBLE or that
any of the MEMFILE_F_* flags are set for the file descriptor to be
used as a private_fd. Is this also intentional?

Most of us working on pKVM will be at KVM forum Dublin in September,
so it would be great if we could have a chat (and/or beer!) face to
face sometime during the conference to help us figure out an
upstreamable solution for Android

Cheers,
/fuad

[1] https://lore.kernel.org/all/20220630135747.26983-1-will@kernel.org/
[2] https://android-kvm.googlesource.com/linux/+/refs/heads/tabba/fdmem
[3] https://lore.kernel.org/all/YkcTTY4YjQs5BRhE@google.com/


> Test
> ----
> To test the new functionalities of this patch TDX patchset is needed.
> Since TDX patchset has not been merged so I did two kinds of test:
>
> -  Regresion test on kvm/queue (this patchset)
>    Most new code are not covered. Code also in below repo:
>    https://github.com/chao-p/linux/tree/privmem-v7
>
> -  New Funational test on latest TDX code
>    The patch is rebased to latest TDX code and tested the new
>    funcationalities. See below repos:
>    Linux: https://github.com/chao-p/linux/tree/privmem-v7-tdx
>    QEMU: https://github.com/chao-p/qemu/tree/privmem-v7
>
> An example QEMU command line for TDX test:
> -object tdx-guest,id=tdx,debug=off,sept-ve-disable=off \
> -machine confidential-guest-support=tdx \
> -object memory-backend-memfd-private,id=ram1,size=${mem} \
> -machine memory-backend=ram1
>
> Changelog
> ----------
> v7:
>   - Move the private/shared info from backing store to KVM.
>   - Introduce F_SEAL_AUTO_ALLOCATE to avoid double allocation.
>   - Rework on the sync mechanism between zap/page fault paths.
>   - Addressed other comments in v6.
> v6:
>   - Re-organzied patch for both mm/KVM parts.
>   - Added flags for memfile_notifier so its consumers can state their
>     features and memory backing store can check against these flags.
>   - Put a backing store reference in the memfile_notifier and move pfn_ops
>     into backing store.
>   - Only support boot time backing store register.
>   - Overall KVM part improvement suggested by Sean and some others.
> v5:
>   - Removed userspace visible F_SEAL_INACCESSIBLE, instead using an
>     in-kernel flag (SHM_F_INACCESSIBLE for shmem). Private fd can only
>     be created by MFD_INACCESSIBLE.
>   - Introduced new APIs for backing store to register itself to
>     memfile_notifier instead of direct function call.
>   - Added the accounting and restriction for MFD_INACCESSIBLE memory.
>   - Added KVM API doc for new memslot extensions and man page for the new
>     MFD_INACCESSIBLE flag.
>   - Removed the overlap check for mapping the same file+offset into
>     multiple gfns due to perf consideration, warned in document.
>   - Addressed other comments in v4.
> v4:
>   - Decoupled the callbacks between KVM/mm from memfd and use new
>     name 'memfile_notifier'.
>   - Supported register multiple memslots to the same backing store.
>   - Added per-memslot pfn_ops instead of per-system.
>   - Reworked the invalidation part.
>   - Improved new KVM uAPIs (private memslot extension and memory
>     error) per Sean's suggestions.
>   - Addressed many other minor fixes for comments from v3.
> v3:
>   - Added locking protection when calling
>     invalidate_page_range/fallocate callbacks.
>   - Changed memslot structure to keep use useraddr for shared memory.
>   - Re-organized F_SEAL_INACCESSIBLE and MEMFD_OPS.
>   - Added MFD_INACCESSIBLE flag to force F_SEAL_INACCESSIBLE.
>   - Commit message improvement.
>   - Many small fixes for comments from the last version.
>
> Links to previous discussions
> -----------------------------
> [1] Original design proposal:
> https://lkml.kernel.org/kvm/20210824005248.200037-1-seanjc@google.com/
> [2] Updated proposal and RFC patch v1:
> https://lkml.kernel.org/linux-fsdevel/20211111141352.26311-1-chao.p.peng@linux.intel.com/
> [3] Patch v5: https://lkml.org/lkml/2022/5/19/861
>
> Chao Peng (12):
>   mm: Add F_SEAL_AUTO_ALLOCATE seal to memfd
>   selftests/memfd: Add tests for F_SEAL_AUTO_ALLOCATE
>   mm: Introduce memfile_notifier
>   mm/memfd: Introduce MFD_INACCESSIBLE flag
>   KVM: Rename KVM_PRIVATE_MEM_SLOTS to KVM_INTERNAL_MEM_SLOTS
>   KVM: Use gfn instead of hva for mmu_notifier_retry
>   KVM: Rename mmu_notifier_*
>   KVM: Extend the memslot to support fd-based private memory
>   KVM: Add KVM_EXIT_MEMORY_FAULT exit
>   KVM: Register/unregister the guest private memory regions
>   KVM: Handle page fault for private memory
>   KVM: Enable and expose KVM_MEM_PRIVATE
>
> Kirill A. Shutemov (1):
>   mm/shmem: Support memfile_notifier
>
>  Documentation/virt/kvm/api.rst             |  77 +++++-
>  arch/arm64/kvm/mmu.c                       |   8 +-
>  arch/mips/include/asm/kvm_host.h           |   2 +-
>  arch/mips/kvm/mmu.c                        |  10 +-
>  arch/powerpc/include/asm/kvm_book3s_64.h   |   2 +-
>  arch/powerpc/kvm/book3s_64_mmu_host.c      |   4 +-
>  arch/powerpc/kvm/book3s_64_mmu_hv.c        |   4 +-
>  arch/powerpc/kvm/book3s_64_mmu_radix.c     |   6 +-
>  arch/powerpc/kvm/book3s_hv_nested.c        |   2 +-
>  arch/powerpc/kvm/book3s_hv_rm_mmu.c        |   8 +-
>  arch/powerpc/kvm/e500_mmu_host.c           |   4 +-
>  arch/riscv/kvm/mmu.c                       |   4 +-
>  arch/x86/include/asm/kvm_host.h            |   3 +-
>  arch/x86/kvm/Kconfig                       |   3 +
>  arch/x86/kvm/mmu.h                         |   2 -
>  arch/x86/kvm/mmu/mmu.c                     |  74 +++++-
>  arch/x86/kvm/mmu/mmu_internal.h            |  18 ++
>  arch/x86/kvm/mmu/mmutrace.h                |   1 +
>  arch/x86/kvm/mmu/paging_tmpl.h             |   4 +-
>  arch/x86/kvm/x86.c                         |   2 +-
>  include/linux/kvm_host.h                   | 105 +++++---
>  include/linux/memfile_notifier.h           |  91 +++++++
>  include/linux/shmem_fs.h                   |   2 +
>  include/uapi/linux/fcntl.h                 |   1 +
>  include/uapi/linux/kvm.h                   |  37 +++
>  include/uapi/linux/memfd.h                 |   1 +
>  mm/Kconfig                                 |   4 +
>  mm/Makefile                                |   1 +
>  mm/memfd.c                                 |  18 +-
>  mm/memfile_notifier.c                      | 123 ++++++++++
>  mm/shmem.c                                 | 125 +++++++++-
>  tools/testing/selftests/memfd/memfd_test.c | 166 +++++++++++++
>  virt/kvm/Kconfig                           |   3 +
>  virt/kvm/kvm_main.c                        | 272 ++++++++++++++++++---
>  virt/kvm/pfncache.c                        |  14 +-
>  35 files changed, 1074 insertions(+), 127 deletions(-)
>  create mode 100644 include/linux/memfile_notifier.h
>  create mode 100644 mm/memfile_notifier.c
>
> --
> 2.25.1
>
Chao Peng Aug. 29, 2022, 3:17 p.m. UTC | #43
On Fri, Aug 26, 2022 at 04:19:25PM +0100, Fuad Tabba wrote:
> Hi,
> 
> On Wed, Jul 6, 2022 at 9:24 AM Chao Peng <chao.p.peng@linux.intel.com> wrote:
> >
> > This is the v7 of this series which tries to implement the fd-based KVM
> > guest private memory. The patches are based on latest kvm/queue branch
> > commit:
> >
> >   b9b71f43683a (kvm/queue) KVM: x86/mmu: Buffer nested MMU
> > split_desc_cache only by default capacity
> >
> > Introduction
> > ------------
> > In general this patch series introduce fd-based memslot which provides
> > guest memory through memory file descriptor fd[offset,size] instead of
> > hva/size. The fd can be created from a supported memory filesystem
> > like tmpfs/hugetlbfs etc. which we refer as memory backing store. KVM
> > and the the memory backing store exchange callbacks when such memslot
> > gets created. At runtime KVM will call into callbacks provided by the
> > backing store to get the pfn with the fd+offset. Memory backing store
> > will also call into KVM callbacks when userspace punch hole on the fd
> > to notify KVM to unmap secondary MMU page table entries.
> >
> > Comparing to existing hva-based memslot, this new type of memslot allows
> > guest memory unmapped from host userspace like QEMU and even the kernel
> > itself, therefore reduce attack surface and prevent bugs.
> >
> > Based on this fd-based memslot, we can build guest private memory that
> > is going to be used in confidential computing environments such as Intel
> > TDX and AMD SEV. When supported, the memory backing store can provide
> > more enforcement on the fd and KVM can use a single memslot to hold both
> > the private and shared part of the guest memory.
> >
> > mm extension
> > ---------------------
> > Introduces new MFD_INACCESSIBLE flag for memfd_create(), the file
> > created with these flags cannot read(), write() or mmap() etc via normal
> > MMU operations. The file content can only be used with the newly
> > introduced memfile_notifier extension.
> >
> > The memfile_notifier extension provides two sets of callbacks for KVM to
> > interact with the memory backing store:
> >   - memfile_notifier_ops: callbacks for memory backing store to notify
> >     KVM when memory gets invalidated.
> >   - backing store callbacks: callbacks for KVM to call into memory
> >     backing store to request memory pages for guest private memory.
> >
> > The memfile_notifier extension also provides APIs for memory backing
> > store to register/unregister itself and to trigger the notifier when the
> > bookmarked memory gets invalidated.
> >
> > The patchset also introduces a new memfd seal F_SEAL_AUTO_ALLOCATE to
> > prevent double allocation caused by unintentional guest when we only
> > have a single side of the shared/private memfds effective.
> >
> > memslot extension
> > -----------------
> > Add the private fd and the fd offset to existing 'shared' memslot so
> > that both private/shared guest memory can live in one single memslot.
> > A page in the memslot is either private or shared. Whether a guest page
> > is private or shared is maintained through reusing existing SEV ioctls
> > KVM_MEMORY_ENCRYPT_{UN,}REG_REGION.
> >
> 
> I'm on the Android pKVM team at Google, and we've been looking into
> how this approach fits with what we've been doing with pkvm/arm64.
> I've had a go at porting your patches, along with some fixes and
> additions so it would go on top of our latest pkvm patch series [1] to
> see how well this proposal fits with what we’re doing. You can find
> the ported code at this link [2].
> 
> In general, an fd-based approach fits very well with pKVM for the
> reasons you mention. It means that we don't necessarily need to map
> the guest memory, and with the new extensions it allows the host
> kernel to control whether to restrict migration and swapping.

Good to hear that.

> 
> For pKVM, we would also need the guest private memory not to be
> GUP’able by the kernel so that userspace can’t trick the kernel into
> accessing guest private memory in a context where it isn’t prepared to
> handle the fault injected by the hypervisor. We’re looking at whether
> we could use memfd_secret to achieve this, or maybe whether extending
> your work might solve the problem.

This is interesting and can be a valuable addition to this series.

> 
> However, during the porting effort, the main issue we've encountered
> is that many of the details of this approach seem to be targeted at
> TDX/SEV and don’t readily align with the design of pKVM. My knowledge
> on TDX is very rudimentary, so please bear with me if I get things
> wrong.

No doubt this series is initially designed for confidential computing
usages, but pKVM can definitely extend it if it finds useful.

> 
> The idea of the memslot having two references to the backing memory,
> the (new) private_fd (a file descriptor) as well as the userspace_addr
> (a memory address), with the meaning changing depending on whether the
> memory is private or shared. Both can potentially be live at the same
> time, but only one is used by the guest depending on whether the
> memory is shared or private. For pKVM, the memory region is the same,
> and whether the underlying physical page is shared or private is
> determined by the hypervisor based on the initial configuration of the
> VM and also in response to hypercalls from the guest.

For confidential computing usages, this is actually the same. The shared
or private is determined by initial configuration or guest hypercalls.

> So at least from
> our side, having a private_fd isn't the best fit, but rather just
> having an fd instead of a userspace_addr.

Let me understand this a bit: pKVM basically wants to maintain the
shared and private memory in only one fd, and not use userspace_addr at
all, right? Any blocking for pKVM to use private_fd + userspace_addr
instead?

> 
> Moreover, something which was discussed here before [3], is the
> ability to share in-place. For pKVM/arm64, the conversion between
> shared and private involves only changes to the stage-2 page tables,
> which are controlled by the hypervisor. Android supports this in-place
> conversion already, and I think that the cost of copying for many
> use-cases that would involve large amounts of data would be big. We
> will measure the relative costs in due course, but in the meantime
> we’re nervous about adopting a new user ABI which doesn’t appear to
> cater for in-place conversion; having just the fd would simplify that
> somewhat

I understand there is difficulty to achieve that with the current
private_fd + userspace_addr (they basically in two separate fds), but is
it possible for pKVM to extend this? Brainstorming for example, pKVM can
ignore userspace_addr and only use private_fd to cover both shared and
private memory, or pKVM introduce new KVM memslot flag?

> 
> In the memfd approach, what is the plan for being able to initialize
> guest private memory from the host? In my port of this patch series,
> I've added an fcntl() command that allows setting INACCESSIBLE after
> the memfd has been created. So the memory can be mapped, initialized,
> then unmapped. Of course there is no way to enforce that the memory is
> unmapped from userspace before being used as private memory, but the
> hypervisor will take care of the stage-2 mapping and so a user access
> to the private memory would result in a SEGV regardless of the flag

There is discussion on removing MFD_INACCESSIBLE and delaying the
alignment of the flag to the KVM/backing store binding time
(https://lkml.kernel.org/lkml/20220824094149.GA1383966@chaop.bj.intel.com/).

Creating new API like what you are playing with fcntl() also works if it
turns out the MFD_INACCESSIBLE has to be set at the memfd_create time.

> 
> Now, moving on to implementation-specific issues in this patch series
> that I have encountered:
> 
> - There are a couple of small issues in porting the patches, some of
> which have been mentioned already by others. I will point out the rest
> in direct replies to these patches.

Thanks.

> 
> - MEMFILE_F_UNRECLAIMABLE and MEMFILE_F_UNMOVABLE are never set in
> this patch series. MFD_INACCESSIBLE only sets
> MEMFILE_F_USER_INACCESSIBLE. Is this intentional?

It gets set in kvm_private_mem_register() of patch 13, basically those
flags are expected to be set by architecture code.

> 
> - Nothing in this patch series enforces that MFD_INACCESSIBLE or that
> any of the MEMFILE_F_* flags are set for the file descriptor to be
> used as a private_fd. Is this also intentional?

With KVM_MEM_PRIVATE memslot flag, the MEMFILE_F_* are enforced by the
architecture code.

> 
> Most of us working on pKVM will be at KVM forum Dublin in September,
> so it would be great if we could have a chat (and/or beer!) face to
> face sometime during the conference to help us figure out an
> upstreamable solution for Android

I would like to, but currently I have no travel plan due to COVID-19 :(
We can have more online discussions anyway.

Thanks,
Chao
> 
> Cheers,
> /fuad
> 
> [1] https://lore.kernel.org/all/20220630135747.26983-1-will@kernel.org/
> [2] https://android-kvm.googlesource.com/linux/+/refs/heads/tabba/fdmem
> [3] https://lore.kernel.org/all/YkcTTY4YjQs5BRhE@google.com/
> 
> 
> > Test
> > ----
> > To test the new functionalities of this patch TDX patchset is needed.
> > Since TDX patchset has not been merged so I did two kinds of test:
> >
> > -  Regresion test on kvm/queue (this patchset)
> >    Most new code are not covered. Code also in below repo:
> >    https://github.com/chao-p/linux/tree/privmem-v7
> >
> > -  New Funational test on latest TDX code
> >    The patch is rebased to latest TDX code and tested the new
> >    funcationalities. See below repos:
> >    Linux: https://github.com/chao-p/linux/tree/privmem-v7-tdx
> >    QEMU: https://github.com/chao-p/qemu/tree/privmem-v7
> >
> > An example QEMU command line for TDX test:
> > -object tdx-guest,id=tdx,debug=off,sept-ve-disable=off \
> > -machine confidential-guest-support=tdx \
> > -object memory-backend-memfd-private,id=ram1,size=${mem} \
> > -machine memory-backend=ram1
> >
> > Changelog
> > ----------
> > v7:
> >   - Move the private/shared info from backing store to KVM.
> >   - Introduce F_SEAL_AUTO_ALLOCATE to avoid double allocation.
> >   - Rework on the sync mechanism between zap/page fault paths.
> >   - Addressed other comments in v6.
> > v6:
> >   - Re-organzied patch for both mm/KVM parts.
> >   - Added flags for memfile_notifier so its consumers can state their
> >     features and memory backing store can check against these flags.
> >   - Put a backing store reference in the memfile_notifier and move pfn_ops
> >     into backing store.
> >   - Only support boot time backing store register.
> >   - Overall KVM part improvement suggested by Sean and some others.
> > v5:
> >   - Removed userspace visible F_SEAL_INACCESSIBLE, instead using an
> >     in-kernel flag (SHM_F_INACCESSIBLE for shmem). Private fd can only
> >     be created by MFD_INACCESSIBLE.
> >   - Introduced new APIs for backing store to register itself to
> >     memfile_notifier instead of direct function call.
> >   - Added the accounting and restriction for MFD_INACCESSIBLE memory.
> >   - Added KVM API doc for new memslot extensions and man page for the new
> >     MFD_INACCESSIBLE flag.
> >   - Removed the overlap check for mapping the same file+offset into
> >     multiple gfns due to perf consideration, warned in document.
> >   - Addressed other comments in v4.
> > v4:
> >   - Decoupled the callbacks between KVM/mm from memfd and use new
> >     name 'memfile_notifier'.
> >   - Supported register multiple memslots to the same backing store.
> >   - Added per-memslot pfn_ops instead of per-system.
> >   - Reworked the invalidation part.
> >   - Improved new KVM uAPIs (private memslot extension and memory
> >     error) per Sean's suggestions.
> >   - Addressed many other minor fixes for comments from v3.
> > v3:
> >   - Added locking protection when calling
> >     invalidate_page_range/fallocate callbacks.
> >   - Changed memslot structure to keep use useraddr for shared memory.
> >   - Re-organized F_SEAL_INACCESSIBLE and MEMFD_OPS.
> >   - Added MFD_INACCESSIBLE flag to force F_SEAL_INACCESSIBLE.
> >   - Commit message improvement.
> >   - Many small fixes for comments from the last version.
> >
> > Links to previous discussions
> > -----------------------------
> > [1] Original design proposal:
> > https://lkml.kernel.org/kvm/20210824005248.200037-1-seanjc@google.com/
> > [2] Updated proposal and RFC patch v1:
> > https://lkml.kernel.org/linux-fsdevel/20211111141352.26311-1-chao.p.peng@linux.intel.com/
> > [3] Patch v5: https://lkml.org/lkml/2022/5/19/861
> >
> > Chao Peng (12):
> >   mm: Add F_SEAL_AUTO_ALLOCATE seal to memfd
> >   selftests/memfd: Add tests for F_SEAL_AUTO_ALLOCATE
> >   mm: Introduce memfile_notifier
> >   mm/memfd: Introduce MFD_INACCESSIBLE flag
> >   KVM: Rename KVM_PRIVATE_MEM_SLOTS to KVM_INTERNAL_MEM_SLOTS
> >   KVM: Use gfn instead of hva for mmu_notifier_retry
> >   KVM: Rename mmu_notifier_*
> >   KVM: Extend the memslot to support fd-based private memory
> >   KVM: Add KVM_EXIT_MEMORY_FAULT exit
> >   KVM: Register/unregister the guest private memory regions
> >   KVM: Handle page fault for private memory
> >   KVM: Enable and expose KVM_MEM_PRIVATE
> >
> > Kirill A. Shutemov (1):
> >   mm/shmem: Support memfile_notifier
> >
> >  Documentation/virt/kvm/api.rst             |  77 +++++-
> >  arch/arm64/kvm/mmu.c                       |   8 +-
> >  arch/mips/include/asm/kvm_host.h           |   2 +-
> >  arch/mips/kvm/mmu.c                        |  10 +-
> >  arch/powerpc/include/asm/kvm_book3s_64.h   |   2 +-
> >  arch/powerpc/kvm/book3s_64_mmu_host.c      |   4 +-
> >  arch/powerpc/kvm/book3s_64_mmu_hv.c        |   4 +-
> >  arch/powerpc/kvm/book3s_64_mmu_radix.c     |   6 +-
> >  arch/powerpc/kvm/book3s_hv_nested.c        |   2 +-
> >  arch/powerpc/kvm/book3s_hv_rm_mmu.c        |   8 +-
> >  arch/powerpc/kvm/e500_mmu_host.c           |   4 +-
> >  arch/riscv/kvm/mmu.c                       |   4 +-
> >  arch/x86/include/asm/kvm_host.h            |   3 +-
> >  arch/x86/kvm/Kconfig                       |   3 +
> >  arch/x86/kvm/mmu.h                         |   2 -
> >  arch/x86/kvm/mmu/mmu.c                     |  74 +++++-
> >  arch/x86/kvm/mmu/mmu_internal.h            |  18 ++
> >  arch/x86/kvm/mmu/mmutrace.h                |   1 +
> >  arch/x86/kvm/mmu/paging_tmpl.h             |   4 +-
> >  arch/x86/kvm/x86.c                         |   2 +-
> >  include/linux/kvm_host.h                   | 105 +++++---
> >  include/linux/memfile_notifier.h           |  91 +++++++
> >  include/linux/shmem_fs.h                   |   2 +
> >  include/uapi/linux/fcntl.h                 |   1 +
> >  include/uapi/linux/kvm.h                   |  37 +++
> >  include/uapi/linux/memfd.h                 |   1 +
> >  mm/Kconfig                                 |   4 +
> >  mm/Makefile                                |   1 +
> >  mm/memfd.c                                 |  18 +-
> >  mm/memfile_notifier.c                      | 123 ++++++++++
> >  mm/shmem.c                                 | 125 +++++++++-
> >  tools/testing/selftests/memfd/memfd_test.c | 166 +++++++++++++
> >  virt/kvm/Kconfig                           |   3 +
> >  virt/kvm/kvm_main.c                        | 272 ++++++++++++++++++---
> >  virt/kvm/pfncache.c                        |  14 +-
> >  35 files changed, 1074 insertions(+), 127 deletions(-)
> >  create mode 100644 include/linux/memfile_notifier.h
> >  create mode 100644 mm/memfile_notifier.c
> >
> > --
> > 2.25.1
> >
Fuad Tabba Aug. 31, 2022, 9:12 a.m. UTC | #44
Hi Chao,

Thank you for your reply.

On Mon, Aug 29, 2022 at 4:23 PM Chao Peng <chao.p.peng@linux.intel.com> wrote:
>
> On Fri, Aug 26, 2022 at 04:19:25PM +0100, Fuad Tabba wrote:
> > Hi,
> >
> > On Wed, Jul 6, 2022 at 9:24 AM Chao Peng <chao.p.peng@linux.intel.com> wrote:
> > >
> > > This is the v7 of this series which tries to implement the fd-based KVM
> > > guest private memory. The patches are based on latest kvm/queue branch
> > > commit:
> > >
> > >   b9b71f43683a (kvm/queue) KVM: x86/mmu: Buffer nested MMU
> > > split_desc_cache only by default capacity
> > >
> > > Introduction
> > > ------------
> > > In general this patch series introduce fd-based memslot which provides
> > > guest memory through memory file descriptor fd[offset,size] instead of
> > > hva/size. The fd can be created from a supported memory filesystem
> > > like tmpfs/hugetlbfs etc. which we refer as memory backing store. KVM
> > > and the the memory backing store exchange callbacks when such memslot
> > > gets created. At runtime KVM will call into callbacks provided by the
> > > backing store to get the pfn with the fd+offset. Memory backing store
> > > will also call into KVM callbacks when userspace punch hole on the fd
> > > to notify KVM to unmap secondary MMU page table entries.
> > >
> > > Comparing to existing hva-based memslot, this new type of memslot allows
> > > guest memory unmapped from host userspace like QEMU and even the kernel
> > > itself, therefore reduce attack surface and prevent bugs.
> > >
> > > Based on this fd-based memslot, we can build guest private memory that
> > > is going to be used in confidential computing environments such as Intel
> > > TDX and AMD SEV. When supported, the memory backing store can provide
> > > more enforcement on the fd and KVM can use a single memslot to hold both
> > > the private and shared part of the guest memory.
> > >
> > > mm extension
> > > ---------------------
> > > Introduces new MFD_INACCESSIBLE flag for memfd_create(), the file
> > > created with these flags cannot read(), write() or mmap() etc via normal
> > > MMU operations. The file content can only be used with the newly
> > > introduced memfile_notifier extension.
> > >
> > > The memfile_notifier extension provides two sets of callbacks for KVM to
> > > interact with the memory backing store:
> > >   - memfile_notifier_ops: callbacks for memory backing store to notify
> > >     KVM when memory gets invalidated.
> > >   - backing store callbacks: callbacks for KVM to call into memory
> > >     backing store to request memory pages for guest private memory.
> > >
> > > The memfile_notifier extension also provides APIs for memory backing
> > > store to register/unregister itself and to trigger the notifier when the
> > > bookmarked memory gets invalidated.
> > >
> > > The patchset also introduces a new memfd seal F_SEAL_AUTO_ALLOCATE to
> > > prevent double allocation caused by unintentional guest when we only
> > > have a single side of the shared/private memfds effective.
> > >
> > > memslot extension
> > > -----------------
> > > Add the private fd and the fd offset to existing 'shared' memslot so
> > > that both private/shared guest memory can live in one single memslot.
> > > A page in the memslot is either private or shared. Whether a guest page
> > > is private or shared is maintained through reusing existing SEV ioctls
> > > KVM_MEMORY_ENCRYPT_{UN,}REG_REGION.
> > >
> >
> > I'm on the Android pKVM team at Google, and we've been looking into
> > how this approach fits with what we've been doing with pkvm/arm64.
> > I've had a go at porting your patches, along with some fixes and
> > additions so it would go on top of our latest pkvm patch series [1] to
> > see how well this proposal fits with what we’re doing. You can find
> > the ported code at this link [2].
> >
> > In general, an fd-based approach fits very well with pKVM for the
> > reasons you mention. It means that we don't necessarily need to map
> > the guest memory, and with the new extensions it allows the host
> > kernel to control whether to restrict migration and swapping.
>
> Good to hear that.
>
> >
> > For pKVM, we would also need the guest private memory not to be
> > GUP’able by the kernel so that userspace can’t trick the kernel into
> > accessing guest private memory in a context where it isn’t prepared to
> > handle the fault injected by the hypervisor. We’re looking at whether
> > we could use memfd_secret to achieve this, or maybe whether extending
> > your work might solve the problem.
>
> This is interesting and can be a valuable addition to this series.

I'll keep you posted as it goes. I think with the work that you've
already put in, it wouldn't require that much more.

> >
> > However, during the porting effort, the main issue we've encountered
> > is that many of the details of this approach seem to be targeted at
> > TDX/SEV and don’t readily align with the design of pKVM. My knowledge
> > on TDX is very rudimentary, so please bear with me if I get things
> > wrong.
>
> No doubt this series is initially designed for confidential computing
> usages, but pKVM can definitely extend it if it finds useful.
>
> >
> > The idea of the memslot having two references to the backing memory,
> > the (new) private_fd (a file descriptor) as well as the userspace_addr
> > (a memory address), with the meaning changing depending on whether the
> > memory is private or shared. Both can potentially be live at the same
> > time, but only one is used by the guest depending on whether the
> > memory is shared or private. For pKVM, the memory region is the same,
> > and whether the underlying physical page is shared or private is
> > determined by the hypervisor based on the initial configuration of the
> > VM and also in response to hypercalls from the guest.
>
> For confidential computing usages, this is actually the same. The shared
> or private is determined by initial configuration or guest hypercalls.
>
> > So at least from
> > our side, having a private_fd isn't the best fit, but rather just
> > having an fd instead of a userspace_addr.
>
> Let me understand this a bit: pKVM basically wants to maintain the
> shared and private memory in only one fd, and not use userspace_addr at
> all, right? Any blocking for pKVM to use private_fd + userspace_addr
> instead?
> >
> > Moreover, something which was discussed here before [3], is the
> > ability to share in-place. For pKVM/arm64, the conversion between
> > shared and private involves only changes to the stage-2 page tables,
> > which are controlled by the hypervisor. Android supports this in-place
> > conversion already, and I think that the cost of copying for many
> > use-cases that would involve large amounts of data would be big. We
> > will measure the relative costs in due course, but in the meantime
> > we’re nervous about adopting a new user ABI which doesn’t appear to
> > cater for in-place conversion; having just the fd would simplify that
> > somewhat
>
> I understand there is difficulty to achieve that with the current
> private_fd + userspace_addr (they basically in two separate fds), but is
> it possible for pKVM to extend this? Brainstorming for example, pKVM can
> ignore userspace_addr and only use private_fd to cover both shared and
> private memory, or pKVM introduce new KVM memslot flag?

It's not that there's anything blocking pKVM from doing that. It's
that the disconnect of using a memory address for the shared memory,
and a file descriptor for the private memory doesn't really make sense
for pKVM. I see how it makes sense for TDX and the Intel-specific
implementation. It just seems that this is baking in an
implementation-specific aspect as a part of the KVM general api, and
the worry is that this might have some unintended consequences in the
future.

> >
> > In the memfd approach, what is the plan for being able to initialize
> > guest private memory from the host? In my port of this patch series,
> > I've added an fcntl() command that allows setting INACCESSIBLE after
> > the memfd has been created. So the memory can be mapped, initialized,
> > then unmapped. Of course there is no way to enforce that the memory is
> > unmapped from userspace before being used as private memory, but the
> > hypervisor will take care of the stage-2 mapping and so a user access
> > to the private memory would result in a SEGV regardless of the flag
>
> There is discussion on removing MFD_INACCESSIBLE and delaying the
> alignment of the flag to the KVM/backing store binding time
> (https://lkml.kernel.org/lkml/20220824094149.GA1383966@chaop.bj.intel.com/).
>
> Creating new API like what you are playing with fcntl() also works if it
> turns out the MFD_INACCESSIBLE has to be set at the memfd_create time.

That makes sense.

> >
> > Now, moving on to implementation-specific issues in this patch series
> > that I have encountered:
> >
> > - There are a couple of small issues in porting the patches, some of
> > which have been mentioned already by others. I will point out the rest
> > in direct replies to these patches.
>
> Thanks.
>
> >
> > - MEMFILE_F_UNRECLAIMABLE and MEMFILE_F_UNMOVABLE are never set in
> > this patch series. MFD_INACCESSIBLE only sets
> > MEMFILE_F_USER_INACCESSIBLE. Is this intentional?
>
> It gets set in kvm_private_mem_register() of patch 13, basically those
> flags are expected to be set by architecture code.
>
> >
> > - Nothing in this patch series enforces that MFD_INACCESSIBLE or that
> > any of the MEMFILE_F_* flags are set for the file descriptor to be
> > used as a private_fd. Is this also intentional?
>
> With KVM_MEM_PRIVATE memslot flag, the MEMFILE_F_* are enforced by the
> architecture code.

Right. I was expecting them to be in the mem_fd, but I see now how
they are being set and enforced in patch 13. This makes a lot of sense
now. Thanks!

> >
> > Most of us working on pKVM will be at KVM forum Dublin in September,
> > so it would be great if we could have a chat (and/or beer!) face to
> > face sometime during the conference to help us figure out an
> > upstreamable solution for Android
>
> I would like to, but currently I have no travel plan due to COVID-19 :(
> We can have more online discussions anyway.

Of course! We'll continue this online, and hopefully we will get a
chance to meet in person soon.

Cheers,
/fuad


> Thanks,
> Chao
> >
> > Cheers,
> > /fuad
> >
> > [1] https://lore.kernel.org/all/20220630135747.26983-1-will@kernel.org/
> > [2] https://android-kvm.googlesource.com/linux/+/refs/heads/tabba/fdmem
> > [3] https://lore.kernel.org/all/YkcTTY4YjQs5BRhE@google.com/
> >
> >
> > > Test
> > > ----
> > > To test the new functionalities of this patch TDX patchset is needed.
> > > Since TDX patchset has not been merged so I did two kinds of test:
> > >
> > > -  Regresion test on kvm/queue (this patchset)
> > >    Most new code are not covered. Code also in below repo:
> > >    https://github.com/chao-p/linux/tree/privmem-v7
> > >
> > > -  New Funational test on latest TDX code
> > >    The patch is rebased to latest TDX code and tested the new
> > >    funcationalities. See below repos:
> > >    Linux: https://github.com/chao-p/linux/tree/privmem-v7-tdx
> > >    QEMU: https://github.com/chao-p/qemu/tree/privmem-v7
> > >
> > > An example QEMU command line for TDX test:
> > > -object tdx-guest,id=tdx,debug=off,sept-ve-disable=off \
> > > -machine confidential-guest-support=tdx \
> > > -object memory-backend-memfd-private,id=ram1,size=${mem} \
> > > -machine memory-backend=ram1
> > >
> > > Changelog
> > > ----------
> > > v7:
> > >   - Move the private/shared info from backing store to KVM.
> > >   - Introduce F_SEAL_AUTO_ALLOCATE to avoid double allocation.
> > >   - Rework on the sync mechanism between zap/page fault paths.
> > >   - Addressed other comments in v6.
> > > v6:
> > >   - Re-organzied patch for both mm/KVM parts.
> > >   - Added flags for memfile_notifier so its consumers can state their
> > >     features and memory backing store can check against these flags.
> > >   - Put a backing store reference in the memfile_notifier and move pfn_ops
> > >     into backing store.
> > >   - Only support boot time backing store register.
> > >   - Overall KVM part improvement suggested by Sean and some others.
> > > v5:
> > >   - Removed userspace visible F_SEAL_INACCESSIBLE, instead using an
> > >     in-kernel flag (SHM_F_INACCESSIBLE for shmem). Private fd can only
> > >     be created by MFD_INACCESSIBLE.
> > >   - Introduced new APIs for backing store to register itself to
> > >     memfile_notifier instead of direct function call.
> > >   - Added the accounting and restriction for MFD_INACCESSIBLE memory.
> > >   - Added KVM API doc for new memslot extensions and man page for the new
> > >     MFD_INACCESSIBLE flag.
> > >   - Removed the overlap check for mapping the same file+offset into
> > >     multiple gfns due to perf consideration, warned in document.
> > >   - Addressed other comments in v4.
> > > v4:
> > >   - Decoupled the callbacks between KVM/mm from memfd and use new
> > >     name 'memfile_notifier'.
> > >   - Supported register multiple memslots to the same backing store.
> > >   - Added per-memslot pfn_ops instead of per-system.
> > >   - Reworked the invalidation part.
> > >   - Improved new KVM uAPIs (private memslot extension and memory
> > >     error) per Sean's suggestions.
> > >   - Addressed many other minor fixes for comments from v3.
> > > v3:
> > >   - Added locking protection when calling
> > >     invalidate_page_range/fallocate callbacks.
> > >   - Changed memslot structure to keep use useraddr for shared memory.
> > >   - Re-organized F_SEAL_INACCESSIBLE and MEMFD_OPS.
> > >   - Added MFD_INACCESSIBLE flag to force F_SEAL_INACCESSIBLE.
> > >   - Commit message improvement.
> > >   - Many small fixes for comments from the last version.
> > >
> > > Links to previous discussions
> > > -----------------------------
> > > [1] Original design proposal:
> > > https://lkml.kernel.org/kvm/20210824005248.200037-1-seanjc@google.com/
> > > [2] Updated proposal and RFC patch v1:
> > > https://lkml.kernel.org/linux-fsdevel/20211111141352.26311-1-chao.p.peng@linux.intel.com/
> > > [3] Patch v5: https://lkml.org/lkml/2022/5/19/861
> > >
> > > Chao Peng (12):
> > >   mm: Add F_SEAL_AUTO_ALLOCATE seal to memfd
> > >   selftests/memfd: Add tests for F_SEAL_AUTO_ALLOCATE
> > >   mm: Introduce memfile_notifier
> > >   mm/memfd: Introduce MFD_INACCESSIBLE flag
> > >   KVM: Rename KVM_PRIVATE_MEM_SLOTS to KVM_INTERNAL_MEM_SLOTS
> > >   KVM: Use gfn instead of hva for mmu_notifier_retry
> > >   KVM: Rename mmu_notifier_*
> > >   KVM: Extend the memslot to support fd-based private memory
> > >   KVM: Add KVM_EXIT_MEMORY_FAULT exit
> > >   KVM: Register/unregister the guest private memory regions
> > >   KVM: Handle page fault for private memory
> > >   KVM: Enable and expose KVM_MEM_PRIVATE
> > >
> > > Kirill A. Shutemov (1):
> > >   mm/shmem: Support memfile_notifier
> > >
> > >  Documentation/virt/kvm/api.rst             |  77 +++++-
> > >  arch/arm64/kvm/mmu.c                       |   8 +-
> > >  arch/mips/include/asm/kvm_host.h           |   2 +-
> > >  arch/mips/kvm/mmu.c                        |  10 +-
> > >  arch/powerpc/include/asm/kvm_book3s_64.h   |   2 +-
> > >  arch/powerpc/kvm/book3s_64_mmu_host.c      |   4 +-
> > >  arch/powerpc/kvm/book3s_64_mmu_hv.c        |   4 +-
> > >  arch/powerpc/kvm/book3s_64_mmu_radix.c     |   6 +-
> > >  arch/powerpc/kvm/book3s_hv_nested.c        |   2 +-
> > >  arch/powerpc/kvm/book3s_hv_rm_mmu.c        |   8 +-
> > >  arch/powerpc/kvm/e500_mmu_host.c           |   4 +-
> > >  arch/riscv/kvm/mmu.c                       |   4 +-
> > >  arch/x86/include/asm/kvm_host.h            |   3 +-
> > >  arch/x86/kvm/Kconfig                       |   3 +
> > >  arch/x86/kvm/mmu.h                         |   2 -
> > >  arch/x86/kvm/mmu/mmu.c                     |  74 +++++-
> > >  arch/x86/kvm/mmu/mmu_internal.h            |  18 ++
> > >  arch/x86/kvm/mmu/mmutrace.h                |   1 +
> > >  arch/x86/kvm/mmu/paging_tmpl.h             |   4 +-
> > >  arch/x86/kvm/x86.c                         |   2 +-
> > >  include/linux/kvm_host.h                   | 105 +++++---
> > >  include/linux/memfile_notifier.h           |  91 +++++++
> > >  include/linux/shmem_fs.h                   |   2 +
> > >  include/uapi/linux/fcntl.h                 |   1 +
> > >  include/uapi/linux/kvm.h                   |  37 +++
> > >  include/uapi/linux/memfd.h                 |   1 +
> > >  mm/Kconfig                                 |   4 +
> > >  mm/Makefile                                |   1 +
> > >  mm/memfd.c                                 |  18 +-
> > >  mm/memfile_notifier.c                      | 123 ++++++++++
> > >  mm/shmem.c                                 | 125 +++++++++-
> > >  tools/testing/selftests/memfd/memfd_test.c | 166 +++++++++++++
> > >  virt/kvm/Kconfig                           |   3 +
> > >  virt/kvm/kvm_main.c                        | 272 ++++++++++++++++++---
> > >  virt/kvm/pfncache.c                        |  14 +-
> > >  35 files changed, 1074 insertions(+), 127 deletions(-)
> > >  create mode 100644 include/linux/memfile_notifier.h
> > >  create mode 100644 mm/memfile_notifier.c
> > >
> > > --
> > > 2.25.1
> > >
Kirill A. Shutemov Aug. 31, 2022, 2:24 p.m. UTC | #45
On Sat, Aug 20, 2022 at 10:15:32PM -0700, Hugh Dickins wrote:
> > I will try next week to rework it as shim to top of shmem. Does it work
> > for you?
> 
> Yes, please do, thanks.  It's a compromise between us: the initial TDX
> case has no justification to use shmem at all, but doing it that way
> will help you with some of the infrastructure, and will probably be
> easiest for KVM to extend to other more relaxed fd cases later.

Okay, below is my take on the shim approach.

I don't hate how it turned out. It is easier to understand without
callback exchange thing.

The only caveat is I had to introduce external lock to protect against
race between lookup and truncate. Otherwise, looks pretty reasonable to me.

I did very limited testing. And it lacks integration with KVM, but API
changed not substantially, any it should be easy to adopt.

Any comments?

diff --git a/include/linux/memfd.h b/include/linux/memfd.h
index 4f1600413f91..aec04a0f8b7b 100644
--- a/include/linux/memfd.h
+++ b/include/linux/memfd.h
@@ -3,6 +3,7 @@
 #define __LINUX_MEMFD_H
 
 #include <linux/file.h>
+#include <linux/pfn_t.h>
 
 #ifdef CONFIG_MEMFD_CREATE
 extern long memfd_fcntl(struct file *file, unsigned int cmd, unsigned long arg);
@@ -13,4 +14,27 @@ static inline long memfd_fcntl(struct file *f, unsigned int c, unsigned long a)
 }
 #endif
 
+struct inaccessible_notifier;
+
+struct inaccessible_notifier_ops {
+	void (*invalidate)(struct inaccessible_notifier *notifier,
+			   pgoff_t start, pgoff_t end);
+};
+
+struct inaccessible_notifier {
+	struct list_head list;
+	const struct inaccessible_notifier_ops *ops;
+};
+
+int inaccessible_register_notifier(struct file *file,
+				   struct inaccessible_notifier *notifier);
+void inaccessible_unregister_notifier(struct file *file,
+				      struct inaccessible_notifier *notifier);
+
+int inaccessible_get_pfn(struct file *file, pgoff_t offset, pfn_t *pfn,
+			 int *order);
+void inaccessible_put_pfn(struct file *file, pfn_t pfn);
+
+struct file *memfd_mkinaccessible(struct file *memfd);
+
 #endif /* __LINUX_MEMFD_H */
diff --git a/include/uapi/linux/magic.h b/include/uapi/linux/magic.h
index 6325d1d0e90f..9d066be3d7e8 100644
--- a/include/uapi/linux/magic.h
+++ b/include/uapi/linux/magic.h
@@ -101,5 +101,6 @@
 #define DMA_BUF_MAGIC		0x444d4142	/* "DMAB" */
 #define DEVMEM_MAGIC		0x454d444d	/* "DMEM" */
 #define SECRETMEM_MAGIC		0x5345434d	/* "SECM" */
+#define INACCESSIBLE_MAGIC	0x494e4143	/* "INAC" */
 
 #endif /* __LINUX_MAGIC_H__ */
diff --git a/include/uapi/linux/memfd.h b/include/uapi/linux/memfd.h
index 7a8a26751c23..48750474b904 100644
--- a/include/uapi/linux/memfd.h
+++ b/include/uapi/linux/memfd.h
@@ -8,6 +8,7 @@
 #define MFD_CLOEXEC		0x0001U
 #define MFD_ALLOW_SEALING	0x0002U
 #define MFD_HUGETLB		0x0004U
+#define MFD_INACCESSIBLE	0x0008U
 
 /*
  * Huge page size encoding when MFD_HUGETLB is specified, and a huge page
diff --git a/mm/Makefile b/mm/Makefile
index 9a564f836403..f82e5d4b4388 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -126,7 +126,7 @@ obj-$(CONFIG_HARDENED_USERCOPY) += usercopy.o
 obj-$(CONFIG_PERCPU_STATS) += percpu-stats.o
 obj-$(CONFIG_ZONE_DEVICE) += memremap.o
 obj-$(CONFIG_HMM_MIRROR) += hmm.o
-obj-$(CONFIG_MEMFD_CREATE) += memfd.o
+obj-$(CONFIG_MEMFD_CREATE) += memfd.o memfd_inaccessible.o
 obj-$(CONFIG_MAPPING_DIRTY_HELPERS) += mapping_dirty_helpers.o
 obj-$(CONFIG_PTDUMP_CORE) += ptdump.o
 obj-$(CONFIG_PAGE_REPORTING) += page_reporting.o
diff --git a/mm/memfd.c b/mm/memfd.c
index 08f5f8304746..1853a90f49ff 100644
--- a/mm/memfd.c
+++ b/mm/memfd.c
@@ -261,7 +261,8 @@ long memfd_fcntl(struct file *file, unsigned int cmd, unsigned long arg)
 #define MFD_NAME_PREFIX_LEN (sizeof(MFD_NAME_PREFIX) - 1)
 #define MFD_NAME_MAX_LEN (NAME_MAX - MFD_NAME_PREFIX_LEN)
 
-#define MFD_ALL_FLAGS (MFD_CLOEXEC | MFD_ALLOW_SEALING | MFD_HUGETLB)
+#define MFD_ALL_FLAGS (MFD_CLOEXEC | MFD_ALLOW_SEALING | MFD_HUGETLB | \
+		       MFD_INACCESSIBLE)
 
 SYSCALL_DEFINE2(memfd_create,
 		const char __user *, uname,
@@ -283,6 +284,14 @@ SYSCALL_DEFINE2(memfd_create,
 			return -EINVAL;
 	}
 
+	/* Disallow sealing when MFD_INACCESSIBLE is set. */
+	if ((flags & MFD_INACCESSIBLE) && (flags & MFD_ALLOW_SEALING))
+		return -EINVAL;
+
+	/* TODO: add hugetlb support */
+	if ((flags & MFD_INACCESSIBLE) && (flags & MFD_HUGETLB))
+		return -EINVAL;
+
 	/* length includes terminating zero */
 	len = strnlen_user(uname, MFD_NAME_MAX_LEN + 1);
 	if (len <= 0)
@@ -331,10 +340,24 @@ SYSCALL_DEFINE2(memfd_create,
 		*file_seals &= ~F_SEAL_SEAL;
 	}
 
+	if (flags & MFD_INACCESSIBLE) {
+		struct file *inaccessible_file;
+
+		inaccessible_file = memfd_mkinaccessible(file);
+		if (IS_ERR(inaccessible_file)) {
+			error = PTR_ERR(inaccessible_file);
+			goto err_file;
+		}
+
+		file = inaccessible_file;
+	}
+
 	fd_install(fd, file);
 	kfree(name);
 	return fd;
 
+err_file:
+	fput(file);
 err_fd:
 	put_unused_fd(fd);
 err_name:
diff --git a/mm/memfd_inaccessible.c b/mm/memfd_inaccessible.c
new file mode 100644
index 000000000000..89194438af9c
--- /dev/null
+++ b/mm/memfd_inaccessible.c
@@ -0,0 +1,234 @@
+#include <linux/memfd.h>
+#include <linux/pagemap.h>
+#include <linux/pseudo_fs.h>
+#include <linux/shmem_fs.h>
+#include <uapi/linux/falloc.h>
+#include <uapi/linux/magic.h>
+
+struct inaccessible_data {
+	struct rw_semaphore lock;
+	struct file *memfd;
+	struct list_head notifiers;
+};
+
+static void inaccessible_notifier_invalidate(struct inaccessible_data *data,
+				 pgoff_t start, pgoff_t end)
+{
+	struct inaccessible_notifier *notifier;
+
+	lockdep_assert_held(&data->lock);
+	VM_BUG_ON(!rwsem_is_locked(&data->lock));
+
+	list_for_each_entry(notifier, &data->notifiers, list) {
+		notifier->ops->invalidate(notifier, start, end);
+	}
+}
+
+static int inaccessible_release(struct inode *inode, struct file *file)
+{
+	struct inaccessible_data *data = inode->i_mapping->private_data;
+
+	fput(data->memfd);
+	kfree(data);
+	return 0;
+}
+
+static long inaccessible_fallocate(struct file *file, int mode,
+				   loff_t offset, loff_t len)
+{
+	struct inaccessible_data *data = file->f_mapping->private_data;
+	struct file *memfd = data->memfd;
+	int ret;
+
+	/* The lock prevents parallel inaccessible_get/put_pfn() */
+	down_write(&data->lock);
+	if (mode & FALLOC_FL_PUNCH_HOLE) {
+		if (!PAGE_ALIGNED(offset) || !PAGE_ALIGNED(len)) {
+			ret = -EINVAL;
+			goto out;
+		}
+	}
+
+	ret = memfd->f_op->fallocate(memfd, mode, offset, len);
+	inaccessible_notifier_invalidate(data, offset, offset + len);
+out:
+	up_write(&data->lock);
+	return ret;
+}
+
+static const struct file_operations inaccessible_fops = {
+	.release = inaccessible_release,
+	.fallocate = inaccessible_fallocate,
+};
+
+static int inaccessible_getattr(struct user_namespace *mnt_userns,
+				const struct path *path, struct kstat *stat,
+				u32 request_mask, unsigned int query_flags)
+{
+	struct inode *inode = d_inode(path->dentry);
+	struct inaccessible_data *data = inode->i_mapping->private_data;
+	struct file *memfd = data->memfd;
+
+	return memfd->f_inode->i_op->getattr(mnt_userns, path, stat,
+					     request_mask, query_flags);
+}
+
+static int inaccessible_setattr(struct user_namespace *mnt_userns,
+				struct dentry *dentry, struct iattr *attr)
+{
+	struct inode *inode = d_inode(dentry);
+	struct inaccessible_data *data = inode->i_mapping->private_data;
+	struct file *memfd = data->memfd;
+	int ret;
+
+	if (attr->ia_valid & ATTR_SIZE) {
+		if (memfd->f_inode->i_size) {
+			ret = -EPERM;
+			goto out;
+		}
+
+		if (!PAGE_ALIGNED(attr->ia_size)) {
+			ret = -EINVAL;
+			goto out;
+		}
+	}
+
+	ret = memfd->f_inode->i_op->setattr(mnt_userns,
+					    file_dentry(memfd), attr);
+out:
+	return ret;
+}
+
+static const struct inode_operations inaccessible_iops = {
+	.getattr = inaccessible_getattr,
+	.setattr = inaccessible_setattr,
+};
+
+static int inaccessible_init_fs_context(struct fs_context *fc)
+{
+	if (!init_pseudo(fc, INACCESSIBLE_MAGIC))
+		return -ENOMEM;
+
+	fc->s_iflags |= SB_I_NOEXEC;
+	return 0;
+}
+
+static struct file_system_type inaccessible_fs = {
+	.owner		= THIS_MODULE,
+	.name		= "[inaccessible]",
+	.init_fs_context = inaccessible_init_fs_context,
+	.kill_sb	= kill_anon_super,
+};
+
+static struct vfsmount *inaccessible_mnt;
+
+static __init int inaccessible_init(void)
+{
+	inaccessible_mnt = kern_mount(&inaccessible_fs);
+	if (IS_ERR(inaccessible_mnt))
+		return PTR_ERR(inaccessible_mnt);
+	return 0;
+}
+fs_initcall(inaccessible_init);
+
+struct file *memfd_mkinaccessible(struct file *memfd)
+{
+	struct inaccessible_data *data;
+	struct address_space *mapping;
+	struct inode *inode;
+	struct file *file;
+
+	data = kzalloc(sizeof(*data), GFP_KERNEL);
+	if (!data)
+		return ERR_PTR(-ENOMEM);
+
+	data->memfd = memfd;
+	init_rwsem(&data->lock);
+	INIT_LIST_HEAD(&data->notifiers);
+
+	inode = alloc_anon_inode(inaccessible_mnt->mnt_sb);
+	if (IS_ERR(inode)) {
+		kfree(data);
+		return ERR_CAST(inode);
+	}
+
+	inode->i_mode |= S_IFREG;
+	inode->i_op = &inaccessible_iops;
+	inode->i_mapping->private_data = data;
+
+	file = alloc_file_pseudo(inode, inaccessible_mnt,
+				 "[memfd:inaccessible]", O_RDWR,
+				 &inaccessible_fops);
+	if (IS_ERR(file)) {
+		iput(inode);
+		kfree(data);
+	}
+
+	mapping = memfd->f_mapping;
+	mapping_set_unevictable(mapping);
+	mapping_set_gfp_mask(mapping,
+			     mapping_gfp_mask(mapping) & ~__GFP_MOVABLE);
+
+	return file;
+}
+
+int inaccessible_register_notifier(struct file *file,
+			      struct inaccessible_notifier *notifier)
+{
+	struct inaccessible_data *data = file->f_mapping->private_data;
+
+	down_write(&data->lock);
+	list_add(&notifier->list, &data->notifiers);
+	up_write(&data->lock);
+
+	return 0;
+}
+EXPORT_SYMBOL_GPL(inaccessible_register_notifier);
+
+void inaccessible_unregister_notifier(struct file *file,
+				      struct inaccessible_notifier *notifier)
+{
+	struct inaccessible_data *data = file->f_mapping->private_data;
+
+	down_write(&data->lock);
+	list_del_rcu(&notifier->list);
+	up_write(&data->lock);
+}
+EXPORT_SYMBOL_GPL(inaccessible_unregister_notifier);
+
+int inaccessible_get_pfn(struct file *file, pgoff_t offset, pfn_t *pfn,
+			 int *order)
+{
+	struct inaccessible_data *data = file->f_mapping->private_data;
+	struct file *memfd = data->memfd;
+	struct page *page;
+	int ret;
+
+	down_read(&data->lock);
+
+	ret = shmem_getpage(file_inode(memfd), offset, &page, SGP_WRITE);
+	if (ret) {
+		up_read(&data->lock);
+		return ret;
+	}
+
+	*pfn = page_to_pfn_t(page);
+	*order = thp_order(compound_head(page));
+	return 0;
+}
+EXPORT_SYMBOL_GPL(inaccessible_get_pfn);
+
+void inaccessible_put_pfn(struct file *file, pfn_t pfn)
+{
+	struct page *page = pfn_t_to_page(pfn);
+	struct inaccessible_data *data = file->f_mapping->private_data;
+
+	if (WARN_ON_ONCE(!page))
+		return;
+
+	SetPageUptodate(page);
+	unlock_page(page);
+	put_page(page);
+	up_read(&data->lock);
+}
+EXPORT_SYMBOL_GPL(inaccessible_put_pfn);
Chao Peng Sept. 2, 2022, 10:19 a.m. UTC | #46
On Wed, Aug 31, 2022 at 10:12:12AM +0100, Fuad Tabba wrote:
> > > Moreover, something which was discussed here before [3], is the
> > > ability to share in-place. For pKVM/arm64, the conversion between
> > > shared and private involves only changes to the stage-2 page tables,
> > > which are controlled by the hypervisor. Android supports this in-place
> > > conversion already, and I think that the cost of copying for many
> > > use-cases that would involve large amounts of data would be big. We
> > > will measure the relative costs in due course, but in the meantime
> > > we’re nervous about adopting a new user ABI which doesn’t appear to
> > > cater for in-place conversion; having just the fd would simplify that
> > > somewhat
> >
> > I understand there is difficulty to achieve that with the current
> > private_fd + userspace_addr (they basically in two separate fds), but is
> > it possible for pKVM to extend this? Brainstorming for example, pKVM can
> > ignore userspace_addr and only use private_fd to cover both shared and
> > private memory, or pKVM introduce new KVM memslot flag?
> 
> It's not that there's anything blocking pKVM from doing that. It's
> that the disconnect of using a memory address for the shared memory,
> and a file descriptor for the private memory doesn't really make sense
> for pKVM. I see how it makes sense for TDX and the Intel-specific
> implementation. It just seems that this is baking in an
> implementation-specific aspect as a part of the KVM general api, and
> the worry is that this might have some unintended consequences in the
> future.

It's true this API originates from supporting TDX and probably other
similar confidential computing(CC) technologies. But if we ever get
chance to make it more common to cover more usages like pKVM, I would
also like to. The challenge on this point is pKVM diverges a lot from CC
usages, putting both shared and private memory in the same fd
complicates CC usages. If two things are different enough, I'm also
thinking implementation-specific may not be that bad.

Chao
Chao Peng Sept. 2, 2022, 10:27 a.m. UTC | #47
On Wed, Aug 31, 2022 at 05:24:39PM +0300, Kirill A . Shutemov wrote:
> On Sat, Aug 20, 2022 at 10:15:32PM -0700, Hugh Dickins wrote:
> > > I will try next week to rework it as shim to top of shmem. Does it work
> > > for you?
> > 
> > Yes, please do, thanks.  It's a compromise between us: the initial TDX
> > case has no justification to use shmem at all, but doing it that way
> > will help you with some of the infrastructure, and will probably be
> > easiest for KVM to extend to other more relaxed fd cases later.
> 
> Okay, below is my take on the shim approach.
> 
> I don't hate how it turned out. It is easier to understand without
> callback exchange thing.
> 
> The only caveat is I had to introduce external lock to protect against
> race between lookup and truncate. Otherwise, looks pretty reasonable to me.
> 
> I did very limited testing. And it lacks integration with KVM, but API
> changed not substantially, any it should be easy to adopt.

I have integrated this patch with other KVM patches and verified the
functionality works well in TDX environment with a minor fix below.

> 
> Any comments?
> 

...

> diff --git a/mm/memfd.c b/mm/memfd.c
> index 08f5f8304746..1853a90f49ff 100644
> --- a/mm/memfd.c
> +++ b/mm/memfd.c
> @@ -261,7 +261,8 @@ long memfd_fcntl(struct file *file, unsigned int cmd, unsigned long arg)
>  #define MFD_NAME_PREFIX_LEN (sizeof(MFD_NAME_PREFIX) - 1)
>  #define MFD_NAME_MAX_LEN (NAME_MAX - MFD_NAME_PREFIX_LEN)
>  
> -#define MFD_ALL_FLAGS (MFD_CLOEXEC | MFD_ALLOW_SEALING | MFD_HUGETLB)
> +#define MFD_ALL_FLAGS (MFD_CLOEXEC | MFD_ALLOW_SEALING | MFD_HUGETLB | \
> +		       MFD_INACCESSIBLE)
>  
>  SYSCALL_DEFINE2(memfd_create,
>  		const char __user *, uname,
> @@ -283,6 +284,14 @@ SYSCALL_DEFINE2(memfd_create,
>  			return -EINVAL;
>  	}
>  
> +	/* Disallow sealing when MFD_INACCESSIBLE is set. */
> +	if ((flags & MFD_INACCESSIBLE) && (flags & MFD_ALLOW_SEALING))
> +		return -EINVAL;
> +
> +	/* TODO: add hugetlb support */
> +	if ((flags & MFD_INACCESSIBLE) && (flags & MFD_HUGETLB))
> +		return -EINVAL;
> +
>  	/* length includes terminating zero */
>  	len = strnlen_user(uname, MFD_NAME_MAX_LEN + 1);
>  	if (len <= 0)
> @@ -331,10 +340,24 @@ SYSCALL_DEFINE2(memfd_create,
>  		*file_seals &= ~F_SEAL_SEAL;
>  	}
>  
> +	if (flags & MFD_INACCESSIBLE) {
> +		struct file *inaccessible_file;
> +
> +		inaccessible_file = memfd_mkinaccessible(file);
> +		if (IS_ERR(inaccessible_file)) {
> +			error = PTR_ERR(inaccessible_file);
> +			goto err_file;
> +		}

The new file should alse be marked as O_LARGEFILE otherwise setting the
initial size greater than 2^31 on the fd will be refused by ftruncate().

+               inaccessible_file->f_flags |= O_LARGEFILE;
+

> +
> +		file = inaccessible_file;
> +	}
> +
>  	fd_install(fd, file);
>  	kfree(name);
>  	return fd;
>  
> +err_file:
> +	fput(file);
>  err_fd:
>  	put_unused_fd(fd);
>  err_name:
Kirill A. Shutemov Sept. 2, 2022, 12:30 p.m. UTC | #48
On Fri, Sep 02, 2022 at 06:27:57PM +0800, Chao Peng wrote:
> > +	if (flags & MFD_INACCESSIBLE) {
> > +		struct file *inaccessible_file;
> > +
> > +		inaccessible_file = memfd_mkinaccessible(file);
> > +		if (IS_ERR(inaccessible_file)) {
> > +			error = PTR_ERR(inaccessible_file);
> > +			goto err_file;
> > +		}
> 
> The new file should alse be marked as O_LARGEFILE otherwise setting the
> initial size greater than 2^31 on the fd will be refused by ftruncate().
> 
> +               inaccessible_file->f_flags |= O_LARGEFILE;
> +

Good catch. Thanks.

I will modify memfd_mkinaccessible() to do this.
Kirill A. Shutemov Sept. 8, 2022, 1:10 a.m. UTC | #49
On Wed, Aug 31, 2022 at 05:24:39PM +0300, Kirill A . Shutemov wrote:
> On Sat, Aug 20, 2022 at 10:15:32PM -0700, Hugh Dickins wrote:
> > > I will try next week to rework it as shim to top of shmem. Does it work
> > > for you?
> > 
> > Yes, please do, thanks.  It's a compromise between us: the initial TDX
> > case has no justification to use shmem at all, but doing it that way
> > will help you with some of the infrastructure, and will probably be
> > easiest for KVM to extend to other more relaxed fd cases later.
> 
> Okay, below is my take on the shim approach.
> 
> I don't hate how it turned out. It is easier to understand without
> callback exchange thing.
> 
> The only caveat is I had to introduce external lock to protect against
> race between lookup and truncate. Otherwise, looks pretty reasonable to me.
> 
> I did very limited testing. And it lacks integration with KVM, but API
> changed not substantially, any it should be easy to adopt.
> 
> Any comments?

Updated version below. Nothing major. Some simplification and cleanups.

diff --git a/include/linux/memfd.h b/include/linux/memfd.h
index 4f1600413f91..334ddff08377 100644
--- a/include/linux/memfd.h
+++ b/include/linux/memfd.h
@@ -3,6 +3,7 @@
 #define __LINUX_MEMFD_H
 
 #include <linux/file.h>
+#include <linux/pfn_t.h>
 
 #ifdef CONFIG_MEMFD_CREATE
 extern long memfd_fcntl(struct file *file, unsigned int cmd, unsigned long arg);
@@ -13,4 +14,27 @@ static inline long memfd_fcntl(struct file *f, unsigned int c, unsigned long a)
 }
 #endif
 
+struct inaccessible_notifier;
+
+struct inaccessible_notifier_ops {
+	void (*invalidate)(struct inaccessible_notifier *notifier,
+			   pgoff_t start, pgoff_t end);
+};
+
+struct inaccessible_notifier {
+	struct list_head list;
+	const struct inaccessible_notifier_ops *ops;
+};
+
+void inaccessible_register_notifier(struct file *file,
+				    struct inaccessible_notifier *notifier);
+void inaccessible_unregister_notifier(struct file *file,
+				      struct inaccessible_notifier *notifier);
+
+int inaccessible_get_pfn(struct file *file, pgoff_t offset, pfn_t *pfn,
+			 int *order);
+void inaccessible_put_pfn(struct file *file, pfn_t pfn);
+
+struct file *memfd_mkinaccessible(struct file *memfd);
+
 #endif /* __LINUX_MEMFD_H */
diff --git a/include/uapi/linux/magic.h b/include/uapi/linux/magic.h
index 6325d1d0e90f..9d066be3d7e8 100644
--- a/include/uapi/linux/magic.h
+++ b/include/uapi/linux/magic.h
@@ -101,5 +101,6 @@
 #define DMA_BUF_MAGIC		0x444d4142	/* "DMAB" */
 #define DEVMEM_MAGIC		0x454d444d	/* "DMEM" */
 #define SECRETMEM_MAGIC		0x5345434d	/* "SECM" */
+#define INACCESSIBLE_MAGIC	0x494e4143	/* "INAC" */
 
 #endif /* __LINUX_MAGIC_H__ */
diff --git a/include/uapi/linux/memfd.h b/include/uapi/linux/memfd.h
index 7a8a26751c23..48750474b904 100644
--- a/include/uapi/linux/memfd.h
+++ b/include/uapi/linux/memfd.h
@@ -8,6 +8,7 @@
 #define MFD_CLOEXEC		0x0001U
 #define MFD_ALLOW_SEALING	0x0002U
 #define MFD_HUGETLB		0x0004U
+#define MFD_INACCESSIBLE	0x0008U
 
 /*
  * Huge page size encoding when MFD_HUGETLB is specified, and a huge page
diff --git a/mm/Makefile b/mm/Makefile
index 9a564f836403..f82e5d4b4388 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -126,7 +126,7 @@ obj-$(CONFIG_HARDENED_USERCOPY) += usercopy.o
 obj-$(CONFIG_PERCPU_STATS) += percpu-stats.o
 obj-$(CONFIG_ZONE_DEVICE) += memremap.o
 obj-$(CONFIG_HMM_MIRROR) += hmm.o
-obj-$(CONFIG_MEMFD_CREATE) += memfd.o
+obj-$(CONFIG_MEMFD_CREATE) += memfd.o memfd_inaccessible.o
 obj-$(CONFIG_MAPPING_DIRTY_HELPERS) += mapping_dirty_helpers.o
 obj-$(CONFIG_PTDUMP_CORE) += ptdump.o
 obj-$(CONFIG_PAGE_REPORTING) += page_reporting.o
diff --git a/mm/memfd.c b/mm/memfd.c
index 08f5f8304746..1853a90f49ff 100644
--- a/mm/memfd.c
+++ b/mm/memfd.c
@@ -261,7 +261,8 @@ long memfd_fcntl(struct file *file, unsigned int cmd, unsigned long arg)
 #define MFD_NAME_PREFIX_LEN (sizeof(MFD_NAME_PREFIX) - 1)
 #define MFD_NAME_MAX_LEN (NAME_MAX - MFD_NAME_PREFIX_LEN)
 
-#define MFD_ALL_FLAGS (MFD_CLOEXEC | MFD_ALLOW_SEALING | MFD_HUGETLB)
+#define MFD_ALL_FLAGS (MFD_CLOEXEC | MFD_ALLOW_SEALING | MFD_HUGETLB | \
+		       MFD_INACCESSIBLE)
 
 SYSCALL_DEFINE2(memfd_create,
 		const char __user *, uname,
@@ -283,6 +284,14 @@ SYSCALL_DEFINE2(memfd_create,
 			return -EINVAL;
 	}
 
+	/* Disallow sealing when MFD_INACCESSIBLE is set. */
+	if ((flags & MFD_INACCESSIBLE) && (flags & MFD_ALLOW_SEALING))
+		return -EINVAL;
+
+	/* TODO: add hugetlb support */
+	if ((flags & MFD_INACCESSIBLE) && (flags & MFD_HUGETLB))
+		return -EINVAL;
+
 	/* length includes terminating zero */
 	len = strnlen_user(uname, MFD_NAME_MAX_LEN + 1);
 	if (len <= 0)
@@ -331,10 +340,24 @@ SYSCALL_DEFINE2(memfd_create,
 		*file_seals &= ~F_SEAL_SEAL;
 	}
 
+	if (flags & MFD_INACCESSIBLE) {
+		struct file *inaccessible_file;
+
+		inaccessible_file = memfd_mkinaccessible(file);
+		if (IS_ERR(inaccessible_file)) {
+			error = PTR_ERR(inaccessible_file);
+			goto err_file;
+		}
+
+		file = inaccessible_file;
+	}
+
 	fd_install(fd, file);
 	kfree(name);
 	return fd;
 
+err_file:
+	fput(file);
 err_fd:
 	put_unused_fd(fd);
 err_name:
diff --git a/mm/memfd_inaccessible.c b/mm/memfd_inaccessible.c
new file mode 100644
index 000000000000..dc79988a49d0
--- /dev/null
+++ b/mm/memfd_inaccessible.c
@@ -0,0 +1,219 @@
+#include "linux/sbitmap.h"
+#include <linux/memfd.h>
+#include <linux/pagemap.h>
+#include <linux/pseudo_fs.h>
+#include <linux/shmem_fs.h>
+#include <uapi/linux/falloc.h>
+#include <uapi/linux/magic.h>
+
+struct inaccessible_data {
+	struct mutex lock;
+	struct file *memfd;
+	struct list_head notifiers;
+};
+
+static void inaccessible_notifier_invalidate(struct inaccessible_data *data,
+				 pgoff_t start, pgoff_t end)
+{
+	struct inaccessible_notifier *notifier;
+
+	mutex_lock(&data->lock);
+	list_for_each_entry(notifier, &data->notifiers, list) {
+		notifier->ops->invalidate(notifier, start, end);
+	}
+	mutex_unlock(&data->lock);
+}
+
+static int inaccessible_release(struct inode *inode, struct file *file)
+{
+	struct inaccessible_data *data = inode->i_mapping->private_data;
+
+	fput(data->memfd);
+	kfree(data);
+	return 0;
+}
+
+static long inaccessible_fallocate(struct file *file, int mode,
+				   loff_t offset, loff_t len)
+{
+	struct inaccessible_data *data = file->f_mapping->private_data;
+	struct file *memfd = data->memfd;
+	int ret;
+
+	if (mode & FALLOC_FL_PUNCH_HOLE) {
+		if (!PAGE_ALIGNED(offset) || !PAGE_ALIGNED(len)) {
+			return -EINVAL;
+		}
+	}
+
+	ret = memfd->f_op->fallocate(memfd, mode, offset, len);
+	inaccessible_notifier_invalidate(data, offset, offset + len);
+	return ret;
+}
+
+static const struct file_operations inaccessible_fops = {
+	.release = inaccessible_release,
+	.fallocate = inaccessible_fallocate,
+};
+
+static int inaccessible_getattr(struct user_namespace *mnt_userns,
+				const struct path *path, struct kstat *stat,
+				u32 request_mask, unsigned int query_flags)
+{
+	struct inode *inode = d_inode(path->dentry);
+	struct inaccessible_data *data = inode->i_mapping->private_data;
+	struct file *memfd = data->memfd;
+
+	return memfd->f_inode->i_op->getattr(mnt_userns, path, stat,
+					     request_mask, query_flags);
+}
+
+static int inaccessible_setattr(struct user_namespace *mnt_userns,
+				struct dentry *dentry, struct iattr *attr)
+{
+	struct inode *inode = d_inode(dentry);
+	struct inaccessible_data *data = inode->i_mapping->private_data;
+	struct file *memfd = data->memfd;
+	int ret;
+
+	if (attr->ia_valid & ATTR_SIZE) {
+		if (memfd->f_inode->i_size)
+			return -EPERM;
+
+		if (!PAGE_ALIGNED(attr->ia_size))
+			return -EINVAL;
+	}
+
+	ret = memfd->f_inode->i_op->setattr(mnt_userns,
+					    file_dentry(memfd), attr);
+	return ret;
+}
+
+static const struct inode_operations inaccessible_iops = {
+	.getattr = inaccessible_getattr,
+	.setattr = inaccessible_setattr,
+};
+
+static int inaccessible_init_fs_context(struct fs_context *fc)
+{
+	if (!init_pseudo(fc, INACCESSIBLE_MAGIC))
+		return -ENOMEM;
+
+	fc->s_iflags |= SB_I_NOEXEC;
+	return 0;
+}
+
+static struct file_system_type inaccessible_fs = {
+	.owner		= THIS_MODULE,
+	.name		= "[inaccessible]",
+	.init_fs_context = inaccessible_init_fs_context,
+	.kill_sb	= kill_anon_super,
+};
+
+static struct vfsmount *inaccessible_mnt;
+
+static __init int inaccessible_init(void)
+{
+	inaccessible_mnt = kern_mount(&inaccessible_fs);
+	if (IS_ERR(inaccessible_mnt))
+		return PTR_ERR(inaccessible_mnt);
+	return 0;
+}
+fs_initcall(inaccessible_init);
+
+struct file *memfd_mkinaccessible(struct file *memfd)
+{
+	struct inaccessible_data *data;
+	struct address_space *mapping;
+	struct inode *inode;
+	struct file *file;
+
+	data = kzalloc(sizeof(*data), GFP_KERNEL);
+	if (!data)
+		return ERR_PTR(-ENOMEM);
+
+	data->memfd = memfd;
+	mutex_init(&data->lock);
+	INIT_LIST_HEAD(&data->notifiers);
+
+	inode = alloc_anon_inode(inaccessible_mnt->mnt_sb);
+	if (IS_ERR(inode)) {
+		kfree(data);
+		return ERR_CAST(inode);
+	}
+
+	inode->i_mode |= S_IFREG;
+	inode->i_op = &inaccessible_iops;
+	inode->i_mapping->private_data = data;
+
+	file = alloc_file_pseudo(inode, inaccessible_mnt,
+				 "[memfd:inaccessible]", O_RDWR,
+				 &inaccessible_fops);
+	if (IS_ERR(file)) {
+		iput(inode);
+		kfree(data);
+	}
+
+	file->f_flags |= O_LARGEFILE;
+
+	mapping = memfd->f_mapping;
+	mapping_set_unevictable(mapping);
+	mapping_set_gfp_mask(mapping,
+			     mapping_gfp_mask(mapping) & ~__GFP_MOVABLE);
+
+	return file;
+}
+
+void inaccessible_register_notifier(struct file *file,
+				    struct inaccessible_notifier *notifier)
+{
+	struct inaccessible_data *data = file->f_mapping->private_data;
+
+	mutex_lock(&data->lock);
+	list_add(&notifier->list, &data->notifiers);
+	mutex_unlock(&data->lock);
+}
+EXPORT_SYMBOL_GPL(inaccessible_register_notifier);
+
+void inaccessible_unregister_notifier(struct file *file,
+				      struct inaccessible_notifier *notifier)
+{
+	struct inaccessible_data *data = file->f_mapping->private_data;
+
+	mutex_lock(&data->lock);
+	list_del(&notifier->list);
+	mutex_unlock(&data->lock);
+}
+EXPORT_SYMBOL_GPL(inaccessible_unregister_notifier);
+
+int inaccessible_get_pfn(struct file *file, pgoff_t offset, pfn_t *pfn,
+			 int *order)
+{
+	struct inaccessible_data *data = file->f_mapping->private_data;
+	struct file *memfd = data->memfd;
+	struct page *page;
+	int ret;
+
+	ret = shmem_getpage(file_inode(memfd), offset, &page, SGP_WRITE);
+	if (ret)
+		return ret;
+
+	*pfn = page_to_pfn_t(page);
+	*order = thp_order(compound_head(page));
+	SetPageUptodate(page);
+	unlock_page(page);
+
+	return 0;
+}
+EXPORT_SYMBOL_GPL(inaccessible_get_pfn);
+
+void inaccessible_put_pfn(struct file *file, pfn_t pfn)
+{
+	struct page *page = pfn_t_to_page(pfn);
+
+	if (WARN_ON_ONCE(!page))
+		return;
+
+	put_page(page);
+}
+EXPORT_SYMBOL_GPL(inaccessible_put_pfn);
Andy Lutomirski Sept. 9, 2022, 4:44 a.m. UTC | #50
On 8/18/22 06:24, Kirill A . Shutemov wrote:
> On Wed, Aug 17, 2022 at 10:40:12PM -0700, Hugh Dickins wrote:
>> On Wed, 6 Jul 2022, Chao Peng wrote:
>>> This is the v7 of this series which tries to implement the fd-based KVM
>>> guest private memory.
>>
>> Here at last are my reluctant thoughts on this patchset.
>>
>> fd-based approach for supporting KVM guest private memory: fine.
>>
>> Use or abuse of memfd and shmem.c: mistaken.
>>
>> memfd_create() was an excellent way to put together the initial prototype.
>>
>> But since then, TDX in particular has forced an effort into preventing
>> (by flags, seals, notifiers) almost everything that makes it shmem/tmpfs.
>>
>> Are any of the shmem.c mods useful to existing users of shmem.c? No.
>> Is MFD_INACCESSIBLE useful or comprehensible to memfd_create() users? No.
>>
>> What use do you have for a filesystem here?  Almost none.
>> IIUC, what you want is an fd through which QEMU can allocate kernel
>> memory, selectively free that memory, and communicate fd+offset+length
>> to KVM.  And perhaps an interface to initialize a little of that memory
>> from a template (presumably copied from a real file on disk somewhere).
>>
>> You don't need shmem.c or a filesystem for that!
>>
>> If your memory could be swapped, that would be enough of a good reason
>> to make use of shmem.c: but it cannot be swapped; and although there
>> are some references in the mailthreads to it perhaps being swappable
>> in future, I get the impression that will not happen soon if ever.
>>
>> If your memory could be migrated, that would be some reason to use
>> filesystem page cache (because page migration happens to understand
>> that type of memory): but it cannot be migrated.
> 
> Migration support is in pipeline. It is part of TDX 1.5 [1]. And swapping
> theoretically possible, but I'm not aware of any plans as of now.
> 
> [1] https://www.intel.com/content/www/us/en/developer/articles/technical/intel-trust-domain-extensions.html
> 

This thing?

https://cdrdv2.intel.com/v1/dl/getContent/733578

That looks like migration between computers, not between NUMA nodes.  Or 
am I missing something?
Andy Lutomirski Sept. 9, 2022, 4:48 a.m. UTC | #51
On 8/19/22 17:27, Kirill A. Shutemov wrote:
> On Thu, Aug 18, 2022 at 08:00:41PM -0700, Hugh Dickins wrote:
>> On Thu, 18 Aug 2022, Kirill A . Shutemov wrote:
>>> On Wed, Aug 17, 2022 at 10:40:12PM -0700, Hugh Dickins wrote:
>>>>
>>>> If your memory could be swapped, that would be enough of a good reason
>>>> to make use of shmem.c: but it cannot be swapped; and although there
>>>> are some references in the mailthreads to it perhaps being swappable
>>>> in future, I get the impression that will not happen soon if ever.
>>>>
>>>> If your memory could be migrated, that would be some reason to use
>>>> filesystem page cache (because page migration happens to understand
>>>> that type of memory): but it cannot be migrated.
>>>
>>> Migration support is in pipeline. It is part of TDX 1.5 [1]. And swapping
>>> theoretically possible, but I'm not aware of any plans as of now.
>>>
>>> [1] https://www.intel.com/content/www/us/en/developer/articles/technical/intel-trust-domain-extensions.html
>>
>> I always forget, migration means different things to different audiences.
>> As an mm person, I was meaning page migration, whereas a virtualization
>> person thinks VM live migration (which that reference appears to be about),
>> a scheduler person task migration, an ornithologist bird migration, etc.
>>
>> But you're an mm person too: you may have cited that reference in the
>> knowledge that TDX 1.5 Live Migration will entail page migration of the
>> kind I'm thinking of.  (Anyway, it's not important to clarify that here.)
> 
> TDX 1.5 brings both.
> 
> In TDX speak, mm migration called relocation. See TDH.MEM.PAGE.RELOCATE.
> 

This seems to be a pretty bad fit for the way that the core mm migrates 
pages.  The core mm unmaps the page, then moves (in software) the 
contents to a new address, then faults it in.  TDH.MEM.PAGE.RELOCATE 
doesn't fit into that workflow very well.  I'm not saying it can't be 
done, but it won't just work.
Andy Lutomirski Sept. 9, 2022, 4:55 a.m. UTC | #52
On 8/24/22 02:41, Chao Peng wrote:
> On Tue, Aug 23, 2022 at 04:05:27PM +0000, Sean Christopherson wrote:
>> On Tue, Aug 23, 2022, David Hildenbrand wrote:
>>> On 19.08.22 05:38, Hugh Dickins wrote:
>>>> On Fri, 19 Aug 2022, Sean Christopherson wrote:
>>>>> On Thu, Aug 18, 2022, Kirill A . Shutemov wrote:
>>>>>> On Wed, Aug 17, 2022 at 10:40:12PM -0700, Hugh Dickins wrote:
>>>>>>> On Wed, 6 Jul 2022, Chao Peng wrote:
>>>>>>> But since then, TDX in particular has forced an effort into preventing
>>>>>>> (by flags, seals, notifiers) almost everything that makes it shmem/tmpfs.
>>>>>>>
>>>>>>> Are any of the shmem.c mods useful to existing users of shmem.c? No.
>>>>>>> Is MFD_INACCESSIBLE useful or comprehensible to memfd_create() users? No.
>>>>>
>>>>> But QEMU and other VMMs are users of shmem and memfd.  The new features certainly
>>>>> aren't useful for _all_ existing users, but I don't think it's fair to say that
>>>>> they're not useful for _any_ existing users.
>>>>
>>>> Okay, I stand corrected: there exist some users of memfd_create()
>>>> who will also have use for "INACCESSIBLE" memory.
>>>
>>> As raised in reply to the relevant patch, I'm not sure if we really have
>>> to/want to expose MFD_INACCESSIBLE to user space. I feel like this is a
>>> requirement of specific memfd_notifer (memfile_notifier) implementations
>>> -- such as TDX that will convert the memory and MCE-kill the machine on
>>> ordinary write access. We might be able to set/enforce this when
>>> registering a notifier internally instead, and fail notifier
>>> registration if a condition isn't met (e.g., existing mmap).
>>>
>>> So I'd be curious, which other users of shmem/memfd would benefit from
>>> (MMU)-"INACCESSIBLE" memory obtained via memfd_create()?
>>
>> I agree that there's no need to expose the inaccessible behavior via uAPI.  Making
>> it a kernel-internal thing that's negotiated/resolved when KVM binds to the fd
>> would align INACCESSIBLE with the UNMOVABLE and UNRECLAIMABLE flags (and any other
>> flags that get added in the future).
>>
>> AFAICT, the user-visible flag is a holdover from the early RFCs and doesn't provide
>> any unique functionality.
> 
> That's also what I'm thinking. And I don't see problem immediately if
> user has populated the fd at the binding time. Actually that looks an
> advantage for previously discussed guest payload pre-loading.

I think this gets awkward. Trying to define sensible semantics for what 
happens if a shmem or similar fd gets used as secret guest memory and 
that fd isn't utterly and completely empty can get quite nasty.  For 
example:

If there are already mmaps, then TDX (much more so than SEV) really 
doesn't want to also use it as guest memory.

If there is already data in the fd, then maybe some technologies can use 
this for pre-population, but TDX needs explicit instructions in order to 
get the guest's hash right.


In general, it seems like it will be much more likely to actually work 
well if the user (uAPI) is required to declare to the kernel exactly 
what the fd is for (e.g. TDX secret memory, software-only secret memory, 
etc) before doing anything at all with it other than binding it to KVM.

INACCESSIBLE is a way to achieve this.  Maybe it's not the prettiest in 
the world -- I personally would rather see an explicit request for, say, 
TDX or SEV memory or maybe the memory that works for a particular KVM 
instance instead of something generic like INACCESSIBLE, but this is a 
pretty weak preference.  But I think that just starting with a plain 
memfd is a can of worms.
Kirill A. Shutemov Sept. 9, 2022, 2:32 p.m. UTC | #53
On Thu, Sep 08, 2022 at 09:48:35PM -0700, Andy Lutomirski wrote:
> On 8/19/22 17:27, Kirill A. Shutemov wrote:
> > On Thu, Aug 18, 2022 at 08:00:41PM -0700, Hugh Dickins wrote:
> > > On Thu, 18 Aug 2022, Kirill A . Shutemov wrote:
> > > > On Wed, Aug 17, 2022 at 10:40:12PM -0700, Hugh Dickins wrote:
> > > > > 
> > > > > If your memory could be swapped, that would be enough of a good reason
> > > > > to make use of shmem.c: but it cannot be swapped; and although there
> > > > > are some references in the mailthreads to it perhaps being swappable
> > > > > in future, I get the impression that will not happen soon if ever.
> > > > > 
> > > > > If your memory could be migrated, that would be some reason to use
> > > > > filesystem page cache (because page migration happens to understand
> > > > > that type of memory): but it cannot be migrated.
> > > > 
> > > > Migration support is in pipeline. It is part of TDX 1.5 [1]. And swapping
> > > > theoretically possible, but I'm not aware of any plans as of now.
> > > > 
> > > > [1] https://www.intel.com/content/www/us/en/developer/articles/technical/intel-trust-domain-extensions.html
> > > 
> > > I always forget, migration means different things to different audiences.
> > > As an mm person, I was meaning page migration, whereas a virtualization
> > > person thinks VM live migration (which that reference appears to be about),
> > > a scheduler person task migration, an ornithologist bird migration, etc.
> > > 
> > > But you're an mm person too: you may have cited that reference in the
> > > knowledge that TDX 1.5 Live Migration will entail page migration of the
> > > kind I'm thinking of.  (Anyway, it's not important to clarify that here.)
> > 
> > TDX 1.5 brings both.
> > 
> > In TDX speak, mm migration called relocation. See TDH.MEM.PAGE.RELOCATE.
> > 
> 
> This seems to be a pretty bad fit for the way that the core mm migrates
> pages.  The core mm unmaps the page, then moves (in software) the contents
> to a new address, then faults it in.  TDH.MEM.PAGE.RELOCATE doesn't fit into
> that workflow very well.  I'm not saying it can't be done, but it won't just
> work.

Hm. From what I see we have all necessary infrastructure in place.

Unmaping is NOP for inaccessible pages as it is never mapped and we have
mapping->a_ops->migrate_folio() callback that allows to replace software
copying with whatever is needed, like TDH.MEM.PAGE.RELOCATE.

What do I miss?
Michael Roth Sept. 9, 2022, 3:35 p.m. UTC | #54
On Wed, Jul 06, 2022 at 04:20:02PM +0800, Chao Peng wrote:
> This is the v7 of this series which tries to implement the fd-based KVM
> guest private memory. The patches are based on latest kvm/queue branch
> commit:
> 
>   b9b71f43683a (kvm/queue) KVM: x86/mmu: Buffer nested MMU
> split_desc_cache only by default capacity
> 
> Introduction
> ------------
> In general this patch series introduce fd-based memslot which provides
> guest memory through memory file descriptor fd[offset,size] instead of
> hva/size. The fd can be created from a supported memory filesystem
> like tmpfs/hugetlbfs etc. which we refer as memory backing store. KVM
> and the the memory backing store exchange callbacks when such memslot
> gets created. At runtime KVM will call into callbacks provided by the
> backing store to get the pfn with the fd+offset. Memory backing store
> will also call into KVM callbacks when userspace punch hole on the fd
> to notify KVM to unmap secondary MMU page table entries.
> 
> Comparing to existing hva-based memslot, this new type of memslot allows
> guest memory unmapped from host userspace like QEMU and even the kernel
> itself, therefore reduce attack surface and prevent bugs.
> 
> Based on this fd-based memslot, we can build guest private memory that
> is going to be used in confidential computing environments such as Intel
> TDX and AMD SEV. When supported, the memory backing store can provide
> more enforcement on the fd and KVM can use a single memslot to hold both
> the private and shared part of the guest memory. 

Hi everyone,

Just wanted to let you all know that I reserved a slot at the LPC
Confidential Computing Microconference to discuss some topics related
to unmapped/inaccessible private memory support:

  "Unmapped Private Memory for Confidential Guests"
  Tuesday, Sep 13th, 10:00am (Dublin time)
  https://lpc.events/event/16/sessions/133/#20220913

The discussion agenda is still a bit in flux, but one topic I really
wanted to cover is how we intend to deal with the kernel directmap
for TDX/SNP, where there is a need to either remove or split mappings
so that KVM or other kernel threads writing to non-private pages
don't run into issues due mappings overlapping with private pages.[1]

Other possible discussion topics:

  - guarding against shared->private conversions while KVM is
    attempting to access a shared page (separate PFN pools for
    shared/private seems to resolve this nicely, but may not be
    compatible with things like pKVM where the underlying PFN
    is the same for shared/private)[2]

  - extending KVM_EXIT_MEMORY_FAULT to handle batched requests to
    better handle things like explicit batched conversions initiated
    by the guest

It's a short session so not sure how much time we'll actually have
to discuss things in detail, but maybe this can at least be a good
jumping off point for other discussions.

Thanks, and hope to see you there!

[1] https://lore.kernel.org/all/YWb8WG6Ravbs1nbx@google.com/
[2] https://lore.kernel.org/lkml/CA+EHjTy6NF=BkCqK0vhXLdtKZMahp55JUMSfxN96-NT3YiMXYQ@mail.gmail.com/
Andy Lutomirski Sept. 9, 2022, 7:11 p.m. UTC | #55
On Fri, Sep 9, 2022, at 7:32 AM, Kirill A . Shutemov wrote:
> On Thu, Sep 08, 2022 at 09:48:35PM -0700, Andy Lutomirski wrote:
>> On 8/19/22 17:27, Kirill A. Shutemov wrote:
>> > On Thu, Aug 18, 2022 at 08:00:41PM -0700, Hugh Dickins wrote:
>> > > On Thu, 18 Aug 2022, Kirill A . Shutemov wrote:
>> > > > On Wed, Aug 17, 2022 at 10:40:12PM -0700, Hugh Dickins wrote:
>> > > > > 
>> > > > > If your memory could be swapped, that would be enough of a good reason
>> > > > > to make use of shmem.c: but it cannot be swapped; and although there
>> > > > > are some references in the mailthreads to it perhaps being swappable
>> > > > > in future, I get the impression that will not happen soon if ever.
>> > > > > 
>> > > > > If your memory could be migrated, that would be some reason to use
>> > > > > filesystem page cache (because page migration happens to understand
>> > > > > that type of memory): but it cannot be migrated.
>> > > > 
>> > > > Migration support is in pipeline. It is part of TDX 1.5 [1]. And swapping
>> > > > theoretically possible, but I'm not aware of any plans as of now.
>> > > > 
>> > > > [1] https://www.intel.com/content/www/us/en/developer/articles/technical/intel-trust-domain-extensions.html
>> > > 
>> > > I always forget, migration means different things to different audiences.
>> > > As an mm person, I was meaning page migration, whereas a virtualization
>> > > person thinks VM live migration (which that reference appears to be about),
>> > > a scheduler person task migration, an ornithologist bird migration, etc.
>> > > 
>> > > But you're an mm person too: you may have cited that reference in the
>> > > knowledge that TDX 1.5 Live Migration will entail page migration of the
>> > > kind I'm thinking of.  (Anyway, it's not important to clarify that here.)
>> > 
>> > TDX 1.5 brings both.
>> > 
>> > In TDX speak, mm migration called relocation. See TDH.MEM.PAGE.RELOCATE.
>> > 
>> 
>> This seems to be a pretty bad fit for the way that the core mm migrates
>> pages.  The core mm unmaps the page, then moves (in software) the contents
>> to a new address, then faults it in.  TDH.MEM.PAGE.RELOCATE doesn't fit into
>> that workflow very well.  I'm not saying it can't be done, but it won't just
>> work.
>
> Hm. From what I see we have all necessary infrastructure in place.
>
> Unmaping is NOP for inaccessible pages as it is never mapped and we have
> mapping->a_ops->migrate_folio() callback that allows to replace software
> copying with whatever is needed, like TDH.MEM.PAGE.RELOCATE.
>
> What do I miss?

Hmm, maybe this isn't as bad as I thought.

Right now, unless I've missed something, the migration workflow is to unmap (via try_to_migrate) all mappings, then migrate the backing store (with ->migrate_folio(), although it seems like most callers expect the actual copy to happen outside of ->migrate_folio(), and then make new mappings.  With the *current* (vma-based, not fd-based) model for KVM memory, this won't work -- we can't unmap before calling TDH.MEM.PAGE.RELOCATE.

But maybe it's actually okay with some care or maybe mild modifications with the fd-based model.  We don't have any mmaps, per se, to unmap for secret / INACCESSIBLE memory.  So maybe we can get all the way to ->migrate_folio() without zapping anything in the secure EPT and just call TDH-MEM.PAGE.RELOCATE from inside migrate_folio().  And there will be nothing to fault back in.  From the core code's perspective, it's like migrating a memfd that doesn't happen to have my mappings at the time.

--Andy
Kirill A. Shutemov Sept. 9, 2022, 11:02 p.m. UTC | #56
On Fri, Sep 09, 2022 at 12:11:05PM -0700, Andy Lutomirski wrote:
> 
> 
> On Fri, Sep 9, 2022, at 7:32 AM, Kirill A . Shutemov wrote:
> > On Thu, Sep 08, 2022 at 09:48:35PM -0700, Andy Lutomirski wrote:
> >> On 8/19/22 17:27, Kirill A. Shutemov wrote:
> >> > On Thu, Aug 18, 2022 at 08:00:41PM -0700, Hugh Dickins wrote:
> >> > > On Thu, 18 Aug 2022, Kirill A . Shutemov wrote:
> >> > > > On Wed, Aug 17, 2022 at 10:40:12PM -0700, Hugh Dickins wrote:
> >> > > > > 
> >> > > > > If your memory could be swapped, that would be enough of a good reason
> >> > > > > to make use of shmem.c: but it cannot be swapped; and although there
> >> > > > > are some references in the mailthreads to it perhaps being swappable
> >> > > > > in future, I get the impression that will not happen soon if ever.
> >> > > > > 
> >> > > > > If your memory could be migrated, that would be some reason to use
> >> > > > > filesystem page cache (because page migration happens to understand
> >> > > > > that type of memory): but it cannot be migrated.
> >> > > > 
> >> > > > Migration support is in pipeline. It is part of TDX 1.5 [1]. And swapping
> >> > > > theoretically possible, but I'm not aware of any plans as of now.
> >> > > > 
> >> > > > [1] https://www.intel.com/content/www/us/en/developer/articles/technical/intel-trust-domain-extensions.html
> >> > > 
> >> > > I always forget, migration means different things to different audiences.
> >> > > As an mm person, I was meaning page migration, whereas a virtualization
> >> > > person thinks VM live migration (which that reference appears to be about),
> >> > > a scheduler person task migration, an ornithologist bird migration, etc.
> >> > > 
> >> > > But you're an mm person too: you may have cited that reference in the
> >> > > knowledge that TDX 1.5 Live Migration will entail page migration of the
> >> > > kind I'm thinking of.  (Anyway, it's not important to clarify that here.)
> >> > 
> >> > TDX 1.5 brings both.
> >> > 
> >> > In TDX speak, mm migration called relocation. See TDH.MEM.PAGE.RELOCATE.
> >> > 
> >> 
> >> This seems to be a pretty bad fit for the way that the core mm migrates
> >> pages.  The core mm unmaps the page, then moves (in software) the contents
> >> to a new address, then faults it in.  TDH.MEM.PAGE.RELOCATE doesn't fit into
> >> that workflow very well.  I'm not saying it can't be done, but it won't just
> >> work.
> >
> > Hm. From what I see we have all necessary infrastructure in place.
> >
> > Unmaping is NOP for inaccessible pages as it is never mapped and we have
> > mapping->a_ops->migrate_folio() callback that allows to replace software
> > copying with whatever is needed, like TDH.MEM.PAGE.RELOCATE.
> >
> > What do I miss?
> 
> Hmm, maybe this isn't as bad as I thought.
> 
> Right now, unless I've missed something, the migration workflow is to
> unmap (via try_to_migrate) all mappings, then migrate the backing store
> (with ->migrate_folio(), although it seems like most callers expect the
> actual copy to happen outside of ->migrate_folio(),

Most? I guess you are talking about MIGRATE_SYNC_NO_COPY, right? AFAICS,
it is HMM thing and not a common thing.

> and then make new
> mappings.  With the *current* (vma-based, not fd-based) model for KVM
> memory, this won't work -- we can't unmap before calling
> TDH.MEM.PAGE.RELOCATE.

We don't need to unmap. The page is not mapped from core-mm PoV.

> But maybe it's actually okay with some care or maybe mild modifications
> with the fd-based model.  We don't have any mmaps, per se, to unmap for
> secret / INACCESSIBLE memory.  So maybe we can get all the way to
> ->migrate_folio() without zapping anything in the secure EPT and just
> call TDH-MEM.PAGE.RELOCATE from inside migrate_folio().  And there will
> be nothing to fault back in.  From the core code's perspective, it's
> like migrating a memfd that doesn't happen to have my mappings at the
> time.

Modifications needed if we want to initiate migation from userspace. IIRC,
we don't have any API that can initiate page migration for file ranges,
without mapping the file.

But kernel can do it fine for own housekeeping, like compaction doesn't
need any VMA. And we need compaction working for long term stability of
the system.
Sean Christopherson Sept. 13, 2022, 9:44 a.m. UTC | #57
On Thu, Sep 08, 2022, Kirill A. Shutemov wrote:
> On Wed, Aug 31, 2022 at 05:24:39PM +0300, Kirill A . Shutemov wrote:
> > On Sat, Aug 20, 2022 at 10:15:32PM -0700, Hugh Dickins wrote:
> > > > I will try next week to rework it as shim to top of shmem. Does it work
> > > > for you?
> > > 
> > > Yes, please do, thanks.  It's a compromise between us: the initial TDX
> > > case has no justification to use shmem at all, but doing it that way
> > > will help you with some of the infrastructure, and will probably be
> > > easiest for KVM to extend to other more relaxed fd cases later.
> > 
> > Okay, below is my take on the shim approach.
> > 
> > I don't hate how it turned out. It is easier to understand without
> > callback exchange thing.
> > 
> > The only caveat is I had to introduce external lock to protect against
> > race between lookup and truncate.

As before, I think this lock is unnecessary.  Or at least it's unnessary to hold
the lock across get/put.  The ->invalidate() call will ensure that the pfn is
never actually used if get() races with truncation.

Switching topics, what actually prevents mmapp() on the shim?  I tried to follow,
but I don't know these areas well enough.
Kirill A. Shutemov Sept. 13, 2022, 1:28 p.m. UTC | #58
On Tue, Sep 13, 2022 at 09:44:27AM +0000, Sean Christopherson wrote:
> On Thu, Sep 08, 2022, Kirill A. Shutemov wrote:
> > On Wed, Aug 31, 2022 at 05:24:39PM +0300, Kirill A . Shutemov wrote:
> > > On Sat, Aug 20, 2022 at 10:15:32PM -0700, Hugh Dickins wrote:
> > > > > I will try next week to rework it as shim to top of shmem. Does it work
> > > > > for you?
> > > > 
> > > > Yes, please do, thanks.  It's a compromise between us: the initial TDX
> > > > case has no justification to use shmem at all, but doing it that way
> > > > will help you with some of the infrastructure, and will probably be
> > > > easiest for KVM to extend to other more relaxed fd cases later.
> > > 
> > > Okay, below is my take on the shim approach.
> > > 
> > > I don't hate how it turned out. It is easier to understand without
> > > callback exchange thing.
> > > 
> > > The only caveat is I had to introduce external lock to protect against
> > > race between lookup and truncate.
> 
> As before, I think this lock is unnecessary.  Or at least it's unnessary to hold
> the lock across get/put.  The ->invalidate() call will ensure that the pfn is
> never actually used if get() races with truncation.

The updated version you replying to does not use the lock to protect
against truncation anymore. The lock protect notifier list.

> Switching topics, what actually prevents mmapp() on the shim?  I tried to follow,
> but I don't know these areas well enough.

It has no f_op->mmap, so mmap() will fail with -ENODEV. See do_mmap().
(I did not read the switch statement correctly at first. Note there are
two 'fallthrough' there.)
Sean Christopherson Sept. 13, 2022, 2:53 p.m. UTC | #59
On Tue, Sep 13, 2022, Kirill A. Shutemov wrote:
> On Tue, Sep 13, 2022 at 09:44:27AM +0000, Sean Christopherson wrote:
> > On Thu, Sep 08, 2022, Kirill A. Shutemov wrote:
> > > On Wed, Aug 31, 2022 at 05:24:39PM +0300, Kirill A . Shutemov wrote:
> > > > On Sat, Aug 20, 2022 at 10:15:32PM -0700, Hugh Dickins wrote:
> > > > > > I will try next week to rework it as shim to top of shmem. Does it work
> > > > > > for you?
> > > > > 
> > > > > Yes, please do, thanks.  It's a compromise between us: the initial TDX
> > > > > case has no justification to use shmem at all, but doing it that way
> > > > > will help you with some of the infrastructure, and will probably be
> > > > > easiest for KVM to extend to other more relaxed fd cases later.
> > > > 
> > > > Okay, below is my take on the shim approach.
> > > > 
> > > > I don't hate how it turned out. It is easier to understand without
> > > > callback exchange thing.
> > > > 
> > > > The only caveat is I had to introduce external lock to protect against
> > > > race between lookup and truncate.
> > 
> > As before, I think this lock is unnecessary.  Or at least it's unnessary to hold
> > the lock across get/put.  The ->invalidate() call will ensure that the pfn is
> > never actually used if get() races with truncation.
> 
> The updated version you replying to does not use the lock to protect
> against truncation anymore. The lock protect notifier list.

Gah, grabbed the patch when applying.

> > Switching topics, what actually prevents mmapp() on the shim?  I tried to follow,
> > but I don't know these areas well enough.
> 
> It has no f_op->mmap, so mmap() will fail with -ENODEV. See do_mmap().
> (I did not read the switch statement correctly at first. Note there are
> two 'fallthrough' there.)

Ah, validate_mmap_request().  Thought not implementing ->mmap() was the key, but
couldn't find the actual check.

Thanks much!
Kirill A. Shutemov Sept. 13, 2022, 4 p.m. UTC | #60
On Tue, Sep 13, 2022 at 02:53:25PM +0000, Sean Christopherson wrote:
> > > Switching topics, what actually prevents mmapp() on the shim?  I tried to follow,
> > > but I don't know these areas well enough.
> > 
> > It has no f_op->mmap, so mmap() will fail with -ENODEV. See do_mmap().
> > (I did not read the switch statement correctly at first. Note there are
> > two 'fallthrough' there.)
> 
> Ah, validate_mmap_request().  Thought not implementing ->mmap() was the key, but
> couldn't find the actual check.

validate_mmap_request() is in mm/nommu.c which is not relevant for real
computers.

I was talking about this check:

https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/mm/mmap.c#n1495
Sean Christopherson Sept. 13, 2022, 4:12 p.m. UTC | #61
On Tue, Sep 13, 2022, Kirill A. Shutemov wrote:
> On Tue, Sep 13, 2022 at 02:53:25PM +0000, Sean Christopherson wrote:
> > > > Switching topics, what actually prevents mmapp() on the shim?  I tried to follow,
> > > > but I don't know these areas well enough.
> > > 
> > > It has no f_op->mmap, so mmap() will fail with -ENODEV. See do_mmap().
> > > (I did not read the switch statement correctly at first. Note there are
> > > two 'fallthrough' there.)
> > 
> > Ah, validate_mmap_request().  Thought not implementing ->mmap() was the key, but
> > couldn't find the actual check.
> 
> validate_mmap_request() is in mm/nommu.c which is not relevant for real
> computers.
> 
> I was talking about this check:
> 
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/mm/mmap.c#n1495

Hence the comment about 'fallthrough'.  Thanks again!