[v8,1/8] mm/memfd: Introduce userspace inaccessible memfd

Message ID	20220915142913.2213336-2-chao.p.peng@linux.intel.com (mailing list archive)
State	New, archived
Headers	show Return-Path: <kvm-owner@kernel.org> From: Chao Peng <chao.p.peng@linux.intel.com> To: kvm@vger.kernel.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org, linux-fsdevel@vger.kernel.org, linux-api@vger.kernel.org, linux-doc@vger.kernel.org, qemu-devel@nongnu.org Cc: Paolo Bonzini <pbonzini@redhat.com>, Jonathan Corbet <corbet@lwn.net>, Sean Christopherson <seanjc@google.com>, Vitaly Kuznetsov <vkuznets@redhat.com>, Wanpeng Li <wanpengli@tencent.com>, Jim Mattson <jmattson@google.com>, Joerg Roedel <joro@8bytes.org>, Thomas Gleixner <tglx@linutronix.de>, Ingo Molnar <mingo@redhat.com>, Borislav Petkov <bp@alien8.de>, x86@kernel.org, "H . Peter Anvin" <hpa@zytor.com>, Hugh Dickins <hughd@google.com>, Jeff Layton <jlayton@kernel.org>, "J . Bruce Fields" <bfields@fieldses.org>, Andrew Morton <akpm@linux-foundation.org>, Shuah Khan <shuah@kernel.org>, Mike Rapoport <rppt@kernel.org>, Steven Price <steven.price@arm.com>, "Maciej S . Szmigiero" <mail@maciej.szmigiero.name>, Vlastimil Babka <vbabka@suse.cz>, Vishal Annapurve <vannapurve@google.com>, Yu Zhang <yu.c.zhang@linux.intel.com>, Chao Peng <chao.p.peng@linux.intel.com>, "Kirill A . Shutemov" <kirill.shutemov@linux.intel.com>, luto@kernel.org, jun.nakajima@intel.com, dave.hansen@intel.com, ak@linux.intel.com, david@redhat.com, aarcange@redhat.com, ddutile@redhat.com, dhildenb@redhat.com, Quentin Perret <qperret@google.com>, Michael Roth <michael.roth@amd.com>, mhocko@suse.com, Muchun Song <songmuchun@bytedance.com>, wei.w.wang@intel.com Subject: [PATCH v8 1/8] mm/memfd: Introduce userspace inaccessible memfd Date: Thu, 15 Sep 2022 22:29:06 +0800 Message-Id: <20220915142913.2213336-2-chao.p.peng@linux.intel.com> In-Reply-To: <20220915142913.2213336-1-chao.p.peng@linux.intel.com> References: <20220915142913.2213336-1-chao.p.peng@linux.intel.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Precedence: bulk
Series	KVM: mm: fd-based approach for supporting KVM \| expand [v8,0/8] KVM: mm: fd-based approach for supporting KVM [v8,1/8] mm/memfd: Introduce userspace inaccessible memfd [v8,2/8] KVM: Extend the memslot to support fd-based private memory [v8,3/8] KVM: Add KVM_EXIT_MEMORY_FAULT exit [v8,4/8] KVM: Use gfn instead of hva for mmu_notifier_retry [v8,5/8] KVM: Register/unregister the guest private memory regions [v8,6/8] KVM: Update lpage info when private/shared memory are mixed [v8,7/8] KVM: Handle page fault for private memory [v8,8/8] KVM: Enable and expose KVM_MEM_PRIVATE

Chao Peng Sept. 15, 2022, 2:29 p.m. UTC

From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>

KVM can use memfd-provided memory for guest memory. For normal userspace
accessible memory, KVM userspace (e.g. QEMU) mmaps the memfd into its
virtual address space and then tells KVM to use the virtual address to
setup the mapping in the secondary page table (e.g. EPT).

With confidential computing technologies like Intel TDX, the
memfd-provided memory may be encrypted with special key for special
software domain (e.g. KVM guest) and is not expected to be directly
accessed by userspace. Precisely, userspace access to such encrypted
memory may lead to host crash so it should be prevented.

This patch introduces userspace inaccessible memfd (created with
MFD_INACCESSIBLE). Its memory is inaccessible from userspace through
ordinary MMU access (e.g. read/write/mmap) but can be accessed via
in-kernel interface so KVM can directly interact with core-mm without
the need to map the memory into KVM userspace.

It provides semantics required for KVM guest private(encrypted) memory
support that a file descriptor with this flag set is going to be used as
the source of guest memory in confidential computing environments such
as Intel TDX/AMD SEV.

KVM userspace is still in charge of the lifecycle of the memfd. It
should pass the opened fd to KVM. KVM uses the kernel APIs newly added
in this patch to obtain the physical memory address and then populate
the secondary page table entries.

The userspace inaccessible memfd can be fallocate-ed and hole-punched
from userspace. When hole-punching happens, KVM can get notified through
inaccessible_notifier it then gets chance to remove any mapped entries
of the range in the secondary page tables.

The userspace inaccessible memfd itself is implemented as a shim layer
on top of real memory file systems like tmpfs/hugetlbfs but this patch
only implemented tmpfs. The allocated memory is currently marked as
unmovable and unevictable, this is required for current confidential
usage. But in future this might be changed.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Chao Peng <chao.p.peng@linux.intel.com>
---
 include/linux/memfd.h      |  24 ++++
 include/uapi/linux/magic.h |   1 +
 include/uapi/linux/memfd.h |   1 +
 mm/Makefile                |   2 +-
 mm/memfd.c                 |  25 ++++-
 mm/memfd_inaccessible.c    | 219 +++++++++++++++++++++++++++++++++++++
 6 files changed, 270 insertions(+), 2 deletions(-)
 create mode 100644 mm/memfd_inaccessible.c

David Hildenbrand Sept. 19, 2022, 9:12 a.m. UTC | #1

On 15.09.22 16:29, Chao Peng wrote:
> From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
> 
> KVM can use memfd-provided memory for guest memory. For normal userspace
> accessible memory, KVM userspace (e.g. QEMU) mmaps the memfd into its
> virtual address space and then tells KVM to use the virtual address to
> setup the mapping in the secondary page table (e.g. EPT).
> 
> With confidential computing technologies like Intel TDX, the
> memfd-provided memory may be encrypted with special key for special
> software domain (e.g. KVM guest) and is not expected to be directly
> accessed by userspace. Precisely, userspace access to such encrypted
> memory may lead to host crash so it should be prevented.

Initially my thaught was that this whole inaccessible thing is TDX 
specific and there is no need to force that on other mechanisms. That's 
why I suggested to not expose this to user space but handle the notifier 
requirements internally.

IIUC now, protected KVM has similar demands. Either access (read/write) 
of guest RAM would result in a fault and possibly crash the hypervisor 
(at least not the whole machine IIUC).

> 
> This patch introduces userspace inaccessible memfd (created with
> MFD_INACCESSIBLE). Its memory is inaccessible from userspace through
> ordinary MMU access (e.g. read/write/mmap) but can be accessed via
> in-kernel interface so KVM can directly interact with core-mm without
> the need to map the memory into KVM userspace.

With secretmem we decided to not add such "concept switch" flags and 
instead use a dedicated syscall.

What about memfd_inaccessible()? Especially, sealing and hugetlb are not 
even supported and it might take a while to support either.


> 
> It provides semantics required for KVM guest private(encrypted) memory
> support that a file descriptor with this flag set is going to be used as
> the source of guest memory in confidential computing environments such
> as Intel TDX/AMD SEV.
> 
> KVM userspace is still in charge of the lifecycle of the memfd. It
> should pass the opened fd to KVM. KVM uses the kernel APIs newly added
> in this patch to obtain the physical memory address and then populate
> the secondary page table entries.
> 
> The userspace inaccessible memfd can be fallocate-ed and hole-punched
> from userspace. When hole-punching happens, KVM can get notified through
> inaccessible_notifier it then gets chance to remove any mapped entries
> of the range in the secondary page tables.
> 
> The userspace inaccessible memfd itself is implemented as a shim layer
> on top of real memory file systems like tmpfs/hugetlbfs but this patch
> only implemented tmpfs. The allocated memory is currently marked as
> unmovable and unevictable, this is required for current confidential
> usage. But in future this might be changed.
> 
> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> Signed-off-by: Chao Peng <chao.p.peng@linux.intel.com>
> ---
>   include/linux/memfd.h      |  24 ++++
>   include/uapi/linux/magic.h |   1 +
>   include/uapi/linux/memfd.h |   1 +
>   mm/Makefile                |   2 +-
>   mm/memfd.c                 |  25 ++++-
>   mm/memfd_inaccessible.c    | 219 +++++++++++++++++++++++++++++++++++++
>   6 files changed, 270 insertions(+), 2 deletions(-)
>   create mode 100644 mm/memfd_inaccessible.c
> 
> diff --git a/include/linux/memfd.h b/include/linux/memfd.h
> index 4f1600413f91..334ddff08377 100644
> --- a/include/linux/memfd.h
> +++ b/include/linux/memfd.h
> @@ -3,6 +3,7 @@
>   #define __LINUX_MEMFD_H
>   
>   #include <linux/file.h>
> +#include <linux/pfn_t.h>
>   
>   #ifdef CONFIG_MEMFD_CREATE
>   extern long memfd_fcntl(struct file *file, unsigned int cmd, unsigned long arg);
> @@ -13,4 +14,27 @@ static inline long memfd_fcntl(struct file *f, unsigned int c, unsigned long a)
>   }
>   #endif
>   
> +struct inaccessible_notifier;
> +
> +struct inaccessible_notifier_ops {
> +	void (*invalidate)(struct inaccessible_notifier *notifier,
> +			   pgoff_t start, pgoff_t end);
> +};
> +
> +struct inaccessible_notifier {
> +	struct list_head list;
> +	const struct inaccessible_notifier_ops *ops;
> +};
> +
> +void inaccessible_register_notifier(struct file *file,
> +				    struct inaccessible_notifier *notifier);
> +void inaccessible_unregister_notifier(struct file *file,
> +				      struct inaccessible_notifier *notifier);
> +
> +int inaccessible_get_pfn(struct file *file, pgoff_t offset, pfn_t *pfn,
> +			 int *order);
> +void inaccessible_put_pfn(struct file *file, pfn_t pfn);
> +
> +struct file *memfd_mkinaccessible(struct file *memfd);
> +
>   #endif /* __LINUX_MEMFD_H */
> diff --git a/include/uapi/linux/magic.h b/include/uapi/linux/magic.h
> index 6325d1d0e90f..9d066be3d7e8 100644
> --- a/include/uapi/linux/magic.h
> +++ b/include/uapi/linux/magic.h
> @@ -101,5 +101,6 @@
>   #define DMA_BUF_MAGIC		0x444d4142	/* "DMAB" */
>   #define DEVMEM_MAGIC		0x454d444d	/* "DMEM" */
>   #define SECRETMEM_MAGIC		0x5345434d	/* "SECM" */
> +#define INACCESSIBLE_MAGIC	0x494e4143	/* "INAC" */


[...]

> +
> +int inaccessible_get_pfn(struct file *file, pgoff_t offset, pfn_t *pfn,
> +			 int *order)
> +{
> +	struct inaccessible_data *data = file->f_mapping->private_data;
> +	struct file *memfd = data->memfd;
> +	struct page *page;
> +	int ret;
> +
> +	ret = shmem_getpage(file_inode(memfd), offset, &page, SGP_WRITE);
> +	if (ret)
> +		return ret;
> +
> +	*pfn = page_to_pfn_t(page);
> +	*order = thp_order(compound_head(page));
> +	SetPageUptodate(page);
> +	unlock_page(page);
> +
> +	return 0;
> +}
> +EXPORT_SYMBOL_GPL(inaccessible_get_pfn);
> +
> +void inaccessible_put_pfn(struct file *file, pfn_t pfn)
> +{
> +	struct page *page = pfn_t_to_page(pfn);
> +
> +	if (WARN_ON_ONCE(!page))
> +		return;
> +
> +	put_page(page);
> +}
> +EXPORT_SYMBOL_GPL(inaccessible_put_pfn);

Sorry, I missed your reply regarding get/put interface.

https://lore.kernel.org/linux-mm/20220810092532.GD862421@chaop.bj.intel.com/

"We have a design assumption that somedays this can even support 
non-page based backing stores."

As long as there is no such user in sight (especially how to get the 
memfd from even allocating such memory which will require bigger 
changes), I prefer to keep it simple here and work on pages/folios. No 
need to over-complicate it for now.

Sean Christopherson Sept. 19, 2022, 7:10 p.m. UTC | #2

+Will, Marc and Fuad (apologies if I missed other pKVM folks)

On Mon, Sep 19, 2022, David Hildenbrand wrote:
> On 15.09.22 16:29, Chao Peng wrote:
> > From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
> > 
> > KVM can use memfd-provided memory for guest memory. For normal userspace
> > accessible memory, KVM userspace (e.g. QEMU) mmaps the memfd into its
> > virtual address space and then tells KVM to use the virtual address to
> > setup the mapping in the secondary page table (e.g. EPT).
> > 
> > With confidential computing technologies like Intel TDX, the
> > memfd-provided memory may be encrypted with special key for special
> > software domain (e.g. KVM guest) and is not expected to be directly
> > accessed by userspace. Precisely, userspace access to such encrypted
> > memory may lead to host crash so it should be prevented.
> 
> Initially my thaught was that this whole inaccessible thing is TDX specific
> and there is no need to force that on other mechanisms. That's why I
> suggested to not expose this to user space but handle the notifier
> requirements internally.
> 
> IIUC now, protected KVM has similar demands. Either access (read/write) of
> guest RAM would result in a fault and possibly crash the hypervisor (at
> least not the whole machine IIUC).

Yep.  The missing piece for pKVM is the ability to convert from shared to private
while preserving the contents, e.g. to hand off a large buffer (hundreds of MiB)
for processing in the protected VM.  Thoughts on this at the bottom.

> > This patch introduces userspace inaccessible memfd (created with
> > MFD_INACCESSIBLE). Its memory is inaccessible from userspace through
> > ordinary MMU access (e.g. read/write/mmap) but can be accessed via
> > in-kernel interface so KVM can directly interact with core-mm without
> > the need to map the memory into KVM userspace.
> 
> With secretmem we decided to not add such "concept switch" flags and instead
> use a dedicated syscall.
>

I have no personal preference whatsoever between a flag and a dedicated syscall,
but a dedicated syscall does seem like it would give the kernel a bit more
flexibility.

> What about memfd_inaccessible()? Especially, sealing and hugetlb are not
> even supported and it might take a while to support either.

Don't know about sealing, but hugetlb support for "inaccessible" memory needs to
come sooner than later.  "inaccessible" in quotes because we might want to choose
a less binary name, e.g. "restricted"?.

Regarding pKVM's use case, with the shim approach I believe this can be done by
allowing userspace mmap() the "hidden" memfd, but with a ton of restrictions
piled on top.

My first thought was to make the uAPI a set of KVM ioctls so that KVM could tightly
tightly control usage without taking on too much complexity in the kernel, but
working through things, routing the behavior through the shim itself might not be
all that horrific.

IIRC, we discarded the idea of allowing userspace to map the "private" fd because
things got too complex, but with the shim it doesn't seem _that_ bad.

E.g. on the memfd side:

  1. The entire memfd must be mapped, and at most one mapping is allowed, i.e.
     mapping is all or nothing.

  2. Acquiring a reference via get_pfn() is disallowed if there's a mapping for
     the restricted memfd.

  3. Add notifier hooks to allow downstream users to further restrict things.

  4. Disallow splitting VMAs, e.g. to force userspace to munmap() everything in
     one shot.

  5. Require that there are no outstanding references at munmap().  Or if this
     can't be guaranteed by userspace, maybe add some way for userspace to wait
     until it's ok to convert to private?  E.g. so that get_pfn() doesn't need
     to do an expensive check every time.
     
  static int memfd_restricted_mmap(struct file *file, struct vm_area_struct *vma)
  {
	if (vma->vm_pgoff)
		return -EINVAL;

	if ((vma->vm_end - vma->vm_start) != <file size>)
		return -EINVAL;

	mutex_lock(&data->lock);

	if (data->has_mapping) {
		r = -EINVAL;
		goto err;
	}
	list_for_each_entry(notifier, &data->notifiers, list) {
		r = notifier->ops->mmap_start(notifier, ...);
		if (r)
			goto abort;
	}

	notifier->ops->mmap_end(notifier, ...);
	mutex_unlock(&data->lock);
	return 0;

  abort:
	list_for_each_entry_continue_reverse(notifier &data->notifiers, list)
		notifier->ops->mmap_abort(notifier, ...);
  err:
	mutex_unlock(&data->lock);
	return r;
  }

  static void memfd_restricted_close(struct vm_area_struct *vma)
  {
	mutex_lock(...);

	/*
	 * Destroy the memfd and disable all future accesses if there are
	 * outstanding refcounts (or other unsatisfied restrictions?).
	 */
	if (<outstanding references> || ???)
		memfd_restricted_destroy(...);
	else
		data->has_mapping = false;

	mutex_unlock(...);
  }

  static int memfd_restricted_may_split(struct vm_area_struct *area, unsigned long addr)
  {
	return -EINVAL;
  }

  static int memfd_restricted_mapping_mremap(struct vm_area_struct *new_vma)
  {
	return -EINVAL;
  }

Then on the KVM side, its mmap_start() + mmap_end() sequence would:

  1. Not be supported for TDX or SEV-SNP because they don't allow adding non-zero
     memory into the guest (after pre-boot phase).

  2. Be mutually exclusive with shared<=>private conversions, and is allowed if
     and only if the entire gfn range of the associated memslot is shared.

Andy Lutomirski Sept. 21, 2022, 9:10 p.m. UTC | #3

(please excuse any formatting disasters.  my internet went out as I was composing this, and i did my best to rescue it.)

On Mon, Sep 19, 2022, at 12:10 PM, Sean Christopherson wrote:
> +Will, Marc and Fuad (apologies if I missed other pKVM folks)
>
> On Mon, Sep 19, 2022, David Hildenbrand wrote:
>> On 15.09.22 16:29, Chao Peng wrote:
>> > From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
>> > 
>> > KVM can use memfd-provided memory for guest memory. For normal userspace
>> > accessible memory, KVM userspace (e.g. QEMU) mmaps the memfd into its
>> > virtual address space and then tells KVM to use the virtual address to
>> > setup the mapping in the secondary page table (e.g. EPT).
>> > 
>> > With confidential computing technologies like Intel TDX, the
>> > memfd-provided memory may be encrypted with special key for special
>> > software domain (e.g. KVM guest) and is not expected to be directly
>> > accessed by userspace. Precisely, userspace access to such encrypted
>> > memory may lead to host crash so it should be prevented.
>> 
>> Initially my thaught was that this whole inaccessible thing is TDX specific
>> and there is no need to force that on other mechanisms. That's why I
>> suggested to not expose this to user space but handle the notifier
>> requirements internally.
>> 
>> IIUC now, protected KVM has similar demands. Either access (read/write) of
>> guest RAM would result in a fault and possibly crash the hypervisor (at
>> least not the whole machine IIUC).
>
> Yep.  The missing piece for pKVM is the ability to convert from shared 
> to private
> while preserving the contents, e.g. to hand off a large buffer 
> (hundreds of MiB)
> for processing in the protected VM.  Thoughts on this at the bottom.
>
>> > This patch introduces userspace inaccessible memfd (created with
>> > MFD_INACCESSIBLE). Its memory is inaccessible from userspace through
>> > ordinary MMU access (e.g. read/write/mmap) but can be accessed via
>> > in-kernel interface so KVM can directly interact with core-mm without
>> > the need to map the memory into KVM userspace.
>> 
>> With secretmem we decided to not add such "concept switch" flags and instead
>> use a dedicated syscall.
>>
>
> I have no personal preference whatsoever between a flag and a dedicated syscall,
> but a dedicated syscall does seem like it would give the kernel a bit more
> flexibility.

The third option is a device node, e.g. /dev/kvm_secretmem or /dev/kvm_tdxmem or similar.  But if we need flags or other details in the future, maybe this isn't ideal.

>
>> What about memfd_inaccessible()? Especially, sealing and hugetlb are not
>> even supported and it might take a while to support either.
>
> Don't know about sealing, but hugetlb support for "inaccessible" memory 
> needs to
> come sooner than later.  "inaccessible" in quotes because we might want 
> to choose
> a less binary name, e.g. "restricted"?.
>
> Regarding pKVM's use case, with the shim approach I believe this can be done by
> allowing userspace mmap() the "hidden" memfd, but with a ton of restrictions
> piled on top.
>
> My first thought was to make the uAPI a set of KVM ioctls so that KVM 
> could tightly
> tightly control usage without taking on too much complexity in the 
> kernel, but
> working through things, routing the behavior through the shim itself 
> might not be
> all that horrific.
>
> IIRC, we discarded the idea of allowing userspace to map the "private" 
> fd because
> things got too complex, but with the shim it doesn't seem _that_ bad.

What's the exact use case?  Is it just to pre-populate the memory?

>
> E.g. on the memfd side:
>
>   1. The entire memfd must be mapped, and at most one mapping is allowed, i.e.
>      mapping is all or nothing.
>
>   2. Acquiring a reference via get_pfn() is disallowed if there's a mapping for
>      the restricted memfd.
>
>   3. Add notifier hooks to allow downstream users to further restrict things.
>
>   4. Disallow splitting VMAs, e.g. to force userspace to munmap() everything in
>      one shot.
>
>   5. Require that there are no outstanding references at munmap().  Or if this
>      can't be guaranteed by userspace, maybe add some way for userspace to wait
>      until it's ok to convert to private?  E.g. so that get_pfn() doesn't need
>      to do an expensive check every time.

Hmm.  I haven't looked at the code to see if this would really work, but I think this could be done more in line with how the rest of the kernel works by using the rmap infrastructure.  When the pKVM memfd is in not-yet-private mode, just let it be mmapped as usual (but don't allow any form of GUP or pinning).  Then have an ioctl to switch to to shared mode that takes locks or sets flags so that no new faults can be serviced and does unmap_mapping_range.

As long as the shim arranges to have its own vm_ops, I don't immediately see any reason this can't work.

Wang, Wei W Sept. 22, 2022, 1:23 p.m. UTC | #4

On Thursday, September 22, 2022 5:11 AM, Andy Lutomirski wrote:
> To: Christopherson,, Sean <seanjc@google.com>; David Hildenbrand
> <david@redhat.com>
> Cc: Chao Peng <chao.p.peng@linux.intel.com>; kvm list
> <kvm@vger.kernel.org>; Linux Kernel Mailing List
> <linux-kernel@vger.kernel.org>; linux-mm@kvack.org;
> linux-fsdevel@vger.kernel.org; Linux API <linux-api@vger.kernel.org>;
> linux-doc@vger.kernel.org; qemu-devel@nongnu.org; Paolo Bonzini
> <pbonzini@redhat.com>; Jonathan Corbet <corbet@lwn.net>; Vitaly
> Kuznetsov <vkuznets@redhat.com>; Wanpeng Li <wanpengli@tencent.com>;
> Jim Mattson <jmattson@google.com>; Joerg Roedel <joro@8bytes.org>;
> Thomas Gleixner <tglx@linutronix.de>; Ingo Molnar <mingo@redhat.com>;
> Borislav Petkov <bp@alien8.de>; the arch/x86 maintainers <x86@kernel.org>;
> H. Peter Anvin <hpa@zytor.com>; Hugh Dickins <hughd@google.com>; Jeff
> Layton <jlayton@kernel.org>; J . Bruce Fields <bfields@fieldses.org>; Andrew
> Morton <akpm@linux-foundation.org>; Shuah Khan <shuah@kernel.org>;
> Mike Rapoport <rppt@kernel.org>; Steven Price <steven.price@arm.com>;
> Maciej S . Szmigiero <mail@maciej.szmigiero.name>; Vlastimil Babka
> <vbabka@suse.cz>; Vishal Annapurve <vannapurve@google.com>; Yu Zhang
> <yu.c.zhang@linux.intel.com>; Kirill A. Shutemov
> <kirill.shutemov@linux.intel.com>; Nakajima, Jun <jun.nakajima@intel.com>;
> Hansen, Dave <dave.hansen@intel.com>; Andi Kleen <ak@linux.intel.com>;
> aarcange@redhat.com; ddutile@redhat.com; dhildenb@redhat.com; Quentin
> Perret <qperret@google.com>; Michael Roth <michael.roth@amd.com>;
> Hocko, Michal <mhocko@suse.com>; Muchun Song
> <songmuchun@bytedance.com>; Wang, Wei W <wei.w.wang@intel.com>;
> Will Deacon <will@kernel.org>; Marc Zyngier <maz@kernel.org>; Fuad Tabba
> <tabba@google.com>
> Subject: Re: [PATCH v8 1/8] mm/memfd: Introduce userspace inaccessible
> memfd
> 
> (please excuse any formatting disasters.  my internet went out as I was
> composing this, and i did my best to rescue it.)
> 
> On Mon, Sep 19, 2022, at 12:10 PM, Sean Christopherson wrote:
> > +Will, Marc and Fuad (apologies if I missed other pKVM folks)
> >
> > On Mon, Sep 19, 2022, David Hildenbrand wrote:
> >> On 15.09.22 16:29, Chao Peng wrote:
> >> > From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
> >> >
> >> > KVM can use memfd-provided memory for guest memory. For normal
> >> > userspace accessible memory, KVM userspace (e.g. QEMU) mmaps the
> >> > memfd into its virtual address space and then tells KVM to use the
> >> > virtual address to setup the mapping in the secondary page table (e.g.
> EPT).
> >> >
> >> > With confidential computing technologies like Intel TDX, the
> >> > memfd-provided memory may be encrypted with special key for special
> >> > software domain (e.g. KVM guest) and is not expected to be directly
> >> > accessed by userspace. Precisely, userspace access to such
> >> > encrypted memory may lead to host crash so it should be prevented.
> >>
> >> Initially my thaught was that this whole inaccessible thing is TDX
> >> specific and there is no need to force that on other mechanisms.
> >> That's why I suggested to not expose this to user space but handle
> >> the notifier requirements internally.
> >>
> >> IIUC now, protected KVM has similar demands. Either access
> >> (read/write) of guest RAM would result in a fault and possibly crash
> >> the hypervisor (at least not the whole machine IIUC).
> >
> > Yep.  The missing piece for pKVM is the ability to convert from shared
> > to private while preserving the contents, e.g. to hand off a large
> > buffer (hundreds of MiB) for processing in the protected VM.  Thoughts
> > on this at the bottom.
> >
> >> > This patch introduces userspace inaccessible memfd (created with
> >> > MFD_INACCESSIBLE). Its memory is inaccessible from userspace
> >> > through ordinary MMU access (e.g. read/write/mmap) but can be
> >> > accessed via in-kernel interface so KVM can directly interact with
> >> > core-mm without the need to map the memory into KVM userspace.
> >>
> >> With secretmem we decided to not add such "concept switch" flags and
> >> instead use a dedicated syscall.
> >>
> >
> > I have no personal preference whatsoever between a flag and a
> > dedicated syscall, but a dedicated syscall does seem like it would
> > give the kernel a bit more flexibility.
> 
> The third option is a device node, e.g. /dev/kvm_secretmem or
> /dev/kvm_tdxmem or similar.  But if we need flags or other details in the
> future, maybe this isn't ideal.
> 
> >
> >> What about memfd_inaccessible()? Especially, sealing and hugetlb are
> >> not even supported and it might take a while to support either.
> >
> > Don't know about sealing, but hugetlb support for "inaccessible"
> > memory needs to come sooner than later.  "inaccessible" in quotes
> > because we might want to choose a less binary name, e.g.
> > "restricted"?.
> >
> > Regarding pKVM's use case, with the shim approach I believe this can
> > be done by allowing userspace mmap() the "hidden" memfd, but with a
> > ton of restrictions piled on top.
> >
> > My first thought was to make the uAPI a set of KVM ioctls so that KVM
> > could tightly tightly control usage without taking on too much
> > complexity in the kernel, but working through things, routing the
> > behavior through the shim itself might not be all that horrific.
> >
> > IIRC, we discarded the idea of allowing userspace to map the "private"
> > fd because
> > things got too complex, but with the shim it doesn't seem _that_ bad.
> 
> What's the exact use case?  Is it just to pre-populate the memory?

Add one more use case here. For TDX live migration support, on the destination side,
we map the private fd during migration to store the encrypted private memory data sent
from source, and at the end of migration, we unmap it and make it inaccessible before
resuming the TD to run.

Wang, Wei W Sept. 22, 2022, 1:26 p.m. UTC | #5

On Thursday, September 15, 2022 10:29 PM, Chao Peng wrote:
> +int inaccessible_get_pfn(struct file *file, pgoff_t offset, pfn_t *pfn,
> +			 int *order)

Better to remove "order" from this interface?
Some callers only need to get pfn, and no need to bother with
defining and inputting something unused. For callers who need the "order",
can easily get it via thp_order(pfn_to_page(pfn)) on their own.

Sean Christopherson Sept. 22, 2022, 7:49 p.m. UTC | #6

On Thu, Sep 22, 2022, Wang, Wei W wrote:
> On Thursday, September 15, 2022 10:29 PM, Chao Peng wrote:
> > +int inaccessible_get_pfn(struct file *file, pgoff_t offset, pfn_t *pfn,
> > +			 int *order)
> 
> Better to remove "order" from this interface?

Hard 'no'.

> Some callers only need to get pfn, and no need to bother with
> defining and inputting something unused. For callers who need the "order",
> can easily get it via thp_order(pfn_to_page(pfn)) on their own.

That requires (a) assuming the pfn is backed by struct page, and (b) assuming the
struct page is a transparent huge page.  That might be true for the current
implementation, but it most certainly will not always be true.

KVM originally did things like this, where there was dedicated code for THP vs.
HugeTLB, and it was a mess.  The goal here is very much to avoid repeating those
mistakes.  Have the backing store _tell_ KVM how big the mapping is, don't force
KVM to rediscover the info on its own.

Kirill A. Shutemov Sept. 23, 2022, 12:53 a.m. UTC | #7

On Thu, Sep 22, 2022 at 07:49:18PM +0000, Sean Christopherson wrote:
> On Thu, Sep 22, 2022, Wang, Wei W wrote:
> > On Thursday, September 15, 2022 10:29 PM, Chao Peng wrote:
> > > +int inaccessible_get_pfn(struct file *file, pgoff_t offset, pfn_t *pfn,
> > > +			 int *order)
> > 
> > Better to remove "order" from this interface?
> 
> Hard 'no'.
> 
> > Some callers only need to get pfn, and no need to bother with
> > defining and inputting something unused. For callers who need the "order",
> > can easily get it via thp_order(pfn_to_page(pfn)) on their own.
> 
> That requires (a) assuming the pfn is backed by struct page, and (b) assuming the
> struct page is a transparent huge page.  That might be true for the current
> implementation, but it most certainly will not always be true.
> 
> KVM originally did things like this, where there was dedicated code for THP vs.
> HugeTLB, and it was a mess.  The goal here is very much to avoid repeating those
> mistakes.  Have the backing store _tell_ KVM how big the mapping is, don't force
> KVM to rediscover the info on its own.

I guess we can allow order pointer to be NULL to cover caller that don't
need to know the order. Is it useful?

Kirill A. Shutemov Sept. 23, 2022, 12:58 a.m. UTC | #8

On Mon, Sep 19, 2022 at 11:12:46AM +0200, David Hildenbrand wrote:
> > diff --git a/include/uapi/linux/magic.h b/include/uapi/linux/magic.h
> > index 6325d1d0e90f..9d066be3d7e8 100644
> > --- a/include/uapi/linux/magic.h
> > +++ b/include/uapi/linux/magic.h
> > @@ -101,5 +101,6 @@
> >   #define DMA_BUF_MAGIC		0x444d4142	/* "DMAB" */
> >   #define DEVMEM_MAGIC		0x454d444d	/* "DMEM" */
> >   #define SECRETMEM_MAGIC		0x5345434d	/* "SECM" */
> > +#define INACCESSIBLE_MAGIC	0x494e4143	/* "INAC" */
> 
> 
> [...]
> 
> > +
> > +int inaccessible_get_pfn(struct file *file, pgoff_t offset, pfn_t *pfn,
> > +			 int *order)
> > +{
> > +	struct inaccessible_data *data = file->f_mapping->private_data;
> > +	struct file *memfd = data->memfd;
> > +	struct page *page;
> > +	int ret;
> > +
> > +	ret = shmem_getpage(file_inode(memfd), offset, &page, SGP_WRITE);
> > +	if (ret)
> > +		return ret;
> > +
> > +	*pfn = page_to_pfn_t(page);
> > +	*order = thp_order(compound_head(page));
> > +	SetPageUptodate(page);
> > +	unlock_page(page);
> > +
> > +	return 0;
> > +}
> > +EXPORT_SYMBOL_GPL(inaccessible_get_pfn);
> > +
> > +void inaccessible_put_pfn(struct file *file, pfn_t pfn)
> > +{
> > +	struct page *page = pfn_t_to_page(pfn);
> > +
> > +	if (WARN_ON_ONCE(!page))
> > +		return;
> > +
> > +	put_page(page);
> > +}
> > +EXPORT_SYMBOL_GPL(inaccessible_put_pfn);
> 
> Sorry, I missed your reply regarding get/put interface.
> 
> https://lore.kernel.org/linux-mm/20220810092532.GD862421@chaop.bj.intel.com/
> 
> "We have a design assumption that somedays this can even support non-page
> based backing stores."
> 
> As long as there is no such user in sight (especially how to get the memfd
> from even allocating such memory which will require bigger changes), I
> prefer to keep it simple here and work on pages/folios. No need to
> over-complicate it for now.

Sean, Paolo , what is your take on this? Do you have conrete use case of
pageless backend for the mechanism in sight? Maybe DAX?

Fuad Tabba Sept. 23, 2022, 3:19 p.m. UTC | #9

Hi,

On Mon, Sep 19, 2022 at 8:10 PM Sean Christopherson <seanjc@google.com> wrote:
>
> +Will, Marc and Fuad (apologies if I missed other pKVM folks)
>
> On Mon, Sep 19, 2022, David Hildenbrand wrote:
> > On 15.09.22 16:29, Chao Peng wrote:
> > > From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
> > >
> > > KVM can use memfd-provided memory for guest memory. For normal userspace
> > > accessible memory, KVM userspace (e.g. QEMU) mmaps the memfd into its
> > > virtual address space and then tells KVM to use the virtual address to
> > > setup the mapping in the secondary page table (e.g. EPT).
> > >
> > > With confidential computing technologies like Intel TDX, the
> > > memfd-provided memory may be encrypted with special key for special
> > > software domain (e.g. KVM guest) and is not expected to be directly
> > > accessed by userspace. Precisely, userspace access to such encrypted
> > > memory may lead to host crash so it should be prevented.
> >
> > Initially my thaught was that this whole inaccessible thing is TDX specific
> > and there is no need to force that on other mechanisms. That's why I
> > suggested to not expose this to user space but handle the notifier
> > requirements internally.
> >
> > IIUC now, protected KVM has similar demands. Either access (read/write) of
> > guest RAM would result in a fault and possibly crash the hypervisor (at
> > least not the whole machine IIUC).
>
> Yep.  The missing piece for pKVM is the ability to convert from shared to private
> while preserving the contents, e.g. to hand off a large buffer (hundreds of MiB)
> for processing in the protected VM.  Thoughts on this at the bottom.

Just wanted to mention that for pKVM (arm64), this wouldn't crash the
hypervisor. A userspace access would crash the userspace process since
the hypervisor would inject a fault back. Because of that making it
inaccessible from userspace is good to have, but not really vital for
pKVM. What is important for pKVM is that the guest private memory is
not GUP'able by the host. This is because if it were, it might be
possible for a malicious userspace process (e.g., a malicious vmm) to
trick the host kernel into accessing guest private memory in a context
where it isn’t prepared to handle the fault injected by the
hypervisor. This of course might crash the host.

> > > This patch introduces userspace inaccessible memfd (created with
> > > MFD_INACCESSIBLE). Its memory is inaccessible from userspace through
> > > ordinary MMU access (e.g. read/write/mmap) but can be accessed via
> > > in-kernel interface so KVM can directly interact with core-mm without
> > > the need to map the memory into KVM userspace.
> >
> > With secretmem we decided to not add such "concept switch" flags and instead
> > use a dedicated syscall.
> >
>
> I have no personal preference whatsoever between a flag and a dedicated syscall,
> but a dedicated syscall does seem like it would give the kernel a bit more
> flexibility.
>
> > What about memfd_inaccessible()? Especially, sealing and hugetlb are not
> > even supported and it might take a while to support either.
>
> Don't know about sealing, but hugetlb support for "inaccessible" memory needs to
> come sooner than later.  "inaccessible" in quotes because we might want to choose
> a less binary name, e.g. "restricted"?.
>
> Regarding pKVM's use case, with the shim approach I believe this can be done by
> allowing userspace mmap() the "hidden" memfd, but with a ton of restrictions
> piled on top.
>
> My first thought was to make the uAPI a set of KVM ioctls so that KVM could tightly
> tightly control usage without taking on too much complexity in the kernel, but
> working through things, routing the behavior through the shim itself might not be
> all that horrific.
>
> IIRC, we discarded the idea of allowing userspace to map the "private" fd because
> things got too complex, but with the shim it doesn't seem _that_ bad.
>
> E.g. on the memfd side:
>
>   1. The entire memfd must be mapped, and at most one mapping is allowed, i.e.
>      mapping is all or nothing.
>
>   2. Acquiring a reference via get_pfn() is disallowed if there's a mapping for
>      the restricted memfd.
>
>   3. Add notifier hooks to allow downstream users to further restrict things.
>
>   4. Disallow splitting VMAs, e.g. to force userspace to munmap() everything in
>      one shot.
>
>   5. Require that there are no outstanding references at munmap().  Or if this
>      can't be guaranteed by userspace, maybe add some way for userspace to wait
>      until it's ok to convert to private?  E.g. so that get_pfn() doesn't need
>      to do an expensive check every time.
>
>   static int memfd_restricted_mmap(struct file *file, struct vm_area_struct *vma)
>   {
>         if (vma->vm_pgoff)
>                 return -EINVAL;
>
>         if ((vma->vm_end - vma->vm_start) != <file size>)
>                 return -EINVAL;
>
>         mutex_lock(&data->lock);
>
>         if (data->has_mapping) {
>                 r = -EINVAL;
>                 goto err;
>         }
>         list_for_each_entry(notifier, &data->notifiers, list) {
>                 r = notifier->ops->mmap_start(notifier, ...);
>                 if (r)
>                         goto abort;
>         }
>
>         notifier->ops->mmap_end(notifier, ...);
>         mutex_unlock(&data->lock);
>         return 0;
>
>   abort:
>         list_for_each_entry_continue_reverse(notifier &data->notifiers, list)
>                 notifier->ops->mmap_abort(notifier, ...);
>   err:
>         mutex_unlock(&data->lock);
>         return r;
>   }
>
>   static void memfd_restricted_close(struct vm_area_struct *vma)
>   {
>         mutex_lock(...);
>
>         /*
>          * Destroy the memfd and disable all future accesses if there are
>          * outstanding refcounts (or other unsatisfied restrictions?).
>          */
>         if (<outstanding references> || ???)
>                 memfd_restricted_destroy(...);
>         else
>                 data->has_mapping = false;
>
>         mutex_unlock(...);
>   }
>
>   static int memfd_restricted_may_split(struct vm_area_struct *area, unsigned long addr)
>   {
>         return -EINVAL;
>   }
>
>   static int memfd_restricted_mapping_mremap(struct vm_area_struct *new_vma)
>   {
>         return -EINVAL;
>   }
>
> Then on the KVM side, its mmap_start() + mmap_end() sequence would:
>
>   1. Not be supported for TDX or SEV-SNP because they don't allow adding non-zero
>      memory into the guest (after pre-boot phase).
>
>   2. Be mutually exclusive with shared<=>private conversions, and is allowed if
>      and only if the entire gfn range of the associated memslot is shared.

In general I think that this would work with pKVM. However, limiting
private<->shared conversions to the granularity of a whole memslot
might be difficult to handle in pKVM, since the guest doesn't have the
concept of memslots. For example, in pKVM right now, when a guest
shares back its restricted DMA pool with the host it does so at the
page-level. pKVM would also need a way to make an fd accessible again
when shared back, which I think isn't possible with this patch.

You were initially considering a KVM ioctl for mapping, which might be
better suited for this since KVM knows which pages are shared and
which ones are private. So routing things through KVM might simplify
things and allow it to enforce all the necessary restrictions (e.g.,
private memory cannot be mapped). What do you think?

Thanks,
/fuad

Fuad Tabba Sept. 23, 2022, 3:20 p.m. UTC | #10

Hi,

<...>

> > Regarding pKVM's use case, with the shim approach I believe this can be done by
> > allowing userspace mmap() the "hidden" memfd, but with a ton of restrictions
> > piled on top.
> >
> > My first thought was to make the uAPI a set of KVM ioctls so that KVM
> > could tightly
> > tightly control usage without taking on too much complexity in the
> > kernel, but
> > working through things, routing the behavior through the shim itself
> > might not be
> > all that horrific.
> >
> > IIRC, we discarded the idea of allowing userspace to map the "private"
> > fd because
> > things got too complex, but with the shim it doesn't seem _that_ bad.
>
> What's the exact use case?  Is it just to pre-populate the memory?

Prepopulate memory and access memory that could go back and forth from
being shared to being private.

Cheers,
/fuad



> >
> > E.g. on the memfd side:
> >
> >   1. The entire memfd must be mapped, and at most one mapping is allowed, i.e.
> >      mapping is all or nothing.
> >
> >   2. Acquiring a reference via get_pfn() is disallowed if there's a mapping for
> >      the restricted memfd.
> >
> >   3. Add notifier hooks to allow downstream users to further restrict things.
> >
> >   4. Disallow splitting VMAs, e.g. to force userspace to munmap() everything in
> >      one shot.
> >
> >   5. Require that there are no outstanding references at munmap().  Or if this
> >      can't be guaranteed by userspace, maybe add some way for userspace to wait
> >      until it's ok to convert to private?  E.g. so that get_pfn() doesn't need
> >      to do an expensive check every time.
>
> Hmm.  I haven't looked at the code to see if this would really work, but I think this could be done more in line with how the rest of the kernel works by using the rmap infrastructure.  When the pKVM memfd is in not-yet-private mode, just let it be mmapped as usual (but don't allow any form of GUP or pinning).  Then have an ioctl to switch to to shared mode that takes locks or sets flags so that no new faults can be serviced and does unmap_mapping_range.
>
> As long as the shim arranges to have its own vm_ops, I don't immediately see any reason this can't work.

Fuad Tabba Sept. 23, 2022, 3:20 p.m. UTC | #11

Hi,

On Fri, Sep 23, 2022 at 1:53 AM Kirill A . Shutemov
<kirill.shutemov@linux.intel.com> wrote:
>
> On Thu, Sep 22, 2022 at 07:49:18PM +0000, Sean Christopherson wrote:
> > On Thu, Sep 22, 2022, Wang, Wei W wrote:
> > > On Thursday, September 15, 2022 10:29 PM, Chao Peng wrote:
> > > > +int inaccessible_get_pfn(struct file *file, pgoff_t offset, pfn_t *pfn,
> > > > +                  int *order)
> > >
> > > Better to remove "order" from this interface?
> >
> > Hard 'no'.
> >
> > > Some callers only need to get pfn, and no need to bother with
> > > defining and inputting something unused. For callers who need the "order",
> > > can easily get it via thp_order(pfn_to_page(pfn)) on their own.
> >
> > That requires (a) assuming the pfn is backed by struct page, and (b) assuming the
> > struct page is a transparent huge page.  That might be true for the current
> > implementation, but it most certainly will not always be true.
> >
> > KVM originally did things like this, where there was dedicated code for THP vs.
> > HugeTLB, and it was a mess.  The goal here is very much to avoid repeating those
> > mistakes.  Have the backing store _tell_ KVM how big the mapping is, don't force
> > KVM to rediscover the info on its own.
>
> I guess we can allow order pointer to be NULL to cover caller that don't
> need to know the order. Is it useful?

I think that would be useful. In pKVM we don't need to know the order,
and I had to use a dummy variable when porting V7.

Cheers,
/fuad


> --
>   Kiryl Shutsemau / Kirill A. Shutemov

David Hildenbrand Sept. 26, 2022, 10:35 a.m. UTC | #12

On 23.09.22 02:58, Kirill A . Shutemov wrote:
> On Mon, Sep 19, 2022 at 11:12:46AM +0200, David Hildenbrand wrote:
>>> diff --git a/include/uapi/linux/magic.h b/include/uapi/linux/magic.h
>>> index 6325d1d0e90f..9d066be3d7e8 100644
>>> --- a/include/uapi/linux/magic.h
>>> +++ b/include/uapi/linux/magic.h
>>> @@ -101,5 +101,6 @@
>>>    #define DMA_BUF_MAGIC		0x444d4142	/* "DMAB" */
>>>    #define DEVMEM_MAGIC		0x454d444d	/* "DMEM" */
>>>    #define SECRETMEM_MAGIC		0x5345434d	/* "SECM" */
>>> +#define INACCESSIBLE_MAGIC	0x494e4143	/* "INAC" */
>>
>>
>> [...]
>>
>>> +
>>> +int inaccessible_get_pfn(struct file *file, pgoff_t offset, pfn_t *pfn,
>>> +			 int *order)
>>> +{
>>> +	struct inaccessible_data *data = file->f_mapping->private_data;
>>> +	struct file *memfd = data->memfd;
>>> +	struct page *page;
>>> +	int ret;
>>> +
>>> +	ret = shmem_getpage(file_inode(memfd), offset, &page, SGP_WRITE);
>>> +	if (ret)
>>> +		return ret;
>>> +
>>> +	*pfn = page_to_pfn_t(page);
>>> +	*order = thp_order(compound_head(page));
>>> +	SetPageUptodate(page);
>>> +	unlock_page(page);
>>> +
>>> +	return 0;
>>> +}
>>> +EXPORT_SYMBOL_GPL(inaccessible_get_pfn);
>>> +
>>> +void inaccessible_put_pfn(struct file *file, pfn_t pfn)
>>> +{
>>> +	struct page *page = pfn_t_to_page(pfn);
>>> +
>>> +	if (WARN_ON_ONCE(!page))
>>> +		return;
>>> +
>>> +	put_page(page);
>>> +}
>>> +EXPORT_SYMBOL_GPL(inaccessible_put_pfn);
>>
>> Sorry, I missed your reply regarding get/put interface.
>>
>> https://lore.kernel.org/linux-mm/20220810092532.GD862421@chaop.bj.intel.com/
>>
>> "We have a design assumption that somedays this can even support non-page
>> based backing stores."
>>
>> As long as there is no such user in sight (especially how to get the memfd
>> from even allocating such memory which will require bigger changes), I
>> prefer to keep it simple here and work on pages/folios. No need to
>> over-complicate it for now.
> 
> Sean, Paolo , what is your take on this? Do you have conrete use case of
> pageless backend for the mechanism in sight? Maybe DAX?

The problem I'm having with this is how to actually get such memory into 
the memory backend (that triggers notifiers) and what the semantics are 
at all with memory that is not managed by the buddy.

memfd with fixed PFNs doesn't make too much sense.

When using DAX, what happens with the shared <->private conversion? 
Which "type" is supposed to use dax, which not?

In other word, I'm missing too many details on the bigger picture of how 
this would work at all to see why it makes sense right now to prepare 
for that.

Chao Peng Sept. 26, 2022, 2:23 p.m. UTC | #13

On Fri, Sep 23, 2022 at 04:19:46PM +0100, Fuad Tabba wrote:
> > Regarding pKVM's use case, with the shim approach I believe this can be done by
> > allowing userspace mmap() the "hidden" memfd, but with a ton of restrictions
> > piled on top.
> >
> > My first thought was to make the uAPI a set of KVM ioctls so that KVM could tightly
> > tightly control usage without taking on too much complexity in the kernel, but
> > working through things, routing the behavior through the shim itself might not be
> > all that horrific.
> >
> > IIRC, we discarded the idea of allowing userspace to map the "private" fd because
> > things got too complex, but with the shim it doesn't seem _that_ bad.
> >
> > E.g. on the memfd side:
> >
> >   1. The entire memfd must be mapped, and at most one mapping is allowed, i.e.
> >      mapping is all or nothing.
> >
> >   2. Acquiring a reference via get_pfn() is disallowed if there's a mapping for
> >      the restricted memfd.
> >
> >   3. Add notifier hooks to allow downstream users to further restrict things.
> >
> >   4. Disallow splitting VMAs, e.g. to force userspace to munmap() everything in
> >      one shot.
> >
> >   5. Require that there are no outstanding references at munmap().  Or if this
> >      can't be guaranteed by userspace, maybe add some way for userspace to wait
> >      until it's ok to convert to private?  E.g. so that get_pfn() doesn't need
> >      to do an expensive check every time.
> >
> >   static int memfd_restricted_mmap(struct file *file, struct vm_area_struct *vma)
> >   {
> >         if (vma->vm_pgoff)
> >                 return -EINVAL;
> >
> >         if ((vma->vm_end - vma->vm_start) != <file size>)
> >                 return -EINVAL;
> >
> >         mutex_lock(&data->lock);
> >
> >         if (data->has_mapping) {
> >                 r = -EINVAL;
> >                 goto err;
> >         }
> >         list_for_each_entry(notifier, &data->notifiers, list) {
> >                 r = notifier->ops->mmap_start(notifier, ...);
> >                 if (r)
> >                         goto abort;
> >         }
> >
> >         notifier->ops->mmap_end(notifier, ...);
> >         mutex_unlock(&data->lock);
> >         return 0;
> >
> >   abort:
> >         list_for_each_entry_continue_reverse(notifier &data->notifiers, list)
> >                 notifier->ops->mmap_abort(notifier, ...);
> >   err:
> >         mutex_unlock(&data->lock);
> >         return r;
> >   }
> >
> >   static void memfd_restricted_close(struct vm_area_struct *vma)
> >   {
> >         mutex_lock(...);
> >
> >         /*
> >          * Destroy the memfd and disable all future accesses if there are
> >          * outstanding refcounts (or other unsatisfied restrictions?).
> >          */
> >         if (<outstanding references> || ???)
> >                 memfd_restricted_destroy(...);
> >         else
> >                 data->has_mapping = false;
> >
> >         mutex_unlock(...);
> >   }
> >
> >   static int memfd_restricted_may_split(struct vm_area_struct *area, unsigned long addr)
> >   {
> >         return -EINVAL;
> >   }
> >
> >   static int memfd_restricted_mapping_mremap(struct vm_area_struct *new_vma)
> >   {
> >         return -EINVAL;
> >   }
> >
> > Then on the KVM side, its mmap_start() + mmap_end() sequence would:
> >
> >   1. Not be supported for TDX or SEV-SNP because they don't allow adding non-zero
> >      memory into the guest (after pre-boot phase).
> >
> >   2. Be mutually exclusive with shared<=>private conversions, and is allowed if
> >      and only if the entire gfn range of the associated memslot is shared.
> 
> In general I think that this would work with pKVM. However, limiting
> private<->shared conversions to the granularity of a whole memslot
> might be difficult to handle in pKVM, since the guest doesn't have the
> concept of memslots. For example, in pKVM right now, when a guest
> shares back its restricted DMA pool with the host it does so at the
> page-level. pKVM would also need a way to make an fd accessible again
> when shared back, which I think isn't possible with this patch.

But does pKVM really want to mmap/munmap a new region at the page-level,
that can cause VMA fragmentation if the conversion is frequent as I see.
Even with a KVM ioctl for mapping as mentioned below, I think there will
be the same issue.

> 
> You were initially considering a KVM ioctl for mapping, which might be
> better suited for this since KVM knows which pages are shared and
> which ones are private. So routing things through KVM might simplify
> things and allow it to enforce all the necessary restrictions (e.g.,
> private memory cannot be mapped). What do you think?
> 
> Thanks,
> /fuad

Kirill A . Shutemov Sept. 26, 2022, 2:48 p.m. UTC | #14

On Mon, Sep 26, 2022 at 12:35:34PM +0200, David Hildenbrand wrote:
> On 23.09.22 02:58, Kirill A . Shutemov wrote:
> > On Mon, Sep 19, 2022 at 11:12:46AM +0200, David Hildenbrand wrote:
> > > > diff --git a/include/uapi/linux/magic.h b/include/uapi/linux/magic.h
> > > > index 6325d1d0e90f..9d066be3d7e8 100644
> > > > --- a/include/uapi/linux/magic.h
> > > > +++ b/include/uapi/linux/magic.h
> > > > @@ -101,5 +101,6 @@
> > > >    #define DMA_BUF_MAGIC		0x444d4142	/* "DMAB" */
> > > >    #define DEVMEM_MAGIC		0x454d444d	/* "DMEM" */
> > > >    #define SECRETMEM_MAGIC		0x5345434d	/* "SECM" */
> > > > +#define INACCESSIBLE_MAGIC	0x494e4143	/* "INAC" */
> > > 
> > > 
> > > [...]
> > > 
> > > > +
> > > > +int inaccessible_get_pfn(struct file *file, pgoff_t offset, pfn_t *pfn,
> > > > +			 int *order)
> > > > +{
> > > > +	struct inaccessible_data *data = file->f_mapping->private_data;
> > > > +	struct file *memfd = data->memfd;
> > > > +	struct page *page;
> > > > +	int ret;
> > > > +
> > > > +	ret = shmem_getpage(file_inode(memfd), offset, &page, SGP_WRITE);
> > > > +	if (ret)
> > > > +		return ret;
> > > > +
> > > > +	*pfn = page_to_pfn_t(page);
> > > > +	*order = thp_order(compound_head(page));
> > > > +	SetPageUptodate(page);
> > > > +	unlock_page(page);
> > > > +
> > > > +	return 0;
> > > > +}
> > > > +EXPORT_SYMBOL_GPL(inaccessible_get_pfn);
> > > > +
> > > > +void inaccessible_put_pfn(struct file *file, pfn_t pfn)
> > > > +{
> > > > +	struct page *page = pfn_t_to_page(pfn);
> > > > +
> > > > +	if (WARN_ON_ONCE(!page))
> > > > +		return;
> > > > +
> > > > +	put_page(page);
> > > > +}
> > > > +EXPORT_SYMBOL_GPL(inaccessible_put_pfn);
> > > 
> > > Sorry, I missed your reply regarding get/put interface.
> > > 
> > > https://lore.kernel.org/linux-mm/20220810092532.GD862421@chaop.bj.intel.com/
> > > 
> > > "We have a design assumption that somedays this can even support non-page
> > > based backing stores."
> > > 
> > > As long as there is no such user in sight (especially how to get the memfd
> > > from even allocating such memory which will require bigger changes), I
> > > prefer to keep it simple here and work on pages/folios. No need to
> > > over-complicate it for now.
> > 
> > Sean, Paolo , what is your take on this? Do you have conrete use case of
> > pageless backend for the mechanism in sight? Maybe DAX?
> 
> The problem I'm having with this is how to actually get such memory into the
> memory backend (that triggers notifiers) and what the semantics are at all
> with memory that is not managed by the buddy.
> 
> memfd with fixed PFNs doesn't make too much sense.

What do you mean by "fixed PFN". It is as fixed as struct page/folio, no?
PFN covers more possible backends.

> When using DAX, what happens with the shared <->private conversion? Which
> "type" is supposed to use dax, which not?
> 
> In other word, I'm missing too many details on the bigger picture of how
> this would work at all to see why it makes sense right now to prepare for
> that.

IIUC, KVM doesn't really care about pages or folios. They need PFN to
populate SEPT. Returning page/folio would make KVM do additional steps to
extract PFN and one more place to have a bug.

David Hildenbrand Sept. 26, 2022, 2:53 p.m. UTC | #15

On 26.09.22 16:48, Kirill A. Shutemov wrote:
> On Mon, Sep 26, 2022 at 12:35:34PM +0200, David Hildenbrand wrote:
>> On 23.09.22 02:58, Kirill A . Shutemov wrote:
>>> On Mon, Sep 19, 2022 at 11:12:46AM +0200, David Hildenbrand wrote:
>>>>> diff --git a/include/uapi/linux/magic.h b/include/uapi/linux/magic.h
>>>>> index 6325d1d0e90f..9d066be3d7e8 100644
>>>>> --- a/include/uapi/linux/magic.h
>>>>> +++ b/include/uapi/linux/magic.h
>>>>> @@ -101,5 +101,6 @@
>>>>>     #define DMA_BUF_MAGIC		0x444d4142	/* "DMAB" */
>>>>>     #define DEVMEM_MAGIC		0x454d444d	/* "DMEM" */
>>>>>     #define SECRETMEM_MAGIC		0x5345434d	/* "SECM" */
>>>>> +#define INACCESSIBLE_MAGIC	0x494e4143	/* "INAC" */
>>>>
>>>>
>>>> [...]
>>>>
>>>>> +
>>>>> +int inaccessible_get_pfn(struct file *file, pgoff_t offset, pfn_t *pfn,
>>>>> +			 int *order)
>>>>> +{
>>>>> +	struct inaccessible_data *data = file->f_mapping->private_data;
>>>>> +	struct file *memfd = data->memfd;
>>>>> +	struct page *page;
>>>>> +	int ret;
>>>>> +
>>>>> +	ret = shmem_getpage(file_inode(memfd), offset, &page, SGP_WRITE);
>>>>> +	if (ret)
>>>>> +		return ret;
>>>>> +
>>>>> +	*pfn = page_to_pfn_t(page);
>>>>> +	*order = thp_order(compound_head(page));
>>>>> +	SetPageUptodate(page);
>>>>> +	unlock_page(page);
>>>>> +
>>>>> +	return 0;
>>>>> +}
>>>>> +EXPORT_SYMBOL_GPL(inaccessible_get_pfn);
>>>>> +
>>>>> +void inaccessible_put_pfn(struct file *file, pfn_t pfn)
>>>>> +{
>>>>> +	struct page *page = pfn_t_to_page(pfn);
>>>>> +
>>>>> +	if (WARN_ON_ONCE(!page))
>>>>> +		return;
>>>>> +
>>>>> +	put_page(page);
>>>>> +}
>>>>> +EXPORT_SYMBOL_GPL(inaccessible_put_pfn);
>>>>
>>>> Sorry, I missed your reply regarding get/put interface.
>>>>
>>>> https://lore.kernel.org/linux-mm/20220810092532.GD862421@chaop.bj.intel.com/
>>>>
>>>> "We have a design assumption that somedays this can even support non-page
>>>> based backing stores."
>>>>
>>>> As long as there is no such user in sight (especially how to get the memfd
>>>> from even allocating such memory which will require bigger changes), I
>>>> prefer to keep it simple here and work on pages/folios. No need to
>>>> over-complicate it for now.
>>>
>>> Sean, Paolo , what is your take on this? Do you have conrete use case of
>>> pageless backend for the mechanism in sight? Maybe DAX?
>>
>> The problem I'm having with this is how to actually get such memory into the
>> memory backend (that triggers notifiers) and what the semantics are at all
>> with memory that is not managed by the buddy.
>>
>> memfd with fixed PFNs doesn't make too much sense.
> 
> What do you mean by "fixed PFN". It is as fixed as struct page/folio, no?
> PFN covers more possible backends.

For DAX, you usually bypass the buddy and map /dev/mem or a devdax. In 
contrast to ordinary memfd that allocates memory via the buddy. That's 
the difference I see -- and I wonder how it could work.

> 
>> When using DAX, what happens with the shared <->private conversion? Which
>> "type" is supposed to use dax, which not?
>>
>> In other word, I'm missing too many details on the bigger picture of how
>> this would work at all to see why it makes sense right now to prepare for
>> that.
> 
> IIUC, KVM doesn't really care about pages or folios. They need PFN to
> populate SEPT. Returning page/folio would make KVM do additional steps to
> extract PFN and one more place to have a bug.

Fair enough. Smells KVM specific, though.

Fuad Tabba Sept. 26, 2022, 3:51 p.m. UTC | #16

Hi,

On Mon, Sep 26, 2022 at 3:28 PM Chao Peng <chao.p.peng@linux.intel.com> wrote:
>
> On Fri, Sep 23, 2022 at 04:19:46PM +0100, Fuad Tabba wrote:
> > > Regarding pKVM's use case, with the shim approach I believe this can be done by
> > > allowing userspace mmap() the "hidden" memfd, but with a ton of restrictions
> > > piled on top.
> > >
> > > My first thought was to make the uAPI a set of KVM ioctls so that KVM could tightly
> > > tightly control usage without taking on too much complexity in the kernel, but
> > > working through things, routing the behavior through the shim itself might not be
> > > all that horrific.
> > >
> > > IIRC, we discarded the idea of allowing userspace to map the "private" fd because
> > > things got too complex, but with the shim it doesn't seem _that_ bad.
> > >
> > > E.g. on the memfd side:
> > >
> > >   1. The entire memfd must be mapped, and at most one mapping is allowed, i.e.
> > >      mapping is all or nothing.
> > >
> > >   2. Acquiring a reference via get_pfn() is disallowed if there's a mapping for
> > >      the restricted memfd.
> > >
> > >   3. Add notifier hooks to allow downstream users to further restrict things.
> > >
> > >   4. Disallow splitting VMAs, e.g. to force userspace to munmap() everything in
> > >      one shot.
> > >
> > >   5. Require that there are no outstanding references at munmap().  Or if this
> > >      can't be guaranteed by userspace, maybe add some way for userspace to wait
> > >      until it's ok to convert to private?  E.g. so that get_pfn() doesn't need
> > >      to do an expensive check every time.
> > >
> > >   static int memfd_restricted_mmap(struct file *file, struct vm_area_struct *vma)
> > >   {
> > >         if (vma->vm_pgoff)
> > >                 return -EINVAL;
> > >
> > >         if ((vma->vm_end - vma->vm_start) != <file size>)
> > >                 return -EINVAL;
> > >
> > >         mutex_lock(&data->lock);
> > >
> > >         if (data->has_mapping) {
> > >                 r = -EINVAL;
> > >                 goto err;
> > >         }
> > >         list_for_each_entry(notifier, &data->notifiers, list) {
> > >                 r = notifier->ops->mmap_start(notifier, ...);
> > >                 if (r)
> > >                         goto abort;
> > >         }
> > >
> > >         notifier->ops->mmap_end(notifier, ...);
> > >         mutex_unlock(&data->lock);
> > >         return 0;
> > >
> > >   abort:
> > >         list_for_each_entry_continue_reverse(notifier &data->notifiers, list)
> > >                 notifier->ops->mmap_abort(notifier, ...);
> > >   err:
> > >         mutex_unlock(&data->lock);
> > >         return r;
> > >   }
> > >
> > >   static void memfd_restricted_close(struct vm_area_struct *vma)
> > >   {
> > >         mutex_lock(...);
> > >
> > >         /*
> > >          * Destroy the memfd and disable all future accesses if there are
> > >          * outstanding refcounts (or other unsatisfied restrictions?).
> > >          */
> > >         if (<outstanding references> || ???)
> > >                 memfd_restricted_destroy(...);
> > >         else
> > >                 data->has_mapping = false;
> > >
> > >         mutex_unlock(...);
> > >   }
> > >
> > >   static int memfd_restricted_may_split(struct vm_area_struct *area, unsigned long addr)
> > >   {
> > >         return -EINVAL;
> > >   }
> > >
> > >   static int memfd_restricted_mapping_mremap(struct vm_area_struct *new_vma)
> > >   {
> > >         return -EINVAL;
> > >   }
> > >
> > > Then on the KVM side, its mmap_start() + mmap_end() sequence would:
> > >
> > >   1. Not be supported for TDX or SEV-SNP because they don't allow adding non-zero
> > >      memory into the guest (after pre-boot phase).
> > >
> > >   2. Be mutually exclusive with shared<=>private conversions, and is allowed if
> > >      and only if the entire gfn range of the associated memslot is shared.
> >
> > In general I think that this would work with pKVM. However, limiting
> > private<->shared conversions to the granularity of a whole memslot
> > might be difficult to handle in pKVM, since the guest doesn't have the
> > concept of memslots. For example, in pKVM right now, when a guest
> > shares back its restricted DMA pool with the host it does so at the
> > page-level. pKVM would also need a way to make an fd accessible again
> > when shared back, which I think isn't possible with this patch.
>
> But does pKVM really want to mmap/munmap a new region at the page-level,
> that can cause VMA fragmentation if the conversion is frequent as I see.
> Even with a KVM ioctl for mapping as mentioned below, I think there will
> be the same issue.

pKVM doesn't really need to unmap the memory. What is really important
is that the memory is not GUP'able. Having private memory mapped and
then accessed by a misbehaving/malicious process will reinject a fault
into the misbehaving process.

Cheers,
/fuad

> >
> > You were initially considering a KVM ioctl for mapping, which might be
> > better suited for this since KVM knows which pages are shared and
> > which ones are private. So routing things through KVM might simplify
> > things and allow it to enforce all the necessary restrictions (e.g.,
> > private memory cannot be mapped). What do you think?
> >
> > Thanks,
> > /fuad

Sean Christopherson Sept. 27, 2022, 10:47 p.m. UTC | #17

On Mon, Sep 26, 2022, Fuad Tabba wrote:
> Hi,
> 
> On Mon, Sep 26, 2022 at 3:28 PM Chao Peng <chao.p.peng@linux.intel.com> wrote:
> >
> > On Fri, Sep 23, 2022 at 04:19:46PM +0100, Fuad Tabba wrote:
> > > > Then on the KVM side, its mmap_start() + mmap_end() sequence would:
> > > >
> > > >   1. Not be supported for TDX or SEV-SNP because they don't allow adding non-zero
> > > >      memory into the guest (after pre-boot phase).
> > > >
> > > >   2. Be mutually exclusive with shared<=>private conversions, and is allowed if
> > > >      and only if the entire gfn range of the associated memslot is shared.
> > >
> > > In general I think that this would work with pKVM. However, limiting
> > > private<->shared conversions to the granularity of a whole memslot
> > > might be difficult to handle in pKVM, since the guest doesn't have the
> > > concept of memslots. For example, in pKVM right now, when a guest
> > > shares back its restricted DMA pool with the host it does so at the
> > > page-level.

Y'all are killing me :-)

Isn't the guest enlightened?  E.g. can't you tell the guest "thou shalt share at
granularity X"?  With KVM's newfangled scalable memslots and per-vCPU MRU slot,
X doesn't even have to be that high to get reasonable performance, e.g. assuming
the DMA pool is at most 2GiB, that's "only" 1024 memslots, which is supposed to
work just fine in KVM.

> > > pKVM would also need a way to make an fd accessible again
> > > when shared back, which I think isn't possible with this patch.
> >
> > But does pKVM really want to mmap/munmap a new region at the page-level,
> > that can cause VMA fragmentation if the conversion is frequent as I see.
> > Even with a KVM ioctl for mapping as mentioned below, I think there will
> > be the same issue.
> 
> pKVM doesn't really need to unmap the memory. What is really important
> is that the memory is not GUP'able.

Well, not entirely unguppable, just unguppable without a magic FOLL_* flag,
otherwise KVM wouldn't be able to get the PFN to map into guest memory.

The problem is that gup() and "mapped" are tied together.  So yes, pKVM doesn't
strictly need to unmap memory _in the untrusted host_, but since mapped==guppable,
the end result is the same.

Emphasis above because pKVM still needs unmap the memory _somehwere_.  IIUC, the
current approach is to do that only in the stage-2 page tables, i.e. only in the
context of the hypervisor.  Which is also the source of the gup() problems; the
untrusted kernel is blissfully unaware that the memory is inaccessible.

Any approach that moves some of that information into the untrusted kernel so that
the kernel can protect itself will incur fragmentation in the VMAs.  Well, unless
all of guest memory becomes unguppable, but that's likely not a viable option.

Sean Christopherson Sept. 27, 2022, 11:23 p.m. UTC | #18

On Mon, Sep 26, 2022, David Hildenbrand wrote:
> On 26.09.22 16:48, Kirill A. Shutemov wrote:
> > On Mon, Sep 26, 2022 at 12:35:34PM +0200, David Hildenbrand wrote:
> > > When using DAX, what happens with the shared <->private conversion? Which
> > > "type" is supposed to use dax, which not?
> > > 
> > > In other word, I'm missing too many details on the bigger picture of how
> > > this would work at all to see why it makes sense right now to prepare for
> > > that.
> > 
> > IIUC, KVM doesn't really care about pages or folios. They need PFN to
> > populate SEPT. Returning page/folio would make KVM do additional steps to
> > extract PFN and one more place to have a bug.
> 
> Fair enough. Smells KVM specific, though.

TL;DR: I'm good with either approach, though providing a "struct page" might avoid
       refactoring the API in the nearish future.

Playing devil's advocate for a second, the counter argument is that KVM is the
only user for the foreseeable future.

That said, it might make sense to return a "struct page" from the core API and
force KVM to do page_to_pfn().  KVM already does that for HVA-based memory, so
it's not exactly new code.

More importantly, KVM may actually need/want the "struct page" in the not-too-distant
future to support mapping non-refcounted "struct page" memory into the guest.  The
ChromeOS folks have a use case involving virtio-gpu blobs where KVM can get handed a
"struct page" that _isn't_ refcounted[*].  Once the lack of mmu_notifier integration
is fixed, the remaining issue is that KVM doesn't currently have a way to determine
whether or not it holds a reference to the page.  Instead, KVM assumes that if the
page is "normal", it's refcounted, e.g. see kvm_release_pfn_clean().

KVM's current workaround for this is to refuse to map these pages into the guest,
i.e. KVM simply forces its assumption that normal pages are refcounted to be true.
To remove that workaround, the likely solution will be to pass around a tuple of
page+pfn, where "page" is non-NULL if the pfn is a refcounted "struct page".

At that point, getting handed a "struct page" from the core API would be a good
thing as KVM wouldn't need to probe the PFN to determine whether or not it's a
refcounted page.

Note, I still want the order to be provided by the API so that KVM doesn't need
to run through a bunch of helpers to try and figure out the allowed mapping size.

[*] https://lore.kernel.org/all/CAD=HUj736L5oxkzeL2JoPV8g1S6Rugy_TquW=PRt73YmFzP6Jw@mail.gmail.com

Kirill A . Shutemov Sept. 28, 2022, 1:36 p.m. UTC | #19

On Tue, Sep 27, 2022 at 11:23:24PM +0000, Sean Christopherson wrote:
> On Mon, Sep 26, 2022, David Hildenbrand wrote:
> > On 26.09.22 16:48, Kirill A. Shutemov wrote:
> > > On Mon, Sep 26, 2022 at 12:35:34PM +0200, David Hildenbrand wrote:
> > > > When using DAX, what happens with the shared <->private conversion? Which
> > > > "type" is supposed to use dax, which not?
> > > > 
> > > > In other word, I'm missing too many details on the bigger picture of how
> > > > this would work at all to see why it makes sense right now to prepare for
> > > > that.
> > > 
> > > IIUC, KVM doesn't really care about pages or folios. They need PFN to
> > > populate SEPT. Returning page/folio would make KVM do additional steps to
> > > extract PFN and one more place to have a bug.
> > 
> > Fair enough. Smells KVM specific, though.
> 
> TL;DR: I'm good with either approach, though providing a "struct page" might avoid
>        refactoring the API in the nearish future.
> 
> Playing devil's advocate for a second, the counter argument is that KVM is the
> only user for the foreseeable future.
> 
> That said, it might make sense to return a "struct page" from the core API and
> force KVM to do page_to_pfn().  KVM already does that for HVA-based memory, so
> it's not exactly new code.

Core MM tries to move away from struct page in favour of struct folio. We
can make interface return folio.

But it would require more work on KVM side.

folio_pfn(folio) + offset % folio_nr_pages(folio) would give you PFN for
base-pagesize PFN for given offset. I guess it is not too hard.

It also gives KVM capability to populate multiple EPT entries for non-zero
order folio and save few cycles.

Does it work for you?

> More importantly, KVM may actually need/want the "struct page" in the not-too-distant
> future to support mapping non-refcounted "struct page" memory into the guest.  The
> ChromeOS folks have a use case involving virtio-gpu blobs where KVM can get handed a
> "struct page" that _isn't_ refcounted[*].  Once the lack of mmu_notifier integration
> is fixed, the remaining issue is that KVM doesn't currently have a way to determine
> whether or not it holds a reference to the page.  Instead, KVM assumes that if the
> page is "normal", it's refcounted, e.g. see kvm_release_pfn_clean().
> 
> KVM's current workaround for this is to refuse to map these pages into the guest,
> i.e. KVM simply forces its assumption that normal pages are refcounted to be true.
> To remove that workaround, the likely solution will be to pass around a tuple of
> page+pfn, where "page" is non-NULL if the pfn is a refcounted "struct page".
> 
> At that point, getting handed a "struct page" from the core API would be a good
> thing as KVM wouldn't need to probe the PFN to determine whether or not it's a
> refcounted page.
> 
> Note, I still want the order to be provided by the API so that KVM doesn't need
> to run through a bunch of helpers to try and figure out the allowed mapping size.
> 
> [*] https://lore.kernel.org/all/CAD=HUj736L5oxkzeL2JoPV8g1S6Rugy_TquW=PRt73YmFzP6Jw@mail.gmail.com

These non-refcounted "struct page" confuses me.

IIUC (probably not), the idea is to share a buffer between host and guest
and avoid double buffering in page cache on the guest ("guest shadow
buffer" means page cache, right?). Don't we already have DAX interfaces to
bypass guest page cache?

And do you think it would need to be handled on inaccessible API lavel or
is it KVM-only thing that uses inaccessible API for some use-cases?

Fuad Tabba Sept. 30, 2022, 4:14 p.m. UTC | #20

Hi,

<...>

> diff --git a/mm/memfd_inaccessible.c b/mm/memfd_inaccessible.c
> new file mode 100644
> index 000000000000..2d33cbdd9282
> --- /dev/null
> +++ b/mm/memfd_inaccessible.c
> @@ -0,0 +1,219 @@
> +// SPDX-License-Identifier: GPL-2.0
> +#include "linux/sbitmap.h"
> +#include <linux/memfd.h>
> +#include <linux/pagemap.h>
> +#include <linux/pseudo_fs.h>
> +#include <linux/shmem_fs.h>
> +#include <uapi/linux/falloc.h>
> +#include <uapi/linux/magic.h>
> +
> +struct inaccessible_data {
> +       struct mutex lock;
> +       struct file *memfd;
> +       struct list_head notifiers;
> +};
> +
> +static void inaccessible_notifier_invalidate(struct inaccessible_data *data,
> +                                pgoff_t start, pgoff_t end)
> +{
> +       struct inaccessible_notifier *notifier;
> +
> +       mutex_lock(&data->lock);
> +       list_for_each_entry(notifier, &data->notifiers, list) {
> +               notifier->ops->invalidate(notifier, start, end);
> +       }
> +       mutex_unlock(&data->lock);
> +}
> +
> +static int inaccessible_release(struct inode *inode, struct file *file)
> +{
> +       struct inaccessible_data *data = inode->i_mapping->private_data;
> +
> +       fput(data->memfd);
> +       kfree(data);
> +       return 0;
> +}
> +
> +static long inaccessible_fallocate(struct file *file, int mode,
> +                                  loff_t offset, loff_t len)
> +{
> +       struct inaccessible_data *data = file->f_mapping->private_data;
> +       struct file *memfd = data->memfd;
> +       int ret;
> +
> +       if (mode & FALLOC_FL_PUNCH_HOLE) {
> +               if (!PAGE_ALIGNED(offset) || !PAGE_ALIGNED(len))
> +                       return -EINVAL;
> +       }
> +
> +       ret = memfd->f_op->fallocate(memfd, mode, offset, len);

I think that shmem_file_operations.fallocate is only set if
CONFIG_TMPFS is enabled (shmem.c). Should there be a check at
initialization that fallocate is set, or maybe a config dependency, or
can we count on it always being enabled?

> +       inaccessible_notifier_invalidate(data, offset, offset + len);
> +       return ret;
> +}
> +

<...>

> +void inaccessible_register_notifier(struct file *file,
> +                                   struct inaccessible_notifier *notifier)
> +{
> +       struct inaccessible_data *data = file->f_mapping->private_data;
> +
> +       mutex_lock(&data->lock);
> +       list_add(&notifier->list, &data->notifiers);
> +       mutex_unlock(&data->lock);
> +}
> +EXPORT_SYMBOL_GPL(inaccessible_register_notifier);

If the memfd wasn't marked as inaccessible, or more generally
speaking, if the file isn't a memfd_inaccessible file, this ends up
accessing an uninitialized pointer for the notifier list. Should there
be a check for that here, and have this function return an error if
that's not the case?

Thanks,
/fuad



> +
> +void inaccessible_unregister_notifier(struct file *file,
> +                                     struct inaccessible_notifier *notifier)
> +{
> +       struct inaccessible_data *data = file->f_mapping->private_data;
> +
> +       mutex_lock(&data->lock);
> +       list_del(&notifier->list);
> +       mutex_unlock(&data->lock);
> +}
> +EXPORT_SYMBOL_GPL(inaccessible_unregister_notifier);
> +
> +int inaccessible_get_pfn(struct file *file, pgoff_t offset, pfn_t *pfn,
> +                        int *order)
> +{
> +       struct inaccessible_data *data = file->f_mapping->private_data;
> +       struct file *memfd = data->memfd;
> +       struct page *page;
> +       int ret;
> +
> +       ret = shmem_getpage(file_inode(memfd), offset, &page, SGP_WRITE);
> +       if (ret)
> +               return ret;
> +
> +       *pfn = page_to_pfn_t(page);
> +       *order = thp_order(compound_head(page));
> +       SetPageUptodate(page);
> +       unlock_page(page);
> +
> +       return 0;
> +}
> +EXPORT_SYMBOL_GPL(inaccessible_get_pfn);
> +
> +void inaccessible_put_pfn(struct file *file, pfn_t pfn)
> +{
> +       struct page *page = pfn_t_to_page(pfn);
> +
> +       if (WARN_ON_ONCE(!page))
> +               return;
> +
> +       put_page(page);
> +}
> +EXPORT_SYMBOL_GPL(inaccessible_put_pfn);
> --
> 2.25.1
>

Fuad Tabba Sept. 30, 2022, 4:19 p.m. UTC | #21

Hi,

On Tue, Sep 27, 2022 at 11:47 PM Sean Christopherson <seanjc@google.com> wrote:
>
> On Mon, Sep 26, 2022, Fuad Tabba wrote:
> > Hi,
> >
> > On Mon, Sep 26, 2022 at 3:28 PM Chao Peng <chao.p.peng@linux.intel.com> wrote:
> > >
> > > On Fri, Sep 23, 2022 at 04:19:46PM +0100, Fuad Tabba wrote:
> > > > > Then on the KVM side, its mmap_start() + mmap_end() sequence would:
> > > > >
> > > > >   1. Not be supported for TDX or SEV-SNP because they don't allow adding non-zero
> > > > >      memory into the guest (after pre-boot phase).
> > > > >
> > > > >   2. Be mutually exclusive with shared<=>private conversions, and is allowed if
> > > > >      and only if the entire gfn range of the associated memslot is shared.
> > > >
> > > > In general I think that this would work with pKVM. However, limiting
> > > > private<->shared conversions to the granularity of a whole memslot
> > > > might be difficult to handle in pKVM, since the guest doesn't have the
> > > > concept of memslots. For example, in pKVM right now, when a guest
> > > > shares back its restricted DMA pool with the host it does so at the
> > > > page-level.
>
> Y'all are killing me :-)

 :D

> Isn't the guest enlightened?  E.g. can't you tell the guest "thou shalt share at
> granularity X"?  With KVM's newfangled scalable memslots and per-vCPU MRU slot,
> X doesn't even have to be that high to get reasonable performance, e.g. assuming
> the DMA pool is at most 2GiB, that's "only" 1024 memslots, which is supposed to
> work just fine in KVM.

The guest is potentially enlightened, but the host doesn't necessarily
know which memslot the guest might want to share back, since it
doesn't know where the guest might want to place the DMA pool. If I
understand this correctly, for this to work, all memslots would need
to be the same size and sharing would always need to happen at that
granularity.

Moreover, for something like a small DMA pool this might scale, but
I'm not sure about potential future workloads (e.g., multimedia
in-place sharing).

>
> > > > pKVM would also need a way to make an fd accessible again
> > > > when shared back, which I think isn't possible with this patch.
> > >
> > > But does pKVM really want to mmap/munmap a new region at the page-level,
> > > that can cause VMA fragmentation if the conversion is frequent as I see.
> > > Even with a KVM ioctl for mapping as mentioned below, I think there will
> > > be the same issue.
> >
> > pKVM doesn't really need to unmap the memory. What is really important
> > is that the memory is not GUP'able.
>
> Well, not entirely unguppable, just unguppable without a magic FOLL_* flag,
> otherwise KVM wouldn't be able to get the PFN to map into guest memory.
>
> The problem is that gup() and "mapped" are tied together.  So yes, pKVM doesn't
> strictly need to unmap memory _in the untrusted host_, but since mapped==guppable,
> the end result is the same.
>
> Emphasis above because pKVM still needs unmap the memory _somehwere_.  IIUC, the
> current approach is to do that only in the stage-2 page tables, i.e. only in the
> context of the hypervisor.  Which is also the source of the gup() problems; the
> untrusted kernel is blissfully unaware that the memory is inaccessible.
>
> Any approach that moves some of that information into the untrusted kernel so that
> the kernel can protect itself will incur fragmentation in the VMAs.  Well, unless
> all of guest memory becomes unguppable, but that's likely not a viable option.

Actually, for pKVM, there is no need for the guest memory to be
GUP'able at all if we use the new inaccessible_get_pfn(). This of
course goes back to what I'd mentioned before in v7; it seems that
representing the memslot memory as a file descriptor should be
orthogonal to whether the memory is shared or private, rather than a
private_fd for private memory and the userspace_addr for shared
memory. The host can then map or unmap the shared/private memory using
the fd, which allows it more freedom in even choosing to unmap shared
memory when not needed, for example.

Cheers,
/fuad

Kirill A. Shutemov Sept. 30, 2022, 4:23 p.m. UTC | #22

On Fri, Sep 30, 2022 at 05:14:00PM +0100, Fuad Tabba wrote:
> Hi,
> 
> <...>
> 
> > diff --git a/mm/memfd_inaccessible.c b/mm/memfd_inaccessible.c
> > new file mode 100644
> > index 000000000000..2d33cbdd9282
> > --- /dev/null
> > +++ b/mm/memfd_inaccessible.c
> > @@ -0,0 +1,219 @@
> > +// SPDX-License-Identifier: GPL-2.0
> > +#include "linux/sbitmap.h"
> > +#include <linux/memfd.h>
> > +#include <linux/pagemap.h>
> > +#include <linux/pseudo_fs.h>
> > +#include <linux/shmem_fs.h>
> > +#include <uapi/linux/falloc.h>
> > +#include <uapi/linux/magic.h>
> > +
> > +struct inaccessible_data {
> > +       struct mutex lock;
> > +       struct file *memfd;
> > +       struct list_head notifiers;
> > +};
> > +
> > +static void inaccessible_notifier_invalidate(struct inaccessible_data *data,
> > +                                pgoff_t start, pgoff_t end)
> > +{
> > +       struct inaccessible_notifier *notifier;
> > +
> > +       mutex_lock(&data->lock);
> > +       list_for_each_entry(notifier, &data->notifiers, list) {
> > +               notifier->ops->invalidate(notifier, start, end);
> > +       }
> > +       mutex_unlock(&data->lock);
> > +}
> > +
> > +static int inaccessible_release(struct inode *inode, struct file *file)
> > +{
> > +       struct inaccessible_data *data = inode->i_mapping->private_data;
> > +
> > +       fput(data->memfd);
> > +       kfree(data);
> > +       return 0;
> > +}
> > +
> > +static long inaccessible_fallocate(struct file *file, int mode,
> > +                                  loff_t offset, loff_t len)
> > +{
> > +       struct inaccessible_data *data = file->f_mapping->private_data;
> > +       struct file *memfd = data->memfd;
> > +       int ret;
> > +
> > +       if (mode & FALLOC_FL_PUNCH_HOLE) {
> > +               if (!PAGE_ALIGNED(offset) || !PAGE_ALIGNED(len))
> > +                       return -EINVAL;
> > +       }
> > +
> > +       ret = memfd->f_op->fallocate(memfd, mode, offset, len);
> 
> I think that shmem_file_operations.fallocate is only set if
> CONFIG_TMPFS is enabled (shmem.c). Should there be a check at
> initialization that fallocate is set, or maybe a config dependency, or
> can we count on it always being enabled?

It is already there:

	config MEMFD_CREATE
		def_bool TMPFS || HUGETLBFS

And we reject inaccessible memfd_create() for HUGETLBFS.

But if we go with a separate syscall, yes, we need the dependency.

> > +       inaccessible_notifier_invalidate(data, offset, offset + len);
> > +       return ret;
> > +}
> > +
> 
> <...>
> 
> > +void inaccessible_register_notifier(struct file *file,
> > +                                   struct inaccessible_notifier *notifier)
> > +{
> > +       struct inaccessible_data *data = file->f_mapping->private_data;
> > +
> > +       mutex_lock(&data->lock);
> > +       list_add(&notifier->list, &data->notifiers);
> > +       mutex_unlock(&data->lock);
> > +}
> > +EXPORT_SYMBOL_GPL(inaccessible_register_notifier);
> 
> If the memfd wasn't marked as inaccessible, or more generally
> speaking, if the file isn't a memfd_inaccessible file, this ends up
> accessing an uninitialized pointer for the notifier list. Should there
> be a check for that here, and have this function return an error if
> that's not the case?

I think it is "don't do that" category. inaccessible_register_notifier()
caller has to know what file it operates on, no?

Fuad Tabba Oct. 3, 2022, 7:33 a.m. UTC | #23

Hi

On Fri, Sep 30, 2022 at 5:23 PM Kirill A . Shutemov
<kirill.shutemov@linux.intel.com> wrote:
>
> On Fri, Sep 30, 2022 at 05:14:00PM +0100, Fuad Tabba wrote:
> > Hi,
> >
> > <...>
> >
> > > diff --git a/mm/memfd_inaccessible.c b/mm/memfd_inaccessible.c
> > > new file mode 100644
> > > index 000000000000..2d33cbdd9282
> > > --- /dev/null
> > > +++ b/mm/memfd_inaccessible.c
> > > @@ -0,0 +1,219 @@
> > > +// SPDX-License-Identifier: GPL-2.0
> > > +#include "linux/sbitmap.h"
> > > +#include <linux/memfd.h>
> > > +#include <linux/pagemap.h>
> > > +#include <linux/pseudo_fs.h>
> > > +#include <linux/shmem_fs.h>
> > > +#include <uapi/linux/falloc.h>
> > > +#include <uapi/linux/magic.h>
> > > +
> > > +struct inaccessible_data {
> > > +       struct mutex lock;
> > > +       struct file *memfd;
> > > +       struct list_head notifiers;
> > > +};
> > > +
> > > +static void inaccessible_notifier_invalidate(struct inaccessible_data *data,
> > > +                                pgoff_t start, pgoff_t end)
> > > +{
> > > +       struct inaccessible_notifier *notifier;
> > > +
> > > +       mutex_lock(&data->lock);
> > > +       list_for_each_entry(notifier, &data->notifiers, list) {
> > > +               notifier->ops->invalidate(notifier, start, end);
> > > +       }
> > > +       mutex_unlock(&data->lock);
> > > +}
> > > +
> > > +static int inaccessible_release(struct inode *inode, struct file *file)
> > > +{
> > > +       struct inaccessible_data *data = inode->i_mapping->private_data;
> > > +
> > > +       fput(data->memfd);
> > > +       kfree(data);
> > > +       return 0;
> > > +}
> > > +
> > > +static long inaccessible_fallocate(struct file *file, int mode,
> > > +                                  loff_t offset, loff_t len)
> > > +{
> > > +       struct inaccessible_data *data = file->f_mapping->private_data;
> > > +       struct file *memfd = data->memfd;
> > > +       int ret;
> > > +
> > > +       if (mode & FALLOC_FL_PUNCH_HOLE) {
> > > +               if (!PAGE_ALIGNED(offset) || !PAGE_ALIGNED(len))
> > > +                       return -EINVAL;
> > > +       }
> > > +
> > > +       ret = memfd->f_op->fallocate(memfd, mode, offset, len);
> >
> > I think that shmem_file_operations.fallocate is only set if
> > CONFIG_TMPFS is enabled (shmem.c). Should there be a check at
> > initialization that fallocate is set, or maybe a config dependency, or
> > can we count on it always being enabled?
>
> It is already there:
>
>         config MEMFD_CREATE
>                 def_bool TMPFS || HUGETLBFS
>
> And we reject inaccessible memfd_create() for HUGETLBFS.
>
> But if we go with a separate syscall, yes, we need the dependency.

I missed that, thanks.

>
> > > +       inaccessible_notifier_invalidate(data, offset, offset + len);
> > > +       return ret;
> > > +}
> > > +
> >
> > <...>
> >
> > > +void inaccessible_register_notifier(struct file *file,
> > > +                                   struct inaccessible_notifier *notifier)
> > > +{
> > > +       struct inaccessible_data *data = file->f_mapping->private_data;
> > > +
> > > +       mutex_lock(&data->lock);
> > > +       list_add(&notifier->list, &data->notifiers);
> > > +       mutex_unlock(&data->lock);
> > > +}
> > > +EXPORT_SYMBOL_GPL(inaccessible_register_notifier);
> >
> > If the memfd wasn't marked as inaccessible, or more generally
> > speaking, if the file isn't a memfd_inaccessible file, this ends up
> > accessing an uninitialized pointer for the notifier list. Should there
> > be a check for that here, and have this function return an error if
> > that's not the case?
>
> I think it is "don't do that" category. inaccessible_register_notifier()
> caller has to know what file it operates on, no?

The thing is, you could oops the kernel from userspace. For that, all
you have to do is a memfd_create without the MFD_INACCESSIBLE,
followed by a KVM_SET_USER_MEMORY_REGION using that as the private_fd.
I ran into this using my port of this patch series to arm64.

Cheers,
/fuad


> --
>   Kiryl Shutsemau / Kirill A. Shutemov

Kirill A . Shutemov Oct. 3, 2022, 11:01 a.m. UTC | #24

On Mon, Oct 03, 2022 at 08:33:13AM +0100, Fuad Tabba wrote:
> > I think it is "don't do that" category. inaccessible_register_notifier()
> > caller has to know what file it operates on, no?
> 
> The thing is, you could oops the kernel from userspace. For that, all
> you have to do is a memfd_create without the MFD_INACCESSIBLE,
> followed by a KVM_SET_USER_MEMORY_REGION using that as the private_fd.
> I ran into this using my port of this patch series to arm64.

My point is that it has to be handled on a different level. KVM has to
reject private_fd if it is now inaccessible. It should be trivial by
checking file->f_inode->i_sb->s_magic.

Fuad Tabba Oct. 4, 2022, 3:39 p.m. UTC | #25

Hi,

On Mon, Oct 3, 2022 at 12:01 PM Kirill A. Shutemov <kirill@shutemov.name> wrote:
>
> On Mon, Oct 03, 2022 at 08:33:13AM +0100, Fuad Tabba wrote:
> > > I think it is "don't do that" category. inaccessible_register_notifier()
> > > caller has to know what file it operates on, no?
> >
> > The thing is, you could oops the kernel from userspace. For that, all
> > you have to do is a memfd_create without the MFD_INACCESSIBLE,
> > followed by a KVM_SET_USER_MEMORY_REGION using that as the private_fd.
> > I ran into this using my port of this patch series to arm64.
>
> My point is that it has to be handled on a different level. KVM has to
> reject private_fd if it is now inaccessible. It should be trivial by
> checking file->f_inode->i_sb->s_magic.

Yes, that makes sense.

Thanks,
/fuad

> --
>   Kiryl Shutsemau / Kirill A. Shutemov

Fuad Tabba Oct. 6, 2022, 8:50 a.m. UTC | #26

Hi,

<...>


> diff --git a/mm/memfd_inaccessible.c b/mm/memfd_inaccessible.c
> new file mode 100644
> index 000000000000..2d33cbdd9282
> --- /dev/null
> +++ b/mm/memfd_inaccessible.c

<...>

> +struct file *memfd_mkinaccessible(struct file *memfd)
> +{
> +       struct inaccessible_data *data;
> +       struct address_space *mapping;
> +       struct inode *inode;
> +       struct file *file;
> +
> +       data = kzalloc(sizeof(*data), GFP_KERNEL);
> +       if (!data)
> +               return ERR_PTR(-ENOMEM);
> +
> +       data->memfd = memfd;
> +       mutex_init(&data->lock);
> +       INIT_LIST_HEAD(&data->notifiers);
> +
> +       inode = alloc_anon_inode(inaccessible_mnt->mnt_sb);
> +       if (IS_ERR(inode)) {
> +               kfree(data);
> +               return ERR_CAST(inode);
> +       }
> +
> +       inode->i_mode |= S_IFREG;
> +       inode->i_op = &inaccessible_iops;
> +       inode->i_mapping->private_data = data;
> +
> +       file = alloc_file_pseudo(inode, inaccessible_mnt,
> +                                "[memfd:inaccessible]", O_RDWR,
> +                                &inaccessible_fops);
> +       if (IS_ERR(file)) {
> +               iput(inode);
> +               kfree(data);

I think this might be missing a return at this point.

> +       }
> +
> +       file->f_flags |= O_LARGEFILE;
> +
> +       mapping = memfd->f_mapping;
> +       mapping_set_unevictable(mapping);
> +       mapping_set_gfp_mask(mapping,
> +                            mapping_gfp_mask(mapping) & ~__GFP_MOVABLE);
> +
> +       return file;
> +}

Thanks,
/fuad



> +
> +void inaccessible_register_notifier(struct file *file,
> +                                   struct inaccessible_notifier *notifier)
> +{
> +       struct inaccessible_data *data = file->f_mapping->private_data;
> +
> +       mutex_lock(&data->lock);
> +       list_add(&notifier->list, &data->notifiers);
> +       mutex_unlock(&data->lock);
> +}
> +EXPORT_SYMBOL_GPL(inaccessible_register_notifier);
> +
> +void inaccessible_unregister_notifier(struct file *file,
> +                                     struct inaccessible_notifier *notifier)
> +{
> +       struct inaccessible_data *data = file->f_mapping->private_data;
> +
> +       mutex_lock(&data->lock);
> +       list_del(&notifier->list);
> +       mutex_unlock(&data->lock);
> +}
> +EXPORT_SYMBOL_GPL(inaccessible_unregister_notifier);
> +
> +int inaccessible_get_pfn(struct file *file, pgoff_t offset, pfn_t *pfn,
> +                        int *order)
> +{
> +       struct inaccessible_data *data = file->f_mapping->private_data;
> +       struct file *memfd = data->memfd;
> +       struct page *page;
> +       int ret;
> +
> +       ret = shmem_getpage(file_inode(memfd), offset, &page, SGP_WRITE);
> +       if (ret)
> +               return ret;
> +
> +       *pfn = page_to_pfn_t(page);
> +       *order = thp_order(compound_head(page));
> +       SetPageUptodate(page);
> +       unlock_page(page);
> +
> +       return 0;
> +}
> +EXPORT_SYMBOL_GPL(inaccessible_get_pfn);
> +
> +void inaccessible_put_pfn(struct file *file, pfn_t pfn)
> +{
> +       struct page *page = pfn_t_to_page(pfn);
> +
> +       if (WARN_ON_ONCE(!page))
> +               return;
> +
> +       put_page(page);
> +}
> +EXPORT_SYMBOL_GPL(inaccessible_put_pfn);
> --
> 2.25.1
>

Kirill A . Shutemov Oct. 6, 2022, 1:04 p.m. UTC | #27

On Thu, Oct 06, 2022 at 09:50:28AM +0100, Fuad Tabba wrote:
> Hi,
> 
> <...>
> 
> 
> > diff --git a/mm/memfd_inaccessible.c b/mm/memfd_inaccessible.c
> > new file mode 100644
> > index 000000000000..2d33cbdd9282
> > --- /dev/null
> > +++ b/mm/memfd_inaccessible.c
> 
> <...>
> 
> > +struct file *memfd_mkinaccessible(struct file *memfd)
> > +{
> > +       struct inaccessible_data *data;
> > +       struct address_space *mapping;
> > +       struct inode *inode;
> > +       struct file *file;
> > +
> > +       data = kzalloc(sizeof(*data), GFP_KERNEL);
> > +       if (!data)
> > +               return ERR_PTR(-ENOMEM);
> > +
> > +       data->memfd = memfd;
> > +       mutex_init(&data->lock);
> > +       INIT_LIST_HEAD(&data->notifiers);
> > +
> > +       inode = alloc_anon_inode(inaccessible_mnt->mnt_sb);
> > +       if (IS_ERR(inode)) {
> > +               kfree(data);
> > +               return ERR_CAST(inode);
> > +       }
> > +
> > +       inode->i_mode |= S_IFREG;
> > +       inode->i_op = &inaccessible_iops;
> > +       inode->i_mapping->private_data = data;
> > +
> > +       file = alloc_file_pseudo(inode, inaccessible_mnt,
> > +                                "[memfd:inaccessible]", O_RDWR,
> > +                                &inaccessible_fops);
> > +       if (IS_ERR(file)) {
> > +               iput(inode);
> > +               kfree(data);
> 
> I think this might be missing a return at this point.

Good catch! Thanks!

Chao Peng Oct. 13, 2022, 1:34 p.m. UTC | #28

On Fri, Sep 30, 2022 at 05:19:00PM +0100, Fuad Tabba wrote:
> Hi,
> 
> On Tue, Sep 27, 2022 at 11:47 PM Sean Christopherson <seanjc@google.com> wrote:
> >
> > On Mon, Sep 26, 2022, Fuad Tabba wrote:
> > > Hi,
> > >
> > > On Mon, Sep 26, 2022 at 3:28 PM Chao Peng <chao.p.peng@linux.intel.com> wrote:
> > > >
> > > > On Fri, Sep 23, 2022 at 04:19:46PM +0100, Fuad Tabba wrote:
> > > > > > Then on the KVM side, its mmap_start() + mmap_end() sequence would:
> > > > > >
> > > > > >   1. Not be supported for TDX or SEV-SNP because they don't allow adding non-zero
> > > > > >      memory into the guest (after pre-boot phase).
> > > > > >
> > > > > >   2. Be mutually exclusive with shared<=>private conversions, and is allowed if
> > > > > >      and only if the entire gfn range of the associated memslot is shared.
> > > > >
> > > > > In general I think that this would work with pKVM. However, limiting
> > > > > private<->shared conversions to the granularity of a whole memslot
> > > > > might be difficult to handle in pKVM, since the guest doesn't have the
> > > > > concept of memslots. For example, in pKVM right now, when a guest
> > > > > shares back its restricted DMA pool with the host it does so at the
> > > > > page-level.
> >
> > Y'all are killing me :-)
> 
>  :D
> 
> > Isn't the guest enlightened?  E.g. can't you tell the guest "thou shalt share at
> > granularity X"?  With KVM's newfangled scalable memslots and per-vCPU MRU slot,
> > X doesn't even have to be that high to get reasonable performance, e.g. assuming
> > the DMA pool is at most 2GiB, that's "only" 1024 memslots, which is supposed to
> > work just fine in KVM.
> 
> The guest is potentially enlightened, but the host doesn't necessarily
> know which memslot the guest might want to share back, since it
> doesn't know where the guest might want to place the DMA pool. If I
> understand this correctly, for this to work, all memslots would need
> to be the same size and sharing would always need to happen at that
> granularity.
> 
> Moreover, for something like a small DMA pool this might scale, but
> I'm not sure about potential future workloads (e.g., multimedia
> in-place sharing).
> 
> >
> > > > > pKVM would also need a way to make an fd accessible again
> > > > > when shared back, which I think isn't possible with this patch.
> > > >
> > > > But does pKVM really want to mmap/munmap a new region at the page-level,
> > > > that can cause VMA fragmentation if the conversion is frequent as I see.
> > > > Even with a KVM ioctl for mapping as mentioned below, I think there will
> > > > be the same issue.
> > >
> > > pKVM doesn't really need to unmap the memory. What is really important
> > > is that the memory is not GUP'able.
> >
> > Well, not entirely unguppable, just unguppable without a magic FOLL_* flag,
> > otherwise KVM wouldn't be able to get the PFN to map into guest memory.
> >
> > The problem is that gup() and "mapped" are tied together.  So yes, pKVM doesn't
> > strictly need to unmap memory _in the untrusted host_, but since mapped==guppable,
> > the end result is the same.
> >
> > Emphasis above because pKVM still needs unmap the memory _somehwere_.  IIUC, the
> > current approach is to do that only in the stage-2 page tables, i.e. only in the
> > context of the hypervisor.  Which is also the source of the gup() problems; the
> > untrusted kernel is blissfully unaware that the memory is inaccessible.
> >
> > Any approach that moves some of that information into the untrusted kernel so that
> > the kernel can protect itself will incur fragmentation in the VMAs.  Well, unless
> > all of guest memory becomes unguppable, but that's likely not a viable option.
> 
> Actually, for pKVM, there is no need for the guest memory to be
> GUP'able at all if we use the new inaccessible_get_pfn().

If pKVM can use inaccessible_get_pfn() to get pfn and can avoid GUP (I
think that is the major concern?), do you see any other gap from
existing API? 

> This of
> course goes back to what I'd mentioned before in v7; it seems that
> representing the memslot memory as a file descriptor should be
> orthogonal to whether the memory is shared or private, rather than a
> private_fd for private memory and the userspace_addr for shared
> memory. The host can then map or unmap the shared/private memory using
> the fd, which allows it more freedom in even choosing to unmap shared
> memory when not needed, for example.

Using both private_fd and userspace_addr is only needed in TDX and other
confidential computing scenarios, pKVM may only use private_fd if the fd
can also be mmaped as a whole to userspace as Sean suggested.

Thanks,
Chao
> 
> Cheers,
> /fuad

Fuad Tabba Oct. 17, 2022, 10:31 a.m. UTC | #29

Hi,

> >
> > Actually, for pKVM, there is no need for the guest memory to be
> > GUP'able at all if we use the new inaccessible_get_pfn().
>
> If pKVM can use inaccessible_get_pfn() to get pfn and can avoid GUP (I
> think that is the major concern?), do you see any other gap from
> existing API?

Actually for this part no, there aren't any gaps and
inaccessible_get_pfn() is sufficient.

> > This of
> > course goes back to what I'd mentioned before in v7; it seems that
> > representing the memslot memory as a file descriptor should be
> > orthogonal to whether the memory is shared or private, rather than a
> > private_fd for private memory and the userspace_addr for shared
> > memory. The host can then map or unmap the shared/private memory using
> > the fd, which allows it more freedom in even choosing to unmap shared
> > memory when not needed, for example.
>
> Using both private_fd and userspace_addr is only needed in TDX and other
> confidential computing scenarios, pKVM may only use private_fd if the fd
> can also be mmaped as a whole to userspace as Sean suggested.

That does work in practice, for now at least, and is what I do in my
current port. However, the naming and how the API is defined as
implied by the name and the documentation. By calling the field
private_fd, it does imply that it should not be mapped, which is also
what api.rst says in PATCH v8 5/8. My worry is that in that case pKVM
would be mis/ab-using this interface, and that future changes could
cause unforeseen issues for pKVM.

Maybe renaming this to something like "guest_fp", and specifying in
the documentation that it can be restricted, e.g., instead of "the
content of the private memory is invisible to userspace" something
along the lines of  "the content of the guest memory may be restricted
to userspace".

What do you think?

Cheers,
/fuad

>
> Thanks,
> Chao
> >
> > Cheers,
> > /fuad

Vlastimil Babka Oct. 17, 2022, 1 p.m. UTC | #30

On 9/15/22 16:29, Chao Peng wrote:
> From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
> 
> KVM can use memfd-provided memory for guest memory. For normal userspace
> accessible memory, KVM userspace (e.g. QEMU) mmaps the memfd into its
> virtual address space and then tells KVM to use the virtual address to
> setup the mapping in the secondary page table (e.g. EPT).
> 
> With confidential computing technologies like Intel TDX, the
> memfd-provided memory may be encrypted with special key for special
> software domain (e.g. KVM guest) and is not expected to be directly
> accessed by userspace. Precisely, userspace access to such encrypted
> memory may lead to host crash so it should be prevented.
> 
> This patch introduces userspace inaccessible memfd (created with
> MFD_INACCESSIBLE). Its memory is inaccessible from userspace through
> ordinary MMU access (e.g. read/write/mmap) but can be accessed via
> in-kernel interface so KVM can directly interact with core-mm without
> the need to map the memory into KVM userspace.
> 
> It provides semantics required for KVM guest private(encrypted) memory
> support that a file descriptor with this flag set is going to be used as
> the source of guest memory in confidential computing environments such
> as Intel TDX/AMD SEV.
> 
> KVM userspace is still in charge of the lifecycle of the memfd. It
> should pass the opened fd to KVM. KVM uses the kernel APIs newly added
> in this patch to obtain the physical memory address and then populate
> the secondary page table entries.
> 
> The userspace inaccessible memfd can be fallocate-ed and hole-punched
> from userspace. When hole-punching happens, KVM can get notified through
> inaccessible_notifier it then gets chance to remove any mapped entries
> of the range in the secondary page tables.
> 
> The userspace inaccessible memfd itself is implemented as a shim layer
> on top of real memory file systems like tmpfs/hugetlbfs but this patch
> only implemented tmpfs. The allocated memory is currently marked as
> unmovable and unevictable, this is required for current confidential
> usage. But in future this might be changed.
> 
> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> Signed-off-by: Chao Peng <chao.p.peng@linux.intel.com>
> ---

...

> +static long inaccessible_fallocate(struct file *file, int mode,
> +				   loff_t offset, loff_t len)
> +{
> +	struct inaccessible_data *data = file->f_mapping->private_data;
> +	struct file *memfd = data->memfd;
> +	int ret;
> +
> +	if (mode & FALLOC_FL_PUNCH_HOLE) {
> +		if (!PAGE_ALIGNED(offset) || !PAGE_ALIGNED(len))
> +			return -EINVAL;
> +	}
> +
> +	ret = memfd->f_op->fallocate(memfd, mode, offset, len);
> +	inaccessible_notifier_invalidate(data, offset, offset + len);

Wonder if invalidate should precede the actual hole punch, otherwise we open
a window where the page tables point to memory no longer valid?

> +	return ret;
> +}
> +

...

> +
> +static struct file_system_type inaccessible_fs = {
> +	.owner		= THIS_MODULE,
> +	.name		= "[inaccessible]",

Dunno where exactly is this name visible, but shouldn't it better be
"[memfd:inaccessible]"?

> +	.init_fs_context = inaccessible_init_fs_context,
> +	.kill_sb	= kill_anon_super,
> +};
> +

Chao Peng Oct. 17, 2022, 2:58 p.m. UTC | #31

On Mon, Oct 17, 2022 at 11:31:19AM +0100, Fuad Tabba wrote:
> Hi,
> 
> > >
> > > Actually, for pKVM, there is no need for the guest memory to be
> > > GUP'able at all if we use the new inaccessible_get_pfn().
> >
> > If pKVM can use inaccessible_get_pfn() to get pfn and can avoid GUP (I
> > think that is the major concern?), do you see any other gap from
> > existing API?
> 
> Actually for this part no, there aren't any gaps and
> inaccessible_get_pfn() is sufficient.

Thanks for the confirmation.

> 
> > > This of
> > > course goes back to what I'd mentioned before in v7; it seems that
> > > representing the memslot memory as a file descriptor should be
> > > orthogonal to whether the memory is shared or private, rather than a
> > > private_fd for private memory and the userspace_addr for shared
> > > memory. The host can then map or unmap the shared/private memory using
> > > the fd, which allows it more freedom in even choosing to unmap shared
> > > memory when not needed, for example.
> >
> > Using both private_fd and userspace_addr is only needed in TDX and other
> > confidential computing scenarios, pKVM may only use private_fd if the fd
> > can also be mmaped as a whole to userspace as Sean suggested.
> 
> That does work in practice, for now at least, and is what I do in my
> current port. However, the naming and how the API is defined as
> implied by the name and the documentation. By calling the field
> private_fd, it does imply that it should not be mapped, which is also
> what api.rst says in PATCH v8 5/8. My worry is that in that case pKVM
> would be mis/ab-using this interface, and that future changes could
> cause unforeseen issues for pKVM.

That is fairly enough. We can change the naming and the documents.

> 
> Maybe renaming this to something like "guest_fp", and specifying in
> the documentation that it can be restricted, e.g., instead of "the
> content of the private memory is invisible to userspace" something
> along the lines of  "the content of the guest memory may be restricted
> to userspace".

Some other candidates in my mind:
- restricted_fd: to pair with the mm side restricted_memfd
- protected_fd: as Sean suggested before
- fd: how it's explained relies on the memslot.flag.

Thanks,
Chao
> 
> What do you think?
> 
> Cheers,
> /fuad
> 
> >
> > Thanks,
> > Chao
> > >
> > > Cheers,
> > > /fuad

Kirill A. Shutemov Oct. 17, 2022, 4:19 p.m. UTC | #32

On Mon, Oct 17, 2022 at 03:00:21PM +0200, Vlastimil Babka wrote:
> On 9/15/22 16:29, Chao Peng wrote:
> > From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
> > 
> > KVM can use memfd-provided memory for guest memory. For normal userspace
> > accessible memory, KVM userspace (e.g. QEMU) mmaps the memfd into its
> > virtual address space and then tells KVM to use the virtual address to
> > setup the mapping in the secondary page table (e.g. EPT).
> > 
> > With confidential computing technologies like Intel TDX, the
> > memfd-provided memory may be encrypted with special key for special
> > software domain (e.g. KVM guest) and is not expected to be directly
> > accessed by userspace. Precisely, userspace access to such encrypted
> > memory may lead to host crash so it should be prevented.
> > 
> > This patch introduces userspace inaccessible memfd (created with
> > MFD_INACCESSIBLE). Its memory is inaccessible from userspace through
> > ordinary MMU access (e.g. read/write/mmap) but can be accessed via
> > in-kernel interface so KVM can directly interact with core-mm without
> > the need to map the memory into KVM userspace.
> > 
> > It provides semantics required for KVM guest private(encrypted) memory
> > support that a file descriptor with this flag set is going to be used as
> > the source of guest memory in confidential computing environments such
> > as Intel TDX/AMD SEV.
> > 
> > KVM userspace is still in charge of the lifecycle of the memfd. It
> > should pass the opened fd to KVM. KVM uses the kernel APIs newly added
> > in this patch to obtain the physical memory address and then populate
> > the secondary page table entries.
> > 
> > The userspace inaccessible memfd can be fallocate-ed and hole-punched
> > from userspace. When hole-punching happens, KVM can get notified through
> > inaccessible_notifier it then gets chance to remove any mapped entries
> > of the range in the secondary page tables.
> > 
> > The userspace inaccessible memfd itself is implemented as a shim layer
> > on top of real memory file systems like tmpfs/hugetlbfs but this patch
> > only implemented tmpfs. The allocated memory is currently marked as
> > unmovable and unevictable, this is required for current confidential
> > usage. But in future this might be changed.
> > 
> > Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> > Signed-off-by: Chao Peng <chao.p.peng@linux.intel.com>
> > ---
> 
> ...
> 
> > +static long inaccessible_fallocate(struct file *file, int mode,
> > +				   loff_t offset, loff_t len)
> > +{
> > +	struct inaccessible_data *data = file->f_mapping->private_data;
> > +	struct file *memfd = data->memfd;
> > +	int ret;
> > +
> > +	if (mode & FALLOC_FL_PUNCH_HOLE) {
> > +		if (!PAGE_ALIGNED(offset) || !PAGE_ALIGNED(len))
> > +			return -EINVAL;
> > +	}
> > +
> > +	ret = memfd->f_op->fallocate(memfd, mode, offset, len);
> > +	inaccessible_notifier_invalidate(data, offset, offset + len);
> 
> Wonder if invalidate should precede the actual hole punch, otherwise we open
> a window where the page tables point to memory no longer valid?

Yes, you are right. Thanks for catching this.

> > +	return ret;
> > +}
> > +
> 
> ...
> 
> > +
> > +static struct file_system_type inaccessible_fs = {
> > +	.owner		= THIS_MODULE,
> > +	.name		= "[inaccessible]",
> 
> Dunno where exactly is this name visible, but shouldn't it better be
> "[memfd:inaccessible]"?

Maybe. And skip brackets.

> 
> > +	.init_fs_context = inaccessible_init_fs_context,
> > +	.kill_sb	= kill_anon_super,
> > +};
> > +
>

Gupta, Pankaj Oct. 17, 2022, 4:39 p.m. UTC | #33

On 10/17/2022 6:19 PM, Kirill A . Shutemov wrote:
> On Mon, Oct 17, 2022 at 03:00:21PM +0200, Vlastimil Babka wrote:
>> On 9/15/22 16:29, Chao Peng wrote:
>>> From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
>>>
>>> KVM can use memfd-provided memory for guest memory. For normal userspace
>>> accessible memory, KVM userspace (e.g. QEMU) mmaps the memfd into its
>>> virtual address space and then tells KVM to use the virtual address to
>>> setup the mapping in the secondary page table (e.g. EPT).
>>>
>>> With confidential computing technologies like Intel TDX, the
>>> memfd-provided memory may be encrypted with special key for special
>>> software domain (e.g. KVM guest) and is not expected to be directly
>>> accessed by userspace. Precisely, userspace access to such encrypted
>>> memory may lead to host crash so it should be prevented.
>>>
>>> This patch introduces userspace inaccessible memfd (created with
>>> MFD_INACCESSIBLE). Its memory is inaccessible from userspace through
>>> ordinary MMU access (e.g. read/write/mmap) but can be accessed via
>>> in-kernel interface so KVM can directly interact with core-mm without
>>> the need to map the memory into KVM userspace.
>>>
>>> It provides semantics required for KVM guest private(encrypted) memory
>>> support that a file descriptor with this flag set is going to be used as
>>> the source of guest memory in confidential computing environments such
>>> as Intel TDX/AMD SEV.
>>>
>>> KVM userspace is still in charge of the lifecycle of the memfd. It
>>> should pass the opened fd to KVM. KVM uses the kernel APIs newly added
>>> in this patch to obtain the physical memory address and then populate
>>> the secondary page table entries.
>>>
>>> The userspace inaccessible memfd can be fallocate-ed and hole-punched
>>> from userspace. When hole-punching happens, KVM can get notified through
>>> inaccessible_notifier it then gets chance to remove any mapped entries
>>> of the range in the secondary page tables.
>>>
>>> The userspace inaccessible memfd itself is implemented as a shim layer
>>> on top of real memory file systems like tmpfs/hugetlbfs but this patch
>>> only implemented tmpfs. The allocated memory is currently marked as
>>> unmovable and unevictable, this is required for current confidential
>>> usage. But in future this might be changed.
>>>
>>> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
>>> Signed-off-by: Chao Peng <chao.p.peng@linux.intel.com>
>>> ---
>>
>> ...
>>
>>> +static long inaccessible_fallocate(struct file *file, int mode,
>>> +				   loff_t offset, loff_t len)
>>> +{
>>> +	struct inaccessible_data *data = file->f_mapping->private_data;
>>> +	struct file *memfd = data->memfd;
>>> +	int ret;
>>> +
>>> +	if (mode & FALLOC_FL_PUNCH_HOLE) {
>>> +		if (!PAGE_ALIGNED(offset) || !PAGE_ALIGNED(len))
>>> +			return -EINVAL;
>>> +	}
>>> +
>>> +	ret = memfd->f_op->fallocate(memfd, mode, offset, len);
>>> +	inaccessible_notifier_invalidate(data, offset, offset + len);
>>
>> Wonder if invalidate should precede the actual hole punch, otherwise we open
>> a window where the page tables point to memory no longer valid?
> 
> Yes, you are right. Thanks for catching this.

I also noticed this. But then thought the memory would be anyways zeroed 
(hole punched) before this call?

> 
>>> +	return ret;
>>> +}
>>> +
>>
>> ...
>>
>>> +
>>> +static struct file_system_type inaccessible_fs = {
>>> +	.owner		= THIS_MODULE,
>>> +	.name		= "[inaccessible]",
>>
>> Dunno where exactly is this name visible, but shouldn't it better be
>> "[memfd:inaccessible]"?
> 
> Maybe. And skip brackets.
> 
>>
>>> +	.init_fs_context = inaccessible_init_fs_context,
>>> +	.kill_sb	= kill_anon_super,
>>> +};
>>> +
>>
>

Fuad Tabba Oct. 17, 2022, 7:05 p.m. UTC | #34

Hi,

> > > Using both private_fd and userspace_addr is only needed in TDX and other
> > > confidential computing scenarios, pKVM may only use private_fd if the fd
> > > can also be mmaped as a whole to userspace as Sean suggested.
> >
> > That does work in practice, for now at least, and is what I do in my
> > current port. However, the naming and how the API is defined as
> > implied by the name and the documentation. By calling the field
> > private_fd, it does imply that it should not be mapped, which is also
> > what api.rst says in PATCH v8 5/8. My worry is that in that case pKVM
> > would be mis/ab-using this interface, and that future changes could
> > cause unforeseen issues for pKVM.
>
> That is fairly enough. We can change the naming and the documents.
>
> >
> > Maybe renaming this to something like "guest_fp", and specifying in
> > the documentation that it can be restricted, e.g., instead of "the
> > content of the private memory is invisible to userspace" something
> > along the lines of  "the content of the guest memory may be restricted
> > to userspace".
>
> Some other candidates in my mind:
> - restricted_fd: to pair with the mm side restricted_memfd
> - protected_fd: as Sean suggested before
> - fd: how it's explained relies on the memslot.flag.

All these sound good, since they all capture the potential use cases.
Restricted might be the most logical choice if that's going to also
become the name for the mem_fd.

Thanks,
/fuad

> Thanks,
> Chao
> >
> > What do you think?
> >
> > Cheers,
> > /fuad
> >
> > >
> > > Thanks,
> > > Chao
> > > >
> > > > Cheers,
> > > > /fuad

Kirill A. Shutemov Oct. 17, 2022, 9:56 p.m. UTC | #35

On Mon, Oct 17, 2022 at 06:39:06PM +0200, Gupta, Pankaj wrote:
> On 10/17/2022 6:19 PM, Kirill A . Shutemov wrote:
> > On Mon, Oct 17, 2022 at 03:00:21PM +0200, Vlastimil Babka wrote:
> > > On 9/15/22 16:29, Chao Peng wrote:
> > > > From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
> > > > 
> > > > KVM can use memfd-provided memory for guest memory. For normal userspace
> > > > accessible memory, KVM userspace (e.g. QEMU) mmaps the memfd into its
> > > > virtual address space and then tells KVM to use the virtual address to
> > > > setup the mapping in the secondary page table (e.g. EPT).
> > > > 
> > > > With confidential computing technologies like Intel TDX, the
> > > > memfd-provided memory may be encrypted with special key for special
> > > > software domain (e.g. KVM guest) and is not expected to be directly
> > > > accessed by userspace. Precisely, userspace access to such encrypted
> > > > memory may lead to host crash so it should be prevented.
> > > > 
> > > > This patch introduces userspace inaccessible memfd (created with
> > > > MFD_INACCESSIBLE). Its memory is inaccessible from userspace through
> > > > ordinary MMU access (e.g. read/write/mmap) but can be accessed via
> > > > in-kernel interface so KVM can directly interact with core-mm without
> > > > the need to map the memory into KVM userspace.
> > > > 
> > > > It provides semantics required for KVM guest private(encrypted) memory
> > > > support that a file descriptor with this flag set is going to be used as
> > > > the source of guest memory in confidential computing environments such
> > > > as Intel TDX/AMD SEV.
> > > > 
> > > > KVM userspace is still in charge of the lifecycle of the memfd. It
> > > > should pass the opened fd to KVM. KVM uses the kernel APIs newly added
> > > > in this patch to obtain the physical memory address and then populate
> > > > the secondary page table entries.
> > > > 
> > > > The userspace inaccessible memfd can be fallocate-ed and hole-punched
> > > > from userspace. When hole-punching happens, KVM can get notified through
> > > > inaccessible_notifier it then gets chance to remove any mapped entries
> > > > of the range in the secondary page tables.
> > > > 
> > > > The userspace inaccessible memfd itself is implemented as a shim layer
> > > > on top of real memory file systems like tmpfs/hugetlbfs but this patch
> > > > only implemented tmpfs. The allocated memory is currently marked as
> > > > unmovable and unevictable, this is required for current confidential
> > > > usage. But in future this might be changed.
> > > > 
> > > > Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> > > > Signed-off-by: Chao Peng <chao.p.peng@linux.intel.com>
> > > > ---
> > > 
> > > ...
> > > 
> > > > +static long inaccessible_fallocate(struct file *file, int mode,
> > > > +				   loff_t offset, loff_t len)
> > > > +{
> > > > +	struct inaccessible_data *data = file->f_mapping->private_data;
> > > > +	struct file *memfd = data->memfd;
> > > > +	int ret;
> > > > +
> > > > +	if (mode & FALLOC_FL_PUNCH_HOLE) {
> > > > +		if (!PAGE_ALIGNED(offset) || !PAGE_ALIGNED(len))
> > > > +			return -EINVAL;
> > > > +	}
> > > > +
> > > > +	ret = memfd->f_op->fallocate(memfd, mode, offset, len);
> > > > +	inaccessible_notifier_invalidate(data, offset, offset + len);
> > > 
> > > Wonder if invalidate should precede the actual hole punch, otherwise we open
> > > a window where the page tables point to memory no longer valid?
> > 
> > Yes, you are right. Thanks for catching this.
> 
> I also noticed this. But then thought the memory would be anyways zeroed
> (hole punched) before this call?

Hole punching can free pages, given that offset/len covers full page.

Sean Christopherson Oct. 18, 2022, 12:33 a.m. UTC | #36

On Fri, Sep 30, 2022, Fuad Tabba wrote:
> > > > > pKVM would also need a way to make an fd accessible again
> > > > > when shared back, which I think isn't possible with this patch.
> > > >
> > > > But does pKVM really want to mmap/munmap a new region at the page-level,
> > > > that can cause VMA fragmentation if the conversion is frequent as I see.
> > > > Even with a KVM ioctl for mapping as mentioned below, I think there will
> > > > be the same issue.
> > >
> > > pKVM doesn't really need to unmap the memory. What is really important
> > > is that the memory is not GUP'able.
> >
> > Well, not entirely unguppable, just unguppable without a magic FOLL_* flag,
> > otherwise KVM wouldn't be able to get the PFN to map into guest memory.
> >
> > The problem is that gup() and "mapped" are tied together.  So yes, pKVM doesn't
> > strictly need to unmap memory _in the untrusted host_, but since mapped==guppable,
> > the end result is the same.
> >
> > Emphasis above because pKVM still needs unmap the memory _somehwere_.  IIUC, the
> > current approach is to do that only in the stage-2 page tables, i.e. only in the
> > context of the hypervisor.  Which is also the source of the gup() problems; the
> > untrusted kernel is blissfully unaware that the memory is inaccessible.
> >
> > Any approach that moves some of that information into the untrusted kernel so that
> > the kernel can protect itself will incur fragmentation in the VMAs.  Well, unless
> > all of guest memory becomes unguppable, but that's likely not a viable option.
> 
> Actually, for pKVM, there is no need for the guest memory to be GUP'able at
> all if we use the new inaccessible_get_pfn().

Ya, I was referring to pKVM without UPM / inaccessible memory.

Jumping back to blocking gup(), what about using the same tricks as secretmem to
block gup()?  E.g. compare vm_ops to block regular gup() and a_ops to block fast
gup() on struct page?  With a Kconfig that's selected by pKVM (which would also
need its own Kconfig), e.g. CONFIG_INACCESSIBLE_MAPPABLE_MEM, there would be zero
performance overhead for non-pKVM kernels, i.e. hooking gup() shouldn't be
controversial.

I suspect the fast gup() path could even be optimized to avoid the page_mapping()
lookup by adding a PG_inaccessible flag that's defined iff the TBD Kconfig is
selected.  I'm guessing pKVM isn't expected to be deployed on massivve NUMA systems
anytime soon, so there should be plenty of page flags to go around.

Blocking gup() instead of trying to play refcount games when converting back to
private would eliminate the need to put heavy restrictions on mapping, as the goal
of those were purely to simplify the KVM implementation, e.g. the "one mapping per
memslot" thing would go away entirely.

> This of course goes back to what I'd mentioned before in v7; it seems that
> representing the memslot memory as a file descriptor should be orthogonal to
> whether the memory is shared or private, rather than a private_fd for private
> memory and the userspace_addr for shared memory.

I also explored the idea of backing any guest memory with an fd, but came to
the conclusion that private memory needs a separate handle[1], at least on x86.

For SNP and TDX, even though the GPA is the same (ignoring the fact that SNP and
TDX steal GPA bits to differentiate private vs. shared), the two types need to be
treated as separate mappings[2].  Post-boot, converting is lossy in both directions,
so even conceptually they are two disctint pages that just happen to share (some)
GPA bits.

To allow conversions, i.e. changing which mapping to use, without memslot updates,
KVM needs to let userspace provide both mappings in a single memslot.  So while
fd-based memory is an orthogonal concept, e.g. we could add fd-based shared memory,
KVM would still need a dedicated private handle.

For pKVM, the fd doesn't strictly need to be mutually exclusive with the existing
userspace_addr, but since the private_fd is going to be added for x86, I think it
makes sense to use that instead of adding generic fd-based memory for pKVM's use
case (which is arguably still "private" memory but with special semantics).

[1] https://lore.kernel.org/all/YulTH7bL4MwT5v5K@google.com
[2] https://lore.kernel.org/all/869622df-5bf6-0fbb-cac4-34c6ae7df119@kernel.org

>  The host can then map or unmap the shared/private memory using the fd, which
>  allows it more freedom in even choosing to unmap shared memory when not
>  needed, for example.

Vishal Annapurve Oct. 18, 2022, 1:42 p.m. UTC | #37

On Tue, Oct 18, 2022 at 3:27 AM Kirill A . Shutemov
<kirill.shutemov@linux.intel.com> wrote:
>
> On Mon, Oct 17, 2022 at 06:39:06PM +0200, Gupta, Pankaj wrote:
> > On 10/17/2022 6:19 PM, Kirill A . Shutemov wrote:
> > > On Mon, Oct 17, 2022 at 03:00:21PM +0200, Vlastimil Babka wrote:
> > > > On 9/15/22 16:29, Chao Peng wrote:
> > > > > From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
> > > > >
> > > > > KVM can use memfd-provided memory for guest memory. For normal userspace
> > > > > accessible memory, KVM userspace (e.g. QEMU) mmaps the memfd into its
> > > > > virtual address space and then tells KVM to use the virtual address to
> > > > > setup the mapping in the secondary page table (e.g. EPT).
> > > > >
> > > > > With confidential computing technologies like Intel TDX, the
> > > > > memfd-provided memory may be encrypted with special key for special
> > > > > software domain (e.g. KVM guest) and is not expected to be directly
> > > > > accessed by userspace. Precisely, userspace access to such encrypted
> > > > > memory may lead to host crash so it should be prevented.
> > > > >
> > > > > This patch introduces userspace inaccessible memfd (created with
> > > > > MFD_INACCESSIBLE). Its memory is inaccessible from userspace through
> > > > > ordinary MMU access (e.g. read/write/mmap) but can be accessed via
> > > > > in-kernel interface so KVM can directly interact with core-mm without
> > > > > the need to map the memory into KVM userspace.
> > > > >
> > > > > It provides semantics required for KVM guest private(encrypted) memory
> > > > > support that a file descriptor with this flag set is going to be used as
> > > > > the source of guest memory in confidential computing environments such
> > > > > as Intel TDX/AMD SEV.
> > > > >
> > > > > KVM userspace is still in charge of the lifecycle of the memfd. It
> > > > > should pass the opened fd to KVM. KVM uses the kernel APIs newly added
> > > > > in this patch to obtain the physical memory address and then populate
> > > > > the secondary page table entries.
> > > > >
> > > > > The userspace inaccessible memfd can be fallocate-ed and hole-punched
> > > > > from userspace. When hole-punching happens, KVM can get notified through
> > > > > inaccessible_notifier it then gets chance to remove any mapped entries
> > > > > of the range in the secondary page tables.
> > > > >
> > > > > The userspace inaccessible memfd itself is implemented as a shim layer
> > > > > on top of real memory file systems like tmpfs/hugetlbfs but this patch
> > > > > only implemented tmpfs. The allocated memory is currently marked as
> > > > > unmovable and unevictable, this is required for current confidential
> > > > > usage. But in future this might be changed.
> > > > >
> > > > > Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> > > > > Signed-off-by: Chao Peng <chao.p.peng@linux.intel.com>
> > > > > ---
> > > >
> > > > ...
> > > >
> > > > > +static long inaccessible_fallocate(struct file *file, int mode,
> > > > > +                                  loff_t offset, loff_t len)
> > > > > +{
> > > > > +       struct inaccessible_data *data = file->f_mapping->private_data;
> > > > > +       struct file *memfd = data->memfd;
> > > > > +       int ret;
> > > > > +
> > > > > +       if (mode & FALLOC_FL_PUNCH_HOLE) {
> > > > > +               if (!PAGE_ALIGNED(offset) || !PAGE_ALIGNED(len))
> > > > > +                       return -EINVAL;
> > > > > +       }
> > > > > +
> > > > > +       ret = memfd->f_op->fallocate(memfd, mode, offset, len);
> > > > > +       inaccessible_notifier_invalidate(data, offset, offset + len);
> > > >
> > > > Wonder if invalidate should precede the actual hole punch, otherwise we open
> > > > a window where the page tables point to memory no longer valid?
> > >
> > > Yes, you are right. Thanks for catching this.
> >
> > I also noticed this. But then thought the memory would be anyways zeroed
> > (hole punched) before this call?
>
> Hole punching can free pages, given that offset/len covers full page.
>
> --
>   Kiryl Shutsemau / Kirill A. Shutemov

I think moving this notifier_invalidate before fallocate may not solve
the problem completely. Is it possible that between invalidate and
fallocate, KVM tries to handle the page fault for the guest VM from
another vcpu and uses the pages to be freed to back gpa ranges? Should
hole punching here also update mem_attr first to say that KVM should
consider the corresponding gpa ranges to be no more backed by
inaccessible memfd?

Vishal Annapurve Oct. 19, 2022, 12:23 p.m. UTC | #38

On Thu, Sep 15, 2022 at 8:04 PM Chao Peng <chao.p.peng@linux.intel.com> wrote:
>
> From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
>
> KVM can use memfd-provided memory for guest memory. For normal userspace
> accessible memory, KVM userspace (e.g. QEMU) mmaps the memfd into its
> virtual address space and then tells KVM to use the virtual address to
> setup the mapping in the secondary page table (e.g. EPT).
>
> With confidential computing technologies like Intel TDX, the
> memfd-provided memory may be encrypted with special key for special
> software domain (e.g. KVM guest) and is not expected to be directly
> accessed by userspace. Precisely, userspace access to such encrypted
> memory may lead to host crash so it should be prevented.
>
> This patch introduces userspace inaccessible memfd (created with
> MFD_INACCESSIBLE). Its memory is inaccessible from userspace through
> ordinary MMU access (e.g. read/write/mmap) but can be accessed via
> in-kernel interface so KVM can directly interact with core-mm without
> the need to map the memory into KVM userspace.
>
> It provides semantics required for KVM guest private(encrypted) memory
> support that a file descriptor with this flag set is going to be used as
> the source of guest memory in confidential computing environments such
> as Intel TDX/AMD SEV.
>
> KVM userspace is still in charge of the lifecycle of the memfd. It
> should pass the opened fd to KVM. KVM uses the kernel APIs newly added
> in this patch to obtain the physical memory address and then populate
> the secondary page table entries.
>
> The userspace inaccessible memfd can be fallocate-ed and hole-punched
> from userspace. When hole-punching happens, KVM can get notified through
> inaccessible_notifier it then gets chance to remove any mapped entries
> of the range in the secondary page tables.
>
> The userspace inaccessible memfd itself is implemented as a shim layer
> on top of real memory file systems like tmpfs/hugetlbfs but this patch
> only implemented tmpfs. The allocated memory is currently marked as
> unmovable and unevictable, this is required for current confidential
> usage. But in future this might be changed.
>
> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> Signed-off-by: Chao Peng <chao.p.peng@linux.intel.com>
> ---
>  include/linux/memfd.h      |  24 ++++
>  include/uapi/linux/magic.h |   1 +
>  include/uapi/linux/memfd.h |   1 +
>  mm/Makefile                |   2 +-
>  mm/memfd.c                 |  25 ++++-
>  mm/memfd_inaccessible.c    | 219 +++++++++++++++++++++++++++++++++++++
>  6 files changed, 270 insertions(+), 2 deletions(-)
>  create mode 100644 mm/memfd_inaccessible.c
>
> diff --git a/include/linux/memfd.h b/include/linux/memfd.h
> index 4f1600413f91..334ddff08377 100644
> --- a/include/linux/memfd.h
> +++ b/include/linux/memfd.h
> @@ -3,6 +3,7 @@
>  #define __LINUX_MEMFD_H
>
>  #include <linux/file.h>
> +#include <linux/pfn_t.h>
>
>  #ifdef CONFIG_MEMFD_CREATE
>  extern long memfd_fcntl(struct file *file, unsigned int cmd, unsigned long arg);
> @@ -13,4 +14,27 @@ static inline long memfd_fcntl(struct file *f, unsigned int c, unsigned long a)
>  }
>  #endif
>
> +struct inaccessible_notifier;
> +
> +struct inaccessible_notifier_ops {
> +       void (*invalidate)(struct inaccessible_notifier *notifier,
> +                          pgoff_t start, pgoff_t end);
> +};
> +
> +struct inaccessible_notifier {
> +       struct list_head list;
> +       const struct inaccessible_notifier_ops *ops;
> +};
> +
> +void inaccessible_register_notifier(struct file *file,
> +                                   struct inaccessible_notifier *notifier);
> +void inaccessible_unregister_notifier(struct file *file,
> +                                     struct inaccessible_notifier *notifier);
> +
> +int inaccessible_get_pfn(struct file *file, pgoff_t offset, pfn_t *pfn,
> +                        int *order);
> +void inaccessible_put_pfn(struct file *file, pfn_t pfn);
> +
> +struct file *memfd_mkinaccessible(struct file *memfd);
> +
>  #endif /* __LINUX_MEMFD_H */
> diff --git a/include/uapi/linux/magic.h b/include/uapi/linux/magic.h
> index 6325d1d0e90f..9d066be3d7e8 100644
> --- a/include/uapi/linux/magic.h
> +++ b/include/uapi/linux/magic.h
> @@ -101,5 +101,6 @@
>  #define DMA_BUF_MAGIC          0x444d4142      /* "DMAB" */
>  #define DEVMEM_MAGIC           0x454d444d      /* "DMEM" */
>  #define SECRETMEM_MAGIC                0x5345434d      /* "SECM" */
> +#define INACCESSIBLE_MAGIC     0x494e4143      /* "INAC" */
>
>  #endif /* __LINUX_MAGIC_H__ */
> diff --git a/include/uapi/linux/memfd.h b/include/uapi/linux/memfd.h
> index 7a8a26751c23..48750474b904 100644
> --- a/include/uapi/linux/memfd.h
> +++ b/include/uapi/linux/memfd.h
> @@ -8,6 +8,7 @@
>  #define MFD_CLOEXEC            0x0001U
>  #define MFD_ALLOW_SEALING      0x0002U
>  #define MFD_HUGETLB            0x0004U
> +#define MFD_INACCESSIBLE       0x0008U
>
>  /*
>   * Huge page size encoding when MFD_HUGETLB is specified, and a huge page
> diff --git a/mm/Makefile b/mm/Makefile
> index 9a564f836403..f82e5d4b4388 100644
> --- a/mm/Makefile
> +++ b/mm/Makefile
> @@ -126,7 +126,7 @@ obj-$(CONFIG_HARDENED_USERCOPY) += usercopy.o
>  obj-$(CONFIG_PERCPU_STATS) += percpu-stats.o
>  obj-$(CONFIG_ZONE_DEVICE) += memremap.o
>  obj-$(CONFIG_HMM_MIRROR) += hmm.o
> -obj-$(CONFIG_MEMFD_CREATE) += memfd.o
> +obj-$(CONFIG_MEMFD_CREATE) += memfd.o memfd_inaccessible.o
>  obj-$(CONFIG_MAPPING_DIRTY_HELPERS) += mapping_dirty_helpers.o
>  obj-$(CONFIG_PTDUMP_CORE) += ptdump.o
>  obj-$(CONFIG_PAGE_REPORTING) += page_reporting.o
> diff --git a/mm/memfd.c b/mm/memfd.c
> index 08f5f8304746..1853a90f49ff 100644
> --- a/mm/memfd.c
> +++ b/mm/memfd.c
> @@ -261,7 +261,8 @@ long memfd_fcntl(struct file *file, unsigned int cmd, unsigned long arg)
>  #define MFD_NAME_PREFIX_LEN (sizeof(MFD_NAME_PREFIX) - 1)
>  #define MFD_NAME_MAX_LEN (NAME_MAX - MFD_NAME_PREFIX_LEN)
>
> -#define MFD_ALL_FLAGS (MFD_CLOEXEC | MFD_ALLOW_SEALING | MFD_HUGETLB)
> +#define MFD_ALL_FLAGS (MFD_CLOEXEC | MFD_ALLOW_SEALING | MFD_HUGETLB | \
> +                      MFD_INACCESSIBLE)
>
>  SYSCALL_DEFINE2(memfd_create,
>                 const char __user *, uname,
> @@ -283,6 +284,14 @@ SYSCALL_DEFINE2(memfd_create,
>                         return -EINVAL;
>         }
>
> +       /* Disallow sealing when MFD_INACCESSIBLE is set. */
> +       if ((flags & MFD_INACCESSIBLE) && (flags & MFD_ALLOW_SEALING))
> +               return -EINVAL;
> +
> +       /* TODO: add hugetlb support */
> +       if ((flags & MFD_INACCESSIBLE) && (flags & MFD_HUGETLB))
> +               return -EINVAL;
> +
>         /* length includes terminating zero */
>         len = strnlen_user(uname, MFD_NAME_MAX_LEN + 1);
>         if (len <= 0)
> @@ -331,10 +340,24 @@ SYSCALL_DEFINE2(memfd_create,
>                 *file_seals &= ~F_SEAL_SEAL;
>         }
>
> +       if (flags & MFD_INACCESSIBLE) {
> +               struct file *inaccessible_file;
> +
> +               inaccessible_file = memfd_mkinaccessible(file);
> +               if (IS_ERR(inaccessible_file)) {
> +                       error = PTR_ERR(inaccessible_file);
> +                       goto err_file;
> +               }
> +
> +               file = inaccessible_file;
> +       }
> +
>         fd_install(fd, file);
>         kfree(name);
>         return fd;
>
> +err_file:
> +       fput(file);
>  err_fd:
>         put_unused_fd(fd);
>  err_name:
> diff --git a/mm/memfd_inaccessible.c b/mm/memfd_inaccessible.c
> new file mode 100644
> index 000000000000..2d33cbdd9282
> --- /dev/null
> +++ b/mm/memfd_inaccessible.c
> @@ -0,0 +1,219 @@
> +// SPDX-License-Identifier: GPL-2.0
> +#include "linux/sbitmap.h"
> +#include <linux/memfd.h>
> +#include <linux/pagemap.h>
> +#include <linux/pseudo_fs.h>
> +#include <linux/shmem_fs.h>
> +#include <uapi/linux/falloc.h>
> +#include <uapi/linux/magic.h>
> +
> +struct inaccessible_data {
> +       struct mutex lock;
> +       struct file *memfd;
> +       struct list_head notifiers;
> +};
> +
> +static void inaccessible_notifier_invalidate(struct inaccessible_data *data,
> +                                pgoff_t start, pgoff_t end)
> +{
> +       struct inaccessible_notifier *notifier;
> +
> +       mutex_lock(&data->lock);
> +       list_for_each_entry(notifier, &data->notifiers, list) {
> +               notifier->ops->invalidate(notifier, start, end);
> +       }
> +       mutex_unlock(&data->lock);
> +}
> +
> +static int inaccessible_release(struct inode *inode, struct file *file)
> +{
> +       struct inaccessible_data *data = inode->i_mapping->private_data;
> +
> +       fput(data->memfd);
> +       kfree(data);
> +       return 0;
> +}
> +
> +static long inaccessible_fallocate(struct file *file, int mode,
> +                                  loff_t offset, loff_t len)
> +{
> +       struct inaccessible_data *data = file->f_mapping->private_data;
> +       struct file *memfd = data->memfd;
> +       int ret;
> +
> +       if (mode & FALLOC_FL_PUNCH_HOLE) {
> +               if (!PAGE_ALIGNED(offset) || !PAGE_ALIGNED(len))
> +                       return -EINVAL;
> +       }
> +
> +       ret = memfd->f_op->fallocate(memfd, mode, offset, len);
> +       inaccessible_notifier_invalidate(data, offset, offset + len);
> +       return ret;
> +}
> +
> +static const struct file_operations inaccessible_fops = {
> +       .release = inaccessible_release,
> +       .fallocate = inaccessible_fallocate,
> +};
> +
> +static int inaccessible_getattr(struct user_namespace *mnt_userns,
> +                               const struct path *path, struct kstat *stat,
> +                               u32 request_mask, unsigned int query_flags)
> +{
> +       struct inode *inode = d_inode(path->dentry);
> +       struct inaccessible_data *data = inode->i_mapping->private_data;
> +       struct file *memfd = data->memfd;
> +
> +       return memfd->f_inode->i_op->getattr(mnt_userns, path, stat,
> +                                            request_mask, query_flags);
> +}
> +
> +static int inaccessible_setattr(struct user_namespace *mnt_userns,
> +                               struct dentry *dentry, struct iattr *attr)
> +{
> +       struct inode *inode = d_inode(dentry);
> +       struct inaccessible_data *data = inode->i_mapping->private_data;
> +       struct file *memfd = data->memfd;
> +       int ret;
> +
> +       if (attr->ia_valid & ATTR_SIZE) {
> +               if (memfd->f_inode->i_size)
> +                       return -EPERM;
> +
> +               if (!PAGE_ALIGNED(attr->ia_size))
> +                       return -EINVAL;
> +       }
> +
> +       ret = memfd->f_inode->i_op->setattr(mnt_userns,
> +                                           file_dentry(memfd), attr);
> +       return ret;
> +}
> +
> +static const struct inode_operations inaccessible_iops = {
> +       .getattr = inaccessible_getattr,
> +       .setattr = inaccessible_setattr,
> +};
> +
> +static int inaccessible_init_fs_context(struct fs_context *fc)
> +{
> +       if (!init_pseudo(fc, INACCESSIBLE_MAGIC))
> +               return -ENOMEM;
> +
> +       fc->s_iflags |= SB_I_NOEXEC;
> +       return 0;
> +}
> +
> +static struct file_system_type inaccessible_fs = {
> +       .owner          = THIS_MODULE,
> +       .name           = "[inaccessible]",
> +       .init_fs_context = inaccessible_init_fs_context,
> +       .kill_sb        = kill_anon_super,
> +};
> +
> +static struct vfsmount *inaccessible_mnt;
> +
> +static __init int inaccessible_init(void)
> +{
> +       inaccessible_mnt = kern_mount(&inaccessible_fs);
> +       if (IS_ERR(inaccessible_mnt))
> +               return PTR_ERR(inaccessible_mnt);
> +       return 0;
> +}
> +fs_initcall(inaccessible_init);
> +
> +struct file *memfd_mkinaccessible(struct file *memfd)
> +{
> +       struct inaccessible_data *data;
> +       struct address_space *mapping;
> +       struct inode *inode;
> +       struct file *file;
> +
> +       data = kzalloc(sizeof(*data), GFP_KERNEL);
> +       if (!data)
> +               return ERR_PTR(-ENOMEM);
> +
> +       data->memfd = memfd;
> +       mutex_init(&data->lock);
> +       INIT_LIST_HEAD(&data->notifiers);
> +
> +       inode = alloc_anon_inode(inaccessible_mnt->mnt_sb);
> +       if (IS_ERR(inode)) {
> +               kfree(data);
> +               return ERR_CAST(inode);
> +       }
> +
> +       inode->i_mode |= S_IFREG;
> +       inode->i_op = &inaccessible_iops;
> +       inode->i_mapping->private_data = data;
> +
> +       file = alloc_file_pseudo(inode, inaccessible_mnt,
> +                                "[memfd:inaccessible]", O_RDWR,
> +                                &inaccessible_fops);
> +       if (IS_ERR(file)) {
> +               iput(inode);
> +               kfree(data);
> +       }
> +
> +       file->f_flags |= O_LARGEFILE;
> +
> +       mapping = memfd->f_mapping;
> +       mapping_set_unevictable(mapping);
> +       mapping_set_gfp_mask(mapping,
> +                            mapping_gfp_mask(mapping) & ~__GFP_MOVABLE);
> +
> +       return file;
> +}
> +
> +void inaccessible_register_notifier(struct file *file,
> +                                   struct inaccessible_notifier *notifier)
> +{
> +       struct inaccessible_data *data = file->f_mapping->private_data;
> +
> +       mutex_lock(&data->lock);
> +       list_add(&notifier->list, &data->notifiers);
> +       mutex_unlock(&data->lock);
> +}
> +EXPORT_SYMBOL_GPL(inaccessible_register_notifier);
> +
> +void inaccessible_unregister_notifier(struct file *file,
> +                                     struct inaccessible_notifier *notifier)
> +{
> +       struct inaccessible_data *data = file->f_mapping->private_data;
> +
> +       mutex_lock(&data->lock);
> +       list_del(&notifier->list);
> +       mutex_unlock(&data->lock);
> +}
> +EXPORT_SYMBOL_GPL(inaccessible_unregister_notifier);
> +
> +int inaccessible_get_pfn(struct file *file, pgoff_t offset, pfn_t *pfn,
> +                        int *order)
> +{
> +       struct inaccessible_data *data = file->f_mapping->private_data;
> +       struct file *memfd = data->memfd;
> +       struct page *page;
> +       int ret;
> +
> +       ret = shmem_getpage(file_inode(memfd), offset, &page, SGP_WRITE);
> +       if (ret)
> +               return ret;
> +
> +       *pfn = page_to_pfn_t(page);
> +       *order = thp_order(compound_head(page));
> +       SetPageUptodate(page);
> +       unlock_page(page);
> +
> +       return 0;
> +}
> +EXPORT_SYMBOL_GPL(inaccessible_get_pfn);
> +
> +void inaccessible_put_pfn(struct file *file, pfn_t pfn)
> +{
> +       struct page *page = pfn_t_to_page(pfn);
> +
> +       if (WARN_ON_ONCE(!page))
> +               return;
> +
> +       put_page(page);
> +}
> +EXPORT_SYMBOL_GPL(inaccessible_put_pfn);
> --
> 2.25.1
>

In the context of userspace inaccessible memfd, what would be a
suggested way to enforce NUMA memory policy for physical memory
allocation? mbind[1] won't work here in absence of virtual address
range.

[1] https://github.com/chao-p/qemu/blob/privmem-v8/backends/hostmem.c#L382

Chao Peng Oct. 19, 2022, 1:30 p.m. UTC | #39

On Mon, Oct 17, 2022 at 08:05:10PM +0100, Fuad Tabba wrote:
> Hi,
> 
> > > > Using both private_fd and userspace_addr is only needed in TDX and other
> > > > confidential computing scenarios, pKVM may only use private_fd if the fd
> > > > can also be mmaped as a whole to userspace as Sean suggested.
> > >
> > > That does work in practice, for now at least, and is what I do in my
> > > current port. However, the naming and how the API is defined as
> > > implied by the name and the documentation. By calling the field
> > > private_fd, it does imply that it should not be mapped, which is also
> > > what api.rst says in PATCH v8 5/8. My worry is that in that case pKVM
> > > would be mis/ab-using this interface, and that future changes could
> > > cause unforeseen issues for pKVM.
> >
> > That is fairly enough. We can change the naming and the documents.
> >
> > >
> > > Maybe renaming this to something like "guest_fp", and specifying in
> > > the documentation that it can be restricted, e.g., instead of "the
> > > content of the private memory is invisible to userspace" something
> > > along the lines of  "the content of the guest memory may be restricted
> > > to userspace".
> >
> > Some other candidates in my mind:
> > - restricted_fd: to pair with the mm side restricted_memfd
> > - protected_fd: as Sean suggested before
> > - fd: how it's explained relies on the memslot.flag.
> 
> All these sound good, since they all capture the potential use cases.
> Restricted might be the most logical choice if that's going to also
> become the name for the mem_fd.

Thanks, I will use 'restricted' for them. e.g.:
- memfd_restricted() syscall
- restricted_fd
- restricted_offset

The memslot flags will still be KVM_MEM_PRIVATE, since I think pKVM will
create its own one?

Chao
> 
> Thanks,
> /fuad
> 
> > Thanks,
> > Chao
> > >
> > > What do you think?
> > >
> > > Cheers,
> > > /fuad
> > >
> > > >
> > > > Thanks,
> > > > Chao
> > > > >
> > > > > Cheers,
> > > > > /fuad

Fuad Tabba Oct. 19, 2022, 3:04 p.m. UTC | #40

Hi,

On Tue, Oct 18, 2022 at 1:34 AM Sean Christopherson <seanjc@google.com> wrote:
>
> On Fri, Sep 30, 2022, Fuad Tabba wrote:
> > > > > > pKVM would also need a way to make an fd accessible again
> > > > > > when shared back, which I think isn't possible with this patch.
> > > > >
> > > > > But does pKVM really want to mmap/munmap a new region at the page-level,
> > > > > that can cause VMA fragmentation if the conversion is frequent as I see.
> > > > > Even with a KVM ioctl for mapping as mentioned below, I think there will
> > > > > be the same issue.
> > > >
> > > > pKVM doesn't really need to unmap the memory. What is really important
> > > > is that the memory is not GUP'able.
> > >
> > > Well, not entirely unguppable, just unguppable without a magic FOLL_* flag,
> > > otherwise KVM wouldn't be able to get the PFN to map into guest memory.
> > >
> > > The problem is that gup() and "mapped" are tied together.  So yes, pKVM doesn't
> > > strictly need to unmap memory _in the untrusted host_, but since mapped==guppable,
> > > the end result is the same.
> > >
> > > Emphasis above because pKVM still needs unmap the memory _somehwere_.  IIUC, the
> > > current approach is to do that only in the stage-2 page tables, i.e. only in the
> > > context of the hypervisor.  Which is also the source of the gup() problems; the
> > > untrusted kernel is blissfully unaware that the memory is inaccessible.
> > >
> > > Any approach that moves some of that information into the untrusted kernel so that
> > > the kernel can protect itself will incur fragmentation in the VMAs.  Well, unless
> > > all of guest memory becomes unguppable, but that's likely not a viable option.
> >
> > Actually, for pKVM, there is no need for the guest memory to be GUP'able at
> > all if we use the new inaccessible_get_pfn().
>
> Ya, I was referring to pKVM without UPM / inaccessible memory.
>
> Jumping back to blocking gup(), what about using the same tricks as secretmem to
> block gup()?  E.g. compare vm_ops to block regular gup() and a_ops to block fast
> gup() on struct page?  With a Kconfig that's selected by pKVM (which would also
> need its own Kconfig), e.g. CONFIG_INACCESSIBLE_MAPPABLE_MEM, there would be zero
> performance overhead for non-pKVM kernels, i.e. hooking gup() shouldn't be
> controversial.
>
> I suspect the fast gup() path could even be optimized to avoid the page_mapping()
> lookup by adding a PG_inaccessible flag that's defined iff the TBD Kconfig is
> selected.  I'm guessing pKVM isn't expected to be deployed on massivve NUMA systems
> anytime soon, so there should be plenty of page flags to go around.
>
> Blocking gup() instead of trying to play refcount games when converting back to
> private would eliminate the need to put heavy restrictions on mapping, as the goal
> of those were purely to simplify the KVM implementation, e.g. the "one mapping per
> memslot" thing would go away entirely.

My implementation of mmap for inaccessible_fops was setting VM_PFNMAP.
That said, I realized that that might be adding an unnecessary
restriction, and now have changed it to do it the secretmem way.
That's straightforward and works well.

> > This of course goes back to what I'd mentioned before in v7; it seems that
> > representing the memslot memory as a file descriptor should be orthogonal to
> > whether the memory is shared or private, rather than a private_fd for private
> > memory and the userspace_addr for shared memory.
>
> I also explored the idea of backing any guest memory with an fd, but came to
> the conclusion that private memory needs a separate handle[1], at least on x86.
>
> For SNP and TDX, even though the GPA is the same (ignoring the fact that SNP and
> TDX steal GPA bits to differentiate private vs. shared), the two types need to be
> treated as separate mappings[2].  Post-boot, converting is lossy in both directions,
> so even conceptually they are two disctint pages that just happen to share (some)
> GPA bits.
>
> To allow conversions, i.e. changing which mapping to use, without memslot updates,
> KVM needs to let userspace provide both mappings in a single memslot.  So while
> fd-based memory is an orthogonal concept, e.g. we could add fd-based shared memory,
> KVM would still need a dedicated private handle.
>
> For pKVM, the fd doesn't strictly need to be mutually exclusive with the existing
> userspace_addr, but since the private_fd is going to be added for x86, I think it
> makes sense to use that instead of adding generic fd-based memory for pKVM's use
> case (which is arguably still "private" memory but with special semantics).
>
> [1] https://lore.kernel.org/all/YulTH7bL4MwT5v5K@google.com
> [2] https://lore.kernel.org/all/869622df-5bf6-0fbb-cac4-34c6ae7df119@kernel.org

As long as the API does not impose this limit, which would imply pKVM
is misusing it, then I agree. I think that's why renaming it to
something like "restricted" might be clearer.

Thanks,
/fuad

Kirill A. Shutemov Oct. 19, 2022, 3:32 p.m. UTC | #41

On Tue, Oct 18, 2022 at 07:12:10PM +0530, Vishal Annapurve wrote:
> On Tue, Oct 18, 2022 at 3:27 AM Kirill A . Shutemov
> <kirill.shutemov@linux.intel.com> wrote:
> >
> > On Mon, Oct 17, 2022 at 06:39:06PM +0200, Gupta, Pankaj wrote:
> > > On 10/17/2022 6:19 PM, Kirill A . Shutemov wrote:
> > > > On Mon, Oct 17, 2022 at 03:00:21PM +0200, Vlastimil Babka wrote:
> > > > > On 9/15/22 16:29, Chao Peng wrote:
> > > > > > From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
> > > > > >
> > > > > > KVM can use memfd-provided memory for guest memory. For normal userspace
> > > > > > accessible memory, KVM userspace (e.g. QEMU) mmaps the memfd into its
> > > > > > virtual address space and then tells KVM to use the virtual address to
> > > > > > setup the mapping in the secondary page table (e.g. EPT).
> > > > > >
> > > > > > With confidential computing technologies like Intel TDX, the
> > > > > > memfd-provided memory may be encrypted with special key for special
> > > > > > software domain (e.g. KVM guest) and is not expected to be directly
> > > > > > accessed by userspace. Precisely, userspace access to such encrypted
> > > > > > memory may lead to host crash so it should be prevented.
> > > > > >
> > > > > > This patch introduces userspace inaccessible memfd (created with
> > > > > > MFD_INACCESSIBLE). Its memory is inaccessible from userspace through
> > > > > > ordinary MMU access (e.g. read/write/mmap) but can be accessed via
> > > > > > in-kernel interface so KVM can directly interact with core-mm without
> > > > > > the need to map the memory into KVM userspace.
> > > > > >
> > > > > > It provides semantics required for KVM guest private(encrypted) memory
> > > > > > support that a file descriptor with this flag set is going to be used as
> > > > > > the source of guest memory in confidential computing environments such
> > > > > > as Intel TDX/AMD SEV.
> > > > > >
> > > > > > KVM userspace is still in charge of the lifecycle of the memfd. It
> > > > > > should pass the opened fd to KVM. KVM uses the kernel APIs newly added
> > > > > > in this patch to obtain the physical memory address and then populate
> > > > > > the secondary page table entries.
> > > > > >
> > > > > > The userspace inaccessible memfd can be fallocate-ed and hole-punched
> > > > > > from userspace. When hole-punching happens, KVM can get notified through
> > > > > > inaccessible_notifier it then gets chance to remove any mapped entries
> > > > > > of the range in the secondary page tables.
> > > > > >
> > > > > > The userspace inaccessible memfd itself is implemented as a shim layer
> > > > > > on top of real memory file systems like tmpfs/hugetlbfs but this patch
> > > > > > only implemented tmpfs. The allocated memory is currently marked as
> > > > > > unmovable and unevictable, this is required for current confidential
> > > > > > usage. But in future this might be changed.
> > > > > >
> > > > > > Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> > > > > > Signed-off-by: Chao Peng <chao.p.peng@linux.intel.com>
> > > > > > ---
> > > > >
> > > > > ...
> > > > >
> > > > > > +static long inaccessible_fallocate(struct file *file, int mode,
> > > > > > +                                  loff_t offset, loff_t len)
> > > > > > +{
> > > > > > +       struct inaccessible_data *data = file->f_mapping->private_data;
> > > > > > +       struct file *memfd = data->memfd;
> > > > > > +       int ret;
> > > > > > +
> > > > > > +       if (mode & FALLOC_FL_PUNCH_HOLE) {
> > > > > > +               if (!PAGE_ALIGNED(offset) || !PAGE_ALIGNED(len))
> > > > > > +                       return -EINVAL;
> > > > > > +       }
> > > > > > +
> > > > > > +       ret = memfd->f_op->fallocate(memfd, mode, offset, len);
> > > > > > +       inaccessible_notifier_invalidate(data, offset, offset + len);
> > > > >
> > > > > Wonder if invalidate should precede the actual hole punch, otherwise we open
> > > > > a window where the page tables point to memory no longer valid?
> > > >
> > > > Yes, you are right. Thanks for catching this.
> > >
> > > I also noticed this. But then thought the memory would be anyways zeroed
> > > (hole punched) before this call?
> >
> > Hole punching can free pages, given that offset/len covers full page.
> >
> > --
> >   Kiryl Shutsemau / Kirill A. Shutemov
> 
> I think moving this notifier_invalidate before fallocate may not solve
> the problem completely. Is it possible that between invalidate and
> fallocate, KVM tries to handle the page fault for the guest VM from
> another vcpu and uses the pages to be freed to back gpa ranges? Should
> hole punching here also update mem_attr first to say that KVM should
> consider the corresponding gpa ranges to be no more backed by
> inaccessible memfd?

We rely on external synchronization to prevent this. See code around
mmu_invalidate_retry_hva().

Vishal Annapurve Oct. 20, 2022, 10:50 a.m. UTC | #42

On Wed, Oct 19, 2022 at 9:02 PM Kirill A . Shutemov
<kirill.shutemov@linux.intel.com> wrote:
>
> On Tue, Oct 18, 2022 at 07:12:10PM +0530, Vishal Annapurve wrote:
> > On Tue, Oct 18, 2022 at 3:27 AM Kirill A . Shutemov
> > <kirill.shutemov@linux.intel.com> wrote:
> > >
> > > On Mon, Oct 17, 2022 at 06:39:06PM +0200, Gupta, Pankaj wrote:
> > > > On 10/17/2022 6:19 PM, Kirill A . Shutemov wrote:
> > > > > On Mon, Oct 17, 2022 at 03:00:21PM +0200, Vlastimil Babka wrote:
> > > > > > On 9/15/22 16:29, Chao Peng wrote:
> > > > > > > From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
> > > > > > >
> > > > > > > KVM can use memfd-provided memory for guest memory. For normal userspace
> > > > > > > accessible memory, KVM userspace (e.g. QEMU) mmaps the memfd into its
> > > > > > > virtual address space and then tells KVM to use the virtual address to
> > > > > > > setup the mapping in the secondary page table (e.g. EPT).
> > > > > > >
> > > > > > > With confidential computing technologies like Intel TDX, the
> > > > > > > memfd-provided memory may be encrypted with special key for special
> > > > > > > software domain (e.g. KVM guest) and is not expected to be directly
> > > > > > > accessed by userspace. Precisely, userspace access to such encrypted
> > > > > > > memory may lead to host crash so it should be prevented.
> > > > > > >
> > > > > > > This patch introduces userspace inaccessible memfd (created with
> > > > > > > MFD_INACCESSIBLE). Its memory is inaccessible from userspace through
> > > > > > > ordinary MMU access (e.g. read/write/mmap) but can be accessed via
> > > > > > > in-kernel interface so KVM can directly interact with core-mm without
> > > > > > > the need to map the memory into KVM userspace.
> > > > > > >
> > > > > > > It provides semantics required for KVM guest private(encrypted) memory
> > > > > > > support that a file descriptor with this flag set is going to be used as
> > > > > > > the source of guest memory in confidential computing environments such
> > > > > > > as Intel TDX/AMD SEV.
> > > > > > >
> > > > > > > KVM userspace is still in charge of the lifecycle of the memfd. It
> > > > > > > should pass the opened fd to KVM. KVM uses the kernel APIs newly added
> > > > > > > in this patch to obtain the physical memory address and then populate
> > > > > > > the secondary page table entries.
> > > > > > >
> > > > > > > The userspace inaccessible memfd can be fallocate-ed and hole-punched
> > > > > > > from userspace. When hole-punching happens, KVM can get notified through
> > > > > > > inaccessible_notifier it then gets chance to remove any mapped entries
> > > > > > > of the range in the secondary page tables.
> > > > > > >
> > > > > > > The userspace inaccessible memfd itself is implemented as a shim layer
> > > > > > > on top of real memory file systems like tmpfs/hugetlbfs but this patch
> > > > > > > only implemented tmpfs. The allocated memory is currently marked as
> > > > > > > unmovable and unevictable, this is required for current confidential
> > > > > > > usage. But in future this might be changed.
> > > > > > >
> > > > > > > Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> > > > > > > Signed-off-by: Chao Peng <chao.p.peng@linux.intel.com>
> > > > > > > ---
> > > > > >
> > > > > > ...
> > > > > >
> > > > > > > +static long inaccessible_fallocate(struct file *file, int mode,
> > > > > > > +                                  loff_t offset, loff_t len)
> > > > > > > +{
> > > > > > > +       struct inaccessible_data *data = file->f_mapping->private_data;
> > > > > > > +       struct file *memfd = data->memfd;
> > > > > > > +       int ret;
> > > > > > > +
> > > > > > > +       if (mode & FALLOC_FL_PUNCH_HOLE) {
> > > > > > > +               if (!PAGE_ALIGNED(offset) || !PAGE_ALIGNED(len))
> > > > > > > +                       return -EINVAL;
> > > > > > > +       }
> > > > > > > +
> > > > > > > +       ret = memfd->f_op->fallocate(memfd, mode, offset, len);
> > > > > > > +       inaccessible_notifier_invalidate(data, offset, offset + len);
> > > > > >
> > > > > > Wonder if invalidate should precede the actual hole punch, otherwise we open
> > > > > > a window where the page tables point to memory no longer valid?
> > > > >
> > > > > Yes, you are right. Thanks for catching this.
> > > >
> > > > I also noticed this. But then thought the memory would be anyways zeroed
> > > > (hole punched) before this call?
> > >
> > > Hole punching can free pages, given that offset/len covers full page.
> > >
> > > --
> > >   Kiryl Shutsemau / Kirill A. Shutemov
> >
> > I think moving this notifier_invalidate before fallocate may not solve
> > the problem completely. Is it possible that between invalidate and
> > fallocate, KVM tries to handle the page fault for the guest VM from
> > another vcpu and uses the pages to be freed to back gpa ranges? Should
> > hole punching here also update mem_attr first to say that KVM should
> > consider the corresponding gpa ranges to be no more backed by
> > inaccessible memfd?
>
> We rely on external synchronization to prevent this. See code around
> mmu_invalidate_retry_hva().
>
> --
>   Kiryl Shutsemau / Kirill A. Shutemov

IIUC, mmu_invalidate_retry_hva/gfn ensures that page faults on gfn
ranges that are being invalidated are retried till invalidation is
complete. In this case, is it possible that KVM tries to serve the
page fault after inaccessible_notifier_invalidate is complete but
before fallocate could punch hole into the files?
e.g.
inaccessible_notifier_invalidate(...)
... (system event preempting this control flow, giving a window for
the guest to retry accessing the gfn range which was invalidated)
fallocate(.., PUNCH_HOLE..)

Chao Peng Oct. 21, 2022, 1:47 p.m. UTC | #43

> 
> In the context of userspace inaccessible memfd, what would be a
> suggested way to enforce NUMA memory policy for physical memory
> allocation? mbind[1] won't work here in absence of virtual address
> range.

How about set_mempolicy():
https://www.man7.org/linux/man-pages/man2/set_mempolicy.2.html

Chao
> 
> [1] https://github.com/chao-p/qemu/blob/privmem-v8/backends/hostmem.c#L382

Chao Peng Oct. 21, 2022, 1:54 p.m. UTC | #44

On Thu, Oct 20, 2022 at 04:20:58PM +0530, Vishal Annapurve wrote:
> On Wed, Oct 19, 2022 at 9:02 PM Kirill A . Shutemov
> <kirill.shutemov@linux.intel.com> wrote:
> >
> > On Tue, Oct 18, 2022 at 07:12:10PM +0530, Vishal Annapurve wrote:
> > > On Tue, Oct 18, 2022 at 3:27 AM Kirill A . Shutemov
> > > <kirill.shutemov@linux.intel.com> wrote:
> > > >
> > > > On Mon, Oct 17, 2022 at 06:39:06PM +0200, Gupta, Pankaj wrote:
> > > > > On 10/17/2022 6:19 PM, Kirill A . Shutemov wrote:
> > > > > > On Mon, Oct 17, 2022 at 03:00:21PM +0200, Vlastimil Babka wrote:
> > > > > > > On 9/15/22 16:29, Chao Peng wrote:
> > > > > > > > From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
> > > > > > > >
> > > > > > > > KVM can use memfd-provided memory for guest memory. For normal userspace
> > > > > > > > accessible memory, KVM userspace (e.g. QEMU) mmaps the memfd into its
> > > > > > > > virtual address space and then tells KVM to use the virtual address to
> > > > > > > > setup the mapping in the secondary page table (e.g. EPT).
> > > > > > > >
> > > > > > > > With confidential computing technologies like Intel TDX, the
> > > > > > > > memfd-provided memory may be encrypted with special key for special
> > > > > > > > software domain (e.g. KVM guest) and is not expected to be directly
> > > > > > > > accessed by userspace. Precisely, userspace access to such encrypted
> > > > > > > > memory may lead to host crash so it should be prevented.
> > > > > > > >
> > > > > > > > This patch introduces userspace inaccessible memfd (created with
> > > > > > > > MFD_INACCESSIBLE). Its memory is inaccessible from userspace through
> > > > > > > > ordinary MMU access (e.g. read/write/mmap) but can be accessed via
> > > > > > > > in-kernel interface so KVM can directly interact with core-mm without
> > > > > > > > the need to map the memory into KVM userspace.
> > > > > > > >
> > > > > > > > It provides semantics required for KVM guest private(encrypted) memory
> > > > > > > > support that a file descriptor with this flag set is going to be used as
> > > > > > > > the source of guest memory in confidential computing environments such
> > > > > > > > as Intel TDX/AMD SEV.
> > > > > > > >
> > > > > > > > KVM userspace is still in charge of the lifecycle of the memfd. It
> > > > > > > > should pass the opened fd to KVM. KVM uses the kernel APIs newly added
> > > > > > > > in this patch to obtain the physical memory address and then populate
> > > > > > > > the secondary page table entries.
> > > > > > > >
> > > > > > > > The userspace inaccessible memfd can be fallocate-ed and hole-punched
> > > > > > > > from userspace. When hole-punching happens, KVM can get notified through
> > > > > > > > inaccessible_notifier it then gets chance to remove any mapped entries
> > > > > > > > of the range in the secondary page tables.
> > > > > > > >
> > > > > > > > The userspace inaccessible memfd itself is implemented as a shim layer
> > > > > > > > on top of real memory file systems like tmpfs/hugetlbfs but this patch
> > > > > > > > only implemented tmpfs. The allocated memory is currently marked as
> > > > > > > > unmovable and unevictable, this is required for current confidential
> > > > > > > > usage. But in future this might be changed.
> > > > > > > >
> > > > > > > > Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> > > > > > > > Signed-off-by: Chao Peng <chao.p.peng@linux.intel.com>
> > > > > > > > ---
> > > > > > >
> > > > > > > ...
> > > > > > >
> > > > > > > > +static long inaccessible_fallocate(struct file *file, int mode,
> > > > > > > > +                                  loff_t offset, loff_t len)
> > > > > > > > +{
> > > > > > > > +       struct inaccessible_data *data = file->f_mapping->private_data;
> > > > > > > > +       struct file *memfd = data->memfd;
> > > > > > > > +       int ret;
> > > > > > > > +
> > > > > > > > +       if (mode & FALLOC_FL_PUNCH_HOLE) {
> > > > > > > > +               if (!PAGE_ALIGNED(offset) || !PAGE_ALIGNED(len))
> > > > > > > > +                       return -EINVAL;
> > > > > > > > +       }
> > > > > > > > +
> > > > > > > > +       ret = memfd->f_op->fallocate(memfd, mode, offset, len);
> > > > > > > > +       inaccessible_notifier_invalidate(data, offset, offset + len);
> > > > > > >
> > > > > > > Wonder if invalidate should precede the actual hole punch, otherwise we open
> > > > > > > a window where the page tables point to memory no longer valid?
> > > > > >
> > > > > > Yes, you are right. Thanks for catching this.
> > > > >
> > > > > I also noticed this. But then thought the memory would be anyways zeroed
> > > > > (hole punched) before this call?
> > > >
> > > > Hole punching can free pages, given that offset/len covers full page.
> > > >
> > > > --
> > > >   Kiryl Shutsemau / Kirill A. Shutemov
> > >
> > > I think moving this notifier_invalidate before fallocate may not solve
> > > the problem completely. Is it possible that between invalidate and
> > > fallocate, KVM tries to handle the page fault for the guest VM from
> > > another vcpu and uses the pages to be freed to back gpa ranges? Should
> > > hole punching here also update mem_attr first to say that KVM should
> > > consider the corresponding gpa ranges to be no more backed by
> > > inaccessible memfd?
> >
> > We rely on external synchronization to prevent this. See code around
> > mmu_invalidate_retry_hva().
> >
> > --
> >   Kiryl Shutsemau / Kirill A. Shutemov
> 
> IIUC, mmu_invalidate_retry_hva/gfn ensures that page faults on gfn
> ranges that are being invalidated are retried till invalidation is
> complete. In this case, is it possible that KVM tries to serve the
> page fault after inaccessible_notifier_invalidate is complete but
> before fallocate could punch hole into the files?
> e.g.
> inaccessible_notifier_invalidate(...)
> ... (system event preempting this control flow, giving a window for
> the guest to retry accessing the gfn range which was invalidated)
> fallocate(.., PUNCH_HOLE..)

Looks this is something can happen. And sounds to me the solution needs
just follow the mmu_notifier's way of using a invalidate_start/end pair.

  invalidate_start()  --> kvm->mmu_invalidate_in_progress++;
                          zap KVM page table entries;
  fallocate()
  invalidate_end()  --> kvm->mmu_invalidate_in_progress--;

Then during invalidate_start/end time window mmu_invalidate_retry_gfn
checks 'mmu_invalidate_in_progress' and prevent repopulating the same
page in KVM page table.

  if(kvm->mmu_invalidate_in_progress)
      return 1; /* retry */

Thanks,
Chao

Sean Christopherson Oct. 21, 2022, 4:18 p.m. UTC | #45

On Fri, Oct 21, 2022, Chao Peng wrote:
> > 
> > In the context of userspace inaccessible memfd, what would be a
> > suggested way to enforce NUMA memory policy for physical memory
> > allocation? mbind[1] won't work here in absence of virtual address
> > range.
> 
> How about set_mempolicy():
> https://www.man7.org/linux/man-pages/man2/set_mempolicy.2.html

Andy Lutomirski brought this up in an off-list discussion way back when the whole
private-fd thing was first being proposed.

  : The current Linux NUMA APIs (mbind, move_pages) work on virtual addresses.  If
  : we want to support them for TDX private memory, we either need TDX private
  : memory to have an HVA or we need file-based equivalents. Arguably we should add
  : fmove_pages and fbind syscalls anyway, since the current API is quite awkward
  : even for tools like numactl.

Sean Christopherson Oct. 21, 2022, 4:53 p.m. UTC | #46

On Fri, Oct 21, 2022, Chao Peng wrote:
> On Thu, Oct 20, 2022 at 04:20:58PM +0530, Vishal Annapurve wrote:
> > On Wed, Oct 19, 2022 at 9:02 PM Kirill A . Shutemov <kirill.shutemov@linux.intel.com> wrote:
> > >
> > > On Tue, Oct 18, 2022 at 07:12:10PM +0530, Vishal Annapurve wrote:
> > > > I think moving this notifier_invalidate before fallocate may not solve
> > > > the problem completely. Is it possible that between invalidate and
> > > > fallocate, KVM tries to handle the page fault for the guest VM from
> > > > another vcpu and uses the pages to be freed to back gpa ranges? Should
> > > > hole punching here also update mem_attr first to say that KVM should
> > > > consider the corresponding gpa ranges to be no more backed by
> > > > inaccessible memfd?
> > >
> > > We rely on external synchronization to prevent this. See code around
> > > mmu_invalidate_retry_hva().
> > >
> > > --
> > >   Kiryl Shutsemau / Kirill A. Shutemov
> > 
> > IIUC, mmu_invalidate_retry_hva/gfn ensures that page faults on gfn
> > ranges that are being invalidated are retried till invalidation is
> > complete. In this case, is it possible that KVM tries to serve the
> > page fault after inaccessible_notifier_invalidate is complete but
> > before fallocate could punch hole into the files?

It's not just the page fault edge case.  In the more straightforward scenario
where the memory is already mapped into the guest, freeing pages back to the kernel
before they are removed from the guest will lead to use-after-free.

> > e.g.
> > inaccessible_notifier_invalidate(...)
> > ... (system event preempting this control flow, giving a window for
> > the guest to retry accessing the gfn range which was invalidated)
> > fallocate(.., PUNCH_HOLE..)
> 
> Looks this is something can happen.
> And sounds to me the solution needs
> just follow the mmu_notifier's way of using a invalidate_start/end pair.
> 
>   invalidate_start()  --> kvm->mmu_invalidate_in_progress++;
>                           zap KVM page table entries;
>   fallocate()
>   invalidate_end()  --> kvm->mmu_invalidate_in_progress--;
> 
> Then during invalidate_start/end time window mmu_invalidate_retry_gfn
> checks 'mmu_invalidate_in_progress' and prevent repopulating the same
> page in KVM page table.

Yes, if it's not safe to invalidate after making the change (fallocate()), then
the change needs to be bookended by a start+end pair.  The mmu_notifier's unpaired
invalidate() hook works by zapping the primary MMU's PTEs before invalidate(), but
frees the underlying physical page _after_ invalidate().

And the only reason the unpaired invalidate() exists is because there are secondary
MMUs that reuse the primary MMU's page tables, e.g. shared virtual addressing, in
which case bookending doesn't work because the secondary MMU can't remove PTEs, it
can only flush its TLBs.

For this case, the whole point is to not create PTEs in the primary MMU, so there
should never be a use case that _needs_ an unpaired invalidate().

TL;DR: a start+end pair is likely the simplest solution.

Kirill A. Shutemov Oct. 24, 2022, 2:59 p.m. UTC | #47

On Fri, Oct 21, 2022 at 04:18:14PM +0000, Sean Christopherson wrote:
> On Fri, Oct 21, 2022, Chao Peng wrote:
> > > 
> > > In the context of userspace inaccessible memfd, what would be a
> > > suggested way to enforce NUMA memory policy for physical memory
> > > allocation? mbind[1] won't work here in absence of virtual address
> > > range.
> > 
> > How about set_mempolicy():
> > https://www.man7.org/linux/man-pages/man2/set_mempolicy.2.html
> 
> Andy Lutomirski brought this up in an off-list discussion way back when the whole
> private-fd thing was first being proposed.
> 
>   : The current Linux NUMA APIs (mbind, move_pages) work on virtual addresses.  If
>   : we want to support them for TDX private memory, we either need TDX private
>   : memory to have an HVA or we need file-based equivalents. Arguably we should add
>   : fmove_pages and fbind syscalls anyway, since the current API is quite awkward
>   : even for tools like numactl.

Yeah, we definitely have gaps in API wrt NUMA, but I don't think it be
addressed in the initial submission.

BTW, it is not regression comparing to old KVM slots, if the memory is
backed by memfd or other file:

MBIND(2)
       The  specified policy will be ignored for any MAP_SHARED mappings in the
       specified memory range.  Rather the pages will be allocated according to
       the  memory  policy  of the thread that caused the page to be allocated.
       Again, this may not be the thread that called mbind().

It is not clear how to define fbind(2) semantics, considering that multiple
processes may compete for the same region of page cache.

Should it be per-inode or per-fd? Or maybe per-range in inode/fd?

fmove_pages(2) should be relatively straight forward, since it is
best-effort and does not guarantee that the page will note be moved
somewhare else just after return from the syscall.

David Hildenbrand Oct. 24, 2022, 3:26 p.m. UTC | #48

On 24.10.22 16:59, Kirill A . Shutemov wrote:
> On Fri, Oct 21, 2022 at 04:18:14PM +0000, Sean Christopherson wrote:
>> On Fri, Oct 21, 2022, Chao Peng wrote:
>>>>
>>>> In the context of userspace inaccessible memfd, what would be a
>>>> suggested way to enforce NUMA memory policy for physical memory
>>>> allocation? mbind[1] won't work here in absence of virtual address
>>>> range.
>>>
>>> How about set_mempolicy():
>>> https://www.man7.org/linux/man-pages/man2/set_mempolicy.2.html
>>
>> Andy Lutomirski brought this up in an off-list discussion way back when the whole
>> private-fd thing was first being proposed.
>>
>>    : The current Linux NUMA APIs (mbind, move_pages) work on virtual addresses.  If
>>    : we want to support them for TDX private memory, we either need TDX private
>>    : memory to have an HVA or we need file-based equivalents. Arguably we should add
>>    : fmove_pages and fbind syscalls anyway, since the current API is quite awkward
>>    : even for tools like numactl.
> 
> Yeah, we definitely have gaps in API wrt NUMA, but I don't think it be
> addressed in the initial submission.
> 
> BTW, it is not regression comparing to old KVM slots, if the memory is
> backed by memfd or other file:
> 
> MBIND(2)
>         The  specified policy will be ignored for any MAP_SHARED mappings in the
>         specified memory range.  Rather the pages will be allocated according to
>         the  memory  policy  of the thread that caused the page to be allocated.
>         Again, this may not be the thread that called mbind().

IIRC, that documentation is imprecise/incorrect especially when it comes 
to memfd. Page faults in shared mappings will similarly obey the set 
mbind() policy when allocating new pages.

QEMU relies on that.

The "fun" begins when we have multiple mappings, and only some have a 
policy set ... or if we already, previously allocated the pages.

Vishal Annapurve Nov. 3, 2022, 4:27 p.m. UTC | #49

On Mon, Oct 24, 2022 at 8:30 PM Kirill A . Shutemov
<kirill.shutemov@linux.intel.com> wrote:
>
> On Fri, Oct 21, 2022 at 04:18:14PM +0000, Sean Christopherson wrote:
> > On Fri, Oct 21, 2022, Chao Peng wrote:
> > > >
> > > > In the context of userspace inaccessible memfd, what would be a
> > > > suggested way to enforce NUMA memory policy for physical memory
> > > > allocation? mbind[1] won't work here in absence of virtual address
> > > > range.
> > >
> > > How about set_mempolicy():
> > > https://www.man7.org/linux/man-pages/man2/set_mempolicy.2.html
> >
> > Andy Lutomirski brought this up in an off-list discussion way back when the whole
> > private-fd thing was first being proposed.
> >
> >   : The current Linux NUMA APIs (mbind, move_pages) work on virtual addresses.  If
> >   : we want to support them for TDX private memory, we either need TDX private
> >   : memory to have an HVA or we need file-based equivalents. Arguably we should add
> >   : fmove_pages and fbind syscalls anyway, since the current API is quite awkward
> >   : even for tools like numactl.
>
> Yeah, we definitely have gaps in API wrt NUMA, but I don't think it be
> addressed in the initial submission.
>
> BTW, it is not regression comparing to old KVM slots, if the memory is
> backed by memfd or other file:
>
> MBIND(2)
>        The  specified policy will be ignored for any MAP_SHARED mappings in the
>        specified memory range.  Rather the pages will be allocated according to
>        the  memory  policy  of the thread that caused the page to be allocated.
>        Again, this may not be the thread that called mbind().
>
> It is not clear how to define fbind(2) semantics, considering that multiple
> processes may compete for the same region of page cache.
>
> Should it be per-inode or per-fd? Or maybe per-range in inode/fd?
>

David's analysis on mempolicy with shmem seems to be right. set_policy
on virtual address range does seem to change the shared policy for the
inode irrespective of the mapping type.

Maybe having a way to set numa policy per-range in the inode would be
at par with what we can do today via mbind on virtual address ranges.



> fmove_pages(2) should be relatively straight forward, since it is
> best-effort and does not guarantee that the page will note be moved
> somewhare else just after return from the syscall.
>
> --
>   Kiryl Shutsemau / Kirill A. Shutemov

[v8,1/8] mm/memfd: Introduce userspace inaccessible memfd

Commit Message

Comments

Patch