[RFC,0/5] mm/gup: Introduce exclusive GUP pinning

Message ID	20240618-exclusive-gup-v1-0-30472a19c5d1@quicinc.com (mailing list archive)
Headers	show Return-Path: <owner-linux-mm@kvack.org> From: Elliot Berman <quic_eberman@quicinc.com> Subject: [PATCH RFC 0/5] mm/gup: Introduce exclusive GUP pinning Date: Tue, 18 Jun 2024 17:05:06 -0700 Message-ID: <20240618-exclusive-gup-v1-0-30472a19c5d1@quicinc.com> MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: 7bit To: Andrew Morton <akpm@linux-foundation.org>, Shuah Khan <shuah@kernel.org>, David Hildenbrand <david@redhat.com>, Matthew Wilcox <willy@infradead.org>, <maz@kernel.org> CC: <kvm@vger.kernel.org>, <linux-arm-msm@vger.kernel.org>, <linux-mm@kvack.org>, <linux-kernel@vger.kernel.org>, <linux-kselftest@vger.kernel.org>, <pbonzini@redhat.com>, Elliot Berman <quic_eberman@quicinc.com>, Fuad Tabba <tabba@google.com> Sender: owner-linux-mm@kvack.org Precedence: bulk
Series	mm/gup: Introduce exclusive GUP pinning \| expand [RFC,0/5] mm/gup: Introduce exclusive GUP pinning [RFC,1/5] mm/gup: Move GUP_PIN_COUNTING_BIAS to page_ref.h [RFC,2/5] mm/gup: Add an option for obtaining an exclusive pin [RFC,3/5] mm/gup: Add support for re-pinning a normal pinned page as exclusive [RFC,4/5] mm/gup-test: Verify exclusive pinned [RFC,5/5] mm/gup_test: Verify GUP grabs same pages twice

Elliot Berman June 19, 2024, 12:05 a.m. UTC

In arm64 pKVM and QuIC's Gunyah protected VM model, we want to support
grabbing shmem user pages instead of using KVM's guestmemfd. These
hypervisors provide a different isolation model than the CoCo
implementations from x86. KVM's guest_memfd is focused on providing
memory that is more isolated than AVF requires. Some specific examples
include ability to pre-load data onto guest-private pages, dynamically
sharing/isolating guest pages without copy, and (future) migrating
guest-private pages.  In sum of those differences after a discussion in
[1] and at PUCK, we want to try to stick with existing shmem and extend
GUP to support the isolation needs for arm64 pKVM and Gunyah. To that
end, we introduce the concept of "exclusive GUP pinning", which enforces
that only one pin of any kind is allowed when using the FOLL_EXCLUSIVE
flag is set. This behavior doesn't affect FOLL_GET or any other folio
refcount operations that don't go through the FOLL_PIN path.

[1]: https://lore.kernel.org/all/20240319143119.GA2736@willie-the-truck/

Tree with patches at:
https://git.codelinaro.org/clo/linux-kernel/gunyah-linux/-/tree/sent/exclusive-gup-v1

	 anup@brainfault.org, paul.walmsley@sifive.com,
	palmer@dabbelt.com,  aou@eecs.berkeley.edu, seanjc@google.com,
	viro@zeniv.linux.org.uk,  brauner@kernel.org,
	willy@infradead.org, akpm@linux-foundation.org,
	 xiaoyao.li@intel.com, yilun.xu@intel.com,
	chao.p.peng@linux.intel.com,  jarkko@kernel.org,
	amoorthy@google.com, dmatlack@google.com,
	 yu.c.zhang@linux.intel.com, isaku.yamahata@intel.com,
	mic@digikod.net,  vbabka@suse.cz, vannapurve@google.com,
	ackerleytng@google.com,  mail@maciej.szmigiero.name,
	david@redhat.com, michael.roth@amd.com,  wei.w.wang@intel.com,
	liam.merwick@oracle.com, isaku.yamahata@gmail.com,
	 kirill.shutemov@linux.intel.com, suzuki.poulose@arm.com,
	steven.price@arm.com,  quic_eberman@quicinc.com,
	quic_mnalajal@quicinc.com, quic_tsoni@quicinc.com,
	 quic_svaddagi@quicinc.com, quic_cvanscha@quicinc.com,
	 quic_pderrin@quicinc.com, quic_pheragu@quicinc.com,
	catalin.marinas@arm.com,  james.morse@arm.com,
	yuzenghui@huawei.com, oliver.upton@linux.dev,  maz@kernel.org,
	will@kernel.org, qperret@google.com, keirf@google.com,
	 tabba@google.com

Signed-off-by: Elliot Berman <quic_eberman@quicinc.com>
---
Elliot Berman (2):
      mm/gup-test: Verify exclusive pinned
      mm/gup_test: Verify GUP grabs same pages twice

Fuad Tabba (3):
      mm/gup: Move GUP_PIN_COUNTING_BIAS to page_ref.h
      mm/gup: Add an option for obtaining an exclusive pin
      mm/gup: Add support for re-pinning a normal pinned page as exclusive

 include/linux/mm.h                    |  57 ++++----
 include/linux/mm_types.h              |   2 +
 include/linux/page_ref.h              |  74 ++++++++++
 mm/Kconfig                            |   5 +
 mm/gup.c                              | 265 ++++++++++++++++++++++++++++++----
 mm/gup_test.c                         | 108 ++++++++++++++
 mm/gup_test.h                         |   1 +
 tools/testing/selftests/mm/gup_test.c |   5 +-
 8 files changed, 457 insertions(+), 60 deletions(-)
---
base-commit: 6ba59ff4227927d3a8530fc2973b80e94b54d58f
change-id: 20240509-exclusive-gup-66259138bbff

Best regards,

Elliot Berman June 19, 2024, 12:11 a.m. UTC | #1

b4 wasn't happy with my copy/paste of the CC list from Fuad's series
[1]. CC'ing them here.

[1]: https://lore.kernel.org/all/20240222161047.402609-1-tabba@google.com/

On Tue, Jun 18, 2024 at 05:05:06PM -0700, Elliot Berman wrote:
> In arm64 pKVM and QuIC's Gunyah protected VM model, we want to support
> grabbing shmem user pages instead of using KVM's guestmemfd. These
> hypervisors provide a different isolation model than the CoCo
> implementations from x86. KVM's guest_memfd is focused on providing
> memory that is more isolated than AVF requires. Some specific examples
> include ability to pre-load data onto guest-private pages, dynamically
> sharing/isolating guest pages without copy, and (future) migrating
> guest-private pages.  In sum of those differences after a discussion in
> [1] and at PUCK, we want to try to stick with existing shmem and extend
> GUP to support the isolation needs for arm64 pKVM and Gunyah. To that
> end, we introduce the concept of "exclusive GUP pinning", which enforces
> that only one pin of any kind is allowed when using the FOLL_EXCLUSIVE
> flag is set. This behavior doesn't affect FOLL_GET or any other folio
> refcount operations that don't go through the FOLL_PIN path.
> 
> [1]: https://lore.kernel.org/all/20240319143119.GA2736@willie-the-truck/
> 
> Tree with patches at:
> https://git.codelinaro.org/clo/linux-kernel/gunyah-linux/-/tree/sent/exclusive-gup-v1
> 
> 
> Signed-off-by: Elliot Berman <quic_eberman@quicinc.com>
> ---
> Elliot Berman (2):
>       mm/gup-test: Verify exclusive pinned
>       mm/gup_test: Verify GUP grabs same pages twice
> 
> Fuad Tabba (3):
>       mm/gup: Move GUP_PIN_COUNTING_BIAS to page_ref.h
>       mm/gup: Add an option for obtaining an exclusive pin
>       mm/gup: Add support for re-pinning a normal pinned page as exclusive
> 
>  include/linux/mm.h                    |  57 ++++----
>  include/linux/mm_types.h              |   2 +
>  include/linux/page_ref.h              |  74 ++++++++++
>  mm/Kconfig                            |   5 +
>  mm/gup.c                              | 265 ++++++++++++++++++++++++++++++----
>  mm/gup_test.c                         | 108 ++++++++++++++
>  mm/gup_test.h                         |   1 +
>  tools/testing/selftests/mm/gup_test.c |   5 +-
>  8 files changed, 457 insertions(+), 60 deletions(-)
> ---
> base-commit: 6ba59ff4227927d3a8530fc2973b80e94b54d58f
> change-id: 20240509-exclusive-gup-66259138bbff
> 
> Best regards,
> -- 
> Elliot Berman <quic_eberman@quicinc.com>
>

John Hubbard June 19, 2024, 2:44 a.m. UTC | #2

On 6/18/24 5:05 PM, Elliot Berman wrote:
> In arm64 pKVM and QuIC's Gunyah protected VM model, we want to support
> grabbing shmem user pages instead of using KVM's guestmemfd. These
> hypervisors provide a different isolation model than the CoCo
> implementations from x86. KVM's guest_memfd is focused on providing
> memory that is more isolated than AVF requires. Some specific examples
> include ability to pre-load data onto guest-private pages, dynamically
> sharing/isolating guest pages without copy, and (future) migrating
> guest-private pages.  In sum of those differences after a discussion in
> [1] and at PUCK, we want to try to stick with existing shmem and extend
> GUP to support the isolation needs for arm64 pKVM and Gunyah. To that
> end, we introduce the concept of "exclusive GUP pinning", which enforces
> that only one pin of any kind is allowed when using the FOLL_EXCLUSIVE
> flag is set. This behavior doesn't affect FOLL_GET or any other folio
> refcount operations that don't go through the FOLL_PIN path.
> 
> [1]: https://lore.kernel.org/all/20240319143119.GA2736@willie-the-truck/
> 

Hi!

Looking through this, I feel that some intangible threshold of "this is
too much overloading of page->_refcount" has been crossed. This is a very
specific feature, and it is using approximately one more bit than is
really actually "available"...

If we need a bit in struct page/folio, is this really the only way? Willy
is working towards getting us an entirely separate folio->pincount, I
suppose that might take too long? Or not?

This feels like force-fitting a very specific feature (KVM/CoCo handling
of shmem pages) into a more general mechanism that is running low on
bits (gup/pup).

Maybe a good topic for LPC!

thanks,

David Hildenbrand June 19, 2024, 7:37 a.m. UTC | #3

Hi,

On 19.06.24 04:44, John Hubbard wrote:
> On 6/18/24 5:05 PM, Elliot Berman wrote:
>> In arm64 pKVM and QuIC's Gunyah protected VM model, we want to support
>> grabbing shmem user pages instead of using KVM's guestmemfd. These
>> hypervisors provide a different isolation model than the CoCo
>> implementations from x86. KVM's guest_memfd is focused on providing
>> memory that is more isolated than AVF requires. Some specific examples
>> include ability to pre-load data onto guest-private pages, dynamically
>> sharing/isolating guest pages without copy, and (future) migrating
>> guest-private pages.  In sum of those differences after a discussion in
>> [1] and at PUCK, we want to try to stick with existing shmem and extend
>> GUP to support the isolation needs for arm64 pKVM and Gunyah.

The main question really is, into which direction we want and can 
develop guest_memfd. At this point (after talking to Jason at LSF/MM), I 
wonder if guest_memfd should be our new target for guest memory, both 
shared and private. There are a bunch of issues to be sorted out though ...

As there is interest from Red Hat into supporting hugetlb-style huge 
pages in confidential VMs for real-time workloads, and wasting memory is 
not really desired, I'm going to think some more about some of the 
challenges (shared+private in guest_memfd, mmap support, migration of 
!shared folios, hugetlb-like support, in-place shared<->private 
conversion, interaction with page pinning). Tricky.

Ideally, we'd have one way to back guest memory for confidential VMs in 
the future.

Can you comment on the bigger design goal here? In particular:

1) Who would get the exclusive PIN and for which reason? When would we
    pin, when would we unpin?

2) What would happen if there is already another PIN? Can we deal with
    speculative short-term PINs from GUP-fast that could introduce
    errors?

3) How can we be sure we don't need other long-term pins (IOMMUs?) in
    the future?

4) Why are GUP pins special? How one would deal with other folio
    references (e.g., simply mmap the shmem file into a different
    process).

5) Why you have to bother about anonymous pages at all (skimming over s
    some patches), when you really want to handle shmem differently only?

>> To that
>> end, we introduce the concept of "exclusive GUP pinning", which enforces
>> that only one pin of any kind is allowed when using the FOLL_EXCLUSIVE
>> flag is set. This behavior doesn't affect FOLL_GET or any other folio
>> refcount operations that don't go through the FOLL_PIN path.

So, FOLL_EXCLUSIVE would fail if there already is a PIN, but 
!FOLL_EXCLUSIVE would succeed even if there is a single PIN via 
FOLL_EXCLUSIVE? Or would the single FOLL_EXCLUSIVE pin make other pins 
that don't have FOLL_EXCLUSIVE set fail as well?

>>
>> [1]: https://lore.kernel.org/all/20240319143119.GA2736@willie-the-truck/
>>
> 
> Hi!
> 
> Looking through this, I feel that some intangible threshold of "this is
> too much overloading of page->_refcount" has been crossed. This is a very
> specific feature, and it is using approximately one more bit than is
> really actually "available"...

Agreed.

> 
> If we need a bit in struct page/folio, is this really the only way? Willy
> is working towards getting us an entirely separate folio->pincount, I
> suppose that might take too long? Or not?

Before talking about how to implement it, I think we first have to learn 
whether that approach is what we want at all, and how it fits into the 
bigger picture of that use case.

> 
> This feels like force-fitting a very specific feature (KVM/CoCo handling
> of shmem pages) into a more general mechanism that is running low on
> bits (gup/pup).

Agreed.

> 
> Maybe a good topic for LPC!

The KVM track has plenty of guest_memfd topics, might be a good fit 
there. (or in the MM track, of course)

Fuad Tabba June 19, 2024, 9:11 a.m. UTC | #4

Hi John and David,

Thank you for your comments.

On Wed, Jun 19, 2024 at 8:38 AM David Hildenbrand <david@redhat.com> wrote:
>
> Hi,
>
> On 19.06.24 04:44, John Hubbard wrote:
> > On 6/18/24 5:05 PM, Elliot Berman wrote:
> >> In arm64 pKVM and QuIC's Gunyah protected VM model, we want to support
> >> grabbing shmem user pages instead of using KVM's guestmemfd. These
> >> hypervisors provide a different isolation model than the CoCo
> >> implementations from x86. KVM's guest_memfd is focused on providing
> >> memory that is more isolated than AVF requires. Some specific examples
> >> include ability to pre-load data onto guest-private pages, dynamically
> >> sharing/isolating guest pages without copy, and (future) migrating
> >> guest-private pages.  In sum of those differences after a discussion in
> >> [1] and at PUCK, we want to try to stick with existing shmem and extend
> >> GUP to support the isolation needs for arm64 pKVM and Gunyah.
>
> The main question really is, into which direction we want and can
> develop guest_memfd. At this point (after talking to Jason at LSF/MM), I
> wonder if guest_memfd should be our new target for guest memory, both
> shared and private. There are a bunch of issues to be sorted out though ...
>
> As there is interest from Red Hat into supporting hugetlb-style huge
> pages in confidential VMs for real-time workloads, and wasting memory is
> not really desired, I'm going to think some more about some of the
> challenges (shared+private in guest_memfd, mmap support, migration of
> !shared folios, hugetlb-like support, in-place shared<->private
> conversion, interaction with page pinning). Tricky.
>
> Ideally, we'd have one way to back guest memory for confidential VMs in
> the future.

As you know, initially we went down the route of guest memory and
invested a lot of time on it, including presenting our proposal at LPC
last year. But there was resistance to expanding it to support more
than what was initially envisioned, e.g., sharing guest memory in
place migration, and maybe even huge pages, and its implications such
as being able to conditionally mmap guest memory.

To be honest, personally (speaking only for myself, not necessarily
for Elliot and not for anyone else in the pKVM team), I still would
prefer to use guest_memfd(). I think that having one solution for
confidential computing that rules them all would be best. But we do
need to be able to share memory in place, have a plan for supporting
huge pages in the near future, and migration in the not-too-distant
future.

We are currently shipping pKVM in Android as it is, warts and all.
We're also working on upstreaming the rest of it. Currently, this is
the main blocker for us to be able to upstream the rest (same probably
applies to Gunyah).

> Can you comment on the bigger design goal here? In particular:

At a high level: We want to prevent a misbehaving host process from
crashing the system when attempting to access (deliberately or
accidentally) protected guest memory. As it currently stands in pKVM
and Gunyah, the hypervisor does prevent the host from accessing
(private) guest memory. In certain cases though, if the host attempts
to access that memory and is prevented by the hypervisor (either out
of ignorance or out of malice), the host kernel wouldn't be able to
recover, causing the whole system to crash.

guest_memfd() prevents such accesses by not allowing confidential
memory to be mapped at the host to begin with. This works fine for us,
but there's the issue of being able to share memory in place, which
implies mapping it conditionally (among others that I've mentioned).

The approach we're taking with this proposal is to instead restrict
the pinning of protected memory. If the host kernel can't pin the
memory, then a misbehaving process can't trick the host into accessing
it.

>
> 1) Who would get the exclusive PIN and for which reason? When would we
>     pin, when would we unpin?

The exclusive pin would be acquired for private guest pages, in
addition to a normal pin. It would be released when the private memory
is released, or if the guest shares that memory.

> 2) What would happen if there is already another PIN? Can we deal with
>     speculative short-term PINs from GUP-fast that could introduce
>     errors?

The exclusive pin would be rejected if there's any other pin
(exclusive or normal). Normal pins would be rejected if there's an
exclusive pin.

> 3) How can we be sure we don't need other long-term pins (IOMMUs?) in
>     the future?

I can't :)

> 4) Why are GUP pins special? How one would deal with other folio
>     references (e.g., simply mmap the shmem file into a different
>     process).

Other references would crash the userspace process, but the host
kernel can handle them, and shouldn't cause the system to crash. The
way things are now in Android/pKVM, a userspace process can crash the
system as a whole.

> 5) Why you have to bother about anonymous pages at all (skimming over s
>     some patches), when you really want to handle shmem differently only?

I'm not sure I understand the question. We use anonymous memory for pKVM.

> >> To that
> >> end, we introduce the concept of "exclusive GUP pinning", which enforces
> >> that only one pin of any kind is allowed when using the FOLL_EXCLUSIVE
> >> flag is set. This behavior doesn't affect FOLL_GET or any other folio
> >> refcount operations that don't go through the FOLL_PIN path.
>
> So, FOLL_EXCLUSIVE would fail if there already is a PIN, but
> !FOLL_EXCLUSIVE would succeed even if there is a single PIN via
> FOLL_EXCLUSIVE? Or would the single FOLL_EXCLUSIVE pin make other pins
> that don't have FOLL_EXCLUSIVE set fail as well?

A FOLL_EXCLUSIVE would fail if there's any other pin. A normal pin
(!FOLL_EXCLUSIVE) would fail if there's a FOLL_EXCLUSIVE pin. It's the
PIN to end all pins!

> >>
> >> [1]: https://lore.kernel.org/all/20240319143119.GA2736@willie-the-truck/
> >>
> >
> > Hi!
> >
> > Looking through this, I feel that some intangible threshold of "this is
> > too much overloading of page->_refcount" has been crossed. This is a very
> > specific feature, and it is using approximately one more bit than is
> > really actually "available"...
>
> Agreed.

We are gating it behind a CONFIG flag :)

Also, since pin is already overloading the refcount, having the
exclusive pin there helps in ensuring atomic accesses and avoiding
races.

> >
> > If we need a bit in struct page/folio, is this really the only way? Willy
> > is working towards getting us an entirely separate folio->pincount, I
> > suppose that might take too long? Or not?
>
> Before talking about how to implement it, I think we first have to learn
> whether that approach is what we want at all, and how it fits into the
> bigger picture of that use case.
>
> >
> > This feels like force-fitting a very specific feature (KVM/CoCo handling
> > of shmem pages) into a more general mechanism that is running low on
> > bits (gup/pup).
>
> Agreed.
>
> >
> > Maybe a good topic for LPC!
>
> The KVM track has plenty of guest_memfd topics, might be a good fit
> there. (or in the MM track, of course)

We are planning on submitting a proposal for LPC (see you in Vienna!) :)

Thanks again!
/fuad (and elliot*)

* Mistakes, errors, and unclear statements in this email are mine alone though.

> --
> Cheers,
>
> David / dhildenb
>

Jason Gunthorpe June 19, 2024, 11:51 a.m. UTC | #5

On Wed, Jun 19, 2024 at 10:11:35AM +0100, Fuad Tabba wrote:

> To be honest, personally (speaking only for myself, not necessarily
> for Elliot and not for anyone else in the pKVM team), I still would
> prefer to use guest_memfd(). I think that having one solution for
> confidential computing that rules them all would be best. But we do
> need to be able to share memory in place, have a plan for supporting
> huge pages in the near future, and migration in the not-too-distant
> future.

I think using a FD to control this special lifetime stuff is
dramatically better than trying to force the MM to do it with struct
page hacks.

If you can't agree with the guest_memfd people on how to get there
then maybe you need a guest_memfd2 for this slightly different special
stuff instead of intruding on the core mm so much. (though that would
be sad)

We really need to be thinking more about containing these special
things and not just sprinkling them everywhere.

> The approach we're taking with this proposal is to instead restrict
> the pinning of protected memory. If the host kernel can't pin the
> memory, then a misbehaving process can't trick the host into accessing
> it.

If the memory can't be accessed by the CPU then it shouldn't be mapped
into a PTE in the first place. The fact you made userspace faults
(only) work is nifty but still an ugly hack to get around the fact you
shouldn't be mapping in the first place.

We already have ZONE_DEVICE/DEVICE_PRIVATE to handle exactly this
scenario. "memory" that cannot be touched by the CPU but can still be
specially accessed by enlightened components.

guest_memfd, and more broadly memfd based instead of VMA based, memory
mapping in KVM is a similar outcome to DEVICE_PRIVATE.

I think you need to stay in the world of not mapping the memory, one
way or another.

> > 3) How can we be sure we don't need other long-term pins (IOMMUs?) in
> >     the future?
> 
> I can't :)

AFAICT in the pKVM model the IOMMU has to be managed by the
hypervisor..

> We are gating it behind a CONFIG flag :)
> 
> Also, since pin is already overloading the refcount, having the
> exclusive pin there helps in ensuring atomic accesses and avoiding
> races.

Yeah, but every time someone does this and then links it to a uAPI it
becomes utterly baked in concrete for the MM forever.

Jason

Fuad Tabba June 19, 2024, 12:01 p.m. UTC | #6

Hi Jason,

On Wed, Jun 19, 2024 at 12:51 PM Jason Gunthorpe <jgg@nvidia.com> wrote:
>
> On Wed, Jun 19, 2024 at 10:11:35AM +0100, Fuad Tabba wrote:
>
> > To be honest, personally (speaking only for myself, not necessarily
> > for Elliot and not for anyone else in the pKVM team), I still would
> > prefer to use guest_memfd(). I think that having one solution for
> > confidential computing that rules them all would be best. But we do
> > need to be able to share memory in place, have a plan for supporting
> > huge pages in the near future, and migration in the not-too-distant
> > future.
>
> I think using a FD to control this special lifetime stuff is
> dramatically better than trying to force the MM to do it with struct
> page hacks.
>
> If you can't agree with the guest_memfd people on how to get there
> then maybe you need a guest_memfd2 for this slightly different special
> stuff instead of intruding on the core mm so much. (though that would
> be sad)
>
> We really need to be thinking more about containing these special
> things and not just sprinkling them everywhere.

I agree that we need to agree :) This discussion has been going on
since before LPC last year, and the consensus from the guest_memfd()
folks (if I understood it correctly) is that guest_memfd() is what it
is: designed for a specific type of confidential computing, in the
style of TDX and CCA perhaps, and that it cannot (or will not) perform
the role of being a general solution for all confidential computing.

> > The approach we're taking with this proposal is to instead restrict
> > the pinning of protected memory. If the host kernel can't pin the
> > memory, then a misbehaving process can't trick the host into accessing
> > it.
>
> If the memory can't be accessed by the CPU then it shouldn't be mapped
> into a PTE in the first place. The fact you made userspace faults
> (only) work is nifty but still an ugly hack to get around the fact you
> shouldn't be mapping in the first place.
>
> We already have ZONE_DEVICE/DEVICE_PRIVATE to handle exactly this
> scenario. "memory" that cannot be touched by the CPU but can still be
> specially accessed by enlightened components.
>
> guest_memfd, and more broadly memfd based instead of VMA based, memory
> mapping in KVM is a similar outcome to DEVICE_PRIVATE.
>
> I think you need to stay in the world of not mapping the memory, one
> way or another.

As I mentioned earlier, that's my personal preferred option.

> > > 3) How can we be sure we don't need other long-term pins (IOMMUs?) in
> > >     the future?
> >
> > I can't :)
>
> AFAICT in the pKVM model the IOMMU has to be managed by the
> hypervisor..

I realized that I misunderstood this. At least speaking for pKVM, we
don't need other long term pins as long as the memory is private. The
exclusive pin is dropped when the memory is shared.

> > We are gating it behind a CONFIG flag :)
> >
> > Also, since pin is already overloading the refcount, having the
> > exclusive pin there helps in ensuring atomic accesses and avoiding
> > races.
>
> Yeah, but every time someone does this and then links it to a uAPI it
> becomes utterly baked in concrete for the MM forever.

I agree. But if we can't modify guest_memfd() to fit our needs (pKVM,
Gunyah), then we don't really have that many other options.

Thanks!
/fuad

> Jason

David Hildenbrand June 19, 2024, 12:16 p.m. UTC | #7

On 19.06.24 11:11, Fuad Tabba wrote:
> Hi John and David,
> 
> Thank you for your comments.
> 
> On Wed, Jun 19, 2024 at 8:38 AM David Hildenbrand <david@redhat.com> wrote:
>>
>> Hi,
>>
>> On 19.06.24 04:44, John Hubbard wrote:
>>> On 6/18/24 5:05 PM, Elliot Berman wrote:
>>>> In arm64 pKVM and QuIC's Gunyah protected VM model, we want to support
>>>> grabbing shmem user pages instead of using KVM's guestmemfd. These
>>>> hypervisors provide a different isolation model than the CoCo
>>>> implementations from x86. KVM's guest_memfd is focused on providing
>>>> memory that is more isolated than AVF requires. Some specific examples
>>>> include ability to pre-load data onto guest-private pages, dynamically
>>>> sharing/isolating guest pages without copy, and (future) migrating
>>>> guest-private pages.  In sum of those differences after a discussion in
>>>> [1] and at PUCK, we want to try to stick with existing shmem and extend
>>>> GUP to support the isolation needs for arm64 pKVM and Gunyah.
>>
>> The main question really is, into which direction we want and can
>> develop guest_memfd. At this point (after talking to Jason at LSF/MM), I
>> wonder if guest_memfd should be our new target for guest memory, both
>> shared and private. There are a bunch of issues to be sorted out though ...
>>
>> As there is interest from Red Hat into supporting hugetlb-style huge
>> pages in confidential VMs for real-time workloads, and wasting memory is
>> not really desired, I'm going to think some more about some of the
>> challenges (shared+private in guest_memfd, mmap support, migration of
>> !shared folios, hugetlb-like support, in-place shared<->private
>> conversion, interaction with page pinning). Tricky.
>>
>> Ideally, we'd have one way to back guest memory for confidential VMs in
>> the future.
> 
> As you know, initially we went down the route of guest memory and
> invested a lot of time on it, including presenting our proposal at LPC
> last year. But there was resistance to expanding it to support more
> than what was initially envisioned, e.g., sharing guest memory in
> place migration, and maybe even huge pages, and its implications such
> as being able to conditionally mmap guest memory.

Yes, and I think we might have to revive that discussion, unfortunately. 
I started thinking about this, but did not reach a conclusion. Sharing 
my thoughts.

The minimum we might need to make use of guest_memfd (v1 or v2 ;) ) not 
just for private memory should be:

(1) Have private + shared parts backed by guest_memfd. Either the same,
     or a fd pair.
(2) Allow to mmap only the "shared" parts.
(3) Allow in-place conversion between "shared" and "private" parts.
(4) Allow migration of the "shared" parts.

A) Convert shared -> private?
* Must not be GUP-pinned
* Must not be mapped
* Must not reside on ZONE_MOVABLE/MIGRATE_CMA
* (must rule out any other problematic folio references that could
    read/write memory, might be feasible for guest_memfd)

B) Convert private -> shared?
* Nothing to consider

C) Map something?
* Must not be private

For ordinary (small) pages, that might be feasible. 
(ZONE_MOVABLE/MIGRATE_CMA might be feasible, but maybe we could just not 
support them initially)

The real fun begins once we want to support huge pages/large folios and 
can end up having a mixture of "private" and "shared" per huge page. But 
really, that's what we want in the end I think.

Unless we can teach the VM to not convert arbitrary physical memory 
ranges on a 4k basis to a mixture of private/shared ... but I've been 
told we don't want that. Hm.


There are two big problems with that that I can see:

1) References/GUP-pins are per folio

What if some shared part of the folio is pinned but another shared part 
that we want to convert to private is not? Core-mm will not provide the 
answer to that: the folio maybe pinned, that's it. *Disallowing* at 
least long-term GUP-pins might be an option.

To get stuff into an IOMMU, maybe a per-fd interface could work, and 
guest_memfd would track itself which parts are currently "handed out", 
and with which "semantics" (shared vs. private).

[IOMMU + private parts might require that either way? Because, if we 
dissallow mmap, how should that ever work with an IOMMU otherwise].

2) Tracking of mappings will likely soon be per folio.

page_mapped() / folio_mapped() only tell us if any part of the folio is 
mapped. Of course, what always works is unmapping the whole thing, or 
walking the rmap to detect if a specific part is currently mapped.


Then, there is the problem of getting huge pages into guest_memfd (using 
hugetlb reserves, but not using hugetlb), but that should be solvable.


As raised in previous discussions, I think we should then allow the 
whole guest_memfd to be mapped, but simply SIGBUS/... when trying to 
access a private part. We would track private/shared internally, and 
track "handed out" pages to IOMMUs internally. FOLL_LONGTERM would be 
disallowed.

But that's only the high level idea I had so far ... likely ignore way 
too many details.

Is there broader interest to discuss that and there would be value in 
setting up a meeting and finally make progress with that?

I recall quite some details with memory renting or so on pKVM ... and I 
have to refresh my memory on that.

> 
> To be honest, personally (speaking only for myself, not necessarily
> for Elliot and not for anyone else in the pKVM team), I still would
> prefer to use guest_memfd(). I think that having one solution for
> confidential computing that rules them all would be best. But we do
> need to be able to share memory in place, have a plan for supporting
> huge pages in the near future, and migration in the not-too-distant
> future.

Yes, huge pages are also of interest for RH. And memory-overconsumption 
due to having partially used huge pages in private/shared memory is not 
desired.

> 
> We are currently shipping pKVM in Android as it is, warts and all.
> We're also working on upstreaming the rest of it. Currently, this is
> the main blocker for us to be able to upstream the rest (same probably
> applies to Gunyah).
> 
>> Can you comment on the bigger design goal here? In particular:
> 
> At a high level: We want to prevent a misbehaving host process from
> crashing the system when attempting to access (deliberately or
> accidentally) protected guest memory. As it currently stands in pKVM
> and Gunyah, the hypervisor does prevent the host from accessing
> (private) guest memory. In certain cases though, if the host attempts
> to access that memory and is prevented by the hypervisor (either out
> of ignorance or out of malice), the host kernel wouldn't be able to
> recover, causing the whole system to crash.
> 
> guest_memfd() prevents such accesses by not allowing confidential
> memory to be mapped at the host to begin with. This works fine for us,
> but there's the issue of being able to share memory in place, which
> implies mapping it conditionally (among others that I've mentioned).
> 
> The approach we're taking with this proposal is to instead restrict
> the pinning of protected memory. If the host kernel can't pin the
> memory, then a misbehaving process can't trick the host into accessing
> it.

Got it, thanks. So once we pinned it, nobody else can pin it. But we can 
still map it?

> 
>>
>> 1) Who would get the exclusive PIN and for which reason? When would we
>>      pin, when would we unpin?
> 
> The exclusive pin would be acquired for private guest pages, in
> addition to a normal pin. It would be released when the private memory
> is released, or if the guest shares that memory.

Understood.

> 
>> 2) What would happen if there is already another PIN? Can we deal with
>>      speculative short-term PINs from GUP-fast that could introduce
>>      errors?
> 
> The exclusive pin would be rejected if there's any other pin
> (exclusive or normal). Normal pins would be rejected if there's an
> exclusive pin.

Makes sense, thanks.

> 
>> 3) How can we be sure we don't need other long-term pins (IOMMUs?) in
>>      the future?
> 
> I can't :)

:)

> 
>> 4) Why are GUP pins special? How one would deal with other folio
>>      references (e.g., simply mmap the shmem file into a different
>>      process).
> 
> Other references would crash the userspace process, but the host
> kernel can handle them, and shouldn't cause the system to crash. The
> way things are now in Android/pKVM, a userspace process can crash the
> system as a whole.

Okay, so very Android/pKVM specific :/

> 
>> 5) Why you have to bother about anonymous pages at all (skimming over s
>>      some patches), when you really want to handle shmem differently only?
> 
> I'm not sure I understand the question. We use anonymous memory for pKVM.
> 

"we want to support grabbing shmem user pages instead of using KVM's 
guestmemfd" indicated to me that you primarily care about shmem with 
FOLL_EXCLUSIVE?

>>>> To that
>>>> end, we introduce the concept of "exclusive GUP pinning", which enforces
>>>> that only one pin of any kind is allowed when using the FOLL_EXCLUSIVE
>>>> flag is set. This behavior doesn't affect FOLL_GET or any other folio
>>>> refcount operations that don't go through the FOLL_PIN path.
>>
>> So, FOLL_EXCLUSIVE would fail if there already is a PIN, but
>> !FOLL_EXCLUSIVE would succeed even if there is a single PIN via
>> FOLL_EXCLUSIVE? Or would the single FOLL_EXCLUSIVE pin make other pins
>> that don't have FOLL_EXCLUSIVE set fail as well?
> 
> A FOLL_EXCLUSIVE would fail if there's any other pin. A normal pin
> (!FOLL_EXCLUSIVE) would fail if there's a FOLL_EXCLUSIVE pin. It's the
> PIN to end all pins!
> 
>>>>
>>>> [1]: https://lore.kernel.org/all/20240319143119.GA2736@willie-the-truck/
>>>>
>>>
>>> Hi!
>>>
>>> Looking through this, I feel that some intangible threshold of "this is
>>> too much overloading of page->_refcount" has been crossed. This is a very
>>> specific feature, and it is using approximately one more bit than is
>>> really actually "available"...
>>
>> Agreed.
> 
> We are gating it behind a CONFIG flag :)

;)

> 
> Also, since pin is already overloading the refcount, having the
> exclusive pin there helps in ensuring atomic accesses and avoiding
> races.
> 
>>>
>>> If we need a bit in struct page/folio, is this really the only way? Willy
>>> is working towards getting us an entirely separate folio->pincount, I
>>> suppose that might take too long? Or not?
>>
>> Before talking about how to implement it, I think we first have to learn
>> whether that approach is what we want at all, and how it fits into the
>> bigger picture of that use case.
>>
>>>
>>> This feels like force-fitting a very specific feature (KVM/CoCo handling
>>> of shmem pages) into a more general mechanism that is running low on
>>> bits (gup/pup).
>>
>> Agreed.
>>
>>>
>>> Maybe a good topic for LPC!
>>
>> The KVM track has plenty of guest_memfd topics, might be a good fit
>> there. (or in the MM track, of course)
> 
> We are planning on submitting a proposal for LPC (see you in Vienna!) :)

Great!

David Hildenbrand June 19, 2024, 12:17 p.m. UTC | #8

> If the memory can't be accessed by the CPU then it shouldn't be mapped
> into a PTE in the first place. The fact you made userspace faults
> (only) work is nifty but still an ugly hack to get around the fact you
> shouldn't be mapping in the first place.
> 
> We already have ZONE_DEVICE/DEVICE_PRIVATE to handle exactly this
> scenario. "memory" that cannot be touched by the CPU but can still be
> specially accessed by enlightened components.
> 
> guest_memfd, and more broadly memfd based instead of VMA based, memory
> mapping in KVM is a similar outcome to DEVICE_PRIVATE.
> 
> I think you need to stay in the world of not mapping the memory, one
> way or another.

Fully agreed. Private memory shall not be mapped.

Jason Gunthorpe June 19, 2024, 12:42 p.m. UTC | #9

On Wed, Jun 19, 2024 at 01:01:14PM +0100, Fuad Tabba wrote:
> Hi Jason,
> 
> On Wed, Jun 19, 2024 at 12:51 PM Jason Gunthorpe <jgg@nvidia.com> wrote:
> >
> > On Wed, Jun 19, 2024 at 10:11:35AM +0100, Fuad Tabba wrote:
> >
> > > To be honest, personally (speaking only for myself, not necessarily
> > > for Elliot and not for anyone else in the pKVM team), I still would
> > > prefer to use guest_memfd(). I think that having one solution for
> > > confidential computing that rules them all would be best. But we do
> > > need to be able to share memory in place, have a plan for supporting
> > > huge pages in the near future, and migration in the not-too-distant
> > > future.
> >
> > I think using a FD to control this special lifetime stuff is
> > dramatically better than trying to force the MM to do it with struct
> > page hacks.
> >
> > If you can't agree with the guest_memfd people on how to get there
> > then maybe you need a guest_memfd2 for this slightly different special
> > stuff instead of intruding on the core mm so much. (though that would
> > be sad)
> >
> > We really need to be thinking more about containing these special
> > things and not just sprinkling them everywhere.
> 
> I agree that we need to agree :) This discussion has been going on
> since before LPC last year, and the consensus from the guest_memfd()
> folks (if I understood it correctly) is that guest_memfd() is what it
> is: designed for a specific type of confidential computing, in the
> style of TDX and CCA perhaps, and that it cannot (or will not) perform
> the role of being a general solution for all confidential computing.

If you can't agree with guest_memfd, that just says you need Yet
Another FD, not mm hacks.

IMHO there is nothing intrinsically wrong with having the various FD
types being narrowly tailored to their use case. Not to say sharing
wouldn't be nice too.

Jason

Christoph Hellwig June 20, 2024, 4:11 a.m. UTC | #10

On Wed, Jun 19, 2024 at 08:51:35AM -0300, Jason Gunthorpe wrote:
> If you can't agree with the guest_memfd people on how to get there
> then maybe you need a guest_memfd2 for this slightly different special
> stuff instead of intruding on the core mm so much. (though that would
> be sad)

Or we're just not going to support it at all.  It's not like supporting
this weird usage model is a must-have for Linux to start with.

Fuad Tabba June 20, 2024, 8:32 a.m. UTC | #11

Hi,

On Thu, Jun 20, 2024 at 5:11 AM Christoph Hellwig <hch@infradead.org> wrote:
>
> On Wed, Jun 19, 2024 at 08:51:35AM -0300, Jason Gunthorpe wrote:
> > If you can't agree with the guest_memfd people on how to get there
> > then maybe you need a guest_memfd2 for this slightly different special
> > stuff instead of intruding on the core mm so much. (though that would
> > be sad)
>
> Or we're just not going to support it at all.  It's not like supporting
> this weird usage model is a must-have for Linux to start with.

Sorry, but could you please clarify to me what usage model you're
referring to exactly, and why you think it's weird? It's just that we
have covered a few things in this thread, and to me it's not clear if
you're referring to protected VMs sharing memory, or being able to
(conditionally) map a VM's memory that's backed by guest_memfd(), or
if it's the Exclusive pin.

Thank you,
/fuad

Fuad Tabba June 20, 2024, 8:47 a.m. UTC | #12

Hi David,

On Wed, Jun 19, 2024 at 1:16 PM David Hildenbrand <david@redhat.com> wrote:
>
> On 19.06.24 11:11, Fuad Tabba wrote:
> > Hi John and David,
> >
> > Thank you for your comments.
> >
> > On Wed, Jun 19, 2024 at 8:38 AM David Hildenbrand <david@redhat.com> wrote:
> >>
> >> Hi,
> >>
> >> On 19.06.24 04:44, John Hubbard wrote:
> >>> On 6/18/24 5:05 PM, Elliot Berman wrote:
> >>>> In arm64 pKVM and QuIC's Gunyah protected VM model, we want to support
> >>>> grabbing shmem user pages instead of using KVM's guestmemfd. These
> >>>> hypervisors provide a different isolation model than the CoCo
> >>>> implementations from x86. KVM's guest_memfd is focused on providing
> >>>> memory that is more isolated than AVF requires. Some specific examples
> >>>> include ability to pre-load data onto guest-private pages, dynamically
> >>>> sharing/isolating guest pages without copy, and (future) migrating
> >>>> guest-private pages.  In sum of those differences after a discussion in
> >>>> [1] and at PUCK, we want to try to stick with existing shmem and extend
> >>>> GUP to support the isolation needs for arm64 pKVM and Gunyah.
> >>
> >> The main question really is, into which direction we want and can
> >> develop guest_memfd. At this point (after talking to Jason at LSF/MM), I
> >> wonder if guest_memfd should be our new target for guest memory, both
> >> shared and private. There are a bunch of issues to be sorted out though ...
> >>
> >> As there is interest from Red Hat into supporting hugetlb-style huge
> >> pages in confidential VMs for real-time workloads, and wasting memory is
> >> not really desired, I'm going to think some more about some of the
> >> challenges (shared+private in guest_memfd, mmap support, migration of
> >> !shared folios, hugetlb-like support, in-place shared<->private
> >> conversion, interaction with page pinning). Tricky.
> >>
> >> Ideally, we'd have one way to back guest memory for confidential VMs in
> >> the future.
> >
> > As you know, initially we went down the route of guest memory and
> > invested a lot of time on it, including presenting our proposal at LPC
> > last year. But there was resistance to expanding it to support more
> > than what was initially envisioned, e.g., sharing guest memory in
> > place migration, and maybe even huge pages, and its implications such
> > as being able to conditionally mmap guest memory.
>
> Yes, and I think we might have to revive that discussion, unfortunately.
> I started thinking about this, but did not reach a conclusion. Sharing
> my thoughts.
>
> The minimum we might need to make use of guest_memfd (v1 or v2 ;) ) not
> just for private memory should be:
>
> (1) Have private + shared parts backed by guest_memfd. Either the same,
>      or a fd pair.
> (2) Allow to mmap only the "shared" parts.
> (3) Allow in-place conversion between "shared" and "private" parts.

These three were covered (modulo bugs) in the guest_memfd() RFC I'd
sent a while back:

https://lore.kernel.org/all/20240222161047.402609-1-tabba@google.com/

> (4) Allow migration of the "shared" parts.

We would really like that too, if they allow us :)

> A) Convert shared -> private?
> * Must not be GUP-pinned
> * Must not be mapped
> * Must not reside on ZONE_MOVABLE/MIGRATE_CMA
> * (must rule out any other problematic folio references that could
>     read/write memory, might be feasible for guest_memfd)
>
> B) Convert private -> shared?
> * Nothing to consider
>
> C) Map something?
> * Must not be private

A,B and C were covered (again, modulo bugs) in the RFC.

> For ordinary (small) pages, that might be feasible.
> (ZONE_MOVABLE/MIGRATE_CMA might be feasible, but maybe we could just not
> support them initially)
>
> The real fun begins once we want to support huge pages/large folios and
> can end up having a mixture of "private" and "shared" per huge page. But
> really, that's what we want in the end I think.

I agree.

> Unless we can teach the VM to not convert arbitrary physical memory
> ranges on a 4k basis to a mixture of private/shared ... but I've been
> told we don't want that. Hm.
>
>
> There are two big problems with that that I can see:
>
> 1) References/GUP-pins are per folio
>
> What if some shared part of the folio is pinned but another shared part
> that we want to convert to private is not? Core-mm will not provide the
> answer to that: the folio maybe pinned, that's it. *Disallowing* at
> least long-term GUP-pins might be an option.

Right.

> To get stuff into an IOMMU, maybe a per-fd interface could work, and
> guest_memfd would track itself which parts are currently "handed out",
> and with which "semantics" (shared vs. private).
>
> [IOMMU + private parts might require that either way? Because, if we
> dissallow mmap, how should that ever work with an IOMMU otherwise].

Not sure if IOMMU + private makes that much sense really, but I think
I might not really understand what you mean by this.

> 2) Tracking of mappings will likely soon be per folio.
>
> page_mapped() / folio_mapped() only tell us if any part of the folio is
> mapped. Of course, what always works is unmapping the whole thing, or
> walking the rmap to detect if a specific part is currently mapped.

This might complicate things a but, but we could be conservative, at
least initially in what we allow to be mapped.

>
> Then, there is the problem of getting huge pages into guest_memfd (using
> hugetlb reserves, but not using hugetlb), but that should be solvable.
>
>
> As raised in previous discussions, I think we should then allow the
> whole guest_memfd to be mapped, but simply SIGBUS/... when trying to
> access a private part. We would track private/shared internally, and
> track "handed out" pages to IOMMUs internally. FOLL_LONGTERM would be
> disallowed.
>
> But that's only the high level idea I had so far ... likely ignore way
> too many details.
>
> Is there broader interest to discuss that and there would be value in
> setting up a meeting and finally make progress with that?
>
> I recall quite some details with memory renting or so on pKVM ... and I
> have to refresh my memory on that.

I really would like to get to a place where we could investigate and
sort out all of these issues. It would be good to know though, what,
in principle (and not due to any technical limitations), we might be
allowed to do and expand guest_memfd() to do, and what out of
principle is off the table.

> >
> > To be honest, personally (speaking only for myself, not necessarily
> > for Elliot and not for anyone else in the pKVM team), I still would
> > prefer to use guest_memfd(). I think that having one solution for
> > confidential computing that rules them all would be best. But we do
> > need to be able to share memory in place, have a plan for supporting
> > huge pages in the near future, and migration in the not-too-distant
> > future.
>
> Yes, huge pages are also of interest for RH. And memory-overconsumption
> due to having partially used huge pages in private/shared memory is not
> desired.
>
> >
> > We are currently shipping pKVM in Android as it is, warts and all.
> > We're also working on upstreaming the rest of it. Currently, this is
> > the main blocker for us to be able to upstream the rest (same probably
> > applies to Gunyah).
> >
> >> Can you comment on the bigger design goal here? In particular:
> >
> > At a high level: We want to prevent a misbehaving host process from
> > crashing the system when attempting to access (deliberately or
> > accidentally) protected guest memory. As it currently stands in pKVM
> > and Gunyah, the hypervisor does prevent the host from accessing
> > (private) guest memory. In certain cases though, if the host attempts
> > to access that memory and is prevented by the hypervisor (either out
> > of ignorance or out of malice), the host kernel wouldn't be able to
> > recover, causing the whole system to crash.
> >
> > guest_memfd() prevents such accesses by not allowing confidential
> > memory to be mapped at the host to begin with. This works fine for us,
> > but there's the issue of being able to share memory in place, which
> > implies mapping it conditionally (among others that I've mentioned).
> >
> > The approach we're taking with this proposal is to instead restrict
> > the pinning of protected memory. If the host kernel can't pin the
> > memory, then a misbehaving process can't trick the host into accessing
> > it.
>
> Got it, thanks. So once we pinned it, nobody else can pin it. But we can
> still map it?

This proposal (the exclusive gup) places no limitations on mapping,
only on pinning. If private memory is mapped and then accessed, then
the worst thing that could happen is the userspace process gets
killed, potentially taking down the guest with it (if that process
happens to be the VMM for example).

The reason why we care about pinning is to ensure that the host kernel
doesn't access protected memory, thereby crashing the system.

> >
> >>
> >> 1) Who would get the exclusive PIN and for which reason? When would we
> >>      pin, when would we unpin?
> >
> > The exclusive pin would be acquired for private guest pages, in
> > addition to a normal pin. It would be released when the private memory
> > is released, or if the guest shares that memory.
>
> Understood.
>
> >
> >> 2) What would happen if there is already another PIN? Can we deal with
> >>      speculative short-term PINs from GUP-fast that could introduce
> >>      errors?
> >
> > The exclusive pin would be rejected if there's any other pin
> > (exclusive or normal). Normal pins would be rejected if there's an
> > exclusive pin.
>
> Makes sense, thanks.
>
> >
> >> 3) How can we be sure we don't need other long-term pins (IOMMUs?) in
> >>      the future?
> >
> > I can't :)
>
> :)
>
> >
> >> 4) Why are GUP pins special? How one would deal with other folio
> >>      references (e.g., simply mmap the shmem file into a different
> >>      process).
> >
> > Other references would crash the userspace process, but the host
> > kernel can handle them, and shouldn't cause the system to crash. The
> > way things are now in Android/pKVM, a userspace process can crash the
> > system as a whole.
>
> Okay, so very Android/pKVM specific :/

Gunyah too.

> >
> >> 5) Why you have to bother about anonymous pages at all (skimming over s
> >>      some patches), when you really want to handle shmem differently only?
> >
> > I'm not sure I understand the question. We use anonymous memory for pKVM.
> >
>
> "we want to support grabbing shmem user pages instead of using KVM's
> guestmemfd" indicated to me that you primarily care about shmem with
> FOLL_EXCLUSIVE?

Right, maybe we should have clarified this better when we sent out this series.

This patch series is meant as an alternative to guest_memfd(), and not
as something to be used in conjunction with it. This came about from
the discussions we had with you and others back when Elliot and I sent
our respective RFCs, and found that there was resistance into adding
guest_memfd() support that would make it practical to use with pKVM or
Gunyah.

https://lore.kernel.org/all/ZdfoR3nCEP3HTtm1@casper.infradead.org/

Thanks again for your ideas and comments!
/fuad

> >>>> To that
> >>>> end, we introduce the concept of "exclusive GUP pinning", which enforces
> >>>> that only one pin of any kind is allowed when using the FOLL_EXCLUSIVE
> >>>> flag is set. This behavior doesn't affect FOLL_GET or any other folio
> >>>> refcount operations that don't go through the FOLL_PIN path.
> >>
> >> So, FOLL_EXCLUSIVE would fail if there already is a PIN, but
> >> !FOLL_EXCLUSIVE would succeed even if there is a single PIN via
> >> FOLL_EXCLUSIVE? Or would the single FOLL_EXCLUSIVE pin make other pins
> >> that don't have FOLL_EXCLUSIVE set fail as well?
> >
> > A FOLL_EXCLUSIVE would fail if there's any other pin. A normal pin
> > (!FOLL_EXCLUSIVE) would fail if there's a FOLL_EXCLUSIVE pin. It's the
> > PIN to end all pins!
> >
> >>>>
> >>>> [1]: https://lore.kernel.org/all/20240319143119.GA2736@willie-the-truck/
> >>>>
> >>>
> >>> Hi!
> >>>
> >>> Looking through this, I feel that some intangible threshold of "this is
> >>> too much overloading of page->_refcount" has been crossed. This is a very
> >>> specific feature, and it is using approximately one more bit than is
> >>> really actually "available"...
> >>
> >> Agreed.
> >
> > We are gating it behind a CONFIG flag :)
>
> ;)
>
> >
> > Also, since pin is already overloading the refcount, having the
> > exclusive pin there helps in ensuring atomic accesses and avoiding
> > races.
> >
> >>>
> >>> If we need a bit in struct page/folio, is this really the only way? Willy
> >>> is working towards getting us an entirely separate folio->pincount, I
> >>> suppose that might take too long? Or not?
> >>
> >> Before talking about how to implement it, I think we first have to learn
> >> whether that approach is what we want at all, and how it fits into the
> >> bigger picture of that use case.
> >>
> >>>
> >>> This feels like force-fitting a very specific feature (KVM/CoCo handling
> >>> of shmem pages) into a more general mechanism that is running low on
> >>> bits (gup/pup).
> >>
> >> Agreed.
> >>
> >>>
> >>> Maybe a good topic for LPC!
> >>
> >> The KVM track has plenty of guest_memfd topics, might be a good fit
> >> there. (or in the MM track, of course)
> >
> > We are planning on submitting a proposal for LPC (see you in Vienna!) :)
>
> Great!
>
> --
> Cheers,
>
> David / dhildenb
>

David Hildenbrand June 20, 2024, 9 a.m. UTC | #13

>> Yes, and I think we might have to revive that discussion, unfortunately.
>> I started thinking about this, but did not reach a conclusion. Sharing
>> my thoughts.
>>
>> The minimum we might need to make use of guest_memfd (v1 or v2 ;) ) not
>> just for private memory should be:
>>
>> (1) Have private + shared parts backed by guest_memfd. Either the same,
>>       or a fd pair.
>> (2) Allow to mmap only the "shared" parts.
>> (3) Allow in-place conversion between "shared" and "private" parts.
> 
> These three were covered (modulo bugs) in the guest_memfd() RFC I'd
> sent a while back:
> 
> https://lore.kernel.org/all/20240222161047.402609-1-tabba@google.com/

I remember there was a catch to it (either around mmap or pinning 
detection -- or around support for huge pages in the future; maybe these 
count as BUGs :) ).

I should probably go back and revisit the whole thing, I was only CCed 
on some part of it back then.

> 
>> (4) Allow migration of the "shared" parts.
> 
> We would really like that too, if they allow us :)
> 
>> A) Convert shared -> private?
>> * Must not be GUP-pinned
>> * Must not be mapped
>> * Must not reside on ZONE_MOVABLE/MIGRATE_CMA
>> * (must rule out any other problematic folio references that could
>>      read/write memory, might be feasible for guest_memfd)
>>
>> B) Convert private -> shared?
>> * Nothing to consider
>>
>> C) Map something?
>> * Must not be private
> 
> A,B and C were covered (again, modulo bugs) in the RFC.
> 
>> For ordinary (small) pages, that might be feasible.
>> (ZONE_MOVABLE/MIGRATE_CMA might be feasible, but maybe we could just not
>> support them initially)
>>
>> The real fun begins once we want to support huge pages/large folios and
>> can end up having a mixture of "private" and "shared" per huge page. But
>> really, that's what we want in the end I think.
> 
> I agree.
> 
>> Unless we can teach the VM to not convert arbitrary physical memory
>> ranges on a 4k basis to a mixture of private/shared ... but I've been
>> told we don't want that. Hm.
>>
>>
>> There are two big problems with that that I can see:
>>
>> 1) References/GUP-pins are per folio
>>
>> What if some shared part of the folio is pinned but another shared part
>> that we want to convert to private is not? Core-mm will not provide the
>> answer to that: the folio maybe pinned, that's it. *Disallowing* at
>> least long-term GUP-pins might be an option.
> 
> Right.
> 
>> To get stuff into an IOMMU, maybe a per-fd interface could work, and
>> guest_memfd would track itself which parts are currently "handed out",
>> and with which "semantics" (shared vs. private).
>>
>> [IOMMU + private parts might require that either way? Because, if we
>> dissallow mmap, how should that ever work with an IOMMU otherwise].
> 
> Not sure if IOMMU + private makes that much sense really, but I think
> I might not really understand what you mean by this.

A device might be able to access private memory. In the TDX world, this 
would mean that a device "speaks" encrypted memory.

At the same time, a device might be able to access shared memory. Maybe 
devices can do both?

What do do when converting between private and shared? I think it 
depends on various factors (e.g., device capabilities).

[...]

>> I recall quite some details with memory renting or so on pKVM ... and I
>> have to refresh my memory on that.
> 
> I really would like to get to a place where we could investigate and
> sort out all of these issues. It would be good to know though, what,
> in principle (and not due to any technical limitations), we might be
> allowed to do and expand guest_memfd() to do, and what out of
> principle is off the table.

As Jason said, maybe we need a revised model that can handle
[...] private+shared properly.

Mostafa Saleh June 20, 2024, 1:08 p.m. UTC | #14

Hi David,

On Wed, Jun 19, 2024 at 09:37:58AM +0200, David Hildenbrand wrote:
> Hi,
> 
> On 19.06.24 04:44, John Hubbard wrote:
> > On 6/18/24 5:05 PM, Elliot Berman wrote:
> > > In arm64 pKVM and QuIC's Gunyah protected VM model, we want to support
> > > grabbing shmem user pages instead of using KVM's guestmemfd. These
> > > hypervisors provide a different isolation model than the CoCo
> > > implementations from x86. KVM's guest_memfd is focused on providing
> > > memory that is more isolated than AVF requires. Some specific examples
> > > include ability to pre-load data onto guest-private pages, dynamically
> > > sharing/isolating guest pages without copy, and (future) migrating
> > > guest-private pages.  In sum of those differences after a discussion in
> > > [1] and at PUCK, we want to try to stick with existing shmem and extend
> > > GUP to support the isolation needs for arm64 pKVM and Gunyah.
> 
> The main question really is, into which direction we want and can develop
> guest_memfd. At this point (after talking to Jason at LSF/MM), I wonder if
> guest_memfd should be our new target for guest memory, both shared and
> private. There are a bunch of issues to be sorted out though ...
> 
> As there is interest from Red Hat into supporting hugetlb-style huge pages
> in confidential VMs for real-time workloads, and wasting memory is not
> really desired, I'm going to think some more about some of the challenges
> (shared+private in guest_memfd, mmap support, migration of !shared folios,
> hugetlb-like support, in-place shared<->private conversion, interaction with
> page pinning). Tricky.
> 
> Ideally, we'd have one way to back guest memory for confidential VMs in the
> future.
> 
> 
> Can you comment on the bigger design goal here? In particular:
> 
> 1) Who would get the exclusive PIN and for which reason? When would we
>    pin, when would we unpin?
> 
> 2) What would happen if there is already another PIN? Can we deal with
>    speculative short-term PINs from GUP-fast that could introduce
>    errors?
> 
> 3) How can we be sure we don't need other long-term pins (IOMMUs?) in
>    the future?

Can you please clarify more about the IOMMU case?

pKVM has no merged upstream IOMMU support at the moment, although
there was an RFC a while a go [1], also there would be a v2 soon.

In the patches KVM (running in EL2) will manage the IOMMUs including
the page tables and all pages used in that are allocated from the
kernel.

These patches don't support IOMMUs for guests. However, I don't see
why would that be different from the CPU? as once the page is pinned
it can be owned by a guest and that would be reflected in the
hypervisor tracking, the CPU stage-2 and IOMMU page tables as well.

[1] https://lore.kernel.org/kvmarm/20230201125328.2186498-1-jean-philippe@linaro.org/

Thanks,
Mostafa

> 
> 4) Why are GUP pins special? How one would deal with other folio
>    references (e.g., simply mmap the shmem file into a different
>    process).
> 
> 5) Why you have to bother about anonymous pages at all (skimming over s
>    some patches), when you really want to handle shmem differently only?
> 
> > > To that
> > > end, we introduce the concept of "exclusive GUP pinning", which enforces
> > > that only one pin of any kind is allowed when using the FOLL_EXCLUSIVE
> > > flag is set. This behavior doesn't affect FOLL_GET or any other folio
> > > refcount operations that don't go through the FOLL_PIN path.
> 
> So, FOLL_EXCLUSIVE would fail if there already is a PIN, but !FOLL_EXCLUSIVE
> would succeed even if there is a single PIN via FOLL_EXCLUSIVE? Or would the
> single FOLL_EXCLUSIVE pin make other pins that don't have FOLL_EXCLUSIVE set
> fail as well?
> 
> > > 
> > > [1]: https://lore.kernel.org/all/20240319143119.GA2736@willie-the-truck/
> > > 
> > 
> > Hi!
> > 
> > Looking through this, I feel that some intangible threshold of "this is
> > too much overloading of page->_refcount" has been crossed. This is a very
> > specific feature, and it is using approximately one more bit than is
> > really actually "available"...
> 
> Agreed.
> 
> > 
> > If we need a bit in struct page/folio, is this really the only way? Willy
> > is working towards getting us an entirely separate folio->pincount, I
> > suppose that might take too long? Or not?
> 
> Before talking about how to implement it, I think we first have to learn
> whether that approach is what we want at all, and how it fits into the
> bigger picture of that use case.
> 
> > 
> > This feels like force-fitting a very specific feature (KVM/CoCo handling
> > of shmem pages) into a more general mechanism that is running low on
> > bits (gup/pup).
> 
> Agreed.
> 
> > 
> > Maybe a good topic for LPC!
> 
> The KVM track has plenty of guest_memfd topics, might be a good fit there.
> (or in the MM track, of course)
> 
> -- 
> Cheers,
> 
> David / dhildenb
>

Jason Gunthorpe June 20, 2024, 1:55 p.m. UTC | #15

On Thu, Jun 20, 2024 at 09:32:11AM +0100, Fuad Tabba wrote:
> Hi,
> 
> On Thu, Jun 20, 2024 at 5:11 AM Christoph Hellwig <hch@infradead.org> wrote:
> >
> > On Wed, Jun 19, 2024 at 08:51:35AM -0300, Jason Gunthorpe wrote:
> > > If you can't agree with the guest_memfd people on how to get there
> > > then maybe you need a guest_memfd2 for this slightly different special
> > > stuff instead of intruding on the core mm so much. (though that would
> > > be sad)
> >
> > Or we're just not going to support it at all.  It's not like supporting
> > this weird usage model is a must-have for Linux to start with.
> 
> Sorry, but could you please clarify to me what usage model you're
> referring to exactly, and why you think it's weird? It's just that we
> have covered a few things in this thread, and to me it's not clear if
> you're referring to protected VMs sharing memory, or being able to
> (conditionally) map a VM's memory that's backed by guest_memfd(), or
> if it's the Exclusive pin.

Personally I think mapping memory under guest_memfd is pretty weird.

I don't really understand why you end up with something different than
normal CC. Normal CC has memory that the VMM can access and memory it
cannot access. guest_memory is supposed to hold the memory the VMM cannot
reach, right?

So how does normal CC handle memory switching between private and
shared and why doesn't that work for pKVM? I think the normal CC path
effectively discards the memory content on these switches and is
slow. Are you trying to make the switch content preserving and faster?

If yes, why? What is wrong with the normal CC model of slow and
non-preserving shared memory? Are you trying to speed up IO in these
VMs by dynamically sharing pages instead of SWIOTLB?

Maybe this was all explained, but I reviewed your presentation and the
cover letter for the guest_memfd patches and I still don't see the why
in all of this.

Jason

David Hildenbrand June 20, 2024, 2:01 p.m. UTC | #16

On 20.06.24 15:55, Jason Gunthorpe wrote:
> On Thu, Jun 20, 2024 at 09:32:11AM +0100, Fuad Tabba wrote:
>> Hi,
>>
>> On Thu, Jun 20, 2024 at 5:11 AM Christoph Hellwig <hch@infradead.org> wrote:
>>>
>>> On Wed, Jun 19, 2024 at 08:51:35AM -0300, Jason Gunthorpe wrote:
>>>> If you can't agree with the guest_memfd people on how to get there
>>>> then maybe you need a guest_memfd2 for this slightly different special
>>>> stuff instead of intruding on the core mm so much. (though that would
>>>> be sad)
>>>
>>> Or we're just not going to support it at all.  It's not like supporting
>>> this weird usage model is a must-have for Linux to start with.
>>
>> Sorry, but could you please clarify to me what usage model you're
>> referring to exactly, and why you think it's weird? It's just that we
>> have covered a few things in this thread, and to me it's not clear if
>> you're referring to protected VMs sharing memory, or being able to
>> (conditionally) map a VM's memory that's backed by guest_memfd(), or
>> if it's the Exclusive pin.
> 
> Personally I think mapping memory under guest_memfd is pretty weird.
> 
> I don't really understand why you end up with something different than
> normal CC. Normal CC has memory that the VMM can access and memory it
> cannot access. guest_memory is supposed to hold the memory the VMM cannot
> reach, right?
> 
> So how does normal CC handle memory switching between private and
> shared and why doesn't that work for pKVM? I think the normal CC path
> effectively discards the memory content on these switches and is
> slow. Are you trying to make the switch content preserving and faster?
> 
> If yes, why? What is wrong with the normal CC model of slow and
> non-preserving shared memory?

I'll leave the !huge page part to Fuad.

Regarding huge pages: assume the huge page (e.g., 1 GiB hugetlb) is 
shared, now the VM requests to make one subpage private. How to handle 
that without eventually running into a double memory-allocation? (in the 
worst case, allocating a 1GiB huge page for shared and for private memory).

In the world of RT, you want your VM to be consistently backed by 
huge/gigantic mappings, not some weird mixture -- so I've been told by 
our RT team.

(there are more issues with huge pages in the style hugetlb, where we 
actually want to preallocate all pages and not rely on dynamic 
allocation at runtime when we convert back and forth between shared and 
private)

Jason Gunthorpe June 20, 2024, 2:01 p.m. UTC | #17

On Thu, Jun 20, 2024 at 11:00:45AM +0200, David Hildenbrand wrote:
> > Not sure if IOMMU + private makes that much sense really, but I think
> > I might not really understand what you mean by this.
> 
> A device might be able to access private memory. In the TDX world, this
> would mean that a device "speaks" encrypted memory.
> 
> At the same time, a device might be able to access shared memory. Maybe
> devices can do both?
> 
> What do do when converting between private and shared? I think it depends on
> various factors (e.g., device capabilities).

The whole thing is complicated once you put the pages into the VMA. We
have hmm_range_fault and IOMMU SVA paths that both obtain the pfns
without any of the checks here.

(and I suspect many of the target HW's for pKVM have/will have SVA
capable GPUs so SVA is an attack vector worth considering)

What happens if someone does DMA to these PFNs? It seems like nothing
good in either scenario..

Really the only way to do it properly is to keep the memory unmapped,
that must be the starting point to any solution. Denying GUP is just
an ugly hack.

Jason

David Hildenbrand June 20, 2024, 2:14 p.m. UTC | #18

On 20.06.24 15:08, Mostafa Saleh wrote:
> Hi David,
> 
> On Wed, Jun 19, 2024 at 09:37:58AM +0200, David Hildenbrand wrote:
>> Hi,
>>
>> On 19.06.24 04:44, John Hubbard wrote:
>>> On 6/18/24 5:05 PM, Elliot Berman wrote:
>>>> In arm64 pKVM and QuIC's Gunyah protected VM model, we want to support
>>>> grabbing shmem user pages instead of using KVM's guestmemfd. These
>>>> hypervisors provide a different isolation model than the CoCo
>>>> implementations from x86. KVM's guest_memfd is focused on providing
>>>> memory that is more isolated than AVF requires. Some specific examples
>>>> include ability to pre-load data onto guest-private pages, dynamically
>>>> sharing/isolating guest pages without copy, and (future) migrating
>>>> guest-private pages.  In sum of those differences after a discussion in
>>>> [1] and at PUCK, we want to try to stick with existing shmem and extend
>>>> GUP to support the isolation needs for arm64 pKVM and Gunyah.
>>
>> The main question really is, into which direction we want and can develop
>> guest_memfd. At this point (after talking to Jason at LSF/MM), I wonder if
>> guest_memfd should be our new target for guest memory, both shared and
>> private. There are a bunch of issues to be sorted out though ...
>>
>> As there is interest from Red Hat into supporting hugetlb-style huge pages
>> in confidential VMs for real-time workloads, and wasting memory is not
>> really desired, I'm going to think some more about some of the challenges
>> (shared+private in guest_memfd, mmap support, migration of !shared folios,
>> hugetlb-like support, in-place shared<->private conversion, interaction with
>> page pinning). Tricky.
>>
>> Ideally, we'd have one way to back guest memory for confidential VMs in the
>> future.
>>
>>
>> Can you comment on the bigger design goal here? In particular:
>>
>> 1) Who would get the exclusive PIN and for which reason? When would we
>>     pin, when would we unpin?
>>
>> 2) What would happen if there is already another PIN? Can we deal with
>>     speculative short-term PINs from GUP-fast that could introduce
>>     errors?
>>
>> 3) How can we be sure we don't need other long-term pins (IOMMUs?) in
>>     the future?
> 
> Can you please clarify more about the IOMMU case?
> 
> pKVM has no merged upstream IOMMU support at the moment, although
> there was an RFC a while a go [1], also there would be a v2 soon.
> 
> In the patches KVM (running in EL2) will manage the IOMMUs including
> the page tables and all pages used in that are allocated from the
> kernel.
> 
> These patches don't support IOMMUs for guests. However, I don't see
> why would that be different from the CPU? as once the page is pinned
> it can be owned by a guest and that would be reflected in the
> hypervisor tracking, the CPU stage-2 and IOMMU page tables as well.

So this is my thinking, it might be flawed:

In the "normal" world (e.g., vfio), we FOLL_PIN|FOLL_LONGTERM the pages 
to be accessible by a dedicated device. We look them up in the page 
tables to pin them, then we can map them into the IOMMU.

Devices that cannot speak "private memory" should only access shared 
memory. So we must not have "private memory" mapped into their IOMMU.

Devices that can speak "private memory" may either access shared or 
private memory. So we may have"private memory" mapped into their IOMMU.


What I see (again, I might be just wrong):

1) How would the device be able to grab/access "private memory", if not
    via the user page tables?

2) How would we be able to convert shared -> private, if there is a
    longterm pin from that IOMMU? We must dynamically unmap it from the
    IOMMU.

I assume when you're saying "In the patches KVM (running in EL2) will 
manage the IOMMUs  including the page tables", this is easily solved by 
not relying on pinning: KVM just knows what to update and where. (which 
is a very different model than what VFIO does)

Thanks!

Jason Gunthorpe June 20, 2024, 2:29 p.m. UTC | #19

On Thu, Jun 20, 2024 at 04:01:08PM +0200, David Hildenbrand wrote:
> On 20.06.24 15:55, Jason Gunthorpe wrote:
> > On Thu, Jun 20, 2024 at 09:32:11AM +0100, Fuad Tabba wrote:
> > > Hi,
> > > 
> > > On Thu, Jun 20, 2024 at 5:11 AM Christoph Hellwig <hch@infradead.org> wrote:
> > > > 
> > > > On Wed, Jun 19, 2024 at 08:51:35AM -0300, Jason Gunthorpe wrote:
> > > > > If you can't agree with the guest_memfd people on how to get there
> > > > > then maybe you need a guest_memfd2 for this slightly different special
> > > > > stuff instead of intruding on the core mm so much. (though that would
> > > > > be sad)
> > > > 
> > > > Or we're just not going to support it at all.  It's not like supporting
> > > > this weird usage model is a must-have for Linux to start with.
> > > 
> > > Sorry, but could you please clarify to me what usage model you're
> > > referring to exactly, and why you think it's weird? It's just that we
> > > have covered a few things in this thread, and to me it's not clear if
> > > you're referring to protected VMs sharing memory, or being able to
> > > (conditionally) map a VM's memory that's backed by guest_memfd(), or
> > > if it's the Exclusive pin.
> > 
> > Personally I think mapping memory under guest_memfd is pretty weird.
> > 
> > I don't really understand why you end up with something different than
> > normal CC. Normal CC has memory that the VMM can access and memory it
> > cannot access. guest_memory is supposed to hold the memory the VMM cannot
> > reach, right?
> > 
> > So how does normal CC handle memory switching between private and
> > shared and why doesn't that work for pKVM? I think the normal CC path
> > effectively discards the memory content on these switches and is
> > slow. Are you trying to make the switch content preserving and faster?
> > 
> > If yes, why? What is wrong with the normal CC model of slow and
> > non-preserving shared memory?
> 
> I'll leave the !huge page part to Fuad.
> 
> Regarding huge pages: assume the huge page (e.g., 1 GiB hugetlb) is shared,
> now the VM requests to make one subpage private. 

I think the general CC model has the shared/private setup earlier on
the VM lifecycle with large runs of contiguous pages. It would only
become a problem if you intend to to high rate fine granual
shared/private switching. Which is why I am asking what the actual
"why" is here.

> How to handle that without eventually running into a double
> memory-allocation? (in the worst case, allocating a 1GiB huge page
> for shared and for private memory).

I expect you'd take the linear range of 1G of PFNs and fragment it
into three ranges private/shared/private that span the same 1G.

When you construct a page table (ie a S2) that holds these three
ranges and has permission to access all the memory you want the page
table to automatically join them back together into 1GB entry.

When you construct a page table that has only access to the shared,
then you'd only install the shared hole at its natural best size.

So, I think there are two challenges - how to build an allocator and
uAPI to manage this sort of stuff so you can keep track of any
fractured pfns and ensure things remain in physical order.

Then how to re-consolidate this for the KVM side of the world.

guest_memfd, or something like it, is just really a good answer. You
have it obtain the huge folio, and keep track on its own which sub
pages can be mapped to a VMA because they are shared. KVM will obtain
the PFNs directly from the fd and KVM will not see the shared
holes. This means your S2's can be trivially constructed correctly.

No need to double allocate..

I'm kind of surprised the CC folks don't want the same thing for
exactly the same reason. It is much easier to recover the huge
mappings for the S2 in the presence of shared holes if you track it
this way. Even CC will have this problem, to some degree, too.

> In the world of RT, you want your VM to be consistently backed by
> huge/gigantic mappings, not some weird mixture -- so I've been told by our
> RT team.

Yes, even outside RT, if you want good IO performance in DMA you must
also have high IOTLB hit rates too, especially with nesting.

Jason

Jason Gunthorpe June 20, 2024, 2:34 p.m. UTC | #20

On Thu, Jun 20, 2024 at 04:14:23PM +0200, David Hildenbrand wrote:

> 1) How would the device be able to grab/access "private memory", if not
>    via the user page tables?

The approaches I'm aware of require the secure world to own the IOMMU
and generate the IOMMU page tables. So we will not use a GUP approach
with VFIO today as the kernel will not have any reason to generate a
page table in the first place. Instead we will say "this PCI device
translates through the secure world" and walk away.

The page table population would have to be done through the KVM path.

> I assume when you're saying "In the patches KVM (running in EL2) will manage
> the IOMMUs  including the page tables", this is easily solved by not relying
> on pinning: KVM just knows what to update and where. (which is a very
> different model than what VFIO does)

This is my read as well for pKVM.

IMHO pKVM is just a version of CC without requiring some of HW
features to make the isolation stronger and ignoring the
attestation/strong confidentiality part.

Jason

David Hildenbrand June 20, 2024, 2:45 p.m. UTC | #21

On 20.06.24 16:29, Jason Gunthorpe wrote:
> On Thu, Jun 20, 2024 at 04:01:08PM +0200, David Hildenbrand wrote:
>> On 20.06.24 15:55, Jason Gunthorpe wrote:
>>> On Thu, Jun 20, 2024 at 09:32:11AM +0100, Fuad Tabba wrote:
>>>> Hi,
>>>>
>>>> On Thu, Jun 20, 2024 at 5:11 AM Christoph Hellwig <hch@infradead.org> wrote:
>>>>>
>>>>> On Wed, Jun 19, 2024 at 08:51:35AM -0300, Jason Gunthorpe wrote:
>>>>>> If you can't agree with the guest_memfd people on how to get there
>>>>>> then maybe you need a guest_memfd2 for this slightly different special
>>>>>> stuff instead of intruding on the core mm so much. (though that would
>>>>>> be sad)
>>>>>
>>>>> Or we're just not going to support it at all.  It's not like supporting
>>>>> this weird usage model is a must-have for Linux to start with.
>>>>
>>>> Sorry, but could you please clarify to me what usage model you're
>>>> referring to exactly, and why you think it's weird? It's just that we
>>>> have covered a few things in this thread, and to me it's not clear if
>>>> you're referring to protected VMs sharing memory, or being able to
>>>> (conditionally) map a VM's memory that's backed by guest_memfd(), or
>>>> if it's the Exclusive pin.
>>>
>>> Personally I think mapping memory under guest_memfd is pretty weird.
>>>
>>> I don't really understand why you end up with something different than
>>> normal CC. Normal CC has memory that the VMM can access and memory it
>>> cannot access. guest_memory is supposed to hold the memory the VMM cannot
>>> reach, right?
>>>
>>> So how does normal CC handle memory switching between private and
>>> shared and why doesn't that work for pKVM? I think the normal CC path
>>> effectively discards the memory content on these switches and is
>>> slow. Are you trying to make the switch content preserving and faster?
>>>
>>> If yes, why? What is wrong with the normal CC model of slow and
>>> non-preserving shared memory?
>>
>> I'll leave the !huge page part to Fuad.
>>
>> Regarding huge pages: assume the huge page (e.g., 1 GiB hugetlb) is shared,
>> now the VM requests to make one subpage private.
> 
> I think the general CC model has the shared/private setup earlier on
> the VM lifecycle with large runs of contiguous pages. It would only
> become a problem if you intend to to high rate fine granual
> shared/private switching. Which is why I am asking what the actual
> "why" is here.

I am not an expert on that, but I remember that the way memory 
shared<->private conversion happens can heavily depend on the VM use 
case, and that under pKVM we might see more frequent conversion, without 
even going to user space.

> 
>> How to handle that without eventually running into a double
>> memory-allocation? (in the worst case, allocating a 1GiB huge page
>> for shared and for private memory).
> 
> I expect you'd take the linear range of 1G of PFNs and fragment it
> into three ranges private/shared/private that span the same 1G.
> 
> When you construct a page table (ie a S2) that holds these three
> ranges and has permission to access all the memory you want the page
> table to automatically join them back together into 1GB entry.
> 
> When you construct a page table that has only access to the shared,
> then you'd only install the shared hole at its natural best size.
> 
> So, I think there are two challenges - how to build an allocator and
> uAPI to manage this sort of stuff so you can keep track of any
> fractured pfns and ensure things remain in physical order.
> 
> Then how to re-consolidate this for the KVM side of the world.

Exactly!

> 
> guest_memfd, or something like it, is just really a good answer. You
> have it obtain the huge folio, and keep track on its own which sub
> pages can be mapped to a VMA because they are shared. KVM will obtain
> the PFNs directly from the fd and KVM will not see the shared
> holes. This means your S2's can be trivially constructed correctly.
> 
> No need to double allocate..

Yes, that's why my thinking so far was:

Let guest_memfd (or something like that) consume huge pages (somehow, 
let it access the hugetlb reserves). Preallocate that memory once, as 
the VM starts up: just like we do with hugetlb in VMs.

Let KVM track which parts are shared/private, and if required, let it 
map only the shared parts to user space. KVM has all information to make 
these decisions.

If we could disallow pinning any shared pages, that would make life a 
lot easier, but I think there were reasons for why we might require it. 
To convert shared->private, simply unmap that folio (only the shared 
parts could possibly be mapped) from all user page tables.

Of course, there might be alternatives, and I'll be happy to learn about 
them. The allcoator part would be fairly easy, and the uAPI part would 
similarly be comparably easy. So far the theory :)

> 
> I'm kind of surprised the CC folks don't want the same thing for
> exactly the same reason. It is much easier to recover the huge
> mappings for the S2 in the presence of shared holes if you track it
> this way. Even CC will have this problem, to some degree, too.

Precisely! RH (and therefore, me) is primarily interested in existing 
guest_memfd users at this point ("CC"), and I don't see an easy way to 
get that running with huge pages in the existing model reasonably well ...

Sean Christopherson June 20, 2024, 3:37 p.m. UTC | #22

On Wed, Jun 19, 2024, Fuad Tabba wrote:
> Hi Jason,
> 
> On Wed, Jun 19, 2024 at 12:51 PM Jason Gunthorpe <jgg@nvidia.com> wrote:
> >
> > On Wed, Jun 19, 2024 at 10:11:35AM +0100, Fuad Tabba wrote:
> >
> > > To be honest, personally (speaking only for myself, not necessarily
> > > for Elliot and not for anyone else in the pKVM team), I still would
> > > prefer to use guest_memfd(). I think that having one solution for
> > > confidential computing that rules them all would be best. But we do
> > > need to be able to share memory in place, have a plan for supporting
> > > huge pages in the near future, and migration in the not-too-distant
> > > future.
> >
> > I think using a FD to control this special lifetime stuff is
> > dramatically better than trying to force the MM to do it with struct
> > page hacks.
> >
> > If you can't agree with the guest_memfd people on how to get there
> > then maybe you need a guest_memfd2 for this slightly different special
> > stuff instead of intruding on the core mm so much. (though that would
> > be sad)
> >
> > We really need to be thinking more about containing these special
> > things and not just sprinkling them everywhere.
> 
> I agree that we need to agree :) This discussion has been going on
> since before LPC last year, and the consensus from the guest_memfd()
> folks (if I understood it correctly) is that guest_memfd() is what it
> is: designed for a specific type of confidential computing, in the
> style of TDX and CCA perhaps, and that it cannot (or will not) perform
> the role of being a general solution for all confidential computing.

That isn't remotely accurate.  I have stated multiple times that I want guest_memfd
to be a vehicle for all VM types, i.e. not just CoCo VMs, and most definitely not
just TDX/SNP/CCA VMs.

What I am staunchly against is piling features onto guest_memfd that will cause
it to eventually become virtually indistinguishable from any other file-based
backing store.  I.e. while I want to make guest_memfd usable for all VM *types*,
making guest_memfd the preferred backing store for all *VMs* and use cases is
very much a non-goal.

From an earlier conversation[1]:

 : In other words, ditch the complexity for features that are well served by existing
 : general purpose solutions, so that guest_memfd can take on a bit of complexity to
 : serve use cases that are unique to KVM guests, without becoming an unmaintainble
 : mess due to cross-products.

> > > Also, since pin is already overloading the refcount, having the
> > > exclusive pin there helps in ensuring atomic accesses and avoiding
> > > races.
> >
> > Yeah, but every time someone does this and then links it to a uAPI it
> > becomes utterly baked in concrete for the MM forever.
> 
> I agree. But if we can't modify guest_memfd() to fit our needs (pKVM,
> Gunyah), then we don't really have that many other options.

What _are_ your needs?  There are multiple unanswered questions from our last
conversation[2].  And by "needs" I don't mean "what changes do you want to make
to guest_memfd?", I mean "what are the use cases, patterns, and scenarios that
you want to support?".

 : What's "hypervisor-assisted page migration"?  More specifically, what's the
 : mechanism that drives it?

 : Do you happen to have a list of exactly what you mean by "normal mm stuff"?  I
 : am not at all opposed to supporting .mmap(), because long term I also want to
 : use guest_memfd for non-CoCo VMs.  But I want to be very conservative with respect
 : to what is allowed for guest_memfd.   E.g. host userspace can map guest_memfd,
 : and do operations that are directly related to its mapping, but that's about it.

That distinction matters, because as I have stated in that thread, I am not
opposed to page migration itself:

 : I am not opposed to page migration itself, what I am opposed to is adding deep
 : integration with core MM to do some of the fancy/complex things that lead to page
 : migration.

I am generally aware of the core pKVM use cases, but I AFAIK I haven't seen a
complete picture of everything you want to do, and _why_.

E.g. if one of your requirements is that guest memory is managed by core-mm the
same as all other memory in the system, then yeah, guest_memfd isn't for you.
Integrating guest_memfd deeply into core-mm simply isn't realistic, at least not
without *massive* changes to core-mm, as the whole point of guest_memfd is that
it is guest-first memory, i.e. it is NOT memory that is managed by core-mm (primary
MMU) and optionally mapped into KVM (secondary MMU).

Again from that thread, one of most important aspects guest_memfd is that VMAs
are not required.  Stating the obvious, lack of VMAs makes it really hard to drive
swap, reclaim, migration, etc. from code that fundamentally operates on VMAs.

 : More broadly, no VMAs are required.  The lack of stage-1 page tables are nice to
 : have; the lack of VMAs means that guest_memfd isn't playing second fiddle, e.g.
 : it's not subject to VMA protections, isn't restricted to host mapping size, etc.

[1] https://lore.kernel.org/all/Zfmpby6i3PfBEcCV@google.com
[2] https://lore.kernel.org/all/Zg3xF7dTtx6hbmZj@google.com

Sean Christopherson June 20, 2024, 4:04 p.m. UTC | #23

On Thu, Jun 20, 2024, David Hildenbrand wrote:
> On 20.06.24 16:29, Jason Gunthorpe wrote:
> > On Thu, Jun 20, 2024 at 04:01:08PM +0200, David Hildenbrand wrote:
> > > On 20.06.24 15:55, Jason Gunthorpe wrote:
> > > > On Thu, Jun 20, 2024 at 09:32:11AM +0100, Fuad Tabba wrote:
> > > Regarding huge pages: assume the huge page (e.g., 1 GiB hugetlb) is shared,
> > > now the VM requests to make one subpage private.
> > 
> > I think the general CC model has the shared/private setup earlier on
> > the VM lifecycle with large runs of contiguous pages. It would only
> > become a problem if you intend to to high rate fine granual
> > shared/private switching. Which is why I am asking what the actual
> > "why" is here.
> 
> I am not an expert on that, but I remember that the way memory
> shared<->private conversion happens can heavily depend on the VM use case,

Yeah, I forget the details, but there are scenarios where the guest will share
(and unshare) memory at 4KiB (give or take) granularity, at runtime.  There's an
RFC[*] for making SWIOTLB operate at 2MiB is driven by the same underlying problems.

But even if Linux-as-a-guest were better behaved, we (the host) can't prevent the
guest from doing suboptimal conversions.  In practice, killing the guest or
refusing to convert memory isn't an option, i.e. we can't completely push the
problem into the guest

https://lore.kernel.org/all/20240112055251.36101-1-vannapurve@google.com

> and that under pKVM we might see more frequent conversion, without even
> going to user space.
> 
> > 
> > > How to handle that without eventually running into a double
> > > memory-allocation? (in the worst case, allocating a 1GiB huge page
> > > for shared and for private memory).
> > 
> > I expect you'd take the linear range of 1G of PFNs and fragment it
> > into three ranges private/shared/private that span the same 1G.
> > 
> > When you construct a page table (ie a S2) that holds these three
> > ranges and has permission to access all the memory you want the page
> > table to automatically join them back together into 1GB entry.
> > 
> > When you construct a page table that has only access to the shared,
> > then you'd only install the shared hole at its natural best size.
> > 
> > So, I think there are two challenges - how to build an allocator and
> > uAPI to manage this sort of stuff so you can keep track of any
> > fractured pfns and ensure things remain in physical order.
> > 
> > Then how to re-consolidate this for the KVM side of the world.
> 
> Exactly!
> 
> > 
> > guest_memfd, or something like it, is just really a good answer. You
> > have it obtain the huge folio, and keep track on its own which sub
> > pages can be mapped to a VMA because they are shared. KVM will obtain
> > the PFNs directly from the fd and KVM will not see the shared
> > holes. This means your S2's can be trivially constructed correctly.
> > 
> > No need to double allocate..
> 
> Yes, that's why my thinking so far was:
> 
> Let guest_memfd (or something like that) consume huge pages (somehow, let it
> access the hugetlb reserves). Preallocate that memory once, as the VM starts
> up: just like we do with hugetlb in VMs.
> 
> Let KVM track which parts are shared/private, and if required, let it map
> only the shared parts to user space. KVM has all information to make these
> decisions.
> 
> If we could disallow pinning any shared pages, that would make life a lot
> easier, but I think there were reasons for why we might require it. To
> convert shared->private, simply unmap that folio (only the shared parts
> could possibly be mapped) from all user page tables.
> 
> Of course, there might be alternatives, and I'll be happy to learn about
> them. The allcoator part would be fairly easy, and the uAPI part would
> similarly be comparably easy. So far the theory :)
> 
> > 
> > I'm kind of surprised the CC folks don't want the same thing for
> > exactly the same reason. It is much easier to recover the huge
> > mappings for the S2 in the presence of shared holes if you track it
> > this way. Even CC will have this problem, to some degree, too.
>
> Precisely! RH (and therefore, me) is primarily interested in existing
> guest_memfd users at this point ("CC"), and I don't see an easy way to get
> that running with huge pages in the existing model reasonably well ...

This is the general direction guest_memfd is headed, but getting there is easier
said than done.  E.g. as alluded to above, "simply unmap that folio" is quite
difficult, bordering on infeasible if the kernel is allowed to gup() shared
guest_memfd memory.

Mostafa Saleh June 20, 2024, 4:33 p.m. UTC | #24

Hi David,

On Thu, Jun 20, 2024 at 04:14:23PM +0200, David Hildenbrand wrote:
> On 20.06.24 15:08, Mostafa Saleh wrote:
> > Hi David,
> > 
> > On Wed, Jun 19, 2024 at 09:37:58AM +0200, David Hildenbrand wrote:
> > > Hi,
> > > 
> > > On 19.06.24 04:44, John Hubbard wrote:
> > > > On 6/18/24 5:05 PM, Elliot Berman wrote:
> > > > > In arm64 pKVM and QuIC's Gunyah protected VM model, we want to support
> > > > > grabbing shmem user pages instead of using KVM's guestmemfd. These
> > > > > hypervisors provide a different isolation model than the CoCo
> > > > > implementations from x86. KVM's guest_memfd is focused on providing
> > > > > memory that is more isolated than AVF requires. Some specific examples
> > > > > include ability to pre-load data onto guest-private pages, dynamically
> > > > > sharing/isolating guest pages without copy, and (future) migrating
> > > > > guest-private pages.  In sum of those differences after a discussion in
> > > > > [1] and at PUCK, we want to try to stick with existing shmem and extend
> > > > > GUP to support the isolation needs for arm64 pKVM and Gunyah.
> > > 
> > > The main question really is, into which direction we want and can develop
> > > guest_memfd. At this point (after talking to Jason at LSF/MM), I wonder if
> > > guest_memfd should be our new target for guest memory, both shared and
> > > private. There are a bunch of issues to be sorted out though ...
> > > 
> > > As there is interest from Red Hat into supporting hugetlb-style huge pages
> > > in confidential VMs for real-time workloads, and wasting memory is not
> > > really desired, I'm going to think some more about some of the challenges
> > > (shared+private in guest_memfd, mmap support, migration of !shared folios,
> > > hugetlb-like support, in-place shared<->private conversion, interaction with
> > > page pinning). Tricky.
> > > 
> > > Ideally, we'd have one way to back guest memory for confidential VMs in the
> > > future.
> > > 
> > > 
> > > Can you comment on the bigger design goal here? In particular:
> > > 
> > > 1) Who would get the exclusive PIN and for which reason? When would we
> > >     pin, when would we unpin?
> > > 
> > > 2) What would happen if there is already another PIN? Can we deal with
> > >     speculative short-term PINs from GUP-fast that could introduce
> > >     errors?
> > > 
> > > 3) How can we be sure we don't need other long-term pins (IOMMUs?) in
> > >     the future?
> > 
> > Can you please clarify more about the IOMMU case?
> > 
> > pKVM has no merged upstream IOMMU support at the moment, although
> > there was an RFC a while a go [1], also there would be a v2 soon.
> > 
> > In the patches KVM (running in EL2) will manage the IOMMUs including
> > the page tables and all pages used in that are allocated from the
> > kernel.
> > 
> > These patches don't support IOMMUs for guests. However, I don't see
> > why would that be different from the CPU? as once the page is pinned
> > it can be owned by a guest and that would be reflected in the
> > hypervisor tracking, the CPU stage-2 and IOMMU page tables as well.
> 
> So this is my thinking, it might be flawed:
> 
> In the "normal" world (e.g., vfio), we FOLL_PIN|FOLL_LONGTERM the pages to
> be accessible by a dedicated device. We look them up in the page tables to
> pin them, then we can map them into the IOMMU.
> 
> Devices that cannot speak "private memory" should only access shared memory.
> So we must not have "private memory" mapped into their IOMMU.
> 
> Devices that can speak "private memory" may either access shared or private
> memory. So we may have"private memory" mapped into their IOMMU.
> 

Private pages must not be accessible to devices owned by the
host, and for that we have the same rules as the CPU:
A) The hypervisor doesn’t trust the host, and must enforce that using the CPU
   stage-2 MMU.
B) It’s preferable that userspace doesn’t, and hence these patches (or guest_memfd...)

We need the same rules for DMA, otherwise it is "simple" to instrument a DMA attack,
so we need a protection by the IOMMU. pKVM at the moment provides 2 ways of
establishing that (each has their own trade-off which are not relevant here):

1) pKVM manages the IOMMUs and provides a hypercall interface to map/unmap in
   the IOMMU, looking at the rules

   For A), pKVM has its own per-page metadata which tracks page state, which can
   prevent mapping private pages in the IOMMU and transitioning pages to private
   if they are mapped in the IOMMU.

   For B), userspace won’t be able to map private pages(through VFIO/IOMMUFD), as
   the hypercall interface would fail if the pages are private.

   This proposal is the one on the list.

2) pKVM manages a second stage of the IOMMU (as SMMUv3), and let the kernel map what
   it wants in stage-1 and pKVM would use a mirrored page table of the CPU MMU stage-2.

   For A) Similar to the CPU, stage-2 IOMMU will protect the private pages.

   For B) userspace can map private pages in the first stage IOMMU, and that would
   result in stage-2 fault, AFAIK, SMMUv3 is the only Arm implementation that
   supports nesting in Linux, for that the driver would only print a page fault,
   and ideally the kernel wouldn’t crash, although that is really hardware
   dependant how it handle faults, and I guess assigning a device through VFIO
   to userspace comes with similar risks already (bogus MMIO access can
   crash the system).

   This proposal only exists in Android at the moment(However I am working on
   getting an SMMUv3 compliant implementation that can be posted upstream).

> 
> What I see (again, I might be just wrong):
> 
> 1) How would the device be able to grab/access "private memory", if not
>    via the user page tables?

I hope the above answers the question, but just to confirmn, a device owned by
the host shouldn’t access the memory as the host kernel is not trusted and
can instrument DMA attacks. Device assignment (passthrough) is another story.

> 2) How would we be able to convert shared -> private, if there is a
>    longterm pin from that IOMMU? We must dynamically unmap it from the
>    IOMMU.

Depending on which solution from the above, for
1) The transition from shared -> private would fail
2) The private page would be unmapped from the stage-2 IOMMU (similar to the
   stage-2 CPU MMU)

> 
> I assume when you're saying "In the patches KVM (running in EL2) will manage
> the IOMMUs  including the page tables", this is easily solved by not relying
> on pinning: KVM just knows what to update and where. (which is a very
> different model than what VFIO does)
> 

Yes, that's is not required to protect private memory.

Thanks,
Mostafa

> Thanks!
> 
> -- 
> Cheers,
> 
> David / dhildenb
>

Jason Gunthorpe June 20, 2024, 4:36 p.m. UTC | #25

On Thu, Jun 20, 2024 at 04:45:08PM +0200, David Hildenbrand wrote:

> If we could disallow pinning any shared pages, that would make life a lot
> easier, but I think there were reasons for why we might require it. To
> convert shared->private, simply unmap that folio (only the shared parts
> could possibly be mapped) from all user page tables.

IMHO it should be reasonable to make it work like ZONE_MOVABLE and
FOLL_LONGTERM. Making a shared page private is really no different
from moving it.

And if you have built a VMM that uses VMA mapped shared pages and
short-term pinning then you should really also ensure that the VM is
aware when the pins go away. For instance if you are doing some virtio
thing with O_DIRECT pinning then the guest will know the pins are gone
when it observes virtio completions.

In this way making private is just like moving, we unmap the page and
then drive the refcount to zero, then move it.

> > I'm kind of surprised the CC folks don't want the same thing for
> > exactly the same reason. It is much easier to recover the huge
> > mappings for the S2 in the presence of shared holes if you track it
> > this way. Even CC will have this problem, to some degree, too.
> 
> Precisely! RH (and therefore, me) is primarily interested in existing
> guest_memfd users at this point ("CC"), and I don't see an easy way to get
> that running with huge pages in the existing model reasonably well ...

IMHO it is an important topic so I'm glad you are thinking about it.

There is definately some overlap here where if you do teach
guest_memfd about huge pages then you must also provide a away to map
the fragments of them that have become shared. I think there is little
option here unless you double allocate and/or destroy the performance
properties of the huge pages.

It is just the nature of our system that shared pages must be in VMAs
and must be copy_to/from_user/GUP'able/etc.

Jason

David Hildenbrand June 20, 2024, 6:53 p.m. UTC | #26

On 20.06.24 18:36, Jason Gunthorpe wrote:
> On Thu, Jun 20, 2024 at 04:45:08PM +0200, David Hildenbrand wrote:
> 
>> If we could disallow pinning any shared pages, that would make life a lot
>> easier, but I think there were reasons for why we might require it. To
>> convert shared->private, simply unmap that folio (only the shared parts
>> could possibly be mapped) from all user page tables.
> 
> IMHO it should be reasonable to make it work like ZONE_MOVABLE and
> FOLL_LONGTERM. Making a shared page private is really no different
> from moving it.
> 
> And if you have built a VMM that uses VMA mapped shared pages and
> short-term pinning then you should really also ensure that the VM is
> aware when the pins go away. For instance if you are doing some virtio
> thing with O_DIRECT pinning then the guest will know the pins are gone
> when it observes virtio completions.
> 
> In this way making private is just like moving, we unmap the page and
> then drive the refcount to zero, then move it.
Yes, but here is the catch: what if a single shared subpage of a large 
folio is (validly) longterm pinned and you want to convert another 
shared subpage to private?

Sure, we can unmap the whole large folio (including all shared parts) 
before the conversion, just like we would do for migration. But we 
cannot detect that nobody pinned that subpage that we want to convert to 
private.

Core-mm is not, and will not, track pins per subpage.

So I only see two options:

a) Disallow long-term pinning. That means, we can, with a bit of wait,
    always convert subpages shared->private after unmapping them and
    waiting for the short-term pin to go away. Not too bad, and we
    already have other mechanisms disallow long-term pinnings (especially
    writable fs ones!).

b) Expose the large folio as multiple 4k folios to the core-mm.

b) would look as follows: we allocate a gigantic page from the (hugetlb) 
reserve into guest_memfd. Then, we break it down into individual 4k 
folios by splitting/demoting the folio. We make sure that all 4k folios 
are unmovable (raised refcount). We keep tracking internally that these 
4k folios comprise a single large gigantic page.

Core-mm can track for us now without any modifications per (previously 
subpage,) now small folios GUP pins and page table mappings without 
modifications.

Once we unmap the gigantic page from guest_memfd, we recronstruct the 
gigantic page and hand it back to the reserve (only possible once all 
pins are gone).

We can still map the whole thing into the KVM guest+iommu using a single 
large unit, because guest_memfd knows the origin/relationship of these 
pages. But we would only map individual pages into user page tables 
(unless we use large VM_PFNMAP mappings, but then also pinning would not 
work, so that's likely also not what we want).

The downside is that we won't benefit from vmemmap optimizations for 
large folios from hugetlb, and have more tracking overhead when mapping 
individual pages into user page tables.

OTOH, maybe we really *need* per-page tracking and this might be the 
simplest way forward, making GUP and friends just work naturally with it.

> 
>>> I'm kind of surprised the CC folks don't want the same thing for
>>> exactly the same reason. It is much easier to recover the huge
>>> mappings for the S2 in the presence of shared holes if you track it
>>> this way. Even CC will have this problem, to some degree, too.
>>
>> Precisely! RH (and therefore, me) is primarily interested in existing
>> guest_memfd users at this point ("CC"), and I don't see an easy way to get
>> that running with huge pages in the existing model reasonably well ...
> 
> IMHO it is an important topic so I'm glad you are thinking about it.

Thank my manager ;)

> 
> There is definately some overlap here where if you do teach
> guest_memfd about huge pages then you must also provide a away to map
> the fragments of them that have become shared. I think there is little
> option here unless you double allocate and/or destroy the performance
> properties of the huge pages.

Right, and that's not what we want.

> 
> It is just the nature of our system that shared pages must be in VMAs
> and must be copy_to/from_user/GUP'able/etc.

Right. Longterm GUP is not a real requirement.

David Hildenbrand June 20, 2024, 6:56 p.m. UTC | #27

On 20.06.24 18:04, Sean Christopherson wrote:
> On Thu, Jun 20, 2024, David Hildenbrand wrote:
>> On 20.06.24 16:29, Jason Gunthorpe wrote:
>>> On Thu, Jun 20, 2024 at 04:01:08PM +0200, David Hildenbrand wrote:
>>>> On 20.06.24 15:55, Jason Gunthorpe wrote:
>>>>> On Thu, Jun 20, 2024 at 09:32:11AM +0100, Fuad Tabba wrote:
>>>> Regarding huge pages: assume the huge page (e.g., 1 GiB hugetlb) is shared,
>>>> now the VM requests to make one subpage private.
>>>
>>> I think the general CC model has the shared/private setup earlier on
>>> the VM lifecycle with large runs of contiguous pages. It would only
>>> become a problem if you intend to to high rate fine granual
>>> shared/private switching. Which is why I am asking what the actual
>>> "why" is here.
>>
>> I am not an expert on that, but I remember that the way memory
>> shared<->private conversion happens can heavily depend on the VM use case,
> 
> Yeah, I forget the details, but there are scenarios where the guest will share
> (and unshare) memory at 4KiB (give or take) granularity, at runtime.  There's an
> RFC[*] for making SWIOTLB operate at 2MiB is driven by the same underlying problems.
> 
> But even if Linux-as-a-guest were better behaved, we (the host) can't prevent the
> guest from doing suboptimal conversions.  In practice, killing the guest or
> refusing to convert memory isn't an option, i.e. we can't completely push the
> problem into the guest

Agreed!

> 
> https://lore.kernel.org/all/20240112055251.36101-1-vannapurve@google.com
> 
>> and that under pKVM we might see more frequent conversion, without even
>> going to user space.
>>
>>>
>>>> How to handle that without eventually running into a double
>>>> memory-allocation? (in the worst case, allocating a 1GiB huge page
>>>> for shared and for private memory).
>>>
>>> I expect you'd take the linear range of 1G of PFNs and fragment it
>>> into three ranges private/shared/private that span the same 1G.
>>>
>>> When you construct a page table (ie a S2) that holds these three
>>> ranges and has permission to access all the memory you want the page
>>> table to automatically join them back together into 1GB entry.
>>>
>>> When you construct a page table that has only access to the shared,
>>> then you'd only install the shared hole at its natural best size.
>>>
>>> So, I think there are two challenges - how to build an allocator and
>>> uAPI to manage this sort of stuff so you can keep track of any
>>> fractured pfns and ensure things remain in physical order.
>>>
>>> Then how to re-consolidate this for the KVM side of the world.
>>
>> Exactly!
>>
>>>
>>> guest_memfd, or something like it, is just really a good answer. You
>>> have it obtain the huge folio, and keep track on its own which sub
>>> pages can be mapped to a VMA because they are shared. KVM will obtain
>>> the PFNs directly from the fd and KVM will not see the shared
>>> holes. This means your S2's can be trivially constructed correctly.
>>>
>>> No need to double allocate..
>>
>> Yes, that's why my thinking so far was:
>>
>> Let guest_memfd (or something like that) consume huge pages (somehow, let it
>> access the hugetlb reserves). Preallocate that memory once, as the VM starts
>> up: just like we do with hugetlb in VMs.
>>
>> Let KVM track which parts are shared/private, and if required, let it map
>> only the shared parts to user space. KVM has all information to make these
>> decisions.
>>
>> If we could disallow pinning any shared pages, that would make life a lot
>> easier, but I think there were reasons for why we might require it. To
>> convert shared->private, simply unmap that folio (only the shared parts
>> could possibly be mapped) from all user page tables.
>>
>> Of course, there might be alternatives, and I'll be happy to learn about
>> them. The allcoator part would be fairly easy, and the uAPI part would
>> similarly be comparably easy. So far the theory :)
>>
>>>
>>> I'm kind of surprised the CC folks don't want the same thing for
>>> exactly the same reason. It is much easier to recover the huge
>>> mappings for the S2 in the presence of shared holes if you track it
>>> this way. Even CC will have this problem, to some degree, too.
>>
>> Precisely! RH (and therefore, me) is primarily interested in existing
>> guest_memfd users at this point ("CC"), and I don't see an easy way to get
>> that running with huge pages in the existing model reasonably well ...
> 
> This is the general direction guest_memfd is headed, but getting there is easier
> said than done.  E.g. as alluded to above, "simply unmap that folio" is quite
> difficult, bordering on infeasible if the kernel is allowed to gup() shared
> guest_memfd memory.

Right. I think ways forward are the ones stated in my mail to Jason: 
disallow long-term GUP or expose the huge page as unmovable small folios 
to core-mm.

Maybe there are other alternatives, but it all feels like we want the MM 
to track in granularity of small pages, but map it into the KVM/IOMMU 
page tables in large pages.

Sean Christopherson June 20, 2024, 8:30 p.m. UTC | #28

On Thu, Jun 20, 2024, David Hildenbrand wrote:
> On 20.06.24 18:36, Jason Gunthorpe wrote:
> > On Thu, Jun 20, 2024 at 04:45:08PM +0200, David Hildenbrand wrote:
> > 
> > > If we could disallow pinning any shared pages, that would make life a lot
> > > easier, but I think there were reasons for why we might require it. To
> > > convert shared->private, simply unmap that folio (only the shared parts
> > > could possibly be mapped) from all user page tables.
> > 
> > IMHO it should be reasonable to make it work like ZONE_MOVABLE and
> > FOLL_LONGTERM. Making a shared page private is really no different
> > from moving it.
> > 
> > And if you have built a VMM that uses VMA mapped shared pages and
> > short-term pinning then you should really also ensure that the VM is
> > aware when the pins go away. For instance if you are doing some virtio
> > thing with O_DIRECT pinning then the guest will know the pins are gone
> > when it observes virtio completions.
> > 
> > In this way making private is just like moving, we unmap the page and
> > then drive the refcount to zero, then move it.
> Yes, but here is the catch: what if a single shared subpage of a large folio
> is (validly) longterm pinned and you want to convert another shared subpage
> to private?
> 
> Sure, we can unmap the whole large folio (including all shared parts) before
> the conversion, just like we would do for migration. But we cannot detect
> that nobody pinned that subpage that we want to convert to private.
> 
> Core-mm is not, and will not, track pins per subpage.
> 
> So I only see two options:
> 
> a) Disallow long-term pinning. That means, we can, with a bit of wait,
>    always convert subpages shared->private after unmapping them and
>    waiting for the short-term pin to go away. Not too bad, and we
>    already have other mechanisms disallow long-term pinnings (especially
>    writable fs ones!).

I don't think disallowing _just_ long-term GUP will suffice, if we go the "disallow
GUP" route than I think it needs to disallow GUP, period.  Like the whole "GUP
writes to file-back memory" issue[*], which I think you're alluding to, short-term
GUP is also problematic.  But unlike file-backed memory, for TDX and SNP (and I
think pKVM), a single rogue access has a high probability of being fatal to the
entire system.

I.e. except for blatant bugs, e.g. use-after-free, we need to be able to guarantee
with 100% accuracy that there are no outstanding mappings when converting a page
from shared=>private.  Crossing our fingers and hoping that short-term GUP will
have gone away isn't enough.

[*] https://lore.kernel.org/all/cover.1683235180.git.lstoakes@gmail.com

> b) Expose the large folio as multiple 4k folios to the core-mm.
> 
> 
> b) would look as follows: we allocate a gigantic page from the (hugetlb)
> reserve into guest_memfd. Then, we break it down into individual 4k folios
> by splitting/demoting the folio. We make sure that all 4k folios are
> unmovable (raised refcount). We keep tracking internally that these 4k
> folios comprise a single large gigantic page.
> 
> Core-mm can track for us now without any modifications per (previously
> subpage,) now small folios GUP pins and page table mappings without
> modifications.
> 
> Once we unmap the gigantic page from guest_memfd, we recronstruct the
> gigantic page and hand it back to the reserve (only possible once all pins
> are gone).
> 
> We can still map the whole thing into the KVM guest+iommu using a single
> large unit, because guest_memfd knows the origin/relationship of these
> pages. But we would only map individual pages into user page tables (unless
> we use large VM_PFNMAP mappings, but then also pinning would not work, so
> that's likely also not what we want).

Not being to map guest_memfd into userspace with 1GiB mappings should be ok, at
least for CoCo VMs.  If the guest shares an entire 1GiB chunk, e.g. for DMA or
whatever, then userspace can simply punch a hole in guest_memfd and allocate 1GiB
of memory from regular memory.  Even losing 2MiB mappings should be ok.

For non-CoCo VMs, I expect we'll want to be much more permissive, but I think
they'll be a complete non-issue because there is no shared vs. private to worry
about.  We can simply allow any and all userspace mappings for guest_memfd that is
attached to a "regular" VM, because a misbehaving userspace only loses whatever
hardening (or other benefits) was being provided by using guest_memfd.  I.e. the
kernel and system at-large isn't at risk.

> The downside is that we won't benefit from vmemmap optimizations for large
> folios from hugetlb, and have more tracking overhead when mapping individual
> pages into user page tables.

Hmm, I suspect losing the vmemmap optimizations would be acceptable, especially
if we could defer the shattering until the guest actually tried to partially
convert a 1GiB/2MiB region, and restore the optimizations when the memory is
converted back.

> OTOH, maybe we really *need* per-page tracking and this might be the
> simplest way forward, making GUP and friends just work naturally with it.

David Hildenbrand June 20, 2024, 8:47 p.m. UTC | #29

On 20.06.24 22:30, Sean Christopherson wrote:
> On Thu, Jun 20, 2024, David Hildenbrand wrote:
>> On 20.06.24 18:36, Jason Gunthorpe wrote:
>>> On Thu, Jun 20, 2024 at 04:45:08PM +0200, David Hildenbrand wrote:
>>>
>>>> If we could disallow pinning any shared pages, that would make life a lot
>>>> easier, but I think there were reasons for why we might require it. To
>>>> convert shared->private, simply unmap that folio (only the shared parts
>>>> could possibly be mapped) from all user page tables.
>>>
>>> IMHO it should be reasonable to make it work like ZONE_MOVABLE and
>>> FOLL_LONGTERM. Making a shared page private is really no different
>>> from moving it.
>>>
>>> And if you have built a VMM that uses VMA mapped shared pages and
>>> short-term pinning then you should really also ensure that the VM is
>>> aware when the pins go away. For instance if you are doing some virtio
>>> thing with O_DIRECT pinning then the guest will know the pins are gone
>>> when it observes virtio completions.
>>>
>>> In this way making private is just like moving, we unmap the page and
>>> then drive the refcount to zero, then move it.
>> Yes, but here is the catch: what if a single shared subpage of a large folio
>> is (validly) longterm pinned and you want to convert another shared subpage
>> to private?
>>
>> Sure, we can unmap the whole large folio (including all shared parts) before
>> the conversion, just like we would do for migration. But we cannot detect
>> that nobody pinned that subpage that we want to convert to private.
>>
>> Core-mm is not, and will not, track pins per subpage.
>>
>> So I only see two options:
>>
>> a) Disallow long-term pinning. That means, we can, with a bit of wait,
>>     always convert subpages shared->private after unmapping them and
>>     waiting for the short-term pin to go away. Not too bad, and we
>>     already have other mechanisms disallow long-term pinnings (especially
>>     writable fs ones!).
> 
> I don't think disallowing _just_ long-term GUP will suffice, if we go the "disallow
> GUP" route than I think it needs to disallow GUP, period.  Like the whole "GUP
> writes to file-back memory" issue[*], which I think you're alluding to, short-term
> GUP is also problematic.  But unlike file-backed memory, for TDX and SNP (and I
> think pKVM), a single rogue access has a high probability of being fatal to the
> entire system.

Disallowing short-term should work, in theory, because the 
writes-to-fileback has different issues (the PIN is not the problem but 
the dirtying).

It's more related us not allowing long-term pins for FSDAX pages, 
because the lifetime of these pages is determined by the FS.

What we would do is

1) Unmap the large folio completely and make any refaults block.
-> No new pins can pop up

2) If the folio is pinned, busy-wait until all the short-term pins are
    gone.

3) Safely convert the relevant subpage from shared -> private

Not saying it's the best approach, but it should be doable.

> 
> I.e. except for blatant bugs, e.g. use-after-free, we need to be able to guarantee
> with 100% accuracy that there are no outstanding mappings when converting a page
> from shared=>private.  Crossing our fingers and hoping that short-term GUP will
> have gone away isn't enough.

We do have the mapcount and the refcount that will be completely 
reliable for our cases.

folio_mapcount()==0 not mapped

folio_ref_count()==1 we hold the single folio reference. (-> no mapping, 
no GUP, no unexpected references)

(folio_maybe_dma_pinned() could be used as well, but things like 
vmsplice() and some O_DIRECT might still take references. 
folio_ref_count() is more reliable in that regard)

> 
> [*] https://lore.kernel.org/all/cover.1683235180.git.lstoakes@gmail.com
> 
>> b) Expose the large folio as multiple 4k folios to the core-mm.
>>
>>
>> b) would look as follows: we allocate a gigantic page from the (hugetlb)
>> reserve into guest_memfd. Then, we break it down into individual 4k folios
>> by splitting/demoting the folio. We make sure that all 4k folios are
>> unmovable (raised refcount). We keep tracking internally that these 4k
>> folios comprise a single large gigantic page.
>>
>> Core-mm can track for us now without any modifications per (previously
>> subpage,) now small folios GUP pins and page table mappings without
>> modifications.
>>
>> Once we unmap the gigantic page from guest_memfd, we recronstruct the
>> gigantic page and hand it back to the reserve (only possible once all pins
>> are gone).
>>
>> We can still map the whole thing into the KVM guest+iommu using a single
>> large unit, because guest_memfd knows the origin/relationship of these
>> pages. But we would only map individual pages into user page tables (unless
>> we use large VM_PFNMAP mappings, but then also pinning would not work, so
>> that's likely also not what we want).
> 
> Not being to map guest_memfd into userspace with 1GiB mappings should be ok, at
> least for CoCo VMs.  If the guest shares an entire 1GiB chunk, e.g. for DMA or
> whatever, then userspace can simply punch a hole in guest_memfd and allocate 1GiB
> of memory from regular memory.  Even losing 2MiB mappings should be ok.
> 
> For non-CoCo VMs, I expect we'll want to be much more permissive, but I think
> they'll be a complete non-issue because there is no shared vs. private to worry
> about.  We can simply allow any and all userspace mappings for guest_memfd that is
> attached to a "regular" VM, because a misbehaving userspace only loses whatever
> hardening (or other benefits) was being provided by using guest_memfd.  I.e. the
> kernel and system at-large isn't at risk.
> 
>> The downside is that we won't benefit from vmemmap optimizations for large
>> folios from hugetlb, and have more tracking overhead when mapping individual
>> pages into user page tables.
> 
> Hmm, I suspect losing the vmemmap optimizations would be acceptable, especially
> if we could defer the shattering until the guest actually tried to partially
> convert a 1GiB/2MiB region, and restore the optimizations when the memory is
> converted back.

We can only shatter/collapse if there are no unexpected folio 
references. So GUP would have to be handles as well ... so that is 
certainly problematic.

Sean Christopherson June 20, 2024, 10:32 p.m. UTC | #30

On Thu, Jun 20, 2024, David Hildenbrand wrote:
> On 20.06.24 22:30, Sean Christopherson wrote:
> > On Thu, Jun 20, 2024, David Hildenbrand wrote:
> > > On 20.06.24 18:36, Jason Gunthorpe wrote:
> > > > On Thu, Jun 20, 2024 at 04:45:08PM +0200, David Hildenbrand wrote:
> > > > 
> > > > > If we could disallow pinning any shared pages, that would make life a lot
> > > > > easier, but I think there were reasons for why we might require it. To
> > > > > convert shared->private, simply unmap that folio (only the shared parts
> > > > > could possibly be mapped) from all user page tables.
> > > > 
> > > > IMHO it should be reasonable to make it work like ZONE_MOVABLE and
> > > > FOLL_LONGTERM. Making a shared page private is really no different
> > > > from moving it.
> > > > 
> > > > And if you have built a VMM that uses VMA mapped shared pages and
> > > > short-term pinning then you should really also ensure that the VM is
> > > > aware when the pins go away. For instance if you are doing some virtio
> > > > thing with O_DIRECT pinning then the guest will know the pins are gone
> > > > when it observes virtio completions.
> > > > 
> > > > In this way making private is just like moving, we unmap the page and
> > > > then drive the refcount to zero, then move it.
> > > Yes, but here is the catch: what if a single shared subpage of a large folio
> > > is (validly) longterm pinned and you want to convert another shared subpage
> > > to private?
> > > 
> > > Sure, we can unmap the whole large folio (including all shared parts) before
> > > the conversion, just like we would do for migration. But we cannot detect
> > > that nobody pinned that subpage that we want to convert to private.
> > > 
> > > Core-mm is not, and will not, track pins per subpage.
> > > 
> > > So I only see two options:
> > > 
> > > a) Disallow long-term pinning. That means, we can, with a bit of wait,
> > >     always convert subpages shared->private after unmapping them and
> > >     waiting for the short-term pin to go away. Not too bad, and we
> > >     already have other mechanisms disallow long-term pinnings (especially
> > >     writable fs ones!).
> > 
> > I don't think disallowing _just_ long-term GUP will suffice, if we go the "disallow
> > GUP" route than I think it needs to disallow GUP, period.  Like the whole "GUP
> > writes to file-back memory" issue[*], which I think you're alluding to, short-term
> > GUP is also problematic.  But unlike file-backed memory, for TDX and SNP (and I
> > think pKVM), a single rogue access has a high probability of being fatal to the
> > entire system.
> 
> Disallowing short-term should work, in theory, because the

By "short-term", I assume you mean "long-term"?  Or am I more lost than I realize?

> writes-to-fileback has different issues (the PIN is not the problem but the
> dirtying).
>
> It's more related us not allowing long-term pins for FSDAX pages, because
> the lifetime of these pages is determined by the FS.
> 
> What we would do is
> 
> 1) Unmap the large folio completely and make any refaults block.
> -> No new pins can pop up
> 
> 2) If the folio is pinned, busy-wait until all the short-term pins are
>    gone.

This is the step that concerns me.   "Relatively short time" is, well, relative.
Hmm, though I suppose if userspace managed to map a shared page into something
that pins the page, and can't force an unpin, e.g. by stopping I/O?, then either
there's a host userspace bug or a guest bug, and so effectively hanging the vCPU
that is waiting for the conversion to complete is ok.

> 3) Safely convert the relevant subpage from shared -> private
> 
> Not saying it's the best approach, but it should be doable.

Elliot Berman June 20, 2024, 10:47 p.m. UTC | #31

On Thu, Jun 20, 2024 at 11:29:56AM -0300, Jason Gunthorpe wrote:
> On Thu, Jun 20, 2024 at 04:01:08PM +0200, David Hildenbrand wrote:
> > Regarding huge pages: assume the huge page (e.g., 1 GiB hugetlb) is shared,
> > now the VM requests to make one subpage private. 
> 
> I think the general CC model has the shared/private setup earlier on
> the VM lifecycle with large runs of contiguous pages. It would only
> become a problem if you intend to to high rate fine granual
> shared/private switching. Which is why I am asking what the actual
> "why" is here.
> 

I'd let Fuad comment if he's aware of any specific/concrete Anrdoid
usecases about converting between shared and private. One usecase I can
think about is host providing large multimedia blobs (e.g. video) to the
guest. Rather than using swiotlb, the CC guest can share pages back with
the host so host can copy the blob in, possibly using H/W accel. I
mention this example because we may not need to support shared/private
conversions at granularity finer than huge pages. The host and guest can
negotiate the minimum size that can be converted and you never run into
issue where subpages of a folio are differently shared. I can't think of
a usecase where we need such granularity for converting private/shared.

Jason, do you have scenario in mind? I couldn't tell if we now had a
usecase or are brainstorming a solution to have a solution.

Thanks,
Elliot

Jason Gunthorpe June 20, 2024, 11 p.m. UTC | #32

> This is the step that concerns me.   "Relatively short time" is, well, relative.
> Hmm, though I suppose if userspace managed to map a shared page into something
> that pins the page, and can't force an unpin, e.g. by stopping I/O?, then either
> there's a host userspace bug or a guest bug, and so effectively hanging the vCPU
> that is waiting for the conversion to complete is ok.

The whole entire point of FOLL_LONGTERM is to interact with
ZONE_MOVABLE stuff such that only FOLL_LONGTERM users will cause
unlimited refcount elevation.

Blocking FOLL_LONGTERM is supposed to result result in pins that go to
zero on their own in some entirely kernel controlled time
frame. Userspace is not supposed to be able to do anything to prevent
this.

This is not necessarily guarenteed "fast", but it is certainly largely
under the control of hypervisor kernel and VMM. ie if you do O_DIRECT
to the shared memory then the memory will remain pinned until the
storage completes. Which might be ms or it might be a xx second
storage timeout.

But putting it in the full context, if the guest tries to make a page
private that is actively undergoing IO while shared, then I think it
is misbehaving and it is quite reasonable to stall its call for
private until the page refs drop to zero. If guests want shared to
private to be fast then guests need to ensure there is no outstanding
IO.

In other words the page ref scheme would only be protective against
hostile guests and in real workloads we'd never expect to have to
wait. The same as ZONE_MOVABLE.

Jason

Jason Gunthorpe June 20, 2024, 11:08 p.m. UTC | #33

On Thu, Jun 20, 2024 at 08:53:07PM +0200, David Hildenbrand wrote:
> On 20.06.24 18:36, Jason Gunthorpe wrote:
> > On Thu, Jun 20, 2024 at 04:45:08PM +0200, David Hildenbrand wrote:
> > 
> > > If we could disallow pinning any shared pages, that would make life a lot
> > > easier, but I think there were reasons for why we might require it. To
> > > convert shared->private, simply unmap that folio (only the shared parts
> > > could possibly be mapped) from all user page tables.
> > 
> > IMHO it should be reasonable to make it work like ZONE_MOVABLE and
> > FOLL_LONGTERM. Making a shared page private is really no different
> > from moving it.
> > 
> > And if you have built a VMM that uses VMA mapped shared pages and
> > short-term pinning then you should really also ensure that the VM is
> > aware when the pins go away. For instance if you are doing some virtio
> > thing with O_DIRECT pinning then the guest will know the pins are gone
> > when it observes virtio completions.
> > 
> > In this way making private is just like moving, we unmap the page and
> > then drive the refcount to zero, then move it.
> Yes, but here is the catch: what if a single shared subpage of a large folio
> is (validly) longterm pinned and you want to convert another shared subpage
> to private?

When I wrote the above I was assuming option b was the choice.

> a) Disallow long-term pinning. That means, we can, with a bit of wait,
>    always convert subpages shared->private after unmapping them and
>    waiting for the short-term pin to go away. Not too bad, and we
>    already have other mechanisms disallow long-term pinnings (especially
>    writable fs ones!).

This seems reasonable, but you are trading off a big hit to IO
performance while doing shared/private operations

> b) Expose the large folio as multiple 4k folios to the core-mm.

And this trades off more VMM memory usage and micro-slower
copy_to/from_user. I think this is probably the better choice

IMHO the VMA does not need to map at a high granularity for these
cases. The IO path on these VM types is already disastrously slow,
optimizing with 1GB huge pages in the VMM to make copy_to/from_user
very slightly faster doesn't seem worthwhile.

> b) would look as follows: we allocate a gigantic page from the (hugetlb)
> reserve into guest_memfd. Then, we break it down into individual 4k folios
> by splitting/demoting the folio. We make sure that all 4k folios are
> unmovable (raised refcount). We keep tracking internally that these 4k
> folios comprise a single large gigantic page.

Yes, something like this. Or maybe they get converted to ZONE_DEVICE
pages so that freeing them goes back to pgmap callback in the the
guest_memfd or something simple like that.

> The downside is that we won't benefit from vmemmap optimizations for large
> folios from hugetlb, and have more tracking overhead when mapping individual
> pages into user page tables.

Yes, that too, but you are going to have some kind of per 4k tracking
overhead anyhow in guest_memfd no matter what you do. It would
probably be less than the struct pages though.

There is also the interesting option to use a PFNMAP VMA so there is
no refcounting and we don't need to mess with the struct pages. The
downside is that you totally lose GUP. So no O_DIRECT..

Jason

Jason Gunthorpe June 20, 2024, 11:11 p.m. UTC | #34

On Thu, Jun 20, 2024 at 01:30:29PM -0700, Sean Christopherson wrote:
> I.e. except for blatant bugs, e.g. use-after-free, we need to be able to guarantee
> with 100% accuracy that there are no outstanding mappings when converting a page
> from shared=>private.  Crossing our fingers and hoping that short-term GUP will
> have gone away isn't enough.

To be clear it is not crossing fingers. If the page refcount is 0 then
there are no references to that memory anywhere at all. It is 100%
certain.

It may take time to reach zero, but when it does it is safe.

Many things rely on this property, including FSDAX.

> For non-CoCo VMs, I expect we'll want to be much more permissive, but I think
> they'll be a complete non-issue because there is no shared vs. private to worry
> about.  We can simply allow any and all userspace mappings for guest_memfd that is
> attached to a "regular" VM, because a misbehaving userspace only loses whatever
> hardening (or other benefits) was being provided by using guest_memfd.  I.e. the
> kernel and system at-large isn't at risk.

It does seem to me like guest_memfd should really focus on the private
aspect.

If we need normal memfd enhancements of some kind to work better with
KVM then that may be a better option than turning guest_memfd into
memfd.

Jason

Jason Gunthorpe June 20, 2024, 11:18 p.m. UTC | #35

On Thu, Jun 20, 2024 at 03:47:23PM -0700, Elliot Berman wrote:
> On Thu, Jun 20, 2024 at 11:29:56AM -0300, Jason Gunthorpe wrote:
> > On Thu, Jun 20, 2024 at 04:01:08PM +0200, David Hildenbrand wrote:
> > > Regarding huge pages: assume the huge page (e.g., 1 GiB hugetlb) is shared,
> > > now the VM requests to make one subpage private. 
> > 
> > I think the general CC model has the shared/private setup earlier on
> > the VM lifecycle with large runs of contiguous pages. It would only
> > become a problem if you intend to to high rate fine granual
> > shared/private switching. Which is why I am asking what the actual
> > "why" is here.
> > 
> 
> I'd let Fuad comment if he's aware of any specific/concrete Anrdoid
> usecases about converting between shared and private. One usecase I can
> think about is host providing large multimedia blobs (e.g. video) to the
> guest. Rather than using swiotlb, the CC guest can share pages back with
> the host so host can copy the blob in, possibly using H/W accel. I
> mention this example because we may not need to support shared/private
> conversions at granularity finer than huge pages. 

I suspect the more useful thing would be to be able to allocate actual
shared memory and use that to shuffle data without a copy, setup much
less frequently. Ie you could allocate a large shared buffer for video
sharing and stream the video frames through that memory without copy.

This is slightly different from converting arbitary memory in-place
into shared memory. The VM may be able to do a better job at
clustering the shared memory allocation requests, ie locate them all
within a 1GB region to further optimize the host side.

> Jason, do you have scenario in mind? I couldn't tell if we now had a
> usecase or are brainstorming a solution to have a solution.

No, I'm interested in what pKVM is doing that needs this to be so much
different than the CC case..

Jason

Sean Christopherson June 20, 2024, 11:54 p.m. UTC | #36

On Thu, Jun 20, 2024, Jason Gunthorpe wrote:
> On Thu, Jun 20, 2024 at 01:30:29PM -0700, Sean Christopherson wrote:
> > I.e. except for blatant bugs, e.g. use-after-free, we need to be able to guarantee
> > with 100% accuracy that there are no outstanding mappings when converting a page
> > from shared=>private.  Crossing our fingers and hoping that short-term GUP will
> > have gone away isn't enough.
> 
> To be clear it is not crossing fingers. If the page refcount is 0 then
> there are no references to that memory anywhere at all. It is 100%
> certain.
> 
> It may take time to reach zero, but when it does it is safe.

Yeah, we're on the same page, I just didn't catch the implicit (or maybe it was
explicitly stated earlier) "wait for the refcount to hit zero" part that David
already clarified.

> Many things rely on this property, including FSDAX.
> 
> > For non-CoCo VMs, I expect we'll want to be much more permissive, but I think
> > they'll be a complete non-issue because there is no shared vs. private to worry
> > about.  We can simply allow any and all userspace mappings for guest_memfd that is
> > attached to a "regular" VM, because a misbehaving userspace only loses whatever
> > hardening (or other benefits) was being provided by using guest_memfd.  I.e. the
> > kernel and system at-large isn't at risk.
> 
> It does seem to me like guest_memfd should really focus on the private
> aspect.
> 
> If we need normal memfd enhancements of some kind to work better with
> KVM then that may be a better option than turning guest_memfd into
> memfd.

Heh, and then we'd end up turning memfd into guest_memfd.  As I see it, being
able to safely map TDX/SNP/pKVM private memory is a happy side effect that is
possible because guest_memfd isn't subordinate to the primary MMU, but private
memory isn't the core idenity of guest_memfd.

The thing that makes guest_memfd tick is that it's guest-first, i.e. allows mapping
memory into the guest with more permissions/capabilities than the host.  E.g. access
to private memory, hugepage mappings when the host is forced to use small pages,
RWX mappings when the host is limited to RO, etc.

We could do a subset of those for memfd, but I don't see the point, assuming we
allow mmap() on shared guest_memfd memory.  Solving mmap() for VMs that do
private<=>shared conversions is the hard problem to solve.  Once that's done,
we'll get support for regular VMs along with the other benefits of guest_memfd
for free (or very close to free).

Quentin Perret June 21, 2024, 7:32 a.m. UTC | #37

On Thursday 20 Jun 2024 at 20:18:14 (-0300), Jason Gunthorpe wrote:
> On Thu, Jun 20, 2024 at 03:47:23PM -0700, Elliot Berman wrote:
> > On Thu, Jun 20, 2024 at 11:29:56AM -0300, Jason Gunthorpe wrote:
> > > On Thu, Jun 20, 2024 at 04:01:08PM +0200, David Hildenbrand wrote:
> > > > Regarding huge pages: assume the huge page (e.g., 1 GiB hugetlb) is shared,
> > > > now the VM requests to make one subpage private. 
> > > 
> > > I think the general CC model has the shared/private setup earlier on
> > > the VM lifecycle with large runs of contiguous pages. It would only
> > > become a problem if you intend to to high rate fine granual
> > > shared/private switching. Which is why I am asking what the actual
> > > "why" is here.
> > > 
> > 
> > I'd let Fuad comment if he's aware of any specific/concrete Anrdoid
> > usecases about converting between shared and private. One usecase I can
> > think about is host providing large multimedia blobs (e.g. video) to the
> > guest. Rather than using swiotlb, the CC guest can share pages back with
> > the host so host can copy the blob in, possibly using H/W accel. I
> > mention this example because we may not need to support shared/private
> > conversions at granularity finer than huge pages. 
> 
> I suspect the more useful thing would be to be able to allocate actual
> shared memory and use that to shuffle data without a copy, setup much
> less frequently. Ie you could allocate a large shared buffer for video
> sharing and stream the video frames through that memory without copy.
> 
> This is slightly different from converting arbitary memory in-place
> into shared memory. The VM may be able to do a better job at
> clustering the shared memory allocation requests, ie locate them all
> within a 1GB region to further optimize the host side.
> 
> > Jason, do you have scenario in mind? I couldn't tell if we now had a
> > usecase or are brainstorming a solution to have a solution.
> 
> No, I'm interested in what pKVM is doing that needs this to be so much
> different than the CC case..

The underlying technology for implementing CC is obviously very
different (MMU-based for pKVM, encryption-based for the others + some
extra bits but let's keep it simple). In-place conversion is inherently
painful with encryption-based schemes, so it's not a surprise the
approach taken in these cases is built around destructive conversions as
a core construct. But as Elliot highlighted, the MMU-based approach
allows for pretty flexible and efficient zero-copy, which we're not
ready to sacrifice purely to shoehorn pKVM into a model that was
designed for a technology that has very different set of constraints.
A private->shared conversion in the pKVM case is nothing more than
setting a PTE in the recipient's stage-2 page-table.

I'm not at all against starting with something simple and bouncing via
swiotlb, that is totally fine. What is _not_ fine however would be to
bake into the userspace API that conversions are not in-place and
destructive (which in my mind equates to 'you can't mmap guest_memfd
pages'). But I think that isn't really a point of disagreement these
days, so hopefully we're aligned.

And to clarify some things I've also read in the thread, pKVM can
handle the vast majority of faults caused by accesses to protected
memory just fine. Userspace accesses protected guest memory? Fine,
we'll SEGV the userspace process. The kernel accesses via uaccess
macros? Also fine, we'll fail the syscall (or whatever it is we're
doing) cleanly -- the whole extable machinery works OK, which also
means that things like load_unaligned_zeropad() keep working as-is.
The only thing pKVM does is re-inject the fault back into the kernel
with some extra syndrome information it can figure out what to do by
itself.

It's really only accesses via e.g. the linear map that are problematic,
hence the exclusive GUP approach proposed in the series that tries to
avoid that by construction. That has the benefit of leaving
guest_memfd to other CC solutions that have more things in common. I
think it's good for that discussion to happen, no matter what we end up
doing in the end.

I hope that helps!

Thanks,
Quentin

David Hildenbrand June 21, 2024, 7:43 a.m. UTC | #38

On 21.06.24 01:54, Sean Christopherson wrote:
> On Thu, Jun 20, 2024, Jason Gunthorpe wrote:
>> On Thu, Jun 20, 2024 at 01:30:29PM -0700, Sean Christopherson wrote:
>>> I.e. except for blatant bugs, e.g. use-after-free, we need to be able to guarantee
>>> with 100% accuracy that there are no outstanding mappings when converting a page
>>> from shared=>private.  Crossing our fingers and hoping that short-term GUP will
>>> have gone away isn't enough.
>>
>> To be clear it is not crossing fingers. If the page refcount is 0 then
>> there are no references to that memory anywhere at all. It is 100%
>> certain.
>>
>> It may take time to reach zero, but when it does it is safe.
> 
> Yeah, we're on the same page, I just didn't catch the implicit (or maybe it was
> explicitly stated earlier) "wait for the refcount to hit zero" part that David
> already clarified.
>   
>> Many things rely on this property, including FSDAX.
>>
>>> For non-CoCo VMs, I expect we'll want to be much more permissive, but I think
>>> they'll be a complete non-issue because there is no shared vs. private to worry
>>> about.  We can simply allow any and all userspace mappings for guest_memfd that is
>>> attached to a "regular" VM, because a misbehaving userspace only loses whatever
>>> hardening (or other benefits) was being provided by using guest_memfd.  I.e. the
>>> kernel and system at-large isn't at risk.
>>
>> It does seem to me like guest_memfd should really focus on the private
>> aspect.

We'll likely have to enter that domain for clean huge page support 
and/or pKVM here either way.

Likely the future will see a mixture of things: some will use 
guest_memfd only for the "private" parts and anon/shmem for the "shared" 
parts, others will use guest_memfd for both.

>>
>> If we need normal memfd enhancements of some kind to work better with
>> KVM then that may be a better option than turning guest_memfd into
>> memfd.
> 
> Heh, and then we'd end up turning memfd into guest_memfd.  As I see it, being
> able to safely map TDX/SNP/pKVM private memory is a happy side effect that is
> possible because guest_memfd isn't subordinate to the primary MMU, but private
> memory isn't the core idenity of guest_memfd.

Right.

> 
> The thing that makes guest_memfd tick is that it's guest-first, i.e. allows mapping
> memory into the guest with more permissions/capabilities than the host.  E.g. access
> to private memory, hugepage mappings when the host is forced to use small pages,
> RWX mappings when the host is limited to RO, etc.
> 
> We could do a subset of those for memfd, but I don't see the point, assuming we
> allow mmap() on shared guest_memfd memory.  Solving mmap() for VMs that do
> private<=>shared conversions is the hard problem to solve.  Once that's done,
> we'll get support for regular VMs along with the other benefits of guest_memfd
> for free (or very close to free).

I suspect there would be pushback from Hugh trying to teach memfd things 
it really shouldn't be doing.

I once shared the idea of having a guest_memfd+memfd pair (managed by 
KVM or whatever more genric virt infrastructure), whereby we could move 
folios back and forth and only the memfd pages can be mapped and 
consequently pinned. Of course, we could only move full folios, which 
implies some kind of option b) for handling larger memory chunks 
(gigantic pages).

But I'm not sure if that is really required and it wouldn't be just 
easier to let the guest_memfd be mapped but only shared pages are handed 
out.

David Hildenbrand June 21, 2024, 8:02 a.m. UTC | #39

On 21.06.24 09:32, Quentin Perret wrote:
> On Thursday 20 Jun 2024 at 20:18:14 (-0300), Jason Gunthorpe wrote:
>> On Thu, Jun 20, 2024 at 03:47:23PM -0700, Elliot Berman wrote:
>>> On Thu, Jun 20, 2024 at 11:29:56AM -0300, Jason Gunthorpe wrote:
>>>> On Thu, Jun 20, 2024 at 04:01:08PM +0200, David Hildenbrand wrote:
>>>>> Regarding huge pages: assume the huge page (e.g., 1 GiB hugetlb) is shared,
>>>>> now the VM requests to make one subpage private.
>>>>
>>>> I think the general CC model has the shared/private setup earlier on
>>>> the VM lifecycle with large runs of contiguous pages. It would only
>>>> become a problem if you intend to to high rate fine granual
>>>> shared/private switching. Which is why I am asking what the actual
>>>> "why" is here.
>>>>
>>>
>>> I'd let Fuad comment if he's aware of any specific/concrete Anrdoid
>>> usecases about converting between shared and private. One usecase I can
>>> think about is host providing large multimedia blobs (e.g. video) to the
>>> guest. Rather than using swiotlb, the CC guest can share pages back with
>>> the host so host can copy the blob in, possibly using H/W accel. I
>>> mention this example because we may not need to support shared/private
>>> conversions at granularity finer than huge pages.
>>
>> I suspect the more useful thing would be to be able to allocate actual
>> shared memory and use that to shuffle data without a copy, setup much
>> less frequently. Ie you could allocate a large shared buffer for video
>> sharing and stream the video frames through that memory without copy.
>>
>> This is slightly different from converting arbitary memory in-place
>> into shared memory. The VM may be able to do a better job at
>> clustering the shared memory allocation requests, ie locate them all
>> within a 1GB region to further optimize the host side.
>>
>>> Jason, do you have scenario in mind? I couldn't tell if we now had a
>>> usecase or are brainstorming a solution to have a solution.
>>
>> No, I'm interested in what pKVM is doing that needs this to be so much
>> different than the CC case..
> 
> The underlying technology for implementing CC is obviously very
> different (MMU-based for pKVM, encryption-based for the others + some
> extra bits but let's keep it simple). In-place conversion is inherently
> painful with encryption-based schemes, so it's not a surprise the
> approach taken in these cases is built around destructive conversions as
> a core construct. But as Elliot highlighted, the MMU-based approach
> allows for pretty flexible and efficient zero-copy, which we're not
> ready to sacrifice purely to shoehorn pKVM into a model that was
> designed for a technology that has very different set of constraints.
> A private->shared conversion in the pKVM case is nothing more than
> setting a PTE in the recipient's stage-2 page-table.
> 
> I'm not at all against starting with something simple and bouncing via
> swiotlb, that is totally fine. What is _not_ fine however would be to
> bake into the userspace API that conversions are not in-place and
> destructive (which in my mind equates to 'you can't mmap guest_memfd
> pages'). But I think that isn't really a point of disagreement these
> days, so hopefully we're aligned.
> 
> And to clarify some things I've also read in the thread, pKVM can
> handle the vast majority of faults caused by accesses to protected
> memory just fine. Userspace accesses protected guest memory? Fine,
> we'll SEGV the userspace process. The kernel accesses via uaccess
> macros? Also fine, we'll fail the syscall (or whatever it is we're
> doing) cleanly -- the whole extable machinery works OK, which also
> means that things like load_unaligned_zeropad() keep working as-is.
> The only thing pKVM does is re-inject the fault back into the kernel
> with some extra syndrome information it can figure out what to do by
> itself.
> 
> It's really only accesses via e.g. the linear map that are problematic,
> hence the exclusive GUP approach proposed in the series that tries to
> avoid that by construction. That has the benefit of leaving
> guest_memfd to other CC solutions that have more things in common. I
> think it's good for that discussion to happen, no matter what we end up
> doing in the end.

Thanks for the information. IMHO we really should try to find a common 
ground here, and FOLL_EXCLUSIVE is likely not it :)

Thanks for reviving this discussion with your patch set!

pKVM is interested in in-place conversion, I believe there are valid use 
cases for in-place conversion for TDX and friends as well (as discussed, 
I think that might be a clean way to get huge/gigantic page support in).

This implies the option to:

1) Have shared+private memory in guest_memfd
2) Be able to mmap shared parts
3) Be able to convert shared<->private in place

and later in my interest

4) Have huge/gigantic page support in guest_memfd with the option of
    converting individual subpages

We might not want to make use of that model for all of CC -- as you 
state, sometimes the destructive approach might be better performance 
wise -- but having that option doesn't sound crazy to me (and maybe 
would solve real issues as well).

After all, the common requirement here is that "private" pages are not 
mapped/pinned/accessible.

Sure, there might be cases like "pKVM can handle access to private pages 
in user page mappings", "AMD-SNP will not crash the host if writing to 
private pages" but there are not factors that really make a difference 
for a common solution.

private memory: not mapped, not pinned
shared memory: maybe mapped, maybe pinned
granularity of conversion: single pages

Anything I am missing?

Fuad Tabba June 21, 2024, 8:23 a.m. UTC | #40

Hi Sean,

On Thu, Jun 20, 2024 at 4:37 PM Sean Christopherson <seanjc@google.com> wrote:
>
> On Wed, Jun 19, 2024, Fuad Tabba wrote:
> > Hi Jason,
> >
> > On Wed, Jun 19, 2024 at 12:51 PM Jason Gunthorpe <jgg@nvidia.com> wrote:
> > >
> > > On Wed, Jun 19, 2024 at 10:11:35AM +0100, Fuad Tabba wrote:
> > >
> > > > To be honest, personally (speaking only for myself, not necessarily
> > > > for Elliot and not for anyone else in the pKVM team), I still would
> > > > prefer to use guest_memfd(). I think that having one solution for
> > > > confidential computing that rules them all would be best. But we do
> > > > need to be able to share memory in place, have a plan for supporting
> > > > huge pages in the near future, and migration in the not-too-distant
> > > > future.
> > >
> > > I think using a FD to control this special lifetime stuff is
> > > dramatically better than trying to force the MM to do it with struct
> > > page hacks.
> > >
> > > If you can't agree with the guest_memfd people on how to get there
> > > then maybe you need a guest_memfd2 for this slightly different special
> > > stuff instead of intruding on the core mm so much. (though that would
> > > be sad)
> > >
> > > We really need to be thinking more about containing these special
> > > things and not just sprinkling them everywhere.
> >
> > I agree that we need to agree :) This discussion has been going on
> > since before LPC last year, and the consensus from the guest_memfd()
> > folks (if I understood it correctly) is that guest_memfd() is what it
> > is: designed for a specific type of confidential computing, in the
> > style of TDX and CCA perhaps, and that it cannot (or will not) perform
> > the role of being a general solution for all confidential computing.
>
> That isn't remotely accurate.  I have stated multiple times that I want guest_memfd
> to be a vehicle for all VM types, i.e. not just CoCo VMs, and most definitely not
> just TDX/SNP/CCA VMs.

I think that there might have been a slight misunderstanding between
us. I just thought that that's what you meant by:

: And I'm saying say we should stand firm in what guest_memfd _won't_
support, e.g.
: swap/reclaim and probably page migration should get a hard "no".

https://lore.kernel.org/all/Zfmpby6i3PfBEcCV@google.com/

> What I am staunchly against is piling features onto guest_memfd that will cause
> it to eventually become virtually indistinguishable from any other file-based
> backing store.  I.e. while I want to make guest_memfd usable for all VM *types*,
> making guest_memfd the preferred backing store for all *VMs* and use cases is
> very much a non-goal.
>
> From an earlier conversation[1]:
>
>  : In other words, ditch the complexity for features that are well served by existing
>  : general purpose solutions, so that guest_memfd can take on a bit of complexity to
>  : serve use cases that are unique to KVM guests, without becoming an unmaintainble
>  : mess due to cross-products.
> > > > Also, since pin is already overloading the refcount, having the
> > > > exclusive pin there helps in ensuring atomic accesses and avoiding
> > > > races.
> > >
> > > Yeah, but every time someone does this and then links it to a uAPI it
> > > becomes utterly baked in concrete for the MM forever.
> >
> > I agree. But if we can't modify guest_memfd() to fit our needs (pKVM,
> > Gunyah), then we don't really have that many other options.
>
> What _are_ your needs?  There are multiple unanswered questions from our last
> conversation[2].  And by "needs" I don't mean "what changes do you want to make
> to guest_memfd?", I mean "what are the use cases, patterns, and scenarios that
> you want to support?".

I think Quentin's reply in this thread outlines what it is pKVM would
like to do, and why it's different from, e.g., TDX:
https://lore.kernel.org/all/ZnUsmFFslBWZxGIq@google.com/

To summarize, our requirements are the same as other CC
implementations, except that we don't want to pay a penalty for
operations that pKVM (and Gunyah) can do more efficiently than
encryption-based CC, e.g., in-place conversion of private -> shared.

Apart from that, we are happy to use an interface that can support our
needs, or at least that we can extend in the (near) future to do that.
Whether it's guest_memfd() or something else.

>  : What's "hypervisor-assisted page migration"?  More specifically, what's the
>  : mechanism that drives it?

I believe what Will specifically meant by this is that, we can add
hypervisor support for migration in pKVM for the stage 2 page tables.

We don't have a detailed implementation for this yet, of course, since
there's no point yet until we know whether we're going with
guest_memfd(), or another alternative.

>  : Do you happen to have a list of exactly what you mean by "normal mm stuff"?  I
>  : am not at all opposed to supporting .mmap(), because long term I also want to
>  : use guest_memfd for non-CoCo VMs.  But I want to be very conservative with respect
>  : to what is allowed for guest_memfd.   E.g. host userspace can map guest_memfd,
>  : and do operations that are directly related to its mapping, but that's about it.
>
> That distinction matters, because as I have stated in that thread, I am not
> opposed to page migration itself:
>
>  : I am not opposed to page migration itself, what I am opposed to is adding deep
>  : integration with core MM to do some of the fancy/complex things that lead to page
>  : migration.

So it's not a "hard no"? :)

> I am generally aware of the core pKVM use cases, but I AFAIK I haven't seen a
> complete picture of everything you want to do, and _why_.
> E.g. if one of your requirements is that guest memory is managed by core-mm the
> same as all other memory in the system, then yeah, guest_memfd isn't for you.
> Integrating guest_memfd deeply into core-mm simply isn't realistic, at least not
> without *massive* changes to core-mm, as the whole point of guest_memfd is that
> it is guest-first memory, i.e. it is NOT memory that is managed by core-mm (primary
> MMU) and optionally mapped into KVM (secondary MMU).

It's not a requirement that guest memory is managed by the core-mm.
But, like we mentioned, support for in-place conversion from
shared->private, huge pages, and eventually migration are.

> Again from that thread, one of most important aspects guest_memfd is that VMAs
> are not required.  Stating the obvious, lack of VMAs makes it really hard to drive
> swap, reclaim, migration, etc. from code that fundamentally operates on VMAs.
>
>  : More broadly, no VMAs are required.  The lack of stage-1 page tables are nice to
>  : have; the lack of VMAs means that guest_memfd isn't playing second fiddle, e.g.
>  : it's not subject to VMA protections, isn't restricted to host mapping size, etc.
>
> [1] https://lore.kernel.org/all/Zfmpby6i3PfBEcCV@google.com
> [2] https://lore.kernel.org/all/Zg3xF7dTtx6hbmZj@google.com

I wonder if it might be more productive to also discuss this in one of
the PUCKs, ahead of LPC, in addition to trying to go over this in LPC.

Cheers,
/fuad

David Hildenbrand June 21, 2024, 8:43 a.m. UTC | #41

>> Again from that thread, one of most important aspects guest_memfd is that VMAs
>> are not required.  Stating the obvious, lack of VMAs makes it really hard to drive
>> swap, reclaim, migration, etc. from code that fundamentally operates on VMAs.
>>
>>   : More broadly, no VMAs are required.  The lack of stage-1 page tables are nice to
>>   : have; the lack of VMAs means that guest_memfd isn't playing second fiddle, e.g.
>>   : it's not subject to VMA protections, isn't restricted to host mapping size, etc.
>>
>> [1] https://lore.kernel.org/all/Zfmpby6i3PfBEcCV@google.com
>> [2] https://lore.kernel.org/all/Zg3xF7dTtx6hbmZj@google.com
> 
> I wonder if it might be more productive to also discuss this in one of
> the PUCKs, ahead of LPC, in addition to trying to go over this in LPC.

I don't know in  which context you usually discuss that, but I could 
propose that as a topic in the bi-weekly MM meeting.

This would, of course, be focused on the bigger MM picture: how to mmap, 
how how to support huge pages, interaction with page pinning, ... So 
obviously more MM focused once we are in agreement that we want to 
support shared memory in guest_memfd and how to make that work with core-mm.

Discussing if we want shared memory in guest_memfd might be betetr 
suited for a different, more CC/KVM specific meeting (likely the "PUCKs" 
mentioned here?).

Fuad Tabba June 21, 2024, 8:54 a.m. UTC | #42

Hi David,

On Fri, Jun 21, 2024 at 9:44 AM David Hildenbrand <david@redhat.com> wrote:
>
> >> Again from that thread, one of most important aspects guest_memfd is that VMAs
> >> are not required.  Stating the obvious, lack of VMAs makes it really hard to drive
> >> swap, reclaim, migration, etc. from code that fundamentally operates on VMAs.
> >>
> >>   : More broadly, no VMAs are required.  The lack of stage-1 page tables are nice to
> >>   : have; the lack of VMAs means that guest_memfd isn't playing second fiddle, e.g.
> >>   : it's not subject to VMA protections, isn't restricted to host mapping size, etc.
> >>
> >> [1] https://lore.kernel.org/all/Zfmpby6i3PfBEcCV@google.com
> >> [2] https://lore.kernel.org/all/Zg3xF7dTtx6hbmZj@google.com
> >
> > I wonder if it might be more productive to also discuss this in one of
> > the PUCKs, ahead of LPC, in addition to trying to go over this in LPC.
>
> I don't know in  which context you usually discuss that, but I could
> propose that as a topic in the bi-weekly MM meeting.
>
> This would, of course, be focused on the bigger MM picture: how to mmap,
> how how to support huge pages, interaction with page pinning, ... So
> obviously more MM focused once we are in agreement that we want to
> support shared memory in guest_memfd and how to make that work with core-mm.
>
> Discussing if we want shared memory in guest_memfd might be betetr
> suited for a different, more CC/KVM specific meeting (likely the "PUCKs"
> mentioned here?).

Sorry, I should have given more context on what a PUCK* is :) It's a
periodic (almost weekly) upstream call for KVM.

[*] https://lore.kernel.org/all/20230512231026.799267-1-seanjc@google.com/

But yes, having a discussion in one of the mm meetings ahead of LPC
would also be great. When do these meetings usually take place, to try
to coordinate across timezones.

Cheers,
/fuad

> --
> Cheers,
>
> David / dhildenb
>

David Hildenbrand June 21, 2024, 9:10 a.m. UTC | #43

On 21.06.24 10:54, Fuad Tabba wrote:
> Hi David,
> 
> On Fri, Jun 21, 2024 at 9:44 AM David Hildenbrand <david@redhat.com> wrote:
>>
>>>> Again from that thread, one of most important aspects guest_memfd is that VMAs
>>>> are not required.  Stating the obvious, lack of VMAs makes it really hard to drive
>>>> swap, reclaim, migration, etc. from code that fundamentally operates on VMAs.
>>>>
>>>>    : More broadly, no VMAs are required.  The lack of stage-1 page tables are nice to
>>>>    : have; the lack of VMAs means that guest_memfd isn't playing second fiddle, e.g.
>>>>    : it's not subject to VMA protections, isn't restricted to host mapping size, etc.
>>>>
>>>> [1] https://lore.kernel.org/all/Zfmpby6i3PfBEcCV@google.com
>>>> [2] https://lore.kernel.org/all/Zg3xF7dTtx6hbmZj@google.com
>>>
>>> I wonder if it might be more productive to also discuss this in one of
>>> the PUCKs, ahead of LPC, in addition to trying to go over this in LPC.
>>
>> I don't know in  which context you usually discuss that, but I could
>> propose that as a topic in the bi-weekly MM meeting.
>>
>> This would, of course, be focused on the bigger MM picture: how to mmap,
>> how how to support huge pages, interaction with page pinning, ... So
>> obviously more MM focused once we are in agreement that we want to
>> support shared memory in guest_memfd and how to make that work with core-mm.
>>
>> Discussing if we want shared memory in guest_memfd might be betetr
>> suited for a different, more CC/KVM specific meeting (likely the "PUCKs"
>> mentioned here?).
> 
> Sorry, I should have given more context on what a PUCK* is :) It's a
> periodic (almost weekly) upstream call for KVM.
> 
> [*] https://lore.kernel.org/all/20230512231026.799267-1-seanjc@google.com/
> 
> But yes, having a discussion in one of the mm meetings ahead of LPC
> would also be great. When do these meetings usually take place, to try
> to coordinate across timezones.

It's Wednesday, 9:00 - 10:00am PDT (GMT-7) every second week.

If we're in agreement, we could (assuming there are no other planned 
topics) either use the slot next week (June 26) or the following one 
(July 10).

Selfish as I am, I would prefer July 10, because I'll be on vacation 
next week and there would be little time to prepare.

@David R., heads up that this might become a topic ("shared and private 
memory in guest_memfd: mmap, pinning and huge pages"), if people here 
agree that this is a direction worth heading.

Quentin Perret June 21, 2024, 9:25 a.m. UTC | #44

On Friday 21 Jun 2024 at 10:02:08 (+0200), David Hildenbrand wrote:
> Thanks for the information. IMHO we really should try to find a common
> ground here, and FOLL_EXCLUSIVE is likely not it :)

That's OK, IMO at least :-).

> Thanks for reviving this discussion with your patch set!
> 
> pKVM is interested in in-place conversion, I believe there are valid use
> cases for in-place conversion for TDX and friends as well (as discussed, I
> think that might be a clean way to get huge/gigantic page support in).
> 
> This implies the option to:
> 
> 1) Have shared+private memory in guest_memfd
> 2) Be able to mmap shared parts
> 3) Be able to convert shared<->private in place
> 
> and later in my interest
> 
> 4) Have huge/gigantic page support in guest_memfd with the option of
>    converting individual subpages
> 
> We might not want to make use of that model for all of CC -- as you state,
> sometimes the destructive approach might be better performance wise -- but
> having that option doesn't sound crazy to me (and maybe would solve real
> issues as well).

Cool.

> After all, the common requirement here is that "private" pages are not
> mapped/pinned/accessible.
> 
> Sure, there might be cases like "pKVM can handle access to private pages in
> user page mappings", "AMD-SNP will not crash the host if writing to private
> pages" but there are not factors that really make a difference for a common
> solution.

Sure, there isn't much value in differentiating on these things. One
might argue that we could save one mmap() on the private->shared
conversion path by keeping all of guest_memfd mapped in userspace
including private memory, but that's most probably not worth the
effort of re-designing the whole thing just for that, so let's forget
that.

The ability to handle stage-2 faults in the kernel has implications in
other places however. It means we don't need to punch holes in the
kernel linear map when donating memory to a guest for example, even with
'crazy' access patterns like load_unaligned_zeropad(). So that's good.

> private memory: not mapped, not pinned
> shared memory: maybe mapped, maybe pinned
> granularity of conversion: single pages
> 
> Anything I am missing?

That looks good to me. And as discussed in previous threads, we have the
ambition of getting page-migration to work, including for private memory,
mostly to get kcompactd to work better when pVMs are running. Android
makes extensive use of compaction, and pVMs currently stick out like a
sore thumb.

We can trivially implement a hypercall to have pKVM swap a private
page with another without the guest having to know. The difficulty is
obviously to hook that in Linux, and I've personally not looked into it
properly, so that is clearly longer term. We don't want to take anybody
by surprise if there is a need for some added complexity in guest_memfd
to support this use-case though. I don't expect folks on the receiving
end of that to agree to it blindly without knowing _what_ this
complexity is FWIW. But at least our intentions are clear :-)

Thanks,
Quentin

David Hildenbrand June 21, 2024, 9:37 a.m. UTC | #45

On 21.06.24 11:25, Quentin Perret wrote:
> On Friday 21 Jun 2024 at 10:02:08 (+0200), David Hildenbrand wrote:
>> Thanks for the information. IMHO we really should try to find a common
>> ground here, and FOLL_EXCLUSIVE is likely not it :)
> 
> That's OK, IMO at least :-).
> 
>> Thanks for reviving this discussion with your patch set!
>>
>> pKVM is interested in in-place conversion, I believe there are valid use
>> cases for in-place conversion for TDX and friends as well (as discussed, I
>> think that might be a clean way to get huge/gigantic page support in).
>>
>> This implies the option to:
>>
>> 1) Have shared+private memory in guest_memfd
>> 2) Be able to mmap shared parts
>> 3) Be able to convert shared<->private in place
>>
>> and later in my interest
>>
>> 4) Have huge/gigantic page support in guest_memfd with the option of
>>     converting individual subpages
>>
>> We might not want to make use of that model for all of CC -- as you state,
>> sometimes the destructive approach might be better performance wise -- but
>> having that option doesn't sound crazy to me (and maybe would solve real
>> issues as well).
> 
> Cool.
> 
>> After all, the common requirement here is that "private" pages are not
>> mapped/pinned/accessible.
>>
>> Sure, there might be cases like "pKVM can handle access to private pages in
>> user page mappings", "AMD-SNP will not crash the host if writing to private
>> pages" but there are not factors that really make a difference for a common
>> solution.
> 
> Sure, there isn't much value in differentiating on these things. One
> might argue that we could save one mmap() on the private->shared
> conversion path by keeping all of guest_memfd mapped in userspace
> including private memory, but that's most probably not worth the
> effort of re-designing the whole thing just for that, so let's forget
> that.

In a world where we can mmap() the whole (sparse "shared") thing, and 
dynamically map/unmap the shared parts only it would be saving a page 
fault on private->shared conversion, correct.

But that's sounds more like a CC-specific optimization for frequent 
conversions, which we should just ignore initially.

> 
> The ability to handle stage-2 faults in the kernel has implications in
> other places however. It means we don't need to punch holes in the
> kernel linear map when donating memory to a guest for example, even with
> 'crazy' access patterns like load_unaligned_zeropad(). So that's good.
> 
>> private memory: not mapped, not pinned
>> shared memory: maybe mapped, maybe pinned
>> granularity of conversion: single pages
>>
>> Anything I am missing?
> 
> That looks good to me. And as discussed in previous threads, we have the
> ambition of getting page-migration to work, including for private memory,
> mostly to get kcompactd to work better when pVMs are running. Android
> makes extensive use of compaction, and pVMs currently stick out like a
> sore thumb.

Yes, I think migration for compaction has to be supported at some point 
(at least for small pages that can be either private or shared, not a 
mixture), and I suspect we should be able to integrate it with core-mm 
in a not-too-horrible fashion. For example, we do have a non-lru page 
migration infrastructure in place already if the LRU-based one is not a 
good fit.

Memory swapping and all other currently-strictly LRU-based mechanisms 
should be out of scope for now: as Sean says, we don't want to go down 
that path.

> 
> We can trivially implement a hypercall to have pKVM swap a private
> page with another without the guest having to know. The difficulty is
> obviously to hook that in Linux, and I've personally not looked into it
> properly, so that is clearly longer term. We don't want to take anybody
> by surprise if there is a need for some added complexity in guest_memfd
> to support this use-case though. I don't expect folks on the receiving
> end of that to agree to it blindly without knowing _what_ this
> complexity is FWIW. But at least our intentions are clear :-)

Agreed.

Fuad Tabba June 21, 2024, 10:16 a.m. UTC | #46

Hi David,

On Fri, Jun 21, 2024 at 10:10 AM David Hildenbrand <david@redhat.com> wrote:
>
> On 21.06.24 10:54, Fuad Tabba wrote:
> > Hi David,
> >
> > On Fri, Jun 21, 2024 at 9:44 AM David Hildenbrand <david@redhat.com> wrote:
> >>
> >>>> Again from that thread, one of most important aspects guest_memfd is that VMAs
> >>>> are not required.  Stating the obvious, lack of VMAs makes it really hard to drive
> >>>> swap, reclaim, migration, etc. from code that fundamentally operates on VMAs.
> >>>>
> >>>>    : More broadly, no VMAs are required.  The lack of stage-1 page tables are nice to
> >>>>    : have; the lack of VMAs means that guest_memfd isn't playing second fiddle, e.g.
> >>>>    : it's not subject to VMA protections, isn't restricted to host mapping size, etc.
> >>>>
> >>>> [1] https://lore.kernel.org/all/Zfmpby6i3PfBEcCV@google.com
> >>>> [2] https://lore.kernel.org/all/Zg3xF7dTtx6hbmZj@google.com
> >>>
> >>> I wonder if it might be more productive to also discuss this in one of
> >>> the PUCKs, ahead of LPC, in addition to trying to go over this in LPC.
> >>
> >> I don't know in  which context you usually discuss that, but I could
> >> propose that as a topic in the bi-weekly MM meeting.
> >>
> >> This would, of course, be focused on the bigger MM picture: how to mmap,
> >> how how to support huge pages, interaction with page pinning, ... So
> >> obviously more MM focused once we are in agreement that we want to
> >> support shared memory in guest_memfd and how to make that work with core-mm.
> >>
> >> Discussing if we want shared memory in guest_memfd might be betetr
> >> suited for a different, more CC/KVM specific meeting (likely the "PUCKs"
> >> mentioned here?).
> >
> > Sorry, I should have given more context on what a PUCK* is :) It's a
> > periodic (almost weekly) upstream call for KVM.
> >
> > [*] https://lore.kernel.org/all/20230512231026.799267-1-seanjc@google.com/
> >
> > But yes, having a discussion in one of the mm meetings ahead of LPC
> > would also be great. When do these meetings usually take place, to try
> > to coordinate across timezones.
>
> It's Wednesday, 9:00 - 10:00am PDT (GMT-7) every second week.
>
> If we're in agreement, we could (assuming there are no other planned
> topics) either use the slot next week (June 26) or the following one
> (July 10).
>
> Selfish as I am, I would prefer July 10, because I'll be on vacation
> next week and there would be little time to prepare.
>
> @David R., heads up that this might become a topic ("shared and private
> memory in guest_memfd: mmap, pinning and huge pages"), if people here
> agree that this is a direction worth heading.

Thanks for the invite! Tentatively July 10th works for me, but I'd
like to talk to the others who might be interested (pKVM, Gunyah, and
others) to see if that works for them. I'll get back to you shortly.

Cheers,
/fuad

> --
> Cheers,
>
> David / dhildenb
>

Jason Gunthorpe June 21, 2024, 12:26 p.m. UTC | #47

On Fri, Jun 21, 2024 at 07:32:40AM +0000, Quentin Perret wrote:
> > No, I'm interested in what pKVM is doing that needs this to be so much
> > different than the CC case..
> 
> The underlying technology for implementing CC is obviously very
> different (MMU-based for pKVM, encryption-based for the others + some
> extra bits but let's keep it simple). In-place conversion is inherently
> painful with encryption-based schemes, so it's not a surprise the
> approach taken in these cases is built around destructive conversions as
> a core construct.

I'm not sure I fully agree with this. CC can do non-destructive too
(though the proprietary secure worlds may choose not to implement
it). Even implementations like ARM's CC are much closer to how pKVM
works without encryption and just page table updates.

The only question that matters at all is how fast is the
private->shared conversion. Is it fast enough that it can be used on
the IO path instead of swiotlb?

TBH I'm willing to believe number's showing that pKVM is fast enough,
but would like to see them before we consider major changes to the
kernel :)

> I'm not at all against starting with something simple and bouncing via
> swiotlb, that is totally fine. What is _not_ fine however would be to
> bake into the userspace API that conversions are not in-place and
> destructive (which in my mind equates to 'you can't mmap guest_memfd
> pages'). But I think that isn't really a point of disagreement these
> days, so hopefully we're aligned.

IMHO CC and pKVM should align here and provide a way for optional
non-destructive private->shared conversion.

> It's really only accesses via e.g. the linear map that are problematic,
> hence the exclusive GUP approach proposed in the series that tries to
> avoid that by construction. 

I think as others have said, this is just too weird. Memory that is
inaccessible and always faults the kernel doesn't make any sense. It
shouldn't be mapped into VMAs.

If you really, really, want to do this then use your own FD and a PFN
map. Copy to user will still work fine and you don't need to disrupt
the mm.

Jason

Jason Gunthorpe June 21, 2024, 12:39 p.m. UTC | #48

On Thu, Jun 20, 2024 at 04:54:00PM -0700, Sean Christopherson wrote:

> Heh, and then we'd end up turning memfd into guest_memfd.  As I see it, being
> able to safely map TDX/SNP/pKVM private memory is a happy side effect that is
> possible because guest_memfd isn't subordinate to the primary MMU, but private
> memory isn't the core idenity of guest_memfd.

IMHO guest memfd still has a very bright line between it and normal
memfd.

guest mfd is holding all the memory and making it unmovable because it
has donated it to some secure world. Unmovable means the mm can't do
anything with it in normal ways. For things like David's 'b' where we
fragement the pages it also requires guest memfd act as an allocator
and completely own the PFNs, including handling free callbacks like
ZONE_DEVICE does.

memfd on the other hand should always be normal movable allocated
kernel memory with full normal folios and it shouldn't act as an
allocator.

Teaching memfd to hold a huge folio is probably going to be a
different approach than teaching guest memfd, I suspect "a" would be a
more suitable choice there. You give up the kvm side contiguity, but
get full mm integration of the memory.

User gets to choose which is more important..

It is not that different than today where VMMs are using hugetlbfs to
get unmovable memory.

> We could do a subset of those for memfd, but I don't see the point, assuming we
> allow mmap() on shared guest_memfd memory.  Solving mmap() for VMs that do
> private<=>shared conversions is the hard problem to solve.  Once that's done,
> we'll get support for regular VMs along with the other benefits of guest_memfd
> for free (or very close to free).

Yes, but I get the feeling that even in the best case for guest memfd
you still end up with the non-movable memory and less mm features
available.

Like if we do movability in a guest memfd space it would have be with
some op callback to move the memory via the secure world, and guest
memfd would still be pinning all the memory. Quite a different flow
than what memfd should do.

There may still be merit in teaching memfd how to do huge pages too,
though I don't really know.

Jason

Elliot Berman June 21, 2024, 4:48 p.m. UTC | #49

On Fri, Jun 21, 2024 at 09:25:10AM +0000, Quentin Perret wrote:
> On Friday 21 Jun 2024 at 10:02:08 (+0200), David Hildenbrand wrote:
> > Sure, there might be cases like "pKVM can handle access to private pages in
> > user page mappings", "AMD-SNP will not crash the host if writing to private
> > pages" but there are not factors that really make a difference for a common
> > solution.
> 
> Sure, there isn't much value in differentiating on these things. One
> might argue that we could save one mmap() on the private->shared
> conversion path by keeping all of guest_memfd mapped in userspace
> including private memory, but that's most probably not worth the
> effort of re-designing the whole thing just for that, so let's forget
> that.
> 
> The ability to handle stage-2 faults in the kernel has implications in
> other places however. It means we don't need to punch holes in the
> kernel linear map when donating memory to a guest for example, even with
> 'crazy' access patterns like load_unaligned_zeropad(). So that's good.
> 

The ability to handle stage-2 faults in the kernel is something that's
specific to arm64 pKVM though. We do want to punch holes in the linear
map for Gunyah case. I don't think this is blocking issue. I only want
to point out we can't totally ignore the linear map.

Thanks,
Elliot

Elliot Berman June 21, 2024, 4:54 p.m. UTC | #50

On Fri, Jun 21, 2024 at 11:16:31AM +0100, Fuad Tabba wrote:
> Hi David,
> 
> On Fri, Jun 21, 2024 at 10:10 AM David Hildenbrand <david@redhat.com> wrote:
> >
> > On 21.06.24 10:54, Fuad Tabba wrote:
> > > Hi David,
> > >
> > > On Fri, Jun 21, 2024 at 9:44 AM David Hildenbrand <david@redhat.com> wrote:
> > >>
> > >>>> Again from that thread, one of most important aspects guest_memfd is that VMAs
> > >>>> are not required.  Stating the obvious, lack of VMAs makes it really hard to drive
> > >>>> swap, reclaim, migration, etc. from code that fundamentally operates on VMAs.
> > >>>>
> > >>>>    : More broadly, no VMAs are required.  The lack of stage-1 page tables are nice to
> > >>>>    : have; the lack of VMAs means that guest_memfd isn't playing second fiddle, e.g.
> > >>>>    : it's not subject to VMA protections, isn't restricted to host mapping size, etc.
> > >>>>
> > >>>> [1] https://lore.kernel.org/all/Zfmpby6i3PfBEcCV@google.com
> > >>>> [2] https://lore.kernel.org/all/Zg3xF7dTtx6hbmZj@google.com
> > >>>
> > >>> I wonder if it might be more productive to also discuss this in one of
> > >>> the PUCKs, ahead of LPC, in addition to trying to go over this in LPC.
> > >>
> > >> I don't know in  which context you usually discuss that, but I could
> > >> propose that as a topic in the bi-weekly MM meeting.
> > >>
> > >> This would, of course, be focused on the bigger MM picture: how to mmap,
> > >> how how to support huge pages, interaction with page pinning, ... So
> > >> obviously more MM focused once we are in agreement that we want to
> > >> support shared memory in guest_memfd and how to make that work with core-mm.
> > >>
> > >> Discussing if we want shared memory in guest_memfd might be betetr
> > >> suited for a different, more CC/KVM specific meeting (likely the "PUCKs"
> > >> mentioned here?).
> > >
> > > Sorry, I should have given more context on what a PUCK* is :) It's a
> > > periodic (almost weekly) upstream call for KVM.
> > >
> > > [*] https://lore.kernel.org/all/20230512231026.799267-1-seanjc@google.com/
> > >
> > > But yes, having a discussion in one of the mm meetings ahead of LPC
> > > would also be great. When do these meetings usually take place, to try
> > > to coordinate across timezones.
> >
> > It's Wednesday, 9:00 - 10:00am PDT (GMT-7) every second week.
> >
> > If we're in agreement, we could (assuming there are no other planned
> > topics) either use the slot next week (June 26) or the following one
> > (July 10).
> >
> > Selfish as I am, I would prefer July 10, because I'll be on vacation
> > next week and there would be little time to prepare.
> >
> > @David R., heads up that this might become a topic ("shared and private
> > memory in guest_memfd: mmap, pinning and huge pages"), if people here
> > agree that this is a direction worth heading.
> 
> Thanks for the invite! Tentatively July 10th works for me, but I'd
> like to talk to the others who might be interested (pKVM, Gunyah, and
> others) to see if that works for them. I'll get back to you shortly.
> 

I'd like to join too, July 10th at that time works for me.

- Elliot

Sean Christopherson June 24, 2024, 7:03 p.m. UTC | #51

On Fri, Jun 21, 2024, Elliot Berman wrote:
> On Fri, Jun 21, 2024 at 11:16:31AM +0100, Fuad Tabba wrote:
> > On Fri, Jun 21, 2024 at 10:10 AM David Hildenbrand <david@redhat.com> wrote:
> > > On 21.06.24 10:54, Fuad Tabba wrote:
> > > > On Fri, Jun 21, 2024 at 9:44 AM David Hildenbrand <david@redhat.com> wrote:
> > > >>
> > > >>>> Again from that thread, one of most important aspects guest_memfd is that VMAs
> > > >>>> are not required.  Stating the obvious, lack of VMAs makes it really hard to drive
> > > >>>> swap, reclaim, migration, etc. from code that fundamentally operates on VMAs.
> > > >>>>
> > > >>>>    : More broadly, no VMAs are required.  The lack of stage-1 page tables are nice to
> > > >>>>    : have; the lack of VMAs means that guest_memfd isn't playing second fiddle, e.g.
> > > >>>>    : it's not subject to VMA protections, isn't restricted to host mapping size, etc.
> > > >>>>
> > > >>>> [1] https://lore.kernel.org/all/Zfmpby6i3PfBEcCV@google.com
> > > >>>> [2] https://lore.kernel.org/all/Zg3xF7dTtx6hbmZj@google.com
> > > >>>
> > > >>> I wonder if it might be more productive to also discuss this in one of
> > > >>> the PUCKs, ahead of LPC, in addition to trying to go over this in LPC.
> > > >>
> > > >> I don't know in  which context you usually discuss that, but I could
> > > >> propose that as a topic in the bi-weekly MM meeting.
> > > >>
> > > >> This would, of course, be focused on the bigger MM picture: how to mmap,
> > > >> how how to support huge pages, interaction with page pinning, ... So
> > > >> obviously more MM focused once we are in agreement that we want to
> > > >> support shared memory in guest_memfd and how to make that work with core-mm.
> > > >>
> > > >> Discussing if we want shared memory in guest_memfd might be betetr
> > > >> suited for a different, more CC/KVM specific meeting (likely the "PUCKs"
> > > >> mentioned here?).
> > > >
> > > > Sorry, I should have given more context on what a PUCK* is :) It's a
> > > > periodic (almost weekly) upstream call for KVM.
> > > >
> > > > [*] https://lore.kernel.org/all/20230512231026.799267-1-seanjc@google.com/
> > > >
> > > > But yes, having a discussion in one of the mm meetings ahead of LPC
> > > > would also be great. When do these meetings usually take place, to try
> > > > to coordinate across timezones.

Let's do the MM meeting.  As evidenced by the responses, it'll be easier to get
KVM folks to join the MM meeting as opposed to other way around.

> > > It's Wednesday, 9:00 - 10:00am PDT (GMT-7) every second week.
> > >
> > > If we're in agreement, we could (assuming there are no other planned
> > > topics) either use the slot next week (June 26) or the following one
> > > (July 10).
> > >
> > > Selfish as I am, I would prefer July 10, because I'll be on vacation
> > > next week and there would be little time to prepare.
> > >
> > > @David R., heads up that this might become a topic ("shared and private
> > > memory in guest_memfd: mmap, pinning and huge pages"), if people here
> > > agree that this is a direction worth heading.
> > 
> > Thanks for the invite! Tentatively July 10th works for me, but I'd
> > like to talk to the others who might be interested (pKVM, Gunyah, and
> > others) to see if that works for them. I'll get back to you shortly.
> > 
> 
> I'd like to join too, July 10th at that time works for me.

July 10th works for me too.

David Rientjes June 24, 2024, 9:50 p.m. UTC | #52

On Mon, 24 Jun 2024, Sean Christopherson wrote:

> On Fri, Jun 21, 2024, Elliot Berman wrote:
> > On Fri, Jun 21, 2024 at 11:16:31AM +0100, Fuad Tabba wrote:
> > > On Fri, Jun 21, 2024 at 10:10 AM David Hildenbrand <david@redhat.com> wrote:
> > > > On 21.06.24 10:54, Fuad Tabba wrote:
> > > > > On Fri, Jun 21, 2024 at 9:44 AM David Hildenbrand <david@redhat.com> wrote:
> > > > >>
> > > > >>>> Again from that thread, one of most important aspects guest_memfd is that VMAs
> > > > >>>> are not required.  Stating the obvious, lack of VMAs makes it really hard to drive
> > > > >>>> swap, reclaim, migration, etc. from code that fundamentally operates on VMAs.
> > > > >>>>
> > > > >>>>    : More broadly, no VMAs are required.  The lack of stage-1 page tables are nice to
> > > > >>>>    : have; the lack of VMAs means that guest_memfd isn't playing second fiddle, e.g.
> > > > >>>>    : it's not subject to VMA protections, isn't restricted to host mapping size, etc.
> > > > >>>>
> > > > >>>> [1] https://lore.kernel.org/all/Zfmpby6i3PfBEcCV@google.com
> > > > >>>> [2] https://lore.kernel.org/all/Zg3xF7dTtx6hbmZj@google.com
> > > > >>>
> > > > >>> I wonder if it might be more productive to also discuss this in one of
> > > > >>> the PUCKs, ahead of LPC, in addition to trying to go over this in LPC.
> > > > >>
> > > > >> I don't know in  which context you usually discuss that, but I could
> > > > >> propose that as a topic in the bi-weekly MM meeting.
> > > > >>
> > > > >> This would, of course, be focused on the bigger MM picture: how to mmap,
> > > > >> how how to support huge pages, interaction with page pinning, ... So
> > > > >> obviously more MM focused once we are in agreement that we want to
> > > > >> support shared memory in guest_memfd and how to make that work with core-mm.
> > > > >>
> > > > >> Discussing if we want shared memory in guest_memfd might be betetr
> > > > >> suited for a different, more CC/KVM specific meeting (likely the "PUCKs"
> > > > >> mentioned here?).
> > > > >
> > > > > Sorry, I should have given more context on what a PUCK* is :) It's a
> > > > > periodic (almost weekly) upstream call for KVM.
> > > > >
> > > > > [*] https://lore.kernel.org/all/20230512231026.799267-1-seanjc@google.com/
> > > > >
> > > > > But yes, having a discussion in one of the mm meetings ahead of LPC
> > > > > would also be great. When do these meetings usually take place, to try
> > > > > to coordinate across timezones.
> 
> Let's do the MM meeting.  As evidenced by the responses, it'll be easier to get
> KVM folks to join the MM meeting as opposed to other way around.
> 
> > > > It's Wednesday, 9:00 - 10:00am PDT (GMT-7) every second week.
> > > >
> > > > If we're in agreement, we could (assuming there are no other planned
> > > > topics) either use the slot next week (June 26) or the following one
> > > > (July 10).
> > > >
> > > > Selfish as I am, I would prefer July 10, because I'll be on vacation
> > > > next week and there would be little time to prepare.
> > > >
> > > > @David R., heads up that this might become a topic ("shared and private
> > > > memory in guest_memfd: mmap, pinning and huge pages"), if people here
> > > > agree that this is a direction worth heading.
> > > 
> > > Thanks for the invite! Tentatively July 10th works for me, but I'd
> > > like to talk to the others who might be interested (pKVM, Gunyah, and
> > > others) to see if that works for them. I'll get back to you shortly.
> > > 
> > 
> > I'd like to join too, July 10th at that time works for me.
> 
> July 10th works for me too.
> 

Thanks all, and David H for the topic suggestion.  Let's tentatively 
pencil this in for the Wednesday, July 10th instance at 9am PDT and I'll 
follow-up offlist with those will be needed to lead the discussion to make 
sure we're on track.

Vishal Annapurve June 26, 2024, 3:19 a.m. UTC | #53

On Mon, Jun 24, 2024 at 2:50 PM David Rientjes <rientjes@google.com> wrote:
>
> On Mon, 24 Jun 2024, Sean Christopherson wrote:
>
> > On Fri, Jun 21, 2024, Elliot Berman wrote:
> > > On Fri, Jun 21, 2024 at 11:16:31AM +0100, Fuad Tabba wrote:
> > > > On Fri, Jun 21, 2024 at 10:10 AM David Hildenbrand <david@redhat.com> wrote:
> > > > > On 21.06.24 10:54, Fuad Tabba wrote:
> > > > > > On Fri, Jun 21, 2024 at 9:44 AM David Hildenbrand <david@redhat.com> wrote:
> > > > > >>
> > > > > >>>> Again from that thread, one of most important aspects guest_memfd is that VMAs
> > > > > >>>> are not required.  Stating the obvious, lack of VMAs makes it really hard to drive
> > > > > >>>> swap, reclaim, migration, etc. from code that fundamentally operates on VMAs.
> > > > > >>>>
> > > > > >>>>    : More broadly, no VMAs are required.  The lack of stage-1 page tables are nice to
> > > > > >>>>    : have; the lack of VMAs means that guest_memfd isn't playing second fiddle, e.g.
> > > > > >>>>    : it's not subject to VMA protections, isn't restricted to host mapping size, etc.
> > > > > >>>>
> > > > > >>>> [1] https://lore.kernel.org/all/Zfmpby6i3PfBEcCV@google.com
> > > > > >>>> [2] https://lore.kernel.org/all/Zg3xF7dTtx6hbmZj@google.com
> > > > > >>>
> > > > > >>> I wonder if it might be more productive to also discuss this in one of
> > > > > >>> the PUCKs, ahead of LPC, in addition to trying to go over this in LPC.
> > > > > >>
> > > > > >> I don't know in  which context you usually discuss that, but I could
> > > > > >> propose that as a topic in the bi-weekly MM meeting.
> > > > > >>
> > > > > >> This would, of course, be focused on the bigger MM picture: how to mmap,
> > > > > >> how how to support huge pages, interaction with page pinning, ... So
> > > > > >> obviously more MM focused once we are in agreement that we want to
> > > > > >> support shared memory in guest_memfd and how to make that work with core-mm.
> > > > > >>
> > > > > >> Discussing if we want shared memory in guest_memfd might be betetr
> > > > > >> suited for a different, more CC/KVM specific meeting (likely the "PUCKs"
> > > > > >> mentioned here?).
> > > > > >
> > > > > > Sorry, I should have given more context on what a PUCK* is :) It's a
> > > > > > periodic (almost weekly) upstream call for KVM.
> > > > > >
> > > > > > [*] https://lore.kernel.org/all/20230512231026.799267-1-seanjc@google.com/
> > > > > >
> > > > > > But yes, having a discussion in one of the mm meetings ahead of LPC
> > > > > > would also be great. When do these meetings usually take place, to try
> > > > > > to coordinate across timezones.
> >
> > Let's do the MM meeting.  As evidenced by the responses, it'll be easier to get
> > KVM folks to join the MM meeting as opposed to other way around.
> >
> > > > > It's Wednesday, 9:00 - 10:00am PDT (GMT-7) every second week.
> > > > >
> > > > > If we're in agreement, we could (assuming there are no other planned
> > > > > topics) either use the slot next week (June 26) or the following one
> > > > > (July 10).
> > > > >
> > > > > Selfish as I am, I would prefer July 10, because I'll be on vacation
> > > > > next week and there would be little time to prepare.
> > > > >
> > > > > @David R., heads up that this might become a topic ("shared and private
> > > > > memory in guest_memfd: mmap, pinning and huge pages"), if people here
> > > > > agree that this is a direction worth heading.
> > > >
> > > > Thanks for the invite! Tentatively July 10th works for me, but I'd
> > > > like to talk to the others who might be interested (pKVM, Gunyah, and
> > > > others) to see if that works for them. I'll get back to you shortly.
> > > >
> > >
> > > I'd like to join too, July 10th at that time works for me.
> >
> > July 10th works for me too.
> >
>
> Thanks all, and David H for the topic suggestion.  Let's tentatively
> pencil this in for the Wednesday, July 10th instance at 9am PDT and I'll
> follow-up offlist with those will be needed to lead the discussion to make
> sure we're on track.

I would like to join the call too.

Regards,
Vishal

Pankaj Gupta June 26, 2024, 5:20 a.m. UTC | #54

I am also interested and would like to join the discussion.

Best regards,
Pankaj Gupta

On Wed, 26 Jun, 2024, 5:19 am Vishal Annapurve, <vannapurve@google.com>
wrote:

> On Mon, Jun 24, 2024 at 2:50 PM David Rientjes <rientjes@google.com>
> wrote:
> >
> > On Mon, 24 Jun 2024, Sean Christopherson wrote:
> >
> > > On Fri, Jun 21, 2024, Elliot Berman wrote:
> > > > On Fri, Jun 21, 2024 at 11:16:31AM +0100, Fuad Tabba wrote:
> > > > > On Fri, Jun 21, 2024 at 10:10 AM David Hildenbrand <
> david@redhat.com> wrote:
> > > > > > On 21.06.24 10:54, Fuad Tabba wrote:
> > > > > > > On Fri, Jun 21, 2024 at 9:44 AM David Hildenbrand <
> david@redhat.com> wrote:
> > > > > > >>
> > > > > > >>>> Again from that thread, one of most important aspects
> guest_memfd is that VMAs
> > > > > > >>>> are not required.  Stating the obvious, lack of VMAs makes
> it really hard to drive
> > > > > > >>>> swap, reclaim, migration, etc. from code that fundamentally
> operates on VMAs.
> > > > > > >>>>
> > > > > > >>>>    : More broadly, no VMAs are required.  The lack of
> stage-1 page tables are nice to
> > > > > > >>>>    : have; the lack of VMAs means that guest_memfd isn't
> playing second fiddle, e.g.
> > > > > > >>>>    : it's not subject to VMA protections, isn't restricted
> to host mapping size, etc.
> > > > > > >>>>
> > > > > > >>>> [1] https://lore.kernel.org/all/Zfmpby6i3PfBEcCV@google.com
> > > > > > >>>> [2] https://lore.kernel.org/all/Zg3xF7dTtx6hbmZj@google.com
> > > > > > >>>
> > > > > > >>> I wonder if it might be more productive to also discuss this
> in one of
> > > > > > >>> the PUCKs, ahead of LPC, in addition to trying to go over
> this in LPC.
> > > > > > >>
> > > > > > >> I don't know in  which context you usually discuss that, but
> I could
> > > > > > >> propose that as a topic in the bi-weekly MM meeting.
> > > > > > >>
> > > > > > >> This would, of course, be focused on the bigger MM picture:
> how to mmap,
> > > > > > >> how how to support huge pages, interaction with page pinning,
> ... So
> > > > > > >> obviously more MM focused once we are in agreement that we
> want to
> > > > > > >> support shared memory in guest_memfd and how to make that
> work with core-mm.
> > > > > > >>
> > > > > > >> Discussing if we want shared memory in guest_memfd might be
> betetr
> > > > > > >> suited for a different, more CC/KVM specific meeting (likely
> the "PUCKs"
> > > > > > >> mentioned here?).
> > > > > > >
> > > > > > > Sorry, I should have given more context on what a PUCK* is :)
> It's a
> > > > > > > periodic (almost weekly) upstream call for KVM.
> > > > > > >
> > > > > > > [*]
> https://lore.kernel.org/all/20230512231026.799267-1-seanjc@google.com/
> > > > > > >
> > > > > > > But yes, having a discussion in one of the mm meetings ahead
> of LPC
> > > > > > > would also be great. When do these meetings usually take
> place, to try
> > > > > > > to coordinate across timezones.
> > >
> > > Let's do the MM meeting.  As evidenced by the responses, it'll be
> easier to get
> > > KVM folks to join the MM meeting as opposed to other way around.
> > >
> > > > > > It's Wednesday, 9:00 - 10:00am PDT (GMT-7) every second week.
> > > > > >
> > > > > > If we're in agreement, we could (assuming there are no other
> planned
> > > > > > topics) either use the slot next week (June 26) or the following
> one
> > > > > > (July 10).
> > > > > >
> > > > > > Selfish as I am, I would prefer July 10, because I'll be on
> vacation
> > > > > > next week and there would be little time to prepare.
> > > > > >
> > > > > > @David R., heads up that this might become a topic ("shared and
> private
> > > > > > memory in guest_memfd: mmap, pinning and huge pages"), if people
> here
> > > > > > agree that this is a direction worth heading.
> > > > >
> > > > > Thanks for the invite! Tentatively July 10th works for me, but I'd
> > > > > like to talk to the others who might be interested (pKVM, Gunyah,
> and
> > > > > others) to see if that works for them. I'll get back to you
> shortly.
> > > > >
> > > >
> > > > I'd like to join too, July 10th at that time works for me.
> > >
> > > July 10th works for me too.
> > >
> >
> > Thanks all, and David H for the topic suggestion.  Let's tentatively
> > pencil this in for the Wednesday, July 10th instance at 9am PDT and I'll
> > follow-up offlist with those will be needed to lead the discussion to
> make
> > sure we're on track.
>
> I would like to join the call too.
>
> Regards,
> Vishal
>
>

Ackerley Tng July 12, 2024, 11:29 p.m. UTC | #55

Here’s an update from the Linux MM Alignment Session on July 10 2024, 9-10am
PDT:

The current direction is:

+ Allow mmap() of ranges that cover both shared and private memory, but disallow
  faulting in of private pages
  + On access to private pages, userspace will get some error, perhaps SIGBUS
  + On shared to private conversions, unmap the page and decrease refcounts

+ To support huge pages, guest_memfd will take ownership of the hugepages, and
  provide interested parties (userspace, KVM, iommu) with pages to be used.
  + guest_memfd will track usage of (sub)pages, for both private and shared
    memory
  + Pages will be broken into smaller (probably 4K) chunks at creation time to
    simplify implementation (as opposed to splitting at runtime when private to
    shared conversion is requested by the guest)
    + Core MM infrastructure will still be used to track page table mappings in
      mapcounts and other references (refcounts) per subpage
    + HugeTLB vmemmap Optimization (HVO) is lost when pages are broken up - to
      be optimized later. Suggestions:
      + Use a tracking data structure other than struct page
      + Remove the memory for struct pages backing private memory from the
        vmemmap, and re-populate the vmemmap on conversion from private to
        shared
  + Implementation pointers for huge page support
    + Consensus was that getting core MM to do tracking seems wrong
    + Maintaining special page refcounts for guest_memfd pages is difficult to
      get working and requires weird special casing in many places. This was
      tried for FS DAX pages and did not work out: [1]

+ Implementation suggestion: use infrastructure similar to what ZONE_DEVICE
  uses, to provide the huge page to interested parties
  + TBD: how to actually get huge pages into guest_memfd
  + TBD: how to provide/convert the huge pages to ZONE_DEVICE
    + Perhaps reserve them at boot time like in HugeTLB

+ Line of sight to compaction/migration:
  + Compaction here means making memory contiguous
  + Compaction/migration scope:
    + In scope for 4K pages
    + Out of scope for 1G pages and anything managed through ZONE_DEVICE
    + Out of scope for an initial implementation
  + Ideas for future implementations
    + Reuse the non-LRU page migration framework as used by memory balloning
    + Have userspace drive compaction/migration via ioctls
      + Having line of sight to optimizing lost HVO means avoiding being locked
        in to any implementation requiring struct pages
        + Without struct pages, it is hard to reuse core MM’s
          compaction/migration infrastructure

+ Discuss more details at LPC in Sep 2024, such as how to use huge pages,
  shared/private conversion, huge page splitting

This addresses the prerequisites set out by Fuad and Elliott at the beginning of
the session, which were:

1. Non-destructive shared/private conversion
  + Through having guest_memfd manage and track both shared/private memory
2. Huge page support with the option of converting individual subpages
  + Splitting of pages will be managed by guest_memfd
3. Line of sight to compaction/migration of private memory
  + Possibly driven by userspace using guest_memfd ioctls
4. Loading binaries into guest (private) memory before VM starts
  + This was identified as a special case of (1.) above
5. Non-protected guests in pKVM
  + Not discussed during session, but this is a goal of guest_memfd, for all VM
    types [2]

David Hildenbrand summarized this during the meeting at t=47m25s [3].

[1]: https://lore.kernel.org/linux-mm/cover.66009f59a7fe77320d413011386c3ae5c2ee82eb.1719386613.git-series.apopple@nvidia.com/
[2]: https://lore.kernel.org/lkml/ZnRMn1ObU8TFrms3@google.com/
[3]: https://drive.google.com/file/d/17lruFrde2XWs6B1jaTrAy9gjv08FnJ45/view?t=47m25s&resourcekey=0-LiteoxLd5f4fKoPRMjMTOw

Sean Christopherson July 16, 2024, 4:03 p.m. UTC | #56

Thanks for doing the dirty work!

On Fri, Jul 12, 2024, Ackerley Tng wrote:
> Here’s an update from the Linux MM Alignment Session on July 10 2024, 9-10am
> PDT:
> 
> The current direction is:
> 
> + Allow mmap() of ranges that cover both shared and private memory, but disallow
>   faulting in of private pages
>   + On access to private pages, userspace will get some error, perhaps SIGBUS
>   + On shared to private conversions, unmap the page and decrease refcounts

Note, I would strike the "decrease refcounts" part, as putting references is a
natural consequence of unmapping memory, not an explicit action guest_memfd will
take when converting from shared=>private.

And more importantly, guest_memfd will wait for the refcount to hit zero (or
whatever the baseline refcount is).

> + To support huge pages, guest_memfd will take ownership of the hugepages, and
>   provide interested parties (userspace, KVM, iommu) with pages to be used.
>   + guest_memfd will track usage of (sub)pages, for both private and shared
>     memory
>   + Pages will be broken into smaller (probably 4K) chunks at creation time to
>     simplify implementation (as opposed to splitting at runtime when private to
>     shared conversion is requested by the guest)

FWIW, I doubt we'll ever release a version with mmap()+guest_memfd support that
shatters pages at creation.  I can see it being an intermediate step, e.g. to
prove correctness and provide a bisection point, but shattering hugepages at
creation would effectively make hugepage support useless.

I don't think we need to sort this out now though, as when the shattering (and
potential reconstituion) occurs doesn't affect the overall direction in any way
(AFAIK).  I'm chiming in purely to stave off complaints that this would break
hugepage support :-)

>     + Core MM infrastructure will still be used to track page table mappings in
>       mapcounts and other references (refcounts) per subpage
>     + HugeTLB vmemmap Optimization (HVO) is lost when pages are broken up - to
>       be optimized later. Suggestions:
>       + Use a tracking data structure other than struct page
>       + Remove the memory for struct pages backing private memory from the
>         vmemmap, and re-populate the vmemmap on conversion from private to
>         shared
>   + Implementation pointers for huge page support
>     + Consensus was that getting core MM to do tracking seems wrong
>     + Maintaining special page refcounts for guest_memfd pages is difficult to
>       get working and requires weird special casing in many places. This was
>       tried for FS DAX pages and did not work out: [1]
> 
> + Implementation suggestion: use infrastructure similar to what ZONE_DEVICE
>   uses, to provide the huge page to interested parties
>   + TBD: how to actually get huge pages into guest_memfd
>   + TBD: how to provide/convert the huge pages to ZONE_DEVICE
>     + Perhaps reserve them at boot time like in HugeTLB
> 
> + Line of sight to compaction/migration:
>   + Compaction here means making memory contiguous
>   + Compaction/migration scope:
>     + In scope for 4K pages
>     + Out of scope for 1G pages and anything managed through ZONE_DEVICE
>     + Out of scope for an initial implementation
>   + Ideas for future implementations
>     + Reuse the non-LRU page migration framework as used by memory balloning
>     + Have userspace drive compaction/migration via ioctls
>       + Having line of sight to optimizing lost HVO means avoiding being locked
>         in to any implementation requiring struct pages
>         + Without struct pages, it is hard to reuse core MM’s
>           compaction/migration infrastructure
> 
> + Discuss more details at LPC in Sep 2024, such as how to use huge pages,
>   shared/private conversion, huge page splitting
> 
> This addresses the prerequisites set out by Fuad and Elliott at the beginning of
> the session, which were:
> 
> 1. Non-destructive shared/private conversion
>   + Through having guest_memfd manage and track both shared/private memory
> 2. Huge page support with the option of converting individual subpages
>   + Splitting of pages will be managed by guest_memfd
> 3. Line of sight to compaction/migration of private memory
>   + Possibly driven by userspace using guest_memfd ioctls
> 4. Loading binaries into guest (private) memory before VM starts
>   + This was identified as a special case of (1.) above
> 5. Non-protected guests in pKVM
>   + Not discussed during session, but this is a goal of guest_memfd, for all VM
>     types [2]
> 
> David Hildenbrand summarized this during the meeting at t=47m25s [3].
> 
> [1]: https://lore.kernel.org/linux-mm/cover.66009f59a7fe77320d413011386c3ae5c2ee82eb.1719386613.git-series.apopple@nvidia.com/
> [2]: https://lore.kernel.org/lkml/ZnRMn1ObU8TFrms3@google.com/
> [3]: https://drive.google.com/file/d/17lruFrde2XWs6B1jaTrAy9gjv08FnJ45/view?t=47m25s&resourcekey=0-LiteoxLd5f4fKoPRMjMTOw

Jason Gunthorpe July 16, 2024, 4:08 p.m. UTC | #57

On Tue, Jul 16, 2024 at 09:03:00AM -0700, Sean Christopherson wrote:

> > + To support huge pages, guest_memfd will take ownership of the hugepages, and
> >   provide interested parties (userspace, KVM, iommu) with pages to be used.
> >   + guest_memfd will track usage of (sub)pages, for both private and shared
> >     memory
> >   + Pages will be broken into smaller (probably 4K) chunks at creation time to
> >     simplify implementation (as opposed to splitting at runtime when private to
> >     shared conversion is requested by the guest)
> 
> FWIW, I doubt we'll ever release a version with mmap()+guest_memfd support that
> shatters pages at creation.  I can see it being an intermediate step, e.g. to
> prove correctness and provide a bisection point, but shattering hugepages at
> creation would effectively make hugepage support useless.

Why? If the private memory retains its contiguity seperately but the
struct pages are removed from the vmemmap, what is the downside?

As I understand it the point is to give a large contiguous range to
the private world and use only 4k pages to give the hypervisor world
access to limited amounts of the memory.

Is there a reason that not having the shared memory elevated to higher
contiguity a deal breaker?

Jason

Sean Christopherson July 16, 2024, 5:34 p.m. UTC | #58

On Tue, Jul 16, 2024, Jason Gunthorpe wrote:
> On Tue, Jul 16, 2024 at 09:03:00AM -0700, Sean Christopherson wrote:
> 
> > > + To support huge pages, guest_memfd will take ownership of the hugepages, and
> > >   provide interested parties (userspace, KVM, iommu) with pages to be used.
> > >   + guest_memfd will track usage of (sub)pages, for both private and shared
> > >     memory
> > >   + Pages will be broken into smaller (probably 4K) chunks at creation time to
> > >     simplify implementation (as opposed to splitting at runtime when private to
> > >     shared conversion is requested by the guest)
> > 
> > FWIW, I doubt we'll ever release a version with mmap()+guest_memfd support that
> > shatters pages at creation.  I can see it being an intermediate step, e.g. to
> > prove correctness and provide a bisection point, but shattering hugepages at
> > creation would effectively make hugepage support useless.
> 
> Why? If the private memory retains its contiguity seperately but the
> struct pages are removed from the vmemmap, what is the downside?

Oooh, you're talking about shattering only the host userspace mappings.  Now I
understand why there was a bit of a disconnect, I was thinking you (hand-wavy
everyone) were saying that KVM would immediately shatter its own mappings too.

> As I understand it the point is to give a large contiguous range to
> the private world and use only 4k pages to give the hypervisor world
> access to limited amounts of the memory.
> 
> Is there a reason that not having the shared memory elevated to higher
> contiguity a deal breaker?

Nope.  I'm sure someone will ask for it sooner than later, but definitely not a
must have.

Jason Gunthorpe July 16, 2024, 8:11 p.m. UTC | #59

On Tue, Jul 16, 2024 at 10:34:55AM -0700, Sean Christopherson wrote:
> On Tue, Jul 16, 2024, Jason Gunthorpe wrote:
> > On Tue, Jul 16, 2024 at 09:03:00AM -0700, Sean Christopherson wrote:
> > 
> > > > + To support huge pages, guest_memfd will take ownership of the hugepages, and
> > > >   provide interested parties (userspace, KVM, iommu) with pages to be used.
> > > >   + guest_memfd will track usage of (sub)pages, for both private and shared
> > > >     memory
> > > >   + Pages will be broken into smaller (probably 4K) chunks at creation time to
> > > >     simplify implementation (as opposed to splitting at runtime when private to
> > > >     shared conversion is requested by the guest)
> > > 
> > > FWIW, I doubt we'll ever release a version with mmap()+guest_memfd support that
> > > shatters pages at creation.  I can see it being an intermediate step, e.g. to
> > > prove correctness and provide a bisection point, but shattering hugepages at
> > > creation would effectively make hugepage support useless.
> > 
> > Why? If the private memory retains its contiguity seperately but the
> > struct pages are removed from the vmemmap, what is the downside?
> 
> Oooh, you're talking about shattering only the host userspace mappings.  Now I
> understand why there was a bit of a disconnect, I was thinking you (hand-wavy
> everyone) were saying that KVM would immediately shatter its own mappings too.

Right, I'm imagining that guestmemfd keep track of the physical ranges
in something else, like a maple tree, xarray or heck a SW radix page
table perhaps. It does not use struct pages. Then it has, say, a
bitmap indicating what 4k granuals are shared.

When kvm or the private world needs the physical addresses it reads
them out of that structure and it always sees perfectly physically
contiguous data regardless of any shared/private stuff.

It is not so much "broken at creation time", but more that guest memfd
does not use struct pages at all for private mappings and thus we can
setup the unused struct pages however we like, including removing them
from the vmemmap or preconfiguring them for order 0 granuals.

There is definitely some detailed datastructure work here to allow
guestmemfd to manage all of this efficiently and be effective for 4k
and 1G cases.

Jason

Tian, Kevin Aug. 2, 2024, 8:26 a.m. UTC | #60

> From: Jason Gunthorpe <jgg@nvidia.com>
> Sent: Thursday, June 20, 2024 10:34 PM
> 
> On Thu, Jun 20, 2024 at 04:14:23PM +0200, David Hildenbrand wrote:
> 
> > 1) How would the device be able to grab/access "private memory", if not
> >    via the user page tables?
> 
> The approaches I'm aware of require the secure world to own the IOMMU
> and generate the IOMMU page tables. So we will not use a GUP approach
> with VFIO today as the kernel will not have any reason to generate a
> page table in the first place. Instead we will say "this PCI device
> translates through the secure world" and walk away.
> 
> The page table population would have to be done through the KVM path.
> 

Sorry for noting this discussion late. Dave pointed it to me in a related
thread [1].

I had an impression that above approach fits some trusted IO arch (e.g.
TDX Connect which has a special secure I/O page table format and
requires sharing it between IOMMU/KVM) but not all.

e.g. SEV-TIO spec [2] (page 8) describes to have the IOMMU walk the
existing I/O page tables to get HPA and then verify it through a new
permission table (RMP) for access control.

That arch may better fit a scheme in which the I/O page tables are
still managed by VFIO/IOMMUFD and RMP is managed by KVM, with an
an extension to the MAP_DMA call to accept a [guest_memfd, offset]
pair to find out the pfn instead of using host virtual address.

looks the Linux MM alignment session [3] did mention "guest_memfd
will take ownership of the hugepages, and provide interested parties
(userspace, KVM, iommu) with pages to be used" to support that extension?

[1] https://lore.kernel.org/kvm/272e3dbf-ed4a-43f5-8b5f-56bf6d74930c@redhat.com/
[2] https://www.amd.com/system/files/documents/sev-tio-whitepaper.pdf
[3] https://lore.kernel.org/kvm/20240712232937.2861788-1-ackerleytng@google.com/

Thanks
Kevin

Jason Gunthorpe Aug. 2, 2024, 11:22 a.m. UTC | #61

On Fri, Aug 02, 2024 at 08:26:48AM +0000, Tian, Kevin wrote:
> > From: Jason Gunthorpe <jgg@nvidia.com>
> > Sent: Thursday, June 20, 2024 10:34 PM
> > 
> > On Thu, Jun 20, 2024 at 04:14:23PM +0200, David Hildenbrand wrote:
> > 
> > > 1) How would the device be able to grab/access "private memory", if not
> > >    via the user page tables?
> > 
> > The approaches I'm aware of require the secure world to own the IOMMU
> > and generate the IOMMU page tables. So we will not use a GUP approach
> > with VFIO today as the kernel will not have any reason to generate a
> > page table in the first place. Instead we will say "this PCI device
> > translates through the secure world" and walk away.
> > 
> > The page table population would have to be done through the KVM path.
> 
> Sorry for noting this discussion late. Dave pointed it to me in a related
> thread [1].
> 
> I had an impression that above approach fits some trusted IO arch (e.g.
> TDX Connect which has a special secure I/O page table format and
> requires sharing it between IOMMU/KVM) but not all.
> 
> e.g. SEV-TIO spec [2] (page 8) describes to have the IOMMU walk the
> existing I/O page tables to get HPA and then verify it through a new
> permission table (RMP) for access control.

It is not possible, you cannot have the unsecure world control the
IOMMU translation and expect a secure guest.

The unsecure world can attack the guest by scrambling the mappings of
its private pages. A RMP does not protect against this.

This is why the secure world controls the CPU's GPA translation
exclusively, same reasoning for iommu.

Jason

Tian, Kevin Aug. 5, 2024, 2:24 a.m. UTC | #62

> From: Jason Gunthorpe <jgg@nvidia.com>
> Sent: Friday, August 2, 2024 7:22 PM
> 
> On Fri, Aug 02, 2024 at 08:26:48AM +0000, Tian, Kevin wrote:
> > > From: Jason Gunthorpe <jgg@nvidia.com>
> > > Sent: Thursday, June 20, 2024 10:34 PM
> > >
> > > On Thu, Jun 20, 2024 at 04:14:23PM +0200, David Hildenbrand wrote:
> > >
> > > > 1) How would the device be able to grab/access "private memory", if
> not
> > > >    via the user page tables?
> > >
> > > The approaches I'm aware of require the secure world to own the
> IOMMU
> > > and generate the IOMMU page tables. So we will not use a GUP approach
> > > with VFIO today as the kernel will not have any reason to generate a
> > > page table in the first place. Instead we will say "this PCI device
> > > translates through the secure world" and walk away.
> > >
> > > The page table population would have to be done through the KVM path.
> >
> > Sorry for noting this discussion late. Dave pointed it to me in a related
> > thread [1].
> >
> > I had an impression that above approach fits some trusted IO arch (e.g.
> > TDX Connect which has a special secure I/O page table format and
> > requires sharing it between IOMMU/KVM) but not all.
> >
> > e.g. SEV-TIO spec [2] (page 8) describes to have the IOMMU walk the
> > existing I/O page tables to get HPA and then verify it through a new
> > permission table (RMP) for access control.
> 
> It is not possible, you cannot have the unsecure world control the
> IOMMU translation and expect a secure guest.
> 
> The unsecure world can attack the guest by scrambling the mappings of
> its private pages. A RMP does not protect against this.
> 
> This is why the secure world controls the CPU's GPA translation
> exclusively, same reasoning for iommu.
> 

According to [3],

"
  With SNP, when pages are marked as guest-owned in the RMP table,
  they are assigned to a specific guest/ASID, as well as a specific GFN
  with in the guest. Any attempts to map it in the RMP table to a different
  guest/ASID, or a different GFN within a guest/ASID, will result in an RMP
  nested page fault.
"

With that measure in place my impression is that even the CPU's GPA
translation can be controlled by the unsecure world in SEV-SNP.

[3] https://lore.kernel.org/all/20240501085210.2213060-1-michael.roth@amd.com/

Jason Gunthorpe Aug. 5, 2024, 11:22 p.m. UTC | #63

On Mon, Aug 05, 2024 at 02:24:42AM +0000, Tian, Kevin wrote:
> 
> According to [3],
> 
> "
>   With SNP, when pages are marked as guest-owned in the RMP table,
>   they are assigned to a specific guest/ASID, as well as a specific GFN
>   with in the guest. Any attempts to map it in the RMP table to a different
>   guest/ASID, or a different GFN within a guest/ASID, will result in an RMP
>   nested page fault.
> "
> 
> With that measure in place my impression is that even the CPU's GPA
> translation can be controlled by the unsecure world in SEV-SNP.

Sure, but the GPA is the KVM S2, not the IOMMU. If there is some
complicated way to lock down the KVM S2 then it doesn't necessarily
apply to every IOVA to GPA translation as well.

The guest/hypervisor could have a huge number of iommu domains, where
would you even store such granular data?

About the only thing that could possibly do is setup a S2 IOMMU
identity translation reliably and have no support for vIOMMU - which
doesn't sound like a sane architecture to me.

It is not insurmountable, but it is going to be annoying if someone
needs access to the private pages physical address in the iommufd
side.

Jason

Tian, Kevin Aug. 6, 2024, 12:50 a.m. UTC | #64

> From: Jason Gunthorpe <jgg@nvidia.com>
> Sent: Tuesday, August 6, 2024 7:23 AM
> 
> On Mon, Aug 05, 2024 at 02:24:42AM +0000, Tian, Kevin wrote:
> >
> > According to [3],
> >
> > "
> >   With SNP, when pages are marked as guest-owned in the RMP table,
> >   they are assigned to a specific guest/ASID, as well as a specific GFN
> >   with in the guest. Any attempts to map it in the RMP table to a different
> >   guest/ASID, or a different GFN within a guest/ASID, will result in an RMP
> >   nested page fault.
> > "
> >
> > With that measure in place my impression is that even the CPU's GPA
> > translation can be controlled by the unsecure world in SEV-SNP.
> 
> Sure, but the GPA is the KVM S2, not the IOMMU. If there is some
> complicated way to lock down the KVM S2 then it doesn't necessarily
> apply to every IOVA to GPA translation as well.
> 
> The guest/hypervisor could have a huge number of iommu domains, where
> would you even store such granular data?
> 
> About the only thing that could possibly do is setup a S2 IOMMU
> identity translation reliably and have no support for vIOMMU - which
> doesn't sound like a sane architecture to me.
> 

According to the SEV-TIO spec there will be a new structure called
Secure Device Table to track security attributes of a TDI and also
location of guest page tables. It also puts hardware assisted
vIOMMU in the TCB then with nested translation the IOMMU S2
will always be GPA.

> It is not insurmountable, but it is going to be annoying if someone
> needs access to the private pages physical address in the iommufd
> side.
> 

Don't know much about SEV but based on my reading it appears
that it is designed with the assumption that GPA page tables (both
CPU/IOMMU S2, in nested translation) are managed by untrusted
host, for both shared and private pages.

Probably AMD folks can chime in to help confirm.

[RFC,0/5] mm/gup: Introduce exclusive GUP pinning

Message

Comments