diff mbox series

[RFC,8/8] kvm: gmem: Allow restricted userspace mappings

Message ID 20240709132041.3625501-9-roypat@amazon.co.uk (mailing list archive)
State New, archived
Headers show
Series Unmapping guest_memfd from Direct Map | expand

Commit Message

Patrick Roy July 9, 2024, 1:20 p.m. UTC
Allow mapping guest_memfd into userspace. Since AS_INACCESSIBLE is set
on the underlying address_space struct, no GUP of guest_memfd will be
possible.

Signed-off-by: Patrick Roy <roypat@amazon.co.uk>
---
 virt/kvm/guest_memfd.c | 31 ++++++++++++++++++++++++++++++-
 1 file changed, 30 insertions(+), 1 deletion(-)

Comments

Fuad Tabba July 9, 2024, 2:48 p.m. UTC | #1
Hi Patrick,

On Tue, Jul 9, 2024 at 2:21 PM Patrick Roy <roypat@amazon.co.uk> wrote:
>
> Allow mapping guest_memfd into userspace. Since AS_INACCESSIBLE is set
> on the underlying address_space struct, no GUP of guest_memfd will be
> possible.

This patch allows mapping guest_memfd() unconditionally. Even if it's
not guppable, there are other reasons why you wouldn't want to allow
this. Maybe a config flag to gate it? e.g.,

https://lore.kernel.org/all/20240222161047.402609-4-tabba@google.com/

>
> Signed-off-by: Patrick Roy <roypat@amazon.co.uk>
> ---
>  virt/kvm/guest_memfd.c | 31 ++++++++++++++++++++++++++++++-
>  1 file changed, 30 insertions(+), 1 deletion(-)
>
> diff --git a/virt/kvm/guest_memfd.c b/virt/kvm/guest_memfd.c
> index dc9b0c2d0b0e..101ec2b248bf 100644
> --- a/virt/kvm/guest_memfd.c
> +++ b/virt/kvm/guest_memfd.c
> @@ -319,7 +319,37 @@ static inline struct file *kvm_gmem_get_file(struct kvm_memory_slot *slot)
>         return get_file_active(&slot->gmem.file);
>  }
>
> +static vm_fault_t kvm_gmem_fault(struct vm_fault *vmf)
> +{
> +       struct folio *folio;
> +
> +       folio = kvm_gmem_get_folio(file_inode(vmf->vma->vm_file), vmf->pgoff, true);
> +
> +       if (!folio)
> +               return VM_FAULT_SIGBUS;
> +
> +       vmf->page = folio_file_page(folio, vmf->pgoff);
> +
> +       return VM_FAULT_LOCKED;
> +}
> +
> +static const struct vm_operations_struct kvm_gmem_vm_ops = {
> +       .fault = kvm_gmem_fault
> +};
> +
> +static int kvm_gmem_mmap(struct file *file, struct vm_area_struct *vma)
> +{
> +       if ((vma->vm_flags & (VM_SHARED | VM_MAYSHARE)) == 0)
> +               return -EINVAL;
> +
> +       vm_flags_set(vma, VM_DONTDUMP);
> +       vma->vm_ops = &kvm_gmem_vm_ops;
> +
> +       return 0;
> +}
> +
>  static struct file_operations kvm_gmem_fops = {
> +       .mmap           = kvm_gmem_mmap,
>         .open           = generic_file_open,
>         .release        = kvm_gmem_release,
>         .fallocate      = kvm_gmem_fallocate,
> @@ -594,7 +624,6 @@ static int __kvm_gmem_get_pfn(struct file *file, struct kvm_memory_slot *slot,
>                 return -EFAULT;
>         }
>
> -       gmem = file->private_data;

Is this intentional?

Cheers,
/fuad

>         if (xa_load(&gmem->bindings, index) != slot) {
>                 WARN_ON_ONCE(xa_load(&gmem->bindings, index));
>                 return -EIO;
> --
> 2.45.2
>
David Hildenbrand July 9, 2024, 9:13 p.m. UTC | #2
On 09.07.24 16:48, Fuad Tabba wrote:
> Hi Patrick,
> 
> On Tue, Jul 9, 2024 at 2:21 PM Patrick Roy <roypat@amazon.co.uk> wrote:
>>
>> Allow mapping guest_memfd into userspace. Since AS_INACCESSIBLE is set
>> on the underlying address_space struct, no GUP of guest_memfd will be
>> possible.
> 
> This patch allows mapping guest_memfd() unconditionally. Even if it's
> not guppable, there are other reasons why you wouldn't want to allow
> this. Maybe a config flag to gate it? e.g.,


As discussed with Jason, maybe not the direction we want to take with 
guest_memfd.
If it's private memory, it shall not be mapped. Also not via magic 
config options.

We'll likely discuss some of that in the meeting MM tomorrow I guess 
(having both shared and private memory in guest_memfd).

Note that just from staring at this commit, I don't understand the 
motivation *why* we would want to do that.
Patrick Roy July 10, 2024, 9:51 a.m. UTC | #3
On 7/9/24 22:13, David Hildenbrand wrote:
> CAUTION: This email originated from outside of the organization. Do not click links or open attachments unless you can confirm the sender and know the content is safe.
> 
> 
> 
> On 09.07.24 16:48, Fuad Tabba wrote:
>> Hi Patrick,
>>
>> On Tue, Jul 9, 2024 at 2:21 PM Patrick Roy <roypat@amazon.co.uk> wrote:
>>>
>>> Allow mapping guest_memfd into userspace. Since AS_INACCESSIBLE is set
>>> on the underlying address_space struct, no GUP of guest_memfd will be
>>> possible.
>>
>> This patch allows mapping guest_memfd() unconditionally. Even if it's
>> not guppable, there are other reasons why you wouldn't want to allow
>> this. Maybe a config flag to gate it? e.g.,
> 
> 
> As discussed with Jason, maybe not the direction we want to take with
> guest_memfd.
> If it's private memory, it shall not be mapped. Also not via magic
> config options.
> 
> We'll likely discuss some of that in the meeting MM tomorrow I guess
> (having both shared and private memory in guest_memfd).

Oh, nice. I'm assuming you mean this meeting:
https://lore.kernel.org/linux-mm/197a2f19-c71c-fbde-a62a-213dede1f4fd@google.com/T/?
Would it be okay if I also attend? I see it also mentions huge pages,
which is another thing we are interested in, actually :)

> Note that just from staring at this commit, I don't understand the
> motivation *why* we would want to do that.

Fair - I admittedly didn't get into that as much as I probably should
have. In our usecase, we do not have anything that pKVM would (I think)
call "guest-private" memory. I think our memory can be better described
as guest-owned, but always shared with the VMM (e.g. userspace), but
ideally never shared with the host kernel. This model lets us do a lot
of simplifying assumptions: Things like I/O can be handled in userspace
without the guest explicitly sharing I/O buffers (which is not exactly
what we would want long-term anyway, as sharing in the guest_memfd
context means sharing with the host kernel), we can easily do VM
snapshotting without needing things like TDX's TDH.EXPORT.MEM APIs, etc.

> -- 
> Cheers,
> 
> David / dhildenb
>
David Hildenbrand July 10, 2024, 9:12 p.m. UTC | #4
On 10.07.24 11:51, Patrick Roy wrote:
> 
> 
> On 7/9/24 22:13, David Hildenbrand wrote:
>> CAUTION: This email originated from outside of the organization. Do not click links or open attachments unless you can confirm the sender and know the content is safe.
>>
>>
>>
>> On 09.07.24 16:48, Fuad Tabba wrote:
>>> Hi Patrick,
>>>
>>> On Tue, Jul 9, 2024 at 2:21 PM Patrick Roy <roypat@amazon.co.uk> wrote:
>>>>
>>>> Allow mapping guest_memfd into userspace. Since AS_INACCESSIBLE is set
>>>> on the underlying address_space struct, no GUP of guest_memfd will be
>>>> possible.
>>>
>>> This patch allows mapping guest_memfd() unconditionally. Even if it's
>>> not guppable, there are other reasons why you wouldn't want to allow
>>> this. Maybe a config flag to gate it? e.g.,
>>
>>
>> As discussed with Jason, maybe not the direction we want to take with
>> guest_memfd.
>> If it's private memory, it shall not be mapped. Also not via magic
>> config options.
>>
>> We'll likely discuss some of that in the meeting MM tomorrow I guess
>> (having both shared and private memory in guest_memfd).
> 
> Oh, nice. I'm assuming you mean this meeting:
> https://lore.kernel.org/linux-mm/197a2f19-c71c-fbde-a62a-213dede1f4fd@google.com/T/?
> Would it be okay if I also attend? I see it also mentions huge pages,
> which is another thing we are interested in, actually :)

Hi,

sorry for the late reply. Yes, you could have joined .... too late. 
There will be a summary posted soon. So far the agreement is that we're 
planning on allowing shared memory as part guest_memfd, and will allow 
that to get mapped and pinned. Private memory is not going to get mapped 
and pinned.

If we have to disallow pinning of shared memory on top for some use 
cases (i.e., no directmap), I assume that could be added.

> 
>> Note that just from staring at this commit, I don't understand the
>> motivation *why* we would want to do that.
> 
> Fair - I admittedly didn't get into that as much as I probably should
> have. In our usecase, we do not have anything that pKVM would (I think)
> call "guest-private" memory. I think our memory can be better described
> as guest-owned, but always shared with the VMM (e.g. userspace), but
> ideally never shared with the host kernel. This model lets us do a lot
> of simplifying assumptions: Things like I/O can be handled in userspace
> without the guest explicitly sharing I/O buffers (which is not exactly
> what we would want long-term anyway, as sharing in the guest_memfd
> context means sharing with the host kernel), we can easily do VM
> snapshotting without needing things like TDX's TDH.EXPORT.MEM APIs, etc.

Okay, so essentially you would want to use guest_memfd to only contain 
shard memory and disallow any pinning like for secretmem.

If so, I wonder if it wouldn't be better to simply add KVM support to 
consume *real* secretmem memory? IIRC so far there was only demand to 
probably remove the directmap of private memory in guest_memfd, not of 
shared memory.
Sean Christopherson July 10, 2024, 9:53 p.m. UTC | #5
On Wed, Jul 10, 2024, David Hildenbrand wrote:
> On 10.07.24 11:51, Patrick Roy wrote:
> > On 7/9/24 22:13, David Hildenbrand wrote:
> > > Note that just from staring at this commit, I don't understand the
> > > motivation *why* we would want to do that.
> > 
> > Fair - I admittedly didn't get into that as much as I probably should
> > have. In our usecase, we do not have anything that pKVM would (I think)
> > call "guest-private" memory. I think our memory can be better described
> > as guest-owned, but always shared with the VMM (e.g. userspace), but
> > ideally never shared with the host kernel. This model lets us do a lot
> > of simplifying assumptions: Things like I/O can be handled in userspace
> > without the guest explicitly sharing I/O buffers (which is not exactly
> > what we would want long-term anyway, as sharing in the guest_memfd
> > context means sharing with the host kernel), we can easily do VM
> > snapshotting without needing things like TDX's TDH.EXPORT.MEM APIs, etc.
> 
> Okay, so essentially you would want to use guest_memfd to only contain shard
> memory and disallow any pinning like for secretmem.
> 
> If so, I wonder if it wouldn't be better to simply add KVM support to
> consume *real* secretmem memory? IIRC so far there was only demand to
> probably remove the directmap of private memory in guest_memfd, not of
> shared memory.

It's also desirable to remove shared memory from the directmap, e.g. to prevent
using the directmap in a cross-VM attack.

I don't think we want to allow KVM to consume secretmem.  That would require
letting KVM gup() secretmem, which AIUI defeats the entire purpose of secretmem,
and I don't think KVM should be special.
David Hildenbrand July 10, 2024, 9:56 p.m. UTC | #6
On 10.07.24 23:53, Sean Christopherson wrote:
> On Wed, Jul 10, 2024, David Hildenbrand wrote:
>> On 10.07.24 11:51, Patrick Roy wrote:
>>> On 7/9/24 22:13, David Hildenbrand wrote:
>>>> Note that just from staring at this commit, I don't understand the
>>>> motivation *why* we would want to do that.
>>>
>>> Fair - I admittedly didn't get into that as much as I probably should
>>> have. In our usecase, we do not have anything that pKVM would (I think)
>>> call "guest-private" memory. I think our memory can be better described
>>> as guest-owned, but always shared with the VMM (e.g. userspace), but
>>> ideally never shared with the host kernel. This model lets us do a lot
>>> of simplifying assumptions: Things like I/O can be handled in userspace
>>> without the guest explicitly sharing I/O buffers (which is not exactly
>>> what we would want long-term anyway, as sharing in the guest_memfd
>>> context means sharing with the host kernel), we can easily do VM
>>> snapshotting without needing things like TDX's TDH.EXPORT.MEM APIs, etc.
>>
>> Okay, so essentially you would want to use guest_memfd to only contain shard
>> memory and disallow any pinning like for secretmem.
>>
>> If so, I wonder if it wouldn't be better to simply add KVM support to
>> consume *real* secretmem memory? IIRC so far there was only demand to
>> probably remove the directmap of private memory in guest_memfd, not of
>> shared memory.
> 
> It's also desirable to remove shared memory from the directmap, e.g. to prevent
> using the directmap in a cross-VM attack.
> 
> I don't think we want to allow KVM to consume secretmem.  That would require
> letting KVM gup() secretmem, which AIUI defeats the entire purpose of secretmem,
> and I don't think KVM should be special.

I would mean consuming secretmem via the fd, *not* via page tables / gup.

But if we also want to have the option of directmap modifications for 
shared memory in guest_memfd, we could make that indeed a guest_memfd 
feature.
Patrick Roy July 12, 2024, 3:59 p.m. UTC | #7
Hey,

On Wed, 2024-07-10 at 22:12 +0100, David Hildenbrand wrote:
> On 10.07.24 11:51, Patrick Roy wrote:
>>
>>
>> On 7/9/24 22:13, David Hildenbrand wrote:
>>> On 09.07.24 16:48, Fuad Tabba wrote:
>>>> Hi Patrick,
>>>>
>>>> On Tue, Jul 9, 2024 at 2:21 PM Patrick Roy <roypat@amazon.co.uk> wrote:
>>>>>
>>>>> Allow mapping guest_memfd into userspace. Since AS_INACCESSIBLE is set
>>>>> on the underlying address_space struct, no GUP of guest_memfd will be
>>>>> possible.
>>>>
>>>> This patch allows mapping guest_memfd() unconditionally. Even if it's
>>>> not guppable, there are other reasons why you wouldn't want to allow
>>>> this. Maybe a config flag to gate it? e.g.,
>>>
>>>
>>> As discussed with Jason, maybe not the direction we want to take with
>>> guest_memfd.
>>> If it's private memory, it shall not be mapped. Also not via magic
>>> config options.
>>>
>>> We'll likely discuss some of that in the meeting MM tomorrow I guess
>>> (having both shared and private memory in guest_memfd).
>>
>> Oh, nice. I'm assuming you mean this meeting:
>> https://lore.kernel.org/linux-mm/197a2f19-c71c-fbde-a62a-213dede1f4fd@google.com/T/?
>> Would it be okay if I also attend? I see it also mentions huge pages,
>> which is another thing we are interested in, actually :)
> 
> Hi,
> 
> sorry for the late reply. Yes, you could have joined .... too late.

No worries, I did end up joining to listen in to y'all's discussion
anyway :)

> There will be a summary posted soon. So far the agreement is that we're
> planning on allowing shared memory as part guest_memfd, and will allow
> that to get mapped and pinned. Private memory is not going to get mapped
> and pinned.
> 
> If we have to disallow pinning of shared memory on top for some use
> cases (i.e., no directmap), I assume that could be added.
> 
>>
>>> Note that just from staring at this commit, I don't understand the
>>> motivation *why* we would want to do that.
>>
>> Fair - I admittedly didn't get into that as much as I probably should
>> have. In our usecase, we do not have anything that pKVM would (I think)
>> call "guest-private" memory. I think our memory can be better described
>> as guest-owned, but always shared with the VMM (e.g. userspace), but
>> ideally never shared with the host kernel. This model lets us do a lot
>> of simplifying assumptions: Things like I/O can be handled in userspace
>> without the guest explicitly sharing I/O buffers (which is not exactly
>> what we would want long-term anyway, as sharing in the guest_memfd
>> context means sharing with the host kernel), we can easily do VM
>> snapshotting without needing things like TDX's TDH.EXPORT.MEM APIs, etc.
> 
> Okay, so essentially you would want to use guest_memfd to only contain
> shard memory and disallow any pinning like for secretmem.

Yeah, this is pretty much what I thought we wanted before listening in
on Wednesday.

I've actually be thinking about this some more since then though. With
hugepages, if the VM is backed by, say, 2M pages, our on-demand direct
map insertion approach runs into the same problem that CoCo VMs have
when they're backed by hugepages: How to deal with the guest only
sharing a 4K range in a hugepage? If we want to restore the direct map
for e.g. the page containing kvm-clock data, then we can't simply go
ahead and restore the direct map for the entire 2M page, because there
very well might be stuff in the other 511 small guest pages that we
really do not want in the direct map. And we can't even take the
approach of letting the guest deal with the problem, because here
"sharing" is driven by the host, not the guest, so the guest cannot
possibly know that it maybe should avoid putting stuff it doesn't want
shared into those remaining 511 pages! To me that sounds a lot like the
whole "breaking down huge folios to allow GUP to only some parts of it"
thing mentioned on Wednesday.

Now, if we instead treat "guest memory without direct map entries" as
"private", and "guest memory with direct map entries" as "shared", then
the above will be solved by whatever mechanism allows gupping/mapping of
only the "shared" parts of huge folios, IIUC. The fact that GUP is then
also allowed for the "shared" parts is not actually a problem for us -
we went down the route of disabling GUP altogether here because based on
[1] it sounded like GUP for anything gmem related would never happen.
But after something is re-inserted into the direct map, we don't very
much care if it can be GUP-ed or not. In fact, allowing GUP for the
shared parts probably makes some things easier for us, as we can then do
I/O without bounce buffers by just in-place converting I/O-buffers to
shared, and then treating that shared slice of guest_memfd the same way
we treat traditional guest memory today. In a very far-off future, we'd
like to be able to do I/O without ever reinserting pages into the direct
map, but I don't think adopting this private/shared model for gmem would
block us from doing that?

Although all of this does hinge on us being able to do the in-place
shared/private conversion without any guest involvement. Do you envision
that to be possible?

> If so, I wonder if it wouldn't be better to simply add KVM support to
> consume *real* secretmem memory? IIRC so far there was only demand to
> probably remove the directmap of private memory in guest_memfd, not of
> shared memory.
> -- 
> Cheers,
> 
> David / dhildenb

Best, 
Patrick

[1]: https://lore.kernel.org/all/ZdfoR3nCEP3HTtm1@casper.infradead.org/
David Hildenbrand July 30, 2024, 10:15 a.m. UTC | #8
>> Hi,
>>
>> sorry for the late reply. Yes, you could have joined .... too late.
> 
> No worries, I did end up joining to listen in to y'all's discussion
> anyway :)

Sorry for the late reply :(

> 
>> There will be a summary posted soon. So far the agreement is that we're
>> planning on allowing shared memory as part guest_memfd, and will allow
>> that to get mapped and pinned. Private memory is not going to get mapped
>> and pinned.
>>
>> If we have to disallow pinning of shared memory on top for some use
>> cases (i.e., no directmap), I assume that could be added.
>>
>>>
>>>> Note that just from staring at this commit, I don't understand the
>>>> motivation *why* we would want to do that.
>>>
>>> Fair - I admittedly didn't get into that as much as I probably should
>>> have. In our usecase, we do not have anything that pKVM would (I think)
>>> call "guest-private" memory. I think our memory can be better described
>>> as guest-owned, but always shared with the VMM (e.g. userspace), but
>>> ideally never shared with the host kernel. This model lets us do a lot
>>> of simplifying assumptions: Things like I/O can be handled in userspace
>>> without the guest explicitly sharing I/O buffers (which is not exactly
>>> what we would want long-term anyway, as sharing in the guest_memfd
>>> context means sharing with the host kernel), we can easily do VM
>>> snapshotting without needing things like TDX's TDH.EXPORT.MEM APIs, etc.
>>
>> Okay, so essentially you would want to use guest_memfd to only contain
>> shard memory and disallow any pinning like for secretmem.
> 
> Yeah, this is pretty much what I thought we wanted before listening in
> on Wednesday.
> 
> I've actually be thinking about this some more since then though. With
> hugepages, if the VM is backed by, say, 2M pages, our on-demand direct
> map insertion approach runs into the same problem that CoCo VMs have
> when they're backed by hugepages: How to deal with the guest only
> sharing a 4K range in a hugepage? If we want to restore the direct map
> for e.g. the page containing kvm-clock data, then we can't simply go
> ahead and restore the direct map for the entire 2M page, because there
> very well might be stuff in the other 511 small guest pages that we
> really do not want in the direct map. And we can't even take the

Right, you'd only want to restore the direct map for a fragment. Or 
dynamically map that fragment using kmap where required (as raised by 
Vlastimil).

> approach of letting the guest deal with the problem, because here
> "sharing" is driven by the host, not the guest, so the guest cannot
> possibly know that it maybe should avoid putting stuff it doesn't want
> shared into those remaining 511 pages! To me that sounds a lot like the
> whole "breaking down huge folios to allow GUP to only some parts of it"
> thing mentioned on Wednesday.

Yes. While it would be one logical huge page, it would be exposed to the 
remainder of the kernel as 512 individual pages.

> 
> Now, if we instead treat "guest memory without direct map entries" as
> "private", and "guest memory with direct map entries" as "shared", then
> the above will be solved by whatever mechanism allows gupping/mapping of
> only the "shared" parts of huge folios, IIUC. The fact that GUP is then
> also allowed for the "shared" parts is not actually a problem for us -
> we went down the route of disabling GUP altogether here because based on
> [1] it sounded like GUP for anything gmem related would never happen.

Right. Might there also be a case for removing the directmap for shared 
memory or is that not really a requirement so far?

> But after something is re-inserted into the direct map, we don't very
> much care if it can be GUP-ed or not. In fact, allowing GUP for the
> shared parts probably makes some things easier for us, as we can then do
> I/O without bounce buffers by just in-place converting I/O-buffers to
> shared, and then treating that shared slice of guest_memfd the same way
> we treat traditional guest memory today. 

Yes.

> In a very far-off future, we'd
> like to be able to do I/O without ever reinserting pages into the direct
> map, but I don't think adopting this private/shared model for gmem would
> block us from doing that?

How would that I/O get triggered? GUP would require the directmap.

> 
> Although all of this does hinge on us being able to do the in-place
> shared/private conversion without any guest involvement. Do you envision
> that to be possible?

Who would trigger the conversion and how? I don't see a reason why -- 
for your use case -- user space shouldn't be able to trigger conversion 
private <-> shared. At least nothing fundamental comes to mind that 
would prohibit that.
Patrick Roy Aug. 1, 2024, 10:30 a.m. UTC | #9
On Tue, 2024-07-30 at 11:15 +0100, David Hildenbrand wrote:
>>> Hi,
>>>
>>> sorry for the late reply. Yes, you could have joined .... too late.
>>
>> No worries, I did end up joining to listen in to y'all's discussion
>> anyway :)
> 
> Sorry for the late reply :(

No worries :)

>>
>>> There will be a summary posted soon. So far the agreement is that we're
>>> planning on allowing shared memory as part guest_memfd, and will allow
>>> that to get mapped and pinned. Private memory is not going to get mapped
>>> and pinned.
>>>
>>> If we have to disallow pinning of shared memory on top for some use
>>> cases (i.e., no directmap), I assume that could be added.
>>>
>>>>
>>>>> Note that just from staring at this commit, I don't understand the
>>>>> motivation *why* we would want to do that.
>>>>
>>>> Fair - I admittedly didn't get into that as much as I probably should
>>>> have. In our usecase, we do not have anything that pKVM would (I think)
>>>> call "guest-private" memory. I think our memory can be better described
>>>> as guest-owned, but always shared with the VMM (e.g. userspace), but
>>>> ideally never shared with the host kernel. This model lets us do a lot
>>>> of simplifying assumptions: Things like I/O can be handled in userspace
>>>> without the guest explicitly sharing I/O buffers (which is not exactly
>>>> what we would want long-term anyway, as sharing in the guest_memfd
>>>> context means sharing with the host kernel), we can easily do VM
>>>> snapshotting without needing things like TDX's TDH.EXPORT.MEM APIs, etc.
>>>
>>> Okay, so essentially you would want to use guest_memfd to only contain
>>> shard memory and disallow any pinning like for secretmem.
>>
>> Yeah, this is pretty much what I thought we wanted before listening in
>> on Wednesday.
>>
>> I've actually be thinking about this some more since then though. With
>> hugepages, if the VM is backed by, say, 2M pages, our on-demand direct
>> map insertion approach runs into the same problem that CoCo VMs have
>> when they're backed by hugepages: How to deal with the guest only
>> sharing a 4K range in a hugepage? If we want to restore the direct map
>> for e.g. the page containing kvm-clock data, then we can't simply go
>> ahead and restore the direct map for the entire 2M page, because there
>> very well might be stuff in the other 511 small guest pages that we
>> really do not want in the direct map. And we can't even take the
> 
> Right, you'd only want to restore the direct map for a fragment. Or
> dynamically map that fragment using kmap where required (as raised by
> Vlastimil).

Can the kmap approach work if the memory is supposed to be GUP-able?

>> approach of letting the guest deal with the problem, because here
>> "sharing" is driven by the host, not the guest, so the guest cannot
>> possibly know that it maybe should avoid putting stuff it doesn't want
>> shared into those remaining 511 pages! To me that sounds a lot like the
>> whole "breaking down huge folios to allow GUP to only some parts of it"
>> thing mentioned on Wednesday.
> 
> Yes. While it would be one logical huge page, it would be exposed to the
> remainder of the kernel as 512 individual pages.
> 
>>
>> Now, if we instead treat "guest memory without direct map entries" as
>> "private", and "guest memory with direct map entries" as "shared", then
>> the above will be solved by whatever mechanism allows gupping/mapping of
>> only the "shared" parts of huge folios, IIUC. The fact that GUP is then
>> also allowed for the "shared" parts is not actually a problem for us -
>> we went down the route of disabling GUP altogether here because based on
>> [1] it sounded like GUP for anything gmem related would never happen.
> 
> Right. Might there also be a case for removing the directmap for shared
> memory or is that not really a requirement so far?

No, not really - we would only mark as "shared" memory that _needs_ to
be in the direct map for functional reasons (e.g. MMIO instruction
emulation, etc.).

>> But after something is re-inserted into the direct map, we don't very
>> much care if it can be GUP-ed or not. In fact, allowing GUP for the
>> shared parts probably makes some things easier for us, as we can then do
>> I/O without bounce buffers by just in-place converting I/O-buffers to
>> shared, and then treating that shared slice of guest_memfd the same way
>> we treat traditional guest memory today.
> 
> Yes.
> 
>> In a very far-off future, we'd
>> like to be able to do I/O without ever reinserting pages into the direct
>> map, but I don't think adopting this private/shared model for gmem would
>> block us from doing that?
> 
> How would that I/O get triggered? GUP would require the directmap.

I was hoping that this "phyr" thing Matthew has been talking about [1]
would allow somehow doing I/O without direct map entries/GUP, but maybe
I am misunderstanding something.

>>
>> Although all of this does hinge on us being able to do the in-place
>> shared/private conversion without any guest involvement. Do you envision
>> that to be possible?
> 
> Who would trigger the conversion and how? I don't see a reason why --
> for your use case -- user space shouldn't be able to trigger conversion
> private <-> shared. At least nothing fundamental comes to mind that
> would prohibit that.

Either KVM itself would trigger the conversions whenever it wants to
access gmem (e.g. each place in this series where there is a
set_direct_map_{invalid,default} it would do a shared/private
conversion), or userspace would do it via some syscall/ioctl (the one
place I can think of right now is I/O, where the VMM receives a virtio
buffer from the guest and converts it from private to shared in-place.
Although I guess 2 syscalls for each I/O operation aren't great
perf-wise, so maybe swiotlb still wins out here?).

I actually see that Fuad just posted an RFC series that implements the
basic shared/private handling [2], so will probably also comment about
this over there after I had a closer look :)

> -- 
> Cheers,
> 
> David / dhildenb

Best, 
Patrick

[1]: https://lore.kernel.org/netdev/Yd0IeK5s%2FE0fuWqn@casper.infradead.org/T/
[2]: https://lore.kernel.org/kvm/20240801090117.3841080-1-tabba@google.com/T/#t
diff mbox series

Patch

diff --git a/virt/kvm/guest_memfd.c b/virt/kvm/guest_memfd.c
index dc9b0c2d0b0e..101ec2b248bf 100644
--- a/virt/kvm/guest_memfd.c
+++ b/virt/kvm/guest_memfd.c
@@ -319,7 +319,37 @@  static inline struct file *kvm_gmem_get_file(struct kvm_memory_slot *slot)
 	return get_file_active(&slot->gmem.file);
 }
 
+static vm_fault_t kvm_gmem_fault(struct vm_fault *vmf)
+{
+	struct folio *folio;
+
+	folio = kvm_gmem_get_folio(file_inode(vmf->vma->vm_file), vmf->pgoff, true);
+
+	if (!folio)
+		return VM_FAULT_SIGBUS;
+
+	vmf->page = folio_file_page(folio, vmf->pgoff);
+
+	return VM_FAULT_LOCKED;
+}
+
+static const struct vm_operations_struct kvm_gmem_vm_ops = {
+	.fault = kvm_gmem_fault
+};
+
+static int kvm_gmem_mmap(struct file *file, struct vm_area_struct *vma)
+{
+	if ((vma->vm_flags & (VM_SHARED | VM_MAYSHARE)) == 0)
+		return -EINVAL;
+
+	vm_flags_set(vma, VM_DONTDUMP);
+	vma->vm_ops = &kvm_gmem_vm_ops;
+
+	return 0;
+}
+
 static struct file_operations kvm_gmem_fops = {
+	.mmap           = kvm_gmem_mmap,
 	.open		= generic_file_open,
 	.release	= kvm_gmem_release,
 	.fallocate	= kvm_gmem_fallocate,
@@ -594,7 +624,6 @@  static int __kvm_gmem_get_pfn(struct file *file, struct kvm_memory_slot *slot,
 		return -EFAULT;
 	}
 
-	gmem = file->private_data;
 	if (xa_load(&gmem->bindings, index) != slot) {
 		WARN_ON_ONCE(xa_load(&gmem->bindings, index));
 		return -EIO;