mbox series

[v11,0/8] KVM: allow mapping non-refcounted pages

Message ID 20240229025759.1187910-1-stevensd@google.com (mailing list archive)
Headers show
Series KVM: allow mapping non-refcounted pages | expand

Message

David Stevens Feb. 29, 2024, 2:57 a.m. UTC
From: David Stevens <stevensd@chromium.org>

This patch series adds support for mapping VM_IO and VM_PFNMAP memory
that is backed by struct pages that aren't currently being refcounted
(e.g. tail pages of non-compound higher order allocations) into the
guest.

Our use case is virtio-gpu blob resources [1], which directly map host
graphics buffers into the guest as "vram" for the virtio-gpu device.
This feature currently does not work on systems using the amdgpu driver,
as that driver allocates non-compound higher order pages via
ttm_pool_alloc_page().

First, this series replaces the gfn_to_pfn_memslot() API with a more
extensible kvm_follow_pfn() API. The updated API rearranges
gfn_to_pfn_memslot()'s args into a struct and where possible packs the
bool arguments into a FOLL_ flags argument. The refactoring changes do
not change any behavior.

From there, this series extends the kvm_follow_pfn() API so that
non-refconuted pages can be safely handled. This invloves adding an
input parameter to indicate whether the caller can safely use
non-refcounted pfns and an output parameter to tell the caller whether
or not the returned page is refcounted. This change includes a breaking
change, by disallowing non-refcounted pfn mappings by default, as such
mappings are unsafe. To allow such systems to continue to function, an
opt-in module parameter is added to allow the unsafe behavior.

This series only adds support for non-refcounted pages to x86. Other
MMUs can likely be updated without too much difficulty, but it is not
needed at this point. Updating other parts of KVM (e.g. pfncache) is not
straightforward [2].

[1]
https://patchwork.kernel.org/project/dri-devel/cover/20200814024000.2485-1-gurchetansingh@chromium.org/
[2] https://lore.kernel.org/all/ZBEEQtmtNPaEqU1i@google.com/

v10 -> v11:
 - Switch to u64 __read_mostly shadow_refcounted_mask.
 - Update comments about allow_non_refcounted_struct_page.
v9 -> v10:
 - Re-add FOLL_GET changes.
 - Split x86/mmu spte+non-refcount-page patch into two patches.
 - Rename 'foll' variables to 'kfp'.
 - Properly gate usage of refcount spte bit when it's not available.
 - Replace kfm_follow_pfn's is_refcounted_page output parameter with
   a struct page *refcounted_page pointing to the page in question.
 - Add patch downgrading BUG_ON to WARN_ON_ONCE.
v8 -> v9:
 - Make paying attention to is_refcounted_page mandatory. This means
   that FOLL_GET is no longer necessary. For compatibility with
   un-migrated callers, add a temporary parameter to sidestep
   ref-counting issues.
 - Add allow_unsafe_mappings, which is a breaking change.
 - Migrate kvm_vcpu_map and other callsites used by x86 to the new API.
 - Drop arm and ppc changes.
v7 -> v8:
 - Set access bits before releasing mmu_lock.
 - Pass FOLL_GET on 32-bit x86 or !tdp_enabled.
 - Refactor FOLL_GET handling, add kvm_follow_refcounted_pfn helper.
 - Set refcounted bit on >4k pages.
 - Add comments and apply formatting suggestions.
 - rebase on kvm next branch.
v6 -> v7:
 - Replace __gfn_to_pfn_memslot with a more flexible __kvm_faultin_pfn,
   and extend that API to support non-refcounted pages (complete
   rewrite).

David Stevens (7):
  KVM: Relax BUG_ON argument validation
  KVM: mmu: Introduce kvm_follow_pfn()
  KVM: mmu: Improve handling of non-refcounted pfns
  KVM: Migrate kvm_vcpu_map() to kvm_follow_pfn()
  KVM: x86: Migrate to kvm_follow_pfn()
  KVM: x86/mmu: Track if sptes refer to refcounted pages
  KVM: x86/mmu: Handle non-refcounted pages

Sean Christopherson (1):
  KVM: Assert that a page's refcount is elevated when marking
    accessed/dirty

 arch/x86/kvm/mmu/mmu.c          | 108 +++++++---
 arch/x86/kvm/mmu/mmu_internal.h |   2 +
 arch/x86/kvm/mmu/paging_tmpl.h  |   7 +-
 arch/x86/kvm/mmu/spte.c         |   5 +-
 arch/x86/kvm/mmu/spte.h         |  16 +-
 arch/x86/kvm/mmu/tdp_mmu.c      |  22 +-
 arch/x86/kvm/x86.c              |  11 +-
 include/linux/kvm_host.h        |  58 +++++-
 virt/kvm/guest_memfd.c          |   8 +-
 virt/kvm/kvm_main.c             | 345 +++++++++++++++++++-------------
 virt/kvm/kvm_mm.h               |   3 +-
 virt/kvm/pfncache.c             |  11 +-
 12 files changed, 399 insertions(+), 197 deletions(-)


base-commit: 54be6c6c5ae8e0d93a6c4641cb7528eb0b6ba478

Comments

Christoph Hellwig Feb. 29, 2024, 1:36 p.m. UTC | #1
On Thu, Feb 29, 2024 at 11:57:51AM +0900, David Stevens wrote:
> Our use case is virtio-gpu blob resources [1], which directly map host
> graphics buffers into the guest as "vram" for the virtio-gpu device.
> This feature currently does not work on systems using the amdgpu driver,
> as that driver allocates non-compound higher order pages via
> ttm_pool_alloc_page().

.. and just as last time around that is still the problem that needs
to be fixed instead of creating a monster like this to map
non-refcounted pages.
David Stevens March 13, 2024, 4:55 a.m. UTC | #2
On Thu, Feb 29, 2024 at 10:36 PM Christoph Hellwig <hch@infradead.org> wrote:
>
> On Thu, Feb 29, 2024 at 11:57:51AM +0900, David Stevens wrote:
> > Our use case is virtio-gpu blob resources [1], which directly map host
> > graphics buffers into the guest as "vram" for the virtio-gpu device.
> > This feature currently does not work on systems using the amdgpu driver,
> > as that driver allocates non-compound higher order pages via
> > ttm_pool_alloc_page().
>
> .. and just as last time around that is still the problem that needs
> to be fixed instead of creating a monster like this to map
> non-refcounted pages.
>

Patches to amdgpu to have been NAKed [1] with the justification that
using non-refcounted pages is working as intended and KVM is in the
wrong for wanting to take references to pages mapped with VM_PFNMAP
[2].

The existence of the VM_PFNMAP implies that the existence of
non-refcounted pages is working as designed. We can argue about
whether or not VM_PFNMAP should exist, but until VM_PFNMAP is removed,
KVM should be able to handle it. Also note that this is not adding a
new source of non-refcounted pages, so it doesn't make removing
non-refcounted pages more difficult, if the kernel does decide to go
in that direction.

-David

[1] https://lore.kernel.org/lkml/8230a356-be38-f228-4a8e-95124e8e8db6@amd.com/
[2] https://lore.kernel.org/lkml/594f1013-b925-3c75-be61-2d649f5ca54e@amd.com/
Christian König March 13, 2024, 9:55 a.m. UTC | #3
Am 13.03.24 um 05:55 schrieb David Stevens:
> On Thu, Feb 29, 2024 at 10:36 PM Christoph Hellwig <hch@infradead.org> wrote:
>> On Thu, Feb 29, 2024 at 11:57:51AM +0900, David Stevens wrote:
>>> Our use case is virtio-gpu blob resources [1], which directly map host
>>> graphics buffers into the guest as "vram" for the virtio-gpu device.
>>> This feature currently does not work on systems using the amdgpu driver,
>>> as that driver allocates non-compound higher order pages via
>>> ttm_pool_alloc_page().
>> .. and just as last time around that is still the problem that needs
>> to be fixed instead of creating a monster like this to map
>> non-refcounted pages.
>>
> Patches to amdgpu to have been NAKed [1] with the justification that
> using non-refcounted pages is working as intended and KVM is in the
> wrong for wanting to take references to pages mapped with VM_PFNMAP
> [2].
>
> The existence of the VM_PFNMAP implies that the existence of
> non-refcounted pages is working as designed. We can argue about
> whether or not VM_PFNMAP should exist, but until VM_PFNMAP is removed,
> KVM should be able to handle it. Also note that this is not adding a
> new source of non-refcounted pages, so it doesn't make removing
> non-refcounted pages more difficult, if the kernel does decide to go
> in that direction.

Well, the meaning of VM_PFNMAP is that you should not touch the 
underlying struct page the PTE is pointing to. As far as I can see this 
includes grabbing a reference count.

But that isn't really the problem here. The issue is rather that KVM 
assumes that by grabbing a reference count to the page that the driver 
won't change the PTE to point somewhere else.. And that is simply not true.

So what KVM needs to do is to either have an MMU notifier installed so 
that updates to the PTEs on the host side are reflected immediately to 
the PTEs on the guest side.

Or (even better) you use hardware functionality like nested page tables 
so that we don't actually need to update the guest PTEs when the host 
PTEs change.

And when you have either of those two functionalities the requirement to 
add a long term reference to the struct page goes away completely. So 
when this is done right you don't need to grab a reference in the first 
place.

Regards,
Christian.

>
> -David
>
> [1] https://lore.kernel.org/lkml/8230a356-be38-f228-4a8e-95124e8e8db6@amd.com/
> [2] https://lore.kernel.org/lkml/594f1013-b925-3c75-be61-2d649f5ca54e@amd.com/
Christoph Hellwig March 13, 2024, 1:33 p.m. UTC | #4
On Wed, Mar 13, 2024 at 01:55:20PM +0900, David Stevens wrote:
> On Thu, Feb 29, 2024 at 10:36 PM Christoph Hellwig <hch@infradead.org> wrote:
> >
> > On Thu, Feb 29, 2024 at 11:57:51AM +0900, David Stevens wrote:
> > > Our use case is virtio-gpu blob resources [1], which directly map host
> > > graphics buffers into the guest as "vram" for the virtio-gpu device.
> > > This feature currently does not work on systems using the amdgpu driver,
> > > as that driver allocates non-compound higher order pages via
> > > ttm_pool_alloc_page().
> >
> > .. and just as last time around that is still the problem that needs
> > to be fixed instead of creating a monster like this to map
> > non-refcounted pages.
> >
> 
> Patches to amdgpu to have been NAKed [1] with the justification that
> using non-refcounted pages is working as intended and KVM is in the
> wrong for wanting to take references to pages mapped with VM_PFNMAP
> [2].

So make them not work for KVM as they should to kick the amdgpu
maintainers a***es.
Sean Christopherson March 13, 2024, 1:34 p.m. UTC | #5
On Wed, Mar 13, 2024, Christian König wrote:
> Am 13.03.24 um 05:55 schrieb David Stevens:
> > On Thu, Feb 29, 2024 at 10:36 PM Christoph Hellwig <hch@infradead.org> wrote:
> > > On Thu, Feb 29, 2024 at 11:57:51AM +0900, David Stevens wrote:
> > > > Our use case is virtio-gpu blob resources [1], which directly map host
> > > > graphics buffers into the guest as "vram" for the virtio-gpu device.
> > > > This feature currently does not work on systems using the amdgpu driver,
> > > > as that driver allocates non-compound higher order pages via
> > > > ttm_pool_alloc_page().
> > > .. and just as last time around that is still the problem that needs
> > > to be fixed instead of creating a monster like this to map
> > > non-refcounted pages.
> > > 
> > Patches to amdgpu to have been NAKed [1] with the justification that
> > using non-refcounted pages is working as intended and KVM is in the
> > wrong for wanting to take references to pages mapped with VM_PFNMAP
> > [2].
> > 
> > The existence of the VM_PFNMAP implies that the existence of
> > non-refcounted pages is working as designed. We can argue about
> > whether or not VM_PFNMAP should exist, but until VM_PFNMAP is removed,
> > KVM should be able to handle it. Also note that this is not adding a
> > new source of non-refcounted pages, so it doesn't make removing
> > non-refcounted pages more difficult, if the kernel does decide to go
> > in that direction.
> 
> Well, the meaning of VM_PFNMAP is that you should not touch the underlying
> struct page the PTE is pointing to. As far as I can see this includes
> grabbing a reference count.
> 
> But that isn't really the problem here. The issue is rather that KVM assumes
> that by grabbing a reference count to the page that the driver won't change
> the PTE to point somewhere else.. And that is simply not true.

No, KVM doesn't assume that.

> So what KVM needs to do is to either have an MMU notifier installed so that
> updates to the PTEs on the host side are reflected immediately to the PTEs
> on the guest side.

KVM already has an MMU notifier and reacts accordingly.

> Or (even better) you use hardware functionality like nested page tables so
> that we don't actually need to update the guest PTEs when the host PTEs
> change.

That's not how stage-2 page tables work. 

> And when you have either of those two functionalities the requirement to add
> a long term reference to the struct page goes away completely. So when this
> is done right you don't need to grab a reference in the first place.

The KVM issue that this series is solving isn't that KVM grabs a reference, it's
that KVM assumes that any non-reserved pfn that is backed by "struct page" is
refcounted.

What Christoph is objecting to is that, in this series, KVM is explicitly adding
support for mapping non-compound (huge)pages into KVM guests.  David is arguing
that Christoph's objection to _KVM_ adding support is unfair, because the real
problem is that the kernel already maps such pages into host userspace.  I.e. if
the userspace mapping ceases to exist, then there are no mappings for KVM to follow
and propagate to KVM's stage-2 page tables.
Christian König March 13, 2024, 2:37 p.m. UTC | #6
Am 13.03.24 um 14:34 schrieb Sean Christopherson:
> On Wed, Mar 13, 2024, Christian König wrote:
>> Am 13.03.24 um 05:55 schrieb David Stevens:
>>> On Thu, Feb 29, 2024 at 10:36 PM Christoph Hellwig <hch@infradead.org> wrote:
>>>> On Thu, Feb 29, 2024 at 11:57:51AM +0900, David Stevens wrote:
>>>>> Our use case is virtio-gpu blob resources [1], which directly map host
>>>>> graphics buffers into the guest as "vram" for the virtio-gpu device.
>>>>> This feature currently does not work on systems using the amdgpu driver,
>>>>> as that driver allocates non-compound higher order pages via
>>>>> ttm_pool_alloc_page().
>>>> .. and just as last time around that is still the problem that needs
>>>> to be fixed instead of creating a monster like this to map
>>>> non-refcounted pages.
>>>>
>>> Patches to amdgpu to have been NAKed [1] with the justification that
>>> using non-refcounted pages is working as intended and KVM is in the
>>> wrong for wanting to take references to pages mapped with VM_PFNMAP
>>> [2].
>>>
>>> The existence of the VM_PFNMAP implies that the existence of
>>> non-refcounted pages is working as designed. We can argue about
>>> whether or not VM_PFNMAP should exist, but until VM_PFNMAP is removed,
>>> KVM should be able to handle it. Also note that this is not adding a
>>> new source of non-refcounted pages, so it doesn't make removing
>>> non-refcounted pages more difficult, if the kernel does decide to go
>>> in that direction.
>> Well, the meaning of VM_PFNMAP is that you should not touch the underlying
>> struct page the PTE is pointing to. As far as I can see this includes
>> grabbing a reference count.
>>
>> But that isn't really the problem here. The issue is rather that KVM assumes
>> that by grabbing a reference count to the page that the driver won't change
>> the PTE to point somewhere else.. And that is simply not true.
> No, KVM doesn't assume that.
>
>> So what KVM needs to do is to either have an MMU notifier installed so that
>> updates to the PTEs on the host side are reflected immediately to the PTEs
>> on the guest side.
> KVM already has an MMU notifier and reacts accordingly.
>
>> Or (even better) you use hardware functionality like nested page tables so
>> that we don't actually need to update the guest PTEs when the host PTEs
>> change.
> That's not how stage-2 page tables work.
>
>> And when you have either of those two functionalities the requirement to add
>> a long term reference to the struct page goes away completely. So when this
>> is done right you don't need to grab a reference in the first place.
> The KVM issue that this series is solving isn't that KVM grabs a reference, it's
> that KVM assumes that any non-reserved pfn that is backed by "struct page" is
> refcounted.

Well why does it assumes that? When you have a MMU notifier that seems 
unnecessary.

> What Christoph is objecting to is that, in this series, KVM is explicitly adding
> support for mapping non-compound (huge)pages into KVM guests.  David is arguing
> that Christoph's objection to _KVM_ adding support is unfair, because the real
> problem is that the kernel already maps such pages into host userspace.  I.e. if
> the userspace mapping ceases to exist, then there are no mappings for KVM to follow
> and propagate to KVM's stage-2 page tables.

And I have to agree with Christoph that this doesn't make much sense. 
KVM should *never* map (huge) pages from VMAs marked with VM_PFNMAP into 
KVM guests in the first place.

What it should do instead is to mirror the PFN from the host page tables 
into the guest page tables. If there is a page behind that or not *must* 
be completely irrelevant to KVM.

The background here is that drivers are modifying the page table on the 
fly to point to either MMIO or real memory, this also includes switching 
the caching attributes.

The real question is why is KVM trying to grab a page reference when 
there is an MMU notifier installed.

Regards,
Christian.
Sean Christopherson March 13, 2024, 2:48 p.m. UTC | #7
On Wed, Mar 13, 2024, Christian König wrote:
> Am 13.03.24 um 14:34 schrieb Sean Christopherson:
> > On Wed, Mar 13, 2024, Christian König wrote:
> > > And when you have either of those two functionalities the requirement to add
> > > a long term reference to the struct page goes away completely. So when this
> > > is done right you don't need to grab a reference in the first place.
> > The KVM issue that this series is solving isn't that KVM grabs a reference, it's
> > that KVM assumes that any non-reserved pfn that is backed by "struct page" is
> > refcounted.
> 
> Well why does it assumes that? When you have a MMU notifier that seems
> unnecessary.

Indeed, it's legacy code that we're trying to clean up.  It's the bulk of this
series.

> > What Christoph is objecting to is that, in this series, KVM is explicitly adding
> > support for mapping non-compound (huge)pages into KVM guests.  David is arguing
> > that Christoph's objection to _KVM_ adding support is unfair, because the real
> > problem is that the kernel already maps such pages into host userspace.  I.e. if
> > the userspace mapping ceases to exist, then there are no mappings for KVM to follow
> > and propagate to KVM's stage-2 page tables.
> 
> And I have to agree with Christoph that this doesn't make much sense. KVM
> should *never* map (huge) pages from VMAs marked with VM_PFNMAP into KVM
> guests in the first place.
> 
> What it should do instead is to mirror the PFN from the host page tables
> into the guest page tables.

That's exactly what this series does.  Christoph is objecting to KVM playing nice
with non-compound hugepages, as he feels that such mappings should not exist
*anywhere*.

I.e. Christoph is (implicitly) saying that instead of modifying KVM to play nice,
we should instead fix the TTM allocations.  And David pointed out that that was
tried and got NAK'd.
Christian König March 13, 2024, 3:09 p.m. UTC | #8
Sending that once more as text only since AMD eMail has messed that up 
once more.

Regards,
Christian.

Am 13.03.24 um 16:07 schrieb Christian König:
> Am 13.03.24 um 15:48 schrieb Sean Christopherson:
>> On Wed, Mar 13, 2024, Christian König wrote:
>>> Am 13.03.24 um 14:34 schrieb Sean Christopherson:
>>>> On Wed, Mar 13, 2024, Christian König wrote:
>>>>> And when you have either of those two functionalities the requirement to add
>>>>> a long term reference to the struct page goes away completely. So when this
>>>>> is done right you don't need to grab a reference in the first place.
>>>> The KVM issue that this series is solving isn't that KVM grabs a reference, it's
>>>> that KVM assumes that any non-reserved pfn that is backed by "struct page" is
>>>> refcounted.
>>> Well why does it assumes that? When you have a MMU notifier that seems
>>> unnecessary.
>> Indeed, it's legacy code that we're trying to clean up.  It's the bulk of this
>> series.
>
> Yeah, that is the right approach as far as I can see.
>
>>>> What Christoph is objecting to is that, in this series, KVM is explicitly adding
>>>> support for mapping non-compound (huge)pages into KVM guests.  David is arguing
>>>> that Christoph's objection to _KVM_ adding support is unfair, because the real
>>>> problem is that the kernel already maps such pages into host userspace.  I.e. if
>>>> the userspace mapping ceases to exist, then there are no mappings for KVM to follow
>>>> and propagate to KVM's stage-2 page tables.
>>> And I have to agree with Christoph that this doesn't make much sense. KVM
>>> should *never* map (huge) pages from VMAs marked with VM_PFNMAP into KVM
>>> guests in the first place.
>>>
>>> What it should do instead is to mirror the PFN from the host page tables
>>> into the guest page tables.
>> That's exactly what this series does.  Christoph is objecting to KVM playing nice
>> with non-compound hugepages, as he feels that such mappings should not exist
>> *anywhere*.
>
> Well Christoph is right those mappings shouldn't exists and they also 
> don't exists.
>
> What happens here is that a driver has allocated some contiguous 
> memory to do DMA with. And then some page table is pointing to a PFN 
> inside that memory because userspace needs to provide parameters for 
> the DMA transfer.
>
> This is *not* a mapping of a non-compound hugepage, it's simply a PTE 
> pointing to some PFN. It can trivially be that userspace only maps 
> 4KiB of some 2MiB piece of memory the driver has allocate.
>
>> I.e. Christoph is (implicitly) saying that instead of modifying KVM to play nice,
>> we should instead fix the TTM allocations.  And David pointed out that that was
>> tried and got NAK'd.
>
> Well as far as I can see Christoph rejects the complexity coming with 
> the approach of sometimes grabbing the reference and sometimes not. 
> And I have to agree that this is extremely odd.
>
> What the KVM code should do instead is to completely drop grabbing 
> references to struct pages, no matter what the VMA flags are.
>
> Regards,
> Christian.
Sean Christopherson March 13, 2024, 3:47 p.m. UTC | #9
On Wed, Mar 13, 2024, Christian König wrote:
> Am 13.03.24 um 15:48 schrieb Sean Christopherson:
> > On Wed, Mar 13, 2024, Christian König wrote:
> > > Am 13.03.24 um 14:34 schrieb Sean Christopherson:
> > > > What Christoph is objecting to is that, in this series, KVM is explicitly adding
> > > > support for mapping non-compound (huge)pages into KVM guests.  David is arguing
> > > > that Christoph's objection to _KVM_ adding support is unfair, because the real
> > > > problem is that the kernel already maps such pages into host userspace.  I.e. if
> > > > the userspace mapping ceases to exist, then there are no mappings for KVM to follow
> > > > and propagate to KVM's stage-2 page tables.
> > > And I have to agree with Christoph that this doesn't make much sense. KVM
> > > should *never* map (huge) pages from VMAs marked with VM_PFNMAP into KVM
> > > guests in the first place.
> > > 
> > > What it should do instead is to mirror the PFN from the host page tables
> > > into the guest page tables.
> > That's exactly what this series does.  Christoph is objecting to KVM playing nice
> > with non-compound hugepages, as he feels that such mappings should not exist
> > *anywhere*.
> 
> Well Christoph is right those mappings shouldn't exists and they also don't
> exists.
>
> What happens here is that a driver has allocated some contiguous memory to
> do DMA with. And then some page table is pointing to a PFN inside that
> memory because userspace needs to provide parameters for the DMA transfer.
> 
> This is *not* a mapping of a non-compound hugepage, it's simply a PTE
> pointing to some PFN. 

Yes, I know.  And David knows.  By "such mappings" I did not mean "huge PMD mappings
that point at non-compound pages", I meant "any mapping in the host userspace
VMAs and page tables that points at memory that is backed by a larger-than-order-0,
non-compound allocation".

And even then, the whole larger-than-order-0 mapping is not something we on the
KVM side care about, at all.  The _only_ new thing KVM is trying to do in this
series is to allow mapping non-refcounted struct page memory into KVM guest.
Those details were brought up purely because they provide context on how/why such
non-refcounted pages exist.

> It can trivially be that userspace only maps 4KiB of some 2MiB piece of
> memory the driver has allocate.
>
> > I.e. Christoph is (implicitly) saying that instead of modifying KVM to play nice,
> > we should instead fix the TTM allocations.  And David pointed out that that was
> > tried and got NAK'd.
> 
> Well as far as I can see Christoph rejects the complexity coming with the
> approach of sometimes grabbing the reference and sometimes not.

Unless I've wildly misread multiple threads, that is not Christoph's objection.
From v9 (https://lore.kernel.org/all/ZRpiXsm7X6BFAU%2Fy@infradead.org):

  On Sun, Oct 1, 2023 at 11:25 PM Christoph Hellwig <hch@infradead.org> wrote:
  >
  > On Fri, Sep 29, 2023 at 09:06:34AM -0700, Sean Christopherson wrote:
  > > KVM needs to be aware of non-refcounted struct page memory no matter what; see
  > > CVE-2021-22543 and, commit f8be156be163 ("KVM: do not allow mapping valid but
  > > non-reference-counted pages").  I don't think it makes any sense whatsoever to
  > > remove that code and assume every driver in existence will do the right thing.
  >
  > Agreed.
  >
  > >
  > > With the cleanups done, playing nice with non-refcounted paged instead of outright
  > > rejecting them is a wash in terms of lines of code, complexity, and ongoing
  > > maintenance cost.
  >
  > I tend to strongly disagree with that, though.  We can't just let these
  > non-refcounted pages spread everywhere and instead need to fix their
  > usage.

> And I have to agree that this is extremely odd.

Yes, it's odd and not ideal.  But with nested virtualization, KVM _must_ "map"
pfns directly into the guest via fields in the control structures that are
consumed by hardware.  I.e. pfns are exposed to the guest in an "out-of-band"
structure that is NOT part of the stage-2 page tables.  And wiring those up to
the MMU notifiers is extremely difficult for a variety of reasons[*].

Because KVM doesn't control which pfns are mapped this way, KVM's compromise is
to grab a reference to the struct page while the out-of-band mapping exists, i.e.
to pin the page to prevent use-after-free.  And KVM's historical ABI is to support
any refcounted page for these out-of-band mappings, regardless of whether the
page was obtained by gup() or follow_pte().

Thus, to support non-refouncted VM_PFNMAP pages without breaking existing userspace,
KVM resorts to conditionally grabbing references and disllowing non-refcounted
pages from being inserted into the out-of-band mappings.

But again, I don't think these details are relevant to Christoph's objection.

[*] https://lore.kernel.org/all/ZBEEQtmtNPaEqU1i@google.com
Sean Christopherson March 13, 2024, 5:26 p.m. UTC | #10
On Wed, Mar 13, 2024, Christian König wrote:
> Am 13.03.24 um 16:47 schrieb Sean Christopherson:
> > [SNIP]
> > > It can trivially be that userspace only maps 4KiB of some 2MiB piece of
> > > memory the driver has allocate.
> > > 
> > > > I.e. Christoph is (implicitly) saying that instead of modifying KVM to play nice,
> > > > we should instead fix the TTM allocations.  And David pointed out that that was
> > > > tried and got NAK'd.
> > > Well as far as I can see Christoph rejects the complexity coming with the
> > > approach of sometimes grabbing the reference and sometimes not.
> > Unless I've wildly misread multiple threads, that is not Christoph's objection.
> >  From v9 (https://lore.kernel.org/all/ZRpiXsm7X6BFAU%2Fy@infradead.org):
> > 
> >    On Sun, Oct 1, 2023 at 11:25 PM Christoph Hellwig<hch@infradead.org>  wrote:
> >    >
> >    > On Fri, Sep 29, 2023 at 09:06:34AM -0700, Sean Christopherson wrote:
> >    > > KVM needs to be aware of non-refcounted struct page memory no matter what; see
> >    > > CVE-2021-22543 and, commit f8be156be163 ("KVM: do not allow mapping valid but
> >    > > non-reference-counted pages").  I don't think it makes any sense whatsoever to
> >    > > remove that code and assume every driver in existence will do the right thing.
> >    >
> >    > Agreed.
> >    >
> >    > >
> >    > > With the cleanups done, playing nice with non-refcounted paged instead of outright
> >    > > rejecting them is a wash in terms of lines of code, complexity, and ongoing
> >    > > maintenance cost.
> >    >
> >    > I tend to strongly disagree with that, though.  We can't just let these
> >    > non-refcounted pages spread everywhere and instead need to fix their
> >    > usage.
> 
> And I can only repeat myself that I completely agree with Christoph here.

I am so confused.  If you agree with Christoph, why not fix the TTM allocations?

> > > And I have to agree that this is extremely odd.
> > Yes, it's odd and not ideal.  But with nested virtualization, KVM _must_ "map"
> > pfns directly into the guest via fields in the control structures that are
> > consumed by hardware.  I.e. pfns are exposed to the guest in an "out-of-band"
> > structure that is NOT part of the stage-2 page tables.  And wiring those up to
> > the MMU notifiers is extremely difficult for a variety of reasons[*].
> > 
> > Because KVM doesn't control which pfns are mapped this way, KVM's compromise is
> > to grab a reference to the struct page while the out-of-band mapping exists, i.e.
> > to pin the page to prevent use-after-free.
> 
> Wait a second, as far as I know this approach doesn't work correctly in all
> cases. See here as well: https://lwn.net/Articles/930667/
>
> The MMU notifier is not only to make sure that pages don't go away and
> prevent use after free, but also that PTE updates correctly wait for ongoing
> DMA transfers.
> 
> Otherwise quite a bunch of other semantics doesn't work correctly either.
> 
> > And KVM's historical ABI is to support
> > any refcounted page for these out-of-band mappings, regardless of whether the
> > page was obtained by gup() or follow_pte().
> 
> Well see the discussion from last year the LWN article summarizes.

Oof.  I suspect that in practice, no one has observed issues because the pages
in question are heavily used by the guest and don't get evicted from the page cache.

> I'm not an expert for KVM but as far as I can see what you guys do here is
> illegal and can lead to corruption.
>
> Basically you can't create a second channel to modify page content like
> that.

Well, KVM can, it just needs to honor mmu_notifier events in order to be 100%
robust.

That said, while this might be motivation to revisit tying the out-of-band mappings
to the mmu_notifier, it has no bearing on the outcome of Christoph's objection.
Because (a) AIUI, Christoph is objecting to mapping non-refcounted struct page
memory *anywhere*, and (b) in this series, KVM will explicitly disallow non-
refcounted pages from being mapped in the out-of-band structures (see below).

> > Thus, to support non-refouncted VM_PFNMAP pages without breaking existing userspace,
> > KVM resorts to conditionally grabbing references and disllowing non-refcounted
> > pages from being inserted into the out-of-band mappings.
> 
> Why not only refcount the pages when out of band mappings are requested?

That's also what this series effectively does.  By default, KVM will disallow
inserting *any* non-refcounted memory into the out-of-band mappings.

"By default" because there are use cases where the guest memory pool is hidden
from the kernel at boot, and is fully managed by userspace.  I.e. userspace is
effectively responsible/trusted to not free/change the mappings for an active
guest.
Christian König March 14, 2024, 9:20 a.m. UTC | #11
Sending that out once more since the AMD email servers have converted it 
to HTML mail once more :(

Grrr,
Christian.

Am 14.03.24 um 10:18 schrieb Christian König:
> Am 13.03.24 um 18:26 schrieb Sean Christopherson:
>> On Wed, Mar 13, 2024, Christian König wrote:
>>> Am 13.03.24 um 16:47 schrieb Sean Christopherson:
>>>> [SNIP]
>>>>> It can trivially be that userspace only maps 4KiB of some 2MiB piece of
>>>>> memory the driver has allocate.
>>>>>
>>>>>> I.e. Christoph is (implicitly) saying that instead of modifying KVM to play nice,
>>>>>> we should instead fix the TTM allocations.  And David pointed out that that was
>>>>>> tried and got NAK'd.
>>>>> Well as far as I can see Christoph rejects the complexity coming with the
>>>>> approach of sometimes grabbing the reference and sometimes not.
>>>> Unless I've wildly misread multiple threads, that is not Christoph's objection.
>>>>   From v9 (https://lore.kernel.org/all/ZRpiXsm7X6BFAU%2Fy@infradead.org):
>>>>
>>>>     On Sun, Oct 1, 2023 at 11:25 PM Christoph Hellwig<hch@infradead.org>   wrote:
>>>>     >
>>>>     > On Fri, Sep 29, 2023 at 09:06:34AM -0700, Sean Christopherson wrote:
>>>>     > > KVM needs to be aware of non-refcounted struct page memory no matter what; see
>>>>     > > CVE-2021-22543 and, commit f8be156be163 ("KVM: do not allow mapping valid but
>>>>     > > non-reference-counted pages").  I don't think it makes any sense whatsoever to
>>>>     > > remove that code and assume every driver in existence will do the right thing.
>>>>     >
>>>>     > Agreed.
>>>>     >
>>>>     > >
>>>>     > > With the cleanups done, playing nice with non-refcounted paged instead of outright
>>>>     > > rejecting them is a wash in terms of lines of code, complexity, and ongoing
>>>>     > > maintenance cost.
>>>>     >
>>>>     > I tend to strongly disagree with that, though.  We can't just let these
>>>>     > non-refcounted pages spread everywhere and instead need to fix their
>>>>     > usage.
>>> And I can only repeat myself that I completely agree with Christoph here.
>> I am so confused.  If you agree with Christoph, why not fix the TTM allocations?
>
> Because the TTM allocation isn't broken in any way.
>
> See in some configurations TTM even uses the DMA API for those 
> allocations and that is actually something Christoph coded.
>
> What Christoph is really pointing out is that absolutely nobody should 
> put non-refcounted pages into a VMA, but again this isn't something 
> TTM does. What TTM does instead is to work with the PFN and puts that 
> into a VMA.
>
> It's just that then KVM comes along and converts the PFN back into a 
> struct page again and that is essentially what causes all the 
> problems, including CVE-2021-22543.
>
>>>>> And I have to agree that this is extremely odd.
>>>> Yes, it's odd and not ideal.  But with nested virtualization, KVM _must_ "map"
>>>> pfns directly into the guest via fields in the control structures that are
>>>> consumed by hardware.  I.e. pfns are exposed to the guest in an "out-of-band"
>>>> structure that is NOT part of the stage-2 page tables.  And wiring those up to
>>>> the MMU notifiers is extremely difficult for a variety of reasons[*].
>>>>
>>>> Because KVM doesn't control which pfns are mapped this way, KVM's compromise is
>>>> to grab a reference to the struct page while the out-of-band mapping exists, i.e.
>>>> to pin the page to prevent use-after-free.
>>> Wait a second, as far as I know this approach doesn't work correctly in all
>>> cases. See here as well:https://lwn.net/Articles/930667/
>>>
>>> The MMU notifier is not only to make sure that pages don't go away and
>>> prevent use after free, but also that PTE updates correctly wait for ongoing
>>> DMA transfers.
>>>
>>> Otherwise quite a bunch of other semantics doesn't work correctly either.
>>>
>>>> And KVM's historical ABI is to support
>>>> any refcounted page for these out-of-band mappings, regardless of whether the
>>>> page was obtained by gup() or follow_pte().
>>> Well see the discussion from last year the LWN article summarizes.
>> Oof.  I suspect that in practice, no one has observed issues because the pages
>> in question are heavily used by the guest and don't get evicted from the page cache.
>
> Well in this case I strongly suggest to block the problematic cases.
>
> It just sounds like KVM never run into problems because nobody is 
> doing any of those problematic cases. If that's true it should be 
> doable to change the UAPI and say this specific combination is 
> forbidden because it could result in data corruption.
>
>>> I'm not an expert for KVM but as far as I can see what you guys do here is
>>> illegal and can lead to corruption.
>>>
>>> Basically you can't create a second channel to modify page content like
>>> that.
>> Well, KVM can, it just needs to honor mmu_notifier events in order to be 100%
>> robust.
>
> Yeah, completely agree.
>
>> That said, while this might be motivation to revisit tying the out-of-band mappings
>> to the mmu_notifier, it has no bearing on the outcome of Christoph's objection.
>> Because (a) AIUI, Christoph is objecting to mapping non-refcounted struct page
>> memory *anywhere*, and (b) in this series, KVM will explicitly disallow non-
>> refcounted pages from being mapped in the out-of-band structures (see below).
>>
>>>> Thus, to support non-refouncted VM_PFNMAP pages without breaking existing userspace,
>>>> KVM resorts to conditionally grabbing references and disllowing non-refcounted
>>>> pages from being inserted into the out-of-band mappings.
>>> Why not only refcount the pages when out of band mappings are requested?
>> That's also what this series effectively does.  By default, KVM will disallow
>> inserting *any* non-refcounted memory into the out-of-band mappings.
>
> Ok than that's basically my inconvenient. I can't really criticize the 
> KVM patch because I don't really understand all the details.
>
> I'm only pointing out from a very high level view how memory 
> management works and that I see some conflict with what KVM does.
>
> As far as I can tell the cleanest approach for KVM would be to have 
> two completely separate handlings:
>
> 1. Using GUP to handle the out-of-band mappings, this one grabs 
> references and honors all the GUP restrictions with the appropriate 
> flags. When VM_PFNMAP is set then those mappings will be rejected. 
> That should also avoid any trouble with file backed mappings.
>
> 2. Using follow_pte() plus an MMU notifier and this one can even 
> handle VMAs with the VM_PFNMAP and VM_IO flag set.
>
>> "By default" because there are use cases where the guest memory pool is hidden
>> from the kernel at boot, and is fully managed by userspace.  I.e. userspace is
>> effectively responsible/trusted to not free/change the mappings for an active
>> guest.
>
> Wow, could that potentially crash the kernel?
>
> Cheers,
> Christian.
David Stevens March 14, 2024, 11:31 a.m. UTC | #12
On Thu, Mar 14, 2024 at 6:20 PM Christian König
<christian.koenig@amd.com> wrote:
>
> Sending that out once more since the AMD email servers have converted it
> to HTML mail once more :(
>
> Grrr,
> Christian.
>
> Am 14.03.24 um 10:18 schrieb Christian König:
> > Am 13.03.24 um 18:26 schrieb Sean Christopherson:
> >> On Wed, Mar 13, 2024, Christian König wrote:
> >>> Am 13.03.24 um 16:47 schrieb Sean Christopherson:
> >>>> [SNIP]
> >>>>> It can trivially be that userspace only maps 4KiB of some 2MiB piece of
> >>>>> memory the driver has allocate.
> >>>>>
> >>>>>> I.e. Christoph is (implicitly) saying that instead of modifying KVM to play nice,
> >>>>>> we should instead fix the TTM allocations.  And David pointed out that that was
> >>>>>> tried and got NAK'd.
> >>>>> Well as far as I can see Christoph rejects the complexity coming with the
> >>>>> approach of sometimes grabbing the reference and sometimes not.
> >>>> Unless I've wildly misread multiple threads, that is not Christoph's objection.
> >>>>   From v9 (https://lore.kernel.org/all/ZRpiXsm7X6BFAU%2Fy@infradead.org):
> >>>>
> >>>>     On Sun, Oct 1, 2023 at 11:25 PM Christoph Hellwig<hch@infradead.org>   wrote:
> >>>>     >
> >>>>     > On Fri, Sep 29, 2023 at 09:06:34AM -0700, Sean Christopherson wrote:
> >>>>     > > KVM needs to be aware of non-refcounted struct page memory no matter what; see
> >>>>     > > CVE-2021-22543 and, commit f8be156be163 ("KVM: do not allow mapping valid but
> >>>>     > > non-reference-counted pages").  I don't think it makes any sense whatsoever to
> >>>>     > > remove that code and assume every driver in existence will do the right thing.
> >>>>     >
> >>>>     > Agreed.
> >>>>     >
> >>>>     > >
> >>>>     > > With the cleanups done, playing nice with non-refcounted paged instead of outright
> >>>>     > > rejecting them is a wash in terms of lines of code, complexity, and ongoing
> >>>>     > > maintenance cost.
> >>>>     >
> >>>>     > I tend to strongly disagree with that, though.  We can't just let these
> >>>>     > non-refcounted pages spread everywhere and instead need to fix their
> >>>>     > usage.
> >>> And I can only repeat myself that I completely agree with Christoph here.
> >> I am so confused.  If you agree with Christoph, why not fix the TTM allocations?
> >
> > Because the TTM allocation isn't broken in any way.
> >
> > See in some configurations TTM even uses the DMA API for those
> > allocations and that is actually something Christoph coded.
> >
> > What Christoph is really pointing out is that absolutely nobody should
> > put non-refcounted pages into a VMA, but again this isn't something
> > TTM does. What TTM does instead is to work with the PFN and puts that
> > into a VMA.
> >
> > It's just that then KVM comes along and converts the PFN back into a
> > struct page again and that is essentially what causes all the
> > problems, including CVE-2021-22543.

Does Christoph's objection come from my poorly worded cover letter and
commit messages, then? Fundamentally, what this series is doing is
allowing pfns returned by follow_pte to be mapped into KVM's shadow
MMU without inadvertently translating them into struct pages. If I'm
understand this discussion correctly, since KVM's shadow MMU is hooked
up to MMU notifiers, this shouldn't be controversial. However, my
cover letter got a little confused because KVM is currently doing
something that it sounds like it shouldn't - translating pfns returned
by follow_pte into struct pages with kvm_try_get_pfn. Because of that,
the specific type of pfns that don't work right now are pfn_valid() &&
!PG_Reserved && !page_ref_count() - what I called the non-refcounted
pages in a bad choice of words. If that's correct, then perhaps this
series should go a little bit further in modifying
hva_to_pfn_remapped, but it isn't fundamentally wrong.

-David
Christian König March 14, 2024, 11:51 a.m. UTC | #13
Am 14.03.24 um 12:31 schrieb David Stevens:
> On Thu, Mar 14, 2024 at 6:20 PM Christian König
> <christian.koenig@amd.com> wrote:
>> Sending that out once more since the AMD email servers have converted it
>> to HTML mail once more :(
>>
>> Grrr,
>> Christian.
>>
>> Am 14.03.24 um 10:18 schrieb Christian König:
>>> Am 13.03.24 um 18:26 schrieb Sean Christopherson:
>>>> On Wed, Mar 13, 2024, Christian König wrote:
>>>>> Am 13.03.24 um 16:47 schrieb Sean Christopherson:
>>>>>> [SNIP]
>>>>>>> It can trivially be that userspace only maps 4KiB of some 2MiB piece of
>>>>>>> memory the driver has allocate.
>>>>>>>
>>>>>>>> I.e. Christoph is (implicitly) saying that instead of modifying KVM to play nice,
>>>>>>>> we should instead fix the TTM allocations.  And David pointed out that that was
>>>>>>>> tried and got NAK'd.
>>>>>>> Well as far as I can see Christoph rejects the complexity coming with the
>>>>>>> approach of sometimes grabbing the reference and sometimes not.
>>>>>> Unless I've wildly misread multiple threads, that is not Christoph's objection.
>>>>>>    From v9 (https://lore.kernel.org/all/ZRpiXsm7X6BFAU%2Fy@infradead.org):
>>>>>>
>>>>>>      On Sun, Oct 1, 2023 at 11:25 PM Christoph Hellwig<hch@infradead.org>   wrote:
>>>>>>      >
>>>>>>      > On Fri, Sep 29, 2023 at 09:06:34AM -0700, Sean Christopherson wrote:
>>>>>>      > > KVM needs to be aware of non-refcounted struct page memory no matter what; see
>>>>>>      > > CVE-2021-22543 and, commit f8be156be163 ("KVM: do not allow mapping valid but
>>>>>>      > > non-reference-counted pages").  I don't think it makes any sense whatsoever to
>>>>>>      > > remove that code and assume every driver in existence will do the right thing.
>>>>>>      >
>>>>>>      > Agreed.
>>>>>>      >
>>>>>>      > >
>>>>>>      > > With the cleanups done, playing nice with non-refcounted paged instead of outright
>>>>>>      > > rejecting them is a wash in terms of lines of code, complexity, and ongoing
>>>>>>      > > maintenance cost.
>>>>>>      >
>>>>>>      > I tend to strongly disagree with that, though.  We can't just let these
>>>>>>      > non-refcounted pages spread everywhere and instead need to fix their
>>>>>>      > usage.
>>>>> And I can only repeat myself that I completely agree with Christoph here.
>>>> I am so confused.  If you agree with Christoph, why not fix the TTM allocations?
>>> Because the TTM allocation isn't broken in any way.
>>>
>>> See in some configurations TTM even uses the DMA API for those
>>> allocations and that is actually something Christoph coded.
>>>
>>> What Christoph is really pointing out is that absolutely nobody should
>>> put non-refcounted pages into a VMA, but again this isn't something
>>> TTM does. What TTM does instead is to work with the PFN and puts that
>>> into a VMA.
>>>
>>> It's just that then KVM comes along and converts the PFN back into a
>>> struct page again and that is essentially what causes all the
>>> problems, including CVE-2021-22543.
> Does Christoph's objection come from my poorly worded cover letter and
> commit messages, then?

Yes, that could certainly be.

> Fundamentally, what this series is doing is
> allowing pfns returned by follow_pte to be mapped into KVM's shadow
> MMU without inadvertently translating them into struct pages.

As far as I can tell that is really the right thing to do. Yes.

> If I'm understand this discussion correctly, since KVM's shadow MMU is hooked
> up to MMU notifiers, this shouldn't be controversial. However, my
> cover letter got a little confused because KVM is currently doing
> something that it sounds like it shouldn't - translating pfns returned
> by follow_pte into struct pages with kvm_try_get_pfn. Because of that,
> the specific type of pfns that don't work right now are pfn_valid() &&
> !PG_Reserved && !page_ref_count() - what I called the non-refcounted
> pages in a bad choice of words. If that's correct, then perhaps this
> series should go a little bit further in modifying
> hva_to_pfn_remapped, but it isn't fundamentally wrong.

Completely agree. In my thinking when you go a step further and offload 
grabbing the page reference to get_user_pages() then you are always on 
the save side.

Because then it is no longer the responsibility of the KVM code to get 
all the rules around that right, instead you are relying on a core 
functionality which should (at least in theory) do the correct thing.

Regards,
Christian.

>
> -David
Sean Christopherson March 14, 2024, 2:45 p.m. UTC | #14
On Thu, Mar 14, 2024, Christian König wrote:
> Am 14.03.24 um 12:31 schrieb David Stevens:
> > On Thu, Mar 14, 2024 at 6:20 PM Christian König <christian.koenig@amd.com> wrote:
> > > > > > > > Well as far as I can see Christoph rejects the complexity coming with the
> > > > > > > > approach of sometimes grabbing the reference and sometimes not.
> > > > > > > Unless I've wildly misread multiple threads, that is not Christoph's objection.
> > > > > > >    From v9 (https://lore.kernel.org/all/ZRpiXsm7X6BFAU%2Fy@infradead.org):
> > > > > > > 
> > > > > > >      On Sun, Oct 1, 2023 at 11:25 PM Christoph Hellwig<hch@infradead.org>   wrote:
> > > > > > >      >
> > > > > > >      > On Fri, Sep 29, 2023 at 09:06:34AM -0700, Sean Christopherson wrote:
> > > > > > >      > > KVM needs to be aware of non-refcounted struct page memory no matter what; see
> > > > > > >      > > CVE-2021-22543 and, commit f8be156be163 ("KVM: do not allow mapping valid but
> > > > > > >      > > non-reference-counted pages").  I don't think it makes any sense whatsoever to
> > > > > > >      > > remove that code and assume every driver in existence will do the right thing.
> > > > > > >      >
> > > > > > >      > Agreed.
> > > > > > >      >
> > > > > > >      > >
> > > > > > >      > > With the cleanups done, playing nice with non-refcounted paged instead of outright
> > > > > > >      > > rejecting them is a wash in terms of lines of code, complexity, and ongoing
> > > > > > >      > > maintenance cost.
> > > > > > >      >
> > > > > > >      > I tend to strongly disagree with that, though.  We can't just let these
> > > > > > >      > non-refcounted pages spread everywhere and instead need to fix their
> > > > > > >      > usage.
> > > > > > And I can only repeat myself that I completely agree with Christoph here.
> > > > > I am so confused.  If you agree with Christoph, why not fix the TTM allocations?
> > > > Because the TTM allocation isn't broken in any way.
> > > > 
> > > > See in some configurations TTM even uses the DMA API for those
> > > > allocations and that is actually something Christoph coded.
> > > > 
> > > > What Christoph is really pointing out is that absolutely nobody should
> > > > put non-refcounted pages into a VMA, but again this isn't something
> > > > TTM does. What TTM does instead is to work with the PFN and puts that
> > > > into a VMA.
> > > > 
> > > > It's just that then KVM comes along and converts the PFN back into a
> > > > struct page again and that is essentially what causes all the
> > > > problems, including CVE-2021-22543.
> > Does Christoph's objection come from my poorly worded cover letter and
> > commit messages, then?
> 
> Yes, that could certainly be.
> 
> > Fundamentally, what this series is doing is
> > allowing pfns returned by follow_pte to be mapped into KVM's shadow
> > MMU without inadvertently translating them into struct pages.
> 
> As far as I can tell that is really the right thing to do. Yes.

Christoph,

Can you please confirm that you don't object to KVM using follow_pte() to get
PFNs which happen to have an associated struct page?  We've gone in enough circles...
Sean Christopherson March 14, 2024, 4:17 p.m. UTC | #15
-Christ{oph,ian} to avoid creating more noise...

On Thu, Mar 14, 2024, David Stevens wrote:
> Because of that, the specific type of pfns that don't work right now are
> pfn_valid() && !PG_Reserved && !page_ref_count() - what I called the
> non-refcounted pages in a bad choice of words. If that's correct, then
> perhaps this series should go a little bit further in modifying
> hva_to_pfn_remapped, but it isn't fundamentally wrong.

Loosely related to all of this, I have a mildly ambitious idea.  Well, one mildly
ambitious idea, and one crazy ambitious idea.  Crazy ambitious idea first...

Something we (GCE side of Google) have been eyeballing is adding support for huge
VM_PFNMAP memory, e.g. for mapping large amounts of device (a.k.a. GPU) memory
into guests using hugepages.  One of the hiccups is that follow_pte() doesn't play
nice with hugepages, at all, e.g. even has a "VM_BUG_ON(pmd_trans_huge(*pmd))".
Teaching follow_pte() to play nice with hugepage probably is doing, but making
sure all existing users are aware, maybe not so much.

My first (half baked, crazy ambitious) idea is to move away from follow_pte() and
get_user_page_fast_only() for mmu_notifier-aware lookups, i.e. that don't need
to grab references, and replace them with a new converged API that locklessly walks
host userspace page tables, and grabs the hugepage size along the way, e.g. so that
arch code wouldn't have to do a second walk of the page tables just to get the
hugepage size.

In other words, for the common case (mmu_notifier integration, no reference needed),
route hva_to_pfn_fast() into the new API and walk the userspace page tables (probably
only for write faults, to avoid CoW compliciations) before doing anything else.

Uses of hva_to_pfn() that need to get a reference to the struct page couldn't be
converted, e.g. when stuffing physical addresses into the VMCS for nested virtualization.
But for everything else, grabbing a reference is a non-goal, i.e. actually "getting"
a user page is wasted effort and actively gets in the way.

I was initially hoping we could go super simple and use something like x86's
host_pfn_mapping_level(), but there are too many edge cases in gup() that need to
be respected, e.g. to avoid mapping memfd_secret pages into KVM guests.  I.e. the
API would need to be a formal mm-owned thing, not some homebrewed KVM implementation.

I can't tell if the payoff would be big enough to justify the effort involved, i.e.
having a single unified API for grabbing PFNs from the primary MMU might just be a
pie-in-the-sky type idea.

My second, less ambitious idea: the previously linked LWN[*] article about the
writeback issues reminded me of something that has bugged me for a long time.  IIUC,
getting a writable mapping from the primary MMU marks the page/folio dirty, and that
page/folio stays dirty until the data is written back and the mapping is made read-only.
And because KVM is tapped into the mmu_notifiers, KVM will be notified *before* the
RW=>RO conversion completes, i.e. before the page/folio is marked clean.

I _think_ that means that calling kvm_set_page_dirty() when zapping a SPTE (or
dropping any mmu_notifier-aware mapping) is completely unnecessary.  If that is the
case, _and_ we can weasel our way out of calling kvm_set_page_accessed() too, then
with FOLL_GET plumbed into hva_to_pfn(), we can:

  - Drop kvm_{set,release}_pfn_{accessed,dirty}(), because all callers of hva_to_pfn()
    that aren't tied into mmu_notifiers, i.e. aren't guaranteed to drop mappings
    before the page/folio is cleaned, will *know* that they hold a refcounted struct
    page.

  - Skip "KVM: x86/mmu: Track if sptes refer to refcounted pages" entirely, because
    KVM never needs to know if a SPTE points at a refcounted page.

In other words, double down on immediately doing put_page() after gup() if FOLL_GET
isn't specified, and naturally make all KVM MMUs compatible with pfn_valid() PFNs
that are acquired by follow_pte().

I suspect we can simply mark pages as access when a page is retrieved from the primary
MMU, as marking a page accessed when its *removed* from the guest is rather nonsensical.
E.g. if a page is mapped into the guest for a long time and it gets swapped out, marking
the page accessed when KVM drops its SPTEs in response to the swap adds no value.  And
through the mmu_notifiers, KVM already plays nice with setups that use idle page
tracking to make reclaim decisions.

[*] https://lwn.net/Articles/930667
Sean Christopherson March 14, 2024, 5:19 p.m. UTC | #16
+Alex, who is looking at the huge-VM_PFNMAP angle in particular.

On Thu, Mar 14, 2024, Sean Christopherson wrote:
> -Christ{oph,ian} to avoid creating more noise...
> 
> On Thu, Mar 14, 2024, David Stevens wrote:
> > Because of that, the specific type of pfns that don't work right now are
> > pfn_valid() && !PG_Reserved && !page_ref_count() - what I called the
> > non-refcounted pages in a bad choice of words. If that's correct, then
> > perhaps this series should go a little bit further in modifying
> > hva_to_pfn_remapped, but it isn't fundamentally wrong.
> 
> Loosely related to all of this, I have a mildly ambitious idea.  Well, one mildly
> ambitious idea, and one crazy ambitious idea.  Crazy ambitious idea first...
> 
> Something we (GCE side of Google) have been eyeballing is adding support for huge
> VM_PFNMAP memory, e.g. for mapping large amounts of device (a.k.a. GPU) memory
> into guests using hugepages.  One of the hiccups is that follow_pte() doesn't play
> nice with hugepages, at all, e.g. even has a "VM_BUG_ON(pmd_trans_huge(*pmd))".
> Teaching follow_pte() to play nice with hugepage probably is doing, but making
> sure all existing users are aware, maybe not so much.
> 
> My first (half baked, crazy ambitious) idea is to move away from follow_pte() and
> get_user_page_fast_only() for mmu_notifier-aware lookups, i.e. that don't need
> to grab references, and replace them with a new converged API that locklessly walks
> host userspace page tables, and grabs the hugepage size along the way, e.g. so that
> arch code wouldn't have to do a second walk of the page tables just to get the
> hugepage size.
> 
> In other words, for the common case (mmu_notifier integration, no reference needed),
> route hva_to_pfn_fast() into the new API and walk the userspace page tables (probably
> only for write faults, to avoid CoW compliciations) before doing anything else.
> 
> Uses of hva_to_pfn() that need to get a reference to the struct page couldn't be
> converted, e.g. when stuffing physical addresses into the VMCS for nested virtualization.
> But for everything else, grabbing a reference is a non-goal, i.e. actually "getting"
> a user page is wasted effort and actively gets in the way.
> 
> I was initially hoping we could go super simple and use something like x86's
> host_pfn_mapping_level(), but there are too many edge cases in gup() that need to
> be respected, e.g. to avoid mapping memfd_secret pages into KVM guests.  I.e. the
> API would need to be a formal mm-owned thing, not some homebrewed KVM implementation.
> 
> I can't tell if the payoff would be big enough to justify the effort involved, i.e.
> having a single unified API for grabbing PFNs from the primary MMU might just be a
> pie-in-the-sky type idea.
> 
> My second, less ambitious idea: the previously linked LWN[*] article about the
> writeback issues reminded me of something that has bugged me for a long time.  IIUC,
> getting a writable mapping from the primary MMU marks the page/folio dirty, and that
> page/folio stays dirty until the data is written back and the mapping is made read-only.
> And because KVM is tapped into the mmu_notifiers, KVM will be notified *before* the
> RW=>RO conversion completes, i.e. before the page/folio is marked clean.
> 
> I _think_ that means that calling kvm_set_page_dirty() when zapping a SPTE (or
> dropping any mmu_notifier-aware mapping) is completely unnecessary.  If that is the
> case, _and_ we can weasel our way out of calling kvm_set_page_accessed() too, then
> with FOLL_GET plumbed into hva_to_pfn(), we can:
> 
>   - Drop kvm_{set,release}_pfn_{accessed,dirty}(), because all callers of hva_to_pfn()
>     that aren't tied into mmu_notifiers, i.e. aren't guaranteed to drop mappings
>     before the page/folio is cleaned, will *know* that they hold a refcounted struct
>     page.
> 
>   - Skip "KVM: x86/mmu: Track if sptes refer to refcounted pages" entirely, because
>     KVM never needs to know if a SPTE points at a refcounted page.
> 
> In other words, double down on immediately doing put_page() after gup() if FOLL_GET
> isn't specified, and naturally make all KVM MMUs compatible with pfn_valid() PFNs
> that are acquired by follow_pte().
> 
> I suspect we can simply mark pages as access when a page is retrieved from the primary
> MMU, as marking a page accessed when its *removed* from the guest is rather nonsensical.
> E.g. if a page is mapped into the guest for a long time and it gets swapped out, marking
> the page accessed when KVM drops its SPTEs in response to the swap adds no value.  And
> through the mmu_notifiers, KVM already plays nice with setups that use idle page
> tracking to make reclaim decisions.
> 
> [*] https://lwn.net/Articles/930667
Sean Christopherson March 15, 2024, 5:59 p.m. UTC | #17
On Thu, Mar 14, 2024, Sean Christopherson wrote:
> +Alex, who is looking at the huge-VM_PFNMAP angle in particular.

Oof, *Axel*.  Sorry Axel.

> On Thu, Mar 14, 2024, Sean Christopherson wrote:
> > -Christ{oph,ian} to avoid creating more noise...
> > 
> > On Thu, Mar 14, 2024, David Stevens wrote:
> > > Because of that, the specific type of pfns that don't work right now are
> > > pfn_valid() && !PG_Reserved && !page_ref_count() - what I called the
> > > non-refcounted pages in a bad choice of words. If that's correct, then
> > > perhaps this series should go a little bit further in modifying
> > > hva_to_pfn_remapped, but it isn't fundamentally wrong.
> > 
> > Loosely related to all of this, I have a mildly ambitious idea.  Well, one mildly
> > ambitious idea, and one crazy ambitious idea.  Crazy ambitious idea first...
> > 
> > Something we (GCE side of Google) have been eyeballing is adding support for huge
> > VM_PFNMAP memory, e.g. for mapping large amounts of device (a.k.a. GPU) memory
> > into guests using hugepages.  One of the hiccups is that follow_pte() doesn't play
> > nice with hugepages, at all, e.g. even has a "VM_BUG_ON(pmd_trans_huge(*pmd))".
> > Teaching follow_pte() to play nice with hugepage probably is doing, but making
> > sure all existing users are aware, maybe not so much.
> > 
> > My first (half baked, crazy ambitious) idea is to move away from follow_pte() and
> > get_user_page_fast_only() for mmu_notifier-aware lookups, i.e. that don't need
> > to grab references, and replace them with a new converged API that locklessly walks
> > host userspace page tables, and grabs the hugepage size along the way, e.g. so that
> > arch code wouldn't have to do a second walk of the page tables just to get the
> > hugepage size.
> > 
> > In other words, for the common case (mmu_notifier integration, no reference needed),
> > route hva_to_pfn_fast() into the new API and walk the userspace page tables (probably
> > only for write faults, to avoid CoW compliciations) before doing anything else.
> > 
> > Uses of hva_to_pfn() that need to get a reference to the struct page couldn't be
> > converted, e.g. when stuffing physical addresses into the VMCS for nested virtualization.
> > But for everything else, grabbing a reference is a non-goal, i.e. actually "getting"
> > a user page is wasted effort and actively gets in the way.
> > 
> > I was initially hoping we could go super simple and use something like x86's
> > host_pfn_mapping_level(), but there are too many edge cases in gup() that need to
> > be respected, e.g. to avoid mapping memfd_secret pages into KVM guests.  I.e. the
> > API would need to be a formal mm-owned thing, not some homebrewed KVM implementation.
> > 
> > I can't tell if the payoff would be big enough to justify the effort involved, i.e.
> > having a single unified API for grabbing PFNs from the primary MMU might just be a
> > pie-in-the-sky type idea.
> > 
> > My second, less ambitious idea: the previously linked LWN[*] article about the
> > writeback issues reminded me of something that has bugged me for a long time.  IIUC,
> > getting a writable mapping from the primary MMU marks the page/folio dirty, and that
> > page/folio stays dirty until the data is written back and the mapping is made read-only.
> > And because KVM is tapped into the mmu_notifiers, KVM will be notified *before* the
> > RW=>RO conversion completes, i.e. before the page/folio is marked clean.
> > 
> > I _think_ that means that calling kvm_set_page_dirty() when zapping a SPTE (or
> > dropping any mmu_notifier-aware mapping) is completely unnecessary.  If that is the
> > case, _and_ we can weasel our way out of calling kvm_set_page_accessed() too, then
> > with FOLL_GET plumbed into hva_to_pfn(), we can:
> > 
> >   - Drop kvm_{set,release}_pfn_{accessed,dirty}(), because all callers of hva_to_pfn()
> >     that aren't tied into mmu_notifiers, i.e. aren't guaranteed to drop mappings
> >     before the page/folio is cleaned, will *know* that they hold a refcounted struct
> >     page.
> > 
> >   - Skip "KVM: x86/mmu: Track if sptes refer to refcounted pages" entirely, because
> >     KVM never needs to know if a SPTE points at a refcounted page.
> > 
> > In other words, double down on immediately doing put_page() after gup() if FOLL_GET
> > isn't specified, and naturally make all KVM MMUs compatible with pfn_valid() PFNs
> > that are acquired by follow_pte().
> > 
> > I suspect we can simply mark pages as access when a page is retrieved from the primary
> > MMU, as marking a page accessed when its *removed* from the guest is rather nonsensical.
> > E.g. if a page is mapped into the guest for a long time and it gets swapped out, marking
> > the page accessed when KVM drops its SPTEs in response to the swap adds no value.  And
> > through the mmu_notifiers, KVM already plays nice with setups that use idle page
> > tracking to make reclaim decisions.
> > 
> > [*] https://lwn.net/Articles/930667
Christoph Hellwig March 18, 2024, 1:26 a.m. UTC | #18
On Thu, Mar 14, 2024 at 12:51:40PM +0100, Christian König wrote:
> > Does Christoph's objection come from my poorly worded cover letter and
> > commit messages, then?
> 
> Yes, that could certainly be.

That's definitively a big part of it, but I think not the only one.

> > Fundamentally, what this series is doing is
> > allowing pfns returned by follow_pte to be mapped into KVM's shadow
> > MMU without inadvertently translating them into struct pages.
> 
> As far as I can tell that is really the right thing to do. Yes.

IFF your callers don't need pages and you just want to track the
mapping in the shadow mmu and never take a refcount that is a good
thing.

But unless I completely misunderstood the series that doesn't seem
to be the case - it builds a new kvm_follow_pfn API which is another
of these weird multiplexers like get_user_pages that can to tons of
different things depending on the flags.  And some of that still
grabs the refcount, right?

> Completely agree. In my thinking when you go a step further and offload
> grabbing the page reference to get_user_pages() then you are always on the
> save side.

Agreed.

> Because then it is no longer the responsibility of the KVM code to get all
> the rules around that right, instead you are relying on a core functionality
> which should (at least in theory) do the correct thing.

Exactly.
Paolo Bonzini March 18, 2024, 1:10 p.m. UTC | #19
On Mon, Mar 18, 2024 at 3:07 AM Christoph Hellwig <hch@infradead.org> wrote:
> > > Fundamentally, what this series is doing is
> > > allowing pfns returned by follow_pte to be mapped into KVM's shadow
> > > MMU without inadvertently translating them into struct pages.
> >
> > As far as I can tell that is really the right thing to do. Yes.
>
> IFF your callers don't need pages and you just want to track the
> mapping in the shadow mmu and never take a refcount that is a good
> thing.

Yes, that's the case and for everything else we can use a function
that goes from guest physical address to struct page with a reference
taken, similar to the current gfn_to_page family of functions.

> But unless I completely misunderstood the series that doesn't seem
> to be the case - it builds a new kvm_follow_pfn API which is another
> of these weird multiplexers like get_user_pages that can to tons of
> different things depending on the flags.  And some of that still
> grabs the refcount, right?

Yes, for a couple reasons. First, a lot of the lookup logic is shared
by the two cases; second, it's easier for both developers and
reviewers if you first convert to the new API, and remove the refcount
in a separate commit. Also, you have to do this for every architecture
since we agree that this is the direction that all of them should move
to.

So what we could do, would be to start with something like David's
patches, and move towards forbidding the FOLL_GET flag (the case that
returns the page with elevated refcount) in the new kvm_follow_pfn()
API.

Another possibility is to have a double-underscore version that allows
FOLL_GET, and have the "clean" kvm_follow_pfn() forbid it. So you
would still have the possibility to convert to __kvm_follow_pfn() with
FOLL_GET first, and then when you remove the refcount you switch to
kvm_follow_pfn().


Paolo

> > Completely agree. In my thinking when you go a step further and offload
> > grabbing the page reference to get_user_pages() then you are always on the
> > save side.
>
> Agreed.
>
> > Because then it is no longer the responsibility of the KVM code to get all
> > the rules around that right, instead you are relying on a core functionality
> > which should (at least in theory) do the correct thing.
>
> Exactly.
>
Christoph Hellwig March 18, 2024, 11:20 p.m. UTC | #20
On Mon, Mar 18, 2024 at 02:10:55PM +0100, Paolo Bonzini wrote:
> Another possibility is to have a double-underscore version that allows
> FOLL_GET, and have the "clean" kvm_follow_pfn() forbid it. So you
> would still have the possibility to convert to __kvm_follow_pfn() with
> FOLL_GET first, and then when you remove the refcount you switch to
> kvm_follow_pfn().

That does sound much better.  Then again anything that actually wants
pages (either for a good reason or historic reasons) really should be
using get/pin_user_pages anyway and not use follow_pte.
Axel Rasmussen March 20, 2024, 8:54 p.m. UTC | #21
On Fri, Mar 15, 2024 at 10:59 AM Sean Christopherson <seanjc@google.com> wrote:
>
> On Thu, Mar 14, 2024, Sean Christopherson wrote:
> > +Alex, who is looking at the huge-VM_PFNMAP angle in particular.
>
> Oof, *Axel*.  Sorry Axel.


No worries. Believe it or not this happens frequently. :) I'm well
past being bothered about it.

>
>
> > On Thu, Mar 14, 2024, Sean Christopherson wrote:
> > > -Christ{oph,ian} to avoid creating more noise...
> > >
> > > On Thu, Mar 14, 2024, David Stevens wrote:
> > > > Because of that, the specific type of pfns that don't work right now are
> > > > pfn_valid() && !PG_Reserved && !page_ref_count() - what I called the
> > > > non-refcounted pages in a bad choice of words. If that's correct, then
> > > > perhaps this series should go a little bit further in modifying
> > > > hva_to_pfn_remapped, but it isn't fundamentally wrong.
> > >
> > > Loosely related to all of this, I have a mildly ambitious idea.  Well, one mildly
> > > ambitious idea, and one crazy ambitious idea.  Crazy ambitious idea first...
> > >
> > > Something we (GCE side of Google) have been eyeballing is adding support for huge
> > > VM_PFNMAP memory, e.g. for mapping large amounts of device (a.k.a. GPU) memory
> > > into guests using hugepages.  One of the hiccups is that follow_pte() doesn't play
> > > nice with hugepages, at all, e.g. even has a "VM_BUG_ON(pmd_trans_huge(*pmd))".
> > > Teaching follow_pte() to play nice with hugepage probably is doing, but making
> > > sure all existing users are aware, maybe not so much.


Right. The really basic idea I had was, to modify remap_pfn_range so
it can, if possible (if it sees a (sub)range which is aligned + big
enough), it can just install a leaf pud or pmd instead of always going
down to the pte level.

follow_pte is problematic though, because it returns a pte
specifically so it's unclear to me how to have it "just work" for
existing callers with an area mapped this way. So I think the idea
would be to change follow_pte to detect this case and bail out
(-EINVAL I guess), and then add some new function which can properly
support these mappings.

>
> > >
> > > My first (half baked, crazy ambitious) idea is to move away from follow_pte() and
> > > get_user_page_fast_only() for mmu_notifier-aware lookups, i.e. that don't need
> > > to grab references, and replace them with a new converged API that locklessly walks
> > > host userspace page tables, and grabs the hugepage size along the way, e.g. so that
> > > arch code wouldn't have to do a second walk of the page tables just to get the
> > > hugepage size.
> > >
> > > In other words, for the common case (mmu_notifier integration, no reference needed),
> > > route hva_to_pfn_fast() into the new API and walk the userspace page tables (probably
> > > only for write faults, to avoid CoW compliciations) before doing anything else.
> > >
> > > Uses of hva_to_pfn() that need to get a reference to the struct page couldn't be
> > > converted, e.g. when stuffing physical addresses into the VMCS for nested virtualization.
> > > But for everything else, grabbing a reference is a non-goal, i.e. actually "getting"
> > > a user page is wasted effort and actively gets in the way.
> > >
> > > I was initially hoping we could go super simple and use something like x86's
> > > host_pfn_mapping_level(), but there are too many edge cases in gup() that need to
> > > be respected, e.g. to avoid mapping memfd_secret pages into KVM guests.  I.e. the
> > > API would need to be a formal mm-owned thing, not some homebrewed KVM implementation.
> > >
> > > I can't tell if the payoff would be big enough to justify the effort involved, i.e.
> > > having a single unified API for grabbing PFNs from the primary MMU might just be a
> > > pie-in-the-sky type idea.

Yeah, I have been thinking about this.

One thing is, KVM is not the only caller of follow_pte today. At least
naively it would be nice if any existing callers could benefit from
this new support, not just KVM.

Another thing I noticed is, most callers don't need much; they don't
need a reference to the page, they don't even really need the pte
itself. Most callers just want a ptl held, and they want to know the
pgprot flags or whether the pte is writable or not, and they want to
know the pfn for this address. IOW follow_pte is sort of overkill for
most callers.

KVM is a bit different, it does call GUP to get the page. One other
thing is, KVM at least on x86 also cares about the "level" of the
mapping, in host_pfn_mapping_level(). That code is all fairly x86
specific so I'm not sure how to generalize it. Also I haven't looked
closely at what's going on for other architectures.

I'm not sure yet where is the right place to end up, but I at least
think it's worth trying to build some general API under mm/ which
supports these various things.

In general I'm thinking of proceeding in two steps. First,
enlightening remap_pfn_range, updating follow_pte to return -EINVAL,
and then adding some new function which tries to be somewhat close to
a drop-in replacement for existing follow_pte callers. Once I get that
proof of concept working / tested, I think the next step is figuring
out how we can extend it a bit to build some more general solution
like Sean is describing here.

> > >
> > > My second, less ambitious idea: the previously linked LWN[*] article about the
> > > writeback issues reminded me of something that has bugged me for a long time.  IIUC,
> > > getting a writable mapping from the primary MMU marks the page/folio dirty, and that
> > > page/folio stays dirty until the data is written back and the mapping is made read-only.
> > > And because KVM is tapped into the mmu_notifiers, KVM will be notified *before* the
> > > RW=>RO conversion completes, i.e. before the page/folio is marked clean.
> > >
> > > I _think_ that means that calling kvm_set_page_dirty() when zapping a SPTE (or
> > > dropping any mmu_notifier-aware mapping) is completely unnecessary.  If that is the
> > > case, _and_ we can weasel our way out of calling kvm_set_page_accessed() too, then
> > > with FOLL_GET plumbed into hva_to_pfn(), we can:
> > >
> > >   - Drop kvm_{set,release}_pfn_{accessed,dirty}(), because all callers of hva_to_pfn()
> > >     that aren't tied into mmu_notifiers, i.e. aren't guaranteed to drop mappings
> > >     before the page/folio is cleaned, will *know* that they hold a refcounted struct
> > >     page.
> > >
> > >   - Skip "KVM: x86/mmu: Track if sptes refer to refcounted pages" entirely, because
> > >     KVM never needs to know if a SPTE points at a refcounted page.
> > >
> > > In other words, double down on immediately doing put_page() after gup() if FOLL_GET
> > > isn't specified, and naturally make all KVM MMUs compatible with pfn_valid() PFNs
> > > that are acquired by follow_pte().
> > >
> > > I suspect we can simply mark pages as access when a page is retrieved from the primary
> > > MMU, as marking a page accessed when its *removed* from the guest is rather nonsensical.
> > > E.g. if a page is mapped into the guest for a long time and it gets swapped out, marking
> > > the page accessed when KVM drops its SPTEs in response to the swap adds no value.  And
> > > through the mmu_notifiers, KVM already plays nice with setups that use idle page
> > > tracking to make reclaim decisions.
> > >
> > > [*] https://lwn.net/Articles/930667
Sean Christopherson June 21, 2024, 6:32 p.m. UTC | #22
On Thu, Feb 29, 2024, David Stevens wrote:
> From: David Stevens <stevensd@chromium.org>
> 
> This patch series adds support for mapping VM_IO and VM_PFNMAP memory
> that is backed by struct pages that aren't currently being refcounted
> (e.g. tail pages of non-compound higher order allocations) into the
> guest.
> 
> Our use case is virtio-gpu blob resources [1], which directly map host
> graphics buffers into the guest as "vram" for the virtio-gpu device.
> This feature currently does not work on systems using the amdgpu driver,
> as that driver allocates non-compound higher order pages via
> ttm_pool_alloc_page().
> 
> First, this series replaces the gfn_to_pfn_memslot() API with a more
> extensible kvm_follow_pfn() API. The updated API rearranges
> gfn_to_pfn_memslot()'s args into a struct and where possible packs the
> bool arguments into a FOLL_ flags argument. The refactoring changes do
> not change any behavior.
> 
> From there, this series extends the kvm_follow_pfn() API so that
> non-refconuted pages can be safely handled. This invloves adding an
> input parameter to indicate whether the caller can safely use
> non-refcounted pfns and an output parameter to tell the caller whether
> or not the returned page is refcounted. This change includes a breaking
> change, by disallowing non-refcounted pfn mappings by default, as such
> mappings are unsafe. To allow such systems to continue to function, an
> opt-in module parameter is added to allow the unsafe behavior.
> 
> This series only adds support for non-refcounted pages to x86. Other
> MMUs can likely be updated without too much difficulty, but it is not
> needed at this point. Updating other parts of KVM (e.g. pfncache) is not
> straightforward [2].

FYI, on the off chance that someone else is eyeballing this, I am working on
revamping this series.  It's still a ways out, but I'm optimistic that we'll be
able to address the concerns raised by Christoph and Christian, and maybe even
get KVM out of the weeds straightaway (PPC looks thorny :-/).
Alex Bennée July 31, 2024, 11:41 a.m. UTC | #23
Sean Christopherson <seanjc@google.com> writes:

> On Thu, Feb 29, 2024, David Stevens wrote:
>> From: David Stevens <stevensd@chromium.org>
>> 
>> This patch series adds support for mapping VM_IO and VM_PFNMAP memory
>> that is backed by struct pages that aren't currently being refcounted
>> (e.g. tail pages of non-compound higher order allocations) into the
>> guest.
>> 
>> Our use case is virtio-gpu blob resources [1], which directly map host
>> graphics buffers into the guest as "vram" for the virtio-gpu device.
>> This feature currently does not work on systems using the amdgpu driver,
>> as that driver allocates non-compound higher order pages via
>> ttm_pool_alloc_page().
>> 
>> First, this series replaces the gfn_to_pfn_memslot() API with a more
>> extensible kvm_follow_pfn() API. The updated API rearranges
>> gfn_to_pfn_memslot()'s args into a struct and where possible packs the
>> bool arguments into a FOLL_ flags argument. The refactoring changes do
>> not change any behavior.
>> 
>> From there, this series extends the kvm_follow_pfn() API so that
>> non-refconuted pages can be safely handled. This invloves adding an
>> input parameter to indicate whether the caller can safely use
>> non-refcounted pfns and an output parameter to tell the caller whether
>> or not the returned page is refcounted. This change includes a breaking
>> change, by disallowing non-refcounted pfn mappings by default, as such
>> mappings are unsafe. To allow such systems to continue to function, an
>> opt-in module parameter is added to allow the unsafe behavior.
>> 
>> This series only adds support for non-refcounted pages to x86. Other
>> MMUs can likely be updated without too much difficulty, but it is not
>> needed at this point. Updating other parts of KVM (e.g. pfncache) is not
>> straightforward [2].
>
> FYI, on the off chance that someone else is eyeballing this, I am working on
> revamping this series.  It's still a ways out, but I'm optimistic that we'll be
> able to address the concerns raised by Christoph and Christian, and maybe even
> get KVM out of the weeds straightaway (PPC looks thorny :-/).

I've applied this series to the latest 6.9.x while attempting to
diagnose some of the virtio-gpu problems it may or may not address.
However launching KVM guests keeps triggering a bunch of BUGs that
eventually leave a hung guest:

  12:16:54 [root@draig:~] # dmesg -c                                                                                                                                           
  [252080.141629] RAX: ffffffffffffffda RBX: 0000560a64915500 RCX: 00007faa23e81c5b                                                                                            
  [252080.141629] RDX: 0000000000000000 RSI: 000000000000ae80 RDI: 0000000000000017                                                                                            
  [252080.141630] RBP: 000000000000ae80 R08: 0000000000000000 R09: 0000000000000000                                                                                            
  [252080.141630] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000                                                                                            
  [252080.141631] R13: 0000000000000001 R14: 00000000000000b2 R15: 0000000000000002                                                                                            
  [252080.141632]  </TASK>                                                                                                                                                     
  [252080.141632] BUG: Bad page state in process CPU 0/KVM  pfn:fb1665                                                                                                         
  [252080.141633] page: refcount:0 mapcount:1 mapping:0000000000000000 index:0x7fa8117c3 pfn:0xfb1665                                                                          
  [252080.141633] flags: 0x17ffffc00a000c(referenced|uptodate|mappedtodisk|swapbacked|node=0|zone=2|lastcpupid=0x1fffff)                                                       
  [252080.141634] page_type: 0x0()                                                                                                                                             
  [252080.141635] raw: 0017ffffc00a000c dead000000000100 dead000000000122 0000000000000000                                                                                     
  [252080.141635] raw: 00000007fa8117c3 0000000000000000 0000000000000000 0000000000000000                                                                                     
  [252080.141635] page dumped because: nonzero mapcount                                                                                                                        
  [252080.141636] Modules linked in: vhost_net vhost vhost_iotlb tap tun uas usb_storage veth cfg80211 nf_conntrack_netlink xfrm_user xfrm_algo xt_addrtype br_netfilter nft_ma
  sq wireguard libchacha20poly1305 chacha_x86_64 poly1305_x86_64 curve25519_x86_64 libcurve25519_generic libchacha ip6_udp_tunnel udp_tunnel rfcomm snd_seq_dummy snd_hrtimer s
  nd_seq xt_CHECKSUM xt_MASQUERADE xt_conntrack ipt_REJECT nf_reject_ipv4 xt_tcpudp nft_compat nft_chain_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 nf_tables nfnetl
  ink bridge stp llc qrtr overlay cmac algif_hash algif_skcipher af_alg bnep binfmt_misc squashfs snd_hda_codec_hdmi intel_uncore_frequency snd_ctl_led intel_uncore_frequency_
  common ledtrig_audio x86_pkg_temp_thermal intel_powerclamp coretemp snd_sof_pci_intel_tgl snd_sof_intel_hda_common kvm_intel soundwire_intel soundwire_generic_allocation btu
  sb snd_sof_intel_hda_mlink sd_mod soundwire_cadence btrtl snd_hda_codec_realtek kvm sg snd_sof_intel_hda btintel snd_sof_pci btbcm snd_hda_codec_generic btmtk               
  [252080.141656]  snd_sof_xtensa_dsp crc32_pclmul bluetooth snd_hda_scodec_component ghash_clmulni_intel snd_sof sha256_ssse3 sha1_ssse3 snd_sof_utils snd_soc_hdac_hda snd_hd
  a_ext_core snd_soc_acpi_intel_match snd_soc_acpi snd_soc_core snd_compress soundwire_bus sha3_generic jitterentropy_rng aesni_intel snd_hda_intel snd_intel_dspcfg crypto_sim
  d sha512_ssse3 snd_intel_sdw_acpi cryptd sha512_generic uvcvideo snd_hda_codec snd_usb_audio videobuf2_vmalloc uvc ctr videobuf2_memops snd_hda_core snd_usbmidi_lib videobuf
  2_v4l2 snd_rawmidi drbg snd_hwdep dell_wmi snd_seq_device nls_ascii ahci ansi_cprng iTCO_wdt processor_thermal_device_pci videodev nls_cp437 snd_pcm intel_pmc_bxt dell_smbio
  s libahci processor_thermal_device rapl rtsx_pci_sdmmc iTCO_vendor_support ecdh_generic mmc_core mei_hdcp watchdog libata intel_rapl_msr videobuf2_common rfkill vfat process
  or_thermal_wt_hint pl2303 snd_timer dcdbas dell_wmi_ddv dell_wmi_sysman processor_thermal_rfim ucsi_acpi fat intel_cstate usbserial intel_uncore cdc_acm mc battery ecc      
  [252080.141670]  firmware_attributes_class dell_wmi_descriptor wmi_bmof dell_smm_hwmon processor_thermal_rapl pcspkr scsi_mod mei_me intel_lpss_pci snd typec_ucsi igc e1000e
   i2c_i801 rtsx_pci intel_rapl_common intel_lpss roles mei soundcore processor_thermal_wt_req i2c_smbus idma64 scsi_common processor_thermal_power_floor typec processor_therm
  al_mbox button intel_pmc_core int3403_thermal int340x_thermal_zone intel_vsec pmt_telemetry intel_hid int3400_thermal pmt_class sparse_keymap acpi_tad acpi_pad acpi_thermal_
  rel msr parport_pc ppdev lp parport fuse loop efi_pstore configfs efivarfs ip_tables x_tables autofs4 ext4 crc16 mbcache jbd2 hid_microsoft joydev ff_memless hid_generic usb
  hid hid btrfs blake2b_generic libcrc32c crc32c_generic xor raid6_pq evdev dm_mod i915 i2c_algo_bit drm_buddy ttm drm_display_helper xhci_pci xhci_hcd drm_kms_helper nvme nvm
  e_core drm t10_pi usbcore video crc64_rocksoft crc64 crc_t10dif cec crct10dif_generic crct10dif_pclmul crc32c_intel rc_core usb_common crct10dif_common wmi                  
  [252080.141686]  pinctrl_alderlake                                                                                                                                           
  [252080.141686] CPU: 8 PID: 1819169 Comm: CPU 0/KVM Tainted: G    B   W          6.9.12-ajb-00008-gfcd4b7efbad0 #17                                                          
  [252080.141687] Hardware name: Dell Inc. Precision 3660/0PRR48, BIOS 2.8.1 08/14/2023                                                                                        
  [252080.141688] Call Trace:                                                                                                                                                  
  [252080.141688]  <TASK>                                                                                                                                                      
  [252080.141688]  dump_stack_lvl+0x60/0x80                                                                                                                                    
  [252080.141689]  bad_page+0x70/0x100                                                                                                                                         
  [252080.141690]  free_unref_page_prepare+0x22a/0x370                                                                                                                         
  [252080.141692]  free_unref_folios+0xe5/0x340                                                                                                                                
  [252080.141693]  ? __mem_cgroup_uncharge_folios+0x7a/0xa0                                                                                                                    
  [252080.141694]  folios_put_refs+0x147/0x1e0                                                                                                                                 
  [252080.141696]  ? __pfx_lru_add_fn+0x10/0x10                                                                                                                                
  [252080.141697]  folio_batch_move_lru+0xc8/0x140                                                                                                                             
  [252080.141699]  folio_add_lru+0x51/0xa0                                                                                                                                     
  [252080.141700]  do_wp_page+0x4dd/0xb60                                                                                                                                      
  [252080.141701]  __handle_mm_fault+0xb2a/0xe30          
  [252080.141703]  handle_mm_fault+0x18c/0x320                                                                                                                                 
  [252080.141704]  __get_user_pages+0x164/0x6f0                                                                                                                                
  [252080.141705]  get_user_pages_unlocked+0xe2/0x370                                                                                                                          
  [252080.141706]  hva_to_pfn+0xa0/0x740 [kvm]                                                                                                                                 
  [252080.141724]  kvm_faultin_pfn+0xf3/0x5f0 [kvm]                                                                                                                            
  [252080.141750]  kvm_tdp_page_fault+0x100/0x150 [kvm]                                                                                                                        
  [252080.141774]  kvm_mmu_page_fault+0x27e/0x7f0 [kvm]                                                                                                                        
  [252080.141798]  ? em_rsm+0xad/0x170 [kvm]                                                                                                                                   
  [252080.141823]  ? writeback_registers+0x44/0x80 [kvm]                                                                                                                       
  [252080.141848]  ? vmx_set_cr0+0xc7/0x1320 [kvm_intel]                                                                                                                       
  [252080.141853]  ? x86_emulate_insn+0x484/0xe60 [kvm]                                                                                                                        
  [252080.141877]  ? vmx_vmexit+0x6e/0xd0 [kvm_intel]                                                                                                                          
  [252080.141882]  ? vmx_vmexit+0x99/0xd0 [kvm_intel]                                                                                                                          
  [252080.141887]  vmx_handle_exit+0x129/0x930 [kvm_intel]                                                                                                                     
  [252080.141892]  kvm_arch_vcpu_ioctl_run+0x682/0x15b0 [kvm]                                                                                                                  
  [252080.141918]  kvm_vcpu_ioctl+0x23d/0x6f0 [kvm]                                                                                                                            
  [252080.141936]  ? __seccomp_filter+0x32f/0x500                                                                                                                              
  [252080.141937]  ? kvm_io_bus_read+0x42/0xd0 [kvm]                                                                                                                           
  [252080.141956]  __x64_sys_ioctl+0x90/0xd0                                                                                                                                   
  [252080.141957]  do_syscall_64+0x80/0x190                                                                                                                                    
  [252080.141958]  ? kvm_arch_vcpu_put+0x126/0x160 [kvm]                                                                                                                       
  [252080.141982]  ? vcpu_put+0x1e/0x50 [kvm]                                                                                                                                  
  [252080.141999]  ? kvm_arch_vcpu_ioctl_run+0x757/0x15b0 [kvm]                                                                                                                
  [252080.142023]  ? kvm_vcpu_ioctl+0x29e/0x6f0 [kvm]                                                                                                                          
  [252080.142040]  ? __seccomp_filter+0x32f/0x500                                                                                                                              
  [252080.142042]  ? kvm_on_user_return+0x60/0x90 [kvm]                                                                                                                        
  [252080.142065]  ? fire_user_return_notifiers+0x30/0x60                                                                                                                      
  [252080.142066]  ? syscall_exit_to_user_mode+0x73/0x200                                                                                                                      
  [252080.142067]  ? do_syscall_64+0x8c/0x190                                                                                                                                  
  [252080.142068]  ? kvm_on_user_return+0x60/0x90 [kvm]                                                                                                                        
  [252080.142090]  ? fire_user_return_notifiers+0x30/0x60                                                                                                                      
  [252080.142091]  ? syscall_exit_to_user_mode+0x73/0x200                                                                                                                      
  [252080.142092]  ? do_syscall_64+0x8c/0x190                                                                                                                                  
  [252080.142093]  ? do_syscall_64+0x8c/0x190                                                                                                                                  
  [252080.142094]  ? do_syscall_64+0x8c/0x190                                                                                                                                  
  [252080.142095]  ? exc_page_fault+0x72/0x170                                                                                                                                 
  [252080.142096]  entry_SYSCALL_64_after_hwframe+0x76/0x7e                                                                                                                    

This backtrace repeats for a large chunk of pfns
Sean Christopherson July 31, 2024, 3:01 p.m. UTC | #24
On Wed, Jul 31, 2024, Alex Bennée wrote:
> Sean Christopherson <seanjc@google.com> writes:
> 
> > On Thu, Feb 29, 2024, David Stevens wrote:
> >> From: David Stevens <stevensd@chromium.org>
> >> 
> >> This patch series adds support for mapping VM_IO and VM_PFNMAP memory
> >> that is backed by struct pages that aren't currently being refcounted
> >> (e.g. tail pages of non-compound higher order allocations) into the
> >> guest.
> >> 
> >> Our use case is virtio-gpu blob resources [1], which directly map host
> >> graphics buffers into the guest as "vram" for the virtio-gpu device.
> >> This feature currently does not work on systems using the amdgpu driver,
> >> as that driver allocates non-compound higher order pages via
> >> ttm_pool_alloc_page().
> >> 
> >> First, this series replaces the gfn_to_pfn_memslot() API with a more
> >> extensible kvm_follow_pfn() API. The updated API rearranges
> >> gfn_to_pfn_memslot()'s args into a struct and where possible packs the
> >> bool arguments into a FOLL_ flags argument. The refactoring changes do
> >> not change any behavior.
> >> 
> >> From there, this series extends the kvm_follow_pfn() API so that
> >> non-refconuted pages can be safely handled. This invloves adding an
> >> input parameter to indicate whether the caller can safely use
> >> non-refcounted pfns and an output parameter to tell the caller whether
> >> or not the returned page is refcounted. This change includes a breaking
> >> change, by disallowing non-refcounted pfn mappings by default, as such
> >> mappings are unsafe. To allow such systems to continue to function, an
> >> opt-in module parameter is added to allow the unsafe behavior.
> >> 
> >> This series only adds support for non-refcounted pages to x86. Other
> >> MMUs can likely be updated without too much difficulty, but it is not
> >> needed at this point. Updating other parts of KVM (e.g. pfncache) is not
> >> straightforward [2].
> >
> > FYI, on the off chance that someone else is eyeballing this, I am working on
> > revamping this series.  It's still a ways out, but I'm optimistic that we'll be
> > able to address the concerns raised by Christoph and Christian, and maybe even
> > get KVM out of the weeds straightaway (PPC looks thorny :-/).
> 
> I've applied this series to the latest 6.9.x while attempting to
> diagnose some of the virtio-gpu problems it may or may not address.
> However launching KVM guests keeps triggering a bunch of BUGs that
> eventually leave a hung guest:

Can you give v12 (which is comically large) a spin?  I still need to do more
testing, but if it too is buggy, I definitely want to know sooner than later.

Thanks!

https://lore.kernel.org/all/20240726235234.228822-1-seanjc@google.com
David Stevens Aug. 5, 2024, 11:44 p.m. UTC | #25
On Wed, Jul 31, 2024 at 8:41 PM Alex Bennée <alex.bennee@linaro.org> wrote:
>
> Sean Christopherson <seanjc@google.com> writes:
>
> > On Thu, Feb 29, 2024, David Stevens wrote:
> >> From: David Stevens <stevensd@chromium.org>
> >>
> >> This patch series adds support for mapping VM_IO and VM_PFNMAP memory
> >> that is backed by struct pages that aren't currently being refcounted
> >> (e.g. tail pages of non-compound higher order allocations) into the
> >> guest.
> >>
> >> Our use case is virtio-gpu blob resources [1], which directly map host
> >> graphics buffers into the guest as "vram" for the virtio-gpu device.
> >> This feature currently does not work on systems using the amdgpu driver,
> >> as that driver allocates non-compound higher order pages via
> >> ttm_pool_alloc_page().
> >>
> >> First, this series replaces the gfn_to_pfn_memslot() API with a more
> >> extensible kvm_follow_pfn() API. The updated API rearranges
> >> gfn_to_pfn_memslot()'s args into a struct and where possible packs the
> >> bool arguments into a FOLL_ flags argument. The refactoring changes do
> >> not change any behavior.
> >>
> >> From there, this series extends the kvm_follow_pfn() API so that
> >> non-refconuted pages can be safely handled. This invloves adding an
> >> input parameter to indicate whether the caller can safely use
> >> non-refcounted pfns and an output parameter to tell the caller whether
> >> or not the returned page is refcounted. This change includes a breaking
> >> change, by disallowing non-refcounted pfn mappings by default, as such
> >> mappings are unsafe. To allow such systems to continue to function, an
> >> opt-in module parameter is added to allow the unsafe behavior.
> >>
> >> This series only adds support for non-refcounted pages to x86. Other
> >> MMUs can likely be updated without too much difficulty, but it is not
> >> needed at this point. Updating other parts of KVM (e.g. pfncache) is not
> >> straightforward [2].
> >
> > FYI, on the off chance that someone else is eyeballing this, I am working on
> > revamping this series.  It's still a ways out, but I'm optimistic that we'll be
> > able to address the concerns raised by Christoph and Christian, and maybe even
> > get KVM out of the weeds straightaway (PPC looks thorny :-/).
>
> I've applied this series to the latest 6.9.x while attempting to
> diagnose some of the virtio-gpu problems it may or may not address.
> However launching KVM guests keeps triggering a bunch of BUGs that
> eventually leave a hung guest:
>

Likely the same issue as [1]. Commit d02c357e5bfa added another call
to kvm_release_pfn_clean() in kvm_faultin_pfn(), which ends up
releasing a reference that is no longer being taken. If you replace
that with kvm_set_page_accessed() instead, then things should work
again. I didn't send out a rebased version of the series, since Sean's
work supersedes my series.

-David

[1] https://lore.kernel.org/lkml/15865985-4688-4b7e-9f2d-89803adb8f5b@collabora.com/