[v9,0/6] KVM: allow mapping non-refcounted pages

Message ID	20230911021637.1941096-1-stevensd@google.com (mailing list archive)
Headers	show Return-Path: <kvm-owner@vger.kernel.org> From: David Stevens <stevensd@chromium.org> To: Sean Christopherson <seanjc@google.com> Cc: Yu Zhang <yu.c.zhang@linux.intel.com>, Isaku Yamahata <isaku.yamahata@gmail.com>, Zhi Wang <zhi.wang.linux@gmail.com>, kvmarm@lists.linux.dev, linux-kernel@vger.kernel.org, kvm@vger.kernel.org, David Stevens <stevensd@chromium.org> Subject: [PATCH v9 0/6] KVM: allow mapping non-refcounted pages Date: Mon, 11 Sep 2023 11:16:30 +0900 Message-ID: <20230911021637.1941096-1-stevensd@google.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Precedence: bulk
Series	KVM: allow mapping non-refcounted pages \| expand [v9,0/6] KVM: allow mapping non-refcounted pages [v9,1/6] KVM: Assert that a page's refcount is elevated when marking accessed/dirty [v9,2/6] KVM: mmu: Introduce __kvm_follow_pfn function [v9,3/6] KVM: mmu: Improve handling of non-refcounted pfns [v9,4/6] KVM: Migrate kvm_vcpu_map to __kvm_follow_pfn [v9,5/6] KVM: x86: Migrate to __kvm_follow_pfn [v9,6/6] KVM: x86/mmu: Handle non-refcounted pages

David Stevens Sept. 11, 2023, 2:16 a.m. UTC

From: David Stevens <stevensd@chromium.org>

This patch series adds support for mapping VM_IO and VM_PFNMAP memory
that is backed by struct pages that aren't currently being refcounted
(e.g. tail pages of non-compound higher order allocations) into the
guest.

Our use case is virtio-gpu blob resources [1], which directly map host
graphics buffers into the guest as "vram" for the virtio-gpu device.
This feature currently does not work on systems using the amdgpu driver,
as that driver allocates non-compound higher order pages via
ttm_pool_alloc_page.

First, this series replaces the __gfn_to_pfn_memslot API with a more
extensible __kvm_faultin_pfn API. The updated API rearranges
__gfn_to_pfn_memslot's args into a struct and where possible packs the
bool arguments into a FOLL_ flags argument. The refactoring changes do
not change any behavior.

From there, this series extends the __kvm_faultin_pfn API so that
non-refconuted pages can be safely handled. This invloves adding an
input parameter to indicate whether the caller can safely use
non-refcounted pfns and an output parameter to tell the caller whether
or not the returned page is refcounted. This change includes a breaking
change, by disallowing non-refcounted pfn mappings by default, as such
mappings are unsafe. To allow such systems to continue to function, an
opt-in module parameter is added to allow the unsafe behavior.

This series only adds support for non-refcounted pages to x86. Other
MMUs can likely be updated without too much difficulty, but it is not
needed at this point. Updating other parts of KVM (e.g. pfncache) is not
straightforward [2].

[1]
https://patchwork.kernel.org/project/dri-devel/cover/20200814024000.2485-1-gurchetansingh@chromium.org/
[2] https://lore.kernel.org/all/ZBEEQtmtNPaEqU1i@google.com/

v8 -> v9:
 - Make paying attention to is_refcounted_page mandatory. This means
   that FOLL_GET is no longer necessary. For compatibility with
   un-migrated callers, add a temporary parameter to sidestep
   ref-counting issues.
 - Add allow_unsafe_mappings, which is a breaking change.
 - Migrate kvm_vcpu_map and other callsites used by x86 to the new API.
 - Drop arm and ppc changes.
v7 -> v8:
 - Set access bits before releasing mmu_lock.
 - Pass FOLL_GET on 32-bit x86 or !tdp_enabled.
 - Refactor FOLL_GET handling, add kvm_follow_refcounted_pfn helper.
 - Set refcounted bit on >4k pages.
 - Add comments and apply formatting suggestions.
 - rebase on kvm next branch.
v6 -> v7:
 - Replace __gfn_to_pfn_memslot with a more flexible __kvm_faultin_pfn,
   and extend that API to support non-refcounted pages (complete
   rewrite).

David Stevens (5):
  KVM: mmu: Introduce __kvm_follow_pfn function
  KVM: mmu: Improve handling of non-refcounted pfns
  KVM: Migrate kvm_vcpu_map to __kvm_follow_pfn
  KVM: x86: Migrate to __kvm_follow_pfn
  KVM: x86/mmu: Handle non-refcounted pages

Sean Christopherson (1):
  KVM: Assert that a page's refcount is elevated when marking
    accessed/dirty

 arch/x86/kvm/mmu/mmu.c          |  93 +++++++---
 arch/x86/kvm/mmu/mmu_internal.h |   1 +
 arch/x86/kvm/mmu/paging_tmpl.h  |   8 +-
 arch/x86/kvm/mmu/spte.c         |   4 +-
 arch/x86/kvm/mmu/spte.h         |  12 +-
 arch/x86/kvm/mmu/tdp_mmu.c      |  22 ++-
 arch/x86/kvm/x86.c              |  12 +-
 include/linux/kvm_host.h        |  42 ++++-
 virt/kvm/kvm_main.c             | 294 +++++++++++++++++++-------------
 virt/kvm/kvm_mm.h               |   3 +-
 virt/kvm/pfncache.c             |  11 +-
 11 files changed, 339 insertions(+), 163 deletions(-)

Christoph Hellwig Sept. 29, 2023, 5:19 a.m. UTC | #1

On Mon, Sep 11, 2023 at 11:16:30AM +0900, David Stevens wrote:
> From: David Stevens <stevensd@chromium.org>
> 
> This patch series adds support for mapping VM_IO and VM_PFNMAP memory
> that is backed by struct pages that aren't currently being refcounted
> (e.g. tail pages of non-compound higher order allocations) into the
> guest.
> 
> Our use case is virtio-gpu blob resources [1], which directly map host
> graphics buffers into the guest as "vram" for the virtio-gpu device.
> This feature currently does not work on systems using the amdgpu driver,
> as that driver allocates non-compound higher order pages via
> ttm_pool_alloc_page.
> 
> First, this series replaces the __gfn_to_pfn_memslot API with a more
> extensible __kvm_faultin_pfn API. The updated API rearranges
> __gfn_to_pfn_memslot's args into a struct and where possible packs the
> bool arguments into a FOLL_ flags argument. The refactoring changes do
> not change any behavior.

Instead of adding hacks to kvm you really should fix the driver / TTM
to not do weird memory allocations.

Sean Christopherson Sept. 29, 2023, 4:06 p.m. UTC | #2

On Thu, Sep 28, 2023, Christoph Hellwig wrote:
> On Mon, Sep 11, 2023 at 11:16:30AM +0900, David Stevens wrote:
> > From: David Stevens <stevensd@chromium.org>
> > 
> > This patch series adds support for mapping VM_IO and VM_PFNMAP memory
> > that is backed by struct pages that aren't currently being refcounted
> > (e.g. tail pages of non-compound higher order allocations) into the
> > guest.
> > 
> > Our use case is virtio-gpu blob resources [1], which directly map host
> > graphics buffers into the guest as "vram" for the virtio-gpu device.
> > This feature currently does not work on systems using the amdgpu driver,
> > as that driver allocates non-compound higher order pages via
> > ttm_pool_alloc_page.
> > 
> > First, this series replaces the __gfn_to_pfn_memslot API with a more
> > extensible __kvm_faultin_pfn API. The updated API rearranges
> > __gfn_to_pfn_memslot's args into a struct and where possible packs the
> > bool arguments into a FOLL_ flags argument. The refactoring changes do
> > not change any behavior.
> 
> Instead of adding hacks to kvm you really should fix the driver / TTM
> to not do weird memory allocations.

I agree that the driver/TTM behavior is nasty, but from a KVM perspective the vast
majority of this series is long-overdue cleanups (at least, IMO).  All of those
cleanups were my requirement for adding support for the behavior David and friends
actually care about.

KVM needs to be aware of non-refcounted struct page memory no matter what; see
CVE-2021-22543 and, commit f8be156be163 ("KVM: do not allow mapping valid but
non-reference-counted pages").  I don't think it makes any sense whatsoever to
remove that code and assume every driver in existence will do the right thing.

With the cleanups done, playing nice with non-refcounted paged instead of outright
rejecting them is a wash in terms of lines of code, complexity, and ongoing
maintenance cost.

Christoph Hellwig Oct. 2, 2023, 6:25 a.m. UTC | #3

On Fri, Sep 29, 2023 at 09:06:34AM -0700, Sean Christopherson wrote:
> KVM needs to be aware of non-refcounted struct page memory no matter what; see
> CVE-2021-22543 and, commit f8be156be163 ("KVM: do not allow mapping valid but
> non-reference-counted pages").  I don't think it makes any sense whatsoever to
> remove that code and assume every driver in existence will do the right thing.

Agreed.

> 
> With the cleanups done, playing nice with non-refcounted paged instead of outright
> rejecting them is a wash in terms of lines of code, complexity, and ongoing
> maintenance cost.

I tend to strongly disagree with that, though.  We can't just let these
non-refcounted pages spread everywhere and instead need to fix their
usage.

David Stevens Oct. 31, 2023, 4:30 a.m. UTC | #4

Sean, have you been waiting for a new patch series with responses to
Maxim's comments? I'm not really familiar with kernel contribution
etiquette, but I was hoping to get your feedback before spending the
time to put together another patch series.

-David

Sean Christopherson Oct. 31, 2023, 2:30 p.m. UTC | #5

On Tue, Oct 31, 2023, David Stevens wrote:
> Sean, have you been waiting for a new patch series with responses to
> Maxim's comments? I'm not really familiar with kernel contribution
> etiquette, but I was hoping to get your feedback before spending the
> time to put together another patch series.

No, I'm working my way back toward it.  The guest_memfd series took precedence
over everything that I wasn't confident would land in 6.7, i.e. larger series
effectively got put on the back burner.  Sorry :-(

David Stevens Dec. 12, 2023, 1:59 a.m. UTC | #6

On Tue, Oct 31, 2023 at 11:30 PM Sean Christopherson <seanjc@google.com> wrote:
>
> On Tue, Oct 31, 2023, David Stevens wrote:
> > Sean, have you been waiting for a new patch series with responses to
> > Maxim's comments? I'm not really familiar with kernel contribution
> > etiquette, but I was hoping to get your feedback before spending the
> > time to put together another patch series.
>
> No, I'm working my way back toward it.  The guest_memfd series took precedence
> over everything that I wasn't confident would land in 6.7, i.e. larger series
> effectively got put on the back burner.  Sorry :-(

Is this series something that may be able to make it into 6.8 or 6.9?

-David

Sean Christopherson Dec. 20, 2023, 1:37 a.m. UTC | #7

On Tue, Dec 12, 2023, David Stevens wrote:
> On Tue, Oct 31, 2023 at 11:30 PM Sean Christopherson <seanjc@google.com> wrote:
> >
> > On Tue, Oct 31, 2023, David Stevens wrote:
> > > Sean, have you been waiting for a new patch series with responses to
> > > Maxim's comments? I'm not really familiar with kernel contribution
> > > etiquette, but I was hoping to get your feedback before spending the
> > > time to put together another patch series.
> >
> > No, I'm working my way back toward it.  The guest_memfd series took precedence
> > over everything that I wasn't confident would land in 6.7, i.e. larger series
> > effectively got put on the back burner.  Sorry :-(
> 
> Is this series something that may be able to make it into 6.8 or 6.9?

6.8 isn't realistic.  Between LPC, vacation, and non-upstream stuff, I've done
frustratingly little code review since early November.  Sorry :-(

I haven't paged this series back into memory, so take this with a grain of salt,
but IIRC there was nothing that would block this from landing in 6.9.  Timing will
likely be tight though, especially for getting testing on all architectures.

Sean Christopherson Feb. 6, 2024, 3:29 a.m. UTC | #8

On Sun, Oct 01, 2023, Christoph Hellwig wrote:
> On Fri, Sep 29, 2023 at 09:06:34AM -0700, Sean Christopherson wrote:
> > With the cleanups done, playing nice with non-refcounted paged instead of outright
> > rejecting them is a wash in terms of lines of code, complexity, and ongoing
> > maintenance cost.
> 
> I tend to strongly disagree with that, though.  We can't just let these
> non-refcounted pages spread everywhere and instead need to fix their
> usage.

Sorry for the horrifically slow reply.

Is there a middle ground somewhere between allowing this willy nilly, and tainting
the kernel?  I too would love to get the TTM stuff fixed up, but every time I look
at that code I am less and less confident that it will happen anytime soon.  It's
not even clear to me what all code needs to be touched.

In other words, is there a way we can unblock David and friends, while still
providing a forcing function of some kind to motivate/heckle the TTM (or whatever
is making the allocations) to change?

Sean Christopherson Feb. 6, 2024, 3:30 a.m. UTC | #9

On Tue, Dec 19, 2023, Sean Christopherson wrote:
> On Tue, Dec 12, 2023, David Stevens wrote:
> > On Tue, Oct 31, 2023 at 11:30 PM Sean Christopherson <seanjc@google.com> wrote:
> > >
> > > On Tue, Oct 31, 2023, David Stevens wrote:
> > > > Sean, have you been waiting for a new patch series with responses to
> > > > Maxim's comments? I'm not really familiar with kernel contribution
> > > > etiquette, but I was hoping to get your feedback before spending the
> > > > time to put together another patch series.
> > >
> > > No, I'm working my way back toward it.  The guest_memfd series took precedence
> > > over everything that I wasn't confident would land in 6.7, i.e. larger series
> > > effectively got put on the back burner.  Sorry :-(
> > 
> > Is this series something that may be able to make it into 6.8 or 6.9?
> 
> 6.8 isn't realistic.  Between LPC, vacation, and non-upstream stuff, I've done
> frustratingly little code review since early November.  Sorry :-(
> 
> I haven't paged this series back into memory, so take this with a grain of salt,
> but IIRC there was nothing that would block this from landing in 6.9.  Timing will
> likely be tight though, especially for getting testing on all architectures.

I did a quick-ish pass today.  If you can hold off on v10 until later this week,
I'll try to take a more in-depth look by EOD Thursday.

Sean Christopherson Feb. 13, 2024, 3:39 a.m. UTC | #10

On Mon, Feb 05, 2024, Sean Christopherson wrote:
> On Tue, Dec 19, 2023, Sean Christopherson wrote:
> > On Tue, Dec 12, 2023, David Stevens wrote:
> > > On Tue, Oct 31, 2023 at 11:30 PM Sean Christopherson <seanjc@google.com> wrote:
> > > >
> > > > On Tue, Oct 31, 2023, David Stevens wrote:
> > > > > Sean, have you been waiting for a new patch series with responses to
> > > > > Maxim's comments? I'm not really familiar with kernel contribution
> > > > > etiquette, but I was hoping to get your feedback before spending the
> > > > > time to put together another patch series.
> > > >
> > > > No, I'm working my way back toward it.  The guest_memfd series took precedence
> > > > over everything that I wasn't confident would land in 6.7, i.e. larger series
> > > > effectively got put on the back burner.  Sorry :-(
> > > 
> > > Is this series something that may be able to make it into 6.8 or 6.9?
> > 
> > 6.8 isn't realistic.  Between LPC, vacation, and non-upstream stuff, I've done
> > frustratingly little code review since early November.  Sorry :-(
> > 
> > I haven't paged this series back into memory, so take this with a grain of salt,
> > but IIRC there was nothing that would block this from landing in 6.9.  Timing will
> > likely be tight though, especially for getting testing on all architectures.
> 
> I did a quick-ish pass today.  If you can hold off on v10 until later this week,
> I'll try to take a more in-depth look by EOD Thursday.

So I took a "deeper" look, but honestly it wasn't really any more in-depth than
the previous look.  I think I was somewhat surprised at the relatively small amount
of churn this ended up requiring.

Anywho, no major complaints.  This might be fodder for 6.9?  Maybe.  It'll be
tight.  At the very least though, I expect to shove v10 in a branch and start
beating on it in anticipation of landing it no later than 6.10.

One question though: what happened to the !FOLL_GET logic in kvm_follow_refcounted_pfn()?

In a previous version, I suggested:

  static kvm_pfn_t kvm_follow_refcounted_pfn(struct kvm_follow_pfn *foll,
                                             struct page *page)
  {
       kvm_pfn_t pfn = page_to_pfn(page);

       foll->is_refcounted_page = true;

       /*
        * FIXME: Ideally, KVM wouldn't pass FOLL_GET to gup() when the caller
        * doesn't want to grab a reference, but gup() doesn't support getting
        * just the pfn, i.e. FOLL_GET is effectively mandatory.  If that ever
        * changes, drop this and simply don't pass FOLL_GET to gup().
        */
       if (!(foll->flags & FOLL_GET))
               put_page(page);

       return pfn;
  }

but in v9 it's simply:

  static kvm_pfn_t kvm_follow_refcounted_pfn(struct kvm_follow_pfn *foll,
					     struct page *page)
  {
	kvm_pfn_t pfn = page_to_pfn(page);

	foll->is_refcounted_page = true;
	return pfn;
  }

And instead the x86 page fault handlers manually drop the reference.  Why is that?

David Stevens Feb. 21, 2024, 6:05 a.m. UTC | #11

On Tue, Feb 13, 2024 at 12:39 PM Sean Christopherson <seanjc@google.com> wrote:
> On Mon, Feb 05, 2024, Sean Christopherson wrote:
> > On Tue, Dec 19, 2023, Sean Christopherson wrote:
> > > On Tue, Dec 12, 2023, David Stevens wrote:
> > > > On Tue, Oct 31, 2023 at 11:30 PM Sean Christopherson <seanjc@google.com> wrote:
> > > > >
> > > > > On Tue, Oct 31, 2023, David Stevens wrote:
> > > > > > Sean, have you been waiting for a new patch series with responses to
> > > > > > Maxim's comments? I'm not really familiar with kernel contribution
> > > > > > etiquette, but I was hoping to get your feedback before spending the
> > > > > > time to put together another patch series.
> > > > >
> > > > > No, I'm working my way back toward it.  The guest_memfd series took precedence
> > > > > over everything that I wasn't confident would land in 6.7, i.e. larger series
> > > > > effectively got put on the back burner.  Sorry :-(
> > > >
> > > > Is this series something that may be able to make it into 6.8 or 6.9?
> > >
> > > 6.8 isn't realistic.  Between LPC, vacation, and non-upstream stuff, I've done
> > > frustratingly little code review since early November.  Sorry :-(
> > >
> > > I haven't paged this series back into memory, so take this with a grain of salt,
> > > but IIRC there was nothing that would block this from landing in 6.9.  Timing will
> > > likely be tight though, especially for getting testing on all architectures.
> >
> > I did a quick-ish pass today.  If you can hold off on v10 until later this week,
> > I'll try to take a more in-depth look by EOD Thursday.
>
> So I took a "deeper" look, but honestly it wasn't really any more in-depth than
> the previous look.  I think I was somewhat surprised at the relatively small amount
> of churn this ended up requiring.
>
> Anywho, no major complaints.  This might be fodder for 6.9?  Maybe.  It'll be
> tight.  At the very least though, I expect to shove v10 in a branch and start
> beating on it in anticipation of landing it no later than 6.10.
>
> One question though: what happened to the !FOLL_GET logic in kvm_follow_refcounted_pfn()?
>
> In a previous version, I suggested:
>
>   static kvm_pfn_t kvm_follow_refcounted_pfn(struct kvm_follow_pfn *foll,
>                                              struct page *page)
>   {
>        kvm_pfn_t pfn = page_to_pfn(page);
>
>        foll->is_refcounted_page = true;
>
>        /*
>         * FIXME: Ideally, KVM wouldn't pass FOLL_GET to gup() when the caller
>         * doesn't want to grab a reference, but gup() doesn't support getting
>         * just the pfn, i.e. FOLL_GET is effectively mandatory.  If that ever
>         * changes, drop this and simply don't pass FOLL_GET to gup().
>         */
>        if (!(foll->flags & FOLL_GET))
>                put_page(page);
>
>        return pfn;
>   }
>
> but in v9 it's simply:
>
>   static kvm_pfn_t kvm_follow_refcounted_pfn(struct kvm_follow_pfn *foll,
>                                              struct page *page)
>   {
>         kvm_pfn_t pfn = page_to_pfn(page);
>
>         foll->is_refcounted_page = true;
>         return pfn;
>   }
>
> And instead the x86 page fault handlers manually drop the reference.  Why is that?

I don't think FOLL_GET adds much to the API if is_refcounted_page is
present. At least right now, all of the callers need to pay attention
to is_refcounted_page so that they can update the access/dirty flags
properly. If they already have to do that anyway, then it's not any
harder for them to call put_page(). Not taking a reference also adds
one more footgun to the API, since the caller needs to make sure it
only accesses the page under an appropriate lock (v7 of this series
had that bug).

That said, I don't have particularly strong feelings one way or the
other, so I've added it back to v10 of the series.

-David

[v9,0/6] KVM: allow mapping non-refcounted pages

Message

Comments