Message ID | 20230911021637.1941096-1-stevensd@google.com (mailing list archive) |
---|---|
Headers | show |
Series | KVM: allow mapping non-refcounted pages | expand |
On Mon, Sep 11, 2023 at 11:16:30AM +0900, David Stevens wrote: > From: David Stevens <stevensd@chromium.org> > > This patch series adds support for mapping VM_IO and VM_PFNMAP memory > that is backed by struct pages that aren't currently being refcounted > (e.g. tail pages of non-compound higher order allocations) into the > guest. > > Our use case is virtio-gpu blob resources [1], which directly map host > graphics buffers into the guest as "vram" for the virtio-gpu device. > This feature currently does not work on systems using the amdgpu driver, > as that driver allocates non-compound higher order pages via > ttm_pool_alloc_page. > > First, this series replaces the __gfn_to_pfn_memslot API with a more > extensible __kvm_faultin_pfn API. The updated API rearranges > __gfn_to_pfn_memslot's args into a struct and where possible packs the > bool arguments into a FOLL_ flags argument. The refactoring changes do > not change any behavior. Instead of adding hacks to kvm you really should fix the driver / TTM to not do weird memory allocations.
On Thu, Sep 28, 2023, Christoph Hellwig wrote: > On Mon, Sep 11, 2023 at 11:16:30AM +0900, David Stevens wrote: > > From: David Stevens <stevensd@chromium.org> > > > > This patch series adds support for mapping VM_IO and VM_PFNMAP memory > > that is backed by struct pages that aren't currently being refcounted > > (e.g. tail pages of non-compound higher order allocations) into the > > guest. > > > > Our use case is virtio-gpu blob resources [1], which directly map host > > graphics buffers into the guest as "vram" for the virtio-gpu device. > > This feature currently does not work on systems using the amdgpu driver, > > as that driver allocates non-compound higher order pages via > > ttm_pool_alloc_page. > > > > First, this series replaces the __gfn_to_pfn_memslot API with a more > > extensible __kvm_faultin_pfn API. The updated API rearranges > > __gfn_to_pfn_memslot's args into a struct and where possible packs the > > bool arguments into a FOLL_ flags argument. The refactoring changes do > > not change any behavior. > > Instead of adding hacks to kvm you really should fix the driver / TTM > to not do weird memory allocations. I agree that the driver/TTM behavior is nasty, but from a KVM perspective the vast majority of this series is long-overdue cleanups (at least, IMO). All of those cleanups were my requirement for adding support for the behavior David and friends actually care about. KVM needs to be aware of non-refcounted struct page memory no matter what; see CVE-2021-22543 and, commit f8be156be163 ("KVM: do not allow mapping valid but non-reference-counted pages"). I don't think it makes any sense whatsoever to remove that code and assume every driver in existence will do the right thing. With the cleanups done, playing nice with non-refcounted paged instead of outright rejecting them is a wash in terms of lines of code, complexity, and ongoing maintenance cost.
On Fri, Sep 29, 2023 at 09:06:34AM -0700, Sean Christopherson wrote: > KVM needs to be aware of non-refcounted struct page memory no matter what; see > CVE-2021-22543 and, commit f8be156be163 ("KVM: do not allow mapping valid but > non-reference-counted pages"). I don't think it makes any sense whatsoever to > remove that code and assume every driver in existence will do the right thing. Agreed. > > With the cleanups done, playing nice with non-refcounted paged instead of outright > rejecting them is a wash in terms of lines of code, complexity, and ongoing > maintenance cost. I tend to strongly disagree with that, though. We can't just let these non-refcounted pages spread everywhere and instead need to fix their usage.
Sean, have you been waiting for a new patch series with responses to Maxim's comments? I'm not really familiar with kernel contribution etiquette, but I was hoping to get your feedback before spending the time to put together another patch series. -David
On Tue, Oct 31, 2023, David Stevens wrote: > Sean, have you been waiting for a new patch series with responses to > Maxim's comments? I'm not really familiar with kernel contribution > etiquette, but I was hoping to get your feedback before spending the > time to put together another patch series. No, I'm working my way back toward it. The guest_memfd series took precedence over everything that I wasn't confident would land in 6.7, i.e. larger series effectively got put on the back burner. Sorry :-(
On Tue, Oct 31, 2023 at 11:30 PM Sean Christopherson <seanjc@google.com> wrote: > > On Tue, Oct 31, 2023, David Stevens wrote: > > Sean, have you been waiting for a new patch series with responses to > > Maxim's comments? I'm not really familiar with kernel contribution > > etiquette, but I was hoping to get your feedback before spending the > > time to put together another patch series. > > No, I'm working my way back toward it. The guest_memfd series took precedence > over everything that I wasn't confident would land in 6.7, i.e. larger series > effectively got put on the back burner. Sorry :-( Is this series something that may be able to make it into 6.8 or 6.9? -David
On Tue, Dec 12, 2023, David Stevens wrote: > On Tue, Oct 31, 2023 at 11:30 PM Sean Christopherson <seanjc@google.com> wrote: > > > > On Tue, Oct 31, 2023, David Stevens wrote: > > > Sean, have you been waiting for a new patch series with responses to > > > Maxim's comments? I'm not really familiar with kernel contribution > > > etiquette, but I was hoping to get your feedback before spending the > > > time to put together another patch series. > > > > No, I'm working my way back toward it. The guest_memfd series took precedence > > over everything that I wasn't confident would land in 6.7, i.e. larger series > > effectively got put on the back burner. Sorry :-( > > Is this series something that may be able to make it into 6.8 or 6.9? 6.8 isn't realistic. Between LPC, vacation, and non-upstream stuff, I've done frustratingly little code review since early November. Sorry :-( I haven't paged this series back into memory, so take this with a grain of salt, but IIRC there was nothing that would block this from landing in 6.9. Timing will likely be tight though, especially for getting testing on all architectures.
On Sun, Oct 01, 2023, Christoph Hellwig wrote: > On Fri, Sep 29, 2023 at 09:06:34AM -0700, Sean Christopherson wrote: > > With the cleanups done, playing nice with non-refcounted paged instead of outright > > rejecting them is a wash in terms of lines of code, complexity, and ongoing > > maintenance cost. > > I tend to strongly disagree with that, though. We can't just let these > non-refcounted pages spread everywhere and instead need to fix their > usage. Sorry for the horrifically slow reply. Is there a middle ground somewhere between allowing this willy nilly, and tainting the kernel? I too would love to get the TTM stuff fixed up, but every time I look at that code I am less and less confident that it will happen anytime soon. It's not even clear to me what all code needs to be touched. In other words, is there a way we can unblock David and friends, while still providing a forcing function of some kind to motivate/heckle the TTM (or whatever is making the allocations) to change?
On Tue, Dec 19, 2023, Sean Christopherson wrote: > On Tue, Dec 12, 2023, David Stevens wrote: > > On Tue, Oct 31, 2023 at 11:30 PM Sean Christopherson <seanjc@google.com> wrote: > > > > > > On Tue, Oct 31, 2023, David Stevens wrote: > > > > Sean, have you been waiting for a new patch series with responses to > > > > Maxim's comments? I'm not really familiar with kernel contribution > > > > etiquette, but I was hoping to get your feedback before spending the > > > > time to put together another patch series. > > > > > > No, I'm working my way back toward it. The guest_memfd series took precedence > > > over everything that I wasn't confident would land in 6.7, i.e. larger series > > > effectively got put on the back burner. Sorry :-( > > > > Is this series something that may be able to make it into 6.8 or 6.9? > > 6.8 isn't realistic. Between LPC, vacation, and non-upstream stuff, I've done > frustratingly little code review since early November. Sorry :-( > > I haven't paged this series back into memory, so take this with a grain of salt, > but IIRC there was nothing that would block this from landing in 6.9. Timing will > likely be tight though, especially for getting testing on all architectures. I did a quick-ish pass today. If you can hold off on v10 until later this week, I'll try to take a more in-depth look by EOD Thursday.
On Mon, Feb 05, 2024, Sean Christopherson wrote: > On Tue, Dec 19, 2023, Sean Christopherson wrote: > > On Tue, Dec 12, 2023, David Stevens wrote: > > > On Tue, Oct 31, 2023 at 11:30 PM Sean Christopherson <seanjc@google.com> wrote: > > > > > > > > On Tue, Oct 31, 2023, David Stevens wrote: > > > > > Sean, have you been waiting for a new patch series with responses to > > > > > Maxim's comments? I'm not really familiar with kernel contribution > > > > > etiquette, but I was hoping to get your feedback before spending the > > > > > time to put together another patch series. > > > > > > > > No, I'm working my way back toward it. The guest_memfd series took precedence > > > > over everything that I wasn't confident would land in 6.7, i.e. larger series > > > > effectively got put on the back burner. Sorry :-( > > > > > > Is this series something that may be able to make it into 6.8 or 6.9? > > > > 6.8 isn't realistic. Between LPC, vacation, and non-upstream stuff, I've done > > frustratingly little code review since early November. Sorry :-( > > > > I haven't paged this series back into memory, so take this with a grain of salt, > > but IIRC there was nothing that would block this from landing in 6.9. Timing will > > likely be tight though, especially for getting testing on all architectures. > > I did a quick-ish pass today. If you can hold off on v10 until later this week, > I'll try to take a more in-depth look by EOD Thursday. So I took a "deeper" look, but honestly it wasn't really any more in-depth than the previous look. I think I was somewhat surprised at the relatively small amount of churn this ended up requiring. Anywho, no major complaints. This might be fodder for 6.9? Maybe. It'll be tight. At the very least though, I expect to shove v10 in a branch and start beating on it in anticipation of landing it no later than 6.10. One question though: what happened to the !FOLL_GET logic in kvm_follow_refcounted_pfn()? In a previous version, I suggested: static kvm_pfn_t kvm_follow_refcounted_pfn(struct kvm_follow_pfn *foll, struct page *page) { kvm_pfn_t pfn = page_to_pfn(page); foll->is_refcounted_page = true; /* * FIXME: Ideally, KVM wouldn't pass FOLL_GET to gup() when the caller * doesn't want to grab a reference, but gup() doesn't support getting * just the pfn, i.e. FOLL_GET is effectively mandatory. If that ever * changes, drop this and simply don't pass FOLL_GET to gup(). */ if (!(foll->flags & FOLL_GET)) put_page(page); return pfn; } but in v9 it's simply: static kvm_pfn_t kvm_follow_refcounted_pfn(struct kvm_follow_pfn *foll, struct page *page) { kvm_pfn_t pfn = page_to_pfn(page); foll->is_refcounted_page = true; return pfn; } And instead the x86 page fault handlers manually drop the reference. Why is that?
On Tue, Feb 13, 2024 at 12:39 PM Sean Christopherson <seanjc@google.com> wrote: > On Mon, Feb 05, 2024, Sean Christopherson wrote: > > On Tue, Dec 19, 2023, Sean Christopherson wrote: > > > On Tue, Dec 12, 2023, David Stevens wrote: > > > > On Tue, Oct 31, 2023 at 11:30 PM Sean Christopherson <seanjc@google.com> wrote: > > > > > > > > > > On Tue, Oct 31, 2023, David Stevens wrote: > > > > > > Sean, have you been waiting for a new patch series with responses to > > > > > > Maxim's comments? I'm not really familiar with kernel contribution > > > > > > etiquette, but I was hoping to get your feedback before spending the > > > > > > time to put together another patch series. > > > > > > > > > > No, I'm working my way back toward it. The guest_memfd series took precedence > > > > > over everything that I wasn't confident would land in 6.7, i.e. larger series > > > > > effectively got put on the back burner. Sorry :-( > > > > > > > > Is this series something that may be able to make it into 6.8 or 6.9? > > > > > > 6.8 isn't realistic. Between LPC, vacation, and non-upstream stuff, I've done > > > frustratingly little code review since early November. Sorry :-( > > > > > > I haven't paged this series back into memory, so take this with a grain of salt, > > > but IIRC there was nothing that would block this from landing in 6.9. Timing will > > > likely be tight though, especially for getting testing on all architectures. > > > > I did a quick-ish pass today. If you can hold off on v10 until later this week, > > I'll try to take a more in-depth look by EOD Thursday. > > So I took a "deeper" look, but honestly it wasn't really any more in-depth than > the previous look. I think I was somewhat surprised at the relatively small amount > of churn this ended up requiring. > > Anywho, no major complaints. This might be fodder for 6.9? Maybe. It'll be > tight. At the very least though, I expect to shove v10 in a branch and start > beating on it in anticipation of landing it no later than 6.10. > > One question though: what happened to the !FOLL_GET logic in kvm_follow_refcounted_pfn()? > > In a previous version, I suggested: > > static kvm_pfn_t kvm_follow_refcounted_pfn(struct kvm_follow_pfn *foll, > struct page *page) > { > kvm_pfn_t pfn = page_to_pfn(page); > > foll->is_refcounted_page = true; > > /* > * FIXME: Ideally, KVM wouldn't pass FOLL_GET to gup() when the caller > * doesn't want to grab a reference, but gup() doesn't support getting > * just the pfn, i.e. FOLL_GET is effectively mandatory. If that ever > * changes, drop this and simply don't pass FOLL_GET to gup(). > */ > if (!(foll->flags & FOLL_GET)) > put_page(page); > > return pfn; > } > > but in v9 it's simply: > > static kvm_pfn_t kvm_follow_refcounted_pfn(struct kvm_follow_pfn *foll, > struct page *page) > { > kvm_pfn_t pfn = page_to_pfn(page); > > foll->is_refcounted_page = true; > return pfn; > } > > And instead the x86 page fault handlers manually drop the reference. Why is that? I don't think FOLL_GET adds much to the API if is_refcounted_page is present. At least right now, all of the callers need to pay attention to is_refcounted_page so that they can update the access/dirty flags properly. If they already have to do that anyway, then it's not any harder for them to call put_page(). Not taking a reference also adds one more footgun to the API, since the caller needs to make sure it only accesses the page under an appropriate lock (v7 of this series had that bug). That said, I don't have particularly strong feelings one way or the other, so I've added it back to v10 of the series. -David
From: David Stevens <stevensd@chromium.org> This patch series adds support for mapping VM_IO and VM_PFNMAP memory that is backed by struct pages that aren't currently being refcounted (e.g. tail pages of non-compound higher order allocations) into the guest. Our use case is virtio-gpu blob resources [1], which directly map host graphics buffers into the guest as "vram" for the virtio-gpu device. This feature currently does not work on systems using the amdgpu driver, as that driver allocates non-compound higher order pages via ttm_pool_alloc_page. First, this series replaces the __gfn_to_pfn_memslot API with a more extensible __kvm_faultin_pfn API. The updated API rearranges __gfn_to_pfn_memslot's args into a struct and where possible packs the bool arguments into a FOLL_ flags argument. The refactoring changes do not change any behavior. From there, this series extends the __kvm_faultin_pfn API so that non-refconuted pages can be safely handled. This invloves adding an input parameter to indicate whether the caller can safely use non-refcounted pfns and an output parameter to tell the caller whether or not the returned page is refcounted. This change includes a breaking change, by disallowing non-refcounted pfn mappings by default, as such mappings are unsafe. To allow such systems to continue to function, an opt-in module parameter is added to allow the unsafe behavior. This series only adds support for non-refcounted pages to x86. Other MMUs can likely be updated without too much difficulty, but it is not needed at this point. Updating other parts of KVM (e.g. pfncache) is not straightforward [2]. [1] https://patchwork.kernel.org/project/dri-devel/cover/20200814024000.2485-1-gurchetansingh@chromium.org/ [2] https://lore.kernel.org/all/ZBEEQtmtNPaEqU1i@google.com/ v8 -> v9: - Make paying attention to is_refcounted_page mandatory. This means that FOLL_GET is no longer necessary. For compatibility with un-migrated callers, add a temporary parameter to sidestep ref-counting issues. - Add allow_unsafe_mappings, which is a breaking change. - Migrate kvm_vcpu_map and other callsites used by x86 to the new API. - Drop arm and ppc changes. v7 -> v8: - Set access bits before releasing mmu_lock. - Pass FOLL_GET on 32-bit x86 or !tdp_enabled. - Refactor FOLL_GET handling, add kvm_follow_refcounted_pfn helper. - Set refcounted bit on >4k pages. - Add comments and apply formatting suggestions. - rebase on kvm next branch. v6 -> v7: - Replace __gfn_to_pfn_memslot with a more flexible __kvm_faultin_pfn, and extend that API to support non-refcounted pages (complete rewrite). David Stevens (5): KVM: mmu: Introduce __kvm_follow_pfn function KVM: mmu: Improve handling of non-refcounted pfns KVM: Migrate kvm_vcpu_map to __kvm_follow_pfn KVM: x86: Migrate to __kvm_follow_pfn KVM: x86/mmu: Handle non-refcounted pages Sean Christopherson (1): KVM: Assert that a page's refcount is elevated when marking accessed/dirty arch/x86/kvm/mmu/mmu.c | 93 +++++++--- arch/x86/kvm/mmu/mmu_internal.h | 1 + arch/x86/kvm/mmu/paging_tmpl.h | 8 +- arch/x86/kvm/mmu/spte.c | 4 +- arch/x86/kvm/mmu/spte.h | 12 +- arch/x86/kvm/mmu/tdp_mmu.c | 22 ++- arch/x86/kvm/x86.c | 12 +- include/linux/kvm_host.h | 42 ++++- virt/kvm/kvm_main.c | 294 +++++++++++++++++++------------- virt/kvm/kvm_mm.h | 3 +- virt/kvm/pfncache.c | 11 +- 11 files changed, 339 insertions(+), 163 deletions(-)