Message ID | 20230911021637.1941096-4-stevensd@google.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | KVM: allow mapping non-refcounted pages | expand |
У пн, 2023-09-11 у 11:16 +0900, David Stevens пише: > From: David Stevens <stevensd@chromium.org> > > KVM's handling of non-refcounted pfns has two problems: > > - struct pages without refcounting (e.g. tail pages of non-compound > higher order pages) cannot be used at all, as gfn_to_pfn does not > provide enough information for callers to handle the refcount. > - pfns without struct pages can be accessed without the protection of a > mmu notifier. This is unsafe because KVM cannot monitor or control > the lifespan of such pfns, so it may continue to access the pfns > after they are freed. > > This patch extends the __kvm_follow_pfn API to properly handle these > cases. > First, it adds a is_refcounted_page output parameter so that > callers can tell whether or not a pfn has a struct page that needs to be > passed to put_page. > Second, it adds a guarded_by_mmu_notifier parameter > that is used to avoid returning non-refcounted pages when the caller > cannot safely use them. > > Since callers need to be updated on a case-by-case basis to pay > attention to is_refcounted_page, the new behavior of returning > non-refcounted pages is opt-in via the allow_non_refcounted_struct_page > parameter. Once all callers have been updated, this parameter should be > removed. Small note: since these new parameters are critical for understanding the patch, Maybe it makes sense to re-order their description to match the order in the struct (or at least put the output parameter at the end of the description), and give each a separate paragraph as I did above. > > The fact that non-refcounted pfns can no longer be accessed without mmu > notifier protection is a breaking change. Since there is no timeline for > updating everything in KVM to use mmu notifiers, this change adds an > opt-in module parameter called allow_unsafe_mappings to allow such > mappings. Systems which trust userspace not to tear down such unsafe > mappings while KVM is using them can set this parameter to re-enable the > legacy behavior. Do you have a practical example of a VM that can break with this change? E.g will a normal VM break? will a VM with VFIO devices break? Will a VM with hugepages mapped into it break? Will the trick of limiting the kernel memory with 'mem=X', and then use the extra 'upper memory' for VMs still work? > > Signed-off-by: David Stevens <stevensd@chromium.org> > --- > include/linux/kvm_host.h | 21 ++++++++++ > virt/kvm/kvm_main.c | 84 ++++++++++++++++++++++++---------------- > virt/kvm/pfncache.c | 1 + > 3 files changed, 72 insertions(+), 34 deletions(-) > > diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h > index c2e0ddf14dba..2ed08ae1a9be 100644 > --- a/include/linux/kvm_host.h > +++ b/include/linux/kvm_host.h > @@ -1185,10 +1185,31 @@ struct kvm_follow_pfn { > bool atomic; > /* Try to create a writable mapping even for a read fault */ > bool try_map_writable; > + /* Usage of the returned pfn will be guared by a mmu notifier. */ > + bool guarded_by_mmu_notifier; > + /* > + * When false, do not return pfns for non-refcounted struct pages. > + * > + * TODO: This allows callers to use kvm_release_pfn on the pfns > + * returned by gfn_to_pfn without worrying about corrupting the > + * refcounted of non-refcounted pages. Once all callers respect Typo: refcount. > + * is_refcounted_page, this flag should be removed. > + */ > + bool allow_non_refcounted_struct_page; > > /* Outputs of __kvm_follow_pfn */ > hva_t hva; > bool writable; > + /* > + * True if the returned pfn is for a page with a valid refcount. False > + * if the returned pfn has no struct page or if the struct page is not > + * being refcounted (e.g. tail pages of non-compound higher order > + * allocations from IO/PFNMAP mappings). > Aren't all tail pages not-refcounted (e.g tail page of a hugepage?) I haven't researched this topic yet. > + * > + * When this output flag is false, callers should not try to convert > + * the pfn to a struct page. > + */ > + bool is_refcounted_page; > }; > > kvm_pfn_t __kvm_follow_pfn(struct kvm_follow_pfn *foll); > diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c > index 9b33a59c6d65..235c5cb3fdac 100644 > --- a/virt/kvm/kvm_main.c > +++ b/virt/kvm/kvm_main.c > @@ -96,6 +96,10 @@ unsigned int halt_poll_ns_shrink; > module_param(halt_poll_ns_shrink, uint, 0644); > EXPORT_SYMBOL_GPL(halt_poll_ns_shrink); > > +/* Allow non-struct page memory to be mapped without MMU notifier protection. */ > +static bool allow_unsafe_mappings; > +module_param(allow_unsafe_mappings, bool, 0444); > + > /* > * Ordering of locks: > * > @@ -2507,6 +2511,15 @@ static inline int check_user_page_hwpoison(unsigned long addr) > return rc == -EHWPOISON; > } > > +static kvm_pfn_t kvm_follow_refcounted_pfn(struct kvm_follow_pfn *foll, > + struct page *page) > +{ > + kvm_pfn_t pfn = page_to_pfn(page); > + > + foll->is_refcounted_page = true; > + return pfn; > +} Just a matter of taste but to me this function looks confusing. IMHO, just duplicating these two lines of code is better. However if you prefer I won't argue over this. > + > /* > * The fast path to get the writable pfn which will be stored in @pfn, > * true indicates success, otherwise false is returned. It's also the > @@ -2525,7 +2538,7 @@ static bool hva_to_pfn_fast(struct kvm_follow_pfn *foll, kvm_pfn_t *pfn) > return false; > > if (get_user_page_fast_only(foll->hva, FOLL_WRITE, page)) { > - *pfn = page_to_pfn(page[0]); > + *pfn = kvm_follow_refcounted_pfn(foll, page[0]); Yep, here just 'foll->is_refcounted_page = true;' looks more readable to me. > foll->writable = true; > return true; > } > @@ -2561,7 +2574,7 @@ static int hva_to_pfn_slow(struct kvm_follow_pfn *foll, kvm_pfn_t *pfn) > page = wpage; > } > } > - *pfn = page_to_pfn(page); > + *pfn = kvm_follow_refcounted_pfn(foll, page); Same here and probably in other places. > return npages; > } > > @@ -2576,16 +2589,6 @@ static bool vma_is_valid(struct vm_area_struct *vma, bool write_fault) > return true; > } > > -static int kvm_try_get_pfn(kvm_pfn_t pfn) > -{ > - struct page *page = kvm_pfn_to_refcounted_page(pfn); > - > - if (!page) > - return 1; > - > - return get_page_unless_zero(page); > -} > - > static int hva_to_pfn_remapped(struct vm_area_struct *vma, > struct kvm_follow_pfn *foll, kvm_pfn_t *p_pfn) > { > @@ -2594,6 +2597,7 @@ static int hva_to_pfn_remapped(struct vm_area_struct *vma, > pte_t pte; > spinlock_t *ptl; > bool write_fault = foll->flags & FOLL_WRITE; > + struct page *page; > int r; > > r = follow_pte(vma->vm_mm, foll->hva, &ptep, &ptl); > @@ -2618,37 +2622,39 @@ static int hva_to_pfn_remapped(struct vm_area_struct *vma, > > pte = ptep_get(ptep); > > + foll->writable = pte_write(pte); > + pfn = pte_pfn(pte); > + > + page = kvm_pfn_to_refcounted_page(pfn); > + > if (write_fault && !pte_write(pte)) { > pfn = KVM_PFN_ERR_RO_FAULT; > goto out; > } > > - foll->writable = pte_write(pte); > - pfn = pte_pfn(pte); > + if (!page) > + goto out; > > - /* > - * Get a reference here because callers of *hva_to_pfn* and > - * *gfn_to_pfn* ultimately call kvm_release_pfn_clean on the > - * returned pfn. This is only needed if the VMA has VM_MIXEDMAP > - * set, but the kvm_try_get_pfn/kvm_release_pfn_clean pair will > - * simply do nothing for reserved pfns. > - * > - * Whoever called remap_pfn_range is also going to call e.g. > - * unmap_mapping_range before the underlying pages are freed, > - * causing a call to our MMU notifier. > - * > - * Certain IO or PFNMAP mappings can be backed with valid > - * struct pages, but be allocated without refcounting e.g., > - * tail pages of non-compound higher order allocations, which > - * would then underflow the refcount when the caller does the > - * required put_page. Don't allow those pages here. > - */ Why the comment is removed? as far as I see the code still grabs a reference to the page. > - if (!kvm_try_get_pfn(pfn)) > - r = -EFAULT; > + if (get_page_unless_zero(page)) > + WARN_ON_ONCE(kvm_follow_refcounted_pfn(foll, page) != pfn); Once again, the kvm_follow_refcounted_pfn usage is confusing IMHO. It sets the 'foll->is_refcounted_page', and yet someone can think that it's only there for the WARN_ON_ONCE. That IMHO would read better: if (get_page_unless_zero(page)) foll->is_refcounted_page = true; WARN_ON_ONCE(page_to_pfn(page) != pfn); Note that I moved the warn out of the 'get_page_unless_zero' condition because I think that this condition should be true for non refcounted pages as well. Also I don't understand why 'get_page_unless_zero(page)' result is ignored. As I understand it, it will increase refcount of a page unless it is zero. If a refcount of a refcounted page is 0 isn't that a bug? The page was returned from kvm_pfn_to_refcounted_page which supposed only to return pages that are refcounted. I might not understand something in regard to 'get_page_unless_zero(page)' usage both in old and the new code. > > out: > pte_unmap_unlock(ptep, ptl); > - *p_pfn = pfn; > + > + /* > + * TODO: Remove the first branch once all callers have been > + * taught to play nice with non-refcounted struct pages. > + */ > + if (page && !foll->is_refcounted_page && > + !foll->allow_non_refcounted_struct_page) { > + r = -EFAULT; > + } else if (!foll->is_refcounted_page && > + !foll->guarded_by_mmu_notifier && > + !allow_unsafe_mappings) { > + r = -EFAULT; > + } else { > + *p_pfn = pfn; > + } > > return r; > } > @@ -2722,6 +2728,8 @@ kvm_pfn_t hva_to_pfn(struct kvm_follow_pfn *foll) > kvm_pfn_t __kvm_follow_pfn(struct kvm_follow_pfn *foll) > { > foll->writable = false; > + foll->is_refcounted_page = false; > + > foll->hva = __gfn_to_hva_many(foll->slot, foll->gfn, NULL, > foll->flags & FOLL_WRITE); > > @@ -2749,6 +2757,7 @@ kvm_pfn_t __gfn_to_pfn_memslot(const struct kvm_memory_slot *slot, gfn_t gfn, > .flags = 0, > .atomic = atomic, > .try_map_writable = !!writable, > + .allow_non_refcounted_struct_page = false, > }; > > if (write_fault) > @@ -2780,6 +2789,7 @@ kvm_pfn_t gfn_to_pfn_prot(struct kvm *kvm, gfn_t gfn, bool write_fault, > .gfn = gfn, > .flags = write_fault ? FOLL_WRITE : 0, > .try_map_writable = !!writable, > + .allow_non_refcounted_struct_page = false, > }; > pfn = __kvm_follow_pfn(&foll); > if (writable) > @@ -2794,6 +2804,7 @@ kvm_pfn_t gfn_to_pfn_memslot(const struct kvm_memory_slot *slot, gfn_t gfn) > .slot = slot, > .gfn = gfn, > .flags = FOLL_WRITE, > + .allow_non_refcounted_struct_page = false, > }; > return __kvm_follow_pfn(&foll); > } > @@ -2806,6 +2817,11 @@ kvm_pfn_t gfn_to_pfn_memslot_atomic(const struct kvm_memory_slot *slot, gfn_t gf > .gfn = gfn, > .flags = FOLL_WRITE, > .atomic = true, > + /* > + * Setting atomic means __kvm_follow_pfn will never make it > + * to hva_to_pfn_remapped, so this is vacuously true. > + */ > + .allow_non_refcounted_struct_page = true, > }; > return __kvm_follow_pfn(&foll); > } > diff --git a/virt/kvm/pfncache.c b/virt/kvm/pfncache.c > index 86cd40acad11..6bbf972c11f8 100644 > --- a/virt/kvm/pfncache.c > +++ b/virt/kvm/pfncache.c > @@ -149,6 +149,7 @@ static kvm_pfn_t hva_to_pfn_retry(struct gfn_to_pfn_cache *gpc) > .gfn = gpa_to_gfn(gpc->gpa), > .flags = FOLL_WRITE, > .hva = gpc->uhva, > + .allow_non_refcounted_struct_page = false, > }; > > lockdep_assert_held(&gpc->refresh_lock); Best regards, Maxim Levitsky
On Tue, Oct 03, 2023, Maxim Levitsky wrote: > У пн, 2023-09-11 у 11:16 +0900, David Stevens пише: > > From: David Stevens <stevensd@chromium.org> > > The fact that non-refcounted pfns can no longer be accessed without mmu > > notifier protection is a breaking change. Since there is no timeline for > > updating everything in KVM to use mmu notifiers, this change adds an > > opt-in module parameter called allow_unsafe_mappings to allow such > > mappings. Systems which trust userspace not to tear down such unsafe > > mappings while KVM is using them can set this parameter to re-enable the > > legacy behavior. > > Do you have a practical example of a VM that can break with this change? > E.g will a normal VM break? will a VM with VFIO devices break? Will a VM with > hugepages mapped into it break? > > Will the trick of limiting the kernel memory with 'mem=X', and then use the > extra 'upper memory' for VMs still work? This is the trick that will require an opt-in from the admin. Anything where KVM is effectively relying on userspace to pinky swear that the memory won't be migrated, freed, etc. It's unlikely, but theoretically possible that it might break existing setups for "normal" VMs. E.g. if a VM is using VM_PFNMAP'd memory for a nested VMCS. But such setups are already wildly broken, their users just don't know it. The proposal here is to require admins for such setups to opt-in to the "unsafe" behavior, i.e. give backwards compatibility, but make the admin explicitly acknowledge that what they are doing may have unwanted consequences. > > + /* > > + * True if the returned pfn is for a page with a valid refcount. False > > + * if the returned pfn has no struct page or if the struct page is not > > + * being refcounted (e.g. tail pages of non-compound higher order > > + * allocations from IO/PFNMAP mappings). > > > Aren't all tail pages not-refcounted (e.g tail page of a hugepage?) > I haven't researched this topic yet. Nope. As Christoph stated, they are most definitely "weird" allocations though. In this case, IIRC, it's the DRM's Translation Table Manager (TTM) code that kmalloc's a large chunk of memory, and then stuffs the pfn into the page tables courtesy of the vmf_insert_pfn_prot() in ttm_bo_vm_fault_reserved(). The head page has a non-zero refcount, but it's not really refcounted. And the tail pages have nothing, which IIRC, results in KVM inadvertantly causing pages to be freed due to putting the last refeferences. [*] https://lore.kernel.org/all/ZRZeaP7W5SuereMX@infradead.org > > + * > > + * When this output flag is false, callers should not try to convert > > + * the pfn to a struct page. This should explain what the flag tracks, not what callers should or shouldn't do with the flag. E.g. strictly speaking, there's no danger in grabbing the corresponding "struct page" if the caller verifies it's a valid PFN. Trying to use the page outside of mmu_notifier protection is where things would get dicey. > > + */ > > + bool is_refcounted_page; > > }; > > > > kvm_pfn_t __kvm_follow_pfn(struct kvm_follow_pfn *foll); > > diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c > > index 9b33a59c6d65..235c5cb3fdac 100644 > > --- a/virt/kvm/kvm_main.c > > +++ b/virt/kvm/kvm_main.c > > @@ -96,6 +96,10 @@ unsigned int halt_poll_ns_shrink; > > module_param(halt_poll_ns_shrink, uint, 0644); > > EXPORT_SYMBOL_GPL(halt_poll_ns_shrink); > > > > +/* Allow non-struct page memory to be mapped without MMU notifier protection. */ > > +static bool allow_unsafe_mappings; > > +module_param(allow_unsafe_mappings, bool, 0444); > > + > > /* > > * Ordering of locks: > > * > > @@ -2507,6 +2511,15 @@ static inline int check_user_page_hwpoison(unsigned long addr) > > return rc == -EHWPOISON; > > } > > > > +static kvm_pfn_t kvm_follow_refcounted_pfn(struct kvm_follow_pfn *foll, > > + struct page *page) > > +{ > > + kvm_pfn_t pfn = page_to_pfn(page); > > + > > + foll->is_refcounted_page = true; > > + return pfn; > > +} > > Just a matter of taste but to me this function looks confusing. > IMHO, just duplicating these two lines of code is better. > However if you prefer I won't argue over this. Hrm, when I suggested this, there was also a put_page() and a comment about hacky code in there: static kvm_pfn_t kvm_follow_refcounted_pfn(struct kvm_follow_pfn *foll, struct page *page) { kvm_pfn_t pfn = page_to_pfn(page); foll->is_refcounted_page = true; /* * FIXME: Ideally, KVM wouldn't pass FOLL_GET to gup() when the caller * doesn't want to grab a reference, but gup() doesn't support getting * just the pfn, i.e. FOLL_GET is effectively mandatory. If that ever * changes, drop this and simply don't pass FOLL_GET to gup(). */ if (!(foll->flags & FOLL_GET)) put_page(page); return pfn; } More below. > > - /* > > - * Get a reference here because callers of *hva_to_pfn* and > > - * *gfn_to_pfn* ultimately call kvm_release_pfn_clean on the > > - * returned pfn. This is only needed if the VMA has VM_MIXEDMAP > > - * set, but the kvm_try_get_pfn/kvm_release_pfn_clean pair will > > - * simply do nothing for reserved pfns. > > - * > > - * Whoever called remap_pfn_range is also going to call e.g. > > - * unmap_mapping_range before the underlying pages are freed, > > - * causing a call to our MMU notifier. > > - * > > - * Certain IO or PFNMAP mappings can be backed with valid > > - * struct pages, but be allocated without refcounting e.g., > > - * tail pages of non-compound higher order allocations, which > > - * would then underflow the refcount when the caller does the > > - * required put_page. Don't allow those pages here. > > - */ > > Why the comment is removed? as far as I see the code still grabs a reference > to the page. Because what the above comment is saying is effectively obsoleted by is_refcounted_page. The old comment is essentially saying, "grab a reference because KVM expects to put a reference", whereas as the new code grabs a reference because it's necessary to honor allow_unsafe_mappings. So I agree with David that the existing comment should go away, but I agree with you in that kvm_follow_refcounted_pfn() really needs a comment, e.g. to explain how KVM manages struct page refcounts. > > - if (!kvm_try_get_pfn(pfn)) > > - r = -EFAULT; > > + if (get_page_unless_zero(page)) > > + WARN_ON_ONCE(kvm_follow_refcounted_pfn(foll, page) != pfn); > > Once again, the kvm_follow_refcounted_pfn usage is confusing IMHO. It sets > the 'foll->is_refcounted_page', and yet someone can think that it's only > there for the WARN_ON_ONCE. > > That IMHO would read better: > > if (get_page_unless_zero(page)) > foll->is_refcounted_page = true; > > WARN_ON_ONCE(page_to_pfn(page) != pfn); > > Note that I moved the warn out of the 'get_page_unless_zero' condition > because I think that this condition should be true for non refcounted pages > as well. Yeah, let me see if I can piece together what happened to the put_page() call. > Also I don't understand why 'get_page_unless_zero(page)' result is ignored. > As I understand it, it will increase refcount of a page unless it is zero. > > If a refcount of a refcounted page is 0 isn't that a bug? Yes, the problem is that KVM doesn't know if a page is refcounted or not, without actually trying to acquire a reference. See the TTM mess mentioned above. Note, Christoph is suggesting that KVM instead refuse to play nice and force the TTM code to properly refcount things. I very much like that idea in theory, but I have no idea how feasible it is (that code is all kinds of twisty). > The page was returned from kvm_pfn_to_refcounted_page which supposed only to > return pages that are refcounted. > > I might not understand something in regard to 'get_page_unless_zero(page)' > usage both in old and the new code.
On Mon, Sep 11, 2023, David Stevens wrote: > diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h > index c2e0ddf14dba..2ed08ae1a9be 100644 > --- a/include/linux/kvm_host.h > +++ b/include/linux/kvm_host.h > @@ -1185,10 +1185,31 @@ struct kvm_follow_pfn { > bool atomic; > /* Try to create a writable mapping even for a read fault */ > bool try_map_writable; > + /* Usage of the returned pfn will be guared by a mmu notifier. */ > + bool guarded_by_mmu_notifier; > + /* > + * When false, do not return pfns for non-refcounted struct pages. > + * > + * TODO: This allows callers to use kvm_release_pfn on the pfns > + * returned by gfn_to_pfn without worrying about corrupting the > + * refcounted of non-refcounted pages. Once all callers respect > + * is_refcounted_page, this flag should be removed. > + */ > + bool allow_non_refcounted_struct_page; > > /* Outputs of __kvm_follow_pfn */ > hva_t hva; > bool writable; > + /* > + * True if the returned pfn is for a page with a valid refcount. False > + * if the returned pfn has no struct page or if the struct page is not > + * being refcounted (e.g. tail pages of non-compound higher order > + * allocations from IO/PFNMAP mappings). > + * > + * When this output flag is false, callers should not try to convert > + * the pfn to a struct page. > + */ > + bool is_refcounted_page; Idea. Hopefully a good one. Rather than tracking a bool, what if we track: struct page *refcounted_page; and then make kvm_xxx_page_clean() wrappers around inner helpers that play nice with NULL pages, e.g. static inline void kvm_release_page_clean(struct page *page) { if (!page) return __kvm_release_page_clean(page); } Then callers of __kvm_follow_pfn() can do: kvm_release_page_clean(fault->refcounted_page); instead of if (fault->is_refcounted_page) kvm_release_page_clean(pfn_to_page(fault->pfn));
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h index c2e0ddf14dba..2ed08ae1a9be 100644 --- a/include/linux/kvm_host.h +++ b/include/linux/kvm_host.h @@ -1185,10 +1185,31 @@ struct kvm_follow_pfn { bool atomic; /* Try to create a writable mapping even for a read fault */ bool try_map_writable; + /* Usage of the returned pfn will be guared by a mmu notifier. */ + bool guarded_by_mmu_notifier; + /* + * When false, do not return pfns for non-refcounted struct pages. + * + * TODO: This allows callers to use kvm_release_pfn on the pfns + * returned by gfn_to_pfn without worrying about corrupting the + * refcounted of non-refcounted pages. Once all callers respect + * is_refcounted_page, this flag should be removed. + */ + bool allow_non_refcounted_struct_page; /* Outputs of __kvm_follow_pfn */ hva_t hva; bool writable; + /* + * True if the returned pfn is for a page with a valid refcount. False + * if the returned pfn has no struct page or if the struct page is not + * being refcounted (e.g. tail pages of non-compound higher order + * allocations from IO/PFNMAP mappings). + * + * When this output flag is false, callers should not try to convert + * the pfn to a struct page. + */ + bool is_refcounted_page; }; kvm_pfn_t __kvm_follow_pfn(struct kvm_follow_pfn *foll); diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c index 9b33a59c6d65..235c5cb3fdac 100644 --- a/virt/kvm/kvm_main.c +++ b/virt/kvm/kvm_main.c @@ -96,6 +96,10 @@ unsigned int halt_poll_ns_shrink; module_param(halt_poll_ns_shrink, uint, 0644); EXPORT_SYMBOL_GPL(halt_poll_ns_shrink); +/* Allow non-struct page memory to be mapped without MMU notifier protection. */ +static bool allow_unsafe_mappings; +module_param(allow_unsafe_mappings, bool, 0444); + /* * Ordering of locks: * @@ -2507,6 +2511,15 @@ static inline int check_user_page_hwpoison(unsigned long addr) return rc == -EHWPOISON; } +static kvm_pfn_t kvm_follow_refcounted_pfn(struct kvm_follow_pfn *foll, + struct page *page) +{ + kvm_pfn_t pfn = page_to_pfn(page); + + foll->is_refcounted_page = true; + return pfn; +} + /* * The fast path to get the writable pfn which will be stored in @pfn, * true indicates success, otherwise false is returned. It's also the @@ -2525,7 +2538,7 @@ static bool hva_to_pfn_fast(struct kvm_follow_pfn *foll, kvm_pfn_t *pfn) return false; if (get_user_page_fast_only(foll->hva, FOLL_WRITE, page)) { - *pfn = page_to_pfn(page[0]); + *pfn = kvm_follow_refcounted_pfn(foll, page[0]); foll->writable = true; return true; } @@ -2561,7 +2574,7 @@ static int hva_to_pfn_slow(struct kvm_follow_pfn *foll, kvm_pfn_t *pfn) page = wpage; } } - *pfn = page_to_pfn(page); + *pfn = kvm_follow_refcounted_pfn(foll, page); return npages; } @@ -2576,16 +2589,6 @@ static bool vma_is_valid(struct vm_area_struct *vma, bool write_fault) return true; } -static int kvm_try_get_pfn(kvm_pfn_t pfn) -{ - struct page *page = kvm_pfn_to_refcounted_page(pfn); - - if (!page) - return 1; - - return get_page_unless_zero(page); -} - static int hva_to_pfn_remapped(struct vm_area_struct *vma, struct kvm_follow_pfn *foll, kvm_pfn_t *p_pfn) { @@ -2594,6 +2597,7 @@ static int hva_to_pfn_remapped(struct vm_area_struct *vma, pte_t pte; spinlock_t *ptl; bool write_fault = foll->flags & FOLL_WRITE; + struct page *page; int r; r = follow_pte(vma->vm_mm, foll->hva, &ptep, &ptl); @@ -2618,37 +2622,39 @@ static int hva_to_pfn_remapped(struct vm_area_struct *vma, pte = ptep_get(ptep); + foll->writable = pte_write(pte); + pfn = pte_pfn(pte); + + page = kvm_pfn_to_refcounted_page(pfn); + if (write_fault && !pte_write(pte)) { pfn = KVM_PFN_ERR_RO_FAULT; goto out; } - foll->writable = pte_write(pte); - pfn = pte_pfn(pte); + if (!page) + goto out; - /* - * Get a reference here because callers of *hva_to_pfn* and - * *gfn_to_pfn* ultimately call kvm_release_pfn_clean on the - * returned pfn. This is only needed if the VMA has VM_MIXEDMAP - * set, but the kvm_try_get_pfn/kvm_release_pfn_clean pair will - * simply do nothing for reserved pfns. - * - * Whoever called remap_pfn_range is also going to call e.g. - * unmap_mapping_range before the underlying pages are freed, - * causing a call to our MMU notifier. - * - * Certain IO or PFNMAP mappings can be backed with valid - * struct pages, but be allocated without refcounting e.g., - * tail pages of non-compound higher order allocations, which - * would then underflow the refcount when the caller does the - * required put_page. Don't allow those pages here. - */ - if (!kvm_try_get_pfn(pfn)) - r = -EFAULT; + if (get_page_unless_zero(page)) + WARN_ON_ONCE(kvm_follow_refcounted_pfn(foll, page) != pfn); out: pte_unmap_unlock(ptep, ptl); - *p_pfn = pfn; + + /* + * TODO: Remove the first branch once all callers have been + * taught to play nice with non-refcounted struct pages. + */ + if (page && !foll->is_refcounted_page && + !foll->allow_non_refcounted_struct_page) { + r = -EFAULT; + } else if (!foll->is_refcounted_page && + !foll->guarded_by_mmu_notifier && + !allow_unsafe_mappings) { + r = -EFAULT; + } else { + *p_pfn = pfn; + } return r; } @@ -2722,6 +2728,8 @@ kvm_pfn_t hva_to_pfn(struct kvm_follow_pfn *foll) kvm_pfn_t __kvm_follow_pfn(struct kvm_follow_pfn *foll) { foll->writable = false; + foll->is_refcounted_page = false; + foll->hva = __gfn_to_hva_many(foll->slot, foll->gfn, NULL, foll->flags & FOLL_WRITE); @@ -2749,6 +2757,7 @@ kvm_pfn_t __gfn_to_pfn_memslot(const struct kvm_memory_slot *slot, gfn_t gfn, .flags = 0, .atomic = atomic, .try_map_writable = !!writable, + .allow_non_refcounted_struct_page = false, }; if (write_fault) @@ -2780,6 +2789,7 @@ kvm_pfn_t gfn_to_pfn_prot(struct kvm *kvm, gfn_t gfn, bool write_fault, .gfn = gfn, .flags = write_fault ? FOLL_WRITE : 0, .try_map_writable = !!writable, + .allow_non_refcounted_struct_page = false, }; pfn = __kvm_follow_pfn(&foll); if (writable) @@ -2794,6 +2804,7 @@ kvm_pfn_t gfn_to_pfn_memslot(const struct kvm_memory_slot *slot, gfn_t gfn) .slot = slot, .gfn = gfn, .flags = FOLL_WRITE, + .allow_non_refcounted_struct_page = false, }; return __kvm_follow_pfn(&foll); } @@ -2806,6 +2817,11 @@ kvm_pfn_t gfn_to_pfn_memslot_atomic(const struct kvm_memory_slot *slot, gfn_t gf .gfn = gfn, .flags = FOLL_WRITE, .atomic = true, + /* + * Setting atomic means __kvm_follow_pfn will never make it + * to hva_to_pfn_remapped, so this is vacuously true. + */ + .allow_non_refcounted_struct_page = true, }; return __kvm_follow_pfn(&foll); } diff --git a/virt/kvm/pfncache.c b/virt/kvm/pfncache.c index 86cd40acad11..6bbf972c11f8 100644 --- a/virt/kvm/pfncache.c +++ b/virt/kvm/pfncache.c @@ -149,6 +149,7 @@ static kvm_pfn_t hva_to_pfn_retry(struct gfn_to_pfn_cache *gpc) .gfn = gpa_to_gfn(gpc->gpa), .flags = FOLL_WRITE, .hva = gpc->uhva, + .allow_non_refcounted_struct_page = false, }; lockdep_assert_held(&gpc->refresh_lock);