diff mbox series

[v1,03/10] KVM: Prepare kvm_is_reserved_pfn() for PG_reserved changes

Message ID 20191024120938.11237-4-david@redhat.com (mailing list archive)
State New, archived
Headers show
Series mm: Don't mark hotplugged pages PG_reserved (including ZONE_DEVICE) | expand

Commit Message

David Hildenbrand Oct. 24, 2019, 12:09 p.m. UTC
Right now, ZONE_DEVICE memory is always set PG_reserved. We want to
change that.

KVM has this weird use case that you can map anything from /dev/mem
into the guest. pfn_valid() is not a reliable check whether the memmap
was initialized and can be touched. pfn_to_online_page() makes sure
that we have an initialized memmap (and don't have ZONE_DEVICE memory).

Rewrite kvm_is_reserved_pfn() to make sure the function produces the
same result once we stop setting ZONE_DEVICE pages PG_reserved.

Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: "Radim Krčmář" <rkrcmar@redhat.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: KarimAllah Ahmed <karahmed@amazon.de>
Signed-off-by: David Hildenbrand <david@redhat.com>
---
 virt/kvm/kvm_main.c | 10 ++++++++--
 1 file changed, 8 insertions(+), 2 deletions(-)

Comments

Dan Williams Nov. 5, 2019, 4:38 a.m. UTC | #1
On Thu, Oct 24, 2019 at 5:11 AM David Hildenbrand <david@redhat.com> wrote:
>
> Right now, ZONE_DEVICE memory is always set PG_reserved. We want to
> change that.
>
> KVM has this weird use case that you can map anything from /dev/mem
> into the guest. pfn_valid() is not a reliable check whether the memmap
> was initialized and can be touched. pfn_to_online_page() makes sure
> that we have an initialized memmap (and don't have ZONE_DEVICE memory).
>
> Rewrite kvm_is_reserved_pfn() to make sure the function produces the
> same result once we stop setting ZONE_DEVICE pages PG_reserved.
>
> Cc: Paolo Bonzini <pbonzini@redhat.com>
> Cc: "Radim Krčmář" <rkrcmar@redhat.com>
> Cc: Michal Hocko <mhocko@kernel.org>
> Cc: Dan Williams <dan.j.williams@intel.com>
> Cc: KarimAllah Ahmed <karahmed@amazon.de>
> Signed-off-by: David Hildenbrand <david@redhat.com>
> ---
>  virt/kvm/kvm_main.c | 10 ++++++++--
>  1 file changed, 8 insertions(+), 2 deletions(-)
>
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index e9eb666eb6e8..9d18cc67d124 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -151,9 +151,15 @@ __weak int kvm_arch_mmu_notifier_invalidate_range(struct kvm *kvm,
>
>  bool kvm_is_reserved_pfn(kvm_pfn_t pfn)
>  {
> -       if (pfn_valid(pfn))
> -               return PageReserved(pfn_to_page(pfn));
> +       struct page *page = pfn_to_online_page(pfn);
>
> +       /*
> +        * We treat any pages that are not online (not managed by the buddy)
> +        * as reserved - this includes ZONE_DEVICE pages and pages without
> +        * a memmap (e.g., mapped via /dev/mem).
> +        */
> +       if (page)
> +               return PageReserved(page);
>         return true;
>  }

So after this all the pfn_valid() usage in kvm_main.c is replaced with
pfn_to_online_page()? Looks correct to me.

However, I'm worried that kvm is taking reference on ZONE_DEVICE pages
through some other path resulting in this:

    https://lore.kernel.org/linux-nvdimm/20190919154708.GA24650@angband.pl/

I'll see if this patch set modulates or maintains that failure mode.
David Hildenbrand Nov. 5, 2019, 9:17 a.m. UTC | #2
On 05.11.19 05:38, Dan Williams wrote:
> On Thu, Oct 24, 2019 at 5:11 AM David Hildenbrand <david@redhat.com> wrote:
>>
>> Right now, ZONE_DEVICE memory is always set PG_reserved. We want to
>> change that.
>>
>> KVM has this weird use case that you can map anything from /dev/mem
>> into the guest. pfn_valid() is not a reliable check whether the memmap
>> was initialized and can be touched. pfn_to_online_page() makes sure
>> that we have an initialized memmap (and don't have ZONE_DEVICE memory).
>>
>> Rewrite kvm_is_reserved_pfn() to make sure the function produces the
>> same result once we stop setting ZONE_DEVICE pages PG_reserved.
>>
>> Cc: Paolo Bonzini <pbonzini@redhat.com>
>> Cc: "Radim Krčmář" <rkrcmar@redhat.com>
>> Cc: Michal Hocko <mhocko@kernel.org>
>> Cc: Dan Williams <dan.j.williams@intel.com>
>> Cc: KarimAllah Ahmed <karahmed@amazon.de>
>> Signed-off-by: David Hildenbrand <david@redhat.com>
>> ---
>>   virt/kvm/kvm_main.c | 10 ++++++++--
>>   1 file changed, 8 insertions(+), 2 deletions(-)
>>
>> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
>> index e9eb666eb6e8..9d18cc67d124 100644
>> --- a/virt/kvm/kvm_main.c
>> +++ b/virt/kvm/kvm_main.c
>> @@ -151,9 +151,15 @@ __weak int kvm_arch_mmu_notifier_invalidate_range(struct kvm *kvm,
>>
>>   bool kvm_is_reserved_pfn(kvm_pfn_t pfn)
>>   {
>> -       if (pfn_valid(pfn))
>> -               return PageReserved(pfn_to_page(pfn));
>> +       struct page *page = pfn_to_online_page(pfn);
>>
>> +       /*
>> +        * We treat any pages that are not online (not managed by the buddy)
>> +        * as reserved - this includes ZONE_DEVICE pages and pages without
>> +        * a memmap (e.g., mapped via /dev/mem).
>> +        */
>> +       if (page)
>> +               return PageReserved(page);
>>          return true;
>>   }
> 
> So after this all the pfn_valid() usage in kvm_main.c is replaced with
> pfn_to_online_page()? Looks correct to me.
> 
> However, I'm worried that kvm is taking reference on ZONE_DEVICE pages
> through some other path resulting in this:
> 
>      https://lore.kernel.org/linux-nvdimm/20190919154708.GA24650@angband.pl/
> 
> I'll see if this patch set modulates or maintains that failure mode.
> 

I'd assume that the behavior is unchanged. Ithink we get a reference to 
these ZONE_DEVICE pages via __get_user_pages_fast() and friends in 
hva_to_pfn_fast() and friends in virt/kvm/kvm_main.c
David Hildenbrand Nov. 5, 2019, 9:49 a.m. UTC | #3
On 05.11.19 10:17, David Hildenbrand wrote:
> On 05.11.19 05:38, Dan Williams wrote:
>> On Thu, Oct 24, 2019 at 5:11 AM David Hildenbrand <david@redhat.com> wrote:
>>>
>>> Right now, ZONE_DEVICE memory is always set PG_reserved. We want to
>>> change that.
>>>
>>> KVM has this weird use case that you can map anything from /dev/mem
>>> into the guest. pfn_valid() is not a reliable check whether the memmap
>>> was initialized and can be touched. pfn_to_online_page() makes sure
>>> that we have an initialized memmap (and don't have ZONE_DEVICE memory).
>>>
>>> Rewrite kvm_is_reserved_pfn() to make sure the function produces the
>>> same result once we stop setting ZONE_DEVICE pages PG_reserved.
>>>
>>> Cc: Paolo Bonzini <pbonzini@redhat.com>
>>> Cc: "Radim Krčmář" <rkrcmar@redhat.com>
>>> Cc: Michal Hocko <mhocko@kernel.org>
>>> Cc: Dan Williams <dan.j.williams@intel.com>
>>> Cc: KarimAllah Ahmed <karahmed@amazon.de>
>>> Signed-off-by: David Hildenbrand <david@redhat.com>
>>> ---
>>>    virt/kvm/kvm_main.c | 10 ++++++++--
>>>    1 file changed, 8 insertions(+), 2 deletions(-)
>>>
>>> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
>>> index e9eb666eb6e8..9d18cc67d124 100644
>>> --- a/virt/kvm/kvm_main.c
>>> +++ b/virt/kvm/kvm_main.c
>>> @@ -151,9 +151,15 @@ __weak int kvm_arch_mmu_notifier_invalidate_range(struct kvm *kvm,
>>>
>>>    bool kvm_is_reserved_pfn(kvm_pfn_t pfn)
>>>    {
>>> -       if (pfn_valid(pfn))
>>> -               return PageReserved(pfn_to_page(pfn));
>>> +       struct page *page = pfn_to_online_page(pfn);
>>>
>>> +       /*
>>> +        * We treat any pages that are not online (not managed by the buddy)
>>> +        * as reserved - this includes ZONE_DEVICE pages and pages without
>>> +        * a memmap (e.g., mapped via /dev/mem).
>>> +        */
>>> +       if (page)
>>> +               return PageReserved(page);
>>>           return true;
>>>    }
>>
>> So after this all the pfn_valid() usage in kvm_main.c is replaced with
>> pfn_to_online_page()? Looks correct to me.
>>
>> However, I'm worried that kvm is taking reference on ZONE_DEVICE pages
>> through some other path resulting in this:
>>
>>       https://lore.kernel.org/linux-nvdimm/20190919154708.GA24650@angband.pl/
>>
>> I'll see if this patch set modulates or maintains that failure mode.
>>
> 
> I'd assume that the behavior is unchanged. Ithink we get a reference to
> these ZONE_DEVICE pages via __get_user_pages_fast() and friends in
> hva_to_pfn_fast() and friends in virt/kvm/kvm_main.c
> 

I think I know what's going wrong:

Pages that are pinned via gfn_to_pfn() and friends take a references, 
however are often released via 
kvm_release_pfn_clean()/kvm_release_pfn_dirty()/kvm_release_page_clean()...


E.g., in arch/x86/kvm/x86.c:reexecute_instruction()

...
pfn = gfn_to_pfn(vcpu->kvm, gpa_to_gfn(gpa));
...
kvm_release_pfn_clean(pfn);



void kvm_release_pfn_clean(kvm_pfn_t pfn)
{
	if (!is_error_noslot_pfn(pfn) && !kvm_is_reserved_pfn(pfn))
		put_page(pfn_to_page(pfn));
}

This function makes perfect sense as the counterpart for kvm_get_pfn():

void kvm_get_pfn(kvm_pfn_t pfn)
{
	if (!kvm_is_reserved_pfn(pfn))
		get_page(pfn_to_page(pfn));
}


As all ZONE_DEVICE pages are currently reserved, pages pinned via 
gfn_to_pfn() and friends will often not see a put_page() AFAIKS.

Now, my patch does not change that, the result of 
kvm_is_reserved_pfn(pfn) will be unchanged. A proper fix for that would 
probably be

a) To drop the reference to ZONE_DEVICE pages in gfn_to_pfn() and 
friends, after you successfully pinned the pages. (not sure if that's 
the right thing to do but you're the expert)

b) To not use kvm_release_pfn_clean() and friends on pages that were 
definitely pinned.
David Hildenbrand Nov. 5, 2019, 10:02 a.m. UTC | #4
On 05.11.19 10:49, David Hildenbrand wrote:
> On 05.11.19 10:17, David Hildenbrand wrote:
>> On 05.11.19 05:38, Dan Williams wrote:
>>> On Thu, Oct 24, 2019 at 5:11 AM David Hildenbrand <david@redhat.com> wrote:
>>>>
>>>> Right now, ZONE_DEVICE memory is always set PG_reserved. We want to
>>>> change that.
>>>>
>>>> KVM has this weird use case that you can map anything from /dev/mem
>>>> into the guest. pfn_valid() is not a reliable check whether the memmap
>>>> was initialized and can be touched. pfn_to_online_page() makes sure
>>>> that we have an initialized memmap (and don't have ZONE_DEVICE memory).
>>>>
>>>> Rewrite kvm_is_reserved_pfn() to make sure the function produces the
>>>> same result once we stop setting ZONE_DEVICE pages PG_reserved.
>>>>
>>>> Cc: Paolo Bonzini <pbonzini@redhat.com>
>>>> Cc: "Radim Krčmář" <rkrcmar@redhat.com>
>>>> Cc: Michal Hocko <mhocko@kernel.org>
>>>> Cc: Dan Williams <dan.j.williams@intel.com>
>>>> Cc: KarimAllah Ahmed <karahmed@amazon.de>
>>>> Signed-off-by: David Hildenbrand <david@redhat.com>
>>>> ---
>>>>     virt/kvm/kvm_main.c | 10 ++++++++--
>>>>     1 file changed, 8 insertions(+), 2 deletions(-)
>>>>
>>>> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
>>>> index e9eb666eb6e8..9d18cc67d124 100644
>>>> --- a/virt/kvm/kvm_main.c
>>>> +++ b/virt/kvm/kvm_main.c
>>>> @@ -151,9 +151,15 @@ __weak int kvm_arch_mmu_notifier_invalidate_range(struct kvm *kvm,
>>>>
>>>>     bool kvm_is_reserved_pfn(kvm_pfn_t pfn)
>>>>     {
>>>> -       if (pfn_valid(pfn))
>>>> -               return PageReserved(pfn_to_page(pfn));
>>>> +       struct page *page = pfn_to_online_page(pfn);
>>>>
>>>> +       /*
>>>> +        * We treat any pages that are not online (not managed by the buddy)
>>>> +        * as reserved - this includes ZONE_DEVICE pages and pages without
>>>> +        * a memmap (e.g., mapped via /dev/mem).
>>>> +        */
>>>> +       if (page)
>>>> +               return PageReserved(page);
>>>>            return true;
>>>>     }
>>>
>>> So after this all the pfn_valid() usage in kvm_main.c is replaced with
>>> pfn_to_online_page()? Looks correct to me.
>>>
>>> However, I'm worried that kvm is taking reference on ZONE_DEVICE pages
>>> through some other path resulting in this:
>>>
>>>        https://lore.kernel.org/linux-nvdimm/20190919154708.GA24650@angband.pl/
>>>
>>> I'll see if this patch set modulates or maintains that failure mode.
>>>
>>
>> I'd assume that the behavior is unchanged. Ithink we get a reference to
>> these ZONE_DEVICE pages via __get_user_pages_fast() and friends in
>> hva_to_pfn_fast() and friends in virt/kvm/kvm_main.c
>>
> 
> I think I know what's going wrong:
> 
> Pages that are pinned via gfn_to_pfn() and friends take a references,
> however are often released via
> kvm_release_pfn_clean()/kvm_release_pfn_dirty()/kvm_release_page_clean()...
> 
> 
> E.g., in arch/x86/kvm/x86.c:reexecute_instruction()
> 
> ...
> pfn = gfn_to_pfn(vcpu->kvm, gpa_to_gfn(gpa));
> ...
> kvm_release_pfn_clean(pfn);
> 
> 
> 
> void kvm_release_pfn_clean(kvm_pfn_t pfn)
> {
> 	if (!is_error_noslot_pfn(pfn) && !kvm_is_reserved_pfn(pfn))
> 		put_page(pfn_to_page(pfn));
> }
> 
> This function makes perfect sense as the counterpart for kvm_get_pfn():
> 
> void kvm_get_pfn(kvm_pfn_t pfn)
> {
> 	if (!kvm_is_reserved_pfn(pfn))
> 		get_page(pfn_to_page(pfn));
> }
> 
> 
> As all ZONE_DEVICE pages are currently reserved, pages pinned via
> gfn_to_pfn() and friends will often not see a put_page() AFAIKS.
> 
> Now, my patch does not change that, the result of
> kvm_is_reserved_pfn(pfn) will be unchanged. A proper fix for that would
> probably be
> 
> a) To drop the reference to ZONE_DEVICE pages in gfn_to_pfn() and
> friends, after you successfully pinned the pages. (not sure if that's
> the right thing to do but you're the expert)
> 
> b) To not use kvm_release_pfn_clean() and friends on pages that were
> definitely pinned.
> 

(talking to myself, sorry)

Thinking again, dropping this patch from this series could effectively 
also fix that issue. E.g., kvm_release_pfn_clean() and friends would 
always do a put_page() if "pfn_valid() and !PageReserved()", so after 
patch 9 also on ZONDE_DEVICE pages.

But it would have side effects that might not be desired. E.g.,:

1. kvm_pfn_to_page() would also return ZONE_DEVICE pages (might even be 
the right thing to do).

2. kvm_set_pfn_dirty() would also set ZONE_DEVICE pages dirty (might be 
okay)
Sean Christopherson Nov. 5, 2019, 4 p.m. UTC | #5
On Tue, Nov 05, 2019 at 11:02:46AM +0100, David Hildenbrand wrote:
> On 05.11.19 10:49, David Hildenbrand wrote:
> >On 05.11.19 10:17, David Hildenbrand wrote:
> >>On 05.11.19 05:38, Dan Williams wrote:
> >>>On Thu, Oct 24, 2019 at 5:11 AM David Hildenbrand <david@redhat.com> wrote:
> >>>>
> >>>>Right now, ZONE_DEVICE memory is always set PG_reserved. We want to
> >>>>change that.
> >>>>
> >>>>KVM has this weird use case that you can map anything from /dev/mem
> >>>>into the guest. pfn_valid() is not a reliable check whether the memmap
> >>>>was initialized and can be touched. pfn_to_online_page() makes sure
> >>>>that we have an initialized memmap (and don't have ZONE_DEVICE memory).
> >>>>
> >>>>Rewrite kvm_is_reserved_pfn() to make sure the function produces the
> >>>>same result once we stop setting ZONE_DEVICE pages PG_reserved.
> >>>>
> >>>>Cc: Paolo Bonzini <pbonzini@redhat.com>
> >>>>Cc: "Radim Krčmář" <rkrcmar@redhat.com>
> >>>>Cc: Michal Hocko <mhocko@kernel.org>
> >>>>Cc: Dan Williams <dan.j.williams@intel.com>
> >>>>Cc: KarimAllah Ahmed <karahmed@amazon.de>
> >>>>Signed-off-by: David Hildenbrand <david@redhat.com>
> >>>>---
> >>>>    virt/kvm/kvm_main.c | 10 ++++++++--
> >>>>    1 file changed, 8 insertions(+), 2 deletions(-)
> >>>>
> >>>>diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> >>>>index e9eb666eb6e8..9d18cc67d124 100644
> >>>>--- a/virt/kvm/kvm_main.c
> >>>>+++ b/virt/kvm/kvm_main.c
> >>>>@@ -151,9 +151,15 @@ __weak int kvm_arch_mmu_notifier_invalidate_range(struct kvm *kvm,
> >>>>
> >>>>    bool kvm_is_reserved_pfn(kvm_pfn_t pfn)
> >>>>    {
> >>>>-       if (pfn_valid(pfn))
> >>>>-               return PageReserved(pfn_to_page(pfn));
> >>>>+       struct page *page = pfn_to_online_page(pfn);
> >>>>
> >>>>+       /*
> >>>>+        * We treat any pages that are not online (not managed by the buddy)
> >>>>+        * as reserved - this includes ZONE_DEVICE pages and pages without
> >>>>+        * a memmap (e.g., mapped via /dev/mem).
> >>>>+        */
> >>>>+       if (page)
> >>>>+               return PageReserved(page);
> >>>>           return true;
> >>>>    }
> >>>
> >>>So after this all the pfn_valid() usage in kvm_main.c is replaced with
> >>>pfn_to_online_page()? Looks correct to me.
> >>>
> >>>However, I'm worried that kvm is taking reference on ZONE_DEVICE pages
> >>>through some other path resulting in this:
> >>>
> >>>       https://lore.kernel.org/linux-nvdimm/20190919154708.GA24650@angband.pl/
> >>>
> >>>I'll see if this patch set modulates or maintains that failure mode.
> >>>
> >>
> >>I'd assume that the behavior is unchanged. Ithink we get a reference to
> >>these ZONE_DEVICE pages via __get_user_pages_fast() and friends in
> >>hva_to_pfn_fast() and friends in virt/kvm/kvm_main.c
> >>
> >
> >I think I know what's going wrong:
> >
> >Pages that are pinned via gfn_to_pfn() and friends take a references,
> >however are often released via
> >kvm_release_pfn_clean()/kvm_release_pfn_dirty()/kvm_release_page_clean()...
> >
> >
> >E.g., in arch/x86/kvm/x86.c:reexecute_instruction()
> >
> >...
> >pfn = gfn_to_pfn(vcpu->kvm, gpa_to_gfn(gpa));
> >...
> >kvm_release_pfn_clean(pfn);
> >
> >
> >
> >void kvm_release_pfn_clean(kvm_pfn_t pfn)
> >{
> >	if (!is_error_noslot_pfn(pfn) && !kvm_is_reserved_pfn(pfn))
> >		put_page(pfn_to_page(pfn));
> >}
> >
> >This function makes perfect sense as the counterpart for kvm_get_pfn():
> >
> >void kvm_get_pfn(kvm_pfn_t pfn)
> >{
> >	if (!kvm_is_reserved_pfn(pfn))
> >		get_page(pfn_to_page(pfn));
> >}
> >
> >
> >As all ZONE_DEVICE pages are currently reserved, pages pinned via
> >gfn_to_pfn() and friends will often not see a put_page() AFAIKS.

Assuming gup() takes a reference for ZONE_DEVICE pages, yes, this is a
KVM bug.

> >Now, my patch does not change that, the result of
> >kvm_is_reserved_pfn(pfn) will be unchanged. A proper fix for that would
> >probably be
> >
> >a) To drop the reference to ZONE_DEVICE pages in gfn_to_pfn() and
> >friends, after you successfully pinned the pages. (not sure if that's
> >the right thing to do but you're the expert)
> >
> >b) To not use kvm_release_pfn_clean() and friends on pages that were
> >definitely pinned.

This is already KVM's intent, i.e. the purpose of the PageReserved() check
is simply to avoid putting a non-existent reference.  The problem is that
KVM assumes pages with PG_reserved set are never pinned, which AFAICT was
true when the code was first added.

> (talking to myself, sorry)
> 
> Thinking again, dropping this patch from this series could effectively also
> fix that issue. E.g., kvm_release_pfn_clean() and friends would always do a
> put_page() if "pfn_valid() and !PageReserved()", so after patch 9 also on
> ZONDE_DEVICE pages.

Yeah, this appears to be the correct fix.

> But it would have side effects that might not be desired. E.g.,:
> 
> 1. kvm_pfn_to_page() would also return ZONE_DEVICE pages (might even be the
> right thing to do).

This should be ok, at least on x86.  There are only three users of
kvm_pfn_to_page().  Two of those are on allocations that are controlled by
KVM and are guaranteed to be vanilla MAP_ANONYMOUS.  The third is on guest
memory when running a nested guest, and in that case supporting ZONE_DEVICE
memory is desirable, i.e. KVM should play nice with a guest that is backed
by ZONE_DEVICE memory.

> 2. kvm_set_pfn_dirty() would also set ZONE_DEVICE pages dirty (might be
> okay)

This is ok from a KVM perspective.

The scarier code (for me) is transparent_hugepage_adjust() and
kvm_mmu_zap_collapsible_spte(), as I don't at all understand the
interaction between THP and _PAGE_DEVMAP.
David Hildenbrand Nov. 5, 2019, 8:30 p.m. UTC | #6
>>> I think I know what's going wrong:
>>>
>>> Pages that are pinned via gfn_to_pfn() and friends take a references,
>>> however are often released via
>>> kvm_release_pfn_clean()/kvm_release_pfn_dirty()/kvm_release_page_clean()...
>>>
>>>
>>> E.g., in arch/x86/kvm/x86.c:reexecute_instruction()
>>>
>>> ...
>>> pfn = gfn_to_pfn(vcpu->kvm, gpa_to_gfn(gpa));
>>> ...
>>> kvm_release_pfn_clean(pfn);
>>>
>>>
>>>
>>> void kvm_release_pfn_clean(kvm_pfn_t pfn)
>>> {
>>> 	if (!is_error_noslot_pfn(pfn) && !kvm_is_reserved_pfn(pfn))
>>> 		put_page(pfn_to_page(pfn));
>>> }
>>>
>>> This function makes perfect sense as the counterpart for kvm_get_pfn():
>>>
>>> void kvm_get_pfn(kvm_pfn_t pfn)
>>> {
>>> 	if (!kvm_is_reserved_pfn(pfn))
>>> 		get_page(pfn_to_page(pfn));
>>> }
>>>
>>>
>>> As all ZONE_DEVICE pages are currently reserved, pages pinned via
>>> gfn_to_pfn() and friends will often not see a put_page() AFAIKS.
> 
> Assuming gup() takes a reference for ZONE_DEVICE pages, yes, this is a
> KVM bug.

Yes, it does take a reference AFAIKs. E.g.,

mm/gup.c:gup_pte_range():
...
		if (pte_devmap(pte)) {
			if (unlikely(flags & FOLL_LONGTERM))
				goto pte_unmap;

			pgmap = get_dev_pagemap(pte_pfn(pte), pgmap);
			if (unlikely(!pgmap)) {
				undo_dev_pagemap(nr, nr_start, pages);
				goto pte_unmap;
			}
		} else if (pte_special(pte))
			goto pte_unmap;

		VM_BUG_ON(!pfn_valid(pte_pfn(pte)));
		page = pte_page(pte);

		head = try_get_compound_head(page, 1);

try_get_compound_head() will increment the reference count.

> 
>>> Now, my patch does not change that, the result of
>>> kvm_is_reserved_pfn(pfn) will be unchanged. A proper fix for that would
>>> probably be
>>>
>>> a) To drop the reference to ZONE_DEVICE pages in gfn_to_pfn() and
>>> friends, after you successfully pinned the pages. (not sure if that's
>>> the right thing to do but you're the expert)
>>>
>>> b) To not use kvm_release_pfn_clean() and friends on pages that were
>>> definitely pinned.
> 
> This is already KVM's intent, i.e. the purpose of the PageReserved() check
> is simply to avoid putting a non-existent reference.  The problem is that
> KVM assumes pages with PG_reserved set are never pinned, which AFAICT was
> true when the code was first added.
> 
>> (talking to myself, sorry)
>>
>> Thinking again, dropping this patch from this series could effectively also
>> fix that issue. E.g., kvm_release_pfn_clean() and friends would always do a
>> put_page() if "pfn_valid() and !PageReserved()", so after patch 9 also on
>> ZONDE_DEVICE pages.
> 
> Yeah, this appears to be the correct fix.
> 
>> But it would have side effects that might not be desired. E.g.,:
>>
>> 1. kvm_pfn_to_page() would also return ZONE_DEVICE pages (might even be the
>> right thing to do).
> 
> This should be ok, at least on x86.  There are only three users of
> kvm_pfn_to_page().  Two of those are on allocations that are controlled by
> KVM and are guaranteed to be vanilla MAP_ANONYMOUS.  The third is on guest
> memory when running a nested guest, and in that case supporting ZONE_DEVICE
> memory is desirable, i.e. KVM should play nice with a guest that is backed
> by ZONE_DEVICE memory.
> 
>> 2. kvm_set_pfn_dirty() would also set ZONE_DEVICE pages dirty (might be
>> okay)
> 
> This is ok from a KVM perspective.

What about

void kvm_get_pfn(kvm_pfn_t pfn)
{
	if (!kvm_is_reserved_pfn(pfn))
		get_page(pfn_to_page(pfn));
}

Is a pure get_page() sufficient in case of ZONE_DEVICE?
(asking because of the live references obtained via 
get_dev_pagemap(pte_pfn(pte), pgmap) in mm/gup.c:gup_pte_range() 
somewhat confuse me :) )


> 
> The scarier code (for me) is transparent_hugepage_adjust() and
> kvm_mmu_zap_collapsible_spte(), as I don't at all understand the
> interaction between THP and _PAGE_DEVMAP.

The x86 KVM MMU code is one of the ugliest code I know (sorry, but it 
had to be said :/ ). Luckily, this should be independent of the 
PG_reserved thingy AFAIKs.
Sean Christopherson Nov. 5, 2019, 10:22 p.m. UTC | #7
On Tue, Nov 05, 2019 at 09:30:53PM +0100, David Hildenbrand wrote:
> >>>I think I know what's going wrong:
> >>>
> >>>Pages that are pinned via gfn_to_pfn() and friends take a references,
> >>>however are often released via
> >>>kvm_release_pfn_clean()/kvm_release_pfn_dirty()/kvm_release_page_clean()...
> >>>
> >>>
> >>>E.g., in arch/x86/kvm/x86.c:reexecute_instruction()
> >>>
> >>>...
> >>>pfn = gfn_to_pfn(vcpu->kvm, gpa_to_gfn(gpa));
> >>>...
> >>>kvm_release_pfn_clean(pfn);
> >>>
> >>>
> >>>
> >>>void kvm_release_pfn_clean(kvm_pfn_t pfn)
> >>>{
> >>>	if (!is_error_noslot_pfn(pfn) && !kvm_is_reserved_pfn(pfn))
> >>>		put_page(pfn_to_page(pfn));
> >>>}
> >>>
> >>>This function makes perfect sense as the counterpart for kvm_get_pfn():
> >>>
> >>>void kvm_get_pfn(kvm_pfn_t pfn)
> >>>{
> >>>	if (!kvm_is_reserved_pfn(pfn))
> >>>		get_page(pfn_to_page(pfn));
> >>>}
> >>>
> >>>
> >>>As all ZONE_DEVICE pages are currently reserved, pages pinned via
> >>>gfn_to_pfn() and friends will often not see a put_page() AFAIKS.
> >
> >Assuming gup() takes a reference for ZONE_DEVICE pages, yes, this is a
> >KVM bug.
> 
> Yes, it does take a reference AFAIKs. E.g.,
> 
> mm/gup.c:gup_pte_range():
> ...
> 		if (pte_devmap(pte)) {
> 			if (unlikely(flags & FOLL_LONGTERM))
> 				goto pte_unmap;
> 
> 			pgmap = get_dev_pagemap(pte_pfn(pte), pgmap);
> 			if (unlikely(!pgmap)) {
> 				undo_dev_pagemap(nr, nr_start, pages);
> 				goto pte_unmap;
> 			}
> 		} else if (pte_special(pte))
> 			goto pte_unmap;
> 
> 		VM_BUG_ON(!pfn_valid(pte_pfn(pte)));
> 		page = pte_page(pte);
> 
> 		head = try_get_compound_head(page, 1);
> 
> try_get_compound_head() will increment the reference count.

Doh, I looked right at that code and somehow didn't connect the dots.
Thanks!

> >>>Now, my patch does not change that, the result of
> >>>kvm_is_reserved_pfn(pfn) will be unchanged. A proper fix for that would
> >>>probably be
> >>>
> >>>a) To drop the reference to ZONE_DEVICE pages in gfn_to_pfn() and
> >>>friends, after you successfully pinned the pages. (not sure if that's
> >>>the right thing to do but you're the expert)
> >>>
> >>>b) To not use kvm_release_pfn_clean() and friends on pages that were
> >>>definitely pinned.
> >
> >This is already KVM's intent, i.e. the purpose of the PageReserved() check
> >is simply to avoid putting a non-existent reference.  The problem is that
> >KVM assumes pages with PG_reserved set are never pinned, which AFAICT was
> >true when the code was first added.
> >
> >>(talking to myself, sorry)
> >>
> >>Thinking again, dropping this patch from this series could effectively also
> >>fix that issue. E.g., kvm_release_pfn_clean() and friends would always do a
> >>put_page() if "pfn_valid() and !PageReserved()", so after patch 9 also on
> >>ZONDE_DEVICE pages.
> >
> >Yeah, this appears to be the correct fix.
> >
> >>But it would have side effects that might not be desired. E.g.,:
> >>
> >>1. kvm_pfn_to_page() would also return ZONE_DEVICE pages (might even be the
> >>right thing to do).
> >
> >This should be ok, at least on x86.  There are only three users of
> >kvm_pfn_to_page().  Two of those are on allocations that are controlled by
> >KVM and are guaranteed to be vanilla MAP_ANONYMOUS.  The third is on guest
> >memory when running a nested guest, and in that case supporting ZONE_DEVICE
> >memory is desirable, i.e. KVM should play nice with a guest that is backed
> >by ZONE_DEVICE memory.
> >
> >>2. kvm_set_pfn_dirty() would also set ZONE_DEVICE pages dirty (might be
> >>okay)
> >
> >This is ok from a KVM perspective.
> 
> What about
> 
> void kvm_get_pfn(kvm_pfn_t pfn)
> {
> 	if (!kvm_is_reserved_pfn(pfn))
> 		get_page(pfn_to_page(pfn));
> }
> 
> Is a pure get_page() sufficient in case of ZONE_DEVICE?
> (asking because of the live references obtained via
> get_dev_pagemap(pte_pfn(pte), pgmap) in mm/gup.c:gup_pte_range() somewhat
> confuse me :) )

This ties into my concern with thp_adjust().  On x86, kvm_get_pfn() is
only used in two flows, to manually get a ref for VM_IO/VM_PFNMAP pages
and to switch the ref when mapping a non-hugetlbfs compound page, i.e. a
THP.

I assume VM_IO and PFNMAP can't apply to ZONE_DEVICE pages.

In the thp_adjust() case, when a THP is encountered and the original PFN
is for a non-PG_head page, KVM transfers the reference to the associated
PG_head page[*] and maps the associated 2mb chunk/page.  This is where KVM
uses kvm_get_pfn() and could run afoul of the get_dev_pagemap() refcounts.


[*] Technically I don't think it's guaranteed to be a PG_head, e.g. if the
    THP is a 1gb page, as KVM currently only maps THP as 2mb pages.  But
    the idea is the same, transfer the refcount the PFN that's actually
    going into KVM's page tables.

> >
> >The scarier code (for me) is transparent_hugepage_adjust() and
> >kvm_mmu_zap_collapsible_spte(), as I don't at all understand the
> >interaction between THP and _PAGE_DEVMAP.
> 
> The x86 KVM MMU code is one of the ugliest code I know (sorry, but it had to
> be said :/ ). Luckily, this should be independent of the PG_reserved thingy
> AFAIKs.
Dan Williams Nov. 5, 2019, 11:02 p.m. UTC | #8
On Tue, Nov 5, 2019 at 12:31 PM David Hildenbrand <david@redhat.com> wrote:
>
> >>> I think I know what's going wrong:
> >>>
> >>> Pages that are pinned via gfn_to_pfn() and friends take a references,
> >>> however are often released via
> >>> kvm_release_pfn_clean()/kvm_release_pfn_dirty()/kvm_release_page_clean()...
> >>>
> >>>
> >>> E.g., in arch/x86/kvm/x86.c:reexecute_instruction()
> >>>
> >>> ...
> >>> pfn = gfn_to_pfn(vcpu->kvm, gpa_to_gfn(gpa));
> >>> ...
> >>> kvm_release_pfn_clean(pfn);
> >>>
> >>>
> >>>
> >>> void kvm_release_pfn_clean(kvm_pfn_t pfn)
> >>> {
> >>>     if (!is_error_noslot_pfn(pfn) && !kvm_is_reserved_pfn(pfn))
> >>>             put_page(pfn_to_page(pfn));
> >>> }
> >>>
> >>> This function makes perfect sense as the counterpart for kvm_get_pfn():
> >>>
> >>> void kvm_get_pfn(kvm_pfn_t pfn)
> >>> {
> >>>     if (!kvm_is_reserved_pfn(pfn))
> >>>             get_page(pfn_to_page(pfn));
> >>> }
> >>>
> >>>
> >>> As all ZONE_DEVICE pages are currently reserved, pages pinned via
> >>> gfn_to_pfn() and friends will often not see a put_page() AFAIKS.
> >
> > Assuming gup() takes a reference for ZONE_DEVICE pages, yes, this is a
> > KVM bug.
>
> Yes, it does take a reference AFAIKs. E.g.,
>
> mm/gup.c:gup_pte_range():
> ...
>                 if (pte_devmap(pte)) {
>                         if (unlikely(flags & FOLL_LONGTERM))
>                                 goto pte_unmap;
>
>                         pgmap = get_dev_pagemap(pte_pfn(pte), pgmap);
>                         if (unlikely(!pgmap)) {
>                                 undo_dev_pagemap(nr, nr_start, pages);
>                                 goto pte_unmap;
>                         }
>                 } else if (pte_special(pte))
>                         goto pte_unmap;
>
>                 VM_BUG_ON(!pfn_valid(pte_pfn(pte)));
>                 page = pte_page(pte);
>
>                 head = try_get_compound_head(page, 1);
>
> try_get_compound_head() will increment the reference count.
>
> >
> >>> Now, my patch does not change that, the result of
> >>> kvm_is_reserved_pfn(pfn) will be unchanged. A proper fix for that would
> >>> probably be
> >>>
> >>> a) To drop the reference to ZONE_DEVICE pages in gfn_to_pfn() and
> >>> friends, after you successfully pinned the pages. (not sure if that's
> >>> the right thing to do but you're the expert)
> >>>
> >>> b) To not use kvm_release_pfn_clean() and friends on pages that were
> >>> definitely pinned.
> >
> > This is already KVM's intent, i.e. the purpose of the PageReserved() check
> > is simply to avoid putting a non-existent reference.  The problem is that
> > KVM assumes pages with PG_reserved set are never pinned, which AFAICT was
> > true when the code was first added.
> >
> >> (talking to myself, sorry)
> >>
> >> Thinking again, dropping this patch from this series could effectively also
> >> fix that issue. E.g., kvm_release_pfn_clean() and friends would always do a
> >> put_page() if "pfn_valid() and !PageReserved()", so after patch 9 also on
> >> ZONDE_DEVICE pages.
> >
> > Yeah, this appears to be the correct fix.
> >
> >> But it would have side effects that might not be desired. E.g.,:
> >>
> >> 1. kvm_pfn_to_page() would also return ZONE_DEVICE pages (might even be the
> >> right thing to do).
> >
> > This should be ok, at least on x86.  There are only three users of
> > kvm_pfn_to_page().  Two of those are on allocations that are controlled by
> > KVM and are guaranteed to be vanilla MAP_ANONYMOUS.  The third is on guest
> > memory when running a nested guest, and in that case supporting ZONE_DEVICE
> > memory is desirable, i.e. KVM should play nice with a guest that is backed
> > by ZONE_DEVICE memory.
> >
> >> 2. kvm_set_pfn_dirty() would also set ZONE_DEVICE pages dirty (might be
> >> okay)
> >
> > This is ok from a KVM perspective.
>
> What about
>
> void kvm_get_pfn(kvm_pfn_t pfn)
> {
>         if (!kvm_is_reserved_pfn(pfn))
>                 get_page(pfn_to_page(pfn));
> }
>
> Is a pure get_page() sufficient in case of ZONE_DEVICE?
> (asking because of the live references obtained via
> get_dev_pagemap(pte_pfn(pte), pgmap) in mm/gup.c:gup_pte_range()
> somewhat confuse me :) )

It is not sufficient. PTE_DEVMAP is there to tell the gup path "be
careful, this pfn has device-lifetime, make sure the device is pinned
and not actively in the process of dying before pinning any pages
associated with this device".

However, if kvm_get_pfn() is honoring kvm_is_reserved_pfn() that
returns true for ZONE_DEVICE, I'm missing how it is messing up the
reference counts.

> > The scarier code (for me) is transparent_hugepage_adjust() and
> > kvm_mmu_zap_collapsible_spte(), as I don't at all understand the
> > interaction between THP and _PAGE_DEVMAP.
>
> The x86 KVM MMU code is one of the ugliest code I know (sorry, but it
> had to be said :/ ). Luckily, this should be independent of the
> PG_reserved thingy AFAIKs.

Both transparent_hugepage_adjust() and kvm_mmu_zap_collapsible_spte()
are honoring kvm_is_reserved_pfn(), so again I'm missing where the
page count gets mismanaged and leads to the reported hang.
Sean Christopherson Nov. 5, 2019, 11:13 p.m. UTC | #9
On Tue, Nov 05, 2019 at 03:02:40PM -0800, Dan Williams wrote:
> On Tue, Nov 5, 2019 at 12:31 PM David Hildenbrand <david@redhat.com> wrote:
> > > The scarier code (for me) is transparent_hugepage_adjust() and
> > > kvm_mmu_zap_collapsible_spte(), as I don't at all understand the
> > > interaction between THP and _PAGE_DEVMAP.
> >
> > The x86 KVM MMU code is one of the ugliest code I know (sorry, but it
> > had to be said :/ ). Luckily, this should be independent of the
> > PG_reserved thingy AFAIKs.
> 
> Both transparent_hugepage_adjust() and kvm_mmu_zap_collapsible_spte()
> are honoring kvm_is_reserved_pfn(), so again I'm missing where the
> page count gets mismanaged and leads to the reported hang.

When mapping pages into the guest, KVM gets the page via gup(), which
increments the page count for ZONE_DEVICE pages.  But KVM puts the page
using kvm_release_pfn_clean(), which skips put_page() if PageReserved()
and so never puts its reference to ZONE_DEVICE pages.

My transparent_hugepage_adjust() and kvm_mmu_zap_collapsible_spte()
comments were for a post-patch/series scenario wheren PageReserved() is
no longer true for ZONE_DEVICE pages.
Dan Williams Nov. 5, 2019, 11:30 p.m. UTC | #10
On Tue, Nov 5, 2019 at 3:13 PM Sean Christopherson
<sean.j.christopherson@intel.com> wrote:
>
> On Tue, Nov 05, 2019 at 03:02:40PM -0800, Dan Williams wrote:
> > On Tue, Nov 5, 2019 at 12:31 PM David Hildenbrand <david@redhat.com> wrote:
> > > > The scarier code (for me) is transparent_hugepage_adjust() and
> > > > kvm_mmu_zap_collapsible_spte(), as I don't at all understand the
> > > > interaction between THP and _PAGE_DEVMAP.
> > >
> > > The x86 KVM MMU code is one of the ugliest code I know (sorry, but it
> > > had to be said :/ ). Luckily, this should be independent of the
> > > PG_reserved thingy AFAIKs.
> >
> > Both transparent_hugepage_adjust() and kvm_mmu_zap_collapsible_spte()
> > are honoring kvm_is_reserved_pfn(), so again I'm missing where the
> > page count gets mismanaged and leads to the reported hang.
>
> When mapping pages into the guest, KVM gets the page via gup(), which
> increments the page count for ZONE_DEVICE pages.  But KVM puts the page
> using kvm_release_pfn_clean(), which skips put_page() if PageReserved()
> and so never puts its reference to ZONE_DEVICE pages.

Oh, yeah, that's busted.

> My transparent_hugepage_adjust() and kvm_mmu_zap_collapsible_spte()
> comments were for a post-patch/series scenario wheren PageReserved() is
> no longer true for ZONE_DEVICE pages.

Ah, ok, for that David is preserving kvm_is_reserved_pfn() returning
true for ZONE_DEVICE because pfn_to_online_page() will fail for
ZONE_DEVICE.
Sean Christopherson Nov. 5, 2019, 11:42 p.m. UTC | #11
On Tue, Nov 05, 2019 at 03:30:00PM -0800, Dan Williams wrote:
> On Tue, Nov 5, 2019 at 3:13 PM Sean Christopherson
> <sean.j.christopherson@intel.com> wrote:
> >
> > On Tue, Nov 05, 2019 at 03:02:40PM -0800, Dan Williams wrote:
> > > On Tue, Nov 5, 2019 at 12:31 PM David Hildenbrand <david@redhat.com> wrote:
> > > > > The scarier code (for me) is transparent_hugepage_adjust() and
> > > > > kvm_mmu_zap_collapsible_spte(), as I don't at all understand the
> > > > > interaction between THP and _PAGE_DEVMAP.
> > > >
> > > > The x86 KVM MMU code is one of the ugliest code I know (sorry, but it
> > > > had to be said :/ ). Luckily, this should be independent of the
> > > > PG_reserved thingy AFAIKs.
> > >
> > > Both transparent_hugepage_adjust() and kvm_mmu_zap_collapsible_spte()
> > > are honoring kvm_is_reserved_pfn(), so again I'm missing where the
> > > page count gets mismanaged and leads to the reported hang.
> >
> > When mapping pages into the guest, KVM gets the page via gup(), which
> > increments the page count for ZONE_DEVICE pages.  But KVM puts the page
> > using kvm_release_pfn_clean(), which skips put_page() if PageReserved()
> > and so never puts its reference to ZONE_DEVICE pages.
> 
> Oh, yeah, that's busted.
> 
> > My transparent_hugepage_adjust() and kvm_mmu_zap_collapsible_spte()
> > comments were for a post-patch/series scenario wheren PageReserved() is
> > no longer true for ZONE_DEVICE pages.
> 
> Ah, ok, for that David is preserving kvm_is_reserved_pfn() returning
> true for ZONE_DEVICE because pfn_to_online_page() will fail for
> ZONE_DEVICE.

But David's proposed fix for the above refcount bug is to omit the patch
so that KVM no longer treats ZONE_DEVICE pages as reserved.  That seems
like the right thing to do, including for thp_adjust(), e.g. it would
naturally let KVM use 2mb pages for the guest when a ZONE_DEVICE page is
mapped with a huge page (2mb or above) in the host.  The only hiccup is
figuring out how to correctly transfer the reference.
Dan Williams Nov. 5, 2019, 11:43 p.m. UTC | #12
On Tue, Nov 5, 2019 at 3:30 PM Dan Williams <dan.j.williams@intel.com> wrote:
>
> On Tue, Nov 5, 2019 at 3:13 PM Sean Christopherson
> <sean.j.christopherson@intel.com> wrote:
> >
> > On Tue, Nov 05, 2019 at 03:02:40PM -0800, Dan Williams wrote:
> > > On Tue, Nov 5, 2019 at 12:31 PM David Hildenbrand <david@redhat.com> wrote:
> > > > > The scarier code (for me) is transparent_hugepage_adjust() and
> > > > > kvm_mmu_zap_collapsible_spte(), as I don't at all understand the
> > > > > interaction between THP and _PAGE_DEVMAP.
> > > >
> > > > The x86 KVM MMU code is one of the ugliest code I know (sorry, but it
> > > > had to be said :/ ). Luckily, this should be independent of the
> > > > PG_reserved thingy AFAIKs.
> > >
> > > Both transparent_hugepage_adjust() and kvm_mmu_zap_collapsible_spte()
> > > are honoring kvm_is_reserved_pfn(), so again I'm missing where the
> > > page count gets mismanaged and leads to the reported hang.
> >
> > When mapping pages into the guest, KVM gets the page via gup(), which
> > increments the page count for ZONE_DEVICE pages.  But KVM puts the page
> > using kvm_release_pfn_clean(), which skips put_page() if PageReserved()
> > and so never puts its reference to ZONE_DEVICE pages.
>
> Oh, yeah, that's busted.

Ugh, it's extra busted because every other gup user in the kernel
tracks the pages resulting from gup and puts them (put_page()) when
they are done. KVM wants to forget about whether it did a gup to get
the page and optionally trigger put_page() based purely on the pfn.
Outside of VFIO device assignment that needs pages pinned for DMA, why
does KVM itself need to pin pages? If pages are pinned over a return
to userspace that needs to be a FOLL_LONGTERM gup.
Sean Christopherson Nov. 6, 2019, 12:03 a.m. UTC | #13
On Tue, Nov 05, 2019 at 03:43:29PM -0800, Dan Williams wrote:
> On Tue, Nov 5, 2019 at 3:30 PM Dan Williams <dan.j.williams@intel.com> wrote:
> >
> > On Tue, Nov 5, 2019 at 3:13 PM Sean Christopherson
> > <sean.j.christopherson@intel.com> wrote:
> > >
> > > On Tue, Nov 05, 2019 at 03:02:40PM -0800, Dan Williams wrote:
> > > > On Tue, Nov 5, 2019 at 12:31 PM David Hildenbrand <david@redhat.com> wrote:
> > > > > > The scarier code (for me) is transparent_hugepage_adjust() and
> > > > > > kvm_mmu_zap_collapsible_spte(), as I don't at all understand the
> > > > > > interaction between THP and _PAGE_DEVMAP.
> > > > >
> > > > > The x86 KVM MMU code is one of the ugliest code I know (sorry, but it
> > > > > had to be said :/ ). Luckily, this should be independent of the
> > > > > PG_reserved thingy AFAIKs.
> > > >
> > > > Both transparent_hugepage_adjust() and kvm_mmu_zap_collapsible_spte()
> > > > are honoring kvm_is_reserved_pfn(), so again I'm missing where the
> > > > page count gets mismanaged and leads to the reported hang.
> > >
> > > When mapping pages into the guest, KVM gets the page via gup(), which
> > > increments the page count for ZONE_DEVICE pages.  But KVM puts the page
> > > using kvm_release_pfn_clean(), which skips put_page() if PageReserved()
> > > and so never puts its reference to ZONE_DEVICE pages.
> >
> > Oh, yeah, that's busted.
> 
> Ugh, it's extra busted because every other gup user in the kernel
> tracks the pages resulting from gup and puts them (put_page()) when
> they are done. KVM wants to forget about whether it did a gup to get
> the page and optionally trigger put_page() based purely on the pfn.
> Outside of VFIO device assignment that needs pages pinned for DMA, why
> does KVM itself need to pin pages? If pages are pinned over a return
> to userspace that needs to be a FOLL_LONGTERM gup.

Short answer, KVM pins the page to ensure correctness with respect to the
primary MMU invalidating the associated host virtual address, e.g. when
the page is being migrated or unmapped from host userspace.

The main use of gup() is to handle guest page faults and map pages into
the guest, i.e. into KVM's secondary MMU.  KVM uses gup() to both get the
PFN and to temporarily pin the page.  The pin is held just long enough to
guaranteed that any invalidation via the mmu_notifier will be stalled
until after KVM finishes installing the page into the secondary MMU, i.e.
the pin is short-term and not held across a return to userspace or entry
into the guest.  When a subsequent mmu_notifier invalidation occurs, KVM
pulls the PFN from the secondary MMU and uses that to update accessed
and dirty bits in the host.

There are a few other KVM flows that eventually call into gup(), but those
are "traditional" short-term pins and use put_page() directly.
Dan Williams Nov. 6, 2019, 12:08 a.m. UTC | #14
On Tue, Nov 5, 2019 at 4:03 PM Sean Christopherson
<sean.j.christopherson@intel.com> wrote:
>
> On Tue, Nov 05, 2019 at 03:43:29PM -0800, Dan Williams wrote:
> > On Tue, Nov 5, 2019 at 3:30 PM Dan Williams <dan.j.williams@intel.com> wrote:
> > >
> > > On Tue, Nov 5, 2019 at 3:13 PM Sean Christopherson
> > > <sean.j.christopherson@intel.com> wrote:
> > > >
> > > > On Tue, Nov 05, 2019 at 03:02:40PM -0800, Dan Williams wrote:
> > > > > On Tue, Nov 5, 2019 at 12:31 PM David Hildenbrand <david@redhat.com> wrote:
> > > > > > > The scarier code (for me) is transparent_hugepage_adjust() and
> > > > > > > kvm_mmu_zap_collapsible_spte(), as I don't at all understand the
> > > > > > > interaction between THP and _PAGE_DEVMAP.
> > > > > >
> > > > > > The x86 KVM MMU code is one of the ugliest code I know (sorry, but it
> > > > > > had to be said :/ ). Luckily, this should be independent of the
> > > > > > PG_reserved thingy AFAIKs.
> > > > >
> > > > > Both transparent_hugepage_adjust() and kvm_mmu_zap_collapsible_spte()
> > > > > are honoring kvm_is_reserved_pfn(), so again I'm missing where the
> > > > > page count gets mismanaged and leads to the reported hang.
> > > >
> > > > When mapping pages into the guest, KVM gets the page via gup(), which
> > > > increments the page count for ZONE_DEVICE pages.  But KVM puts the page
> > > > using kvm_release_pfn_clean(), which skips put_page() if PageReserved()
> > > > and so never puts its reference to ZONE_DEVICE pages.
> > >
> > > Oh, yeah, that's busted.
> >
> > Ugh, it's extra busted because every other gup user in the kernel
> > tracks the pages resulting from gup and puts them (put_page()) when
> > they are done. KVM wants to forget about whether it did a gup to get
> > the page and optionally trigger put_page() based purely on the pfn.
> > Outside of VFIO device assignment that needs pages pinned for DMA, why
> > does KVM itself need to pin pages? If pages are pinned over a return
> > to userspace that needs to be a FOLL_LONGTERM gup.
>
> Short answer, KVM pins the page to ensure correctness with respect to the
> primary MMU invalidating the associated host virtual address, e.g. when
> the page is being migrated or unmapped from host userspace.
>
> The main use of gup() is to handle guest page faults and map pages into
> the guest, i.e. into KVM's secondary MMU.  KVM uses gup() to both get the
> PFN and to temporarily pin the page.  The pin is held just long enough to
> guaranteed that any invalidation via the mmu_notifier will be stalled
> until after KVM finishes installing the page into the secondary MMU, i.e.
> the pin is short-term and not held across a return to userspace or entry
> into the guest.  When a subsequent mmu_notifier invalidation occurs, KVM
> pulls the PFN from the secondary MMU and uses that to update accessed
> and dirty bits in the host.
>
> There are a few other KVM flows that eventually call into gup(), but those
> are "traditional" short-term pins and use put_page() directly.

Ok, I was misinterpreting the effect of the bug with what KVM is using
the reference to do.

To your other point:

> But David's proposed fix for the above refcount bug is to omit the patch
> so that KVM no longer treats ZONE_DEVICE pages as reserved.  That seems
> like the right thing to do, including for thp_adjust(), e.g. it would
> naturally let KVM use 2mb pages for the guest when a ZONE_DEVICE page is
> mapped with a huge page (2mb or above) in the host.  The only hiccup is
> figuring out how to correctly transfer the reference.

That might not be the only hiccup. There's currently no such thing as
huge pages for ZONE_DEVICE, there are huge *mappings* (pmd and pud),
but the result of pfn_to_page() on such a mapping does not yield a
huge 'struct page'. It seems there are other paths in KVM that assume
that more typical page machinery is active like SetPageDirty() based
on kvm_is_reserved_pfn(). While I told David that I did not want to
see more usage of is_zone_device_page(), this patch below (untested)
seems a cleaner path with less surprises:

diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 4df0aa6b8e5c..fbea17c1810c 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -1831,7 +1831,8 @@ EXPORT_SYMBOL_GPL(kvm_release_page_clean);

 void kvm_release_pfn_clean(kvm_pfn_t pfn)
 {
-       if (!is_error_noslot_pfn(pfn) && !kvm_is_reserved_pfn(pfn))
+       if ((!is_error_noslot_pfn(pfn) && !kvm_is_reserved_pfn(pfn)) ||
+           (pfn_valid(pfn) && is_zone_device_page(pfn_to_page(pfn))))
                put_page(pfn_to_page(pfn));
 }
 EXPORT_SYMBOL_GPL(kvm_release_pfn_clean);

This is safe because the reference that KVM took earlier protects the
is_zone_device_page() lookup from racing device teardown. Otherwise,
if KVM does not have a reference it's unsafe, but that's already even
more broken because KVM would be releasing a page that it never
referenced. Every other KVM operation that assumes page allocator
pages would continue to honor kvm_is_reserved_pfn().
David Hildenbrand Nov. 6, 2019, 6:56 a.m. UTC | #15
On 06.11.19 01:08, Dan Williams wrote:
> On Tue, Nov 5, 2019 at 4:03 PM Sean Christopherson
> <sean.j.christopherson@intel.com> wrote:
>>
>> On Tue, Nov 05, 2019 at 03:43:29PM -0800, Dan Williams wrote:
>>> On Tue, Nov 5, 2019 at 3:30 PM Dan Williams <dan.j.williams@intel.com> wrote:
>>>>
>>>> On Tue, Nov 5, 2019 at 3:13 PM Sean Christopherson
>>>> <sean.j.christopherson@intel.com> wrote:
>>>>>
>>>>> On Tue, Nov 05, 2019 at 03:02:40PM -0800, Dan Williams wrote:
>>>>>> On Tue, Nov 5, 2019 at 12:31 PM David Hildenbrand <david@redhat.com> wrote:
>>>>>>>> The scarier code (for me) is transparent_hugepage_adjust() and
>>>>>>>> kvm_mmu_zap_collapsible_spte(), as I don't at all understand the
>>>>>>>> interaction between THP and _PAGE_DEVMAP.
>>>>>>>
>>>>>>> The x86 KVM MMU code is one of the ugliest code I know (sorry, but it
>>>>>>> had to be said :/ ). Luckily, this should be independent of the
>>>>>>> PG_reserved thingy AFAIKs.
>>>>>>
>>>>>> Both transparent_hugepage_adjust() and kvm_mmu_zap_collapsible_spte()
>>>>>> are honoring kvm_is_reserved_pfn(), so again I'm missing where the
>>>>>> page count gets mismanaged and leads to the reported hang.
>>>>>
>>>>> When mapping pages into the guest, KVM gets the page via gup(), which
>>>>> increments the page count for ZONE_DEVICE pages.  But KVM puts the page
>>>>> using kvm_release_pfn_clean(), which skips put_page() if PageReserved()
>>>>> and so never puts its reference to ZONE_DEVICE pages.
>>>>
>>>> Oh, yeah, that's busted.
>>>
>>> Ugh, it's extra busted because every other gup user in the kernel
>>> tracks the pages resulting from gup and puts them (put_page()) when
>>> they are done. KVM wants to forget about whether it did a gup to get
>>> the page and optionally trigger put_page() based purely on the pfn.
>>> Outside of VFIO device assignment that needs pages pinned for DMA, why
>>> does KVM itself need to pin pages? If pages are pinned over a return
>>> to userspace that needs to be a FOLL_LONGTERM gup.
>>
>> Short answer, KVM pins the page to ensure correctness with respect to the
>> primary MMU invalidating the associated host virtual address, e.g. when
>> the page is being migrated or unmapped from host userspace.
>>
>> The main use of gup() is to handle guest page faults and map pages into
>> the guest, i.e. into KVM's secondary MMU.  KVM uses gup() to both get the
>> PFN and to temporarily pin the page.  The pin is held just long enough to
>> guaranteed that any invalidation via the mmu_notifier will be stalled
>> until after KVM finishes installing the page into the secondary MMU, i.e.
>> the pin is short-term and not held across a return to userspace or entry
>> into the guest.  When a subsequent mmu_notifier invalidation occurs, KVM
>> pulls the PFN from the secondary MMU and uses that to update accessed
>> and dirty bits in the host.
>>
>> There are a few other KVM flows that eventually call into gup(), but those
>> are "traditional" short-term pins and use put_page() directly.
> 
> Ok, I was misinterpreting the effect of the bug with what KVM is using
> the reference to do.
> 
> To your other point:
> 
>> But David's proposed fix for the above refcount bug is to omit the patch
>> so that KVM no longer treats ZONE_DEVICE pages as reserved.  That seems
>> like the right thing to do, including for thp_adjust(), e.g. it would
>> naturally let KVM use 2mb pages for the guest when a ZONE_DEVICE page is
>> mapped with a huge page (2mb or above) in the host.  The only hiccup is
>> figuring out how to correctly transfer the reference.
> 
> That might not be the only hiccup. There's currently no such thing as
> huge pages for ZONE_DEVICE, there are huge *mappings* (pmd and pud),
> but the result of pfn_to_page() on such a mapping does not yield a
> huge 'struct page'. It seems there are other paths in KVM that assume
> that more typical page machinery is active like SetPageDirty() based
> on kvm_is_reserved_pfn(). While I told David that I did not want to
> see more usage of is_zone_device_page(), this patch below (untested)
> seems a cleaner path with less surprises:
> 
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index 4df0aa6b8e5c..fbea17c1810c 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -1831,7 +1831,8 @@ EXPORT_SYMBOL_GPL(kvm_release_page_clean);
> 
>   void kvm_release_pfn_clean(kvm_pfn_t pfn)
>   {
> -       if (!is_error_noslot_pfn(pfn) && !kvm_is_reserved_pfn(pfn))
> +       if ((!is_error_noslot_pfn(pfn) && !kvm_is_reserved_pfn(pfn)) ||
> +           (pfn_valid(pfn) && is_zone_device_page(pfn_to_page(pfn))))
>                  put_page(pfn_to_page(pfn));
>   }
>   EXPORT_SYMBOL_GPL(kvm_release_pfn_clean);

I had the same thought, but I do wonder about the kvm_get_pfn() users, 
e.g.,:

hva_to_pfn_remapped():
	r = follow_pfn(vma, addr, &pfn);
	...
	kvm_get_pfn(pfn);
	...

We would not take a reference for ZONE_DEVICE, but later drop one 
reference via kvm_release_pfn_clean(). IOW, kvm_get_pfn() gets *really* 
dangerous to use. I can't tell if this can happen right now.

We do have 3 users of kvm_get_pfn() that we have to audit before this 
change. Also, we should add a comment to kvm_get_pfn() that it should 
never be used with possible ZONE_DEVICE pages.

Also, we should add a comment to kvm_release_pfn_clean(), describing why 
we treat ZONE_DEVICE in a special way here.


We can then progress like this

1. Get this fix upstream, it's somewhat unrelated to this series.
2. This patch here remains as is in this series (+/- documentation update)
3. Long term, rework KVM to not have to not treat ZONE_DEVICE like 
reserved pages. E.g., get rid of kvm_get_pfn(). Then, this special zone 
check can go.
Sean Christopherson Nov. 6, 2019, 4:09 p.m. UTC | #16
On Wed, Nov 06, 2019 at 07:56:34AM +0100, David Hildenbrand wrote:
> On 06.11.19 01:08, Dan Williams wrote:
> >On Tue, Nov 5, 2019 at 4:03 PM Sean Christopherson
> >>But David's proposed fix for the above refcount bug is to omit the patch
> >>so that KVM no longer treats ZONE_DEVICE pages as reserved.  That seems
> >>like the right thing to do, including for thp_adjust(), e.g. it would
> >>naturally let KVM use 2mb pages for the guest when a ZONE_DEVICE page is
> >>mapped with a huge page (2mb or above) in the host.  The only hiccup is
> >>figuring out how to correctly transfer the reference.
> >
> >That might not be the only hiccup. There's currently no such thing as
> >huge pages for ZONE_DEVICE, there are huge *mappings* (pmd and pud),
> >but the result of pfn_to_page() on such a mapping does not yield a
> >huge 'struct page'. It seems there are other paths in KVM that assume
> >that more typical page machinery is active like SetPageDirty() based
> >on kvm_is_reserved_pfn(). While I told David that I did not want to
> >see more usage of is_zone_device_page(), this patch below (untested)
> >seems a cleaner path with less surprises:
> >
> >diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> >index 4df0aa6b8e5c..fbea17c1810c 100644
> >--- a/virt/kvm/kvm_main.c
> >+++ b/virt/kvm/kvm_main.c
> >@@ -1831,7 +1831,8 @@ EXPORT_SYMBOL_GPL(kvm_release_page_clean);
> >
> >  void kvm_release_pfn_clean(kvm_pfn_t pfn)
> >  {
> >-       if (!is_error_noslot_pfn(pfn) && !kvm_is_reserved_pfn(pfn))
> >+       if ((!is_error_noslot_pfn(pfn) && !kvm_is_reserved_pfn(pfn)) ||

The is_error_noslot_pfn() check shouldn't be overriden by zone_device.

> >+           (pfn_valid(pfn) && is_zone_device_page(pfn_to_page(pfn))))

But rather than special case kvm_release_pfn_clean(), I'd rather KVM
explicitly handle ZONE_DEVICE pages, there are other flows where KVM
really should be aware of ZONE_DEVICE pages, e.g. for sanity checks and
whatnot.  There are surprisingly few callers of kvm_is_reserved_pfn(), so
it's actually not too big of a change. 

> >                 put_page(pfn_to_page(pfn));
> >  }
> >  EXPORT_SYMBOL_GPL(kvm_release_pfn_clean);
> 
> I had the same thought, but I do wonder about the kvm_get_pfn() users,
> e.g.,:
> 
> hva_to_pfn_remapped():
> 	r = follow_pfn(vma, addr, &pfn);
> 	...
> 	kvm_get_pfn(pfn);
> 	...
> 
> We would not take a reference for ZONE_DEVICE, but later drop one reference
> via kvm_release_pfn_clean(). IOW, kvm_get_pfn() gets *really* dangerous to
> use. I can't tell if this can happen right now.
> 
> We do have 3 users of kvm_get_pfn() that we have to audit before this
> change. Also, we should add a comment to kvm_get_pfn() that it should never
> be used with possible ZONE_DEVICE pages.
> 
> Also, we should add a comment to kvm_release_pfn_clean(), describing why we
> treat ZONE_DEVICE in a special way here.
> 
> 
> We can then progress like this
> 
> 1. Get this fix upstream, it's somewhat unrelated to this series.
> 2. This patch here remains as is in this series (+/- documentation update)
> 3. Long term, rework KVM to not have to not treat ZONE_DEVICE like reserved
> pages. E.g., get rid of kvm_get_pfn(). Then, this special zone check can go.

Dropping kvm_get_pfn() is less than ideal, and at this point unnecessary.
I'm 99% sure the existing call sites for kvm_get_pfn() can never be
reached with ZONE_DEVICE pages.  I think we can do:

  1. Get a fix upstream to have KVM stop treating ZONE_DEVICE pages as
     reserved PFNs, i.e. exempt them in kvm_is_reserved_pfn() and change
     the callers of kvm_is_reserved_pfn() to handle ZONE_DEVICE pages.
  2. Drop this patch from the series, and instead remove the special
     treatment of ZONE_DEVICE pages from kvm_is_reserved_pfn().

Give me a few minutes to prep a patch.

> 
> -- 
> 
> Thanks,
> 
> David / dhildenb
>
diff mbox series

Patch

diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index e9eb666eb6e8..9d18cc67d124 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -151,9 +151,15 @@  __weak int kvm_arch_mmu_notifier_invalidate_range(struct kvm *kvm,
 
 bool kvm_is_reserved_pfn(kvm_pfn_t pfn)
 {
-	if (pfn_valid(pfn))
-		return PageReserved(pfn_to_page(pfn));
+	struct page *page = pfn_to_online_page(pfn);
 
+	/*
+	 * We treat any pages that are not online (not managed by the buddy)
+	 * as reserved - this includes ZONE_DEVICE pages and pages without
+	 * a memmap (e.g., mapped via /dev/mem).
+	 */
+	if (page)
+		return PageReserved(page);
 	return true;
 }