[v4,2/2] kvm: Use huge pages for DAX-backed files

Message ID	20191211213207.215936-3-brho@google.com (mailing list archive)
State	New, archived
Headers	show Return-Path: <SRS0=dbDR=2B=vger.kernel.org=kvm-owner@kernel.org> Date: Wed, 11 Dec 2019 16:32:07 -0500 In-Reply-To: <20191211213207.215936-1-brho@google.com> Message-Id: <20191211213207.215936-3-brho@google.com> Mime-Version: 1.0 References: <20191211213207.215936-1-brho@google.com> Subject: [PATCH v4 2/2] kvm: Use huge pages for DAX-backed files From: Barret Rhoden <brho@google.com> To: Paolo Bonzini <pbonzini@redhat.com>, Dan Williams <dan.j.williams@intel.com>, David Hildenbrand <david@redhat.com>, Dave Jiang <dave.jiang@intel.com>, Alexander Duyck <alexander.h.duyck@linux.intel.com> Cc: linux-nvdimm@lists.01.org, x86@kernel.org, kvm@vger.kernel.org, linux-kernel@vger.kernel.org, jason.zeng@intel.com Content-Type: text/plain; charset="UTF-8" Sender: kvm-owner@vger.kernel.org Precedence: bulk
Series	kvm: Use huge pages for DAX-backed files \| expand [v4,0/2] kvm: Use huge pages for DAX-backed files [v4,1/2] mm: make dev_pagemap_mapping_shift() externally visible [v4,2/2] kvm: Use huge pages for DAX-backed files

Barret Rhoden Dec. 11, 2019, 9:32 p.m. UTC

This change allows KVM to map DAX-backed files made of huge pages with
huge mappings in the EPT/TDP.

DAX pages are not PageTransCompound.  The existing check is trying to
determine if the mapping for the pfn is a huge mapping or not.  For
non-DAX maps, e.g. hugetlbfs, that means checking PageTransCompound.
For DAX, we can check the page table itself.

Note that KVM already faulted in the page (or huge page) in the host's
page table, and we hold the KVM mmu spinlock.  We grabbed that lock in
kvm_mmu_notifier_invalidate_range_end, before checking the mmu seq.

Signed-off-by: Barret Rhoden <brho@google.com>
---
 arch/x86/kvm/mmu/mmu.c | 36 ++++++++++++++++++++++++++++++++----
 1 file changed, 32 insertions(+), 4 deletions(-)

Paolo Bonzini Dec. 12, 2019, 12:21 a.m. UTC | #1

On 11/12/19 22:32, Barret Rhoden wrote:
> +	/*
> +	 * Our caller grabbed the KVM mmu_lock with a successful
> +	 * mmu_notifier_retry, so we're safe to walk the page table.
> +	 */
> +	switch (dev_pagemap_mapping_shift(hva, current->mm)) {
> +	case PMD_SHIFT:
> +	case PUD_SIZE:
> +		return true;
> +	}
> +	return false;

Should this simply be "> PAGE_SHIFT"?

Paolo

David Hildenbrand Dec. 12, 2019, 12:22 p.m. UTC | #2

On 11.12.19 22:32, Barret Rhoden wrote:
> This change allows KVM to map DAX-backed files made of huge pages with
> huge mappings in the EPT/TDP.
> 
> DAX pages are not PageTransCompound.  The existing check is trying to
> determine if the mapping for the pfn is a huge mapping or not.  For
> non-DAX maps, e.g. hugetlbfs, that means checking PageTransCompound.
> For DAX, we can check the page table itself.
> 
> Note that KVM already faulted in the page (or huge page) in the host's
> page table, and we hold the KVM mmu spinlock.  We grabbed that lock in
> kvm_mmu_notifier_invalidate_range_end, before checking the mmu seq.
> 
> Signed-off-by: Barret Rhoden <brho@google.com>
> ---
>  arch/x86/kvm/mmu/mmu.c | 36 ++++++++++++++++++++++++++++++++----
>  1 file changed, 32 insertions(+), 4 deletions(-)
> 
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index 6f92b40d798c..cd07bc4e595f 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -3384,6 +3384,35 @@ static int kvm_handle_bad_page(struct kvm_vcpu *vcpu, gfn_t gfn, kvm_pfn_t pfn)
>  	return -EFAULT;
>  }
>  
> +static bool pfn_is_huge_mapped(struct kvm *kvm, gfn_t gfn, kvm_pfn_t pfn)
> +{
> +	struct page *page = pfn_to_page(pfn);
> +	unsigned long hva;
> +
> +	if (!is_zone_device_page(page))
> +		return PageTransCompoundMap(page);
> +
> +	/*
> +	 * DAX pages do not use compound pages.  The page should have already
> +	 * been mapped into the host-side page table during try_async_pf(), so
> +	 * we can check the page tables directly.
> +	 */
> +	hva = gfn_to_hva(kvm, gfn);
> +	if (kvm_is_error_hva(hva))
> +		return false;
> +
> +	/*
> +	 * Our caller grabbed the KVM mmu_lock with a successful
> +	 * mmu_notifier_retry, so we're safe to walk the page table.
> +	 */
> +	switch (dev_pagemap_mapping_shift(hva, current->mm)) {
> +	case PMD_SHIFT:
> +	case PUD_SIZE:

Shouldn't this be PUD_SHIFT?

But I agree with Paolo, that this is simply

return dev_pagemap_mapping_shift(hva, current->mm) > PAGE_SHIFT;

> +		return true;
> +	}
> +	return false;
> +}
> +
>  static void transparent_hugepage_adjust(struct kvm_vcpu *vcpu,
>  					gfn_t gfn, kvm_pfn_t *pfnp,
>  					int *levelp)
> @@ -3398,8 +3427,8 @@ static void transparent_hugepage_adjust(struct kvm_vcpu *vcpu,
>  	 * here.
>  	 */
>  	if (!is_error_noslot_pfn(pfn) && !kvm_is_reserved_pfn(pfn) &&
> -	    !kvm_is_zone_device_pfn(pfn) && level == PT_PAGE_TABLE_LEVEL &&
> -	    PageTransCompoundMap(pfn_to_page(pfn)) &&
> +	    level == PT_PAGE_TABLE_LEVEL &&
> +	    pfn_is_huge_mapped(vcpu->kvm, gfn, pfn) &&
>  	    !mmu_gfn_lpage_is_disallowed(vcpu, gfn, PT_DIRECTORY_LEVEL)) {
>  		unsigned long mask;
>  		/*
> @@ -6015,8 +6044,7 @@ static bool kvm_mmu_zap_collapsible_spte(struct kvm *kvm,
>  		 * mapping if the indirect sp has level = 1.
>  		 */
>  		if (sp->role.direct && !kvm_is_reserved_pfn(pfn) &&
> -		    !kvm_is_zone_device_pfn(pfn) &&
> -		    PageTransCompoundMap(pfn_to_page(pfn))) {
> +		    pfn_is_huge_mapped(kvm, sp->gfn, pfn)) {
>  			pte_list_remove(rmap_head, sptep);
>  
>  			if (kvm_available_flush_tlb_with_range())
> 

Patch itself looks good to me (especially, cleans up these two places a
bit). I am not an expert on the locking part, so I can't give my RB.

Liran Alon Dec. 12, 2019, 12:33 p.m. UTC | #3

> On 11 Dec 2019, at 23:32, Barret Rhoden <brho@google.com> wrote:
> 
> This change allows KVM to map DAX-backed files made of huge pages with
> huge mappings in the EPT/TDP.
> 
> DAX pages are not PageTransCompound.  The existing check is trying to
> determine if the mapping for the pfn is a huge mapping or not.  For
> non-DAX maps, e.g. hugetlbfs, that means checking PageTransCompound.
> For DAX, we can check the page table itself.

For hugetlbfs pages, tdp_page_fault() -> mapping_level() -> host_mapping_level() -> kvm_host_page_size() -> vma_kernel_pagesize()
will return the page-size of the hugetlbfs without the need to parse the page-tables.
See vma->vm_ops->pagesize() callback implementation at hugetlb_vm_ops->pagesize()==hugetlb_vm_op_pagesize().

Only for pages that were originally mapped as small-pages and later merged to larger pages by THP, there is a need to check for PageTransCompound(). Again, instead of parsing page-tables.

Therefore, it seems more logical to me that:
(a) If DAX-backed files are mapped as large-pages to userspace, it should be reflected in vma->vm_ops->page_size() of that mapping. Causing kvm_host_page_size() to return the right size without the need to parse the page-tables.
(b) If DAX-backed files small-pages can be later merged to large-pages by THP, then the “struct page” of these pages should be modified as usual to make PageTransCompound() return true for them. I’m not highly familiar with this mechanism, but I would expect THP to be able to merge DAX-backed files small-pages to large-pages in case DAX provides “struct page” for the DAX pages.

> 
> Note that KVM already faulted in the page (or huge page) in the host's
> page table, and we hold the KVM mmu spinlock.  We grabbed that lock in
> kvm_mmu_notifier_invalidate_range_end, before checking the mmu seq.
> 
> Signed-off-by: Barret Rhoden <brho@google.com>
> ---
> arch/x86/kvm/mmu/mmu.c | 36 ++++++++++++++++++++++++++++++++----
> 1 file changed, 32 insertions(+), 4 deletions(-)
> 
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index 6f92b40d798c..cd07bc4e595f 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -3384,6 +3384,35 @@ static int kvm_handle_bad_page(struct kvm_vcpu *vcpu, gfn_t gfn, kvm_pfn_t pfn)
> 	return -EFAULT;
> }
> 
> +static bool pfn_is_huge_mapped(struct kvm *kvm, gfn_t gfn, kvm_pfn_t pfn)
> +{
> +	struct page *page = pfn_to_page(pfn);
> +	unsigned long hva;
> +
> +	if (!is_zone_device_page(page))
> +		return PageTransCompoundMap(page);
> +
> +	/*
> +	 * DAX pages do not use compound pages.  The page should have already
> +	 * been mapped into the host-side page table during try_async_pf(), so
> +	 * we can check the page tables directly.
> +	 */
> +	hva = gfn_to_hva(kvm, gfn);
> +	if (kvm_is_error_hva(hva))
> +		return false;
> +
> +	/*
> +	 * Our caller grabbed the KVM mmu_lock with a successful
> +	 * mmu_notifier_retry, so we're safe to walk the page table.
> +	 */
> +	switch (dev_pagemap_mapping_shift(hva, current->mm)) {

Doesn’t dev_pagemap_mapping_shift() get “struct page” as first parameter?
Was this changed by a commit I missed?

-Liran

> +	case PMD_SHIFT:
> +	case PUD_SIZE:
> +		return true;
> +	}
> +	return false;
> +}
> +
> static void transparent_hugepage_adjust(struct kvm_vcpu *vcpu,
> 					gfn_t gfn, kvm_pfn_t *pfnp,
> 					int *levelp)
> @@ -3398,8 +3427,8 @@ static void transparent_hugepage_adjust(struct kvm_vcpu *vcpu,
> 	 * here.
> 	 */
> 	if (!is_error_noslot_pfn(pfn) && !kvm_is_reserved_pfn(pfn) &&
> -	    !kvm_is_zone_device_pfn(pfn) && level == PT_PAGE_TABLE_LEVEL &&
> -	    PageTransCompoundMap(pfn_to_page(pfn)) &&
> +	    level == PT_PAGE_TABLE_LEVEL &&
> +	    pfn_is_huge_mapped(vcpu->kvm, gfn, pfn) &&
> 	    !mmu_gfn_lpage_is_disallowed(vcpu, gfn, PT_DIRECTORY_LEVEL)) {
> 		unsigned long mask;
> 		/*
> @@ -6015,8 +6044,7 @@ static bool kvm_mmu_zap_collapsible_spte(struct kvm *kvm,
> 		 * mapping if the indirect sp has level = 1.
> 		 */
> 		if (sp->role.direct && !kvm_is_reserved_pfn(pfn) &&
> -		    !kvm_is_zone_device_pfn(pfn) &&
> -		    PageTransCompoundMap(pfn_to_page(pfn))) {
> +		    pfn_is_huge_mapped(kvm, sp->gfn, pfn)) {
> 			pte_list_remove(rmap_head, sptep);
> 
> 			if (kvm_available_flush_tlb_with_range())
> -- 
> 2.24.0.525.g8f36a354ae-goog
>

Barret Rhoden Dec. 12, 2019, 4:31 p.m. UTC | #4

On 12/12/19 7:22 AM, David Hildenbrand wrote:
>> +	/*
>> +	 * Our caller grabbed the KVM mmu_lock with a successful
>> +	 * mmu_notifier_retry, so we're safe to walk the page table.
>> +	 */
>> +	switch (dev_pagemap_mapping_shift(hva, current->mm)) {
>> +	case PMD_SHIFT:
>> +	case PUD_SIZE:
> 
> Shouldn't this be PUD_SHIFT?
> 
> But I agree with Paolo, that this is simply
> 
> return dev_pagemap_mapping_shift(hva, current->mm) > PAGE_SHIFT;

Yes, good call.  I'll fix that in the next version.

Dan Williams Dec. 12, 2019, 4:54 p.m. UTC | #5

On Thu, Dec 12, 2019 at 4:34 AM Liran Alon <liran.alon@oracle.com> wrote:
>
>
>
> > On 11 Dec 2019, at 23:32, Barret Rhoden <brho@google.com> wrote:
> >
> > This change allows KVM to map DAX-backed files made of huge pages with
> > huge mappings in the EPT/TDP.
> >
> > DAX pages are not PageTransCompound.  The existing check is trying to
> > determine if the mapping for the pfn is a huge mapping or not.  For
> > non-DAX maps, e.g. hugetlbfs, that means checking PageTransCompound.
> > For DAX, we can check the page table itself.
>
> For hugetlbfs pages, tdp_page_fault() -> mapping_level() -> host_mapping_level() -> kvm_host_page_size() -> vma_kernel_pagesize()
> will return the page-size of the hugetlbfs without the need to parse the page-tables.
> See vma->vm_ops->pagesize() callback implementation at hugetlb_vm_ops->pagesize()==hugetlb_vm_op_pagesize().
>
> Only for pages that were originally mapped as small-pages and later merged to larger pages by THP, there is a need to check for PageTransCompound(). Again, instead of parsing page-tables.
>
> Therefore, it seems more logical to me that:
> (a) If DAX-backed files are mapped as large-pages to userspace, it should be reflected in vma->vm_ops->page_size() of that mapping. Causing kvm_host_page_size() to return the right size without the need to parse the page-tables.

A given dax-mapped vma may have mixed page sizes so ->page_size()
can't be used reliably to enumerating the mapping size.

> (b) If DAX-backed files small-pages can be later merged to large-pages by THP, then the “struct page” of these pages should be modified as usual to make PageTransCompound() return true for them. I’m not highly familiar with this mechanism, but I would expect THP to be able to merge DAX-backed files small-pages to large-pages in case DAX provides “struct page” for the DAX pages.

DAX pages do not participate in THP and do not have the
PageTransCompound accounting. The only mechanism that records the
mapping size for dax is the page tables themselves.


>
> >
> > Note that KVM already faulted in the page (or huge page) in the host's
> > page table, and we hold the KVM mmu spinlock.  We grabbed that lock in
> > kvm_mmu_notifier_invalidate_range_end, before checking the mmu seq.
> >
> > Signed-off-by: Barret Rhoden <brho@google.com>
> > ---
> > arch/x86/kvm/mmu/mmu.c | 36 ++++++++++++++++++++++++++++++++----
> > 1 file changed, 32 insertions(+), 4 deletions(-)
> >
> > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> > index 6f92b40d798c..cd07bc4e595f 100644
> > --- a/arch/x86/kvm/mmu/mmu.c
> > +++ b/arch/x86/kvm/mmu/mmu.c
> > @@ -3384,6 +3384,35 @@ static int kvm_handle_bad_page(struct kvm_vcpu *vcpu, gfn_t gfn, kvm_pfn_t pfn)
> >       return -EFAULT;
> > }
> >
> > +static bool pfn_is_huge_mapped(struct kvm *kvm, gfn_t gfn, kvm_pfn_t pfn)
> > +{
> > +     struct page *page = pfn_to_page(pfn);
> > +     unsigned long hva;
> > +
> > +     if (!is_zone_device_page(page))
> > +             return PageTransCompoundMap(page);
> > +
> > +     /*
> > +      * DAX pages do not use compound pages.  The page should have already
> > +      * been mapped into the host-side page table during try_async_pf(), so
> > +      * we can check the page tables directly.
> > +      */
> > +     hva = gfn_to_hva(kvm, gfn);
> > +     if (kvm_is_error_hva(hva))
> > +             return false;
> > +
> > +     /*
> > +      * Our caller grabbed the KVM mmu_lock with a successful
> > +      * mmu_notifier_retry, so we're safe to walk the page table.
> > +      */
> > +     switch (dev_pagemap_mapping_shift(hva, current->mm)) {
>
> Doesn’t dev_pagemap_mapping_shift() get “struct page” as first parameter?
> Was this changed by a commit I missed?
>
> -Liran
>
> > +     case PMD_SHIFT:
> > +     case PUD_SIZE:
> > +             return true;
> > +     }
> > +     return false;
> > +}
> > +
> > static void transparent_hugepage_adjust(struct kvm_vcpu *vcpu,
> >                                       gfn_t gfn, kvm_pfn_t *pfnp,
> >                                       int *levelp)
> > @@ -3398,8 +3427,8 @@ static void transparent_hugepage_adjust(struct kvm_vcpu *vcpu,
> >        * here.
> >        */
> >       if (!is_error_noslot_pfn(pfn) && !kvm_is_reserved_pfn(pfn) &&
> > -         !kvm_is_zone_device_pfn(pfn) && level == PT_PAGE_TABLE_LEVEL &&
> > -         PageTransCompoundMap(pfn_to_page(pfn)) &&
> > +         level == PT_PAGE_TABLE_LEVEL &&
> > +         pfn_is_huge_mapped(vcpu->kvm, gfn, pfn) &&
> >           !mmu_gfn_lpage_is_disallowed(vcpu, gfn, PT_DIRECTORY_LEVEL)) {
> >               unsigned long mask;
> >               /*
> > @@ -6015,8 +6044,7 @@ static bool kvm_mmu_zap_collapsible_spte(struct kvm *kvm,
> >                * mapping if the indirect sp has level = 1.
> >                */
> >               if (sp->role.direct && !kvm_is_reserved_pfn(pfn) &&
> > -                 !kvm_is_zone_device_pfn(pfn) &&
> > -                 PageTransCompoundMap(pfn_to_page(pfn))) {
> > +                 pfn_is_huge_mapped(kvm, sp->gfn, pfn)) {
> >                       pte_list_remove(rmap_head, sptep);
> >
> >                       if (kvm_available_flush_tlb_with_range())
> > --
> > 2.24.0.525.g8f36a354ae-goog
> >
>

Barret Rhoden Dec. 12, 2019, 5:03 p.m. UTC | #6

On 12/12/19 7:33 AM, Liran Alon wrote:
>> +	/*
>> +	 * Our caller grabbed the KVM mmu_lock with a successful
>> +	 * mmu_notifier_retry, so we're safe to walk the page table.
>> +	 */
>> +	switch (dev_pagemap_mapping_shift(hva, current->mm)) {
> Doesn’t dev_pagemap_mapping_shift() get “struct page” as first parameter?
> Was this changed by a commit I missed?

I changed this in Patch 1.  The place I call it in KVM has the address 
and mm available, which is the only think dev_pagemap_mapping_shift() 
really needs.  (The first thing it did was convert page to address).

I'll add some more text to patch 1's commit message about that.

Thanks,

Barret

Sean Christopherson Dec. 12, 2019, 5:34 p.m. UTC | #7

On Wed, Dec 11, 2019 at 04:32:07PM -0500, Barret Rhoden wrote:
> This change allows KVM to map DAX-backed files made of huge pages with
> huge mappings in the EPT/TDP.
> 
> DAX pages are not PageTransCompound.  The existing check is trying to
> determine if the mapping for the pfn is a huge mapping or not.  For
> non-DAX maps, e.g. hugetlbfs, that means checking PageTransCompound.
> For DAX, we can check the page table itself.
> 
> Note that KVM already faulted in the page (or huge page) in the host's
> page table, and we hold the KVM mmu spinlock.  We grabbed that lock in
> kvm_mmu_notifier_invalidate_range_end, before checking the mmu seq.
> 
> Signed-off-by: Barret Rhoden <brho@google.com>
> ---
>  arch/x86/kvm/mmu/mmu.c | 36 ++++++++++++++++++++++++++++++++----
>  1 file changed, 32 insertions(+), 4 deletions(-)
> 
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index 6f92b40d798c..cd07bc4e595f 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -3384,6 +3384,35 @@ static int kvm_handle_bad_page(struct kvm_vcpu *vcpu, gfn_t gfn, kvm_pfn_t pfn)
>  	return -EFAULT;
>  }
>  
> +static bool pfn_is_huge_mapped(struct kvm *kvm, gfn_t gfn, kvm_pfn_t pfn)
> +{
> +	struct page *page = pfn_to_page(pfn);
> +	unsigned long hva;
> +
> +	if (!is_zone_device_page(page))
> +		return PageTransCompoundMap(page);
> +
> +	/*
> +	 * DAX pages do not use compound pages.  The page should have already
> +	 * been mapped into the host-side page table during try_async_pf(), so
> +	 * we can check the page tables directly.
> +	 */
> +	hva = gfn_to_hva(kvm, gfn);
> +	if (kvm_is_error_hva(hva))
> +		return false;
> +
> +	/*
> +	 * Our caller grabbed the KVM mmu_lock with a successful
> +	 * mmu_notifier_retry, so we're safe to walk the page table.
> +	 */
> +	switch (dev_pagemap_mapping_shift(hva, current->mm)) {
> +	case PMD_SHIFT:
> +	case PUD_SIZE:

I assume this means DAX can have 1GB pages?  I ask because KVM's THP logic
has historically relied on THP only supporting 2MB.  I cleaned this up in
a recent series[*], which is in kvm/queue, but I obviously didn't actually
test whether or not KVM would correctly handle 1GB non-hugetlbfs pages.

The easiest thing is probably to rebase on kvm/queue.  You'll need to do
that anyways, and I suspect doing so will help shake out any hiccups.

[*] https://lkml.kernel.org/r/20191206235729.29263-1-sean.j.christopherson@intel.com

> +		return true;
> +	}
> +	return false;
> +}
> +
>  static void transparent_hugepage_adjust(struct kvm_vcpu *vcpu,
>  					gfn_t gfn, kvm_pfn_t *pfnp,
>  					int *levelp)
> @@ -3398,8 +3427,8 @@ static void transparent_hugepage_adjust(struct kvm_vcpu *vcpu,
>  	 * here.
>  	 */
>  	if (!is_error_noslot_pfn(pfn) && !kvm_is_reserved_pfn(pfn) &&
> -	    !kvm_is_zone_device_pfn(pfn) && level == PT_PAGE_TABLE_LEVEL &&
> -	    PageTransCompoundMap(pfn_to_page(pfn)) &&
> +	    level == PT_PAGE_TABLE_LEVEL &&
> +	    pfn_is_huge_mapped(vcpu->kvm, gfn, pfn) &&
>  	    !mmu_gfn_lpage_is_disallowed(vcpu, gfn, PT_DIRECTORY_LEVEL)) {
>  		unsigned long mask;
>  		/*
> @@ -6015,8 +6044,7 @@ static bool kvm_mmu_zap_collapsible_spte(struct kvm *kvm,
>  		 * mapping if the indirect sp has level = 1.
>  		 */
>  		if (sp->role.direct && !kvm_is_reserved_pfn(pfn) &&
> -		    !kvm_is_zone_device_pfn(pfn) &&
> -		    PageTransCompoundMap(pfn_to_page(pfn))) {
> +		    pfn_is_huge_mapped(kvm, sp->gfn, pfn)) {
>  			pte_list_remove(rmap_head, sptep);
>  
>  			if (kvm_available_flush_tlb_with_range())
> -- 
> 2.24.0.525.g8f36a354ae-goog
>

Dan Williams Dec. 12, 2019, 5:37 p.m. UTC | #8

On Thu, Dec 12, 2019 at 9:34 AM Sean Christopherson
<sean.j.christopherson@intel.com> wrote:
>
> On Wed, Dec 11, 2019 at 04:32:07PM -0500, Barret Rhoden wrote:
> > This change allows KVM to map DAX-backed files made of huge pages with
> > huge mappings in the EPT/TDP.
> >
> > DAX pages are not PageTransCompound.  The existing check is trying to
> > determine if the mapping for the pfn is a huge mapping or not.  For
> > non-DAX maps, e.g. hugetlbfs, that means checking PageTransCompound.
> > For DAX, we can check the page table itself.
> >
> > Note that KVM already faulted in the page (or huge page) in the host's
> > page table, and we hold the KVM mmu spinlock.  We grabbed that lock in
> > kvm_mmu_notifier_invalidate_range_end, before checking the mmu seq.
> >
> > Signed-off-by: Barret Rhoden <brho@google.com>
> > ---
> >  arch/x86/kvm/mmu/mmu.c | 36 ++++++++++++++++++++++++++++++++----
> >  1 file changed, 32 insertions(+), 4 deletions(-)
> >
> > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> > index 6f92b40d798c..cd07bc4e595f 100644
> > --- a/arch/x86/kvm/mmu/mmu.c
> > +++ b/arch/x86/kvm/mmu/mmu.c
> > @@ -3384,6 +3384,35 @@ static int kvm_handle_bad_page(struct kvm_vcpu *vcpu, gfn_t gfn, kvm_pfn_t pfn)
> >       return -EFAULT;
> >  }
> >
> > +static bool pfn_is_huge_mapped(struct kvm *kvm, gfn_t gfn, kvm_pfn_t pfn)
> > +{
> > +     struct page *page = pfn_to_page(pfn);
> > +     unsigned long hva;
> > +
> > +     if (!is_zone_device_page(page))
> > +             return PageTransCompoundMap(page);
> > +
> > +     /*
> > +      * DAX pages do not use compound pages.  The page should have already
> > +      * been mapped into the host-side page table during try_async_pf(), so
> > +      * we can check the page tables directly.
> > +      */
> > +     hva = gfn_to_hva(kvm, gfn);
> > +     if (kvm_is_error_hva(hva))
> > +             return false;
> > +
> > +     /*
> > +      * Our caller grabbed the KVM mmu_lock with a successful
> > +      * mmu_notifier_retry, so we're safe to walk the page table.
> > +      */
> > +     switch (dev_pagemap_mapping_shift(hva, current->mm)) {
> > +     case PMD_SHIFT:
> > +     case PUD_SIZE:
>
> I assume this means DAX can have 1GB pages?

Correct, it can. Not in the filesystem-dax case, but device-dax
supports 1GB pages.

> I ask because KVM's THP logic
> has historically relied on THP only supporting 2MB.  I cleaned this up in
> a recent series[*], which is in kvm/queue, but I obviously didn't actually
> test whether or not KVM would correctly handle 1GB non-hugetlbfs pages.

Yeah, since device-dax is the only path to support longterm page
pinning for vfio device assignment, testing with device-dax + 1GB
pages would be a useful sanity check.

Liran Alon Dec. 12, 2019, 5:39 p.m. UTC | #9

> On 12 Dec 2019, at 18:54, Dan Williams <dan.j.williams@intel.com> wrote:
> 
> On Thu, Dec 12, 2019 at 4:34 AM Liran Alon <liran.alon@oracle.com> wrote:
>> 
>> 
>> 
>>> On 11 Dec 2019, at 23:32, Barret Rhoden <brho@google.com> wrote:
>>> 
>>> This change allows KVM to map DAX-backed files made of huge pages with
>>> huge mappings in the EPT/TDP.
>>> 
>>> DAX pages are not PageTransCompound.  The existing check is trying to
>>> determine if the mapping for the pfn is a huge mapping or not.  For
>>> non-DAX maps, e.g. hugetlbfs, that means checking PageTransCompound.
>>> For DAX, we can check the page table itself.
>> 
>> For hugetlbfs pages, tdp_page_fault() -> mapping_level() -> host_mapping_level() -> kvm_host_page_size() -> vma_kernel_pagesize()
>> will return the page-size of the hugetlbfs without the need to parse the page-tables.
>> See vma->vm_ops->pagesize() callback implementation at hugetlb_vm_ops->pagesize()==hugetlb_vm_op_pagesize().
>> 
>> Only for pages that were originally mapped as small-pages and later merged to larger pages by THP, there is a need to check for PageTransCompound(). Again, instead of parsing page-tables.
>> 
>> Therefore, it seems more logical to me that:
>> (a) If DAX-backed files are mapped as large-pages to userspace, it should be reflected in vma->vm_ops->page_size() of that mapping. Causing kvm_host_page_size() to return the right size without the need to parse the page-tables.
> 
> A given dax-mapped vma may have mixed page sizes so ->page_size()
> can't be used reliably to enumerating the mapping size.

Naive question: Why don’t split the VMA in this case to multiple VMAs with different results for ->page_size()?
What you are describing sounds like DAX is breaking this callback semantics in an unpredictable manner.

> 
>> (b) If DAX-backed files small-pages can be later merged to large-pages by THP, then the “struct page” of these pages should be modified as usual to make PageTransCompound() return true for them. I’m not highly familiar with this mechanism, but I would expect THP to be able to merge DAX-backed files small-pages to large-pages in case DAX provides “struct page” for the DAX pages.
> 
> DAX pages do not participate in THP and do not have the
> PageTransCompound accounting. The only mechanism that records the
> mapping size for dax is the page tables themselves.

What is the rational behind this? Given that DAX pages can be described with “struct page” (i.e. ZONE_DEVICE), what prevents THP from manipulating page-tables to merge multiple DAX PFNs to a larger page?

-Liran

> 
> 
>> 
>>> 
>>> Note that KVM already faulted in the page (or huge page) in the host's
>>> page table, and we hold the KVM mmu spinlock.  We grabbed that lock in
>>> kvm_mmu_notifier_invalidate_range_end, before checking the mmu seq.
>>> 
>>> Signed-off-by: Barret Rhoden <brho@google.com>
>>> ---
>>> arch/x86/kvm/mmu/mmu.c | 36 ++++++++++++++++++++++++++++++++----
>>> 1 file changed, 32 insertions(+), 4 deletions(-)
>>> 
>>> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
>>> index 6f92b40d798c..cd07bc4e595f 100644
>>> --- a/arch/x86/kvm/mmu/mmu.c
>>> +++ b/arch/x86/kvm/mmu/mmu.c
>>> @@ -3384,6 +3384,35 @@ static int kvm_handle_bad_page(struct kvm_vcpu *vcpu, gfn_t gfn, kvm_pfn_t pfn)
>>>      return -EFAULT;
>>> }
>>> 
>>> +static bool pfn_is_huge_mapped(struct kvm *kvm, gfn_t gfn, kvm_pfn_t pfn)
>>> +{
>>> +     struct page *page = pfn_to_page(pfn);
>>> +     unsigned long hva;
>>> +
>>> +     if (!is_zone_device_page(page))
>>> +             return PageTransCompoundMap(page);
>>> +
>>> +     /*
>>> +      * DAX pages do not use compound pages.  The page should have already
>>> +      * been mapped into the host-side page table during try_async_pf(), so
>>> +      * we can check the page tables directly.
>>> +      */
>>> +     hva = gfn_to_hva(kvm, gfn);
>>> +     if (kvm_is_error_hva(hva))
>>> +             return false;
>>> +
>>> +     /*
>>> +      * Our caller grabbed the KVM mmu_lock with a successful
>>> +      * mmu_notifier_retry, so we're safe to walk the page table.
>>> +      */
>>> +     switch (dev_pagemap_mapping_shift(hva, current->mm)) {
>> 
>> Doesn’t dev_pagemap_mapping_shift() get “struct page” as first parameter?
>> Was this changed by a commit I missed?
>> 
>> -Liran
>> 
>>> +     case PMD_SHIFT:
>>> +     case PUD_SIZE:
>>> +             return true;
>>> +     }
>>> +     return false;
>>> +}
>>> +
>>> static void transparent_hugepage_adjust(struct kvm_vcpu *vcpu,
>>>                                      gfn_t gfn, kvm_pfn_t *pfnp,
>>>                                      int *levelp)
>>> @@ -3398,8 +3427,8 @@ static void transparent_hugepage_adjust(struct kvm_vcpu *vcpu,
>>>       * here.
>>>       */
>>>      if (!is_error_noslot_pfn(pfn) && !kvm_is_reserved_pfn(pfn) &&
>>> -         !kvm_is_zone_device_pfn(pfn) && level == PT_PAGE_TABLE_LEVEL &&
>>> -         PageTransCompoundMap(pfn_to_page(pfn)) &&
>>> +         level == PT_PAGE_TABLE_LEVEL &&
>>> +         pfn_is_huge_mapped(vcpu->kvm, gfn, pfn) &&
>>>          !mmu_gfn_lpage_is_disallowed(vcpu, gfn, PT_DIRECTORY_LEVEL)) {
>>>              unsigned long mask;
>>>              /*
>>> @@ -6015,8 +6044,7 @@ static bool kvm_mmu_zap_collapsible_spte(struct kvm *kvm,
>>>               * mapping if the indirect sp has level = 1.
>>>               */
>>>              if (sp->role.direct && !kvm_is_reserved_pfn(pfn) &&
>>> -                 !kvm_is_zone_device_pfn(pfn) &&
>>> -                 PageTransCompoundMap(pfn_to_page(pfn))) {
>>> +                 pfn_is_huge_mapped(kvm, sp->gfn, pfn)) {
>>>                      pte_list_remove(rmap_head, sptep);
>>> 
>>>                      if (kvm_available_flush_tlb_with_range())
>>> --
>>> 2.24.0.525.g8f36a354ae-goog
>>> 
>>

Liran Alon Dec. 12, 2019, 5:45 p.m. UTC | #10

> On 12 Dec 2019, at 19:34, Sean Christopherson <sean.j.christopherson@intel.com> wrote:
> 
> On Wed, Dec 11, 2019 at 04:32:07PM -0500, Barret Rhoden wrote:
>> This change allows KVM to map DAX-backed files made of huge pages with
>> huge mappings in the EPT/TDP.
>> 
>> DAX pages are not PageTransCompound.  The existing check is trying to
>> determine if the mapping for the pfn is a huge mapping or not.  For
>> non-DAX maps, e.g. hugetlbfs, that means checking PageTransCompound.
>> For DAX, we can check the page table itself.
>> 
>> Note that KVM already faulted in the page (or huge page) in the host's
>> page table, and we hold the KVM mmu spinlock.  We grabbed that lock in
>> kvm_mmu_notifier_invalidate_range_end, before checking the mmu seq.
>> 
>> Signed-off-by: Barret Rhoden <brho@google.com>
>> ---
>> arch/x86/kvm/mmu/mmu.c | 36 ++++++++++++++++++++++++++++++++----
>> 1 file changed, 32 insertions(+), 4 deletions(-)
>> 
>> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
>> index 6f92b40d798c..cd07bc4e595f 100644
>> --- a/arch/x86/kvm/mmu/mmu.c
>> +++ b/arch/x86/kvm/mmu/mmu.c
>> @@ -3384,6 +3384,35 @@ static int kvm_handle_bad_page(struct kvm_vcpu *vcpu, gfn_t gfn, kvm_pfn_t pfn)
>> 	return -EFAULT;
>> }
>> 
>> +static bool pfn_is_huge_mapped(struct kvm *kvm, gfn_t gfn, kvm_pfn_t pfn)
>> +{
>> +	struct page *page = pfn_to_page(pfn);
>> +	unsigned long hva;
>> +
>> +	if (!is_zone_device_page(page))
>> +		return PageTransCompoundMap(page);
>> +
>> +	/*
>> +	 * DAX pages do not use compound pages.  The page should have already
>> +	 * been mapped into the host-side page table during try_async_pf(), so
>> +	 * we can check the page tables directly.
>> +	 */
>> +	hva = gfn_to_hva(kvm, gfn);
>> +	if (kvm_is_error_hva(hva))
>> +		return false;
>> +
>> +	/*
>> +	 * Our caller grabbed the KVM mmu_lock with a successful
>> +	 * mmu_notifier_retry, so we're safe to walk the page table.
>> +	 */
>> +	switch (dev_pagemap_mapping_shift(hva, current->mm)) {
>> +	case PMD_SHIFT:
>> +	case PUD_SIZE:
> 
> I assume this means DAX can have 1GB pages?  I ask because KVM's THP logic
> has historically relied on THP only supporting 2MB.  I cleaned this up in
> a recent series[*], which is in kvm/queue, but I obviously didn't actually
> test whether or not KVM would correctly handle 1GB non-hugetlbfs pages.

KVM doesn’t handle 1GB correctly for all types of non-hugetlbfs pages.
One example we have noticed internally but haven’t submitted an upstream patch yet is
for pages without “struct page”. As in this case, hva_to_pfn() will notice vma->vm_flags have VM_PFNMAP set
and call hva_to_pfn_remapped() -> follow_pfn().
However, follow_pfn() currently just calls follow_pte() which use __follow_pte_pmd() that doesn’t handle a huge PUD entry.

> 
> The easiest thing is probably to rebase on kvm/queue.  You'll need to do
> that anyways, and I suspect doing so will help shake out any hiccups.
> 
> [*] https://urldefense.proofpoint.com/v2/url?u=https-3A__lkml.kernel.org_r_20191206235729.29263-2D1-2Dsean.j.christopherson-40intel.com&d=DwIBAg&c=RoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE&r=Jk6Q8nNzkQ6LJ6g42qARkg6ryIDGQr-yKXPNGZbpTx0&m=Lk-PXE125WU3GWJOV4U4crsSEFx7f5AUmRJhkrfIeAE&s=BIo4tnL4OfswRQ2QKfTs9VYScLU5lBy2pwzePBnHow8&e= 
> 
>> +		return true;
>> +	}
>> +	return false;
>> +}
>> +
>> static void transparent_hugepage_adjust(struct kvm_vcpu *vcpu,
>> 					gfn_t gfn, kvm_pfn_t *pfnp,
>> 					int *levelp)
>> @@ -3398,8 +3427,8 @@ static void transparent_hugepage_adjust(struct kvm_vcpu *vcpu,
>> 	 * here.
>> 	 */
>> 	if (!is_error_noslot_pfn(pfn) && !kvm_is_reserved_pfn(pfn) &&
>> -	    !kvm_is_zone_device_pfn(pfn) && level == PT_PAGE_TABLE_LEVEL &&
>> -	    PageTransCompoundMap(pfn_to_page(pfn)) &&
>> +	    level == PT_PAGE_TABLE_LEVEL &&
>> +	    pfn_is_huge_mapped(vcpu->kvm, gfn, pfn) &&
>> 	    !mmu_gfn_lpage_is_disallowed(vcpu, gfn, PT_DIRECTORY_LEVEL)) {
>> 		unsigned long mask;
>> 		/*
>> @@ -6015,8 +6044,7 @@ static bool kvm_mmu_zap_collapsible_spte(struct kvm *kvm,
>> 		 * mapping if the indirect sp has level = 1.
>> 		 */
>> 		if (sp->role.direct && !kvm_is_reserved_pfn(pfn) &&
>> -		    !kvm_is_zone_device_pfn(pfn) &&
>> -		    PageTransCompoundMap(pfn_to_page(pfn))) {
>> +		    pfn_is_huge_mapped(kvm, sp->gfn, pfn)) {
>> 			pte_list_remove(rmap_head, sptep);
>> 
>> 			if (kvm_available_flush_tlb_with_range())
>> -- 
>> 2.24.0.525.g8f36a354ae-goog
>>

Dan Williams Dec. 12, 2019, 5:59 p.m. UTC | #11

On Thu, Dec 12, 2019 at 9:39 AM Liran Alon <liran.alon@oracle.com> wrote:
>
>
>
> > On 12 Dec 2019, at 18:54, Dan Williams <dan.j.williams@intel.com> wrote:
> >
> > On Thu, Dec 12, 2019 at 4:34 AM Liran Alon <liran.alon@oracle.com> wrote:
> >>
> >>
> >>
> >>> On 11 Dec 2019, at 23:32, Barret Rhoden <brho@google.com> wrote:
> >>>
> >>> This change allows KVM to map DAX-backed files made of huge pages with
> >>> huge mappings in the EPT/TDP.
> >>>
> >>> DAX pages are not PageTransCompound.  The existing check is trying to
> >>> determine if the mapping for the pfn is a huge mapping or not.  For
> >>> non-DAX maps, e.g. hugetlbfs, that means checking PageTransCompound.
> >>> For DAX, we can check the page table itself.
> >>
> >> For hugetlbfs pages, tdp_page_fault() -> mapping_level() -> host_mapping_level() -> kvm_host_page_size() -> vma_kernel_pagesize()
> >> will return the page-size of the hugetlbfs without the need to parse the page-tables.
> >> See vma->vm_ops->pagesize() callback implementation at hugetlb_vm_ops->pagesize()==hugetlb_vm_op_pagesize().
> >>
> >> Only for pages that were originally mapped as small-pages and later merged to larger pages by THP, there is a need to check for PageTransCompound(). Again, instead of parsing page-tables.
> >>
> >> Therefore, it seems more logical to me that:
> >> (a) If DAX-backed files are mapped as large-pages to userspace, it should be reflected in vma->vm_ops->page_size() of that mapping. Causing kvm_host_page_size() to return the right size without the need to parse the page-tables.
> >
> > A given dax-mapped vma may have mixed page sizes so ->page_size()
> > can't be used reliably to enumerating the mapping size.
>
> Naive question: Why don’t split the VMA in this case to multiple VMAs with different results for ->page_size()?

Filesystems traditionally have not populated ->pagesize() in their
vm_operations, there was no compelling reason to go add it and the
complexity seems prohibitive.

> What you are describing sounds like DAX is breaking this callback semantics in an unpredictable manner.

It's not unpredictable. vma_kernel_pagesize() returns PAGE_SIZE. Huge
pages in the page cache has a similar issue.

> >> (b) If DAX-backed files small-pages can be later merged to large-pages by THP, then the “struct page” of these pages should be modified as usual to make PageTransCompound() return true for them. I’m not highly familiar with this mechanism, but I would expect THP to be able to merge DAX-backed files small-pages to large-pages in case DAX provides “struct page” for the DAX pages.
> >
> > DAX pages do not participate in THP and do not have the
> > PageTransCompound accounting. The only mechanism that records the
> > mapping size for dax is the page tables themselves.
>
> What is the rational behind this? Given that DAX pages can be described with “struct page” (i.e. ZONE_DEVICE), what prevents THP from manipulating page-tables to merge multiple DAX PFNs to a larger page?

THP accounting is a function of the page allocator. ZONE_DEVICE pages
are excluded from the page allocator. ZONE_DEVICE is just enough
infrastructure to support pfn_to_page(), page_address(), and
get_user_pages(). Other page allocator services beyond that are not
present.

Liran Alon Dec. 12, 2019, 6:32 p.m. UTC | #12

> On 12 Dec 2019, at 19:59, Dan Williams <dan.j.williams@intel.com> wrote:
> 
> On Thu, Dec 12, 2019 at 9:39 AM Liran Alon <liran.alon@oracle.com> wrote:
>> 
>> 
>> 
>>> On 12 Dec 2019, at 18:54, Dan Williams <dan.j.williams@intel.com> wrote:
>>> 
>>> On Thu, Dec 12, 2019 at 4:34 AM Liran Alon <liran.alon@oracle.com> wrote:
>>>> 
>>>> 
>>>> 
>>>>> On 11 Dec 2019, at 23:32, Barret Rhoden <brho@google.com> wrote:
>>>>> 
>>>>> This change allows KVM to map DAX-backed files made of huge pages with
>>>>> huge mappings in the EPT/TDP.
>>>>> 
>>>>> DAX pages are not PageTransCompound.  The existing check is trying to
>>>>> determine if the mapping for the pfn is a huge mapping or not.  For
>>>>> non-DAX maps, e.g. hugetlbfs, that means checking PageTransCompound.
>>>>> For DAX, we can check the page table itself.
>>>> 
>>>> For hugetlbfs pages, tdp_page_fault() -> mapping_level() -> host_mapping_level() -> kvm_host_page_size() -> vma_kernel_pagesize()
>>>> will return the page-size of the hugetlbfs without the need to parse the page-tables.
>>>> See vma->vm_ops->pagesize() callback implementation at hugetlb_vm_ops->pagesize()==hugetlb_vm_op_pagesize().
>>>> 
>>>> Only for pages that were originally mapped as small-pages and later merged to larger pages by THP, there is a need to check for PageTransCompound(). Again, instead of parsing page-tables.
>>>> 
>>>> Therefore, it seems more logical to me that:
>>>> (a) If DAX-backed files are mapped as large-pages to userspace, it should be reflected in vma->vm_ops->page_size() of that mapping. Causing kvm_host_page_size() to return the right size without the need to parse the page-tables.
>>> 
>>> A given dax-mapped vma may have mixed page sizes so ->page_size()
>>> can't be used reliably to enumerating the mapping size.
>> 
>> Naive question: Why don’t split the VMA in this case to multiple VMAs with different results for ->page_size()?
> 
> Filesystems traditionally have not populated ->pagesize() in their
> vm_operations, there was no compelling reason to go add it and the
> complexity seems prohibitive.

I understand. Though this is technical debt that breaks ->page_size() semantics which might cause a complex bug some day...

> 
>> What you are describing sounds like DAX is breaking this callback semantics in an unpredictable manner.
> 
> It's not unpredictable. vma_kernel_pagesize() returns PAGE_SIZE.

Of course. :) I meant it may be unexpected by the caller.

> Huge
> pages in the page cache has a similar issue.

Ok. I haven’t known that. Thanks for the explanation.

> 
>>>> (b) If DAX-backed files small-pages can be later merged to large-pages by THP, then the “struct page” of these pages should be modified as usual to make PageTransCompound() return true for them. I’m not highly familiar with this mechanism, but I would expect THP to be able to merge DAX-backed files small-pages to large-pages in case DAX provides “struct page” for the DAX pages.
>>> 
>>> DAX pages do not participate in THP and do not have the
>>> PageTransCompound accounting. The only mechanism that records the
>>> mapping size for dax is the page tables themselves.
>> 
>> What is the rational behind this? Given that DAX pages can be described with “struct page” (i.e. ZONE_DEVICE), what prevents THP from manipulating page-tables to merge multiple DAX PFNs to a larger page?
> 
> THP accounting is a function of the page allocator. ZONE_DEVICE pages
> are excluded from the page allocator. ZONE_DEVICE is just enough
> infrastructure to support pfn_to_page(), page_address(), and
> get_user_pages(). Other page allocator services beyond that are not
> present.

Ok.

Barret Rhoden Dec. 12, 2019, 7:16 p.m. UTC | #13

On 12/12/19 12:37 PM, Dan Williams wrote:
> Yeah, since device-dax is the only path to support longterm page
> pinning for vfio device assignment, testing with device-dax + 1GB
> pages would be a useful sanity check.

What are the issues with fs-dax and page pinning?  Is that limitation 
something that is permanent and unfixable (by me or anyone)?

I'd like to put a lot more in a DAX/pmem region than just a guest's 
memory, and having a mountable filesystem would be extremely convenient.

Thanks,

Barret

Dan Williams Dec. 12, 2019, 7:48 p.m. UTC | #14

On Thu, Dec 12, 2019 at 11:16 AM Barret Rhoden <brho@google.com> wrote:
>
> On 12/12/19 12:37 PM, Dan Williams wrote:
> > Yeah, since device-dax is the only path to support longterm page
> > pinning for vfio device assignment, testing with device-dax + 1GB
> > pages would be a useful sanity check.
>
> What are the issues with fs-dax and page pinning?  Is that limitation
> something that is permanent and unfixable (by me or anyone)?

It's a surprisingly painful point of contention...

File backed DAX pages cannot be truncated while the page is pinned
because the pin may indicate that DMA is ongoing to the file block /
DAX page. When that pin is from RDMA or VFIO that creates a situation
where filesystem operations are blocked indefinitely. More details
here: 94db151dc892 "vfio: disable filesystem-dax page pinning".

Currently, to prevent the deadlock, RDMA, VFIO, and IO_URING memory
registration is blocked if the mapping is filesystem-dax backed (see
the FOLL_LONGTERM flag to get_user_pages).

One of the proposals to break the impasse was to allow the filesystem
to forcibly revoke the mapping. I.e. to use the IOMMU to forcibly kick
the RDMA device out of its registration. That was rejected by RDMA
folks because RDMA applications are not prepared for this revocation
to happen and the application that performed the registration may not
be the application that uses the registration. There was an attempt to
use a file lease to indicate the presence of a file /
memory-registration that is blocking file-system operations, but that
was still less palatable to filesystem folks than just keeping the
status quo of blocking longterm pinning.

That said, the VFIO use case seems a different situation than RDMA.
There's often a 1:1 relationship between the application performing
the memory registration and the application consuming it, the VMM, and
there is always an IOMMU present that could revoke access and kill the
guest is the mapping got truncated. It seems in theory that VFIO could
tolerate a "revoke pin on truncate" mechanism where RDMA could not.

> I'd like to put a lot more in a DAX/pmem region than just a guest's
> memory, and having a mountable filesystem would be extremely convenient.

Why would page pinning be involved in allowing the guest to mount a
filesystem on guest-pmem? That already works today, it's just the
device-passthrough that causes guest memory to be pinned indefinitely.

Barret Rhoden Dec. 12, 2019, 8:08 p.m. UTC | #15

On 12/12/19 2:48 PM, Dan Williams wrote:
> On Thu, Dec 12, 2019 at 11:16 AM Barret Rhoden <brho@google.com> wrote:
>>
>> On 12/12/19 12:37 PM, Dan Williams wrote:
>>> Yeah, since device-dax is the only path to support longterm page
>>> pinning for vfio device assignment, testing with device-dax + 1GB
>>> pages would be a useful sanity check.
>>
>> What are the issues with fs-dax and page pinning?  Is that limitation
>> something that is permanent and unfixable (by me or anyone)?
> 
> It's a surprisingly painful point of contention...

Thanks for the info; I'll check out those threads.

[snip]

>> I'd like to put a lot more in a DAX/pmem region than just a guest's
>> memory, and having a mountable filesystem would be extremely convenient.
> 
> Why would page pinning be involved in allowing the guest to mount a
> filesystem on guest-pmem? That already works today, it's just the
> device-passthrough that causes guest memory to be pinned indefinitely.

I'd like to mount the pmem filesystem on the *host* and use its files 
for the guest's memory.  So far I've just been making an ext4 FS on 
/dev/pmem0 and creating a bunch of files in the FS.  Some of the files 
are the guest memory: one file for each VM.  Other files are just 
metadata that the host uses.

That all works right now, but I'd also like to use VFIO with the guests.

Thanks,

Barret

[v4,2/2] kvm: Use huge pages for DAX-backed files

Commit Message

Comments

Patch