diff mbox series

[10/19] KVM: Use follow_pfnmap API

Message ID 20240809160909.1023470-11-peterx@redhat.com (mailing list archive)
State New
Headers show
Series mm: Support huge pfnmaps | expand

Commit Message

Peter Xu Aug. 9, 2024, 4:09 p.m. UTC
Use the new pfnmap API to allow huge MMIO mappings for VMs.  The rest work
is done perfectly on the other side (host_pfn_mapping_level()).

Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Sean Christopherson <seanjc@google.com>
Signed-off-by: Peter Xu <peterx@redhat.com>
---
 virt/kvm/kvm_main.c | 19 +++++++------------
 1 file changed, 7 insertions(+), 12 deletions(-)

Comments

Axel Rasmussen Aug. 9, 2024, 5:23 p.m. UTC | #1
On Fri, Aug 9, 2024 at 9:09 AM Peter Xu <peterx@redhat.com> wrote:
>
> Use the new pfnmap API to allow huge MMIO mappings for VMs.  The rest work
> is done perfectly on the other side (host_pfn_mapping_level()).

I don't think it has to be done in this series, but a future
optimization to consider is having follow_pfnmap just tell the caller
about the mapping level directly. It already found this information as
part of its walk. I think there's a possibility to simplify KVM /
avoid it having to do its own walk again later.

>
>
> Cc: Paolo Bonzini <pbonzini@redhat.com>
> Cc: Sean Christopherson <seanjc@google.com>
> Signed-off-by: Peter Xu <peterx@redhat.com>
> ---
>  virt/kvm/kvm_main.c | 19 +++++++------------
>  1 file changed, 7 insertions(+), 12 deletions(-)
>
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index d0788d0a72cc..9fb1c527a8e1 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -2862,13 +2862,11 @@ static int hva_to_pfn_remapped(struct vm_area_struct *vma,
>                                unsigned long addr, bool write_fault,
>                                bool *writable, kvm_pfn_t *p_pfn)
>  {
> +       struct follow_pfnmap_args args = { .vma = vma, .address = addr };
>         kvm_pfn_t pfn;
> -       pte_t *ptep;
> -       pte_t pte;
> -       spinlock_t *ptl;
>         int r;
>
> -       r = follow_pte(vma, addr, &ptep, &ptl);
> +       r = follow_pfnmap_start(&args);
>         if (r) {
>                 /*
>                  * get_user_pages fails for VM_IO and VM_PFNMAP vmas and does
> @@ -2883,21 +2881,19 @@ static int hva_to_pfn_remapped(struct vm_area_struct *vma,
>                 if (r)
>                         return r;
>
> -               r = follow_pte(vma, addr, &ptep, &ptl);
> +               r = follow_pfnmap_start(&args);
>                 if (r)
>                         return r;
>         }
>
> -       pte = ptep_get(ptep);
> -
> -       if (write_fault && !pte_write(pte)) {
> +       if (write_fault && !args.writable) {
>                 pfn = KVM_PFN_ERR_RO_FAULT;
>                 goto out;
>         }
>
>         if (writable)
> -               *writable = pte_write(pte);
> -       pfn = pte_pfn(pte);
> +               *writable = args.writable;
> +       pfn = args.pfn;
>
>         /*
>          * Get a reference here because callers of *hva_to_pfn* and
> @@ -2918,9 +2914,8 @@ static int hva_to_pfn_remapped(struct vm_area_struct *vma,
>          */
>         if (!kvm_try_get_pfn(pfn))
>                 r = -EFAULT;
> -
>  out:
> -       pte_unmap_unlock(ptep, ptl);
> +       follow_pfnmap_end(&args);
>         *p_pfn = pfn;
>
>         return r;
> --
> 2.45.0
>
Peter Xu Aug. 12, 2024, 6:58 p.m. UTC | #2
On Fri, Aug 09, 2024 at 10:23:20AM -0700, Axel Rasmussen wrote:
> On Fri, Aug 9, 2024 at 9:09 AM Peter Xu <peterx@redhat.com> wrote:
> >
> > Use the new pfnmap API to allow huge MMIO mappings for VMs.  The rest work
> > is done perfectly on the other side (host_pfn_mapping_level()).
> 
> I don't think it has to be done in this series, but a future
> optimization to consider is having follow_pfnmap just tell the caller
> about the mapping level directly. It already found this information as
> part of its walk. I think there's a possibility to simplify KVM /
> avoid it having to do its own walk again later.

AFAIU pfnmap isn't special in this case, as we do the "walk pgtable twice"
idea also to a generic page here, so probably not directly relevant to this
patch alone.

But I agree with you, sounds like something we can consider trying.  I
would be curious on whether the perf difference would be measurable in this
specific case, though.  I mean, this first walk will heat up all the
things, so I'd expect the 2nd walk (which is lockless) later be pretty fast
normally.

Thanks,
Axel Rasmussen Aug. 12, 2024, 10:47 p.m. UTC | #3
On Mon, Aug 12, 2024 at 11:58 AM Peter Xu <peterx@redhat.com> wrote:
>
> On Fri, Aug 09, 2024 at 10:23:20AM -0700, Axel Rasmussen wrote:
> > On Fri, Aug 9, 2024 at 9:09 AM Peter Xu <peterx@redhat.com> wrote:
> > >
> > > Use the new pfnmap API to allow huge MMIO mappings for VMs.  The rest work
> > > is done perfectly on the other side (host_pfn_mapping_level()).
> >
> > I don't think it has to be done in this series, but a future
> > optimization to consider is having follow_pfnmap just tell the caller
> > about the mapping level directly. It already found this information as
> > part of its walk. I think there's a possibility to simplify KVM /
> > avoid it having to do its own walk again later.
>
> AFAIU pfnmap isn't special in this case, as we do the "walk pgtable twice"
> idea also to a generic page here, so probably not directly relevant to this
> patch alone.
>
> But I agree with you, sounds like something we can consider trying.  I
> would be curious on whether the perf difference would be measurable in this
> specific case, though.  I mean, this first walk will heat up all the
> things, so I'd expect the 2nd walk (which is lockless) later be pretty fast
> normally.

Agreed, the main benefit is probably just code simplification.

>
> Thanks,
>
> --
> Peter Xu
>
Sean Christopherson Aug. 12, 2024, 11:44 p.m. UTC | #4
On Mon, Aug 12, 2024, Axel Rasmussen wrote:
> On Mon, Aug 12, 2024 at 11:58 AM Peter Xu <peterx@redhat.com> wrote:
> >
> > On Fri, Aug 09, 2024 at 10:23:20AM -0700, Axel Rasmussen wrote:
> > > On Fri, Aug 9, 2024 at 9:09 AM Peter Xu <peterx@redhat.com> wrote:
> > > >
> > > > Use the new pfnmap API to allow huge MMIO mappings for VMs.  The rest work
> > > > is done perfectly on the other side (host_pfn_mapping_level()).
> > >
> > > I don't think it has to be done in this series, but a future
> > > optimization to consider is having follow_pfnmap just tell the caller
> > > about the mapping level directly. It already found this information as
> > > part of its walk. I think there's a possibility to simplify KVM /
> > > avoid it having to do its own walk again later.
> >
> > AFAIU pfnmap isn't special in this case, as we do the "walk pgtable twice"
> > idea also to a generic page here, so probably not directly relevant to this
> > patch alone.

Ya.  My original hope was that KVM could simply walk the host page tables and get
whatever PFN+size it found, i.e. that KVM wouldn't care about pfn-mapped versus
regular pages.  That might be feasible after dropping all of KVM's refcounting
shenanigans[*]?  Not sure, haven't thought too much about it, precisely because
I too think it won't provide any meaningful performance boost.

> > But I agree with you, sounds like something we can consider trying.  I
> > would be curious on whether the perf difference would be measurable in this
> > specific case, though.  I mean, this first walk will heat up all the
> > things, so I'd expect the 2nd walk (which is lockless) later be pretty fast
> > normally.
> 
> Agreed, the main benefit is probably just code simplification.

+1.  I wouldn't spend much time, if any, trying to plumb the size back out.
Unless we can convert regular pages as well, it'd probably be more confusing to
have separate ways of getting the mapping size.
Jason Gunthorpe Aug. 14, 2024, 1:15 p.m. UTC | #5
On Mon, Aug 12, 2024 at 04:44:40PM -0700, Sean Christopherson wrote:

> > > > I don't think it has to be done in this series, but a future
> > > > optimization to consider is having follow_pfnmap just tell the caller
> > > > about the mapping level directly. It already found this information as
> > > > part of its walk. I think there's a possibility to simplify KVM /
> > > > avoid it having to do its own walk again later.
> > >
> > > AFAIU pfnmap isn't special in this case, as we do the "walk pgtable twice"
> > > idea also to a generic page here, so probably not directly relevant to this
> > > patch alone.
> 
> Ya.  My original hope was that KVM could simply walk the host page tables and get
> whatever PFN+size it found, i.e. that KVM wouldn't care about pfn-mapped versus
> regular pages.  That might be feasible after dropping all of KVM's refcounting
> shenanigans[*]?  Not sure, haven't thought too much about it, precisely because
> I too think it won't provide any meaningful performance boost.

The main thing, from my perspective, is that KVM reliably creates 1G
mappings in its table if the VMA has 1G mappings, across all arches
and scenarios. For normal memory and PFNMAP equally.

Not returning the size here makes me wonder if that actually happens?
Does KVM have another way to know what size entry to create?

Jason
Sean Christopherson Aug. 14, 2024, 2:23 p.m. UTC | #6
On Wed, Aug 14, 2024, Jason Gunthorpe wrote:
> On Mon, Aug 12, 2024 at 04:44:40PM -0700, Sean Christopherson wrote:
> 
> > > > > I don't think it has to be done in this series, but a future
> > > > > optimization to consider is having follow_pfnmap just tell the caller
> > > > > about the mapping level directly. It already found this information as
> > > > > part of its walk. I think there's a possibility to simplify KVM /
> > > > > avoid it having to do its own walk again later.
> > > >
> > > > AFAIU pfnmap isn't special in this case, as we do the "walk pgtable twice"
> > > > idea also to a generic page here, so probably not directly relevant to this
> > > > patch alone.
> > 
> > Ya.  My original hope was that KVM could simply walk the host page tables and get
> > whatever PFN+size it found, i.e. that KVM wouldn't care about pfn-mapped versus
> > regular pages.  That might be feasible after dropping all of KVM's refcounting
> > shenanigans[*]?  Not sure, haven't thought too much about it, precisely because
> > I too think it won't provide any meaningful performance boost.
> 
> The main thing, from my perspective, is that KVM reliably creates 1G
> mappings in its table if the VMA has 1G mappings, across all arches
> and scenarios. For normal memory and PFNMAP equally.

Yes, KVM walks the host page tables for the user virtual address and uses whatever
page size it finds, regardless of what the mapping type.

> Not returning the size here makes me wonder if that actually happens?

It does happen, the idea here was purely to avoid the second page table walk.

> Does KVM have another way to know what size entry to create?
> 
> Jason
diff mbox series

Patch

diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index d0788d0a72cc..9fb1c527a8e1 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -2862,13 +2862,11 @@  static int hva_to_pfn_remapped(struct vm_area_struct *vma,
 			       unsigned long addr, bool write_fault,
 			       bool *writable, kvm_pfn_t *p_pfn)
 {
+	struct follow_pfnmap_args args = { .vma = vma, .address = addr };
 	kvm_pfn_t pfn;
-	pte_t *ptep;
-	pte_t pte;
-	spinlock_t *ptl;
 	int r;
 
-	r = follow_pte(vma, addr, &ptep, &ptl);
+	r = follow_pfnmap_start(&args);
 	if (r) {
 		/*
 		 * get_user_pages fails for VM_IO and VM_PFNMAP vmas and does
@@ -2883,21 +2881,19 @@  static int hva_to_pfn_remapped(struct vm_area_struct *vma,
 		if (r)
 			return r;
 
-		r = follow_pte(vma, addr, &ptep, &ptl);
+		r = follow_pfnmap_start(&args);
 		if (r)
 			return r;
 	}
 
-	pte = ptep_get(ptep);
-
-	if (write_fault && !pte_write(pte)) {
+	if (write_fault && !args.writable) {
 		pfn = KVM_PFN_ERR_RO_FAULT;
 		goto out;
 	}
 
 	if (writable)
-		*writable = pte_write(pte);
-	pfn = pte_pfn(pte);
+		*writable = args.writable;
+	pfn = args.pfn;
 
 	/*
 	 * Get a reference here because callers of *hva_to_pfn* and
@@ -2918,9 +2914,8 @@  static int hva_to_pfn_remapped(struct vm_area_struct *vma,
 	 */
 	if (!kvm_try_get_pfn(pfn))
 		r = -EFAULT;
-
 out:
-	pte_unmap_unlock(ptep, ptl);
+	follow_pfnmap_end(&args);
 	*p_pfn = pfn;
 
 	return r;