[10/19] KVM: Use follow_pfnmap API

Message ID	20240809160909.1023470-11-peterx@redhat.com (mailing list archive)
State	New
Headers	show Return-Path: <owner-linux-mm@kvack.org> From: Peter Xu <peterx@redhat.com> To: linux-mm@kvack.org, linux-kernel@vger.kernel.org Cc: Sean Christopherson <seanjc@google.com>, Oscar Salvador <osalvador@suse.de>, Jason Gunthorpe <jgg@nvidia.com>, Axel Rasmussen <axelrasmussen@google.com>, linux-arm-kernel@lists.infradead.org, x86@kernel.org, peterx@redhat.com, Will Deacon <will@kernel.org>, Gavin Shan <gshan@redhat.com>, Paolo Bonzini <pbonzini@redhat.com>, Zi Yan <ziy@nvidia.com>, Andrew Morton <akpm@linux-foundation.org>, Catalin Marinas <catalin.marinas@arm.com>, Ingo Molnar <mingo@redhat.com>, Alistair Popple <apopple@nvidia.com>, Borislav Petkov <bp@alien8.de>, David Hildenbrand <david@redhat.com>, Thomas Gleixner <tglx@linutronix.de>, kvm@vger.kernel.org, Dave Hansen <dave.hansen@linux.intel.com>, Alex Williamson <alex.williamson@redhat.com>, Yan Zhao <yan.y.zhao@intel.com> Subject: [PATCH 10/19] KVM: Use follow_pfnmap API Date: Fri, 9 Aug 2024 12:09:00 -0400 Message-ID: <20240809160909.1023470-11-peterx@redhat.com> In-Reply-To: <20240809160909.1023470-1-peterx@redhat.com> References: <20240809160909.1023470-1-peterx@redhat.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Content-Type: text/plain; charset="US-ASCII"; x-default=true Sender: owner-linux-mm@kvack.org Precedence: bulk
Series	mm: Support huge pfnmaps \| expand [00/19] mm: Support huge pfnmaps [01/19] mm: Introduce ARCH_SUPPORTS_HUGE_PFNMAP and special bits to pmd/pud [02/19] mm: Drop is_huge_zero_pud() [03/19] mm: Mark special bits for huge pfn mappings when inject [04/19] mm: Allow THP orders for PFNMAPs [05/19] mm/gup: Detect huge pfnmap entries in gup-fast [07/19] mm/fork: Accept huge pfnmap entries [08/19] mm: Always define pxx_pgprot() [09/19] mm: New follow_pfnmap API [10/19] KVM: Use follow_pfnmap API [11/19] s390/pci_mmio: Use follow_pfnmap API [12/19] mm/x86/pat: Use the new follow_pfnmap API [13/19] vfio: Use the new follow_pfnmap API [14/19] acrn: Use the new follow_pfnmap API [15/19] mm/access_process_vm: Use the new follow_pfnmap API [16/19] mm: Remove follow_pte() [17/19] mm/x86: Support large pfn mappings [18/19] mm/arm64: Support large pfn mappings [19/19] vfio/pci: Implement huge_fault support

Peter Xu Aug. 9, 2024, 4:09 p.m. UTC

Use the new pfnmap API to allow huge MMIO mappings for VMs.  The rest work
is done perfectly on the other side (host_pfn_mapping_level()).

Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Sean Christopherson <seanjc@google.com>
Signed-off-by: Peter Xu <peterx@redhat.com>
---
 virt/kvm/kvm_main.c | 19 +++++++------------
 1 file changed, 7 insertions(+), 12 deletions(-)

Axel Rasmussen Aug. 9, 2024, 5:23 p.m. UTC | #1

On Fri, Aug 9, 2024 at 9:09 AM Peter Xu <peterx@redhat.com> wrote:
>
> Use the new pfnmap API to allow huge MMIO mappings for VMs.  The rest work
> is done perfectly on the other side (host_pfn_mapping_level()).

I don't think it has to be done in this series, but a future
optimization to consider is having follow_pfnmap just tell the caller
about the mapping level directly. It already found this information as
part of its walk. I think there's a possibility to simplify KVM /
avoid it having to do its own walk again later.

>
>
> Cc: Paolo Bonzini <pbonzini@redhat.com>
> Cc: Sean Christopherson <seanjc@google.com>
> Signed-off-by: Peter Xu <peterx@redhat.com>
> ---
>  virt/kvm/kvm_main.c | 19 +++++++------------
>  1 file changed, 7 insertions(+), 12 deletions(-)
>
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index d0788d0a72cc..9fb1c527a8e1 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -2862,13 +2862,11 @@ static int hva_to_pfn_remapped(struct vm_area_struct *vma,
>                                unsigned long addr, bool write_fault,
>                                bool *writable, kvm_pfn_t *p_pfn)
>  {
> +       struct follow_pfnmap_args args = { .vma = vma, .address = addr };
>         kvm_pfn_t pfn;
> -       pte_t *ptep;
> -       pte_t pte;
> -       spinlock_t *ptl;
>         int r;
>
> -       r = follow_pte(vma, addr, &ptep, &ptl);
> +       r = follow_pfnmap_start(&args);
>         if (r) {
>                 /*
>                  * get_user_pages fails for VM_IO and VM_PFNMAP vmas and does
> @@ -2883,21 +2881,19 @@ static int hva_to_pfn_remapped(struct vm_area_struct *vma,
>                 if (r)
>                         return r;
>
> -               r = follow_pte(vma, addr, &ptep, &ptl);
> +               r = follow_pfnmap_start(&args);
>                 if (r)
>                         return r;
>         }
>
> -       pte = ptep_get(ptep);
> -
> -       if (write_fault && !pte_write(pte)) {
> +       if (write_fault && !args.writable) {
>                 pfn = KVM_PFN_ERR_RO_FAULT;
>                 goto out;
>         }
>
>         if (writable)
> -               *writable = pte_write(pte);
> -       pfn = pte_pfn(pte);
> +               *writable = args.writable;
> +       pfn = args.pfn;
>
>         /*
>          * Get a reference here because callers of *hva_to_pfn* and
> @@ -2918,9 +2914,8 @@ static int hva_to_pfn_remapped(struct vm_area_struct *vma,
>          */
>         if (!kvm_try_get_pfn(pfn))
>                 r = -EFAULT;
> -
>  out:
> -       pte_unmap_unlock(ptep, ptl);
> +       follow_pfnmap_end(&args);
>         *p_pfn = pfn;
>
>         return r;
> --
> 2.45.0
>

Peter Xu Aug. 12, 2024, 6:58 p.m. UTC | #2

On Fri, Aug 09, 2024 at 10:23:20AM -0700, Axel Rasmussen wrote:
> On Fri, Aug 9, 2024 at 9:09 AM Peter Xu <peterx@redhat.com> wrote:
> >
> > Use the new pfnmap API to allow huge MMIO mappings for VMs.  The rest work
> > is done perfectly on the other side (host_pfn_mapping_level()).
> 
> I don't think it has to be done in this series, but a future
> optimization to consider is having follow_pfnmap just tell the caller
> about the mapping level directly. It already found this information as
> part of its walk. I think there's a possibility to simplify KVM /
> avoid it having to do its own walk again later.

AFAIU pfnmap isn't special in this case, as we do the "walk pgtable twice"
idea also to a generic page here, so probably not directly relevant to this
patch alone.

But I agree with you, sounds like something we can consider trying.  I
would be curious on whether the perf difference would be measurable in this
specific case, though.  I mean, this first walk will heat up all the
things, so I'd expect the 2nd walk (which is lockless) later be pretty fast
normally.

Thanks,

Axel Rasmussen Aug. 12, 2024, 10:47 p.m. UTC | #3

On Mon, Aug 12, 2024 at 11:58 AM Peter Xu <peterx@redhat.com> wrote:
>
> On Fri, Aug 09, 2024 at 10:23:20AM -0700, Axel Rasmussen wrote:
> > On Fri, Aug 9, 2024 at 9:09 AM Peter Xu <peterx@redhat.com> wrote:
> > >
> > > Use the new pfnmap API to allow huge MMIO mappings for VMs.  The rest work
> > > is done perfectly on the other side (host_pfn_mapping_level()).
> >
> > I don't think it has to be done in this series, but a future
> > optimization to consider is having follow_pfnmap just tell the caller
> > about the mapping level directly. It already found this information as
> > part of its walk. I think there's a possibility to simplify KVM /
> > avoid it having to do its own walk again later.
>
> AFAIU pfnmap isn't special in this case, as we do the "walk pgtable twice"
> idea also to a generic page here, so probably not directly relevant to this
> patch alone.
>
> But I agree with you, sounds like something we can consider trying.  I
> would be curious on whether the perf difference would be measurable in this
> specific case, though.  I mean, this first walk will heat up all the
> things, so I'd expect the 2nd walk (which is lockless) later be pretty fast
> normally.

Agreed, the main benefit is probably just code simplification.

>
> Thanks,
>
> --
> Peter Xu
>

Sean Christopherson Aug. 12, 2024, 11:44 p.m. UTC | #4

On Mon, Aug 12, 2024, Axel Rasmussen wrote:
> On Mon, Aug 12, 2024 at 11:58 AM Peter Xu <peterx@redhat.com> wrote:
> >
> > On Fri, Aug 09, 2024 at 10:23:20AM -0700, Axel Rasmussen wrote:
> > > On Fri, Aug 9, 2024 at 9:09 AM Peter Xu <peterx@redhat.com> wrote:
> > > >
> > > > Use the new pfnmap API to allow huge MMIO mappings for VMs.  The rest work
> > > > is done perfectly on the other side (host_pfn_mapping_level()).
> > >
> > > I don't think it has to be done in this series, but a future
> > > optimization to consider is having follow_pfnmap just tell the caller
> > > about the mapping level directly. It already found this information as
> > > part of its walk. I think there's a possibility to simplify KVM /
> > > avoid it having to do its own walk again later.
> >
> > AFAIU pfnmap isn't special in this case, as we do the "walk pgtable twice"
> > idea also to a generic page here, so probably not directly relevant to this
> > patch alone.

Ya.  My original hope was that KVM could simply walk the host page tables and get
whatever PFN+size it found, i.e. that KVM wouldn't care about pfn-mapped versus
regular pages.  That might be feasible after dropping all of KVM's refcounting
shenanigans[*]?  Not sure, haven't thought too much about it, precisely because
I too think it won't provide any meaningful performance boost.

> > But I agree with you, sounds like something we can consider trying.  I
> > would be curious on whether the perf difference would be measurable in this
> > specific case, though.  I mean, this first walk will heat up all the
> > things, so I'd expect the 2nd walk (which is lockless) later be pretty fast
> > normally.
> 
> Agreed, the main benefit is probably just code simplification.

+1.  I wouldn't spend much time, if any, trying to plumb the size back out.
Unless we can convert regular pages as well, it'd probably be more confusing to
have separate ways of getting the mapping size.

Jason Gunthorpe Aug. 14, 2024, 1:15 p.m. UTC | #5

On Mon, Aug 12, 2024 at 04:44:40PM -0700, Sean Christopherson wrote:

> > > > I don't think it has to be done in this series, but a future
> > > > optimization to consider is having follow_pfnmap just tell the caller
> > > > about the mapping level directly. It already found this information as
> > > > part of its walk. I think there's a possibility to simplify KVM /
> > > > avoid it having to do its own walk again later.
> > >
> > > AFAIU pfnmap isn't special in this case, as we do the "walk pgtable twice"
> > > idea also to a generic page here, so probably not directly relevant to this
> > > patch alone.
> 
> Ya.  My original hope was that KVM could simply walk the host page tables and get
> whatever PFN+size it found, i.e. that KVM wouldn't care about pfn-mapped versus
> regular pages.  That might be feasible after dropping all of KVM's refcounting
> shenanigans[*]?  Not sure, haven't thought too much about it, precisely because
> I too think it won't provide any meaningful performance boost.

The main thing, from my perspective, is that KVM reliably creates 1G
mappings in its table if the VMA has 1G mappings, across all arches
and scenarios. For normal memory and PFNMAP equally.

Not returning the size here makes me wonder if that actually happens?
Does KVM have another way to know what size entry to create?

Jason

Sean Christopherson Aug. 14, 2024, 2:23 p.m. UTC | #6

On Wed, Aug 14, 2024, Jason Gunthorpe wrote:
> On Mon, Aug 12, 2024 at 04:44:40PM -0700, Sean Christopherson wrote:
> 
> > > > > I don't think it has to be done in this series, but a future
> > > > > optimization to consider is having follow_pfnmap just tell the caller
> > > > > about the mapping level directly. It already found this information as
> > > > > part of its walk. I think there's a possibility to simplify KVM /
> > > > > avoid it having to do its own walk again later.
> > > >
> > > > AFAIU pfnmap isn't special in this case, as we do the "walk pgtable twice"
> > > > idea also to a generic page here, so probably not directly relevant to this
> > > > patch alone.
> > 
> > Ya.  My original hope was that KVM could simply walk the host page tables and get
> > whatever PFN+size it found, i.e. that KVM wouldn't care about pfn-mapped versus
> > regular pages.  That might be feasible after dropping all of KVM's refcounting
> > shenanigans[*]?  Not sure, haven't thought too much about it, precisely because
> > I too think it won't provide any meaningful performance boost.
> 
> The main thing, from my perspective, is that KVM reliably creates 1G
> mappings in its table if the VMA has 1G mappings, across all arches
> and scenarios. For normal memory and PFNMAP equally.

Yes, KVM walks the host page tables for the user virtual address and uses whatever
page size it finds, regardless of what the mapping type.

> Not returning the size here makes me wonder if that actually happens?

It does happen, the idea here was purely to avoid the second page table walk.

> Does KVM have another way to know what size entry to create?
> 
> Jason

[10/19] KVM: Use follow_pfnmap API

Commit Message

Comments

Patch