mbox series

[00/19] mm: Support huge pfnmaps

Message ID 20240809160909.1023470-1-peterx@redhat.com (mailing list archive)
Headers show
Series mm: Support huge pfnmaps | expand

Message

Peter Xu Aug. 9, 2024, 4:08 p.m. UTC
Overview
========

This series is based on mm-unstable, commit 98808d08fc0f of Aug 7th latest,
plus dax 1g fix [1].  Note that this series should also apply if without
the dax 1g fix series, but when without it, mprotect() will trigger similar
errors otherwise on PUD mappings.

This series implements huge pfnmaps support for mm in general.  Huge pfnmap
allows e.g. VM_PFNMAP vmas to map in either PMD or PUD levels, similar to
what we do with dax / thp / hugetlb so far to benefit from TLB hits.  Now
we extend that idea to PFN mappings, e.g. PCI MMIO bars where it can grow
as large as 8GB or even bigger.

Currently, only x86_64 (1G+2M) and arm64 (2M) are supported.  The last
patch (from Alex Williamson) will be the first user of huge pfnmap, so as
to enable vfio-pci driver to fault in huge pfn mappings.

Implementation
==============

In reality, it's relatively simple to add such support comparing to many
other types of mappings, because of PFNMAP's specialties when there's no
vmemmap backing it, so that most of the kernel routines on huge mappings
should simply already fail for them, like GUPs or old-school follow_page()
(which is recently rewritten to be folio_walk* APIs by David).

One trick here is that we're still unmature on PUDs in generic paths here
and there, as DAX is so far the only user.  This patchset will add the 2nd
user of it.  Hugetlb can be a 3rd user if the hugetlb unification work can
go on smoothly, but to be discussed later.

The other trick is how to allow gup-fast working for such huge mappings
even if there's no direct sign of knowing whether it's a normal page or
MMIO mapping.  This series chose to keep the pte_special solution, so that
it reuses similar idea on setting a special bit to pfnmap PMDs/PUDs so that
gup-fast will be able to identify them and fail properly.

Along the way, we'll also notice that the major pgtable pfn walker, aka,
follow_pte(), will need to retire soon due to the fact that it only works
with ptes.  A new set of simple API is introduced (follow_pfnmap* API) to
be able to do whatever follow_pte() can already do, plus that it can also
process huge pfnmaps now.  Half of this series is about that and converting
all existing pfnmap walkers to use the new API properly.  Hopefully the new
API also looks better to avoid exposing e.g. pgtable lock details into the
callers, so that it can be used in an even more straightforward way.

Here, three more options will be introduced and involved in huge pfnmap:

  - ARCH_SUPPORTS_HUGE_PFNMAP

    Arch developers will need to select this option when huge pfnmap is
    supported in arch's Kconfig.  After this patchset applied, both x86_64
    and arm64 will start to enable it by default.

  - ARCH_SUPPORTS_PMD_PFNMAP / ARCH_SUPPORTS_PUD_PFNMAP

    These options are for driver developers to identify whether current
    arch / config supports huge pfnmaps, making decision on whether it can
    use the huge pfnmap APIs to inject them.  One can refer to the last
    vfio-pci patch from Alex on the use of them properly in a device
    driver.

So after the whole set applied, and if one would enable some dynamic debug
lines in vfio-pci core files, we should observe things like:

  vfio-pci 0000:00:06.0: vfio_pci_mmap_huge_fault(,order = 9) BAR 0 page offset 0x0: 0x100
  vfio-pci 0000:00:06.0: vfio_pci_mmap_huge_fault(,order = 9) BAR 0 page offset 0x200: 0x100
  vfio-pci 0000:00:06.0: vfio_pci_mmap_huge_fault(,order = 9) BAR 0 page offset 0x400: 0x100

In this specific case, it says that vfio-pci faults in PMDs properly for a
few BAR0 offsets.

Patch Layout
============

Patch 1:         Introduce the new options mentioned above for huge PFNMAPs
Patch 2:         A tiny cleanup
Patch 3-8:       Preparation patches for huge pfnmap (include introduce
                 special bit for pmd/pud)
Patch 9-16:      Introduce follow_pfnmap*() API, use it everywhere, and
                 then drop follow_pte() API
Patch 17:        Add huge pfnmap support for x86_64
Patch 18:        Add huge pfnmap support for arm64
Patch 19:        Add vfio-pci support for all kinds of huge pfnmaps (Alex)

TODO
====

Nothing I plan to do myself, as in our VM use case most of these doesn't
yet apply, but still list something I think might be interesting.

More architectures / More page sizes
------------------------------------

Currently only x86_64 (2M+1G) and arm64 (2M) are supported.

For example, if arm64 can start to support THP_PUD one day, the huge pfnmap
on 1G will be automatically enabled.

Generically speaking, arch will need to first support THP / THP_1G, then
provide a special bit in pmds/puds to support huge pfnmaps.

remap_pfn_range() support
-------------------------

Currently, remap_pfn_range() still only maps PTEs.  With the new option,
remap_pfn_range() can logically start to inject either PMDs or PUDs when
the alignment requirements match on the VAs.

When the support is there, it should be able to silently benefit all
drivers that is using remap_pfn_range() in its mmap() handler on better TLB
hit rate and overall faster MMIO accesses similar to processor on hugepages.

More driver support
-------------------

VFIO is so far the only consumer for the huge pfnmaps after this series
applied.  Besides above remap_pfn_range() generic optimization, device
driver can also try to optimize its mmap() on a better VA alignment for
either PMD/PUD sizes.  This may, iiuc, normally require userspace changes,
as the driver doesn't normally decide the VA to map a bar.  But I don't
think I know all the drivers to know the full picture.

Tests
=====

- Cross-build tests that I normally do. I only saw one bluetooth driver
  build failure on i386 PAE on top of latest mm-unstable, but shouldn't be
  relevant.

- run_vmtests.sh whole set, no more failures (e.g. mlock2 tests fail on
  mm-unstable)

- Hacked e1000e QEMU with 128MB BAR 0, with some prefault test, mprotect()
  and fork() tests on the bar mapped

- x86_64 + AMD GPU
  - Needs Alex's modified QEMU to guarantee proper VA alignment to make
    sure all pages to be mapped with PUDs
  - Main BAR (8GB) start to use PUD mappings
  - Sub BAR (??MBs?) start to use PMD mappings
  - Performance wise, slight improvement comparing to the old PTE mappings

- aarch64 + NIC
  - Detached NIC test to make sure driver loads fine with PMD mappings

Credits all go to Alex on help testing the GPU/NIC use cases above.

Comments welcomed, thanks.

[1] https://lore.kernel.org/r/20240807194812.819412-1-peterx@redhat.com

Alex Williamson (1):
  vfio/pci: Implement huge_fault support

Peter Xu (18):
  mm: Introduce ARCH_SUPPORTS_HUGE_PFNMAP and special bits to pmd/pud
  mm: Drop is_huge_zero_pud()
  mm: Mark special bits for huge pfn mappings when inject
  mm: Allow THP orders for PFNMAPs
  mm/gup: Detect huge pfnmap entries in gup-fast
  mm/pagewalk: Check pfnmap early for folio_walk_start()
  mm/fork: Accept huge pfnmap entries
  mm: Always define pxx_pgprot()
  mm: New follow_pfnmap API
  KVM: Use follow_pfnmap API
  s390/pci_mmio: Use follow_pfnmap API
  mm/x86/pat: Use the new follow_pfnmap API
  vfio: Use the new follow_pfnmap API
  acrn: Use the new follow_pfnmap API
  mm/access_process_vm: Use the new follow_pfnmap API
  mm: Remove follow_pte()
  mm/x86: Support large pfn mappings
  mm/arm64: Support large pfn mappings

 arch/arm64/Kconfig                  |   1 +
 arch/arm64/include/asm/pgtable.h    |  30 +++++
 arch/powerpc/include/asm/pgtable.h  |   1 +
 arch/s390/include/asm/pgtable.h     |   1 +
 arch/s390/pci/pci_mmio.c            |  22 ++--
 arch/sparc/include/asm/pgtable_64.h |   1 +
 arch/x86/Kconfig                    |   1 +
 arch/x86/include/asm/pgtable.h      |  80 ++++++++-----
 arch/x86/mm/pat/memtype.c           |  17 ++-
 drivers/vfio/pci/vfio_pci_core.c    |  60 +++++++---
 drivers/vfio/vfio_iommu_type1.c     |  16 +--
 drivers/virt/acrn/mm.c              |  16 +--
 include/linux/huge_mm.h             |  16 +--
 include/linux/mm.h                  |  57 ++++++++-
 include/linux/pgtable.h             |  12 ++
 mm/Kconfig                          |  13 ++
 mm/gup.c                            |   6 +
 mm/huge_memory.c                    |  48 +++++---
 mm/memory.c                         | 178 ++++++++++++++++++++--------
 mm/pagewalk.c                       |   5 +
 virt/kvm/kvm_main.c                 |  19 ++-
 21 files changed, 422 insertions(+), 178 deletions(-)

Comments

David Hildenbrand Aug. 9, 2024, 6:12 p.m. UTC | #1
On 09.08.24 18:08, Peter Xu wrote:
> Overview
> ========
> 
> This series is based on mm-unstable, commit 98808d08fc0f of Aug 7th latest,
> plus dax 1g fix [1].  Note that this series should also apply if without
> the dax 1g fix series, but when without it, mprotect() will trigger similar
> errors otherwise on PUD mappings.
> 
> This series implements huge pfnmaps support for mm in general.  Huge pfnmap
> allows e.g. VM_PFNMAP vmas to map in either PMD or PUD levels, similar to
> what we do with dax / thp / hugetlb so far to benefit from TLB hits.  Now
> we extend that idea to PFN mappings, e.g. PCI MMIO bars where it can grow
> as large as 8GB or even bigger.
> 
> Currently, only x86_64 (1G+2M) and arm64 (2M) are supported.  The last
> patch (from Alex Williamson) will be the first user of huge pfnmap, so as
> to enable vfio-pci driver to fault in huge pfn mappings.
> 
> Implementation
> ==============
> 
> In reality, it's relatively simple to add such support comparing to many
> other types of mappings, because of PFNMAP's specialties when there's no
> vmemmap backing it, so that most of the kernel routines on huge mappings
> should simply already fail for them, like GUPs or old-school follow_page()
> (which is recently rewritten to be folio_walk* APIs by David).

Indeed, skimming most patches, there is very limit core-mm impact. I 
expected much more :)

I suspect primarily because DAX already paved the way. And as DAX likely 
supports fault-after-fork, which is why the fork() case wasn't relevant 
before.
Jason Gunthorpe Aug. 14, 2024, 12:37 p.m. UTC | #2
On Fri, Aug 09, 2024 at 12:08:50PM -0400, Peter Xu wrote:
> Overview
> ========
> 
> This series is based on mm-unstable, commit 98808d08fc0f of Aug 7th latest,
> plus dax 1g fix [1].  Note that this series should also apply if without
> the dax 1g fix series, but when without it, mprotect() will trigger similar
> errors otherwise on PUD mappings.
> 
> This series implements huge pfnmaps support for mm in general.  Huge pfnmap
> allows e.g. VM_PFNMAP vmas to map in either PMD or PUD levels, similar to
> what we do with dax / thp / hugetlb so far to benefit from TLB hits.  Now
> we extend that idea to PFN mappings, e.g. PCI MMIO bars where it can grow
> as large as 8GB or even bigger.

FWIW, I've started to hear people talk about needing this in the VFIO
context with VMs.

vfio/iommufd will reassemble the contiguous range from the 4k PFNs to
setup the IOMMU, but KVM is not able to do it so reliably. There is a
notable performance gap with two dimensional paging between 4k and 1G
entries in the KVM table. The platforms are being architected with the
assumption that 1G TLB entires will be used throughout the hypervisor
environment.

> Currently, only x86_64 (1G+2M) and arm64 (2M) are supported.  

There is definitely interest here in extending ARM to support the 1G
size too, what is missing?

> The other trick is how to allow gup-fast working for such huge mappings
> even if there's no direct sign of knowing whether it's a normal page or
> MMIO mapping.  This series chose to keep the pte_special solution, so that
> it reuses similar idea on setting a special bit to pfnmap PMDs/PUDs so that
> gup-fast will be able to identify them and fail properly.

Make sense

> More architectures / More page sizes
> ------------------------------------
> 
> Currently only x86_64 (2M+1G) and arm64 (2M) are supported.
> 
> For example, if arm64 can start to support THP_PUD one day, the huge pfnmap
> on 1G will be automatically enabled.

Oh that sounds like a bigger step..
 
> VFIO is so far the only consumer for the huge pfnmaps after this series
> applied.  Besides above remap_pfn_range() generic optimization, device
> driver can also try to optimize its mmap() on a better VA alignment for
> either PMD/PUD sizes.  This may, iiuc, normally require userspace changes,
> as the driver doesn't normally decide the VA to map a bar.  But I don't
> think I know all the drivers to know the full picture.

How does alignment work? In most caes I'm aware of the userspace does
not use MAP_FIXED so the expectation would be for the kernel to
automatically select a high alignment. I suppose your cases are
working because qemu uses MAP_FIXED and naturally aligns the BAR
addresses?

> - x86_64 + AMD GPU
>   - Needs Alex's modified QEMU to guarantee proper VA alignment to make
>     sure all pages to be mapped with PUDs

Oh :(

Jason
Sean Christopherson Aug. 14, 2024, 2:35 p.m. UTC | #3
On Wed, Aug 14, 2024, Jason Gunthorpe wrote:
> On Fri, Aug 09, 2024 at 12:08:50PM -0400, Peter Xu wrote:
> > Overview
> > ========
> > 
> > This series is based on mm-unstable, commit 98808d08fc0f of Aug 7th latest,
> > plus dax 1g fix [1].  Note that this series should also apply if without
> > the dax 1g fix series, but when without it, mprotect() will trigger similar
> > errors otherwise on PUD mappings.
> > 
> > This series implements huge pfnmaps support for mm in general.  Huge pfnmap
> > allows e.g. VM_PFNMAP vmas to map in either PMD or PUD levels, similar to
> > what we do with dax / thp / hugetlb so far to benefit from TLB hits.  Now
> > we extend that idea to PFN mappings, e.g. PCI MMIO bars where it can grow
> > as large as 8GB or even bigger.
> 
> FWIW, I've started to hear people talk about needing this in the VFIO
> context with VMs.
> 
> vfio/iommufd will reassemble the contiguous range from the 4k PFNs to
> setup the IOMMU, but KVM is not able to do it so reliably.

Heh, KVM should very reliably do the exact opposite, i.e. KVM should never create
a huge page unless the mapping is huge in the primary MMU.  And that's very much
by design, as KVM has no knowledge of what actually resides at a given PFN, and
thus can't determine whether or not its safe to create a huge page if KVM happens
to realize the VM has access to a contiguous range of memory.
Paolo Bonzini Aug. 14, 2024, 2:42 p.m. UTC | #4
On Wed, Aug 14, 2024 at 4:35 PM Sean Christopherson <seanjc@google.com> wrote:
> > vfio/iommufd will reassemble the contiguous range from the 4k PFNs to
> > setup the IOMMU, but KVM is not able to do it so reliably.
>
> Heh, KVM should very reliably do the exact opposite, i.e. KVM should never create
> a huge page unless the mapping is huge in the primary MMU.  And that's very much
> by design, as KVM has no knowledge of what actually resides at a given PFN, and
> thus can't determine whether or not its safe to create a huge page if KVM happens
> to realize the VM has access to a contiguous range of memory.

Indeed: the EPT is managed as a secondary MMU. It replays the contents
of the primary MMU, apart from A/D bits (which are independent) and
permissions possibly being more restrictive, and that includes the
page size.

Which in turn explains why the VA has to be aligned for KVM to pick up
the hint: aligning the VA allows the primary MMU to use a hugepage,
which is a prerequisite for using it in EPT.

Paolo
Jason Gunthorpe Aug. 14, 2024, 2:43 p.m. UTC | #5
On Wed, Aug 14, 2024 at 07:35:01AM -0700, Sean Christopherson wrote:
> On Wed, Aug 14, 2024, Jason Gunthorpe wrote:
> > On Fri, Aug 09, 2024 at 12:08:50PM -0400, Peter Xu wrote:
> > > Overview
> > > ========
> > > 
> > > This series is based on mm-unstable, commit 98808d08fc0f of Aug 7th latest,
> > > plus dax 1g fix [1].  Note that this series should also apply if without
> > > the dax 1g fix series, but when without it, mprotect() will trigger similar
> > > errors otherwise on PUD mappings.
> > > 
> > > This series implements huge pfnmaps support for mm in general.  Huge pfnmap
> > > allows e.g. VM_PFNMAP vmas to map in either PMD or PUD levels, similar to
> > > what we do with dax / thp / hugetlb so far to benefit from TLB hits.  Now
> > > we extend that idea to PFN mappings, e.g. PCI MMIO bars where it can grow
> > > as large as 8GB or even bigger.
> > 
> > FWIW, I've started to hear people talk about needing this in the VFIO
> > context with VMs.
> > 
> > vfio/iommufd will reassemble the contiguous range from the 4k PFNs to
> > setup the IOMMU, but KVM is not able to do it so reliably.
> 
> Heh, KVM should very reliably do the exact opposite, i.e. KVM should never create
> a huge page unless the mapping is huge in the primary MMU.  And that's very much
> by design, as KVM has no knowledge of what actually resides at a given PFN, and
> thus can't determine whether or not its safe to create a huge page if KVM happens
> to realize the VM has access to a contiguous range of memory.

Oh? Someone told me recently x86 kvm had code to reassemble contiguous
ranges?

I don't quite understand your safety argument, if the VMA has 1G of
contiguous physical memory described with 4K it is definitely safe for
KVM to reassemble that same memory and represent it as 1G.

Jason
Sean Christopherson Aug. 14, 2024, 8:54 p.m. UTC | #6
+Marc and Oliver

On Wed, Aug 14, 2024, Jason Gunthorpe wrote:
> On Wed, Aug 14, 2024 at 07:35:01AM -0700, Sean Christopherson wrote:
> > On Wed, Aug 14, 2024, Jason Gunthorpe wrote:
> > > On Fri, Aug 09, 2024 at 12:08:50PM -0400, Peter Xu wrote:
> > > > Overview
> > > > ========
> > > > 
> > > > This series is based on mm-unstable, commit 98808d08fc0f of Aug 7th latest,
> > > > plus dax 1g fix [1].  Note that this series should also apply if without
> > > > the dax 1g fix series, but when without it, mprotect() will trigger similar
> > > > errors otherwise on PUD mappings.
> > > > 
> > > > This series implements huge pfnmaps support for mm in general.  Huge pfnmap
> > > > allows e.g. VM_PFNMAP vmas to map in either PMD or PUD levels, similar to
> > > > what we do with dax / thp / hugetlb so far to benefit from TLB hits.  Now
> > > > we extend that idea to PFN mappings, e.g. PCI MMIO bars where it can grow
> > > > as large as 8GB or even bigger.
> > > 
> > > FWIW, I've started to hear people talk about needing this in the VFIO
> > > context with VMs.
> > > 
> > > vfio/iommufd will reassemble the contiguous range from the 4k PFNs to
> > > setup the IOMMU, but KVM is not able to do it so reliably.
> > 
> > Heh, KVM should very reliably do the exact opposite, i.e. KVM should never create
> > a huge page unless the mapping is huge in the primary MMU.  And that's very much
> > by design, as KVM has no knowledge of what actually resides at a given PFN, and
> > thus can't determine whether or not its safe to create a huge page if KVM happens
> > to realize the VM has access to a contiguous range of memory.
> 
> Oh? Someone told me recently x86 kvm had code to reassemble contiguous
> ranges?

Nope.  KVM ARM does (see get_vma_page_shift()) but I strongly suspect that's only
a win in very select use cases, and is overall a non-trivial loss.  

> I don't quite understand your safety argument, if the VMA has 1G of
> contiguous physical memory described with 4K it is definitely safe for
> KVM to reassemble that same memory and represent it as 1G.

That would require taking mmap_lock to get the VMA, which would be a net negative,
especially for workloads that are latency sensitive.   E.g. if userspace is doing
mprotect(), madvise(), etc. on VMA that is NOT mapped into the guest, taking
mmap_lock in the guest page fault path means vCPUs block waiting for the unrelated
host operation to complete.  And vise versa, subsequent host operations can be
blocked waiting on vCPUs.

Which reminds me...

Marc/Oliver,

TL;DR: it's probably worth looking at mmu_stress_test (was: max_guest_memory_test)
on arm64, specifically the mprotect() testcase[1], as performance is significantly
worse compared to x86, and there might be bugs lurking the mmu_notifier flows.

When running mmu_stress_test the mprotect() phase that makes guest memory read-only
takes more than three times as long on arm64 versus x86.  The time to initially
popuplate memory (run1) is also notably higher on arm64, as is the time to
mprotect() back to RW protections.

The test doesn't go super far out of its way to control the environment, but it
should be a fairly reasonable apples-to-apples comparison.  

Ouch.  I take that back, it's not apples-to-apples, because the test does more
work for x86.  On x86, during mprotect(PROT_READ), the userspace side skips the
faulting instruction on -EFAULT and so vCPUs keep writing for the entire duration.
Other architectures stop running the vCPU after the first write -EFAULT and wait
for the mproptect() to complete.  If I comment out the x86-only logic and have
vCPUs stop on the first -EFAULT, the mprotect() goes way down.

/me fiddles with arm64

And if I have arm64 vCPUs keep faulting, the time goes up, as exptected.

With 128GiB of guest memory (aliased to a single 2GiB chunk of physical memory),
and 48 vCPUs (on systems with 64+ CPUs), stopping on the first fault:

 x86:
  run1 =  6.873408794s, reset = 0.000165898s, run2 = 0.035537803s, ro =  6.149083106s, rw = 7.713627355s

 arm64:
  run1 = 13.960144969s, reset = 0.000178596s, run2 = 0.018020005s, ro = 50.924434051s, rw = 14.712983786

and skipping on -EFAULT and thus writing throughout mprotect():

 x86:
  run1 =  6.923218747s, reset = 0.000167050s, run2 = 0.034676225s, ro = 14.599445790s, rw = 7.763152792s

 arm64:
  run1 = 13.543469513s, reset = 0.000018763s, run2 = 0.020533896s, ro = 81.063504438s, rw = 14.967504024s

I originally suspected at the main source of difference is that user_mem_abort()
takes mmap_lock for read.  But I doubt that's the case now that I realize arm64
vCPUs stop after the first -EFAULT, i.e. won't contend mmap_lock.

And it's shouldn't be the lack of support for mmu_invalidate_retry_gfn() (on arm64
vs. x86), because the mprotect() is relevant to guest memory (though range-based
retry is something that KVM ARM likely should support).  And again, arm64 is still
much slower when vCPUs stop on the first -EFAULT.

However, before I realized mmap_lock likely wasn't the main problem, I tried to
prove that taking mmap_lock is problematic, and that didn't end well.  When I
hacked user_mem_abort() to not take mmap_lock, the host reboots when the mprotect()
read-only phase kicks in.  AFAICT, there is no crash, e.g. no kdump and nothing
printed to the console, the host suddenly just starts executing firmware code.

To try and rule out a hidden dependency I'm missing, I tried trimming out pretty
much everything that runs under mmap_lock (and only running the selftest, which
doesn't do anything "odd", e.g. doesn't passthrough device memory).  I also forced
small pages, e.g. in case transparent_hugepage_adjust() is somehow reliant on
mmap_lock being taken, to no avail.  Just in case someone can spot an obvious
flaw in my hack, the final diff I tried is below.

Jumping back to mmap_lock, adding a lock, vma_lookup(), and unlock in x86's page
fault path for valid VMAs does introduce a performance regression, but only ~30%,
not the ~6x jump from x86 to arm64.  So that too makes it unlikely taking mmap_lock
is the main problem, though it's still good justification for avoid mmap_lock in
the page fault path.

[1] https://lore.kernel.org/all/20240809194335.1726916-9-seanjc@google.com

---
 arch/arm64/kvm/mmu.c | 167 +++----------------------------------------
 1 file changed, 9 insertions(+), 158 deletions(-)

diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
index 6981b1bc0946..df551c19f626 100644
--- a/arch/arm64/kvm/mmu.c
+++ b/arch/arm64/kvm/mmu.c
@@ -1424,14 +1424,12 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
 			  bool fault_is_perm)
 {
 	int ret = 0;
-	bool write_fault, writable, force_pte = false;
-	bool exec_fault, mte_allowed;
-	bool device = false, vfio_allow_any_uc = false;
+	bool write_fault, writable;
+	bool exec_fault;
 	unsigned long mmu_seq;
 	phys_addr_t ipa = fault_ipa;
 	struct kvm *kvm = vcpu->kvm;
 	struct kvm_mmu_memory_cache *memcache = &vcpu->arch.mmu_page_cache;
-	struct vm_area_struct *vma;
 	short vma_shift;
 	gfn_t gfn;
 	kvm_pfn_t pfn;
@@ -1451,6 +1449,9 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
 		return -EFAULT;
 	}
 
+	if (WARN_ON_ONCE(nested) || WARN_ON_ONCE(kvm_has_mte(kvm)))
+		return -EIO;
+
 	/*
 	 * Permission faults just need to update the existing leaf entry,
 	 * and so normally don't require allocations from the memcache. The
@@ -1464,92 +1465,9 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
 			return ret;
 	}
 
-	/*
-	 * Let's check if we will get back a huge page backed by hugetlbfs, or
-	 * get block mapping for device MMIO region.
-	 */
-	mmap_read_lock(current->mm);
-	vma = vma_lookup(current->mm, hva);
-	if (unlikely(!vma)) {
-		kvm_err("Failed to find VMA for hva 0x%lx\n", hva);
-		mmap_read_unlock(current->mm);
-		return -EFAULT;
-	}
-
-	/*
-	 * logging_active is guaranteed to never be true for VM_PFNMAP
-	 * memslots.
-	 */
-	if (logging_active) {
-		force_pte = true;
-		vma_shift = PAGE_SHIFT;
-	} else {
-		vma_shift = get_vma_page_shift(vma, hva);
-	}
-
-	switch (vma_shift) {
-#ifndef __PAGETABLE_PMD_FOLDED
-	case PUD_SHIFT:
-		if (fault_supports_stage2_huge_mapping(memslot, hva, PUD_SIZE))
-			break;
-		fallthrough;
-#endif
-	case CONT_PMD_SHIFT:
-		vma_shift = PMD_SHIFT;
-		fallthrough;
-	case PMD_SHIFT:
-		if (fault_supports_stage2_huge_mapping(memslot, hva, PMD_SIZE))
-			break;
-		fallthrough;
-	case CONT_PTE_SHIFT:
-		vma_shift = PAGE_SHIFT;
-		force_pte = true;
-		fallthrough;
-	case PAGE_SHIFT:
-		break;
-	default:
-		WARN_ONCE(1, "Unknown vma_shift %d", vma_shift);
-	}
-
+	vma_shift = PAGE_SHIFT;
 	vma_pagesize = 1UL << vma_shift;
-
-	if (nested) {
-		unsigned long max_map_size;
-
-		max_map_size = force_pte ? PAGE_SIZE : PUD_SIZE;
-
-		ipa = kvm_s2_trans_output(nested);
-
-		/*
-		 * If we're about to create a shadow stage 2 entry, then we
-		 * can only create a block mapping if the guest stage 2 page
-		 * table uses at least as big a mapping.
-		 */
-		max_map_size = min(kvm_s2_trans_size(nested), max_map_size);
-
-		/*
-		 * Be careful that if the mapping size falls between
-		 * two host sizes, take the smallest of the two.
-		 */
-		if (max_map_size >= PMD_SIZE && max_map_size < PUD_SIZE)
-			max_map_size = PMD_SIZE;
-		else if (max_map_size >= PAGE_SIZE && max_map_size < PMD_SIZE)
-			max_map_size = PAGE_SIZE;
-
-		force_pte = (max_map_size == PAGE_SIZE);
-		vma_pagesize = min(vma_pagesize, (long)max_map_size);
-	}
-
-	if (vma_pagesize == PMD_SIZE || vma_pagesize == PUD_SIZE)
-		fault_ipa &= ~(vma_pagesize - 1);
-
 	gfn = ipa >> PAGE_SHIFT;
-	mte_allowed = kvm_vma_mte_allowed(vma);
-
-	vfio_allow_any_uc = vma->vm_flags & VM_ALLOW_ANY_UNCACHED;
-
-	/* Don't use the VMA after the unlock -- it may have vanished */
-	vma = NULL;
 
 	/*
 	 * Read mmu_invalidate_seq so that KVM can detect if the results of
@@ -1560,7 +1478,7 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
 	 * with the smp_wmb() in kvm_mmu_invalidate_end().
 	 */
 	mmu_seq = vcpu->kvm->mmu_invalidate_seq;
-	mmap_read_unlock(current->mm);
+	smp_rmb();
 
 	pfn = __gfn_to_pfn_memslot(memslot, gfn, false, false, NULL,
 				   write_fault, &writable, NULL);
@@ -1571,19 +1489,7 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
 	if (is_error_noslot_pfn(pfn))
 		return -EFAULT;
 
-	if (kvm_is_device_pfn(pfn)) {
-		/*
-		 * If the page was identified as device early by looking at
-		 * the VMA flags, vma_pagesize is already representing the
-		 * largest quantity we can map.  If instead it was mapped
-		 * via gfn_to_pfn_prot(), vma_pagesize is set to PAGE_SIZE
-		 * and must not be upgraded.
-		 *
-		 * In both cases, we don't let transparent_hugepage_adjust()
-		 * change things at the last minute.
-		 */
-		device = true;
-	} else if (logging_active && !write_fault) {
+	if (logging_active && !write_fault) {
 		/*
 		 * Only actually map the page as writable if this was a write
 		 * fault.
@@ -1591,28 +1497,6 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
 		writable = false;
 	}
 
-	if (exec_fault && device)
-		return -ENOEXEC;
-
-	/*
-	 * Potentially reduce shadow S2 permissions to match the guest's own
-	 * S2. For exec faults, we'd only reach this point if the guest
-	 * actually allowed it (see kvm_s2_handle_perm_fault).
-	 *
-	 * Also encode the level of the original translation in the SW bits
-	 * of the leaf entry as a proxy for the span of that translation.
-	 * This will be retrieved on TLB invalidation from the guest and
-	 * used to limit the invalidation scope if a TTL hint or a range
-	 * isn't provided.
-	 */
-	if (nested) {
-		writable &= kvm_s2_trans_writable(nested);
-		if (!kvm_s2_trans_readable(nested))
-			prot &= ~KVM_PGTABLE_PROT_R;
-
-		prot |= kvm_encode_nested_level(nested);
-	}
-
 	read_lock(&kvm->mmu_lock);
 	pgt = vcpu->arch.hw_mmu->pgt;
 	if (mmu_invalidate_retry(kvm, mmu_seq)) {
@@ -1620,46 +1504,13 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
 		goto out_unlock;
 	}
 
-	/*
-	 * If we are not forced to use page mapping, check if we are
-	 * backed by a THP and thus use block mapping if possible.
-	 */
-	if (vma_pagesize == PAGE_SIZE && !(force_pte || device)) {
-		if (fault_is_perm && fault_granule > PAGE_SIZE)
-			vma_pagesize = fault_granule;
-		else
-			vma_pagesize = transparent_hugepage_adjust(kvm, memslot,
-								   hva, &pfn,
-								   &fault_ipa);
-
-		if (vma_pagesize < 0) {
-			ret = vma_pagesize;
-			goto out_unlock;
-		}
-	}
-
-	if (!fault_is_perm && !device && kvm_has_mte(kvm)) {
-		/* Check the VMM hasn't introduced a new disallowed VMA */
-		if (mte_allowed) {
-			sanitise_mte_tags(kvm, pfn, vma_pagesize);
-		} else {
-			ret = -EFAULT;
-			goto out_unlock;
-		}
-	}
-
 	if (writable)
 		prot |= KVM_PGTABLE_PROT_W;
 
 	if (exec_fault)
 		prot |= KVM_PGTABLE_PROT_X;
 
-	if (device) {
-		if (vfio_allow_any_uc)
-			prot |= KVM_PGTABLE_PROT_NORMAL_NC;
-		else
-			prot |= KVM_PGTABLE_PROT_DEVICE;
-	} else if (cpus_have_final_cap(ARM64_HAS_CACHE_DIC) &&
+	if (cpus_have_final_cap(ARM64_HAS_CACHE_DIC) &&
 		   (!nested || kvm_s2_trans_executable(nested))) {
 		prot |= KVM_PGTABLE_PROT_X;
 	}

base-commit: 15e1c3d65975524c5c792fcd59f7d89f00402261
--
Sean Christopherson Aug. 14, 2024, 10 p.m. UTC | #7
On Wed, Aug 14, 2024, Sean Christopherson wrote:
> TL;DR: it's probably worth looking at mmu_stress_test (was: max_guest_memory_test)
> on arm64, specifically the mprotect() testcase[1], as performance is significantly
> worse compared to x86, and there might be bugs lurking the mmu_notifier flows.
> 
> When running mmu_stress_test the mprotect() phase that makes guest memory read-only
> takes more than three times as long on arm64 versus x86.  The time to initially
> popuplate memory (run1) is also notably higher on arm64, as is the time to
> mprotect() back to RW protections.
> 
> The test doesn't go super far out of its way to control the environment, but it
> should be a fairly reasonable apples-to-apples comparison.  
> 
> Ouch.  I take that back, it's not apples-to-apples, because the test does more
> work for x86.  On x86, during mprotect(PROT_READ), the userspace side skips the
> faulting instruction on -EFAULT and so vCPUs keep writing for the entire duration.
> Other architectures stop running the vCPU after the first write -EFAULT and wait
> for the mproptect() to complete.  If I comment out the x86-only logic and have
> vCPUs stop on the first -EFAULT, the mprotect() goes way down.
> 
> /me fiddles with arm64
> 
> And if I have arm64 vCPUs keep faulting, the time goes up, as exptected.
> 
> With 128GiB of guest memory (aliased to a single 2GiB chunk of physical memory),
> and 48 vCPUs (on systems with 64+ CPUs), stopping on the first fault:
> 
>  x86:
>   run1 =  6.873408794s, reset = 0.000165898s, run2 = 0.035537803s, ro =  6.149083106s, rw = 7.713627355s
> 
>  arm64:
>   run1 = 13.960144969s, reset = 0.000178596s, run2 = 0.018020005s, ro = 50.924434051s, rw = 14.712983786
> 
> and skipping on -EFAULT and thus writing throughout mprotect():
> 
>  x86:
>   run1 =  6.923218747s, reset = 0.000167050s, run2 = 0.034676225s, ro = 14.599445790s, rw = 7.763152792s
> 
>  arm64:
>   run1 = 13.543469513s, reset = 0.000018763s, run2 = 0.020533896s, ro = 81.063504438s, rw = 14.967504024s

Oliver pointed out off-list that the hardware I was using doesn't have forced
write-back, and so the overhead on arm64 is likely due to cache maintenance.
Jason Gunthorpe Aug. 14, 2024, 10:10 p.m. UTC | #8
On Wed, Aug 14, 2024 at 01:54:04PM -0700, Sean Christopherson wrote:
> +Marc and Oliver
> 
> On Wed, Aug 14, 2024, Jason Gunthorpe wrote:
> > On Wed, Aug 14, 2024 at 07:35:01AM -0700, Sean Christopherson wrote:
> > > On Wed, Aug 14, 2024, Jason Gunthorpe wrote:
> > > > On Fri, Aug 09, 2024 at 12:08:50PM -0400, Peter Xu wrote:
> > > > > Overview
> > > > > ========
> > > > > 
> > > > > This series is based on mm-unstable, commit 98808d08fc0f of Aug 7th latest,
> > > > > plus dax 1g fix [1].  Note that this series should also apply if without
> > > > > the dax 1g fix series, but when without it, mprotect() will trigger similar
> > > > > errors otherwise on PUD mappings.
> > > > > 
> > > > > This series implements huge pfnmaps support for mm in general.  Huge pfnmap
> > > > > allows e.g. VM_PFNMAP vmas to map in either PMD or PUD levels, similar to
> > > > > what we do with dax / thp / hugetlb so far to benefit from TLB hits.  Now
> > > > > we extend that idea to PFN mappings, e.g. PCI MMIO bars where it can grow
> > > > > as large as 8GB or even bigger.
> > > > 
> > > > FWIW, I've started to hear people talk about needing this in the VFIO
> > > > context with VMs.
> > > > 
> > > > vfio/iommufd will reassemble the contiguous range from the 4k PFNs to
> > > > setup the IOMMU, but KVM is not able to do it so reliably.
> > > 
> > > Heh, KVM should very reliably do the exact opposite, i.e. KVM should never create
> > > a huge page unless the mapping is huge in the primary MMU.  And that's very much
> > > by design, as KVM has no knowledge of what actually resides at a given PFN, and
> > > thus can't determine whether or not its safe to create a huge page if KVM happens
> > > to realize the VM has access to a contiguous range of memory.
> > 
> > Oh? Someone told me recently x86 kvm had code to reassemble contiguous
> > ranges?
> 
> Nope.  KVM ARM does (see get_vma_page_shift()) but I strongly suspect that's only
> a win in very select use cases, and is overall a non-trivial loss.  

Ah that ARM behavior was probably what was being mentioned then! So
take my original remark as applying to this :)

> > I don't quite understand your safety argument, if the VMA has 1G of
> > contiguous physical memory described with 4K it is definitely safe for
> > KVM to reassemble that same memory and represent it as 1G.
>
> That would require taking mmap_lock to get the VMA, which would be a net negative,
> especially for workloads that are latency sensitive.

You can aggregate if the read and aggregating logic are protected by
mmu notifiers, I think. A invalidation would still have enough
information to clear the aggregate shadow entry. If you get a sequence
number collision then you'd throw away the aggregation.

But yes, I also think it would be slow to have aggregation logic in
KVM. Doing in the main mmu is much better.

Jason
Oliver Upton Aug. 14, 2024, 11:27 p.m. UTC | #9
On Wed, Aug 14, 2024 at 01:54:04PM -0700, Sean Christopherson wrote:
> TL;DR: it's probably worth looking at mmu_stress_test (was: max_guest_memory_test)
> on arm64, specifically the mprotect() testcase[1], as performance is significantly
> worse compared to x86,

Sharing what we discussed offline:

Sean was using a machine w/o FEAT_FWB for this test, so the increased
runtime on arm64 is likely explained by the CMOs we're doing when
creating or invalidating a stage-2 PTE.

Using a machine w/ FEAT_FWB would be better for making these sort of
cross-architecture comparisons. Beyond CMOs, we do have some 

> and there might be bugs lurking the mmu_notifier flows.

Impossible! :)

> Jumping back to mmap_lock, adding a lock, vma_lookup(), and unlock in x86's page
> fault path for valid VMAs does introduce a performance regression, but only ~30%,
> not the ~6x jump from x86 to arm64.  So that too makes it unlikely taking mmap_lock
> is the main problem, though it's still good justification for avoid mmap_lock in
> the page fault path.

I'm curious how much of that 30% in a microbenchmark would translate to
real world performance, since it isn't *that* egregious. We also have
other uses for getting at the VMA beyond mapping granularity (MTE and
the VFIO Normal-NC hint) that'd require some attention too.
Oliver Upton Aug. 14, 2024, 11:36 p.m. UTC | #10
On Wed, Aug 14, 2024 at 07:10:31PM -0300, Jason Gunthorpe wrote:

[...]

> > Nope.  KVM ARM does (see get_vma_page_shift()) but I strongly suspect that's only
> > a win in very select use cases, and is overall a non-trivial loss.  
> 
> Ah that ARM behavior was probably what was being mentioned then! So
> take my original remark as applying to this :)
> 
> > > I don't quite understand your safety argument, if the VMA has 1G of
> > > contiguous physical memory described with 4K it is definitely safe for
> > > KVM to reassemble that same memory and represent it as 1G.
> >
> > That would require taking mmap_lock to get the VMA, which would be a net negative,
> > especially for workloads that are latency sensitive.
> 
> You can aggregate if the read and aggregating logic are protected by
> mmu notifiers, I think. A invalidation would still have enough
> information to clear the aggregate shadow entry. If you get a sequence
> number collision then you'd throw away the aggregation.
> 
> But yes, I also think it would be slow to have aggregation logic in
> KVM. Doing in the main mmu is much better.

+1.

For KVM/arm64 I'm quite hesitant to change the behavior to PTE mappings
in this situation (i.e. dump get_vma_page_shift()), as I'm quite certain
that'll have a performance regression on someone's workload. But once we
can derive huge PFNMAP from the primary MMU then we should just normalize
on that.
Oliver Upton Aug. 14, 2024, 11:38 p.m. UTC | #11
On Wed, Aug 14, 2024 at 04:28:00PM -0700, Oliver Upton wrote:
> On Wed, Aug 14, 2024 at 01:54:04PM -0700, Sean Christopherson wrote:
> > TL;DR: it's probably worth looking at mmu_stress_test (was: max_guest_memory_test)
> > on arm64, specifically the mprotect() testcase[1], as performance is significantly
> > worse compared to x86,
> 
> Sharing what we discussed offline:
> 
> Sean was using a machine w/o FEAT_FWB for this test, so the increased
> runtime on arm64 is likely explained by the CMOs we're doing when
> creating or invalidating a stage-2 PTE.
> 
> Using a machine w/ FEAT_FWB would be better for making these sort of
> cross-architecture comparisons. Beyond CMOs, we do have some 

... some heavy barriers (e.g. DSB(ishst)) we use to ensure page table
updates are visible to the system. So there could still be some
arch-specific quirks that'll show up in the test.

> > and there might be bugs lurking the mmu_notifier flows.
> 
> Impossible! :)
> 
> > Jumping back to mmap_lock, adding a lock, vma_lookup(), and unlock in x86's page
> > fault path for valid VMAs does introduce a performance regression, but only ~30%,
> > not the ~6x jump from x86 to arm64.  So that too makes it unlikely taking mmap_lock
> > is the main problem, though it's still good justification for avoid mmap_lock in
> > the page fault path.
> 
> I'm curious how much of that 30% in a microbenchmark would translate to
> real world performance, since it isn't *that* egregious. We also have
> other uses for getting at the VMA beyond mapping granularity (MTE and
> the VFIO Normal-NC hint) that'd require some attention too.
> 
> -- 
> Thanks,
> Oliver
Sean Christopherson Aug. 15, 2024, 12:23 a.m. UTC | #12
On Wed, Aug 14, 2024, Oliver Upton wrote:
> On Wed, Aug 14, 2024 at 04:28:00PM -0700, Oliver Upton wrote:
> > On Wed, Aug 14, 2024 at 01:54:04PM -0700, Sean Christopherson wrote:
> > > TL;DR: it's probably worth looking at mmu_stress_test (was: max_guest_memory_test)
> > > on arm64, specifically the mprotect() testcase[1], as performance is significantly
> > > worse compared to x86,
> > 
> > Sharing what we discussed offline:
> > 
> > Sean was using a machine w/o FEAT_FWB for this test, so the increased
> > runtime on arm64 is likely explained by the CMOs we're doing when
> > creating or invalidating a stage-2 PTE.
> > 
> > Using a machine w/ FEAT_FWB would be better for making these sort of
> > cross-architecture comparisons. Beyond CMOs, we do have some 
> 
> ... some heavy barriers (e.g. DSB(ishst)) we use to ensure page table
> updates are visible to the system. So there could still be some
> arch-specific quirks that'll show up in the test.

Nope, 'twas FWB.  On a system with FWB, ARM nicely outperforms x86 on mprotect()
when vCPUs stop on the first -EFAULT.  I suspect because ARM can do broadcast TLB
invalidations and doesn't need to interrupt and wait for every vCPU to respond.

  run1 = 10.723194154s, reset = 0.000014732s, run2 = 0.013790876s, ro = 2.151261587s, rw = 10.624272116s

However, having vCPUs continue faulting while mprotect() is running turns the
tables, I suspect due to mmap_lock

  run1 = 10.768003815s, reset = 0.000012051s, run2 = 0.013781921s, ro = 23.277624455s, rw = 10.649136889s

The x86 numbers since they're out of sight now:

 -EFAULT once
  run1 =  6.873408794s, reset = 0.000165898s, run2 = 0.035537803s, ro =  6.149083106s, rw = 7.713627355s

 -EFAULT forever
  run1 =  6.923218747s, reset = 0.000167050s, run2 = 0.034676225s, ro = 14.599445790s, rw = 7.763152792s

> > > and there might be bugs lurking the mmu_notifier flows.
> > 
> > Impossible! :)
> > 
> > > Jumping back to mmap_lock, adding a lock, vma_lookup(), and unlock in x86's page
> > > fault path for valid VMAs does introduce a performance regression, but only ~30%,
> > > not the ~6x jump from x86 to arm64.  So that too makes it unlikely taking mmap_lock
> > > is the main problem, though it's still good justification for avoid mmap_lock in
> > > the page fault path.
> > 
> > I'm curious how much of that 30% in a microbenchmark would translate to
> > real world performance, since it isn't *that* egregious.

vCPU jitter is the big problem, especially if userspace is doing something odd,
and/or if the kernel is preemptible (which also triggers yeild-on-contention logic
for spinlocks, ew).  E.g. the range-based retry to avoid spinning and waiting on
an unrelated MM operation was added by the ChromeOS folks[1] to resolve issues
where an MM operation got preempted and so blocked vCPU faults.

But even for cloud setups with a non-preemptible kernel, contending with unrelated
userspace VMM modification can be problematic, e.g. it turns out even the
gfn_to_pfn_cache logic needs range-based retry[2] (though that's a rather
pathological case where userspace is spamming madvise() to the point where vCPUs
can't even make forward progress).

> > We also have other uses for getting at the VMA beyond mapping granularity
> > (MTE and the VFIO Normal-NC hint) that'd require some attention too.

Yeah, though it seems like it'd be easy enough to take mmap_lock if and only if
it's necessary, e.g. similar to how common KVM takes it only if it encounters
VM_PFNMAP'd memory.

E.g. take mmap_lock if and only if MTE is active (I assume that's uncommon?), or
if the fault is to device memory.

[1] https://lore.kernel.org/all/20210222024522.1751719-1-stevensd@google.com
[2] https://lore.kernel.org/all/f862cefff2ed3f4211b69d785670f41667703cf3.camel@infradead.org
Peter Xu Aug. 15, 2024, 7:20 p.m. UTC | #13
On Wed, Aug 14, 2024 at 09:37:15AM -0300, Jason Gunthorpe wrote:
> > Currently, only x86_64 (1G+2M) and arm64 (2M) are supported.  
> 
> There is definitely interest here in extending ARM to support the 1G
> size too, what is missing?

Currently PUD pfnmap relies on THP_PUD config option:

config ARCH_SUPPORTS_PUD_PFNMAP
	def_bool y
	depends on ARCH_SUPPORTS_HUGE_PFNMAP && HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD

Arm64 unfortunately doesn't yet support dax 1G, so not applicable yet.

Ideally, pfnmap is too simple comparing to real THPs and it shouldn't
require to depend on THP at all, but we'll need things like below to land
first:

https://lore.kernel.org/r/20240717220219.3743374-1-peterx@redhat.com

I sent that first a while ago, but I didn't collect enough inputs, and I
decided to unblock this series from that, so x86_64 shouldn't be affected,
and arm64 will at least start to have 2M.

> 
> > The other trick is how to allow gup-fast working for such huge mappings
> > even if there's no direct sign of knowing whether it's a normal page or
> > MMIO mapping.  This series chose to keep the pte_special solution, so that
> > it reuses similar idea on setting a special bit to pfnmap PMDs/PUDs so that
> > gup-fast will be able to identify them and fail properly.
> 
> Make sense
> 
> > More architectures / More page sizes
> > ------------------------------------
> > 
> > Currently only x86_64 (2M+1G) and arm64 (2M) are supported.
> > 
> > For example, if arm64 can start to support THP_PUD one day, the huge pfnmap
> > on 1G will be automatically enabled.
> 
> Oh that sounds like a bigger step..

Just to mention, no real THP 1G needed here for pfnmaps.  The real gap here
is only about the pud helpers that only exists so far with CONFIG_THP_PUD
in huge_memory.c.

>  
> > VFIO is so far the only consumer for the huge pfnmaps after this series
> > applied.  Besides above remap_pfn_range() generic optimization, device
> > driver can also try to optimize its mmap() on a better VA alignment for
> > either PMD/PUD sizes.  This may, iiuc, normally require userspace changes,
> > as the driver doesn't normally decide the VA to map a bar.  But I don't
> > think I know all the drivers to know the full picture.
> 
> How does alignment work? In most caes I'm aware of the userspace does
> not use MAP_FIXED so the expectation would be for the kernel to
> automatically select a high alignment. I suppose your cases are
> working because qemu uses MAP_FIXED and naturally aligns the BAR
> addresses?
> 
> > - x86_64 + AMD GPU
> >   - Needs Alex's modified QEMU to guarantee proper VA alignment to make
> >     sure all pages to be mapped with PUDs
> 
> Oh :(

So I suppose this answers above. :) Yes, alignment needed.
Kefeng Wang Aug. 16, 2024, 3:05 a.m. UTC | #14
On 2024/8/16 3:20, Peter Xu wrote:
> On Wed, Aug 14, 2024 at 09:37:15AM -0300, Jason Gunthorpe wrote:
>>> Currently, only x86_64 (1G+2M) and arm64 (2M) are supported.
>>
>> There is definitely interest here in extending ARM to support the 1G
>> size too, what is missing?
> 
> Currently PUD pfnmap relies on THP_PUD config option:
> 
> config ARCH_SUPPORTS_PUD_PFNMAP
> 	def_bool y
> 	depends on ARCH_SUPPORTS_HUGE_PFNMAP && HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD
> 
> Arm64 unfortunately doesn't yet support dax 1G, so not applicable yet.
> 
> Ideally, pfnmap is too simple comparing to real THPs and it shouldn't
> require to depend on THP at all, but we'll need things like below to land
> first:
> 
> https://lore.kernel.org/r/20240717220219.3743374-1-peterx@redhat.com
> 
> I sent that first a while ago, but I didn't collect enough inputs, and I
> decided to unblock this series from that, so x86_64 shouldn't be affected,
> and arm64 will at least start to have 2M.
> 
>>
>>> The other trick is how to allow gup-fast working for such huge mappings
>>> even if there's no direct sign of knowing whether it's a normal page or
>>> MMIO mapping.  This series chose to keep the pte_special solution, so that
>>> it reuses similar idea on setting a special bit to pfnmap PMDs/PUDs so that
>>> gup-fast will be able to identify them and fail properly.
>>
>> Make sense
>>
>>> More architectures / More page sizes
>>> ------------------------------------
>>>
>>> Currently only x86_64 (2M+1G) and arm64 (2M) are supported.
>>>
>>> For example, if arm64 can start to support THP_PUD one day, the huge pfnmap
>>> on 1G will be automatically enabled.

A draft patch to enable THP_PUD on arm64, only passed with 
DEBUG_VM_PGTABLE, we may test pud pfnmaps on arm64.

diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
index a2f8ff354ca6..ff0d27c72020 100644
--- a/arch/arm64/Kconfig
+++ b/arch/arm64/Kconfig
@@ -184,6 +184,7 @@ config ARM64
  	select HAVE_ARCH_THREAD_STRUCT_WHITELIST
  	select HAVE_ARCH_TRACEHOOK
  	select HAVE_ARCH_TRANSPARENT_HUGEPAGE
+	select HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD if PGTABLE_LEVELS > 2
  	select HAVE_ARCH_VMAP_STACK
  	select HAVE_ARM_SMCCC
  	select HAVE_ASM_MODVERSIONS
diff --git a/arch/arm64/include/asm/pgtable.h 
b/arch/arm64/include/asm/pgtable.h
index 7a4f5604be3f..e013fe458476 100644
--- a/arch/arm64/include/asm/pgtable.h
+++ b/arch/arm64/include/asm/pgtable.h
@@ -763,6 +763,25 @@ static inline unsigned long pmd_page_vaddr(pmd_t pmd)
  #define pud_valid(pud)		pte_valid(pud_pte(pud))
  #define pud_user(pud)		pte_user(pud_pte(pud))
  #define pud_user_exec(pud)	pte_user_exec(pud_pte(pud))
+#define pud_dirty(pud)		pte_dirty(pud_pte(pud))
+#define pud_devmap(pud)		pte_devmap(pud_pte(pud))
+#define pud_wrprotect(pud)	pte_pud(pte_wrprotect(pud_pte(pud)))
+#define pud_mkold(pud)		pte_pud(pte_mkold(pud_pte(pud)))
+#define pud_mkwrite(pud)	pte_pud(pte_mkwrite_novma(pud_pte(pud)))
+#define pud_mkclean(pud)	pte_pud(pte_mkclean(pud_pte(pud)))
+#define pud_mkdirty(pud)	pte_pud(pte_mkdirty(pud_pte(pud)))
+
+#ifdef CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD
+static inline int pud_trans_huge(pud_t pud)
+{
+	return pud_val(pud) && pud_present(pud) && !(pud_val(pud) & 
PUD_TABLE_BIT);
+}
+
+static inline pud_t pud_mkdevmap(pud_t pud)
+{
+	return pte_pud(set_pte_bit(pud_pte(pud), __pgprot(PTE_DEVMAP)));
+}
+#endif

  static inline bool pgtable_l4_enabled(void);

@@ -1137,10 +1156,20 @@ static inline int pmdp_set_access_flags(struct 
vm_area_struct *vma,
  							pmd_pte(entry), dirty);
  }

+static inline int pudp_set_access_flags(struct vm_area_struct *vma,
+					unsigned long address, pud_t *pudp,
+					pud_t entry, int dirty)
+{
+	return __ptep_set_access_flags(vma, address, (pte_t *)pudp,
+							pud_pte(entry), dirty);
+}
+
+#ifndef CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD
  static inline int pud_devmap(pud_t pud)
  {
  	return 0;
  }
+#endif

  static inline int pgd_devmap(pgd_t pgd)
  {
@@ -1213,6 +1242,13 @@ static inline int 
pmdp_test_and_clear_young(struct vm_area_struct *vma,
  {
  	return __ptep_test_and_clear_young(vma, address, (pte_t *)pmdp);
  }
+
+static inline int pudp_test_and_clear_young(struct vm_area_struct *vma,
+					    unsigned long address,
+					    pud_t *pudp)
+{
+	return __ptep_test_and_clear_young(vma, address, (pte_t *)pudp);
+}
  #endif /* CONFIG_TRANSPARENT_HUGEPAGE */

  static inline pte_t __ptep_get_and_clear(struct mm_struct *mm,
@@ -1433,6 +1469,7 @@ static inline void update_mmu_cache_range(struct 
vm_fault *vmf,
  #define update_mmu_cache(vma, addr, ptep) \
  	update_mmu_cache_range(NULL, vma, addr, ptep, 1)
  #define update_mmu_cache_pmd(vma, address, pmd) do { } while (0)
+#define update_mmu_cache_pud(vma, address, pud) do { } while (0)

  #ifdef CONFIG_ARM64_PA_BITS_52
  #define phys_to_ttbr(addr)	(((addr) | ((addr) >> 46)) & 
TTBR_BADDR_MASK_52)
Peter Xu Aug. 16, 2024, 2:33 p.m. UTC | #15
On Fri, Aug 16, 2024 at 11:05:33AM +0800, Kefeng Wang wrote:
> 
> 
> On 2024/8/16 3:20, Peter Xu wrote:
> > On Wed, Aug 14, 2024 at 09:37:15AM -0300, Jason Gunthorpe wrote:
> > > > Currently, only x86_64 (1G+2M) and arm64 (2M) are supported.
> > > 
> > > There is definitely interest here in extending ARM to support the 1G
> > > size too, what is missing?
> > 
> > Currently PUD pfnmap relies on THP_PUD config option:
> > 
> > config ARCH_SUPPORTS_PUD_PFNMAP
> > 	def_bool y
> > 	depends on ARCH_SUPPORTS_HUGE_PFNMAP && HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD
> > 
> > Arm64 unfortunately doesn't yet support dax 1G, so not applicable yet.
> > 
> > Ideally, pfnmap is too simple comparing to real THPs and it shouldn't
> > require to depend on THP at all, but we'll need things like below to land
> > first:
> > 
> > https://lore.kernel.org/r/20240717220219.3743374-1-peterx@redhat.com
> > 
> > I sent that first a while ago, but I didn't collect enough inputs, and I
> > decided to unblock this series from that, so x86_64 shouldn't be affected,
> > and arm64 will at least start to have 2M.
> > 
> > > 
> > > > The other trick is how to allow gup-fast working for such huge mappings
> > > > even if there's no direct sign of knowing whether it's a normal page or
> > > > MMIO mapping.  This series chose to keep the pte_special solution, so that
> > > > it reuses similar idea on setting a special bit to pfnmap PMDs/PUDs so that
> > > > gup-fast will be able to identify them and fail properly.
> > > 
> > > Make sense
> > > 
> > > > More architectures / More page sizes
> > > > ------------------------------------
> > > > 
> > > > Currently only x86_64 (2M+1G) and arm64 (2M) are supported.
> > > > 
> > > > For example, if arm64 can start to support THP_PUD one day, the huge pfnmap
> > > > on 1G will be automatically enabled.
> 
> A draft patch to enable THP_PUD on arm64, only passed with DEBUG_VM_PGTABLE,
> we may test pud pfnmaps on arm64.

Thanks, Kefeng.  It'll be great if this works already, as simple.

Might be interesting to know whether it works already if you have some
few-GBs GPU around on the systems.

Logically as long as you have HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD selected
below, 1g pfnmap will be automatically enabled when you rebuild the kernel.
You can double check that by looking for this:

  CONFIG_ARCH_SUPPORTS_PUD_PFNMAP=y

And you can try to observe the mappings by enabling dynamic debug for
vfio_pci_mmap_huge_fault(), then map the bar with vfio-pci and read
something from it.

> 
> diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
> index a2f8ff354ca6..ff0d27c72020 100644
> --- a/arch/arm64/Kconfig
> +++ b/arch/arm64/Kconfig
> @@ -184,6 +184,7 @@ config ARM64
>  	select HAVE_ARCH_THREAD_STRUCT_WHITELIST
>  	select HAVE_ARCH_TRACEHOOK
>  	select HAVE_ARCH_TRANSPARENT_HUGEPAGE
> +	select HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD if PGTABLE_LEVELS > 2
>  	select HAVE_ARCH_VMAP_STACK
>  	select HAVE_ARM_SMCCC
>  	select HAVE_ASM_MODVERSIONS
> diff --git a/arch/arm64/include/asm/pgtable.h
> b/arch/arm64/include/asm/pgtable.h
> index 7a4f5604be3f..e013fe458476 100644
> --- a/arch/arm64/include/asm/pgtable.h
> +++ b/arch/arm64/include/asm/pgtable.h
> @@ -763,6 +763,25 @@ static inline unsigned long pmd_page_vaddr(pmd_t pmd)
>  #define pud_valid(pud)		pte_valid(pud_pte(pud))
>  #define pud_user(pud)		pte_user(pud_pte(pud))
>  #define pud_user_exec(pud)	pte_user_exec(pud_pte(pud))
> +#define pud_dirty(pud)		pte_dirty(pud_pte(pud))
> +#define pud_devmap(pud)		pte_devmap(pud_pte(pud))
> +#define pud_wrprotect(pud)	pte_pud(pte_wrprotect(pud_pte(pud)))
> +#define pud_mkold(pud)		pte_pud(pte_mkold(pud_pte(pud)))
> +#define pud_mkwrite(pud)	pte_pud(pte_mkwrite_novma(pud_pte(pud)))
> +#define pud_mkclean(pud)	pte_pud(pte_mkclean(pud_pte(pud)))
> +#define pud_mkdirty(pud)	pte_pud(pte_mkdirty(pud_pte(pud)))
> +
> +#ifdef CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD
> +static inline int pud_trans_huge(pud_t pud)
> +{
> +	return pud_val(pud) && pud_present(pud) && !(pud_val(pud) &
> PUD_TABLE_BIT);
> +}
> +
> +static inline pud_t pud_mkdevmap(pud_t pud)
> +{
> +	return pte_pud(set_pte_bit(pud_pte(pud), __pgprot(PTE_DEVMAP)));
> +}
> +#endif
> 
>  static inline bool pgtable_l4_enabled(void);
> 
> @@ -1137,10 +1156,20 @@ static inline int pmdp_set_access_flags(struct
> vm_area_struct *vma,
>  							pmd_pte(entry), dirty);
>  }
> 
> +static inline int pudp_set_access_flags(struct vm_area_struct *vma,
> +					unsigned long address, pud_t *pudp,
> +					pud_t entry, int dirty)
> +{
> +	return __ptep_set_access_flags(vma, address, (pte_t *)pudp,
> +							pud_pte(entry), dirty);
> +}
> +
> +#ifndef CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD
>  static inline int pud_devmap(pud_t pud)
>  {
>  	return 0;
>  }
> +#endif
> 
>  static inline int pgd_devmap(pgd_t pgd)
>  {
> @@ -1213,6 +1242,13 @@ static inline int pmdp_test_and_clear_young(struct
> vm_area_struct *vma,
>  {
>  	return __ptep_test_and_clear_young(vma, address, (pte_t *)pmdp);
>  }
> +
> +static inline int pudp_test_and_clear_young(struct vm_area_struct *vma,
> +					    unsigned long address,
> +					    pud_t *pudp)
> +{
> +	return __ptep_test_and_clear_young(vma, address, (pte_t *)pudp);
> +}
>  #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
> 
>  static inline pte_t __ptep_get_and_clear(struct mm_struct *mm,
> @@ -1433,6 +1469,7 @@ static inline void update_mmu_cache_range(struct
> vm_fault *vmf,
>  #define update_mmu_cache(vma, addr, ptep) \
>  	update_mmu_cache_range(NULL, vma, addr, ptep, 1)
>  #define update_mmu_cache_pmd(vma, address, pmd) do { } while (0)
> +#define update_mmu_cache_pud(vma, address, pud) do { } while (0)
> 
>  #ifdef CONFIG_ARM64_PA_BITS_52
>  #define phys_to_ttbr(addr)	(((addr) | ((addr) >> 46)) & TTBR_BADDR_MASK_52)
> -- 
> 2.27.0
Kefeng Wang Aug. 19, 2024, 1:14 p.m. UTC | #16
On 2024/8/16 22:33, Peter Xu wrote:
> On Fri, Aug 16, 2024 at 11:05:33AM +0800, Kefeng Wang wrote:
>>
>>
>> On 2024/8/16 3:20, Peter Xu wrote:
>>> On Wed, Aug 14, 2024 at 09:37:15AM -0300, Jason Gunthorpe wrote:
>>>>> Currently, only x86_64 (1G+2M) and arm64 (2M) are supported.
>>>>
>>>> There is definitely interest here in extending ARM to support the 1G
>>>> size too, what is missing?
>>>
>>> Currently PUD pfnmap relies on THP_PUD config option:
>>>
>>> config ARCH_SUPPORTS_PUD_PFNMAP
>>> 	def_bool y
>>> 	depends on ARCH_SUPPORTS_HUGE_PFNMAP && HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD
>>>
>>> Arm64 unfortunately doesn't yet support dax 1G, so not applicable yet.
>>>
>>> Ideally, pfnmap is too simple comparing to real THPs and it shouldn't
>>> require to depend on THP at all, but we'll need things like below to land
>>> first:
>>>
>>> https://lore.kernel.org/r/20240717220219.3743374-1-peterx@redhat.com
>>>
>>> I sent that first a while ago, but I didn't collect enough inputs, and I
>>> decided to unblock this series from that, so x86_64 shouldn't be affected,
>>> and arm64 will at least start to have 2M.
>>>
>>>>
>>>>> The other trick is how to allow gup-fast working for such huge mappings
>>>>> even if there's no direct sign of knowing whether it's a normal page or
>>>>> MMIO mapping.  This series chose to keep the pte_special solution, so that
>>>>> it reuses similar idea on setting a special bit to pfnmap PMDs/PUDs so that
>>>>> gup-fast will be able to identify them and fail properly.
>>>>
>>>> Make sense
>>>>
>>>>> More architectures / More page sizes
>>>>> ------------------------------------
>>>>>
>>>>> Currently only x86_64 (2M+1G) and arm64 (2M) are supported.
>>>>>
>>>>> For example, if arm64 can start to support THP_PUD one day, the huge pfnmap
>>>>> on 1G will be automatically enabled.
>>
>> A draft patch to enable THP_PUD on arm64, only passed with DEBUG_VM_PGTABLE,
>> we may test pud pfnmaps on arm64.
> 
> Thanks, Kefeng.  It'll be great if this works already, as simple.
> 
> Might be interesting to know whether it works already if you have some
> few-GBs GPU around on the systems.
> 
> Logically as long as you have HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD selected
> below, 1g pfnmap will be automatically enabled when you rebuild the kernel.
> You can double check that by looking for this:
> 
>    CONFIG_ARCH_SUPPORTS_PUD_PFNMAP=y
> 
> And you can try to observe the mappings by enabling dynamic debug for
> vfio_pci_mmap_huge_fault(), then map the bar with vfio-pci and read
> something from it.


I don't have such device, but we write a driver which use
vmf_insert_pfn_pmd/pud in huge_fault,

static const struct vm_operations_struct test_vm_ops = {
         .huge_fault = test_huge_fault,
	...
}

and read/write it after mmap(,2M/1G,test_fd,...), it works as expected,
since it could be used by dax, let's send it separately.