Message ID | 20240809160909.1023470-1-peterx@redhat.com (mailing list archive) |
---|---|
Headers | show |
Series | mm: Support huge pfnmaps | expand |
On 09.08.24 18:08, Peter Xu wrote: > Overview > ======== > > This series is based on mm-unstable, commit 98808d08fc0f of Aug 7th latest, > plus dax 1g fix [1]. Note that this series should also apply if without > the dax 1g fix series, but when without it, mprotect() will trigger similar > errors otherwise on PUD mappings. > > This series implements huge pfnmaps support for mm in general. Huge pfnmap > allows e.g. VM_PFNMAP vmas to map in either PMD or PUD levels, similar to > what we do with dax / thp / hugetlb so far to benefit from TLB hits. Now > we extend that idea to PFN mappings, e.g. PCI MMIO bars where it can grow > as large as 8GB or even bigger. > > Currently, only x86_64 (1G+2M) and arm64 (2M) are supported. The last > patch (from Alex Williamson) will be the first user of huge pfnmap, so as > to enable vfio-pci driver to fault in huge pfn mappings. > > Implementation > ============== > > In reality, it's relatively simple to add such support comparing to many > other types of mappings, because of PFNMAP's specialties when there's no > vmemmap backing it, so that most of the kernel routines on huge mappings > should simply already fail for them, like GUPs or old-school follow_page() > (which is recently rewritten to be folio_walk* APIs by David). Indeed, skimming most patches, there is very limit core-mm impact. I expected much more :) I suspect primarily because DAX already paved the way. And as DAX likely supports fault-after-fork, which is why the fork() case wasn't relevant before.
On Fri, Aug 09, 2024 at 12:08:50PM -0400, Peter Xu wrote: > Overview > ======== > > This series is based on mm-unstable, commit 98808d08fc0f of Aug 7th latest, > plus dax 1g fix [1]. Note that this series should also apply if without > the dax 1g fix series, but when without it, mprotect() will trigger similar > errors otherwise on PUD mappings. > > This series implements huge pfnmaps support for mm in general. Huge pfnmap > allows e.g. VM_PFNMAP vmas to map in either PMD or PUD levels, similar to > what we do with dax / thp / hugetlb so far to benefit from TLB hits. Now > we extend that idea to PFN mappings, e.g. PCI MMIO bars where it can grow > as large as 8GB or even bigger. FWIW, I've started to hear people talk about needing this in the VFIO context with VMs. vfio/iommufd will reassemble the contiguous range from the 4k PFNs to setup the IOMMU, but KVM is not able to do it so reliably. There is a notable performance gap with two dimensional paging between 4k and 1G entries in the KVM table. The platforms are being architected with the assumption that 1G TLB entires will be used throughout the hypervisor environment. > Currently, only x86_64 (1G+2M) and arm64 (2M) are supported. There is definitely interest here in extending ARM to support the 1G size too, what is missing? > The other trick is how to allow gup-fast working for such huge mappings > even if there's no direct sign of knowing whether it's a normal page or > MMIO mapping. This series chose to keep the pte_special solution, so that > it reuses similar idea on setting a special bit to pfnmap PMDs/PUDs so that > gup-fast will be able to identify them and fail properly. Make sense > More architectures / More page sizes > ------------------------------------ > > Currently only x86_64 (2M+1G) and arm64 (2M) are supported. > > For example, if arm64 can start to support THP_PUD one day, the huge pfnmap > on 1G will be automatically enabled. Oh that sounds like a bigger step.. > VFIO is so far the only consumer for the huge pfnmaps after this series > applied. Besides above remap_pfn_range() generic optimization, device > driver can also try to optimize its mmap() on a better VA alignment for > either PMD/PUD sizes. This may, iiuc, normally require userspace changes, > as the driver doesn't normally decide the VA to map a bar. But I don't > think I know all the drivers to know the full picture. How does alignment work? In most caes I'm aware of the userspace does not use MAP_FIXED so the expectation would be for the kernel to automatically select a high alignment. I suppose your cases are working because qemu uses MAP_FIXED and naturally aligns the BAR addresses? > - x86_64 + AMD GPU > - Needs Alex's modified QEMU to guarantee proper VA alignment to make > sure all pages to be mapped with PUDs Oh :( Jason
On Wed, Aug 14, 2024, Jason Gunthorpe wrote: > On Fri, Aug 09, 2024 at 12:08:50PM -0400, Peter Xu wrote: > > Overview > > ======== > > > > This series is based on mm-unstable, commit 98808d08fc0f of Aug 7th latest, > > plus dax 1g fix [1]. Note that this series should also apply if without > > the dax 1g fix series, but when without it, mprotect() will trigger similar > > errors otherwise on PUD mappings. > > > > This series implements huge pfnmaps support for mm in general. Huge pfnmap > > allows e.g. VM_PFNMAP vmas to map in either PMD or PUD levels, similar to > > what we do with dax / thp / hugetlb so far to benefit from TLB hits. Now > > we extend that idea to PFN mappings, e.g. PCI MMIO bars where it can grow > > as large as 8GB or even bigger. > > FWIW, I've started to hear people talk about needing this in the VFIO > context with VMs. > > vfio/iommufd will reassemble the contiguous range from the 4k PFNs to > setup the IOMMU, but KVM is not able to do it so reliably. Heh, KVM should very reliably do the exact opposite, i.e. KVM should never create a huge page unless the mapping is huge in the primary MMU. And that's very much by design, as KVM has no knowledge of what actually resides at a given PFN, and thus can't determine whether or not its safe to create a huge page if KVM happens to realize the VM has access to a contiguous range of memory.
On Wed, Aug 14, 2024 at 4:35 PM Sean Christopherson <seanjc@google.com> wrote: > > vfio/iommufd will reassemble the contiguous range from the 4k PFNs to > > setup the IOMMU, but KVM is not able to do it so reliably. > > Heh, KVM should very reliably do the exact opposite, i.e. KVM should never create > a huge page unless the mapping is huge in the primary MMU. And that's very much > by design, as KVM has no knowledge of what actually resides at a given PFN, and > thus can't determine whether or not its safe to create a huge page if KVM happens > to realize the VM has access to a contiguous range of memory. Indeed: the EPT is managed as a secondary MMU. It replays the contents of the primary MMU, apart from A/D bits (which are independent) and permissions possibly being more restrictive, and that includes the page size. Which in turn explains why the VA has to be aligned for KVM to pick up the hint: aligning the VA allows the primary MMU to use a hugepage, which is a prerequisite for using it in EPT. Paolo
On Wed, Aug 14, 2024 at 07:35:01AM -0700, Sean Christopherson wrote: > On Wed, Aug 14, 2024, Jason Gunthorpe wrote: > > On Fri, Aug 09, 2024 at 12:08:50PM -0400, Peter Xu wrote: > > > Overview > > > ======== > > > > > > This series is based on mm-unstable, commit 98808d08fc0f of Aug 7th latest, > > > plus dax 1g fix [1]. Note that this series should also apply if without > > > the dax 1g fix series, but when without it, mprotect() will trigger similar > > > errors otherwise on PUD mappings. > > > > > > This series implements huge pfnmaps support for mm in general. Huge pfnmap > > > allows e.g. VM_PFNMAP vmas to map in either PMD or PUD levels, similar to > > > what we do with dax / thp / hugetlb so far to benefit from TLB hits. Now > > > we extend that idea to PFN mappings, e.g. PCI MMIO bars where it can grow > > > as large as 8GB or even bigger. > > > > FWIW, I've started to hear people talk about needing this in the VFIO > > context with VMs. > > > > vfio/iommufd will reassemble the contiguous range from the 4k PFNs to > > setup the IOMMU, but KVM is not able to do it so reliably. > > Heh, KVM should very reliably do the exact opposite, i.e. KVM should never create > a huge page unless the mapping is huge in the primary MMU. And that's very much > by design, as KVM has no knowledge of what actually resides at a given PFN, and > thus can't determine whether or not its safe to create a huge page if KVM happens > to realize the VM has access to a contiguous range of memory. Oh? Someone told me recently x86 kvm had code to reassemble contiguous ranges? I don't quite understand your safety argument, if the VMA has 1G of contiguous physical memory described with 4K it is definitely safe for KVM to reassemble that same memory and represent it as 1G. Jason
+Marc and Oliver On Wed, Aug 14, 2024, Jason Gunthorpe wrote: > On Wed, Aug 14, 2024 at 07:35:01AM -0700, Sean Christopherson wrote: > > On Wed, Aug 14, 2024, Jason Gunthorpe wrote: > > > On Fri, Aug 09, 2024 at 12:08:50PM -0400, Peter Xu wrote: > > > > Overview > > > > ======== > > > > > > > > This series is based on mm-unstable, commit 98808d08fc0f of Aug 7th latest, > > > > plus dax 1g fix [1]. Note that this series should also apply if without > > > > the dax 1g fix series, but when without it, mprotect() will trigger similar > > > > errors otherwise on PUD mappings. > > > > > > > > This series implements huge pfnmaps support for mm in general. Huge pfnmap > > > > allows e.g. VM_PFNMAP vmas to map in either PMD or PUD levels, similar to > > > > what we do with dax / thp / hugetlb so far to benefit from TLB hits. Now > > > > we extend that idea to PFN mappings, e.g. PCI MMIO bars where it can grow > > > > as large as 8GB or even bigger. > > > > > > FWIW, I've started to hear people talk about needing this in the VFIO > > > context with VMs. > > > > > > vfio/iommufd will reassemble the contiguous range from the 4k PFNs to > > > setup the IOMMU, but KVM is not able to do it so reliably. > > > > Heh, KVM should very reliably do the exact opposite, i.e. KVM should never create > > a huge page unless the mapping is huge in the primary MMU. And that's very much > > by design, as KVM has no knowledge of what actually resides at a given PFN, and > > thus can't determine whether or not its safe to create a huge page if KVM happens > > to realize the VM has access to a contiguous range of memory. > > Oh? Someone told me recently x86 kvm had code to reassemble contiguous > ranges? Nope. KVM ARM does (see get_vma_page_shift()) but I strongly suspect that's only a win in very select use cases, and is overall a non-trivial loss. > I don't quite understand your safety argument, if the VMA has 1G of > contiguous physical memory described with 4K it is definitely safe for > KVM to reassemble that same memory and represent it as 1G. That would require taking mmap_lock to get the VMA, which would be a net negative, especially for workloads that are latency sensitive. E.g. if userspace is doing mprotect(), madvise(), etc. on VMA that is NOT mapped into the guest, taking mmap_lock in the guest page fault path means vCPUs block waiting for the unrelated host operation to complete. And vise versa, subsequent host operations can be blocked waiting on vCPUs. Which reminds me... Marc/Oliver, TL;DR: it's probably worth looking at mmu_stress_test (was: max_guest_memory_test) on arm64, specifically the mprotect() testcase[1], as performance is significantly worse compared to x86, and there might be bugs lurking the mmu_notifier flows. When running mmu_stress_test the mprotect() phase that makes guest memory read-only takes more than three times as long on arm64 versus x86. The time to initially popuplate memory (run1) is also notably higher on arm64, as is the time to mprotect() back to RW protections. The test doesn't go super far out of its way to control the environment, but it should be a fairly reasonable apples-to-apples comparison. Ouch. I take that back, it's not apples-to-apples, because the test does more work for x86. On x86, during mprotect(PROT_READ), the userspace side skips the faulting instruction on -EFAULT and so vCPUs keep writing for the entire duration. Other architectures stop running the vCPU after the first write -EFAULT and wait for the mproptect() to complete. If I comment out the x86-only logic and have vCPUs stop on the first -EFAULT, the mprotect() goes way down. /me fiddles with arm64 And if I have arm64 vCPUs keep faulting, the time goes up, as exptected. With 128GiB of guest memory (aliased to a single 2GiB chunk of physical memory), and 48 vCPUs (on systems with 64+ CPUs), stopping on the first fault: x86: run1 = 6.873408794s, reset = 0.000165898s, run2 = 0.035537803s, ro = 6.149083106s, rw = 7.713627355s arm64: run1 = 13.960144969s, reset = 0.000178596s, run2 = 0.018020005s, ro = 50.924434051s, rw = 14.712983786 and skipping on -EFAULT and thus writing throughout mprotect(): x86: run1 = 6.923218747s, reset = 0.000167050s, run2 = 0.034676225s, ro = 14.599445790s, rw = 7.763152792s arm64: run1 = 13.543469513s, reset = 0.000018763s, run2 = 0.020533896s, ro = 81.063504438s, rw = 14.967504024s I originally suspected at the main source of difference is that user_mem_abort() takes mmap_lock for read. But I doubt that's the case now that I realize arm64 vCPUs stop after the first -EFAULT, i.e. won't contend mmap_lock. And it's shouldn't be the lack of support for mmu_invalidate_retry_gfn() (on arm64 vs. x86), because the mprotect() is relevant to guest memory (though range-based retry is something that KVM ARM likely should support). And again, arm64 is still much slower when vCPUs stop on the first -EFAULT. However, before I realized mmap_lock likely wasn't the main problem, I tried to prove that taking mmap_lock is problematic, and that didn't end well. When I hacked user_mem_abort() to not take mmap_lock, the host reboots when the mprotect() read-only phase kicks in. AFAICT, there is no crash, e.g. no kdump and nothing printed to the console, the host suddenly just starts executing firmware code. To try and rule out a hidden dependency I'm missing, I tried trimming out pretty much everything that runs under mmap_lock (and only running the selftest, which doesn't do anything "odd", e.g. doesn't passthrough device memory). I also forced small pages, e.g. in case transparent_hugepage_adjust() is somehow reliant on mmap_lock being taken, to no avail. Just in case someone can spot an obvious flaw in my hack, the final diff I tried is below. Jumping back to mmap_lock, adding a lock, vma_lookup(), and unlock in x86's page fault path for valid VMAs does introduce a performance regression, but only ~30%, not the ~6x jump from x86 to arm64. So that too makes it unlikely taking mmap_lock is the main problem, though it's still good justification for avoid mmap_lock in the page fault path. [1] https://lore.kernel.org/all/20240809194335.1726916-9-seanjc@google.com --- arch/arm64/kvm/mmu.c | 167 +++---------------------------------------- 1 file changed, 9 insertions(+), 158 deletions(-) diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c index 6981b1bc0946..df551c19f626 100644 --- a/arch/arm64/kvm/mmu.c +++ b/arch/arm64/kvm/mmu.c @@ -1424,14 +1424,12 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa, bool fault_is_perm) { int ret = 0; - bool write_fault, writable, force_pte = false; - bool exec_fault, mte_allowed; - bool device = false, vfio_allow_any_uc = false; + bool write_fault, writable; + bool exec_fault; unsigned long mmu_seq; phys_addr_t ipa = fault_ipa; struct kvm *kvm = vcpu->kvm; struct kvm_mmu_memory_cache *memcache = &vcpu->arch.mmu_page_cache; - struct vm_area_struct *vma; short vma_shift; gfn_t gfn; kvm_pfn_t pfn; @@ -1451,6 +1449,9 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa, return -EFAULT; } + if (WARN_ON_ONCE(nested) || WARN_ON_ONCE(kvm_has_mte(kvm))) + return -EIO; + /* * Permission faults just need to update the existing leaf entry, * and so normally don't require allocations from the memcache. The @@ -1464,92 +1465,9 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa, return ret; } - /* - * Let's check if we will get back a huge page backed by hugetlbfs, or - * get block mapping for device MMIO region. - */ - mmap_read_lock(current->mm); - vma = vma_lookup(current->mm, hva); - if (unlikely(!vma)) { - kvm_err("Failed to find VMA for hva 0x%lx\n", hva); - mmap_read_unlock(current->mm); - return -EFAULT; - } - - /* - * logging_active is guaranteed to never be true for VM_PFNMAP - * memslots. - */ - if (logging_active) { - force_pte = true; - vma_shift = PAGE_SHIFT; - } else { - vma_shift = get_vma_page_shift(vma, hva); - } - - switch (vma_shift) { -#ifndef __PAGETABLE_PMD_FOLDED - case PUD_SHIFT: - if (fault_supports_stage2_huge_mapping(memslot, hva, PUD_SIZE)) - break; - fallthrough; -#endif - case CONT_PMD_SHIFT: - vma_shift = PMD_SHIFT; - fallthrough; - case PMD_SHIFT: - if (fault_supports_stage2_huge_mapping(memslot, hva, PMD_SIZE)) - break; - fallthrough; - case CONT_PTE_SHIFT: - vma_shift = PAGE_SHIFT; - force_pte = true; - fallthrough; - case PAGE_SHIFT: - break; - default: - WARN_ONCE(1, "Unknown vma_shift %d", vma_shift); - } - + vma_shift = PAGE_SHIFT; vma_pagesize = 1UL << vma_shift; - - if (nested) { - unsigned long max_map_size; - - max_map_size = force_pte ? PAGE_SIZE : PUD_SIZE; - - ipa = kvm_s2_trans_output(nested); - - /* - * If we're about to create a shadow stage 2 entry, then we - * can only create a block mapping if the guest stage 2 page - * table uses at least as big a mapping. - */ - max_map_size = min(kvm_s2_trans_size(nested), max_map_size); - - /* - * Be careful that if the mapping size falls between - * two host sizes, take the smallest of the two. - */ - if (max_map_size >= PMD_SIZE && max_map_size < PUD_SIZE) - max_map_size = PMD_SIZE; - else if (max_map_size >= PAGE_SIZE && max_map_size < PMD_SIZE) - max_map_size = PAGE_SIZE; - - force_pte = (max_map_size == PAGE_SIZE); - vma_pagesize = min(vma_pagesize, (long)max_map_size); - } - - if (vma_pagesize == PMD_SIZE || vma_pagesize == PUD_SIZE) - fault_ipa &= ~(vma_pagesize - 1); - gfn = ipa >> PAGE_SHIFT; - mte_allowed = kvm_vma_mte_allowed(vma); - - vfio_allow_any_uc = vma->vm_flags & VM_ALLOW_ANY_UNCACHED; - - /* Don't use the VMA after the unlock -- it may have vanished */ - vma = NULL; /* * Read mmu_invalidate_seq so that KVM can detect if the results of @@ -1560,7 +1478,7 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa, * with the smp_wmb() in kvm_mmu_invalidate_end(). */ mmu_seq = vcpu->kvm->mmu_invalidate_seq; - mmap_read_unlock(current->mm); + smp_rmb(); pfn = __gfn_to_pfn_memslot(memslot, gfn, false, false, NULL, write_fault, &writable, NULL); @@ -1571,19 +1489,7 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa, if (is_error_noslot_pfn(pfn)) return -EFAULT; - if (kvm_is_device_pfn(pfn)) { - /* - * If the page was identified as device early by looking at - * the VMA flags, vma_pagesize is already representing the - * largest quantity we can map. If instead it was mapped - * via gfn_to_pfn_prot(), vma_pagesize is set to PAGE_SIZE - * and must not be upgraded. - * - * In both cases, we don't let transparent_hugepage_adjust() - * change things at the last minute. - */ - device = true; - } else if (logging_active && !write_fault) { + if (logging_active && !write_fault) { /* * Only actually map the page as writable if this was a write * fault. @@ -1591,28 +1497,6 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa, writable = false; } - if (exec_fault && device) - return -ENOEXEC; - - /* - * Potentially reduce shadow S2 permissions to match the guest's own - * S2. For exec faults, we'd only reach this point if the guest - * actually allowed it (see kvm_s2_handle_perm_fault). - * - * Also encode the level of the original translation in the SW bits - * of the leaf entry as a proxy for the span of that translation. - * This will be retrieved on TLB invalidation from the guest and - * used to limit the invalidation scope if a TTL hint or a range - * isn't provided. - */ - if (nested) { - writable &= kvm_s2_trans_writable(nested); - if (!kvm_s2_trans_readable(nested)) - prot &= ~KVM_PGTABLE_PROT_R; - - prot |= kvm_encode_nested_level(nested); - } - read_lock(&kvm->mmu_lock); pgt = vcpu->arch.hw_mmu->pgt; if (mmu_invalidate_retry(kvm, mmu_seq)) { @@ -1620,46 +1504,13 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa, goto out_unlock; } - /* - * If we are not forced to use page mapping, check if we are - * backed by a THP and thus use block mapping if possible. - */ - if (vma_pagesize == PAGE_SIZE && !(force_pte || device)) { - if (fault_is_perm && fault_granule > PAGE_SIZE) - vma_pagesize = fault_granule; - else - vma_pagesize = transparent_hugepage_adjust(kvm, memslot, - hva, &pfn, - &fault_ipa); - - if (vma_pagesize < 0) { - ret = vma_pagesize; - goto out_unlock; - } - } - - if (!fault_is_perm && !device && kvm_has_mte(kvm)) { - /* Check the VMM hasn't introduced a new disallowed VMA */ - if (mte_allowed) { - sanitise_mte_tags(kvm, pfn, vma_pagesize); - } else { - ret = -EFAULT; - goto out_unlock; - } - } - if (writable) prot |= KVM_PGTABLE_PROT_W; if (exec_fault) prot |= KVM_PGTABLE_PROT_X; - if (device) { - if (vfio_allow_any_uc) - prot |= KVM_PGTABLE_PROT_NORMAL_NC; - else - prot |= KVM_PGTABLE_PROT_DEVICE; - } else if (cpus_have_final_cap(ARM64_HAS_CACHE_DIC) && + if (cpus_have_final_cap(ARM64_HAS_CACHE_DIC) && (!nested || kvm_s2_trans_executable(nested))) { prot |= KVM_PGTABLE_PROT_X; } base-commit: 15e1c3d65975524c5c792fcd59f7d89f00402261 --
On Wed, Aug 14, 2024, Sean Christopherson wrote: > TL;DR: it's probably worth looking at mmu_stress_test (was: max_guest_memory_test) > on arm64, specifically the mprotect() testcase[1], as performance is significantly > worse compared to x86, and there might be bugs lurking the mmu_notifier flows. > > When running mmu_stress_test the mprotect() phase that makes guest memory read-only > takes more than three times as long on arm64 versus x86. The time to initially > popuplate memory (run1) is also notably higher on arm64, as is the time to > mprotect() back to RW protections. > > The test doesn't go super far out of its way to control the environment, but it > should be a fairly reasonable apples-to-apples comparison. > > Ouch. I take that back, it's not apples-to-apples, because the test does more > work for x86. On x86, during mprotect(PROT_READ), the userspace side skips the > faulting instruction on -EFAULT and so vCPUs keep writing for the entire duration. > Other architectures stop running the vCPU after the first write -EFAULT and wait > for the mproptect() to complete. If I comment out the x86-only logic and have > vCPUs stop on the first -EFAULT, the mprotect() goes way down. > > /me fiddles with arm64 > > And if I have arm64 vCPUs keep faulting, the time goes up, as exptected. > > With 128GiB of guest memory (aliased to a single 2GiB chunk of physical memory), > and 48 vCPUs (on systems with 64+ CPUs), stopping on the first fault: > > x86: > run1 = 6.873408794s, reset = 0.000165898s, run2 = 0.035537803s, ro = 6.149083106s, rw = 7.713627355s > > arm64: > run1 = 13.960144969s, reset = 0.000178596s, run2 = 0.018020005s, ro = 50.924434051s, rw = 14.712983786 > > and skipping on -EFAULT and thus writing throughout mprotect(): > > x86: > run1 = 6.923218747s, reset = 0.000167050s, run2 = 0.034676225s, ro = 14.599445790s, rw = 7.763152792s > > arm64: > run1 = 13.543469513s, reset = 0.000018763s, run2 = 0.020533896s, ro = 81.063504438s, rw = 14.967504024s Oliver pointed out off-list that the hardware I was using doesn't have forced write-back, and so the overhead on arm64 is likely due to cache maintenance.
On Wed, Aug 14, 2024 at 01:54:04PM -0700, Sean Christopherson wrote: > +Marc and Oliver > > On Wed, Aug 14, 2024, Jason Gunthorpe wrote: > > On Wed, Aug 14, 2024 at 07:35:01AM -0700, Sean Christopherson wrote: > > > On Wed, Aug 14, 2024, Jason Gunthorpe wrote: > > > > On Fri, Aug 09, 2024 at 12:08:50PM -0400, Peter Xu wrote: > > > > > Overview > > > > > ======== > > > > > > > > > > This series is based on mm-unstable, commit 98808d08fc0f of Aug 7th latest, > > > > > plus dax 1g fix [1]. Note that this series should also apply if without > > > > > the dax 1g fix series, but when without it, mprotect() will trigger similar > > > > > errors otherwise on PUD mappings. > > > > > > > > > > This series implements huge pfnmaps support for mm in general. Huge pfnmap > > > > > allows e.g. VM_PFNMAP vmas to map in either PMD or PUD levels, similar to > > > > > what we do with dax / thp / hugetlb so far to benefit from TLB hits. Now > > > > > we extend that idea to PFN mappings, e.g. PCI MMIO bars where it can grow > > > > > as large as 8GB or even bigger. > > > > > > > > FWIW, I've started to hear people talk about needing this in the VFIO > > > > context with VMs. > > > > > > > > vfio/iommufd will reassemble the contiguous range from the 4k PFNs to > > > > setup the IOMMU, but KVM is not able to do it so reliably. > > > > > > Heh, KVM should very reliably do the exact opposite, i.e. KVM should never create > > > a huge page unless the mapping is huge in the primary MMU. And that's very much > > > by design, as KVM has no knowledge of what actually resides at a given PFN, and > > > thus can't determine whether or not its safe to create a huge page if KVM happens > > > to realize the VM has access to a contiguous range of memory. > > > > Oh? Someone told me recently x86 kvm had code to reassemble contiguous > > ranges? > > Nope. KVM ARM does (see get_vma_page_shift()) but I strongly suspect that's only > a win in very select use cases, and is overall a non-trivial loss. Ah that ARM behavior was probably what was being mentioned then! So take my original remark as applying to this :) > > I don't quite understand your safety argument, if the VMA has 1G of > > contiguous physical memory described with 4K it is definitely safe for > > KVM to reassemble that same memory and represent it as 1G. > > That would require taking mmap_lock to get the VMA, which would be a net negative, > especially for workloads that are latency sensitive. You can aggregate if the read and aggregating logic are protected by mmu notifiers, I think. A invalidation would still have enough information to clear the aggregate shadow entry. If you get a sequence number collision then you'd throw away the aggregation. But yes, I also think it would be slow to have aggregation logic in KVM. Doing in the main mmu is much better. Jason
On Wed, Aug 14, 2024 at 01:54:04PM -0700, Sean Christopherson wrote: > TL;DR: it's probably worth looking at mmu_stress_test (was: max_guest_memory_test) > on arm64, specifically the mprotect() testcase[1], as performance is significantly > worse compared to x86, Sharing what we discussed offline: Sean was using a machine w/o FEAT_FWB for this test, so the increased runtime on arm64 is likely explained by the CMOs we're doing when creating or invalidating a stage-2 PTE. Using a machine w/ FEAT_FWB would be better for making these sort of cross-architecture comparisons. Beyond CMOs, we do have some > and there might be bugs lurking the mmu_notifier flows. Impossible! :) > Jumping back to mmap_lock, adding a lock, vma_lookup(), and unlock in x86's page > fault path for valid VMAs does introduce a performance regression, but only ~30%, > not the ~6x jump from x86 to arm64. So that too makes it unlikely taking mmap_lock > is the main problem, though it's still good justification for avoid mmap_lock in > the page fault path. I'm curious how much of that 30% in a microbenchmark would translate to real world performance, since it isn't *that* egregious. We also have other uses for getting at the VMA beyond mapping granularity (MTE and the VFIO Normal-NC hint) that'd require some attention too.
On Wed, Aug 14, 2024 at 07:10:31PM -0300, Jason Gunthorpe wrote: [...] > > Nope. KVM ARM does (see get_vma_page_shift()) but I strongly suspect that's only > > a win in very select use cases, and is overall a non-trivial loss. > > Ah that ARM behavior was probably what was being mentioned then! So > take my original remark as applying to this :) > > > > I don't quite understand your safety argument, if the VMA has 1G of > > > contiguous physical memory described with 4K it is definitely safe for > > > KVM to reassemble that same memory and represent it as 1G. > > > > That would require taking mmap_lock to get the VMA, which would be a net negative, > > especially for workloads that are latency sensitive. > > You can aggregate if the read and aggregating logic are protected by > mmu notifiers, I think. A invalidation would still have enough > information to clear the aggregate shadow entry. If you get a sequence > number collision then you'd throw away the aggregation. > > But yes, I also think it would be slow to have aggregation logic in > KVM. Doing in the main mmu is much better. +1. For KVM/arm64 I'm quite hesitant to change the behavior to PTE mappings in this situation (i.e. dump get_vma_page_shift()), as I'm quite certain that'll have a performance regression on someone's workload. But once we can derive huge PFNMAP from the primary MMU then we should just normalize on that.
On Wed, Aug 14, 2024 at 04:28:00PM -0700, Oliver Upton wrote: > On Wed, Aug 14, 2024 at 01:54:04PM -0700, Sean Christopherson wrote: > > TL;DR: it's probably worth looking at mmu_stress_test (was: max_guest_memory_test) > > on arm64, specifically the mprotect() testcase[1], as performance is significantly > > worse compared to x86, > > Sharing what we discussed offline: > > Sean was using a machine w/o FEAT_FWB for this test, so the increased > runtime on arm64 is likely explained by the CMOs we're doing when > creating or invalidating a stage-2 PTE. > > Using a machine w/ FEAT_FWB would be better for making these sort of > cross-architecture comparisons. Beyond CMOs, we do have some ... some heavy barriers (e.g. DSB(ishst)) we use to ensure page table updates are visible to the system. So there could still be some arch-specific quirks that'll show up in the test. > > and there might be bugs lurking the mmu_notifier flows. > > Impossible! :) > > > Jumping back to mmap_lock, adding a lock, vma_lookup(), and unlock in x86's page > > fault path for valid VMAs does introduce a performance regression, but only ~30%, > > not the ~6x jump from x86 to arm64. So that too makes it unlikely taking mmap_lock > > is the main problem, though it's still good justification for avoid mmap_lock in > > the page fault path. > > I'm curious how much of that 30% in a microbenchmark would translate to > real world performance, since it isn't *that* egregious. We also have > other uses for getting at the VMA beyond mapping granularity (MTE and > the VFIO Normal-NC hint) that'd require some attention too. > > -- > Thanks, > Oliver
On Wed, Aug 14, 2024, Oliver Upton wrote: > On Wed, Aug 14, 2024 at 04:28:00PM -0700, Oliver Upton wrote: > > On Wed, Aug 14, 2024 at 01:54:04PM -0700, Sean Christopherson wrote: > > > TL;DR: it's probably worth looking at mmu_stress_test (was: max_guest_memory_test) > > > on arm64, specifically the mprotect() testcase[1], as performance is significantly > > > worse compared to x86, > > > > Sharing what we discussed offline: > > > > Sean was using a machine w/o FEAT_FWB for this test, so the increased > > runtime on arm64 is likely explained by the CMOs we're doing when > > creating or invalidating a stage-2 PTE. > > > > Using a machine w/ FEAT_FWB would be better for making these sort of > > cross-architecture comparisons. Beyond CMOs, we do have some > > ... some heavy barriers (e.g. DSB(ishst)) we use to ensure page table > updates are visible to the system. So there could still be some > arch-specific quirks that'll show up in the test. Nope, 'twas FWB. On a system with FWB, ARM nicely outperforms x86 on mprotect() when vCPUs stop on the first -EFAULT. I suspect because ARM can do broadcast TLB invalidations and doesn't need to interrupt and wait for every vCPU to respond. run1 = 10.723194154s, reset = 0.000014732s, run2 = 0.013790876s, ro = 2.151261587s, rw = 10.624272116s However, having vCPUs continue faulting while mprotect() is running turns the tables, I suspect due to mmap_lock run1 = 10.768003815s, reset = 0.000012051s, run2 = 0.013781921s, ro = 23.277624455s, rw = 10.649136889s The x86 numbers since they're out of sight now: -EFAULT once run1 = 6.873408794s, reset = 0.000165898s, run2 = 0.035537803s, ro = 6.149083106s, rw = 7.713627355s -EFAULT forever run1 = 6.923218747s, reset = 0.000167050s, run2 = 0.034676225s, ro = 14.599445790s, rw = 7.763152792s > > > and there might be bugs lurking the mmu_notifier flows. > > > > Impossible! :) > > > > > Jumping back to mmap_lock, adding a lock, vma_lookup(), and unlock in x86's page > > > fault path for valid VMAs does introduce a performance regression, but only ~30%, > > > not the ~6x jump from x86 to arm64. So that too makes it unlikely taking mmap_lock > > > is the main problem, though it's still good justification for avoid mmap_lock in > > > the page fault path. > > > > I'm curious how much of that 30% in a microbenchmark would translate to > > real world performance, since it isn't *that* egregious. vCPU jitter is the big problem, especially if userspace is doing something odd, and/or if the kernel is preemptible (which also triggers yeild-on-contention logic for spinlocks, ew). E.g. the range-based retry to avoid spinning and waiting on an unrelated MM operation was added by the ChromeOS folks[1] to resolve issues where an MM operation got preempted and so blocked vCPU faults. But even for cloud setups with a non-preemptible kernel, contending with unrelated userspace VMM modification can be problematic, e.g. it turns out even the gfn_to_pfn_cache logic needs range-based retry[2] (though that's a rather pathological case where userspace is spamming madvise() to the point where vCPUs can't even make forward progress). > > We also have other uses for getting at the VMA beyond mapping granularity > > (MTE and the VFIO Normal-NC hint) that'd require some attention too. Yeah, though it seems like it'd be easy enough to take mmap_lock if and only if it's necessary, e.g. similar to how common KVM takes it only if it encounters VM_PFNMAP'd memory. E.g. take mmap_lock if and only if MTE is active (I assume that's uncommon?), or if the fault is to device memory. [1] https://lore.kernel.org/all/20210222024522.1751719-1-stevensd@google.com [2] https://lore.kernel.org/all/f862cefff2ed3f4211b69d785670f41667703cf3.camel@infradead.org
On Wed, Aug 14, 2024 at 09:37:15AM -0300, Jason Gunthorpe wrote: > > Currently, only x86_64 (1G+2M) and arm64 (2M) are supported. > > There is definitely interest here in extending ARM to support the 1G > size too, what is missing? Currently PUD pfnmap relies on THP_PUD config option: config ARCH_SUPPORTS_PUD_PFNMAP def_bool y depends on ARCH_SUPPORTS_HUGE_PFNMAP && HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD Arm64 unfortunately doesn't yet support dax 1G, so not applicable yet. Ideally, pfnmap is too simple comparing to real THPs and it shouldn't require to depend on THP at all, but we'll need things like below to land first: https://lore.kernel.org/r/20240717220219.3743374-1-peterx@redhat.com I sent that first a while ago, but I didn't collect enough inputs, and I decided to unblock this series from that, so x86_64 shouldn't be affected, and arm64 will at least start to have 2M. > > > The other trick is how to allow gup-fast working for such huge mappings > > even if there's no direct sign of knowing whether it's a normal page or > > MMIO mapping. This series chose to keep the pte_special solution, so that > > it reuses similar idea on setting a special bit to pfnmap PMDs/PUDs so that > > gup-fast will be able to identify them and fail properly. > > Make sense > > > More architectures / More page sizes > > ------------------------------------ > > > > Currently only x86_64 (2M+1G) and arm64 (2M) are supported. > > > > For example, if arm64 can start to support THP_PUD one day, the huge pfnmap > > on 1G will be automatically enabled. > > Oh that sounds like a bigger step.. Just to mention, no real THP 1G needed here for pfnmaps. The real gap here is only about the pud helpers that only exists so far with CONFIG_THP_PUD in huge_memory.c. > > > VFIO is so far the only consumer for the huge pfnmaps after this series > > applied. Besides above remap_pfn_range() generic optimization, device > > driver can also try to optimize its mmap() on a better VA alignment for > > either PMD/PUD sizes. This may, iiuc, normally require userspace changes, > > as the driver doesn't normally decide the VA to map a bar. But I don't > > think I know all the drivers to know the full picture. > > How does alignment work? In most caes I'm aware of the userspace does > not use MAP_FIXED so the expectation would be for the kernel to > automatically select a high alignment. I suppose your cases are > working because qemu uses MAP_FIXED and naturally aligns the BAR > addresses? > > > - x86_64 + AMD GPU > > - Needs Alex's modified QEMU to guarantee proper VA alignment to make > > sure all pages to be mapped with PUDs > > Oh :( So I suppose this answers above. :) Yes, alignment needed.
On 2024/8/16 3:20, Peter Xu wrote: > On Wed, Aug 14, 2024 at 09:37:15AM -0300, Jason Gunthorpe wrote: >>> Currently, only x86_64 (1G+2M) and arm64 (2M) are supported. >> >> There is definitely interest here in extending ARM to support the 1G >> size too, what is missing? > > Currently PUD pfnmap relies on THP_PUD config option: > > config ARCH_SUPPORTS_PUD_PFNMAP > def_bool y > depends on ARCH_SUPPORTS_HUGE_PFNMAP && HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD > > Arm64 unfortunately doesn't yet support dax 1G, so not applicable yet. > > Ideally, pfnmap is too simple comparing to real THPs and it shouldn't > require to depend on THP at all, but we'll need things like below to land > first: > > https://lore.kernel.org/r/20240717220219.3743374-1-peterx@redhat.com > > I sent that first a while ago, but I didn't collect enough inputs, and I > decided to unblock this series from that, so x86_64 shouldn't be affected, > and arm64 will at least start to have 2M. > >> >>> The other trick is how to allow gup-fast working for such huge mappings >>> even if there's no direct sign of knowing whether it's a normal page or >>> MMIO mapping. This series chose to keep the pte_special solution, so that >>> it reuses similar idea on setting a special bit to pfnmap PMDs/PUDs so that >>> gup-fast will be able to identify them and fail properly. >> >> Make sense >> >>> More architectures / More page sizes >>> ------------------------------------ >>> >>> Currently only x86_64 (2M+1G) and arm64 (2M) are supported. >>> >>> For example, if arm64 can start to support THP_PUD one day, the huge pfnmap >>> on 1G will be automatically enabled. A draft patch to enable THP_PUD on arm64, only passed with DEBUG_VM_PGTABLE, we may test pud pfnmaps on arm64. diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig index a2f8ff354ca6..ff0d27c72020 100644 --- a/arch/arm64/Kconfig +++ b/arch/arm64/Kconfig @@ -184,6 +184,7 @@ config ARM64 select HAVE_ARCH_THREAD_STRUCT_WHITELIST select HAVE_ARCH_TRACEHOOK select HAVE_ARCH_TRANSPARENT_HUGEPAGE + select HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD if PGTABLE_LEVELS > 2 select HAVE_ARCH_VMAP_STACK select HAVE_ARM_SMCCC select HAVE_ASM_MODVERSIONS diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h index 7a4f5604be3f..e013fe458476 100644 --- a/arch/arm64/include/asm/pgtable.h +++ b/arch/arm64/include/asm/pgtable.h @@ -763,6 +763,25 @@ static inline unsigned long pmd_page_vaddr(pmd_t pmd) #define pud_valid(pud) pte_valid(pud_pte(pud)) #define pud_user(pud) pte_user(pud_pte(pud)) #define pud_user_exec(pud) pte_user_exec(pud_pte(pud)) +#define pud_dirty(pud) pte_dirty(pud_pte(pud)) +#define pud_devmap(pud) pte_devmap(pud_pte(pud)) +#define pud_wrprotect(pud) pte_pud(pte_wrprotect(pud_pte(pud))) +#define pud_mkold(pud) pte_pud(pte_mkold(pud_pte(pud))) +#define pud_mkwrite(pud) pte_pud(pte_mkwrite_novma(pud_pte(pud))) +#define pud_mkclean(pud) pte_pud(pte_mkclean(pud_pte(pud))) +#define pud_mkdirty(pud) pte_pud(pte_mkdirty(pud_pte(pud))) + +#ifdef CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD +static inline int pud_trans_huge(pud_t pud) +{ + return pud_val(pud) && pud_present(pud) && !(pud_val(pud) & PUD_TABLE_BIT); +} + +static inline pud_t pud_mkdevmap(pud_t pud) +{ + return pte_pud(set_pte_bit(pud_pte(pud), __pgprot(PTE_DEVMAP))); +} +#endif static inline bool pgtable_l4_enabled(void); @@ -1137,10 +1156,20 @@ static inline int pmdp_set_access_flags(struct vm_area_struct *vma, pmd_pte(entry), dirty); } +static inline int pudp_set_access_flags(struct vm_area_struct *vma, + unsigned long address, pud_t *pudp, + pud_t entry, int dirty) +{ + return __ptep_set_access_flags(vma, address, (pte_t *)pudp, + pud_pte(entry), dirty); +} + +#ifndef CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD static inline int pud_devmap(pud_t pud) { return 0; } +#endif static inline int pgd_devmap(pgd_t pgd) { @@ -1213,6 +1242,13 @@ static inline int pmdp_test_and_clear_young(struct vm_area_struct *vma, { return __ptep_test_and_clear_young(vma, address, (pte_t *)pmdp); } + +static inline int pudp_test_and_clear_young(struct vm_area_struct *vma, + unsigned long address, + pud_t *pudp) +{ + return __ptep_test_and_clear_young(vma, address, (pte_t *)pudp); +} #endif /* CONFIG_TRANSPARENT_HUGEPAGE */ static inline pte_t __ptep_get_and_clear(struct mm_struct *mm, @@ -1433,6 +1469,7 @@ static inline void update_mmu_cache_range(struct vm_fault *vmf, #define update_mmu_cache(vma, addr, ptep) \ update_mmu_cache_range(NULL, vma, addr, ptep, 1) #define update_mmu_cache_pmd(vma, address, pmd) do { } while (0) +#define update_mmu_cache_pud(vma, address, pud) do { } while (0) #ifdef CONFIG_ARM64_PA_BITS_52 #define phys_to_ttbr(addr) (((addr) | ((addr) >> 46)) & TTBR_BADDR_MASK_52)
On Fri, Aug 16, 2024 at 11:05:33AM +0800, Kefeng Wang wrote: > > > On 2024/8/16 3:20, Peter Xu wrote: > > On Wed, Aug 14, 2024 at 09:37:15AM -0300, Jason Gunthorpe wrote: > > > > Currently, only x86_64 (1G+2M) and arm64 (2M) are supported. > > > > > > There is definitely interest here in extending ARM to support the 1G > > > size too, what is missing? > > > > Currently PUD pfnmap relies on THP_PUD config option: > > > > config ARCH_SUPPORTS_PUD_PFNMAP > > def_bool y > > depends on ARCH_SUPPORTS_HUGE_PFNMAP && HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD > > > > Arm64 unfortunately doesn't yet support dax 1G, so not applicable yet. > > > > Ideally, pfnmap is too simple comparing to real THPs and it shouldn't > > require to depend on THP at all, but we'll need things like below to land > > first: > > > > https://lore.kernel.org/r/20240717220219.3743374-1-peterx@redhat.com > > > > I sent that first a while ago, but I didn't collect enough inputs, and I > > decided to unblock this series from that, so x86_64 shouldn't be affected, > > and arm64 will at least start to have 2M. > > > > > > > > > The other trick is how to allow gup-fast working for such huge mappings > > > > even if there's no direct sign of knowing whether it's a normal page or > > > > MMIO mapping. This series chose to keep the pte_special solution, so that > > > > it reuses similar idea on setting a special bit to pfnmap PMDs/PUDs so that > > > > gup-fast will be able to identify them and fail properly. > > > > > > Make sense > > > > > > > More architectures / More page sizes > > > > ------------------------------------ > > > > > > > > Currently only x86_64 (2M+1G) and arm64 (2M) are supported. > > > > > > > > For example, if arm64 can start to support THP_PUD one day, the huge pfnmap > > > > on 1G will be automatically enabled. > > A draft patch to enable THP_PUD on arm64, only passed with DEBUG_VM_PGTABLE, > we may test pud pfnmaps on arm64. Thanks, Kefeng. It'll be great if this works already, as simple. Might be interesting to know whether it works already if you have some few-GBs GPU around on the systems. Logically as long as you have HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD selected below, 1g pfnmap will be automatically enabled when you rebuild the kernel. You can double check that by looking for this: CONFIG_ARCH_SUPPORTS_PUD_PFNMAP=y And you can try to observe the mappings by enabling dynamic debug for vfio_pci_mmap_huge_fault(), then map the bar with vfio-pci and read something from it. > > diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig > index a2f8ff354ca6..ff0d27c72020 100644 > --- a/arch/arm64/Kconfig > +++ b/arch/arm64/Kconfig > @@ -184,6 +184,7 @@ config ARM64 > select HAVE_ARCH_THREAD_STRUCT_WHITELIST > select HAVE_ARCH_TRACEHOOK > select HAVE_ARCH_TRANSPARENT_HUGEPAGE > + select HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD if PGTABLE_LEVELS > 2 > select HAVE_ARCH_VMAP_STACK > select HAVE_ARM_SMCCC > select HAVE_ASM_MODVERSIONS > diff --git a/arch/arm64/include/asm/pgtable.h > b/arch/arm64/include/asm/pgtable.h > index 7a4f5604be3f..e013fe458476 100644 > --- a/arch/arm64/include/asm/pgtable.h > +++ b/arch/arm64/include/asm/pgtable.h > @@ -763,6 +763,25 @@ static inline unsigned long pmd_page_vaddr(pmd_t pmd) > #define pud_valid(pud) pte_valid(pud_pte(pud)) > #define pud_user(pud) pte_user(pud_pte(pud)) > #define pud_user_exec(pud) pte_user_exec(pud_pte(pud)) > +#define pud_dirty(pud) pte_dirty(pud_pte(pud)) > +#define pud_devmap(pud) pte_devmap(pud_pte(pud)) > +#define pud_wrprotect(pud) pte_pud(pte_wrprotect(pud_pte(pud))) > +#define pud_mkold(pud) pte_pud(pte_mkold(pud_pte(pud))) > +#define pud_mkwrite(pud) pte_pud(pte_mkwrite_novma(pud_pte(pud))) > +#define pud_mkclean(pud) pte_pud(pte_mkclean(pud_pte(pud))) > +#define pud_mkdirty(pud) pte_pud(pte_mkdirty(pud_pte(pud))) > + > +#ifdef CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD > +static inline int pud_trans_huge(pud_t pud) > +{ > + return pud_val(pud) && pud_present(pud) && !(pud_val(pud) & > PUD_TABLE_BIT); > +} > + > +static inline pud_t pud_mkdevmap(pud_t pud) > +{ > + return pte_pud(set_pte_bit(pud_pte(pud), __pgprot(PTE_DEVMAP))); > +} > +#endif > > static inline bool pgtable_l4_enabled(void); > > @@ -1137,10 +1156,20 @@ static inline int pmdp_set_access_flags(struct > vm_area_struct *vma, > pmd_pte(entry), dirty); > } > > +static inline int pudp_set_access_flags(struct vm_area_struct *vma, > + unsigned long address, pud_t *pudp, > + pud_t entry, int dirty) > +{ > + return __ptep_set_access_flags(vma, address, (pte_t *)pudp, > + pud_pte(entry), dirty); > +} > + > +#ifndef CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD > static inline int pud_devmap(pud_t pud) > { > return 0; > } > +#endif > > static inline int pgd_devmap(pgd_t pgd) > { > @@ -1213,6 +1242,13 @@ static inline int pmdp_test_and_clear_young(struct > vm_area_struct *vma, > { > return __ptep_test_and_clear_young(vma, address, (pte_t *)pmdp); > } > + > +static inline int pudp_test_and_clear_young(struct vm_area_struct *vma, > + unsigned long address, > + pud_t *pudp) > +{ > + return __ptep_test_and_clear_young(vma, address, (pte_t *)pudp); > +} > #endif /* CONFIG_TRANSPARENT_HUGEPAGE */ > > static inline pte_t __ptep_get_and_clear(struct mm_struct *mm, > @@ -1433,6 +1469,7 @@ static inline void update_mmu_cache_range(struct > vm_fault *vmf, > #define update_mmu_cache(vma, addr, ptep) \ > update_mmu_cache_range(NULL, vma, addr, ptep, 1) > #define update_mmu_cache_pmd(vma, address, pmd) do { } while (0) > +#define update_mmu_cache_pud(vma, address, pud) do { } while (0) > > #ifdef CONFIG_ARM64_PA_BITS_52 > #define phys_to_ttbr(addr) (((addr) | ((addr) >> 46)) & TTBR_BADDR_MASK_52) > -- > 2.27.0
On 2024/8/16 22:33, Peter Xu wrote: > On Fri, Aug 16, 2024 at 11:05:33AM +0800, Kefeng Wang wrote: >> >> >> On 2024/8/16 3:20, Peter Xu wrote: >>> On Wed, Aug 14, 2024 at 09:37:15AM -0300, Jason Gunthorpe wrote: >>>>> Currently, only x86_64 (1G+2M) and arm64 (2M) are supported. >>>> >>>> There is definitely interest here in extending ARM to support the 1G >>>> size too, what is missing? >>> >>> Currently PUD pfnmap relies on THP_PUD config option: >>> >>> config ARCH_SUPPORTS_PUD_PFNMAP >>> def_bool y >>> depends on ARCH_SUPPORTS_HUGE_PFNMAP && HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD >>> >>> Arm64 unfortunately doesn't yet support dax 1G, so not applicable yet. >>> >>> Ideally, pfnmap is too simple comparing to real THPs and it shouldn't >>> require to depend on THP at all, but we'll need things like below to land >>> first: >>> >>> https://lore.kernel.org/r/20240717220219.3743374-1-peterx@redhat.com >>> >>> I sent that first a while ago, but I didn't collect enough inputs, and I >>> decided to unblock this series from that, so x86_64 shouldn't be affected, >>> and arm64 will at least start to have 2M. >>> >>>> >>>>> The other trick is how to allow gup-fast working for such huge mappings >>>>> even if there's no direct sign of knowing whether it's a normal page or >>>>> MMIO mapping. This series chose to keep the pte_special solution, so that >>>>> it reuses similar idea on setting a special bit to pfnmap PMDs/PUDs so that >>>>> gup-fast will be able to identify them and fail properly. >>>> >>>> Make sense >>>> >>>>> More architectures / More page sizes >>>>> ------------------------------------ >>>>> >>>>> Currently only x86_64 (2M+1G) and arm64 (2M) are supported. >>>>> >>>>> For example, if arm64 can start to support THP_PUD one day, the huge pfnmap >>>>> on 1G will be automatically enabled. >> >> A draft patch to enable THP_PUD on arm64, only passed with DEBUG_VM_PGTABLE, >> we may test pud pfnmaps on arm64. > > Thanks, Kefeng. It'll be great if this works already, as simple. > > Might be interesting to know whether it works already if you have some > few-GBs GPU around on the systems. > > Logically as long as you have HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD selected > below, 1g pfnmap will be automatically enabled when you rebuild the kernel. > You can double check that by looking for this: > > CONFIG_ARCH_SUPPORTS_PUD_PFNMAP=y > > And you can try to observe the mappings by enabling dynamic debug for > vfio_pci_mmap_huge_fault(), then map the bar with vfio-pci and read > something from it. I don't have such device, but we write a driver which use vmf_insert_pfn_pmd/pud in huge_fault, static const struct vm_operations_struct test_vm_ops = { .huge_fault = test_huge_fault, ... } and read/write it after mmap(,2M/1G,test_fd,...), it works as expected, since it could be used by dax, let's send it separately.