Message ID | 20240809160909.1023470-1-peterx@redhat.com (mailing list archive) |
---|---|
Headers | show |
Series | mm: Support huge pfnmaps | expand |
On 09.08.24 18:08, Peter Xu wrote: > Pfnmaps can always be identified with special bits in the ptes/pmds/puds. > However that's unnecessary if the vma is stable, and when it's mapped under > VM_PFNMAP | VM_IO. > > Instead of adding similar checks in all the levels for huge pfnmaps, let > folio_walk_start() fail even earlier for these mappings. It's also > something gup-slow already does, so make them match. > > Cc: David Hildenbrand <david@redhat.com> > Signed-off-by: Peter Xu <peterx@redhat.com> > --- > mm/pagewalk.c | 5 +++++ > 1 file changed, 5 insertions(+) > > diff --git a/mm/pagewalk.c b/mm/pagewalk.c > index cd79fb3b89e5..fd3965efe773 100644 > --- a/mm/pagewalk.c > +++ b/mm/pagewalk.c > @@ -727,6 +727,11 @@ struct folio *folio_walk_start(struct folio_walk *fw, > p4d_t *p4dp; > > mmap_assert_locked(vma->vm_mm); > + > + /* It has no folio backing the mappings at all.. */ > + if (vma->vm_flags & (VM_IO | VM_PFNMAP)) > + return NULL; > + That is in general not what we want, and we still have some places that wrongly hard-code that behavior. In a MAP_PRIVATE mapping you might have anon pages that we can happily walk. vm_normal_page() / vm_normal_page_pmd() [and as commented as a TODO, vm_normal_page_pud()] should be able to identify PFN maps and reject them, no?
On Fri, Aug 09, 2024 at 06:20:06PM +0200, David Hildenbrand wrote: > On 09.08.24 18:08, Peter Xu wrote: > > Pfnmaps can always be identified with special bits in the ptes/pmds/puds. > > However that's unnecessary if the vma is stable, and when it's mapped under > > VM_PFNMAP | VM_IO. > > > > Instead of adding similar checks in all the levels for huge pfnmaps, let > > folio_walk_start() fail even earlier for these mappings. It's also > > something gup-slow already does, so make them match. > > > > Cc: David Hildenbrand <david@redhat.com> > > Signed-off-by: Peter Xu <peterx@redhat.com> > > --- > > mm/pagewalk.c | 5 +++++ > > 1 file changed, 5 insertions(+) > > > > diff --git a/mm/pagewalk.c b/mm/pagewalk.c > > index cd79fb3b89e5..fd3965efe773 100644 > > --- a/mm/pagewalk.c > > +++ b/mm/pagewalk.c > > @@ -727,6 +727,11 @@ struct folio *folio_walk_start(struct folio_walk *fw, > > p4d_t *p4dp; > > mmap_assert_locked(vma->vm_mm); > > + > > + /* It has no folio backing the mappings at all.. */ > > + if (vma->vm_flags & (VM_IO | VM_PFNMAP)) > > + return NULL; > > + > > That is in general not what we want, and we still have some places that > wrongly hard-code that behavior. > > In a MAP_PRIVATE mapping you might have anon pages that we can happily walk. > > vm_normal_page() / vm_normal_page_pmd() [and as commented as a TODO, > vm_normal_page_pud()] should be able to identify PFN maps and reject them, > no? Yep, I think we can also rely on special bit. When I was working on this whole series I must confess I am already confused on the real users of MAP_PRIVATE pfnmaps. E.g. we probably don't need either PFNMAP for either mprotect/fork/... at least for our use case, then VM_PRIVATE is even one step further. Here I chose to follow gup-slow, and I suppose you meant that's also wrong? If so, would it make sense we keep them aligned as of now, and change them altogether? Or do you think we should just rely on the special bits? And, just curious: is there any use case you're aware of that can benefit from caring PRIVATE pfnmaps yet so far, especially in this path? As far as I read, none of folio_walk_start() users so far should even stumble on top of a pfnmap, share or private. But that's a fairly quick glimps only. IOW, I was wondering whether I'm just over cautious here. Thanks,
On 09.08.24 18:54, Peter Xu wrote: > On Fri, Aug 09, 2024 at 06:20:06PM +0200, David Hildenbrand wrote: >> On 09.08.24 18:08, Peter Xu wrote: >>> Pfnmaps can always be identified with special bits in the ptes/pmds/puds. >>> However that's unnecessary if the vma is stable, and when it's mapped under >>> VM_PFNMAP | VM_IO. >>> >>> Instead of adding similar checks in all the levels for huge pfnmaps, let >>> folio_walk_start() fail even earlier for these mappings. It's also >>> something gup-slow already does, so make them match. >>> >>> Cc: David Hildenbrand <david@redhat.com> >>> Signed-off-by: Peter Xu <peterx@redhat.com> >>> --- >>> mm/pagewalk.c | 5 +++++ >>> 1 file changed, 5 insertions(+) >>> >>> diff --git a/mm/pagewalk.c b/mm/pagewalk.c >>> index cd79fb3b89e5..fd3965efe773 100644 >>> --- a/mm/pagewalk.c >>> +++ b/mm/pagewalk.c >>> @@ -727,6 +727,11 @@ struct folio *folio_walk_start(struct folio_walk *fw, >>> p4d_t *p4dp; >>> mmap_assert_locked(vma->vm_mm); >>> + >>> + /* It has no folio backing the mappings at all.. */ >>> + if (vma->vm_flags & (VM_IO | VM_PFNMAP)) >>> + return NULL; >>> + >> >> That is in general not what we want, and we still have some places that >> wrongly hard-code that behavior. >> >> In a MAP_PRIVATE mapping you might have anon pages that we can happily walk. >> >> vm_normal_page() / vm_normal_page_pmd() [and as commented as a TODO, >> vm_normal_page_pud()] should be able to identify PFN maps and reject them, >> no? > > Yep, I think we can also rely on special bit. > > When I was working on this whole series I must confess I am already > confused on the real users of MAP_PRIVATE pfnmaps. E.g. we probably don't > need either PFNMAP for either mprotect/fork/... at least for our use case, > then VM_PRIVATE is even one step further. Yes, it's rather a corner case indeed. > > Here I chose to follow gup-slow, and I suppose you meant that's also wrong? I assume just nobody really noticed, just like nobody noticed that walk_page_test() skips VM_PFNMAP (but not VM_IO :) ). Your process memory stats will likely miss anon folios on COW PFNMAP mappings ... in the rare cases where they exist (e.g., mmap() of /dev/mem). > If so, would it make sense we keep them aligned as of now, and change them > altogether? Or do you think we should just rely on the special bits? GUP already refuses to work on a lot of other stuff, so likely not a good use of time unless somebody complains. But yes, long-term we should make all code either respect that it could happen (and bury less awkward checks in page table walkers) or rip support for MAP_PRIVATE PFNMAP out completely. > > And, just curious: is there any use case you're aware of that can benefit > from caring PRIVATE pfnmaps yet so far, especially in this path? In general MAP_PRIVATE pfnmaps is not really useful on things like MMIO. There was a discussion (in VM_PAT) some time ago whether we could remove MAP_PRIVATE PFNMAPs completely [1]. At least some users still use COW mappings on /dev/mem, although not many (and they might not actually write to these areas). I'm happy if someone wants to try ripping that out, I'm not brave enough :) [1] https://lkml.kernel.org/r/1f2a8ed4-aaff-4be7-b3b6-63d2841a2908@redhat.com > > As far as I read, none of folio_walk_start() users so far should even > stumble on top of a pfnmap, share or private. But that's a fairly quick > glimps only. do_pages_stat()->do_pages_stat_array() should be able to trigger it, if you pass "nodes=NULL" to move_pages(). Maybe s390x could be tricked into it, but likely as you say, most code shouldn't trigger it. The function itself should be handling it correctly as of today, though.
On 09.08.24 18:08, Peter Xu wrote: > Overview > ======== > > This series is based on mm-unstable, commit 98808d08fc0f of Aug 7th latest, > plus dax 1g fix [1]. Note that this series should also apply if without > the dax 1g fix series, but when without it, mprotect() will trigger similar > errors otherwise on PUD mappings. > > This series implements huge pfnmaps support for mm in general. Huge pfnmap > allows e.g. VM_PFNMAP vmas to map in either PMD or PUD levels, similar to > what we do with dax / thp / hugetlb so far to benefit from TLB hits. Now > we extend that idea to PFN mappings, e.g. PCI MMIO bars where it can grow > as large as 8GB or even bigger. > > Currently, only x86_64 (1G+2M) and arm64 (2M) are supported. The last > patch (from Alex Williamson) will be the first user of huge pfnmap, so as > to enable vfio-pci driver to fault in huge pfn mappings. > > Implementation > ============== > > In reality, it's relatively simple to add such support comparing to many > other types of mappings, because of PFNMAP's specialties when there's no > vmemmap backing it, so that most of the kernel routines on huge mappings > should simply already fail for them, like GUPs or old-school follow_page() > (which is recently rewritten to be folio_walk* APIs by David). Indeed, skimming most patches, there is very limit core-mm impact. I expected much more :) I suspect primarily because DAX already paved the way. And as DAX likely supports fault-after-fork, which is why the fork() case wasn't relevant before.
On Fri, Aug 09, 2024 at 07:25:36PM +0200, David Hildenbrand wrote: > On 09.08.24 18:54, Peter Xu wrote: > > On Fri, Aug 09, 2024 at 06:20:06PM +0200, David Hildenbrand wrote: > > > On 09.08.24 18:08, Peter Xu wrote: > > > > Pfnmaps can always be identified with special bits in the ptes/pmds/puds. > > > > However that's unnecessary if the vma is stable, and when it's mapped under > > > > VM_PFNMAP | VM_IO. > > > > > > > > Instead of adding similar checks in all the levels for huge pfnmaps, let > > > > folio_walk_start() fail even earlier for these mappings. It's also > > > > something gup-slow already does, so make them match. > > > > > > > > Cc: David Hildenbrand <david@redhat.com> > > > > Signed-off-by: Peter Xu <peterx@redhat.com> > > > > --- > > > > mm/pagewalk.c | 5 +++++ > > > > 1 file changed, 5 insertions(+) > > > > > > > > diff --git a/mm/pagewalk.c b/mm/pagewalk.c > > > > index cd79fb3b89e5..fd3965efe773 100644 > > > > --- a/mm/pagewalk.c > > > > +++ b/mm/pagewalk.c > > > > @@ -727,6 +727,11 @@ struct folio *folio_walk_start(struct folio_walk *fw, > > > > p4d_t *p4dp; > > > > mmap_assert_locked(vma->vm_mm); > > > > + > > > > + /* It has no folio backing the mappings at all.. */ > > > > + if (vma->vm_flags & (VM_IO | VM_PFNMAP)) > > > > + return NULL; > > > > + > > > > > > That is in general not what we want, and we still have some places that > > > wrongly hard-code that behavior. > > > > > > In a MAP_PRIVATE mapping you might have anon pages that we can happily walk. > > > > > > vm_normal_page() / vm_normal_page_pmd() [and as commented as a TODO, > > > vm_normal_page_pud()] should be able to identify PFN maps and reject them, > > > no? > > > > Yep, I think we can also rely on special bit. > > > > When I was working on this whole series I must confess I am already > > confused on the real users of MAP_PRIVATE pfnmaps. E.g. we probably don't > > need either PFNMAP for either mprotect/fork/... at least for our use case, > > then VM_PRIVATE is even one step further. > > Yes, it's rather a corner case indeed. > > > > Here I chose to follow gup-slow, and I suppose you meant that's also wrong? > > I assume just nobody really noticed, just like nobody noticed that > walk_page_test() skips VM_PFNMAP (but not VM_IO :) ). I noticed it, and that's one of the reasons why this series can be small, as walk page callers are intact. > > Your process memory stats will likely miss anon folios on COW PFNMAP > mappings ... in the rare cases where they exist (e.g., mmap() of /dev/mem). Do you mean /proc/$PID/status? I thought that (aka, mm counters) should be fine with anon pages CoWed on top of private pfnmaps, but possibly I misunderstood what you meant. > > > If so, would it make sense we keep them aligned as of now, and change them > > altogether? Or do you think we should just rely on the special bits? > > GUP already refuses to work on a lot of other stuff, so likely not a good > use of time unless somebody complains. > > But yes, long-term we should make all code either respect that it could > happen (and bury less awkward checks in page table walkers) or rip support > for MAP_PRIVATE PFNMAP out completely. > > > > > And, just curious: is there any use case you're aware of that can benefit > > from caring PRIVATE pfnmaps yet so far, especially in this path? > > In general MAP_PRIVATE pfnmaps is not really useful on things like MMIO. > > There was a discussion (in VM_PAT) some time ago whether we could remove > MAP_PRIVATE PFNMAPs completely [1]. At least some users still use COW > mappings on /dev/mem, although not many (and they might not actually write > to these areas). Ah, looks like the private map on /dev/mem is the only thing we know. > > I'm happy if someone wants to try ripping that out, I'm not brave enough :) > > [1] > https://lkml.kernel.org/r/1f2a8ed4-aaff-4be7-b3b6-63d2841a2908@redhat.com > > > > > As far as I read, none of folio_walk_start() users so far should even > > stumble on top of a pfnmap, share or private. But that's a fairly quick > > glimps only. > > do_pages_stat()->do_pages_stat_array() should be able to trigger it, if you > pass "nodes=NULL" to move_pages(). .. so assume this is also about private mapping over /dev/mem, then: someone tries to write some pages there to some MMIO regions, then tries to use move_pages() to fetch which node those pages locate? Hmm.. OK :) > > Maybe s390x could be tricked into it, but likely as you say, most code > shouldn't trigger it. The function itself should be handling it correctly as > of today, though. So indeed I cannot justify it won't be used, and it's not a huge deal indeed if we stick with special bits. Let me go with that in the next version for folio_walk_start(). Thanks,
On Fri, Aug 09, 2024 at 12:08:50PM -0400, Peter Xu wrote: > Overview > ======== > > This series is based on mm-unstable, commit 98808d08fc0f of Aug 7th latest, > plus dax 1g fix [1]. Note that this series should also apply if without > the dax 1g fix series, but when without it, mprotect() will trigger similar > errors otherwise on PUD mappings. > > This series implements huge pfnmaps support for mm in general. Huge pfnmap > allows e.g. VM_PFNMAP vmas to map in either PMD or PUD levels, similar to > what we do with dax / thp / hugetlb so far to benefit from TLB hits. Now > we extend that idea to PFN mappings, e.g. PCI MMIO bars where it can grow > as large as 8GB or even bigger. FWIW, I've started to hear people talk about needing this in the VFIO context with VMs. vfio/iommufd will reassemble the contiguous range from the 4k PFNs to setup the IOMMU, but KVM is not able to do it so reliably. There is a notable performance gap with two dimensional paging between 4k and 1G entries in the KVM table. The platforms are being architected with the assumption that 1G TLB entires will be used throughout the hypervisor environment. > Currently, only x86_64 (1G+2M) and arm64 (2M) are supported. There is definitely interest here in extending ARM to support the 1G size too, what is missing? > The other trick is how to allow gup-fast working for such huge mappings > even if there's no direct sign of knowing whether it's a normal page or > MMIO mapping. This series chose to keep the pte_special solution, so that > it reuses similar idea on setting a special bit to pfnmap PMDs/PUDs so that > gup-fast will be able to identify them and fail properly. Make sense > More architectures / More page sizes > ------------------------------------ > > Currently only x86_64 (2M+1G) and arm64 (2M) are supported. > > For example, if arm64 can start to support THP_PUD one day, the huge pfnmap > on 1G will be automatically enabled. Oh that sounds like a bigger step.. > VFIO is so far the only consumer for the huge pfnmaps after this series > applied. Besides above remap_pfn_range() generic optimization, device > driver can also try to optimize its mmap() on a better VA alignment for > either PMD/PUD sizes. This may, iiuc, normally require userspace changes, > as the driver doesn't normally decide the VA to map a bar. But I don't > think I know all the drivers to know the full picture. How does alignment work? In most caes I'm aware of the userspace does not use MAP_FIXED so the expectation would be for the kernel to automatically select a high alignment. I suppose your cases are working because qemu uses MAP_FIXED and naturally aligns the BAR addresses? > - x86_64 + AMD GPU > - Needs Alex's modified QEMU to guarantee proper VA alignment to make > sure all pages to be mapped with PUDs Oh :( Jason
On Fri, Aug 09, 2024 at 07:25:36PM +0200, David Hildenbrand wrote: > > > That is in general not what we want, and we still have some places that > > > wrongly hard-code that behavior. > > > > > > In a MAP_PRIVATE mapping you might have anon pages that we can happily walk. > > > > > > vm_normal_page() / vm_normal_page_pmd() [and as commented as a TODO, > > > vm_normal_page_pud()] should be able to identify PFN maps and reject them, > > > no? > > > > Yep, I think we can also rely on special bit. It is more than just relying on the special bit.. VM_PFNMAP/VM_MIXEDMAP should really only be used inside vm_normal_page() because thay are, effectively, support for a limited emulation of the special bit on arches that don't have them. There are a bunch of weird rules that are used to try and make that work properly that have to be followed. On arches with the sepcial bit they should possibly never be checked since the special bit does everything you need. Arguably any place reading those flags out side of vm_normal_page/etc is suspect. > > Here I chose to follow gup-slow, and I suppose you meant that's also wrong? > > I assume just nobody really noticed, just like nobody noticed that > walk_page_test() skips VM_PFNMAP (but not VM_IO :) ). Like here.. > > And, just curious: is there any use case you're aware of that can benefit > > from caring PRIVATE pfnmaps yet so far, especially in this path? > > In general MAP_PRIVATE pfnmaps is not really useful on things like MMIO. > > There was a discussion (in VM_PAT) some time ago whether we could remove > MAP_PRIVATE PFNMAPs completely [1]. At least some users still use COW > mappings on /dev/mem, although not many (and they might not actually write > to these areas). I've squashed many bugs where kernel drivers don't demand userspace use MAP_SHARED when asking for a PFNMAP, and of course userspace has gained the wrong flags too. I don't know if anyone needs this, but it has crept wrongly into the API. Maybe an interesting place to start is a warning printk about using an obsolete feature and see where things go from there?? Jason
On Wed, Aug 14, 2024, Jason Gunthorpe wrote: > On Fri, Aug 09, 2024 at 12:08:50PM -0400, Peter Xu wrote: > > Overview > > ======== > > > > This series is based on mm-unstable, commit 98808d08fc0f of Aug 7th latest, > > plus dax 1g fix [1]. Note that this series should also apply if without > > the dax 1g fix series, but when without it, mprotect() will trigger similar > > errors otherwise on PUD mappings. > > > > This series implements huge pfnmaps support for mm in general. Huge pfnmap > > allows e.g. VM_PFNMAP vmas to map in either PMD or PUD levels, similar to > > what we do with dax / thp / hugetlb so far to benefit from TLB hits. Now > > we extend that idea to PFN mappings, e.g. PCI MMIO bars where it can grow > > as large as 8GB or even bigger. > > FWIW, I've started to hear people talk about needing this in the VFIO > context with VMs. > > vfio/iommufd will reassemble the contiguous range from the 4k PFNs to > setup the IOMMU, but KVM is not able to do it so reliably. Heh, KVM should very reliably do the exact opposite, i.e. KVM should never create a huge page unless the mapping is huge in the primary MMU. And that's very much by design, as KVM has no knowledge of what actually resides at a given PFN, and thus can't determine whether or not its safe to create a huge page if KVM happens to realize the VM has access to a contiguous range of memory.
On Wed, Aug 14, 2024 at 4:35 PM Sean Christopherson <seanjc@google.com> wrote: > > vfio/iommufd will reassemble the contiguous range from the 4k PFNs to > > setup the IOMMU, but KVM is not able to do it so reliably. > > Heh, KVM should very reliably do the exact opposite, i.e. KVM should never create > a huge page unless the mapping is huge in the primary MMU. And that's very much > by design, as KVM has no knowledge of what actually resides at a given PFN, and > thus can't determine whether or not its safe to create a huge page if KVM happens > to realize the VM has access to a contiguous range of memory. Indeed: the EPT is managed as a secondary MMU. It replays the contents of the primary MMU, apart from A/D bits (which are independent) and permissions possibly being more restrictive, and that includes the page size. Which in turn explains why the VA has to be aligned for KVM to pick up the hint: aligning the VA allows the primary MMU to use a hugepage, which is a prerequisite for using it in EPT. Paolo
On Wed, Aug 14, 2024 at 07:35:01AM -0700, Sean Christopherson wrote: > On Wed, Aug 14, 2024, Jason Gunthorpe wrote: > > On Fri, Aug 09, 2024 at 12:08:50PM -0400, Peter Xu wrote: > > > Overview > > > ======== > > > > > > This series is based on mm-unstable, commit 98808d08fc0f of Aug 7th latest, > > > plus dax 1g fix [1]. Note that this series should also apply if without > > > the dax 1g fix series, but when without it, mprotect() will trigger similar > > > errors otherwise on PUD mappings. > > > > > > This series implements huge pfnmaps support for mm in general. Huge pfnmap > > > allows e.g. VM_PFNMAP vmas to map in either PMD or PUD levels, similar to > > > what we do with dax / thp / hugetlb so far to benefit from TLB hits. Now > > > we extend that idea to PFN mappings, e.g. PCI MMIO bars where it can grow > > > as large as 8GB or even bigger. > > > > FWIW, I've started to hear people talk about needing this in the VFIO > > context with VMs. > > > > vfio/iommufd will reassemble the contiguous range from the 4k PFNs to > > setup the IOMMU, but KVM is not able to do it so reliably. > > Heh, KVM should very reliably do the exact opposite, i.e. KVM should never create > a huge page unless the mapping is huge in the primary MMU. And that's very much > by design, as KVM has no knowledge of what actually resides at a given PFN, and > thus can't determine whether or not its safe to create a huge page if KVM happens > to realize the VM has access to a contiguous range of memory. Oh? Someone told me recently x86 kvm had code to reassemble contiguous ranges? I don't quite understand your safety argument, if the VMA has 1G of contiguous physical memory described with 4K it is definitely safe for KVM to reassemble that same memory and represent it as 1G. Jason
+Marc and Oliver On Wed, Aug 14, 2024, Jason Gunthorpe wrote: > On Wed, Aug 14, 2024 at 07:35:01AM -0700, Sean Christopherson wrote: > > On Wed, Aug 14, 2024, Jason Gunthorpe wrote: > > > On Fri, Aug 09, 2024 at 12:08:50PM -0400, Peter Xu wrote: > > > > Overview > > > > ======== > > > > > > > > This series is based on mm-unstable, commit 98808d08fc0f of Aug 7th latest, > > > > plus dax 1g fix [1]. Note that this series should also apply if without > > > > the dax 1g fix series, but when without it, mprotect() will trigger similar > > > > errors otherwise on PUD mappings. > > > > > > > > This series implements huge pfnmaps support for mm in general. Huge pfnmap > > > > allows e.g. VM_PFNMAP vmas to map in either PMD or PUD levels, similar to > > > > what we do with dax / thp / hugetlb so far to benefit from TLB hits. Now > > > > we extend that idea to PFN mappings, e.g. PCI MMIO bars where it can grow > > > > as large as 8GB or even bigger. > > > > > > FWIW, I've started to hear people talk about needing this in the VFIO > > > context with VMs. > > > > > > vfio/iommufd will reassemble the contiguous range from the 4k PFNs to > > > setup the IOMMU, but KVM is not able to do it so reliably. > > > > Heh, KVM should very reliably do the exact opposite, i.e. KVM should never create > > a huge page unless the mapping is huge in the primary MMU. And that's very much > > by design, as KVM has no knowledge of what actually resides at a given PFN, and > > thus can't determine whether or not its safe to create a huge page if KVM happens > > to realize the VM has access to a contiguous range of memory. > > Oh? Someone told me recently x86 kvm had code to reassemble contiguous > ranges? Nope. KVM ARM does (see get_vma_page_shift()) but I strongly suspect that's only a win in very select use cases, and is overall a non-trivial loss. > I don't quite understand your safety argument, if the VMA has 1G of > contiguous physical memory described with 4K it is definitely safe for > KVM to reassemble that same memory and represent it as 1G. That would require taking mmap_lock to get the VMA, which would be a net negative, especially for workloads that are latency sensitive. E.g. if userspace is doing mprotect(), madvise(), etc. on VMA that is NOT mapped into the guest, taking mmap_lock in the guest page fault path means vCPUs block waiting for the unrelated host operation to complete. And vise versa, subsequent host operations can be blocked waiting on vCPUs. Which reminds me... Marc/Oliver, TL;DR: it's probably worth looking at mmu_stress_test (was: max_guest_memory_test) on arm64, specifically the mprotect() testcase[1], as performance is significantly worse compared to x86, and there might be bugs lurking the mmu_notifier flows. When running mmu_stress_test the mprotect() phase that makes guest memory read-only takes more than three times as long on arm64 versus x86. The time to initially popuplate memory (run1) is also notably higher on arm64, as is the time to mprotect() back to RW protections. The test doesn't go super far out of its way to control the environment, but it should be a fairly reasonable apples-to-apples comparison. Ouch. I take that back, it's not apples-to-apples, because the test does more work for x86. On x86, during mprotect(PROT_READ), the userspace side skips the faulting instruction on -EFAULT and so vCPUs keep writing for the entire duration. Other architectures stop running the vCPU after the first write -EFAULT and wait for the mproptect() to complete. If I comment out the x86-only logic and have vCPUs stop on the first -EFAULT, the mprotect() goes way down. /me fiddles with arm64 And if I have arm64 vCPUs keep faulting, the time goes up, as exptected. With 128GiB of guest memory (aliased to a single 2GiB chunk of physical memory), and 48 vCPUs (on systems with 64+ CPUs), stopping on the first fault: x86: run1 = 6.873408794s, reset = 0.000165898s, run2 = 0.035537803s, ro = 6.149083106s, rw = 7.713627355s arm64: run1 = 13.960144969s, reset = 0.000178596s, run2 = 0.018020005s, ro = 50.924434051s, rw = 14.712983786 and skipping on -EFAULT and thus writing throughout mprotect(): x86: run1 = 6.923218747s, reset = 0.000167050s, run2 = 0.034676225s, ro = 14.599445790s, rw = 7.763152792s arm64: run1 = 13.543469513s, reset = 0.000018763s, run2 = 0.020533896s, ro = 81.063504438s, rw = 14.967504024s I originally suspected at the main source of difference is that user_mem_abort() takes mmap_lock for read. But I doubt that's the case now that I realize arm64 vCPUs stop after the first -EFAULT, i.e. won't contend mmap_lock. And it's shouldn't be the lack of support for mmu_invalidate_retry_gfn() (on arm64 vs. x86), because the mprotect() is relevant to guest memory (though range-based retry is something that KVM ARM likely should support). And again, arm64 is still much slower when vCPUs stop on the first -EFAULT. However, before I realized mmap_lock likely wasn't the main problem, I tried to prove that taking mmap_lock is problematic, and that didn't end well. When I hacked user_mem_abort() to not take mmap_lock, the host reboots when the mprotect() read-only phase kicks in. AFAICT, there is no crash, e.g. no kdump and nothing printed to the console, the host suddenly just starts executing firmware code. To try and rule out a hidden dependency I'm missing, I tried trimming out pretty much everything that runs under mmap_lock (and only running the selftest, which doesn't do anything "odd", e.g. doesn't passthrough device memory). I also forced small pages, e.g. in case transparent_hugepage_adjust() is somehow reliant on mmap_lock being taken, to no avail. Just in case someone can spot an obvious flaw in my hack, the final diff I tried is below. Jumping back to mmap_lock, adding a lock, vma_lookup(), and unlock in x86's page fault path for valid VMAs does introduce a performance regression, but only ~30%, not the ~6x jump from x86 to arm64. So that too makes it unlikely taking mmap_lock is the main problem, though it's still good justification for avoid mmap_lock in the page fault path. [1] https://lore.kernel.org/all/20240809194335.1726916-9-seanjc@google.com --- arch/arm64/kvm/mmu.c | 167 +++---------------------------------------- 1 file changed, 9 insertions(+), 158 deletions(-) diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c index 6981b1bc0946..df551c19f626 100644 --- a/arch/arm64/kvm/mmu.c +++ b/arch/arm64/kvm/mmu.c @@ -1424,14 +1424,12 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa, bool fault_is_perm) { int ret = 0; - bool write_fault, writable, force_pte = false; - bool exec_fault, mte_allowed; - bool device = false, vfio_allow_any_uc = false; + bool write_fault, writable; + bool exec_fault; unsigned long mmu_seq; phys_addr_t ipa = fault_ipa; struct kvm *kvm = vcpu->kvm; struct kvm_mmu_memory_cache *memcache = &vcpu->arch.mmu_page_cache; - struct vm_area_struct *vma; short vma_shift; gfn_t gfn; kvm_pfn_t pfn; @@ -1451,6 +1449,9 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa, return -EFAULT; } + if (WARN_ON_ONCE(nested) || WARN_ON_ONCE(kvm_has_mte(kvm))) + return -EIO; + /* * Permission faults just need to update the existing leaf entry, * and so normally don't require allocations from the memcache. The @@ -1464,92 +1465,9 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa, return ret; } - /* - * Let's check if we will get back a huge page backed by hugetlbfs, or - * get block mapping for device MMIO region. - */ - mmap_read_lock(current->mm); - vma = vma_lookup(current->mm, hva); - if (unlikely(!vma)) { - kvm_err("Failed to find VMA for hva 0x%lx\n", hva); - mmap_read_unlock(current->mm); - return -EFAULT; - } - - /* - * logging_active is guaranteed to never be true for VM_PFNMAP - * memslots. - */ - if (logging_active) { - force_pte = true; - vma_shift = PAGE_SHIFT; - } else { - vma_shift = get_vma_page_shift(vma, hva); - } - - switch (vma_shift) { -#ifndef __PAGETABLE_PMD_FOLDED - case PUD_SHIFT: - if (fault_supports_stage2_huge_mapping(memslot, hva, PUD_SIZE)) - break; - fallthrough; -#endif - case CONT_PMD_SHIFT: - vma_shift = PMD_SHIFT; - fallthrough; - case PMD_SHIFT: - if (fault_supports_stage2_huge_mapping(memslot, hva, PMD_SIZE)) - break; - fallthrough; - case CONT_PTE_SHIFT: - vma_shift = PAGE_SHIFT; - force_pte = true; - fallthrough; - case PAGE_SHIFT: - break; - default: - WARN_ONCE(1, "Unknown vma_shift %d", vma_shift); - } - + vma_shift = PAGE_SHIFT; vma_pagesize = 1UL << vma_shift; - - if (nested) { - unsigned long max_map_size; - - max_map_size = force_pte ? PAGE_SIZE : PUD_SIZE; - - ipa = kvm_s2_trans_output(nested); - - /* - * If we're about to create a shadow stage 2 entry, then we - * can only create a block mapping if the guest stage 2 page - * table uses at least as big a mapping. - */ - max_map_size = min(kvm_s2_trans_size(nested), max_map_size); - - /* - * Be careful that if the mapping size falls between - * two host sizes, take the smallest of the two. - */ - if (max_map_size >= PMD_SIZE && max_map_size < PUD_SIZE) - max_map_size = PMD_SIZE; - else if (max_map_size >= PAGE_SIZE && max_map_size < PMD_SIZE) - max_map_size = PAGE_SIZE; - - force_pte = (max_map_size == PAGE_SIZE); - vma_pagesize = min(vma_pagesize, (long)max_map_size); - } - - if (vma_pagesize == PMD_SIZE || vma_pagesize == PUD_SIZE) - fault_ipa &= ~(vma_pagesize - 1); - gfn = ipa >> PAGE_SHIFT; - mte_allowed = kvm_vma_mte_allowed(vma); - - vfio_allow_any_uc = vma->vm_flags & VM_ALLOW_ANY_UNCACHED; - - /* Don't use the VMA after the unlock -- it may have vanished */ - vma = NULL; /* * Read mmu_invalidate_seq so that KVM can detect if the results of @@ -1560,7 +1478,7 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa, * with the smp_wmb() in kvm_mmu_invalidate_end(). */ mmu_seq = vcpu->kvm->mmu_invalidate_seq; - mmap_read_unlock(current->mm); + smp_rmb(); pfn = __gfn_to_pfn_memslot(memslot, gfn, false, false, NULL, write_fault, &writable, NULL); @@ -1571,19 +1489,7 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa, if (is_error_noslot_pfn(pfn)) return -EFAULT; - if (kvm_is_device_pfn(pfn)) { - /* - * If the page was identified as device early by looking at - * the VMA flags, vma_pagesize is already representing the - * largest quantity we can map. If instead it was mapped - * via gfn_to_pfn_prot(), vma_pagesize is set to PAGE_SIZE - * and must not be upgraded. - * - * In both cases, we don't let transparent_hugepage_adjust() - * change things at the last minute. - */ - device = true; - } else if (logging_active && !write_fault) { + if (logging_active && !write_fault) { /* * Only actually map the page as writable if this was a write * fault. @@ -1591,28 +1497,6 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa, writable = false; } - if (exec_fault && device) - return -ENOEXEC; - - /* - * Potentially reduce shadow S2 permissions to match the guest's own - * S2. For exec faults, we'd only reach this point if the guest - * actually allowed it (see kvm_s2_handle_perm_fault). - * - * Also encode the level of the original translation in the SW bits - * of the leaf entry as a proxy for the span of that translation. - * This will be retrieved on TLB invalidation from the guest and - * used to limit the invalidation scope if a TTL hint or a range - * isn't provided. - */ - if (nested) { - writable &= kvm_s2_trans_writable(nested); - if (!kvm_s2_trans_readable(nested)) - prot &= ~KVM_PGTABLE_PROT_R; - - prot |= kvm_encode_nested_level(nested); - } - read_lock(&kvm->mmu_lock); pgt = vcpu->arch.hw_mmu->pgt; if (mmu_invalidate_retry(kvm, mmu_seq)) { @@ -1620,46 +1504,13 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa, goto out_unlock; } - /* - * If we are not forced to use page mapping, check if we are - * backed by a THP and thus use block mapping if possible. - */ - if (vma_pagesize == PAGE_SIZE && !(force_pte || device)) { - if (fault_is_perm && fault_granule > PAGE_SIZE) - vma_pagesize = fault_granule; - else - vma_pagesize = transparent_hugepage_adjust(kvm, memslot, - hva, &pfn, - &fault_ipa); - - if (vma_pagesize < 0) { - ret = vma_pagesize; - goto out_unlock; - } - } - - if (!fault_is_perm && !device && kvm_has_mte(kvm)) { - /* Check the VMM hasn't introduced a new disallowed VMA */ - if (mte_allowed) { - sanitise_mte_tags(kvm, pfn, vma_pagesize); - } else { - ret = -EFAULT; - goto out_unlock; - } - } - if (writable) prot |= KVM_PGTABLE_PROT_W; if (exec_fault) prot |= KVM_PGTABLE_PROT_X; - if (device) { - if (vfio_allow_any_uc) - prot |= KVM_PGTABLE_PROT_NORMAL_NC; - else - prot |= KVM_PGTABLE_PROT_DEVICE; - } else if (cpus_have_final_cap(ARM64_HAS_CACHE_DIC) && + if (cpus_have_final_cap(ARM64_HAS_CACHE_DIC) && (!nested || kvm_s2_trans_executable(nested))) { prot |= KVM_PGTABLE_PROT_X; } base-commit: 15e1c3d65975524c5c792fcd59f7d89f00402261 --
On Wed, Aug 14, 2024, Sean Christopherson wrote: > TL;DR: it's probably worth looking at mmu_stress_test (was: max_guest_memory_test) > on arm64, specifically the mprotect() testcase[1], as performance is significantly > worse compared to x86, and there might be bugs lurking the mmu_notifier flows. > > When running mmu_stress_test the mprotect() phase that makes guest memory read-only > takes more than three times as long on arm64 versus x86. The time to initially > popuplate memory (run1) is also notably higher on arm64, as is the time to > mprotect() back to RW protections. > > The test doesn't go super far out of its way to control the environment, but it > should be a fairly reasonable apples-to-apples comparison. > > Ouch. I take that back, it's not apples-to-apples, because the test does more > work for x86. On x86, during mprotect(PROT_READ), the userspace side skips the > faulting instruction on -EFAULT and so vCPUs keep writing for the entire duration. > Other architectures stop running the vCPU after the first write -EFAULT and wait > for the mproptect() to complete. If I comment out the x86-only logic and have > vCPUs stop on the first -EFAULT, the mprotect() goes way down. > > /me fiddles with arm64 > > And if I have arm64 vCPUs keep faulting, the time goes up, as exptected. > > With 128GiB of guest memory (aliased to a single 2GiB chunk of physical memory), > and 48 vCPUs (on systems with 64+ CPUs), stopping on the first fault: > > x86: > run1 = 6.873408794s, reset = 0.000165898s, run2 = 0.035537803s, ro = 6.149083106s, rw = 7.713627355s > > arm64: > run1 = 13.960144969s, reset = 0.000178596s, run2 = 0.018020005s, ro = 50.924434051s, rw = 14.712983786 > > and skipping on -EFAULT and thus writing throughout mprotect(): > > x86: > run1 = 6.923218747s, reset = 0.000167050s, run2 = 0.034676225s, ro = 14.599445790s, rw = 7.763152792s > > arm64: > run1 = 13.543469513s, reset = 0.000018763s, run2 = 0.020533896s, ro = 81.063504438s, rw = 14.967504024s Oliver pointed out off-list that the hardware I was using doesn't have forced write-back, and so the overhead on arm64 is likely due to cache maintenance.
On Wed, Aug 14, 2024 at 01:54:04PM -0700, Sean Christopherson wrote: > +Marc and Oliver > > On Wed, Aug 14, 2024, Jason Gunthorpe wrote: > > On Wed, Aug 14, 2024 at 07:35:01AM -0700, Sean Christopherson wrote: > > > On Wed, Aug 14, 2024, Jason Gunthorpe wrote: > > > > On Fri, Aug 09, 2024 at 12:08:50PM -0400, Peter Xu wrote: > > > > > Overview > > > > > ======== > > > > > > > > > > This series is based on mm-unstable, commit 98808d08fc0f of Aug 7th latest, > > > > > plus dax 1g fix [1]. Note that this series should also apply if without > > > > > the dax 1g fix series, but when without it, mprotect() will trigger similar > > > > > errors otherwise on PUD mappings. > > > > > > > > > > This series implements huge pfnmaps support for mm in general. Huge pfnmap > > > > > allows e.g. VM_PFNMAP vmas to map in either PMD or PUD levels, similar to > > > > > what we do with dax / thp / hugetlb so far to benefit from TLB hits. Now > > > > > we extend that idea to PFN mappings, e.g. PCI MMIO bars where it can grow > > > > > as large as 8GB or even bigger. > > > > > > > > FWIW, I've started to hear people talk about needing this in the VFIO > > > > context with VMs. > > > > > > > > vfio/iommufd will reassemble the contiguous range from the 4k PFNs to > > > > setup the IOMMU, but KVM is not able to do it so reliably. > > > > > > Heh, KVM should very reliably do the exact opposite, i.e. KVM should never create > > > a huge page unless the mapping is huge in the primary MMU. And that's very much > > > by design, as KVM has no knowledge of what actually resides at a given PFN, and > > > thus can't determine whether or not its safe to create a huge page if KVM happens > > > to realize the VM has access to a contiguous range of memory. > > > > Oh? Someone told me recently x86 kvm had code to reassemble contiguous > > ranges? > > Nope. KVM ARM does (see get_vma_page_shift()) but I strongly suspect that's only > a win in very select use cases, and is overall a non-trivial loss. Ah that ARM behavior was probably what was being mentioned then! So take my original remark as applying to this :) > > I don't quite understand your safety argument, if the VMA has 1G of > > contiguous physical memory described with 4K it is definitely safe for > > KVM to reassemble that same memory and represent it as 1G. > > That would require taking mmap_lock to get the VMA, which would be a net negative, > especially for workloads that are latency sensitive. You can aggregate if the read and aggregating logic are protected by mmu notifiers, I think. A invalidation would still have enough information to clear the aggregate shadow entry. If you get a sequence number collision then you'd throw away the aggregation. But yes, I also think it would be slow to have aggregation logic in KVM. Doing in the main mmu is much better. Jason
On Wed, Aug 14, 2024 at 01:54:04PM -0700, Sean Christopherson wrote: > TL;DR: it's probably worth looking at mmu_stress_test (was: max_guest_memory_test) > on arm64, specifically the mprotect() testcase[1], as performance is significantly > worse compared to x86, Sharing what we discussed offline: Sean was using a machine w/o FEAT_FWB for this test, so the increased runtime on arm64 is likely explained by the CMOs we're doing when creating or invalidating a stage-2 PTE. Using a machine w/ FEAT_FWB would be better for making these sort of cross-architecture comparisons. Beyond CMOs, we do have some > and there might be bugs lurking the mmu_notifier flows. Impossible! :) > Jumping back to mmap_lock, adding a lock, vma_lookup(), and unlock in x86's page > fault path for valid VMAs does introduce a performance regression, but only ~30%, > not the ~6x jump from x86 to arm64. So that too makes it unlikely taking mmap_lock > is the main problem, though it's still good justification for avoid mmap_lock in > the page fault path. I'm curious how much of that 30% in a microbenchmark would translate to real world performance, since it isn't *that* egregious. We also have other uses for getting at the VMA beyond mapping granularity (MTE and the VFIO Normal-NC hint) that'd require some attention too.
On Wed, Aug 14, 2024 at 07:10:31PM -0300, Jason Gunthorpe wrote: [...] > > Nope. KVM ARM does (see get_vma_page_shift()) but I strongly suspect that's only > > a win in very select use cases, and is overall a non-trivial loss. > > Ah that ARM behavior was probably what was being mentioned then! So > take my original remark as applying to this :) > > > > I don't quite understand your safety argument, if the VMA has 1G of > > > contiguous physical memory described with 4K it is definitely safe for > > > KVM to reassemble that same memory and represent it as 1G. > > > > That would require taking mmap_lock to get the VMA, which would be a net negative, > > especially for workloads that are latency sensitive. > > You can aggregate if the read and aggregating logic are protected by > mmu notifiers, I think. A invalidation would still have enough > information to clear the aggregate shadow entry. If you get a sequence > number collision then you'd throw away the aggregation. > > But yes, I also think it would be slow to have aggregation logic in > KVM. Doing in the main mmu is much better. +1. For KVM/arm64 I'm quite hesitant to change the behavior to PTE mappings in this situation (i.e. dump get_vma_page_shift()), as I'm quite certain that'll have a performance regression on someone's workload. But once we can derive huge PFNMAP from the primary MMU then we should just normalize on that.
On Wed, Aug 14, 2024 at 04:28:00PM -0700, Oliver Upton wrote: > On Wed, Aug 14, 2024 at 01:54:04PM -0700, Sean Christopherson wrote: > > TL;DR: it's probably worth looking at mmu_stress_test (was: max_guest_memory_test) > > on arm64, specifically the mprotect() testcase[1], as performance is significantly > > worse compared to x86, > > Sharing what we discussed offline: > > Sean was using a machine w/o FEAT_FWB for this test, so the increased > runtime on arm64 is likely explained by the CMOs we're doing when > creating or invalidating a stage-2 PTE. > > Using a machine w/ FEAT_FWB would be better for making these sort of > cross-architecture comparisons. Beyond CMOs, we do have some ... some heavy barriers (e.g. DSB(ishst)) we use to ensure page table updates are visible to the system. So there could still be some arch-specific quirks that'll show up in the test. > > and there might be bugs lurking the mmu_notifier flows. > > Impossible! :) > > > Jumping back to mmap_lock, adding a lock, vma_lookup(), and unlock in x86's page > > fault path for valid VMAs does introduce a performance regression, but only ~30%, > > not the ~6x jump from x86 to arm64. So that too makes it unlikely taking mmap_lock > > is the main problem, though it's still good justification for avoid mmap_lock in > > the page fault path. > > I'm curious how much of that 30% in a microbenchmark would translate to > real world performance, since it isn't *that* egregious. We also have > other uses for getting at the VMA beyond mapping granularity (MTE and > the VFIO Normal-NC hint) that'd require some attention too. > > -- > Thanks, > Oliver
On Wed, Aug 14, 2024, Oliver Upton wrote: > On Wed, Aug 14, 2024 at 04:28:00PM -0700, Oliver Upton wrote: > > On Wed, Aug 14, 2024 at 01:54:04PM -0700, Sean Christopherson wrote: > > > TL;DR: it's probably worth looking at mmu_stress_test (was: max_guest_memory_test) > > > on arm64, specifically the mprotect() testcase[1], as performance is significantly > > > worse compared to x86, > > > > Sharing what we discussed offline: > > > > Sean was using a machine w/o FEAT_FWB for this test, so the increased > > runtime on arm64 is likely explained by the CMOs we're doing when > > creating or invalidating a stage-2 PTE. > > > > Using a machine w/ FEAT_FWB would be better for making these sort of > > cross-architecture comparisons. Beyond CMOs, we do have some > > ... some heavy barriers (e.g. DSB(ishst)) we use to ensure page table > updates are visible to the system. So there could still be some > arch-specific quirks that'll show up in the test. Nope, 'twas FWB. On a system with FWB, ARM nicely outperforms x86 on mprotect() when vCPUs stop on the first -EFAULT. I suspect because ARM can do broadcast TLB invalidations and doesn't need to interrupt and wait for every vCPU to respond. run1 = 10.723194154s, reset = 0.000014732s, run2 = 0.013790876s, ro = 2.151261587s, rw = 10.624272116s However, having vCPUs continue faulting while mprotect() is running turns the tables, I suspect due to mmap_lock run1 = 10.768003815s, reset = 0.000012051s, run2 = 0.013781921s, ro = 23.277624455s, rw = 10.649136889s The x86 numbers since they're out of sight now: -EFAULT once run1 = 6.873408794s, reset = 0.000165898s, run2 = 0.035537803s, ro = 6.149083106s, rw = 7.713627355s -EFAULT forever run1 = 6.923218747s, reset = 0.000167050s, run2 = 0.034676225s, ro = 14.599445790s, rw = 7.763152792s > > > and there might be bugs lurking the mmu_notifier flows. > > > > Impossible! :) > > > > > Jumping back to mmap_lock, adding a lock, vma_lookup(), and unlock in x86's page > > > fault path for valid VMAs does introduce a performance regression, but only ~30%, > > > not the ~6x jump from x86 to arm64. So that too makes it unlikely taking mmap_lock > > > is the main problem, though it's still good justification for avoid mmap_lock in > > > the page fault path. > > > > I'm curious how much of that 30% in a microbenchmark would translate to > > real world performance, since it isn't *that* egregious. vCPU jitter is the big problem, especially if userspace is doing something odd, and/or if the kernel is preemptible (which also triggers yeild-on-contention logic for spinlocks, ew). E.g. the range-based retry to avoid spinning and waiting on an unrelated MM operation was added by the ChromeOS folks[1] to resolve issues where an MM operation got preempted and so blocked vCPU faults. But even for cloud setups with a non-preemptible kernel, contending with unrelated userspace VMM modification can be problematic, e.g. it turns out even the gfn_to_pfn_cache logic needs range-based retry[2] (though that's a rather pathological case where userspace is spamming madvise() to the point where vCPUs can't even make forward progress). > > We also have other uses for getting at the VMA beyond mapping granularity > > (MTE and the VFIO Normal-NC hint) that'd require some attention too. Yeah, though it seems like it'd be easy enough to take mmap_lock if and only if it's necessary, e.g. similar to how common KVM takes it only if it encounters VM_PFNMAP'd memory. E.g. take mmap_lock if and only if MTE is active (I assume that's uncommon?), or if the fault is to device memory. [1] https://lore.kernel.org/all/20210222024522.1751719-1-stevensd@google.com [2] https://lore.kernel.org/all/f862cefff2ed3f4211b69d785670f41667703cf3.camel@infradead.org
On Wed, Aug 14, 2024 at 09:37:15AM -0300, Jason Gunthorpe wrote: > > Currently, only x86_64 (1G+2M) and arm64 (2M) are supported. > > There is definitely interest here in extending ARM to support the 1G > size too, what is missing? Currently PUD pfnmap relies on THP_PUD config option: config ARCH_SUPPORTS_PUD_PFNMAP def_bool y depends on ARCH_SUPPORTS_HUGE_PFNMAP && HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD Arm64 unfortunately doesn't yet support dax 1G, so not applicable yet. Ideally, pfnmap is too simple comparing to real THPs and it shouldn't require to depend on THP at all, but we'll need things like below to land first: https://lore.kernel.org/r/20240717220219.3743374-1-peterx@redhat.com I sent that first a while ago, but I didn't collect enough inputs, and I decided to unblock this series from that, so x86_64 shouldn't be affected, and arm64 will at least start to have 2M. > > > The other trick is how to allow gup-fast working for such huge mappings > > even if there's no direct sign of knowing whether it's a normal page or > > MMIO mapping. This series chose to keep the pte_special solution, so that > > it reuses similar idea on setting a special bit to pfnmap PMDs/PUDs so that > > gup-fast will be able to identify them and fail properly. > > Make sense > > > More architectures / More page sizes > > ------------------------------------ > > > > Currently only x86_64 (2M+1G) and arm64 (2M) are supported. > > > > For example, if arm64 can start to support THP_PUD one day, the huge pfnmap > > on 1G will be automatically enabled. > > Oh that sounds like a bigger step.. Just to mention, no real THP 1G needed here for pfnmaps. The real gap here is only about the pud helpers that only exists so far with CONFIG_THP_PUD in huge_memory.c. > > > VFIO is so far the only consumer for the huge pfnmaps after this series > > applied. Besides above remap_pfn_range() generic optimization, device > > driver can also try to optimize its mmap() on a better VA alignment for > > either PMD/PUD sizes. This may, iiuc, normally require userspace changes, > > as the driver doesn't normally decide the VA to map a bar. But I don't > > think I know all the drivers to know the full picture. > > How does alignment work? In most caes I'm aware of the userspace does > not use MAP_FIXED so the expectation would be for the kernel to > automatically select a high alignment. I suppose your cases are > working because qemu uses MAP_FIXED and naturally aligns the BAR > addresses? > > > - x86_64 + AMD GPU > > - Needs Alex's modified QEMU to guarantee proper VA alignment to make > > sure all pages to be mapped with PUDs > > Oh :( So I suppose this answers above. :) Yes, alignment needed.
On 2024/8/16 3:20, Peter Xu wrote: > On Wed, Aug 14, 2024 at 09:37:15AM -0300, Jason Gunthorpe wrote: >>> Currently, only x86_64 (1G+2M) and arm64 (2M) are supported. >> >> There is definitely interest here in extending ARM to support the 1G >> size too, what is missing? > > Currently PUD pfnmap relies on THP_PUD config option: > > config ARCH_SUPPORTS_PUD_PFNMAP > def_bool y > depends on ARCH_SUPPORTS_HUGE_PFNMAP && HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD > > Arm64 unfortunately doesn't yet support dax 1G, so not applicable yet. > > Ideally, pfnmap is too simple comparing to real THPs and it shouldn't > require to depend on THP at all, but we'll need things like below to land > first: > > https://lore.kernel.org/r/20240717220219.3743374-1-peterx@redhat.com > > I sent that first a while ago, but I didn't collect enough inputs, and I > decided to unblock this series from that, so x86_64 shouldn't be affected, > and arm64 will at least start to have 2M. > >> >>> The other trick is how to allow gup-fast working for such huge mappings >>> even if there's no direct sign of knowing whether it's a normal page or >>> MMIO mapping. This series chose to keep the pte_special solution, so that >>> it reuses similar idea on setting a special bit to pfnmap PMDs/PUDs so that >>> gup-fast will be able to identify them and fail properly. >> >> Make sense >> >>> More architectures / More page sizes >>> ------------------------------------ >>> >>> Currently only x86_64 (2M+1G) and arm64 (2M) are supported. >>> >>> For example, if arm64 can start to support THP_PUD one day, the huge pfnmap >>> on 1G will be automatically enabled. A draft patch to enable THP_PUD on arm64, only passed with DEBUG_VM_PGTABLE, we may test pud pfnmaps on arm64. diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig index a2f8ff354ca6..ff0d27c72020 100644 --- a/arch/arm64/Kconfig +++ b/arch/arm64/Kconfig @@ -184,6 +184,7 @@ config ARM64 select HAVE_ARCH_THREAD_STRUCT_WHITELIST select HAVE_ARCH_TRACEHOOK select HAVE_ARCH_TRANSPARENT_HUGEPAGE + select HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD if PGTABLE_LEVELS > 2 select HAVE_ARCH_VMAP_STACK select HAVE_ARM_SMCCC select HAVE_ASM_MODVERSIONS diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h index 7a4f5604be3f..e013fe458476 100644 --- a/arch/arm64/include/asm/pgtable.h +++ b/arch/arm64/include/asm/pgtable.h @@ -763,6 +763,25 @@ static inline unsigned long pmd_page_vaddr(pmd_t pmd) #define pud_valid(pud) pte_valid(pud_pte(pud)) #define pud_user(pud) pte_user(pud_pte(pud)) #define pud_user_exec(pud) pte_user_exec(pud_pte(pud)) +#define pud_dirty(pud) pte_dirty(pud_pte(pud)) +#define pud_devmap(pud) pte_devmap(pud_pte(pud)) +#define pud_wrprotect(pud) pte_pud(pte_wrprotect(pud_pte(pud))) +#define pud_mkold(pud) pte_pud(pte_mkold(pud_pte(pud))) +#define pud_mkwrite(pud) pte_pud(pte_mkwrite_novma(pud_pte(pud))) +#define pud_mkclean(pud) pte_pud(pte_mkclean(pud_pte(pud))) +#define pud_mkdirty(pud) pte_pud(pte_mkdirty(pud_pte(pud))) + +#ifdef CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD +static inline int pud_trans_huge(pud_t pud) +{ + return pud_val(pud) && pud_present(pud) && !(pud_val(pud) & PUD_TABLE_BIT); +} + +static inline pud_t pud_mkdevmap(pud_t pud) +{ + return pte_pud(set_pte_bit(pud_pte(pud), __pgprot(PTE_DEVMAP))); +} +#endif static inline bool pgtable_l4_enabled(void); @@ -1137,10 +1156,20 @@ static inline int pmdp_set_access_flags(struct vm_area_struct *vma, pmd_pte(entry), dirty); } +static inline int pudp_set_access_flags(struct vm_area_struct *vma, + unsigned long address, pud_t *pudp, + pud_t entry, int dirty) +{ + return __ptep_set_access_flags(vma, address, (pte_t *)pudp, + pud_pte(entry), dirty); +} + +#ifndef CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD static inline int pud_devmap(pud_t pud) { return 0; } +#endif static inline int pgd_devmap(pgd_t pgd) { @@ -1213,6 +1242,13 @@ static inline int pmdp_test_and_clear_young(struct vm_area_struct *vma, { return __ptep_test_and_clear_young(vma, address, (pte_t *)pmdp); } + +static inline int pudp_test_and_clear_young(struct vm_area_struct *vma, + unsigned long address, + pud_t *pudp) +{ + return __ptep_test_and_clear_young(vma, address, (pte_t *)pudp); +} #endif /* CONFIG_TRANSPARENT_HUGEPAGE */ static inline pte_t __ptep_get_and_clear(struct mm_struct *mm, @@ -1433,6 +1469,7 @@ static inline void update_mmu_cache_range(struct vm_fault *vmf, #define update_mmu_cache(vma, addr, ptep) \ update_mmu_cache_range(NULL, vma, addr, ptep, 1) #define update_mmu_cache_pmd(vma, address, pmd) do { } while (0) +#define update_mmu_cache_pud(vma, address, pud) do { } while (0) #ifdef CONFIG_ARM64_PA_BITS_52 #define phys_to_ttbr(addr) (((addr) | ((addr) >> 46)) & TTBR_BADDR_MASK_52)
On 14.08.24 15:05, Jason Gunthorpe wrote: > On Fri, Aug 09, 2024 at 07:25:36PM +0200, David Hildenbrand wrote: > >>>> That is in general not what we want, and we still have some places that >>>> wrongly hard-code that behavior. >>>> >>>> In a MAP_PRIVATE mapping you might have anon pages that we can happily walk. >>>> >>>> vm_normal_page() / vm_normal_page_pmd() [and as commented as a TODO, >>>> vm_normal_page_pud()] should be able to identify PFN maps and reject them, >>>> no? >>> >>> Yep, I think we can also rely on special bit. > > It is more than just relying on the special bit.. > > VM_PFNMAP/VM_MIXEDMAP should really only be used inside > vm_normal_page() because thay are, effectively, support for a limited > emulation of the special bit on arches that don't have them. There are > a bunch of weird rules that are used to try and make that work > properly that have to be followed. > > On arches with the sepcial bit they should possibly never be checked > since the special bit does everything you need. > > Arguably any place reading those flags out side of vm_normal_page/etc > is suspect. IIUC, your opinion matches mine: VM_PFNMAP/VM_MIXEDMAP and pte_special()/... usage should be limited to vm_normal_page/vm_normal_page_pmd/ ... of course, GUP-fast is special (one of the reason for "pte_special()" and friends after all). > >>> Here I chose to follow gup-slow, and I suppose you meant that's also wrong? >> >> I assume just nobody really noticed, just like nobody noticed that >> walk_page_test() skips VM_PFNMAP (but not VM_IO :) ). > > Like here.. > >>> And, just curious: is there any use case you're aware of that can benefit >>> from caring PRIVATE pfnmaps yet so far, especially in this path? >> >> In general MAP_PRIVATE pfnmaps is not really useful on things like MMIO. >> >> There was a discussion (in VM_PAT) some time ago whether we could remove >> MAP_PRIVATE PFNMAPs completely [1]. At least some users still use COW >> mappings on /dev/mem, although not many (and they might not actually write >> to these areas). > > I've squashed many bugs where kernel drivers don't demand userspace > use MAP_SHARED when asking for a PFNMAP, and of course userspace has > gained the wrong flags too. I don't know if anyone needs this, but it > has crept wrongly into the API. > > Maybe an interesting place to start is a warning printk about using an > obsolete feature and see where things go from there?? Maybe we should start with some way to pr_warn_ONCE() whenever we get a COW/unshare-fault in such a MAP_PRIVATE mapping, and essentially populate the fresh anon folio. Then we don't only know who mmaps() something like that, but who actually relies on getting anon folios in there.
On Fri, Aug 16, 2024 at 11:30:31AM +0200, David Hildenbrand wrote: > On 14.08.24 15:05, Jason Gunthorpe wrote: > > On Fri, Aug 09, 2024 at 07:25:36PM +0200, David Hildenbrand wrote: > > > > > > > That is in general not what we want, and we still have some places that > > > > > wrongly hard-code that behavior. > > > > > > > > > > In a MAP_PRIVATE mapping you might have anon pages that we can happily walk. > > > > > > > > > > vm_normal_page() / vm_normal_page_pmd() [and as commented as a TODO, > > > > > vm_normal_page_pud()] should be able to identify PFN maps and reject them, > > > > > no? > > > > > > > > Yep, I think we can also rely on special bit. > > > > It is more than just relying on the special bit.. > > > > VM_PFNMAP/VM_MIXEDMAP should really only be used inside > > vm_normal_page() because thay are, effectively, support for a limited > > emulation of the special bit on arches that don't have them. There are > > a bunch of weird rules that are used to try and make that work > > properly that have to be followed. > > > > On arches with the sepcial bit they should possibly never be checked > > since the special bit does everything you need. > > > > Arguably any place reading those flags out side of vm_normal_page/etc > > is suspect. > > IIUC, your opinion matches mine: VM_PFNMAP/VM_MIXEDMAP and pte_special()/... > usage should be limited to vm_normal_page/vm_normal_page_pmd/ ... of course, > GUP-fast is special (one of the reason for "pte_special()" and friends after > all). The issue is at least GUP currently doesn't work with pfnmaps, while there're potentially users who wants to be able to work on both page + !page use cases. Besides access_process_vm(), KVM also uses similar thing, and maybe more; these all seem to be valid use case of reference the vma flags for PFNMAP and such, so they can identify "it's pfnmap" or more generic issues like "permission check error on pgtable". The whole private mapping thing definitely made it complicated. > > > > > > > Here I chose to follow gup-slow, and I suppose you meant that's also wrong? > > > > > > I assume just nobody really noticed, just like nobody noticed that > > > walk_page_test() skips VM_PFNMAP (but not VM_IO :) ). > > > > Like here.. > > > > > > And, just curious: is there any use case you're aware of that can benefit > > > > from caring PRIVATE pfnmaps yet so far, especially in this path? > > > > > > In general MAP_PRIVATE pfnmaps is not really useful on things like MMIO. > > > > > > There was a discussion (in VM_PAT) some time ago whether we could remove > > > MAP_PRIVATE PFNMAPs completely [1]. At least some users still use COW > > > mappings on /dev/mem, although not many (and they might not actually write > > > to these areas). > > > > I've squashed many bugs where kernel drivers don't demand userspace > > use MAP_SHARED when asking for a PFNMAP, and of course userspace has > > gained the wrong flags too. I don't know if anyone needs this, but it > > has crept wrongly into the API. > > > > Maybe an interesting place to start is a warning printk about using an > > obsolete feature and see where things go from there?? > > Maybe we should start with some way to pr_warn_ONCE() whenever we get a > COW/unshare-fault in such a MAP_PRIVATE mapping, and essentially populate > the fresh anon folio. > > Then we don't only know who mmaps() something like that, but who actually > relies on getting anon folios in there. Sounds useful to me, if nobody yet has solid understanding of those private mappings while we'd want to collect some info. My gut feeling is we'll see some valid use of them, but I hope I'm wrong.. I hope we can still leave that as a separate thing so we focus on large mappings in this series. And yes, I'll stick with special bits here to not add one more flag reference. Thanks,
On Fri, Aug 16, 2024 at 11:05:33AM +0800, Kefeng Wang wrote: > > > On 2024/8/16 3:20, Peter Xu wrote: > > On Wed, Aug 14, 2024 at 09:37:15AM -0300, Jason Gunthorpe wrote: > > > > Currently, only x86_64 (1G+2M) and arm64 (2M) are supported. > > > > > > There is definitely interest here in extending ARM to support the 1G > > > size too, what is missing? > > > > Currently PUD pfnmap relies on THP_PUD config option: > > > > config ARCH_SUPPORTS_PUD_PFNMAP > > def_bool y > > depends on ARCH_SUPPORTS_HUGE_PFNMAP && HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD > > > > Arm64 unfortunately doesn't yet support dax 1G, so not applicable yet. > > > > Ideally, pfnmap is too simple comparing to real THPs and it shouldn't > > require to depend on THP at all, but we'll need things like below to land > > first: > > > > https://lore.kernel.org/r/20240717220219.3743374-1-peterx@redhat.com > > > > I sent that first a while ago, but I didn't collect enough inputs, and I > > decided to unblock this series from that, so x86_64 shouldn't be affected, > > and arm64 will at least start to have 2M. > > > > > > > > > The other trick is how to allow gup-fast working for such huge mappings > > > > even if there's no direct sign of knowing whether it's a normal page or > > > > MMIO mapping. This series chose to keep the pte_special solution, so that > > > > it reuses similar idea on setting a special bit to pfnmap PMDs/PUDs so that > > > > gup-fast will be able to identify them and fail properly. > > > > > > Make sense > > > > > > > More architectures / More page sizes > > > > ------------------------------------ > > > > > > > > Currently only x86_64 (2M+1G) and arm64 (2M) are supported. > > > > > > > > For example, if arm64 can start to support THP_PUD one day, the huge pfnmap > > > > on 1G will be automatically enabled. > > A draft patch to enable THP_PUD on arm64, only passed with DEBUG_VM_PGTABLE, > we may test pud pfnmaps on arm64. Thanks, Kefeng. It'll be great if this works already, as simple. Might be interesting to know whether it works already if you have some few-GBs GPU around on the systems. Logically as long as you have HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD selected below, 1g pfnmap will be automatically enabled when you rebuild the kernel. You can double check that by looking for this: CONFIG_ARCH_SUPPORTS_PUD_PFNMAP=y And you can try to observe the mappings by enabling dynamic debug for vfio_pci_mmap_huge_fault(), then map the bar with vfio-pci and read something from it. > > diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig > index a2f8ff354ca6..ff0d27c72020 100644 > --- a/arch/arm64/Kconfig > +++ b/arch/arm64/Kconfig > @@ -184,6 +184,7 @@ config ARM64 > select HAVE_ARCH_THREAD_STRUCT_WHITELIST > select HAVE_ARCH_TRACEHOOK > select HAVE_ARCH_TRANSPARENT_HUGEPAGE > + select HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD if PGTABLE_LEVELS > 2 > select HAVE_ARCH_VMAP_STACK > select HAVE_ARM_SMCCC > select HAVE_ASM_MODVERSIONS > diff --git a/arch/arm64/include/asm/pgtable.h > b/arch/arm64/include/asm/pgtable.h > index 7a4f5604be3f..e013fe458476 100644 > --- a/arch/arm64/include/asm/pgtable.h > +++ b/arch/arm64/include/asm/pgtable.h > @@ -763,6 +763,25 @@ static inline unsigned long pmd_page_vaddr(pmd_t pmd) > #define pud_valid(pud) pte_valid(pud_pte(pud)) > #define pud_user(pud) pte_user(pud_pte(pud)) > #define pud_user_exec(pud) pte_user_exec(pud_pte(pud)) > +#define pud_dirty(pud) pte_dirty(pud_pte(pud)) > +#define pud_devmap(pud) pte_devmap(pud_pte(pud)) > +#define pud_wrprotect(pud) pte_pud(pte_wrprotect(pud_pte(pud))) > +#define pud_mkold(pud) pte_pud(pte_mkold(pud_pte(pud))) > +#define pud_mkwrite(pud) pte_pud(pte_mkwrite_novma(pud_pte(pud))) > +#define pud_mkclean(pud) pte_pud(pte_mkclean(pud_pte(pud))) > +#define pud_mkdirty(pud) pte_pud(pte_mkdirty(pud_pte(pud))) > + > +#ifdef CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD > +static inline int pud_trans_huge(pud_t pud) > +{ > + return pud_val(pud) && pud_present(pud) && !(pud_val(pud) & > PUD_TABLE_BIT); > +} > + > +static inline pud_t pud_mkdevmap(pud_t pud) > +{ > + return pte_pud(set_pte_bit(pud_pte(pud), __pgprot(PTE_DEVMAP))); > +} > +#endif > > static inline bool pgtable_l4_enabled(void); > > @@ -1137,10 +1156,20 @@ static inline int pmdp_set_access_flags(struct > vm_area_struct *vma, > pmd_pte(entry), dirty); > } > > +static inline int pudp_set_access_flags(struct vm_area_struct *vma, > + unsigned long address, pud_t *pudp, > + pud_t entry, int dirty) > +{ > + return __ptep_set_access_flags(vma, address, (pte_t *)pudp, > + pud_pte(entry), dirty); > +} > + > +#ifndef CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD > static inline int pud_devmap(pud_t pud) > { > return 0; > } > +#endif > > static inline int pgd_devmap(pgd_t pgd) > { > @@ -1213,6 +1242,13 @@ static inline int pmdp_test_and_clear_young(struct > vm_area_struct *vma, > { > return __ptep_test_and_clear_young(vma, address, (pte_t *)pmdp); > } > + > +static inline int pudp_test_and_clear_young(struct vm_area_struct *vma, > + unsigned long address, > + pud_t *pudp) > +{ > + return __ptep_test_and_clear_young(vma, address, (pte_t *)pudp); > +} > #endif /* CONFIG_TRANSPARENT_HUGEPAGE */ > > static inline pte_t __ptep_get_and_clear(struct mm_struct *mm, > @@ -1433,6 +1469,7 @@ static inline void update_mmu_cache_range(struct > vm_fault *vmf, > #define update_mmu_cache(vma, addr, ptep) \ > update_mmu_cache_range(NULL, vma, addr, ptep, 1) > #define update_mmu_cache_pmd(vma, address, pmd) do { } while (0) > +#define update_mmu_cache_pud(vma, address, pud) do { } while (0) > > #ifdef CONFIG_ARM64_PA_BITS_52 > #define phys_to_ttbr(addr) (((addr) | ((addr) >> 46)) & TTBR_BADDR_MASK_52) > -- > 2.27.0
On Fri, Aug 16, 2024 at 10:21:17AM -0400, Peter Xu wrote: > On Fri, Aug 16, 2024 at 11:30:31AM +0200, David Hildenbrand wrote: > > On 14.08.24 15:05, Jason Gunthorpe wrote: > > > On Fri, Aug 09, 2024 at 07:25:36PM +0200, David Hildenbrand wrote: > > > > > > > > > That is in general not what we want, and we still have some places that > > > > > > wrongly hard-code that behavior. > > > > > > > > > > > > In a MAP_PRIVATE mapping you might have anon pages that we can happily walk. > > > > > > > > > > > > vm_normal_page() / vm_normal_page_pmd() [and as commented as a TODO, > > > > > > vm_normal_page_pud()] should be able to identify PFN maps and reject them, > > > > > > no? > > > > > > > > > > Yep, I think we can also rely on special bit. > > > > > > It is more than just relying on the special bit.. > > > > > > VM_PFNMAP/VM_MIXEDMAP should really only be used inside > > > vm_normal_page() because thay are, effectively, support for a limited > > > emulation of the special bit on arches that don't have them. There are > > > a bunch of weird rules that are used to try and make that work > > > properly that have to be followed. > > > > > > On arches with the sepcial bit they should possibly never be checked > > > since the special bit does everything you need. > > > > > > Arguably any place reading those flags out side of vm_normal_page/etc > > > is suspect. > > > > IIUC, your opinion matches mine: VM_PFNMAP/VM_MIXEDMAP and pte_special()/... > > usage should be limited to vm_normal_page/vm_normal_page_pmd/ ... of course, > > GUP-fast is special (one of the reason for "pte_special()" and friends after > > all). > > The issue is at least GUP currently doesn't work with pfnmaps, while > there're potentially users who wants to be able to work on both page + > !page use cases. Besides access_process_vm(), KVM also uses similar thing, > and maybe more; these all seem to be valid use case of reference the vma > flags for PFNMAP and such, so they can identify "it's pfnmap" or more > generic issues like "permission check error on pgtable". Why are those valid compared with calling vm_normal_page() per-page instead? What reason is there to not do something based only on the PFNMAP flag? Jason
On 16.08.24 16:21, Peter Xu wrote: > On Fri, Aug 16, 2024 at 11:30:31AM +0200, David Hildenbrand wrote: >> On 14.08.24 15:05, Jason Gunthorpe wrote: >>> On Fri, Aug 09, 2024 at 07:25:36PM +0200, David Hildenbrand wrote: >>> >>>>>> That is in general not what we want, and we still have some places that >>>>>> wrongly hard-code that behavior. >>>>>> >>>>>> In a MAP_PRIVATE mapping you might have anon pages that we can happily walk. >>>>>> >>>>>> vm_normal_page() / vm_normal_page_pmd() [and as commented as a TODO, >>>>>> vm_normal_page_pud()] should be able to identify PFN maps and reject them, >>>>>> no? >>>>> >>>>> Yep, I think we can also rely on special bit. >>> >>> It is more than just relying on the special bit.. >>> >>> VM_PFNMAP/VM_MIXEDMAP should really only be used inside >>> vm_normal_page() because thay are, effectively, support for a limited >>> emulation of the special bit on arches that don't have them. There are >>> a bunch of weird rules that are used to try and make that work >>> properly that have to be followed. >>> >>> On arches with the sepcial bit they should possibly never be checked >>> since the special bit does everything you need. >>> >>> Arguably any place reading those flags out side of vm_normal_page/etc >>> is suspect. >> >> IIUC, your opinion matches mine: VM_PFNMAP/VM_MIXEDMAP and pte_special()/... >> usage should be limited to vm_normal_page/vm_normal_page_pmd/ ... of course, >> GUP-fast is special (one of the reason for "pte_special()" and friends after >> all). > > The issue is at least GUP currently doesn't work with pfnmaps, while > there're potentially users who wants to be able to work on both page + > !page use cases. Besides access_process_vm(), KVM also uses similar thing, > and maybe more; these all seem to be valid use case of reference the vma > flags for PFNMAP and such, so they can identify "it's pfnmap" or more > generic issues like "permission check error on pgtable". What at least VFIO does is first try GUP, and if that fails, try follow_fault_pfn()->follow_pte(). There is a VM_PFNMAP check in there, yes. Ideally, follow_pte() would never return refcounted/normal pages, then the PFNMAP check might only be a performance improvement (maybe). > > The whole private mapping thing definitely made it complicated. Yes, and follow_pte() for now could even return ordinary anon pages. I spotted that when I was working on that VM_PAT stuff, but I was to unsure what to do (see below that KVM with MAP_PRIVATE /dev/mem might just work, no idea if there are use cases?). Fortunately, vfio calls is_invalid_reserved_pfn() and refuses anything that has a struct page. I think KVM does something nasty: if it something with a "struct page", and it's not PageReserved, it would take a reference (if I get kvm_pfn_to_refcounted_page()) independent if it's a "normal" or "not normal" page -- it essentially ignores the vm_normal_page() information in the page tables ... So anon pages in pivate mappings from follow_pte() might currently work with KVM ... because of the way KVM uses follow_pte(). I did not play with it, so I'm not sure if I am missing some detail.
On Fri, Aug 16, 2024 at 07:56:30PM +0200, David Hildenbrand wrote: > I think KVM does something nasty: if it something with a "struct page", and > it's not PageReserved, it would take a reference (if I get > kvm_pfn_to_refcounted_page()) independent if it's a "normal" or "not normal" > page -- it essentially ignores the vm_normal_page() information in the page > tables ... Oh that's nasty. Nothing should be upgrading the output of the follow functions to refcounted. That's what GUP is for. And PFNMAP pages, even if they have struct pages for some reason, should *NEVER* be refcounted because they are in a PFNMAP VMA. That is completely against the whole point :\ If they could be safely refcounted then it would be a MIXEDMAP. Jason
On 2024/8/16 22:33, Peter Xu wrote: > On Fri, Aug 16, 2024 at 11:05:33AM +0800, Kefeng Wang wrote: >> >> >> On 2024/8/16 3:20, Peter Xu wrote: >>> On Wed, Aug 14, 2024 at 09:37:15AM -0300, Jason Gunthorpe wrote: >>>>> Currently, only x86_64 (1G+2M) and arm64 (2M) are supported. >>>> >>>> There is definitely interest here in extending ARM to support the 1G >>>> size too, what is missing? >>> >>> Currently PUD pfnmap relies on THP_PUD config option: >>> >>> config ARCH_SUPPORTS_PUD_PFNMAP >>> def_bool y >>> depends on ARCH_SUPPORTS_HUGE_PFNMAP && HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD >>> >>> Arm64 unfortunately doesn't yet support dax 1G, so not applicable yet. >>> >>> Ideally, pfnmap is too simple comparing to real THPs and it shouldn't >>> require to depend on THP at all, but we'll need things like below to land >>> first: >>> >>> https://lore.kernel.org/r/20240717220219.3743374-1-peterx@redhat.com >>> >>> I sent that first a while ago, but I didn't collect enough inputs, and I >>> decided to unblock this series from that, so x86_64 shouldn't be affected, >>> and arm64 will at least start to have 2M. >>> >>>> >>>>> The other trick is how to allow gup-fast working for such huge mappings >>>>> even if there's no direct sign of knowing whether it's a normal page or >>>>> MMIO mapping. This series chose to keep the pte_special solution, so that >>>>> it reuses similar idea on setting a special bit to pfnmap PMDs/PUDs so that >>>>> gup-fast will be able to identify them and fail properly. >>>> >>>> Make sense >>>> >>>>> More architectures / More page sizes >>>>> ------------------------------------ >>>>> >>>>> Currently only x86_64 (2M+1G) and arm64 (2M) are supported. >>>>> >>>>> For example, if arm64 can start to support THP_PUD one day, the huge pfnmap >>>>> on 1G will be automatically enabled. >> >> A draft patch to enable THP_PUD on arm64, only passed with DEBUG_VM_PGTABLE, >> we may test pud pfnmaps on arm64. > > Thanks, Kefeng. It'll be great if this works already, as simple. > > Might be interesting to know whether it works already if you have some > few-GBs GPU around on the systems. > > Logically as long as you have HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD selected > below, 1g pfnmap will be automatically enabled when you rebuild the kernel. > You can double check that by looking for this: > > CONFIG_ARCH_SUPPORTS_PUD_PFNMAP=y > > And you can try to observe the mappings by enabling dynamic debug for > vfio_pci_mmap_huge_fault(), then map the bar with vfio-pci and read > something from it. I don't have such device, but we write a driver which use vmf_insert_pfn_pmd/pud in huge_fault, static const struct vm_operations_struct test_vm_ops = { .huge_fault = test_huge_fault, ... } and read/write it after mmap(,2M/1G,test_fd,...), it works as expected, since it could be used by dax, let's send it separately.
On Mon, Aug 19, 2024, Jason Gunthorpe wrote: > On Fri, Aug 16, 2024 at 07:56:30PM +0200, David Hildenbrand wrote: > > > I think KVM does something nasty: if it something with a "struct page", and > > it's not PageReserved, it would take a reference (if I get > > kvm_pfn_to_refcounted_page()) independent if it's a "normal" or "not normal" > > page -- it essentially ignores the vm_normal_page() information in the page > > tables ... > > Oh that's nasty. Nothing should be upgrading the output of the follow > functions to refcounted. That's what GUP is for. > > And PFNMAP pages, even if they have struct pages for some reason, > should *NEVER* be refcounted because they are in a PFNMAP VMA. That is > completely against the whole point :\ If they could be safely > refcounted then it would be a MIXEDMAP. Yeah yeah, I'm working on it. https://lore.kernel.org/all/20240726235234.228822-1-seanjc@google.com
On Fri, Aug 16, 2024 at 02:38:36PM -0300, Jason Gunthorpe wrote: > On Fri, Aug 16, 2024 at 10:21:17AM -0400, Peter Xu wrote: > > On Fri, Aug 16, 2024 at 11:30:31AM +0200, David Hildenbrand wrote: > > > On 14.08.24 15:05, Jason Gunthorpe wrote: > > > > On Fri, Aug 09, 2024 at 07:25:36PM +0200, David Hildenbrand wrote: > > > > > > > > > > > That is in general not what we want, and we still have some places that > > > > > > > wrongly hard-code that behavior. > > > > > > > > > > > > > > In a MAP_PRIVATE mapping you might have anon pages that we can happily walk. > > > > > > > > > > > > > > vm_normal_page() / vm_normal_page_pmd() [and as commented as a TODO, > > > > > > > vm_normal_page_pud()] should be able to identify PFN maps and reject them, > > > > > > > no? > > > > > > > > > > > > Yep, I think we can also rely on special bit. > > > > > > > > It is more than just relying on the special bit.. > > > > > > > > VM_PFNMAP/VM_MIXEDMAP should really only be used inside > > > > vm_normal_page() because thay are, effectively, support for a limited > > > > emulation of the special bit on arches that don't have them. There are > > > > a bunch of weird rules that are used to try and make that work > > > > properly that have to be followed. > > > > > > > > On arches with the sepcial bit they should possibly never be checked > > > > since the special bit does everything you need. > > > > > > > > Arguably any place reading those flags out side of vm_normal_page/etc > > > > is suspect. > > > > > > IIUC, your opinion matches mine: VM_PFNMAP/VM_MIXEDMAP and pte_special()/... > > > usage should be limited to vm_normal_page/vm_normal_page_pmd/ ... of course, > > > GUP-fast is special (one of the reason for "pte_special()" and friends after > > > all). > > > > The issue is at least GUP currently doesn't work with pfnmaps, while > > there're potentially users who wants to be able to work on both page + > > !page use cases. Besides access_process_vm(), KVM also uses similar thing, > > and maybe more; these all seem to be valid use case of reference the vma > > flags for PFNMAP and such, so they can identify "it's pfnmap" or more > > generic issues like "permission check error on pgtable". > > Why are those valid compared with calling vm_normal_page() per-page > instead? > > What reason is there to not do something based only on the PFNMAP > flag? My comment was for answering "why VM_PFNMAP flags is needed outside vm_normal_page()", because GUP lacks supports of it. Are you suggesting we should support VM_PFNMAP in GUP, perhaps? Thanks,