Message ID | 20240731122103.382509-1-david@redhat.com (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | [v3] mm/hugetlb: fix hugetlb vs. core-mm PT locking | expand |
On Wed, Jul 31, 2024 at 02:21:03PM +0200, David Hildenbrand wrote: > We recently made GUP's common page table walking code to also walk hugetlb > VMAs without most hugetlb special-casing, preparing for the future of > having less hugetlb-specific page table walking code in the codebase. > Turns out that we missed one page table locking detail: page table locking > for hugetlb folios that are not mapped using a single PMD/PUD. > > Assume we have hugetlb folio that spans multiple PTEs (e.g., 64 KiB > hugetlb folios on arm64 with 4 KiB base page size). GUP, as it walks the > page tables, will perform a pte_offset_map_lock() to grab the PTE table > lock. > > However, hugetlb that concurrently modifies these page tables would > actually grab the mm->page_table_lock: with USE_SPLIT_PTE_PTLOCKS, the > locks would differ. Something similar can happen right now with hugetlb > folios that span multiple PMDs when USE_SPLIT_PMD_PTLOCKS. > > This issue can be reproduced [1], for example triggering: > > [ 3105.936100] ------------[ cut here ]------------ > [ 3105.939323] WARNING: CPU: 31 PID: 2732 at mm/gup.c:142 try_grab_folio+0x11c/0x188 > [ 3105.944634] Modules linked in: [...] > [ 3105.974841] CPU: 31 PID: 2732 Comm: reproducer Not tainted 6.10.0-64.eln141.aarch64 #1 > [ 3105.980406] Hardware name: QEMU KVM Virtual Machine, BIOS edk2-20240524-4.fc40 05/24/2024 > [ 3105.986185] pstate: 60000005 (nZCv daif -PAN -UAO -TCO -DIT -SSBS BTYPE=--) > [ 3105.991108] pc : try_grab_folio+0x11c/0x188 > [ 3105.994013] lr : follow_page_pte+0xd8/0x430 > [ 3105.996986] sp : ffff80008eafb8f0 > [ 3105.999346] x29: ffff80008eafb900 x28: ffffffe8d481f380 x27: 00f80001207cff43 > [ 3106.004414] x26: 0000000000000001 x25: 0000000000000000 x24: ffff80008eafba48 > [ 3106.009520] x23: 0000ffff9372f000 x22: ffff7a54459e2000 x21: ffff7a546c1aa978 > [ 3106.014529] x20: ffffffe8d481f3c0 x19: 0000000000610041 x18: 0000000000000001 > [ 3106.019506] x17: 0000000000000001 x16: ffffffffffffffff x15: 0000000000000000 > [ 3106.024494] x14: ffffb85477fdfe08 x13: 0000ffff9372ffff x12: 0000000000000000 > [ 3106.029469] x11: 1fffef4a88a96be1 x10: ffff7a54454b5f0c x9 : ffffb854771b12f0 > [ 3106.034324] x8 : 0008000000000000 x7 : ffff7a546c1aa980 x6 : 0008000000000080 > [ 3106.038902] x5 : 00000000001207cf x4 : 0000ffff9372f000 x3 : ffffffe8d481f000 > [ 3106.043420] x2 : 0000000000610041 x1 : 0000000000000001 x0 : 0000000000000000 > [ 3106.047957] Call trace: > [ 3106.049522] try_grab_folio+0x11c/0x188 > [ 3106.051996] follow_pmd_mask.constprop.0.isra.0+0x150/0x2e0 > [ 3106.055527] follow_page_mask+0x1a0/0x2b8 > [ 3106.058118] __get_user_pages+0xf0/0x348 > [ 3106.060647] faultin_page_range+0xb0/0x360 > [ 3106.063651] do_madvise+0x340/0x598 > > Let's make huge_pte_lockptr() effectively use the same PT locks as any > core-mm page table walker would. Add ptep_lockptr() to obtain the PTE > page table lock using a pte pointer -- unfortunately we cannot convert > pte_lockptr() because virt_to_page() doesn't work with kmap'ed page > tables we can have with CONFIG_HIGHPTE. > > Take care of PTE tables possibly spanning multiple pages, and take care of > CONFIG_PGTABLE_LEVELS complexity when e.g., PMD_SIZE == PUD_SIZE. For > example, with CONFIG_PGTABLE_LEVELS == 2, core-mm would detect > with hugepagesize==PMD_SIZE pmd_leaf() and use the pmd_lockptr(), which > would end up just mapping to the per-MM PT lock. > > There is one ugly case: powerpc 8xx, whereby we have an 8 MiB hugetlb > folio being mapped using two PTE page tables. While hugetlb wants to take > the PMD table lock, core-mm would grab the PTE table lock of one of both > PTE page tables. In such corner cases, we have to make sure that both > locks match, which is (fortunately!) currently guaranteed for 8xx as it > does not support SMP and consequently doesn't use split PT locks. > > [1] https://lore.kernel.org/all/1bbfcc7f-f222-45a5-ac44-c5a1381c596d@redhat.com/ > > Fixes: 9cb28da54643 ("mm/gup: handle hugetlb in the generic follow_page_mask code") > Reviewed-by: James Houghton <jthoughton@google.com> > Cc: <stable@vger.kernel.org> > Cc: Peter Xu <peterx@redhat.com> > Cc: Oscar Salvador <osalvador@suse.de> > Cc: Muchun Song <muchun.song@linux.dev> > Cc: Baolin Wang <baolin.wang@linux.alibaba.com> > Signed-off-by: David Hildenbrand <david@redhat.com> Nitpick: I wonder whether some of the lines can be simplified if we write it downwards from PUD, like, huge_pte_lockptr() { if (size >= PUD_SIZE) return pud_lockptr(...); if (size >= PMD_SIZE) return pmd_lockptr(...); /* Sub-PMD only applies to !CONFIG_HIGHPTE, see pte_alloc_huge() */ WARN_ON(IS_ENABLED(CONFIG_HIGHPTE)); return ptep_lockptr(...); } The ">=" checks should avoid checking over pgtable level, iiuc. The other nitpick is, I didn't yet find any arch that use non-zero order page for pte pgtables. I would give it a shot with dropping the mask thing then see what explodes (which I don't expect any, per my read..), but yeah I understand we saw some already due to other things, so I think it's fine in this hugetlb path (that we're removing) we do a few more math if you think that's easier for you. Acked-by: Peter Xu <peterx@redhat.com> Thanks,
On 31.07.24 16:54, Peter Xu wrote: > On Wed, Jul 31, 2024 at 02:21:03PM +0200, David Hildenbrand wrote: >> We recently made GUP's common page table walking code to also walk hugetlb >> VMAs without most hugetlb special-casing, preparing for the future of >> having less hugetlb-specific page table walking code in the codebase. >> Turns out that we missed one page table locking detail: page table locking >> for hugetlb folios that are not mapped using a single PMD/PUD. >> >> Assume we have hugetlb folio that spans multiple PTEs (e.g., 64 KiB >> hugetlb folios on arm64 with 4 KiB base page size). GUP, as it walks the >> page tables, will perform a pte_offset_map_lock() to grab the PTE table >> lock. >> >> However, hugetlb that concurrently modifies these page tables would >> actually grab the mm->page_table_lock: with USE_SPLIT_PTE_PTLOCKS, the >> locks would differ. Something similar can happen right now with hugetlb >> folios that span multiple PMDs when USE_SPLIT_PMD_PTLOCKS. >> >> This issue can be reproduced [1], for example triggering: >> >> [ 3105.936100] ------------[ cut here ]------------ >> [ 3105.939323] WARNING: CPU: 31 PID: 2732 at mm/gup.c:142 try_grab_folio+0x11c/0x188 >> [ 3105.944634] Modules linked in: [...] >> [ 3105.974841] CPU: 31 PID: 2732 Comm: reproducer Not tainted 6.10.0-64.eln141.aarch64 #1 >> [ 3105.980406] Hardware name: QEMU KVM Virtual Machine, BIOS edk2-20240524-4.fc40 05/24/2024 >> [ 3105.986185] pstate: 60000005 (nZCv daif -PAN -UAO -TCO -DIT -SSBS BTYPE=--) >> [ 3105.991108] pc : try_grab_folio+0x11c/0x188 >> [ 3105.994013] lr : follow_page_pte+0xd8/0x430 >> [ 3105.996986] sp : ffff80008eafb8f0 >> [ 3105.999346] x29: ffff80008eafb900 x28: ffffffe8d481f380 x27: 00f80001207cff43 >> [ 3106.004414] x26: 0000000000000001 x25: 0000000000000000 x24: ffff80008eafba48 >> [ 3106.009520] x23: 0000ffff9372f000 x22: ffff7a54459e2000 x21: ffff7a546c1aa978 >> [ 3106.014529] x20: ffffffe8d481f3c0 x19: 0000000000610041 x18: 0000000000000001 >> [ 3106.019506] x17: 0000000000000001 x16: ffffffffffffffff x15: 0000000000000000 >> [ 3106.024494] x14: ffffb85477fdfe08 x13: 0000ffff9372ffff x12: 0000000000000000 >> [ 3106.029469] x11: 1fffef4a88a96be1 x10: ffff7a54454b5f0c x9 : ffffb854771b12f0 >> [ 3106.034324] x8 : 0008000000000000 x7 : ffff7a546c1aa980 x6 : 0008000000000080 >> [ 3106.038902] x5 : 00000000001207cf x4 : 0000ffff9372f000 x3 : ffffffe8d481f000 >> [ 3106.043420] x2 : 0000000000610041 x1 : 0000000000000001 x0 : 0000000000000000 >> [ 3106.047957] Call trace: >> [ 3106.049522] try_grab_folio+0x11c/0x188 >> [ 3106.051996] follow_pmd_mask.constprop.0.isra.0+0x150/0x2e0 >> [ 3106.055527] follow_page_mask+0x1a0/0x2b8 >> [ 3106.058118] __get_user_pages+0xf0/0x348 >> [ 3106.060647] faultin_page_range+0xb0/0x360 >> [ 3106.063651] do_madvise+0x340/0x598 >> >> Let's make huge_pte_lockptr() effectively use the same PT locks as any >> core-mm page table walker would. Add ptep_lockptr() to obtain the PTE >> page table lock using a pte pointer -- unfortunately we cannot convert >> pte_lockptr() because virt_to_page() doesn't work with kmap'ed page >> tables we can have with CONFIG_HIGHPTE. >> >> Take care of PTE tables possibly spanning multiple pages, and take care of >> CONFIG_PGTABLE_LEVELS complexity when e.g., PMD_SIZE == PUD_SIZE. For >> example, with CONFIG_PGTABLE_LEVELS == 2, core-mm would detect >> with hugepagesize==PMD_SIZE pmd_leaf() and use the pmd_lockptr(), which >> would end up just mapping to the per-MM PT lock. >> >> There is one ugly case: powerpc 8xx, whereby we have an 8 MiB hugetlb >> folio being mapped using two PTE page tables. While hugetlb wants to take >> the PMD table lock, core-mm would grab the PTE table lock of one of both >> PTE page tables. In such corner cases, we have to make sure that both >> locks match, which is (fortunately!) currently guaranteed for 8xx as it >> does not support SMP and consequently doesn't use split PT locks. >> >> [1] https://lore.kernel.org/all/1bbfcc7f-f222-45a5-ac44-c5a1381c596d@redhat.com/ >> >> Fixes: 9cb28da54643 ("mm/gup: handle hugetlb in the generic follow_page_mask code") >> Reviewed-by: James Houghton <jthoughton@google.com> >> Cc: <stable@vger.kernel.org> >> Cc: Peter Xu <peterx@redhat.com> >> Cc: Oscar Salvador <osalvador@suse.de> >> Cc: Muchun Song <muchun.song@linux.dev> >> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> >> Signed-off-by: David Hildenbrand <david@redhat.com> > > Nitpick: I wonder whether some of the lines can be simplified if we write > it downwards from PUD, like, > > huge_pte_lockptr() > { > if (size >= PUD_SIZE) > return pud_lockptr(...); > if (size >= PMD_SIZE) > return pmd_lockptr(...); > /* Sub-PMD only applies to !CONFIG_HIGHPTE, see pte_alloc_huge() */ > WARN_ON(IS_ENABLED(CONFIG_HIGHPTE)); > return ptep_lockptr(...); > } Let me think about it. For PUD_SIZE == PMD_SIZE instead of like core-mm calling pmd_lockptr we'd call pud_lockptr(). Likely it would work because we default in most cases to the per-MM lock: arch/x86/Kconfig: select ARCH_ENABLE_SPLIT_PMD_PTLOCK if (PGTABLE_LEVELS > 2) && (X86_64 || X86_PAE) > > The ">=" checks should avoid checking over pgtable level, iiuc. > > The other nitpick is, I didn't yet find any arch that use non-zero order > page for pte pgtables. I would give it a shot with dropping the mask thing > then see what explodes (which I don't expect any, per my read..), but yeah > I understand we saw some already due to other things, so I think it's fine > in this hugetlb path (that we're removing) we do a few more math if you > think that's easier for you. I threw BUILD_BUG_ON(PTRS_PER_PTE * sizeof(pte_t) > PAGE_SIZE); into pte_lockptr() and did a bunch of cross-compiles. And for some reason it blows up for powernv (powernv_defconfig) and pseries (pseries_defconfig). In function 'pte_lockptr', inlined from 'pte_offset_map_nolock' at mm/pgtable-generic.c:316:11: ././include/linux/compiler_types.h:510:45: error: call to '__compiletime_assert_291' declared with attribute error: BUILD_BUG_ON failed: PTRS_PER_PTE * sizeof(pte_t) > PAGE_SIZE 510 | _compiletime_assert(condition, msg, __compiletime_assert_, __COUNTER__) | ^ ././include/linux/compiler_types.h:491:25: note: in definition of macro '__compiletime_assert' 491 | prefix ## suffix(); \ | ^~~~~~ ././include/linux/compiler_types.h:510:9: note: in expansion of macro '_compiletime_assert' 510 | _compiletime_assert(condition, msg, __compiletime_assert_, __COUNTER__) | ^~~~~~~~~~~~~~~~~~~ ./include/linux/build_bug.h:39:37: note: in expansion of macro 'compiletime_assert' 39 | #define BUILD_BUG_ON_MSG(cond, msg) compiletime_assert(!(cond), msg) | ^~~~~~~~~~~~~~~~~~ ./include/linux/build_bug.h:50:9: note: in expansion of macro 'BUILD_BUG_ON_MSG' 50 | BUILD_BUG_ON_MSG(condition, "BUILD_BUG_ON failed: " #condition) | ^~~~~~~~~~~~~~~~ ./include/linux/mm.h:2926:9: note: in expansion of macro 'BUILD_BUG_ON' 2926 | BUILD_BUG_ON(PTRS_PER_PTE * sizeof(pte_t) > PAGE_SIZE); | ^~~~~~~~~~~~ In function 'pte_lockptr', inlined from '__pte_offset_map_lock' at mm/pgtable-generic.c:374:8: ././include/linux/compiler_types.h:510:45: error: call to '__compiletime_assert_291' declared with attribute error: BUILD_BUG_ON failed: PTRS_PER_PTE * sizeof(pte_t) > PAGE_SIZE 510 | _compiletime_assert(condition, msg, __compiletime_assert_, __COUNTER__) | ^ ././include/linux/compiler_types.h:491:25: note: in definition of macro '__compiletime_assert' 491 | prefix ## suffix(); \ | ^~~~~~ ././include/linux/compiler_types.h:510:9: note: in expansion of macro '_compiletime_assert' 510 | _compiletime_assert(condition, msg, __compiletime_assert_, __COUNTER__) | ^~~~~~~~~~~~~~~~~~~ ./include/linux/build_bug.h:39:37: note: in expansion of macro 'compiletime_assert' 39 | #define BUILD_BUG_ON_MSG(cond, msg) compiletime_assert(!(cond), msg) | ^~~~~~~~~~~~~~~~~~ ./include/linux/build_bug.h:50:9: note: in expansion of macro 'BUILD_BUG_ON_MSG' 50 | BUILD_BUG_ON_MSG(condition, "BUILD_BUG_ON failed: " #condition) | ^~~~~~~~~~~~~~~~ ./include/linux/mm.h:2926:9: note: in expansion of macro 'BUILD_BUG_ON' 2926 | BUILD_BUG_ON(PTRS_PER_PTE * sizeof(pte_t) > PAGE_SIZE); | ^~~~~~~~~~~~ pte_alloc_one() ends up calling pte_fragment_alloc(mm, 0). But there we always end up calling pagetable_alloc(, 0). And fragments are supposed to be <= a single page. Now I'm confused what's wrong here ... am I missing something obvious? CCing some powerpc folks. Is this some pte_t oddity? But in mm_inc_nr_ptes/mm_dec_nr_ptes we use the exact same calculation :/
David Hildenbrand <david@redhat.com> writes: > On 31.07.24 16:54, Peter Xu wrote: ... >> >> The other nitpick is, I didn't yet find any arch that use non-zero order >> page for pte pgtables. I would give it a shot with dropping the mask thing >> then see what explodes (which I don't expect any, per my read..), but yeah >> I understand we saw some already due to other things, so I think it's fine >> in this hugetlb path (that we're removing) we do a few more math if you >> think that's easier for you. > > I threw > BUILD_BUG_ON(PTRS_PER_PTE * sizeof(pte_t) > PAGE_SIZE); > into pte_lockptr() and did a bunch of cross-compiles. > > And for some reason it blows up for powernv (powernv_defconfig) and > pseries (pseries_defconfig). > > > In function 'pte_lockptr', > inlined from 'pte_offset_map_nolock' at mm/pgtable-generic.c:316:11: > ././include/linux/compiler_types.h:510:45: error: call to '__compiletime_assert_291' declared with attribute error: BUILD_BUG_ON failed: PTRS_PER_PTE * sizeof(pte_t) > PAGE_SIZE > 510 | _compiletime_assert(condition, msg, __compiletime_assert_, __COUNTER__) > | ^ > ././include/linux/compiler_types.h:491:25: note: in definition of macro '__compiletime_assert' > 491 | prefix ## suffix(); \ > | ^~~~~~ > ././include/linux/compiler_types.h:510:9: note: in expansion of macro '_compiletime_assert' > 510 | _compiletime_assert(condition, msg, __compiletime_assert_, __COUNTER__) > | ^~~~~~~~~~~~~~~~~~~ > ./include/linux/build_bug.h:39:37: note: in expansion of macro 'compiletime_assert' > 39 | #define BUILD_BUG_ON_MSG(cond, msg) compiletime_assert(!(cond), msg) > | ^~~~~~~~~~~~~~~~~~ > ./include/linux/build_bug.h:50:9: note: in expansion of macro 'BUILD_BUG_ON_MSG' > 50 | BUILD_BUG_ON_MSG(condition, "BUILD_BUG_ON failed: " #condition) > | ^~~~~~~~~~~~~~~~ > ./include/linux/mm.h:2926:9: note: in expansion of macro 'BUILD_BUG_ON' > 2926 | BUILD_BUG_ON(PTRS_PER_PTE * sizeof(pte_t) > PAGE_SIZE); > | ^~~~~~~~~~~~ ... > > pte_alloc_one() ends up calling pte_fragment_alloc(mm, 0). But there we always > end up calling pagetable_alloc(, 0). > > And fragments are supposed to be <= a single page. > > Now I'm confused what's wrong here ... am I missing something obvious? > > CCing some powerpc folks. Is this some pte_t oddity? It will be because PTRS_PER_PTE is not a compile time constant :( $ git grep "define PTRS_PER_PTE" arch/powerpc/include/asm/book3s/64 arch/powerpc/include/asm/book3s/64/pgtable.h:#define PTRS_PER_PTE (1 << PTE_INDEX_SIZE) $ git grep "define PTE_INDEX_SIZE" arch/powerpc/include/asm/book3s/64 arch/powerpc/include/asm/book3s/64/pgtable.h:#define PTE_INDEX_SIZE __pte_index_size $ git grep __pte_index_size arch/powerpc/mm/pgtable_64.c arch/powerpc/mm/pgtable_64.c:unsigned long __pte_index_size; Which is because the pseries/powernv (book3s64) kernel supports either the HPT or Radix MMU at runtime, and they have different page table geometry. If you change it to use MAX_PTRS_PER_PTE it should work (that's defined for all arches). cheers diff --git a/include/linux/mm.h b/include/linux/mm.h index 381750f41767..1fd9c296c0b6 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -2924,6 +2924,8 @@ static inline spinlock_t *ptlock_ptr(struct ptdesc *ptdesc) static inline spinlock_t *pte_lockptr(struct mm_struct *mm, pte_t *pte) { /* PTE page tables don't currently exceed a single page. */ + BUILD_BUG_ON(MAX_PTRS_PER_PTE * sizeof(pte_t) > PAGE_SIZE); + return ptlock_ptr(virt_to_ptdesc(pte)); }
>> pte_alloc_one() ends up calling pte_fragment_alloc(mm, 0). But there we always >> end up calling pagetable_alloc(, 0). >> >> And fragments are supposed to be <= a single page. >> >> Now I'm confused what's wrong here ... am I missing something obvious? >> >> CCing some powerpc folks. Is this some pte_t oddity? > > It will be because PTRS_PER_PTE is not a compile time constant :( Oh, sneaky! Thanks a bunch! > > $ git grep "define PTRS_PER_PTE" arch/powerpc/include/asm/book3s/64 > arch/powerpc/include/asm/book3s/64/pgtable.h:#define PTRS_PER_PTE (1 << PTE_INDEX_SIZE) > > $ git grep "define PTE_INDEX_SIZE" arch/powerpc/include/asm/book3s/64 > arch/powerpc/include/asm/book3s/64/pgtable.h:#define PTE_INDEX_SIZE __pte_index_size > > $ git grep __pte_index_size arch/powerpc/mm/pgtable_64.c > arch/powerpc/mm/pgtable_64.c:unsigned long __pte_index_size; > > Which is because the pseries/powernv (book3s64) kernel supports either > the HPT or Radix MMU at runtime, and they have different page table > geometry. > > If you change it to use MAX_PTRS_PER_PTE it should work (that's defined > for all arches). Yes, that should fly, thanks again!
On 31.07.24 14:21, David Hildenbrand wrote: > We recently made GUP's common page table walking code to also walk hugetlb > VMAs without most hugetlb special-casing, preparing for the future of > having less hugetlb-specific page table walking code in the codebase. > Turns out that we missed one page table locking detail: page table locking > for hugetlb folios that are not mapped using a single PMD/PUD. James, Peter, the following seems to get the job done. Thoughts? diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h index 8e462205400d..776dc3914d9e 100644 --- a/include/linux/hugetlb.h +++ b/include/linux/hugetlb.h @@ -938,10 +938,40 @@ static inline bool htlb_allow_alloc_fallback(int reason) static inline spinlock_t *huge_pte_lockptr(struct hstate *h, struct mm_struct *mm, pte_t *pte) { - if (huge_page_size(h) == PMD_SIZE) + unsigned long size = huge_page_size(h); + + VM_WARN_ON(size == PAGE_SIZE); + + /* + * hugetlb must use the exact same PT locks as core-mm page table + * walkers would. When modifying a PTE table, hugetlb must take the + * PTE PT lock, when modifying a PMD table, hugetlb must take the PMD + * PT lock etc. + * + * The expectation is that any hugetlb folio smaller than a PMD is + * always mapped into a single PTE table and that any hugetlb folio + * smaller than a PUD (but at least as big as a PMD) is always mapped + * into a single PMD table. + * + * If that does not hold for an architecture, then that architecture + * must disable split PT locks such that all *_lockptr() functions + * will give us the same result: the per-MM PT lock. + * + * Note that with e.g., CONFIG_PGTABLE_LEVELS=2 where + * PGDIR_SIZE==P4D_SIZE==PUD_SIZE==PMD_SIZE, we'd use the MM PT lock + * directly with a PMD hugetlb size, whereby core-mm would call + * pmd_lockptr() instead. However, in such configurations split PMD + * locks are disabled -- split locks don't make sense on a single + * PGDIR page table -- and the end result is the same. + */ + if (size >= P4D_SIZE) + return &mm->page_table_lock; + else if (size >= PUD_SIZE) + return pud_lockptr(mm, (pud_t *) pte); + else if (size >= PMD_SIZE || IS_ENABLED(CONFIG_HIGHPTE)) return pmd_lockptr(mm, (pmd_t *) pte); - VM_BUG_ON(huge_page_size(h) == PAGE_SIZE); - return &mm->page_table_lock; + /* pte_alloc_huge() only applies with !CONFIG_HIGHPTE */ + return ptep_lockptr(mm, pte); } #ifndef hugepages_supported diff --git a/include/linux/mm.h b/include/linux/mm.h index a890a1731c14..bd219ac9c026 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -2869,6 +2869,13 @@ static inline spinlock_t *pte_lockptr(struct mm_struct *mm, pmd_t *pmd) return ptlock_ptr(page_ptdesc(pmd_page(*pmd))); } +static inline spinlock_t *ptep_lockptr(struct mm_struct *mm, pte_t *pte) +{ + BUILD_BUG_ON(IS_ENABLED(CONFIG_HIGHPTE)); + BUILD_BUG_ON(MAX_PTRS_PER_PTE * sizeof(pte_t) > PAGE_SIZE); + return ptlock_ptr(virt_to_ptdesc(pte)); +} + static inline bool ptlock_init(struct ptdesc *ptdesc) { /* @@ -2893,6 +2900,10 @@ static inline spinlock_t *pte_lockptr(struct mm_struct *mm, pmd_t *pmd) { return &mm->page_table_lock; } +static inline spinlock_t *ptep_lockptr(struct mm_struct *mm, pte_t *pte) +{ + return &mm->page_table_lock; +} static inline void ptlock_cache_init(void) {} static inline bool ptlock_init(struct ptdesc *ptdesc) { return true; } static inline void ptlock_free(struct ptdesc *ptdesc) {}
On Thu, Aug 01, 2024 at 10:50:18AM +0200, David Hildenbrand wrote: > On 31.07.24 14:21, David Hildenbrand wrote: > > We recently made GUP's common page table walking code to also walk hugetlb > > VMAs without most hugetlb special-casing, preparing for the future of > > having less hugetlb-specific page table walking code in the codebase. > > Turns out that we missed one page table locking detail: page table locking > > for hugetlb folios that are not mapped using a single PMD/PUD. > > James, Peter, > > the following seems to get the job done. Thoughts? OK to me, so my A-b can keep, but let me still comment; again, all nitpicks. > > diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h > index 8e462205400d..776dc3914d9e 100644 > --- a/include/linux/hugetlb.h > +++ b/include/linux/hugetlb.h > @@ -938,10 +938,40 @@ static inline bool htlb_allow_alloc_fallback(int reason) > static inline spinlock_t *huge_pte_lockptr(struct hstate *h, > struct mm_struct *mm, pte_t *pte) > { > - if (huge_page_size(h) == PMD_SIZE) > + unsigned long size = huge_page_size(h); > + > + VM_WARN_ON(size == PAGE_SIZE); > + > + /* > + * hugetlb must use the exact same PT locks as core-mm page table > + * walkers would. When modifying a PTE table, hugetlb must take the > + * PTE PT lock, when modifying a PMD table, hugetlb must take the PMD > + * PT lock etc. > + * > + * The expectation is that any hugetlb folio smaller than a PMD is > + * always mapped into a single PTE table and that any hugetlb folio > + * smaller than a PUD (but at least as big as a PMD) is always mapped > + * into a single PMD table. > + * > + * If that does not hold for an architecture, then that architecture > + * must disable split PT locks such that all *_lockptr() functions > + * will give us the same result: the per-MM PT lock. > + * > + * Note that with e.g., CONFIG_PGTABLE_LEVELS=2 where > + * PGDIR_SIZE==P4D_SIZE==PUD_SIZE==PMD_SIZE, we'd use the MM PT lock > + * directly with a PMD hugetlb size, whereby core-mm would call > + * pmd_lockptr() instead. However, in such configurations split PMD > + * locks are disabled -- split locks don't make sense on a single > + * PGDIR page table -- and the end result is the same. > + */ > + if (size >= P4D_SIZE) > + return &mm->page_table_lock; I'd drop this so the mm lock fallback will be done below (especially in reality the pud lock is always mm lock for now..). Also this line reads like there can be P4D size huge page but in reality PUD is the largest (nopxx doesn't count). We also same some cycles in most cases if removed. > + else if (size >= PUD_SIZE) > + return pud_lockptr(mm, (pud_t *) pte); > + else if (size >= PMD_SIZE || IS_ENABLED(CONFIG_HIGHPTE)) I thought this HIGHPTE can also be dropped? Because in HIGHPTE it should never have lower-than-PMD huge pages or we're in trouble. That's why I kept one WARN_ON() in my pesudo code but only before trying to take the pte lockptr. > return pmd_lockptr(mm, (pmd_t *) pte); > - VM_BUG_ON(huge_page_size(h) == PAGE_SIZE); > - return &mm->page_table_lock; > + /* pte_alloc_huge() only applies with !CONFIG_HIGHPTE */ > + return ptep_lockptr(mm, pte); > } > #ifndef hugepages_supported > diff --git a/include/linux/mm.h b/include/linux/mm.h > index a890a1731c14..bd219ac9c026 100644 > --- a/include/linux/mm.h > +++ b/include/linux/mm.h > @@ -2869,6 +2869,13 @@ static inline spinlock_t *pte_lockptr(struct mm_struct *mm, pmd_t *pmd) > return ptlock_ptr(page_ptdesc(pmd_page(*pmd))); > } > +static inline spinlock_t *ptep_lockptr(struct mm_struct *mm, pte_t *pte) > +{ > + BUILD_BUG_ON(IS_ENABLED(CONFIG_HIGHPTE)); > + BUILD_BUG_ON(MAX_PTRS_PER_PTE * sizeof(pte_t) > PAGE_SIZE); > + return ptlock_ptr(virt_to_ptdesc(pte)); > +} Great to know we can drop the mask.. Thanks, > + > static inline bool ptlock_init(struct ptdesc *ptdesc) > { > /* > @@ -2893,6 +2900,10 @@ static inline spinlock_t *pte_lockptr(struct mm_struct *mm, pmd_t *pmd) > { > return &mm->page_table_lock; > } > +static inline spinlock_t *ptep_lockptr(struct mm_struct *mm, pte_t *pte) > +{ > + return &mm->page_table_lock; > +} > static inline void ptlock_cache_init(void) {} > static inline bool ptlock_init(struct ptdesc *ptdesc) { return true; } > static inline void ptlock_free(struct ptdesc *ptdesc) {} > -- > 2.45.2 > > > -- > Cheers, > > David / dhildenb >
Hi Peter, >> - if (huge_page_size(h) == PMD_SIZE) >> + unsigned long size = huge_page_size(h); >> + >> + VM_WARN_ON(size == PAGE_SIZE); >> + >> + /* >> + * hugetlb must use the exact same PT locks as core-mm page table >> + * walkers would. When modifying a PTE table, hugetlb must take the >> + * PTE PT lock, when modifying a PMD table, hugetlb must take the PMD >> + * PT lock etc. >> + * >> + * The expectation is that any hugetlb folio smaller than a PMD is >> + * always mapped into a single PTE table and that any hugetlb folio >> + * smaller than a PUD (but at least as big as a PMD) is always mapped >> + * into a single PMD table. >> + * >> + * If that does not hold for an architecture, then that architecture >> + * must disable split PT locks such that all *_lockptr() functions >> + * will give us the same result: the per-MM PT lock. >> + * >> + * Note that with e.g., CONFIG_PGTABLE_LEVELS=2 where >> + * PGDIR_SIZE==P4D_SIZE==PUD_SIZE==PMD_SIZE, we'd use the MM PT lock >> + * directly with a PMD hugetlb size, whereby core-mm would call >> + * pmd_lockptr() instead. However, in such configurations split PMD >> + * locks are disabled -- split locks don't make sense on a single >> + * PGDIR page table -- and the end result is the same. >> + */ >> + if (size >= P4D_SIZE) >> + return &mm->page_table_lock; > > I'd drop this so the mm lock fallback will be done below (especially in > reality the pud lock is always mm lock for now..). Also this line reads > like there can be P4D size huge page but in reality PUD is the largest > (nopxx doesn't count). We also same some cycles in most cases if removed. The compiler should be smart enough to optimize most of that out. But I agree that there is no need to be that future-proof here. These two are interesting: arch/powerpc/mm/hugetlbpage.c: if (!mm_pud_folded(mm) && sz >= P4D_SIZE) arch/riscv/mm/hugetlbpage.c: else if (sz >= P4D_SIZE) But I assume they are only fordward-looking. (GUP would already be pretty broken if that would exist) >> + else if (size >= PUD_SIZE) >> + return pud_lockptr(mm, (pud_t *) pte); >> + else if (size >= PMD_SIZE || IS_ENABLED(CONFIG_HIGHPTE)) > > I thought this HIGHPTE can also be dropped? Because in HIGHPTE it should > never have lower-than-PMD huge pages or we're in trouble. That's why I > kept one WARN_ON() in my pesudo code but only before trying to take the pte > lockptr. Then the compiler won't optimize out the ptep_lockptr() call and we'll run into a build error. And I think the HIGHPTE builderror serves good purpose. In file included from <command-line>: In function 'ptep_lockptr', inlined from 'huge_pte_lockptr' at ./include/linux/hugetlb.h:974:9, inlined from 'huge_pte_lock' at ./include/linux/hugetlb.h:1248:8, inlined from 'pagemap_scan_hugetlb_entry' at fs/proc/task_mmu.c:2581:8: ././include/linux/compiler_types.h:510:45: error: call to '__compiletime_assert_256' declared with attribute error: BUILD_BUG_ON failed: IS_ENABLED(CONFIG_HIGHPTE) 510 | _compiletime_assert(condition, msg, __compiletime_assert_, __COUNTER__) | ^ ././include/linux/compiler_types.h:491:25: note: in definition of macro '__compiletime_assert' 491 | prefix ## suffix(); \ | ^~~~~~ ././include/linux/compiler_types.h:510:9: note: in expansion of macro '_compiletime_assert' 510 | _compiletime_assert(condition, msg, __compiletime_assert_, __COUNTER__) | ^~~~~~~~~~~~~~~~~~~ ./include/linux/build_bug.h:39:37: note: in expansion of macro 'compiletime_assert' 39 | #define BUILD_BUG_ON_MSG(cond, msg) compiletime_assert(!(cond), msg) | ^~~~~~~~~~~~~~~~~~ ./include/linux/build_bug.h:50:9: note: in expansion of macro 'BUILD_BUG_ON_MSG' 50 | BUILD_BUG_ON_MSG(condition, "BUILD_BUG_ON failed: " #condition) | ^~~~~~~~~~~~~~~~ ./include/linux/mm.h:2874:9: note: in expansion of macro 'BUILD_BUG_ON' 2874 | BUILD_BUG_ON(IS_ENABLED(CONFIG_HIGHPTE)); It would be even better to have a way to know whether an arch even has hugetlb < PMD_SIZE (config option?) and use that instead here; but I'll leave that as an optimization.
On Thu, Aug 01, 2024 at 05:35:20PM +0200, David Hildenbrand wrote: > Hi Peter, [...] > > > + else if (size >= PUD_SIZE) > > > + return pud_lockptr(mm, (pud_t *) pte); > > > + else if (size >= PMD_SIZE || IS_ENABLED(CONFIG_HIGHPTE)) > > > > I thought this HIGHPTE can also be dropped? Because in HIGHPTE it should > > never have lower-than-PMD huge pages or we're in trouble. That's why I > > kept one WARN_ON() in my pesudo code but only before trying to take the pte > > lockptr. > > Then the compiler won't optimize out the ptep_lockptr() call and we'll run > into a build error. And I think the HIGHPTE builderror serves good purpose. > > In file included from <command-line>: > In function 'ptep_lockptr', > inlined from 'huge_pte_lockptr' at ./include/linux/hugetlb.h:974:9, > inlined from 'huge_pte_lock' at ./include/linux/hugetlb.h:1248:8, > inlined from 'pagemap_scan_hugetlb_entry' at fs/proc/task_mmu.c:2581:8: > ././include/linux/compiler_types.h:510:45: error: call to '__compiletime_assert_256' declared with attribute error: BUILD_BUG_ON failed: IS_ENABLED(CONFIG_HIGHPTE) > 510 | _compiletime_assert(condition, msg, __compiletime_assert_, __COUNTER__) > | ^ > ././include/linux/compiler_types.h:491:25: note: in definition of macro '__compiletime_assert' > 491 | prefix ## suffix(); \ > | ^~~~~~ > ././include/linux/compiler_types.h:510:9: note: in expansion of macro '_compiletime_assert' > 510 | _compiletime_assert(condition, msg, __compiletime_assert_, __COUNTER__) > | ^~~~~~~~~~~~~~~~~~~ > ./include/linux/build_bug.h:39:37: note: in expansion of macro 'compiletime_assert' > 39 | #define BUILD_BUG_ON_MSG(cond, msg) compiletime_assert(!(cond), msg) > | ^~~~~~~~~~~~~~~~~~ > ./include/linux/build_bug.h:50:9: note: in expansion of macro 'BUILD_BUG_ON_MSG' > 50 | BUILD_BUG_ON_MSG(condition, "BUILD_BUG_ON failed: " #condition) > | ^~~~~~~~~~~~~~~~ > ./include/linux/mm.h:2874:9: note: in expansion of macro 'BUILD_BUG_ON' > 2874 | BUILD_BUG_ON(IS_ENABLED(CONFIG_HIGHPTE)); Ahh.. this is in "ifdef USE_SPLIT_PTE_PTLOCKS" section. I'm thinking maybe we should drop this BUILD_BUG_ON - it says "HIGHPTE shouldn't co-exist with USE_SPLIT_PTE_PTLOCKS", but I think it can? Said that, I think I can also understand your point, where you see ptep_lockptr() a hugetlb-only function, in that case the BUILD_BUG_ON would make sense in hugetlb world. So.. per my previous nitpick suggestion, IIUC we'll need to drop this BUILD_BUG_ON, just to say "USE_SPLIT_PTE_PTLOCKS can work with HIGHPTE" and perhaps slightly more readable; we'll rely on the WARN_ON to guard HIGHPTE won't use pte lock. Either way works for me, thanks!
On 01.08.24 18:07, Peter Xu wrote: > On Thu, Aug 01, 2024 at 05:35:20PM +0200, David Hildenbrand wrote: >> Hi Peter, > > [...] > >>>> + else if (size >= PUD_SIZE) >>>> + return pud_lockptr(mm, (pud_t *) pte); >>>> + else if (size >= PMD_SIZE || IS_ENABLED(CONFIG_HIGHPTE)) >>> >>> I thought this HIGHPTE can also be dropped? Because in HIGHPTE it should >>> never have lower-than-PMD huge pages or we're in trouble. That's why I >>> kept one WARN_ON() in my pesudo code but only before trying to take the pte >>> lockptr. >> >> Then the compiler won't optimize out the ptep_lockptr() call and we'll run >> into a build error. And I think the HIGHPTE builderror serves good purpose. >> >> In file included from <command-line>: >> In function 'ptep_lockptr', >> inlined from 'huge_pte_lockptr' at ./include/linux/hugetlb.h:974:9, >> inlined from 'huge_pte_lock' at ./include/linux/hugetlb.h:1248:8, >> inlined from 'pagemap_scan_hugetlb_entry' at fs/proc/task_mmu.c:2581:8: >> ././include/linux/compiler_types.h:510:45: error: call to '__compiletime_assert_256' declared with attribute error: BUILD_BUG_ON failed: IS_ENABLED(CONFIG_HIGHPTE) >> 510 | _compiletime_assert(condition, msg, __compiletime_assert_, __COUNTER__) >> | ^ >> ././include/linux/compiler_types.h:491:25: note: in definition of macro '__compiletime_assert' >> 491 | prefix ## suffix(); \ >> | ^~~~~~ >> ././include/linux/compiler_types.h:510:9: note: in expansion of macro '_compiletime_assert' >> 510 | _compiletime_assert(condition, msg, __compiletime_assert_, __COUNTER__) >> | ^~~~~~~~~~~~~~~~~~~ >> ./include/linux/build_bug.h:39:37: note: in expansion of macro 'compiletime_assert' >> 39 | #define BUILD_BUG_ON_MSG(cond, msg) compiletime_assert(!(cond), msg) >> | ^~~~~~~~~~~~~~~~~~ >> ./include/linux/build_bug.h:50:9: note: in expansion of macro 'BUILD_BUG_ON_MSG' >> 50 | BUILD_BUG_ON_MSG(condition, "BUILD_BUG_ON failed: " #condition) >> | ^~~~~~~~~~~~~~~~ >> ./include/linux/mm.h:2874:9: note: in expansion of macro 'BUILD_BUG_ON' >> 2874 | BUILD_BUG_ON(IS_ENABLED(CONFIG_HIGHPTE)); > > Ahh.. this is in "ifdef USE_SPLIT_PTE_PTLOCKS" section. I'm thinking maybe > we should drop this BUILD_BUG_ON - it says "HIGHPTE shouldn't co-exist with > USE_SPLIT_PTE_PTLOCKS", but I think it can? > > Said that, I think I can also understand your point, where you see > ptep_lockptr() a hugetlb-only function, in that case the BUILD_BUG_ON would > make sense in hugetlb world. > > So.. per my previous nitpick suggestion, IIUC we'll need to drop this > BUILD_BUG_ON, just to say "USE_SPLIT_PTE_PTLOCKS can work with HIGHPTE" and > perhaps slightly more readable; we'll rely on the WARN_ON to guard HIGHPTE > won't use pte lock. I really don't want to drop the BUILD_BUG_ON. The function cannot possibly work with HIGHPTE, especially once used in other context by accident. So I'll leave it like that. Feel free to optimize the hugetlb code further once the fix has landed (e.g., really optimize it out if we cannot possibly have such hugetlb sizes). Thanks!
Le 31/07/2024 à 18:33, David Hildenbrand a écrit : > On 31.07.24 16:54, Peter Xu wrote: >> On Wed, Jul 31, 2024 at 02:21:03PM +0200, David Hildenbrand wrote: >>> We recently made GUP's common page table walking code to also walk >>> hugetlb >>> VMAs without most hugetlb special-casing, preparing for the future of >>> having less hugetlb-specific page table walking code in the codebase. >>> Turns out that we missed one page table locking detail: page table >>> locking >>> for hugetlb folios that are not mapped using a single PMD/PUD. >>> >>> Assume we have hugetlb folio that spans multiple PTEs (e.g., 64 KiB >>> hugetlb folios on arm64 with 4 KiB base page size). GUP, as it walks the >>> page tables, will perform a pte_offset_map_lock() to grab the PTE table >>> lock. >>> >>> However, hugetlb that concurrently modifies these page tables would >>> actually grab the mm->page_table_lock: with USE_SPLIT_PTE_PTLOCKS, the >>> locks would differ. Something similar can happen right now with hugetlb >>> folios that span multiple PMDs when USE_SPLIT_PMD_PTLOCKS. >>> >>> This issue can be reproduced [1], for example triggering: >>> >>> [ 3105.936100] ------------[ cut here ]------------ >>> [ 3105.939323] WARNING: CPU: 31 PID: 2732 at mm/gup.c:142 >>> try_grab_folio+0x11c/0x188 >>> [ 3105.944634] Modules linked in: [...] >>> [ 3105.974841] CPU: 31 PID: 2732 Comm: reproducer Not tainted >>> 6.10.0-64.eln141.aarch64 #1 >>> [ 3105.980406] Hardware name: QEMU KVM Virtual Machine, BIOS >>> edk2-20240524-4.fc40 05/24/2024 >>> [ 3105.986185] pstate: 60000005 (nZCv daif -PAN -UAO -TCO -DIT -SSBS >>> BTYPE=--) >>> [ 3105.991108] pc : try_grab_folio+0x11c/0x188 >>> [ 3105.994013] lr : follow_page_pte+0xd8/0x430 >>> [ 3105.996986] sp : ffff80008eafb8f0 >>> [ 3105.999346] x29: ffff80008eafb900 x28: ffffffe8d481f380 x27: >>> 00f80001207cff43 >>> [ 3106.004414] x26: 0000000000000001 x25: 0000000000000000 x24: >>> ffff80008eafba48 >>> [ 3106.009520] x23: 0000ffff9372f000 x22: ffff7a54459e2000 x21: >>> ffff7a546c1aa978 >>> [ 3106.014529] x20: ffffffe8d481f3c0 x19: 0000000000610041 x18: >>> 0000000000000001 >>> [ 3106.019506] x17: 0000000000000001 x16: ffffffffffffffff x15: >>> 0000000000000000 >>> [ 3106.024494] x14: ffffb85477fdfe08 x13: 0000ffff9372ffff x12: >>> 0000000000000000 >>> [ 3106.029469] x11: 1fffef4a88a96be1 x10: ffff7a54454b5f0c x9 : >>> ffffb854771b12f0 >>> [ 3106.034324] x8 : 0008000000000000 x7 : ffff7a546c1aa980 x6 : >>> 0008000000000080 >>> [ 3106.038902] x5 : 00000000001207cf x4 : 0000ffff9372f000 x3 : >>> ffffffe8d481f000 >>> [ 3106.043420] x2 : 0000000000610041 x1 : 0000000000000001 x0 : >>> 0000000000000000 >>> [ 3106.047957] Call trace: >>> [ 3106.049522] try_grab_folio+0x11c/0x188 >>> [ 3106.051996] follow_pmd_mask.constprop.0.isra.0+0x150/0x2e0 >>> [ 3106.055527] follow_page_mask+0x1a0/0x2b8 >>> [ 3106.058118] __get_user_pages+0xf0/0x348 >>> [ 3106.060647] faultin_page_range+0xb0/0x360 >>> [ 3106.063651] do_madvise+0x340/0x598 >>> >>> Let's make huge_pte_lockptr() effectively use the same PT locks as any >>> core-mm page table walker would. Add ptep_lockptr() to obtain the PTE >>> page table lock using a pte pointer -- unfortunately we cannot convert >>> pte_lockptr() because virt_to_page() doesn't work with kmap'ed page >>> tables we can have with CONFIG_HIGHPTE. >>> >>> Take care of PTE tables possibly spanning multiple pages, and take >>> care of >>> CONFIG_PGTABLE_LEVELS complexity when e.g., PMD_SIZE == PUD_SIZE. For >>> example, with CONFIG_PGTABLE_LEVELS == 2, core-mm would detect >>> with hugepagesize==PMD_SIZE pmd_leaf() and use the pmd_lockptr(), which >>> would end up just mapping to the per-MM PT lock. >>> >>> There is one ugly case: powerpc 8xx, whereby we have an 8 MiB hugetlb >>> folio being mapped using two PTE page tables. While hugetlb wants to >>> take >>> the PMD table lock, core-mm would grab the PTE table lock of one of both >>> PTE page tables. In such corner cases, we have to make sure that both >>> locks match, which is (fortunately!) currently guaranteed for 8xx as it >>> does not support SMP and consequently doesn't use split PT locks. >>> >>> [1] >>> https://eur01.safelinks.protection.outlook.com/?url=https%3A%2F%2Flore.kernel.org%2Fall%2F1bbfcc7f-f222-45a5-ac44-c5a1381c596d%40redhat.com%2F&data=05%7C02%7Cchristophe.leroy%40csgroup.eu%7Cf91a0b3cdcab46c7bd6108dcb17e9454%7C8b87af7d86474dc78df45f69a2011bb5%7C0%7C0%7C638580404425532305%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C0%7C%7C%7C&sdata=%2FQ4QFqbyThojISHACwzxCdtYbgwc4JsMIP%2Bmx4PneOk%3D&reserved=0 >>> >>> Fixes: 9cb28da54643 ("mm/gup: handle hugetlb in the generic >>> follow_page_mask code") >>> Reviewed-by: James Houghton <jthoughton@google.com> >>> Cc: <stable@vger.kernel.org> >>> Cc: Peter Xu <peterx@redhat.com> >>> Cc: Oscar Salvador <osalvador@suse.de> >>> Cc: Muchun Song <muchun.song@linux.dev> >>> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> >>> Signed-off-by: David Hildenbrand <david@redhat.com> >> >> Nitpick: I wonder whether some of the lines can be simplified if we write >> it downwards from PUD, like, >> >> huge_pte_lockptr() >> { >> if (size >= PUD_SIZE) >> return pud_lockptr(...); >> if (size >= PMD_SIZE) >> return pmd_lockptr(...); >> /* Sub-PMD only applies to !CONFIG_HIGHPTE, see >> pte_alloc_huge() */ >> WARN_ON(IS_ENABLED(CONFIG_HIGHPTE)); >> return ptep_lockptr(...); >> } > > Let me think about it. For PUD_SIZE == PMD_SIZE instead of like core-mm > calling pmd_lockptr we'd call pud_lockptr(). I guess it is only when including asm-generic/pgtable-nopmd.h Otherwise you should have more than one entry in the PMD table so PMD_SIZE would always be smaller than PUD_SIZE, wouldn't it ? So maybe some simplification could be done, like having pud_lock() a nop in that case ? Christophe
diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h index c9bf68c239a01..e6437a06e2346 100644 --- a/include/linux/hugetlb.h +++ b/include/linux/hugetlb.h @@ -944,9 +944,32 @@ static inline bool htlb_allow_alloc_fallback(int reason) static inline spinlock_t *huge_pte_lockptr(struct hstate *h, struct mm_struct *mm, pte_t *pte) { - if (huge_page_size(h) == PMD_SIZE) + unsigned long size = huge_page_size(h); + + VM_WARN_ON(size == PAGE_SIZE); + + /* + * hugetlb must use the exact same PT locks as core-mm page table + * walkers would. When modifying a PTE table, hugetlb must take the + * PTE PT lock, when modifying a PMD table, hugetlb must take the PMD + * PT lock etc. + * + * The expectation is that any hugetlb folio smaller than a PMD is + * always mapped into a single PTE table and that any hugetlb folio + * smaller than a PUD (but at least as big as a PMD) is always mapped + * into a single PMD table. + * + * If that does not hold for an architecture, then that architecture + * must disable split PT locks such that all *_lockptr() functions + * will give us the same result: the per-MM PT lock. + */ + if (size < PMD_SIZE && !IS_ENABLED(CONFIG_HIGHPTE)) + /* pte_alloc_huge() only applies with !CONFIG_HIGHPTE */ + return ptep_lockptr(mm, pte); + else if (size < PUD_SIZE || CONFIG_PGTABLE_LEVELS == 2) return pmd_lockptr(mm, (pmd_t *) pte); - VM_BUG_ON(huge_page_size(h) == PAGE_SIZE); + else if (size < P4D_SIZE || CONFIG_PGTABLE_LEVELS == 3) + return pud_lockptr(mm, (pud_t *) pte); return &mm->page_table_lock; } diff --git a/include/linux/mm.h b/include/linux/mm.h index b100df8cb5857..f6c7fe8f5746f 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -2926,6 +2926,24 @@ static inline spinlock_t *pte_lockptr(struct mm_struct *mm, pmd_t *pmd) return ptlock_ptr(page_ptdesc(pmd_page(*pmd))); } +static inline struct page *ptep_pgtable_page(pte_t *pte) +{ + unsigned long mask = ~(PTRS_PER_PTE * sizeof(pte_t) - 1); + + BUILD_BUG_ON(IS_ENABLED(CONFIG_HIGHPTE)); + return virt_to_page((void *)((unsigned long)pte & mask)); +} + +static inline struct ptdesc *ptep_ptdesc(pte_t *pte) +{ + return page_ptdesc(ptep_pgtable_page(pte)); +} + +static inline spinlock_t *ptep_lockptr(struct mm_struct *mm, pte_t *pte) +{ + return ptlock_ptr(ptep_ptdesc(pte)); +} + static inline bool ptlock_init(struct ptdesc *ptdesc) { /* @@ -2950,6 +2968,10 @@ static inline spinlock_t *pte_lockptr(struct mm_struct *mm, pmd_t *pmd) { return &mm->page_table_lock; } +static inline spinlock_t *ptep_lockptr(struct mm_struct *mm, pte_t *pte) +{ + return &mm->page_table_lock; +} static inline void ptlock_cache_init(void) {} static inline bool ptlock_init(struct ptdesc *ptdesc) { return true; } static inline void ptlock_free(struct ptdesc *ptdesc) {}