diff mbox series

[v3] mm/hugetlb: fix hugetlb vs. core-mm PT locking

Message ID 20240731122103.382509-1-david@redhat.com (mailing list archive)
State New
Headers show
Series [v3] mm/hugetlb: fix hugetlb vs. core-mm PT locking | expand

Commit Message

David Hildenbrand July 31, 2024, 12:21 p.m. UTC
We recently made GUP's common page table walking code to also walk hugetlb
VMAs without most hugetlb special-casing, preparing for the future of
having less hugetlb-specific page table walking code in the codebase.
Turns out that we missed one page table locking detail: page table locking
for hugetlb folios that are not mapped using a single PMD/PUD.

Assume we have hugetlb folio that spans multiple PTEs (e.g., 64 KiB
hugetlb folios on arm64 with 4 KiB base page size). GUP, as it walks the
page tables, will perform a pte_offset_map_lock() to grab the PTE table
lock.

However, hugetlb that concurrently modifies these page tables would
actually grab the mm->page_table_lock: with USE_SPLIT_PTE_PTLOCKS, the
locks would differ. Something similar can happen right now with hugetlb
folios that span multiple PMDs when USE_SPLIT_PMD_PTLOCKS.

This issue can be reproduced [1], for example triggering:

[ 3105.936100] ------------[ cut here ]------------
[ 3105.939323] WARNING: CPU: 31 PID: 2732 at mm/gup.c:142 try_grab_folio+0x11c/0x188
[ 3105.944634] Modules linked in: [...]
[ 3105.974841] CPU: 31 PID: 2732 Comm: reproducer Not tainted 6.10.0-64.eln141.aarch64 #1
[ 3105.980406] Hardware name: QEMU KVM Virtual Machine, BIOS edk2-20240524-4.fc40 05/24/2024
[ 3105.986185] pstate: 60000005 (nZCv daif -PAN -UAO -TCO -DIT -SSBS BTYPE=--)
[ 3105.991108] pc : try_grab_folio+0x11c/0x188
[ 3105.994013] lr : follow_page_pte+0xd8/0x430
[ 3105.996986] sp : ffff80008eafb8f0
[ 3105.999346] x29: ffff80008eafb900 x28: ffffffe8d481f380 x27: 00f80001207cff43
[ 3106.004414] x26: 0000000000000001 x25: 0000000000000000 x24: ffff80008eafba48
[ 3106.009520] x23: 0000ffff9372f000 x22: ffff7a54459e2000 x21: ffff7a546c1aa978
[ 3106.014529] x20: ffffffe8d481f3c0 x19: 0000000000610041 x18: 0000000000000001
[ 3106.019506] x17: 0000000000000001 x16: ffffffffffffffff x15: 0000000000000000
[ 3106.024494] x14: ffffb85477fdfe08 x13: 0000ffff9372ffff x12: 0000000000000000
[ 3106.029469] x11: 1fffef4a88a96be1 x10: ffff7a54454b5f0c x9 : ffffb854771b12f0
[ 3106.034324] x8 : 0008000000000000 x7 : ffff7a546c1aa980 x6 : 0008000000000080
[ 3106.038902] x5 : 00000000001207cf x4 : 0000ffff9372f000 x3 : ffffffe8d481f000
[ 3106.043420] x2 : 0000000000610041 x1 : 0000000000000001 x0 : 0000000000000000
[ 3106.047957] Call trace:
[ 3106.049522]  try_grab_folio+0x11c/0x188
[ 3106.051996]  follow_pmd_mask.constprop.0.isra.0+0x150/0x2e0
[ 3106.055527]  follow_page_mask+0x1a0/0x2b8
[ 3106.058118]  __get_user_pages+0xf0/0x348
[ 3106.060647]  faultin_page_range+0xb0/0x360
[ 3106.063651]  do_madvise+0x340/0x598

Let's make huge_pte_lockptr() effectively use the same PT locks as any
core-mm page table walker would. Add ptep_lockptr() to obtain the PTE
page table lock using a pte pointer -- unfortunately we cannot convert
pte_lockptr() because virt_to_page() doesn't work with kmap'ed page
tables we can have with CONFIG_HIGHPTE.

Take care of PTE tables possibly spanning multiple pages, and take care of
CONFIG_PGTABLE_LEVELS complexity when e.g., PMD_SIZE == PUD_SIZE. For
example, with CONFIG_PGTABLE_LEVELS == 2, core-mm would detect
with hugepagesize==PMD_SIZE pmd_leaf() and use the pmd_lockptr(), which
would end up just mapping to the per-MM PT lock.

There is one ugly case: powerpc 8xx, whereby we have an 8 MiB hugetlb
folio being mapped using two PTE page tables.  While hugetlb wants to take
the PMD table lock, core-mm would grab the PTE table lock of one of both
PTE page tables.  In such corner cases, we have to make sure that both
locks match, which is (fortunately!) currently guaranteed for 8xx as it
does not support SMP and consequently doesn't use split PT locks.

[1] https://lore.kernel.org/all/1bbfcc7f-f222-45a5-ac44-c5a1381c596d@redhat.com/

Fixes: 9cb28da54643 ("mm/gup: handle hugetlb in the generic follow_page_mask code")
Reviewed-by: James Houghton <jthoughton@google.com>
Cc: <stable@vger.kernel.org>
Cc: Peter Xu <peterx@redhat.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Muchun Song <muchun.song@linux.dev>
Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
Signed-off-by: David Hildenbrand <david@redhat.com>
---

Third time is the charm?

Retested on arm64 and x86-64. Cross-compiled on a bunch of others.

v2 -> v3:
* Handle CONFIG_PGTABLE_LEVELS oddities as good as possible. It's a mess.
  Remove the size >= P4D_SIZE check and simply default to the
  &mm->page_table_lock.
* Align the PTE pointer to the start of the page table to handle PTE page
  tables bigger than a single page (unclear if this could currently trigger).
* Extend patch description

v1 -> 2:
* Extend patch description
* Drop "mm: let pte_lockptr() consume a pte_t pointer"
* Introduce ptep_lockptr() in this patch

---
 include/linux/hugetlb.h | 27 +++++++++++++++++++++++++--
 include/linux/mm.h      | 22 ++++++++++++++++++++++
 2 files changed, 47 insertions(+), 2 deletions(-)

Comments

Peter Xu July 31, 2024, 2:54 p.m. UTC | #1
On Wed, Jul 31, 2024 at 02:21:03PM +0200, David Hildenbrand wrote:
> We recently made GUP's common page table walking code to also walk hugetlb
> VMAs without most hugetlb special-casing, preparing for the future of
> having less hugetlb-specific page table walking code in the codebase.
> Turns out that we missed one page table locking detail: page table locking
> for hugetlb folios that are not mapped using a single PMD/PUD.
> 
> Assume we have hugetlb folio that spans multiple PTEs (e.g., 64 KiB
> hugetlb folios on arm64 with 4 KiB base page size). GUP, as it walks the
> page tables, will perform a pte_offset_map_lock() to grab the PTE table
> lock.
> 
> However, hugetlb that concurrently modifies these page tables would
> actually grab the mm->page_table_lock: with USE_SPLIT_PTE_PTLOCKS, the
> locks would differ. Something similar can happen right now with hugetlb
> folios that span multiple PMDs when USE_SPLIT_PMD_PTLOCKS.
> 
> This issue can be reproduced [1], for example triggering:
> 
> [ 3105.936100] ------------[ cut here ]------------
> [ 3105.939323] WARNING: CPU: 31 PID: 2732 at mm/gup.c:142 try_grab_folio+0x11c/0x188
> [ 3105.944634] Modules linked in: [...]
> [ 3105.974841] CPU: 31 PID: 2732 Comm: reproducer Not tainted 6.10.0-64.eln141.aarch64 #1
> [ 3105.980406] Hardware name: QEMU KVM Virtual Machine, BIOS edk2-20240524-4.fc40 05/24/2024
> [ 3105.986185] pstate: 60000005 (nZCv daif -PAN -UAO -TCO -DIT -SSBS BTYPE=--)
> [ 3105.991108] pc : try_grab_folio+0x11c/0x188
> [ 3105.994013] lr : follow_page_pte+0xd8/0x430
> [ 3105.996986] sp : ffff80008eafb8f0
> [ 3105.999346] x29: ffff80008eafb900 x28: ffffffe8d481f380 x27: 00f80001207cff43
> [ 3106.004414] x26: 0000000000000001 x25: 0000000000000000 x24: ffff80008eafba48
> [ 3106.009520] x23: 0000ffff9372f000 x22: ffff7a54459e2000 x21: ffff7a546c1aa978
> [ 3106.014529] x20: ffffffe8d481f3c0 x19: 0000000000610041 x18: 0000000000000001
> [ 3106.019506] x17: 0000000000000001 x16: ffffffffffffffff x15: 0000000000000000
> [ 3106.024494] x14: ffffb85477fdfe08 x13: 0000ffff9372ffff x12: 0000000000000000
> [ 3106.029469] x11: 1fffef4a88a96be1 x10: ffff7a54454b5f0c x9 : ffffb854771b12f0
> [ 3106.034324] x8 : 0008000000000000 x7 : ffff7a546c1aa980 x6 : 0008000000000080
> [ 3106.038902] x5 : 00000000001207cf x4 : 0000ffff9372f000 x3 : ffffffe8d481f000
> [ 3106.043420] x2 : 0000000000610041 x1 : 0000000000000001 x0 : 0000000000000000
> [ 3106.047957] Call trace:
> [ 3106.049522]  try_grab_folio+0x11c/0x188
> [ 3106.051996]  follow_pmd_mask.constprop.0.isra.0+0x150/0x2e0
> [ 3106.055527]  follow_page_mask+0x1a0/0x2b8
> [ 3106.058118]  __get_user_pages+0xf0/0x348
> [ 3106.060647]  faultin_page_range+0xb0/0x360
> [ 3106.063651]  do_madvise+0x340/0x598
> 
> Let's make huge_pte_lockptr() effectively use the same PT locks as any
> core-mm page table walker would. Add ptep_lockptr() to obtain the PTE
> page table lock using a pte pointer -- unfortunately we cannot convert
> pte_lockptr() because virt_to_page() doesn't work with kmap'ed page
> tables we can have with CONFIG_HIGHPTE.
> 
> Take care of PTE tables possibly spanning multiple pages, and take care of
> CONFIG_PGTABLE_LEVELS complexity when e.g., PMD_SIZE == PUD_SIZE. For
> example, with CONFIG_PGTABLE_LEVELS == 2, core-mm would detect
> with hugepagesize==PMD_SIZE pmd_leaf() and use the pmd_lockptr(), which
> would end up just mapping to the per-MM PT lock.
> 
> There is one ugly case: powerpc 8xx, whereby we have an 8 MiB hugetlb
> folio being mapped using two PTE page tables.  While hugetlb wants to take
> the PMD table lock, core-mm would grab the PTE table lock of one of both
> PTE page tables.  In such corner cases, we have to make sure that both
> locks match, which is (fortunately!) currently guaranteed for 8xx as it
> does not support SMP and consequently doesn't use split PT locks.
> 
> [1] https://lore.kernel.org/all/1bbfcc7f-f222-45a5-ac44-c5a1381c596d@redhat.com/
> 
> Fixes: 9cb28da54643 ("mm/gup: handle hugetlb in the generic follow_page_mask code")
> Reviewed-by: James Houghton <jthoughton@google.com>
> Cc: <stable@vger.kernel.org>
> Cc: Peter Xu <peterx@redhat.com>
> Cc: Oscar Salvador <osalvador@suse.de>
> Cc: Muchun Song <muchun.song@linux.dev>
> Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
> Signed-off-by: David Hildenbrand <david@redhat.com>

Nitpick: I wonder whether some of the lines can be simplified if we write
it downwards from PUD, like,

huge_pte_lockptr()
{
        if (size >= PUD_SIZE)
           return pud_lockptr(...);
        if (size >= PMD_SIZE)
           return pmd_lockptr(...);
        /* Sub-PMD only applies to !CONFIG_HIGHPTE, see pte_alloc_huge() */
        WARN_ON(IS_ENABLED(CONFIG_HIGHPTE));
        return ptep_lockptr(...);
}

The ">=" checks should avoid checking over pgtable level, iiuc.

The other nitpick is, I didn't yet find any arch that use non-zero order
page for pte pgtables.  I would give it a shot with dropping the mask thing
then see what explodes (which I don't expect any, per my read..), but yeah
I understand we saw some already due to other things, so I think it's fine
in this hugetlb path (that we're removing) we do a few more math if you
think that's easier for you.

Acked-by: Peter Xu <peterx@redhat.com>

Thanks,
David Hildenbrand July 31, 2024, 4:33 p.m. UTC | #2
On 31.07.24 16:54, Peter Xu wrote:
> On Wed, Jul 31, 2024 at 02:21:03PM +0200, David Hildenbrand wrote:
>> We recently made GUP's common page table walking code to also walk hugetlb
>> VMAs without most hugetlb special-casing, preparing for the future of
>> having less hugetlb-specific page table walking code in the codebase.
>> Turns out that we missed one page table locking detail: page table locking
>> for hugetlb folios that are not mapped using a single PMD/PUD.
>>
>> Assume we have hugetlb folio that spans multiple PTEs (e.g., 64 KiB
>> hugetlb folios on arm64 with 4 KiB base page size). GUP, as it walks the
>> page tables, will perform a pte_offset_map_lock() to grab the PTE table
>> lock.
>>
>> However, hugetlb that concurrently modifies these page tables would
>> actually grab the mm->page_table_lock: with USE_SPLIT_PTE_PTLOCKS, the
>> locks would differ. Something similar can happen right now with hugetlb
>> folios that span multiple PMDs when USE_SPLIT_PMD_PTLOCKS.
>>
>> This issue can be reproduced [1], for example triggering:
>>
>> [ 3105.936100] ------------[ cut here ]------------
>> [ 3105.939323] WARNING: CPU: 31 PID: 2732 at mm/gup.c:142 try_grab_folio+0x11c/0x188
>> [ 3105.944634] Modules linked in: [...]
>> [ 3105.974841] CPU: 31 PID: 2732 Comm: reproducer Not tainted 6.10.0-64.eln141.aarch64 #1
>> [ 3105.980406] Hardware name: QEMU KVM Virtual Machine, BIOS edk2-20240524-4.fc40 05/24/2024
>> [ 3105.986185] pstate: 60000005 (nZCv daif -PAN -UAO -TCO -DIT -SSBS BTYPE=--)
>> [ 3105.991108] pc : try_grab_folio+0x11c/0x188
>> [ 3105.994013] lr : follow_page_pte+0xd8/0x430
>> [ 3105.996986] sp : ffff80008eafb8f0
>> [ 3105.999346] x29: ffff80008eafb900 x28: ffffffe8d481f380 x27: 00f80001207cff43
>> [ 3106.004414] x26: 0000000000000001 x25: 0000000000000000 x24: ffff80008eafba48
>> [ 3106.009520] x23: 0000ffff9372f000 x22: ffff7a54459e2000 x21: ffff7a546c1aa978
>> [ 3106.014529] x20: ffffffe8d481f3c0 x19: 0000000000610041 x18: 0000000000000001
>> [ 3106.019506] x17: 0000000000000001 x16: ffffffffffffffff x15: 0000000000000000
>> [ 3106.024494] x14: ffffb85477fdfe08 x13: 0000ffff9372ffff x12: 0000000000000000
>> [ 3106.029469] x11: 1fffef4a88a96be1 x10: ffff7a54454b5f0c x9 : ffffb854771b12f0
>> [ 3106.034324] x8 : 0008000000000000 x7 : ffff7a546c1aa980 x6 : 0008000000000080
>> [ 3106.038902] x5 : 00000000001207cf x4 : 0000ffff9372f000 x3 : ffffffe8d481f000
>> [ 3106.043420] x2 : 0000000000610041 x1 : 0000000000000001 x0 : 0000000000000000
>> [ 3106.047957] Call trace:
>> [ 3106.049522]  try_grab_folio+0x11c/0x188
>> [ 3106.051996]  follow_pmd_mask.constprop.0.isra.0+0x150/0x2e0
>> [ 3106.055527]  follow_page_mask+0x1a0/0x2b8
>> [ 3106.058118]  __get_user_pages+0xf0/0x348
>> [ 3106.060647]  faultin_page_range+0xb0/0x360
>> [ 3106.063651]  do_madvise+0x340/0x598
>>
>> Let's make huge_pte_lockptr() effectively use the same PT locks as any
>> core-mm page table walker would. Add ptep_lockptr() to obtain the PTE
>> page table lock using a pte pointer -- unfortunately we cannot convert
>> pte_lockptr() because virt_to_page() doesn't work with kmap'ed page
>> tables we can have with CONFIG_HIGHPTE.
>>
>> Take care of PTE tables possibly spanning multiple pages, and take care of
>> CONFIG_PGTABLE_LEVELS complexity when e.g., PMD_SIZE == PUD_SIZE. For
>> example, with CONFIG_PGTABLE_LEVELS == 2, core-mm would detect
>> with hugepagesize==PMD_SIZE pmd_leaf() and use the pmd_lockptr(), which
>> would end up just mapping to the per-MM PT lock.
>>
>> There is one ugly case: powerpc 8xx, whereby we have an 8 MiB hugetlb
>> folio being mapped using two PTE page tables.  While hugetlb wants to take
>> the PMD table lock, core-mm would grab the PTE table lock of one of both
>> PTE page tables.  In such corner cases, we have to make sure that both
>> locks match, which is (fortunately!) currently guaranteed for 8xx as it
>> does not support SMP and consequently doesn't use split PT locks.
>>
>> [1] https://lore.kernel.org/all/1bbfcc7f-f222-45a5-ac44-c5a1381c596d@redhat.com/
>>
>> Fixes: 9cb28da54643 ("mm/gup: handle hugetlb in the generic follow_page_mask code")
>> Reviewed-by: James Houghton <jthoughton@google.com>
>> Cc: <stable@vger.kernel.org>
>> Cc: Peter Xu <peterx@redhat.com>
>> Cc: Oscar Salvador <osalvador@suse.de>
>> Cc: Muchun Song <muchun.song@linux.dev>
>> Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
>> Signed-off-by: David Hildenbrand <david@redhat.com>
> 
> Nitpick: I wonder whether some of the lines can be simplified if we write
> it downwards from PUD, like,
> 
> huge_pte_lockptr()
> {
>          if (size >= PUD_SIZE)
>             return pud_lockptr(...);
>          if (size >= PMD_SIZE)
>             return pmd_lockptr(...);
>          /* Sub-PMD only applies to !CONFIG_HIGHPTE, see pte_alloc_huge() */
>          WARN_ON(IS_ENABLED(CONFIG_HIGHPTE));
>          return ptep_lockptr(...);
> }

Let me think about it. For PUD_SIZE == PMD_SIZE instead of like core-mm
calling pmd_lockptr we'd call pud_lockptr().

Likely it would work because we default in most cases to the per-MM lock:

arch/x86/Kconfig:       select ARCH_ENABLE_SPLIT_PMD_PTLOCK if (PGTABLE_LEVELS > 2) && (X86_64 || X86_PAE)



> 
> The ">=" checks should avoid checking over pgtable level, iiuc.
> 
> The other nitpick is, I didn't yet find any arch that use non-zero order
> page for pte pgtables.  I would give it a shot with dropping the mask thing
> then see what explodes (which I don't expect any, per my read..), but yeah
> I understand we saw some already due to other things, so I think it's fine
> in this hugetlb path (that we're removing) we do a few more math if you
> think that's easier for you.

I threw
	BUILD_BUG_ON(PTRS_PER_PTE * sizeof(pte_t) > PAGE_SIZE);
into pte_lockptr() and did a bunch of cross-compiles.

And for some reason it blows up for powernv (powernv_defconfig) and
pseries (pseries_defconfig).


In function 'pte_lockptr',
     inlined from 'pte_offset_map_nolock' at mm/pgtable-generic.c:316:11:
././include/linux/compiler_types.h:510:45: error: call to '__compiletime_assert_291' declared with attribute error: BUILD_BUG_ON failed: PTRS_PER_PTE * sizeof(pte_t) > PAGE_SIZE
   510 |         _compiletime_assert(condition, msg, __compiletime_assert_, __COUNTER__)
       |                                             ^
././include/linux/compiler_types.h:491:25: note: in definition of macro '__compiletime_assert'
   491 |                         prefix ## suffix();                             \
       |                         ^~~~~~
././include/linux/compiler_types.h:510:9: note: in expansion of macro '_compiletime_assert'
   510 |         _compiletime_assert(condition, msg, __compiletime_assert_, __COUNTER__)
       |         ^~~~~~~~~~~~~~~~~~~
./include/linux/build_bug.h:39:37: note: in expansion of macro 'compiletime_assert'
    39 | #define BUILD_BUG_ON_MSG(cond, msg) compiletime_assert(!(cond), msg)
       |                                     ^~~~~~~~~~~~~~~~~~
./include/linux/build_bug.h:50:9: note: in expansion of macro 'BUILD_BUG_ON_MSG'
    50 |         BUILD_BUG_ON_MSG(condition, "BUILD_BUG_ON failed: " #condition)
       |         ^~~~~~~~~~~~~~~~
./include/linux/mm.h:2926:9: note: in expansion of macro 'BUILD_BUG_ON'
  2926 |         BUILD_BUG_ON(PTRS_PER_PTE * sizeof(pte_t) > PAGE_SIZE);
       |         ^~~~~~~~~~~~
In function 'pte_lockptr',
     inlined from '__pte_offset_map_lock' at mm/pgtable-generic.c:374:8:
././include/linux/compiler_types.h:510:45: error: call to '__compiletime_assert_291' declared with attribute error: BUILD_BUG_ON failed: PTRS_PER_PTE * sizeof(pte_t) > PAGE_SIZE
   510 |         _compiletime_assert(condition, msg, __compiletime_assert_, __COUNTER__)
       |                                             ^
././include/linux/compiler_types.h:491:25: note: in definition of macro '__compiletime_assert'
   491 |                         prefix ## suffix();                             \
       |                         ^~~~~~
././include/linux/compiler_types.h:510:9: note: in expansion of macro '_compiletime_assert'
   510 |         _compiletime_assert(condition, msg, __compiletime_assert_, __COUNTER__)
       |         ^~~~~~~~~~~~~~~~~~~
./include/linux/build_bug.h:39:37: note: in expansion of macro 'compiletime_assert'
    39 | #define BUILD_BUG_ON_MSG(cond, msg) compiletime_assert(!(cond), msg)
       |                                     ^~~~~~~~~~~~~~~~~~
./include/linux/build_bug.h:50:9: note: in expansion of macro 'BUILD_BUG_ON_MSG'
    50 |         BUILD_BUG_ON_MSG(condition, "BUILD_BUG_ON failed: " #condition)
       |         ^~~~~~~~~~~~~~~~
./include/linux/mm.h:2926:9: note: in expansion of macro 'BUILD_BUG_ON'
  2926 |         BUILD_BUG_ON(PTRS_PER_PTE * sizeof(pte_t) > PAGE_SIZE);
       |         ^~~~~~~~~~~~


pte_alloc_one() ends up calling pte_fragment_alloc(mm, 0). But there we always
end up calling pagetable_alloc(, 0).

And fragments are supposed to be <= a single page.

Now I'm confused what's wrong here ... am I missing something obvious?

CCing some powerpc folks. Is this some pte_t oddity?

But in mm_inc_nr_ptes/mm_dec_nr_ptes we use the exact same calculation :/
Michael Ellerman Aug. 1, 2024, 2:03 a.m. UTC | #3
David Hildenbrand <david@redhat.com> writes:
> On 31.07.24 16:54, Peter Xu wrote:
...
>> 
>> The other nitpick is, I didn't yet find any arch that use non-zero order
>> page for pte pgtables.  I would give it a shot with dropping the mask thing
>> then see what explodes (which I don't expect any, per my read..), but yeah
>> I understand we saw some already due to other things, so I think it's fine
>> in this hugetlb path (that we're removing) we do a few more math if you
>> think that's easier for you.
>
> I threw
> 	BUILD_BUG_ON(PTRS_PER_PTE * sizeof(pte_t) > PAGE_SIZE);
> into pte_lockptr() and did a bunch of cross-compiles.
>
> And for some reason it blows up for powernv (powernv_defconfig) and
> pseries (pseries_defconfig).
>
>
> In function 'pte_lockptr',
>      inlined from 'pte_offset_map_nolock' at mm/pgtable-generic.c:316:11:
> ././include/linux/compiler_types.h:510:45: error: call to '__compiletime_assert_291' declared with attribute error: BUILD_BUG_ON failed: PTRS_PER_PTE * sizeof(pte_t) > PAGE_SIZE
>    510 |         _compiletime_assert(condition, msg, __compiletime_assert_, __COUNTER__)
>        |                                             ^
> ././include/linux/compiler_types.h:491:25: note: in definition of macro '__compiletime_assert'
>    491 |                         prefix ## suffix();                             \
>        |                         ^~~~~~
> ././include/linux/compiler_types.h:510:9: note: in expansion of macro '_compiletime_assert'
>    510 |         _compiletime_assert(condition, msg, __compiletime_assert_, __COUNTER__)
>        |         ^~~~~~~~~~~~~~~~~~~
> ./include/linux/build_bug.h:39:37: note: in expansion of macro 'compiletime_assert'
>     39 | #define BUILD_BUG_ON_MSG(cond, msg) compiletime_assert(!(cond), msg)
>        |                                     ^~~~~~~~~~~~~~~~~~
> ./include/linux/build_bug.h:50:9: note: in expansion of macro 'BUILD_BUG_ON_MSG'
>     50 |         BUILD_BUG_ON_MSG(condition, "BUILD_BUG_ON failed: " #condition)
>        |         ^~~~~~~~~~~~~~~~
> ./include/linux/mm.h:2926:9: note: in expansion of macro 'BUILD_BUG_ON'
>   2926 |         BUILD_BUG_ON(PTRS_PER_PTE * sizeof(pte_t) > PAGE_SIZE);
>        |         ^~~~~~~~~~~~
...
>
> pte_alloc_one() ends up calling pte_fragment_alloc(mm, 0). But there we always
> end up calling pagetable_alloc(, 0).
>
> And fragments are supposed to be <= a single page.
>
> Now I'm confused what's wrong here ... am I missing something obvious?
>
> CCing some powerpc folks. Is this some pte_t oddity?

It will be because PTRS_PER_PTE is not a compile time constant :(

  $ git grep "define PTRS_PER_PTE" arch/powerpc/include/asm/book3s/64
  arch/powerpc/include/asm/book3s/64/pgtable.h:#define PTRS_PER_PTE        (1 << PTE_INDEX_SIZE)
  
  $ git grep "define PTE_INDEX_SIZE" arch/powerpc/include/asm/book3s/64
  arch/powerpc/include/asm/book3s/64/pgtable.h:#define PTE_INDEX_SIZE  __pte_index_size
  
  $ git grep __pte_index_size arch/powerpc/mm/pgtable_64.c
  arch/powerpc/mm/pgtable_64.c:unsigned long __pte_index_size;

Which is because the pseries/powernv (book3s64) kernel supports either
the HPT or Radix MMU at runtime, and they have different page table
geometry.

If you change it to use MAX_PTRS_PER_PTE it should work (that's defined
for all arches).

cheers


diff --git a/include/linux/mm.h b/include/linux/mm.h
index 381750f41767..1fd9c296c0b6 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2924,6 +2924,8 @@ static inline spinlock_t *ptlock_ptr(struct ptdesc *ptdesc)
 static inline spinlock_t *pte_lockptr(struct mm_struct *mm, pte_t *pte)
 {
        /* PTE page tables don't currently exceed a single page. */
+       BUILD_BUG_ON(MAX_PTRS_PER_PTE * sizeof(pte_t) > PAGE_SIZE);
+
        return ptlock_ptr(virt_to_ptdesc(pte));
 }
David Hildenbrand Aug. 1, 2024, 7:35 a.m. UTC | #4
>> pte_alloc_one() ends up calling pte_fragment_alloc(mm, 0). But there we always
>> end up calling pagetable_alloc(, 0).
>>
>> And fragments are supposed to be <= a single page.
>>
>> Now I'm confused what's wrong here ... am I missing something obvious?
>>
>> CCing some powerpc folks. Is this some pte_t oddity?
> 
> It will be because PTRS_PER_PTE is not a compile time constant :(

Oh, sneaky! Thanks a bunch!

> 
>    $ git grep "define PTRS_PER_PTE" arch/powerpc/include/asm/book3s/64
>    arch/powerpc/include/asm/book3s/64/pgtable.h:#define PTRS_PER_PTE        (1 << PTE_INDEX_SIZE)
>    
>    $ git grep "define PTE_INDEX_SIZE" arch/powerpc/include/asm/book3s/64
>    arch/powerpc/include/asm/book3s/64/pgtable.h:#define PTE_INDEX_SIZE  __pte_index_size
>    
>    $ git grep __pte_index_size arch/powerpc/mm/pgtable_64.c
>    arch/powerpc/mm/pgtable_64.c:unsigned long __pte_index_size;
> 
> Which is because the pseries/powernv (book3s64) kernel supports either
> the HPT or Radix MMU at runtime, and they have different page table
> geometry.
> 
> If you change it to use MAX_PTRS_PER_PTE it should work (that's defined
> for all arches).

Yes, that should fly, thanks again!
David Hildenbrand Aug. 1, 2024, 8:50 a.m. UTC | #5
On 31.07.24 14:21, David Hildenbrand wrote:
> We recently made GUP's common page table walking code to also walk hugetlb
> VMAs without most hugetlb special-casing, preparing for the future of
> having less hugetlb-specific page table walking code in the codebase.
> Turns out that we missed one page table locking detail: page table locking
> for hugetlb folios that are not mapped using a single PMD/PUD.

James, Peter,

the following seems to get the job done. Thoughts?

diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index 8e462205400d..776dc3914d9e 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -938,10 +938,40 @@ static inline bool htlb_allow_alloc_fallback(int reason)
  static inline spinlock_t *huge_pte_lockptr(struct hstate *h,
  					   struct mm_struct *mm, pte_t *pte)
  {
-	if (huge_page_size(h) == PMD_SIZE)
+	unsigned long size = huge_page_size(h);
+
+	VM_WARN_ON(size == PAGE_SIZE);
+
+	/*
+	 * hugetlb must use the exact same PT locks as core-mm page table
+	 * walkers would. When modifying a PTE table, hugetlb must take the
+	 * PTE PT lock, when modifying a PMD table, hugetlb must take the PMD
+	 * PT lock etc.
+	 *
+	 * The expectation is that any hugetlb folio smaller than a PMD is
+	 * always mapped into a single PTE table and that any hugetlb folio
+	 * smaller than a PUD (but at least as big as a PMD) is always mapped
+	 * into a single PMD table.
+	 *
+	 * If that does not hold for an architecture, then that architecture
+	 * must disable split PT locks such that all *_lockptr() functions
+	 * will give us the same result: the per-MM PT lock.
+	 *
+	 * Note that with e.g., CONFIG_PGTABLE_LEVELS=2 where
+	 * PGDIR_SIZE==P4D_SIZE==PUD_SIZE==PMD_SIZE, we'd use the MM PT lock
+	 * directly with a PMD hugetlb size, whereby core-mm would call
+	 * pmd_lockptr() instead. However, in such configurations split PMD
+	 * locks are disabled -- split locks don't make sense on a single
+	 * PGDIR page table -- and the end result is the same.
+	 */
+	if (size >= P4D_SIZE)
+		return &mm->page_table_lock;
+	else if (size >= PUD_SIZE)
+		return pud_lockptr(mm, (pud_t *) pte);
+	else if (size >= PMD_SIZE || IS_ENABLED(CONFIG_HIGHPTE))
  		return pmd_lockptr(mm, (pmd_t *) pte);
-	VM_BUG_ON(huge_page_size(h) == PAGE_SIZE);
-	return &mm->page_table_lock;
+	/* pte_alloc_huge() only applies with !CONFIG_HIGHPTE */
+	return ptep_lockptr(mm, pte);
  }
  
  #ifndef hugepages_supported
diff --git a/include/linux/mm.h b/include/linux/mm.h
index a890a1731c14..bd219ac9c026 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2869,6 +2869,13 @@ static inline spinlock_t *pte_lockptr(struct mm_struct *mm, pmd_t *pmd)
  	return ptlock_ptr(page_ptdesc(pmd_page(*pmd)));
  }
  
+static inline spinlock_t *ptep_lockptr(struct mm_struct *mm, pte_t *pte)
+{
+	BUILD_BUG_ON(IS_ENABLED(CONFIG_HIGHPTE));
+	BUILD_BUG_ON(MAX_PTRS_PER_PTE * sizeof(pte_t) > PAGE_SIZE);
+	return ptlock_ptr(virt_to_ptdesc(pte));
+}
+
  static inline bool ptlock_init(struct ptdesc *ptdesc)
  {
  	/*
@@ -2893,6 +2900,10 @@ static inline spinlock_t *pte_lockptr(struct mm_struct *mm, pmd_t *pmd)
  {
  	return &mm->page_table_lock;
  }
+static inline spinlock_t *ptep_lockptr(struct mm_struct *mm, pte_t *pte)
+{
+	return &mm->page_table_lock;
+}
  static inline void ptlock_cache_init(void) {}
  static inline bool ptlock_init(struct ptdesc *ptdesc) { return true; }
  static inline void ptlock_free(struct ptdesc *ptdesc) {}
Peter Xu Aug. 1, 2024, 1:52 p.m. UTC | #6
On Thu, Aug 01, 2024 at 10:50:18AM +0200, David Hildenbrand wrote:
> On 31.07.24 14:21, David Hildenbrand wrote:
> > We recently made GUP's common page table walking code to also walk hugetlb
> > VMAs without most hugetlb special-casing, preparing for the future of
> > having less hugetlb-specific page table walking code in the codebase.
> > Turns out that we missed one page table locking detail: page table locking
> > for hugetlb folios that are not mapped using a single PMD/PUD.
> 
> James, Peter,
> 
> the following seems to get the job done. Thoughts?

OK to me, so my A-b can keep, but let me still comment; again, all
nitpicks.

> 
> diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
> index 8e462205400d..776dc3914d9e 100644
> --- a/include/linux/hugetlb.h
> +++ b/include/linux/hugetlb.h
> @@ -938,10 +938,40 @@ static inline bool htlb_allow_alloc_fallback(int reason)
>  static inline spinlock_t *huge_pte_lockptr(struct hstate *h,
>  					   struct mm_struct *mm, pte_t *pte)
>  {
> -	if (huge_page_size(h) == PMD_SIZE)
> +	unsigned long size = huge_page_size(h);
> +
> +	VM_WARN_ON(size == PAGE_SIZE);
> +
> +	/*
> +	 * hugetlb must use the exact same PT locks as core-mm page table
> +	 * walkers would. When modifying a PTE table, hugetlb must take the
> +	 * PTE PT lock, when modifying a PMD table, hugetlb must take the PMD
> +	 * PT lock etc.
> +	 *
> +	 * The expectation is that any hugetlb folio smaller than a PMD is
> +	 * always mapped into a single PTE table and that any hugetlb folio
> +	 * smaller than a PUD (but at least as big as a PMD) is always mapped
> +	 * into a single PMD table.
> +	 *
> +	 * If that does not hold for an architecture, then that architecture
> +	 * must disable split PT locks such that all *_lockptr() functions
> +	 * will give us the same result: the per-MM PT lock.
> +	 *
> +	 * Note that with e.g., CONFIG_PGTABLE_LEVELS=2 where
> +	 * PGDIR_SIZE==P4D_SIZE==PUD_SIZE==PMD_SIZE, we'd use the MM PT lock
> +	 * directly with a PMD hugetlb size, whereby core-mm would call
> +	 * pmd_lockptr() instead. However, in such configurations split PMD
> +	 * locks are disabled -- split locks don't make sense on a single
> +	 * PGDIR page table -- and the end result is the same.
> +	 */
> +	if (size >= P4D_SIZE)
> +		return &mm->page_table_lock;

I'd drop this so the mm lock fallback will be done below (especially in
reality the pud lock is always mm lock for now..).  Also this line reads
like there can be P4D size huge page but in reality PUD is the largest
(nopxx doesn't count).  We also same some cycles in most cases if removed.

> +	else if (size >= PUD_SIZE)
> +		return pud_lockptr(mm, (pud_t *) pte);
> +	else if (size >= PMD_SIZE || IS_ENABLED(CONFIG_HIGHPTE))

I thought this HIGHPTE can also be dropped? Because in HIGHPTE it should
never have lower-than-PMD huge pages or we're in trouble.  That's why I
kept one WARN_ON() in my pesudo code but only before trying to take the pte
lockptr.

>  		return pmd_lockptr(mm, (pmd_t *) pte);
> -	VM_BUG_ON(huge_page_size(h) == PAGE_SIZE);
> -	return &mm->page_table_lock;
> +	/* pte_alloc_huge() only applies with !CONFIG_HIGHPTE */
> +	return ptep_lockptr(mm, pte);
>  }
>  #ifndef hugepages_supported
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index a890a1731c14..bd219ac9c026 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -2869,6 +2869,13 @@ static inline spinlock_t *pte_lockptr(struct mm_struct *mm, pmd_t *pmd)
>  	return ptlock_ptr(page_ptdesc(pmd_page(*pmd)));
>  }
> +static inline spinlock_t *ptep_lockptr(struct mm_struct *mm, pte_t *pte)
> +{
> +	BUILD_BUG_ON(IS_ENABLED(CONFIG_HIGHPTE));
> +	BUILD_BUG_ON(MAX_PTRS_PER_PTE * sizeof(pte_t) > PAGE_SIZE);
> +	return ptlock_ptr(virt_to_ptdesc(pte));
> +}

Great to know we can drop the mask..

Thanks,

> +
>  static inline bool ptlock_init(struct ptdesc *ptdesc)
>  {
>  	/*
> @@ -2893,6 +2900,10 @@ static inline spinlock_t *pte_lockptr(struct mm_struct *mm, pmd_t *pmd)
>  {
>  	return &mm->page_table_lock;
>  }
> +static inline spinlock_t *ptep_lockptr(struct mm_struct *mm, pte_t *pte)
> +{
> +	return &mm->page_table_lock;
> +}
>  static inline void ptlock_cache_init(void) {}
>  static inline bool ptlock_init(struct ptdesc *ptdesc) { return true; }
>  static inline void ptlock_free(struct ptdesc *ptdesc) {}
> -- 
> 2.45.2
> 
> 
> -- 
> Cheers,
> 
> David / dhildenb
>
David Hildenbrand Aug. 1, 2024, 3:35 p.m. UTC | #7
Hi Peter,

>> -	if (huge_page_size(h) == PMD_SIZE)
>> +	unsigned long size = huge_page_size(h);
>> +
>> +	VM_WARN_ON(size == PAGE_SIZE);
>> +
>> +	/*
>> +	 * hugetlb must use the exact same PT locks as core-mm page table
>> +	 * walkers would. When modifying a PTE table, hugetlb must take the
>> +	 * PTE PT lock, when modifying a PMD table, hugetlb must take the PMD
>> +	 * PT lock etc.
>> +	 *
>> +	 * The expectation is that any hugetlb folio smaller than a PMD is
>> +	 * always mapped into a single PTE table and that any hugetlb folio
>> +	 * smaller than a PUD (but at least as big as a PMD) is always mapped
>> +	 * into a single PMD table.
>> +	 *
>> +	 * If that does not hold for an architecture, then that architecture
>> +	 * must disable split PT locks such that all *_lockptr() functions
>> +	 * will give us the same result: the per-MM PT lock.
>> +	 *
>> +	 * Note that with e.g., CONFIG_PGTABLE_LEVELS=2 where
>> +	 * PGDIR_SIZE==P4D_SIZE==PUD_SIZE==PMD_SIZE, we'd use the MM PT lock
>> +	 * directly with a PMD hugetlb size, whereby core-mm would call
>> +	 * pmd_lockptr() instead. However, in such configurations split PMD
>> +	 * locks are disabled -- split locks don't make sense on a single
>> +	 * PGDIR page table -- and the end result is the same.
>> +	 */
>> +	if (size >= P4D_SIZE)
>> +		return &mm->page_table_lock;
> 
> I'd drop this so the mm lock fallback will be done below (especially in
> reality the pud lock is always mm lock for now..).  Also this line reads
> like there can be P4D size huge page but in reality PUD is the largest
> (nopxx doesn't count).  We also same some cycles in most cases if removed.

The compiler should be smart enough to optimize most of that out. But
I agree that there is no need to be that future-proof here.

These two are interesting:

arch/powerpc/mm/hugetlbpage.c:  if (!mm_pud_folded(mm) && sz >= P4D_SIZE)
arch/riscv/mm/hugetlbpage.c:    else if (sz >= P4D_SIZE)

But I assume they are only fordward-looking. (GUP would already be
pretty broken if that would exist)


>> +	else if (size >= PUD_SIZE)
>> +		return pud_lockptr(mm, (pud_t *) pte);
>> +	else if (size >= PMD_SIZE || IS_ENABLED(CONFIG_HIGHPTE))
> 
> I thought this HIGHPTE can also be dropped? Because in HIGHPTE it should
> never have lower-than-PMD huge pages or we're in trouble.  That's why I
> kept one WARN_ON() in my pesudo code but only before trying to take the pte
> lockptr.

Then the compiler won't optimize out the ptep_lockptr() call and we'll run
into a build error. And I think the HIGHPTE builderror serves good purpose.

In file included from <command-line>:
In function 'ptep_lockptr',
     inlined from 'huge_pte_lockptr' at ./include/linux/hugetlb.h:974:9,
     inlined from 'huge_pte_lock' at ./include/linux/hugetlb.h:1248:8,
     inlined from 'pagemap_scan_hugetlb_entry' at fs/proc/task_mmu.c:2581:8:
././include/linux/compiler_types.h:510:45: error: call to '__compiletime_assert_256' declared with attribute error: BUILD_BUG_ON failed: IS_ENABLED(CONFIG_HIGHPTE)
   510 |         _compiletime_assert(condition, msg, __compiletime_assert_, __COUNTER__)
       |                                             ^
././include/linux/compiler_types.h:491:25: note: in definition of macro '__compiletime_assert'
   491 |                         prefix ## suffix();                             \
       |                         ^~~~~~
././include/linux/compiler_types.h:510:9: note: in expansion of macro '_compiletime_assert'
   510 |         _compiletime_assert(condition, msg, __compiletime_assert_, __COUNTER__)
       |         ^~~~~~~~~~~~~~~~~~~
./include/linux/build_bug.h:39:37: note: in expansion of macro 'compiletime_assert'
    39 | #define BUILD_BUG_ON_MSG(cond, msg) compiletime_assert(!(cond), msg)
       |                                     ^~~~~~~~~~~~~~~~~~
./include/linux/build_bug.h:50:9: note: in expansion of macro 'BUILD_BUG_ON_MSG'
    50 |         BUILD_BUG_ON_MSG(condition, "BUILD_BUG_ON failed: " #condition)
       |         ^~~~~~~~~~~~~~~~
./include/linux/mm.h:2874:9: note: in expansion of macro 'BUILD_BUG_ON'
  2874 |         BUILD_BUG_ON(IS_ENABLED(CONFIG_HIGHPTE));



It would be even better to have a way to know whether an arch even has
hugetlb < PMD_SIZE (config option?) and use that instead here; but I'll leave
that as an optimization.
Peter Xu Aug. 1, 2024, 4:07 p.m. UTC | #8
On Thu, Aug 01, 2024 at 05:35:20PM +0200, David Hildenbrand wrote:
> Hi Peter,

[...]

> > > +	else if (size >= PUD_SIZE)
> > > +		return pud_lockptr(mm, (pud_t *) pte);
> > > +	else if (size >= PMD_SIZE || IS_ENABLED(CONFIG_HIGHPTE))
> > 
> > I thought this HIGHPTE can also be dropped? Because in HIGHPTE it should
> > never have lower-than-PMD huge pages or we're in trouble.  That's why I
> > kept one WARN_ON() in my pesudo code but only before trying to take the pte
> > lockptr.
> 
> Then the compiler won't optimize out the ptep_lockptr() call and we'll run
> into a build error. And I think the HIGHPTE builderror serves good purpose.
> 
> In file included from <command-line>:
> In function 'ptep_lockptr',
>     inlined from 'huge_pte_lockptr' at ./include/linux/hugetlb.h:974:9,
>     inlined from 'huge_pte_lock' at ./include/linux/hugetlb.h:1248:8,
>     inlined from 'pagemap_scan_hugetlb_entry' at fs/proc/task_mmu.c:2581:8:
> ././include/linux/compiler_types.h:510:45: error: call to '__compiletime_assert_256' declared with attribute error: BUILD_BUG_ON failed: IS_ENABLED(CONFIG_HIGHPTE)
>   510 |         _compiletime_assert(condition, msg, __compiletime_assert_, __COUNTER__)
>       |                                             ^
> ././include/linux/compiler_types.h:491:25: note: in definition of macro '__compiletime_assert'
>   491 |                         prefix ## suffix();                             \
>       |                         ^~~~~~
> ././include/linux/compiler_types.h:510:9: note: in expansion of macro '_compiletime_assert'
>   510 |         _compiletime_assert(condition, msg, __compiletime_assert_, __COUNTER__)
>       |         ^~~~~~~~~~~~~~~~~~~
> ./include/linux/build_bug.h:39:37: note: in expansion of macro 'compiletime_assert'
>    39 | #define BUILD_BUG_ON_MSG(cond, msg) compiletime_assert(!(cond), msg)
>       |                                     ^~~~~~~~~~~~~~~~~~
> ./include/linux/build_bug.h:50:9: note: in expansion of macro 'BUILD_BUG_ON_MSG'
>    50 |         BUILD_BUG_ON_MSG(condition, "BUILD_BUG_ON failed: " #condition)
>       |         ^~~~~~~~~~~~~~~~
> ./include/linux/mm.h:2874:9: note: in expansion of macro 'BUILD_BUG_ON'
>  2874 |         BUILD_BUG_ON(IS_ENABLED(CONFIG_HIGHPTE));

Ahh.. this is in "ifdef USE_SPLIT_PTE_PTLOCKS" section.  I'm thinking maybe
we should drop this BUILD_BUG_ON - it says "HIGHPTE shouldn't co-exist with
USE_SPLIT_PTE_PTLOCKS", but I think it can?

Said that, I think I can also understand your point, where you see
ptep_lockptr() a hugetlb-only function, in that case the BUILD_BUG_ON would
make sense in hugetlb world.

So.. per my previous nitpick suggestion, IIUC we'll need to drop this
BUILD_BUG_ON, just to say "USE_SPLIT_PTE_PTLOCKS can work with HIGHPTE" and
perhaps slightly more readable; we'll rely on the WARN_ON to guard HIGHPTE
won't use pte lock.

Either way works for me, thanks!
David Hildenbrand Aug. 1, 2024, 4:24 p.m. UTC | #9
On 01.08.24 18:07, Peter Xu wrote:
> On Thu, Aug 01, 2024 at 05:35:20PM +0200, David Hildenbrand wrote:
>> Hi Peter,
> 
> [...]
> 
>>>> +	else if (size >= PUD_SIZE)
>>>> +		return pud_lockptr(mm, (pud_t *) pte);
>>>> +	else if (size >= PMD_SIZE || IS_ENABLED(CONFIG_HIGHPTE))
>>>
>>> I thought this HIGHPTE can also be dropped? Because in HIGHPTE it should
>>> never have lower-than-PMD huge pages or we're in trouble.  That's why I
>>> kept one WARN_ON() in my pesudo code but only before trying to take the pte
>>> lockptr.
>>
>> Then the compiler won't optimize out the ptep_lockptr() call and we'll run
>> into a build error. And I think the HIGHPTE builderror serves good purpose.
>>
>> In file included from <command-line>:
>> In function 'ptep_lockptr',
>>      inlined from 'huge_pte_lockptr' at ./include/linux/hugetlb.h:974:9,
>>      inlined from 'huge_pte_lock' at ./include/linux/hugetlb.h:1248:8,
>>      inlined from 'pagemap_scan_hugetlb_entry' at fs/proc/task_mmu.c:2581:8:
>> ././include/linux/compiler_types.h:510:45: error: call to '__compiletime_assert_256' declared with attribute error: BUILD_BUG_ON failed: IS_ENABLED(CONFIG_HIGHPTE)
>>    510 |         _compiletime_assert(condition, msg, __compiletime_assert_, __COUNTER__)
>>        |                                             ^
>> ././include/linux/compiler_types.h:491:25: note: in definition of macro '__compiletime_assert'
>>    491 |                         prefix ## suffix();                             \
>>        |                         ^~~~~~
>> ././include/linux/compiler_types.h:510:9: note: in expansion of macro '_compiletime_assert'
>>    510 |         _compiletime_assert(condition, msg, __compiletime_assert_, __COUNTER__)
>>        |         ^~~~~~~~~~~~~~~~~~~
>> ./include/linux/build_bug.h:39:37: note: in expansion of macro 'compiletime_assert'
>>     39 | #define BUILD_BUG_ON_MSG(cond, msg) compiletime_assert(!(cond), msg)
>>        |                                     ^~~~~~~~~~~~~~~~~~
>> ./include/linux/build_bug.h:50:9: note: in expansion of macro 'BUILD_BUG_ON_MSG'
>>     50 |         BUILD_BUG_ON_MSG(condition, "BUILD_BUG_ON failed: " #condition)
>>        |         ^~~~~~~~~~~~~~~~
>> ./include/linux/mm.h:2874:9: note: in expansion of macro 'BUILD_BUG_ON'
>>   2874 |         BUILD_BUG_ON(IS_ENABLED(CONFIG_HIGHPTE));
> 
> Ahh.. this is in "ifdef USE_SPLIT_PTE_PTLOCKS" section.  I'm thinking maybe
> we should drop this BUILD_BUG_ON - it says "HIGHPTE shouldn't co-exist with
> USE_SPLIT_PTE_PTLOCKS", but I think it can?
> 
> Said that, I think I can also understand your point, where you see
> ptep_lockptr() a hugetlb-only function, in that case the BUILD_BUG_ON would
> make sense in hugetlb world.
> 
> So.. per my previous nitpick suggestion, IIUC we'll need to drop this
> BUILD_BUG_ON, just to say "USE_SPLIT_PTE_PTLOCKS can work with HIGHPTE" and
> perhaps slightly more readable; we'll rely on the WARN_ON to guard HIGHPTE
> won't use pte lock.

I really don't want to  drop the BUILD_BUG_ON. The function cannot 
possibly work with HIGHPTE, especially once used in other context by 
accident.

So I'll leave it like that. Feel free to optimize the hugetlb code 
further once the fix has landed (e.g., really optimize it out if we 
cannot possibly have such hugetlb sizes).

Thanks!
Christophe Leroy Aug. 14, 2024, 5:25 p.m. UTC | #10
Le 31/07/2024 à 18:33, David Hildenbrand a écrit :
> On 31.07.24 16:54, Peter Xu wrote:
>> On Wed, Jul 31, 2024 at 02:21:03PM +0200, David Hildenbrand wrote:
>>> We recently made GUP's common page table walking code to also walk 
>>> hugetlb
>>> VMAs without most hugetlb special-casing, preparing for the future of
>>> having less hugetlb-specific page table walking code in the codebase.
>>> Turns out that we missed one page table locking detail: page table 
>>> locking
>>> for hugetlb folios that are not mapped using a single PMD/PUD.
>>>
>>> Assume we have hugetlb folio that spans multiple PTEs (e.g., 64 KiB
>>> hugetlb folios on arm64 with 4 KiB base page size). GUP, as it walks the
>>> page tables, will perform a pte_offset_map_lock() to grab the PTE table
>>> lock.
>>>
>>> However, hugetlb that concurrently modifies these page tables would
>>> actually grab the mm->page_table_lock: with USE_SPLIT_PTE_PTLOCKS, the
>>> locks would differ. Something similar can happen right now with hugetlb
>>> folios that span multiple PMDs when USE_SPLIT_PMD_PTLOCKS.
>>>
>>> This issue can be reproduced [1], for example triggering:
>>>
>>> [ 3105.936100] ------------[ cut here ]------------
>>> [ 3105.939323] WARNING: CPU: 31 PID: 2732 at mm/gup.c:142 
>>> try_grab_folio+0x11c/0x188
>>> [ 3105.944634] Modules linked in: [...]
>>> [ 3105.974841] CPU: 31 PID: 2732 Comm: reproducer Not tainted 
>>> 6.10.0-64.eln141.aarch64 #1
>>> [ 3105.980406] Hardware name: QEMU KVM Virtual Machine, BIOS 
>>> edk2-20240524-4.fc40 05/24/2024
>>> [ 3105.986185] pstate: 60000005 (nZCv daif -PAN -UAO -TCO -DIT -SSBS 
>>> BTYPE=--)
>>> [ 3105.991108] pc : try_grab_folio+0x11c/0x188
>>> [ 3105.994013] lr : follow_page_pte+0xd8/0x430
>>> [ 3105.996986] sp : ffff80008eafb8f0
>>> [ 3105.999346] x29: ffff80008eafb900 x28: ffffffe8d481f380 x27: 
>>> 00f80001207cff43
>>> [ 3106.004414] x26: 0000000000000001 x25: 0000000000000000 x24: 
>>> ffff80008eafba48
>>> [ 3106.009520] x23: 0000ffff9372f000 x22: ffff7a54459e2000 x21: 
>>> ffff7a546c1aa978
>>> [ 3106.014529] x20: ffffffe8d481f3c0 x19: 0000000000610041 x18: 
>>> 0000000000000001
>>> [ 3106.019506] x17: 0000000000000001 x16: ffffffffffffffff x15: 
>>> 0000000000000000
>>> [ 3106.024494] x14: ffffb85477fdfe08 x13: 0000ffff9372ffff x12: 
>>> 0000000000000000
>>> [ 3106.029469] x11: 1fffef4a88a96be1 x10: ffff7a54454b5f0c x9 : 
>>> ffffb854771b12f0
>>> [ 3106.034324] x8 : 0008000000000000 x7 : ffff7a546c1aa980 x6 : 
>>> 0008000000000080
>>> [ 3106.038902] x5 : 00000000001207cf x4 : 0000ffff9372f000 x3 : 
>>> ffffffe8d481f000
>>> [ 3106.043420] x2 : 0000000000610041 x1 : 0000000000000001 x0 : 
>>> 0000000000000000
>>> [ 3106.047957] Call trace:
>>> [ 3106.049522]  try_grab_folio+0x11c/0x188
>>> [ 3106.051996]  follow_pmd_mask.constprop.0.isra.0+0x150/0x2e0
>>> [ 3106.055527]  follow_page_mask+0x1a0/0x2b8
>>> [ 3106.058118]  __get_user_pages+0xf0/0x348
>>> [ 3106.060647]  faultin_page_range+0xb0/0x360
>>> [ 3106.063651]  do_madvise+0x340/0x598
>>>
>>> Let's make huge_pte_lockptr() effectively use the same PT locks as any
>>> core-mm page table walker would. Add ptep_lockptr() to obtain the PTE
>>> page table lock using a pte pointer -- unfortunately we cannot convert
>>> pte_lockptr() because virt_to_page() doesn't work with kmap'ed page
>>> tables we can have with CONFIG_HIGHPTE.
>>>
>>> Take care of PTE tables possibly spanning multiple pages, and take 
>>> care of
>>> CONFIG_PGTABLE_LEVELS complexity when e.g., PMD_SIZE == PUD_SIZE. For
>>> example, with CONFIG_PGTABLE_LEVELS == 2, core-mm would detect
>>> with hugepagesize==PMD_SIZE pmd_leaf() and use the pmd_lockptr(), which
>>> would end up just mapping to the per-MM PT lock.
>>>
>>> There is one ugly case: powerpc 8xx, whereby we have an 8 MiB hugetlb
>>> folio being mapped using two PTE page tables.  While hugetlb wants to 
>>> take
>>> the PMD table lock, core-mm would grab the PTE table lock of one of both
>>> PTE page tables.  In such corner cases, we have to make sure that both
>>> locks match, which is (fortunately!) currently guaranteed for 8xx as it
>>> does not support SMP and consequently doesn't use split PT locks.
>>>
>>> [1] 
>>> https://eur01.safelinks.protection.outlook.com/?url=https%3A%2F%2Flore.kernel.org%2Fall%2F1bbfcc7f-f222-45a5-ac44-c5a1381c596d%40redhat.com%2F&data=05%7C02%7Cchristophe.leroy%40csgroup.eu%7Cf91a0b3cdcab46c7bd6108dcb17e9454%7C8b87af7d86474dc78df45f69a2011bb5%7C0%7C0%7C638580404425532305%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C0%7C%7C%7C&sdata=%2FQ4QFqbyThojISHACwzxCdtYbgwc4JsMIP%2Bmx4PneOk%3D&reserved=0
>>>
>>> Fixes: 9cb28da54643 ("mm/gup: handle hugetlb in the generic 
>>> follow_page_mask code")
>>> Reviewed-by: James Houghton <jthoughton@google.com>
>>> Cc: <stable@vger.kernel.org>
>>> Cc: Peter Xu <peterx@redhat.com>
>>> Cc: Oscar Salvador <osalvador@suse.de>
>>> Cc: Muchun Song <muchun.song@linux.dev>
>>> Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
>>> Signed-off-by: David Hildenbrand <david@redhat.com>
>>
>> Nitpick: I wonder whether some of the lines can be simplified if we write
>> it downwards from PUD, like,
>>
>> huge_pte_lockptr()
>> {
>>          if (size >= PUD_SIZE)
>>             return pud_lockptr(...);
>>          if (size >= PMD_SIZE)
>>             return pmd_lockptr(...);
>>          /* Sub-PMD only applies to !CONFIG_HIGHPTE, see 
>> pte_alloc_huge() */
>>          WARN_ON(IS_ENABLED(CONFIG_HIGHPTE));
>>          return ptep_lockptr(...);
>> }
> 
> Let me think about it. For PUD_SIZE == PMD_SIZE instead of like core-mm
> calling pmd_lockptr we'd call pud_lockptr().

I guess it is only when including asm-generic/pgtable-nopmd.h

Otherwise you should have more than one entry in the PMD table so 
PMD_SIZE would always be smaller than PUD_SIZE, wouldn't it ?

So maybe some simplification could be done, like having pud_lock() a nop 
in that case ?


Christophe
diff mbox series

Patch

diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index c9bf68c239a01..e6437a06e2346 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -944,9 +944,32 @@  static inline bool htlb_allow_alloc_fallback(int reason)
 static inline spinlock_t *huge_pte_lockptr(struct hstate *h,
 					   struct mm_struct *mm, pte_t *pte)
 {
-	if (huge_page_size(h) == PMD_SIZE)
+	unsigned long size = huge_page_size(h);
+
+	VM_WARN_ON(size == PAGE_SIZE);
+
+	/*
+	 * hugetlb must use the exact same PT locks as core-mm page table
+	 * walkers would. When modifying a PTE table, hugetlb must take the
+	 * PTE PT lock, when modifying a PMD table, hugetlb must take the PMD
+	 * PT lock etc.
+	 *
+	 * The expectation is that any hugetlb folio smaller than a PMD is
+	 * always mapped into a single PTE table and that any hugetlb folio
+	 * smaller than a PUD (but at least as big as a PMD) is always mapped
+	 * into a single PMD table.
+	 *
+	 * If that does not hold for an architecture, then that architecture
+	 * must disable split PT locks such that all *_lockptr() functions
+	 * will give us the same result: the per-MM PT lock.
+	 */
+	if (size < PMD_SIZE && !IS_ENABLED(CONFIG_HIGHPTE))
+		/* pte_alloc_huge() only applies with !CONFIG_HIGHPTE */
+		return ptep_lockptr(mm, pte);
+	else if (size < PUD_SIZE || CONFIG_PGTABLE_LEVELS == 2)
 		return pmd_lockptr(mm, (pmd_t *) pte);
-	VM_BUG_ON(huge_page_size(h) == PAGE_SIZE);
+	else if (size < P4D_SIZE || CONFIG_PGTABLE_LEVELS == 3)
+		return pud_lockptr(mm, (pud_t *) pte);
 	return &mm->page_table_lock;
 }
 
diff --git a/include/linux/mm.h b/include/linux/mm.h
index b100df8cb5857..f6c7fe8f5746f 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2926,6 +2926,24 @@  static inline spinlock_t *pte_lockptr(struct mm_struct *mm, pmd_t *pmd)
 	return ptlock_ptr(page_ptdesc(pmd_page(*pmd)));
 }
 
+static inline struct page *ptep_pgtable_page(pte_t *pte)
+{
+	unsigned long mask = ~(PTRS_PER_PTE * sizeof(pte_t) - 1);
+
+	BUILD_BUG_ON(IS_ENABLED(CONFIG_HIGHPTE));
+	return virt_to_page((void *)((unsigned long)pte & mask));
+}
+
+static inline struct ptdesc *ptep_ptdesc(pte_t *pte)
+{
+	return page_ptdesc(ptep_pgtable_page(pte));
+}
+
+static inline spinlock_t *ptep_lockptr(struct mm_struct *mm, pte_t *pte)
+{
+	return ptlock_ptr(ptep_ptdesc(pte));
+}
+
 static inline bool ptlock_init(struct ptdesc *ptdesc)
 {
 	/*
@@ -2950,6 +2968,10 @@  static inline spinlock_t *pte_lockptr(struct mm_struct *mm, pmd_t *pmd)
 {
 	return &mm->page_table_lock;
 }
+static inline spinlock_t *ptep_lockptr(struct mm_struct *mm, pte_t *pte)
+{
+	return &mm->page_table_lock;
+}
 static inline void ptlock_cache_init(void) {}
 static inline bool ptlock_init(struct ptdesc *ptdesc) { return true; }
 static inline void ptlock_free(struct ptdesc *ptdesc) {}