mbox series

[v4,00/13] mm/gup: Unify hugetlb, part 2

Message ID 20240327152332.950956-1-peterx@redhat.com (mailing list archive)
Headers show
Series mm/gup: Unify hugetlb, part 2 | expand

Message

Peter Xu March 27, 2024, 3:23 p.m. UTC
From: Peter Xu <peterx@redhat.com>

v4:
- Fix build issues, tested on more archs/configs ([x86_64, i386, arm, arm64,
  powerpc, riscv, s390] x [allno, alldef, allmod]).
  - Squashed the fixup series into v3, touched up commit messages [1]
  - Added the patch to fix pud_pfn() into the series [2]
  - Fixed one more build issue on arm+alldefconfig, where pgd_t is a
    two-item array.
- Manage R-bs: add some, remove some (due to the squashes above)
- Rebase to latest mm-unstable (2f6182cd23a7, March 26th)

rfc: https://lore.kernel.org/r/20231116012908.392077-1-peterx@redhat.com
v1:  https://lore.kernel.org/r/20231219075538.414708-1-peterx@redhat.com
v2:  https://lore.kernel.org/r/20240103091423.400294-1-peterx@redhat.com
v3:  https://lore.kernel.org/r/20240321220802.679544-1-peterx@redhat.com

The series removes the hugetlb slow gup path after a previous refactor work
[1], so that slow gup now uses the exact same path to process all kinds of
memory including hugetlb.

For the long term, we may want to remove most, if not all, call sites of
huge_pte_offset().  It'll be ideal if that API can be completely dropped
from arch hugetlb API.  This series is one small step towards merging
hugetlb specific codes into generic mm paths.  From that POV, this series
removes one reference to huge_pte_offset() out of many others.

One goal of such a route is that we can reconsider merging hugetlb features
like High Granularity Mapping (HGM).  It was not accepted in the past
because it may add lots of hugetlb specific codes and make the mm code even
harder to maintain.  With a merged codeset, features like HGM can hopefully
share some code with THP, legacy (PMD+) or modern (continuous PTEs).

To make it work, the generic slow gup code will need to at least understand
hugepd, which is already done like so in fast-gup.  Due to the specialty of
hugepd to be software-only solution (no hardware recognizes the hugepd
format, so it's purely artificial structures), there's chance we can merge
some or all hugepd formats with cont_pte in the future.  That question is
yet unsettled from Power side to have an acknowledgement.  As of now for
this series, I kept the hugepd handling because we may still need to do so
before getting a clearer picture of the future of hugepd.  The other reason
is simply that we did it already for fast-gup and most codes are still
around to be reused.  It'll make more sense to keep slow/fast gup behave
the same before a decision is made to remove hugepd.

There's one major difference for slow-gup on cont_pte / cont_pmd handling,
currently supported on three architectures (aarch64, riscv, ppc).  Before
the series, slow gup will be able to recognize e.g. cont_pte entries with
the help of huge_pte_offset() when hstate is around.  Now it's gone but
still working, by looking up pgtable entries one by one.

It's not ideal, but hopefully this change should not affect yet on major
workloads.  There's some more information in the commit message of the last
patch.  If this would be a concern, we can consider teaching slow gup to
recognize cont pte/pmd entries, and that should recover the lost
performance.  But I doubt its necessity for now, so I kept it as simple as
it can be.

Test Done
=========

For x86_64, tested full gup_test matrix over 2MB huge pages. For aarch64,
tested the same over 64KB cont_pte huge pages.

One note is that this v3 didn't go through any ppc test anymore, as finding
such system can always take time.  It's based on the fact that it was
tested in previous versions, and this version should have zero change
regarding to hugepd sections.

If anyone (Christophe?) wants to give it a shot on PowerPC, please do and I
would appreciate it: "./run_vmtests.sh -a -t gup_test" should do well
enough (please consider [2] applied if hugepd is <1MB), as long as we're
sure the hugepd pages are touched as expected.

Patch layout
=============

Patch 1-8:    Preparation works, or cleanups in relevant code paths
Patch 9-11:   Teach slow gup with all kinds of huge entries (pXd, hugepd)
Patch 12:     Drop hugetlb_follow_page_mask()

More information can be found in the commit messages of each patch.  Any
comment will be welcomed.  Thanks.

[1] https://lore.kernel.org/all/20230628215310.73782-1-peterx@redhat.com
[2] https://lore.kernel.org/r/20240321215047.678172-1-peterx@redhat.com

Peter Xu (13):
  mm/Kconfig: CONFIG_PGTABLE_HAS_HUGE_LEAVES
  mm/hugetlb: Declare hugetlbfs_pagecache_present() non-static
  mm: Make HPAGE_PXD_* macros even if !THP
  mm: Introduce vma_pgtable_walk_{begin|end}()
  mm/arch: Provide pud_pfn() fallback
  mm/gup: Drop folio_fast_pin_allowed() in hugepd processing
  mm/gup: Refactor record_subpages() to find 1st small page
  mm/gup: Handle hugetlb for no_page_table()
  mm/gup: Cache *pudp in follow_pud_mask()
  mm/gup: Handle huge pud for follow_pud_mask()
  mm/gup: Handle huge pmd for follow_pmd_mask()
  mm/gup: Handle hugepd for follow_page()
  mm/gup: Handle hugetlb in the generic follow_page_mask code

 arch/riscv/include/asm/pgtable.h    |   1 +
 arch/s390/include/asm/pgtable.h     |   1 +
 arch/sparc/include/asm/pgtable_64.h |   1 +
 arch/x86/include/asm/pgtable.h      |   1 +
 include/linux/huge_mm.h             |  37 +-
 include/linux/hugetlb.h             |  16 +-
 include/linux/mm.h                  |   3 +
 include/linux/pgtable.h             |  10 +
 mm/Kconfig                          |   6 +
 mm/gup.c                            | 518 ++++++++++++++++++++--------
 mm/huge_memory.c                    | 133 +------
 mm/hugetlb.c                        |  75 +---
 mm/internal.h                       |   7 +-
 mm/memory.c                         |  12 +
 14 files changed, 441 insertions(+), 380 deletions(-)

Comments

Ryan Roberts April 2, 2024, 2:48 p.m. UTC | #1
Hi Peter,

On 27/03/2024 15:23, peterx@redhat.com wrote:
> From: Peter Xu <peterx@redhat.com>
> 
> Now follow_page() is ready to handle hugetlb pages in whatever form, and
> over all architectures.  Switch to the generic code path.
> 
> Time to retire hugetlb_follow_page_mask(), following the previous
> retirement of follow_hugetlb_page() in 4849807114b8.
> 
> There may be a slight difference of how the loops run when processing slow
> GUP over a large hugetlb range on cont_pte/cont_pmd supported archs: each
> loop of __get_user_pages() will resolve one pgtable entry with the patch
> applied, rather than relying on the size of hugetlb hstate, the latter may
> cover multiple entries in one loop.
> 
> A quick performance test on an aarch64 VM on M1 chip shows 15% degrade over
> a tight loop of slow gup after the path switched.  That shouldn't be a
> problem because slow-gup should not be a hot path for GUP in general: when
> page is commonly present, fast-gup will already succeed, while when the
> page is indeed missing and require a follow up page fault, the slow gup
> degrade will probably buried in the fault paths anyway.  It also explains
> why slow gup for THP used to be very slow before 57edfcfd3419 ("mm/gup:
> accelerate thp gup even for "pages != NULL"") lands, the latter not part of
> a performance analysis but a side benefit.  If the performance will be a
> concern, we can consider handle CONT_PTE in follow_page().
> 
> Before that is justified to be necessary, keep everything clean and simple.
> 
> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
> Signed-off-by: Peter Xu <peterx@redhat.com>

Afraid I'm seeing an oops when running gup_longterm test on arm64 with current mm-unstable. Git bisect blames this patch. The oops reproduces for me every time on 2 different machines:


[    9.340416] kernel BUG at mm/gup.c:778!
[    9.340746] Internal error: Oops - BUG: 00000000f2000800 [#1] PREEMPT SMP
[    9.341199] Modules linked in:
[    9.341481] CPU: 1 PID: 1159 Comm: gup_longterm Not tainted 6.9.0-rc2-00210-g910ff1a347e4 #11
[    9.342232] Hardware name: linux,dummy-virt (DT)
[    9.342647] pstate: 60400005 (nZCv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
[    9.343195] pc : follow_page_mask+0x4d4/0x880
[    9.343580] lr : follow_page_mask+0x4d4/0x880
[    9.344018] sp : ffff8000898b3aa0
[    9.344345] x29: ffff8000898b3aa0 x28: fffffdffc53973e8 x27: 00003c0005d08000
[    9.345028] x26: ffff00014e5cfd08 x25: ffffd3513a40c000 x24: fffffdffc5d08000
[    9.345682] x23: ffffc1ffc0000000 x22: 0000000000080101 x21: ffff8000898b3ba8
[    9.346337] x20: 0000fffff4200000 x19: ffff00014e52d508 x18: 0000000000000010
[    9.347005] x17: 5f656e6f7a5f7369 x16: 2120262620296567 x15: 6170286461654865
[    9.347713] x14: 6761502128454741 x13: 2929656761702865 x12: 6761705f65636976
[    9.348371] x11: 65645f656e6f7a5f x10: ffffd3513b31d6e0 x9 : ffffd3513852f090
[    9.349062] x8 : 00000000ffffefff x7 : ffffd3513b31d6e0 x6 : 0000000000000000
[    9.349753] x5 : ffff00017ff98cc8 x4 : 0000000000000fff x3 : 0000000000000000
[    9.350397] x2 : 0000000000000000 x1 : ffff000190e8b480 x0 : 0000000000000052
[    9.351097] Call trace:
[    9.351312]  follow_page_mask+0x4d4/0x880
[    9.351700]  __get_user_pages+0xf4/0x3e8
[    9.352089]  __gup_longterm_locked+0x204/0xa70
[    9.352516]  pin_user_pages+0x88/0xc0
[    9.352873]  gup_test_ioctl+0x860/0xc40
[    9.353249]  __arm64_sys_ioctl+0xb0/0x100
[    9.353648]  invoke_syscall+0x50/0x128
[    9.354022]  el0_svc_common.constprop.0+0x48/0xf8
[    9.354488]  do_el0_svc+0x28/0x40
[    9.354822]  el0_svc+0x34/0xe0
[    9.355128]  el0t_64_sync_handler+0x13c/0x158
[    9.355489]  el0t_64_sync+0x190/0x198
[    9.355793] Code: aa1803e0 d000d8e1 91220021 97fff560 (d4210000) 
[    9.356280] ---[ end trace 0000000000000000 ]---
[    9.356651] note: gup_longterm[1159] exited with irqs disabled
[    9.357141] note: gup_longterm[1159] exited with preempt_count 2
[    9.358033] ------------[ cut here ]------------
[    9.358800] WARNING: CPU: 1 PID: 0 at kernel/context_tracking.c:128 ct_kernel_exit.constprop.0+0x108/0x120
[    9.360157] Modules linked in:
[    9.360541] CPU: 1 PID: 0 Comm: swapper/1 Tainted: G      D            6.9.0-rc2-00210-g910ff1a347e4 #11
[    9.361626] Hardware name: linux,dummy-virt (DT)
[    9.362087] pstate: 204003c5 (nzCv DAIF +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
[    9.362758] pc : ct_kernel_exit.constprop.0+0x108/0x120
[    9.363306] lr : ct_idle_enter+0x10/0x20
[    9.363845] sp : ffff8000801abdc0
[    9.364222] x29: ffff8000801abdc0 x28: 0000000000000000 x27: 0000000000000000
[    9.364961] x26: 0000000000000000 x25: ffff00014149d780 x24: 0000000000000000
[    9.365557] x23: 0000000000000000 x22: ffffd3513b299d48 x21: ffffd3513a785730
[    9.366239] x20: ffffd3513b299c28 x19: ffff00017ffa7da0 x18: 0000fffff5ffffff
[    9.366869] x17: 0000000000000000 x16: 1fffe0002a21a8c1 x15: 0000000000000000
[    9.367524] x14: 0000000000000000 x13: 0000000000000000 x12: 0000000000000002
[    9.368207] x11: 0000000000000001 x10: 0000000000000ad0 x9 : ffffd35138589230
[    9.369123] x8 : ffff00014149e2b0 x7 : 0000000000000000 x6 : 000000000f8c0fb2
[    9.370403] x5 : 4000000000000002 x4 : ffff2cb045825000 x3 : ffff8000801abdc0
[    9.371170] x2 : ffffd3513a782da0 x1 : 4000000000000000 x0 : ffffd3513a782da0
[    9.372279] Call trace:
[    9.372519]  ct_kernel_exit.constprop.0+0x108/0x120
[    9.373216]  ct_idle_enter+0x10/0x20
[    9.373562]  default_idle_call+0x3c/0x160
[    9.374055]  do_idle+0x21c/0x280
[    9.374394]  cpu_startup_entry+0x3c/0x50
[    9.374797]  secondary_start_kernel+0x140/0x168
[    9.375220]  __secondary_switched+0xb8/0xc0
[    9.375875] ---[ end trace 0000000000000000 ]---


The oops trigger is at mm/gup.c:778:
VM_BUG_ON_PAGE(!PageHead(page) && !is_zone_device_page(page), page);


This is the output of gup_longterm (last output is just before oops):

# [INFO] detected hugetlb page size: 2048 KiB
# [INFO] detected hugetlb page size: 32768 KiB
# [INFO] detected hugetlb page size: 64 KiB
# [INFO] detected hugetlb page size: 1048576 KiB
TAP version 13
1..70
# [RUN] R/W longterm GUP pin in MAP_SHARED file mapping ... with memfd
ok 1 Should have worked
# [RUN] R/W longterm GUP pin in MAP_SHARED file mapping ... with tmpfile
ok 2 Should have failed
# [RUN] R/W longterm GUP pin in MAP_SHARED file mapping ... with local tmpfile
ok 3 Should have failed
# [RUN] R/W longterm GUP pin in MAP_SHARED file mapping ... with memfd hugetlb (2048 kB)
ok 4 Should have worked
# [RUN] R/W longterm GUP pin in MAP_SHARED file mapping ... with memfd hugetlb (32768 kB)


So 2M passed ok, and its failing for 32M, which is cont-pmd. I'm guessing you're trying to iterate 2M into a cont-pmd folio and ending up with an unexpected tail page?


I'm running with defconfig plus these:

./scripts/config --enable CONFIG_SQUASHFS_LZ4
./scripts/config --enable CONFIG_SQUASHFS_LZO
./scripts/config --enable CONFIG_SQUASHFS_XZ
./scripts/config --enable CONFIG_SQUASHFS_ZSTD
./scripts/config --enable CONFIG_XFS_FS
./scripts/config --enable CONFIG_FTRACE
./scripts/config --enable CONFIG_FUNCTION_TRACER
./scripts/config --enable CONFIG_KPROBES
./scripts/config --enable CONFIG_HIST_TRIGGERS
./scripts/config --enable CONFIG_FTRACE_SYSCALLS
./scripts/config --enable CONFIG_DEBUG_VM
./scripts/config --enable CONFIG_DEBUG_VM_MAPLE_TREE
./scripts/config --enable CONFIG_DEBUG_VM_RB
./scripts/config --enable CONFIG_DEBUG_VM_PGFLAGS
./scripts/config --enable CONFIG_DEBUG_VM_PGTABLE
./scripts/config --enable CONFIG_PAGE_TABLE_CHECK
./scripts/config --enable CONFIG_USERFAULTFD
./scripts/config --enable CONFIG_TEST_VMALLOC
./scripts/config --enable CONFIG_GUP_TEST


Thanks,
Ryan




> ---
>  include/linux/hugetlb.h |  7 ----
>  mm/gup.c                | 15 +++------
>  mm/hugetlb.c            | 71 -----------------------------------------
>  3 files changed, 5 insertions(+), 88 deletions(-)
> 
> diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
> index 294c78b3549f..a546140f89cd 100644
> --- a/include/linux/hugetlb.h
> +++ b/include/linux/hugetlb.h
> @@ -328,13 +328,6 @@ static inline void hugetlb_zap_end(
>  {
>  }
>  
> -static inline struct page *hugetlb_follow_page_mask(
> -    struct vm_area_struct *vma, unsigned long address, unsigned int flags,
> -    unsigned int *page_mask)
> -{
> -	BUILD_BUG(); /* should never be compiled in if !CONFIG_HUGETLB_PAGE*/
> -}
> -
>  static inline int copy_hugetlb_page_range(struct mm_struct *dst,
>  					  struct mm_struct *src,
>  					  struct vm_area_struct *dst_vma,
> diff --git a/mm/gup.c b/mm/gup.c
> index a02463c9420e..c803d0b0f358 100644
> --- a/mm/gup.c
> +++ b/mm/gup.c
> @@ -1135,18 +1135,11 @@ static struct page *follow_page_mask(struct vm_area_struct *vma,
>  {
>  	pgd_t *pgd;
>  	struct mm_struct *mm = vma->vm_mm;
> +	struct page *page;
>  
> -	ctx->page_mask = 0;
> -
> -	/*
> -	 * Call hugetlb_follow_page_mask for hugetlb vmas as it will use
> -	 * special hugetlb page table walking code.  This eliminates the
> -	 * need to check for hugetlb entries in the general walking code.
> -	 */
> -	if (is_vm_hugetlb_page(vma))
> -		return hugetlb_follow_page_mask(vma, address, flags,
> -						&ctx->page_mask);
> +	vma_pgtable_walk_begin(vma);
>  
> +	ctx->page_mask = 0;
>  	pgd = pgd_offset(mm, address);
>  
>  	if (unlikely(is_hugepd(__hugepd(pgd_val(*pgd)))))
> @@ -1157,6 +1150,8 @@ static struct page *follow_page_mask(struct vm_area_struct *vma,
>  	else
>  		page = follow_p4d_mask(vma, address, pgd, flags, ctx);
>  
> +	vma_pgtable_walk_end(vma);
> +
>  	return page;
>  }
>  
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index 65b9c9a48fd2..cc79891a3597 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -6870,77 +6870,6 @@ int hugetlb_mfill_atomic_pte(pte_t *dst_pte,
>  }
>  #endif /* CONFIG_USERFAULTFD */
>  
> -struct page *hugetlb_follow_page_mask(struct vm_area_struct *vma,
> -				      unsigned long address, unsigned int flags,
> -				      unsigned int *page_mask)
> -{
> -	struct hstate *h = hstate_vma(vma);
> -	struct mm_struct *mm = vma->vm_mm;
> -	unsigned long haddr = address & huge_page_mask(h);
> -	struct page *page = NULL;
> -	spinlock_t *ptl;
> -	pte_t *pte, entry;
> -	int ret;
> -
> -	hugetlb_vma_lock_read(vma);
> -	pte = hugetlb_walk(vma, haddr, huge_page_size(h));
> -	if (!pte)
> -		goto out_unlock;
> -
> -	ptl = huge_pte_lock(h, mm, pte);
> -	entry = huge_ptep_get(pte);
> -	if (pte_present(entry)) {
> -		page = pte_page(entry);
> -
> -		if (!huge_pte_write(entry)) {
> -			if (flags & FOLL_WRITE) {
> -				page = NULL;
> -				goto out;
> -			}
> -
> -			if (gup_must_unshare(vma, flags, page)) {
> -				/* Tell the caller to do unsharing */
> -				page = ERR_PTR(-EMLINK);
> -				goto out;
> -			}
> -		}
> -
> -		page = nth_page(page, ((address & ~huge_page_mask(h)) >> PAGE_SHIFT));
> -
> -		/*
> -		 * Note that page may be a sub-page, and with vmemmap
> -		 * optimizations the page struct may be read only.
> -		 * try_grab_page() will increase the ref count on the
> -		 * head page, so this will be OK.
> -		 *
> -		 * try_grab_page() should always be able to get the page here,
> -		 * because we hold the ptl lock and have verified pte_present().
> -		 */
> -		ret = try_grab_page(page, flags);
> -
> -		if (WARN_ON_ONCE(ret)) {
> -			page = ERR_PTR(ret);
> -			goto out;
> -		}
> -
> -		*page_mask = (1U << huge_page_order(h)) - 1;
> -	}
> -out:
> -	spin_unlock(ptl);
> -out_unlock:
> -	hugetlb_vma_unlock_read(vma);
> -
> -	/*
> -	 * Fixup retval for dump requests: if pagecache doesn't exist,
> -	 * don't try to allocate a new page but just skip it.
> -	 */
> -	if (!page && (flags & FOLL_DUMP) &&
> -	    !hugetlbfs_pagecache_present(h, vma, address))
> -		page = ERR_PTR(-EFAULT);
> -
> -	return page;
> -}
> -
>  long hugetlb_change_protection(struct vm_area_struct *vma,
>  		unsigned long address, unsigned long end,
>  		pgprot_t newprot, unsigned long cp_flags)
David Hildenbrand April 2, 2024, 3:26 p.m. UTC | #2
On 02.04.24 16:48, Ryan Roberts wrote:
> Hi Peter,
> 
> On 27/03/2024 15:23, peterx@redhat.com wrote:
>> From: Peter Xu <peterx@redhat.com>
>>
>> Now follow_page() is ready to handle hugetlb pages in whatever form, and
>> over all architectures.  Switch to the generic code path.
>>
>> Time to retire hugetlb_follow_page_mask(), following the previous
>> retirement of follow_hugetlb_page() in 4849807114b8.
>>
>> There may be a slight difference of how the loops run when processing slow
>> GUP over a large hugetlb range on cont_pte/cont_pmd supported archs: each
>> loop of __get_user_pages() will resolve one pgtable entry with the patch
>> applied, rather than relying on the size of hugetlb hstate, the latter may
>> cover multiple entries in one loop.
>>
>> A quick performance test on an aarch64 VM on M1 chip shows 15% degrade over
>> a tight loop of slow gup after the path switched.  That shouldn't be a
>> problem because slow-gup should not be a hot path for GUP in general: when
>> page is commonly present, fast-gup will already succeed, while when the
>> page is indeed missing and require a follow up page fault, the slow gup
>> degrade will probably buried in the fault paths anyway.  It also explains
>> why slow gup for THP used to be very slow before 57edfcfd3419 ("mm/gup:
>> accelerate thp gup even for "pages != NULL"") lands, the latter not part of
>> a performance analysis but a side benefit.  If the performance will be a
>> concern, we can consider handle CONT_PTE in follow_page().
>>
>> Before that is justified to be necessary, keep everything clean and simple.
>>
>> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
>> Signed-off-by: Peter Xu <peterx@redhat.com>
> 
> Afraid I'm seeing an oops when running gup_longterm test on arm64 with current mm-unstable. Git bisect blames this patch. The oops reproduces for me every time on 2 different machines:
> 
> 
> [    9.340416] kernel BUG at mm/gup.c:778!
> [    9.340746] Internal error: Oops - BUG: 00000000f2000800 [#1] PREEMPT SMP
> [    9.341199] Modules linked in:
> [    9.341481] CPU: 1 PID: 1159 Comm: gup_longterm Not tainted 6.9.0-rc2-00210-g910ff1a347e4 #11
> [    9.342232] Hardware name: linux,dummy-virt (DT)
> [    9.342647] pstate: 60400005 (nZCv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
> [    9.343195] pc : follow_page_mask+0x4d4/0x880
> [    9.343580] lr : follow_page_mask+0x4d4/0x880
> [    9.344018] sp : ffff8000898b3aa0
> [    9.344345] x29: ffff8000898b3aa0 x28: fffffdffc53973e8 x27: 00003c0005d08000
> [    9.345028] x26: ffff00014e5cfd08 x25: ffffd3513a40c000 x24: fffffdffc5d08000
> [    9.345682] x23: ffffc1ffc0000000 x22: 0000000000080101 x21: ffff8000898b3ba8
> [    9.346337] x20: 0000fffff4200000 x19: ffff00014e52d508 x18: 0000000000000010
> [    9.347005] x17: 5f656e6f7a5f7369 x16: 2120262620296567 x15: 6170286461654865
> [    9.347713] x14: 6761502128454741 x13: 2929656761702865 x12: 6761705f65636976
> [    9.348371] x11: 65645f656e6f7a5f x10: ffffd3513b31d6e0 x9 : ffffd3513852f090
> [    9.349062] x8 : 00000000ffffefff x7 : ffffd3513b31d6e0 x6 : 0000000000000000
> [    9.349753] x5 : ffff00017ff98cc8 x4 : 0000000000000fff x3 : 0000000000000000
> [    9.350397] x2 : 0000000000000000 x1 : ffff000190e8b480 x0 : 0000000000000052
> [    9.351097] Call trace:
> [    9.351312]  follow_page_mask+0x4d4/0x880
> [    9.351700]  __get_user_pages+0xf4/0x3e8
> [    9.352089]  __gup_longterm_locked+0x204/0xa70
> [    9.352516]  pin_user_pages+0x88/0xc0
> [    9.352873]  gup_test_ioctl+0x860/0xc40
> [    9.353249]  __arm64_sys_ioctl+0xb0/0x100
> [    9.353648]  invoke_syscall+0x50/0x128
> [    9.354022]  el0_svc_common.constprop.0+0x48/0xf8
> [    9.354488]  do_el0_svc+0x28/0x40
> [    9.354822]  el0_svc+0x34/0xe0
> [    9.355128]  el0t_64_sync_handler+0x13c/0x158
> [    9.355489]  el0t_64_sync+0x190/0x198
> [    9.355793] Code: aa1803e0 d000d8e1 91220021 97fff560 (d4210000)
> [    9.356280] ---[ end trace 0000000000000000 ]---
> [    9.356651] note: gup_longterm[1159] exited with irqs disabled
> [    9.357141] note: gup_longterm[1159] exited with preempt_count 2
> [    9.358033] ------------[ cut here ]------------
> [    9.358800] WARNING: CPU: 1 PID: 0 at kernel/context_tracking.c:128 ct_kernel_exit.constprop.0+0x108/0x120
> [    9.360157] Modules linked in:
> [    9.360541] CPU: 1 PID: 0 Comm: swapper/1 Tainted: G      D            6.9.0-rc2-00210-g910ff1a347e4 #11
> [    9.361626] Hardware name: linux,dummy-virt (DT)
> [    9.362087] pstate: 204003c5 (nzCv DAIF +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
> [    9.362758] pc : ct_kernel_exit.constprop.0+0x108/0x120
> [    9.363306] lr : ct_idle_enter+0x10/0x20
> [    9.363845] sp : ffff8000801abdc0
> [    9.364222] x29: ffff8000801abdc0 x28: 0000000000000000 x27: 0000000000000000
> [    9.364961] x26: 0000000000000000 x25: ffff00014149d780 x24: 0000000000000000
> [    9.365557] x23: 0000000000000000 x22: ffffd3513b299d48 x21: ffffd3513a785730
> [    9.366239] x20: ffffd3513b299c28 x19: ffff00017ffa7da0 x18: 0000fffff5ffffff
> [    9.366869] x17: 0000000000000000 x16: 1fffe0002a21a8c1 x15: 0000000000000000
> [    9.367524] x14: 0000000000000000 x13: 0000000000000000 x12: 0000000000000002
> [    9.368207] x11: 0000000000000001 x10: 0000000000000ad0 x9 : ffffd35138589230
> [    9.369123] x8 : ffff00014149e2b0 x7 : 0000000000000000 x6 : 000000000f8c0fb2
> [    9.370403] x5 : 4000000000000002 x4 : ffff2cb045825000 x3 : ffff8000801abdc0
> [    9.371170] x2 : ffffd3513a782da0 x1 : 4000000000000000 x0 : ffffd3513a782da0
> [    9.372279] Call trace:
> [    9.372519]  ct_kernel_exit.constprop.0+0x108/0x120
> [    9.373216]  ct_idle_enter+0x10/0x20
> [    9.373562]  default_idle_call+0x3c/0x160
> [    9.374055]  do_idle+0x21c/0x280
> [    9.374394]  cpu_startup_entry+0x3c/0x50
> [    9.374797]  secondary_start_kernel+0x140/0x168
> [    9.375220]  __secondary_switched+0xb8/0xc0
> [    9.375875] ---[ end trace 0000000000000000 ]---
> 
> 
> The oops trigger is at mm/gup.c:778:
> VM_BUG_ON_PAGE(!PageHead(page) && !is_zone_device_page(page), page);
> 
> 
> This is the output of gup_longterm (last output is just before oops):
> 
> # [INFO] detected hugetlb page size: 2048 KiB
> # [INFO] detected hugetlb page size: 32768 KiB
> # [INFO] detected hugetlb page size: 64 KiB
> # [INFO] detected hugetlb page size: 1048576 KiB
> TAP version 13
> 1..70
> # [RUN] R/W longterm GUP pin in MAP_SHARED file mapping ... with memfd
> ok 1 Should have worked
> # [RUN] R/W longterm GUP pin in MAP_SHARED file mapping ... with tmpfile
> ok 2 Should have failed
> # [RUN] R/W longterm GUP pin in MAP_SHARED file mapping ... with local tmpfile
> ok 3 Should have failed
> # [RUN] R/W longterm GUP pin in MAP_SHARED file mapping ... with memfd hugetlb (2048 kB)
> ok 4 Should have worked
> # [RUN] R/W longterm GUP pin in MAP_SHARED file mapping ... with memfd hugetlb (32768 kB)
> 
> 
> So 2M passed ok, and its failing for 32M, which is cont-pmd. I'm guessing you're trying to iterate 2M into a cont-pmd folio and ending up with an unexpected tail page?

I assume we find the expected tail page, it's just that the check

VM_BUG_ON_PAGE(!PageHead(page) && !is_zone_device_page(page), page);

Doesn't make sense with hugetlb folios. We might have a tail page mapped 
in a cont-pmd entry. As soon as we call follow_huge_pmd() on "not the 
first cont-pmd entry", we trigger this check.

Likely this sanity check must also allow for hugetlb folios. Or we 
should just remove it completely.

In the past, we wanted to make sure that we never get tail pages of THP 
from PMD entries, because something would currently be broken (we don't 
support THP > PMD).
Matthew Wilcox April 2, 2024, 4 p.m. UTC | #3
On Tue, Apr 02, 2024 at 05:26:28PM +0200, David Hildenbrand wrote:
> > The oops trigger is at mm/gup.c:778:
> > VM_BUG_ON_PAGE(!PageHead(page) && !is_zone_device_page(page), page);
> > 
> > So 2M passed ok, and its failing for 32M, which is cont-pmd. I'm guessing you're trying to iterate 2M into a cont-pmd folio and ending up with an unexpected tail page?
> 
> I assume we find the expected tail page, it's just that the check
> 
> VM_BUG_ON_PAGE(!PageHead(page) && !is_zone_device_page(page), page);
> 
> Doesn't make sense with hugetlb folios. We might have a tail page mapped in
> a cont-pmd entry. As soon as we call follow_huge_pmd() on "not the first
> cont-pmd entry", we trigger this check.
> 
> Likely this sanity check must also allow for hugetlb folios. Or we should
> just remove it completely.
> 
> In the past, we wanted to make sure that we never get tail pages of THP from
> PMD entries, because something would currently be broken (we don't support
> THP > PMD).

That was a practical limitation on my part.  We have various parts of
the MM which assume that pmd_page() returns a head page and until we
get all of those fixed, adding support for folios larger than PMD_SIZE
was only going to cause trouble for no significant wins.

I agree with you we should get rid of this assertion entirely.  We should
fix all the places which assume that pmd_page() returns a head page,
but that may take some time.

As an example, filemap_map_pmd() has:

       if (pmd_none(*vmf->pmd) && folio_test_pmd_mappable(folio)) {
                struct page *page = folio_file_page(folio, start);
                vm_fault_t ret = do_set_pmd(vmf, page);

and then do_set_pmd() has:

        if (page != &folio->page || folio_order(folio) != HPAGE_PMD_ORDER)
                return ret;

so we'd simply refuse to use a PMD to map a folio larger than PMD_SIZE.
There's a lot of work to be done to make this work generally (not to
mention figuring out how to handle mapcount for such folios ;-).

This particular case seems straightforward though.  Just remove the
assertion.
Ryan Roberts April 2, 2024, 4:18 p.m. UTC | #4
On 02/04/2024 17:00, Matthew Wilcox wrote:
> On Tue, Apr 02, 2024 at 05:26:28PM +0200, David Hildenbrand wrote:
>>> The oops trigger is at mm/gup.c:778:
>>> VM_BUG_ON_PAGE(!PageHead(page) && !is_zone_device_page(page), page);
>>>
>>> So 2M passed ok, and its failing for 32M, which is cont-pmd. I'm guessing you're trying to iterate 2M into a cont-pmd folio and ending up with an unexpected tail page?
>>
>> I assume we find the expected tail page, it's just that the check
>>
>> VM_BUG_ON_PAGE(!PageHead(page) && !is_zone_device_page(page), page);
>>
>> Doesn't make sense with hugetlb folios. We might have a tail page mapped in
>> a cont-pmd entry. As soon as we call follow_huge_pmd() on "not the first
>> cont-pmd entry", we trigger this check.
>>
>> Likely this sanity check must also allow for hugetlb folios. Or we should
>> just remove it completely.
>>
>> In the past, we wanted to make sure that we never get tail pages of THP from
>> PMD entries, because something would currently be broken (we don't support
>> THP > PMD).
> 
> That was a practical limitation on my part.  We have various parts of
> the MM which assume that pmd_page() returns a head page and until we
> get all of those fixed, adding support for folios larger than PMD_SIZE
> was only going to cause trouble for no significant wins.
> 
> I agree with you we should get rid of this assertion entirely.  We should
> fix all the places which assume that pmd_page() returns a head page,
> but that may take some time.
> 
> As an example, filemap_map_pmd() has:
> 
>        if (pmd_none(*vmf->pmd) && folio_test_pmd_mappable(folio)) {
>                 struct page *page = folio_file_page(folio, start);
>                 vm_fault_t ret = do_set_pmd(vmf, page);
> 
> and then do_set_pmd() has:
> 
>         if (page != &folio->page || folio_order(folio) != HPAGE_PMD_ORDER)
>                 return ret;
> 
> so we'd simply refuse to use a PMD to map a folio larger than PMD_SIZE.
> There's a lot of work to be done to make this work generally (not to
> mention figuring out how to handle mapcount for such folios ;-).
> 
> This particular case seems straightforward though.  Just remove the
> assertion.

Removing the assertion gets me further, but then I end up with this:

[    9.748422] kernel BUG at include/linux/page-flags.h:1098!
[    9.748897] Internal error: Oops - BUG: 00000000f2000800 [#1] PREEMPT SMP
[    9.749590] Modules linked in:
[    9.749867] CPU: 2 PID: 1155 Comm: gup_longterm Not tainted 6.9.0-rc2-00210-g910ff1a347e4-dirty #12
[    9.750682] Hardware name: linux,dummy-virt (DT)
[    9.751095] pstate: 60400005 (nZCv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
[    9.751729] pc : follow_page_mask+0x730/0x850
[    9.752152] lr : follow_page_mask+0x730/0x850
[    9.752573] sp : ffff8000898f3aa0
[    9.752882] x29: ffff8000898f3aa0 x28: fffffdffc52b91a8 x27: 0000000000000001
[    9.753543] x26: ffff00014ae46d08 x25: 00003c0005d88000 x24: fffffdffc5d88000
[    9.754221] x23: ffffc1ffc0000000 x22: 0000000000080101 x21: ffff8000898f3ba8
[    9.754875] x20: 0000fffff4200000 x19: ffff0001a3d64450 x18: 0000000000000010
[    9.755567] x17: 2864616548656761 x16: 5021202626202965 x15: 6761702865677548
[    9.756254] x14: 6567615028454741 x13: 2929656761702864 x12: 6165486567615021
[    9.756953] x11: 2026262029656761 x10: ffffaaac08f1d6e0 x9 : ffffaaac0612f090
[    9.757671] x8 : 00000000ffffefff x7 : ffffaaac08f1d6e0 x6 : 0000000000000000
[    9.758356] x5 : ffff00017ffb9cc8 x4 : 0000000000000fff x3 : 0000000000000000
[    9.758983] x2 : 0000000000000000 x1 : ffff000189ecb480 x0 : 0000000000000046
[    9.759663] Call trace:
[    9.759901]  follow_page_mask+0x730/0x850
[    9.760293]  __get_user_pages+0xf4/0x3e8
[    9.760683]  __gup_longterm_locked+0x204/0xa70
[    9.761110]  pin_user_pages+0x88/0xc0
[    9.761486]  gup_test_ioctl+0x860/0xc40
[    9.761866]  __arm64_sys_ioctl+0xb0/0x100
[    9.762254]  invoke_syscall+0x50/0x128
[    9.762630]  el0_svc_common.constprop.0+0x48/0xf8
[    9.763104]  do_el0_svc+0x28/0x40
[    9.763413]  el0_svc+0x34/0xe0
[    9.763699]  el0t_64_sync_handler+0x13c/0x158
[    9.764139]  el0t_64_sync+0x190/0x198
[    9.764465] Code: aa1803e0 d000d8e1 911d6021 97fff4c9 (d4210000) 
[    9.765053] ---[ end trace 0000000000000000 ]---
[    9.765520] note: gup_longterm[1155] exited with irqs disabled
[    9.766146] note: gup_longterm[1155] exited with preempt_count 2
[    9.767366] ------------[ cut here ]------------
[    9.768062] WARNING: CPU: 2 PID: 0 at kernel/context_tracking.c:128 ct_kernel_exit.constprop.0+0x108/0x120
[    9.769146] Modules linked in:
[    9.769429] CPU: 2 PID: 0 Comm: swapper/2 Tainted: G      D            6.9.0-rc2-00210-g910ff1a347e4-dirty #12
[    9.770338] Hardware name: linux,dummy-virt (DT)
[    9.770837] pstate: 204003c5 (nzCv DAIF +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
[    9.771615] pc : ct_kernel_exit.constprop.0+0x108/0x120
[    9.772150] lr : ct_idle_enter+0x10/0x20
[    9.772539] sp : ffff8000801b3dc0
[    9.772913] x29: ffff8000801b3dc0 x28: 0000000000000000 x27: 0000000000000000
[    9.773769] x26: 0000000000000000 x25: ffff00014149e900 x24: 0000000000000000
[    9.774526] x23: 0000000000000000 x22: ffffaaac08e99d48 x21: ffffaaac08385730
[    9.775255] x20: ffffaaac08e99c28 x19: ffff00017ffc8da0 x18: 0000fffff5ffffff
[    9.775924] x17: 0000000000000000 x16: 1fffe0002a57c9e1 x15: 0000000000000001
[    9.776619] x14: ffffffffffffffff x13: 0000000000000000 x12: ffffaaac07a06968
[    9.777246] x11: 000000ae44c42eec x10: 0000000000000ad0 x9 : ffffaaac06189230
[    9.777942] x8 : ffff00014149f430 x7 : 02c9acb509db422c x6 : 000000001015a9f0
[    9.778635] x5 : 4000000000000002 x4 : ffff555577c46000 x3 : ffff8000801b3dc0
[    9.779671] x2 : ffffaaac08382da0 x1 : 4000000000000000 x0 : ffffaaac08382da0
[    9.780703] Call trace:
[    9.781150]  ct_kernel_exit.constprop.0+0x108/0x120
[    9.781949]  ct_idle_enter+0x10/0x20
[    9.782246]  default_idle_call+0x3c/0x160
[    9.782624]  do_idle+0x21c/0x280
[    9.782945]  cpu_startup_entry+0x3c/0x50
[    9.783268]  secondary_start_kernel+0x140/0x168
[    9.783818]  __secondary_switched+0xb8/0xc0
[    9.784163] ---[ end trace 0000000000000000 ]---


Which is caused by this:

static __always_inline int PageAnonExclusive(const struct page *page)
{
	VM_BUG_ON_PGFLAGS(!PageAnon(page), page);
	VM_BUG_ON_PGFLAGS(PageHuge(page) && !PageHead(page), page); <<<<
	return test_bit(PG_anon_exclusive, &PF_ANY(page, 1)->flags);
}

Which is called from can_follow_write_pmd(), called just after the assert I just commented out.


It's triggered by this test:

# [RUN] R/W longterm GUP pin in MAP_PRIVATE file mapping ... with memfd hugetlb (32768 kB)

Which is the first MAP_PRIVATE test for cont-pmd mapped hugetlb. (All MAP_SHARED tests are passing).


Looks like can_follow_write_pmd() returns early for VM_SHARED mappings.

I don't think we only keep the PAE flag in the head page for hugetlb pages? So we can't just remove this assert?

I tried just commenting it out and get assert further down follow_huge_pmd():

VM_BUG_ON_PAGE((flags & FOLL_PIN) && PageAnon(page) &&
			!PageAnonExclusive(page), page);

Thanks,
Ryan
Peter Xu April 2, 2024, 4:26 p.m. UTC | #5
On Tue, Apr 02, 2024 at 05:18:36PM +0100, Ryan Roberts wrote:
> On 02/04/2024 17:00, Matthew Wilcox wrote:
> > On Tue, Apr 02, 2024 at 05:26:28PM +0200, David Hildenbrand wrote:
> >>> The oops trigger is at mm/gup.c:778:
> >>> VM_BUG_ON_PAGE(!PageHead(page) && !is_zone_device_page(page), page);
> >>>
> >>> So 2M passed ok, and its failing for 32M, which is cont-pmd. I'm guessing you're trying to iterate 2M into a cont-pmd folio and ending up with an unexpected tail page?
> >>
> >> I assume we find the expected tail page, it's just that the check
> >>
> >> VM_BUG_ON_PAGE(!PageHead(page) && !is_zone_device_page(page), page);
> >>
> >> Doesn't make sense with hugetlb folios. We might have a tail page mapped in
> >> a cont-pmd entry. As soon as we call follow_huge_pmd() on "not the first
> >> cont-pmd entry", we trigger this check.
> >>
> >> Likely this sanity check must also allow for hugetlb folios. Or we should
> >> just remove it completely.
> >>
> >> In the past, we wanted to make sure that we never get tail pages of THP from
> >> PMD entries, because something would currently be broken (we don't support
> >> THP > PMD).
> > 
> > That was a practical limitation on my part.  We have various parts of
> > the MM which assume that pmd_page() returns a head page and until we
> > get all of those fixed, adding support for folios larger than PMD_SIZE
> > was only going to cause trouble for no significant wins.
> > 
> > I agree with you we should get rid of this assertion entirely.  We should
> > fix all the places which assume that pmd_page() returns a head page,
> > but that may take some time.
> > 
> > As an example, filemap_map_pmd() has:
> > 
> >        if (pmd_none(*vmf->pmd) && folio_test_pmd_mappable(folio)) {
> >                 struct page *page = folio_file_page(folio, start);
> >                 vm_fault_t ret = do_set_pmd(vmf, page);
> > 
> > and then do_set_pmd() has:
> > 
> >         if (page != &folio->page || folio_order(folio) != HPAGE_PMD_ORDER)
> >                 return ret;
> > 
> > so we'd simply refuse to use a PMD to map a folio larger than PMD_SIZE.
> > There's a lot of work to be done to make this work generally (not to
> > mention figuring out how to handle mapcount for such folios ;-).

Hmm, I think it means there're more work than I was thinking... but that's
okay, let's move one step at a time..

> > 
> > This particular case seems straightforward though.  Just remove the
> > assertion.
> 
> Removing the assertion gets me further, but then I end up with this:
> 
> [    9.748422] kernel BUG at include/linux/page-flags.h:1098!
> [    9.748897] Internal error: Oops - BUG: 00000000f2000800 [#1] PREEMPT SMP
> [    9.749590] Modules linked in:
> [    9.749867] CPU: 2 PID: 1155 Comm: gup_longterm Not tainted 6.9.0-rc2-00210-g910ff1a347e4-dirty #12
> [    9.750682] Hardware name: linux,dummy-virt (DT)
> [    9.751095] pstate: 60400005 (nZCv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
> [    9.751729] pc : follow_page_mask+0x730/0x850
> [    9.752152] lr : follow_page_mask+0x730/0x850
> [    9.752573] sp : ffff8000898f3aa0
> [    9.752882] x29: ffff8000898f3aa0 x28: fffffdffc52b91a8 x27: 0000000000000001
> [    9.753543] x26: ffff00014ae46d08 x25: 00003c0005d88000 x24: fffffdffc5d88000
> [    9.754221] x23: ffffc1ffc0000000 x22: 0000000000080101 x21: ffff8000898f3ba8
> [    9.754875] x20: 0000fffff4200000 x19: ffff0001a3d64450 x18: 0000000000000010
> [    9.755567] x17: 2864616548656761 x16: 5021202626202965 x15: 6761702865677548
> [    9.756254] x14: 6567615028454741 x13: 2929656761702864 x12: 6165486567615021
> [    9.756953] x11: 2026262029656761 x10: ffffaaac08f1d6e0 x9 : ffffaaac0612f090
> [    9.757671] x8 : 00000000ffffefff x7 : ffffaaac08f1d6e0 x6 : 0000000000000000
> [    9.758356] x5 : ffff00017ffb9cc8 x4 : 0000000000000fff x3 : 0000000000000000
> [    9.758983] x2 : 0000000000000000 x1 : ffff000189ecb480 x0 : 0000000000000046
> [    9.759663] Call trace:
> [    9.759901]  follow_page_mask+0x730/0x850
> [    9.760293]  __get_user_pages+0xf4/0x3e8
> [    9.760683]  __gup_longterm_locked+0x204/0xa70
> [    9.761110]  pin_user_pages+0x88/0xc0
> [    9.761486]  gup_test_ioctl+0x860/0xc40
> [    9.761866]  __arm64_sys_ioctl+0xb0/0x100
> [    9.762254]  invoke_syscall+0x50/0x128
> [    9.762630]  el0_svc_common.constprop.0+0x48/0xf8
> [    9.763104]  do_el0_svc+0x28/0x40
> [    9.763413]  el0_svc+0x34/0xe0
> [    9.763699]  el0t_64_sync_handler+0x13c/0x158
> [    9.764139]  el0t_64_sync+0x190/0x198
> [    9.764465] Code: aa1803e0 d000d8e1 911d6021 97fff4c9 (d4210000) 
> [    9.765053] ---[ end trace 0000000000000000 ]---
> [    9.765520] note: gup_longterm[1155] exited with irqs disabled
> [    9.766146] note: gup_longterm[1155] exited with preempt_count 2
> [    9.767366] ------------[ cut here ]------------
> [    9.768062] WARNING: CPU: 2 PID: 0 at kernel/context_tracking.c:128 ct_kernel_exit.constprop.0+0x108/0x120
> [    9.769146] Modules linked in:
> [    9.769429] CPU: 2 PID: 0 Comm: swapper/2 Tainted: G      D            6.9.0-rc2-00210-g910ff1a347e4-dirty #12
> [    9.770338] Hardware name: linux,dummy-virt (DT)
> [    9.770837] pstate: 204003c5 (nzCv DAIF +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
> [    9.771615] pc : ct_kernel_exit.constprop.0+0x108/0x120
> [    9.772150] lr : ct_idle_enter+0x10/0x20
> [    9.772539] sp : ffff8000801b3dc0
> [    9.772913] x29: ffff8000801b3dc0 x28: 0000000000000000 x27: 0000000000000000
> [    9.773769] x26: 0000000000000000 x25: ffff00014149e900 x24: 0000000000000000
> [    9.774526] x23: 0000000000000000 x22: ffffaaac08e99d48 x21: ffffaaac08385730
> [    9.775255] x20: ffffaaac08e99c28 x19: ffff00017ffc8da0 x18: 0000fffff5ffffff
> [    9.775924] x17: 0000000000000000 x16: 1fffe0002a57c9e1 x15: 0000000000000001
> [    9.776619] x14: ffffffffffffffff x13: 0000000000000000 x12: ffffaaac07a06968
> [    9.777246] x11: 000000ae44c42eec x10: 0000000000000ad0 x9 : ffffaaac06189230
> [    9.777942] x8 : ffff00014149f430 x7 : 02c9acb509db422c x6 : 000000001015a9f0
> [    9.778635] x5 : 4000000000000002 x4 : ffff555577c46000 x3 : ffff8000801b3dc0
> [    9.779671] x2 : ffffaaac08382da0 x1 : 4000000000000000 x0 : ffffaaac08382da0
> [    9.780703] Call trace:
> [    9.781150]  ct_kernel_exit.constprop.0+0x108/0x120
> [    9.781949]  ct_idle_enter+0x10/0x20
> [    9.782246]  default_idle_call+0x3c/0x160
> [    9.782624]  do_idle+0x21c/0x280
> [    9.782945]  cpu_startup_entry+0x3c/0x50
> [    9.783268]  secondary_start_kernel+0x140/0x168
> [    9.783818]  __secondary_switched+0xb8/0xc0
> [    9.784163] ---[ end trace 0000000000000000 ]---
> 
> 
> Which is caused by this:
> 
> static __always_inline int PageAnonExclusive(const struct page *page)
> {
> 	VM_BUG_ON_PGFLAGS(!PageAnon(page), page);
> 	VM_BUG_ON_PGFLAGS(PageHuge(page) && !PageHead(page), page); <<<<
> 	return test_bit(PG_anon_exclusive, &PF_ANY(page, 1)->flags);
> }
> 
> Which is called from can_follow_write_pmd(), called just after the assert I just commented out.
> 
> 
> It's triggered by this test:
> 
> # [RUN] R/W longterm GUP pin in MAP_PRIVATE file mapping ... with memfd hugetlb (32768 kB)
> 
> Which is the first MAP_PRIVATE test for cont-pmd mapped hugetlb. (All MAP_SHARED tests are passing).
> 
> 
> Looks like can_follow_write_pmd() returns early for VM_SHARED mappings.
> 
> I don't think we only keep the PAE flag in the head page for hugetlb pages? So we can't just remove this assert?
> 
> I tried just commenting it out and get assert further down follow_huge_pmd():
> 
> VM_BUG_ON_PAGE((flags & FOLL_PIN) && PageAnon(page) &&
> 			!PageAnonExclusive(page), page);

I just replied in another email; we can try the two patches I attached, or
we can wait until I do some tests (but will be mostly unavailable this
afternoon).

Thanks,
David Hildenbrand April 2, 2024, 4:40 p.m. UTC | #6
On 02.04.24 18:00, Matthew Wilcox wrote:
> On Tue, Apr 02, 2024 at 05:26:28PM +0200, David Hildenbrand wrote:
>>> The oops trigger is at mm/gup.c:778:
>>> VM_BUG_ON_PAGE(!PageHead(page) && !is_zone_device_page(page), page);
>>>
>>> So 2M passed ok, and its failing for 32M, which is cont-pmd. I'm guessing you're trying to iterate 2M into a cont-pmd folio and ending up with an unexpected tail page?
>>
>> I assume we find the expected tail page, it's just that the check
>>
>> VM_BUG_ON_PAGE(!PageHead(page) && !is_zone_device_page(page), page);
>>
>> Doesn't make sense with hugetlb folios. We might have a tail page mapped in
>> a cont-pmd entry. As soon as we call follow_huge_pmd() on "not the first
>> cont-pmd entry", we trigger this check.
>>
>> Likely this sanity check must also allow for hugetlb folios. Or we should
>> just remove it completely.
>>
>> In the past, we wanted to make sure that we never get tail pages of THP from
>> PMD entries, because something would currently be broken (we don't support
>> THP > PMD).
> 
> That was a practical limitation on my part.  We have various parts of
> the MM which assume that pmd_page() returns a head page and until we
> get all of those fixed, adding support for folios larger than PMD_SIZE
> was only going to cause trouble for no significant wins.
> 
> I agree with you we should get rid of this assertion entirely.  We should
> fix all the places which assume that pmd_page() returns a head page,
> but that may take some time.
> 
> As an example, filemap_map_pmd() has:
> 
>         if (pmd_none(*vmf->pmd) && folio_test_pmd_mappable(folio)) {
>                  struct page *page = folio_file_page(folio, start);
>                  vm_fault_t ret = do_set_pmd(vmf, page);
> 
> and then do_set_pmd() has:
> 
>          if (page != &folio->page || folio_order(folio) != HPAGE_PMD_ORDER)
>                  return ret;
> 
> so we'd simply refuse to use a PMD to map a folio larger than PMD_SIZE.
> There's a lot of work to be done to make this work generally (not to
> mention figuring out how to handle mapcount for such folios ;-).

Yes :)
Ryan Roberts April 2, 2024, 4:46 p.m. UTC | #7
On 02/04/2024 17:20, Peter Xu wrote:
> On Tue, Apr 02, 2024 at 05:26:28PM +0200, David Hildenbrand wrote:
>> On 02.04.24 16:48, Ryan Roberts wrote:
>>> Hi Peter,
> 
> Hey, Ryan,
> 
> Thanks for the report!
> 
>>>
>>> On 27/03/2024 15:23, peterx@redhat.com wrote:
>>>> From: Peter Xu <peterx@redhat.com>
>>>>
>>>> Now follow_page() is ready to handle hugetlb pages in whatever form, and
>>>> over all architectures.  Switch to the generic code path.
>>>>
>>>> Time to retire hugetlb_follow_page_mask(), following the previous
>>>> retirement of follow_hugetlb_page() in 4849807114b8.
>>>>
>>>> There may be a slight difference of how the loops run when processing slow
>>>> GUP over a large hugetlb range on cont_pte/cont_pmd supported archs: each
>>>> loop of __get_user_pages() will resolve one pgtable entry with the patch
>>>> applied, rather than relying on the size of hugetlb hstate, the latter may
>>>> cover multiple entries in one loop.
>>>>
>>>> A quick performance test on an aarch64 VM on M1 chip shows 15% degrade over
>>>> a tight loop of slow gup after the path switched.  That shouldn't be a
>>>> problem because slow-gup should not be a hot path for GUP in general: when
>>>> page is commonly present, fast-gup will already succeed, while when the
>>>> page is indeed missing and require a follow up page fault, the slow gup
>>>> degrade will probably buried in the fault paths anyway.  It also explains
>>>> why slow gup for THP used to be very slow before 57edfcfd3419 ("mm/gup:
>>>> accelerate thp gup even for "pages != NULL"") lands, the latter not part of
>>>> a performance analysis but a side benefit.  If the performance will be a
>>>> concern, we can consider handle CONT_PTE in follow_page().
>>>>
>>>> Before that is justified to be necessary, keep everything clean and simple.
>>>>
>>>> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
>>>> Signed-off-by: Peter Xu <peterx@redhat.com>
>>>
>>> Afraid I'm seeing an oops when running gup_longterm test on arm64 with current mm-unstable. Git bisect blames this patch. The oops reproduces for me every time on 2 different machines:
>>>
>>>
>>> [    9.340416] kernel BUG at mm/gup.c:778!
>>> [    9.340746] Internal error: Oops - BUG: 00000000f2000800 [#1] PREEMPT SMP
>>> [    9.341199] Modules linked in:
>>> [    9.341481] CPU: 1 PID: 1159 Comm: gup_longterm Not tainted 6.9.0-rc2-00210-g910ff1a347e4 #11
>>> [    9.342232] Hardware name: linux,dummy-virt (DT)
>>> [    9.342647] pstate: 60400005 (nZCv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
>>> [    9.343195] pc : follow_page_mask+0x4d4/0x880
>>> [    9.343580] lr : follow_page_mask+0x4d4/0x880
>>> [    9.344018] sp : ffff8000898b3aa0
>>> [    9.344345] x29: ffff8000898b3aa0 x28: fffffdffc53973e8 x27: 00003c0005d08000
>>> [    9.345028] x26: ffff00014e5cfd08 x25: ffffd3513a40c000 x24: fffffdffc5d08000
>>> [    9.345682] x23: ffffc1ffc0000000 x22: 0000000000080101 x21: ffff8000898b3ba8
>>> [    9.346337] x20: 0000fffff4200000 x19: ffff00014e52d508 x18: 0000000000000010
>>> [    9.347005] x17: 5f656e6f7a5f7369 x16: 2120262620296567 x15: 6170286461654865
>>> [    9.347713] x14: 6761502128454741 x13: 2929656761702865 x12: 6761705f65636976
>>> [    9.348371] x11: 65645f656e6f7a5f x10: ffffd3513b31d6e0 x9 : ffffd3513852f090
>>> [    9.349062] x8 : 00000000ffffefff x7 : ffffd3513b31d6e0 x6 : 0000000000000000
>>> [    9.349753] x5 : ffff00017ff98cc8 x4 : 0000000000000fff x3 : 0000000000000000
>>> [    9.350397] x2 : 0000000000000000 x1 : ffff000190e8b480 x0 : 0000000000000052
>>> [    9.351097] Call trace:
>>> [    9.351312]  follow_page_mask+0x4d4/0x880
>>> [    9.351700]  __get_user_pages+0xf4/0x3e8
>>> [    9.352089]  __gup_longterm_locked+0x204/0xa70
>>> [    9.352516]  pin_user_pages+0x88/0xc0
>>> [    9.352873]  gup_test_ioctl+0x860/0xc40
>>> [    9.353249]  __arm64_sys_ioctl+0xb0/0x100
>>> [    9.353648]  invoke_syscall+0x50/0x128
>>> [    9.354022]  el0_svc_common.constprop.0+0x48/0xf8
>>> [    9.354488]  do_el0_svc+0x28/0x40
>>> [    9.354822]  el0_svc+0x34/0xe0
>>> [    9.355128]  el0t_64_sync_handler+0x13c/0x158
>>> [    9.355489]  el0t_64_sync+0x190/0x198
>>> [    9.355793] Code: aa1803e0 d000d8e1 91220021 97fff560 (d4210000)
>>> [    9.356280] ---[ end trace 0000000000000000 ]---
>>> [    9.356651] note: gup_longterm[1159] exited with irqs disabled
>>> [    9.357141] note: gup_longterm[1159] exited with preempt_count 2
>>> [    9.358033] ------------[ cut here ]------------
>>> [    9.358800] WARNING: CPU: 1 PID: 0 at kernel/context_tracking.c:128 ct_kernel_exit.constprop.0+0x108/0x120
>>> [    9.360157] Modules linked in:
>>> [    9.360541] CPU: 1 PID: 0 Comm: swapper/1 Tainted: G      D            6.9.0-rc2-00210-g910ff1a347e4 #11
>>> [    9.361626] Hardware name: linux,dummy-virt (DT)
>>> [    9.362087] pstate: 204003c5 (nzCv DAIF +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
>>> [    9.362758] pc : ct_kernel_exit.constprop.0+0x108/0x120
>>> [    9.363306] lr : ct_idle_enter+0x10/0x20
>>> [    9.363845] sp : ffff8000801abdc0
>>> [    9.364222] x29: ffff8000801abdc0 x28: 0000000000000000 x27: 0000000000000000
>>> [    9.364961] x26: 0000000000000000 x25: ffff00014149d780 x24: 0000000000000000
>>> [    9.365557] x23: 0000000000000000 x22: ffffd3513b299d48 x21: ffffd3513a785730
>>> [    9.366239] x20: ffffd3513b299c28 x19: ffff00017ffa7da0 x18: 0000fffff5ffffff
>>> [    9.366869] x17: 0000000000000000 x16: 1fffe0002a21a8c1 x15: 0000000000000000
>>> [    9.367524] x14: 0000000000000000 x13: 0000000000000000 x12: 0000000000000002
>>> [    9.368207] x11: 0000000000000001 x10: 0000000000000ad0 x9 : ffffd35138589230
>>> [    9.369123] x8 : ffff00014149e2b0 x7 : 0000000000000000 x6 : 000000000f8c0fb2
>>> [    9.370403] x5 : 4000000000000002 x4 : ffff2cb045825000 x3 : ffff8000801abdc0
>>> [    9.371170] x2 : ffffd3513a782da0 x1 : 4000000000000000 x0 : ffffd3513a782da0
>>> [    9.372279] Call trace:
>>> [    9.372519]  ct_kernel_exit.constprop.0+0x108/0x120
>>> [    9.373216]  ct_idle_enter+0x10/0x20
>>> [    9.373562]  default_idle_call+0x3c/0x160
>>> [    9.374055]  do_idle+0x21c/0x280
>>> [    9.374394]  cpu_startup_entry+0x3c/0x50
>>> [    9.374797]  secondary_start_kernel+0x140/0x168
>>> [    9.375220]  __secondary_switched+0xb8/0xc0
>>> [    9.375875] ---[ end trace 0000000000000000 ]---
>>>
>>>
>>> The oops trigger is at mm/gup.c:778:
>>> VM_BUG_ON_PAGE(!PageHead(page) && !is_zone_device_page(page), page);
>>>
>>>
>>> This is the output of gup_longterm (last output is just before oops):
>>>
>>> # [INFO] detected hugetlb page size: 2048 KiB
>>> # [INFO] detected hugetlb page size: 32768 KiB
>>> # [INFO] detected hugetlb page size: 64 KiB
>>> # [INFO] detected hugetlb page size: 1048576 KiB
>>> TAP version 13
>>> 1..70
>>> # [RUN] R/W longterm GUP pin in MAP_SHARED file mapping ... with memfd
>>> ok 1 Should have worked
>>> # [RUN] R/W longterm GUP pin in MAP_SHARED file mapping ... with tmpfile
>>> ok 2 Should have failed
>>> # [RUN] R/W longterm GUP pin in MAP_SHARED file mapping ... with local tmpfile
>>> ok 3 Should have failed
>>> # [RUN] R/W longterm GUP pin in MAP_SHARED file mapping ... with memfd hugetlb (2048 kB)
>>> ok 4 Should have worked
>>> # [RUN] R/W longterm GUP pin in MAP_SHARED file mapping ... with memfd hugetlb (32768 kB)
>>>
>>>
>>> So 2M passed ok, and its failing for 32M, which is cont-pmd. I'm guessing you're trying to iterate 2M into a cont-pmd folio and ending up with an unexpected tail page?
>>
>> I assume we find the expected tail page, it's just that the check
>>
>> VM_BUG_ON_PAGE(!PageHead(page) && !is_zone_device_page(page), page);
>>
>> Doesn't make sense with hugetlb folios. We might have a tail page mapped in
>> a cont-pmd entry. As soon as we call follow_huge_pmd() on "not the first
>> cont-pmd entry", we trigger this check.
>>
>> Likely this sanity check must also allow for hugetlb folios. Or we should
>> just remove it completely.
> 
> Right, IMHO it'll be easier we remove it, actually I see there's one more
> at the end, so I think we need to remove both.
> 
>>
>> In the past, we wanted to make sure that we never get tail pages of THP from
>> PMD entries, because something would currently be broken (we don't support
>> THP > PMD).
> 
> There's probably one more thing we need to do, on allowing
> PageAnonExclusive() to work with hugetlb tails. Even if we remove the
> warnings and if I read the code right, we can BUG_ON again on checking tail
> pages over anon-exclusive for PageHuge.
> 
> So I assume to fix it completely, we may need two changes: Patch 1 to
> prepare PageAnonExclusive() to work on hugetlb tails, then patch 2 to be
> squashed into the patch "mm/gup: handle huge pmd for follow_pmd_mask()".
> Note: not this patch to fixup, as this patch only does the "switchover" to
> the new path, the culprit should be the other patch..

I'll leave you to do the testing on this, if that's ok.

Just to make my config explicit, I have this kernel command line, which reserves
hugetlbs of all sizes for the tests:

"transparent_hugepage=madvise earlycon root=/dev/vda2 secretmem.enable
hugepagesz=1G hugepages=0:2,1:2 hugepagesz=32M hugepages=0:2,1:2
default_hugepagesz=2M hugepages=0:64,1:64 hugepagesz=64K hugepages=0:2,1:2"

Thanks,
Ryan

> 
> I have them attached below first, before I'll also go and see whether I can
> run some arm tests later today or tomorrow.  David, any comments from
> anon-exclusive side?
> 
> Thanks,
> 
> ===8<===
> 
> From 26f0670acea948945222c97a9cab58428782ca69 Mon Sep 17 00:00:00 2001
> From: Peter Xu <peterx@redhat.com>
> Date: Tue, 2 Apr 2024 11:52:28 -0400
> Subject: [PATCH 1/2] mm: Allow anon exclusive check over hugetlb tail pages
> 
> PageAnonExclusive() used to forbid tail pages for hugetlbfs, as that used
> to be called mostly in hugetlb specific paths and the head page was
> guaranteed.
> 
> As we move forward towards merging hugetlb paths into generic mm, we may
> start to pass in tail hugetlb pages (when with cont-pte/cont-pmd huge
> pages) for such check.  Allow it to properly fetch the head, in which case
> the anon-exclusiveness of the head will always represents the tail page.
> 
> There's already a sign of it when we look at the fast-gup which already
> contain the hugetlb processing altogether: we used to have a specific
> commit 5805192c7b72 ("mm/gup: handle cont-PTE hugetlb pages correctly in
> gup_must_unshare() via GUP-fast") covering that area.  Now with this more
> generic change, that can also go away.
> 
> Signed-off-by: Peter Xu <peterx@redhat.com>
> ---
>  include/linux/page-flags.h |  8 +++++++-
>  mm/internal.h              | 10 ----------
>  2 files changed, 7 insertions(+), 11 deletions(-)
> 
> diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
> index 888353c209c0..225357f48a79 100644
> --- a/include/linux/page-flags.h
> +++ b/include/linux/page-flags.h
> @@ -1095,7 +1095,13 @@ PAGEFLAG(Isolated, isolated, PF_ANY);
>  static __always_inline int PageAnonExclusive(const struct page *page)
>  {
>  	VM_BUG_ON_PGFLAGS(!PageAnon(page), page);
> -	VM_BUG_ON_PGFLAGS(PageHuge(page) && !PageHead(page), page);
> +	/*
> +	 * Allow the anon-exclusive check to work on hugetlb tail pages.
> +	 * Here hugetlb pages will always guarantee the anon-exclusiveness
> +	 * of the head page represents the tail pages.
> +	 */
> +	if (PageHuge(page) && !PageHead(page))
> +		page = compound_head(page);
>  	return test_bit(PG_anon_exclusive, &PF_ANY(page, 1)->flags);
>  }
>  
> diff --git a/mm/internal.h b/mm/internal.h
> index 9512de7398d5..87f6e4fd56a5 100644
> --- a/mm/internal.h
> +++ b/mm/internal.h
> @@ -1259,16 +1259,6 @@ static inline bool gup_must_unshare(struct vm_area_struct *vma,
>  	if (IS_ENABLED(CONFIG_HAVE_FAST_GUP))
>  		smp_rmb();
>  
> -	/*
> -	 * During GUP-fast we might not get called on the head page for a
> -	 * hugetlb page that is mapped using cont-PTE, because GUP-fast does
> -	 * not work with the abstracted hugetlb PTEs that always point at the
> -	 * head page. For hugetlb, PageAnonExclusive only applies on the head
> -	 * page (as it cannot be partially COW-shared), so lookup the head page.
> -	 */
> -	if (unlikely(!PageHead(page) && PageHuge(page)))
> -		page = compound_head(page);
> -
>  	/*
>  	 * Note that PageKsm() pages cannot be exclusive, and consequently,
>  	 * cannot get pinned.
Peter Xu April 2, 2024, 5:57 p.m. UTC | #8
On Tue, Apr 02, 2024 at 06:39:31PM +0200, David Hildenbrand wrote:
> On 02.04.24 18:20, Peter Xu wrote:
> > On Tue, Apr 02, 2024 at 05:26:28PM +0200, David Hildenbrand wrote:
> > > On 02.04.24 16:48, Ryan Roberts wrote:
> > > > Hi Peter,
> > 
> > Hey, Ryan,
> > 
> > Thanks for the report!
> > 
> > > > 
> > > > On 27/03/2024 15:23, peterx@redhat.com wrote:
> > > > > From: Peter Xu <peterx@redhat.com>
> > > > > 
> > > > > Now follow_page() is ready to handle hugetlb pages in whatever form, and
> > > > > over all architectures.  Switch to the generic code path.
> > > > > 
> > > > > Time to retire hugetlb_follow_page_mask(), following the previous
> > > > > retirement of follow_hugetlb_page() in 4849807114b8.
> > > > > 
> > > > > There may be a slight difference of how the loops run when processing slow
> > > > > GUP over a large hugetlb range on cont_pte/cont_pmd supported archs: each
> > > > > loop of __get_user_pages() will resolve one pgtable entry with the patch
> > > > > applied, rather than relying on the size of hugetlb hstate, the latter may
> > > > > cover multiple entries in one loop.
> > > > > 
> > > > > A quick performance test on an aarch64 VM on M1 chip shows 15% degrade over
> > > > > a tight loop of slow gup after the path switched.  That shouldn't be a
> > > > > problem because slow-gup should not be a hot path for GUP in general: when
> > > > > page is commonly present, fast-gup will already succeed, while when the
> > > > > page is indeed missing and require a follow up page fault, the slow gup
> > > > > degrade will probably buried in the fault paths anyway.  It also explains
> > > > > why slow gup for THP used to be very slow before 57edfcfd3419 ("mm/gup:
> > > > > accelerate thp gup even for "pages != NULL"") lands, the latter not part of
> > > > > a performance analysis but a side benefit.  If the performance will be a
> > > > > concern, we can consider handle CONT_PTE in follow_page().
> > > > > 
> > > > > Before that is justified to be necessary, keep everything clean and simple.
> > > > > 
> > > > > Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
> > > > > Signed-off-by: Peter Xu <peterx@redhat.com>
> > > > 
> > > > Afraid I'm seeing an oops when running gup_longterm test on arm64 with current mm-unstable. Git bisect blames this patch. The oops reproduces for me every time on 2 different machines:
> > > > 
> > > > 
> > > > [    9.340416] kernel BUG at mm/gup.c:778!
> > > > [    9.340746] Internal error: Oops - BUG: 00000000f2000800 [#1] PREEMPT SMP
> > > > [    9.341199] Modules linked in:
> > > > [    9.341481] CPU: 1 PID: 1159 Comm: gup_longterm Not tainted 6.9.0-rc2-00210-g910ff1a347e4 #11
> > > > [    9.342232] Hardware name: linux,dummy-virt (DT)
> > > > [    9.342647] pstate: 60400005 (nZCv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
> > > > [    9.343195] pc : follow_page_mask+0x4d4/0x880
> > > > [    9.343580] lr : follow_page_mask+0x4d4/0x880
> > > > [    9.344018] sp : ffff8000898b3aa0
> > > > [    9.344345] x29: ffff8000898b3aa0 x28: fffffdffc53973e8 x27: 00003c0005d08000
> > > > [    9.345028] x26: ffff00014e5cfd08 x25: ffffd3513a40c000 x24: fffffdffc5d08000
> > > > [    9.345682] x23: ffffc1ffc0000000 x22: 0000000000080101 x21: ffff8000898b3ba8
> > > > [    9.346337] x20: 0000fffff4200000 x19: ffff00014e52d508 x18: 0000000000000010
> > > > [    9.347005] x17: 5f656e6f7a5f7369 x16: 2120262620296567 x15: 6170286461654865
> > > > [    9.347713] x14: 6761502128454741 x13: 2929656761702865 x12: 6761705f65636976
> > > > [    9.348371] x11: 65645f656e6f7a5f x10: ffffd3513b31d6e0 x9 : ffffd3513852f090
> > > > [    9.349062] x8 : 00000000ffffefff x7 : ffffd3513b31d6e0 x6 : 0000000000000000
> > > > [    9.349753] x5 : ffff00017ff98cc8 x4 : 0000000000000fff x3 : 0000000000000000
> > > > [    9.350397] x2 : 0000000000000000 x1 : ffff000190e8b480 x0 : 0000000000000052
> > > > [    9.351097] Call trace:
> > > > [    9.351312]  follow_page_mask+0x4d4/0x880
> > > > [    9.351700]  __get_user_pages+0xf4/0x3e8
> > > > [    9.352089]  __gup_longterm_locked+0x204/0xa70
> > > > [    9.352516]  pin_user_pages+0x88/0xc0
> > > > [    9.352873]  gup_test_ioctl+0x860/0xc40
> > > > [    9.353249]  __arm64_sys_ioctl+0xb0/0x100
> > > > [    9.353648]  invoke_syscall+0x50/0x128
> > > > [    9.354022]  el0_svc_common.constprop.0+0x48/0xf8
> > > > [    9.354488]  do_el0_svc+0x28/0x40
> > > > [    9.354822]  el0_svc+0x34/0xe0
> > > > [    9.355128]  el0t_64_sync_handler+0x13c/0x158
> > > > [    9.355489]  el0t_64_sync+0x190/0x198
> > > > [    9.355793] Code: aa1803e0 d000d8e1 91220021 97fff560 (d4210000)
> > > > [    9.356280] ---[ end trace 0000000000000000 ]---
> > > > [    9.356651] note: gup_longterm[1159] exited with irqs disabled
> > > > [    9.357141] note: gup_longterm[1159] exited with preempt_count 2
> > > > [    9.358033] ------------[ cut here ]------------
> > > > [    9.358800] WARNING: CPU: 1 PID: 0 at kernel/context_tracking.c:128 ct_kernel_exit.constprop.0+0x108/0x120
> > > > [    9.360157] Modules linked in:
> > > > [    9.360541] CPU: 1 PID: 0 Comm: swapper/1 Tainted: G      D            6.9.0-rc2-00210-g910ff1a347e4 #11
> > > > [    9.361626] Hardware name: linux,dummy-virt (DT)
> > > > [    9.362087] pstate: 204003c5 (nzCv DAIF +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
> > > > [    9.362758] pc : ct_kernel_exit.constprop.0+0x108/0x120
> > > > [    9.363306] lr : ct_idle_enter+0x10/0x20
> > > > [    9.363845] sp : ffff8000801abdc0
> > > > [    9.364222] x29: ffff8000801abdc0 x28: 0000000000000000 x27: 0000000000000000
> > > > [    9.364961] x26: 0000000000000000 x25: ffff00014149d780 x24: 0000000000000000
> > > > [    9.365557] x23: 0000000000000000 x22: ffffd3513b299d48 x21: ffffd3513a785730
> > > > [    9.366239] x20: ffffd3513b299c28 x19: ffff00017ffa7da0 x18: 0000fffff5ffffff
> > > > [    9.366869] x17: 0000000000000000 x16: 1fffe0002a21a8c1 x15: 0000000000000000
> > > > [    9.367524] x14: 0000000000000000 x13: 0000000000000000 x12: 0000000000000002
> > > > [    9.368207] x11: 0000000000000001 x10: 0000000000000ad0 x9 : ffffd35138589230
> > > > [    9.369123] x8 : ffff00014149e2b0 x7 : 0000000000000000 x6 : 000000000f8c0fb2
> > > > [    9.370403] x5 : 4000000000000002 x4 : ffff2cb045825000 x3 : ffff8000801abdc0
> > > > [    9.371170] x2 : ffffd3513a782da0 x1 : 4000000000000000 x0 : ffffd3513a782da0
> > > > [    9.372279] Call trace:
> > > > [    9.372519]  ct_kernel_exit.constprop.0+0x108/0x120
> > > > [    9.373216]  ct_idle_enter+0x10/0x20
> > > > [    9.373562]  default_idle_call+0x3c/0x160
> > > > [    9.374055]  do_idle+0x21c/0x280
> > > > [    9.374394]  cpu_startup_entry+0x3c/0x50
> > > > [    9.374797]  secondary_start_kernel+0x140/0x168
> > > > [    9.375220]  __secondary_switched+0xb8/0xc0
> > > > [    9.375875] ---[ end trace 0000000000000000 ]---
> > > > 
> > > > 
> > > > The oops trigger is at mm/gup.c:778:
> > > > VM_BUG_ON_PAGE(!PageHead(page) && !is_zone_device_page(page), page);
> > > > 
> > > > 
> > > > This is the output of gup_longterm (last output is just before oops):
> > > > 
> > > > # [INFO] detected hugetlb page size: 2048 KiB
> > > > # [INFO] detected hugetlb page size: 32768 KiB
> > > > # [INFO] detected hugetlb page size: 64 KiB
> > > > # [INFO] detected hugetlb page size: 1048576 KiB
> > > > TAP version 13
> > > > 1..70
> > > > # [RUN] R/W longterm GUP pin in MAP_SHARED file mapping ... with memfd
> > > > ok 1 Should have worked
> > > > # [RUN] R/W longterm GUP pin in MAP_SHARED file mapping ... with tmpfile
> > > > ok 2 Should have failed
> > > > # [RUN] R/W longterm GUP pin in MAP_SHARED file mapping ... with local tmpfile
> > > > ok 3 Should have failed
> > > > # [RUN] R/W longterm GUP pin in MAP_SHARED file mapping ... with memfd hugetlb (2048 kB)
> > > > ok 4 Should have worked
> > > > # [RUN] R/W longterm GUP pin in MAP_SHARED file mapping ... with memfd hugetlb (32768 kB)
> > > > 
> > > > 
> > > > So 2M passed ok, and its failing for 32M, which is cont-pmd. I'm guessing you're trying to iterate 2M into a cont-pmd folio and ending up with an unexpected tail page?
> > > 
> > > I assume we find the expected tail page, it's just that the check
> > > 
> > > VM_BUG_ON_PAGE(!PageHead(page) && !is_zone_device_page(page), page);
> > > 
> > > Doesn't make sense with hugetlb folios. We might have a tail page mapped in
> > > a cont-pmd entry. As soon as we call follow_huge_pmd() on "not the first
> > > cont-pmd entry", we trigger this check.
> > > 
> > > Likely this sanity check must also allow for hugetlb folios. Or we should
> > > just remove it completely.
> > 
> > Right, IMHO it'll be easier we remove it, actually I see there's one more
> > at the end, so I think we need to remove both.
> > 
> > > 
> > > In the past, we wanted to make sure that we never get tail pages of THP from
> > > PMD entries, because something would currently be broken (we don't support
> > > THP > PMD).
> > 
> > There's probably one more thing we need to do, on allowing
> > PageAnonExclusive() to work with hugetlb tails. Even if we remove the
> > warnings and if I read the code right, we can BUG_ON again on checking tail
> > pages over anon-exclusive for PageHuge.
> > 
> > So I assume to fix it completely, we may need two changes: Patch 1 to
> > prepare PageAnonExclusive() to work on hugetlb tails, then patch 2 to be
> > squashed into the patch "mm/gup: handle huge pmd for follow_pmd_mask()".
> > Note: not this patch to fixup, as this patch only does the "switchover" to
> > the new path, the culprit should be the other patch..
> > 
> > I have them attached below first, before I'll also go and see whether I can
> > run some arm tests later today or tomorrow.  David, any comments from
> > anon-exclusive side?
> 
> I added the PageAnonExclusive checks for hugetlb back then, because calling
> it on a tail page indicated real trouble for hugetlb.
> 
> Well, and I didn't want to have runtime-hugetlb checks in PageAnonExclusive
> code called on certainly-not-hugetlb code paths.
> 
> Personally, I'd fixup the problematic callsite where we know nothing nasty
> is happening (like we did for gup_must_unshare(), because we don't expect
> hugetlb tail pages from arbitrary other code).
> 
> But as I'm getting closer to a folio_test_anon_exclusive() implementation as
> we speak (closer, but not done :) ... ), where I'd remove any such hugetlb
> special handling, I don't particularly care how we handle GUP here in the
> meantime.

That's what I was looking for and found missing just now, when I wanted to
allow follow_huge_pmd() pass page / folio (which will be the head then)
properly into different checks.  I think that patch 1 is the simplest I can
come up with that works mostly like what you said before a follow up
cleanup on top if possible.  It mostly pushed the existing runtime check in
gup_must_unshare() to be more generic.

IIUC it's also a matter of whether you'd want PageAnonExclusive() to take
care of both thp + hugetlb in one shot, rather than let callers handle it
by things like "if (PageHuge()) ... else ...", which I would try to avoid.
It seems so far cleaner to allow PageAnonExclusive() take whatever tail
pages, thp or hugetlb.  But maybe your ultimate patchset can be even better
than that.

Thanks,
Peter Xu April 2, 2024, 5:58 p.m. UTC | #9
On Tue, Apr 02, 2024 at 05:46:57PM +0100, Ryan Roberts wrote:
> I'll leave you to do the testing on this, if that's ok.

Definitely.  I'll test and send formal patches.

> 
> Just to make my config explicit, I have this kernel command line, which reserves
> hugetlbs of all sizes for the tests:
> 
> "transparent_hugepage=madvise earlycon root=/dev/vda2 secretmem.enable
> hugepagesz=1G hugepages=0:2,1:2 hugepagesz=32M hugepages=0:2,1:2
> default_hugepagesz=2M hugepages=0:64,1:64 hugepagesz=64K hugepages=0:2,1:2"

This helps, thanks.
David Hildenbrand April 2, 2024, 6:43 p.m. UTC | #10
On 02.04.24 19:57, Peter Xu wrote:
> On Tue, Apr 02, 2024 at 06:39:31PM +0200, David Hildenbrand wrote:
>> On 02.04.24 18:20, Peter Xu wrote:
>>> On Tue, Apr 02, 2024 at 05:26:28PM +0200, David Hildenbrand wrote:
>>>> On 02.04.24 16:48, Ryan Roberts wrote:
>>>>> Hi Peter,
>>>
>>> Hey, Ryan,
>>>
>>> Thanks for the report!
>>>
>>>>>
>>>>> On 27/03/2024 15:23, peterx@redhat.com wrote:
>>>>>> From: Peter Xu <peterx@redhat.com>
>>>>>>
>>>>>> Now follow_page() is ready to handle hugetlb pages in whatever form, and
>>>>>> over all architectures.  Switch to the generic code path.
>>>>>>
>>>>>> Time to retire hugetlb_follow_page_mask(), following the previous
>>>>>> retirement of follow_hugetlb_page() in 4849807114b8.
>>>>>>
>>>>>> There may be a slight difference of how the loops run when processing slow
>>>>>> GUP over a large hugetlb range on cont_pte/cont_pmd supported archs: each
>>>>>> loop of __get_user_pages() will resolve one pgtable entry with the patch
>>>>>> applied, rather than relying on the size of hugetlb hstate, the latter may
>>>>>> cover multiple entries in one loop.
>>>>>>
>>>>>> A quick performance test on an aarch64 VM on M1 chip shows 15% degrade over
>>>>>> a tight loop of slow gup after the path switched.  That shouldn't be a
>>>>>> problem because slow-gup should not be a hot path for GUP in general: when
>>>>>> page is commonly present, fast-gup will already succeed, while when the
>>>>>> page is indeed missing and require a follow up page fault, the slow gup
>>>>>> degrade will probably buried in the fault paths anyway.  It also explains
>>>>>> why slow gup for THP used to be very slow before 57edfcfd3419 ("mm/gup:
>>>>>> accelerate thp gup even for "pages != NULL"") lands, the latter not part of
>>>>>> a performance analysis but a side benefit.  If the performance will be a
>>>>>> concern, we can consider handle CONT_PTE in follow_page().
>>>>>>
>>>>>> Before that is justified to be necessary, keep everything clean and simple.
>>>>>>
>>>>>> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
>>>>>> Signed-off-by: Peter Xu <peterx@redhat.com>
>>>>>
>>>>> Afraid I'm seeing an oops when running gup_longterm test on arm64 with current mm-unstable. Git bisect blames this patch. The oops reproduces for me every time on 2 different machines:
>>>>>
>>>>>
>>>>> [    9.340416] kernel BUG at mm/gup.c:778!
>>>>> [    9.340746] Internal error: Oops - BUG: 00000000f2000800 [#1] PREEMPT SMP
>>>>> [    9.341199] Modules linked in:
>>>>> [    9.341481] CPU: 1 PID: 1159 Comm: gup_longterm Not tainted 6.9.0-rc2-00210-g910ff1a347e4 #11
>>>>> [    9.342232] Hardware name: linux,dummy-virt (DT)
>>>>> [    9.342647] pstate: 60400005 (nZCv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
>>>>> [    9.343195] pc : follow_page_mask+0x4d4/0x880
>>>>> [    9.343580] lr : follow_page_mask+0x4d4/0x880
>>>>> [    9.344018] sp : ffff8000898b3aa0
>>>>> [    9.344345] x29: ffff8000898b3aa0 x28: fffffdffc53973e8 x27: 00003c0005d08000
>>>>> [    9.345028] x26: ffff00014e5cfd08 x25: ffffd3513a40c000 x24: fffffdffc5d08000
>>>>> [    9.345682] x23: ffffc1ffc0000000 x22: 0000000000080101 x21: ffff8000898b3ba8
>>>>> [    9.346337] x20: 0000fffff4200000 x19: ffff00014e52d508 x18: 0000000000000010
>>>>> [    9.347005] x17: 5f656e6f7a5f7369 x16: 2120262620296567 x15: 6170286461654865
>>>>> [    9.347713] x14: 6761502128454741 x13: 2929656761702865 x12: 6761705f65636976
>>>>> [    9.348371] x11: 65645f656e6f7a5f x10: ffffd3513b31d6e0 x9 : ffffd3513852f090
>>>>> [    9.349062] x8 : 00000000ffffefff x7 : ffffd3513b31d6e0 x6 : 0000000000000000
>>>>> [    9.349753] x5 : ffff00017ff98cc8 x4 : 0000000000000fff x3 : 0000000000000000
>>>>> [    9.350397] x2 : 0000000000000000 x1 : ffff000190e8b480 x0 : 0000000000000052
>>>>> [    9.351097] Call trace:
>>>>> [    9.351312]  follow_page_mask+0x4d4/0x880
>>>>> [    9.351700]  __get_user_pages+0xf4/0x3e8
>>>>> [    9.352089]  __gup_longterm_locked+0x204/0xa70
>>>>> [    9.352516]  pin_user_pages+0x88/0xc0
>>>>> [    9.352873]  gup_test_ioctl+0x860/0xc40
>>>>> [    9.353249]  __arm64_sys_ioctl+0xb0/0x100
>>>>> [    9.353648]  invoke_syscall+0x50/0x128
>>>>> [    9.354022]  el0_svc_common.constprop.0+0x48/0xf8
>>>>> [    9.354488]  do_el0_svc+0x28/0x40
>>>>> [    9.354822]  el0_svc+0x34/0xe0
>>>>> [    9.355128]  el0t_64_sync_handler+0x13c/0x158
>>>>> [    9.355489]  el0t_64_sync+0x190/0x198
>>>>> [    9.355793] Code: aa1803e0 d000d8e1 91220021 97fff560 (d4210000)
>>>>> [    9.356280] ---[ end trace 0000000000000000 ]---
>>>>> [    9.356651] note: gup_longterm[1159] exited with irqs disabled
>>>>> [    9.357141] note: gup_longterm[1159] exited with preempt_count 2
>>>>> [    9.358033] ------------[ cut here ]------------
>>>>> [    9.358800] WARNING: CPU: 1 PID: 0 at kernel/context_tracking.c:128 ct_kernel_exit.constprop.0+0x108/0x120
>>>>> [    9.360157] Modules linked in:
>>>>> [    9.360541] CPU: 1 PID: 0 Comm: swapper/1 Tainted: G      D            6.9.0-rc2-00210-g910ff1a347e4 #11
>>>>> [    9.361626] Hardware name: linux,dummy-virt (DT)
>>>>> [    9.362087] pstate: 204003c5 (nzCv DAIF +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
>>>>> [    9.362758] pc : ct_kernel_exit.constprop.0+0x108/0x120
>>>>> [    9.363306] lr : ct_idle_enter+0x10/0x20
>>>>> [    9.363845] sp : ffff8000801abdc0
>>>>> [    9.364222] x29: ffff8000801abdc0 x28: 0000000000000000 x27: 0000000000000000
>>>>> [    9.364961] x26: 0000000000000000 x25: ffff00014149d780 x24: 0000000000000000
>>>>> [    9.365557] x23: 0000000000000000 x22: ffffd3513b299d48 x21: ffffd3513a785730
>>>>> [    9.366239] x20: ffffd3513b299c28 x19: ffff00017ffa7da0 x18: 0000fffff5ffffff
>>>>> [    9.366869] x17: 0000000000000000 x16: 1fffe0002a21a8c1 x15: 0000000000000000
>>>>> [    9.367524] x14: 0000000000000000 x13: 0000000000000000 x12: 0000000000000002
>>>>> [    9.368207] x11: 0000000000000001 x10: 0000000000000ad0 x9 : ffffd35138589230
>>>>> [    9.369123] x8 : ffff00014149e2b0 x7 : 0000000000000000 x6 : 000000000f8c0fb2
>>>>> [    9.370403] x5 : 4000000000000002 x4 : ffff2cb045825000 x3 : ffff8000801abdc0
>>>>> [    9.371170] x2 : ffffd3513a782da0 x1 : 4000000000000000 x0 : ffffd3513a782da0
>>>>> [    9.372279] Call trace:
>>>>> [    9.372519]  ct_kernel_exit.constprop.0+0x108/0x120
>>>>> [    9.373216]  ct_idle_enter+0x10/0x20
>>>>> [    9.373562]  default_idle_call+0x3c/0x160
>>>>> [    9.374055]  do_idle+0x21c/0x280
>>>>> [    9.374394]  cpu_startup_entry+0x3c/0x50
>>>>> [    9.374797]  secondary_start_kernel+0x140/0x168
>>>>> [    9.375220]  __secondary_switched+0xb8/0xc0
>>>>> [    9.375875] ---[ end trace 0000000000000000 ]---
>>>>>
>>>>>
>>>>> The oops trigger is at mm/gup.c:778:
>>>>> VM_BUG_ON_PAGE(!PageHead(page) && !is_zone_device_page(page), page);
>>>>>
>>>>>
>>>>> This is the output of gup_longterm (last output is just before oops):
>>>>>
>>>>> # [INFO] detected hugetlb page size: 2048 KiB
>>>>> # [INFO] detected hugetlb page size: 32768 KiB
>>>>> # [INFO] detected hugetlb page size: 64 KiB
>>>>> # [INFO] detected hugetlb page size: 1048576 KiB
>>>>> TAP version 13
>>>>> 1..70
>>>>> # [RUN] R/W longterm GUP pin in MAP_SHARED file mapping ... with memfd
>>>>> ok 1 Should have worked
>>>>> # [RUN] R/W longterm GUP pin in MAP_SHARED file mapping ... with tmpfile
>>>>> ok 2 Should have failed
>>>>> # [RUN] R/W longterm GUP pin in MAP_SHARED file mapping ... with local tmpfile
>>>>> ok 3 Should have failed
>>>>> # [RUN] R/W longterm GUP pin in MAP_SHARED file mapping ... with memfd hugetlb (2048 kB)
>>>>> ok 4 Should have worked
>>>>> # [RUN] R/W longterm GUP pin in MAP_SHARED file mapping ... with memfd hugetlb (32768 kB)
>>>>>
>>>>>
>>>>> So 2M passed ok, and its failing for 32M, which is cont-pmd. I'm guessing you're trying to iterate 2M into a cont-pmd folio and ending up with an unexpected tail page?
>>>>
>>>> I assume we find the expected tail page, it's just that the check
>>>>
>>>> VM_BUG_ON_PAGE(!PageHead(page) && !is_zone_device_page(page), page);
>>>>
>>>> Doesn't make sense with hugetlb folios. We might have a tail page mapped in
>>>> a cont-pmd entry. As soon as we call follow_huge_pmd() on "not the first
>>>> cont-pmd entry", we trigger this check.
>>>>
>>>> Likely this sanity check must also allow for hugetlb folios. Or we should
>>>> just remove it completely.
>>>
>>> Right, IMHO it'll be easier we remove it, actually I see there's one more
>>> at the end, so I think we need to remove both.
>>>
>>>>
>>>> In the past, we wanted to make sure that we never get tail pages of THP from
>>>> PMD entries, because something would currently be broken (we don't support
>>>> THP > PMD).
>>>
>>> There's probably one more thing we need to do, on allowing
>>> PageAnonExclusive() to work with hugetlb tails. Even if we remove the
>>> warnings and if I read the code right, we can BUG_ON again on checking tail
>>> pages over anon-exclusive for PageHuge.
>>>
>>> So I assume to fix it completely, we may need two changes: Patch 1 to
>>> prepare PageAnonExclusive() to work on hugetlb tails, then patch 2 to be
>>> squashed into the patch "mm/gup: handle huge pmd for follow_pmd_mask()".
>>> Note: not this patch to fixup, as this patch only does the "switchover" to
>>> the new path, the culprit should be the other patch..
>>>
>>> I have them attached below first, before I'll also go and see whether I can
>>> run some arm tests later today or tomorrow.  David, any comments from
>>> anon-exclusive side?
>>
>> I added the PageAnonExclusive checks for hugetlb back then, because calling
>> it on a tail page indicated real trouble for hugetlb.
>>
>> Well, and I didn't want to have runtime-hugetlb checks in PageAnonExclusive
>> code called on certainly-not-hugetlb code paths.
>>
>> Personally, I'd fixup the problematic callsite where we know nothing nasty
>> is happening (like we did for gup_must_unshare(), because we don't expect
>> hugetlb tail pages from arbitrary other code).
>>
>> But as I'm getting closer to a folio_test_anon_exclusive() implementation as
>> we speak (closer, but not done :) ... ), where I'd remove any such hugetlb
>> special handling, I don't particularly care how we handle GUP here in the
>> meantime.
> 
> That's what I was looking for and found missing just now, when I wanted to
> allow follow_huge_pmd() pass page / folio (which will be the head then)
> properly into different checks.  I think that patch 1 is the simplest I can
> come up with that works mostly like what you said before a follow up
> cleanup on top if possible.  It mostly pushed the existing runtime check in
> gup_must_unshare() to be more generic.
> 
> IIUC it's also a matter of whether you'd want PageAnonExclusive() to take
> care of both thp + hugetlb in one shot, rather than let callers handle it
> by things like "if (PageHuge()) ... else ...", which I would try to avoid.

I tried to not let the caller pass in things that didn't make any sense.

Getting a tail page on a hugetlb folio in a page table walker except 
GUP-fast was completely bogus before your patch.

PageAnonExclusive was designed to be set on the page that was pointed to 
by a PTE, like having an additional PTE bit. Cont-pte/cont-pmd with the 
hugetlb fuzz around it we all love (huge_pte_offset()) did the right 
thing, because it abstracted the "multiple cont-pte/cont-pmd" PTEs to 
just a single logical PTE, with a single dedicated PageAnonExclusive.

So "conceptually", the caller that knows how the "single logical PTE" 
was the one to handle it. That meant, GUP-fast needed to be special, 
because it was unaware of the huge_pte_offset() logic.

But that seems to change now as we are changing our page table walkers, 
so I don't particularly care how we handle it.

> It seems so far cleaner to allow PageAnonExclusive() take whatever tail
> pages, thp or hugetlb.  But maybe your ultimate patchset can be even better
> than that.

At least that part will be much cleaner.