mbox series

[v2,0/4] Speed up boot with faster linear map creation

Message ID 20240404143308.2224141-1-ryan.roberts@arm.com (mailing list archive)
Headers show
Series Speed up boot with faster linear map creation | expand

Message

Ryan Roberts April 4, 2024, 2:33 p.m. UTC
Hi All,

It turns out that creating the linear map can take a significant proportion of
the total boot time, especially when rodata=full. And most of the time is spent
waiting on superfluous tlb invalidation and memory barriers. This series reworks
the kernel pgtable generation code to significantly reduce the number of those
TLBIs, ISBs and DSBs. See each patch for details.

The below shows the execution time of map_mem() across a couple of different
systems with different RAM configurations. We measure after applying each patch
and show the improvement relative to base (v6.9-rc2):

               | Apple M2 VM | Ampere Altra| Ampere Altra| Ampere Altra
               | VM, 16G     | VM, 64G     | VM, 256G    | Metal, 512G
---------------|-------------|-------------|-------------|-------------
               |   ms    (%) |   ms    (%) |   ms    (%) |    ms    (%)
---------------|-------------|-------------|-------------|-------------
base           |  153   (0%) | 2227   (0%) | 8798   (0%) | 17442   (0%)
no-cont-remap  |   77 (-49%) |  431 (-81%) | 1727 (-80%) |  3796 (-78%)
batch-barriers |   13 (-92%) |  162 (-93%) |  655 (-93%) |  1656 (-91%)
no-alloc-remap |   11 (-93%) |  109 (-95%) |  449 (-95%) |  1257 (-93%)
lazy-unmap     |    6 (-96%) |   61 (-97%) |  257 (-97%) |   838 (-95%)

This series applies on top of v6.9-rc2. All mm selftests pass. I've compile and
boot tested various PAGE_SIZE and VA size configs.

---

Changes since v1 [1]
====================

  - Added Tested-by tags (thanks to Eric and Itaru)
  - Renamed ___set_pte() -> __set_pte_nosync() (per Ard)
  - Reordered patches (biggest impact & least controversial first)
  - Reordered alloc/map/unmap functions in mmu.c to aid reader
  - pte_clear() -> __pte_clear() in clear_fixmap_nosync()
  - Reverted generic p4d_index() which caused x86 build error. Replaced with
    unconditional p4d_index() define under arm64.


[1] https://lore.kernel.org/linux-arm-kernel/20240326101448.3453626-1-ryan.roberts@arm.com/

Thanks,
Ryan


Ryan Roberts (4):
  arm64: mm: Don't remap pgtables per-cont(pte|pmd) block
  arm64: mm: Batch dsb and isb when populating pgtables
  arm64: mm: Don't remap pgtables for allocate vs populate
  arm64: mm: Lazily clear pte table mappings from fixmap

 arch/arm64/include/asm/fixmap.h  |   5 +-
 arch/arm64/include/asm/mmu.h     |   8 +
 arch/arm64/include/asm/pgtable.h |  13 +-
 arch/arm64/kernel/cpufeature.c   |  10 +-
 arch/arm64/mm/fixmap.c           |  11 +
 arch/arm64/mm/mmu.c              | 377 +++++++++++++++++++++++--------
 6 files changed, 319 insertions(+), 105 deletions(-)

--
2.25.1

Comments

Itaru Kitayama April 5, 2024, 7:39 a.m. UTC | #1
On Thu, Apr 04, 2024 at 03:33:04PM +0100, Ryan Roberts wrote:
> Hi All,
> 
> It turns out that creating the linear map can take a significant proportion of
> the total boot time, especially when rodata=full. And most of the time is spent
> waiting on superfluous tlb invalidation and memory barriers. This series reworks
> the kernel pgtable generation code to significantly reduce the number of those
> TLBIs, ISBs and DSBs. See each patch for details.
> 
> The below shows the execution time of map_mem() across a couple of different
> systems with different RAM configurations. We measure after applying each patch
> and show the improvement relative to base (v6.9-rc2):
> 
>                | Apple M2 VM | Ampere Altra| Ampere Altra| Ampere Altra
>                | VM, 16G     | VM, 64G     | VM, 256G    | Metal, 512G
> ---------------|-------------|-------------|-------------|-------------
>                |   ms    (%) |   ms    (%) |   ms    (%) |    ms    (%)
> ---------------|-------------|-------------|-------------|-------------
> base           |  153   (0%) | 2227   (0%) | 8798   (0%) | 17442   (0%)
> no-cont-remap  |   77 (-49%) |  431 (-81%) | 1727 (-80%) |  3796 (-78%)
> batch-barriers |   13 (-92%) |  162 (-93%) |  655 (-93%) |  1656 (-91%)
> no-alloc-remap |   11 (-93%) |  109 (-95%) |  449 (-95%) |  1257 (-93%)
> lazy-unmap     |    6 (-96%) |   61 (-97%) |  257 (-97%) |   838 (-95%)
> 
> This series applies on top of v6.9-rc2. All mm selftests pass. I've compile and
> boot tested various PAGE_SIZE and VA size configs.
> 
> ---
> 
> Changes since v1 [1]
> ====================
> 
>   - Added Tested-by tags (thanks to Eric and Itaru)
>   - Renamed ___set_pte() -> __set_pte_nosync() (per Ard)
>   - Reordered patches (biggest impact & least controversial first)
>   - Reordered alloc/map/unmap functions in mmu.c to aid reader
>   - pte_clear() -> __pte_clear() in clear_fixmap_nosync()
>   - Reverted generic p4d_index() which caused x86 build error. Replaced with
>     unconditional p4d_index() define under arm64.
> 
> 
> [1] https://lore.kernel.org/linux-arm-kernel/20240326101448.3453626-1-ryan.roberts@arm.com/
> 
> Thanks,
> Ryan
> 
> 
> Ryan Roberts (4):
>   arm64: mm: Don't remap pgtables per-cont(pte|pmd) block
>   arm64: mm: Batch dsb and isb when populating pgtables
>   arm64: mm: Don't remap pgtables for allocate vs populate
>   arm64: mm: Lazily clear pte table mappings from fixmap
> 
>  arch/arm64/include/asm/fixmap.h  |   5 +-
>  arch/arm64/include/asm/mmu.h     |   8 +
>  arch/arm64/include/asm/pgtable.h |  13 +-
>  arch/arm64/kernel/cpufeature.c   |  10 +-
>  arch/arm64/mm/fixmap.c           |  11 +
>  arch/arm64/mm/mmu.c              | 377 +++++++++++++++++++++++--------
>  6 files changed, 319 insertions(+), 105 deletions(-)
> 
> --
> 2.25.1
>

I've build and boot tested the v2 on FVP, base is taken from your
linux-rr repo. Running run_vmtests.sh on v2 left some gup longterm not oks, would you take a look at it? The mm ksefltests used is from your linux-rr repo too.

Thanks,
Itaru.
# timeout set to 180
# TAP version 13
# # -unnin--./-u-e---e-mm--
# # running ./hugepage-mmap
# # -unnin--./-u-e---e-mm--
# # TAP version 13
# # 1..1
# # # Returned address is 0xffff92e00000
# # # First hex is 0
# # # First hex is 3020100
# # ok 1 Read same data
# # # Totals: pass:1 fail:0 xfail:0 xpass:0 skip:0 error:0
# # [PASS]
# ok 1 hugepage-mmap
# # -unnin--./-u-e---e-s-m
# # running ./hugepage-shm
# # -unnin--./-u-e---e-s-m
# # shmid: 0x0
# # shmaddr: 0xffffac600000
# # Starting the writes:
# # ................................................................................................................................................................................................................................................................
# # Starting the Check...Done.
# # [PASS]
# ok 2 hugepage-shm
# # -unnin--./m--_-u-etlb
# # running ./map_hugetlb
# # -unnin--./m--_-u-etlb
# # TAP version 13
# # 1..1
# # # Default size hugepages
# # # Mapping 256 Mbytes
# # # Returned address is 0xffff7e400000
# # # First hex is 0
# # # First hex is 3020100
# # ok 1 Read correct data
# # # Totals: pass:1 fail:0 xfail:0 xpass:0 skip:0 error:0
# # [PASS]
# ok 3 map_hugetlb
# # -unnin--./-u-e---e-m-em--
# # running ./hugepage-mremap
# # -unnin--./-u-e---e-m-em--
# # TAP version 13
# # 1..1
# # # Map haddr: Returned address is 0x7eaa40000000
# # # Map daddr: Returned address is 0x7daa40000000
# # # Map vaddr: Returned address is 0x7faa40000000
# # # Address returned by mmap() = 0xffffb4a00000
# # # Mremap: Returned address is 0x7faa40000000
# # # First hex is 0
# # # First hex is 3020100
# # ok 1 Read same data
# # # Totals: pass:1 fail:0 xfail:0 xpass:0 skip:0 error:0
# # [PASS]
# ok 4 hugepage-mremap
# # -unnin--./-u-e---e-vmemm--
# # running ./hugepage-vmemmap
# # -unnin--./-u-e---e-vmemm--
# # Returned address is 0xffff88c00000 whose pfn is 89d000
# # [PASS]
# ok 5 hugepage-vmemmap
# # -unnin--./-u-etlb-m-dvise
# # running ./hugetlb-madvise
# # -unnin--./-u-etlb-m-dvise
# # [PASS]
# ok 6 hugetlb-madvise
# # -unnin--./-u-etlb_f-ult_-fte-_m-dv
# # running ./hugetlb_fault_after_madv
# # -unnin--./-u-etlb_f-ult_-fte-_m-dv
# # [PASS]
# ok 7 hugetlb_fault_after_madv
# # -unnin--./-u-etlb_m-dv_vs_m--
# # running ./hugetlb_madv_vs_map
# # -unnin--./-u-etlb_m-dv_vs_m--
# # [PASS]
# ok 8 hugetlb_madv_vs_map
# # NOTE: These hugetlb tests provide minimal coverage.  Use
# #       https://github.com/libhugetlbfs/libhugetlbfs.git for
# #       hugetlb regression testing.
# # -unnin--./m--_fixed_no-e-l-ce
# # running ./map_fixed_noreplace
# # -unnin--./m--_fixed_no-e-l-ce
# # TAP version 13
# # 1..9
# # ok 1 mmap() @ 0xffff9c341000-0xffff9c346000 p=0xffff9c341000 result=Success
# # ok 2 mmap() @ 0xffff9c342000-0xffff9c345000 p=0xffff9c342000 result=Success
# # ok 3 mmap() @ 0xffff9c341000-0xffff9c346000 p=0xffffffffffffffff result=File exists
# # ok 4 mmap() @ 0xffff9c343000-0xffff9c344000 p=0xffffffffffffffff result=File exists
# # ok 5 mmap() @ 0xffff9c344000-0xffff9c346000 p=0xffffffffffffffff result=File exists
# # ok 6 mmap() @ 0xffff9c341000-0xffff9c343000 p=0xffffffffffffffff result=File exists
# # ok 7 mmap() @ 0xffff9c341000-0xffff9c342000 p=0xffff9c341000 result=File exists
# # ok 8 mmap() @ 0xffff9c345000-0xffff9c346000 p=0xffff9c345000 result=File exists
# # ok 9 Base Address unmap() successful
# # # Totals: pass:9 fail:0 xfail:0 xpass:0 skip:0 error:0
# # [PASS]
# ok 9 map_fixed_noreplace
# # -unnin--./-u-_test--u
# # running ./gup_test -u
# # -unnin--./-u-_test--u
# # TAP version 13
# # 1..1
# # # GUP_FAST_BENCHMARK: Time: get:311641 put:52261 us# 
# # ok 1 ioctl status 0
# # # Totals: pass:1 fail:0 xfail:0 xpass:0 skip:0 error:0
# # [PASS]
# ok 10 gup_test -u
# # -unnin--./-u-_test---
# # running ./gup_test -a
# # -unnin--./-u-_test---
# # TAP version 13
# # 1..1
# # # PIN_FAST_BENCHMARK: Time: get:575618 put:102293 us# 
# # ok 1 ioctl status 0
# # # Totals: pass:1 fail:0 xfail:0 xpass:0 skip:0 error:0
# # [PASS]
# ok 11 gup_test -a
# # -unnin--./-u-_test--ct--F-0x1-0-19-0x1000
# # running ./gup_test -ct -F 0x1 0 19 0x1000
# # -unnin--./-u-_test--ct--F-0x1-0-19-0x1000
# # TAP version 13
# # 1..1
# # # DUMP_USER_PAGES_TEST: done
# # ok 1 ioctl status 0
# # # Totals: pass:1 fail:0 xfail:0 xpass:0 skip:0 error:0
# # [PASS]
# ok 12 gup_test -ct -F 0x1 0 19 0x1000
# # -unnin--./-u-_lon-te-m
# # running ./gup_longterm
# # -unnin--./-u-_lon-te-m
# # # [INFO] detected hugetlb page size: 2048 KiB
# # # [INFO] detected hugetlb page size: 32768 KiB
# # # [INFO] detected hugetlb page size: 64 KiB
# # # [INFO] detected hugetlb page size: 1048576 KiB
# # TAP version 13
# # 1..56
# # # [RUN] R/W longterm GUP pin in MAP_SHARED file mapping ... with memfd
# # ok 1 Should have worked
# # # [RUN] R/W longterm GUP pin in MAP_SHARED file mapping ... with tmpfile
# # ok 2 Should have worked
# # # [RUN] R/W longterm GUP pin in MAP_SHARED file mapping ... with local tmpfile
# # not ok 3 ftruncate() failed
# # # [RUN] R/W longterm GUP pin in MAP_SHARED file mapping ... with memfd hugetlb (2048 kB)
# # ok 4 Should have worked
# # # [RUN] R/W longterm GUP pin in MAP_SHARED file mapping ... with memfd hugetlb (32768 kB)
# # ok 5 # SKIP need more free huge pages
# # # [RUN] R/W longterm GUP pin in MAP_SHARED file mapping ... with memfd hugetlb (64 kB)
# # ok 6 # SKIP need more free huge pages
# # # [RUN] R/W longterm GUP pin in MAP_SHARED file mapping ... with memfd hugetlb (1048576 kB)
# # ok 7 # SKIP need more free huge pages
# # # [RUN] R/W longterm GUP-fast pin in MAP_SHARED file mapping ... with memfd
# # ok 8 Should have worked
# # # [RUN] R/W longterm GUP-fast pin in MAP_SHARED file mapping ... with tmpfile
# # ok 9 Should have worked
# # # [RUN] R/W longterm GUP-fast pin in MAP_SHARED file mapping ... with local tmpfile
# # not ok 10 ftruncate() failed
# # # [RUN] R/W longterm GUP-fast pin in MAP_SHARED file mapping ... with memfd hugetlb (2048 kB)
# # ok 11 Should have worked
# # # [RUN] R/W longterm GUP-fast pin in MAP_SHARED file mapping ... with memfd hugetlb (32768 kB)
# # ok 12 # SKIP need more free huge pages
# # # [RUN] R/W longterm GUP-fast pin in MAP_SHARED file mapping ... with memfd hugetlb (64 kB)
# # ok 13 # SKIP need more free huge pages
# # # [RUN] R/W longterm GUP-fast pin in MAP_SHARED file mapping ... with memfd hugetlb (1048576 kB)
# # ok 14 # SKIP need more free huge pages
# # # [RUN] R/O longterm GUP pin in MAP_SHARED file mapping ... with memfd
# # ok 15 Should have worked
# # # [RUN] R/O longterm GUP pin in MAP_SHARED file mapping ... with tmpfile
# # ok 16 Should have worked
# # # [RUN] R/O longterm GUP pin in MAP_SHARED file mapping ... with local tmpfile
# # not ok 17 ftruncate() failed
# # # [RUN] R/O longterm GUP pin in MAP_SHARED file mapping ... with memfd hugetlb (2048 kB)
# # ok 18 Should have worked
# # # [RUN] R/O longterm GUP pin in MAP_SHARED file mapping ... with memfd hugetlb (32768 kB)
# # ok 19 # SKIP need more free huge pages
# # # [RUN] R/O longterm GUP pin in MAP_SHARED file mapping ... with memfd hugetlb (64 kB)
# # ok 20 # SKIP need more free huge pages
# # # [RUN] R/O longterm GUP pin in MAP_SHARED file mapping ... with memfd hugetlb (1048576 kB)
# # ok 21 # SKIP need more free huge pages
# # # [RUN] R/O longterm GUP-fast pin in MAP_SHARED file mapping ... with memfd
# # ok 22 Should have worked
# # # [RUN] R/O longterm GUP-fast pin in MAP_SHARED file mapping ... with tmpfile
# # ok 23 Should have worked
# # # [RUN] R/O longterm GUP-fast pin in MAP_SHARED file mapping ... with local tmpfile
# # not ok 24 ftruncate() failed
# # # [RUN] R/O longterm GUP-fast pin in MAP_SHARED file mapping ... with memfd hugetlb (2048 kB)
# # ok 25 Should have worked
# # # [RUN] R/O longterm GUP-fast pin in MAP_SHARED file mapping ... with memfd hugetlb (32768 kB)
# # ok 26 # SKIP need more free huge pages
# # # [RUN] R/O longterm GUP-fast pin in MAP_SHARED file mapping ... with memfd hugetlb (64 kB)
# # ok 27 # SKIP need more free huge pages
# # # [RUN] R/O longterm GUP-fast pin in MAP_SHARED file mapping ... with memfd hugetlb (1048576 kB)
# # ok 28 # SKIP need more free huge pages
# # # [RUN] R/W longterm GUP pin in MAP_PRIVATE file mapping ... with memfd
# # ok 29 Should have worked
# # # [RUN] R/W longterm GUP pin in MAP_PRIVATE file mapping ... with tmpfile
# # ok 30 Should have worked
# # # [RUN] R/W longterm GUP pin in MAP_PRIVATE file mapping ... with local tmpfile
# # not ok 31 ftruncate() failed
# # # [RUN] R/W longterm GUP pin in MAP_PRIVATE file mapping ... with memfd hugetlb (2048 kB)
# # ok 32 Should have worked
# # # [RUN] R/W longterm GUP pin in MAP_PRIVATE file mapping ... with memfd hugetlb (32768 kB)
# # ok 33 # SKIP need more free huge pages
# # # [RUN] R/W longterm GUP pin in MAP_PRIVATE file mapping ... with memfd hugetlb (64 kB)
# # ok 34 # SKIP need more free huge pages
# # # [RUN] R/W longterm GUP pin in MAP_PRIVATE file mapping ... with memfd hugetlb (1048576 kB)
# # ok 35 # SKIP need more free huge pages
# # # [RUN] R/W longterm GUP-fast pin in MAP_PRIVATE file mapping ... with memfd
# # ok 36 Should have worked
# # # [RUN] R/W longterm GUP-fast pin in MAP_PRIVATE file mapping ... with tmpfile
# # ok 37 Should have worked
# # # [RUN] R/W longterm GUP-fast pin in MAP_PRIVATE file mapping ... with local tmpfile
# # not ok 38 ftruncate() failed
# # # [RUN] R/W longterm GUP-fast pin in MAP_PRIVATE file mapping ... with memfd hugetlb (2048 kB)
# # ok 39 Should have worked
# # # [RUN] R/W longterm GUP-fast pin in MAP_PRIVATE file mapping ... with memfd hugetlb (32768 kB)
# # ok 40 # SKIP need more free huge pages
# # # [RUN] R/W longterm GUP-fast pin in MAP_PRIVATE file mapping ... with memfd hugetlb (64 kB)
# # ok 41 # SKIP need more free huge pages
# # # [RUN] R/W longterm GUP-fast pin in MAP_PRIVATE file mapping ... with memfd hugetlb (1048576 kB)
# # ok 42 # SKIP need more free huge pages
# # # [RUN] R/O longterm GUP pin in MAP_PRIVATE file mapping ... with memfd
# # ok 43 Should have worked
# # # [RUN] R/O longterm GUP pin in MAP_PRIVATE file mapping ... with tmpfile
# # ok 44 Should have worked
# # # [RUN] R/O longterm GUP pin in MAP_PRIVATE file mapping ... with local tmpfile
# # not ok 45 ftruncate() failed
# # # [RUN] R/O longterm GUP pin in MAP_PRIVATE file mapping ... with memfd hugetlb (2048 kB)
# # ok 46 Should have worked
# # # [RUN] R/O longterm GUP pin in MAP_PRIVATE file mapping ... with memfd hugetlb (32768 kB)
# # ok 47 # SKIP need more free huge pages
# # # [RUN] R/O longterm GUP pin in MAP_PRIVATE file mapping ... with memfd hugetlb (64 kB)
# # ok 48 # SKIP need more free huge pages
# # # [RUN] R/O longterm GUP pin in MAP_PRIVATE file mapping ... with memfd hugetlb (1048576 kB)
# # ok 49 # SKIP need more free huge pages
# # # [RUN] R/O longterm GUP-fast pin in MAP_PRIVATE file mapping ... with memfd
# # ok 50 Should have worked
# # # [RUN] R/O longterm GUP-fast pin in MAP_PRIVATE file mapping ... with tmpfile
# # ok 51 Should have worked
# # # [RUN] R/O longterm GUP-fast pin in MAP_PRIVATE file mapping ... with local tmpfile
# # not ok 52 ftruncate() failed
# # # [RUN] R/O longterm GUP-fast pin in MAP_PRIVATE file mapping ... with memfd hugetlb (2048 kB)
# # ok 53 Should have worked
# # # [RUN] R/O longterm GUP-fast pin in MAP_PRIVATE file mapping ... with memfd hugetlb (32768 kB)
# # ok 54 # SKIP need more free huge pages
# # # [RUN] R/O longterm GUP-fast pin in MAP_PRIVATE file mapping ... with memfd hugetlb (64 kB)
# # ok 55 # SKIP need more free huge pages
# # # [RUN] R/O longterm GUP-fast pin in MAP_PRIVATE file mapping ... with memfd hugetlb (1048576 kB)
# # ok 56 # SKIP need more free huge pages
# # Bail out! 8 out of 56 tests failed
# # # Totals: pass:24 fail:8 xfail:0 xpass:0 skip:24 error:0
# # [FAIL]
# not ok 13 gup_longterm # exit=1
# # -unnin--./uffd-unit-tests
# # running ./uffd-unit-tests
# # -unnin--./uffd-unit-tests
# # Testing UFFDIO_API (with syscall)... done
# # Testing UFFDIO_API (with /dev/userfaultfd)... done
# # Testing register-ioctls on anon... skipped [reason: feature missing]
# # Testing register-ioctls on shmem... skipped [reason: feature missing]
# # Testing register-ioctls on shmem-private... skipped [reason: feature missing]
# # Testing register-ioctls on hugetlb... skipped [reason: feature missing]
# # Testing register-ioctls on hugetlb-private... skipped [reason: feature missing]
# # Testing zeropage on anon... done
# # Testing zeropage on shmem... done
# # Testing zeropage on shmem-private... done
# # Testing zeropage on hugetlb... done
# # Testing zeropage on hugetlb-private... done
# # Testing move on anon... done
# # Testing move-pmd on anon... done
# # Testing move-pmd-split on anon... done
# # Testing wp-fork on anon... skipped [reason: feature missing]
# # Testing wp-fork on shmem... skipped [reason: feature missing]
# # Testing wp-fork on shmem-private... skipped [reason: feature missing]
# # Testing wp-fork on hugetlb... skipped [reason: feature missing]
# # Testing wp-fork on hugetlb-private... skipped [reason: feature missing]
# # Testing wp-fork-with-event on anon... skipped [reason: feature missing]
# # Testing wp-fork-with-event on shmem... skipped [reason: feature missing]
# # Testing wp-fork-with-event on shmem-private... skipped [reason: feature missing]
# # Testing wp-fork-with-event on hugetlb... skipped [reason: feature missing]
# # Testing wp-fork-with-event on hugetlb-private... skipped [reason: feature missing]
# # Testing wp-fork-pin on anon... skipped [reason: feature missing]
# # Testing wp-fork-pin on shmem... skipped [reason: feature missing]
# # Testing wp-fork-pin on shmem-private... skipped [reason: feature missing]
# # Testing wp-fork-pin on hugetlb... skipped [reason: feature missing]
# # Testing wp-fork-pin on hugetlb-private... skipped [reason: feature missing]
# # Testing wp-fork-pin-with-event on anon... skipped [reason: feature missing]
# # Testing wp-fork-pin-with-event on shmem... skipped [reason: feature missing]
# # Testing wp-fork-pin-with-event on shmem-private... skipped [reason: feature missing]
# # Testing wp-fork-pin-with-event on hugetlb... skipped [reason: feature missing]
# # Testing wp-fork-pin-with-event on hugetlb-private... skipped [reason: feature missing]
# # Testing wp-unpopulated on anon... skipped [reason: feature missing]
# # Testing minor on shmem... done
# # Testing minor on hugetlb... done
# # Testing minor-wp on shmem... skipped [reason: feature missing]
# # Testing minor-wp on hugetlb... skipped [reason: feature missing]
# # Testing minor-collapse on shmem... done
# # Testing sigbus on anon... done
# # Testing sigbus on shmem... done
# # Testing sigbus on shmem-private... done
# # Testing sigbus on hugetlb... done
# # Testing sigbus on hugetlb-private... done
# # Testing sigbus-wp on anon... skipped [reason: feature missing]
# # Testing sigbus-wp on shmem... skipped [reason: feature missing]
# # Testing sigbus-wp on shmem-private... skipped [reason: feature missing]
# # Testing sigbus-wp on hugetlb... skipped [reason: feature missing]
# # Testing sigbus-wp on hugetlb-private... skipped [reason: feature missing]
# # Testing events on anon... done
# # Testing events on shmem... done
# # Testing events on shmem-private... done
# # Testing events on hugetlb... done
# # Testing events on hugetlb-private... done
# # Testing events-wp on anon... skipped [reason: feature missing]
# # Testing events-wp on shmem... skipped [reason: feature missing]
# # Testing events-wp on shmem-private... skipped [reason: feature missing]
# # Testing events-wp on hugetlb... skipped [reason: feature missing]
# # Testing events-wp on hugetlb-private... skipped [reason: feature missing]
# # Testing poison on anon... done
# # Testing poison on shmem... done
# # Testing poison on shmem-private... done
# # Testing poison on hugetlb... done
# # Testing poison on hugetlb-private... done
# # Userfaults unit tests: pass=28, skip=38, fail=0 (total=66)
# # [PASS]
# ok 14 uffd-unit-tests
# # -unnin--./uffd-st-ess--non-20-16
# # running ./uffd-stress anon 20 16
# # -unnin--./uffd-st-ess--non-20-16
# # nr_pages: 5120, nr_pages_per_cpu: 640
# # bounces: 15, mode: rnd racing ver poll, userfaults: 867 missing (192+160+139+121+97+61+63+34+) 
# # bounces: 14, mode: racing ver poll, userfaults: 667 missing (122+102+103+79+88+69+59+45+) 
# # bounces: 13, mode: rnd ver poll, userfaults: 809 missing (188+159+122+119+66+64+58+33+) 
# # bounces: 12, mode: ver poll, userfaults: 1141 missing (246+208+214+155+190+61+33+34+) 
# # bounces: 11, mode: rnd racing poll, userfaults: 977 missing (200+202+153+132+104+74+62+50+) 
# # bounces: 10, mode: racing poll, userfaults: 448 missing (106+78+75+59+47+30+34+19+) 
# # bounces: 9, mode: rnd poll, userfaults: 1600 missing (860+167+136+105+100+102+70+60+) 
# # bounces: 8, mode: poll, userfaults: 609 missing (153+101+80+72+65+47+52+39+) 
# # bounces: 7, mode: rnd racing ver read, userfaults: 953 missing (208+176+142+123+104+87+70+43+) 
# # bounces: 6, mode: racing ver read, userfaults: 745 missing (126+125+102+104+82+90+65+51+) 
# # bounces: 5, mode: rnd ver read, userfaults: 981 missing (227+191+143+122+109+79+65+45+) 
# # bounces: 4, mode: ver read, userfaults: 780 missing (171+128+115+89+74+72+73+58+) 
# # bounces: 3, mode: rnd racing read, userfaults: 1053 missing (265+212+154+127+110+89+54+42+) 
# # bounces: 2, mode: racing read, userfaults: 747 missing (145+122+110+101+83+46+56+84+) 
# # bounces: 1, mode: rnd read, userfaults: 983 missing (231+175+144+116+101+84+69+63+) 
# # bounces: 0, mode: read, userfaults: 659 missing (148+86+83+73+84+72+65+48+) 
# # [PASS]
# ok 15 uffd-stress anon 20 16
# # -unnin--./uffd-st-ess--u-etlb-128-32
# # running ./uffd-stress hugetlb 128 32
# # -unnin--./uffd-st-ess--u-etlb-128-32
# # nr_pages: 64, nr_pages_per_cpu: 8
# # bounces: 31, mode: rnd racing ver poll, userfaults: 42 missing (15+10+9+7+1+0+0+0+) 
# # bounces: 30, mode: racing ver poll, userfaults: 23 missing (10+3+2+4+2+1+1+0+) 
# # bounces: 29, mode: rnd ver poll, userfaults: 41 missing (15+13+8+3+2+0+0+0+) 
# # bounces: 28, mode: ver poll, userfaults: 27 missing (11+8+6+2+0+0+0+0+) 
# # bounces: 27, mode: rnd racing poll, userfaults: 43 missing (13+12+10+6+2+0+0+0+) 
# # bounces: 26, mode: racing poll, userfaults: 25 missing (11+3+4+4+2+1+0+0+) 
# # bounces: 25, mode: rnd poll, userfaults: 38 missing (13+9+8+5+3+0+0+0+) 
# # bounces: 24, mode: poll, userfaults: 25 missing (25+0+0+0+0+0+0+0+) 
# # bounces: 23, mode: rnd racing ver read, userfaults: 40 missing (13+12+7+4+4+0+0+0+) 
# # bounces: 22, mode: racing ver read, userfaults: 21 missing (8+4+2+2+2+1+2+0+) 
# # bounces: 21, mode: rnd ver read, userfaults: 40 missing (14+11+6+5+4+0+0+0+) 
# # bounces: 20, mode: ver read, userfaults: 20 missing (9+6+4+1+0+0+0+0+) 
# # bounces: 19, mode: rnd racing read, userfaults: 40 missing (14+11+9+3+2+1+0+0+) 
# # bounces: 18, mode: racing read, userfaults: 17 missing (7+4+2+3+1+0+0+0+) 
# # bounces: 17, mode: rnd read, userfaults: 40 missing (13+13+8+3+3+0+0+0+) 
# # bounces: 16, mode: read, userfaults: 16 missing (7+6+2+1+0+0+0+0+) 
# # bounces: 15, mode: rnd racing ver poll, userfaults: 40 missing (12+10+8+6+3+1+0+0+) 
# # bounces: 14, mode: racing ver poll, userfaults: 16 missing (10+2+2+0+0+1+0+1+) 
# # bounces: 13, mode: rnd ver poll, userfaults: 42 missing (15+11+8+6+1+1+0+0+) 
# # bounces: 12, mode: ver poll, userfaults: 18 missing (9+4+1+4+0+0+0+0+) 
# # bounces: 11, mode: rnd racing poll, userfaults: 41 missing (13+13+9+3+3+0+0+0+) 
# # bounces: 10, mode: racing poll, userfaults: 9 missing (9+0+0+0+0+0+0+0+) 
# # bounces: 9, mode: rnd poll, userfaults: 40 missing (14+11+7+6+2+0+0+0+) 
# # bounces: 8, mode: poll, userfaults: 15 missing (6+4+2+1+1+1+0+0+) 
# # bounces: 7, mode: rnd racing ver read, userfaults: 43 missing (13+12+8+6+3+1+0+0+) 
# # bounces: 6, mode: racing ver read, userfaults: 10 missing (5+1+0+0+0+3+1+0+) 
# # bounces: 5, mode: rnd ver read, userfaults: 37 missing (14+8+6+7+1+1+0+0+) 
# # bounces: 4, mode: ver read, userfaults: 29 missing (9+4+5+2+3+3+2+1+) 
# # bounces: 3, mode: rnd racing read, userfaults: 39 missing (14+10+7+5+3+0+0+0+) 
# # bounces: 2, mode: racing read, userfaults: 8 missing (4+1+0+1+0+1+1+0+) 
# # bounces: 1, mode: rnd read, userfaults: 39 missing (12+12+8+4+3+0+0+0+) 
# # bounces: 0, mode: read, userfaults: 25 missing (6+6+5+2+3+3+0+0+) 
# # [PASS]
# ok 16 uffd-stress hugetlb 128 32
# # -unnin--./uffd-st-ess--u-etlb---iv-te-128-32
# # running ./uffd-stress hugetlb-private 128 32
# # -unnin--./uffd-st-ess--u-etlb---iv-te-128-32
# # nr_pages: 64, nr_pages_per_cpu: 8
# # bounces: 31, mode: rnd racing ver poll, userfaults: 42 missing (13+12+9+6+2+0+0+0+) 
# # bounces: 30, mode: racing ver poll, userfaults: 25 missing (12+3+6+0+0+3+1+0+) 
# # bounces: 29, mode: rnd ver poll, userfaults: 41 missing (15+11+8+5+2+0+0+0+) 
# # bounces: 28, mode: ver poll, userfaults: 28 missing (10+9+7+2+0+0+0+0+) 
# # bounces: 27, mode: rnd racing poll, userfaults: 42 missing (15+13+7+5+1+1+0+0+) 
# # bounces: 26, mode: racing poll, userfaults: 22 missing (8+2+4+3+1+1+2+1+) 
# # bounces: 25, mode: rnd poll, userfaults: 39 missing (14+11+9+3+2+0+0+0+) 
# # bounces: 24, mode: poll, userfaults: 24 missing (12+7+4+1+0+0+0+0+) 
# # bounces: 23, mode: rnd racing ver read, userfaults: 44 missing (16+12+9+5+2+0+0+0+) 
# # bounces: 22, mode: racing ver read, userfaults: 21 missing (12+0+4+1+2+1+1+0+) 
# # bounces: 21, mode: rnd ver read, userfaults: 41 missing (12+13+8+6+2+0+0+0+) 
# # bounces: 20, mode: ver read, userfaults: 37 missing (26+7+1+1+2+0+0+0+) 
# # bounces: 19, mode: rnd racing read, userfaults: 46 missing (19+16+11+0+0+0+0+0+) 
# # bounces: 18, mode: racing read, userfaults: 17 missing (7+5+4+1+0+0+0+0+) 
# # bounces: 17, mode: rnd read, userfaults: 41 missing (15+10+9+4+3+0+0+0+) 
# # bounces: 16, mode: read, userfaults: 17 missing (9+6+0+2+0+0+0+0+) 
# # bounces: 15, mode: rnd racing ver poll, userfaults: 43 missing (15+11+8+7+1+1+0+0+) 
# # bounces: 14, mode: racing ver poll, userfaults: 15 missing (10+2+1+1+0+1+0+0+) 
# # bounces: 13, mode: rnd ver poll, userfaults: 43 missing (16+10+8+5+3+1+0+0+) 
# # bounces: 12, mode: ver poll, userfaults: 17 missing (7+4+1+2+1+2+0+0+) 
# # bounces: 11, mode: rnd racing poll, userfaults: 39 missing (16+11+5+4+3+0+0+0+) 
# # bounces: 10, mode: racing poll, userfaults: 13 missing (6+3+1+1+1+0+0+1+) 
# # bounces: 9, mode: rnd poll, userfaults: 40 missing (14+10+9+6+1+0+0+0+) 
# # bounces: 8, mode: poll, userfaults: 19 missing (9+5+2+1+1+1+0+0+) 
# # bounces: 7, mode: rnd racing ver read, userfaults: 39 missing (15+10+8+5+1+0+0+0+) 
# # bounces: 6, mode: racing ver read, userfaults: 8 missing (6+1+0+0+0+0+1+0+) 
# # bounces: 5, mode: rnd ver read, userfaults: 40 missing (15+8+8+5+3+1+0+0+) 
# # bounces: 4, mode: ver read, userfaults: 29 missing (11+3+3+4+4+3+1+0+) 
# # bounces: 3, mode: rnd racing read, userfaults: 40 missing (13+13+8+4+1+1+0+0+) 
# # bounces: 2, mode: racing read, userfaults: 6 missing (4+0+1+0+0+1+0+0+) 
# # bounces: 1, mode: rnd read, userfaults: 42 missing (15+11+9+4+3+0+0+0+) 
# # bounces: 0, mode: read, userfaults: 27 missing (6+6+2+7+0+3+3+0+) 
# # [PASS]
# ok 17 uffd-stress hugetlb-private 128 32
# # -unnin--./uffd-st-ess-s-mem-20-16
# # running ./uffd-stress shmem 20 16
# # -unnin--./uffd-st-ess-s-mem-20-16
# # nr_pages: 5120, nr_pages_per_cpu: 640
# # bounces: 15, mode: rnd racing ver poll, userfaults: 927 missing (199+165+147+137+97+72+65+45+) 
# # bounces: 14, mode: racing ver poll, userfaults: 412 missing (100+68+61+48+43+33+35+24+) 
# # bounces: 13, mode: rnd ver poll, userfaults: 945 missing (173+164+131+122+119+102+70+64+) 
# # bounces: 12, mode: ver poll, userfaults: 658 missing (159+98+96+79+68+56+72+30+) 
# # bounces: 11, mode: rnd racing poll, userfaults: 926 missing (195+146+140+114+112+103+73+43+) 
# # bounces: 10, mode: racing poll, userfaults: 405 missing (88+77+59+59+44+26+32+20+) 
# # bounces: 9, mode: rnd poll, userfaults: 933 missing (177+173+134+114+101+104+68+62+) 
# # bounces: 8, mode: poll, userfaults: 1329 missing (319+271+253+224+85+82+83+12+) 
# # bounces: 7, mode: rnd racing ver read, userfaults: 959 missing (179+163+153+126+100+90+92+56+) 
# # bounces: 6, mode: racing ver read, userfaults: 427 missing (81+80+60+57+43+49+31+26+) 
# # bounces: 5, mode: rnd ver read, userfaults: 954 missing (201+175+152+112+107+98+60+49+) 
# # bounces: 4, mode: ver read, userfaults: 815 missing (171+131+115+115+86+81+63+53+) 
# # bounces: 3, mode: rnd racing read, userfaults: 933 missing (187+152+140+137+105+91+62+59+) 
# # bounces: 2, mode: racing read, userfaults: 415 missing (87+67+70+59+44+35+33+20+) 
# # bounces: 1, mode: rnd read, userfaults: 933 missing (166+165+151+124+95+96+85+51+) 
# # bounces: 0, mode: read, userfaults: 652 missing (117+103+112+65+80+76+56+43+) 
# # [PASS]
# ok 18 uffd-stress shmem 20 16
# # -unnin--./uffd-st-ess-s-mem---iv-te-20-16
# # running ./uffd-stress shmem-private 20 16
# # -unnin--./uffd-st-ess-s-mem---iv-te-20-16
# # nr_pages: 5120, nr_pages_per_cpu: 640
# # bounces: 15, mode: rnd racing ver poll, userfaults: 955 missing (178+173+160+118+90+110+64+62+) 
# # bounces: 14, mode: racing ver poll, userfaults: 400 missing (86+70+48+53+43+38+34+28+) 
# # bounces: 13, mode: rnd ver poll, userfaults: 1037 missing (184+190+163+141+121+73+102+63+) 
# # bounces: 12, mode: ver poll, userfaults: 725 missing (155+104+97+87+89+68+62+63+) 
# # bounces: 11, mode: rnd racing poll, userfaults: 986 missing (187+180+164+154+97+69+74+61+) 
# # bounces: 10, mode: racing poll, userfaults: 372 missing (82+66+68+43+34+28+31+20+) 
# # bounces: 9, mode: rnd poll, userfaults: 891 missing (173+147+152+121+99+95+61+43+) 
# # bounces: 8, mode: poll, userfaults: 670 missing (131+110+96+84+81+67+53+48+) 
# # bounces: 7, mode: rnd racing ver read, userfaults: 1578 missing (866+163+141+106+94+72+70+66+) 
# # bounces: 6, mode: racing ver read, userfaults: 414 missing (79+75+66+55+44+48+22+25+) 
# # bounces: 5, mode: rnd ver read, userfaults: 886 missing (190+151+139+107+97+92+72+38+) 
# # bounces: 4, mode: ver read, userfaults: 928 missing (171+147+114+123+108+100+87+78+) 
# # bounces: 3, mode: rnd racing read, userfaults: 909 missing (181+159+138+139+106+81+57+48+) 
# # bounces: 2, mode: racing read, userfaults: 398 missing (89+67+55+56+42+40+27+22+) 
# # bounces: 1, mode: rnd read, userfaults: 988 missing (160+176+177+126+127+91+68+63+) 
# # bounces: 0, mode: read, userfaults: 614 missing (127+94+79+75+68+67+46+58+) 
# # [PASS]
# ok 19 uffd-stress shmem-private 20 16
# # -unnin--./com--ction_test
# # running ./compaction_test
# # -unnin--./com--ction_test
# # TAP version 13
# # 1..1
# # # Number of huge pages allocated = 848
# # ok 1 check_compaction
# # # Totals: pass:1 fail:0 xfail:0 xpass:0 skip:0 error:0
# # [PASS]
# ok 20 compaction_test
# # SKIP ./on-fault-limit
# # -unnin--./m--_-o-ul-te
# # running ./map_populate
# # -unnin--./m--_-o-ul-te
# # TAP version 13
# # 1..2
# # ok 1 MAP_POPULATE COW private page
# # ok 2 The mapping state
# # # Totals: pass:2 fail:0 xfail:0 xpass:0 skip:0 error:0
# # [PASS]
# ok 21 map_populate
# # -unnin--./mlock---ndom-test
# # running ./mlock-random-test
# # -unnin--./mlock---ndom-test
# # TAP version 13
# # 1..2
# # ok 1 test_mlock_within_limit
# # ok 2 test_mlock_outof_limit
# # # Totals: pass:2 fail:0 xfail:0 xpass:0 skip:0 error:0
# # [PASS]
# ok 22 mlock-random-test
# # -unnin--./mlock2-tests
# # running ./mlock2-tests
# # -unnin--./mlock2-tests
# # TAP version 13
# # 1..13
# # ok 1 test_mlock_lock: Locked
# # ok 2 test_mlock_lock: Locked
# # ok 3 test_mlock_onfault: VMA marked for lock on fault
# # ok 4 VMA open lock after fault
# # ok 5 test_munlockall0: Locked memory area
# # ok 6 test_munlockall0: No locked memory
# # ok 7 test_munlockall1: VMA marked for lock on fault
# # ok 8 test_munlockall1: Unlocked
# # ok 9 test_munlockall1: Locked
# # ok 10 test_munlockall1: No locked memory
# # ok 11 VMA with present pages is not marked lock on fault
# # ok 12 test_vma_management call_mlock 1
# # ok 13 test_vma_management call_mlock 0
# # # Totals: pass:13 fail:0 xfail:0 xpass:0 skip:0 error:0
# # [PASS]
# ok 23 mlock2-tests
# # -unnin--./m-ele-se_test
# # running ./mrelease_test
# # -unnin--./m-ele-se_test
# # TAP version 13
# # 1..1
# # ok 1 Success reaping a child with 1MB of memory allocations
# # # Totals: pass:1 fail:0 xfail:0 xpass:0 skip:0 error:0
# # [PASS]
# ok 24 mrelease_test
# # -unnin--./m-em--_test
# # running ./mremap_test
# # -unnin--./m-em--_test
# # # Test configs:
# # 	threshold_mb=4
# # 	pattern_seed=1712296164
# # 
# # 1..19
# # # mremap failed: Invalid argument
# # ok 1 # XFAIL mremap - Source and Destination Regions Overlapping
# # 	Expected mremap failure
# # # mremap failed: Invalid argument
# # ok 2 # XFAIL mremap - Destination Address Misaligned (1KB-aligned)
# # 	Expected mremap failure
# # # Failed to map source region: Invalid argument
# # ok 3 # XFAIL mremap - Source Address Misaligned (1KB-aligned)
# # 	Expected mremap failure
# # ok 4 8KB mremap - Source PTE-aligned, Destination PTE-aligned
# # 	mremap time:      2646330ns
# # ok 5 2MB mremap - Source 1MB-aligned, Destination PTE-aligned
# # 	mremap time:      3390760ns
# # ok 6 2MB mremap - Source 1MB-aligned, Destination 1MB-aligned
# # 	mremap time:      2931770ns
# # ok 7 4MB mremap - Source PMD-aligned, Destination PTE-aligned
# # 	mremap time:      6104990ns
# # ok 8 4MB mremap - Source PMD-aligned, Destination 1MB-aligned
# # 	mremap time:      4501050ns
# # ok 9 4MB mremap - Source PMD-aligned, Destination PMD-aligned
# # 	mremap time:      2425660ns
# # ok 10 2GB mremap - Source PUD-aligned, Destination PTE-aligned
# # ok 11 2GB mremap - Source PUD-aligned, Destination 1MB-aligned
# # ok 12 2GB mremap - Source PUD-aligned, Destination PMD-aligned
# # ok 13 2GB mremap - Source PUD-aligned, Destination PUD-aligned
# # ok 14 5MB mremap - Source 1MB-aligned, Destination 1MB-aligned
# # ok 15 5MB mremap - Source 1MB-aligned, Dest 1MB-aligned with 40MB Preamble
# # ok 16 mremap expand merge
# # ok 17 mremap expand merge offset
# # ok 18 mremap mremap move within range
# # ok 19 mremap move 1mb from start at 1MB+256KB aligned src
# # # Totals: pass:16 fail:0 xfail:3 xpass:0 skip:0 error:0
# # [PASS]
# ok 25 mremap_test
# # -unnin--./t-u-e--en
# # running ./thuge-gen
# # -unnin--./t-u-e--en
# # TAP version 13
# # # Found 1024MB
# # # SKIP for size 1024 MB as not enough huge pages, need 4
# # # Found 0MB
# # # SKIP for size 2 MB as not enough huge pages, need 4
# # # Found 0MB
# # # SKIP for size 32 MB as not enough huge pages, need 4
# # # Found 0MB
# # # SKIP for size 0 MB as not enough huge pages, need 4
# # # Totals: pass:0 fail:0 xfail:0 xpass:0 skip:0 error:0
# # [PASS]
# ok 26 thuge-gen
# # -unnin--./c----e_-ese-ved_-u-etlb.s---c--ou--v2
# # running ./charge_reserved_hugetlb.sh -cgroup-v2
# # -unnin--./c----e_-ese-ved_-u-etlb.s---c--ou--v2
# # mount: /dev/cgroup/memory: mount point does not exist.
# #        dmesg(1) may have more information after failed mount system call.
# # [FAIL]
# not ok 27 charge_reserved_hugetlb.sh -cgroup-v2 # exit=32
# # -unnin--./-u-etlb_-e---entin-_test.s---c--ou--v2
# # running ./hugetlb_reparenting_test.sh -cgroup-v2
# # -unnin--./-u-etlb_-e---entin-_test.s---c--ou--v2
# # mount: /dev/cgroup/memory: mount point does not exist.
# #        dmesg(1) may have more information after failed mount system call.
# # [FAIL]
# not ok 28 hugetlb_reparenting_test.sh -cgroup-v2 # exit=32
# # -unnin--./vi-tu-l_-dd-ess_--n-e
# # running ./virtual_address_range
# # -unnin--./vi-tu-l_-dd-ess_--n-e
# # TAP version 13
# # 1..1
# # ok 1 Test
# # # Totals: pass:1 fail:0 xfail:0 xpass:0 skip:0 error:0
# # [PASS]
# ok 29 virtual_address_range
# # -unnin--b-s--./v-_-i--_-dd-_switc-.s-
# # running bash ./va_high_addr_switch.sh
# # -unnin--b-s--./v-_-i--_-dd-_switc-.s-
# # [SKIP]
# ok 30 va_high_addr_switch.sh # SKIP
# # -unnin--b-s--./test_vm-lloc.s--smoke
# # running bash ./test_vmalloc.sh smoke
# # -unnin--b-s--./test_vm-lloc.s--smoke
# # ./test_vmalloc.sh: You must have the following enabled in your kernel:
# # CONFIG_TEST_VMALLOC=m
# # [SKIP]
# ok 31 test_vmalloc.sh smoke # SKIP
# # -unnin--./m-em--_dontunm--
# # running ./mremap_dontunmap
# # -unnin--./m-em--_dontunm--
# # TAP version 13
# # 1..5
# # ok 1 mremap_dontunmap_simple
# # ok 2 mremap_dontunmap_simple_shmem
# # ok 3 mremap_dontunmap_simple_fixed
# # ok 4 mremap_dontunmap_partial_mapping
# # ok 5 mremap_dontunmap_partial_mapping_overwrite
# # # Totals: pass:5 fail:0 xfail:0 xpass:0 skip:0 error:0
# # [PASS]
# ok 32 mremap_dont
Ryan Roberts April 6, 2024, 8:32 a.m. UTC | #2
Hi Itaru,

On 05/04/2024 08:39, Itaru Kitayama wrote:
> On Thu, Apr 04, 2024 at 03:33:04PM +0100, Ryan Roberts wrote:
>> Hi All,
>>
>> It turns out that creating the linear map can take a significant proportion of
>> the total boot time, especially when rodata=full. And most of the time is spent
>> waiting on superfluous tlb invalidation and memory barriers. This series reworks
>> the kernel pgtable generation code to significantly reduce the number of those
>> TLBIs, ISBs and DSBs. See each patch for details.
>>
>> The below shows the execution time of map_mem() across a couple of different
>> systems with different RAM configurations. We measure after applying each patch
>> and show the improvement relative to base (v6.9-rc2):
>>
>>                | Apple M2 VM | Ampere Altra| Ampere Altra| Ampere Altra
>>                | VM, 16G     | VM, 64G     | VM, 256G    | Metal, 512G
>> ---------------|-------------|-------------|-------------|-------------
>>                |   ms    (%) |   ms    (%) |   ms    (%) |    ms    (%)
>> ---------------|-------------|-------------|-------------|-------------
>> base           |  153   (0%) | 2227   (0%) | 8798   (0%) | 17442   (0%)
>> no-cont-remap  |   77 (-49%) |  431 (-81%) | 1727 (-80%) |  3796 (-78%)
>> batch-barriers |   13 (-92%) |  162 (-93%) |  655 (-93%) |  1656 (-91%)
>> no-alloc-remap |   11 (-93%) |  109 (-95%) |  449 (-95%) |  1257 (-93%)
>> lazy-unmap     |    6 (-96%) |   61 (-97%) |  257 (-97%) |   838 (-95%)
>>
>> This series applies on top of v6.9-rc2. All mm selftests pass. I've compile and
>> boot tested various PAGE_SIZE and VA size configs.
>>
>> ---
>>
>> Changes since v1 [1]
>> ====================
>>
>>   - Added Tested-by tags (thanks to Eric and Itaru)
>>   - Renamed ___set_pte() -> __set_pte_nosync() (per Ard)
>>   - Reordered patches (biggest impact & least controversial first)
>>   - Reordered alloc/map/unmap functions in mmu.c to aid reader
>>   - pte_clear() -> __pte_clear() in clear_fixmap_nosync()
>>   - Reverted generic p4d_index() which caused x86 build error. Replaced with
>>     unconditional p4d_index() define under arm64.
>>
>>
>> [1] https://lore.kernel.org/linux-arm-kernel/20240326101448.3453626-1-ryan.roberts@arm.com/
>>
>> Thanks,
>> Ryan
>>
>>
>> Ryan Roberts (4):
>>   arm64: mm: Don't remap pgtables per-cont(pte|pmd) block
>>   arm64: mm: Batch dsb and isb when populating pgtables
>>   arm64: mm: Don't remap pgtables for allocate vs populate
>>   arm64: mm: Lazily clear pte table mappings from fixmap
>>
>>  arch/arm64/include/asm/fixmap.h  |   5 +-
>>  arch/arm64/include/asm/mmu.h     |   8 +
>>  arch/arm64/include/asm/pgtable.h |  13 +-
>>  arch/arm64/kernel/cpufeature.c   |  10 +-
>>  arch/arm64/mm/fixmap.c           |  11 +
>>  arch/arm64/mm/mmu.c              | 377 +++++++++++++++++++++++--------
>>  6 files changed, 319 insertions(+), 105 deletions(-)
>>
>> --
>> 2.25.1
>>
> 
> I've build and boot tested the v2 on FVP, base is taken from your
> linux-rr repo. Running run_vmtests.sh on v2 left some gup longterm not oks, would you take a look at it? The mm ksefltests used is from your linux-rr repo too.

Thanks for taking a look at this.

I can't reproduce your issue unfortunately; steps as follows on Apple M2 VM:

Config: arm64 defconfig + the following:

# Squashfs for snaps, xfs for large file folios.
./scripts/config --enable CONFIG_SQUASHFS_LZ4
./scripts/config --enable CONFIG_SQUASHFS_LZO
./scripts/config --enable CONFIG_SQUASHFS_XZ
./scripts/config --enable CONFIG_SQUASHFS_ZSTD
./scripts/config --enable CONFIG_XFS_FS

# For general mm debug.
./scripts/config --enable CONFIG_DEBUG_VM
./scripts/config --enable CONFIG_DEBUG_VM_MAPLE_TREE
./scripts/config --enable CONFIG_DEBUG_VM_RB
./scripts/config --enable CONFIG_DEBUG_VM_PGFLAGS
./scripts/config --enable CONFIG_DEBUG_VM_PGTABLE
./scripts/config --enable CONFIG_PAGE_TABLE_CHECK

# For mm selftests.
./scripts/config --enable CONFIG_USERFAULTFD
./scripts/config --enable CONFIG_TEST_VMALLOC
./scripts/config --enable CONFIG_GUP_TEST

Running on VM with 12G memory, split across 2 (emulated) NUMA nodes (needed by
some mm selftests), with kernel command line to reserve hugetlbs and other
features required by some mm selftests:

"
transparent_hugepage=madvise earlycon root=/dev/vda2 secretmem.enable
hugepagesz=1G hugepages=0:2,1:2 hugepagesz=32M hugepages=0:2,1:2
default_hugepagesz=2M hugepages=0:64,1:64 hugepagesz=64K hugepages=0:2,1:2
"

Ubuntu userspace running off XFS rootfs. Build and run mm selftests from same
git tree.


Although I don't think any of this config should make a difference to gup_longterm.

Looks like your errors are all "ftruncate() failed". I've seen this problem on
our CI system. There it is due to running the tests from NFS file system. What
filesystem are you using? Perhaps you are sharing into the FVP using 9p? That
might also be problematic.

Does this problem reproduce with v6.9-rc2, without my patches? I except it
probably does?

Thanks,
Ryan

> 
> Thanks,
> Itaru.
Itaru Kitayama April 6, 2024, 10:31 a.m. UTC | #3
Hi Ryan,

On Sat, Apr 06, 2024 at 09:32:34AM +0100, Ryan Roberts wrote:
> Hi Itaru,
> 
> On 05/04/2024 08:39, Itaru Kitayama wrote:
> > On Thu, Apr 04, 2024 at 03:33:04PM +0100, Ryan Roberts wrote:
> >> Hi All,
> >>
> >> It turns out that creating the linear map can take a significant proportion of
> >> the total boot time, especially when rodata=full. And most of the time is spent
> >> waiting on superfluous tlb invalidation and memory barriers. This series reworks
> >> the kernel pgtable generation code to significantly reduce the number of those
> >> TLBIs, ISBs and DSBs. See each patch for details.
> >>
> >> The below shows the execution time of map_mem() across a couple of different
> >> systems with different RAM configurations. We measure after applying each patch
> >> and show the improvement relative to base (v6.9-rc2):
> >>
> >>                | Apple M2 VM | Ampere Altra| Ampere Altra| Ampere Altra
> >>                | VM, 16G     | VM, 64G     | VM, 256G    | Metal, 512G
> >> ---------------|-------------|-------------|-------------|-------------
> >>                |   ms    (%) |   ms    (%) |   ms    (%) |    ms    (%)
> >> ---------------|-------------|-------------|-------------|-------------
> >> base           |  153   (0%) | 2227   (0%) | 8798   (0%) | 17442   (0%)
> >> no-cont-remap  |   77 (-49%) |  431 (-81%) | 1727 (-80%) |  3796 (-78%)
> >> batch-barriers |   13 (-92%) |  162 (-93%) |  655 (-93%) |  1656 (-91%)
> >> no-alloc-remap |   11 (-93%) |  109 (-95%) |  449 (-95%) |  1257 (-93%)
> >> lazy-unmap     |    6 (-96%) |   61 (-97%) |  257 (-97%) |   838 (-95%)
> >>
> >> This series applies on top of v6.9-rc2. All mm selftests pass. I've compile and
> >> boot tested various PAGE_SIZE and VA size configs.
> >>
> >> ---
> >>
> >> Changes since v1 [1]
> >> ====================
> >>
> >>   - Added Tested-by tags (thanks to Eric and Itaru)
> >>   - Renamed ___set_pte() -> __set_pte_nosync() (per Ard)
> >>   - Reordered patches (biggest impact & least controversial first)
> >>   - Reordered alloc/map/unmap functions in mmu.c to aid reader
> >>   - pte_clear() -> __pte_clear() in clear_fixmap_nosync()
> >>   - Reverted generic p4d_index() which caused x86 build error. Replaced with
> >>     unconditional p4d_index() define under arm64.
> >>
> >>
> >> [1] https://lore.kernel.org/linux-arm-kernel/20240326101448.3453626-1-ryan.roberts@arm.com/
> >>
> >> Thanks,
> >> Ryan
> >>
> >>
> >> Ryan Roberts (4):
> >>   arm64: mm: Don't remap pgtables per-cont(pte|pmd) block
> >>   arm64: mm: Batch dsb and isb when populating pgtables
> >>   arm64: mm: Don't remap pgtables for allocate vs populate
> >>   arm64: mm: Lazily clear pte table mappings from fixmap
> >>
> >>  arch/arm64/include/asm/fixmap.h  |   5 +-
> >>  arch/arm64/include/asm/mmu.h     |   8 +
> >>  arch/arm64/include/asm/pgtable.h |  13 +-
> >>  arch/arm64/kernel/cpufeature.c   |  10 +-
> >>  arch/arm64/mm/fixmap.c           |  11 +
> >>  arch/arm64/mm/mmu.c              | 377 +++++++++++++++++++++++--------
> >>  6 files changed, 319 insertions(+), 105 deletions(-)
> >>
> >> --
> >> 2.25.1
> >>
> > 
> > I've build and boot tested the v2 on FVP, base is taken from your
> > linux-rr repo. Running run_vmtests.sh on v2 left some gup longterm not oks, would you take a look at it? The mm ksefltests used is from your linux-rr repo too.
> 
> Thanks for taking a look at this.
> 
> I can't reproduce your issue unfortunately; steps as follows on Apple M2 VM:
> 
> Config: arm64 defconfig + the following:
> 
> # Squashfs for snaps, xfs for large file folios.
> ./scripts/config --enable CONFIG_SQUASHFS_LZ4
> ./scripts/config --enable CONFIG_SQUASHFS_LZO
> ./scripts/config --enable CONFIG_SQUASHFS_XZ
> ./scripts/config --enable CONFIG_SQUASHFS_ZSTD
> ./scripts/config --enable CONFIG_XFS_FS
> 
> # For general mm debug.
> ./scripts/config --enable CONFIG_DEBUG_VM
> ./scripts/config --enable CONFIG_DEBUG_VM_MAPLE_TREE
> ./scripts/config --enable CONFIG_DEBUG_VM_RB
> ./scripts/config --enable CONFIG_DEBUG_VM_PGFLAGS
> ./scripts/config --enable CONFIG_DEBUG_VM_PGTABLE
> ./scripts/config --enable CONFIG_PAGE_TABLE_CHECK
> 
> # For mm selftests.
> ./scripts/config --enable CONFIG_USERFAULTFD
> ./scripts/config --enable CONFIG_TEST_VMALLOC
> ./scripts/config --enable CONFIG_GUP_TEST
> 
> Running on VM with 12G memory, split across 2 (emulated) NUMA nodes (needed by
> some mm selftests), with kernel command line to reserve hugetlbs and other
> features required by some mm selftests:
> 
> "
> transparent_hugepage=madvise earlycon root=/dev/vda2 secretmem.enable
> hugepagesz=1G hugepages=0:2,1:2 hugepagesz=32M hugepages=0:2,1:2
> default_hugepagesz=2M hugepages=0:64,1:64 hugepagesz=64K hugepages=0:2,1:2
> "
> 
> Ubuntu userspace running off XFS rootfs. Build and run mm selftests from same
> git tree.
> 
> 
> Although I don't think any of this config should make a difference to gup_longterm.
> 
> Looks like your errors are all "ftruncate() failed". I've seen this problem on
> our CI system. There it is due to running the tests from NFS file system. What
> filesystem are you using? Perhaps you are sharing into the FVP using 9p? That
> might also be problematic.

That was it. This time I booted up the kernel including your series on
QEMU on my M1 and executed the gup_longterm program without the ftruncate
failures. When testing your kernel on FVP, I was executing the script from the FVP's host filesystem using 9p.

Thanks,
Itaru.

> 
> Does this problem reproduce with v6.9-rc2, without my patches? I except it
> probably does?
> 
> Thanks,
> Ryan
> 
> > 
> > Thanks,
> > Itaru.
>
Ryan Roberts April 8, 2024, 7:30 a.m. UTC | #4
On 06/04/2024 11:31, Itaru Kitayama wrote:
> Hi Ryan,
> 
> On Sat, Apr 06, 2024 at 09:32:34AM +0100, Ryan Roberts wrote:
>> Hi Itaru,
>>
>> On 05/04/2024 08:39, Itaru Kitayama wrote:
>>> On Thu, Apr 04, 2024 at 03:33:04PM +0100, Ryan Roberts wrote:
>>>> Hi All,
>>>>
>>>> It turns out that creating the linear map can take a significant proportion of
>>>> the total boot time, especially when rodata=full. And most of the time is spent
>>>> waiting on superfluous tlb invalidation and memory barriers. This series reworks
>>>> the kernel pgtable generation code to significantly reduce the number of those
>>>> TLBIs, ISBs and DSBs. See each patch for details.
>>>>
>>>> The below shows the execution time of map_mem() across a couple of different
>>>> systems with different RAM configurations. We measure after applying each patch
>>>> and show the improvement relative to base (v6.9-rc2):
>>>>
>>>>                | Apple M2 VM | Ampere Altra| Ampere Altra| Ampere Altra
>>>>                | VM, 16G     | VM, 64G     | VM, 256G    | Metal, 512G
>>>> ---------------|-------------|-------------|-------------|-------------
>>>>                |   ms    (%) |   ms    (%) |   ms    (%) |    ms    (%)
>>>> ---------------|-------------|-------------|-------------|-------------
>>>> base           |  153   (0%) | 2227   (0%) | 8798   (0%) | 17442   (0%)
>>>> no-cont-remap  |   77 (-49%) |  431 (-81%) | 1727 (-80%) |  3796 (-78%)
>>>> batch-barriers |   13 (-92%) |  162 (-93%) |  655 (-93%) |  1656 (-91%)
>>>> no-alloc-remap |   11 (-93%) |  109 (-95%) |  449 (-95%) |  1257 (-93%)
>>>> lazy-unmap     |    6 (-96%) |   61 (-97%) |  257 (-97%) |   838 (-95%)
>>>>
>>>> This series applies on top of v6.9-rc2. All mm selftests pass. I've compile and
>>>> boot tested various PAGE_SIZE and VA size configs.
>>>>
>>>> ---
>>>>
>>>> Changes since v1 [1]
>>>> ====================
>>>>
>>>>   - Added Tested-by tags (thanks to Eric and Itaru)
>>>>   - Renamed ___set_pte() -> __set_pte_nosync() (per Ard)
>>>>   - Reordered patches (biggest impact & least controversial first)
>>>>   - Reordered alloc/map/unmap functions in mmu.c to aid reader
>>>>   - pte_clear() -> __pte_clear() in clear_fixmap_nosync()
>>>>   - Reverted generic p4d_index() which caused x86 build error. Replaced with
>>>>     unconditional p4d_index() define under arm64.
>>>>
>>>>
>>>> [1] https://lore.kernel.org/linux-arm-kernel/20240326101448.3453626-1-ryan.roberts@arm.com/
>>>>
>>>> Thanks,
>>>> Ryan
>>>>
>>>>
>>>> Ryan Roberts (4):
>>>>   arm64: mm: Don't remap pgtables per-cont(pte|pmd) block
>>>>   arm64: mm: Batch dsb and isb when populating pgtables
>>>>   arm64: mm: Don't remap pgtables for allocate vs populate
>>>>   arm64: mm: Lazily clear pte table mappings from fixmap
>>>>
>>>>  arch/arm64/include/asm/fixmap.h  |   5 +-
>>>>  arch/arm64/include/asm/mmu.h     |   8 +
>>>>  arch/arm64/include/asm/pgtable.h |  13 +-
>>>>  arch/arm64/kernel/cpufeature.c   |  10 +-
>>>>  arch/arm64/mm/fixmap.c           |  11 +
>>>>  arch/arm64/mm/mmu.c              | 377 +++++++++++++++++++++++--------
>>>>  6 files changed, 319 insertions(+), 105 deletions(-)
>>>>
>>>> --
>>>> 2.25.1
>>>>
>>>
>>> I've build and boot tested the v2 on FVP, base is taken from your
>>> linux-rr repo. Running run_vmtests.sh on v2 left some gup longterm not oks, would you take a look at it? The mm ksefltests used is from your linux-rr repo too.
>>
>> Thanks for taking a look at this.
>>
>> I can't reproduce your issue unfortunately; steps as follows on Apple M2 VM:
>>
>> Config: arm64 defconfig + the following:
>>
>> # Squashfs for snaps, xfs for large file folios.
>> ./scripts/config --enable CONFIG_SQUASHFS_LZ4
>> ./scripts/config --enable CONFIG_SQUASHFS_LZO
>> ./scripts/config --enable CONFIG_SQUASHFS_XZ
>> ./scripts/config --enable CONFIG_SQUASHFS_ZSTD
>> ./scripts/config --enable CONFIG_XFS_FS
>>
>> # For general mm debug.
>> ./scripts/config --enable CONFIG_DEBUG_VM
>> ./scripts/config --enable CONFIG_DEBUG_VM_MAPLE_TREE
>> ./scripts/config --enable CONFIG_DEBUG_VM_RB
>> ./scripts/config --enable CONFIG_DEBUG_VM_PGFLAGS
>> ./scripts/config --enable CONFIG_DEBUG_VM_PGTABLE
>> ./scripts/config --enable CONFIG_PAGE_TABLE_CHECK
>>
>> # For mm selftests.
>> ./scripts/config --enable CONFIG_USERFAULTFD
>> ./scripts/config --enable CONFIG_TEST_VMALLOC
>> ./scripts/config --enable CONFIG_GUP_TEST
>>
>> Running on VM with 12G memory, split across 2 (emulated) NUMA nodes (needed by
>> some mm selftests), with kernel command line to reserve hugetlbs and other
>> features required by some mm selftests:
>>
>> "
>> transparent_hugepage=madvise earlycon root=/dev/vda2 secretmem.enable
>> hugepagesz=1G hugepages=0:2,1:2 hugepagesz=32M hugepages=0:2,1:2
>> default_hugepagesz=2M hugepages=0:64,1:64 hugepagesz=64K hugepages=0:2,1:2
>> "
>>
>> Ubuntu userspace running off XFS rootfs. Build and run mm selftests from same
>> git tree.
>>
>>
>> Although I don't think any of this config should make a difference to gup_longterm.
>>
>> Looks like your errors are all "ftruncate() failed". I've seen this problem on
>> our CI system. There it is due to running the tests from NFS file system. What
>> filesystem are you using? Perhaps you are sharing into the FVP using 9p? That
>> might also be problematic.
> 
> That was it. This time I booted up the kernel including your series on
> QEMU on my M1 and executed the gup_longterm program without the ftruncate
> failures. When testing your kernel on FVP, I was executing the script from the FVP's host filesystem using 9p.

I'm not sure exactly what the root cause is. Perhaps there isn't enough space on
the disk? It might be worth enhancing the error log to provide the errno in
tools/testing/selftests/mm/gup_longterm.c.

Thanks,
Ryan

> 
> Thanks,
> Itaru.
> 
>>
>> Does this problem reproduce with v6.9-rc2, without my patches? I except it
>> probably does?
>>
>> Thanks,
>> Ryan
>>
>>>
>>> Thanks,
>>> Itaru.
>>
Itaru Kitayama April 9, 2024, 12:10 a.m. UTC | #5
Hi Ryan,

> On Apr 8, 2024, at 16:30, Ryan Roberts <ryan.roberts@arm.com> wrote:
> 
> On 06/04/2024 11:31, Itaru Kitayama wrote:
>> Hi Ryan,
>> 
>> On Sat, Apr 06, 2024 at 09:32:34AM +0100, Ryan Roberts wrote:
>>> Hi Itaru,
>>> 
>>> On 05/04/2024 08:39, Itaru Kitayama wrote:
>>>> On Thu, Apr 04, 2024 at 03:33:04PM +0100, Ryan Roberts wrote:
>>>>> Hi All,
>>>>> 
>>>>> It turns out that creating the linear map can take a significant proportion of
>>>>> the total boot time, especially when rodata=full. And most of the time is spent
>>>>> waiting on superfluous tlb invalidation and memory barriers. This series reworks
>>>>> the kernel pgtable generation code to significantly reduce the number of those
>>>>> TLBIs, ISBs and DSBs. See each patch for details.
>>>>> 
>>>>> The below shows the execution time of map_mem() across a couple of different
>>>>> systems with different RAM configurations. We measure after applying each patch
>>>>> and show the improvement relative to base (v6.9-rc2):
>>>>> 
>>>>>               | Apple M2 VM | Ampere Altra| Ampere Altra| Ampere Altra
>>>>>               | VM, 16G     | VM, 64G     | VM, 256G    | Metal, 512G
>>>>> ---------------|-------------|-------------|-------------|-------------
>>>>>               |   ms    (%) |   ms    (%) |   ms    (%) |    ms    (%)
>>>>> ---------------|-------------|-------------|-------------|-------------
>>>>> base           |  153   (0%) | 2227   (0%) | 8798   (0%) | 17442   (0%)
>>>>> no-cont-remap  |   77 (-49%) |  431 (-81%) | 1727 (-80%) |  3796 (-78%)
>>>>> batch-barriers |   13 (-92%) |  162 (-93%) |  655 (-93%) |  1656 (-91%)
>>>>> no-alloc-remap |   11 (-93%) |  109 (-95%) |  449 (-95%) |  1257 (-93%)
>>>>> lazy-unmap     |    6 (-96%) |   61 (-97%) |  257 (-97%) |   838 (-95%)
>>>>> 
>>>>> This series applies on top of v6.9-rc2. All mm selftests pass. I've compile and
>>>>> boot tested various PAGE_SIZE and VA size configs.
>>>>> 
>>>>> ---
>>>>> 
>>>>> Changes since v1 [1]
>>>>> ====================
>>>>> 
>>>>>  - Added Tested-by tags (thanks to Eric and Itaru)
>>>>>  - Renamed ___set_pte() -> __set_pte_nosync() (per Ard)
>>>>>  - Reordered patches (biggest impact & least controversial first)
>>>>>  - Reordered alloc/map/unmap functions in mmu.c to aid reader
>>>>>  - pte_clear() -> __pte_clear() in clear_fixmap_nosync()
>>>>>  - Reverted generic p4d_index() which caused x86 build error. Replaced with
>>>>>    unconditional p4d_index() define under arm64.
>>>>> 
>>>>> 
>>>>> [1] https://lore.kernel.org/linux-arm-kernel/20240326101448.3453626-1-ryan.roberts@arm.com/
>>>>> 
>>>>> Thanks,
>>>>> Ryan
>>>>> 
>>>>> 
>>>>> Ryan Roberts (4):
>>>>>  arm64: mm: Don't remap pgtables per-cont(pte|pmd) block
>>>>>  arm64: mm: Batch dsb and isb when populating pgtables
>>>>>  arm64: mm: Don't remap pgtables for allocate vs populate
>>>>>  arm64: mm: Lazily clear pte table mappings from fixmap
>>>>> 
>>>>> arch/arm64/include/asm/fixmap.h  |   5 +-
>>>>> arch/arm64/include/asm/mmu.h     |   8 +
>>>>> arch/arm64/include/asm/pgtable.h |  13 +-
>>>>> arch/arm64/kernel/cpufeature.c   |  10 +-
>>>>> arch/arm64/mm/fixmap.c           |  11 +
>>>>> arch/arm64/mm/mmu.c              | 377 +++++++++++++++++++++++--------
>>>>> 6 files changed, 319 insertions(+), 105 deletions(-)
>>>>> 
>>>>> --
>>>>> 2.25.1
>>>>> 
>>>> 
>>>> I've build and boot tested the v2 on FVP, base is taken from your
>>>> linux-rr repo. Running run_vmtests.sh on v2 left some gup longterm not oks, would you take a look at it? The mm ksefltests used is from your linux-rr repo too.
>>> 
>>> Thanks for taking a look at this.
>>> 
>>> I can't reproduce your issue unfortunately; steps as follows on Apple M2 VM:
>>> 
>>> Config: arm64 defconfig + the following:
>>> 
>>> # Squashfs for snaps, xfs for large file folios.
>>> ./scripts/config --enable CONFIG_SQUASHFS_LZ4
>>> ./scripts/config --enable CONFIG_SQUASHFS_LZO
>>> ./scripts/config --enable CONFIG_SQUASHFS_XZ
>>> ./scripts/config --enable CONFIG_SQUASHFS_ZSTD
>>> ./scripts/config --enable CONFIG_XFS_FS
>>> 
>>> # For general mm debug.
>>> ./scripts/config --enable CONFIG_DEBUG_VM
>>> ./scripts/config --enable CONFIG_DEBUG_VM_MAPLE_TREE
>>> ./scripts/config --enable CONFIG_DEBUG_VM_RB
>>> ./scripts/config --enable CONFIG_DEBUG_VM_PGFLAGS
>>> ./scripts/config --enable CONFIG_DEBUG_VM_PGTABLE
>>> ./scripts/config --enable CONFIG_PAGE_TABLE_CHECK
>>> 
>>> # For mm selftests.
>>> ./scripts/config --enable CONFIG_USERFAULTFD
>>> ./scripts/config --enable CONFIG_TEST_VMALLOC
>>> ./scripts/config --enable CONFIG_GUP_TEST
>>> 
>>> Running on VM with 12G memory, split across 2 (emulated) NUMA nodes (needed by
>>> some mm selftests), with kernel command line to reserve hugetlbs and other
>>> features required by some mm selftests:
>>> 
>>> "
>>> transparent_hugepage=madvise earlycon root=/dev/vda2 secretmem.enable
>>> hugepagesz=1G hugepages=0:2,1:2 hugepagesz=32M hugepages=0:2,1:2
>>> default_hugepagesz=2M hugepages=0:64,1:64 hugepagesz=64K hugepages=0:2,1:2
>>> "
>>> 
>>> Ubuntu userspace running off XFS rootfs. Build and run mm selftests from same
>>> git tree.
>>> 
>>> 
>>> Although I don't think any of this config should make a difference to gup_longterm.
>>> 
>>> Looks like your errors are all "ftruncate() failed". I've seen this problem on
>>> our CI system. There it is due to running the tests from NFS file system. What
>>> filesystem are you using? Perhaps you are sharing into the FVP using 9p? That
>>> might also be problematic.
>> 
>> That was it. This time I booted up the kernel including your series on
>> QEMU on my M1 and executed the gup_longterm program without the ftruncate
>> failures. When testing your kernel on FVP, I was executing the script from the FVP's host filesystem using 9p.
> 
> I'm not sure exactly what the root cause is. Perhaps there isn't enough space on
> the disk? It might be worth enhancing the error log to provide the errno in
> tools/testing/selftests/mm/gup_longterm.c.
> 

Attached is the strace’d gup_longterm executiong log on your pgtable-boot-speedup-v2 kernel.

Thanks,
Itaru.
> Thanks,
> Ryan
> 
>> 
>> Thanks,
>> Itaru.
>> 
>>> 
>>> Does this problem reproduce with v6.9-rc2, without my patches? I except it
>>> probably does?
>>> 
>>> Thanks,
>>> Ryan
>>> 
>>>> 
>>>> Thanks,
>>>> Itaru.
Ryan Roberts April 9, 2024, 10:04 a.m. UTC | #6
On 09/04/2024 01:10, Itaru Kitayama wrote:
> Hi Ryan,
> 
>> On Apr 8, 2024, at 16:30, Ryan Roberts <ryan.roberts@arm.com> wrote:
>> 
>> On 06/04/2024 11:31, Itaru Kitayama wrote:
>>> Hi Ryan,
>>> 
>>> On Sat, Apr 06, 2024 at 09:32:34AM +0100, Ryan Roberts wrote:
>>>> Hi Itaru,
>>>> 
>>>> On 05/04/2024 08:39, Itaru Kitayama wrote:
>>>>> On Thu, Apr 04, 2024 at 03:33:04PM +0100, Ryan Roberts wrote:
>>>>>> Hi All,
>>>>>> 
>>>>>> It turns out that creating the linear map can take a significant proportion of
>>>>>> the total boot time, especially when rodata=full. And most of the time is spent
>>>>>> waiting on superfluous tlb invalidation and memory barriers. This series reworks
>>>>>> the kernel pgtable generation code to significantly reduce the number of those
>>>>>> TLBIs, ISBs and DSBs. See each patch for details.
>>>>>> 
>>>>>> The below shows the execution time of map_mem() across a couple of different
>>>>>> systems with different RAM configurations. We measure after applying each patch
>>>>>> and show the improvement relative to base (v6.9-rc2):
>>>>>> 
>>>>>>               | Apple M2 VM | Ampere Altra| Ampere Altra| Ampere Altra
>>>>>>               | VM, 16G     | VM, 64G     | VM, 256G    | Metal, 512G
>>>>>> ---------------|-------------|-------------|-------------|-------------
>>>>>>               |   ms    (%) |   ms    (%) |   ms    (%) |    ms    (%)
>>>>>> ---------------|-------------|-------------|-------------|-------------
>>>>>> base           |  153   (0%) | 2227   (0%) | 8798   (0%) | 17442   (0%)
>>>>>> no-cont-remap  |   77 (-49%) |  431 (-81%) | 1727 (-80%) |  3796 (-78%)
>>>>>> batch-barriers |   13 (-92%) |  162 (-93%) |  655 (-93%) |  1656 (-91%)
>>>>>> no-alloc-remap |   11 (-93%) |  109 (-95%) |  449 (-95%) |  1257 (-93%)
>>>>>> lazy-unmap     |    6 (-96%) |   61 (-97%) |  257 (-97%) |   838 (-95%)
>>>>>> 
>>>>>> This series applies on top of v6.9-rc2. All mm selftests pass. I've compile and
>>>>>> boot tested various PAGE_SIZE and VA size configs.
>>>>>> 
>>>>>> ---
>>>>>> 
>>>>>> Changes since v1 [1]
>>>>>> ====================
>>>>>> 
>>>>>>  - Added Tested-by tags (thanks to Eric and Itaru)
>>>>>>  - Renamed ___set_pte() -> __set_pte_nosync() (per Ard)
>>>>>>  - Reordered patches (biggest impact & least controversial first)
>>>>>>  - Reordered alloc/map/unmap functions in mmu.c to aid reader
>>>>>>  - pte_clear() -> __pte_clear() in clear_fixmap_nosync()
>>>>>>  - Reverted generic p4d_index() which caused x86 build error. Replaced with
>>>>>>    unconditional p4d_index() define under arm64.
>>>>>> 
>>>>>> 
>>>>>> [1] https://lore.kernel.org/linux-arm-kernel/20240326101448.3453626-1-ryan.roberts@arm.com/ <https://lore.kernel.org/linux-arm-kernel/20240326101448.3453626-1-ryan.roberts@arm.com/>
>>>>>> 
>>>>>> Thanks,
>>>>>> Ryan
>>>>>> 
>>>>>> 
>>>>>> Ryan Roberts (4):
>>>>>>  arm64: mm: Don't remap pgtables per-cont(pte|pmd) block
>>>>>>  arm64: mm: Batch dsb and isb when populating pgtables
>>>>>>  arm64: mm: Don't remap pgtables for allocate vs populate
>>>>>>  arm64: mm: Lazily clear pte table mappings from fixmap
>>>>>> 
>>>>>> arch/arm64/include/asm/fixmap.h  |   5 +-
>>>>>> arch/arm64/include/asm/mmu.h     |   8 +
>>>>>> arch/arm64/include/asm/pgtable.h |  13 +-
>>>>>> arch/arm64/kernel/cpufeature.c   |  10 +-
>>>>>> arch/arm64/mm/fixmap.c           |  11 +
>>>>>> arch/arm64/mm/mmu.c              | 377 +++++++++++++++++++++++--------
>>>>>> 6 files changed, 319 insertions(+), 105 deletions(-)
>>>>>> 
>>>>>> --
>>>>>> 2.25.1
>>>>>> 
>>>>> 
>>>>> I've build and boot tested the v2 on FVP, base is taken from your
>>>>> linux-rr repo. Running run_vmtests.sh on v2 left some gup longterm not oks, would you take a look at it? The mm ksefltests used is from your linux-rr repo too.
>>>> 
>>>> Thanks for taking a look at this.
>>>> 
>>>> I can't reproduce your issue unfortunately; steps as follows on Apple M2 VM:
>>>> 
>>>> Config: arm64 defconfig + the following:
>>>> 
>>>> # Squashfs for snaps, xfs for large file folios.
>>>> ./scripts/config --enable CONFIG_SQUASHFS_LZ4
>>>> ./scripts/config --enable CONFIG_SQUASHFS_LZO
>>>> ./scripts/config --enable CONFIG_SQUASHFS_XZ
>>>> ./scripts/config --enable CONFIG_SQUASHFS_ZSTD
>>>> ./scripts/config --enable CONFIG_XFS_FS
>>>> 
>>>> # For general mm debug.
>>>> ./scripts/config --enable CONFIG_DEBUG_VM
>>>> ./scripts/config --enable CONFIG_DEBUG_VM_MAPLE_TREE
>>>> ./scripts/config --enable CONFIG_DEBUG_VM_RB
>>>> ./scripts/config --enable CONFIG_DEBUG_VM_PGFLAGS
>>>> ./scripts/config --enable CONFIG_DEBUG_VM_PGTABLE
>>>> ./scripts/config --enable CONFIG_PAGE_TABLE_CHECK
>>>> 
>>>> # For mm selftests.
>>>> ./scripts/config --enable CONFIG_USERFAULTFD
>>>> ./scripts/config --enable CONFIG_TEST_VMALLOC
>>>> ./scripts/config --enable CONFIG_GUP_TEST
>>>> 
>>>> Running on VM with 12G memory, split across 2 (emulated) NUMA nodes (needed by
>>>> some mm selftests), with kernel command line to reserve hugetlbs and other
>>>> features required by some mm selftests:
>>>> 
>>>> "
>>>> transparent_hugepage=madvise earlycon root=/dev/vda2 secretmem.enable
>>>> hugepagesz=1G hugepages=0:2,1:2 hugepagesz=32M hugepages=0:2,1:2
>>>> default_hugepagesz=2M hugepages=0:64,1:64 hugepagesz=64K hugepages=0:2,1:2
>>>> "
>>>> 
>>>> Ubuntu userspace running off XFS rootfs. Build and run mm selftests from same
>>>> git tree.
>>>> 
>>>> 
>>>> Although I don't think any of this config should make a difference to gup_longterm.
>>>> 
>>>> Looks like your errors are all "ftruncate() failed". I've seen this problem on
>>>> our CI system. There it is due to running the tests from NFS file system. What
>>>> filesystem are you using? Perhaps you are sharing into the FVP using 9p? That
>>>> might also be problematic.
>>> 
>>> That was it. This time I booted up the kernel including your series on
>>> QEMU on my M1 and executed the gup_longterm program without the ftruncate
>>> failures. When testing your kernel on FVP, I was executing the script from the FVP's host filesystem using 9p.
>> 
>> I'm not sure exactly what the root cause is. Perhaps there isn't enough space on
>> the disk? It might be worth enhancing the error log to provide the errno in
>> tools/testing/selftests/mm/gup_longterm.c.
>> 
> 
> Attached is the strace’d gup_longterm executiong log on your
> pgtable-boot-speedup-v2 kernel.

Sorry are you saying that it only fails with the pgtable-boot-speedup-v2 patch
set applied? I thought we previously concluded that it was independent of that?
I was under the impression that it was filesystem related and not something that
I was planning to investigate.

> 
> Thanks,
> Itaru.
> 
>> Thanks,
>> Ryan
>> 
>>> 
>>> Thanks,
>>> Itaru.
>>> 
>>>> 
>>>> Does this problem reproduce with v6.9-rc2, without my patches? I except it
>>>> probably does?
>>>> 
>>>> Thanks,
>>>> Ryan
>>>> 
>>>>> 
>>>>> Thanks,
>>>>> Itaru.
> 
>
Itaru Kitayama April 9, 2024, 10:13 a.m. UTC | #7
> On Apr 9, 2024, at 19:04, Ryan Roberts <ryan.roberts@arm.com> wrote:
> 
> On 09/04/2024 01:10, Itaru Kitayama wrote:
>> Hi Ryan,
>> 
>>> On Apr 8, 2024, at 16:30, Ryan Roberts <ryan.roberts@arm.com> wrote:
>>> 
>>> On 06/04/2024 11:31, Itaru Kitayama wrote:
>>>> Hi Ryan,
>>>> 
>>>> On Sat, Apr 06, 2024 at 09:32:34AM +0100, Ryan Roberts wrote:
>>>>> Hi Itaru,
>>>>> 
>>>>> On 05/04/2024 08:39, Itaru Kitayama wrote:
>>>>>> On Thu, Apr 04, 2024 at 03:33:04PM +0100, Ryan Roberts wrote:
>>>>>>> Hi All,
>>>>>>> 
>>>>>>> It turns out that creating the linear map can take a significant proportion of
>>>>>>> the total boot time, especially when rodata=full. And most of the time is spent
>>>>>>> waiting on superfluous tlb invalidation and memory barriers. This series reworks
>>>>>>> the kernel pgtable generation code to significantly reduce the number of those
>>>>>>> TLBIs, ISBs and DSBs. See each patch for details.
>>>>>>> 
>>>>>>> The below shows the execution time of map_mem() across a couple of different
>>>>>>> systems with different RAM configurations. We measure after applying each patch
>>>>>>> and show the improvement relative to base (v6.9-rc2):
>>>>>>> 
>>>>>>>                | Apple M2 VM | Ampere Altra| Ampere Altra| Ampere Altra
>>>>>>>                | VM, 16G     | VM, 64G     | VM, 256G    | Metal, 512G
>>>>>>> ---------------|-------------|-------------|-------------|-------------
>>>>>>>                |   ms    (%) |   ms    (%) |   ms    (%) |    ms    (%)
>>>>>>> ---------------|-------------|-------------|-------------|-------------
>>>>>>> base           |  153   (0%) | 2227   (0%) | 8798   (0%) | 17442   (0%)
>>>>>>> no-cont-remap  |   77 (-49%) |  431 (-81%) | 1727 (-80%) |  3796 (-78%)
>>>>>>> batch-barriers |   13 (-92%) |  162 (-93%) |  655 (-93%) |  1656 (-91%)
>>>>>>> no-alloc-remap |   11 (-93%) |  109 (-95%) |  449 (-95%) |  1257 (-93%)
>>>>>>> lazy-unmap     |    6 (-96%) |   61 (-97%) |  257 (-97%) |   838 (-95%)
>>>>>>> 
>>>>>>> This series applies on top of v6.9-rc2. All mm selftests pass. I've compile and
>>>>>>> boot tested various PAGE_SIZE and VA size configs.
>>>>>>> 
>>>>>>> ---
>>>>>>> 
>>>>>>> Changes since v1 [1]
>>>>>>> ====================
>>>>>>> 
>>>>>>>   - Added Tested-by tags (thanks to Eric and Itaru)
>>>>>>>   - Renamed ___set_pte() -> __set_pte_nosync() (per Ard)
>>>>>>>   - Reordered patches (biggest impact & least controversial first)
>>>>>>>   - Reordered alloc/map/unmap functions in mmu.c to aid reader
>>>>>>>   - pte_clear() -> __pte_clear() in clear_fixmap_nosync()
>>>>>>>   - Reverted generic p4d_index() which caused x86 build error. Replaced with
>>>>>>>     unconditional p4d_index() define under arm64.
>>>>>>> 
>>>>>>> 
>>>>>>> [1] https://lore.kernel.org/linux-arm-kernel/20240326101448.3453626-1-ryan.roberts@arm.com/<https://lore.kernel.org/linux-arm-kernel/20240326101448.3453626-1-ryan.roberts@arm.com/>
>>>>>>> 
>>>>>>> Thanks,
>>>>>>> Ryan
>>>>>>> 
>>>>>>> 
>>>>>>> Ryan Roberts (4):
>>>>>>>   arm64: mm: Don't remap pgtables per-cont(pte|pmd) block
>>>>>>>   arm64: mm: Batch dsb and isb when populating pgtables
>>>>>>>   arm64: mm: Don't remap pgtables for allocate vs populate
>>>>>>>   arm64: mm: Lazily clear pte table mappings from fixmap
>>>>>>> 
>>>>>>> arch/arm64/include/asm/fixmap.h  |   5 +-
>>>>>>> arch/arm64/include/asm/mmu.h     |   8 +
>>>>>>> arch/arm64/include/asm/pgtable.h |  13 +-
>>>>>>> arch/arm64/kernel/cpufeature.c   |  10 +-
>>>>>>> arch/arm64/mm/fixmap.c           |  11 +
>>>>>>> arch/arm64/mm/mmu.c              | 377 +++++++++++++++++++++++--------
>>>>>>> 6 files changed, 319 insertions(+), 105 deletions(-)
>>>>>>> 
>>>>>>> --
>>>>>>> 2.25.1
>>>>>>> 
>>>>>> 
>>>>>> I've build and boot tested the v2 on FVP, base is taken from your
>>>>>> linux-rr repo. Running run_vmtests.sh on v2 left some gup longterm not oks, would you take a look at it? The mm ksefltests used is from your linux-rr repo too.
>>>>> 
>>>>> Thanks for taking a look at this.
>>>>> 
>>>>> I can't reproduce your issue unfortunately; steps as follows on Apple M2 VM:
>>>>> 
>>>>> Config: arm64 defconfig + the following:
>>>>> 
>>>>> # Squashfs for snaps, xfs for large file folios.
>>>>> ./scripts/config --enable CONFIG_SQUASHFS_LZ4
>>>>> ./scripts/config --enable CONFIG_SQUASHFS_LZO
>>>>> ./scripts/config --enable CONFIG_SQUASHFS_XZ
>>>>> ./scripts/config --enable CONFIG_SQUASHFS_ZSTD
>>>>> ./scripts/config --enable CONFIG_XFS_FS
>>>>> 
>>>>> # For general mm debug.
>>>>> ./scripts/config --enable CONFIG_DEBUG_VM
>>>>> ./scripts/config --enable CONFIG_DEBUG_VM_MAPLE_TREE
>>>>> ./scripts/config --enable CONFIG_DEBUG_VM_RB
>>>>> ./scripts/config --enable CONFIG_DEBUG_VM_PGFLAGS
>>>>> ./scripts/config --enable CONFIG_DEBUG_VM_PGTABLE
>>>>> ./scripts/config --enable CONFIG_PAGE_TABLE_CHECK
>>>>> 
>>>>> # For mm selftests.
>>>>> ./scripts/config --enable CONFIG_USERFAULTFD
>>>>> ./scripts/config --enable CONFIG_TEST_VMALLOC
>>>>> ./scripts/config --enable CONFIG_GUP_TEST
>>>>> 
>>>>> Running on VM with 12G memory, split across 2 (emulated) NUMA nodes (needed by
>>>>> some mm selftests), with kernel command line to reserve hugetlbs and other
>>>>> features required by some mm selftests:
>>>>> 
>>>>> "
>>>>> transparent_hugepage=madvise earlycon root=/dev/vda2 secretmem.enable
>>>>> hugepagesz=1G hugepages=0:2,1:2 hugepagesz=32M hugepages=0:2,1:2
>>>>> default_hugepagesz=2M hugepages=0:64,1:64 hugepagesz=64K hugepages=0:2,1:2
>>>>> "
>>>>> 
>>>>> Ubuntu userspace running off XFS rootfs. Build and run mm selftests from same
>>>>> git tree.
>>>>> 
>>>>> 
>>>>> Although I don't think any of this config should make a difference to gup_longterm.
>>>>> 
>>>>> Looks like your errors are all "ftruncate() failed". I've seen this problem on
>>>>> our CI system. There it is due to running the tests from NFS file system. What
>>>>> filesystem are you using? Perhaps you are sharing into the FVP using 9p? That
>>>>> might also be problematic.
>>>> 
>>>> That was it. This time I booted up the kernel including your series on
>>>> QEMU on my M1 and executed the gup_longterm program without the ftruncate
>>>> failures. When testing your kernel on FVP, I was executing the script from the FVP's host filesystem using 9p.
>>> 
>>> I'm not sure exactly what the root cause is. Perhaps there isn't enough space on
>>> the disk? It might be worth enhancing the error log to provide the errno in
>>> tools/testing/selftests/mm/gup_longterm.c.
>>> 
>> 
>> Attached is the strace’d gup_longterm executiong log on your
>> pgtable-boot-speedup-v2 kernel.
> 
> Sorry are you saying that it only fails with the pgtable-boot-speedup-v2 patch
> set applied? I thought we previously concluded that it was independent of that?
> I was under the impression that it was filesystem related and not something that
> I was planning to investigate.

No, irrespective of the kernel, if using 9p on FVP the test program fails.
It is indeed 9p filesystem related, as I switched to using NFS all the issues are gone.

Thanks,
Itaru.

> 
>> 
>> Thanks,
>> Itaru.
>> 
>>> Thanks,
>>> Ryan
>>> 
>>>> 
>>>> Thanks,
>>>> Itaru.
>>>> 
>>>>> 
>>>>> Does this problem reproduce with v6.9-rc2, without my patches? I except it
>>>>> probably does?
>>>>> 
>>>>> Thanks,
>>>>> Ryan
>>>>> 
>>>>>> 
>>>>>> Thanks,
>>>>>> Itaru.
David Hildenbrand April 9, 2024, 11:22 a.m. UTC | #8
On 09.04.24 12:13, Itaru Kitayama wrote:
> 
> 
>> On Apr 9, 2024, at 19:04, Ryan Roberts <ryan.roberts@arm.com> wrote:
>>
>> On 09/04/2024 01:10, Itaru Kitayama wrote:
>>> Hi Ryan,
>>>
>>>> On Apr 8, 2024, at 16:30, Ryan Roberts <ryan.roberts@arm.com> wrote:
>>>>
>>>> On 06/04/2024 11:31, Itaru Kitayama wrote:
>>>>> Hi Ryan,
>>>>>
>>>>> On Sat, Apr 06, 2024 at 09:32:34AM +0100, Ryan Roberts wrote:
>>>>>> Hi Itaru,
>>>>>>
>>>>>> On 05/04/2024 08:39, Itaru Kitayama wrote:
>>>>>>> On Thu, Apr 04, 2024 at 03:33:04PM +0100, Ryan Roberts wrote:
>>>>>>>> Hi All,
>>>>>>>>
>>>>>>>> It turns out that creating the linear map can take a significant proportion of
>>>>>>>> the total boot time, especially when rodata=full. And most of the time is spent
>>>>>>>> waiting on superfluous tlb invalidation and memory barriers. This series reworks
>>>>>>>> the kernel pgtable generation code to significantly reduce the number of those
>>>>>>>> TLBIs, ISBs and DSBs. See each patch for details.
>>>>>>>>
>>>>>>>> The below shows the execution time of map_mem() across a couple of different
>>>>>>>> systems with different RAM configurations. We measure after applying each patch
>>>>>>>> and show the improvement relative to base (v6.9-rc2):
>>>>>>>>
>>>>>>>>                 | Apple M2 VM | Ampere Altra| Ampere Altra| Ampere Altra
>>>>>>>>                 | VM, 16G     | VM, 64G     | VM, 256G    | Metal, 512G
>>>>>>>> ---------------|-------------|-------------|-------------|-------------
>>>>>>>>                 |   ms    (%) |   ms    (%) |   ms    (%) |    ms    (%)
>>>>>>>> ---------------|-------------|-------------|-------------|-------------
>>>>>>>> base           |  153   (0%) | 2227   (0%) | 8798   (0%) | 17442   (0%)
>>>>>>>> no-cont-remap  |   77 (-49%) |  431 (-81%) | 1727 (-80%) |  3796 (-78%)
>>>>>>>> batch-barriers |   13 (-92%) |  162 (-93%) |  655 (-93%) |  1656 (-91%)
>>>>>>>> no-alloc-remap |   11 (-93%) |  109 (-95%) |  449 (-95%) |  1257 (-93%)
>>>>>>>> lazy-unmap     |    6 (-96%) |   61 (-97%) |  257 (-97%) |   838 (-95%)
>>>>>>>>
>>>>>>>> This series applies on top of v6.9-rc2. All mm selftests pass. I've compile and
>>>>>>>> boot tested various PAGE_SIZE and VA size configs.
>>>>>>>>
>>>>>>>> ---
>>>>>>>>
>>>>>>>> Changes since v1 [1]
>>>>>>>> ====================
>>>>>>>>
>>>>>>>>    - Added Tested-by tags (thanks to Eric and Itaru)
>>>>>>>>    - Renamed ___set_pte() -> __set_pte_nosync() (per Ard)
>>>>>>>>    - Reordered patches (biggest impact & least controversial first)
>>>>>>>>    - Reordered alloc/map/unmap functions in mmu.c to aid reader
>>>>>>>>    - pte_clear() -> __pte_clear() in clear_fixmap_nosync()
>>>>>>>>    - Reverted generic p4d_index() which caused x86 build error. Replaced with
>>>>>>>>      unconditional p4d_index() define under arm64.
>>>>>>>>
>>>>>>>>
>>>>>>>> [1] https://lore.kernel.org/linux-arm-kernel/20240326101448.3453626-1-ryan.roberts@arm.com/<https://lore.kernel.org/linux-arm-kernel/20240326101448.3453626-1-ryan.roberts@arm.com/>
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>> Ryan
>>>>>>>>
>>>>>>>>
>>>>>>>> Ryan Roberts (4):
>>>>>>>>    arm64: mm: Don't remap pgtables per-cont(pte|pmd) block
>>>>>>>>    arm64: mm: Batch dsb and isb when populating pgtables
>>>>>>>>    arm64: mm: Don't remap pgtables for allocate vs populate
>>>>>>>>    arm64: mm: Lazily clear pte table mappings from fixmap
>>>>>>>>
>>>>>>>> arch/arm64/include/asm/fixmap.h  |   5 +-
>>>>>>>> arch/arm64/include/asm/mmu.h     |   8 +
>>>>>>>> arch/arm64/include/asm/pgtable.h |  13 +-
>>>>>>>> arch/arm64/kernel/cpufeature.c   |  10 +-
>>>>>>>> arch/arm64/mm/fixmap.c           |  11 +
>>>>>>>> arch/arm64/mm/mmu.c              | 377 +++++++++++++++++++++++--------
>>>>>>>> 6 files changed, 319 insertions(+), 105 deletions(-)
>>>>>>>>
>>>>>>>> --
>>>>>>>> 2.25.1
>>>>>>>>
>>>>>>>
>>>>>>> I've build and boot tested the v2 on FVP, base is taken from your
>>>>>>> linux-rr repo. Running run_vmtests.sh on v2 left some gup longterm not oks, would you take a look at it? The mm ksefltests used is from your linux-rr repo too.
>>>>>>
>>>>>> Thanks for taking a look at this.
>>>>>>
>>>>>> I can't reproduce your issue unfortunately; steps as follows on Apple M2 VM:
>>>>>>
>>>>>> Config: arm64 defconfig + the following:
>>>>>>
>>>>>> # Squashfs for snaps, xfs for large file folios.
>>>>>> ./scripts/config --enable CONFIG_SQUASHFS_LZ4
>>>>>> ./scripts/config --enable CONFIG_SQUASHFS_LZO
>>>>>> ./scripts/config --enable CONFIG_SQUASHFS_XZ
>>>>>> ./scripts/config --enable CONFIG_SQUASHFS_ZSTD
>>>>>> ./scripts/config --enable CONFIG_XFS_FS
>>>>>>
>>>>>> # For general mm debug.
>>>>>> ./scripts/config --enable CONFIG_DEBUG_VM
>>>>>> ./scripts/config --enable CONFIG_DEBUG_VM_MAPLE_TREE
>>>>>> ./scripts/config --enable CONFIG_DEBUG_VM_RB
>>>>>> ./scripts/config --enable CONFIG_DEBUG_VM_PGFLAGS
>>>>>> ./scripts/config --enable CONFIG_DEBUG_VM_PGTABLE
>>>>>> ./scripts/config --enable CONFIG_PAGE_TABLE_CHECK
>>>>>>
>>>>>> # For mm selftests.
>>>>>> ./scripts/config --enable CONFIG_USERFAULTFD
>>>>>> ./scripts/config --enable CONFIG_TEST_VMALLOC
>>>>>> ./scripts/config --enable CONFIG_GUP_TEST
>>>>>>
>>>>>> Running on VM with 12G memory, split across 2 (emulated) NUMA nodes (needed by
>>>>>> some mm selftests), with kernel command line to reserve hugetlbs and other
>>>>>> features required by some mm selftests:
>>>>>>
>>>>>> "
>>>>>> transparent_hugepage=madvise earlycon root=/dev/vda2 secretmem.enable
>>>>>> hugepagesz=1G hugepages=0:2,1:2 hugepagesz=32M hugepages=0:2,1:2
>>>>>> default_hugepagesz=2M hugepages=0:64,1:64 hugepagesz=64K hugepages=0:2,1:2
>>>>>> "
>>>>>>
>>>>>> Ubuntu userspace running off XFS rootfs. Build and run mm selftests from same
>>>>>> git tree.
>>>>>>
>>>>>>
>>>>>> Although I don't think any of this config should make a difference to gup_longterm.
>>>>>>
>>>>>> Looks like your errors are all "ftruncate() failed". I've seen this problem on
>>>>>> our CI system. There it is due to running the tests from NFS file system. What
>>>>>> filesystem are you using? Perhaps you are sharing into the FVP using 9p? That
>>>>>> might also be problematic.
>>>>>
>>>>> That was it. This time I booted up the kernel including your series on
>>>>> QEMU on my M1 and executed the gup_longterm program without the ftruncate
>>>>> failures. When testing your kernel on FVP, I was executing the script from the FVP's host filesystem using 9p.
>>>>
>>>> I'm not sure exactly what the root cause is. Perhaps there isn't enough space on
>>>> the disk? It might be worth enhancing the error log to provide the errno in
>>>> tools/testing/selftests/mm/gup_longterm.c.
>>>>
>>>
>>> Attached is the strace’d gup_longterm executiong log on your
>>> pgtable-boot-speedup-v2 kernel.
>>
>> Sorry are you saying that it only fails with the pgtable-boot-speedup-v2 patch
>> set applied? I thought we previously concluded that it was independent of that?
>> I was under the impression that it was filesystem related and not something that
>> I was planning to investigate.
> 
> No, irrespective of the kernel, if using 9p on FVP the test program fails.
> It is indeed 9p filesystem related, as I switched to using NFS all the issues are gone.

Did it never work on 9p? If so, we might have to SKIP that test.

openat(AT_FDCWD, "gup_longterm.c_tmpfile_BLboOt", O_RDWR|O_CREAT|O_EXCL, 0600) = 3
unlinkat(AT_FDCWD, "gup_longterm.c_tmpfile_BLboOt", 0) = 0
fstatfs(3, 0xffffe505a840)              = -1 EOPNOTSUPP (Operation not supported)
ftruncate(3, 4096)                      = -1 ENOENT (No such file or directory)


fstatfs() fails and makes get_fs_type() simply say "0" -- IOW "I don't know",
which should be fine here, as it will make fs_is_unknown() trigger for relevant
cases where the type matters.

ftruncate() failing with ENOENT seems to be the problem.

But that error is a bit weird.

The man page says "ENOENT The named file does not exist.", which should only apply to
truncate() but not ftruncate().

Sound weird, but maybe that's the way to say here "not supported" ?
David Hildenbrand April 9, 2024, 11:29 a.m. UTC | #9
On 09.04.24 13:22, David Hildenbrand wrote:
> On 09.04.24 12:13, Itaru Kitayama wrote:
>>
>>
>>> On Apr 9, 2024, at 19:04, Ryan Roberts <ryan.roberts@arm.com> wrote:
>>>
>>> On 09/04/2024 01:10, Itaru Kitayama wrote:
>>>> Hi Ryan,
>>>>
>>>>> On Apr 8, 2024, at 16:30, Ryan Roberts <ryan.roberts@arm.com> wrote:
>>>>>
>>>>> On 06/04/2024 11:31, Itaru Kitayama wrote:
>>>>>> Hi Ryan,
>>>>>>
>>>>>> On Sat, Apr 06, 2024 at 09:32:34AM +0100, Ryan Roberts wrote:
>>>>>>> Hi Itaru,
>>>>>>>
>>>>>>> On 05/04/2024 08:39, Itaru Kitayama wrote:
>>>>>>>> On Thu, Apr 04, 2024 at 03:33:04PM +0100, Ryan Roberts wrote:
>>>>>>>>> Hi All,
>>>>>>>>>
>>>>>>>>> It turns out that creating the linear map can take a significant proportion of
>>>>>>>>> the total boot time, especially when rodata=full. And most of the time is spent
>>>>>>>>> waiting on superfluous tlb invalidation and memory barriers. This series reworks
>>>>>>>>> the kernel pgtable generation code to significantly reduce the number of those
>>>>>>>>> TLBIs, ISBs and DSBs. See each patch for details.
>>>>>>>>>
>>>>>>>>> The below shows the execution time of map_mem() across a couple of different
>>>>>>>>> systems with different RAM configurations. We measure after applying each patch
>>>>>>>>> and show the improvement relative to base (v6.9-rc2):
>>>>>>>>>
>>>>>>>>>                  | Apple M2 VM | Ampere Altra| Ampere Altra| Ampere Altra
>>>>>>>>>                  | VM, 16G     | VM, 64G     | VM, 256G    | Metal, 512G
>>>>>>>>> ---------------|-------------|-------------|-------------|-------------
>>>>>>>>>                  |   ms    (%) |   ms    (%) |   ms    (%) |    ms    (%)
>>>>>>>>> ---------------|-------------|-------------|-------------|-------------
>>>>>>>>> base           |  153   (0%) | 2227   (0%) | 8798   (0%) | 17442   (0%)
>>>>>>>>> no-cont-remap  |   77 (-49%) |  431 (-81%) | 1727 (-80%) |  3796 (-78%)
>>>>>>>>> batch-barriers |   13 (-92%) |  162 (-93%) |  655 (-93%) |  1656 (-91%)
>>>>>>>>> no-alloc-remap |   11 (-93%) |  109 (-95%) |  449 (-95%) |  1257 (-93%)
>>>>>>>>> lazy-unmap     |    6 (-96%) |   61 (-97%) |  257 (-97%) |   838 (-95%)
>>>>>>>>>
>>>>>>>>> This series applies on top of v6.9-rc2. All mm selftests pass. I've compile and
>>>>>>>>> boot tested various PAGE_SIZE and VA size configs.
>>>>>>>>>
>>>>>>>>> ---
>>>>>>>>>
>>>>>>>>> Changes since v1 [1]
>>>>>>>>> ====================
>>>>>>>>>
>>>>>>>>>     - Added Tested-by tags (thanks to Eric and Itaru)
>>>>>>>>>     - Renamed ___set_pte() -> __set_pte_nosync() (per Ard)
>>>>>>>>>     - Reordered patches (biggest impact & least controversial first)
>>>>>>>>>     - Reordered alloc/map/unmap functions in mmu.c to aid reader
>>>>>>>>>     - pte_clear() -> __pte_clear() in clear_fixmap_nosync()
>>>>>>>>>     - Reverted generic p4d_index() which caused x86 build error. Replaced with
>>>>>>>>>       unconditional p4d_index() define under arm64.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> [1] https://lore.kernel.org/linux-arm-kernel/20240326101448.3453626-1-ryan.roberts@arm.com/<https://lore.kernel.org/linux-arm-kernel/20240326101448.3453626-1-ryan.roberts@arm.com/>
>>>>>>>>>
>>>>>>>>> Thanks,
>>>>>>>>> Ryan
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Ryan Roberts (4):
>>>>>>>>>     arm64: mm: Don't remap pgtables per-cont(pte|pmd) block
>>>>>>>>>     arm64: mm: Batch dsb and isb when populating pgtables
>>>>>>>>>     arm64: mm: Don't remap pgtables for allocate vs populate
>>>>>>>>>     arm64: mm: Lazily clear pte table mappings from fixmap
>>>>>>>>>
>>>>>>>>> arch/arm64/include/asm/fixmap.h  |   5 +-
>>>>>>>>> arch/arm64/include/asm/mmu.h     |   8 +
>>>>>>>>> arch/arm64/include/asm/pgtable.h |  13 +-
>>>>>>>>> arch/arm64/kernel/cpufeature.c   |  10 +-
>>>>>>>>> arch/arm64/mm/fixmap.c           |  11 +
>>>>>>>>> arch/arm64/mm/mmu.c              | 377 +++++++++++++++++++++++--------
>>>>>>>>> 6 files changed, 319 insertions(+), 105 deletions(-)
>>>>>>>>>
>>>>>>>>> --
>>>>>>>>> 2.25.1
>>>>>>>>>
>>>>>>>>
>>>>>>>> I've build and boot tested the v2 on FVP, base is taken from your
>>>>>>>> linux-rr repo. Running run_vmtests.sh on v2 left some gup longterm not oks, would you take a look at it? The mm ksefltests used is from your linux-rr repo too.
>>>>>>>
>>>>>>> Thanks for taking a look at this.
>>>>>>>
>>>>>>> I can't reproduce your issue unfortunately; steps as follows on Apple M2 VM:
>>>>>>>
>>>>>>> Config: arm64 defconfig + the following:
>>>>>>>
>>>>>>> # Squashfs for snaps, xfs for large file folios.
>>>>>>> ./scripts/config --enable CONFIG_SQUASHFS_LZ4
>>>>>>> ./scripts/config --enable CONFIG_SQUASHFS_LZO
>>>>>>> ./scripts/config --enable CONFIG_SQUASHFS_XZ
>>>>>>> ./scripts/config --enable CONFIG_SQUASHFS_ZSTD
>>>>>>> ./scripts/config --enable CONFIG_XFS_FS
>>>>>>>
>>>>>>> # For general mm debug.
>>>>>>> ./scripts/config --enable CONFIG_DEBUG_VM
>>>>>>> ./scripts/config --enable CONFIG_DEBUG_VM_MAPLE_TREE
>>>>>>> ./scripts/config --enable CONFIG_DEBUG_VM_RB
>>>>>>> ./scripts/config --enable CONFIG_DEBUG_VM_PGFLAGS
>>>>>>> ./scripts/config --enable CONFIG_DEBUG_VM_PGTABLE
>>>>>>> ./scripts/config --enable CONFIG_PAGE_TABLE_CHECK
>>>>>>>
>>>>>>> # For mm selftests.
>>>>>>> ./scripts/config --enable CONFIG_USERFAULTFD
>>>>>>> ./scripts/config --enable CONFIG_TEST_VMALLOC
>>>>>>> ./scripts/config --enable CONFIG_GUP_TEST
>>>>>>>
>>>>>>> Running on VM with 12G memory, split across 2 (emulated) NUMA nodes (needed by
>>>>>>> some mm selftests), with kernel command line to reserve hugetlbs and other
>>>>>>> features required by some mm selftests:
>>>>>>>
>>>>>>> "
>>>>>>> transparent_hugepage=madvise earlycon root=/dev/vda2 secretmem.enable
>>>>>>> hugepagesz=1G hugepages=0:2,1:2 hugepagesz=32M hugepages=0:2,1:2
>>>>>>> default_hugepagesz=2M hugepages=0:64,1:64 hugepagesz=64K hugepages=0:2,1:2
>>>>>>> "
>>>>>>>
>>>>>>> Ubuntu userspace running off XFS rootfs. Build and run mm selftests from same
>>>>>>> git tree.
>>>>>>>
>>>>>>>
>>>>>>> Although I don't think any of this config should make a difference to gup_longterm.
>>>>>>>
>>>>>>> Looks like your errors are all "ftruncate() failed". I've seen this problem on
>>>>>>> our CI system. There it is due to running the tests from NFS file system. What
>>>>>>> filesystem are you using? Perhaps you are sharing into the FVP using 9p? That
>>>>>>> might also be problematic.
>>>>>>
>>>>>> That was it. This time I booted up the kernel including your series on
>>>>>> QEMU on my M1 and executed the gup_longterm program without the ftruncate
>>>>>> failures. When testing your kernel on FVP, I was executing the script from the FVP's host filesystem using 9p.
>>>>>
>>>>> I'm not sure exactly what the root cause is. Perhaps there isn't enough space on
>>>>> the disk? It might be worth enhancing the error log to provide the errno in
>>>>> tools/testing/selftests/mm/gup_longterm.c.
>>>>>
>>>>
>>>> Attached is the strace’d gup_longterm executiong log on your
>>>> pgtable-boot-speedup-v2 kernel.
>>>
>>> Sorry are you saying that it only fails with the pgtable-boot-speedup-v2 patch
>>> set applied? I thought we previously concluded that it was independent of that?
>>> I was under the impression that it was filesystem related and not something that
>>> I was planning to investigate.
>>
>> No, irrespective of the kernel, if using 9p on FVP the test program fails.
>> It is indeed 9p filesystem related, as I switched to using NFS all the issues are gone.
> 
> Did it never work on 9p? If so, we might have to SKIP that test.
> 
> openat(AT_FDCWD, "gup_longterm.c_tmpfile_BLboOt", O_RDWR|O_CREAT|O_EXCL, 0600) = 3
> unlinkat(AT_FDCWD, "gup_longterm.c_tmpfile_BLboOt", 0) = 0
> fstatfs(3, 0xffffe505a840)              = -1 EOPNOTSUPP (Operation not supported)
> ftruncate(3, 4096)                      = -1 ENOENT (No such file or directory)

Note: I'm wondering if the unlinkat here is the problem that makes 
ftruncate() with 9p result in weird errors (e.g., the hypervisor 
unlinked the file and cannot reopen it for the fstatfs/ftruncate. ... 
which gives us weird errors here).

Then, we should lookup the fs type in run_with_local_tmpfile() before 
the unlink() and simply skip the test if it is 9p.
David Hildenbrand April 9, 2024, 11:51 a.m. UTC | #10
On 09.04.24 13:29, David Hildenbrand wrote:
> On 09.04.24 13:22, David Hildenbrand wrote:
>> On 09.04.24 12:13, Itaru Kitayama wrote:
>>>
>>>
>>>> On Apr 9, 2024, at 19:04, Ryan Roberts <ryan.roberts@arm.com> wrote:
>>>>
>>>> On 09/04/2024 01:10, Itaru Kitayama wrote:
>>>>> Hi Ryan,
>>>>>
>>>>>> On Apr 8, 2024, at 16:30, Ryan Roberts <ryan.roberts@arm.com> wrote:
>>>>>>
>>>>>> On 06/04/2024 11:31, Itaru Kitayama wrote:
>>>>>>> Hi Ryan,
>>>>>>>
>>>>>>> On Sat, Apr 06, 2024 at 09:32:34AM +0100, Ryan Roberts wrote:
>>>>>>>> Hi Itaru,
>>>>>>>>
>>>>>>>> On 05/04/2024 08:39, Itaru Kitayama wrote:
>>>>>>>>> On Thu, Apr 04, 2024 at 03:33:04PM +0100, Ryan Roberts wrote:
>>>>>>>>>> Hi All,
>>>>>>>>>>
>>>>>>>>>> It turns out that creating the linear map can take a significant proportion of
>>>>>>>>>> the total boot time, especially when rodata=full. And most of the time is spent
>>>>>>>>>> waiting on superfluous tlb invalidation and memory barriers. This series reworks
>>>>>>>>>> the kernel pgtable generation code to significantly reduce the number of those
>>>>>>>>>> TLBIs, ISBs and DSBs. See each patch for details.
>>>>>>>>>>
>>>>>>>>>> The below shows the execution time of map_mem() across a couple of different
>>>>>>>>>> systems with different RAM configurations. We measure after applying each patch
>>>>>>>>>> and show the improvement relative to base (v6.9-rc2):
>>>>>>>>>>
>>>>>>>>>>                   | Apple M2 VM | Ampere Altra| Ampere Altra| Ampere Altra
>>>>>>>>>>                   | VM, 16G     | VM, 64G     | VM, 256G    | Metal, 512G
>>>>>>>>>> ---------------|-------------|-------------|-------------|-------------
>>>>>>>>>>                   |   ms    (%) |   ms    (%) |   ms    (%) |    ms    (%)
>>>>>>>>>> ---------------|-------------|-------------|-------------|-------------
>>>>>>>>>> base           |  153   (0%) | 2227   (0%) | 8798   (0%) | 17442   (0%)
>>>>>>>>>> no-cont-remap  |   77 (-49%) |  431 (-81%) | 1727 (-80%) |  3796 (-78%)
>>>>>>>>>> batch-barriers |   13 (-92%) |  162 (-93%) |  655 (-93%) |  1656 (-91%)
>>>>>>>>>> no-alloc-remap |   11 (-93%) |  109 (-95%) |  449 (-95%) |  1257 (-93%)
>>>>>>>>>> lazy-unmap     |    6 (-96%) |   61 (-97%) |  257 (-97%) |   838 (-95%)
>>>>>>>>>>
>>>>>>>>>> This series applies on top of v6.9-rc2. All mm selftests pass. I've compile and
>>>>>>>>>> boot tested various PAGE_SIZE and VA size configs.
>>>>>>>>>>
>>>>>>>>>> ---
>>>>>>>>>>
>>>>>>>>>> Changes since v1 [1]
>>>>>>>>>> ====================
>>>>>>>>>>
>>>>>>>>>>      - Added Tested-by tags (thanks to Eric and Itaru)
>>>>>>>>>>      - Renamed ___set_pte() -> __set_pte_nosync() (per Ard)
>>>>>>>>>>      - Reordered patches (biggest impact & least controversial first)
>>>>>>>>>>      - Reordered alloc/map/unmap functions in mmu.c to aid reader
>>>>>>>>>>      - pte_clear() -> __pte_clear() in clear_fixmap_nosync()
>>>>>>>>>>      - Reverted generic p4d_index() which caused x86 build error. Replaced with
>>>>>>>>>>        unconditional p4d_index() define under arm64.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> [1] https://lore.kernel.org/linux-arm-kernel/20240326101448.3453626-1-ryan.roberts@arm.com/<https://lore.kernel.org/linux-arm-kernel/20240326101448.3453626-1-ryan.roberts@arm.com/>
>>>>>>>>>>
>>>>>>>>>> Thanks,
>>>>>>>>>> Ryan
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Ryan Roberts (4):
>>>>>>>>>>      arm64: mm: Don't remap pgtables per-cont(pte|pmd) block
>>>>>>>>>>      arm64: mm: Batch dsb and isb when populating pgtables
>>>>>>>>>>      arm64: mm: Don't remap pgtables for allocate vs populate
>>>>>>>>>>      arm64: mm: Lazily clear pte table mappings from fixmap
>>>>>>>>>>
>>>>>>>>>> arch/arm64/include/asm/fixmap.h  |   5 +-
>>>>>>>>>> arch/arm64/include/asm/mmu.h     |   8 +
>>>>>>>>>> arch/arm64/include/asm/pgtable.h |  13 +-
>>>>>>>>>> arch/arm64/kernel/cpufeature.c   |  10 +-
>>>>>>>>>> arch/arm64/mm/fixmap.c           |  11 +
>>>>>>>>>> arch/arm64/mm/mmu.c              | 377 +++++++++++++++++++++++--------
>>>>>>>>>> 6 files changed, 319 insertions(+), 105 deletions(-)
>>>>>>>>>>
>>>>>>>>>> --
>>>>>>>>>> 2.25.1
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>> I've build and boot tested the v2 on FVP, base is taken from your
>>>>>>>>> linux-rr repo. Running run_vmtests.sh on v2 left some gup longterm not oks, would you take a look at it? The mm ksefltests used is from your linux-rr repo too.
>>>>>>>>
>>>>>>>> Thanks for taking a look at this.
>>>>>>>>
>>>>>>>> I can't reproduce your issue unfortunately; steps as follows on Apple M2 VM:
>>>>>>>>
>>>>>>>> Config: arm64 defconfig + the following:
>>>>>>>>
>>>>>>>> # Squashfs for snaps, xfs for large file folios.
>>>>>>>> ./scripts/config --enable CONFIG_SQUASHFS_LZ4
>>>>>>>> ./scripts/config --enable CONFIG_SQUASHFS_LZO
>>>>>>>> ./scripts/config --enable CONFIG_SQUASHFS_XZ
>>>>>>>> ./scripts/config --enable CONFIG_SQUASHFS_ZSTD
>>>>>>>> ./scripts/config --enable CONFIG_XFS_FS
>>>>>>>>
>>>>>>>> # For general mm debug.
>>>>>>>> ./scripts/config --enable CONFIG_DEBUG_VM
>>>>>>>> ./scripts/config --enable CONFIG_DEBUG_VM_MAPLE_TREE
>>>>>>>> ./scripts/config --enable CONFIG_DEBUG_VM_RB
>>>>>>>> ./scripts/config --enable CONFIG_DEBUG_VM_PGFLAGS
>>>>>>>> ./scripts/config --enable CONFIG_DEBUG_VM_PGTABLE
>>>>>>>> ./scripts/config --enable CONFIG_PAGE_TABLE_CHECK
>>>>>>>>
>>>>>>>> # For mm selftests.
>>>>>>>> ./scripts/config --enable CONFIG_USERFAULTFD
>>>>>>>> ./scripts/config --enable CONFIG_TEST_VMALLOC
>>>>>>>> ./scripts/config --enable CONFIG_GUP_TEST
>>>>>>>>
>>>>>>>> Running on VM with 12G memory, split across 2 (emulated) NUMA nodes (needed by
>>>>>>>> some mm selftests), with kernel command line to reserve hugetlbs and other
>>>>>>>> features required by some mm selftests:
>>>>>>>>
>>>>>>>> "
>>>>>>>> transparent_hugepage=madvise earlycon root=/dev/vda2 secretmem.enable
>>>>>>>> hugepagesz=1G hugepages=0:2,1:2 hugepagesz=32M hugepages=0:2,1:2
>>>>>>>> default_hugepagesz=2M hugepages=0:64,1:64 hugepagesz=64K hugepages=0:2,1:2
>>>>>>>> "
>>>>>>>>
>>>>>>>> Ubuntu userspace running off XFS rootfs. Build and run mm selftests from same
>>>>>>>> git tree.
>>>>>>>>
>>>>>>>>
>>>>>>>> Although I don't think any of this config should make a difference to gup_longterm.
>>>>>>>>
>>>>>>>> Looks like your errors are all "ftruncate() failed". I've seen this problem on
>>>>>>>> our CI system. There it is due to running the tests from NFS file system. What
>>>>>>>> filesystem are you using? Perhaps you are sharing into the FVP using 9p? That
>>>>>>>> might also be problematic.
>>>>>>>
>>>>>>> That was it. This time I booted up the kernel including your series on
>>>>>>> QEMU on my M1 and executed the gup_longterm program without the ftruncate
>>>>>>> failures. When testing your kernel on FVP, I was executing the script from the FVP's host filesystem using 9p.
>>>>>>
>>>>>> I'm not sure exactly what the root cause is. Perhaps there isn't enough space on
>>>>>> the disk? It might be worth enhancing the error log to provide the errno in
>>>>>> tools/testing/selftests/mm/gup_longterm.c.
>>>>>>
>>>>>
>>>>> Attached is the strace’d gup_longterm executiong log on your
>>>>> pgtable-boot-speedup-v2 kernel.
>>>>
>>>> Sorry are you saying that it only fails with the pgtable-boot-speedup-v2 patch
>>>> set applied? I thought we previously concluded that it was independent of that?
>>>> I was under the impression that it was filesystem related and not something that
>>>> I was planning to investigate.
>>>
>>> No, irrespective of the kernel, if using 9p on FVP the test program fails.
>>> It is indeed 9p filesystem related, as I switched to using NFS all the issues are gone.
>>
>> Did it never work on 9p? If so, we might have to SKIP that test.
>>
>> openat(AT_FDCWD, "gup_longterm.c_tmpfile_BLboOt", O_RDWR|O_CREAT|O_EXCL, 0600) = 3
>> unlinkat(AT_FDCWD, "gup_longterm.c_tmpfile_BLboOt", 0) = 0
>> fstatfs(3, 0xffffe505a840)              = -1 EOPNOTSUPP (Operation not supported)
>> ftruncate(3, 4096)                      = -1 ENOENT (No such file or directory)
> 
> Note: I'm wondering if the unlinkat here is the problem that makes
> ftruncate() with 9p result in weird errors (e.g., the hypervisor
> unlinked the file and cannot reopen it for the fstatfs/ftruncate. ...
> which gives us weird errors here).
> 
> Then, we should lookup the fs type in run_with_local_tmpfile() before
> the unlink() and simply skip the test if it is 9p.

The unlink with 9p most certainly was a known issue in the past:

https://gitlab.com/qemu-project/qemu/-/issues/103

Maybe it's still an issue with older hypervisors (QEMU?)? Or it was 
never completely resolved?

According to https://bugs.launchpad.net/qemu/+bug/1336794, QEMU v5.2.0 
should contain a fix that is supposed to work with never kernels.
Ryan Roberts April 9, 2024, 2:13 p.m. UTC | #11
On 09/04/2024 12:51, David Hildenbrand wrote:
> On 09.04.24 13:29, David Hildenbrand wrote:
>> On 09.04.24 13:22, David Hildenbrand wrote:
>>> On 09.04.24 12:13, Itaru Kitayama wrote:
>>>>
>>>>
>>>>> On Apr 9, 2024, at 19:04, Ryan Roberts <ryan.roberts@arm.com> wrote:
>>>>>
>>>>> On 09/04/2024 01:10, Itaru Kitayama wrote:
>>>>>> Hi Ryan,
>>>>>>
>>>>>>> On Apr 8, 2024, at 16:30, Ryan Roberts <ryan.roberts@arm.com> wrote:
>>>>>>>
>>>>>>> On 06/04/2024 11:31, Itaru Kitayama wrote:
>>>>>>>> Hi Ryan,
>>>>>>>>
>>>>>>>> On Sat, Apr 06, 2024 at 09:32:34AM +0100, Ryan Roberts wrote:
>>>>>>>>> Hi Itaru,
>>>>>>>>>
>>>>>>>>> On 05/04/2024 08:39, Itaru Kitayama wrote:
>>>>>>>>>> On Thu, Apr 04, 2024 at 03:33:04PM +0100, Ryan Roberts wrote:
>>>>>>>>>>> Hi All,
>>>>>>>>>>>
>>>>>>>>>>> It turns out that creating the linear map can take a significant
>>>>>>>>>>> proportion of
>>>>>>>>>>> the total boot time, especially when rodata=full. And most of the
>>>>>>>>>>> time is spent
>>>>>>>>>>> waiting on superfluous tlb invalidation and memory barriers. This
>>>>>>>>>>> series reworks
>>>>>>>>>>> the kernel pgtable generation code to significantly reduce the number
>>>>>>>>>>> of those
>>>>>>>>>>> TLBIs, ISBs and DSBs. See each patch for details.
>>>>>>>>>>>
>>>>>>>>>>> The below shows the execution time of map_mem() across a couple of
>>>>>>>>>>> different
>>>>>>>>>>> systems with different RAM configurations. We measure after applying
>>>>>>>>>>> each patch
>>>>>>>>>>> and show the improvement relative to base (v6.9-rc2):
>>>>>>>>>>>
>>>>>>>>>>>                   | Apple M2 VM | Ampere Altra| Ampere Altra| Ampere
>>>>>>>>>>> Altra
>>>>>>>>>>>                   | VM, 16G     | VM, 64G     | VM, 256G    | Metal,
>>>>>>>>>>> 512G
>>>>>>>>>>> ---------------|-------------|-------------|-------------|-------------
>>>>>>>>>>>                   |   ms    (%) |   ms    (%) |   ms    (%) |   
>>>>>>>>>>> ms    (%)
>>>>>>>>>>> ---------------|-------------|-------------|-------------|-------------
>>>>>>>>>>> base           |  153   (0%) | 2227   (0%) | 8798   (0%) | 17442   (0%)
>>>>>>>>>>> no-cont-remap  |   77 (-49%) |  431 (-81%) | 1727 (-80%) |  3796 (-78%)
>>>>>>>>>>> batch-barriers |   13 (-92%) |  162 (-93%) |  655 (-93%) |  1656 (-91%)
>>>>>>>>>>> no-alloc-remap |   11 (-93%) |  109 (-95%) |  449 (-95%) |  1257 (-93%)
>>>>>>>>>>> lazy-unmap     |    6 (-96%) |   61 (-97%) |  257 (-97%) |   838 (-95%)
>>>>>>>>>>>
>>>>>>>>>>> This series applies on top of v6.9-rc2. All mm selftests pass. I've
>>>>>>>>>>> compile and
>>>>>>>>>>> boot tested various PAGE_SIZE and VA size configs.
>>>>>>>>>>>
>>>>>>>>>>> ---
>>>>>>>>>>>
>>>>>>>>>>> Changes since v1 [1]
>>>>>>>>>>> ====================
>>>>>>>>>>>
>>>>>>>>>>>      - Added Tested-by tags (thanks to Eric and Itaru)
>>>>>>>>>>>      - Renamed ___set_pte() -> __set_pte_nosync() (per Ard)
>>>>>>>>>>>      - Reordered patches (biggest impact & least controversial first)
>>>>>>>>>>>      - Reordered alloc/map/unmap functions in mmu.c to aid reader
>>>>>>>>>>>      - pte_clear() -> __pte_clear() in clear_fixmap_nosync()
>>>>>>>>>>>      - Reverted generic p4d_index() which caused x86 build error.
>>>>>>>>>>> Replaced with
>>>>>>>>>>>        unconditional p4d_index() define under arm64.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> [1]
>>>>>>>>>>> https://lore.kernel.org/linux-arm-kernel/20240326101448.3453626-1-ryan.roberts@arm.com/<https://lore.kernel.org/linux-arm-kernel/20240326101448.3453626-1-ryan.roberts@arm.com/>
>>>>>>>>>>>
>>>>>>>>>>> Thanks,
>>>>>>>>>>> Ryan
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Ryan Roberts (4):
>>>>>>>>>>>      arm64: mm: Don't remap pgtables per-cont(pte|pmd) block
>>>>>>>>>>>      arm64: mm: Batch dsb and isb when populating pgtables
>>>>>>>>>>>      arm64: mm: Don't remap pgtables for allocate vs populate
>>>>>>>>>>>      arm64: mm: Lazily clear pte table mappings from fixmap
>>>>>>>>>>>
>>>>>>>>>>> arch/arm64/include/asm/fixmap.h  |   5 +-
>>>>>>>>>>> arch/arm64/include/asm/mmu.h     |   8 +
>>>>>>>>>>> arch/arm64/include/asm/pgtable.h |  13 +-
>>>>>>>>>>> arch/arm64/kernel/cpufeature.c   |  10 +-
>>>>>>>>>>> arch/arm64/mm/fixmap.c           |  11 +
>>>>>>>>>>> arch/arm64/mm/mmu.c              | 377 +++++++++++++++++++++++--------
>>>>>>>>>>> 6 files changed, 319 insertions(+), 105 deletions(-)
>>>>>>>>>>>
>>>>>>>>>>> -- 
>>>>>>>>>>> 2.25.1
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> I've build and boot tested the v2 on FVP, base is taken from your
>>>>>>>>>> linux-rr repo. Running run_vmtests.sh on v2 left some gup longterm not
>>>>>>>>>> oks, would you take a look at it? The mm ksefltests used is from your
>>>>>>>>>> linux-rr repo too.
>>>>>>>>>
>>>>>>>>> Thanks for taking a look at this.
>>>>>>>>>
>>>>>>>>> I can't reproduce your issue unfortunately; steps as follows on Apple
>>>>>>>>> M2 VM:
>>>>>>>>>
>>>>>>>>> Config: arm64 defconfig + the following:
>>>>>>>>>
>>>>>>>>> # Squashfs for snaps, xfs for large file folios.
>>>>>>>>> ./scripts/config --enable CONFIG_SQUASHFS_LZ4
>>>>>>>>> ./scripts/config --enable CONFIG_SQUASHFS_LZO
>>>>>>>>> ./scripts/config --enable CONFIG_SQUASHFS_XZ
>>>>>>>>> ./scripts/config --enable CONFIG_SQUASHFS_ZSTD
>>>>>>>>> ./scripts/config --enable CONFIG_XFS_FS
>>>>>>>>>
>>>>>>>>> # For general mm debug.
>>>>>>>>> ./scripts/config --enable CONFIG_DEBUG_VM
>>>>>>>>> ./scripts/config --enable CONFIG_DEBUG_VM_MAPLE_TREE
>>>>>>>>> ./scripts/config --enable CONFIG_DEBUG_VM_RB
>>>>>>>>> ./scripts/config --enable CONFIG_DEBUG_VM_PGFLAGS
>>>>>>>>> ./scripts/config --enable CONFIG_DEBUG_VM_PGTABLE
>>>>>>>>> ./scripts/config --enable CONFIG_PAGE_TABLE_CHECK
>>>>>>>>>
>>>>>>>>> # For mm selftests.
>>>>>>>>> ./scripts/config --enable CONFIG_USERFAULTFD
>>>>>>>>> ./scripts/config --enable CONFIG_TEST_VMALLOC
>>>>>>>>> ./scripts/config --enable CONFIG_GUP_TEST
>>>>>>>>>
>>>>>>>>> Running on VM with 12G memory, split across 2 (emulated) NUMA nodes
>>>>>>>>> (needed by
>>>>>>>>> some mm selftests), with kernel command line to reserve hugetlbs and other
>>>>>>>>> features required by some mm selftests:
>>>>>>>>>
>>>>>>>>> "
>>>>>>>>> transparent_hugepage=madvise earlycon root=/dev/vda2 secretmem.enable
>>>>>>>>> hugepagesz=1G hugepages=0:2,1:2 hugepagesz=32M hugepages=0:2,1:2
>>>>>>>>> default_hugepagesz=2M hugepages=0:64,1:64 hugepagesz=64K hugepages=0:2,1:2
>>>>>>>>> "
>>>>>>>>>
>>>>>>>>> Ubuntu userspace running off XFS rootfs. Build and run mm selftests
>>>>>>>>> from same
>>>>>>>>> git tree.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Although I don't think any of this config should make a difference to
>>>>>>>>> gup_longterm.
>>>>>>>>>
>>>>>>>>> Looks like your errors are all "ftruncate() failed". I've seen this
>>>>>>>>> problem on
>>>>>>>>> our CI system. There it is due to running the tests from NFS file
>>>>>>>>> system. What
>>>>>>>>> filesystem are you using? Perhaps you are sharing into the FVP using
>>>>>>>>> 9p? That
>>>>>>>>> might also be problematic.
>>>>>>>>
>>>>>>>> That was it. This time I booted up the kernel including your series on
>>>>>>>> QEMU on my M1 and executed the gup_longterm program without the ftruncate
>>>>>>>> failures. When testing your kernel on FVP, I was executing the script
>>>>>>>> from the FVP's host filesystem using 9p.
>>>>>>>
>>>>>>> I'm not sure exactly what the root cause is. Perhaps there isn't enough
>>>>>>> space on
>>>>>>> the disk? It might be worth enhancing the error log to provide the errno in
>>>>>>> tools/testing/selftests/mm/gup_longterm.c.
>>>>>>>
>>>>>>
>>>>>> Attached is the strace’d gup_longterm executiong log on your
>>>>>> pgtable-boot-speedup-v2 kernel.
>>>>>
>>>>> Sorry are you saying that it only fails with the pgtable-boot-speedup-v2 patch
>>>>> set applied? I thought we previously concluded that it was independent of
>>>>> that?
>>>>> I was under the impression that it was filesystem related and not something
>>>>> that
>>>>> I was planning to investigate.
>>>>
>>>> No, irrespective of the kernel, if using 9p on FVP the test program fails.
>>>> It is indeed 9p filesystem related, as I switched to using NFS all the
>>>> issues are gone.
>>>
>>> Did it never work on 9p? If so, we might have to SKIP that test.
>>>
>>> openat(AT_FDCWD, "gup_longterm.c_tmpfile_BLboOt", O_RDWR|O_CREAT|O_EXCL,
>>> 0600) = 3
>>> unlinkat(AT_FDCWD, "gup_longterm.c_tmpfile_BLboOt", 0) = 0
>>> fstatfs(3, 0xffffe505a840)              = -1 EOPNOTSUPP (Operation not
>>> supported)
>>> ftruncate(3, 4096)                      = -1 ENOENT (No such file or directory)
>>
>> Note: I'm wondering if the unlinkat here is the problem that makes
>> ftruncate() with 9p result in weird errors (e.g., the hypervisor
>> unlinked the file and cannot reopen it for the fstatfs/ftruncate. ...
>> which gives us weird errors here).
>>
>> Then, we should lookup the fs type in run_with_local_tmpfile() before
>> the unlink() and simply skip the test if it is 9p.
> 
> The unlink with 9p most certainly was a known issue in the past:
> 
> https://gitlab.com/qemu-project/qemu/-/issues/103
> 
> Maybe it's still an issue with older hypervisors (QEMU?)? Or it was never
> completely resolved?

I believe Itaru is running on FVP (Fixed Virtual Platform - "fast model" - Arm's architecture emulator). So QEMU won't be involved here. The FVP emulates a 9p device, so perhaps the bug is in there.

Note that I see lots of "fallocate() failed" failures in gup_longterm when running on our CI system. This is a completely different setup; Real HW with Linux running bare metal using an NFS rootfs. I'm not sure if this is related. Logs show it failing consistently for the "tmpfile" and "local tmpfile" test configs. I also see a couple of these fails in the cow tests.

Logs for reference:

# # ----------------------
# # running ./gup_longterm
# # ----------------------
# # # [INFO] detected hugetlb page size: 2048 KiB
# # # [INFO] detected hugetlb page size: 32768 KiB
# # # [INFO] detected hugetlb page size: 64 KiB
# # # [INFO] detected hugetlb page size: 1048576 KiB
# # TAP version 13
# # 1..56
# # # [RUN] R/W longterm GUP pin in MAP_SHARED file mapping ... with memfd
# # ok 1 Should have worked
# # # [RUN] R/W longterm GUP pin in MAP_SHARED file mapping ... with tmpfile
# # not ok 2 fallocate() failed
# # # [RUN] R/W longterm GUP pin in MAP_SHARED file mapping ... with local tmpfile
# # not ok 3 fallocate() failed
# # # [RUN] R/W longterm GUP pin in MAP_SHARED file mapping ... with memfd hugetlb (2048 kB)
# # ok 4 Should have worked
# # # [RUN] R/W longterm GUP pin in MAP_SHARED file mapping ... with memfd hugetlb (32768 kB)
# # ok 5 Should have worked
# # # [RUN] R/W longterm GUP pin in MAP_SHARED file mapping ... with memfd hugetlb (64 kB)
# # ok 6 Should have worked
# # # [RUN] R/W longterm GUP pin in MAP_SHARED file mapping ... with memfd hugetlb (1048576 kB)
# # ok 7 Should have worked
# # # [RUN] R/W longterm GUP-fast pin in MAP_SHARED file mapping ... with memfd
# # ok 8 Should have worked
# # # [RUN] R/W longterm GUP-fast pin in MAP_SHARED file mapping ... with tmpfile
# # not ok 9 fallocate() failed
# # # [RUN] R/W longterm GUP-fast pin in MAP_SHARED file mapping ... with local tmpfile
# # not ok 10 fallocate() failed
# # # [RUN] R/W longterm GUP-fast pin in MAP_SHARED file mapping ... with memfd hugetlb (2048 kB)
# # ok 11 Should have worked
# # # [RUN] R/W longterm GUP-fast pin in MAP_SHARED file mapping ... with memfd hugetlb (32768 kB)
# # ok 12 Should have worked
# # # [RUN] R/W longterm GUP-fast pin in MAP_SHARED file mapping ... with memfd hugetlb (64 kB)
# # ok 13 Should have worked
# # # [RUN] R/W longterm GUP-fast pin in MAP_SHARED file mapping ... with memfd hugetlb (1048576 kB)
# # ok 14 Should have worked
# # # [RUN] R/O longterm GUP pin in MAP_SHARED file mapping ... with memfd
# # ok 15 Should have worked
# # # [RUN] R/O longterm GUP pin in MAP_SHARED file mapping ... with tmpfile
# # not ok 16 fallocate() failed
# # # [RUN] R/O longterm GUP pin in MAP_SHARED file mapping ... with local tmpfile
# # not ok 17 fallocate() failed
# # # [RUN] R/O longterm GUP pin in MAP_SHARED file mapping ... with memfd hugetlb (2048 kB)
# # ok 18 Should have worked
# # # [RUN] R/O longterm GUP pin in MAP_SHARED file mapping ... with memfd hugetlb (32768 kB)
# # ok 19 Should have worked
# # # [RUN] R/O longterm GUP pin in MAP_SHARED file mapping ... with memfd hugetlb (64 kB)
# # ok 20 Should have worked
# # # [RUN] R/O longterm GUP pin in MAP_SHARED file mapping ... with memfd hugetlb (1048576 kB)
# # ok 21 Should have worked
# # # [RUN] R/O longterm GUP-fast pin in MAP_SHARED file mapping ... with memfd
# # ok 22 Should have worked
# # # [RUN] R/O longterm GUP-fast pin in MAP_SHARED file mapping ... with tmpfile
# # not ok 23 fallocate() failed
# # # [RUN] R/O longterm GUP-fast pin in MAP_SHARED file mapping ... with local tmpfile
# # not ok 24 fallocate() failed
# # # [RUN] R/O longterm GUP-fast pin in MAP_SHARED file mapping ... with memfd hugetlb (2048 kB)
# # ok 25 Should have worked
# # # [RUN] R/O longterm GUP-fast pin in MAP_SHARED file mapping ... with memfd hugetlb (32768 kB)
# # ok 26 Should have worked
# # # [RUN] R/O longterm GUP-fast pin in MAP_SHARED file mapping ... with memfd hugetlb (64 kB)
# # ok 27 Should have worked
# # # [RUN] R/O longterm GUP-fast pin in MAP_SHARED file mapping ... with memfd hugetlb (1048576 kB)
# # ok 28 Should have worked
# # # [RUN] R/W longterm GUP pin in MAP_PRIVATE file mapping ... with memfd
# # ok 29 Should have worked
# # # [RUN] R/W longterm GUP pin in MAP_PRIVATE file mapping ... with tmpfile
# # not ok 30 fallocate() failed
# # # [RUN] R/W longterm GUP pin in MAP_PRIVATE file mapping ... with local tmpfile
# # not ok 31 fallocate() failed
# # # [RUN] R/W longterm GUP pin in MAP_PRIVATE file mapping ... with memfd hugetlb (2048 kB)
# # ok 32 Should have worked
# # # [RUN] R/W longterm GUP pin in MAP_PRIVATE file mapping ... with memfd hugetlb (32768 kB)
# # ok 33 Should have worked
# # # [RUN] R/W longterm GUP pin in MAP_PRIVATE file mapping ... with memfd hugetlb (64 kB)
# # ok 34 Should have worked
# # # [RUN] R/W longterm GUP pin in MAP_PRIVATE file mapping ... with memfd hugetlb (1048576 kB)
# # ok 35 Should have worked
# # # [RUN] R/W longterm GUP-fast pin in MAP_PRIVATE file mapping ... with memfd
# # ok 36 Should have worked
# # # [RUN] R/W longterm GUP-fast pin in MAP_PRIVATE file mapping ... with tmpfile
# # not ok 37 fallocate() failed
# # # [RUN] R/W longterm GUP-fast pin in MAP_PRIVATE file mapping ... with local tmpfile
# # not ok 38 fallocate() failed
# # # [RUN] R/W longterm GUP-fast pin in MAP_PRIVATE file mapping ... with memfd hugetlb (2048 kB)
# # ok 39 Should have worked
# # # [RUN] R/W longterm GUP-fast pin in MAP_PRIVATE file mapping ... with memfd hugetlb (32768 kB)
# # ok 40 Should have worked
# # # [RUN] R/W longterm GUP-fast pin in MAP_PRIVATE file mapping ... with memfd hugetlb (64 kB)
# # ok 41 Should have worked
# # # [RUN] R/W longterm GUP-fast pin in MAP_PRIVATE file mapping ... with memfd hugetlb (1048576 kB)
# # ok 42 Should have worked
# # # [RUN] R/O longterm GUP pin in MAP_PRIVATE file mapping ... with memfd
# # ok 43 Should have worked
# # # [RUN] R/O longterm GUP pin in MAP_PRIVATE file mapping ... with tmpfile
# # not ok 44 fallocate() failed
# # # [RUN] R/O longterm GUP pin in MAP_PRIVATE file mapping ... with local tmpfile
# # not ok 45 fallocate() failed
# # # [RUN] R/O longterm GUP pin in MAP_PRIVATE file mapping ... with memfd hugetlb (2048 kB)
# # ok 46 Should have worked
# # # [RUN] R/O longterm GUP pin in MAP_PRIVATE file mapping ... with memfd hugetlb (32768 kB)
# # ok 47 Should have worked
# # # [RUN] R/O longterm GUP pin in MAP_PRIVATE file mapping ... with memfd hugetlb (64 kB)
# # ok 48 Should have worked
# # # [RUN] R/O longterm GUP pin in MAP_PRIVATE file mapping ... with memfd hugetlb (1048576 kB)
# # ok 49 Should have worked
# # # [RUN] R/O longterm GUP-fast pin in MAP_PRIVATE file mapping ... with memfd
# # ok 50 Should have worked
# # # [RUN] R/O longterm GUP-fast pin in MAP_PRIVATE file mapping ... with tmpfile
# # not ok 51 fallocate() failed
# # # [RUN] R/O longterm GUP-fast pin in MAP_PRIVATE file mapping ... with local tmpfile
# # not ok 52 fallocate() failed
# # # [RUN] R/O longterm GUP-fast pin in MAP_PRIVATE file mapping ... with memfd hugetlb (2048 kB)
# # ok 53 Should have worked
# # # [RUN] R/O longterm GUP-fast pin in MAP_PRIVATE file mapping ... with memfd hugetlb (32768 kB)
# # ok 54 Should have worked
# # # [RUN] R/O longterm GUP-fast pin in MAP_PRIVATE file mapping ... with memfd hugetlb (64 kB)
# # ok 55 Should have worked
# # # [RUN] R/O longterm GUP-fast pin in MAP_PRIVATE file mapping ... with memfd hugetlb (1048576 kB)
# # ok 56 Should have worked
# # Bail out! 16 out of 56 tests failed
# # # Totals: pass:40 fail:16 xfail:0 xpass:0 skip:0 error:0
# # [FAIL]
# not ok 13 gup_longterm # exit=1
David Hildenbrand April 9, 2024, 2:29 p.m. UTC | #12
On 09.04.24 16:13, Ryan Roberts wrote:
> On 09/04/2024 12:51, David Hildenbrand wrote:
>> On 09.04.24 13:29, David Hildenbrand wrote:
>>> On 09.04.24 13:22, David Hildenbrand wrote:
>>>> On 09.04.24 12:13, Itaru Kitayama wrote:
>>>>>
>>>>>
>>>>>> On Apr 9, 2024, at 19:04, Ryan Roberts <ryan.roberts@arm.com> wrote:
>>>>>>
>>>>>> On 09/04/2024 01:10, Itaru Kitayama wrote:
>>>>>>> Hi Ryan,
>>>>>>>
>>>>>>>> On Apr 8, 2024, at 16:30, Ryan Roberts <ryan.roberts@arm.com> wrote:
>>>>>>>>
>>>>>>>> On 06/04/2024 11:31, Itaru Kitayama wrote:
>>>>>>>>> Hi Ryan,
>>>>>>>>>
>>>>>>>>> On Sat, Apr 06, 2024 at 09:32:34AM +0100, Ryan Roberts wrote:
>>>>>>>>>> Hi Itaru,
>>>>>>>>>>
>>>>>>>>>> On 05/04/2024 08:39, Itaru Kitayama wrote:
>>>>>>>>>>> On Thu, Apr 04, 2024 at 03:33:04PM +0100, Ryan Roberts wrote:
>>>>>>>>>>>> Hi All,
>>>>>>>>>>>>
>>>>>>>>>>>> It turns out that creating the linear map can take a significant
>>>>>>>>>>>> proportion of
>>>>>>>>>>>> the total boot time, especially when rodata=full. And most of the
>>>>>>>>>>>> time is spent
>>>>>>>>>>>> waiting on superfluous tlb invalidation and memory barriers. This
>>>>>>>>>>>> series reworks
>>>>>>>>>>>> the kernel pgtable generation code to significantly reduce the number
>>>>>>>>>>>> of those
>>>>>>>>>>>> TLBIs, ISBs and DSBs. See each patch for details.
>>>>>>>>>>>>
>>>>>>>>>>>> The below shows the execution time of map_mem() across a couple of
>>>>>>>>>>>> different
>>>>>>>>>>>> systems with different RAM configurations. We measure after applying
>>>>>>>>>>>> each patch
>>>>>>>>>>>> and show the improvement relative to base (v6.9-rc2):
>>>>>>>>>>>>
>>>>>>>>>>>>                    | Apple M2 VM | Ampere Altra| Ampere Altra| Ampere
>>>>>>>>>>>> Altra
>>>>>>>>>>>>                    | VM, 16G     | VM, 64G     | VM, 256G    | Metal,
>>>>>>>>>>>> 512G
>>>>>>>>>>>> ---------------|-------------|-------------|-------------|-------------
>>>>>>>>>>>>                    |   ms    (%) |   ms    (%) |   ms    (%) |
>>>>>>>>>>>> ms    (%)
>>>>>>>>>>>> ---------------|-------------|-------------|-------------|-------------
>>>>>>>>>>>> base           |  153   (0%) | 2227   (0%) | 8798   (0%) | 17442   (0%)
>>>>>>>>>>>> no-cont-remap  |   77 (-49%) |  431 (-81%) | 1727 (-80%) |  3796 (-78%)
>>>>>>>>>>>> batch-barriers |   13 (-92%) |  162 (-93%) |  655 (-93%) |  1656 (-91%)
>>>>>>>>>>>> no-alloc-remap |   11 (-93%) |  109 (-95%) |  449 (-95%) |  1257 (-93%)
>>>>>>>>>>>> lazy-unmap     |    6 (-96%) |   61 (-97%) |  257 (-97%) |   838 (-95%)
>>>>>>>>>>>>
>>>>>>>>>>>> This series applies on top of v6.9-rc2. All mm selftests pass. I've
>>>>>>>>>>>> compile and
>>>>>>>>>>>> boot tested various PAGE_SIZE and VA size configs.
>>>>>>>>>>>>
>>>>>>>>>>>> ---
>>>>>>>>>>>>
>>>>>>>>>>>> Changes since v1 [1]
>>>>>>>>>>>> ====================
>>>>>>>>>>>>
>>>>>>>>>>>>       - Added Tested-by tags (thanks to Eric and Itaru)
>>>>>>>>>>>>       - Renamed ___set_pte() -> __set_pte_nosync() (per Ard)
>>>>>>>>>>>>       - Reordered patches (biggest impact & least controversial first)
>>>>>>>>>>>>       - Reordered alloc/map/unmap functions in mmu.c to aid reader
>>>>>>>>>>>>       - pte_clear() -> __pte_clear() in clear_fixmap_nosync()
>>>>>>>>>>>>       - Reverted generic p4d_index() which caused x86 build error.
>>>>>>>>>>>> Replaced with
>>>>>>>>>>>>         unconditional p4d_index() define under arm64.
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> [1]
>>>>>>>>>>>> https://lore.kernel.org/linux-arm-kernel/20240326101448.3453626-1-ryan.roberts@arm.com/<https://lore.kernel.org/linux-arm-kernel/20240326101448.3453626-1-ryan.roberts@arm.com/>
>>>>>>>>>>>>
>>>>>>>>>>>> Thanks,
>>>>>>>>>>>> Ryan
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> Ryan Roberts (4):
>>>>>>>>>>>>       arm64: mm: Don't remap pgtables per-cont(pte|pmd) block
>>>>>>>>>>>>       arm64: mm: Batch dsb and isb when populating pgtables
>>>>>>>>>>>>       arm64: mm: Don't remap pgtables for allocate vs populate
>>>>>>>>>>>>       arm64: mm: Lazily clear pte table mappings from fixmap
>>>>>>>>>>>>
>>>>>>>>>>>> arch/arm64/include/asm/fixmap.h  |   5 +-
>>>>>>>>>>>> arch/arm64/include/asm/mmu.h     |   8 +
>>>>>>>>>>>> arch/arm64/include/asm/pgtable.h |  13 +-
>>>>>>>>>>>> arch/arm64/kernel/cpufeature.c   |  10 +-
>>>>>>>>>>>> arch/arm64/mm/fixmap.c           |  11 +
>>>>>>>>>>>> arch/arm64/mm/mmu.c              | 377 +++++++++++++++++++++++--------
>>>>>>>>>>>> 6 files changed, 319 insertions(+), 105 deletions(-)
>>>>>>>>>>>>
>>>>>>>>>>>> -- 
>>>>>>>>>>>> 2.25.1
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> I've build and boot tested the v2 on FVP, base is taken from your
>>>>>>>>>>> linux-rr repo. Running run_vmtests.sh on v2 left some gup longterm not
>>>>>>>>>>> oks, would you take a look at it? The mm ksefltests used is from your
>>>>>>>>>>> linux-rr repo too.
>>>>>>>>>>
>>>>>>>>>> Thanks for taking a look at this.
>>>>>>>>>>
>>>>>>>>>> I can't reproduce your issue unfortunately; steps as follows on Apple
>>>>>>>>>> M2 VM:
>>>>>>>>>>
>>>>>>>>>> Config: arm64 defconfig + the following:
>>>>>>>>>>
>>>>>>>>>> # Squashfs for snaps, xfs for large file folios.
>>>>>>>>>> ./scripts/config --enable CONFIG_SQUASHFS_LZ4
>>>>>>>>>> ./scripts/config --enable CONFIG_SQUASHFS_LZO
>>>>>>>>>> ./scripts/config --enable CONFIG_SQUASHFS_XZ
>>>>>>>>>> ./scripts/config --enable CONFIG_SQUASHFS_ZSTD
>>>>>>>>>> ./scripts/config --enable CONFIG_XFS_FS
>>>>>>>>>>
>>>>>>>>>> # For general mm debug.
>>>>>>>>>> ./scripts/config --enable CONFIG_DEBUG_VM
>>>>>>>>>> ./scripts/config --enable CONFIG_DEBUG_VM_MAPLE_TREE
>>>>>>>>>> ./scripts/config --enable CONFIG_DEBUG_VM_RB
>>>>>>>>>> ./scripts/config --enable CONFIG_DEBUG_VM_PGFLAGS
>>>>>>>>>> ./scripts/config --enable CONFIG_DEBUG_VM_PGTABLE
>>>>>>>>>> ./scripts/config --enable CONFIG_PAGE_TABLE_CHECK
>>>>>>>>>>
>>>>>>>>>> # For mm selftests.
>>>>>>>>>> ./scripts/config --enable CONFIG_USERFAULTFD
>>>>>>>>>> ./scripts/config --enable CONFIG_TEST_VMALLOC
>>>>>>>>>> ./scripts/config --enable CONFIG_GUP_TEST
>>>>>>>>>>
>>>>>>>>>> Running on VM with 12G memory, split across 2 (emulated) NUMA nodes
>>>>>>>>>> (needed by
>>>>>>>>>> some mm selftests), with kernel command line to reserve hugetlbs and other
>>>>>>>>>> features required by some mm selftests:
>>>>>>>>>>
>>>>>>>>>> "
>>>>>>>>>> transparent_hugepage=madvise earlycon root=/dev/vda2 secretmem.enable
>>>>>>>>>> hugepagesz=1G hugepages=0:2,1:2 hugepagesz=32M hugepages=0:2,1:2
>>>>>>>>>> default_hugepagesz=2M hugepages=0:64,1:64 hugepagesz=64K hugepages=0:2,1:2
>>>>>>>>>> "
>>>>>>>>>>
>>>>>>>>>> Ubuntu userspace running off XFS rootfs. Build and run mm selftests
>>>>>>>>>> from same
>>>>>>>>>> git tree.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Although I don't think any of this config should make a difference to
>>>>>>>>>> gup_longterm.
>>>>>>>>>>
>>>>>>>>>> Looks like your errors are all "ftruncate() failed". I've seen this
>>>>>>>>>> problem on
>>>>>>>>>> our CI system. There it is due to running the tests from NFS file
>>>>>>>>>> system. What
>>>>>>>>>> filesystem are you using? Perhaps you are sharing into the FVP using
>>>>>>>>>> 9p? That
>>>>>>>>>> might also be problematic.
>>>>>>>>>
>>>>>>>>> That was it. This time I booted up the kernel including your series on
>>>>>>>>> QEMU on my M1 and executed the gup_longterm program without the ftruncate
>>>>>>>>> failures. When testing your kernel on FVP, I was executing the script
>>>>>>>>> from the FVP's host filesystem using 9p.
>>>>>>>>
>>>>>>>> I'm not sure exactly what the root cause is. Perhaps there isn't enough
>>>>>>>> space on
>>>>>>>> the disk? It might be worth enhancing the error log to provide the errno in
>>>>>>>> tools/testing/selftests/mm/gup_longterm.c.
>>>>>>>>
>>>>>>>
>>>>>>> Attached is the strace’d gup_longterm executiong log on your
>>>>>>> pgtable-boot-speedup-v2 kernel.
>>>>>>
>>>>>> Sorry are you saying that it only fails with the pgtable-boot-speedup-v2 patch
>>>>>> set applied? I thought we previously concluded that it was independent of
>>>>>> that?
>>>>>> I was under the impression that it was filesystem related and not something
>>>>>> that
>>>>>> I was planning to investigate.
>>>>>
>>>>> No, irrespective of the kernel, if using 9p on FVP the test program fails.
>>>>> It is indeed 9p filesystem related, as I switched to using NFS all the
>>>>> issues are gone.
>>>>
>>>> Did it never work on 9p? If so, we might have to SKIP that test.
>>>>
>>>> openat(AT_FDCWD, "gup_longterm.c_tmpfile_BLboOt", O_RDWR|O_CREAT|O_EXCL,
>>>> 0600) = 3
>>>> unlinkat(AT_FDCWD, "gup_longterm.c_tmpfile_BLboOt", 0) = 0
>>>> fstatfs(3, 0xffffe505a840)              = -1 EOPNOTSUPP (Operation not
>>>> supported)
>>>> ftruncate(3, 4096)                      = -1 ENOENT (No such file or directory)
>>>
>>> Note: I'm wondering if the unlinkat here is the problem that makes
>>> ftruncate() with 9p result in weird errors (e.g., the hypervisor
>>> unlinked the file and cannot reopen it for the fstatfs/ftruncate. ...
>>> which gives us weird errors here).
>>>
>>> Then, we should lookup the fs type in run_with_local_tmpfile() before
>>> the unlink() and simply skip the test if it is 9p.
>>
>> The unlink with 9p most certainly was a known issue in the past:
>>
>> https://gitlab.com/qemu-project/qemu/-/issues/103
>>
>> Maybe it's still an issue with older hypervisors (QEMU?)? Or it was never
>> completely resolved?
> 
> I believe Itaru is running on FVP (Fixed Virtual Platform - "fast model" - Arm's architecture emulator). So QEMU won't be involved here. The FVP emulates a 9p device, so perhaps the bug is in there.

Very likely.

> 
> Note that I see lots of "fallocate() failed" failures in gup_longterm when running on our CI system. This is a completely different setup; Real HW with Linux running bare metal using an NFS rootfs. I'm not sure if this is related. Logs show it failing consistently for the "tmpfile" and "local tmpfile" test configs. I also see a couple of these fails in the cow tests.

What is the fallocate() errno you are getting? strace log would help (to 
see if statfs also fails already)! Likely a similar NFS issue.
Ryan Roberts April 9, 2024, 2:39 p.m. UTC | #13
On 09/04/2024 15:29, David Hildenbrand wrote:
> On 09.04.24 16:13, Ryan Roberts wrote:
>> On 09/04/2024 12:51, David Hildenbrand wrote:
>>> On 09.04.24 13:29, David Hildenbrand wrote:
>>>> On 09.04.24 13:22, David Hildenbrand wrote:
>>>>> On 09.04.24 12:13, Itaru Kitayama wrote:
>>>>>>
>>>>>>
>>>>>>> On Apr 9, 2024, at 19:04, Ryan Roberts <ryan.roberts@arm.com> wrote:
>>>>>>>
>>>>>>> On 09/04/2024 01:10, Itaru Kitayama wrote:
>>>>>>>> Hi Ryan,
>>>>>>>>
>>>>>>>>> On Apr 8, 2024, at 16:30, Ryan Roberts <ryan.roberts@arm.com> wrote:
>>>>>>>>>
>>>>>>>>> On 06/04/2024 11:31, Itaru Kitayama wrote:
>>>>>>>>>> Hi Ryan,
>>>>>>>>>>
>>>>>>>>>> On Sat, Apr 06, 2024 at 09:32:34AM +0100, Ryan Roberts wrote:
>>>>>>>>>>> Hi Itaru,
>>>>>>>>>>>
>>>>>>>>>>> On 05/04/2024 08:39, Itaru Kitayama wrote:
>>>>>>>>>>>> On Thu, Apr 04, 2024 at 03:33:04PM +0100, Ryan Roberts wrote:
>>>>>>>>>>>>> Hi All,
>>>>>>>>>>>>>
>>>>>>>>>>>>> It turns out that creating the linear map can take a significant
>>>>>>>>>>>>> proportion of
>>>>>>>>>>>>> the total boot time, especially when rodata=full. And most of the
>>>>>>>>>>>>> time is spent
>>>>>>>>>>>>> waiting on superfluous tlb invalidation and memory barriers. This
>>>>>>>>>>>>> series reworks
>>>>>>>>>>>>> the kernel pgtable generation code to significantly reduce the number
>>>>>>>>>>>>> of those
>>>>>>>>>>>>> TLBIs, ISBs and DSBs. See each patch for details.
>>>>>>>>>>>>>
>>>>>>>>>>>>> The below shows the execution time of map_mem() across a couple of
>>>>>>>>>>>>> different
>>>>>>>>>>>>> systems with different RAM configurations. We measure after applying
>>>>>>>>>>>>> each patch
>>>>>>>>>>>>> and show the improvement relative to base (v6.9-rc2):
>>>>>>>>>>>>>
>>>>>>>>>>>>>                    | Apple M2 VM | Ampere Altra| Ampere Altra| Ampere
>>>>>>>>>>>>> Altra
>>>>>>>>>>>>>                    | VM, 16G     | VM, 64G     | VM, 256G    | Metal,
>>>>>>>>>>>>> 512G
>>>>>>>>>>>>> ---------------|-------------|-------------|-------------|-------------
>>>>>>>>>>>>>                    |   ms    (%) |   ms    (%) |   ms    (%) |
>>>>>>>>>>>>> ms    (%)
>>>>>>>>>>>>> ---------------|-------------|-------------|-------------|-------------
>>>>>>>>>>>>> base           |  153   (0%) | 2227   (0%) | 8798   (0%) | 17442  
>>>>>>>>>>>>> (0%)
>>>>>>>>>>>>> no-cont-remap  |   77 (-49%) |  431 (-81%) | 1727 (-80%) |  3796
>>>>>>>>>>>>> (-78%)
>>>>>>>>>>>>> batch-barriers |   13 (-92%) |  162 (-93%) |  655 (-93%) |  1656
>>>>>>>>>>>>> (-91%)
>>>>>>>>>>>>> no-alloc-remap |   11 (-93%) |  109 (-95%) |  449 (-95%) |  1257
>>>>>>>>>>>>> (-93%)
>>>>>>>>>>>>> lazy-unmap     |    6 (-96%) |   61 (-97%) |  257 (-97%) |   838
>>>>>>>>>>>>> (-95%)
>>>>>>>>>>>>>
>>>>>>>>>>>>> This series applies on top of v6.9-rc2. All mm selftests pass. I've
>>>>>>>>>>>>> compile and
>>>>>>>>>>>>> boot tested various PAGE_SIZE and VA size configs.
>>>>>>>>>>>>>
>>>>>>>>>>>>> ---
>>>>>>>>>>>>>
>>>>>>>>>>>>> Changes since v1 [1]
>>>>>>>>>>>>> ====================
>>>>>>>>>>>>>
>>>>>>>>>>>>>       - Added Tested-by tags (thanks to Eric and Itaru)
>>>>>>>>>>>>>       - Renamed ___set_pte() -> __set_pte_nosync() (per Ard)
>>>>>>>>>>>>>       - Reordered patches (biggest impact & least controversial first)
>>>>>>>>>>>>>       - Reordered alloc/map/unmap functions in mmu.c to aid reader
>>>>>>>>>>>>>       - pte_clear() -> __pte_clear() in clear_fixmap_nosync()
>>>>>>>>>>>>>       - Reverted generic p4d_index() which caused x86 build error.
>>>>>>>>>>>>> Replaced with
>>>>>>>>>>>>>         unconditional p4d_index() define under arm64.
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> [1]
>>>>>>>>>>>>> https://lore.kernel.org/linux-arm-kernel/20240326101448.3453626-1-ryan.roberts@arm.com/<https://lore.kernel.org/linux-arm-kernel/20240326101448.3453626-1-ryan.roberts@arm.com/>
>>>>>>>>>>>>>
>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>> Ryan
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> Ryan Roberts (4):
>>>>>>>>>>>>>       arm64: mm: Don't remap pgtables per-cont(pte|pmd) block
>>>>>>>>>>>>>       arm64: mm: Batch dsb and isb when populating pgtables
>>>>>>>>>>>>>       arm64: mm: Don't remap pgtables for allocate vs populate
>>>>>>>>>>>>>       arm64: mm: Lazily clear pte table mappings from fixmap
>>>>>>>>>>>>>
>>>>>>>>>>>>> arch/arm64/include/asm/fixmap.h  |   5 +-
>>>>>>>>>>>>> arch/arm64/include/asm/mmu.h     |   8 +
>>>>>>>>>>>>> arch/arm64/include/asm/pgtable.h |  13 +-
>>>>>>>>>>>>> arch/arm64/kernel/cpufeature.c   |  10 +-
>>>>>>>>>>>>> arch/arm64/mm/fixmap.c           |  11 +
>>>>>>>>>>>>> arch/arm64/mm/mmu.c              | 377 +++++++++++++++++++++++--------
>>>>>>>>>>>>> 6 files changed, 319 insertions(+), 105 deletions(-)
>>>>>>>>>>>>>
>>>>>>>>>>>>> -- 
>>>>>>>>>>>>> 2.25.1
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> I've build and boot tested the v2 on FVP, base is taken from your
>>>>>>>>>>>> linux-rr repo. Running run_vmtests.sh on v2 left some gup longterm not
>>>>>>>>>>>> oks, would you take a look at it? The mm ksefltests used is from your
>>>>>>>>>>>> linux-rr repo too.
>>>>>>>>>>>
>>>>>>>>>>> Thanks for taking a look at this.
>>>>>>>>>>>
>>>>>>>>>>> I can't reproduce your issue unfortunately; steps as follows on Apple
>>>>>>>>>>> M2 VM:
>>>>>>>>>>>
>>>>>>>>>>> Config: arm64 defconfig + the following:
>>>>>>>>>>>
>>>>>>>>>>> # Squashfs for snaps, xfs for large file folios.
>>>>>>>>>>> ./scripts/config --enable CONFIG_SQUASHFS_LZ4
>>>>>>>>>>> ./scripts/config --enable CONFIG_SQUASHFS_LZO
>>>>>>>>>>> ./scripts/config --enable CONFIG_SQUASHFS_XZ
>>>>>>>>>>> ./scripts/config --enable CONFIG_SQUASHFS_ZSTD
>>>>>>>>>>> ./scripts/config --enable CONFIG_XFS_FS
>>>>>>>>>>>
>>>>>>>>>>> # For general mm debug.
>>>>>>>>>>> ./scripts/config --enable CONFIG_DEBUG_VM
>>>>>>>>>>> ./scripts/config --enable CONFIG_DEBUG_VM_MAPLE_TREE
>>>>>>>>>>> ./scripts/config --enable CONFIG_DEBUG_VM_RB
>>>>>>>>>>> ./scripts/config --enable CONFIG_DEBUG_VM_PGFLAGS
>>>>>>>>>>> ./scripts/config --enable CONFIG_DEBUG_VM_PGTABLE
>>>>>>>>>>> ./scripts/config --enable CONFIG_PAGE_TABLE_CHECK
>>>>>>>>>>>
>>>>>>>>>>> # For mm selftests.
>>>>>>>>>>> ./scripts/config --enable CONFIG_USERFAULTFD
>>>>>>>>>>> ./scripts/config --enable CONFIG_TEST_VMALLOC
>>>>>>>>>>> ./scripts/config --enable CONFIG_GUP_TEST
>>>>>>>>>>>
>>>>>>>>>>> Running on VM with 12G memory, split across 2 (emulated) NUMA nodes
>>>>>>>>>>> (needed by
>>>>>>>>>>> some mm selftests), with kernel command line to reserve hugetlbs and
>>>>>>>>>>> other
>>>>>>>>>>> features required by some mm selftests:
>>>>>>>>>>>
>>>>>>>>>>> "
>>>>>>>>>>> transparent_hugepage=madvise earlycon root=/dev/vda2 secretmem.enable
>>>>>>>>>>> hugepagesz=1G hugepages=0:2,1:2 hugepagesz=32M hugepages=0:2,1:2
>>>>>>>>>>> default_hugepagesz=2M hugepages=0:64,1:64 hugepagesz=64K
>>>>>>>>>>> hugepages=0:2,1:2
>>>>>>>>>>> "
>>>>>>>>>>>
>>>>>>>>>>> Ubuntu userspace running off XFS rootfs. Build and run mm selftests
>>>>>>>>>>> from same
>>>>>>>>>>> git tree.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Although I don't think any of this config should make a difference to
>>>>>>>>>>> gup_longterm.
>>>>>>>>>>>
>>>>>>>>>>> Looks like your errors are all "ftruncate() failed". I've seen this
>>>>>>>>>>> problem on
>>>>>>>>>>> our CI system. There it is due to running the tests from NFS file
>>>>>>>>>>> system. What
>>>>>>>>>>> filesystem are you using? Perhaps you are sharing into the FVP using
>>>>>>>>>>> 9p? That
>>>>>>>>>>> might also be problematic.
>>>>>>>>>>
>>>>>>>>>> That was it. This time I booted up the kernel including your series on
>>>>>>>>>> QEMU on my M1 and executed the gup_longterm program without the ftruncate
>>>>>>>>>> failures. When testing your kernel on FVP, I was executing the script
>>>>>>>>>> from the FVP's host filesystem using 9p.
>>>>>>>>>
>>>>>>>>> I'm not sure exactly what the root cause is. Perhaps there isn't enough
>>>>>>>>> space on
>>>>>>>>> the disk? It might be worth enhancing the error log to provide the
>>>>>>>>> errno in
>>>>>>>>> tools/testing/selftests/mm/gup_longterm.c.
>>>>>>>>>
>>>>>>>>
>>>>>>>> Attached is the strace’d gup_longterm executiong log on your
>>>>>>>> pgtable-boot-speedup-v2 kernel.
>>>>>>>
>>>>>>> Sorry are you saying that it only fails with the pgtable-boot-speedup-v2
>>>>>>> patch
>>>>>>> set applied? I thought we previously concluded that it was independent of
>>>>>>> that?
>>>>>>> I was under the impression that it was filesystem related and not something
>>>>>>> that
>>>>>>> I was planning to investigate.
>>>>>>
>>>>>> No, irrespective of the kernel, if using 9p on FVP the test program fails.
>>>>>> It is indeed 9p filesystem related, as I switched to using NFS all the
>>>>>> issues are gone.
>>>>>
>>>>> Did it never work on 9p? If so, we might have to SKIP that test.
>>>>>
>>>>> openat(AT_FDCWD, "gup_longterm.c_tmpfile_BLboOt", O_RDWR|O_CREAT|O_EXCL,
>>>>> 0600) = 3
>>>>> unlinkat(AT_FDCWD, "gup_longterm.c_tmpfile_BLboOt", 0) = 0
>>>>> fstatfs(3, 0xffffe505a840)              = -1 EOPNOTSUPP (Operation not
>>>>> supported)
>>>>> ftruncate(3, 4096)                      = -1 ENOENT (No such file or
>>>>> directory)
>>>>
>>>> Note: I'm wondering if the unlinkat here is the problem that makes
>>>> ftruncate() with 9p result in weird errors (e.g., the hypervisor
>>>> unlinked the file and cannot reopen it for the fstatfs/ftruncate. ...
>>>> which gives us weird errors here).
>>>>
>>>> Then, we should lookup the fs type in run_with_local_tmpfile() before
>>>> the unlink() and simply skip the test if it is 9p.
>>>
>>> The unlink with 9p most certainly was a known issue in the past:
>>>
>>> https://gitlab.com/qemu-project/qemu/-/issues/103
>>>
>>> Maybe it's still an issue with older hypervisors (QEMU?)? Or it was never
>>> completely resolved?
>>
>> I believe Itaru is running on FVP (Fixed Virtual Platform - "fast model" -
>> Arm's architecture emulator). So QEMU won't be involved here. The FVP emulates
>> a 9p device, so perhaps the bug is in there.
> 
> Very likely.
> 
>>
>> Note that I see lots of "fallocate() failed" failures in gup_longterm when
>> running on our CI system. This is a completely different setup; Real HW with
>> Linux running bare metal using an NFS rootfs. I'm not sure if this is related.
>> Logs show it failing consistently for the "tmpfile" and "local tmpfile" test
>> configs. I also see a couple of these fails in the cow tests.
> 
> What is the fallocate() errno you are getting? strace log would help (to see if
> statfs also fails already)! Likely a similar NFS issue.

Unfortunately this is a system I don't have access to. I've requested some of
this triage to be done, but its fairly low priority unfortunately.
David Hildenbrand April 9, 2024, 2:45 p.m. UTC | #14
On 09.04.24 16:39, Ryan Roberts wrote:
> On 09/04/2024 15:29, David Hildenbrand wrote:
>> On 09.04.24 16:13, Ryan Roberts wrote:
>>> On 09/04/2024 12:51, David Hildenbrand wrote:
>>>> On 09.04.24 13:29, David Hildenbrand wrote:
>>>>> On 09.04.24 13:22, David Hildenbrand wrote:
>>>>>> On 09.04.24 12:13, Itaru Kitayama wrote:
>>>>>>>
>>>>>>>
>>>>>>>> On Apr 9, 2024, at 19:04, Ryan Roberts <ryan.roberts@arm.com> wrote:
>>>>>>>>
>>>>>>>> On 09/04/2024 01:10, Itaru Kitayama wrote:
>>>>>>>>> Hi Ryan,
>>>>>>>>>
>>>>>>>>>> On Apr 8, 2024, at 16:30, Ryan Roberts <ryan.roberts@arm.com> wrote:
>>>>>>>>>>
>>>>>>>>>> On 06/04/2024 11:31, Itaru Kitayama wrote:
>>>>>>>>>>> Hi Ryan,
>>>>>>>>>>>
>>>>>>>>>>> On Sat, Apr 06, 2024 at 09:32:34AM +0100, Ryan Roberts wrote:
>>>>>>>>>>>> Hi Itaru,
>>>>>>>>>>>>
>>>>>>>>>>>> On 05/04/2024 08:39, Itaru Kitayama wrote:
>>>>>>>>>>>>> On Thu, Apr 04, 2024 at 03:33:04PM +0100, Ryan Roberts wrote:
>>>>>>>>>>>>>> Hi All,
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> It turns out that creating the linear map can take a significant
>>>>>>>>>>>>>> proportion of
>>>>>>>>>>>>>> the total boot time, especially when rodata=full. And most of the
>>>>>>>>>>>>>> time is spent
>>>>>>>>>>>>>> waiting on superfluous tlb invalidation and memory barriers. This
>>>>>>>>>>>>>> series reworks
>>>>>>>>>>>>>> the kernel pgtable generation code to significantly reduce the number
>>>>>>>>>>>>>> of those
>>>>>>>>>>>>>> TLBIs, ISBs and DSBs. See each patch for details.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> The below shows the execution time of map_mem() across a couple of
>>>>>>>>>>>>>> different
>>>>>>>>>>>>>> systems with different RAM configurations. We measure after applying
>>>>>>>>>>>>>> each patch
>>>>>>>>>>>>>> and show the improvement relative to base (v6.9-rc2):
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>                     | Apple M2 VM | Ampere Altra| Ampere Altra| Ampere
>>>>>>>>>>>>>> Altra
>>>>>>>>>>>>>>                     | VM, 16G     | VM, 64G     | VM, 256G    | Metal,
>>>>>>>>>>>>>> 512G
>>>>>>>>>>>>>> ---------------|-------------|-------------|-------------|-------------
>>>>>>>>>>>>>>                     |   ms    (%) |   ms    (%) |   ms    (%) |
>>>>>>>>>>>>>> ms    (%)
>>>>>>>>>>>>>> ---------------|-------------|-------------|-------------|-------------
>>>>>>>>>>>>>> base           |  153   (0%) | 2227   (0%) | 8798   (0%) | 17442
>>>>>>>>>>>>>> (0%)
>>>>>>>>>>>>>> no-cont-remap  |   77 (-49%) |  431 (-81%) | 1727 (-80%) |  3796
>>>>>>>>>>>>>> (-78%)
>>>>>>>>>>>>>> batch-barriers |   13 (-92%) |  162 (-93%) |  655 (-93%) |  1656
>>>>>>>>>>>>>> (-91%)
>>>>>>>>>>>>>> no-alloc-remap |   11 (-93%) |  109 (-95%) |  449 (-95%) |  1257
>>>>>>>>>>>>>> (-93%)
>>>>>>>>>>>>>> lazy-unmap     |    6 (-96%) |   61 (-97%) |  257 (-97%) |   838
>>>>>>>>>>>>>> (-95%)
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> This series applies on top of v6.9-rc2. All mm selftests pass. I've
>>>>>>>>>>>>>> compile and
>>>>>>>>>>>>>> boot tested various PAGE_SIZE and VA size configs.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> ---
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Changes since v1 [1]
>>>>>>>>>>>>>> ====================
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>        - Added Tested-by tags (thanks to Eric and Itaru)
>>>>>>>>>>>>>>        - Renamed ___set_pte() -> __set_pte_nosync() (per Ard)
>>>>>>>>>>>>>>        - Reordered patches (biggest impact & least controversial first)
>>>>>>>>>>>>>>        - Reordered alloc/map/unmap functions in mmu.c to aid reader
>>>>>>>>>>>>>>        - pte_clear() -> __pte_clear() in clear_fixmap_nosync()
>>>>>>>>>>>>>>        - Reverted generic p4d_index() which caused x86 build error.
>>>>>>>>>>>>>> Replaced with
>>>>>>>>>>>>>>          unconditional p4d_index() define under arm64.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> [1]
>>>>>>>>>>>>>> https://lore.kernel.org/linux-arm-kernel/20240326101448.3453626-1-ryan.roberts@arm.com/<https://lore.kernel.org/linux-arm-kernel/20240326101448.3453626-1-ryan.roberts@arm.com/>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>> Ryan
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Ryan Roberts (4):
>>>>>>>>>>>>>>        arm64: mm: Don't remap pgtables per-cont(pte|pmd) block
>>>>>>>>>>>>>>        arm64: mm: Batch dsb and isb when populating pgtables
>>>>>>>>>>>>>>        arm64: mm: Don't remap pgtables for allocate vs populate
>>>>>>>>>>>>>>        arm64: mm: Lazily clear pte table mappings from fixmap
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> arch/arm64/include/asm/fixmap.h  |   5 +-
>>>>>>>>>>>>>> arch/arm64/include/asm/mmu.h     |   8 +
>>>>>>>>>>>>>> arch/arm64/include/asm/pgtable.h |  13 +-
>>>>>>>>>>>>>> arch/arm64/kernel/cpufeature.c   |  10 +-
>>>>>>>>>>>>>> arch/arm64/mm/fixmap.c           |  11 +
>>>>>>>>>>>>>> arch/arm64/mm/mmu.c              | 377 +++++++++++++++++++++++--------
>>>>>>>>>>>>>> 6 files changed, 319 insertions(+), 105 deletions(-)
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> -- 
>>>>>>>>>>>>>> 2.25.1
>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> I've build and boot tested the v2 on FVP, base is taken from your
>>>>>>>>>>>>> linux-rr repo. Running run_vmtests.sh on v2 left some gup longterm not
>>>>>>>>>>>>> oks, would you take a look at it? The mm ksefltests used is from your
>>>>>>>>>>>>> linux-rr repo too.
>>>>>>>>>>>>
>>>>>>>>>>>> Thanks for taking a look at this.
>>>>>>>>>>>>
>>>>>>>>>>>> I can't reproduce your issue unfortunately; steps as follows on Apple
>>>>>>>>>>>> M2 VM:
>>>>>>>>>>>>
>>>>>>>>>>>> Config: arm64 defconfig + the following:
>>>>>>>>>>>>
>>>>>>>>>>>> # Squashfs for snaps, xfs for large file folios.
>>>>>>>>>>>> ./scripts/config --enable CONFIG_SQUASHFS_LZ4
>>>>>>>>>>>> ./scripts/config --enable CONFIG_SQUASHFS_LZO
>>>>>>>>>>>> ./scripts/config --enable CONFIG_SQUASHFS_XZ
>>>>>>>>>>>> ./scripts/config --enable CONFIG_SQUASHFS_ZSTD
>>>>>>>>>>>> ./scripts/config --enable CONFIG_XFS_FS
>>>>>>>>>>>>
>>>>>>>>>>>> # For general mm debug.
>>>>>>>>>>>> ./scripts/config --enable CONFIG_DEBUG_VM
>>>>>>>>>>>> ./scripts/config --enable CONFIG_DEBUG_VM_MAPLE_TREE
>>>>>>>>>>>> ./scripts/config --enable CONFIG_DEBUG_VM_RB
>>>>>>>>>>>> ./scripts/config --enable CONFIG_DEBUG_VM_PGFLAGS
>>>>>>>>>>>> ./scripts/config --enable CONFIG_DEBUG_VM_PGTABLE
>>>>>>>>>>>> ./scripts/config --enable CONFIG_PAGE_TABLE_CHECK
>>>>>>>>>>>>
>>>>>>>>>>>> # For mm selftests.
>>>>>>>>>>>> ./scripts/config --enable CONFIG_USERFAULTFD
>>>>>>>>>>>> ./scripts/config --enable CONFIG_TEST_VMALLOC
>>>>>>>>>>>> ./scripts/config --enable CONFIG_GUP_TEST
>>>>>>>>>>>>
>>>>>>>>>>>> Running on VM with 12G memory, split across 2 (emulated) NUMA nodes
>>>>>>>>>>>> (needed by
>>>>>>>>>>>> some mm selftests), with kernel command line to reserve hugetlbs and
>>>>>>>>>>>> other
>>>>>>>>>>>> features required by some mm selftests:
>>>>>>>>>>>>
>>>>>>>>>>>> "
>>>>>>>>>>>> transparent_hugepage=madvise earlycon root=/dev/vda2 secretmem.enable
>>>>>>>>>>>> hugepagesz=1G hugepages=0:2,1:2 hugepagesz=32M hugepages=0:2,1:2
>>>>>>>>>>>> default_hugepagesz=2M hugepages=0:64,1:64 hugepagesz=64K
>>>>>>>>>>>> hugepages=0:2,1:2
>>>>>>>>>>>> "
>>>>>>>>>>>>
>>>>>>>>>>>> Ubuntu userspace running off XFS rootfs. Build and run mm selftests
>>>>>>>>>>>> from same
>>>>>>>>>>>> git tree.
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> Although I don't think any of this config should make a difference to
>>>>>>>>>>>> gup_longterm.
>>>>>>>>>>>>
>>>>>>>>>>>> Looks like your errors are all "ftruncate() failed". I've seen this
>>>>>>>>>>>> problem on
>>>>>>>>>>>> our CI system. There it is due to running the tests from NFS file
>>>>>>>>>>>> system. What
>>>>>>>>>>>> filesystem are you using? Perhaps you are sharing into the FVP using
>>>>>>>>>>>> 9p? That
>>>>>>>>>>>> might also be problematic.
>>>>>>>>>>>
>>>>>>>>>>> That was it. This time I booted up the kernel including your series on
>>>>>>>>>>> QEMU on my M1 and executed the gup_longterm program without the ftruncate
>>>>>>>>>>> failures. When testing your kernel on FVP, I was executing the script
>>>>>>>>>>> from the FVP's host filesystem using 9p.
>>>>>>>>>>
>>>>>>>>>> I'm not sure exactly what the root cause is. Perhaps there isn't enough
>>>>>>>>>> space on
>>>>>>>>>> the disk? It might be worth enhancing the error log to provide the
>>>>>>>>>> errno in
>>>>>>>>>> tools/testing/selftests/mm/gup_longterm.c.
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Attached is the strace’d gup_longterm executiong log on your
>>>>>>>>> pgtable-boot-speedup-v2 kernel.
>>>>>>>>
>>>>>>>> Sorry are you saying that it only fails with the pgtable-boot-speedup-v2
>>>>>>>> patch
>>>>>>>> set applied? I thought we previously concluded that it was independent of
>>>>>>>> that?
>>>>>>>> I was under the impression that it was filesystem related and not something
>>>>>>>> that
>>>>>>>> I was planning to investigate.
>>>>>>>
>>>>>>> No, irrespective of the kernel, if using 9p on FVP the test program fails.
>>>>>>> It is indeed 9p filesystem related, as I switched to using NFS all the
>>>>>>> issues are gone.
>>>>>>
>>>>>> Did it never work on 9p? If so, we might have to SKIP that test.
>>>>>>
>>>>>> openat(AT_FDCWD, "gup_longterm.c_tmpfile_BLboOt", O_RDWR|O_CREAT|O_EXCL,
>>>>>> 0600) = 3
>>>>>> unlinkat(AT_FDCWD, "gup_longterm.c_tmpfile_BLboOt", 0) = 0
>>>>>> fstatfs(3, 0xffffe505a840)              = -1 EOPNOTSUPP (Operation not
>>>>>> supported)
>>>>>> ftruncate(3, 4096)                      = -1 ENOENT (No such file or
>>>>>> directory)
>>>>>
>>>>> Note: I'm wondering if the unlinkat here is the problem that makes
>>>>> ftruncate() with 9p result in weird errors (e.g., the hypervisor
>>>>> unlinked the file and cannot reopen it for the fstatfs/ftruncate. ...
>>>>> which gives us weird errors here).
>>>>>
>>>>> Then, we should lookup the fs type in run_with_local_tmpfile() before
>>>>> the unlink() and simply skip the test if it is 9p.
>>>>
>>>> The unlink with 9p most certainly was a known issue in the past:
>>>>
>>>> https://gitlab.com/qemu-project/qemu/-/issues/103
>>>>
>>>> Maybe it's still an issue with older hypervisors (QEMU?)? Or it was never
>>>> completely resolved?
>>>
>>> I believe Itaru is running on FVP (Fixed Virtual Platform - "fast model" -
>>> Arm's architecture emulator). So QEMU won't be involved here. The FVP emulates
>>> a 9p device, so perhaps the bug is in there.
>>
>> Very likely.
>>
>>>
>>> Note that I see lots of "fallocate() failed" failures in gup_longterm when
>>> running on our CI system. This is a completely different setup; Real HW with
>>> Linux running bare metal using an NFS rootfs. I'm not sure if this is related.
>>> Logs show it failing consistently for the "tmpfile" and "local tmpfile" test
>>> configs. I also see a couple of these fails in the cow tests.
>>
>> What is the fallocate() errno you are getting? strace log would help (to see if
>> statfs also fails already)! Likely a similar NFS issue.
> 
> Unfortunately this is a system I don't have access to. I've requested some of
> this triage to be done, but its fairly low priority unfortunately.

To work around these BUGs (?) elsewhere, we could simply skip the test 
if get_fs_type() is not able to detect the FS type. Likely that's an 
early indicator that the unlink() messed something up.

... doesn't feel right, though.
Itaru Kitayama April 9, 2024, 11:30 p.m. UTC | #15
Hi David,

> On Apr 9, 2024, at 23:45, David Hildenbrand <david@redhat.com> wrote:
> 
> On 09.04.24 16:39, Ryan Roberts wrote:
>> On 09/04/2024 15:29, David Hildenbrand wrote:
>>> On 09.04.24 16:13, Ryan Roberts wrote:
>>>> On 09/04/2024 12:51, David Hildenbrand wrote:
>>>>> On 09.04.24 13:29, David Hildenbrand wrote:
>>>>>> On 09.04.24 13:22, David Hildenbrand wrote:
>>>>>>> On 09.04.24 12:13, Itaru Kitayama wrote:
>>>>>>>> 
>>>>>>>> 
>>>>>>>>> On Apr 9, 2024, at 19:04, Ryan Roberts <ryan.roberts@arm.com> wrote:
>>>>>>>>> 
>>>>>>>>> On 09/04/2024 01:10, Itaru Kitayama wrote:
>>>>>>>>>> Hi Ryan,
>>>>>>>>>> 
>>>>>>>>>>> On Apr 8, 2024, at 16:30, Ryan Roberts <ryan.roberts@arm.com> wrote:
>>>>>>>>>>> 
>>>>>>>>>>> On 06/04/2024 11:31, Itaru Kitayama wrote:
>>>>>>>>>>>> Hi Ryan,
>>>>>>>>>>>> 
>>>>>>>>>>>> On Sat, Apr 06, 2024 at 09:32:34AM +0100, Ryan Roberts wrote:
>>>>>>>>>>>>> Hi Itaru,
>>>>>>>>>>>>> 
>>>>>>>>>>>>> On 05/04/2024 08:39, Itaru Kitayama wrote:
>>>>>>>>>>>>>> On Thu, Apr 04, 2024 at 03:33:04PM +0100, Ryan Roberts wrote:
>>>>>>>>>>>>>>> Hi All,
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> It turns out that creating the linear map can take a significant
>>>>>>>>>>>>>>> proportion of
>>>>>>>>>>>>>>> the total boot time, especially when rodata=full. And most of the
>>>>>>>>>>>>>>> time is spent
>>>>>>>>>>>>>>> waiting on superfluous tlb invalidation and memory barriers. This
>>>>>>>>>>>>>>> series reworks
>>>>>>>>>>>>>>> the kernel pgtable generation code to significantly reduce the number
>>>>>>>>>>>>>>> of those
>>>>>>>>>>>>>>> TLBIs, ISBs and DSBs. See each patch for details.
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> The below shows the execution time of map_mem() across a couple of
>>>>>>>>>>>>>>> different
>>>>>>>>>>>>>>> systems with different RAM configurations. We measure after applying
>>>>>>>>>>>>>>> each patch
>>>>>>>>>>>>>>> and show the improvement relative to base (v6.9-rc2):
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>                    | Apple M2 VM | Ampere Altra| Ampere Altra| Ampere
>>>>>>>>>>>>>>> Altra
>>>>>>>>>>>>>>>                    | VM, 16G     | VM, 64G     | VM, 256G    | Metal,
>>>>>>>>>>>>>>> 512G
>>>>>>>>>>>>>>> ---------------|-------------|-------------|-------------|-------------
>>>>>>>>>>>>>>>                    |   ms    (%) |   ms    (%) |   ms    (%) |
>>>>>>>>>>>>>>> ms    (%)
>>>>>>>>>>>>>>> ---------------|-------------|-------------|-------------|-------------
>>>>>>>>>>>>>>> base           |  153   (0%) | 2227   (0%) | 8798   (0%) | 17442
>>>>>>>>>>>>>>> (0%)
>>>>>>>>>>>>>>> no-cont-remap  |   77 (-49%) |  431 (-81%) | 1727 (-80%) |  3796
>>>>>>>>>>>>>>> (-78%)
>>>>>>>>>>>>>>> batch-barriers |   13 (-92%) |  162 (-93%) |  655 (-93%) |  1656
>>>>>>>>>>>>>>> (-91%)
>>>>>>>>>>>>>>> no-alloc-remap |   11 (-93%) |  109 (-95%) |  449 (-95%) |  1257
>>>>>>>>>>>>>>> (-93%)
>>>>>>>>>>>>>>> lazy-unmap     |    6 (-96%) |   61 (-97%) |  257 (-97%) |   838
>>>>>>>>>>>>>>> (-95%)
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> This series applies on top of v6.9-rc2. All mm selftests pass. I've
>>>>>>>>>>>>>>> compile and
>>>>>>>>>>>>>>> boot tested various PAGE_SIZE and VA size configs.
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> ---
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> Changes since v1 [1]
>>>>>>>>>>>>>>> ====================
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>       - Added Tested-by tags (thanks to Eric and Itaru)
>>>>>>>>>>>>>>>       - Renamed ___set_pte() -> __set_pte_nosync() (per Ard)
>>>>>>>>>>>>>>>       - Reordered patches (biggest impact & least controversial first)
>>>>>>>>>>>>>>>       - Reordered alloc/map/unmap functions in mmu.c to aid reader
>>>>>>>>>>>>>>>       - pte_clear() -> __pte_clear() in clear_fixmap_nosync()
>>>>>>>>>>>>>>>       - Reverted generic p4d_index() which caused x86 build error.
>>>>>>>>>>>>>>> Replaced with
>>>>>>>>>>>>>>>         unconditional p4d_index() define under arm64.
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> [1]
>>>>>>>>>>>>>>> https://lore.kernel.org/linux-arm-kernel/20240326101448.3453626-1-ryan.roberts@arm.com/<https://lore.kernel.org/linux-arm-kernel/20240326101448.3453626-1-ryan.roberts@arm.com/>
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>> Ryan
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> Ryan Roberts (4):
>>>>>>>>>>>>>>>       arm64: mm: Don't remap pgtables per-cont(pte|pmd) block
>>>>>>>>>>>>>>>       arm64: mm: Batch dsb and isb when populating pgtables
>>>>>>>>>>>>>>>       arm64: mm: Don't remap pgtables for allocate vs populate
>>>>>>>>>>>>>>>       arm64: mm: Lazily clear pte table mappings from fixmap
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> arch/arm64/include/asm/fixmap.h  |   5 +-
>>>>>>>>>>>>>>> arch/arm64/include/asm/mmu.h     |   8 +
>>>>>>>>>>>>>>> arch/arm64/include/asm/pgtable.h |  13 +-
>>>>>>>>>>>>>>> arch/arm64/kernel/cpufeature.c   |  10 +-
>>>>>>>>>>>>>>> arch/arm64/mm/fixmap.c           |  11 +
>>>>>>>>>>>>>>> arch/arm64/mm/mmu.c              | 377 +++++++++++++++++++++++--------
>>>>>>>>>>>>>>> 6 files changed, 319 insertions(+), 105 deletions(-)
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> -- 
>>>>>>>>>>>>>>> 2.25.1
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> I've build and boot tested the v2 on FVP, base is taken from your
>>>>>>>>>>>>>> linux-rr repo. Running run_vmtests.sh on v2 left some gup longterm not
>>>>>>>>>>>>>> oks, would you take a look at it? The mm ksefltests used is from your
>>>>>>>>>>>>>> linux-rr repo too.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Thanks for taking a look at this.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> I can't reproduce your issue unfortunately; steps as follows on Apple
>>>>>>>>>>>>> M2 VM:
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Config: arm64 defconfig + the following:
>>>>>>>>>>>>> 
>>>>>>>>>>>>> # Squashfs for snaps, xfs for large file folios.
>>>>>>>>>>>>> ./scripts/config --enable CONFIG_SQUASHFS_LZ4
>>>>>>>>>>>>> ./scripts/config --enable CONFIG_SQUASHFS_LZO
>>>>>>>>>>>>> ./scripts/config --enable CONFIG_SQUASHFS_XZ
>>>>>>>>>>>>> ./scripts/config --enable CONFIG_SQUASHFS_ZSTD
>>>>>>>>>>>>> ./scripts/config --enable CONFIG_XFS_FS
>>>>>>>>>>>>> 
>>>>>>>>>>>>> # For general mm debug.
>>>>>>>>>>>>> ./scripts/config --enable CONFIG_DEBUG_VM
>>>>>>>>>>>>> ./scripts/config --enable CONFIG_DEBUG_VM_MAPLE_TREE
>>>>>>>>>>>>> ./scripts/config --enable CONFIG_DEBUG_VM_RB
>>>>>>>>>>>>> ./scripts/config --enable CONFIG_DEBUG_VM_PGFLAGS
>>>>>>>>>>>>> ./scripts/config --enable CONFIG_DEBUG_VM_PGTABLE
>>>>>>>>>>>>> ./scripts/config --enable CONFIG_PAGE_TABLE_CHECK
>>>>>>>>>>>>> 
>>>>>>>>>>>>> # For mm selftests.
>>>>>>>>>>>>> ./scripts/config --enable CONFIG_USERFAULTFD
>>>>>>>>>>>>> ./scripts/config --enable CONFIG_TEST_VMALLOC
>>>>>>>>>>>>> ./scripts/config --enable CONFIG_GUP_TEST
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Running on VM with 12G memory, split across 2 (emulated) NUMA nodes
>>>>>>>>>>>>> (needed by
>>>>>>>>>>>>> some mm selftests), with kernel command line to reserve hugetlbs and
>>>>>>>>>>>>> other
>>>>>>>>>>>>> features required by some mm selftests:
>>>>>>>>>>>>> 
>>>>>>>>>>>>> "
>>>>>>>>>>>>> transparent_hugepage=madvise earlycon root=/dev/vda2 secretmem.enable
>>>>>>>>>>>>> hugepagesz=1G hugepages=0:2,1:2 hugepagesz=32M hugepages=0:2,1:2
>>>>>>>>>>>>> default_hugepagesz=2M hugepages=0:64,1:64 hugepagesz=64K
>>>>>>>>>>>>> hugepages=0:2,1:2
>>>>>>>>>>>>> "
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Ubuntu userspace running off XFS rootfs. Build and run mm selftests
>>>>>>>>>>>>> from same
>>>>>>>>>>>>> git tree.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Although I don't think any of this config should make a difference to
>>>>>>>>>>>>> gup_longterm.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Looks like your errors are all "ftruncate() failed". I've seen this
>>>>>>>>>>>>> problem on
>>>>>>>>>>>>> our CI system. There it is due to running the tests from NFS file
>>>>>>>>>>>>> system. What
>>>>>>>>>>>>> filesystem are you using? Perhaps you are sharing into the FVP using
>>>>>>>>>>>>> 9p? That
>>>>>>>>>>>>> might also be problematic.
>>>>>>>>>>>> 
>>>>>>>>>>>> That was it. This time I booted up the kernel including your series on
>>>>>>>>>>>> QEMU on my M1 and executed the gup_longterm program without the ftruncate
>>>>>>>>>>>> failures. When testing your kernel on FVP, I was executing the script
>>>>>>>>>>>> from the FVP's host filesystem using 9p.
>>>>>>>>>>> 
>>>>>>>>>>> I'm not sure exactly what the root cause is. Perhaps there isn't enough
>>>>>>>>>>> space on
>>>>>>>>>>> the disk? It might be worth enhancing the error log to provide the
>>>>>>>>>>> errno in
>>>>>>>>>>> tools/testing/selftests/mm/gup_longterm.c.
>>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> Attached is the strace’d gup_longterm executiong log on your
>>>>>>>>>> pgtable-boot-speedup-v2 kernel.
>>>>>>>>> 
>>>>>>>>> Sorry are you saying that it only fails with the pgtable-boot-speedup-v2
>>>>>>>>> patch
>>>>>>>>> set applied? I thought we previously concluded that it was independent of
>>>>>>>>> that?
>>>>>>>>> I was under the impression that it was filesystem related and not something
>>>>>>>>> that
>>>>>>>>> I was planning to investigate.
>>>>>>>> 
>>>>>>>> No, irrespective of the kernel, if using 9p on FVP the test program fails.
>>>>>>>> It is indeed 9p filesystem related, as I switched to using NFS all the
>>>>>>>> issues are gone.
>>>>>>> 
>>>>>>> Did it never work on 9p? If so, we might have to SKIP that test.
>>>>>>> 
>>>>>>> openat(AT_FDCWD, "gup_longterm.c_tmpfile_BLboOt", O_RDWR|O_CREAT|O_EXCL,
>>>>>>> 0600) = 3
>>>>>>> unlinkat(AT_FDCWD, "gup_longterm.c_tmpfile_BLboOt", 0) = 0
>>>>>>> fstatfs(3, 0xffffe505a840)              = -1 EOPNOTSUPP (Operation not
>>>>>>> supported)
>>>>>>> ftruncate(3, 4096)                      = -1 ENOENT (No such file or
>>>>>>> directory)
>>>>>> 
>>>>>> Note: I'm wondering if the unlinkat here is the problem that makes
>>>>>> ftruncate() with 9p result in weird errors (e.g., the hypervisor
>>>>>> unlinked the file and cannot reopen it for the fstatfs/ftruncate. ...
>>>>>> which gives us weird errors here).
>>>>>> 
>>>>>> Then, we should lookup the fs type in run_with_local_tmpfile() before
>>>>>> the unlink() and simply skip the test if it is 9p.
>>>>> 
>>>>> The unlink with 9p most certainly was a known issue in the past:
>>>>> 
>>>>> https://gitlab.com/qemu-project/qemu/-/issues/103
>>>>> 
>>>>> Maybe it's still an issue with older hypervisors (QEMU?)? Or it was never
>>>>> completely resolved?
>>>> 
>>>> I believe Itaru is running on FVP (Fixed Virtual Platform - "fast model" -
>>>> Arm's architecture emulator). So QEMU won't be involved here. The FVP emulates
>>>> a 9p device, so perhaps the bug is in there.
>>> 
>>> Very likely.
>>> 
>>>> 
>>>> Note that I see lots of "fallocate() failed" failures in gup_longterm when
>>>> running on our CI system. This is a completely different setup; Real HW with
>>>> Linux running bare metal using an NFS rootfs. I'm not sure if this is related.
>>>> Logs show it failing consistently for the "tmpfile" and "local tmpfile" test
>>>> configs. I also see a couple of these fails in the cow tests.
>>> 
>>> What is the fallocate() errno you are getting? strace log would help (to see if
>>> statfs also fails already)! Likely a similar NFS issue.
>> Unfortunately this is a system I don't have access to. I've requested some of
>> this triage to be done, but its fairly low priority unfortunately.
> 
> To work around these BUGs (?) elsewhere, we could simply skip the test if get_fs_type() is not able to detect the FS type. Likely that's an early indicator that the unlink() messed something up.
> 
> ... doesn't feel right, though.

I think it’s a good idea so that the mm kselftests results look reasonable. Since you’re an expert on GUP-fast (or fast-GUP?), when you update the code, could you print out errno as well like the split_huge_page_test.c does?

Thanks,
Itaru.

> 
> -- 
> Cheers,
> 
> David / dhildenb
>
Itaru Kitayama April 10, 2024, 6:47 a.m. UTC | #16
> On Apr 10, 2024, at 8:30, Itaru Kitayama <itaru.kitayama@linux.dev> wrote:
> 
> Hi David,
> 
>> On Apr 9, 2024, at 23:45, David Hildenbrand <david@redhat.com> wrote:
>> 
>> On 09.04.24 16:39, Ryan Roberts wrote:
>>> On 09/04/2024 15:29, David Hildenbrand wrote:
>>>> On 09.04.24 16:13, Ryan Roberts wrote:
>>>>> On 09/04/2024 12:51, David Hildenbrand wrote:
>>>>>> On 09.04.24 13:29, David Hildenbrand wrote:
>>>>>>> On 09.04.24 13:22, David Hildenbrand wrote:
>>>>>>>> On 09.04.24 12:13, Itaru Kitayama wrote:
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>>> On Apr 9, 2024, at 19:04, Ryan Roberts <ryan.roberts@arm.com> wrote:
>>>>>>>>>> 
>>>>>>>>>> On 09/04/2024 01:10, Itaru Kitayama wrote:
>>>>>>>>>>> Hi Ryan,
>>>>>>>>>>> 
>>>>>>>>>>>> On Apr 8, 2024, at 16:30, Ryan Roberts <ryan.roberts@arm.com> wrote:
>>>>>>>>>>>> 
>>>>>>>>>>>> On 06/04/2024 11:31, Itaru Kitayama wrote:
>>>>>>>>>>>>> Hi Ryan,
>>>>>>>>>>>>> 
>>>>>>>>>>>>> On Sat, Apr 06, 2024 at 09:32:34AM +0100, Ryan Roberts wrote:
>>>>>>>>>>>>>> Hi Itaru,
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> On 05/04/2024 08:39, Itaru Kitayama wrote:
>>>>>>>>>>>>>>> On Thu, Apr 04, 2024 at 03:33:04PM +0100, Ryan Roberts wrote:
>>>>>>>>>>>>>>>> Hi All,
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> It turns out that creating the linear map can take a significant
>>>>>>>>>>>>>>>> proportion of
>>>>>>>>>>>>>>>> the total boot time, especially when rodata=full. And most of the
>>>>>>>>>>>>>>>> time is spent
>>>>>>>>>>>>>>>> waiting on superfluous tlb invalidation and memory barriers. This
>>>>>>>>>>>>>>>> series reworks
>>>>>>>>>>>>>>>> the kernel pgtable generation code to significantly reduce the number
>>>>>>>>>>>>>>>> of those
>>>>>>>>>>>>>>>> TLBIs, ISBs and DSBs. See each patch for details.
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> The below shows the execution time of map_mem() across a couple of
>>>>>>>>>>>>>>>> different
>>>>>>>>>>>>>>>> systems with different RAM configurations. We measure after applying
>>>>>>>>>>>>>>>> each patch
>>>>>>>>>>>>>>>> and show the improvement relative to base (v6.9-rc2):
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>                   | Apple M2 VM | Ampere Altra| Ampere Altra| Ampere
>>>>>>>>>>>>>>>> Altra
>>>>>>>>>>>>>>>>                   | VM, 16G     | VM, 64G     | VM, 256G    | Metal,
>>>>>>>>>>>>>>>> 512G
>>>>>>>>>>>>>>>> ---------------|-------------|-------------|-------------|-------------
>>>>>>>>>>>>>>>>                   |   ms    (%) |   ms    (%) |   ms    (%) |
>>>>>>>>>>>>>>>> ms    (%)
>>>>>>>>>>>>>>>> ---------------|-------------|-------------|-------------|-------------
>>>>>>>>>>>>>>>> base           |  153   (0%) | 2227   (0%) | 8798   (0%) | 17442
>>>>>>>>>>>>>>>> (0%)
>>>>>>>>>>>>>>>> no-cont-remap  |   77 (-49%) |  431 (-81%) | 1727 (-80%) |  3796
>>>>>>>>>>>>>>>> (-78%)
>>>>>>>>>>>>>>>> batch-barriers |   13 (-92%) |  162 (-93%) |  655 (-93%) |  1656
>>>>>>>>>>>>>>>> (-91%)
>>>>>>>>>>>>>>>> no-alloc-remap |   11 (-93%) |  109 (-95%) |  449 (-95%) |  1257
>>>>>>>>>>>>>>>> (-93%)
>>>>>>>>>>>>>>>> lazy-unmap     |    6 (-96%) |   61 (-97%) |  257 (-97%) |   838
>>>>>>>>>>>>>>>> (-95%)
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> This series applies on top of v6.9-rc2. All mm selftests pass. I've
>>>>>>>>>>>>>>>> compile and
>>>>>>>>>>>>>>>> boot tested various PAGE_SIZE and VA size configs.
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> ---
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> Changes since v1 [1]
>>>>>>>>>>>>>>>> ====================
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>      - Added Tested-by tags (thanks to Eric and Itaru)
>>>>>>>>>>>>>>>>      - Renamed ___set_pte() -> __set_pte_nosync() (per Ard)
>>>>>>>>>>>>>>>>      - Reordered patches (biggest impact & least controversial first)
>>>>>>>>>>>>>>>>      - Reordered alloc/map/unmap functions in mmu.c to aid reader
>>>>>>>>>>>>>>>>      - pte_clear() -> __pte_clear() in clear_fixmap_nosync()
>>>>>>>>>>>>>>>>      - Reverted generic p4d_index() which caused x86 build error.
>>>>>>>>>>>>>>>> Replaced with
>>>>>>>>>>>>>>>>        unconditional p4d_index() define under arm64.
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> [1]
>>>>>>>>>>>>>>>> https://lore.kernel.org/linux-arm-kernel/20240326101448.3453626-1-ryan.roberts@arm.com/<https://lore.kernel.org/linux-arm-kernel/20240326101448.3453626-1-ryan.roberts@arm.com/>
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>>> Ryan
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> Ryan Roberts (4):
>>>>>>>>>>>>>>>>      arm64: mm: Don't remap pgtables per-cont(pte|pmd) block
>>>>>>>>>>>>>>>>      arm64: mm: Batch dsb and isb when populating pgtables
>>>>>>>>>>>>>>>>      arm64: mm: Don't remap pgtables for allocate vs populate
>>>>>>>>>>>>>>>>      arm64: mm: Lazily clear pte table mappings from fixmap
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> arch/arm64/include/asm/fixmap.h  |   5 +-
>>>>>>>>>>>>>>>> arch/arm64/include/asm/mmu.h     |   8 +
>>>>>>>>>>>>>>>> arch/arm64/include/asm/pgtable.h |  13 +-
>>>>>>>>>>>>>>>> arch/arm64/kernel/cpufeature.c   |  10 +-
>>>>>>>>>>>>>>>> arch/arm64/mm/fixmap.c           |  11 +
>>>>>>>>>>>>>>>> arch/arm64/mm/mmu.c              | 377 +++++++++++++++++++++++--------
>>>>>>>>>>>>>>>> 6 files changed, 319 insertions(+), 105 deletions(-)
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> -- 
>>>>>>>>>>>>>>>> 2.25.1
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> I've build and boot tested the v2 on FVP, base is taken from your
>>>>>>>>>>>>>>> linux-rr repo. Running run_vmtests.sh on v2 left some gup longterm not
>>>>>>>>>>>>>>> oks, would you take a look at it? The mm ksefltests used is from your
>>>>>>>>>>>>>>> linux-rr repo too.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Thanks for taking a look at this.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> I can't reproduce your issue unfortunately; steps as follows on Apple
>>>>>>>>>>>>>> M2 VM:
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Config: arm64 defconfig + the following:
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> # Squashfs for snaps, xfs for large file folios.
>>>>>>>>>>>>>> ./scripts/config --enable CONFIG_SQUASHFS_LZ4
>>>>>>>>>>>>>> ./scripts/config --enable CONFIG_SQUASHFS_LZO
>>>>>>>>>>>>>> ./scripts/config --enable CONFIG_SQUASHFS_XZ
>>>>>>>>>>>>>> ./scripts/config --enable CONFIG_SQUASHFS_ZSTD
>>>>>>>>>>>>>> ./scripts/config --enable CONFIG_XFS_FS
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> # For general mm debug.
>>>>>>>>>>>>>> ./scripts/config --enable CONFIG_DEBUG_VM
>>>>>>>>>>>>>> ./scripts/config --enable CONFIG_DEBUG_VM_MAPLE_TREE
>>>>>>>>>>>>>> ./scripts/config --enable CONFIG_DEBUG_VM_RB
>>>>>>>>>>>>>> ./scripts/config --enable CONFIG_DEBUG_VM_PGFLAGS
>>>>>>>>>>>>>> ./scripts/config --enable CONFIG_DEBUG_VM_PGTABLE
>>>>>>>>>>>>>> ./scripts/config --enable CONFIG_PAGE_TABLE_CHECK
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> # For mm selftests.
>>>>>>>>>>>>>> ./scripts/config --enable CONFIG_USERFAULTFD
>>>>>>>>>>>>>> ./scripts/config --enable CONFIG_TEST_VMALLOC
>>>>>>>>>>>>>> ./scripts/config --enable CONFIG_GUP_TEST
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Running on VM with 12G memory, split across 2 (emulated) NUMA nodes
>>>>>>>>>>>>>> (needed by
>>>>>>>>>>>>>> some mm selftests), with kernel command line to reserve hugetlbs and
>>>>>>>>>>>>>> other
>>>>>>>>>>>>>> features required by some mm selftests:
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> "
>>>>>>>>>>>>>> transparent_hugepage=madvise earlycon root=/dev/vda2 secretmem.enable
>>>>>>>>>>>>>> hugepagesz=1G hugepages=0:2,1:2 hugepagesz=32M hugepages=0:2,1:2
>>>>>>>>>>>>>> default_hugepagesz=2M hugepages=0:64,1:64 hugepagesz=64K
>>>>>>>>>>>>>> hugepages=0:2,1:2
>>>>>>>>>>>>>> "
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Ubuntu userspace running off XFS rootfs. Build and run mm selftests
>>>>>>>>>>>>>> from same
>>>>>>>>>>>>>> git tree.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Although I don't think any of this config should make a difference to
>>>>>>>>>>>>>> gup_longterm.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Looks like your errors are all "ftruncate() failed". I've seen this
>>>>>>>>>>>>>> problem on
>>>>>>>>>>>>>> our CI system. There it is due to running the tests from NFS file
>>>>>>>>>>>>>> system. What
>>>>>>>>>>>>>> filesystem are you using? Perhaps you are sharing into the FVP using
>>>>>>>>>>>>>> 9p? That
>>>>>>>>>>>>>> might also be problematic.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> That was it. This time I booted up the kernel including your series on
>>>>>>>>>>>>> QEMU on my M1 and executed the gup_longterm program without the ftruncate
>>>>>>>>>>>>> failures. When testing your kernel on FVP, I was executing the script
>>>>>>>>>>>>> from the FVP's host filesystem using 9p.
>>>>>>>>>>>> 
>>>>>>>>>>>> I'm not sure exactly what the root cause is. Perhaps there isn't enough
>>>>>>>>>>>> space on
>>>>>>>>>>>> the disk? It might be worth enhancing the error log to provide the
>>>>>>>>>>>> errno in
>>>>>>>>>>>> tools/testing/selftests/mm/gup_longterm.c.
>>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> Attached is the strace’d gup_longterm executiong log on your
>>>>>>>>>>> pgtable-boot-speedup-v2 kernel.
>>>>>>>>>> 
>>>>>>>>>> Sorry are you saying that it only fails with the pgtable-boot-speedup-v2
>>>>>>>>>> patch
>>>>>>>>>> set applied? I thought we previously concluded that it was independent of
>>>>>>>>>> that?
>>>>>>>>>> I was under the impression that it was filesystem related and not something
>>>>>>>>>> that
>>>>>>>>>> I was planning to investigate.
>>>>>>>>> 
>>>>>>>>> No, irrespective of the kernel, if using 9p on FVP the test program fails.
>>>>>>>>> It is indeed 9p filesystem related, as I switched to using NFS all the
>>>>>>>>> issues are gone.
>>>>>>>> 
>>>>>>>> Did it never work on 9p? If so, we might have to SKIP that test.
>>>>>>>> 
>>>>>>>> openat(AT_FDCWD, "gup_longterm.c_tmpfile_BLboOt", O_RDWR|O_CREAT|O_EXCL,
>>>>>>>> 0600) = 3
>>>>>>>> unlinkat(AT_FDCWD, "gup_longterm.c_tmpfile_BLboOt", 0) = 0
>>>>>>>> fstatfs(3, 0xffffe505a840)              = -1 EOPNOTSUPP (Operation not
>>>>>>>> supported)
>>>>>>>> ftruncate(3, 4096)                      = -1 ENOENT (No such file or
>>>>>>>> directory)
>>>>>>> 
>>>>>>> Note: I'm wondering if the unlinkat here is the problem that makes
>>>>>>> ftruncate() with 9p result in weird errors (e.g., the hypervisor
>>>>>>> unlinked the file and cannot reopen it for the fstatfs/ftruncate. ...
>>>>>>> which gives us weird errors here).
>>>>>>> 
>>>>>>> Then, we should lookup the fs type in run_with_local_tmpfile() before
>>>>>>> the unlink() and simply skip the test if it is 9p.
>>>>>> 
>>>>>> The unlink with 9p most certainly was a known issue in the past:
>>>>>> 
>>>>>> https://gitlab.com/qemu-project/qemu/-/issues/103
>>>>>> 
>>>>>> Maybe it's still an issue with older hypervisors (QEMU?)? Or it was never
>>>>>> completely resolved?
>>>>> 
>>>>> I believe Itaru is running on FVP (Fixed Virtual Platform - "fast model" -
>>>>> Arm's architecture emulator). So QEMU won't be involved here. The FVP emulates
>>>>> a 9p device, so perhaps the bug is in there.
>>>> 
>>>> Very likely.
>>>> 
>>>>> 
>>>>> Note that I see lots of "fallocate() failed" failures in gup_longterm when
>>>>> running on our CI system. This is a completely different setup; Real HW with
>>>>> Linux running bare metal using an NFS rootfs. I'm not sure if this is related.
>>>>> Logs show it failing consistently for the "tmpfile" and "local tmpfile" test
>>>>> configs. I also see a couple of these fails in the cow tests.
>>>> 
>>>> What is the fallocate() errno you are getting? strace log would help (to see if
>>>> statfs also fails already)! Likely a similar NFS issue.
>>> Unfortunately this is a system I don't have access to. I've requested some of
>>> this triage to be done, but its fairly low priority unfortunately.
>> 
>> To work around these BUGs (?) elsewhere, we could simply skip the test if get_fs_type() is not able to detect the FS type. Likely that's an early indicator that the unlink() messed something up.
>> 
>> ... doesn't feel right, though.
> 
> I think it’s a good idea so that the mm kselftests results look reasonable. Since you’re an expert on GUP-fast (or fast-GUP?), when you update the code, could you print out errno as well like the split_huge_page_test.c does?
> 
> Thanks,
> Itaru.

David, attached is the straced execution log of the gup_longterm kselftest over the NFS case.
I’m running the program on FVP, let me know if you need other logs or test results.  

Thanks,
Itaru.
> 
>> 
>> -- 
>> Cheers,
>> 
>> David / dhildenb
David Hildenbrand April 10, 2024, 7:10 a.m. UTC | #17
On 10.04.24 08:47, Itaru Kitayama wrote:
> 
> 
>> On Apr 10, 2024, at 8:30, Itaru Kitayama <itaru.kitayama@linux.dev> wrote:
>>
>> Hi David,
>>
>>> On Apr 9, 2024, at 23:45, David Hildenbrand <david@redhat.com> wrote:
>>>
>>> On 09.04.24 16:39, Ryan Roberts wrote:
>>>> On 09/04/2024 15:29, David Hildenbrand wrote:
>>>>> On 09.04.24 16:13, Ryan Roberts wrote:
>>>>>> On 09/04/2024 12:51, David Hildenbrand wrote:
>>>>>>> On 09.04.24 13:29, David Hildenbrand wrote:
>>>>>>>> On 09.04.24 13:22, David Hildenbrand wrote:
>>>>>>>>> On 09.04.24 12:13, Itaru Kitayama wrote:
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>> On Apr 9, 2024, at 19:04, Ryan Roberts <ryan.roberts@arm.com> wrote:
>>>>>>>>>>>
>>>>>>>>>>> On 09/04/2024 01:10, Itaru Kitayama wrote:
>>>>>>>>>>>> Hi Ryan,
>>>>>>>>>>>>
>>>>>>>>>>>>> On Apr 8, 2024, at 16:30, Ryan Roberts <ryan.roberts@arm.com> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>> On 06/04/2024 11:31, Itaru Kitayama wrote:
>>>>>>>>>>>>>> Hi Ryan,
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Sat, Apr 06, 2024 at 09:32:34AM +0100, Ryan Roberts wrote:
>>>>>>>>>>>>>>> Hi Itaru,
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On 05/04/2024 08:39, Itaru Kitayama wrote:
>>>>>>>>>>>>>>>> On Thu, Apr 04, 2024 at 03:33:04PM +0100, Ryan Roberts wrote:
>>>>>>>>>>>>>>>>> Hi All,
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> It turns out that creating the linear map can take a significant
>>>>>>>>>>>>>>>>> proportion of
>>>>>>>>>>>>>>>>> the total boot time, especially when rodata=full. And most of the
>>>>>>>>>>>>>>>>> time is spent
>>>>>>>>>>>>>>>>> waiting on superfluous tlb invalidation and memory barriers. This
>>>>>>>>>>>>>>>>> series reworks
>>>>>>>>>>>>>>>>> the kernel pgtable generation code to significantly reduce the number
>>>>>>>>>>>>>>>>> of those
>>>>>>>>>>>>>>>>> TLBIs, ISBs and DSBs. See each patch for details.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> The below shows the execution time of map_mem() across a couple of
>>>>>>>>>>>>>>>>> different
>>>>>>>>>>>>>>>>> systems with different RAM configurations. We measure after applying
>>>>>>>>>>>>>>>>> each patch
>>>>>>>>>>>>>>>>> and show the improvement relative to base (v6.9-rc2):
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>                    | Apple M2 VM | Ampere Altra| Ampere Altra| Ampere
>>>>>>>>>>>>>>>>> Altra
>>>>>>>>>>>>>>>>>                    | VM, 16G     | VM, 64G     | VM, 256G    | Metal,
>>>>>>>>>>>>>>>>> 512G
>>>>>>>>>>>>>>>>> ---------------|-------------|-------------|-------------|-------------
>>>>>>>>>>>>>>>>>                    |   ms    (%) |   ms    (%) |   ms    (%) |
>>>>>>>>>>>>>>>>> ms    (%)
>>>>>>>>>>>>>>>>> ---------------|-------------|-------------|-------------|-------------
>>>>>>>>>>>>>>>>> base           |  153   (0%) | 2227   (0%) | 8798   (0%) | 17442
>>>>>>>>>>>>>>>>> (0%)
>>>>>>>>>>>>>>>>> no-cont-remap  |   77 (-49%) |  431 (-81%) | 1727 (-80%) |  3796
>>>>>>>>>>>>>>>>> (-78%)
>>>>>>>>>>>>>>>>> batch-barriers |   13 (-92%) |  162 (-93%) |  655 (-93%) |  1656
>>>>>>>>>>>>>>>>> (-91%)
>>>>>>>>>>>>>>>>> no-alloc-remap |   11 (-93%) |  109 (-95%) |  449 (-95%) |  1257
>>>>>>>>>>>>>>>>> (-93%)
>>>>>>>>>>>>>>>>> lazy-unmap     |    6 (-96%) |   61 (-97%) |  257 (-97%) |   838
>>>>>>>>>>>>>>>>> (-95%)
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> This series applies on top of v6.9-rc2. All mm selftests pass. I've
>>>>>>>>>>>>>>>>> compile and
>>>>>>>>>>>>>>>>> boot tested various PAGE_SIZE and VA size configs.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> ---
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Changes since v1 [1]
>>>>>>>>>>>>>>>>> ====================
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>       - Added Tested-by tags (thanks to Eric and Itaru)
>>>>>>>>>>>>>>>>>       - Renamed ___set_pte() -> __set_pte_nosync() (per Ard)
>>>>>>>>>>>>>>>>>       - Reordered patches (biggest impact & least controversial first)
>>>>>>>>>>>>>>>>>       - Reordered alloc/map/unmap functions in mmu.c to aid reader
>>>>>>>>>>>>>>>>>       - pte_clear() -> __pte_clear() in clear_fixmap_nosync()
>>>>>>>>>>>>>>>>>       - Reverted generic p4d_index() which caused x86 build error.
>>>>>>>>>>>>>>>>> Replaced with
>>>>>>>>>>>>>>>>>         unconditional p4d_index() define under arm64.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> [1]
>>>>>>>>>>>>>>>>> https://lore.kernel.org/linux-arm-kernel/20240326101448.3453626-1-ryan.roberts@arm.com/<https://lore.kernel.org/linux-arm-kernel/20240326101448.3453626-1-ryan.roberts@arm.com/>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>>>> Ryan
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Ryan Roberts (4):
>>>>>>>>>>>>>>>>>       arm64: mm: Don't remap pgtables per-cont(pte|pmd) block
>>>>>>>>>>>>>>>>>       arm64: mm: Batch dsb and isb when populating pgtables
>>>>>>>>>>>>>>>>>       arm64: mm: Don't remap pgtables for allocate vs populate
>>>>>>>>>>>>>>>>>       arm64: mm: Lazily clear pte table mappings from fixmap
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> arch/arm64/include/asm/fixmap.h  |   5 +-
>>>>>>>>>>>>>>>>> arch/arm64/include/asm/mmu.h     |   8 +
>>>>>>>>>>>>>>>>> arch/arm64/include/asm/pgtable.h |  13 +-
>>>>>>>>>>>>>>>>> arch/arm64/kernel/cpufeature.c   |  10 +-
>>>>>>>>>>>>>>>>> arch/arm64/mm/fixmap.c           |  11 +
>>>>>>>>>>>>>>>>> arch/arm64/mm/mmu.c              | 377 +++++++++++++++++++++++--------
>>>>>>>>>>>>>>>>> 6 files changed, 319 insertions(+), 105 deletions(-)
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> -- 
>>>>>>>>>>>>>>>>> 2.25.1
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> I've build and boot tested the v2 on FVP, base is taken from your
>>>>>>>>>>>>>>>> linux-rr repo. Running run_vmtests.sh on v2 left some gup longterm not
>>>>>>>>>>>>>>>> oks, would you take a look at it? The mm ksefltests used is from your
>>>>>>>>>>>>>>>> linux-rr repo too.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Thanks for taking a look at this.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> I can't reproduce your issue unfortunately; steps as follows on Apple
>>>>>>>>>>>>>>> M2 VM:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Config: arm64 defconfig + the following:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> # Squashfs for snaps, xfs for large file folios.
>>>>>>>>>>>>>>> ./scripts/config --enable CONFIG_SQUASHFS_LZ4
>>>>>>>>>>>>>>> ./scripts/config --enable CONFIG_SQUASHFS_LZO
>>>>>>>>>>>>>>> ./scripts/config --enable CONFIG_SQUASHFS_XZ
>>>>>>>>>>>>>>> ./scripts/config --enable CONFIG_SQUASHFS_ZSTD
>>>>>>>>>>>>>>> ./scripts/config --enable CONFIG_XFS_FS
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> # For general mm debug.
>>>>>>>>>>>>>>> ./scripts/config --enable CONFIG_DEBUG_VM
>>>>>>>>>>>>>>> ./scripts/config --enable CONFIG_DEBUG_VM_MAPLE_TREE
>>>>>>>>>>>>>>> ./scripts/config --enable CONFIG_DEBUG_VM_RB
>>>>>>>>>>>>>>> ./scripts/config --enable CONFIG_DEBUG_VM_PGFLAGS
>>>>>>>>>>>>>>> ./scripts/config --enable CONFIG_DEBUG_VM_PGTABLE
>>>>>>>>>>>>>>> ./scripts/config --enable CONFIG_PAGE_TABLE_CHECK
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> # For mm selftests.
>>>>>>>>>>>>>>> ./scripts/config --enable CONFIG_USERFAULTFD
>>>>>>>>>>>>>>> ./scripts/config --enable CONFIG_TEST_VMALLOC
>>>>>>>>>>>>>>> ./scripts/config --enable CONFIG_GUP_TEST
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Running on VM with 12G memory, split across 2 (emulated) NUMA nodes
>>>>>>>>>>>>>>> (needed by
>>>>>>>>>>>>>>> some mm selftests), with kernel command line to reserve hugetlbs and
>>>>>>>>>>>>>>> other
>>>>>>>>>>>>>>> features required by some mm selftests:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> "
>>>>>>>>>>>>>>> transparent_hugepage=madvise earlycon root=/dev/vda2 secretmem.enable
>>>>>>>>>>>>>>> hugepagesz=1G hugepages=0:2,1:2 hugepagesz=32M hugepages=0:2,1:2
>>>>>>>>>>>>>>> default_hugepagesz=2M hugepages=0:64,1:64 hugepagesz=64K
>>>>>>>>>>>>>>> hugepages=0:2,1:2
>>>>>>>>>>>>>>> "
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Ubuntu userspace running off XFS rootfs. Build and run mm selftests
>>>>>>>>>>>>>>> from same
>>>>>>>>>>>>>>> git tree.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Although I don't think any of this config should make a difference to
>>>>>>>>>>>>>>> gup_longterm.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Looks like your errors are all "ftruncate() failed". I've seen this
>>>>>>>>>>>>>>> problem on
>>>>>>>>>>>>>>> our CI system. There it is due to running the tests from NFS file
>>>>>>>>>>>>>>> system. What
>>>>>>>>>>>>>>> filesystem are you using? Perhaps you are sharing into the FVP using
>>>>>>>>>>>>>>> 9p? That
>>>>>>>>>>>>>>> might also be problematic.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> That was it. This time I booted up the kernel including your series on
>>>>>>>>>>>>>> QEMU on my M1 and executed the gup_longterm program without the ftruncate
>>>>>>>>>>>>>> failures. When testing your kernel on FVP, I was executing the script
>>>>>>>>>>>>>> from the FVP's host filesystem using 9p.
>>>>>>>>>>>>>
>>>>>>>>>>>>> I'm not sure exactly what the root cause is. Perhaps there isn't enough
>>>>>>>>>>>>> space on
>>>>>>>>>>>>> the disk? It might be worth enhancing the error log to provide the
>>>>>>>>>>>>> errno in
>>>>>>>>>>>>> tools/testing/selftests/mm/gup_longterm.c.
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> Attached is the strace’d gup_longterm executiong log on your
>>>>>>>>>>>> pgtable-boot-speedup-v2 kernel.
>>>>>>>>>>>
>>>>>>>>>>> Sorry are you saying that it only fails with the pgtable-boot-speedup-v2
>>>>>>>>>>> patch
>>>>>>>>>>> set applied? I thought we previously concluded that it was independent of
>>>>>>>>>>> that?
>>>>>>>>>>> I was under the impression that it was filesystem related and not something
>>>>>>>>>>> that
>>>>>>>>>>> I was planning to investigate.
>>>>>>>>>>
>>>>>>>>>> No, irrespective of the kernel, if using 9p on FVP the test program fails.
>>>>>>>>>> It is indeed 9p filesystem related, as I switched to using NFS all the
>>>>>>>>>> issues are gone.
>>>>>>>>>
>>>>>>>>> Did it never work on 9p? If so, we might have to SKIP that test.
>>>>>>>>>
>>>>>>>>> openat(AT_FDCWD, "gup_longterm.c_tmpfile_BLboOt", O_RDWR|O_CREAT|O_EXCL,
>>>>>>>>> 0600) = 3
>>>>>>>>> unlinkat(AT_FDCWD, "gup_longterm.c_tmpfile_BLboOt", 0) = 0
>>>>>>>>> fstatfs(3, 0xffffe505a840)              = -1 EOPNOTSUPP (Operation not
>>>>>>>>> supported)
>>>>>>>>> ftruncate(3, 4096)                      = -1 ENOENT (No such file or
>>>>>>>>> directory)
>>>>>>>>
>>>>>>>> Note: I'm wondering if the unlinkat here is the problem that makes
>>>>>>>> ftruncate() with 9p result in weird errors (e.g., the hypervisor
>>>>>>>> unlinked the file and cannot reopen it for the fstatfs/ftruncate. ...
>>>>>>>> which gives us weird errors here).
>>>>>>>>
>>>>>>>> Then, we should lookup the fs type in run_with_local_tmpfile() before
>>>>>>>> the unlink() and simply skip the test if it is 9p.
>>>>>>>
>>>>>>> The unlink with 9p most certainly was a known issue in the past:
>>>>>>>
>>>>>>> https://gitlab.com/qemu-project/qemu/-/issues/103
>>>>>>>
>>>>>>> Maybe it's still an issue with older hypervisors (QEMU?)? Or it was never
>>>>>>> completely resolved?
>>>>>>
>>>>>> I believe Itaru is running on FVP (Fixed Virtual Platform - "fast model" -
>>>>>> Arm's architecture emulator). So QEMU won't be involved here. The FVP emulates
>>>>>> a 9p device, so perhaps the bug is in there.
>>>>>
>>>>> Very likely.
>>>>>
>>>>>>
>>>>>> Note that I see lots of "fallocate() failed" failures in gup_longterm when
>>>>>> running on our CI system. This is a completely different setup; Real HW with
>>>>>> Linux running bare metal using an NFS rootfs. I'm not sure if this is related.
>>>>>> Logs show it failing consistently for the "tmpfile" and "local tmpfile" test
>>>>>> configs. I also see a couple of these fails in the cow tests.
>>>>>
>>>>> What is the fallocate() errno you are getting? strace log would help (to see if
>>>>> statfs also fails already)! Likely a similar NFS issue.
>>>> Unfortunately this is a system I don't have access to. I've requested some of
>>>> this triage to be done, but its fairly low priority unfortunately.
>>>
>>> To work around these BUGs (?) elsewhere, we could simply skip the test if get_fs_type() is not able to detect the FS type. Likely that's an early indicator that the unlink() messed something up.
>>>
>>> ... doesn't feel right, though.
>>
>> I think it’s a good idea so that the mm kselftests results look reasonable.

Yeah, but this will hide BUGs elsewhere. I suspect that in Ryan's NFS setup is
also a BUG lurking somewhere in the NFS implementation. But that's just a guess
until we have more details.

>> Since you’re an expert on GUP-fast (or fast-GUP?), when you update the code, could you print out errno as well like the split_huge_page_test.c does

While we could, I don't see much value in that for selftests. strace log is of much
more valuable to understand what is actually happening (e.g., fstatfs failing), and
quite easy to obtain.

>> Thanks,
>> Itaru.
> 
> David, attached is the straced execution log of the gup_longterm kselftest over the NFS case.
> I’m running the program on FVP, let me know if you need other logs or test results.

For your run, it all looks good:

openat(AT_FDCWD, "/tmp", O_RDWR|O_EXCL|O_TMPFILE, 0600) = 3
fcntl(3, F_GETFL)                       = 0x424002 (flags O_RDWR|O_LARGEFILE|O_TMPFILE)
fstatfs(3, {f_type=TMPFS_MAGIC, f_bsize=4096, f_blocks=416015, f_bfree=415997, f_bavail=415997, f_files=416015, f_ffree=416009, f_fsid={val=[0x8e6b7ce6, 0xe1737440]}, f_namelen=255, f_frsize=4096, f_flags=ST_VALID|ST_RELATIME}) = 0
ftruncate(3, 4096)                      = 0
fallocate(3, 0, 0, 4096)                = 0

-> TMPFS/SHMEM, works as expected

openat(AT_FDCWD, "gup_longterm.c_tmpfile_WMLTNf", O_RDWR|O_CREAT|O_EXCL, 0600) = 3
unlinkat(AT_FDCWD, "gup_longterm.c_tmpfile_WMLTNf", 0) = 0
fstatfs(3, {f_type=NFS_SUPER_MAGIC, f_bsize=1048576, f_blocks=112200, f_bfree=27954, f_bavail=23296, f_files=7307264, f_ffree=4724815, f_fsid={val=[0, 0]}, f_namelen=255, f_frsize=1048576, f_flags=ST_VALID|ST_RELATIME}) = 0
ftruncate(3, 4096)                      = 0
fallocate(3, 0, 0, 4096)                = 0

-> NFS, works as expected

Note that you get all skips (not fails), because your kernel is not compiled with CONFIG_GUP_TEST.

ok 1 # SKIP gup_test not available
Itaru Kitayama April 10, 2024, 7:37 a.m. UTC | #18
> On Apr 10, 2024, at 16:10, David Hildenbrand <david@redhat.com> wrote:
> 
> On 10.04.24 08:47, Itaru Kitayama wrote:
>>> On Apr 10, 2024, at 8:30, Itaru Kitayama <itaru.kitayama@linux.dev> wrote:
>>> 
>>> Hi David,
>>> 
>>>> On Apr 9, 2024, at 23:45, David Hildenbrand <david@redhat.com> wrote:
>>>> 
>>>> On 09.04.24 16:39, Ryan Roberts wrote:
>>>>> On 09/04/2024 15:29, David Hildenbrand wrote:
>>>>>> On 09.04.24 16:13, Ryan Roberts wrote:
>>>>>>> On 09/04/2024 12:51, David Hildenbrand wrote:
>>>>>>>> On 09.04.24 13:29, David Hildenbrand wrote:
>>>>>>>>> On 09.04.24 13:22, David Hildenbrand wrote:
>>>>>>>>>> On 09.04.24 12:13, Itaru Kitayama wrote:
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>>> On Apr 9, 2024, at 19:04, Ryan Roberts <ryan.roberts@arm.com> wrote:
>>>>>>>>>>>> 
>>>>>>>>>>>> On 09/04/2024 01:10, Itaru Kitayama wrote:
>>>>>>>>>>>>> Hi Ryan,
>>>>>>>>>>>>> 
>>>>>>>>>>>>>> On Apr 8, 2024, at 16:30, Ryan Roberts <ryan.roberts@arm.com> wrote:
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> On 06/04/2024 11:31, Itaru Kitayama wrote:
>>>>>>>>>>>>>>> Hi Ryan,
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> On Sat, Apr 06, 2024 at 09:32:34AM +0100, Ryan Roberts wrote:
>>>>>>>>>>>>>>>> Hi Itaru,
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> On 05/04/2024 08:39, Itaru Kitayama wrote:
>>>>>>>>>>>>>>>>> On Thu, Apr 04, 2024 at 03:33:04PM +0100, Ryan Roberts wrote:
>>>>>>>>>>>>>>>>>> Hi All,
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> It turns out that creating the linear map can take a significant
>>>>>>>>>>>>>>>>>> proportion of
>>>>>>>>>>>>>>>>>> the total boot time, especially when rodata=full. And most of the
>>>>>>>>>>>>>>>>>> time is spent
>>>>>>>>>>>>>>>>>> waiting on superfluous tlb invalidation and memory barriers. This
>>>>>>>>>>>>>>>>>> series reworks
>>>>>>>>>>>>>>>>>> the kernel pgtable generation code to significantly reduce the number
>>>>>>>>>>>>>>>>>> of those
>>>>>>>>>>>>>>>>>> TLBIs, ISBs and DSBs. See each patch for details.
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> The below shows the execution time of map_mem() across a couple of
>>>>>>>>>>>>>>>>>> different
>>>>>>>>>>>>>>>>>> systems with different RAM configurations. We measure after applying
>>>>>>>>>>>>>>>>>> each patch
>>>>>>>>>>>>>>>>>> and show the improvement relative to base (v6.9-rc2):
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>                   | Apple M2 VM | Ampere Altra| Ampere Altra| Ampere
>>>>>>>>>>>>>>>>>> Altra
>>>>>>>>>>>>>>>>>>                   | VM, 16G     | VM, 64G     | VM, 256G    | Metal,
>>>>>>>>>>>>>>>>>> 512G
>>>>>>>>>>>>>>>>>> ---------------|-------------|-------------|-------------|-------------
>>>>>>>>>>>>>>>>>>                   |   ms    (%) |   ms    (%) |   ms    (%) |
>>>>>>>>>>>>>>>>>> ms    (%)
>>>>>>>>>>>>>>>>>> ---------------|-------------|-------------|-------------|-------------
>>>>>>>>>>>>>>>>>> base           |  153   (0%) | 2227   (0%) | 8798   (0%) | 17442
>>>>>>>>>>>>>>>>>> (0%)
>>>>>>>>>>>>>>>>>> no-cont-remap  |   77 (-49%) |  431 (-81%) | 1727 (-80%) |  3796
>>>>>>>>>>>>>>>>>> (-78%)
>>>>>>>>>>>>>>>>>> batch-barriers |   13 (-92%) |  162 (-93%) |  655 (-93%) |  1656
>>>>>>>>>>>>>>>>>> (-91%)
>>>>>>>>>>>>>>>>>> no-alloc-remap |   11 (-93%) |  109 (-95%) |  449 (-95%) |  1257
>>>>>>>>>>>>>>>>>> (-93%)
>>>>>>>>>>>>>>>>>> lazy-unmap     |    6 (-96%) |   61 (-97%) |  257 (-97%) |   838
>>>>>>>>>>>>>>>>>> (-95%)
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> This series applies on top of v6.9-rc2. All mm selftests pass. I've
>>>>>>>>>>>>>>>>>> compile and
>>>>>>>>>>>>>>>>>> boot tested various PAGE_SIZE and VA size configs.
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> ---
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> Changes since v1 [1]
>>>>>>>>>>>>>>>>>> ====================
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>      - Added Tested-by tags (thanks to Eric and Itaru)
>>>>>>>>>>>>>>>>>>      - Renamed ___set_pte() -> __set_pte_nosync() (per Ard)
>>>>>>>>>>>>>>>>>>      - Reordered patches (biggest impact & least controversial first)
>>>>>>>>>>>>>>>>>>      - Reordered alloc/map/unmap functions in mmu.c to aid reader
>>>>>>>>>>>>>>>>>>      - pte_clear() -> __pte_clear() in clear_fixmap_nosync()
>>>>>>>>>>>>>>>>>>      - Reverted generic p4d_index() which caused x86 build error.
>>>>>>>>>>>>>>>>>> Replaced with
>>>>>>>>>>>>>>>>>>        unconditional p4d_index() define under arm64.
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> [1]
>>>>>>>>>>>>>>>>>> https://lore.kernel.org/linux-arm-kernel/20240326101448.3453626-1-ryan.roberts@arm.com/<https://lore.kernel.org/linux-arm-kernel/20240326101448.3453626-1-ryan.roberts@arm.com/>
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>>>>> Ryan
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> Ryan Roberts (4):
>>>>>>>>>>>>>>>>>>      arm64: mm: Don't remap pgtables per-cont(pte|pmd) block
>>>>>>>>>>>>>>>>>>      arm64: mm: Batch dsb and isb when populating pgtables
>>>>>>>>>>>>>>>>>>      arm64: mm: Don't remap pgtables for allocate vs populate
>>>>>>>>>>>>>>>>>>      arm64: mm: Lazily clear pte table mappings from fixmap
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> arch/arm64/include/asm/fixmap.h  |   5 +-
>>>>>>>>>>>>>>>>>> arch/arm64/include/asm/mmu.h     |   8 +
>>>>>>>>>>>>>>>>>> arch/arm64/include/asm/pgtable.h |  13 +-
>>>>>>>>>>>>>>>>>> arch/arm64/kernel/cpufeature.c   |  10 +-
>>>>>>>>>>>>>>>>>> arch/arm64/mm/fixmap.c           |  11 +
>>>>>>>>>>>>>>>>>> arch/arm64/mm/mmu.c              | 377 +++++++++++++++++++++++--------
>>>>>>>>>>>>>>>>>> 6 files changed, 319 insertions(+), 105 deletions(-)
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> -- 
>>>>>>>>>>>>>>>>>> 2.25.1
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> I've build and boot tested the v2 on FVP, base is taken from your
>>>>>>>>>>>>>>>>> linux-rr repo. Running run_vmtests.sh on v2 left some gup longterm not
>>>>>>>>>>>>>>>>> oks, would you take a look at it? The mm ksefltests used is from your
>>>>>>>>>>>>>>>>> linux-rr repo too.
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> Thanks for taking a look at this.
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> I can't reproduce your issue unfortunately; steps as follows on Apple
>>>>>>>>>>>>>>>> M2 VM:
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> Config: arm64 defconfig + the following:
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> # Squashfs for snaps, xfs for large file folios.
>>>>>>>>>>>>>>>> ./scripts/config --enable CONFIG_SQUASHFS_LZ4
>>>>>>>>>>>>>>>> ./scripts/config --enable CONFIG_SQUASHFS_LZO
>>>>>>>>>>>>>>>> ./scripts/config --enable CONFIG_SQUASHFS_XZ
>>>>>>>>>>>>>>>> ./scripts/config --enable CONFIG_SQUASHFS_ZSTD
>>>>>>>>>>>>>>>> ./scripts/config --enable CONFIG_XFS_FS
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> # For general mm debug.
>>>>>>>>>>>>>>>> ./scripts/config --enable CONFIG_DEBUG_VM
>>>>>>>>>>>>>>>> ./scripts/config --enable CONFIG_DEBUG_VM_MAPLE_TREE
>>>>>>>>>>>>>>>> ./scripts/config --enable CONFIG_DEBUG_VM_RB
>>>>>>>>>>>>>>>> ./scripts/config --enable CONFIG_DEBUG_VM_PGFLAGS
>>>>>>>>>>>>>>>> ./scripts/config --enable CONFIG_DEBUG_VM_PGTABLE
>>>>>>>>>>>>>>>> ./scripts/config --enable CONFIG_PAGE_TABLE_CHECK
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> # For mm selftests.
>>>>>>>>>>>>>>>> ./scripts/config --enable CONFIG_USERFAULTFD
>>>>>>>>>>>>>>>> ./scripts/config --enable CONFIG_TEST_VMALLOC
>>>>>>>>>>>>>>>> ./scripts/config --enable CONFIG_GUP_TEST
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> Running on VM with 12G memory, split across 2 (emulated) NUMA nodes
>>>>>>>>>>>>>>>> (needed by
>>>>>>>>>>>>>>>> some mm selftests), with kernel command line to reserve hugetlbs and
>>>>>>>>>>>>>>>> other
>>>>>>>>>>>>>>>> features required by some mm selftests:
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> "
>>>>>>>>>>>>>>>> transparent_hugepage=madvise earlycon root=/dev/vda2 secretmem.enable
>>>>>>>>>>>>>>>> hugepagesz=1G hugepages=0:2,1:2 hugepagesz=32M hugepages=0:2,1:2
>>>>>>>>>>>>>>>> default_hugepagesz=2M hugepages=0:64,1:64 hugepagesz=64K
>>>>>>>>>>>>>>>> hugepages=0:2,1:2
>>>>>>>>>>>>>>>> "
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> Ubuntu userspace running off XFS rootfs. Build and run mm selftests
>>>>>>>>>>>>>>>> from same
>>>>>>>>>>>>>>>> git tree.
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> Although I don't think any of this config should make a difference to
>>>>>>>>>>>>>>>> gup_longterm.
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> Looks like your errors are all "ftruncate() failed". I've seen this
>>>>>>>>>>>>>>>> problem on
>>>>>>>>>>>>>>>> our CI system. There it is due to running the tests from NFS file
>>>>>>>>>>>>>>>> system. What
>>>>>>>>>>>>>>>> filesystem are you using? Perhaps you are sharing into the FVP using
>>>>>>>>>>>>>>>> 9p? That
>>>>>>>>>>>>>>>> might also be problematic.
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> That was it. This time I booted up the kernel including your series on
>>>>>>>>>>>>>>> QEMU on my M1 and executed the gup_longterm program without the ftruncate
>>>>>>>>>>>>>>> failures. When testing your kernel on FVP, I was executing the script
>>>>>>>>>>>>>>> from the FVP's host filesystem using 9p.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> I'm not sure exactly what the root cause is. Perhaps there isn't enough
>>>>>>>>>>>>>> space on
>>>>>>>>>>>>>> the disk? It might be worth enhancing the error log to provide the
>>>>>>>>>>>>>> errno in
>>>>>>>>>>>>>> tools/testing/selftests/mm/gup_longterm.c.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Attached is the strace’d gup_longterm executiong log on your
>>>>>>>>>>>>> pgtable-boot-speedup-v2 kernel.
>>>>>>>>>>>> 
>>>>>>>>>>>> Sorry are you saying that it only fails with the pgtable-boot-speedup-v2
>>>>>>>>>>>> patch
>>>>>>>>>>>> set applied? I thought we previously concluded that it was independent of
>>>>>>>>>>>> that?
>>>>>>>>>>>> I was under the impression that it was filesystem related and not something
>>>>>>>>>>>> that
>>>>>>>>>>>> I was planning to investigate.
>>>>>>>>>>> 
>>>>>>>>>>> No, irrespective of the kernel, if using 9p on FVP the test program fails.
>>>>>>>>>>> It is indeed 9p filesystem related, as I switched to using NFS all the
>>>>>>>>>>> issues are gone.
>>>>>>>>>> 
>>>>>>>>>> Did it never work on 9p? If so, we might have to SKIP that test.
>>>>>>>>>> 
>>>>>>>>>> openat(AT_FDCWD, "gup_longterm.c_tmpfile_BLboOt", O_RDWR|O_CREAT|O_EXCL,
>>>>>>>>>> 0600) = 3
>>>>>>>>>> unlinkat(AT_FDCWD, "gup_longterm.c_tmpfile_BLboOt", 0) = 0
>>>>>>>>>> fstatfs(3, 0xffffe505a840)              = -1 EOPNOTSUPP (Operation not
>>>>>>>>>> supported)
>>>>>>>>>> ftruncate(3, 4096)                      = -1 ENOENT (No such file or
>>>>>>>>>> directory)
>>>>>>>>> 
>>>>>>>>> Note: I'm wondering if the unlinkat here is the problem that makes
>>>>>>>>> ftruncate() with 9p result in weird errors (e.g., the hypervisor
>>>>>>>>> unlinked the file and cannot reopen it for the fstatfs/ftruncate. ...
>>>>>>>>> which gives us weird errors here).
>>>>>>>>> 
>>>>>>>>> Then, we should lookup the fs type in run_with_local_tmpfile() before
>>>>>>>>> the unlink() and simply skip the test if it is 9p.
>>>>>>>> 
>>>>>>>> The unlink with 9p most certainly was a known issue in the past:
>>>>>>>> 
>>>>>>>> https://gitlab.com/qemu-project/qemu/-/issues/103
>>>>>>>> 
>>>>>>>> Maybe it's still an issue with older hypervisors (QEMU?)? Or it was never
>>>>>>>> completely resolved?
>>>>>>> 
>>>>>>> I believe Itaru is running on FVP (Fixed Virtual Platform - "fast model" -
>>>>>>> Arm's architecture emulator). So QEMU won't be involved here. The FVP emulates
>>>>>>> a 9p device, so perhaps the bug is in there.
>>>>>> 
>>>>>> Very likely.
>>>>>> 
>>>>>>> 
>>>>>>> Note that I see lots of "fallocate() failed" failures in gup_longterm when
>>>>>>> running on our CI system. This is a completely different setup; Real HW with
>>>>>>> Linux running bare metal using an NFS rootfs. I'm not sure if this is related.
>>>>>>> Logs show it failing consistently for the "tmpfile" and "local tmpfile" test
>>>>>>> configs. I also see a couple of these fails in the cow tests.
>>>>>> 
>>>>>> What is the fallocate() errno you are getting? strace log would help (to see if
>>>>>> statfs also fails already)! Likely a similar NFS issue.
>>>>> Unfortunately this is a system I don't have access to. I've requested some of
>>>>> this triage to be done, but its fairly low priority unfortunately.
>>>> 
>>>> To work around these BUGs (?) elsewhere, we could simply skip the test if get_fs_type() is not able to detect the FS type. Likely that's an early indicator that the unlink() messed something up.
>>>> 
>>>> ... doesn't feel right, though.
>>> 
>>> I think it’s a good idea so that the mm kselftests results look reasonable.
> 
> Yeah, but this will hide BUGs elsewhere. I suspect that in Ryan's NFS setup is
> also a BUG lurking somewhere in the NFS implementation. But that's just a guess
> until we have more details.
> 

Ok.

>>> Since you’re an expert on GUP-fast (or fast-GUP?), when you update the code, could you print out errno as well like the split_huge_page_test.c does
> 
> While we could, I don't see much value in that for selftests. strace log is of much
> more valuable to understand what is actually happening (e.g., fstatfs failing), and
> quite easy to obtain.

Ok.

> 
>>> Thanks,
>>> Itaru.
>> David, attached is the straced execution log of the gup_longterm kselftest over the NFS case.
>> I’m running the program on FVP, let me know if you need other logs or test results.
> 
> For your run, it all looks good:
> 
> openat(AT_FDCWD, "/tmp", O_RDWR|O_EXCL|O_TMPFILE, 0600) = 3
> fcntl(3, F_GETFL)                       = 0x424002 (flags O_RDWR|O_LARGEFILE|O_TMPFILE)
> fstatfs(3, {f_type=TMPFS_MAGIC, f_bsize=4096, f_blocks=416015, f_bfree=415997, f_bavail=415997, f_files=416015, f_ffree=416009, f_fsid={val=[0x8e6b7ce6, 0xe1737440]}, f_namelen=255, f_frsize=4096, f_flags=ST_VALID|ST_RELATIME}) = 0
> ftruncate(3, 4096)                      = 0
> fallocate(3, 0, 0, 4096)                = 0
> 
> -> TMPFS/SHMEM, works as expected
> 
> openat(AT_FDCWD, "gup_longterm.c_tmpfile_WMLTNf", O_RDWR|O_CREAT|O_EXCL, 0600) = 3
> unlinkat(AT_FDCWD, "gup_longterm.c_tmpfile_WMLTNf", 0) = 0
> fstatfs(3, {f_type=NFS_SUPER_MAGIC, f_bsize=1048576, f_blocks=112200, f_bfree=27954, f_bavail=23296, f_files=7307264, f_ffree=4724815, f_fsid={val=[0, 0]}, f_namelen=255, f_frsize=1048576, f_flags=ST_VALID|ST_RELATIME}) = 0
> ftruncate(3, 4096)                      = 0
> fallocate(3, 0, 0, 4096)                = 0
> 
> -> NFS, works as expected
> 
> Note that you get all skips (not fails), because your kernel is not compiled with CONFIG_GUP_TEST.
> 
> ok 1 # SKIP gup_test not available

I rebuilt the v6.9-rc3 kernel with that option enabled. This time SKIPs are due to “need more free huge pages”, I’ll check even on a limited memory size system preparing enough huge pages is possible.

Thanks,
Itaru. 

> 
> -- 
> Cheers,
> 
> David / dhildenb
David Hildenbrand April 10, 2024, 7:45 a.m. UTC | #19
On 10.04.24 09:37, Itaru Kitayama wrote:
> 
> 
>> On Apr 10, 2024, at 16:10, David Hildenbrand <david@redhat.com> wrote:
>>
>> On 10.04.24 08:47, Itaru Kitayama wrote:
>>>> On Apr 10, 2024, at 8:30, Itaru Kitayama <itaru.kitayama@linux.dev> wrote:
>>>>
>>>> Hi David,
>>>>
>>>>> On Apr 9, 2024, at 23:45, David Hildenbrand <david@redhat.com> wrote:
>>>>>
>>>>> On 09.04.24 16:39, Ryan Roberts wrote:
>>>>>> On 09/04/2024 15:29, David Hildenbrand wrote:
>>>>>>> On 09.04.24 16:13, Ryan Roberts wrote:
>>>>>>>> On 09/04/2024 12:51, David Hildenbrand wrote:
>>>>>>>>> On 09.04.24 13:29, David Hildenbrand wrote:
>>>>>>>>>> On 09.04.24 13:22, David Hildenbrand wrote:
>>>>>>>>>>> On 09.04.24 12:13, Itaru Kitayama wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>> On Apr 9, 2024, at 19:04, Ryan Roberts <ryan.roberts@arm.com> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>> On 09/04/2024 01:10, Itaru Kitayama wrote:
>>>>>>>>>>>>>> Hi Ryan,
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On Apr 8, 2024, at 16:30, Ryan Roberts <ryan.roberts@arm.com> wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On 06/04/2024 11:31, Itaru Kitayama wrote:
>>>>>>>>>>>>>>>> Hi Ryan,
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> On Sat, Apr 06, 2024 at 09:32:34AM +0100, Ryan Roberts wrote:
>>>>>>>>>>>>>>>>> Hi Itaru,
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> On 05/04/2024 08:39, Itaru Kitayama wrote:
>>>>>>>>>>>>>>>>>> On Thu, Apr 04, 2024 at 03:33:04PM +0100, Ryan Roberts wrote:
>>>>>>>>>>>>>>>>>>> Hi All,
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> It turns out that creating the linear map can take a significant
>>>>>>>>>>>>>>>>>>> proportion of
>>>>>>>>>>>>>>>>>>> the total boot time, especially when rodata=full. And most of the
>>>>>>>>>>>>>>>>>>> time is spent
>>>>>>>>>>>>>>>>>>> waiting on superfluous tlb invalidation and memory barriers. This
>>>>>>>>>>>>>>>>>>> series reworks
>>>>>>>>>>>>>>>>>>> the kernel pgtable generation code to significantly reduce the number
>>>>>>>>>>>>>>>>>>> of those
>>>>>>>>>>>>>>>>>>> TLBIs, ISBs and DSBs. See each patch for details.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> The below shows the execution time of map_mem() across a couple of
>>>>>>>>>>>>>>>>>>> different
>>>>>>>>>>>>>>>>>>> systems with different RAM configurations. We measure after applying
>>>>>>>>>>>>>>>>>>> each patch
>>>>>>>>>>>>>>>>>>> and show the improvement relative to base (v6.9-rc2):
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>                    | Apple M2 VM | Ampere Altra| Ampere Altra| Ampere
>>>>>>>>>>>>>>>>>>> Altra
>>>>>>>>>>>>>>>>>>>                    | VM, 16G     | VM, 64G     | VM, 256G    | Metal,
>>>>>>>>>>>>>>>>>>> 512G
>>>>>>>>>>>>>>>>>>> ---------------|-------------|-------------|-------------|-------------
>>>>>>>>>>>>>>>>>>>                    |   ms    (%) |   ms    (%) |   ms    (%) |
>>>>>>>>>>>>>>>>>>> ms    (%)
>>>>>>>>>>>>>>>>>>> ---------------|-------------|-------------|-------------|-------------
>>>>>>>>>>>>>>>>>>> base           |  153   (0%) | 2227   (0%) | 8798   (0%) | 17442
>>>>>>>>>>>>>>>>>>> (0%)
>>>>>>>>>>>>>>>>>>> no-cont-remap  |   77 (-49%) |  431 (-81%) | 1727 (-80%) |  3796
>>>>>>>>>>>>>>>>>>> (-78%)
>>>>>>>>>>>>>>>>>>> batch-barriers |   13 (-92%) |  162 (-93%) |  655 (-93%) |  1656
>>>>>>>>>>>>>>>>>>> (-91%)
>>>>>>>>>>>>>>>>>>> no-alloc-remap |   11 (-93%) |  109 (-95%) |  449 (-95%) |  1257
>>>>>>>>>>>>>>>>>>> (-93%)
>>>>>>>>>>>>>>>>>>> lazy-unmap     |    6 (-96%) |   61 (-97%) |  257 (-97%) |   838
>>>>>>>>>>>>>>>>>>> (-95%)
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> This series applies on top of v6.9-rc2. All mm selftests pass. I've
>>>>>>>>>>>>>>>>>>> compile and
>>>>>>>>>>>>>>>>>>> boot tested various PAGE_SIZE and VA size configs.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> ---
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Changes since v1 [1]
>>>>>>>>>>>>>>>>>>> ====================
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>       - Added Tested-by tags (thanks to Eric and Itaru)
>>>>>>>>>>>>>>>>>>>       - Renamed ___set_pte() -> __set_pte_nosync() (per Ard)
>>>>>>>>>>>>>>>>>>>       - Reordered patches (biggest impact & least controversial first)
>>>>>>>>>>>>>>>>>>>       - Reordered alloc/map/unmap functions in mmu.c to aid reader
>>>>>>>>>>>>>>>>>>>       - pte_clear() -> __pte_clear() in clear_fixmap_nosync()
>>>>>>>>>>>>>>>>>>>       - Reverted generic p4d_index() which caused x86 build error.
>>>>>>>>>>>>>>>>>>> Replaced with
>>>>>>>>>>>>>>>>>>>         unconditional p4d_index() define under arm64.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> [1]
>>>>>>>>>>>>>>>>>>> https://lore.kernel.org/linux-arm-kernel/20240326101448.3453626-1-ryan.roberts@arm.com/<https://lore.kernel.org/linux-arm-kernel/20240326101448.3453626-1-ryan.roberts@arm.com/>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>>>>>> Ryan
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Ryan Roberts (4):
>>>>>>>>>>>>>>>>>>>       arm64: mm: Don't remap pgtables per-cont(pte|pmd) block
>>>>>>>>>>>>>>>>>>>       arm64: mm: Batch dsb and isb when populating pgtables
>>>>>>>>>>>>>>>>>>>       arm64: mm: Don't remap pgtables for allocate vs populate
>>>>>>>>>>>>>>>>>>>       arm64: mm: Lazily clear pte table mappings from fixmap
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> arch/arm64/include/asm/fixmap.h  |   5 +-
>>>>>>>>>>>>>>>>>>> arch/arm64/include/asm/mmu.h     |   8 +
>>>>>>>>>>>>>>>>>>> arch/arm64/include/asm/pgtable.h |  13 +-
>>>>>>>>>>>>>>>>>>> arch/arm64/kernel/cpufeature.c   |  10 +-
>>>>>>>>>>>>>>>>>>> arch/arm64/mm/fixmap.c           |  11 +
>>>>>>>>>>>>>>>>>>> arch/arm64/mm/mmu.c              | 377 +++++++++++++++++++++++--------
>>>>>>>>>>>>>>>>>>> 6 files changed, 319 insertions(+), 105 deletions(-)
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> -- 
>>>>>>>>>>>>>>>>>>> 2.25.1
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> I've build and boot tested the v2 on FVP, base is taken from your
>>>>>>>>>>>>>>>>>> linux-rr repo. Running run_vmtests.sh on v2 left some gup longterm not
>>>>>>>>>>>>>>>>>> oks, would you take a look at it? The mm ksefltests used is from your
>>>>>>>>>>>>>>>>>> linux-rr repo too.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Thanks for taking a look at this.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> I can't reproduce your issue unfortunately; steps as follows on Apple
>>>>>>>>>>>>>>>>> M2 VM:
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Config: arm64 defconfig + the following:
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> # Squashfs for snaps, xfs for large file folios.
>>>>>>>>>>>>>>>>> ./scripts/config --enable CONFIG_SQUASHFS_LZ4
>>>>>>>>>>>>>>>>> ./scripts/config --enable CONFIG_SQUASHFS_LZO
>>>>>>>>>>>>>>>>> ./scripts/config --enable CONFIG_SQUASHFS_XZ
>>>>>>>>>>>>>>>>> ./scripts/config --enable CONFIG_SQUASHFS_ZSTD
>>>>>>>>>>>>>>>>> ./scripts/config --enable CONFIG_XFS_FS
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> # For general mm debug.
>>>>>>>>>>>>>>>>> ./scripts/config --enable CONFIG_DEBUG_VM
>>>>>>>>>>>>>>>>> ./scripts/config --enable CONFIG_DEBUG_VM_MAPLE_TREE
>>>>>>>>>>>>>>>>> ./scripts/config --enable CONFIG_DEBUG_VM_RB
>>>>>>>>>>>>>>>>> ./scripts/config --enable CONFIG_DEBUG_VM_PGFLAGS
>>>>>>>>>>>>>>>>> ./scripts/config --enable CONFIG_DEBUG_VM_PGTABLE
>>>>>>>>>>>>>>>>> ./scripts/config --enable CONFIG_PAGE_TABLE_CHECK
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> # For mm selftests.
>>>>>>>>>>>>>>>>> ./scripts/config --enable CONFIG_USERFAULTFD
>>>>>>>>>>>>>>>>> ./scripts/config --enable CONFIG_TEST_VMALLOC
>>>>>>>>>>>>>>>>> ./scripts/config --enable CONFIG_GUP_TEST
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Running on VM with 12G memory, split across 2 (emulated) NUMA nodes
>>>>>>>>>>>>>>>>> (needed by
>>>>>>>>>>>>>>>>> some mm selftests), with kernel command line to reserve hugetlbs and
>>>>>>>>>>>>>>>>> other
>>>>>>>>>>>>>>>>> features required by some mm selftests:
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> "
>>>>>>>>>>>>>>>>> transparent_hugepage=madvise earlycon root=/dev/vda2 secretmem.enable
>>>>>>>>>>>>>>>>> hugepagesz=1G hugepages=0:2,1:2 hugepagesz=32M hugepages=0:2,1:2
>>>>>>>>>>>>>>>>> default_hugepagesz=2M hugepages=0:64,1:64 hugepagesz=64K
>>>>>>>>>>>>>>>>> hugepages=0:2,1:2
>>>>>>>>>>>>>>>>> "
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Ubuntu userspace running off XFS rootfs. Build and run mm selftests
>>>>>>>>>>>>>>>>> from same
>>>>>>>>>>>>>>>>> git tree.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Although I don't think any of this config should make a difference to
>>>>>>>>>>>>>>>>> gup_longterm.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Looks like your errors are all "ftruncate() failed". I've seen this
>>>>>>>>>>>>>>>>> problem on
>>>>>>>>>>>>>>>>> our CI system. There it is due to running the tests from NFS file
>>>>>>>>>>>>>>>>> system. What
>>>>>>>>>>>>>>>>> filesystem are you using? Perhaps you are sharing into the FVP using
>>>>>>>>>>>>>>>>> 9p? That
>>>>>>>>>>>>>>>>> might also be problematic.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> That was it. This time I booted up the kernel including your series on
>>>>>>>>>>>>>>>> QEMU on my M1 and executed the gup_longterm program without the ftruncate
>>>>>>>>>>>>>>>> failures. When testing your kernel on FVP, I was executing the script
>>>>>>>>>>>>>>>> from the FVP's host filesystem using 9p.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> I'm not sure exactly what the root cause is. Perhaps there isn't enough
>>>>>>>>>>>>>>> space on
>>>>>>>>>>>>>>> the disk? It might be worth enhancing the error log to provide the
>>>>>>>>>>>>>>> errno in
>>>>>>>>>>>>>>> tools/testing/selftests/mm/gup_longterm.c.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Attached is the strace’d gup_longterm executiong log on your
>>>>>>>>>>>>>> pgtable-boot-speedup-v2 kernel.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Sorry are you saying that it only fails with the pgtable-boot-speedup-v2
>>>>>>>>>>>>> patch
>>>>>>>>>>>>> set applied? I thought we previously concluded that it was independent of
>>>>>>>>>>>>> that?
>>>>>>>>>>>>> I was under the impression that it was filesystem related and not something
>>>>>>>>>>>>> that
>>>>>>>>>>>>> I was planning to investigate.
>>>>>>>>>>>>
>>>>>>>>>>>> No, irrespective of the kernel, if using 9p on FVP the test program fails.
>>>>>>>>>>>> It is indeed 9p filesystem related, as I switched to using NFS all the
>>>>>>>>>>>> issues are gone.
>>>>>>>>>>>
>>>>>>>>>>> Did it never work on 9p? If so, we might have to SKIP that test.
>>>>>>>>>>>
>>>>>>>>>>> openat(AT_FDCWD, "gup_longterm.c_tmpfile_BLboOt", O_RDWR|O_CREAT|O_EXCL,
>>>>>>>>>>> 0600) = 3
>>>>>>>>>>> unlinkat(AT_FDCWD, "gup_longterm.c_tmpfile_BLboOt", 0) = 0
>>>>>>>>>>> fstatfs(3, 0xffffe505a840)              = -1 EOPNOTSUPP (Operation not
>>>>>>>>>>> supported)
>>>>>>>>>>> ftruncate(3, 4096)                      = -1 ENOENT (No such file or
>>>>>>>>>>> directory)
>>>>>>>>>>
>>>>>>>>>> Note: I'm wondering if the unlinkat here is the problem that makes
>>>>>>>>>> ftruncate() with 9p result in weird errors (e.g., the hypervisor
>>>>>>>>>> unlinked the file and cannot reopen it for the fstatfs/ftruncate. ...
>>>>>>>>>> which gives us weird errors here).
>>>>>>>>>>
>>>>>>>>>> Then, we should lookup the fs type in run_with_local_tmpfile() before
>>>>>>>>>> the unlink() and simply skip the test if it is 9p.
>>>>>>>>>
>>>>>>>>> The unlink with 9p most certainly was a known issue in the past:
>>>>>>>>>
>>>>>>>>> https://gitlab.com/qemu-project/qemu/-/issues/103
>>>>>>>>>
>>>>>>>>> Maybe it's still an issue with older hypervisors (QEMU?)? Or it was never
>>>>>>>>> completely resolved?
>>>>>>>>
>>>>>>>> I believe Itaru is running on FVP (Fixed Virtual Platform - "fast model" -
>>>>>>>> Arm's architecture emulator). So QEMU won't be involved here. The FVP emulates
>>>>>>>> a 9p device, so perhaps the bug is in there.
>>>>>>>
>>>>>>> Very likely.
>>>>>>>
>>>>>>>>
>>>>>>>> Note that I see lots of "fallocate() failed" failures in gup_longterm when
>>>>>>>> running on our CI system. This is a completely different setup; Real HW with
>>>>>>>> Linux running bare metal using an NFS rootfs. I'm not sure if this is related.
>>>>>>>> Logs show it failing consistently for the "tmpfile" and "local tmpfile" test
>>>>>>>> configs. I also see a couple of these fails in the cow tests.
>>>>>>>
>>>>>>> What is the fallocate() errno you are getting? strace log would help (to see if
>>>>>>> statfs also fails already)! Likely a similar NFS issue.
>>>>>> Unfortunately this is a system I don't have access to. I've requested some of
>>>>>> this triage to be done, but its fairly low priority unfortunately.
>>>>>
>>>>> To work around these BUGs (?) elsewhere, we could simply skip the test if get_fs_type() is not able to detect the FS type. Likely that's an early indicator that the unlink() messed something up.
>>>>>
>>>>> ... doesn't feel right, though.
>>>>
>>>> I think it’s a good idea so that the mm kselftests results look reasonable.
>>
>> Yeah, but this will hide BUGs elsewhere. I suspect that in Ryan's NFS setup is
>> also a BUG lurking somewhere in the NFS implementation. But that's just a guess
>> until we have more details.
>>
> 
> Ok.
> 
>>>> Since you’re an expert on GUP-fast (or fast-GUP?), when you update the code, could you print out errno as well like the split_huge_page_test.c does
>>
>> While we could, I don't see much value in that for selftests. strace log is of much
>> more valuable to understand what is actually happening (e.g., fstatfs failing), and
>> quite easy to obtain.
> 
> Ok.
> 
>>
>>>> Thanks,
>>>> Itaru.
>>> David, attached is the straced execution log of the gup_longterm kselftest over the NFS case.
>>> I’m running the program on FVP, let me know if you need other logs or test results.
>>
>> For your run, it all looks good:
>>
>> openat(AT_FDCWD, "/tmp", O_RDWR|O_EXCL|O_TMPFILE, 0600) = 3
>> fcntl(3, F_GETFL)                       = 0x424002 (flags O_RDWR|O_LARGEFILE|O_TMPFILE)
>> fstatfs(3, {f_type=TMPFS_MAGIC, f_bsize=4096, f_blocks=416015, f_bfree=415997, f_bavail=415997, f_files=416015, f_ffree=416009, f_fsid={val=[0x8e6b7ce6, 0xe1737440]}, f_namelen=255, f_frsize=4096, f_flags=ST_VALID|ST_RELATIME}) = 0
>> ftruncate(3, 4096)                      = 0
>> fallocate(3, 0, 0, 4096)                = 0
>>
>> -> TMPFS/SHMEM, works as expected
>>
>> openat(AT_FDCWD, "gup_longterm.c_tmpfile_WMLTNf", O_RDWR|O_CREAT|O_EXCL, 0600) = 3
>> unlinkat(AT_FDCWD, "gup_longterm.c_tmpfile_WMLTNf", 0) = 0
>> fstatfs(3, {f_type=NFS_SUPER_MAGIC, f_bsize=1048576, f_blocks=112200, f_bfree=27954, f_bavail=23296, f_files=7307264, f_ffree=4724815, f_fsid={val=[0, 0]}, f_namelen=255, f_frsize=1048576, f_flags=ST_VALID|ST_RELATIME}) = 0
>> ftruncate(3, 4096)                      = 0
>> fallocate(3, 0, 0, 4096)                = 0
>>
>> -> NFS, works as expected
>>
>> Note that you get all skips (not fails), because your kernel is not compiled with CONFIG_GUP_TEST.
>>
>> ok 1 # SKIP gup_test not available
> 
> I rebuilt the v6.9-rc3 kernel with that option enabled. This time SKIPs are due to “need more free huge pages”, I’ll check even on a limited memory size system preparing enough huge pages is possible.

That's expected, you have to reserve hugetlb pages before running the 
test. But the important thing is that tmpfs/nfs works for you.