mbox series

[v1,00/18] mm: mapcount for large folios + page_mapcount() cleanups

Message ID 20240409192301.907377-1-david@redhat.com (mailing list archive)
Headers show
Series mm: mapcount for large folios + page_mapcount() cleanups | expand

Message

David Hildenbrand April 9, 2024, 7:22 p.m. UTC
This series tracks the mapcount of large folios in a single value, so
it can be read efficiently and atomically, just like the mapcount of
small folios.

folio_mapcount() is then used in a couple more places, most notably to
reduce false negatives in folio_likely_mapped_shared(), and many users of
page_mapcount() are cleaned up (that's maybe why you got CCed on the
full series, sorry sh+xtensa folks! :) ).

The remaining s390x user and one KSM user of page_mapcount() are getting
removed separately on the list right now. I have patches to handle the
other KSM one, the khugepaged one and the kpagecount one; as they are not
as "obvious", I will send them out separately in the future. Once that is
all in place, I'm planning on moving page_mapcount() into
fs/proc/task_mmu.c, the remaining user for the time being (and we can
discuss at LSF/MM details on that :) ).

I proposed the mapcount for large folios (previously called total
mapcount) originally in part of [1] and I later included it in [2] where
it is a requirement. In the meantime, I changed the patch a bit so I
dropped all RB's. During the discussion of [1], Peter Xu correctly raised
that this additional tracking might affect the performance when
PMD->PTE remapping THPs. In the meantime. I addressed that by batching RMAP
operations during fork(), unmap/zap and when PMD->PTE remapping THPs.

Running some of my micro-benchmarks [3] (fork,munmap,cow-byte,remap) on 1
GiB of memory backed by folios with the same order, I observe the following
on an Intel(R) Xeon(R) Silver 4210R CPU @ 2.40GHz tuned for reproducible
results as much as possible:

Standard deviation is mostly < 1%, except for order-9, where it's < 2% for
fork() and munmap().

(1) Small folios are not affected (< 1%) in all 4 microbenchmarks.
(2) Order-4 folios are not affected (< 1%) in all 4 microbenchmarks. A bit
    weird comapred to the other orders ...
(3) PMD->PTE remapping of order-9 THPs is not affected (< 1%)
(4) COW-byte (COWing a single page by writing a single byte) is not
    affected for any order (< 1 %). The page copy_fault overhead dominates
    everything.
(5) fork() is mostly not affected (< 1%), except order-2, where we have
    a slowdown of ~4%. Already for order-3 folios, we're down to a slowdown
    of < 1%.
(6) munmap() sees a slowdown by < 3% for some orders (order-5,
    order-6, order-9), but less for others (< 1% for order-4 and order-8,
    < 2% for order-2, order-3, order-7).

Especially the fork() and munmap() benchmark are sensitive to each added
instruction and other system noise, so I suspect some of the change and
observed weirdness (order-4) is due to code layout changes and other
factors, but not really due to the added atomics.

So in the common case where we can batch, the added atomics don't really
make a big difference, especially in light of the recent improvements for
large folios that we recently gained due to batching. Surprisingly, for
some cases where we cannot batch (e.g., COW), the added atomics don't seem
to matter, because other overhead dominates.

My fork and munmap micro-benchmarks don't cover cases where we cannot
batch-process bigger parts of large folios. As this is not the common case,
I'm not worrying about that right now.

Future work is batching RMAP operations during swapout and folio
migration.

Not CCing everybody (e.g., cgroups folks just because of the doc
updated) recommended by get_maintainers, to reduce noise. Tested on
x86-64, compile-tested on a bunch of other archs. Will do more testing
in the upcoming days.

[1] https://lore.kernel.org/all/20230809083256.699513-1-david@redhat.com/
[2] https://lore.kernel.org/all/20231124132626.235350-1-david@redhat.com/
[3] https://gitlab.com/davidhildenbrand/scratchspace/-/raw/main/pte-mapped-folio-benchmarks.c?ref_type=heads

Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
Cc: Peter Xu <peterx@redhat.com>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Cc: Yin Fengwei <fengwei.yin@intel.com>
Cc: Yang Shi <shy828301@gmail.com>
Cc: Zi Yan <ziy@nvidia.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Hugh Dickins <hughd@google.com>
Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
Cc: Rich Felker <dalias@libc.org>
Cc: John Paul Adrian Glaubitz <glaubitz@physik.fu-berlin.de>
Cc: Chris Zankel <chris@zankel.net>
Cc: Max Filippov <jcmvbkbc@gmail.com>
Cc: Muchun Song <muchun.song@linux.dev>
Cc: Miaohe Lin <linmiaohe@huawei.com>
Cc: Naoya Horiguchi <naoya.horiguchi@nec.com>
Cc: Richard Chang <richardycc@google.com>

David Hildenbrand (18):
  mm: allow for detecting underflows with page_mapcount() again
  mm/rmap: always inline anon/file rmap duplication of a single PTE
  mm/rmap: add fast-path for small folios when
    adding/removing/duplicating
  mm: track mapcount of large folios in single value
  mm: improve folio_likely_mapped_shared() using the mapcount of large
    folios
  mm: make folio_mapcount() return 0 for small typed folios
  mm/memory: use folio_mapcount() in zap_present_folio_ptes()
  mm/huge_memory: use folio_mapcount() in zap_huge_pmd() sanity check
  mm/memory-failure: use folio_mapcount() in hwpoison_user_mappings()
  mm/page_alloc: use folio_mapped() in __alloc_contig_migrate_range()
  mm/migrate: use folio_likely_mapped_shared() in
    add_page_for_migration()
  sh/mm/cache: use folio_mapped() in copy_from_user_page()
  mm/filemap: use folio_mapcount() in filemap_unaccount_folio()
  mm/migrate_device: use folio_mapcount() in migrate_vma_check_page()
  trace/events/page_ref: trace the raw page mapcount value
  xtensa/mm: convert check_tlb_entry() to sanity check folios
  mm/debug: print only page mapcount (excluding folio entire mapcount)
    in __dump_folio()
  Documentation/admin-guide/cgroup-v1/memory.rst: don't reference
    page_mapcount()

 .../admin-guide/cgroup-v1/memory.rst          |  4 +-
 Documentation/mm/transhuge.rst                | 12 +--
 arch/sh/mm/cache.c                            |  2 +-
 arch/xtensa/mm/tlb.c                          | 11 +--
 include/linux/mm.h                            | 77 +++++++++++--------
 include/linux/mm_types.h                      |  5 +-
 include/linux/rmap.h                          | 40 +++++++++-
 include/trace/events/page_ref.h               |  4 +-
 mm/debug.c                                    | 12 +--
 mm/filemap.c                                  |  2 +-
 mm/huge_memory.c                              |  2 +-
 mm/hugetlb.c                                  |  4 +-
 mm/internal.h                                 |  3 +
 mm/khugepaged.c                               |  2 +-
 mm/memory-failure.c                           |  4 +-
 mm/memory.c                                   |  3 +-
 mm/migrate.c                                  |  2 +-
 mm/migrate_device.c                           | 12 +--
 mm/page_alloc.c                               | 12 ++-
 mm/rmap.c                                     | 60 +++++++--------
 20 files changed, 163 insertions(+), 110 deletions(-)