[v14,000/138] Memory folios

Message ID	20210715033704.692967-1-willy@infradead.org (mailing list archive)
Headers	show Return-Path: <linux-fsdevel-owner@kernel.org> From: "Matthew Wilcox (Oracle)" <willy@infradead.org> To: linux-kernel@vger.kernel.org Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>, linux-mm@kvack.org, linux-fsdevel@vger.kernel.org Subject: [PATCH v14 000/138] Memory folios Date: Thu, 15 Jul 2021 04:34:46 +0100 Message-Id: <20210715033704.692967-1-willy@infradead.org> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Precedence: bulk
Series	Memory folios \| expand [v14,000/138] Memory folios [v14,001/138] mm: Convert get_page_unless_zero() to return bool [v14,002/138] mm: Introduce struct folio [v14,003/138] mm: Add folio_pgdat(), folio_zone() and folio_zonenum() [v14,004/138] mm/vmstat: Add functions to account folio statistics [v14,005/138] mm/debug: Add VM_BUG_ON_FOLIO() and VM_WARN_ON_ONCE_FOLIO() [v14,006/138] mm: Add folio reference count functions [v14,007/138] mm: Add folio_put() [v14,008/138] mm: Add folio_get() [v14,009/138] mm: Add folio_try_get_rcu() [v14,010/138] mm: Add folio flag manipulation functions [v14,011/138] mm/lru: Add folio LRU functions [v14,012/138] mm: Handle per-folio private data [v14,013/138] mm/filemap: Add folio_index(), folio_file_page() and folio_contains() [v14,014/138] mm/filemap: Add folio_next_index() [v14,015/138] mm/filemap: Add folio_pos() and folio_file_pos() [v14,016/138] mm/util: Add folio_mapping() and folio_file_mapping() [v14,017/138] mm/filemap: Add folio_unlock() [v14,018/138] mm/filemap: Add folio_lock() [v14,019/138] mm/filemap: Add folio_lock_killable() [v14,020/138] mm/filemap: Add __folio_lock_async() [v14,021/138] mm/filemap: Add folio_wait_locked() [v14,022/138] mm/filemap: Add __folio_lock_or_retry() [v14,023/138] mm/swap: Add folio_rotate_reclaimable() [v14,024/138] mm/filemap: Add folio_end_writeback() [v14,025/138] mm/writeback: Add folio_wait_writeback() [v14,026/138] mm/writeback: Add folio_wait_stable() [v14,027/138] mm/filemap: Add folio_wait_bit() [v14,028/138] mm/filemap: Add folio_wake_bit() [v14,029/138] mm/filemap: Convert page wait queues to be folios [v14,030/138] mm/filemap: Add folio private_2 functions [v14,031/138] fs/netfs: Add folio fscache functions [v14,032/138] mm: Add folio_mapped() [v14,033/138] mm: Add folio_nid() [v14,034/138] mm/memcg: Remove 'page' parameter to mem_cgroup_charge_statistics() [v14,035/138] mm/memcg: Use the node id in mem_cgroup_update_tree() [v14,036/138] mm/memcg: Remove soft_limit_tree_node() [v14,037/138] mm/memcg: Convert memcg_check_events to take a node ID [v14,038/138] mm/memcg: Add folio_memcg() and related functions [v14,039/138] mm/memcg: Convert commit_charge() to take a folio [v14,040/138] mm/memcg: Convert mem_cgroup_charge() to take a folio [v14,041/138] mm/memcg: Convert uncharge_page() to uncharge_folio() [v14,042/138] mm/memcg: Convert mem_cgroup_uncharge() to take a folio [v14,043/138] mm/memcg: Convert mem_cgroup_migrate() to take folios [v14,044/138] mm/memcg: Convert mem_cgroup_track_foreign_dirty_slowpath() to folio [v14,045/138] mm/memcg: Add folio_memcg_lock() and folio_memcg_unlock() [v14,046/138] mm/memcg: Convert mem_cgroup_move_account() to use a folio [v14,047/138] mm/memcg: Add folio_lruvec() [v14,048/138] mm/memcg: Add folio_lruvec_lock() and similar functions [v14,049/138] mm/memcg: Add folio_lruvec_relock_irq() and folio_lruvec_relock_irqsave() [v14,050/138] mm/workingset: Convert workingset_activation to take a folio [v14,051/138] mm: Add folio_pfn() [v14,052/138] mm: Add folio_raw_mapping() [v14,053/138] mm: Add flush_dcache_folio() [v14,054/138] mm: Add kmap_local_folio() [v14,055/138] mm: Add arch_make_folio_accessible() [v14,056/138] mm: Add folio_young and folio_idle [v14,057/138] mm/swap: Add folio_activate() [v14,058/138] mm/swap: Add folio_mark_accessed() [v14,059/138] mm/rmap: Add folio_mkclean() [v14,060/138] mm/migrate: Add folio_migrate_mapping() [v14,061/138] mm/migrate: Add folio_migrate_flags() [v14,062/138] mm/migrate: Add folio_migrate_copy() [v14,063/138] mm/writeback: Rename __add_wb_stat() to wb_stat_mod() [v14,064/138] flex_proportions: Allow N events instead of 1 [v14,065/138] mm/writeback: Change __wb_writeout_inc() to __wb_writeout_add() [v14,066/138] mm/writeback: Add __folio_end_writeback() [v14,067/138] mm/writeback: Add folio_start_writeback() [v14,068/138] mm/writeback: Add folio_mark_dirty() [v14,069/138] mm/writeback: Add __folio_mark_dirty() [v14,070/138] mm/writeback: Convert tracing writeback_page_template to folios [v14,071/138] mm/writeback: Add filemap_dirty_folio() [v14,072/138] mm/writeback: Add folio_account_cleaned() [v14,073/138] mm/writeback: Add folio_cancel_dirty() [v14,074/138] mm/writeback: Add folio_clear_dirty_for_io() [v14,075/138] mm/writeback: Add folio_account_redirty() [v14,076/138] mm/writeback: Add folio_redirty_for_writepage() [v14,077/138] mm/filemap: Add i_blocks_per_folio() [v14,078/138] mm/filemap: Add folio_mkwrite_check_truncate() [v14,079/138] mm/filemap: Add readahead_folio() [v14,080/138] mm/workingset: Convert workingset_refault() to take a folio [v14,081/138] mm: Add folio_evictable() [v14,082/138] mm/lru: Convert __pagevec_lru_add_fn to take a folio [v14,083/138] mm/lru: Add folio_add_lru() [v14,084/138] mm/page_alloc: Add folio allocation functions [v14,085/138] mm/filemap: Add filemap_alloc_folio [v14,086/138] mm/filemap: Add filemap_add_folio() [v14,087/138] mm/filemap: Convert mapping_get_entry to return a folio [v14,088/138] mm/filemap: Add filemap_get_folio [v14,089/138] mm/filemap: Add FGP_STABLE [v14,090/138] block: Add bio_add_folio() [v14,091/138] block: Add bio_for_each_folio_all() [v14,092/138] iomap: Convert to_iomap_page to take a folio [v14,093/138] iomap: Convert iomap_page_create to take a folio [v14,094/138] iomap: Convert iomap_page_release to take a folio [v14,095/138] iomap: Convert iomap_releasepage to use a folio [v14,096/138] iomap: Convert iomap_invalidatepage to use a folio [v14,097/138] iomap: Pass the iomap_page into iomap_set_range_uptodate [v14,098/138] iomap: Use folio offsets instead of page offsets [v14,099/138] iomap: Convert bio completions to use folios [v14,100/138] iomap: Convert readahead and readpage to use a folio [v14,101/138] iomap: Convert iomap_page_mkwrite to use a folio [v14,102/138] iomap: Convert iomap_write_begin and iomap_write_end to folios [v14,103/138] iomap: Convert iomap_read_inline_data to take a folio [v14,104/138] iomap: Convert iomap_write_end_inline to take a folio [v14,105/138] iomap: Convert iomap_add_to_ioend to take a folio [v14,106/138] iomap: Convert iomap_do_writepage to use a folio [v14,107/138] iomap: Convert iomap_migrate_page to use folios [v14,108/138] mm/filemap: Convert page_cache_delete to take a folio [v14,109/138] mm/filemap: Convert unaccount_page_cache_page to filemap_unaccount_folio [v14,110/138] mm/filemap: Add filemap_remove_folio and __filemap_remove_folio [v14,111/138] mm/filemap: Convert find_get_entry to return a folio [v14,112/138] mm/filemap: Convert filemap_get_read_batch to use folios [v14,113/138] mm/filemap: Convert find_get_pages_contig to folios [v14,114/138] mm/filemap: Convert filemap_read_page to take a folio [v14,115/138] mm/filemap: Convert filemap_create_page to folio [v14,116/138] mm/filemap: Convert filemap_range_uptodate to folios [v14,117/138] mm/filemap: Convert filemap_fault to folio [v14,118/138] mm/filemap: Add read_cache_folio and read_mapping_folio [v14,119/138] mm/filemap: Convert filemap_get_pages to use folios [v14,120/138] mm/filemap: Convert page_cache_delete_batch to folios [v14,121/138] mm/filemap: Remove PageHWPoison check from next_uptodate_page() [v14,122/138] mm/filemap: Use folios in next_uptodate_page [v14,123/138] mm/filemap: Use a folio in filemap_map_pages [v14,124/138] fs: Convert vfs_dedupe_file_range_compare to folios [v14,125/138] mm/truncate,shmem: Handle truncates that split THPs [v14,126/138] mm/filemap: Return only head pages from find_get_entries [v14,127/138] mm: Use multi-index entries in the page cache [v14,128/138] iomap: Support multi-page folios in invalidatepage [v14,129/138] xfs: Support THPs [v14,130/138] mm/truncate: Convert invalidate_inode_pages2_range to folios [v14,131/138] mm/truncate: Fix invalidate_complete_page2 for THPs [v14,132/138] mm/vmscan: Free non-shmem THPs without splitting them [v14,133/138] mm: Fix READ_ONLY_THP warning [v14,134/138] mm: Support arbitrary THP sizes [v14,135/138] mm/filemap: Allow multi-page folios to be added to the page cache [v14,136/138] mm/vmscan: Optimise shrink_page_list for smaller THPs [v14,137/138] mm/readahead: Convert page_cache_async_ra() to take a folio [v14,138/138] mm/readahead: Add multi-page folio readahead

Matthew Wilcox July 15, 2021, 3:34 a.m. UTC

Managing memory in 4KiB pages is a serious overhead.  Many benchmarks
benefit from a larger "page size".  As an example, an earlier iteration
of this idea which used compound pages (and wasn't particularly tuned)
got a 7% performance boost when compiling the kernel.

Using compound pages or THPs exposes a weakness of our type system.
Functions are often unprepared for compound pages to be passed to them,
and may only act on PAGE_SIZE chunks.  Even functions which are aware of
compound pages may expect a head page, and do the wrong thing if passed
a tail page.

We also waste a lot of instructions ensuring that we're not looking at
a tail page.  Almost every call to PageFoo() contains one or more hidden
calls to compound_head().  This also happens for get_page(), put_page()
and many more functions.

This patch series uses a new type, the struct folio, to manage memory.
It converts enough of the page cache, iomap and XFS to use folios instead
of pages, and then adds support for multi-page folios.  It passes xfstests
(running on XFS) with no regressions compared to v5.14-rc1.

Git: https://git.infradead.org/users/willy/pagecache.git/shortlog/refs/tags/folio_14

v14:
 - Defined folio_memcg() for !CONFIG_MEMCG builds
 - Fixed typo in folio_activate() for !SMP builds
 - Fixed missed conversion of folio_rmapping to folio_raw_mapping in KSM
 - Fixed hugetlb's new dependency on copy_huge_page() by introducing
   folio_copy()
 - Changed the LRU page API to be entirely wrappers around the folio versions
 - Removed page_lru() entirely
 - Renamed folio_add_to_lru_list() -> lruvec_add_folio()
 - Renamed folio_add_to_lru_list_tail() -> lruvec_add_folio_tail()
 - Renamed folio_del_from_lru_list() -> lruvec_del_folio()
 - Changed folio flag operations to be:
   - folio_test_foo()
   - folio_test_set_foo()
   - folio_test_clear_foo()
   - folio_set_foo()
   - folio_clear_foo()
   - __folio_set_foo()
   - __folio_clear_foo()
 - Converted trace_mm_lru_activate() to take a folio
 - Converted trace_wait_on_page_writeback() to trace_folio_wait_writeback()
 - Converted trace_writeback_dirty_page() to trace_writeback_dirty_folio()
 - Converted trace_mm_lru_insertion() to take a folio
 - Renamed alloc_folio() -> folio_alloc()
 - Renamed __alloc_folio() -> __folio_alloc()
 - Renamed __alloc_folio_node() -> __folio_alloc_node()

Matthew Wilcox (Oracle) (138):
  mm: Convert get_page_unless_zero() to return bool
  mm: Introduce struct folio
  mm: Add folio_pgdat(), folio_zone() and folio_zonenum()
  mm/vmstat: Add functions to account folio statistics
  mm/debug: Add VM_BUG_ON_FOLIO() and VM_WARN_ON_ONCE_FOLIO()
  mm: Add folio reference count functions
  mm: Add folio_put()
  mm: Add folio_get()
  mm: Add folio_try_get_rcu()
  mm: Add folio flag manipulation functions
  mm/lru: Add folio LRU functions
  mm: Handle per-folio private data
  mm/filemap: Add folio_index(), folio_file_page() and folio_contains()
  mm/filemap: Add folio_next_index()
  mm/filemap: Add folio_pos() and folio_file_pos()
  mm/util: Add folio_mapping() and folio_file_mapping()
  mm/filemap: Add folio_unlock()
  mm/filemap: Add folio_lock()
  mm/filemap: Add folio_lock_killable()
  mm/filemap: Add __folio_lock_async()
  mm/filemap: Add folio_wait_locked()
  mm/filemap: Add __folio_lock_or_retry()
  mm/swap: Add folio_rotate_reclaimable()
  mm/filemap: Add folio_end_writeback()
  mm/writeback: Add folio_wait_writeback()
  mm/writeback: Add folio_wait_stable()
  mm/filemap: Add folio_wait_bit()
  mm/filemap: Add folio_wake_bit()
  mm/filemap: Convert page wait queues to be folios
  mm/filemap: Add folio private_2 functions
  fs/netfs: Add folio fscache functions
  mm: Add folio_mapped()
  mm: Add folio_nid()
  mm/memcg: Remove 'page' parameter to mem_cgroup_charge_statistics()
  mm/memcg: Use the node id in mem_cgroup_update_tree()
  mm/memcg: Remove soft_limit_tree_node()
  mm/memcg: Convert memcg_check_events to take a node ID
  mm/memcg: Add folio_memcg() and related functions
  mm/memcg: Convert commit_charge() to take a folio
  mm/memcg: Convert mem_cgroup_charge() to take a folio
  mm/memcg: Convert uncharge_page() to uncharge_folio()
  mm/memcg: Convert mem_cgroup_uncharge() to take a folio
  mm/memcg: Convert mem_cgroup_migrate() to take folios
  mm/memcg: Convert mem_cgroup_track_foreign_dirty_slowpath() to folio
  mm/memcg: Add folio_memcg_lock() and folio_memcg_unlock()
  mm/memcg: Convert mem_cgroup_move_account() to use a folio
  mm/memcg: Add folio_lruvec()
  mm/memcg: Add folio_lruvec_lock() and similar functions
  mm/memcg: Add folio_lruvec_relock_irq() and
    folio_lruvec_relock_irqsave()
  mm/workingset: Convert workingset_activation to take a folio
  mm: Add folio_pfn()
  mm: Add folio_raw_mapping()
  mm: Add flush_dcache_folio()
  mm: Add kmap_local_folio()
  mm: Add arch_make_folio_accessible()
  mm: Add folio_young and folio_idle
  mm/swap: Add folio_activate()
  mm/swap: Add folio_mark_accessed()
  mm/rmap: Add folio_mkclean()
  mm/migrate: Add folio_migrate_mapping()
  mm/migrate: Add folio_migrate_flags()
  mm/migrate: Add folio_migrate_copy()
  mm/writeback: Rename __add_wb_stat() to wb_stat_mod()
  flex_proportions: Allow N events instead of 1
  mm/writeback: Change __wb_writeout_inc() to __wb_writeout_add()
  mm/writeback: Add __folio_end_writeback()
  mm/writeback: Add folio_start_writeback()
  mm/writeback: Add folio_mark_dirty()
  mm/writeback: Add __folio_mark_dirty()
  mm/writeback: Convert tracing writeback_page_template to folios
  mm/writeback: Add filemap_dirty_folio()
  mm/writeback: Add folio_account_cleaned()
  mm/writeback: Add folio_cancel_dirty()
  mm/writeback: Add folio_clear_dirty_for_io()
  mm/writeback: Add folio_account_redirty()
  mm/writeback: Add folio_redirty_for_writepage()
  mm/filemap: Add i_blocks_per_folio()
  mm/filemap: Add folio_mkwrite_check_truncate()
  mm/filemap: Add readahead_folio()
  mm/workingset: Convert workingset_refault() to take a folio
  mm: Add folio_evictable()
  mm/lru: Convert __pagevec_lru_add_fn to take a folio
  mm/lru: Add folio_add_lru()
  mm/page_alloc: Add folio allocation functions
  mm/filemap: Add filemap_alloc_folio
  mm/filemap: Add filemap_add_folio()
  mm/filemap: Convert mapping_get_entry to return a folio
  mm/filemap: Add filemap_get_folio
  mm/filemap: Add FGP_STABLE
  block: Add bio_add_folio()
  block: Add bio_for_each_folio_all()
  iomap: Convert to_iomap_page to take a folio
  iomap: Convert iomap_page_create to take a folio
  iomap: Convert iomap_page_release to take a folio
  iomap: Convert iomap_releasepage to use a folio
  iomap: Convert iomap_invalidatepage to use a folio
  iomap: Pass the iomap_page into iomap_set_range_uptodate
  iomap: Use folio offsets instead of page offsets
  iomap: Convert bio completions to use folios
  iomap: Convert readahead and readpage to use a folio
  iomap: Convert iomap_page_mkwrite to use a folio
  iomap: Convert iomap_write_begin and iomap_write_end to folios
  iomap: Convert iomap_read_inline_data to take a folio
  iomap: Convert iomap_write_end_inline to take a folio
  iomap: Convert iomap_add_to_ioend to take a folio
  iomap: Convert iomap_do_writepage to use a folio
  iomap: Convert iomap_migrate_page to use folios
  mm/filemap: Convert page_cache_delete to take a folio
  mm/filemap: Convert unaccount_page_cache_page to
    filemap_unaccount_folio
  mm/filemap: Add filemap_remove_folio and __filemap_remove_folio
  mm/filemap: Convert find_get_entry to return a folio
  mm/filemap: Convert filemap_get_read_batch to use folios
  mm/filemap: Convert find_get_pages_contig to folios
  mm/filemap: Convert filemap_read_page to take a folio
  mm/filemap: Convert filemap_create_page to folio
  mm/filemap: Convert filemap_range_uptodate to folios
  mm/filemap: Convert filemap_fault to folio
  mm/filemap: Add read_cache_folio and read_mapping_folio
  mm/filemap: Convert filemap_get_pages to use folios
  mm/filemap: Convert page_cache_delete_batch to folios
  mm/filemap: Remove PageHWPoison check from next_uptodate_page()
  mm/filemap: Use folios in next_uptodate_page
  mm/filemap: Use a folio in filemap_map_pages
  fs: Convert vfs_dedupe_file_range_compare to folios
  mm/truncate,shmem: Handle truncates that split THPs
  mm/filemap: Return only head pages from find_get_entries
  mm: Use multi-index entries in the page cache
  iomap: Support multi-page folios in invalidatepage
  xfs: Support THPs
  mm/truncate: Convert invalidate_inode_pages2_range to folios
  mm/truncate: Fix invalidate_complete_page2 for THPs
  mm/vmscan: Free non-shmem THPs without splitting them
  mm: Fix READ_ONLY_THP warning
  mm: Support arbitrary THP sizes
  mm/filemap: Allow multi-page folios to be added to the page cache
  mm/vmscan: Optimise shrink_page_list for smaller THPs
  mm/readahead: Convert page_cache_async_ra() to take a folio
  mm/readahead: Add multi-page folio readahead

 Documentation/core-api/cachetlb.rst         |    6 +
 Documentation/core-api/mm-api.rst           |    4 +
 Documentation/filesystems/netfs_library.rst |    2 +
 arch/nds32/include/asm/cacheflush.h         |    1 +
 block/bio.c                                 |   21 +
 fs/afs/write.c                              |    9 +-
 fs/cachefiles/rdwr.c                        |   16 +-
 fs/io_uring.c                               |    2 +-
 fs/iomap/buffered-io.c                      |  524 ++++----
 fs/jfs/jfs_metapage.c                       |    1 +
 fs/remap_range.c                            |  116 +-
 fs/xfs/xfs_aops.c                           |   11 +-
 fs/xfs/xfs_super.c                          |    3 +-
 include/asm-generic/cacheflush.h            |    6 +
 include/linux/backing-dev.h                 |    6 +-
 include/linux/bio.h                         |   46 +-
 include/linux/flex_proportions.h            |    9 +-
 include/linux/gfp.h                         |   22 +-
 include/linux/highmem-internal.h            |   11 +
 include/linux/highmem.h                     |   38 +
 include/linux/huge_mm.h                     |   23 +-
 include/linux/ksm.h                         |    4 +-
 include/linux/memcontrol.h                  |  226 ++--
 include/linux/migrate.h                     |    4 +
 include/linux/mm.h                          |  268 +++-
 include/linux/mm_inline.h                   |   98 +-
 include/linux/mm_types.h                    |   77 ++
 include/linux/mmdebug.h                     |   20 +
 include/linux/netfs.h                       |   77 +-
 include/linux/page-flags.h                  |  267 ++--
 include/linux/page_idle.h                   |   99 +-
 include/linux/page_owner.h                  |    8 +-
 include/linux/page_ref.h                    |  158 ++-
 include/linux/pagemap.h                     |  615 +++++----
 include/linux/rmap.h                        |   10 +-
 include/linux/swap.h                        |   17 +-
 include/linux/vmstat.h                      |  107 ++
 include/linux/writeback.h                   |    9 +-
 include/trace/events/pagemap.h              |   46 +-
 include/trace/events/writeback.h            |   28 +-
 kernel/bpf/verifier.c                       |    2 +-
 kernel/events/uprobes.c                     |    3 +-
 lib/flex_proportions.c                      |   28 +-
 mm/Makefile                                 |    2 +-
 mm/compaction.c                             |    4 +-
 mm/filemap.c                                | 1285 +++++++++----------
 mm/folio-compat.c                           |  147 +++
 mm/huge_memory.c                            |   27 +-
 mm/hugetlb.c                                |    2 +-
 mm/internal.h                               |   40 +-
 mm/khugepaged.c                             |   20 +-
 mm/ksm.c                                    |   34 +-
 mm/memcontrol.c                             |  323 +++--
 mm/memory-failure.c                         |    2 +-
 mm/memory.c                                 |   20 +-
 mm/mempolicy.c                              |   10 +
 mm/memremap.c                               |    2 +-
 mm/migrate.c                                |  193 ++-
 mm/mlock.c                                  |    3 +-
 mm/page-writeback.c                         |  447 ++++---
 mm/page_alloc.c                             |   14 +-
 mm/page_io.c                                |    4 +-
 mm/page_owner.c                             |   10 +-
 mm/readahead.c                              |  108 +-
 mm/rmap.c                                   |   14 +-
 mm/shmem.c                                  |  115 +-
 mm/swap.c                                   |  180 +--
 mm/swap_state.c                             |    2 +-
 mm/swapfile.c                               |    8 +-
 mm/truncate.c                               |  193 +--
 mm/userfaultfd.c                            |    2 +-
 mm/util.c                                   |   98 +-
 mm/vmscan.c                                 |   15 +-
 mm/workingset.c                             |   44 +-
 74 files changed, 3865 insertions(+), 2551 deletions(-)
 create mode 100644 mm/folio-compat.c

Theodore Ts'o July 15, 2021, 3:56 p.m. UTC | #1

On Thu, Jul 15, 2021 at 04:34:46AM +0100, Matthew Wilcox (Oracle) wrote:
> Managing memory in 4KiB pages is a serious overhead.  Many benchmarks
> benefit from a larger "page size".  As an example, an earlier iteration
> of this idea which used compound pages (and wasn't particularly tuned)
> got a 7% performance boost when compiling the kernel.
> 
> Using compound pages or THPs exposes a weakness of our type system.
> Functions are often unprepared for compound pages to be passed to them,
> and may only act on PAGE_SIZE chunks.  Even functions which are aware of
> compound pages may expect a head page, and do the wrong thing if passed
> a tail page.
> 
> We also waste a lot of instructions ensuring that we're not looking at
> a tail page.  Almost every call to PageFoo() contains one or more hidden
> calls to compound_head().  This also happens for get_page(), put_page()
> and many more functions.
> 
> This patch series uses a new type, the struct folio, to manage memory.
> It converts enough of the page cache, iomap and XFS to use folios instead
> of pages, and then adds support for multi-page folios.  It passes xfstests
> (running on XFS) with no regressions compared to v5.14-rc1.

Hey Willy,

I must confess I've lost the thread of the plot in terms of how you
hope to get the Memory folio work merged upstream.  There are some
partial patch sets that just have the mm core, and then there were
some larger patchsets include some in the past which as I recall,
would touch ext4 (but which isn't in this set).

I was wondering if you could perhaps post a roadmap for how this patch
set might be broken up, and which subsections you were hoping to
target for the upcoming merge window versus the following merge
windows.

Also I assume that for file systems that aren't converted to use
Folios, there won't be any performance regressions --- is that
correct?  Or is that something we need to watch for?  Put another way,
if we don't land all of the memory folio patches before the end of the
calendar year, and we cut an LTS release with some file systems
converted and some file systems not yet converted, are there any
potential problems in that eventuality?

Thanks!

						- Ted

Matthew Wilcox July 15, 2021, 5:14 p.m. UTC | #2

On Thu, Jul 15, 2021 at 11:56:07AM -0400, Theodore Y. Ts'o wrote:
> On Thu, Jul 15, 2021 at 04:34:46AM +0100, Matthew Wilcox (Oracle) wrote:
> > Managing memory in 4KiB pages is a serious overhead.  Many benchmarks
> > benefit from a larger "page size".  As an example, an earlier iteration
> > of this idea which used compound pages (and wasn't particularly tuned)
> > got a 7% performance boost when compiling the kernel.
> > 
> > Using compound pages or THPs exposes a weakness of our type system.
> > Functions are often unprepared for compound pages to be passed to them,
> > and may only act on PAGE_SIZE chunks.  Even functions which are aware of
> > compound pages may expect a head page, and do the wrong thing if passed
> > a tail page.
> > 
> > We also waste a lot of instructions ensuring that we're not looking at
> > a tail page.  Almost every call to PageFoo() contains one or more hidden
> > calls to compound_head().  This also happens for get_page(), put_page()
> > and many more functions.
> > 
> > This patch series uses a new type, the struct folio, to manage memory.
> > It converts enough of the page cache, iomap and XFS to use folios instead
> > of pages, and then adds support for multi-page folios.  It passes xfstests
> > (running on XFS) with no regressions compared to v5.14-rc1.
> 
> Hey Willy,
> 
> I must confess I've lost the thread of the plot in terms of how you
> hope to get the Memory folio work merged upstream.  There are some
> partial patch sets that just have the mm core, and then there were
> some larger patchsets include some in the past which as I recall,
> would touch ext4 (but which isn't in this set).
> 
> I was wondering if you could perhaps post a roadmap for how this patch
> set might be broken up, and which subsections you were hoping to
> target for the upcoming merge window versus the following merge
> windows.

Hi Ted!  Great questions.  This particular incarnation of the
patch set is the one Linus asked for -- show the performance win
of using compound pages in the page cache.  I think of this patchset
as having six parts:

1-32: core; introduce struct folio, get/put, flags
33-50: memcg
51-89: page cache, part 1
90-107: block + iomap
108-124: page cache, part 2
125-138: actually use compound pages in the page cache

I'm hoping to get the first three parts (patches 1-89) into the
next merge window.  That gets us to the point where filesystems
can start to use folios themselves (ie it does the initial Amdahl
step and then everything else can happen in parallel)

> Also I assume that for file systems that aren't converted to use
> Folios, there won't be any performance regressions --- is that
> correct?  Or is that something we need to watch for?  Put another way,
> if we don't land all of the memory folio patches before the end of the
> calendar year, and we cut an LTS release with some file systems
> converted and some file systems not yet converted, are there any
> potential problems in that eventuality?

I suppose I can't guarantee that there will be no performance
regressions as a result (eg 5899593f51e6 was a regression that
was seen as a result of some of the prep work for folios), but
I do not anticipate any for unconverted filesystems.  There might
be a tiny performance penalty for supporting arbitrary-order pages
instead of just orders 0 and 9, but I haven't seen anything to
suggest it's noticable.  I would expect to see a tiny performance
win from removing all the compound_head() calls in the VFS core.

I have a proposal in to Plumbers filesystem track where I intend to
go over all the ways I'm going to want filesystems to change to take
advantage of folios.  I think that will be a good venue to discuss how
to handle buffer_head based filesystems in a multi-page folio world.

I wouldn't expect anything to have to change before the end of the year.
I only have four patches in my extended tree which touch ext4, and
they're all in the context of making treewide changes to all
filesystems:
 - Converting ->set_page_dirty to ->dirty_folio,
 - Converting ->invalidatepage to ->invalidate_folio,
 - Converting ->readpage to ->read_folio,
 - Changing readahead_page() to readahead_folio()

None of those patches are in great shape at this point, and I wouldn't ask
anyone to review them.  I am anticipating that some filesystems will never
be converted to multi-page folios (although all filesystems should be
converted to single-page folios so we can remove the folio compat code).

Mike Rapoport July 20, 2021, 10:54 a.m. UTC | #3

Hi Matthew,

(Sorry for the late response, I could not find time earlier)

On Thu, Jul 15, 2021 at 04:34:46AM +0100, Matthew Wilcox (Oracle) wrote:
> Managing memory in 4KiB pages is a serious overhead.  Many benchmarks
> benefit from a larger "page size".  As an example, an earlier iteration
> of this idea which used compound pages (and wasn't particularly tuned)
> got a 7% performance boost when compiling the kernel.
> 
> Using compound pages or THPs exposes a weakness of our type system.
> Functions are often unprepared for compound pages to be passed to them,
> and may only act on PAGE_SIZE chunks.  Even functions which are aware of
> compound pages may expect a head page, and do the wrong thing if passed
> a tail page.
> 
> We also waste a lot of instructions ensuring that we're not looking at
> a tail page.  Almost every call to PageFoo() contains one or more hidden
> calls to compound_head().  This also happens for get_page(), put_page()
> and many more functions.
> 
> This patch series uses a new type, the struct folio, to manage memory.
> It converts enough of the page cache, iomap and XFS to use folios instead
> of pages, and then adds support for multi-page folios.  It passes xfstests
> (running on XFS) with no regressions compared to v5.14-rc1.

I like the idea of folio and that first patches I've reviewed look good.

Most of the changelogs (at least at the first patches) mention reduction of
the kernel size for your configuration on x86. I wonder, what happens if
you build the kernel with "non-distro" configuration, e.g. defconfig or
tiny.config?

Also, what is the difference on !x86 builds?
 
> Git: https://git.infradead.org/users/willy/pagecache.git/shortlog/refs/tags/folio_14

Matthew Wilcox July 20, 2021, 12:41 p.m. UTC | #4

On Tue, Jul 20, 2021 at 01:54:38PM +0300, Mike Rapoport wrote:
> Most of the changelogs (at least at the first patches) mention reduction of
> the kernel size for your configuration on x86. I wonder, what happens if
> you build the kernel with "non-distro" configuration, e.g. defconfig or
> tiny.config?

I did an allnoconfig build and that reduced in size by ~2KiB.

> Also, what is the difference on !x86 builds?

I don't generally do non-x86 builds ... feel free to compare for
yourself!  I imagine it'll be 2-4 instructions per call to
compound_head().  ie something like:

	load page into reg S
	load reg S + 8 into reg T
	test bottom bit of reg T
	cond-move reg T - 1 to reg S
becomes
	load folio into reg S

the exact spelling of those instructions will vary from architecture to
architecture; some will take more instructions than others.  Possibly it
means we end up using one fewer register and so reducing the number of
registers spilled to the stack.  Probably not, though.

Mike Rapoport July 20, 2021, 3:17 p.m. UTC | #5

On Tue, Jul 20, 2021 at 01:41:15PM +0100, Matthew Wilcox wrote:
> On Tue, Jul 20, 2021 at 01:54:38PM +0300, Mike Rapoport wrote:
> > Most of the changelogs (at least at the first patches) mention reduction of
> > the kernel size for your configuration on x86. I wonder, what happens if
> > you build the kernel with "non-distro" configuration, e.g. defconfig or
> > tiny.config?
> 
> I did an allnoconfig build and that reduced in size by ~2KiB.
> 
> > Also, what is the difference on !x86 builds?
> 
> I don't generally do non-x86 builds ... feel free to compare for
> yourself!

I did allnoconfig and defconfig for arm64 and powerpc.

All execpt arm64::defconfig show decrease by ~1KiB, while arm64::defconfig
was actually increased by ~500 bytes.

I didn't dig into objdumps yet.

I also tried to build arm but it failed with:

  CC      fs/remap_range.o
fs/remap_range.c: In function 'vfs_dedupe_file_range_compare':
fs/remap_range.c:250:3: error: implicit declaration of function 'flush_dcache_folio'; did you mean 'flush_cache_louis'? [-Werror=implicit-function-declaration]
  250 |   flush_dcache_folio(src_folio);
      |   ^~~~~~~~~~~~~~~~~~
      |   flush_cache_louis
cc1: some warnings being treated as errors


> I imagine it'll be 2-4 instructions per call to
> compound_head().  ie something like:
> 
> 	load page into reg S
> 	load reg S + 8 into reg T
> 	test bottom bit of reg T
> 	cond-move reg T - 1 to reg S
> becomes
> 	load folio into reg S
> 
> the exact spelling of those instructions will vary from architecture to
> architecture; some will take more instructions than others.  Possibly it
> means we end up using one fewer register and so reducing the number of
> registers spilled to the stack.  Probably not, though.

Matthew Wilcox July 20, 2021, 3:23 p.m. UTC | #6

On Tue, Jul 20, 2021 at 06:17:26PM +0300, Mike Rapoport wrote:
> On Tue, Jul 20, 2021 at 01:41:15PM +0100, Matthew Wilcox wrote:
> > On Tue, Jul 20, 2021 at 01:54:38PM +0300, Mike Rapoport wrote:
> > > Most of the changelogs (at least at the first patches) mention reduction of
> > > the kernel size for your configuration on x86. I wonder, what happens if
> > > you build the kernel with "non-distro" configuration, e.g. defconfig or
> > > tiny.config?
> > 
> > I did an allnoconfig build and that reduced in size by ~2KiB.
> > 
> > > Also, what is the difference on !x86 builds?
> > 
> > I don't generally do non-x86 builds ... feel free to compare for
> > yourself!
> 
> I did allnoconfig and defconfig for arm64 and powerpc.
> 
> All execpt arm64::defconfig show decrease by ~1KiB, while arm64::defconfig
> was actually increased by ~500 bytes.

Which patch did you go up to for that?  If you're going past patch 50 or
so, then you're starting to add functionality (ie support for arbitrary
order pages), so a certain amount of extra code size might be expected.
I measured 6KB at patch 32 or so, then between patch 32 & 50 was pretty
much a wash.

> I didn't dig into objdumps yet.
> 
> I also tried to build arm but it failed with:
> 
>   CC      fs/remap_range.o
> fs/remap_range.c: In function 'vfs_dedupe_file_range_compare':
> fs/remap_range.c:250:3: error: implicit declaration of function 'flush_dcache_folio'; did you mean 'flush_cache_louis'? [-Werror=implicit-function-declaration]
>   250 |   flush_dcache_folio(src_folio);
>       |   ^~~~~~~~~~~~~~~~~~
>       |   flush_cache_louis
> cc1: some warnings being treated as errors

Already complained about by the build bot; already fixed.  You should
maybe look at the git tree if you're doing more than code review.

Mike Rapoport July 20, 2021, 3:35 p.m. UTC | #7

On Tue, Jul 20, 2021 at 04:23:29PM +0100, Matthew Wilcox wrote:
> On Tue, Jul 20, 2021 at 06:17:26PM +0300, Mike Rapoport wrote:
> > On Tue, Jul 20, 2021 at 01:41:15PM +0100, Matthew Wilcox wrote:
> > > On Tue, Jul 20, 2021 at 01:54:38PM +0300, Mike Rapoport wrote:
> > > > Most of the changelogs (at least at the first patches) mention reduction of
> > > > the kernel size for your configuration on x86. I wonder, what happens if
> > > > you build the kernel with "non-distro" configuration, e.g. defconfig or
> > > > tiny.config?
> > > 
> > > I did an allnoconfig build and that reduced in size by ~2KiB.
> > > 
> > > > Also, what is the difference on !x86 builds?
> > > 
> > > I don't generally do non-x86 builds ... feel free to compare for
> > > yourself!
> > 
> > I did allnoconfig and defconfig for arm64 and powerpc.
> > 
> > All execpt arm64::defconfig show decrease by ~1KiB, while arm64::defconfig
> > was actually increased by ~500 bytes.
> 
> Which patch did you go up to for that?  If you're going past patch 50 or
> so, then you're starting to add functionality (ie support for arbitrary
> order pages), so a certain amount of extra code size might be expected.
> I measured 6KB at patch 32 or so, then between patch 32 & 50 was pretty
> much a wash.

I've used folio_14 tag:

commit 480552d0322d855d146c0fa6fdf1e89ca8569037 (HEAD, tag: folio_14)
Author: Matthew Wilcox (Oracle) <willy@infradead.org>
Date:   Wed Feb 5 11:27:01 2020 -0500

    mm/readahead: Add multi-page folio readahead

Matthew Wilcox July 20, 2021, 5:18 p.m. UTC | #8

On Tue, Jul 20, 2021 at 06:35:50PM +0300, Mike Rapoport wrote:
> On Tue, Jul 20, 2021 at 04:23:29PM +0100, Matthew Wilcox wrote:
> > Which patch did you go up to for that?  If you're going past patch 50 or
> > so, then you're starting to add functionality (ie support for arbitrary
> > order pages), so a certain amount of extra code size might be expected.
> > I measured 6KB at patch 32 or so, then between patch 32 & 50 was pretty
> > much a wash.
> 
> I've used folio_14 tag:
> 
> commit 480552d0322d855d146c0fa6fdf1e89ca8569037 (HEAD, tag: folio_14)
> Author: Matthew Wilcox (Oracle) <willy@infradead.org>
> Date:   Wed Feb 5 11:27:01 2020 -0500
> 
>     mm/readahead: Add multi-page folio readahead

Probably worth trying the for-next tag instead to get a meaningful
comparison of how much using folios saves over pages.

I don't want to give the impression that this is all that can be
saved by switching to folios.  There are still hundreds of places that
call PageFoo(), SetPageFoo(), ClearPageFoo(), put_page(), get_page(),
lock_page() and so on.  There's probably another 20KB of code that can
be removed that way.

Matthew Wilcox July 24, 2021, 5:27 p.m. UTC | #9

On Thu, Jul 15, 2021 at 04:34:46AM +0100, Matthew Wilcox (Oracle) wrote:
> Managing memory in 4KiB pages is a serious overhead.  Many benchmarks
> benefit from a larger "page size".  As an example, an earlier iteration
> of this idea which used compound pages (and wasn't particularly tuned)
> got a 7% performance boost when compiling the kernel.

I want to thank Michael Larabel for his benchmarking effort:
https://www.phoronix.com/scan.php?page=news_item&px=Folios-v14-Testing-AMD-Linux

I'm not too surprised by the lack of performance change on the majority
of benchmarks.  This patch series is only going to change things for
heavy users of the page cache (ie it'll do nothing for anon memory users),
and it's only really a benefit for programs that have good locality.

What blows me away is the 80% performance improvement for PostgreSQL.
I know they use the page cache extensively, so it's plausibly real.
I'm a bit surprised that it has such good locality, and the size of the
win far exceeds my expectations.  We should probably dive into it and
figure out exactly what's going on.

Should we accelerate inclusion of this patchset?  Right now, I have
89 mm patches queued up for the 5.15 merge window.  My plan was to get
the 17 iomap + block patches, plus another 18 page cache patches into
5.16 and then get the 14 multi-page folio patches into 5.17.  But I'm
mindful of the longterm release coming up "soon", and I'm not sure we're
best served by multiple distros trying to backport the multi-page folio
patches to either 5.15 or 5.16.

James Bottomley July 24, 2021, 6:09 p.m. UTC | #10

On Sat, 2021-07-24 at 18:27 +0100, Matthew Wilcox wrote:
> What blows me away is the 80% performance improvement for PostgreSQL.
> I know they use the page cache extensively, so it's plausibly real.
> I'm a bit surprised that it has such good locality, and the size of
> the win far exceeds my expectations.  We should probably dive into it
> and figure out exactly what's going on.

Since none of the other tested databases showed more than a 3%
improvement, this looks like an anomalous result specific to something
in postgres ... although the next biggest db: mariadb wasn't part of
the tests so I'm not sure that's definitive.  Perhaps the next step
should be to test mariadb?  Since they're fairly similar in domain
(both full SQL) if mariadb shows this type of improvement, you can
safely assume it's something in the way SQL databases handle paging and
if it doesn't, it's likely fixing a postgres inefficiency.

James

Matthew Wilcox July 24, 2021, 6:14 p.m. UTC | #11

On Sat, Jul 24, 2021 at 11:09:02AM -0700, James Bottomley wrote:
> On Sat, 2021-07-24 at 18:27 +0100, Matthew Wilcox wrote:
> > What blows me away is the 80% performance improvement for PostgreSQL.
> > I know they use the page cache extensively, so it's plausibly real.
> > I'm a bit surprised that it has such good locality, and the size of
> > the win far exceeds my expectations.  We should probably dive into it
> > and figure out exactly what's going on.
> 
> Since none of the other tested databases showed more than a 3%
> improvement, this looks like an anomalous result specific to something
> in postgres ... although the next biggest db: mariadb wasn't part of
> the tests so I'm not sure that's definitive.  Perhaps the next step
> should be to test mariadb?  Since they're fairly similar in domain
> (both full SQL) if mariadb shows this type of improvement, you can
> safely assume it's something in the way SQL databases handle paging and
> if it doesn't, it's likely fixing a postgres inefficiency.

I think the thing that's specific to PostgreSQL is that it's a heavy
user of the page cache.  My understanding is that most databases use
direct IO and manage their own page cache, while PostgreSQL trusts
the kernel to get it right.

Regardless of whether postgres is "doing something wrong" or not,
do you not think that an 80% performance win would exert a certain
amount of pressure on distros to do the backport?

James Bottomley July 24, 2021, 6:23 p.m. UTC | #12

On Sat, 2021-07-24 at 19:14 +0100, Matthew Wilcox wrote:
> On Sat, Jul 24, 2021 at 11:09:02AM -0700, James Bottomley wrote:
> > On Sat, 2021-07-24 at 18:27 +0100, Matthew Wilcox wrote:
> > > What blows me away is the 80% performance improvement for
> > > PostgreSQL. I know they use the page cache extensively, so it's
> > > plausibly real. I'm a bit surprised that it has such good
> > > locality, and the size of the win far exceeds my
> > > expectations.  We should probably dive into it and figure out
> > > exactly what's going on.
> > 
> > Since none of the other tested databases showed more than a 3%
> > improvement, this looks like an anomalous result specific to
> > something in postgres ... although the next biggest db: mariadb
> > wasn't part of the tests so I'm not sure that's
> > definitive.  Perhaps the next step should be to t
> > est mariadb?  Since they're fairly similar in domain (both full
> > SQL) if mariadb shows this type of improvement, you can
> > safely assume it's something in the way SQL databases handle paging
> > and if it doesn't, it's likely fixing a postgres inefficiency.
> 
> I think the thing that's specific to PostgreSQL is that it's a heavy
> user of the page cache.  My understanding is that most databases use
> direct IO and manage their own page cache, while PostgreSQL trusts
> the kernel to get it right.

That's testable with mariadb, at least for the innodb engine since the
flush_method is settable. 

> Regardless of whether postgres is "doing something wrong" or not,
> do you not think that an 80% performance win would exert a certain
> amount of pressure on distros to do the backport?

Well, I cut the previous question deliberately, but if you're going to
force me to answer, my experience with storage tells me that one test
being 10x different from all the others usually indicates a problem
with the benchmark test itself rather than a baseline improvement, so
I'd wait for more data.

James

Andres Freund July 24, 2021, 6:45 p.m. UTC | #13

Hi,

On Sat, Jul 24, 2021, at 11:23, James Bottomley wrote:
> On Sat, 2021-07-24 at 19:14 +0100, Matthew Wilcox wrote:
> > On Sat, Jul 24, 2021 at 11:09:02AM -0700, James Bottomley wrote:
> > > On Sat, 2021-07-24 at 18:27 +0100, Matthew Wilcox wrote:
> > > > What blows me away is the 80% performance improvement for
> > > > PostgreSQL. I know they use the page cache extensively, so it's
> > > > plausibly real. I'm a bit surprised that it has such good
> > > > locality, and the size of the win far exceeds my
> > > > expectations.  We should probably dive into it and figure out
> > > > exactly what's going on.
> > > 
> > > Since none of the other tested databases showed more than a 3%
> > > improvement, this looks like an anomalous result specific to
> > > something in postgres ... although the next biggest db: mariadb
> > > wasn't part of the tests so I'm not sure that's
> > > definitive.  Perhaps the next step should be to t
> > > est mariadb?  Since they're fairly similar in domain (both full
> > > SQL) if mariadb shows this type of improvement, you can
> > > safely assume it's something in the way SQL databases handle paging
> > > and if it doesn't, it's likely fixing a postgres inefficiency.
> > 
> > I think the thing that's specific to PostgreSQL is that it's a heavy
> > user of the page cache.  My understanding is that most databases use
> > direct IO and manage their own page cache, while PostgreSQL trusts
> > the kernel to get it right.
> 
> That's testable with mariadb, at least for the innodb engine since the
> flush_method is settable. 
> 
> > Regardless of whether postgres is "doing something wrong" or not,
> > do you not think that an 80% performance win would exert a certain
> > amount of pressure on distros to do the backport?
> 
> Well, I cut the previous question deliberately, but if you're going to
> force me to answer, my experience with storage tells me that one test
> being 10x different from all the others usually indicates a problem
> with the benchmark test itself rather than a baseline improvement, so
> I'd wait for more data.

I have a similar reaction - the large improvements are for a read/write pgbench benchmark at a scale that fits in memory. That's typically purely bound by the speed at which the WAL can be synced to disk. As far as I recall mariadb also uses buffered IO for WAL (but there was recent work in the area).

Is there a reason fdatasync() of 16MB files to have got a lot faster? Or a chance that could be broken?

Some improvement for read-only wouldn't surprise me, particularly if the os/pg weren't configured for explicit huge pages. Pgbench has a uniform distribution so its *very* tlb miss heavy with 4k pages.

Regards,
Andres

Matthew Wilcox July 24, 2021, 6:50 p.m. UTC | #14

On Sat, Jul 24, 2021 at 11:23:25AM -0700, James Bottomley wrote:
> On Sat, 2021-07-24 at 19:14 +0100, Matthew Wilcox wrote:
> > On Sat, Jul 24, 2021 at 11:09:02AM -0700, James Bottomley wrote:
> > > On Sat, 2021-07-24 at 18:27 +0100, Matthew Wilcox wrote:
> > > > What blows me away is the 80% performance improvement for
> > > > PostgreSQL. I know they use the page cache extensively, so it's
> > > > plausibly real. I'm a bit surprised that it has such good
> > > > locality, and the size of the win far exceeds my
> > > > expectations.  We should probably dive into it and figure out
> > > > exactly what's going on.
> > > 
> > > Since none of the other tested databases showed more than a 3%
> > > improvement, this looks like an anomalous result specific to
> > > something in postgres ... although the next biggest db: mariadb
> > > wasn't part of the tests so I'm not sure that's
> > > definitive.  Perhaps the next step should be to t
> > > est mariadb?  Since they're fairly similar in domain (both full
> > > SQL) if mariadb shows this type of improvement, you can
> > > safely assume it's something in the way SQL databases handle paging
> > > and if it doesn't, it's likely fixing a postgres inefficiency.
> > 
> > I think the thing that's specific to PostgreSQL is that it's a heavy
> > user of the page cache.  My understanding is that most databases use
> > direct IO and manage their own page cache, while PostgreSQL trusts
> > the kernel to get it right.
> 
> That's testable with mariadb, at least for the innodb engine since the
> flush_method is settable. 

We're still not communicating well.  I'm not talking about writes,
I'm talking about reads.  Postgres uses the page cache for reads.
InnoDB uses O_DIRECT (afaict).  See articles like this one:
https://www.percona.com/blog/2018/02/08/fsync-performance-storage-devices/

: The first and most obvious type of IO are pages reads and writes from
: the tablespaces. The pages are most often read one at a time, as 16KB
: random read operations. Writes to the tablespaces are also typically
: 16KB random operations, but they are done in batches. After every batch,
: fsync is called on the tablespace file handle.

(the current folio patch set does not create multi-page folios for
writes, only for reads)

I downloaded the mariadb source package that's in Debian, and from
what I can glean, it does indeed set O_DIRECT on data files in Linux,
through os_file_set_nocache().

> > Regardless of whether postgres is "doing something wrong" or not,
> > do you not think that an 80% performance win would exert a certain
> > amount of pressure on distros to do the backport?
> 
> Well, I cut the previous question deliberately, but if you're going to
> force me to answer, my experience with storage tells me that one test
> being 10x different from all the others usually indicates a problem
> with the benchmark test itself rather than a baseline improvement, so
> I'd wait for more data.

... or the two benchmarks use Linux in completely different ways such
that one sees a huge benefit while the other sees none.  Which is what
you'd expect for a patchset that improves the page cache and using a
benchmark that doesn't use the page cache.

Matthew Wilcox July 24, 2021, 7:01 p.m. UTC | #15

On Sat, Jul 24, 2021 at 11:45:26AM -0700, Andres Freund wrote:
> On Sat, Jul 24, 2021, at 11:23, James Bottomley wrote:
> > Well, I cut the previous question deliberately, but if you're going to
> > force me to answer, my experience with storage tells me that one test
> > being 10x different from all the others usually indicates a problem
> > with the benchmark test itself rather than a baseline improvement, so
> > I'd wait for more data.
> 
> I have a similar reaction - the large improvements are for a read/write pgbench benchmark at a scale that fits in memory. That's typically purely bound by the speed at which the WAL can be synced to disk. As far as I recall mariadb also uses buffered IO for WAL (but there was recent work in the area).
> 
> Is there a reason fdatasync() of 16MB files to have got a lot faster? Or a chance that could be broken?
> 
> Some improvement for read-only wouldn't surprise me, particularly if the os/pg weren't configured for explicit huge pages. Pgbench has a uniform distribution so its *very* tlb miss heavy with 4k pages.

It's going to depend substantially on the access pattern.  If the 16MB
file (oof, that's tiny!) was read in in large chunks or even in small
chunks, but consecutively, the folio changes will allocate larger pages
(16k, 64k, 256k, ...).  Theoretically it might get up to 2MB pages and
start using PMDs, but I've never seen that in my testing.

fdatasync() could indeed have got much faster.  If we're writing back a
256kB page as a unit, we're handling 64 times less metadata than writing
back 64x4kB pages.  We'll track 64x less dirty bits.  We'll find only
64 dirty pages per 16MB instead of 4096 dirty pages.

It's always possible I just broke something.  The xfstests aren't
exhaustive, and no regressions doesn't mean no problems.

Can you guide Michael towards parameters for pgbench that might give
an indication of performance on a more realistic workload that doesn't
entirely fit in memory?

Andres Freund July 24, 2021, 7:12 p.m. UTC | #16

Hi,

On Sat, Jul 24, 2021, at 12:01, Matthew Wilcox wrote:
> On Sat, Jul 24, 2021 at 11:45:26AM -0700, Andres Freund wrote:
> > On Sat, Jul 24, 2021, at 11:23, James Bottomley wrote:
> > > Well, I cut the previous question deliberately, but if you're going to
> > > force me to answer, my experience with storage tells me that one test
> > > being 10x different from all the others usually indicates a problem
> > > with the benchmark test itself rather than a baseline improvement, so
> > > I'd wait for more data.
> > 
> > I have a similar reaction - the large improvements are for a read/write pgbench benchmark at a scale that fits in memory. That's typically purely bound by the speed at which the WAL can be synced to disk. As far as I recall mariadb also uses buffered IO for WAL (but there was recent work in the area).
> > 
> > Is there a reason fdatasync() of 16MB files to have got a lot faster? Or a chance that could be broken?
> > 
> > Some improvement for read-only wouldn't surprise me, particularly if the os/pg weren't configured for explicit huge pages. Pgbench has a uniform distribution so its *very* tlb miss heavy with 4k pages.
> 
> It's going to depend substantially on the access pattern.  If the 16MB
> file (oof, that's tiny!) was read in in large chunks or even in small
> chunks, but consecutively, the folio changes will allocate larger pages
> (16k, 64k, 256k, ...).  Theoretically it might get up to 2MB pages and
> start using PMDs, but I've never seen that in my testing.

The 16MB files are just for the WAL/journal, and are write only in a benchmark like this. With pgbench it'll be written in small consecutive chunks (a few pages at a time, for each group commit). Each page is only written once, until after a checkpoint the entire file is "recycled" (renamed into the future of the WAL stream) and reused from start.

The data files are 1GB.


> fdatasync() could indeed have got much faster.  If we're writing back a
> 256kB page as a unit, we're handling 64 times less metadata than writing
> back 64x4kB pages.  We'll track 64x less dirty bits.  We'll find only
> 64 dirty pages per 16MB instead of 4096 dirty pages.

The dirty writes will be 8-32k or so in this workload - the constant commits require the WAL to constantly be flushed.


> It's always possible I just broke something.  The xfstests aren't
> exhaustive, and no regressions doesn't mean no problems.
> 
> Can you guide Michael towards parameters for pgbench that might give
> an indication of performance on a more realistic workload that doesn't
> entirely fit in memory?

Fitting in memory isn't bad - that's a large post of real workloads. It just makes it hard to believe the performance improvement, given that we expect to be bound by disk sync speed...

Michael, where do I find more details about the codification used during the run?

Regards,

Andres

James Bottomley July 24, 2021, 7:21 p.m. UTC | #17

On Sat, 2021-07-24 at 19:50 +0100, Matthew Wilcox wrote:
> On Sat, Jul 24, 2021 at 11:23:25AM -0700, James Bottomley wrote:
> > On Sat, 2021-07-24 at 19:14 +0100, Matthew Wilcox wrote:
> > > On Sat, Jul 24, 2021 at 11:09:02AM -0700, James Bottomley wrote:
> > > > On Sat, 2021-07-24 at 18:27 +0100, Matthew Wilcox wrote:
> > > > > What blows me away is the 80% performance improvement for
> > > > > PostgreSQL. I know they use the page cache extensively, so
> > > > > it's
> > > > > plausibly real. I'm a bit surprised that it has such good
> > > > > locality, and the size of the win far exceeds my
> > > > > expectations.  We should probably dive into it and figure out
> > > > > exactly what's going on.
> > > > 
> > > > Since none of the other tested databases showed more than a 3%
> > > > improvement, this looks like an anomalous result specific to
> > > > something in postgres ... although the next biggest db: mariadb
> > > > wasn't part of the tests so I'm not sure that's
> > > > definitive.  Perhaps the next step should be to t
> > > > est mariadb?  Since they're fairly similar in domain (both full
> > > > SQL) if mariadb shows this type of improvement, you can
> > > > safely assume it's something in the way SQL databases handle
> > > > paging and if it doesn't, it's likely fixing a postgres
> > > > inefficiency.
> > > 
> > > I think the thing that's specific to PostgreSQL is that it's a
> > > heavy user of the page cache.  My understanding is that most
> > > databases use direct IO and manage their own page cache, while
> > > PostgreSQL trusts the kernel to get it right.
> > 
> > That's testable with mariadb, at least for the innodb engine since
> > the flush_method is settable. 
> 
> We're still not communicating well.  I'm not talking about writes,
> I'm talking about reads.  Postgres uses the page cache for reads.
> InnoDB uses O_DIRECT (afaict).  See articles like this one:
> https://www.percona.com/blog/2018/02/08/fsync-performance-storage-devices/

If it were all about reads, wouldn't the Phoronix pgbench read only
test have shown a better improvement than 7%?  I think the Phoronix
data shows that whatever it is it's to do with writes ... that does
imply something in the way the log syncs data.

James

Andres Freund July 24, 2021, 9:44 p.m. UTC | #18

Hi,

On 2021-07-24 12:12:36 -0700, Andres Freund wrote:
> On Sat, Jul 24, 2021, at 12:01, Matthew Wilcox wrote:
> > On Sat, Jul 24, 2021 at 11:45:26AM -0700, Andres Freund wrote:
> > It's always possible I just broke something.  The xfstests aren't
> > exhaustive, and no regressions doesn't mean no problems.
> > 
> > Can you guide Michael towards parameters for pgbench that might give
> > an indication of performance on a more realistic workload that doesn't
> > entirely fit in memory?
> 
> Fitting in memory isn't bad - that's a large post of real workloads. It just makes it hard to believe the performance improvement, given that we expect to be bound by disk sync speed...

I just tried to compare folio-14 vs its baseline, testing commit 8096acd7442e
against 480552d0322d. In a VM however (but at least with its memory being
backed by huge pages and storage being passed through).  I got about 7%
improvement with just some baseline tuning of postgres applied. I think a 1-2%
of that is potentially runtime variance (I saw slightly different timings
leading around checkpointing that lead to a bit "unfair" advantage to the
folio run).

That's a *nice* win!

WRT the ~70% improvement:

> Michael, where do I find more details about the codification used during the
> run?

After some digging I found https://github.com/phoronix-test-suite/phoronix-test-suite/blob/94562dd4a808637be526b639d220c7cd937e2aa1/ob-cache/test-profiles/pts/pgbench-1.10.1/install.sh
For one the test says its done on ext4, while I used xfs. But I think the
bigger thing is the following:

The phoronix test uses postgres with only one relevant setting adjusted
(increasing the max connection count). That will end up using a buffer pool of
128MB, no huge pages, and importantly is configured to aim for not more than
1GB for postgres' journal, which will lead to constant checkpointing. The test
also only runs for 15 seconds, which likely isn't even enough to "warm up"
(the creation of the data set here will take longer than the run).

Given that the dataset phoronix is using is about ~16GB of data (excluding
WAL), and uses 256 concurrent clients running full tilt, using that limited
postgres settings doesn't end up measuring something particularly interesting
in my opinion.

Without changing the filesystem, using a configuration more similar to
phoronix', I do get a bigger win. But the run-to-run variance is so high
(largely due to the short test duration) that I don't trust those results
much.

It does look like there's a less slowdown due to checkpoints (i.e. fsyncing
all data files postgres modified since the last checkpoints) on the folio
branch, which does make some sense to me and would be a welcome improvement.

Greetings,

Andres Freund

Michael Larabel July 24, 2021, 10:23 p.m. UTC | #19

On 7/24/21 4:44 PM, Andres Freund wrote:
> Hi,
>
> On 2021-07-24 12:12:36 -0700, Andres Freund wrote:
>> On Sat, Jul 24, 2021, at 12:01, Matthew Wilcox wrote:
>>> On Sat, Jul 24, 2021 at 11:45:26AM -0700, Andres Freund wrote:
>>> It's always possible I just broke something.  The xfstests aren't
>>> exhaustive, and no regressions doesn't mean no problems.
>>>
>>> Can you guide Michael towards parameters for pgbench that might give
>>> an indication of performance on a more realistic workload that doesn't
>>> entirely fit in memory?
>> Fitting in memory isn't bad - that's a large post of real workloads. It just makes it hard to believe the performance improvement, given that we expect to be bound by disk sync speed...
> I just tried to compare folio-14 vs its baseline, testing commit 8096acd7442e
> against 480552d0322d. In a VM however (but at least with its memory being
> backed by huge pages and storage being passed through).  I got about 7%
> improvement with just some baseline tuning of postgres applied. I think a 1-2%
> of that is potentially runtime variance (I saw slightly different timings
> leading around checkpointing that lead to a bit "unfair" advantage to the
> folio run).
>
> That's a *nice* win!
>
> WRT the ~70% improvement:
>
>> Michael, where do I find more details about the codification used during the
>> run?
> After some digging I found https://github.com/phoronix-test-suite/phoronix-test-suite/blob/94562dd4a808637be526b639d220c7cd937e2aa1/ob-cache/test-profiles/pts/pgbench-1.10.1/install.sh
> For one the test says its done on ext4, while I used xfs. But I think the
> bigger thing is the following:

Yes that is the run/setup script used. The additional pgbench arguments 
passed at run-time are outlined in

https://github.com/phoronix-test-suite/phoronix-test-suite/blob/94562dd4a808637be526b639d220c7cd937e2aa1/ob-cache/test-profiles/pts/pgbench-1.10.1/test-definition.xml

Though in this case is quite straight-forward in corresponding to the 
relevant -s, -c options for pgbench and what is shown in turn on the 
pgbench graphs.

I have been running some more PostgreSQL tests on other hardware as well 
as via HammerDB and other databases. Will send that over when wrapped up 
likely tomorrow.

Michael


>
> The phoronix test uses postgres with only one relevant setting adjusted
> (increasing the max connection count). That will end up using a buffer pool of
> 128MB, no huge pages, and importantly is configured to aim for not more than
> 1GB for postgres' journal, which will lead to constant checkpointing. The test
> also only runs for 15 seconds, which likely isn't even enough to "warm up"
> (the creation of the data set here will take longer than the run).
>
> Given that the dataset phoronix is using is about ~16GB of data (excluding
> WAL), and uses 256 concurrent clients running full tilt, using that limited
> postgres settings doesn't end up measuring something particularly interesting
> in my opinion.
>
> Without changing the filesystem, using a configuration more similar to
> phoronix', I do get a bigger win. But the run-to-run variance is so high
> (largely due to the short test duration) that I don't trust those results
> much.
>
> It does look like there's a less slowdown due to checkpoints (i.e. fsyncing
> all data files postgres modified since the last checkpoints) on the folio
> branch, which does make some sense to me and would be a welcome improvement.
>
> Greetings,
>
> Andres Freund

Theodore Ts'o July 26, 2021, 2:19 p.m. UTC | #20

On Sat, Jul 24, 2021 at 02:44:13PM -0700, Andres Freund wrote:
> The phoronix test uses postgres with only one relevant setting adjusted
> (increasing the max connection count). That will end up using a buffer pool of
> 128MB, no huge pages, and importantly is configured to aim for not more than
> 1GB for postgres' journal, which will lead to constant checkpointing. The test
> also only runs for 15 seconds, which likely isn't even enough to "warm up"
> (the creation of the data set here will take longer than the run).
> 
> Given that the dataset phoronix is using is about ~16GB of data (excluding
> WAL), and uses 256 concurrent clients running full tilt, using that limited
> postgres settings doesn't end up measuring something particularly interesting
> in my opinion.

Hi Andreas,

I tend to use the phoronix test suite for my performance runs when
testing ext4 changes simply because it's convenient.  Can you suggest
a better set configuration settings that I should perhaps use that
might give more "real world" numbers that you would find more
significant?

Thanks,

					- Ted

Andres Freund July 27, 2021, 1:01 a.m. UTC | #21

Hi,

On 2021-07-26 10:19:11 -0400, Theodore Ts'o wrote:
> On Sat, Jul 24, 2021 at 02:44:13PM -0700, Andres Freund wrote:
> > The phoronix test uses postgres with only one relevant setting adjusted
> > (increasing the max connection count). That will end up using a buffer pool of
> > 128MB, no huge pages, and importantly is configured to aim for not more than
> > 1GB for postgres' journal, which will lead to constant checkpointing. The test
> > also only runs for 15 seconds, which likely isn't even enough to "warm up"
> > (the creation of the data set here will take longer than the run).
> > 
> > Given that the dataset phoronix is using is about ~16GB of data (excluding
> > WAL), and uses 256 concurrent clients running full tilt, using that limited
> > postgres settings doesn't end up measuring something particularly interesting
> > in my opinion.

> I tend to use the phoronix test suite for my performance runs when
> testing ext4 changes simply because it's convenient.  Can you suggest
> a better set configuration settings that I should perhaps use that
> might give more "real world" numbers that you would find more
> significant?

It depends a bit on what you want to test, obviously...

At the very least you should 'max_wal_size = 32GB' or such (it'll only
use that much if enough WAL is generated within checkpoint timeout,
which defaults to 5min).

And unfortunately you're not going to get meaningful performance results
for a read/write test within 10s, you need to run at least ~11min (so
two checkpoints happen).

With the default shared_buffers setting of 128MB you are going to
simulate a much-larger-than-postgres's-memory workload, albeit one where
the page cache *is* big enough on most current machines, unless you
limit the size of the page cache considerably. Doing so can be useful to
approximate a workload that would take much longer to initialize due to
the size.

I suggest *not* disabling autovacuum as currently done for performance
testing - it's not something many real-world setups can afford to do, so
benchmarking FS performance with it disabled doesn't seem like a good
idea.

FWIW, depending on what kind of thing you want to test, it'd not be hard
to come up with a test that less time to initialize. E.g. an insert-only
workload without an initial dataset or such.

As long as you *do* initialize 16GB of data, I think it'd make sense to
measure the time that takes. There's definitely been filesystem level
performance changes of that, and it's often going to be more IO intensive.

Greetings,

Andres Freund

[v14,000/138] Memory folios

Message

Comments