mbox series

[RFC,v6,00/51] Large pages in the page cache

Message ID 20200610201345.13273-1-willy@infradead.org (mailing list archive)
Headers show
Series Large pages in the page cache | expand

Message

Matthew Wilcox June 10, 2020, 8:12 p.m. UTC
From: "Matthew Wilcox (Oracle)" <willy@infradead.org>

Another fortnight, another dump of my current large pages work.
I've squished a lot of bugs this time.  xfstests is much happier now,
running for 1631 seconds and getting as far as generic/086.  This patchset
is getting a little big, so I'm going to try to get some bits of it
upstream soon (the bits that make sense regardless of whether the rest
of this is merged).

It's now based on linus' master (6f630784cc0d), and you can get it from
http://git.infradead.org/users/willy/linux-dax.git/shortlog/refs/heads/xarray-pagecache
if you'd rather see it there (this branch is force-pushed frequently)

The primary idea here is that a large part of the overhead in dealing
with individual pages is that there's just so darned many of them.
We would be better off dealing with fewer, larger pages, even if they
don't get to be the size necessary for the CPU to use a larger TLB entry.

The approach taken is to make THPs support arbitrary power-of-two sizes
(instead of just PMDs).  There's probably some tuning to be done to decide
what sizes are worth using, but we're a fair way from doing performance
work with this patchset yet.

TODO:
 - Fix arc/arm/arm64/mips/powerpc/space flush_dcache_page() to
   support THPs natively
 - Actually create large pages for sufficiently large writes
 - Copy in larger chunks for write() in iomap
 - More bug fixing

v6:
 - Improved debug output for large pages (will send to Andrew soon)
 - Make compound_nr() more efficient (will send to Andrew soon)
 - Renamed hpage_nr_pages() to thp_nr_pages()
 - Added thp_head()
 - Set the THP_SUPPORT flag in shmfs
 - Change zero_user_segments() to call flush_dcache_page() once for the
   head page instead of once for each subpage.  The architectures listed
   above need to be fixed.
 - Fix shmem & truncate to call zero_user_segment() with the head page
 - Fix page_is_mergeable() for THPs
 - Fix a bug in iomap_iop_set_range_uptodate() where I was assuming that
   the offset was block-aligned
 - Fix a few more places that assume unsigned int is large enough to hold
   offset/length within a page
 - Fix doing writeback of a page after discarding its iop due to a partial
   truncate
 - Convert the iomap write paths more comprehensively.  That's now four
   separate patches
v5:
 - Add a mapping AS_LARGE_PAGES flag to reduce the levels of indirection
   (Dave Chinner)
 - Change iomap_invalidate_page() to handle subpages of a THP being punched
 - Ensure we don't call page_cache_async_readahead() with a tail page
 - Revert to Bill's original patch for thp_get_unmapped_area() to allow
   for hardware page sizes other than PMD to be supported more easily
 - Remove a few more HPAGE_PMD_NR
 - Move shmem_punch_compound() to truncate.c and rename it to punch_thp()
 - Add support for page_private to punch_thp()

v4:
 - Fix thp_size typo
 - Fix the iomap page_mkwrite() path to operate on the head page, even
   though the vm_fault has a pointer to the tail page
 - Fix iomap_finish_ioend() to use bio_for_each_thp_segment_all()
 - Rework PageDoubleMap (see first two patches for details)
 - Fix page_cache_delete() to handle shadow entries being stored to a THP
 - Fix the assertion in pagecache_get_page() to handle tail pages
 - Change PageReadahead from NO_COMPOUND to ONLY_HEAD
 - Handle PageReadahead being set on head pages
 - Handle total_mapcount correctly (Kirill)
 - Pull the FS_LARGE_PAGES check out into mapping_large_pages()
 - Fix page size assumption in truncate_cleanup_page()
 - Avoid splitting large pages unnecessarily on truncate
 - Disable the page cache truncation introduced as part of the read-only
   THP patch set
 - Call compound_head() in iomap buffered write paths -- we retrieve a
   (potentially) tail page from the page cache and need to use that for
   flush_dcache_page(), but we expect to operate on a head page in most
   of the iomap code



Kirill A. Shutemov (1):
  mm: Fix total_mapcount assumption of page size

Matthew Wilcox (Oracle) (49):
  mm: Print head flags in dump_page
  mm: Print the inode number in dump_page
  mm: Print hashed address of struct page
  mm: Move PageDoubleMap bit
  mm: Simplify PageDoubleMap with PF_SECOND policy
  mm: Store compound_nr as well as compound_order
  mm: Move page-flags include to top of file
  mm: Add thp_order
  mm: Add thp_size
  mm: Replace hpage_nr_pages with thp_nr_pages
  mm: Add thp_head
  mm: Introduce offset_in_thp
  mm: Support arbitrary THP sizes
  fs: Add a filesystem flag for THPs
  fs: Do not update nr_thps for mappings which support THPs
  fs: Introduce i_blocks_per_page
  fs: Make page_mkwrite_check_truncate thp-aware
  mm: Support THPs in zero_user_segments
  mm: Zero the head page, not the tail page
  block: Add bio_for_each_thp_segment_all
  block: Support THPs in page_is_mergeable
  iomap: Support arbitrarily many blocks per page
  iomap: Support THPs in iomap_adjust_read_range
  iomap: Support THPs in invalidatepage
  iomap: Support THPs in read paths
  iomap: Convert iomap_write_end types
  iomap: Change calling convention for zeroing
  iomap: Change iomap_write_begin calling convention
  iomap: Support THPs in write paths
  iomap: Inline data shouldn't see THPs
  iomap: Handle tail pages in iomap_page_mkwrite
  xfs: Support THPs
  mm: Make prep_transhuge_page return its argument
  mm: Add __page_cache_alloc_order
  mm: Allow THPs to be added to the page cache
  mm: Allow THPs to be removed from the page cache
  mm: Remove page fault assumption of compound page size
  mm: Remove assumptions of THP size
  mm: Avoid splitting THPs
  mm: Fix truncation for pages of arbitrary size
  mm: Handle truncates that split THPs
  mm: Support storing shadow entries for THPs
  mm: Support retrieving tail pages from the page cache
  mm: Support tail pages in wait_for_stable_page
  mm: Add DEFINE_READAHEAD
  mm: Make page_cache_readahead_unbounded take a readahead_control
  mm: Make __do_page_cache_readahead take a readahead_control
  mm: Allow PageReadahead to be set on head pages
  mm: Add THP readahead

William Kucharski (1):
  mm: Align THP mappings for non-DAX

 block/bio.c                |   2 +-
 drivers/nvdimm/btt.c       |   4 +-
 drivers/nvdimm/pmem.c      |   6 +-
 fs/dax.c                   |  13 +-
 fs/ext4/verity.c           |   4 +-
 fs/f2fs/verity.c           |   4 +-
 fs/inode.c                 |   2 +
 fs/iomap/buffered-io.c     | 250 +++++++++++++++++++------------------
 fs/jfs/jfs_metapage.c      |   2 +-
 fs/xfs/xfs_aops.c          |   4 +-
 fs/xfs/xfs_super.c         |   2 +-
 include/linux/bio.h        |  13 ++
 include/linux/bvec.h       |  23 ++++
 include/linux/dax.h        |   3 +-
 include/linux/fs.h         |  28 +----
 include/linux/highmem.h    |  11 +-
 include/linux/huge_mm.h    |  65 ++++++++--
 include/linux/mm.h         |  46 +++----
 include/linux/mm_inline.h  |   6 +-
 include/linux/mm_types.h   |   1 +
 include/linux/page-flags.h |  46 ++-----
 include/linux/pagemap.h    | 102 ++++++++++++---
 mm/compaction.c            |   2 +-
 mm/debug.c                 |  23 ++--
 mm/filemap.c               | 101 +++++++++------
 mm/gup.c                   |   2 +-
 mm/highmem.c               |  62 ++++++++-
 mm/huge_memory.c           |  38 +++---
 mm/hugetlb.c               |   2 +-
 mm/internal.h              |  17 +--
 mm/memcontrol.c            |  10 +-
 mm/memory.c                |   7 +-
 mm/memory_hotplug.c        |   7 +-
 mm/mempolicy.c             |   2 +-
 mm/migrate.c               |  16 +--
 mm/mlock.c                 |   9 +-
 mm/page-writeback.c        |   1 +
 mm/page_alloc.c            |   5 +-
 mm/page_io.c               |   4 +-
 mm/page_vma_mapped.c       |   6 +-
 mm/readahead.c             | 145 ++++++++++++++++-----
 mm/rmap.c                  |  18 +--
 mm/shmem.c                 |  39 ++----
 mm/swap.c                  |  16 +--
 mm/swap_state.c            |   6 +-
 mm/swapfile.c              |   2 +-
 mm/truncate.c              |  70 ++++++++++-
 mm/vmscan.c                |  12 +-
 48 files changed, 795 insertions(+), 464 deletions(-)

Comments

Christoph Hellwig June 11, 2020, 6:59 a.m. UTC | #1
On Wed, Jun 10, 2020 at 01:12:54PM -0700, Matthew Wilcox wrote:
> From: "Matthew Wilcox (Oracle)" <willy@infradead.org>
> 
> Another fortnight, another dump of my current large pages work.
> I've squished a lot of bugs this time.  xfstests is much happier now,
> running for 1631 seconds and getting as far as generic/086.  This patchset
> is getting a little big, so I'm going to try to get some bits of it
> upstream soon (the bits that make sense regardless of whether the rest
> of this is merged).

At this size a git tree to pull would also be nice..
Matthew Wilcox June 11, 2020, 11:24 a.m. UTC | #2
On Wed, Jun 10, 2020 at 11:59:54PM -0700, Christoph Hellwig wrote:
> On Wed, Jun 10, 2020 at 01:12:54PM -0700, Matthew Wilcox wrote:
> > From: "Matthew Wilcox (Oracle)" <willy@infradead.org>
> > 
> > Another fortnight, another dump of my current large pages work.
> > I've squished a lot of bugs this time.  xfstests is much happier now,
> > running for 1631 seconds and getting as far as generic/086.  This patchset
> > is getting a little big, so I'm going to try to get some bits of it
> > upstream soon (the bits that make sense regardless of whether the rest
> > of this is merged).
> 
> At this size a git tree to pull would also be nice..

That was literally the next paragraph ...

It's now based on linus' master (6f630784cc0d), and you can get it from
http://git.infradead.org/users/willy/linux-dax.git/shortlog/refs/heads/xarray-pa
+gecache
if you'd rather see it there (this branch is force-pushed frequently)

Or are you saying you'd rather see the git URL than the link to gitweb?
Matthew Wilcox June 14, 2020, 4:26 p.m. UTC | #3
On Wed, Jun 10, 2020 at 01:12:54PM -0700, Matthew Wilcox wrote:
> Another fortnight, another dump of my current large pages work.

The generic/127 test has pointed out to me that range writeback is
broken by this patchset.  Here's how (may not be exactly what's going on,
but it's close):

page cache allocates an order-2 page covering indices 40-43.
bytes are written, page is dirtied
test then calls fallocate(FALLOC_FL_COLLAPSE_RANGE) for a range which
starts in page 41.
XFS calls filemap_write_and_wait_range() which calls
__filemap_fdatawrite_range() which calls
do_writepages() which calls
iomap_writepages() which calls
write_cache_pages() which calls
tag_pages_for_writeback() which calls
xas_for_each_marked() starting at page 41.  Which doesn't find page
  41 because when we dirtied pages 40-43, we only marked index 40 as
  being dirty.

Annoyingly, the XArray actually handles this just fine ... if we were
using multi-order entries, we'd find it.  But we're still storing 2^N
entries for an order N page.

I can see two ways to fix this.  One is to bite the bullet and do the
conversion of the page cache to use multi-order entries.  The second
is to set and clear the marks on all entries.  I'm concerned about the
performance of the latter solution.  Not so bad for order-2 pages, but for
an order-9 page we have 520 bits to set, spread over 9 non-consecutive
cachelines.  Also, I'm unenthusiastic about writing code that I want to
throw away as quickly as possible.

So unless somebody has a really good alternative idea, I'm going to
convert the page cache over to multi-order entries.  This will have
several positive effects:

 - Get DAX and regular page cache using the xarray in a more similar way
 - Saves about 4.5kB of memory for every 2MB page in tmpfs/shmem
 - Prep work for converting hugetlbfs to use the page cache the same
   way as tmpfs
Christoph Hellwig June 15, 2020, 1:32 p.m. UTC | #4
On Thu, Jun 11, 2020 at 04:24:12AM -0700, Matthew Wilcox wrote:
> On Wed, Jun 10, 2020 at 11:59:54PM -0700, Christoph Hellwig wrote:
> > On Wed, Jun 10, 2020 at 01:12:54PM -0700, Matthew Wilcox wrote:
> > > From: "Matthew Wilcox (Oracle)" <willy@infradead.org>
> > > 
> > > Another fortnight, another dump of my current large pages work.
> > > I've squished a lot of bugs this time.  xfstests is much happier now,
> > > running for 1631 seconds and getting as far as generic/086.  This patchset
> > > is getting a little big, so I'm going to try to get some bits of it
> > > upstream soon (the bits that make sense regardless of whether the rest
> > > of this is merged).
> > 
> > At this size a git tree to pull would also be nice..
> 
> That was literally the next paragraph ...

Oops.  Next time with more coffee..