mbox series

[v6,00/27] Memory Folios

Message ID 20210331184728.1188084-1-willy@infradead.org (mailing list archive)
Headers show
Series Memory Folios | expand

Message

Matthew Wilcox March 31, 2021, 6:47 p.m. UTC
Managing memory in 4KiB pages is a serious overhead.  Many benchmarks
exist which show the benefits of a larger "page size".  As an example,
an earlier iteration of this idea which used compound pages got a 7%
performance boost when compiling the kernel using kernbench without any
particular tuning.

Using compound pages or THPs exposes a serious weakness in our type
system.  Functions are often unprepared for compound pages to be passed
to them, and may only act on PAGE_SIZE chunks.  Even functions which are
aware of compound pages may expect a head page, and do the wrong thing
if passed a tail page.

There have been efforts to label function parameters as 'head' instead
of 'page' to indicate that the function expects a head page, but this
leaves us with runtime assertions instead of using the compiler to prove
that nobody has mistakenly passed a tail page.  Calling a struct page
'head' is also inaccurate as they will work perfectly well on base pages.
The term 'nottail' has not proven popular.

We also waste a lot of instructions ensuring that we're not looking at
a tail page.  Almost every call to PageFoo() contains one or more hidden
calls to compound_head().  This also happens for get_page(), put_page()
and many more functions.  There does not appear to be a way to tell gcc
that it can cache the result of compound_head(), nor is there a way to
tell it that compound_head() is idempotent.

This series introduces the 'struct folio' as a replacement for
head-or-base pages.  This initial set reduces the kernel size by
approximately 5kB by removing conversions from tail pages to head pages.
The real purpose of this series is adding infrastructure to enable
further use of the folio.

The medium-term goal is to convert all filesystems and some device
drivers to work in terms of folios.  This series contains a lot of
explicit conversions, but it's important to realise it's removing a lot
of implicit conversions in some relatively hot paths.  There will be very
few conversions from folios when this work is completed; filesystems,
the page cache, the LRU and so on will generally only deal with folios.

I analysed the text size reduction using a config based on Oracle UEK
with all modules changed to built-in.  That's obviously not a kernel
which makes sense to run, but it serves to compare the effects on (many
common) filesystems & drivers, not just the core.

add/remove: 34266/34260 grow/shrink: 5220/3206 up/down: 1083860/-1088546 (-4686)

Current tree at:
https://git.infradead.org/users/willy/pagecache.git/shortlog/refs/heads/folio

(contains another ~100 patches on top of this batch, not all of which are
in good shape for submission)

v6:
 - Rebase on next-20210330
   - wait_bit_key patch merged by Linus
   - wait_on_page_writeback_killable() patches merged by Linus
   - Documentation patch merged by Andrew
 - Move folio_next_index() into this series
 - Move folio_offset() and folio_file_offset() into this series
 - Mirror members of struct page (for pagecache / anon) into struct folio,
   so (eg) you can use folio->mapping instead of folio->page.mapping
 - Add folio_ref_* functions, including kernel-doc for folio_ref_count().
 - Add count_memcg_folio_event()
 - Add put_folio_testzero()
 - Add folio_mapcount()
 - Add FolioKsm()
 - Fix afs_page_mkwrite() compilation
 - Fix/improve kernel-doc for
   - struct folio
   - add_folio_wait_queue()
   - wait_for_stable_folio()
   - wait_on_folio_writeback()
   - wait_on_folio_writeback_killable()
v5:
 - Rebase on next-20210319
 - Pull out three bug-fix patches to the front of the series, allowing
   them to be applied earlier.
 - Fix folio_page() against pages being moved between swap & page cache
 - Fix FolioDoubleMap to use the right page flags
 - Rename next_folio() to folio_next() (akpm)
 - Renamed folio stat functions (akpm)
 - Add 'mod' versions of the folio stats for users that already have 'nr'
 - Renamed folio_page to folio_file_page() (akpm)
 - Added kernel-doc for struct folio, folio_next(), folio_index(),
   folio_file_page(), folio_contains(), folio_order(), folio_nr_pages(),
   folio_shift(), folio_size(), page_folio(), get_folio(), put_folio()
 - Make folio_private() work in terms of void * instead of unsigned long
 - Used page_folio() in attach/detach page_private() (hch)
 - Drop afs_page_mkwrite folio conversion from this series
 - Add wait_on_folio_writeback_killable()
 - Convert add_page_wait_queue() to add_folio_wait_queue()
 - Add folio_swap_entry() helper
 - Drop the additions of *FolioFsCache
 - Simplify the addition of lock_folio_memcg() et al
 - Drop test_clear_page_writeback() conversion from this series
 - Add FolioTransHuge() definition
 - Rename __folio_file_mapping() to swapcache_mapping()
 - Added swapcache_index() helper
 - Removed lock_folio_async()
 - Made __lock_folio_async() static to filemap.c
 - Converted unlock_page_private_2() to use a folio internally
v4:
 - Rebase on current Linus tree (including swap fix)
 - Analyse each patch in terms of its effects on kernel text size.
   A few were modified to improve their effect.  In particular, where
   pushing calls to page_folio() into the callers resulted in unacceptable
   size increases, the wrapper was placed in mm/folio-compat.c.  This lets
   us see all the places which are good targets for conversion to folios.
 - Some of the patches were reordered, split or merged in order to make
   more logical sense.
 - Use nth_page() for folio_next() if we're using SPARSEMEM and not
   VMEMMAP (Zi Yan)
 - Increment and decrement page stats in units of pages instead of units
   of folios (Zi Yan)
v3:
 - Rebase on next-20210127.  Two major sources of conflict, the
   generic_file_buffered_read refactoring (in akpm tree) and the
   fscache work (in dhowells tree).
v2:
 - Pare patch series back to just infrastructure and the page waiting
   parts.

Matthew Wilcox (Oracle) (27):
  mm: Introduce struct folio
  mm: Add folio_pgdat and folio_zone
  mm/vmstat: Add functions to account folio statistics
  mm/debug: Add VM_BUG_ON_FOLIO and VM_WARN_ON_ONCE_FOLIO
  mm: Add folio reference count functions
  mm: Add put_folio
  mm: Add get_folio
  mm: Create FolioFlags
  mm: Handle per-folio private data
  mm/filemap: Add folio_index, folio_file_page and folio_contains
  mm/filemap: Add folio_next_index
  mm/filemap: Add folio_offset and folio_file_offset
  mm/util: Add folio_mapping and folio_file_mapping
  mm: Add folio_mapcount
  mm/memcg: Add folio wrappers for various functions
  mm/filemap: Add unlock_folio
  mm/filemap: Add lock_folio
  mm/filemap: Add lock_folio_killable
  mm/filemap: Add __lock_folio_async
  mm/filemap: Add __lock_folio_or_retry
  mm/filemap: Add wait_on_folio_locked
  mm/filemap: Add end_folio_writeback
  mm/writeback: Add wait_on_folio_writeback
  mm/writeback: Add wait_for_stable_folio
  mm/filemap: Convert wait_on_page_bit to wait_on_folio_bit
  mm/filemap: Convert wake_up_page_bit to wake_up_folio_bit
  mm/filemap: Convert page wait queues to be folios

 Documentation/core-api/mm-api.rst |   3 +
 fs/afs/write.c                    |   7 +-
 fs/cachefiles/rdwr.c              |  16 +-
 fs/io_uring.c                     |   2 +-
 include/linux/memcontrol.h        |  30 ++++
 include/linux/mm.h                | 177 ++++++++++++++++----
 include/linux/mm_types.h          |  81 +++++++++
 include/linux/mmdebug.h           |  20 +++
 include/linux/netfs.h             |   2 +-
 include/linux/page-flags.h        | 130 +++++++++++---
 include/linux/page_ref.h          |  88 +++++++++-
 include/linux/pagemap.h           | 270 ++++++++++++++++++++++--------
 include/linux/swap.h              |   6 +
 include/linux/vmstat.h            | 107 ++++++++++++
 mm/Makefile                       |   2 +-
 mm/filemap.c                      | 242 +++++++++++++-------------
 mm/folio-compat.c                 |  37 ++++
 mm/memory.c                       |   8 +-
 mm/page-writeback.c               |  72 +++++---
 mm/swapfile.c                     |   8 +-
 mm/util.c                         |  49 ++++--
 21 files changed, 1051 insertions(+), 306 deletions(-)
 create mode 100644 mm/folio-compat.c

Comments

Christoph Hellwig April 1, 2021, 7:05 a.m. UTC | #1
On Wed, Mar 31, 2021 at 07:47:01PM +0100, Matthew Wilcox (Oracle) wrote:
>  - Mirror members of struct page (for pagecache / anon) into struct folio,
>    so (eg) you can use folio->mapping instead of folio->page.mapping

Eww, why?
Matthew Wilcox April 1, 2021, 11:26 a.m. UTC | #2
On Thu, Apr 01, 2021 at 08:05:37AM +0100, Christoph Hellwig wrote:
> On Wed, Mar 31, 2021 at 07:47:01PM +0100, Matthew Wilcox (Oracle) wrote:
> >  - Mirror members of struct page (for pagecache / anon) into struct folio,
> >    so (eg) you can use folio->mapping instead of folio->page.mapping
> 
> Eww, why?

So that eventually we can rename page->mapping to page->_mapping and
prevent the bugs from people doing page->mapping on a tail page.  eg
https://lore.kernel.org/linux-mm/alpine.LSU.2.11.2103102214170.7159@eggly.anvils/
Jason Gunthorpe April 1, 2021, 12:28 p.m. UTC | #3
On Thu, Apr 01, 2021 at 12:26:56PM +0100, Matthew Wilcox wrote:
> On Thu, Apr 01, 2021 at 08:05:37AM +0100, Christoph Hellwig wrote:
> > On Wed, Mar 31, 2021 at 07:47:01PM +0100, Matthew Wilcox (Oracle) wrote:
> > >  - Mirror members of struct page (for pagecache / anon) into struct folio,
> > >    so (eg) you can use folio->mapping instead of folio->page.mapping
> > 
> > Eww, why?
> 
> So that eventually we can rename page->mapping to page->_mapping and
> prevent the bugs from people doing page->mapping on a tail page.  eg
> https://lore.kernel.org/linux-mm/alpine.LSU.2.11.2103102214170.7159@eggly.anvils/

Is that gcc structure layout randomization stuff going to be a problem
here?

Add some 
  static_assert(offsetof(struct folio,..) == offsetof(struct page,..))

tests to force it?

Jason
Matthew Wilcox April 1, 2021, 12:52 p.m. UTC | #4
On Thu, Apr 01, 2021 at 09:28:03AM -0300, Jason Gunthorpe wrote:
> On Thu, Apr 01, 2021 at 12:26:56PM +0100, Matthew Wilcox wrote:
> > On Thu, Apr 01, 2021 at 08:05:37AM +0100, Christoph Hellwig wrote:
> > > On Wed, Mar 31, 2021 at 07:47:01PM +0100, Matthew Wilcox (Oracle) wrote:
> > > >  - Mirror members of struct page (for pagecache / anon) into struct folio,
> > > >    so (eg) you can use folio->mapping instead of folio->page.mapping
> > > 
> > > Eww, why?
> > 
> > So that eventually we can rename page->mapping to page->_mapping and
> > prevent the bugs from people doing page->mapping on a tail page.  eg
> > https://lore.kernel.org/linux-mm/alpine.LSU.2.11.2103102214170.7159@eggly.anvils/
> 
> Is that gcc structure layout randomization stuff going to be a problem
> here?
> 
> Add some 
>   static_assert(offsetof(struct folio,..) == offsetof(struct page,..))
> 
> tests to force it?

You sound like the kind of person who hasn't read patch 1.

diff --git a/mm/util.c b/mm/util.c
index 0b6dd9d81da7..521a772f06eb 100644
--- a/mm/util.c
+++ b/mm/util.c
@@ -686,6 +686,25 @@ struct anon_vma *page_anon_vma(struct page *page)
 	return __page_rmapping(page);
 }
 
+static inline void folio_build_bug(void)
+{
+#define FOLIO_MATCH(pg, fl)						\
+BUILD_BUG_ON(offsetof(struct page, pg) != offsetof(struct folio, fl));
+
+	FOLIO_MATCH(flags, flags);
+	FOLIO_MATCH(lru, lru);
+	FOLIO_MATCH(mapping, mapping);
+	FOLIO_MATCH(index, index);
+	FOLIO_MATCH(private, private);
+	FOLIO_MATCH(_mapcount, _mapcount);
+	FOLIO_MATCH(_refcount, _refcount);
+#ifdef CONFIG_MEMCG
+	FOLIO_MATCH(memcg_data, memcg_data);
+#endif
+#undef FOLIO_MATCH
+	BUILD_BUG_ON(sizeof(struct page) != sizeof(struct folio));
+}
+
 struct address_space *page_mapping(struct page *page)
 {
 	struct address_space *mapping;
Jason Gunthorpe April 1, 2021, 1:30 p.m. UTC | #5
On Thu, Apr 01, 2021 at 01:52:01PM +0100, Matthew Wilcox wrote:
> On Thu, Apr 01, 2021 at 09:28:03AM -0300, Jason Gunthorpe wrote:
> > On Thu, Apr 01, 2021 at 12:26:56PM +0100, Matthew Wilcox wrote:
> > > On Thu, Apr 01, 2021 at 08:05:37AM +0100, Christoph Hellwig wrote:
> > > > On Wed, Mar 31, 2021 at 07:47:01PM +0100, Matthew Wilcox (Oracle) wrote:
> > > > >  - Mirror members of struct page (for pagecache / anon) into struct folio,
> > > > >    so (eg) you can use folio->mapping instead of folio->page.mapping
> > > > 
> > > > Eww, why?
> > > 
> > > So that eventually we can rename page->mapping to page->_mapping and
> > > prevent the bugs from people doing page->mapping on a tail page.  eg
> > > https://lore.kernel.org/linux-mm/alpine.LSU.2.11.2103102214170.7159@eggly.anvils/
> > 
> > Is that gcc structure layout randomization stuff going to be a problem
> > here?
> > 
> > Add some 
> >   static_assert(offsetof(struct folio,..) == offsetof(struct page,..))
> > 
> > tests to force it?
> 
> You sound like the kind of person who hasn't read patch 1.

Yes, I missed this hunk :)

Jason
Christoph Hellwig April 2, 2021, 2:37 p.m. UTC | #6
On Thu, Apr 01, 2021 at 12:26:56PM +0100, Matthew Wilcox wrote:
> On Thu, Apr 01, 2021 at 08:05:37AM +0100, Christoph Hellwig wrote:
> > On Wed, Mar 31, 2021 at 07:47:01PM +0100, Matthew Wilcox (Oracle) wrote:
> > >  - Mirror members of struct page (for pagecache / anon) into struct folio,
> > >    so (eg) you can use folio->mapping instead of folio->page.mapping
> > 
> > Eww, why?
> 
> So that eventually we can rename page->mapping to page->_mapping and
> prevent the bugs from people doing page->mapping on a tail page.  eg
> https://lore.kernel.org/linux-mm/alpine.LSU.2.11.2103102214170.7159@eggly.anvils/

I'm not sure I like this.  This whole concept of structures that do need
the same layout is very problematic, even with the safe guards you've
added.  So if it was up to me I'd prefer the folio as a simple container
as it was in the previous revisions.  At some point members should move
from the page to the folio, but I'd rather do that over a shorter period
an in targeted series.  We need the basic to go in first.
Matthew Wilcox April 2, 2021, 2:49 p.m. UTC | #7
On Fri, Apr 02, 2021 at 03:37:55PM +0100, Christoph Hellwig wrote:
> On Thu, Apr 01, 2021 at 12:26:56PM +0100, Matthew Wilcox wrote:
> > On Thu, Apr 01, 2021 at 08:05:37AM +0100, Christoph Hellwig wrote:
> > > On Wed, Mar 31, 2021 at 07:47:01PM +0100, Matthew Wilcox (Oracle) wrote:
> > > >  - Mirror members of struct page (for pagecache / anon) into struct folio,
> > > >    so (eg) you can use folio->mapping instead of folio->page.mapping
> > > 
> > > Eww, why?
> > 
> > So that eventually we can rename page->mapping to page->_mapping and
> > prevent the bugs from people doing page->mapping on a tail page.  eg
> > https://lore.kernel.org/linux-mm/alpine.LSU.2.11.2103102214170.7159@eggly.anvils/
> 
> I'm not sure I like this.  This whole concept of structures that do need
> the same layout is very problematic, even with the safe guards you've
> added.  So if it was up to me I'd prefer the folio as a simple container
> as it was in the previous revisions.  At some point members should move
> from the page to the folio, but I'd rather do that over a shorter period
> an in targeted series.  We need the basic to go in first.

That was my original plan, but it'll be another round of churn, and I'm
not sure there'll be the appetite for it.  There's not a lot of appetite
for this round, and this one has measurable performance gains!
Kent Overstreet April 3, 2021, 12:31 a.m. UTC | #8
On Wed, Mar 31, 2021 at 07:47:01PM +0100, Matthew Wilcox (Oracle) wrote:
> The medium-term goal is to convert all filesystems and some device
> drivers to work in terms of folios.  This series contains a lot of
> explicit conversions, but it's important to realise it's removing a lot
> of implicit conversions in some relatively hot paths.  There will be very
> few conversions from folios when this work is completed; filesystems,
> the page cache, the LRU and so on will generally only deal with folios.

I'm pretty excited for this to land - 4k page overhead has been a pain point for
me for quite some time. I know this is going to be a lot of churn but I think
leveraging the type system is exactly the right way to go about this, and I
can't wait to start converting bcachefs.
Jeff Layton April 5, 2021, 7:14 p.m. UTC | #9
On Wed, 2021-03-31 at 19:47 +0100, Matthew Wilcox (Oracle) wrote:
> Managing memory in 4KiB pages is a serious overhead.  Many benchmarks
> exist which show the benefits of a larger "page size".  As an example,
> an earlier iteration of this idea which used compound pages got a 7%
> performance boost when compiling the kernel using kernbench without any
> particular tuning.
> 
> Using compound pages or THPs exposes a serious weakness in our type
> system.  Functions are often unprepared for compound pages to be passed
> to them, and may only act on PAGE_SIZE chunks.  Even functions which are
> aware of compound pages may expect a head page, and do the wrong thing
> if passed a tail page.
> 
> There have been efforts to label function parameters as 'head' instead
> of 'page' to indicate that the function expects a head page, but this
> leaves us with runtime assertions instead of using the compiler to prove
> that nobody has mistakenly passed a tail page.  Calling a struct page
> 'head' is also inaccurate as they will work perfectly well on base pages.
> The term 'nottail' has not proven popular.
> 
> We also waste a lot of instructions ensuring that we're not looking at
> a tail page.  Almost every call to PageFoo() contains one or more hidden
> calls to compound_head().  This also happens for get_page(), put_page()
> and many more functions.  There does not appear to be a way to tell gcc
> that it can cache the result of compound_head(), nor is there a way to
> tell it that compound_head() is idempotent.
> 
> This series introduces the 'struct folio' as a replacement for
> head-or-base pages.  This initial set reduces the kernel size by
> approximately 5kB by removing conversions from tail pages to head pages.
> The real purpose of this series is adding infrastructure to enable
> further use of the folio.
> 
> The medium-term goal is to convert all filesystems and some device
> drivers to work in terms of folios.  This series contains a lot of
> explicit conversions, but it's important to realise it's removing a lot
> of implicit conversions in some relatively hot paths.  There will be very
> few conversions from folios when this work is completed; filesystems,
> the page cache, the LRU and so on will generally only deal with folios.
> 
> I analysed the text size reduction using a config based on Oracle UEK
> with all modules changed to built-in.  That's obviously not a kernel
> which makes sense to run, but it serves to compare the effects on (many
> common) filesystems & drivers, not just the core.
> 
> add/remove: 34266/34260 grow/shrink: 5220/3206 up/down: 1083860/-1088546 (-4686)
> 
> Current tree at:
> https://git.infradead.org/users/willy/pagecache.git/shortlog/refs/heads/folio
> 
> (contains another ~100 patches on top of this batch, not all of which are
> in good shape for submission)
> 
> v6:
>  - Rebase on next-20210330
>    - wait_bit_key patch merged by Linus
>    - wait_on_page_writeback_killable() patches merged by Linus
>    - Documentation patch merged by Andrew
>  - Move folio_next_index() into this series
>  - Move folio_offset() and folio_file_offset() into this series
>  - Mirror members of struct page (for pagecache / anon) into struct folio,
>    so (eg) you can use folio->mapping instead of folio->page.mapping
>  - Add folio_ref_* functions, including kernel-doc for folio_ref_count().
>  - Add count_memcg_folio_event()
>  - Add put_folio_testzero()
>  - Add folio_mapcount()
>  - Add FolioKsm()
>  - Fix afs_page_mkwrite() compilation
>  - Fix/improve kernel-doc for
>    - struct folio
>    - add_folio_wait_queue()
>    - wait_for_stable_folio()
>    - wait_on_folio_writeback()
>    - wait_on_folio_writeback_killable()
> v5:
>  - Rebase on next-20210319
>  - Pull out three bug-fix patches to the front of the series, allowing
>    them to be applied earlier.
>  - Fix folio_page() against pages being moved between swap & page cache
>  - Fix FolioDoubleMap to use the right page flags
>  - Rename next_folio() to folio_next() (akpm)
>  - Renamed folio stat functions (akpm)
>  - Add 'mod' versions of the folio stats for users that already have 'nr'
>  - Renamed folio_page to folio_file_page() (akpm)
>  - Added kernel-doc for struct folio, folio_next(), folio_index(),
>    folio_file_page(), folio_contains(), folio_order(), folio_nr_pages(),
>    folio_shift(), folio_size(), page_folio(), get_folio(), put_folio()
>  - Make folio_private() work in terms of void * instead of unsigned long
>  - Used page_folio() in attach/detach page_private() (hch)
>  - Drop afs_page_mkwrite folio conversion from this series
>  - Add wait_on_folio_writeback_killable()
>  - Convert add_page_wait_queue() to add_folio_wait_queue()
>  - Add folio_swap_entry() helper
>  - Drop the additions of *FolioFsCache
>  - Simplify the addition of lock_folio_memcg() et al
>  - Drop test_clear_page_writeback() conversion from this series
>  - Add FolioTransHuge() definition
>  - Rename __folio_file_mapping() to swapcache_mapping()
>  - Added swapcache_index() helper
>  - Removed lock_folio_async()
>  - Made __lock_folio_async() static to filemap.c
>  - Converted unlock_page_private_2() to use a folio internally
> v4:
>  - Rebase on current Linus tree (including swap fix)
>  - Analyse each patch in terms of its effects on kernel text size.
>    A few were modified to improve their effect.  In particular, where
>    pushing calls to page_folio() into the callers resulted in unacceptable
>    size increases, the wrapper was placed in mm/folio-compat.c.  This lets
>    us see all the places which are good targets for conversion to folios.
>  - Some of the patches were reordered, split or merged in order to make
>    more logical sense.
>  - Use nth_page() for folio_next() if we're using SPARSEMEM and not
>    VMEMMAP (Zi Yan)
>  - Increment and decrement page stats in units of pages instead of units
>    of folios (Zi Yan)
> v3:
>  - Rebase on next-20210127.  Two major sources of conflict, the
>    generic_file_buffered_read refactoring (in akpm tree) and the
>    fscache work (in dhowells tree).
> v2:
>  - Pare patch series back to just infrastructure and the page waiting
>    parts.
> 
> Matthew Wilcox (Oracle) (27):
>   mm: Introduce struct folio
>   mm: Add folio_pgdat and folio_zone
>   mm/vmstat: Add functions to account folio statistics
>   mm/debug: Add VM_BUG_ON_FOLIO and VM_WARN_ON_ONCE_FOLIO
>   mm: Add folio reference count functions
>   mm: Add put_folio
>   mm: Add get_folio
>   mm: Create FolioFlags
>   mm: Handle per-folio private data
>   mm/filemap: Add folio_index, folio_file_page and folio_contains
>   mm/filemap: Add folio_next_index
>   mm/filemap: Add folio_offset and folio_file_offset
>   mm/util: Add folio_mapping and folio_file_mapping
>   mm: Add folio_mapcount
>   mm/memcg: Add folio wrappers for various functions
>   mm/filemap: Add unlock_folio
>   mm/filemap: Add lock_folio
>   mm/filemap: Add lock_folio_killable
>   mm/filemap: Add __lock_folio_async
>   mm/filemap: Add __lock_folio_or_retry
>   mm/filemap: Add wait_on_folio_locked
>   mm/filemap: Add end_folio_writeback
>   mm/writeback: Add wait_on_folio_writeback
>   mm/writeback: Add wait_for_stable_folio
>   mm/filemap: Convert wait_on_page_bit to wait_on_folio_bit
>   mm/filemap: Convert wake_up_page_bit to wake_up_folio_bit
>   mm/filemap: Convert page wait queues to be folios
> 
>  Documentation/core-api/mm-api.rst |   3 +
>  fs/afs/write.c                    |   7 +-
>  fs/cachefiles/rdwr.c              |  16 +-
>  fs/io_uring.c                     |   2 +-
>  include/linux/memcontrol.h        |  30 ++++
>  include/linux/mm.h                | 177 ++++++++++++++++----
>  include/linux/mm_types.h          |  81 +++++++++
>  include/linux/mmdebug.h           |  20 +++
>  include/linux/netfs.h             |   2 +-
>  include/linux/page-flags.h        | 130 +++++++++++---
>  include/linux/page_ref.h          |  88 +++++++++-
>  include/linux/pagemap.h           | 270 ++++++++++++++++++++++--------
>  include/linux/swap.h              |   6 +
>  include/linux/vmstat.h            | 107 ++++++++++++
>  mm/Makefile                       |   2 +-
>  mm/filemap.c                      | 242 +++++++++++++-------------
>  mm/folio-compat.c                 |  37 ++++
>  mm/memory.c                       |   8 +-
>  mm/page-writeback.c               |  72 +++++---
>  mm/swapfile.c                     |   8 +-
>  mm/util.c                         |  49 ++++--
>  21 files changed, 1051 insertions(+), 306 deletions(-)
>  create mode 100644 mm/folio-compat.c
> 
> -- 
> 2.30.2
> 
> 
> From 99da34311602826672621c3d69bad13813993c1a Mon Sep 17 00:00:00 2001
> From: "Matthew Wilcox (Oracle)" <willy@infradead.org>
> Date: Tue, 30 Mar 2021 10:47:46 -0400
> Subject: [PATCH v6 00/25] *** SUBJECT HERE ***
> To: linux-mm@kvack.org
> Cc: linux-kernel@vger.kernel.org,
>     linux-fsdevel@vger.kernel.org,
>     linux-cachefs@redhat.com,
>     linux-afs@lists.infradead.org
> 
> *** BLURB HERE ***
> 
> Matthew Wilcox (Oracle) (25):
>   mm: Introduce struct folio
>   mm: Add folio_pgdat and folio_zone
>   mm/vmstat: Add functions to account folio statistics
>   mm/debug: Add VM_BUG_ON_FOLIO and VM_WARN_ON_ONCE_FOLIO
>   mm: Add put_folio
>   mm: Add get_folio
>   mm: Create FolioFlags
>   mm: Handle per-folio private data
>   mm/filemap: Add folio_index, folio_file_page and folio_contains
>   mm/filemap: Add folio_next_index
>   mm/filemap: Add folio_offset and folio_file_offset
>   mm/util: Add folio_mapping and folio_file_mapping
>   mm/memcg: Add folio wrappers for various functions
>   mm/filemap: Add unlock_folio
>   mm/filemap: Add lock_folio
>   mm/filemap: Add lock_folio_killable
>   mm/filemap: Add __lock_folio_async
>   mm/filemap: Add __lock_folio_or_retry
>   mm/filemap: Add wait_on_folio_locked
>   mm/filemap: Add end_folio_writeback
>   mm/writeback: Add wait_on_folio_writeback
>   mm/writeback: Add wait_for_stable_folio
>   mm/filemap: Convert wait_on_page_bit to wait_on_folio_bit
>   mm/filemap: Convert wake_up_page_bit to wake_up_folio_bit
>   mm/filemap: Convert page wait queues to be folios
> 
>  Documentation/core-api/mm-api.rst |   2 +
>  fs/afs/write.c                    |   7 +-
>  fs/cachefiles/rdwr.c              |  16 +-
>  fs/io_uring.c                     |   2 +-
>  include/linux/memcontrol.h        |  21 +++
>  include/linux/mm.h                | 156 +++++++++++++----
>  include/linux/mm_types.h          |  81 +++++++++
>  include/linux/mmdebug.h           |  20 +++
>  include/linux/netfs.h             |   2 +-
>  include/linux/page-flags.h        | 120 ++++++++++---
>  include/linux/pagemap.h           | 270 ++++++++++++++++++++++--------
>  include/linux/swap.h              |   6 +
>  include/linux/vmstat.h            | 107 ++++++++++++
>  mm/Makefile                       |   2 +-
>  mm/filemap.c                      | 242 +++++++++++++-------------
>  mm/folio-compat.c                 |  37 ++++
>  mm/memory.c                       |   8 +-
>  mm/page-writeback.c               |  72 +++++---
>  mm/swapfile.c                     |   8 +-
>  mm/util.c                         |  49 ++++--
>  20 files changed, 926 insertions(+), 302 deletions(-)
>  create mode 100644 mm/folio-compat.c
> 

I too am a little concerned about the amount of churn this is likely to
cause, but this does seem like a fairly promising way forward for
actually using THPs in the pagecache. The set is fairly straightforward.

That said, there are few callers of these new functions in here. Is this
set enough to allow converting some subsystem to use folios? It might be
good to do that if possible, so we can get an idea of how much work
we're in for.
Matthew Wilcox April 5, 2021, 7:31 p.m. UTC | #10
On Mon, Apr 05, 2021 at 03:14:29PM -0400, Jeff Layton wrote:
> On Wed, 2021-03-31 at 19:47 +0100, Matthew Wilcox (Oracle) wrote:
> > Managing memory in 4KiB pages is a serious overhead.  Many benchmarks
> > exist which show the benefits of a larger "page size".  As an example,
> > an earlier iteration of this idea which used compound pages got a 7%
> > performance boost when compiling the kernel using kernbench without any
> > particular tuning.
> > 
> > Using compound pages or THPs exposes a serious weakness in our type
> > system.  Functions are often unprepared for compound pages to be passed
> > to them, and may only act on PAGE_SIZE chunks.  Even functions which are
> > aware of compound pages may expect a head page, and do the wrong thing
> > if passed a tail page.
> > 
> > There have been efforts to label function parameters as 'head' instead
> > of 'page' to indicate that the function expects a head page, but this
> > leaves us with runtime assertions instead of using the compiler to prove
> > that nobody has mistakenly passed a tail page.  Calling a struct page
> > 'head' is also inaccurate as they will work perfectly well on base pages.
> > The term 'nottail' has not proven popular.
> > 
> > We also waste a lot of instructions ensuring that we're not looking at
> > a tail page.  Almost every call to PageFoo() contains one or more hidden
> > calls to compound_head().  This also happens for get_page(), put_page()
> > and many more functions.  There does not appear to be a way to tell gcc
> > that it can cache the result of compound_head(), nor is there a way to
> > tell it that compound_head() is idempotent.
> > 
> > This series introduces the 'struct folio' as a replacement for
> > head-or-base pages.  This initial set reduces the kernel size by
> > approximately 5kB by removing conversions from tail pages to head pages.
> > The real purpose of this series is adding infrastructure to enable
> > further use of the folio.
> > 
> > The medium-term goal is to convert all filesystems and some device
> > drivers to work in terms of folios.  This series contains a lot of
> > explicit conversions, but it's important to realise it's removing a lot
> > of implicit conversions in some relatively hot paths.  There will be very
> > few conversions from folios when this work is completed; filesystems,
> > the page cache, the LRU and so on will generally only deal with folios.
> 
> I too am a little concerned about the amount of churn this is likely to
> cause, but this does seem like a fairly promising way forward for
> actually using THPs in the pagecache. The set is fairly straightforward.
> 
> That said, there are few callers of these new functions in here. Is this
> set enough to allow converting some subsystem to use folios? It might be
> good to do that if possible, so we can get an idea of how much work
> we're in for.

It isn't enough to start converting much.  There needs to be a second set
of patches which add all the infrastructure for converting a filesystem.
Then we can start working on the filesystems.  I have a start at that
here:

https://git.infradead.org/users/willy/pagecache.git/shortlog/refs/heads/folio

I don't know if it's exactly how I'll arrange it for submission.  It might
be better to convert all the filesystem implementations of readpage
to work on a folio, and then the big bang conversion of ->readpage to
->read_folio will look much more mechanical.

But if I can't convince people that a folio approach is what we need,
then I should stop working on it, and go back to fixing the endless
stream of bugs that the thp-based approach surfaces.
Jeff Layton April 6, 2021, 3:14 p.m. UTC | #11
On Mon, 2021-04-05 at 20:31 +0100, Matthew Wilcox wrote:
> On Mon, Apr 05, 2021 at 03:14:29PM -0400, Jeff Layton wrote:
> > On Wed, 2021-03-31 at 19:47 +0100, Matthew Wilcox (Oracle) wrote:
> > > Managing memory in 4KiB pages is a serious overhead.  Many benchmarks
> > > exist which show the benefits of a larger "page size".  As an example,
> > > an earlier iteration of this idea which used compound pages got a 7%
> > > performance boost when compiling the kernel using kernbench without any
> > > particular tuning.
> > > 
> > > Using compound pages or THPs exposes a serious weakness in our type
> > > system.  Functions are often unprepared for compound pages to be passed
> > > to them, and may only act on PAGE_SIZE chunks.  Even functions which are
> > > aware of compound pages may expect a head page, and do the wrong thing
> > > if passed a tail page.
> > > 
> > > There have been efforts to label function parameters as 'head' instead
> > > of 'page' to indicate that the function expects a head page, but this
> > > leaves us with runtime assertions instead of using the compiler to prove
> > > that nobody has mistakenly passed a tail page.  Calling a struct page
> > > 'head' is also inaccurate as they will work perfectly well on base pages.
> > > The term 'nottail' has not proven popular.
> > > 
> > > We also waste a lot of instructions ensuring that we're not looking at
> > > a tail page.  Almost every call to PageFoo() contains one or more hidden
> > > calls to compound_head().  This also happens for get_page(), put_page()
> > > and many more functions.  There does not appear to be a way to tell gcc
> > > that it can cache the result of compound_head(), nor is there a way to
> > > tell it that compound_head() is idempotent.
> > > 
> > > This series introduces the 'struct folio' as a replacement for
> > > head-or-base pages.  This initial set reduces the kernel size by
> > > approximately 5kB by removing conversions from tail pages to head pages.
> > > The real purpose of this series is adding infrastructure to enable
> > > further use of the folio.
> > > 
> > > The medium-term goal is to convert all filesystems and some device
> > > drivers to work in terms of folios.  This series contains a lot of
> > > explicit conversions, but it's important to realise it's removing a lot
> > > of implicit conversions in some relatively hot paths.  There will be very
> > > few conversions from folios when this work is completed; filesystems,
> > > the page cache, the LRU and so on will generally only deal with folios.
> > 
> > I too am a little concerned about the amount of churn this is likely to
> > cause, but this does seem like a fairly promising way forward for
> > actually using THPs in the pagecache. The set is fairly straightforward.
> > 
> > That said, there are few callers of these new functions in here. Is this
> > set enough to allow converting some subsystem to use folios? It might be
> > good to do that if possible, so we can get an idea of how much work
> > we're in for.
> 
> It isn't enough to start converting much.  There needs to be a second set
> of patches which add all the infrastructure for converting a filesystem.
> Then we can start working on the filesystems.  I have a start at that
> here:
> 
> https://git.infradead.org/users/willy/pagecache.git/shortlog/refs/heads/folio
> 
> I don't know if it's exactly how I'll arrange it for submission.  It might
> be better to convert all the filesystem implementations of readpage
> to work on a folio, and then the big bang conversion of ->readpage to
> ->read_folio will look much more mechanical.
> 
> But if I can't convince people that a folio approach is what we need,
> then I should stop working on it, and go back to fixing the endless
> stream of bugs that the thp-based approach surfaces.

Fair enough. I generally prefer to see some callers added at the same
time as new functions, but I understand that the scale of this patchset
makes that difficult. You can add this to the whole series. I don't see
any major show-stoppers here:

Acked-by: Jeff Layton <jlayton@kernel.org>