[v5,00/27] Memory Folios

Message ID	20210320054104.1300774-1-willy@infradead.org (mailing list archive)
Headers	show Return-Path: <linux-fsdevel-owner@kernel.org> From: "Matthew Wilcox (Oracle)" <willy@infradead.org> To: linux-mm@kvack.org Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>, linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-cachefs@redhat.com, linux-afs@lists.infradead.org Subject: [PATCH v5 00/27] Memory Folios Date: Sat, 20 Mar 2021 05:40:37 +0000 Message-Id: <20210320054104.1300774-1-willy@infradead.org> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Precedence: bulk
Series	Memory Folios \| expand [v5,00/27] Memory Folios [v5,01/27] fs/cachefiles: Remove wait_bit_key layout dependency [v5,02/27] mm/writeback: Add wait_on_page_writeback_killable [v5,03/27] afs: Use wait_on_page_writeback_killable [v5,04/27] mm: Introduce struct folio [v5,05/27] mm: Add folio_pgdat and folio_zone [v5,06/27] mm/vmstat: Add functions to account folio statistics [v5,07/27] mm/debug: Add VM_BUG_ON_FOLIO and VM_WARN_ON_ONCE_FOLIO [v5,08/27] mm: Add put_folio [v5,09/27] mm: Add get_folio [v5,10/27] mm: Create FolioFlags [v5,11/27] mm: Handle per-folio private data [v5,12/27] mm: Add folio_index, folio_file_page and folio_contains [v5,13/27] mm/util: Add folio_mapping and folio_file_mapping [v5,14/27] mm/memcg: Add folio wrappers for various functions [v5,15/27] mm/filemap: Add unlock_folio [v5,16/27] mm/filemap: Add lock_folio [v5,17/27] mm/filemap: Add lock_folio_killable [v5,18/27] mm/filemap: Add __lock_folio_async [v5,19/27] mm/filemap: Add __lock_folio_or_retry [v5,20/27] mm/filemap: Add wait_on_folio_locked [v5,21/27] mm/filemap: Add end_folio_writeback [v5,22/27] mm/writeback: Add wait_on_folio_writeback [v5,23/27] mm/writeback: Add wait_for_stable_folio [v5,24/27] mm/filemap: Convert wait_on_page_bit to wait_on_folio_bit [v5,25/27] mm/filemap: Convert wake_up_page_bit to wake_up_folio_bit [v5,26/27] mm/filemap: Convert page wait queues to be folios [v5,27/27] mm/doc: Build kerneldoc for various mm files

Matthew Wilcox March 20, 2021, 5:40 a.m. UTC

Managing memory in 4KiB pages is a serious overhead.  Many benchmarks
exist which show the benefits of a larger "page size".  As an example,
an earlier iteration of this idea which used compound pages got a 7%
performance boost when compiling the kernel using kernbench without any
particular tuning.

Using compound pages or THPs exposes a serious weakness in our type
system.  Functions are often unprepared for compound pages to be passed
to them, and may only act on PAGE_SIZE chunks.  Even functions which are
aware of compound pages may expect a head page, and do the wrong thing
if passed a tail page.

There have been efforts to label function parameters as 'head' instead
of 'page' to indicate that the function expects a head page, but this
leaves us with runtime assertions instead of using the compiler to prove
that nobody has mistakenly passed a tail page.  Calling a struct page
'head' is also inaccurate as they will work perfectly well on base pages.
The term 'nottail' has not proven popular.

We also waste a lot of instructions ensuring that we're not looking at
a tail page.  Almost every call to PageFoo() contains one or more hidden
calls to compound_head().  This also happens for get_page(), put_page()
and many more functions.  There does not appear to be a way to tell gcc
that it can cache the result of compound_head(), nor is there a way to
tell it that compound_head() is idempotent.

This series introduces the 'struct folio' as a replacement for
head-or-base pages.  This initial set reduces the kernel size by
approximately 6kB, although its real purpose is adding infrastructure
to enable further use of the folio.

The intent is to convert all filesystems and some device drivers to work
in terms of folios.  This series contains a lot of explicit conversions,
but it's important to realise it's removing a lot of implicit conversions
in some relatively hot paths.  There will be very few conversions from
folios when this work is completed; filesystems, the page cache, the
LRU and so on will generally only deal with folios.

I analysed the text size reduction using a config based on Oracle UEK
with all modules changed to built-in.  That's obviously not a kernel
which makes sense to run, but it serves to compare the effects on (many
common) filesystems & drivers, not just the core.

add/remove: 33645/33632 grow/shrink: 1850/1924 up/down: 894474/-899674 (-5200)

Current tree at:
https://git.infradead.org/users/willy/pagecache.git/shortlog/refs/heads/folio

(contains another ~100 patches on top of this batch, not all of which are
in good shape for submission)

v5:
 - Rebase on next-20210319
 - Pull out three bug-fix patches to the front of the series, allowing
   them to be applied earlier.
 - Fix folio_page() against pages being moved between swap & page cache
 - Fix FolioDoubleMap to use the right page flags
 - Rename next_folio() to folio_next() (akpm)
 - Renamed folio stat functions (akpm)
 - Add 'mod' versions of the folio stats for users that already have 'nr'
 - Renamed folio_page to folio_file_page() (akpm)
 - Added kernel-doc for struct folio, folio_next(), folio_index(),
   folio_file_page(), folio_contains(), folio_order(), folio_nr_pages(),
   folio_shift(), folio_size(), page_folio(), get_folio(), put_folio()
 - Make folio_private() work in terms of void * instead of unsigned long
 - Used page_folio() in attach/detach page_private() (hch)
 - Drop afs_page_mkwrite folio conversion from this series
 - Add wait_on_folio_writeback_killable()
 - Convert add_page_wait_queue() to add_folio_wait_queue()
 - Add folio_swap_entry() helper
 - Drop the additions of *FolioFsCache
 - Simplify the addition of lock_folio_memcg() et al
 - Drop test_clear_page_writeback() conversion from this series
 - Add FolioTransHuge() definition
 - Rename __folio_file_mapping() to swapcache_mapping()
 - Added swapcache_index() helper
 - Removed lock_folio_async()
 - Made __lock_folio_async() static to filemap.c
 - Converted unlock_page_private_2() to use a folio internally
v4:
 - Rebase on current Linus tree (including swap fix)
 - Analyse each patch in terms of its effects on kernel text size.
   A few were modified to improve their effect.  In particular, where
   pushing calls to page_folio() into the callers resulted in unacceptable
   size increases, the wrapper was placed in mm/folio-compat.c.  This lets
   us see all the places which are good targets for conversion to folios.
 - Some of the patches were reordered, split or merged in order to make
   more logical sense.
 - Use nth_page() for folio_next() if we're using SPARSEMEM and not
   VMEMMAP (Zi Yan)
 - Increment and decrement page stats in units of pages instead of units
   of folios (Zi Yan)
v3:
 - Rebase on next-20210127.  Two major sources of conflict, the
   generic_file_buffered_read refactoring (in akpm tree) and the
   fscache work (in dhowells tree).
v2:
 - Pare patch series back to just infrastructure and the page waiting
   parts.

Matthew Wilcox (Oracle) (27):
  fs/cachefiles: Remove wait_bit_key layout dependency
  mm/writeback: Add wait_on_page_writeback_killable
  afs: Use wait_on_page_writeback_killable
  mm: Introduce struct folio
  mm: Add folio_pgdat and folio_zone
  mm/vmstat: Add functions to account folio statistics
  mm/debug: Add VM_BUG_ON_FOLIO and VM_WARN_ON_ONCE_FOLIO
  mm: Add put_folio
  mm: Add get_folio
  mm: Create FolioFlags
  mm: Handle per-folio private data
  mm: Add folio_index, folio_file_page and folio_contains
  mm/util: Add folio_mapping and folio_file_mapping
  mm/memcg: Add folio wrappers for various functions
  mm/filemap: Add unlock_folio
  mm/filemap: Add lock_folio
  mm/filemap: Add lock_folio_killable
  mm/filemap: Add __lock_folio_async
  mm/filemap: Add __lock_folio_or_retry
  mm/filemap: Add wait_on_folio_locked
  mm/filemap: Add end_folio_writeback
  mm/writeback: Add wait_on_folio_writeback
  mm/writeback: Add wait_for_stable_folio
  mm/filemap: Convert wait_on_page_bit to wait_on_folio_bit
  mm/filemap: Convert wake_up_page_bit to wake_up_folio_bit
  mm/filemap: Convert page wait queues to be folios
  mm/doc: Build kerneldoc for various mm files

 Documentation/core-api/mm-api.rst |   7 +
 fs/afs/write.c                    |   3 +-
 fs/cachefiles/rdwr.c              |  19 ++-
 fs/io_uring.c                     |   2 +-
 include/linux/memcontrol.h        |  21 +++
 include/linux/mm.h                | 156 +++++++++++++++----
 include/linux/mm_types.h          |  52 +++++++
 include/linux/mmdebug.h           |  20 +++
 include/linux/netfs.h             |   2 +-
 include/linux/page-flags.h        | 120 +++++++++++---
 include/linux/pagemap.h           | 249 ++++++++++++++++++++++--------
 include/linux/swap.h              |   6 +
 include/linux/vmstat.h            | 107 +++++++++++++
 mm/Makefile                       |   2 +-
 mm/filemap.c                      | 237 ++++++++++++++--------------
 mm/folio-compat.c                 |  37 +++++
 mm/memory.c                       |   8 +-
 mm/page-writeback.c               |  62 ++++++--
 mm/swapfile.c                     |   8 +-
 mm/util.c                         |  30 ++--
 20 files changed, 857 insertions(+), 291 deletions(-)
 create mode 100644 mm/folio-compat.c

Matthew Wilcox March 22, 2021, 3:25 a.m. UTC | #1

On Sat, Mar 20, 2021 at 05:40:37AM +0000, Matthew Wilcox (Oracle) wrote:
> Current tree at:
> https://git.infradead.org/users/willy/pagecache.git/shortlog/refs/heads/folio
> 
> (contains another ~100 patches on top of this batch, not all of which are
> in good shape for submission)

I've fixed the two buildbot bugs.  I also resplit the docs work, and
did a bunch of other things to the patches that I haven't posted yet.

I'll send the first three patches as a separate series tomorrow,
and then the next four as their own series, then I'll repost the
rest (up to and including "Convert page wait queues to be folios")
later in the week.

Johannes Weiner March 22, 2021, 5:59 p.m. UTC | #2

On Sat, Mar 20, 2021 at 05:40:37AM +0000, Matthew Wilcox (Oracle) wrote:
> Managing memory in 4KiB pages is a serious overhead.  Many benchmarks
> exist which show the benefits of a larger "page size".  As an example,
> an earlier iteration of this idea which used compound pages got a 7%
> performance boost when compiling the kernel using kernbench without any
> particular tuning.
> 
> Using compound pages or THPs exposes a serious weakness in our type
> system.  Functions are often unprepared for compound pages to be passed
> to them, and may only act on PAGE_SIZE chunks.  Even functions which are
> aware of compound pages may expect a head page, and do the wrong thing
> if passed a tail page.
> 
> There have been efforts to label function parameters as 'head' instead
> of 'page' to indicate that the function expects a head page, but this
> leaves us with runtime assertions instead of using the compiler to prove
> that nobody has mistakenly passed a tail page.  Calling a struct page
> 'head' is also inaccurate as they will work perfectly well on base pages.
> The term 'nottail' has not proven popular.
> 
> We also waste a lot of instructions ensuring that we're not looking at
> a tail page.  Almost every call to PageFoo() contains one or more hidden
> calls to compound_head().  This also happens for get_page(), put_page()
> and many more functions.  There does not appear to be a way to tell gcc
> that it can cache the result of compound_head(), nor is there a way to
> tell it that compound_head() is idempotent.
> 
> This series introduces the 'struct folio' as a replacement for
> head-or-base pages.  This initial set reduces the kernel size by
> approximately 6kB, although its real purpose is adding infrastructure
> to enable further use of the folio.
> 
> The intent is to convert all filesystems and some device drivers to work
> in terms of folios.  This series contains a lot of explicit conversions,
> but it's important to realise it's removing a lot of implicit conversions
> in some relatively hot paths.  There will be very few conversions from
> folios when this work is completed; filesystems, the page cache, the
> LRU and so on will generally only deal with folios.

If that is the case, shouldn't there in the long term only be very
few, easy to review instances of things like compound_head(),
PAGE_SIZE etc. deep in the heart of MM? And everybody else should 1)
never see tail pages and 2) never assume a compile-time page size?

What are the higher-level places that in the long-term should be
dealing with tail pages at all? Are there legit ones besides the page
allocator, THP splitting internals & pte-mapped compound pages?

I do agree that the current confusion around which layer sees which
types of pages is a problem. But I also think a lot of it is the
result of us being in a transitional period where we've added THP in
more places but not all code and data structures are or were fully
native yet, and so we had things leak out or into where maybe they
shouldn't be to make things work in the short term.

But this part is already getting better, and has gotten better, with
the page cache (largely?) going native for example.

Some compound_head() that are currently in the codebase are already
unnecessary. Like the one in activate_page().

And looking at grep, I wouldn't be surprised if only the page table
walkers need the page_compound() that mark_page_accessed() does. We
would be better off if they did the translation once and explicitly in
the outer scope, where it's clear they're dealing with a pte-mapped
compound page, instead of having a series of rather low level helpers
(page flags testing, refcount operations, LRU operations, stat
accounting) all trying to be clever but really just obscuring things
and imposing unnecessary costs on the vast majority of cases.

So I fully agree with the motivation behind this patch. But I do
wonder why it's special-casing the commmon case instead of the rare
case. It comes at a huge cost. Short term, the churn of replacing
'page' with 'folio' in pretty much all instances is enormous.

And longer term, I'm not convinced folio is the abstraction we want
throughout the kernel. If nobody should be dealing with tail pages in
the first place, why are we making everybody think in 'folios'? Why
does a filesystem care that huge pages are composed of multiple base
pages internally? This feels like an implementation detail leaking out
of the MM code. The vast majority of places should be thinking 'page'
with a size of 'page_size()'. Including most parts of the MM itself.

The compile-time check is nice, but I'm not sure it would be that much
more effective at catching things than a few centrally placed warns
inside PageFoo(), get_page() etc. and other things that should not
encounter tail pages in the first place (with __helpers for the few
instances that do). And given the invasiveness of this change, they
ought to be very drastically better at it, and obviously so, IMO.

>  Documentation/core-api/mm-api.rst |   7 +
>  fs/afs/write.c                    |   3 +-
>  fs/cachefiles/rdwr.c              |  19 ++-
>  fs/io_uring.c                     |   2 +-
>  include/linux/memcontrol.h        |  21 +++
>  include/linux/mm.h                | 156 +++++++++++++++----
>  include/linux/mm_types.h          |  52 +++++++
>  include/linux/mmdebug.h           |  20 +++
>  include/linux/netfs.h             |   2 +-
>  include/linux/page-flags.h        | 120 +++++++++++---
>  include/linux/pagemap.h           | 249 ++++++++++++++++++++++--------
>  include/linux/swap.h              |   6 +
>  include/linux/vmstat.h            | 107 +++++++++++++
>  mm/Makefile                       |   2 +-
>  mm/filemap.c                      | 237 ++++++++++++++--------------
>  mm/folio-compat.c                 |  37 +++++
>  mm/memory.c                       |   8 +-
>  mm/page-writeback.c               |  62 ++++++--
>  mm/swapfile.c                     |   8 +-
>  mm/util.c                         |  30 ++--
>  20 files changed, 857 insertions(+), 291 deletions(-)
>  create mode 100644 mm/folio-compat.c

Matthew Wilcox March 22, 2021, 6:47 p.m. UTC | #3

On Mon, Mar 22, 2021 at 01:59:24PM -0400, Johannes Weiner wrote:
> On Sat, Mar 20, 2021 at 05:40:37AM +0000, Matthew Wilcox (Oracle) wrote:
> > This series introduces the 'struct folio' as a replacement for
> > head-or-base pages.  This initial set reduces the kernel size by
> > approximately 6kB, although its real purpose is adding infrastructure
> > to enable further use of the folio.
> > 
> > The intent is to convert all filesystems and some device drivers to work
> > in terms of folios.  This series contains a lot of explicit conversions,
> > but it's important to realise it's removing a lot of implicit conversions
> > in some relatively hot paths.  There will be very few conversions from
> > folios when this work is completed; filesystems, the page cache, the
> > LRU and so on will generally only deal with folios.
> 
> If that is the case, shouldn't there in the long term only be very
> few, easy to review instances of things like compound_head(),
> PAGE_SIZE etc. deep in the heart of MM? And everybody else should 1)
> never see tail pages and 2) never assume a compile-time page size?

I don't know exactly where we get to eventually.  There are definitely
some aspects of the filesystem<->mm interface which are page-based
(eg ->fault needs to look up the exact page, regardless of its
head/tail/base nature), while ->readpage needs to talk in terms of
folios.

> What are the higher-level places that in the long-term should be
> dealing with tail pages at all? Are there legit ones besides the page
> allocator, THP splitting internals & pte-mapped compound pages?

I can't tell.  I think this patch maybe illustrates some of the
problems, but maybe it's just an intermediate problem:

https://git.infradead.org/users/willy/pagecache.git/commitdiff/047e9185dc146b18f56c6df0b49fe798f1805c7b

It deals mostly in terms of folios, but when it needs to kmap() and
memcmp(), then it needs to work in terms of pages.  I don't think it's
avoidable (maybe we bury the "dealing with pages" inside a kmap()
wrapper somewhere, but I'm not sure that's better).

> I do agree that the current confusion around which layer sees which
> types of pages is a problem. But I also think a lot of it is the
> result of us being in a transitional period where we've added THP in
> more places but not all code and data structures are or were fully
> native yet, and so we had things leak out or into where maybe they
> shouldn't be to make things work in the short term.
> 
> But this part is already getting better, and has gotten better, with
> the page cache (largely?) going native for example.

Thanks ;-)  There's still more work to do on that (ie storing one
entry to cover 512 indices instead of 512 identical entries), but it
is getting better.  What can't be made better is the CPU page tables;
they really do need to point to tail pages.

One of my longer-term goals is to support largeish pages on ARM (and
other CPUs).  Instead of these silly config options to have 16KiB
or 64KiB pages, support "add PTEs for these 16 consecutive, aligned pages".
And I'm not sure how we do that without folios.  The notion that a
page is PAGE_SIZE is really, really ingrained.  I tried the page_size()
macro to make things easier, but there's 17000 instances of PAGE_SIZE
in the tree, and they just aren't going to go away.

> Some compound_head() that are currently in the codebase are already
> unnecessary. Like the one in activate_page().

Right!  And it's hard to find & remove them without very careful analysis,
or particularly deep knowledge.  With folios, we can remove them without
terribly deep thought.

> And looking at grep, I wouldn't be surprised if only the page table
> walkers need the page_compound() that mark_page_accessed() does. We
> would be better off if they did the translation once and explicitly in
> the outer scope, where it's clear they're dealing with a pte-mapped
> compound page, instead of having a series of rather low level helpers
> (page flags testing, refcount operations, LRU operations, stat
> accounting) all trying to be clever but really just obscuring things
> and imposing unnecessary costs on the vast majority of cases.
> 
> So I fully agree with the motivation behind this patch. But I do
> wonder why it's special-casing the commmon case instead of the rare
> case. It comes at a huge cost. Short term, the churn of replacing
> 'page' with 'folio' in pretty much all instances is enormous.

Because people (think they) know what a page is.  It's PAGE_SIZE bytes
long, it occupies one PTE, etc, etc.  A folio is new and instead of
changing how something familiar (a page) behaves, we're asking them
to think about something new instead that behaves a lot like a page,
but has differences.

> And longer term, I'm not convinced folio is the abstraction we want
> throughout the kernel. If nobody should be dealing with tail pages in
> the first place, why are we making everybody think in 'folios'? Why
> does a filesystem care that huge pages are composed of multiple base
> pages internally? This feels like an implementation detail leaking out
> of the MM code. The vast majority of places should be thinking 'page'
> with a size of 'page_size()'. Including most parts of the MM itself.

I think pages already leaked out of the MM and into filesystems (and
most of the filesystem writers seem pretty unknowledgable about how
pages and the page cache work, TBH).  That's OK!  Or it should be OK.
Filesystem authors should be experts on how their filesystem works.
Everywhere that they have to learn about the page cache is a distraction
and annoyance for them.

I mean, I already tried what you're suggesting.  It's really freaking
hard.  It's hard to do, it's hard to explain, it's hard to know if you
got it right.  With folios, I've got the compiler working for me, telling
me that I got some of the low-level bits right (or wrong), leaving me
free to notice "Oh, wait, we got the accounting wrong because writeback
assumes that a page is only PAGE_SIZE bytes".  I would _never_ have
noticed that with the THP tree.  I only noticed it because transitioning
things to folios made me read the writeback code and wonder about the
'inc_wb_stat' call, see that it's measuring something in 'number of pages'
and realise that the wb_stat accounting needs to be fixed.

> The compile-time check is nice, but I'm not sure it would be that much
> more effective at catching things than a few centrally placed warns
> inside PageFoo(), get_page() etc. and other things that should not
> encounter tail pages in the first place (with __helpers for the few
> instances that do). And given the invasiveness of this change, they
> ought to be very drastically better at it, and obviously so, IMO.

We should have come up with a new type 15 years ago instead of doing THP.
But the second best time to invent a new type for "memory objects which
are at least as big as a page" is right now.  Because it only gets more
painful over time.

Christoph Hellwig March 23, 2021, 3:50 p.m. UTC | #4

On Mon, Mar 22, 2021 at 01:59:24PM -0400, Johannes Weiner wrote:
> If that is the case, shouldn't there in the long term only be very
> few, easy to review instances of things like compound_head(),
> PAGE_SIZE etc. deep in the heart of MM? And everybody else should 1)
> never see tail pages and 2) never assume a compile-time page size?

Probably.

> But this part is already getting better, and has gotten better, with
> the page cache (largely?) going native for example.

As long as there is no strong typing it is going to remain a mess.

> So I fully agree with the motivation behind this patch. But I do
> wonder why it's special-casing the commmon case instead of the rare
> case. It comes at a huge cost. Short term, the churn of replacing
> 'page' with 'folio' in pretty much all instances is enormous.

The special case is in the eye of the beholder.  I suspect we'll end
up using the folio in most FS/VM interaction eventually, which makes it
the common.  But I don't see how it is the special case?  Yes, changing
from page to folio just about everywhere causes more change, but it also
allow to:

 a) do this gradually
 b) thus actually audit everything that we actually do the right thing

And I think willys whole series (the git branch, not just the few
patches sent out) very clearly shows the benefit of that.

> And longer term, I'm not convinced folio is the abstraction we want
> throughout the kernel. If nobody should be dealing with tail pages in
> the first place, why are we making everybody think in 'folios'? Why
> does a filesystem care that huge pages are composed of multiple base
> pages internally? This feels like an implementation detail leaking out
> of the MM code. The vast majority of places should be thinking 'page'
> with a size of 'page_size()'. Including most parts of the MM itself.

Why does the name matter?  While there are arguments both ways, the
clean break certainly helps every to remind everyone that this is not
your grandfathers fixed sized page.

> 
> The compile-time check is nice, but I'm not sure it would be that much
> more effective at catching things than a few centrally placed warns
> inside PageFoo(), get_page() etc. and other things that should not
> encounter tail pages in the first place (with __helpers for the few
> instances that do).

Eeek, no.  No amount of runtime checks is going to replace compile
time type safety.

David Howells March 23, 2021, 5:50 p.m. UTC | #5

Johannes Weiner <hannes@cmpxchg.org> wrote:

> So I fully agree with the motivation behind this patch. But I do
> wonder why it's special-casing the commmon case instead of the rare
> case. It comes at a huge cost. Short term, the churn of replacing
> 'page' with 'folio' in pretty much all instances is enormous.
> 
> And longer term, I'm not convinced folio is the abstraction we want
> throughout the kernel. If nobody should be dealing with tail pages in
> the first place, why are we making everybody think in 'folios'? Why
> does a filesystem care that huge pages are composed of multiple base
> pages internally? This feels like an implementation detail leaking out
> of the MM code. The vast majority of places should be thinking 'page'
> with a size of 'page_size()'. Including most parts of the MM itself.

I like the idea of logically separating individual hardware pages from
abstract bundles of pages by using a separate type for them - at least in
filesystem code.  I'm trying to abstract some of the handling out of the
network filesystems and into a common library plus ITER_XARRAY to insulate
those filesystems from the VM.

David

Johannes Weiner March 24, 2021, 12:29 a.m. UTC | #6

On Mon, Mar 22, 2021 at 06:47:44PM +0000, Matthew Wilcox wrote:
> On Mon, Mar 22, 2021 at 01:59:24PM -0400, Johannes Weiner wrote:
> > On Sat, Mar 20, 2021 at 05:40:37AM +0000, Matthew Wilcox (Oracle) wrote:
> > > This series introduces the 'struct folio' as a replacement for
> > > head-or-base pages.  This initial set reduces the kernel size by
> > > approximately 6kB, although its real purpose is adding infrastructure
> > > to enable further use of the folio.
> > > 
> > > The intent is to convert all filesystems and some device drivers to work
> > > in terms of folios.  This series contains a lot of explicit conversions,
> > > but it's important to realise it's removing a lot of implicit conversions
> > > in some relatively hot paths.  There will be very few conversions from
> > > folios when this work is completed; filesystems, the page cache, the
> > > LRU and so on will generally only deal with folios.
> > 
> > If that is the case, shouldn't there in the long term only be very
> > few, easy to review instances of things like compound_head(),
> > PAGE_SIZE etc. deep in the heart of MM? And everybody else should 1)
> > never see tail pages and 2) never assume a compile-time page size?
> 
> I don't know exactly where we get to eventually.  There are definitely
> some aspects of the filesystem<->mm interface which are page-based
> (eg ->fault needs to look up the exact page, regardless of its
> head/tail/base nature), while ->readpage needs to talk in terms of
> folios.

I can imagine we'd eventually want fault handlers that can also fill
in larger chunks of data if the file is of the right size and the MM
is able to (and policy/heuristics determine to) go with a huge page.

> > What are the higher-level places that in the long-term should be
> > dealing with tail pages at all? Are there legit ones besides the page
> > allocator, THP splitting internals & pte-mapped compound pages?
> 
> I can't tell.  I think this patch maybe illustrates some of the
> problems, but maybe it's just an intermediate problem:
> 
> https://git.infradead.org/users/willy/pagecache.git/commitdiff/047e9185dc146b18f56c6df0b49fe798f1805c7b
> 
> It deals mostly in terms of folios, but when it needs to kmap() and
> memcmp(), then it needs to work in terms of pages.  I don't think it's
> avoidable (maybe we bury the "dealing with pages" inside a kmap()
> wrapper somewhere, but I'm not sure that's better).

Yeah it'd be nice to get low-level, PAGE_SIZE pages out of there. We
may be able to just kmap whole folios too, which are more likely to be
small pages on highmem systems anyway.

> > Some compound_head() that are currently in the codebase are already
> > unnecessary. Like the one in activate_page().
> 
> Right!  And it's hard to find & remove them without very careful analysis,
> or particularly deep knowledge.  With folios, we can remove them without
> terribly deep thought.

True. It definitely also helps mark the places that have been
converted from the top down and which ones haven't. Without that you
need to think harder about the context ("How would a tail page even
get here?" vs. "No page can get here, only folios" ;-))

Again, I think that's something that would automatically be better in
the long term when compound_page() and PAGE_SIZE themselves would
stand out like sore thumbs. But you raise a good point: there is such
an overwhelming amount of them right now that it's difficult to do
this without a clearer marker and help from the type system.

> > And looking at grep, I wouldn't be surprised if only the page table
> > walkers need the page_compound() that mark_page_accessed() does. We
> > would be better off if they did the translation once and explicitly in
> > the outer scope, where it's clear they're dealing with a pte-mapped
> > compound page, instead of having a series of rather low level helpers
> > (page flags testing, refcount operations, LRU operations, stat
> > accounting) all trying to be clever but really just obscuring things
> > and imposing unnecessary costs on the vast majority of cases.
> > 
> > So I fully agree with the motivation behind this patch. But I do
> > wonder why it's special-casing the commmon case instead of the rare
> > case. It comes at a huge cost. Short term, the churn of replacing
> > 'page' with 'folio' in pretty much all instances is enormous.
> 
> Because people (think they) know what a page is.  It's PAGE_SIZE bytes
> long, it occupies one PTE, etc, etc.  A folio is new and instead of
> changing how something familiar (a page) behaves, we're asking them
> to think about something new instead that behaves a lot like a page,
> but has differences.

Yeah, that makes sense.

> > And longer term, I'm not convinced folio is the abstraction we want
> > throughout the kernel. If nobody should be dealing with tail pages in
> > the first place, why are we making everybody think in 'folios'? Why
> > does a filesystem care that huge pages are composed of multiple base
> > pages internally? This feels like an implementation detail leaking out
> > of the MM code. The vast majority of places should be thinking 'page'
> > with a size of 'page_size()'. Including most parts of the MM itself.
> 
> I think pages already leaked out of the MM and into filesystems (and
> most of the filesystem writers seem pretty unknowledgable about how
> pages and the page cache work, TBH).  That's OK!  Or it should be OK.
> Filesystem authors should be experts on how their filesystem works.
> Everywhere that they have to learn about the page cache is a distraction
> and annoyance for them.
>
> I mean, I already tried what you're suggesting.  It's really freaking
> hard.  It's hard to do, it's hard to explain, it's hard to know if you
> got it right.  With folios, I've got the compiler working for me, telling
> me that I got some of the low-level bits right (or wrong), leaving me
> free to notice "Oh, wait, we got the accounting wrong because writeback
> assumes that a page is only PAGE_SIZE bytes".  I would _never_ have
> noticed that with the THP tree.  I only noticed it because transitioning
> things to folios made me read the writeback code and wonder about the
> 'inc_wb_stat' call, see that it's measuring something in 'number of pages'
> and realise that the wb_stat accounting needs to be fixed.

I agree with all of this whole-heartedly.

The reason I asked about who would deal with tail pages in the long
term is because I think optimally most places would just think of
these things as descriptors for variable lengths of memory. And only
the allocator looks behind the curtain and deals with the (current!)
reality that they're stitched together from fixed-size objects.

To me, folios seem to further highlight this implementation detail,
more so than saying a page is now page_size() - although I readily
accept that the latter didn't turn out to be a viable mid-term
strategy in practice at all, and that a clean break is necessary
sooner rather than later (instead of cleaning up the page api now and
replacing the backing pages with struct hwpage or something later).

The name of the abstraction indicates how we think we're supposed to
use it, what behavior stands out as undesirable.

For example, you brought up kmap/memcpy/usercopy, which is a pretty
common operation. Should they continue to deal with individual tail
pages, and thereby perpetuate the exposure of these low-level MM
building blocks to drivers and filesystems?

It means portfolio -> page lookups will remain common - and certainly
the concept of the folio suggests thinking of it as a couple of pages
strung together. And the more this is the case, the less it stands out
when somebody is dealing with low-level pages when really they
shouldn't be - the thing this is trying to fix. Granted it's narrowing
the channel quite a bit. But it's also so pervasively used that I do
wonder if it's possible to keep up with creative new abuses.

But I also worry about the longevity of the concept in general. This
is one of the most central and fundamental concepts in the kernel. Is
this going to make sense in the future? In 5 years even?

> > The compile-time check is nice, but I'm not sure it would be that much
> > more effective at catching things than a few centrally placed warns
> > inside PageFoo(), get_page() etc. and other things that should not
> > encounter tail pages in the first place (with __helpers for the few
> > instances that do). And given the invasiveness of this change, they
> > ought to be very drastically better at it, and obviously so, IMO.
> 
> We should have come up with a new type 15 years ago instead of doing THP.
> But the second best time to invent a new type for "memory objects which
> are at least as big as a page" is right now.  Because it only gets more
> painful over time.

Yes and no.

Yes because I fully agree that too much detail of the pages have
leaked into all kinds of places where they shouldn't be, and a new
abstraction for what most places interact with is a good idea IMO.

But we're also headed in a direction with the VM that give me pause
about the folios-are-multiple-pages abstraction.

How long are we going to have multiple pages behind a huge page?

Common storage drives are getting fast enough that simple buffered IO
workloads are becoming limited by CPU, just because it's too many
individual pages to push through the cache. We have pending patches to
rewrite the reclaim algorithm because rmap is falling apart with the
rate of paging we're doing. We'll need larger pages in the VM not just
for optimizing TLB access, but to cut transaction overhead for paging
in general (I know you're already onboard with this, especially on the
page cache side, just stating it for completeness).

But for that to work, we'll need the allocator to produce huge pages
at the necessary rate, too. The current implementation likely won't
scale. Compaction is expensive enough that we have to weigh when to
allocate huge pages for long-lived anon regions, let alone allocate
them for streaming IO cache entries.

But if the overwhelming number of requests going to the page allocator
are larger than 4k pages - anon regions? check. page cache? likely a
sizable share. slub? check. network? check - does it even make sense
to have that as the default block size for the page allocator anymore?
Or even allocate struct page at this granularity?

So I think transitioning away from ye olde page is a great idea. I
wonder this: have we mapped out the near future of the VM enough to
say that the folio is the right abstraction?

What does 'folio' mean when it corresponds to either a single page or
some slab-type object with no dedicated page?

If we go through with all the churn now anyway, IMO it makes at least
sense to ditch all association and conceptual proximity to the
hardware page or collections thereof. Simply say it's some length of
memory, and keep thing-to-page translations out of the public API from
the start. I mean, is there a good reason to keep this baggage?

mem_t or something.

mem = find_get_mem(mapping, offset);
p = kmap(mem, offset - mem_file_offset(mem), len);
copy_from_user(p, buf, len);
kunmap(mem);
SetMemDirty(mem);
put_mem(mem);

There are 10k instances of 'page' in mm/ outside the page allocator, a
majority of which will be the new thing. 14k in fs. I don't think I
have the strength to type shrink_folio_list(), or explain to new
people what it means, years after it has stopped making sense.

Matthew Wilcox March 24, 2021, 6:24 a.m. UTC | #7

On Tue, Mar 23, 2021 at 08:29:16PM -0400, Johannes Weiner wrote:
> On Mon, Mar 22, 2021 at 06:47:44PM +0000, Matthew Wilcox wrote:
> > On Mon, Mar 22, 2021 at 01:59:24PM -0400, Johannes Weiner wrote:
> > > On Sat, Mar 20, 2021 at 05:40:37AM +0000, Matthew Wilcox (Oracle) wrote:
> > > > This series introduces the 'struct folio' as a replacement for
> > > > head-or-base pages.  This initial set reduces the kernel size by
> > > > approximately 6kB, although its real purpose is adding infrastructure
> > > > to enable further use of the folio.
> > > > 
> > > > The intent is to convert all filesystems and some device drivers to work
> > > > in terms of folios.  This series contains a lot of explicit conversions,
> > > > but it's important to realise it's removing a lot of implicit conversions
> > > > in some relatively hot paths.  There will be very few conversions from
> > > > folios when this work is completed; filesystems, the page cache, the
> > > > LRU and so on will generally only deal with folios.
> > > 
> > > If that is the case, shouldn't there in the long term only be very
> > > few, easy to review instances of things like compound_head(),
> > > PAGE_SIZE etc. deep in the heart of MM? And everybody else should 1)
> > > never see tail pages and 2) never assume a compile-time page size?
> > 
> > I don't know exactly where we get to eventually.  There are definitely
> > some aspects of the filesystem<->mm interface which are page-based
> > (eg ->fault needs to look up the exact page, regardless of its
> > head/tail/base nature), while ->readpage needs to talk in terms of
> > folios.
> 
> I can imagine we'd eventually want fault handlers that can also fill
> in larger chunks of data if the file is of the right size and the MM
> is able to (and policy/heuristics determine to) go with a huge page.

Oh yes, me too!

The way I think this works is that the VM asks for the specific
page, just as it does today and the ->fault handler returns the page.
Then the VM looks up the folio for that page, and asks the arch to map
the entire folio.  How the arch does that is up to the arch -- if it's
PMD sized and aligned, it can do that; if the arch knows that it should
use 8 consecutive PTE entries to map 32KiB all at once, it can do that.

But I think we need the ->fault handler to return the specific page,
because that's how we can figure out whether this folio is mapped at the
appropriate alignment to make this work.  If the fault handler returns
the folio, I don't think we can figure out if the alignment is correct.
Maybe we can for the page cache, but a device driver might have a compound
page allocated for its own purposes, and it might not be amenable to
the same rules as the page cache.

> > https://git.infradead.org/users/willy/pagecache.git/commitdiff/047e9185dc146b18f56c6df0b49fe798f1805c7b
> > 
> > It deals mostly in terms of folios, but when it needs to kmap() and
> > memcmp(), then it needs to work in terms of pages.  I don't think it's
> > avoidable (maybe we bury the "dealing with pages" inside a kmap()
> > wrapper somewhere, but I'm not sure that's better).
> 
> Yeah it'd be nice to get low-level, PAGE_SIZE pages out of there. We
> may be able to just kmap whole folios too, which are more likely to be
> small pages on highmem systems anyway.

I got told "no" when asking for kmap_local() of a compound page.
Maybe that's changeable, but I'm assuming that kmap() space will
continue to be tight for the foreseeable future (until we can
kill highmem forever).

> > > Some compound_head() that are currently in the codebase are already
> > > unnecessary. Like the one in activate_page().
> > 
> > Right!  And it's hard to find & remove them without very careful analysis,
> > or particularly deep knowledge.  With folios, we can remove them without
> > terribly deep thought.
> 
> True. It definitely also helps mark the places that have been
> converted from the top down and which ones haven't. Without that you
> need to think harder about the context ("How would a tail page even
> get here?" vs. "No page can get here, only folios" ;-))

Exactly!  Take a look at page_mkclean().  Its implementation strongly
suggests that it expects a head page, but I think it'll unmap a single
page if passed a tail page ... and it's not clear to me that isn't the
behaviour that pagecache_isize_extended() would prefer.  Tricky.

> > I mean, I already tried what you're suggesting.  It's really freaking
> > hard.  It's hard to do, it's hard to explain, it's hard to know if you
> > got it right.  With folios, I've got the compiler working for me, telling
> > me that I got some of the low-level bits right (or wrong), leaving me
> > free to notice "Oh, wait, we got the accounting wrong because writeback
> > assumes that a page is only PAGE_SIZE bytes".  I would _never_ have
> > noticed that with the THP tree.  I only noticed it because transitioning
> > things to folios made me read the writeback code and wonder about the
> > 'inc_wb_stat' call, see that it's measuring something in 'number of pages'
> > and realise that the wb_stat accounting needs to be fixed.
> 
> I agree with all of this whole-heartedly.
> 
> The reason I asked about who would deal with tail pages in the long
> term is because I think optimally most places would just think of
> these things as descriptors for variable lengths of memory. And only
> the allocator looks behind the curtain and deals with the (current!)
> reality that they're stitched together from fixed-size objects.
> 
> To me, folios seem to further highlight this implementation detail,
> more so than saying a page is now page_size() - although I readily
> accept that the latter didn't turn out to be a viable mid-term
> strategy in practice at all, and that a clean break is necessary
> sooner rather than later (instead of cleaning up the page api now and
> replacing the backing pages with struct hwpage or something later).
> 
> The name of the abstraction indicates how we think we're supposed to
> use it, what behavior stands out as undesirable.
> 
> For example, you brought up kmap/memcpy/usercopy, which is a pretty
> common operation. Should they continue to deal with individual tail
> pages, and thereby perpetuate the exposure of these low-level MM
> building blocks to drivers and filesystems?
> 
> It means portfolio -> page lookups will remain common - and certainly
> the concept of the folio suggests thinking of it as a couple of pages
> strung together. And the more this is the case, the less it stands out
> when somebody is dealing with low-level pages when really they
> shouldn't be - the thing this is trying to fix. Granted it's narrowing
> the channel quite a bit. But it's also so pervasively used that I do
> wonder if it's possible to keep up with creative new abuses.
> 
> But I also worry about the longevity of the concept in general. This
> is one of the most central and fundamental concepts in the kernel. Is
> this going to make sense in the future? In 5 years even?

One of the patches I haven't posted yet starts to try to deal with kmap()/mem*()/kunmap():

    mm: Add kmap_local_folio

    This allows us to map a portion of a folio.  Callers can only expect
    to access up to the next page boundary.

    Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>

diff --git a/include/linux/highmem-internal.h b/include/linux/highmem-internal.h
index 7902c7d8b55f..55a29c9d562f 100644
--- a/include/linux/highmem-internal.h
+++ b/include/linux/highmem-internal.h
@@ -73,6 +73,12 @@ static inline void *kmap_local_page(struct page *page)
        return __kmap_local_page_prot(page, kmap_prot);
 }

+static inline void *kmap_local_folio(struct folio *folio, size_t offset)
+{
+       struct page *page = &folio->page + offset / PAGE_SIZE;
+       return __kmap_local_page_prot(page, kmap_prot) + offset % PAGE_SIZE;
+}

Partly I haven't shared that one because I'm not 100% sure that 'byte
offset relative to start of folio' is the correct interface.  I'm looking
at some users and thinking that maybe 'byte offset relative to start
of file' might be better.  Or perhaps that's just filesystem-centric
thinking.

> > > The compile-time check is nice, but I'm not sure it would be that much
> > > more effective at catching things than a few centrally placed warns
> > > inside PageFoo(), get_page() etc. and other things that should not
> > > encounter tail pages in the first place (with __helpers for the few
> > > instances that do). And given the invasiveness of this change, they
> > > ought to be very drastically better at it, and obviously so, IMO.
> > 
> > We should have come up with a new type 15 years ago instead of doing THP.
> > But the second best time to invent a new type for "memory objects which
> > are at least as big as a page" is right now.  Because it only gets more
> > painful over time.
> 
> Yes and no.
> 
> Yes because I fully agree that too much detail of the pages have
> leaked into all kinds of places where they shouldn't be, and a new
> abstraction for what most places interact with is a good idea IMO.
> 
> But we're also headed in a direction with the VM that give me pause
> about the folios-are-multiple-pages abstraction.
> 
> How long are we going to have multiple pages behind a huge page?

Yes, that's a really good question.  I think Muchun Song's patches
are an interesting and practical way of freeing up memory _now_, but
long-term we'll need something different.  Maybe we end up with
dynamically allocated pages (perhaps when we break a 2MB page into
1MB pages in the buddy allocator).

> Common storage drives are getting fast enough that simple buffered IO
> workloads are becoming limited by CPU, just because it's too many
> individual pages to push through the cache. We have pending patches to
> rewrite the reclaim algorithm because rmap is falling apart with the
> rate of paging we're doing. We'll need larger pages in the VM not just
> for optimizing TLB access, but to cut transaction overhead for paging
> in general (I know you're already onboard with this, especially on the
> page cache side, just stating it for completeness).

yes, yes, yes and yes.  Dave Chinner produced a fantastic perf report
for me illustrating how kswapd and the page cache completely fall apart
under what must be a common streaming load.  Just create a file 2x the
size of memory, then cat it to /dev/null.  cat tries to allocate memory
in readahead and ends up contending on the i_pages lock with kswapd
who's trying to free pages from the LRU list one at a time.

Larger pages will help with that because more work gets done with each
lock acquisition, but I can't help but feel that the real solution is for
the page cache to notice that this is a streaming workload and have cat
eagerly recycle pages from this file.  That's a biggish project; we know
how many pages there are in this mapping, but how to know when to switch
from "allocate memory from the page allocator" to "just delete a page
from early in the file and reuse it at the current position inn the file"?

> But for that to work, we'll need the allocator to produce huge pages
> at the necessary rate, too. The current implementation likely won't
> scale. Compaction is expensive enough that we have to weigh when to
> allocate huge pages for long-lived anon regions, let alone allocate
> them for streaming IO cache entries.

Heh, I have that as a work item for later this year -- give the page
allocator per-cpu lists of compound pages, not just order-0 pages.
That'll save us turning compound pages back into buddy pages, only to
turn them into compound pages again.

I also have a feeling that the page allocator either needs to become a
sub-allocator of an allocator that deals in, say, 1GB chunks of memory,
or it needs to become reluctant to break up larger orders.  eg if the
dcache asks for just one more dentry, it should have to go through at
least one round of reclaim before we choose to break up a high-order
page to satisfy that request.

> But if the overwhelming number of requests going to the page allocator
> are larger than 4k pages - anon regions? check. page cache? likely a
> sizable share. slub? check. network? check - does it even make sense
> to have that as the default block size for the page allocator anymore?
> Or even allocate struct page at this granularity?

Yep, others have talked about that as well.  I think I may even have said
a few times at LSFMM, "What if we just make PAGE_SIZE 2MB?".  After all,
my first 386 Linux system was 4-8MB of RAM (it got upgraded).  The 16GB
laptop that I now have is 2048 times more RAM, so 4x the number of pages
that system had.

But people seem attached to being able to use smaller page sizes.
There's that pesky "compatibility" argument.

> So I think transitioning away from ye olde page is a great idea. I
> wonder this: have we mapped out the near future of the VM enough to
> say that the folio is the right abstraction?
> 
> What does 'folio' mean when it corresponds to either a single page or
> some slab-type object with no dedicated page?
> 
> If we go through with all the churn now anyway, IMO it makes at least
> sense to ditch all association and conceptual proximity to the
> hardware page or collections thereof. Simply say it's some length of
> memory, and keep thing-to-page translations out of the public API from
> the start. I mean, is there a good reason to keep this baggage?
> 
> mem_t or something.
> 
> mem = find_get_mem(mapping, offset);
> p = kmap(mem, offset - mem_file_offset(mem), len);
> copy_from_user(p, buf, len);
> kunmap(mem);
> SetMemDirty(mem);
> put_mem(mem);

I think there's still value to the "new thing" being a power of two
in size.  I'm not sure you were suggesting otherwise, but it's worth
putting on the table as something we explicitly agree on (or not!)

I mean what you've written there looks a _lot_ like where I get to
in the iomap code.

                status = iomap_write_begin(inode, pos, bytes, 0, &folio, iomap,
                                srcmap);
                if (unlikely(status))
                        break;

                if (mapping_writably_mapped(inode->i_mapping))
                        flush_dcache_folio(folio);

                /* We may be part-way through a folio */
                offset = offset_in_folio(folio, pos);
                copied = iov_iter_copy_from_user_atomic(folio, i, offset,
                                bytes);

                copied = iomap_write_end(inode, pos, bytes, copied, folio,
                                iomap, srcmap);
(which eventually calls TestSetFolioDirty)

It doesn't copy more than PAGE_SIZE bytes per iteration because
iov_iter_copy_from_user_atomic() isn't safe to do that yet.
But in *principle*, it should be able to.

> There are 10k instances of 'page' in mm/ outside the page allocator, a
> majority of which will be the new thing. 14k in fs. I don't think I
> have the strength to type shrink_folio_list(), or explain to new
> people what it means, years after it has stopped making sense.

One of the things I don't like about the current iteration of folio
is that getting to things is folio->page.mapping.  I think it does want
to be folio->mapping, and I'm playing around with this:

 struct folio {
-       struct page page;
+       union {
+               struct page page;
+               struct {
+                       unsigned long flags;
+                       struct list_head lru;
+                       struct address_space *mapping;
+                       pgoff_t index;
+                       unsigned long private;
+                       atomic_t _mapcount;
+                       atomic_t _refcount;
+               };
+       };
 };

+static inline void folio_build_bug(void)
+{
+#define FOLIO_MATCH(pg, fl)                                            \
+BUILD_BUG_ON(offsetof(struct page, pg) != offsetof(struct folio, fl));
+
+       FOLIO_MATCH(flags, flags);
+       FOLIO_MATCH(lru, lru);
+       FOLIO_MATCH(mapping, mapping);
+       FOLIO_MATCH(index, index);
+       FOLIO_MATCH(private, private);
+       FOLIO_MATCH(_mapcount, _mapcount);
+       FOLIO_MATCH(_refcount, _refcount);
+#undef FOLIO_MATCH
+       BUILD_BUG_ON(sizeof(struct page) != sizeof(struct folio));
+}

with the intent of eventually renaming page->mapping to page->__mapping
so people can't look at page->mapping on a tail page.  If we even have
tail pages eventually.  I could see a future where we have pte_to_pfn(),
pfn_to_folio() and are completely page-free (... the vm_fault would
presumably return a pfn instead of a page at that point ...).  But that's
too ambitious a project to succeed any time soon.

There's a lot of transitional stuff in these patches where I do
&folio->page.  I cringe a little every time I write that.

So yes, let's ask the question of "Is this the right short term, medium
term or long term approach?"  I think it is, at least in broad strokes.
Let's keep refining it.

Thanks for your contribution here; it's really useful.

Johannes Weiner March 26, 2021, 5:48 p.m. UTC | #8

On Wed, Mar 24, 2021 at 06:24:21AM +0000, Matthew Wilcox wrote:
> On Tue, Mar 23, 2021 at 08:29:16PM -0400, Johannes Weiner wrote:
> > On Mon, Mar 22, 2021 at 06:47:44PM +0000, Matthew Wilcox wrote:
> > > On Mon, Mar 22, 2021 at 01:59:24PM -0400, Johannes Weiner wrote:
> One of the patches I haven't posted yet starts to try to deal with kmap()/mem*()/kunmap():
> 
>     mm: Add kmap_local_folio
>     
>     This allows us to map a portion of a folio.  Callers can only expect
>     to access up to the next page boundary.
>     
>     Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
> 
> diff --git a/include/linux/highmem-internal.h b/include/linux/highmem-internal.h
> index 7902c7d8b55f..55a29c9d562f 100644
> --- a/include/linux/highmem-internal.h
> +++ b/include/linux/highmem-internal.h
> @@ -73,6 +73,12 @@ static inline void *kmap_local_page(struct page *page)
>         return __kmap_local_page_prot(page, kmap_prot);
>  }
>  
> +static inline void *kmap_local_folio(struct folio *folio, size_t offset)
> +{
> +       struct page *page = &folio->page + offset / PAGE_SIZE;
> +       return __kmap_local_page_prot(page, kmap_prot) + offset % PAGE_SIZE;
> +}
> 
> Partly I haven't shared that one because I'm not 100% sure that 'byte
> offset relative to start of folio' is the correct interface.  I'm looking
> at some users and thinking that maybe 'byte offset relative to start
> of file' might be better.  Or perhaps that's just filesystem-centric
> thinking.

Right, this doesn't seem specific to files just because they would be
the primary users of it.

> > But for that to work, we'll need the allocator to produce huge pages
> > at the necessary rate, too. The current implementation likely won't
> > scale. Compaction is expensive enough that we have to weigh when to
> > allocate huge pages for long-lived anon regions, let alone allocate
> > them for streaming IO cache entries.
> 
> Heh, I have that as a work item for later this year -- give the page
> allocator per-cpu lists of compound pages, not just order-0 pages.
> That'll save us turning compound pages back into buddy pages, only to
> turn them into compound pages again.
> 
> I also have a feeling that the page allocator either needs to become a
> sub-allocator of an allocator that deals in, say, 1GB chunks of memory,
> or it needs to become reluctant to break up larger orders.  eg if the
> dcache asks for just one more dentry, it should have to go through at
> least one round of reclaim before we choose to break up a high-order
> page to satisfy that request.

Slub already allocates higher-order pages for dentries:

slabinfo - version: 2.1
# name            <active_objs> <num_objs> <objsize> <objperslab> <pagesperslab> : tunables <limit> <batchcount> <sharedfactor> : slabdata <active_slabs> <num_slabs> <sharedavail>
dentry            133350 133350    192   42    2 : tunables    0    0    0 : slabdata   3175   3175      0

                                               ^ here

and it could avoid even more internal fragmentation with bigger
orders. It only doesn't because of the overhead of allocating them.

If the default block size in the allocator were 2M, we'd also get slab
packing at that granularity, and we wouldn't have to worry about small
objects breaking huge pages any more than we worry about slab objects
fragmenting 4k pages today.

> > But if the overwhelming number of requests going to the page allocator
> > are larger than 4k pages - anon regions? check. page cache? likely a
> > sizable share. slub? check. network? check - does it even make sense
> > to have that as the default block size for the page allocator anymore?
> > Or even allocate struct page at this granularity?
> 
> Yep, others have talked about that as well.  I think I may even have said
> a few times at LSFMM, "What if we just make PAGE_SIZE 2MB?".  After all,
> my first 386 Linux system was 4-8MB of RAM (it got upgraded).  The 16GB
> laptop that I now have is 2048 times more RAM, so 4x the number of pages
> that system had.
> 
> But people seem attached to being able to use smaller page sizes.
> There's that pesky "compatibility" argument.

Right, that's why I'm NOT saying we should eliminate the support for
4k chunks in the page cache and page tables. That's still useful if
you have lots of small files.

I'm just saying it doesn't have to be the default that everything is
primarily optimized for. We can make the default allocation size of
the allocator correspond to a hugepage and have a secondary allocator
level for 4k chunks. Like slab, but fixed-size and highmem-aware.

It makes sense to make struct page 2M as well. It would save a ton of
memory on average and reduce the pressure we have on struct page's
size today.

And we really don't need struct page at 4k just to support this unit
of paging when necesary: page tables don't care, they use pfns and can
point to any 4k offset, struct page or no struct page. For the page
cache, we can move mapping, index, lru. etc from today's struct page
into an entry descriptor that could either sit in a native 2M struct
page (just like today), or be be allocated on demand and point into a
chunked struct page. Same for <2M anonymous mappings.

Hey, didn't you just move EXACTLY those fields into the folio? ;)

All this to reiterate, I really do agree with the concept of a new
type of object for paging, page cache entries, etc. But I think there
are good reasons to assume that this unit of paging needs to support
sizes smaller than the standard page size used by the kernel at large,
and so 'bundle of pages' is not a good way of defining it.

It can easily cause problems down the line again if people continue to
assume that there is at least one PAGE_SIZE struct page in a folio.

And it's not obvious to me why it really NEEDS to be 'bundle of pages'
instead of just 'chunk of memory'.

> > So I think transitioning away from ye olde page is a great idea. I
> > wonder this: have we mapped out the near future of the VM enough to
> > say that the folio is the right abstraction?
> > 
> > What does 'folio' mean when it corresponds to either a single page or
> > some slab-type object with no dedicated page?
> > 
> > If we go through with all the churn now anyway, IMO it makes at least
> > sense to ditch all association and conceptual proximity to the
> > hardware page or collections thereof. Simply say it's some length of
> > memory, and keep thing-to-page translations out of the public API from
> > the start. I mean, is there a good reason to keep this baggage?
> > 
> > mem_t or something.
> > 
> > mem = find_get_mem(mapping, offset);
> > p = kmap(mem, offset - mem_file_offset(mem), len);
> > copy_from_user(p, buf, len);
> > kunmap(mem);
> > SetMemDirty(mem);
> > put_mem(mem);
> 
> I think there's still value to the "new thing" being a power of two
> in size.  I'm not sure you were suggesting otherwise, but it's worth
> putting on the table as something we explicitly agree on (or not!)

Ha, I wasn't thinking about minimum alignment. I used the byte offsets
because I figured that's what's natural to the fs and saw no reason to
have it think in terms of page size in this example.

From an implementation pov, since anything in the page cache can end
up in a page table, it probably doesn't make a whole lot of sense to
allow quantities smaller than the smallest unit of paging supported by
the processor. But I wonder if that's mostly something the MM would
care about when it allocates these objects, not necessarily something
that needs to be reflected in the interface or the filesystem.

The other point I was trying to make was just the alternate name. As I
said above, I think 'bundle of pages' as a concept is a strategic
error that will probably come back to haunt us.

I also have to admit, I really hate the name. We may want to stop
people thinking of PAGE_SIZE, but this term doesn't give people any
clue WHAT to think of. Ten years down the line, when the possible
confusion between folio and page and PAGE_SIZE has been eradicated,
people still will have to google what a folio is, and then have a hard
time retaining a mental image. I *know* what it is and I still have a
hard time reading code that uses it.

That's why I drafted around with the above code, to see if it would go
down easier. I think it does. It's simple, self-explanatory, but
abstract enough as to not make assumptions around its implementation.
Filesystem look up cache memory, write data in it, mark memory dirty.

Maybe folio makes more sense to native speakers, but I have never
heard this term. Of course when you look it up, it's "something to do
with pages" :D

As a strategy to unseat the obsolete mental model around pages, IMO
redirection would be preferable to confusion.

> > There are 10k instances of 'page' in mm/ outside the page allocator, a
> > majority of which will be the new thing. 14k in fs. I don't think I
> > have the strength to type shrink_folio_list(), or explain to new
> > people what it means, years after it has stopped making sense.
> 
> One of the things I don't like about the current iteration of folio
> is that getting to things is folio->page.mapping.  I think it does want
> to be folio->mapping, and I'm playing around with this:
> 
>  struct folio {
> -       struct page page;
> +       union {
> +               struct page page;
> +               struct {
> +                       unsigned long flags;
> +                       struct list_head lru;
> +                       struct address_space *mapping;
> +                       pgoff_t index;
> +                       unsigned long private;
> +                       atomic_t _mapcount;
> +                       atomic_t _refcount;
> +               };
> +       };
>  };
> 
> +static inline void folio_build_bug(void)
> +{
> +#define FOLIO_MATCH(pg, fl)                                            \
> +BUILD_BUG_ON(offsetof(struct page, pg) != offsetof(struct folio, fl));
> +
> +       FOLIO_MATCH(flags, flags);
> +       FOLIO_MATCH(lru, lru);
> +       FOLIO_MATCH(mapping, mapping);
> +       FOLIO_MATCH(index, index);
> +       FOLIO_MATCH(private, private);
> +       FOLIO_MATCH(_mapcount, _mapcount);
> +       FOLIO_MATCH(_refcount, _refcount);
> +#undef FOLIO_MATCH
> +       BUILD_BUG_ON(sizeof(struct page) != sizeof(struct folio));
> +}
> 
> with the intent of eventually renaming page->mapping to page->__mapping
> so people can't look at page->mapping on a tail page.  If we even have
> tail pages eventually.  I could see a future where we have pte_to_pfn(),
> pfn_to_folio() and are completely page-free (... the vm_fault would
> presumably return a pfn instead of a page at that point ...).  But that's
> too ambitious a project to succeed any time soon.
>
> There's a lot of transitional stuff in these patches where I do
> &folio->page.  I cringe a little every time I write that.

Instead of the union in there, could you do this?

	struct thing {
		struct address_space *mapping;
		pgoff_t index;
		...
	};

	struct page {
		union {
			struct thing thing;
			...
		}
	}

and use container_of() to get to the page in those places?

> So yes, let's ask the question of "Is this the right short term, medium
> term or long term approach?"  I think it is, at least in broad strokes.
> Let's keep refining it.

Yes, yes, and yes. :)

Matthew Wilcox March 29, 2021, 4:58 p.m. UTC | #9

I'm going to respond to some points in detail below, but there are a
couple of overarching themes that I want to bring out up here.

Grand Vision
~~~~~~~~~~~~

I haven't outlined my long-term plan.  Partly because it is a _very_
long way off, and partly because I think what I'm doing stands on its
own.  But some of the points below bear on this, so I'll do it now.

Eventually, I want to make struct page optional for allocations.  It's too
small for some things (allocating page tables, for example), and overly
large for others (allocating a 2MB page, networking page_pool).  I don't
want to change its size in the meantime; having a struct page refer to
PAGE_SIZE bytes is something that's quite deeply baked in.

In broad strokes, I think that having a Power Of Two Allocator
with Descriptor (POTAD) is a useful foundational allocator to have.
The specific allocator that we call the buddy allocator is very clever for
the 1990s, but touches too many cachelines to be good with today's CPUs.
The generalisation of the buddy allocator to the POTAD lets us allocate
smaller quantities (eg a 512 byte block) and allocate descriptors which
differ in size from a struct page.  For an extreme example, see xfs_buf
which is 360 bytes and is the descriptor for an allocation between 512
and 65536 bytes.

There are times when we need to get from the physical address to
the descriptor, eg memory-failure.c or get_user_pages().  This is the
equivalent of phys_to_page(), and it's going to have to be a lookup tree.
I think this is a role for the Maple Tree, but it's not ready yet.
I don't know if it'll be fast enough for this case.  There's also the
need (particularly for memory-failure) to determine exactly what kind
of descriptor we're dealing with, and also its size.  Even its owner,
so we can notify them of memory failure.

There's still a role for the slab allocator, eg allocating objects
which aren't a power of two, or allocating things for which the user
doesn't need a descriptor of its own.  We can even keep the 'alloc_page'
interface around; it's just a specialisation of the POTAD.

Anyway, there's a lot of work here, and I'm sure there are many holes
to be poked in it, but eventually I want the concept of tail pages to
go away, and for pages to become not-the-unit of memory management in
Linux any more.

Naming
~~~~~~

The fun thing about the word folio is that it actually has several
meanings.  Quoting wikipedia,

: it is firstly a term for a common method of arranging sheets of paper
: into book form, folding the sheet only once, and a term for a book
: made in this way; secondly it is a general term for a sheet, leaf or
: page in (especially) manuscripts and old books; and thirdly it is an
: approximate term for the size of a book, and for a book of this size.

So while it is a collection of pages in the first sense, in the second
sense it's also its own term for a "sheet, leaf or page".  I (still)
don't insist on the word folio, but I do insist that it be _a_ word.
The word "slab" was a great coin by Bonwick -- it didn't really mean
anything in the context of memory before he used it, and now we all know
exactly what it means.  I just don't want us to end up with

struct uma { /* unit of memory allocation */

We could choose another (short, not-used-in-kernel) word almost at random.
How about 'kerb'?

What I haven't touched on anywhere in this, is whether a folio is the
descriptor for all POTA or whether it's specifically the page cache
descriptor.  I like the idea of having separate descriptors for objects
in the page cache from anonymous or other allocations.  But I'm not very
familiar with the rmap code, and that wants to do things like manipulate
the refcount on a descriptor without knowing whether it's a file or
anon page.  Or neither (eg device driver memory mapped to userspace.
Or vmalloc memory mapped to userspace.  Or ...)

We could get terribly carried away with this ...

struct mappable { /* any mappable object must be LRU */
	struct list_head lru;
	int refcount;
	int mapcount;
};

struct folio { /* for page cache */
	unsigned long flags;
	struct mappable map;
	struct address_space *mapping;
	pgoff_t index;
	void *private;
};

struct quarto { /* for anon pages */
	unsigned long flags;
	struct mappable map;
	swp_entry_t swp;
	struct anon_vma *vma;
};

but I'm not sure we want to go there.

On Fri, Mar 26, 2021 at 01:48:15PM -0400, Johannes Weiner wrote:
> On Wed, Mar 24, 2021 at 06:24:21AM +0000, Matthew Wilcox wrote:
> > On Tue, Mar 23, 2021 at 08:29:16PM -0400, Johannes Weiner wrote:
> > > On Mon, Mar 22, 2021 at 06:47:44PM +0000, Matthew Wilcox wrote:
> > > > On Mon, Mar 22, 2021 at 01:59:24PM -0400, Johannes Weiner wrote:
> > One of the patches I haven't posted yet starts to try to deal with kmap()/mem*()/kunmap():
> > 
> >     mm: Add kmap_local_folio
> >     
> >     This allows us to map a portion of a folio.  Callers can only expect
> >     to access up to the next page boundary.
> >     
> >     Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
> > 
> > diff --git a/include/linux/highmem-internal.h b/include/linux/highmem-internal.h
> > index 7902c7d8b55f..55a29c9d562f 100644
> > --- a/include/linux/highmem-internal.h
> > +++ b/include/linux/highmem-internal.h
> > @@ -73,6 +73,12 @@ static inline void *kmap_local_page(struct page *page)
> >         return __kmap_local_page_prot(page, kmap_prot);
> >  }
> >  
> > +static inline void *kmap_local_folio(struct folio *folio, size_t offset)
> > +{
> > +       struct page *page = &folio->page + offset / PAGE_SIZE;
> > +       return __kmap_local_page_prot(page, kmap_prot) + offset % PAGE_SIZE;
> > +}
> > 
> > Partly I haven't shared that one because I'm not 100% sure that 'byte
> > offset relative to start of folio' is the correct interface.  I'm looking
> > at some users and thinking that maybe 'byte offset relative to start
> > of file' might be better.  Or perhaps that's just filesystem-centric
> > thinking.
> 
> Right, this doesn't seem specific to files just because they would be
> the primary users of it.

Yeah.  I think I forgot to cc you on this:

https://lore.kernel.org/linux-fsdevel/20210325032202.GS1719932@casper.infradead.org/

and "byte offset relative to the start of the folio" works just fine:

+	offset = offset_in_folio(folio, diter->pos);
+
+map:
+	diter->entry = kmap_local_folio(folio, offset);

> > > But for that to work, we'll need the allocator to produce huge pages
> > > at the necessary rate, too. The current implementation likely won't
> > > scale. Compaction is expensive enough that we have to weigh when to
> > > allocate huge pages for long-lived anon regions, let alone allocate
> > > them for streaming IO cache entries.
> > 
> > Heh, I have that as a work item for later this year -- give the page
> > allocator per-cpu lists of compound pages, not just order-0 pages.
> > That'll save us turning compound pages back into buddy pages, only to
> > turn them into compound pages again.
> > 
> > I also have a feeling that the page allocator either needs to become a
> > sub-allocator of an allocator that deals in, say, 1GB chunks of memory,
> > or it needs to become reluctant to break up larger orders.  eg if the
> > dcache asks for just one more dentry, it should have to go through at
> > least one round of reclaim before we choose to break up a high-order
> > page to satisfy that request.
> 
> Slub already allocates higher-order pages for dentries:
> 
> slabinfo - version: 2.1
> # name            <active_objs> <num_objs> <objsize> <objperslab> <pagesperslab> : tunables <limit> <batchcount> <sharedfactor> : slabdata <active_slabs> <num_slabs> <sharedavail>
> dentry            133350 133350    192   42    2 : tunables    0    0    0 : slabdata   3175   3175      0
> 
>                                                ^ here
> 
> and it could avoid even more internal fragmentation with bigger
> orders. It only doesn't because of the overhead of allocating them.

Oh, yes.  Sorry, I didn't explain myself properly.  If we have a
lightly-loaded system with terabytes of memory (perhaps all the jobs
it is running are CPU intensive and don't need much memory), the system
has a tendency to clog up with negative dentries.  Hundreds of millions
of them.  We rely on memory pressure to get rid of them, and when there
finally is memory pressure, it takes literally hours.

If there were a slight amount of pressure to trim the dcache at the point
when we'd otherwise break up an order-4 page to get an order-2 page,
the system would work much better.  Obviously, we do want the dcache to
be able to expand to the point where it's useful, but at the point that
it's no longer useful, we need to trim it.

It'd probably be better to have the dcache realise that its old entries
aren't useful any more and age them out instead of relying on memory
pressure to remove old entries, so this is probably an unnecessary
digression.

> If the default block size in the allocator were 2M, we'd also get slab
> packing at that granularity, and we wouldn't have to worry about small
> objects breaking huge pages any more than we worry about slab objects
> fragmenting 4k pages today.

Yup.  I definitely see the attraction of letting the slab allocator
allocate in larger units.  On the other hand, you have to start worrying
about underutilisation of the memory at _some_ size, and I'd argue the
sweet spot is somewhere between 4kB and 2MB today.  For example:

fat_inode_cache      110    110    744   22    4 : tunables    0    0    0 : slabdata      5      5      0

That's currently using 20 pages.  If slab were only allocating 2MB slabs
from the page allocator, I'd have 1.9MB of ram unused in that cache.

> > But people seem attached to being able to use smaller page sizes.
> > There's that pesky "compatibility" argument.
> 
> Right, that's why I'm NOT saying we should eliminate the support for
> 4k chunks in the page cache and page tables. That's still useful if
> you have lots of small files.
> 
> I'm just saying it doesn't have to be the default that everything is
> primarily optimized for. We can make the default allocation size of
> the allocator correspond to a hugepage and have a secondary allocator
> level for 4k chunks. Like slab, but fixed-size and highmem-aware.
> 
> It makes sense to make struct page 2M as well. It would save a ton of
> memory on average and reduce the pressure we have on struct page's
> size today.
> 
> And we really don't need struct page at 4k just to support this unit
> of paging when necesary: page tables don't care, they use pfns and can
> point to any 4k offset, struct page or no struct page. For the page
> cache, we can move mapping, index, lru. etc from today's struct page
> into an entry descriptor that could either sit in a native 2M struct
> page (just like today), or be be allocated on demand and point into a
> chunked struct page. Same for <2M anonymous mappings.
> 
> Hey, didn't you just move EXACTLY those fields into the folio? ;)

You say page tables don't actually need a struct page, but we do use it.

                struct {        /* Page table pages */
                        unsigned long _pt_pad_1;        /* compound_head */
                        pgtable_t pmd_huge_pte; /* protected by page->ptl */
                        unsigned long _pt_pad_2;        /* mapping */
                        union {
                                struct mm_struct *pt_mm; /* x86 pgds only */
                                atomic_t pt_frag_refcount; /* powerpc */
                        };
#if ALLOC_SPLIT_PTLOCKS
                        spinlock_t *ptl;
#else
                        spinlock_t ptl;
#endif
                };

It's a problem because some architectures would really rather
allocate 2KiB page tables (s390) or would like to support 4KiB page
tables on a 64KiB base page size kernel (ppc).

[actually i misread your comment initially; you meant that page
tables point to PFNs and don't care what struct backs them ... i'm
leaving this in here because it illustrates a problem with change
struct-page-size-to-2MB]

Matthew Wilcox March 29, 2021, 5:56 p.m. UTC | #10

On Mon, Mar 29, 2021 at 05:58:32PM +0100, Matthew Wilcox wrote:
> In broad strokes, I think that having a Power Of Two Allocator
> with Descriptor (POTAD) is a useful foundational allocator to have.
> The specific allocator that we call the buddy allocator is very clever for
> the 1990s, but touches too many cachelines to be good with today's CPUs.
> The generalisation of the buddy allocator to the POTAD lets us allocate
> smaller quantities (eg a 512 byte block) and allocate descriptors which
> differ in size from a struct page.  For an extreme example, see xfs_buf
> which is 360 bytes and is the descriptor for an allocation between 512
> and 65536 bytes.
> 
> There are times when we need to get from the physical address to
> the descriptor, eg memory-failure.c or get_user_pages().  This is the
> equivalent of phys_to_page(), and it's going to have to be a lookup tree.
> I think this is a role for the Maple Tree, but it's not ready yet.
> I don't know if it'll be fast enough for this case.  There's also the
> need (particularly for memory-failure) to determine exactly what kind
> of descriptor we're dealing with, and also its size.  Even its owner,
> so we can notify them of memory failure.

A couple of things I forgot to mention ...

I'd like the POTAD to be not necessarily tied to allocating memory.
For example, I think it could be used to allocate swap space.  eg the swap
code could register the space in a swap file as allocatable through the
POTAD, and then later ask the POTAD to allocate a POT from the swap space.

The POTAD wouldn't need to be limited to MAX_ORDER.  It should be
perfectly capable of allocating 1TB if your machine has 1.5TB of RAM
in it (... and things haven't got too fragmented)

I think the POTAD can be used to replace the CMA.  The CMA supports
weirdo things like "Allocate 8MB of memory at a 1MB alignment", and I
think that's doable within the data structures that I'm thinking about
for the POTAD.  It'd first try to allocate an 8MB chunk at 8MB alignment,
and then if that's not possible, try to allocate two adjacent 4MB chunks;
continuing down until it finds that there aren't 8x1MB chunks, at which
point it can give up.

Johannes Weiner March 30, 2021, 7:30 p.m. UTC | #11

Hi Willy,

On Mon, Mar 29, 2021 at 05:58:32PM +0100, Matthew Wilcox wrote:
> I'm going to respond to some points in detail below, but there are a
> couple of overarching themes that I want to bring out up here.
> 
> Grand Vision
> ~~~~~~~~~~~~
> 
> I haven't outlined my long-term plan.  Partly because it is a _very_
> long way off, and partly because I think what I'm doing stands on its
> own.  But some of the points below bear on this, so I'll do it now.
> 
> Eventually, I want to make struct page optional for allocations.  It's too
> small for some things (allocating page tables, for example), and overly
> large for others (allocating a 2MB page, networking page_pool).  I don't
> want to change its size in the meantime; having a struct page refer to
> PAGE_SIZE bytes is something that's quite deeply baked in.

Right, I think it's overloaded and it needs to go away from many
contexts it's used in today.

I think it describes a real physical thing, though, and won't go away
as a concept. More on that below.

> In broad strokes, I think that having a Power Of Two Allocator
> with Descriptor (POTAD) is a useful foundational allocator to have.
> The specific allocator that we call the buddy allocator is very clever for
> the 1990s, but touches too many cachelines to be good with today's CPUs.
> The generalisation of the buddy allocator to the POTAD lets us allocate
> smaller quantities (eg a 512 byte block) and allocate descriptors which
> differ in size from a struct page.  For an extreme example, see xfs_buf
> which is 360 bytes and is the descriptor for an allocation between 512
> and 65536 bytes.

I actually disagree with this rather strongly. If anything, the buddy
allocator has turned out to be a pretty poor fit for the foundational
allocator.

On paper, it is elegant and versatile in serving essentially arbitrary
memory blocks. In practice, we mostly just need 4k and 2M chunks from
it. And it sucks at the 2M ones because of the fragmentation caused by
the ungrouped 4k blocks.

The great thing about the slab allocator isn't just that it manages
internal fragmentation of the larger underlying blocks. It also groups
related objects by lifetime/age and reclaimability, which dramatically
mitigates the external fragmentation of the memory space.

The buddy allocator on the other hand has no idea what you want that
4k block for, and whether it pairs up well with the 4k block it just
handed to somebody else. But the decision it makes in that moment is
crucial for its ability to serve larger blocks later on.

We do some mobility grouping based on how reclaimable or migratable
the memory is, but it's not the full answer.

A variable size allocator without object type grouping will always
have difficulties producing anything but the smallest block size after
some uptime. It's inherently flawed that way.

What HAS proven itself is having the base block size correspond to a
reasonable transaction unit for paging and page reclaim, then fill in
smaller ranges with lifetime-aware slabbing, larger ranges with
vmalloc and SG schemes, and absurdly large requests with CMA.

We might be stuck with serving order-1, order-2 etc. for a little
while longer for the few users who can't go to kvmalloc(), but IMO
it's the wrong direction to expand into.

Optimally the foundational allocator would just do one block size.

> There are times when we need to get from the physical address to
> the descriptor, eg memory-failure.c or get_user_pages().  This is the
> equivalent of phys_to_page(), and it's going to have to be a lookup tree.
> I think this is a role for the Maple Tree, but it's not ready yet.
> I don't know if it'll be fast enough for this case.  There's also the
> need (particularly for memory-failure) to determine exactly what kind
> of descriptor we're dealing with, and also its size.  Even its owner,
> so we can notify them of memory failure.

A tree could be more memory efficient in the long term, but for
starters a 2M page could have a

	struct smallpage *smallpages[512];

member that points to any allocated/mapped 4k descriptors. The page
table level would tell you what you're looking at: a pmd is simple, a
pte would map to a 4k pfn, whose upper bits identify a struct page
then a page flag would tell you whether we have a pte-mapped 2M page
or whether the lower pfn bits identify an offset in smallpages[].

It's one pointer for every 4k of RAM, which is a bit dumb, but not as
dumb as having an entire struct page for each of those ;)

> What I haven't touched on anywhere in this, is whether a folio is the
> descriptor for all POTA or whether it's specifically the page cache
> descriptor.  I like the idea of having separate descriptors for objects
> in the page cache from anonymous or other allocations.  But I'm not very
> familiar with the rmap code, and that wants to do things like manipulate
> the refcount on a descriptor without knowing whether it's a file or
> anon page.  Or neither (eg device driver memory mapped to userspace.
> Or vmalloc memory mapped to userspace.  Or ...)

The rmap code is all about the page type specifics, but once you get
into mmap, page reclaim, page migration, we're dealing with fully
fungible blocks of memory.

I do like the idea of using actual language typing for the different
things struct page can be today (fs page), but with a common type to
manage the fungible block of memory backing it (allocation state, LRU
& aging state, mmap state etc.)

New types for the former are an easier sell. We all agree that there
are too many details of the page - including the compound page
implementation detail - inside the cache library, fs code and drivers.

It's a slightly tougher sell to say that the core VM code itself
(outside the cache library) needs a tighter abstraction for the struct
page building block and the compound page structure. At least at this
time while we're still sorting out how it all may work down the line.
Certainly, we need something to describe fungible memory blocks:
either a struct page that can be 4k and 2M compound, or a new thing
that can be backed by a 2M struct page or a 4k struct smallpage. We
don't know yet, so I would table the new abstraction type for this.

I generally don't think we want a new type that does everything that
the overloaded struct page already does PLUS the compound
abstraction. Whatever name we pick for it, it'll always be difficult
to wrap your head around such a beast.

IMO starting with an explicit page cache descriptor that resolves to
struct page inside core VM code (and maybe ->fault) for now makes the
most sense: it greatly mitigates the PAGE_SIZE and tail page issue
right away, and it's not in conflict with, but rather helps work
toward, replacing the fungible memory unit behind it.

There isn't too much overlap or generic code between cache and anon
pages such that sharing a common descriptor would be a huge win (most
overlap is at the fungible memory block level, and the physical struct
page layout of course), so I don't think we should aim for a generic
abstraction for both.

As drivers go, I think there are slightly different requirements to
filesystems, too. For filesystems, when the VM can finally do it (and
the file range permits it), I assume we want to rather transparently
increase the unit of data transfer from 4k to 2M. Most drivers that
currently hardcode alloc_page() or PAGE_SIZE OTOH probably don't want
us to bump their allocation sizes.

There ARE instances where drivers allocate pages based on buffer_size
/ PAGE_SIZE and then interact with virtual memory. Those are true VM
objects that could grow transparently if PAGE_SIZE grows, and IMO they
should share the "fungible memory block" abstraction the VM uses.

But there are also many instances where PAGE_SIZE just means 4006 is a
good size for me, and struct page is useful for refcounting. Those
just shouldn't use whatever the VM or the cache layer are using and
stop putting additional burden on an already tricky abstraction.

> On Fri, Mar 26, 2021 at 01:48:15PM -0400, Johannes Weiner wrote:
> > On Wed, Mar 24, 2021 at 06:24:21AM +0000, Matthew Wilcox wrote:
> > > On Tue, Mar 23, 2021 at 08:29:16PM -0400, Johannes Weiner wrote:
> > > > On Mon, Mar 22, 2021 at 06:47:44PM +0000, Matthew Wilcox wrote:
> > > > > On Mon, Mar 22, 2021 at 01:59:24PM -0400, Johannes Weiner wrote:
> > > One of the patches I haven't posted yet starts to try to deal with kmap()/mem*()/kunmap():
> > > 
> > >     mm: Add kmap_local_folio
> > >     
> > >     This allows us to map a portion of a folio.  Callers can only expect
> > >     to access up to the next page boundary.
> > >     
> > >     Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
> > > 
> > > diff --git a/include/linux/highmem-internal.h b/include/linux/highmem-internal.h
> > > index 7902c7d8b55f..55a29c9d562f 100644
> > > --- a/include/linux/highmem-internal.h
> > > +++ b/include/linux/highmem-internal.h
> > > @@ -73,6 +73,12 @@ static inline void *kmap_local_page(struct page *page)
> > >         return __kmap_local_page_prot(page, kmap_prot);
> > >  }
> > >  
> > > +static inline void *kmap_local_folio(struct folio *folio, size_t offset)
> > > +{
> > > +       struct page *page = &folio->page + offset / PAGE_SIZE;
> > > +       return __kmap_local_page_prot(page, kmap_prot) + offset % PAGE_SIZE;
> > > +}
> > > 
> > > Partly I haven't shared that one because I'm not 100% sure that 'byte
> > > offset relative to start of folio' is the correct interface.  I'm looking
> > > at some users and thinking that maybe 'byte offset relative to start
> > > of file' might be better.  Or perhaps that's just filesystem-centric
> > > thinking.
> > 
> > Right, this doesn't seem specific to files just because they would be
> > the primary users of it.
> 
> Yeah.  I think I forgot to cc you on this:
> 
> https://lore.kernel.org/linux-fsdevel/20210325032202.GS1719932@casper.infradead.org/
> 
> and "byte offset relative to the start of the folio" works just fine:
> 
> +	offset = offset_in_folio(folio, diter->pos);
> +
> +map:
> +	diter->entry = kmap_local_folio(folio, offset);

Yeah, that looks great to me!

> > > > But for that to work, we'll need the allocator to produce huge pages
> > > > at the necessary rate, too. The current implementation likely won't
> > > > scale. Compaction is expensive enough that we have to weigh when to
> > > > allocate huge pages for long-lived anon regions, let alone allocate
> > > > them for streaming IO cache entries.
> > > 
> > > Heh, I have that as a work item for later this year -- give the page
> > > allocator per-cpu lists of compound pages, not just order-0 pages.
> > > That'll save us turning compound pages back into buddy pages, only to
> > > turn them into compound pages again.
> > > 
> > > I also have a feeling that the page allocator either needs to become a
> > > sub-allocator of an allocator that deals in, say, 1GB chunks of memory,
> > > or it needs to become reluctant to break up larger orders.  eg if the
> > > dcache asks for just one more dentry, it should have to go through at
> > > least one round of reclaim before we choose to break up a high-order
> > > page to satisfy that request.
> > 
> > Slub already allocates higher-order pages for dentries:
> > 
> > slabinfo - version: 2.1
> > # name            <active_objs> <num_objs> <objsize> <objperslab> <pagesperslab> : tunables <limit> <batchcount> <sharedfactor> : slabdata <active_slabs> <num_slabs> <sharedavail>
> > dentry            133350 133350    192   42    2 : tunables    0    0    0 : slabdata   3175   3175      0
> > 
> >                                                ^ here
> > 
> > and it could avoid even more internal fragmentation with bigger
> > orders. It only doesn't because of the overhead of allocating them.
> 
> Oh, yes.  Sorry, I didn't explain myself properly.  If we have a
> lightly-loaded system with terabytes of memory (perhaps all the jobs
> it is running are CPU intensive and don't need much memory), the system
> has a tendency to clog up with negative dentries.  Hundreds of millions
> of them.  We rely on memory pressure to get rid of them, and when there
> finally is memory pressure, it takes literally hours.
> 
> If there were a slight amount of pressure to trim the dcache at the point
> when we'd otherwise break up an order-4 page to get an order-2 page,
> the system would work much better.  Obviously, we do want the dcache to
> be able to expand to the point where it's useful, but at the point that
> it's no longer useful, we need to trim it.
> 
> It'd probably be better to have the dcache realise that its old entries
> aren't useful any more and age them out instead of relying on memory
> pressure to remove old entries, so this is probably an unnecessary
> digression.

It's difficult to identify a universally acceptable line for
usefulness of caches other than physical memory pressure.

The good thing about the memory pressure threshold is that you KNOW
somebody else has immediate use for the memory, and you're justified
in recycling and reallocating caches from the cold end.

Without that, you'd either have to set an arbitrary size cutoff or an
arbitrary aging cutoff (not used in the last minute e.g.). But optimal
settings for either of those depend on the workload, and aren't very
intuitive to configure.

Such a large gap between the smallest object and the overall size of
memory is just inherently difficult to manage. More below.

> > If the default block size in the allocator were 2M, we'd also get slab
> > packing at that granularity, and we wouldn't have to worry about small
> > objects breaking huge pages any more than we worry about slab objects
> > fragmenting 4k pages today.
> 
> Yup.  I definitely see the attraction of letting the slab allocator
> allocate in larger units.  On the other hand, you have to start worrying
> about underutilisation of the memory at _some_ size, and I'd argue the
> sweet spot is somewhere between 4kB and 2MB today.  For example:
> 
> fat_inode_cache      110    110    744   22    4 : tunables    0    0    0 : slabdata      5      5      0
> 
> That's currently using 20 pages.  If slab were only allocating 2MB slabs
> from the page allocator, I'd have 1.9MB of ram unused in that cache.

Right, we'd raise internal fragmentation to a worst case of 2M (minus
minimum object size) per slab cache. As a ratio of overall memory,
this isn't unprecedented, though: my desktop machine has 32G and my
phone has 8G. Divide those by 512 for a 4k base page comparison and
you get memory sizes common in the mid to late 90s.

Our levels of internal fragmentation are historically low, which of
course is nice by itself. But that's also what's causing problems in
the form of external fragmentation, and why we struggle to produce 2M
blocks. It's multitudes easier to free one 2M slab page of
consecutively allocated inodes than it is to free 512 batches of
different objects with conflicting lifetimes, ages, or potentially
even reclaimability.

I don't think we'll have much of a choice when it comes to trading
some internal fragmentation to deal with our mounting external
fragmentation problem.

[ Because of the way fragmentation works I also don't think that 1G
  would be a good foundational block size. It either wastes a crazy
  amount of memory on internal fragmentation, or you allow external
  fragmentation and the big blocks deteriorate with uptime anyway.

  There really is such a thing as a page: a goldilocks quantity of
  memory, given the overall amount of RAM in a system, that is optimal
  as a paging unit and intersection point for the fragmentation axes.

  This never went away. It just isn't 4k anymore on modern systems.
  And we're creating a bit of a mess by adapting various places (page
  allocator, slab, page cache, swap code) to today's goldilocks size
  while struct page lags behind and doesn't track reality anymore.

  I think there is a lot of value in disconnecting places from struct
  page that don't need it, but IMO all in the context of the broader
  goal of being able to catch up struct page to what the real page is.

  We may be able to get rid of the 4k backward-compatible paging units
  eventually when we all have 1TB of RAM. But the concept of a page in
  a virtual memory system isn't really going anywhere. ]

> > > But people seem attached to being able to use smaller page sizes.
> > > There's that pesky "compatibility" argument.
> > 
> > Right, that's why I'm NOT saying we should eliminate the support for
> > 4k chunks in the page cache and page tables. That's still useful if
> > you have lots of small files.
> > 
> > I'm just saying it doesn't have to be the default that everything is
> > primarily optimized for. We can make the default allocation size of
> > the allocator correspond to a hugepage and have a secondary allocator
> > level for 4k chunks. Like slab, but fixed-size and highmem-aware.
> > 
> > It makes sense to make struct page 2M as well. It would save a ton of
> > memory on average and reduce the pressure we have on struct page's
> > size today.
> > 
> > And we really don't need struct page at 4k just to support this unit
> > of paging when necesary: page tables don't care, they use pfns and can
> > point to any 4k offset, struct page or no struct page. For the page
> > cache, we can move mapping, index, lru. etc from today's struct page
> > into an entry descriptor that could either sit in a native 2M struct
> > page (just like today), or be be allocated on demand and point into a
> > chunked struct page. Same for <2M anonymous mappings.
> > 
> > Hey, didn't you just move EXACTLY those fields into the folio? ;)
> 
> You say page tables don't actually need a struct page, but we do use it.
> 
>                 struct {        /* Page table pages */
>                         unsigned long _pt_pad_1;        /* compound_head */
>                         pgtable_t pmd_huge_pte; /* protected by page->ptl */
>                         unsigned long _pt_pad_2;        /* mapping */
>                         union {
>                                 struct mm_struct *pt_mm; /* x86 pgds only */
>                                 atomic_t pt_frag_refcount; /* powerpc */
>                         };
> #if ALLOC_SPLIT_PTLOCKS
>                         spinlock_t *ptl;
> #else
>                         spinlock_t ptl;
> #endif
>                 };
> 
> It's a problem because some architectures would really rather
> allocate 2KiB page tables (s390) or would like to support 4KiB page
> tables on a 64KiB base page size kernel (ppc).
> 
> [actually i misread your comment initially; you meant that page
> tables point to PFNs and don't care what struct backs them ... i'm
> leaving this in here because it illustrates a problem with change
> struct-page-size-to-2MB]

Yes, I meant what page table entries point to.

The page table (directories) themselves are still 4k as per the
architecture, and they'd also have to use smallpage descriptors.

I don't immediately see why they couldn't, though. It's not that many,
especially if pmd mappings are common (a 4k pmd can map 1G worth of
address space).

Matthew Wilcox March 30, 2021, 9:09 p.m. UTC | #12

On Tue, Mar 30, 2021 at 03:30:54PM -0400, Johannes Weiner wrote:
> Hi Willy,
> 
> On Mon, Mar 29, 2021 at 05:58:32PM +0100, Matthew Wilcox wrote:
> > I'm going to respond to some points in detail below, but there are a
> > couple of overarching themes that I want to bring out up here.
> > 
> > Grand Vision
> > ~~~~~~~~~~~~
> > 
> > I haven't outlined my long-term plan.  Partly because it is a _very_
> > long way off, and partly because I think what I'm doing stands on its
> > own.  But some of the points below bear on this, so I'll do it now.
> > 
> > Eventually, I want to make struct page optional for allocations.  It's too
> > small for some things (allocating page tables, for example), and overly
> > large for others (allocating a 2MB page, networking page_pool).  I don't
> > want to change its size in the meantime; having a struct page refer to
> > PAGE_SIZE bytes is something that's quite deeply baked in.
> 
> Right, I think it's overloaded and it needs to go away from many
> contexts it's used in today.
> 
> I think it describes a real physical thing, though, and won't go away
> as a concept. More on that below.

I'm at least 90% with you on this, and we're just quibbling over details
at this point, I think.

> > In broad strokes, I think that having a Power Of Two Allocator
> > with Descriptor (POTAD) is a useful foundational allocator to have.
> > The specific allocator that we call the buddy allocator is very clever for
> > the 1990s, but touches too many cachelines to be good with today's CPUs.
> > The generalisation of the buddy allocator to the POTAD lets us allocate
> > smaller quantities (eg a 512 byte block) and allocate descriptors which
> > differ in size from a struct page.  For an extreme example, see xfs_buf
> > which is 360 bytes and is the descriptor for an allocation between 512
> > and 65536 bytes.
> 
> I actually disagree with this rather strongly. If anything, the buddy
> allocator has turned out to be a pretty poor fit for the foundational
> allocator.
> 
> On paper, it is elegant and versatile in serving essentially arbitrary
> memory blocks. In practice, we mostly just need 4k and 2M chunks from
> it. And it sucks at the 2M ones because of the fragmentation caused by
> the ungrouped 4k blocks.

That's a very Intel-centric way of looking at it.  Other architectures
support a multitude of page sizes, from the insane ia64 (4k, 8k, 16k, then
every power of four up to 4GB) to more reasonable options like (4k, 32k,
256k, 2M, 16M, 128M).  But we (in software) shouldn't constrain ourselves
to thinking in terms of what the hardware currently supports.  Google
have data showing that for their workloads, 32kB is the goldilocks size.
I'm sure for some workloads, it's much higher and for others it's lower.
But for almost no workload is 4kB the right choice any more, and probably
hasn't been since the late 90s.

> The great thing about the slab allocator isn't just that it manages
> internal fragmentation of the larger underlying blocks. It also groups
> related objects by lifetime/age and reclaimability, which dramatically
> mitigates the external fragmentation of the memory space.
> 
> The buddy allocator on the other hand has no idea what you want that
> 4k block for, and whether it pairs up well with the 4k block it just
> handed to somebody else. But the decision it makes in that moment is
> crucial for its ability to serve larger blocks later on.
> 
> We do some mobility grouping based on how reclaimable or migratable
> the memory is, but it's not the full answer.

I don't think that's entirely true.  The vast majority of memory in any
machine is either anonymous or page cache.  The problem is that right now,
all anonymous and page cache allocations are order-0 (... or order-9).
So the buddy allocator can't know anything useful about the pages and will
often allocate one order-0 page to the page cache, then allocate its buddy
to the slab cache in order to allocate the radix_tree_node to store the
pointer to the page in (ok, radix tree nodes come from an order-2 cache,
but it still prevents this order-9 page from being assembled).

If the movable allocations suddenly start being order-3 and order-4,
the unmovable, unreclaimable allocations are naturally going to group
down in the lower orders, and we won't have the problem that a single
dentry blocks the allocation of an entire 2MB page.

The problem, for me, with the ZONE_MOVABLE stuff is that it requires
sysadmin intervention to set up.  I don't have a ZONE_MOVABLE on
my laptop.  The allocator should be automatically handling movability
hints without my intervention.

> A variable size allocator without object type grouping will always
> have difficulties producing anything but the smallest block size after
> some uptime. It's inherently flawed that way.

I think our buddy allocator is flawed, to be sure, but only because
it doesn't handle movable hints more aggressively.  For example, at
the point that a largeish block gets a single non-movable allocation,
all the movable allocations within that block should be migrated out.
If the offending allocation is freed quickly, it all collapses into a
large, useful chunk, or if not, then it provides a sponge to soak up
other non-movable allocations.

> > What I haven't touched on anywhere in this, is whether a folio is the
> > descriptor for all POTA or whether it's specifically the page cache
> > descriptor.  I like the idea of having separate descriptors for objects
> > in the page cache from anonymous or other allocations.  But I'm not very
> > familiar with the rmap code, and that wants to do things like manipulate
> > the refcount on a descriptor without knowing whether it's a file or
> > anon page.  Or neither (eg device driver memory mapped to userspace.
> > Or vmalloc memory mapped to userspace.  Or ...)
> 
> The rmap code is all about the page type specifics, but once you get
> into mmap, page reclaim, page migration, we're dealing with fully
> fungible blocks of memory.
> 
> I do like the idea of using actual language typing for the different
> things struct page can be today (fs page), but with a common type to
> manage the fungible block of memory backing it (allocation state, LRU
> & aging state, mmap state etc.)
> 
> New types for the former are an easier sell. We all agree that there
> are too many details of the page - including the compound page
> implementation detail - inside the cache library, fs code and drivers.
> 
> It's a slightly tougher sell to say that the core VM code itself
> (outside the cache library) needs a tighter abstraction for the struct
> page building block and the compound page structure. At least at this
> time while we're still sorting out how it all may work down the line.
> Certainly, we need something to describe fungible memory blocks:
> either a struct page that can be 4k and 2M compound, or a new thing
> that can be backed by a 2M struct page or a 4k struct smallpage. We
> don't know yet, so I would table the new abstraction type for this.
> 
> I generally don't think we want a new type that does everything that
> the overloaded struct page already does PLUS the compound
> abstraction. Whatever name we pick for it, it'll always be difficult
> to wrap your head around such a beast.
> 
> IMO starting with an explicit page cache descriptor that resolves to
> struct page inside core VM code (and maybe ->fault) for now makes the
> most sense: it greatly mitigates the PAGE_SIZE and tail page issue
> right away, and it's not in conflict with, but rather helps work
> toward, replacing the fungible memory unit behind it.

Right, and that's what struct folio is today.  It eliminates tail pages
from consideration in a lot of paths.  I think it also makes sense for
struct folio to be used for anonymous memory.  But I think that's where it
stops; it isn't for Slab, it isn't for page table pages, and it's not
for ZONE_DEVICE pages.

> There isn't too much overlap or generic code between cache and anon
> pages such that sharing a common descriptor would be a huge win (most
> overlap is at the fungible memory block level, and the physical struct
> page layout of course), so I don't think we should aim for a generic
> abstraction for both.

They're both on the LRU list, they use a lot of the same PageFlags,
they both have a mapcount and refcount, and they both have memcg_data.
The only things they really use differently are mapping, index and
private.  And then we have to consider shmem which uses both in a
pretty eldritch way.

> As drivers go, I think there are slightly different requirements to
> filesystems, too. For filesystems, when the VM can finally do it (and
> the file range permits it), I assume we want to rather transparently
> increase the unit of data transfer from 4k to 2M. Most drivers that
> currently hardcode alloc_page() or PAGE_SIZE OTOH probably don't want
> us to bump their allocation sizes.

If you take a look at my earlier work, you'll see me using a range of
sizes in the page cache, starting at 16kB and gradually increasing to
(theoretically) 2MB, although the algorithm tended to top out around
256kB.  Doing particularly large reads could see 512kB/1MB reads, but
it was very hard to hit 2MB in practice.  I wasn't too concerned at the
time, but my point is that we do want to automatically tune the size
of the allocation unit to the workload.  An application which reads in
64kB chunks is giving us a pretty clear signal that they want to manage
memory in 64kB chunks.

> > It'd probably be better to have the dcache realise that its old entries
> > aren't useful any more and age them out instead of relying on memory
> > pressure to remove old entries, so this is probably an unnecessary
> > digression.
> 
> It's difficult to identify a universally acceptable line for
> usefulness of caches other than physical memory pressure.
> 
> The good thing about the memory pressure threshold is that you KNOW
> somebody else has immediate use for the memory, and you're justified
> in recycling and reallocating caches from the cold end.
> 
> Without that, you'd either have to set an arbitrary size cutoff or an
> arbitrary aging cutoff (not used in the last minute e.g.). But optimal
> settings for either of those depend on the workload, and aren't very
> intuitive to configure.

For the dentry cache, I think there is a more useful metric, and that's
length of the hash chain.  If it gets too long, we're spending more time
walking it than we're saving by having entries cached.  Starting reclaim
based on "this bucket of the dcache has twenty entries in it" would
probably work quite well.

> Our levels of internal fragmentation are historically low, which of
> course is nice by itself. But that's also what's causing problems in
> the form of external fragmentation, and why we struggle to produce 2M
> blocks. It's multitudes easier to free one 2M slab page of
> consecutively allocated inodes than it is to free 512 batches of
> different objects with conflicting lifetimes, ages, or potentially
> even reclaimability.

Unf.  I don't think freeing 2MB worth of _anything_ is ever going to be
easy enough to rely on.  My actual root filesystem:

xfs_inode         143134 144460   1024   32    8 : tunables    0    0    0 : slabdata   4517   4517      0

So we'd have to be able to free 2048 of those 143k inodes, and they all
have to be consecutive (and aligned).  I suppose we could model that and
try to work out how many we'd have to be able to free in order to get all
2048 in any page free, but I bet it's a variant of the Birthday Paradox,
and we'd find it's something crazy like half of them.

Without slab gaining the ability to ask users to relocate allocations,
I think any memory sent to slab is never coming back.

So ... even if I accept every part of your vision as the way things
are going to be, I think the folio patchset I have now is a step in the
right direction.  I'm going to send a v6 now and hope it's not too late
for this merge window.

Christoph Hellwig March 31, 2021, 2:54 p.m. UTC | #13

On Tue, Mar 30, 2021 at 03:30:54PM -0400, Johannes Weiner wrote:
> > Eventually, I want to make struct page optional for allocations.  It's too
> > small for some things (allocating page tables, for example), and overly
> > large for others (allocating a 2MB page, networking page_pool).  I don't
> > want to change its size in the meantime; having a struct page refer to
> > PAGE_SIZE bytes is something that's quite deeply baked in.
> 
> Right, I think it's overloaded and it needs to go away from many
> contexts it's used in today.

FYI, one unrelated usage is that in many contet we use a struct page and
an offset to describe locations for I/O (block layer, networking, DMA
API).  With huge pages and merged I/O buffers this representation
actually becomes increasingly painful.

And a little bit back to the topic:  I think the folio as in the
current patchset is incredibly useful and someting we need like
yesterday to help file systems and the block layer to cope with
huge and compound pages of all sorts.  Once willy sends out a new
version with the accumulated fixes I'm ready to ACK the whole thing.

Johannes Weiner March 31, 2021, 6:14 p.m. UTC | #14

On Tue, Mar 30, 2021 at 10:09:29PM +0100, Matthew Wilcox wrote:
> On Tue, Mar 30, 2021 at 03:30:54PM -0400, Johannes Weiner wrote:
> > Hi Willy,
> > 
> > On Mon, Mar 29, 2021 at 05:58:32PM +0100, Matthew Wilcox wrote:
> > > I'm going to respond to some points in detail below, but there are a
> > > couple of overarching themes that I want to bring out up here.
> > > 
> > > Grand Vision
> > > ~~~~~~~~~~~~
> > > 
> > > I haven't outlined my long-term plan.  Partly because it is a _very_
> > > long way off, and partly because I think what I'm doing stands on its
> > > own.  But some of the points below bear on this, so I'll do it now.
> > > 
> > > Eventually, I want to make struct page optional for allocations.  It's too
> > > small for some things (allocating page tables, for example), and overly
> > > large for others (allocating a 2MB page, networking page_pool).  I don't
> > > want to change its size in the meantime; having a struct page refer to
> > > PAGE_SIZE bytes is something that's quite deeply baked in.
> > 
> > Right, I think it's overloaded and it needs to go away from many
> > contexts it's used in today.
> > 
> > I think it describes a real physical thing, though, and won't go away
> > as a concept. More on that below.
> 
> I'm at least 90% with you on this, and we're just quibbling over details
> at this point, I think.
> 
> > > In broad strokes, I think that having a Power Of Two Allocator
> > > with Descriptor (POTAD) is a useful foundational allocator to have.
> > > The specific allocator that we call the buddy allocator is very clever for
> > > the 1990s, but touches too many cachelines to be good with today's CPUs.
> > > The generalisation of the buddy allocator to the POTAD lets us allocate
> > > smaller quantities (eg a 512 byte block) and allocate descriptors which
> > > differ in size from a struct page.  For an extreme example, see xfs_buf
> > > which is 360 bytes and is the descriptor for an allocation between 512
> > > and 65536 bytes.
> > 
> > I actually disagree with this rather strongly. If anything, the buddy
> > allocator has turned out to be a pretty poor fit for the foundational
> > allocator.
> > 
> > On paper, it is elegant and versatile in serving essentially arbitrary
> > memory blocks. In practice, we mostly just need 4k and 2M chunks from
> > it. And it sucks at the 2M ones because of the fragmentation caused by
> > the ungrouped 4k blocks.
> 
> That's a very Intel-centric way of looking at it.  Other architectures
> support a multitude of page sizes, from the insane ia64 (4k, 8k, 16k, then
> every power of four up to 4GB) to more reasonable options like (4k, 32k,
> 256k, 2M, 16M, 128M).  But we (in software) shouldn't constrain ourselves
> to thinking in terms of what the hardware currently supports.  Google
> have data showing that for their workloads, 32kB is the goldilocks size.
> I'm sure for some workloads, it's much higher and for others it's lower.
> But for almost no workload is 4kB the right choice any more, and probably
> hasn't been since the late 90s.

You missed my point entirely.

It's not about the exact page sizes, it's about the fragmentation
issue when you mix variable-sized blocks without lifetime grouping.

Anyway, we digressed quite far here. My argument was simply that it's
conceivable we'll switch to a default allocation block and page size
that is larger than the smallest paging size supported by the CPU and
the kernel. (Various architectures might support multiple page sizes,
but once you pick one, that's the smallest quantity the kernel pages.)

That makes "bundle of pages" a short-sighted abstraction, and folio a
poor name for pageable units.

I might be wrong about what happens to PAGE_SIZE eventually (even
though your broader arguments around allocator behavior and
fragmentation don't seem to line up with my observations from
production systems, or the evolution of how we manage allocations of
different sizes) - but you also haven't made a good argument why the
API *should* continue to imply we're dealing with one or more pages.

Yes, it's a bit bikesheddy. But you're proposing an abstraction for
one of the most fundamental data structures in the operating system,
with tens of thousands of instances in almost all core subsystems.

"Bundle of pages (for now) with filesystem data (and maybe anon data
since it's sort of convenient in terms of data structure, for now)"
just doesn't make me go "Yeah, that's it."

I would understand cache_entry for the cache; mem for cache and file
(that discussion trailed off); pageable if we want to imply a sizing
and alignment constraints based on the underlying MMU.

I would even prefer kerb, because at least it wouldn't be misleading
if we do have non-struct page backing in the future.

> > The great thing about the slab allocator isn't just that it manages
> > internal fragmentation of the larger underlying blocks. It also groups
> > related objects by lifetime/age and reclaimability, which dramatically
> > mitigates the external fragmentation of the memory space.
> > 
> > The buddy allocator on the other hand has no idea what you want that
> > 4k block for, and whether it pairs up well with the 4k block it just
> > handed to somebody else. But the decision it makes in that moment is
> > crucial for its ability to serve larger blocks later on.
> > 
> > We do some mobility grouping based on how reclaimable or migratable
> > the memory is, but it's not the full answer.
> 
> I don't think that's entirely true.  The vast majority of memory in any
> machine is either anonymous or page cache.  The problem is that right now,
> all anonymous and page cache allocations are order-0 (... or order-9).
> So the buddy allocator can't know anything useful about the pages and will
> often allocate one order-0 page to the page cache, then allocate its buddy
> to the slab cache in order to allocate the radix_tree_node to store the
> pointer to the page in (ok, radix tree nodes come from an order-2 cache,
> but it still prevents this order-9 page from being assembled).
> 
> If the movable allocations suddenly start being order-3 and order-4,
> the unmovable, unreclaimable allocations are naturally going to group
> down in the lower orders, and we won't have the problem that a single
> dentry blocks the allocation of an entire 2MB page.

I don't follow what you're saying here.

> > A variable size allocator without object type grouping will always
> > have difficulties producing anything but the smallest block size after
> > some uptime. It's inherently flawed that way.
> 
> I think our buddy allocator is flawed, to be sure, but only because
> it doesn't handle movable hints more aggressively.  For example, at
> the point that a largeish block gets a single non-movable allocation,
> all the movable allocations within that block should be migrated out.
> If the offending allocation is freed quickly, it all collapses into a
> large, useful chunk, or if not, then it provides a sponge to soak up
> other non-movable allocations.

The object type implies aging rules and typical access patterns that
are not going to be captured purely by migratability. As such, the
migratetype alone will always perform worse than full type grouping.

E.g. a burst of inodes and dentries allocations can claim a large
number of blocks from movable to reclaimable, which will then also be
used to serve concurrent allocations of a different type that may have
much longer lifetimes. After the inodes and dentries disappear again,
you're stuck with very sparsely populated reclaimable blocks.

They can still be reclaimed, but they won't free up as easily as a
contiguous run of bulk-aged inodes and dentries.

You also cannot easily move reclaimable objects out of the block when
an unmovable allocation claims it the same way, so this is sort of a
moot proposal anyway.

The slab allocator isn't a guarantee, but I don't see why you're
arguing we should leave additional lifetime/usage hints on the table.

> > As drivers go, I think there are slightly different requirements to
> > filesystems, too. For filesystems, when the VM can finally do it (and
> > the file range permits it), I assume we want to rather transparently
> > increase the unit of data transfer from 4k to 2M. Most drivers that
> > currently hardcode alloc_page() or PAGE_SIZE OTOH probably don't want
> > us to bump their allocation sizes.
> 
> If you take a look at my earlier work, you'll see me using a range of
> sizes in the page cache, starting at 16kB and gradually increasing to
> (theoretically) 2MB, although the algorithm tended to top out around
> 256kB.  Doing particularly large reads could see 512kB/1MB reads, but
> it was very hard to hit 2MB in practice.  I wasn't too concerned at the
> time, but my point is that we do want to automatically tune the size
> of the allocation unit to the workload.  An application which reads in
> 64kB chunks is giving us a pretty clear signal that they want to manage
> memory in 64kB chunks.

You missed my point here, but it sounds like we agree that drivers who
just want a fixed buffer should not use the same type that filesystems
use for dynamic paging units.

> > > It'd probably be better to have the dcache realise that its old entries
> > > aren't useful any more and age them out instead of relying on memory
> > > pressure to remove old entries, so this is probably an unnecessary
> > > digression.
> > 
> > It's difficult to identify a universally acceptable line for
> > usefulness of caches other than physical memory pressure.
> > 
> > The good thing about the memory pressure threshold is that you KNOW
> > somebody else has immediate use for the memory, and you're justified
> > in recycling and reallocating caches from the cold end.
> > 
> > Without that, you'd either have to set an arbitrary size cutoff or an
> > arbitrary aging cutoff (not used in the last minute e.g.). But optimal
> > settings for either of those depend on the workload, and aren't very
> > intuitive to configure.
> 
> For the dentry cache, I think there is a more useful metric, and that's
> length of the hash chain.  If it gets too long, we're spending more time
> walking it than we're saving by having entries cached.  Starting reclaim
> based on "this bucket of the dcache has twenty entries in it" would
> probably work quite well.

That might work for this cache, but it's not a generic solution to
fragmentation caused by cache positions building in the absence of
memory pressure.

> > Our levels of internal fragmentation are historically low, which of
> > course is nice by itself. But that's also what's causing problems in
> > the form of external fragmentation, and why we struggle to produce 2M
> > blocks. It's multitudes easier to free one 2M slab page of
> > consecutively allocated inodes than it is to free 512 batches of
> > different objects with conflicting lifetimes, ages, or potentially
> > even reclaimability.
> 
> Unf.  I don't think freeing 2MB worth of _anything_ is ever going to be
> easy enough to rely on.  My actual root filesystem:
> 
> xfs_inode         143134 144460   1024   32    8 : tunables    0    0    0 : slabdata   4517   4517      0
> 
> So we'd have to be able to free 2048 of those 143k inodes, and they all
> have to be consecutive (and aligned).  I suppose we could model that and
> try to work out how many we'd have to be able to free in order to get all
> 2048 in any page free, but I bet it's a variant of the Birthday Paradox,
> and we'd find it's something crazy like half of them.

How is it different than freeing a 4k page in 1995?

The descriptor size itself may not have scaled at the same rate as
overall memory size. But that also means the cache position itself is
much less a concern in terms of memory consumed and fragmented.

Case in point, this is 141M. Yes, probably with a mixture of some hot
and a long tail of cold entries. It's not really an interesting
reclaim target.

When slab cache positions become a reclaim concern, it's usually when
they spike due to a change in the workload. And then you tend to get
contiguous runs of objects with a similar age.

> Without slab gaining the ability to ask users to relocate allocations,
> I think any memory sent to slab is never coming back.

Not sure what data you're basing this on.

> So ... even if I accept every part of your vision as the way things
> are going to be, I think the folio patchset I have now is a step in the
> right direction.  I'm going to send a v6 now and hope it's not too late
> for this merge window.

I don't think folio as an abstraction is cooked enough to replace such
a major part of the kernel with it. so I'm against merging it now.

I would really like to see a better definition of what it actually
represents, instead of a fluid combination of implementation details
and conveniences.

Matthew Wilcox March 31, 2021, 6:28 p.m. UTC | #15

On Wed, Mar 31, 2021 at 02:14:00PM -0400, Johannes Weiner wrote:
> Anyway, we digressed quite far here. My argument was simply that it's
> conceivable we'll switch to a default allocation block and page size
> that is larger than the smallest paging size supported by the CPU and
> the kernel. (Various architectures might support multiple page sizes,
> but once you pick one, that's the smallest quantity the kernel pages.)

We've had several attempts in the past to make 'struct page' refer to
a different number of bytes than the-size-of-a-single-pte, and they've
all failed in one way or another.  I don't think changing PAGE_SIZE to
any other size is reasonable.

Maybe we have a larger allocation unit in the future, maybe we do
something else, but that should have its own name, not 'struct page'.
I think the shortest path to getting what you want is having a superpage
allocator that the current page allocator can allocate from.  When a
superpage is allocated from the superpage allocator, we allocate an
array of struct pages for it.

> I don't think folio as an abstraction is cooked enough to replace such
> a major part of the kernel with it. so I'm against merging it now.
> 
> I would really like to see a better definition of what it actually
> represents, instead of a fluid combination of implementation details
> and conveniences.

Here's the current kernel-doc for it:

/**
 * struct folio - Represents a contiguous set of bytes.
 * @flags: Identical to the page flags.
 * @lru: Least Recently Used list; tracks how recently this folio was used.
 * @mapping: The file this page belongs to, or refers to the anon_vma for
 *    anonymous pages.
 * @index: Offset within the file, in units of pages.  For anonymous pages,
 *    this is the index from the beginning of the mmap.
 * @private: Filesystem per-folio data (see attach_folio_private()).
 *    Used for swp_entry_t if FolioSwapCache().
 * @_mapcount: How many times this folio is mapped to userspace.  Use
 *    folio_mapcount() to access it.
 * @_refcount: Number of references to this folio.  Use folio_ref_count()
 *    to read it.
 * @memcg_data: Memory Control Group data.
 *
 * A folio is a physically, virtually and logically contiguous set
 * of bytes.  It is a power-of-two in size, and it is aligned to that
 * same power-of-two.  It is at least as large as %PAGE_SIZE.  If it is
 * in the page cache, it is at a file offset which is a multiple of that
 * power-of-two.
 */
struct folio {
        /* private: don't document the anon union */
        union {
                struct {
        /* public: */
                        unsigned long flags;
                        struct list_head lru;
                        struct address_space *mapping;
                        pgoff_t index;
                        unsigned long private;
                        atomic_t _mapcount;
                        atomic_t _refcount;
#ifdef CONFIG_MEMCG
                        unsigned long memcg_data;
#endif
        /* private: the union with struct page is transitional */
                };
                struct page page;
        };
};

Al Viro April 1, 2021, 5:05 a.m. UTC | #16

On Tue, Mar 30, 2021 at 10:09:29PM +0100, Matthew Wilcox wrote:

> That's a very Intel-centric way of looking at it.  Other architectures
> support a multitude of page sizes, from the insane ia64 (4k, 8k, 16k, then
> every power of four up to 4GB) to more reasonable options like (4k, 32k,
> 256k, 2M, 16M, 128M).  But we (in software) shouldn't constrain ourselves
> to thinking in terms of what the hardware currently supports.  Google
> have data showing that for their workloads, 32kB is the goldilocks size.
> I'm sure for some workloads, it's much higher and for others it's lower.
> But for almost no workload is 4kB the right choice any more, and probably
> hasn't been since the late 90s.

Out of curiosity I looked at the distribution of file sizes in the
kernel tree:
71455 files total
0--4Kb		36702
4--8Kb		11820
8--16Kb		10066
16--32Kb	6984
32--64Kb	3804
64--128Kb	1498
128--256Kb	393
256--512Kb	108
512Kb--1Mb	35
1--2Mb		25
2--4Mb		5
4--6Mb		7
6--8Mb		4
12Mb		2 
14Mb		1
16Mb		1

... incidentally, everything bigger than 1.2Mb lives^Wshambles under
drivers/gpu/drm/amd/include/asic_reg/

Page size	Footprint
4Kb		1128Mb
8Kb		1324Mb
16Kb		1764Mb
32Kb		2739Mb
64Kb		4832Mb
128Kb		9191Mb
256Kb		18062Mb
512Kb		35883Mb
1Mb		71570Mb
2Mb		142958Mb

So for kernel builds (as well as grep over the tree, etc.) uniform 2Mb pages
would be... interesting.

Matthew Wilcox April 1, 2021, 12:07 p.m. UTC | #17

On Thu, Apr 01, 2021 at 05:05:37AM +0000, Al Viro wrote:
> On Tue, Mar 30, 2021 at 10:09:29PM +0100, Matthew Wilcox wrote:
> 
> > That's a very Intel-centric way of looking at it.  Other architectures
> > support a multitude of page sizes, from the insane ia64 (4k, 8k, 16k, then
> > every power of four up to 4GB) to more reasonable options like (4k, 32k,
> > 256k, 2M, 16M, 128M).  But we (in software) shouldn't constrain ourselves
> > to thinking in terms of what the hardware currently supports.  Google
> > have data showing that for their workloads, 32kB is the goldilocks size.
> > I'm sure for some workloads, it's much higher and for others it's lower.
> > But for almost no workload is 4kB the right choice any more, and probably
> > hasn't been since the late 90s.
> 
> Out of curiosity I looked at the distribution of file sizes in the
> kernel tree:
> 71455 files total
> 0--4Kb		36702
> 4--8Kb		11820
> 8--16Kb		10066
> 16--32Kb	6984
> 32--64Kb	3804
> 64--128Kb	1498
> 128--256Kb	393
> 256--512Kb	108
> 512Kb--1Mb	35
> 1--2Mb		25
> 2--4Mb		5
> 4--6Mb		7
> 6--8Mb		4
> 12Mb		2 
> 14Mb		1
> 16Mb		1
> 
> ... incidentally, everything bigger than 1.2Mb lives^Wshambles under
> drivers/gpu/drm/amd/include/asic_reg/

I'm just going to edit this table to add a column indicating ratio
to previous size:

> Page size	Footprint
> 4Kb		1128Mb
> 8Kb		1324Mb		1.17
> 16Kb		1764Mb		1.33
> 32Kb		2739Mb		1.55
> 64Kb		4832Mb		1.76
> 128Kb		9191Mb		1.90
> 256Kb		18062Mb		1.96
> 512Kb		35883Mb		1.98
> 1Mb		71570Mb		1.994
> 2Mb		142958Mb	1.997
> 
> So for kernel builds (as well as grep over the tree, etc.) uniform 2Mb pages
> would be... interesting.

Yep, that's why I opted for a "start out slowly and let readahead tell me
when to increase the page size" approach.

I think Johannes' real problem is that slab and page cache / anon pages
are getting intermingled.  We could solve this by having slab allocate
2MB pages from the page allocator and then split them up internally
(so not all of that 2MB necessarily goes to a single slab cache, but all
of that 2MB goes to some slab cache).

Johannes Weiner April 1, 2021, 4 p.m. UTC | #18

On Thu, Apr 01, 2021 at 05:05:37AM +0000, Al Viro wrote:
> On Tue, Mar 30, 2021 at 10:09:29PM +0100, Matthew Wilcox wrote:
> 
> > That's a very Intel-centric way of looking at it.  Other architectures
> > support a multitude of page sizes, from the insane ia64 (4k, 8k, 16k, then
> > every power of four up to 4GB) to more reasonable options like (4k, 32k,
> > 256k, 2M, 16M, 128M).  But we (in software) shouldn't constrain ourselves
> > to thinking in terms of what the hardware currently supports.  Google
> > have data showing that for their workloads, 32kB is the goldilocks size.
> > I'm sure for some workloads, it's much higher and for others it's lower.
> > But for almost no workload is 4kB the right choice any more, and probably
> > hasn't been since the late 90s.
> 
> Out of curiosity I looked at the distribution of file sizes in the
> kernel tree:
> 71455 files total
> 0--4Kb		36702
> 4--8Kb		11820
> 8--16Kb		10066
> 16--32Kb	6984
> 32--64Kb	3804
> 64--128Kb	1498
> 128--256Kb	393
> 256--512Kb	108
> 512Kb--1Mb	35
> 1--2Mb		25
> 2--4Mb		5
> 4--6Mb		7
> 6--8Mb		4
> 12Mb		2 
> 14Mb		1
> 16Mb		1
> 
> ... incidentally, everything bigger than 1.2Mb lives^Wshambles under
> drivers/gpu/drm/amd/include/asic_reg/
> 
> Page size	Footprint
> 4Kb		1128Mb
> 8Kb		1324Mb
> 16Kb		1764Mb
> 32Kb		2739Mb
> 64Kb		4832Mb
> 128Kb		9191Mb
> 256Kb		18062Mb
> 512Kb		35883Mb
> 1Mb		71570Mb
> 2Mb		142958Mb
> 
> So for kernel builds (as well as grep over the tree, etc.) uniform 2Mb pages
> would be... interesting.

Right, I don't see us getting rid of 4k cache entries anytime
soon. Even 32k pages would double the footprint here.

The issue is just that at the other end of the spectrum we have IO
devices that do 10GB/s, which corresponds to 2.6 million pages per
second. At such data rates we are currently CPU-limited because of the
pure transaction overhead in page reclaim. Workloads like this tend to
use much larger files, and would benefit from a larger paging unit.

Likewise, most production workloads in cloud servers have enormous
anonymous regions and large executables that greatly benefit from
fewer page table levels and bigger TLB entries.

Today, fragmentation prevents the page allocator from producing 2MB
blocks at a satisfactory rate and allocation latency. It's not
feasible to allocate 2M inside page faults for example; getting huge
page coverage for the page cache will be even more difficult.

I'm not saying we should get rid of 4k cache entries. Rather, I'm
wondering out loud whether longer-term we'd want to change the default
page size to 2M, and implement the 4k cache entries, which we clearly
continue to need, with a slab style allocator on top. The idea being
that it'll do a better job at grouping cache entries with other cache
entries of a similar lifetime than the untyped page allocator does
naturally, and so make fragmentation a whole lot more manageable.

(I'm using x86 page sizes as examples because they matter to me. But
there is an architecture independent discrepancy between the smallest
cache entries we must continue to support, and larger blocks / huge
pages that we increasingly rely on as first class pages.)

[v5,00/27] Memory Folios

Message

Comments