mbox series

[v10,00/33] Memory folios

Message ID 20210511214735.1836149-1-willy@infradead.org (mailing list archive)
Headers show
Series Memory folios | expand

Message

Matthew Wilcox May 11, 2021, 9:47 p.m. UTC
Managing memory in 4KiB pages is a serious overhead.  Many benchmarks
benefit from a larger "page size".  As an example, an earlier iteration
of this idea which used compound pages (and wasn't particularly tuned)
got a 7% performance boost when compiling the kernel.

Using compound pages or THPs exposes a weakness of our type system.
Functions are often unprepared for compound pages to be passed to them,
and may only act on PAGE_SIZE chunks.  Even functions which are aware of
compound pages may expect a head page, and do the wrong thing if passed
a tail page.

We also waste a lot of instructions ensuring that we're not looking at
a tail page.  Almost every call to PageFoo() contains one or more hidden
calls to compound_head().  This also happens for get_page(), put_page()
and many more functions.  There does not appear to be a way to tell gcc
that it can cache the result of compound_head(), nor is there a way to
tell it that compound_head() is idempotent.

This patch series uses a new type, the struct folio, to manage memory.
It provides some basic infrastructure that's worthwhile in its own right,
shrinking the kernel by about 5kB of text.

Since v9:
 - Rebase onto mmotm 2021-05-10-21-46
 - Add folio_memcg() definition for !MEMCG (intel lkp)
 - Change folio->private from an unsigned long to a void *
 - Use folio_page() to implement folio_file_page()
 - Add folio_try_get() and folio_try_get_rcu()
 - Trim back down to just the first few patches, which are better-reviewed.
v9: https://lore.kernel.org/linux-mm/20210505150628.111735-1-willy@infradead.org/
v8: https://lore.kernel.org/linux-mm/20210430180740.2707166-1-willy@infradead.org/

Matthew Wilcox (Oracle) (33):
  mm: Introduce struct folio
  mm: Add folio_pgdat and folio_zone
  mm/vmstat: Add functions to account folio statistics
  mm/debug: Add VM_BUG_ON_FOLIO and VM_WARN_ON_ONCE_FOLIO
  mm: Add folio reference count functions
  mm: Add folio_put
  mm: Add folio_get
  mm: Add folio_try_get_rcu
  mm: Add folio flag manipulation functions
  mm: Add folio_young and folio_idle
  mm: Handle per-folio private data
  mm/filemap: Add folio_index, folio_file_page and folio_contains
  mm/filemap: Add folio_next_index
  mm/filemap: Add folio_offset and folio_file_offset
  mm/util: Add folio_mapping and folio_file_mapping
  mm: Add folio_mapcount
  mm/memcg: Add folio wrappers for various functions
  mm/filemap: Add folio_unlock
  mm/filemap: Add folio_lock
  mm/filemap: Add folio_lock_killable
  mm/filemap: Add __folio_lock_async
  mm/filemap: Add __folio_lock_or_retry
  mm/filemap: Add folio_wait_locked
  mm/swap: Add folio_rotate_reclaimable
  mm/filemap: Add folio_end_writeback
  mm/writeback: Add folio_wait_writeback
  mm/writeback: Add folio_wait_stable
  mm/filemap: Add folio_wait_bit
  mm/filemap: Add folio_wake_bit
  mm/filemap: Convert page wait queues to be folios
  mm/filemap: Add folio private_2 functions
  fs/netfs: Add folio fscache functions
  mm: Add folio_mapped

 Documentation/core-api/mm-api.rst           |   4 +
 Documentation/filesystems/netfs_library.rst |   2 +
 fs/afs/write.c                              |   9 +-
 fs/cachefiles/rdwr.c                        |  16 +-
 fs/io_uring.c                               |   2 +-
 include/linux/memcontrol.h                  |  63 ++++
 include/linux/mm.h                          | 174 ++++++++--
 include/linux/mm_types.h                    |  71 ++++
 include/linux/mmdebug.h                     |  20 ++
 include/linux/netfs.h                       |  77 +++--
 include/linux/page-flags.h                  | 230 ++++++++++---
 include/linux/page_idle.h                   |  99 +++---
 include/linux/page_ref.h                    | 158 ++++++++-
 include/linux/pagemap.h                     | 358 ++++++++++++--------
 include/linux/swap.h                        |   7 +-
 include/linux/vmstat.h                      | 107 ++++++
 mm/Makefile                                 |   2 +-
 mm/filemap.c                                | 315 ++++++++---------
 mm/folio-compat.c                           |  43 +++
 mm/internal.h                               |   1 +
 mm/memory.c                                 |   8 +-
 mm/page-writeback.c                         |  72 ++--
 mm/page_io.c                                |   4 +-
 mm/swap.c                                   |  18 +-
 mm/swapfile.c                               |   8 +-
 mm/util.c                                   |  59 ++--
 26 files changed, 1374 insertions(+), 553 deletions(-)
 create mode 100644 mm/folio-compat.c

Comments

Matthew Wilcox May 13, 2021, 2:50 p.m. UTC | #1
On Tue, May 11, 2021 at 10:47:02PM +0100, Matthew Wilcox (Oracle) wrote:
> We also waste a lot of instructions ensuring that we're not looking at
> a tail page.  Almost every call to PageFoo() contains one or more hidden
> calls to compound_head().  This also happens for get_page(), put_page()
> and many more functions.  There does not appear to be a way to tell gcc
> that it can cache the result of compound_head(), nor is there a way to
> tell it that compound_head() is idempotent.

I instrumented _compound_head() on a test VM:

+++ b/include/linux/page-flags.h
@@ -179,10 +179,13 @@ enum pageflags {

 #ifndef __GENERATING_BOUNDS_H

+extern atomic_t chcc;
+
 static inline unsigned long _compound_head(const struct page *page)
 {
        unsigned long head = READ_ONCE(page->compound_head);

+       atomic_inc(&chcc);
        if (unlikely(head & 1))
                return head - 1;
        return (unsigned long)page;

which means it catches both calls to compound_head() and page_folio().
Between patch 8/96 in folio_v9 and patch 96/96, the number of calls in
an idle VM went down from almost 7k/s to just over 5k/s; about 25%.
William Kucharski May 15, 2021, 10:26 a.m. UTC | #2
I have a nit on part 01/33, but will respond directly there.

For the series:

Reviewed-by: William Kucharski <william.kucharski@oracle.com>

> On May 11, 2021, at 3:47 PM, Matthew Wilcox (Oracle) <willy@infradead.org> wrote:
> 
> Managing memory in 4KiB pages is a serious overhead.  Many benchmarks
> benefit from a larger "page size".  As an example, an earlier iteration
> of this idea which used compound pages (and wasn't particularly tuned)
> got a 7% performance boost when compiling the kernel.
> 
> Using compound pages or THPs exposes a weakness of our type system.
> Functions are often unprepared for compound pages to be passed to them,
> and may only act on PAGE_SIZE chunks.  Even functions which are aware of
> compound pages may expect a head page, and do the wrong thing if passed
> a tail page.
> 
> We also waste a lot of instructions ensuring that we're not looking at
> a tail page.  Almost every call to PageFoo() contains one or more hidden
> calls to compound_head().  This also happens for get_page(), put_page()
> and many more functions.  There does not appear to be a way to tell gcc
> that it can cache the result of compound_head(), nor is there a way to
> tell it that compound_head() is idempotent.
> 
> This patch series uses a new type, the struct folio, to manage memory.
> It provides some basic infrastructure that's worthwhile in its own right,
> shrinking the kernel by about 5kB of text.
> 
> Since v9:
> - Rebase onto mmotm 2021-05-10-21-46
> - Add folio_memcg() definition for !MEMCG (intel lkp)
> - Change folio->private from an unsigned long to a void *
> - Use folio_page() to implement folio_file_page()
> - Add folio_try_get() and folio_try_get_rcu()
> - Trim back down to just the first few patches, which are better-reviewed.
> v9: https://lore.kernel.org/linux-mm/20210505150628.111735-1-willy@infradead.org/
> v8: https://lore.kernel.org/linux-mm/20210430180740.2707166-1-willy@infradead.org/
> 
> Matthew Wilcox (Oracle) (33):
>  mm: Introduce struct folio
>  mm: Add folio_pgdat and folio_zone
>  mm/vmstat: Add functions to account folio statistics
>  mm/debug: Add VM_BUG_ON_FOLIO and VM_WARN_ON_ONCE_FOLIO
>  mm: Add folio reference count functions
>  mm: Add folio_put
>  mm: Add folio_get
>  mm: Add folio_try_get_rcu
>  mm: Add folio flag manipulation functions
>  mm: Add folio_young and folio_idle
>  mm: Handle per-folio private data
>  mm/filemap: Add folio_index, folio_file_page and folio_contains
>  mm/filemap: Add folio_next_index
>  mm/filemap: Add folio_offset and folio_file_offset
>  mm/util: Add folio_mapping and folio_file_mapping
>  mm: Add folio_mapcount
>  mm/memcg: Add folio wrappers for various functions
>  mm/filemap: Add folio_unlock
>  mm/filemap: Add folio_lock
>  mm/filemap: Add folio_lock_killable
>  mm/filemap: Add __folio_lock_async
>  mm/filemap: Add __folio_lock_or_retry
>  mm/filemap: Add folio_wait_locked
>  mm/swap: Add folio_rotate_reclaimable
>  mm/filemap: Add folio_end_writeback
>  mm/writeback: Add folio_wait_writeback
>  mm/writeback: Add folio_wait_stable
>  mm/filemap: Add folio_wait_bit
>  mm/filemap: Add folio_wake_bit
>  mm/filemap: Convert page wait queues to be folios
>  mm/filemap: Add folio private_2 functions
>  fs/netfs: Add folio fscache functions
>  mm: Add folio_mapped
> 
> Documentation/core-api/mm-api.rst           |   4 +
> Documentation/filesystems/netfs_library.rst |   2 +
> fs/afs/write.c                              |   9 +-
> fs/cachefiles/rdwr.c                        |  16 +-
> fs/io_uring.c                               |   2 +-
> include/linux/memcontrol.h                  |  63 ++++
> include/linux/mm.h                          | 174 ++++++++--
> include/linux/mm_types.h                    |  71 ++++
> include/linux/mmdebug.h                     |  20 ++
> include/linux/netfs.h                       |  77 +++--
> include/linux/page-flags.h                  | 230 ++++++++++---
> include/linux/page_idle.h                   |  99 +++---
> include/linux/page_ref.h                    | 158 ++++++++-
> include/linux/pagemap.h                     | 358 ++++++++++++--------
> include/linux/swap.h                        |   7 +-
> include/linux/vmstat.h                      | 107 ++++++
> mm/Makefile                                 |   2 +-
> mm/filemap.c                                | 315 ++++++++---------
> mm/folio-compat.c                           |  43 +++
> mm/internal.h                               |   1 +
> mm/memory.c                                 |   8 +-
> mm/page-writeback.c                         |  72 ++--
> mm/page_io.c                                |   4 +-
> mm/swap.c                                   |  18 +-
> mm/swapfile.c                               |   8 +-
> mm/util.c                                   |  59 ++--
> 26 files changed, 1374 insertions(+), 553 deletions(-)
> create mode 100644 mm/folio-compat.c
> 
> -- 
> 2.30.2
> 
>
Matteo Croce June 4, 2021, 1:07 a.m. UTC | #3
On Tue, 11 May 2021 22:47:02 +0100
"Matthew Wilcox (Oracle)" <willy@infradead.org> wrote:

> We also waste a lot of instructions ensuring that we're not looking at
> a tail page.  Almost every call to PageFoo() contains one or more
> hidden calls to compound_head().  This also happens for get_page(),
> put_page() and many more functions.  There does not appear to be a
> way to tell gcc that it can cache the result of compound_head(), nor
> is there a way to tell it that compound_head() is idempotent.
> 

Maybe it's not effective in all situations but the following hint to
the compiler seems to have an effect, at least according to bloat-o-meter:


--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -179,7 +179,7 @@ enum pageflags {
 
 struct page;   /* forward declaration */
 
-static inline struct page *compound_head(struct page *page)
+static inline __attribute_const__ struct page *compound_head(struct page *page)
 {
        unsigned long head = READ_ONCE(page->compound_head);
 

$ scripts/bloat-o-meter vmlinux.o.orig vmlinux.o
add/remove: 3/13 grow/shrink: 65/689 up/down: 21080/-198089 (-177009)
Function                                     old     new   delta
ntfs_mft_record_alloc                      14414   16627   +2213
migrate_pages                               8891   10819   +1928
ext2_get_page.isra                          1029    2343   +1314
kfence_init                                  180    1331   +1151
page_remove_rmap                             754    1893   +1139
f2fs_fsync_node_pages                       4378    5406   +1028
deferred_split_huge_page                    1279    2286   +1007
relock_page_lruvec_irqsave                     -     975    +975
f2fs_file_write_iter                        3508    4408    +900
__pagevec_lru_add                            704    1311    +607
[...]
pagevec_move_tail_fn                        5333    3215   -2118
__activate_page                             6183    4021   -2162
__unmap_and_move                            2190       -   -2190
__page_cache_release                        4738    2547   -2191
migrate_page_states                         7088    4842   -2246
lru_deactivate_fn                           5925    3652   -2273
move_pages_to_lru                           7259    4980   -2279
check_move_unevictable_pages                7131    4594   -2537
release_pages                               6940    4386   -2554
lru_lazyfree_fn                             6798    4198   -2600
ntfs_mft_record_format                      2940       -   -2940
lru_deactivate_file_fn                      9220    5631   -3589
shrink_page_list                           20653   15749   -4904
page_memcg                                  5149     193   -4956
Total: Before=388863526, After=388686517, chg -0.05%

I don't know if it breaks something though, nor if it gives some real
improvement.
Matthew Wilcox June 4, 2021, 2:13 a.m. UTC | #4
On Fri, Jun 04, 2021 at 03:07:12AM +0200, Matteo Croce wrote:
> On Tue, 11 May 2021 22:47:02 +0100
> "Matthew Wilcox (Oracle)" <willy@infradead.org> wrote:
> 
> > We also waste a lot of instructions ensuring that we're not looking at
> > a tail page.  Almost every call to PageFoo() contains one or more
> > hidden calls to compound_head().  This also happens for get_page(),
> > put_page() and many more functions.  There does not appear to be a
> > way to tell gcc that it can cache the result of compound_head(), nor
> > is there a way to tell it that compound_head() is idempotent.
> > 
> 
> Maybe it's not effective in all situations but the following hint to
> the compiler seems to have an effect, at least according to bloat-o-meter:

It definitely has an effect ;-)

     Note that a function that has pointer arguments and examines the
     data pointed to must _not_ be declared 'const' if the pointed-to
     data might change between successive invocations of the function.
     In general, since a function cannot distinguish data that might
     change from data that cannot, const functions should never take
     pointer or, in C++, reference arguments.  Likewise, a function that
     calls a non-const function usually must not be const itself.

So that's not going to work because a call to split_huge_page() won't
tell the compiler that it's changed.

Reading the documentation, we might be able to get away with marking the
function as pure:

     The 'pure' attribute imposes similar but looser restrictions on a
     function's definition than the 'const' attribute: 'pure' allows the
     function to read any non-volatile memory, even if it changes in
     between successive invocations of the function.

although that's going to miss opportunities, since taking a lock will
modify the contents of struct page, meaning the compiler won't cache
the results of compound_head().

> $ scripts/bloat-o-meter vmlinux.o.orig vmlinux.o
> add/remove: 3/13 grow/shrink: 65/689 up/down: 21080/-198089 (-177009)

I assume this is an allyesconfig kernel?    I think it's a good
indication of how much opportunity there is.
Matteo Croce June 8, 2021, 2:56 p.m. UTC | #5
On Fri, Jun 4, 2021 at 4:13 AM Matthew Wilcox <willy@infradead.org> wrote:
>
> On Fri, Jun 04, 2021 at 03:07:12AM +0200, Matteo Croce wrote:
> > On Tue, 11 May 2021 22:47:02 +0100
> > "Matthew Wilcox (Oracle)" <willy@infradead.org> wrote:
> >
> > > We also waste a lot of instructions ensuring that we're not looking at
> > > a tail page.  Almost every call to PageFoo() contains one or more
> > > hidden calls to compound_head().  This also happens for get_page(),
> > > put_page() and many more functions.  There does not appear to be a
> > > way to tell gcc that it can cache the result of compound_head(), nor
> > > is there a way to tell it that compound_head() is idempotent.
> > >
> >
> > Maybe it's not effective in all situations but the following hint to
> > the compiler seems to have an effect, at least according to bloat-o-meter:
>
> It definitely has an effect ;-)
>
>      Note that a function that has pointer arguments and examines the
>      data pointed to must _not_ be declared 'const' if the pointed-to
>      data might change between successive invocations of the function.
>      In general, since a function cannot distinguish data that might
>      change from data that cannot, const functions should never take
>      pointer or, in C++, reference arguments.  Likewise, a function that
>      calls a non-const function usually must not be const itself.
>
> So that's not going to work because a call to split_huge_page() won't
> tell the compiler that it's changed.
>
> Reading the documentation, we might be able to get away with marking the
> function as pure:
>
>      The 'pure' attribute imposes similar but looser restrictions on a
>      function's definition than the 'const' attribute: 'pure' allows the
>      function to read any non-volatile memory, even if it changes in
>      between successive invocations of the function.
>
> although that's going to miss opportunities, since taking a lock will
> modify the contents of struct page, meaning the compiler won't cache
> the results of compound_head().
>
> > $ scripts/bloat-o-meter vmlinux.o.orig vmlinux.o
> > add/remove: 3/13 grow/shrink: 65/689 up/down: 21080/-198089 (-177009)
>
> I assume this is an allyesconfig kernel?    I think it's a good
> indication of how much opportunity there is.
>

Yes, it's an allyesconfig kernel.
I did the same with pure:

$ git diff
diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
index 04a34c08e0a6..548b72b46eb1 100644
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -179,7 +179,7 @@ enum pageflags {

struct page;   /* forward declaration */

-static inline struct page *compound_head(struct page *page)
+static inline __pure struct page *compound_head(struct page *page)
{
       unsigned long head = READ_ONCE(page->compound_head);


$ scripts/bloat-o-meter vmlinux.o.orig vmlinux.o
add/remove: 3/13 grow/shrink: 63/689 up/down: 20910/-192081 (-171171)
Function                                     old     new   delta
ntfs_mft_record_alloc                      14414   16627   +2213
migrate_pages                               8891   10819   +1928
ext2_get_page.isra                          1029    2343   +1314
kfence_init                                  180    1331   +1151
page_remove_rmap                             754    1893   +1139
f2fs_fsync_node_pages                       4378    5406   +1028
[...]
migrate_page_states                         7088    4842   -2246
ntfs_mft_record_format                      2940       -   -2940
lru_deactivate_file_fn                      9220    6277   -2943
shrink_page_list                           20653   15749   -4904
page_memcg                                  5149     193   -4956
Total: Before=388869713, After=388698542, chg -0.04%

$ ls -l vmlinux.o.orig vmlinux.o
-rw-rw-r-- 1 mcroce mcroce 1295502680 Jun  8 16:47 vmlinux.o
-rw-rw-r-- 1 mcroce mcroce 1295934624 Jun  8 16:28 vmlinux.o.orig

vmlinux is ~420 kb smaller..