mbox series

[00/25] Page folios

Message ID 20201216182335.27227-1-willy@infradead.org (mailing list archive)
Headers show
Series Page folios | expand

Message

Matthew Wilcox Dec. 16, 2020, 6:23 p.m. UTC
One of the great things about compound pages is that when you try to
do various operations on a tail page, it redirects to the head page and
everything Just Works.  One of the awful things is how much we pay for
that simplicity.  Here's an example, end_page_writeback():

        if (PageReclaim(page)) {
                ClearPageReclaim(page);
                rotate_reclaimable_page(page);
        }
        get_page(page);
        if (!test_clear_page_writeback(page))
                BUG();

        smp_mb__after_atomic();
        wake_up_page(page, PG_writeback);
        put_page(page);

That all looks very straightforward, but if you dive into the disassembly,
you see that there are four calls to compound_head() in this function
(PageReclaim(), ClearPageReclaim(), get_page() and put_page()).  It's
all for nothing, because if anyone does call this routine with a tail
page, wake_up_page() will VM_BUG_ON_PGFLAGS(PageTail(page), page).

I'm not really a CPU person, but I imagine there's some kind of dependency
here that sucks too:

    1fd7:       48 8b 57 08             mov    0x8(%rdi),%rdx
    1fdb:       48 8d 42 ff             lea    -0x1(%rdx),%rax
    1fdf:       83 e2 01                and    $0x1,%edx
    1fe2:       48 0f 44 c7             cmove  %rdi,%rax
    1fe6:       f0 80 60 02 fb          lock andb $0xfb,0x2(%rax)

Sure, it's going to be cache hot, but that cmove has to execute before
the lock andb.

I would like to introduce a new concept that I call a Page Folio.
Or just struct folio to its friends.  Here it is,
struct folio {
        struct page page;
};

A folio is a struct page which is guaranteed not to be a tail page.
So it's either a head page or a base (order-0) page.  That means
we don't have to call compound_head() on it and we save massively.
end_page_writeback() reduces from four calls to compound_head() to just
one (at the beginning of the function) and it shrinks from 213 bytes
to 126 bytes (using distro kernel config options).  I think even that one
can be eliminated, but I'm going slowly at this point and taking the
safe route of transforming a random struct page pointer into a struct
folio pointer by calling page_folio().  By the end of this exercise,
end_page_writeback() will become end_folio_writeback().

This is going to be a ton of work, and massively disruptive.  It'll touch
every filesystem, and a good few device drivers!  But I think it's worth
it.  Not every routine benefits as much as end_page_writeback(), but it
makes everything a little better.  At 29 bytes per call to lock_page(),
unlock_page(), put_page() and get_page(), that's on the order of 60kB of
text for allyesconfig.  More when you add on all the PageFoo() calls.
With the small amount of work I've done here, mm/filemap.o shrinks its
text segment by over a kilobyte from 33687 to 32318 bytes (and also 192
bytes of data).

But better than that, it's good documentation.  A function which has a
struct page argument might be expecting a head or base page and will
BUG if given a tail page.  It might work with any kind of page and
operate on PAGE_SIZE bytes.  It might work with any kind of page and
operate on page_size() bytes if given a head page but PAGE_SIZE bytes
if given a base or tail page.  It might operate on page_size() bytes
if passed a head or tail page.  We have examples of all of these today.
If a function takes a folio argument, it's operating on the entire folio.

This version of the patch series converts the deduplication code from
operating on pages to operating on folios.  Most of the patches are
somewhat generic infrastructure we'll need, then there's a big gulp as
all filesystems are converted to use folios for readahead and readpage.
Finally, we can convert the deduplification code to use page folios.

If you're interested, you can listen to a discussion of page folios
from last week here: https://www.youtube.com/watch?v=iP49_ER1FUM
Git tree version here (against next-20201216):
https://git.infradead.org/users/willy/pagecache.git/shortlog/refs/heads/folio

Matthew Wilcox (Oracle) (25):
  mm: Introduce struct folio
  mm: Add put_folio
  mm: Add get_folio
  mm: Create FolioFlags
  mm: Add unlock_folio
  mm: Add lock_folio
  mm: Add lock_folio_killable
  mm: Add __alloc_folio_node and alloc_folio
  mm: Convert __page_cache_alloc to return a folio
  mm/filemap: Convert end_page_writeback to use a folio
  mm: Convert mapping_get_entry to return a folio
  mm: Add mark_folio_accessed
  mm: Add filemap_get_folio and find_get_folio
  mm/filemap: Add folio_add_to_page_cache
  mm/swap: Convert rotate_reclaimable_page to folio
  mm: Add folio_mapping
  mm: Rename THP_SUPPORT to MULTI_PAGE_FOLIOS
  btrfs: Use readahead_batch_length
  fs: Change page refcount rules for readahead
  fs: Change readpage to take a folio
  mm: Convert wait_on_page_bit to wait_on_folio_bit
  mm: Add wait_on_folio_locked & wait_on_folio_locked_killable
  mm: Add flush_dcache_folio
  mm: Add read_cache_folio and read_mapping_folio
  fs: Convert vfs_dedupe_file_range_compare to folios

 Documentation/core-api/cachetlb.rst   |   6 +
 Documentation/filesystems/locking.rst |   2 +-
 Documentation/filesystems/porting.rst |   8 +
 Documentation/filesystems/vfs.rst     |  35 +-
 fs/9p/vfs_addr.c                      |   9 +-
 fs/adfs/inode.c                       |   4 +-
 fs/affs/file.c                        |   8 +-
 fs/affs/symlink.c                     |   3 +-
 fs/afs/dir.c                          |   2 +-
 fs/afs/file.c                         |   5 +-
 fs/afs/write.c                        |   2 +-
 fs/befs/linuxvfs.c                    |  23 +-
 fs/bfs/file.c                         |   4 +-
 fs/block_dev.c                        |   4 +-
 fs/btrfs/compression.c                |   4 +-
 fs/btrfs/ctree.h                      |   2 +-
 fs/btrfs/extent_io.c                  |  19 +-
 fs/btrfs/file.c                       |  13 +-
 fs/btrfs/free-space-cache.c           |   9 +-
 fs/btrfs/inode.c                      |  16 +-
 fs/btrfs/ioctl.c                      |  11 +-
 fs/btrfs/relocation.c                 |  11 +-
 fs/btrfs/send.c                       |  11 +-
 fs/buffer.c                           |  12 +-
 fs/cachefiles/rdwr.c                  |  17 +-
 fs/ceph/addr.c                        |   8 +-
 fs/ceph/file.c                        |   2 +-
 fs/cifs/file.c                        |   3 +-
 fs/coda/symlink.c                     |   3 +-
 fs/cramfs/inode.c                     |   3 +-
 fs/ecryptfs/mmap.c                    |   3 +-
 fs/efs/inode.c                        |   4 +-
 fs/efs/symlink.c                      |   3 +-
 fs/erofs/data.c                       |  12 +-
 fs/erofs/zdata.c                      |   8 +-
 fs/exfat/inode.c                      |   4 +-
 fs/ext2/inode.c                       |   4 +-
 fs/ext4/ext4.h                        |   2 +-
 fs/ext4/inode.c                       |  10 +-
 fs/ext4/readpage.c                    |  35 +-
 fs/f2fs/data.c                        |  12 +-
 fs/fat/inode.c                        |   4 +-
 fs/freevxfs/vxfs_immed.c              |   7 +-
 fs/freevxfs/vxfs_subr.c               |   7 +-
 fs/fuse/dir.c                         |   8 +-
 fs/fuse/file.c                        |   7 +-
 fs/gfs2/aops.c                        |  13 +-
 fs/hfs/inode.c                        |   4 +-
 fs/hfsplus/inode.c                    |   4 +-
 fs/hpfs/file.c                        |   4 +-
 fs/hpfs/namei.c                       |   3 +-
 fs/inode.c                            |   4 +-
 fs/iomap/buffered-io.c                |  14 +-
 fs/isofs/compress.c                   |   3 +-
 fs/isofs/inode.c                      |   4 +-
 fs/isofs/rock.c                       |   3 +-
 fs/jffs2/file.c                       |  20 +-
 fs/jffs2/os-linux.h                   |   2 +-
 fs/jfs/inode.c                        |   4 +-
 fs/jfs/jfs_metapage.c                 |   3 +-
 fs/libfs.c                            |  10 +-
 fs/minix/inode.c                      |   4 +-
 fs/mpage.c                            |   9 +-
 fs/nfs/file.c                         |   5 +-
 fs/nfs/read.c                         |   7 +-
 fs/nfs/symlink.c                      |  12 +-
 fs/nilfs2/inode.c                     |   4 +-
 fs/ntfs/aops.c                        |   3 +-
 fs/ocfs2/aops.c                       |  14 +-
 fs/ocfs2/refcounttree.c               |   5 +-
 fs/ocfs2/symlink.c                    |   3 +-
 fs/omfs/file.c                        |   4 +-
 fs/orangefs/inode.c                   |   3 +-
 fs/qnx4/inode.c                       |   4 +-
 fs/qnx6/inode.c                       |   4 +-
 fs/reiserfs/inode.c                   |   4 +-
 fs/remap_range.c                      | 109 +++---
 fs/romfs/super.c                      |   3 +-
 fs/squashfs/file.c                    |   3 +-
 fs/squashfs/symlink.c                 |   3 +-
 fs/sysv/itree.c                       |   4 +-
 fs/ubifs/file.c                       |   8 +-
 fs/udf/file.c                         |   8 +-
 fs/udf/inode.c                        |   4 +-
 fs/udf/symlink.c                      |   3 +-
 fs/ufs/inode.c                        |   4 +-
 fs/vboxsf/file.c                      |   3 +-
 fs/xfs/xfs_aops.c                     |   4 +-
 fs/zonefs/super.c                     |   4 +-
 include/asm-generic/cacheflush.h      |  13 +
 include/linux/buffer_head.h           |   2 +-
 include/linux/fs.h                    |   6 +-
 include/linux/gfp.h                   |  11 +
 include/linux/iomap.h                 |   2 +-
 include/linux/mm.h                    |  57 +++-
 include/linux/mm_types.h              |  17 +
 include/linux/mpage.h                 |   2 +-
 include/linux/nfs_fs.h                |   2 +-
 include/linux/page-flags.h            |  80 ++++-
 include/linux/pagemap.h               | 227 +++++++++----
 include/linux/swap.h                  |   9 +-
 mm/filemap.c                          | 466 +++++++++++++-------------
 mm/internal.h                         |   1 +
 mm/page-writeback.c                   |   7 +-
 mm/page_io.c                          |   6 +-
 mm/readahead.c                        |  24 +-
 mm/shmem.c                            |   2 +-
 mm/swap.c                             |  40 ++-
 mm/swapfile.c                         |   6 +-
 mm/util.c                             |  20 +-
 net/ceph/pagelist.c                   |   4 +-
 net/ceph/pagevec.c                    |   2 +-
 112 files changed, 983 insertions(+), 752 deletions(-)

Comments

David Hildenbrand Dec. 17, 2020, 12:47 p.m. UTC | #1
On 16.12.20 19:23, Matthew Wilcox (Oracle) wrote:
> One of the great things about compound pages is that when you try to
> do various operations on a tail page, it redirects to the head page and
> everything Just Works.  One of the awful things is how much we pay for
> that simplicity.  Here's an example, end_page_writeback():
> 
>         if (PageReclaim(page)) {
>                 ClearPageReclaim(page);
>                 rotate_reclaimable_page(page);
>         }
>         get_page(page);
>         if (!test_clear_page_writeback(page))
>                 BUG();
> 
>         smp_mb__after_atomic();
>         wake_up_page(page, PG_writeback);
>         put_page(page);
> 
> That all looks very straightforward, but if you dive into the disassembly,
> you see that there are four calls to compound_head() in this function
> (PageReclaim(), ClearPageReclaim(), get_page() and put_page()).  It's
> all for nothing, because if anyone does call this routine with a tail
> page, wake_up_page() will VM_BUG_ON_PGFLAGS(PageTail(page), page).
> 
> I'm not really a CPU person, but I imagine there's some kind of dependency
> here that sucks too:
> 
>     1fd7:       48 8b 57 08             mov    0x8(%rdi),%rdx
>     1fdb:       48 8d 42 ff             lea    -0x1(%rdx),%rax
>     1fdf:       83 e2 01                and    $0x1,%edx
>     1fe2:       48 0f 44 c7             cmove  %rdi,%rax
>     1fe6:       f0 80 60 02 fb          lock andb $0xfb,0x2(%rax)
> 
> Sure, it's going to be cache hot, but that cmove has to execute before
> the lock andb.
> 
> I would like to introduce a new concept that I call a Page Folio.
> Or just struct folio to its friends.  Here it is,
> struct folio {
>         struct page page;
> };
> 
> A folio is a struct page which is guaranteed not to be a tail page.
> So it's either a head page or a base (order-0) page.  That means
> we don't have to call compound_head() on it and we save massively.
> end_page_writeback() reduces from four calls to compound_head() to just
> one (at the beginning of the function) and it shrinks from 213 bytes
> to 126 bytes (using distro kernel config options).  I think even that one
> can be eliminated, but I'm going slowly at this point and taking the
> safe route of transforming a random struct page pointer into a struct
> folio pointer by calling page_folio().  By the end of this exercise,
> end_page_writeback() will become end_folio_writeback().
> 
> This is going to be a ton of work, and massively disruptive.  It'll touch
> every filesystem, and a good few device drivers!  But I think it's worth
> it.  Not every routine benefits as much as end_page_writeback(), but it
> makes everything a little better.  At 29 bytes per call to lock_page(),
> unlock_page(), put_page() and get_page(), that's on the order of 60kB of
> text for allyesconfig.  More when you add on all the PageFoo() calls.
> With the small amount of work I've done here, mm/filemap.o shrinks its
> text segment by over a kilobyte from 33687 to 32318 bytes (and also 192
> bytes of data).

Just wondering, as the primary motivation here is "minimizing CPU work",
did you run any benchmarks that revealed a visible performance improvement?

Otherwise, we're left with a concept that's hard to grasp first (folio -
what?!) and "a ton of work, and massively disruptive", saving some kb of
code - which does not sound too appealing to me.

(I like the idea of abstracting which pages are actually worth looking
at directly instead of going via a tail page - tail pages act somewhat
like a proxy for the head page when accessing flags)
Matthew Wilcox Dec. 17, 2020, 1:55 p.m. UTC | #2
On Thu, Dec 17, 2020 at 01:47:57PM +0100, David Hildenbrand wrote:
> On 16.12.20 19:23, Matthew Wilcox (Oracle) wrote:
> > One of the great things about compound pages is that when you try to
> > do various operations on a tail page, it redirects to the head page and
> > everything Just Works.  One of the awful things is how much we pay for
> > that simplicity.  Here's an example, end_page_writeback():
> > 
> >         if (PageReclaim(page)) {
> >                 ClearPageReclaim(page);
> >                 rotate_reclaimable_page(page);
> >         }
> >         get_page(page);
> >         if (!test_clear_page_writeback(page))
> >                 BUG();
> > 
> >         smp_mb__after_atomic();
> >         wake_up_page(page, PG_writeback);
> >         put_page(page);
> > 
> > That all looks very straightforward, but if you dive into the disassembly,
> > you see that there are four calls to compound_head() in this function
> > (PageReclaim(), ClearPageReclaim(), get_page() and put_page()).  It's
> > all for nothing, because if anyone does call this routine with a tail
> > page, wake_up_page() will VM_BUG_ON_PGFLAGS(PageTail(page), page).
> > 
> > I'm not really a CPU person, but I imagine there's some kind of dependency
> > here that sucks too:
> > 
> >     1fd7:       48 8b 57 08             mov    0x8(%rdi),%rdx
> >     1fdb:       48 8d 42 ff             lea    -0x1(%rdx),%rax
> >     1fdf:       83 e2 01                and    $0x1,%edx
> >     1fe2:       48 0f 44 c7             cmove  %rdi,%rax
> >     1fe6:       f0 80 60 02 fb          lock andb $0xfb,0x2(%rax)
> > 
> > Sure, it's going to be cache hot, but that cmove has to execute before
> > the lock andb.
> > 
> > I would like to introduce a new concept that I call a Page Folio.
> > Or just struct folio to its friends.  Here it is,
> > struct folio {
> >         struct page page;
> > };
> > 
> > A folio is a struct page which is guaranteed not to be a tail page.
> > So it's either a head page or a base (order-0) page.  That means
> > we don't have to call compound_head() on it and we save massively.
> > end_page_writeback() reduces from four calls to compound_head() to just
> > one (at the beginning of the function) and it shrinks from 213 bytes
> > to 126 bytes (using distro kernel config options).  I think even that one
> > can be eliminated, but I'm going slowly at this point and taking the
> > safe route of transforming a random struct page pointer into a struct
> > folio pointer by calling page_folio().  By the end of this exercise,
> > end_page_writeback() will become end_folio_writeback().
> > 
> > This is going to be a ton of work, and massively disruptive.  It'll touch
> > every filesystem, and a good few device drivers!  But I think it's worth
> > it.  Not every routine benefits as much as end_page_writeback(), but it
> > makes everything a little better.  At 29 bytes per call to lock_page(),
> > unlock_page(), put_page() and get_page(), that's on the order of 60kB of
> > text for allyesconfig.  More when you add on all the PageFoo() calls.
> > With the small amount of work I've done here, mm/filemap.o shrinks its
> > text segment by over a kilobyte from 33687 to 32318 bytes (and also 192
> > bytes of data).
> 
> Just wondering, as the primary motivation here is "minimizing CPU work",
> did you run any benchmarks that revealed a visible performance improvement?
> 
> Otherwise, we're left with a concept that's hard to grasp first (folio -
> what?!) and "a ton of work, and massively disruptive", saving some kb of
> code - which does not sound too appealing to me.
> 
> (I like the idea of abstracting which pages are actually worth looking
> at directly instead of going via a tail page - tail pages act somewhat
> like a proxy for the head page when accessing flags)

My primary motivation here isn't minimising CPU work at all.  It's trying
to document which interfaces are expected to operate on an entire
compound page and which are expected to operate on a PAGE_SIZE page.
Today, we have a horrible mishmash of

 - This is a head page, I shall operate on 2MB of data
 - This is a tail page, I shall operate on 2MB of data
 - This is not a head page, I shall operate on 4kB of data
 - This is a head page, I shall operate on 4kB of data
 - This is a head|tail page, I shall operate on the size of the compound page.

You might say "Well, why not lead with that?", but I don't know which
advantages people are going to find most compelling.  Even if someone
doesn't believe in the advantages of using folios in the page cache,
looking at the assembler output is, I think, compelling.
David Hildenbrand Dec. 17, 2020, 2:35 p.m. UTC | #3
On 17.12.20 14:55, Matthew Wilcox wrote:
> On Thu, Dec 17, 2020 at 01:47:57PM +0100, David Hildenbrand wrote:
>> On 16.12.20 19:23, Matthew Wilcox (Oracle) wrote:
>>> One of the great things about compound pages is that when you try to
>>> do various operations on a tail page, it redirects to the head page and
>>> everything Just Works.  One of the awful things is how much we pay for
>>> that simplicity.  Here's an example, end_page_writeback():
>>>
>>>         if (PageReclaim(page)) {
>>>                 ClearPageReclaim(page);
>>>                 rotate_reclaimable_page(page);
>>>         }
>>>         get_page(page);
>>>         if (!test_clear_page_writeback(page))
>>>                 BUG();
>>>
>>>         smp_mb__after_atomic();
>>>         wake_up_page(page, PG_writeback);
>>>         put_page(page);
>>>
>>> That all looks very straightforward, but if you dive into the disassembly,
>>> you see that there are four calls to compound_head() in this function
>>> (PageReclaim(), ClearPageReclaim(), get_page() and put_page()).  It's
>>> all for nothing, because if anyone does call this routine with a tail
>>> page, wake_up_page() will VM_BUG_ON_PGFLAGS(PageTail(page), page).
>>>
>>> I'm not really a CPU person, but I imagine there's some kind of dependency
>>> here that sucks too:
>>>
>>>     1fd7:       48 8b 57 08             mov    0x8(%rdi),%rdx
>>>     1fdb:       48 8d 42 ff             lea    -0x1(%rdx),%rax
>>>     1fdf:       83 e2 01                and    $0x1,%edx
>>>     1fe2:       48 0f 44 c7             cmove  %rdi,%rax
>>>     1fe6:       f0 80 60 02 fb          lock andb $0xfb,0x2(%rax)
>>>
>>> Sure, it's going to be cache hot, but that cmove has to execute before
>>> the lock andb.
>>>
>>> I would like to introduce a new concept that I call a Page Folio.
>>> Or just struct folio to its friends.  Here it is,
>>> struct folio {
>>>         struct page page;
>>> };
>>>
>>> A folio is a struct page which is guaranteed not to be a tail page.
>>> So it's either a head page or a base (order-0) page.  That means
>>> we don't have to call compound_head() on it and we save massively.
>>> end_page_writeback() reduces from four calls to compound_head() to just
>>> one (at the beginning of the function) and it shrinks from 213 bytes
>>> to 126 bytes (using distro kernel config options).  I think even that one
>>> can be eliminated, but I'm going slowly at this point and taking the
>>> safe route of transforming a random struct page pointer into a struct
>>> folio pointer by calling page_folio().  By the end of this exercise,
>>> end_page_writeback() will become end_folio_writeback().
>>>
>>> This is going to be a ton of work, and massively disruptive.  It'll touch
>>> every filesystem, and a good few device drivers!  But I think it's worth
>>> it.  Not every routine benefits as much as end_page_writeback(), but it
>>> makes everything a little better.  At 29 bytes per call to lock_page(),
>>> unlock_page(), put_page() and get_page(), that's on the order of 60kB of
>>> text for allyesconfig.  More when you add on all the PageFoo() calls.
>>> With the small amount of work I've done here, mm/filemap.o shrinks its
>>> text segment by over a kilobyte from 33687 to 32318 bytes (and also 192
>>> bytes of data).
>>
>> Just wondering, as the primary motivation here is "minimizing CPU work",
>> did you run any benchmarks that revealed a visible performance improvement?
>>
>> Otherwise, we're left with a concept that's hard to grasp first (folio -
>> what?!) and "a ton of work, and massively disruptive", saving some kb of
>> code - which does not sound too appealing to me.
>>
>> (I like the idea of abstracting which pages are actually worth looking
>> at directly instead of going via a tail page - tail pages act somewhat
>> like a proxy for the head page when accessing flags)
> 
> My primary motivation here isn't minimising CPU work at all.  It's trying

Ah, okay, reading about disassembly gave me that impression.

> to document which interfaces are expected to operate on an entire
> compound page and which are expected to operate on a PAGE_SIZE page.
> Today, we have a horrible mishmash of
> 
>  - This is a head page, I shall operate on 2MB of data
>  - This is a tail page, I shall operate on 2MB of data
>  - This is not a head page, I shall operate on 4kB of data
>  - This is a head page, I shall operate on 4kB of data
>  - This is a head|tail page, I shall operate on the size of the compound page.
> 
> You might say "Well, why not lead with that?", but I don't know which
> advantages people are going to find most compelling.  Even if someone
> doesn't believe in the advantages of using folios in the page cache,
> looking at the assembler output is, I think, compelling.

Personally, I think the implicit documentation of which type of pages
functions expect is a clear advantage. Having less code is a nice cherry
on top.