Message ID | 20210511214735.1836149-1-willy@infradead.org (mailing list archive) |
---|---|
Headers | show |
Series | Memory folios | expand |
On Tue, May 11, 2021 at 10:47:02PM +0100, Matthew Wilcox (Oracle) wrote: > We also waste a lot of instructions ensuring that we're not looking at > a tail page. Almost every call to PageFoo() contains one or more hidden > calls to compound_head(). This also happens for get_page(), put_page() > and many more functions. There does not appear to be a way to tell gcc > that it can cache the result of compound_head(), nor is there a way to > tell it that compound_head() is idempotent. I instrumented _compound_head() on a test VM: +++ b/include/linux/page-flags.h @@ -179,10 +179,13 @@ enum pageflags { #ifndef __GENERATING_BOUNDS_H +extern atomic_t chcc; + static inline unsigned long _compound_head(const struct page *page) { unsigned long head = READ_ONCE(page->compound_head); + atomic_inc(&chcc); if (unlikely(head & 1)) return head - 1; return (unsigned long)page; which means it catches both calls to compound_head() and page_folio(). Between patch 8/96 in folio_v9 and patch 96/96, the number of calls in an idle VM went down from almost 7k/s to just over 5k/s; about 25%.
I have a nit on part 01/33, but will respond directly there. For the series: Reviewed-by: William Kucharski <william.kucharski@oracle.com> > On May 11, 2021, at 3:47 PM, Matthew Wilcox (Oracle) <willy@infradead.org> wrote: > > Managing memory in 4KiB pages is a serious overhead. Many benchmarks > benefit from a larger "page size". As an example, an earlier iteration > of this idea which used compound pages (and wasn't particularly tuned) > got a 7% performance boost when compiling the kernel. > > Using compound pages or THPs exposes a weakness of our type system. > Functions are often unprepared for compound pages to be passed to them, > and may only act on PAGE_SIZE chunks. Even functions which are aware of > compound pages may expect a head page, and do the wrong thing if passed > a tail page. > > We also waste a lot of instructions ensuring that we're not looking at > a tail page. Almost every call to PageFoo() contains one or more hidden > calls to compound_head(). This also happens for get_page(), put_page() > and many more functions. There does not appear to be a way to tell gcc > that it can cache the result of compound_head(), nor is there a way to > tell it that compound_head() is idempotent. > > This patch series uses a new type, the struct folio, to manage memory. > It provides some basic infrastructure that's worthwhile in its own right, > shrinking the kernel by about 5kB of text. > > Since v9: > - Rebase onto mmotm 2021-05-10-21-46 > - Add folio_memcg() definition for !MEMCG (intel lkp) > - Change folio->private from an unsigned long to a void * > - Use folio_page() to implement folio_file_page() > - Add folio_try_get() and folio_try_get_rcu() > - Trim back down to just the first few patches, which are better-reviewed. > v9: https://lore.kernel.org/linux-mm/20210505150628.111735-1-willy@infradead.org/ > v8: https://lore.kernel.org/linux-mm/20210430180740.2707166-1-willy@infradead.org/ > > Matthew Wilcox (Oracle) (33): > mm: Introduce struct folio > mm: Add folio_pgdat and folio_zone > mm/vmstat: Add functions to account folio statistics > mm/debug: Add VM_BUG_ON_FOLIO and VM_WARN_ON_ONCE_FOLIO > mm: Add folio reference count functions > mm: Add folio_put > mm: Add folio_get > mm: Add folio_try_get_rcu > mm: Add folio flag manipulation functions > mm: Add folio_young and folio_idle > mm: Handle per-folio private data > mm/filemap: Add folio_index, folio_file_page and folio_contains > mm/filemap: Add folio_next_index > mm/filemap: Add folio_offset and folio_file_offset > mm/util: Add folio_mapping and folio_file_mapping > mm: Add folio_mapcount > mm/memcg: Add folio wrappers for various functions > mm/filemap: Add folio_unlock > mm/filemap: Add folio_lock > mm/filemap: Add folio_lock_killable > mm/filemap: Add __folio_lock_async > mm/filemap: Add __folio_lock_or_retry > mm/filemap: Add folio_wait_locked > mm/swap: Add folio_rotate_reclaimable > mm/filemap: Add folio_end_writeback > mm/writeback: Add folio_wait_writeback > mm/writeback: Add folio_wait_stable > mm/filemap: Add folio_wait_bit > mm/filemap: Add folio_wake_bit > mm/filemap: Convert page wait queues to be folios > mm/filemap: Add folio private_2 functions > fs/netfs: Add folio fscache functions > mm: Add folio_mapped > > Documentation/core-api/mm-api.rst | 4 + > Documentation/filesystems/netfs_library.rst | 2 + > fs/afs/write.c | 9 +- > fs/cachefiles/rdwr.c | 16 +- > fs/io_uring.c | 2 +- > include/linux/memcontrol.h | 63 ++++ > include/linux/mm.h | 174 ++++++++-- > include/linux/mm_types.h | 71 ++++ > include/linux/mmdebug.h | 20 ++ > include/linux/netfs.h | 77 +++-- > include/linux/page-flags.h | 230 ++++++++++--- > include/linux/page_idle.h | 99 +++--- > include/linux/page_ref.h | 158 ++++++++- > include/linux/pagemap.h | 358 ++++++++++++-------- > include/linux/swap.h | 7 +- > include/linux/vmstat.h | 107 ++++++ > mm/Makefile | 2 +- > mm/filemap.c | 315 ++++++++--------- > mm/folio-compat.c | 43 +++ > mm/internal.h | 1 + > mm/memory.c | 8 +- > mm/page-writeback.c | 72 ++-- > mm/page_io.c | 4 +- > mm/swap.c | 18 +- > mm/swapfile.c | 8 +- > mm/util.c | 59 ++-- > 26 files changed, 1374 insertions(+), 553 deletions(-) > create mode 100644 mm/folio-compat.c > > -- > 2.30.2 > >
On Tue, 11 May 2021 22:47:02 +0100 "Matthew Wilcox (Oracle)" <willy@infradead.org> wrote: > We also waste a lot of instructions ensuring that we're not looking at > a tail page. Almost every call to PageFoo() contains one or more > hidden calls to compound_head(). This also happens for get_page(), > put_page() and many more functions. There does not appear to be a > way to tell gcc that it can cache the result of compound_head(), nor > is there a way to tell it that compound_head() is idempotent. > Maybe it's not effective in all situations but the following hint to the compiler seems to have an effect, at least according to bloat-o-meter: --- a/include/linux/page-flags.h +++ b/include/linux/page-flags.h @@ -179,7 +179,7 @@ enum pageflags { struct page; /* forward declaration */ -static inline struct page *compound_head(struct page *page) +static inline __attribute_const__ struct page *compound_head(struct page *page) { unsigned long head = READ_ONCE(page->compound_head); $ scripts/bloat-o-meter vmlinux.o.orig vmlinux.o add/remove: 3/13 grow/shrink: 65/689 up/down: 21080/-198089 (-177009) Function old new delta ntfs_mft_record_alloc 14414 16627 +2213 migrate_pages 8891 10819 +1928 ext2_get_page.isra 1029 2343 +1314 kfence_init 180 1331 +1151 page_remove_rmap 754 1893 +1139 f2fs_fsync_node_pages 4378 5406 +1028 deferred_split_huge_page 1279 2286 +1007 relock_page_lruvec_irqsave - 975 +975 f2fs_file_write_iter 3508 4408 +900 __pagevec_lru_add 704 1311 +607 [...] pagevec_move_tail_fn 5333 3215 -2118 __activate_page 6183 4021 -2162 __unmap_and_move 2190 - -2190 __page_cache_release 4738 2547 -2191 migrate_page_states 7088 4842 -2246 lru_deactivate_fn 5925 3652 -2273 move_pages_to_lru 7259 4980 -2279 check_move_unevictable_pages 7131 4594 -2537 release_pages 6940 4386 -2554 lru_lazyfree_fn 6798 4198 -2600 ntfs_mft_record_format 2940 - -2940 lru_deactivate_file_fn 9220 5631 -3589 shrink_page_list 20653 15749 -4904 page_memcg 5149 193 -4956 Total: Before=388863526, After=388686517, chg -0.05% I don't know if it breaks something though, nor if it gives some real improvement.
On Fri, Jun 04, 2021 at 03:07:12AM +0200, Matteo Croce wrote: > On Tue, 11 May 2021 22:47:02 +0100 > "Matthew Wilcox (Oracle)" <willy@infradead.org> wrote: > > > We also waste a lot of instructions ensuring that we're not looking at > > a tail page. Almost every call to PageFoo() contains one or more > > hidden calls to compound_head(). This also happens for get_page(), > > put_page() and many more functions. There does not appear to be a > > way to tell gcc that it can cache the result of compound_head(), nor > > is there a way to tell it that compound_head() is idempotent. > > > > Maybe it's not effective in all situations but the following hint to > the compiler seems to have an effect, at least according to bloat-o-meter: It definitely has an effect ;-) Note that a function that has pointer arguments and examines the data pointed to must _not_ be declared 'const' if the pointed-to data might change between successive invocations of the function. In general, since a function cannot distinguish data that might change from data that cannot, const functions should never take pointer or, in C++, reference arguments. Likewise, a function that calls a non-const function usually must not be const itself. So that's not going to work because a call to split_huge_page() won't tell the compiler that it's changed. Reading the documentation, we might be able to get away with marking the function as pure: The 'pure' attribute imposes similar but looser restrictions on a function's definition than the 'const' attribute: 'pure' allows the function to read any non-volatile memory, even if it changes in between successive invocations of the function. although that's going to miss opportunities, since taking a lock will modify the contents of struct page, meaning the compiler won't cache the results of compound_head(). > $ scripts/bloat-o-meter vmlinux.o.orig vmlinux.o > add/remove: 3/13 grow/shrink: 65/689 up/down: 21080/-198089 (-177009) I assume this is an allyesconfig kernel? I think it's a good indication of how much opportunity there is.
On Fri, Jun 4, 2021 at 4:13 AM Matthew Wilcox <willy@infradead.org> wrote: > > On Fri, Jun 04, 2021 at 03:07:12AM +0200, Matteo Croce wrote: > > On Tue, 11 May 2021 22:47:02 +0100 > > "Matthew Wilcox (Oracle)" <willy@infradead.org> wrote: > > > > > We also waste a lot of instructions ensuring that we're not looking at > > > a tail page. Almost every call to PageFoo() contains one or more > > > hidden calls to compound_head(). This also happens for get_page(), > > > put_page() and many more functions. There does not appear to be a > > > way to tell gcc that it can cache the result of compound_head(), nor > > > is there a way to tell it that compound_head() is idempotent. > > > > > > > Maybe it's not effective in all situations but the following hint to > > the compiler seems to have an effect, at least according to bloat-o-meter: > > It definitely has an effect ;-) > > Note that a function that has pointer arguments and examines the > data pointed to must _not_ be declared 'const' if the pointed-to > data might change between successive invocations of the function. > In general, since a function cannot distinguish data that might > change from data that cannot, const functions should never take > pointer or, in C++, reference arguments. Likewise, a function that > calls a non-const function usually must not be const itself. > > So that's not going to work because a call to split_huge_page() won't > tell the compiler that it's changed. > > Reading the documentation, we might be able to get away with marking the > function as pure: > > The 'pure' attribute imposes similar but looser restrictions on a > function's definition than the 'const' attribute: 'pure' allows the > function to read any non-volatile memory, even if it changes in > between successive invocations of the function. > > although that's going to miss opportunities, since taking a lock will > modify the contents of struct page, meaning the compiler won't cache > the results of compound_head(). > > > $ scripts/bloat-o-meter vmlinux.o.orig vmlinux.o > > add/remove: 3/13 grow/shrink: 65/689 up/down: 21080/-198089 (-177009) > > I assume this is an allyesconfig kernel? I think it's a good > indication of how much opportunity there is. > Yes, it's an allyesconfig kernel. I did the same with pure: $ git diff diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h index 04a34c08e0a6..548b72b46eb1 100644 --- a/include/linux/page-flags.h +++ b/include/linux/page-flags.h @@ -179,7 +179,7 @@ enum pageflags { struct page; /* forward declaration */ -static inline struct page *compound_head(struct page *page) +static inline __pure struct page *compound_head(struct page *page) { unsigned long head = READ_ONCE(page->compound_head); $ scripts/bloat-o-meter vmlinux.o.orig vmlinux.o add/remove: 3/13 grow/shrink: 63/689 up/down: 20910/-192081 (-171171) Function old new delta ntfs_mft_record_alloc 14414 16627 +2213 migrate_pages 8891 10819 +1928 ext2_get_page.isra 1029 2343 +1314 kfence_init 180 1331 +1151 page_remove_rmap 754 1893 +1139 f2fs_fsync_node_pages 4378 5406 +1028 [...] migrate_page_states 7088 4842 -2246 ntfs_mft_record_format 2940 - -2940 lru_deactivate_file_fn 9220 6277 -2943 shrink_page_list 20653 15749 -4904 page_memcg 5149 193 -4956 Total: Before=388869713, After=388698542, chg -0.04% $ ls -l vmlinux.o.orig vmlinux.o -rw-rw-r-- 1 mcroce mcroce 1295502680 Jun 8 16:47 vmlinux.o -rw-rw-r-- 1 mcroce mcroce 1295934624 Jun 8 16:28 vmlinux.o.orig vmlinux is ~420 kb smaller..