diff mbox series

[RFC,1/4] mcpage: add size/mask/shift definition for multiple consecutive page

Message ID 20230109072232.2398464-2-fengwei.yin@intel.com (mailing list archive)
State New
Headers show
Series Multiple consecutive page for anonymous mapping | expand

Commit Message

Yin, Fengwei Jan. 9, 2023, 7:22 a.m. UTC
Huge page in current kernel could bring obvious performance improvement
for some workloads with less TLB missing and less page fault. But the
limited options of huge page size (2M/1G for x86_64) also brings extra
cost like larger memory consumption, and more CPU cycle for page zeroing.

The idea of the multiple consecutive page (abbr as "mcpage") is using
collection of physical contiguous 4K page other than huge page for
anonymous mapping. Target is to have more choices to trade off the pros
and cons of huge page. Comparing to huge page, it will not get so much
benefit of TLB missing and page fault. And it will not pay too much extra
cost for large memory consumption and larger latency introduced by page
compaction, page zeroing etc.

The size of mcpage can be configured. The default value of 16K size is
just picked up arbitrarily. User should choose the value according to the
result of tuning their workload with different mcpage size.

To have physical contiguous pages, high order pages is allocated (order
is calculated according to mcpage size). Then the high order page will
be split. By doing this, each sub page of mcpage is just normal 4K page.
The current kernel page management infrastructure is applied to "mc"
pages without any change.

To reduce the page fault number, multiple page table entries are populated
in one page fault with sub pages pfn of mcpage. This also brings a little
bit cost of memory consumption.

Update Kconfig to allow user define the mcpage order. Define MACROs like
mcpage mask/shift/nr/size.

In this RFC patch, only Kconfig is used for mcpage order to show the idea.
Runtime parameter will be chosen if make this official patch in the future.

Signed-off-by: Yin Fengwei <fengwei.yin@intel.com>
---
 include/linux/mm_types.h | 11 +++++++++++
 mm/Kconfig               | 19 +++++++++++++++++++
 2 files changed, 30 insertions(+)

Comments

Matthew Wilcox Jan. 9, 2023, 1:24 p.m. UTC | #1
On Mon, Jan 09, 2023 at 03:22:29PM +0800, Yin Fengwei wrote:
> The idea of the multiple consecutive page (abbr as "mcpage") is using
> collection of physical contiguous 4K page other than huge page for
> anonymous mapping.

This is what folios are for.  You have an interesting demonstration
here that shows that moving to larger folios for anonymous memory
is worth doing (thank you!) but you're missing several of the advantages
of folios by going off and doing your own thing.

> The size of mcpage can be configured. The default value of 16K size is
> just picked up arbitrarily. User should choose the value according to the
> result of tuning their workload with different mcpage size.

Uh, no.  We don't do these kinds of config options any more (or boot-time
options as you mention later).  The size of a folio allocated for a given
VMA should be adaptive based on observing how the program is using memory.
There will likely be many different sizes of folio present in a given VMA.

> To have physical contiguous pages, high order pages is allocated (order
> is calculated according to mcpage size). Then the high order page will
> be split. By doing this, each sub page of mcpage is just normal 4K page.
> The current kernel page management infrastructure is applied to "mc"
> pages without any change.

This is somewhere that you're losing an advantage of folios.  By keeping
all the pages together, they get managed as a single unit.  That shrinks
the length of the LRU list and reduces lock contention.  It also reduces
the number of cache lines which are modified as, eg, we only need to
keep track of one dirty bit for many pages.

> To reduce the page fault number, multiple page table entries are populated
> in one page fault with sub pages pfn of mcpage. This also brings a little
> bit cost of memory consumption.

That needs to be done for folios.  It's a long way down my todo list,
so if you wanted to take it on, it would be very much appreciated!
Dave Hansen Jan. 9, 2023, 4:30 p.m. UTC | #2
On 1/9/23 05:24, Matthew Wilcox wrote:
> On Mon, Jan 09, 2023 at 03:22:29PM +0800, Yin Fengwei wrote:
>> The idea of the multiple consecutive page (abbr as "mcpage") is using
>> collection of physical contiguous 4K page other than huge page for
>> anonymous mapping.
> This is what folios are for.  You have an interesting demonstration
> here that shows that moving to larger folios for anonymous memory
> is worth doing (thank you!) but you're missing several of the advantages
> of folios by going off and doing your own thing.

It might not have come across in the changelog and cover letter, but
Fengwei and the rest of us *totally* agree with you on this.  "Doing
your own thing" just isn't going to cut it and if this is going to go
anywhere, it needs to use folios.

This series is _pure_ RFC and the comments we're interested in is
whether this demonstration warrants going back and doing it the right
way (with folios).
Matthew Wilcox Jan. 9, 2023, 5:01 p.m. UTC | #3
On Mon, Jan 09, 2023 at 08:30:43AM -0800, Dave Hansen wrote:
> On 1/9/23 05:24, Matthew Wilcox wrote:
> > On Mon, Jan 09, 2023 at 03:22:29PM +0800, Yin Fengwei wrote:
> >> The idea of the multiple consecutive page (abbr as "mcpage") is using
> >> collection of physical contiguous 4K page other than huge page for
> >> anonymous mapping.
> > This is what folios are for.  You have an interesting demonstration
> > here that shows that moving to larger folios for anonymous memory
> > is worth doing (thank you!) but you're missing several of the advantages
> > of folios by going off and doing your own thing.
> 
> It might not have come across in the changelog and cover letter, but
> Fengwei and the rest of us *totally* agree with you on this.  "Doing
> your own thing" just isn't going to cut it and if this is going to go
> anywhere, it needs to use folios.
> 
> This series is _pure_ RFC and the comments we're interested in is
> whether this demonstration warrants going back and doing it the right
> way (with folios).

Ah, yes, that didn't come across.  In fact, the opposite came across
with this paragraph:

: This series is the first step of mcpage. The furture work can be
: enable mcpage for more components like page cache, swapping etc.
: Finally, most pages in system will be allocated/free/reclaimed
: with mcpage order.

Since the page cache has been using multipage folios in mainline since
March (and in various trees of mine since Feb 2020!), that indicated
to me either a lack of knowledge of folios, or a rejection of the folio
approach.  Happy to hear that's not true!

Most of the results here validate my experience and/or assumptions,
which is good.  I'm more than happy for someone else to take on the
hard work of folio-ising the anon VMAs.  I see the problems to be
solved as:

 - Determining (on a page fault) what the correct allocation size
   is for this process at this time.  We have the readahead code to
   leverage for files, but I don't think we have anything similar for
   anon memory
 - Inserting multiple PTEs when a multi-page folio is found.  This also
   needs to be done for file pages, and maybe that's a good place
   to start.
 - Finding all the places in the anon memory code that assume that
   PageCompound() / PageTransHuge() is the same thing as
   folio_test_pmd_mappable().

There are probably other things that are going to come up, but I think
starting is the important part.  Not everything needs to be done
immediately (#3 before #1, I would think ;-).
Yin, Fengwei Jan. 10, 2023, 2:53 a.m. UTC | #4
On 1/9/2023 9:24 PM, Matthew Wilcox wrote:
> On Mon, Jan 09, 2023 at 03:22:29PM +0800, Yin Fengwei wrote:
>> The idea of the multiple consecutive page (abbr as "mcpage") is using
>> collection of physical contiguous 4K page other than huge page for
>> anonymous mapping.
> 
> This is what folios are for.  You have an interesting demonstration
> here that shows that moving to larger folios for anonymous memory
> is worth doing (thank you!) but you're missing several of the advantages
> of folios by going off and doing your own thing.
Yes. Folio and mcpage share some advantages.

> 
>> The size of mcpage can be configured. The default value of 16K size is
>> just picked up arbitrarily. User should choose the value according to the
>> result of tuning their workload with different mcpage size.
> 
> Uh, no.  We don't do these kinds of config options any more (or boot-time
> options as you mention later).  The size of a folio allocated for a given
> VMA should be adaptive based on observing how the program is using memory.
> There will likely be many different sizes of folio present in a given VMA.
I had two thoughts for adaptive folio size:
1. It could have high tail latency to allocate folio with large size. Which
   is not appreciated by some workloads. It may be good to allow user to
   define the size?
2. Difference size of folio in system may make whole memory fragment?

> 
>> To have physical contiguous pages, high order pages is allocated (order
>> is calculated according to mcpage size). Then the high order page will
>> be split. By doing this, each sub page of mcpage is just normal 4K page.
>> The current kernel page management infrastructure is applied to "mc"
>> pages without any change.
> 
> This is somewhere that you're losing an advantage of folios.  By keeping
> all the pages together, they get managed as a single unit.  That shrinks
> the length of the LRU list and reduces lock contention.  It also reduces
> the number of cache lines which are modified as, eg, we only need to
> keep track of one dirty bit for many pages.
Yes. lru list/lock benefit is provided by folios.

For dirty bit, one dirty bit for many pages means just one dirty sub-page
of folios require all sub-pages need be writing out. It brings pressure
to storage. But Yes. Other bits can get benefit of less cache line
modification.


Regards
Yin, Fengwei

> 
>> To reduce the page fault number, multiple page table entries are populated
>> in one page fault with sub pages pfn of mcpage. This also brings a little
>> bit cost of memory consumption.
> 
> That needs to be done for folios.  It's a long way down my todo list,
> so if you wanted to take it on, it would be very much appreciated!
>
diff mbox series

Patch

diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 3b8475007734..fa561c7b6290 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -71,6 +71,17 @@  struct mem_cgroup;
 #define _struct_page_alignment	__aligned(sizeof(unsigned long))
 #endif
 
+#ifdef CONFIG_MCPAGE_ORDER
+#define MCPAGE_ORDER		CONFIG_MCPAGE_ORDER
+#else
+#define MCPAGE_ORDER		0
+#endif
+
+#define MCPAGE_SIZE		(1 << (MCPAGE_ORDER + PAGE_SHIFT))
+#define MCPAGE_MASK		(~(MCPAGE_SIZE - 1))
+#define MCPAGE_SHIFT		(MCPAGE_ORDER + PAGE_SHIFT)
+#define MCPAGE_NR		(1 << (MCPAGE_ORDER))
+
 struct page {
 	unsigned long flags;		/* Atomic flags, some possibly
 					 * updated asynchronously */
diff --git a/mm/Kconfig b/mm/Kconfig
index ff7b209dec05..c202dc99ab6d 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -650,6 +650,25 @@  config HUGETLB_PAGE_SIZE_VARIABLE
 	  Note that the pageblock_order cannot exceed MAX_ORDER - 1 and will be
 	  clamped down to MAX_ORDER - 1.
 
+config MCPAGE
+	bool "multiple consecutive page <mcpage>"
+	default n
+	help
+	  Enable multiple consecutive page: mcpage is page collections (sub-page)
+	  which are physical contiguous. When mapping to user space, all the
+	  sub-pages will be mapped to user space in one page fault handler.
+	  Expect to trade off the pros and cons of huge page. Like less
+	  unnecessary extra memory zeroing and less memory consumption.
+	  But with no TLB benefit.
+
+config MCPAGE_ORDER
+	int "multiple consecutive page order"
+	default 2
+	depends on X86_64 && MCPAGE
+	help
+	  The order of mcpage. Should be chosen carefully by tuning your
+	  workload.
+
 config CONTIG_ALLOC
 	def_bool (MEMORY_ISOLATION && COMPACTION) || CMA