diff mbox series

[RFC,03/26] mm: make pageblock_order 2M per default

Message ID 20230418191313.268131-4-hannes@cmpxchg.org (mailing list archive)
State New
Headers show
Series mm: reliable huge page allocator | expand

Commit Message

Johannes Weiner April 18, 2023, 7:12 p.m. UTC
pageblock_order can be of various sizes, depending on configuration,
but the default is MAX_ORDER-1. Given 4k pages, that comes out to
4M. This is a large chunk for the allocator/reclaim/compaction to try
to keep grouped per migratetype. It's also unnecessary as the majority
of higher order allocations - THP and slab - are smaller than that.

Before subsequent patches increase the effort that goes into
maintaining migratetype isolation, it's important to first set the
defrag block size to what's likely to have common consumers.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
---
 include/linux/pageblock-flags.h | 4 ++--
 mm/page_alloc.c                 | 2 +-
 2 files changed, 3 insertions(+), 3 deletions(-)

Comments

Kirill A . Shutemov April 19, 2023, 12:01 a.m. UTC | #1
On Tue, Apr 18, 2023 at 03:12:50PM -0400, Johannes Weiner wrote:
> pageblock_order can be of various sizes, depending on configuration,
> but the default is MAX_ORDER-1.

Note that MAX_ORDER got redefined in -mm tree recently.

> Given 4k pages, that comes out to
> 4M. This is a large chunk for the allocator/reclaim/compaction to try
> to keep grouped per migratetype. It's also unnecessary as the majority
> of higher order allocations - THP and slab - are smaller than that.

This seems way to x86-specific. Other arches have larger THP sizes. I
believe 16M is common.

Maybe define it as min(MAX_ORDER, PMD_ORDER)?

> Before subsequent patches increase the effort that goes into
> maintaining migratetype isolation, it's important to first set the
> defrag block size to what's likely to have common consumers.
> 
> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
> ---
>  include/linux/pageblock-flags.h | 4 ++--
>  mm/page_alloc.c                 | 2 +-
>  2 files changed, 3 insertions(+), 3 deletions(-)
> 
> diff --git a/include/linux/pageblock-flags.h b/include/linux/pageblock-flags.h
> index 5f1ae07d724b..05b6811f8cee 100644
> --- a/include/linux/pageblock-flags.h
> +++ b/include/linux/pageblock-flags.h
> @@ -47,8 +47,8 @@ extern unsigned int pageblock_order;
>  
>  #else /* CONFIG_HUGETLB_PAGE */
>  
> -/* If huge pages are not used, group by MAX_ORDER_NR_PAGES */
> -#define pageblock_order		(MAX_ORDER-1)
> +/* Manage fragmentation at the 2M level */
> +#define pageblock_order		ilog2(2U << (20 - PAGE_SHIFT))
>  
>  #endif /* CONFIG_HUGETLB_PAGE */
>  
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index ac03571e0532..5e04a69f6a26 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -7634,7 +7634,7 @@ static inline void setup_usemap(struct zone *zone) {}
>  /* Initialise the number of pages represented by NR_PAGEBLOCK_BITS */
>  void __init set_pageblock_order(void)
>  {
> -	unsigned int order = MAX_ORDER - 1;
> +	unsigned int order = ilog2(2U << (20 - PAGE_SHIFT));
>  
>  	/* Check that pageblock_nr_pages has not already been setup */
>  	if (pageblock_order)
> -- 
> 2.39.2
> 
>
Johannes Weiner April 19, 2023, 2:55 a.m. UTC | #2
On Wed, Apr 19, 2023 at 03:01:05AM +0300, Kirill A. Shutemov wrote:
> On Tue, Apr 18, 2023 at 03:12:50PM -0400, Johannes Weiner wrote:
> > pageblock_order can be of various sizes, depending on configuration,
> > but the default is MAX_ORDER-1.
> 
> Note that MAX_ORDER got redefined in -mm tree recently.
> 
> > Given 4k pages, that comes out to
> > 4M. This is a large chunk for the allocator/reclaim/compaction to try
> > to keep grouped per migratetype. It's also unnecessary as the majority
> > of higher order allocations - THP and slab - are smaller than that.
> 
> This seems way to x86-specific.

Hey, that's the machines I have access to ;)

> Other arches have larger THP sizes. I believe 16M is common.
>
> Maybe define it as min(MAX_ORDER, PMD_ORDER)?

Hm, let me play around with larger pageblocks.

The thing that gives me pause is that this seems quite aggressive as a
default block size for the allocator and reclaim/compaction - if you
consider the implications for internal fragmentation and the amount of
ongoing defragmentation work it would require.

IOW, it's not just a function of physical page size supported by the
CPU. It's also a function of overall memory capacity. Independent of
architecture, 2MB seems like a more reasonable step up than 16M.

16M is great for TLB coverage, and in our DCs we're getting a lot of
use out of 1G hugetlb pages as well. The question is if those archs
are willing to pay the cost of serving such page sizes quickly and
reliably during runtime; or if that's something better left to setups
with explicit preallocations and stuff like hugetlb_cma reservations.
Johannes Weiner April 19, 2023, 3:44 a.m. UTC | #3
On Tue, Apr 18, 2023 at 10:55:53PM -0400, Johannes Weiner wrote:
> On Wed, Apr 19, 2023 at 03:01:05AM +0300, Kirill A. Shutemov wrote:
> > On Tue, Apr 18, 2023 at 03:12:50PM -0400, Johannes Weiner wrote:
> > > pageblock_order can be of various sizes, depending on configuration,
> > > but the default is MAX_ORDER-1.
> > 
> > Note that MAX_ORDER got redefined in -mm tree recently.
> > 
> > > Given 4k pages, that comes out to
> > > 4M. This is a large chunk for the allocator/reclaim/compaction to try
> > > to keep grouped per migratetype. It's also unnecessary as the majority
> > > of higher order allocations - THP and slab - are smaller than that.
> > 
> > This seems way to x86-specific.
> > Other arches have larger THP sizes. I believe 16M is common.
> >
> > Maybe define it as min(MAX_ORDER, PMD_ORDER)?
> 
> Hm, let me play around with larger pageblocks.
> 
> The thing that gives me pause is that this seems quite aggressive as a
> default block size for the allocator and reclaim/compaction - if you
> consider the implications for internal fragmentation and the amount of
> ongoing defragmentation work it would require.
> 
> IOW, it's not just a function of physical page size supported by the
> CPU. It's also a function of overall memory capacity. Independent of
> architecture, 2MB seems like a more reasonable step up than 16M.

[ Quick addition: on those other archs, these patches would still help
  with other, non-THP sources of compound allocations, such as slub,
  variable-order cache folios, and really any orders up to 2M. So it's
  not like we *have* to raise it to PMD_ORDER for them to benefit. ]
Vlastimil Babka April 19, 2023, 10:36 a.m. UTC | #4
On 4/18/23 21:12, Johannes Weiner wrote:
> pageblock_order can be of various sizes, depending on configuration,
> but the default is MAX_ORDER-1. Given 4k pages, that comes out to
> 4M. This is a large chunk for the allocator/reclaim/compaction to try
> to keep grouped per migratetype. It's also unnecessary as the majority
> of higher order allocations - THP and slab - are smaller than that.

Well in my experience the kernel usually has hugetlbfs config-enabled so it
uses 2MB pageblocks (on x86) even if hugetlbfs is unused at runtime and THP
is used instead. But sure, we can set a better default that's not tied to
hugetlbfs.

> Before subsequent patches increase the effort that goes into
> maintaining migratetype isolation, it's important to first set the
> defrag block size to what's likely to have common consumers.
> 
> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
> ---
>  include/linux/pageblock-flags.h | 4 ++--
>  mm/page_alloc.c                 | 2 +-
>  2 files changed, 3 insertions(+), 3 deletions(-)
> 
> diff --git a/include/linux/pageblock-flags.h b/include/linux/pageblock-flags.h
> index 5f1ae07d724b..05b6811f8cee 100644
> --- a/include/linux/pageblock-flags.h
> +++ b/include/linux/pageblock-flags.h
> @@ -47,8 +47,8 @@ extern unsigned int pageblock_order;
>  
>  #else /* CONFIG_HUGETLB_PAGE */
>  
> -/* If huge pages are not used, group by MAX_ORDER_NR_PAGES */
> -#define pageblock_order		(MAX_ORDER-1)
> +/* Manage fragmentation at the 2M level */
> +#define pageblock_order		ilog2(2U << (20 - PAGE_SHIFT))
>  
>  #endif /* CONFIG_HUGETLB_PAGE */
>  
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index ac03571e0532..5e04a69f6a26 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -7634,7 +7634,7 @@ static inline void setup_usemap(struct zone *zone) {}
>  /* Initialise the number of pages represented by NR_PAGEBLOCK_BITS */
>  void __init set_pageblock_order(void)
>  {
> -	unsigned int order = MAX_ORDER - 1;
> +	unsigned int order = ilog2(2U << (20 - PAGE_SHIFT));
>  
>  	/* Check that pageblock_nr_pages has not already been setup */
>  	if (pageblock_order)
David Hildenbrand April 19, 2023, 11:09 a.m. UTC | #5
On 19.04.23 12:36, Vlastimil Babka wrote:
> On 4/18/23 21:12, Johannes Weiner wrote:
>> pageblock_order can be of various sizes, depending on configuration,
>> but the default is MAX_ORDER-1. Given 4k pages, that comes out to
>> 4M. This is a large chunk for the allocator/reclaim/compaction to try
>> to keep grouped per migratetype. It's also unnecessary as the majority
>> of higher order allocations - THP and slab - are smaller than that.
> 
> Well in my experience the kernel usually has hugetlbfs config-enabled so it
> uses 2MB pageblocks (on x86) even if hugetlbfs is unused at runtime and THP
> is used instead. But sure, we can set a better default that's not tied to
> hugetlbfs.

As virtio-mem really wants small pageblocks (hot(un)plug granularity), 
I've seen reports from users without HUGETLB configured complaining 
about this (on x86, we'd get 4M instead of 2M).

So having a better default (PMD_SIZE) sounds like a good idea to me (and 
I even recall suggesting to change the !hugetlb default).
David Hildenbrand April 19, 2023, 11:10 a.m. UTC | #6
On 19.04.23 02:01, Kirill A. Shutemov wrote:
> On Tue, Apr 18, 2023 at 03:12:50PM -0400, Johannes Weiner wrote:
>> pageblock_order can be of various sizes, depending on configuration,
>> but the default is MAX_ORDER-1.
> 
> Note that MAX_ORDER got redefined in -mm tree recently.
> 
>> Given 4k pages, that comes out to
>> 4M. This is a large chunk for the allocator/reclaim/compaction to try
>> to keep grouped per migratetype. It's also unnecessary as the majority
>> of higher order allocations - THP and slab - are smaller than that.
> 
> This seems way to x86-specific. Other arches have larger THP sizes. I
> believe 16M is common.
> 

arm64 with 64k pages has ... 512 MiB IIRC :/ It's the weird one.

> Maybe define it as min(MAX_ORDER, PMD_ORDER)?

Sounds good to me.
Mel Gorman April 21, 2023, 12:37 p.m. UTC | #7
On Tue, Apr 18, 2023 at 03:12:50PM -0400, Johannes Weiner wrote:
> pageblock_order can be of various sizes, depending on configuration,
> but the default is MAX_ORDER-1. Given 4k pages, that comes out to
> 4M. This is a large chunk for the allocator/reclaim/compaction to try
> to keep grouped per migratetype. It's also unnecessary as the majority
> of higher order allocations - THP and slab - are smaller than that.
> 
> Before subsequent patches increase the effort that goes into
> maintaining migratetype isolation, it's important to first set the
> defrag block size to what's likely to have common consumers.
> 
> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>

This patch may be a distraction in the context of this series. I don't feel
particularly strongly about it but it has strong bikeshed potential. For
configurations that support huge pages of any sort, it should be PMD_ORDER,
for anything else the choice is arbitrary. 2M is as good a guess as
anyway because even if it was tied to the PAGE_ALLOC_COSTLY_ORDER then
the pageblock bitmap overhead might be annoying.
diff mbox series

Patch

diff --git a/include/linux/pageblock-flags.h b/include/linux/pageblock-flags.h
index 5f1ae07d724b..05b6811f8cee 100644
--- a/include/linux/pageblock-flags.h
+++ b/include/linux/pageblock-flags.h
@@ -47,8 +47,8 @@  extern unsigned int pageblock_order;
 
 #else /* CONFIG_HUGETLB_PAGE */
 
-/* If huge pages are not used, group by MAX_ORDER_NR_PAGES */
-#define pageblock_order		(MAX_ORDER-1)
+/* Manage fragmentation at the 2M level */
+#define pageblock_order		ilog2(2U << (20 - PAGE_SHIFT))
 
 #endif /* CONFIG_HUGETLB_PAGE */
 
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index ac03571e0532..5e04a69f6a26 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -7634,7 +7634,7 @@  static inline void setup_usemap(struct zone *zone) {}
 /* Initialise the number of pages represented by NR_PAGEBLOCK_BITS */
 void __init set_pageblock_order(void)
 {
-	unsigned int order = MAX_ORDER - 1;
+	unsigned int order = ilog2(2U << (20 - PAGE_SHIFT));
 
 	/* Check that pageblock_nr_pages has not already been setup */
 	if (pageblock_order)