diff mbox series

[4/4] mm/page_alloc: no need to ClearPageReserved on giving page to buddy system

Message ID 20240629013322.12364-4-richard.weiyang@gmail.com (mailing list archive)
State New
Headers show
Series [1/4] mm: use zonelist_zone() to get zone | expand

Commit Message

Wei Yang June 29, 2024, 1:33 a.m. UTC
Function __free_pages_core() is only used in the following two cases to
put page to buddy system:

* free bootmem
* free hot-add memory

After the above cleanup, there is no case to free page with PG_reserved
set. Let's remove the clear operation.

The page initialization time shows 6.5% faster with a 6G qemu virtual
machine.

Signed-off-by: Wei Yang <richard.weiyang@gmail.com>
CC: David Hildenbrand <david@redhat.com>
CC: Mike Rapoport (IBM) <rppt@kernel.org>
---
 mm/page_alloc.c | 2 --
 1 file changed, 2 deletions(-)

Comments

Matthew Wilcox June 29, 2024, 3:21 a.m. UTC | #1
On Sat, Jun 29, 2024 at 01:33:22AM +0000, Wei Yang wrote:
> +++ b/mm/page_alloc.c
> @@ -1232,10 +1232,8 @@ void __meminit __free_pages_core(struct page *page, unsigned int order)
>  	prefetchw(p);
>  	for (loop = 0; loop < (nr_pages - 1); loop++, p++) {
>  		prefetchw(p + 1);
> -		__ClearPageReserved(p);
>  		set_page_count(p, 0);
>  	}
> -	__ClearPageReserved(p);
>  	set_page_count(p, 0);

Is the prefetchw() still useful?  Any remotely competent CPU should
be able to figure out this loop ...
Wei Yang June 29, 2024, 8:44 a.m. UTC | #2
On Sat, Jun 29, 2024 at 04:21:59AM +0100, Matthew Wilcox wrote:
>On Sat, Jun 29, 2024 at 01:33:22AM +0000, Wei Yang wrote:
>> +++ b/mm/page_alloc.c
>> @@ -1232,10 +1232,8 @@ void __meminit __free_pages_core(struct page *page, unsigned int order)
>>  	prefetchw(p);
>>  	for (loop = 0; loop < (nr_pages - 1); loop++, p++) {
>>  		prefetchw(p + 1);
>> -		__ClearPageReserved(p);
>>  		set_page_count(p, 0);
>>  	}
>> -	__ClearPageReserved(p);
>>  	set_page_count(p, 0);
>
>Is the prefetchw() still useful?  Any remotely competent CPU should
>be able to figure out this loop ...

Hi, Matthew

Thanks for your question. But to be honest, I am not fully understand it.

Per my understanding, prefetchw() is trying to load data to cache before we
really accessing it. By doing so, we won't hit a cache miss when we really
need it.

This looks useful for me. And how remote competent CPU is involved in this
process?
Wei Yang June 29, 2024, 8:48 a.m. UTC | #3
On Sat, Jun 29, 2024 at 08:28:30AM +0200, David Hildenbrand wrote:
>On 29.06.24 08:25, David Hildenbrand wrote:
>> On 29.06.24 03:33, Wei Yang wrote:
>> > Function __free_pages_core() is only used in the following two cases to
>> > put page to buddy system:
>> > 
>> > * free bootmem
>> > * free hot-add memory
>> > 
>> > After the above cleanup, there is no case to free page with PG_reserved
>> > set. Let's remove the clear operation.
>> > 
>> > The page initialization time shows 6.5% faster with a 6G qemu virtual
>> > machine.
>> > 
>> > Signed-off-by: Wei Yang <richard.weiyang@gmail.com>
>> > CC: David Hildenbrand <david@redhat.com>
>> > CC: Mike Rapoport (IBM) <rppt@kernel.org>
>> > ---
>> >    mm/page_alloc.c | 2 --
>> >    1 file changed, 2 deletions(-)
>> > 
>> > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
>> > index 51a47db375b6..bc7316744a34 100644
>> > --- a/mm/page_alloc.c
>> > +++ b/mm/page_alloc.c
>> > @@ -1232,10 +1232,8 @@ void __meminit __free_pages_core(struct page *page, unsigned int order)
>> >    	prefetchw(p);
>> >    	for (loop = 0; loop < (nr_pages - 1); loop++, p++) {
>> >    		prefetchw(p + 1);
>> > -		__ClearPageReserved(p);
>> >    		set_page_count(p, 0);
>> >    	}
>> > -	__ClearPageReserved(p);
>> >    	set_page_count(p, 0);
>> >    	atomic_long_add(nr_pages, &page_zone(page)->managed_pages);
>> 
>> Again see mm/mm-stable where that code changed.
>> 
>> I think we can still get reserved pages here, for example via
>> kmsan_memblock_free_pages().
>> 

You mean ZONDE_DEVICE pages?

If yes, I think the normal memblock_free_pages() still could get reserved pages.
Am I right?

>
>Sorry, I meant kmsan_memblock_discard().

Yep.

>-- 
>Cheers,
>
>David / dhildenb
Matthew Wilcox June 29, 2024, 4:28 p.m. UTC | #4
On Sat, Jun 29, 2024 at 08:44:11AM +0000, Wei Yang wrote:
> On Sat, Jun 29, 2024 at 04:21:59AM +0100, Matthew Wilcox wrote:
> >On Sat, Jun 29, 2024 at 01:33:22AM +0000, Wei Yang wrote:
> >> +++ b/mm/page_alloc.c
> >> @@ -1232,10 +1232,8 @@ void __meminit __free_pages_core(struct page *page, unsigned int order)
> >>  	prefetchw(p);
> >>  	for (loop = 0; loop < (nr_pages - 1); loop++, p++) {
> >>  		prefetchw(p + 1);
> >> -		__ClearPageReserved(p);
> >>  		set_page_count(p, 0);
> >>  	}
> >> -	__ClearPageReserved(p);
> >>  	set_page_count(p, 0);
> >
> >Is the prefetchw() still useful?  Any remotely competent CPU should
> >be able to figure out this loop ...
> 
> Hi, Matthew
> 
> Thanks for your question. But to be honest, I am not fully understand it.

Let's try this question:

If you remove the prefetchw() line, does the performance change?

> Per my understanding, prefetchw() is trying to load data to cache before we
> really accessing it. By doing so, we won't hit a cache miss when we really
> need it.

Yes, but the CPU can also do this by itself without needing an explicit
hint from software.  It can notice that we have a loop that's accessing
successive cachelines for write.  This is approximately the easiest
prefetcher to design.
Matthew Wilcox June 29, 2024, 4:45 p.m. UTC | #5
On Sat, Jun 29, 2024 at 05:28:34PM +0100, Matthew Wilcox wrote:
> On Sat, Jun 29, 2024 at 08:44:11AM +0000, Wei Yang wrote:
> > Per my understanding, prefetchw() is trying to load data to cache before we
> > really accessing it. By doing so, we won't hit a cache miss when we really
> > need it.
> 
> Yes, but the CPU can also do this by itself without needing an explicit
> hint from software.  It can notice that we have a loop that's accessing
> successive cachelines for write.  This is approximately the easiest
> prefetcher to design.

I tracked down prefetchw() being added:

commit 3b901ea58a56
Author: Josh Aas <josha@sgi.com>
Date:   Mon Aug 23 21:26:54 2004 -0700

    [PATCH] improve speed of freeing bootmem

    Attached is a patch that greatly improves the speed of freeing boot memory.
     On ia64 machines with 2GB or more memory (I didn't test with less, but I
    can't imagine there being a problem), the speed improvement is about 75%
    for the function free_all_bootmem_core.  This translates to savings on the
    order of 1 minute / TB of memory during boot time.  That number comes from
    testing on a machine with 512GB, and extrapolating based on profiling of an
    unpatched 4TB machine.  For 4 and 8 TB machines, the time spent in this
    function is about 1 minutes/TB, which is painful especially given that
    there is no indication of what is going on put to the console (this issue
    to possibly be addressed later).

    The basic idea is to free higher order pages instead of going through every
    single one.  Also, some unnecessary atomic operations are done away with
    and replaced with non-atomic equivalents, and prefetching is done where it
    helps the most.  For a more in-depth discusion of this patch, please see
    the linux-ia64 archives (topic is "free bootmem feedback patch").

(quoting the entire commit message because it's buried in linux-fullhistory,
being a pre-git patch).  For the thread he's referring to, see
https://lore.kernel.org/linux-ia64/40F46962.4090604@sgi.com/

Itanium CPUs of this era had no prefetchers.
Wei Yang June 30, 2024, 7:30 a.m. UTC | #6
On Sat, Jun 29, 2024 at 05:45:53PM +0100, Matthew Wilcox wrote:
>On Sat, Jun 29, 2024 at 05:28:34PM +0100, Matthew Wilcox wrote:
>> On Sat, Jun 29, 2024 at 08:44:11AM +0000, Wei Yang wrote:
>> > Per my understanding, prefetchw() is trying to load data to cache before we
>> > really accessing it. By doing so, we won't hit a cache miss when we really
>> > need it.
>> 
>> Yes, but the CPU can also do this by itself without needing an explicit
>> hint from software.  It can notice that we have a loop that's accessing
>> successive cachelines for write.  This is approximately the easiest
>> prefetcher to design.

This is my first intuition on reading the code, while I *believed* there is a
reason for the prefetchw() to be put here.

>
>I tracked down prefetchw() being added:
>
>commit 3b901ea58a56
>Author: Josh Aas <josha@sgi.com>
>Date:   Mon Aug 23 21:26:54 2004 -0700
>
>    [PATCH] improve speed of freeing bootmem
>
>    Attached is a patch that greatly improves the speed of freeing boot memory.
>     On ia64 machines with 2GB or more memory (I didn't test with less, but I
>    can't imagine there being a problem), the speed improvement is about 75%
>    for the function free_all_bootmem_core.  This translates to savings on the
>    order of 1 minute / TB of memory during boot time.  That number comes from
>    testing on a machine with 512GB, and extrapolating based on profiling of an
>    unpatched 4TB machine.  For 4 and 8 TB machines, the time spent in this
>    function is about 1 minutes/TB, which is painful especially given that
>    there is no indication of what is going on put to the console (this issue
>    to possibly be addressed later).
>
>    The basic idea is to free higher order pages instead of going through every
>    single one.  Also, some unnecessary atomic operations are done away with
>    and replaced with non-atomic equivalents, and prefetching is done where it
>    helps the most.  For a more in-depth discusion of this patch, please see
>    the linux-ia64 archives (topic is "free bootmem feedback patch").
>
>(quoting the entire commit message because it's buried in linux-fullhistory,
>being a pre-git patch).  For the thread he's referring to, see
>https://lore.kernel.org/linux-ia64/40F46962.4090604@sgi.com/
>

Thanks for digging the ancient history. And now I know the linux-fullhistory tree
:-)

>Itanium CPUs of this era had no prefetchers.

Oops, so sad to hear it.

So it looks the most helpful change is:

  * free higher order pages
  * non-atomic equivalent

A quick test on my 6G qemu virtual machine shows the prefetchw() here seems
not helpful. After removal, the meminit process even a little bit faster. From
135ms to 121ms as the sum of 3 times bootup test.

If you think it is ok, I would wrap up one patch on current mm-stable for
more audience to review.
diff mbox series

Patch

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 51a47db375b6..bc7316744a34 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1232,10 +1232,8 @@  void __meminit __free_pages_core(struct page *page, unsigned int order)
 	prefetchw(p);
 	for (loop = 0; loop < (nr_pages - 1); loop++, p++) {
 		prefetchw(p + 1);
-		__ClearPageReserved(p);
 		set_page_count(p, 0);
 	}
-	__ClearPageReserved(p);
 	set_page_count(p, 0);
 
 	atomic_long_add(nr_pages, &page_zone(page)->managed_pages);