Message ID | 20240629013322.12364-4-richard.weiyang@gmail.com (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | [1/4] mm: use zonelist_zone() to get zone | expand |
On Sat, Jun 29, 2024 at 01:33:22AM +0000, Wei Yang wrote: > +++ b/mm/page_alloc.c > @@ -1232,10 +1232,8 @@ void __meminit __free_pages_core(struct page *page, unsigned int order) > prefetchw(p); > for (loop = 0; loop < (nr_pages - 1); loop++, p++) { > prefetchw(p + 1); > - __ClearPageReserved(p); > set_page_count(p, 0); > } > - __ClearPageReserved(p); > set_page_count(p, 0); Is the prefetchw() still useful? Any remotely competent CPU should be able to figure out this loop ...
On Sat, Jun 29, 2024 at 04:21:59AM +0100, Matthew Wilcox wrote: >On Sat, Jun 29, 2024 at 01:33:22AM +0000, Wei Yang wrote: >> +++ b/mm/page_alloc.c >> @@ -1232,10 +1232,8 @@ void __meminit __free_pages_core(struct page *page, unsigned int order) >> prefetchw(p); >> for (loop = 0; loop < (nr_pages - 1); loop++, p++) { >> prefetchw(p + 1); >> - __ClearPageReserved(p); >> set_page_count(p, 0); >> } >> - __ClearPageReserved(p); >> set_page_count(p, 0); > >Is the prefetchw() still useful? Any remotely competent CPU should >be able to figure out this loop ... Hi, Matthew Thanks for your question. But to be honest, I am not fully understand it. Per my understanding, prefetchw() is trying to load data to cache before we really accessing it. By doing so, we won't hit a cache miss when we really need it. This looks useful for me. And how remote competent CPU is involved in this process?
On Sat, Jun 29, 2024 at 08:28:30AM +0200, David Hildenbrand wrote: >On 29.06.24 08:25, David Hildenbrand wrote: >> On 29.06.24 03:33, Wei Yang wrote: >> > Function __free_pages_core() is only used in the following two cases to >> > put page to buddy system: >> > >> > * free bootmem >> > * free hot-add memory >> > >> > After the above cleanup, there is no case to free page with PG_reserved >> > set. Let's remove the clear operation. >> > >> > The page initialization time shows 6.5% faster with a 6G qemu virtual >> > machine. >> > >> > Signed-off-by: Wei Yang <richard.weiyang@gmail.com> >> > CC: David Hildenbrand <david@redhat.com> >> > CC: Mike Rapoport (IBM) <rppt@kernel.org> >> > --- >> > mm/page_alloc.c | 2 -- >> > 1 file changed, 2 deletions(-) >> > >> > diff --git a/mm/page_alloc.c b/mm/page_alloc.c >> > index 51a47db375b6..bc7316744a34 100644 >> > --- a/mm/page_alloc.c >> > +++ b/mm/page_alloc.c >> > @@ -1232,10 +1232,8 @@ void __meminit __free_pages_core(struct page *page, unsigned int order) >> > prefetchw(p); >> > for (loop = 0; loop < (nr_pages - 1); loop++, p++) { >> > prefetchw(p + 1); >> > - __ClearPageReserved(p); >> > set_page_count(p, 0); >> > } >> > - __ClearPageReserved(p); >> > set_page_count(p, 0); >> > atomic_long_add(nr_pages, &page_zone(page)->managed_pages); >> >> Again see mm/mm-stable where that code changed. >> >> I think we can still get reserved pages here, for example via >> kmsan_memblock_free_pages(). >> You mean ZONDE_DEVICE pages? If yes, I think the normal memblock_free_pages() still could get reserved pages. Am I right? > >Sorry, I meant kmsan_memblock_discard(). Yep. >-- >Cheers, > >David / dhildenb
On Sat, Jun 29, 2024 at 08:44:11AM +0000, Wei Yang wrote: > On Sat, Jun 29, 2024 at 04:21:59AM +0100, Matthew Wilcox wrote: > >On Sat, Jun 29, 2024 at 01:33:22AM +0000, Wei Yang wrote: > >> +++ b/mm/page_alloc.c > >> @@ -1232,10 +1232,8 @@ void __meminit __free_pages_core(struct page *page, unsigned int order) > >> prefetchw(p); > >> for (loop = 0; loop < (nr_pages - 1); loop++, p++) { > >> prefetchw(p + 1); > >> - __ClearPageReserved(p); > >> set_page_count(p, 0); > >> } > >> - __ClearPageReserved(p); > >> set_page_count(p, 0); > > > >Is the prefetchw() still useful? Any remotely competent CPU should > >be able to figure out this loop ... > > Hi, Matthew > > Thanks for your question. But to be honest, I am not fully understand it. Let's try this question: If you remove the prefetchw() line, does the performance change? > Per my understanding, prefetchw() is trying to load data to cache before we > really accessing it. By doing so, we won't hit a cache miss when we really > need it. Yes, but the CPU can also do this by itself without needing an explicit hint from software. It can notice that we have a loop that's accessing successive cachelines for write. This is approximately the easiest prefetcher to design.
On Sat, Jun 29, 2024 at 05:28:34PM +0100, Matthew Wilcox wrote: > On Sat, Jun 29, 2024 at 08:44:11AM +0000, Wei Yang wrote: > > Per my understanding, prefetchw() is trying to load data to cache before we > > really accessing it. By doing so, we won't hit a cache miss when we really > > need it. > > Yes, but the CPU can also do this by itself without needing an explicit > hint from software. It can notice that we have a loop that's accessing > successive cachelines for write. This is approximately the easiest > prefetcher to design. I tracked down prefetchw() being added: commit 3b901ea58a56 Author: Josh Aas <josha@sgi.com> Date: Mon Aug 23 21:26:54 2004 -0700 [PATCH] improve speed of freeing bootmem Attached is a patch that greatly improves the speed of freeing boot memory. On ia64 machines with 2GB or more memory (I didn't test with less, but I can't imagine there being a problem), the speed improvement is about 75% for the function free_all_bootmem_core. This translates to savings on the order of 1 minute / TB of memory during boot time. That number comes from testing on a machine with 512GB, and extrapolating based on profiling of an unpatched 4TB machine. For 4 and 8 TB machines, the time spent in this function is about 1 minutes/TB, which is painful especially given that there is no indication of what is going on put to the console (this issue to possibly be addressed later). The basic idea is to free higher order pages instead of going through every single one. Also, some unnecessary atomic operations are done away with and replaced with non-atomic equivalents, and prefetching is done where it helps the most. For a more in-depth discusion of this patch, please see the linux-ia64 archives (topic is "free bootmem feedback patch"). (quoting the entire commit message because it's buried in linux-fullhistory, being a pre-git patch). For the thread he's referring to, see https://lore.kernel.org/linux-ia64/40F46962.4090604@sgi.com/ Itanium CPUs of this era had no prefetchers.
On Sat, Jun 29, 2024 at 05:45:53PM +0100, Matthew Wilcox wrote: >On Sat, Jun 29, 2024 at 05:28:34PM +0100, Matthew Wilcox wrote: >> On Sat, Jun 29, 2024 at 08:44:11AM +0000, Wei Yang wrote: >> > Per my understanding, prefetchw() is trying to load data to cache before we >> > really accessing it. By doing so, we won't hit a cache miss when we really >> > need it. >> >> Yes, but the CPU can also do this by itself without needing an explicit >> hint from software. It can notice that we have a loop that's accessing >> successive cachelines for write. This is approximately the easiest >> prefetcher to design. This is my first intuition on reading the code, while I *believed* there is a reason for the prefetchw() to be put here. > >I tracked down prefetchw() being added: > >commit 3b901ea58a56 >Author: Josh Aas <josha@sgi.com> >Date: Mon Aug 23 21:26:54 2004 -0700 > > [PATCH] improve speed of freeing bootmem > > Attached is a patch that greatly improves the speed of freeing boot memory. > On ia64 machines with 2GB or more memory (I didn't test with less, but I > can't imagine there being a problem), the speed improvement is about 75% > for the function free_all_bootmem_core. This translates to savings on the > order of 1 minute / TB of memory during boot time. That number comes from > testing on a machine with 512GB, and extrapolating based on profiling of an > unpatched 4TB machine. For 4 and 8 TB machines, the time spent in this > function is about 1 minutes/TB, which is painful especially given that > there is no indication of what is going on put to the console (this issue > to possibly be addressed later). > > The basic idea is to free higher order pages instead of going through every > single one. Also, some unnecessary atomic operations are done away with > and replaced with non-atomic equivalents, and prefetching is done where it > helps the most. For a more in-depth discusion of this patch, please see > the linux-ia64 archives (topic is "free bootmem feedback patch"). > >(quoting the entire commit message because it's buried in linux-fullhistory, >being a pre-git patch). For the thread he's referring to, see >https://lore.kernel.org/linux-ia64/40F46962.4090604@sgi.com/ > Thanks for digging the ancient history. And now I know the linux-fullhistory tree :-) >Itanium CPUs of this era had no prefetchers. Oops, so sad to hear it. So it looks the most helpful change is: * free higher order pages * non-atomic equivalent A quick test on my 6G qemu virtual machine shows the prefetchw() here seems not helpful. After removal, the meminit process even a little bit faster. From 135ms to 121ms as the sum of 3 times bootup test. If you think it is ok, I would wrap up one patch on current mm-stable for more audience to review.
diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 51a47db375b6..bc7316744a34 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -1232,10 +1232,8 @@ void __meminit __free_pages_core(struct page *page, unsigned int order) prefetchw(p); for (loop = 0; loop < (nr_pages - 1); loop++, p++) { prefetchw(p + 1); - __ClearPageReserved(p); set_page_count(p, 0); } - __ClearPageReserved(p); set_page_count(p, 0); atomic_long_add(nr_pages, &page_zone(page)->managed_pages);
Function __free_pages_core() is only used in the following two cases to put page to buddy system: * free bootmem * free hot-add memory After the above cleanup, there is no case to free page with PG_reserved set. Let's remove the clear operation. The page initialization time shows 6.5% faster with a 6G qemu virtual machine. Signed-off-by: Wei Yang <richard.weiyang@gmail.com> CC: David Hildenbrand <david@redhat.com> CC: Mike Rapoport (IBM) <rppt@kernel.org> --- mm/page_alloc.c | 2 -- 1 file changed, 2 deletions(-)