diff mbox series

mm, proc: collect percpu free pages into the free pages

Message ID 20240830014453.3070909-1-mawupeng1@huawei.com (mailing list archive)
State New
Headers show
Series mm, proc: collect percpu free pages into the free pages | expand

Commit Message

mawupeng Aug. 30, 2024, 1:44 a.m. UTC
From: Ma Wupeng <mawupeng1@huawei.com>

The introduction of Per-CPU-Pageset (PCP) per zone aims to enhance the
performance of the page allocator by enabling page allocation without
requiring the zone lock. This kind of memory is free memory however is
not included in Memfree or MemAvailable.

With the support of higt-order pcp and pcp auto-tuning, the size of the
pages in this list has become a matter of concern due to the following
patches:

  1. Introduction of Order 1~3 and PMD level PCP in commit 44042b449872
  ("mm/page_alloc: allow high-order pages to be stored on the per-cpu
  lists").
  2. Introduction of PCP auto-tuning in commit 90b41691b988 ("mm: add
  framework for PCP high auto-tuning").

Which lead to the total amount of the pcp can not be ignored just after
booting without any real tasks for as the result show below:

		   w/o patch	  with patch	      diff	diff/total
MemTotal:	525424652 kB	525424652 kB	      0 kB	        0%
MemFree:	517030396 kB	520134136 kB	3103740 kB	      0.6%
MemAvailable:	515837152 kB	518941080 kB	3103928 kB	      0.6%

On a machine with 16 zones and 600+ CPUs, prior to these commits, the PCP
list contained 274368 pages (1097M) immediately after booting. In the
mainline, this number has increased to 3003M, marking a 173% increase.

Since available memory is used by numerous services to determine memory
pressure. A substantial PCP memory volume leads to an inaccurate estimation
of available memory size, significantly impacting the service logic.

Remove the useless CONFIG_HIGMEM in si_meminfo_node since it will always
false in is_highmem_idx if config is not enabled.

Signed-off-by: Ma Wupeng <mawupeng1@huawei.com>
Signed-off-by: Liu Shixin <liushixin2@huawei.com>
---
 mm/show_mem.c | 46 ++++++++++++++++++++++++++++++++--------------
 1 file changed, 32 insertions(+), 14 deletions(-)

Comments

Huang, Ying Aug. 30, 2024, 7:53 a.m. UTC | #1
Hi, Wupeng,

Wupeng Ma <mawupeng1@huawei.com> writes:

> From: Ma Wupeng <mawupeng1@huawei.com>
>
> The introduction of Per-CPU-Pageset (PCP) per zone aims to enhance the
> performance of the page allocator by enabling page allocation without
> requiring the zone lock. This kind of memory is free memory however is
> not included in Memfree or MemAvailable.
>
> With the support of higt-order pcp and pcp auto-tuning, the size of the
> pages in this list has become a matter of concern due to the following
> patches:
>
>   1. Introduction of Order 1~3 and PMD level PCP in commit 44042b449872
>   ("mm/page_alloc: allow high-order pages to be stored on the per-cpu
>   lists").
>   2. Introduction of PCP auto-tuning in commit 90b41691b988 ("mm: add
>   framework for PCP high auto-tuning").

With PCP auto-tuning, the idle pages in PCP will be freed to buddy after
some time (may be as long as tens seconds in some cases).

> Which lead to the total amount of the pcp can not be ignored just after
> booting without any real tasks for as the result show below:
>
> 		   w/o patch	  with patch	      diff	diff/total
> MemTotal:	525424652 kB	525424652 kB	      0 kB	        0%
> MemFree:	517030396 kB	520134136 kB	3103740 kB	      0.6%
> MemAvailable:	515837152 kB	518941080 kB	3103928 kB	      0.6%
>
> On a machine with 16 zones and 600+ CPUs, prior to these commits, the PCP
> list contained 274368 pages (1097M) immediately after booting. In the
> mainline, this number has increased to 3003M, marking a 173% increase.
>
> Since available memory is used by numerous services to determine memory
> pressure. A substantial PCP memory volume leads to an inaccurate estimation
> of available memory size, significantly impacting the service logic.
>
> Remove the useless CONFIG_HIGMEM in si_meminfo_node since it will always
> false in is_highmem_idx if config is not enabled.
>
> Signed-off-by: Ma Wupeng <mawupeng1@huawei.com>
> Signed-off-by: Liu Shixin <liushixin2@huawei.com>

This has been discussed before in the thread of the previous version,
better to refer to it and summarize it.

[1] https://lore.kernel.org/linux-mm/YwSGqtEICW5AlhWr@dhcp22.suse.cz/

--
Best Regards,
Huang, Ying
mawupeng Sept. 2, 2024, 1:11 a.m. UTC | #2
On 2024/8/30 15:53, Huang, Ying wrote:
> Hi, Wupeng,
> 
> Wupeng Ma <mawupeng1@huawei.com> writes:
> 
>> From: Ma Wupeng <mawupeng1@huawei.com>
>>
>> The introduction of Per-CPU-Pageset (PCP) per zone aims to enhance the
>> performance of the page allocator by enabling page allocation without
>> requiring the zone lock. This kind of memory is free memory however is
>> not included in Memfree or MemAvailable.
>>
>> With the support of higt-order pcp and pcp auto-tuning, the size of the
>> pages in this list has become a matter of concern due to the following
>> patches:
>>
>>   1. Introduction of Order 1~3 and PMD level PCP in commit 44042b449872
>>   ("mm/page_alloc: allow high-order pages to be stored on the per-cpu
>>   lists").
>>   2. Introduction of PCP auto-tuning in commit 90b41691b988 ("mm: add
>>   framework for PCP high auto-tuning").
> 
> With PCP auto-tuning, the idle pages in PCP will be freed to buddy after
> some time (may be as long as tens seconds in some cases).

Thank you for the detailed explanation regarding PCP auto-tuning. If the
PCP pages are freed to the buddy after a certain period due to auto-tuning,
it's possible that there is no direct association between PCP auto-tuning
and the increase in the PCP count as indicated below, especially if no
actual tasks have commenced after booting. The primary reason for the
increase might be more orders and a surplus of CPUs.

> 
>> Which lead to the total amount of the pcp can not be ignored just after
>> booting without any real tasks for as the result show below:
>>
>> 		   w/o patch	  with patch	      diff	diff/total
>> MemTotal:	525424652 kB	525424652 kB	      0 kB	        0%
>> MemFree:	517030396 kB	520134136 kB	3103740 kB	      0.6%
>> MemAvailable:	515837152 kB	518941080 kB	3103928 kB	      0.6%

We do the following experiments which make the pcp amount even bigger:
1. alloc 8G of memory in all of the 600+ cpus
2. kill all the above user tasks 
3. waiting for 36h

the pcp amount 6161097(24644M) which 4.6% of the total 512G memory.


>>
>> On a machine with 16 zones and 600+ CPUs, prior to these commits, the PCP
>> list contained 274368 pages (1097M) immediately after booting. In the
>> mainline, this number has increased to 3003M, marking a 173% increase.
>>
>> Since available memory is used by numerous services to determine memory
>> pressure. A substantial PCP memory volume leads to an inaccurate estimation
>> of available memory size, significantly impacting the service logic.
>>
>> Remove the useless CONFIG_HIGMEM in si_meminfo_node since it will always
>> false in is_highmem_idx if config is not enabled.
>>
>> Signed-off-by: Ma Wupeng <mawupeng1@huawei.com>
>> Signed-off-by: Liu Shixin <liushixin2@huawei.com>
> 
> This has been discussed before in the thread of the previous version,
> better to refer to it and summarize it.
> 
> [1] https://lore.kernel.org/linux-mm/YwSGqtEICW5AlhWr@dhcp22.suse.cz/

As Michal Hocko mentioned in previous discussion:
 1. If it is a real problem?
 2. MemAvailable is documented as available without swapping, however
    pcp need to drain reclaim.

1. Since available memory is used by numerous services to determine memory
pressure. A substantial PCP memory volume leads to an inaccurate estimation
of available memory size, significantly impacting the service logic.
2. MemAvailable here do seems wired. There is no reason to drain pcp to
drop clean page cache As Michal Hocko already pointed in this post, drain
clean page cache is much cheaper than drain remote pcp.Any idea on this?

[1] https://lore.kernel.org/linux-mm/ZWRYZmulV0B-Jv3k@tiehlicka/
> 
> --
> Best Regards,
> Huang, Ying
>
Huang, Ying Sept. 2, 2024, 1:29 a.m. UTC | #3
mawupeng <mawupeng1@huawei.com> writes:

> On 2024/8/30 15:53, Huang, Ying wrote:
>> Hi, Wupeng,
>> 
>> Wupeng Ma <mawupeng1@huawei.com> writes:
>> 
>>> From: Ma Wupeng <mawupeng1@huawei.com>
>>>
>>> The introduction of Per-CPU-Pageset (PCP) per zone aims to enhance the
>>> performance of the page allocator by enabling page allocation without
>>> requiring the zone lock. This kind of memory is free memory however is
>>> not included in Memfree or MemAvailable.
>>>
>>> With the support of higt-order pcp and pcp auto-tuning, the size of the
>>> pages in this list has become a matter of concern due to the following
>>> patches:
>>>
>>>   1. Introduction of Order 1~3 and PMD level PCP in commit 44042b449872
>>>   ("mm/page_alloc: allow high-order pages to be stored on the per-cpu
>>>   lists").
>>>   2. Introduction of PCP auto-tuning in commit 90b41691b988 ("mm: add
>>>   framework for PCP high auto-tuning").
>> 
>> With PCP auto-tuning, the idle pages in PCP will be freed to buddy after
>> some time (may be as long as tens seconds in some cases).
>
> Thank you for the detailed explanation regarding PCP auto-tuning. If the
> PCP pages are freed to the buddy after a certain period due to auto-tuning,
> it's possible that there is no direct association between PCP auto-tuning
> and the increase in the PCP count as indicated below, especially if no
> actual tasks have commenced after booting. The primary reason for the
> increase might be more orders and a surplus of CPUs.
>
>> 
>>> Which lead to the total amount of the pcp can not be ignored just after
>>> booting without any real tasks for as the result show below:
>>>
>>> 		   w/o patch	  with patch	      diff	diff/total
>>> MemTotal:	525424652 kB	525424652 kB	      0 kB	        0%
>>> MemFree:	517030396 kB	520134136 kB	3103740 kB	      0.6%
>>> MemAvailable:	515837152 kB	518941080 kB	3103928 kB	      0.6%
>
> We do the following experiments which make the pcp amount even bigger:
> 1. alloc 8G of memory in all of the 600+ cpus
> 2. kill all the above user tasks 
> 3. waiting for 36h
>
> the pcp amount 6161097(24644M) which 4.6% of the total 512G memory.
>
>
>>>
>>> On a machine with 16 zones and 600+ CPUs, prior to these commits, the PCP
>>> list contained 274368 pages (1097M) immediately after booting. In the
>>> mainline, this number has increased to 3003M, marking a 173% increase.
>>>
>>> Since available memory is used by numerous services to determine memory
>>> pressure. A substantial PCP memory volume leads to an inaccurate estimation
>>> of available memory size, significantly impacting the service logic.
>>>
>>> Remove the useless CONFIG_HIGMEM in si_meminfo_node since it will always
>>> false in is_highmem_idx if config is not enabled.
>>>
>>> Signed-off-by: Ma Wupeng <mawupeng1@huawei.com>
>>> Signed-off-by: Liu Shixin <liushixin2@huawei.com>
>> 
>> This has been discussed before in the thread of the previous version,
>> better to refer to it and summarize it.
>> 
>> [1] https://lore.kernel.org/linux-mm/YwSGqtEICW5AlhWr@dhcp22.suse.cz/
>
> As Michal Hocko mentioned in previous discussion:
>  1. If it is a real problem?
>  2. MemAvailable is documented as available without swapping, however
>     pcp need to drain reclaim.
>
> 1. Since available memory is used by numerous services to determine memory
> pressure. A substantial PCP memory volume leads to an inaccurate estimation
> of available memory size, significantly impacting the service logic.
> 2. MemAvailable here do seems wired. There is no reason to drain pcp to
> drop clean page cache As Michal Hocko already pointed in this post, drain
> clean page cache is much cheaper than drain remote pcp.Any idea on this?

Drain remote PCP may be not that expensive now after commit 4b23a68f9536
("mm/page_alloc: protect PCP lists with a spinlock").  No IPI is needed
to drain the remote PCP.

> [1] https://lore.kernel.org/linux-mm/ZWRYZmulV0B-Jv3k@tiehlicka/

--
Best Regards,
Huang, Ying
mawupeng Sept. 3, 2024, 1:50 a.m. UTC | #4
On 2024/9/2 9:29, Huang, Ying wrote:
> mawupeng <mawupeng1@huawei.com> writes:
> 
>> On 2024/8/30 15:53, Huang, Ying wrote:
>>> Hi, Wupeng,
>>>
>>> Wupeng Ma <mawupeng1@huawei.com> writes:
>>>
>>>> From: Ma Wupeng <mawupeng1@huawei.com>
>>>>
>>>> The introduction of Per-CPU-Pageset (PCP) per zone aims to enhance the
>>>> performance of the page allocator by enabling page allocation without
>>>> requiring the zone lock. This kind of memory is free memory however is
>>>> not included in Memfree or MemAvailable.
>>>>
>>>> With the support of higt-order pcp and pcp auto-tuning, the size of the
>>>> pages in this list has become a matter of concern due to the following
>>>> patches:
>>>>
>>>>   1. Introduction of Order 1~3 and PMD level PCP in commit 44042b449872
>>>>   ("mm/page_alloc: allow high-order pages to be stored on the per-cpu
>>>>   lists").
>>>>   2. Introduction of PCP auto-tuning in commit 90b41691b988 ("mm: add
>>>>   framework for PCP high auto-tuning").
>>>
>>> With PCP auto-tuning, the idle pages in PCP will be freed to buddy after
>>> some time (may be as long as tens seconds in some cases).
>>
>> Thank you for the detailed explanation regarding PCP auto-tuning. If the
>> PCP pages are freed to the buddy after a certain period due to auto-tuning,
>> it's possible that there is no direct association between PCP auto-tuning
>> and the increase in the PCP count as indicated below, especially if no
>> actual tasks have commenced after booting. The primary reason for the
>> increase might be more orders and a surplus of CPUs.
>>
>>>
>>>> Which lead to the total amount of the pcp can not be ignored just after
>>>> booting without any real tasks for as the result show below:
>>>>
>>>> 		   w/o patch	  with patch	      diff	diff/total
>>>> MemTotal:	525424652 kB	525424652 kB	      0 kB	        0%
>>>> MemFree:	517030396 kB	520134136 kB	3103740 kB	      0.6%
>>>> MemAvailable:	515837152 kB	518941080 kB	3103928 kB	      0.6%
>>
>> We do the following experiments which make the pcp amount even bigger:
>> 1. alloc 8G of memory in all of the 600+ cpus
>> 2. kill all the above user tasks 
>> 3. waiting for 36h
>>
>> the pcp amount 6161097(24644M) which 4.6% of the total 512G memory.
>>
>>
>>>>
>>>> On a machine with 16 zones and 600+ CPUs, prior to these commits, the PCP
>>>> list contained 274368 pages (1097M) immediately after booting. In the
>>>> mainline, this number has increased to 3003M, marking a 173% increase.
>>>>
>>>> Since available memory is used by numerous services to determine memory
>>>> pressure. A substantial PCP memory volume leads to an inaccurate estimation
>>>> of available memory size, significantly impacting the service logic.
>>>>
>>>> Remove the useless CONFIG_HIGMEM in si_meminfo_node since it will always
>>>> false in is_highmem_idx if config is not enabled.
>>>>
>>>> Signed-off-by: Ma Wupeng <mawupeng1@huawei.com>
>>>> Signed-off-by: Liu Shixin <liushixin2@huawei.com>
>>>
>>> This has been discussed before in the thread of the previous version,
>>> better to refer to it and summarize it.
>>>
>>> [1] https://lore.kernel.org/linux-mm/YwSGqtEICW5AlhWr@dhcp22.suse.cz/
>>
>> As Michal Hocko mentioned in previous discussion:
>>  1. If it is a real problem?
>>  2. MemAvailable is documented as available without swapping, however
>>     pcp need to drain reclaim.
>>
>> 1. Since available memory is used by numerous services to determine memory
>> pressure. A substantial PCP memory volume leads to an inaccurate estimation
>> of available memory size, significantly impacting the service logic.
>> 2. MemAvailable here do seems wired. There is no reason to drain pcp to
>> drop clean page cache As Michal Hocko already pointed in this post, drain
>> clean page cache is much cheaper than drain remote pcp.Any idea on this?
> 
> Drain remote PCP may be not that expensive now after commit 4b23a68f9536
> ("mm/page_alloc: protect PCP lists with a spinlock").  No IPI is needed
> to drain the remote PCP.

This looks really great, we can think a way to drop pcp before goto slowpath
before swap.

> 
>> [1] https://lore.kernel.org/linux-mm/ZWRYZmulV0B-Jv3k@tiehlicka/
> 
> --
> Best Regards,
> Huang, Ying
>
Michal Hocko Sept. 3, 2024, 8:09 a.m. UTC | #5
On Tue 03-09-24 09:50:48, mawupeng wrote:
> > Drain remote PCP may be not that expensive now after commit 4b23a68f9536
> > ("mm/page_alloc: protect PCP lists with a spinlock").  No IPI is needed
> > to drain the remote PCP.
> 
> This looks really great, we can think a way to drop pcp before goto slowpath
> before swap.

We currently drain after first unsuccessful direct reclaim run. Is that
insufficient? Should we do a less aggressive draining sooner? Ideally
restricted to cpus on the same NUMA node maybe? Do you have any specific
workloads that would benefit from this?
mawupeng Sept. 4, 2024, 6:49 a.m. UTC | #6
On 2024/9/3 16:09, Michal Hocko wrote:
> On Tue 03-09-24 09:50:48, mawupeng wrote:
>>> Drain remote PCP may be not that expensive now after commit 4b23a68f9536
>>> ("mm/page_alloc: protect PCP lists with a spinlock").  No IPI is needed
>>> to drain the remote PCP.
>>
>> This looks really great, we can think a way to drop pcp before goto slowpath
>> before swap.
> 
> We currently drain after first unsuccessful direct reclaim run. Is that
> insufficient? 

The reason i said the drain of pcp is insufficient or expensive is based
on you comment[1] :-). Since IPIs is not requiered since commit 4b23a68f9536
("mm/page_alloc: protect PCP lists with a spinlock"). This could be much
better.

[1]: https://lore.kernel.org/linux-mm/ZWRYZmulV0B-Jv3k@tiehlicka/

> Should we do a less aggressive draining sooner? Ideally
> restricted to cpus on the same NUMA node maybe? Do you have any specific
> workloads that would benefit from this?

Current the problem is amount the pcp, which can increase to 4.6%(24644M)
of the total 512G memory.
Michal Hocko Sept. 4, 2024, 7:28 a.m. UTC | #7
On Wed 04-09-24 14:49:20, mawupeng wrote:
> 
> 
> On 2024/9/3 16:09, Michal Hocko wrote:
> > On Tue 03-09-24 09:50:48, mawupeng wrote:
> >>> Drain remote PCP may be not that expensive now after commit 4b23a68f9536
> >>> ("mm/page_alloc: protect PCP lists with a spinlock").  No IPI is needed
> >>> to drain the remote PCP.
> >>
> >> This looks really great, we can think a way to drop pcp before goto slowpath
> >> before swap.
> > 
> > We currently drain after first unsuccessful direct reclaim run. Is that
> > insufficient? 
> 
> The reason i said the drain of pcp is insufficient or expensive is based
> on you comment[1] :-). Since IPIs is not requiered since commit 4b23a68f9536
> ("mm/page_alloc: protect PCP lists with a spinlock"). This could be much
> better.
> 
> [1]: https://lore.kernel.org/linux-mm/ZWRYZmulV0B-Jv3k@tiehlicka/

there are other reasons I have mentioned in that reply which play role
as well.

> > Should we do a less aggressive draining sooner? Ideally
> > restricted to cpus on the same NUMA node maybe? Do you have any specific
> > workloads that would benefit from this?
> 
> Current the problem is amount the pcp, which can increase to 4.6%(24644M)
> of the total 512G memory.

Why is that a problem? Just because some tools are miscalculating memory
pressure because they are based on MemAvailable? Or does this lead to
performance regressions on the kernel side? In other words would the
same workload behaved better if the amount of pcp-cache was reduced
without any userspace intervention?
mawupeng Sept. 10, 2024, 12:11 p.m. UTC | #8
On 2024/9/4 15:28, Michal Hocko wrote:
> On Wed 04-09-24 14:49:20, mawupeng wrote:
>>
>>
>> On 2024/9/3 16:09, Michal Hocko wrote:
>>> On Tue 03-09-24 09:50:48, mawupeng wrote:
>>>>> Drain remote PCP may be not that expensive now after commit 4b23a68f9536
>>>>> ("mm/page_alloc: protect PCP lists with a spinlock").  No IPI is needed
>>>>> to drain the remote PCP.
>>>>
>>>> This looks really great, we can think a way to drop pcp before goto slowpath
>>>> before swap.
>>>
>>> We currently drain after first unsuccessful direct reclaim run. Is that
>>> insufficient? 
>>
>> The reason i said the drain of pcp is insufficient or expensive is based
>> on you comment[1] :-). Since IPIs is not requiered since commit 4b23a68f9536
>> ("mm/page_alloc: protect PCP lists with a spinlock"). This could be much
>> better.
>>
>> [1]: https://lore.kernel.org/linux-mm/ZWRYZmulV0B-Jv3k@tiehlicka/
> 
> there are other reasons I have mentioned in that reply which play role
> as well.
> 
>>> Should we do a less aggressive draining sooner? Ideally
>>> restricted to cpus on the same NUMA node maybe? Do you have any specific
>>> workloads that would benefit from this?
>>
>> Current the problem is amount the pcp, which can increase to 4.6%(24644M)
>> of the total 512G memory.
> 
> Why is that a problem? 

MemAvailable
              An estimate of how much memory is available for starting new
              applications, without swapping. Calculated from MemFree,
              SReclaimable, the size of the file LRU lists, and the low
              watermarks in each zone.

The PCP memory is essentially available memory and will be reclaimed before OOM.
In essence, it is not fundamentally different from reclaiming file pages, as both
are reclaimed within __alloc_pages_direct_reclaim. Therefore, why shouldn't it be
included in MemAvailable to avoid confusion.

__alloc_pages_direct_reclaim
  __perform_reclaim
  if (!page && !drained)
    drain_all_pages(NULL);


> Just because some tools are miscalculating memory
> pressure because they are based on MemAvailable? Or does this lead to
> performance regressions on the kernel side? In other words would the
> same workload behaved better if the amount of pcp-cache was reduced
> without any userspace intervention?
Michal Hocko Sept. 10, 2024, 1:11 p.m. UTC | #9
On Tue 10-09-24 20:11:36, mawupeng wrote:
> 
> 
> On 2024/9/4 15:28, Michal Hocko wrote:
> > On Wed 04-09-24 14:49:20, mawupeng wrote:
[...]
> >> Current the problem is amount the pcp, which can increase to 4.6%(24644M)
> >> of the total 512G memory.
> > 
> > Why is that a problem? 
> 
> MemAvailable
>               An estimate of how much memory is available for starting new
>               applications, without swapping. Calculated from MemFree,
>               SReclaimable, the size of the file LRU lists, and the low
>               watermarks in each zone.
> 
> The PCP memory is essentially available memory and will be reclaimed before OOM.
> In essence, it is not fundamentally different from reclaiming file pages, as both
> are reclaimed within __alloc_pages_direct_reclaim.

It is not _fundamentally_ different but the reclaim trigger bar is
different (much higher). You might get into swapping while still keeping
pcp cache available for example.

MemAvailable has been an estimate. Time has proven not a great one as
it is hard to set expectations around that. You are focusing on pcp
caches but there others that might are not covered (e.g. networking
stack can do a lot of caching on its own). MemAvailable will never be
perfect and if you are hitting limits of its usefulness I would
recommend finding a different way to achieve your goals (which are still
not really clear to me TBH).
Huang, Ying Sept. 11, 2024, 5:37 a.m. UTC | #10
mawupeng <mawupeng1@huawei.com> writes:

> On 2024/9/4 15:28, Michal Hocko wrote:
>> On Wed 04-09-24 14:49:20, mawupeng wrote:
>>>
>>>
>>> On 2024/9/3 16:09, Michal Hocko wrote:
>>>> On Tue 03-09-24 09:50:48, mawupeng wrote:
>>>>>> Drain remote PCP may be not that expensive now after commit 4b23a68f9536
>>>>>> ("mm/page_alloc: protect PCP lists with a spinlock").  No IPI is needed
>>>>>> to drain the remote PCP.
>>>>>
>>>>> This looks really great, we can think a way to drop pcp before goto slowpath
>>>>> before swap.
>>>>
>>>> We currently drain after first unsuccessful direct reclaim run. Is that
>>>> insufficient? 
>>>
>>> The reason i said the drain of pcp is insufficient or expensive is based
>>> on you comment[1] :-). Since IPIs is not requiered since commit 4b23a68f9536
>>> ("mm/page_alloc: protect PCP lists with a spinlock"). This could be much
>>> better.
>>>
>>> [1]: https://lore.kernel.org/linux-mm/ZWRYZmulV0B-Jv3k@tiehlicka/
>> 
>> there are other reasons I have mentioned in that reply which play role
>> as well.
>> 
>>>> Should we do a less aggressive draining sooner? Ideally
>>>> restricted to cpus on the same NUMA node maybe? Do you have any specific
>>>> workloads that would benefit from this?
>>>
>>> Current the problem is amount the pcp, which can increase to 4.6%(24644M)
>>> of the total 512G memory.
>> 
>> Why is that a problem? 
>
> MemAvailable
>               An estimate of how much memory is available for starting new
>               applications, without swapping. Calculated from MemFree,
>               SReclaimable, the size of the file LRU lists, and the low
>               watermarks in each zone.
>
> The PCP memory is essentially available memory and will be reclaimed before OOM.
> In essence, it is not fundamentally different from reclaiming file pages, as both
> are reclaimed within __alloc_pages_direct_reclaim. Therefore, why shouldn't it be
> included in MemAvailable to avoid confusion.
>
> __alloc_pages_direct_reclaim
>   __perform_reclaim
>   if (!page && !drained)
>     drain_all_pages(NULL);
>
>
>> Just because some tools are miscalculating memory
>> pressure because they are based on MemAvailable? Or does this lead to
>> performance regressions on the kernel side? In other words would the
>> same workload behaved better if the amount of pcp-cache was reduced
>> without any userspace intervention?

Back to the original PCP cache issue.  I want to make sure that whether
PCP auto-tuning works properly on your system.  If so, the total PCP
pages should be less than the sum of the low watermark of zones.  Can
you verify that first?

--
Best Regards,
Huang, Ying
diff mbox series

Patch

diff --git a/mm/show_mem.c b/mm/show_mem.c
index bdb439551eef..08f566c30b3d 100644
--- a/mm/show_mem.c
+++ b/mm/show_mem.c
@@ -29,6 +29,26 @@  static inline void show_node(struct zone *zone)
 		printk("Node %d ", zone_to_nid(zone));
 }
 
+static unsigned long nr_free_zone_pcplist_pages(struct zone *zone)
+{
+	unsigned long sum = 0;
+	int cpu;
+
+	for_each_online_cpu(cpu)
+		sum += per_cpu_ptr(zone->per_cpu_pageset, cpu)->count;
+	return sum;
+}
+
+static unsigned long nr_free_pcplist_pages(void)
+{
+	unsigned long sum = 0;
+	struct zone *zone;
+
+	for_each_populated_zone(zone)
+		sum += nr_free_zone_pcplist_pages(zone);
+	return sum;
+}
+
 long si_mem_available(void)
 {
 	long available;
@@ -44,7 +64,8 @@  long si_mem_available(void)
 	 * Estimate the amount of memory available for userspace allocations,
 	 * without causing swapping or OOM.
 	 */
-	available = global_zone_page_state(NR_FREE_PAGES) - totalreserve_pages;
+	available = global_zone_page_state(NR_FREE_PAGES) +
+		    nr_free_pcplist_pages() - totalreserve_pages;
 
 	/*
 	 * Not all the page cache can be freed, otherwise the system will
@@ -76,7 +97,8 @@  void si_meminfo(struct sysinfo *val)
 {
 	val->totalram = totalram_pages();
 	val->sharedram = global_node_page_state(NR_SHMEM);
-	val->freeram = global_zone_page_state(NR_FREE_PAGES);
+	val->freeram =
+		global_zone_page_state(NR_FREE_PAGES) + nr_free_pcplist_pages();
 	val->bufferram = nr_blockdev_pages();
 	val->totalhigh = totalhigh_pages();
 	val->freehigh = nr_free_highpages();
@@ -90,30 +112,27 @@  void si_meminfo_node(struct sysinfo *val, int nid)
 {
 	int zone_type;		/* needs to be signed */
 	unsigned long managed_pages = 0;
+	unsigned long free_pages = sum_zone_node_page_state(nid, NR_FREE_PAGES);
 	unsigned long managed_highpages = 0;
 	unsigned long free_highpages = 0;
 	pg_data_t *pgdat = NODE_DATA(nid);
 
-	for (zone_type = 0; zone_type < MAX_NR_ZONES; zone_type++)
-		managed_pages += zone_managed_pages(&pgdat->node_zones[zone_type]);
-	val->totalram = managed_pages;
-	val->sharedram = node_page_state(pgdat, NR_SHMEM);
-	val->freeram = sum_zone_node_page_state(nid, NR_FREE_PAGES);
-#ifdef CONFIG_HIGHMEM
 	for (zone_type = 0; zone_type < MAX_NR_ZONES; zone_type++) {
 		struct zone *zone = &pgdat->node_zones[zone_type];
 
+		managed_pages += zone_managed_pages(zone);
+		free_pages += nr_free_zone_pcplist_pages(zone);
 		if (is_highmem(zone)) {
 			managed_highpages += zone_managed_pages(zone);
 			free_highpages += zone_page_state(zone, NR_FREE_PAGES);
 		}
 	}
+
+	val->totalram = managed_pages;
+	val->sharedram = node_page_state(pgdat, NR_SHMEM);
+	val->freeram = free_pages;
 	val->totalhigh = managed_highpages;
 	val->freehigh = free_highpages;
-#else
-	val->totalhigh = managed_highpages;
-	val->freehigh = free_highpages;
-#endif
 	val->mem_unit = PAGE_SIZE;
 }
 #endif
@@ -196,8 +215,7 @@  static void show_free_areas(unsigned int filter, nodemask_t *nodemask, int max_z
 		if (show_mem_node_skip(filter, zone_to_nid(zone), nodemask))
 			continue;
 
-		for_each_online_cpu(cpu)
-			free_pcp += per_cpu_ptr(zone->per_cpu_pageset, cpu)->count;
+		free_pcp += nr_free_zone_pcplist_pages(zone);
 	}
 
 	printk("active_anon:%lu inactive_anon:%lu isolated_anon:%lu\n"