diff mbox series

[-next,v2] mm, proc: collect percpu free pages into the free pages

Message ID 20220822033354.952849-1-liushixin2@huawei.com (mailing list archive)
State New
Headers show
Series [-next,v2] mm, proc: collect percpu free pages into the free pages | expand

Commit Message

Liu Shixin Aug. 22, 2022, 3:33 a.m. UTC
The page on pcplist could be used, but not counted into memory free or
avaliable, and pcp_free is only showed by show_mem() for now. Since commit
d8a759b57035 ("mm, page_alloc: double zone's batchsize"), there is a
significant decrease in the display of free memory, with a large number
of cpus and zones, the number of pages in the percpu list can be very
large, so it is better to let user to know the pcp count.

On a machine with 3 zones and 72 CPUs. Before commit d8a759b57035, the
maximum amount of pages in the pcp lists was theoretically 162MB(3*72*768KB).
After the patch, the lists can hold 324MB. It has been observed to be 114MB
in the idle state after system startup in practice(increased 80 MB).

Signed-off-by: Liu Shixin <liushixin2@huawei.com>
---
 mm/page_alloc.c | 51 ++++++++++++++++++++++++++++++++-----------------
 1 file changed, 34 insertions(+), 17 deletions(-)

Comments

Andrew Morton Aug. 22, 2022, 9:12 p.m. UTC | #1
On Mon, 22 Aug 2022 11:33:54 +0800 Liu Shixin <liushixin2@huawei.com> wrote:

> The page on pcplist could be used, but not counted into memory free or
> avaliable, and pcp_free is only showed by show_mem() for now. Since commit
> d8a759b57035 ("mm, page_alloc: double zone's batchsize"), there is a
> significant decrease in the display of free memory, with a large number
> of cpus and zones, the number of pages in the percpu list can be very
> large, so it is better to let user to know the pcp count.
> 
> On a machine with 3 zones and 72 CPUs. Before commit d8a759b57035, the
> maximum amount of pages in the pcp lists was theoretically 162MB(3*72*768KB).
> After the patch, the lists can hold 324MB. It has been observed to be 114MB
> in the idle state after system startup in practice(increased 80 MB).
> 

Seems reasonable.

> 
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 033f1e26d15b..f89928d3ad4e 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -5853,6 +5853,26 @@ static unsigned long nr_free_zone_pages(int offset)
>  	return sum;
>  }
>  
> +static unsigned long nr_free_zone_pcplist_pages(struct zone *zone)
> +{
> +	unsigned long sum = 0;
> +	int cpu;
> +
> +	for_each_online_cpu(cpu)
> +		sum += per_cpu_ptr(zone->per_cpu_pageset, cpu)->count;
> +	return sum;
> +}
> +
> +static unsigned long nr_free_pcplist_pages(void)
> +{
> +	unsigned long sum = 0;
> +	struct zone *zone;
> +
> +	for_each_zone(zone)
> +		sum += nr_free_zone_pcplist_pages(zone);
> +	return sum;
> +}

Prevention of races against zone/node hotplug?
Andrew Morton Aug. 22, 2022, 9:13 p.m. UTC | #2
On Mon, 22 Aug 2022 14:12:07 -0700 Andrew Morton <akpm@linux-foundation.org> wrote:

> Prevention of races against zone/node hotplug?

I meant "cpu/node", of course.
Michal Hocko Aug. 23, 2022, 7:50 a.m. UTC | #3
On Mon 22-08-22 14:12:07, Andrew Morton wrote:
> On Mon, 22 Aug 2022 11:33:54 +0800 Liu Shixin <liushixin2@huawei.com> wrote:
> 
> > The page on pcplist could be used, but not counted into memory free or
> > avaliable, and pcp_free is only showed by show_mem() for now. Since commit
> > d8a759b57035 ("mm, page_alloc: double zone's batchsize"), there is a
> > significant decrease in the display of free memory, with a large number
> > of cpus and zones, the number of pages in the percpu list can be very
> > large, so it is better to let user to know the pcp count.
> > 
> > On a machine with 3 zones and 72 CPUs. Before commit d8a759b57035, the
> > maximum amount of pages in the pcp lists was theoretically 162MB(3*72*768KB).
> > After the patch, the lists can hold 324MB. It has been observed to be 114MB
> > in the idle state after system startup in practice(increased 80 MB).
> > 
> 
> Seems reasonable.

I have asked in the previous incarnation of the patch but haven't really
received any answer[1]. Is this a _real_ problem? The absolute amount of
memory could be perceived as a lot but is this really noticeable wrt
overall memory on those systems?

Also the patch is accounting these pcp caches as free memory but that
can be misleading as this memory is not readily available for use in
general. E.g. MemAvailable is documented as:
	An estimate of how much memory is available for starting new
	applications, without swapping.
but pcp caches are drained only after direct reclaim fails which can
imply a lot of reclaim and runtime disruption.

[1] http://lkml.kernel.org/r/YwMv1A1rVNZQuuOo@dhcp22.suse.cz

> > 
> > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > index 033f1e26d15b..f89928d3ad4e 100644
> > --- a/mm/page_alloc.c
> > +++ b/mm/page_alloc.c
> > @@ -5853,6 +5853,26 @@ static unsigned long nr_free_zone_pages(int offset)
> >  	return sum;
> >  }
> >  
> > +static unsigned long nr_free_zone_pcplist_pages(struct zone *zone)
> > +{
> > +	unsigned long sum = 0;
> > +	int cpu;
> > +
> > +	for_each_online_cpu(cpu)
> > +		sum += per_cpu_ptr(zone->per_cpu_pageset, cpu)->count;
> > +	return sum;
> > +}
> > +
> > +static unsigned long nr_free_pcplist_pages(void)
> > +{
> > +	unsigned long sum = 0;
> > +	struct zone *zone;
> > +
> > +	for_each_zone(zone)
> > +		sum += nr_free_zone_pcplist_pages(zone);
> > +	return sum;
> > +}
> 
> Prevention of races against zone/node hotplug?

Memory hotplug doesn't remove nodes nor its zones.
Liu Shixin Aug. 23, 2022, 12:46 p.m. UTC | #4
On 2022/8/23 15:50, Michal Hocko wrote:
> On Mon 22-08-22 14:12:07, Andrew Morton wrote:
>> On Mon, 22 Aug 2022 11:33:54 +0800 Liu Shixin <liushixin2@huawei.com> wrote:
>>
>>> The page on pcplist could be used, but not counted into memory free or
>>> avaliable, and pcp_free is only showed by show_mem() for now. Since commit
>>> d8a759b57035 ("mm, page_alloc: double zone's batchsize"), there is a
>>> significant decrease in the display of free memory, with a large number
>>> of cpus and zones, the number of pages in the percpu list can be very
>>> large, so it is better to let user to know the pcp count.
>>>
>>> On a machine with 3 zones and 72 CPUs. Before commit d8a759b57035, the
>>> maximum amount of pages in the pcp lists was theoretically 162MB(3*72*768KB).
>>> After the patch, the lists can hold 324MB. It has been observed to be 114MB
>>> in the idle state after system startup in practice(increased 80 MB).
>>>
>> Seems reasonable.
> I have asked in the previous incarnation of the patch but haven't really
> received any answer[1]. Is this a _real_ problem? The absolute amount of
> memory could be perceived as a lot but is this really noticeable wrt
> overall memory on those systems?
This may not obvious when the memory is sufficient. However, as products monitor the
memory to plan it. The change has caused warning. We have also considered using /proc/zoneinfo
to calculate the total number of pcplists. However, we think it is more appropriate to add
the total number of pcplists to free and available pages. After all, this part is also free pages.
> Also the patch is accounting these pcp caches as free memory but that
> can be misleading as this memory is not readily available for use in
> general. E.g. MemAvailable is documented as:
> 	An estimate of how much memory is available for starting new
> 	applications, without swapping.
> but pcp caches are drained only after direct reclaim fails which can
> imply a lot of reclaim and runtime disruption.
Maybe it makes more sense to add it only to free? Or handle it like page cache?
>
> [1] http://lkml.kernel.org/r/YwMv1A1rVNZQuuOo@dhcp22.suse.cz
>
>>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
>>> index 033f1e26d15b..f89928d3ad4e 100644
>>> --- a/mm/page_alloc.c
>>> +++ b/mm/page_alloc.c
>>> @@ -5853,6 +5853,26 @@ static unsigned long nr_free_zone_pages(int offset)
>>>  	return sum;
>>>  }
>>>  
>>> +static unsigned long nr_free_zone_pcplist_pages(struct zone *zone)
>>> +{
>>> +	unsigned long sum = 0;
>>> +	int cpu;
>>> +
>>> +	for_each_online_cpu(cpu)
>>> +		sum += per_cpu_ptr(zone->per_cpu_pageset, cpu)->count;
>>> +	return sum;
>>> +}
>>> +
>>> +static unsigned long nr_free_pcplist_pages(void)
>>> +{
>>> +	unsigned long sum = 0;
>>> +	struct zone *zone;
>>> +
>>> +	for_each_zone(zone)
>>> +		sum += nr_free_zone_pcplist_pages(zone);
>>> +	return sum;
>>> +}
>> Prevention of races against zone/node hotplug?
> Memory hotplug doesn't remove nodes nor its zones.
>
Liu Shixin Aug. 23, 2022, 1:12 p.m. UTC | #5
On 2022/8/23 5:13, Andrew Morton wrote:
> On Mon, 22 Aug 2022 14:12:07 -0700 Andrew Morton <akpm@linux-foundation.org> wrote:
>
>> Prevention of races against zone/node hotplug?
> I meant "cpu/node", of course.
> .
Thanks for your advice, but I didn't quite understand what you meant.
Do you mean that I should use cpus_read_lock() to protect cpu hotplug just like this?
+       cpus_read_lock();
        for_each_online_cpu(cpu)
                sum += per_cpu_ptr(zone->per_cpu_pageset, cpu)->count;
+       cpus_read_unlock();

I see there is some areas were not protected by the cpus_{read/write}_lock, so I was confused
as to whether I should add it.

I want to replace for_each_zone with for_each_populated_zone, what else should I do to protect
node hotplug.

Thanks.
Michal Hocko Aug. 23, 2022, 1:37 p.m. UTC | #6
On Tue 23-08-22 20:46:43, Liu Shixin wrote:
> On 2022/8/23 15:50, Michal Hocko wrote:
> > On Mon 22-08-22 14:12:07, Andrew Morton wrote:
> >> On Mon, 22 Aug 2022 11:33:54 +0800 Liu Shixin <liushixin2@huawei.com> wrote:
> >>
> >>> The page on pcplist could be used, but not counted into memory free or
> >>> avaliable, and pcp_free is only showed by show_mem() for now. Since commit
> >>> d8a759b57035 ("mm, page_alloc: double zone's batchsize"), there is a
> >>> significant decrease in the display of free memory, with a large number
> >>> of cpus and zones, the number of pages in the percpu list can be very
> >>> large, so it is better to let user to know the pcp count.
> >>>
> >>> On a machine with 3 zones and 72 CPUs. Before commit d8a759b57035, the
> >>> maximum amount of pages in the pcp lists was theoretically 162MB(3*72*768KB).
> >>> After the patch, the lists can hold 324MB. It has been observed to be 114MB
> >>> in the idle state after system startup in practice(increased 80 MB).
> >>>
> >> Seems reasonable.
> > I have asked in the previous incarnation of the patch but haven't really
> > received any answer[1]. Is this a _real_ problem? The absolute amount of
> > memory could be perceived as a lot but is this really noticeable wrt
> > overall memory on those systems?
>
> This may not obvious when the memory is sufficient. However, as products monitor the
> memory to plan it. The change has caused warning.

Is it possible that the said monitor is over sensitive and looking at
wrong numbers? Overall free memory doesn't really tell much TBH.
MemAvailable is a very rough estimation as well.

In reality what really matters much more is whether the memory is
readily available when it is required and none of MemFree/MemAvailable
gives you that information in general case.

> We have also considered using /proc/zoneinfo to calculate the total
> number of pcplists. However, we think it is more appropriate to add
> the total number of pcplists to free and available pages. After all,
> this part is also free pages.

Those free pages are not generally available as exaplained. They are
available to a specific CPU, drained under memory pressure and other
events but still there is no guarantee a specific process can harvest
that memory because the pcp caches are replenished all the time.
So in a sense it is a semi-hidden memory.

That being said, I am still not convinced this is actually going to help
all that much. You will see a slightly different numbers which do not
tell much one way or another and if the sole reason for tweaking these
numbers is that some monitor is complaining because X became X-epsilon
then this sounds like a weak justification to me. That epsilon happens
all the time because there are quite some hidden caches that are
released under memory pressure. I am not sure it is maintainable to
consider each one of them and pretend that MemFree/MemAvailable is
somehow precise. It has never been and likely never will be.
Liu Shixin Aug. 24, 2022, 10:05 a.m. UTC | #7
On 2022/8/23 21:37, Michal Hocko wrote:
> On Tue 23-08-22 20:46:43, Liu Shixin wrote:
>> On 2022/8/23 15:50, Michal Hocko wrote:
>>> On Mon 22-08-22 14:12:07, Andrew Morton wrote:
>>>> On Mon, 22 Aug 2022 11:33:54 +0800 Liu Shixin <liushixin2@huawei.com> wrote:
>>>>
>>>>> The page on pcplist could be used, but not counted into memory free or
>>>>> avaliable, and pcp_free is only showed by show_mem() for now. Since commit
>>>>> d8a759b57035 ("mm, page_alloc: double zone's batchsize"), there is a
>>>>> significant decrease in the display of free memory, with a large number
>>>>> of cpus and zones, the number of pages in the percpu list can be very
>>>>> large, so it is better to let user to know the pcp count.
>>>>>
>>>>> On a machine with 3 zones and 72 CPUs. Before commit d8a759b57035, the
>>>>> maximum amount of pages in the pcp lists was theoretically 162MB(3*72*768KB).
>>>>> After the patch, the lists can hold 324MB. It has been observed to be 114MB
>>>>> in the idle state after system startup in practice(increased 80 MB).
>>>>>
>>>> Seems reasonable.
>>> I have asked in the previous incarnation of the patch but haven't really
>>> received any answer[1]. Is this a _real_ problem? The absolute amount of
>>> memory could be perceived as a lot but is this really noticeable wrt
>>> overall memory on those systems?
>> This may not obvious when the memory is sufficient. However, as products monitor the
>> memory to plan it. The change has caused warning.
> Is it possible that the said monitor is over sensitive and looking at
> wrong numbers? Overall free memory doesn't really tell much TBH.
> MemAvailable is a very rough estimation as well.
>
> In reality what really matters much more is whether the memory is
> readily available when it is required and none of MemFree/MemAvailable
> gives you that information in general case.
>
>> We have also considered using /proc/zoneinfo to calculate the total
>> number of pcplists. However, we think it is more appropriate to add
>> the total number of pcplists to free and available pages. After all,
>> this part is also free pages.
> Those free pages are not generally available as exaplained. They are
> available to a specific CPU, drained under memory pressure and other
> events but still there is no guarantee a specific process can harvest
> that memory because the pcp caches are replenished all the time.
> So in a sense it is a semi-hidden memory.
>
> That being said, I am still not convinced this is actually going to help
> all that much. You will see a slightly different numbers which do not
> tell much one way or another and if the sole reason for tweaking these
> numbers is that some monitor is complaining because X became X-epsilon
> then this sounds like a weak justification to me. That epsilon happens
> all the time because there are quite some hidden caches that are
> released under memory pressure. I am not sure it is maintainable to
> consider each one of them and pretend that MemFree/MemAvailable is
> somehow precise. It has never been and likely never will be.
Thanks for your explanation. As you said, it seems that merge these memory into
MemFree/MemAvailable directly may affect the performance under memory pressure.
That sounds reasonable.
 
But since these memory is also free memory that can be uesd and is large, I think we
should still provide a statistic for the user. Perhaps add a new statistic is better?

Thanks,
Michal Hocko Aug. 24, 2022, 10:12 a.m. UTC | #8
On Wed 24-08-22 18:05:58, Liu Shixin wrote:
[...]
> But since these memory is also free memory that can be uesd and is large, I think we
> should still provide a statistic for the user. Perhaps add a new statistic is better?

Well, if somebody is really interested then /proc/zoneinfo gives a very
detailed insight into pcp usage.
Dmytro Maluka Nov. 24, 2023, 5:54 p.m. UTC | #9
On Tue, Aug 23, 2022 at 03:37:52PM +0200, Michal Hocko wrote:
> On Tue 23-08-22 20:46:43, Liu Shixin wrote:
> > On 2022/8/23 15:50, Michal Hocko wrote:
> > > On Mon 22-08-22 14:12:07, Andrew Morton wrote:
> > >> On Mon, 22 Aug 2022 11:33:54 +0800 Liu Shixin <liushixin2@huawei.com> wrote:
> > >>
> > >>> The page on pcplist could be used, but not counted into memory free or
> > >>> avaliable, and pcp_free is only showed by show_mem() for now. Since commit
> > >>> d8a759b57035 ("mm, page_alloc: double zone's batchsize"), there is a
> > >>> significant decrease in the display of free memory, with a large number
> > >>> of cpus and zones, the number of pages in the percpu list can be very
> > >>> large, so it is better to let user to know the pcp count.
> > >>>
> > >>> On a machine with 3 zones and 72 CPUs. Before commit d8a759b57035, the
> > >>> maximum amount of pages in the pcp lists was theoretically 162MB(3*72*768KB).
> > >>> After the patch, the lists can hold 324MB. It has been observed to be 114MB
> > >>> in the idle state after system startup in practice(increased 80 MB).
> > >>>
> > >> Seems reasonable.
> > > I have asked in the previous incarnation of the patch but haven't really
> > > received any answer[1]. Is this a _real_ problem? The absolute amount of
> > > memory could be perceived as a lot but is this really noticeable wrt
> > > overall memory on those systems?

Let me provide some other numbers, from the desktop side. On a low-end
chromebook with 4GB RAM and a dual-core CPU, after commit b92ca18e8ca5
(mm/page_alloc: disassociate the pcp->high from pcp->batch) the max
amount of PCP pages increased 56x times: from 2.9MB (1.45 per CPU) to
165MB (82.5MB per CPU).

On such a system, memory pressure conditions are not a rare occurrence,
so several dozen MB make a lot of difference.

(The reason it increased so much is because it now corresponds to the
low watermark, which is 165MB. And the low watermark, in turn, is so
high because of khugepaged, which bumps up min_free_kbytes to 132MB
regardless of the total amount of memory.)

> > This may not obvious when the memory is sufficient. However, as products monitor the
> > memory to plan it. The change has caused warning.
> 
> Is it possible that the said monitor is over sensitive and looking at
> wrong numbers? Overall free memory doesn't really tell much TBH.
> MemAvailable is a very rough estimation as well.
> 
> In reality what really matters much more is whether the memory is
> readily available when it is required and none of MemFree/MemAvailable
> gives you that information in general case.
> 
> > We have also considered using /proc/zoneinfo to calculate the total
> > number of pcplists. However, we think it is more appropriate to add
> > the total number of pcplists to free and available pages. After all,
> > this part is also free pages.
> 
> Those free pages are not generally available as exaplained. They are
> available to a specific CPU, drained under memory pressure and other
> events but still there is no guarantee a specific process can harvest
> that memory because the pcp caches are replenished all the time.
> So in a sense it is a semi-hidden memory.

I was intuitively assuming that per-CPU pages should be always available
for allocation without resorting to paging out allocated pages (and thus
it should be non-controversially a good idea to include per-CPU pages in
MemFree, to make it more accurate).

But looking at the code in __alloc_pages() and around, I see you are
right: we don't try draining other CPUs' PCP lists *before* resorting to
direct reclaim, compaction etc.

BTW, why not? Shouldn't draining PCP lists be cheaper than pageout() in
any case?

> That being said, I am still not convinced this is actually going to help
> all that much. You will see a slightly different numbers which do not
> tell much one way or another and if the sole reason for tweaking these
> numbers is that some monitor is complaining because X became X-epsilon
> then this sounds like a weak justification to me. That epsilon happens
> all the time because there are quite some hidden caches that are
> released under memory pressure. I am not sure it is maintainable to
> consider each one of them and pretend that MemFree/MemAvailable is
> somehow precise. It has never been and likely never will be.
> -- 
> Michal Hocko
> SUSE Labs
Kefeng Wang Nov. 25, 2023, 2:22 a.m. UTC | #10
On 2023/11/25 1:54, Dmytro Maluka wrote:
> On Tue, Aug 23, 2022 at 03:37:52PM +0200, Michal Hocko wrote:
>> On Tue 23-08-22 20:46:43, Liu Shixin wrote:
>>> On 2022/8/23 15:50, Michal Hocko wrote:
>>>> On Mon 22-08-22 14:12:07, Andrew Morton wrote:
>>>>> On Mon, 22 Aug 2022 11:33:54 +0800 Liu Shixin <liushixin2@huawei.com> wrote:
>>>>>
>>>>>> The page on pcplist could be used, but not counted into memory free or
>>>>>> avaliable, and pcp_free is only showed by show_mem() for now. Since commit
>>>>>> d8a759b57035 ("mm, page_alloc: double zone's batchsize"), there is a
>>>>>> significant decrease in the display of free memory, with a large number
>>>>>> of cpus and zones, the number of pages in the percpu list can be very
>>>>>> large, so it is better to let user to know the pcp count.
>>>>>>
>>>>>> On a machine with 3 zones and 72 CPUs. Before commit d8a759b57035, the
>>>>>> maximum amount of pages in the pcp lists was theoretically 162MB(3*72*768KB).
>>>>>> After the patch, the lists can hold 324MB. It has been observed to be 114MB
>>>>>> in the idle state after system startup in practice(increased 80 MB).
>>>>>>
>>>>> Seems reasonable.
>>>> I have asked in the previous incarnation of the patch but haven't really
>>>> received any answer[1]. Is this a _real_ problem? The absolute amount of
>>>> memory could be perceived as a lot but is this really noticeable wrt
>>>> overall memory on those systems?
> 
> Let me provide some other numbers, from the desktop side. On a low-end
> chromebook with 4GB RAM and a dual-core CPU, after commit b92ca18e8ca5
> (mm/page_alloc: disassociate the pcp->high from pcp->batch) the max
> amount of PCP pages increased 56x times: from 2.9MB (1.45 per CPU) to
> 165MB (82.5MB per CPU).
> 
> On such a system, memory pressure conditions are not a rare occurrence,
> so several dozen MB make a lot of difference.

And with mm: PCP high auto-tuning merged in v6.7, the pcp could be more 
bigger than before.

> 
> (The reason it increased so much is because it now corresponds to the
> low watermark, which is 165MB. And the low watermark, in turn, is so
> high because of khugepaged, which bumps up min_free_kbytes to 132MB
> regardless of the total amount of memory.)
> 
>>> This may not obvious when the memory is sufficient. However, as products monitor the
>>> memory to plan it. The change has caused warning.
>>
>> Is it possible that the said monitor is over sensitive and looking at
>> wrong numbers? Overall free memory doesn't really tell much TBH.
>> MemAvailable is a very rough estimation as well.
>>
>> In reality what really matters much more is whether the memory is
>> readily available when it is required and none of MemFree/MemAvailable
>> gives you that information in general case.
>>
>>> We have also considered using /proc/zoneinfo to calculate the total
>>> number of pcplists. However, we think it is more appropriate to add
>>> the total number of pcplists to free and available pages. After all,
>>> this part is also free pages.
>>
>> Those free pages are not generally available as exaplained. They are
>> available to a specific CPU, drained under memory pressure and other
>> events but still there is no guarantee a specific process can harvest
>> that memory because the pcp caches are replenished all the time.
>> So in a sense it is a semi-hidden memory.
> 
> I was intuitively assuming that per-CPU pages should be always available
> for allocation without resorting to paging out allocated pages (and thus
> it should be non-controversially a good idea to include per-CPU pages in
> MemFree, to make it more accurate).
> 
> But looking at the code in __alloc_pages() and around, I see you are
> right: we don't try draining other CPUs' PCP lists *before* resorting to
> direct reclaim, compaction etc.
> 
> BTW, why not? Shouldn't draining PCP lists be cheaper than pageout() in
> any case?

Same question here, could drain pcp before direct reclaim?

> 
>> That being said, I am still not convinced this is actually going to help
>> all that much. You will see a slightly different numbers which do not
>> tell much one way or another and if the sole reason for tweaking these
>> numbers is that some monitor is complaining because X became X-epsilon
>> then this sounds like a weak justification to me. That epsilon happens
>> all the time because there are quite some hidden caches that are
>> released under memory pressure. I am not sure it is maintainable to
>> consider each one of them and pretend that MemFree/MemAvailable is
>> somehow precise. It has never been and likely never will be.
>> -- 
>> Michal Hocko
>> SUSE Labs
Michal Hocko Nov. 27, 2023, 8:50 a.m. UTC | #11
On Fri 24-11-23 18:54:54, Dmytro Maluka wrote:
[...]
> But looking at the code in __alloc_pages() and around, I see you are
> right: we don't try draining other CPUs' PCP lists *before* resorting to
> direct reclaim, compaction etc.
> 
> BTW, why not? Shouldn't draining PCP lists be cheaper than pageout() in
> any case?

My guess would be that draining remote pcp caches is quite expensive on
its own. This requires IPIs, preempting whatever is running there and
wait for the all the cpus with pcp caches to be done. On the other hand
reclaiming a mostly clean page cache could be much less expensive. 

Also consider that refilling those pcp caches is not free either (you
might hit zone lock contetion and who knows what else).

Last but not least also consider that many systems could be just on the
edge of low/min watermark with a lot of cached data. If we drained all
pcp caches whenever we reclaim this could just make the cache pointless.

All that being said, I do not remember any actual numbers or research
about this.
diff mbox series

Patch

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 033f1e26d15b..f89928d3ad4e 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -5853,6 +5853,26 @@  static unsigned long nr_free_zone_pages(int offset)
 	return sum;
 }
 
+static unsigned long nr_free_zone_pcplist_pages(struct zone *zone)
+{
+	unsigned long sum = 0;
+	int cpu;
+
+	for_each_online_cpu(cpu)
+		sum += per_cpu_ptr(zone->per_cpu_pageset, cpu)->count;
+	return sum;
+}
+
+static unsigned long nr_free_pcplist_pages(void)
+{
+	unsigned long sum = 0;
+	struct zone *zone;
+
+	for_each_zone(zone)
+		sum += nr_free_zone_pcplist_pages(zone);
+	return sum;
+}
+
 /**
  * nr_free_buffer_pages - count number of pages beyond high watermark
  *
@@ -5894,7 +5914,8 @@  long si_mem_available(void)
 	 * Estimate the amount of memory available for userspace allocations,
 	 * without causing swapping or OOM.
 	 */
-	available = global_zone_page_state(NR_FREE_PAGES) - totalreserve_pages;
+	available = global_zone_page_state(NR_FREE_PAGES) +
+		    nr_free_pcplist_pages() - totalreserve_pages;
 
 	/*
 	 * Not all the page cache can be freed, otherwise the system will
@@ -5924,7 +5945,8 @@  void si_meminfo(struct sysinfo *val)
 {
 	val->totalram = totalram_pages();
 	val->sharedram = global_node_page_state(NR_SHMEM);
-	val->freeram = global_zone_page_state(NR_FREE_PAGES);
+	val->freeram = global_zone_page_state(NR_FREE_PAGES) +
+		       nr_free_pcplist_pages();
 	val->bufferram = nr_blockdev_pages();
 	val->totalhigh = totalhigh_pages();
 	val->freehigh = nr_free_highpages();
@@ -5938,30 +5960,28 @@  void si_meminfo_node(struct sysinfo *val, int nid)
 {
 	int zone_type;		/* needs to be signed */
 	unsigned long managed_pages = 0;
+	unsigned long free_pages = sum_zone_node_page_state(nid, NR_FREE_PAGES);
 	unsigned long managed_highpages = 0;
 	unsigned long free_highpages = 0;
 	pg_data_t *pgdat = NODE_DATA(nid);
 
-	for (zone_type = 0; zone_type < MAX_NR_ZONES; zone_type++)
-		managed_pages += zone_managed_pages(&pgdat->node_zones[zone_type]);
-	val->totalram = managed_pages;
-	val->sharedram = node_page_state(pgdat, NR_SHMEM);
-	val->freeram = sum_zone_node_page_state(nid, NR_FREE_PAGES);
-#ifdef CONFIG_HIGHMEM
 	for (zone_type = 0; zone_type < MAX_NR_ZONES; zone_type++) {
 		struct zone *zone = &pgdat->node_zones[zone_type];
 
+		managed_pages += zone_managed_pages(zone);
+		free_pages += nr_free_zone_pcplist_pages(zone);
+#ifdef CONFIG_HIGHMEM
 		if (is_highmem(zone)) {
 			managed_highpages += zone_managed_pages(zone);
 			free_highpages += zone_page_state(zone, NR_FREE_PAGES);
 		}
+#endif
 	}
+	val->totalram = managed_pages;
+	val->sharedram = node_page_state(pgdat, NR_SHMEM);
+	val->freeram = free_pages;
 	val->totalhigh = managed_highpages;
 	val->freehigh = free_highpages;
-#else
-	val->totalhigh = managed_highpages;
-	val->freehigh = free_highpages;
-#endif
 	val->mem_unit = PAGE_SIZE;
 }
 #endif
@@ -6035,8 +6055,7 @@  void show_free_areas(unsigned int filter, nodemask_t *nodemask)
 		if (show_mem_node_skip(filter, zone_to_nid(zone), nodemask))
 			continue;
 
-		for_each_online_cpu(cpu)
-			free_pcp += per_cpu_ptr(zone->per_cpu_pageset, cpu)->count;
+		free_pcp += nr_free_zone_pcplist_pages(zone);
 	}
 
 	printk("active_anon:%lu inactive_anon:%lu isolated_anon:%lu\n"
@@ -6128,9 +6147,7 @@  void show_free_areas(unsigned int filter, nodemask_t *nodemask)
 		if (show_mem_node_skip(filter, zone_to_nid(zone), nodemask))
 			continue;
 
-		free_pcp = 0;
-		for_each_online_cpu(cpu)
-			free_pcp += per_cpu_ptr(zone->per_cpu_pageset, cpu)->count;
+		free_pcp = nr_free_zone_pcplist_pages(zone);
 
 		show_node(zone);
 		printk(KERN_CONT