diff mbox series

mm: fix draining remote pageset

Message ID 20230811090819.60845-1-ying.huang@intel.com (mailing list archive)
State New
Headers show
Series mm: fix draining remote pageset | expand

Commit Message

Huang, Ying Aug. 11, 2023, 9:08 a.m. UTC
If there is no memory allocation/freeing in the remote pageset after
some time (3 seconds for now), the remote pageset will be drained to
avoid memory wastage.

But in the current implementation, vmstat updater worker may not be
re-queued when we are waiting for the timeout (pcp->expire != 0) if
there are no vmstat changes, for example, when CPU goes idle.

This is fixed via guaranteeing that the vmstat updater worker will
always be re-queued when we are waiting for the timeout.

We can reproduce the bug via allocating/freeing pages from remote
node, then go idle.  And the patch can fix it.

Fixes: 7cc36bbddde5 ("vmstat: on-demand vmstat workers V8")
Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Michal Hocko <mhocko@kernel.org>
---
 mm/vmstat.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

Comments

Michal Hocko Aug. 11, 2023, 9:35 a.m. UTC | #1
On Fri 11-08-23 17:08:19, Huang Ying wrote:
> If there is no memory allocation/freeing in the remote pageset after
> some time (3 seconds for now), the remote pageset will be drained to
> avoid memory wastage.
> 
> But in the current implementation, vmstat updater worker may not be
> re-queued when we are waiting for the timeout (pcp->expire != 0) if
> there are no vmstat changes, for example, when CPU goes idle.

Why is that a problem?

> This is fixed via guaranteeing that the vmstat updater worker will
> always be re-queued when we are waiting for the timeout.
> 
> We can reproduce the bug via allocating/freeing pages from remote
> node, then go idle.  And the patch can fix it.
> 
> Fixes: 7cc36bbddde5 ("vmstat: on-demand vmstat workers V8")
> Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
> Cc: Christoph Lameter <cl@linux.com>
> Cc: Mel Gorman <mgorman@techsingularity.net>
> Cc: Vlastimil Babka <vbabka@suse.cz>
> Cc: Michal Hocko <mhocko@kernel.org>
> ---
>  mm/vmstat.c | 4 +++-
>  1 file changed, 3 insertions(+), 1 deletion(-)
> 
> diff --git a/mm/vmstat.c b/mm/vmstat.c
> index b731d57996c5..111118741abf 100644
> --- a/mm/vmstat.c
> +++ b/mm/vmstat.c
> @@ -856,8 +856,10 @@ static int refresh_cpu_vm_stats(bool do_pagesets)
>  				continue;
>  			}
>  
> -			if (__this_cpu_dec_return(pcp->expire))
> +			if (__this_cpu_dec_return(pcp->expire)) {
> +				changes++;
>  				continue;
> +			}
>  
>  			if (__this_cpu_read(pcp->count)) {
>  				drain_zone_pages(zone, this_cpu_ptr(pcp));
> -- 
> 2.39.2
Huang, Ying Aug. 14, 2023, 1:59 a.m. UTC | #2
Hi, Michal,

Michal Hocko <mhocko@suse.com> writes:

> On Fri 11-08-23 17:08:19, Huang Ying wrote:
>> If there is no memory allocation/freeing in the remote pageset after
>> some time (3 seconds for now), the remote pageset will be drained to
>> avoid memory wastage.
>> 
>> But in the current implementation, vmstat updater worker may not be
>> re-queued when we are waiting for the timeout (pcp->expire != 0) if
>> there are no vmstat changes, for example, when CPU goes idle.
>
> Why is that a problem?

The pages of the remote zone may be kept in the local per-CPU pageset
for long time as long as there's no page allocation/freeing on the
logical CPU.  In addition to the logical CPU goes idle, this is also
possible if the logical CPU is busy in the user space.

I will update the change log to include this.

--
Best Regards,
Huang, Ying

>> This is fixed via guaranteeing that the vmstat updater worker will
>> always be re-queued when we are waiting for the timeout.
>> 
>> We can reproduce the bug via allocating/freeing pages from remote
>> node, then go idle.  And the patch can fix it.
>> 
>> Fixes: 7cc36bbddde5 ("vmstat: on-demand vmstat workers V8")
>> Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
>> Cc: Christoph Lameter <cl@linux.com>
>> Cc: Mel Gorman <mgorman@techsingularity.net>
>> Cc: Vlastimil Babka <vbabka@suse.cz>
>> Cc: Michal Hocko <mhocko@kernel.org>
>> ---
>>  mm/vmstat.c | 4 +++-
>>  1 file changed, 3 insertions(+), 1 deletion(-)
>> 
>> diff --git a/mm/vmstat.c b/mm/vmstat.c
>> index b731d57996c5..111118741abf 100644
>> --- a/mm/vmstat.c
>> +++ b/mm/vmstat.c
>> @@ -856,8 +856,10 @@ static int refresh_cpu_vm_stats(bool do_pagesets)
>>  				continue;
>>  			}
>>  
>> -			if (__this_cpu_dec_return(pcp->expire))
>> +			if (__this_cpu_dec_return(pcp->expire)) {
>> +				changes++;
>>  				continue;
>> +			}
>>  
>>  			if (__this_cpu_read(pcp->count)) {
>>  				drain_zone_pages(zone, this_cpu_ptr(pcp));
>> -- 
>> 2.39.2
Michal Hocko Aug. 16, 2023, 6:49 a.m. UTC | #3
On Mon 14-08-23 09:59:51, Huang, Ying wrote:
> Hi, Michal,
> 
> Michal Hocko <mhocko@suse.com> writes:
> 
> > On Fri 11-08-23 17:08:19, Huang Ying wrote:
> >> If there is no memory allocation/freeing in the remote pageset after
> >> some time (3 seconds for now), the remote pageset will be drained to
> >> avoid memory wastage.
> >> 
> >> But in the current implementation, vmstat updater worker may not be
> >> re-queued when we are waiting for the timeout (pcp->expire != 0) if
> >> there are no vmstat changes, for example, when CPU goes idle.
> >
> > Why is that a problem?
> 
> The pages of the remote zone may be kept in the local per-CPU pageset
> for long time as long as there's no page allocation/freeing on the
> logical CPU.  In addition to the logical CPU goes idle, this is also
> possible if the logical CPU is busy in the user space.

But why is this a problem? Is the scale of the problem sufficient to
trigger out of memory situations or be otherwise harmful?
Huang, Ying Aug. 16, 2023, 7:08 a.m. UTC | #4
Michal Hocko <mhocko@suse.com> writes:

> On Mon 14-08-23 09:59:51, Huang, Ying wrote:
>> Hi, Michal,
>> 
>> Michal Hocko <mhocko@suse.com> writes:
>> 
>> > On Fri 11-08-23 17:08:19, Huang Ying wrote:
>> >> If there is no memory allocation/freeing in the remote pageset after
>> >> some time (3 seconds for now), the remote pageset will be drained to
>> >> avoid memory wastage.
>> >> 
>> >> But in the current implementation, vmstat updater worker may not be
>> >> re-queued when we are waiting for the timeout (pcp->expire != 0) if
>> >> there are no vmstat changes, for example, when CPU goes idle.
>> >
>> > Why is that a problem?
>> 
>> The pages of the remote zone may be kept in the local per-CPU pageset
>> for long time as long as there's no page allocation/freeing on the
>> logical CPU.  In addition to the logical CPU goes idle, this is also
>> possible if the logical CPU is busy in the user space.
>
> But why is this a problem? Is the scale of the problem sufficient to
> trigger out of memory situations or be otherwise harmful?

This may trigger premature page reclaiming.  The pages in the PCP of the
remote zone would have been freed to satisfy the page allocation for the
remote zone to avoid page reclaiming.  It's highly possible that the
local CPU just allocate/free from/to the remote zone temporarily.  So,
we should free PCP pages of the remote zone if there is no page
allocation/freeing from/to the remote zone for 3 seconds.

This will not trigger OOM, because all PCP will be drained if allocation
failed after direct reclaiming.

--
Best Regards,
Huang, Ying
Lameter, Christopher Aug. 16, 2023, 8:23 p.m. UTC | #5
On Wed, 16 Aug 2023, Huang, Ying wrote:

> Michal Hocko <mhocko@suse.com> writes:
>>>> Why is that a problem?
>>>
>>> The pages of the remote zone may be kept in the local per-CPU pageset
>>> for long time as long as there's no page allocation/freeing on the
>>> logical CPU.  In addition to the logical CPU goes idle, this is also
>>> possible if the logical CPU is busy in the user space.
>>
>> But why is this a problem? Is the scale of the problem sufficient to
>> trigger out of memory situations or be otherwise harmful?
>
> This may trigger premature page reclaiming.  The pages in the PCP of the
> remote zone would have been freed to satisfy the page allocation for the
> remote zone to avoid page reclaiming.  It's highly possible that the
> local CPU just allocate/free from/to the remote zone temporarily.  So,
> we should free PCP pages of the remote zone if there is no page
> allocation/freeing from/to the remote zone for 3 seconds.
>
> This will not trigger OOM, because all PCP will be drained if allocation
> failed after direct reclaiming.

I think this is a minor issue but its cleaner behavior. 
vmstat refresh's should continue as long as there are 
pages queued in remote pcps.
Michal Hocko Aug. 21, 2023, 7:55 a.m. UTC | #6
On Wed 16-08-23 15:08:23, Huang, Ying wrote:
> Michal Hocko <mhocko@suse.com> writes:
> 
> > On Mon 14-08-23 09:59:51, Huang, Ying wrote:
> >> Hi, Michal,
> >> 
> >> Michal Hocko <mhocko@suse.com> writes:
> >> 
> >> > On Fri 11-08-23 17:08:19, Huang Ying wrote:
> >> >> If there is no memory allocation/freeing in the remote pageset after
> >> >> some time (3 seconds for now), the remote pageset will be drained to
> >> >> avoid memory wastage.
> >> >> 
> >> >> But in the current implementation, vmstat updater worker may not be
> >> >> re-queued when we are waiting for the timeout (pcp->expire != 0) if
> >> >> there are no vmstat changes, for example, when CPU goes idle.
> >> >
> >> > Why is that a problem?
> >> 
> >> The pages of the remote zone may be kept in the local per-CPU pageset
> >> for long time as long as there's no page allocation/freeing on the
> >> logical CPU.  In addition to the logical CPU goes idle, this is also
> >> possible if the logical CPU is busy in the user space.
> >
> > But why is this a problem? Is the scale of the problem sufficient to
> > trigger out of memory situations or be otherwise harmful?
> 
> This may trigger premature page reclaiming.  The pages in the PCP of the
> remote zone would have been freed to satisfy the page allocation for the
> remote zone to avoid page reclaiming.  It's highly possible that the
> local CPU just allocate/free from/to the remote zone temporarily.

I am slightly confused here but I suspect by zone you mean remote pcp.
But more importantly is this a concern seen in real workload? Can you
quantify it in some manner? E.g. with this patch we have X more kswapd
scanning or even hit direct reclaim much less often.

> So,
> we should free PCP pages of the remote zone if there is no page
> allocation/freeing from/to the remote zone for 3 seconds.

Well, I would argue this depends a lot. There are workloads which really
like to have CPUs idle and yet they would like to benefit from the
allocator fast path after that CPU goes out of idle because idling is
their power saving opportunity while workloads want to act quickly after
there is something to run.

That being said, we really need some numbers (ideally from real world)
that proves this is not just a theoretical concern.
Huang, Ying Aug. 21, 2023, 8:30 a.m. UTC | #7
Michal Hocko <mhocko@suse.com> writes:

> On Wed 16-08-23 15:08:23, Huang, Ying wrote:
>> Michal Hocko <mhocko@suse.com> writes:
>> 
>> > On Mon 14-08-23 09:59:51, Huang, Ying wrote:
>> >> Hi, Michal,
>> >> 
>> >> Michal Hocko <mhocko@suse.com> writes:
>> >> 
>> >> > On Fri 11-08-23 17:08:19, Huang Ying wrote:
>> >> >> If there is no memory allocation/freeing in the remote pageset after
>> >> >> some time (3 seconds for now), the remote pageset will be drained to
>> >> >> avoid memory wastage.
>> >> >> 
>> >> >> But in the current implementation, vmstat updater worker may not be
>> >> >> re-queued when we are waiting for the timeout (pcp->expire != 0) if
>> >> >> there are no vmstat changes, for example, when CPU goes idle.
>> >> >
>> >> > Why is that a problem?
>> >> 
>> >> The pages of the remote zone may be kept in the local per-CPU pageset
>> >> for long time as long as there's no page allocation/freeing on the
>> >> logical CPU.  In addition to the logical CPU goes idle, this is also
>> >> possible if the logical CPU is busy in the user space.
>> >
>> > But why is this a problem? Is the scale of the problem sufficient to
>> > trigger out of memory situations or be otherwise harmful?
>> 
>> This may trigger premature page reclaiming.  The pages in the PCP of the
>> remote zone would have been freed to satisfy the page allocation for the
>> remote zone to avoid page reclaiming.  It's highly possible that the
>> local CPU just allocate/free from/to the remote zone temporarily.
>
> I am slightly confused here but I suspect by zone you mean remote pcp.
> But more importantly is this a concern seen in real workload? Can you
> quantify it in some manner? E.g. with this patch we have X more kswapd
> scanning or even hit direct reclaim much less often.
>> So,
>> we should free PCP pages of the remote zone if there is no page
>> allocation/freeing from/to the remote zone for 3 seconds.
>
> Well, I would argue this depends a lot. There are workloads which really
> like to have CPUs idle and yet they would like to benefit from the
> allocator fast path after that CPU goes out of idle because idling is
> their power saving opportunity while workloads want to act quickly after
> there is something to run.
>
> That being said, we really need some numbers (ideally from real world)
> that proves this is not just a theoretical concern.

The behavior to drain the PCP of the remote zone (that is, remote PCP)
was introduced in commit 4ae7c03943fc ("[PATCH] Periodically drain non
local pagesets").  The goal of draining was well documented in the
change log.  IIUC, some of your questions can be answered there?

This patch just restores the original behavior changed by commit
7cc36bbddde5 ("vmstat: on-demand vmstat workers V8").

--
Best Regards,
Huang, Ying
Michal Hocko Aug. 21, 2023, 9:27 a.m. UTC | #8
On Mon 21-08-23 16:30:18, Huang, Ying wrote:
> Michal Hocko <mhocko@suse.com> writes:
> 
> > On Wed 16-08-23 15:08:23, Huang, Ying wrote:
> >> Michal Hocko <mhocko@suse.com> writes:
> >> 
> >> > On Mon 14-08-23 09:59:51, Huang, Ying wrote:
> >> >> Hi, Michal,
> >> >> 
> >> >> Michal Hocko <mhocko@suse.com> writes:
> >> >> 
> >> >> > On Fri 11-08-23 17:08:19, Huang Ying wrote:
> >> >> >> If there is no memory allocation/freeing in the remote pageset after
> >> >> >> some time (3 seconds for now), the remote pageset will be drained to
> >> >> >> avoid memory wastage.
> >> >> >> 
> >> >> >> But in the current implementation, vmstat updater worker may not be
> >> >> >> re-queued when we are waiting for the timeout (pcp->expire != 0) if
> >> >> >> there are no vmstat changes, for example, when CPU goes idle.
> >> >> >
> >> >> > Why is that a problem?
> >> >> 
> >> >> The pages of the remote zone may be kept in the local per-CPU pageset
> >> >> for long time as long as there's no page allocation/freeing on the
> >> >> logical CPU.  In addition to the logical CPU goes idle, this is also
> >> >> possible if the logical CPU is busy in the user space.
> >> >
> >> > But why is this a problem? Is the scale of the problem sufficient to
> >> > trigger out of memory situations or be otherwise harmful?
> >> 
> >> This may trigger premature page reclaiming.  The pages in the PCP of the
> >> remote zone would have been freed to satisfy the page allocation for the
> >> remote zone to avoid page reclaiming.  It's highly possible that the
> >> local CPU just allocate/free from/to the remote zone temporarily.
> >
> > I am slightly confused here but I suspect by zone you mean remote pcp.
> > But more importantly is this a concern seen in real workload? Can you
> > quantify it in some manner? E.g. with this patch we have X more kswapd
> > scanning or even hit direct reclaim much less often.
> >> So,
> >> we should free PCP pages of the remote zone if there is no page
> >> allocation/freeing from/to the remote zone for 3 seconds.
> >
> > Well, I would argue this depends a lot. There are workloads which really
> > like to have CPUs idle and yet they would like to benefit from the
> > allocator fast path after that CPU goes out of idle because idling is
> > their power saving opportunity while workloads want to act quickly after
> > there is something to run.
> >
> > That being said, we really need some numbers (ideally from real world)
> > that proves this is not just a theoretical concern.
> 
> The behavior to drain the PCP of the remote zone (that is, remote PCP)
> was introduced in commit 4ae7c03943fc ("[PATCH] Periodically drain non
> local pagesets").  The goal of draining was well documented in the
> change log.  IIUC, some of your questions can be answered there?
> 
> This patch just restores the original behavior changed by commit
> 7cc36bbddde5 ("vmstat: on-demand vmstat workers V8").

Let me repeat. You need some numbers to show this is needed.
Huang, Ying Aug. 21, 2023, 10:31 p.m. UTC | #9
Michal Hocko <mhocko@suse.com> writes:

> On Mon 21-08-23 16:30:18, Huang, Ying wrote:
>> Michal Hocko <mhocko@suse.com> writes:
>> 
>> > On Wed 16-08-23 15:08:23, Huang, Ying wrote:
>> >> Michal Hocko <mhocko@suse.com> writes:
>> >> 
>> >> > On Mon 14-08-23 09:59:51, Huang, Ying wrote:
>> >> >> Hi, Michal,
>> >> >> 
>> >> >> Michal Hocko <mhocko@suse.com> writes:
>> >> >> 
>> >> >> > On Fri 11-08-23 17:08:19, Huang Ying wrote:
>> >> >> >> If there is no memory allocation/freeing in the remote pageset after
>> >> >> >> some time (3 seconds for now), the remote pageset will be drained to
>> >> >> >> avoid memory wastage.
>> >> >> >> 
>> >> >> >> But in the current implementation, vmstat updater worker may not be
>> >> >> >> re-queued when we are waiting for the timeout (pcp->expire != 0) if
>> >> >> >> there are no vmstat changes, for example, when CPU goes idle.
>> >> >> >
>> >> >> > Why is that a problem?
>> >> >> 
>> >> >> The pages of the remote zone may be kept in the local per-CPU pageset
>> >> >> for long time as long as there's no page allocation/freeing on the
>> >> >> logical CPU.  In addition to the logical CPU goes idle, this is also
>> >> >> possible if the logical CPU is busy in the user space.
>> >> >
>> >> > But why is this a problem? Is the scale of the problem sufficient to
>> >> > trigger out of memory situations or be otherwise harmful?
>> >> 
>> >> This may trigger premature page reclaiming.  The pages in the PCP of the
>> >> remote zone would have been freed to satisfy the page allocation for the
>> >> remote zone to avoid page reclaiming.  It's highly possible that the
>> >> local CPU just allocate/free from/to the remote zone temporarily.
>> >
>> > I am slightly confused here but I suspect by zone you mean remote pcp.
>> > But more importantly is this a concern seen in real workload? Can you
>> > quantify it in some manner? E.g. with this patch we have X more kswapd
>> > scanning or even hit direct reclaim much less often.
>> >> So,
>> >> we should free PCP pages of the remote zone if there is no page
>> >> allocation/freeing from/to the remote zone for 3 seconds.
>> >
>> > Well, I would argue this depends a lot. There are workloads which really
>> > like to have CPUs idle and yet they would like to benefit from the
>> > allocator fast path after that CPU goes out of idle because idling is
>> > their power saving opportunity while workloads want to act quickly after
>> > there is something to run.
>> >
>> > That being said, we really need some numbers (ideally from real world)
>> > that proves this is not just a theoretical concern.
>> 
>> The behavior to drain the PCP of the remote zone (that is, remote PCP)
>> was introduced in commit 4ae7c03943fc ("[PATCH] Periodically drain non
>> local pagesets").  The goal of draining was well documented in the
>> change log.  IIUC, some of your questions can be answered there?
>> 
>> This patch just restores the original behavior changed by commit
>> 7cc36bbddde5 ("vmstat: on-demand vmstat workers V8").
>
> Let me repeat. You need some numbers to show this is needed.

I have done some test for this patch as follows,

- Run some workloads, use `numactl` to bind CPU to node 0 and memory to
  node 1.  So the PCP of the CPU on node 0 for zone on node 1 will be
  filled.

- After workloads finish, idle for 60s

- Check /proc/zoneinfo

With the original kernel, the number of pages in the PCP of the CPU on
node 0 for zone on node 1 is non-zero after idle.  With the patched
kernel, that becomes 0 after idle.  We avoid to keep pages in the remote
PCP during idle.

This is the number I have.  If you think it isn't enough to justify the
patch, then I'm OK too (although I think it's enough).  Because the
remote PCP will be drained later when some pages are allocated/freed on
the CPU.

--
Best Regards,
Huang, Ying
Michal Hocko Aug. 22, 2023, 8:09 a.m. UTC | #10
On Tue 22-08-23 06:31:42, Huang, Ying wrote:
> Michal Hocko <mhocko@suse.com> writes:
> 
> > On Mon 21-08-23 16:30:18, Huang, Ying wrote:
> >> Michal Hocko <mhocko@suse.com> writes:
> >> 
> >> > On Wed 16-08-23 15:08:23, Huang, Ying wrote:
> >> >> Michal Hocko <mhocko@suse.com> writes:
> >> >> 
> >> >> > On Mon 14-08-23 09:59:51, Huang, Ying wrote:
> >> >> >> Hi, Michal,
> >> >> >> 
> >> >> >> Michal Hocko <mhocko@suse.com> writes:
> >> >> >> 
> >> >> >> > On Fri 11-08-23 17:08:19, Huang Ying wrote:
> >> >> >> >> If there is no memory allocation/freeing in the remote pageset after
> >> >> >> >> some time (3 seconds for now), the remote pageset will be drained to
> >> >> >> >> avoid memory wastage.
> >> >> >> >> 
> >> >> >> >> But in the current implementation, vmstat updater worker may not be
> >> >> >> >> re-queued when we are waiting for the timeout (pcp->expire != 0) if
> >> >> >> >> there are no vmstat changes, for example, when CPU goes idle.
> >> >> >> >
> >> >> >> > Why is that a problem?
> >> >> >> 
> >> >> >> The pages of the remote zone may be kept in the local per-CPU pageset
> >> >> >> for long time as long as there's no page allocation/freeing on the
> >> >> >> logical CPU.  In addition to the logical CPU goes idle, this is also
> >> >> >> possible if the logical CPU is busy in the user space.
> >> >> >
> >> >> > But why is this a problem? Is the scale of the problem sufficient to
> >> >> > trigger out of memory situations or be otherwise harmful?
> >> >> 
> >> >> This may trigger premature page reclaiming.  The pages in the PCP of the
> >> >> remote zone would have been freed to satisfy the page allocation for the
> >> >> remote zone to avoid page reclaiming.  It's highly possible that the
> >> >> local CPU just allocate/free from/to the remote zone temporarily.
> >> >
> >> > I am slightly confused here but I suspect by zone you mean remote pcp.
> >> > But more importantly is this a concern seen in real workload? Can you
> >> > quantify it in some manner? E.g. with this patch we have X more kswapd
> >> > scanning or even hit direct reclaim much less often.
> >> >> So,
> >> >> we should free PCP pages of the remote zone if there is no page
> >> >> allocation/freeing from/to the remote zone for 3 seconds.
> >> >
> >> > Well, I would argue this depends a lot. There are workloads which really
> >> > like to have CPUs idle and yet they would like to benefit from the
> >> > allocator fast path after that CPU goes out of idle because idling is
> >> > their power saving opportunity while workloads want to act quickly after
> >> > there is something to run.
> >> >
> >> > That being said, we really need some numbers (ideally from real world)
> >> > that proves this is not just a theoretical concern.
> >> 
> >> The behavior to drain the PCP of the remote zone (that is, remote PCP)
> >> was introduced in commit 4ae7c03943fc ("[PATCH] Periodically drain non
> >> local pagesets").  The goal of draining was well documented in the
> >> change log.  IIUC, some of your questions can be answered there?
> >> 
> >> This patch just restores the original behavior changed by commit
> >> 7cc36bbddde5 ("vmstat: on-demand vmstat workers V8").
> >
> > Let me repeat. You need some numbers to show this is needed.
> 
> I have done some test for this patch as follows,
> 
> - Run some workloads, use `numactl` to bind CPU to node 0 and memory to
>   node 1.  So the PCP of the CPU on node 0 for zone on node 1 will be
>   filled.
> 
> - After workloads finish, idle for 60s
> 
> - Check /proc/zoneinfo
> 
> With the original kernel, the number of pages in the PCP of the CPU on
> node 0 for zone on node 1 is non-zero after idle.  With the patched
> kernel, that becomes 0 after idle.  We avoid to keep pages in the remote
> PCP during idle.
> 
> This is the number I have.  If you think it isn't enough to justify the
> patch, then I'm OK too (although I think it's enough).  Because the
> remote PCP will be drained later when some pages are allocated/freed on
> the CPU.

Yes, this doesn't really show any actual correctness problem so I do not
think this is sufficient to change the code. You would need to show that
the existing behavior is actively harmful.
Lameter, Christopher Aug. 25, 2023, 5:06 p.m. UTC | #11
On Tue, 22 Aug 2023, Michal Hocko wrote:

> Yes, this doesn't really show any actual correctness problem so I do not
> think this is sufficient to change the code. You would need to show that
> the existing behavior is actively harmful.

Having some pages from a remote NUMA node stuck in a pcp somewhere is 
making that memory unusable. It is usually rate that these remote pages 
are needed again and so they may remain there for a long time if the 
situation is right.

And he is right that the intended behavior of freeing the remote pages 
has been disabled by the patch.

So I think there is sufficient rationale to apply these fixes.
Huang, Ying Aug. 29, 2023, 6:08 a.m. UTC | #12
Hi, Christopher,

"Lameter, Christopher" <cl@os.amperecomputing.com> writes:

> On Tue, 22 Aug 2023, Michal Hocko wrote:
>
>> Yes, this doesn't really show any actual correctness problem so I do not
>> think this is sufficient to change the code. You would need to show that
>> the existing behavior is actively harmful.
>
> Having some pages from a remote NUMA node stuck in a pcp somewhere is
> making that memory unusable. It is usually rate that these remote
> pages are needed again and so they may remain there for a long time if
> the situation is right.
>
> And he is right that the intended behavior of freeing the remote pages
> has been disabled by the patch.
>
> So I think there is sufficient rationale to apply these fixes.

Thanks!  Can I get your "Acked-by" or "Reviewed-by" for the patch?

--
Best Regards,
Huang, Ying
Lameter, Christopher Aug. 29, 2023, 6:05 p.m. UTC | #13
On Tue, 29 Aug 2023, Huang, Ying wrote:

>> So I think there is sufficient rationale to apply these fixes.
>
> Thanks!  Can I get your "Acked-by" or "Reviewed-by" for the patch?

Reviewed-by: Christoph Lameter <cl@linux.com>
Vlastimil Babka Sept. 5, 2023, 4:52 p.m. UTC | #14
On 8/25/23 19:06, Lameter, Christopher wrote:
> On Tue, 22 Aug 2023, Michal Hocko wrote:
> 
>> Yes, this doesn't really show any actual correctness problem so I do not
>> think this is sufficient to change the code. You would need to show that
>> the existing behavior is actively harmful.
> 
> Having some pages from a remote NUMA node stuck in a pcp somewhere is 
> making that memory unusable. It is usually rate that these remote pages 
> are needed again and so they may remain there for a long time if the 
> situation is right.
> 
> And he is right that the intended behavior of freeing the remote pages 
> has been disabled by the patch.
> 
> So I think there is sufficient rationale to apply these fixes.

I wonder if this the optimum way to handle the NOHZ case? IIUC there we use
quiet_vmstat() to call refresh_cpu_vm_stats(). I'd expect if there were
pending remote pages to flush, it would be best to do it immediately, and
not keep a worker being requeued and only do that after the pcp->expires
goes zero.

However quiet_vmstat() even calls the refresh with do_pagesets == false. Why
do we even refresh the stats at that moment if the delayed update is pending
anyway? And could we maybe make sure that in that case the flush is done on
the first delayed update in that case and not expiring like this?
Huang, Ying Sept. 6, 2023, 4:17 a.m. UTC | #15
Vlastimil Babka <vbabka@suse.cz> writes:

> On 8/25/23 19:06, Lameter, Christopher wrote:
>> On Tue, 22 Aug 2023, Michal Hocko wrote:
>> 
>>> Yes, this doesn't really show any actual correctness problem so I do not
>>> think this is sufficient to change the code. You would need to show that
>>> the existing behavior is actively harmful.
>> 
>> Having some pages from a remote NUMA node stuck in a pcp somewhere is 
>> making that memory unusable. It is usually rate that these remote pages 
>> are needed again and so they may remain there for a long time if the 
>> situation is right.
>> 
>> And he is right that the intended behavior of freeing the remote pages 
>> has been disabled by the patch.
>> 
>> So I think there is sufficient rationale to apply these fixes.
>
> I wonder if this the optimum way to handle the NOHZ case? IIUC there we use
> quiet_vmstat() to call refresh_cpu_vm_stats(). I'd expect if there were
> pending remote pages to flush, it would be best to do it immediately, and
> not keep a worker being requeued and only do that after the pcp->expires
> goes zero.
>
> However quiet_vmstat() even calls the refresh with do_pagesets == false. Why
> do we even refresh the stats at that moment if the delayed update is pending
> anyway?

According to commit f01f17d3705b ("mm, vmstat: make quiet_vmstat
lighter") and the comments in quiet_vmstat().  The pending worker will
not be canceled to avoid long latency of idle entry.

> And could we maybe make sure that in that case the flush is done on
> the first delayed update in that case and not expiring like this?

This sounds reasonable.  How to identify whether the current CPU is in
NOHZ state? Via tick_get_tick_sched()->tick_stopped?

--
Best Regards,
Huang, Ying
diff mbox series

Patch

diff --git a/mm/vmstat.c b/mm/vmstat.c
index b731d57996c5..111118741abf 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -856,8 +856,10 @@  static int refresh_cpu_vm_stats(bool do_pagesets)
 				continue;
 			}
 
-			if (__this_cpu_dec_return(pcp->expire))
+			if (__this_cpu_dec_return(pcp->expire)) {
+				changes++;
 				continue;
+			}
 
 			if (__this_cpu_read(pcp->count)) {
 				drain_zone_pages(zone, this_cpu_ptr(pcp));