diff mbox series

mm/page_alloc: fix NUMA stats update for cpu-less nodes

Message ID 20241023175037.9125-1-dongjoo.linux.dev@gmail.com (mailing list archive)
State New
Headers show
Series mm/page_alloc: fix NUMA stats update for cpu-less nodes | expand

Commit Message

Dongjoo Seo Oct. 23, 2024, 5:50 p.m. UTC
This patch corrects this issue by:
1. Checking if the zone or preferred zone is CPU-less before updating
   the NUMA stats.
2. Ensuring NUMA_HIT is only updated if the zone is not CPU-less.
3. Ensuring NUMA_FOREIGN is only updated if the preferred zone is not
   CPU-less.

Example Before and After Patch:
- Before Patch:
 node0                   node1           node2
 numa_hit                86333181       114338269            5108
 numa_miss                5199455               0        56844591
 numa_foreign            32281033        29763013               0
 interleave_hit                91              91               0
 local_node              86326417       114288458               0
 other_node               5206219           49768        56849702

- After Patch:
                            node0           node1           node2
 numa_hit                 2523058         9225528               0
 numa_miss                 150213           10226        21495942
 numa_foreign            17144215         4501270               0
 interleave_hit                91              94               0
 local_node               2493918         9208226               0
 other_node                179351           27528        21495942

In the case of memoryless node, when a process prefers a node
with no memory(e.g., because it is running on a CPU local to that
node), the kernel treats a nearby node with memory as the
preffered node. As a result, such allocation do not increment the
numa_foreign counter on the memoryless node, leading to skewed
NUMA_HIT, NUMA_MISS, and NUMA_FOREIGN stat for the nearest node.

Similarly, in the context of cpuless nodes, this patch ensures
that NUMA statistics are accurately updated by adding checks to
prevent the miscounting of memory allocations when the involved
nodes have no CPUs. This ensures more precise tracking of memory
access patterns accross all nodes, regardless of whether they
have CPUs or not, improving the overall reliability of NUMA stat.
The reason is that page allocation from dev_dax, cpuset, memcg ..
comes with preferred allocating zone in cpuless node and its hard
to track the zone info for miss information.

Signed-off-by: Dongjoo Seo <dongjoo.linux.dev@gmail.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Fan Ni <nifan@outlook.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Adam Manzanares <a.manzanares@samsung.com>
---
 mm/page_alloc.c | 10 ++++++----
 1 file changed, 6 insertions(+), 4 deletions(-)

Comments

Michal Hocko Oct. 23, 2024, 6:03 p.m. UTC | #1
On Wed 23-10-24 10:50:37, Dongjoo Seo wrote:
> This patch corrects this issue by:

What is this issue? Please describe the problem first, ideally describe
the NUMA topology, workload and what kind of misaccounting happens
(expected values vs. really reported values).
Andrew Morton Oct. 23, 2024, 8:41 p.m. UTC | #2
On Wed, 23 Oct 2024 20:03:24 +0200 Michal Hocko <mhocko@suse.com> wrote:

> On Wed 23-10-24 10:50:37, Dongjoo Seo wrote:
> > This patch corrects this issue by:
> 
> What is this issue? Please describe the problem first,

Actually, relocating the author's second-last paragraph to
top-of-changelog produced a decent result ;)

> ideally describe
> the NUMA topology, workload and what kind of misaccounting happens
> (expected values vs. really reported values).

I think the changelog covered this adequately?

So with these changelog alterations I've queued this for 6.12-rcX with
a cc:stable.  As far as I can tell this has been there since 2018.

: In the case of memoryless node, when a process prefers a node with no
: memory(e.g., because it is running on a CPU local to that node), the
: kernel treats a nearby node with memory as the preferred node.  As a
: result, such allocations do not increment the numa_foreign counter on the
: memoryless node, leading to skewed NUMA_HIT, NUMA_MISS, and NUMA_FOREIGN
: stats for the nearest node.
: 
: This patch corrects this issue by:
: 1. Checking if the zone or preferred zone is CPU-less before updating
:    the NUMA stats.
: 2. Ensuring NUMA_HIT is only updated if the zone is not CPU-less.
: 3. Ensuring NUMA_FOREIGN is only updated if the preferred zone is not
:    CPU-less.
: 
: Example Before and After Patch:
: - Before Patch:
:  node0                   node1           node2
:  numa_hit                86333181       114338269            5108
:  numa_miss                5199455               0        56844591
:  numa_foreign            32281033        29763013               0
:  interleave_hit                91              91               0
:  local_node              86326417       114288458               0
:  other_node               5206219           49768        56849702
: 
: - After Patch:
:                             node0           node1           node2
:  numa_hit                 2523058         9225528               0
:  numa_miss                 150213           10226        21495942
:  numa_foreign            17144215         4501270               0
:  interleave_hit                91              94               0
:  local_node               2493918         9208226               0
:  other_node                179351           27528        21495942
: 
: Similarly, in the context of cpuless nodes, this patch ensures that NUMA
: statistics are accurately updated by adding checks to prevent the
: miscounting of memory allocations when the involved nodes have no CPUs. 
: This ensures more precise tracking of memory access patterns accross all
: nodes, regardless of whether they have CPUs or not, improving the overall
: reliability of NUMA stat.  The reason is that page allocation from
: dev_dax, cpuset, memcg ..  comes with preferred allocating zone in cpuless
: node and its hard to track the zone info for miss information.
:
Michal Hocko Oct. 23, 2024, 9:38 p.m. UTC | #3
On Wed 23-10-24 13:41:21, Andrew Morton wrote:
> On Wed, 23 Oct 2024 20:03:24 +0200 Michal Hocko <mhocko@suse.com> wrote:
> 
> > On Wed 23-10-24 10:50:37, Dongjoo Seo wrote:
> > > This patch corrects this issue by:
> > 
> > What is this issue? Please describe the problem first,
> 
> Actually, relocating the author's second-last paragraph to
> top-of-changelog produced a decent result ;)
> 
> > ideally describe
> > the NUMA topology, workload and what kind of misaccounting happens
> > (expected values vs. really reported values).
> 
> I think the changelog covered this adequately?
> 
> So with these changelog alterations I've queued this for 6.12-rcX with
> a cc:stable.  As far as I can tell this has been there since 2018.
> 
> : In the case of memoryless node, when a process prefers a node with no
> : memory(e.g., because it is running on a CPU local to that node), the
> : kernel treats a nearby node with memory as the preferred node.  As a
> : result, such allocations do not increment the numa_foreign counter on the
> : memoryless node, leading to skewed NUMA_HIT, NUMA_MISS, and NUMA_FOREIGN
> : stats for the nearest node.

I am sorry but I still do not underastand that. Especially when I do
look at the patch which would like to treat cpuless nodes specially.
Let me be more specific. Why ...

> -     if (zone_to_nid(z) != numa_node_id())
> +     if (zone_to_nid(z) != numa_node_id() || z_is_cpuless)
>               local_stat = NUMA_OTHER;
>
> -     if (zone_to_nid(z) == zone_to_nid(preferred_zone))
> +     if (zone_to_nid(z) == zone_to_nid(preferred_zone) && !z_is_cpuless)
>               __count_numa_events(z, NUMA_HIT, nr_account);
>       else {
>               __count_numa_events(z, NUMA_MISS, nr_account);
> -             __count_numa_events(preferred_zone, NUMA_FOREIGN, nr_account);
> +             if (!pref_is_cpuless)
> +                     __count_numa_events(preferred_zone, NUMA_FOREIGN, nr_account);

... a (well?) established meaning of local needs to be changed? Why
prefrerred policy should have a different meaning for cpuless policies?
Those are memory specific rather than cpu specific right?

Quite some quiestions to have it in linux-next IMHO....
Dongjoo Seo Oct. 23, 2024, 10:15 p.m. UTC | #4
Hi Andrew, Michal,

Thanks for the feedback.

The issue is that CPU-less nodes can lead to incorrect NUMA stats.
For example, NUMA_HIT may incorrectly increase for CPU-less nodes
because the current logic doesn't account for whether a node has CPUs.

Key changes:

local_stat: CPU-less nodes can't be "local," so allocations are
counted as NUMA_OTHER.
preferred_zone: If the preferred zone is CPU-less, NUMA_HIT and
NUMA_FOREIGN are not updated since no CPU runs there.
This ensures more accurate stats, especially for cases like dev_dax
and cpuset.

Hope that clarifies things.

Thanks,
Dongjoo

On Wed, Oct 23, 2024 at 11:38:40PM +0200, Michal Hocko wrote:
> On Wed 23-10-24 13:41:21, Andrew Morton wrote:
> > On Wed, 23 Oct 2024 20:03:24 +0200 Michal Hocko <mhocko@suse.com> wrote:
> > 
> > > On Wed 23-10-24 10:50:37, Dongjoo Seo wrote:
> > > > This patch corrects this issue by:
> > > 
> > > What is this issue? Please describe the problem first,
> > 
> > Actually, relocating the author's second-last paragraph to
> > top-of-changelog produced a decent result ;)
> > 
> > > ideally describe
> > > the NUMA topology, workload and what kind of misaccounting happens
> > > (expected values vs. really reported values).
> > 
> > I think the changelog covered this adequately?
> > 
> > So with these changelog alterations I've queued this for 6.12-rcX with
> > a cc:stable.  As far as I can tell this has been there since 2018.
> > 
> > : In the case of memoryless node, when a process prefers a node with no
> > : memory(e.g., because it is running on a CPU local to that node), the
> > : kernel treats a nearby node with memory as the preferred node.  As a
> > : result, such allocations do not increment the numa_foreign counter on the
> > : memoryless node, leading to skewed NUMA_HIT, NUMA_MISS, and NUMA_FOREIGN
> > : stats for the nearest node.
> 
> I am sorry but I still do not underastand that. Especially when I do
> look at the patch which would like to treat cpuless nodes specially.
> Let me be more specific. Why ...
> 
> > -     if (zone_to_nid(z) != numa_node_id())
> > +     if (zone_to_nid(z) != numa_node_id() || z_is_cpuless)
> >               local_stat = NUMA_OTHER;
> >
> > -     if (zone_to_nid(z) == zone_to_nid(preferred_zone))
> > +     if (zone_to_nid(z) == zone_to_nid(preferred_zone) && !z_is_cpuless)
> >               __count_numa_events(z, NUMA_HIT, nr_account);
> >       else {
> >               __count_numa_events(z, NUMA_MISS, nr_account);
> > -             __count_numa_events(preferred_zone, NUMA_FOREIGN, nr_account);
> > +             if (!pref_is_cpuless)
> > +                     __count_numa_events(preferred_zone, NUMA_FOREIGN, nr_account);
> 
> ... a (well?) established meaning of local needs to be changed? Why
> prefrerred policy should have a different meaning for cpuless policies?
> Those are memory specific rather than cpu specific right?
> 
> Quite some quiestions to have it in linux-next IMHO....
> -- 
> Michal Hocko
> SUSE Labs
Michal Hocko Oct. 23, 2024, 10:23 p.m. UTC | #5
On Wed 23-10-24 15:15:20, Dongjoo Seo wrote:
> Hi Andrew, Michal,
> 
> Thanks for the feedback.
> 
> The issue is that CPU-less nodes can lead to incorrect NUMA stats.
> For example, NUMA_HIT may incorrectly increase for CPU-less nodes
> because the current logic doesn't account for whether a node has CPUs.

Define incorrect

Current semantic doesn't really care about cpu less NUMA nodes because
current means whatever is required AFIU. This is certainly a long term
semantic. Why does this need to change and why it makes sense to 
pre-existing users?
Dongjoo Seo Oct. 24, 2024, 4:54 a.m. UTC | #6
On Thu, Oct 24, 2024 at 12:23:56AM +0200, Michal Hocko wrote:
> On Wed 23-10-24 15:15:20, Dongjoo Seo wrote:
> > Hi Andrew, Michal,
> > 
> > Thanks for the feedback.
> > 
> > The issue is that CPU-less nodes can lead to incorrect NUMA stats.
> > For example, NUMA_HIT may incorrectly increase for CPU-less nodes
> > because the current logic doesn't account for whether a node has CPUs.
> 
> Define incorrect
> 
> Current semantic doesn't really care about cpu less NUMA nodes because
> current means whatever is required AFIU. This is certainly a long term

I agree that, in the long term, special logging for preferred_zone 
and a separate counter might be necessary for CPU-less nodes.

> semantic. Why does this need to change and why it makes sense to 
> pre-existing users?

This patch doesn't change existing logic; the additional logic only 
applies when a CPU-less node is present, so there shouldn't be 
concerns for pre-existing users. Currently, the NUMA stats for 
configurations with CPU-less nodes are incorrect, as allocations
are not properly accounted for.

I believe this approach improves logging accuracy with minimal impact
on the memory allocation path, but I'm open to alternative solutions.
This isn't the only way to address the issue—any suggestions?

> 
> -- 
> Michal Hocko
> SUSE Labs
Michal Hocko Oct. 24, 2024, 8:24 a.m. UTC | #7
On Wed 23-10-24 21:54:37, Dongjoo Seo wrote:
> On Thu, Oct 24, 2024 at 12:23:56AM +0200, Michal Hocko wrote:
> > On Wed 23-10-24 15:15:20, Dongjoo Seo wrote:
> > > Hi Andrew, Michal,
> > > 
> > > Thanks for the feedback.
> > > 
> > > The issue is that CPU-less nodes can lead to incorrect NUMA stats.
> > > For example, NUMA_HIT may incorrectly increase for CPU-less nodes
> > > because the current logic doesn't account for whether a node has CPUs.
> > 
> > Define incorrect
> > 
> > Current semantic doesn't really care about cpu less NUMA nodes because
> > current means whatever is required AFIU. This is certainly a long term
> 
> I agree that, in the long term, special logging for preferred_zone 
> and a separate counter might be necessary for CPU-less nodes.
> 
> > semantic. Why does this need to change and why it makes sense to 
> > pre-existing users?
> 
> This patch doesn't change existing logic; the additional logic only 
> applies when a CPU-less node is present, so there shouldn't be 
> concerns for pre-existing users. Currently, the NUMA stats for 
> configurations with CPU-less nodes are incorrect, as allocations
> are not properly accounted for.
> 
> I believe this approach improves logging accuracy with minimal impact
> on the memory allocation path, but I'm open to alternative solutions.
> This isn't the only way to address the issue—any suggestions?

I still do not understand the actual problem. CPU-less nodes are nothing
really special. They just never have local allocations for obvious
reasons. NUMA_HIT which your patch is special casing has a very well
defined meaning and that is that the memory allocated matches the
preferred node. That doesn't have any notion of CPU at all. Say somebody
explicitly requests to allocate from a CPU less node. Why should you
consider thiat as NUMA_OTHER just because that node has no CPUs? That
just seems completely wrong.
Dongjoo Seo Oct. 24, 2024, 6:13 p.m. UTC | #8
On Thu, Oct 24, 2024 at 10:24:56AM +0200, Michal Hocko wrote:
> On Wed 23-10-24 21:54:37, Dongjoo Seo wrote:
> > On Thu, Oct 24, 2024 at 12:23:56AM +0200, Michal Hocko wrote:
> > > On Wed 23-10-24 15:15:20, Dongjoo Seo wrote:
> > > > Hi Andrew, Michal,
> > > > 
> > > > Thanks for the feedback.
> > > > 
> > > > The issue is that CPU-less nodes can lead to incorrect NUMA stats.
> > > > For example, NUMA_HIT may incorrectly increase for CPU-less nodes
> > > > because the current logic doesn't account for whether a node has CPUs.
> > > 
> > > Define incorrect
> > > 
> > > Current semantic doesn't really care about cpu less NUMA nodes because
> > > current means whatever is required AFIU. This is certainly a long term
> > 
> > I agree that, in the long term, special logging for preferred_zone 
> > and a separate counter might be necessary for CPU-less nodes.
> > 
> > > semantic. Why does this need to change and why it makes sense to 
> > > pre-existing users?
> > 
> > This patch doesn't change existing logic; the additional logic only 
> > applies when a CPU-less node is present, so there shouldn't be 
> > concerns for pre-existing users. Currently, the NUMA stats for 
> > configurations with CPU-less nodes are incorrect, as allocations
> > are not properly accounted for.
> > 
> > I believe this approach improves logging accuracy with minimal impact
> > on the memory allocation path, but I'm open to alternative solutions.
> > This isn't the only way to address the issue—any suggestions?
> 
> I still do not understand the actual problem. CPU-less nodes are nothing
> really special. They just never have local allocations for obvious
> reasons. NUMA_HIT which your patch is special casing has a very well
> defined meaning and that is that the memory allocated matches the
> preferred node. That doesn't have any notion of CPU at all. Say somebody
> explicitly requests to allocate from a CPU less node. Why should you
> consider thiat as NUMA_OTHER just because that node has no CPUs? That
> just seems completely wrong.

Thank you for your feedback. After reviewing ur reply and [1], I realize 
my misunderstanding of numa_* stats. I mistakenly assumed node referred to
CPU locality. The current logic is indeed memory-centric and operates 
correctly as it is. 
I appreciate the clarification, and I now understand that no changes are 
needed to special case CPU-less nodes in this context. 

Thanks again for pointing this out.

[1] https://docs.kernel.org/admin-guide/numastat.html

> -- 
> Michal Hocko
> SUSE Labs
diff mbox series

Patch

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 0f33dab6d344..2981466e8e1a 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2894,19 +2894,21 @@  static inline void zone_statistics(struct zone *preferred_zone, struct zone *z,
 {
 #ifdef CONFIG_NUMA
 	enum numa_stat_item local_stat = NUMA_LOCAL;
+	bool z_is_cpuless = !node_state(zone_to_nid(z), N_CPU);
+	bool pref_is_cpuless = !node_state(zone_to_nid(preferred_zone), N_CPU);
 
-	/* skip numa counters update if numa stats is disabled */
 	if (!static_branch_likely(&vm_numa_stat_key))
 		return;
 
-	if (zone_to_nid(z) != numa_node_id())
+	if (zone_to_nid(z) != numa_node_id() || z_is_cpuless)
 		local_stat = NUMA_OTHER;
 
-	if (zone_to_nid(z) == zone_to_nid(preferred_zone))
+	if (zone_to_nid(z) == zone_to_nid(preferred_zone) && !z_is_cpuless)
 		__count_numa_events(z, NUMA_HIT, nr_account);
 	else {
 		__count_numa_events(z, NUMA_MISS, nr_account);
-		__count_numa_events(preferred_zone, NUMA_FOREIGN, nr_account);
+		if (!pref_is_cpuless)
+			__count_numa_events(preferred_zone, NUMA_FOREIGN, nr_account);
 	}
 	__count_numa_events(z, local_stat, nr_account);
 #endif