diff mbox series

mm: vmstat: Use zeroed stats for unpopulated zones

Message ID 20200504070304.127361-1-sandipan@linux.ibm.com (mailing list archive)
State New, archived
Headers show
Series mm: vmstat: Use zeroed stats for unpopulated zones | expand

Commit Message

Sandipan Das May 4, 2020, 7:03 a.m. UTC
For unpopulated zones, the pagesets point to the common
boot_pageset which can have non-zero vm_numa_stat counts.
Because of this memory-less nodes end up having non-zero
NUMA statistics. This can be observed on any architecture
that supports memory-less NUMA nodes.

E.g.

  $ numactl -H
  available: 2 nodes (0-1)
  node 0 cpus: 0 1 2 3
  node 0 size: 0 MB
  node 0 free: 0 MB
  node 1 cpus: 4 5 6 7
  node 1 size: 8131 MB
  node 1 free: 6980 MB
  node distances:
  node   0   1
    0:  10  40
    1:  40  10

  $ numastat
                             node0           node1
  numa_hit                     108           56495
  numa_miss                      0               0
  numa_foreign                   0               0
  interleave_hit                 0            4537
  local_node                   108           31547
  other_node                     0           24948

Hence, return zero explicitly for all the stats of an
unpopulated zone.

Signed-off-by: Sandipan Das <sandipan@linux.ibm.com>
---
 include/linux/vmstat.h | 15 +++++++++++++++
 1 file changed, 15 insertions(+)

Comments

Michal Hocko May 4, 2020, 10:26 a.m. UTC | #1
On Mon 04-05-20 12:33:04, Sandipan Das wrote:
> For unpopulated zones, the pagesets point to the common
> boot_pageset which can have non-zero vm_numa_stat counts.
> Because of this memory-less nodes end up having non-zero
> NUMA statistics. This can be observed on any architecture
> that supports memory-less NUMA nodes.
> 
> E.g.
> 
>   $ numactl -H
>   available: 2 nodes (0-1)
>   node 0 cpus: 0 1 2 3
>   node 0 size: 0 MB
>   node 0 free: 0 MB
>   node 1 cpus: 4 5 6 7
>   node 1 size: 8131 MB
>   node 1 free: 6980 MB
>   node distances:
>   node   0   1
>     0:  10  40
>     1:  40  10
> 
>   $ numastat
>                              node0           node1
>   numa_hit                     108           56495
>   numa_miss                      0               0
>   numa_foreign                   0               0
>   interleave_hit                 0            4537
>   local_node                   108           31547
>   other_node                     0           24948
> 
> Hence, return zero explicitly for all the stats of an
> unpopulated zone.

I hope I am not just confused but I would expect that at least
numa_foreign and other_node to be non zero.
Vlastimil Babka May 6, 2020, 1:33 p.m. UTC | #2
On 5/4/20 12:26 PM, Michal Hocko wrote:
> On Mon 04-05-20 12:33:04, Sandipan Das wrote:
>> For unpopulated zones, the pagesets point to the common
>> boot_pageset which can have non-zero vm_numa_stat counts.
>> Because of this memory-less nodes end up having non-zero
>> NUMA statistics. This can be observed on any architecture
>> that supports memory-less NUMA nodes.
>> 
>> E.g.
>> 
>>   $ numactl -H
>>   available: 2 nodes (0-1)
>>   node 0 cpus: 0 1 2 3
>>   node 0 size: 0 MB
>>   node 0 free: 0 MB
>>   node 1 cpus: 4 5 6 7
>>   node 1 size: 8131 MB
>>   node 1 free: 6980 MB
>>   node distances:
>>   node   0   1
>>     0:  10  40
>>     1:  40  10
>> 
>>   $ numastat
>>                              node0           node1
>>   numa_hit                     108           56495
>>   numa_miss                      0               0
>>   numa_foreign                   0               0
>>   interleave_hit                 0            4537
>>   local_node                   108           31547
>>   other_node                     0           24948
>> 
>> Hence, return zero explicitly for all the stats of an
>> unpopulated zone.
> 
> I hope I am not just confused but I would expect that at least
> numa_foreign and other_node to be non zero.
Hmm, checking zone_statistics():

NUMA_FOREIGN increment uses preferred zone, which is the first in zone in
zonelist, so it will be a zone from node 1 even for allocations on cpu
associated to node 0 - assuming node 0's unpopulated zones are not included in
node 0's zonelist.

NUMA_OTHER uses numa_node_id(), which would mean the node 0's cpus have node 1
in their numa_node_id() ? Is that correct?

So the 108 comes from where exactly, some early allocations until all was
initialized?
Michal Hocko May 6, 2020, 2:02 p.m. UTC | #3
On Wed 06-05-20 15:33:36, Vlastimil Babka wrote:
> On 5/4/20 12:26 PM, Michal Hocko wrote:
> > On Mon 04-05-20 12:33:04, Sandipan Das wrote:
> >> For unpopulated zones, the pagesets point to the common
> >> boot_pageset which can have non-zero vm_numa_stat counts.
> >> Because of this memory-less nodes end up having non-zero
> >> NUMA statistics. This can be observed on any architecture
> >> that supports memory-less NUMA nodes.
> >> 
> >> E.g.
> >> 
> >>   $ numactl -H
> >>   available: 2 nodes (0-1)
> >>   node 0 cpus: 0 1 2 3
> >>   node 0 size: 0 MB
> >>   node 0 free: 0 MB
> >>   node 1 cpus: 4 5 6 7
> >>   node 1 size: 8131 MB
> >>   node 1 free: 6980 MB
> >>   node distances:
> >>   node   0   1
> >>     0:  10  40
> >>     1:  40  10
> >> 
> >>   $ numastat
> >>                              node0           node1
> >>   numa_hit                     108           56495
> >>   numa_miss                      0               0
> >>   numa_foreign                   0               0
> >>   interleave_hit                 0            4537
> >>   local_node                   108           31547
> >>   other_node                     0           24948
> >> 
> >> Hence, return zero explicitly for all the stats of an
> >> unpopulated zone.
> > 
> > I hope I am not just confused but I would expect that at least
> > numa_foreign and other_node to be non zero.
> Hmm, checking zone_statistics():
> 
> NUMA_FOREIGN increment uses preferred zone, which is the first in zone in
> zonelist, so it will be a zone from node 1 even for allocations on cpu
> associated to node 0 - assuming node 0's unpopulated zones are not included in
> node 0's zonelist.

But the allocation could have been requested for node 0 regardless of
the amount of memory the node has.
 
> NUMA_OTHER uses numa_node_id(), which would mean the node 0's cpus have node 1
> in their numa_node_id() ? Is that correct?

numa_node_id should reflect the real node the CPU is associated with.
Vlastimil Babka May 6, 2020, 3:09 p.m. UTC | #4
On 5/6/20 4:02 PM, Michal Hocko wrote:
> On Wed 06-05-20 15:33:36, Vlastimil Babka wrote:
>> On 5/4/20 12:26 PM, Michal Hocko wrote:
>> > On Mon 04-05-20 12:33:04, Sandipan Das wrote:
>> >> For unpopulated zones, the pagesets point to the common
>> >> boot_pageset which can have non-zero vm_numa_stat counts.
>> >> Because of this memory-less nodes end up having non-zero
>> >> NUMA statistics. This can be observed on any architecture
>> >> that supports memory-less NUMA nodes.
>> >> 
>> >> E.g.
>> >> 
>> >>   $ numactl -H
>> >>   available: 2 nodes (0-1)
>> >>   node 0 cpus: 0 1 2 3
>> >>   node 0 size: 0 MB
>> >>   node 0 free: 0 MB
>> >>   node 1 cpus: 4 5 6 7
>> >>   node 1 size: 8131 MB
>> >>   node 1 free: 6980 MB
>> >>   node distances:
>> >>   node   0   1
>> >>     0:  10  40
>> >>     1:  40  10
>> >> 
>> >>   $ numastat
>> >>                              node0           node1
>> >>   numa_hit                     108           56495
>> >>   numa_miss                      0               0
>> >>   numa_foreign                   0               0
>> >>   interleave_hit                 0            4537
>> >>   local_node                   108           31547
>> >>   other_node                     0           24948
>> >> 
>> >> Hence, return zero explicitly for all the stats of an
>> >> unpopulated zone.
>> > 
>> > I hope I am not just confused but I would expect that at least
>> > numa_foreign and other_node to be non zero.
>> Hmm, checking zone_statistics():
>> 
>> NUMA_FOREIGN increment uses preferred zone, which is the first in zone in
>> zonelist, so it will be a zone from node 1 even for allocations on cpu
>> associated to node 0 - assuming node 0's unpopulated zones are not included in
>> node 0's zonelist.
> 
> But the allocation could have been requested for node 0 regardless of
> the amount of memory the node has.

Yes, if we allocate from cpu 0-3 then it should be a miss on node 0. But the
zonelists are optimized in a way that they don't include empty zones -
build_zonerefs_node() checks managed_zone(). As a result, node 0 zonelist has no
node 0 zones, which confuses the stats code. We should probably document that
numa stats are bogus on systems with memoryless nodes. This patch makes it
somewhat more obvious by presenting nice zeroes on the memoryless node itself,
but node 1 now include stats from node 0.

>> NUMA_OTHER uses numa_node_id(), which would mean the node 0's cpus have node 1
>> in their numa_node_id() ? Is that correct?
> 
> numa_node_id should reflect the real node the CPU is associated with.

You're right, numa_node_id() is probably fine. But NUMA_OTHER is actually
incremented at the zone where the allocation succeeds. This probably doesn't
match Documentation/admin-guide/numastat.rst, even on a non-memoryless-node systems:

other_node      A process ran on this node and got memory from another node.
Michal Hocko May 6, 2020, 3:24 p.m. UTC | #5
On Wed 06-05-20 17:09:51, Vlastimil Babka wrote:
> On 5/6/20 4:02 PM, Michal Hocko wrote:
> > On Wed 06-05-20 15:33:36, Vlastimil Babka wrote:
> >> On 5/4/20 12:26 PM, Michal Hocko wrote:
> >> > On Mon 04-05-20 12:33:04, Sandipan Das wrote:
> >> >> For unpopulated zones, the pagesets point to the common
> >> >> boot_pageset which can have non-zero vm_numa_stat counts.
> >> >> Because of this memory-less nodes end up having non-zero
> >> >> NUMA statistics. This can be observed on any architecture
> >> >> that supports memory-less NUMA nodes.
> >> >> 
> >> >> E.g.
> >> >> 
> >> >>   $ numactl -H
> >> >>   available: 2 nodes (0-1)
> >> >>   node 0 cpus: 0 1 2 3
> >> >>   node 0 size: 0 MB
> >> >>   node 0 free: 0 MB
> >> >>   node 1 cpus: 4 5 6 7
> >> >>   node 1 size: 8131 MB
> >> >>   node 1 free: 6980 MB
> >> >>   node distances:
> >> >>   node   0   1
> >> >>     0:  10  40
> >> >>     1:  40  10
> >> >> 
> >> >>   $ numastat
> >> >>                              node0           node1
> >> >>   numa_hit                     108           56495
> >> >>   numa_miss                      0               0
> >> >>   numa_foreign                   0               0
> >> >>   interleave_hit                 0            4537
> >> >>   local_node                   108           31547
> >> >>   other_node                     0           24948
> >> >> 
> >> >> Hence, return zero explicitly for all the stats of an
> >> >> unpopulated zone.
> >> > 
> >> > I hope I am not just confused but I would expect that at least
> >> > numa_foreign and other_node to be non zero.
> >> Hmm, checking zone_statistics():
> >> 
> >> NUMA_FOREIGN increment uses preferred zone, which is the first in zone in
> >> zonelist, so it will be a zone from node 1 even for allocations on cpu
> >> associated to node 0 - assuming node 0's unpopulated zones are not included in
> >> node 0's zonelist.
> > 
> > But the allocation could have been requested for node 0 regardless of
> > the amount of memory the node has.
> 
> Yes, if we allocate from cpu 0-3 then it should be a miss on node 0. But the
> zonelists are optimized in a way that they don't include empty zones -
> build_zonerefs_node() checks managed_zone(). As a result, node 0 zonelist has no
> node 0 zones, which confuses the stats code. We should probably document that
> numa stats are bogus on systems with memoryless nodes. This patch makes it
> somewhat more obvious by presenting nice zeroes on the memoryless node itself,
> but node 1 now include stats from node 0.

Thanks for the clarification. So the underlying problem is that zone_statistics
operates on a preferred zone rather than node. This would be fixable but
I am not sure whether this is something worth bothering. Maybe it would
just be more convenient to document the unfortunate memory less nodes
stats situation and be done with it. Or do we have any consumers that
really do care?

> >> NUMA_OTHER uses numa_node_id(), which would mean the node 0's cpus have node 1
> >> in their numa_node_id() ? Is that correct?
> > 
> > numa_node_id should reflect the real node the CPU is associated with.
> 
> You're right, numa_node_id() is probably fine. But NUMA_OTHER is actually
> incremented at the zone where the allocation succeeds. This probably doesn't
> match Documentation/admin-guide/numastat.rst, even on a non-memoryless-node systems:
> 
> other_node      A process ran on this node and got memory from another node.

Yeah, the documentation doesn't match the implementation. Maybe we
should just fix the documentation because this has been the case for
ages.
Vlastimil Babka May 6, 2020, 3:50 p.m. UTC | #6
On 5/6/20 5:24 PM, Michal Hocko wrote:
>> 
>> Yes, if we allocate from cpu 0-3 then it should be a miss on node 0. But the
>> zonelists are optimized in a way that they don't include empty zones -
>> build_zonerefs_node() checks managed_zone(). As a result, node 0 zonelist has no
>> node 0 zones, which confuses the stats code. We should probably document that
>> numa stats are bogus on systems with memoryless nodes. This patch makes it
>> somewhat more obvious by presenting nice zeroes on the memoryless node itself,
>> but node 1 now include stats from node 0.
> 
> Thanks for the clarification. So the underlying problem is that zone_statistics
> operates on a preferred zone rather than node. This would be fixable but
> I am not sure whether this is something worth bothering. Maybe it would
> just be more convenient to document the unfortunate memory less nodes
> stats situation and be done with it. Or do we have any consumers that
> really do care?
> 
>> >> NUMA_OTHER uses numa_node_id(), which would mean the node 0's cpus have node 1
>> >> in their numa_node_id() ? Is that correct?
>> > 
>> > numa_node_id should reflect the real node the CPU is associated with.
>> 
>> You're right, numa_node_id() is probably fine. But NUMA_OTHER is actually
>> incremented at the zone where the allocation succeeds. This probably doesn't
>> match Documentation/admin-guide/numastat.rst, even on a non-memoryless-node systems:
>> 
>> other_node      A process ran on this node and got memory from another node.
> 
> Yeah, the documentation doesn't match the implementation. Maybe we
> should just fix the documentation because this has been the case for
> ages.
> 

How about something like this:

diff --git a/Documentation/admin-guide/numastat.rst b/Documentation/admin-guide/numastat.rst
index aaf1667489f8..08ec2c2bdce3 100644
--- a/Documentation/admin-guide/numastat.rst
+++ b/Documentation/admin-guide/numastat.rst
@@ -6,6 +6,21 @@ Numa policy hit/miss statistics
 
 All units are pages. Hugepages have separate counters.
 
+The numa_hit, numa_miss and numa_foreign counters reflect how well processes
+are able to allocate memory from nodes they prefer. If they succeed, numa_hit
+is incremented on the preferred node, otherwise numa_foreign is incremented on
+the preferred node and numa_miss on the node where allocation succeeded.
+
+Usually preferred node is the one local to the CPU where the process executes,
+but restrictions such as mempolicies can change that, so there are also two
+counters based on CPU local node. local_node is similar to numa_hit and is
+incremented on allocation from a node by CPU on the same node. other_node is
+similar to numa_miss and is incremented on the node where allocation succeeds
+from a CPU from a different node. Note there is no counter analogical to
+numa_foreign.
+
+In more detail:
+
 =============== ============================================================
 numa_hit	A process wanted to allocate memory from this node,
 		and succeeded.
@@ -14,11 +29,13 @@ numa_miss	A process wanted to allocate memory from another node,
 		but ended up with memory from this node.
 
 numa_foreign	A process wanted to allocate on this node,
-		but ended up with memory from another one.
+		but ended up with memory from another node.
 
-local_node	A process ran on this node and got memory from it.
+local_node	A process ran on this node's CPU,
+		and got memory from this node.
 
-other_node	A process ran on this node and got memory from another node.
+other_node	A process ran on a different node's CPU
+		and got memory from this node.
 
 interleave_hit 	Interleaving wanted to allocate from this node
 		and succeeded.
@@ -28,3 +45,11 @@ For easier reading you can use the numastat utility from the numactl package
 (http://oss.sgi.com/projects/libnuma/). Note that it only works
 well right now on machines with a small number of CPUs.
 
+Note that on systems with memoryless nodes (where a node has CPUs but no
+memory) the numa_hit, numa_miss and numa_foreign statistics can be skewed
+heavily. In the current kernel implementation, if a process prefers a
+memoryless node (i.e.  because it is running on one of its local CPU), the
+implementation actually treats one of the nearest nodes with memory as the
+preferred node. As a result, such allocation will not increase the numa_foreign
+counter on the memoryless node, and will skew the numa_hit, numa_miss and
+numa_foreign statistics of the nearest node.
Michal Hocko May 7, 2020, 7:09 a.m. UTC | #7
On Wed 06-05-20 17:50:28, Vlastimil Babka wrote:
> On 5/6/20 5:24 PM, Michal Hocko wrote:
> >> 
> >> Yes, if we allocate from cpu 0-3 then it should be a miss on node 0. But the
> >> zonelists are optimized in a way that they don't include empty zones -
> >> build_zonerefs_node() checks managed_zone(). As a result, node 0 zonelist has no
> >> node 0 zones, which confuses the stats code. We should probably document that
> >> numa stats are bogus on systems with memoryless nodes. This patch makes it
> >> somewhat more obvious by presenting nice zeroes on the memoryless node itself,
> >> but node 1 now include stats from node 0.
> > 
> > Thanks for the clarification. So the underlying problem is that zone_statistics
> > operates on a preferred zone rather than node. This would be fixable but
> > I am not sure whether this is something worth bothering. Maybe it would
> > just be more convenient to document the unfortunate memory less nodes
> > stats situation and be done with it. Or do we have any consumers that
> > really do care?
> > 
> >> >> NUMA_OTHER uses numa_node_id(), which would mean the node 0's cpus have node 1
> >> >> in their numa_node_id() ? Is that correct?
> >> > 
> >> > numa_node_id should reflect the real node the CPU is associated with.
> >> 
> >> You're right, numa_node_id() is probably fine. But NUMA_OTHER is actually
> >> incremented at the zone where the allocation succeeds. This probably doesn't
> >> match Documentation/admin-guide/numastat.rst, even on a non-memoryless-node systems:
> >> 
> >> other_node      A process ran on this node and got memory from another node.
> > 
> > Yeah, the documentation doesn't match the implementation. Maybe we
> > should just fix the documentation because this has been the case for
> > ages.
> > 
> 
> How about something like this:
> 
> diff --git a/Documentation/admin-guide/numastat.rst b/Documentation/admin-guide/numastat.rst
> index aaf1667489f8..08ec2c2bdce3 100644
> --- a/Documentation/admin-guide/numastat.rst
> +++ b/Documentation/admin-guide/numastat.rst
> @@ -6,6 +6,21 @@ Numa policy hit/miss statistics
>  
>  All units are pages. Hugepages have separate counters.
>  
> +The numa_hit, numa_miss and numa_foreign counters reflect how well processes
> +are able to allocate memory from nodes they prefer. If they succeed, numa_hit
> +is incremented on the preferred node, otherwise numa_foreign is incremented on
> +the preferred node and numa_miss on the node where allocation succeeded.
> +
> +Usually preferred node is the one local to the CPU where the process executes,
> +but restrictions such as mempolicies can change that, so there are also two
> +counters based on CPU local node. local_node is similar to numa_hit and is
> +incremented on allocation from a node by CPU on the same node. other_node is
> +similar to numa_miss and is incremented on the node where allocation succeeds
> +from a CPU from a different node. Note there is no counter analogical to
> +numa_foreign.
> +
> +In more detail:
> +
>  =============== ============================================================
>  numa_hit	A process wanted to allocate memory from this node,
>  		and succeeded.
> @@ -14,11 +29,13 @@ numa_miss	A process wanted to allocate memory from another node,
>  		but ended up with memory from this node.
>  
>  numa_foreign	A process wanted to allocate on this node,
> -		but ended up with memory from another one.
> +		but ended up with memory from another node.
>  
> -local_node	A process ran on this node and got memory from it.
> +local_node	A process ran on this node's CPU,
> +		and got memory from this node.
>  
> -other_node	A process ran on this node and got memory from another node.
> +other_node	A process ran on a different node's CPU
> +		and got memory from this node.
>  
>  interleave_hit 	Interleaving wanted to allocate from this node
>  		and succeeded.
> @@ -28,3 +45,11 @@ For easier reading you can use the numastat utility from the numactl package
>  (http://oss.sgi.com/projects/libnuma/). Note that it only works
>  well right now on machines with a small number of CPUs.
>  
> +Note that on systems with memoryless nodes (where a node has CPUs but no
> +memory) the numa_hit, numa_miss and numa_foreign statistics can be skewed
> +heavily. In the current kernel implementation, if a process prefers a
> +memoryless node (i.e.  because it is running on one of its local CPU), the
> +implementation actually treats one of the nearest nodes with memory as the
> +preferred node. As a result, such allocation will not increase the numa_foreign
> +counter on the memoryless node, and will skew the numa_hit, numa_miss and
> +numa_foreign statistics of the nearest node.

This is certainly an improvement. Thanks! The question whether we can
identify where bogus numbers came from would be interesting as well.
Maybe those are not worth fixing but it would be great to understand
them at least. I have to say that the explanation via boot_pageset is
not really clear to me.
Sandipan Das May 7, 2020, 9:05 a.m. UTC | #8
On 07/05/20 12:39 pm, Michal Hocko wrote:
> On Wed 06-05-20 17:50:28, Vlastimil Babka wrote:
>> [...]
>>
>> How about something like this:
>>
>> diff --git a/Documentation/admin-guide/numastat.rst b/Documentation/admin-guide/numastat.rst
>> index aaf1667489f8..08ec2c2bdce3 100644
>> --- a/Documentation/admin-guide/numastat.rst
>> +++ b/Documentation/admin-guide/numastat.rst
>> @@ -6,6 +6,21 @@ Numa policy hit/miss statistics
>>  
>>  All units are pages. Hugepages have separate counters.
>>  
>> +The numa_hit, numa_miss and numa_foreign counters reflect how well processes
>> +are able to allocate memory from nodes they prefer. If they succeed, numa_hit
>> +is incremented on the preferred node, otherwise numa_foreign is incremented on
>> +the preferred node and numa_miss on the node where allocation succeeded.
>> +
>> +Usually preferred node is the one local to the CPU where the process executes,
>> +but restrictions such as mempolicies can change that, so there are also two
>> +counters based on CPU local node. local_node is similar to numa_hit and is
>> +incremented on allocation from a node by CPU on the same node. other_node is
>> +similar to numa_miss and is incremented on the node where allocation succeeds
>> +from a CPU from a different node. Note there is no counter analogical to
>> +numa_foreign.
>> +
>> +In more detail:
>> +
>>  =============== ============================================================
>>  numa_hit	A process wanted to allocate memory from this node,
>>  		and succeeded.
>> @@ -14,11 +29,13 @@ numa_miss	A process wanted to allocate memory from another node,
>>  		but ended up with memory from this node.
>>  
>>  numa_foreign	A process wanted to allocate on this node,
>> -		but ended up with memory from another one.
>> +		but ended up with memory from another node.
>>  
>> -local_node	A process ran on this node and got memory from it.
>> +local_node	A process ran on this node's CPU,
>> +		and got memory from this node.
>>  
>> -other_node	A process ran on this node and got memory from another node.
>> +other_node	A process ran on a different node's CPU
>> +		and got memory from this node.
>>  
>>  interleave_hit 	Interleaving wanted to allocate from this node
>>  		and succeeded.
>> @@ -28,3 +45,11 @@ For easier reading you can use the numastat utility from the numactl package
>>  (http://oss.sgi.com/projects/libnuma/). Note that it only works
>>  well right now on machines with a small number of CPUs.
>>  
>> +Note that on systems with memoryless nodes (where a node has CPUs but no
>> +memory) the numa_hit, numa_miss and numa_foreign statistics can be skewed
>> +heavily. In the current kernel implementation, if a process prefers a
>> +memoryless node (i.e.  because it is running on one of its local CPU), the
>> +implementation actually treats one of the nearest nodes with memory as the
>> +preferred node. As a result, such allocation will not increase the numa_foreign
>> +counter on the memoryless node, and will skew the numa_hit, numa_miss and
>> +numa_foreign statistics of the nearest node.
> 
> This is certainly an improvement. Thanks! The question whether we can
> identify where bogus numbers came from would be interesting as well.
> Maybe those are not worth fixing but it would be great to understand
> them at least. I have to say that the explanation via boot_pageset is
> not really clear to me.
> 

The documentation update will definitely help. Thanks for that.
I did collect some stack traces on a ppc64 guest for calls to zone_statistics()
in case of zones that are using the boot_pageset and most of them originate
from kmem_cache_init() with eventual calls to allocate_slab().

[    0.000000] [c00000000282b690] [c000000000402d98] zone_statistics+0x138/0x1d0                                                                                                
[    0.000000] [c00000000282b740] [c000000000401190] rmqueue_pcplist+0xf0/0x120                                                                                                 
[    0.000000] [c00000000282b7d0] [c00000000040b178] get_page_from_freelist+0x2f8/0x2100                                                                                        
[    0.000000] [c00000000282bb30] [c000000000401ae0] __alloc_pages_nodemask+0x1a0/0x2d0                                                                                         
[    0.000000] [c00000000282bbc0] [c00000000044b040] alloc_slab_page+0x70/0x580                                                                                                 
[    0.000000] [c00000000282bc20] [c00000000044b5f8] allocate_slab+0xa8/0x610                                                                                                   
...

In the remaining cases, the sources are ftrace_init() and early_trace_init().

Unless they are useful, can we avoid incrementing stats for zones using boot_pageset
inside zone_statistics()?


- Sandipan
Sandipan Das May 7, 2020, 9:08 a.m. UTC | #9
On 07/05/20 2:35 pm, Sandipan Das wrote:
> 
> On 07/05/20 12:39 pm, Michal Hocko wrote:
>> On Wed 06-05-20 17:50:28, Vlastimil Babka wrote:
>>> [...]
>>>
>>> How about something like this:
>>>
>>> diff --git a/Documentation/admin-guide/numastat.rst b/Documentation/admin-guide/numastat.rst
>>> index aaf1667489f8..08ec2c2bdce3 100644
>>> --- a/Documentation/admin-guide/numastat.rst
>>> +++ b/Documentation/admin-guide/numastat.rst
>>> @@ -6,6 +6,21 @@ Numa policy hit/miss statistics
>>>  
>>>  All units are pages. Hugepages have separate counters.
>>>  
>>> +The numa_hit, numa_miss and numa_foreign counters reflect how well processes
>>> +are able to allocate memory from nodes they prefer. If they succeed, numa_hit
>>> +is incremented on the preferred node, otherwise numa_foreign is incremented on
>>> +the preferred node and numa_miss on the node where allocation succeeded.
>>> +
>>> +Usually preferred node is the one local to the CPU where the process executes,
>>> +but restrictions such as mempolicies can change that, so there are also two
>>> +counters based on CPU local node. local_node is similar to numa_hit and is
>>> +incremented on allocation from a node by CPU on the same node. other_node is
>>> +similar to numa_miss and is incremented on the node where allocation succeeds
>>> +from a CPU from a different node. Note there is no counter analogical to
>>> +numa_foreign.
>>> +
>>> +In more detail:
>>> +
>>>  =============== ============================================================
>>>  numa_hit	A process wanted to allocate memory from this node,
>>>  		and succeeded.
>>> @@ -14,11 +29,13 @@ numa_miss	A process wanted to allocate memory from another node,
>>>  		but ended up with memory from this node.
>>>  
>>>  numa_foreign	A process wanted to allocate on this node,
>>> -		but ended up with memory from another one.
>>> +		but ended up with memory from another node.
>>>  
>>> -local_node	A process ran on this node and got memory from it.
>>> +local_node	A process ran on this node's CPU,
>>> +		and got memory from this node.
>>>  
>>> -other_node	A process ran on this node and got memory from another node.
>>> +other_node	A process ran on a different node's CPU
>>> +		and got memory from this node.
>>>  
>>>  interleave_hit 	Interleaving wanted to allocate from this node
>>>  		and succeeded.
>>> @@ -28,3 +45,11 @@ For easier reading you can use the numastat utility from the numactl package
>>>  (http://oss.sgi.com/projects/libnuma/). Note that it only works
>>>  well right now on machines with a small number of CPUs.
>>>  
>>> +Note that on systems with memoryless nodes (where a node has CPUs but no
>>> +memory) the numa_hit, numa_miss and numa_foreign statistics can be skewed
>>> +heavily. In the current kernel implementation, if a process prefers a
>>> +memoryless node (i.e.  because it is running on one of its local CPU), the
>>> +implementation actually treats one of the nearest nodes with memory as the
>>> +preferred node. As a result, such allocation will not increase the numa_foreign
>>> +counter on the memoryless node, and will skew the numa_hit, numa_miss and
>>> +numa_foreign statistics of the nearest node.
>>
>> This is certainly an improvement. Thanks! The question whether we can
>> identify where bogus numbers came from would be interesting as well.
>> Maybe those are not worth fixing but it would be great to understand
>> them at least. I have to say that the explanation via boot_pageset is
>> not really clear to me.
>>
> 
> The documentation update will definitely help. Thanks for that.
> I did collect some stack traces on a ppc64 guest for calls to zone_statistics()
> in case of zones that are using the boot_pageset and most of them originate
> from kmem_cache_init() with eventual calls to allocate_slab().
> 
> [    0.000000] [c00000000282b690] [c000000000402d98] zone_statistics+0x138/0x1d0                                                                                                
> [    0.000000] [c00000000282b740] [c000000000401190] rmqueue_pcplist+0xf0/0x120                                                                                                 
> [    0.000000] [c00000000282b7d0] [c00000000040b178] get_page_from_freelist+0x2f8/0x2100                                                                                        
> [    0.000000] [c00000000282bb30] [c000000000401ae0] __alloc_pages_nodemask+0x1a0/0x2d0                                                                                         
> [    0.000000] [c00000000282bbc0] [c00000000044b040] alloc_slab_page+0x70/0x580                                                                                                 
> [    0.000000] [c00000000282bc20] [c00000000044b5f8] allocate_slab+0xa8/0x610                                                                                                   
> ...
> 
> In the remaining cases, the sources are ftrace_init() and early_trace_init().
> 

Forgot to add that this happens during the period between zone_pcp_init() and setup_zone_pageset().

- Sandipan
Vlastimil Babka May 7, 2020, 11:07 a.m. UTC | #10
On 5/7/20 11:08 AM, Sandipan Das wrote:
>>>
>>> This is certainly an improvement. Thanks! The question whether we can
>>> identify where bogus numbers came from would be interesting as well.
>>> Maybe those are not worth fixing but it would be great to understand
>>> them at least. I have to say that the explanation via boot_pageset is
>>> not really clear to me.
>>>
>> 
>> The documentation update will definitely help. Thanks for that.

Thanks both, will send a proper patch.

>> I did collect some stack traces on a ppc64 guest for calls to zone_statistics()
>> in case of zones that are using the boot_pageset and most of them originate
>> from kmem_cache_init() with eventual calls to allocate_slab().
>> 
>> [    0.000000] [c00000000282b690] [c000000000402d98] zone_statistics+0x138/0x1d0                                                                                                
>> [    0.000000] [c00000000282b740] [c000000000401190] rmqueue_pcplist+0xf0/0x120                                                                                                 
>> [    0.000000] [c00000000282b7d0] [c00000000040b178] get_page_from_freelist+0x2f8/0x2100                                                                                        
>> [    0.000000] [c00000000282bb30] [c000000000401ae0] __alloc_pages_nodemask+0x1a0/0x2d0                                                                                         
>> [    0.000000] [c00000000282bbc0] [c00000000044b040] alloc_slab_page+0x70/0x580                                                                                                 
>> [    0.000000] [c00000000282bc20] [c00000000044b5f8] allocate_slab+0xa8/0x610                                                                                                   
>> ...
>> 
>> In the remaining cases, the sources are ftrace_init() and early_trace_init().
>> 
> 
> Forgot to add that this happens during the period between zone_pcp_init() and setup_zone_pageset().

If you can identify a good moment during the boot when the boot_pageset stops
being used, maybe we could just reset its stats to zero once at that point and
be done with it.

> - Sandipan
>
Sandipan Das May 7, 2020, 11:17 a.m. UTC | #11
On 07/05/20 4:37 pm, Vlastimil Babka wrote:
> On 5/7/20 11:08 AM, Sandipan Das wrote:
>>>>
>>>> This is certainly an improvement. Thanks! The question whether we can
>>>> identify where bogus numbers came from would be interesting as well.
>>>> Maybe those are not worth fixing but it would be great to understand
>>>> them at least. I have to say that the explanation via boot_pageset is
>>>> not really clear to me.
>>>>
>>>
>>> The documentation update will definitely help. Thanks for that.
> 
> Thanks both, will send a proper patch.
> 
>>> I did collect some stack traces on a ppc64 guest for calls to zone_statistics()
>>> in case of zones that are using the boot_pageset and most of them originate
>>> from kmem_cache_init() with eventual calls to allocate_slab().
>>>
>>> [    0.000000] [c00000000282b690] [c000000000402d98] zone_statistics+0x138/0x1d0                                                                                                
>>> [    0.000000] [c00000000282b740] [c000000000401190] rmqueue_pcplist+0xf0/0x120                                                                                                 
>>> [    0.000000] [c00000000282b7d0] [c00000000040b178] get_page_from_freelist+0x2f8/0x2100                                                                                        
>>> [    0.000000] [c00000000282bb30] [c000000000401ae0] __alloc_pages_nodemask+0x1a0/0x2d0                                                                                         
>>> [    0.000000] [c00000000282bbc0] [c00000000044b040] alloc_slab_page+0x70/0x580                                                                                                 
>>> [    0.000000] [c00000000282bc20] [c00000000044b5f8] allocate_slab+0xa8/0x610                                                                                                   
>>> ...
>>>
>>> In the remaining cases, the sources are ftrace_init() and early_trace_init().
>>>
>>
>> Forgot to add that this happens during the period between zone_pcp_init() and setup_zone_pageset().
> 
> If you can identify a good moment during the boot when the boot_pageset stops
> being used, maybe we could just reset its stats to zero once at that point and
> be done with it.
> 

I think that would be in setup_per_cpu_pageset(). After the pagesets for the populated
zone have been allocated.

- Sandipan
diff mbox series

Patch

diff --git a/include/linux/vmstat.h b/include/linux/vmstat.h
index 292485f3d24d..55a68b379a2c 100644
--- a/include/linux/vmstat.h
+++ b/include/linux/vmstat.h
@@ -159,6 +159,21 @@  static inline unsigned long zone_numa_state_snapshot(struct zone *zone,
 	long x = atomic_long_read(&zone->vm_numa_stat[item]);
 	int cpu;
 
+	/*
+	 * Initially, the pageset of all zones are set to point to the
+	 * boot_pageset. The real pagesets are allocated later but only
+	 * for the populated zones. Unpopulated zones still continue
+	 * using the boot_pageset.
+	 *
+	 * Before the real pagesets are allocated, the boot_pageset's
+	 * vm_numa_stat counters can get incremented. This affects the
+	 * unpopulated zones which end up with non-zero stats despite
+	 * having no memory associated with them. For such cases,
+	 * return zero explicitly.
+	 */
+	if (!populated_zone(zone))
+		return 0;
+
 	for_each_online_cpu(cpu)
 		x += per_cpu_ptr(zone->pageset, cpu)->vm_numa_stat_diff[item];