Message ID | 20210830121603.1081-3-bharata@amd.com (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | Fix NUMA nodes fallback list ordering | expand |
On Mon, Aug 30, 2021 at 05:46:03PM +0530, Bharata B Rao wrote: > From: Krupa Ramakrishnan <krupa.ramakrishnan@amd.com> > > In build_zonelists(), when the fallback list is built for the nodes, > the node load gets reinitialized during each iteration. This results > in nodes with same distances occupying the same slot in different > node fallback lists rather than appearing in the intended round- > robin manner. This results in one node getting picked for allocation > more compared to other nodes with the same distance. > > As an example, consider a 4 node system with the following distance > matrix. > > Node 0 1 2 3 > ---------------- > 0 10 12 32 32 > 1 12 10 32 32 > 2 32 32 10 12 > 3 32 32 12 10 > > For this case, the node fallback list gets built like this: > > Node Fallback list > --------------------- > 0 0 1 2 3 > 1 1 0 3 2 > 2 2 3 0 1 > 3 3 2 0 1 <-- Unexpected fallback order > > In the fallback list for nodes 2 and 3, the nodes 0 and 1 > appear in the same order which results in more allocations > getting satisfied from node 0 compared to node 1. > > The effect of this on remote memory bandwidth as seen by stream > benchmark is shown below: > > Case 1: Bandwidth from cores on nodes 2 & 3 to memory on nodes 0 & 1 > (numactl -m 0,1 ./stream_lowOverhead ... --cores <from 2, 3>) > Case 2: Bandwidth from cores on nodes 0 & 1 to memory on nodes 2 & 3 > (numactl -m 2,3 ./stream_lowOverhead ... --cores <from 0, 1>) > > ---------------------------------------- > BANDWIDTH (MB/s) > TEST Case 1 Case 2 > ---------------------------------------- > COPY 57479.6 110791.8 > SCALE 55372.9 105685.9 > ADD 50460.6 96734.2 > TRIADD 50397.6 97119.1 > ---------------------------------------- > > The bandwidth drop in Case 1 occurs because most of the allocations > get satisfied by node 0 as it appears first in the fallback order > for both nodes 2 and 3. > > This can be fixed by accumulating the node load in build_zonelists() > rather than reinitializing it during each iteration. With this the > nodes with the same distance rightly get assigned in the round robin > manner. In fact this was how it was originally until the > commit f0c0b2b808f2 ("change zonelist order: zonelist order selection > logic") dropped the load accumulation and resorted to initializing > the load during each iteration. While zonelist ordering was removed by > commit c9bff3eebc09 ("mm, page_alloc: rip out ZONELIST_ORDER_ZONE"), > the change to the node load accumulation in build_zonelists() remained. > So essentially this patch reverts back to the accumulated node load > logic. > > After this fix, the fallback order gets built like this: > > Node Fallback list > ------------------ > 0 0 1 2 3 > 1 1 0 3 2 > 2 2 3 0 1 > 3 3 2 1 0 <-- Note the change here > > The bandwidth in Case 1 improves and matches Case 2 as shown below. > > ---------------------------------------- > BANDWIDTH (MB/s) > TEST Case 1 Case 2 > ---------------------------------------- > COPY 110438.9 110107.2 > SCALE 105930.5 105817.5 > ADD 97005.1 96159.8 > TRIADD 97441.5 96757.1 > ---------------------------------------- > > The correctness of the fallback list generation has been verified > for the above node configuration where the node 3 starts as > memory-less node and comes up online only during memory hotplug. > > [bharata@amd.com: Added changelog, review, test validation] > > Fixes: f0c0b2b808f2 ("change zonelist order: zonelist order selection > logic") > Signed-off-by: Krupa Ramakrishnan <krupa.ramakrishnan@amd.com> > Co-developed-by: Sadagopan Srinivasan <Sadagopan.Srinivasan@amd.com> > Signed-off-by: Sadagopan Srinivasan <Sadagopan.Srinivasan@amd.com> > Signed-off-by: Bharata B Rao <bharata@amd.com> Minor not, The Fixes should be all one line even if it goes over the line length limit. Otherwise, I can confirm that 2-socket Intel machines with SNC enabled suffer a similar problem -- the fallback lists for equal-distance nodes ends up overloading one node. Aside from the problems you mention, the overloaded node could trigger premature reclaim and inconsistent behaviour depending on where the task executes so Acked-by: Mel Gorman <mgorman@suse.de>
On 8/30/21 5:46 PM, Bharata B Rao wrote: > As an example, consider a 4 node system with the following distance > matrix. > > Node 0 1 2 3 > ---------------- > 0 10 12 32 32 > 1 12 10 32 32 > 2 32 32 10 12 > 3 32 32 12 10 > > For this case, the node fallback list gets built like this: > > Node Fallback list > --------------------- > 0 0 1 2 3 > 1 1 0 3 2 > 2 2 3 0 1 > 3 3 2 0 1 <-- Unexpected fallback order > > In the fallback list for nodes 2 and 3, the nodes 0 and 1 > appear in the same order which results in more allocations > getting satisfied from node 0 compared to node 1. > > The effect of this on remote memory bandwidth as seen by stream > benchmark is shown below: > > Case 1: Bandwidth from cores on nodes 2 & 3 to memory on nodes 0 & 1 > (numactl -m 0,1 ./stream_lowOverhead ... --cores <from 2, 3>) > Case 2: Bandwidth from cores on nodes 0 & 1 to memory on nodes 2 & 3 > (numactl -m 2,3 ./stream_lowOverhead ... --cores <from 0, 1>) > > ---------------------------------------- > BANDWIDTH (MB/s) > TEST Case 1 Case 2 > ---------------------------------------- > COPY 57479.6 110791.8 > SCALE 55372.9 105685.9 > ADD 50460.6 96734.2 > TRIADD 50397.6 97119.1 > ---------------------------------------- > > The bandwidth drop in Case 1 occurs because most of the allocations > get satisfied by node 0 as it appears first in the fallback order > for both nodes 2 and 3. I am wondering what causes this performance drop here ? Would not the memory access latency be similar between {2, 3} ---> { 0 } and {2, 3} ---> { 1 }, given both these nodes {0, 1} have same distance from {2, 3} i.e 32 from the above distance matrix. Even if the preferred node order changes from { 0 } to { 1 } for the accessing node { 3 }, it should not change the latency as such. Is the performance drop here, is caused by excessive allocation on node { 0 } resulting from page allocation latency instead.
[AMD Official Use Only] The bandwidth is limited by underutilization of cross socket links and not the latency. Hotspotting on one node will not engage all hardware resources based on our routing protocol which results in the lower bandwidth. Distributing equally across nodes 0 and 1 will yield the best results as it stresses the full system capabilities. Thanks Krupa Ramakrishnan -----Original Message----- From: Anshuman Khandual <anshuman.khandual@arm.com> Sent: 31 August, 2021 4:58 To: Rao, Bharata Bhasker <bharata@amd.com>; linux-mm@kvack.org; linux-kernel@vger.kernel.org Cc: akpm@linux-foundation.org; kamezawa.hiroyu@jp.fujitsu.com; lee.schermerhorn@hp.com; mgorman@suse.de; Ramakrishnan, Krupa <Krupa.Ramakrishnan@amd.com>; Srinivasan, Sadagopan <Sadagopan.Srinivasan@amd.com> Subject: Re: [FIX PATCH 2/2] mm/page_alloc: Use accumulated load when building node fallback list [CAUTION: External Email] On 8/30/21 5:46 PM, Bharata B Rao wrote: > As an example, consider a 4 node system with the following distance > matrix. > > Node 0 1 2 3 > ---------------- > 0 10 12 32 32 > 1 12 10 32 32 > 2 32 32 10 12 > 3 32 32 12 10 > > For this case, the node fallback list gets built like this: > > Node Fallback list > --------------------- > 0 0 1 2 3 > 1 1 0 3 2 > 2 2 3 0 1 > 3 3 2 0 1 <-- Unexpected fallback order > > In the fallback list for nodes 2 and 3, the nodes 0 and 1 appear in > the same order which results in more allocations getting satisfied > from node 0 compared to node 1. > > The effect of this on remote memory bandwidth as seen by stream > benchmark is shown below: > > Case 1: Bandwidth from cores on nodes 2 & 3 to memory on nodes 0 & 1 > (numactl -m 0,1 ./stream_lowOverhead ... --cores <from 2, 3>) > Case 2: Bandwidth from cores on nodes 0 & 1 to memory on nodes 2 & 3 > (numactl -m 2,3 ./stream_lowOverhead ... --cores <from 0, 1>) > > ---------------------------------------- > BANDWIDTH (MB/s) > TEST Case 1 Case 2 > ---------------------------------------- > COPY 57479.6 110791.8 > SCALE 55372.9 105685.9 > ADD 50460.6 96734.2 > TRIADD 50397.6 97119.1 > ---------------------------------------- > > The bandwidth drop in Case 1 occurs because most of the allocations > get satisfied by node 0 as it appears first in the fallback order for > both nodes 2 and 3. I am wondering what causes this performance drop here ? Would not the memory access latency be similar between {2, 3} ---> { 0 } and {2, 3} ---> { 1 }, given both these nodes {0, 1} have same distance from {2, 3} i.e 32 from the above distance matrix. Even if the preferred node order changes from { 0 } to { 1 } for the accessing node { 3 }, it should not change the latency as such. Is the performance drop here, is caused by excessive allocation on node { 0 } resulting from page allocation latency instead.
On 8/31/21 8:56 PM, Ramakrishnan, Krupa wrote: > [AMD Official Use Only] > > The bandwidth is limited by underutilization of cross socket links and not the latency. Hotspotting on one node will not engage all hardware resources based on our routing protocol which results in the lower bandwidth. Distributing equally across nodes 0 and 1 will yield the best results as it stresses the full system capabilities. Makes sense. Nonetheless this patch clearly solves a problem. > > Thanks > Krupa Ramakrishnan > > -----Original Message----- > From: Anshuman Khandual <anshuman.khandual@arm.com> > Sent: 31 August, 2021 4:58 > To: Rao, Bharata Bhasker <bharata@amd.com>; linux-mm@kvack.org; linux-kernel@vger.kernel.org > Cc: akpm@linux-foundation.org; kamezawa.hiroyu@jp.fujitsu.com; lee.schermerhorn@hp.com; mgorman@suse.de; Ramakrishnan, Krupa <Krupa.Ramakrishnan@amd.com>; Srinivasan, Sadagopan <Sadagopan.Srinivasan@amd.com> > Subject: Re: [FIX PATCH 2/2] mm/page_alloc: Use accumulated load when building node fallback list > > [CAUTION: External Email] > > On 8/30/21 5:46 PM, Bharata B Rao wrote: >> As an example, consider a 4 node system with the following distance >> matrix. >> >> Node 0 1 2 3 >> ---------------- >> 0 10 12 32 32 >> 1 12 10 32 32 >> 2 32 32 10 12 >> 3 32 32 12 10 >> >> For this case, the node fallback list gets built like this: >> >> Node Fallback list >> --------------------- >> 0 0 1 2 3 >> 1 1 0 3 2 >> 2 2 3 0 1 >> 3 3 2 0 1 <-- Unexpected fallback order >> >> In the fallback list for nodes 2 and 3, the nodes 0 and 1 appear in >> the same order which results in more allocations getting satisfied >> from node 0 compared to node 1. >> >> The effect of this on remote memory bandwidth as seen by stream >> benchmark is shown below: >> >> Case 1: Bandwidth from cores on nodes 2 & 3 to memory on nodes 0 & 1 >> (numactl -m 0,1 ./stream_lowOverhead ... --cores <from 2, 3>) >> Case 2: Bandwidth from cores on nodes 0 & 1 to memory on nodes 2 & 3 >> (numactl -m 2,3 ./stream_lowOverhead ... --cores <from 0, 1>) >> >> ---------------------------------------- >> BANDWIDTH (MB/s) >> TEST Case 1 Case 2 >> ---------------------------------------- >> COPY 57479.6 110791.8 >> SCALE 55372.9 105685.9 >> ADD 50460.6 96734.2 >> TRIADD 50397.6 97119.1 >> ---------------------------------------- >> >> The bandwidth drop in Case 1 occurs because most of the allocations >> get satisfied by node 0 as it appears first in the fallback order for >> both nodes 2 and 3. > > I am wondering what causes this performance drop here ? Would not the memory access latency be similar between {2, 3} ---> { 0 } and {2, 3} ---> { 1 }, given both these nodes {0, 1} have same distance from {2, 3} i.e 32 from the above distance matrix. Even if the preferred node order changes from { 0 } to { 1 } for the accessing node { 3 }, it should not change the latency as such. > > Is the performance drop here, is caused by excessive allocation on node { 0 } resulting from page allocation latency instead. >
On 8/30/21 5:46 PM, Bharata B Rao wrote: > From: Krupa Ramakrishnan <krupa.ramakrishnan@amd.com> > > In build_zonelists(), when the fallback list is built for the nodes, > the node load gets reinitialized during each iteration. This results > in nodes with same distances occupying the same slot in different > node fallback lists rather than appearing in the intended round- > robin manner. This results in one node getting picked for allocation > more compared to other nodes with the same distance. > > As an example, consider a 4 node system with the following distance > matrix. > > Node 0 1 2 3 > ---------------- > 0 10 12 32 32 > 1 12 10 32 32 > 2 32 32 10 12 > 3 32 32 12 10 > > For this case, the node fallback list gets built like this: > > Node Fallback list > --------------------- > 0 0 1 2 3 > 1 1 0 3 2 > 2 2 3 0 1 > 3 3 2 0 1 <-- Unexpected fallback order > > In the fallback list for nodes 2 and 3, the nodes 0 and 1 > appear in the same order which results in more allocations > getting satisfied from node 0 compared to node 1. > > The effect of this on remote memory bandwidth as seen by stream > benchmark is shown below: > > Case 1: Bandwidth from cores on nodes 2 & 3 to memory on nodes 0 & 1 > (numactl -m 0,1 ./stream_lowOverhead ... --cores <from 2, 3>) > Case 2: Bandwidth from cores on nodes 0 & 1 to memory on nodes 2 & 3 > (numactl -m 2,3 ./stream_lowOverhead ... --cores <from 0, 1>) > > ---------------------------------------- > BANDWIDTH (MB/s) > TEST Case 1 Case 2 > ---------------------------------------- > COPY 57479.6 110791.8 > SCALE 55372.9 105685.9 > ADD 50460.6 96734.2 > TRIADD 50397.6 97119.1 > ---------------------------------------- > > The bandwidth drop in Case 1 occurs because most of the allocations > get satisfied by node 0 as it appears first in the fallback order > for both nodes 2 and 3. > > This can be fixed by accumulating the node load in build_zonelists() > rather than reinitializing it during each iteration. With this the > nodes with the same distance rightly get assigned in the round robin > manner. In fact this was how it was originally until the > commit f0c0b2b808f2 ("change zonelist order: zonelist order selection > logic") dropped the load accumulation and resorted to initializing > the load during each iteration. While zonelist ordering was removed by > commit c9bff3eebc09 ("mm, page_alloc: rip out ZONELIST_ORDER_ZONE"), > the change to the node load accumulation in build_zonelists() remained. > So essentially this patch reverts back to the accumulated node load > logic. > > After this fix, the fallback order gets built like this: > > Node Fallback list > ------------------ > 0 0 1 2 3 > 1 1 0 3 2 > 2 2 3 0 1 > 3 3 2 1 0 <-- Note the change here > > The bandwidth in Case 1 improves and matches Case 2 as shown below. > > ---------------------------------------- > BANDWIDTH (MB/s) > TEST Case 1 Case 2 > ---------------------------------------- > COPY 110438.9 110107.2 > SCALE 105930.5 105817.5 > ADD 97005.1 96159.8 > TRIADD 97441.5 96757.1 > ---------------------------------------- > > The correctness of the fallback list generation has been verified > for the above node configuration where the node 3 starts as > memory-less node and comes up online only during memory hotplug. > > [bharata@amd.com: Added changelog, review, test validation] > > Fixes: f0c0b2b808f2 ("change zonelist order: zonelist order selection > logic") > Signed-off-by: Krupa Ramakrishnan <krupa.ramakrishnan@amd.com> > Co-developed-by: Sadagopan Srinivasan <Sadagopan.Srinivasan@amd.com> > Signed-off-by: Sadagopan Srinivasan <Sadagopan.Srinivasan@amd.com> > Signed-off-by: Bharata B Rao <bharata@amd.com> > --- > mm/page_alloc.c | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff --git a/mm/page_alloc.c b/mm/page_alloc.c > index 22f7ad6ec11c..47f4d160971e 100644 > --- a/mm/page_alloc.c > +++ b/mm/page_alloc.c > @@ -6268,7 +6268,7 @@ static void build_zonelists(pg_data_t *pgdat) > */ > if (node_distance(local_node, node) != > node_distance(local_node, prev_node)) > - node_load[node] = load; > + node_load[node] += load; > > node_order[nr_nodes++] = node; > prev_node = node; > Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com>
On 8/30/2021 5:46 PM, Bharata B Rao wrote: > From: Krupa Ramakrishnan <krupa.ramakrishnan@amd.com> > > In build_zonelists(), when the fallback list is built for the nodes, > the node load gets reinitialized during each iteration. This results > in nodes with same distances occupying the same slot in different > node fallback lists rather than appearing in the intended round- > robin manner. This results in one node getting picked for allocation > more compared to other nodes with the same distance. > > As an example, consider a 4 node system with the following distance > matrix. > > Node 0 1 2 3 > ---------------- > 0 10 12 32 32 > 1 12 10 32 32 > 2 32 32 10 12 > 3 32 32 12 10 > > For this case, the node fallback list gets built like this: > > Node Fallback list > --------------------- > 0 0 1 2 3 > 1 1 0 3 2 > 2 2 3 0 1 > 3 3 2 0 1 <-- Unexpected fallback order FWIW, for a dual-socket 8 node system with the following distance matrix, node 0 1 2 3 4 5 6 7 0: 10 12 12 12 32 32 32 32 1: 12 10 12 12 32 32 32 32 2: 12 12 10 12 32 32 32 32 3: 12 12 12 10 32 32 32 32 4: 32 32 32 32 10 12 12 12 5: 32 32 32 32 12 10 12 12 6: 32 32 32 32 12 12 10 12 7: 32 32 32 32 12 12 12 10 the fallback list looks like this: Before ======= Fallback order for Node 0: 0 1 2 3 4 5 6 7 Fallback order for Node 1: 1 2 3 0 5 6 7 4 Fallback order for Node 2: 2 3 0 1 6 7 4 5 Fallback order for Node 3: 3 0 1 2 7 4 5 6 Fallback order for Node 4: 4 5 6 7 0 1 2 3 Fallback order for Node 5: 5 6 7 4 0 1 2 3 Fallback order for Node 6: 6 7 4 5 0 1 2 3 Fallback order for Node 7: 7 4 5 6 0 1 2 3 After the fix ============== Fallback order for Node 0: 0 1 2 3 4 5 6 7 Fallback order for Node 1: 1 2 3 0 5 6 7 4 Fallback order for Node 2: 2 3 0 1 6 7 4 5 Fallback order for Node 3: 3 0 1 2 7 4 5 6 Fallback order for Node 4: 4 5 6 7 0 1 2 3 Fallback order for Node 5: 5 6 7 4 1 2 3 0 Fallback order for Node 6: 6 7 4 5 2 3 0 1 Fallback order for Node 7: 7 4 5 6 3 0 1 2 So the problem becomes more pronounced for bigger NUMA systems. Regards, Bharata.
diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 22f7ad6ec11c..47f4d160971e 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -6268,7 +6268,7 @@ static void build_zonelists(pg_data_t *pgdat) */ if (node_distance(local_node, node) != node_distance(local_node, prev_node)) - node_load[node] = load; + node_load[node] += load; node_order[nr_nodes++] = node; prev_node = node;