Message ID | 20250319081432.18130-1-nikhil.dhama@amd.com (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | [-V2] mm: pcp: scale batch to reduce number of high order pcp flushes on deallocation | expand |
On 3/19/2025 1:44 PM, Nikhil Dhama wrote: [...] >> And, do you run network related workloads on one machine? If so, please >> try to run them on two machines instead, with clients and servers run on >> different machines. At least, please use different sockets for clients >> and servers. Because larger pcp->free_count will make it easier to >> trigger free_high heuristics. If that is the case, please try to >> optimize free_high heuristics directly too. > > I agree with Ying Huang, the above change is not the best possible fix for > the issue. On futher analysis I figured that root cause of the issue is > the frequent pcp high order flushes. During a 20sec iperf3 run > I observed on avg 5 pcp high order flushes in kernel v6.6, whereas, in > v6.7, I observed about 170 pcp high order flushes. > Tracing pcp->free_count, I figured with the patch v1 (patch I suggested > earlier) free_count is going into negatives which reduces the number of > times free_high heuristics is triggered hence reducing the high order > flushes. > > As Ying Huang Suggested, it helps the performance on increasing the batch size > for free_high heuristics. I tried different scaling factors to find best > suitable batch value for free_high heuristics, > > > score # free_high > ----------- ----- ----------- > v6.6 (base) 100 4 > v6.12 (batch*1) 69 170 > batch*2 69 150 > batch*4 74 101 > batch*5 100 53 > batch*6 100 36 > batch*8 100 3 > > scaling batch for free_high heuristics with a factor of 5 restores the > performance. Hello Nikhil, Thanks for looking further on this. But from design standpoint, how a batch-size of 5 is helping here is not clear (Andrew's original question). Any case can you post the patch-set in a new email so that the below patch is not lost in discussion thread? > > On AMD 2-node machine, score for other benchmarks with patch v2 > are as follows: > > iperf3 lmbench3 netperf kbuild > (AF_UNIX) (SCTP_STREAM_MANY) > ------- --------- ----------------- ------ > v6.6 (base) 100 100 100 100 > v6.12 69 113 98.5 98.8 > v6.12 with patch v2 100 112.5 100.1 99.6 > > for network workloads, clients and server are running on different > machines conneted via Mellanox Connect-7 NIC. > > number of free_high: > iperf3 lmbench3 netperf kbuild > (AF_UNIX) (SCTP_STREAM_MANY) > ------- --------- ----------------- ------ > v6.6 (base) 5 12 6 2 > v6.12 170 11 92 2 > v6.12 with patch v2 58 11 34 2 > > > Signed-off-by: Nikhil Dhama <nikhil.dhama@amd.com> > Cc: Andrew Morton <akpm@linux-foundation.org> > Cc: Ying Huang <huang.ying.caritas@gmail.com> > Cc: linux-mm@kvack.org > Cc: linux-kernel@vger.kernel.org > Cc: Bharata B Rao <bharata@amd.com> > Cc: Raghavendra <raghavendra.kodsarathimmappa@amd.com> > --- > mm/page_alloc.c | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff --git a/mm/page_alloc.c b/mm/page_alloc.c > index b6958333054d..326d5fbae353 100644 > --- a/mm/page_alloc.c > +++ b/mm/page_alloc.c > @@ -2617,7 +2617,7 @@ static void free_unref_page_commit(struct zone *zone, struct per_cpu_pages *pcp, > * stops will be drained from vmstat refresh context. > */ > if (order && order <= PAGE_ALLOC_COSTLY_ORDER) { > - free_high = (pcp->free_count >= batch && > + free_high = (pcp->free_count >= (batch*5) && > (pcp->flags & PCPF_PREV_FREE_HIGH_ORDER) && > (!(pcp->flags & PCPF_FREE_HIGH_BATCH) || > pcp->count >= READ_ONCE(batch)));
On 3/25/2025 1:30 PM, Raghavendra K T wrote: > On 3/19/2025 1:44 PM, Nikhil Dhama wrote: > [...] >>> And, do you run network related workloads on one machine? If so, >>> please >>> try to run them on two machines instead, with clients and servers >>> run on >>> different machines. At least, please use different sockets for clients >>> and servers. Because larger pcp->free_count will make it easier to >>> trigger free_high heuristics. If that is the case, please try to >>> optimize free_high heuristics directly too. >> >> I agree with Ying Huang, the above change is not the best possible >> fix for >> the issue. On futher analysis I figured that root cause of the issue is >> the frequent pcp high order flushes. During a 20sec iperf3 run >> I observed on avg 5 pcp high order flushes in kernel v6.6, whereas, in >> v6.7, I observed about 170 pcp high order flushes. >> Tracing pcp->free_count, I figured with the patch v1 (patch I suggested >> earlier) free_count is going into negatives which reduces the number of >> times free_high heuristics is triggered hence reducing the high order >> flushes. >> >> As Ying Huang Suggested, it helps the performance on increasing the >> batch size >> for free_high heuristics. I tried different scaling factors to find best >> suitable batch value for free_high heuristics, >> >> >> score # free_high >> ----------- ----- ----------- >> v6.6 (base) 100 4 >> v6.12 (batch*1) 69 170 >> batch*2 69 150 >> batch*4 74 101 >> batch*5 100 53 >> batch*6 100 36 >> batch*8 100 3 >> scaling batch for free_high heuristics with a factor of 5 restores >> the >> performance. > > Hello Nikhil, > > Thanks for looking further on this. But from design standpoint, > how a batch-size of 5 is helping here is not clear (Andrew's original > question). > > Any case can you post the patch-set in a new email so that the below > patch is not lost in discussion thread? Hi Raghavendra, Thanks, I have posted the patch-set in a new email link: https://lore.kernel.org/linux-mm/20250325171915.14384-1-nikhil.dhama@amd.com/ with a better explanation on how scaling batch is helping here. Thanks, Nikhil
diff --git a/mm/page_alloc.c b/mm/page_alloc.c index b6958333054d..326d5fbae353 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -2617,7 +2617,7 @@ static void free_unref_page_commit(struct zone *zone, struct per_cpu_pages *pcp, * stops will be drained from vmstat refresh context. */ if (order && order <= PAGE_ALLOC_COSTLY_ORDER) { - free_high = (pcp->free_count >= batch && + free_high = (pcp->free_count >= (batch*5) && (pcp->flags & PCPF_PREV_FREE_HIGH_ORDER) && (!(pcp->flags & PCPF_FREE_HIGH_BATCH) || pcp->count >= READ_ONCE(batch)));