Message ID | 20231012121707.51368-4-yangyicong@huawei.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | sched/fair: Scan cluster before scanning LLC in wake-up path | expand |
On Thu, 12 Oct 2023 at 14:19, Yicong Yang <yangyicong@huawei.com> wrote: > > From: Yicong Yang <yangyicong@hisilicon.com> > > Chen Yu reports a hackbench regression of cluster wakeup when > hackbench threads equal to the CPU number [1]. Analysis shows > it's because we wake up more on the target CPU even if the > prev_cpu is a good wakeup candidate and leads to the decrease > of the CPU utilization. > > Generally if the task's prev_cpu is idle we'll wake up the task > on it without scanning. On cluster machines we'll try to wake up > the task in the same cluster of the target for better cache > affinity, so if the prev_cpu is idle but not sharing the same > cluster with the target we'll still try to find an idle CPU within > the cluster. This will improve the performance at low loads on > cluster machines. But in the issue above, if the prev_cpu is idle > but not in the cluster with the target CPU, we'll try to scan an > idle one in the cluster. But since the system is busy, we're > likely to fail the scanning and use target instead, even if > the prev_cpu is idle. Then leads to the regression. > > This patch solves this in 2 steps: > o record the prev_cpu/recent_used_cpu if they're good wakeup > candidates but not sharing the cluster with the target. > o on scanning failure use the prev_cpu/recent_used_cpu if > they're still idle > > [1] https://lore.kernel.org/all/ZGzDLuVaHR1PAYDt@chenyu5-mobl1/ > Reported-by: Chen Yu <yu.c.chen@intel.com> > Signed-off-by: Yicong Yang <yangyicong@hisilicon.com> > --- > kernel/sched/fair.c | 19 ++++++++++++++++++- > 1 file changed, 18 insertions(+), 1 deletion(-) > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c > index 4039f9b348ec..f1d94668bd71 100644 > --- a/kernel/sched/fair.c > +++ b/kernel/sched/fair.c > @@ -7392,7 +7392,7 @@ static int select_idle_sibling(struct task_struct *p, int prev, int target) > bool has_idle_core = false; > struct sched_domain *sd; > unsigned long task_util, util_min, util_max; > - int i, recent_used_cpu; > + int i, recent_used_cpu, prev_aff = -1; > > /* > * On asymmetric system, update task utilization because we will check > @@ -7425,6 +7425,8 @@ static int select_idle_sibling(struct task_struct *p, int prev, int target) > > if (cpus_share_resources(prev, target)) > return prev; > + > + prev_aff = prev; > } > > /* > @@ -7457,6 +7459,8 @@ static int select_idle_sibling(struct task_struct *p, int prev, int target) > > if (cpus_share_resources(recent_used_cpu, target)) > return recent_used_cpu; > + } else { > + recent_used_cpu = -1; > } > > /* > @@ -7497,6 +7501,19 @@ static int select_idle_sibling(struct task_struct *p, int prev, int target) > if ((unsigned)i < nr_cpumask_bits) > return i; > > + /* > + * For cluster machines which have lower sharing cache like L2 or > + * LLC Tag, we tend to find an idle CPU in the target's cluster > + * first. But prev_cpu or recent_used_cpu may also be a good candidate, > + * use them if possible when no idle CPU found in select_idle_cpu(). > + */ > + if ((unsigned int)prev_aff < nr_cpumask_bits && > + (available_idle_cpu(prev_aff) || sched_idle_cpu(prev_aff))) Hasn't prev_aff (i.e. prev) been already tested as idle ? > + return prev_aff; > + if ((unsigned int)recent_used_cpu < nr_cpumask_bits && > + (available_idle_cpu(recent_used_cpu) || sched_idle_cpu(recent_used_cpu))) > + return recent_used_cpu; same here > + > return target; > } > > -- > 2.24.0 >
On 2023-10-13 at 17:04:32 +0200, Vincent Guittot wrote: > On Thu, 12 Oct 2023 at 14:19, Yicong Yang <yangyicong@huawei.com> wrote: > > > > From: Yicong Yang <yangyicong@hisilicon.com> > > > > Chen Yu reports a hackbench regression of cluster wakeup when > > hackbench threads equal to the CPU number [1]. Analysis shows > > it's because we wake up more on the target CPU even if the > > prev_cpu is a good wakeup candidate and leads to the decrease > > of the CPU utilization. > > > > Generally if the task's prev_cpu is idle we'll wake up the task > > on it without scanning. On cluster machines we'll try to wake up > > the task in the same cluster of the target for better cache > > affinity, so if the prev_cpu is idle but not sharing the same > > cluster with the target we'll still try to find an idle CPU within > > the cluster. This will improve the performance at low loads on > > cluster machines. But in the issue above, if the prev_cpu is idle > > but not in the cluster with the target CPU, we'll try to scan an > > idle one in the cluster. But since the system is busy, we're > > likely to fail the scanning and use target instead, even if > > the prev_cpu is idle. Then leads to the regression. > > > > This patch solves this in 2 steps: > > o record the prev_cpu/recent_used_cpu if they're good wakeup > > candidates but not sharing the cluster with the target. > > o on scanning failure use the prev_cpu/recent_used_cpu if > > they're still idle > > > > [1] https://lore.kernel.org/all/ZGzDLuVaHR1PAYDt@chenyu5-mobl1/ > > Reported-by: Chen Yu <yu.c.chen@intel.com> > > Signed-off-by: Yicong Yang <yangyicong@hisilicon.com> > > --- > > kernel/sched/fair.c | 19 ++++++++++++++++++- > > 1 file changed, 18 insertions(+), 1 deletion(-) > > > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c > > index 4039f9b348ec..f1d94668bd71 100644 > > --- a/kernel/sched/fair.c > > +++ b/kernel/sched/fair.c > > @@ -7392,7 +7392,7 @@ static int select_idle_sibling(struct task_struct *p, int prev, int target) > > bool has_idle_core = false; > > struct sched_domain *sd; > > unsigned long task_util, util_min, util_max; > > - int i, recent_used_cpu; > > + int i, recent_used_cpu, prev_aff = -1; > > > > /* > > * On asymmetric system, update task utilization because we will check > > @@ -7425,6 +7425,8 @@ static int select_idle_sibling(struct task_struct *p, int prev, int target) > > > > if (cpus_share_resources(prev, target)) > > return prev; > > + > > + prev_aff = prev; > > } > > > > /* > > @@ -7457,6 +7459,8 @@ static int select_idle_sibling(struct task_struct *p, int prev, int target) > > > > if (cpus_share_resources(recent_used_cpu, target)) > > return recent_used_cpu; > > + } else { > > + recent_used_cpu = -1; > > } > > > > /* > > @@ -7497,6 +7501,19 @@ static int select_idle_sibling(struct task_struct *p, int prev, int target) > > if ((unsigned)i < nr_cpumask_bits) > > return i; > > > > + /* > > + * For cluster machines which have lower sharing cache like L2 or > > + * LLC Tag, we tend to find an idle CPU in the target's cluster > > + * first. But prev_cpu or recent_used_cpu may also be a good candidate, > > + * use them if possible when no idle CPU found in select_idle_cpu(). > > + */ > > + if ((unsigned int)prev_aff < nr_cpumask_bits && > > + (available_idle_cpu(prev_aff) || sched_idle_cpu(prev_aff))) > > Hasn't prev_aff (i.e. prev) been already tested as idle ? > It aims to shrink the race window that the prev idle CPU becomes non-idle during above select_idle_cpu() scan(maybe time consuming). And it wants to make a double check. thanks, Chenyu > > + return prev_aff; > > + if ((unsigned int)recent_used_cpu < nr_cpumask_bits && > > + (available_idle_cpu(recent_used_cpu) || sched_idle_cpu(recent_used_cpu))) > > + return recent_used_cpu; > > same here > > > > + > > return target; > > } > > > > -- > > 2.24.0 > >
Hi Vincent, On 2023/10/13 23:04, Vincent Guittot wrote: > On Thu, 12 Oct 2023 at 14:19, Yicong Yang <yangyicong@huawei.com> wrote: >> >> From: Yicong Yang <yangyicong@hisilicon.com> >> >> Chen Yu reports a hackbench regression of cluster wakeup when >> hackbench threads equal to the CPU number [1]. Analysis shows >> it's because we wake up more on the target CPU even if the >> prev_cpu is a good wakeup candidate and leads to the decrease >> of the CPU utilization. >> >> Generally if the task's prev_cpu is idle we'll wake up the task >> on it without scanning. On cluster machines we'll try to wake up >> the task in the same cluster of the target for better cache >> affinity, so if the prev_cpu is idle but not sharing the same >> cluster with the target we'll still try to find an idle CPU within >> the cluster. This will improve the performance at low loads on >> cluster machines. But in the issue above, if the prev_cpu is idle >> but not in the cluster with the target CPU, we'll try to scan an >> idle one in the cluster. But since the system is busy, we're >> likely to fail the scanning and use target instead, even if >> the prev_cpu is idle. Then leads to the regression. >> >> This patch solves this in 2 steps: >> o record the prev_cpu/recent_used_cpu if they're good wakeup >> candidates but not sharing the cluster with the target. >> o on scanning failure use the prev_cpu/recent_used_cpu if >> they're still idle >> >> [1] https://lore.kernel.org/all/ZGzDLuVaHR1PAYDt@chenyu5-mobl1/ >> Reported-by: Chen Yu <yu.c.chen@intel.com> >> Signed-off-by: Yicong Yang <yangyicong@hisilicon.com> >> --- >> kernel/sched/fair.c | 19 ++++++++++++++++++- >> 1 file changed, 18 insertions(+), 1 deletion(-) >> >> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c >> index 4039f9b348ec..f1d94668bd71 100644 >> --- a/kernel/sched/fair.c >> +++ b/kernel/sched/fair.c >> @@ -7392,7 +7392,7 @@ static int select_idle_sibling(struct task_struct *p, int prev, int target) >> bool has_idle_core = false; >> struct sched_domain *sd; >> unsigned long task_util, util_min, util_max; >> - int i, recent_used_cpu; >> + int i, recent_used_cpu, prev_aff = -1; >> >> /* >> * On asymmetric system, update task utilization because we will check >> @@ -7425,6 +7425,8 @@ static int select_idle_sibling(struct task_struct *p, int prev, int target) >> >> if (cpus_share_resources(prev, target)) >> return prev; >> + >> + prev_aff = prev; >> } >> >> /* >> @@ -7457,6 +7459,8 @@ static int select_idle_sibling(struct task_struct *p, int prev, int target) >> >> if (cpus_share_resources(recent_used_cpu, target)) >> return recent_used_cpu; >> + } else { >> + recent_used_cpu = -1; >> } >> >> /* >> @@ -7497,6 +7501,19 @@ static int select_idle_sibling(struct task_struct *p, int prev, int target) >> if ((unsigned)i < nr_cpumask_bits) >> return i; >> >> + /* >> + * For cluster machines which have lower sharing cache like L2 or >> + * LLC Tag, we tend to find an idle CPU in the target's cluster >> + * first. But prev_cpu or recent_used_cpu may also be a good candidate, >> + * use them if possible when no idle CPU found in select_idle_cpu(). >> + */ >> + if ((unsigned int)prev_aff < nr_cpumask_bits && >> + (available_idle_cpu(prev_aff) || sched_idle_cpu(prev_aff))) > > Hasn't prev_aff (i.e. prev) been already tested as idle ? > >> + return prev_aff; >> + if ((unsigned int)recent_used_cpu < nr_cpumask_bits && >> + (available_idle_cpu(recent_used_cpu) || sched_idle_cpu(recent_used_cpu))) >> + return recent_used_cpu; > > same here > It was thought that there maybe a small potential race window here that the prev/recent_used CPU becoming non-idle after scanning, discussed in [1]. I think the check here won't be expensive so added it here. It should be redundant and can be removed. [1] https://lore.kernel.org/all/ZIams6s+qShFWhfQ@BLR-5CG11610CF.amd.com/ Thanks. > >> + >> return target; >> } >> >> -- >> 2.24.0 >> >
On Mon, 16 Oct 2023 at 14:55, Yicong Yang <yangyicong@huawei.com> wrote: > > Hi Vincent, > > On 2023/10/13 23:04, Vincent Guittot wrote: > > On Thu, 12 Oct 2023 at 14:19, Yicong Yang <yangyicong@huawei.com> wrote: > >> > >> From: Yicong Yang <yangyicong@hisilicon.com> > >> > >> Chen Yu reports a hackbench regression of cluster wakeup when > >> hackbench threads equal to the CPU number [1]. Analysis shows > >> it's because we wake up more on the target CPU even if the > >> prev_cpu is a good wakeup candidate and leads to the decrease > >> of the CPU utilization. > >> > >> Generally if the task's prev_cpu is idle we'll wake up the task > >> on it without scanning. On cluster machines we'll try to wake up > >> the task in the same cluster of the target for better cache > >> affinity, so if the prev_cpu is idle but not sharing the same > >> cluster with the target we'll still try to find an idle CPU within > >> the cluster. This will improve the performance at low loads on > >> cluster machines. But in the issue above, if the prev_cpu is idle > >> but not in the cluster with the target CPU, we'll try to scan an > >> idle one in the cluster. But since the system is busy, we're > >> likely to fail the scanning and use target instead, even if > >> the prev_cpu is idle. Then leads to the regression. > >> > >> This patch solves this in 2 steps: > >> o record the prev_cpu/recent_used_cpu if they're good wakeup > >> candidates but not sharing the cluster with the target. > >> o on scanning failure use the prev_cpu/recent_used_cpu if > >> they're still idle > >> > >> [1] https://lore.kernel.org/all/ZGzDLuVaHR1PAYDt@chenyu5-mobl1/ > >> Reported-by: Chen Yu <yu.c.chen@intel.com> > >> Signed-off-by: Yicong Yang <yangyicong@hisilicon.com> > >> --- > >> kernel/sched/fair.c | 19 ++++++++++++++++++- > >> 1 file changed, 18 insertions(+), 1 deletion(-) > >> > >> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c > >> index 4039f9b348ec..f1d94668bd71 100644 > >> --- a/kernel/sched/fair.c > >> +++ b/kernel/sched/fair.c > >> @@ -7392,7 +7392,7 @@ static int select_idle_sibling(struct task_struct *p, int prev, int target) > >> bool has_idle_core = false; > >> struct sched_domain *sd; > >> unsigned long task_util, util_min, util_max; > >> - int i, recent_used_cpu; > >> + int i, recent_used_cpu, prev_aff = -1; > >> > >> /* > >> * On asymmetric system, update task utilization because we will check > >> @@ -7425,6 +7425,8 @@ static int select_idle_sibling(struct task_struct *p, int prev, int target) > >> > >> if (cpus_share_resources(prev, target)) > >> return prev; > >> + > >> + prev_aff = prev; > >> } > >> > >> /* > >> @@ -7457,6 +7459,8 @@ static int select_idle_sibling(struct task_struct *p, int prev, int target) > >> > >> if (cpus_share_resources(recent_used_cpu, target)) > >> return recent_used_cpu; > >> + } else { > >> + recent_used_cpu = -1; > >> } > >> > >> /* > >> @@ -7497,6 +7501,19 @@ static int select_idle_sibling(struct task_struct *p, int prev, int target) > >> if ((unsigned)i < nr_cpumask_bits) > >> return i; > >> > >> + /* > >> + * For cluster machines which have lower sharing cache like L2 or > >> + * LLC Tag, we tend to find an idle CPU in the target's cluster > >> + * first. But prev_cpu or recent_used_cpu may also be a good candidate, > >> + * use them if possible when no idle CPU found in select_idle_cpu(). > >> + */ > >> + if ((unsigned int)prev_aff < nr_cpumask_bits && > >> + (available_idle_cpu(prev_aff) || sched_idle_cpu(prev_aff))) > > > > Hasn't prev_aff (i.e. prev) been already tested as idle ? > > > >> + return prev_aff; > >> + if ((unsigned int)recent_used_cpu < nr_cpumask_bits && > >> + (available_idle_cpu(recent_used_cpu) || sched_idle_cpu(recent_used_cpu))) > >> + return recent_used_cpu; > > > > same here > > > > It was thought that there maybe a small potential race window here that the prev/recent_used > CPU becoming non-idle after scanning, discussed in [1]. I think the check here won't be > expensive so added it here. It should be redundant and can be removed. I agree that there is a race but the whole function select_idle_sibling() is made of possible races because by the time it selects a CPU this one can become non-idle. It would be good to have some figures showing that these redundant checks make a difference. > > [1] https://lore.kernel.org/all/ZIams6s+qShFWhfQ@BLR-5CG11610CF.amd.com/ > > Thanks. > > > > >> + > >> return target; > >> } > >> > >> -- > >> 2.24.0 > >> > >
On 2023/10/17 23:23, Vincent Guittot wrote: > On Mon, 16 Oct 2023 at 14:55, Yicong Yang <yangyicong@huawei.com> wrote: >> >> Hi Vincent, >> >> On 2023/10/13 23:04, Vincent Guittot wrote: >>> On Thu, 12 Oct 2023 at 14:19, Yicong Yang <yangyicong@huawei.com> wrote: >>>> >>>> From: Yicong Yang <yangyicong@hisilicon.com> >>>> >>>> Chen Yu reports a hackbench regression of cluster wakeup when >>>> hackbench threads equal to the CPU number [1]. Analysis shows >>>> it's because we wake up more on the target CPU even if the >>>> prev_cpu is a good wakeup candidate and leads to the decrease >>>> of the CPU utilization. >>>> >>>> Generally if the task's prev_cpu is idle we'll wake up the task >>>> on it without scanning. On cluster machines we'll try to wake up >>>> the task in the same cluster of the target for better cache >>>> affinity, so if the prev_cpu is idle but not sharing the same >>>> cluster with the target we'll still try to find an idle CPU within >>>> the cluster. This will improve the performance at low loads on >>>> cluster machines. But in the issue above, if the prev_cpu is idle >>>> but not in the cluster with the target CPU, we'll try to scan an >>>> idle one in the cluster. But since the system is busy, we're >>>> likely to fail the scanning and use target instead, even if >>>> the prev_cpu is idle. Then leads to the regression. >>>> >>>> This patch solves this in 2 steps: >>>> o record the prev_cpu/recent_used_cpu if they're good wakeup >>>> candidates but not sharing the cluster with the target. >>>> o on scanning failure use the prev_cpu/recent_used_cpu if >>>> they're still idle >>>> >>>> [1] https://lore.kernel.org/all/ZGzDLuVaHR1PAYDt@chenyu5-mobl1/ >>>> Reported-by: Chen Yu <yu.c.chen@intel.com> >>>> Signed-off-by: Yicong Yang <yangyicong@hisilicon.com> >>>> --- >>>> kernel/sched/fair.c | 19 ++++++++++++++++++- >>>> 1 file changed, 18 insertions(+), 1 deletion(-) >>>> >>>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c >>>> index 4039f9b348ec..f1d94668bd71 100644 >>>> --- a/kernel/sched/fair.c >>>> +++ b/kernel/sched/fair.c >>>> @@ -7392,7 +7392,7 @@ static int select_idle_sibling(struct task_struct *p, int prev, int target) >>>> bool has_idle_core = false; >>>> struct sched_domain *sd; >>>> unsigned long task_util, util_min, util_max; >>>> - int i, recent_used_cpu; >>>> + int i, recent_used_cpu, prev_aff = -1; >>>> >>>> /* >>>> * On asymmetric system, update task utilization because we will check >>>> @@ -7425,6 +7425,8 @@ static int select_idle_sibling(struct task_struct *p, int prev, int target) >>>> >>>> if (cpus_share_resources(prev, target)) >>>> return prev; >>>> + >>>> + prev_aff = prev; >>>> } >>>> >>>> /* >>>> @@ -7457,6 +7459,8 @@ static int select_idle_sibling(struct task_struct *p, int prev, int target) >>>> >>>> if (cpus_share_resources(recent_used_cpu, target)) >>>> return recent_used_cpu; >>>> + } else { >>>> + recent_used_cpu = -1; >>>> } >>>> >>>> /* >>>> @@ -7497,6 +7501,19 @@ static int select_idle_sibling(struct task_struct *p, int prev, int target) >>>> if ((unsigned)i < nr_cpumask_bits) >>>> return i; >>>> >>>> + /* >>>> + * For cluster machines which have lower sharing cache like L2 or >>>> + * LLC Tag, we tend to find an idle CPU in the target's cluster >>>> + * first. But prev_cpu or recent_used_cpu may also be a good candidate, >>>> + * use them if possible when no idle CPU found in select_idle_cpu(). >>>> + */ >>>> + if ((unsigned int)prev_aff < nr_cpumask_bits && >>>> + (available_idle_cpu(prev_aff) || sched_idle_cpu(prev_aff))) >>> >>> Hasn't prev_aff (i.e. prev) been already tested as idle ? >>> >>>> + return prev_aff; >>>> + if ((unsigned int)recent_used_cpu < nr_cpumask_bits && >>>> + (available_idle_cpu(recent_used_cpu) || sched_idle_cpu(recent_used_cpu))) >>>> + return recent_used_cpu; >>> >>> same here >>> >> >> It was thought that there maybe a small potential race window here that the prev/recent_used >> CPU becoming non-idle after scanning, discussed in [1]. I think the check here won't be >> expensive so added it here. It should be redundant and can be removed. > > I agree that there is a race but the whole function > select_idle_sibling() is made of possible races because by the time it > selects a CPU this one can become non-idle. It would be good to have > some figures showing that these redundant checks make a difference. > Got it. Actually I see no difference for these two checks on my machine. So I would like to remove this. Thanks. >> >> [1] https://lore.kernel.org/all/ZIams6s+qShFWhfQ@BLR-5CG11610CF.amd.com/ >> >> Thanks. >> >>> >>>> + >>>> return target; >>>> } >>>> >>>> -- >>>> 2.24.0 >>>> >>>
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 4039f9b348ec..f1d94668bd71 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -7392,7 +7392,7 @@ static int select_idle_sibling(struct task_struct *p, int prev, int target) bool has_idle_core = false; struct sched_domain *sd; unsigned long task_util, util_min, util_max; - int i, recent_used_cpu; + int i, recent_used_cpu, prev_aff = -1; /* * On asymmetric system, update task utilization because we will check @@ -7425,6 +7425,8 @@ static int select_idle_sibling(struct task_struct *p, int prev, int target) if (cpus_share_resources(prev, target)) return prev; + + prev_aff = prev; } /* @@ -7457,6 +7459,8 @@ static int select_idle_sibling(struct task_struct *p, int prev, int target) if (cpus_share_resources(recent_used_cpu, target)) return recent_used_cpu; + } else { + recent_used_cpu = -1; } /* @@ -7497,6 +7501,19 @@ static int select_idle_sibling(struct task_struct *p, int prev, int target) if ((unsigned)i < nr_cpumask_bits) return i; + /* + * For cluster machines which have lower sharing cache like L2 or + * LLC Tag, we tend to find an idle CPU in the target's cluster + * first. But prev_cpu or recent_used_cpu may also be a good candidate, + * use them if possible when no idle CPU found in select_idle_cpu(). + */ + if ((unsigned int)prev_aff < nr_cpumask_bits && + (available_idle_cpu(prev_aff) || sched_idle_cpu(prev_aff))) + return prev_aff; + if ((unsigned int)recent_used_cpu < nr_cpumask_bits && + (available_idle_cpu(recent_used_cpu) || sched_idle_cpu(recent_used_cpu))) + return recent_used_cpu; + return target; }