diff mbox series

[RFC] psi : calc psi memstall time more precisely

Message ID 1631188824-25623-1-git-send-email-huangzhaoyang@gmail.com (mailing list archive)
State New
Headers show
Series [RFC] psi : calc psi memstall time more precisely | expand

Commit Message

Zhaoyang Huang Sept. 9, 2021, noon UTC
From: Zhaoyang Huang <zhaoyang.huang@unisoc.com>

psi's memstall time is counted as simple as exit - entry so far, which ignore
the task's off cpu time. Fix it by calc the percentage of off time via task and
rq's util and runq load.

Signed-off-by: Zhaoyang Huang <zhaoyang.huang@unisoc.com>
---
 kernel/sched/psi.c | 22 +++++++++++++++++++++-
 1 file changed, 21 insertions(+), 1 deletion(-)

Comments

Vlastimil Babka Sept. 9, 2021, 1:07 p.m. UTC | #1
On 9/9/21 14:00, Huangzhaoyang wrote:
> From: Zhaoyang Huang <zhaoyang.huang@unisoc.com>
> 
> psi's memstall time is counted as simple as exit - entry so far, which ignore
> the task's off cpu time. Fix it by calc the percentage of off time via task and
> rq's util and runq load.

Wouldn't this also filter out IO wait time as that means sleeping, thus
again defeat the purpose of observing real stalls due to memory shortage?
If cpu starvation (due to overcommited system?) affects pci memstall
accuracy then that's suboptimal, but IMHO fixing it this way would do more
harm than good?

> Signed-off-by: Zhaoyang Huang <zhaoyang.huang@unisoc.com>
> ---
>  kernel/sched/psi.c | 22 +++++++++++++++++++++-
>  1 file changed, 21 insertions(+), 1 deletion(-)
> 
> diff --git a/kernel/sched/psi.c b/kernel/sched/psi.c
> index cc25a3c..6812c7e 100644
> --- a/kernel/sched/psi.c
> +++ b/kernel/sched/psi.c
> @@ -182,6 +182,8 @@ struct psi_group psi_system = {
>  
>  static void psi_avgs_work(struct work_struct *work);
>  
> +static unsigned long psi_memtime_fixup(u32 growth);
> +
>  static void group_init(struct psi_group *group)
>  {
>  	int cpu;
> @@ -492,6 +494,23 @@ static u64 window_update(struct psi_window *win, u64 now, u64 value)
>  	return growth;
>  }
>  
> +static unsigned long psi_memtime_fixup(u32 growth)
> +{
> +	struct rq *rq = task_rq(current);
> +	unsigned long growth_fixed = (unsigned long)growth;
> +
> +	if !(current->policy == SCHED_NORMAL || current->policy == SCHED_BATCH)
> +		return growth_fixed;
> +
> +	if ((current->in_memstall)
> +		&& (current->se.avg.util_avg <= rq->cfs.avg.util_avg)
> +		&& current->se.avg.util_avg != 0)
> +		growth_fixed = div64_ul((current->se.avg.util_avg + 1) * growth,
> +			rq->cfs.avg.util_avg + rq->avg_rt.util_avg + 1);
> +
> +	return growth_fixed;
> +}
> +
>  static void init_triggers(struct psi_group *group, u64 now)
>  {
>  	struct psi_trigger *t;
> @@ -658,6 +677,7 @@ static void record_times(struct psi_group_cpu *groupc, u64 now)
>  	}
>  
>  	if (groupc->state_mask & (1 << PSI_MEM_SOME)) {
> +		delta = psi_memtime_fixup(delta);
>  		groupc->times[PSI_MEM_SOME] += delta;
>  		if (groupc->state_mask & (1 << PSI_MEM_FULL))
>  			groupc->times[PSI_MEM_FULL] += delta;
> @@ -928,8 +948,8 @@ void psi_memstall_leave(unsigned long *flags)
>  	 */
>  	rq = this_rq_lock_irq(&rf);
>  
> -	current->in_memstall = 0;
>  	psi_task_change(current, TSK_MEMSTALL, 0);
> +	current->in_memstall = 0;
>  
>  	rq_unlock_irq(rq, &rf);
>  }
>
Johannes Weiner Sept. 9, 2021, 3:56 p.m. UTC | #2
On Thu, Sep 09, 2021 at 08:00:24PM +0800, Huangzhaoyang wrote:
> From: Zhaoyang Huang <zhaoyang.huang@unisoc.com>
> 
> psi's memstall time is counted as simple as exit - entry so far, which ignore
> the task's off cpu time. Fix it by calc the percentage of off time via task and
> rq's util and runq load.
> 
> Signed-off-by: Zhaoyang Huang <zhaoyang.huang@unisoc.com>

Can you please explain what practical problem you are trying to solve?

If a reclaimer gets preempted and has to wait for CPU, should that
stall be attributed to a lack of memory? Some of it should, since page
reclaim consumed CPU budget that would've otherwise been available for
doing real work. The application of course may still have experienced
a CPU wait outside of reclaim, but potentially a shorter one. Memory
pressure can definitely increase CPU pressure (as it can IO pressure).

Proportional and transitive accounting - how much of total CPU load is
page reclaim, and thus how much of each runq wait is due to memory
pressure - would give more precise answers. But generally discounting
off-CPU time in a stall is not any more correct than including it all.

This is doable, but I think there needs to be better justification for
providing this level of precision, since it comes with code complexity
that has performance and maintenance overhead.
Zhaoyang Huang Sept. 14, 2021, 1:24 a.m. UTC | #3
On Thu, Sep 9, 2021 at 11:54 PM Johannes Weiner <hannes@cmpxchg.org> wrote:
>
> On Thu, Sep 09, 2021 at 08:00:24PM +0800, Huangzhaoyang wrote:
> > From: Zhaoyang Huang <zhaoyang.huang@unisoc.com>
> >
> > psi's memstall time is counted as simple as exit - entry so far, which ignore
> > the task's off cpu time. Fix it by calc the percentage of off time via task and
> > rq's util and runq load.
> >
> > Signed-off-by: Zhaoyang Huang <zhaoyang.huang@unisoc.com>
>
> Can you please explain what practical problem you are trying to solve?
>
> If a reclaimer gets preempted and has to wait for CPU, should that
> stall be attributed to a lack of memory? Some of it should, since page
> reclaim consumed CPU budget that would've otherwise been available for
> doing real work. The application of course may still have experienced
> a CPU wait outside of reclaim, but potentially a shorter one. Memory
> pressure can definitely increase CPU pressure (as it can IO pressure).
The preempted time which is mentioned here can be separated into 2
categories. First one is cfs task preempted because running out of the
share of schedule_lantency. The second one is preempted by RT,DL and
IRQs. IMO, the previous is reasonable to be counted in stall time,
while the second one NOT.
>
> Proportional and transitive accounting - how much of total CPU load is
> page reclaim, and thus how much of each runq wait is due to memory
> pressure - would give more precise answers. But generally discounting
> off-CPU time in a stall is not any more correct than including it all.
>
> This is doable, but I think there needs to be better justification for
> providing this level of precision, since it comes with code complexity
> that has performance and maintenance overhead.
The rq's utilization of load tracking provides an easy way to compute
such proportion. A new commit has been given out which mainly deals
with the 2nd scenario described above. The statistics of the precision
are provided together.
diff mbox series

Patch

diff --git a/kernel/sched/psi.c b/kernel/sched/psi.c
index cc25a3c..6812c7e 100644
--- a/kernel/sched/psi.c
+++ b/kernel/sched/psi.c
@@ -182,6 +182,8 @@  struct psi_group psi_system = {
 
 static void psi_avgs_work(struct work_struct *work);
 
+static unsigned long psi_memtime_fixup(u32 growth);
+
 static void group_init(struct psi_group *group)
 {
 	int cpu;
@@ -492,6 +494,23 @@  static u64 window_update(struct psi_window *win, u64 now, u64 value)
 	return growth;
 }
 
+static unsigned long psi_memtime_fixup(u32 growth)
+{
+	struct rq *rq = task_rq(current);
+	unsigned long growth_fixed = (unsigned long)growth;
+
+	if !(current->policy == SCHED_NORMAL || current->policy == SCHED_BATCH)
+		return growth_fixed;
+
+	if ((current->in_memstall)
+		&& (current->se.avg.util_avg <= rq->cfs.avg.util_avg)
+		&& current->se.avg.util_avg != 0)
+		growth_fixed = div64_ul((current->se.avg.util_avg + 1) * growth,
+			rq->cfs.avg.util_avg + rq->avg_rt.util_avg + 1);
+
+	return growth_fixed;
+}
+
 static void init_triggers(struct psi_group *group, u64 now)
 {
 	struct psi_trigger *t;
@@ -658,6 +677,7 @@  static void record_times(struct psi_group_cpu *groupc, u64 now)
 	}
 
 	if (groupc->state_mask & (1 << PSI_MEM_SOME)) {
+		delta = psi_memtime_fixup(delta);
 		groupc->times[PSI_MEM_SOME] += delta;
 		if (groupc->state_mask & (1 << PSI_MEM_FULL))
 			groupc->times[PSI_MEM_FULL] += delta;
@@ -928,8 +948,8 @@  void psi_memstall_leave(unsigned long *flags)
 	 */
 	rq = this_rq_lock_irq(&rf);
 
-	current->in_memstall = 0;
 	psi_task_change(current, TSK_MEMSTALL, 0);
+	current->in_memstall = 0;
 
 	rq_unlock_irq(rq, &rf);
 }