diff mbox

[RFC,10/10] psi: aggregate ongoing stall events when somebody reads pressure

Message ID 20180712172942.10094-11-hannes@cmpxchg.org (mailing list archive)
State New, archived
Headers show

Commit Message

Johannes Weiner July 12, 2018, 5:29 p.m. UTC
Right now, psi reports pressure and stall times of already concluded
stall events. For most use cases this is current enough, but certain
highly latency-sensitive applications, like the Android OOM killer,
might want to know about and react to stall states before they have
even concluded (e.g. a prolonged reclaim cycle).

This patches the procfs/cgroupfs interface such that when the pressure
metrics are read, the current per-cpu states, if any, are taken into
account as well.

Any ongoing states are concluded, their time snapshotted, and then
restarted. This requires holding the rq lock to avoid corruption. It
could use some form of rq lock ratelimiting or avoidance.

Requested-by: Suren Baghdasaryan <surenb@google.com>
Not-yet-signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
---
 kernel/sched/psi.c | 56 +++++++++++++++++++++++++++++++++++++---------
 1 file changed, 46 insertions(+), 10 deletions(-)

Comments

Andrew Morton July 12, 2018, 11:45 p.m. UTC | #1
On Thu, 12 Jul 2018 13:29:42 -0400 Johannes Weiner <hannes@cmpxchg.org> wrote:

> Right now, psi reports pressure and stall times of already concluded
> stall events. For most use cases this is current enough, but certain
> highly latency-sensitive applications, like the Android OOM killer,
> might want to know about and react to stall states before they have
> even concluded (e.g. a prolonged reclaim cycle).
> 
> This patches the procfs/cgroupfs interface such that when the pressure
> metrics are read, the current per-cpu states, if any, are taken into
> account as well.
> 
> Any ongoing states are concluded, their time snapshotted, and then
> restarted. This requires holding the rq lock to avoid corruption. It
> could use some form of rq lock ratelimiting or avoidance.
> 
> Requested-by: Suren Baghdasaryan <surenb@google.com>
> Not-yet-signed-off-by: Johannes Weiner <hannes@cmpxchg.org>

What-does-that-mean:?
Suren Baghdasaryan July 13, 2018, 10:13 p.m. UTC | #2
On Thu, Jul 12, 2018 at 10:29 AM, Johannes Weiner <hannes@cmpxchg.org> wrote:
> Right now, psi reports pressure and stall times of already concluded
> stall events. For most use cases this is current enough, but certain
> highly latency-sensitive applications, like the Android OOM killer,

to be more precise, it's Android LMKD (low memory killer daemon) not
to be confused with kernel OOM killer.

> might want to know about and react to stall states before they have
> even concluded (e.g. a prolonged reclaim cycle).
>
> This patches the procfs/cgroupfs interface such that when the pressure
> metrics are read, the current per-cpu states, if any, are taken into
> account as well.
>
> Any ongoing states are concluded, their time snapshotted, and then
> restarted. This requires holding the rq lock to avoid corruption. It
> could use some form of rq lock ratelimiting or avoidance.
>
> Requested-by: Suren Baghdasaryan <surenb@google.com>
> Not-yet-signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
> ---

IMHO this description is a little difficult to understand. In essence,
PSI information is being updated periodically every 2secs and without
this patch the data can be stale at the time when we read it (because
it was last updated up to 2secs ago). To avoid this we update the PSI
"total" values when data is being read.

>  kernel/sched/psi.c | 56 +++++++++++++++++++++++++++++++++++++---------
>  1 file changed, 46 insertions(+), 10 deletions(-)
>
> diff --git a/kernel/sched/psi.c b/kernel/sched/psi.c
> index 53e0b7b83e2e..5a6c6057f775 100644
> --- a/kernel/sched/psi.c
> +++ b/kernel/sched/psi.c
> @@ -190,7 +190,7 @@ static void calc_avgs(unsigned long avg[3], u64 time, int missed_periods)
>         }
>  }
>
> -static bool psi_update_stats(struct psi_group *group)
> +static bool psi_update_stats(struct psi_group *group, bool ondemand)
>  {
>         u64 some[NR_PSI_RESOURCES] = { 0, };
>         u64 full[NR_PSI_RESOURCES] = { 0, };
> @@ -200,8 +200,6 @@ static bool psi_update_stats(struct psi_group *group)
>         int cpu;
>         int r;
>
> -       mutex_lock(&group->stat_lock);
> -
>         /*
>          * Collect the per-cpu time buckets and average them into a
>          * single time sample that is normalized to wallclock time.
> @@ -218,10 +216,36 @@ static bool psi_update_stats(struct psi_group *group)
>         for_each_online_cpu(cpu) {
>                 struct psi_group_cpu *groupc = per_cpu_ptr(group->cpus, cpu);
>                 unsigned long nonidle;
> +               struct rq_flags rf;
> +               struct rq *rq;
> +               u64 now;
>
> -               if (!groupc->nonidle_time)
> +               if (!groupc->nonidle_time && !groupc->nonidle)
>                         continue;
>
> +               /*
> +                * We come here for two things: 1) periodic per-cpu
> +                * bucket flushing and averaging and 2) when the user
> +                * wants to read a pressure file. For flushing and
> +                * averaging, which is relatively infrequent, we can
> +                * be lazy and tolerate some raciness with concurrent
> +                * updates to the per-cpu counters. However, if a user
> +                * polls the pressure state, we want to give them the
> +                * most uptodate information we have, including any
> +                * currently active state which hasn't been timed yet,
> +                * because in case of an iowait or a reclaim run, that
> +                * can be significant.
> +                */
> +               if (ondemand) {
> +                       rq = cpu_rq(cpu);
> +                       rq_lock_irq(rq, &rf);
> +
> +                       now = cpu_clock(cpu);
> +
> +                       groupc->nonidle_time += now - groupc->nonidle_start;
> +                       groupc->nonidle_start = now;
> +               }
> +
>                 nonidle = nsecs_to_jiffies(groupc->nonidle_time);
>                 groupc->nonidle_time = 0;
>                 nonidle_total += nonidle;
> @@ -229,13 +253,27 @@ static bool psi_update_stats(struct psi_group *group)
>                 for (r = 0; r < NR_PSI_RESOURCES; r++) {
>                         struct psi_resource *res = &groupc->res[r];
>
> +                       if (ondemand && res->state != PSI_NONE) {
> +                               bool is_full = res->state == PSI_FULL;
> +
> +                               res->times[is_full] += now - res->state_start;
> +                               res->state_start = now;
> +                       }
> +
>                         some[r] += (res->times[0] + res->times[1]) * nonidle;
>                         full[r] += res->times[1] * nonidle;
>
> -                       /* It's racy, but we can tolerate some error */
>                         res->times[0] = 0;
>                         res->times[1] = 0;
>                 }
> +
> +               if (ondemand)
> +                       rq_unlock_irq(rq, &rf);
> +       }
> +
> +       for (r = 0; r < NR_PSI_RESOURCES; r++) {
> +               do_div(some[r], max(nonidle_total, 1UL));
> +               do_div(full[r], max(nonidle_total, 1UL));
>         }
>
>         /*
> @@ -249,12 +287,10 @@ static bool psi_update_stats(struct psi_group *group)
>          * activity, thus no data, and clock ticks are sporadic. The
>          * below handles both.
>          */
> +       mutex_lock(&group->stat_lock);
>
>         /* total= */
>         for (r = 0; r < NR_PSI_RESOURCES; r++) {
> -               do_div(some[r], max(nonidle_total, 1UL));
> -               do_div(full[r], max(nonidle_total, 1UL));
> -
>                 group->some[r] += some[r];
>                 group->full[r] += full[r];
>         }
> @@ -301,7 +337,7 @@ static void psi_clock(struct work_struct *work)
>          * go - see calc_avgs() and missed_periods.
>          */
>
> -       nonidle = psi_update_stats(group);
> +       nonidle = psi_update_stats(group, false);
>
>         if (nonidle) {
>                 unsigned long delay = 0;
> @@ -570,7 +606,7 @@ int psi_show(struct seq_file *m, struct psi_group *group, enum psi_res res)
>         if (psi_disabled)
>                 return -EOPNOTSUPP;
>
> -       psi_update_stats(group);
> +       psi_update_stats(group, true);
>
>         for (w = 0; w < 3; w++) {
>                 avg[0][w] = group->avg_some[res][w];
> --
> 2.18.0
>
Johannes Weiner July 13, 2018, 10:17 p.m. UTC | #3
On Thu, Jul 12, 2018 at 04:45:37PM -0700, Andrew Morton wrote:
> On Thu, 12 Jul 2018 13:29:42 -0400 Johannes Weiner <hannes@cmpxchg.org> wrote:
> 
> > Right now, psi reports pressure and stall times of already concluded
> > stall events. For most use cases this is current enough, but certain
> > highly latency-sensitive applications, like the Android OOM killer,
> > might want to know about and react to stall states before they have
> > even concluded (e.g. a prolonged reclaim cycle).
> > 
> > This patches the procfs/cgroupfs interface such that when the pressure
> > metrics are read, the current per-cpu states, if any, are taken into
> > account as well.
> > 
> > Any ongoing states are concluded, their time snapshotted, and then
> > restarted. This requires holding the rq lock to avoid corruption. It
> > could use some form of rq lock ratelimiting or avoidance.
> > 
> > Requested-by: Suren Baghdasaryan <surenb@google.com>
> > Not-yet-signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
> 
> What-does-that-mean:?

I didn't think this patch was ready for upstream yet, hence the RFC
and the lack of a proper sign-off.

But Suren has been testing this and found it useful in his specific
low-latency application, so I included it for completeness, for other
testers to find, and for possible suggestions on how to improve it.
Johannes Weiner July 13, 2018, 10:49 p.m. UTC | #4
On Fri, Jul 13, 2018 at 03:13:07PM -0700, Suren Baghdasaryan wrote:
> On Thu, Jul 12, 2018 at 10:29 AM, Johannes Weiner <hannes@cmpxchg.org> wrote:
> > might want to know about and react to stall states before they have
> > even concluded (e.g. a prolonged reclaim cycle).
> >
> > This patches the procfs/cgroupfs interface such that when the pressure
> > metrics are read, the current per-cpu states, if any, are taken into
> > account as well.
> >
> > Any ongoing states are concluded, their time snapshotted, and then
> > restarted. This requires holding the rq lock to avoid corruption. It
> > could use some form of rq lock ratelimiting or avoidance.
> >
> > Requested-by: Suren Baghdasaryan <surenb@google.com>
> > Not-yet-signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
> > ---
> 
> IMHO this description is a little difficult to understand. In essence,
> PSI information is being updated periodically every 2secs and without
> this patch the data can be stale at the time when we read it (because
> it was last updated up to 2secs ago). To avoid this we update the PSI
> "total" values when data is being read.

That fix I actually folded into the main patch. We now always update
the total= field at the time the user reads to include all concluded
events, even if we sampled less than 2s ago. Only the running averages
are still bound to the 2s sampling window.

What this patch adds on top is for total= to include any *ongoing*
stall events that might be happening on a CPU at the time of reading
from the interface, like a reclaim cycle that hasn't finished yet.
Suren Baghdasaryan July 13, 2018, 11:34 p.m. UTC | #5
On Fri, Jul 13, 2018 at 3:49 PM, Johannes Weiner <hannes@cmpxchg.org> wrote:
> On Fri, Jul 13, 2018 at 03:13:07PM -0700, Suren Baghdasaryan wrote:
>> On Thu, Jul 12, 2018 at 10:29 AM, Johannes Weiner <hannes@cmpxchg.org> wrote:
>> > might want to know about and react to stall states before they have
>> > even concluded (e.g. a prolonged reclaim cycle).
>> >
>> > This patches the procfs/cgroupfs interface such that when the pressure
>> > metrics are read, the current per-cpu states, if any, are taken into
>> > account as well.
>> >
>> > Any ongoing states are concluded, their time snapshotted, and then
>> > restarted. This requires holding the rq lock to avoid corruption. It
>> > could use some form of rq lock ratelimiting or avoidance.
>> >
>> > Requested-by: Suren Baghdasaryan <surenb@google.com>
>> > Not-yet-signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
>> > ---
>>
>> IMHO this description is a little difficult to understand. In essence,
>> PSI information is being updated periodically every 2secs and without
>> this patch the data can be stale at the time when we read it (because
>> it was last updated up to 2secs ago). To avoid this we update the PSI
>> "total" values when data is being read.
>
> That fix I actually folded into the main patch. We now always update
> the total= field at the time the user reads to include all concluded
> events, even if we sampled less than 2s ago. Only the running averages
> are still bound to the 2s sampling window.
>
> What this patch adds on top is for total= to include any *ongoing*
> stall events that might be happening on a CPU at the time of reading
> from the interface, like a reclaim cycle that hasn't finished yet.

Ok, I see now what you mean. So ondemand flag controls whether
*ongoing* stall events are accounted for or not. Nit: maybe rename
that flag to better explain it's function?
Peter Zijlstra July 17, 2018, 3:13 p.m. UTC | #6
On Thu, Jul 12, 2018 at 01:29:42PM -0400, Johannes Weiner wrote:
> @@ -218,10 +216,36 @@ static bool psi_update_stats(struct psi_group *group)
>  	for_each_online_cpu(cpu) {
>  		struct psi_group_cpu *groupc = per_cpu_ptr(group->cpus, cpu);
>  		unsigned long nonidle;
> +		struct rq_flags rf;
> +		struct rq *rq;
> +		u64 now;
>  
> -		if (!groupc->nonidle_time)
> +		if (!groupc->nonidle_time && !groupc->nonidle)
>  			continue;
>  
> +		/*
> +		 * We come here for two things: 1) periodic per-cpu
> +		 * bucket flushing and averaging and 2) when the user
> +		 * wants to read a pressure file. For flushing and
> +		 * averaging, which is relatively infrequent, we can
> +		 * be lazy and tolerate some raciness with concurrent
> +		 * updates to the per-cpu counters. However, if a user
> +		 * polls the pressure state, we want to give them the
> +		 * most uptodate information we have, including any
> +		 * currently active state which hasn't been timed yet,
> +		 * because in case of an iowait or a reclaim run, that
> +		 * can be significant.
> +		 */
> +		if (ondemand) {
> +			rq = cpu_rq(cpu);
> +			rq_lock_irq(rq, &rf);

That's a DoS right there..
diff mbox

Patch

diff --git a/kernel/sched/psi.c b/kernel/sched/psi.c
index 53e0b7b83e2e..5a6c6057f775 100644
--- a/kernel/sched/psi.c
+++ b/kernel/sched/psi.c
@@ -190,7 +190,7 @@  static void calc_avgs(unsigned long avg[3], u64 time, int missed_periods)
 	}
 }
 
-static bool psi_update_stats(struct psi_group *group)
+static bool psi_update_stats(struct psi_group *group, bool ondemand)
 {
 	u64 some[NR_PSI_RESOURCES] = { 0, };
 	u64 full[NR_PSI_RESOURCES] = { 0, };
@@ -200,8 +200,6 @@  static bool psi_update_stats(struct psi_group *group)
 	int cpu;
 	int r;
 
-	mutex_lock(&group->stat_lock);
-
 	/*
 	 * Collect the per-cpu time buckets and average them into a
 	 * single time sample that is normalized to wallclock time.
@@ -218,10 +216,36 @@  static bool psi_update_stats(struct psi_group *group)
 	for_each_online_cpu(cpu) {
 		struct psi_group_cpu *groupc = per_cpu_ptr(group->cpus, cpu);
 		unsigned long nonidle;
+		struct rq_flags rf;
+		struct rq *rq;
+		u64 now;
 
-		if (!groupc->nonidle_time)
+		if (!groupc->nonidle_time && !groupc->nonidle)
 			continue;
 
+		/*
+		 * We come here for two things: 1) periodic per-cpu
+		 * bucket flushing and averaging and 2) when the user
+		 * wants to read a pressure file. For flushing and
+		 * averaging, which is relatively infrequent, we can
+		 * be lazy and tolerate some raciness with concurrent
+		 * updates to the per-cpu counters. However, if a user
+		 * polls the pressure state, we want to give them the
+		 * most uptodate information we have, including any
+		 * currently active state which hasn't been timed yet,
+		 * because in case of an iowait or a reclaim run, that
+		 * can be significant.
+		 */
+		if (ondemand) {
+			rq = cpu_rq(cpu);
+			rq_lock_irq(rq, &rf);
+
+			now = cpu_clock(cpu);
+
+			groupc->nonidle_time += now - groupc->nonidle_start;
+			groupc->nonidle_start = now;
+		}
+
 		nonidle = nsecs_to_jiffies(groupc->nonidle_time);
 		groupc->nonidle_time = 0;
 		nonidle_total += nonidle;
@@ -229,13 +253,27 @@  static bool psi_update_stats(struct psi_group *group)
 		for (r = 0; r < NR_PSI_RESOURCES; r++) {
 			struct psi_resource *res = &groupc->res[r];
 
+			if (ondemand && res->state != PSI_NONE) {
+				bool is_full = res->state == PSI_FULL;
+
+				res->times[is_full] += now - res->state_start;
+				res->state_start = now;
+			}
+
 			some[r] += (res->times[0] + res->times[1]) * nonidle;
 			full[r] += res->times[1] * nonidle;
 
-			/* It's racy, but we can tolerate some error */
 			res->times[0] = 0;
 			res->times[1] = 0;
 		}
+
+		if (ondemand)
+			rq_unlock_irq(rq, &rf);
+	}
+
+	for (r = 0; r < NR_PSI_RESOURCES; r++) {
+		do_div(some[r], max(nonidle_total, 1UL));
+		do_div(full[r], max(nonidle_total, 1UL));
 	}
 
 	/*
@@ -249,12 +287,10 @@  static bool psi_update_stats(struct psi_group *group)
 	 * activity, thus no data, and clock ticks are sporadic. The
 	 * below handles both.
 	 */
+	mutex_lock(&group->stat_lock);
 
 	/* total= */
 	for (r = 0; r < NR_PSI_RESOURCES; r++) {
-		do_div(some[r], max(nonidle_total, 1UL));
-		do_div(full[r], max(nonidle_total, 1UL));
-
 		group->some[r] += some[r];
 		group->full[r] += full[r];
 	}
@@ -301,7 +337,7 @@  static void psi_clock(struct work_struct *work)
 	 * go - see calc_avgs() and missed_periods.
 	 */
 
-	nonidle = psi_update_stats(group);
+	nonidle = psi_update_stats(group, false);
 
 	if (nonidle) {
 		unsigned long delay = 0;
@@ -570,7 +606,7 @@  int psi_show(struct seq_file *m, struct psi_group *group, enum psi_res res)
 	if (psi_disabled)
 		return -EOPNOTSUPP;
 
-	psi_update_stats(group);
+	psi_update_stats(group, true);
 
 	for (w = 0; w < 3; w++) {
 		avg[0][w] = group->avg_some[res][w];