diff mbox series

[v3,3/6] sched: Change wait_task_inactive()s match_state

Message ID 20220822114648.856734578@infradead.org (mailing list archive)
State Handled Elsewhere, archived
Headers show
Series Freezer Rewrite | expand

Commit Message

Peter Zijlstra Aug. 22, 2022, 11:18 a.m. UTC
Make wait_task_inactive()'s @match_state work like ttwu()'s @state.

That is, instead of an equal comparison, use it as a mask. This allows
matching multiple block conditions.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 kernel/sched/core.c |    4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

Comments

Ingo Molnar Sept. 4, 2022, 10:44 a.m. UTC | #1
* Peter Zijlstra <peterz@infradead.org> wrote:

> Make wait_task_inactive()'s @match_state work like ttwu()'s @state.
> 
> That is, instead of an equal comparison, use it as a mask. This allows
> matching multiple block conditions.
> 
> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> ---
>  kernel/sched/core.c |    4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
> 
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -3295,7 +3295,7 @@ unsigned long wait_task_inactive(struct
>  		 * is actually now running somewhere else!
>  		 */
>  		while (task_running(rq, p)) {
> -			if (match_state && unlikely(READ_ONCE(p->__state) != match_state))
> +			if (match_state && !(READ_ONCE(p->__state) & match_state))
>  				return 0;

We lose the unlikely annotation there - but I guess it probably never 
really mattered anyway?

Suggestion #1:

- Shouldn't we rename task_running() to something like task_on_cpu()? The 
  task_running() primitive is similar to TASK_RUNNING but is not based off 
  any TASK_FLAGS.

Suggestion #2:

- Shouldn't we eventually standardize on task->on_cpu on UP kernels too? 
  They don't really matter anymore, and doing so removes #ifdefs and makes 
  the code easier to read.


>  			cpu_relax();
>  		}
> @@ -3310,7 +3310,7 @@ unsigned long wait_task_inactive(struct
>  		running = task_running(rq, p);
>  		queued = task_on_rq_queued(p);
>  		ncsw = 0;
> -		if (!match_state || READ_ONCE(p->__state) == match_state)
> +		if (!match_state || (READ_ONCE(p->__state) & match_state))
>  			ncsw = p->nvcsw | LONG_MIN; /* sets MSB */
>  		task_rq_unlock(rq, p, &rf);

Suggestion #3:

- Couldn't the following users with a 0 mask:

    drivers/powercap/idle_inject.c:         wait_task_inactive(iit->tsk, 0);
    fs/coredump.c:                  wait_task_inactive(ptr->task, 0);

  Use ~0 instead (exposed as TASK_ANY or so) and then we can drop the
  !match_state special case?

  They'd do something like:

    drivers/powercap/idle_inject.c:         wait_task_inactive(iit->tsk, TASK_ANY);
    fs/coredump.c:                  wait_task_inactive(ptr->task, TASK_ANY);

  It's not an entirely 100% equivalent transformation though, but looks OK 
  at first sight: ->__state will be some nonzero mask for genuine tasks 
  waiting to schedule out, so any match will be functionally the same as a 
  0 flag telling us not to check any of the bits, right? I might be missing 
  something though.

Thanks,

	Ingo
Peter Zijlstra Sept. 6, 2022, 10:54 a.m. UTC | #2
On Sun, Sep 04, 2022 at 12:44:36PM +0200, Ingo Molnar wrote:
> 
> * Peter Zijlstra <peterz@infradead.org> wrote:
> 
> > Make wait_task_inactive()'s @match_state work like ttwu()'s @state.
> > 
> > That is, instead of an equal comparison, use it as a mask. This allows
> > matching multiple block conditions.
> > 
> > Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> > ---
> >  kernel/sched/core.c |    4 ++--
> >  1 file changed, 2 insertions(+), 2 deletions(-)
> > 
> > --- a/kernel/sched/core.c
> > +++ b/kernel/sched/core.c
> > @@ -3295,7 +3295,7 @@ unsigned long wait_task_inactive(struct
> >  		 * is actually now running somewhere else!
> >  		 */
> >  		while (task_running(rq, p)) {
> > -			if (match_state && unlikely(READ_ONCE(p->__state) != match_state))
> > +			if (match_state && !(READ_ONCE(p->__state) & match_state))
> >  				return 0;
> 
> We lose the unlikely annotation there - but I guess it probably never 
> really mattered anyway?

So any wait_task_inactive() caller does want that case to be true, but
the whole match_state precondition mostly wrecks things anyway. If
anything it should've been:

		if (likely(match_state && !(READ_ONCE(p->__state) & match_state)))
			return 0;

but I can't find it in me to care too much here.

> Suggestion #1:
> 
> - Shouldn't we rename task_running() to something like task_on_cpu()? The 
>   task_running() primitive is similar to TASK_RUNNING but is not based off 
>   any TASK_FLAGS.

That looks like a simple enough patch, lemme go do that.

> Suggestion #2:
> 
> - Shouldn't we eventually standardize on task->on_cpu on UP kernels too? 
>   They don't really matter anymore, and doing so removes #ifdefs and makes 
>   the code easier to read.

Probably, but that sounds like something that'll spiral out of control
real quick, so I'll leave that on the TODO list somewhere.

> >  			cpu_relax();
> >  		}
> > @@ -3310,7 +3310,7 @@ unsigned long wait_task_inactive(struct
> >  		running = task_running(rq, p);
> >  		queued = task_on_rq_queued(p);
> >  		ncsw = 0;
> > -		if (!match_state || READ_ONCE(p->__state) == match_state)
> > +		if (!match_state || (READ_ONCE(p->__state) & match_state))
> >  			ncsw = p->nvcsw | LONG_MIN; /* sets MSB */
> >  		task_rq_unlock(rq, p, &rf);
> 
> Suggestion #3:
> 
> - Couldn't the following users with a 0 mask:
> 
>     drivers/powercap/idle_inject.c:         wait_task_inactive(iit->tsk, 0);
>     fs/coredump.c:                  wait_task_inactive(ptr->task, 0);
> 
>   Use ~0 instead (exposed as TASK_ANY or so) and then we can drop the
>   !match_state special case?
> 
>   They'd do something like:
> 
>     drivers/powercap/idle_inject.c:         wait_task_inactive(iit->tsk, TASK_ANY);
>     fs/coredump.c:                  wait_task_inactive(ptr->task, TASK_ANY);
> 
>   It's not an entirely 100% equivalent transformation though, but looks OK 
>   at first sight: ->__state will be some nonzero mask for genuine tasks 
>   waiting to schedule out, so any match will be functionally the same as a 
>   0 flag telling us not to check any of the bits, right? I might be missing 
>   something though.

I too am thinking that should work. Added patch for that.
Ingo Molnar Sept. 7, 2022, 7:23 a.m. UTC | #3
* Peter Zijlstra <peterz@infradead.org> wrote:

> On Sun, Sep 04, 2022 at 12:44:36PM +0200, Ingo Molnar wrote:
> > 
> > * Peter Zijlstra <peterz@infradead.org> wrote:
> > 
> > > Make wait_task_inactive()'s @match_state work like ttwu()'s @state.
> > > 
> > > That is, instead of an equal comparison, use it as a mask. This allows
> > > matching multiple block conditions.
> > > 
> > > Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> > > ---
> > >  kernel/sched/core.c |    4 ++--
> > >  1 file changed, 2 insertions(+), 2 deletions(-)
> > > 
> > > --- a/kernel/sched/core.c
> > > +++ b/kernel/sched/core.c
> > > @@ -3295,7 +3295,7 @@ unsigned long wait_task_inactive(struct
> > >  		 * is actually now running somewhere else!
> > >  		 */
> > >  		while (task_running(rq, p)) {
> > > -			if (match_state && unlikely(READ_ONCE(p->__state) != match_state))
> > > +			if (match_state && !(READ_ONCE(p->__state) & match_state))
> > >  				return 0;
> > 
> > We lose the unlikely annotation there - but I guess it probably never 
> > really mattered anyway?
> 
> So any wait_task_inactive() caller does want that case to be true, but
> the whole match_state precondition mostly wrecks things anyway. If
> anything it should've been:
> 
> 		if (likely(match_state && !(READ_ONCE(p->__state) & match_state)))
> 			return 0;
> 
> but I can't find it in me to care too much here.

Yeah, I agree that this is probably the most likely branch - and default 
compiler code generation behavior should be pretty close to that to begin 
with.

Ie. ack on dropping the unlikely() annotation. :-)

Might make sense to add a sentence to the changelog though, in case anyone 
(like me) is wondering about whether the dropped annotation was intended.

Thanks,

	Ingo
Peter Zijlstra Sept. 7, 2022, 9:29 a.m. UTC | #4
On Tue, Sep 06, 2022 at 12:54:34PM +0200, Peter Zijlstra wrote:

> > Suggestion #1:
> > 
> > - Shouldn't we rename task_running() to something like task_on_cpu()? The 
> >   task_running() primitive is similar to TASK_RUNNING but is not based off 
> >   any TASK_FLAGS.
> 
> That looks like a simple enough patch, lemme go do that.

---
Subject: sched: Rename task_running() to task_on_cpu()
From: Peter Zijlstra <peterz@infradead.org>
Date: Tue Sep 6 12:33:04 CEST 2022

There is some ambiguity about task_running() in that it is unrelated
to TASK_RUNNING but instead tests ->on_cpu. As such, rename the thing
task_on_cpu().

Suggested-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 kernel/sched/core.c       |   10 +++++-----
 kernel/sched/core_sched.c |    2 +-
 kernel/sched/deadline.c   |    6 +++---
 kernel/sched/fair.c       |    2 +-
 kernel/sched/rt.c         |    6 +++---
 kernel/sched/sched.h      |    2 +-
 6 files changed, 14 insertions(+), 14 deletions(-)

--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -2778,7 +2778,7 @@ static int affine_move_task(struct rq *r
 		return -EINVAL;
 	}
 
-	if (task_running(rq, p) || READ_ONCE(p->__state) == TASK_WAKING) {
+	if (task_on_cpu(rq, p) || READ_ONCE(p->__state) == TASK_WAKING) {
 		/*
 		 * MIGRATE_ENABLE gets here because 'p == current', but for
 		 * anything else we cannot do is_migration_disabled(), punt
@@ -3290,11 +3290,11 @@ unsigned long wait_task_inactive(struct
 		 *
 		 * NOTE! Since we don't hold any locks, it's not
 		 * even sure that "rq" stays as the right runqueue!
-		 * But we don't care, since "task_running()" will
+		 * But we don't care, since "task_on_cpu()" will
 		 * return false if the runqueue has changed and p
 		 * is actually now running somewhere else!
 		 */
-		while (task_running(rq, p)) {
+		while (task_on_cpu(rq, p)) {
 			if (match_state && unlikely(READ_ONCE(p->__state) != match_state))
 				return 0;
 			cpu_relax();
@@ -3307,7 +3307,7 @@ unsigned long wait_task_inactive(struct
 		 */
 		rq = task_rq_lock(p, &rf);
 		trace_sched_wait_task(p);
-		running = task_running(rq, p);
+		running = task_on_cpu(rq, p);
 		queued = task_on_rq_queued(p);
 		ncsw = 0;
 		if (!match_state || READ_ONCE(p->__state) == match_state)
@@ -8649,7 +8649,7 @@ int __sched yield_to(struct task_struct
 	if (curr->sched_class != p->sched_class)
 		goto out_unlock;
 
-	if (task_running(p_rq, p) || !task_is_running(p))
+	if (task_on_cpu(p_rq, p) || !task_is_running(p))
 		goto out_unlock;
 
 	yielded = curr->sched_class->yield_to_task(rq, p);
--- a/kernel/sched/core_sched.c
+++ b/kernel/sched/core_sched.c
@@ -88,7 +88,7 @@ static unsigned long sched_core_update_c
 	 * core has now entered/left forced idle state. Defer accounting to the
 	 * next scheduling edge, rather than always forcing a reschedule here.
 	 */
-	if (task_running(rq, p))
+	if (task_on_cpu(rq, p))
 		resched_curr(rq);
 
 	task_rq_unlock(rq, p, &rf);
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -2087,7 +2087,7 @@ static void task_fork_dl(struct task_str
 
 static int pick_dl_task(struct rq *rq, struct task_struct *p, int cpu)
 {
-	if (!task_running(rq, p) &&
+	if (!task_on_cpu(rq, p) &&
 	    cpumask_test_cpu(cpu, &p->cpus_mask))
 		return 1;
 	return 0;
@@ -2241,7 +2241,7 @@ static struct rq *find_lock_later_rq(str
 		if (double_lock_balance(rq, later_rq)) {
 			if (unlikely(task_rq(task) != rq ||
 				     !cpumask_test_cpu(later_rq->cpu, &task->cpus_mask) ||
-				     task_running(rq, task) ||
+				     task_on_cpu(rq, task) ||
 				     !dl_task(task) ||
 				     !task_on_rq_queued(task))) {
 				double_unlock_balance(rq, later_rq);
@@ -2475,7 +2475,7 @@ static void pull_dl_task(struct rq *this
  */
 static void task_woken_dl(struct rq *rq, struct task_struct *p)
 {
-	if (!task_running(rq, p) &&
+	if (!task_on_cpu(rq, p) &&
 	    !test_tsk_need_resched(rq->curr) &&
 	    p->nr_cpus_allowed > 1 &&
 	    dl_task(rq->curr) &&
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -7938,7 +7938,7 @@ int can_migrate_task(struct task_struct
 	/* Record that we found at least one task that could run on dst_cpu */
 	env->flags &= ~LBF_ALL_PINNED;
 
-	if (task_running(env->src_rq, p)) {
+	if (task_on_cpu(env->src_rq, p)) {
 		schedstat_inc(p->stats.nr_failed_migrations_running);
 		return 0;
 	}
--- a/kernel/sched/rt.c
+++ b/kernel/sched/rt.c
@@ -1849,7 +1849,7 @@ static void put_prev_task_rt(struct rq *
 
 static int pick_rt_task(struct rq *rq, struct task_struct *p, int cpu)
 {
-	if (!task_running(rq, p) &&
+	if (!task_on_cpu(rq, p) &&
 	    cpumask_test_cpu(cpu, &p->cpus_mask))
 		return 1;
 
@@ -2004,7 +2004,7 @@ static struct rq *find_lock_lowest_rq(st
 			 */
 			if (unlikely(task_rq(task) != rq ||
 				     !cpumask_test_cpu(lowest_rq->cpu, &task->cpus_mask) ||
-				     task_running(rq, task) ||
+				     task_on_cpu(rq, task) ||
 				     !rt_task(task) ||
 				     !task_on_rq_queued(task))) {
 
@@ -2462,7 +2462,7 @@ static void pull_rt_task(struct rq *this
  */
 static void task_woken_rt(struct rq *rq, struct task_struct *p)
 {
-	bool need_to_push = !task_running(rq, p) &&
+	bool need_to_push = !task_on_cpu(rq, p) &&
 			    !test_tsk_need_resched(rq->curr) &&
 			    p->nr_cpus_allowed > 1 &&
 			    (dl_task(rq->curr) || rt_task(rq->curr)) &&
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -2060,7 +2060,7 @@ static inline int task_current(struct rq
 	return rq->curr == p;
 }
 
-static inline int task_running(struct rq *rq, struct task_struct *p)
+static inline int task_on_cpu(struct rq *rq, struct task_struct *p)
 {
 #ifdef CONFIG_SMP
 	return p->on_cpu;
Peter Zijlstra Sept. 7, 2022, 9:30 a.m. UTC | #5
On Tue, Sep 06, 2022 at 12:54:34PM +0200, Peter Zijlstra wrote:

> > Suggestion #3:
> > 
> > - Couldn't the following users with a 0 mask:
> > 
> >     drivers/powercap/idle_inject.c:         wait_task_inactive(iit->tsk, 0);
> >     fs/coredump.c:                  wait_task_inactive(ptr->task, 0);
> > 
> >   Use ~0 instead (exposed as TASK_ANY or so) and then we can drop the
> >   !match_state special case?
> > 
> >   They'd do something like:
> > 
> >     drivers/powercap/idle_inject.c:         wait_task_inactive(iit->tsk, TASK_ANY);
> >     fs/coredump.c:                  wait_task_inactive(ptr->task, TASK_ANY);
> > 
> >   It's not an entirely 100% equivalent transformation though, but looks OK 
> >   at first sight: ->__state will be some nonzero mask for genuine tasks 
> >   waiting to schedule out, so any match will be functionally the same as a 
> >   0 flag telling us not to check any of the bits, right? I might be missing 
> >   something though.
> 
> I too am thinking that should work. Added patch for that.

---
Subject: sched: Add TASK_ANY for wait_task_inactive()
From: Peter Zijlstra <peterz@infradead.org>
Date: Tue Sep 6 12:39:55 CEST 2022

Now that wait_task_inactive()'s @match_state argument is a mask (like
ttwu()) it is possible to replace the special !match_state case with
an 'all-states' value such that any blocked state will match.

Suggested-by: Ingo Molnar (mingo@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 drivers/powercap/idle_inject.c |    2 +-
 fs/coredump.c                  |    2 +-
 include/linux/sched.h          |    2 ++
 kernel/sched/core.c            |   16 ++++++++--------
 4 files changed, 12 insertions(+), 10 deletions(-)

--- a/drivers/powercap/idle_inject.c
+++ b/drivers/powercap/idle_inject.c
@@ -254,7 +254,7 @@ void idle_inject_stop(struct idle_inject
 		iit = per_cpu_ptr(&idle_inject_thread, cpu);
 		iit->should_run = 0;
 
-		wait_task_inactive(iit->tsk, 0);
+		wait_task_inactive(iit->tsk, TASK_ANY);
 	}
 
 	cpu_hotplug_enable();
--- a/fs/coredump.c
+++ b/fs/coredump.c
@@ -412,7 +412,7 @@ static int coredump_wait(int exit_code,
 		 */
 		ptr = core_state->dumper.next;
 		while (ptr != NULL) {
-			wait_task_inactive(ptr->task, 0);
+			wait_task_inactive(ptr->task, TASK_ANY);
 			ptr = ptr->next;
 		}
 	}
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -101,6 +101,8 @@ struct task_group;
 #define TASK_RTLOCK_WAIT		0x1000
 #define TASK_STATE_MAX			0x2000
 
+#define TASK_ANY			(TASK_STATE_MAX-1)
+
 /* Convenience macros for the sake of set_current_state: */
 #define TASK_KILLABLE			(TASK_WAKEKILL | TASK_UNINTERRUPTIBLE)
 #define TASK_STOPPED			(TASK_WAKEKILL | __TASK_STOPPED)
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -3254,12 +3254,12 @@ int migrate_swap(struct task_struct *cur
 /*
  * wait_task_inactive - wait for a thread to unschedule.
  *
- * If @match_state is nonzero, it's the @p->state value just checked and
- * not expected to change.  If it changes, i.e. @p might have woken up,
- * then return zero.  When we succeed in waiting for @p to be off its CPU,
- * we return a positive number (its total switch count).  If a second call
- * a short while later returns the same number, the caller can be sure that
- * @p has remained unscheduled the whole time.
+ * Wait for the thread to block in any of the states set in @match_state.
+ * If it changes, i.e. @p might have woken up, then return zero.  When we
+ * succeed in waiting for @p to be off its CPU, we return a positive number
+ * (its total switch count).  If a second call a short while later returns the
+ * same number, the caller can be sure that @p has remained unscheduled the
+ * whole time.
  *
  * The caller must ensure that the task *will* unschedule sometime soon,
  * else this function might spin for a *long* time. This function can't
@@ -3295,7 +3295,7 @@ unsigned long wait_task_inactive(struct
 		 * is actually now running somewhere else!
 		 */
 		while (task_on_cpu(rq, p)) {
-			if (match_state && !(READ_ONCE(p->__state) & match_state))
+			if (!(READ_ONCE(p->__state) & match_state))
 				return 0;
 			cpu_relax();
 		}
@@ -3310,7 +3310,7 @@ unsigned long wait_task_inactive(struct
 		running = task_on_cpu(rq, p);
 		queued = task_on_rq_queued(p);
 		ncsw = 0;
-		if (!match_state || (READ_ONCE(p->__state) & match_state))
+		if (READ_ONCE(p->__state) & match_state)
 			ncsw = p->nvcsw | LONG_MIN; /* sets MSB */
 		task_rq_unlock(rq, p, &rf);
diff mbox series

Patch

--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -3295,7 +3295,7 @@  unsigned long wait_task_inactive(struct
 		 * is actually now running somewhere else!
 		 */
 		while (task_running(rq, p)) {
-			if (match_state && unlikely(READ_ONCE(p->__state) != match_state))
+			if (match_state && !(READ_ONCE(p->__state) & match_state))
 				return 0;
 			cpu_relax();
 		}
@@ -3310,7 +3310,7 @@  unsigned long wait_task_inactive(struct
 		running = task_running(rq, p);
 		queued = task_on_rq_queued(p);
 		ncsw = 0;
-		if (!match_state || READ_ONCE(p->__state) == match_state)
+		if (!match_state || (READ_ONCE(p->__state) & match_state))
 			ncsw = p->nvcsw | LONG_MIN; /* sets MSB */
 		task_rq_unlock(rq, p, &rf);