diff mbox series

[RFC,1/6] sched: Add nice value change notifier

Message ID 20210930171552.501553-2-tvrtko.ursulin@linux.intel.com (mailing list archive)
State New, archived
Headers show
Series CPU + GPU synchronised priority scheduling | expand

Commit Message

Tvrtko Ursulin Sept. 30, 2021, 5:15 p.m. UTC
From: Tvrtko Ursulin <tvrtko.ursulin@intel.com>

Implement a simple notifier chain via which interested parties can track
when process nice value changes. Simple because it is global so each user
would have to track which tasks it is interested in.

To use register_user_nice_notifier and unregister_user_nice_notifier
functions are provided and new nice value and pointer to task_struct
being modified passed to the callbacks.

Opens:
 * Security. Would some sort of a  per process mechanism be better and
   feasible?
 * Put it all behind kconfig to be selected by interested drivers?

Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
---
 include/linux/sched.h |  5 +++++
 kernel/sched/core.c   | 37 ++++++++++++++++++++++++++++++++++++-
 2 files changed, 41 insertions(+), 1 deletion(-)

Comments

Peter Zijlstra Sept. 30, 2021, 6:33 p.m. UTC | #1
On Thu, Sep 30, 2021 at 06:15:47PM +0100, Tvrtko Ursulin wrote:
>  void set_user_nice(struct task_struct *p, long nice)
>  {
>  	bool queued, running;
> -	int old_prio;
> +	int old_prio, ret;
>  	struct rq_flags rf;
>  	struct rq *rq;
>  
> @@ -6913,6 +6945,9 @@ void set_user_nice(struct task_struct *p, long nice)
>  	 */
>  	p->sched_class->prio_changed(rq, p, old_prio);
>  
> +	ret = atomic_notifier_call_chain(&user_nice_notifier_list, nice, p);
> +	WARN_ON_ONCE(ret != NOTIFY_DONE);
> +
>  out_unlock:
>  	task_rq_unlock(rq, p, &rf);
>  }

No, we're not going to call out to exported, and potentially unbounded,
functions under scheduler locks.
Tvrtko Ursulin Oct. 1, 2021, 9:04 a.m. UTC | #2
Hi Peter,

On 30/09/2021 19:33, Peter Zijlstra wrote:
> On Thu, Sep 30, 2021 at 06:15:47PM +0100, Tvrtko Ursulin wrote:
>>   void set_user_nice(struct task_struct *p, long nice)
>>   {
>>   	bool queued, running;
>> -	int old_prio;
>> +	int old_prio, ret;
>>   	struct rq_flags rf;
>>   	struct rq *rq;
>>   
>> @@ -6913,6 +6945,9 @@ void set_user_nice(struct task_struct *p, long nice)
>>   	 */
>>   	p->sched_class->prio_changed(rq, p, old_prio);
>>   
>> +	ret = atomic_notifier_call_chain(&user_nice_notifier_list, nice, p);
>> +	WARN_ON_ONCE(ret != NOTIFY_DONE);
>> +
>>   out_unlock:
>>   	task_rq_unlock(rq, p, &rf);
>>   }
> 
> No, we're not going to call out to exported, and potentially unbounded,
> functions under scheduler locks.

Agreed, that's another good point why it is even more hairy, as I have 
generally alluded in the cover letter.

Do you have any immediate thoughts on possible alternatives?

Like for instance if I did a queue_work from set_user_nice and then ran 
a notifier chain async from a worker? I haven't looked at yet what 
repercussion would that have in terms of having to cancel the pending 
workers when tasks exit. I can try and prototype that and see how it 
would look.

There is of course an example ioprio which solves the runtime 
adjustments via a dedicated system call. But I don't currently feel that 
a third one would be a good solution. At least I don't see a case for 
being able to decouple the priority of CPU and GPU and computations.

Have I opened a large can of worms? :)

Regards,

Tvrtko
Tvrtko Ursulin Oct. 1, 2021, 10:32 a.m. UTC | #3
On 01/10/2021 10:04, Tvrtko Ursulin wrote:
> 
> Hi Peter,
> 
> On 30/09/2021 19:33, Peter Zijlstra wrote:
>> On Thu, Sep 30, 2021 at 06:15:47PM +0100, Tvrtko Ursulin wrote:
>>>   void set_user_nice(struct task_struct *p, long nice)
>>>   {
>>>       bool queued, running;
>>> -    int old_prio;
>>> +    int old_prio, ret;
>>>       struct rq_flags rf;
>>>       struct rq *rq;
>>> @@ -6913,6 +6945,9 @@ void set_user_nice(struct task_struct *p, long 
>>> nice)
>>>        */
>>>       p->sched_class->prio_changed(rq, p, old_prio);
>>> +    ret = atomic_notifier_call_chain(&user_nice_notifier_list, nice, 
>>> p);
>>> +    WARN_ON_ONCE(ret != NOTIFY_DONE);
>>> +
>>>   out_unlock:
>>>       task_rq_unlock(rq, p, &rf);
>>>   }
>>
>> No, we're not going to call out to exported, and potentially unbounded,
>> functions under scheduler locks.
> 
> Agreed, that's another good point why it is even more hairy, as I have 
> generally alluded in the cover letter.
> 
> Do you have any immediate thoughts on possible alternatives?
> 
> Like for instance if I did a queue_work from set_user_nice and then ran 
> a notifier chain async from a worker? I haven't looked at yet what 
> repercussion would that have in terms of having to cancel the pending 
> workers when tasks exit. I can try and prototype that and see how it 
> would look.

Hm or I simply move calling the notifier chain to after task_rq_unlock? 
That would leave it run under the tasklist lock so probably still quite bad.

Or another option - I stash aside the tasks on a private list (adding 
new list_head to trask_struct), with elevated task ref count, and run 
the notifier chain outside any locked sections, at the end of the 
setpriority syscall. This way only the sycall caller pays the cost of 
any misbehaving notifiers in the chain. Further improvement could be per 
task notifiers but that would grow the task_struct more.

Regards,

Tvrtko

> There is of course an example ioprio which solves the runtime 
> adjustments via a dedicated system call. But I don't currently feel that 
> a third one would be a good solution. At least I don't see a case for 
> being able to decouple the priority of CPU and GPU and computations.
> 
> Have I opened a large can of worms? :)
> 
> Regards,
> 
> Tvrtko
Peter Zijlstra Oct. 1, 2021, 3:48 p.m. UTC | #4
On Fri, Oct 01, 2021 at 11:32:16AM +0100, Tvrtko Ursulin wrote:
> 
> On 01/10/2021 10:04, Tvrtko Ursulin wrote:
> > 
> > Hi Peter,
> > 
> > On 30/09/2021 19:33, Peter Zijlstra wrote:
> > > On Thu, Sep 30, 2021 at 06:15:47PM +0100, Tvrtko Ursulin wrote:
> > > >   void set_user_nice(struct task_struct *p, long nice)
> > > >   {
> > > >       bool queued, running;
> > > > -    int old_prio;
> > > > +    int old_prio, ret;
> > > >       struct rq_flags rf;
> > > >       struct rq *rq;
> > > > @@ -6913,6 +6945,9 @@ void set_user_nice(struct task_struct *p,
> > > > long nice)
> > > >        */
> > > >       p->sched_class->prio_changed(rq, p, old_prio);
> > > > +    ret = atomic_notifier_call_chain(&user_nice_notifier_list,
> > > > nice, p);
> > > > +    WARN_ON_ONCE(ret != NOTIFY_DONE);
> > > > +
> > > >   out_unlock:
> > > >       task_rq_unlock(rq, p, &rf);
> > > >   }
> > > 
> > > No, we're not going to call out to exported, and potentially unbounded,
> > > functions under scheduler locks.
> > 
> > Agreed, that's another good point why it is even more hairy, as I have
> > generally alluded in the cover letter.
> > 
> > Do you have any immediate thoughts on possible alternatives?
> > 
> > Like for instance if I did a queue_work from set_user_nice and then ran
> > a notifier chain async from a worker? I haven't looked at yet what
> > repercussion would that have in terms of having to cancel the pending
> > workers when tasks exit. I can try and prototype that and see how it
> > would look.
> 
> Hm or I simply move calling the notifier chain to after task_rq_unlock? That
> would leave it run under the tasklist lock so probably still quite bad.

Hmm? That's for normalize_rt_tasks() only, right? Just don't have it
call the notifier in that special case (that's a magic sysrq thing
anyway).
Tvrtko Ursulin Oct. 4, 2021, 8:12 a.m. UTC | #5
On 01/10/2021 16:48, Peter Zijlstra wrote:
> On Fri, Oct 01, 2021 at 11:32:16AM +0100, Tvrtko Ursulin wrote:
>>
>> On 01/10/2021 10:04, Tvrtko Ursulin wrote:
>>>
>>> Hi Peter,
>>>
>>> On 30/09/2021 19:33, Peter Zijlstra wrote:
>>>> On Thu, Sep 30, 2021 at 06:15:47PM +0100, Tvrtko Ursulin wrote:
>>>>>    void set_user_nice(struct task_struct *p, long nice)
>>>>>    {
>>>>>        bool queued, running;
>>>>> -    int old_prio;
>>>>> +    int old_prio, ret;
>>>>>        struct rq_flags rf;
>>>>>        struct rq *rq;
>>>>> @@ -6913,6 +6945,9 @@ void set_user_nice(struct task_struct *p,
>>>>> long nice)
>>>>>         */
>>>>>        p->sched_class->prio_changed(rq, p, old_prio);
>>>>> +    ret = atomic_notifier_call_chain(&user_nice_notifier_list,
>>>>> nice, p);
>>>>> +    WARN_ON_ONCE(ret != NOTIFY_DONE);
>>>>> +
>>>>>    out_unlock:
>>>>>        task_rq_unlock(rq, p, &rf);
>>>>>    }
>>>>
>>>> No, we're not going to call out to exported, and potentially unbounded,
>>>> functions under scheduler locks.
>>>
>>> Agreed, that's another good point why it is even more hairy, as I have
>>> generally alluded in the cover letter.
>>>
>>> Do you have any immediate thoughts on possible alternatives?
>>>
>>> Like for instance if I did a queue_work from set_user_nice and then ran
>>> a notifier chain async from a worker? I haven't looked at yet what
>>> repercussion would that have in terms of having to cancel the pending
>>> workers when tasks exit. I can try and prototype that and see how it
>>> would look.
>>
>> Hm or I simply move calling the notifier chain to after task_rq_unlock? That
>> would leave it run under the tasklist lock so probably still quite bad.
> 
> Hmm? That's for normalize_rt_tasks() only, right? Just don't have it
> call the notifier in that special case (that's a magic sysrq thing
> anyway).

You mean my talk about tasklist_lock? No, it is also on the syscall part 
I am interested in as well. Call chain looks like this:

sys_setpriority()
{
   ...
   rcu_read_lock();
   read_lock(&tasklist_lock);
   ...
   set_one_prio()
     set_user_nice()
     {
       ...
       task_rq_lock();
         -> my notifier from this RFC [1]
       task_rq_unlock();
         -> I can move the notifier here for _some_ improvement [2]
     }
   ...
   read_unlock(&tasklist_lock);
   rcu_read_unlock();
}

So this RFC had the notifier call chain at [1], which I understood was 
the thing you initially pointed was horrible, being under a scheduler lock.

I can trivially move it to [2] but that still leaves it under the 
tasklist lock. I don't have a good feel how much better that would be. 
If not good enough then I will look for a smarter solution with less 
opportunity for global impact.

Regards,

Tvrtko
Peter Zijlstra Oct. 4, 2021, 8:39 a.m. UTC | #6
On Mon, Oct 04, 2021 at 09:12:37AM +0100, Tvrtko Ursulin wrote:
> On 01/10/2021 16:48, Peter Zijlstra wrote:

> > Hmm? That's for normalize_rt_tasks() only, right? Just don't have it
> > call the notifier in that special case (that's a magic sysrq thing
> > anyway).
> 
> You mean my talk about tasklist_lock? No, it is also on the syscall part I
> am interested in as well. Call chain looks like this:

Urgh, I alwys miss that because it lives outside of sched.. :/

> sys_setpriority()
> {
>   ...
>   rcu_read_lock();
>   read_lock(&tasklist_lock);
>   ...
>   set_one_prio()
>     set_user_nice()
>     {
>       ...
>       task_rq_lock();
>         -> my notifier from this RFC [1]
>       task_rq_unlock();
>         -> I can move the notifier here for _some_ improvement [2]
>     }
>   ...
>   read_unlock(&tasklist_lock);
>   rcu_read_unlock();
> }
> 
> So this RFC had the notifier call chain at [1], which I understood was the
> thing you initially pointed was horrible, being under a scheduler lock.
> 
> I can trivially move it to [2] but that still leaves it under the tasklist
> lock. I don't have a good feel how much better that would be. If not good
> enough then I will look for a smarter solution with less opportunity for
> global impact.

So task_list lock is pretty terrible and effectively unbound already
(just create more tasks etc..) so adding a notifier call there shouldn't
really make it much worse.
diff mbox series

Patch

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 39039ce8ac4c..45ae9eca38c6 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -2309,4 +2309,9 @@  static inline void sched_core_free(struct task_struct *tsk) { }
 static inline void sched_core_fork(struct task_struct *p) { }
 #endif
 
+struct notifier_block;
+
+extern int register_user_nice_notifier(struct notifier_block *);
+extern int unregister_user_nice_notifier(struct notifier_block *);
+
 #endif
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 1bba4128a3e6..26ff75d6fe00 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -6864,10 +6864,42 @@  static inline int rt_effective_prio(struct task_struct *p, int prio)
 }
 #endif
 
+ATOMIC_NOTIFIER_HEAD(user_nice_notifier_list);
+
+/**
+ * register_user_nice_notifier - Register function to be called when task nice changes
+ * @nb: Info about notifier function to be called
+ *
+ * Registers a function with the list of functions to be called when task nice
+ * value changes.
+ *
+ * Currently always returns zero, as atomic_notifier_chain_register()
+ * always returns zero.
+ */
+int register_user_nice_notifier(struct notifier_block *nb)
+{
+	return atomic_notifier_chain_register(&user_nice_notifier_list, nb);
+}
+EXPORT_SYMBOL(register_user_nice_notifier);
+
+/**
+ * unregister_user_nice_notifier - Unregister previously registered user nice notifier
+ * @nb: Hook to be unregistered
+ *
+ * Unregisters a previously registered user nice notifier function.
+ *
+ * Returns zero on success, or %-ENOENT on failure.
+ */
+int unregister_user_nice_notifier(struct notifier_block *nb)
+{
+	return atomic_notifier_chain_unregister(&user_nice_notifier_list, nb);
+}
+EXPORT_SYMBOL(unregister_user_nice_notifier);
+
 void set_user_nice(struct task_struct *p, long nice)
 {
 	bool queued, running;
-	int old_prio;
+	int old_prio, ret;
 	struct rq_flags rf;
 	struct rq *rq;
 
@@ -6913,6 +6945,9 @@  void set_user_nice(struct task_struct *p, long nice)
 	 */
 	p->sched_class->prio_changed(rq, p, old_prio);
 
+	ret = atomic_notifier_call_chain(&user_nice_notifier_list, nice, p);
+	WARN_ON_ONCE(ret != NOTIFY_DONE);
+
 out_unlock:
 	task_rq_unlock(rq, p, &rf);
 }