diff mbox series

[v6,1/2] sched/uclamp: Add a new sysctl to control RT default boost value

Message ID 20200706142839.26629-2-qais.yousef@arm.com (mailing list archive)
State New, archived
Headers show
Series sched/uclamp: new sysctl for default RT boost value | expand

Commit Message

Qais Yousef July 6, 2020, 2:28 p.m. UTC
RT tasks by default run at the highest capacity/performance level. When
uclamp is selected this default behavior is retained by enforcing the
requested uclamp.min (p->uclamp_req[UCLAMP_MIN]) of the RT tasks to be
uclamp_none(UCLAMP_MAX), which is SCHED_CAPACITY_SCALE; the maximum
value.

This is also referred to as 'the default boost value of RT tasks'.

See commit 1a00d999971c ("sched/uclamp: Set default clamps for RT tasks").

On battery powered devices, it is desired to control this default
(currently hardcoded) behavior at runtime to reduce energy consumed by
RT tasks.

For example, a mobile device manufacturer where big.LITTLE architecture
is dominant, the performance of the little cores varies across SoCs, and
on high end ones the big cores could be too power hungry.

Given the diversity of SoCs, the new knob allows manufactures to tune
the best performance/power for RT tasks for the particular hardware they
run on.

They could opt to further tune the value when the user selects
a different power saving mode or when the device is actively charging.

The runtime aspect of it further helps in creating a single kernel image
that can be run on multiple devices that require different tuning.

Keep in mind that a lot of RT tasks in the system are created by the
kernel. On Android for instance I can see over 50 RT tasks, only
a handful of which created by the Android framework.

To control the default behavior globally by system admins and device
integrator, introduce the new sysctl_sched_uclamp_util_min_rt_default
to change the default boost value of the RT tasks.

I anticipate this to be mostly in the form of modifying the init script
of a particular device.

To avoid polluting the fast path with unnecessary code, the approach
taken is to synchronously do the update by traversing all the existing
tasks in the system. This could race with a concurrent fork(), which is
dealt with by introducing sched_post_fork() function which will ensure
the racy fork will get the right update applied.

Tested on Juno-r2 in combination with the RT capacity awareness [1].
By default an RT task will go to the highest capacity CPU and run at the
maximum frequency, which is particularly energy inefficient on high end
mobile devices because the biggest core[s] are 'huge' and power hungry.

With this patch the RT task can be controlled to run anywhere by
default, and doesn't cause the frequency to be maximum all the time.
Yet any task that really needs to be boosted can easily escape this
default behavior by modifying its requested uclamp.min value
(p->uclamp_req[UCLAMP_MIN]) via sched_setattr() syscall.

[1] 804d402fb6f6: ("sched/rt: Make RT capacity-aware")

Signed-off-by: Qais Yousef <qais.yousef@arm.com>
CC: Jonathan Corbet <corbet@lwn.net>
CC: Juri Lelli <juri.lelli@redhat.com>
CC: Vincent Guittot <vincent.guittot@linaro.org>
CC: Dietmar Eggemann <dietmar.eggemann@arm.com>
CC: Steven Rostedt <rostedt@goodmis.org>
CC: Ben Segall <bsegall@google.com>
CC: Mel Gorman <mgorman@suse.de>
CC: Luis Chamberlain <mcgrof@kernel.org>
CC: Kees Cook <keescook@chromium.org>
CC: Iurii Zaikin <yzaikin@google.com>
CC: Quentin Perret <qperret@google.com>
CC: Valentin Schneider <valentin.schneider@arm.com>
CC: Patrick Bellasi <patrick.bellasi@matbug.net>
CC: Pavan Kondeti <pkondeti@codeaurora.org>
CC: linux-doc@vger.kernel.org
CC: linux-kernel@vger.kernel.org
CC: linux-fsdevel@vger.kernel.org
---

Peter

I didn't do the

	read_lock(&taslist_lock);
	smp_mb__after_spinlock();
	read_unlock(&tasklist_lock);

dance you suggested on IRC as it didn't seem necessary. But maybe I missed
something.

 include/linux/sched/sysctl.h |   1 +
 include/linux/sched/task.h   |   1 +
 kernel/fork.c                |   1 +
 kernel/sched/core.c          | 132 +++++++++++++++++++++++++++++++++--
 kernel/sysctl.c              |   7 ++
 5 files changed, 135 insertions(+), 7 deletions(-)

Comments

Valentin Schneider July 6, 2020, 3:49 p.m. UTC | #1
On 06/07/20 15:28, Qais Yousef wrote:
> CC: linux-fsdevel@vger.kernel.org
> ---
>
> Peter
>
> I didn't do the
>
>       read_lock(&taslist_lock);
>       smp_mb__after_spinlock();
>       read_unlock(&tasklist_lock);
>
> dance you suggested on IRC as it didn't seem necessary. But maybe I missed
> something.
>

So the annoying bit with just uclamp_fork() is that it happens *before* the
task is appended to the tasklist. This means without too much care we
would have (if we'd do a sync at uclamp_fork()):

  CPU0 (sysctl write)                                CPU1 (concurrent forker)

                                                       copy_process()
                                                         uclamp_fork()
                                                           p.uclamp_min = state
    state = foo

    for_each_process_thread(p, t)
      update_state(t);
                                                         list_add(p)

i.e. that newly forked process would entirely sidestep the update. Now,
with Peter's suggested approach we can be in a much better situation. If we
have this in the sysctl update:

  state = foo;

  read_lock(&taslist_lock);
  smp_mb__after_spinlock();
  read_unlock(&tasklist_lock);

  for_each_process_thread(p, t)
    update_state(t);

While having this in the fork:

  write_lock(&tasklist_lock);
  list_add(p);
  write_unlock(&tasklist_lock);

  sched_post_fork(p); // state re-read here; probably wants an mb first

Then we can no longer miss an update. If the forked p doesn't see the new
value, it *must* have been added to the tasklist before the updater loops
over it, so the loop will catch it. If it sees the new value, we're done.

AIUI, the above strategy doesn't require any use of RCU. The update_state()
and sched_post_fork() can race, but as per the above they should both be
writing the same value.
Qais Yousef July 7, 2020, 9:34 a.m. UTC | #2
On 07/06/20 16:49, Valentin Schneider wrote:
> 
> On 06/07/20 15:28, Qais Yousef wrote:
> > CC: linux-fsdevel@vger.kernel.org
> > ---
> >
> > Peter
> >
> > I didn't do the
> >
> >       read_lock(&taslist_lock);
> >       smp_mb__after_spinlock();
> >       read_unlock(&tasklist_lock);
> >
> > dance you suggested on IRC as it didn't seem necessary. But maybe I missed
> > something.
> >
> 
> So the annoying bit with just uclamp_fork() is that it happens *before* the
> task is appended to the tasklist. This means without too much care we
> would have (if we'd do a sync at uclamp_fork()):
> 
>   CPU0 (sysctl write)                                CPU1 (concurrent forker)
> 
>                                                        copy_process()
>                                                          uclamp_fork()
>                                                            p.uclamp_min = state
>     state = foo
> 
>     for_each_process_thread(p, t)
>       update_state(t);
>                                                          list_add(p)
> 
> i.e. that newly forked process would entirely sidestep the update. Now,
> with Peter's suggested approach we can be in a much better situation. If we
> have this in the sysctl update:
> 
>   state = foo;
> 
>   read_lock(&taslist_lock);
>   smp_mb__after_spinlock();
>   read_unlock(&tasklist_lock);
> 
>   for_each_process_thread(p, t)
>     update_state(t);
> 
> While having this in the fork:
> 
>   write_lock(&tasklist_lock);
>   list_add(p);
>   write_unlock(&tasklist_lock);
> 
>   sched_post_fork(p); // state re-read here; probably wants an mb first
> 
> Then we can no longer miss an update. If the forked p doesn't see the new
> value, it *must* have been added to the tasklist before the updater loops
> over it, so the loop will catch it. If it sees the new value, we're done.

uclamp_fork() has nothing to do with the race. If copy_process() duplicates the
task_struct of an RT task, it'll copy the old value.

I'd expect the newly introduced sched_post_fork() (also in copy_process() after
the list update) to prevent this race altogether.

Now we could end up with a problem if for_each_process_thread() doesn't see the
newly forked task _after_ sched_post_fork(). Hence my question to Peter.

> 
> AIUI, the above strategy doesn't require any use of RCU. The update_state()
> and sched_post_fork() can race, but as per the above they should both be
> writing the same value.

for_each_process_thread() must be protected by either tasklist_lock or
rcu_read_lock().

The other RCU logic I added is not to protect against the race above. I
describe the other race condition in a comment. Basically another updater on a
different cpu via fork() and sched_setattr() might read an old value and get
preempted. The rcu synchronization will ensure concurrent updaters have
finished before iterating the list.

Thanks

--
Qais Yousef
Valentin Schneider July 7, 2020, 11:30 a.m. UTC | #3
On 07/07/20 10:34, Qais Yousef wrote:
> On 07/06/20 16:49, Valentin Schneider wrote:
>>
>> On 06/07/20 15:28, Qais Yousef wrote:
>> > CC: linux-fsdevel@vger.kernel.org
>> > ---
>> >
>> > Peter
>> >
>> > I didn't do the
>> >
>> >       read_lock(&taslist_lock);
>> >       smp_mb__after_spinlock();
>> >       read_unlock(&tasklist_lock);
>> >
>> > dance you suggested on IRC as it didn't seem necessary. But maybe I missed
>> > something.
>> >
>>
>> So the annoying bit with just uclamp_fork() is that it happens *before* the
>> task is appended to the tasklist. This means without too much care we
>> would have (if we'd do a sync at uclamp_fork()):
>>
>>   CPU0 (sysctl write)                                CPU1 (concurrent forker)
>>
>>                                                        copy_process()
>>                                                          uclamp_fork()
>>                                                            p.uclamp_min = state
>>     state = foo
>>
>>     for_each_process_thread(p, t)
>>       update_state(t);
>>                                                          list_add(p)
>>
>> i.e. that newly forked process would entirely sidestep the update. Now,
>> with Peter's suggested approach we can be in a much better situation. If we
>> have this in the sysctl update:
>>
>>   state = foo;
>>
>>   read_lock(&taslist_lock);
>>   smp_mb__after_spinlock();
>>   read_unlock(&tasklist_lock);
>>
>>   for_each_process_thread(p, t)
>>     update_state(t);
>>
>> While having this in the fork:
>>
>>   write_lock(&tasklist_lock);
>>   list_add(p);
>>   write_unlock(&tasklist_lock);
>>
>>   sched_post_fork(p); // state re-read here; probably wants an mb first
>>
>> Then we can no longer miss an update. If the forked p doesn't see the new
>> value, it *must* have been added to the tasklist before the updater loops
>> over it, so the loop will catch it. If it sees the new value, we're done.
>
> uclamp_fork() has nothing to do with the race. If copy_process() duplicates the
> task_struct of an RT task, it'll copy the old value.
>

Quite so; my point was if we were to use uclamp_fork() as to re-read the value.

> I'd expect the newly introduced sched_post_fork() (also in copy_process() after
> the list update) to prevent this race altogether.
>
> Now we could end up with a problem if for_each_process_thread() doesn't see the
> newly forked task _after_ sched_post_fork(). Hence my question to Peter.
>


>>
>> AIUI, the above strategy doesn't require any use of RCU. The update_state()
>> and sched_post_fork() can race, but as per the above they should both be
>> writing the same value.
>
> for_each_process_thread() must be protected by either tasklist_lock or
> rcu_read_lock().
>

Right

> The other RCU logic I added is not to protect against the race above. I
> describe the other race condition in a comment.

I take it that's the one in uclamp_sync_util_min_rt_default()?

__setscheduler_uclamp() can't be preempted as we hold task_rq_lock(). It
can indeed race with the sync though, but again with the above suggested
setup it would either:
- see the old value, but be guaranteed to be iterated over later by the
  updater
- see the new value

sched_post_fork() being preempted out is a bit more annoying, but what
prevents us from making that bit preempt-disabled?

I have to point out I'm assuming here updaters are serialized, which does
seem to be see the case (cf. uclamp_mutex).


> Basically another updater on a
> different cpu via fork() and sched_setattr() might read an old value and get
> preempted. The rcu synchronization will ensure concurrent updaters have
> finished before iterating the list.
>

> Thanks
Valentin Schneider July 7, 2020, 11:39 a.m. UTC | #4
On 06/07/20 15:28, Qais Yousef wrote:
> RT tasks by default run at the highest capacity/performance level. When
> uclamp is selected this default behavior is retained by enforcing the
> requested uclamp.min (p->uclamp_req[UCLAMP_MIN]) of the RT tasks to be
> uclamp_none(UCLAMP_MAX), which is SCHED_CAPACITY_SCALE; the maximum
> value.
>
> This is also referred to as 'the default boost value of RT tasks'.
>
> See commit 1a00d999971c ("sched/uclamp: Set default clamps for RT tasks").
>
> On battery powered devices, it is desired to control this default
> (currently hardcoded) behavior at runtime to reduce energy consumed by
> RT tasks.
>
> For example, a mobile device manufacturer where big.LITTLE architecture
> is dominant, the performance of the little cores varies across SoCs, and
> on high end ones the big cores could be too power hungry.
>
> Given the diversity of SoCs, the new knob allows manufactures to tune
> the best performance/power for RT tasks for the particular hardware they
> run on.
>
> They could opt to further tune the value when the user selects
> a different power saving mode or when the device is actively charging.
>
> The runtime aspect of it further helps in creating a single kernel image
> that can be run on multiple devices that require different tuning.
>
> Keep in mind that a lot of RT tasks in the system are created by the
> kernel. On Android for instance I can see over 50 RT tasks, only
> a handful of which created by the Android framework.
>
> To control the default behavior globally by system admins and device
> integrator, introduce the new sysctl_sched_uclamp_util_min_rt_default
> to change the default boost value of the RT tasks.
>
> I anticipate this to be mostly in the form of modifying the init script
> of a particular device.
>

Sorry for going at this again, but I feel like I can squeeze more juice out
of this.

This being mainly tweaked in init scripts makes me question why we should
harden this for runtime tweaking. Yes, there's the whole single image
thing, but there's other ways to have different init-time values on a
single image (e.g. cmdline, although that one is usually a bit
controversial).

For instance, Android could set the min to 0 and then go about its life
tweaking the clamp of individual / all RT tasks at runtime, using the
existing uclamp API.
Qais Yousef July 7, 2020, 12:36 p.m. UTC | #5
On 07/07/20 12:30, Valentin Schneider wrote:
> 
> On 07/07/20 10:34, Qais Yousef wrote:
> > On 07/06/20 16:49, Valentin Schneider wrote:
> >>
> >> On 06/07/20 15:28, Qais Yousef wrote:
> >> > CC: linux-fsdevel@vger.kernel.org
> >> > ---
> >> >
> >> > Peter
> >> >
> >> > I didn't do the
> >> >
> >> >       read_lock(&taslist_lock);
> >> >       smp_mb__after_spinlock();
> >> >       read_unlock(&tasklist_lock);
> >> >
> >> > dance you suggested on IRC as it didn't seem necessary. But maybe I missed
> >> > something.
> >> >
> >>
> >> So the annoying bit with just uclamp_fork() is that it happens *before* the
> >> task is appended to the tasklist. This means without too much care we
> >> would have (if we'd do a sync at uclamp_fork()):
> >>
> >>   CPU0 (sysctl write)                                CPU1 (concurrent forker)
> >>
> >>                                                        copy_process()
> >>                                                          uclamp_fork()
> >>                                                            p.uclamp_min = state
> >>     state = foo
> >>
> >>     for_each_process_thread(p, t)
> >>       update_state(t);
> >>                                                          list_add(p)
> >>
> >> i.e. that newly forked process would entirely sidestep the update. Now,
> >> with Peter's suggested approach we can be in a much better situation. If we
> >> have this in the sysctl update:
> >>
> >>   state = foo;
> >>
> >>   read_lock(&taslist_lock);
> >>   smp_mb__after_spinlock();
> >>   read_unlock(&tasklist_lock);
> >>
> >>   for_each_process_thread(p, t)
> >>     update_state(t);
> >>
> >> While having this in the fork:
> >>
> >>   write_lock(&tasklist_lock);
> >>   list_add(p);
> >>   write_unlock(&tasklist_lock);
> >>
> >>   sched_post_fork(p); // state re-read here; probably wants an mb first
> >>
> >> Then we can no longer miss an update. If the forked p doesn't see the new
> >> value, it *must* have been added to the tasklist before the updater loops
> >> over it, so the loop will catch it. If it sees the new value, we're done.
> >
> > uclamp_fork() has nothing to do with the race. If copy_process() duplicates the
> > task_struct of an RT task, it'll copy the old value.
> >
> 
> Quite so; my point was if we were to use uclamp_fork() as to re-read the value.
> 
> > I'd expect the newly introduced sched_post_fork() (also in copy_process() after
> > the list update) to prevent this race altogether.
> >
> > Now we could end up with a problem if for_each_process_thread() doesn't see the
> > newly forked task _after_ sched_post_fork(). Hence my question to Peter.
> >
> 
> 
> >>
> >> AIUI, the above strategy doesn't require any use of RCU. The update_state()
> >> and sched_post_fork() can race, but as per the above they should both be
> >> writing the same value.
> >
> > for_each_process_thread() must be protected by either tasklist_lock or
> > rcu_read_lock().
> >
> 
> Right
> 
> > The other RCU logic I added is not to protect against the race above. I
> > describe the other race condition in a comment.
> 
> I take it that's the one in uclamp_sync_util_min_rt_default()?

Correct.

> 
> __setscheduler_uclamp() can't be preempted as we hold task_rq_lock(). It
> can indeed race with the sync though, but again with the above suggested
> setup it would either:
> - see the old value, but be guaranteed to be iterated over later by the
>   updater
> - see the new value

AFAIU rcu_read_lock() is light weight. So having the protection applied is more
robust against future changes.

> 
> sched_post_fork() being preempted out is a bit more annoying, but what
> prevents us from making that bit preempt-disabled?

preempt_disable() is not friendly to RT and heavy handed approach IMO.

> 
> I have to point out I'm assuming here updaters are serialized, which does
> seem to be see the case (cf. uclamp_mutex).

Correct.

Thanks

--
Qais Yousef
Qais Yousef July 7, 2020, 12:58 p.m. UTC | #6
On 07/07/20 12:39, Valentin Schneider wrote:
> 
> On 06/07/20 15:28, Qais Yousef wrote:
> > RT tasks by default run at the highest capacity/performance level. When
> > uclamp is selected this default behavior is retained by enforcing the
> > requested uclamp.min (p->uclamp_req[UCLAMP_MIN]) of the RT tasks to be
> > uclamp_none(UCLAMP_MAX), which is SCHED_CAPACITY_SCALE; the maximum
> > value.
> >
> > This is also referred to as 'the default boost value of RT tasks'.
> >
> > See commit 1a00d999971c ("sched/uclamp: Set default clamps for RT tasks").
> >
> > On battery powered devices, it is desired to control this default
> > (currently hardcoded) behavior at runtime to reduce energy consumed by
> > RT tasks.
> >
> > For example, a mobile device manufacturer where big.LITTLE architecture
> > is dominant, the performance of the little cores varies across SoCs, and
> > on high end ones the big cores could be too power hungry.
> >
> > Given the diversity of SoCs, the new knob allows manufactures to tune
> > the best performance/power for RT tasks for the particular hardware they
> > run on.
> >
> > They could opt to further tune the value when the user selects
> > a different power saving mode or when the device is actively charging.
> >
> > The runtime aspect of it further helps in creating a single kernel image
> > that can be run on multiple devices that require different tuning.
> >
> > Keep in mind that a lot of RT tasks in the system are created by the
> > kernel. On Android for instance I can see over 50 RT tasks, only
> > a handful of which created by the Android framework.
> >
> > To control the default behavior globally by system admins and device
> > integrator, introduce the new sysctl_sched_uclamp_util_min_rt_default
> > to change the default boost value of the RT tasks.
> >
> > I anticipate this to be mostly in the form of modifying the init script
> > of a particular device.
> >
> 
> Sorry for going at this again, but I feel like I can squeeze more juice out
> of this.
> 
> This being mainly tweaked in init scripts makes me question why we should
> harden this for runtime tweaking. Yes, there's the whole single image
> thing, but there's other ways to have different init-time values on a
> single image (e.g. cmdline, although that one is usually a bit
> controversial).

cmdline is not userfriendly. A person who installs a new distro on their laptop
can easily modify a sysctl knob to squeeze maximum battery life, but not play
with grub and bootloader to add a cmdline option.

> 
> For instance, Android could set the min to 0 and then go about its life
> tweaking the clamp of individual / all RT tasks at runtime, using the
> existing uclamp API.

This was addressed before. This is not flexible and having a daemon that
monitors RT tasks as they come and go is terrible IMO.

If you're suggesting to hardcode it to 0, then this is not what we're trying to
achieve here.

I don't think we've hit a wall where things look really ugly to not continue
pursue this flexible and more robust solution.

Thanks

--
Qais Yousef
Valentin Schneider July 8, 2020, 11:05 a.m. UTC | #7
On 07/07/20 13:36, Qais Yousef wrote:
> On 07/07/20 12:30, Valentin Schneider wrote:
>>
>> On 07/07/20 10:34, Qais Yousef wrote:
>> > On 07/06/20 16:49, Valentin Schneider wrote:
>> >>
>> >> On 06/07/20 15:28, Qais Yousef wrote:
>> >> > CC: linux-fsdevel@vger.kernel.org
>> >> > ---
>> >> >
>> >> > Peter
>> >> >
>> >> > I didn't do the
>> >> >
>> >> >       read_lock(&taslist_lock);
>> >> >       smp_mb__after_spinlock();
>> >> >       read_unlock(&tasklist_lock);
>> >> >
>> >> > dance you suggested on IRC as it didn't seem necessary. But maybe I missed
>> >> > something.
>> >> >
>> >>
>> >> So the annoying bit with just uclamp_fork() is that it happens *before* the
>> >> task is appended to the tasklist. This means without too much care we
>> >> would have (if we'd do a sync at uclamp_fork()):
>> >>
>> >>   CPU0 (sysctl write)                                CPU1 (concurrent forker)
>> >>
>> >>                                                        copy_process()
>> >>                                                          uclamp_fork()
>> >>                                                            p.uclamp_min = state
>> >>     state = foo
>> >>
>> >>     for_each_process_thread(p, t)
>> >>       update_state(t);
>> >>                                                          list_add(p)
>> >>
>> >> i.e. that newly forked process would entirely sidestep the update. Now,
>> >> with Peter's suggested approach we can be in a much better situation. If we
>> >> have this in the sysctl update:
>> >>
>> >>   state = foo;
>> >>
>> >>   read_lock(&taslist_lock);
>> >>   smp_mb__after_spinlock();
>> >>   read_unlock(&tasklist_lock);
>> >>
>> >>   for_each_process_thread(p, t)
>> >>     update_state(t);
>> >>
>> >> While having this in the fork:
>> >>
>> >>   write_lock(&tasklist_lock);
>> >>   list_add(p);
>> >>   write_unlock(&tasklist_lock);
>> >>
>> >>   sched_post_fork(p); // state re-read here; probably wants an mb first
>> >>
>> >> Then we can no longer miss an update. If the forked p doesn't see the new
>> >> value, it *must* have been added to the tasklist before the updater loops
>> >> over it, so the loop will catch it. If it sees the new value, we're done.
>> >
>> > uclamp_fork() has nothing to do with the race. If copy_process() duplicates the
>> > task_struct of an RT task, it'll copy the old value.
>> >
>>
>> Quite so; my point was if we were to use uclamp_fork() as to re-read the value.
>>
>> > I'd expect the newly introduced sched_post_fork() (also in copy_process() after
>> > the list update) to prevent this race altogether.
>> >
>> > Now we could end up with a problem if for_each_process_thread() doesn't see the
>> > newly forked task _after_ sched_post_fork(). Hence my question to Peter.
>> >
>>
>>
>> >>
>> >> AIUI, the above strategy doesn't require any use of RCU. The update_state()
>> >> and sched_post_fork() can race, but as per the above they should both be
>> >> writing the same value.
>> >
>> > for_each_process_thread() must be protected by either tasklist_lock or
>> > rcu_read_lock().
>> >
>>
>> Right
>>
>> > The other RCU logic I added is not to protect against the race above. I
>> > describe the other race condition in a comment.
>>
>> I take it that's the one in uclamp_sync_util_min_rt_default()?
>
> Correct.
>
>>
>> __setscheduler_uclamp() can't be preempted as we hold task_rq_lock(). It
>> can indeed race with the sync though, but again with the above suggested
>> setup it would either:
>> - see the old value, but be guaranteed to be iterated over later by the
>>   updater
>> - see the new value
>
> AFAIU rcu_read_lock() is light weight. So having the protection applied is more
> robust against future changes.

So I think the one thing you win by having this dance with mb's and the
suggested handling of the task list is that you do not need any
rcu_synchronize() anymore. Both approaches have merit, it's just that the
way I understood the suggestion to add sched_post_fork() was to simplify
the ordering of the update with the aforementioned scheme.

>
>>
>> sched_post_fork() being preempted out is a bit more annoying, but what
>> prevents us from making that bit preempt-disabled?
>
> preempt_disable() is not friendly to RT and heavy handed approach IMO.
>

True, but this is both an infrequent and slow sysctl path, so I don't think
RT would care much.

>>
>> I have to point out I'm assuming here updaters are serialized, which does
>> seem to be see the case (cf. uclamp_mutex).
>
> Correct.
>
> Thanks
Qais Yousef July 8, 2020, 1:08 p.m. UTC | #8
On 07/08/20 12:05, Valentin Schneider wrote:
> > AFAIU rcu_read_lock() is light weight. So having the protection applied is more
> > robust against future changes.
> 
> So I think the one thing you win by having this dance with mb's and the
> suggested handling of the task list is that you do not need any
> rcu_synchronize() anymore. Both approaches have merit, it's just that the
> way I understood the suggestion to add sched_post_fork() was to simplify
> the ordering of the update with the aforementioned scheme.

The synchronize_rcu() is not for sched_post_fork(). It is to deal with the
preemption problem.

> 
> >
> >>
> >> sched_post_fork() being preempted out is a bit more annoying, but what
> >> prevents us from making that bit preempt-disabled?
> >
> > preempt_disable() is not friendly to RT and heavy handed approach IMO.
> >
> 
> True, but this is both an infrequent and slow sysctl path, so I don't think
> RT would care much.

There's an easy answer for that. But first I'm not sure what problem are we
discussing here.

What is the problem with rcu? And how is preempt_disable() fixes it or improves
on it?

Thanks

--
Qais Yousef
Valentin Schneider July 8, 2020, 9:45 p.m. UTC | #9
On 08/07/20 14:08, Qais Yousef wrote:
> On 07/08/20 12:05, Valentin Schneider wrote:
>> > AFAIU rcu_read_lock() is light weight. So having the protection applied is more
>> > robust against future changes.
>>
>> So I think the one thing you win by having this dance with mb's and the
>> suggested handling of the task list is that you do not need any
>> rcu_synchronize() anymore. Both approaches have merit, it's just that the
>> way I understood the suggestion to add sched_post_fork() was to simplify
>> the ordering of the update with the aforementioned scheme.
>
> The synchronize_rcu() is not for sched_post_fork(). It is to deal with the
> preemption problem.
>
>>
>> >
>> >>
>> >> sched_post_fork() being preempted out is a bit more annoying, but what
>> >> prevents us from making that bit preempt-disabled?
>> >
>> > preempt_disable() is not friendly to RT and heavy handed approach IMO.
>> >
>>
>> True, but this is both an infrequent and slow sysctl path, so I don't think
>> RT would care much.
>
> There's an easy answer for that. But first I'm not sure what problem are we
> discussing here.
>
> What is the problem with rcu? And how is preempt_disable() fixes it or improves
> on it?
>

Minimizing the races we have to think and take care of would make this
easier to review and maintain. That's what I was suggesting with getting
entirely rid of the preemption+update issue and having only the
update+forkee race to take care of, which is IMO fairly straightforward to
reason about on its own.

That said, you're driving the thing, and I'm not, so I'll trust your
judgement here.

> Thanks
Peter Zijlstra July 13, 2020, 11:21 a.m. UTC | #10
On Mon, Jul 06, 2020 at 03:28:38PM +0100, Qais Yousef wrote:

> +static void __uclamp_sync_util_min_rt_default(struct task_struct *p)
> +{
> +	unsigned int default_util_min;
> +	struct uclamp_se *uc_se;
> +
> +	WARN_ON_ONCE(!rcu_read_lock_held());
> +
> +	if (!rt_task(p))
> +		return;
> +
> +	uc_se = &p->uclamp_req[UCLAMP_MIN];
> +
> +	/* Only sync if user didn't override the default */
> +	if (uc_se->user_defined)
> +		return;
> +
> +	/* Sync with smp_wmb() in uclamp_sync_util_min_rt_default() */
> +	smp_rmb();
> +	default_util_min = sysctl_sched_uclamp_util_min_rt_default;
> +	uclamp_se_set(uc_se, default_util_min, false);
> +}
> +
> +static void uclamp_sync_util_min_rt_default(void)
> +{
> +	struct task_struct *g, *p;
> +
> +	/*
> +	 * Make sure the updated sysctl_sched_uclamp_util_min_rt_default which
> +	 * was just written is synchronized against any future read on another
> +	 * cpu.
> +	 */
> +	smp_wmb();
> +
> +	/*
> +	 * Wait for all updaters to observe the new change.
> +	 *
> +	 * There are 2 races to deal with here:
> +	 *
> +	 * 1. fork()->copy_process()
> +	 *
> +	 *	If a task was concurrently forking, for_each_process_thread()
> +	 *	will not see it, hence it could have copied the old value and
> +	 *	we missed the opportunity to update it.
> +	 *
> +	 *	This should be handled by sched_post_fork() where it'll ensure
> +	 *	it performs the sync after the fork.
> +	 *
> +	 * 2. fork()->sched_post_fork()
> +	 *    __setscheduler_uclamp()
> +	 *
> +	 *	Both of these functions could read the old value but then get
> +	 *	preempted, during which a user might write new value to
> +	 *	sysctl_sched_uclamp_util_min_rt_default.
> +	 *
> +	 *	// read sysctl_sched_uclamp_util_min_rt_default;
> +	 *	// PREEMPT-OUT
> +	 *	.
> +	 *	.                  <-- sync happens here
> +	 *	.
> +	 *	// PREEMPT-IN
> +	 *	// write p->uclamp_req[UCLAMP_MIN]
> +	 *
> +	 *	That section is protected with rcu_read_lock(), so
> +	 *	synchronize_rcu() will guarantee it has finished before we
> +	 *	perform the update. Hence ensure that this sync happens after
> +	 *	any concurrent sync which should guarantee correctness.
> +	 */
> +	synchronize_rcu();
> +
> +	rcu_read_lock();
> +	for_each_process_thread(g, p)
> +		__uclamp_sync_util_min_rt_default(p);
> +	rcu_read_unlock();
> +}

It's monday, and I cannot get my brain working.. I cannot decipher the
comments you have with the smp_[rw]mb(), what actual ordering do they
enforce?

Also, your synchronize_rcu() relies on write_lock() beeing
non-preemptible, which isn't true on PREEMPT_RT.

The below seems simpler...

--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1013,8 +1013,6 @@ static void __uclamp_sync_util_min_rt_de
 	unsigned int default_util_min;
 	struct uclamp_se *uc_se;
 
-	WARN_ON_ONCE(!rcu_read_lock_held());
-
 	if (!rt_task(p))
 		return;
 
@@ -1024,8 +1022,6 @@ static void __uclamp_sync_util_min_rt_de
 	if (uc_se->user_defined)
 		return;
 
-	/* Sync with smp_wmb() in uclamp_sync_util_min_rt_default() */
-	smp_rmb();
 	default_util_min = sysctl_sched_uclamp_util_min_rt_default;
 	uclamp_se_set(uc_se, default_util_min, false);
 }
@@ -1035,47 +1031,21 @@ static void uclamp_sync_util_min_rt_defa
 	struct task_struct *g, *p;
 
 	/*
-	 * Make sure the updated sysctl_sched_uclamp_util_min_rt_default which
-	 * was just written is synchronized against any future read on another
-	 * cpu.
-	 */
-	smp_wmb();
-
-	/*
-	 * Wait for all updaters to observe the new change.
-	 *
-	 * There are 2 races to deal with here:
-	 *
-	 * 1. fork()->copy_process()
-	 *
-	 *	If a task was concurrently forking, for_each_process_thread()
-	 *	will not see it, hence it could have copied the old value and
-	 *	we missed the opportunity to update it.
-	 *
-	 *	This should be handled by sched_post_fork() where it'll ensure
-	 *	it performs the sync after the fork.
-	 *
-	 * 2. fork()->sched_post_fork()
-	 *    __setscheduler_uclamp()
-	 *
-	 *	Both of these functions could read the old value but then get
-	 *	preempted, during which a user might write new value to
-	 *	sysctl_sched_uclamp_util_min_rt_default.
-	 *
-	 *	// read sysctl_sched_uclamp_util_min_rt_default;
-	 *	// PREEMPT-OUT
-	 *	.
-	 *	.                  <-- sync happens here
-	 *	.
-	 *	// PREEMPT-IN
-	 *	// write p->uclamp_req[UCLAMP_MIN]
-	 *
-	 *	That section is protected with rcu_read_lock(), so
-	 *	synchronize_rcu() will guarantee it has finished before we
-	 *	perform the update. Hence ensure that this sync happens after
-	 *	any concurrent sync which should guarantee correctness.
-	 */
-	synchronize_rcu();
+	 * copy_process()			sysctl_uclamp
+	 *					  uclamp_min_rt = X;
+	 *   write_lock(&tasklist_lock)		  read_lock(&tasklist_lock)
+	 *   // link thread			  smp_mb__after_spinlock()
+	 *   write_unlock(&tasklist_lock)	  read_unlock(&tasklist_lock);
+	 *   sched_post_fork()			  for_each_process_thread()
+	 *     __uclamp_sync_rt()		    __uclamp_sync_rt()
+	 *
+	 * Ensures that either sched_post_fork() will observe the new
+	 * uclamp_min_rt or for_each_process_thread() will observe the new
+	 * task.
+	 */
+	read_lock(&tasklist_lock);
+	smp_mb__after_spinlock();
+	read_unlock(&tasklist_lock);
 
 	rcu_read_lock();
 	for_each_process_thread(g, p)
@@ -1408,6 +1378,9 @@ int sysctl_sched_uclamp_handler(struct c
 		uclamp_update_root_tg();
 	}
 
+	if (old_min_rt != sysctl_sched_uclamp_util_min_rt_default)
+		uclamp_sync_util_min_rt_default();
+
 	/*
 	 * We update all RUNNABLE tasks only when task groups are in use.
 	 * Otherwise, keep it simple and do just a lazy update at each next
@@ -1466,9 +1439,7 @@ static void __setscheduler_uclamp(struct
 		 * at runtime.
 		 */
 		if (unlikely(rt_task(p) && clamp_id == UCLAMP_MIN)) {
-			rcu_read_lock();
 			__uclamp_sync_util_min_rt_default(p);
-			rcu_read_unlock();
 		} else {
 			uclamp_se_set(uc_se, uclamp_none(clamp_id), false);
 		}
@@ -1521,6 +1492,11 @@ static void __init init_uclamp_rq(struct
 	rq->uclamp_flags = 0;
 }
 
+static void uclamp_post_fork(struct task_struct *p)
+{
+	__uclamp_sync_util_min_rt_default(p);
+}
+
 static void __init init_uclamp(void)
 {
 	struct uclamp_se uc_max = {};
Peter Zijlstra July 13, 2020, 11:36 a.m. UTC | #11
On Mon, Jul 13, 2020 at 01:21:25PM +0200, Peter Zijlstra wrote:
> +	 * copy_process()			sysctl_uclamp
> +	 *					  uclamp_min_rt = X;
> +	 *   write_lock(&tasklist_lock)		  read_lock(&tasklist_lock)
> +	 *   // link thread			  smp_mb__after_spinlock()
> +	 *   write_unlock(&tasklist_lock)	  read_unlock(&tasklist_lock);
> +	 *   sched_post_fork()			  for_each_process_thread()
> +	 *     __uclamp_sync_rt()		    __uclamp_sync_rt()
> +	 *
> +	 * Ensures that either sched_post_fork() will observe the new
> +	 * uclamp_min_rt or for_each_process_thread() will observe the new
> +	 * task.
> +	 */

more specifically this has the cases:


A)

	   copy_process()			sysctl_uclamp
						  uclamp_min_rt = X;
	     write_lock(&tasklist_lock)
	     // link thread
	     write_unlock(&tasklist_lock)
	     sched_post_fork()			  read_lock(&tasklist_lock)
	       __uclamp_sync_rt()		  smp_mb__after_spinlock()
                                                  read_unlock(&tasklist_lock);
                                                  for_each_process_thread()
                                                    __uclamp_sync_rt()


Where write_unlock()'s RELEASE matches read_lock() ACQUIRE and
guarantees for_each_process_thread() must observe the new thread.


B)


	   copy_process()			sysctl_uclamp
						  uclamp_min_rt = X;
						  read_lock(&tasklist_lock)
						  smp_mb__after_spinlock()
						  read_unlock(&tasklist_lock);
	     write_lock(&tasklist_lock)		  for_each_process_thread()
	     // link thread			    __uclamp_sync_rt()
             write_unlock(&tasklist_lock)
             sched_post_fork()
               __uclamp_sync_rt()

Where read_unlock()'s RELEASE matches write_lock()'s ACQUIRE and
sched_post_fork() must observe the uclamp_min_t STORE.

The smp_mb__after_spinlock() might be superfluous, but like said, brain
isn't working.
Qais Yousef July 13, 2020, 12:12 p.m. UTC | #12
On 07/13/20 13:21, Peter Zijlstra wrote:
> > +	 * 2. fork()->sched_post_fork()
> > +	 *    __setscheduler_uclamp()
> > +	 *
> > +	 *	Both of these functions could read the old value but then get
> > +	 *	preempted, during which a user might write new value to
> > +	 *	sysctl_sched_uclamp_util_min_rt_default.
> > +	 *
> > +	 *	// read sysctl_sched_uclamp_util_min_rt_default;
> > +	 *	// PREEMPT-OUT
> > +	 *	.
> > +	 *	.                  <-- sync happens here
> > +	 *	.
> > +	 *	// PREEMPT-IN
> > +	 *	// write p->uclamp_req[UCLAMP_MIN]
> > +	 *
> > +	 *	That section is protected with rcu_read_lock(), so
> > +	 *	synchronize_rcu() will guarantee it has finished before we
> > +	 *	perform the update. Hence ensure that this sync happens after
> > +	 *	any concurrent sync which should guarantee correctness.
> > +	 */
> > +	synchronize_rcu();
> > +
> > +	rcu_read_lock();
> > +	for_each_process_thread(g, p)
> > +		__uclamp_sync_util_min_rt_default(p);
> > +	rcu_read_unlock();
> > +}
> 
> It's monday, and I cannot get my brain working.. I cannot decipher the
> comments you have with the smp_[rw]mb(), what actual ordering do they
> enforce?

It was a  bit of a paranoia to ensure that readers on other cpus see the new
value after this point.

> 
> Also, your synchronize_rcu() relies on write_lock() beeing
> non-preemptible, which isn't true on PREEMPT_RT.
> 
> The below seems simpler...

Hmm maybe I am missing something obvious, but beside the race with fork; I was
worried about another race and that's what the synchronize_rcu() is trying to
handle.

It's the classic preemption in the middle of RMW operation race.

		copy_process()			sysctl_uclamp

		  sched_post_fork()
		    __uclamp_sync_rt()
		      // read sysctl
		      // PREEMPT
						  for_each_process_thread()
		      // RESUME
		      // write syctl to p

So to summarize we have 3 scenarios:


	1. sysctl_uclamp happens *before* sched_post_fork()

for_each_process_thread() could miss the forked task, but that's okay because
sched_post_fork() will apply the correct value.


	2. sysctl_uclamp happens *during* sched_post_fork()

There's the risk of the classic preemption in the middle of RMW where another
CPU could have changed the shared variable after the current CPU has already
read it, but before writing it back.

I protect this with rcu_read_lock() which as far as I know synchronize_rcu()
will ensure if we do the update during this section; we'll wait for it to
finish. New forkees entering the rcu_read_lock() section will be okay because
they should see the new value.

spinlocks() and mutexes seemed inferior to this approach.

Any other potential future user that needs to do __uclamp_sync_rt() could
suffer from this race.


	3. sysctl_uclamp happens *after* sched_post_fork()

Here if for_each_process_thread() still can't see the forked task; then we have
a problem. For this case I wasn't sure if we needed the
smp_mp__after_spinlock() dance. It seemed a stretch to me not to see the forked
task after this point.

Would a simple smp_mp() in for_each_process_thread() be sufficient instead?

Though maybe better to provide a generic macro to do this dance for the benefit
of potential other future users and just call it here and not think too much.

diff --git a/include/linux/sched/signal.h b/include/linux/sched/signal.h
index 0ee5e696c5d8..a124e3a1cb6d 100644
--- a/include/linux/sched/signal.h
+++ b/include/linux/sched/signal.h
@@ -586,7 +586,7 @@ extern void flush_itimer_signals(void);
        list_entry_rcu((p)->tasks.next, struct task_struct, tasks)

 #define for_each_process(p) \
-       for (p = &init_task ; (p = next_task(p)) != &init_task ; )
+       for (smp_mp(); p = &init_task ; (p = next_task(p)) != &init_task ; )

 extern bool current_is_single_threaded(void);

Thanks

--
Qais Yousef

> 
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -1013,8 +1013,6 @@ static void __uclamp_sync_util_min_rt_de
>  	unsigned int default_util_min;
>  	struct uclamp_se *uc_se;
>  
> -	WARN_ON_ONCE(!rcu_read_lock_held());
> -
>  	if (!rt_task(p))
>  		return;
>  
> @@ -1024,8 +1022,6 @@ static void __uclamp_sync_util_min_rt_de
>  	if (uc_se->user_defined)
>  		return;
>  
> -	/* Sync with smp_wmb() in uclamp_sync_util_min_rt_default() */
> -	smp_rmb();
>  	default_util_min = sysctl_sched_uclamp_util_min_rt_default;
>  	uclamp_se_set(uc_se, default_util_min, false);
>  }
> @@ -1035,47 +1031,21 @@ static void uclamp_sync_util_min_rt_defa
>  	struct task_struct *g, *p;
>  
>  	/*
> -	 * Make sure the updated sysctl_sched_uclamp_util_min_rt_default which
> -	 * was just written is synchronized against any future read on another
> -	 * cpu.
> -	 */
> -	smp_wmb();
> -
> -	/*
> -	 * Wait for all updaters to observe the new change.
> -	 *
> -	 * There are 2 races to deal with here:
> -	 *
> -	 * 1. fork()->copy_process()
> -	 *
> -	 *	If a task was concurrently forking, for_each_process_thread()
> -	 *	will not see it, hence it could have copied the old value and
> -	 *	we missed the opportunity to update it.
> -	 *
> -	 *	This should be handled by sched_post_fork() where it'll ensure
> -	 *	it performs the sync after the fork.
> -	 *
> -	 * 2. fork()->sched_post_fork()
> -	 *    __setscheduler_uclamp()
> -	 *
> -	 *	Both of these functions could read the old value but then get
> -	 *	preempted, during which a user might write new value to
> -	 *	sysctl_sched_uclamp_util_min_rt_default.
> -	 *
> -	 *	// read sysctl_sched_uclamp_util_min_rt_default;
> -	 *	// PREEMPT-OUT
> -	 *	.
> -	 *	.                  <-- sync happens here
> -	 *	.
> -	 *	// PREEMPT-IN
> -	 *	// write p->uclamp_req[UCLAMP_MIN]
> -	 *
> -	 *	That section is protected with rcu_read_lock(), so
> -	 *	synchronize_rcu() will guarantee it has finished before we
> -	 *	perform the update. Hence ensure that this sync happens after
> -	 *	any concurrent sync which should guarantee correctness.
> -	 */
> -	synchronize_rcu();
> +	 * copy_process()			sysctl_uclamp
> +	 *					  uclamp_min_rt = X;
> +	 *   write_lock(&tasklist_lock)		  read_lock(&tasklist_lock)
> +	 *   // link thread			  smp_mb__after_spinlock()
> +	 *   write_unlock(&tasklist_lock)	  read_unlock(&tasklist_lock);
> +	 *   sched_post_fork()			  for_each_process_thread()
> +	 *     __uclamp_sync_rt()		    __uclamp_sync_rt()
> +	 *
> +	 * Ensures that either sched_post_fork() will observe the new
> +	 * uclamp_min_rt or for_each_process_thread() will observe the new
> +	 * task.
> +	 */
> +	read_lock(&tasklist_lock);
> +	smp_mb__after_spinlock();
> +	read_unlock(&tasklist_lock);
>  
>  	rcu_read_lock();
>  	for_each_process_thread(g, p)
> @@ -1408,6 +1378,9 @@ int sysctl_sched_uclamp_handler(struct c
>  		uclamp_update_root_tg();
>  	}
>  
> +	if (old_min_rt != sysctl_sched_uclamp_util_min_rt_default)
> +		uclamp_sync_util_min_rt_default();
> +
>  	/*
>  	 * We update all RUNNABLE tasks only when task groups are in use.
>  	 * Otherwise, keep it simple and do just a lazy update at each next
> @@ -1466,9 +1439,7 @@ static void __setscheduler_uclamp(struct
>  		 * at runtime.
>  		 */
>  		if (unlikely(rt_task(p) && clamp_id == UCLAMP_MIN)) {
> -			rcu_read_lock();
>  			__uclamp_sync_util_min_rt_default(p);
> -			rcu_read_unlock();
>  		} else {
>  			uclamp_se_set(uc_se, uclamp_none(clamp_id), false);
>  		}
> @@ -1521,6 +1492,11 @@ static void __init init_uclamp_rq(struct
>  	rq->uclamp_flags = 0;
>  }
>  
> +static void uclamp_post_fork(struct task_struct *p)
> +{
> +	__uclamp_sync_util_min_rt_default(p);
> +}
> +
>  static void __init init_uclamp(void)
>  {
>  	struct uclamp_se uc_max = {};
Peter Zijlstra July 13, 2020, 1:35 p.m. UTC | #13
On Mon, Jul 13, 2020 at 01:12:46PM +0100, Qais Yousef wrote:
> On 07/13/20 13:21, Peter Zijlstra wrote:

> > It's monday, and I cannot get my brain working.. I cannot decipher the
> > comments you have with the smp_[rw]mb(), what actual ordering do they
> > enforce?
> 
> It was a  bit of a paranoia to ensure that readers on other cpus see the new
> value after this point.

IIUC that's not something any barrier can provide.

Barriers can only order between (at least) two memory operations:

	X = 1;		y = Y;
	smp_wmb();	smp_rmb();
	Y = 1;		x = X;

guarantees that if y == 1, then x must also be 1. Because the left hand
side orders the store of Y after the store of X, while the right hand
side order the load of X after the load of Y. Therefore, if the first
load observes the last store, the second load must observe the first
store.

Without a second variable, barriers can't guarantee _anything_. Which is
why any barrier comment should refer to at least two variables.

> > Also, your synchronize_rcu() relies on write_lock() beeing
> > non-preemptible, which isn't true on PREEMPT_RT.
> > 
> > The below seems simpler...

> Hmm maybe I am missing something obvious, but beside the race with fork; I was
> worried about another race and that's what the synchronize_rcu() is trying to
> handle.
> 
> It's the classic preemption in the middle of RMW operation race.
> 
> 		copy_process()			sysctl_uclamp
> 
> 		  sched_post_fork()
> 		    __uclamp_sync_rt()
> 		      // read sysctl
> 		      // PREEMPT
> 						  for_each_process_thread()
> 		      // RESUME
> 		      // write syctl to p
> 

> 	2. sysctl_uclamp happens *during* sched_post_fork()
> 
> There's the risk of the classic preemption in the middle of RMW where another
> CPU could have changed the shared variable after the current CPU has already
> read it, but before writing it back.

Aah.. I see.

> I protect this with rcu_read_lock() which as far as I know synchronize_rcu()
> will ensure if we do the update during this section; we'll wait for it to
> finish. New forkees entering the rcu_read_lock() section will be okay because
> they should see the new value.
> 
> spinlocks() and mutexes seemed inferior to this approach.

Well, didn't we just write in another patch that p->uclamp_* was
protected by both rq->lock and p->pi_lock?
Qais Yousef July 13, 2020, 2:27 p.m. UTC | #14
On 07/13/20 15:35, Peter Zijlstra wrote:
> On Mon, Jul 13, 2020 at 01:12:46PM +0100, Qais Yousef wrote:
> > On 07/13/20 13:21, Peter Zijlstra wrote:
> 
> > > It's monday, and I cannot get my brain working.. I cannot decipher the
> > > comments you have with the smp_[rw]mb(), what actual ordering do they
> > > enforce?
> > 
> > It was a  bit of a paranoia to ensure that readers on other cpus see the new
> > value after this point.
> 
> IIUC that's not something any barrier can provide.
> 
> Barriers can only order between (at least) two memory operations:
> 
> 	X = 1;		y = Y;
> 	smp_wmb();	smp_rmb();
> 	Y = 1;		x = X;
> 
> guarantees that if y == 1, then x must also be 1. Because the left hand
> side orders the store of Y after the store of X, while the right hand
> side order the load of X after the load of Y. Therefore, if the first
> load observes the last store, the second load must observe the first
> store.
> 
> Without a second variable, barriers can't guarantee _anything_. Which is
> why any barrier comment should refer to at least two variables.

Hmmm okay. I thought this will order the write with the read. In my head if,
for example, the write was stuck in the write buffer of CPU0, then a read on
CPU1 would wait for this to be committed before carrying on and issue a read.

So I was reading this as don't issue new reads before current writes are done.

I need to go back and read memory-barriers.rst. It's been 10 years since I last
read it..

> 
> > > Also, your synchronize_rcu() relies on write_lock() beeing
> > > non-preemptible, which isn't true on PREEMPT_RT.
> > > 
> > > The below seems simpler...
> 
> > Hmm maybe I am missing something obvious, but beside the race with fork; I was
> > worried about another race and that's what the synchronize_rcu() is trying to
> > handle.
> > 
> > It's the classic preemption in the middle of RMW operation race.
> > 
> > 		copy_process()			sysctl_uclamp
> > 
> > 		  sched_post_fork()
> > 		    __uclamp_sync_rt()
> > 		      // read sysctl
> > 		      // PREEMPT
> > 						  for_each_process_thread()
> > 		      // RESUME
> > 		      // write syctl to p
> > 
> 
> > 	2. sysctl_uclamp happens *during* sched_post_fork()
> > 
> > There's the risk of the classic preemption in the middle of RMW where another
> > CPU could have changed the shared variable after the current CPU has already
> > read it, but before writing it back.
> 
> Aah.. I see.
> 
> > I protect this with rcu_read_lock() which as far as I know synchronize_rcu()
> > will ensure if we do the update during this section; we'll wait for it to
> > finish. New forkees entering the rcu_read_lock() section will be okay because
> > they should see the new value.
> > 
> > spinlocks() and mutexes seemed inferior to this approach.
> 
> Well, didn't we just write in another patch that p->uclamp_* was
> protected by both rq->lock and p->pi_lock?

__setscheduler_uclamp() path is holding these locks, not sure by design or it
just happened this path holds the lock. I can't see the lock in the
uclamp_fork() path. But it's hard sometimes to unfold the layers of callers,
especially not all call sites are annotated for which lock is assumed to be
held.

Is it safe to hold the locks in uclamp_fork() while the task is still being
created? My new code doesn't hold it of course.

We can enforce this rule if you like. Though rcu critical section seems lighter
weight to me.

If all of this does indeed start looking messy we can put the update in
a delayed worker and schedule that instead of doing synchronous setup.

Or go back to task_woken_rt() with a cached per-rq variable of
sysctl_util_min_rt that is more likely to be cache hot compared to the global
sysctl_util_min_rt variable.

Thanks

--
Qais Yousef
Peter Zijlstra July 13, 2020, 4:54 p.m. UTC | #15
On Mon, Jul 13, 2020 at 03:27:55PM +0100, Qais Yousef wrote:
> On 07/13/20 15:35, Peter Zijlstra wrote:
> > > I protect this with rcu_read_lock() which as far as I know synchronize_rcu()
> > > will ensure if we do the update during this section; we'll wait for it to
> > > finish. New forkees entering the rcu_read_lock() section will be okay because
> > > they should see the new value.
> > > 
> > > spinlocks() and mutexes seemed inferior to this approach.
> > 
> > Well, didn't we just write in another patch that p->uclamp_* was
> > protected by both rq->lock and p->pi_lock?
> 
> __setscheduler_uclamp() path is holding these locks, not sure by design or it
> just happened this path holds the lock. I can't see the lock in the
> uclamp_fork() path. But it's hard sometimes to unfold the layers of callers,
> especially not all call sites are annotated for which lock is assumed to be
> held.
> 
> Is it safe to hold the locks in uclamp_fork() while the task is still being
> created? My new code doesn't hold it of course.
> 
> We can enforce this rule if you like. Though rcu critical section seems lighter
> weight to me.
> 
> If all of this does indeed start looking messy we can put the update in
> a delayed worker and schedule that instead of doing synchronous setup.

sched_fork() doesn't need the locks, because at that point the task
isn't visible yet. HOWEVER, sched_post_fork() is after pid-hash (per
design) and thus the task is visible, so we can race against
sched_setattr(), so we'd better hold those locks anyway.
Qais Yousef July 13, 2020, 6:09 p.m. UTC | #16
On 07/13/20 18:54, Peter Zijlstra wrote:
> On Mon, Jul 13, 2020 at 03:27:55PM +0100, Qais Yousef wrote:
> > On 07/13/20 15:35, Peter Zijlstra wrote:
> > > > I protect this with rcu_read_lock() which as far as I know synchronize_rcu()
> > > > will ensure if we do the update during this section; we'll wait for it to
> > > > finish. New forkees entering the rcu_read_lock() section will be okay because
> > > > they should see the new value.
> > > > 
> > > > spinlocks() and mutexes seemed inferior to this approach.
> > > 
> > > Well, didn't we just write in another patch that p->uclamp_* was
> > > protected by both rq->lock and p->pi_lock?
> > 
> > __setscheduler_uclamp() path is holding these locks, not sure by design or it
> > just happened this path holds the lock. I can't see the lock in the
> > uclamp_fork() path. But it's hard sometimes to unfold the layers of callers,
> > especially not all call sites are annotated for which lock is assumed to be
> > held.
> > 
> > Is it safe to hold the locks in uclamp_fork() while the task is still being
> > created? My new code doesn't hold it of course.
> > 
> > We can enforce this rule if you like. Though rcu critical section seems lighter
> > weight to me.
> > 
> > If all of this does indeed start looking messy we can put the update in
> > a delayed worker and schedule that instead of doing synchronous setup.
> 
> sched_fork() doesn't need the locks, because at that point the task
> isn't visible yet. HOWEVER, sched_post_fork() is after pid-hash (per
> design) and thus the task is visible, so we can race against
> sched_setattr(), so we'd better hold those locks anyway.

Okay. I thought task_rq_lock() is expensive because it'll compete with other
users in fast path.

I got the below which I am testing. I hit a splat in uclamp static key too
while testing this :( I'll test them in unison and lump the fix in this series
in the next version.

Thanks

--
Qais Yousef

---
 include/linux/sched.h | 10 ++++--
 kernel/sched/core.c   | 81 ++++++++++++++++---------------------------
 2 files changed, 37 insertions(+), 54 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 12b10ce51a08..81c4eed8d9a3 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -686,9 +686,15 @@ struct task_struct {
 	struct sched_dl_entity		dl;
 
 #ifdef CONFIG_UCLAMP_TASK
-	/* Clamp values requested for a scheduling entity */
+	/*
+	 * Clamp values requested for a scheduling entity.
+	 * Must be updated with task_rq_lock() held.
+	 */
 	struct uclamp_se		uclamp_req[UCLAMP_CNT];
-	/* Effective clamp values used for a scheduling entity */
+	/*
+	 * Effective clamp values used for a scheduling entity.
+	 * Must be updated with task_rq_lock() held.
+	 */
 	struct uclamp_se		uclamp[UCLAMP_CNT];
 #endif
 
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 45bd4d9d2bba..8a648da4e7f2 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -914,74 +914,55 @@ unsigned int uclamp_rq_max_value(struct rq *rq, enum uclamp_id clamp_id,
 	return uclamp_idle_value(rq, clamp_id, clamp_value);
 }
 
-static void __uclamp_sync_util_min_rt_default(struct task_struct *p)
+static void __uclamp_sync_util_min_rt_default_locked(struct task_struct *p)
 {
 	unsigned int default_util_min;
 	struct uclamp_se *uc_se;
 
-	WARN_ON_ONCE(!rcu_read_lock_held());
-
-	if (!rt_task(p))
-		return;
-
 	uc_se = &p->uclamp_req[UCLAMP_MIN];
 
 	/* Only sync if user didn't override the default */
 	if (uc_se->user_defined)
 		return;
 
-	/* Sync with smp_wmb() in uclamp_sync_util_min_rt_default() */
-	smp_rmb();
 	default_util_min = sysctl_sched_uclamp_util_min_rt_default;
 	uclamp_se_set(uc_se, default_util_min, false);
 }
 
+static void __uclamp_sync_util_min_rt_default(struct task_struct *p)
+{
+	struct rq_flags rf;
+	struct rq *rq;
+
+	if (!rt_task(p))
+		return;
+
+	/* Protect updates to p->uclamp_* */
+	rq = task_rq_lock(p, &rf);
+	__uclamp_sync_util_min_rt_default_locked(p);
+	task_rq_unlock(rq, p, &rf);
+}
+
 static void uclamp_sync_util_min_rt_default(void)
 {
 	struct task_struct *g, *p;
 
 	/*
-	 * Make sure the updated sysctl_sched_uclamp_util_min_rt_default which
-	 * was just written is synchronized against any future read on another
-	 * cpu.
-	 */
-	smp_wmb();
-
-	/*
-	 * Wait for all updaters to observe the new change.
-	 *
-	 * There are 2 races to deal with here:
-	 *
-	 * 1. fork()->copy_process()
-	 *
-	 *	If a task was concurrently forking, for_each_process_thread()
-	 *	will not see it, hence it could have copied the old value and
-	 *	we missed the opportunity to update it.
-	 *
-	 *	This should be handled by sched_post_fork() where it'll ensure
-	 *	it performs the sync after the fork.
-	 *
-	 * 2. fork()->sched_post_fork()
-	 *    __setscheduler_uclamp()
-	 *
-	 *	Both of these functions could read the old value but then get
-	 *	preempted, during which a user might write new value to
-	 *	sysctl_sched_uclamp_util_min_rt_default.
+	 * copy_process()			sysctl_uclamp
+	 *					  uclamp_min_rt = X;
+	 *   write_lock(&tasklist_lock)		  read_lock(&tasklist_lock)
+	 *   // link thread			  smp_mb__after_spinlock()
+	 *   write_unlock(&tasklist_lock)	  read_unlock(&tasklist_lock);
+	 *   sched_post_fork()			  for_each_process_thread()
+	 *     __uclamp_sync_rt()		    __uclamp_sync_rt()
 	 *
-	 *	// read sysctl_sched_uclamp_util_min_rt_default;
-	 *	// PREEMPT-OUT
-	 *	.
-	 *	.                  <-- sync happens here
-	 *	.
-	 *	// PREEMPT-IN
-	 *	// write p->uclamp_req[UCLAMP_MIN]
-	 *
-	 *	That section is protected with rcu_read_lock(), so
-	 *	synchronize_rcu() will guarantee it has finished before we
-	 *	perform the update. Hence ensure that this sync happens after
-	 *	any concurrent sync which should guarantee correctness.
+	 * Ensures that either sched_post_fork() will observe the new
+	 * uclamp_min_rt or for_each_process_thread() will observe the new
+	 * task.
 	 */
-	synchronize_rcu();
+	read_lock(&tasklist_lock);
+	smp_mb__after_spinlock();
+	read_unlock(&tasklist_lock);
 
 	rcu_read_lock();
 	for_each_process_thread(g, p)
@@ -1377,9 +1358,7 @@ static void __setscheduler_uclamp(struct task_struct *p,
 		 * at runtime.
 		 */
 		if (unlikely(rt_task(p) && clamp_id == UCLAMP_MIN)) {
-			rcu_read_lock();
-			__uclamp_sync_util_min_rt_default(p);
-			rcu_read_unlock();
+			__uclamp_sync_util_min_rt_default_locked(p);
 		} else {
 			uclamp_se_set(uc_se, uclamp_none(clamp_id), false);
 		}
@@ -1420,9 +1399,7 @@ static void uclamp_fork(struct task_struct *p)
 
 static void uclamp_post_fork(struct task_struct *p)
 {
-	rcu_read_lock();
 	__uclamp_sync_util_min_rt_default(p);
-	rcu_read_unlock();
 }
 
 static void __init init_uclamp_rq(struct rq *rq)
diff mbox series

Patch

diff --git a/include/linux/sched/sysctl.h b/include/linux/sched/sysctl.h
index 660ac49f2b53..efabeace2bbf 100644
--- a/include/linux/sched/sysctl.h
+++ b/include/linux/sched/sysctl.h
@@ -64,6 +64,7 @@  extern int sysctl_sched_rt_runtime;
 #ifdef CONFIG_UCLAMP_TASK
 extern unsigned int sysctl_sched_uclamp_util_min;
 extern unsigned int sysctl_sched_uclamp_util_max;
+extern unsigned int sysctl_sched_uclamp_util_min_rt_default;
 #endif
 
 #ifdef CONFIG_CFS_BANDWIDTH
diff --git a/include/linux/sched/task.h b/include/linux/sched/task.h
index 38359071236a..e7ddab095baf 100644
--- a/include/linux/sched/task.h
+++ b/include/linux/sched/task.h
@@ -55,6 +55,7 @@  extern asmlinkage void schedule_tail(struct task_struct *prev);
 extern void init_idle(struct task_struct *idle, int cpu);
 
 extern int sched_fork(unsigned long clone_flags, struct task_struct *p);
+extern void sched_post_fork(struct task_struct *p);
 extern void sched_dead(struct task_struct *p);
 
 void __noreturn do_task_dead(void);
diff --git a/kernel/fork.c b/kernel/fork.c
index 142b23645d82..cb555c0b3f19 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -2304,6 +2304,7 @@  static __latent_entropy struct task_struct *copy_process(
 	write_unlock_irq(&tasklist_lock);
 
 	proc_fork_connector(p);
+	sched_post_fork(p);
 	cgroup_post_fork(p, args);
 	perf_event_fork(p);
 
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 8f360326861e..aa0a0118ad97 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -791,6 +791,23 @@  unsigned int sysctl_sched_uclamp_util_min = SCHED_CAPACITY_SCALE;
 /* Max allowed maximum utilization */
 unsigned int sysctl_sched_uclamp_util_max = SCHED_CAPACITY_SCALE;
 
+/*
+ * By default RT tasks run at the maximum performance point/capacity of the
+ * system. Uclamp enforces this by always setting UCLAMP_MIN of RT tasks to
+ * SCHED_CAPACITY_SCALE.
+ *
+ * This knob allows admins to change the default behavior when uclamp is being
+ * used. In battery powered devices, particularly, running at the maximum
+ * capacity and frequency will increase energy consumption and shorten the
+ * battery life.
+ *
+ * This knob only affects RT tasks that their uclamp_se->user_defined == false.
+ *
+ * This knob will not override the system default sched_util_clamp_min defined
+ * above.
+ */
+unsigned int sysctl_sched_uclamp_util_min_rt_default = SCHED_CAPACITY_SCALE;
+
 /* All clamps are required to be less or equal than these values */
 static struct uclamp_se uclamp_default[UCLAMP_CNT];
 
@@ -873,6 +890,81 @@  unsigned int uclamp_rq_max_value(struct rq *rq, enum uclamp_id clamp_id,
 	return uclamp_idle_value(rq, clamp_id, clamp_value);
 }
 
+static void __uclamp_sync_util_min_rt_default(struct task_struct *p)
+{
+	unsigned int default_util_min;
+	struct uclamp_se *uc_se;
+
+	WARN_ON_ONCE(!rcu_read_lock_held());
+
+	if (!rt_task(p))
+		return;
+
+	uc_se = &p->uclamp_req[UCLAMP_MIN];
+
+	/* Only sync if user didn't override the default */
+	if (uc_se->user_defined)
+		return;
+
+	/* Sync with smp_wmb() in uclamp_sync_util_min_rt_default() */
+	smp_rmb();
+	default_util_min = sysctl_sched_uclamp_util_min_rt_default;
+	uclamp_se_set(uc_se, default_util_min, false);
+}
+
+static void uclamp_sync_util_min_rt_default(void)
+{
+	struct task_struct *g, *p;
+
+	/*
+	 * Make sure the updated sysctl_sched_uclamp_util_min_rt_default which
+	 * was just written is synchronized against any future read on another
+	 * cpu.
+	 */
+	smp_wmb();
+
+	/*
+	 * Wait for all updaters to observe the new change.
+	 *
+	 * There are 2 races to deal with here:
+	 *
+	 * 1. fork()->copy_process()
+	 *
+	 *	If a task was concurrently forking, for_each_process_thread()
+	 *	will not see it, hence it could have copied the old value and
+	 *	we missed the opportunity to update it.
+	 *
+	 *	This should be handled by sched_post_fork() where it'll ensure
+	 *	it performs the sync after the fork.
+	 *
+	 * 2. fork()->sched_post_fork()
+	 *    __setscheduler_uclamp()
+	 *
+	 *	Both of these functions could read the old value but then get
+	 *	preempted, during which a user might write new value to
+	 *	sysctl_sched_uclamp_util_min_rt_default.
+	 *
+	 *	// read sysctl_sched_uclamp_util_min_rt_default;
+	 *	// PREEMPT-OUT
+	 *	.
+	 *	.                  <-- sync happens here
+	 *	.
+	 *	// PREEMPT-IN
+	 *	// write p->uclamp_req[UCLAMP_MIN]
+	 *
+	 *	That section is protected with rcu_read_lock(), so
+	 *	synchronize_rcu() will guarantee it has finished before we
+	 *	perform the update. Hence ensure that this sync happens after
+	 *	any concurrent sync which should guarantee correctness.
+	 */
+	synchronize_rcu();
+
+	rcu_read_lock();
+	for_each_process_thread(g, p)
+		__uclamp_sync_util_min_rt_default(p);
+	rcu_read_unlock();
+}
+
 static inline struct uclamp_se
 uclamp_tg_restrict(struct task_struct *p, enum uclamp_id clamp_id)
 {
@@ -1114,12 +1206,13 @@  int sysctl_sched_uclamp_handler(struct ctl_table *table, int write,
 				void *buffer, size_t *lenp, loff_t *ppos)
 {
 	bool update_root_tg = false;
-	int old_min, old_max;
+	int old_min, old_max, old_min_rt;
 	int result;
 
 	mutex_lock(&uclamp_mutex);
 	old_min = sysctl_sched_uclamp_util_min;
 	old_max = sysctl_sched_uclamp_util_max;
+	old_min_rt = sysctl_sched_uclamp_util_min_rt_default;
 
 	result = proc_dointvec(table, write, buffer, lenp, ppos);
 	if (result)
@@ -1128,7 +1221,9 @@  int sysctl_sched_uclamp_handler(struct ctl_table *table, int write,
 		goto done;
 
 	if (sysctl_sched_uclamp_util_min > sysctl_sched_uclamp_util_max ||
-	    sysctl_sched_uclamp_util_max > SCHED_CAPACITY_SCALE) {
+	    sysctl_sched_uclamp_util_max > SCHED_CAPACITY_SCALE		||
+	    sysctl_sched_uclamp_util_min_rt_default > SCHED_CAPACITY_SCALE) {
+
 		result = -EINVAL;
 		goto undo;
 	}
@@ -1147,6 +1242,9 @@  int sysctl_sched_uclamp_handler(struct ctl_table *table, int write,
 	if (update_root_tg)
 		uclamp_update_root_tg();
 
+	if (old_min_rt != sysctl_sched_uclamp_util_min_rt_default)
+		uclamp_sync_util_min_rt_default();
+
 	/*
 	 * We update all RUNNABLE tasks only when task groups are in use.
 	 * Otherwise, keep it simple and do just a lazy update at each next
@@ -1158,6 +1256,7 @@  int sysctl_sched_uclamp_handler(struct ctl_table *table, int write,
 undo:
 	sysctl_sched_uclamp_util_min = old_min;
 	sysctl_sched_uclamp_util_max = old_max;
+	sysctl_sched_uclamp_util_min_rt_default = old_min_rt;
 done:
 	mutex_unlock(&uclamp_mutex);
 
@@ -1194,17 +1293,23 @@  static void __setscheduler_uclamp(struct task_struct *p,
 	 */
 	for_each_clamp_id(clamp_id) {
 		struct uclamp_se *uc_se = &p->uclamp_req[clamp_id];
-		unsigned int clamp_value = uclamp_none(clamp_id);
 
 		/* Keep using defined clamps across class changes */
 		if (uc_se->user_defined)
 			continue;
 
-		/* By default, RT tasks always get 100% boost */
-		if (unlikely(rt_task(p) && clamp_id == UCLAMP_MIN))
-			clamp_value = uclamp_none(UCLAMP_MAX);
+		/*
+		 * RT by default have a 100% boost value that could be modified
+		 * at runtime.
+		 */
+		if (unlikely(rt_task(p) && clamp_id == UCLAMP_MIN)) {
+			rcu_read_lock();
+			__uclamp_sync_util_min_rt_default(p);
+			rcu_read_unlock();
+		} else {
+			uclamp_se_set(uc_se, uclamp_none(clamp_id), false);
+		}
 
-		uclamp_se_set(uc_se, clamp_value, false);
 	}
 
 	if (likely(!(attr->sched_flags & SCHED_FLAG_UTIL_CLAMP)))
@@ -1237,6 +1342,13 @@  static void uclamp_fork(struct task_struct *p)
 	}
 }
 
+static void uclamp_post_fork(struct task_struct *p)
+{
+	rcu_read_lock();
+	__uclamp_sync_util_min_rt_default(p);
+	rcu_read_unlock();
+}
+
 static void __init init_uclamp(void)
 {
 	struct uclamp_se uc_max = {};
@@ -1278,6 +1390,7 @@  static inline int uclamp_validate(struct task_struct *p,
 static void __setscheduler_uclamp(struct task_struct *p,
 				  const struct sched_attr *attr) { }
 static inline void uclamp_fork(struct task_struct *p) { }
+static inline void uclamp_post_fork(struct task_struct *p) { }
 static inline void init_uclamp(void) { }
 #endif /* CONFIG_UCLAMP_TASK */
 
@@ -2963,6 +3076,11 @@  int sched_fork(unsigned long clone_flags, struct task_struct *p)
 	return 0;
 }
 
+void sched_post_fork(struct task_struct *p)
+{
+	uclamp_post_fork(p);
+}
+
 unsigned long to_ratio(u64 period, u64 runtime)
 {
 	if (runtime == RUNTIME_INF)
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index db1ce7af2563..d36913c9f3b5 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -1801,6 +1801,13 @@  static struct ctl_table kern_table[] = {
 		.mode		= 0644,
 		.proc_handler	= sysctl_sched_uclamp_handler,
 	},
+	{
+		.procname	= "sched_util_clamp_min_rt_default",
+		.data		= &sysctl_sched_uclamp_util_min_rt_default,
+		.maxlen		= sizeof(unsigned int),
+		.mode		= 0644,
+		.proc_handler	= sysctl_sched_uclamp_handler,
+	},
 #endif
 #ifdef CONFIG_SCHED_AUTOGROUP
 	{