diff mbox series

[v6,13/21] sched: Admit forcefully-affined tasks into SCHED_DEADLINE

Message ID 20210518094725.7701-14-will@kernel.org (mailing list archive)
State New, archived
Headers show
Series Add support for 32-bit tasks on asymmetric AArch32 systems | expand

Commit Message

Will Deacon May 18, 2021, 9:47 a.m. UTC
On asymmetric systems where the affinity of a task is restricted to
contain only the CPUs capable of running it, admission to the deadline
scheduler is likely to fail because the span of the sched domain
contains incompatible CPUs. Although this is arguably the right thing to
do, it is inconsistent with the case where the affinity of a task is
restricted after already having been admitted to the deadline scheduler.

For example, on an arm64 system where not all CPUs support 32-bit
applications, a 64-bit deadline task can exec() a 32-bit image and have
its affinity forcefully restricted.

Rather than reject these tasks altogether, favour the requested user
affinity saved in 'task_struct::user_cpus_ptr' over the actual affinity
of the task which has been restricted by the kernel.

Signed-off-by: Will Deacon <will@kernel.org>
---
 kernel/sched/core.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

Comments

Quentin Perret May 18, 2021, 10:20 a.m. UTC | #1
On Tuesday 18 May 2021 at 10:47:17 (+0100), Will Deacon wrote:
> On asymmetric systems where the affinity of a task is restricted to
> contain only the CPUs capable of running it, admission to the deadline
> scheduler is likely to fail because the span of the sched domain
> contains incompatible CPUs. Although this is arguably the right thing to
> do, it is inconsistent with the case where the affinity of a task is
> restricted after already having been admitted to the deadline scheduler.
> 
> For example, on an arm64 system where not all CPUs support 32-bit
> applications, a 64-bit deadline task can exec() a 32-bit image and have
> its affinity forcefully restricted.

So I guess the alternative would be to fail exec-ing into 32bit from a
64bit DL task, and then drop this patch?

The nice thing about your approach is that existing applications won't
really notice a difference (execve would still 'work'), but on the cons
side it breaks admission control, which is sad.

I don't expect this weird execve-to-32bit pattern from DL to be that
common in practice (at the very least not in Android), so maybe we could
start with the stricter version (fail the execve), and wait to see if
folks complain? Making things stricter later will be harder.

Thoughts?

Thanks,
Quentin
Will Deacon May 18, 2021, 10:28 a.m. UTC | #2
[Dropping Li Zefan as his mail is bouncing]

On Tue, May 18, 2021 at 10:20:38AM +0000, Quentin Perret wrote:
> On Tuesday 18 May 2021 at 10:47:17 (+0100), Will Deacon wrote:
> > On asymmetric systems where the affinity of a task is restricted to
> > contain only the CPUs capable of running it, admission to the deadline
> > scheduler is likely to fail because the span of the sched domain
> > contains incompatible CPUs. Although this is arguably the right thing to
> > do, it is inconsistent with the case where the affinity of a task is
> > restricted after already having been admitted to the deadline scheduler.
> > 
> > For example, on an arm64 system where not all CPUs support 32-bit
> > applications, a 64-bit deadline task can exec() a 32-bit image and have
> > its affinity forcefully restricted.
> 
> So I guess the alternative would be to fail exec-ing into 32bit from a
> 64bit DL task, and then drop this patch?
> 
> The nice thing about your approach is that existing applications won't
> really notice a difference (execve would still 'work'), but on the cons
> side it breaks admission control, which is sad.

Right, with your suggestion here we would forbid any 32-bit deadline tasks
on an asymmetric system, even if you'd gone to the extraordinary effort
to cater for that (e.g. by having a separate root domain).

> I don't expect this weird execve-to-32bit pattern from DL to be that
> common in practice (at the very least not in Android), so maybe we could
> start with the stricter version (fail the execve), and wait to see if
> folks complain? Making things stricter later will be harder.
> 
> Thoughts?

I don't have strong opinions on this, but I _do_ want the admission via
sched_setattr() to be consistent with execve(). What you're suggesting
ticks that box, but how many applications are prepared to handle a failed
execve()? I suspect it will be fatal.

Probably also worth pointing out that the approach here will at least
warn in the execve() case when the affinity is overridden for a deadline
task.

Will
Quentin Perret May 18, 2021, 10:48 a.m. UTC | #3
On Tuesday 18 May 2021 at 11:28:34 (+0100), Will Deacon wrote:
> I don't have strong opinions on this, but I _do_ want the admission via
> sched_setattr() to be consistent with execve(). What you're suggesting
> ticks that box, but how many applications are prepared to handle a failed
> execve()? I suspect it will be fatal.

Yep, probably.

> Probably also worth pointing out that the approach here will at least
> warn in the execve() case when the affinity is overridden for a deadline
> task.

Right so I think either way will be imperfect, so I agree with the
above.

Maybe one thing though is that, IIRC, userspace _can_ disable admission
control if it wants to. In this case I'd have no problem with allowing
this weird behaviour when admission control is off -- the kernel won't
provide any guarantees. But if it's left on, then it's a different
story.

So what about we say, if admission control is off, we allow execve() and
sched_setattr() with appropriate warnings as you suggest, but if
admission control is on then we fail both?

We might still see random failures in the wild if admission control is
left enabled on those devices but then I think these could qualify as
a device misconfiguration, not as a kernel bug.

Thoughts?

Thanks,
Quentin
Will Deacon May 18, 2021, 10:59 a.m. UTC | #4
On Tue, May 18, 2021 at 10:48:07AM +0000, Quentin Perret wrote:
> On Tuesday 18 May 2021 at 11:28:34 (+0100), Will Deacon wrote:
> > I don't have strong opinions on this, but I _do_ want the admission via
> > sched_setattr() to be consistent with execve(). What you're suggesting
> > ticks that box, but how many applications are prepared to handle a failed
> > execve()? I suspect it will be fatal.
> 
> Yep, probably.
> 
> > Probably also worth pointing out that the approach here will at least
> > warn in the execve() case when the affinity is overridden for a deadline
> > task.
> 
> Right so I think either way will be imperfect, so I agree with the
> above.
> 
> Maybe one thing though is that, IIRC, userspace _can_ disable admission
> control if it wants to. In this case I'd have no problem with allowing
> this weird behaviour when admission control is off -- the kernel won't
> provide any guarantees. But if it's left on, then it's a different
> story.
> 
> So what about we say, if admission control is off, we allow execve() and
> sched_setattr() with appropriate warnings as you suggest, but if
> admission control is on then we fail both?

That's an interesting idea. The part that I'm not super keen about is
that it means admission control _also_ has an effect on the behaviour of
execve(), so practically you'd have to have it disabled as long as you
have the possibility of 32-bit deadline tasks anywhere in the system,
which impacts 64-bit tasks which may well want admission control enabled.

So perhaps my initial position of trying to keep sched_setattr() and
execve() consistent with each other is flawed and actually we can say:

  * Disable admission control if you want to admit a 32-bit task explicitly
    via sched_setattr()

  * If a 64-bit deadline task execve()s a 32-bit program then we warn
    and override the affinity (i.e. you should avoid doing this if you
    care about the deadlines).

That amounts to dropping this patch and tweaking the documentation.

Dunno, what do you think?

Will
Quentin Perret May 18, 2021, 1:19 p.m. UTC | #5
On Tuesday 18 May 2021 at 11:59:51 (+0100), Will Deacon wrote:
> On Tue, May 18, 2021 at 10:48:07AM +0000, Quentin Perret wrote:
> > On Tuesday 18 May 2021 at 11:28:34 (+0100), Will Deacon wrote:
> > > I don't have strong opinions on this, but I _do_ want the admission via
> > > sched_setattr() to be consistent with execve(). What you're suggesting
> > > ticks that box, but how many applications are prepared to handle a failed
> > > execve()? I suspect it will be fatal.
> > 
> > Yep, probably.
> > 
> > > Probably also worth pointing out that the approach here will at least
> > > warn in the execve() case when the affinity is overridden for a deadline
> > > task.
> > 
> > Right so I think either way will be imperfect, so I agree with the
> > above.
> > 
> > Maybe one thing though is that, IIRC, userspace _can_ disable admission
> > control if it wants to. In this case I'd have no problem with allowing
> > this weird behaviour when admission control is off -- the kernel won't
> > provide any guarantees. But if it's left on, then it's a different
> > story.
> > 
> > So what about we say, if admission control is off, we allow execve() and
> > sched_setattr() with appropriate warnings as you suggest, but if
> > admission control is on then we fail both?
> 
> That's an interesting idea. The part that I'm not super keen about is
> that it means admission control _also_ has an effect on the behaviour of
> execve()

Right, that's a good point. And it looks like fork() behaves the same
regardless of admission control being enabled or not -- it is forbidden
from DL either way. So I can't say there is a precedent :/

> so practically you'd have to have it disabled as long as you
> have the possibility of 32-bit deadline tasks anywhere in the system,
> which impacts 64-bit tasks which may well want admission control enabled.

Indeed, this is a bit sad, but I don't know if the kernel should pretend
it can guarantee to meet your deadlines and at the same time allow to do
something that wrecks the underlying theory.

I'd personally be happy with saying that admission control should be
disabled on these dumb systems (and have that documented), at least
until DL gets proper support for affinities. ISTR there was work going
in that direction, but some folks in the CC list will know better.

@Juri, maybe you would know if that's still planned?

Thanks,
Quentin
Juri Lelli May 20, 2021, 9:13 a.m. UTC | #6
Hi Quentin and Will,

Apologies for the delay in replying.

On 18/05/21 13:19, Quentin Perret wrote:
> On Tuesday 18 May 2021 at 11:59:51 (+0100), Will Deacon wrote:
> > On Tue, May 18, 2021 at 10:48:07AM +0000, Quentin Perret wrote:
> > > On Tuesday 18 May 2021 at 11:28:34 (+0100), Will Deacon wrote:
> > > > I don't have strong opinions on this, but I _do_ want the admission via
> > > > sched_setattr() to be consistent with execve(). What you're suggesting
> > > > ticks that box, but how many applications are prepared to handle a failed
> > > > execve()? I suspect it will be fatal.
> > > 
> > > Yep, probably.
> > > 
> > > > Probably also worth pointing out that the approach here will at least
> > > > warn in the execve() case when the affinity is overridden for a deadline
> > > > task.
> > > 
> > > Right so I think either way will be imperfect, so I agree with the
> > > above.
> > > 
> > > Maybe one thing though is that, IIRC, userspace _can_ disable admission
> > > control if it wants to. In this case I'd have no problem with allowing
> > > this weird behaviour when admission control is off -- the kernel won't
> > > provide any guarantees. But if it's left on, then it's a different
> > > story.
> > > 
> > > So what about we say, if admission control is off, we allow execve() and
> > > sched_setattr() with appropriate warnings as you suggest, but if
> > > admission control is on then we fail both?
> > 
> > That's an interesting idea. The part that I'm not super keen about is
> > that it means admission control _also_ has an effect on the behaviour of
> > execve()
> 
> Right, that's a good point. And it looks like fork() behaves the same
> regardless of admission control being enabled or not -- it is forbidden
> from DL either way. So I can't say there is a precedent :/
> 
> > so practically you'd have to have it disabled as long as you
> > have the possibility of 32-bit deadline tasks anywhere in the system,
> > which impacts 64-bit tasks which may well want admission control enabled.
> 
> Indeed, this is a bit sad, but I don't know if the kernel should pretend
> it can guarantee to meet your deadlines and at the same time allow to do
> something that wrecks the underlying theory.
> 
> I'd personally be happy with saying that admission control should be
> disabled on these dumb systems (and have that documented), at least
> until DL gets proper support for affinities. ISTR there was work going
> in that direction, but some folks in the CC list will know better.
> 
> @Juri, maybe you would know if that's still planned?

I won't go as far as saying planned, but that is still under "our" radar
for sure. Daniel was working on it, but I don't think he had any time to
resume that bit of work lately.

So, until we have that, I think we have been as conservative as we could
for this type of decisions. I'm a little afraid that allowing
configuration to break admission control (even with a non fatal warning
is emitted) is still risky. I'd go with fail hard if AC is on, let it
pass if AC is off (supposedly the user knows what to do). But I'm not
familiar with the mixed 32/64 apps usecase you describe, so I might be
missing details.

Best,
Juri
Will Deacon May 20, 2021, 10:16 a.m. UTC | #7
Hi Juri,

On Thu, May 20, 2021 at 11:13:39AM +0200, Juri Lelli wrote:
> Apologies for the delay in replying.

Not at all!

> On 18/05/21 13:19, Quentin Perret wrote:
> > On Tuesday 18 May 2021 at 11:59:51 (+0100), Will Deacon wrote:
> > > On Tue, May 18, 2021 at 10:48:07AM +0000, Quentin Perret wrote:
> > > > On Tuesday 18 May 2021 at 11:28:34 (+0100), Will Deacon wrote:
> > > > > I don't have strong opinions on this, but I _do_ want the admission via
> > > > > sched_setattr() to be consistent with execve(). What you're suggesting
> > > > > ticks that box, but how many applications are prepared to handle a failed
> > > > > execve()? I suspect it will be fatal.
> > > > 
> > > > Yep, probably.
> > > > 
> > > > > Probably also worth pointing out that the approach here will at least
> > > > > warn in the execve() case when the affinity is overridden for a deadline
> > > > > task.
> > > > 
> > > > Right so I think either way will be imperfect, so I agree with the
> > > > above.
> > > > 
> > > > Maybe one thing though is that, IIRC, userspace _can_ disable admission
> > > > control if it wants to. In this case I'd have no problem with allowing
> > > > this weird behaviour when admission control is off -- the kernel won't
> > > > provide any guarantees. But if it's left on, then it's a different
> > > > story.
> > > > 
> > > > So what about we say, if admission control is off, we allow execve() and
> > > > sched_setattr() with appropriate warnings as you suggest, but if
> > > > admission control is on then we fail both?
> > > 
> > > That's an interesting idea. The part that I'm not super keen about is
> > > that it means admission control _also_ has an effect on the behaviour of
> > > execve()
> > 
> > Right, that's a good point. And it looks like fork() behaves the same
> > regardless of admission control being enabled or not -- it is forbidden
> > from DL either way. So I can't say there is a precedent :/
> > 
> > > so practically you'd have to have it disabled as long as you
> > > have the possibility of 32-bit deadline tasks anywhere in the system,
> > > which impacts 64-bit tasks which may well want admission control enabled.
> > 
> > Indeed, this is a bit sad, but I don't know if the kernel should pretend
> > it can guarantee to meet your deadlines and at the same time allow to do
> > something that wrecks the underlying theory.
> > 
> > I'd personally be happy with saying that admission control should be
> > disabled on these dumb systems (and have that documented), at least
> > until DL gets proper support for affinities. ISTR there was work going
> > in that direction, but some folks in the CC list will know better.
> > 
> > @Juri, maybe you would know if that's still planned?
> 
> I won't go as far as saying planned, but that is still under "our" radar
> for sure. Daniel was working on it, but I don't think he had any time to
> resume that bit of work lately.
> 
> So, until we have that, I think we have been as conservative as we could
> for this type of decisions. I'm a little afraid that allowing
> configuration to break admission control (even with a non fatal warning
> is emitted) is still risky. I'd go with fail hard if AC is on, let it
> pass if AC is off (supposedly the user knows what to do). But I'm not
> familiar with the mixed 32/64 apps usecase you describe, so I might be
> missing details.

Ok, thanks for the insight. In which case, I'll go with what we discussed:
require admission control to be disabled for sched_setattr() but allow
execve() to a 32-bit task from a 64-bit deadline task with a warning (this
is probably similar to CPU hotplug?).

I'll update that for v8, and this patch will disappear.

Will
Quentin Perret May 20, 2021, 10:33 a.m. UTC | #8
On Thursday 20 May 2021 at 11:16:41 (+0100), Will Deacon wrote:
> Ok, thanks for the insight. In which case, I'll go with what we discussed:
> require admission control to be disabled for sched_setattr() but allow
> execve() to a 32-bit task from a 64-bit deadline task with a warning (this
> is probably similar to CPU hotplug?).

Still not sure that we can let execve go through ... It will break AC
all the same, so it should probably fail as well if AC is on IMO
Juri Lelli May 20, 2021, 12:38 p.m. UTC | #9
On 20/05/21 10:33, Quentin Perret wrote:
> On Thursday 20 May 2021 at 11:16:41 (+0100), Will Deacon wrote:
> > Ok, thanks for the insight. In which case, I'll go with what we discussed:
> > require admission control to be disabled for sched_setattr() but allow
> > execve() to a 32-bit task from a 64-bit deadline task with a warning (this
> > is probably similar to CPU hotplug?).
> 
> Still not sure that we can let execve go through ... It will break AC
> all the same, so it should probably fail as well if AC is on IMO
> 

Yeah, this was my thinking as well.
Daniel Bristot de Oliveira May 20, 2021, 12:38 p.m. UTC | #10
On 5/20/21 12:33 PM, Quentin Perret wrote:
> On Thursday 20 May 2021 at 11:16:41 (+0100), Will Deacon wrote:
>> Ok, thanks for the insight. In which case, I'll go with what we discussed:
>> require admission control to be disabled for sched_setattr() but allow
>> execve() to a 32-bit task from a 64-bit deadline task with a warning (this
>> is probably similar to CPU hotplug?).
> 
> Still not sure that we can let execve go through ... It will break AC
> all the same, so it should probably fail as well if AC is on IMO
> 

If the cpumask of the 32-bit task is != of the 64-bit task that is executing it,
the admission control needs to be re-executed, and it could fail. So I see this
operation equivalent to sched_setaffinity(). This will likely be true for future
schedulers that will allow arbitrary affinities (AC should run on affinity
change, and could fail).

I would vote with Juri: "I'd go with fail hard if AC is on, let it
pass if AC is off (supposedly the user knows what to do)," (also hope nobody
complains until we add better support for affinity, and use this as a motivation
to get back on this front).

-- Daniel
Dietmar Eggemann May 20, 2021, 3:06 p.m. UTC | #11
On 20/05/2021 14:38, Daniel Bristot de Oliveira wrote:
> On 5/20/21 12:33 PM, Quentin Perret wrote:
>> On Thursday 20 May 2021 at 11:16:41 (+0100), Will Deacon wrote:
>>> Ok, thanks for the insight. In which case, I'll go with what we discussed:
>>> require admission control to be disabled for sched_setattr() but allow
>>> execve() to a 32-bit task from a 64-bit deadline task with a warning (this
>>> is probably similar to CPU hotplug?).
>>
>> Still not sure that we can let execve go through ... It will break AC
>> all the same, so it should probably fail as well if AC is on IMO
>>
> 
> If the cpumask of the 32-bit task is != of the 64-bit task that is executing it,
> the admission control needs to be re-executed, and it could fail. So I see this
> operation equivalent to sched_setaffinity(). This will likely be true for future
> schedulers that will allow arbitrary affinities (AC should run on affinity
> change, and could fail).
> 
> I would vote with Juri: "I'd go with fail hard if AC is on, let it
> pass if AC is off (supposedly the user knows what to do)," (also hope nobody
> complains until we add better support for affinity, and use this as a motivation
> to get back on this front).
> 
> -- Daniel

(1) # chrt -d -T 5000000 -P 16666666 0 ./32bit_app

(2) # ./32bit_app &

    # chrt -d -T 5000000 -P 16666666 -p 0 pid_of(32bit_app)


Wouldn't the behaviour of (1) and (2) be different w/o this patch?

In (1) __sched_setscheduler() happens before execve so it operates on
p->cpus_ptr equal span.

In (2) span != p->cpus_ptr so DL AC will fail.
Daniel Bristot de Oliveira May 20, 2021, 4 p.m. UTC | #12
On 5/20/21 5:06 PM, Dietmar Eggemann wrote:
> On 20/05/2021 14:38, Daniel Bristot de Oliveira wrote:
>> On 5/20/21 12:33 PM, Quentin Perret wrote:
>>> On Thursday 20 May 2021 at 11:16:41 (+0100), Will Deacon wrote:
>>>> Ok, thanks for the insight. In which case, I'll go with what we discussed:
>>>> require admission control to be disabled for sched_setattr() but allow
>>>> execve() to a 32-bit task from a 64-bit deadline task with a warning (this
>>>> is probably similar to CPU hotplug?).
>>>
>>> Still not sure that we can let execve go through ... It will break AC
>>> all the same, so it should probably fail as well if AC is on IMO
>>>
>>
>> If the cpumask of the 32-bit task is != of the 64-bit task that is executing it,
>> the admission control needs to be re-executed, and it could fail. So I see this
>> operation equivalent to sched_setaffinity(). This will likely be true for future
>> schedulers that will allow arbitrary affinities (AC should run on affinity
>> change, and could fail).
>>
>> I would vote with Juri: "I'd go with fail hard if AC is on, let it
>> pass if AC is off (supposedly the user knows what to do)," (also hope nobody
>> complains until we add better support for affinity, and use this as a motivation
>> to get back on this front).
>>
>> -- Daniel
> 
> (1) # chrt -d -T 5000000 -P 16666666 0 ./32bit_app
> 
> (2) # ./32bit_app &
> 
>     # chrt -d -T 5000000 -P 16666666 -p 0 pid_of(32bit_app)
> 
> 
> Wouldn't the behaviour of (1) and (2) be different w/o this patch?
> 
> In (1) __sched_setscheduler() happens before execve so it operates on
> p->cpus_ptr equal span.
> 
> In (2) span != p->cpus_ptr so DL AC will fail.
> 

As far as I got, the case (1) would be spitted in two steps:

 - __sched_setscheduler() will work, then
 - execv() would fail because (span != p->cpus_ptr)

So... at the end, both (1) and (2) would result in a failure...

am I missing something?

-- Daniel
Dietmar Eggemann May 20, 2021, 5:55 p.m. UTC | #13
On 20/05/2021 18:00, Daniel Bristot de Oliveira wrote:
> On 5/20/21 5:06 PM, Dietmar Eggemann wrote:
>> On 20/05/2021 14:38, Daniel Bristot de Oliveira wrote:
>>> On 5/20/21 12:33 PM, Quentin Perret wrote:
>>>> On Thursday 20 May 2021 at 11:16:41 (+0100), Will Deacon wrote:
>>>>> Ok, thanks for the insight. In which case, I'll go with what we discussed:
>>>>> require admission control to be disabled for sched_setattr() but allow
>>>>> execve() to a 32-bit task from a 64-bit deadline task with a warning (this
>>>>> is probably similar to CPU hotplug?).
>>>>
>>>> Still not sure that we can let execve go through ... It will break AC
>>>> all the same, so it should probably fail as well if AC is on IMO
>>>>
>>>
>>> If the cpumask of the 32-bit task is != of the 64-bit task that is executing it,
>>> the admission control needs to be re-executed, and it could fail. So I see this
>>> operation equivalent to sched_setaffinity(). This will likely be true for future
>>> schedulers that will allow arbitrary affinities (AC should run on affinity
>>> change, and could fail).
>>>
>>> I would vote with Juri: "I'd go with fail hard if AC is on, let it
>>> pass if AC is off (supposedly the user knows what to do)," (also hope nobody
>>> complains until we add better support for affinity, and use this as a motivation
>>> to get back on this front).
>>>
>>> -- Daniel
>>
>> (1) # chrt -d -T 5000000 -P 16666666 0 ./32bit_app
>>
>> (2) # ./32bit_app &
>>
>>     # chrt -d -T 5000000 -P 16666666 -p 0 pid_of(32bit_app)
>>
>>
>> Wouldn't the behaviour of (1) and (2) be different w/o this patch?
>>
>> In (1) __sched_setscheduler() happens before execve so it operates on
>> p->cpus_ptr equal span.
>>
>> In (2) span != p->cpus_ptr so DL AC will fail.
>>
> 
> As far as I got, the case (1) would be spitted in two steps:
> 
>  - __sched_setscheduler() will work, then
>  - execv() would fail because (span != p->cpus_ptr)
> 
> So... at the end, both (1) and (2) would result in a failure...
> 
> am I missing something?

Not sure. Reading this thread I was under the assumption that the only
change would be the drop of this patch. But I assume there is also this
'if DL AC is on then let sched_setattr() fail for this 32bit task'.

IMHO, the current patch-stack w/o this patch should let (1) succeed with
DL AC.
Will Deacon May 20, 2021, 6:01 p.m. UTC | #14
On Thu, May 20, 2021 at 02:38:55PM +0200, Daniel Bristot de Oliveira wrote:
> On 5/20/21 12:33 PM, Quentin Perret wrote:
> > On Thursday 20 May 2021 at 11:16:41 (+0100), Will Deacon wrote:
> >> Ok, thanks for the insight. In which case, I'll go with what we discussed:
> >> require admission control to be disabled for sched_setattr() but allow
> >> execve() to a 32-bit task from a 64-bit deadline task with a warning (this
> >> is probably similar to CPU hotplug?).
> > 
> > Still not sure that we can let execve go through ... It will break AC
> > all the same, so it should probably fail as well if AC is on IMO
> > 
> 
> If the cpumask of the 32-bit task is != of the 64-bit task that is executing it,
> the admission control needs to be re-executed, and it could fail. So I see this
> operation equivalent to sched_setaffinity(). This will likely be true for future
> schedulers that will allow arbitrary affinities (AC should run on affinity
> change, and could fail).
> 
> I would vote with Juri: "I'd go with fail hard if AC is on, let it
> pass if AC is off (supposedly the user knows what to do)," (also hope nobody
> complains until we add better support for affinity, and use this as a motivation
> to get back on this front).

I can have a go at implementing it, but I don't think it's a great solution
and here's why:

Failing an execve() is _very_ likely to be fatal to the application. It's
also very likely that the task calling execve() doesn't know whether the
program it's trying to execute is 32-bit or not. Consequently, if we go
with failing execve() then all that will happen is that people will disable
admission control altogether. That has a negative impact on "pure" 64-bit
applications and so I think we end up with the tail wagging the dog because
admission control will be disabled for everybody just because there is a
handful of 32-bit programs which may get executed. I understand that it
also means that RT throttling would be disabled.

Allowing the execve() to continue with a warning is very similar to the
case in which all the 64-bit CPUs are hot-unplugged at the point of
execve(), and this is much closer to the illusion that this patch series
intends to provide.

So, personally speaking, I would prefer the behaviour where we refuse to
admit 32-bit tasks vioa sched_set_attr() if the root domain contains
64-bit CPUs, but we _don't_ fail execve() of a 32-bit program from a
64-bit deadline task.

However, you're the deadline experts so ultimately I'll implement what
you prefer. I just wanted to explain why I think it's a poor interface.

Have I changed anybody's mind?

Will
Will Deacon May 20, 2021, 6:03 p.m. UTC | #15
Hi Dietmar,

On Thu, May 20, 2021 at 07:55:27PM +0200, Dietmar Eggemann wrote:
> On 20/05/2021 18:00, Daniel Bristot de Oliveira wrote:
> > On 5/20/21 5:06 PM, Dietmar Eggemann wrote:
> >> (1) # chrt -d -T 5000000 -P 16666666 0 ./32bit_app
> >>
> >> (2) # ./32bit_app &
> >>
> >>     # chrt -d -T 5000000 -P 16666666 -p 0 pid_of(32bit_app)
> >>
> >>
> >> Wouldn't the behaviour of (1) and (2) be different w/o this patch?
> >>
> >> In (1) __sched_setscheduler() happens before execve so it operates on
> >> p->cpus_ptr equal span.
> >>
> >> In (2) span != p->cpus_ptr so DL AC will fail.
> >>
> > 
> > As far as I got, the case (1) would be spitted in two steps:
> > 
> >  - __sched_setscheduler() will work, then
> >  - execv() would fail because (span != p->cpus_ptr)
> > 
> > So... at the end, both (1) and (2) would result in a failure...
> > 
> > am I missing something?
> 
> Not sure. Reading this thread I was under the assumption that the only
> change would be the drop of this patch. But I assume there is also this
> 'if DL AC is on then let sched_setattr() fail for this 32bit task'.
> 
> IMHO, the current patch-stack w/o this patch should let (1) succeed with
> DL AC.

That's what I'm proposing, yes, but others (including Daniel) prefer to
fail the execve(). See my other reply just now for a summary [1].

Thanks!

Will

[1] https://lore.kernel.org/lkml/20210520180138.GA10523@willie-the-truck/T/#u
Juri Lelli May 21, 2021, 5:25 a.m. UTC | #16
On 20/05/21 19:01, Will Deacon wrote:
> On Thu, May 20, 2021 at 02:38:55PM +0200, Daniel Bristot de Oliveira wrote:
> > On 5/20/21 12:33 PM, Quentin Perret wrote:
> > > On Thursday 20 May 2021 at 11:16:41 (+0100), Will Deacon wrote:
> > >> Ok, thanks for the insight. In which case, I'll go with what we discussed:
> > >> require admission control to be disabled for sched_setattr() but allow
> > >> execve() to a 32-bit task from a 64-bit deadline task with a warning (this
> > >> is probably similar to CPU hotplug?).
> > > 
> > > Still not sure that we can let execve go through ... It will break AC
> > > all the same, so it should probably fail as well if AC is on IMO
> > > 
> > 
> > If the cpumask of the 32-bit task is != of the 64-bit task that is executing it,
> > the admission control needs to be re-executed, and it could fail. So I see this
> > operation equivalent to sched_setaffinity(). This will likely be true for future
> > schedulers that will allow arbitrary affinities (AC should run on affinity
> > change, and could fail).
> > 
> > I would vote with Juri: "I'd go with fail hard if AC is on, let it
> > pass if AC is off (supposedly the user knows what to do)," (also hope nobody
> > complains until we add better support for affinity, and use this as a motivation
> > to get back on this front).
> 
> I can have a go at implementing it, but I don't think it's a great solution
> and here's why:
> 
> Failing an execve() is _very_ likely to be fatal to the application. It's
> also very likely that the task calling execve() doesn't know whether the
> program it's trying to execute is 32-bit or not. Consequently, if we go
> with failing execve() then all that will happen is that people will disable
> admission control altogether. That has a negative impact on "pure" 64-bit
> applications and so I think we end up with the tail wagging the dog because
> admission control will be disabled for everybody just because there is a
> handful of 32-bit programs which may get executed. I understand that it
> also means that RT throttling would be disabled.

Completely understand your perplexity. But how can the kernel still give
guarantees to "pure" 64-bit applications if there are 32-bit
applications around that essentially broke admission control when they
were restricted to a subset of cores?

> Allowing the execve() to continue with a warning is very similar to the
> case in which all the 64-bit CPUs are hot-unplugged at the point of
> execve(), and this is much closer to the illusion that this patch series
> intends to provide.

So, for hotplug we currently have a check that would make hotplug
operations fail if removing a CPU would mean not enough bandwidth to run
the currently admitted set of DEADLINE tasks.

> So, personally speaking, I would prefer the behaviour where we refuse to
> admit 32-bit tasks vioa sched_set_attr() if the root domain contains
> 64-bit CPUs, but we _don't_ fail execve() of a 32-bit program from a
> 64-bit deadline task.

OK, this is interesting and I guess a very valid alternative. That would
force users to create exclusive domains for 32-bit tasks, right?

> However, you're the deadline experts so ultimately I'll implement what
> you prefer. I just wanted to explain why I think it's a poor interface.
> 
> Have I changed anybody's mind?

Partly! :)

Thanks a lot for the discussion so far.

Juri
Quentin Perret May 21, 2021, 8:15 a.m. UTC | #17
On Friday 21 May 2021 at 07:25:51 (+0200), Juri Lelli wrote:
> On 20/05/21 19:01, Will Deacon wrote:
> > On Thu, May 20, 2021 at 02:38:55PM +0200, Daniel Bristot de Oliveira wrote:
> > > On 5/20/21 12:33 PM, Quentin Perret wrote:
> > > > On Thursday 20 May 2021 at 11:16:41 (+0100), Will Deacon wrote:
> > > >> Ok, thanks for the insight. In which case, I'll go with what we discussed:
> > > >> require admission control to be disabled for sched_setattr() but allow
> > > >> execve() to a 32-bit task from a 64-bit deadline task with a warning (this
> > > >> is probably similar to CPU hotplug?).
> > > > 
> > > > Still not sure that we can let execve go through ... It will break AC
> > > > all the same, so it should probably fail as well if AC is on IMO
> > > > 
> > > 
> > > If the cpumask of the 32-bit task is != of the 64-bit task that is executing it,
> > > the admission control needs to be re-executed, and it could fail. So I see this
> > > operation equivalent to sched_setaffinity(). This will likely be true for future
> > > schedulers that will allow arbitrary affinities (AC should run on affinity
> > > change, and could fail).
> > > 
> > > I would vote with Juri: "I'd go with fail hard if AC is on, let it
> > > pass if AC is off (supposedly the user knows what to do)," (also hope nobody
> > > complains until we add better support for affinity, and use this as a motivation
> > > to get back on this front).
> > 
> > I can have a go at implementing it, but I don't think it's a great solution
> > and here's why:
> > 
> > Failing an execve() is _very_ likely to be fatal to the application. It's
> > also very likely that the task calling execve() doesn't know whether the
> > program it's trying to execute is 32-bit or not. Consequently, if we go
> > with failing execve() then all that will happen is that people will disable
> > admission control altogether.

Right, but only on these dumb 32bit asymmetric systems, and only if we
care about running 32bits deadline tasks -- which I seriously doubt for
the Android use-case.

Note that running deadline tasks is also a privileged operation, it
can't be done by random apps.

> > That has a negative impact on "pure" 64-bit
> > applications and so I think we end up with the tail wagging the dog because
> > admission control will be disabled for everybody just because there is a
> > handful of 32-bit programs which may get executed. I understand that it
> > also means that RT throttling would be disabled.
> 
> Completely understand your perplexity. But how can the kernel still give
> guarantees to "pure" 64-bit applications if there are 32-bit
> applications around that essentially broke admission control when they
> were restricted to a subset of cores?
> 
> > Allowing the execve() to continue with a warning is very similar to the
> > case in which all the 64-bit CPUs are hot-unplugged at the point of
> > execve(), and this is much closer to the illusion that this patch series
> > intends to provide.
> 
> So, for hotplug we currently have a check that would make hotplug
> operations fail if removing a CPU would mean not enough bandwidth to run
> the currently admitted set of DEADLINE tasks.

Aha, wasn't aware. Any pointers to that check for my education?

> > So, personally speaking, I would prefer the behaviour where we refuse to
> > admit 32-bit tasks vioa sched_set_attr() if the root domain contains
> > 64-bit CPUs, but we _don't_ fail execve() of a 32-bit program from a
> > 64-bit deadline task.
> 
> OK, this is interesting and I guess a very valid alternative. That would
> force users to create exclusive domains for 32-bit tasks, right?

FWIW this is not practical at all for our use-cases, the implications of
splitting the system in independent root-domains are way too important
for us to be able to recommend that. Disabling AC, OTOH, sounds simple
enough. The RT throttling part is the only 'worrying' part, but even
that may not be the end of the world.

Thanks!
Quentin
Juri Lelli May 21, 2021, 8:39 a.m. UTC | #18
On 21/05/21 08:15, Quentin Perret wrote:
> On Friday 21 May 2021 at 07:25:51 (+0200), Juri Lelli wrote:
> > On 20/05/21 19:01, Will Deacon wrote:
> > > On Thu, May 20, 2021 at 02:38:55PM +0200, Daniel Bristot de Oliveira wrote:
> > > > On 5/20/21 12:33 PM, Quentin Perret wrote:
> > > > > On Thursday 20 May 2021 at 11:16:41 (+0100), Will Deacon wrote:
> > > > >> Ok, thanks for the insight. In which case, I'll go with what we discussed:
> > > > >> require admission control to be disabled for sched_setattr() but allow
> > > > >> execve() to a 32-bit task from a 64-bit deadline task with a warning (this
> > > > >> is probably similar to CPU hotplug?).
> > > > > 
> > > > > Still not sure that we can let execve go through ... It will break AC
> > > > > all the same, so it should probably fail as well if AC is on IMO
> > > > > 
> > > > 
> > > > If the cpumask of the 32-bit task is != of the 64-bit task that is executing it,
> > > > the admission control needs to be re-executed, and it could fail. So I see this
> > > > operation equivalent to sched_setaffinity(). This will likely be true for future
> > > > schedulers that will allow arbitrary affinities (AC should run on affinity
> > > > change, and could fail).
> > > > 
> > > > I would vote with Juri: "I'd go with fail hard if AC is on, let it
> > > > pass if AC is off (supposedly the user knows what to do)," (also hope nobody
> > > > complains until we add better support for affinity, and use this as a motivation
> > > > to get back on this front).
> > > 
> > > I can have a go at implementing it, but I don't think it's a great solution
> > > and here's why:
> > > 
> > > Failing an execve() is _very_ likely to be fatal to the application. It's
> > > also very likely that the task calling execve() doesn't know whether the
> > > program it's trying to execute is 32-bit or not. Consequently, if we go
> > > with failing execve() then all that will happen is that people will disable
> > > admission control altogether.
> 
> Right, but only on these dumb 32bit asymmetric systems, and only if we
> care about running 32bits deadline tasks -- which I seriously doubt for
> the Android use-case.
> 
> Note that running deadline tasks is also a privileged operation, it
> can't be done by random apps.
> 
> > > That has a negative impact on "pure" 64-bit
> > > applications and so I think we end up with the tail wagging the dog because
> > > admission control will be disabled for everybody just because there is a
> > > handful of 32-bit programs which may get executed. I understand that it
> > > also means that RT throttling would be disabled.
> > 
> > Completely understand your perplexity. But how can the kernel still give
> > guarantees to "pure" 64-bit applications if there are 32-bit
> > applications around that essentially broke admission control when they
> > were restricted to a subset of cores?
> > 
> > > Allowing the execve() to continue with a warning is very similar to the
> > > case in which all the 64-bit CPUs are hot-unplugged at the point of
> > > execve(), and this is much closer to the illusion that this patch series
> > > intends to provide.
> > 
> > So, for hotplug we currently have a check that would make hotplug
> > operations fail if removing a CPU would mean not enough bandwidth to run
> > the currently admitted set of DEADLINE tasks.
> 
> Aha, wasn't aware. Any pointers to that check for my education?

Hotplug ends up calling dl_cpu_busy() (after the cpu being hotplugged out
got removed), IIRC. So, if that fails the operation in undone.

> > > So, personally speaking, I would prefer the behaviour where we refuse to
> > > admit 32-bit tasks vioa sched_set_attr() if the root domain contains
> > > 64-bit CPUs, but we _don't_ fail execve() of a 32-bit program from a
> > > 64-bit deadline task.
> > 
> > OK, this is interesting and I guess a very valid alternative. That would
> > force users to create exclusive domains for 32-bit tasks, right?
> 
> FWIW this is not practical at all for our use-cases, the implications of
> splitting the system in independent root-domains are way too important
> for us to be able to recommend that. Disabling AC, OTOH, sounds simple
> enough. The RT throttling part is the only 'worrying' part, but even
> that may not be the end of the world.

Note that RT throttling (SCHED_{FIFO,RR}) is not handled by DEADLINE
servers yet.

Best,
Juri
Will Deacon May 21, 2021, 10:37 a.m. UTC | #19
On Fri, May 21, 2021 at 10:39:32AM +0200, Juri Lelli wrote:
> On 21/05/21 08:15, Quentin Perret wrote:
> > On Friday 21 May 2021 at 07:25:51 (+0200), Juri Lelli wrote:
> > > On 20/05/21 19:01, Will Deacon wrote:
> > > > On Thu, May 20, 2021 at 02:38:55PM +0200, Daniel Bristot de Oliveira wrote:
> > > > > On 5/20/21 12:33 PM, Quentin Perret wrote:
> > > > > > On Thursday 20 May 2021 at 11:16:41 (+0100), Will Deacon wrote:
> > > > > >> Ok, thanks for the insight. In which case, I'll go with what we discussed:
> > > > > >> require admission control to be disabled for sched_setattr() but allow
> > > > > >> execve() to a 32-bit task from a 64-bit deadline task with a warning (this
> > > > > >> is probably similar to CPU hotplug?).
> > > > > > 
> > > > > > Still not sure that we can let execve go through ... It will break AC
> > > > > > all the same, so it should probably fail as well if AC is on IMO
> > > > > > 
> > > > > 
> > > > > If the cpumask of the 32-bit task is != of the 64-bit task that is executing it,
> > > > > the admission control needs to be re-executed, and it could fail. So I see this
> > > > > operation equivalent to sched_setaffinity(). This will likely be true for future
> > > > > schedulers that will allow arbitrary affinities (AC should run on affinity
> > > > > change, and could fail).
> > > > > 
> > > > > I would vote with Juri: "I'd go with fail hard if AC is on, let it
> > > > > pass if AC is off (supposedly the user knows what to do)," (also hope nobody
> > > > > complains until we add better support for affinity, and use this as a motivation
> > > > > to get back on this front).
> > > > 
> > > > I can have a go at implementing it, but I don't think it's a great solution
> > > > and here's why:
> > > > 
> > > > Failing an execve() is _very_ likely to be fatal to the application. It's
> > > > also very likely that the task calling execve() doesn't know whether the
> > > > program it's trying to execute is 32-bit or not. Consequently, if we go
> > > > with failing execve() then all that will happen is that people will disable
> > > > admission control altogether.
> > 
> > Right, but only on these dumb 32bit asymmetric systems, and only if we
> > care about running 32bits deadline tasks -- which I seriously doubt for
> > the Android use-case.
> > 
> > Note that running deadline tasks is also a privileged operation, it
> > can't be done by random apps.
> > 
> > > > That has a negative impact on "pure" 64-bit
> > > > applications and so I think we end up with the tail wagging the dog because
> > > > admission control will be disabled for everybody just because there is a
> > > > handful of 32-bit programs which may get executed. I understand that it
> > > > also means that RT throttling would be disabled.
> > > 
> > > Completely understand your perplexity. But how can the kernel still give
> > > guarantees to "pure" 64-bit applications if there are 32-bit
> > > applications around that essentially broke admission control when they
> > > were restricted to a subset of cores?
> > > 
> > > > Allowing the execve() to continue with a warning is very similar to the
> > > > case in which all the 64-bit CPUs are hot-unplugged at the point of
> > > > execve(), and this is much closer to the illusion that this patch series
> > > > intends to provide.
> > > 
> > > So, for hotplug we currently have a check that would make hotplug
> > > operations fail if removing a CPU would mean not enough bandwidth to run
> > > the currently admitted set of DEADLINE tasks.
> > 
> > Aha, wasn't aware. Any pointers to that check for my education?
> 
> Hotplug ends up calling dl_cpu_busy() (after the cpu being hotplugged out
> got removed), IIRC. So, if that fails the operation in undone.

Interesting, thanks. Thinking about this some more, it strikes me that with
these silly asymmetric systems there could be an interesting additional
problem with hotplug and deadline tasks. Imagine the following sequence of
events:

  1. All online CPUs are 32-bit-capable
  2. sched_setattr() admits a 32-bit deadline task
  3. A 64-bit-only CPU is onlined
  4. Some of the 32-bit-capable CPUs are offlined

I wonder if we can get into a situation where we think we have enough
bandwidth available, but in reality the 32-bit task is in trouble because
it can't make use of the 64-bit-only CPU.

If so, then it seems to me that admission control is really just
"best-effort" for 32-bit deadline tasks on these systems because it's based
on a snapshot in time of the available resources.

Will
Dietmar Eggemann May 21, 2021, 11:23 a.m. UTC | #20
On 21/05/2021 12:37, Will Deacon wrote:
> On Fri, May 21, 2021 at 10:39:32AM +0200, Juri Lelli wrote:
>> On 21/05/21 08:15, Quentin Perret wrote:
>>> On Friday 21 May 2021 at 07:25:51 (+0200), Juri Lelli wrote:
>>>> On 20/05/21 19:01, Will Deacon wrote:
>>>>> On Thu, May 20, 2021 at 02:38:55PM +0200, Daniel Bristot de Oliveira wrote:
>>>>>> On 5/20/21 12:33 PM, Quentin Perret wrote:
>>>>>>> On Thursday 20 May 2021 at 11:16:41 (+0100), Will Deacon wrote:
>>>>>>>> Ok, thanks for the insight. In which case, I'll go with what we discussed:
>>>>>>>> require admission control to be disabled for sched_setattr() but allow
>>>>>>>> execve() to a 32-bit task from a 64-bit deadline task with a warning (this
>>>>>>>> is probably similar to CPU hotplug?).
>>>>>>>
>>>>>>> Still not sure that we can let execve go through ... It will break AC
>>>>>>> all the same, so it should probably fail as well if AC is on IMO
>>>>>>>
>>>>>>
>>>>>> If the cpumask of the 32-bit task is != of the 64-bit task that is executing it,
>>>>>> the admission control needs to be re-executed, and it could fail. So I see this
>>>>>> operation equivalent to sched_setaffinity(). This will likely be true for future
>>>>>> schedulers that will allow arbitrary affinities (AC should run on affinity
>>>>>> change, and could fail).
>>>>>>
>>>>>> I would vote with Juri: "I'd go with fail hard if AC is on, let it
>>>>>> pass if AC is off (supposedly the user knows what to do)," (also hope nobody
>>>>>> complains until we add better support for affinity, and use this as a motivation
>>>>>> to get back on this front).
>>>>>
>>>>> I can have a go at implementing it, but I don't think it's a great solution
>>>>> and here's why:
>>>>>
>>>>> Failing an execve() is _very_ likely to be fatal to the application. It's
>>>>> also very likely that the task calling execve() doesn't know whether the
>>>>> program it's trying to execute is 32-bit or not. Consequently, if we go
>>>>> with failing execve() then all that will happen is that people will disable
>>>>> admission control altogether.
>>>
>>> Right, but only on these dumb 32bit asymmetric systems, and only if we
>>> care about running 32bits deadline tasks -- which I seriously doubt for
>>> the Android use-case.
>>>
>>> Note that running deadline tasks is also a privileged operation, it
>>> can't be done by random apps.
>>>
>>>>> That has a negative impact on "pure" 64-bit
>>>>> applications and so I think we end up with the tail wagging the dog because
>>>>> admission control will be disabled for everybody just because there is a
>>>>> handful of 32-bit programs which may get executed. I understand that it
>>>>> also means that RT throttling would be disabled.
>>>>
>>>> Completely understand your perplexity. But how can the kernel still give
>>>> guarantees to "pure" 64-bit applications if there are 32-bit
>>>> applications around that essentially broke admission control when they
>>>> were restricted to a subset of cores?
>>>>
>>>>> Allowing the execve() to continue with a warning is very similar to the
>>>>> case in which all the 64-bit CPUs are hot-unplugged at the point of
>>>>> execve(), and this is much closer to the illusion that this patch series
>>>>> intends to provide.
>>>>
>>>> So, for hotplug we currently have a check that would make hotplug
>>>> operations fail if removing a CPU would mean not enough bandwidth to run
>>>> the currently admitted set of DEADLINE tasks.
>>>
>>> Aha, wasn't aware. Any pointers to that check for my education?
>>
>> Hotplug ends up calling dl_cpu_busy() (after the cpu being hotplugged out
>> got removed), IIRC. So, if that fails the operation in undone.
> 
> Interesting, thanks. Thinking about this some more, it strikes me that with
> these silly asymmetric systems there could be an interesting additional
> problem with hotplug and deadline tasks. Imagine the following sequence of
> events:
> 
>   1. All online CPUs are 32-bit-capable
>   2. sched_setattr() admits a 32-bit deadline task
>   3. A 64-bit-only CPU is onlined
>   4. Some of the 32-bit-capable CPUs are offlined
> 
> I wonder if we can get into a situation where we think we have enough
> bandwidth available, but in reality the 32-bit task is in trouble because
> it can't make use of the 64-bit-only CPU.
> 
> If so, then it seems to me that admission control is really just
> "best-effort" for 32-bit deadline tasks on these systems because it's based
> on a snapshot in time of the available resources.

IMHO DL AC is per root domain (rd). So if we have e.g. an 8 CPU system
with aarch32_el0 eq. [0-3] then we would need 2 exclusive cpusets ([0-3]
and [4-7]) to admit 32-bit DL tasks into [0-3] (i.e. to pass the `if
(!cpumask_subset(span, p->cpus_ptr) ...` test in __sched_setscheduler().

Trying to admit too many 32-bit DL tasks or trying to hp out a CPU[0-3]
would lead to `Device or resource busy` in case the rd bw wouldn't be
sufficient anymore for the set of admitted tasks. But the [0-3] DL AC
wouldn't care about hp on CPU[4-7].
Dietmar Eggemann May 21, 2021, 11:26 a.m. UTC | #21
On 20/05/2021 20:03, Will Deacon wrote:
> Hi Dietmar,
> 
> On Thu, May 20, 2021 at 07:55:27PM +0200, Dietmar Eggemann wrote:
>> On 20/05/2021 18:00, Daniel Bristot de Oliveira wrote:
>>> On 5/20/21 5:06 PM, Dietmar Eggemann wrote:
>>>> (1) # chrt -d -T 5000000 -P 16666666 0 ./32bit_app
>>>>
>>>> (2) # ./32bit_app &
>>>>
>>>>     # chrt -d -T 5000000 -P 16666666 -p 0 pid_of(32bit_app)
>>>>
>>>>
>>>> Wouldn't the behaviour of (1) and (2) be different w/o this patch?
>>>>
>>>> In (1) __sched_setscheduler() happens before execve so it operates on
>>>> p->cpus_ptr equal span.
>>>>
>>>> In (2) span != p->cpus_ptr so DL AC will fail.
>>>>
>>>
>>> As far as I got, the case (1) would be spitted in two steps:
>>>
>>>  - __sched_setscheduler() will work, then
>>>  - execv() would fail because (span != p->cpus_ptr)
>>>
>>> So... at the end, both (1) and (2) would result in a failure...
>>>
>>> am I missing something?
>>
>> Not sure. Reading this thread I was under the assumption that the only
>> change would be the drop of this patch. But I assume there is also this
>> 'if DL AC is on then let sched_setattr() fail for this 32bit task'.
>>
>> IMHO, the current patch-stack w/o this patch should let (1) succeed with
>> DL AC.
> 
> That's what I'm proposing, yes, but others (including Daniel) prefer to
> fail the execve(). See my other reply just now for a summary [1].

[...]

Thanks, Will ... Now I understand. Or at least I think I do ;-)

> [1] https://lore.kernel.org/lkml/20210520180138.GA10523@willie-the-truck/T/#u
Daniel Bristot de Oliveira May 21, 2021, 1 p.m. UTC | #22
On 5/21/21 12:37 PM, Will Deacon wrote:
> Interesting, thanks. Thinking about this some more, it strikes me that with
> these silly asymmetric systems there could be an interesting additional
> problem with hotplug and deadline tasks. Imagine the following sequence of
> events:
> 
>   1. All online CPUs are 32-bit-capable
>   2. sched_setattr() admits a 32-bit deadline task
>   3. A 64-bit-only CPU is onlined

At the point 3, the global scheduler assumption is broken. For instance, in a
system with four CPUs and five ready 32-bit-capable tasks, when the fifth CPU as
added, the working conserving rule is violated because the five highest priority
thread are not running (only four are) :-(.

So, at this point, for us to keep to the current behavior, the addition should
be.. blocked? :-((

>   4. Some of the 32-bit-capable CPUs are offlined

Assuming that point 3 does not exist (i.e., all CPUs are 32-bit-capable). At
this point, we will have an increase in the pressure on the 32-bit-capable CPUs.

This can also create bad effects for 64-bit tasks, as the "contended" 32-bit
tasks will still be "queued" in a future time where they were supposed to be
done (leaving time for the 64-bit tasks).

> I wonder if we can get into a situation where we think we have enough
> bandwidth available, but in reality the 32-bit task is in trouble because
> it can't make use of the 64-bit-only CPU.

I would have to think more, but there might be a case where this contended
32-bit tasks could cause deadline misses for the 64-bit too.

> If so, then it seems to me that admission control is really just
> "best-effort" for 32-bit deadline tasks on these systems because it's based
> on a snapshot in time of the available resources.

The admission test as is now is "best-effort" in the sense that it allows a
workload higher than it could handle (it is necessary, but not sufficient AC).
But it should not be considered "best-effort" because of violations in the
working conserving property as a result of arbitrary affinities among tasks.
Overall, we have been trying to close any "exception left" to this later case.

I know, it is a complex situation, I am just trying to illustrate our concerns,
because, in the near future we might have a scheduler that handles arbitrary
affinity correctly. But that might require us to stick to an AC. The AC is
something precious for us.

(yeah, SP would not face this problem... now that I made progress on RV I can
get back to it).

-- Daniel
Quentin Perret May 21, 2021, 1:02 p.m. UTC | #23
On Friday 21 May 2021 at 13:23:55 (+0200), Dietmar Eggemann wrote:
> On 21/05/2021 12:37, Will Deacon wrote:
> > On Fri, May 21, 2021 at 10:39:32AM +0200, Juri Lelli wrote:
> >> On 21/05/21 08:15, Quentin Perret wrote:
> >>> On Friday 21 May 2021 at 07:25:51 (+0200), Juri Lelli wrote:
> >>>> On 20/05/21 19:01, Will Deacon wrote:
> >>>>> On Thu, May 20, 2021 at 02:38:55PM +0200, Daniel Bristot de Oliveira wrote:
> >>>>>> On 5/20/21 12:33 PM, Quentin Perret wrote:
> >>>>>>> On Thursday 20 May 2021 at 11:16:41 (+0100), Will Deacon wrote:
> >>>>>>>> Ok, thanks for the insight. In which case, I'll go with what we discussed:
> >>>>>>>> require admission control to be disabled for sched_setattr() but allow
> >>>>>>>> execve() to a 32-bit task from a 64-bit deadline task with a warning (this
> >>>>>>>> is probably similar to CPU hotplug?).
> >>>>>>>
> >>>>>>> Still not sure that we can let execve go through ... It will break AC
> >>>>>>> all the same, so it should probably fail as well if AC is on IMO
> >>>>>>>
> >>>>>>
> >>>>>> If the cpumask of the 32-bit task is != of the 64-bit task that is executing it,
> >>>>>> the admission control needs to be re-executed, and it could fail. So I see this
> >>>>>> operation equivalent to sched_setaffinity(). This will likely be true for future
> >>>>>> schedulers that will allow arbitrary affinities (AC should run on affinity
> >>>>>> change, and could fail).
> >>>>>>
> >>>>>> I would vote with Juri: "I'd go with fail hard if AC is on, let it
> >>>>>> pass if AC is off (supposedly the user knows what to do)," (also hope nobody
> >>>>>> complains until we add better support for affinity, and use this as a motivation
> >>>>>> to get back on this front).
> >>>>>
> >>>>> I can have a go at implementing it, but I don't think it's a great solution
> >>>>> and here's why:
> >>>>>
> >>>>> Failing an execve() is _very_ likely to be fatal to the application. It's
> >>>>> also very likely that the task calling execve() doesn't know whether the
> >>>>> program it's trying to execute is 32-bit or not. Consequently, if we go
> >>>>> with failing execve() then all that will happen is that people will disable
> >>>>> admission control altogether.
> >>>
> >>> Right, but only on these dumb 32bit asymmetric systems, and only if we
> >>> care about running 32bits deadline tasks -- which I seriously doubt for
> >>> the Android use-case.
> >>>
> >>> Note that running deadline tasks is also a privileged operation, it
> >>> can't be done by random apps.
> >>>
> >>>>> That has a negative impact on "pure" 64-bit
> >>>>> applications and so I think we end up with the tail wagging the dog because
> >>>>> admission control will be disabled for everybody just because there is a
> >>>>> handful of 32-bit programs which may get executed. I understand that it
> >>>>> also means that RT throttling would be disabled.
> >>>>
> >>>> Completely understand your perplexity. But how can the kernel still give
> >>>> guarantees to "pure" 64-bit applications if there are 32-bit
> >>>> applications around that essentially broke admission control when they
> >>>> were restricted to a subset of cores?
> >>>>
> >>>>> Allowing the execve() to continue with a warning is very similar to the
> >>>>> case in which all the 64-bit CPUs are hot-unplugged at the point of
> >>>>> execve(), and this is much closer to the illusion that this patch series
> >>>>> intends to provide.
> >>>>
> >>>> So, for hotplug we currently have a check that would make hotplug
> >>>> operations fail if removing a CPU would mean not enough bandwidth to run
> >>>> the currently admitted set of DEADLINE tasks.
> >>>
> >>> Aha, wasn't aware. Any pointers to that check for my education?
> >>
> >> Hotplug ends up calling dl_cpu_busy() (after the cpu being hotplugged out
> >> got removed), IIRC. So, if that fails the operation in undone.
> > 
> > Interesting, thanks. Thinking about this some more, it strikes me that with
> > these silly asymmetric systems there could be an interesting additional
> > problem with hotplug and deadline tasks. Imagine the following sequence of
> > events:
> > 
> >   1. All online CPUs are 32-bit-capable
> >   2. sched_setattr() admits a 32-bit deadline task
> >   3. A 64-bit-only CPU is onlined
> >   4. Some of the 32-bit-capable CPUs are offlined
> > 
> > I wonder if we can get into a situation where we think we have enough
> > bandwidth available, but in reality the 32-bit task is in trouble because
> > it can't make use of the 64-bit-only CPU.
> > 
> > If so, then it seems to me that admission control is really just
> > "best-effort" for 32-bit deadline tasks on these systems because it's based
> > on a snapshot in time of the available resources.
> 
> IMHO DL AC is per root domain (rd). So if we have e.g. an 8 CPU system
> with aarch32_el0 eq. [0-3] then we would need 2 exclusive cpusets ([0-3]
> and [4-7]) to admit 32-bit DL tasks into [0-3] (i.e. to pass the `if
> (!cpumask_subset(span, p->cpus_ptr) ...` test in __sched_setscheduler().
> 
> Trying to admit too many 32-bit DL tasks or trying to hp out a CPU[0-3]
> would lead to `Device or resource busy` in case the rd bw wouldn't be
> sufficient anymore for the set of admitted tasks. But the [0-3] DL AC
> wouldn't care about hp on CPU[4-7].

So I think Will has a point since, IIRC, the root domains get rebuilt
during hotplug. So you can imagine a case with a single root domain, but
CPUs 4-7 are offline. In this case, sched_setattr() will happily promote
a task to DL as long as its affinity mask is a superset of the rd span,
but things may get ugly when CPUs are plugged back in later on.

This looks like an existing bug though. I just tried the following on a
system with 4 CPUs:

    // Create a task affined to CPU [0-2]
    > while true; do echo "Hi" > /dev/null; done &
    [1] 560
    > mypid=$!
    > taskset -p 7 $mypid
    pid 560's current affinity mask: f
    pid 560's new affinity mask: 7

    // Try to move it DL, this should fail because of the affinity
    > chrt -d -T 5000000 -P 16666666 -p 0 $mypid
    chrt: failed to set pid 560's policy: Operation not permitted

    // Offline CPU 3, so the rd now covers CPUs 0-2 only
    > echo 0 > /sys/devices/system/cpu/cpu3/online
    [  400.843830] CPU3: shutdown
    [  400.844100] psci: CPU3 killed (polled 0 ms)

    // Try to admit the task again, which now succeeds
    > chrt -d -T 5000000 -P 16666666 -p 0 $mypid

    // Plug CPU3 back online
    > echo 1 > /sys/devices/system/cpu/cpu3/online
    [  408.819337] Detected PIPT I-cache on CPU3
    [  408.819642] GICv3: CPU3: found redistributor 3 region 0:0x0000000008100000
    [  408.820165] CPU3: Booted secondary processor 0x0000000003 [0x410fd083]

I don't see any easy way to fix this w/o iterating over all deadline
tasks in the rd when hotplugging a CPU back on, and blocking the hotplug
operation if it'll cause affinity issues. Urgh.
Quentin Perret May 21, 2021, 1:12 p.m. UTC | #24
On Friday 21 May 2021 at 15:00:42 (+0200), Daniel Bristot de Oliveira wrote:
> On 5/21/21 12:37 PM, Will Deacon wrote:
> > Interesting, thanks. Thinking about this some more, it strikes me that with
> > these silly asymmetric systems there could be an interesting additional
> > problem with hotplug and deadline tasks. Imagine the following sequence of
> > events:
> > 
> >   1. All online CPUs are 32-bit-capable
> >   2. sched_setattr() admits a 32-bit deadline task
> >   3. A 64-bit-only CPU is onlined
> 
> At the point 3, the global scheduler assumption is broken. For instance, in a
> system with four CPUs and five ready 32-bit-capable tasks, when the fifth CPU as
> added, the working conserving rule is violated because the five highest priority
> thread are not running (only four are) :-(.
> 
> So, at this point, for us to keep to the current behavior, the addition should
> be.. blocked? :-((
> 
> >   4. Some of the 32-bit-capable CPUs are offlined
> 
> Assuming that point 3 does not exist (i.e., all CPUs are 32-bit-capable). At
> this point, we will have an increase in the pressure on the 32-bit-capable CPUs.
> 
> This can also create bad effects for 64-bit tasks, as the "contended" 32-bit
> tasks will still be "queued" in a future time where they were supposed to be
> done (leaving time for the 64-bit tasks).
> 
> > I wonder if we can get into a situation where we think we have enough
> > bandwidth available, but in reality the 32-bit task is in trouble because
> > it can't make use of the 64-bit-only CPU.
> 
> I would have to think more, but there might be a case where this contended
> 32-bit tasks could cause deadline misses for the 64-bit too.
> 
> > If so, then it seems to me that admission control is really just
> > "best-effort" for 32-bit deadline tasks on these systems because it's based
> > on a snapshot in time of the available resources.
> 
> The admission test as is now is "best-effort" in the sense that it allows a
> workload higher than it could handle (it is necessary, but not sufficient AC).
> But it should not be considered "best-effort" because of violations in the
> working conserving property as a result of arbitrary affinities among tasks.
> Overall, we have been trying to close any "exception left" to this later case.
> 
> I know, it is a complex situation, I am just trying to illustrate our concerns,
> because, in the near future we might have a scheduler that handles arbitrary
> affinity correctly. But that might require us to stick to an AC. The AC is
> something precious for us.

FWIW, I agree with the above. As pointed out in another reply, this
looks like an existing bug and there is nothing specific to 32bits tasks
here.

But I don't think the existence of this bug should be a reason for
breaking AC even more than it is. So it still feels cleaner to block
execve() into 32bit if AC is on until we have proper support for
affinities in DL. And if folks want to use 32bit DL tasks on these
systems they'll have to disable AC, but at least they know what they are
getting ...

Thanks,
Quentin
Juri Lelli May 21, 2021, 2:04 p.m. UTC | #25
On 21/05/21 13:02, Quentin Perret wrote:

...

> So I think Will has a point since, IIRC, the root domains get rebuilt
> during hotplug. So you can imagine a case with a single root domain, but
> CPUs 4-7 are offline. In this case, sched_setattr() will happily promote
> a task to DL as long as its affinity mask is a superset of the rd span,
> but things may get ugly when CPUs are plugged back in later on.
> 
> This looks like an existing bug though. I just tried the following on a
> system with 4 CPUs:
> 
>     // Create a task affined to CPU [0-2]
>     > while true; do echo "Hi" > /dev/null; done &
>     [1] 560
>     > mypid=$!
>     > taskset -p 7 $mypid
>     pid 560's current affinity mask: f
>     pid 560's new affinity mask: 7
> 
>     // Try to move it DL, this should fail because of the affinity
>     > chrt -d -T 5000000 -P 16666666 -p 0 $mypid
>     chrt: failed to set pid 560's policy: Operation not permitted
> 
>     // Offline CPU 3, so the rd now covers CPUs 0-2 only
>     > echo 0 > /sys/devices/system/cpu/cpu3/online
>     [  400.843830] CPU3: shutdown
>     [  400.844100] psci: CPU3 killed (polled 0 ms)
> 
>     // Try to admit the task again, which now succeeds
>     > chrt -d -T 5000000 -P 16666666 -p 0 $mypid
> 
>     // Plug CPU3 back online
>     > echo 1 > /sys/devices/system/cpu/cpu3/online
>     [  408.819337] Detected PIPT I-cache on CPU3
>     [  408.819642] GICv3: CPU3: found redistributor 3 region 0:0x0000000008100000
>     [  408.820165] CPU3: Booted secondary processor 0x0000000003 [0x410fd083]
> 
> I don't see any easy way to fix this w/o iterating over all deadline
> tasks in the rd when hotplugging a CPU back on, and blocking the hotplug
> operation if it'll cause affinity issues. Urgh.
> 

Yeah this looks like a plain existing bug, joy. :)

We fixed a few around AC lately, but I guess work wasn't complete.

Thanks,
Juri
Dietmar Eggemann May 21, 2021, 5:47 p.m. UTC | #26
On 21/05/2021 16:04, Juri Lelli wrote:
> On 21/05/21 13:02, Quentin Perret wrote:
> 
> ...
> 
>> So I think Will has a point since, IIRC, the root domains get rebuilt
>> during hotplug. So you can imagine a case with a single root domain, but
>> CPUs 4-7 are offline. In this case, sched_setattr() will happily promote
>> a task to DL as long as its affinity mask is a superset of the rd span,
>> but things may get ugly when CPUs are plugged back in later on.

Yeah, that's true. I understand the condition, that the task's affinity
mask has to be a superset of the rd span, as that DL AC (i.e DL BW
management) can only work correctly if all admitted tasks can run on
every CPU in the rd.

Like you said, you can already today let tasks with reduced affinity
mask pass the DL AC in case you hp out the other CPUs and then trick DL
AC by hp in the remaining CPUs and admit more DL tasks. But these steps
require a lot of effort to create this false setup.

The dedicated rd for 32-bit tasks matching `aarch32_el0` in an exclusive
cpuset env seems to be a feasible approach to me. But I also don't see
an eminent use case for this.

>> This looks like an existing bug though. I just tried the following on a
>> system with 4 CPUs:
>>
>>     // Create a task affined to CPU [0-2]
>>     > while true; do echo "Hi" > /dev/null; done &
>>     [1] 560
>>     > mypid=$!
>>     > taskset -p 7 $mypid
>>     pid 560's current affinity mask: f
>>     pid 560's new affinity mask: 7
>>
>>     // Try to move it DL, this should fail because of the affinity
>>     > chrt -d -T 5000000 -P 16666666 -p 0 $mypid
>>     chrt: failed to set pid 560's policy: Operation not permitted
>>
>>     // Offline CPU 3, so the rd now covers CPUs 0-2 only
>>     > echo 0 > /sys/devices/system/cpu/cpu3/online
>>     [  400.843830] CPU3: shutdown
>>     [  400.844100] psci: CPU3 killed (polled 0 ms)
>>
>>     // Try to admit the task again, which now succeeds
>>     > chrt -d -T 5000000 -P 16666666 -p 0 $mypid
>>
>>     // Plug CPU3 back online
>>     > echo 1 > /sys/devices/system/cpu/cpu3/online
>>     [  408.819337] Detected PIPT I-cache on CPU3
>>     [  408.819642] GICv3: CPU3: found redistributor 3 region 0:0x0000000008100000
>>     [  408.820165] CPU3: Booted secondary processor 0x0000000003 [0x410fd083]
>>
>> I don't see any easy way to fix this w/o iterating over all deadline
>> tasks in the rd when hotplugging a CPU back on, and blocking the hotplug
>> operation if it'll cause affinity issues. Urgh.

Something like dl_cpu_busy() in cpuset_cpu_inactive() but the other way
around in cpuset_cpu_active().

We iterate over all DL tasks in partition_and_rebuild_sched_domains() ->
rebuild_root_domains() -> update_tasks_root_domain() ->
dl_add_task_root_domain(struct task_struct *p) to recreate DL BW
information after CPU hp but this is asynchronously to cpuset_cpu_active().

> 
> Yeah this looks like a plain existing bug, joy. :)
> 
> We fixed a few around AC lately, but I guess work wasn't complete.
> 
> Thanks,
> Juri
Will Deacon May 24, 2021, 8:47 p.m. UTC | #27
On Fri, May 21, 2021 at 03:00:42PM +0200, Daniel Bristot de Oliveira wrote:
> On 5/21/21 12:37 PM, Will Deacon wrote:
> > Interesting, thanks. Thinking about this some more, it strikes me that with
> > these silly asymmetric systems there could be an interesting additional
> > problem with hotplug and deadline tasks. Imagine the following sequence of
> > events:
> > 
> >   1. All online CPUs are 32-bit-capable
> >   2. sched_setattr() admits a 32-bit deadline task
> >   3. A 64-bit-only CPU is onlined
> 
> At the point 3, the global scheduler assumption is broken. For instance, in a
> system with four CPUs and five ready 32-bit-capable tasks, when the fifth CPU as
> added, the working conserving rule is violated because the five highest priority
> thread are not running (only four are) :-(.
> 
> So, at this point, for us to keep to the current behavior, the addition should
> be.. blocked? :-((
> 
> >   4. Some of the 32-bit-capable CPUs are offlined
> 
> Assuming that point 3 does not exist (i.e., all CPUs are 32-bit-capable). At
> this point, we will have an increase in the pressure on the 32-bit-capable CPUs.
> 
> This can also create bad effects for 64-bit tasks, as the "contended" 32-bit
> tasks will still be "queued" in a future time where they were supposed to be
> done (leaving time for the 64-bit tasks).

That's a really interesting point that I hadn't previously considered. It
means that the effects of 32-bit tasks with forced affinity are far
reaching when it comes to deadline tasks.

> > I wonder if we can get into a situation where we think we have enough
> > bandwidth available, but in reality the 32-bit task is in trouble because
> > it can't make use of the 64-bit-only CPU.
> 
> I would have to think more, but there might be a case where this contended
> 32-bit tasks could cause deadline misses for the 64-bit too.
> 
> > If so, then it seems to me that admission control is really just
> > "best-effort" for 32-bit deadline tasks on these systems because it's based
> > on a snapshot in time of the available resources.
> 
> The admission test as is now is "best-effort" in the sense that it allows a
> workload higher than it could handle (it is necessary, but not sufficient AC).
> But it should not be considered "best-effort" because of violations in the
> working conserving property as a result of arbitrary affinities among tasks.
> Overall, we have been trying to close any "exception left" to this later case.
> 
> I know, it is a complex situation, I am just trying to illustrate our concerns,
> because, in the near future we might have a scheduler that handles arbitrary
> affinity correctly. But that might require us to stick to an AC. The AC is
> something precious for us.

I've implemented AC on execve() of a 32-bit program so we'll fail that system
call with -ENOEXEC if the root domain contains 64-bit-only CPUs. As expected,
the failure mode is awful because it seems as though the ELF binary is then
treated like a shell script by userspace and passed to /bin/sh:

$ sudo chrt -d -T 5000000 -P 16666666 0 ./hello32
./hello32: 1: Syntax error: word unexpected (expecting ")")

Anyway, I'll include this in v7.

Cheers,

Will
diff mbox series

Patch

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index ba66bcf8e812..d7d058fc012e 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -6403,13 +6403,14 @@  static int __sched_setscheduler(struct task_struct *p,
 		if (dl_bandwidth_enabled() && dl_policy(policy) &&
 				!(attr->sched_flags & SCHED_FLAG_SUGOV)) {
 			cpumask_t *span = rq->rd->span;
+			const cpumask_t *aff = p->user_cpus_ptr ?: p->cpus_ptr;
 
 			/*
 			 * Don't allow tasks with an affinity mask smaller than
 			 * the entire root_domain to become SCHED_DEADLINE. We
 			 * will also fail if there's no bandwidth available.
 			 */
-			if (!cpumask_subset(span, p->cpus_ptr) ||
+			if (!cpumask_subset(span, aff) ||
 			    rq->rd->dl_bw.bw == 0) {
 				retval = -EPERM;
 				goto unlock;