diff mbox series

[v4,09/14] cpuset: Don't use the cpu_possible_mask as a last resort for cgroup v1

Message ID 20201124155039.13804-10-will@kernel.org (mailing list archive)
State New, archived
Headers show
Series An alternative series for asymmetric AArch32 systems | expand

Commit Message

Will Deacon Nov. 24, 2020, 3:50 p.m. UTC
If the scheduler cannot find an allowed CPU for a task,
cpuset_cpus_allowed_fallback() will widen the affinity to cpu_possible_mask
if cgroup v1 is in use.

In preparation for allowing architectures to provide their own fallback
mask, just return early if we're not using cgroup v2 and allow
select_fallback_rq() to figure out the mask by itself.

Cc: Li Zefan <lizefan@huawei.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Quentin Perret <qperret@google.com>
Signed-off-by: Will Deacon <will@kernel.org>
---
 kernel/cgroup/cpuset.c | 6 ++++--
 1 file changed, 4 insertions(+), 2 deletions(-)

Comments

Qais Yousef Nov. 27, 2020, 1:32 p.m. UTC | #1
On 11/24/20 15:50, Will Deacon wrote:
> If the scheduler cannot find an allowed CPU for a task,
> cpuset_cpus_allowed_fallback() will widen the affinity to cpu_possible_mask
> if cgroup v1 is in use.
> 
> In preparation for allowing architectures to provide their own fallback
> mask, just return early if we're not using cgroup v2 and allow
> select_fallback_rq() to figure out the mask by itself.

What about cpuset_attach()? When a task attaches to a new group its affinity
could be reset to possible_cpu_mask if it moves to top_cpuset or the
intersection of effective_cpus & cpu_online_mask.

Probably handled with later patches.

/me continue to look at the rest

Okay so in patch 11 we make set_cpus_allowed_ptr() fail. cpuset_attach will
just do WARN_ON_ONCE() and carry on.

I think we can fix that by making sure cpuset_can_attach() will fail. Which can
be done by fixing task_can_attach() to take into account the new arch
task_cpu_possible_mask()?

Thanks

--
Qais Yousef

> 
> Cc: Li Zefan <lizefan@huawei.com>
> Cc: Tejun Heo <tj@kernel.org>
> Cc: Johannes Weiner <hannes@cmpxchg.org>
> Reviewed-by: Quentin Perret <qperret@google.com>
> Signed-off-by: Will Deacon <will@kernel.org>
> ---
>  kernel/cgroup/cpuset.c | 6 ++++--
>  1 file changed, 4 insertions(+), 2 deletions(-)
> 
> diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
> index 57b5b5d0a5fd..e970737c3ed2 100644
> --- a/kernel/cgroup/cpuset.c
> +++ b/kernel/cgroup/cpuset.c
> @@ -3299,9 +3299,11 @@ void cpuset_cpus_allowed(struct task_struct *tsk, struct cpumask *pmask)
>  
>  void cpuset_cpus_allowed_fallback(struct task_struct *tsk)
>  {
> +	if (!is_in_v2_mode())
> +		return; /* select_fallback_rq will try harder */
> +
>  	rcu_read_lock();
> -	do_set_cpus_allowed(tsk, is_in_v2_mode() ?
> -		task_cs(tsk)->cpus_allowed : cpu_possible_mask);
> +	do_set_cpus_allowed(tsk, task_cs(tsk)->cpus_allowed);
>  	rcu_read_unlock();
>  
>  	/*
> -- 
> 2.29.2.454.gaff20da3a2-goog
>
Qais Yousef Nov. 30, 2020, 5:05 p.m. UTC | #2
On 11/27/20 13:32, Qais Yousef wrote:
> On 11/24/20 15:50, Will Deacon wrote:
> > If the scheduler cannot find an allowed CPU for a task,
> > cpuset_cpus_allowed_fallback() will widen the affinity to cpu_possible_mask
> > if cgroup v1 is in use.
> > 
> > In preparation for allowing architectures to provide their own fallback
> > mask, just return early if we're not using cgroup v2 and allow
> > select_fallback_rq() to figure out the mask by itself.
> 
> What about cpuset_attach()? When a task attaches to a new group its affinity
> could be reset to possible_cpu_mask if it moves to top_cpuset or the
> intersection of effective_cpus & cpu_online_mask.
> 
> Probably handled with later patches.
> 
> /me continue to look at the rest
> 
> Okay so in patch 11 we make set_cpus_allowed_ptr() fail. cpuset_attach will
> just do WARN_ON_ONCE() and carry on.
> 
> I think we can fix that by making sure cpuset_can_attach() will fail. Which can
> be done by fixing task_can_attach() to take into account the new arch
> task_cpu_possible_mask()?

I ran some experiments and indeed we have some problems here :(

I create 3 cpusets: 64bit, 32bit and mix. As the name indicates, 64bit contains
all 64bit-only cpus, 32bit contains 32bit-capable ones and mix has a mixture of
both.

If I try to move my test binary to 64bit cpuset, it moves there and I see the
WARN_ON_ONCE() triggered. The task has attached to the new cpuset but
set_allowed_cpus_ptr() has failed and we end up with whatever affinity we had
previously. Breaking cpusets effectively.

This can be easily fixable with my suggestion to educate task_can_attach() to
take into account the new arch_task_cpu_cpu_possible_mask(). The attachment
will fail and userspace will get an error then.

But this introduces another issue..

If I use cpumask_subset() in task_can_attach(), then an attempt to move to
'mix' cpuset will fail. In Android foreground cpuset usually contains a mixture
of cpus. So we want this to work.

Also 'background' tasks are usually bound to littles. If the littles are 64bit
only, then user space must ensure to add a 32bit capable cpu to the list. If we
can't attach to mixed cpuset, then this workaround won't work and Android will
have to create different cpusets for 32bit and 64bit apps. Which would be
difficult since 64bit apps do embed 32bit binaries and it'd be hard for
userspace to deal with that. Unless we extend execve syscall to take a cgroup
as an arg to attach to (like clone3 syscall), then potentially we can do
something about it, I think, by 'fixing' the libc wrapper.

If I use cpumask_intersects() to validate the attachment, then the 64bit
attachment will fail but the 'mix' one will succeed, as desired. BUT we'll
re-introduce the WARN_ON_ONCE() again since __set_cpus_allowed_ptr_locked()
validates against arch_task_cpu_possible_mask() with cpumask_subset() :-/
ie: cpuset_attach()::set_cpus_allowed_ptr() will fail with just a warning
printed once when attaching to 'mix'.

We can fix this by making __set_cpus_allowed_ptr_locked() perform cpumask_and()
like restrict_cpus_allowed_ptr() does.

Though this will not only truncate cpuset mask with the intersection of
arch_task_cpu_possible_mask(), but also truncate it for sched_setaffinity()
syscall. Something to keep in mind.

Finally there's the ugly problem that I'm not sure how to address of modifying
the cpuset.cpus of, for example 'mix' cpuset, such that we end up with 64bit
only cpus. If the 32bit task has already attached, then we'll end up with
a broken cpuset with a task attached having 'invalid' cpumask.

We can teach cpuset.c:update_cpumask()::validate_change() to do something about
it. By either walking the attached tasks and validating the new mask against
the arch one. Or by storing the arch mask in struct cpuset. Though for the
latter if we want to be truly generic, we need to ensure we handle the case of
2 tasks having different arch_task_cpu_possible_mask().

Long email. Hopefully it makes sense!

I can share my test script, binary/code and fixup patches if you like.

Thanks

--
Qais Yousef
Quentin Perret Nov. 30, 2020, 5:36 p.m. UTC | #3
On Monday 30 Nov 2020 at 17:05:31 (+0000), Qais Yousef wrote:
> I create 3 cpusets: 64bit, 32bit and mix. As the name indicates, 64bit contains
> all 64bit-only cpus, 32bit contains 32bit-capable ones and mix has a mixture of
> both.
> 
> If I try to move my test binary to 64bit cpuset, it moves there and I see the
> WARN_ON_ONCE() triggered. The task has attached to the new cpuset but
> set_allowed_cpus_ptr() has failed and we end up with whatever affinity we had
> previously. Breaking cpusets effectively.

Right, and so does exec'ing from a 64 bit task into 32 bit executable
from within a 64 bit-only cpuset :( . And there is nothing we can really
do about it, we cannot fail the exec because we need this to work for
existing apps, and there is no way the Android framework can know
upfront.

So the only thing we can do really is WARN() and proceed to ignore the
cpuset, which is what this series does :/. It's not exactly pretty but I
don't think we can do much better than that TBH, and it's the same thing
for the example you brought up. Failing cpuset_can_attach() will not
help, we can only WARN and proceed ...

Now, Android should be fine with that I think. We only need the kernel
to implement a safe fallback mechanism when userspace gives
contradictory commands, because we know there are edge cases userspace
_cannot_ deal with correctly, but this fallback doesn't need to be
highly optimized (at least for Android), but I'm happy to hear what
others think.

Thanks,
Quentin
Qais Yousef Dec. 1, 2020, 11:58 a.m. UTC | #4
On 11/30/20 17:36, Quentin Perret wrote:
> On Monday 30 Nov 2020 at 17:05:31 (+0000), Qais Yousef wrote:
> > I create 3 cpusets: 64bit, 32bit and mix. As the name indicates, 64bit contains
> > all 64bit-only cpus, 32bit contains 32bit-capable ones and mix has a mixture of
> > both.
> > 
> > If I try to move my test binary to 64bit cpuset, it moves there and I see the
> > WARN_ON_ONCE() triggered. The task has attached to the new cpuset but
> > set_allowed_cpus_ptr() has failed and we end up with whatever affinity we had
> > previously. Breaking cpusets effectively.
> 
> Right, and so does exec'ing from a 64 bit task into 32 bit executable
> from within a 64 bit-only cpuset :( . And there is nothing we can really

True. The kernel can decide to kill the task or force detach it then, no?
Sending SIGKILL makes more sense.

> do about it, we cannot fail the exec because we need this to work for
> existing apps, and there is no way the Android framework can know
> upfront.

It knows upfront it has enabled asym aarch32. So it needs to make sure not to
create 'invalid' cpusets?

> 
> So the only thing we can do really is WARN() and proceed to ignore the
> cpuset, which is what this series does :/. It's not exactly pretty but I
> don't think we can do much better than that TBH, and it's the same thing
> for the example you brought up. Failing cpuset_can_attach() will not
> help, we can only WARN and proceed ...

I think for cases where we can prevent userspace from doing something wrong, we
should. Like trying to attach to a cpuset that will result in an empty mask.
FWIW, it does something similar with deadline tasks. See task_can_attach().

Similarly for the case when userspace tries to modify the cpuset.cpus such that
a task will end up with empty cpumask. We now have the new case that some tasks
can only run on a subset of cpu_possible_mask. So the definition of empty
cpumask has gained an extra meaning.

> 
> Now, Android should be fine with that I think. We only need the kernel
> to implement a safe fallback mechanism when userspace gives
> contradictory commands, because we know there are edge cases userspace
> _cannot_ deal with correctly, but this fallback doesn't need to be
> highly optimized (at least for Android), but I'm happy to hear what
> others think.

Why not go with our original patch that fixes affinity then in the arch code if
the task wakes up on the wrong cpu? It is much simpler approach IMO to achieve
the same thing.

I was under the impression that if we go down teaching the scheduler about asym
ISA, then we have to deal with these edge cases. I don't think we're far away
from getting there.

Thanks

--
Qais Yousef
Quentin Perret Dec. 1, 2020, 12:37 p.m. UTC | #5
On Tuesday 01 Dec 2020 at 11:58:42 (+0000), Qais Yousef wrote:
> On 11/30/20 17:36, Quentin Perret wrote:
> > On Monday 30 Nov 2020 at 17:05:31 (+0000), Qais Yousef wrote:
> > > I create 3 cpusets: 64bit, 32bit and mix. As the name indicates, 64bit contains
> > > all 64bit-only cpus, 32bit contains 32bit-capable ones and mix has a mixture of
> > > both.
> > > 
> > > If I try to move my test binary to 64bit cpuset, it moves there and I see the
> > > WARN_ON_ONCE() triggered. The task has attached to the new cpuset but
> > > set_allowed_cpus_ptr() has failed and we end up with whatever affinity we had
> > > previously. Breaking cpusets effectively.
> > 
> > Right, and so does exec'ing from a 64 bit task into 32 bit executable
> > from within a 64 bit-only cpuset :( . And there is nothing we can really
> 
> True. The kernel can decide to kill the task or force detach it then, no?
> Sending SIGKILL makes more sense.

Yeah but again, we need this to work for existing apps. Just killing it
basically means we have no support, so that doesn't work for the use
case :/

> > do about it, we cannot fail the exec because we need this to work for
> > existing apps, and there is no way the Android framework can know
> > upfront.
> 
> It knows upfront it has enabled asym aarch32. So it needs to make sure not to
> create 'invalid' cpusets?

Problem is, we _really_ don't want to keep a big CPU in the background
cpuset just for that. And even if we did, we'd have to deal with hotplug.

> > 
> > So the only thing we can do really is WARN() and proceed to ignore the
> > cpuset, which is what this series does :/. It's not exactly pretty but I
> > don't think we can do much better than that TBH, and it's the same thing
> > for the example you brought up. Failing cpuset_can_attach() will not
> > help, we can only WARN and proceed ...
> 
> I think for cases where we can prevent userspace from doing something wrong, we
> should. Like trying to attach to a cpuset that will result in an empty mask.
> FWIW, it does something similar with deadline tasks. See task_can_attach().
> 
> Similarly for the case when userspace tries to modify the cpuset.cpus such that
> a task will end up with empty cpumask. We now have the new case that some tasks
> can only run on a subset of cpu_possible_mask. So the definition of empty
> cpumask has gained an extra meaning.

I see this differently, e.g. if you affine a task to a CPU and you
hotunplug it, then the kernel falls back to the remaining online CPUs
for that task. Not pretty, but it keeps things functional. I'm thinking
a similar kind of support would be good enough here.

But yes, this cpuset mess is the part of the series I hate most too.
It's just not clear we have better solutions :/

> > 
> > Now, Android should be fine with that I think. We only need the kernel
> > to implement a safe fallback mechanism when userspace gives
> > contradictory commands, because we know there are edge cases userspace
> > _cannot_ deal with correctly, but this fallback doesn't need to be
> > highly optimized (at least for Android), but I'm happy to hear what
> > others think.
> 
> Why not go with our original patch that fixes affinity then in the arch code if
> the task wakes up on the wrong cpu? It is much simpler approach IMO to achieve
> the same thing.

I personally had no issues with that patch, but as per Peter's original
reply, that's "not going to happen". Will's proposal seems to go one
step further and tries its best to honor the contract with userspace (by
keeping the subset of the affinity mask, ...) when that can be done, so
if that can be acceptable, then be it. But I'd still rather keep this
simple if at all possible. It's just my opinion though :)

Thanks,
Quentin
Qais Yousef Dec. 1, 2020, 2:11 p.m. UTC | #6
On 12/01/20 12:37, Quentin Perret wrote:
> On Tuesday 01 Dec 2020 at 11:58:42 (+0000), Qais Yousef wrote:
> > On 11/30/20 17:36, Quentin Perret wrote:
> > > On Monday 30 Nov 2020 at 17:05:31 (+0000), Qais Yousef wrote:
> > > > I create 3 cpusets: 64bit, 32bit and mix. As the name indicates, 64bit contains
> > > > all 64bit-only cpus, 32bit contains 32bit-capable ones and mix has a mixture of
> > > > both.
> > > > 
> > > > If I try to move my test binary to 64bit cpuset, it moves there and I see the
> > > > WARN_ON_ONCE() triggered. The task has attached to the new cpuset but
> > > > set_allowed_cpus_ptr() has failed and we end up with whatever affinity we had
> > > > previously. Breaking cpusets effectively.
> > > 
> > > Right, and so does exec'ing from a 64 bit task into 32 bit executable
> > > from within a 64 bit-only cpuset :( . And there is nothing we can really
> > 
> > True. The kernel can decide to kill the task or force detach it then, no?
> > Sending SIGKILL makes more sense.
> 
> Yeah but again, we need this to work for existing apps. Just killing it
> basically means we have no support, so that doesn't work for the use
> case :/

I don't get you. Unless what you're saying is you want to support this without
userspace having to do anything to opt-in, then this statement is hard to
digest for me. Existing apps will work. OEMs and vendors who want to enable
this feature must configure their userspace to deal with this correctly. This
includes passing the cmdline option, and parsing the cpumask in
/sys/device/system/cpu/aarch32_el0 to ensure their cpusets contains at least
one cpu from this list.

> 
> > > do about it, we cannot fail the exec because we need this to work for
> > > existing apps, and there is no way the Android framework can know
> > > upfront.
> > 
> > It knows upfront it has enabled asym aarch32. So it needs to make sure not to
> > create 'invalid' cpusets?
> 
> Problem is, we _really_ don't want to keep a big CPU in the background

I don't get you here too.

AFAIU, OEMs have to define their cpusets. So it makes sense to me for them to
define it correctly if they want to enable asym aarch32.

Systems that don't care about this feature shouldn't be affected. If they do,
then I'm missing something.

> cpuset just for that. And even if we did, we'd have to deal with hotplug.

We deal with hotplug by not allowing one of the aarch32 cpus from going
offline.

> 
> > > 
> > > So the only thing we can do really is WARN() and proceed to ignore the
> > > cpuset, which is what this series does :/. It's not exactly pretty but I
> > > don't think we can do much better than that TBH, and it's the same thing
> > > for the example you brought up. Failing cpuset_can_attach() will not
> > > help, we can only WARN and proceed ...
> > 
> > I think for cases where we can prevent userspace from doing something wrong, we
> > should. Like trying to attach to a cpuset that will result in an empty mask.
> > FWIW, it does something similar with deadline tasks. See task_can_attach().
> > 
> > Similarly for the case when userspace tries to modify the cpuset.cpus such that
> > a task will end up with empty cpumask. We now have the new case that some tasks
> > can only run on a subset of cpu_possible_mask. So the definition of empty
> > cpumask has gained an extra meaning.
> 
> I see this differently, e.g. if you affine a task to a CPU and you
> hotunplug it, then the kernel falls back to the remaining online CPUs
> for that task. Not pretty, but it keeps things functional. I'm thinking
> a similar kind of support would be good enough here.

Sorry I can't see it this way :(

The comparison is fundamentally different. Being able to attach to a cpuset, or
modify its cpus is different than short circuiting a cpu out of the system.

For hotplug we have to make sure a single cpu stays alive. The fallback you're
talking about should still work the same if the task is not attached to
a cpuset. Just it has to take the intersection with the
arch_task_cpu_possible_cpu() into account.

For cpusets, if hotunplug results in an empty cpuset, then all tasks are moved
to the nearest ancestor if I read the code correctly. In our case, only 32bit
tasks have to move out to retain this behavior. Since now for the first time we
have tasks that can't run on all cpus.

Which by the way might be the right behavior for 64bit tasks execing 32bit
binary in a 64bit only cpuset. I suggested SIGKILL'ing them but maybe moving
them to the nearest ancestor too is more aligned with the behavior above.

> 
> But yes, this cpuset mess is the part of the series I hate most too.
> It's just not clear we have better solutions :/
> 
> > > 
> > > Now, Android should be fine with that I think. We only need the kernel
> > > to implement a safe fallback mechanism when userspace gives
> > > contradictory commands, because we know there are edge cases userspace
> > > _cannot_ deal with correctly, but this fallback doesn't need to be
> > > highly optimized (at least for Android), but I'm happy to hear what
> > > others think.
> > 
> > Why not go with our original patch that fixes affinity then in the arch code if
> > the task wakes up on the wrong cpu? It is much simpler approach IMO to achieve
> > the same thing.
> 
> I personally had no issues with that patch, but as per Peter's original
> reply, that's "not going to happen". Will's proposal seems to go one
> step further and tries its best to honor the contract with userspace (by

The only difference I see is that this series blocks sched_setaffinity(). Other
than that we fix up the task affinity in a similar way, just in different
places.

If the intersection was empty, we sent a SIGKILL; making it fallback to
something sensible is dead easy. I opted for SIGKILL because I didn't expect
Peter to agree with the fallback. But if we're arguing it's okay now, then it
should be okay in that patch too then.

> keeping the subset of the affinity mask, ...) when that can be done, so
> if that can be acceptable, then be it. But I'd still rather keep this
> simple if at all possible. It's just my opinion though :)

The way I see it, this series adds support to handle generic asym ISA. Asym
AArch32 will be the first and only user for now, but it's paving the way for
others.

It's up to the maintainers to decide, but IMHO if we'll do this via enabling
scheduler to deal with generic asym ISA support, we should deal with these
corner cases. Otherwise a specific solution for the current problem at hand in
the arch code makes more sense to me.

It all depends how much the community wants/is willing to cater for these
systems in the future too.

Thanks

--
Qais Yousef
Quentin Perret Dec. 1, 2020, 3:56 p.m. UTC | #7
On Tuesday 01 Dec 2020 at 14:11:21 (+0000), Qais Yousef wrote:
> AFAIU, OEMs have to define their cpusets. So it makes sense to me for them to
> define it correctly if they want to enable asym aarch32.
> 
> Systems that don't care about this feature shouldn't be affected. If they do,
> then I'm missing something.

Right, but there are 2 cases for 32 bit tasks in Android:

  1. 32 bit apps; these are not an issue, the Android framework knows
     about them and it's fine to expect it to setup cpusets accordingly
     IMO.

  2. 64 bit apps that also happen to have a 32 bit binary payload, and
     exec into it. The Android framework has no visibility over that,
     all it sees is a 64 bit app. Sadly we can't detect this stupid
     pattern, but we need these to remain somewhat functional.

I was only talking about 2. the whole time, sorry if that wasn't clear.
With that said, see below for the discussion about cpuset/hotplug.

> We deal with hotplug by not allowing one of the aarch32 cpus from going
> offline.

Sure, but that would only work if we have that 32 bit CPU present in
_all_ cpusets, no? What I'd like to avoid is to keep a (big) 32
bit CPU in the background cpuset of 64 bit tasks. That would make that
big CPU available to _all_ 64 bit apps in the background, whether they
need 32 bit support or not, because again we cannot distinguish them.
And yeah, I expect this to be not go down well in practice.


So, if we're going to support this, a requirement for Android is that
some cpusets will be 64 bit only, and it's possible that we'll exec into
32 bit from within these cpusets. It's an edge case, we don't really
want to optimize for it, but it needs to not fall apart completely.
I'm not fundamentally against doing smarter things at all, I'm saying we
(Android) just don't _need_ smarter things ATM, so we may want to keep
it simple.

My point in the previous message is, if we're accepting this for exec,
a logical next step could be to accept it for cpuset migrations too.
Failing the cgroup migration is hard since: there is no guarantee the
source cpuset has 32 bit CPUs anyway (assuming the exec'd task is kept
in the same cpuset), so why bother; userspace just doesn't know there
are 32 bit tasks in an app and would keep trying to migrate it to 64 bit
cpuset over and over again; you could end up with apps being stuck
halfway through a top-app->background transition where some tasks have
migrated but not others, ...

It's a bit of a mess :/


<snip>
> For hotplug we have to make sure a single cpu stays alive. The fallback you're
> talking about should still work the same if the task is not attached to
> a cpuset. Just it has to take the intersection with the
> arch_task_cpu_possible_cpu() into account.

Yep, agreed, there's probably room for improvement there.

> For cpusets, if hotunplug results in an empty cpuset, then all tasks are moved
> to the nearest ancestor if I read the code correctly. In our case, only 32bit
> tasks have to move out to retain this behavior. Since now for the first time we
> have tasks that can't run on all cpus.
> 
> Which by the way might be the right behavior for 64bit tasks execing 32bit
> binary in a 64bit only cpuset. I suggested SIGKILL'ing them but maybe moving
> them to the nearest ancestor too is more aligned with the behavior above.

Hmm, I guess that means putting all 32-bit-execd-from-64-bit tasks in
the root group in Android. I'll try and check the implications, but that
might be just fine... Sounds like a sensible behaviour to me anyways.

Thanks,
Quentin
Will Deacon Dec. 1, 2020, 10:30 p.m. UTC | #8
On Tue, Dec 01, 2020 at 03:56:49PM +0000, Quentin Perret wrote:
> On Tuesday 01 Dec 2020 at 14:11:21 (+0000), Qais Yousef wrote:
> > For cpusets, if hotunplug results in an empty cpuset, then all tasks are moved
> > to the nearest ancestor if I read the code correctly. In our case, only 32bit
> > tasks have to move out to retain this behavior. Since now for the first time we
> > have tasks that can't run on all cpus.
> > 
> > Which by the way might be the right behavior for 64bit tasks execing 32bit
> > binary in a 64bit only cpuset. I suggested SIGKILL'ing them but maybe moving
> > them to the nearest ancestor too is more aligned with the behavior above.
> 
> Hmm, I guess that means putting all 32-bit-execd-from-64-bit tasks in
> the root group in Android. I'll try and check the implications, but that
> might be just fine... Sounds like a sensible behaviour to me anyways.

I'll look into this -- anything we can do to avoid forcefully resetting the
affinity mask to the arch_task_cpu_possible_mask() is worth considering.

Will
Qais Yousef Dec. 2, 2020, 11:33 a.m. UTC | #9
On 12/01/20 15:56, Quentin Perret wrote:
> On Tuesday 01 Dec 2020 at 14:11:21 (+0000), Qais Yousef wrote:
> > AFAIU, OEMs have to define their cpusets. So it makes sense to me for them to
> > define it correctly if they want to enable asym aarch32.
> > 
> > Systems that don't care about this feature shouldn't be affected. If they do,
> > then I'm missing something.
> 
> Right, but there are 2 cases for 32 bit tasks in Android:
> 
>   1. 32 bit apps; these are not an issue, the Android framework knows
>      about them and it's fine to expect it to setup cpusets accordingly
>      IMO.
> 
>   2. 64 bit apps that also happen to have a 32 bit binary payload, and
>      exec into it. The Android framework has no visibility over that,
>      all it sees is a 64 bit app. Sadly we can't detect this stupid
>      pattern, but we need these to remain somewhat functional.
> 
> I was only talking about 2. the whole time, sorry if that wasn't clear.
> With that said, see below for the discussion about cpuset/hotplug.

Yep, I was referring to 2 too. I found out about the app that embeds the 32 bit
binary, it was our major concern if we go with user space managing affinities.

> 
> > We deal with hotplug by not allowing one of the aarch32 cpus from going
> > offline.
> 
> Sure, but that would only work if we have that 32 bit CPU present in
> _all_ cpusets, no? What I'd like to avoid is to keep a (big) 32
> bit CPU in the background cpuset of 64 bit tasks. That would make that
> big CPU available to _all_ 64 bit apps in the background, whether they
> need 32 bit support or not, because again we cannot distinguish them.
> And yeah, I expect this to be not go down well in practice.
> 
> 
> So, if we're going to support this, a requirement for Android is that
> some cpusets will be 64 bit only, and it's possible that we'll exec into
> 32 bit from within these cpusets. It's an edge case, we don't really
> want to optimize for it, but it needs to not fall apart completely.
> I'm not fundamentally against doing smarter things at all, I'm saying we
> (Android) just don't _need_ smarter things ATM, so we may want to keep
> it simple.

Fair enough. But in that case I find it neater to fix the affinities up in the
arch code as a specific solution. I'm not seeing there's a difference in the
end results between the two implementations if we don't address these issues
:(

> 
> My point in the previous message is, if we're accepting this for exec,
> a logical next step could be to accept it for cpuset migrations too.
> Failing the cgroup migration is hard since: there is no guarantee the
> source cpuset has 32 bit CPUs anyway (assuming the exec'd task is kept
> in the same cpuset), so why bother; userspace just doesn't know there
> are 32 bit tasks in an app and would keep trying to migrate it to 64 bit
> cpuset over and over again; you could end up with apps being stuck
> halfway through a top-app->background transition where some tasks have
> migrated but not others, ...
> 
> It's a bit of a mess :/

It is. I think I addressed these concerns in other parts from my previous
email. Judging by your reply below I think you see what I was talking about and
we're more on the same page now :-)

> 
> 
> <snip>
> > For hotplug we have to make sure a single cpu stays alive. The fallback you're
> > talking about should still work the same if the task is not attached to
> > a cpuset. Just it has to take the intersection with the
> > arch_task_cpu_possible_cpu() into account.
> 
> Yep, agreed, there's probably room for improvement there.
> 
> > For cpusets, if hotunplug results in an empty cpuset, then all tasks are moved
> > to the nearest ancestor if I read the code correctly. In our case, only 32bit
> > tasks have to move out to retain this behavior. Since now for the first time we
> > have tasks that can't run on all cpus.
> > 
> > Which by the way might be the right behavior for 64bit tasks execing 32bit
> > binary in a 64bit only cpuset. I suggested SIGKILL'ing them but maybe moving
> > them to the nearest ancestor too is more aligned with the behavior above.
> 
> Hmm, I guess that means putting all 32-bit-execd-from-64-bit tasks in
> the root group in Android. I'll try and check the implications, but that
> might be just fine... Sounds like a sensible behaviour to me anyways.

It'd be only the compat tasks that will have to move to root group. And only
for those minority of apps that embed a 32bit binary. I think the impact is
minimum. And I think the behavior makes sense generically.

Thanks!

--
Qais Yousef
Qais Yousef Dec. 2, 2020, 11:34 a.m. UTC | #10
On 12/01/20 22:30, Will Deacon wrote:
> On Tue, Dec 01, 2020 at 03:56:49PM +0000, Quentin Perret wrote:
> > On Tuesday 01 Dec 2020 at 14:11:21 (+0000), Qais Yousef wrote:
> > > For cpusets, if hotunplug results in an empty cpuset, then all tasks are moved
> > > to the nearest ancestor if I read the code correctly. In our case, only 32bit
> > > tasks have to move out to retain this behavior. Since now for the first time we
> > > have tasks that can't run on all cpus.
> > > 
> > > Which by the way might be the right behavior for 64bit tasks execing 32bit
> > > binary in a 64bit only cpuset. I suggested SIGKILL'ing them but maybe moving
> > > them to the nearest ancestor too is more aligned with the behavior above.
> > 
> > Hmm, I guess that means putting all 32-bit-execd-from-64-bit tasks in
> > the root group in Android. I'll try and check the implications, but that
> > might be just fine... Sounds like a sensible behaviour to me anyways.
> 
> I'll look into this -- anything we can do to avoid forcefully resetting the
> affinity mask to the arch_task_cpu_possible_mask() is worth considering.

Happy to lend a hand, just let me know.

Thanks

--
Qais Yousef
diff mbox series

Patch

diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
index 57b5b5d0a5fd..e970737c3ed2 100644
--- a/kernel/cgroup/cpuset.c
+++ b/kernel/cgroup/cpuset.c
@@ -3299,9 +3299,11 @@  void cpuset_cpus_allowed(struct task_struct *tsk, struct cpumask *pmask)
 
 void cpuset_cpus_allowed_fallback(struct task_struct *tsk)
 {
+	if (!is_in_v2_mode())
+		return; /* select_fallback_rq will try harder */
+
 	rcu_read_lock();
-	do_set_cpus_allowed(tsk, is_in_v2_mode() ?
-		task_cs(tsk)->cpus_allowed : cpu_possible_mask);
+	do_set_cpus_allowed(tsk, task_cs(tsk)->cpus_allowed);
 	rcu_read_unlock();
 
 	/*