[v5,00/15] An alternative series for asymmetric AArch32 systems

Message ID	20201208132835.6151-1-will@kernel.org (mailing list archive)
Headers	show Return-Path: <SRS0=ZKTK=FM=lists.infradead.org=linux-arm-kernel-bounces+linux-arm-kernel=archiver.kernel.org@kernel.org> DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org D961923A79 From: Will Deacon <will@kernel.org> To: linux-arm-kernel@lists.infradead.org Subject: [PATCH v5 00/15] An alternative series for asymmetric AArch32 systems Date: Tue, 8 Dec 2020 13:28:20 +0000 Message-Id: <20201208132835.6151-1-will@kernel.org> MIME-Version: 1.0 Precedence: list Cc: linux-arch@vger.kernel.org, Marc Zyngier <maz@kernel.org>, kernel-team@android.com, Vincent Guittot <vincent.guittot@linaro.org>, Juri Lelli <juri.lelli@redhat.com>, Quentin Perret <qperret@google.com>, Peter Zijlstra <peterz@infradead.org>, Catalin Marinas <catalin.marinas@arm.com>, Johannes Weiner <hannes@cmpxchg.org>, linux-kernel@vger.kernel.org, Qais Yousef <qais.yousef@arm.com>, Suren Baghdasaryan <surenb@google.com>, Ingo Molnar <mingo@redhat.com>, Li Zefan <lizefan@huawei.com>, Greg Kroah-Hartman <gregkh@linuxfoundation.org>, Tejun Heo <tj@kernel.org>, Will Deacon <will@kernel.org>, Morten Rasmussen <morten.rasmussen@arm.com> Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit Sender: "linux-arm-kernel" <linux-arm-kernel-bounces@lists.infradead.org> Errors-To: linux-arm-kernel-bounces+linux-arm-kernel=archiver.kernel.org@lists.infradead.org
Series	An alternative series for asymmetric AArch32 systems \| expand [v5,00/15] An alternative series for asymmetric AArch32 systems [v5,01/15] arm64: cpuinfo: Split AArch32 registers out into a separate struct [v5,02/15] arm64: Allow mismatched 32-bit EL0 support [v5,03/15] KVM: arm64: Kill 32-bit vCPUs on systems with mismatched EL0 support [v5,04/15] arm64: Kill 32-bit applications scheduled on 64-bit-only CPUs [v5,05/15] arm64: Advertise CPUs capable of running 32-bit applications in sysfs [v5,06/15] sched: Introduce task_cpu_possible_mask() to limit fallback rq selection [v5,07/15] cpuset: Don't use the cpu_possible_mask as a last resort for cgroup v1 [v5,08/15] cpuset: Honour task_cpu_possible_mask() in guarantee_online_cpus() [v5,09/15] sched: Reject CPU affinity changes based on task_cpu_possible_mask() [v5,10/15] sched: Introduce force_compatible_cpus_allowed_ptr() to limit CPU affinity [v5,11/15] arm64: Implement task_cpu_possible_mask() [v5,12/15] arm64: exec: Adjust affinity for compat tasks with mismatched 32-bit EL0 [v5,13/15] arm64: Prevent offlining first CPU with 32-bit EL0 on mismatched system [v5,14/15] arm64: Hook up cmdline parameter to allow mismatched 32-bit EL0 [v5,15/15] arm64: Remove logic to kill 32-bit tasks on 64-bit-only cores

Will Deacon Dec. 8, 2020, 1:28 p.m. UTC

Hi all,

Christmas has come early: it's time for version five of these patches
which have previously appeared here:

  v1: https://lore.kernel.org/r/20201027215118.27003-1-will@kernel.org
  v2: https://lore.kernel.org/r/20201109213023.15092-1-will@kernel.org
  v3: https://lore.kernel.org/r/20201113093720.21106-1-will@kernel.org
  v4: https://lore.kernel.org/r/20201124155039.13804-1-will@kernel.org

and which started life as a reimplementation of some patches from Qais:

  https://lore.kernel.org/r/20201021104611.2744565-1-qais.yousef@arm.com

There's also now a nice writeup on LWN:

  https://lwn.net/Articles/838339/

and rumours of a feature film are doing the rounds.

[subscriber-only, but if you're reading this then you should really
 subscribe.]

The aim of this series is to allow 32-bit ARM applications to run on
arm64 SoCs where not all of the CPUs support the 32-bit instruction set.
Unfortunately, such SoCs are real and will continue to be productised
over the next few years at least. I can assure you that I'm not just
doing this for fun.

Changes in v5 include:

  * Teach cpuset_cpus_allowed() about task_cpu_possible_mask() so that
    we can avoid returning incompatible CPUs for a given task. This
    means that sched_setaffinity() can be used with larger masks (like
    the online mask) from userspace and also allows us to take into
    account the cpuset hierarchy when forcefully overriding the affinity
    for a task on execve().

  * Honour task_cpu_possible_mask() when attaching a task to a cpuset,
    so that the resulting affinity mask does not contain any incompatible
    CPUs (since it would be rejected by set_cpus_allowed_ptr() otherwise).

  * Moved overriding of the affinity mask into the scheduler core rather
    than munge affinity masks directly in the architecture backend.

  * Extended comments and documentation.

  * Some renaming and cosmetic changes.

I'm pretty happy with this now, although it still needs review and will
require rebasing to play nicely with the SCA changes in -next.

Cheers,

Will

Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Marc Zyngier <maz@kernel.org>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Morten Rasmussen <morten.rasmussen@arm.com>
Cc: Qais Yousef <qais.yousef@arm.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Quentin Perret <qperret@google.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Li Zefan <lizefan@huawei.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: kernel-team@android.com

--->8


Will Deacon (15):
  arm64: cpuinfo: Split AArch32 registers out into a separate struct
  arm64: Allow mismatched 32-bit EL0 support
  KVM: arm64: Kill 32-bit vCPUs on systems with mismatched EL0 support
  arm64: Kill 32-bit applications scheduled on 64-bit-only CPUs
  arm64: Advertise CPUs capable of running 32-bit applications in sysfs
  sched: Introduce task_cpu_possible_mask() to limit fallback rq
    selection
  cpuset: Don't use the cpu_possible_mask as a last resort for cgroup v1
  cpuset: Honour task_cpu_possible_mask() in guarantee_online_cpus()
  sched: Reject CPU affinity changes based on task_cpu_possible_mask()
  sched: Introduce force_compatible_cpus_allowed_ptr() to limit CPU
    affinity
  arm64: Implement task_cpu_possible_mask()
  arm64: exec: Adjust affinity for compat tasks with mismatched 32-bit
    EL0
  arm64: Prevent offlining first CPU with 32-bit EL0 on mismatched
    system
  arm64: Hook up cmdline parameter to allow mismatched 32-bit EL0
  arm64: Remove logic to kill 32-bit tasks on 64-bit-only cores

 .../ABI/testing/sysfs-devices-system-cpu      |   9 +
 .../admin-guide/kernel-parameters.txt         |   8 +
 arch/arm64/include/asm/cpu.h                  |  44 ++--
 arch/arm64/include/asm/cpucaps.h              |   3 +-
 arch/arm64/include/asm/cpufeature.h           |   8 +-
 arch/arm64/include/asm/mmu_context.h          |  13 ++
 arch/arm64/kernel/cpufeature.c                | 219 ++++++++++++++----
 arch/arm64/kernel/cpuinfo.c                   |  53 +++--
 arch/arm64/kernel/process.c                   |  19 +-
 arch/arm64/kvm/arm.c                          |  11 +-
 include/linux/cpuset.h                        |   3 +-
 include/linux/mmu_context.h                   |   8 +
 include/linux/sched.h                         |   1 +
 kernel/cgroup/cpuset.c                        |  39 ++--
 kernel/sched/core.c                           | 112 +++++++--
 15 files changed, 426 insertions(+), 124 deletions(-)

Peter Zijlstra Dec. 15, 2020, 5:36 p.m. UTC | #1

On Tue, Dec 08, 2020 at 01:28:20PM +0000, Will Deacon wrote:
> The aim of this series is to allow 32-bit ARM applications to run on
> arm64 SoCs where not all of the CPUs support the 32-bit instruction set.
> Unfortunately, such SoCs are real and will continue to be productised
> over the next few years at least. I can assure you that I'm not just
> doing this for fun.
> 
> Changes in v5 include:
> 
>   * Teach cpuset_cpus_allowed() about task_cpu_possible_mask() so that
>     we can avoid returning incompatible CPUs for a given task. This
>     means that sched_setaffinity() can be used with larger masks (like
>     the online mask) from userspace and also allows us to take into
>     account the cpuset hierarchy when forcefully overriding the affinity
>     for a task on execve().
> 
>   * Honour task_cpu_possible_mask() when attaching a task to a cpuset,
>     so that the resulting affinity mask does not contain any incompatible
>     CPUs (since it would be rejected by set_cpus_allowed_ptr() otherwise).
> 
>   * Moved overriding of the affinity mask into the scheduler core rather
>     than munge affinity masks directly in the architecture backend.

Hurmph... so if I can still read, this thing will auto truncate the
affinity mask to something that only contains compatible CPUs, right?

Assuming our system has 8 CPUs (0xFF), half of which are 32bit capable
(0x0F), then, when our native task (with affinity 0x3c) does a
fork()+execve() of a 32bit thingy the resulting task has 0x0c.

If that in turn does fork()+execve() of a native task, it will retain
the trucated affinity mask (0x0c), instead of returning to the wider
mask (0x3c).

IOW, any (accidental or otherwise) trip through a 32bit helper, will
destroy user state (the affinity mask: 0x3c).

Should we perhaps split task_struct::cpus_mask, one to keep an original
copy of the user state, and one to be an effective cpumask for the task?
That way, the moment a task constricts or widens it's
task_cpu_possible_mask() we can re-compute the effective mask without
loss of information.

Will Deacon Dec. 15, 2020, 6:50 p.m. UTC | #2

Hi Peter,

Cheers for taking a look.

On Tue, Dec 15, 2020 at 06:36:45PM +0100, Peter Zijlstra wrote:
> On Tue, Dec 08, 2020 at 01:28:20PM +0000, Will Deacon wrote:
> > The aim of this series is to allow 32-bit ARM applications to run on
> > arm64 SoCs where not all of the CPUs support the 32-bit instruction set.
> > Unfortunately, such SoCs are real and will continue to be productised
> > over the next few years at least. I can assure you that I'm not just
> > doing this for fun.
> > 
> > Changes in v5 include:
> > 
> >   * Teach cpuset_cpus_allowed() about task_cpu_possible_mask() so that
> >     we can avoid returning incompatible CPUs for a given task. This
> >     means that sched_setaffinity() can be used with larger masks (like
> >     the online mask) from userspace and also allows us to take into
> >     account the cpuset hierarchy when forcefully overriding the affinity
> >     for a task on execve().
> > 
> >   * Honour task_cpu_possible_mask() when attaching a task to a cpuset,
> >     so that the resulting affinity mask does not contain any incompatible
> >     CPUs (since it would be rejected by set_cpus_allowed_ptr() otherwise).
> > 
> >   * Moved overriding of the affinity mask into the scheduler core rather
> >     than munge affinity masks directly in the architecture backend.
> 
> Hurmph... so if I can still read, this thing will auto truncate the
> affinity mask to something that only contains compatible CPUs, right?
> 
> Assuming our system has 8 CPUs (0xFF), half of which are 32bit capable
> (0x0F), then, when our native task (with affinity 0x3c) does a
> fork()+execve() of a 32bit thingy the resulting task has 0x0c.
> 
> If that in turn does fork()+execve() of a native task, it will retain
> the trucated affinity mask (0x0c), instead of returning to the wider
> mask (0x3c).
> 
> IOW, any (accidental or otherwise) trip through a 32bit helper, will
> destroy user state (the affinity mask: 0x3c).

Yes, that's correct, and I agree that it's a rough edge. If you're happy
with the idea of adding an extra mask to make this work, then I can start
hacking that up (although I doubt I'll get something out before the new
year at this point).

> Should we perhaps split task_struct::cpus_mask, one to keep an original
> copy of the user state, and one to be an effective cpumask for the task?
> That way, the moment a task constricts or widens it's
> task_cpu_possible_mask() we can re-compute the effective mask without
> loss of information.

Hmm, we might already have most of the pieces in place for this (modulo
the extra field), since cpuset_cpus_allowed() provides the limiting mask
now so this might be relatively straightforward.

Famous last words...

Will

Qais Yousef Dec. 16, 2020, 11:16 a.m. UTC | #3

Hi Will

On 12/08/20 13:28, Will Deacon wrote:
> Hi all,
> 
> Christmas has come early: it's time for version five of these patches
> which have previously appeared here:
> 
>   v1: https://lore.kernel.org/r/20201027215118.27003-1-will@kernel.org
>   v2: https://lore.kernel.org/r/20201109213023.15092-1-will@kernel.org
>   v3: https://lore.kernel.org/r/20201113093720.21106-1-will@kernel.org
>   v4: https://lore.kernel.org/r/20201124155039.13804-1-will@kernel.org
> 
> and which started life as a reimplementation of some patches from Qais:
> 
>   https://lore.kernel.org/r/20201021104611.2744565-1-qais.yousef@arm.com
> 
> There's also now a nice writeup on LWN:
> 
>   https://lwn.net/Articles/838339/
> 
> and rumours of a feature film are doing the rounds.
> 
> [subscriber-only, but if you're reading this then you should really
>  subscribe.]
> 
> The aim of this series is to allow 32-bit ARM applications to run on
> arm64 SoCs where not all of the CPUs support the 32-bit instruction set.
> Unfortunately, such SoCs are real and will continue to be productised
> over the next few years at least. I can assure you that I'm not just
> doing this for fun.
> 
> Changes in v5 include:
> 
>   * Teach cpuset_cpus_allowed() about task_cpu_possible_mask() so that
>     we can avoid returning incompatible CPUs for a given task. This
>     means that sched_setaffinity() can be used with larger masks (like
>     the online mask) from userspace and also allows us to take into
>     account the cpuset hierarchy when forcefully overriding the affinity
>     for a task on execve().
> 
>   * Honour task_cpu_possible_mask() when attaching a task to a cpuset,
>     so that the resulting affinity mask does not contain any incompatible
>     CPUs (since it would be rejected by set_cpus_allowed_ptr() otherwise).
> 
>   * Moved overriding of the affinity mask into the scheduler core rather
>     than munge affinity masks directly in the architecture backend.
> 
>   * Extended comments and documentation.
> 
>   * Some renaming and cosmetic changes.
> 
> I'm pretty happy with this now, although it still needs review and will
> require rebasing to play nicely with the SCA changes in -next.

I still have concerns about the cpuset v1 handling. Specifically:

	1. Attaching a 32bit task to 64bit only cpuset is allowed.

	   I think the right behavior here is to prevent that as the
	   intersection will appear as offline cpus for the 32bit tasks. So it
	   shouldn't be allowed to move there.

	2. Modifying cpuset.cpus could result with empty set for 32bit tasks.

	   It is a variation of the above, it's just the cpuset transforms into
	   64bit only after we attach.

	   I think the right behavior here is to move the 32bit tasks to the
	   nearest ancestor like we do when all cpuset.cpus are hotplugged out.

	   We could too return an error if the new set will result an empty set
	   for the 32bit tasks. In a similar manner to how it fails if you
	   write a cpu that is offline.

	3. If a 64bit task belongs to 64bit-only-cpuset execs a 32bit binary,
	   the 32 tasks will inherit the cgroup setting.

	   Like above, we should move this to the nearest ancestor.

I was worried if in a hierarchy the parent cpuset.cpus is modified such that
the childs no longer have a valid cpu for 32bit tasks. But I checked for v1 and
this isn't a problem. You'll get an error if you try to change it in a way that
ends up with an empty cpuset.

I played with v2, and the model allows tasks to remain attached even if cpus
are hotplugged, or cpusets.cpus is modified in such a way we end up with an
empty cpuset. So I think breaking the affinity of the cpuset for v2 is okay.

To simplify the problem for v1, we could say that asym ISA tasks can only live
in the root cpuset for v1. This will simplify the solution too since we will
only need to ensure that these tasks are moved to the root group on exec and
block any future move to anything else. Of course this dictates that such
systems must use cpuset v2 if they care. Not a terrible restriction IMO.

I hacked a patch to fix the exec scenario and it was easy to do. I just need to
block clone3 (cgroup_post_fork()) and task_can_attach() from allowing these
tasks from moving anywhere else.

Thanks

--
Qais Yousef

Will Deacon Dec. 16, 2020, 2:14 p.m. UTC | #4

Hi Qais,

On Wed, Dec 16, 2020 at 11:16:46AM +0000, Qais Yousef wrote:
> On 12/08/20 13:28, Will Deacon wrote:
> > Changes in v5 include:
> > 
> >   * Teach cpuset_cpus_allowed() about task_cpu_possible_mask() so that
> >     we can avoid returning incompatible CPUs for a given task. This
> >     means that sched_setaffinity() can be used with larger masks (like
> >     the online mask) from userspace and also allows us to take into
> >     account the cpuset hierarchy when forcefully overriding the affinity
> >     for a task on execve().
> > 
> >   * Honour task_cpu_possible_mask() when attaching a task to a cpuset,
> >     so that the resulting affinity mask does not contain any incompatible
> >     CPUs (since it would be rejected by set_cpus_allowed_ptr() otherwise).
> > 
> >   * Moved overriding of the affinity mask into the scheduler core rather
> >     than munge affinity masks directly in the architecture backend.
> > 
> >   * Extended comments and documentation.
> > 
> >   * Some renaming and cosmetic changes.
> > 
> > I'm pretty happy with this now, although it still needs review and will
> > require rebasing to play nicely with the SCA changes in -next.
> 
> I still have concerns about the cpuset v1 handling. Specifically:
> 
> 	1. Attaching a 32bit task to 64bit only cpuset is allowed.
> 
> 	   I think the right behavior here is to prevent that as the
> 	   intersection will appear as offline cpus for the 32bit tasks. So it
> 	   shouldn't be allowed to move there.

Suren or Quantin can correct me if I'm wrong I'm here, but I think Android
relies on this working so it's not an option for us to prevent the attach.
I also don't think it really achieves much, since as you point out, the same
problem exists in other cases such as execve() of a 32-bit binary, or
hotplugging off all 32-bit CPUs within a mixed cpuset. Allowing the attach
and immediately reparenting would probably be better, but see below.

> 	2. Modifying cpuset.cpus could result with empty set for 32bit tasks.
> 
> 	   It is a variation of the above, it's just the cpuset transforms into
> 	   64bit only after we attach.
> 
> 	   I think the right behavior here is to move the 32bit tasks to the
> 	   nearest ancestor like we do when all cpuset.cpus are hotplugged out.
> 
> 	   We could too return an error if the new set will result an empty set
> 	   for the 32bit tasks. In a similar manner to how it fails if you
> 	   write a cpu that is offline.
> 
> 	3. If a 64bit task belongs to 64bit-only-cpuset execs a 32bit binary,
> 	   the 32 tasks will inherit the cgroup setting.
> 
> 	   Like above, we should move this to the nearest ancestor.

I considered this when I was writing the patches, but the reality is that
by allowing 32-bit tasks to attach to a 64-bit only cpuset (which is required
by Android), we have no choice but to expose a new ABI to userspace. This is
all gated behind a command-line option, so I think that's fine, but then why
not just have the same behaviour as cgroup v2? I don't see the point in
creating two new ABIs (for cgroup v1 and v2 respectively) if we don't need
to. If it was _identical_ to the hotplug case, then we would surely just
follow the existing behaviour, but it's really quite different in this
situation because the cpuset is not empty.

One thing we should definitely do though is add this to the documentation
for the command-line option.

> To simplify the problem for v1, we could say that asym ISA tasks can only live
> in the root cpuset for v1. This will simplify the solution too since we will
> only need to ensure that these tasks are moved to the root group on exec and
> block any future move to anything else. Of course this dictates that such
> systems must use cpuset v2 if they care. Not a terrible restriction IMO.

Sadly, I think Android is still on cgroup v1 for cpuset, but Suren will know
better the status of cgroup v2 for cpusets. If it's just around the corner,
then maybe we could simplify things here. Suren?

Will

Qais Yousef Dec. 16, 2020, 4:48 p.m. UTC | #5

On 12/16/20 14:14, Will Deacon wrote:
> Hi Qais,
> 
> On Wed, Dec 16, 2020 at 11:16:46AM +0000, Qais Yousef wrote:
> > On 12/08/20 13:28, Will Deacon wrote:
> > > Changes in v5 include:
> > > 
> > >   * Teach cpuset_cpus_allowed() about task_cpu_possible_mask() so that
> > >     we can avoid returning incompatible CPUs for a given task. This
> > >     means that sched_setaffinity() can be used with larger masks (like
> > >     the online mask) from userspace and also allows us to take into
> > >     account the cpuset hierarchy when forcefully overriding the affinity
> > >     for a task on execve().
> > > 
> > >   * Honour task_cpu_possible_mask() when attaching a task to a cpuset,
> > >     so that the resulting affinity mask does not contain any incompatible
> > >     CPUs (since it would be rejected by set_cpus_allowed_ptr() otherwise).
> > > 
> > >   * Moved overriding of the affinity mask into the scheduler core rather
> > >     than munge affinity masks directly in the architecture backend.
> > > 
> > >   * Extended comments and documentation.
> > > 
> > >   * Some renaming and cosmetic changes.
> > > 
> > > I'm pretty happy with this now, although it still needs review and will
> > > require rebasing to play nicely with the SCA changes in -next.
> > 
> > I still have concerns about the cpuset v1 handling. Specifically:
> > 
> > 	1. Attaching a 32bit task to 64bit only cpuset is allowed.
> > 
> > 	   I think the right behavior here is to prevent that as the
> > 	   intersection will appear as offline cpus for the 32bit tasks. So it
> > 	   shouldn't be allowed to move there.
> 
> Suren or Quantin can correct me if I'm wrong I'm here, but I think Android
> relies on this working so it's not an option for us to prevent the attach.

I don't think so. It's just a matter who handles the error. ie: kernel fix it
up silently and effectively make the cpuset a NOP since we don't respect the
affinity of the cpuset, or user space pick the next best thing. Since this
could return an error anyway, likely user space already handles this.

> I also don't think it really achieves much, since as you point out, the same
> problem exists in other cases such as execve() of a 32-bit binary, or
> hotplugging off all 32-bit CPUs within a mixed cpuset. Allowing the attach
> and immediately reparenting would probably be better, but see below.

I am just wary that we're introducing a generic asymmetric ISA support, so my
concerns have been related to making sure the behavior is sane generally. When
this gets merged, I can bet more 'fun' hardware will appear all over the place.
We're opening the flood gates I'm afraid :p

> > 	2. Modifying cpuset.cpus could result with empty set for 32bit tasks.
> > 
> > 	   It is a variation of the above, it's just the cpuset transforms into
> > 	   64bit only after we attach.
> > 
> > 	   I think the right behavior here is to move the 32bit tasks to the
> > 	   nearest ancestor like we do when all cpuset.cpus are hotplugged out.
> > 
> > 	   We could too return an error if the new set will result an empty set
> > 	   for the 32bit tasks. In a similar manner to how it fails if you
> > 	   write a cpu that is offline.
> > 
> > 	3. If a 64bit task belongs to 64bit-only-cpuset execs a 32bit binary,
> > 	   the 32 tasks will inherit the cgroup setting.
> > 
> > 	   Like above, we should move this to the nearest ancestor.
> 
> I considered this when I was writing the patches, but the reality is that
> by allowing 32-bit tasks to attach to a 64-bit only cpuset (which is required
> by Android), we have no choice but to expose a new ABI to userspace. This is
> all gated behind a command-line option, so I think that's fine, but then why
> not just have the same behaviour as cgroup v2? I don't see the point in
> creating two new ABIs (for cgroup v1 and v2 respectively) if we don't need

Ultimately it's up to Tejun and Peter I guess. I thought we need to preserve
the v1 behavior for the new class of tasks. I won't object to the new ABI
myself. Maybe we just need to make the commit messages and cgroup-v1
documentation reflect that explicitly.

> to. If it was _identical_ to the hotplug case, then we would surely just
> follow the existing behaviour, but it's really quite different in this
> situation because the cpuset is not empty.

It is actually effectively empty for those tasks. But I see that one could look
at it from two different angles.

> One thing we should definitely do though is add this to the documentation
> for the command-line option.

+1

By the way, should the command-line option be renamed to something more
generic? This has already grown beyond just enabling the support for one
isolated case. No strong opinion, just a suggestion.

Thanks

--
Qais Yousef

Suren Baghdasaryan Dec. 16, 2020, 6:21 p.m. UTC | #6

On Wed, Dec 16, 2020 at 8:48 AM Qais Yousef <qais.yousef@arm.com> wrote:
>
> On 12/16/20 14:14, Will Deacon wrote:
> > Hi Qais,
> >
> > On Wed, Dec 16, 2020 at 11:16:46AM +0000, Qais Yousef wrote:
> > > On 12/08/20 13:28, Will Deacon wrote:
> > > > Changes in v5 include:
> > > >
> > > >   * Teach cpuset_cpus_allowed() about task_cpu_possible_mask() so that
> > > >     we can avoid returning incompatible CPUs for a given task. This
> > > >     means that sched_setaffinity() can be used with larger masks (like
> > > >     the online mask) from userspace and also allows us to take into
> > > >     account the cpuset hierarchy when forcefully overriding the affinity
> > > >     for a task on execve().
> > > >
> > > >   * Honour task_cpu_possible_mask() when attaching a task to a cpuset,
> > > >     so that the resulting affinity mask does not contain any incompatible
> > > >     CPUs (since it would be rejected by set_cpus_allowed_ptr() otherwise).
> > > >
> > > >   * Moved overriding of the affinity mask into the scheduler core rather
> > > >     than munge affinity masks directly in the architecture backend.
> > > >
> > > >   * Extended comments and documentation.
> > > >
> > > >   * Some renaming and cosmetic changes.
> > > >
> > > > I'm pretty happy with this now, although it still needs review and will
> > > > require rebasing to play nicely with the SCA changes in -next.
> > >
> > > I still have concerns about the cpuset v1 handling. Specifically:
> > >
> > >     1. Attaching a 32bit task to 64bit only cpuset is allowed.
> > >
> > >        I think the right behavior here is to prevent that as the
> > >        intersection will appear as offline cpus for the 32bit tasks. So it
> > >        shouldn't be allowed to move there.
> >
> > Suren or Quantin can correct me if I'm wrong I'm here, but I think Android
> > relies on this working so it's not an option for us to prevent the attach.
>
> I don't think so. It's just a matter who handles the error. ie: kernel fix it
> up silently and effectively make the cpuset a NOP since we don't respect the
> affinity of the cpuset, or user space pick the next best thing. Since this
> could return an error anyway, likely user space already handles this.

Moving a 32bit task around the hierarchy when it lost the last 32bit
capable CPU in its affinity mask would not work for Android. We move
the tasks in the hierarchy only when they change their role
(background/foreground/etc) and does not expect the tasks to migrate
by themselves. I think the current approach of adjusting affinity
without migration while not ideal is much better. Consistency with
cgroup v2 is a big plus as well.
We do plan on moving cpuset controller to cgroup v2 but the transition
is slow, so my guess is that we will stick to it for another Android
release.

> > I also don't think it really achieves much, since as you point out, the same
> > problem exists in other cases such as execve() of a 32-bit binary, or
> > hotplugging off all 32-bit CPUs within a mixed cpuset. Allowing the attach
> > and immediately reparenting would probably be better, but see below.
>
> I am just wary that we're introducing a generic asymmetric ISA support, so my
> concerns have been related to making sure the behavior is sane generally. When
> this gets merged, I can bet more 'fun' hardware will appear all over the place.
> We're opening the flood gates I'm afraid :p
>
> > >     2. Modifying cpuset.cpus could result with empty set for 32bit tasks.
> > >
> > >        It is a variation of the above, it's just the cpuset transforms into
> > >        64bit only after we attach.
> > >
> > >        I think the right behavior here is to move the 32bit tasks to the
> > >        nearest ancestor like we do when all cpuset.cpus are hotplugged out.
> > >
> > >        We could too return an error if the new set will result an empty set
> > >        for the 32bit tasks. In a similar manner to how it fails if you
> > >        write a cpu that is offline.
> > >
> > >     3. If a 64bit task belongs to 64bit-only-cpuset execs a 32bit binary,
> > >        the 32 tasks will inherit the cgroup setting.
> > >
> > >        Like above, we should move this to the nearest ancestor.
> >
> > I considered this when I was writing the patches, but the reality is that
> > by allowing 32-bit tasks to attach to a 64-bit only cpuset (which is required
> > by Android), we have no choice but to expose a new ABI to userspace. This is
> > all gated behind a command-line option, so I think that's fine, but then why
> > not just have the same behaviour as cgroup v2? I don't see the point in
> > creating two new ABIs (for cgroup v1 and v2 respectively) if we don't need
>
> Ultimately it's up to Tejun and Peter I guess. I thought we need to preserve
> the v1 behavior for the new class of tasks. I won't object to the new ABI
> myself. Maybe we just need to make the commit messages and cgroup-v1
> documentation reflect that explicitly.
>
> > to. If it was _identical_ to the hotplug case, then we would surely just
> > follow the existing behaviour, but it's really quite different in this
> > situation because the cpuset is not empty.
>
> It is actually effectively empty for those tasks. But I see that one could look
> at it from two different angles.
>
> > One thing we should definitely do though is add this to the documentation
> > for the command-line option.
>
> +1
>
> By the way, should the command-line option be renamed to something more
> generic? This has already grown beyond just enabling the support for one
> isolated case. No strong opinion, just a suggestion.
>
> Thanks
>
> --
> Qais Yousef

Peter Zijlstra Dec. 17, 2020, 10:55 a.m. UTC | #7

On Tue, Dec 15, 2020 at 06:50:12PM +0000, Will Deacon wrote:
> On Tue, Dec 15, 2020 at 06:36:45PM +0100, Peter Zijlstra wrote:

> > IOW, any (accidental or otherwise) trip through a 32bit helper, will
> > destroy user state (the affinity mask: 0x3c).
> 
> Yes, that's correct, and I agree that it's a rough edge. If you're happy
> with the idea of adding an extra mask to make this work, then I can start
> hacking that up

Yeah, I'm afraid we'll have to, this asymmetric muck is only going to
get worse from here on.

Anyway, I think we can avoid adding another cpumask_t to task_struct and
do with a cpumask_t * insteads. After all, for 'normal' tasks, the
task_cpu_possible_mask() will be cpu_possible_mask and we don't need to
carry anything extra.

Only once we hit one of these assymetric ISA things, can the task
allocate the additional cpumask and retain the full mask.

> (although I doubt I'll get something out before the new
> year at this point).

Yeah, we're all about to shut down for a bit, I'll not be looking at
email for 2 weeks either, so even if you send it, I might not see it
until the next year.

[v5,00/15] An alternative series for asymmetric AArch32 systems

Message

Comments