diff mbox series

[RFC,v1] pin_on_cpu: Introduce thread CPU pinning system call

Message ID 20200121160312.26545-1-mathieu.desnoyers@efficios.com (mailing list archive)
State New
Headers show
Series [RFC,v1] pin_on_cpu: Introduce thread CPU pinning system call | expand

Commit Message

Mathieu Desnoyers Jan. 21, 2020, 4:03 p.m. UTC
There is an important use-case which is not possible with the
"rseq" (Restartable Sequences) system call, which was left as
future work.

That use-case is to modify user-space per-cpu data structures
belonging to specific CPUs which may be brought offline and
online again by CPU hotplug. This can be used by memory
allocators to migrate free memory pools when CPUs are brought
offline, or by ring buffer consumers to target specific per-CPU
buffers, even when CPUs are brought offline.

A few rather complex prior attempts were made to solve this.
Those were based on in-kernel interpreters (cpu_opv, do_on_cpu).
That complexity was generally frowned upon, even by their author.

This patch fulfills this use-case in a refreshingly simple way:
it introduces a "pin_on_cpu" system call, which allows user-space
threads to pin themselves on a specific CPU (which needs to be
present in the thread's allowed cpu mask), and then clear this
pinned state.

"But this can already be done with sched_setaffinity", some
would rightfully reply. However, there is a significant twist
in the way pin_on_cpu deals with CPU hotplug compared to the
allowed cpu mask.

When all CPUs part of the thread's allowed cpu mask are offlined,
this mask is effectively reset to include all CPUs. This behavior
is completely incompatible with modifying per-cpu data structures:
the updates then become racy between concurrent CPUs trying to
update the given per-cpu data.

Conversely, all threads pinned on a given CPU with pin_on_cpu are
guaranteed to be scheduled on the same runqueue when that CPU is
offline. If that CPU is brought back online, the CPU hotplug
scheduler hooks are responsible for migrating back the tasks to
their pinned CPU.

For instance, this allows implementing this userspace library API
for incrementing a per-cpu counter for a specific cpu number
received as parameter:

static inline __attribute__((always_inline))
int percpu_addv(intptr_t *v, intptr_t count, int cpu)
{
        int ret;

        ret = rseq_addv(v, count, cpu);
check:
        if (rseq_unlikely(ret)) {
                pin_on_cpu_set(cpu);
                ret = rseq_addv(v, count, percpu_current_cpu());
                pin_on_cpu_clear();
                goto check;
        }
        return 0;
}

Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Joel Fernandes <joelaf@google.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Dave Watson <davejwatson@fb.com>
Cc: Will Deacon <will.deacon@arm.com>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: linux-kselftest@vger.kernel.org
Cc: "H . Peter Anvin" <hpa@zytor.com>
Cc: Chris Lameter <cl@linux.com>
Cc: Russell King <linux@arm.linux.org.uk>
Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Cc: "Paul E . McKenney" <paulmck@linux.vnet.ibm.com>
Cc: Paul Turner <pjt@google.com>
Cc: Boqun Feng <boqun.feng@gmail.com>
Cc: Josh Triplett <josh@joshtriplett.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Ben Maurer <bmaurer@fb.com>
Cc: linux-api@vger.kernel.org
Cc: Andy Lutomirski <luto@amacapital.net>
---
 arch/x86/entry/syscalls/syscall_32.tbl |   1 +
 arch/x86/entry/syscalls/syscall_64.tbl |   1 +
 fs/exec.c                              |   1 +
 include/linux/sched.h                  |   1 +
 include/linux/syscalls.h               |   1 +
 include/uapi/asm-generic/unistd.h      |   5 +-
 include/uapi/linux/sched.h             |   6 +
 init/init_task.c                       |   1 +
 kernel/sched/core.c                    | 321 +++++++++++++++++++++++--
 kernel/sched/deadline.c                |  54 +++--
 kernel/sched/fair.c                    |  19 ++
 kernel/sched/rt.c                      |  15 +-
 kernel/sched/sched.h                   |  28 +++
 kernel/sys_ni.c                        |   1 +
 14 files changed, 413 insertions(+), 42 deletions(-)

Comments

Jann Horn Jan. 21, 2020, 5:20 p.m. UTC | #1
On Tue, Jan 21, 2020 at 5:13 PM Mathieu Desnoyers
<mathieu.desnoyers@efficios.com> wrote:
> There is an important use-case which is not possible with the
> "rseq" (Restartable Sequences) system call, which was left as
> future work.
>
> That use-case is to modify user-space per-cpu data structures
> belonging to specific CPUs which may be brought offline and
> online again by CPU hotplug. This can be used by memory
> allocators to migrate free memory pools when CPUs are brought
> offline, or by ring buffer consumers to target specific per-CPU
> buffers, even when CPUs are brought offline.
>
> A few rather complex prior attempts were made to solve this.
> Those were based on in-kernel interpreters (cpu_opv, do_on_cpu).
> That complexity was generally frowned upon, even by their author.
>
> This patch fulfills this use-case in a refreshingly simple way:
> it introduces a "pin_on_cpu" system call, which allows user-space
> threads to pin themselves on a specific CPU (which needs to be
> present in the thread's allowed cpu mask), and then clear this
> pinned state.
[...]
> For instance, this allows implementing this userspace library API
> for incrementing a per-cpu counter for a specific cpu number
> received as parameter:
>
> static inline __attribute__((always_inline))
> int percpu_addv(intptr_t *v, intptr_t count, int cpu)
> {
>         int ret;
>
>         ret = rseq_addv(v, count, cpu);
> check:
>         if (rseq_unlikely(ret)) {
>                 pin_on_cpu_set(cpu);
>                 ret = rseq_addv(v, count, percpu_current_cpu());
>                 pin_on_cpu_clear();
>                 goto check;
>         }
>         return 0;
> }

What does userspace have to do if the set of allowed CPUs switches all
the time? For example, on Android, if you first open Chrome and then
look at its allowed CPUs, Chrome is allowed to use all CPU cores
because it's running in the foreground:

walleye:/ # ps -AZ | grep 'android.chrome$'
u:r:untrusted_app:s0:c145,c256,c512,c768 u0_a145 7845 805 1474472
197868 SyS_epoll_wait f09c0194 S com.android.chrome
walleye:/ # grep cpuset /proc/7845/cgroup; grep Cpus_allowed_list
/proc/7845/status
3:cpuset:/top-app
Cpus_allowed_list: 0-7

But if you then switch to the home screen, the application is moved
into a different cgroup, and is restricted to two CPU cores:

walleye:/ # grep cpuset /proc/7845/cgroup; grep Cpus_allowed_list
/proc/7845/status
3:cpuset:/background
Cpus_allowed_list: 0-1

At the same time, I also wonder whether it is a good idea to allow
userspace to stay active on a CPU even after the task has been told to
move to another CPU core - that's probably not exactly a big deal, but
seems suboptimal to me.


I'm wondering whether it might be possible to rework this mechanism
such that, instead of moving the current task onto a target CPU, it
prevents all *other* threads of the current process from running on
that CPU (either entirely or in user mode). That might be the easiest
way to take care of issues like CPU hotplugging and changing cpusets
all at once? The only potential issue I see with that approach would
be that you wouldn't be able to use it for inter-process
communication; and I have no idea whether this would be good or bad
performance-wise.
Mathieu Desnoyers Jan. 21, 2020, 7:47 p.m. UTC | #2
----- On Jan 21, 2020, at 12:20 PM, Jann Horn jannh@google.com wrote:

> On Tue, Jan 21, 2020 at 5:13 PM Mathieu Desnoyers
> <mathieu.desnoyers@efficios.com> wrote:
>> There is an important use-case which is not possible with the
>> "rseq" (Restartable Sequences) system call, which was left as
>> future work.
>>
>> That use-case is to modify user-space per-cpu data structures
>> belonging to specific CPUs which may be brought offline and
>> online again by CPU hotplug. This can be used by memory
>> allocators to migrate free memory pools when CPUs are brought
>> offline, or by ring buffer consumers to target specific per-CPU
>> buffers, even when CPUs are brought offline.
>>
>> A few rather complex prior attempts were made to solve this.
>> Those were based on in-kernel interpreters (cpu_opv, do_on_cpu).
>> That complexity was generally frowned upon, even by their author.
>>
>> This patch fulfills this use-case in a refreshingly simple way:
>> it introduces a "pin_on_cpu" system call, which allows user-space
>> threads to pin themselves on a specific CPU (which needs to be
>> present in the thread's allowed cpu mask), and then clear this
>> pinned state.
> [...]
>> For instance, this allows implementing this userspace library API
>> for incrementing a per-cpu counter for a specific cpu number
>> received as parameter:
>>
>> static inline __attribute__((always_inline))
>> int percpu_addv(intptr_t *v, intptr_t count, int cpu)
>> {
>>         int ret;
>>
>>         ret = rseq_addv(v, count, cpu);
>> check:
>>         if (rseq_unlikely(ret)) {
>>                 pin_on_cpu_set(cpu);
>>                 ret = rseq_addv(v, count, percpu_current_cpu());
>>                 pin_on_cpu_clear();
>>                 goto check;
>>         }
>>         return 0;
>> }
> 
> What does userspace have to do if the set of allowed CPUs switches all
> the time? For example, on Android, if you first open Chrome and then
> look at its allowed CPUs, Chrome is allowed to use all CPU cores
> because it's running in the foreground:
> 
> walleye:/ # ps -AZ | grep 'android.chrome$'
> u:r:untrusted_app:s0:c145,c256,c512,c768 u0_a145 7845 805 1474472
> 197868 SyS_epoll_wait f09c0194 S com.android.chrome
> walleye:/ # grep cpuset /proc/7845/cgroup; grep Cpus_allowed_list
> /proc/7845/status
> 3:cpuset:/top-app
> Cpus_allowed_list: 0-7
> 
> But if you then switch to the home screen, the application is moved
> into a different cgroup, and is restricted to two CPU cores:
> 
> walleye:/ # grep cpuset /proc/7845/cgroup; grep Cpus_allowed_list
> /proc/7845/status
> 3:cpuset:/background
> Cpus_allowed_list: 0-1

Then at that point, pin_on_cpu() would only be allowed to pin on
CPUs 0 and 1.

> At the same time, I also wonder whether it is a good idea to allow
> userspace to stay active on a CPU even after the task has been told to
> move to another CPU core - that's probably not exactly a big deal, but
> seems suboptimal to me.

Do you mean the following scenario for a given task ?

1) set affinity to CPU 0,1,2
2) pin on CPU 2
3) set affinity to CPU 0,1

In the patch I propose here, step (3) is forbidden by this added check in
__set_cpus_allowed_ptr, which is used by sched_setaffinity(2):

+       /* Prevent removing the currently pinned CPU from the allowed cpu mask. */
+       if (is_pinned_task(p) && !cpumask_test_cpu(p->pinned_cpu, new_mask)) {
+               ret = -EINVAL;
+               goto out;
+       }


> I'm wondering whether it might be possible to rework this mechanism
> such that, instead of moving the current task onto a target CPU, it
> prevents all *other* threads of the current process from running on
> that CPU (either entirely or in user mode). That might be the easiest
> way to take care of issues like CPU hotplugging and changing cpusets
> all at once? The only potential issue I see with that approach would
> be that you wouldn't be able to use it for inter-process
> communication; and I have no idea whether this would be good or bad
> performance-wise.

Firstly, inter-process communication over shared memory is one of my use-cases
(for lttng-ust's ring buffer).

I'm not convinced that applying constraints on all other threads belonging to
the current process would be easier or faster than migrating the current thread
over to the target CPU. I'm unsure how those additional constraints would
fit with other threads already having their own cpu affinity masks (which
could generate an empty cpumask by adding an extra constraint).

Thanks for the feedback!

Mathieu
Jann Horn Jan. 21, 2020, 8:35 p.m. UTC | #3
On Tue, Jan 21, 2020 at 8:47 PM Mathieu Desnoyers
<mathieu.desnoyers@efficios.com> wrote:
>
> ----- On Jan 21, 2020, at 12:20 PM, Jann Horn jannh@google.com wrote:
>
> > On Tue, Jan 21, 2020 at 5:13 PM Mathieu Desnoyers
> > <mathieu.desnoyers@efficios.com> wrote:
> >> There is an important use-case which is not possible with the
> >> "rseq" (Restartable Sequences) system call, which was left as
> >> future work.
> >>
> >> That use-case is to modify user-space per-cpu data structures
> >> belonging to specific CPUs which may be brought offline and
> >> online again by CPU hotplug. This can be used by memory
> >> allocators to migrate free memory pools when CPUs are brought
> >> offline, or by ring buffer consumers to target specific per-CPU
> >> buffers, even when CPUs are brought offline.
> >>
> >> A few rather complex prior attempts were made to solve this.
> >> Those were based on in-kernel interpreters (cpu_opv, do_on_cpu).
> >> That complexity was generally frowned upon, even by their author.
> >>
> >> This patch fulfills this use-case in a refreshingly simple way:
> >> it introduces a "pin_on_cpu" system call, which allows user-space
> >> threads to pin themselves on a specific CPU (which needs to be
> >> present in the thread's allowed cpu mask), and then clear this
> >> pinned state.
> > [...]
> >> For instance, this allows implementing this userspace library API
> >> for incrementing a per-cpu counter for a specific cpu number
> >> received as parameter:
> >>
> >> static inline __attribute__((always_inline))
> >> int percpu_addv(intptr_t *v, intptr_t count, int cpu)
> >> {
> >>         int ret;
> >>
> >>         ret = rseq_addv(v, count, cpu);
> >> check:
> >>         if (rseq_unlikely(ret)) {
> >>                 pin_on_cpu_set(cpu);
> >>                 ret = rseq_addv(v, count, percpu_current_cpu());
> >>                 pin_on_cpu_clear();
> >>                 goto check;
> >>         }
> >>         return 0;
> >> }
> >
> > What does userspace have to do if the set of allowed CPUs switches all
> > the time? For example, on Android, if you first open Chrome and then
> > look at its allowed CPUs, Chrome is allowed to use all CPU cores
> > because it's running in the foreground:
> >
> > walleye:/ # ps -AZ | grep 'android.chrome$'
> > u:r:untrusted_app:s0:c145,c256,c512,c768 u0_a145 7845 805 1474472
> > 197868 SyS_epoll_wait f09c0194 S com.android.chrome
> > walleye:/ # grep cpuset /proc/7845/cgroup; grep Cpus_allowed_list
> > /proc/7845/status
> > 3:cpuset:/top-app
> > Cpus_allowed_list: 0-7
> >
> > But if you then switch to the home screen, the application is moved
> > into a different cgroup, and is restricted to two CPU cores:
> >
> > walleye:/ # grep cpuset /proc/7845/cgroup; grep Cpus_allowed_list
> > /proc/7845/status
> > 3:cpuset:/background
> > Cpus_allowed_list: 0-1
>
> Then at that point, pin_on_cpu() would only be allowed to pin on
> CPUs 0 and 1.

Which means that you can't actually reliably use pin_on_cpu_set() to
manipulate percpu data structures since you have to call it with the
assumption that it might randomly fail at any time, right? And then
userspace needs to code a fallback path that somehow halts all the
threads with thread-directed signals or something?
Especially if the task trying to use pin_on_cpu_set() isn't allowed to
pin to the target CPU, but all the other tasks using the shared data
structure are allowed to do that. Or if the CPU you want to pin to is
always removed from your affinity mask immediately before
pin_on_cpu_set() and added back immediately afterwards.

> > At the same time, I also wonder whether it is a good idea to allow
> > userspace to stay active on a CPU even after the task has been told to
> > move to another CPU core - that's probably not exactly a big deal, but
> > seems suboptimal to me.
>
> Do you mean the following scenario for a given task ?
>
> 1) set affinity to CPU 0,1,2
> 2) pin on CPU 2
> 3) set affinity to CPU 0,1
>
> In the patch I propose here, step (3) is forbidden by this added check in
> __set_cpus_allowed_ptr, which is used by sched_setaffinity(2):
>
> +       /* Prevent removing the currently pinned CPU from the allowed cpu mask. */
> +       if (is_pinned_task(p) && !cpumask_test_cpu(p->pinned_cpu, new_mask)) {
> +               ret = -EINVAL;
> +               goto out;
> +       }

How does that interact with things like cgroups? What happens when
root tries to move a process from the cpuset:/top-app cgroup to the
cpuset:/background cgroup? Does that also just throw an error code?

> > I'm wondering whether it might be possible to rework this mechanism
> > such that, instead of moving the current task onto a target CPU, it
> > prevents all *other* threads of the current process from running on
> > that CPU (either entirely or in user mode). That might be the easiest
> > way to take care of issues like CPU hotplugging and changing cpusets
> > all at once? The only potential issue I see with that approach would
> > be that you wouldn't be able to use it for inter-process
> > communication; and I have no idea whether this would be good or bad
> > performance-wise.
>
> Firstly, inter-process communication over shared memory is one of my use-cases
> (for lttng-ust's ring buffer).
>
> I'm not convinced that applying constraints on all other threads belonging to
> the current process would be easier or faster than migrating the current thread
> over to the target CPU. I'm unsure how those additional constraints would
> fit with other threads already having their own cpu affinity masks (which
> could generate an empty cpumask by adding an extra constraint).

Hm - is an empty cpumask a problem here? If one task is in the middle
of a critical section for performing maintenance on a percpu data
structure, isn't it a nice design property to exclude concurrent
access from other tasks to that data structure automatically (by
preventing those tasks by running on that CPU)? That way the
(presumably rarely-executed) update path doesn't have to be
rseq-reentrancy-safe.
Mathieu Desnoyers Jan. 21, 2020, 9:18 p.m. UTC | #4
----- On Jan 21, 2020, at 3:35 PM, Jann Horn jannh@google.com wrote:

> On Tue, Jan 21, 2020 at 8:47 PM Mathieu Desnoyers
> <mathieu.desnoyers@efficios.com> wrote:
>>
>> ----- On Jan 21, 2020, at 12:20 PM, Jann Horn jannh@google.com wrote:
>>
>> > On Tue, Jan 21, 2020 at 5:13 PM Mathieu Desnoyers
>> > <mathieu.desnoyers@efficios.com> wrote:
>> >> There is an important use-case which is not possible with the
>> >> "rseq" (Restartable Sequences) system call, which was left as
>> >> future work.
>> >>
>> >> That use-case is to modify user-space per-cpu data structures
>> >> belonging to specific CPUs which may be brought offline and
>> >> online again by CPU hotplug. This can be used by memory
>> >> allocators to migrate free memory pools when CPUs are brought
>> >> offline, or by ring buffer consumers to target specific per-CPU
>> >> buffers, even when CPUs are brought offline.
>> >>
>> >> A few rather complex prior attempts were made to solve this.
>> >> Those were based on in-kernel interpreters (cpu_opv, do_on_cpu).
>> >> That complexity was generally frowned upon, even by their author.
>> >>
>> >> This patch fulfills this use-case in a refreshingly simple way:
>> >> it introduces a "pin_on_cpu" system call, which allows user-space
>> >> threads to pin themselves on a specific CPU (which needs to be
>> >> present in the thread's allowed cpu mask), and then clear this
>> >> pinned state.
>> > [...]
>> >> For instance, this allows implementing this userspace library API
>> >> for incrementing a per-cpu counter for a specific cpu number
>> >> received as parameter:
>> >>
>> >> static inline __attribute__((always_inline))
>> >> int percpu_addv(intptr_t *v, intptr_t count, int cpu)
>> >> {
>> >>         int ret;
>> >>
>> >>         ret = rseq_addv(v, count, cpu);
>> >> check:
>> >>         if (rseq_unlikely(ret)) {
>> >>                 pin_on_cpu_set(cpu);
>> >>                 ret = rseq_addv(v, count, percpu_current_cpu());
>> >>                 pin_on_cpu_clear();
>> >>                 goto check;
>> >>         }
>> >>         return 0;
>> >> }
>> >
>> > What does userspace have to do if the set of allowed CPUs switches all
>> > the time? For example, on Android, if you first open Chrome and then
>> > look at its allowed CPUs, Chrome is allowed to use all CPU cores
>> > because it's running in the foreground:
>> >
>> > walleye:/ # ps -AZ | grep 'android.chrome$'
>> > u:r:untrusted_app:s0:c145,c256,c512,c768 u0_a145 7845 805 1474472
>> > 197868 SyS_epoll_wait f09c0194 S com.android.chrome
>> > walleye:/ # grep cpuset /proc/7845/cgroup; grep Cpus_allowed_list
>> > /proc/7845/status
>> > 3:cpuset:/top-app
>> > Cpus_allowed_list: 0-7
>> >
>> > But if you then switch to the home screen, the application is moved
>> > into a different cgroup, and is restricted to two CPU cores:
>> >
>> > walleye:/ # grep cpuset /proc/7845/cgroup; grep Cpus_allowed_list
>> > /proc/7845/status
>> > 3:cpuset:/background
>> > Cpus_allowed_list: 0-1
>>
>> Then at that point, pin_on_cpu() would only be allowed to pin on
>> CPUs 0 and 1.
> 
> Which means that you can't actually reliably use pin_on_cpu_set() to
> manipulate percpu data structures since you have to call it with the
> assumption that it might randomly fail at any time, right?

Only if the cpu affinity of the thread is being changed concurrently
by another thread which is a possibility in some applications, indeed.

> And then
> userspace needs to code a fallback path that somehow halts all the
> threads with thread-directed signals or something?

The example use of pin_on_cpu() did not include handling of the return
value in that case (-1, errno=EINVAL) for conciseness. But yes, the
application would have to handle this.

It's not so different from error handling which is required when using
sched_setaffinity(), which can fail with -1, errno=EINVAL in the following
scenario:

       EINVAL The  affinity bit mask mask contains no processors that are cur‐
              rently physically on the system  and  permitted  to  the  thread
              according  to  any  restrictions  that  may  be  imposed  by the
              "cpuset" mechanism described in cpuset(7).

> Especially if the task trying to use pin_on_cpu_set() isn't allowed to
> pin to the target CPU, but all the other tasks using the shared data
> structure are allowed to do that. Or if the CPU you want to pin to is
> always removed from your affinity mask immediately before
> pin_on_cpu_set() and added back immediately afterwards.

I am tempted to state that using pin_on_cpu() targeting a disallowed cpu
should be considered a programming error and handled accordingly by the
application.

> 
>> > At the same time, I also wonder whether it is a good idea to allow
>> > userspace to stay active on a CPU even after the task has been told to
>> > move to another CPU core - that's probably not exactly a big deal, but
>> > seems suboptimal to me.
>>
>> Do you mean the following scenario for a given task ?
>>
>> 1) set affinity to CPU 0,1,2
>> 2) pin on CPU 2
>> 3) set affinity to CPU 0,1
>>
>> In the patch I propose here, step (3) is forbidden by this added check in
>> __set_cpus_allowed_ptr, which is used by sched_setaffinity(2):
>>
>> +       /* Prevent removing the currently pinned CPU from the allowed cpu mask.
>> */
>> +       if (is_pinned_task(p) && !cpumask_test_cpu(p->pinned_cpu, new_mask)) {
>> +               ret = -EINVAL;
>> +               goto out;
>> +       }
> 
> How does that interact with things like cgroups? What happens when
> root tries to move a process from the cpuset:/top-app cgroup to the
> cpuset:/background cgroup? Does that also just throw an error code?

Looking at kernel/cgroup/cpuset.c use of set_cpus_allowed_ptr(), either
it:

A) silently ignores the error (3 instances),
or
B) triggers a WARN_ON_ONCE() (1 instance).

So yeah, that cgroup code seems rather optimistic that this operation
should always succeed (?!?)

I can say up front I am not entirely sure what's the best way to proceed
here. Either sched_setaffinity and cgroup can fail if trying to remove a
CPU from affinity mask while the task is pinned, or we let them succeed
and leave the task running on the pinned CPU anyway, or any better idea ?

> 
>> > I'm wondering whether it might be possible to rework this mechanism
>> > such that, instead of moving the current task onto a target CPU, it
>> > prevents all *other* threads of the current process from running on
>> > that CPU (either entirely or in user mode). That might be the easiest
>> > way to take care of issues like CPU hotplugging and changing cpusets
>> > all at once? The only potential issue I see with that approach would
>> > be that you wouldn't be able to use it for inter-process
>> > communication; and I have no idea whether this would be good or bad
>> > performance-wise.
>>
>> Firstly, inter-process communication over shared memory is one of my use-cases
>> (for lttng-ust's ring buffer).
>>
>> I'm not convinced that applying constraints on all other threads belonging to
>> the current process would be easier or faster than migrating the current thread
>> over to the target CPU. I'm unsure how those additional constraints would
>> fit with other threads already having their own cpu affinity masks (which
>> could generate an empty cpumask by adding an extra constraint).
> 
> Hm - is an empty cpumask a problem here? If one task is in the middle
> of a critical section for performing maintenance on a percpu data
> structure, isn't it a nice design property to exclude concurrent
> access from other tasks to that data structure automatically (by
> preventing those tasks by running on that CPU)? That way the
> (presumably rarely-executed) update path doesn't have to be
> rseq-reentrancy-safe.

Given we already have rseq, simply using it to protect against other
threads trying to touch the same per-cpu data seems rather more lightweight
than to try to exclude all other threads from that CPU for a possibly
unbounded amount of time.

Allowing a completely empty cpumask could effectively allow those
critical sections to prevent execution of possibly higher priority
threads on the system, which ends up being the definition of a priority
inversion, which I'd like to avoid.

Thanks,

Mathieu
Christoph Lameter (Ampere) Jan. 21, 2020, 9:44 p.m. UTC | #5
These scenarios are all pretty complex and will be difficult to understand
for the user of these APIs.

I think the easiest solution (and most comprehensible) is for the user
space process that does per cpu operations to get some sort of signal. If
its not able to handle that then terminate it. The code makes a basic
assumption after all that the process is running on a specific cpu. If
this is no longer the case then its better to abort if the process cannot
handle moving to a different processor.
Mathieu Desnoyers Jan. 22, 2020, 1:11 a.m. UTC | #6
----- On Jan 21, 2020, at 4:44 PM, Chris Lameter cl@linux.com wrote:

> These scenarios are all pretty complex and will be difficult to understand
> for the user of these APIs.
> 
> I think the easiest solution (and most comprehensible) is for the user
> space process that does per cpu operations to get some sort of signal. If
> its not able to handle that then terminate it. The code makes a basic
> assumption after all that the process is running on a specific cpu. If
> this is no longer the case then its better to abort if the process cannot
> handle moving to a different processor.

The point of pin_on_cpu() is to allow threads to access per-cpu data
structures belonging to a given CPU even if they cannot run on that
CPU (because it is offline).

I am not sure what scenario your signal delivery proposal aims to cover.

Just to try to put this into the context of a specific scenario to see
if I understand your point, is the following what you have in mind ?

1. Thread A issues pin_on_cpu(5),
2. Thread B issues sched_setaffinity removing cpu 5 from thread A's
   affinity mask,
3. Noticing that it would generate an invalid combination, rather than
   failing sched_setaffinity, it would send a SIGSEGV (or other) signal
   to thread A.

Or so you have something entirely different in mind ?

Thanks,

Mathieu
Jann Horn Jan. 22, 2020, 8:23 a.m. UTC | #7
On Tue, Jan 21, 2020 at 10:18 PM Mathieu Desnoyers
<mathieu.desnoyers@efficios.com> wrote:
> ----- On Jan 21, 2020, at 3:35 PM, Jann Horn jannh@google.com wrote:
>
> > On Tue, Jan 21, 2020 at 8:47 PM Mathieu Desnoyers
> > <mathieu.desnoyers@efficios.com> wrote:
> >>
> >> ----- On Jan 21, 2020, at 12:20 PM, Jann Horn jannh@google.com wrote:
> >>
> >> > On Tue, Jan 21, 2020 at 5:13 PM Mathieu Desnoyers
> >> > <mathieu.desnoyers@efficios.com> wrote:
> >> >> There is an important use-case which is not possible with the
> >> >> "rseq" (Restartable Sequences) system call, which was left as
> >> >> future work.
> >> >>
> >> >> That use-case is to modify user-space per-cpu data structures
> >> >> belonging to specific CPUs which may be brought offline and
> >> >> online again by CPU hotplug. This can be used by memory
> >> >> allocators to migrate free memory pools when CPUs are brought
> >> >> offline, or by ring buffer consumers to target specific per-CPU
> >> >> buffers, even when CPUs are brought offline.
> >> >>
> >> >> A few rather complex prior attempts were made to solve this.
> >> >> Those were based on in-kernel interpreters (cpu_opv, do_on_cpu).
> >> >> That complexity was generally frowned upon, even by their author.
> >> >>
> >> >> This patch fulfills this use-case in a refreshingly simple way:
> >> >> it introduces a "pin_on_cpu" system call, which allows user-space
> >> >> threads to pin themselves on a specific CPU (which needs to be
> >> >> present in the thread's allowed cpu mask), and then clear this
> >> >> pinned state.
> >> > [...]
> >> >> For instance, this allows implementing this userspace library API
> >> >> for incrementing a per-cpu counter for a specific cpu number
> >> >> received as parameter:
> >> >>
> >> >> static inline __attribute__((always_inline))
> >> >> int percpu_addv(intptr_t *v, intptr_t count, int cpu)
> >> >> {
> >> >>         int ret;
> >> >>
> >> >>         ret = rseq_addv(v, count, cpu);
> >> >> check:
> >> >>         if (rseq_unlikely(ret)) {
> >> >>                 pin_on_cpu_set(cpu);
> >> >>                 ret = rseq_addv(v, count, percpu_current_cpu());
> >> >>                 pin_on_cpu_clear();
> >> >>                 goto check;
> >> >>         }
> >> >>         return 0;
> >> >> }
> >> >
> >> > What does userspace have to do if the set of allowed CPUs switches all
> >> > the time? For example, on Android, if you first open Chrome and then
> >> > look at its allowed CPUs, Chrome is allowed to use all CPU cores
> >> > because it's running in the foreground:
> >> >
> >> > walleye:/ # ps -AZ | grep 'android.chrome$'
> >> > u:r:untrusted_app:s0:c145,c256,c512,c768 u0_a145 7845 805 1474472
> >> > 197868 SyS_epoll_wait f09c0194 S com.android.chrome
> >> > walleye:/ # grep cpuset /proc/7845/cgroup; grep Cpus_allowed_list
> >> > /proc/7845/status
> >> > 3:cpuset:/top-app
> >> > Cpus_allowed_list: 0-7
> >> >
> >> > But if you then switch to the home screen, the application is moved
> >> > into a different cgroup, and is restricted to two CPU cores:
> >> >
> >> > walleye:/ # grep cpuset /proc/7845/cgroup; grep Cpus_allowed_list
> >> > /proc/7845/status
> >> > 3:cpuset:/background
> >> > Cpus_allowed_list: 0-1
> >>
> >> Then at that point, pin_on_cpu() would only be allowed to pin on
> >> CPUs 0 and 1.
> >
> > Which means that you can't actually reliably use pin_on_cpu_set() to
> > manipulate percpu data structures since you have to call it with the
> > assumption that it might randomly fail at any time, right?
>
> Only if the cpu affinity of the thread is being changed concurrently
> by another thread which is a possibility in some applications, indeed.

Not just some applications, but also some environments, right? See the
Android example - the set of permitted CPUs is changed not by the
application itself, but by a management process that uses cgroup
modifications to indirectly change the set of permitted CPUs. I
wouldn't be surprised if the same could happen in e.g. container
environments.

> > And then
> > userspace needs to code a fallback path that somehow halts all the
> > threads with thread-directed signals or something?
>
> The example use of pin_on_cpu() did not include handling of the return
> value in that case (-1, errno=EINVAL) for conciseness. But yes, the
> application would have to handle this.
>
> It's not so different from error handling which is required when using
> sched_setaffinity(), which can fail with -1, errno=EINVAL in the following
> scenario:
>
>        EINVAL The  affinity bit mask mask contains no processors that are cur‐
>               rently physically on the system  and  permitted  to  the  thread
>               according  to  any  restrictions  that  may  be  imposed  by the
>               "cpuset" mechanism described in cpuset(7).

Except that sched_setaffinity() is normally just a performance
optimization, right? Whereas pin_to_cpu() is more of a correctness
thing?

> > Especially if the task trying to use pin_on_cpu_set() isn't allowed to
> > pin to the target CPU, but all the other tasks using the shared data
> > structure are allowed to do that. Or if the CPU you want to pin to is
> > always removed from your affinity mask immediately before
> > pin_on_cpu_set() and added back immediately afterwards.
>
> I am tempted to state that using pin_on_cpu() targeting a disallowed cpu
> should be considered a programming error and handled accordingly by the
> application.

How can it be a programming error if that situation can be triggered
by legitimate external modifications to CPU affinity?

[...]
> >> > I'm wondering whether it might be possible to rework this mechanism
> >> > such that, instead of moving the current task onto a target CPU, it
> >> > prevents all *other* threads of the current process from running on
> >> > that CPU (either entirely or in user mode). That might be the easiest
> >> > way to take care of issues like CPU hotplugging and changing cpusets
> >> > all at once? The only potential issue I see with that approach would
> >> > be that you wouldn't be able to use it for inter-process
> >> > communication; and I have no idea whether this would be good or bad
> >> > performance-wise.
> >>
> >> Firstly, inter-process communication over shared memory is one of my use-cases
> >> (for lttng-ust's ring buffer).
> >>
> >> I'm not convinced that applying constraints on all other threads belonging to
> >> the current process would be easier or faster than migrating the current thread
> >> over to the target CPU. I'm unsure how those additional constraints would
> >> fit with other threads already having their own cpu affinity masks (which
> >> could generate an empty cpumask by adding an extra constraint).
> >
> > Hm - is an empty cpumask a problem here? If one task is in the middle
> > of a critical section for performing maintenance on a percpu data
> > structure, isn't it a nice design property to exclude concurrent
> > access from other tasks to that data structure automatically (by
> > preventing those tasks by running on that CPU)? That way the
> > (presumably rarely-executed) update path doesn't have to be
> > rseq-reentrancy-safe.
>
> Given we already have rseq, simply using it to protect against other
> threads trying to touch the same per-cpu data seems rather more lightweight
> than to try to exclude all other threads from that CPU for a possibly
> unbounded amount of time.

That only works if the cpu-targeted maintenance operation is something
that can be implemented in rseq, right? I was thinking that it might
be nice to avoid that limitation - but I don't know much about the
kinds of data structures that one might want to build on top of rseq,
so maybe that's a silly idea.

> Allowing a completely empty cpumask could effectively allow those
> critical sections to prevent execution of possibly higher priority
> threads on the system, which ends up being the definition of a priority
> inversion, which I'd like to avoid.

Linux does have the infrastructure for RT futexes, right? Maybe that
could be useful here.
H. Peter Anvin Jan. 23, 2020, 7:53 a.m. UTC | #8
On 2020-01-21 17:11, Mathieu Desnoyers wrote:
> ----- On Jan 21, 2020, at 4:44 PM, Chris Lameter cl@linux.com wrote:
> 
>> These scenarios are all pretty complex and will be difficult to understand
>> for the user of these APIs.
>>
>> I think the easiest solution (and most comprehensible) is for the user
>> space process that does per cpu operations to get some sort of signal. If
>> its not able to handle that then terminate it. The code makes a basic
>> assumption after all that the process is running on a specific cpu. If
>> this is no longer the case then its better to abort if the process cannot
>> handle moving to a different processor.
> 
> The point of pin_on_cpu() is to allow threads to access per-cpu data
> structures belonging to a given CPU even if they cannot run on that
> CPU (because it is offline).
> 
> I am not sure what scenario your signal delivery proposal aims to cover.
> 
> Just to try to put this into the context of a specific scenario to see
> if I understand your point, is the following what you have in mind ?
> 
> 1. Thread A issues pin_on_cpu(5),
> 2. Thread B issues sched_setaffinity removing cpu 5 from thread A's
>    affinity mask,
> 3. Noticing that it would generate an invalid combination, rather than
>    failing sched_setaffinity, it would send a SIGSEGV (or other) signal
>    to thread A.
> 
> Or so you have something entirely different in mind ?
> 

I would agree that this seems like the only sane option, or you will be in a
world of hurt because of conflicting semantics. It is not just offlining, but
what happens if a policy manager calls sched_setaffinity() on another thread
-- and now the universe breaks because a library is updated to use this new
system call which collides with the expectations of the policy manager.

There doesn't seem to be any way to get this to be a local event which doesn't
break assumptions elsewhere in the system without making this an abort event
of some type. However, signals are painful in their own right, mostly because
of the lack of any infrastructure for allocating signals to libraries in user
space. I was actually thinking about exactly that issue just this weekend.

	-hpa
Florian Weimer Jan. 23, 2020, 8:19 a.m. UTC | #9
* H. Peter Anvin:

> On 2020-01-21 17:11, Mathieu Desnoyers wrote:
>> ----- On Jan 21, 2020, at 4:44 PM, Chris Lameter cl@linux.com wrote:
>> 
>>> These scenarios are all pretty complex and will be difficult to understand
>>> for the user of these APIs.
>>>
>>> I think the easiest solution (and most comprehensible) is for the user
>>> space process that does per cpu operations to get some sort of signal. If
>>> its not able to handle that then terminate it. The code makes a basic
>>> assumption after all that the process is running on a specific cpu. If
>>> this is no longer the case then its better to abort if the process cannot
>>> handle moving to a different processor.
>> 
>> The point of pin_on_cpu() is to allow threads to access per-cpu data
>> structures belonging to a given CPU even if they cannot run on that
>> CPU (because it is offline).
>> 
>> I am not sure what scenario your signal delivery proposal aims to cover.
>> 
>> Just to try to put this into the context of a specific scenario to see
>> if I understand your point, is the following what you have in mind ?
>> 
>> 1. Thread A issues pin_on_cpu(5),
>> 2. Thread B issues sched_setaffinity removing cpu 5 from thread A's
>>    affinity mask,
>> 3. Noticing that it would generate an invalid combination, rather than
>>    failing sched_setaffinity, it would send a SIGSEGV (or other) signal
>>    to thread A.
>> 
>> Or so you have something entirely different in mind ?
>> 
>
> I would agree that this seems like the only sane option, or you will
> be in a world of hurt because of conflicting semantics. It is not just
> offlining, but what happens if a policy manager calls
> sched_setaffinity() on another thread -- and now the universe breaks
> because a library is updated to use this new system call which
> collides with the expectations of the policy manager.

Yes, this new interface seems fundamentally incompatible with how
affinity masks are changed today.

Would it be possible to make pin_on_cpu_set to use fallback
synchronization via a futex if the CPU cannot be acquired by running on
it?  The rseq section would need to check the futex as well, but would
not have to acquire it.

Thanks,
Florian
Mathieu Desnoyers Jan. 27, 2020, 7:39 p.m. UTC | #10
----- On Jan 23, 2020, at 3:19 AM, Florian Weimer fweimer@redhat.com wrote:

> * H. Peter Anvin:
> 
>> On 2020-01-21 17:11, Mathieu Desnoyers wrote:
>>> ----- On Jan 21, 2020, at 4:44 PM, Chris Lameter cl@linux.com wrote:
>>> 
>>>> These scenarios are all pretty complex and will be difficult to understand
>>>> for the user of these APIs.
>>>>
>>>> I think the easiest solution (and most comprehensible) is for the user
>>>> space process that does per cpu operations to get some sort of signal. If
>>>> its not able to handle that then terminate it. The code makes a basic
>>>> assumption after all that the process is running on a specific cpu. If
>>>> this is no longer the case then its better to abort if the process cannot
>>>> handle moving to a different processor.
>>> 
>>> The point of pin_on_cpu() is to allow threads to access per-cpu data
>>> structures belonging to a given CPU even if they cannot run on that
>>> CPU (because it is offline).
>>> 
>>> I am not sure what scenario your signal delivery proposal aims to cover.
>>> 
>>> Just to try to put this into the context of a specific scenario to see
>>> if I understand your point, is the following what you have in mind ?
>>> 
>>> 1. Thread A issues pin_on_cpu(5),
>>> 2. Thread B issues sched_setaffinity removing cpu 5 from thread A's
>>>    affinity mask,
>>> 3. Noticing that it would generate an invalid combination, rather than
>>>    failing sched_setaffinity, it would send a SIGSEGV (or other) signal
>>>    to thread A.
>>> 
>>> Or so you have something entirely different in mind ?
>>> 
>>
>> I would agree that this seems like the only sane option, or you will
>> be in a world of hurt because of conflicting semantics. It is not just
>> offlining, but what happens if a policy manager calls
>> sched_setaffinity() on another thread -- and now the universe breaks
>> because a library is updated to use this new system call which
>> collides with the expectations of the policy manager.
> 
> Yes, this new interface seems fundamentally incompatible with how
> affinity masks are changed today.
> 
> Would it be possible to make pin_on_cpu_set to use fallback
> synchronization via a futex if the CPU cannot be acquired by running on
> it?  The rseq section would need to check the futex as well, but would
> not have to acquire it.

I would really prefer to avoid adding any "mutual-exclusion-based" locking to
rseq, because it would then remove lock-freedom guarantees, which are really
useful when designing data structures shared across processes over shared
memory. Also, I would prefer to avoid adding additional load and comparisons
to the rseq fast-path, because those quickly add up in terms of overhead
compared to the "basic" fast-path.

It brings an interesting idea to the table though. Let's assume for now that
the only intended use of pin_on_cpu(2) would be to allow rseq(2) critical
sections to update per-cpu data on specific cpu number targets. In fact,
considering that userspace can be preempted at any point, we still need a
mechanism to guarantee atomicity with respect to other threads running on
the same runqueue, which rseq(2) provides. Therefore, that assumption does
not appear too far-fetched.

There are 2 scenarios we need to consider here:

A) pin_on_cpu(2) targets a CPU which is not part of the affinity mask.

This case is easy: pin_on_cpu can return an error, and the caller needs to act
accordingly (e.g. figure out that this is a design error and report it, or
decide that it really did not want to touch that per-cpu data that badly and
make the entire process fall-back to a mechanism which does not use per-cpu
data at all from that point onwards)

B) pin_on_cpu(2) targets a CPU part of affinity mask, which is then removed
   by cpuset or sched_setaffinity before unpin.

When the pinned cpu is removed from the affinity mask, we make sure the target
task enters the kernel (or is already in the kernel) so it can update
__rseq_abi.cpu_id when returning to user-space setting it to:

__rseq_abi.cpu_id = RSEQ_CPU_ID_PINNED_DISALLOWED  (= -3)

Which would ensure that all rseq critical sections fail. It would be restored
to the proper CPU number value when unpinning.

This would allow us to ensure user-space does not update the per-cpu data while
it is in the wrong state without having to play tricks with signals, which can be
rather cumbersome to do from userspace libraries.

Independently of the mechanism we choose to deliver information about this
unexpected state (pinned && disallowed), whether it's signal delivery or
through a special __rseq_abi.cpu_id value, the important part is to figure
out how exactly we expect applications to handle this condition. If the application
has a fallback available which does not require per-cpu data, it could be
enabled from that point onwards (quiescence of that transition could be
provided by a rseq barrier). Or if the application really requires per-cpu
data by design, it could simply choose to abort.

Thoughts ?

Thanks,

Mathieu
Florian Weimer Jan. 30, 2020, 11:10 a.m. UTC | #11
* Mathieu Desnoyers:

> It brings an interesting idea to the table though. Let's assume for now that
> the only intended use of pin_on_cpu(2) would be to allow rseq(2) critical
> sections to update per-cpu data on specific cpu number targets. In fact,
> considering that userspace can be preempted at any point, we still need a
> mechanism to guarantee atomicity with respect to other threads running on
> the same runqueue, which rseq(2) provides. Therefore, that assumption does
> not appear too far-fetched.
>
> There are 2 scenarios we need to consider here:
>
> A) pin_on_cpu(2) targets a CPU which is not part of the affinity mask.
>
> This case is easy: pin_on_cpu can return an error, and the caller needs to act
> accordingly (e.g. figure out that this is a design error and report it, or
> decide that it really did not want to touch that per-cpu data that badly and
> make the entire process fall-back to a mechanism which does not use per-cpu
> data at all from that point onwards)

Affinity masks currently are not like process memory: there is an
expectation that they can be altered from outside the process.

Given that the caller may not have any ways to recover from the
suggested pin_on_cpu behavior, that seems problematic.

What I would expect is that if pin_on_cpu cannot achieve implied
exclusion by running on the associated CPU, it acquires a lock that
prevents others pin_on_cpu calls from entering the critical section, and
tasks in the same task group from running on that CPU (if the CPU
becomes available to the task group).  The second part should maintain
exclusion of rseq sequences even if their fast path is not changed.

(On the other hand, I'm worried that per-CPU data structures are a dead
end for user space unless we get containerized affinity masks, so that
contains only see resources that are actually available to them.)

Thanks,
Florian
Mathieu Desnoyers Feb. 14, 2020, 4:54 p.m. UTC | #12
----- On Jan 30, 2020, at 6:10 AM, Florian Weimer fweimer@redhat.com wrote:

> * Mathieu Desnoyers:
> 
>> It brings an interesting idea to the table though. Let's assume for now that
>> the only intended use of pin_on_cpu(2) would be to allow rseq(2) critical
>> sections to update per-cpu data on specific cpu number targets. In fact,
>> considering that userspace can be preempted at any point, we still need a
>> mechanism to guarantee atomicity with respect to other threads running on
>> the same runqueue, which rseq(2) provides. Therefore, that assumption does
>> not appear too far-fetched.
>>
>> There are 2 scenarios we need to consider here:
>>
>> A) pin_on_cpu(2) targets a CPU which is not part of the affinity mask.
>>
>> This case is easy: pin_on_cpu can return an error, and the caller needs to act
>> accordingly (e.g. figure out that this is a design error and report it, or
>> decide that it really did not want to touch that per-cpu data that badly and
>> make the entire process fall-back to a mechanism which does not use per-cpu
>> data at all from that point onwards)
> 
> Affinity masks currently are not like process memory: there is an
> expectation that they can be altered from outside the process.

Yes, that's my main issue.

> Given that the caller may not have any ways to recover from the
> suggested pin_on_cpu behavior, that seems problematic.

Indeed.

> 
> What I would expect is that if pin_on_cpu cannot achieve implied
> exclusion by running on the associated CPU, it acquires a lock that
> prevents others pin_on_cpu calls from entering the critical section, and
> tasks in the same task group from running on that CPU (if the CPU
> becomes available to the task group).  The second part should maintain
> exclusion of rseq sequences even if their fast path is not changed.

I try to avoid mutual exclusion over shared memory as rseq fallback whenever
I can, so we can use rseq from lock-free algorithms without losing lock-freedom.

> (On the other hand, I'm worried that per-CPU data structures are a dead
> end for user space unless we get containerized affinity masks, so that
> contains only see resources that are actually available to them.)

I'm currently implementing a prototype of the following ideas, and I'm curious to
read your thoughts on those:

I'm adding a "affinity_pinned" flag to the task struct of each thread. It can
be set and cleared only by the owner thread through pin_on_cpu syscall commands.
When the affinity is pinned by a thread, trying to change its affinity (from an
external thread, or possibly from itself) will fail.

Whenever a thread would (temporarily) pin itself on a specific CPU, it would
also pin its affinity mask as a side-effect. When a thread unpins from a CPU,
the affinity mask stays pinned. The purpose of keeping this affinity pinned
state per-thread is to ensure we don't end up with tiny race windows where
changing the thread's affinity mask "typically" works, but fails once in a
while because it's done concurrently with a 1ms long cpu pinning. This would
lead to flaky code, and I try hard to avoid that.

How changing this affinity should fail (from sched_setaffinity and cpusets) is a
big unanswered question. I see two major alternatives so far:

1) We deliver a signal to the target thread (SIGKILL ? SIGSEGV ?), considering
   that failure to be able to change its affinity mask means we need to send a
   signal. How exactly would the killed application recover (or if it should)
   is still unclear.

2) Return an error to the sched_setaffinity or cpusets caller, and let it deal
   with the error as it sees fit: ignore it, log it, or send a signal.

I think option (2) provides the most flexiblity, and moves policy outside of
the kernel, which is a good thing. However, looking at how cpusets seems to
simply ignore errors when setting a task's cpumask, I wonder if asking from
cpusets to handle any kind of error is asking too much. :-/

Thanks,

Mathieu
diff mbox series

Patch

diff --git a/arch/x86/entry/syscalls/syscall_32.tbl b/arch/x86/entry/syscalls/syscall_32.tbl
index 15908eb9b17e..0b1081a9b872 100644
--- a/arch/x86/entry/syscalls/syscall_32.tbl
+++ b/arch/x86/entry/syscalls/syscall_32.tbl
@@ -440,3 +440,4 @@ 
 433	i386	fspick			sys_fspick			__ia32_sys_fspick
 434	i386	pidfd_open		sys_pidfd_open			__ia32_sys_pidfd_open
 435	i386	clone3			sys_clone3			__ia32_sys_clone3
+436	i386	pin_on_cpu		sys_pin_on_cpu			__ia32_sys_pin_on_cpu
diff --git a/arch/x86/entry/syscalls/syscall_64.tbl b/arch/x86/entry/syscalls/syscall_64.tbl
index c29976eca4a8..90f9b3cab88d 100644
--- a/arch/x86/entry/syscalls/syscall_64.tbl
+++ b/arch/x86/entry/syscalls/syscall_64.tbl
@@ -357,6 +357,7 @@ 
 433	common	fspick			__x64_sys_fspick
 434	common	pidfd_open		__x64_sys_pidfd_open
 435	common	clone3			__x64_sys_clone3/ptregs
+436	common	pin_on_cpu		__x64_sys_pin_on_cpu
 
 #
 # x32-specific system call numbers start at 512 to avoid cache impact
diff --git a/fs/exec.c b/fs/exec.c
index c27231234764..6d882dbdd1e3 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -1827,6 +1827,7 @@  static int __do_execve_file(int fd, struct filename *filename,
 	current->fs->in_exec = 0;
 	current->in_execve = 0;
 	rseq_execve(current);
+	current->pinned_cpu = -1;
 	acct_update_integrals(current);
 	task_numa_free(current, false);
 	free_bprm(bprm);
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 7f0bb6fff27c..ac0cac7b8d1d 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -651,6 +651,7 @@  struct task_struct {
 	/* Current CPU: */
 	unsigned int			cpu;
 #endif
+	int				pinned_cpu;
 	unsigned int			wakee_flips;
 	unsigned long			wakee_flip_decay_ts;
 	struct task_struct		*last_wakee;
diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
index be0d0cf788ba..46fee5af99e3 100644
--- a/include/linux/syscalls.h
+++ b/include/linux/syscalls.h
@@ -1000,6 +1000,7 @@  asmlinkage long sys_fspick(int dfd, const char __user *path, unsigned int flags)
 asmlinkage long sys_pidfd_send_signal(int pidfd, int sig,
 				       siginfo_t __user *info,
 				       unsigned int flags);
+asmlinkage long sys_pin_on_cpu(int cmd, int flags, int cpu);
 
 /*
  * Architecture-specific system calls
diff --git a/include/uapi/asm-generic/unistd.h b/include/uapi/asm-generic/unistd.h
index 1fc8faa6e973..43b0c956cc3c 100644
--- a/include/uapi/asm-generic/unistd.h
+++ b/include/uapi/asm-generic/unistd.h
@@ -851,8 +851,11 @@  __SYSCALL(__NR_pidfd_open, sys_pidfd_open)
 __SYSCALL(__NR_clone3, sys_clone3)
 #endif
 
+#define __NR_pin_on_cpu 436
+__SYSCALL(__NR_pin_on_cpu, sys_pin_on_cpu)
+
 #undef __NR_syscalls
-#define __NR_syscalls 436
+#define __NR_syscalls 437
 
 /*
  * 32 bit systems traditionally used different
diff --git a/include/uapi/linux/sched.h b/include/uapi/linux/sched.h
index 25b4fa00bad1..590cdc613698 100644
--- a/include/uapi/linux/sched.h
+++ b/include/uapi/linux/sched.h
@@ -114,4 +114,10 @@  struct clone_args {
 			 SCHED_FLAG_KEEP_ALL		| \
 			 SCHED_FLAG_UTIL_CLAMP)
 
+enum pin_on_cpu_cmd {
+	PIN_ON_CPU_CMD_QUERY				= 0,
+	PIN_ON_CPU_CMD_SET				= (1 << 0),
+	PIN_ON_CPU_CMD_CLEAR				= (1 << 1),
+};
+
 #endif /* _UAPI_LINUX_SCHED_H */
diff --git a/init/init_task.c b/init/init_task.c
index 9e5cbe5eab7b..9aabce589cc7 100644
--- a/init/init_task.c
+++ b/init/init_task.c
@@ -88,6 +88,7 @@  struct task_struct init_task
 	.tasks		= LIST_HEAD_INIT(init_task.tasks),
 #ifdef CONFIG_SMP
 	.pushable_tasks	= PLIST_NODE_INIT(init_task.pushable_tasks, MAX_PRIO),
+	.pinned_cpu	= -1,
 #endif
 #ifdef CONFIG_CGROUP_SCHED
 	.sched_task_group = &root_task_group,
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 8dacda4b0362..6ca904d6e0ef 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -52,6 +52,8 @@  const_debug unsigned int sysctl_sched_features =
 #undef SCHED_FEAT
 #endif
 
+#define PIN_ON_CPU_CMD_BITMASK	(PIN_ON_CPU_CMD_SET | PIN_ON_CPU_CMD_CLEAR)
+
 /*
  * Number of tasks to iterate in a single balance run.
  * Limited because this is done with IRQs disabled.
@@ -1457,8 +1459,13 @@  static inline bool is_per_cpu_kthread(struct task_struct *p)
  */
 static inline bool is_cpu_allowed(struct task_struct *p, int cpu)
 {
-	if (!cpumask_test_cpu(cpu, p->cpus_ptr))
-		return false;
+	if (is_pinned_task(p)) {
+		if (!allowed_pinned_cpu(p, cpu))
+			return false;
+	} else {
+		if (!cpumask_test_cpu(cpu, p->cpus_ptr))
+			return false;
+	}
 
 	if (is_per_cpu_kthread(p))
 		return cpu_online(cpu);
@@ -1662,6 +1669,12 @@  static int __set_cpus_allowed_ptr(struct task_struct *p,
 		goto out;
 	}
 
+	/* Prevent removing the currently pinned CPU from the allowed cpu mask. */
+	if (is_pinned_task(p) && !cpumask_test_cpu(p->pinned_cpu, new_mask)) {
+		ret = -EINVAL;
+		goto out;
+	}
+
 	do_set_cpus_allowed(p, new_mask);
 
 	if (p->flags & PF_KTHREAD) {
@@ -1674,6 +1687,10 @@  static int __set_cpus_allowed_ptr(struct task_struct *p,
 			p->nr_cpus_allowed != 1);
 	}
 
+	/* Task pinned to a CPU overrides allowed cpu mask. */
+	if (is_pinned_task(p))
+		goto out;
+
 	/* Can the task run on the task's current CPU? If so, we're done */
 	if (cpumask_test_cpu(task_cpu(p), new_mask))
 		goto out;
@@ -1813,11 +1830,20 @@  static int migrate_swap_stop(void *data)
 	if (task_cpu(arg->src_task) != arg->src_cpu)
 		goto unlock;
 
-	if (!cpumask_test_cpu(arg->dst_cpu, arg->src_task->cpus_ptr))
-		goto unlock;
-
-	if (!cpumask_test_cpu(arg->src_cpu, arg->dst_task->cpus_ptr))
-		goto unlock;
+	if (is_pinned_task(arg->src_task)) {
+		if (!allowed_pinned_cpu(arg->src_task, arg->dst_cpu))
+			goto unlock;
+	} else {
+		if (!cpumask_test_cpu(arg->dst_cpu, arg->src_task->cpus_ptr))
+			goto unlock;
+	}
+	if (is_pinned_task(arg->dst_task)) {
+		if (!allowed_pinned_cpu(arg->dst_task, arg->src_cpu))
+			goto unlock;
+	} else {
+		if (!cpumask_test_cpu(arg->src_cpu, arg->dst_task->cpus_ptr))
+			goto unlock;
+	}
 
 	__migrate_swap_task(arg->src_task, arg->dst_cpu);
 	__migrate_swap_task(arg->dst_task, arg->src_cpu);
@@ -1858,11 +1884,21 @@  int migrate_swap(struct task_struct *cur, struct task_struct *p,
 	if (!cpu_active(arg.src_cpu) || !cpu_active(arg.dst_cpu))
 		goto out;
 
-	if (!cpumask_test_cpu(arg.dst_cpu, arg.src_task->cpus_ptr))
-		goto out;
+	if (is_pinned_task(arg.src_task)) {
+		if (!allowed_pinned_cpu(arg.src_task, arg.dst_cpu))
+			goto out;
+	} else {
+		if (!cpumask_test_cpu(arg.dst_cpu, arg.src_task->cpus_ptr))
+			goto out;
+	}
 
-	if (!cpumask_test_cpu(arg.src_cpu, arg.dst_task->cpus_ptr))
-		goto out;
+	if (is_pinned_task(arg.dst_task)) {
+		if (!allowed_pinned_cpu(arg.dst_task, arg.src_cpu))
+			goto out;
+	} else {
+		if (!cpumask_test_cpu(arg.src_cpu, arg.dst_task->cpus_ptr))
+			goto out;
+	}
 
 	trace_sched_swap_numa(cur, arg.src_cpu, p, arg.dst_cpu);
 	ret = stop_two_cpus(arg.dst_cpu, arg.src_cpu, migrate_swap_stop, &arg);
@@ -2034,6 +2070,18 @@  static int select_fallback_rq(int cpu, struct task_struct *p)
 	enum { cpuset, possible, fail } state = cpuset;
 	int dest_cpu;
 
+	/*
+	 * If the task is pinned to a CPU which is online, pick that pinned CPU
+	 * number.
+	 * If the task is pinned to a CPU which is offline, pick a CPU which is
+	 * guaranteed to be the same for all tasks pinned to that offlined CPU.
+	 */
+	if (is_pinned_task(p)) {
+		if (cpu_online(p->pinned_cpu))
+			return p->pinned_cpu;
+		else
+			return pinned_cpu_offline_offload(p);
+	}
 	/*
 	 * If the node that the CPU is on has been offlined, cpu_to_node()
 	 * will return -1. There is no CPU on the node, and we should
@@ -2104,10 +2152,15 @@  int select_task_rq(struct task_struct *p, int cpu, int sd_flags, int wake_flags)
 {
 	lockdep_assert_held(&p->pi_lock);
 
-	if (p->nr_cpus_allowed > 1)
-		cpu = p->sched_class->select_task_rq(p, cpu, sd_flags, wake_flags);
-	else
-		cpu = cpumask_any(p->cpus_ptr);
+	if (is_pinned_task(p))
+		cpu = p->pinned_cpu;
+	else {
+		if (p->nr_cpus_allowed > 1)
+			cpu = p->sched_class->select_task_rq(p, cpu, sd_flags,
+							     wake_flags);
+		else
+			cpu = cpumask_any(p->cpus_ptr);
+	}
 
 	/*
 	 * In order not to call set_task_cpu() on a blocking task we need
@@ -6130,8 +6183,13 @@  int migrate_task_to(struct task_struct *p, int target_cpu)
 	if (curr_cpu == target_cpu)
 		return 0;
 
-	if (!cpumask_test_cpu(target_cpu, p->cpus_ptr))
-		return -EINVAL;
+	if (is_pinned_task(p)) {
+		if (!allowed_pinned_cpu(p, target_cpu))
+			return -EINVAL;
+	} else {
+		if (!cpumask_test_cpu(target_cpu, p->cpus_ptr))
+			return -EINVAL;
+	}
 
 	/* TODO: This is not properly updating schedstats */
 
@@ -6300,6 +6358,7 @@  static void migrate_tasks(struct rq *dead_rq, struct rq_flags *rf)
 
 	rq->stop = stop;
 }
+
 #endif /* CONFIG_HOTPLUG_CPU */
 
 void set_rq_online(struct rq *rq)
@@ -6380,11 +6439,100 @@  static int cpuset_cpu_inactive(unsigned int cpu)
 	return 0;
 }
 
+static bool skip_pinned_task(int pinned_cpu, int cpu,
+			     bool first_online)
+{
+	if (pinned_cpu < 0)
+		return true;
+	if (first_online) {
+		if (cpu_online(pinned_cpu) && pinned_cpu != cpu)
+			return true;
+	} else {
+		if (pinned_cpu != cpu)
+			return true;
+	}
+	return false;
+}
+
+static void sched_cpu_migrate_pinned_tasks(unsigned int cpu)
+{
+	struct rq *rq = cpu_rq(cpu);
+	struct task_struct *p, *t;
+	bool first_online = false;
+
+	if (cpu == cpumask_first(cpu_online_mask))
+		first_online = true;
+
+	/*
+	 * This state transition (online && !active) when going online
+	 * only allow bound kthreads to be scheduled.
+	 * At this point, the CPU is completely online and running,
+	 * but no userspace tasks are scheduled yet.
+	 */
+	read_lock(&tasklist_lock);
+	for_each_process_thread(p, t) {
+		struct rq *target_rq;
+		struct rq_flags rf;
+		int pinned_cpu;
+
+		/*
+		 * Migrate t to cpu if pinned to this cpu.
+		 *
+		 * Migrate t to cpu if its pinned cpu is offline
+		 * and cpu becomes the new first online cpu.
+		 *
+		 * Transition of t->pinned_cpu to cpu can only
+		 * happen if the thread is scheduled on cpu, which
+		 * is impossible at this point because the cpu is
+		 * not active.
+		 *
+		 * Transition of t->pinned_cpu from cpu to -1 or some
+		 * other cpu number may happen concurrently. Therefore,
+		 * skip early (without rq lock), and check again with
+		 * the rq lock held to eliminate concurrent transitions
+		 * from cpu to -1 or some other cpu number.
+		 */
+		pinned_cpu = READ_ONCE(t->pinned_cpu);
+		if (skip_pinned_task(pinned_cpu, cpu, first_online))
+			continue;
+		if (pinned_cpu == cpu)
+			printk("pin_on_cpu migrate to owner: online cpu %d\n",
+			       cpu);
+		if (first_online && !cpu_online(pinned_cpu))
+			printk("pin_on_cpu migrate to new offload cpu %d\n",
+			       cpu);
+		target_rq = task_rq_lock(t, &rf);
+		pinned_cpu = t->pinned_cpu;
+		if (skip_pinned_task(pinned_cpu, cpu, first_online))
+			goto unlock;
+		WARN_ON_ONCE(target_rq == rq);
+		update_rq_clock(target_rq);
+		if (task_running(target_rq, t) || t->state == TASK_WAKING) {
+			struct migration_arg arg = { t, cpu };
+			/* Need help from migration thread: drop lock and wait. */
+			task_rq_unlock(target_rq, t, &rf);
+			stop_one_cpu(cpu_of(target_rq), migration_cpu_stop, &arg);
+			continue;
+		} else if (task_on_rq_queued(t)) {
+			/*
+			 * OK, since we're going to drop the lock immediately
+			 * afterwards anyway.
+			 */
+			rq = move_queued_task(target_rq, &rf, t, cpu);
+		}
+	unlock:
+		task_rq_unlock(target_rq, t, &rf);
+	}
+	read_unlock(&tasklist_lock);
+}
+
 int sched_cpu_activate(unsigned int cpu)
 {
 	struct rq *rq = cpu_rq(cpu);
 	struct rq_flags rf;
 
+	sched_cpu_migrate_pinned_tasks(cpu);
+
 #ifdef CONFIG_SCHED_SMT
 	/*
 	 * When going up, increment the number of cores with SMT present.
@@ -7899,6 +8047,145 @@  struct cgroup_subsys cpu_cgrp_subsys = {
 
 #endif	/* CONFIG_CGROUP_SCHED */
 
+static void do_set_pinned_cpu(struct task_struct *p, int cpu)
+{
+	struct rq *rq = task_rq(p);
+	bool queued, running;
+
+	lockdep_assert_held(&p->pi_lock);
+
+	queued = task_on_rq_queued(p);
+	running = task_current(rq, p);
+
+	if (queued) {
+		/*
+		 * Because __kthread_bind() calls this on blocked tasks without
+		 * holding rq->lock.
+		 */
+		lockdep_assert_held(&rq->lock);
+		dequeue_task(rq, p, DEQUEUE_SAVE | DEQUEUE_NOCLOCK);
+	}
+	if (running)
+		put_prev_task(rq, p);
+
+	WRITE_ONCE(p->pinned_cpu, cpu);
+
+	if (queued)
+		enqueue_task(rq, p, ENQUEUE_RESTORE | ENQUEUE_NOCLOCK);
+	if (running)
+		set_next_task(rq, p);
+}
+
+static int __do_pin_on_cpu(int cpu)
+{
+	struct task_struct *p = current;
+	struct rq_flags rf;
+	struct rq *rq;
+	int ret = 0, dest_cpu;
+	struct migration_arg arg = { p };
+
+	cpus_read_lock();
+	rq = task_rq_lock(p, &rf);
+	update_rq_clock(rq);
+	if (cpu >= 0 && !cpumask_test_cpu(cpu, current->cpus_ptr)) {
+		ret = -EINVAL;
+		goto out;
+	}
+#ifdef CONFIG_SMP
+	do_set_pinned_cpu(p, cpu);
+	if (cpu >= 0) {
+		if (cpu_online(cpu))
+			dest_cpu = cpu;
+		else
+			dest_cpu = pinned_cpu_offline_offload(p);
+		if (task_cpu(p) == dest_cpu) {
+			/*
+			 * If the task already runs on the pinned cpu, we're
+			 * done.
+			 */
+			goto out;
+		}
+	} else {
+		/*
+		 * When clearing the pinned cpu, we may need to migrate the
+		 * current task if it is currently sitting on a runqueue which
+		 * does not belong to the allowed mask.
+		 */
+		dest_cpu = cpumask_any(p->cpus_ptr);
+	}
+	arg.dest_cpu = dest_cpu;
+	/* Need help from migration thread: drop lock and wait. */
+	task_rq_unlock(rq, p, &rf);
+	stop_one_cpu(cpu_of(rq), migration_cpu_stop, &arg);
+
+	/* Preempt disable prevents hotplug on current cpu. */
+	preempt_disable();
+	WARN_ON_ONCE(cpu >= 0 && cpu_online(cpu) &&
+		     smp_processor_id() != cpu);
+	preempt_enable();
+	cpus_read_unlock();
+	return 0;
+#endif
+out:
+	task_rq_unlock(rq, p, &rf);
+	cpus_read_unlock();
+	return ret;
+}
+
+static int pin_on_cpu_set(int cpu)
+{
+	if (cpu < 0 || !cpu_possible(cpu)) {
+		return -EINVAL;
+	}
+	return __do_pin_on_cpu(cpu);
+}
+
+static int pin_on_cpu_clear(void)
+{
+	return __do_pin_on_cpu(-1);
+}
+
+/*
+ * sys_pin_on_cpu - pin current task to a specific cpu.
+ * @cmd: command to issue (enum pin_on_cpu_cmd)
+ * @flags: system call flags
+ * @cpu: cpu where the task should run.
+ *
+ * Returns -EINVAL if cmd is unknown.
+ * Returns -EINVAL if flags are unknown.
+ * Returns -EINVAL if the CPU is not part of the possible CPUs.
+ * Returns -EINVAL if the CPU is not part of the allowed cpu mask
+ * for the current task.
+ *
+ * PIN_ON_CPU_CMD_QUERY: Return the mask of supported commands.
+ * PIN_ON_CPU_CMD_SET: Pin the current task to a specific CPU.
+ * PIN_ON_CPU_CMD_CLEAR: Clear cpu pinning for current task.
+ *
+ * If the pinned CPU is online, the current task will run on that CPU.
+ *
+ * If the pinned CPU is offline, the scheduler guarantees that
+ * all tasks pinned to that CPU number are moved to the same
+ * runqueue.
+ *
+ * Removing the pinned CPU from the task's allowed cpu mask is
+ * forbidden.
+ */
+SYSCALL_DEFINE3(pin_on_cpu, int, cmd, int, flags, int, cpu)
+{
+	if (unlikely(flags))
+		return -EINVAL;
+	switch (cmd) {
+	case PIN_ON_CPU_CMD_QUERY:
+		return PIN_ON_CPU_CMD_BITMASK;
+	case PIN_ON_CPU_CMD_SET:
+		return pin_on_cpu_set(cpu);
+	case PIN_ON_CPU_CMD_CLEAR:
+		return pin_on_cpu_clear();
+	default:
+		return -EINVAL;
+	}
+}
+
 void dump_cpu_task(int cpu)
 {
 	pr_info("Task dump for CPU %d:\n", cpu);
diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index a8a08030a8f7..8a1581e8509e 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -535,24 +535,31 @@  static struct rq *dl_task_offline_migration(struct rq *rq, struct task_struct *p
 	if (!later_rq) {
 		int cpu;
 
-		/*
-		 * If we cannot preempt any rq, fall back to pick any
-		 * online CPU:
-		 */
-		cpu = cpumask_any_and(cpu_active_mask, p->cpus_ptr);
-		if (cpu >= nr_cpu_ids) {
-			/*
-			 * Failed to find any suitable CPU.
-			 * The task will never come back!
-			 */
-			BUG_ON(dl_bandwidth_enabled());
-
+		if (is_pinned_task(p)) {
+			if (cpu_online(p->pinned_cpu))
+				cpu = p->pinned_cpu;
+			else
+				cpu = pinned_cpu_offline_offload(p);
+		} else {
 			/*
-			 * If admission control is disabled we
-			 * try a little harder to let the task
-			 * run.
+			 * If we cannot preempt any rq, fall back to pick any
+			 * online CPU:
 			 */
-			cpu = cpumask_any(cpu_active_mask);
+			cpu = cpumask_any_and(cpu_active_mask, p->cpus_ptr);
+			if (cpu >= nr_cpu_ids) {
+				/*
+				 * Failed to find any suitable CPU.
+				 * The task will never come back!
+				 */
+				BUG_ON(dl_bandwidth_enabled());
+
+				/*
+				 * If admission control is disabled we
+				 * try a little harder to let the task
+				 * run.
+				 */
+				cpu = cpumask_any(cpu_active_mask);
+			}
 		}
 		later_rq = cpu_rq(cpu);
 		double_lock_balance(rq, later_rq);
@@ -1836,9 +1843,15 @@  static void task_fork_dl(struct task_struct *p)
 
 static int pick_dl_task(struct rq *rq, struct task_struct *p, int cpu)
 {
-	if (!task_running(rq, p) &&
-	    cpumask_test_cpu(cpu, p->cpus_ptr))
-		return 1;
+	if (!task_running(rq, p)) {
+		if (is_pinned_task(p)) {
+			if (allowed_pinned_cpu(p, cpu))
+				return 1;
+		} else {
+			if (cpumask_test_cpu(cpu, p->cpus_ptr))
+				return 1;
+		}
+	}
 	return 0;
 }
 
@@ -1987,7 +2000,8 @@  static struct rq *find_lock_later_rq(struct task_struct *task, struct rq *rq)
 		/* Retry if something changed. */
 		if (double_lock_balance(rq, later_rq)) {
 			if (unlikely(task_rq(task) != rq ||
-				     !cpumask_test_cpu(later_rq->cpu, task->cpus_ptr) ||
+				     (is_pinned_task(task) && !allowed_pinned_cpu(task, later_rq->cpu)) ||
+				     (!is_pinned_task(task) && !cpumask_test_cpu(later_rq->cpu, task->cpus_ptr)) ||
 				     task_running(rq, task) ||
 				     !dl_task(task) ||
 				     !task_on_rq_queued(task))) {
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 69a81a5709ff..e96ae1ce9829 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -7223,6 +7223,25 @@  int can_migrate_task(struct task_struct *p, struct lb_env *env)
 
 	lockdep_assert_held(&env->src_rq->lock);
 
+	if (is_pinned_task(p)) {
+		if (task_running(env->src_rq, p)) {
+			schedstat_inc(p->se.statistics.nr_failed_migrations_running);
+			return 0;
+		}
+		if (cpu_online(p->pinned_cpu)) {
+			if (env->dst_cpu == p->pinned_cpu)
+				return 1;
+			else
+				return 0;
+		} else {
+			if (env->dst_cpu ==
+			    pinned_cpu_offline_offload(p))
+				return 1;
+			else
+				return 0;
+		}
+	}
+
 	/*
 	 * We do not migrate tasks that are:
 	 * 1) throttled_lb_pair, or
diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
index 9b8adc01be3d..2774311e5750 100644
--- a/kernel/sched/rt.c
+++ b/kernel/sched/rt.c
@@ -1600,9 +1600,15 @@  static void put_prev_task_rt(struct rq *rq, struct task_struct *p)
 
 static int pick_rt_task(struct rq *rq, struct task_struct *p, int cpu)
 {
-	if (!task_running(rq, p) &&
-	    cpumask_test_cpu(cpu, p->cpus_ptr))
-		return 1;
+	if (!task_running(rq, p)) {
+		if (is_pinned_task(p)) {
+			if (allowed_pinned_cpu(p, cpu))
+				return 1;
+		} else {
+			if (cpumask_test_cpu(cpu, p->cpus_ptr))
+				return 1;
+		}
+	}
 
 	return 0;
 }
@@ -1738,7 +1744,8 @@  static struct rq *find_lock_lowest_rq(struct task_struct *task, struct rq *rq)
 			 * Also make sure that it wasn't scheduled on its rq.
 			 */
 			if (unlikely(task_rq(task) != rq ||
-				     !cpumask_test_cpu(lowest_rq->cpu, task->cpus_ptr) ||
+				     (is_pinned_task(task) && !allowed_pinned_cpu(task, lowest_rq->cpu)) ||
+				     (!is_pinned_task(task) && !cpumask_test_cpu(lowest_rq->cpu, task->cpus_ptr)) ||
 				     task_running(rq, task) ||
 				     !rt_task(task) ||
 				     !task_on_rq_queued(task))) {
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 49ed949f850c..922bc618cc87 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -187,6 +187,34 @@  static inline int task_has_dl_policy(struct task_struct *p)
 	return dl_policy(p->policy);
 }
 
+/*
+ * All tasks which require to be pinned on offlined CPUs are sent
+ * to runqueue of the first online CPU.
+ */
+static inline bool is_pinned_task(struct task_struct *p)
+{
+	return p->pinned_cpu >= 0;
+}
+
+static inline int pinned_cpu_offline_offload(struct task_struct *p)
+{
+	return cpumask_first(cpu_online_mask);
+}
+
+static inline bool allowed_pinned_cpu(struct task_struct *p, int cpu)
+{
+	if (!cpu_possible(cpu))
+		return false;
+	if (cpu_online(p->pinned_cpu)) {
+		if (p->pinned_cpu == cpu)
+			return true;
+	} else {
+		if (cpu == pinned_cpu_offline_offload(p))
+			return true;
+	}
+	return false;
+}
+
 #define cap_scale(v, s) ((v)*(s) >> SCHED_CAPACITY_SHIFT)
 
 /*
diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c
index 34b76895b81e..7e5192cd8d9d 100644
--- a/kernel/sys_ni.c
+++ b/kernel/sys_ni.c
@@ -449,3 +449,4 @@  COND_SYSCALL(setuid16);
 
 /* restartable sequence */
 COND_SYSCALL(rseq);
+COND_SYSCALL(pin_on_cpu);