diff mbox series

[v0.9.1,3/6] sched/umcg: implement UMCG syscalls

Message ID 20211122211327.5931-4-posk@google.com (mailing list archive)
State New
Headers show
Series sched,mm,x86/uaccess: implement User Managed Concurrency Groups | expand

Commit Message

Peter Oskolkov Nov. 22, 2021, 9:13 p.m. UTC
Define struct umcg_task and two syscalls: sys_umcg_ctl sys_umcg_wait.

User Managed Concurrency Groups is an M:N threading toolkit that allows
constructing user space schedulers designed to efficiently manage
heterogeneous in-process workloads while maintaining high CPU
utilization (95%+).

In addition, M:N threading and cooperative user space scheduling
enables synchronous coding style and better cache locality when
compared to asynchronous callback/continuation style of programming.

UMCG kernel API is build around the following ideas:

* UMCG server: a task/thread representing "kernel threads", or (v)CPUs;
* UMCG worker: a task/thread representing "application threads", to be
  scheduled over servers;
* UMCG task state: (NONE), RUNNING, BLOCKED, IDLE: states a UMCG task (a
  server or a worker) can be in;
* UMCG task state flag: LOCKED, PREEMPTED: additional state flags that
  can be ORed with the task state to communicate additional information to
  the kernel;
* struct umcg_task: a per-task userspace set of data fields, usually
  residing in the TLS, that fully reflects the current task's UMCG state
  and controls the way the kernel manages the task;
* sys_umcg_ctl(): a syscall used to register the current task/thread as a
  server or a worker, or to unregister a UMCG task;
* sys_umcg_wait(): a syscall used to put the current task to sleep and/or
  wake another task, pontentially context-switching between the two tasks
  on-CPU synchronously.

In short, servers can be thought of as CPUs over which application
threads (workers) are scheduled; at any one time a worker is either:
- RUNNING: has a server and is schedulable by the kernel;
- BLOCKED: blocked in the kernel (e.g. on I/O, or a futex);
- IDLE: is not blocked, but cannot be scheduled by the kernel to
  run because it has no server assigned to it (e.g. because all
  available servers are busy "running" other workers).

Usually the number of servers in a process is equal to the number of
CPUs available to the kernel if the process is supposed to consume
the whole machine, or less than the number of CPUs available if the
process is sharing the machine with other workloads. The number of
workers in a process can grow very large: tens of thousands is normal;
hundreds of thousands and more (millions) is something that would
be desirable to achieve in the future, as lightweight userspace
threads in Java and Go easily scale to millions, and UMCG workers
are (intended to be) conceptually similar to those.

Detailed use cases and API behavior are provided in
Documentation/userspace-api/umcg.txt (see sibling patches).

Some high-level implementation notes:

UMCG tasks (workers and servers) are "tagged" with struct umcg_task
residing in userspace (usually in TLS) to facilitate kernel/userspace
communication. This makes the kernel-side code much simpler (see e.g.
the implementation of sys_umcg_wait), but also requires some careful
uaccess handling and page pinning (see below).

The main UMCG server/worker interaction looks like:

a. worker W1 is RUNNING, with a server S attached to it sleeping
   in IDLE state;
b. worker W1 blocks in the kernel, e.g. on I/O;
c. the kernel marks W1 as BLOCKED, the attached server S
   as RUNNING, and wakes S (the "block detection" event);
d. the server now picks another IDLE worker W2 to run: marks
   W2 as RUNNING, itself as IDLE, ands calls sys_umcg_wait();
e. when the blocking operation of W1 completes, the worker
   is marked by the kernel as IDLE and added to idle workers list
   (see struct umcg_task) for the userspace to pick up and
   later run (the "wake detection" event).

While there are additional operations such as worker-to-worker
context switch, preemption, workers "yielding", etc., the "workflow"
above is the main worker/server interaction that drives the
implementation.

Specifically:

- most operations are conceptually context switches:
    - scheduling a worker: a running server goes to sleep and "runs"
      a worker in its place;
    - block detection: worker is descheduled, and its server is woken;
    - wake detection: woken worker, running in the kernel, is descheduled,
      and if there is an idle server, it is woken to process the wake
      detection event;
- to faciliate low scheduling latencies and cache locality, most
  server/worker interactions described above are performed synchronously
  "on CPU" via WF_CURRENT_CPU flag passed to ttwu; while at the moment
  the context switches are simulated by putting the switch-out task to
  sleep and waking the switch-into task on the same cpu, it is very much
  the long-term goal of this project to make the context switch much
  lighter, by tweaking runtime accounting and, maybe, even bypassing
  __schedule();
- worker blocking is detected in a hook to sched_submit_work; as mentioned
  above, the server is to be woken on the same CPU, synchronously;
  this code may not pagefault, so to access worker's and server's
  userspace memory (struct umcg_task), memory pages containing the worker's
  and the server's structs umcg_task are pinned when the worker is
  exiting to the userspace, and unpinned when the worker is descheduled;
- worker wakeup is detected in a hook to sched_update_worker, and processed
  in the exit to usermode loop (via TIF_NOTIFY_RESUME); workers CAN
  pagefault on the wakeup path;
- worker preemption is implemented by the userspace tagging the worker
  with UMCG_TF_PREEMPTED state flag and sending a NOOP signal to it;
  on the exit to usermode the worker is intercepted and its server is woken
  (see Documentation/userspace-api/umcg.txt for more details);
- each state change is tagged with a unique timestamp (of MONOTONIC
  variety), so that
    - scheduling instrumentation is naturally available;
    - racing state changes are easily detected and ABA issues are
      avoided;
  see umcg_update_state() in umcg.c for implementation details, and
  Documentation/userspace-api/umcg.txt for a higher-level
  description.

The previous version of the patchset can be found at
https://lore.kernel.org/all/20211012232522.714898-1-posk@google.com/
containing some additional context and links to earlier discussions.

More details are available in Documentation/userspace-api/umcg.txt
in sibling patches, and in doc-comments in the code.

Signed-off-by: Peter Oskolkov <posk@google.com>
---
 arch/x86/entry/syscalls/syscall_64.tbl |   2 +
 fs/exec.c                              |   1 +
 include/linux/sched.h                  |  71 ++
 include/linux/syscalls.h               |   3 +
 include/uapi/asm-generic/unistd.h      |   7 +-
 include/uapi/linux/umcg.h              | 137 ++++
 init/Kconfig                           |  10 +
 kernel/entry/common.c                  |   4 +-
 kernel/exit.c                          |   5 +
 kernel/sched/Makefile                  |   1 +
 kernel/sched/core.c                    |   9 +-
 kernel/sched/umcg.c                    | 949 +++++++++++++++++++++++++
 kernel/sys_ni.c                        |   4 +
 13 files changed, 1199 insertions(+), 4 deletions(-)
 create mode 100644 include/uapi/linux/umcg.h
 create mode 100644 kernel/sched/umcg.c

--
2.25.1

Comments

kernel test robot Nov. 24, 2021, 6:36 p.m. UTC | #1
Hi Peter,

Thank you for the patch! Perhaps something to improve:

[auto build test WARNING on cb0e52b7748737b2cf6481fdd9b920ce7e1ebbdf]

url:    https://github.com/0day-ci/linux/commits/Peter-Oskolkov/sched-mm-x86-uaccess-implement-User-Managed-Concurrency-Groups/20211123-051525
base:   cb0e52b7748737b2cf6481fdd9b920ce7e1ebbdf
config: arm64-randconfig-r031-20211124 (https://download.01.org/0day-ci/archive/20211125/202111250209.9dBNZjdP-lkp@intel.com/config)
compiler: clang version 14.0.0 (https://github.com/llvm/llvm-project 67a1c45def8a75061203461ab0060c75c864df1c)
reproduce (this is a W=1 build):
        wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
        chmod +x ~/bin/make.cross
        # install arm64 cross compiling tool for clang build
        # apt-get install binutils-aarch64-linux-gnu
        # https://github.com/0day-ci/linux/commit/942655474fa2cd59ea3d11a1cc03775dd79a508e
        git remote add linux-review https://github.com/0day-ci/linux
        git fetch --no-tags linux-review Peter-Oskolkov/sched-mm-x86-uaccess-implement-User-Managed-Concurrency-Groups/20211123-051525
        git checkout 942655474fa2cd59ea3d11a1cc03775dd79a508e
        # save the config file to linux build tree
        COMPILER_INSTALL_PATH=$HOME/0day COMPILER=clang make.cross W=1 ARCH=arm64 

If you fix the issue, kindly add following tag as appropriate
Reported-by: kernel test robot <lkp@intel.com>

All warnings (new ones prefixed by >>):

           asmlinkage long __weak __arm64_sys_##name(const struct pt_regs *regs)   \
                                  ^
   <scratch space>:34:1: note: expanded from here
   __arm64_sys_recvmsg
   ^
   kernel/sys_ni.c:257:1: note: declare 'static' if the function is not intended to be used outside of this translation unit
   arch/arm64/include/asm/syscall_wrapper.h:76:13: note: expanded from macro 'COND_SYSCALL'
           asmlinkage long __weak __arm64_sys_##name(const struct pt_regs *regs)   \
                      ^
   kernel/sys_ni.c:263:1: warning: no previous prototype for function '__arm64_sys_mremap' [-Wmissing-prototypes]
   COND_SYSCALL(mremap);
   ^
   arch/arm64/include/asm/syscall_wrapper.h:76:25: note: expanded from macro 'COND_SYSCALL'
           asmlinkage long __weak __arm64_sys_##name(const struct pt_regs *regs)   \
                                  ^
   <scratch space>:39:1: note: expanded from here
   __arm64_sys_mremap
   ^
   kernel/sys_ni.c:263:1: note: declare 'static' if the function is not intended to be used outside of this translation unit
   arch/arm64/include/asm/syscall_wrapper.h:76:13: note: expanded from macro 'COND_SYSCALL'
           asmlinkage long __weak __arm64_sys_##name(const struct pt_regs *regs)   \
                      ^
   kernel/sys_ni.c:266:1: warning: no previous prototype for function '__arm64_sys_add_key' [-Wmissing-prototypes]
   COND_SYSCALL(add_key);
   ^
   arch/arm64/include/asm/syscall_wrapper.h:76:25: note: expanded from macro 'COND_SYSCALL'
           asmlinkage long __weak __arm64_sys_##name(const struct pt_regs *regs)   \
                                  ^
   <scratch space>:40:1: note: expanded from here
   __arm64_sys_add_key
   ^
   kernel/sys_ni.c:266:1: note: declare 'static' if the function is not intended to be used outside of this translation unit
   arch/arm64/include/asm/syscall_wrapper.h:76:13: note: expanded from macro 'COND_SYSCALL'
           asmlinkage long __weak __arm64_sys_##name(const struct pt_regs *regs)   \
                      ^
   kernel/sys_ni.c:267:1: warning: no previous prototype for function '__arm64_sys_request_key' [-Wmissing-prototypes]
   COND_SYSCALL(request_key);
   ^
   arch/arm64/include/asm/syscall_wrapper.h:76:25: note: expanded from macro 'COND_SYSCALL'
           asmlinkage long __weak __arm64_sys_##name(const struct pt_regs *regs)   \
                                  ^
   <scratch space>:41:1: note: expanded from here
   __arm64_sys_request_key
   ^
   kernel/sys_ni.c:267:1: note: declare 'static' if the function is not intended to be used outside of this translation unit
   arch/arm64/include/asm/syscall_wrapper.h:76:13: note: expanded from macro 'COND_SYSCALL'
           asmlinkage long __weak __arm64_sys_##name(const struct pt_regs *regs)   \
                      ^
   kernel/sys_ni.c:268:1: warning: no previous prototype for function '__arm64_sys_keyctl' [-Wmissing-prototypes]
   COND_SYSCALL(keyctl);
   ^
   arch/arm64/include/asm/syscall_wrapper.h:76:25: note: expanded from macro 'COND_SYSCALL'
           asmlinkage long __weak __arm64_sys_##name(const struct pt_regs *regs)   \
                                  ^
   <scratch space>:42:1: note: expanded from here
   __arm64_sys_keyctl
   ^
   kernel/sys_ni.c:268:1: note: declare 'static' if the function is not intended to be used outside of this translation unit
   arch/arm64/include/asm/syscall_wrapper.h:76:13: note: expanded from macro 'COND_SYSCALL'
           asmlinkage long __weak __arm64_sys_##name(const struct pt_regs *regs)   \
                      ^
   kernel/sys_ni.c:272:1: warning: no previous prototype for function '__arm64_sys_landlock_create_ruleset' [-Wmissing-prototypes]
   COND_SYSCALL(landlock_create_ruleset);
   ^
   arch/arm64/include/asm/syscall_wrapper.h:76:25: note: expanded from macro 'COND_SYSCALL'
           asmlinkage long __weak __arm64_sys_##name(const struct pt_regs *regs)   \
                                  ^
   <scratch space>:47:1: note: expanded from here
   __arm64_sys_landlock_create_ruleset
   ^
   kernel/sys_ni.c:272:1: note: declare 'static' if the function is not intended to be used outside of this translation unit
   arch/arm64/include/asm/syscall_wrapper.h:76:13: note: expanded from macro 'COND_SYSCALL'
           asmlinkage long __weak __arm64_sys_##name(const struct pt_regs *regs)   \
                      ^
   kernel/sys_ni.c:273:1: warning: no previous prototype for function '__arm64_sys_landlock_add_rule' [-Wmissing-prototypes]
   COND_SYSCALL(landlock_add_rule);
   ^
   arch/arm64/include/asm/syscall_wrapper.h:76:25: note: expanded from macro 'COND_SYSCALL'
           asmlinkage long __weak __arm64_sys_##name(const struct pt_regs *regs)   \
                                  ^
   <scratch space>:48:1: note: expanded from here
   __arm64_sys_landlock_add_rule
   ^
   kernel/sys_ni.c:273:1: note: declare 'static' if the function is not intended to be used outside of this translation unit
   arch/arm64/include/asm/syscall_wrapper.h:76:13: note: expanded from macro 'COND_SYSCALL'
           asmlinkage long __weak __arm64_sys_##name(const struct pt_regs *regs)   \
                      ^
   kernel/sys_ni.c:274:1: warning: no previous prototype for function '__arm64_sys_landlock_restrict_self' [-Wmissing-prototypes]
   COND_SYSCALL(landlock_restrict_self);
   ^
   arch/arm64/include/asm/syscall_wrapper.h:76:25: note: expanded from macro 'COND_SYSCALL'
           asmlinkage long __weak __arm64_sys_##name(const struct pt_regs *regs)   \
                                  ^
   <scratch space>:49:1: note: expanded from here
   __arm64_sys_landlock_restrict_self
   ^
   kernel/sys_ni.c:274:1: note: declare 'static' if the function is not intended to be used outside of this translation unit
   arch/arm64/include/asm/syscall_wrapper.h:76:13: note: expanded from macro 'COND_SYSCALL'
           asmlinkage long __weak __arm64_sys_##name(const struct pt_regs *regs)   \
                      ^
>> kernel/sys_ni.c:277:1: warning: no previous prototype for function '__arm64_sys_umcg_ctl' [-Wmissing-prototypes]
   COND_SYSCALL(umcg_ctl);
   ^
   arch/arm64/include/asm/syscall_wrapper.h:76:25: note: expanded from macro 'COND_SYSCALL'
           asmlinkage long __weak __arm64_sys_##name(const struct pt_regs *regs)   \
                                  ^
   <scratch space>:50:1: note: expanded from here
   __arm64_sys_umcg_ctl
   ^
   kernel/sys_ni.c:277:1: note: declare 'static' if the function is not intended to be used outside of this translation unit
   arch/arm64/include/asm/syscall_wrapper.h:76:13: note: expanded from macro 'COND_SYSCALL'
           asmlinkage long __weak __arm64_sys_##name(const struct pt_regs *regs)   \
                      ^
>> kernel/sys_ni.c:278:1: warning: no previous prototype for function '__arm64_sys_umcg_wait' [-Wmissing-prototypes]
   COND_SYSCALL(umcg_wait);
   ^
   arch/arm64/include/asm/syscall_wrapper.h:76:25: note: expanded from macro 'COND_SYSCALL'
           asmlinkage long __weak __arm64_sys_##name(const struct pt_regs *regs)   \
                                  ^
   <scratch space>:51:1: note: expanded from here
   __arm64_sys_umcg_wait
   ^
   kernel/sys_ni.c:278:1: note: declare 'static' if the function is not intended to be used outside of this translation unit
   arch/arm64/include/asm/syscall_wrapper.h:76:13: note: expanded from macro 'COND_SYSCALL'
           asmlinkage long __weak __arm64_sys_##name(const struct pt_regs *regs)   \
                      ^
   kernel/sys_ni.c:283:1: warning: no previous prototype for function '__arm64_sys_fadvise64_64' [-Wmissing-prototypes]
   COND_SYSCALL(fadvise64_64);
   ^
   arch/arm64/include/asm/syscall_wrapper.h:76:25: note: expanded from macro 'COND_SYSCALL'
           asmlinkage long __weak __arm64_sys_##name(const struct pt_regs *regs)   \
                                  ^
   <scratch space>:52:1: note: expanded from here
   __arm64_sys_fadvise64_64
   ^
   kernel/sys_ni.c:283:1: note: declare 'static' if the function is not intended to be used outside of this translation unit
   arch/arm64/include/asm/syscall_wrapper.h:76:13: note: expanded from macro 'COND_SYSCALL'
           asmlinkage long __weak __arm64_sys_##name(const struct pt_regs *regs)   \
                      ^
   kernel/sys_ni.c:286:1: warning: no previous prototype for function '__arm64_sys_swapon' [-Wmissing-prototypes]
   COND_SYSCALL(swapon);
   ^
   arch/arm64/include/asm/syscall_wrapper.h:76:25: note: expanded from macro 'COND_SYSCALL'
           asmlinkage long __weak __arm64_sys_##name(const struct pt_regs *regs)   \
                                  ^
   <scratch space>:53:1: note: expanded from here
   __arm64_sys_swapon
   ^
   kernel/sys_ni.c:286:1: note: declare 'static' if the function is not intended to be used outside of this translation unit
   arch/arm64/include/asm/syscall_wrapper.h:76:13: note: expanded from macro 'COND_SYSCALL'
           asmlinkage long __weak __arm64_sys_##name(const struct pt_regs *regs)   \
                      ^
   kernel/sys_ni.c:287:1: warning: no previous prototype for function '__arm64_sys_swapoff' [-Wmissing-prototypes]
   COND_SYSCALL(swapoff);
   ^
   arch/arm64/include/asm/syscall_wrapper.h:76:25: note: expanded from macro 'COND_SYSCALL'
           asmlinkage long __weak __arm64_sys_##name(const struct pt_regs *regs)   \
                                  ^
   <scratch space>:54:1: note: expanded from here
   __arm64_sys_swapoff
   ^
   kernel/sys_ni.c:287:1: note: declare 'static' if the function is not intended to be used outside of this translation unit
   arch/arm64/include/asm/syscall_wrapper.h:76:13: note: expanded from macro 'COND_SYSCALL'
           asmlinkage long __weak __arm64_sys_##name(const struct pt_regs *regs)   \
                      ^
   kernel/sys_ni.c:288:1: warning: no previous prototype for function '__arm64_sys_mprotect' [-Wmissing-prototypes]
   COND_SYSCALL(mprotect);
   ^
   arch/arm64/include/asm/syscall_wrapper.h:76:25: note: expanded from macro 'COND_SYSCALL'
           asmlinkage long __weak __arm64_sys_##name(const struct pt_regs *regs)   \
                                  ^
   <scratch space>:55:1: note: expanded from here
   __arm64_sys_mprotect
   ^
   kernel/sys_ni.c:288:1: note: declare 'static' if the function is not intended to be used outside of this translation unit
   arch/arm64/include/asm/syscall_wrapper.h:76:13: note: expanded from macro 'COND_SYSCALL'
           asmlinkage long __weak __arm64_sys_##name(const struct pt_regs *regs)   \
                      ^
   kernel/sys_ni.c:289:1: warning: no previous prototype for function '__arm64_sys_msync' [-Wmissing-prototypes]
   COND_SYSCALL(msync);
   ^
   arch/arm64/include/asm/syscall_wrapper.h:76:25: note: expanded from macro 'COND_SYSCALL'
           asmlinkage long __weak __arm64_sys_##name(const struct pt_regs *regs)   \
                                  ^
   <scratch space>:56:1: note: expanded from here
   __arm64_sys_msync
   ^
   kernel/sys_ni.c:289:1: note: declare 'static' if the function is not intended to be used outside of this translation unit
   arch/arm64/include/asm/syscall_wrapper.h:76:13: note: expanded from macro 'COND_SYSCALL'
           asmlinkage long __weak __arm64_sys_##name(const struct pt_regs *regs)   \
                      ^
   kernel/sys_ni.c:290:1: warning: no previous prototype for function '__arm64_sys_mlock' [-Wmissing-prototypes]
   COND_SYSCALL(mlock);
   ^
   arch/arm64/include/asm/syscall_wrapper.h:76:25: note: expanded from macro 'COND_SYSCALL'
           asmlinkage long __weak __arm64_sys_##name(const struct pt_regs *regs)   \
                                  ^
   <scratch space>:57:1: note: expanded from here
   __arm64_sys_mlock
   ^
   kernel/sys_ni.c:290:1: note: declare 'static' if the function is not intended to be used outside of this translation unit
   arch/arm64/include/asm/syscall_wrapper.h:76:13: note: expanded from macro 'COND_SYSCALL'
           asmlinkage long __weak __arm64_sys_##name(const struct pt_regs *regs)   \
                      ^
   kernel/sys_ni.c:291:1: warning: no previous prototype for function '__arm64_sys_munlock' [-Wmissing-prototypes]
   COND_SYSCALL(munlock);
   ^
   arch/arm64/include/asm/syscall_wrapper.h:76:25: note: expanded from macro 'COND_SYSCALL'
           asmlinkage long __weak __arm64_sys_##name(const struct pt_regs *regs)   \
                                  ^
   <scratch space>:58:1: note: expanded from here
   __arm64_sys_munlock
   ^
   kernel/sys_ni.c:291:1: note: declare 'static' if the function is not intended to be used outside of this translation unit


vim +/__arm64_sys_umcg_ctl +277 kernel/sys_ni.c

   275	
   276	/* kernel/sched/umcg.c */
 > 277	COND_SYSCALL(umcg_ctl);
 > 278	COND_SYSCALL(umcg_wait);
   279	

---
0-DAY CI Kernel Test Service, Intel Corporation
https://lists.01.org/hyperkitty/list/kbuild-all@lists.01.org
Peter Zijlstra Nov. 24, 2021, 8:08 p.m. UTC | #2
On Mon, Nov 22, 2021 at 01:13:24PM -0800, Peter Oskolkov wrote:
> +/**
> + * struct umcg_task - controls the state of UMCG tasks.
> + *
> + * The struct is aligned at 64 bytes to ensure that it fits into
> + * a single cache line.
> + */
> +struct umcg_task {
> +	/**
> +	 * @state_ts: the current state of the UMCG task described by
> +	 *            this struct, with a unique timestamp indicating
> +	 *            when the last state change happened.
> +	 *
> +	 * Readable/writable by both the kernel and the userspace.
> +	 *
> +	 * UMCG task state:
> +	 *   bits  0 -  5: task state;
> +	 *   bits  6 -  7: state flags;
> +	 *   bits  8 - 12: reserved; must be zeroes;
> +	 *   bits 13 - 17: for userspace use;
> +	 *   bits 18 - 63: timestamp (see below).
> +	 *
> +	 * Timestamp: a 46-bit CLOCK_MONOTONIC timestamp, at 16ns resolution.
> +	 * See Documentation/userspace-api/umcg.txt for detals.
> +	 */
> +	__u64	state_ts;		/* r/w */
> +
> +	/**
> +	 * @next_tid: the TID of the UMCG task that should be context-switched
> +	 *            into in sys_umcg_wait(). Can be zero.
> +	 *
> +	 * Running UMCG workers must have next_tid set to point to IDLE
> +	 * UMCG servers.
> +	 *
> +	 * Read-only for the kernel, read/write for the userspace.
> +	 */
> +	__u32	next_tid;		/* r   */
> +
> +	__u32	flags;			/* Reserved; must be zero. */
> +
> +	/**
> +	 * @idle_workers_ptr: a single-linked list of idle workers. Can be NULL.
> +	 *
> +	 * Readable/writable by both the kernel and the userspace: the
> +	 * kernel adds items to the list, the userspace removes them.
> +	 */
> +	__u64	idle_workers_ptr;	/* r/w */
> +
> +	/**
> +	 * @idle_server_tid_ptr: a pointer pointing to a single idle server.
> +	 *                       Readonly.
> +	 */
> +	__u64	idle_server_tid_ptr;	/* r   */
> +} __attribute__((packed, aligned(8 * sizeof(__u64))));

The thing is; I really don't see how this is supposed to be used. Where
did the blocked and runnable list go ?

I also don't see why the kernel cares about idle workers at all; that
seems something userspace can sort itself just fine.

The whole next_tid thing seems confused too, how can it be the next task
when it must be the server? Also, what if there isn't an idle server?

This just all isn't making any sense to me.
Peter Zijlstra Nov. 24, 2021, 9:19 p.m. UTC | #3
On Mon, Nov 22, 2021 at 01:13:24PM -0800, Peter Oskolkov wrote:

> +	 * Timestamp: a 46-bit CLOCK_MONOTONIC timestamp, at 16ns resolution.

> +static int umcg_update_state(u64 __user *state_ts, u64 *expected, u64 desired,
> +				bool may_fault)
> +{
> +	u64 curr_ts = (*expected) >> (64 - UMCG_STATE_TIMESTAMP_BITS);
> +	u64 next_ts = ktime_get_ns() >> UMCG_STATE_TIMESTAMP_GRANULARITY;

I'm still very hesitant to use ktime (fear the HPET); but I suppose it
makes sense to use a time base that's accessible to userspace. Was
MONOTONIC_RAW considered?
Peter Zijlstra Nov. 24, 2021, 9:32 p.m. UTC | #4
On Wed, Nov 24, 2021 at 09:08:23PM +0100, Peter Zijlstra wrote:
> On Mon, Nov 22, 2021 at 01:13:24PM -0800, Peter Oskolkov wrote:
> > +/**
> > + * struct umcg_task - controls the state of UMCG tasks.
> > + *
> > + * The struct is aligned at 64 bytes to ensure that it fits into
> > + * a single cache line.
> > + */
> > +struct umcg_task {
> > +	/**
> > +	 * @state_ts: the current state of the UMCG task described by
> > +	 *            this struct, with a unique timestamp indicating
> > +	 *            when the last state change happened.
> > +	 *
> > +	 * Readable/writable by both the kernel and the userspace.
> > +	 *
> > +	 * UMCG task state:
> > +	 *   bits  0 -  5: task state;
> > +	 *   bits  6 -  7: state flags;
> > +	 *   bits  8 - 12: reserved; must be zeroes;
> > +	 *   bits 13 - 17: for userspace use;
> > +	 *   bits 18 - 63: timestamp (see below).
> > +	 *
> > +	 * Timestamp: a 46-bit CLOCK_MONOTONIC timestamp, at 16ns resolution.
> > +	 * See Documentation/userspace-api/umcg.txt for detals.
> > +	 */
> > +	__u64	state_ts;		/* r/w */
> > +
> > +	/**
> > +	 * @next_tid: the TID of the UMCG task that should be context-switched
> > +	 *            into in sys_umcg_wait(). Can be zero.
> > +	 *
> > +	 * Running UMCG workers must have next_tid set to point to IDLE
> > +	 * UMCG servers.
> > +	 *
> > +	 * Read-only for the kernel, read/write for the userspace.
> > +	 */
> > +	__u32	next_tid;		/* r   */
> > +
> > +	__u32	flags;			/* Reserved; must be zero. */
> > +
> > +	/**
> > +	 * @idle_workers_ptr: a single-linked list of idle workers. Can be NULL.
> > +	 *
> > +	 * Readable/writable by both the kernel and the userspace: the
> > +	 * kernel adds items to the list, the userspace removes them.
> > +	 */
> > +	__u64	idle_workers_ptr;	/* r/w */
> > +
> > +	/**
> > +	 * @idle_server_tid_ptr: a pointer pointing to a single idle server.
> > +	 *                       Readonly.
> > +	 */
> > +	__u64	idle_server_tid_ptr;	/* r   */
> > +} __attribute__((packed, aligned(8 * sizeof(__u64))));
> 
> The thing is; I really don't see how this is supposed to be used. Where
> did the blocked and runnable list go ?
> 
> I also don't see why the kernel cares about idle workers at all; that
> seems something userspace can sort itself just fine.
> 
> The whole next_tid thing seems confused too, how can it be the next task
> when it must be the server? Also, what if there isn't an idle server?
> 
> This just all isn't making any sense to me.

Oooh, someone made things super confusing by doing s/runnable/idle/ on
the whole thing :-( That only took me most of the day to figure out.
Naming is important, don't mess about with stuff like this.
Peter Zijlstra Nov. 24, 2021, 9:41 p.m. UTC | #5
On Mon, Nov 22, 2021 at 01:13:24PM -0800, Peter Oskolkov wrote:
> +	while (true) {

(you have 2 inf. loops in umcg and you chose a different expression for each)

> +		u64 umcg_state;
> +
> +		/*
> +		 * We need to read from userspace _after_ the task is marked
> +		 * TASK_INTERRUPTIBLE, to properly handle concurrent wakeups;
> +		 * but faulting is not allowed; so we try a fast no-fault read,
> +		 * and if it fails, pin the page temporarily.
> +		 */

That comment is misleading! Faulting *is* allowed, but it can scribble
__state. If faulting would not be allowed, you wouldn't be able to call
pin_user_pages_fast().

> +retry_once:
> +		set_current_state(TASK_INTERRUPTIBLE);
> +
> +		/* Order set_current_state above with get_user below. */
> +		smp_mb();

And just in case you hadn't yet seen, that smp_mb() is implied by
set_current_state().

> +		ret = -EFAULT;
> +		if (get_user_nofault(umcg_state, &self->state_ts)) {
> +			set_current_state(TASK_RUNNING);
> +
> +			if (pinned_page)
> +				goto out;
> +			else if (1 != pin_user_pages_fast((unsigned long)self,
> +						1, 0, &pinned_page))

That else is pointless, and that '1 != foo' coding style is evil.

> +					goto out;
> +
> +			goto retry_once;
> +		}

And, as you could've seen from the big patch, all that goto isn't
actually needed here, break / continue seem to be sufficient.

> +
> +		if (pinned_page) {
> +			unpin_user_page(pinned_page);
> +			pinned_page = NULL;
> +		}
Peter Zijlstra Nov. 24, 2021, 9:58 p.m. UTC | #6
On Mon, Nov 22, 2021 at 01:13:24PM -0800, Peter Oskolkov wrote:
> +	if (abs_timeout) {
> +		hrtimer_init_sleeper_on_stack(&timeout, CLOCK_REALTIME,
> +				HRTIMER_MODE_ABS);

Using CLOCK_REALTIME timers while the rest of the thing runs off of
CLOCK_MONOTONIC doesn't seem to make sense to me. Why would you want to
have timeouts subject to DST shifts and crap like that?
Peter Zijlstra Nov. 24, 2021, 10:18 p.m. UTC | #7
On Mon, Nov 22, 2021 at 01:13:24PM -0800, Peter Oskolkov wrote:
> +die:
> +	pr_warn("%s: killing task %d\n", __func__, current->pid);
> +	force_sig(SIGKILL);

That pr_warn() might need to be pr_warn_ratelimited() in order to no be
a system log DoS.

Because, AFAICT, you can craft userspace to trigger this arbitrarily
often, just spawn a worker and make it misbehave.
Peter Oskolkov Nov. 25, 2021, 5:28 p.m. UTC | #8
Thanks, Peter, for the review!

Some of your comments, like ratelimiting pr_warn and removing gotos,
are obvious in how to address them, so I'll just do that and won't
mention them here. Some comments are less clear re: what should be
done about them, so I have them below with my own comments/questions.

At a higher level, I get that the uaccess patch is bad and needs
serious changes. But based on your comments on this main patch so far,
it looks like the overall approach did not raise many objections - is
it so? Have you finished reviewing the patch?

Please also look at my questions/comments below.

Thanks,
Peter

[...]
> > +struct umcg_task {
[...]

>
> The thing is; I really don't see how this is supposed to be used. Where
> did the blocked and runnable list go ?
>
> I also don't see why the kernel cares about idle workers at all; that
> seems something userspace can sort itself just fine.
>
> The whole next_tid thing seems confused too, how can it be the next task
> when it must be the server? Also, what if there isn't an idle server?
>
> This just all isn't making any sense to me.

Based on your later comments I assume it is clearer now. The doc patch
5 has a lot of extra explanations and examples. Please let me know if
something is still unclear here.

> I'm still very hesitant to use ktime (fear the HPET); but I suppose it
> makes sense to use a time base that's accessible to userspace. Was
> MONOTONIC_RAW considered?

I believe it was considered. I'll re-consider it, and add a comment if
the new consideration arrives at the same conclusion.

> Using CLOCK_REALTIME timers while the rest of the thing runs off of
> CLOCK_MONOTONIC doesn't seem to make sense to me. Why would you want to
> have timeouts subject to DST shifts and crap like that?

Yes, these should be the same if at all possible. I'll definitely
reconsider what clock to use in both timeouts and state timestamps.

> Oooh, someone made things super confusing by doing s/runnable/idle/ on
> the whole thing :-( That only took me most of the day to figure out.
> Naming is important, don't mess about with stuff like this.

I clearly remember I had four states: blocked, pending, runnable,
running (I still believe that four states better reflect what is going
on here). The current blocked/idle/running is the result of an early
discussion. Something along the lines of:

<start of a recollection>
pending workers (=unblocked workers that the userspace still thinks
are blocked) are better named as idle; also the kernel does not really
care about what userspace thinks, so idle workers and runnable workers
are the same from the kernel point of view, so let's have one state
for these workers, not two.
<end of the recollection>

Please let me know if you want me to change anything here. I'll gladly
name workers on the idle worker list as idle (or whatever you prefer),
and workers that the userspace took out of the list as "runnable".
Just as a FYI, workers blocked in umcg_wait() will also be called
"runnable" then, as they are sitting in umcg_idle_loop() and can be
woken or swapped into.
Peter Zijlstra Nov. 26, 2021, 5:09 p.m. UTC | #9
On Thu, Nov 25, 2021 at 09:28:49AM -0800, Peter Oskolkov wrote:

> it looks like the overall approach did not raise many objections - is
> it so? Have you finished reviewing the patch?

I've been trying to make sense of it, and while doing so deleted a bunch
of things and rewrote the rest.

Things that went *poof*:

 - wait_wake_only
 - server_tid_ptr (now: server_tid)
 - state_ts (now: state,blocked_ts,runnable_ts)

I've also changed next_tid to only be used as a context switch target,
never to find the server to enqueue the runnable tasks on.

All xchg() users seem to have disappeared.

Signals should now be handled, after which it'll go back to waiting on
RUNNING.

The code could fairly easily be changed to work on 32bit, big-endian is
the tricky bit, for now 64bit only.

Anyway, I only *think* the below code will work (it compiles with gcc-10
and gcc-11) but I've not yet come around to writing/updating the
userspace part, so it might explode on first contact -- I'll try that
next week if you don't beat me to it.

That said, the below code seems somewhat sensible to me (I would say,
having written it :), but I'm fairly sure I killed some capabilities the
other thing had (notably the first two items above).

If you want either of them restored, can you please give a use-case for
them? Because I cannot seem to think of any sane cases for either
wait_wake_only or server_tid_ptr.

Anyway, in large order it's very like what you did, but it's different
in pretty much all details.

Of note, it now has 5 hooks: sys_enter, pre-schedule, post-schedule
(still nop), sys_exit and notify_resume.

---
Subject: sched: User Mode Concurency Groups
From: Peter Zijlstra <peterz@infradead.org>
Date: Fri Nov 26 17:24:27 CET 2021

XXX split and changelog

Originally-by: Peter Oskolkov <posk@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -248,6 +248,7 @@ config X86
 	select HAVE_RSEQ
 	select HAVE_SYSCALL_TRACEPOINTS
 	select HAVE_UNSTABLE_SCHED_CLOCK
+	select HAVE_UMCG			if X86_64
 	select HAVE_USER_RETURN_NOTIFIER
 	select HAVE_GENERIC_VDSO
 	select HOTPLUG_SMT			if SMP
--- a/arch/x86/entry/syscalls/syscall_64.tbl
+++ b/arch/x86/entry/syscalls/syscall_64.tbl
@@ -371,6 +371,8 @@
 447	common	memfd_secret		sys_memfd_secret
 448	common	process_mrelease	sys_process_mrelease
 449	common	futex_waitv		sys_futex_waitv
+450	common	umcg_ctl		sys_umcg_ctl
+451	common	umcg_wait		sys_umcg_wait
 
 #
 # Due to a historical design error, certain syscalls are numbered differently
--- a/arch/x86/include/asm/thread_info.h
+++ b/arch/x86/include/asm/thread_info.h
@@ -83,6 +83,7 @@ struct thread_info {
 #define TIF_NEED_RESCHED	3	/* rescheduling necessary */
 #define TIF_SINGLESTEP		4	/* reenable singlestep on user return*/
 #define TIF_SSBD		5	/* Speculative store bypass disable */
+#define TIF_UMCG		6	/* UMCG return to user hook */
 #define TIF_SPEC_IB		9	/* Indirect branch speculation mitigation */
 #define TIF_SPEC_L1D_FLUSH	10	/* Flush L1D on mm switches (processes) */
 #define TIF_USER_RETURN_NOTIFY	11	/* notify kernel of userspace return */
@@ -107,6 +108,7 @@ struct thread_info {
 #define _TIF_NEED_RESCHED	(1 << TIF_NEED_RESCHED)
 #define _TIF_SINGLESTEP		(1 << TIF_SINGLESTEP)
 #define _TIF_SSBD		(1 << TIF_SSBD)
+#define _TIF_UMCG		(1 << TIF_UMCG)
 #define _TIF_SPEC_IB		(1 << TIF_SPEC_IB)
 #define _TIF_SPEC_L1D_FLUSH	(1 << TIF_SPEC_L1D_FLUSH)
 #define _TIF_USER_RETURN_NOTIFY	(1 << TIF_USER_RETURN_NOTIFY)
--- a/arch/x86/include/asm/uaccess.h
+++ b/arch/x86/include/asm/uaccess.h
@@ -341,6 +341,24 @@ do {									\
 		     : [umem] "m" (__m(addr))				\
 		     : : label)
 
+#define __try_cmpxchg_user_asm(itype, _ptr, _pold, _new, label)	({	\
+	bool success;							\
+	__typeof__(_ptr) _old = (__typeof__(_ptr))(_pold);		\
+	__typeof__(*(_ptr)) __old = *_old;				\
+	__typeof__(*(_ptr)) __new = (_new);				\
+	asm_volatile_goto("\n"						\
+		     "1: " LOCK_PREFIX "cmpxchg"itype" %[new], %[ptr]\n"\
+		     _ASM_EXTABLE_UA(1b, %l[label])			\
+		     : CC_OUT(z) (success),				\
+		       [ptr] "+m" (*_ptr),				\
+		       [old] "+a" (__old)				\
+		     : [new] "r" (__new)				\
+		     : "memory", "cc"					\
+		     : label);						\
+	if (unlikely(!success))						\
+		*_old = __old;						\
+	likely(success);					})
+
 #else // !CONFIG_CC_HAS_ASM_GOTO_OUTPUT
 
 #ifdef CONFIG_X86_32
@@ -411,6 +429,34 @@ do {									\
 		     : [umem] "m" (__m(addr)),				\
 		       [efault] "i" (-EFAULT), "0" (err))
 
+#define __try_cmpxchg_user_asm(itype, _ptr, _pold, _new, label)	({	\
+	int __err = 0;							\
+	bool success;							\
+	__typeof__(_ptr) _old = (__typeof__(_ptr))(_pold);		\
+	__typeof__(*(_ptr)) __old = *_old;				\
+	__typeof__(*(_ptr)) __new = (_new);				\
+	asm volatile("\n"						\
+		     "1: " LOCK_PREFIX "cmpxchg"itype" %[new], %[ptr]\n"\
+		     CC_SET(z)						\
+		     "2:\n"						\
+		     ".pushsection .fixup,\"ax\"\n"			\
+		     "3:	mov %[efault], %[errout]\n"		\
+		     "		jmp 2b\n"				\
+		     ".popsection\n"					\
+		     _ASM_EXTABLE_UA(1b, 3b)				\
+		     : CC_OUT(z) (success),				\
+		       [errout] "+r" (__err),				\
+		       [ptr] "+m" (*_ptr),				\
+		       [old] "+a" (__old)				\
+		     : [new] "r" (__new),				\
+		       [efault] "i" (-EFAULT)				\
+		     : "memory", "cc");					\
+	if (unlikely(__err))						\
+		goto label;						\
+	if (unlikely(!success))						\
+		*_old = __old;						\
+	likely(success);					})
+
 #endif // CONFIG_CC_HAS_ASM_GOTO_OUTPUT
 
 /* FIXME: this hack is definitely wrong -AK */
@@ -505,6 +551,21 @@ do {										\
 } while (0)
 #endif // CONFIG_CC_HAS_ASM_GOTO_OUTPUT
 
+extern void __try_cmpxchg_user_wrong_size(void);
+
+#define unsafe_try_cmpxchg_user(_ptr, _oldp, _nval, _label) ({		\
+	__typeof__(*(_ptr)) __ret;					\
+	switch (sizeof(__ret)) {					\
+	case 4:	__ret = __try_cmpxchg_user_asm("l", (_ptr), (_oldp),	\
+					       (_nval), _label);	\
+		break;							\
+	case 8:	__ret = __try_cmpxchg_user_asm("q", (_ptr), (_oldp),	\
+					       (_nval), _label);	\
+		break;							\
+	default: __try_cmpxchg_user_wrong_size();			\
+	}								\
+	__ret;						})
+
 /*
  * We want the unsafe accessors to always be inlined and use
  * the error labels - thus the macro games.
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -1838,6 +1838,7 @@ static int bprm_execve(struct linux_binp
 	current->fs->in_exec = 0;
 	current->in_execve = 0;
 	rseq_execve(current);
+	umcg_execve(current);
 	acct_update_integrals(current);
 	task_numa_free(current, false);
 	return retval;
--- a/include/linux/entry-common.h
+++ b/include/linux/entry-common.h
@@ -22,6 +22,10 @@
 # define _TIF_UPROBE			(0)
 #endif
 
+#ifndef _TIF_UMCG
+# define _TIF_UMCG			(0)
+#endif
+
 /*
  * SYSCALL_WORK flags handled in syscall_enter_from_user_mode()
  */
@@ -42,11 +46,13 @@
 				 SYSCALL_WORK_SYSCALL_EMU |		\
 				 SYSCALL_WORK_SYSCALL_AUDIT |		\
 				 SYSCALL_WORK_SYSCALL_USER_DISPATCH |	\
+				 SYSCALL_WORK_SYSCALL_UMCG |		\
 				 ARCH_SYSCALL_WORK_ENTER)
 #define SYSCALL_WORK_EXIT	(SYSCALL_WORK_SYSCALL_TRACEPOINT |	\
 				 SYSCALL_WORK_SYSCALL_TRACE |		\
 				 SYSCALL_WORK_SYSCALL_AUDIT |		\
 				 SYSCALL_WORK_SYSCALL_USER_DISPATCH |	\
+				 SYSCALL_WORK_SYSCALL_UMCG |		\
 				 SYSCALL_WORK_SYSCALL_EXIT_TRAP	|	\
 				 ARCH_SYSCALL_WORK_EXIT)
 
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -67,6 +67,7 @@ struct sighand_struct;
 struct signal_struct;
 struct task_delay_info;
 struct task_group;
+struct umcg_task;
 
 /*
  * Task state bitmask. NOTE! These bits are also
@@ -1294,6 +1295,15 @@ struct task_struct {
 	unsigned long rseq_event_mask;
 #endif
 
+#ifdef CONFIG_UMCG
+	clockid_t		umcg_clock;
+	struct umcg_task __user	*umcg_task;
+	struct page		*umcg_worker_page;
+	struct task_struct	*umcg_server;
+	struct umcg_task __user *umcg_server_task;
+	struct page		*umcg_server_page;
+#endif
+
 	struct tlbflush_unmap_batch	tlb_ubc;
 
 	union {
@@ -1687,6 +1697,13 @@ extern struct pid *cad_pid;
 #define PF_KTHREAD		0x00200000	/* I am a kernel thread */
 #define PF_RANDOMIZE		0x00400000	/* Randomize virtual address space */
 #define PF_SWAPWRITE		0x00800000	/* Allowed to write to swap */
+
+#ifdef CONFIG_UMCG
+#define PF_UMCG_WORKER		0x01000000	/* UMCG worker */
+#else
+#define PF_UMCG_WORKER		0x00000000
+#endif
+
 #define PF_NO_SETAFFINITY	0x04000000	/* Userland is not allowed to meddle with cpus_mask */
 #define PF_MCE_EARLY		0x08000000      /* Early kill for mce process policy */
 #define PF_MEMALLOC_PIN		0x10000000	/* Allocation context constrained to zones which allow long term pinning. */
@@ -2285,6 +2302,67 @@ static inline void rseq_execve(struct ta
 {
 }
 
+#endif
+
+#ifdef CONFIG_UMCG
+
+extern void umcg_sys_enter(struct pt_regs *regs, long syscall);
+extern void umcg_sys_exit(struct pt_regs *regs);
+extern void umcg_notify_resume(struct pt_regs *regs);
+extern void umcg_worker_exit(void);
+extern void umcg_clear_child(struct task_struct *tsk);
+
+/* Called by bprm_execve() in fs/exec.c. */
+static inline void umcg_execve(struct task_struct *tsk)
+{
+	if (tsk->umcg_task)
+		umcg_clear_child(tsk);
+}
+
+/* Called by do_exit() in kernel/exit.c. */
+static inline void umcg_handle_exit(void)
+{
+	if (current->flags & PF_UMCG_WORKER)
+		umcg_worker_exit();
+}
+
+/*
+ * umcg_wq_worker_[sleeping|running] are called in core.c by
+ * sched_submit_work() and sched_update_worker().
+ */
+extern void umcg_wq_worker_sleeping(struct task_struct *tsk);
+extern void umcg_wq_worker_running(struct task_struct *tsk);
+
+#else  /* CONFIG_UMCG */
+
+static inline void umcg_sys_enter(struct pt_regs *regs, long syscall)
+{
+}
+
+static inline void umcg_sys_exit(struct pt_regs *regs)
+{
+}
+
+static inline void umcg_notify_resume(struct pt_regs *regs)
+{
+}
+
+static inline void umcg_clear_child(struct task_struct *tsk)
+{
+}
+static inline void umcg_execve(struct task_struct *tsk)
+{
+}
+static inline void umcg_handle_exit(void)
+{
+}
+static inline void umcg_wq_worker_sleeping(struct task_struct *tsk)
+{
+}
+static inline void umcg_wq_worker_running(struct task_struct *tsk)
+{
+}
+
 #endif
 
 #ifdef CONFIG_DEBUG_RSEQ
--- a/include/linux/syscalls.h
+++ b/include/linux/syscalls.h
@@ -72,6 +72,7 @@ struct open_how;
 struct mount_attr;
 struct landlock_ruleset_attr;
 enum landlock_rule_type;
+struct umcg_task;
 
 #include <linux/types.h>
 #include <linux/aio_abi.h>
@@ -1057,6 +1058,8 @@ asmlinkage long sys_landlock_add_rule(in
 		const void __user *rule_attr, __u32 flags);
 asmlinkage long sys_landlock_restrict_self(int ruleset_fd, __u32 flags);
 asmlinkage long sys_memfd_secret(unsigned int flags);
+asmlinkage long sys_umcg_ctl(u32 flags, struct umcg_task __user *self, clockid_t which_clock);
+asmlinkage long sys_umcg_wait(u32 flags, u64 abs_timeout);
 
 /*
  * Architecture-specific system calls
--- a/include/linux/thread_info.h
+++ b/include/linux/thread_info.h
@@ -46,6 +46,7 @@ enum syscall_work_bit {
 	SYSCALL_WORK_BIT_SYSCALL_AUDIT,
 	SYSCALL_WORK_BIT_SYSCALL_USER_DISPATCH,
 	SYSCALL_WORK_BIT_SYSCALL_EXIT_TRAP,
+	SYSCALL_WORK_BIT_SYSCALL_UMCG,
 };
 
 #define SYSCALL_WORK_SECCOMP		BIT(SYSCALL_WORK_BIT_SECCOMP)
@@ -55,6 +56,7 @@ enum syscall_work_bit {
 #define SYSCALL_WORK_SYSCALL_AUDIT	BIT(SYSCALL_WORK_BIT_SYSCALL_AUDIT)
 #define SYSCALL_WORK_SYSCALL_USER_DISPATCH BIT(SYSCALL_WORK_BIT_SYSCALL_USER_DISPATCH)
 #define SYSCALL_WORK_SYSCALL_EXIT_TRAP	BIT(SYSCALL_WORK_BIT_SYSCALL_EXIT_TRAP)
+#define SYSCALL_WORK_SYSCALL_UMCG	BIT(SYSCALL_WORK_BIT_SYSCALL_UMCG)
 #endif
 
 #include <asm/thread_info.h>
--- a/include/uapi/asm-generic/unistd.h
+++ b/include/uapi/asm-generic/unistd.h
@@ -883,8 +883,13 @@ __SYSCALL(__NR_process_mrelease, sys_pro
 #define __NR_futex_waitv 449
 __SYSCALL(__NR_futex_waitv, sys_futex_waitv)
 
+#define __NR_umcg_ctl 450
+__SYSCALL(__NR_umcg_ctl, sys_umcg_ctl)
+#define __NR_umcg_wait 451
+__SYSCALL(__NR_umcg_wait, sys_umcg_wait)
 #undef __NR_syscalls
-#define __NR_syscalls 450
+
+#define __NR_syscalls 452
 
 /*
  * 32 bit systems traditionally used different
--- /dev/null
+++ b/include/uapi/linux/umcg.h
@@ -0,0 +1,117 @@
+/* SPDX-License-Identifier: GPL-2.0+ WITH Linux-syscall-note */
+#ifndef _UAPI_LINUX_UMCG_H
+#define _UAPI_LINUX_UMCG_H
+
+#include <linux/types.h>
+
+/*
+ * UMCG: User Managed Concurrency Groups.
+ *
+ * Syscalls (see kernel/sched/umcg.c):
+ *      sys_umcg_ctl()  - register/unregister UMCG tasks;
+ *      sys_umcg_wait() - wait/wake/context-switch.
+ *
+ * struct umcg_task (below): controls the state of UMCG tasks.
+ */
+
+/*
+ * UMCG task states, the first 6 bits of struct umcg_task.state_ts.
+ * The states represent the user space point of view.
+ *
+ *   ,--------(TF_PREEMPT + notify_resume)-------. ,------------.
+ *   |                                           v |            |
+ * RUNNING -(schedule)-> BLOCKED -(sys_exit)-> RUNNABLE  (signal + notify_resume)
+ *   ^                                           | ^            |
+ *   `--------------(sys_umcg_wait)--------------' `------------'
+ *
+ */
+#define UMCG_TASK_NONE			0x0000U
+#define UMCG_TASK_RUNNING		0x0001U
+#define UMCG_TASK_RUNNABLE		0x0002U
+#define UMCG_TASK_BLOCKED		0x0003U
+
+#define UMCG_TASK_MASK			0x00ffU
+
+/*
+ * UMCG_TF_PREEMPT: userspace indicates the worker should be preempted.
+ *
+ * Must only be set on UMCG_TASK_RUNNING; once set, any subsequent
+ * return-to-user (eg signal) will perform the equivalent of sys_umcg_wait() on
+ * it. That is, it will wake next_tid/server_tid, transfer to RUNNABLE and
+ * enqueue on the server's runnable list.
+ *
+ */
+#define UMCG_TF_PREEMPT			0x0100U
+
+#define UMCG_TF_MASK			0xff00U
+
+#define UMCG_TASK_ALIGN			64
+
+/**
+ * struct umcg_task - controls the state of UMCG tasks.
+ *
+ * The struct is aligned at 64 bytes to ensure that it fits into
+ * a single cache line.
+ */
+struct umcg_task {
+	/**
+	 * @state_ts: the current state of the UMCG task described by
+	 *            this struct, with a unique timestamp indicating
+	 *            when the last state change happened.
+	 *
+	 * Readable/writable by both the kernel and the userspace.
+	 *
+	 * UMCG task state:
+	 *   bits  0 -  7: task state;
+	 *   bits  8 - 15: state flags;
+	 *   bits 16 - 31: for userspace use;
+	 */
+	__u32	state;				/* r/w */
+
+	/**
+	 * @next_tid: the TID of the UMCG task that should be context-switched
+	 *            into in sys_umcg_wait(). Can be zero, in which case
+	 *            it'll switch to server_tid.
+	 *
+	 * @server_tid: the TID of the UMCG server that hosts this task,
+	 *		when RUNNABLE this task will get added to it's
+	 *		runnable_workers_ptr list.
+	 *
+	 * Read-only for the kernel, read/write for the userspace.
+	 */
+	__u32	next_tid;			/* r   */
+	__u32	server_tid;			/* r   */
+
+	__u32	__hole[1];
+
+	/*
+	 * Timestamps for when last we became BLOCKED, RUNNABLE, in CLOCK_MONOTONIC.
+	 */
+	__u64	blocked_ts;			/*   w */
+	__u64   runnable_ts;			/*   w */
+
+	/**
+	 * @runnable_workers_ptr: a single-linked list of runnable workers.
+	 *
+	 * Readable/writable by both the kernel and the userspace: the
+	 * kernel adds items to the list, userspace removes them.
+	 */
+	__u64	runnable_workers_ptr;		/* r/w */
+
+	__u64	__zero[3];
+
+} __attribute__((packed, aligned(UMCG_TASK_ALIGN)));
+
+/**
+ * enum umcg_ctl_flag - flags to pass to sys_umcg_ctl
+ * @UMCG_CTL_REGISTER:   register the current task as a UMCG task
+ * @UMCG_CTL_UNREGISTER: unregister the current task as a UMCG task
+ * @UMCG_CTL_WORKER:     register the current task as a UMCG worker
+ */
+enum umcg_ctl_flag {
+	UMCG_CTL_REGISTER	= 0x00001,
+	UMCG_CTL_UNREGISTER	= 0x00002,
+	UMCG_CTL_WORKER		= 0x10000,
+};
+
+#endif /* _UAPI_LINUX_UMCG_H */
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -1693,6 +1693,21 @@ config MEMBARRIER
 
 	  If unsure, say Y.
 
+config HAVE_UMCG
+	bool
+
+config UMCG
+	bool "Enable User Managed Concurrency Groups API"
+	depends on 64BIT
+	depends on GENERIC_ENTRY
+	depends on HAVE_UMCG
+	default n
+	help
+	  Enable User Managed Concurrency Groups API, which form the basis
+	  for an in-process M:N userspace scheduling framework.
+	  At the moment this is an experimental/RFC feature that is not
+	  guaranteed to be backward-compatible.
+
 config KALLSYMS
 	bool "Load all symbols for debugging/ksymoops" if EXPERT
 	default y
--- a/kernel/entry/common.c
+++ b/kernel/entry/common.c
@@ -6,6 +6,7 @@
 #include <linux/livepatch.h>
 #include <linux/audit.h>
 #include <linux/tick.h>
+#include <linux/sched.h>
 
 #include "common.h"
 
@@ -76,6 +77,9 @@ static long syscall_trace_enter(struct p
 	if (unlikely(work & SYSCALL_WORK_SYSCALL_TRACEPOINT))
 		trace_sys_enter(regs, syscall);
 
+	if (work & SYSCALL_WORK_SYSCALL_UMCG)
+		umcg_sys_enter(regs, syscall);
+
 	syscall_enter_audit(regs, syscall);
 
 	return ret ? : syscall;
@@ -155,8 +159,7 @@ static unsigned long exit_to_user_mode_l
 	 * Before returning to user space ensure that all pending work
 	 * items have been completed.
 	 */
-	while (ti_work & EXIT_TO_USER_MODE_WORK) {
-
+	do {
 		local_irq_enable_exit_to_user(ti_work);
 
 		if (ti_work & _TIF_NEED_RESCHED)
@@ -168,6 +171,10 @@ static unsigned long exit_to_user_mode_l
 		if (ti_work & _TIF_PATCH_PENDING)
 			klp_update_patch_state(current);
 
+		/* must be before handle_signal_work(); terminates on sigpending */
+		if (ti_work & _TIF_UMCG)
+			umcg_notify_resume(regs);
+
 		if (ti_work & (_TIF_SIGPENDING | _TIF_NOTIFY_SIGNAL))
 			handle_signal_work(regs, ti_work);
 
@@ -188,7 +195,7 @@ static unsigned long exit_to_user_mode_l
 		tick_nohz_user_enter_prepare();
 
 		ti_work = READ_ONCE(current_thread_info()->flags);
-	}
+	} while (ti_work & EXIT_TO_USER_MODE_WORK);
 
 	/* Return the latest work state for arch_exit_to_user_mode() */
 	return ti_work;
@@ -203,7 +210,7 @@ static void exit_to_user_mode_prepare(st
 	/* Flush pending rcuog wakeup before the last need_resched() check */
 	tick_nohz_user_enter_prepare();
 
-	if (unlikely(ti_work & EXIT_TO_USER_MODE_WORK))
+	if (unlikely(ti_work & (EXIT_TO_USER_MODE_WORK | _TIF_UMCG)))
 		ti_work = exit_to_user_mode_loop(regs, ti_work);
 
 	arch_exit_to_user_mode_prepare(regs, ti_work);
@@ -253,6 +260,9 @@ static void syscall_exit_work(struct pt_
 	step = report_single_step(work);
 	if (step || work & SYSCALL_WORK_SYSCALL_TRACE)
 		arch_syscall_exit_tracehook(regs, step);
+
+	if (work & SYSCALL_WORK_SYSCALL_UMCG)
+		umcg_sys_exit(regs);
 }
 
 /*
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -749,6 +749,10 @@ void __noreturn do_exit(long code)
 	if (unlikely(!tsk->pid))
 		panic("Attempted to kill the idle task!");
 
+	/* Turn off UMCG sched hooks. */
+	if (unlikely(tsk->flags & PF_UMCG_WORKER))
+		tsk->flags &= ~PF_UMCG_WORKER;
+
 	/*
 	 * If do_exit is called because this processes oopsed, it's possible
 	 * that get_fs() was left as KERNEL_DS, so reset it to USER_DS before
@@ -786,6 +790,7 @@ void __noreturn do_exit(long code)
 
 	io_uring_files_cancel();
 	exit_signals(tsk);  /* sets PF_EXITING */
+	umcg_handle_exit();
 
 	/* sync mm's RSS info before statistics gathering */
 	if (tsk->mm)
--- a/kernel/sched/Makefile
+++ b/kernel/sched/Makefile
@@ -41,3 +41,4 @@ obj-$(CONFIG_MEMBARRIER) += membarrier.o
 obj-$(CONFIG_CPU_ISOLATION) += isolation.o
 obj-$(CONFIG_PSI) += psi.o
 obj-$(CONFIG_SCHED_CORE) += core_sched.o
+obj-$(CONFIG_UMCG) += umcg.o
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -3977,8 +3977,7 @@ bool ttwu_state_match(struct task_struct
  * Return: %true if @p->state changes (an actual wakeup was done),
  *	   %false otherwise.
  */
-static int
-try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags)
+int try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags)
 {
 	unsigned long flags;
 	int cpu, success = 0;
@@ -4270,6 +4269,7 @@ static void __sched_fork(unsigned long c
 	p->wake_entry.u_flags = CSD_TYPE_TTWU;
 	p->migration_pending = NULL;
 #endif
+	umcg_clear_child(p);
 }
 
 DEFINE_STATIC_KEY_FALSE(sched_numa_balancing);
@@ -6328,9 +6328,11 @@ static inline void sched_submit_work(str
 	 * If a worker goes to sleep, notify and ask workqueue whether it
 	 * wants to wake up a task to maintain concurrency.
 	 */
-	if (task_flags & (PF_WQ_WORKER | PF_IO_WORKER)) {
+	if (task_flags & (PF_WQ_WORKER | PF_IO_WORKER | PF_UMCG_WORKER)) {
 		if (task_flags & PF_WQ_WORKER)
 			wq_worker_sleeping(tsk);
+		else if (task_flags & PF_UMCG_WORKER)
+			umcg_wq_worker_sleeping(tsk);
 		else
 			io_wq_worker_sleeping(tsk);
 	}
@@ -6348,9 +6350,11 @@ static inline void sched_submit_work(str
 
 static void sched_update_worker(struct task_struct *tsk)
 {
-	if (tsk->flags & (PF_WQ_WORKER | PF_IO_WORKER)) {
+	if (tsk->flags & (PF_WQ_WORKER | PF_IO_WORKER | PF_UMCG_WORKER)) {
 		if (tsk->flags & PF_WQ_WORKER)
 			wq_worker_running(tsk);
+		else if (tsk->flags & PF_UMCG_WORKER)
+			umcg_wq_worker_running(tsk);
 		else
 			io_wq_worker_running(tsk);
 	}
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6890,6 +6890,10 @@ select_task_rq_fair(struct task_struct *
 	if (wake_flags & WF_TTWU) {
 		record_wakee(p);
 
+		if ((wake_flags & WF_CURRENT_CPU) &&
+		    cpumask_test_cpu(cpu, p->cpus_ptr))
+			return cpu;
+
 		if (sched_energy_enabled()) {
 			new_cpu = find_energy_efficient_cpu(p, prev_cpu);
 			if (new_cpu >= 0)
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -2052,13 +2052,14 @@ static inline int task_on_rq_migrating(s
 }
 
 /* Wake flags. The first three directly map to some SD flag value */
-#define WF_EXEC     0x02 /* Wakeup after exec; maps to SD_BALANCE_EXEC */
-#define WF_FORK     0x04 /* Wakeup after fork; maps to SD_BALANCE_FORK */
-#define WF_TTWU     0x08 /* Wakeup;            maps to SD_BALANCE_WAKE */
-
-#define WF_SYNC     0x10 /* Waker goes to sleep after wakeup */
-#define WF_MIGRATED 0x20 /* Internal use, task got migrated */
-#define WF_ON_CPU   0x40 /* Wakee is on_cpu */
+#define WF_EXEC         0x02 /* Wakeup after exec; maps to SD_BALANCE_EXEC */
+#define WF_FORK         0x04 /* Wakeup after fork; maps to SD_BALANCE_FORK */
+#define WF_TTWU         0x08 /* Wakeup;            maps to SD_BALANCE_WAKE */
+
+#define WF_SYNC         0x10 /* Waker goes to sleep after wakeup */
+#define WF_MIGRATED     0x20 /* Internal use, task got migrated */
+#define WF_ON_CPU       0x40 /* Wakee is on_cpu */
+#define WF_CURRENT_CPU  0x80 /* Prefer to move the wakee to the current CPU. */
 
 #ifdef CONFIG_SMP
 static_assert(WF_EXEC == SD_BALANCE_EXEC);
@@ -3076,6 +3077,8 @@ static inline bool is_per_cpu_kthread(st
 extern void swake_up_all_locked(struct swait_queue_head *q);
 extern void __prepare_to_swait(struct swait_queue_head *q, struct swait_queue *wait);
 
+extern int try_to_wake_up(struct task_struct *tsk, unsigned int state, int wake_flags);
+
 #ifdef CONFIG_PREEMPT_DYNAMIC
 extern int preempt_dynamic_mode;
 extern int sched_dynamic_mode(const char *str);
--- /dev/null
+++ b/kernel/sched/umcg.c
@@ -0,0 +1,744 @@
+// SPDX-License-Identifier: GPL-2.0-only
+
+/*
+ * User Managed Concurrency Groups (UMCG).
+ *
+ */
+
+#include <linux/syscalls.h>
+#include <linux/types.h>
+#include <linux/uaccess.h>
+#include <linux/umcg.h>
+
+#include <asm/syscall.h>
+
+#include "sched.h"
+
+static struct task_struct *umcg_get_task(u32 tid)
+{
+	struct task_struct *tsk = NULL;
+
+	if (tid) {
+		rcu_read_lock();
+		tsk = find_task_by_vpid(tid);
+		if (tsk && current->mm == tsk->mm && tsk->umcg_task)
+			get_task_struct(tsk);
+		else
+			tsk = NULL;
+		rcu_read_unlock();
+	}
+
+	return tsk;
+}
+
+/**
+ * umcg_pin_pages: pin pages containing struct umcg_task of this worker
+ *                 and its server.
+ */
+static int umcg_pin_pages(void)
+{
+	struct task_struct *server = NULL, *tsk = current;
+	struct umcg_task __user *self = tsk->umcg_task;
+	int server_tid;
+
+	if (tsk->umcg_worker_page ||
+	    tsk->umcg_server_page ||
+	    tsk->umcg_server_task ||
+	    tsk->umcg_server)
+		return -EBUSY;
+
+	if (get_user(server_tid, &self->server_tid))
+		return -EFAULT;
+
+	server = umcg_get_task(server_tid);
+	if (!server)
+		return -EINVAL;
+
+	if (pin_user_pages_fast((unsigned long)self, 1, 0,
+				&tsk->umcg_worker_page) != 1)
+		goto clear_self;
+
+	/* must cache due to possible concurrent change vs access_ok() */
+	tsk->umcg_server_task = server->umcg_task;
+	if (pin_user_pages_fast((unsigned long)tsk->umcg_server_task, 1, 0,
+				&tsk->umcg_server_page) != 1)
+		goto clear_server;
+
+	tsk->umcg_server = server;
+
+	return 0;
+
+clear_server:
+	tsk->umcg_server_task = NULL;
+	tsk->umcg_server_page = NULL;
+
+	unpin_user_page(tsk->umcg_worker_page);
+clear_self:
+	tsk->umcg_worker_page = NULL;
+	put_task_struct(server);
+
+	return -EFAULT;
+}
+
+static void umcg_unpin_pages(void)
+{
+	struct task_struct *tsk = current;
+
+	if (tsk->umcg_server) {
+		unpin_user_page(tsk->umcg_worker_page);
+		tsk->umcg_worker_page = NULL;
+
+		unpin_user_page(tsk->umcg_server_page);
+		tsk->umcg_server_page = NULL;
+		tsk->umcg_server_task = NULL;
+
+		put_task_struct(tsk->umcg_server);
+		tsk->umcg_server = NULL;
+	}
+}
+
+static void umcg_clear_task(struct task_struct *tsk)
+{
+	/*
+	 * This is either called for the current task, or for a newly forked
+	 * task that is not yet running, so we don't need strict atomicity
+	 * below.
+	 */
+	if (tsk->umcg_task) {
+		WRITE_ONCE(tsk->umcg_task, NULL);
+		tsk->umcg_server = NULL;
+
+		/* These can be simple writes - see the commment above. */
+		tsk->umcg_worker_page = NULL;
+		tsk->umcg_server_page = NULL;
+		tsk->umcg_server_task = NULL;
+
+		tsk->flags &= ~PF_UMCG_WORKER;
+		clear_task_syscall_work(tsk, SYSCALL_UMCG);
+		clear_tsk_thread_flag(tsk, TIF_UMCG);
+	}
+}
+
+/* Called for a forked or execve-ed child. */
+void umcg_clear_child(struct task_struct *tsk)
+{
+	umcg_clear_task(tsk);
+}
+
+/* Called both by normally (unregister) and abnormally exiting workers. */
+void umcg_worker_exit(void)
+{
+	umcg_unpin_pages();
+	umcg_clear_task(current);
+}
+
+/*
+ * Do a state transition, @from -> @to, and possible read @next after that.
+ *
+ * Will clear UMCG_TF_PREEMPT.
+ *
+ * When @to == {BLOCKED,RUNNABLE}, update timestamps.
+ *
+ * Returns:
+ *   0: success
+ *   -EAGAIN: when self->state != @from
+ *   -EFAULT
+ */
+static int umcg_update_state(struct task_struct *tsk, u32 from, u32 to, u32 *next)
+{
+	struct umcg_task *self = tsk->umcg_task;
+	u32 old, new;
+	u64 now;
+
+	if (to >= UMCG_TASK_RUNNABLE) {
+		switch (tsk->umcg_clock) {
+		case CLOCK_REALTIME:      now = ktime_get_real_ns();     break;
+		case CLOCK_MONOTONIC:     now = ktime_get_ns();          break;
+		case CLOCK_BOOTTIME:      now = ktime_get_boottime_ns(); break;
+		case CLOCK_TAI:           now = ktime_get_clocktai_ns(); break;
+		}
+	}
+
+	if (!user_access_begin(self, sizeof(*self)))
+		return -EFAULT;
+
+	unsafe_get_user(old, &self->state, Efault);
+	do {
+		if ((old & UMCG_TASK_MASK) != from)
+			goto fail;
+
+		new = old & ~(UMCG_TASK_MASK | UMCG_TF_PREEMPT);
+		new |= to & UMCG_TASK_MASK;
+
+	} while (!unsafe_try_cmpxchg_user(&self->state, &old, new, Efault));
+
+	if (to == UMCG_TASK_BLOCKED)
+		unsafe_put_user(now, &self->blocked_ts, Efault);
+	if (to == UMCG_TASK_RUNNABLE)
+		unsafe_put_user(now, &self->runnable_ts, Efault);
+
+	if (next)
+		unsafe_get_user(*next, &self->next_tid, Efault);
+
+	user_access_end();
+	return 0;
+
+fail:
+	user_access_end();
+	return -EAGAIN;
+
+Efault:
+	user_access_end();
+	return -EFAULT;
+}
+
+/* Called from syscall enter path */
+void umcg_sys_enter(struct pt_regs *regs, long syscall)
+{
+	/* avoid recursion vs our own syscalls */
+	if (syscall == __NR_umcg_wait ||
+	    syscall == __NR_umcg_ctl)
+		return;
+
+	/* avoid recursion vs schedule() */
+	current->flags &= ~PF_UMCG_WORKER;
+
+	if (umcg_pin_pages())
+		goto die;
+
+	current->flags |= PF_UMCG_WORKER;
+	return;
+
+die:
+	current->flags |= PF_UMCG_WORKER;
+	pr_warn("%s: killing task %d\n", __func__, current->pid);
+	force_sig(SIGKILL);
+}
+
+static int umcg_wake_task(struct task_struct *tsk)
+{
+	int ret = umcg_update_state(tsk, UMCG_TASK_RUNNABLE, UMCG_TASK_RUNNING, NULL);
+	if (ret)
+		return ret;
+
+	try_to_wake_up(tsk, TASK_NORMAL, WF_CURRENT_CPU);
+	return 0;
+}
+
+/*
+ * Wake @next_tid or server.
+ *
+ * Must be called in umcg_pin_pages() context, relies on tsk->umcg_server.
+ *
+ * Returns:
+ *   0: success
+ *   -EFAULT
+ */
+static int umcg_wake_next(struct task_struct *tsk, u32 next_tid)
+{
+	struct task_struct *next = NULL;
+	int ret;
+
+	next = umcg_get_task(next_tid);
+	/*
+	 * umcg_wake_task(next) might fault; if we cannot fault, we'll eat it
+	 * and 'spuriously' not wake @next_tid but instead try and wake the
+	 * server.
+	 *
+	 * XXX: we can fix this by adding umcg_next_page to umcg_pin_pages().
+	 *
+	 * umcg_wake_task() can also fail due to next not having the right
+	 * state, then too will we try and wake the server.
+	 *
+	 * If we cannot wake the server due to state issues, too bad.
+	 */
+	if (!next || umcg_wake_task(next)) {
+		ret = umcg_wake_task(tsk->umcg_server);
+		if (ret == -EFAULT)
+			goto out;
+	}
+	ret = 0;
+out:
+	if (next)
+		put_task_struct(next);
+
+	return ret;
+}
+
+/* pre-schedule() */
+void umcg_wq_worker_sleeping(struct task_struct *tsk)
+{
+	int next_tid;
+
+	/* Must not fault, mmap_sem might be held. */
+	pagefault_disable();
+
+	if (WARN_ON_ONCE(!tsk->umcg_server))
+		goto die;
+
+	if (umcg_update_state(tsk, UMCG_TASK_RUNNING, UMCG_TASK_BLOCKED, &next_tid))
+		goto die;
+
+	if (umcg_wake_next(tsk, next_tid))
+		goto die;
+
+	pagefault_enable();
+
+	/*
+	 * We're going to sleep, make sure to unpin the pages, this ensures
+	 * the pins are temporary.
+	 */
+	umcg_unpin_pages();
+
+	return;
+
+die:
+	pagefault_enable();
+	pr_warn("%s: killing task %d\n", __func__, current->pid);
+	force_sig(SIGKILL);
+}
+
+/* post-schedule() */
+void umcg_wq_worker_running(struct task_struct *tsk)
+{
+	/* nothing here, see umcg_sys_exit() */
+}
+
+/*
+ * Enqueue @tsk on it's server's runnable list
+ *
+ * Must be called in umcg_pin_pages() context, relies on tsk->umcg_server.
+ *
+ * cmpxchg based single linked list add such that list integrity is never
+ * violated.  Userspace *MUST* remove it from the list before changing ->state.
+ * As such, we must change state to RUNNABLE before enqueue.
+ *
+ * Returns:
+ *   0: success
+ *   -EFAULT
+ */
+static int umcg_enqueue_runnable(struct task_struct *tsk)
+{
+	struct umcg_task __user *server = tsk->umcg_server_task;
+	struct umcg_task __user *self = tsk->umcg_task;
+	u64 self_ptr = (unsigned long)self;
+	u64 first_ptr;
+
+	/*
+	 * umcg_pin_pages() did access_ok() on both pointers, use self here
+	 * only because __user_access_begin() isn't available in generic code.
+	 */
+	if (!user_access_begin(self, sizeof(*self)))
+		return -EFAULT;
+
+	unsafe_get_user(first_ptr, &server->runnable_workers_ptr, Efault);
+	do {
+		unsafe_put_user(first_ptr, &self->runnable_workers_ptr, Efault);
+	} while (!unsafe_try_cmpxchg_user(&server->runnable_workers_ptr, &first_ptr, self_ptr, Efault));
+
+	user_access_end();
+	return 0;
+
+Efault:
+	user_access_end();
+	return -EFAULT;
+}
+
+/*
+ * umcg_wait: Wait for ->state to become RUNNING
+ *
+ * Returns:
+ *   0: success
+ *   -EINTR: pending signal
+ *   -EINVAL: ->state is not {RUNNABLE,RUNNING}
+ *   -ETIMEDOUT
+ *   -EFAULT
+ */
+int umcg_wait(u64 timo)
+{
+	struct task_struct *tsk = current;
+	struct umcg_task __user *self = tsk->umcg_task;
+	struct hrtimer_sleeper timeout;
+	struct page *page = NULL;
+	u32 state;
+	int ret;
+
+	if (timo) {
+		hrtimer_init_sleeper_on_stack(&timeout, tsk->umcg_clock,
+					      HRTIMER_MODE_ABS);
+		hrtimer_set_expires_range_ns(&timeout.timer, (s64)timo,
+					     tsk->timer_slack_ns);
+	}
+
+	for (;;) {
+		set_current_state(TASK_INTERRUPTIBLE);
+
+		ret = -EINTR;
+		if (signal_pending(current))
+			break;
+
+		/*
+		 * Faults can block and scribble our wait state.
+		 */
+		pagefault_disable();
+		if (get_user(state, &self->state)) {
+			pagefault_enable();
+
+			ret = -EFAULT;
+			if (page) {
+				unpin_user_page(page);
+				page = NULL;
+				break;
+			}
+
+			if (pin_user_pages_fast((unsigned long)self, 1, 0, &page) != 1) {
+				page = NULL;
+				break;
+			}
+
+			continue;
+		}
+
+		if (page) {
+			unpin_user_page(page);
+			page = NULL;
+		}
+		pagefault_enable();
+
+		state &= UMCG_TASK_MASK;
+		if (state != UMCG_TASK_RUNNABLE) {
+			ret = 0;
+			if (state == UMCG_TASK_RUNNING)
+				break;
+
+			ret = -EINVAL;
+			break;
+		}
+
+		if (timo)
+			hrtimer_sleeper_start_expires(&timeout, HRTIMER_MODE_ABS);
+
+		freezable_schedule();
+
+		ret = -ETIMEDOUT;
+		if (timo && !timeout.task)
+			break;
+	}
+	__set_current_state(TASK_RUNNING);
+
+	if (timo) {
+		hrtimer_cancel(&timeout.timer);
+		destroy_hrtimer_on_stack(&timeout.timer);
+	}
+
+	return ret;
+}
+
+void umcg_sys_exit(struct pt_regs *regs)
+{
+	struct task_struct *tsk = current;
+	long syscall = syscall_get_nr(tsk, regs);
+
+	if (syscall == __NR_umcg_wait)
+		return;
+
+	/*
+	 * sys_umcg_ctl() will get here without having called umcg_sys_enter()
+	 * as such it will look like a syscall that blocked.
+	 */
+
+	if (tsk->umcg_server) {
+		/*
+		 * Didn't block, we done.
+		 */
+		umcg_unpin_pages();
+		return;
+	}
+
+	/* avoid recursion vs schedule() */
+	current->flags &= ~PF_UMCG_WORKER;
+
+	if (umcg_pin_pages())
+		goto die;
+
+	if (umcg_update_state(tsk, UMCG_TASK_BLOCKED, UMCG_TASK_RUNNABLE, NULL))
+		goto die_unpin;
+
+	if (umcg_enqueue_runnable(tsk))
+		goto die_unpin;
+
+	/* server might not be runnable, too bad */
+	if (umcg_wake_task(tsk->umcg_server) == -EFAULT)
+		goto die_unpin;
+
+	umcg_unpin_pages();
+
+	switch (umcg_wait(0)) {
+	case -EFAULT:
+	case -EINVAL:
+	case -ETIMEDOUT: /* how!?! */
+		goto die;
+
+	case -EINTR:
+		/* notify_resume will continue the wait after the signal */
+		break;
+	default:
+		break;
+	}
+
+	current->flags |= PF_UMCG_WORKER;
+
+	return;
+
+die_unpin:
+	umcg_unpin_pages();
+die:
+	current->flags |= PF_UMCG_WORKER;
+	pr_warn("%s: killing task %d\n", __func__, current->pid);
+	force_sig(SIGKILL);
+}
+
+void umcg_notify_resume(struct pt_regs *regs)
+{
+	struct task_struct *tsk = current;
+	struct umcg_task __user *self = tsk->umcg_task;
+	u32 state, next_tid;
+
+	/* avoid recursion vs schedule() */
+	current->flags &= ~PF_UMCG_WORKER;
+
+	if (get_user(state, &self->state))
+		goto die;
+
+	state &= UMCG_TASK_MASK | UMCG_TF_MASK;
+	if (state == UMCG_TASK_RUNNING)
+		goto done;
+
+	if (state & UMCG_TF_PREEMPT) {
+		umcg_pin_pages();
+
+		if (umcg_update_state(tsk, UMCG_TASK_RUNNING,
+				      UMCG_TASK_RUNNABLE, &next_tid))
+			goto die_unpin;
+
+		if (umcg_enqueue_runnable(tsk))
+			goto die_unpin;
+
+		if (umcg_wake_next(tsk, next_tid))
+			goto die_unpin;
+
+		umcg_unpin_pages();
+	}
+
+	switch (umcg_wait(0)) {
+	case -EFAULT:
+	case -EINVAL:
+	case -ETIMEDOUT: /* how!?! */
+		goto die;
+
+	case -EINTR:
+		/* we'll will continue the wait after the signal */
+		break;
+	default:
+		break;
+	}
+
+done:
+	current->flags |= PF_UMCG_WORKER;
+	return;
+
+die_unpin:
+	umcg_unpin_pages();
+die:
+	current->flags |= PF_UMCG_WORKER;
+	pr_warn("%s: killing task %d\n", __func__, current->pid);
+	force_sig(SIGKILL);
+}
+
+/**
+ * sys_umcg_wait: put the current task to sleep and/or wake another task.
+ * @flags:        zero or a value from enum umcg_wait_flag.
+ * @abs_timeout:  when to wake the task, in nanoseconds; zero for no timeout.
+ *
+ *
+ *
+ * Returns:
+ * 0             - OK;
+ * -ETIMEDOUT    - the timeout expired;
+ * -EFAULT       - failed accessing struct umcg_task __user of the current
+ *                 task, the server or next.
+ * -ESRCH        - the task to wake not found or not a UMCG task;
+ * -EINVAL       - another error happened (e.g. the current task is not a
+ *                 UMCG task, etc.)
+ */
+SYSCALL_DEFINE2(umcg_wait, u32, flags, u64, timo)
+{
+	struct task_struct *next, *tsk = current;
+	struct umcg_task __user *self = tsk->umcg_task;
+	bool worker = tsk->flags & PF_UMCG_WORKER;
+	u32 next_tid;
+	int ret;
+
+	if (!self || flags)
+		return -EINVAL;
+
+	if (worker)
+		tsk->flags &= ~PF_UMCG_WORKER;
+
+	/* see umcg_sys_{enter,exit}() */
+	umcg_pin_pages();
+
+	ret = umcg_update_state(tsk, UMCG_TASK_RUNNING, UMCG_TASK_RUNNABLE, &next_tid);
+	if (ret)
+		goto unpin;
+
+	next = umcg_get_task(next_tid);
+	if (!next) {
+		ret = -ESRCH;
+		goto unblock;
+	}
+
+	if (worker) {
+		ret = umcg_enqueue_runnable(tsk);
+		if (ret)
+			goto put_task;
+	}
+
+	ret = umcg_wake_task(next);
+	if (ret)
+		goto put_task;
+
+	put_task_struct(next);
+	umcg_unpin_pages();
+
+	ret = umcg_wait(timo);
+	switch (ret) {
+	case -EINTR:	/* umcg_notify_resume() will continue the wait */
+	case 0:		/* all done */
+		ret = 0;
+		break;
+
+	default:
+		/*
+		 * If this fails you get to keep the pieces; you'll get stuck
+		 * in umcg_notify_resume().
+		 */
+		umcg_update_state(tsk, UMCG_TASK_RUNNABLE, UMCG_TASK_RUNNING, NULL);
+		break;
+	}
+out:
+	if (worker)
+		tsk->flags |= PF_UMCG_WORKER;
+	return ret;
+
+put_task:
+	put_task_struct(next);
+unblock:
+	umcg_update_state(tsk, UMCG_TASK_RUNNABLE, UMCG_TASK_RUNNING, NULL);
+unpin:
+	umcg_unpin_pages();
+	goto out;
+}
+
+/**
+ * sys_umcg_ctl: (un)register the current task as a UMCG task.
+ * @flags:       ORed values from enum umcg_ctl_flag; see below;
+ * @self:        a pointer to struct umcg_task that describes this
+ *               task and governs the behavior of sys_umcg_wait if
+ *               registering; must be NULL if unregistering.
+ *
+ * @flags & UMCG_CTL_REGISTER: register a UMCG task:
+ *         UMCG workers:
+ *              - @flags & UMCG_CTL_WORKER
+ *              - self->state must be UMCG_TASK_BLOCKED
+ *         UMCG servers:
+ *              - !(@flags & UMCG_CTL_WORKER)
+ *              - self->state must be UMCG_TASK_RUNNING
+ *
+ *         All tasks:
+ *              - self->next_tid must be zero
+ *
+ *         If the conditions above are met, sys_umcg_ctl() immediately returns
+ *         if the registered task is a server; a worker will be added to
+ *         runnable_workers_ptr, and the worker put to sleep; an runnable server
+ *         from runnable_server_tid_ptr will be woken, if present.
+ *
+ * @flags == UMCG_CTL_UNREGISTER: unregister a UMCG task. If the current task
+ *           is a UMCG worker, the userspace is responsible for waking its
+ *           server (before or after calling sys_umcg_ctl).
+ *
+ * Return:
+ * 0                - success
+ * -EFAULT          - failed to read @self
+ * -EINVAL          - some other error occurred
+ */
+SYSCALL_DEFINE3(umcg_ctl, u32, flags, struct umcg_task __user *, self, clockid_t, which_clock)
+{
+	struct umcg_task ut;
+
+	if ((unsigned long)self % UMCG_TASK_ALIGN)
+		return -EINVAL;
+
+	if (flags == UMCG_CTL_UNREGISTER) {
+		if (self || !current->umcg_task)
+			return -EINVAL;
+
+		if (current->flags & PF_UMCG_WORKER)
+			umcg_worker_exit();
+		else
+			umcg_clear_task(current);
+
+		return 0;
+	}
+
+	if (!(flags & UMCG_CTL_REGISTER))
+		return -EINVAL;
+
+	switch (which_clock) {
+	case CLOCK_REALTIME:
+	case CLOCK_MONOTONIC:
+	case CLOCK_BOOTTIME:
+	case CLOCK_TAI:
+		current->umcg_clock = which_clock;
+		break;
+
+	default:
+		return -EINVAL;
+	}
+
+	flags &= ~UMCG_CTL_REGISTER;
+	if (flags && flags != UMCG_CTL_WORKER)
+		return -EINVAL;
+
+	if (current->umcg_task || !self)
+		return -EINVAL;
+
+	if (copy_from_user(&ut, self, sizeof(ut)))
+		return -EFAULT;
+
+	if (ut.next_tid || ut.__hole[0] || ut.__zero[0] || ut.__zero[1] || ut.__zero[2])
+		return -EINVAL;
+
+	if (flags == UMCG_CTL_WORKER) {
+		if ((ut.state & (UMCG_TASK_MASK | UMCG_TF_MASK)) != UMCG_TASK_BLOCKED)
+			return -EINVAL;
+
+		WRITE_ONCE(current->umcg_task, self);
+		current->flags |= PF_UMCG_WORKER;	/* hook schedule() */
+		set_syscall_work(SYSCALL_UMCG);		/* hook syscall */
+		set_thread_flag(TIF_UMCG);		/* hook return-to-user */
+
+		/* umcg_sys_exit() will transition to RUNNABLE and wait */
+
+	} else {
+		if ((ut.state & (UMCG_TASK_MASK | UMCG_TF_MASK)) != UMCG_TASK_RUNNING)
+			return -EINVAL;
+
+		WRITE_ONCE(current->umcg_task, self);
+		set_thread_flag(TIF_UMCG);		/* hook return-to-user */
+
+		/* umcg_notify_resume() would block if not RUNNING */
+	}
+
+	return 0;
+}
--- a/kernel/sys_ni.c
+++ b/kernel/sys_ni.c
@@ -273,6 +273,10 @@ COND_SYSCALL(landlock_create_ruleset);
 COND_SYSCALL(landlock_add_rule);
 COND_SYSCALL(landlock_restrict_self);
 
+/* kernel/sched/umcg.c */
+COND_SYSCALL(umcg_ctl);
+COND_SYSCALL(umcg_wait);
+
 /* arch/example/kernel/sys_example.c */
 
 /* mm/fadvise.c */
Thomas Gleixner Nov. 26, 2021, 9:08 p.m. UTC | #10
On Fri, Nov 26 2021 at 18:09, Peter Zijlstra wrote:
> +
> +	if (timo) {
> +		hrtimer_init_sleeper_on_stack(&timeout, tsk->umcg_clock,
> +					      HRTIMER_MODE_ABS);
> +		hrtimer_set_expires_range_ns(&timeout.timer, (s64)timo,
> +					     tsk->timer_slack_ns);
> +	}
> +
> +	for (;;) {
> +		set_current_state(TASK_INTERRUPTIBLE);
> +
> +		ret = -EINTR;
> +		if (signal_pending(current))
> +			break;
> +
> +		/*
> +		 * Faults can block and scribble our wait state.
> +		 */
> +		pagefault_disable();
> +		if (get_user(state, &self->state)) {
> +			pagefault_enable();
> +
> +			ret = -EFAULT;
> +			if (page) {
> +				unpin_user_page(page);
> +				page = NULL;
> +				break;
> +			}
> +
> +			if (pin_user_pages_fast((unsigned long)self, 1, 0, &page) != 1) {
> +				page = NULL;
> +				break;
> +			}
> +
> +			continue;
> +		}
> +
> +		if (page) {
> +			unpin_user_page(page);
> +			page = NULL;
> +		}
> +		pagefault_enable();
> +
> +		state &= UMCG_TASK_MASK;
> +		if (state != UMCG_TASK_RUNNABLE) {
> +			ret = 0;
> +			if (state == UMCG_TASK_RUNNING)
> +				break;
> +
> +			ret = -EINVAL;
> +			break;
> +		}
> +
> +		if (timo)
> +			hrtimer_sleeper_start_expires(&timeout, HRTIMER_MODE_ABS);
> +
> +		freezable_schedule();

You can replace the whole hrtimer foo with

                if (!schedule_hrtimeout_range_clock(timo ? &timo : NULL,
                                                    tsk->timer_slack_ns,
                                                    HRTIMER_MODE_ABS,
                                                    tsk->umcg_clock)) {
                	ret = -ETIMEOUT;
                        break;
                }

Thanks,

        tglx
Thomas Gleixner Nov. 26, 2021, 9:11 p.m. UTC | #11
On Wed, Nov 24 2021 at 22:19, Peter Zijlstra wrote:
> On Mon, Nov 22, 2021 at 01:13:24PM -0800, Peter Oskolkov wrote:
>
>> +	 * Timestamp: a 46-bit CLOCK_MONOTONIC timestamp, at 16ns resolution.
>
>> +static int umcg_update_state(u64 __user *state_ts, u64 *expected, u64 desired,
>> +				bool may_fault)
>> +{
>> +	u64 curr_ts = (*expected) >> (64 - UMCG_STATE_TIMESTAMP_BITS);
>> +	u64 next_ts = ktime_get_ns() >> UMCG_STATE_TIMESTAMP_GRANULARITY;
>
> I'm still very hesitant to use ktime (fear the HPET); but I suppose it
> makes sense to use a time base that's accessible to userspace. Was
> MONOTONIC_RAW considered?

MONOTONIC_RAW is not really useful as you can't sleep on it and it won't
solve the HPET crap either.

Thanks,

        tglx
Peter Zijlstra Nov. 26, 2021, 9:52 p.m. UTC | #12
On Fri, Nov 26, 2021 at 10:11:17PM +0100, Thomas Gleixner wrote:
> On Wed, Nov 24 2021 at 22:19, Peter Zijlstra wrote:
> > On Mon, Nov 22, 2021 at 01:13:24PM -0800, Peter Oskolkov wrote:
> >
> >> +	 * Timestamp: a 46-bit CLOCK_MONOTONIC timestamp, at 16ns resolution.
> >
> >> +static int umcg_update_state(u64 __user *state_ts, u64 *expected, u64 desired,
> >> +				bool may_fault)
> >> +{
> >> +	u64 curr_ts = (*expected) >> (64 - UMCG_STATE_TIMESTAMP_BITS);
> >> +	u64 next_ts = ktime_get_ns() >> UMCG_STATE_TIMESTAMP_GRANULARITY;
> >
> > I'm still very hesitant to use ktime (fear the HPET); but I suppose it
> > makes sense to use a time base that's accessible to userspace. Was
> > MONOTONIC_RAW considered?
> 
> MONOTONIC_RAW is not really useful as you can't sleep on it and it won't
> solve the HPET crap either.

But it's ns are of equal size to sched_clock(), if both share TSC IIRC.
Whereas MONOTONIC, being subject to ntp rate stuff, has differently
sized ns.

The only time that's relevant though is when you're going to mix these
timestamps with CLOCK_THREAD_CPUTIME_ID, which might just be
interesting.

But yeah, not being able to sleep on it ruins the party.
Peter Zijlstra Nov. 26, 2021, 9:59 p.m. UTC | #13
On Fri, Nov 26, 2021 at 10:08:14PM +0100, Thomas Gleixner wrote:
> On Fri, Nov 26 2021 at 18:09, Peter Zijlstra wrote:
> > +
> > +	if (timo) {
> > +		hrtimer_init_sleeper_on_stack(&timeout, tsk->umcg_clock,
> > +					      HRTIMER_MODE_ABS);
> > +		hrtimer_set_expires_range_ns(&timeout.timer, (s64)timo,
> > +					     tsk->timer_slack_ns);
> > +	}
> > +
> > +	for (;;) {
> > +		set_current_state(TASK_INTERRUPTIBLE);
> > +
> > +		ret = -EINTR;
> > +		if (signal_pending(current))
> > +			break;
> > +
> > +		/*
> > +		 * Faults can block and scribble our wait state.
> > +		 */
> > +		pagefault_disable();
> > +		if (get_user(state, &self->state)) {
> > +			pagefault_enable();
> > +
> > +			ret = -EFAULT;
> > +			if (page) {
> > +				unpin_user_page(page);
> > +				page = NULL;
> > +				break;
> > +			}
> > +
> > +			if (pin_user_pages_fast((unsigned long)self, 1, 0, &page) != 1) {
> > +				page = NULL;
> > +				break;
> > +			}
> > +
> > +			continue;
> > +		}
> > +
> > +		if (page) {
> > +			unpin_user_page(page);
> > +			page = NULL;
> > +		}
> > +		pagefault_enable();
> > +
> > +		state &= UMCG_TASK_MASK;
> > +		if (state != UMCG_TASK_RUNNABLE) {
> > +			ret = 0;
> > +			if (state == UMCG_TASK_RUNNING)
> > +				break;
> > +
> > +			ret = -EINVAL;
> > +			break;
> > +		}
> > +
> > +		if (timo)
> > +			hrtimer_sleeper_start_expires(&timeout, HRTIMER_MODE_ABS);
> > +
> > +		freezable_schedule();
> 
> You can replace the whole hrtimer foo with
> 
>                 if (!schedule_hrtimeout_range_clock(timo ? &timo : NULL,
>                                                     tsk->timer_slack_ns,
>                                                     HRTIMER_MODE_ABS,
>                                                     tsk->umcg_clock)) {
>                 	ret = -ETIMEOUT;
>                         break;
>                 }

That seems to loose the freezable crud.. then again, since we're
interruptible, that shouldn't matter. Lemme go do that.
Peter Zijlstra Nov. 26, 2021, 10:07 p.m. UTC | #14
On Fri, Nov 26, 2021 at 10:59:44PM +0100, Peter Zijlstra wrote:

> That seems to loose the freezable crud.. then again, since we're
> interruptible, that shouldn't matter. Lemme go do that.


---

--- a/kernel/sched/umcg.c
+++ b/kernel/sched/umcg.c
@@ -52,7 +52,7 @@ static int umcg_pin_pages(void)
 
 	server = umcg_get_task(server_tid);
 	if (!server)
-		return -EINVAL;
+		return -ESRCH;
 
 	if (pin_user_pages_fast((unsigned long)self, 1, 0,
 				&tsk->umcg_worker_page) != 1)
@@ -358,18 +358,10 @@ int umcg_wait(u64 timo)
 {
 	struct task_struct *tsk = current;
 	struct umcg_task __user *self = tsk->umcg_task;
-	struct hrtimer_sleeper timeout;
 	struct page *page = NULL;
 	u32 state;
 	int ret;
 
-	if (timo) {
-		hrtimer_init_sleeper_on_stack(&timeout, tsk->umcg_clock,
-					      HRTIMER_MODE_ABS);
-		hrtimer_set_expires_range_ns(&timeout.timer, (s64)timo,
-					     tsk->timer_slack_ns);
-	}
-
 	for (;;) {
 		set_current_state(TASK_INTERRUPTIBLE);
 
@@ -415,22 +407,16 @@ int umcg_wait(u64 timo)
 			break;
 		}
 
-		if (timo)
-			hrtimer_sleeper_start_expires(&timeout, HRTIMER_MODE_ABS);
-
-		freezable_schedule();
-
-		ret = -ETIMEDOUT;
-		if (timo && !timeout.task)
+		if (!schedule_hrtimeout_range_clock(timo ? &timo : NULL,
+						    tsk->timer_slack_ns,
+						    HRTIMER_MODE_ABS,
+						    tsk->umcg_clock)) {
+			ret = -ETIMEDOUT;
 			break;
+		}
 	}
 	__set_current_state(TASK_RUNNING);
 
-	if (timo) {
-		hrtimer_cancel(&timeout.timer);
-		destroy_hrtimer_on_stack(&timeout.timer);
-	}
-
 	return ret;
 }
 
@@ -515,7 +501,8 @@ void umcg_notify_resume(struct pt_regs *
 		goto done;
 
 	if (state & UMCG_TF_PREEMPT) {
-		umcg_pin_pages();
+		if (umcg_pin_pages())
+			goto die;
 
 		if (umcg_update_state(tsk, UMCG_TASK_RUNNING,
 				      UMCG_TASK_RUNNABLE, &next_tid))
@@ -586,7 +573,9 @@ SYSCALL_DEFINE2(umcg_wait, u32, flags, u
 		tsk->flags &= ~PF_UMCG_WORKER;
 
 	/* see umcg_sys_{enter,exit}() */
-	umcg_pin_pages();
+	ret = umcg_pin_pages();
+	if (ret)
+		return ret;
 
 	ret = umcg_update_state(tsk, UMCG_TASK_RUNNING, UMCG_TASK_RUNNABLE, &next_tid);
 	if (ret)
Peter Zijlstra Nov. 26, 2021, 10:16 p.m. UTC | #15
On Fri, Nov 26, 2021 at 06:09:10PM +0100, Peter Zijlstra wrote:

> @@ -155,8 +159,7 @@ static unsigned long exit_to_user_mode_l
>  	 * Before returning to user space ensure that all pending work
>  	 * items have been completed.
>  	 */
> -	while (ti_work & EXIT_TO_USER_MODE_WORK) {
> -
> +	do {
>  		local_irq_enable_exit_to_user(ti_work);
>  
>  		if (ti_work & _TIF_NEED_RESCHED)
> @@ -168,6 +171,10 @@ static unsigned long exit_to_user_mode_l
>  		if (ti_work & _TIF_PATCH_PENDING)
>  			klp_update_patch_state(current);
>  
> +		/* must be before handle_signal_work(); terminates on sigpending */
> +		if (ti_work & _TIF_UMCG)
> +			umcg_notify_resume(regs);
> +
>  		if (ti_work & (_TIF_SIGPENDING | _TIF_NOTIFY_SIGNAL))
>  			handle_signal_work(regs, ti_work);
>  
> @@ -188,7 +195,7 @@ static unsigned long exit_to_user_mode_l
>  		tick_nohz_user_enter_prepare();
>  
>  		ti_work = READ_ONCE(current_thread_info()->flags);
> -	}
> +	} while (ti_work & EXIT_TO_USER_MODE_WORK);
>  
>  	/* Return the latest work state for arch_exit_to_user_mode() */
>  	return ti_work;
> @@ -203,7 +210,7 @@ static void exit_to_user_mode_prepare(st
>  	/* Flush pending rcuog wakeup before the last need_resched() check */
>  	tick_nohz_user_enter_prepare();
>  
> -	if (unlikely(ti_work & EXIT_TO_USER_MODE_WORK))
> +	if (unlikely(ti_work & (EXIT_TO_USER_MODE_WORK | _TIF_UMCG)))
>  		ti_work = exit_to_user_mode_loop(regs, ti_work);
>  
>  	arch_exit_to_user_mode_prepare(regs, ti_work);

Thomas, since you're looking at this. I'm not quite sure I got this
right. The intent is that when _TIF_UMCG is set (and it is never cleared
until the task unregisters) it is called at least once.

The thinking is that if umcg_wait() gets interrupted, we'll drop out,
handle the signal and then resume the wait, which can obviously happen
any number of times.

It's just that I'm never quite sure where signal crud happens; I'm
assuming handle_signal_work() simply mucks about with regs (sets sp and
ip etc.. to the signal stack) and drops out of kernel mode, and on
re-entry we do this whole merry cycle once again. But I never actually
dug that deep.
Thomas Gleixner Nov. 27, 2021, 12:45 a.m. UTC | #16
On Fri, Nov 26 2021 at 22:59, Peter Zijlstra wrote:
> On Fri, Nov 26, 2021 at 10:08:14PM +0100, Thomas Gleixner wrote:
>> > +		if (timo)
>> > +			hrtimer_sleeper_start_expires(&timeout, HRTIMER_MODE_ABS);
>> > +
>> > +		freezable_schedule();
>> 
>> You can replace the whole hrtimer foo with
>> 
>>                 if (!schedule_hrtimeout_range_clock(timo ? &timo : NULL,
>>                                                     tsk->timer_slack_ns,
>>                                                     HRTIMER_MODE_ABS,
>>                                                     tsk->umcg_clock)) {
>>                 	ret = -ETIMEOUT;
>>                         break;
>>                 }
>
> That seems to loose the freezable crud.. then again, since we're
> interruptible, that shouldn't matter. Lemme go do that.

We could add a freezable wrapper for that if necessary.

Thanks,

        tglx
Thomas Gleixner Nov. 27, 2021, 1:16 a.m. UTC | #17
On Fri, Nov 26 2021 at 23:16, Peter Zijlstra wrote:
> On Fri, Nov 26, 2021 at 06:09:10PM +0100, Peter Zijlstra wrote:
>>  
>> -	if (unlikely(ti_work & EXIT_TO_USER_MODE_WORK))
>> +	if (unlikely(ti_work & (EXIT_TO_USER_MODE_WORK | _TIF_UMCG)))
>>  		ti_work = exit_to_user_mode_loop(regs, ti_work);
>>  
>>  	arch_exit_to_user_mode_prepare(regs, ti_work);
>
> Thomas, since you're looking at this. I'm not quite sure I got this
> right. The intent is that when _TIF_UMCG is set (and it is never cleared
> until the task unregisters) it is called at least once.

Right.

> The thinking is that if umcg_wait() gets interrupted, we'll drop out,
> handle the signal and then resume the wait, which can obviously happen
> any number of times.

Right.

> It's just that I'm never quite sure where signal crud happens; I'm
> assuming handle_signal_work() simply mucks about with regs (sets sp and
> ip etc.. to the signal stack) and drops out of kernel mode, and on
> re-entry we do this whole merry cycle once again. But I never actually
> dug that deep.

Yes. It sets up the signal frame and once the loop is left because there
are no more TIF flags to handle it drops back to user space into the
signal handler. That returns to the kernel via sys_[rt_]sigreturn()
which undoes the regs damage either by restoring the previous state or
fiddling it to restart the syscall instead of dropping back to user
space.

So yes, this should work, but I hate the sticky nature of TIF_UMCG. I
have no real good idea how to avoid that yet, but let me think about it
some more.

Thanks,

        tglx
Peter Oskolkov Nov. 29, 2021, 12:29 a.m. UTC | #18
On Fri, Nov 26, 2021 at 9:09 AM Peter Zijlstra <peterz@infradead.org> wrote:
>
> On Thu, Nov 25, 2021 at 09:28:49AM -0800, Peter Oskolkov wrote:
>
> > it looks like the overall approach did not raise many objections - is
> > it so? Have you finished reviewing the patch?
>
> I've been trying to make sense of it, and while doing so deleted a bunch
> of things and rewrote the rest.

Thanks a lot, Peter! If we can get this in, and work the kinks out
later, that would be great!

>
> Things that went *poof*:
>
>  - wait_wake_only
>  - server_tid_ptr (now: server_tid)
>  - state_ts (now: state,blocked_ts,runnable_ts)
>
> I've also changed next_tid to only be used as a context switch target,
> never to find the server to enqueue the runnable tasks on.
>
> All xchg() users seem to have disappeared.
>
> Signals should now be handled, after which it'll go back to waiting on
> RUNNING.
>
> The code could fairly easily be changed to work on 32bit, big-endian is
> the tricky bit, for now 64bit only.
>
> Anyway, I only *think* the below code will work (it compiles with gcc-10
> and gcc-11) but I've not yet come around to writing/updating the
> userspace part, so it might explode on first contact -- I'll try that
> next week if you don't beat me to it.

I'll take me some time to fully test this (got some other stuff to
look at at the moment); some notes are below. I'd prefer you to merge
whatever you believe is working, and to later adjust things that need
adjusting, rather than keep the endless stream of patchsets that go
nowhere.

>
> That said, the below code seems somewhat sensible to me (I would say,
> having written it :), but I'm fairly sure I killed some capabilities the
> other thing had (notably the first two items above).
>
> If you want either of them restored, can you please give a use-case for
> them? Because I cannot seem to think of any sane cases for either
> wait_wake_only or server_tid_ptr.

wait_wake_only is not needed if you have both next_tid and server_tid,
as your patch has. In my version of the patch, next_tid is the same as
server_tid, so the flag is needed to indicate to the kernel that
next_tid is the wakee, not the server.

re: (idle_)server_tid_ptr: it seems that you assume that blocked
workers keep their servers, while in my patch they "lose them" once
they block, and so there should be a global idle server pointer to
wake the server in my scheme (if there is an idle one). The main
difference is that in my approach a server has only a single, running,
worker assigned to it, while in your approach it can have a number of
blocked/idle workers to take care of as well.

The main difference between our approaches, as I see it: in my
approach if a worker is running, its server is sleeping, period. If we
have N servers, and N running workers, there are no servers to wake
when a previously blocked worker finishes its blocking op. In your
approach, it seems that N servers have each a bunch of workers
pointing at them, and a single worker running. If a previously blocked
worker wakes up, it wakes the server it was assigned to previously,
and so now we have more than N physical tasks/threads running: N
workers and the woken server. This is not ideal: if the process is
affined to only N CPUs, that means a worker will be preempted to let
the woken server run, which is somewhat against the goal of letting
the workers run more or less uninterrupted. This is not deal breaking,
but maybe something to keep in mind.

Another big concern I have is that you removed UMCG_TF_LOCKED. I
definitely needed it to guard workers during "sched work" in the
userspace in my approach. I'm not sure if the flag is absolutely
needed with your approach, but most likely it is - the kernel-side
scheduler does lock tasks and runqueues and disables interrupts and
migrations and other things so that the scheduling logic is not
hijacked by concurrent stuff. Why do you assume that the userspace
scheduling code does not need similar protections?

In summary, again, I'm fine with your patch/approach getting in,
provided things like UMCG_TF_LOCKED are considered later.




>
> Anyway, in large order it's very like what you did, but it's different
> in pretty much all details.
>
> Of note, it now has 5 hooks: sys_enter, pre-schedule, post-schedule
> (still nop), sys_exit and notify_resume.
>
> ---
> Subject: sched: User Mode Concurency Groups
> From: Peter Zijlstra <peterz@infradead.org>
> Date: Fri Nov 26 17:24:27 CET 2021
>
> XXX split and changelog
>
> Originally-by: Peter Oskolkov <posk@google.com>
> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> ---
> --- a/arch/x86/Kconfig
> +++ b/arch/x86/Kconfig
> @@ -248,6 +248,7 @@ config X86
>         select HAVE_RSEQ
>         select HAVE_SYSCALL_TRACEPOINTS
>         select HAVE_UNSTABLE_SCHED_CLOCK
> +       select HAVE_UMCG                        if X86_64
>         select HAVE_USER_RETURN_NOTIFIER
>         select HAVE_GENERIC_VDSO
>         select HOTPLUG_SMT                      if SMP
> --- a/arch/x86/entry/syscalls/syscall_64.tbl
> +++ b/arch/x86/entry/syscalls/syscall_64.tbl
> @@ -371,6 +371,8 @@
>  447    common  memfd_secret            sys_memfd_secret
>  448    common  process_mrelease        sys_process_mrelease
>  449    common  futex_waitv             sys_futex_waitv
> +450    common  umcg_ctl                sys_umcg_ctl
> +451    common  umcg_wait               sys_umcg_wait
>
>  #
>  # Due to a historical design error, certain syscalls are numbered differently
> --- a/arch/x86/include/asm/thread_info.h
> +++ b/arch/x86/include/asm/thread_info.h
> @@ -83,6 +83,7 @@ struct thread_info {
>  #define TIF_NEED_RESCHED       3       /* rescheduling necessary */
>  #define TIF_SINGLESTEP         4       /* reenable singlestep on user return*/
>  #define TIF_SSBD               5       /* Speculative store bypass disable */
> +#define TIF_UMCG               6       /* UMCG return to user hook */
>  #define TIF_SPEC_IB            9       /* Indirect branch speculation mitigation */
>  #define TIF_SPEC_L1D_FLUSH     10      /* Flush L1D on mm switches (processes) */
>  #define TIF_USER_RETURN_NOTIFY 11      /* notify kernel of userspace return */
> @@ -107,6 +108,7 @@ struct thread_info {
>  #define _TIF_NEED_RESCHED      (1 << TIF_NEED_RESCHED)
>  #define _TIF_SINGLESTEP                (1 << TIF_SINGLESTEP)
>  #define _TIF_SSBD              (1 << TIF_SSBD)
> +#define _TIF_UMCG              (1 << TIF_UMCG)
>  #define _TIF_SPEC_IB           (1 << TIF_SPEC_IB)
>  #define _TIF_SPEC_L1D_FLUSH    (1 << TIF_SPEC_L1D_FLUSH)
>  #define _TIF_USER_RETURN_NOTIFY        (1 << TIF_USER_RETURN_NOTIFY)
> --- a/arch/x86/include/asm/uaccess.h
> +++ b/arch/x86/include/asm/uaccess.h
> @@ -341,6 +341,24 @@ do {                                                                       \
>                      : [umem] "m" (__m(addr))                           \
>                      : : label)
>
> +#define __try_cmpxchg_user_asm(itype, _ptr, _pold, _new, label)        ({      \
> +       bool success;                                                   \
> +       __typeof__(_ptr) _old = (__typeof__(_ptr))(_pold);              \
> +       __typeof__(*(_ptr)) __old = *_old;                              \
> +       __typeof__(*(_ptr)) __new = (_new);                             \
> +       asm_volatile_goto("\n"                                          \
> +                    "1: " LOCK_PREFIX "cmpxchg"itype" %[new], %[ptr]\n"\
> +                    _ASM_EXTABLE_UA(1b, %l[label])                     \
> +                    : CC_OUT(z) (success),                             \
> +                      [ptr] "+m" (*_ptr),                              \
> +                      [old] "+a" (__old)                               \
> +                    : [new] "r" (__new)                                \
> +                    : "memory", "cc"                                   \
> +                    : label);                                          \
> +       if (unlikely(!success))                                         \
> +               *_old = __old;                                          \
> +       likely(success);                                        })
> +
>  #else // !CONFIG_CC_HAS_ASM_GOTO_OUTPUT
>
>  #ifdef CONFIG_X86_32
> @@ -411,6 +429,34 @@ do {                                                                       \
>                      : [umem] "m" (__m(addr)),                          \
>                        [efault] "i" (-EFAULT), "0" (err))
>
> +#define __try_cmpxchg_user_asm(itype, _ptr, _pold, _new, label)        ({      \
> +       int __err = 0;                                                  \
> +       bool success;                                                   \
> +       __typeof__(_ptr) _old = (__typeof__(_ptr))(_pold);              \
> +       __typeof__(*(_ptr)) __old = *_old;                              \
> +       __typeof__(*(_ptr)) __new = (_new);                             \
> +       asm volatile("\n"                                               \
> +                    "1: " LOCK_PREFIX "cmpxchg"itype" %[new], %[ptr]\n"\
> +                    CC_SET(z)                                          \
> +                    "2:\n"                                             \
> +                    ".pushsection .fixup,\"ax\"\n"                     \
> +                    "3:        mov %[efault], %[errout]\n"             \
> +                    "          jmp 2b\n"                               \
> +                    ".popsection\n"                                    \
> +                    _ASM_EXTABLE_UA(1b, 3b)                            \
> +                    : CC_OUT(z) (success),                             \
> +                      [errout] "+r" (__err),                           \
> +                      [ptr] "+m" (*_ptr),                              \
> +                      [old] "+a" (__old)                               \
> +                    : [new] "r" (__new),                               \
> +                      [efault] "i" (-EFAULT)                           \
> +                    : "memory", "cc");                                 \
> +       if (unlikely(__err))                                            \
> +               goto label;                                             \
> +       if (unlikely(!success))                                         \
> +               *_old = __old;                                          \
> +       likely(success);                                        })
> +
>  #endif // CONFIG_CC_HAS_ASM_GOTO_OUTPUT
>
>  /* FIXME: this hack is definitely wrong -AK */
> @@ -505,6 +551,21 @@ do {                                                                               \
>  } while (0)
>  #endif // CONFIG_CC_HAS_ASM_GOTO_OUTPUT
>
> +extern void __try_cmpxchg_user_wrong_size(void);
> +
> +#define unsafe_try_cmpxchg_user(_ptr, _oldp, _nval, _label) ({         \
> +       __typeof__(*(_ptr)) __ret;                                      \
> +       switch (sizeof(__ret)) {                                        \
> +       case 4: __ret = __try_cmpxchg_user_asm("l", (_ptr), (_oldp),    \
> +                                              (_nval), _label);        \
> +               break;                                                  \
> +       case 8: __ret = __try_cmpxchg_user_asm("q", (_ptr), (_oldp),    \
> +                                              (_nval), _label);        \
> +               break;                                                  \
> +       default: __try_cmpxchg_user_wrong_size();                       \
> +       }                                                               \
> +       __ret;                                          })
> +
>  /*
>   * We want the unsafe accessors to always be inlined and use
>   * the error labels - thus the macro games.
> --- a/fs/exec.c
> +++ b/fs/exec.c
> @@ -1838,6 +1838,7 @@ static int bprm_execve(struct linux_binp
>         current->fs->in_exec = 0;
>         current->in_execve = 0;
>         rseq_execve(current);
> +       umcg_execve(current);
>         acct_update_integrals(current);
>         task_numa_free(current, false);
>         return retval;
> --- a/include/linux/entry-common.h
> +++ b/include/linux/entry-common.h
> @@ -22,6 +22,10 @@
>  # define _TIF_UPROBE                   (0)
>  #endif
>
> +#ifndef _TIF_UMCG
> +# define _TIF_UMCG                     (0)
> +#endif
> +
>  /*
>   * SYSCALL_WORK flags handled in syscall_enter_from_user_mode()
>   */
> @@ -42,11 +46,13 @@
>                                  SYSCALL_WORK_SYSCALL_EMU |             \
>                                  SYSCALL_WORK_SYSCALL_AUDIT |           \
>                                  SYSCALL_WORK_SYSCALL_USER_DISPATCH |   \
> +                                SYSCALL_WORK_SYSCALL_UMCG |            \
>                                  ARCH_SYSCALL_WORK_ENTER)
>  #define SYSCALL_WORK_EXIT      (SYSCALL_WORK_SYSCALL_TRACEPOINT |      \
>                                  SYSCALL_WORK_SYSCALL_TRACE |           \
>                                  SYSCALL_WORK_SYSCALL_AUDIT |           \
>                                  SYSCALL_WORK_SYSCALL_USER_DISPATCH |   \
> +                                SYSCALL_WORK_SYSCALL_UMCG |            \
>                                  SYSCALL_WORK_SYSCALL_EXIT_TRAP |       \
>                                  ARCH_SYSCALL_WORK_EXIT)
>
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -67,6 +67,7 @@ struct sighand_struct;
>  struct signal_struct;
>  struct task_delay_info;
>  struct task_group;
> +struct umcg_task;
>
>  /*
>   * Task state bitmask. NOTE! These bits are also
> @@ -1294,6 +1295,15 @@ struct task_struct {
>         unsigned long rseq_event_mask;
>  #endif
>
> +#ifdef CONFIG_UMCG
> +       clockid_t               umcg_clock;
> +       struct umcg_task __user *umcg_task;
> +       struct page             *umcg_worker_page;
> +       struct task_struct      *umcg_server;
> +       struct umcg_task __user *umcg_server_task;
> +       struct page             *umcg_server_page;
> +#endif
> +
>         struct tlbflush_unmap_batch     tlb_ubc;
>
>         union {
> @@ -1687,6 +1697,13 @@ extern struct pid *cad_pid;
>  #define PF_KTHREAD             0x00200000      /* I am a kernel thread */
>  #define PF_RANDOMIZE           0x00400000      /* Randomize virtual address space */
>  #define PF_SWAPWRITE           0x00800000      /* Allowed to write to swap */
> +
> +#ifdef CONFIG_UMCG
> +#define PF_UMCG_WORKER         0x01000000      /* UMCG worker */
> +#else
> +#define PF_UMCG_WORKER         0x00000000
> +#endif
> +
>  #define PF_NO_SETAFFINITY      0x04000000      /* Userland is not allowed to meddle with cpus_mask */
>  #define PF_MCE_EARLY           0x08000000      /* Early kill for mce process policy */
>  #define PF_MEMALLOC_PIN                0x10000000      /* Allocation context constrained to zones which allow long term pinning. */
> @@ -2285,6 +2302,67 @@ static inline void rseq_execve(struct ta
>  {
>  }
>
> +#endif
> +
> +#ifdef CONFIG_UMCG
> +
> +extern void umcg_sys_enter(struct pt_regs *regs, long syscall);
> +extern void umcg_sys_exit(struct pt_regs *regs);
> +extern void umcg_notify_resume(struct pt_regs *regs);
> +extern void umcg_worker_exit(void);
> +extern void umcg_clear_child(struct task_struct *tsk);
> +
> +/* Called by bprm_execve() in fs/exec.c. */
> +static inline void umcg_execve(struct task_struct *tsk)
> +{
> +       if (tsk->umcg_task)
> +               umcg_clear_child(tsk);
> +}
> +
> +/* Called by do_exit() in kernel/exit.c. */
> +static inline void umcg_handle_exit(void)
> +{
> +       if (current->flags & PF_UMCG_WORKER)
> +               umcg_worker_exit();
> +}
> +
> +/*
> + * umcg_wq_worker_[sleeping|running] are called in core.c by
> + * sched_submit_work() and sched_update_worker().
> + */
> +extern void umcg_wq_worker_sleeping(struct task_struct *tsk);
> +extern void umcg_wq_worker_running(struct task_struct *tsk);
> +
> +#else  /* CONFIG_UMCG */
> +
> +static inline void umcg_sys_enter(struct pt_regs *regs, long syscall)
> +{
> +}
> +
> +static inline void umcg_sys_exit(struct pt_regs *regs)
> +{
> +}
> +
> +static inline void umcg_notify_resume(struct pt_regs *regs)
> +{
> +}
> +
> +static inline void umcg_clear_child(struct task_struct *tsk)
> +{
> +}
> +static inline void umcg_execve(struct task_struct *tsk)
> +{
> +}
> +static inline void umcg_handle_exit(void)
> +{
> +}
> +static inline void umcg_wq_worker_sleeping(struct task_struct *tsk)
> +{
> +}
> +static inline void umcg_wq_worker_running(struct task_struct *tsk)
> +{
> +}
> +
>  #endif
>
>  #ifdef CONFIG_DEBUG_RSEQ
> --- a/include/linux/syscalls.h
> +++ b/include/linux/syscalls.h
> @@ -72,6 +72,7 @@ struct open_how;
>  struct mount_attr;
>  struct landlock_ruleset_attr;
>  enum landlock_rule_type;
> +struct umcg_task;
>
>  #include <linux/types.h>
>  #include <linux/aio_abi.h>
> @@ -1057,6 +1058,8 @@ asmlinkage long sys_landlock_add_rule(in
>                 const void __user *rule_attr, __u32 flags);
>  asmlinkage long sys_landlock_restrict_self(int ruleset_fd, __u32 flags);
>  asmlinkage long sys_memfd_secret(unsigned int flags);
> +asmlinkage long sys_umcg_ctl(u32 flags, struct umcg_task __user *self, clockid_t which_clock);
> +asmlinkage long sys_umcg_wait(u32 flags, u64 abs_timeout);
>
>  /*
>   * Architecture-specific system calls
> --- a/include/linux/thread_info.h
> +++ b/include/linux/thread_info.h
> @@ -46,6 +46,7 @@ enum syscall_work_bit {
>         SYSCALL_WORK_BIT_SYSCALL_AUDIT,
>         SYSCALL_WORK_BIT_SYSCALL_USER_DISPATCH,
>         SYSCALL_WORK_BIT_SYSCALL_EXIT_TRAP,
> +       SYSCALL_WORK_BIT_SYSCALL_UMCG,
>  };
>
>  #define SYSCALL_WORK_SECCOMP           BIT(SYSCALL_WORK_BIT_SECCOMP)
> @@ -55,6 +56,7 @@ enum syscall_work_bit {
>  #define SYSCALL_WORK_SYSCALL_AUDIT     BIT(SYSCALL_WORK_BIT_SYSCALL_AUDIT)
>  #define SYSCALL_WORK_SYSCALL_USER_DISPATCH BIT(SYSCALL_WORK_BIT_SYSCALL_USER_DISPATCH)
>  #define SYSCALL_WORK_SYSCALL_EXIT_TRAP BIT(SYSCALL_WORK_BIT_SYSCALL_EXIT_TRAP)
> +#define SYSCALL_WORK_SYSCALL_UMCG      BIT(SYSCALL_WORK_BIT_SYSCALL_UMCG)
>  #endif
>
>  #include <asm/thread_info.h>
> --- a/include/uapi/asm-generic/unistd.h
> +++ b/include/uapi/asm-generic/unistd.h
> @@ -883,8 +883,13 @@ __SYSCALL(__NR_process_mrelease, sys_pro
>  #define __NR_futex_waitv 449
>  __SYSCALL(__NR_futex_waitv, sys_futex_waitv)
>
> +#define __NR_umcg_ctl 450
> +__SYSCALL(__NR_umcg_ctl, sys_umcg_ctl)
> +#define __NR_umcg_wait 451
> +__SYSCALL(__NR_umcg_wait, sys_umcg_wait)
>  #undef __NR_syscalls
> -#define __NR_syscalls 450
> +
> +#define __NR_syscalls 452
>
>  /*
>   * 32 bit systems traditionally used different
> --- /dev/null
> +++ b/include/uapi/linux/umcg.h
> @@ -0,0 +1,117 @@
> +/* SPDX-License-Identifier: GPL-2.0+ WITH Linux-syscall-note */
> +#ifndef _UAPI_LINUX_UMCG_H
> +#define _UAPI_LINUX_UMCG_H
> +
> +#include <linux/types.h>
> +
> +/*
> + * UMCG: User Managed Concurrency Groups.
> + *
> + * Syscalls (see kernel/sched/umcg.c):
> + *      sys_umcg_ctl()  - register/unregister UMCG tasks;
> + *      sys_umcg_wait() - wait/wake/context-switch.
> + *
> + * struct umcg_task (below): controls the state of UMCG tasks.
> + */
> +
> +/*
> + * UMCG task states, the first 6 bits of struct umcg_task.state_ts.
> + * The states represent the user space point of view.
> + *
> + *   ,--------(TF_PREEMPT + notify_resume)-------. ,------------.
> + *   |                                           v |            |
> + * RUNNING -(schedule)-> BLOCKED -(sys_exit)-> RUNNABLE  (signal + notify_resume)
> + *   ^                                           | ^            |
> + *   `--------------(sys_umcg_wait)--------------' `------------'
> + *
> + */
> +#define UMCG_TASK_NONE                 0x0000U
> +#define UMCG_TASK_RUNNING              0x0001U
> +#define UMCG_TASK_RUNNABLE             0x0002U
> +#define UMCG_TASK_BLOCKED              0x0003U
> +
> +#define UMCG_TASK_MASK                 0x00ffU
> +
> +/*
> + * UMCG_TF_PREEMPT: userspace indicates the worker should be preempted.
> + *
> + * Must only be set on UMCG_TASK_RUNNING; once set, any subsequent
> + * return-to-user (eg signal) will perform the equivalent of sys_umcg_wait() on
> + * it. That is, it will wake next_tid/server_tid, transfer to RUNNABLE and
> + * enqueue on the server's runnable list.
> + *
> + */
> +#define UMCG_TF_PREEMPT                        0x0100U
> +
> +#define UMCG_TF_MASK                   0xff00U
> +
> +#define UMCG_TASK_ALIGN                        64
> +
> +/**
> + * struct umcg_task - controls the state of UMCG tasks.
> + *
> + * The struct is aligned at 64 bytes to ensure that it fits into
> + * a single cache line.
> + */
> +struct umcg_task {
> +       /**
> +        * @state_ts: the current state of the UMCG task described by
> +        *            this struct, with a unique timestamp indicating
> +        *            when the last state change happened.
> +        *
> +        * Readable/writable by both the kernel and the userspace.
> +        *
> +        * UMCG task state:
> +        *   bits  0 -  7: task state;
> +        *   bits  8 - 15: state flags;
> +        *   bits 16 - 31: for userspace use;
> +        */
> +       __u32   state;                          /* r/w */
> +
> +       /**
> +        * @next_tid: the TID of the UMCG task that should be context-switched
> +        *            into in sys_umcg_wait(). Can be zero, in which case
> +        *            it'll switch to server_tid.
> +        *
> +        * @server_tid: the TID of the UMCG server that hosts this task,
> +        *              when RUNNABLE this task will get added to it's
> +        *              runnable_workers_ptr list.
> +        *
> +        * Read-only for the kernel, read/write for the userspace.
> +        */
> +       __u32   next_tid;                       /* r   */
> +       __u32   server_tid;                     /* r   */
> +
> +       __u32   __hole[1];
> +
> +       /*
> +        * Timestamps for when last we became BLOCKED, RUNNABLE, in CLOCK_MONOTONIC.
> +        */
> +       __u64   blocked_ts;                     /*   w */
> +       __u64   runnable_ts;                    /*   w */
> +
> +       /**
> +        * @runnable_workers_ptr: a single-linked list of runnable workers.
> +        *
> +        * Readable/writable by both the kernel and the userspace: the
> +        * kernel adds items to the list, userspace removes them.
> +        */
> +       __u64   runnable_workers_ptr;           /* r/w */
> +
> +       __u64   __zero[3];
> +
> +} __attribute__((packed, aligned(UMCG_TASK_ALIGN)));
> +
> +/**
> + * enum umcg_ctl_flag - flags to pass to sys_umcg_ctl
> + * @UMCG_CTL_REGISTER:   register the current task as a UMCG task
> + * @UMCG_CTL_UNREGISTER: unregister the current task as a UMCG task
> + * @UMCG_CTL_WORKER:     register the current task as a UMCG worker
> + */
> +enum umcg_ctl_flag {
> +       UMCG_CTL_REGISTER       = 0x00001,
> +       UMCG_CTL_UNREGISTER     = 0x00002,
> +       UMCG_CTL_WORKER         = 0x10000,
> +};
> +
> +#endif /* _UAPI_LINUX_UMCG_H */
> --- a/init/Kconfig
> +++ b/init/Kconfig
> @@ -1693,6 +1693,21 @@ config MEMBARRIER
>
>           If unsure, say Y.
>
> +config HAVE_UMCG
> +       bool
> +
> +config UMCG
> +       bool "Enable User Managed Concurrency Groups API"
> +       depends on 64BIT
> +       depends on GENERIC_ENTRY
> +       depends on HAVE_UMCG
> +       default n
> +       help
> +         Enable User Managed Concurrency Groups API, which form the basis
> +         for an in-process M:N userspace scheduling framework.
> +         At the moment this is an experimental/RFC feature that is not
> +         guaranteed to be backward-compatible.
> +
>  config KALLSYMS
>         bool "Load all symbols for debugging/ksymoops" if EXPERT
>         default y
> --- a/kernel/entry/common.c
> +++ b/kernel/entry/common.c
> @@ -6,6 +6,7 @@
>  #include <linux/livepatch.h>
>  #include <linux/audit.h>
>  #include <linux/tick.h>
> +#include <linux/sched.h>
>
>  #include "common.h"
>
> @@ -76,6 +77,9 @@ static long syscall_trace_enter(struct p
>         if (unlikely(work & SYSCALL_WORK_SYSCALL_TRACEPOINT))
>                 trace_sys_enter(regs, syscall);
>
> +       if (work & SYSCALL_WORK_SYSCALL_UMCG)
> +               umcg_sys_enter(regs, syscall);
> +
>         syscall_enter_audit(regs, syscall);
>
>         return ret ? : syscall;
> @@ -155,8 +159,7 @@ static unsigned long exit_to_user_mode_l
>          * Before returning to user space ensure that all pending work
>          * items have been completed.
>          */
> -       while (ti_work & EXIT_TO_USER_MODE_WORK) {
> -
> +       do {
>                 local_irq_enable_exit_to_user(ti_work);
>
>                 if (ti_work & _TIF_NEED_RESCHED)
> @@ -168,6 +171,10 @@ static unsigned long exit_to_user_mode_l
>                 if (ti_work & _TIF_PATCH_PENDING)
>                         klp_update_patch_state(current);
>
> +               /* must be before handle_signal_work(); terminates on sigpending */
> +               if (ti_work & _TIF_UMCG)
> +                       umcg_notify_resume(regs);
> +
>                 if (ti_work & (_TIF_SIGPENDING | _TIF_NOTIFY_SIGNAL))
>                         handle_signal_work(regs, ti_work);
>
> @@ -188,7 +195,7 @@ static unsigned long exit_to_user_mode_l
>                 tick_nohz_user_enter_prepare();
>
>                 ti_work = READ_ONCE(current_thread_info()->flags);
> -       }
> +       } while (ti_work & EXIT_TO_USER_MODE_WORK);
>
>         /* Return the latest work state for arch_exit_to_user_mode() */
>         return ti_work;
> @@ -203,7 +210,7 @@ static void exit_to_user_mode_prepare(st
>         /* Flush pending rcuog wakeup before the last need_resched() check */
>         tick_nohz_user_enter_prepare();
>
> -       if (unlikely(ti_work & EXIT_TO_USER_MODE_WORK))
> +       if (unlikely(ti_work & (EXIT_TO_USER_MODE_WORK | _TIF_UMCG)))
>                 ti_work = exit_to_user_mode_loop(regs, ti_work);
>
>         arch_exit_to_user_mode_prepare(regs, ti_work);
> @@ -253,6 +260,9 @@ static void syscall_exit_work(struct pt_
>         step = report_single_step(work);
>         if (step || work & SYSCALL_WORK_SYSCALL_TRACE)
>                 arch_syscall_exit_tracehook(regs, step);
> +
> +       if (work & SYSCALL_WORK_SYSCALL_UMCG)
> +               umcg_sys_exit(regs);
>  }
>
>  /*
> --- a/kernel/exit.c
> +++ b/kernel/exit.c
> @@ -749,6 +749,10 @@ void __noreturn do_exit(long code)
>         if (unlikely(!tsk->pid))
>                 panic("Attempted to kill the idle task!");
>
> +       /* Turn off UMCG sched hooks. */
> +       if (unlikely(tsk->flags & PF_UMCG_WORKER))
> +               tsk->flags &= ~PF_UMCG_WORKER;
> +
>         /*
>          * If do_exit is called because this processes oopsed, it's possible
>          * that get_fs() was left as KERNEL_DS, so reset it to USER_DS before
> @@ -786,6 +790,7 @@ void __noreturn do_exit(long code)
>
>         io_uring_files_cancel();
>         exit_signals(tsk);  /* sets PF_EXITING */
> +       umcg_handle_exit();
>
>         /* sync mm's RSS info before statistics gathering */
>         if (tsk->mm)
> --- a/kernel/sched/Makefile
> +++ b/kernel/sched/Makefile
> @@ -41,3 +41,4 @@ obj-$(CONFIG_MEMBARRIER) += membarrier.o
>  obj-$(CONFIG_CPU_ISOLATION) += isolation.o
>  obj-$(CONFIG_PSI) += psi.o
>  obj-$(CONFIG_SCHED_CORE) += core_sched.o
> +obj-$(CONFIG_UMCG) += umcg.o
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -3977,8 +3977,7 @@ bool ttwu_state_match(struct task_struct
>   * Return: %true if @p->state changes (an actual wakeup was done),
>   *        %false otherwise.
>   */
> -static int
> -try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags)
> +int try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags)
>  {
>         unsigned long flags;
>         int cpu, success = 0;
> @@ -4270,6 +4269,7 @@ static void __sched_fork(unsigned long c
>         p->wake_entry.u_flags = CSD_TYPE_TTWU;
>         p->migration_pending = NULL;
>  #endif
> +       umcg_clear_child(p);
>  }
>
>  DEFINE_STATIC_KEY_FALSE(sched_numa_balancing);
> @@ -6328,9 +6328,11 @@ static inline void sched_submit_work(str
>          * If a worker goes to sleep, notify and ask workqueue whether it
>          * wants to wake up a task to maintain concurrency.
>          */
> -       if (task_flags & (PF_WQ_WORKER | PF_IO_WORKER)) {
> +       if (task_flags & (PF_WQ_WORKER | PF_IO_WORKER | PF_UMCG_WORKER)) {
>                 if (task_flags & PF_WQ_WORKER)
>                         wq_worker_sleeping(tsk);
> +               else if (task_flags & PF_UMCG_WORKER)
> +                       umcg_wq_worker_sleeping(tsk);
>                 else
>                         io_wq_worker_sleeping(tsk);
>         }
> @@ -6348,9 +6350,11 @@ static inline void sched_submit_work(str
>
>  static void sched_update_worker(struct task_struct *tsk)
>  {
> -       if (tsk->flags & (PF_WQ_WORKER | PF_IO_WORKER)) {
> +       if (tsk->flags & (PF_WQ_WORKER | PF_IO_WORKER | PF_UMCG_WORKER)) {
>                 if (tsk->flags & PF_WQ_WORKER)
>                         wq_worker_running(tsk);
> +               else if (tsk->flags & PF_UMCG_WORKER)
> +                       umcg_wq_worker_running(tsk);
>                 else
>                         io_wq_worker_running(tsk);
>         }
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -6890,6 +6890,10 @@ select_task_rq_fair(struct task_struct *
>         if (wake_flags & WF_TTWU) {
>                 record_wakee(p);
>
> +               if ((wake_flags & WF_CURRENT_CPU) &&
> +                   cpumask_test_cpu(cpu, p->cpus_ptr))
> +                       return cpu;
> +
>                 if (sched_energy_enabled()) {
>                         new_cpu = find_energy_efficient_cpu(p, prev_cpu);
>                         if (new_cpu >= 0)
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -2052,13 +2052,14 @@ static inline int task_on_rq_migrating(s
>  }
>
>  /* Wake flags. The first three directly map to some SD flag value */
> -#define WF_EXEC     0x02 /* Wakeup after exec; maps to SD_BALANCE_EXEC */
> -#define WF_FORK     0x04 /* Wakeup after fork; maps to SD_BALANCE_FORK */
> -#define WF_TTWU     0x08 /* Wakeup;            maps to SD_BALANCE_WAKE */
> -
> -#define WF_SYNC     0x10 /* Waker goes to sleep after wakeup */
> -#define WF_MIGRATED 0x20 /* Internal use, task got migrated */
> -#define WF_ON_CPU   0x40 /* Wakee is on_cpu */
> +#define WF_EXEC         0x02 /* Wakeup after exec; maps to SD_BALANCE_EXEC */
> +#define WF_FORK         0x04 /* Wakeup after fork; maps to SD_BALANCE_FORK */
> +#define WF_TTWU         0x08 /* Wakeup;            maps to SD_BALANCE_WAKE */
> +
> +#define WF_SYNC         0x10 /* Waker goes to sleep after wakeup */
> +#define WF_MIGRATED     0x20 /* Internal use, task got migrated */
> +#define WF_ON_CPU       0x40 /* Wakee is on_cpu */
> +#define WF_CURRENT_CPU  0x80 /* Prefer to move the wakee to the current CPU. */
>
>  #ifdef CONFIG_SMP
>  static_assert(WF_EXEC == SD_BALANCE_EXEC);
> @@ -3076,6 +3077,8 @@ static inline bool is_per_cpu_kthread(st
>  extern void swake_up_all_locked(struct swait_queue_head *q);
>  extern void __prepare_to_swait(struct swait_queue_head *q, struct swait_queue *wait);
>
> +extern int try_to_wake_up(struct task_struct *tsk, unsigned int state, int wake_flags);
> +
>  #ifdef CONFIG_PREEMPT_DYNAMIC
>  extern int preempt_dynamic_mode;
>  extern int sched_dynamic_mode(const char *str);
> --- /dev/null
> +++ b/kernel/sched/umcg.c
> @@ -0,0 +1,744 @@
> +// SPDX-License-Identifier: GPL-2.0-only
> +
> +/*
> + * User Managed Concurrency Groups (UMCG).
> + *
> + */
> +
> +#include <linux/syscalls.h>
> +#include <linux/types.h>
> +#include <linux/uaccess.h>
> +#include <linux/umcg.h>
> +
> +#include <asm/syscall.h>
> +
> +#include "sched.h"
> +
> +static struct task_struct *umcg_get_task(u32 tid)
> +{
> +       struct task_struct *tsk = NULL;
> +
> +       if (tid) {
> +               rcu_read_lock();
> +               tsk = find_task_by_vpid(tid);
> +               if (tsk && current->mm == tsk->mm && tsk->umcg_task)
> +                       get_task_struct(tsk);
> +               else
> +                       tsk = NULL;
> +               rcu_read_unlock();
> +       }
> +
> +       return tsk;
> +}
> +
> +/**
> + * umcg_pin_pages: pin pages containing struct umcg_task of this worker
> + *                 and its server.
> + */
> +static int umcg_pin_pages(void)
> +{
> +       struct task_struct *server = NULL, *tsk = current;
> +       struct umcg_task __user *self = tsk->umcg_task;
> +       int server_tid;
> +
> +       if (tsk->umcg_worker_page ||
> +           tsk->umcg_server_page ||
> +           tsk->umcg_server_task ||
> +           tsk->umcg_server)
> +               return -EBUSY;
> +
> +       if (get_user(server_tid, &self->server_tid))
> +               return -EFAULT;
> +
> +       server = umcg_get_task(server_tid);
> +       if (!server)
> +               return -EINVAL;
> +
> +       if (pin_user_pages_fast((unsigned long)self, 1, 0,
> +                               &tsk->umcg_worker_page) != 1)
> +               goto clear_self;
> +
> +       /* must cache due to possible concurrent change vs access_ok() */
> +       tsk->umcg_server_task = server->umcg_task;
> +       if (pin_user_pages_fast((unsigned long)tsk->umcg_server_task, 1, 0,
> +                               &tsk->umcg_server_page) != 1)
> +               goto clear_server;
> +
> +       tsk->umcg_server = server;
> +
> +       return 0;
> +
> +clear_server:
> +       tsk->umcg_server_task = NULL;
> +       tsk->umcg_server_page = NULL;
> +
> +       unpin_user_page(tsk->umcg_worker_page);
> +clear_self:
> +       tsk->umcg_worker_page = NULL;
> +       put_task_struct(server);
> +
> +       return -EFAULT;
> +}
> +
> +static void umcg_unpin_pages(void)
> +{
> +       struct task_struct *tsk = current;
> +
> +       if (tsk->umcg_server) {
> +               unpin_user_page(tsk->umcg_worker_page);
> +               tsk->umcg_worker_page = NULL;
> +
> +               unpin_user_page(tsk->umcg_server_page);
> +               tsk->umcg_server_page = NULL;
> +               tsk->umcg_server_task = NULL;
> +
> +               put_task_struct(tsk->umcg_server);
> +               tsk->umcg_server = NULL;
> +       }
> +}
> +
> +static void umcg_clear_task(struct task_struct *tsk)
> +{
> +       /*
> +        * This is either called for the current task, or for a newly forked
> +        * task that is not yet running, so we don't need strict atomicity
> +        * below.
> +        */
> +       if (tsk->umcg_task) {
> +               WRITE_ONCE(tsk->umcg_task, NULL);
> +               tsk->umcg_server = NULL;
> +
> +               /* These can be simple writes - see the commment above. */
> +               tsk->umcg_worker_page = NULL;
> +               tsk->umcg_server_page = NULL;
> +               tsk->umcg_server_task = NULL;
> +
> +               tsk->flags &= ~PF_UMCG_WORKER;
> +               clear_task_syscall_work(tsk, SYSCALL_UMCG);
> +               clear_tsk_thread_flag(tsk, TIF_UMCG);
> +       }
> +}
> +
> +/* Called for a forked or execve-ed child. */
> +void umcg_clear_child(struct task_struct *tsk)
> +{
> +       umcg_clear_task(tsk);
> +}
> +
> +/* Called both by normally (unregister) and abnormally exiting workers. */
> +void umcg_worker_exit(void)
> +{
> +       umcg_unpin_pages();
> +       umcg_clear_task(current);
> +}
> +
> +/*
> + * Do a state transition, @from -> @to, and possible read @next after that.
> + *
> + * Will clear UMCG_TF_PREEMPT.
> + *
> + * When @to == {BLOCKED,RUNNABLE}, update timestamps.
> + *
> + * Returns:
> + *   0: success
> + *   -EAGAIN: when self->state != @from
> + *   -EFAULT
> + */
> +static int umcg_update_state(struct task_struct *tsk, u32 from, u32 to, u32 *next)
> +{
> +       struct umcg_task *self = tsk->umcg_task;
> +       u32 old, new;
> +       u64 now;
> +
> +       if (to >= UMCG_TASK_RUNNABLE) {
> +               switch (tsk->umcg_clock) {
> +               case CLOCK_REALTIME:      now = ktime_get_real_ns();     break;
> +               case CLOCK_MONOTONIC:     now = ktime_get_ns();          break;
> +               case CLOCK_BOOTTIME:      now = ktime_get_boottime_ns(); break;
> +               case CLOCK_TAI:           now = ktime_get_clocktai_ns(); break;
> +               }
> +       }
> +
> +       if (!user_access_begin(self, sizeof(*self)))
> +               return -EFAULT;
> +
> +       unsafe_get_user(old, &self->state, Efault);
> +       do {
> +               if ((old & UMCG_TASK_MASK) != from)
> +                       goto fail;
> +
> +               new = old & ~(UMCG_TASK_MASK | UMCG_TF_PREEMPT);
> +               new |= to & UMCG_TASK_MASK;
> +
> +       } while (!unsafe_try_cmpxchg_user(&self->state, &old, new, Efault));
> +
> +       if (to == UMCG_TASK_BLOCKED)
> +               unsafe_put_user(now, &self->blocked_ts, Efault);
> +       if (to == UMCG_TASK_RUNNABLE)
> +               unsafe_put_user(now, &self->runnable_ts, Efault);
> +
> +       if (next)
> +               unsafe_get_user(*next, &self->next_tid, Efault);
> +
> +       user_access_end();
> +       return 0;
> +
> +fail:
> +       user_access_end();
> +       return -EAGAIN;
> +
> +Efault:
> +       user_access_end();
> +       return -EFAULT;
> +}
> +
> +/* Called from syscall enter path */
> +void umcg_sys_enter(struct pt_regs *regs, long syscall)
> +{
> +       /* avoid recursion vs our own syscalls */
> +       if (syscall == __NR_umcg_wait ||
> +           syscall == __NR_umcg_ctl)
> +               return;
> +
> +       /* avoid recursion vs schedule() */
> +       current->flags &= ~PF_UMCG_WORKER;
> +
> +       if (umcg_pin_pages())
> +               goto die;
> +
> +       current->flags |= PF_UMCG_WORKER;
> +       return;
> +
> +die:
> +       current->flags |= PF_UMCG_WORKER;
> +       pr_warn("%s: killing task %d\n", __func__, current->pid);
> +       force_sig(SIGKILL);
> +}
> +
> +static int umcg_wake_task(struct task_struct *tsk)
> +{
> +       int ret = umcg_update_state(tsk, UMCG_TASK_RUNNABLE, UMCG_TASK_RUNNING, NULL);
> +       if (ret)
> +               return ret;
> +
> +       try_to_wake_up(tsk, TASK_NORMAL, WF_CURRENT_CPU);
> +       return 0;
> +}
> +
> +/*
> + * Wake @next_tid or server.
> + *
> + * Must be called in umcg_pin_pages() context, relies on tsk->umcg_server.
> + *
> + * Returns:
> + *   0: success
> + *   -EFAULT
> + */
> +static int umcg_wake_next(struct task_struct *tsk, u32 next_tid)
> +{
> +       struct task_struct *next = NULL;
> +       int ret;
> +
> +       next = umcg_get_task(next_tid);
> +       /*
> +        * umcg_wake_task(next) might fault; if we cannot fault, we'll eat it
> +        * and 'spuriously' not wake @next_tid but instead try and wake the
> +        * server.
> +        *
> +        * XXX: we can fix this by adding umcg_next_page to umcg_pin_pages().
> +        *
> +        * umcg_wake_task() can also fail due to next not having the right
> +        * state, then too will we try and wake the server.
> +        *
> +        * If we cannot wake the server due to state issues, too bad.
> +        */
> +       if (!next || umcg_wake_task(next)) {
> +               ret = umcg_wake_task(tsk->umcg_server);
> +               if (ret == -EFAULT)
> +                       goto out;
> +       }
> +       ret = 0;
> +out:
> +       if (next)
> +               put_task_struct(next);
> +
> +       return ret;
> +}
> +
> +/* pre-schedule() */
> +void umcg_wq_worker_sleeping(struct task_struct *tsk)
> +{
> +       int next_tid;
> +
> +       /* Must not fault, mmap_sem might be held. */
> +       pagefault_disable();
> +
> +       if (WARN_ON_ONCE(!tsk->umcg_server))
> +               goto die;
> +
> +       if (umcg_update_state(tsk, UMCG_TASK_RUNNING, UMCG_TASK_BLOCKED, &next_tid))
> +               goto die;
> +
> +       if (umcg_wake_next(tsk, next_tid))
> +               goto die;
> +
> +       pagefault_enable();
> +
> +       /*
> +        * We're going to sleep, make sure to unpin the pages, this ensures
> +        * the pins are temporary.
> +        */
> +       umcg_unpin_pages();
> +
> +       return;
> +
> +die:
> +       pagefault_enable();
> +       pr_warn("%s: killing task %d\n", __func__, current->pid);
> +       force_sig(SIGKILL);
> +}
> +
> +/* post-schedule() */
> +void umcg_wq_worker_running(struct task_struct *tsk)
> +{
> +       /* nothing here, see umcg_sys_exit() */
> +}
> +
> +/*
> + * Enqueue @tsk on it's server's runnable list
> + *
> + * Must be called in umcg_pin_pages() context, relies on tsk->umcg_server.
> + *
> + * cmpxchg based single linked list add such that list integrity is never
> + * violated.  Userspace *MUST* remove it from the list before changing ->state.
> + * As such, we must change state to RUNNABLE before enqueue.
> + *
> + * Returns:
> + *   0: success
> + *   -EFAULT
> + */
> +static int umcg_enqueue_runnable(struct task_struct *tsk)
> +{
> +       struct umcg_task __user *server = tsk->umcg_server_task;
> +       struct umcg_task __user *self = tsk->umcg_task;
> +       u64 self_ptr = (unsigned long)self;
> +       u64 first_ptr;
> +
> +       /*
> +        * umcg_pin_pages() did access_ok() on both pointers, use self here
> +        * only because __user_access_begin() isn't available in generic code.
> +        */
> +       if (!user_access_begin(self, sizeof(*self)))
> +               return -EFAULT;
> +
> +       unsafe_get_user(first_ptr, &server->runnable_workers_ptr, Efault);
> +       do {
> +               unsafe_put_user(first_ptr, &self->runnable_workers_ptr, Efault);
> +       } while (!unsafe_try_cmpxchg_user(&server->runnable_workers_ptr, &first_ptr, self_ptr, Efault));
> +
> +       user_access_end();
> +       return 0;
> +
> +Efault:
> +       user_access_end();
> +       return -EFAULT;
> +}
> +
> +/*
> + * umcg_wait: Wait for ->state to become RUNNING
> + *
> + * Returns:
> + *   0: success
> + *   -EINTR: pending signal
> + *   -EINVAL: ->state is not {RUNNABLE,RUNNING}
> + *   -ETIMEDOUT
> + *   -EFAULT
> + */
> +int umcg_wait(u64 timo)
> +{
> +       struct task_struct *tsk = current;
> +       struct umcg_task __user *self = tsk->umcg_task;
> +       struct hrtimer_sleeper timeout;
> +       struct page *page = NULL;
> +       u32 state;
> +       int ret;
> +
> +       if (timo) {
> +               hrtimer_init_sleeper_on_stack(&timeout, tsk->umcg_clock,
> +                                             HRTIMER_MODE_ABS);
> +               hrtimer_set_expires_range_ns(&timeout.timer, (s64)timo,
> +                                            tsk->timer_slack_ns);
> +       }
> +
> +       for (;;) {
> +               set_current_state(TASK_INTERRUPTIBLE);
> +
> +               ret = -EINTR;
> +               if (signal_pending(current))
> +                       break;
> +
> +               /*
> +                * Faults can block and scribble our wait state.
> +                */
> +               pagefault_disable();
> +               if (get_user(state, &self->state)) {
> +                       pagefault_enable();
> +
> +                       ret = -EFAULT;
> +                       if (page) {
> +                               unpin_user_page(page);
> +                               page = NULL;
> +                               break;
> +                       }
> +
> +                       if (pin_user_pages_fast((unsigned long)self, 1, 0, &page) != 1) {
> +                               page = NULL;
> +                               break;
> +                       }
> +
> +                       continue;
> +               }
> +
> +               if (page) {
> +                       unpin_user_page(page);
> +                       page = NULL;
> +               }
> +               pagefault_enable();
> +
> +               state &= UMCG_TASK_MASK;
> +               if (state != UMCG_TASK_RUNNABLE) {
> +                       ret = 0;
> +                       if (state == UMCG_TASK_RUNNING)
> +                               break;
> +
> +                       ret = -EINVAL;
> +                       break;
> +               }
> +
> +               if (timo)
> +                       hrtimer_sleeper_start_expires(&timeout, HRTIMER_MODE_ABS);
> +
> +               freezable_schedule();
> +
> +               ret = -ETIMEDOUT;
> +               if (timo && !timeout.task)
> +                       break;
> +       }
> +       __set_current_state(TASK_RUNNING);
> +
> +       if (timo) {
> +               hrtimer_cancel(&timeout.timer);
> +               destroy_hrtimer_on_stack(&timeout.timer);
> +       }
> +
> +       return ret;
> +}
> +
> +void umcg_sys_exit(struct pt_regs *regs)
> +{
> +       struct task_struct *tsk = current;
> +       long syscall = syscall_get_nr(tsk, regs);
> +
> +       if (syscall == __NR_umcg_wait)
> +               return;
> +
> +       /*
> +        * sys_umcg_ctl() will get here without having called umcg_sys_enter()
> +        * as such it will look like a syscall that blocked.
> +        */
> +
> +       if (tsk->umcg_server) {
> +               /*
> +                * Didn't block, we done.
> +                */
> +               umcg_unpin_pages();
> +               return;
> +       }
> +
> +       /* avoid recursion vs schedule() */
> +       current->flags &= ~PF_UMCG_WORKER;
> +
> +       if (umcg_pin_pages())
> +               goto die;
> +
> +       if (umcg_update_state(tsk, UMCG_TASK_BLOCKED, UMCG_TASK_RUNNABLE, NULL))
> +               goto die_unpin;
> +
> +       if (umcg_enqueue_runnable(tsk))
> +               goto die_unpin;
> +
> +       /* server might not be runnable, too bad */
> +       if (umcg_wake_task(tsk->umcg_server) == -EFAULT)
> +               goto die_unpin;
> +
> +       umcg_unpin_pages();
> +
> +       switch (umcg_wait(0)) {
> +       case -EFAULT:
> +       case -EINVAL:
> +       case -ETIMEDOUT: /* how!?! */
> +               goto die;
> +
> +       case -EINTR:
> +               /* notify_resume will continue the wait after the signal */
> +               break;
> +       default:
> +               break;
> +       }
> +
> +       current->flags |= PF_UMCG_WORKER;
> +
> +       return;
> +
> +die_unpin:
> +       umcg_unpin_pages();
> +die:
> +       current->flags |= PF_UMCG_WORKER;
> +       pr_warn("%s: killing task %d\n", __func__, current->pid);
> +       force_sig(SIGKILL);
> +}
> +
> +void umcg_notify_resume(struct pt_regs *regs)
> +{
> +       struct task_struct *tsk = current;
> +       struct umcg_task __user *self = tsk->umcg_task;
> +       u32 state, next_tid;
> +
> +       /* avoid recursion vs schedule() */
> +       current->flags &= ~PF_UMCG_WORKER;
> +
> +       if (get_user(state, &self->state))
> +               goto die;
> +
> +       state &= UMCG_TASK_MASK | UMCG_TF_MASK;
> +       if (state == UMCG_TASK_RUNNING)
> +               goto done;
> +
> +       if (state & UMCG_TF_PREEMPT) {
> +               umcg_pin_pages();
> +
> +               if (umcg_update_state(tsk, UMCG_TASK_RUNNING,
> +                                     UMCG_TASK_RUNNABLE, &next_tid))
> +                       goto die_unpin;
> +
> +               if (umcg_enqueue_runnable(tsk))
> +                       goto die_unpin;
> +
> +               if (umcg_wake_next(tsk, next_tid))
> +                       goto die_unpin;
> +
> +               umcg_unpin_pages();
> +       }
> +
> +       switch (umcg_wait(0)) {
> +       case -EFAULT:
> +       case -EINVAL:
> +       case -ETIMEDOUT: /* how!?! */
> +               goto die;
> +
> +       case -EINTR:
> +               /* we'll will continue the wait after the signal */
> +               break;
> +       default:
> +               break;
> +       }
> +
> +done:
> +       current->flags |= PF_UMCG_WORKER;
> +       return;
> +
> +die_unpin:
> +       umcg_unpin_pages();
> +die:
> +       current->flags |= PF_UMCG_WORKER;
> +       pr_warn("%s: killing task %d\n", __func__, current->pid);
> +       force_sig(SIGKILL);
> +}
> +
> +/**
> + * sys_umcg_wait: put the current task to sleep and/or wake another task.
> + * @flags:        zero or a value from enum umcg_wait_flag.
> + * @abs_timeout:  when to wake the task, in nanoseconds; zero for no timeout.
> + *
> + *
> + *
> + * Returns:
> + * 0             - OK;
> + * -ETIMEDOUT    - the timeout expired;
> + * -EFAULT       - failed accessing struct umcg_task __user of the current
> + *                 task, the server or next.
> + * -ESRCH        - the task to wake not found or not a UMCG task;
> + * -EINVAL       - another error happened (e.g. the current task is not a
> + *                 UMCG task, etc.)
> + */
> +SYSCALL_DEFINE2(umcg_wait, u32, flags, u64, timo)
> +{
> +       struct task_struct *next, *tsk = current;
> +       struct umcg_task __user *self = tsk->umcg_task;
> +       bool worker = tsk->flags & PF_UMCG_WORKER;
> +       u32 next_tid;
> +       int ret;
> +
> +       if (!self || flags)
> +               return -EINVAL;
> +
> +       if (worker)
> +               tsk->flags &= ~PF_UMCG_WORKER;
> +
> +       /* see umcg_sys_{enter,exit}() */
> +       umcg_pin_pages();
> +
> +       ret = umcg_update_state(tsk, UMCG_TASK_RUNNING, UMCG_TASK_RUNNABLE, &next_tid);
> +       if (ret)
> +               goto unpin;
> +
> +       next = umcg_get_task(next_tid);
> +       if (!next) {
> +               ret = -ESRCH;
> +               goto unblock;
> +       }
> +
> +       if (worker) {
> +               ret = umcg_enqueue_runnable(tsk);
> +               if (ret)
> +                       goto put_task;
> +       }
> +
> +       ret = umcg_wake_task(next);
> +       if (ret)
> +               goto put_task;
> +
> +       put_task_struct(next);
> +       umcg_unpin_pages();
> +
> +       ret = umcg_wait(timo);
> +       switch (ret) {
> +       case -EINTR:    /* umcg_notify_resume() will continue the wait */
> +       case 0:         /* all done */
> +               ret = 0;
> +               break;
> +
> +       default:
> +               /*
> +                * If this fails you get to keep the pieces; you'll get stuck
> +                * in umcg_notify_resume().
> +                */
> +               umcg_update_state(tsk, UMCG_TASK_RUNNABLE, UMCG_TASK_RUNNING, NULL);
> +               break;
> +       }
> +out:
> +       if (worker)
> +               tsk->flags |= PF_UMCG_WORKER;
> +       return ret;
> +
> +put_task:
> +       put_task_struct(next);
> +unblock:
> +       umcg_update_state(tsk, UMCG_TASK_RUNNABLE, UMCG_TASK_RUNNING, NULL);
> +unpin:
> +       umcg_unpin_pages();
> +       goto out;
> +}
> +
> +/**
> + * sys_umcg_ctl: (un)register the current task as a UMCG task.
> + * @flags:       ORed values from enum umcg_ctl_flag; see below;
> + * @self:        a pointer to struct umcg_task that describes this
> + *               task and governs the behavior of sys_umcg_wait if
> + *               registering; must be NULL if unregistering.
> + *
> + * @flags & UMCG_CTL_REGISTER: register a UMCG task:
> + *         UMCG workers:
> + *              - @flags & UMCG_CTL_WORKER
> + *              - self->state must be UMCG_TASK_BLOCKED
> + *         UMCG servers:
> + *              - !(@flags & UMCG_CTL_WORKER)
> + *              - self->state must be UMCG_TASK_RUNNING
> + *
> + *         All tasks:
> + *              - self->next_tid must be zero
> + *
> + *         If the conditions above are met, sys_umcg_ctl() immediately returns
> + *         if the registered task is a server; a worker will be added to
> + *         runnable_workers_ptr, and the worker put to sleep; an runnable server
> + *         from runnable_server_tid_ptr will be woken, if present.
> + *
> + * @flags == UMCG_CTL_UNREGISTER: unregister a UMCG task. If the current task
> + *           is a UMCG worker, the userspace is responsible for waking its
> + *           server (before or after calling sys_umcg_ctl).
> + *
> + * Return:
> + * 0                - success
> + * -EFAULT          - failed to read @self
> + * -EINVAL          - some other error occurred
> + */
> +SYSCALL_DEFINE3(umcg_ctl, u32, flags, struct umcg_task __user *, self, clockid_t, which_clock)
> +{
> +       struct umcg_task ut;
> +
> +       if ((unsigned long)self % UMCG_TASK_ALIGN)
> +               return -EINVAL;
> +
> +       if (flags == UMCG_CTL_UNREGISTER) {
> +               if (self || !current->umcg_task)
> +                       return -EINVAL;
> +
> +               if (current->flags & PF_UMCG_WORKER)
> +                       umcg_worker_exit();
> +               else
> +                       umcg_clear_task(current);
> +
> +               return 0;
> +       }
> +
> +       if (!(flags & UMCG_CTL_REGISTER))
> +               return -EINVAL;
> +
> +       switch (which_clock) {
> +       case CLOCK_REALTIME:
> +       case CLOCK_MONOTONIC:
> +       case CLOCK_BOOTTIME:
> +       case CLOCK_TAI:
> +               current->umcg_clock = which_clock;
> +               break;
> +
> +       default:
> +               return -EINVAL;
> +       }
> +
> +       flags &= ~UMCG_CTL_REGISTER;
> +       if (flags && flags != UMCG_CTL_WORKER)
> +               return -EINVAL;
> +
> +       if (current->umcg_task || !self)
> +               return -EINVAL;
> +
> +       if (copy_from_user(&ut, self, sizeof(ut)))
> +               return -EFAULT;
> +
> +       if (ut.next_tid || ut.__hole[0] || ut.__zero[0] || ut.__zero[1] || ut.__zero[2])
> +               return -EINVAL;
> +
> +       if (flags == UMCG_CTL_WORKER) {
> +               if ((ut.state & (UMCG_TASK_MASK | UMCG_TF_MASK)) != UMCG_TASK_BLOCKED)
> +                       return -EINVAL;
> +
> +               WRITE_ONCE(current->umcg_task, self);
> +               current->flags |= PF_UMCG_WORKER;       /* hook schedule() */
> +               set_syscall_work(SYSCALL_UMCG);         /* hook syscall */
> +               set_thread_flag(TIF_UMCG);              /* hook return-to-user */
> +
> +               /* umcg_sys_exit() will transition to RUNNABLE and wait */
> +
> +       } else {
> +               if ((ut.state & (UMCG_TASK_MASK | UMCG_TF_MASK)) != UMCG_TASK_RUNNING)
> +                       return -EINVAL;
> +
> +               WRITE_ONCE(current->umcg_task, self);
> +               set_thread_flag(TIF_UMCG);              /* hook return-to-user */
> +
> +               /* umcg_notify_resume() would block if not RUNNING */
> +       }
> +
> +       return 0;
> +}
> --- a/kernel/sys_ni.c
> +++ b/kernel/sys_ni.c
> @@ -273,6 +273,10 @@ COND_SYSCALL(landlock_create_ruleset);
>  COND_SYSCALL(landlock_add_rule);
>  COND_SYSCALL(landlock_restrict_self);
>
> +/* kernel/sched/umcg.c */
> +COND_SYSCALL(umcg_ctl);
> +COND_SYSCALL(umcg_wait);
> +
>  /* arch/example/kernel/sys_example.c */
>
>  /* mm/fadvise.c */
Peter Zijlstra Nov. 29, 2021, 3:05 p.m. UTC | #19
On Sat, Nov 27, 2021 at 01:45:20AM +0100, Thomas Gleixner wrote:
> On Fri, Nov 26 2021 at 22:59, Peter Zijlstra wrote:
> > On Fri, Nov 26, 2021 at 10:08:14PM +0100, Thomas Gleixner wrote:
> >> > +		if (timo)
> >> > +			hrtimer_sleeper_start_expires(&timeout, HRTIMER_MODE_ABS);
> >> > +
> >> > +		freezable_schedule();
> >> 
> >> You can replace the whole hrtimer foo with
> >> 
> >>                 if (!schedule_hrtimeout_range_clock(timo ? &timo : NULL,
> >>                                                     tsk->timer_slack_ns,
> >>                                                     HRTIMER_MODE_ABS,
> >>                                                     tsk->umcg_clock)) {
> >>                 	ret = -ETIMEOUT;
> >>                         break;
> >>                 }
> >
> > That seems to loose the freezable crud.. then again, since we're
> > interruptible, that shouldn't matter. Lemme go do that.
> 
> We could add a freezable wrapper for that if necessary.

I should just finish rewriting that freezer crap and then we can delete
it all :-) But I don't think that's needed in this case, as long as
we're interruptible we'll pass through the signal path which has a
try_to_freezer() in it.
Peter Zijlstra Nov. 29, 2021, 3:07 p.m. UTC | #20
On Sat, Nov 27, 2021 at 02:16:43AM +0100, Thomas Gleixner wrote:

> So yes, this should work, but I hate the sticky nature of TIF_UMCG. I
> have no real good idea how to avoid that yet, but let me think about it
> some more.

Yeah, that, I couldn't come up with anything saner either.
Peter Zijlstra Nov. 29, 2021, 4:41 p.m. UTC | #21
On Sun, Nov 28, 2021 at 04:29:11PM -0800, Peter Oskolkov wrote:

> wait_wake_only is not needed if you have both next_tid and server_tid,
> as your patch has. In my version of the patch, next_tid is the same as
> server_tid, so the flag is needed to indicate to the kernel that
> next_tid is the wakee, not the server.

Ah, okay.

> re: (idle_)server_tid_ptr: it seems that you assume that blocked
> workers keep their servers, while in my patch they "lose them" once
> they block, and so there should be a global idle server pointer to
> wake the server in my scheme (if there is an idle one). The main
> difference is that in my approach a server has only a single, running,
> worker assigned to it, while in your approach it can have a number of
> blocked/idle workers to take care of as well.

Correct; I've been thinking in analogues of the way we schedule CPUs.
Each CPU has a ready/run queue along with the current task.
fundamentally the RUNNABLE tasks need to go somewhere when all servers
are busy. So at that point the previous server is as good a place as
any.

Now, I sympathise with a blocked task not having a relation; I often
argue this same, since we have wakeup balancing etc. And I've not really
thought about how to best do wakeup-balancing, also see below.

> The main difference between our approaches, as I see it: in my
> approach if a worker is running, its server is sleeping, period. If we
> have N servers, and N running workers, there are no servers to wake
> when a previously blocked worker finishes its blocking op. In your
> approach, it seems that N servers have each a bunch of workers
> pointing at them, and a single worker running. If a previously blocked
> worker wakes up, it wakes the server it was assigned to previously,

Right; it does that. It can check the ::state of it's current task,
possibly set TF_PREEMPT or just go back to sleep.

> and so now we have more than N physical tasks/threads running: N
> workers and the woken server. This is not ideal: if the process is
> affined to only N CPUs, that means a worker will be preempted to let
> the woken server run, which is somewhat against the goal of letting
> the workers run more or less uninterrupted. This is not deal breaking,
> but maybe something to keep in mind.

I suppose it's easy enough to make this behaviour configurable though;
simply enqueue and not wake.... Hmm.. how would this worker know if the
server was 'busy' or not? The whole 'current' thing is a user-space
construct. I suppose that's what your pointer was for? Puts an actual
idle server in there, if there is one. Let me ponder that a bit.

However, do note this whole scheme fundamentally has some of that, the
moment the syscall unblocks until sys_exit is 'unmanaged' runtime for
all tasks, they can consume however much time the syscall needs there.

Also, timeout on sys_umcg_wait() gets you the exact same situation (or
worse, multiple running workers).

> Another big concern I have is that you removed UMCG_TF_LOCKED. I

OOh yes, I forgot to mention that. I couldn't figure out what it was
supposed to do.

> definitely needed it to guard workers during "sched work" in the
> userspace in my approach. I'm not sure if the flag is absolutely
> needed with your approach, but most likely it is - the kernel-side
> scheduler does lock tasks and runqueues and disables interrupts and
> migrations and other things so that the scheduling logic is not
> hijacked by concurrent stuff. Why do you assume that the userspace
> scheduling code does not need similar protections?

I've not yet come across a case where this is needed. Migration for
instance is possible when RUNNABLE, simply write ::server_tid before
::state. Userspace just needs to make sure who actually owns the task,
but it can do that outside of this state.

But like I said; I've not yet done the userspace part (and I lost most
of today trying to install a new machine), so perhaps I'll run into it
soon enough.
Peter Oskolkov Nov. 29, 2021, 5:34 p.m. UTC | #22
On Mon, Nov 29, 2021 at 8:41 AM Peter Zijlstra <peterz@infradead.org> wrote:
>
> On Sun, Nov 28, 2021 at 04:29:11PM -0800, Peter Oskolkov wrote:
>
> > wait_wake_only is not needed if you have both next_tid and server_tid,
> > as your patch has. In my version of the patch, next_tid is the same as
> > server_tid, so the flag is needed to indicate to the kernel that
> > next_tid is the wakee, not the server.
>
> Ah, okay.
>
> > re: (idle_)server_tid_ptr: it seems that you assume that blocked
> > workers keep their servers, while in my patch they "lose them" once
> > they block, and so there should be a global idle server pointer to
> > wake the server in my scheme (if there is an idle one). The main
> > difference is that in my approach a server has only a single, running,
> > worker assigned to it, while in your approach it can have a number of
> > blocked/idle workers to take care of as well.
>
> Correct; I've been thinking in analogues of the way we schedule CPUs.
> Each CPU has a ready/run queue along with the current task.
> fundamentally the RUNNABLE tasks need to go somewhere when all servers
> are busy. So at that point the previous server is as good a place as
> any.
>
> Now, I sympathise with a blocked task not having a relation; I often
> argue this same, since we have wakeup balancing etc. And I've not really
> thought about how to best do wakeup-balancing, also see below.
>
> > The main difference between our approaches, as I see it: in my
> > approach if a worker is running, its server is sleeping, period. If we
> > have N servers, and N running workers, there are no servers to wake
> > when a previously blocked worker finishes its blocking op. In your
> > approach, it seems that N servers have each a bunch of workers
> > pointing at them, and a single worker running. If a previously blocked
> > worker wakes up, it wakes the server it was assigned to previously,
>
> Right; it does that. It can check the ::state of it's current task,
> possibly set TF_PREEMPT or just go back to sleep.
>
> > and so now we have more than N physical tasks/threads running: N
> > workers and the woken server. This is not ideal: if the process is
> > affined to only N CPUs, that means a worker will be preempted to let
> > the woken server run, which is somewhat against the goal of letting
> > the workers run more or less uninterrupted. This is not deal breaking,
> > but maybe something to keep in mind.
>
> I suppose it's easy enough to make this behaviour configurable though;
> simply enqueue and not wake.... Hmm.. how would this worker know if the
> server was 'busy' or not? The whole 'current' thing is a user-space
> construct. I suppose that's what your pointer was for? Puts an actual
> idle server in there, if there is one. Let me ponder that a bit.

Yes, the idle_server_ptr was there to point to an idle server; this
naturally did wakeup balancing.

>
> However, do note this whole scheme fundamentally has some of that, the
> moment the syscall unblocks until sys_exit is 'unmanaged' runtime for
> all tasks, they can consume however much time the syscall needs there.
>
> Also, timeout on sys_umcg_wait() gets you the exact same situation (or
> worse, multiple running workers).

It should not. Timed out workers should be added to the runnable list
and not become running unless a server chooses so. So sys_umcg_wait()
with a timeout should behave similarly to a normal sleep, in that the
server is woken upon the worker blocking, and upon the worker wakeup
the worker is added to the woken workers list and waits for a server
to run it. The only difference is that in a sleep the worker becomes
BLOCKED, while in sys_umcg_wait() the worker is RUNNABLE the whole
time.

Why then have sys_umcg_wait() with a timeout at all, instead of
calling nanosleep()? Because the worker in sys_umcg_wait() can be
context-switched into by another worker, or made running by a server;
if the worker is in nanosleep(), it just sleeps.

>
> > Another big concern I have is that you removed UMCG_TF_LOCKED. I
>
> OOh yes, I forgot to mention that. I couldn't figure out what it was
> supposed to do.
>
> > definitely needed it to guard workers during "sched work" in the
> > userspace in my approach. I'm not sure if the flag is absolutely
> > needed with your approach, but most likely it is - the kernel-side
> > scheduler does lock tasks and runqueues and disables interrupts and
> > migrations and other things so that the scheduling logic is not
> > hijacked by concurrent stuff. Why do you assume that the userspace
> > scheduling code does not need similar protections?
>
> I've not yet come across a case where this is needed. Migration for
> instance is possible when RUNNABLE, simply write ::server_tid before
> ::state. Userspace just needs to make sure who actually owns the task,
> but it can do that outside of this state.
>
> But like I said; I've not yet done the userspace part (and I lost most
> of today trying to install a new machine), so perhaps I'll run into it
> soon enough.

The most obvious scenario where I needed locking is when worker A
wants to context switch into worker B, while another worker C wants to
context switch into worker A, and worker A pagefaults. This involves:

worker A context: worker A context switches into worker B:

- worker B::server_tid = worker A::server_tid
- worker A::server_tid = none
- worker A::state = runnable
- worker B::state = running
- worker A::next_tid = worker B
- worker A calls sys_umcg_wait()

worker B context: before the above completes, worker C wants to
context switch into worker A, with similar steps.

"interrupt context": in the middle of the mess above, worker A pagefaults

Too many moving parts. UMCG_TF_LOCKED helped me make this mess
manageable. Maybe without pagefaults clever ordering of the operations
listed above could make things work, but pagefaults mess things badly,
so some kind of "preempt_disable()" for the userspace scheduling code
was needed, and UMCG_TF_LOCKED was the solution I had.



>
>
Peter Zijlstra Nov. 29, 2021, 9:08 p.m. UTC | #23
On Mon, Nov 29, 2021 at 09:34:49AM -0800, Peter Oskolkov wrote:
> On Mon, Nov 29, 2021 at 8:41 AM Peter Zijlstra <peterz@infradead.org> wrote:

> > However, do note this whole scheme fundamentally has some of that, the
> > moment the syscall unblocks until sys_exit is 'unmanaged' runtime for
> > all tasks, they can consume however much time the syscall needs there.
> >
> > Also, timeout on sys_umcg_wait() gets you the exact same situation (or
> > worse, multiple running workers).
> 
> It should not. Timed out workers should be added to the runnable list
> and not become running unless a server chooses so. So sys_umcg_wait()
> with a timeout should behave similarly to a normal sleep, in that the
> server is woken upon the worker blocking, and upon the worker wakeup
> the worker is added to the woken workers list and waits for a server
> to run it. The only difference is that in a sleep the worker becomes
> BLOCKED, while in sys_umcg_wait() the worker is RUNNABLE the whole
> time.

OK, that's somewhat subtle and I hadn't gotten that either.

Currently it return -ETIMEDOUT in RUNNING state for both server and
worker callers.

Let me go fix that then.

> > > Another big concern I have is that you removed UMCG_TF_LOCKED. I
> >
> > OOh yes, I forgot to mention that. I couldn't figure out what it was
> > supposed to do.
> >
> > > definitely needed it to guard workers during "sched work" in the
> > > userspace in my approach. I'm not sure if the flag is absolutely
> > > needed with your approach, but most likely it is - the kernel-side
> > > scheduler does lock tasks and runqueues and disables interrupts and
> > > migrations and other things so that the scheduling logic is not
> > > hijacked by concurrent stuff. Why do you assume that the userspace
> > > scheduling code does not need similar protections?
> >
> > I've not yet come across a case where this is needed. Migration for
> > instance is possible when RUNNABLE, simply write ::server_tid before
> > ::state. Userspace just needs to make sure who actually owns the task,
> > but it can do that outside of this state.
> >
> > But like I said; I've not yet done the userspace part (and I lost most
> > of today trying to install a new machine), so perhaps I'll run into it
> > soon enough.
> 
> The most obvious scenario where I needed locking is when worker A
> wants to context switch into worker B, while another worker C wants to
> context switch into worker A, and worker A pagefaults. This involves:
> 
> worker A context: worker A context switches into worker B:
> 
> - worker B::server_tid = worker A::server_tid
> - worker A::server_tid = none
> - worker A::state = runnable
> - worker B::state = running
> - worker A::next_tid = worker B
> - worker A calls sys_umcg_wait()
> 
> worker B context: before the above completes, worker C wants to
> context switch into worker A, with similar steps.
> 
> "interrupt context": in the middle of the mess above, worker A pagefaults
> 
> Too many moving parts. UMCG_TF_LOCKED helped me make this mess
> manageable. Maybe without pagefaults clever ordering of the operations
> listed above could make things work, but pagefaults mess things badly,
> so some kind of "preempt_disable()" for the userspace scheduling code
> was needed, and UMCG_TF_LOCKED was the solution I had.

I'm not sure I'm following. For this to be true A and C must be running
on a different server right?

So we have something like:

	S0 running A			S1 running B

Therefore:

	S0::state == RUNNABLE		S1::state == RUNNABLE
	A::server_tid == S0.tid		B::server_tid == S1.tid
	A::state == RUNNING		B::state == RUNNING

Now, you want A to switch to C, therefore C had better be with S0, eg we
have:

	C::server_tid == S0.tid
	C::state == RUNNABLE

So then A does:

	A::next_tid = C.tid;
	sys_umcg_wait();

Which will:

	pin(A);
	pin(S0);

	cmpxchg(A::state, RUNNING, RUNNABLE);

	next_tid = A::next_tid; // C

	enqueue(S0::runnable, A);

At which point B steals S0's runnable queue, and tries to make A go.

					runnable = xchg(S0::runnable_list_ptr, NULL); // == A
					A::server_tid = S1.tid;
					B::next_tid = A.tid;
					sys_umcg_wait();

	wake(C)
	  cmpxchg(C::state, RUNNABLE, RUNNING); <-- *fault*


Something like that, right?

What currently happens is that S0 goes back to S0 and S1 ends up in A.
That is, if, for any reason we fail to wake next_tid, we'll wake
server_tid.

So then S0 wakes up and gets to re-evaluate life. If it has another
worker it can go run that, otherwise it can try and steal a worker
somewhere or just idle out.

Now arguably, the only reason A->C can fault is because C is garbage, at
which point your program is malformed and it doesn't matter what
happens one way or the other.
Peter Zijlstra Nov. 29, 2021, 9:29 p.m. UTC | #24
On Mon, Nov 29, 2021 at 10:08:41PM +0100, Peter Zijlstra wrote:
> I'm not sure I'm following. For this to be true A and C must be running
> on a different server right?
> 
> So we have something like:
> 
> 	S0 running A			S1 running B
> 
> Therefore:
> 
> 	S0::state == RUNNABLE		S1::state == RUNNABLE
> 	A::server_tid == S0.tid		B::server_tid == S1.tid
> 	A::state == RUNNING		B::state == RUNNING
> 
> Now, you want A to switch to C, therefore C had better be with S0, eg we
> have:
> 
> 	C::server_tid == S0.tid
> 	C::state == RUNNABLE
> 
> So then A does:
> 
> 	A::next_tid = C.tid;
> 	sys_umcg_wait();
> 
> Which will:
> 
> 	pin(A);
> 	pin(S0);
> 
> 	cmpxchg(A::state, RUNNING, RUNNABLE);
> 
> 	next_tid = A::next_tid; // C
> 
> 	enqueue(S0::runnable, A);
> 
> At which point B steals S0's runnable queue, and tries to make A go.
> 
> 					runnable = xchg(S0::runnable_list_ptr, NULL); // == A
> 					A::server_tid = S1.tid;
> 					B::next_tid = A.tid;
> 					sys_umcg_wait();
> 
> 	wake(C)
> 	  cmpxchg(C::state, RUNNABLE, RUNNING); <-- *fault*
> 
> 
> Something like that, right?

And note that there's an XXX in the code about exactly this case; it has
a question whether we want to add pin(next) to umcg_pin_pages().

That would not in fact help here, because sys_umcg_wait() is faultable
and the only reason it'll return -EFAULT is because, as stated below, C
is garbage. But it does make a difference for when we do something like:

	self->next_tid = someone;
	sys_something_we_expect_to_block();
	// handle not blocking

Because in that case userspace must have taken 'someone' from the
runnable queue and made it 'next', but then we'll not wake next but the
server, which then needs to figure out something went sideways.

So I'm tempted to add that optional 3rd pin, simply to reduce the
failure cases.

> What currently happens is that S0 goes back to S0 and S1 ends up in A.
> That is, if, for any reason we fail to wake next_tid, we'll wake
> server_tid.
> 
> So then S0 wakes up and gets to re-evaluate life. If it has another
> worker it can go run that, otherwise it can try and steal a worker
> somewhere or just idle out.
> 
> Now arguably, the only reason A->C can fault is because C is garbage, at
> which point your program is malformed and it doesn't matter what
> happens one way or the other.
Thomas Gleixner Nov. 29, 2021, 10:07 p.m. UTC | #25
On Fri, Nov 26 2021 at 22:52, Peter Zijlstra wrote:
> On Fri, Nov 26, 2021 at 10:11:17PM +0100, Thomas Gleixner wrote:
>> On Wed, Nov 24 2021 at 22:19, Peter Zijlstra wrote:
>> > On Mon, Nov 22, 2021 at 01:13:24PM -0800, Peter Oskolkov wrote:
>> >
>> >> +	 * Timestamp: a 46-bit CLOCK_MONOTONIC timestamp, at 16ns resolution.
>> >
>> >> +static int umcg_update_state(u64 __user *state_ts, u64 *expected, u64 desired,
>> >> +				bool may_fault)
>> >> +{
>> >> +	u64 curr_ts = (*expected) >> (64 - UMCG_STATE_TIMESTAMP_BITS);
>> >> +	u64 next_ts = ktime_get_ns() >> UMCG_STATE_TIMESTAMP_GRANULARITY;
>> >
>> > I'm still very hesitant to use ktime (fear the HPET); but I suppose it
>> > makes sense to use a time base that's accessible to userspace. Was
>> > MONOTONIC_RAW considered?
>> 
>> MONOTONIC_RAW is not really useful as you can't sleep on it and it won't
>> solve the HPET crap either.
>
> But it's ns are of equal size to sched_clock(), if both share TSC IIRC.
> Whereas MONOTONIC, being subject to ntp rate stuff, has differently
> sized ns.

The size is the same, i.e. 1 bit per nanosecond :)

> The only time that's relevant though is when you're going to mix these
> timestamps with CLOCK_THREAD_CPUTIME_ID, which might just be
> interesting.

Uuurg. If you want to go towards CLOCK_THREAD_CPUTIME_ID, that's going
to be really nasty. Actually you can sleep on that clock, but that's a
completely different universe. If anything like that is desired then we
need to rewrite that posix CPU timer muck completely with all the bells
and whistels and race conditions attached to it. *Shudder*

Thanks,

        tglx
Peter Zijlstra Nov. 29, 2021, 10:22 p.m. UTC | #26
On Mon, Nov 29, 2021 at 11:07:07PM +0100, Thomas Gleixner wrote:
> On Fri, Nov 26 2021 at 22:52, Peter Zijlstra wrote:

> The size is the same, i.e. 1 bit per nanosecond :)

:-)

> > The only time that's relevant though is when you're going to mix these
> > timestamps with CLOCK_THREAD_CPUTIME_ID, which might just be
> > interesting.
> 
> Uuurg. If you want to go towards CLOCK_THREAD_CPUTIME_ID, that's going
> to be really nasty. Actually you can sleep on that clock, but that's a
> completely different universe. If anything like that is desired then we
> need to rewrite that posix CPU timer muck completely with all the bells
> and whistels and race conditions attached to it. *Shudder*

Oh, I wasn't thinking anything as terrible as that. Sleeping on that
clock is fundamentally daft since it doesn't run when thats is
sleeping, consider trying to sleep on your own runtime :-)

I was only considering combining THREAD_CPUTIME timestamps with the
UMCG timestamps to compute how much unmanaged time there was, or other
such things.

Anyway, lets forget I bought this up and assume that for practical
purposes all [ns] are of equal length.
Peter Oskolkov Nov. 29, 2021, 11:38 p.m. UTC | #27
On Mon, Nov 29, 2021 at 1:08 PM Peter Zijlstra <peterz@infradead.org> wrote:
[...]
> > > > Another big concern I have is that you removed UMCG_TF_LOCKED. I
> > >
> > > OOh yes, I forgot to mention that. I couldn't figure out what it was
> > > supposed to do.
[...]
>
> So then A does:
>
>         A::next_tid = C.tid;
>         sys_umcg_wait();
>
> Which will:
>
>         pin(A);
>         pin(S0);
>
>         cmpxchg(A::state, RUNNING, RUNNABLE);

Hmm.... That's another difference between your patch and mine: my
approach was "the side that initiates the change updates the state".
So in my code the userspace changes the current task's state RUNNING
=> RUNNABLE and the next task's state, or the server's state, RUNNABLE
=> RUNNING before calling sys_umcg_wait(). The kernel changed worker
states to BLOCKED/RUNNABLE during block/wake detection, and marked
servers RUNNING when waking them during block/wake detection; but all
applicable state changes for sys_umcg_wait() happen in the userspace.

The reasoning behind this approach was:
- do in kernel only that which cannot be done in the userspace, to
make the kernel code smaller/simpler
- similar to how futexes work: futex_wait does not change the futex
value to the desired value, but just checks whether the futex value
matches the desired value
- similar to how futexes work, concurrent state changes can happen in
the userspace without calling into the kernel at all
    for example:
        - (a): worker A goes to sleep into sys_umcg_wait()
        - (b): worker B wants to context switch into worker A "a moment" later
        - due to preemption/interrupts/pagefaults/whatnot, (b) happens
in reality before (a)
    in my patchset, the situation above happily resolves in the
userspace so that worker A keeps running without ever calling
sys_umcg_wait().

Again, I don't think this is deal breaking, and your approach will
work, just a bit less efficiently in some cases :)

I'm still not sure we can live without UMCG_TF_LOCKED. What if worker
A transfers its server to worker B that A intends to context switch
into, and then worker A pagefaults or gets interrupted before calling
sys_umcg_wait()? The server will be woken up and will see that it is
assigned to worker B; now what? If worker A is "locked" before the
whole thing starts, the pagefault/interrupt will not trigger
block/wake detection, worker A will keep RUNNING for all intended
purposes, and eventually will call sys_umcg_wait() as it had
intended...

[...]
Peter Zijlstra Dec. 6, 2021, 11:32 a.m. UTC | #28
Sorry, I haven't been feeling too well and as such procastinated on this
because thinking is required :/ Trying to pick up the bits.

On Mon, Nov 29, 2021 at 03:38:38PM -0800, Peter Oskolkov wrote:
> On Mon, Nov 29, 2021 at 1:08 PM Peter Zijlstra <peterz@infradead.org> wrote:
> [...]
> > > > > Another big concern I have is that you removed UMCG_TF_LOCKED. I
> > > >
> > > > OOh yes, I forgot to mention that. I couldn't figure out what it was
> > > > supposed to do.
> [...]
> >
> > So then A does:
> >
> >         A::next_tid = C.tid;
> >         sys_umcg_wait();
> >
> > Which will:
> >
> >         pin(A);
> >         pin(S0);
> >
> >         cmpxchg(A::state, RUNNING, RUNNABLE);
> 
> Hmm.... That's another difference between your patch and mine: my
> approach was "the side that initiates the change updates the state".
> So in my code the userspace changes the current task's state RUNNING
> => RUNNABLE and the next task's state,

I couldn't make that work for wakeups; when a thread blocks in a
random syscall there is no userspace to wake the next thread. And since
it seems required in this case, it's easier and more consistent to
always do it.

> or the server's state, RUNNABLE
> => RUNNING before calling sys_umcg_wait().

Yes, this is indeed required; I've found the same when trying to build
the userspace server loop. And yes, I'm starting to see where you're
coming from.

> I'm still not sure we can live without UMCG_TF_LOCKED. What if worker
> A transfers its server to worker B that A intends to context switch

	S0 running A

Therefore:

	S0::state == RUNNABLE
	A::server_tid = S0.tid
	A::state == RUNNING

you want A to switch to B, therefore:

	B::state == RUNNABLE

if B is not yet on S0 then:

	B::server_tid = S0.tid;

finally:

0:
	A::next_tid = B.tid;
1:
	A::state = RUNNABLE:
2:
	sys_umcg_wait();
3:

> into, and then worker A pagefaults or gets interrupted before calling
> sys_umcg_wait()?

So the problem is tripping umcg_notify_resume() on the labels 1 and 2,
right? tripping it on 0 and 3 is trivially correct.

If we trip it on 1 and !(A::state & TG_PREEMPT), then nothing, since
::state == RUNNING we'll just continue onwards and all is well. That is,
nothing has happened yet.

However, if we trip it on 2: we're screwed. Because at that point
::state is scribbled.

> The server will be woken up and will see that it is
> assigned to worker B; now what? If worker A is "locked" before the
> whole thing starts, the pagefault/interrupt will not trigger
> block/wake detection, worker A will keep RUNNING for all intended
> purposes, and eventually will call sys_umcg_wait() as it had
> intended...

No, the failure case is different; umcg_notify_resume() will simply
block A until someone sets A::state == RUNNING and kicks it, which will
be no-one.

Now, the above situation is actually simple to fix, but it gets more
interesting when we're using sys_umcg_wait() to build wait primitives.
Because in that case we get stuff like:

	for (;;) {
		self->state = RUNNABLE;
		smp_mb();
		if (cond)
			break;
		sys_umcg_wait();
	}
	self->state = RUNNING;

And we really need to not block and also not do sys_umcg_wait() early.

So yes, I agree that we need a special case here that ensures
umcg_notify_resume() doesn't block. Let me ponder naming and comments.
Either a TF_COND_WAIT or a whole new state. I can't decide yet.

Now, obviously if you do a random syscall anywhere around here, you get
to keep the pieces :-)



I've also added ::next_tid to the whole umcg_pin_pages() thing, and made
it so that ::next_tid gets cleared when it's been used. That way things
like:

	self->next_tid = pick_from_runqueue();
	sys_that_is_expected_to_sleep();
	if (self->next_tid) {
		return_to_runqueue(self->next_tid);
		self->next_tid = 0;
	}

Are much simpler to manage. Either it did sleep and ::next_tid is
consumed, or it didn't sleep and it needs to be returned to the
runqueue.
Peter Zijlstra Dec. 6, 2021, 11:47 a.m. UTC | #29
On Mon, Nov 29, 2021 at 09:34:49AM -0800, Peter Oskolkov wrote:
> On Mon, Nov 29, 2021 at 8:41 AM Peter Zijlstra <peterz@infradead.org> wrote:

> > Also, timeout on sys_umcg_wait() gets you the exact same situation (or
> > worse, multiple running workers).
> 
> It should not. Timed out workers should be added to the runnable list
> and not become running unless a server chooses so. So sys_umcg_wait()
> with a timeout should behave similarly to a normal sleep, in that the
> server is woken upon the worker blocking, and upon the worker wakeup
> the worker is added to the woken workers list and waits for a server
> to run it. The only difference is that in a sleep the worker becomes
> BLOCKED, while in sys_umcg_wait() the worker is RUNNABLE the whole
> time.
> 
> Why then have sys_umcg_wait() with a timeout at all, instead of
> calling nanosleep()? Because the worker in sys_umcg_wait() can be
> context-switched into by another worker, or made running by a server;
> if the worker is in nanosleep(), it just sleeps.

I've been trying to figure out the semantics of that timeout thing, and
I can't seem to make sense of it.

Consider two workers:

	S0 running A				S1 running B

therefore:

	S0::state == RUNNABLE			S1::state == RUNNABLE
	A::server_tid == S0.tid			B::server_tid = S1.tid
	A::state == RUNNING			B::state == RUNNING

Doing:

	self->state = RUNNABLE;			self->state = RUNNABLE;
	sys_umcg_wait(0);			sys_umcg_wait(10);
	  umcg_enqueue_runnable()		  umcg_enqueue_runnable()
	  umcg_wake()				  umcg_wake()
	  umcg_wait()				  umcg_wait()
						    hrtimer_start()

In both cases we get the exact same outcome:

	A::state == RUNNABLE			B::state == RUNNABLE
	S0::state == RUNNING			S1::state == RUNNING
	S0::runnable_ptr == &A			S1::runnable_ptr = &B


Which is, AFAICT, the exact state you wanted to achieve, except B now
has an active timer, but what do you want it to do when that goes?

I'm tempted to say workers cannot have timeout, and servers can use it
to wake themselves.
Peter Zijlstra Dec. 6, 2021, 12:04 p.m. UTC | #30
On Mon, Dec 06, 2021 at 12:32:22PM +0100, Peter Zijlstra wrote:
> On Mon, Nov 29, 2021 at 03:38:38PM -0800, Peter Oskolkov wrote:
> > On Mon, Nov 29, 2021 at 1:08 PM Peter Zijlstra <peterz@infradead.org> wrote:

> Now, the above situation is actually simple to fix, but it gets more
> interesting when we're using sys_umcg_wait() to build wait primitives.
> Because in that case we get stuff like:
> 
> 	for (;;) {
> 		self->state = RUNNABLE;
> 		smp_mb();
> 		if (cond)
> 			break;
> 		sys_umcg_wait();
> 	}
> 	self->state = RUNNING;
> 
> And we really need to not block and also not do sys_umcg_wait() early.
> 
> So yes, I agree that we need a special case here that ensures
> umcg_notify_resume() doesn't block. Let me ponder naming and comments.
> Either a TF_COND_WAIT or a whole new state. I can't decide yet.

Hurmph... OTOH since self above hasn't actually done anything yet, it
isn't reported as runnable yet, and so for all intents and purposes the
userspace state thinks it's running (which is true) and nobody should be
trying a concurrent wakeup and there anre't any races.

Bah, now I'm confused again :-) Let me go think more.
Peter Zijlstra Dec. 13, 2021, 1:55 p.m. UTC | #31
On Mon, Dec 06, 2021 at 12:32:22PM +0100, Peter Zijlstra wrote:
> 
> Sorry, I haven't been feeling too well and as such procastinated on this
> because thinking is required :/ Trying to pick up the bits.

*sigh* and yet another week gone... someone was unhappy about refcount_t.


> No, the failure case is different; umcg_notify_resume() will simply
> block A until someone sets A::state == RUNNING and kicks it, which will
> be no-one.
> 
> Now, the above situation is actually simple to fix, but it gets more
> interesting when we're using sys_umcg_wait() to build wait primitives.
> Because in that case we get stuff like:
> 
> 	for (;;) {
> 		self->state = RUNNABLE;
> 		smp_mb();
> 		if (cond)
> 			break;
> 		sys_umcg_wait();
> 	}
> 	self->state = RUNNING;
> 
> And we really need to not block and also not do sys_umcg_wait() early.
> 
> So yes, I agree that we need a special case here that ensures
> umcg_notify_resume() doesn't block. Let me ponder naming and comments.
> Either a TF_COND_WAIT or a whole new state. I can't decide yet.
> 
> Now, obviously if you do a random syscall anywhere around here, you get
> to keep the pieces :-)

Something like so I suppose..

--- a/include/uapi/linux/umcg.h
+++ b/include/uapi/linux/umcg.h
@@ -42,6 +42,32 @@
  *
  */
 #define UMCG_TF_PREEMPT			0x0100U
+/*
+ * UMCG_TF_COND_WAIT: indicate the task *will* call sys_umcg_wait()
+ *
+ * Enables server loops like (vs umcg_sys_exit()):
+ *
+ *   for(;;) {
+ *	self->status = UMCG_TASK_RUNNABLE | UMCG_TF_COND_WAIT;
+ *	// smp_mb() implied by xchg()
+ *
+ *	runnable_ptr = xchg(self->runnable_workers_ptr, NULL);
+ *	while (runnable_ptr) {
+ *		next = runnable_ptr->runnable_workers_ptr;
+ *
+ *		umcg_server_add_runnable(self, runnable_ptr);
+ *
+ *		runnable_ptr = next;
+ *	}
+ *
+ *	self->next = umcg_server_pick_next(self);
+ *	sys_umcg_wait(0, 0);
+ *   }
+ *
+ * without a signal or interrupt in between setting umcg_task::state and
+ * sys_umcg_wait() resulting in an infinite wait in umcg_notify_resume().
+ */
+#define UMCG_TF_COND_WAIT		0x0200U
 
 #define UMCG_TF_MASK			0xff00U
 
--- a/kernel/sched/umcg.c
+++ b/kernel/sched/umcg.c
@@ -180,7 +180,7 @@ void umcg_worker_exit(void)
 /*
  * Do a state transition, @from -> @to, and possible read @next after that.
  *
- * Will clear UMCG_TF_PREEMPT.
+ * Will clear UMCG_TF_PREEMPT, UMCG_TF_COND_WAIT.
  *
  * When @to == {BLOCKED,RUNNABLE}, update timestamps.
  *
@@ -216,7 +216,8 @@ static int umcg_update_state(struct task
 		if ((old & UMCG_TASK_MASK) != from)
 			goto fail;
 
-		new = old & ~(UMCG_TASK_MASK | UMCG_TF_PREEMPT);
+		new = old & ~(UMCG_TASK_MASK |
+			      UMCG_TF_PREEMPT | UMCG_TF_COND_WAIT);
 		new |= to & UMCG_TASK_MASK;
 
 	} while (!unsafe_try_cmpxchg_user(&self->state, &old, new, Efault));
@@ -567,11 +568,13 @@ void umcg_notify_resume(struct pt_regs *
 	if (state == UMCG_TASK_RUNNING)
 		goto done;
 
-	// XXX can get here when:
-	//
-	// self->state = RUNNABLE
-	// <signal>
-	// sys_umcg_wait();
+	/*
+	 * See comment at UMCG_TF_COND_WAIT; TL;DR: user *will* call
+	 * sys_umcg_wait() and signals/interrupts shouldn't block
+	 * return-to-user.
+	 */
+	if (state == UMCG_TASK_RUNNABLE | UMCG_TF_COND_WAIT)
+		goto done;
 
 	if (state & UMCG_TF_PREEMPT) {
 		if (umcg_pin_pages())
@@ -658,6 +661,13 @@ SYSCALL_DEFINE2(umcg_wait, u32, flags, u
 	if (ret)
 		goto unblock;
 
+	/*
+	 * Clear UMCG_TF_COND_WAIT *and* check state == RUNNABLE.
+	 */
+	ret = umcg_update_state(self, tsk, UMCG_TASK_RUNNABLE, UMCG_TASK_RUNNABLE);
+	if (ret)
+		goto unpin;
+
 	if (worker) {
 		ret = umcg_enqueue_runnable(tsk);
 		if (ret)
Peter Oskolkov Jan. 19, 2022, 5:26 p.m. UTC | #32
On Mon, Dec 6, 2021 at 3:47 AM Peter Zijlstra <peterz@infradead.org> wrote:
>
> On Mon, Nov 29, 2021 at 09:34:49AM -0800, Peter Oskolkov wrote:
> > On Mon, Nov 29, 2021 at 8:41 AM Peter Zijlstra <peterz@infradead.org> wrote:
>
> > > Also, timeout on sys_umcg_wait() gets you the exact same situation (or
> > > worse, multiple running workers).
> >
> > It should not. Timed out workers should be added to the runnable list
> > and not become running unless a server chooses so. So sys_umcg_wait()
> > with a timeout should behave similarly to a normal sleep, in that the
> > server is woken upon the worker blocking, and upon the worker wakeup
> > the worker is added to the woken workers list and waits for a server
> > to run it. The only difference is that in a sleep the worker becomes
> > BLOCKED, while in sys_umcg_wait() the worker is RUNNABLE the whole
> > time.
> >
> > Why then have sys_umcg_wait() with a timeout at all, instead of
> > calling nanosleep()? Because the worker in sys_umcg_wait() can be
> > context-switched into by another worker, or made running by a server;
> > if the worker is in nanosleep(), it just sleeps.
>
> I've been trying to figure out the semantics of that timeout thing, and
> I can't seem to make sense of it.
>
> Consider two workers:
>
>         S0 running A                            S1 running B
>
> therefore:
>
>         S0::state == RUNNABLE                   S1::state == RUNNABLE
>         A::server_tid == S0.tid                 B::server_tid = S1.tid
>         A::state == RUNNING                     B::state == RUNNING
>
> Doing:
>
>         self->state = RUNNABLE;                 self->state = RUNNABLE;
>         sys_umcg_wait(0);                       sys_umcg_wait(10);
>           umcg_enqueue_runnable()                 umcg_enqueue_runnable()

sys_umcg_wait() should not enqueue the worker as runnable; workers are
enqueued to indicate wakeup events.

>           umcg_wake()                             umcg_wake()
>           umcg_wait()                             umcg_wait()
>                                                     hrtimer_start()
>
> In both cases we get the exact same outcome:
>
>         A::state == RUNNABLE                    B::state == RUNNABLE
>         S0::state == RUNNING                    S1::state == RUNNING
>         S0::runnable_ptr == &A                  S1::runnable_ptr = &B

So without sys_umcg_wait enqueueing into the queue, the state now is

         A::state == RUNNABLE                    B::state == RUNNABLE
         S0::state == RUNNING                    S1::state == RUNNING
         S0::runnable_ptr == NULL                  S1::runnable_ptr = NULL

>
>
> Which is, AFAICT, the exact state you wanted to achieve, except B now
> has an active timer, but what do you want it to do when that goes?

When the timer goes off, _then_ B is enqueued into the queue, so the
state becomes

         A::state == RUNNABLE                    B::state == RUNNABLE
         S0::state == RUNNING                    S1::state == RUNNING
         S0::runnable_ptr == NULL                  S1::runnable_ptr = &B

So worker timeouts in sys_umcg_wait are treated as wakeup events, with
the difference that when the worker is eventually scheduled by a
server, sys_umcg_wait returns with ETIMEDOUT.

>
> I'm tempted to say workers cannot have timeout, and servers can use it
> to wake themselves.
Peter Zijlstra Jan. 20, 2022, 11:07 a.m. UTC | #33
On Wed, Jan 19, 2022 at 09:26:41AM -0800, Peter Oskolkov wrote:
> On Mon, Dec 6, 2021 at 3:47 AM Peter Zijlstra <peterz@infradead.org> wrote:
> >
> > On Mon, Nov 29, 2021 at 09:34:49AM -0800, Peter Oskolkov wrote:
> > > On Mon, Nov 29, 2021 at 8:41 AM Peter Zijlstra <peterz@infradead.org> wrote:
> >
> > > > Also, timeout on sys_umcg_wait() gets you the exact same situation (or
> > > > worse, multiple running workers).
> > >
> > > It should not. Timed out workers should be added to the runnable list
> > > and not become running unless a server chooses so. So sys_umcg_wait()
> > > with a timeout should behave similarly to a normal sleep, in that the
> > > server is woken upon the worker blocking, and upon the worker wakeup
> > > the worker is added to the woken workers list and waits for a server
> > > to run it. The only difference is that in a sleep the worker becomes
> > > BLOCKED, while in sys_umcg_wait() the worker is RUNNABLE the whole
> > > time.
> > >
> > > Why then have sys_umcg_wait() with a timeout at all, instead of
> > > calling nanosleep()? Because the worker in sys_umcg_wait() can be
> > > context-switched into by another worker, or made running by a server;
> > > if the worker is in nanosleep(), it just sleeps.
> >
> > I've been trying to figure out the semantics of that timeout thing, and
> > I can't seem to make sense of it.
> >
> > Consider two workers:
> >
> >         S0 running A                            S1 running B
> >
> > therefore:
> >
> >         S0::state == RUNNABLE                   S1::state == RUNNABLE
> >         A::server_tid == S0.tid                 B::server_tid = S1.tid
> >         A::state == RUNNING                     B::state == RUNNING
> >
> > Doing:
> >
> >         self->state = RUNNABLE;                 self->state = RUNNABLE;
> >         sys_umcg_wait(0);                       sys_umcg_wait(10);
> >           umcg_enqueue_runnable()                 umcg_enqueue_runnable()
> 
> sys_umcg_wait() should not enqueue the worker as runnable; workers are
> enqueued to indicate wakeup events.

Oooh... I see.

> So worker timeouts in sys_umcg_wait are treated as wakeup events, with
> the difference that when the worker is eventually scheduled by a
> server, sys_umcg_wait returns with ETIMEDOUT.

Right.. OK, let me go fold and polish what I have now before I go change
things again though.
diff mbox series

Patch

diff --git a/arch/x86/entry/syscalls/syscall_64.tbl b/arch/x86/entry/syscalls/syscall_64.tbl
index fe8f8dd157b4..f09f96bb7f35 100644
--- a/arch/x86/entry/syscalls/syscall_64.tbl
+++ b/arch/x86/entry/syscalls/syscall_64.tbl
@@ -371,6 +371,8 @@ 
 447	common	memfd_secret		sys_memfd_secret
 448	common	process_mrelease	sys_process_mrelease
 449	common	futex_waitv		sys_futex_waitv
+450	common	umcg_ctl		sys_umcg_ctl
+451	common	umcg_wait		sys_umcg_wait

 #
 # Due to a historical design error, certain syscalls are numbered differently
diff --git a/fs/exec.c b/fs/exec.c
index 537d92c41105..1749f0f74fed 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -1838,6 +1838,7 @@  static int bprm_execve(struct linux_binprm *bprm,
 	current->fs->in_exec = 0;
 	current->in_execve = 0;
 	rseq_execve(current);
+	umcg_execve(current);
 	acct_update_integrals(current);
 	task_numa_free(current, false);
 	return retval;
diff --git a/include/linux/sched.h b/include/linux/sched.h
index d2e261adb8ea..dc9a8b8c5761 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -67,6 +67,7 @@  struct sighand_struct;
 struct signal_struct;
 struct task_delay_info;
 struct task_group;
+struct umcg_task;

 /*
  * Task state bitmask. NOTE! These bits are also
@@ -1294,6 +1295,12 @@  struct task_struct {
 	unsigned long rseq_event_mask;
 #endif

+#ifdef CONFIG_UMCG
+	struct umcg_task __user	*umcg_task;
+	struct page		*pinned_umcg_worker_page;  /* self */
+	struct page		*pinned_umcg_server_page;
+#endif
+
 	struct tlbflush_unmap_batch	tlb_ubc;

 	union {
@@ -1687,6 +1694,13 @@  extern struct pid *cad_pid;
 #define PF_KTHREAD		0x00200000	/* I am a kernel thread */
 #define PF_RANDOMIZE		0x00400000	/* Randomize virtual address space */
 #define PF_SWAPWRITE		0x00800000	/* Allowed to write to swap */
+
+#ifdef CONFIG_UMCG
+#define PF_UMCG_WORKER		0x01000000	/* UMCG worker */
+#else
+#define PF_UMCG_WORKER		0x00000000
+#endif
+
 #define PF_NO_SETAFFINITY	0x04000000	/* Userland is not allowed to meddle with cpus_mask */
 #define PF_MCE_EARLY		0x08000000      /* Early kill for mce process policy */
 #define PF_MEMALLOC_PIN		0x10000000	/* Allocation context constrained to zones which allow long term pinning. */
@@ -2287,6 +2301,63 @@  static inline void rseq_execve(struct task_struct *t)

 #endif

+#ifdef CONFIG_UMCG
+
+void umcg_handle_resuming_worker(void);
+void umcg_handle_exiting_worker(void);
+void umcg_clear_child(struct task_struct *tsk);
+
+/* Called by bprm_execve() in fs/exec.c. */
+static inline void umcg_execve(struct task_struct *tsk)
+{
+	if (tsk->umcg_task)
+		umcg_clear_child(tsk);
+}
+
+/* Called by exit_to_user_mode_loop() in kernel/entry/common.c.*/
+static inline void umcg_handle_notify_resume(void)
+{
+	if (current->flags & PF_UMCG_WORKER)
+		umcg_handle_resuming_worker();
+}
+
+/* Called by do_exit() in kernel/exit.c. */
+static inline void umcg_handle_exit(void)
+{
+	if (current->flags & PF_UMCG_WORKER)
+		umcg_handle_exiting_worker();
+}
+
+/*
+ * umcg_wq_worker_[sleeping|running] are called in core.c by
+ * sched_submit_work() and sched_update_worker().
+ */
+void umcg_wq_worker_sleeping(struct task_struct *tsk);
+void umcg_wq_worker_running(struct task_struct *tsk);
+
+#else  /* CONFIG_UMCG */
+
+static inline void umcg_clear_child(struct task_struct *tsk)
+{
+}
+static inline void umcg_execve(struct task_struct *tsk)
+{
+}
+static inline void umcg_handle_notify_resume(void)
+{
+}
+static inline void umcg_handle_exit(void)
+{
+}
+static inline void umcg_wq_worker_sleeping(struct task_struct *tsk)
+{
+}
+static inline void umcg_wq_worker_running(struct task_struct *tsk)
+{
+}
+
+#endif
+
 #ifdef CONFIG_DEBUG_RSEQ

 void rseq_syscall(struct pt_regs *regs);
diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
index 528a478dbda8..424a4686be74 100644
--- a/include/linux/syscalls.h
+++ b/include/linux/syscalls.h
@@ -72,6 +72,7 @@  struct open_how;
 struct mount_attr;
 struct landlock_ruleset_attr;
 enum landlock_rule_type;
+struct umcg_task;

 #include <linux/types.h>
 #include <linux/aio_abi.h>
@@ -1057,6 +1058,8 @@  asmlinkage long sys_landlock_add_rule(int ruleset_fd, enum landlock_rule_type ru
 		const void __user *rule_attr, __u32 flags);
 asmlinkage long sys_landlock_restrict_self(int ruleset_fd, __u32 flags);
 asmlinkage long sys_memfd_secret(unsigned int flags);
+asmlinkage long sys_umcg_ctl(u32 flags, struct umcg_task __user *self);
+asmlinkage long sys_umcg_wait(u32 flags, u64 abs_timeout);

 /*
  * Architecture-specific system calls
diff --git a/include/uapi/asm-generic/unistd.h b/include/uapi/asm-generic/unistd.h
index 4557a8b6086f..6d29b3896d4c 100644
--- a/include/uapi/asm-generic/unistd.h
+++ b/include/uapi/asm-generic/unistd.h
@@ -883,8 +883,13 @@  __SYSCALL(__NR_process_mrelease, sys_process_mrelease)
 #define __NR_futex_waitv 449
 __SYSCALL(__NR_futex_waitv, sys_futex_waitv)

+#define __NR_umcg_ctl 450
+__SYSCALL(__NR_umcg_ctl, sys_umcg_ctl)
+#define __NR_umcg_wait 451
+__SYSCALL(__NR_umcg_wait, sys_umcg_wait)
 #undef __NR_syscalls
-#define __NR_syscalls 450
+
+#define __NR_syscalls 452

 /*
  * 32 bit systems traditionally used different
diff --git a/include/uapi/linux/umcg.h b/include/uapi/linux/umcg.h
new file mode 100644
index 000000000000..cd9f60002821
--- /dev/null
+++ b/include/uapi/linux/umcg.h
@@ -0,0 +1,137 @@ 
+/* SPDX-License-Identifier: GPL-2.0+ WITH Linux-syscall-note */
+#ifndef _UAPI_LINUX_UMCG_H
+#define _UAPI_LINUX_UMCG_H
+
+#include <linux/limits.h>
+#include <linux/types.h>
+
+/*
+ * UMCG: User Managed Concurrency Groups.
+ *
+ * Syscalls (see kernel/sched/umcg.c):
+ *      sys_umcg_ctl()  - register/unregister UMCG tasks;
+ *      sys_umcg_wait() - wait/wake/context-switch.
+ *
+ * struct umcg_task (below): controls the state of UMCG tasks.
+ *
+ * See Documentation/userspace-api/umcg.txt for detals.
+ */
+
+/*
+ * UMCG task states, the first 6 bits of struct umcg_task.state_ts.
+ * The states represent the user space point of view.
+ */
+#define UMCG_TASK_NONE			0ULL
+#define UMCG_TASK_RUNNING		1ULL
+#define UMCG_TASK_IDLE			2ULL
+#define UMCG_TASK_BLOCKED		3ULL
+
+/* UMCG task state flags, bits 7-8 */
+
+/*
+ * UMCG_TF_LOCKED: locked by the userspace in preparation to calling umcg_wait.
+ */
+#define UMCG_TF_LOCKED			(1ULL << 6)
+
+/*
+ * UMCG_TF_PREEMPTED: the userspace indicates the worker should be preempted.
+ */
+#define UMCG_TF_PREEMPTED		(1ULL << 7)
+
+/* The first six bits: RUNNING, IDLE, or BLOCKED. */
+#define UMCG_TASK_STATE_MASK		0x3fULL
+
+/* The full state mask: the first 18 bits. */
+#define UMCG_TASK_STATE_MASK_FULL	0x3ffffULL
+
+/*
+ * The number of bits reserved for UMCG state timestamp in
+ * struct umcg_task.state_ts.
+ */
+#define UMCG_STATE_TIMESTAMP_BITS	46
+
+/* The number of bits truncated from UMCG state timestamp. */
+#define UMCG_STATE_TIMESTAMP_GRANULARITY	4
+
+/**
+ * struct umcg_task - controls the state of UMCG tasks.
+ *
+ * The struct is aligned at 64 bytes to ensure that it fits into
+ * a single cache line.
+ */
+struct umcg_task {
+	/**
+	 * @state_ts: the current state of the UMCG task described by
+	 *            this struct, with a unique timestamp indicating
+	 *            when the last state change happened.
+	 *
+	 * Readable/writable by both the kernel and the userspace.
+	 *
+	 * UMCG task state:
+	 *   bits  0 -  5: task state;
+	 *   bits  6 -  7: state flags;
+	 *   bits  8 - 12: reserved; must be zeroes;
+	 *   bits 13 - 17: for userspace use;
+	 *   bits 18 - 63: timestamp (see below).
+	 *
+	 * Timestamp: a 46-bit CLOCK_MONOTONIC timestamp, at 16ns resolution.
+	 * See Documentation/userspace-api/umcg.txt for detals.
+	 */
+	__u64	state_ts;		/* r/w */
+
+	/**
+	 * @next_tid: the TID of the UMCG task that should be context-switched
+	 *            into in sys_umcg_wait(). Can be zero.
+	 *
+	 * Running UMCG workers must have next_tid set to point to IDLE
+	 * UMCG servers.
+	 *
+	 * Read-only for the kernel, read/write for the userspace.
+	 */
+	__u32	next_tid;		/* r   */
+
+	__u32	flags;			/* Reserved; must be zero. */
+
+	/**
+	 * @idle_workers_ptr: a single-linked list of idle workers. Can be NULL.
+	 *
+	 * Readable/writable by both the kernel and the userspace: the
+	 * kernel adds items to the list, the userspace removes them.
+	 */
+	__u64	idle_workers_ptr;	/* r/w */
+
+	/**
+	 * @idle_server_tid_ptr: a pointer pointing to a single idle server.
+	 *                       Readonly.
+	 */
+	__u64	idle_server_tid_ptr;	/* r   */
+} __attribute__((packed, aligned(8 * sizeof(__u64))));
+
+/**
+ * enum umcg_ctl_flag - flags to pass to sys_umcg_ctl
+ * @UMCG_CTL_REGISTER:   register the current task as a UMCG task
+ * @UMCG_CTL_UNREGISTER: unregister the current task as a UMCG task
+ * @UMCG_CTL_WORKER:     register the current task as a UMCG worker
+ */
+enum umcg_ctl_flag {
+	UMCG_CTL_REGISTER	= 0x00001,
+	UMCG_CTL_UNREGISTER	= 0x00002,
+	UMCG_CTL_WORKER		= 0x10000,
+};
+
+/**
+ * enum umcg_wait_flag - flags to pass to sys_umcg_wait
+ * @UMCG_WAIT_WAKE_ONLY:      wake @self->next_tid, don't put @self to sleep;
+ * @UMCG_WAIT_WF_CURRENT_CPU: wake @self->next_tid on the current CPU
+ *                            (use WF_CURRENT_CPU); @UMCG_WAIT_WAKE_ONLY
+ *                            must be set.
+ */
+enum umcg_wait_flag {
+	UMCG_WAIT_WAKE_ONLY			= 1,
+	UMCG_WAIT_WF_CURRENT_CPU		= 2,
+};
+
+/* See Documentation/userspace-api/umcg.txt.*/
+#define UMCG_IDLE_NODE_PENDING (1ULL)
+
+#endif /* _UAPI_LINUX_UMCG_H */
diff --git a/init/Kconfig b/init/Kconfig
index 036b750e8d8a..365802b25100 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -1693,6 +1693,16 @@  config MEMBARRIER

 	  If unsure, say Y.

+config UMCG
+	bool "Enable User Managed Concurrency Groups API"
+	depends on X86_64
+	default n
+	help
+	  Enable User Managed Concurrency Groups API, which form the basis
+	  for an in-process M:N userspace scheduling framework.
+	  At the moment this is an experimental/RFC feature that is not
+	  guaranteed to be backward-compatible.
+
 config KALLSYMS
 	bool "Load all symbols for debugging/ksymoops" if EXPERT
 	default y
diff --git a/kernel/entry/common.c b/kernel/entry/common.c
index d5a61d565ad5..62453772a0c7 100644
--- a/kernel/entry/common.c
+++ b/kernel/entry/common.c
@@ -171,8 +171,10 @@  static unsigned long exit_to_user_mode_loop(struct pt_regs *regs,
 		if (ti_work & (_TIF_SIGPENDING | _TIF_NOTIFY_SIGNAL))
 			handle_signal_work(regs, ti_work);

-		if (ti_work & _TIF_NOTIFY_RESUME)
+		if (ti_work & _TIF_NOTIFY_RESUME) {
+			umcg_handle_notify_resume();
 			tracehook_notify_resume(regs);
+		}

 		/* Architecture specific TIF work */
 		arch_exit_to_user_mode_work(regs, ti_work);
diff --git a/kernel/exit.c b/kernel/exit.c
index f702a6a63686..4bdd51c75aee 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -749,6 +749,10 @@  void __noreturn do_exit(long code)
 	if (unlikely(!tsk->pid))
 		panic("Attempted to kill the idle task!");

+	/* Turn off UMCG sched hooks. */
+	if (unlikely(tsk->flags & PF_UMCG_WORKER))
+		tsk->flags &= ~PF_UMCG_WORKER;
+
 	/*
 	 * If do_exit is called because this processes oopsed, it's possible
 	 * that get_fs() was left as KERNEL_DS, so reset it to USER_DS before
@@ -786,6 +790,7 @@  void __noreturn do_exit(long code)

 	io_uring_files_cancel();
 	exit_signals(tsk);  /* sets PF_EXITING */
+	umcg_handle_exit();

 	/* sync mm's RSS info before statistics gathering */
 	if (tsk->mm)
diff --git a/kernel/sched/Makefile b/kernel/sched/Makefile
index c7421f2d05e1..c03eea9bc738 100644
--- a/kernel/sched/Makefile
+++ b/kernel/sched/Makefile
@@ -41,3 +41,4 @@  obj-$(CONFIG_MEMBARRIER) += membarrier.o
 obj-$(CONFIG_CPU_ISOLATION) += isolation.o
 obj-$(CONFIG_PSI) += psi.o
 obj-$(CONFIG_SCHED_CORE) += core_sched.o
+obj-$(CONFIG_UMCG) += umcg.o
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 5344aa0afe5a..26362cfcee84 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -4269,6 +4269,7 @@  static void __sched_fork(unsigned long clone_flags, struct task_struct *p)
 	p->wake_entry.u_flags = CSD_TYPE_TTWU;
 	p->migration_pending = NULL;
 #endif
+	umcg_clear_child(p);
 }

 DEFINE_STATIC_KEY_FALSE(sched_numa_balancing);
@@ -6327,9 +6328,11 @@  static inline void sched_submit_work(struct task_struct *tsk)
 	 * If a worker goes to sleep, notify and ask workqueue whether it
 	 * wants to wake up a task to maintain concurrency.
 	 */
-	if (task_flags & (PF_WQ_WORKER | PF_IO_WORKER)) {
+	if (task_flags & (PF_WQ_WORKER | PF_IO_WORKER | PF_UMCG_WORKER)) {
 		if (task_flags & PF_WQ_WORKER)
 			wq_worker_sleeping(tsk);
+		else if (task_flags & PF_UMCG_WORKER)
+			umcg_wq_worker_sleeping(tsk);
 		else
 			io_wq_worker_sleeping(tsk);
 	}
@@ -6347,9 +6350,11 @@  static inline void sched_submit_work(struct task_struct *tsk)

 static void sched_update_worker(struct task_struct *tsk)
 {
-	if (tsk->flags & (PF_WQ_WORKER | PF_IO_WORKER)) {
+	if (tsk->flags & (PF_WQ_WORKER | PF_IO_WORKER | PF_UMCG_WORKER)) {
 		if (tsk->flags & PF_WQ_WORKER)
 			wq_worker_running(tsk);
+		else if (tsk->flags & PF_UMCG_WORKER)
+			umcg_wq_worker_running(tsk);
 		else
 			io_wq_worker_running(tsk);
 	}
diff --git a/kernel/sched/umcg.c b/kernel/sched/umcg.c
new file mode 100644
index 000000000000..8f43a9f786c1
--- /dev/null
+++ b/kernel/sched/umcg.c
@@ -0,0 +1,949 @@ 
+// SPDX-License-Identifier: GPL-2.0-only
+
+/*
+ * User Managed Concurrency Groups (UMCG).
+ *
+ * See Documentation/userspace-api/umcg.txt for detals.
+ */
+
+#include <linux/syscalls.h>
+#include <linux/types.h>
+#include <linux/uaccess.h>
+#include <linux/umcg.h>
+
+#include "sched.h"
+
+/**
+ * get_user_nofault - get user value without sleeping.
+ *
+ * get_user() might sleep and therefore cannot be used in preempt-disabled
+ * regions.
+ */
+#define get_user_nofault(out, uaddr)			\
+({							\
+	int ret = -EFAULT;				\
+							\
+	if (access_ok((uaddr), sizeof(*(uaddr)))) {	\
+		pagefault_disable();			\
+							\
+		if (!__get_user((out), (uaddr)))	\
+			ret = 0;			\
+							\
+		pagefault_enable();			\
+	}						\
+	ret;						\
+})
+
+/**
+ * umcg_pin_pages: pin pages containing struct umcg_task of this worker
+ *                 and its server.
+ *
+ * The pages are pinned when the worker exits to the userspace and unpinned
+ * when the worker is in sched_submit_work(), i.e. when the worker is
+ * about to be removed from its runqueue. Thus at most NR_CPUS UMCG pages
+ * are pinned at any one time across the whole system.
+ *
+ * The pinning is needed so that going-to-sleep workers can access
+ * their and their servers' userspace umcg_task structs without page faults,
+ * as the code path can be executed in the context of a pagefault, with
+ * mm lock held.
+ */
+static int umcg_pin_pages(u32 server_tid)
+{
+	struct umcg_task __user *worker_ut = current->umcg_task;
+	struct umcg_task __user *server_ut = NULL;
+	struct task_struct *tsk;
+
+	rcu_read_lock();
+	tsk = find_task_by_vpid(server_tid);
+	/* Server/worker interaction is allowed only within the same mm. */
+	if (tsk && current->mm == tsk->mm)
+		server_ut = READ_ONCE(tsk->umcg_task);
+	rcu_read_unlock();
+
+	if (!server_ut)
+		return -EINVAL;
+
+	tsk = current;
+
+	/* worker_ut is stable, don't need to repin */
+	if (!tsk->pinned_umcg_worker_page)
+		if (1 != pin_user_pages_fast((unsigned long)worker_ut, 1, 0,
+					&tsk->pinned_umcg_worker_page))
+			return -EFAULT;
+
+	/* server_ut may change, need to repin */
+	if (tsk->pinned_umcg_server_page) {
+		unpin_user_page(tsk->pinned_umcg_server_page);
+		tsk->pinned_umcg_server_page = NULL;
+	}
+
+	if (1 != pin_user_pages_fast((unsigned long)server_ut, 1, 0,
+				&tsk->pinned_umcg_server_page))
+		return -EFAULT;
+
+	return 0;
+}
+
+static void umcg_unpin_pages(void)
+{
+	struct task_struct *tsk = current;
+
+	if (tsk->pinned_umcg_worker_page)
+		unpin_user_page(tsk->pinned_umcg_worker_page);
+	if (tsk->pinned_umcg_server_page)
+		unpin_user_page(tsk->pinned_umcg_server_page);
+
+	tsk->pinned_umcg_worker_page = NULL;
+	tsk->pinned_umcg_server_page = NULL;
+}
+
+static void umcg_clear_task(struct task_struct *tsk)
+{
+	/*
+	 * This is either called for the current task, or for a newly forked
+	 * task that is not yet running, so we don't need strict atomicity
+	 * below.
+	 */
+	if (tsk->umcg_task) {
+		WRITE_ONCE(tsk->umcg_task, NULL);
+
+		/* These can be simple writes - see the commment above. */
+		tsk->pinned_umcg_worker_page = NULL;
+		tsk->pinned_umcg_server_page = NULL;
+		tsk->flags &= ~PF_UMCG_WORKER;
+	}
+}
+
+/* Called for a forked or execve-ed child. */
+void umcg_clear_child(struct task_struct *tsk)
+{
+	umcg_clear_task(tsk);
+}
+
+/* Called both by normally (unregister) and abnormally exiting workers. */
+void umcg_handle_exiting_worker(void)
+{
+	umcg_unpin_pages();
+	umcg_clear_task(current);
+}
+
+/**
+ * umcg_update_state: atomically update umcg_task.state_ts, set new timestamp.
+ * @state_ts   - points to the state_ts member of struct umcg_task to update;
+ * @expected   - the expected value of state_ts, including the timestamp;
+ * @desired    - the desired value of state_ts, state part only;
+ * @may_fault  - whether to use normal or _nofault cmpxchg.
+ *
+ * The function is basically cmpxchg(state_ts, expected, desired), with extra
+ * code to set the timestamp in @desired.
+ */
+static int umcg_update_state(u64 __user *state_ts, u64 *expected, u64 desired,
+				bool may_fault)
+{
+	u64 curr_ts = (*expected) >> (64 - UMCG_STATE_TIMESTAMP_BITS);
+	u64 next_ts = ktime_get_ns() >> UMCG_STATE_TIMESTAMP_GRANULARITY;
+
+	/* Cut higher order bits. */
+	next_ts &= (1ULL << UMCG_STATE_TIMESTAMP_BITS) - 1;
+
+	if (next_ts == curr_ts)
+		++next_ts;
+
+	/* Remove an old timestamp, if any. */
+	desired &= UMCG_TASK_STATE_MASK_FULL;
+
+	/* Set the new timestamp. */
+	desired |= (next_ts << (64 - UMCG_STATE_TIMESTAMP_BITS));
+
+	if (may_fault)
+		return cmpxchg_user_64(state_ts, expected, desired);
+
+	return cmpxchg_user_64_nofault(state_ts, expected, desired);
+}
+
+/**
+ * sys_umcg_ctl: (un)register the current task as a UMCG task.
+ * @flags:       ORed values from enum umcg_ctl_flag; see below;
+ * @self:        a pointer to struct umcg_task that describes this
+ *               task and governs the behavior of sys_umcg_wait if
+ *               registering; must be NULL if unregistering.
+ *
+ * @flags & UMCG_CTL_REGISTER: register a UMCG task:
+ *         UMCG workers:
+ *              - @flags & UMCG_CTL_WORKER
+ *              - self->state must be UMCG_TASK_BLOCKED
+ *         UMCG servers:
+ *              - !(@flags & UMCG_CTL_WORKER)
+ *              - self->state must be UMCG_TASK_RUNNING
+ *
+ *         All tasks:
+ *              - self->next_tid must be zero
+ *
+ *         If the conditions above are met, sys_umcg_ctl() immediately returns
+ *         if the registered task is a server; a worker will be added to
+ *         idle_workers_ptr, and the worker put to sleep; an idle server
+ *         from idle_server_tid_ptr will be woken, if present.
+ *
+ * @flags == UMCG_CTL_UNREGISTER: unregister a UMCG task. If the current task
+ *           is a UMCG worker, the userspace is responsible for waking its
+ *           server (before or after calling sys_umcg_ctl).
+ *
+ * Return:
+ * 0                - success
+ * -EFAULT          - failed to read @self
+ * -EINVAL          - some other error occurred
+ */
+SYSCALL_DEFINE2(umcg_ctl, u32, flags, struct umcg_task __user *, self)
+{
+	struct umcg_task ut;
+
+	if (flags == UMCG_CTL_UNREGISTER) {
+		if (self || !current->umcg_task)
+			return -EINVAL;
+
+		if (current->flags & PF_UMCG_WORKER)
+			umcg_handle_exiting_worker();
+		else
+			umcg_clear_task(current);
+
+		return 0;
+	}
+
+	if (!(flags & UMCG_CTL_REGISTER))
+		return -EINVAL;
+
+	flags &= ~UMCG_CTL_REGISTER;
+	if (flags && flags != UMCG_CTL_WORKER)
+		return -EINVAL;
+
+	if (current->umcg_task || !self)
+		return -EINVAL;
+
+	if (copy_from_user(&ut, self, sizeof(ut)))
+		return -EFAULT;
+
+	if (ut.next_tid)
+		return -EINVAL;
+
+	if (flags == UMCG_CTL_WORKER) {
+		if ((ut.state_ts & UMCG_TASK_STATE_MASK_FULL) != UMCG_TASK_BLOCKED)
+			return -EINVAL;
+
+		WRITE_ONCE(current->umcg_task, self);
+		current->flags |= PF_UMCG_WORKER;
+
+		/* Trigger umcg_handle_resuming_worker() */
+		set_tsk_thread_flag(current, TIF_NOTIFY_RESUME);
+	} else {
+		if ((ut.state_ts & UMCG_TASK_STATE_MASK_FULL) != UMCG_TASK_RUNNING)
+			return -EINVAL;
+
+		WRITE_ONCE(current->umcg_task, self);
+	}
+
+	return 0;
+}
+
+/**
+ * handle_timedout_worker - make sure the worker is added to idle_workers
+ *                          upon a "clean" timeout.
+ */
+static int handle_timedout_worker(struct umcg_task __user *self)
+{
+	u64 curr_state, next_state;
+	int ret;
+
+	if (get_user(curr_state, &self->state_ts))
+		return -EFAULT;
+
+	if ((curr_state & UMCG_TASK_STATE_MASK) == UMCG_TASK_IDLE) {
+		/* TODO: should we care here about TF_LOCKED or TF_PREEMPTED? */
+
+		next_state = curr_state & ~UMCG_TASK_STATE_MASK;
+		next_state |= UMCG_TASK_BLOCKED;
+
+		ret = umcg_update_state(&self->state_ts, &curr_state, next_state, true);
+		if (ret)
+			return ret;
+
+		return -ETIMEDOUT;
+	}
+
+	return 0;  /* Not really timed out. */
+}
+
+/*
+ * umcg_should_idle - return true if tasks with @state should block in
+ *                    imcg_idle_loop().
+ */
+static bool umcg_should_idle(u64 state)
+{
+	switch (state & UMCG_TASK_STATE_MASK) {
+	case UMCG_TASK_RUNNING:
+		return state & UMCG_TF_LOCKED;
+	case UMCG_TASK_IDLE:
+		return !(state & UMCG_TF_LOCKED);
+	case UMCG_TASK_BLOCKED:
+		return false;
+	default:
+		WARN_ONCE(true, "unknown UMCG task state");
+		return false;
+	}
+}
+
+/**
+ * umcg_idle_loop - sleep until !umcg_should_idle() or a timeout expires
+ * @abs_timeout - absolute timeout in nanoseconds; zero => no timeout
+ *
+ * The function marks the current task as INTERRUPTIBLE and calls
+ * freezable_schedule().
+ *
+ * Note: because UMCG workers should not be running WITHOUT attached servers,
+ *       and because servers should not be running WITH attached workers,
+ *       the function returns only on fatal signal pending and ignores/flushes
+ *       all other signals.
+ */
+static int umcg_idle_loop(u64 abs_timeout)
+{
+	int ret;
+	struct page *pinned_page = NULL;
+	struct hrtimer_sleeper timeout;
+	struct umcg_task __user *self = current->umcg_task;
+	const bool worker = current->flags & PF_UMCG_WORKER;
+
+	/* Clear PF_UMCG_WORKER to elide workqueue handlers. */
+	if (worker)
+		current->flags &= ~PF_UMCG_WORKER;
+
+	if (abs_timeout) {
+		hrtimer_init_sleeper_on_stack(&timeout, CLOCK_REALTIME,
+				HRTIMER_MODE_ABS);
+
+		hrtimer_set_expires_range_ns(&timeout.timer, (s64)abs_timeout,
+				current->timer_slack_ns);
+	}
+
+	while (true) {
+		u64 umcg_state;
+
+		/*
+		 * We need to read from userspace _after_ the task is marked
+		 * TASK_INTERRUPTIBLE, to properly handle concurrent wakeups;
+		 * but faulting is not allowed; so we try a fast no-fault read,
+		 * and if it fails, pin the page temporarily.
+		 */
+retry_once:
+		set_current_state(TASK_INTERRUPTIBLE);
+
+		/* Order set_current_state above with get_user below. */
+		smp_mb();
+		ret = -EFAULT;
+		if (get_user_nofault(umcg_state, &self->state_ts)) {
+			set_current_state(TASK_RUNNING);
+
+			if (pinned_page)
+				goto out;
+			else if (1 != pin_user_pages_fast((unsigned long)self,
+						1, 0, &pinned_page))
+					goto out;
+
+			goto retry_once;
+		}
+
+		if (pinned_page) {
+			unpin_user_page(pinned_page);
+			pinned_page = NULL;
+		}
+
+		ret = 0;
+		if (!umcg_should_idle(umcg_state)) {
+			set_current_state(TASK_RUNNING);
+			goto out;
+		}
+
+		if (abs_timeout)
+			hrtimer_sleeper_start_expires(&timeout, HRTIMER_MODE_ABS);
+
+		if (!abs_timeout || timeout.task)
+			freezable_schedule();
+
+		__set_current_state(TASK_RUNNING);
+
+		/*
+		 * Check for timeout before checking the state, as workers
+		 * are not going to return from freezable_schedule() unless
+		 * they are RUNNING.
+		 */
+		ret = -ETIMEDOUT;
+		if (abs_timeout && !timeout.task)
+			goto out;
+
+		/* Order set_current_state above with get_user below. */
+		smp_mb();
+		ret = -EFAULT;
+		if (get_user(umcg_state, &self->state_ts))
+			goto out;
+
+		ret = 0;
+		if (!umcg_should_idle(umcg_state))
+			goto out;
+
+		ret = -EINTR;
+		if (fatal_signal_pending(current))
+			goto out;
+
+		if (signal_pending(current))
+			flush_signals(current);
+	}
+
+out:
+	if (pinned_page) {
+		unpin_user_page(pinned_page);
+		pinned_page = NULL;
+	}
+	if (abs_timeout) {
+		hrtimer_cancel(&timeout.timer);
+		destroy_hrtimer_on_stack(&timeout.timer);
+	}
+	if (worker) {
+		current->flags |= PF_UMCG_WORKER;
+
+		if (ret == -ETIMEDOUT)
+			ret = handle_timedout_worker(self);
+
+		/* Workers must go through workqueue handlers upon wakeup. */
+		set_tsk_thread_flag(current, TIF_NOTIFY_RESUME);
+	}
+	return ret;
+}
+
+/**
+ * umcg_wakeup_allowed - check whether @current can wake @tsk.
+ *
+ * Currently a placeholder that allows wakeups within a single process
+ * only (same mm). In the future the requirement will be relaxed (securely).
+ */
+static bool umcg_wakeup_allowed(struct task_struct *tsk)
+{
+	WARN_ON_ONCE(!rcu_read_lock_held());
+
+	if (tsk->mm && tsk->mm == current->mm && READ_ONCE(tsk->umcg_task))
+		return true;
+
+	return false;
+}
+
+/*
+ * Try to wake up. May be called with preempt_disable set. May be called
+ * cross-process.
+ *
+ * Note: umcg_ttwu succeeds even if ttwu fails: see wait/wake state
+ *       ordering logic.
+ */
+static int umcg_ttwu(u32 next_tid, int wake_flags)
+{
+	struct task_struct *next;
+
+	rcu_read_lock();
+	next = find_task_by_vpid(next_tid);
+	if (!next || !umcg_wakeup_allowed(next)) {
+		rcu_read_unlock();
+		return -ESRCH;
+	}
+
+	/* The result of ttwu below is ignored. */
+	try_to_wake_up(next, TASK_NORMAL, wake_flags);
+	rcu_read_unlock();
+
+	return 0;
+}
+
+/*
+ * At the moment, umcg_do_context_switch simply wakes up @next with
+ * WF_CURRENT_CPU and puts the current task to sleep.
+ *
+ * In the future an optimization will be added to adjust runtime accounting
+ * so that from the kernel scheduling perspective the two tasks are
+ * essentially treated as one. In addition, the context switch may be performed
+ * right here on the fast path, instead of going through the wake/wait pair.
+ */
+static int umcg_do_context_switch(u32 next_tid, u64 abs_timeout)
+{
+	int ret;
+
+	ret = umcg_ttwu(next_tid, WF_CURRENT_CPU);
+	if (ret)
+		return ret;
+
+	return umcg_idle_loop(abs_timeout);
+}
+
+/**
+ * sys_umcg_wait: put the current task to sleep and/or wake another task.
+ * @flags:        zero or a value from enum umcg_wait_flag.
+ * @abs_timeout:  when to wake the task, in nanoseconds; zero for no timeout.
+ *
+ * @self->state_ts must be UMCG_TASK_IDLE (where @self is current->umcg_task)
+ * if !(@flags & UMCG_WAIT_WAKE_ONLY) (also see umcg_idle_loop and
+ * umcg_should_idle above).
+ *
+ * If @self->next_tid is not zero, it must point to an IDLE UMCG task.
+ * The userspace must have changed its state from IDLE to RUNNING
+ * before calling sys_umcg_wait() in the current task. This "next"
+ * task will be woken (context-switched-to on the fast path) when the
+ * current task is put to sleep.
+ *
+ * See Documentation/userspace-api/umcg.txt for detals.
+ *
+ * Return:
+ * 0             - OK;
+ * -ETIMEDOUT    - the timeout expired;
+ * -EFAULT       - failed accessing struct umcg_task __user of the current
+ *                 task;
+ * -ESRCH        - the task to wake not found or not a UMCG task;
+ * -EINVAL       - another error happened (e.g. bad @flags, or the current
+ *                 task is not a UMCG task, etc.)
+ */
+SYSCALL_DEFINE2(umcg_wait, u32, flags, u64, abs_timeout)
+{
+	struct umcg_task __user *self = current->umcg_task;
+	u32 next_tid;
+
+	if (!self)
+		return -EINVAL;
+
+	if (get_user(next_tid, &self->next_tid))
+		return -EFAULT;
+
+	if (flags & UMCG_WAIT_WAKE_ONLY) {
+		if (!next_tid || abs_timeout)
+			return -EINVAL;
+
+		flags &= ~UMCG_WAIT_WAKE_ONLY;
+		if (flags & ~UMCG_WAIT_WF_CURRENT_CPU)
+			return -EINVAL;
+
+		return umcg_ttwu(next_tid, flags & UMCG_WAIT_WF_CURRENT_CPU ?
+					WF_CURRENT_CPU : 0);
+	}
+
+	/* Unlock the worker, if locked. */
+	if (current->flags & PF_UMCG_WORKER) {
+		u64 umcg_state;
+
+		if (get_user(umcg_state, &self->state_ts))
+			return -EFAULT;
+
+		if ((umcg_state & UMCG_TF_LOCKED) && umcg_update_state(
+					&self->state_ts, &umcg_state,
+					umcg_state & ~UMCG_TF_LOCKED, true))
+			return -EFAULT;
+	}
+
+	if (next_tid)
+		return umcg_do_context_switch(next_tid, abs_timeout);
+
+	return umcg_idle_loop(abs_timeout);
+}
+
+/*
+ * NOTE: all code below is called from workqueue submit/update, or
+ *       syscall exit to usermode loop, so all errors result in the
+ *       termination of the current task (via SIGKILL).
+ */
+
+/*
+ * Wake idle server: find the task, change its state IDLE=>RUNNING, ttwu.
+ */
+static int umcg_wake_idle_server_nofault(u32 server_tid)
+{
+	struct umcg_task __user *ut_server = NULL;
+	struct task_struct *tsk;
+	int ret = -EINVAL;
+	u64 state;
+
+	rcu_read_lock();
+
+	tsk = find_task_by_vpid(server_tid);
+	/* Server/worker interaction is allowed only within the same mm. */
+	if (tsk && current->mm == tsk->mm)
+		ut_server = READ_ONCE(tsk->umcg_task);
+
+	if (!ut_server)
+		goto out_rcu;
+
+	ret = -EFAULT;
+	if (get_user_nofault(state, &ut_server->state_ts))
+		goto out_rcu;
+
+	ret = -EAGAIN;
+	if ((state & UMCG_TASK_STATE_MASK) != UMCG_TASK_IDLE)
+		goto out_rcu;
+
+	ret = umcg_update_state(&ut_server->state_ts, &state,
+			(state & ~UMCG_TASK_STATE_MASK) | UMCG_TASK_RUNNING,
+			false);
+
+	if (ret)
+		goto out_rcu;
+
+	try_to_wake_up(tsk, TASK_NORMAL, WF_CURRENT_CPU);
+
+out_rcu:
+	rcu_read_unlock();
+	return ret;
+}
+
+/*
+ * Wake idle server: find the task, change its state IDLE=>RUNNING, ttwu.
+ */
+static int umcg_wake_idle_server_may_fault(u32 server_tid)
+{
+	struct umcg_task __user *ut_server = NULL;
+	struct task_struct *tsk;
+	int ret = -EINVAL;
+	u64 state;
+
+	rcu_read_lock();
+	tsk = find_task_by_vpid(server_tid);
+	if (tsk && current->mm == tsk->mm)
+		ut_server = READ_ONCE(tsk->umcg_task);
+	rcu_read_unlock();
+
+	if (!ut_server)
+		return -EINVAL;
+
+	if (get_user(state, &ut_server->state_ts))
+		return -EFAULT;
+
+	if ((state & UMCG_TASK_STATE_MASK) != UMCG_TASK_IDLE)
+		return -EAGAIN;
+
+	ret = umcg_update_state(&ut_server->state_ts, &state,
+			(state & ~UMCG_TASK_STATE_MASK) | UMCG_TASK_RUNNING,
+			true);
+	if (ret)
+		return ret;
+
+	/*
+	 * umcg_ttwu will call find_task_by_vpid again; but we cannot
+	 * elide this, as we cannot do get_user() from an rcu-locked
+	 * code block.
+	 */
+	return umcg_ttwu(server_tid, WF_CURRENT_CPU);
+}
+
+/*
+ * Wake idle server: find the task, change its state IDLE=>RUNNING, ttwu.
+ */
+static int umcg_wake_idle_server(u32 server_tid, bool may_fault)
+{
+	int ret = umcg_wake_idle_server_nofault(server_tid);
+
+	if (!ret)
+		return 0;
+
+	if (!may_fault || ret != -EFAULT)
+		return ret;
+
+	return umcg_wake_idle_server_may_fault(server_tid);
+}
+
+/*
+ * Called in sched_submit_work() context for UMCG workers. In the common case,
+ * the worker's state changes RUNNING => BLOCKED, and its server's state
+ * changes IDLE => RUNNING, and the server is ttwu-ed.
+ *
+ * Under some conditions (e.g. the worker is "locked", see
+ * /Documentation/userspace-api/umcg.txt for more details), the
+ * function does nothing.
+ *
+ * The function is called with preempt disabled to make sure the retry_once
+ * logic below works correctly.
+ */
+static void process_sleeping_worker(struct task_struct *tsk, u32 *server_tid)
+{
+	struct umcg_task __user *ut_worker = tsk->umcg_task;
+	u64 curr_state, next_state;
+	bool retried = false;
+	u32 tid;
+	int ret;
+
+	*server_tid = 0;
+
+	if (WARN_ONCE((tsk != current) || !ut_worker, "Invalid UMCG worker."))
+		return;
+
+	/* If the worker has no server, do nothing. */
+	if (unlikely(!tsk->pinned_umcg_server_page))
+		return;
+
+	if (get_user_nofault(curr_state, &ut_worker->state_ts))
+		goto die;
+
+	/*
+	 * The userspace is allowed to concurrently change a RUNNING worker's
+	 * state only once in a "short" period of time, so we retry state
+	 * change at most once. As this retry block is within a
+	 * preempt_disable region, "short" is truly short here.
+	 *
+	 * See Documentation/userspace-api/umcg.txt for details.
+	 */
+retry_once:
+	if (curr_state & UMCG_TF_LOCKED)
+		return;
+
+	if (WARN_ONCE((curr_state & UMCG_TASK_STATE_MASK) != UMCG_TASK_RUNNING,
+			"Unexpected UMCG worker state."))
+		goto die;
+
+	next_state = curr_state & ~UMCG_TASK_STATE_MASK;
+	next_state |= UMCG_TASK_BLOCKED;
+
+	ret = umcg_update_state(&ut_worker->state_ts, &curr_state, next_state, false);
+	if (ret == -EAGAIN) {
+		if (retried)
+			goto die;
+
+		retried = true;
+		goto retry_once;
+	}
+	if (ret)
+		goto die;
+
+	smp_mb();  /* Order state read/write above and getting next_tid below. */
+	if (get_user_nofault(tid, &ut_worker->next_tid))
+		goto die;
+
+	*server_tid = tid;
+	return;
+
+die:
+	pr_warn("%s: killing task %d\n", __func__, current->pid);
+	force_sig(SIGKILL);
+}
+
+/* Called from sched_submit_work(). Must not fault/sleep. */
+void umcg_wq_worker_sleeping(struct task_struct *tsk)
+{
+	u32 server_tid;
+
+	/*
+	 * Disable preemption so that retry_once in process_sleeping_worker
+	 * works properly.
+	 */
+	preempt_disable();
+	process_sleeping_worker(tsk, &server_tid);
+	preempt_enable();
+
+	if (server_tid) {
+		int ret = umcg_wake_idle_server_nofault(server_tid);
+
+		if (ret && ret != -EAGAIN)
+			goto die;
+	}
+
+	goto out;
+
+die:
+	pr_warn("%s: killing task %d\n", __func__, current->pid);
+	force_sig(SIGKILL);
+out:
+	umcg_unpin_pages();
+}
+
+/**
+ * enqueue_idle_worker - push an idle worker onto idle_workers_ptr list/stack.
+ *
+ * Returns true on success, false on a fatal failure.
+ *
+ * See Documentation/userspace-api/umcg.txt for details.
+ */
+static bool enqueue_idle_worker(struct umcg_task __user *ut_worker)
+{
+	u64 __user *node = &ut_worker->idle_workers_ptr;
+	u64 __user *head_ptr;
+	u64 first = (u64)node;
+	u64 head;
+
+	if (get_user(head, node) || !head)
+		return false;
+
+	head_ptr = (u64 __user *)head;
+
+	/* Mark the worker as pending. */
+	if (put_user(UMCG_IDLE_NODE_PENDING, node))
+		return false;
+
+	/* Make the head point to the worker. */
+	if (xchg_user_64(head_ptr, &first))
+		return false;
+
+	/* Make the worker point to the previous head. */
+	if (put_user(first, node))
+		return false;
+
+	return true;
+}
+
+/**
+ * get_idle_server - retrieve an idle server, if present.
+ *
+ * Returns true on success, false on a fatal failure.
+ */
+static bool get_idle_server(struct umcg_task __user *ut_worker, u32 *server_tid)
+{
+	u64 server_tid_ptr;
+	u32 tid;
+
+	/* Empty result is OK. */
+	*server_tid = 0;
+
+	if (get_user(server_tid_ptr, &ut_worker->idle_server_tid_ptr))
+		return false;
+
+	if (!server_tid_ptr)
+		return false;
+
+	tid = 0;
+	if (xchg_user_32((u32 __user *)server_tid_ptr, &tid))
+		return false;
+
+	*server_tid = tid;
+	return true;
+}
+
+/*
+ * Returns true to wait for the userspace to schedule this worker, false
+ * to return to the userspace.
+ *
+ * In the common case, a BLOCKED worker is marked IDLE and enqueued
+ * to idle_workers_ptr list. The idle server is woken (if present).
+ *
+ * If a RUNNING worker is preempted, this function will trigger, in which
+ * case the worker is moved to IDLE state and its server is woken.
+ *
+ * Sets @server_tid to point to the server to be woken if the worker
+ * is going to sleep; sets @server_tid to point to the server assigned
+ * to this RUNNING worker if the worker is to return to the userspace.
+ */
+static bool process_waking_worker(struct task_struct *tsk, u32 *server_tid)
+{
+	struct umcg_task __user *ut_worker = tsk->umcg_task;
+	u64 curr_state, next_state;
+
+	*server_tid = 0;
+
+	if (WARN_ONCE((tsk != current) || !ut_worker, "Invalid umcg worker"))
+		return false;
+
+	if (fatal_signal_pending(tsk))
+		return false;
+
+	if (get_user(curr_state, &ut_worker->state_ts))
+		goto die;
+
+	if ((curr_state & UMCG_TASK_STATE_MASK) == UMCG_TASK_RUNNING) {
+		u32 tid;
+
+		/* Wakeup: wait but don't enqueue. */
+		if (curr_state & UMCG_TF_LOCKED)
+			return true;
+
+		smp_mb();  /* Order getting state and getting server_tid */
+		if (get_user(tid, &ut_worker->next_tid))
+			goto die;
+
+		if (!tid)
+			/* RUNNING workers must have servers. */
+			goto die;
+
+		*server_tid = tid;
+
+		/* pass-through: RUNNING with a server. */
+		if (!(curr_state & UMCG_TF_PREEMPTED))
+			return false;
+
+		/*
+		 * Fallthrough to mark the worker IDLE: the worker is
+		 * PREEMPTED.
+		 */
+	} else if (unlikely((curr_state & UMCG_TASK_STATE_MASK) == UMCG_TASK_IDLE &&
+			(curr_state & UMCG_TF_LOCKED)))
+		/* The worker prepares to sleep or to unregister. */
+		return false;
+
+	if (unlikely((curr_state & UMCG_TASK_STATE_MASK) == UMCG_TASK_IDLE))
+		goto die;
+
+	next_state = curr_state & ~UMCG_TASK_STATE_MASK;
+	next_state |= UMCG_TASK_IDLE;
+
+	if (umcg_update_state(&ut_worker->state_ts, &curr_state,
+			next_state, true))
+		goto die;
+
+	if (!enqueue_idle_worker(ut_worker))
+		goto die;
+
+	smp_mb();  /* Order enqueuing the worker with getting the server. */
+	if (!(*server_tid) && !get_idle_server(ut_worker, server_tid))
+		goto die;
+
+	return true;
+
+die:
+	pr_warn("umcg_process_waking_worker: killing task %d\n", current->pid);
+	force_sig(SIGKILL);
+	return false;
+}
+
+/*
+ * Called from sched_update_worker(): defer all work until later, as
+ * sched_update_worker() may be called with in-kernel locks held.
+ */
+void umcg_wq_worker_running(struct task_struct *tsk)
+{
+	set_tsk_thread_flag(tsk, TIF_NOTIFY_RESUME);
+}
+
+/* Called via TIF_NOTIFY_RESUME flag from exit_to_user_mode_loop. */
+void umcg_handle_resuming_worker(void)
+{
+	u32 server_tid;
+
+	/* Avoid recursion by removing PF_UMCG_WORKER */
+	current->flags &= ~PF_UMCG_WORKER;
+
+	do {
+		bool should_wait;
+
+		should_wait = process_waking_worker(current, &server_tid);
+		if (!should_wait)
+			break;
+
+		if (server_tid) {
+			int ret = umcg_wake_idle_server(server_tid, true);
+
+			if (ret && ret != -EAGAIN)
+				goto die;
+		}
+
+		umcg_idle_loop(0);
+	} while (true);
+
+	if (!server_tid)
+		/* No server => no reason to pin pages. */
+		umcg_unpin_pages();
+	else if (umcg_pin_pages(server_tid))
+		goto die;
+
+	goto out;
+
+die:
+	pr_warn("%s: killing task %d\n", __func__, current->pid);
+	force_sig(SIGKILL);
+out:
+	current->flags |= PF_UMCG_WORKER;
+}
diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c
index d1944258cfc0..82d233aa2648 100644
--- a/kernel/sys_ni.c
+++ b/kernel/sys_ni.c
@@ -273,6 +273,10 @@  COND_SYSCALL(landlock_create_ruleset);
 COND_SYSCALL(landlock_add_rule);
 COND_SYSCALL(landlock_restrict_self);

+/* kernel/sched/umcg.c */
+COND_SYSCALL(umcg_ctl);
+COND_SYSCALL(umcg_wait);
+
 /* arch/example/kernel/sys_example.c */

 /* mm/fadvise.c */