[v0.9.1,3/6] sched/umcg: implement UMCG syscalls

From: Peter Zijlstra <peterz@infradead.org>

Define struct umcg_task and two syscalls: sys_umcg_ctl sys_umcg_wait.

User Managed Concurrency Groups is an M:N threading toolkit that allows
constructing user space schedulers designed to efficiently manage
heterogeneous in-process workloads while maintaining high CPU
utilization (95%+).

In addition, M:N threading and cooperative user space scheduling
enables synchronous coding style and better cache locality when
compared to asynchronous callback/continuation style of programming.

UMCG kernel API is build around the following ideas:

* UMCG server: a task/thread representing "kernel threads", or (v)CPUs;
* UMCG worker: a task/thread representing "application threads", to be
  scheduled over servers;
* UMCG task state: (NONE), RUNNING, BLOCKED, IDLE: states a UMCG task (a
  server or a worker) can be in;
* UMCG task state flag: LOCKED, PREEMPTED: additional state flags that
  can be ORed with the task state to communicate additional information to
  the kernel;
* struct umcg_task: a per-task userspace set of data fields, usually
  residing in the TLS, that fully reflects the current task's UMCG state
  and controls the way the kernel manages the task;
* sys_umcg_ctl(): a syscall used to register the current task/thread as a
  server or a worker, or to unregister a UMCG task;
* sys_umcg_wait(): a syscall used to put the current task to sleep and/or
  wake another task, pontentially context-switching between the two tasks
  on-CPU synchronously.

In short, servers can be thought of as CPUs over which application
threads (workers) are scheduled; at any one time a worker is either:
- RUNNING: has a server and is schedulable by the kernel;
- BLOCKED: blocked in the kernel (e.g. on I/O, or a futex);
- IDLE: is not blocked, but cannot be scheduled by the kernel to
  run because it has no server assigned to it (e.g. because all
  available servers are busy "running" other workers).

Usually the number of servers in a process is equal to the number of
CPUs available to the kernel if the process is supposed to consume
the whole machine, or less than the number of CPUs available if the
process is sharing the machine with other workloads. The number of
workers in a process can grow very large: tens of thousands is normal;
hundreds of thousands and more (millions) is something that would
be desirable to achieve in the future, as lightweight userspace
threads in Java and Go easily scale to millions, and UMCG workers
are (intended to be) conceptually similar to those.

Detailed use cases and API behavior are provided in
Documentation/userspace-api/umcg.txt (see sibling patches).

Some high-level implementation notes:

UMCG tasks (workers and servers) are "tagged" with struct umcg_task
residing in userspace (usually in TLS) to facilitate kernel/userspace
communication. This makes the kernel-side code much simpler (see e.g.
the implementation of sys_umcg_wait), but also requires some careful
uaccess handling and page pinning (see below).

The main UMCG server/worker interaction looks like:

a. worker W1 is RUNNING, with a server S attached to it sleeping
   in IDLE state;
b. worker W1 blocks in the kernel, e.g. on I/O;
c. the kernel marks W1 as BLOCKED, the attached server S
   as RUNNING, and wakes S (the "block detection" event);
d. the server now picks another IDLE worker W2 to run: marks
   W2 as RUNNING, itself as IDLE, ands calls sys_umcg_wait();
e. when the blocking operation of W1 completes, the worker
   is marked by the kernel as IDLE and added to idle workers list
   (see struct umcg_task) for the userspace to pick up and
   later run (the "wake detection" event).

While there are additional operations such as worker-to-worker
context switch, preemption, workers "yielding", etc., the "workflow"
above is the main worker/server interaction that drives the
implementation.

Specifically:

- most operations are conceptually context switches:
    - scheduling a worker: a running server goes to sleep and "runs"
      a worker in its place;
    - block detection: worker is descheduled, and its server is woken;
    - wake detection: woken worker, running in the kernel, is descheduled,
      and if there is an idle server, it is woken to process the wake
      detection event;
- to faciliate low scheduling latencies and cache locality, most
  server/worker interactions described above are performed synchronously
  "on CPU" via WF_CURRENT_CPU flag passed to ttwu; while at the moment
  the context switches are simulated by putting the switch-out task to
  sleep and waking the switch-into task on the same cpu, it is very much
  the long-term goal of this project to make the context switch much
  lighter, by tweaking runtime accounting and, maybe, even bypassing
  __schedule();
- worker blocking is detected in a hook to sched_submit_work; as mentioned
  above, the server is to be woken on the same CPU, synchronously;
  this code may not pagefault, so to access worker's and server's
  userspace memory (struct umcg_task), memory pages containing the worker's
  and the server's structs umcg_task are pinned when the worker is
  exiting to the userspace, and unpinned when the worker is descheduled;
- worker wakeup is detected in a hook to sched_update_worker, and processed
  in the exit to usermode loop (via TIF_NOTIFY_RESUME); workers CAN
  pagefault on the wakeup path;
- worker preemption is implemented by the userspace tagging the worker
  with UMCG_TF_PREEMPTED state flag and sending a NOOP signal to it;
  on the exit to usermode the worker is intercepted and its server is woken
  (see Documentation/userspace-api/umcg.txt for more details);
- each state change is tagged with a unique timestamp (of MONOTONIC
  variety), so that
    - scheduling instrumentation is naturally available;
    - racing state changes are easily detected and ABA issues are
      avoided;
  see umcg_update_state() in umcg.c for implementation details, and
  Documentation/userspace-api/umcg.txt for a higher-level
  description.

The previous version of the patchset can be found at
https://lore.kernel.org/all/20211012232522.714898-1-posk@google.com/
containing some additional context and links to earlier discussions.

More details are available in Documentation/userspace-api/umcg.txt
in sibling patches, and in doc-comments in the code.

Signed-off-by: Peter Oskolkov <posk@google.com>
---
 arch/x86/entry/syscalls/syscall_64.tbl |   2 +
 fs/exec.c                              |   1 +
 include/linux/sched.h                  |  71 ++
 include/linux/syscalls.h               |   3 +
 include/uapi/asm-generic/unistd.h      |   7 +-
 include/uapi/linux/umcg.h              | 137 ++++
 init/Kconfig                           |  10 +
 kernel/entry/common.c                  |   4 +-
 kernel/exit.c                          |   5 +
 kernel/sched/Makefile                  |   1 +
 kernel/sched/core.c                    |   9 +-
 kernel/sched/umcg.c                    | 949 +++++++++++++++++++++++++
 kernel/sys_ni.c                        |   4 +
 13 files changed, 1199 insertions(+), 4 deletions(-)
 create mode 100644 include/uapi/linux/umcg.h
 create mode 100644 kernel/sched/umcg.c

--
2.25.1

Message ID	20211122211327.5931-4-posk@google.com (mailing list archive)
State	New
Headers	show Return-Path: <owner-linux-mm@kvack.org> X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 6B511C433F5 for <linux-mm@archiver.kernel.org>; Mon, 22 Nov 2021 21:15:25 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 6F8186B0074; Mon, 22 Nov 2021 16:13:47 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 67EE36B0075; Mon, 22 Nov 2021 16:13:47 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 40E476B0078; Mon, 22 Nov 2021 16:13:47 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0043.hostedemail.com [216.40.44.43]) by kanga.kvack.org (Postfix) with ESMTP id 2670E6B0074 for <linux-mm@kvack.org>; Mon, 22 Nov 2021 16:13:47 -0500 (EST) Received: from smtpin18.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay03.hostedemail.com (Postfix) with ESMTP id D3EF5824C42C for <linux-mm@kvack.org>; Mon, 22 Nov 2021 21:13:36 +0000 (UTC) X-FDA: 78837817674.18.B25C7BF Received: from mail-pl1-f172.google.com (mail-pl1-f172.google.com [209.85.214.172]) by imf27.hostedemail.com (Postfix) with ESMTP id 142ED70000A9 for <linux-mm@kvack.org>; Mon, 22 Nov 2021 21:13:34 +0000 (UTC) Received: by mail-pl1-f172.google.com with SMTP id v19so15206005plo.7 for <linux-mm@kvack.org>; Mon, 22 Nov 2021 13:13:36 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=posk.io; s=google; h=from:to:cc:subject:date:message-id:in-reply-to:references :mime-version:content-transfer-encoding; bh=e296wJXjpFDcA6xJxb4c9EdUWQ3lpH0/7jTzYiA0dpI=; b=CW0KPH9K/981ww3wPsq506QnWu48Xi6N02kIf1ZaZ0bj0gD+Q8iyU5mk0g+5bsw56+ Dx+RusHslMvd/0aCD8sLxkjfg+wUexe0eaEFW6NIkXP69YORUTmOHS/he8AG2K1soojh 83xwUsCesBl693VUSHmz+qI9PzurWhmn71r/8/fbeYruir6WvKH5q3lUquAW6ZcALx/J pzLo+ZzPev9e8PjrdmcBXizbfYWO9U232NvdZ/OyAh32GsyVO/fjay72attxE2iiflz1 Vlb+zBvR7V6BYvfsUXPLRYVNUJdQN/8AehoZbJHPlVGuVVVhr614kSb847fAmcUkb3h3 g7tQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to :references:mime-version:content-transfer-encoding; bh=e296wJXjpFDcA6xJxb4c9EdUWQ3lpH0/7jTzYiA0dpI=; b=n/J9AEXfZLFcJLPxNVOZ1/KQE7cZvjguXZovoTcENZW/eXIq7Pf556cCu73Vf62stM YZvuiJH8ghvPivo+Tw5TiI1w9zW5pyZfMkSXQyOLcYdxkMg6HVKRxv9Yd/g/c+y0krx2 U6SWmIZvnJ6NyjwCau+FzFSUzqP3NJ2wYYFrR/LUSCnaVICThjok8HlsvK4okt4XmxYk ZdjpBKD/5hrqwD4UDtPXrrr9xRQUceYTh11eOCZTUvOwCxIBcs3YaXZjvIThEOp+v26f gKFK+t99jhKVqCRh/piWXHWPTkTXeDZI/M/AB0+Wqwp7Owx/gxStE9cbvAzMxv13VB6j klww== X-Gm-Message-State: AOAM533hVq4R8IpaRLB8FZPp54r8QrC1VVuMg+IUfjPPHsgnpphsmn4H iZ9so20m7D3ABvUG/NaAfMxJWQ== X-Google-Smtp-Source: ABdhPJyip3SOF07it6wLGKAhgl8+Y7evffYBBszCEgKXkPacIiPWymp+UCor6Shjh3GSrbPUvGTuQA== X-Received: by 2002:a17:90b:fd5:: with SMTP id gd21mr36719715pjb.37.1637615614809; Mon, 22 Nov 2021 13:13:34 -0800 (PST) Received: from posk-p1g4.localdomain (23-118-52-46.lightspeed.sntcca.sbcglobal.net. [23.118.52.46]) by smtp.gmail.com with ESMTPSA id h3sm10453671pfi.207.2021.11.22.13.13.33 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Mon, 22 Nov 2021 13:13:34 -0800 (PST) From: Peter Oskolkov <posk@posk.io> X-Google-Original-From: Peter Oskolkov <posk@google.com> To: Peter Zijlstra <peterz@infradead.org>, Ingo Molnar <mingo@redhat.com>, Thomas Gleixner <tglx@linutronix.de>, Andrew Morton <akpm@linux-foundation.org>, Dave Hansen <dave.hansen@linux.intel.com>, Andy Lutomirski <luto@kernel.org>, linux-mm@kvack.org, linux-kernel@vger.kernel.org, linux-api@vger.kernel.org Cc: Paul Turner <pjt@google.com>, Ben Segall <bsegall@google.com>, Peter Oskolkov <posk@google.com>, Peter Oskolkov <posk@posk.io>, Andrei Vagin <avagin@google.com>, Jann Horn <jannh@google.com>, Thierry Delisle <tdelisle@uwaterloo.ca> Subject: [PATCH v0.9.1 3/6] sched/umcg: implement UMCG syscalls Date: Mon, 22 Nov 2021 13:13:24 -0800 Message-Id: <20211122211327.5931-4-posk@google.com> X-Mailer: git-send-email 2.25.1 In-Reply-To: <20211122211327.5931-1-posk@google.com> References: <20211122211327.5931-1-posk@google.com> MIME-Version: 1.0 X-Rspamd-Queue-Id: 142ED70000A9 X-Stat-Signature: t8iwcopfaat3wzijs1z6t4fdenk3iftp Authentication-Results: imf27.hostedemail.com; dkim=pass header.d=posk.io header.s=google header.b=CW0KPH9K; dmarc=none; spf=pass (imf27.hostedemail.com: domain of posk@posk.io designates 209.85.214.172 as permitted sender) smtp.mailfrom=posk@posk.io X-Rspamd-Server: rspam02 X-HE-Tag: 1637615614-488122 Content-Transfer-Encoding: quoted-printable X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: <linux-mm.kvack.org>
Series	sched,mm,x86/uaccess: implement User Managed Concurrency Groups \| expand [v0.9.1,0/6] sched,mm,x86/uaccess: implement User Managed Concurrency Groups [v0.9.1,1/6] sched/umcg: add WF_CURRENT_CPU and externise ttwu [v0.9.1,2/6] mm, x86/uaccess: add userspace atomic helpers [v0.9.1,3/6] sched/umcg: implement UMCG syscalls [v0.9.1,4/6] sched/umcg, lib/umcg: implement libumcg [v0.9.1,5/6] sched/umcg: add Documentation/userspace-api/umcg.txt [v0.9.1,6/6] sched/umcg, lib/umcg: add tools/lib/umcg/libumcg.txt

[v0.9.1,3/6] sched/umcg: implement UMCG syscalls

Commit Message

Comments

Patch