diff mbox series

[RFC,v2,4/5] sched: UMCG: add a blocked worker list

Message ID 20220113233940.3608440-5-posk@google.com (mailing list archive)
State New
Headers show
Series User Managed Concurrency Groups | expand

Commit Message

Peter Oskolkov Jan. 13, 2022, 11:39 p.m. UTC
The original idea of a UMCG server was that it was used as a proxy
for a CPU, so if a worker associated with the server is RUNNING,
the server itself is never ever was allowed to be RUNNING as well;
when umcg_wait() returned for a server, it meant that its worker
became BLOCKED.

In the new (old?) "per server runqueues" model implemented in
the previous patch in this patchset, servers are woken when
a previously blocked worker on their runqueue finishes its blocking
operation, even if the currently RUNNING worker continues running.

As now a server may run while a worker assigned to it is running,
the original idea of having at most a single worker RUNNING per
server, as a means to control the number of running workers, is
not really enforced, and the server, woken by a worker
doing BLOCKED=>RUNNABLE transition, may then call sys_umcg_wait()
with a second/third/etc. worker to run.

Support this scenario by adding a blocked worker list:
when a worker transitions RUNNING=>BLOCKED, not only its server
is woken, but the worker is also added to the blocked worker list
of its server.

This change introduces the following benefits:
- block detection how behaves similarly to wake detection;
  without this patch worker wakeups added wakees to the list
  and woke the server, while worker blocks only woke the server
  without adding blocked workers to a list, forcing servers
  to explicitly check worker's state;
- if the blocked worker woke sufficiently quickly, the server
  woken on the block event would observe its worker now as
  RUNNABLE, so the block event had to be inferred rather than
  explicitly signalled by the worker being added to the blocked
  worker list;
- it is now possible for a single server to control several
  RUNNING workers, which makes writing userspace schedulers
  simpler for smaller processes that do not need to scale beyond
  one "server";
- if the userspace wants to keep at most a single RUNNING worker
  per server, and have multiple servers with their own runqueues,
  this model is also naturally supported here.

So this change basically decouples block/wake detection from
M:N threading in the sense that the number of servers is now
does not have to be M or N, but is more driven by the scalability
needs of the userspace application.

Why keep this server/worker model at all then, and not use
something like io_uring to deliver block/wake events to the
userspace? The main benefit of this model is that servers
are woken synchronously on-cpu when an event happens, while
io_uring is more of an asynchronous event framework, so latencies
in this model are potentially better.

In addition, "multiple runqueues" type of scheduling is much easier
to implement with this method than with io_uring.

Signed-off-by: Peter Oskolkov <posk@google.com>
---
 include/uapi/linux/umcg.h | 10 ++++-
 kernel/sched/umcg.c       | 90 ++++++++++++++++++++++++++++-----------
 2 files changed, 75 insertions(+), 25 deletions(-)

Comments

Peter Zijlstra Jan. 17, 2022, 9:19 a.m. UTC | #1
On Thu, Jan 13, 2022 at 03:39:39PM -0800, Peter Oskolkov wrote:
> The original idea of a UMCG server was that it was used as a proxy
> for a CPU, so if a worker associated with the server is RUNNING,
> the server itself is never ever was allowed to be RUNNING as well;
> when umcg_wait() returned for a server, it meant that its worker
> became BLOCKED.
> 
> In the new (old?) "per server runqueues" model implemented in
> the previous patch in this patchset, servers are woken when
> a previously blocked worker on their runqueue finishes its blocking
> operation, even if the currently RUNNING worker continues running.
> 
> As now a server may run while a worker assigned to it is running,
> the original idea of having at most a single worker RUNNING per
> server, as a means to control the number of running workers, is
> not really enforced, and the server, woken by a worker
> doing BLOCKED=>RUNNABLE transition, may then call sys_umcg_wait()
> with a second/third/etc. worker to run.
> 
> Support this scenario by adding a blocked worker list:
> when a worker transitions RUNNING=>BLOCKED, not only its server
> is woken, but the worker is also added to the blocked worker list
> of its server.
> 
> This change introduces the following benefits:
> - block detection how behaves similarly to wake detection;
>   without this patch worker wakeups added wakees to the list
>   and woke the server, while worker blocks only woke the server
>   without adding blocked workers to a list, forcing servers
>   to explicitly check worker's state;
> - if the blocked worker woke sufficiently quickly, the server
>   woken on the block event would observe its worker now as
>   RUNNABLE, so the block event had to be inferred rather than
>   explicitly signalled by the worker being added to the blocked
>   worker list;
> - it is now possible for a single server to control several
>   RUNNING workers, which makes writing userspace schedulers
>   simpler for smaller processes that do not need to scale beyond
>   one "server";
> - if the userspace wants to keep at most a single RUNNING worker
>   per server, and have multiple servers with their own runqueues,
>   this model is also naturally supported here.
> 
> So this change basically decouples block/wake detection from
> M:N threading in the sense that the number of servers is now
> does not have to be M or N, but is more driven by the scalability
> needs of the userspace application.

So I don't object to having this blocking list, we had that early on in
the discussions.

*However*, combined with WF_CURRENT_CPU this 1:N userspace model doesn't
really make sense, also combined with Proxy-Exec (if we ever get that
sorted) it will fundamentally not work.

More consideration is needed I think...
Peter Oskolkov Jan. 18, 2022, 5:16 p.m. UTC | #2
On Mon, Jan 17, 2022 at 1:19 AM Peter Zijlstra <peterz@infradead.org> wrote:
>
> On Thu, Jan 13, 2022 at 03:39:39PM -0800, Peter Oskolkov wrote:

[...]

> >
> > So this change basically decouples block/wake detection from
> > M:N threading in the sense that the number of servers is now
> > does not have to be M or N, but is more driven by the scalability
> > needs of the userspace application.
>
> So I don't object to having this blocking list, we had that early on in
> the discussions.
>
> *However*, combined with WF_CURRENT_CPU this 1:N userspace model doesn't
> really make sense, also combined with Proxy-Exec (if we ever get that
> sorted) it will fundamentally not work.
>
> More consideration is needed I think...

I was not very clear here. The intent of this change is not to make
1:N a good general approach, but to make "several running workers per
single server" a viable option.

My guess, based on some numbers/benchmarks from another project, is
that having a single server/runqueue per four or eight running
workers, properly aligned with (= affined to) an AMD chiplet, will be
the most performant solution, comparing to both a runqueue per single
running worker and to a global runqueue. On Intel this will probably
look like a single runqueue per core (2 running workers/HT threads).

So in this model a "server" represents a runqueue.

I'll reply to other active umcg discussions shortly.
Peter Zijlstra Jan. 27, 2022, 3:37 p.m. UTC | #3
On Thu, Jan 13, 2022 at 03:39:39PM -0800, Peter Oskolkov wrote:

> This change introduces the following benefits:
> - block detection how behaves similarly to wake detection;
>   without this patch worker wakeups added wakees to the list
>   and woke the server, while worker blocks only woke the server
>   without adding blocked workers to a list, forcing servers
>   to explicitly check worker's state;

> - if the blocked worker woke sufficiently quickly, the server
>   woken on the block event would observe its worker now as
>   RUNNABLE, so the block event had to be inferred rather than
>   explicitly signalled by the worker being added to the blocked
>   worker list;

This I think is missing the point, there is no race if the server checks
curr->state == RUNNING.

> - it is now possible for a single server to control several
>   RUNNING workers, which makes writing userspace schedulers
>   simpler for smaller processes that do not need to scale beyond
>   one "server";

How about something like so on top?

--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1298,6 +1298,7 @@ struct task_struct {
 
 #ifdef CONFIG_UMCG
 	/* setup by sys_umcg_ctrl() */
+	u32			umcg_flags;
 	clockid_t		umcg_clock;
 	struct umcg_task __user	*umcg_task;
 
--- a/include/uapi/linux/umcg.h
+++ b/include/uapi/linux/umcg.h
@@ -119,6 +119,8 @@ struct umcg_task {
 	 *
 	 * Readable/writable by both the kernel and the userspace: the
 	 * kernel adds items to the list, userspace removes them.
+	 *
+	 * Only used with UMCG_CTL_MULTI.
 	 */
 	__u64	blocked_workers_ptr;		/* r/w */
 
@@ -147,11 +149,13 @@ enum umcg_wait_flag {
  * @UMCG_CTL_REGISTER:   register the current task as a UMCG task
  * @UMCG_CTL_UNREGISTER: unregister the current task as a UMCG task
  * @UMCG_CTL_WORKER:     register the current task as a UMCG worker
+ * @UMCG_CTL_MULTI:	 allow 1:n worker relations, enables blocked_workers_ptr
  */
 enum umcg_ctl_flag {
 	UMCG_CTL_REGISTER	= 0x00001,
 	UMCG_CTL_UNREGISTER	= 0x00002,
 	UMCG_CTL_WORKER		= 0x10000,
+	UMCG_CTL_MULTI		= 0x20000,
 };
 
 #endif /* _UAPI_LINUX_UMCG_H */
--- a/kernel/sched/umcg.c
+++ b/kernel/sched/umcg.c
@@ -335,7 +335,7 @@ static inline int umcg_enqueue_runnable(
 }
 
 /*
- * Enqueue @tsk on it's server's blocked list
+ * Enqueue @tsk on it's server's blocked list OR ensure @tsk == server::next_tid
  *
  * Must be called in umcg_pin_pages() context, relies on tsk->umcg_server.
  *
@@ -346,10 +346,34 @@ static inline int umcg_enqueue_runnable(
  * Returns:
  *   0: success
  *   -EFAULT
+ *   -ESRCH	server::next_tid is not a valid UMCG task
+ *   -EINVAL	server::next_tid doesn't match @tsk
  */
 static inline int umcg_enqueue_blocked(struct task_struct *tsk)
 {
-	return umcg_enqueue(tsk, true /* blocked */);
+	struct task_struct *next;
+	u32 next_tid;
+	int ret;
+
+	if (tsk->umcg_server->umcg_flags & UMCG_CTL_MULTI)
+		return umcg_enqueue(tsk, true /* blocked */);
+
+	/*
+	 * When !MULTI, ensure this worker is the current worker,
+	 * ensuring the 1:1 relation.
+	 */
+	if (get_user(next_tid, &tsk->umcg_server_task->next_tid))
+		return -EFAULT;
+
+	next = umcg_get_task(next_tid);
+	if (!next)
+		return -ESRCH;
+
+	ret = (next == tsk) ? 0 : -EINVAL;
+
+	put_task_struct(next);
+
+	return ret;
 }
 
 /* pre-schedule() */
@@ -911,6 +934,8 @@ static int umcg_register(struct umcg_tas
 		return -EINVAL;
 	}
 
+	current->umcg_flags = flags;
+
 	if (current->umcg_task || !self)
 		return -EINVAL;
 
@@ -1061,7 +1086,7 @@ SYSCALL_DEFINE3(umcg_ctl, u32, flags, st
 
 	flags &= ~UMCG_CTL_CMD;
 
-	if (flags & ~(UMCG_CTL_WORKER))
+	if (flags & ~(UMCG_CTL_WORKER|UMCG_CTL_MULTI))
 		return -EINVAL;
 
 	switch (cmd) {
Peter Oskolkov Jan. 27, 2022, 5:20 p.m. UTC | #4
On Thu, Jan 27, 2022 at 7:37 AM Peter Zijlstra <peterz@infradead.org> wrote:
>
> On Thu, Jan 13, 2022 at 03:39:39PM -0800, Peter Oskolkov wrote:
>
> > This change introduces the following benefits:
> > - block detection how behaves similarly to wake detection;
> >   without this patch worker wakeups added wakees to the list
> >   and woke the server, while worker blocks only woke the server
> >   without adding blocked workers to a list, forcing servers
> >   to explicitly check worker's state;
>
> > - if the blocked worker woke sufficiently quickly, the server
> >   woken on the block event would observe its worker now as
> >   RUNNABLE, so the block event had to be inferred rather than
> >   explicitly signalled by the worker being added to the blocked
> >   worker list;
>
> This I think is missing the point, there is no race if the server checks
> curr->state == RUNNING.
>
> > - it is now possible for a single server to control several
> >   RUNNING workers, which makes writing userspace schedulers
> >   simpler for smaller processes that do not need to scale beyond
> >   one "server";
>
> How about something like so on top?

This will work, I think. Thanks!

----------

On a more general note, it looks like the original desire to keep state in
the userspace memory (TLS) instead of in task_struct has lead to a lot
of pain and complexity due to the difficulty of updating the userspace from
non-preemptible/sched contexts. And a bunch of stuff still trickled down
to task_struct.

Is it too late to revisit the design? If all state is kept in task_struct,
most of the complexity in the patchset will go away.

The only extra thing will be the fact that the kernel will maintain
the list of blocked/runnable workers, and so there will be an additional
syscall to get it out of the kernel and into the userspace. But all the pain
of pinning pages and related mm changes will go away...

>
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -1298,6 +1298,7 @@ struct task_struct {
>
>  #ifdef CONFIG_UMCG
>         /* setup by sys_umcg_ctrl() */
> +       u32                     umcg_flags;
>         clockid_t               umcg_clock;
>         struct umcg_task __user *umcg_task;
>
> --- a/include/uapi/linux/umcg.h
> +++ b/include/uapi/linux/umcg.h
> @@ -119,6 +119,8 @@ struct umcg_task {
>          *
>          * Readable/writable by both the kernel and the userspace: the
>          * kernel adds items to the list, userspace removes them.
> +        *
> +        * Only used with UMCG_CTL_MULTI.
>          */
>         __u64   blocked_workers_ptr;            /* r/w */
>
> @@ -147,11 +149,13 @@ enum umcg_wait_flag {
>   * @UMCG_CTL_REGISTER:   register the current task as a UMCG task
>   * @UMCG_CTL_UNREGISTER: unregister the current task as a UMCG task
>   * @UMCG_CTL_WORKER:     register the current task as a UMCG worker
> + * @UMCG_CTL_MULTI:     allow 1:n worker relations, enables blocked_workers_ptr
>   */
>  enum umcg_ctl_flag {
>         UMCG_CTL_REGISTER       = 0x00001,
>         UMCG_CTL_UNREGISTER     = 0x00002,
>         UMCG_CTL_WORKER         = 0x10000,
> +       UMCG_CTL_MULTI          = 0x20000,
>  };
>
>  #endif /* _UAPI_LINUX_UMCG_H */
> --- a/kernel/sched/umcg.c
> +++ b/kernel/sched/umcg.c
> @@ -335,7 +335,7 @@ static inline int umcg_enqueue_runnable(
>  }
>
>  /*
> - * Enqueue @tsk on it's server's blocked list
> + * Enqueue @tsk on it's server's blocked list OR ensure @tsk == server::next_tid
>   *
>   * Must be called in umcg_pin_pages() context, relies on tsk->umcg_server.
>   *
> @@ -346,10 +346,34 @@ static inline int umcg_enqueue_runnable(
>   * Returns:
>   *   0: success
>   *   -EFAULT
> + *   -ESRCH    server::next_tid is not a valid UMCG task
> + *   -EINVAL   server::next_tid doesn't match @tsk
>   */
>  static inline int umcg_enqueue_blocked(struct task_struct *tsk)
>  {
> -       return umcg_enqueue(tsk, true /* blocked */);
> +       struct task_struct *next;
> +       u32 next_tid;
> +       int ret;
> +
> +       if (tsk->umcg_server->umcg_flags & UMCG_CTL_MULTI)
> +               return umcg_enqueue(tsk, true /* blocked */);
> +
> +       /*
> +        * When !MULTI, ensure this worker is the current worker,
> +        * ensuring the 1:1 relation.
> +        */
> +       if (get_user(next_tid, &tsk->umcg_server_task->next_tid))
> +               return -EFAULT;
> +
> +       next = umcg_get_task(next_tid);
> +       if (!next)
> +               return -ESRCH;
> +
> +       ret = (next == tsk) ? 0 : -EINVAL;
> +
> +       put_task_struct(next);
> +
> +       return ret;
>  }
>
>  /* pre-schedule() */
> @@ -911,6 +934,8 @@ static int umcg_register(struct umcg_tas
>                 return -EINVAL;
>         }
>
> +       current->umcg_flags = flags;
> +
>         if (current->umcg_task || !self)
>                 return -EINVAL;
>
> @@ -1061,7 +1086,7 @@ SYSCALL_DEFINE3(umcg_ctl, u32, flags, st
>
>         flags &= ~UMCG_CTL_CMD;
>
> -       if (flags & ~(UMCG_CTL_WORKER))
> +       if (flags & ~(UMCG_CTL_WORKER|UMCG_CTL_MULTI))
>                 return -EINVAL;
>
>         switch (cmd) {
diff mbox series

Patch

diff --git a/include/uapi/linux/umcg.h b/include/uapi/linux/umcg.h
index a994bbb062d5..93fccb44283b 100644
--- a/include/uapi/linux/umcg.h
+++ b/include/uapi/linux/umcg.h
@@ -116,6 +116,14 @@  struct umcg_task {
 	__u64	blocked_ts;			/*   w */
 	__u64   runnable_ts;			/*   w */
 
+	/**
+	 * @blocked_workers_ptr: a single-linked list of blocked workers.
+	 *
+	 * Readable/writable by both the kernel and the userspace: the
+	 * kernel adds items to the list, userspace removes them.
+	 */
+	__u64	blocked_workers_ptr;		/* r/w */
+
 	/**
 	 * @runnable_workers_ptr: a single-linked list of runnable workers.
 	 *
@@ -124,7 +132,7 @@  struct umcg_task {
 	 */
 	__u64	runnable_workers_ptr;		/* r/w */
 
-	__u64	__zero[3];
+	__u64	__zero[2];
 
 } __attribute__((packed, aligned(UMCG_TASK_ALIGN)));
 
diff --git a/kernel/sched/umcg.c b/kernel/sched/umcg.c
index 9a8755045285..b85dec6b82e4 100644
--- a/kernel/sched/umcg.c
+++ b/kernel/sched/umcg.c
@@ -343,6 +343,67 @@  static int umcg_wake(struct task_struct *tsk)
 	return umcg_wake_server(tsk);
 }
 
+/*
+ * Enqueue @tsk on it's server's blocked or runnable list
+ *
+ * Must be called in umcg_pin_pages() context, relies on tsk->umcg_server.
+ *
+ * cmpxchg based single linked list add such that list integrity is never
+ * violated.  Userspace *MUST* remove it from the list before changing ->state.
+ * As such, we must change state to BLOCKED or RUNNABLE before enqueue.
+ *
+ * Returns:
+ *   0: success
+ *   -EFAULT
+ */
+static int umcg_enqueue_worker(struct task_struct *tsk, bool blocked)
+{
+	struct umcg_task __user *server = tsk->umcg_server_task;
+	struct umcg_task __user *self = tsk->umcg_task;
+	u64 self_ptr = (unsigned long)self;
+	u64 first_ptr;
+
+	/*
+	 * umcg_pin_pages() did access_ok() on both pointers, use self here
+	 * only because __user_access_begin() isn't available in generic code.
+	 */
+	if (!user_access_begin(self, sizeof(*self)))
+		return -EFAULT;
+
+	unsafe_get_user(first_ptr, blocked ? &server->blocked_workers_ptr :
+			&server->runnable_workers_ptr, Efault);
+	do {
+		unsafe_put_user(first_ptr, blocked ? &self->blocked_workers_ptr :
+				&self->runnable_workers_ptr, Efault);
+	} while (!unsafe_try_cmpxchg_user(blocked ? &server->blocked_workers_ptr :
+				&server->runnable_workers_ptr, &first_ptr, self_ptr, Efault));
+
+	user_access_end();
+	return 0;
+
+Efault:
+	user_access_end();
+	return -EFAULT;
+}
+
+/*
+ * Enqueue @tsk on it's server's blocked list
+ *
+ * Must be called in umcg_pin_pages() context, relies on tsk->umcg_server.
+ *
+ * cmpxchg based single linked list add such that list integrity is never
+ * violated.  Userspace *MUST* remove it from the list before changing ->state.
+ * As such, we must change state to BLOCKED before enqueue.
+ *
+ * Returns:
+ *   0: success
+ *   -EFAULT
+ */
+static int umcg_enqueue_blocked(struct task_struct *tsk)
+{
+	return umcg_enqueue_worker(tsk, true /* blocked */);
+}
+
 /* pre-schedule() */
 void umcg_wq_worker_sleeping(struct task_struct *tsk)
 {
@@ -357,6 +418,9 @@  void umcg_wq_worker_sleeping(struct task_struct *tsk)
 	if (umcg_update_state(tsk, self, UMCG_TASK_RUNNING, UMCG_TASK_BLOCKED))
 		UMCG_DIE_PF("state");
 
+	if (umcg_enqueue_blocked(tsk))
+		UMCG_DIE_PF("enqueue");
+
 	if (umcg_wake(tsk))
 		UMCG_DIE_PF("wake");
 
@@ -390,29 +454,7 @@  void umcg_wq_worker_running(struct task_struct *tsk)
  */
 static int umcg_enqueue_runnable(struct task_struct *tsk)
 {
-	struct umcg_task __user *server = tsk->umcg_server_task;
-	struct umcg_task __user *self = tsk->umcg_task;
-	u64 self_ptr = (unsigned long)self;
-	u64 first_ptr;
-
-	/*
-	 * umcg_pin_pages() did access_ok() on both pointers, use self here
-	 * only because __user_access_begin() isn't available in generic code.
-	 */
-	if (!user_access_begin(self, sizeof(*self)))
-		return -EFAULT;
-
-	unsafe_get_user(first_ptr, &server->runnable_workers_ptr, Efault);
-	do {
-		unsafe_put_user(first_ptr, &self->runnable_workers_ptr, Efault);
-	} while (!unsafe_try_cmpxchg_user(&server->runnable_workers_ptr, &first_ptr, self_ptr, Efault));
-
-	user_access_end();
-	return 0;
-
-Efault:
-	user_access_end();
-	return -EFAULT;
+	return umcg_enqueue_worker(tsk, false /* !blocked */);
 }
 
 /*
@@ -821,7 +863,7 @@  SYSCALL_DEFINE3(umcg_ctl, u32, flags, struct umcg_task __user *, self, clockid_t
 	if (copy_from_user(&ut, self, sizeof(ut)))
 		return -EFAULT;
 
-	if (ut.next_tid || ut.__hole[0] || ut.__zero[0] || ut.__zero[1] || ut.__zero[2])
+	if (ut.next_tid || ut.__hole[0] || ut.__zero[0] || ut.__zero[1])
 		return -EINVAL;
 
 	rcu_read_lock();