diff mbox series

[1/1] psi: eliminate kthread_worker from psi trigger scheduling mechanism

Message ID 20200528195442.190116-1-surenb@google.com (mailing list archive)
State New, archived
Headers show
Series [1/1] psi: eliminate kthread_worker from psi trigger scheduling mechanism | expand

Commit Message

Suren Baghdasaryan May 28, 2020, 7:54 p.m. UTC
Each psi group requires a dedicated kthread_delayed_work and
kthread_worker. Since no other work can be performed using psi_group's
kthread_worker, the same result can be obtained using a task_struct and
a timer directly. This makes psi triggering simpler by removing lists
and locks involved with kthread_worker usage and eliminates the need for
poll_scheduled atomic use in the hot path.

Signed-off-by: Suren Baghdasaryan <surenb@google.com>
---
This patch is meant to address Peter's request in [1] to pull
kthread_queue_delayed_work() out from under rq->lock. This should also address
the lockdep warning about possibility of a circular dependency described in [2]

[1]: https://lore.kernel.org/lkml/20200428163125.GC16910@hirez.programming.kicks-ass.net/
[2]: https://lore.kernel.org/lkml/CAJuCfpG4NkhpQvZjgXZ_3gm6Hf1QgN_eUOQ8iX9Cv1k9whLwSQ@mail.gmail.com
---
 include/linux/psi_types.h |   7 ++-
 kernel/sched/psi.c        | 113 +++++++++++++++++++++-----------------
 2 files changed, 68 insertions(+), 52 deletions(-)

Comments

Peter Zijlstra June 4, 2020, 1:12 p.m. UTC | #1
On Thu, May 28, 2020 at 12:54:42PM -0700, Suren Baghdasaryan wrote:
> Each psi group requires a dedicated kthread_delayed_work and
> kthread_worker. Since no other work can be performed using psi_group's
> kthread_worker, the same result can be obtained using a task_struct and
> a timer directly. This makes psi triggering simpler by removing lists
> and locks involved with kthread_worker usage and eliminates the need for
> poll_scheduled atomic use in the hot path.
> 
> Signed-off-by: Suren Baghdasaryan <surenb@google.com>
> ---
> This patch is meant to address Peter's request in [1] to pull
> kthread_queue_delayed_work() out from under rq->lock. This should also address
> the lockdep warning about possibility of a circular dependency described in [2]

I think you could've just fixed kthread_queue_delayed_work(), that code
is sub-optimal.

But I suppose this works too.
Suren Baghdasaryan June 4, 2020, 7:20 p.m. UTC | #2
On Thu, Jun 4, 2020 at 6:12 AM Peter Zijlstra <peterz@infradead.org> wrote:
>
> On Thu, May 28, 2020 at 12:54:42PM -0700, Suren Baghdasaryan wrote:
> > Each psi group requires a dedicated kthread_delayed_work and
> > kthread_worker. Since no other work can be performed using psi_group's
> > kthread_worker, the same result can be obtained using a task_struct and
> > a timer directly. This makes psi triggering simpler by removing lists
> > and locks involved with kthread_worker usage and eliminates the need for
> > poll_scheduled atomic use in the hot path.
> >
> > Signed-off-by: Suren Baghdasaryan <surenb@google.com>
> > ---
> > This patch is meant to address Peter's request in [1] to pull
> > kthread_queue_delayed_work() out from under rq->lock. This should also address
> > the lockdep warning about possibility of a circular dependency described in [2]
>
> I think you could've just fixed kthread_queue_delayed_work(), that code
> is sub-optimal.

Ok, let me look into it some more. My understanding was that the
worker->lock in kthread_queue_delayed_work() was needed to synchronize
worker->delayed_work_list access. But maybe I'm missing something... I
assume you are talking about optimizing this beyond what
https://lkml.org/lkml/2020/5/4/1148 was doing?

BTW, any objections against taking https://lkml.org/lkml/2020/5/4/1148
? It's not the ultimate fix but it is an improvement since it gets
some of the operations that were unnecessarily under worker->lock out
of it.

>
> But I suppose this works too.

In PSI's case there is always one work for each worker, so the
delayed_work_list and work_list are not needed and therefore I can
replace kthread_worker machinery with a task and a timer.
I think I can simplify this a bit further. For example
group->poll_wakeup doesn't have to be an atomic. Originally I wanted
to avoid a possibility of a race when poll_timer_fn sets it and
psi_poll_worker resets it and as a result misses a wakeup, however if
psi_poll_worker resets it before calling psi_poll_work then there is
no harm in missing a wakeup because we called psi_poll_work and did
the required work anyway.

One question about this patch I'm not sure about and wanted to ask you
Peter is whether it's ok to call mod_timer from within a hotpath
(while holding rq->lock). As I described in the additional comment,
there is a possibility of a race between when I check timer_pending
and the call to mod_timer, so it's possible that mod_timer might be
called both from psi_poll_work (psi poll work handler) and from
psi_task_change (hotpath under rq->lock). I see that mod_timer takes
base->lock spinlock, and IIUC such a race might block the hotpath and
therefore is unacceptable. If this is true I'll need to revive the
poll_scheduled atomic to close this race and then I can change
mod_timer into add_timer.
WDYT? And sorry for my ignorance if this is a trivial question. I'm
not sure about the rules when it comes to rq->locks.

Thanks,
Suren.

>
> --
> To unsubscribe from this group and stop receiving emails from it, send an email to kernel-team+unsubscribe@android.com.
>
Suren Baghdasaryan June 9, 2020, 2:56 a.m. UTC | #3
On Thu, Jun 4, 2020 at 12:20 PM Suren Baghdasaryan <surenb@google.com> wrote:
>
> On Thu, Jun 4, 2020 at 6:12 AM Peter Zijlstra <peterz@infradead.org> wrote:
> >
> > On Thu, May 28, 2020 at 12:54:42PM -0700, Suren Baghdasaryan wrote:
> > > Each psi group requires a dedicated kthread_delayed_work and
> > > kthread_worker. Since no other work can be performed using psi_group's
> > > kthread_worker, the same result can be obtained using a task_struct and
> > > a timer directly. This makes psi triggering simpler by removing lists
> > > and locks involved with kthread_worker usage and eliminates the need for
> > > poll_scheduled atomic use in the hot path.
> > >
> > > Signed-off-by: Suren Baghdasaryan <surenb@google.com>
> > > ---
> > > This patch is meant to address Peter's request in [1] to pull
> > > kthread_queue_delayed_work() out from under rq->lock. This should also address
> > > the lockdep warning about possibility of a circular dependency described in [2]
> >
> > I think you could've just fixed kthread_queue_delayed_work(), that code
> > is sub-optimal.

After some more staring into kthread code I think I understand what
Peter's comment meant about delayed_work_list.
worker->delayed_work_list seems to be unnecessary because each
kthread_delayed_work has its own timer which will add the work into
worker->work_list when the time comes. So there is no need to store
the delayed work in an intermediate worker->delayed_work_list.
However I think kthread_destroy_worker() has an issue if it's called
while worker->delayed_work_list is non-empty. The issue is that
kthread_destroy_worker() does not stop all the
kthread_delayed_work->timers scheduled on the
worker->delayed_work_list. So if such a timer fires after a call to
kthread_destroy_worker(), timer's handler will dereference the already
destroyed worker.

If I'm right and this is indeed an issue then I think we do need
worker->delayed_work_list to cancel all the scheduled timers. The
issue can be avoided if we assume that the caller will alway call
kthread_cancel_delayed_work_sync() for each delayed_work scheduled on
worker->delayed_work_list before calling kthread_destroy_worker(). If
that's what we expect I think this expectation should be reflected in
the comments and a WARN_ON(!list_empty(&worker->delayed_work_list)) be
added in kthread_destroy_worker(). WDYT?

>
> Ok, let me look into it some more. My understanding was that the
> worker->lock in kthread_queue_delayed_work() was needed to synchronize
> worker->delayed_work_list access. But maybe I'm missing something... I
> assume you are talking about optimizing this beyond what
> https://lkml.org/lkml/2020/5/4/1148 was doing?
>
> BTW, any objections against taking https://lkml.org/lkml/2020/5/4/1148
> ? It's not the ultimate fix but it is an improvement since it gets
> some of the operations that were unnecessarily under worker->lock out
> of it.
>
> >
> > But I suppose this works too.
>
> In PSI's case there is always one work for each worker, so the
> delayed_work_list and work_list are not needed and therefore I can
> replace kthread_worker machinery with a task and a timer.
> I think I can simplify this a bit further. For example
> group->poll_wakeup doesn't have to be an atomic. Originally I wanted
> to avoid a possibility of a race when poll_timer_fn sets it and
> psi_poll_worker resets it and as a result misses a wakeup, however if
> psi_poll_worker resets it before calling psi_poll_work then there is
> no harm in missing a wakeup because we called psi_poll_work and did
> the required work anyway.
>
> One question about this patch I'm not sure about and wanted to ask you
> Peter is whether it's ok to call mod_timer from within a hotpath
> (while holding rq->lock). As I described in the additional comment,
> there is a possibility of a race between when I check timer_pending
> and the call to mod_timer, so it's possible that mod_timer might be
> called both from psi_poll_work (psi poll work handler) and from
> psi_task_change (hotpath under rq->lock). I see that mod_timer takes
> base->lock spinlock, and IIUC such a race might block the hotpath and
> therefore is unacceptable. If this is true I'll need to revive the
> poll_scheduled atomic to close this race and then I can change
> mod_timer into add_timer.
> WDYT? And sorry for my ignorance if this is a trivial question. I'm
> not sure about the rules when it comes to rq->locks.
>
> Thanks,
> Suren.
>
> >
> > --
> > To unsubscribe from this group and stop receiving emails from it, send an email to kernel-team+unsubscribe@android.com.
> >
Suren Baghdasaryan June 16, 2020, 3:39 p.m. UTC | #4
On Mon, Jun 8, 2020 at 7:56 PM Suren Baghdasaryan <surenb@google.com> wrote:
>
> On Thu, Jun 4, 2020 at 12:20 PM Suren Baghdasaryan <surenb@google.com> wrote:
> >
> > On Thu, Jun 4, 2020 at 6:12 AM Peter Zijlstra <peterz@infradead.org> wrote:
> > >
> > > On Thu, May 28, 2020 at 12:54:42PM -0700, Suren Baghdasaryan wrote:
> > > > Each psi group requires a dedicated kthread_delayed_work and
> > > > kthread_worker. Since no other work can be performed using psi_group's
> > > > kthread_worker, the same result can be obtained using a task_struct and
> > > > a timer directly. This makes psi triggering simpler by removing lists
> > > > and locks involved with kthread_worker usage and eliminates the need for
> > > > poll_scheduled atomic use in the hot path.
> > > >
> > > > Signed-off-by: Suren Baghdasaryan <surenb@google.com>
> > > > ---
> > > > This patch is meant to address Peter's request in [1] to pull
> > > > kthread_queue_delayed_work() out from under rq->lock. This should also address
> > > > the lockdep warning about possibility of a circular dependency described in [2]
> > >
> > > I think you could've just fixed kthread_queue_delayed_work(), that code
> > > is sub-optimal.
>
> After some more staring into kthread code I think I understand what
> Peter's comment meant about delayed_work_list.
> worker->delayed_work_list seems to be unnecessary because each
> kthread_delayed_work has its own timer which will add the work into
> worker->work_list when the time comes. So there is no need to store
> the delayed work in an intermediate worker->delayed_work_list.
> However I think kthread_destroy_worker() has an issue if it's called
> while worker->delayed_work_list is non-empty. The issue is that
> kthread_destroy_worker() does not stop all the
> kthread_delayed_work->timers scheduled on the
> worker->delayed_work_list. So if such a timer fires after a call to
> kthread_destroy_worker(), timer's handler will dereference the already
> destroyed worker.
>
> If I'm right and this is indeed an issue then I think we do need
> worker->delayed_work_list to cancel all the scheduled timers. The
> issue can be avoided if we assume that the caller will alway call
> kthread_cancel_delayed_work_sync() for each delayed_work scheduled on
> worker->delayed_work_list before calling kthread_destroy_worker(). If
> that's what we expect I think this expectation should be reflected in
> the comments and a WARN_ON(!list_empty(&worker->delayed_work_list)) be
> added in kthread_destroy_worker(). WDYT?
>
> >
> > Ok, let me look into it some more. My understanding was that the
> > worker->lock in kthread_queue_delayed_work() was needed to synchronize
> > worker->delayed_work_list access. But maybe I'm missing something... I
> > assume you are talking about optimizing this beyond what
> > https://lkml.org/lkml/2020/5/4/1148 was doing?
> >
> > BTW, any objections against taking https://lkml.org/lkml/2020/5/4/1148
> > ? It's not the ultimate fix but it is an improvement since it gets
> > some of the operations that were unnecessarily under worker->lock out
> > of it.
> >
> > >
> > > But I suppose this works too.
> >
> > In PSI's case there is always one work for each worker, so the
> > delayed_work_list and work_list are not needed and therefore I can
> > replace kthread_worker machinery with a task and a timer.
> > I think I can simplify this a bit further. For example
> > group->poll_wakeup doesn't have to be an atomic. Originally I wanted
> > to avoid a possibility of a race when poll_timer_fn sets it and
> > psi_poll_worker resets it and as a result misses a wakeup, however if
> > psi_poll_worker resets it before calling psi_poll_work then there is
> > no harm in missing a wakeup because we called psi_poll_work and did
> > the required work anyway.
> >
> > One question about this patch I'm not sure about and wanted to ask you
> > Peter is whether it's ok to call mod_timer from within a hotpath
> > (while holding rq->lock). As I described in the additional comment,
> > there is a possibility of a race between when I check timer_pending
> > and the call to mod_timer, so it's possible that mod_timer might be
> > called both from psi_poll_work (psi poll work handler) and from
> > psi_task_change (hotpath under rq->lock). I see that mod_timer takes
> > base->lock spinlock, and IIUC such a race might block the hotpath and
> > therefore is unacceptable. If this is true I'll need to revive the
> > poll_scheduled atomic to close this race and then I can change
> > mod_timer into add_timer.
> > WDYT? And sorry for my ignorance if this is a trivial question. I'm
> > not sure about the rules when it comes to rq->locks.

Thanks for taking this patch, Peter. I just wanted to double-check if
you considered the race I mentioned in the above paragraph and decided
it's a non-issue. If it is an issue I can send a small follow-up patch
to close the race or I can send a new version of the whole patch with
the fix if that's preferable. Please LMK.

> >
> > Thanks,
> > Suren.
> >
> > >
> > > --
> > > To unsubscribe from this group and stop receiving emails from it, send an email to kernel-team+unsubscribe@android.com.
> > >
diff mbox series

Patch

diff --git a/include/linux/psi_types.h b/include/linux/psi_types.h
index 4b7258495a04..b95f3211566a 100644
--- a/include/linux/psi_types.h
+++ b/include/linux/psi_types.h
@@ -153,9 +153,10 @@  struct psi_group {
 	unsigned long avg[NR_PSI_STATES - 1][3];
 
 	/* Monitor work control */
-	atomic_t poll_scheduled;
-	struct kthread_worker __rcu *poll_kworker;
-	struct kthread_delayed_work poll_work;
+	struct task_struct __rcu *poll_task;
+	struct timer_list poll_timer;
+	wait_queue_head_t poll_wait;
+	atomic_t poll_wakeup;
 
 	/* Protects data used by the monitor */
 	struct mutex trigger_lock;
diff --git a/kernel/sched/psi.c b/kernel/sched/psi.c
index 8f45cdb6463b..e53b711bd643 100644
--- a/kernel/sched/psi.c
+++ b/kernel/sched/psi.c
@@ -190,7 +190,6 @@  static void group_init(struct psi_group *group)
 	INIT_DELAYED_WORK(&group->avgs_work, psi_avgs_work);
 	mutex_init(&group->avgs_lock);
 	/* Init trigger-related members */
-	atomic_set(&group->poll_scheduled, 0);
 	mutex_init(&group->trigger_lock);
 	INIT_LIST_HEAD(&group->triggers);
 	memset(group->nr_triggers, 0, sizeof(group->nr_triggers));
@@ -199,7 +198,7 @@  static void group_init(struct psi_group *group)
 	memset(group->polling_total, 0, sizeof(group->polling_total));
 	group->polling_next_update = ULLONG_MAX;
 	group->polling_until = 0;
-	rcu_assign_pointer(group->poll_kworker, NULL);
+	rcu_assign_pointer(group->poll_task, NULL);
 }
 
 void __init psi_init(void)
@@ -547,47 +546,38 @@  static u64 update_triggers(struct psi_group *group, u64 now)
 	return now + group->poll_min_period;
 }
 
-/*
- * Schedule polling if it's not already scheduled. It's safe to call even from
- * hotpath because even though kthread_queue_delayed_work takes worker->lock
- * spinlock that spinlock is never contended due to poll_scheduled atomic
- * preventing such competition.
- */
+/* Schedule polling if it's not already scheduled. */
 static void psi_schedule_poll_work(struct psi_group *group, unsigned long delay)
 {
-	struct kthread_worker *kworker;
+	struct task_struct *task;
 
-	/* Do not reschedule if already scheduled */
-	if (atomic_cmpxchg(&group->poll_scheduled, 0, 1) != 0)
+	/*
+	 * Do not reschedule if already scheduled.
+	 * Possible race with a timer scheduled after this check but before
+	 * mod_timer below can be tolerated because group->polling_next_update
+	 * will keep updates on schedule.
+	 */
+	if (timer_pending(&group->poll_timer))
 		return;
 
 	rcu_read_lock();
 
-	kworker = rcu_dereference(group->poll_kworker);
+	task = rcu_dereference(group->poll_task);
 	/*
 	 * kworker might be NULL in case psi_trigger_destroy races with
 	 * psi_task_change (hotpath) which can't use locks
 	 */
-	if (likely(kworker))
-		kthread_queue_delayed_work(kworker, &group->poll_work, delay);
-	else
-		atomic_set(&group->poll_scheduled, 0);
+	if (likely(task))
+		mod_timer(&group->poll_timer, jiffies + delay);
 
 	rcu_read_unlock();
 }
 
-static void psi_poll_work(struct kthread_work *work)
+static void psi_poll_work(struct psi_group *group)
 {
-	struct kthread_delayed_work *dwork;
-	struct psi_group *group;
 	u32 changed_states;
 	u64 now;
 
-	dwork = container_of(work, struct kthread_delayed_work, work);
-	group = container_of(dwork, struct psi_group, poll_work);
-
-	atomic_set(&group->poll_scheduled, 0);
-
 	mutex_lock(&group->trigger_lock);
 
 	now = sched_clock();
@@ -623,6 +613,35 @@  static void psi_poll_work(struct kthread_work *work)
 	mutex_unlock(&group->trigger_lock);
 }
 
+static int psi_poll_worker(void *data)
+{
+	struct psi_group *group = (struct psi_group *)data;
+	struct sched_param param = {
+		.sched_priority = 1,
+	};
+
+	sched_setscheduler_nocheck(current, SCHED_FIFO, &param);
+
+	while (true) {
+		wait_event_interruptible(group->poll_wait,
+				atomic_cmpxchg(&group->poll_wakeup, 1, 0) ||
+				kthread_should_stop());
+		if (kthread_should_stop())
+			break;
+
+		psi_poll_work(group);
+	}
+	return 0;
+}
+
+static void poll_timer_fn(struct timer_list *t)
+{
+	struct psi_group *group = from_timer(group, t, poll_timer);
+
+	atomic_set(&group->poll_wakeup, 1);
+	wake_up_interruptible(&group->poll_wait);
+}
+
 static void record_times(struct psi_group_cpu *groupc, int cpu,
 			 bool memstall_tick)
 {
@@ -1099,22 +1118,20 @@  struct psi_trigger *psi_trigger_create(struct psi_group *group,
 
 	mutex_lock(&group->trigger_lock);
 
-	if (!rcu_access_pointer(group->poll_kworker)) {
-		struct sched_param param = {
-			.sched_priority = 1,
-		};
-		struct kthread_worker *kworker;
+	if (!rcu_access_pointer(group->poll_task)) {
+		struct task_struct *task;
 
-		kworker = kthread_create_worker(0, "psimon");
-		if (IS_ERR(kworker)) {
+		task = kthread_create(psi_poll_worker, group, "psimon");
+		if (IS_ERR(task)) {
 			kfree(t);
 			mutex_unlock(&group->trigger_lock);
-			return ERR_CAST(kworker);
+			return ERR_CAST(task);
 		}
-		sched_setscheduler_nocheck(kworker->task, SCHED_FIFO, &param);
-		kthread_init_delayed_work(&group->poll_work,
-				psi_poll_work);
-		rcu_assign_pointer(group->poll_kworker, kworker);
+		atomic_set(&group->poll_wakeup, 0);
+		init_waitqueue_head(&group->poll_wait);
+		wake_up_process(task);
+		timer_setup(&group->poll_timer, poll_timer_fn, 0);
+		rcu_assign_pointer(group->poll_task, task);
 	}
 
 	list_add(&t->node, &group->triggers);
@@ -1132,7 +1149,7 @@  static void psi_trigger_destroy(struct kref *ref)
 {
 	struct psi_trigger *t = container_of(ref, struct psi_trigger, refcount);
 	struct psi_group *group = t->group;
-	struct kthread_worker *kworker_to_destroy = NULL;
+	struct task_struct *task_to_destroy = NULL;
 
 	if (static_branch_likely(&psi_disabled))
 		return;
@@ -1158,13 +1175,13 @@  static void psi_trigger_destroy(struct kref *ref)
 			period = min(period, div_u64(tmp->win.size,
 					UPDATES_PER_WINDOW));
 		group->poll_min_period = period;
-		/* Destroy poll_kworker when the last trigger is destroyed */
+		/* Destroy poll_task when the last trigger is destroyed */
 		if (group->poll_states == 0) {
 			group->polling_until = 0;
-			kworker_to_destroy = rcu_dereference_protected(
-					group->poll_kworker,
+			task_to_destroy = rcu_dereference_protected(
+					group->poll_task,
 					lockdep_is_held(&group->trigger_lock));
-			rcu_assign_pointer(group->poll_kworker, NULL);
+			rcu_assign_pointer(group->poll_task, NULL);
 		}
 	}
 
@@ -1172,25 +1189,23 @@  static void psi_trigger_destroy(struct kref *ref)
 
 	/*
 	 * Wait for both *trigger_ptr from psi_trigger_replace and
-	 * poll_kworker RCUs to complete their read-side critical sections
-	 * before destroying the trigger and optionally the poll_kworker
+	 * poll_task RCUs to complete their read-side critical sections
+	 * before destroying the trigger and optionally the poll_task
 	 */
 	synchronize_rcu();
 	/*
 	 * Destroy the kworker after releasing trigger_lock to prevent a
 	 * deadlock while waiting for psi_poll_work to acquire trigger_lock
 	 */
-	if (kworker_to_destroy) {
+	if (task_to_destroy) {
 		/*
 		 * After the RCU grace period has expired, the worker
-		 * can no longer be found through group->poll_kworker.
+		 * can no longer be found through group->poll_task.
 		 * But it might have been already scheduled before
 		 * that - deschedule it cleanly before destroying it.
 		 */
-		kthread_cancel_delayed_work_sync(&group->poll_work);
-		atomic_set(&group->poll_scheduled, 0);
-
-		kthread_destroy_worker(kworker_to_destroy);
+		del_timer_sync(&group->poll_timer);
+		kthread_stop(task_to_destroy);
 	}
 	kfree(t);
 }