diff mbox

[v3,0/3] epoll: introduce round robin wakeup mode

Message ID 20150305000225.GA27592@gmail.com (mailing list archive)
State New, archived
Headers show

Commit Message

Ingo Molnar March 5, 2015, 12:02 a.m. UTC
* Andrew Morton <akpm@linux-foundation.org> wrote:

> On Fri, 27 Feb 2015 17:01:32 -0500 Jason Baron <jbaron@akamai.com> wrote:
> 
> > 
> > >
> > > I don't really understand the need for rotation/round-robin.  We can
> > > solve the thundering herd via exclusive wakeups, but what is the point
> > > in choosing to wake the task which has been sleeping for the longest
> > > time?  Why is that better than waking the task which has been sleeping
> > > for the *least* time?  That's probably faster as that task's data is
> > > more likely to still be in cache.
> > >
> > > The changelogs talks about "starvation" but they don't really say what
> > > this term means in this context, nor why it is a bad thing.
> > >
> 
> I'm still not getting it.
> 
> > So the idea with the 'rotation' is to try and distribute the
> > workload more evenly across the worker threads.
> 
> Why?
> 
> > We currently
> > tend to wake up the 'head' of the queue over and over and
> > thus the workload for us is not evenly distributed.
> 
> What's wrong with that?
> 
> > In fact, we
> > have a workload where we have to remove all the epoll sets
> > and then re-add them in a different order to improve the situation.
> 
> Why?

So my guess would be (but Jason would know this more precisely) that 
spreading the workload to more tasks in a FIFO manner, the individual 
tasks can move between CPUs better, and fill in available CPU 
bandwidth better, increasing concurrency.

With the current LIFO distribution of wakeups, the 'busiest' threads 
will get many wakeups (potentially from different CPUs), making them 
cache-hot, which may interfere with them easily migrating across CPUs.

So while technically both approaches have similar concurrency, the 
more 'spread out' task hierarchy schedules in a more consistent 
manner.

But ... this is just a wild guess and even if my description is 
accurate then it should still be backed by robust measurements and 
observations, before we extend the ABI.

This hypothesis could be tested by the patch below: with the patch 
applied if the performance difference between FIFO and LIFO epoll 
wakeups disappears, then the root cause is the cache-hotness code in 
the scheduler.

Thanks,

	Ingo

---

--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Comments

Jason Baron March 5, 2015, 3:53 a.m. UTC | #1
On 03/04/2015 07:02 PM, Ingo Molnar wrote:
> * Andrew Morton <akpm@linux-foundation.org> wrote:
>
>> On Fri, 27 Feb 2015 17:01:32 -0500 Jason Baron <jbaron@akamai.com> wrote:
>>
>>>> I don't really understand the need for rotation/round-robin.  We can
>>>> solve the thundering herd via exclusive wakeups, but what is the point
>>>> in choosing to wake the task which has been sleeping for the longest
>>>> time?  Why is that better than waking the task which has been sleeping
>>>> for the *least* time?  That's probably faster as that task's data is
>>>> more likely to still be in cache.
>>>>
>>>> The changelogs talks about "starvation" but they don't really say what
>>>> this term means in this context, nor why it is a bad thing.
>>>>
>> I'm still not getting it.
>>
>>> So the idea with the 'rotation' is to try and distribute the
>>> workload more evenly across the worker threads.
>> Why?
>>
>>> We currently
>>> tend to wake up the 'head' of the queue over and over and
>>> thus the workload for us is not evenly distributed.
>> What's wrong with that?
>>
>>> In fact, we
>>> have a workload where we have to remove all the epoll sets
>>> and then re-add them in a different order to improve the situation.
>> Why?
> So my guess would be (but Jason would know this more precisely) that 
> spreading the workload to more tasks in a FIFO manner, the individual 
> tasks can move between CPUs better, and fill in available CPU 
> bandwidth better, increasing concurrency.
>
> With the current LIFO distribution of wakeups, the 'busiest' threads 
> will get many wakeups (potentially from different CPUs), making them 
> cache-hot, which may interfere with them easily migrating across CPUs.
>
> So while technically both approaches have similar concurrency, the 
> more 'spread out' task hierarchy schedules in a more consistent 
> manner.
>
> But ... this is just a wild guess and even if my description is 
> accurate then it should still be backed by robust measurements and 
> observations, before we extend the ABI.
>
> This hypothesis could be tested by the patch below: with the patch 
> applied if the performance difference between FIFO and LIFO epoll 
> wakeups disappears, then the root cause is the cache-hotness code in 
> the scheduler.
>
>

So what I think you are describing here fits the model
where you have single epoll fd (returned by epoll_create()),
which is then attached to wakeup fds. So that can be thought
of as having a single 'event' queue (the single epoll fd), where
multiple threads are competing to grab events via epoll_wait()
and things are currently LIFO there as you describe.

However, the use-case I was trying to get at is where you have
multiple epoll fds (or event queues), and really just one thread
doing epoll_wait() against a single epoll fd. So instead of having
all threads competing for all events, we have divided up the
events into separate queues.

Now, the 'problematic' case is where there may be an event
source that is shared among all these epoll fds - such as a listen
socket or a pipe. Now there are two distinct issues in this case
that this series is trying to address.

1) All epoll fds will receive a wakeup (and hence the threads
that are potentially blocking there, although they may not
return to user-space if the event has already been consumed).
I think the test case I posted shows this pretty clearly -
http://lwn.net/Articles/632590/. The number of context switches
without adding the to the wait queue is 50x the case where
they are added exclusively. That's a lot of extra cpu usage.

2) We are using the wakeup in this case to 'assign' work more
permanently to the thread. That is, in the case of a listen socket
we then add the connected socket to the woken up threads
local set of epoll events. So the load persists past the wake up.
And in this case, doing the round robin wakeups, simply allows
us to access more cpu bandwidth. (I'm also looking into potentially
using cpu affinity to do the wakeups as well as you suggested.)

Thanks,

-Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Ingo Molnar March 5, 2015, 9:15 a.m. UTC | #2
* Jason Baron <jbaron@akamai.com> wrote:

> 2) We are using the wakeup in this case to 'assign' work more 
> permanently to the thread. That is, in the case of a listen socket 
> we then add the connected socket to the woken up threads local set 
> of epoll events. So the load persists past the wake up. And in this 
> case, doing the round robin wakeups, simply allows us to access more 
> cpu bandwidth. (I'm also looking into potentially using cpu affinity 
> to do the wakeups as well as you suggested.)

So this is the part that I still don't understand.

What difference does LIFO versus FIFO wakeups make to CPU utilization: 
a thread waiting for work is idle, no matter whether it ran most 
recently or least recently.

Once an idle worker thread is woken it will compute its own work, for 
whatever time it needs to, and won't be bothered by epoll again until 
it finished its work and starts waiting again.

So regardless the wakeup order it's the same principal bandwidth 
utilization, modulo caching artifacts [*] and modulo scheduling 
artifacts [**]:

[*]  Caching artifacts: in that sense Andrew's point stands: given 
     multiple equivalent choices it's more beneficial to pick a thread 
     that was most recently used (and is thus most cache-hot - i.e. 
     the current wakeup behavior), versus a thread that was least 
     recently used (and is thus the most cache-cold - i.e. the 
     round-robin wakeup you introduce).

[**] The hack patch I posted in my previous reply.

Thanks,

	Ingo

--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jason Baron March 5, 2015, 8:24 p.m. UTC | #3
On 03/05/2015 04:15 AM, Ingo Molnar wrote:
> * Jason Baron <jbaron@akamai.com> wrote:
>
>> 2) We are using the wakeup in this case to 'assign' work more 
>> permanently to the thread. That is, in the case of a listen socket 
>> we then add the connected socket to the woken up threads local set 
>> of epoll events. So the load persists past the wake up. And in this 
>> case, doing the round robin wakeups, simply allows us to access more 
>> cpu bandwidth. (I'm also looking into potentially using cpu affinity 
>> to do the wakeups as well as you suggested.)
> So this is the part that I still don't understand.
>
> What difference does LIFO versus FIFO wakeups make to CPU utilization: 
> a thread waiting for work is idle, no matter whether it ran most 
> recently or least recently.
>
> Once an idle worker thread is woken it will compute its own work, for 
> whatever time it needs to, and won't be bothered by epoll again until 
> it finished its work and starts waiting again.
>
> So regardless the wakeup order it's the same principal bandwidth 
> utilization, modulo caching artifacts [*] and modulo scheduling 
> artifacts [**]:

So just adding the wakeup source as 'exclusive', I think would
give much of the desired behavior as you point out. In the first
patch posting I separated 'exclusive' from 'rotate' (where rotate
depended on exclusive), since the idle threads will tend to get
assigned the new work vs. the busy threads as you point out
and the workload naturally spreads out (modulo the artifacts
you mentioned).

However, I added the 'rotate' b/c I'm assigning work via the
wakeup that persists past the wakeup point. So without the rotate
I might end up assigning a lot of work to always say the first
thread if its always idle. And then I might get a large burst of
work queued to it at some later point. The rotate is intended
to address this case.

To use some pseudo-code in hopes of clarifying things, each
thread is roughly doing:

epoll_ctl(epfd, EPOLL_CTL_ADD, listen_fd...);

while(1) {

    epoll_wait(epfd...);
    fd = accept(listen_fd...);
    epoll_ctl(epfd, EPOLL_CTL_ADD, fd...);
    ...do any additional desired fd processing...
}

So since the work persists past the wakeup point (after
the 'fd' has been assigned to the epfd set of the local
thread), I am trying to balance out future load.

This is an issue that current userspace has to address in
various ways. In our case, we periodically remove all
epfds from the listen socket, and then re-add in a
different order periodically. Another alternative that was
suggested by Eric was to have a dedicated thread(s), to
do the assignment. So these approaches can work to an
extent, but they seem at least to me to complicate
userspace somewhat. And at least in our case, its not
providing as good balancing as this approach.

So I am trying to use epoll in a special way to do work
assignment. I think the model is different from the
standard waker/wakee model. So to that end, in this
v3 version, I've attempted to isolate all the changes to
be contained within epoll to reflect that fact.

Thanks,

-Jason


>
> [*]  Caching artifacts: in that sense Andrew's point stands: given 
>      multiple equivalent choices it's more beneficial to pick a thread 
>      that was most recently used (and is thus most cache-hot - i.e. 
>      the current wakeup behavior), versus a thread that was least 
>      recently used (and is thus the most cache-cold - i.e. the 
>      round-robin wakeup you introduce).
>
> [**] The hack patch I posted in my previous reply.
>
> Thanks,
>
> 	Ingo
>

--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jason Baron March 7, 2015, 12:35 p.m. UTC | #4
On 03/05/2015 04:15 AM, Ingo Molnar wrote:
> * Jason Baron <jbaron@akamai.com> wrote:
>
>> 2) We are using the wakeup in this case to 'assign' work more 
>> permanently to the thread. That is, in the case of a listen socket 
>> we then add the connected socket to the woken up threads local set 
>> of epoll events. So the load persists past the wake up. And in this 
>> case, doing the round robin wakeups, simply allows us to access more 
>> cpu bandwidth. (I'm also looking into potentially using cpu affinity 
>> to do the wakeups as well as you suggested.)
> So this is the part that I still don't understand.

Here's maybe another way to frame this. Epoll sets add
a waiter on the wait queue in a fixed order when epoll sets
are added (via EPOLL_CTL_ADD). This order does not change
modulo adds/dels which are usually not common. So if
we don't want to wake all threads, when say an interrupt
occurs at some random point, we can either:

1) Walk the list, wake up the first epoll set that has idle
threads (queued via epoll_wait()) and return.

or:

2) Walk the list and wake up the first epoll set that has idle
threads, but then 'rotate' or move this epoll set to the tail
of the queue before returning.

So because the epoll sets are in a fixed order there is
an extreme bias to pick the same epoll sets over and over
regardless of the order in which threads return to wait
via (epoll_wait()). So I think the rotate makes sense for
the case where I am trying to assign work to threads that
may persist past the wake up point, and for cases where
the threads can finish all their work before returning
back to epoll_wait().

Thanks,

-Jason






--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index ee595ef30470..89af04e946d2 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5354,7 +5354,7 @@  static int task_hot(struct task_struct *p, struct lb_env *env)
 
 	lockdep_assert_held(&env->src_rq->lock);
 
-	if (p->sched_class != &fair_sched_class)
+	if (1 || p->sched_class != &fair_sched_class)
 		return 0;
 
 	if (unlikely(p->policy == SCHED_IDLE))