Message ID | 20150305000225.GA27592@gmail.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
On 03/04/2015 07:02 PM, Ingo Molnar wrote: > * Andrew Morton <akpm@linux-foundation.org> wrote: > >> On Fri, 27 Feb 2015 17:01:32 -0500 Jason Baron <jbaron@akamai.com> wrote: >> >>>> I don't really understand the need for rotation/round-robin. We can >>>> solve the thundering herd via exclusive wakeups, but what is the point >>>> in choosing to wake the task which has been sleeping for the longest >>>> time? Why is that better than waking the task which has been sleeping >>>> for the *least* time? That's probably faster as that task's data is >>>> more likely to still be in cache. >>>> >>>> The changelogs talks about "starvation" but they don't really say what >>>> this term means in this context, nor why it is a bad thing. >>>> >> I'm still not getting it. >> >>> So the idea with the 'rotation' is to try and distribute the >>> workload more evenly across the worker threads. >> Why? >> >>> We currently >>> tend to wake up the 'head' of the queue over and over and >>> thus the workload for us is not evenly distributed. >> What's wrong with that? >> >>> In fact, we >>> have a workload where we have to remove all the epoll sets >>> and then re-add them in a different order to improve the situation. >> Why? > So my guess would be (but Jason would know this more precisely) that > spreading the workload to more tasks in a FIFO manner, the individual > tasks can move between CPUs better, and fill in available CPU > bandwidth better, increasing concurrency. > > With the current LIFO distribution of wakeups, the 'busiest' threads > will get many wakeups (potentially from different CPUs), making them > cache-hot, which may interfere with them easily migrating across CPUs. > > So while technically both approaches have similar concurrency, the > more 'spread out' task hierarchy schedules in a more consistent > manner. > > But ... this is just a wild guess and even if my description is > accurate then it should still be backed by robust measurements and > observations, before we extend the ABI. > > This hypothesis could be tested by the patch below: with the patch > applied if the performance difference between FIFO and LIFO epoll > wakeups disappears, then the root cause is the cache-hotness code in > the scheduler. > > So what I think you are describing here fits the model where you have single epoll fd (returned by epoll_create()), which is then attached to wakeup fds. So that can be thought of as having a single 'event' queue (the single epoll fd), where multiple threads are competing to grab events via epoll_wait() and things are currently LIFO there as you describe. However, the use-case I was trying to get at is where you have multiple epoll fds (or event queues), and really just one thread doing epoll_wait() against a single epoll fd. So instead of having all threads competing for all events, we have divided up the events into separate queues. Now, the 'problematic' case is where there may be an event source that is shared among all these epoll fds - such as a listen socket or a pipe. Now there are two distinct issues in this case that this series is trying to address. 1) All epoll fds will receive a wakeup (and hence the threads that are potentially blocking there, although they may not return to user-space if the event has already been consumed). I think the test case I posted shows this pretty clearly - http://lwn.net/Articles/632590/. The number of context switches without adding the to the wait queue is 50x the case where they are added exclusively. That's a lot of extra cpu usage. 2) We are using the wakeup in this case to 'assign' work more permanently to the thread. That is, in the case of a listen socket we then add the connected socket to the woken up threads local set of epoll events. So the load persists past the wake up. And in this case, doing the round robin wakeups, simply allows us to access more cpu bandwidth. (I'm also looking into potentially using cpu affinity to do the wakeups as well as you suggested.) Thanks, -Jason -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
* Jason Baron <jbaron@akamai.com> wrote: > 2) We are using the wakeup in this case to 'assign' work more > permanently to the thread. That is, in the case of a listen socket > we then add the connected socket to the woken up threads local set > of epoll events. So the load persists past the wake up. And in this > case, doing the round robin wakeups, simply allows us to access more > cpu bandwidth. (I'm also looking into potentially using cpu affinity > to do the wakeups as well as you suggested.) So this is the part that I still don't understand. What difference does LIFO versus FIFO wakeups make to CPU utilization: a thread waiting for work is idle, no matter whether it ran most recently or least recently. Once an idle worker thread is woken it will compute its own work, for whatever time it needs to, and won't be bothered by epoll again until it finished its work and starts waiting again. So regardless the wakeup order it's the same principal bandwidth utilization, modulo caching artifacts [*] and modulo scheduling artifacts [**]: [*] Caching artifacts: in that sense Andrew's point stands: given multiple equivalent choices it's more beneficial to pick a thread that was most recently used (and is thus most cache-hot - i.e. the current wakeup behavior), versus a thread that was least recently used (and is thus the most cache-cold - i.e. the round-robin wakeup you introduce). [**] The hack patch I posted in my previous reply. Thanks, Ingo -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 03/05/2015 04:15 AM, Ingo Molnar wrote: > * Jason Baron <jbaron@akamai.com> wrote: > >> 2) We are using the wakeup in this case to 'assign' work more >> permanently to the thread. That is, in the case of a listen socket >> we then add the connected socket to the woken up threads local set >> of epoll events. So the load persists past the wake up. And in this >> case, doing the round robin wakeups, simply allows us to access more >> cpu bandwidth. (I'm also looking into potentially using cpu affinity >> to do the wakeups as well as you suggested.) > So this is the part that I still don't understand. > > What difference does LIFO versus FIFO wakeups make to CPU utilization: > a thread waiting for work is idle, no matter whether it ran most > recently or least recently. > > Once an idle worker thread is woken it will compute its own work, for > whatever time it needs to, and won't be bothered by epoll again until > it finished its work and starts waiting again. > > So regardless the wakeup order it's the same principal bandwidth > utilization, modulo caching artifacts [*] and modulo scheduling > artifacts [**]: So just adding the wakeup source as 'exclusive', I think would give much of the desired behavior as you point out. In the first patch posting I separated 'exclusive' from 'rotate' (where rotate depended on exclusive), since the idle threads will tend to get assigned the new work vs. the busy threads as you point out and the workload naturally spreads out (modulo the artifacts you mentioned). However, I added the 'rotate' b/c I'm assigning work via the wakeup that persists past the wakeup point. So without the rotate I might end up assigning a lot of work to always say the first thread if its always idle. And then I might get a large burst of work queued to it at some later point. The rotate is intended to address this case. To use some pseudo-code in hopes of clarifying things, each thread is roughly doing: epoll_ctl(epfd, EPOLL_CTL_ADD, listen_fd...); while(1) { epoll_wait(epfd...); fd = accept(listen_fd...); epoll_ctl(epfd, EPOLL_CTL_ADD, fd...); ...do any additional desired fd processing... } So since the work persists past the wakeup point (after the 'fd' has been assigned to the epfd set of the local thread), I am trying to balance out future load. This is an issue that current userspace has to address in various ways. In our case, we periodically remove all epfds from the listen socket, and then re-add in a different order periodically. Another alternative that was suggested by Eric was to have a dedicated thread(s), to do the assignment. So these approaches can work to an extent, but they seem at least to me to complicate userspace somewhat. And at least in our case, its not providing as good balancing as this approach. So I am trying to use epoll in a special way to do work assignment. I think the model is different from the standard waker/wakee model. So to that end, in this v3 version, I've attempted to isolate all the changes to be contained within epoll to reflect that fact. Thanks, -Jason > > [*] Caching artifacts: in that sense Andrew's point stands: given > multiple equivalent choices it's more beneficial to pick a thread > that was most recently used (and is thus most cache-hot - i.e. > the current wakeup behavior), versus a thread that was least > recently used (and is thus the most cache-cold - i.e. the > round-robin wakeup you introduce). > > [**] The hack patch I posted in my previous reply. > > Thanks, > > Ingo > -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 03/05/2015 04:15 AM, Ingo Molnar wrote: > * Jason Baron <jbaron@akamai.com> wrote: > >> 2) We are using the wakeup in this case to 'assign' work more >> permanently to the thread. That is, in the case of a listen socket >> we then add the connected socket to the woken up threads local set >> of epoll events. So the load persists past the wake up. And in this >> case, doing the round robin wakeups, simply allows us to access more >> cpu bandwidth. (I'm also looking into potentially using cpu affinity >> to do the wakeups as well as you suggested.) > So this is the part that I still don't understand. Here's maybe another way to frame this. Epoll sets add a waiter on the wait queue in a fixed order when epoll sets are added (via EPOLL_CTL_ADD). This order does not change modulo adds/dels which are usually not common. So if we don't want to wake all threads, when say an interrupt occurs at some random point, we can either: 1) Walk the list, wake up the first epoll set that has idle threads (queued via epoll_wait()) and return. or: 2) Walk the list and wake up the first epoll set that has idle threads, but then 'rotate' or move this epoll set to the tail of the queue before returning. So because the epoll sets are in a fixed order there is an extreme bias to pick the same epoll sets over and over regardless of the order in which threads return to wait via (epoll_wait()). So I think the rotate makes sense for the case where I am trying to assign work to threads that may persist past the wake up point, and for cases where the threads can finish all their work before returning back to epoll_wait(). Thanks, -Jason -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index ee595ef30470..89af04e946d2 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -5354,7 +5354,7 @@ static int task_hot(struct task_struct *p, struct lb_env *env) lockdep_assert_held(&env->src_rq->lock); - if (p->sched_class != &fair_sched_class) + if (1 || p->sched_class != &fair_sched_class) return 0; if (unlikely(p->policy == SCHED_IDLE))