From patchwork Tue Feb 17 19:33:42 2015 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Jason Baron X-Patchwork-Id: 5841241 Return-Path: X-Original-To: patchwork-linux-fsdevel@patchwork.kernel.org Delivered-To: patchwork-parsemail@patchwork1.web.kernel.org Received: from mail.kernel.org (mail.kernel.org [198.145.29.136]) by patchwork1.web.kernel.org (Postfix) with ESMTP id 394399F380 for ; Tue, 17 Feb 2015 19:34:16 +0000 (UTC) Received: from mail.kernel.org (localhost [127.0.0.1]) by mail.kernel.org (Postfix) with ESMTP id 50ACA20166 for ; Tue, 17 Feb 2015 19:34:15 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.kernel.org (Postfix) with ESMTP id 626CB20165 for ; Tue, 17 Feb 2015 19:34:14 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1754424AbbBQTdv (ORCPT ); Tue, 17 Feb 2015 14:33:51 -0500 Received: from prod-mail-xrelay07.akamai.com ([72.246.2.115]:48399 "EHLO prod-mail-xrelay07.akamai.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753212AbbBQTdn (ORCPT ); Tue, 17 Feb 2015 14:33:43 -0500 Received: from prod-mail-xrelay07.akamai.com (localhost.localdomain [127.0.0.1]) by postfix.imss70 (Postfix) with ESMTP id 490B64763B; Tue, 17 Feb 2015 19:33:42 +0000 (GMT) Received: from prod-mail-relay07.akamai.com (prod-mail-relay07.akamai.com [172.17.121.112]) by prod-mail-xrelay07.akamai.com (Postfix) with ESMTP id 313824761C; Tue, 17 Feb 2015 19:33:42 +0000 (GMT) Received: from localhost (bos-lpjec.kendall.corp.akamai.com [172.28.13.37]) by prod-mail-relay07.akamai.com (Postfix) with ESMTP id 2D4E580047; Tue, 17 Feb 2015 19:33:42 +0000 (GMT) To: peterz@infradead.org, mingo@redhat.com, viro@zeniv.linux.org.uk Cc: akpm@linux-foundation.org, normalperson@yhbt.net, davidel@xmailserver.org, mtk.manpages@gmail.com, linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-api@vger.kernel.org Message-Id: <7956874bfdc7403f37afe8a75e50c24221039bd2.1424200151.git.jbaron@akamai.com> In-Reply-To: References: From: Jason Baron Subject: [PATCH v2 2/2] epoll: introduce EPOLLEXCLUSIVE and EPOLLROUNDROBIN Date: Tue, 17 Feb 2015 19:33:42 +0000 (GMT) Sender: linux-fsdevel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-fsdevel@vger.kernel.org X-Spam-Status: No, score=-6.9 required=5.0 tests=BAYES_00, RCVD_IN_DNSWL_HI, T_RP_MATCHES_RCVD, UNPARSEABLE_RELAY autolearn=unavailable version=3.3.1 X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on mail.kernel.org X-Virus-Scanned: ClamAV using ClamSMTP Epoll file descriptors that are added to a shared wakeup source are always added in a non-exclusive manner. That means that when we have multiple epoll fds attached to a shared wakeup source they are all woken up. This can lead to excessive cpu usage and uneven load distribution. This patch introduces two new 'events' flags that are intended to be used with EPOLL_CTL_ADD operations. EPOLLEXCLUSIVE, adds the epoll fd to the event source in an exclusive manner such that the minimum number of threads are woken. EPOLLROUNDROBIN, which depends on EPOLLEXCLUSIVE also being set, can also be added to the 'events' flag, such that we round robin through the set of waiting threads. An implementation note is that in the epoll wakeup routine, 'ep_poll_callback()', if EPOLLROUNDROBIN is set, we return 1, for a successful wakeup, only when there are current waiters. The idea is to use this additional heuristic in order minimize wakeup latencies. Signed-off-by: Jason Baron --- fs/eventpoll.c | 25 ++++++++++++++++++++----- include/uapi/linux/eventpoll.h | 6 ++++++ 2 files changed, 26 insertions(+), 5 deletions(-) diff --git a/fs/eventpoll.c b/fs/eventpoll.c index d77f944..382c832 100644 --- a/fs/eventpoll.c +++ b/fs/eventpoll.c @@ -92,7 +92,8 @@ */ /* Epoll private bits inside the event mask */ -#define EP_PRIVATE_BITS (EPOLLWAKEUP | EPOLLONESHOT | EPOLLET) +#define EP_PRIVATE_BITS (EPOLLWAKEUP | EPOLLONESHOT | EPOLLET | \ + EPOLLEXCLUSIVE | EPOLLROUNDROBIN) /* Maximum number of nesting allowed inside epoll sets */ #define EP_MAX_NESTS 4 @@ -1002,6 +1003,7 @@ static int ep_poll_callback(wait_queue_t *wait, unsigned mode, int sync, void *k unsigned long flags; struct epitem *epi = ep_item_from_wait(wait); struct eventpoll *ep = epi->ep; + int ewake = 0; if ((unsigned long)key & POLLFREE) { ep_pwq_from_wait(wait)->whead = NULL; @@ -1066,8 +1068,10 @@ static int ep_poll_callback(wait_queue_t *wait, unsigned mode, int sync, void *k * Wake up ( if active ) both the eventpoll wait list and the ->poll() * wait list. */ - if (waitqueue_active(&ep->wq)) + if (waitqueue_active(&ep->wq)) { + ewake = 1; wake_up_locked(&ep->wq); + } if (waitqueue_active(&ep->poll_wait)) pwake++; @@ -1078,6 +1082,8 @@ out_unlock: if (pwake) ep_poll_safewake(&ep->poll_wait); + if (epi->event.events & EPOLLROUNDROBIN) + return ewake; return 1; } @@ -1095,7 +1101,12 @@ static void ep_ptable_queue_proc(struct file *file, wait_queue_head_t *whead, init_waitqueue_func_entry(&pwq->wait, ep_poll_callback); pwq->whead = whead; pwq->base = epi; - add_wait_queue(whead, &pwq->wait); + if (epi->event.events & EPOLLROUNDROBIN) + add_wait_queue_rr(whead, &pwq->wait); + else if (epi->event.events & EPOLLEXCLUSIVE) + add_wait_queue_exclusive(whead, &pwq->wait); + else + add_wait_queue(whead, &pwq->wait); list_add_tail(&pwq->llink, &epi->pwqlist); epi->nwait++; } else { @@ -1820,8 +1831,7 @@ SYSCALL_DEFINE1(epoll_create, int, size) SYSCALL_DEFINE4(epoll_ctl, int, epfd, int, op, int, fd, struct epoll_event __user *, event) { - int error; - int full_check = 0; + int error, full_check = 0, wait_flags = 0; struct fd f, tf; struct eventpoll *ep; struct epitem *epi; @@ -1861,6 +1871,11 @@ SYSCALL_DEFINE4(epoll_ctl, int, epfd, int, op, int, fd, if (f.file == tf.file || !is_file_epoll(f.file)) goto error_tgt_fput; + wait_flags = epds.events & (EPOLLEXCLUSIVE | EPOLLROUNDROBIN); + if (wait_flags && ((op == EPOLL_CTL_MOD) || ((op == EPOLL_CTL_ADD) && + ((wait_flags == EPOLLROUNDROBIN) || (is_file_epoll(tf.file)))))) + goto error_tgt_fput; + /* * At this point it is safe to assume that the "private_data" contains * our own data structure. diff --git a/include/uapi/linux/eventpoll.h b/include/uapi/linux/eventpoll.h index bc81fb2..10260a1 100644 --- a/include/uapi/linux/eventpoll.h +++ b/include/uapi/linux/eventpoll.h @@ -26,6 +26,12 @@ #define EPOLL_CTL_DEL 2 #define EPOLL_CTL_MOD 3 +/* Balance wakeups for a shared event source */ +#define EPOLLROUNDROBIN (1 << 27) + +/* Add exclusively */ +#define EPOLLEXCLUSIVE (1 << 28) + /* * Request the handling of system wakeup events so as to prevent system suspends * from happening while those events are being processed.