From patchwork Wed Jan 9 16:40:14 2019 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Roman Penyaev X-Patchwork-Id: 10754527 Return-Path: Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org [172.30.200.125]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id AAF626C5 for ; Wed, 9 Jan 2019 16:42:14 +0000 (UTC) Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 9A56E294EC for ; Wed, 9 Jan 2019 16:42:14 +0000 (UTC) Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486) id 9877929500; Wed, 9 Jan 2019 16:42:14 +0000 (UTC) X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on pdx-wl-mail.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-7.9 required=2.0 tests=BAYES_00,MAILING_LIST_MULTI, RCVD_IN_DNSWL_HI autolearn=ham version=3.3.1 Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 2B4FF294EC for ; Wed, 9 Jan 2019 16:42:14 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1726873AbfAIQmC (ORCPT ); Wed, 9 Jan 2019 11:42:02 -0500 Received: from mx2.suse.de ([195.135.220.15]:57580 "EHLO mx1.suse.de" rhost-flags-OK-OK-OK-FAIL) by vger.kernel.org with ESMTP id S1726571AbfAIQkk (ORCPT ); Wed, 9 Jan 2019 11:40:40 -0500 X-Virus-Scanned: by amavisd-new at test-mx.suse.de Received: from relay1.suse.de (unknown [195.135.220.254]) by mx1.suse.de (Postfix) with ESMTP id 145B0AFD9; Wed, 9 Jan 2019 16:40:38 +0000 (UTC) From: Roman Penyaev Cc: Roman Penyaev , Andrew Morton , Davidlohr Bueso , Jason Baron , Al Viro , "Paul E. McKenney" , Linus Torvalds , Andrea Parri , linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org Subject: [RFC PATCH 04/15] epoll: move private helpers from a header to the source Date: Wed, 9 Jan 2019 17:40:14 +0100 Message-Id: <20190109164025.24554-5-rpenyaev@suse.de> X-Mailer: git-send-email 2.19.1 In-Reply-To: <20190109164025.24554-1-rpenyaev@suse.de> References: <20190109164025.24554-1-rpenyaev@suse.de> MIME-Version: 1.0 To: unlisted-recipients:; (no To-header on input) Sender: linux-fsdevel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-fsdevel@vger.kernel.org X-Virus-Scanned: ClamAV using ClamSMTP Those helpers will access private eventpoll structure in future patches, so keep those helpers close to callers. Nothing important here. Signed-off-by: Roman Penyaev Cc: Andrew Morton Cc: Davidlohr Bueso Cc: Jason Baron Cc: Al Viro Cc: "Paul E. McKenney" Cc: Linus Torvalds Cc: Andrea Parri Cc: linux-fsdevel@vger.kernel.org Cc: linux-kernel@vger.kernel.org --- fs/eventpoll.c | 13 +++++++++++++ include/uapi/linux/eventpoll.h | 12 ------------ 2 files changed, 13 insertions(+), 12 deletions(-) diff --git a/fs/eventpoll.c b/fs/eventpoll.c index 4a0e98d87fcc..2cc183e86a29 100644 --- a/fs/eventpoll.c +++ b/fs/eventpoll.c @@ -466,6 +466,19 @@ static inline void ep_set_busy_poll_napi_id(struct epitem *epi) #endif /* CONFIG_NET_RX_BUSY_POLL */ +#ifdef CONFIG_PM_SLEEP +static inline void ep_take_care_of_epollwakeup(struct epoll_event *epev) +{ + if ((epev->events & EPOLLWAKEUP) && !capable(CAP_BLOCK_SUSPEND)) + epev->events &= ~EPOLLWAKEUP; +} +#else +static inline void ep_take_care_of_epollwakeup(struct epoll_event *epev) +{ + epev->events &= ~EPOLLWAKEUP; +} +#endif /* CONFIG_PM_SLEEP */ + /** * ep_call_nested - Perform a bound (possibly) nested call, by checking * that the recursion limit is not exceeded, and that diff --git a/include/uapi/linux/eventpoll.h b/include/uapi/linux/eventpoll.h index 8a3432d0f0dc..39dfc29f0f52 100644 --- a/include/uapi/linux/eventpoll.h +++ b/include/uapi/linux/eventpoll.h @@ -79,16 +79,4 @@ struct epoll_event { __u64 data; } EPOLL_PACKED; -#ifdef CONFIG_PM_SLEEP -static inline void ep_take_care_of_epollwakeup(struct epoll_event *epev) -{ - if ((epev->events & EPOLLWAKEUP) && !capable(CAP_BLOCK_SUSPEND)) - epev->events &= ~EPOLLWAKEUP; -} -#else -static inline void ep_take_care_of_epollwakeup(struct epoll_event *epev) -{ - epev->events &= ~EPOLLWAKEUP; -} -#endif #endif /* _UAPI_LINUX_EVENTPOLL_H */ From patchwork Wed Jan 9 16:40:15 2019 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Roman Penyaev X-Patchwork-Id: 10754523 Return-Path: Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org [172.30.200.125]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 2C74A17FB for ; Wed, 9 Jan 2019 16:42:00 +0000 (UTC) Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 1D7D6294D7 for ; Wed, 9 Jan 2019 16:42:00 +0000 (UTC) Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486) id 11AC1294EB; Wed, 9 Jan 2019 16:42:00 +0000 (UTC) X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on pdx-wl-mail.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-7.9 required=2.0 tests=BAYES_00,MAILING_LIST_MULTI, RCVD_IN_DNSWL_HI autolearn=ham version=3.3.1 Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 55887294F8 for ; Wed, 9 Jan 2019 16:41:59 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1726819AbfAIQlo (ORCPT ); Wed, 9 Jan 2019 11:41:44 -0500 Received: from mx2.suse.de ([195.135.220.15]:57600 "EHLO mx1.suse.de" rhost-flags-OK-OK-OK-FAIL) by vger.kernel.org with ESMTP id S1726579AbfAIQkk (ORCPT ); Wed, 9 Jan 2019 11:40:40 -0500 X-Virus-Scanned: by amavisd-new at test-mx.suse.de Received: from relay2.suse.de (unknown [195.135.220.254]) by mx1.suse.de (Postfix) with ESMTP id 7EEEDAFF4; Wed, 9 Jan 2019 16:40:38 +0000 (UTC) From: Roman Penyaev Cc: Roman Penyaev , Andrew Morton , Davidlohr Bueso , Jason Baron , Al Viro , "Paul E. McKenney" , Linus Torvalds , Andrea Parri , linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org Subject: [RFC PATCH 05/15] epoll: introduce user header structure and user index for polling from userspace Date: Wed, 9 Jan 2019 17:40:15 +0100 Message-Id: <20190109164025.24554-6-rpenyaev@suse.de> X-Mailer: git-send-email 2.19.1 In-Reply-To: <20190109164025.24554-1-rpenyaev@suse.de> References: <20190109164025.24554-1-rpenyaev@suse.de> MIME-Version: 1.0 To: unlisted-recipients:; (no To-header on input) Sender: linux-fsdevel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-fsdevel@vger.kernel.org X-Virus-Scanned: ClamAV using ClamSMTP This one introduces main user structures: user header and user index. Header describes current state of epoll, head and tail of the index ring, epoll items at the end of the structure. Index table is a ring, which is controlled by head and tail from the user header. Ring consists of u32 indeces, pointing to items in header, which have been ready for polling. Userspace has to call epoll_create1(EPOLL_USERPOLL) in order to start using polling from user side. Signed-off-by: Roman Penyaev Cc: Andrew Morton Cc: Davidlohr Bueso Cc: Jason Baron Cc: Al Viro Cc: "Paul E. McKenney" Cc: Linus Torvalds Cc: Andrea Parri Cc: linux-fsdevel@vger.kernel.org Cc: linux-kernel@vger.kernel.org --- fs/eventpoll.c | 107 ++++++++++++++++++++++++++++++++- include/uapi/linux/eventpoll.h | 3 +- 2 files changed, 106 insertions(+), 4 deletions(-) diff --git a/fs/eventpoll.c b/fs/eventpoll.c index 2cc183e86a29..9ec682b6488f 100644 --- a/fs/eventpoll.c +++ b/fs/eventpoll.c @@ -178,6 +178,42 @@ struct epitem { struct epoll_event event; }; +#define EPOLL_USER_HEADER_SIZE 128 +#define EPOLL_USER_HEADER_MAGIC 0xeb01eb01 + +enum { + EPOLL_USER_POLL_INACTIVE = 0, /* user poll disactivated */ + EPOLL_USER_POLL_ACTIVE = 1, /* user can continue busy polling */ + + /* + * Always keep some slots ahead to be able to consume new events + * from many threads, i.e. if N threads consume ring from userspace, + * we have to keep N free slots ahead to avoid ring overlap. + * + * Probably this number should be reported to userspace in header. + */ + EPOLL_USER_EXTRA_INDEX_NR = 16 /* how many extra indeces keep in ring */ +}; + +struct user_epitem { + unsigned int ready_events; + struct epoll_event event; +}; + +struct user_header { + unsigned int magic; /* epoll user header magic */ + unsigned int state; /* epoll ring state */ + unsigned int header_length; /* length of the header + items */ + unsigned int index_length; /* length of the index ring */ + unsigned int max_items_nr; /* max num of items slots */ + unsigned int max_index_nr; /* max num of items indeces, always pow2 */ + unsigned int head; /* updated by userland */ + unsigned int tail; /* updated by kernel */ + unsigned int padding[24]; /* Header size is 128 bytes */ + + struct user_epitem items[]; +}; + /* * This structure is stored inside the "private_data" member of the file * structure and represents the main data structure for the eventpoll @@ -222,6 +258,36 @@ struct eventpoll { struct file *file; + /* User header with array of items */ + struct user_header *user_header; + + /* User index, which acts as a ring of coming events */ + unsigned int *user_index; + + /* Actual length of user header, always aligned on page */ + unsigned int header_length; + + /* Actual length of user index, always aligned on page */ + unsigned int index_length; + + /* Number of event items */ + unsigned int items_nr; + + /* Items bitmap, is used to get a free bit for new registered epi */ + unsigned long *items_bm; + + /* Removed items bitmap, is used to postpone bit put */ + unsigned long *removed_items_bm; + + /* Length of both items bitmaps, always aligned on page */ + unsigned int items_bm_length; + + /* + * Where events are routed: to kernel lists or to user ring. + * Always false for epfd created without EPOLL_USERPOLL. + */ + bool events_to_uring; + /* used to optimize loop detection check */ int visited; struct list_head visited_list_link; @@ -876,6 +942,10 @@ static void ep_free(struct eventpoll *ep) mutex_destroy(&ep->mtx); free_uid(ep->user); wakeup_source_unregister(ep->ws); + vfree(ep->user_header); + vfree(ep->user_index); + vfree(ep->items_bm); + vfree(ep->removed_items_bm); kfree(ep); } @@ -1028,7 +1098,7 @@ void eventpoll_release_file(struct file *file) mutex_unlock(&epmutex); } -static int ep_alloc(struct eventpoll **pep) +static int ep_alloc(struct eventpoll **pep, int flags) { int error; struct user_struct *user; @@ -1040,6 +1110,31 @@ static int ep_alloc(struct eventpoll **pep) if (unlikely(!ep)) goto free_uid; + if (flags & EPOLL_USERPOLL) { + ep->user_header = vmalloc_user(PAGE_SIZE); + ep->user_index = vmalloc_user(PAGE_SIZE); + ep->items_bm = vzalloc(PAGE_SIZE); + ep->removed_items_bm = vzalloc(PAGE_SIZE); + ep->events_to_uring = true; + if (!ep->user_header || !ep->user_index) + goto free_ep; + if (!ep->items_bm || !ep->removed_items_bm) + goto free_ep; + + ep->header_length = PAGE_SIZE; + ep->index_length = PAGE_SIZE; + ep->items_bm_length = PAGE_SIZE; + + *ep->user_header = (typeof(*ep->user_header)) { + .magic = EPOLL_USER_HEADER_MAGIC, + .state = EPOLL_USER_POLL_ACTIVE, + .header_length = ep->header_length, + .index_length = ep->index_length, + .max_items_nr = ep_max_items_nr(ep), + .max_index_nr = ep_max_index_nr(ep), + }; + } + mutex_init(&ep->mtx); rwlock_init(&ep->lock); init_waitqueue_head(&ep->wq); @@ -1053,6 +1148,12 @@ static int ep_alloc(struct eventpoll **pep) return 0; +free_ep: + vfree(ep->user_header); + vfree(ep->user_index); + vfree(ep->items_bm); + vfree(ep->removed_items_bm); + kfree(ep); free_uid: free_uid(user); return error; @@ -2066,12 +2167,12 @@ static int do_epoll_create(int flags) /* Check the EPOLL_* constant for consistency. */ BUILD_BUG_ON(EPOLL_CLOEXEC != O_CLOEXEC); - if (flags & ~EPOLL_CLOEXEC) + if (flags & ~(EPOLL_CLOEXEC | EPOLL_USERPOLL)) return -EINVAL; /* * Create the internal data structure ("struct eventpoll"). */ - error = ep_alloc(&ep); + error = ep_alloc(&ep, flags); if (error < 0) return error; /* diff --git a/include/uapi/linux/eventpoll.h b/include/uapi/linux/eventpoll.h index 39dfc29f0f52..b0a565f6c6c3 100644 --- a/include/uapi/linux/eventpoll.h +++ b/include/uapi/linux/eventpoll.h @@ -20,7 +20,8 @@ #include /* Flags for epoll_create1. */ -#define EPOLL_CLOEXEC O_CLOEXEC +#define EPOLL_CLOEXEC O_CLOEXEC +#define EPOLL_USERPOLL 1 /* Valid opcodes to issue to sys_epoll_ctl() */ #define EPOLL_CTL_ADD 1 From patchwork Wed Jan 9 16:40:16 2019 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Roman Penyaev X-Patchwork-Id: 10754529 Return-Path: Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org [172.30.200.125]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 5E0ED13BF for ; Wed, 9 Jan 2019 16:42:15 +0000 (UTC) Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 4E24B29503 for ; Wed, 9 Jan 2019 16:42:15 +0000 (UTC) Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486) id 41EE62950E; Wed, 9 Jan 2019 16:42:15 +0000 (UTC) X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on pdx-wl-mail.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-7.9 required=2.0 tests=BAYES_00,MAILING_LIST_MULTI, RCVD_IN_DNSWL_HI autolearn=ham version=3.3.1 Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id CA40929507 for ; Wed, 9 Jan 2019 16:42:14 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1726869AbfAIQmB (ORCPT ); Wed, 9 Jan 2019 11:42:01 -0500 Received: from mx2.suse.de ([195.135.220.15]:57620 "EHLO mx1.suse.de" rhost-flags-OK-OK-OK-FAIL) by vger.kernel.org with ESMTP id S1726586AbfAIQkk (ORCPT ); Wed, 9 Jan 2019 11:40:40 -0500 X-Virus-Scanned: by amavisd-new at test-mx.suse.de Received: from relay1.suse.de (unknown [195.135.220.254]) by mx1.suse.de (Postfix) with ESMTP id DE35CAF91; Wed, 9 Jan 2019 16:40:38 +0000 (UTC) From: Roman Penyaev Cc: Roman Penyaev , Andrew Morton , Davidlohr Bueso , Jason Baron , Al Viro , "Paul E. McKenney" , Linus Torvalds , Andrea Parri , linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org Subject: [RFC PATCH 06/15] epoll: introduce various of helpers for user structure lengths calculations Date: Wed, 9 Jan 2019 17:40:16 +0100 Message-Id: <20190109164025.24554-7-rpenyaev@suse.de> X-Mailer: git-send-email 2.19.1 In-Reply-To: <20190109164025.24554-1-rpenyaev@suse.de> References: <20190109164025.24554-1-rpenyaev@suse.de> MIME-Version: 1.0 To: unlisted-recipients:; (no To-header on input) Sender: linux-fsdevel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-fsdevel@vger.kernel.org X-Virus-Scanned: ClamAV using ClamSMTP Helpers for lengths-from-number and number-from-lengths calculations. Among them: ep_polled_by_user() - returns true if epoll was created with EPOLL_USERPOLL ep_user_ring_events_available() - returns true if there is something in user ring buffer Signed-off-by: Roman Penyaev Cc: Andrew Morton Cc: Davidlohr Bueso Cc: Jason Baron Cc: Al Viro Cc: "Paul E. McKenney" Cc: Linus Torvalds Cc: Andrea Parri Cc: linux-fsdevel@vger.kernel.org Cc: linux-kernel@vger.kernel.org --- fs/eventpoll.c | 59 ++++++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 59 insertions(+) diff --git a/fs/eventpoll.c b/fs/eventpoll.c index 9ec682b6488f..ae288f62aa4c 100644 --- a/fs/eventpoll.c +++ b/fs/eventpoll.c @@ -438,6 +438,65 @@ static void ep_nested_calls_init(struct nested_calls *ncalls) spin_lock_init(&ncalls->lock); } +static inline unsigned int to_items_length(unsigned int nr) +{ + struct eventpoll *ep; + + return (sizeof(*ep->user_header) + + (nr << ilog2(sizeof(ep->user_header->items[0])))); +} + +static inline unsigned int to_index_length(unsigned int nr) +{ + struct eventpoll *ep; + + return nr << ilog2(sizeof(*ep->user_index)); +} + +static inline unsigned int to_items_bm_length(unsigned int nr) +{ + return ALIGN(nr, 8) >> 3; +} + +static inline unsigned int to_items_nr(unsigned int len) +{ + struct eventpoll *ep; + + return (len - sizeof(*ep->user_header)) >> + ilog2(sizeof(ep->user_header->items[0])); +} + +static inline unsigned int to_items_bm_nr(unsigned int len) +{ + return len << 3; +} + +static inline unsigned int ep_max_items_nr(struct eventpoll *ep) +{ + return to_items_nr(ep->header_length); +} + +static inline unsigned int ep_max_index_nr(struct eventpoll *ep) +{ + return ep->index_length >> ilog2(sizeof(*ep->user_index)); +} + +static inline unsigned int ep_max_items_bm_nr(struct eventpoll *ep) +{ + return to_items_bm_nr(ep->items_bm_length); +} + +static inline bool ep_polled_by_user(struct eventpoll *ep) +{ + return !!ep->user_header; +} + +static inline bool ep_user_ring_events_available(struct eventpoll *ep) +{ + return ep_polled_by_user(ep) && + ep->user_header->head != ep->user_header->tail; +} + /** * ep_events_available - Checks if ready events might be available. * From patchwork Wed Jan 9 16:40:17 2019 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Roman Penyaev X-Patchwork-Id: 10754525 Return-Path: Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org [172.30.200.125]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 535CE6C5 for ; Wed, 9 Jan 2019 16:42:00 +0000 (UTC) Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 428612943F for ; Wed, 9 Jan 2019 16:42:00 +0000 (UTC) Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486) id 4048B294F5; Wed, 9 Jan 2019 16:42:00 +0000 (UTC) X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on pdx-wl-mail.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-7.9 required=2.0 tests=BAYES_00,MAILING_LIST_MULTI, RCVD_IN_DNSWL_HI autolearn=ham version=3.3.1 Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id AD8F02943F for ; Wed, 9 Jan 2019 16:41:59 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1726811AbfAIQln (ORCPT ); Wed, 9 Jan 2019 11:41:43 -0500 Received: from mx2.suse.de ([195.135.220.15]:57636 "EHLO mx1.suse.de" rhost-flags-OK-OK-OK-FAIL) by vger.kernel.org with ESMTP id S1726551AbfAIQkk (ORCPT ); Wed, 9 Jan 2019 11:40:40 -0500 X-Virus-Scanned: by amavisd-new at test-mx.suse.de Received: from relay2.suse.de (unknown [195.135.220.254]) by mx1.suse.de (Postfix) with ESMTP id 4A596B012; Wed, 9 Jan 2019 16:40:39 +0000 (UTC) From: Roman Penyaev Cc: Roman Penyaev , Andrew Morton , Davidlohr Bueso , Jason Baron , Al Viro , "Paul E. McKenney" , Linus Torvalds , Andrea Parri , linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org Subject: [RFC PATCH 07/15] epoll: extend epitem struct with new members for polling from userspace Date: Wed, 9 Jan 2019 17:40:17 +0100 Message-Id: <20190109164025.24554-8-rpenyaev@suse.de> X-Mailer: git-send-email 2.19.1 In-Reply-To: <20190109164025.24554-1-rpenyaev@suse.de> References: <20190109164025.24554-1-rpenyaev@suse.de> MIME-Version: 1.0 To: unlisted-recipients:; (no To-header on input) Sender: linux-fsdevel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-fsdevel@vger.kernel.org X-Virus-Scanned: ClamAV using ClamSMTP ->bit every epitem has an element inside user item array, this bit is actually an index position of that user item array and also a bit inside ep->items_bm ->ready_events received events in the period when descriptor can't be polled from userspace and ep->rdllist is used for keeping list of ready items ->work work for offloading polling from task context if epfd is polled from userspace but driver does not provide pollflags on wakeup Signed-off-by: Roman Penyaev Cc: Andrew Morton Cc: Davidlohr Bueso Cc: Jason Baron Cc: Al Viro Cc: "Paul E. McKenney" Cc: Linus Torvalds Cc: Andrea Parri Cc: linux-fsdevel@vger.kernel.org Cc: linux-kernel@vger.kernel.org --- fs/eventpoll.c | 21 +++++++++++++++------ 1 file changed, 15 insertions(+), 6 deletions(-) diff --git a/fs/eventpoll.c b/fs/eventpoll.c index ae288f62aa4c..637b463587c1 100644 --- a/fs/eventpoll.c +++ b/fs/eventpoll.c @@ -9,6 +9,8 @@ * * Davide Libenzi * + * Polling from userspace support by Roman Penyaev + * (C) Copyright 2019 SUSE, All Rights Reserved */ #include @@ -42,6 +44,7 @@ #include #include #include +#include #include /* @@ -176,6 +179,18 @@ struct epitem { /* The structure that describe the interested events and the source fd */ struct epoll_event event; + + /* Bit in user bitmap for user polling */ + unsigned int bit; + + /* + * Collect ready events for the period when descriptor is polled by user + * but events are routed to klists. + */ + __poll_t ready_events; + + /* Work for offloading event callback */ + struct work_struct work; }; #define EPOLL_USER_HEADER_SIZE 128 @@ -2557,12 +2572,6 @@ static int __init eventpoll_init(void) ep_nested_calls_init(&poll_safewake_ncalls); #endif - /* - * We can have many thousands of epitems, so prevent this from - * using an extra cache line on 64-bit (and smaller) CPUs - */ - BUILD_BUG_ON(sizeof(void *) <= 8 && sizeof(struct epitem) > 128); - /* Allocates slab cache used to allocate "struct epitem" items */ epi_cache = kmem_cache_create("eventpoll_epi", sizeof(struct epitem), 0, SLAB_HWCACHE_ALIGN|SLAB_PANIC|SLAB_ACCOUNT, NULL); From patchwork Wed Jan 9 16:40:18 2019 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Roman Penyaev X-Patchwork-Id: 10754519 Return-Path: Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org [172.30.200.125]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id B68956C5 for ; Wed, 9 Jan 2019 16:41:41 +0000 (UTC) Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id A67B1291EB for ; Wed, 9 Jan 2019 16:41:41 +0000 (UTC) Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486) id 9A40429487; Wed, 9 Jan 2019 16:41:41 +0000 (UTC) X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on pdx-wl-mail.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-7.9 required=2.0 tests=BAYES_00,MAILING_LIST_MULTI, RCVD_IN_DNSWL_HI autolearn=ham version=3.3.1 Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id DFA10294EC for ; Wed, 9 Jan 2019 16:41:40 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1726796AbfAIQld (ORCPT ); Wed, 9 Jan 2019 11:41:33 -0500 Received: from mx2.suse.de ([195.135.220.15]:57650 "EHLO mx1.suse.de" rhost-flags-OK-OK-OK-FAIL) by vger.kernel.org with ESMTP id S1726603AbfAIQkl (ORCPT ); Wed, 9 Jan 2019 11:40:41 -0500 X-Virus-Scanned: by amavisd-new at test-mx.suse.de Received: from relay1.suse.de (unknown [195.135.220.254]) by mx1.suse.de (Postfix) with ESMTP id B2528AF58; Wed, 9 Jan 2019 16:40:39 +0000 (UTC) From: Roman Penyaev Cc: Roman Penyaev , Andrew Morton , Davidlohr Bueso , Jason Baron , Al Viro , "Paul E. McKenney" , Linus Torvalds , Andrea Parri , linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org Subject: [RFC PATCH 08/15] epoll: some sanity flags checks for epoll syscalls for polled epfd from userspace Date: Wed, 9 Jan 2019 17:40:18 +0100 Message-Id: <20190109164025.24554-9-rpenyaev@suse.de> X-Mailer: git-send-email 2.19.1 In-Reply-To: <20190109164025.24554-1-rpenyaev@suse.de> References: <20190109164025.24554-1-rpenyaev@suse.de> MIME-Version: 1.0 To: unlisted-recipients:; (no To-header on input) Sender: linux-fsdevel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-fsdevel@vger.kernel.org X-Virus-Scanned: ClamAV using ClamSMTP There are various of limitations if epfd is polled by user: 1. Expect always EPOLLET flag (Edge Triggered behavior) 2. No support for EPOLLWAKEUP events are consumed from userspace, thus no way to call __pm_relax() 3. No support for EPOLLEXCLUSIVE If device does not pass pollflags to wake_up() there is no way to call poll() from the context under spinlock, thus special work is scheduled to offload polling. In this specific case we can't support exclusive wakeups, because we do not know actual result of scheduled work. 4. No support for nesting of epoll descriptors polled from userspace: no real good reason to scan ready events of user ring from the kernel, so just do not do that. Signed-off-by: Roman Penyaev Cc: Andrew Morton Cc: Davidlohr Bueso Cc: Jason Baron Cc: Al Viro Cc: "Paul E. McKenney" Cc: Linus Torvalds Cc: Andrea Parri Cc: linux-fsdevel@vger.kernel.org Cc: linux-kernel@vger.kernel.org --- fs/eventpoll.c | 78 ++++++++++++++++++++++++++++++++++++-------------- 1 file changed, 56 insertions(+), 22 deletions(-) diff --git a/fs/eventpoll.c b/fs/eventpoll.c index 637b463587c1..bdaec59a847e 100644 --- a/fs/eventpoll.c +++ b/fs/eventpoll.c @@ -607,13 +607,17 @@ static inline void ep_set_busy_poll_napi_id(struct epitem *epi) #endif /* CONFIG_NET_RX_BUSY_POLL */ #ifdef CONFIG_PM_SLEEP -static inline void ep_take_care_of_epollwakeup(struct epoll_event *epev) +static inline void ep_take_care_of_epollwakeup(struct eventpoll *ep, + struct epoll_event *epev) { - if ((epev->events & EPOLLWAKEUP) && !capable(CAP_BLOCK_SUSPEND)) - epev->events &= ~EPOLLWAKEUP; + if (epev->events & EPOLLWAKEUP) { + if (!capable(CAP_BLOCK_SUSPEND) || ep_polled_by_user(ep)) + epev->events &= ~EPOLLWAKEUP; + } } #else -static inline void ep_take_care_of_epollwakeup(struct epoll_event *epev) +static inline void ep_take_care_of_epollwakeup(struct eventpoll *ep, + struct epoll_event *epev) { epev->events &= ~EPOLLWAKEUP; } @@ -1054,6 +1058,7 @@ static __poll_t ep_item_poll(const struct epitem *epi, poll_table *pt, return vfs_poll(epi->ffd.file, pt) & epi->event.events; ep = epi->ffd.file->private_data; + WARN_ON(ep_polled_by_user(ep)); poll_wait(epi->ffd.file, &ep->poll_wait, pt); locked = pt && (pt->_qproc == ep_ptable_queue_proc); @@ -1094,6 +1099,13 @@ static __poll_t ep_eventpoll_poll(struct file *file, poll_table *wait) struct eventpoll *ep = file->private_data; int depth = 0; + if (ep_polled_by_user(ep)) + /* + * We do not support polling of descriptor which is polled + * by user. + */ + return 0; + /* Insert inside our poll wait queue */ poll_wait(file, &ep->poll_wait, wait); @@ -2324,10 +2336,6 @@ SYSCALL_DEFINE4(epoll_ctl, int, epfd, int, op, int, fd, if (!file_can_poll(tf.file)) goto error_tgt_fput; - /* Check if EPOLLWAKEUP is allowed */ - if (ep_op_has_event(op)) - ep_take_care_of_epollwakeup(&epds); - /* * We have to check that the file structure underneath the file descriptor * the user passed to us _is_ an eventpoll file. And also we do not permit @@ -2337,10 +2345,25 @@ SYSCALL_DEFINE4(epoll_ctl, int, epfd, int, op, int, fd, if (f.file == tf.file || !is_file_epoll(f.file)) goto error_tgt_fput; + /* + * Do not support scanning of ready events of epoll, which is pollable + * by userspace. + */ + if (is_file_epoll(tf.file) && ep_polled_by_user(tf.file->private_data)) + goto error_tgt_fput; + + /* + * At this point it is safe to assume that the "private_data" contains + * our own data structure. + */ + ep = f.file->private_data; + /* * epoll adds to the wakeup queue at EPOLL_CTL_ADD time only, * so EPOLLEXCLUSIVE is not allowed for a EPOLL_CTL_MOD operation. - * Also, we do not currently supported nested exclusive wakeups. + * Also, we do not currently supported nested exclusive wakeups + * and EPOLLEXCLUSIVE is not supported for epoll which is polled + * from userspace. */ if (ep_op_has_event(op) && (epds.events & EPOLLEXCLUSIVE)) { if (op == EPOLL_CTL_MOD) @@ -2348,13 +2371,18 @@ SYSCALL_DEFINE4(epoll_ctl, int, epfd, int, op, int, fd, if (op == EPOLL_CTL_ADD && (is_file_epoll(tf.file) || (epds.events & ~EPOLLEXCLUSIVE_OK_BITS))) goto error_tgt_fput; + if (ep_polled_by_user(ep)) + goto error_tgt_fput; } - /* - * At this point it is safe to assume that the "private_data" contains - * our own data structure. - */ - ep = f.file->private_data; + if (ep_op_has_event(op)) { + if (ep_polled_by_user(ep) && !(epds.events & EPOLLET)) + /* Polled by user has only edge triggered behaviour */ + goto error_tgt_fput; + + /* Check if EPOLLWAKEUP is allowed */ + ep_take_care_of_epollwakeup(ep, &epds); + } /* * When we insert an epoll file descriptor, inside another epoll file @@ -2456,14 +2484,6 @@ static int do_epoll_wait(int epfd, struct epoll_event __user *events, struct fd f; struct eventpoll *ep; - /* The maximum number of event must be greater than zero */ - if (maxevents <= 0 || maxevents > EP_MAX_EVENTS) - return -EINVAL; - - /* Verify that the area passed by the user is writeable */ - if (!access_ok(events, maxevents * sizeof(struct epoll_event))) - return -EFAULT; - /* Get the "struct file *" for the eventpoll file */ f = fdget(epfd); if (!f.file) @@ -2482,6 +2502,20 @@ static int do_epoll_wait(int epfd, struct epoll_event __user *events, * our own data structure. */ ep = f.file->private_data; + if (!ep_polled_by_user(ep)) { + /* The maximum number of event must be greater than zero */ + if (maxevents <= 0 || maxevents > EP_MAX_EVENTS) + goto error_fput; + + /* Verify that the area passed by the user is writeable */ + error = -EFAULT; + if (!access_ok(events, maxevents * sizeof(struct epoll_event))) + goto error_fput; + } else { + /* Use ring instead */ + if (maxevents != 0 || events != NULL) + goto error_fput; + } /* Time to fish for events ... */ error = ep_poll(ep, events, maxevents, timeout); From patchwork Wed Jan 9 16:40:19 2019 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Roman Penyaev X-Patchwork-Id: 10754511 Return-Path: Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org [172.30.200.125]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id AD7DB6C5 for ; Wed, 9 Jan 2019 16:41:14 +0000 (UTC) Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 9D7EB29486 for ; Wed, 9 Jan 2019 16:41:14 +0000 (UTC) Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486) id 91415294B5; Wed, 9 Jan 2019 16:41:14 +0000 (UTC) X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on pdx-wl-mail.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-7.9 required=2.0 tests=BAYES_00,MAILING_LIST_MULTI, RCVD_IN_DNSWL_HI autolearn=ham version=3.3.1 Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 95A70294B2 for ; Wed, 9 Jan 2019 16:41:13 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1726635AbfAIQkm (ORCPT ); Wed, 9 Jan 2019 11:40:42 -0500 Received: from mx2.suse.de ([195.135.220.15]:57620 "EHLO mx1.suse.de" rhost-flags-OK-OK-OK-FAIL) by vger.kernel.org with ESMTP id S1726608AbfAIQkm (ORCPT ); Wed, 9 Jan 2019 11:40:42 -0500 X-Virus-Scanned: by amavisd-new at test-mx.suse.de Received: from relay2.suse.de (unknown [195.135.220.254]) by mx1.suse.de (Postfix) with ESMTP id 4A039AF82; Wed, 9 Jan 2019 16:40:40 +0000 (UTC) From: Roman Penyaev Cc: Roman Penyaev , Andrew Morton , Davidlohr Bueso , Jason Baron , Al Viro , "Paul E. McKenney" , Linus Torvalds , Andrea Parri , linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org Subject: [RFC PATCH 09/15] epoll: introduce stand-alone helpers for polling from userspace Date: Wed, 9 Jan 2019 17:40:19 +0100 Message-Id: <20190109164025.24554-10-rpenyaev@suse.de> X-Mailer: git-send-email 2.19.1 In-Reply-To: <20190109164025.24554-1-rpenyaev@suse.de> References: <20190109164025.24554-1-rpenyaev@suse.de> MIME-Version: 1.0 To: unlisted-recipients:; (no To-header on input) Sender: linux-fsdevel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-fsdevel@vger.kernel.org X-Virus-Scanned: ClamAV using ClamSMTP ep_vrealloc*() realloc user header, user index or bitmap memory ep_get_bit() gets free bit from bitmap, if free bit is not found - bitmap is expanded on PAGE_SIZE ep_expand_user_is_required() helper which returna true if expand for different memory chunks is required ep_shrink_user_is_required() helper which returna new size if shrink for different memory chunks is required ep_expand_user_*() expands user header or user index ep_shrink_user_*() shrinks user header, user index or bitmaps. In case of srink there is an important procedure of moving sparsed bits at the end to the beginning of the bitmap, in order to free pages at the end. ep_route_events_to_*() routes events to klists or to uring. Should be called under write lock, when all events are stopped. ep_free_user_item() marks item inside user pointer as freed, i.e. atomically exchanges ready_events to 0. Also puts item bit or postponed it to period, when user goes to kernel. ep_add_event_to_uring() adds new event to user ring. Firstly mark user item as ready and if item was observed as not ready - fill in user index. ep_transfer_events_and_shrunk_uring() shrinks if needed and transfers events in klists to uring under the write lock. Signed-off-by: Roman Penyaev Cc: Andrew Morton Cc: Davidlohr Bueso Cc: Jason Baron Cc: Al Viro Cc: "Paul E. McKenney" Cc: Linus Torvalds Cc: Andrea Parri Cc: linux-fsdevel@vger.kernel.org Cc: linux-kernel@vger.kernel.org --- fs/eventpoll.c | 420 +++++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 420 insertions(+) diff --git a/fs/eventpoll.c b/fs/eventpoll.c index bdaec59a847e..36c451c26681 100644 --- a/fs/eventpoll.c +++ b/fs/eventpoll.c @@ -929,6 +929,238 @@ static void epi_rcu_free(struct rcu_head *head) kmem_cache_free(epi_cache, epi); } +static int ep_vrealloc(void **pptr, unsigned int size) +{ + void *old = *pptr, *new; + + new = vrealloc(old, size); + if (unlikely(!new)) + return -ENOMEM; + if (unlikely(new == old)) + return 0; + + *pptr = new; + vfree(old); + + return 0; +} + +static int ep_vrealloc_bm(struct eventpoll *ep, unsigned int bm_len) +{ + unsigned long *bm, *removed_bm; + + /* Reallocate all at once */ + bm = vrealloc(ep->items_bm, bm_len); + removed_bm = vrealloc(ep->removed_items_bm, bm_len); + + if (unlikely(!bm || !removed_bm)) { + vfree(bm); + vfree(removed_bm); + + return -ENOMEM; + } + ep->items_bm = bm; + ep->removed_items_bm = removed_bm; + ep->items_bm_length = bm_len; + + return 0; +} + +static int ep_get_bit(struct eventpoll *ep) +{ + unsigned int max_nr; + int bit, start_bit; + bool was_set; + + lockdep_assert_held(&ep->mtx); + + start_bit = 0; +again: + max_nr = ep_max_items_bm_nr(ep); + bit = find_next_zero_bit(ep->items_bm, max_nr, start_bit); + if (bit >= max_nr) { + unsigned int bm_len; + int rc; + + start_bit = max_nr; + bm_len = ep->items_bm_length + PAGE_SIZE; + + rc = ep_vrealloc_bm(ep, bm_len); + if (unlikely(rc)) + return rc; + + goto again; + } + + was_set = test_and_set_bit(bit, ep->items_bm); + WARN_ON(was_set); + + return bit; +} + +static inline bool ep_expand_user_items_is_required(struct eventpoll *ep) +{ + return (ep->items_nr >= ep_max_items_nr(ep)); +} + +static inline bool ep_expand_user_index_is_required(struct eventpoll *ep) +{ + return (ep->items_nr + EPOLL_USER_EXTRA_INDEX_NR) + >= ep_max_index_nr(ep); +} + +static inline bool ep_expand_user_is_required(struct eventpoll *ep) +{ + return ep_expand_user_items_is_required(ep) || + ep_expand_user_index_is_required(ep); +} + +static inline unsigned int ep_shrunk_user_index_length(struct eventpoll *ep) +{ + unsigned int len, nr; + + nr = ep->items_nr + EPOLL_USER_EXTRA_INDEX_NR; + len = PAGE_ALIGN(to_index_length(nr) + (PAGE_SIZE >> 1)); + if (len < ep->index_length) + return len; + + return 0; +} + +static inline unsigned int ep_shrunk_user_items_length(struct eventpoll *ep) +{ + unsigned int len; + + len = PAGE_ALIGN(to_items_length(ep->items_nr) + (PAGE_SIZE >> 1)); + if (len < ep->header_length) + return len; + + return 0; +} + +static inline unsigned int ep_shrunk_items_bm_length(struct eventpoll *ep) +{ + unsigned int len; + + len = PAGE_ALIGN(to_items_bm_length(ep->items_nr) + (PAGE_SIZE >> 1)); + if (len < ep->items_bm_length) + return len; + + return 0; +} + +static inline bool ep_shrink_user_is_required(struct eventpoll *ep) +{ + return ep_shrunk_user_items_length(ep) != 0 || + ep_shrunk_user_index_length(ep) != 0 || + ep_shrunk_items_bm_length(ep) != 0; +} + +static inline void ep_route_events_to_klists(struct eventpoll *ep) +{ + WARN_ON(!ep_polled_by_user(ep)); + ep->events_to_uring = false; + ep->user_header->state = EPOLL_USER_POLL_INACTIVE; + /* Make sure userspace sees INACTIVE state ASAP */ + smp_wmb(); +} + +static inline void ep_route_events_to_uring(struct eventpoll *ep) +{ + WARN_ON(!ep_polled_by_user(ep)); + ep->events_to_uring = true; + /* Commit all previous writes to user header */ + smp_wmb(); + ep->user_header->state = EPOLL_USER_POLL_ACTIVE; +} + +static inline bool ep_events_routed_to_klists(struct eventpoll *ep) +{ + return !ep->events_to_uring; +} + +static inline bool ep_events_routed_to_uring(struct eventpoll *ep) +{ + return ep->events_to_uring; +} + +static inline bool ep_free_user_item(struct epitem *epi) +{ + struct eventpoll *ep = epi->ep; + struct user_epitem *uitem; + + bool events_to_klist = false; + + lockdep_assert_held(&ep->mtx); + + ep->items_nr--; + + uitem = &ep->user_header->items[epi->bit]; + + /* Firstly drop item events passed from userland */ + memset(&uitem->event, 0, sizeof(uitem->event)); + + /* + * If event is not signaled yet and has been already consumed by + * userspace it is safe to reuse the bit immediately, i.e. just + * put it. If userspace has not been yet consumed this event + * we set the bit in removed bitmap in order to put it later. + */ + if (xchg(&uitem->ready_events, 0)) { + set_bit(epi->bit, ep->removed_items_bm); + events_to_klist = true; + } else { + /* + * Should not be reordered with memset above, thus unlock + * semantics. + */ + clear_bit_unlock(epi->bit, ep->items_bm); + events_to_klist = ep_shrink_user_is_required(ep); + } + + return events_to_klist; +} + +static bool ep_add_event_to_uring(struct epitem *epi, __poll_t pollflags) +{ + struct eventpoll *ep = epi->ep; + struct user_epitem *uitem; + bool added = false; + + if (WARN_ON(!pollflags)) + return false; + + uitem = &ep->user_header->items[epi->bit]; + if (!__atomic_fetch_or(&uitem->ready_events, pollflags, + __ATOMIC_ACQUIRE)) { + unsigned int i, *item_idx, index_mask; + + /* + * Item was not ready before, thus we have to insert + * new index to the ring. + */ + + index_mask = ep_max_index_nr(ep) - 1; + i = __atomic_fetch_add(&ep->user_header->tail, 1, + __ATOMIC_ACQUIRE); + item_idx = &ep->user_index[i & index_mask]; + + /* Signal with a bit, which is > 0 */ + *item_idx = epi->bit + 1; + + /* + * Want index update be flushed from CPU write buffer and + * immediately visible on userspace side to avoid long busy + * loops. + */ + smp_wmb(); + + added = true; + } + + return added; +} + /* * Removes a "struct epitem" from the eventpoll RB tree and deallocates * all the associated resources. Must be called with "mtx" held. @@ -1695,6 +1927,44 @@ static noinline void ep_destroy_wakeup_source(struct epitem *epi) wakeup_source_unregister(ws); } +static int ep_expand_user_items(struct eventpoll *ep) +{ + unsigned int len; + int rc; + + if (!ep_expand_user_items_is_required(ep)) + /* Expanding is not needed */ + return 0; + + len = ep->header_length + PAGE_SIZE; + rc = ep_vrealloc((void **)&ep->user_header, len); + if (unlikely(rc)) + return rc; + + ep->header_length = len; + + return 0; +} + +static int ep_expand_user_index(struct eventpoll *ep) +{ + unsigned int len; + int rc; + + if (!ep_expand_user_index_is_required(ep)) + /* Expanding is not needed */ + return 0; + + len = ep->index_length + PAGE_SIZE; + rc = ep_vrealloc((void **)&ep->user_index, len); + if (unlikely(rc)) + return rc; + + ep->index_length = len; + + return 0; +} + /* * Must be called with "mtx" held. */ @@ -2010,6 +2280,156 @@ static inline struct timespec64 ep_set_mstimeout(long ms) return timespec64_add_safe(now, ts); } +static int ep_shrink_user_index(struct eventpoll *ep) +{ + unsigned int len; + int rc; + + len = ep_shrunk_user_index_length(ep); + if (!len) + /* Shrinking is not needed */ + return 0; + + rc = ep_vrealloc((void **)&ep->user_index, len); + if (unlikely(rc)) + return rc; + + ep->index_length = len; + + return 0; +} + +static int ep_shrink_user_items_and_bm(struct eventpoll *ep) +{ + unsigned int header_len, bm_len; + unsigned int bit, last_bit = UINT_MAX; + int rc; + + struct rb_node *rbp; + struct epitem *epi; + + lockdep_assert_held(&ep->mtx); + + header_len = ep_shrunk_user_items_length(ep); + bm_len = ep_shrunk_items_bm_length(ep); + if (!header_len && !bm_len) + /* Shrinking is not needed */ + return 0; + + /* + * Find left most last bit + */ + if (header_len) + last_bit = to_items_nr(header_len); + if (bm_len) + last_bit = min(last_bit, to_items_bm_nr(header_len)); + + if (WARN_ON(last_bit <= ep->items_nr)) + return -EINVAL; + + /* + * Find bits from the right and move them to the left in order to + * free space on the right. + * + * This is not nice, because O(n), but frankly this operation should + * be quite rare. If not - let's switch to idr or something similar + * (but that obviously will consume more memory). + * + */ + bit = 0; + for (rbp = rb_first_cached(&ep->rbr); rbp; rbp = rb_next(rbp)) { + epi = rb_entry(rbp, struct epitem, rbn); + + if (epi->bit >= last_bit) { + /* Find first available bit from left */ + bit = find_next_zero_bit(ep->items_bm, last_bit, bit); + if (WARN_ON(bit >= last_bit)) + return -EINVAL; + + /* Clear old bit from right */ + clear_bit(epi->bit, ep->items_bm); + + /* + * Set item bit and advance an iterator for the + * following find_next_zero_bit() call. + */ + epi->bit = bit++; + } + } + + /* + * Reallocate memory and commit lengths + */ + if (header_len) { + rc = ep_vrealloc((void **)&ep->user_header, header_len); + if (unlikely(rc)) + return rc; + + ep->header_length = header_len; + } + if (bm_len) { + rc = ep_vrealloc_bm(ep, bm_len); + if (unlikely(rc)) + return rc; + } + + return 0; +} + +static inline void ep_put_postponed_user_items_bits(struct eventpoll *ep) +{ + size_t sz, i; + + lockdep_assert_held(&ep->mtx); + + sz = ep->items_bm_length >> ilog2(sizeof(ep->items_bm[0])); + for (i = 0; i < sz; i++) { + ep->items_bm[i] &= ~(ep->removed_items_bm[i]); + ep->removed_items_bm[i] = 0ul; + } +} + +static int ep_transfer_events_and_shrink_uring(struct eventpoll *ep) +{ + struct epitem *epi, *tmp; + int rc = 0; + + mutex_lock(&ep->mtx); + if (ep_events_routed_to_uring(ep)) + /* A bit late */ + goto unlock; + + /* Here at this point we are sure uring is empty */ + ep_put_postponed_user_items_bits(ep); + + rc = ep_shrink_user_index(ep); + if (unlikely(rc)) + goto unlock; + + rc = ep_shrink_user_items_and_bm(ep); + if (unlikely(rc)) + goto unlock; + + /* Commit lengths to userspace, but state is not yet ACTIVE */ + ep->user_header->index_length = ep->index_length; + ep->user_header->header_length = ep->header_length; + + write_lock_irq(&ep->lock); + /* Atomically transfer events from klists to uring */ + list_for_each_entry_safe(epi, tmp, &ep->rdllist, rdllink) { + ep_add_event_to_uring(epi, epi->ready_events); + list_del_init(&epi->rdllink); + epi->ready_events = 0; + } + ep_route_events_to_uring(ep); + write_unlock_irq(&ep->lock); + +unlock: + mutex_unlock(&ep->mtx); + + return rc; +} + /** * ep_poll - Retrieves ready events, and delivers them to the caller supplied * event buffer. From patchwork Wed Jan 9 16:40:20 2019 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Roman Penyaev X-Patchwork-Id: 10754513 Return-Path: Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org [172.30.200.125]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 9158813BF for ; Wed, 9 Jan 2019 16:41:28 +0000 (UTC) Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 81011292E7 for ; Wed, 9 Jan 2019 16:41:28 +0000 (UTC) Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486) id 74B0A294BB; Wed, 9 Jan 2019 16:41:28 +0000 (UTC) X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on pdx-wl-mail.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-7.9 required=2.0 tests=BAYES_00,MAILING_LIST_MULTI, RCVD_IN_DNSWL_HI autolearn=ham version=3.3.1 Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id E2CFB294AC for ; Wed, 9 Jan 2019 16:41:27 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1726750AbfAIQlO (ORCPT ); Wed, 9 Jan 2019 11:41:14 -0500 Received: from mx2.suse.de ([195.135.220.15]:57600 "EHLO mx1.suse.de" rhost-flags-OK-OK-OK-FAIL) by vger.kernel.org with ESMTP id S1726613AbfAIQkm (ORCPT ); Wed, 9 Jan 2019 11:40:42 -0500 X-Virus-Scanned: by amavisd-new at test-mx.suse.de Received: from relay1.suse.de (unknown [195.135.220.254]) by mx1.suse.de (Postfix) with ESMTP id ADA53AF74; Wed, 9 Jan 2019 16:40:40 +0000 (UTC) From: Roman Penyaev Cc: Roman Penyaev , Andrew Morton , Davidlohr Bueso , Jason Baron , Al Viro , "Paul E. McKenney" , Linus Torvalds , Andrea Parri , linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org Subject: [RFC PATCH 10/15] epoll: support polling from userspace for ep_insert() Date: Wed, 9 Jan 2019 17:40:20 +0100 Message-Id: <20190109164025.24554-11-rpenyaev@suse.de> X-Mailer: git-send-email 2.19.1 In-Reply-To: <20190109164025.24554-1-rpenyaev@suse.de> References: <20190109164025.24554-1-rpenyaev@suse.de> MIME-Version: 1.0 To: unlisted-recipients:; (no To-header on input) Sender: linux-fsdevel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-fsdevel@vger.kernel.org X-Virus-Scanned: ClamAV using ClamSMTP When epfd is polled by userspace and new item is inserted: 1. Get free bit for a new item. 2. If expand for user items or user index is required - route all events to kernel lists and do expand. 3. If events are ready for newly inserted item - add event to uring, if events have been just routed to klists - add item to rdllist. 4. On error path mark user item as freed and route events to klist if ready event has not yet been observed by userspace. That is needed to postpone bit put, otherwise newly allocated bit will corrupt user item. Signed-off-by: Roman Penyaev Cc: Andrew Morton Cc: Davidlohr Bueso Cc: Jason Baron Cc: Al Viro Cc: "Paul E. McKenney" Cc: Linus Torvalds Cc: Andrea Parri Cc: linux-fsdevel@vger.kernel.org Cc: linux-kernel@vger.kernel.org --- fs/eventpoll.c | 74 ++++++++++++++++++++++++++++++++++++++++++++------ 1 file changed, 65 insertions(+), 9 deletions(-) diff --git a/fs/eventpoll.c b/fs/eventpoll.c index 36c451c26681..4618db9c077c 100644 --- a/fs/eventpoll.c +++ b/fs/eventpoll.c @@ -1977,6 +1977,7 @@ static int ep_insert(struct eventpoll *ep, const struct epoll_event *event, struct epitem *epi; struct ep_pqueue epq; + lockdep_assert_held(&ep->mtx); lockdep_assert_irqs_enabled(); user_watches = atomic_long_read(&ep->user->epoll_watches); @@ -2002,6 +2003,43 @@ static int ep_insert(struct eventpoll *ep, const struct epoll_event *event, RCU_INIT_POINTER(epi->ws, NULL); } + if (ep_polled_by_user(ep)) { + struct user_epitem *uitem; + int bit; + + bit = ep_get_bit(ep); + if (unlikely(bit < 0)) { + error = bit; + goto error_get_bit; + } + epi->bit = bit; + ep->items_nr++; + + if (ep_expand_user_is_required(ep)) { + /* + * Expand of user header or user index is required, + * thus reroute all events to klists and then safely + * vrealloc() the memory. + */ + write_lock_irq(&ep->lock); + ep_route_events_to_klists(ep); + write_unlock_irq(&ep->lock); + + error = ep_expand_user_items(ep); + if (unlikely(error)) + goto error_expand; + + error = ep_expand_user_index(ep); + if (unlikely(error)) + goto error_expand; + } + + /* Now fill-in user item */ + uitem = &ep->user_header->items[epi->bit]; + uitem->ready_events = 0; + uitem->event = *event; + } + /* Initialize the poll table using the queue callback */ epq.epi = epi; init_poll_funcptr(&epq.pt, ep_ptable_queue_proc); @@ -2046,16 +2084,23 @@ static int ep_insert(struct eventpoll *ep, const struct epoll_event *event, /* record NAPI ID of new item if present */ ep_set_busy_poll_napi_id(epi); - /* If the file is already "ready" we drop it inside the ready list */ - if (revents && !ep_is_linked(epi)) { - list_add_tail(&epi->rdllink, &ep->rdllist); - ep_pm_stay_awake(epi); + if (revents) { + bool added = false; - /* Notify waiting tasks that events are available */ - if (waitqueue_active(&ep->wq)) - wake_up(&ep->wq); - if (waitqueue_active(&ep->poll_wait)) - pwake++; + if (ep_events_routed_to_uring(ep)) + added = ep_add_event_to_uring(epi, revents); + else if (!ep_is_linked(epi)) { + list_add_tail(&epi->rdllink, &ep->rdllist); + ep_pm_stay_awake(epi); + added = true; + } + if (added) { + /* Notify waiting tasks that events are available */ + if (waitqueue_active(&ep->wq)) + wake_up(&ep->wq); + if (waitqueue_active(&ep->poll_wait)) + pwake++; + } } write_unlock_irq(&ep->lock); @@ -2089,6 +2134,17 @@ static int ep_insert(struct eventpoll *ep, const struct epoll_event *event, list_del_init(&epi->rdllink); write_unlock_irq(&ep->lock); + if (ep_polled_by_user(ep)) { +error_expand: + /* + * No need to check return value: if events are routed to + * klists, that is done by code above, where we've expanded + * memory, but here, on rollback, we do not care. + */ + (void)ep_free_user_item(epi); + } + +error_get_bit: wakeup_source_unregister(ep_wakeup_source(epi)); error_create_wakeup_source: From patchwork Wed Jan 9 16:40:21 2019 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Roman Penyaev X-Patchwork-Id: 10754515 Return-Path: Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org [172.30.200.125]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id DBABE13BF for ; Wed, 9 Jan 2019 16:41:30 +0000 (UTC) Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id CC87B2922B for ; Wed, 9 Jan 2019 16:41:30 +0000 (UTC) Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486) id CAC5A294CF; Wed, 9 Jan 2019 16:41:30 +0000 (UTC) X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on pdx-wl-mail.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-7.9 required=2.0 tests=BAYES_00,MAILING_LIST_MULTI, RCVD_IN_DNSWL_HI autolearn=ham version=3.3.1 Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 41BEA2922B for ; Wed, 9 Jan 2019 16:41:29 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1726743AbfAIQlO (ORCPT ); Wed, 9 Jan 2019 11:41:14 -0500 Received: from mx2.suse.de ([195.135.220.15]:57636 "EHLO mx1.suse.de" rhost-flags-OK-OK-OK-FAIL) by vger.kernel.org with ESMTP id S1726553AbfAIQkm (ORCPT ); Wed, 9 Jan 2019 11:40:42 -0500 X-Virus-Scanned: by amavisd-new at test-mx.suse.de Received: from relay2.suse.de (unknown [195.135.220.254]) by mx1.suse.de (Postfix) with ESMTP id 2CE40B008; Wed, 9 Jan 2019 16:40:41 +0000 (UTC) From: Roman Penyaev Cc: Roman Penyaev , Andrew Morton , Davidlohr Bueso , Jason Baron , Al Viro , "Paul E. McKenney" , Linus Torvalds , Andrea Parri , linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org Subject: [RFC PATCH 11/15] epoll: offload polling to a work in case of epfd polled from userspace Date: Wed, 9 Jan 2019 17:40:21 +0100 Message-Id: <20190109164025.24554-12-rpenyaev@suse.de> X-Mailer: git-send-email 2.19.1 In-Reply-To: <20190109164025.24554-1-rpenyaev@suse.de> References: <20190109164025.24554-1-rpenyaev@suse.de> MIME-Version: 1.0 To: unlisted-recipients:; (no To-header on input) Sender: linux-fsdevel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-fsdevel@vger.kernel.org X-Virus-Scanned: ClamAV using ClamSMTP Not every device reports pollflags on wake_up(), expecting that it will be polled later. vfs_poll() can't be called from ep_poll_callback(), because ep_poll_callback() is called under the spinlock. Obviously userspace can't call vfs_poll(), thus epoll has to offload vfs_poll() to a work and then to call ep_poll_callback() with pollflags in a hand. Signed-off-by: Roman Penyaev Cc: Andrew Morton Cc: Davidlohr Bueso Cc: Jason Baron Cc: Al Viro Cc: "Paul E. McKenney" Cc: Linus Torvalds Cc: Andrea Parri Cc: linux-fsdevel@vger.kernel.org Cc: linux-kernel@vger.kernel.org --- fs/eventpoll.c | 111 ++++++++++++++++++++++++++++++++++++++----------- 1 file changed, 87 insertions(+), 24 deletions(-) diff --git a/fs/eventpoll.c b/fs/eventpoll.c index 4618db9c077c..2af849e6c7a5 100644 --- a/fs/eventpoll.c +++ b/fs/eventpoll.c @@ -1624,9 +1624,8 @@ static inline bool chain_epi_lockless(struct epitem *epi) } /* - * This is the callback that is passed to the wait queue wakeup - * mechanism. It is called by the stored file descriptors when they - * have events to report. + * This is the callback that is called directly from wake queue wakeup or + * from a work. * * This callback takes a read lock in order not to content with concurrent * events from another file descriptors, thus all modifications to ->rdllist @@ -1641,14 +1640,11 @@ static inline bool chain_epi_lockless(struct epitem *epi) * queues are used should be detected accordingly. This is detected using * cmpxchg() operation. */ -static int ep_poll_callback(wait_queue_entry_t *wait, unsigned mode, int sync, void *key) +static int ep_poll_callback(struct epitem *epi, __poll_t pollflags) { - int pwake = 0; - struct epitem *epi = ep_item_from_wait(wait); struct eventpoll *ep = epi->ep; - __poll_t pollflags = key_to_poll(key); + int pwake = 0, ewake = 0; unsigned long flags; - int ewake = 0; read_lock_irqsave(&ep->lock, flags); @@ -1666,12 +1662,32 @@ static int ep_poll_callback(wait_queue_entry_t *wait, unsigned mode, int sync, v /* * Check the events coming with the callback. At this stage, not * every device reports the events in the "key" parameter of the - * callback. We need to be able to handle both cases here, hence the - * test for "key" != NULL before the event match test. + * callback (for ep_poll_callback() case special worker is used). + * We need to be able to handle both cases here, hence the test + * for "key" != NULL before the event match test. */ if (pollflags && !(pollflags & epi->event.events)) goto out_unlock; + if (ep_polled_by_user(ep)) { + __poll_t revents; + + if (ep_events_routed_to_uring(ep)) { + ep_add_event_to_uring(epi, pollflags); + goto wakeup; + } + + WARN_ON(!pollflags); + revents = (epi->event.events & ~EP_PRIVATE_BITS) & pollflags; + + /* + * Keep active events up-to-date for further transfer from + * klists to uring. + */ + __atomic_fetch_or(&epi->ready_events, revents, + __ATOMIC_RELAXED); + } + /* * If we are transferring events to userspace, we can hold no locks * (because we're accessing user memory, and because of linux f_op->poll() @@ -1679,6 +1695,7 @@ static int ep_poll_callback(wait_queue_entry_t *wait, unsigned mode, int sync, v * chained in ep->ovflist and requeued later on. */ if (READ_ONCE(ep->ovflist) != EP_UNACTIVE_PTR) { + WARN_ON(ep_polled_by_user(ep)); if (epi->next == EP_UNACTIVE_PTR && chain_epi_lockless(epi)) ep_pm_stay_awake_rcu(epi); @@ -1691,6 +1708,7 @@ static int ep_poll_callback(wait_queue_entry_t *wait, unsigned mode, int sync, v ep_pm_stay_awake_rcu(epi); } +wakeup: /* * Wake up ( if active ) both the eventpoll wait list and the ->poll() * wait list. @@ -1727,23 +1745,67 @@ static int ep_poll_callback(wait_queue_entry_t *wait, unsigned mode, int sync, v if (!(epi->event.events & EPOLLEXCLUSIVE)) ewake = 1; - if (pollflags & POLLFREE) { - /* - * If we race with ep_remove_wait_queue() it can miss - * ->whead = NULL and do another remove_wait_queue() after - * us, so we can't use __remove_wait_queue(). - */ - list_del_init(&wait->entry); + return ewake; +} + +static void ep_poll_callback_work(struct work_struct *work) +{ + struct epitem *epi = container_of(work, typeof(*epi), work); + __poll_t pollflags; + poll_table pt; + + WARN_ON(!ep_polled_by_user(epi->ep)); + + init_poll_funcptr(&pt, NULL); + pollflags = ep_item_poll(epi, &pt, 1); + + (void)ep_poll_callback(epi, pollflags); +} + +/* + * This is the callback that is passed to the wait queue wakeup + * mechanism. It is called by the stored file descriptors when they + * have events to report. + */ +static int ep_poll_wakeup(wait_queue_entry_t *wait, unsigned int mode, + int sync, void *key) +{ + + struct epitem *epi = ep_item_from_wait(wait); + struct eventpoll *ep = epi->ep; + __poll_t pollflags = key_to_poll(key); + int rc; + + if (!ep_polled_by_user(ep) || pollflags) { + rc = ep_poll_callback(epi, pollflags); + + if (pollflags & POLLFREE) { + /* + * If we race with ep_remove_wait_queue() it can miss + * ->whead = NULL and do another remove_wait_queue() + * after us, so we can't use __remove_wait_queue(). + */ + list_del_init(&wait->entry); + /* + * ->whead != NULL protects us from the race with + * ep_free() or ep_remove(), ep_remove_wait_queue() + * takes whead->lock held by the caller. Once we nullify + * it, nothing protects ep/epi or even wait. + */ + smp_store_release(&ep_pwq_from_wait(wait)->whead, NULL); + } + } else { + schedule_work(&epi->work); + /* - * ->whead != NULL protects us from the race with ep_free() - * or ep_remove(), ep_remove_wait_queue() takes whead->lock - * held by the caller. Once we nullify it, nothing protects - * ep/epi or even wait. + * Here on this path we are absolutely sure that for file + * descriptors* which are pollable from userspace we do not + * support EPOLLEXCLUSIVE, so it is safe to return 1. */ - smp_store_release(&ep_pwq_from_wait(wait)->whead, NULL); + rc = 1; } - return ewake; + return rc; } /* @@ -1757,7 +1819,7 @@ static void ep_ptable_queue_proc(struct file *file, wait_queue_head_t *whead, struct eppoll_entry *pwq; if (epi->nwait >= 0 && (pwq = kmem_cache_alloc(pwq_cache, GFP_KERNEL))) { - init_waitqueue_func_entry(&pwq->wait, ep_poll_callback); + init_waitqueue_func_entry(&pwq->wait, ep_poll_wakeup); pwq->whead = whead; pwq->base = epi; if (epi->event.events & EPOLLEXCLUSIVE) @@ -1990,6 +2052,7 @@ static int ep_insert(struct eventpoll *ep, const struct epoll_event *event, INIT_LIST_HEAD(&epi->rdllink); INIT_LIST_HEAD(&epi->fllink); INIT_LIST_HEAD(&epi->pwqlist); + INIT_WORK(&epi->work, ep_poll_callback_work); epi->ep = ep; ep_set_ffd(&epi->ffd, tfile, fd); epi->event = *event; From patchwork Wed Jan 9 16:40:22 2019 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Roman Penyaev X-Patchwork-Id: 10754509 Return-Path: Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org [172.30.200.125]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id AF60D13BF for ; Wed, 9 Jan 2019 16:41:13 +0000 (UTC) Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id A06D6294D9 for ; Wed, 9 Jan 2019 16:41:13 +0000 (UTC) Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486) id 93848294DB; Wed, 9 Jan 2019 16:41:13 +0000 (UTC) X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on pdx-wl-mail.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-7.9 required=2.0 tests=BAYES_00,MAILING_LIST_MULTI, RCVD_IN_DNSWL_HI autolearn=ham version=3.3.1 Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 7EF59292F1 for ; Wed, 9 Jan 2019 16:41:12 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1726716AbfAIQlG (ORCPT ); Wed, 9 Jan 2019 11:41:06 -0500 Received: from mx2.suse.de ([195.135.220.15]:57712 "EHLO mx1.suse.de" rhost-flags-OK-OK-OK-FAIL) by vger.kernel.org with ESMTP id S1726623AbfAIQkn (ORCPT ); Wed, 9 Jan 2019 11:40:43 -0500 X-Virus-Scanned: by amavisd-new at test-mx.suse.de Received: from relay1.suse.de (unknown [195.135.220.254]) by mx1.suse.de (Postfix) with ESMTP id 8AEFAB017; Wed, 9 Jan 2019 16:40:41 +0000 (UTC) From: Roman Penyaev Cc: Roman Penyaev , Andrew Morton , Davidlohr Bueso , Jason Baron , Al Viro , "Paul E. McKenney" , Linus Torvalds , Andrea Parri , linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org Subject: [RFC PATCH 12/15] epoll: support polling from userspace for ep_remove() Date: Wed, 9 Jan 2019 17:40:22 +0100 Message-Id: <20190109164025.24554-13-rpenyaev@suse.de> X-Mailer: git-send-email 2.19.1 In-Reply-To: <20190109164025.24554-1-rpenyaev@suse.de> References: <20190109164025.24554-1-rpenyaev@suse.de> MIME-Version: 1.0 To: unlisted-recipients:; (no To-header on input) Sender: linux-fsdevel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-fsdevel@vger.kernel.org X-Virus-Scanned: ClamAV using ClamSMTP When epfd is polled from userspace and item is being removed: 1. Mark user item as freed. If userspace has not been yet consumed ready event - route all events to kernel lists. 2. If shrink is required - route all events to kernel lists. 3. On unregistration of epoll entries do not forget to flush item worker, which can be just submitted from ep_poll_callback() Signed-off-by: Roman Penyaev Cc: Andrew Morton Cc: Davidlohr Bueso Cc: Jason Baron Cc: Al Viro Cc: "Paul E. McKenney" Cc: Linus Torvalds Cc: Andrea Parri Cc: linux-fsdevel@vger.kernel.org Cc: linux-kernel@vger.kernel.org --- fs/eventpoll.c | 14 ++++++++++++++ 1 file changed, 14 insertions(+) diff --git a/fs/eventpoll.c b/fs/eventpoll.c index 2af849e6c7a5..7732a8029a1c 100644 --- a/fs/eventpoll.c +++ b/fs/eventpoll.c @@ -780,6 +780,14 @@ static void ep_unregister_pollwait(struct eventpoll *ep, struct epitem *epi) ep_remove_wait_queue(pwq); kmem_cache_free(pwq_cache, pwq); } + if (ep_polled_by_user(ep)) { + /* + * Events polled by user require offloading to a work, + * thus we have to be sure everything which was queued + * has run to a completion. + */ + flush_work(&epi->work); + } } /* call only when ep->mtx is held */ @@ -1168,6 +1176,7 @@ static bool ep_add_event_to_uring(struct epitem *epi, __poll_t pollflags) static int ep_remove(struct eventpoll *ep, struct epitem *epi) { struct file *file = epi->ffd.file; + bool events_to_klists = false; lockdep_assert_irqs_enabled(); @@ -1183,9 +1192,14 @@ static int ep_remove(struct eventpoll *ep, struct epitem *epi) rb_erase_cached(&epi->rbn, &ep->rbr); + if (ep_polled_by_user(ep)) + events_to_klists = ep_free_user_item(epi); + write_lock_irq(&ep->lock); if (ep_is_linked(epi)) list_del_init(&epi->rdllink); + if (events_to_klists) + ep_route_events_to_klists(ep); write_unlock_irq(&ep->lock); wakeup_source_unregister(ep_wakeup_source(epi)); From patchwork Wed Jan 9 16:40:23 2019 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Roman Penyaev X-Patchwork-Id: 10754517 Return-Path: Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org [172.30.200.125]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 672816C5 for ; Wed, 9 Jan 2019 16:41:31 +0000 (UTC) Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 58456294CF for ; Wed, 9 Jan 2019 16:41:31 +0000 (UTC) Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486) id 56474294CE; Wed, 9 Jan 2019 16:41:31 +0000 (UTC) X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on pdx-wl-mail.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-7.9 required=2.0 tests=BAYES_00,MAILING_LIST_MULTI, RCVD_IN_DNSWL_HI autolearn=ham version=3.3.1 Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id E34BB294D6 for ; Wed, 9 Jan 2019 16:41:30 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1726740AbfAIQlN (ORCPT ); Wed, 9 Jan 2019 11:41:13 -0500 Received: from mx2.suse.de ([195.135.220.15]:57650 "EHLO mx1.suse.de" rhost-flags-OK-OK-OK-FAIL) by vger.kernel.org with ESMTP id S1726625AbfAIQkm (ORCPT ); Wed, 9 Jan 2019 11:40:42 -0500 X-Virus-Scanned: by amavisd-new at test-mx.suse.de Received: from relay2.suse.de (unknown [195.135.220.254]) by mx1.suse.de (Postfix) with ESMTP id E7531B01B; Wed, 9 Jan 2019 16:40:41 +0000 (UTC) From: Roman Penyaev Cc: Roman Penyaev , Andrew Morton , Davidlohr Bueso , Jason Baron , Al Viro , "Paul E. McKenney" , Linus Torvalds , Andrea Parri , linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org Subject: [RFC PATCH 13/15] epoll: support polling from userspace for ep_modify() Date: Wed, 9 Jan 2019 17:40:23 +0100 Message-Id: <20190109164025.24554-14-rpenyaev@suse.de> X-Mailer: git-send-email 2.19.1 In-Reply-To: <20190109164025.24554-1-rpenyaev@suse.de> References: <20190109164025.24554-1-rpenyaev@suse.de> MIME-Version: 1.0 To: unlisted-recipients:; (no To-header on input) Sender: linux-fsdevel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-fsdevel@vger.kernel.org X-Virus-Scanned: ClamAV using ClamSMTP When epfd is polled from userspace and item is being modified: 1. Update user item with new pointer or poll flags. 2. Add event to user ring if needed. Signed-off-by: Roman Penyaev Cc: Andrew Morton Cc: Davidlohr Bueso Cc: Jason Baron Cc: Al Viro Cc: "Paul E. McKenney" Cc: Linus Torvalds Cc: Andrea Parri Cc: linux-fsdevel@vger.kernel.org Cc: linux-kernel@vger.kernel.org --- fs/eventpoll.c | 22 +++++++++++++++++++--- 1 file changed, 19 insertions(+), 3 deletions(-) diff --git a/fs/eventpoll.c b/fs/eventpoll.c index 7732a8029a1c..2b38a3d884e8 100644 --- a/fs/eventpoll.c +++ b/fs/eventpoll.c @@ -2237,6 +2237,8 @@ static int ep_insert(struct eventpoll *ep, const struct epoll_event *event, static int ep_modify(struct eventpoll *ep, struct epitem *epi, const struct epoll_event *event) { + struct user_epitem *uitem; + __poll_t revents; int pwake = 0; poll_table pt; @@ -2251,6 +2253,13 @@ static int ep_modify(struct eventpoll *ep, struct epitem *epi, */ epi->event.events = event->events; /* need barrier below */ epi->event.data = event->data; /* protected by mtx */ + + /* Update user item, barrier is below */ + if (ep_polled_by_user(ep)) { + uitem = &ep->user_header->items[epi->bit]; + uitem->event = *event; + } + if (epi->event.events & EPOLLWAKEUP) { if (!ep_has_wakeup_source(epi)) ep_create_wakeup_source(epi); @@ -2284,12 +2293,19 @@ static int ep_modify(struct eventpoll *ep, struct epitem *epi, * If the item is "hot" and it is not registered inside the ready * list, push it inside. */ - if (ep_item_poll(epi, &pt, 1)) { + revents = ep_item_poll(epi, &pt, 1); + if (revents) { + bool added = false; + write_lock_irq(&ep->lock); - if (!ep_is_linked(epi)) { + if (ep_events_routed_to_uring(ep)) + added = ep_add_event_to_uring(epi, revents); + else if (!ep_is_linked(epi)) { list_add_tail(&epi->rdllink, &ep->rdllist); ep_pm_stay_awake(epi); - + added = true; + } + if (added) { /* Notify waiting tasks that events are available */ if (waitqueue_active(&ep->wq)) wake_up(&ep->wq); From patchwork Wed Jan 9 16:40:24 2019 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Roman Penyaev X-Patchwork-Id: 10754507 Return-Path: Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org [172.30.200.125]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 934CD13BF for ; Wed, 9 Jan 2019 16:41:04 +0000 (UTC) Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 836A42947C for ; Wed, 9 Jan 2019 16:41:04 +0000 (UTC) Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486) id 77D69294AC; Wed, 9 Jan 2019 16:41:04 +0000 (UTC) X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on pdx-wl-mail.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-7.9 required=2.0 tests=BAYES_00,MAILING_LIST_MULTI, RCVD_IN_DNSWL_HI autolearn=ham version=3.3.1 Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id F2C09294B2 for ; Wed, 9 Jan 2019 16:41:03 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1726504AbfAIQk6 (ORCPT ); Wed, 9 Jan 2019 11:40:58 -0500 Received: from mx2.suse.de ([195.135.220.15]:57620 "EHLO mx1.suse.de" rhost-flags-OK-OK-OK-FAIL) by vger.kernel.org with ESMTP id S1726629AbfAIQko (ORCPT ); Wed, 9 Jan 2019 11:40:44 -0500 X-Virus-Scanned: by amavisd-new at test-mx.suse.de Received: from relay1.suse.de (unknown [195.135.220.254]) by mx1.suse.de (Postfix) with ESMTP id 53F07B015; Wed, 9 Jan 2019 16:40:42 +0000 (UTC) From: Roman Penyaev Cc: Roman Penyaev , Andrew Morton , Davidlohr Bueso , Jason Baron , Al Viro , "Paul E. McKenney" , Linus Torvalds , Andrea Parri , linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org Subject: [RFC PATCH 14/15] epoll: support polling from userspace for ep_poll() Date: Wed, 9 Jan 2019 17:40:24 +0100 Message-Id: <20190109164025.24554-15-rpenyaev@suse.de> X-Mailer: git-send-email 2.19.1 In-Reply-To: <20190109164025.24554-1-rpenyaev@suse.de> References: <20190109164025.24554-1-rpenyaev@suse.de> MIME-Version: 1.0 To: unlisted-recipients:; (no To-header on input) Sender: linux-fsdevel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-fsdevel@vger.kernel.org X-Virus-Scanned: ClamAV using ClamSMTP When epfd is polled from userspace and user calls epoll_wait(): 1. If user ring is not fully consumed (i.e. head != tail) returns -ESTALE, indicating that some actions on userside is required. 2. If events were routed to klists probably memory was expanded or shrink is still required. Do shrink if needed and transfer all collected events from kernel lists to uring. 3. Ensure with WARN that ep_poll_send_events() can't be called from ep_poll() when epfd is pollable from userspace. 4. Wait for events on wait queue, always return -ESTALE if were awekened indicating that events have to be consumed from user ring. Signed-off-by: Roman Penyaev Cc: Andrew Morton Cc: Davidlohr Bueso Cc: Jason Baron Cc: Al Viro Cc: "Paul E. McKenney" Cc: Linus Torvalds Cc: Andrea Parri Cc: linux-fsdevel@vger.kernel.org Cc: linux-kernel@vger.kernel.org --- fs/eventpoll.c | 46 +++++++++++++++++++++++++++++++++++++--------- 1 file changed, 37 insertions(+), 9 deletions(-) diff --git a/fs/eventpoll.c b/fs/eventpoll.c index 2b38a3d884e8..5de640fcf28b 100644 --- a/fs/eventpoll.c +++ b/fs/eventpoll.c @@ -523,7 +523,8 @@ static inline bool ep_user_ring_events_available(struct eventpoll *ep) static inline int ep_events_available(struct eventpoll *ep) { return !list_empty_careful(&ep->rdllist) || - READ_ONCE(ep->ovflist) != EP_UNACTIVE_PTR; + READ_ONCE(ep->ovflist) != EP_UNACTIVE_PTR || + ep_user_ring_events_available(ep); } #ifdef CONFIG_NET_RX_BUSY_POLL @@ -2411,6 +2412,8 @@ static int ep_send_events(struct eventpoll *ep, { struct ep_send_events_data esed; + WARN_ON(ep_polled_by_user(ep)); + esed.maxevents = maxevents; esed.events = events; @@ -2607,6 +2610,24 @@ static int ep_poll(struct eventpoll *ep, struct epoll_event __user *events, lockdep_assert_irqs_enabled(); + if (ep_polled_by_user(ep)) { + if (ep_user_ring_events_available(ep)) + /* Firstly all events from ring have to be consumed */ + return -ESTALE; + + if (ep_events_routed_to_klists(ep)) { + res = ep_transfer_events_and_shrink_uring(ep); + if (unlikely(res < 0)) + return res; + if (res) + /* + * Events were transferred from klists to + * user ring + */ + return -ESTALE; + } + } + if (timeout > 0) { struct timespec64 end_time = ep_set_mstimeout(timeout); @@ -2695,14 +2716,21 @@ static int ep_poll(struct eventpoll *ep, struct epoll_event __user *events, __set_current_state(TASK_RUNNING); send_events: - /* - * Try to transfer events to user space. In case we get 0 events and - * there's still timeout left over, we go trying again in search of - * more luck. - */ - if (!res && eavail && - !(res = ep_send_events(ep, events, maxevents)) && !timed_out) - goto fetch_events; + if (!res && eavail) { + if (!ep_polled_by_user(ep)) { + /* + * Try to transfer events to user space. In case we get + * 0 events and there's still timeout left over, we go + * trying again in search of more luck. + */ + res = ep_send_events(ep, events, maxevents); + if (!res && !timed_out) + goto fetch_events; + } else { + /* User has to deal with the ring himself */ + res = -ESTALE; + } + } if (waiter) { spin_lock_irq(&ep->wq.lock); From patchwork Wed Jan 9 16:40:25 2019 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Roman Penyaev X-Patchwork-Id: 10754501 Return-Path: Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org [172.30.200.125]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 2BBD513BF for ; Wed, 9 Jan 2019 16:40:48 +0000 (UTC) Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 1C5BE29418 for ; Wed, 9 Jan 2019 16:40:48 +0000 (UTC) Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486) id 1AAD429439; Wed, 9 Jan 2019 16:40:48 +0000 (UTC) X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on pdx-wl-mail.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-7.9 required=2.0 tests=BAYES_00,MAILING_LIST_MULTI, RCVD_IN_DNSWL_HI autolearn=ham version=3.3.1 Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id ED57D29460 for ; Wed, 9 Jan 2019 16:40:46 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1726663AbfAIQkp (ORCPT ); Wed, 9 Jan 2019 11:40:45 -0500 Received: from mx2.suse.de ([195.135.220.15]:57778 "EHLO mx1.suse.de" rhost-flags-OK-OK-OK-FAIL) by vger.kernel.org with ESMTP id S1726642AbfAIQko (ORCPT ); Wed, 9 Jan 2019 11:40:44 -0500 X-Virus-Scanned: by amavisd-new at test-mx.suse.de Received: from relay2.suse.de (unknown [195.135.220.254]) by mx1.suse.de (Postfix) with ESMTP id B39C1B01C; Wed, 9 Jan 2019 16:40:42 +0000 (UTC) From: Roman Penyaev Cc: Roman Penyaev , Andrew Morton , Davidlohr Bueso , Jason Baron , Al Viro , "Paul E. McKenney" , Linus Torvalds , Andrea Parri , linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org Subject: [RFC PATCH 15/15] epoll: support mapping for epfd when polled from userspace Date: Wed, 9 Jan 2019 17:40:25 +0100 Message-Id: <20190109164025.24554-16-rpenyaev@suse.de> X-Mailer: git-send-email 2.19.1 In-Reply-To: <20190109164025.24554-1-rpenyaev@suse.de> References: <20190109164025.24554-1-rpenyaev@suse.de> MIME-Version: 1.0 To: unlisted-recipients:; (no To-header on input) Sender: linux-fsdevel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-fsdevel@vger.kernel.org X-Virus-Scanned: ClamAV using ClamSMTP User has to mmap user_header and user_index vmalloce'd pointers in order to consume events from userspace. Support mapping with possibility to mremap() in the future, i.e. vma does not have VM_DONTEXPAND flag set. User mmaps two pointers: header and index in order to expand both calling mremap(). Expanding is made with support of the fault callback, where page is mmaped with all appropriate size checks. Signed-off-by: Roman Penyaev Cc: Andrew Morton Cc: Davidlohr Bueso Cc: Jason Baron Cc: Al Viro Cc: "Paul E. McKenney" Cc: Linus Torvalds Cc: Andrea Parri Cc: linux-fsdevel@vger.kernel.org Cc: linux-kernel@vger.kernel.org --- fs/eventpoll.c | 85 ++++++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 85 insertions(+) diff --git a/fs/eventpoll.c b/fs/eventpoll.c index 5de640fcf28b..2849b238f80b 100644 --- a/fs/eventpoll.c +++ b/fs/eventpoll.c @@ -1388,11 +1388,96 @@ static void ep_show_fdinfo(struct seq_file *m, struct file *f) } #endif +static vm_fault_t ep_eventpoll_fault(struct vm_fault *vmf) +{ + struct vm_area_struct *vma = vmf->vma; + struct eventpoll *ep = vma->vm_file->private_data; + size_t off = vmf->address - vma->vm_start; + vm_fault_t ret; + int rc; + + mutex_lock(&ep->mtx); + ret = VM_FAULT_SIGBUS; + if (!vma->vm_pgoff) { + if (ep->header_length < (off + PAGE_SIZE)) + goto unlock_and_out; + + rc = remap_vmalloc_range_partial(vma, vmf->address, + ep->user_header + off, + PAGE_SIZE); + } else { + if (ep->index_length < (off + PAGE_SIZE)) + goto unlock_and_out; + + rc = remap_vmalloc_range_partial(vma, vmf->address, + ep->user_index + off, + PAGE_SIZE); + } + if (likely(!rc)) { + /* Success path */ + vma->vm_flags &= ~VM_DONTEXPAND; + ret = VM_FAULT_NOPAGE; + } +unlock_and_out: + mutex_unlock(&ep->mtx); + + return ret; +} + +static const struct vm_operations_struct eventpoll_vm_ops = { + .fault = ep_eventpoll_fault, +}; + +static int ep_eventpoll_mmap(struct file *filep, struct vm_area_struct *vma) +{ + struct eventpoll *ep = vma->vm_file->private_data; + size_t size; + int rc; + + if (!ep_polled_by_user(ep)) + return -ENOTSUPP; + + mutex_lock(&ep->mtx); + rc = -ENXIO; + size = vma->vm_end - vma->vm_start; + if (!vma->vm_pgoff && size > ep->header_length) + goto unlock_and_out; + if (vma->vm_pgoff && ep->header_length != (vma->vm_pgoff << PAGE_SHIFT)) + /* + * Index ring starts exactly after header. In future vm_pgoff + * is not used, only as indication what kernel ptr is mapped. + */ + goto unlock_and_out; + if (vma->vm_pgoff && size > ep->index_length) + goto unlock_and_out; + + /* + * vm_pgoff is used *only* for indication, what is mapped: user header + * or user index ring. + */ + if (!vma->vm_pgoff) + rc = remap_vmalloc_range_partial(vma, vma->vm_start, + ep->user_header, size); + else + rc = remap_vmalloc_range_partial(vma, vma->vm_start, + ep->user_index, size); + + if (likely(!rc)) { + vma->vm_flags &= ~VM_DONTEXPAND; + vma->vm_ops = &eventpoll_vm_ops; + } +unlock_and_out: + mutex_unlock(&ep->mtx); + + return rc; +} + /* File callbacks that implement the eventpoll file behaviour */ static const struct file_operations eventpoll_fops = { #ifdef CONFIG_PROC_FS .show_fdinfo = ep_show_fdinfo, #endif + .mmap = ep_eventpoll_mmap, .release = ep_eventpoll_release, .poll = ep_eventpoll_poll, .llseek = noop_llseek,