From patchwork Mon Jun 24 14:41:38 2019 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Roman Penyaev X-Patchwork-Id: 11013575 Return-Path: Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org [172.30.200.125]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 238A1924 for ; Mon, 24 Jun 2019 14:43:26 +0000 (UTC) Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 15F1B28950 for ; Mon, 24 Jun 2019 14:43:26 +0000 (UTC) Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486) id 09B3A28958; Mon, 24 Jun 2019 14:43:26 +0000 (UTC) X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on pdx-wl-mail.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-7.9 required=2.0 tests=BAYES_00,MAILING_LIST_MULTI, RCVD_IN_DNSWL_HI autolearn=ham version=3.3.1 Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 8CFCA28958 for ; Mon, 24 Jun 2019 14:43:25 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1727968AbfFXOmC (ORCPT ); Mon, 24 Jun 2019 10:42:02 -0400 Received: from mx2.suse.de ([195.135.220.15]:50314 "EHLO mx1.suse.de" rhost-flags-OK-OK-OK-FAIL) by vger.kernel.org with ESMTP id S1726396AbfFXOmB (ORCPT ); Mon, 24 Jun 2019 10:42:01 -0400 X-Virus-Scanned: by amavisd-new at test-mx.suse.de Received: from relay2.suse.de (unknown [195.135.220.254]) by mx1.suse.de (Postfix) with ESMTP id A7811AD72; Mon, 24 Jun 2019 14:42:00 +0000 (UTC) From: Roman Penyaev Cc: Roman Penyaev , Andrew Morton , Al Viro , Linus Torvalds , linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org Subject: [PATCH v5 01/14] epoll: move private helpers from a header to the source Date: Mon, 24 Jun 2019 16:41:38 +0200 Message-Id: <20190624144151.22688-2-rpenyaev@suse.de> X-Mailer: git-send-email 2.21.0 In-Reply-To: <20190624144151.22688-1-rpenyaev@suse.de> References: <20190624144151.22688-1-rpenyaev@suse.de> MIME-Version: 1.0 To: unlisted-recipients:; (no To-header on input) Sender: linux-fsdevel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-fsdevel@vger.kernel.org X-Virus-Scanned: ClamAV using ClamSMTP Those helpers will access private eventpoll structure in future patches, so keep those helpers close to callers. Nothing important here. Signed-off-by: Roman Penyaev Cc: Andrew Morton Cc: Al Viro Cc: Linus Torvalds Cc: linux-fsdevel@vger.kernel.org Cc: linux-kernel@vger.kernel.org --- fs/eventpoll.c | 13 +++++++++++++ include/uapi/linux/eventpoll.h | 12 ------------ 2 files changed, 13 insertions(+), 12 deletions(-) diff --git a/fs/eventpoll.c b/fs/eventpoll.c index 8bc064630be0..622b6c9ef8c9 100644 --- a/fs/eventpoll.c +++ b/fs/eventpoll.c @@ -460,6 +460,19 @@ static inline void ep_set_busy_poll_napi_id(struct epitem *epi) #endif /* CONFIG_NET_RX_BUSY_POLL */ +#ifdef CONFIG_PM_SLEEP +static inline void ep_take_care_of_epollwakeup(struct epoll_event *epev) +{ + if ((epev->events & EPOLLWAKEUP) && !capable(CAP_BLOCK_SUSPEND)) + epev->events &= ~EPOLLWAKEUP; +} +#else +static inline void ep_take_care_of_epollwakeup(struct epoll_event *epev) +{ + epev->events &= ~EPOLLWAKEUP; +} +#endif /* CONFIG_PM_SLEEP */ + /** * ep_call_nested - Perform a bound (possibly) nested call, by checking * that the recursion limit is not exceeded, and that diff --git a/include/uapi/linux/eventpoll.h b/include/uapi/linux/eventpoll.h index 8a3432d0f0dc..39dfc29f0f52 100644 --- a/include/uapi/linux/eventpoll.h +++ b/include/uapi/linux/eventpoll.h @@ -79,16 +79,4 @@ struct epoll_event { __u64 data; } EPOLL_PACKED; -#ifdef CONFIG_PM_SLEEP -static inline void ep_take_care_of_epollwakeup(struct epoll_event *epev) -{ - if ((epev->events & EPOLLWAKEUP) && !capable(CAP_BLOCK_SUSPEND)) - epev->events &= ~EPOLLWAKEUP; -} -#else -static inline void ep_take_care_of_epollwakeup(struct epoll_event *epev) -{ - epev->events &= ~EPOLLWAKEUP; -} -#endif #endif /* _UAPI_LINUX_EVENTPOLL_H */ From patchwork Mon Jun 24 14:41:39 2019 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Roman Penyaev X-Patchwork-Id: 11013571 Return-Path: Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org [172.30.200.125]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id B1936112C for ; Mon, 24 Jun 2019 14:43:23 +0000 (UTC) Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id A304E28BF2 for ; Mon, 24 Jun 2019 14:43:23 +0000 (UTC) Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486) id 97ECB28C07; Mon, 24 Jun 2019 14:43:23 +0000 (UTC) X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on pdx-wl-mail.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-7.9 required=2.0 tests=BAYES_00,MAILING_LIST_MULTI, RCVD_IN_DNSWL_HI autolearn=ham version=3.3.1 Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 320A028C03 for ; Mon, 24 Jun 2019 14:43:23 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1728401AbfFXOmC (ORCPT ); Mon, 24 Jun 2019 10:42:02 -0400 Received: from mx2.suse.de ([195.135.220.15]:50316 "EHLO mx1.suse.de" rhost-flags-OK-OK-OK-FAIL) by vger.kernel.org with ESMTP id S1726402AbfFXOmC (ORCPT ); Mon, 24 Jun 2019 10:42:02 -0400 X-Virus-Scanned: by amavisd-new at test-mx.suse.de Received: from relay2.suse.de (unknown [195.135.220.254]) by mx1.suse.de (Postfix) with ESMTP id C9EE7AE32; Mon, 24 Jun 2019 14:42:00 +0000 (UTC) From: Roman Penyaev Cc: Roman Penyaev , Andrew Morton , Al Viro , Linus Torvalds , linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org Subject: [PATCH v5 02/14] epoll: introduce user structures for polling from userspace Date: Mon, 24 Jun 2019 16:41:39 +0200 Message-Id: <20190624144151.22688-3-rpenyaev@suse.de> X-Mailer: git-send-email 2.21.0 In-Reply-To: <20190624144151.22688-1-rpenyaev@suse.de> References: <20190624144151.22688-1-rpenyaev@suse.de> MIME-Version: 1.0 To: unlisted-recipients:; (no To-header on input) Sender: linux-fsdevel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-fsdevel@vger.kernel.org X-Virus-Scanned: ClamAV using ClamSMTP This one introduces structures of user items array: struct epoll_uheader - describes inserted epoll items. struct epoll_uitem - single epoll item visible to userspace. Signed-off-by: Roman Penyaev Cc: Andrew Morton Cc: Al Viro Cc: Linus Torvalds Cc: linux-fsdevel@vger.kernel.org Cc: linux-kernel@vger.kernel.org --- fs/eventpoll.c | 11 +++++++++++ include/uapi/linux/eventpoll.h | 29 +++++++++++++++++++++++++++++ 2 files changed, 40 insertions(+) diff --git a/fs/eventpoll.c b/fs/eventpoll.c index 622b6c9ef8c9..6d7a5fe4a831 100644 --- a/fs/eventpoll.c +++ b/fs/eventpoll.c @@ -4,6 +4,7 @@ * Copyright (C) 2001,...,2009 Davide Libenzi * * Davide Libenzi + * Polling from userspace support by Roman Penyaev */ #include @@ -104,6 +105,16 @@ #define EP_ITEM_COST (sizeof(struct epitem) + sizeof(struct eppoll_entry)) +/* + * That is around 1.3mb of allocated memory for one epfd. What is more + * important is ->index_length, which should be ^2, so do not increase + * max items number to avoid size doubling of user index. + * + * Before increasing the value see add_event_to_uring() and especially + * cnt_to_advance() functions and change them accordingly. + */ +#define EP_USERPOLL_MAX_ITEMS_NR 65536 + struct epoll_filefd { struct file *file; int fd; diff --git a/include/uapi/linux/eventpoll.h b/include/uapi/linux/eventpoll.h index 39dfc29f0f52..3317901b19c4 100644 --- a/include/uapi/linux/eventpoll.h +++ b/include/uapi/linux/eventpoll.h @@ -79,4 +79,33 @@ struct epoll_event { __u64 data; } EPOLL_PACKED; +#define EPOLL_USERPOLL_HEADER_MAGIC 0xeb01eb01 +#define EPOLL_USERPOLL_HEADER_SIZE 128 + +/* + * Item, shared with userspace. Unfortunately we can't embed epoll_event + * structure, because it is badly aligned on all 64-bit archs, except + * x86-64 (see EPOLL_PACKED). sizeof(epoll_uitem) == 16 + */ +struct epoll_uitem { + __poll_t ready_events; + __poll_t events; + __u64 data; +}; + +/* + * Header, shared with userspace. sizeof(epoll_uheader) == 128 + */ +struct epoll_uheader { + __u32 magic; /* epoll user header magic */ + __u32 header_length; /* length of the header + items */ + __u32 index_length; /* length of the index ring, always pow2 */ + __u32 max_items_nr; /* max number of items */ + __u32 head; /* updated by userland */ + __u32 tail; /* updated by kernel */ + + struct epoll_uitem items[] + __attribute__((__aligned__(EPOLL_USERPOLL_HEADER_SIZE))); +}; + #endif /* _UAPI_LINUX_EVENTPOLL_H */ From patchwork Mon Jun 24 14:41:40 2019 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Roman Penyaev X-Patchwork-Id: 11013537 Return-Path: Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org [172.30.200.125]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id CF432924 for ; Mon, 24 Jun 2019 14:42:07 +0000 (UTC) Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id BD39628925 for ; Mon, 24 Jun 2019 14:42:07 +0000 (UTC) Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486) id B0E8B28BF8; Mon, 24 Jun 2019 14:42:07 +0000 (UTC) X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on pdx-wl-mail.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-7.9 required=2.0 tests=BAYES_00,MAILING_LIST_MULTI, RCVD_IN_DNSWL_HI autolearn=ham version=3.3.1 Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 5A70128BFC for ; Mon, 24 Jun 2019 14:42:06 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1729085AbfFXOmF (ORCPT ); Mon, 24 Jun 2019 10:42:05 -0400 Received: from mx2.suse.de ([195.135.220.15]:50338 "EHLO mx1.suse.de" rhost-flags-OK-OK-OK-FAIL) by vger.kernel.org with ESMTP id S1727661AbfFXOmE (ORCPT ); Mon, 24 Jun 2019 10:42:04 -0400 X-Virus-Scanned: by amavisd-new at test-mx.suse.de Received: from relay2.suse.de (unknown [195.135.220.254]) by mx1.suse.de (Postfix) with ESMTP id 2F40DAE5C; Mon, 24 Jun 2019 14:42:01 +0000 (UTC) From: Roman Penyaev Cc: Roman Penyaev , Andrew Morton , Al Viro , Linus Torvalds , linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org Subject: [PATCH v5 03/14] epoll: allocate user header and user events ring for polling from userspace Date: Mon, 24 Jun 2019 16:41:40 +0200 Message-Id: <20190624144151.22688-4-rpenyaev@suse.de> X-Mailer: git-send-email 2.21.0 In-Reply-To: <20190624144151.22688-1-rpenyaev@suse.de> References: <20190624144151.22688-1-rpenyaev@suse.de> MIME-Version: 1.0 To: unlisted-recipients:; (no To-header on input) Sender: linux-fsdevel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-fsdevel@vger.kernel.org X-Virus-Scanned: ClamAV using ClamSMTP This one allocates user header and user events ring according to max items number, passed as a parameter. User events (index) ring is in a pow2. Pages, which will be shared between kernel and userspace, are accounted through user->locked_vm counter. No support on architectures with reduced set of atomic ops, namely arc-plat-eznps, sparc32, parisc. These architectures have a single atomic op (something like xchg) and others ops are fudged in kernel with a support of a spinlock, thus it is impossible to safely share a variable with a userspace and expect that this variable will not to be corrupted or observed correctly. For these archs -EOPNOTSUP is returned. Signed-off-by: Roman Penyaev Cc: Andrew Morton Cc: Al Viro Cc: Linus Torvalds Cc: linux-fsdevel@vger.kernel.org Cc: linux-kernel@vger.kernel.org --- fs/eventpoll.c | 160 +++++++++++++++++++++++++++++++-- include/uapi/linux/eventpoll.h | 3 +- 2 files changed, 155 insertions(+), 8 deletions(-) diff --git a/fs/eventpoll.c b/fs/eventpoll.c index 6d7a5fe4a831..b6682365d970 100644 --- a/fs/eventpoll.c +++ b/fs/eventpoll.c @@ -228,6 +228,33 @@ struct eventpoll { struct file *file; + /* User header with array of items */ + struct epoll_uheader *user_header; + + /* User index, which acts as a ring of coming events */ + unsigned int *user_index; + + /* Actual length of user header, always aligned on page */ + unsigned int header_length; + + /* Actual length of user index, always pow2 */ + unsigned int index_length; + + /* Maximum possible event items */ + unsigned int max_items_nr; + + /* Items bitmap, is used to get a free bit for new registered epi */ + unsigned long *items_bm; + + /* Length of both items bitmaps, always aligned on page */ + unsigned int items_bm_length; + + /* + * Counter to support atomic and lockless ->tail updates. + * See add_event_to_uring() for details of counter layout. + */ + atomic64_t shadow_cnt; + /* used to optimize loop detection check */ int visited; struct list_head visited_list_link; @@ -377,6 +404,44 @@ static void ep_nested_calls_init(struct nested_calls *ncalls) spin_lock_init(&ncalls->lock); } +static inline unsigned int ep_to_items_length(unsigned int nr) +{ + struct epoll_uheader *user_header; + + return PAGE_ALIGN(struct_size(user_header, items, nr)); +} + +static inline unsigned int ep_to_index_length(unsigned int nr) +{ + struct eventpoll *ep; + unsigned int size; + + size = roundup_pow_of_two(nr << ilog2(sizeof(*ep->user_index))); + return max_t(typeof(size), size, PAGE_SIZE); +} + +static inline unsigned int ep_to_items_bm_length(unsigned int nr) +{ + return PAGE_ALIGN(ALIGN(nr, 8) >> 3); +} + +static inline bool ep_userpoll_supported(void) +{ + /* + * These architectures have a single atomic op (something like + * xchg) and others ops are fudged in kernel with a support of + * a spinlock, thus it is impossible to safely share a variable + * with a userspace and expect that this variable will not to + * be corrupted or observed correctly. The problematic variable + * is ->ready_events, which has to be atomically cleared on + * userspace, but on the kernel side cmpxchg() is called, which + * uses a spinlock as a method of synchronization. + */ + return !(IS_ENABLED(CONFIG_ARC_PLAT_EZNPS) || + IS_ENABLED(CONFIG_SPARC32) || + IS_ENABLED(CONFIG_PARISC)); +} + /** * ep_events_available - Checks if ready events might be available. * @@ -832,6 +897,38 @@ static int ep_remove(struct eventpoll *ep, struct epitem *epi) return 0; } +static int ep_account_mem(struct eventpoll *ep, struct user_struct *user) +{ + unsigned long nr_pages, page_limit, cur_pages, new_pages; + + nr_pages = ep->header_length >> PAGE_SHIFT; + nr_pages += ep->index_length >> PAGE_SHIFT; + + /* Don't allow more pages than we can safely lock */ + page_limit = rlimit(RLIMIT_MEMLOCK) >> PAGE_SHIFT; + + do { + cur_pages = atomic_long_read(&user->locked_vm); + new_pages = cur_pages + nr_pages; + if (new_pages > page_limit && !capable(CAP_IPC_LOCK)) + return -ENOMEM; + } while (atomic_long_cmpxchg(&user->locked_vm, cur_pages, + new_pages) != cur_pages); + + return 0; +} + +static void ep_unaccount_mem(struct eventpoll *ep, struct user_struct *user) +{ + unsigned long nr_pages; + + nr_pages = ep->header_length >> PAGE_SHIFT; + nr_pages += ep->index_length >> PAGE_SHIFT; + if (nr_pages) + /* When polled by user */ + atomic_long_sub(nr_pages, &user->locked_vm); +} + static void ep_free(struct eventpoll *ep) { struct rb_node *rbp; @@ -879,8 +976,12 @@ static void ep_free(struct eventpoll *ep) mutex_unlock(&epmutex); mutex_destroy(&ep->mtx); - free_uid(ep->user); wakeup_source_unregister(ep->ws); + vfree(ep->user_header); + vfree(ep->user_index); + vfree(ep->items_bm); + ep_unaccount_mem(ep, ep->user); + free_uid(ep->user); kfree(ep); } @@ -1033,7 +1134,7 @@ void eventpoll_release_file(struct file *file) mutex_unlock(&epmutex); } -static int ep_alloc(struct eventpoll **pep) +static int ep_alloc(struct eventpoll **pep, int flags, size_t max_items) { int error; struct user_struct *user; @@ -1045,6 +1146,44 @@ static int ep_alloc(struct eventpoll **pep) if (unlikely(!ep)) goto free_uid; + if (flags & EPOLL_USERPOLL) { + BUILD_BUG_ON(sizeof(*ep->user_header) != + EPOLL_USERPOLL_HEADER_SIZE); + BUILD_BUG_ON(sizeof(ep->user_header->items[0]) != 16); + + error = -EOPNOTSUPP; + if (!ep_userpoll_supported()) + goto free_ep; + + error = -EINVAL; + if (!max_items || max_items > EP_USERPOLL_MAX_ITEMS_NR) + goto free_ep; + + ep->max_items_nr = max_items; + ep->header_length = ep_to_items_length(max_items); + ep->index_length = ep_to_index_length(max_items); + ep->items_bm_length = ep_to_items_bm_length(max_items); + + error = ep_account_mem(ep, user); + if (error) + goto free_ep; + + error = -ENOMEM; + ep->user_header = vmalloc_user(ep->header_length); + ep->user_index = vmalloc_user(ep->index_length); + ep->items_bm = vzalloc(ep->items_bm_length); + if (!ep->user_header || !ep->user_index || !ep->items_bm) + goto unaccount_mem; + + *ep->user_header = (typeof(*ep->user_header)) { + .magic = EPOLL_USERPOLL_HEADER_MAGIC, + .header_length = ep->header_length, + .index_length = ep->index_length, + .max_items_nr = ep->max_items_nr, + }; + } + + atomic64_set(&ep->shadow_cnt, 0); mutex_init(&ep->mtx); rwlock_init(&ep->lock); init_waitqueue_head(&ep->wq); @@ -1058,6 +1197,13 @@ static int ep_alloc(struct eventpoll **pep) return 0; +unaccount_mem: + ep_unaccount_mem(ep, user); + vfree(ep->user_header); + vfree(ep->user_index); + vfree(ep->items_bm); +free_ep: + kfree(ep); free_uid: free_uid(user); return error; @@ -2062,7 +2208,7 @@ static void clear_tfile_check_list(void) /* * Open an eventpoll file descriptor. */ -static int do_epoll_create(int flags) +static int do_epoll_create(int flags, size_t size) { int error, fd; struct eventpoll *ep = NULL; @@ -2071,12 +2217,12 @@ static int do_epoll_create(int flags) /* Check the EPOLL_* constant for consistency. */ BUILD_BUG_ON(EPOLL_CLOEXEC != O_CLOEXEC); - if (flags & ~EPOLL_CLOEXEC) + if (flags & ~(EPOLL_CLOEXEC | EPOLL_USERPOLL)) return -EINVAL; /* * Create the internal data structure ("struct eventpoll"). */ - error = ep_alloc(&ep); + error = ep_alloc(&ep, flags, size); if (error < 0) return error; /* @@ -2107,7 +2253,7 @@ static int do_epoll_create(int flags) SYSCALL_DEFINE1(epoll_create1, int, flags) { - return do_epoll_create(flags); + return do_epoll_create(flags, 0); } SYSCALL_DEFINE1(epoll_create, int, size) @@ -2115,7 +2261,7 @@ SYSCALL_DEFINE1(epoll_create, int, size) if (size <= 0) return -EINVAL; - return do_epoll_create(0); + return do_epoll_create(0, 0); } /* diff --git a/include/uapi/linux/eventpoll.h b/include/uapi/linux/eventpoll.h index 3317901b19c4..efd58e9177c2 100644 --- a/include/uapi/linux/eventpoll.h +++ b/include/uapi/linux/eventpoll.h @@ -20,7 +20,8 @@ #include /* Flags for epoll_create1. */ -#define EPOLL_CLOEXEC O_CLOEXEC +#define EPOLL_CLOEXEC O_CLOEXEC +#define EPOLL_USERPOLL 1 /* Valid opcodes to issue to sys_epoll_ctl() */ #define EPOLL_CTL_ADD 1 From patchwork Mon Jun 24 14:41:41 2019 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Roman Penyaev X-Patchwork-Id: 11013559 Return-Path: Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org [172.30.200.125]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 983EE112C for ; Mon, 24 Jun 2019 14:43:09 +0000 (UTC) Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 88D8428BEA for ; Mon, 24 Jun 2019 14:43:09 +0000 (UTC) Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486) id 7D45D28BF9; Mon, 24 Jun 2019 14:43:09 +0000 (UTC) X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on pdx-wl-mail.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-7.9 required=2.0 tests=BAYES_00,MAILING_LIST_MULTI, RCVD_IN_DNSWL_HI autolearn=ham version=3.3.1 Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id ECD0A28BF2 for ; Mon, 24 Jun 2019 14:43:08 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1729083AbfFXOnE (ORCPT ); Mon, 24 Jun 2019 10:43:04 -0400 Received: from mx2.suse.de ([195.135.220.15]:50354 "EHLO mx1.suse.de" rhost-flags-OK-OK-OK-FAIL) by vger.kernel.org with ESMTP id S1727851AbfFXOmD (ORCPT ); Mon, 24 Jun 2019 10:42:03 -0400 X-Virus-Scanned: by amavisd-new at test-mx.suse.de Received: from relay2.suse.de (unknown [195.135.220.254]) by mx1.suse.de (Postfix) with ESMTP id 8F052AE96; Mon, 24 Jun 2019 14:42:01 +0000 (UTC) From: Roman Penyaev Cc: Roman Penyaev , Andrew Morton , Al Viro , Linus Torvalds , linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org Subject: [PATCH v5 04/14] epoll: some sanity flags checks for epoll syscalls for polling from userspace Date: Mon, 24 Jun 2019 16:41:41 +0200 Message-Id: <20190624144151.22688-5-rpenyaev@suse.de> X-Mailer: git-send-email 2.21.0 In-Reply-To: <20190624144151.22688-1-rpenyaev@suse.de> References: <20190624144151.22688-1-rpenyaev@suse.de> MIME-Version: 1.0 To: unlisted-recipients:; (no To-header on input) Sender: linux-fsdevel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-fsdevel@vger.kernel.org X-Virus-Scanned: ClamAV using ClamSMTP There are various of limitations if epfd is polled by user: 1. Expect always EPOLLET flag (Edge Triggered behavior) 2. No support for EPOLLWAKEUP events are consumed from userspace, thus no way to call __pm_relax() 3. No support for EPOLLEXCLUSIVE If device does not pass pollflags to wake_up() there is no way to call poll() from the context under spinlock, thus special work is scheduled to offload polling. In this specific case we can't support exclusive wakeups, because we do not know actual result of scheduled work. 4. epoll_wait() for epfd, created with EPOLL_USERPOLL flag, accepts events as NULL and maxevents as 0. No other values are accepted. Signed-off-by: Roman Penyaev Cc: Andrew Morton Cc: Al Viro Cc: Linus Torvalds Cc: linux-fsdevel@vger.kernel.org Cc: linux-kernel@vger.kernel.org --- fs/eventpoll.c | 72 +++++++++++++++++++++++++++++++++++--------------- 1 file changed, 50 insertions(+), 22 deletions(-) diff --git a/fs/eventpoll.c b/fs/eventpoll.c index b6682365d970..4087efb1fbf3 100644 --- a/fs/eventpoll.c +++ b/fs/eventpoll.c @@ -442,6 +442,15 @@ static inline bool ep_userpoll_supported(void) IS_ENABLED(CONFIG_PARISC)); } +static inline bool ep_polled_by_user(struct eventpoll *ep) +{ + /* + * Hint compiler to optimize 'if' branches out and exclude code + * if polling from userspace is not supported. + */ + return ep_userpoll_supported() && !!ep->user_header; +} + /** * ep_events_available - Checks if ready events might be available. * @@ -537,13 +546,17 @@ static inline void ep_set_busy_poll_napi_id(struct epitem *epi) #endif /* CONFIG_NET_RX_BUSY_POLL */ #ifdef CONFIG_PM_SLEEP -static inline void ep_take_care_of_epollwakeup(struct epoll_event *epev) +static inline void ep_take_care_of_epollwakeup(struct eventpoll *ep, + struct epoll_event *epev) { - if ((epev->events & EPOLLWAKEUP) && !capable(CAP_BLOCK_SUSPEND)) - epev->events &= ~EPOLLWAKEUP; + if (epev->events & EPOLLWAKEUP) { + if (!capable(CAP_BLOCK_SUSPEND) || ep_polled_by_user(ep)) + epev->events &= ~EPOLLWAKEUP; + } } #else -static inline void ep_take_care_of_epollwakeup(struct epoll_event *epev) +static inline void ep_take_care_of_epollwakeup(struct eventpoll *ep, + struct epoll_event *epev) { epev->events &= ~EPOLLWAKEUP; } @@ -2300,10 +2313,6 @@ SYSCALL_DEFINE4(epoll_ctl, int, epfd, int, op, int, fd, if (!file_can_poll(tf.file)) goto error_tgt_fput; - /* Check if EPOLLWAKEUP is allowed */ - if (ep_op_has_event(op)) - ep_take_care_of_epollwakeup(&epds); - /* * We have to check that the file structure underneath the file descriptor * the user passed to us _is_ an eventpoll file. And also we do not permit @@ -2313,10 +2322,18 @@ SYSCALL_DEFINE4(epoll_ctl, int, epfd, int, op, int, fd, if (f.file == tf.file || !is_file_epoll(f.file)) goto error_tgt_fput; + /* + * At this point it is safe to assume that the "private_data" contains + * our own data structure. + */ + ep = f.file->private_data; + /* * epoll adds to the wakeup queue at EPOLL_CTL_ADD time only, * so EPOLLEXCLUSIVE is not allowed for a EPOLL_CTL_MOD operation. - * Also, we do not currently supported nested exclusive wakeups. + * Also, we do not currently supported nested exclusive wakeups + * and EPOLLEXCLUSIVE is not supported for epoll which is polled + * from userspace. */ if (ep_op_has_event(op) && (epds.events & EPOLLEXCLUSIVE)) { if (op == EPOLL_CTL_MOD) @@ -2324,13 +2341,18 @@ SYSCALL_DEFINE4(epoll_ctl, int, epfd, int, op, int, fd, if (op == EPOLL_CTL_ADD && (is_file_epoll(tf.file) || (epds.events & ~EPOLLEXCLUSIVE_OK_BITS))) goto error_tgt_fput; + if (ep_polled_by_user(ep)) + goto error_tgt_fput; } - /* - * At this point it is safe to assume that the "private_data" contains - * our own data structure. - */ - ep = f.file->private_data; + if (ep_op_has_event(op)) { + if (ep_polled_by_user(ep) && !(epds.events & EPOLLET)) + /* Polled by user has only edge triggered behaviour */ + goto error_tgt_fput; + + /* Check if EPOLLWAKEUP is allowed */ + ep_take_care_of_epollwakeup(ep, &epds); + } /* * When we insert an epoll file descriptor, inside another epoll file @@ -2432,14 +2454,6 @@ static int do_epoll_wait(int epfd, struct epoll_event __user *events, struct fd f; struct eventpoll *ep; - /* The maximum number of event must be greater than zero */ - if (maxevents <= 0 || maxevents > EP_MAX_EVENTS) - return -EINVAL; - - /* Verify that the area passed by the user is writeable */ - if (!access_ok(events, maxevents * sizeof(struct epoll_event))) - return -EFAULT; - /* Get the "struct file *" for the eventpoll file */ f = fdget(epfd); if (!f.file) @@ -2458,6 +2472,20 @@ static int do_epoll_wait(int epfd, struct epoll_event __user *events, * our own data structure. */ ep = f.file->private_data; + if (!ep_polled_by_user(ep)) { + /* The maximum number of event must be greater than zero */ + if (maxevents <= 0 || maxevents > EP_MAX_EVENTS) + goto error_fput; + + /* Verify that the area passed by the user is writeable */ + error = -EFAULT; + if (!access_ok(events, maxevents * sizeof(struct epoll_event))) + goto error_fput; + } else { + /* Use ring instead */ + if (maxevents != 0 || events != NULL) + goto error_fput; + } /* Time to fish for events ... */ error = ep_poll(ep, events, maxevents, timeout); From patchwork Mon Jun 24 14:41:42 2019 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Roman Penyaev X-Patchwork-Id: 11013557 Return-Path: Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org [172.30.200.125]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id B0189924 for ; Mon, 24 Jun 2019 14:43:01 +0000 (UTC) Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id A16AE28BF8 for ; Mon, 24 Jun 2019 14:43:01 +0000 (UTC) Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486) id 95BCA28C03; Mon, 24 Jun 2019 14:43:01 +0000 (UTC) X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on pdx-wl-mail.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-7.9 required=2.0 tests=BAYES_00,MAILING_LIST_MULTI, RCVD_IN_DNSWL_HI autolearn=ham version=3.3.1 Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id D4DDD28C07 for ; Mon, 24 Jun 2019 14:43:00 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1730110AbfFXOnA (ORCPT ); Mon, 24 Jun 2019 10:43:00 -0400 Received: from mx2.suse.de ([195.135.220.15]:50380 "EHLO mx1.suse.de" rhost-flags-OK-OK-OK-FAIL) by vger.kernel.org with ESMTP id S1726396AbfFXOmE (ORCPT ); Mon, 24 Jun 2019 10:42:04 -0400 X-Virus-Scanned: by amavisd-new at test-mx.suse.de Received: from relay2.suse.de (unknown [195.135.220.254]) by mx1.suse.de (Postfix) with ESMTP id CD1ECAEBF; Mon, 24 Jun 2019 14:42:01 +0000 (UTC) From: Roman Penyaev Cc: Roman Penyaev , Andrew Morton , Al Viro , Linus Torvalds , linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org Subject: [PATCH v5 05/14] epoll: offload polling to a work in case of epfd polled from userspace Date: Mon, 24 Jun 2019 16:41:42 +0200 Message-Id: <20190624144151.22688-6-rpenyaev@suse.de> X-Mailer: git-send-email 2.21.0 In-Reply-To: <20190624144151.22688-1-rpenyaev@suse.de> References: <20190624144151.22688-1-rpenyaev@suse.de> MIME-Version: 1.0 To: unlisted-recipients:; (no To-header on input) Sender: linux-fsdevel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-fsdevel@vger.kernel.org X-Virus-Scanned: ClamAV using ClamSMTP Not every device reports pollflags on wake_up(), expecting that it will be polled later. vfs_poll() can't be called from ep_poll_callback(), because ep_poll_callback() is called under the spinlock. Obviously userspace can't call vfs_poll(), thus epoll has to offload vfs_poll() to a work and then to call ep_poll_callback() with pollflags in a hand. In order not to bloat the size of `struct epitem` with `struct work_struct` new `struct uepitem` includes original epi and work. Thus for ep pollable from user new `struct uepitem` will be allocated. This will be done in the following patches. Signed-off-by: Roman Penyaev Cc: Andrew Morton Cc: Al Viro Cc: Linus Torvalds Cc: linux-fsdevel@vger.kernel.org Cc: linux-kernel@vger.kernel.org --- fs/eventpoll.c | 131 ++++++++++++++++++++++++++++++++++++++++--------- 1 file changed, 107 insertions(+), 24 deletions(-) diff --git a/fs/eventpoll.c b/fs/eventpoll.c index 4087efb1fbf3..f2a2be93bc4b 100644 --- a/fs/eventpoll.c +++ b/fs/eventpoll.c @@ -38,6 +38,7 @@ #include #include #include +#include #include /* @@ -184,6 +185,18 @@ struct epitem { struct epoll_event event; }; +/* + * The only purpose of this structure is not to inflate the size of the + * 'struct epitem' if polling from user is not used. + */ +struct uepitem { + /* Includes ep item */ + struct epitem epi; + + /* Work for offloading event callback */ + struct work_struct work; +}; + /* * This structure is stored inside the "private_data" member of the file * structure and represents the main data structure for the eventpoll @@ -313,6 +326,9 @@ static struct nested_calls poll_loop_ncalls; /* Slab cache used to allocate "struct epitem" */ static struct kmem_cache *epi_cache __read_mostly; +/* Slab cache used to allocate "struct uepitem" */ +static struct kmem_cache *uepi_cache __read_mostly; + /* Slab cache used to allocate "struct eppoll_entry" */ static struct kmem_cache *pwq_cache __read_mostly; @@ -391,6 +407,12 @@ static inline struct epitem *ep_item_from_epqueue(poll_table *p) return container_of(p, struct ep_pqueue, pt)->epi; } +/* Get the "struct uepitem" from an generic epitem. */ +static inline struct uepitem *uep_item_from_epi(struct epitem *p) +{ + return container_of(p, struct uepitem, epi); +} + /* Tells if the epoll_ctl(2) operation needs an event copy from userspace */ static inline int ep_op_has_event(int op) { @@ -719,6 +741,14 @@ static void ep_unregister_pollwait(struct eventpoll *ep, struct epitem *epi) ep_remove_wait_queue(pwq); kmem_cache_free(pwq_cache, pwq); } + if (ep_polled_by_user(ep)) { + /* + * Events polled by user require offloading to a work, + * thus we have to be sure everything which was queued + * has run to a completion. + */ + flush_work(&uep_item_from_epi(epi)->work); + } } /* call only when ep->mtx is held */ @@ -1369,9 +1399,8 @@ static inline bool chain_epi_lockless(struct epitem *epi) } /* - * This is the callback that is passed to the wait queue wakeup - * mechanism. It is called by the stored file descriptors when they - * have events to report. + * This is the callback that is called directly from wake queue wakeup or + * from a work. * * This callback takes a read lock in order not to content with concurrent * events from another file descriptors, thus all modifications to ->rdllist @@ -1386,14 +1415,11 @@ static inline bool chain_epi_lockless(struct epitem *epi) * queues are used should be detected accordingly. This is detected using * cmpxchg() operation. */ -static int ep_poll_callback(wait_queue_entry_t *wait, unsigned mode, int sync, void *key) +static int ep_poll_callback(struct epitem *epi, __poll_t pollflags) { - int pwake = 0; - struct epitem *epi = ep_item_from_wait(wait); struct eventpoll *ep = epi->ep; - __poll_t pollflags = key_to_poll(key); + int pwake = 0, ewake = 0; unsigned long flags; - int ewake = 0; read_lock_irqsave(&ep->lock, flags); @@ -1411,8 +1437,9 @@ static int ep_poll_callback(wait_queue_entry_t *wait, unsigned mode, int sync, v /* * Check the events coming with the callback. At this stage, not * every device reports the events in the "key" parameter of the - * callback. We need to be able to handle both cases here, hence the - * test for "key" != NULL before the event match test. + * callback (for ep_poll_callback() case special worker is used). + * We need to be able to handle both cases here, hence the test + * for "key" != NULL before the event match test. */ if (pollflags && !(pollflags & epi->event.events)) goto out_unlock; @@ -1472,23 +1499,68 @@ static int ep_poll_callback(wait_queue_entry_t *wait, unsigned mode, int sync, v if (!(epi->event.events & EPOLLEXCLUSIVE)) ewake = 1; - if (pollflags & POLLFREE) { - /* - * If we race with ep_remove_wait_queue() it can miss - * ->whead = NULL and do another remove_wait_queue() after - * us, so we can't use __remove_wait_queue(). - */ - list_del_init(&wait->entry); + return ewake; +} + +static void ep_poll_callback_work(struct work_struct *work) +{ + struct uepitem *uepi = container_of(work, typeof(*uepi), work); + struct epitem *epi = &uepi->epi; + __poll_t pollflags; + poll_table pt; + + WARN_ON(!ep_polled_by_user(epi->ep)); + + init_poll_funcptr(&pt, NULL); + pollflags = ep_item_poll(epi, &pt, 1); + if (pollflags) + (void)ep_poll_callback(epi, pollflags); +} + +/* + * This is the callback that is passed to the wait queue wakeup + * mechanism. It is called by the stored file descriptors when they + * have events to report. + */ +static int ep_poll_wakeup(wait_queue_entry_t *wait, unsigned int mode, + int sync, void *key) +{ + + struct epitem *epi = ep_item_from_wait(wait); + struct eventpoll *ep = epi->ep; + __poll_t pollflags = key_to_poll(key); + int rc; + + if (!ep_polled_by_user(ep) || pollflags) { + rc = ep_poll_callback(epi, pollflags); + + if (pollflags & POLLFREE) { + /* + * If we race with ep_remove_wait_queue() it can miss + * ->whead = NULL and do another remove_wait_queue() + * after us, so we can't use __remove_wait_queue(). + */ + list_del_init(&wait->entry); + /* + * ->whead != NULL protects us from the race with + * ep_free() or ep_remove(), ep_remove_wait_queue() + * takes whead->lock held by the caller. Once we nullify + * it, nothing protects ep/epi or even wait. + */ + smp_store_release(&ep_pwq_from_wait(wait)->whead, NULL); + } + } else { + queue_work(system_highpri_wq, &uep_item_from_epi(epi)->work); + /* - * ->whead != NULL protects us from the race with ep_free() - * or ep_remove(), ep_remove_wait_queue() takes whead->lock - * held by the caller. Once we nullify it, nothing protects - * ep/epi or even wait. + * Here on this path we are absolutely sure that for file + * descriptors which are pollable from userspace we do not + * support EPOLLEXCLUSIVE, so it is safe to return 1. */ - smp_store_release(&ep_pwq_from_wait(wait)->whead, NULL); + rc = 1; } - return ewake; + return rc; } /* @@ -1502,7 +1574,7 @@ static void ep_ptable_queue_proc(struct file *file, wait_queue_head_t *whead, struct eppoll_entry *pwq; if (epi->nwait >= 0 && (pwq = kmem_cache_alloc(pwq_cache, GFP_KERNEL))) { - init_waitqueue_func_entry(&pwq->wait, ep_poll_callback); + init_waitqueue_func_entry(&pwq->wait, ep_poll_wakeup); pwq->whead = whead; pwq->base = epi; if (epi->event.events & EPOLLEXCLUSIVE) @@ -2575,6 +2647,10 @@ static int __init eventpoll_init(void) /* * We can have many thousands of epitems, so prevent this from * using an extra cache line on 64-bit (and smaller) CPUs + * + * 'struct uepitem' is used for polling from userspace, it is + * slightly bigger than 128b because of the work struct, thus + * it is excluded from the check below. */ BUILD_BUG_ON(sizeof(void *) <= 8 && sizeof(struct epitem) > 128); @@ -2582,6 +2658,13 @@ static int __init eventpoll_init(void) epi_cache = kmem_cache_create("eventpoll_epi", sizeof(struct epitem), 0, SLAB_HWCACHE_ALIGN|SLAB_PANIC|SLAB_ACCOUNT, NULL); + /* + * Allocates slab cache used to allocate "struct uepitem" items, + * used only for polling from user. + */ + uepi_cache = kmem_cache_create("eventpoll_uepi", sizeof(struct uepitem), + 0, SLAB_HWCACHE_ALIGN|SLAB_PANIC|SLAB_ACCOUNT, NULL); + /* Allocates slab cache used to allocate "struct eppoll_entry" */ pwq_cache = kmem_cache_create("eventpoll_pwq", sizeof(struct eppoll_entry), 0, SLAB_PANIC|SLAB_ACCOUNT, NULL); From patchwork Mon Jun 24 14:41:43 2019 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Roman Penyaev X-Patchwork-Id: 11013553 Return-Path: Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org [172.30.200.125]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 5C5EB924 for ; Mon, 24 Jun 2019 14:42:48 +0000 (UTC) Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 4B32028958 for ; Mon, 24 Jun 2019 14:42:48 +0000 (UTC) Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486) id 3FC3A28C0C; Mon, 24 Jun 2019 14:42:48 +0000 (UTC) X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on pdx-wl-mail.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-7.9 required=2.0 tests=BAYES_00,MAILING_LIST_MULTI, RCVD_IN_DNSWL_HI autolearn=ham version=3.3.1 Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 6F9D828958 for ; Mon, 24 Jun 2019 14:42:47 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1729916AbfFXOmq (ORCPT ); Mon, 24 Jun 2019 10:42:46 -0400 Received: from mx2.suse.de ([195.135.220.15]:50386 "EHLO mx1.suse.de" rhost-flags-OK-OK-OK-FAIL) by vger.kernel.org with ESMTP id S1728770AbfFXOmE (ORCPT ); Mon, 24 Jun 2019 10:42:04 -0400 X-Virus-Scanned: by amavisd-new at test-mx.suse.de Received: from relay2.suse.de (unknown [195.135.220.254]) by mx1.suse.de (Postfix) with ESMTP id 3D3D8AEC7; Mon, 24 Jun 2019 14:42:02 +0000 (UTC) From: Roman Penyaev Cc: Roman Penyaev , Andrew Morton , Al Viro , Linus Torvalds , Peter Zijlstra , linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org Subject: [PATCH v5 06/14] epoll: introduce helpers for adding/removing events to uring Date: Mon, 24 Jun 2019 16:41:43 +0200 Message-Id: <20190624144151.22688-7-rpenyaev@suse.de> X-Mailer: git-send-email 2.21.0 In-Reply-To: <20190624144151.22688-1-rpenyaev@suse.de> References: <20190624144151.22688-1-rpenyaev@suse.de> MIME-Version: 1.0 To: unlisted-recipients:; (no To-header on input) Sender: linux-fsdevel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-fsdevel@vger.kernel.org X-Virus-Scanned: ClamAV using ClamSMTP Both add and remove events are lockless and can be called in parallel. ep_add_event_to_uring(): o user item is marked atomically as ready o if on previous stem user item was observed as not ready, then new entry is created for the index uring. ep_remove_user_item(): o user item is marked as EPOLLREMOVED only if it was ready, thus userspace will obseve previously added entry in index uring and correct "removed" state of the item. Signed-off-by: Roman Penyaev Cc: Andrew Morton Cc: Al Viro Cc: Linus Torvalds Cc: Peter Zijlstra Cc: linux-fsdevel@vger.kernel.org Cc: linux-kernel@vger.kernel.org --- fs/eventpoll.c | 240 +++++++++++++++++++++++++++++++++ include/uapi/linux/eventpoll.h | 3 + 2 files changed, 243 insertions(+) diff --git a/fs/eventpoll.c b/fs/eventpoll.c index f2a2be93bc4b..3b1f6a210247 100644 --- a/fs/eventpoll.c +++ b/fs/eventpoll.c @@ -195,6 +195,9 @@ struct uepitem { /* Work for offloading event callback */ struct work_struct work; + + /* Bit in user bitmap for user polling */ + unsigned int bit; }; /* @@ -447,6 +450,11 @@ static inline unsigned int ep_to_items_bm_length(unsigned int nr) return PAGE_ALIGN(ALIGN(nr, 8) >> 3); } +static inline unsigned int ep_max_index_nr(struct eventpoll *ep) +{ + return ep->index_length >> ilog2(sizeof(*ep->user_index)); +} + static inline bool ep_userpoll_supported(void) { /* @@ -898,6 +906,238 @@ static void epi_rcu_free(struct rcu_head *head) kmem_cache_free(epi_cache, epi); } +#define set_unless_zero_atomically(ptr, flags) \ +({ \ + typeof(ptr) _ptr = (ptr); \ + typeof(flags) _flags = (flags); \ + typeof(*_ptr) _old, _val = READ_ONCE(*_ptr); \ + \ + for (;;) { \ + if (!_val) \ + break; \ + _old = cmpxchg(_ptr, _val, _flags); \ + if (_old == _val) \ + break; \ + _val = _old; \ + } \ + _val; \ +}) + +static inline void ep_remove_user_item(struct epitem *epi) +{ + struct uepitem *uepi = uep_item_from_epi(epi); + struct eventpoll *ep = epi->ep; + struct epoll_uitem *uitem; + + lockdep_assert_held(&ep->mtx); + + /* Event should not have any attached queues */ + WARN_ON(!list_empty(&epi->pwqlist)); + + uitem = &ep->user_header->items[uepi->bit]; + + /* + * User item can be in two states: signaled (read_events is set + * and userspace has not yet consumed this event) and not signaled + * (no events yet fired or already consumed by userspace). + * We reset ready_events to EPOLLREMOVED only if ready_events is + * in signaled state (we expect that userspace will come soon and + * fetch this event). In case of not signaled leave read_events + * as 0. + * + * Why it is important to mark read_events as EPOLLREMOVED in case + * of already signaled state? ep_insert() op can be immediately + * called after ep_remove(), thus the same bit can be reused and + * then new event comes, which corresponds to the same entry inside + * user items array. For this particular case ep_add_event_to_uring() + * does not allocate a new index entry, but simply masks EPOLLREMOVED, + * and userspace uses old index entry, but meanwhile old user item + * has been removed, new item has been added and event updated. + */ + set_unless_zero_atomically(&uitem->ready_events, EPOLLREMOVED); + clear_bit(uepi->bit, ep->items_bm); +} + +#define or_with_mask_atomically(ptr, flags, mask) \ +({ \ + typeof(ptr) _ptr = (ptr); \ + typeof(flags) _flags = (flags); \ + typeof(flags) _mask = (mask); \ + typeof(*_ptr) _old, _new, _val = READ_ONCE(*_ptr); \ + \ + for (;;) { \ + _new = (_val & ~_mask) | _flags; \ + _old = cmpxchg(_ptr, _val, _new); \ + if (_old == _val) \ + break; \ + _val = _old; \ + } \ + _val; \ +}) + +static inline unsigned int cnt_to_monotonic(unsigned long long cnt) +{ + /* + * Monotonic counter is the index inside the uring, so + * should be big enough to hold all possible event items. + */ + BUILD_BUG_ON(EP_USERPOLL_MAX_ITEMS_NR > BIT(32)); + + return (cnt >> 32); +} + +static inline unsigned int cnt_to_advance(unsigned long long cnt) +{ + /* + * In worse barely possible case each registered event + * item signals completion in parallel. In order not + * to overflow the counter keep it equal or bigger + * than max number of items. + */ + BUILD_BUG_ON(EP_USERPOLL_MAX_ITEMS_NR > BIT(16)); + + return (cnt >> 16) & 0xffff; +} + +static inline unsigned int cnt_to_refs(unsigned long long cnt) +{ + /* + * Counter should be big enough to hold references of all + * possible CPUs which can add events in parallel. + * Although, of course, this will never happen. + */ + BUILD_BUG_ON(NR_CPUS > BIT(16)); + + return (cnt & 0xffff); +} + +#define MONOTONIC_MASK ((1ull<<32)-1) +#define SINGLE_COUNTER ((1ull<<32)|(1ull<<16)|1ull) + +/** + * add_event_to_uring() - adds event to the uring locklessly. + * + * The most important here is a layout of ->shadow_cnt, which includes + * three counters which all of them should be increased atomically, all + * at once. The layout can be represented as the following: + * + * struct counter_t { + * unsigned long long monotonic :32; + * unsigned long long advance :16; + * unsigned long long refs :16; + * }; + * + * 'monotonic' - Monotonically increases on each event insertion, + * never decreases. Used as an index for an event + * in the uring. + * + * 'advance' - Represents number of events on which user ->tail + * has to be advanced. Monotonically increases if + * events are coming in parallel from different cpus + * and reference number keeps > 1. + * + * 'refs' - Represents reference number, i.e. number of cpus + * inserting events in parallel. Once there is a + * last inserter (the reference is 1), it should + * zero out 'advance' member and advance the tail + * for the userspace. + * + * What this is all about? The main problem is that since event can + * be inserted from many cpus in parallel, we can't advance the tail + * if previous insertion has not been fully completed. The idea to + * solve this is simple: the last one advances the tail. Who is + * exactly the last? Who detects the reference number is equal to 1. + */ +static inline void add_event_to_uring(struct uepitem *uepi) +{ + struct eventpoll *ep = uepi->epi.ep; + + unsigned int *item_idx, idx, index_mask, advance; + unsigned long long old, cnt; + + index_mask = ep_max_index_nr(ep) - 1; + /* Increase all three subcounters at once */ + cnt = atomic64_add_return_acquire(SINGLE_COUNTER, &ep->shadow_cnt); + + idx = cnt_to_monotonic(cnt) - 1; + item_idx = &ep->user_index[idx & index_mask]; + + /* Add a bit to the uring */ + WRITE_ONCE(*item_idx, uepi->bit); + + do { + old = cnt; + if (cnt_to_refs(cnt) == 1) { + /* We are the last, we will advance the tail */ + advance = cnt_to_advance(cnt); + WARN_ON(!advance); + /* Zero out all fields except monotonic counter */ + cnt &= ~MONOTONIC_MASK; + } else { + /* Someone else will advance, only drop the ref */ + advance = 0; + cnt -= 1; + } + } while ((cnt = atomic64_cmpxchg_release(&ep->shadow_cnt, + old, cnt)) != old); + + if (advance) { + /* + * Advance the tail executing `tail += advance` operation, + * but since tail is shared with userspace, we can't use + * kernel atomic_t for just atomic add, so use cmpxchg(). + * Sigh. + * + * We can race here with another cpu which also advances the + * tail. This is absolutely ok, since the tail is advanced + * in one direction and eventually addition is commutative. + */ + unsigned int old, tail = READ_ONCE(ep->user_header->tail); + + do { + old = tail; + } while ((tail = cmpxchg(&ep->user_header->tail, + old, old + advance)) != old); + } +} + +static inline bool ep_add_event_to_uring(struct epitem *epi, __poll_t pollflags) +{ + struct uepitem *uepi = uep_item_from_epi(epi); + struct eventpoll *ep = epi->ep; + struct epoll_uitem *uitem; + bool added = false; + + if (WARN_ON(!pollflags)) + return false; + + uitem = &ep->user_header->items[uepi->bit]; + /* + * Can be represented as: + * + * was_ready = uitem->ready_events; + * uitem->ready_events &= ~EPOLLREMOVED; + * uitem->ready_events |= pollflags; + * if (!was_ready) { + * // create index entry + * } + * + * See the big comment inside ep_remove_user_item(), why it is + * important to mask EPOLLREMOVED. + */ + if (!or_with_mask_atomically(&uitem->ready_events, + pollflags, EPOLLREMOVED)) { + /* + * Item was not ready before, thus we have to insert + * new index to the ring. + */ + add_event_to_uring(uepi); + added = true; + } + + return added; +} + /* * Removes a "struct epitem" from the eventpoll RB tree and deallocates * all the associated resources. Must be called with "mtx" held. diff --git a/include/uapi/linux/eventpoll.h b/include/uapi/linux/eventpoll.h index efd58e9177c2..d3246a02dc2b 100644 --- a/include/uapi/linux/eventpoll.h +++ b/include/uapi/linux/eventpoll.h @@ -42,6 +42,9 @@ #define EPOLLMSG (__force __poll_t)0x00000400 #define EPOLLRDHUP (__force __poll_t)0x00002000 +/* User item marked as removed for EPOLL_USERPOLL */ +#define EPOLLREMOVED ((__force __poll_t)(1U << 27)) + /* Set exclusive wakeup mode for the target file descriptor */ #define EPOLLEXCLUSIVE ((__force __poll_t)(1U << 28)) From patchwork Mon Jun 24 14:41:44 2019 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Roman Penyaev X-Patchwork-Id: 11013549 Return-Path: Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org [172.30.200.125]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id EFA08924 for ; Mon, 24 Jun 2019 14:42:41 +0000 (UTC) Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id E102128BF9 for ; Mon, 24 Jun 2019 14:42:41 +0000 (UTC) Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486) id D599128C09; Mon, 24 Jun 2019 14:42:41 +0000 (UTC) X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on pdx-wl-mail.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-7.9 required=2.0 tests=BAYES_00,MAILING_LIST_MULTI, RCVD_IN_DNSWL_HI autolearn=ham version=3.3.1 Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 74BCC28BF9 for ; Mon, 24 Jun 2019 14:42:41 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1729039AbfFXOmE (ORCPT ); Mon, 24 Jun 2019 10:42:04 -0400 Received: from mx2.suse.de ([195.135.220.15]:50396 "EHLO mx1.suse.de" rhost-flags-OK-OK-OK-FAIL) by vger.kernel.org with ESMTP id S1728852AbfFXOmD (ORCPT ); Mon, 24 Jun 2019 10:42:03 -0400 X-Virus-Scanned: by amavisd-new at test-mx.suse.de Received: from relay2.suse.de (unknown [195.135.220.254]) by mx1.suse.de (Postfix) with ESMTP id 80B88AEC3; Mon, 24 Jun 2019 14:42:02 +0000 (UTC) From: Roman Penyaev Cc: Roman Penyaev , Andrew Morton , Al Viro , Linus Torvalds , linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org Subject: [PATCH v5 07/14] epoll: call ep_add_event_to_uring() from ep_poll_callback() Date: Mon, 24 Jun 2019 16:41:44 +0200 Message-Id: <20190624144151.22688-8-rpenyaev@suse.de> X-Mailer: git-send-email 2.21.0 In-Reply-To: <20190624144151.22688-1-rpenyaev@suse.de> References: <20190624144151.22688-1-rpenyaev@suse.de> MIME-Version: 1.0 To: unlisted-recipients:; (no To-header on input) Sender: linux-fsdevel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-fsdevel@vger.kernel.org X-Virus-Scanned: ClamAV using ClamSMTP Each ep_poll_callback() is called when fd calls wakeup() on epfd. So account new event in user ring. The tricky part here is EPOLLONESHOT. Since we are lockless we have to be deal with ep_poll_callbacks() called in paralle, thus use cmpxchg to clear public event bits and filter out concurrent call from another cpu. Signed-off-by: Roman Penyaev Cc: Andrew Morton Cc: Al Viro Cc: Linus Torvalds Cc: linux-fsdevel@vger.kernel.org Cc: linux-kernel@vger.kernel.org --- fs/eventpoll.c | 38 ++++++++++++++++++++++++++++++++++++++ 1 file changed, 38 insertions(+) diff --git a/fs/eventpoll.c b/fs/eventpoll.c index 3b1f6a210247..cc4612e28e03 100644 --- a/fs/eventpoll.c +++ b/fs/eventpoll.c @@ -1565,6 +1565,29 @@ struct file *get_epoll_tfile_raw_ptr(struct file *file, int tfd, } #endif /* CONFIG_CHECKPOINT_RESTORE */ +/** + * Atomically clear public event bits and return %true if the old value has + * public event bits set. + */ +static inline bool ep_clear_public_event_bits(struct epitem *epi) +{ + __poll_t old, flags; + + /* + * Here we race with ourselves and with ep_modify(), which can + * change the event bits. In order not to override events updated + * by ep_modify() we have to do cmpxchg. + */ + + old = READ_ONCE(epi->event.events); + do { + flags = old; + } while ((old = cmpxchg(&epi->event.events, flags, + flags & EP_PRIVATE_BITS)) != flags); + + return flags & ~EP_PRIVATE_BITS; +} + /** * Adds a new entry to the tail of the list in a lockless way, i.e. * multiple CPUs are allowed to call this function concurrently. @@ -1684,6 +1707,20 @@ static int ep_poll_callback(struct epitem *epi, __poll_t pollflags) if (pollflags && !(pollflags & epi->event.events)) goto out_unlock; + if (ep_polled_by_user(ep)) { + /* + * For polled descriptor from user we have to disable events on + * callback path in case of one-shot. + */ + if ((epi->event.events & EPOLLONESHOT) && + !ep_clear_public_event_bits(epi)) + /* Race is lost, another callback has cleared events */ + goto out_unlock; + + ep_add_event_to_uring(epi, pollflags); + goto wakeup; + } + /* * If we are transferring events to userspace, we can hold no locks * (because we're accessing user memory, and because of linux f_op->poll() @@ -1703,6 +1740,7 @@ static int ep_poll_callback(struct epitem *epi, __poll_t pollflags) ep_pm_stay_awake_rcu(epi); } +wakeup: /* * Wake up ( if active ) both the eventpoll wait list and the ->poll() * wait list. From patchwork Mon Jun 24 14:41:45 2019 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Roman Penyaev X-Patchwork-Id: 11013547 Return-Path: Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org [172.30.200.125]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id A45BA186E for ; Mon, 24 Jun 2019 14:42:37 +0000 (UTC) Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 94EB4288EA for ; Mon, 24 Jun 2019 14:42:37 +0000 (UTC) Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486) id 8922F28BF8; Mon, 24 Jun 2019 14:42:37 +0000 (UTC) X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on pdx-wl-mail.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-7.9 required=2.0 tests=BAYES_00,MAILING_LIST_MULTI, RCVD_IN_DNSWL_HI autolearn=ham version=3.3.1 Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 0152A28C0A for ; Mon, 24 Jun 2019 14:42:36 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1729578AbfFXOmg (ORCPT ); Mon, 24 Jun 2019 10:42:36 -0400 Received: from mx2.suse.de ([195.135.220.15]:50404 "EHLO mx1.suse.de" rhost-flags-OK-OK-OK-FAIL) by vger.kernel.org with ESMTP id S1728868AbfFXOmE (ORCPT ); Mon, 24 Jun 2019 10:42:04 -0400 X-Virus-Scanned: by amavisd-new at test-mx.suse.de Received: from relay2.suse.de (unknown [195.135.220.254]) by mx1.suse.de (Postfix) with ESMTP id CFEBDAEFE; Mon, 24 Jun 2019 14:42:02 +0000 (UTC) From: Roman Penyaev Cc: Roman Penyaev , Andrew Morton , Al Viro , Linus Torvalds , linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org Subject: [PATCH v5 08/14] epoll: support polling from userspace for ep_insert() Date: Mon, 24 Jun 2019 16:41:45 +0200 Message-Id: <20190624144151.22688-9-rpenyaev@suse.de> X-Mailer: git-send-email 2.21.0 In-Reply-To: <20190624144151.22688-1-rpenyaev@suse.de> References: <20190624144151.22688-1-rpenyaev@suse.de> MIME-Version: 1.0 To: unlisted-recipients:; (no To-header on input) Sender: linux-fsdevel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-fsdevel@vger.kernel.org X-Virus-Scanned: ClamAV using ClamSMTP When epfd is polled by userspace and new item is inserted new bit should be get from a bitmap and then user item is set accordingly. Signed-off-by: Roman Penyaev Cc: Andrew Morton Cc: Al Viro Cc: Linus Torvalds Cc: linux-fsdevel@vger.kernel.org Cc: linux-kernel@vger.kernel.org --- fs/eventpoll.c | 118 +++++++++++++++++++++++++++++++++++++++++-------- 1 file changed, 100 insertions(+), 18 deletions(-) diff --git a/fs/eventpoll.c b/fs/eventpoll.c index cc4612e28e03..d0c057b73ea1 100644 --- a/fs/eventpoll.c +++ b/fs/eventpoll.c @@ -906,6 +906,23 @@ static void epi_rcu_free(struct rcu_head *head) kmem_cache_free(epi_cache, epi); } +static inline int ep_get_bit(struct eventpoll *ep) +{ + bool was_set; + int bit; + + lockdep_assert_held(&ep->mtx); + + bit = find_first_zero_bit(ep->items_bm, ep->max_items_nr); + if (bit >= ep->max_items_nr) + return -ENOSPC; + + was_set = test_and_set_bit(bit, ep->items_bm); + WARN_ON(was_set); + + return bit; +} + #define set_unless_zero_atomically(ptr, flags) \ ({ \ typeof(ptr) _ptr = (ptr); \ @@ -2022,6 +2039,33 @@ static noinline void ep_destroy_wakeup_source(struct epitem *epi) wakeup_source_unregister(ws); } +static inline struct epitem *epi_alloc(struct eventpoll *ep) +{ + struct epitem *epi; + + if (ep_polled_by_user(ep)) { + struct uepitem *uepi; + + uepi = kmem_cache_alloc(uepi_cache, GFP_KERNEL); + if (likely(uepi)) + epi = &uepi->epi; + else + epi = NULL; + } else { + epi = kmem_cache_alloc(epi_cache, GFP_KERNEL); + } + + return epi; +} + +static inline void epi_free(struct eventpoll *ep, struct epitem *epi) +{ + if (ep_polled_by_user(ep)) + kmem_cache_free(uepi_cache, uep_item_from_epi(epi)); + else + kmem_cache_free(epi_cache, epi); +} + /* * Must be called with "mtx" held. */ @@ -2034,29 +2078,55 @@ static int ep_insert(struct eventpoll *ep, const struct epoll_event *event, struct epitem *epi; struct ep_pqueue epq; + lockdep_assert_held(&ep->mtx); lockdep_assert_irqs_enabled(); user_watches = atomic_long_read(&ep->user->epoll_watches); if (unlikely(user_watches >= max_user_watches)) return -ENOSPC; - if (!(epi = kmem_cache_alloc(epi_cache, GFP_KERNEL))) + epi = epi_alloc(ep); + if (unlikely(!epi)) return -ENOMEM; /* Item initialization follow here ... */ INIT_LIST_HEAD(&epi->rdllink); INIT_LIST_HEAD(&epi->fllink); INIT_LIST_HEAD(&epi->pwqlist); + RCU_INIT_POINTER(epi->ws, NULL); epi->ep = ep; ep_set_ffd(&epi->ffd, tfile, fd); epi->event = *event; epi->nwait = 0; epi->next = EP_UNACTIVE_PTR; - if (epi->event.events & EPOLLWAKEUP) { + + if (ep_polled_by_user(ep)) { + struct uepitem *uepi = uep_item_from_epi(epi); + struct epoll_uitem *uitem; + int bit; + + INIT_WORK(&uepi->work, ep_poll_callback_work); + + bit = ep_get_bit(ep); + if (unlikely(bit < 0)) { + error = bit; + goto error_get_bit; + } + uepi->bit = bit; + + /* + * Now fill-in user item. Do not touch ready_events, since + * it can be EPOLLREMOVED (has been set by previous user + * item), thus user index entry can be not yet consumed + * by userspace. See ep_remove_user_item() and + * ep_add_event_to_uring() for details. + */ + uitem = &ep->user_header->items[uepi->bit]; + uitem->events = event->events; + uitem->data = event->data; + } else if (epi->event.events & EPOLLWAKEUP) { error = ep_create_wakeup_source(epi); if (error) goto error_create_wakeup_source; - } else { - RCU_INIT_POINTER(epi->ws, NULL); } /* Initialize the poll table using the queue callback */ @@ -2103,16 +2173,23 @@ static int ep_insert(struct eventpoll *ep, const struct epoll_event *event, /* record NAPI ID of new item if present */ ep_set_busy_poll_napi_id(epi); - /* If the file is already "ready" we drop it inside the ready list */ - if (revents && !ep_is_linked(epi)) { - list_add_tail(&epi->rdllink, &ep->rdllist); - ep_pm_stay_awake(epi); + if (revents) { + bool added = false; - /* Notify waiting tasks that events are available */ - if (waitqueue_active(&ep->wq)) - wake_up(&ep->wq); - if (waitqueue_active(&ep->poll_wait)) - pwake++; + if (ep_polled_by_user(ep)) { + added = ep_add_event_to_uring(epi, revents); + } else if (!ep_is_linked(epi)) { + list_add_tail(&epi->rdllink, &ep->rdllist); + ep_pm_stay_awake(epi); + added = true; + } + if (added) { + /* Notify waiting tasks that events are available */ + if (waitqueue_active(&ep->wq)) + wake_up(&ep->wq); + if (waitqueue_active(&ep->poll_wait)) + pwake++; + } } write_unlock_irq(&ep->lock); @@ -2141,15 +2218,20 @@ static int ep_insert(struct eventpoll *ep, const struct epoll_event *event, * list, since that is used/cleaned only inside a section bound by "mtx". * And ep_insert() is called with "mtx" held. */ - write_lock_irq(&ep->lock); - if (ep_is_linked(epi)) - list_del_init(&epi->rdllink); - write_unlock_irq(&ep->lock); + if (ep_polled_by_user(ep)) { + ep_remove_user_item(epi); + } else { + write_lock_irq(&ep->lock); + if (ep_is_linked(epi)) + list_del_init(&epi->rdllink); + write_unlock_irq(&ep->lock); + } wakeup_source_unregister(ep_wakeup_source(epi)); +error_get_bit: error_create_wakeup_source: - kmem_cache_free(epi_cache, epi); + epi_free(ep, epi); return error; } From patchwork Mon Jun 24 14:41:46 2019 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Roman Penyaev X-Patchwork-Id: 11013555 Return-Path: Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org [172.30.200.125]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id B11A0924 for ; Mon, 24 Jun 2019 14:43:00 +0000 (UTC) Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 9E9F1288EA for ; Mon, 24 Jun 2019 14:43:00 +0000 (UTC) Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486) id 9C0B728C0A; Mon, 24 Jun 2019 14:43:00 +0000 (UTC) X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on pdx-wl-mail.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-7.9 required=2.0 tests=BAYES_00,MAILING_LIST_MULTI, RCVD_IN_DNSWL_HI autolearn=ham version=3.3.1 Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 0A101288EA for ; Mon, 24 Jun 2019 14:42:59 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1729171AbfFXOmq (ORCPT ); Mon, 24 Jun 2019 10:42:46 -0400 Received: from mx2.suse.de ([195.135.220.15]:50414 "EHLO mx1.suse.de" rhost-flags-OK-OK-OK-FAIL) by vger.kernel.org with ESMTP id S1729005AbfFXOmE (ORCPT ); Mon, 24 Jun 2019 10:42:04 -0400 X-Virus-Scanned: by amavisd-new at test-mx.suse.de Received: from relay2.suse.de (unknown [195.135.220.254]) by mx1.suse.de (Postfix) with ESMTP id 1A55FAF3B; Mon, 24 Jun 2019 14:42:03 +0000 (UTC) From: Roman Penyaev Cc: Roman Penyaev , Andrew Morton , Al Viro , Linus Torvalds , linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org Subject: [PATCH v5 09/14] epoll: support polling from userspace for ep_remove() Date: Mon, 24 Jun 2019 16:41:46 +0200 Message-Id: <20190624144151.22688-10-rpenyaev@suse.de> X-Mailer: git-send-email 2.21.0 In-Reply-To: <20190624144151.22688-1-rpenyaev@suse.de> References: <20190624144151.22688-1-rpenyaev@suse.de> MIME-Version: 1.0 To: unlisted-recipients:; (no To-header on input) Sender: linux-fsdevel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-fsdevel@vger.kernel.org X-Virus-Scanned: ClamAV using ClamSMTP On ep_remove() simply mark a user item with EPOLLREMOVE if the item was ready (i.e. has some bits set). That will prevent further user index entry creation on item ->bit reuse. Signed-off-by: Roman Penyaev Cc: Andrew Morton Cc: Al Viro Cc: Linus Torvalds Cc: linux-fsdevel@vger.kernel.org Cc: linux-kernel@vger.kernel.org --- fs/eventpoll.c | 34 ++++++++++++++++++++++++++++------ 1 file changed, 28 insertions(+), 6 deletions(-) diff --git a/fs/eventpoll.c b/fs/eventpoll.c index d0c057b73ea1..df96569d3b5a 100644 --- a/fs/eventpoll.c +++ b/fs/eventpoll.c @@ -900,12 +900,30 @@ static __poll_t ep_scan_ready_list(struct eventpoll *ep, return res; } -static void epi_rcu_free(struct rcu_head *head) +static void epi_rcu_free_cb(struct rcu_head *head) { struct epitem *epi = container_of(head, struct epitem, rcu); kmem_cache_free(epi_cache, epi); } +static void uepi_rcu_free_cb(struct rcu_head *head) +{ + struct epitem *epi = container_of(head, struct epitem, rcu); + kmem_cache_free(uepi_cache, uep_item_from_epi(epi)); +} + +static void epi_rcu_free(struct eventpoll *ep, struct epitem *epi) +{ + /* + * Check if `ep` is polled by user here, in this function, not + * in the callback, in order not to control lifetime of the 'ep'. + */ + if (ep_polled_by_user(ep)) + call_rcu(&epi->rcu, uepi_rcu_free_cb); + else + call_rcu(&epi->rcu, epi_rcu_free_cb); +} + static inline int ep_get_bit(struct eventpoll *ep) { bool was_set; @@ -1177,10 +1195,14 @@ static int ep_remove(struct eventpoll *ep, struct epitem *epi) rb_erase_cached(&epi->rbn, &ep->rbr); - write_lock_irq(&ep->lock); - if (ep_is_linked(epi)) - list_del_init(&epi->rdllink); - write_unlock_irq(&ep->lock); + if (ep_polled_by_user(ep)) { + ep_remove_user_item(epi); + } else { + write_lock_irq(&ep->lock); + if (ep_is_linked(epi)) + list_del_init(&epi->rdllink); + write_unlock_irq(&ep->lock); + } wakeup_source_unregister(ep_wakeup_source(epi)); /* @@ -1190,7 +1212,7 @@ static int ep_remove(struct eventpoll *ep, struct epitem *epi) * ep->mtx. The rcu read side, reverse_path_check_proc(), does not make * use of the rbn field. */ - call_rcu(&epi->rcu, epi_rcu_free); + epi_rcu_free(ep, epi); atomic_long_dec(&ep->user->epoll_watches); From patchwork Mon Jun 24 14:41:47 2019 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Roman Penyaev X-Patchwork-Id: 11013551 Return-Path: Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org [172.30.200.125]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 41A21112C for ; Mon, 24 Jun 2019 14:42:46 +0000 (UTC) Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 3243628C09 for ; Mon, 24 Jun 2019 14:42:46 +0000 (UTC) Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486) id 267BF28C0A; Mon, 24 Jun 2019 14:42:46 +0000 (UTC) X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on pdx-wl-mail.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-7.9 required=2.0 tests=BAYES_00,MAILING_LIST_MULTI, RCVD_IN_DNSWL_HI autolearn=ham version=3.3.1 Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id ADFB528C07 for ; Mon, 24 Jun 2019 14:42:45 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1729854AbfFXOml (ORCPT ); Mon, 24 Jun 2019 10:42:41 -0400 Received: from mx2.suse.de ([195.135.220.15]:50422 "EHLO mx1.suse.de" rhost-flags-OK-OK-OK-FAIL) by vger.kernel.org with ESMTP id S1727742AbfFXOmE (ORCPT ); Mon, 24 Jun 2019 10:42:04 -0400 X-Virus-Scanned: by amavisd-new at test-mx.suse.de Received: from relay2.suse.de (unknown [195.135.220.254]) by mx1.suse.de (Postfix) with ESMTP id 5A878AF3F; Mon, 24 Jun 2019 14:42:03 +0000 (UTC) From: Roman Penyaev Cc: Roman Penyaev , Andrew Morton , Al Viro , Linus Torvalds , linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org Subject: [PATCH v5 10/14] epoll: support polling from userspace for ep_modify() Date: Mon, 24 Jun 2019 16:41:47 +0200 Message-Id: <20190624144151.22688-11-rpenyaev@suse.de> X-Mailer: git-send-email 2.21.0 In-Reply-To: <20190624144151.22688-1-rpenyaev@suse.de> References: <20190624144151.22688-1-rpenyaev@suse.de> MIME-Version: 1.0 To: unlisted-recipients:; (no To-header on input) Sender: linux-fsdevel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-fsdevel@vger.kernel.org X-Virus-Scanned: ClamAV using ClamSMTP When epfd is polled from userspace and item is being modified: 1. Update user item with new pointer or poll flags. 2. Add event to user ring if needed. Signed-off-by: Roman Penyaev Cc: Andrew Morton Cc: Al Viro Cc: Linus Torvalds Cc: linux-fsdevel@vger.kernel.org Cc: linux-kernel@vger.kernel.org --- fs/eventpoll.c | 32 +++++++++++++++++++++++++++----- 1 file changed, 27 insertions(+), 5 deletions(-) diff --git a/fs/eventpoll.c b/fs/eventpoll.c index df96569d3b5a..f94608ca9f7a 100644 --- a/fs/eventpoll.c +++ b/fs/eventpoll.c @@ -2265,6 +2265,7 @@ static int ep_insert(struct eventpoll *ep, const struct epoll_event *event, static int ep_modify(struct eventpoll *ep, struct epitem *epi, const struct epoll_event *event) { + __poll_t revents; int pwake = 0; poll_table pt; @@ -2276,10 +2277,24 @@ static int ep_modify(struct eventpoll *ep, struct epitem *epi, * Set the new event interest mask before calling f_op->poll(); * otherwise we might miss an event that happens between the * f_op->poll() call and the new event set registering. + * + * Use xchg() here because we can race with ep_clear_public_event_bits() + * for the case when events are polled from userspace. Internally + * ep_clear_public_event_bits() uses cmpxchg(), thus on some archs + * we can't mix normal writes and cmpxchg(). */ - epi->event.events = event->events; /* need barrier below */ + (void) xchg(&epi->event.events, event->events); epi->event.data = event->data; /* protected by mtx */ - if (epi->event.events & EPOLLWAKEUP) { + + /* Update user item, barrier is below */ + if (ep_polled_by_user(ep)) { + struct uepitem *uepi = uep_item_from_epi(epi); + struct epoll_uitem *uitem; + + uitem = &ep->user_header->items[uepi->bit]; + WRITE_ONCE(uitem->events, event->events); + WRITE_ONCE(uitem->data, event->data); + } else if (epi->event.events & EPOLLWAKEUP) { if (!ep_has_wakeup_source(epi)) ep_create_wakeup_source(epi); } else if (ep_has_wakeup_source(epi)) { @@ -2312,12 +2327,19 @@ static int ep_modify(struct eventpoll *ep, struct epitem *epi, * If the item is "hot" and it is not registered inside the ready * list, push it inside. */ - if (ep_item_poll(epi, &pt, 1)) { + revents = ep_item_poll(epi, &pt, 1); + if (revents) { + bool added = false; + write_lock_irq(&ep->lock); - if (!ep_is_linked(epi)) { + if (ep_polled_by_user(ep)) + added = ep_add_event_to_uring(epi, revents); + else if (!ep_is_linked(epi)) { list_add_tail(&epi->rdllink, &ep->rdllist); ep_pm_stay_awake(epi); - + added = true; + } + if (added) { /* Notify waiting tasks that events are available */ if (waitqueue_active(&ep->wq)) wake_up(&ep->wq); From patchwork Mon Jun 24 14:41:48 2019 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Roman Penyaev X-Patchwork-Id: 11013543 Return-Path: Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org [172.30.200.125]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id D37DE924 for ; Mon, 24 Jun 2019 14:42:34 +0000 (UTC) Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id C491128C09 for ; Mon, 24 Jun 2019 14:42:34 +0000 (UTC) Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486) id B913828C07; Mon, 24 Jun 2019 14:42:34 +0000 (UTC) X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on pdx-wl-mail.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-7.9 required=2.0 tests=BAYES_00,MAILING_LIST_MULTI, RCVD_IN_DNSWL_HI autolearn=ham version=3.3.1 Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 259B828BFB for ; Mon, 24 Jun 2019 14:42:34 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1729371AbfFXOmS (ORCPT ); Mon, 24 Jun 2019 10:42:18 -0400 Received: from mx2.suse.de ([195.135.220.15]:50430 "EHLO mx1.suse.de" rhost-flags-OK-OK-OK-FAIL) by vger.kernel.org with ESMTP id S1729026AbfFXOmF (ORCPT ); Mon, 24 Jun 2019 10:42:05 -0400 X-Virus-Scanned: by amavisd-new at test-mx.suse.de Received: from relay2.suse.de (unknown [195.135.220.254]) by mx1.suse.de (Postfix) with ESMTP id A32E8AF70; Mon, 24 Jun 2019 14:42:03 +0000 (UTC) From: Roman Penyaev Cc: Roman Penyaev , Andrew Morton , Al Viro , Linus Torvalds , linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org Subject: [PATCH v5 11/14] epoll: support polling from userspace for ep_poll() Date: Mon, 24 Jun 2019 16:41:48 +0200 Message-Id: <20190624144151.22688-12-rpenyaev@suse.de> X-Mailer: git-send-email 2.21.0 In-Reply-To: <20190624144151.22688-1-rpenyaev@suse.de> References: <20190624144151.22688-1-rpenyaev@suse.de> MIME-Version: 1.0 To: unlisted-recipients:; (no To-header on input) Sender: linux-fsdevel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-fsdevel@vger.kernel.org X-Virus-Scanned: ClamAV using ClamSMTP Rule of thumb for epfd polled from userspace is simple: epfd has events if ->head != ->tail, no traversing of each item is performed. Signed-off-by: Roman Penyaev Cc: Andrew Morton Cc: Al Viro Cc: Linus Torvalds Cc: linux-fsdevel@vger.kernel.org Cc: linux-kernel@vger.kernel.org --- fs/eventpoll.c | 69 ++++++++++++++++++++++++++++++++++++++++++-------- 1 file changed, 59 insertions(+), 10 deletions(-) diff --git a/fs/eventpoll.c b/fs/eventpoll.c index f94608ca9f7a..1d4a76ff5dff 100644 --- a/fs/eventpoll.c +++ b/fs/eventpoll.c @@ -481,6 +481,13 @@ static inline bool ep_polled_by_user(struct eventpoll *ep) return ep_userpoll_supported() && !!ep->user_header; } +static inline bool ep_uring_events_available(struct eventpoll *ep) +{ + return ep_polled_by_user(ep) && + READ_ONCE(ep->user_header->head) != + READ_ONCE(ep->user_header->tail); +} + /** * ep_events_available - Checks if ready events might be available. * @@ -492,7 +499,8 @@ static inline bool ep_polled_by_user(struct eventpoll *ep) static inline int ep_events_available(struct eventpoll *ep) { return !list_empty_careful(&ep->rdllist) || - READ_ONCE(ep->ovflist) != EP_UNACTIVE_PTR; + READ_ONCE(ep->ovflist) != EP_UNACTIVE_PTR || + ep_uring_events_available(ep); } #ifdef CONFIG_NET_RX_BUSY_POLL @@ -1330,7 +1338,7 @@ static void ep_ptable_queue_proc(struct file *file, wait_queue_head_t *whead, static __poll_t ep_item_poll(const struct epitem *epi, poll_table *pt, int depth) { - struct eventpoll *ep; + struct eventpoll *ep, *tep; bool locked; pt->_key = epi->event.events; @@ -1339,6 +1347,26 @@ static __poll_t ep_item_poll(const struct epitem *epi, poll_table *pt, ep = epi->ffd.file->private_data; poll_wait(epi->ffd.file, &ep->poll_wait, pt); + + tep = epi->ffd.file->private_data; + if (ep_polled_by_user(tep)) { + /* + * The behaviour differs comparing to full scan of ready + * list for original epoll. If descriptor is pollable + * from userspace we don't do scan of all ready user items: + * firstly because we can't do reverse search of epi by + * uitem bit, secondly this is simply waste of time for + * edge triggered descriptors (user code should be prepared + * to deal with EAGAIN returned from read() or write() on + * inserted file descriptor) and thirdly once event is put + * into user index ring do not touch it from kernel, what + * we do is mark it as EPOLLREMOVED on ep_remove() and + * that's it. + */ + return ep_uring_events_available(tep) ? + EPOLLIN | EPOLLRDNORM : 0; + } + locked = pt && (pt->_qproc == ep_ptable_queue_proc); return ep_scan_ready_list(epi->ffd.file->private_data, @@ -1381,6 +1409,12 @@ static __poll_t ep_eventpoll_poll(struct file *file, poll_table *wait) /* Insert inside our poll wait queue */ poll_wait(file, &ep->poll_wait, wait); + if (ep_polled_by_user(ep)) { + /* Please read detailed comments inside ep_item_poll() */ + return ep_uring_events_available(ep) ? + EPOLLIN | EPOLLRDNORM : 0; + } + /* * Proceed to find out if wanted events are really available inside * the ready list. @@ -2445,6 +2479,8 @@ static int ep_send_events(struct eventpoll *ep, { struct ep_send_events_data esed; + WARN_ON(ep_polled_by_user(ep)); + esed.maxevents = maxevents; esed.events = events; @@ -2491,6 +2527,12 @@ static int ep_poll(struct eventpoll *ep, struct epoll_event __user *events, lockdep_assert_irqs_enabled(); + if (ep_polled_by_user(ep)) { + if (ep_uring_events_available(ep)) + /* Firstly all events from ring have to be consumed */ + return -ESTALE; + } + if (timeout > 0) { struct timespec64 end_time = ep_set_mstimeout(timeout); @@ -2579,14 +2621,21 @@ static int ep_poll(struct eventpoll *ep, struct epoll_event __user *events, __set_current_state(TASK_RUNNING); send_events: - /* - * Try to transfer events to user space. In case we get 0 events and - * there's still timeout left over, we go trying again in search of - * more luck. - */ - if (!res && eavail && - !(res = ep_send_events(ep, events, maxevents)) && !timed_out) - goto fetch_events; + if (!res && eavail) { + if (!ep_polled_by_user(ep)) { + /* + * Try to transfer events to user space. In case we get + * 0 events and there's still timeout left over, we go + * trying again in search of more luck. + */ + res = ep_send_events(ep, events, maxevents); + if (!res && !timed_out) + goto fetch_events; + } else { + /* User has to deal with the ring himself */ + res = -ESTALE; + } + } if (waiter) { spin_lock_irq(&ep->wq.lock); From patchwork Mon Jun 24 14:41:49 2019 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Roman Penyaev X-Patchwork-Id: 11013541 Return-Path: Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org [172.30.200.125]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id C1C84924 for ; Mon, 24 Jun 2019 14:42:32 +0000 (UTC) Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id B175E28BF9 for ; Mon, 24 Jun 2019 14:42:32 +0000 (UTC) Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486) id A5A5528C07; Mon, 24 Jun 2019 14:42:32 +0000 (UTC) X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on pdx-wl-mail.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-7.9 required=2.0 tests=BAYES_00,MAILING_LIST_MULTI, RCVD_IN_DNSWL_HI autolearn=ham version=3.3.1 Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 4A9DF28C00 for ; Mon, 24 Jun 2019 14:42:32 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1729405AbfFXOmS (ORCPT ); Mon, 24 Jun 2019 10:42:18 -0400 Received: from mx2.suse.de ([195.135.220.15]:50442 "EHLO mx1.suse.de" rhost-flags-OK-OK-OK-FAIL) by vger.kernel.org with ESMTP id S1729030AbfFXOmF (ORCPT ); Mon, 24 Jun 2019 10:42:05 -0400 X-Virus-Scanned: by amavisd-new at test-mx.suse.de Received: from relay2.suse.de (unknown [195.135.220.254]) by mx1.suse.de (Postfix) with ESMTP id E2F7CAF7C; Mon, 24 Jun 2019 14:42:03 +0000 (UTC) From: Roman Penyaev Cc: Roman Penyaev , Andrew Morton , Al Viro , Linus Torvalds , linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org Subject: [PATCH v5 12/14] epoll: support mapping for epfd when polled from userspace Date: Mon, 24 Jun 2019 16:41:49 +0200 Message-Id: <20190624144151.22688-13-rpenyaev@suse.de> X-Mailer: git-send-email 2.21.0 In-Reply-To: <20190624144151.22688-1-rpenyaev@suse.de> References: <20190624144151.22688-1-rpenyaev@suse.de> MIME-Version: 1.0 To: unlisted-recipients:; (no To-header on input) Sender: linux-fsdevel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-fsdevel@vger.kernel.org X-Virus-Scanned: ClamAV using ClamSMTP User has to mmap user_header and user_index vmalloce'd pointers in order to consume events from userspace. Also we do not let any copies of vma on fork(). Signed-off-by: Roman Penyaev Cc: Andrew Morton Cc: Al Viro Cc: Linus Torvalds Cc: linux-fsdevel@vger.kernel.org Cc: linux-kernel@vger.kernel.org --- fs/eventpoll.c | 36 ++++++++++++++++++++++++++++++++++++ 1 file changed, 36 insertions(+) diff --git a/fs/eventpoll.c b/fs/eventpoll.c index 1d4a76ff5dff..34239564bdfb 100644 --- a/fs/eventpoll.c +++ b/fs/eventpoll.c @@ -1447,11 +1447,47 @@ static void ep_show_fdinfo(struct seq_file *m, struct file *f) } #endif +static int ep_eventpoll_mmap(struct file *filep, struct vm_area_struct *vma) +{ + struct eventpoll *ep = vma->vm_file->private_data; + size_t size; + int rc; + + if (!ep_polled_by_user(ep)) + return -EOPNOTSUPP; + + size = vma->vm_end - vma->vm_start; + if (!vma->vm_pgoff && size > ep->header_length) + return -ENXIO; + if (vma->vm_pgoff && ep->header_length != (vma->vm_pgoff << PAGE_SHIFT)) + /* Index ring starts exactly after the header */ + return -ENXIO; + if (vma->vm_pgoff && size > ep->index_length) + return -ENXIO; + + /* + * vm_pgoff is used *only* for indication, what is mapped: user header + * or user index ring. Sizes are checked above. + */ + if (!vma->vm_pgoff) + rc = remap_vmalloc_range_partial(vma, vma->vm_start, + ep->user_header, size); + else + rc = remap_vmalloc_range_partial(vma, vma->vm_start, + ep->user_index, size); + if (likely(!rc)) + /* No copies for forks(), please */ + vma->vm_flags |= VM_DONTCOPY; + + return rc; +} + /* File callbacks that implement the eventpoll file behaviour */ static const struct file_operations eventpoll_fops = { #ifdef CONFIG_PROC_FS .show_fdinfo = ep_show_fdinfo, #endif + .mmap = ep_eventpoll_mmap, .release = ep_eventpoll_release, .poll = ep_eventpoll_poll, .llseek = noop_llseek, From patchwork Mon Jun 24 14:41:50 2019 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Roman Penyaev X-Patchwork-Id: 11013545 Return-Path: Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org [172.30.200.125]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 055C1112C for ; Mon, 24 Jun 2019 14:42:37 +0000 (UTC) Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id E40E428950 for ; Mon, 24 Jun 2019 14:42:36 +0000 (UTC) Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486) id D89C928C07; Mon, 24 Jun 2019 14:42:36 +0000 (UTC) X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on pdx-wl-mail.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-7.9 required=2.0 tests=BAYES_00,MAILING_LIST_MULTI, RCVD_IN_DNSWL_HI autolearn=ham version=3.3.1 Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id F1FB928950 for ; Mon, 24 Jun 2019 14:42:35 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1729314AbfFXOmS (ORCPT ); Mon, 24 Jun 2019 10:42:18 -0400 Received: from mx2.suse.de ([195.135.220.15]:50396 "EHLO mx1.suse.de" rhost-flags-OK-OK-OK-FAIL) by vger.kernel.org with ESMTP id S1729040AbfFXOmF (ORCPT ); Mon, 24 Jun 2019 10:42:05 -0400 X-Virus-Scanned: by amavisd-new at test-mx.suse.de Received: from relay2.suse.de (unknown [195.135.220.254]) by mx1.suse.de (Postfix) with ESMTP id 5EF02AF81; Mon, 24 Jun 2019 14:42:04 +0000 (UTC) From: Roman Penyaev Cc: Roman Penyaev , Andrew Morton , Al Viro , Arnd Bergmann , Linus Torvalds , linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org Subject: [PATCH v5 13/14] epoll: implement epoll_create2() syscall Date: Mon, 24 Jun 2019 16:41:50 +0200 Message-Id: <20190624144151.22688-14-rpenyaev@suse.de> X-Mailer: git-send-email 2.21.0 In-Reply-To: <20190624144151.22688-1-rpenyaev@suse.de> References: <20190624144151.22688-1-rpenyaev@suse.de> MIME-Version: 1.0 To: unlisted-recipients:; (no To-header on input) Sender: linux-fsdevel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-fsdevel@vger.kernel.org X-Virus-Scanned: ClamAV using ClamSMTP epoll_create2() is needed to accept EPOLL_USERPOLL flags and size, i.e. this patch wires up polling from userspace. Signed-off-by: Roman Penyaev Cc: Andrew Morton Cc: Al Viro Cc: Arnd Bergmann Cc: Linus Torvalds Cc: linux-fsdevel@vger.kernel.org Cc: linux-kernel@vger.kernel.org --- arch/alpha/kernel/syscalls/syscall.tbl | 2 ++ arch/arm/tools/syscall.tbl | 1 + arch/arm64/include/asm/unistd.h | 2 +- arch/arm64/include/asm/unistd32.h | 2 ++ arch/ia64/kernel/syscalls/syscall.tbl | 2 ++ arch/m68k/kernel/syscalls/syscall.tbl | 2 ++ arch/microblaze/kernel/syscalls/syscall.tbl | 1 + arch/mips/kernel/syscalls/syscall_n32.tbl | 2 ++ arch/mips/kernel/syscalls/syscall_n64.tbl | 2 ++ arch/mips/kernel/syscalls/syscall_o32.tbl | 2 ++ arch/parisc/kernel/syscalls/syscall.tbl | 2 ++ arch/powerpc/kernel/syscalls/syscall.tbl | 2 ++ arch/s390/kernel/syscalls/syscall.tbl | 2 ++ arch/sh/kernel/syscalls/syscall.tbl | 2 ++ arch/sparc/kernel/syscalls/syscall.tbl | 2 ++ arch/x86/entry/syscalls/syscall_32.tbl | 1 + arch/x86/entry/syscalls/syscall_64.tbl | 1 + arch/xtensa/kernel/syscalls/syscall.tbl | 1 + fs/eventpoll.c | 5 +++++ include/linux/syscalls.h | 1 + include/uapi/asm-generic/unistd.h | 4 +++- kernel/sys_ni.c | 1 + 22 files changed, 40 insertions(+), 2 deletions(-) diff --git a/arch/alpha/kernel/syscalls/syscall.tbl b/arch/alpha/kernel/syscalls/syscall.tbl index 1db9bbcfb84e..a1d7b695063d 100644 --- a/arch/alpha/kernel/syscalls/syscall.tbl +++ b/arch/alpha/kernel/syscalls/syscall.tbl @@ -474,3 +474,5 @@ 542 common fsmount sys_fsmount 543 common fspick sys_fspick 544 common pidfd_open sys_pidfd_open +# 546 common clone3 sys_clone3 +547 common epoll_create2 sys_epoll_create2 diff --git a/arch/arm/tools/syscall.tbl b/arch/arm/tools/syscall.tbl index ff45d8807cb8..1497f3c87d54 100644 --- a/arch/arm/tools/syscall.tbl +++ b/arch/arm/tools/syscall.tbl @@ -449,3 +449,4 @@ 433 common fspick sys_fspick 434 common pidfd_open sys_pidfd_open 436 common clone3 sys_clone3 +437 common epoll_create2 sys_epoll_create2 diff --git a/arch/arm64/include/asm/unistd.h b/arch/arm64/include/asm/unistd.h index e4e0523102e2..7fb8d77e2340 100644 --- a/arch/arm64/include/asm/unistd.h +++ b/arch/arm64/include/asm/unistd.h @@ -44,7 +44,7 @@ #define __ARM_NR_compat_set_tls (__ARM_NR_COMPAT_BASE + 5) #define __ARM_NR_COMPAT_END (__ARM_NR_COMPAT_BASE + 0x800) -#define __NR_compat_syscalls 437 +#define __NR_compat_syscalls 438 #endif #define __ARCH_WANT_SYS_CLONE diff --git a/arch/arm64/include/asm/unistd32.h b/arch/arm64/include/asm/unistd32.h index 10f16b0175ca..eb467e639352 100644 --- a/arch/arm64/include/asm/unistd32.h +++ b/arch/arm64/include/asm/unistd32.h @@ -890,6 +890,8 @@ __SYSCALL(__NR_fspick, sys_fspick) __SYSCALL(__NR_pidfd_open, sys_pidfd_open) #define __NR_clone3 436 __SYSCALL(__NR_clone3, sys_clone3) +#define __NR_epoll_create2 437 +__SYSCALL(__NR_epoll_create2, sys_epoll_create2) /* * Please add new compat syscalls above this comment and update diff --git a/arch/ia64/kernel/syscalls/syscall.tbl b/arch/ia64/kernel/syscalls/syscall.tbl index ecc44926737b..5cecff901853 100644 --- a/arch/ia64/kernel/syscalls/syscall.tbl +++ b/arch/ia64/kernel/syscalls/syscall.tbl @@ -355,3 +355,5 @@ 432 common fsmount sys_fsmount 433 common fspick sys_fspick 434 common pidfd_open sys_pidfd_open +# 436 common clone3 sys_clone3 +437 common epoll_create2 sys_epoll_create2 diff --git a/arch/m68k/kernel/syscalls/syscall.tbl b/arch/m68k/kernel/syscalls/syscall.tbl index 9a3eb2558568..29d944e2e9d6 100644 --- a/arch/m68k/kernel/syscalls/syscall.tbl +++ b/arch/m68k/kernel/syscalls/syscall.tbl @@ -434,3 +434,5 @@ 432 common fsmount sys_fsmount 433 common fspick sys_fspick 434 common pidfd_open sys_pidfd_open +# 436 common clone3 sys_clone3 +437 common epoll_create2 sys_epoll_create2 diff --git a/arch/microblaze/kernel/syscalls/syscall.tbl b/arch/microblaze/kernel/syscalls/syscall.tbl index 9181f181f76d..fad83841b16b 100644 --- a/arch/microblaze/kernel/syscalls/syscall.tbl +++ b/arch/microblaze/kernel/syscalls/syscall.tbl @@ -441,3 +441,4 @@ 433 common fspick sys_fspick 434 common pidfd_open sys_pidfd_open 436 common clone3 sys_clone3 +437 common epoll_create2 sys_epoll_create2 diff --git a/arch/mips/kernel/syscalls/syscall_n32.tbl b/arch/mips/kernel/syscalls/syscall_n32.tbl index 97035e19ad03..661d63d7ea84 100644 --- a/arch/mips/kernel/syscalls/syscall_n32.tbl +++ b/arch/mips/kernel/syscalls/syscall_n32.tbl @@ -373,3 +373,5 @@ 432 n32 fsmount sys_fsmount 433 n32 fspick sys_fspick 434 n32 pidfd_open sys_pidfd_open +# 436 n32 clone3 sys_clone3 +437 n32 epoll_create2 sys_epoll_create2 diff --git a/arch/mips/kernel/syscalls/syscall_n64.tbl b/arch/mips/kernel/syscalls/syscall_n64.tbl index d7292722d3b0..4a7f270ef126 100644 --- a/arch/mips/kernel/syscalls/syscall_n64.tbl +++ b/arch/mips/kernel/syscalls/syscall_n64.tbl @@ -349,3 +349,5 @@ 432 n64 fsmount sys_fsmount 433 n64 fspick sys_fspick 434 n64 pidfd_open sys_pidfd_open +# 436 n64 clone3 sys_clone3 +437 n64 epoll_create2 sys_epoll_create2 diff --git a/arch/mips/kernel/syscalls/syscall_o32.tbl b/arch/mips/kernel/syscalls/syscall_o32.tbl index dba084c92f14..db87225a9dc5 100644 --- a/arch/mips/kernel/syscalls/syscall_o32.tbl +++ b/arch/mips/kernel/syscalls/syscall_o32.tbl @@ -422,3 +422,5 @@ 432 o32 fsmount sys_fsmount 433 o32 fspick sys_fspick 434 o32 pidfd_open sys_pidfd_open +# 436 o32 clone3 sys_clone3 +437 o32 epoll_create2 sys_epoll_create2 diff --git a/arch/parisc/kernel/syscalls/syscall.tbl b/arch/parisc/kernel/syscalls/syscall.tbl index 5022b9e179c2..25374a2c4e16 100644 --- a/arch/parisc/kernel/syscalls/syscall.tbl +++ b/arch/parisc/kernel/syscalls/syscall.tbl @@ -431,3 +431,5 @@ 432 common fsmount sys_fsmount 433 common fspick sys_fspick 434 common pidfd_open sys_pidfd_open +# 436 common clone3 sys_clone3 +437 common epoll_create2 sys_epoll_create2 diff --git a/arch/powerpc/kernel/syscalls/syscall.tbl b/arch/powerpc/kernel/syscalls/syscall.tbl index f2c3bda2d39f..9f48d8a737ab 100644 --- a/arch/powerpc/kernel/syscalls/syscall.tbl +++ b/arch/powerpc/kernel/syscalls/syscall.tbl @@ -516,3 +516,5 @@ 432 common fsmount sys_fsmount 433 common fspick sys_fspick 434 common pidfd_open sys_pidfd_open +# 436 common clone3 sys_clone3 +437 common epoll_create2 sys_epoll_create2 diff --git a/arch/s390/kernel/syscalls/syscall.tbl b/arch/s390/kernel/syscalls/syscall.tbl index 6ebacfeaf853..0cef54850a1c 100644 --- a/arch/s390/kernel/syscalls/syscall.tbl +++ b/arch/s390/kernel/syscalls/syscall.tbl @@ -437,3 +437,5 @@ 432 common fsmount sys_fsmount sys_fsmount 433 common fspick sys_fspick sys_fspick 434 common pidfd_open sys_pidfd_open sys_pidfd_open +# 436 common clone3 sys_clone3 sys_clone3 +437 common epoll_create2 sys_epoll_create2 sys_epoll_create2 diff --git a/arch/sh/kernel/syscalls/syscall.tbl b/arch/sh/kernel/syscalls/syscall.tbl index 834c9c7d79fa..42769d6ec8fe 100644 --- a/arch/sh/kernel/syscalls/syscall.tbl +++ b/arch/sh/kernel/syscalls/syscall.tbl @@ -437,3 +437,5 @@ 432 common fsmount sys_fsmount 433 common fspick sys_fspick 434 common pidfd_open sys_pidfd_open +# 436 common clone3 sys_clone3 +437 common epoll_create2 sys_epoll_create2 diff --git a/arch/sparc/kernel/syscalls/syscall.tbl b/arch/sparc/kernel/syscalls/syscall.tbl index c58e71f21129..bf86fa32913d 100644 --- a/arch/sparc/kernel/syscalls/syscall.tbl +++ b/arch/sparc/kernel/syscalls/syscall.tbl @@ -480,3 +480,5 @@ 432 common fsmount sys_fsmount 433 common fspick sys_fspick 434 common pidfd_open sys_pidfd_open +# 436 common clone3 sys_clone3 +437 common epoll_create2 sys_epoll_create2 diff --git a/arch/x86/entry/syscalls/syscall_32.tbl b/arch/x86/entry/syscalls/syscall_32.tbl index 6a99dbbf7e04..26d64f09b097 100644 --- a/arch/x86/entry/syscalls/syscall_32.tbl +++ b/arch/x86/entry/syscalls/syscall_32.tbl @@ -440,3 +440,4 @@ 433 i386 fspick sys_fspick __ia32_sys_fspick 434 i386 pidfd_open sys_pidfd_open __ia32_sys_pidfd_open 436 i386 clone3 sys_clone3 __ia32_sys_clone3 +437 i386 epoll_create2 sys_epoll_create2 __ia32_sys_epoll_create2 diff --git a/arch/x86/entry/syscalls/syscall_64.tbl b/arch/x86/entry/syscalls/syscall_64.tbl index 68227c7f71e5..4b58d8694693 100644 --- a/arch/x86/entry/syscalls/syscall_64.tbl +++ b/arch/x86/entry/syscalls/syscall_64.tbl @@ -357,6 +357,7 @@ 433 common fspick __x64_sys_fspick 434 common pidfd_open __x64_sys_pidfd_open 436 common clone3 __x64_sys_clone3/ptregs +437 common epoll_create2 __x64_sys_epoll_create2 # # x32-specific system call numbers start at 512 to avoid cache impact diff --git a/arch/xtensa/kernel/syscalls/syscall.tbl b/arch/xtensa/kernel/syscalls/syscall.tbl index d5a1fd2c96c7..eef4367d433e 100644 --- a/arch/xtensa/kernel/syscalls/syscall.tbl +++ b/arch/xtensa/kernel/syscalls/syscall.tbl @@ -406,3 +406,4 @@ 433 common fspick sys_fspick 434 common pidfd_open sys_pidfd_open 436 common clone3 sys_clone3 +437 common epoll_create2 sys_epoll_create2 diff --git a/fs/eventpoll.c b/fs/eventpoll.c index 34239564bdfb..c3379bb4209b 100644 --- a/fs/eventpoll.c +++ b/fs/eventpoll.c @@ -2825,6 +2825,11 @@ static int do_epoll_create(int flags, size_t size) return error; } +SYSCALL_DEFINE2(epoll_create2, int, flags, size_t, size) +{ + return do_epoll_create(flags, size); +} + SYSCALL_DEFINE1(epoll_create1, int, flags) { return do_epoll_create(flags, 0); diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h index b01d54a5732e..655ac0ebfdf9 100644 --- a/include/linux/syscalls.h +++ b/include/linux/syscalls.h @@ -357,6 +357,7 @@ asmlinkage long sys_eventfd2(unsigned int count, int flags); /* fs/eventpoll.c */ asmlinkage long sys_epoll_create1(int flags); +asmlinkage long sys_epoll_create2(int flags, size_t size); asmlinkage long sys_epoll_ctl(int epfd, int op, int fd, struct epoll_event __user *event); asmlinkage long sys_epoll_pwait(int epfd, struct epoll_event __user *events, diff --git a/include/uapi/asm-generic/unistd.h b/include/uapi/asm-generic/unistd.h index 7c7c14a2e097..59c9dea64565 100644 --- a/include/uapi/asm-generic/unistd.h +++ b/include/uapi/asm-generic/unistd.h @@ -848,9 +848,11 @@ __SYSCALL(__NR_fspick, sys_fspick) __SYSCALL(__NR_pidfd_open, sys_pidfd_open) #define __NR_clone3 436 __SYSCALL(__NR_clone3, sys_clone3) +#define __NR_epoll_create2 437 +__SYSCALL(__NR_epoll_create2, sys_epoll_create2) #undef __NR_syscalls -#define __NR_syscalls 437 +#define __NR_syscalls 438 /* * 32 bit systems traditionally used different diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c index 34b76895b81e..090ff3aa7568 100644 --- a/kernel/sys_ni.c +++ b/kernel/sys_ni.c @@ -65,6 +65,7 @@ COND_SYSCALL(eventfd2); /* fs/eventfd.c */ COND_SYSCALL(epoll_create1); +COND_SYSCALL(epoll_create2); COND_SYSCALL(epoll_ctl); COND_SYSCALL(epoll_pwait); COND_SYSCALL_COMPAT(epoll_pwait); From patchwork Mon Jun 24 14:41:51 2019 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Roman Penyaev X-Patchwork-Id: 11013539 Return-Path: Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org [172.30.200.125]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 6BF71112C for ; Mon, 24 Jun 2019 14:42:18 +0000 (UTC) Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 59F3228BF3 for ; Mon, 24 Jun 2019 14:42:18 +0000 (UTC) Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486) id 44CC628C0B; Mon, 24 Jun 2019 14:42:18 +0000 (UTC) X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on pdx-wl-mail.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-7.9 required=2.0 tests=BAYES_00,MAILING_LIST_MULTI, RCVD_IN_DNSWL_HI autolearn=ham version=3.3.1 Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id EFFAE28BF3 for ; Mon, 24 Jun 2019 14:42:16 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1729146AbfFXOmH (ORCPT ); Mon, 24 Jun 2019 10:42:07 -0400 Received: from mx2.suse.de ([195.135.220.15]:50414 "EHLO mx1.suse.de" rhost-flags-OK-OK-OK-FAIL) by vger.kernel.org with ESMTP id S1729088AbfFXOmG (ORCPT ); Mon, 24 Jun 2019 10:42:06 -0400 X-Virus-Scanned: by amavisd-new at test-mx.suse.de Received: from relay2.suse.de (unknown [195.135.220.254]) by mx1.suse.de (Postfix) with ESMTP id D1A7AAFAB; Mon, 24 Jun 2019 14:42:04 +0000 (UTC) From: Roman Penyaev Cc: Roman Penyaev , Andrew Morton , Al Viro , Linus Torvalds , linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org Subject: [PATCH v5 14/14] kselftest: add uepoll-test which tests polling from userspace Date: Mon, 24 Jun 2019 16:41:51 +0200 Message-Id: <20190624144151.22688-15-rpenyaev@suse.de> X-Mailer: git-send-email 2.21.0 In-Reply-To: <20190624144151.22688-1-rpenyaev@suse.de> References: <20190624144151.22688-1-rpenyaev@suse.de> MIME-Version: 1.0 To: unlisted-recipients:; (no To-header on input) Sender: linux-fsdevel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-fsdevel@vger.kernel.org X-Virus-Scanned: ClamAV using ClamSMTP Signed-off-by: Roman Penyaev Cc: Andrew Morton Cc: Al Viro Cc: Linus Torvalds Cc: linux-fsdevel@vger.kernel.org Cc: linux-kernel@vger.kernel.org --- tools/testing/selftests/Makefile | 1 + tools/testing/selftests/uepoll/.gitignore | 1 + tools/testing/selftests/uepoll/Makefile | 16 + .../uepoll/atomic-builtins-support.c | 13 + tools/testing/selftests/uepoll/uepoll-test.c | 603 ++++++++++++++++++ 5 files changed, 634 insertions(+) create mode 100644 tools/testing/selftests/uepoll/.gitignore create mode 100644 tools/testing/selftests/uepoll/Makefile create mode 100644 tools/testing/selftests/uepoll/atomic-builtins-support.c create mode 100644 tools/testing/selftests/uepoll/uepoll-test.c diff --git a/tools/testing/selftests/Makefile b/tools/testing/selftests/Makefile index 9781ca79794a..ff87ac3400fe 100644 --- a/tools/testing/selftests/Makefile +++ b/tools/testing/selftests/Makefile @@ -52,6 +52,7 @@ TARGETS += timers endif TARGETS += tmpfs TARGETS += tpm2 +TARGETS += uepoll TARGETS += user TARGETS += vm TARGETS += x86 diff --git a/tools/testing/selftests/uepoll/.gitignore b/tools/testing/selftests/uepoll/.gitignore new file mode 100644 index 000000000000..8eedec333023 --- /dev/null +++ b/tools/testing/selftests/uepoll/.gitignore @@ -0,0 +1 @@ +uepoll-test diff --git a/tools/testing/selftests/uepoll/Makefile b/tools/testing/selftests/uepoll/Makefile new file mode 100644 index 000000000000..cc1b2009197d --- /dev/null +++ b/tools/testing/selftests/uepoll/Makefile @@ -0,0 +1,16 @@ +# SPDX-License-Identifier: GPL-2.0 + +CC := $(CROSS_COMPILE)gcc +CFLAGS += -O2 -g -I../../../../usr/include/ -lnuma -lpthread + +BUILTIN_SUPPORT := $(shell $(CC) -o /dev/null ./atomic-builtins-support.c >/dev/null 2>&1; echo $$?) + +ifeq "$(BUILTIN_SUPPORT)" "0" + TEST_GEN_PROGS := uepoll-test +else + $(warning WARNING:) + $(warning WARNING: uepoll compilation is skipped, gcc atomic builtins are not supported!) + $(warning WARNING:) +endif + +include ../lib.mk diff --git a/tools/testing/selftests/uepoll/atomic-builtins-support.c b/tools/testing/selftests/uepoll/atomic-builtins-support.c new file mode 100644 index 000000000000..d9ded39ec497 --- /dev/null +++ b/tools/testing/selftests/uepoll/atomic-builtins-support.c @@ -0,0 +1,13 @@ +// SPDX-License-Identifier: GPL-2.0 +/* + * Just a test to check if gcc supports atomic builtins + */ +unsigned long long v, vv, vvv; + +int main(void) +{ + vv = __atomic_load_n(&v, __ATOMIC_ACQUIRE); + vvv = __atomic_exchange_n(&vv, 0, __ATOMIC_ACQUIRE); + + return __atomic_add_fetch(&vvv, 1, __ATOMIC_RELAXED); +} diff --git a/tools/testing/selftests/uepoll/uepoll-test.c b/tools/testing/selftests/uepoll/uepoll-test.c new file mode 100644 index 000000000000..1cefdcc1e25b --- /dev/null +++ b/tools/testing/selftests/uepoll/uepoll-test.c @@ -0,0 +1,603 @@ +// SPDX-License-Identifier: GPL-2.0 +/* + * uepoll-test.c - Test cases for epoll_create2(), namely pollable + * epoll from userspace. Copyright (c) 2019 Roman Penyaev + */ + +#define _GNU_SOURCE +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include + +#include "../kselftest.h" +#include "../kselftest_harness.h" + +#include +#include + +#define BUILD_BUG_ON(condition) ((void)sizeof(char [1 - 2*!!(condition)])) +#define READ_ONCE(v) (*(volatile typeof(v)*)&(v)) + +#define ITERS 1000000ull + +/* + * Add main epoll functions manually, because sys/epoll.h conflicts + * with linux/eventpoll.h. + */ +extern int epoll_create1(int __flags); +extern int epoll_ctl(int __epfd, int __op, int __fd, + struct epoll_event *__event); +extern int epoll_wait(int __epfd, struct epoll_event *__events, + int __maxevents, int __timeout); + +static inline long epoll_create2(int flags, size_t size) +{ + return syscall(__NR_epoll_create2, flags, size); +} + +struct thread_ctx { + pthread_t thread; + int efd; +}; + +struct cpu_map { + unsigned int nr; + unsigned int map[]; +}; + +static volatile unsigned int thr_ready; +static volatile unsigned int start; + +static inline unsigned int max_index_nr(struct epoll_uheader *header) +{ + return header->index_length >> 2; +} + +static int is_cpu_online(int cpu) +{ + char buf[64]; + char online; + FILE *f; + int rc; + + snprintf(buf, sizeof(buf), "/sys/devices/system/cpu/cpu%d/online", cpu); + f = fopen(buf, "r"); + if (!f) + return 1; + + rc = fread(&online, 1, 1, f); + assert(rc == 1); + fclose(f); + + return (char)online == '1'; +} + +static struct cpu_map *cpu_map__new(void) +{ + struct cpu_map *cpu; + struct bitmask *bm; + + int i, bit, cpus_nr; + + cpus_nr = numa_num_possible_cpus(); + cpu = calloc(1, sizeof(*cpu) + sizeof(cpu->map[0]) * cpus_nr); + if (!cpu) + return NULL; + + bm = numa_all_cpus_ptr; + assert(bm); + + for (bit = 0, i = 0; bit < bm->size; bit++) { + if (numa_bitmask_isbitset(bm, bit) && is_cpu_online(bit)) { + cpu->map[i++] = bit; + } + } + cpu->nr = i; + + return cpu; +} + +static void cpu_map__put(struct cpu_map *cpu) +{ + free(cpu); +} + +static inline unsigned long long nsecs(void) +{ + struct timespec ts = {0, 0}; + + clock_gettime(CLOCK_MONOTONIC, &ts); + return ((unsigned long long)ts.tv_sec * 1000000000ull) + ts.tv_nsec; +} + +static void *thread_work(void *arg) +{ + struct thread_ctx *ctx = arg; + uint64_t ucnt = 1; + unsigned int i; + int rc; + + __atomic_add_fetch(&thr_ready, 1, __ATOMIC_RELAXED); + + while (!start) + ; + + for (i = 0; i < ITERS; i++) { + rc = write(ctx->efd, &ucnt, sizeof(ucnt)); + assert(rc == sizeof(ucnt)); + } + + return NULL; +} + +static inline _Bool read_event(struct epoll_uheader *header, + unsigned int *index, unsigned int idx, + struct epoll_event *event) +{ + struct epoll_uitem *item; + unsigned int *item_idx_ptr; + unsigned int indeces_mask; + + indeces_mask = max_index_nr(header) - 1; + if (indeces_mask & max_index_nr(header)) { + assert(0); + /* Should be pow2, corrupted header? */ + return 0; + } + + item_idx_ptr = &index[idx & indeces_mask]; + + /* Load index */ + idx = __atomic_load_n(item_idx_ptr, __ATOMIC_ACQUIRE); + if (idx >= header->max_items_nr) { + assert(0); + /* Corrupted index? */ + return 0; + } + + item = &header->items[idx]; + + /* + * Fetch data first, if event is cleared by the kernel we drop the data + * returning false. + */ + event->data = (__u64) item->data; + event->events = __atomic_exchange_n(&item->ready_events, 0, + __ATOMIC_RELEASE); + + return (event->events & ~EPOLLREMOVED); +} + +static int uepoll_wait(struct epoll_uheader *header, unsigned int *index, + int epfd, struct epoll_event *events, int maxevents) + +{ + /* + * Before entering kernel we do busy wait for ~1ms, naively assuming + * each iteration costs 1 cycle, 1 ns. + */ + unsigned int spins = 1000000; + unsigned int tail; + int i; + + assert(maxevents > 0); + +again: + /* + * Cache the tail because we don't want refetch it on each iteration + * and then catch live events updates, i.e. we don't want user @events + * array consist of events from the same fds. + */ + tail = READ_ONCE(header->tail); + + if (header->head == tail) { + if (spins--) + /* Busy loop a bit */ + goto again; + + i = epoll_wait(epfd, NULL, 0, -1); + assert(i < 0); + if (errno != ESTALE) + return i; + + tail = READ_ONCE(header->tail); + assert(header->head != tail); + } + + for (i = 0; header->head != tail && i < maxevents; header->head++) { + if (read_event(header, index, header->head, &events[i])) + i++; + else + /* Event can't be removed under us */ + assert(0); + } + + return i; +} + +static void uepoll_mmap(int epfd, struct epoll_uheader **_header, + unsigned int **_index) +{ + struct epoll_uheader *header; + unsigned int *index, len; + + BUILD_BUG_ON(sizeof(*header) != EPOLL_USERPOLL_HEADER_SIZE); + BUILD_BUG_ON(sizeof(header->items[0]) != 16); + + len = sysconf(_SC_PAGESIZE); +again: + header = mmap(NULL, len, PROT_WRITE|PROT_READ, MAP_SHARED, epfd, 0); + if (header == MAP_FAILED) + ksft_exit_fail_msg("Failed map(header)\n"); + + if (header->header_length != len) { + unsigned int tmp_len = len; + + len = header->header_length; + munmap(header, tmp_len); + goto again; + } + assert(header->magic == EPOLL_USERPOLL_HEADER_MAGIC); + + index = mmap(NULL, header->index_length, PROT_WRITE|PROT_READ, + MAP_SHARED, epfd, header->header_length); + if (index == MAP_FAILED) + ksft_exit_fail_msg("Failed map(index)\n"); + + *_header = header; + *_index = index; +} + +static void uepoll_munmap(struct epoll_uheader *header, + unsigned int *index) +{ + int rc; + + rc = munmap(index, header->index_length); + if (rc) + ksft_exit_fail_msg("Failed munmap(index)\n"); + + rc = munmap(header, header->header_length); + if (rc) + ksft_exit_fail_msg("Failed munmap(header)\n"); +} + +static int do_bench(struct cpu_map *cpu, unsigned int nthreads) +{ + struct epoll_event ev, events[nthreads]; + struct thread_ctx threads[nthreads]; + pthread_attr_t thrattr; + struct thread_ctx *ctx; + int rc, epfd, nfds; + cpu_set_t cpuset; + unsigned int i; + + struct epoll_uheader *header; + unsigned int *index; + + unsigned long long epoll_calls = 0, epoll_nsecs; + unsigned long long ucnt, ucnt_sum = 0, eagains = 0; + + thr_ready = 0; + start = 0; + + epfd = epoll_create2(EPOLL_USERPOLL, nthreads); + if (epfd < 0) + ksft_exit_fail_msg("Failed epoll_create2()\n"); + + for (i = 0; i < nthreads; i++) { + ctx = &threads[i]; + + ctx->efd = eventfd(0, EFD_NONBLOCK); + if (ctx->efd < 0) + ksft_exit_fail_msg("Failed eventfd()\n"); + + ev.events = EPOLLIN | EPOLLET; + ev.data = (uintptr_t) ctx; + rc = epoll_ctl(epfd, EPOLL_CTL_ADD, ctx->efd, &ev); + if (rc) + ksft_exit_fail_msg("Failed epoll_ctl()\n"); + + CPU_ZERO(&cpuset); + CPU_SET(cpu->map[i % cpu->nr], &cpuset); + + pthread_attr_init(&thrattr); + rc = pthread_attr_setaffinity_np(&thrattr, sizeof(cpu_set_t), + &cpuset); + if (rc) { + errno = rc; + ksft_exit_fail_msg("Failed pthread_attr_setaffinity_np()\n"); + } + + rc = pthread_create(&ctx->thread, NULL, thread_work, ctx); + if (rc) { + errno = rc; + ksft_exit_fail_msg("Failed pthread_create()\n"); + } + } + + /* Mmap all pointers */ + uepoll_mmap(epfd, &header, &index); + + while (thr_ready != nthreads) + ; + + /* Signal start for all threads */ + start = 1; + + epoll_nsecs = nsecs(); + while (1) { + nfds = uepoll_wait(header, index, epfd, events, nthreads); + if (nfds < 0) + ksft_exit_fail_msg("Failed uepoll_wait()\n"); + + epoll_calls++; + + for (i = 0; i < (unsigned int)nfds; ++i) { + ctx = (void *)(uintptr_t) events[i].data; + rc = read(ctx->efd, &ucnt, sizeof(ucnt)); + if (rc < 0) { + assert(errno == EAGAIN); + continue; + } + assert(rc == sizeof(ucnt)); + ucnt_sum += ucnt; + if (ucnt_sum == nthreads * ITERS) + goto end; + } + } +end: + epoll_nsecs = nsecs() - epoll_nsecs; + + for (i = 0; i < nthreads; i++) { + ctx = &threads[i]; + pthread_join(ctx->thread, NULL); + } + uepoll_munmap(header, index); + close(epfd); + + ksft_print_msg("%7d %8lld %8lld\n", + nthreads, + ITERS*nthreads/(epoll_nsecs/1000/1000), + epoll_nsecs/1000/1000); + + return 0; +} + +/** + * uepoll loop + */ +TEST(uepoll_basics) +{ + unsigned int i, nthreads_arr[] = {8, 16, 32, 64}; + struct cpu_map *cpu; + + cpu = cpu_map__new(); + if (!cpu) { + errno = ENOMEM; + ksft_exit_fail_msg("Failed cpu_map__new()\n"); + } + + ksft_print_msg("threads events/ms run-time ms\n"); + for (i = 0; i < ARRAY_SIZE(nthreads_arr); i++) + do_bench(cpu, nthreads_arr[i]); + + cpu_map__put(cpu); +} + +/** + * Checks different flags and args + */ +TEST(uepoll_args) +{ + struct epoll_event ev; + int epfd, evfd, rc; + + /* Fail */ + epfd = epoll_create2(EPOLL_USERPOLL, (1<<16)+1); + ASSERT_EQ(errno, EINVAL); + ASSERT_EQ(epfd, -1); + + /* Fail */ + epfd = epoll_create2(EPOLL_USERPOLL, 0); + ASSERT_EQ(errno, EINVAL); + ASSERT_EQ(epfd, -1); + + /* Success */ + epfd = epoll_create2(EPOLL_USERPOLL, (1<<16)); + ASSERT_GE(epfd, 0); + + /* Success */ + evfd = eventfd(0, EFD_NONBLOCK); + ASSERT_GE(evfd, 0); + + /* Fail, expect EPOLLET */ + ev.events = EPOLLIN; + rc = epoll_ctl(epfd, EPOLL_CTL_ADD, evfd, &ev); + ASSERT_EQ(errno, EINVAL); + ASSERT_EQ(rc, -1); + + /* Fail, no support for EPOLLEXCLUSIVE */ + ev.events = EPOLLIN | EPOLLET | EPOLLEXCLUSIVE; + rc = epoll_ctl(epfd, EPOLL_CTL_ADD, evfd, &ev); + ASSERT_EQ(errno, EINVAL); + ASSERT_EQ(rc, -1); + + /* Success */ + ev.events = EPOLLIN | EPOLLET; + rc = epoll_ctl(epfd, EPOLL_CTL_ADD, evfd, &ev); + ASSERT_EQ(rc, 0); + + /* Fail, expect events and maxevents as zeroes */ + rc = epoll_wait(epfd, &ev, 1, -1); + ASSERT_EQ(errno, EINVAL); + ASSERT_EQ(rc, -1); + + /* Fail, expect events as zero */ + rc = epoll_wait(epfd, &ev, 0, -1); + ASSERT_EQ(errno, EINVAL); + ASSERT_EQ(rc, -1); + + /* Fail, expect maxevents as zero */ + rc = epoll_wait(epfd, NULL, 1, -1); + ASSERT_EQ(errno, EINVAL); + ASSERT_EQ(rc, -1); + + /* Success */ + rc = epoll_wait(epfd, NULL, 0, 0); + ASSERT_EQ(rc, 0); + + close(epfd); + close(evfd); +} + +static void *signal_eventfd_work(void *arg) +{ + unsigned long long cnt; + + int rc, evfd = *(int *)arg; + + sleep(1); + cnt = 1; + rc = write(evfd, &cnt, sizeof(cnt)); + assert(rc == 8); + + return NULL; +} + +/** + * Nested poll + */ +TEST(uepoll_poll) +{ + int epfd, uepfd, evfd, rc; + + struct epoll_event ev; + pthread_t thread; + + /* Success */ + uepfd = epoll_create2(EPOLL_USERPOLL, 128); + ASSERT_GE(uepfd, 0); + + /* Success */ + epfd = epoll_create2(0, 0); + ASSERT_GE(epfd, 0); + + /* Success */ + evfd = eventfd(0, EFD_NONBLOCK); + ASSERT_GE(evfd, 0); + + /* Success */ + ev.events = EPOLLIN | EPOLLET; + rc = epoll_ctl(uepfd, EPOLL_CTL_ADD, evfd, &ev); + ASSERT_EQ(rc, 0); + + /* Success */ + ev.events = EPOLLIN | EPOLLET; + rc = epoll_ctl(epfd, EPOLL_CTL_ADD, uepfd, &ev); + ASSERT_EQ(rc, 0); + + /* Success */ + rc = epoll_wait(epfd, &ev, 1, 0); + ASSERT_EQ(rc, 0); + + /* Success */ + rc = pthread_create(&thread, NULL, signal_eventfd_work, &evfd); + ASSERT_EQ(rc, 0); + + /* Success */ + rc = epoll_wait(epfd, &ev, 1, 5000); + ASSERT_EQ(rc, 1); + + close(uepfd); + close(epfd); + close(evfd); +} + +/** + * One shot + */ +TEST(uepoll_epolloneshot) +{ + int epfd, evfd, rc; + unsigned long long cnt; + + struct epoll_uheader *header; + unsigned int *index; + + struct epoll_event ev; + pthread_t thread; + + /* Success */ + epfd = epoll_create2(EPOLL_USERPOLL, 128); + ASSERT_GE(epfd, 0); + + /* Mmap all pointers */ + uepoll_mmap(epfd, &header, &index); + + /* Success */ + evfd = eventfd(0, EFD_NONBLOCK); + ASSERT_GE(evfd, 0); + + /* Success */ + ev.events = EPOLLIN | EPOLLET | EPOLLONESHOT; + rc = epoll_ctl(epfd, EPOLL_CTL_ADD, evfd, &ev); + ASSERT_EQ(rc, 0); + + /* Success */ + rc = pthread_create(&thread, NULL, signal_eventfd_work, &evfd); + ASSERT_EQ(rc, 0); + + /* Fail, expect -ESTALE */ + rc = epoll_wait(epfd, NULL, 0, 3000); + ASSERT_EQ(errno, ESTALE); + ASSERT_EQ(rc, -1); + + /* Success */ + rc = uepoll_wait(header, index, epfd, &ev, 1); + ASSERT_EQ(rc, 1); + + /* Success */ + rc = pthread_create(&thread, NULL, signal_eventfd_work, &evfd); + ASSERT_EQ(rc, 0); + + /* Success */ + rc = epoll_wait(epfd, NULL, 0, 3000); + ASSERT_EQ(rc, 0); + + + /* Success */ + ev.events = EPOLLIN | EPOLLET; + rc = epoll_ctl(epfd, EPOLL_CTL_MOD, evfd, &ev); + ASSERT_EQ(rc, 0); + + /* Success */ + rc = pthread_create(&thread, NULL, signal_eventfd_work, &evfd); + ASSERT_EQ(rc, 0); + + /* Success */ + rc = uepoll_wait(header, index, epfd, &ev, 1); + ASSERT_EQ(rc, 1); + + /* Success */ + rc = pthread_create(&thread, NULL, signal_eventfd_work, &evfd); + ASSERT_EQ(rc, 0); + + /* Success */ + rc = uepoll_wait(header, index, epfd, &ev, 1); + ASSERT_EQ(rc, 1); + + uepoll_munmap(header, index); + close(epfd); + close(evfd); +} + +TEST_HARNESS_MAIN