[0/8] simplify ep_poll

Message ID	20201106231635.3528496-1-soheil.kdev@gmail.com (mailing list archive)
Headers	show Return-Path: <SRS0=jEHB=EM=vger.kernel.org=linux-fsdevel-owner@kernel.org> From: Soheil Hassas Yeganeh <soheil.kdev@gmail.com> To: torvalds@linux-foundation.org, viro@zeniv.linux.org.uk, linux-fsdevel@vger.kernel.org Cc: linux-kernel@vger.kernel.org, akpm@linux-foundation.org, dave@stgolabs.net, edumazet@google.com, willemb@google.com, khazhy@google.com, guantaol@google.com, Soheil Hassas Yeganeh <soheil@google.com> Subject: [PATCH 0/8] simplify ep_poll Date: Fri, 6 Nov 2020 18:16:27 -0500 Message-Id: <20201106231635.3528496-1-soheil.kdev@gmail.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Precedence: bulk
Series	simplify ep_poll \| expand [0/8] simplify ep_poll [1/8] epoll: check for events when removing a timed out thread from the wait queue [2/8] epoll: simplify signal handling [3/8] epoll: pull fatal signal checks into ep_send_events() [4/8] epoll: move eavail next to the list_empty_careful check [5/8] epoll: simplify and optimize busy loop logic [6/8] epoll: pull all code between fetch_events and send_event into the loop [7/8] epoll: replace gotos with a proper loop [8/8] epoll: eliminate unnecessary lock for zero timeout

Soheil Hassas Yeganeh Nov. 6, 2020, 11:16 p.m. UTC

From: Soheil Hassas Yeganeh <soheil@google.com>

This patch series is a follow up based on the suggestions and feedback by Linus:
https://lkml.kernel.org/r/CAHk-=wizk=OxUyQPbO8MS41w2Pag1kniUV5WdD5qWL-gq1kjDA@mail.gmail.com

The first patch in the series is a fix for the epoll race in
presence of timeouts, so that it can be cleanly backported to all
affected stable kernels.

The rest of the patch series simplify the ep_poll() implementation.
Some of these simplifications result in minor performance enhancements
as well.  We have kept these changes under self tests and internal
benchmarks for a few days, and there are minor (1-2%) performance
enhancements as a result.

Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com>

Soheil Hassas Yeganeh (8):
  epoll: check for events when removing a timed out thread from the wait
    queue
  epoll: simplify signal handling
  epoll: pull fatal signal checks into ep_send_events()
  epoll: move eavail next to the list_empty_careful check
  epoll: simplify and optimize busy loop logic
  epoll: pull all code between fetch_events and send_event into the loop
  epoll: replace gotos with a proper loop
  epoll: eliminate unnecessary lock for zero timeout

 fs/eventpoll.c | 159 +++++++++++++++++++++++++------------------------
 1 file changed, 80 insertions(+), 79 deletions(-)

Linus Torvalds Nov. 6, 2020, 11:35 p.m. UTC | #1

On Fri, Nov 6, 2020 at 3:17 PM Soheil Hassas Yeganeh
<soheil.kdev@gmail.com> wrote:
>
> The first patch in the series is a fix for the epoll race in
> presence of timeouts, so that it can be cleanly backported to all
> affected stable kernels.
>
> The rest of the patch series simplify the ep_poll() implementation.
> Some of these simplifications result in minor performance enhancements
> as well.  We have kept these changes under self tests and internal
> benchmarks for a few days, and there are minor (1-2%) performance
> enhancements as a result.

From just looking at the patches (not the end result - I didn't
actually apply them), it looks sane to me.

             Linus

Andrew Morton Nov. 8, 2020, 1:43 a.m. UTC | #2

On Fri,  6 Nov 2020 18:16:27 -0500 Soheil Hassas Yeganeh <soheil.kdev@gmail.com> wrote:

> From: Soheil Hassas Yeganeh <soheil@google.com>
> 
> This patch series is a follow up based on the suggestions and feedback by Linus:
> https://lkml.kernel.org/r/CAHk-=wizk=OxUyQPbO8MS41w2Pag1kniUV5WdD5qWL-gq1kjDA@mail.gmail.com

Al Viro has been playing in here as well - see the below, from
linux-next.

I think I'll leave it to you folks to sort this out, please.


commit 57804b1cc4616780c72a2d0930d1bd0d5bd3ed4c
Author:     Al Viro <viro@zeniv.linux.org.uk>
AuthorDate: Mon Aug 31 13:41:30 2020 -0400
Commit:     Al Viro <viro@zeniv.linux.org.uk>
CommitDate: Sun Oct 25 20:02:01 2020 -0400

    lift locking/unlocking ep->mtx out of ep_{start,done}_scan()
    
    get rid of depth/ep_locked arguments there and document
    the kludge in ep_item_poll() that has lead to ep_locked existence in
    the first place
    
    Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

diff --git a/fs/eventpoll.c b/fs/eventpoll.c
index ac996b959e94..f9c567af1f5f 100644
--- a/fs/eventpoll.c
+++ b/fs/eventpoll.c
@@ -554,20 +554,13 @@ static inline void ep_pm_stay_awake_rcu(struct epitem *epi)
 	rcu_read_unlock();
 }
 
-static void ep_start_scan(struct eventpoll *ep,
-			  int depth, bool ep_locked,
-			  struct list_head *txlist)
-{
-	lockdep_assert_irqs_enabled();
-
-	/*
-	 * We need to lock this because we could be hit by
-	 * eventpoll_release_file() and epoll_ctl().
-	 */
-
-	if (!ep_locked)
-		mutex_lock_nested(&ep->mtx, depth);
 
+/*
+ * ep->mutex needs to be held because we could be hit by
+ * eventpoll_release_file() and epoll_ctl().
+ */
+static void ep_start_scan(struct eventpoll *ep, struct list_head *txlist)
+{
 	/*
 	 * Steal the ready list, and re-init the original one to the
 	 * empty list. Also, set ep->ovflist to NULL so that events
@@ -576,6 +569,7 @@ static void ep_start_scan(struct eventpoll *ep,
 	 * because we want the "sproc" callback to be able to do it
 	 * in a lockless way.
 	 */
+	lockdep_assert_irqs_enabled();
 	write_lock_irq(&ep->lock);
 	list_splice_init(&ep->rdllist, txlist);
 	WRITE_ONCE(ep->ovflist, NULL);
@@ -583,7 +577,6 @@ static void ep_start_scan(struct eventpoll *ep,
 }
 
 static void ep_done_scan(struct eventpoll *ep,
-			 int depth, bool ep_locked,
 			 struct list_head *txlist)
 {
 	struct epitem *epi, *nepi;
@@ -624,9 +617,6 @@ static void ep_done_scan(struct eventpoll *ep,
 	list_splice(txlist, &ep->rdllist);
 	__pm_relax(ep->ws);
 	write_unlock_irq(&ep->lock);
-
-	if (!ep_locked)
-		mutex_unlock(&ep->mtx);
 }
 
 static void epi_rcu_free(struct rcu_head *head)
@@ -763,11 +753,16 @@ static __poll_t ep_item_poll(const struct epitem *epi, poll_table *pt,
 
 	ep = epi->ffd.file->private_data;
 	poll_wait(epi->ffd.file, &ep->poll_wait, pt);
-	locked = pt && (pt->_qproc == ep_ptable_queue_proc);
 
-	ep_start_scan(ep, depth, locked, &txlist);
+	// kludge: ep_insert() calls us with ep->mtx already locked
+	locked = pt && (pt->_qproc == ep_ptable_queue_proc);
+	if (!locked)
+		mutex_lock_nested(&ep->mtx, depth);
+	ep_start_scan(ep, &txlist);
 	res = ep_read_events_proc(ep, &txlist, depth + 1);
-	ep_done_scan(ep, depth, locked, &txlist);
+	ep_done_scan(ep, &txlist);
+	if (!locked)
+		mutex_unlock(&ep->mtx);
 	return res & epi->event.events;
 }
 
@@ -809,9 +804,11 @@ static __poll_t ep_eventpoll_poll(struct file *file, poll_table *wait)
 	 * Proceed to find out if wanted events are really available inside
 	 * the ready list.
 	 */
-	ep_start_scan(ep, 0, false, &txlist);
+	mutex_lock(&ep->mtx);
+	ep_start_scan(ep, &txlist);
 	res = ep_read_events_proc(ep, &txlist, 1);
-	ep_done_scan(ep, 0, false, &txlist);
+	ep_done_scan(ep, &txlist);
+	mutex_unlock(&ep->mtx);
 	return res;
 }
 
@@ -1573,15 +1570,13 @@ static int ep_send_events(struct eventpoll *ep,
 
 	init_poll_funcptr(&pt, NULL);
 
-	ep_start_scan(ep, 0, false, &txlist);
+	mutex_lock(&ep->mtx);
+	ep_start_scan(ep, &txlist);
 
 	/*
 	 * We can loop without lock because we are passed a task private list.
-	 * Items cannot vanish during the loop because ep_scan_ready_list() is
-	 * holding "mtx" during this call.
+	 * Items cannot vanish during the loop we are holding ep->mtx.
 	 */
-	lockdep_assert_held(&ep->mtx);
-
 	list_for_each_entry_safe(epi, tmp, &txlist, rdllink) {
 		struct wakeup_source *ws;
 		__poll_t revents;
@@ -1609,9 +1604,8 @@ static int ep_send_events(struct eventpoll *ep,
 
 		/*
 		 * If the event mask intersect the caller-requested one,
-		 * deliver the event to userspace. Again, ep_scan_ready_list()
-		 * is holding ep->mtx, so no operations coming from userspace
-		 * can change the item.
+		 * deliver the event to userspace. Again, we are holding ep->mtx,
+		 * so no operations coming from userspace can change the item.
 		 */
 		revents = ep_item_poll(epi, &pt, 1);
 		if (!revents)
@@ -1645,7 +1639,8 @@ static int ep_send_events(struct eventpoll *ep,
 			ep_pm_stay_awake(epi);
 		}
 	}
-	ep_done_scan(ep, 0, false, &txlist);
+	ep_done_scan(ep, &txlist);
+	mutex_unlock(&ep->mtx);
 
 	return res;
 }

Soheil Hassas Yeganeh Nov. 8, 2020, 4:45 a.m. UTC | #3

On Sat, Nov 7, 2020 at 8:43 PM Andrew Morton <akpm@linux-foundation.org> wrote:
>
> On Fri,  6 Nov 2020 18:16:27 -0500 Soheil Hassas Yeganeh <soheil.kdev@gmail.com> wrote:
>
> > From: Soheil Hassas Yeganeh <soheil@google.com>
> >
> > This patch series is a follow up based on the suggestions and feedback by Linus:
> > https://lkml.kernel.org/r/CAHk-=wizk=OxUyQPbO8MS41w2Pag1kniUV5WdD5qWL-gq1kjDA@mail.gmail.com
>
> Al Viro has been playing in here as well - see the below, from
> linux-next.
>
> I think I'll leave it to you folks to sort this out, please.

Thank you Andrew for pointing that out!  Sorry that I didn't notice Al
Viro's nice clean ups.

The changes are all orthogonal and apply cleanly except "epoll: pull
fatal signal checks into ep_send_events()".   The conflict is trivial
and the following patch should cleanly apply to linux-next/master (I
didn't move the initialization of `res = 0` after the new branch to
keep it simple).

FWIW, I also stress-tested the patch series applied on top of
linux-next/master for a couple of hours.

Could you please let me know whether I should send a V2 of the patch
series for linux-next? Thanks!

Subject: [PATCH linux-next] epoll: pull fatal signal checks into
ep_send_events()

To simplify the code, pull in checking the fatal signals into
ep_send_events().  ep_send_events() is called only from ep_poll().

Note that, previously, we were always checking fatal events, but
it is checked only if eavail is true.  This should be fine because
the goal of that check is to quickly return from epoll_wait() when
there is a pending fatal signal.

Suggested-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Reviewed-by: Khazhismel Kumykov <khazhy@google.com>
Change-Id: I68bbaf02db564e64815bcd8ed3c1cd272cfe832f
---
 fs/eventpoll.c | 17 ++++++++---------
 1 file changed, 8 insertions(+), 9 deletions(-)

diff --git a/fs/eventpoll.c b/fs/eventpoll.c
index 06fb0de4bcc7..42f6bfc5f24e 100644
--- a/fs/eventpoll.c
+++ b/fs/eventpoll.c
@@ -1625,6 +1625,14 @@ static int ep_send_events(struct eventpoll *ep,
        poll_table pt;
        int res = 0;

+       /*
+        * Always short-circuit for fatal signals to allow threads to make a
+        * timely exit without the chance of finding more events available and
+        * fetching repeatedly.
+        */
+       if (fatal_signal_pending(current))
+               return -EINTR;
+
        init_poll_funcptr(&pt, NULL);

        mutex_lock(&ep->mtx);
@@ -1846,15 +1854,6 @@ static int ep_poll(struct eventpoll *ep, struct
epoll_event __user *events,
        }

 send_events:
-       if (fatal_signal_pending(current)) {
-               /*
-                * Always short-circuit for fatal signals to allow
-                * threads to make a timely exit without the chance of
-                * finding more events available and fetching
-                * repeatedly.
-                */
-               return -EINTR;
-       }
        /*
         * Try to transfer events to user space. In case we get 0 events and
         * there's still timeout left over, we go trying again in search of

--
2.29.2.222.g5d2a92d10f8-goog




> commit 57804b1cc4616780c72a2d0930d1bd0d5bd3ed4c
> Author:     Al Viro <viro@zeniv.linux.org.uk>
> AuthorDate: Mon Aug 31 13:41:30 2020 -0400
> Commit:     Al Viro <viro@zeniv.linux.org.uk>
> CommitDate: Sun Oct 25 20:02:01 2020 -0400
>
>     lift locking/unlocking ep->mtx out of ep_{start,done}_scan()
>
>     get rid of depth/ep_locked arguments there and document
>     the kludge in ep_item_poll() that has lead to ep_locked existence in
>     the first place
>
>     Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
>
> diff --git a/fs/eventpoll.c b/fs/eventpoll.c
> index ac996b959e94..f9c567af1f5f 100644
> --- a/fs/eventpoll.c
> +++ b/fs/eventpoll.c
> @@ -554,20 +554,13 @@ static inline void ep_pm_stay_awake_rcu(struct epitem *epi)
>         rcu_read_unlock();
>  }
>
> -static void ep_start_scan(struct eventpoll *ep,
> -                         int depth, bool ep_locked,
> -                         struct list_head *txlist)
> -{
> -       lockdep_assert_irqs_enabled();
> -
> -       /*
> -        * We need to lock this because we could be hit by
> -        * eventpoll_release_file() and epoll_ctl().
> -        */
> -
> -       if (!ep_locked)
> -               mutex_lock_nested(&ep->mtx, depth);
>
> +/*
> + * ep->mutex needs to be held because we could be hit by
> + * eventpoll_release_file() and epoll_ctl().
> + */
> +static void ep_start_scan(struct eventpoll *ep, struct list_head *txlist)
> +{
>         /*
>          * Steal the ready list, and re-init the original one to the
>          * empty list. Also, set ep->ovflist to NULL so that events
> @@ -576,6 +569,7 @@ static void ep_start_scan(struct eventpoll *ep,
>          * because we want the "sproc" callback to be able to do it
>          * in a lockless way.
>          */
> +       lockdep_assert_irqs_enabled();
>         write_lock_irq(&ep->lock);
>         list_splice_init(&ep->rdllist, txlist);
>         WRITE_ONCE(ep->ovflist, NULL);
> @@ -583,7 +577,6 @@ static void ep_start_scan(struct eventpoll *ep,
>  }
>
>  static void ep_done_scan(struct eventpoll *ep,
> -                        int depth, bool ep_locked,
>                          struct list_head *txlist)
>  {
>         struct epitem *epi, *nepi;
> @@ -624,9 +617,6 @@ static void ep_done_scan(struct eventpoll *ep,
>         list_splice(txlist, &ep->rdllist);
>         __pm_relax(ep->ws);
>         write_unlock_irq(&ep->lock);
> -
> -       if (!ep_locked)
> -               mutex_unlock(&ep->mtx);
>  }
>
>  static void epi_rcu_free(struct rcu_head *head)
> @@ -763,11 +753,16 @@ static __poll_t ep_item_poll(const struct epitem *epi, poll_table *pt,
>
>         ep = epi->ffd.file->private_data;
>         poll_wait(epi->ffd.file, &ep->poll_wait, pt);
> -       locked = pt && (pt->_qproc == ep_ptable_queue_proc);
>
> -       ep_start_scan(ep, depth, locked, &txlist);
> +       // kludge: ep_insert() calls us with ep->mtx already locked
> +       locked = pt && (pt->_qproc == ep_ptable_queue_proc);
> +       if (!locked)
> +               mutex_lock_nested(&ep->mtx, depth);
> +       ep_start_scan(ep, &txlist);
>         res = ep_read_events_proc(ep, &txlist, depth + 1);
> -       ep_done_scan(ep, depth, locked, &txlist);
> +       ep_done_scan(ep, &txlist);
> +       if (!locked)
> +               mutex_unlock(&ep->mtx);
>         return res & epi->event.events;
>  }
>
> @@ -809,9 +804,11 @@ static __poll_t ep_eventpoll_poll(struct file *file, poll_table *wait)
>          * Proceed to find out if wanted events are really available inside
>          * the ready list.
>          */
> -       ep_start_scan(ep, 0, false, &txlist);
> +       mutex_lock(&ep->mtx);
> +       ep_start_scan(ep, &txlist);
>         res = ep_read_events_proc(ep, &txlist, 1);
> -       ep_done_scan(ep, 0, false, &txlist);
> +       ep_done_scan(ep, &txlist);
> +       mutex_unlock(&ep->mtx);
>         return res;
>  }
>
> @@ -1573,15 +1570,13 @@ static int ep_send_events(struct eventpoll *ep,
>
>         init_poll_funcptr(&pt, NULL);
>
> -       ep_start_scan(ep, 0, false, &txlist);
> +       mutex_lock(&ep->mtx);
> +       ep_start_scan(ep, &txlist);
>
>         /*
>          * We can loop without lock because we are passed a task private list.
> -        * Items cannot vanish during the loop because ep_scan_ready_list() is
> -        * holding "mtx" during this call.
> +        * Items cannot vanish during the loop we are holding ep->mtx.
>          */
> -       lockdep_assert_held(&ep->mtx);
> -
>         list_for_each_entry_safe(epi, tmp, &txlist, rdllink) {
>                 struct wakeup_source *ws;
>                 __poll_t revents;
> @@ -1609,9 +1604,8 @@ static int ep_send_events(struct eventpoll *ep,
>
>                 /*
>                  * If the event mask intersect the caller-requested one,
> -                * deliver the event to userspace. Again, ep_scan_ready_list()
> -                * is holding ep->mtx, so no operations coming from userspace
> -                * can change the item.
> +                * deliver the event to userspace. Again, we are holding ep->mtx,
> +                * so no operations coming from userspace can change the item.
>                  */
>                 revents = ep_item_poll(epi, &pt, 1);
>                 if (!revents)
> @@ -1645,7 +1639,8 @@ static int ep_send_events(struct eventpoll *ep,
>                         ep_pm_stay_awake(epi);
>                 }
>         }
> -       ep_done_scan(ep, 0, false, &txlist);
> +       ep_done_scan(ep, &txlist);
> +       mutex_unlock(&ep->mtx);
>
>         return res;
>  }
>

Davidlohr Bueso Nov. 9, 2020, 6:59 p.m. UTC | #4

On Sat, 07 Nov 2020, Soheil Hassas Yeganeh wrote:
>FWIW, I also stress-tested the patch series applied on top of
>linux-next/master for a couple of hours.

Out of curiosity, what exactly did you use for testing?

Thanks,
Davidlohr

Soheil Hassas Yeganeh Nov. 9, 2020, 7:28 p.m. UTC | #5

On Mon, Nov 9, 2020 at 2:22 PM Davidlohr Bueso <dave@stgolabs.net> wrote:
>
> On Sat, 07 Nov 2020, Soheil Hassas Yeganeh wrote:
> >FWIW, I also stress-tested the patch series applied on top of
> >linux-next/master for a couple of hours.
>
> Out of curiosity, what exactly did you use for testing?

That's a good question. I ran two flavors of tests:
- The epoll_wakeup_test selftest in a tight loop on two different
CPUs. This includes the newly added epoll61 test case for the timeout
race.
- Internal production loadtests (i.e., multi-node benchmarks).

Thanks,
Soheil

> Thanks,
> Davidlohr

Andrew Morton Nov. 10, 2020, 10:05 p.m. UTC | #6

On Sat, 7 Nov 2020 23:45:30 -0500 Soheil Hassas Yeganeh <soheil@google.com> wrote:

> On Sat, Nov 7, 2020 at 8:43 PM Andrew Morton <akpm@linux-foundation.org> wrote:
> >
> > On Fri,  6 Nov 2020 18:16:27 -0500 Soheil Hassas Yeganeh <soheil.kdev@gmail.com> wrote:
> >
> > > From: Soheil Hassas Yeganeh <soheil@google.com>
> > >
> > > This patch series is a follow up based on the suggestions and feedback by Linus:
> > > https://lkml.kernel.org/r/CAHk-=wizk=OxUyQPbO8MS41w2Pag1kniUV5WdD5qWL-gq1kjDA@mail.gmail.com
> >
> > Al Viro has been playing in here as well - see the below, from
> > linux-next.
> >
> > I think I'll leave it to you folks to sort this out, please.
> 
> Thank you Andrew for pointing that out!  Sorry that I didn't notice Al
> Viro's nice clean ups.
> 
> The changes are all orthogonal and apply cleanly except "epoll: pull
> fatal signal checks into ep_send_events()".   The conflict is trivial
> and the following patch should cleanly apply to linux-next/master (I
> didn't move the initialization of `res = 0` after the new branch to
> keep it simple).
> 
> FWIW, I also stress-tested the patch series applied on top of
> linux-next/master for a couple of hours.
> 
> Could you please let me know whether I should send a V2 of the patch
> series for linux-next? Thanks!

That worked, thanks.  I'll include all this in the next drop for
linux-next.

Soheil Hassas Yeganeh Nov. 11, 2020, 3:37 a.m. UTC | #7

On Tue, Nov 10, 2020 at 5:05 PM Andrew Morton <akpm@linux-foundation.org> wrote:
>
> On Sat, 7 Nov 2020 23:45:30 -0500 Soheil Hassas Yeganeh <soheil@google.com> wrote:
>
> > On Sat, Nov 7, 2020 at 8:43 PM Andrew Morton <akpm@linux-foundation.org> wrote:
> > >
> > > On Fri,  6 Nov 2020 18:16:27 -0500 Soheil Hassas Yeganeh <soheil.kdev@gmail.com> wrote:
> > >
> > > > From: Soheil Hassas Yeganeh <soheil@google.com>
> > > >
> > > > This patch series is a follow up based on the suggestions and feedback by Linus:
> > > > https://lkml.kernel.org/r/CAHk-=wizk=OxUyQPbO8MS41w2Pag1kniUV5WdD5qWL-gq1kjDA@mail.gmail.com
> > >
> > > Al Viro has been playing in here as well - see the below, from
> > > linux-next.
> > >
> > > I think I'll leave it to you folks to sort this out, please.
> >
> > Thank you Andrew for pointing that out!  Sorry that I didn't notice Al
> > Viro's nice clean ups.
> >
> > The changes are all orthogonal and apply cleanly except "epoll: pull
> > fatal signal checks into ep_send_events()".   The conflict is trivial
> > and the following patch should cleanly apply to linux-next/master (I
> > didn't move the initialization of `res = 0` after the new branch to
> > keep it simple).
> >
> > FWIW, I also stress-tested the patch series applied on top of
> > linux-next/master for a couple of hours.
> >
> > Could you please let me know whether I should send a V2 of the patch
> > series for linux-next? Thanks!
>
> That worked, thanks.  I'll include all this in the next drop for
> linux-next.

Awesome! Thanks very much, Andrew!

[0/8] simplify ep_poll

Message

Comments