Message ID | 20190925015603.10939-1-r@hev.cc (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | [RESEND,v4] fs/epoll: Remove unnecessary wakeups of nested epoll that in ET mode | expand |
On Wed, 25 Sep 2019 09:56:03 +0800 hev <r@hev.cc> wrote: > From: Heiher <r@hev.cc> > > Take the case where we have: > > t0 > | (ew) > e0 > | (et) > e1 > | (lt) > s0 > > t0: thread 0 > e0: epoll fd 0 > e1: epoll fd 1 > s0: socket fd 0 > ew: epoll_wait > et: edge-trigger > lt: level-trigger > > We only need to wakeup nested epoll fds if something has been queued to the > overflow list, since the ep_poll() traverses the rdllist during recursive poll > and thus events on the overflow list may not be visible yet. > > Test code: Look sane to me. Do you have any performance testing results which show a benefit? epoll maintainership isn't exactly a hive of activity nowadays :( Roman, would you please have time to review this?
On 2019-09-28 04:29, Andrew Morton wrote: > On Wed, 25 Sep 2019 09:56:03 +0800 hev <r@hev.cc> wrote: > >> From: Heiher <r@hev.cc> >> >> Take the case where we have: >> >> t0 >> | (ew) >> e0 >> | (et) >> e1 >> | (lt) >> s0 >> >> t0: thread 0 >> e0: epoll fd 0 >> e1: epoll fd 1 >> s0: socket fd 0 >> ew: epoll_wait >> et: edge-trigger >> lt: level-trigger >> >> We only need to wakeup nested epoll fds if something has been queued >> to the >> overflow list, since the ep_poll() traverses the rdllist during >> recursive poll >> and thus events on the overflow list may not be visible yet. >> >> Test code: > > Look sane to me. Do you have any performance testing results which > show a benefit? > > epoll maintainership isn't exactly a hive of activity nowadays :( > Roman, would you please have time to review this? Yes, I can revisit this once more next week. Heiher, mind to prepare a patchset with your test suit and make it a part of kselftest? I hope nobody has any objections. -- Roman
On 2019-09-28 04:29, Andrew Morton wrote: > On Wed, 25 Sep 2019 09:56:03 +0800 hev <r@hev.cc> wrote: > >> From: Heiher <r@hev.cc> >> >> Take the case where we have: >> >> t0 >> | (ew) >> e0 >> | (et) >> e1 >> | (lt) >> s0 >> >> t0: thread 0 >> e0: epoll fd 0 >> e1: epoll fd 1 >> s0: socket fd 0 >> ew: epoll_wait >> et: edge-trigger >> lt: level-trigger >> >> We only need to wakeup nested epoll fds if something has been queued >> to the >> overflow list, since the ep_poll() traverses the rdllist during >> recursive poll >> and thus events on the overflow list may not be visible yet. >> >> Test code: > > Look sane to me. Do you have any performance testing results which > show a benefit? > > epoll maintainership isn't exactly a hive of activity nowadays :( > Roman, would you please have time to review this? So here is my observation: current patch does not fix the described problem (double wakeup) for the case, when new event comes exactly to the ->ovflist. According to the patch this is the desired intention: /* * We only need to wakeup nested epoll fds if something has been queued * to the overflow list, since the ep_poll() traverses the rdllist * during recursive poll and thus events on the overflow list may not be * visible yet. */ if (nepi != NULL) pwake++; .... if (pwake == 2) ep_poll_safewake(&ep->poll_wait); but this actually means that we repeat the same behavior (double wakeup) but only for the case, when event comes to the ->ovflist. How to reproduce? Can be easily done (ok, not so easy but it is possible to try): to the given userspace test we need to add one more socket and immediately fire the event: e.events = EPOLLIN; if (epoll_ctl(efd[1], EPOLL_CTL_ADD, s2fd[0], &e) < 0) goto out; /* * Signal any fd to let epoll_wait() to call ep_scan_ready_list() * in order to "catch" it there and add new event to ->ovflist. */ if (write(s2fd[1], "w", 1) != 1) goto out; That is done in order to let the following epoll_wait() call to invoke ep_scan_ready_list(), where we can "catch" and insert new event exactly to the ->ovflist. In order to insert event exactly in the correct list I introduce artificial delay. Modified test and kernel patch is below. Here is the output of the testing tool with some debug lines from kernel: # ~/devel/test/edge-bug [ 59.263178] ### sleep 2 >> write to sock [ 61.318243] ### done sleep [ 61.318991] !!!!!!!!!!! ep_poll_safewake(&ep->poll_wait); events_in_rdllist=1, events_in_ovflist=1 [ 61.321204] ### sleep 2 [ 63.398325] ### done sleep error: What?! Again?! First epoll_wait() call (ep_scan_ready_list()) observes 2 events (see "!!!!!!!!!!! ep_poll_safewake" output line), exactly what we wanted to achieve, so eventually ep_poll_safewake() is called again which leads to double wakeup. In my opinion current patch as it is should be dropped, it does not fix the described problem but just hides it. -- Roman ######### USERSPACE ########## #include <unistd.h> #include <sys/epoll.h> #include <sys/socket.h> #include <stdio.h> #include <pthread.h> static void *do_thread(void *arg) { int s = *(int *)arg; sleep(1); printf(">> write to sock\n"); write(s, "w", 1); } int main(int argc, char *argv[]) { int s1fd[2]; int s2fd[2]; int efd[2]; struct epoll_event e; if (socketpair(AF_UNIX, SOCK_STREAM, 0, s1fd) < 0) goto out; if (socketpair(AF_UNIX, SOCK_STREAM, 0, s2fd) < 0) goto out; efd[0] = epoll_create(1); if (efd[0] < 0) goto out; efd[1] = epoll_create(1); if (efd[1] < 0) goto out; e.events = EPOLLIN; if (epoll_ctl(efd[1], EPOLL_CTL_ADD, s1fd[0], &e) < 0) goto out; e.events = EPOLLIN; if (epoll_ctl(efd[1], EPOLL_CTL_ADD, s2fd[0], &e) < 0) goto out; e.events = EPOLLIN | EPOLLET; if (epoll_ctl(efd[0], EPOLL_CTL_ADD, efd[1], &e) < 0) goto out; /* * Signal any fd to let epoll_wait() to call ep_scan_ready_list() * in order to "catch" it there and add new event to ->ovflist. */ if (write(s2fd[1], "w", 1) != 1) goto out; pthread_t thr; pthread_create(&thr, NULL, do_thread, &s1fd[1]); if (epoll_wait(efd[0], &e, 1, 0) != 1) { goto out; } pthread_join(thr, NULL); if (epoll_wait(efd[0], &e, 1, 0) != 0) { printf("error: What?! Again?!\n"); goto out; } return 0; out: return -1; } ######### KERNEL ########## diff --git a/fs/eventpoll.c b/fs/eventpoll.c index 8bc064630be0..edba7ab45083 100644 --- a/fs/eventpoll.c +++ b/fs/eventpoll.c @@ -39,6 +39,8 @@ #include <linux/rculist.h> #include <net/busy_poll.h> +static bool is_send_events_call; + /* * LOCKING: * There are three level of locking required by epoll : @@ -672,6 +674,8 @@ static __poll_t ep_scan_ready_list(struct eventpoll *ep, __poll_t res; int pwake = 0; struct epitem *epi, *nepi; + unsigned events_in_rdllist = 0; + unsigned events_in_ovflist = 0; LIST_HEAD(txlist); lockdep_assert_irqs_enabled(); @@ -693,23 +697,52 @@ static __poll_t ep_scan_ready_list(struct eventpoll *ep, * in a lockless way. */ write_lock_irq(&ep->lock); + + /* XXX Count events */ + if (!strcmp("edge-bug", current->comm) && depth) { + struct list_head *l; + list_for_each(l, &ep->rdllist) + events_in_rdllist++; + } list_splice_init(&ep->rdllist, &txlist); WRITE_ONCE(ep->ovflist, NULL); write_unlock_irq(&ep->lock); + if (!strcmp("edge-bug", current->comm) && depth && is_send_events_call) { + /* + * XXX Introduce delay to let userspace fire event + * XXX directly to ovflist. + */ + pr_err("### sleep 2\n"); + msleep(2000); + pr_err("### done sleep\n"); + } + + /* * Now call the callback function. */ res = (*sproc)(ep, &txlist, priv); write_lock_irq(&ep->lock); + nepi = READ_ONCE(ep->ovflist); + /* + * We only need to wakeup nested epoll fds if something has been queued + * to the overflow list, since the ep_poll() traverses the rdllist + * during recursive poll and thus events on the overflow list may not be + * visible yet. + */ + if (nepi != NULL) + pwake++; /* * During the time we spent inside the "sproc" callback, some * other events might have been queued by the poll callback. * We re-insert them inside the main ready-list here. */ - for (nepi = READ_ONCE(ep->ovflist); (epi = nepi) != NULL; + for (; (epi = nepi) != NULL; nepi = epi->next, epi->next = EP_UNACTIVE_PTR) { + /* XXX Count events */ + events_in_ovflist++; /* * We need to check if the item is already in the list. * During the "sproc" callback execution time, items are @@ -754,8 +787,11 @@ static __poll_t ep_scan_ready_list(struct eventpoll *ep, mutex_unlock(&ep->mtx); /* We have to call this outside the lock */ - if (pwake) + if (pwake == 2) { + pr_err("!!!!!!!!!!! ep_poll_safewake(&ep->poll_wait); events_in_rdllist=%d, events_in_ovflist=%d\n", + events_in_rdllist, events_in_ovflist); ep_poll_safewake(&ep->poll_wait); + } return res; } @@ -1925,9 +1961,16 @@ static int ep_poll(struct eventpoll *ep, struct epoll_event __user *events, * there's still timeout left over, we go trying again in search of * more luck. */ + + /* XXX Catch only ep_scan_ready_list() called from here */ + if (!strcmp("edge-bug", current->comm)) + is_send_events_call = 1; if (!res && eavail && - !(res = ep_send_events(ep, events, maxevents)) && !timed_out) + !(res = ep_send_events(ep, events, maxevents)) && !timed_out) { + is_send_events_call = 0; goto fetch_events; + } + is_send_events_call = 0; if (waiter) { spin_lock_irq(&ep->wq.lock);
On 9/30/19 7:55 AM, Roman Penyaev wrote: > On 2019-09-28 04:29, Andrew Morton wrote: >> On Wed, 25 Sep 2019 09:56:03 +0800 hev <r@hev.cc> wrote: >> >>> From: Heiher <r@hev.cc> >>> >>> Take the case where we have: >>> >>> t0 >>> | (ew) >>> e0 >>> | (et) >>> e1 >>> | (lt) >>> s0 >>> >>> t0: thread 0 >>> e0: epoll fd 0 >>> e1: epoll fd 1 >>> s0: socket fd 0 >>> ew: epoll_wait >>> et: edge-trigger >>> lt: level-trigger >>> >>> We only need to wakeup nested epoll fds if something has been queued >>> to the >>> overflow list, since the ep_poll() traverses the rdllist during >>> recursive poll >>> and thus events on the overflow list may not be visible yet. >>> >>> Test code: >> >> Look sane to me. Do you have any performance testing results which >> show a benefit? >> >> epoll maintainership isn't exactly a hive of activity nowadays :( >> Roman, would you please have time to review this? > > So here is my observation: current patch does not fix the described > problem (double wakeup) for the case, when new event comes exactly > to the ->ovflist. According to the patch this is the desired intention: > > /* > * We only need to wakeup nested epoll fds if something has been queued > * to the overflow list, since the ep_poll() traverses the rdllist > * during recursive poll and thus events on the overflow list may not be > * visible yet. > */ > if (nepi != NULL) > pwake++; > > .... > > if (pwake == 2) > ep_poll_safewake(&ep->poll_wait); > > > but this actually means that we repeat the same behavior (double wakeup) > but only for the case, when event comes to the ->ovflist. > > How to reproduce? Can be easily done (ok, not so easy but it is possible > to try): to the given userspace test we need to add one more socket and > immediately fire the event: > > e.events = EPOLLIN; > if (epoll_ctl(efd[1], EPOLL_CTL_ADD, s2fd[0], &e) < 0) > goto out; > > /* > * Signal any fd to let epoll_wait() to call ep_scan_ready_list() > * in order to "catch" it there and add new event to ->ovflist. > */ > if (write(s2fd[1], "w", 1) != 1) > goto out; > > That is done in order to let the following epoll_wait() call to invoke > ep_scan_ready_list(), where we can "catch" and insert new event exactly > to the ->ovflist. In order to insert event exactly in the correct list > I introduce artificial delay. > > Modified test and kernel patch is below. Here is the output of the > testing tool with some debug lines from kernel: > > # ~/devel/test/edge-bug > [ 59.263178] ### sleep 2 > >> write to sock > [ 61.318243] ### done sleep > [ 61.318991] !!!!!!!!!!! ep_poll_safewake(&ep->poll_wait); > events_in_rdllist=1, events_in_ovflist=1 > [ 61.321204] ### sleep 2 > [ 63.398325] ### done sleep > error: What?! Again?! > > First epoll_wait() call (ep_scan_ready_list()) observes 2 events > (see "!!!!!!!!!!! ep_poll_safewake" output line), exactly what we > wanted to achieve, so eventually ep_poll_safewake() is called again > which leads to double wakeup. > > In my opinion current patch as it is should be dropped, it does not > fix the described problem but just hides it. > > -- Yes, there are 2 wakeups in the test case you describe, but if the second event (write to s1fd) gets queued after the first call to epoll_wait(), we are going to get 2 wakeups anyways. So yes, there may be a slightly bigger window with this patch for 2 wakeups, but its small and I tried to be conservative with the patch - I'd rather get an occasional 2nd wakeup then miss one. Trying to debug missing wakeups isn't always fun... That said, the reason for propagating events that end up on the overflow list was to prevent the race of the wakee not seeing events because they were still on the overflow list. In the testcase, imagine if there was a thread doing epoll_wait() on efd[0], and then a write happends on s1fd. I thought it was possible then that a 2nd thread doing epoll_wait() on efd[1], wakes up, checks efd[0] and sees no events because they are still potentially on the overflow list. However, I think that case is not possible because the thread doing epoll_wait() on efd[0] is going to have the ep->mtx, and thus when the thread wakes up on efd[1], its going to have to be ordered because its also grabbing the ep->mtx associated with efd[0]. So I think its safe to do the following if we want to go further than the proposed patch, which is what you suggested earlier in the thread (minus keeping the wakeup on ep->wq). diff --git a/fs/eventpoll.c b/fs/eventpoll.c index c4159bc..61d653d1 100644 --- a/fs/eventpoll.c +++ b/fs/eventpoll.c @@ -671,7 +671,6 @@ static __poll_t ep_scan_ready_list(struct eventpoll *ep, void *priv, int depth, bool ep_locked) { __poll_t res; - int pwake = 0; struct epitem *epi, *nepi; LIST_HEAD(txlist); @@ -741,23 +740,16 @@ static __poll_t ep_scan_ready_list(struct eventpoll *ep, if (!list_empty(&ep->rdllist)) { /* - * Wake up (if active) both the eventpoll wait list and - * the ->poll() wait list (delayed after we release the lock). + * Wake up (if active) the eventpoll wait list */ if (waitqueue_active(&ep->wq)) wake_up(&ep->wq); - if (waitqueue_active(&ep->poll_wait)) - pwake++; } write_unlock_irq(&ep->lock); if (!ep_locked) mutex_unlock(&ep->mtx); - /* We have to call this outside the lock */ - if (pwake) - ep_poll_safewake(&ep->poll_wait); - return res; } Thanks, -Jason
On 2019-10-03 18:13, Jason Baron wrote: > On 9/30/19 7:55 AM, Roman Penyaev wrote: >> On 2019-09-28 04:29, Andrew Morton wrote: >>> On Wed, 25 Sep 2019 09:56:03 +0800 hev <r@hev.cc> wrote: >>> >>>> From: Heiher <r@hev.cc> >>>> >>>> Take the case where we have: >>>> >>>> t0 >>>> | (ew) >>>> e0 >>>> | (et) >>>> e1 >>>> | (lt) >>>> s0 >>>> >>>> t0: thread 0 >>>> e0: epoll fd 0 >>>> e1: epoll fd 1 >>>> s0: socket fd 0 >>>> ew: epoll_wait >>>> et: edge-trigger >>>> lt: level-trigger >>>> >>>> We only need to wakeup nested epoll fds if something has been queued >>>> to the >>>> overflow list, since the ep_poll() traverses the rdllist during >>>> recursive poll >>>> and thus events on the overflow list may not be visible yet. >>>> >>>> Test code: >>> >>> Look sane to me. Do you have any performance testing results which >>> show a benefit? >>> >>> epoll maintainership isn't exactly a hive of activity nowadays :( >>> Roman, would you please have time to review this? >> >> So here is my observation: current patch does not fix the described >> problem (double wakeup) for the case, when new event comes exactly >> to the ->ovflist. According to the patch this is the desired >> intention: >> >> /* >> * We only need to wakeup nested epoll fds if something has been >> queued >> * to the overflow list, since the ep_poll() traverses the rdllist >> * during recursive poll and thus events on the overflow list may >> not be >> * visible yet. >> */ >> if (nepi != NULL) >> pwake++; >> >> .... >> >> if (pwake == 2) >> ep_poll_safewake(&ep->poll_wait); >> >> >> but this actually means that we repeat the same behavior (double >> wakeup) >> but only for the case, when event comes to the ->ovflist. >> >> How to reproduce? Can be easily done (ok, not so easy but it is >> possible >> to try): to the given userspace test we need to add one more socket >> and >> immediately fire the event: >> >> e.events = EPOLLIN; >> if (epoll_ctl(efd[1], EPOLL_CTL_ADD, s2fd[0], &e) < 0) >> goto out; >> >> /* >> * Signal any fd to let epoll_wait() to call ep_scan_ready_list() >> * in order to "catch" it there and add new event to ->ovflist. >> */ >> if (write(s2fd[1], "w", 1) != 1) >> goto out; >> >> That is done in order to let the following epoll_wait() call to invoke >> ep_scan_ready_list(), where we can "catch" and insert new event >> exactly >> to the ->ovflist. In order to insert event exactly in the correct list >> I introduce artificial delay. >> >> Modified test and kernel patch is below. Here is the output of the >> testing tool with some debug lines from kernel: >> >> # ~/devel/test/edge-bug >> [ 59.263178] ### sleep 2 >> >> write to sock >> [ 61.318243] ### done sleep >> [ 61.318991] !!!!!!!!!!! ep_poll_safewake(&ep->poll_wait); >> events_in_rdllist=1, events_in_ovflist=1 >> [ 61.321204] ### sleep 2 >> [ 63.398325] ### done sleep >> error: What?! Again?! >> >> First epoll_wait() call (ep_scan_ready_list()) observes 2 events >> (see "!!!!!!!!!!! ep_poll_safewake" output line), exactly what we >> wanted to achieve, so eventually ep_poll_safewake() is called again >> which leads to double wakeup. >> >> In my opinion current patch as it is should be dropped, it does not >> fix the described problem but just hides it. >> >> -- Hi Jason, > > Yes, there are 2 wakeups in the test case you describe, but if the > second event (write to s1fd) gets queued after the first call to > epoll_wait(), we are going to get 2 wakeups anyways. Yes, exactly, for this reason I print out the number of events observed on first wait, there should be 1 (rdllist) and 1 (ovflist), otherwise this is another case, when second event comes exactly after first wait, which is legitimate wakeup. > So yes, there may > be a slightly bigger window with this patch for 2 wakeups, but its > small > and I tried to be conservative with the patch - I'd rather get an > occasional 2nd wakeup then miss one. Trying to debug missing wakeups > isn't always fun... > > That said, the reason for propagating events that end up on the > overflow > list was to prevent the race of the wakee not seeing events because > they > were still on the overflow list. In the testcase, imagine if there was > a > thread doing epoll_wait() on efd[0], and then a write happends on s1fd. > I thought it was possible then that a 2nd thread doing epoll_wait() on > efd[1], wakes up, checks efd[0] and sees no events because they are > still potentially on the overflow list. However, I think that case is > not possible because the thread doing epoll_wait() on efd[0] is going > to > have the ep->mtx, and thus when the thread wakes up on efd[1], its > going > to have to be ordered because its also grabbing the ep->mtx associated > with efd[0]. > > So I think its safe to do the following if we want to go further than > the proposed patch, which is what you suggested earlier in the thread > (minus keeping the wakeup on ep->wq). Then I do not understand why we need to keep ep->wq wakeup? @wq and @poll_wait are almost the same with only one difference: @wq is used when you sleep on it inside epoll_wait() and the other is used when you nest epoll fd inside epoll fd. Either you wake both up either you don't this at all. ep_poll_callback() does wakeup explicitly, ep_insert() and ep_modify() do wakeup explicitly, so what are the cases when we need to do wakeups from ep_scan_ready_list()? I would still remove the whole branch: --- a/fs/eventpoll.c +++ b/fs/eventpoll.c @@ -671,7 +671,6 @@ static __poll_t ep_scan_ready_list(struct eventpoll *ep, void *priv, int depth, bool ep_locked) { __poll_t res; - int pwake = 0; struct epitem *epi, *nepi; LIST_HEAD(txlist); @@ -738,26 +737,11 @@ static __poll_t ep_scan_ready_list(struct eventpoll *ep, */ list_splice(&txlist, &ep->rdllist); __pm_relax(ep->ws); - - if (!list_empty(&ep->rdllist)) { - /* - * Wake up (if active) both the eventpoll wait list and - * the ->poll() wait list (delayed after we release the lock). - */ - if (waitqueue_active(&ep->wq)) - wake_up(&ep->wq); - if (waitqueue_active(&ep->poll_wait)) - pwake++; - } write_unlock_irq(&ep->lock); if (!ep_locked) mutex_unlock(&ep->mtx); - /* We have to call this outside the lock */ - if (pwake) - ep_poll_safewake(&ep->poll_wait); - return res; } -- Roman
On 10/7/19 6:54 AM, Roman Penyaev wrote: > On 2019-10-03 18:13, Jason Baron wrote: >> On 9/30/19 7:55 AM, Roman Penyaev wrote: >>> On 2019-09-28 04:29, Andrew Morton wrote: >>>> On Wed, 25 Sep 2019 09:56:03 +0800 hev <r@hev.cc> wrote: >>>> >>>>> From: Heiher <r@hev.cc> >>>>> >>>>> Take the case where we have: >>>>> >>>>> t0 >>>>> | (ew) >>>>> e0 >>>>> | (et) >>>>> e1 >>>>> | (lt) >>>>> s0 >>>>> >>>>> t0: thread 0 >>>>> e0: epoll fd 0 >>>>> e1: epoll fd 1 >>>>> s0: socket fd 0 >>>>> ew: epoll_wait >>>>> et: edge-trigger >>>>> lt: level-trigger >>>>> >>>>> We only need to wakeup nested epoll fds if something has been queued >>>>> to the >>>>> overflow list, since the ep_poll() traverses the rdllist during >>>>> recursive poll >>>>> and thus events on the overflow list may not be visible yet. >>>>> >>>>> Test code: >>>> >>>> Look sane to me. Do you have any performance testing results which >>>> show a benefit? >>>> >>>> epoll maintainership isn't exactly a hive of activity nowadays :( >>>> Roman, would you please have time to review this? >>> >>> So here is my observation: current patch does not fix the described >>> problem (double wakeup) for the case, when new event comes exactly >>> to the ->ovflist. According to the patch this is the desired intention: >>> >>> /* >>> * We only need to wakeup nested epoll fds if something has been >>> queued >>> * to the overflow list, since the ep_poll() traverses the rdllist >>> * during recursive poll and thus events on the overflow list may >>> not be >>> * visible yet. >>> */ >>> if (nepi != NULL) >>> pwake++; >>> >>> .... >>> >>> if (pwake == 2) >>> ep_poll_safewake(&ep->poll_wait); >>> >>> >>> but this actually means that we repeat the same behavior (double wakeup) >>> but only for the case, when event comes to the ->ovflist. >>> >>> How to reproduce? Can be easily done (ok, not so easy but it is possible >>> to try): to the given userspace test we need to add one more socket and >>> immediately fire the event: >>> >>> e.events = EPOLLIN; >>> if (epoll_ctl(efd[1], EPOLL_CTL_ADD, s2fd[0], &e) < 0) >>> goto out; >>> >>> /* >>> * Signal any fd to let epoll_wait() to call ep_scan_ready_list() >>> * in order to "catch" it there and add new event to ->ovflist. >>> */ >>> if (write(s2fd[1], "w", 1) != 1) >>> goto out; >>> >>> That is done in order to let the following epoll_wait() call to invoke >>> ep_scan_ready_list(), where we can "catch" and insert new event exactly >>> to the ->ovflist. In order to insert event exactly in the correct list >>> I introduce artificial delay. >>> >>> Modified test and kernel patch is below. Here is the output of the >>> testing tool with some debug lines from kernel: >>> >>> # ~/devel/test/edge-bug >>> [ 59.263178] ### sleep 2 >>> >> write to sock >>> [ 61.318243] ### done sleep >>> [ 61.318991] !!!!!!!!!!! ep_poll_safewake(&ep->poll_wait); >>> events_in_rdllist=1, events_in_ovflist=1 >>> [ 61.321204] ### sleep 2 >>> [ 63.398325] ### done sleep >>> error: What?! Again?! >>> >>> First epoll_wait() call (ep_scan_ready_list()) observes 2 events >>> (see "!!!!!!!!!!! ep_poll_safewake" output line), exactly what we >>> wanted to achieve, so eventually ep_poll_safewake() is called again >>> which leads to double wakeup. >>> >>> In my opinion current patch as it is should be dropped, it does not >>> fix the described problem but just hides it. >>> >>> -- > > Hi Jason, > >> >> Yes, there are 2 wakeups in the test case you describe, but if the >> second event (write to s1fd) gets queued after the first call to >> epoll_wait(), we are going to get 2 wakeups anyways. > > Yes, exactly, for this reason I print out the number of events observed > on first wait, there should be 1 (rdllist) and 1 (ovflist), otherwise > this is another case, when second event comes exactly after first > wait, which is legitimate wakeup. > >> So yes, there may >> be a slightly bigger window with this patch for 2 wakeups, but its small >> and I tried to be conservative with the patch - I'd rather get an >> occasional 2nd wakeup then miss one. Trying to debug missing wakeups >> isn't always fun... >> >> That said, the reason for propagating events that end up on the overflow >> list was to prevent the race of the wakee not seeing events because they >> were still on the overflow list. In the testcase, imagine if there was a >> thread doing epoll_wait() on efd[0], and then a write happends on s1fd. >> I thought it was possible then that a 2nd thread doing epoll_wait() on >> efd[1], wakes up, checks efd[0] and sees no events because they are >> still potentially on the overflow list. However, I think that case is >> not possible because the thread doing epoll_wait() on efd[0] is going to >> have the ep->mtx, and thus when the thread wakes up on efd[1], its going >> to have to be ordered because its also grabbing the ep->mtx associated >> with efd[0]. >> >> So I think its safe to do the following if we want to go further than >> the proposed patch, which is what you suggested earlier in the thread >> (minus keeping the wakeup on ep->wq). > > Then I do not understand why we need to keep ep->wq wakeup? > @wq and @poll_wait are almost the same with only one difference: > @wq is used when you sleep on it inside epoll_wait() and the other > is used when you nest epoll fd inside epoll fd. Either you wake > both up either you don't this at all. > > ep_poll_callback() does wakeup explicitly, ep_insert() and ep_modify() > do wakeup explicitly, so what are the cases when we need to do wakeups > from ep_scan_ready_list()? Hi Roman, So the reason I was saying not to drop the ep->wq wakeup was that I was thinking about a usecase where you have multi-threads say thread A and thread B which are doing epoll_wait() on the same epfd. Now, the threads both call epoll_wait() and are added as exclusive to ep->wq. Now a bunch of events happen and thread A is woken up. However, thread A may only process a subset of the events due to its 'maxevents' parameter. In that case, I was thinking that the wakeup on ep->wq might be helpful, because in the absence of subsequent events, thread B can now start processing the rest, instead of waiting for the next event to be queued. However, I was thinking about the state of things before: 86c0517 fs/epoll: deal with wait_queue only once Before that patch, thread A would have been removed from eq->wq before the wakeup call, thus waking up thread B. However, now that thread A stays on the queue during the call to ep_send_events(), I believe the wakeup would only target thread A, which doesn't help since its already checking for events. So given the state of things I think you are right in that its not needed. However, I wonder if not removing from the ep->wq affects the multi-threaded case I described. Its been around since 5.0, so probably not, but it would be a more subtle performance difference. Thanks, -Jason > > I would still remove the whole branch: > > > --- a/fs/eventpoll.c > +++ b/fs/eventpoll.c > @@ -671,7 +671,6 @@ static __poll_t ep_scan_ready_list(struct eventpoll > *ep, > void *priv, int depth, bool ep_locked) > { > __poll_t res; > - int pwake = 0; > struct epitem *epi, *nepi; > LIST_HEAD(txlist); > > @@ -738,26 +737,11 @@ static __poll_t ep_scan_ready_list(struct > eventpoll *ep, > */ > list_splice(&txlist, &ep->rdllist); > __pm_relax(ep->ws); > - > - if (!list_empty(&ep->rdllist)) { > - /* > - * Wake up (if active) both the eventpoll wait list and > - * the ->poll() wait list (delayed after we release the > lock). > - */ > - if (waitqueue_active(&ep->wq)) > - wake_up(&ep->wq); > - if (waitqueue_active(&ep->poll_wait)) > - pwake++; > - } > write_unlock_irq(&ep->lock); > > if (!ep_locked) > mutex_unlock(&ep->mtx); > > - /* We have to call this outside the lock */ > - if (pwake) > - ep_poll_safewake(&ep->poll_wait); > - > return res; > } > > -- > Roman > > >
On 2019-10-07 18:42, Jason Baron wrote: > On 10/7/19 6:54 AM, Roman Penyaev wrote: >> On 2019-10-03 18:13, Jason Baron wrote: >>> On 9/30/19 7:55 AM, Roman Penyaev wrote: >>>> On 2019-09-28 04:29, Andrew Morton wrote: >>>>> On Wed, 25 Sep 2019 09:56:03 +0800 hev <r@hev.cc> wrote: >>>>> >>>>>> From: Heiher <r@hev.cc> >>>>>> >>>>>> Take the case where we have: >>>>>> >>>>>> t0 >>>>>> | (ew) >>>>>> e0 >>>>>> | (et) >>>>>> e1 >>>>>> | (lt) >>>>>> s0 >>>>>> >>>>>> t0: thread 0 >>>>>> e0: epoll fd 0 >>>>>> e1: epoll fd 1 >>>>>> s0: socket fd 0 >>>>>> ew: epoll_wait >>>>>> et: edge-trigger >>>>>> lt: level-trigger >>>>>> >>>>>> We only need to wakeup nested epoll fds if something has been >>>>>> queued >>>>>> to the >>>>>> overflow list, since the ep_poll() traverses the rdllist during >>>>>> recursive poll >>>>>> and thus events on the overflow list may not be visible yet. >>>>>> >>>>>> Test code: >>>>> >>>>> Look sane to me. Do you have any performance testing results which >>>>> show a benefit? >>>>> >>>>> epoll maintainership isn't exactly a hive of activity nowadays :( >>>>> Roman, would you please have time to review this? >>>> >>>> So here is my observation: current patch does not fix the described >>>> problem (double wakeup) for the case, when new event comes exactly >>>> to the ->ovflist. According to the patch this is the desired >>>> intention: >>>> >>>> /* >>>> * We only need to wakeup nested epoll fds if something has been >>>> queued >>>> * to the overflow list, since the ep_poll() traverses the >>>> rdllist >>>> * during recursive poll and thus events on the overflow list may >>>> not be >>>> * visible yet. >>>> */ >>>> if (nepi != NULL) >>>> pwake++; >>>> >>>> .... >>>> >>>> if (pwake == 2) >>>> ep_poll_safewake(&ep->poll_wait); >>>> >>>> >>>> but this actually means that we repeat the same behavior (double >>>> wakeup) >>>> but only for the case, when event comes to the ->ovflist. >>>> >>>> How to reproduce? Can be easily done (ok, not so easy but it is >>>> possible >>>> to try): to the given userspace test we need to add one more socket >>>> and >>>> immediately fire the event: >>>> >>>> e.events = EPOLLIN; >>>> if (epoll_ctl(efd[1], EPOLL_CTL_ADD, s2fd[0], &e) < 0) >>>> goto out; >>>> >>>> /* >>>> * Signal any fd to let epoll_wait() to call >>>> ep_scan_ready_list() >>>> * in order to "catch" it there and add new event to ->ovflist. >>>> */ >>>> if (write(s2fd[1], "w", 1) != 1) >>>> goto out; >>>> >>>> That is done in order to let the following epoll_wait() call to >>>> invoke >>>> ep_scan_ready_list(), where we can "catch" and insert new event >>>> exactly >>>> to the ->ovflist. In order to insert event exactly in the correct >>>> list >>>> I introduce artificial delay. >>>> >>>> Modified test and kernel patch is below. Here is the output of the >>>> testing tool with some debug lines from kernel: >>>> >>>> # ~/devel/test/edge-bug >>>> [ 59.263178] ### sleep 2 >>>> >> write to sock >>>> [ 61.318243] ### done sleep >>>> [ 61.318991] !!!!!!!!!!! ep_poll_safewake(&ep->poll_wait); >>>> events_in_rdllist=1, events_in_ovflist=1 >>>> [ 61.321204] ### sleep 2 >>>> [ 63.398325] ### done sleep >>>> error: What?! Again?! >>>> >>>> First epoll_wait() call (ep_scan_ready_list()) observes 2 events >>>> (see "!!!!!!!!!!! ep_poll_safewake" output line), exactly what we >>>> wanted to achieve, so eventually ep_poll_safewake() is called again >>>> which leads to double wakeup. >>>> >>>> In my opinion current patch as it is should be dropped, it does not >>>> fix the described problem but just hides it. >>>> >>>> -- >> >> Hi Jason, >> >>> >>> Yes, there are 2 wakeups in the test case you describe, but if the >>> second event (write to s1fd) gets queued after the first call to >>> epoll_wait(), we are going to get 2 wakeups anyways. >> >> Yes, exactly, for this reason I print out the number of events >> observed >> on first wait, there should be 1 (rdllist) and 1 (ovflist), otherwise >> this is another case, when second event comes exactly after first >> wait, which is legitimate wakeup. >> >>> So yes, there may >>> be a slightly bigger window with this patch for 2 wakeups, but its >>> small >>> and I tried to be conservative with the patch - I'd rather get an >>> occasional 2nd wakeup then miss one. Trying to debug missing wakeups >>> isn't always fun... >>> >>> That said, the reason for propagating events that end up on the >>> overflow >>> list was to prevent the race of the wakee not seeing events because >>> they >>> were still on the overflow list. In the testcase, imagine if there >>> was a >>> thread doing epoll_wait() on efd[0], and then a write happends on >>> s1fd. >>> I thought it was possible then that a 2nd thread doing epoll_wait() >>> on >>> efd[1], wakes up, checks efd[0] and sees no events because they are >>> still potentially on the overflow list. However, I think that case is >>> not possible because the thread doing epoll_wait() on efd[0] is going >>> to >>> have the ep->mtx, and thus when the thread wakes up on efd[1], its >>> going >>> to have to be ordered because its also grabbing the ep->mtx >>> associated >>> with efd[0]. >>> >>> So I think its safe to do the following if we want to go further than >>> the proposed patch, which is what you suggested earlier in the thread >>> (minus keeping the wakeup on ep->wq). >> >> Then I do not understand why we need to keep ep->wq wakeup? >> @wq and @poll_wait are almost the same with only one difference: >> @wq is used when you sleep on it inside epoll_wait() and the other >> is used when you nest epoll fd inside epoll fd. Either you wake >> both up either you don't this at all. >> >> ep_poll_callback() does wakeup explicitly, ep_insert() and ep_modify() >> do wakeup explicitly, so what are the cases when we need to do wakeups >> from ep_scan_ready_list()? > > Hi Roman, > > So the reason I was saying not to drop the ep->wq wakeup was that I was > thinking about a usecase where you have multi-threads say thread A and > thread B which are doing epoll_wait() on the same epfd. Now, the > threads > both call epoll_wait() and are added as exclusive to ep->wq. Now a > bunch > of events happen and thread A is woken up. However, thread A may only > process a subset of the events due to its 'maxevents' parameter. In > that > case, I was thinking that the wakeup on ep->wq might be helpful, > because > in the absence of subsequent events, thread B can now start processing > the rest, instead of waiting for the next event to be queued. > > However, I was thinking about the state of things before: > 86c0517 fs/epoll: deal with wait_queue only once > > Before that patch, thread A would have been removed from eq->wq before > the wakeup call, thus waking up thread B. However, now that thread A > stays on the queue during the call to ep_send_events(), I believe the > wakeup would only target thread A, which doesn't help since its already > checking for events. So given the state of things I think you are right > in that its not needed. However, I wonder if not removing from the > ep->wq affects the multi-threaded case I described. Its been around > since 5.0, so probably not, but it would be a more subtle performance > difference. Now I understand what you mean. You want to prevent "idling" of events, while thread A is busy with the small portion of events (portion is equal to maxevents). On next iteration thread A will pick up the rest, no doubts, but would be nice to give a chance to thread B immediately to deal with the rest. Ok, makes sense. But what if to make this wakeup explicit if we have more events to process? (nothing is tested, just a guess) @@ -255,6 +255,7 @@ struct ep_pqueue { struct ep_send_events_data { int maxevents; struct epoll_event __user *events; + bool have_more; int res; }; @@ -1783,14 +1768,17 @@ static __poll_t ep_send_events_proc(struct eventpoll *ep, struct list_head *head } static int ep_send_events(struct eventpoll *ep, - struct epoll_event __user *events, int maxevents) + struct epoll_event __user *events, int maxevents, + bool *have_more) { - struct ep_send_events_data esed; - - esed.maxevents = maxevents; - esed.events = events; + struct ep_send_events_data esed = { + .maxevents = maxevents, + .events = events, + }; ep_scan_ready_list(ep, ep_send_events_proc, &esed, 0, false); + *have_more = esed.have_more; + return esed.res; } @@ -1827,7 +1815,7 @@ static int ep_poll(struct eventpoll *ep, struct epoll_event __user *events, { int res = 0, eavail, timed_out = 0; u64 slack = 0; - bool waiter = false; + bool waiter = false, have_more; wait_queue_entry_t wait; ktime_t expires, *to = NULL; @@ -1927,7 +1915,8 @@ static int ep_poll(struct eventpoll *ep, struct epoll_event __user *events, * more luck. */ if (!res && eavail && - !(res = ep_send_events(ep, events, maxevents)) && !timed_out) + !(res = ep_send_events(ep, events, maxevents, &have_more)) && + !timed_out) goto fetch_events; if (waiter) { @@ -1935,6 +1924,12 @@ static int ep_poll(struct eventpoll *ep, struct epoll_event __user *events, __remove_wait_queue(&ep->wq, &wait); spin_unlock_irq(&ep->wq.lock); } + /* + * We were not able to process all the events, so immediately + * wakeup other waiter. + */ + if (res > 0 && have_more && waitqueue_active(&ep->wq)) + wake_up(&ep->wq); return res; } PS. So what we decide with the original patch? Remove the whole branch? -- Roman
On 10/7/19 2:30 PM, Roman Penyaev wrote: > On 2019-10-07 18:42, Jason Baron wrote: >> On 10/7/19 6:54 AM, Roman Penyaev wrote: >>> On 2019-10-03 18:13, Jason Baron wrote: >>>> On 9/30/19 7:55 AM, Roman Penyaev wrote: >>>>> On 2019-09-28 04:29, Andrew Morton wrote: >>>>>> On Wed, 25 Sep 2019 09:56:03 +0800 hev <r@hev.cc> wrote: >>>>>> >>>>>>> From: Heiher <r@hev.cc> >>>>>>> >>>>>>> Take the case where we have: >>>>>>> >>>>>>> t0 >>>>>>> | (ew) >>>>>>> e0 >>>>>>> | (et) >>>>>>> e1 >>>>>>> | (lt) >>>>>>> s0 >>>>>>> >>>>>>> t0: thread 0 >>>>>>> e0: epoll fd 0 >>>>>>> e1: epoll fd 1 >>>>>>> s0: socket fd 0 >>>>>>> ew: epoll_wait >>>>>>> et: edge-trigger >>>>>>> lt: level-trigger >>>>>>> >>>>>>> We only need to wakeup nested epoll fds if something has been queued >>>>>>> to the >>>>>>> overflow list, since the ep_poll() traverses the rdllist during >>>>>>> recursive poll >>>>>>> and thus events on the overflow list may not be visible yet. >>>>>>> >>>>>>> Test code: >>>>>> >>>>>> Look sane to me. Do you have any performance testing results which >>>>>> show a benefit? >>>>>> >>>>>> epoll maintainership isn't exactly a hive of activity nowadays :( >>>>>> Roman, would you please have time to review this? >>>>> >>>>> So here is my observation: current patch does not fix the described >>>>> problem (double wakeup) for the case, when new event comes exactly >>>>> to the ->ovflist. According to the patch this is the desired >>>>> intention: >>>>> >>>>> /* >>>>> * We only need to wakeup nested epoll fds if something has been >>>>> queued >>>>> * to the overflow list, since the ep_poll() traverses the rdllist >>>>> * during recursive poll and thus events on the overflow list may >>>>> not be >>>>> * visible yet. >>>>> */ >>>>> if (nepi != NULL) >>>>> pwake++; >>>>> >>>>> .... >>>>> >>>>> if (pwake == 2) >>>>> ep_poll_safewake(&ep->poll_wait); >>>>> >>>>> >>>>> but this actually means that we repeat the same behavior (double >>>>> wakeup) >>>>> but only for the case, when event comes to the ->ovflist. >>>>> >>>>> How to reproduce? Can be easily done (ok, not so easy but it is >>>>> possible >>>>> to try): to the given userspace test we need to add one more socket >>>>> and >>>>> immediately fire the event: >>>>> >>>>> e.events = EPOLLIN; >>>>> if (epoll_ctl(efd[1], EPOLL_CTL_ADD, s2fd[0], &e) < 0) >>>>> goto out; >>>>> >>>>> /* >>>>> * Signal any fd to let epoll_wait() to call ep_scan_ready_list() >>>>> * in order to "catch" it there and add new event to ->ovflist. >>>>> */ >>>>> if (write(s2fd[1], "w", 1) != 1) >>>>> goto out; >>>>> >>>>> That is done in order to let the following epoll_wait() call to invoke >>>>> ep_scan_ready_list(), where we can "catch" and insert new event >>>>> exactly >>>>> to the ->ovflist. In order to insert event exactly in the correct list >>>>> I introduce artificial delay. >>>>> >>>>> Modified test and kernel patch is below. Here is the output of the >>>>> testing tool with some debug lines from kernel: >>>>> >>>>> # ~/devel/test/edge-bug >>>>> [ 59.263178] ### sleep 2 >>>>> >> write to sock >>>>> [ 61.318243] ### done sleep >>>>> [ 61.318991] !!!!!!!!!!! ep_poll_safewake(&ep->poll_wait); >>>>> events_in_rdllist=1, events_in_ovflist=1 >>>>> [ 61.321204] ### sleep 2 >>>>> [ 63.398325] ### done sleep >>>>> error: What?! Again?! >>>>> >>>>> First epoll_wait() call (ep_scan_ready_list()) observes 2 events >>>>> (see "!!!!!!!!!!! ep_poll_safewake" output line), exactly what we >>>>> wanted to achieve, so eventually ep_poll_safewake() is called again >>>>> which leads to double wakeup. >>>>> >>>>> In my opinion current patch as it is should be dropped, it does not >>>>> fix the described problem but just hides it. >>>>> >>>>> -- >>> >>> Hi Jason, >>> >>>> >>>> Yes, there are 2 wakeups in the test case you describe, but if the >>>> second event (write to s1fd) gets queued after the first call to >>>> epoll_wait(), we are going to get 2 wakeups anyways. >>> >>> Yes, exactly, for this reason I print out the number of events observed >>> on first wait, there should be 1 (rdllist) and 1 (ovflist), otherwise >>> this is another case, when second event comes exactly after first >>> wait, which is legitimate wakeup. >>> >>>> So yes, there may >>>> be a slightly bigger window with this patch for 2 wakeups, but its >>>> small >>>> and I tried to be conservative with the patch - I'd rather get an >>>> occasional 2nd wakeup then miss one. Trying to debug missing wakeups >>>> isn't always fun... >>>> >>>> That said, the reason for propagating events that end up on the >>>> overflow >>>> list was to prevent the race of the wakee not seeing events because >>>> they >>>> were still on the overflow list. In the testcase, imagine if there >>>> was a >>>> thread doing epoll_wait() on efd[0], and then a write happends on s1fd. >>>> I thought it was possible then that a 2nd thread doing epoll_wait() on >>>> efd[1], wakes up, checks efd[0] and sees no events because they are >>>> still potentially on the overflow list. However, I think that case is >>>> not possible because the thread doing epoll_wait() on efd[0] is >>>> going to >>>> have the ep->mtx, and thus when the thread wakes up on efd[1], its >>>> going >>>> to have to be ordered because its also grabbing the ep->mtx associated >>>> with efd[0]. >>>> >>>> So I think its safe to do the following if we want to go further than >>>> the proposed patch, which is what you suggested earlier in the thread >>>> (minus keeping the wakeup on ep->wq). >>> >>> Then I do not understand why we need to keep ep->wq wakeup? >>> @wq and @poll_wait are almost the same with only one difference: >>> @wq is used when you sleep on it inside epoll_wait() and the other >>> is used when you nest epoll fd inside epoll fd. Either you wake >>> both up either you don't this at all. >>> >>> ep_poll_callback() does wakeup explicitly, ep_insert() and ep_modify() >>> do wakeup explicitly, so what are the cases when we need to do wakeups >>> from ep_scan_ready_list()? >> >> Hi Roman, >> >> So the reason I was saying not to drop the ep->wq wakeup was that I was >> thinking about a usecase where you have multi-threads say thread A and >> thread B which are doing epoll_wait() on the same epfd. Now, the threads >> both call epoll_wait() and are added as exclusive to ep->wq. Now a bunch >> of events happen and thread A is woken up. However, thread A may only >> process a subset of the events due to its 'maxevents' parameter. In that >> case, I was thinking that the wakeup on ep->wq might be helpful, because >> in the absence of subsequent events, thread B can now start processing >> the rest, instead of waiting for the next event to be queued. >> >> However, I was thinking about the state of things before: >> 86c0517 fs/epoll: deal with wait_queue only once >> >> Before that patch, thread A would have been removed from eq->wq before >> the wakeup call, thus waking up thread B. However, now that thread A >> stays on the queue during the call to ep_send_events(), I believe the >> wakeup would only target thread A, which doesn't help since its already >> checking for events. So given the state of things I think you are right >> in that its not needed. However, I wonder if not removing from the >> ep->wq affects the multi-threaded case I described. Its been around >> since 5.0, so probably not, but it would be a more subtle performance >> difference. > > Now I understand what you mean. You want to prevent "idling" of events, > while thread A is busy with the small portion of events (portion is equal > to maxevents). On next iteration thread A will pick up the rest, no > doubts, > but would be nice to give a chance to thread B immediately to deal with the > rest. Ok, makes sense. Exactly, I don't believe its racy as is - but it seems like it would be good to wakeup other threads that may be waiting. That said, this logic was removed as I pointed out. So I'm not sure we need to tie this change to the current one - but it may be a nice addition. > > But what if to make this wakeup explicit if we have more events to process? > (nothing is tested, just a guess) > > @@ -255,6 +255,7 @@ struct ep_pqueue { > struct ep_send_events_data { > int maxevents; > struct epoll_event __user *events; > + bool have_more; > int res; > }; > @@ -1783,14 +1768,17 @@ static __poll_t ep_send_events_proc(struct > eventpoll *ep, struct list_head *head > } > > static int ep_send_events(struct eventpoll *ep, > - struct epoll_event __user *events, int maxevents) > + struct epoll_event __user *events, int maxevents, > + bool *have_more) > { > - struct ep_send_events_data esed; > - > - esed.maxevents = maxevents; > - esed.events = events; > + struct ep_send_events_data esed = { > + .maxevents = maxevents, > + .events = events, > + }; > > ep_scan_ready_list(ep, ep_send_events_proc, &esed, 0, false); > + *have_more = esed.have_more; > + > return esed.res; > } > > @@ -1827,7 +1815,7 @@ static int ep_poll(struct eventpoll *ep, struct > epoll_event __user *events, > { > int res = 0, eavail, timed_out = 0; > u64 slack = 0; > - bool waiter = false; > + bool waiter = false, have_more; > wait_queue_entry_t wait; > ktime_t expires, *to = NULL; > > @@ -1927,7 +1915,8 @@ static int ep_poll(struct eventpoll *ep, struct > epoll_event __user *events, > * more luck. > */ > if (!res && eavail && > - !(res = ep_send_events(ep, events, maxevents)) && !timed_out) > + !(res = ep_send_events(ep, events, maxevents, &have_more)) && > + !timed_out) > goto fetch_events; > > if (waiter) { > @@ -1935,6 +1924,12 @@ static int ep_poll(struct eventpoll *ep, struct > epoll_event __user *events, > __remove_wait_queue(&ep->wq, &wait); > spin_unlock_irq(&ep->wq.lock); > } > + /* > + * We were not able to process all the events, so immediately > + * wakeup other waiter. > + */ > + if (res > 0 && have_more && waitqueue_active(&ep->wq)) > + wake_up(&ep->wq); > > return res; > } > > Ok, yeah I like making it explicit. Looks like you are missing the changes to ep_scan_ready_list(), but I think the general approach makes sense. Although I really didn't have a test case that motivated this - its just was sort of noting this change in behavior while reviewing the current change. > PS. So what we decide with the original patch? Remove the whole branch? > For fwiw, I'm ok removing the whole branch as you proposed. And I think the above change can go in separately (if we decide we want it). I don't think they need to be tied together. I also want to make sure this change gets a full linux-next cycle, so I think it should target 5.5 at this point. Thanks, -Jason
On 2019-10-07 20:43, Jason Baron wrote: > On 10/7/19 2:30 PM, Roman Penyaev wrote: >> On 2019-10-07 18:42, Jason Baron wrote: >>> On 10/7/19 6:54 AM, Roman Penyaev wrote: >>>> On 2019-10-03 18:13, Jason Baron wrote: >>>>> On 9/30/19 7:55 AM, Roman Penyaev wrote: >>>>>> On 2019-09-28 04:29, Andrew Morton wrote: >>>>>>> On Wed, 25 Sep 2019 09:56:03 +0800 hev <r@hev.cc> wrote: >>>>>>> >>>>>>>> From: Heiher <r@hev.cc> >>>>>>>> >>>>>>>> Take the case where we have: >>>>>>>> >>>>>>>> t0 >>>>>>>> | (ew) >>>>>>>> e0 >>>>>>>> | (et) >>>>>>>> e1 >>>>>>>> | (lt) >>>>>>>> s0 >>>>>>>> >>>>>>>> t0: thread 0 >>>>>>>> e0: epoll fd 0 >>>>>>>> e1: epoll fd 1 >>>>>>>> s0: socket fd 0 >>>>>>>> ew: epoll_wait >>>>>>>> et: edge-trigger >>>>>>>> lt: level-trigger >>>>>>>> >>>>>>>> We only need to wakeup nested epoll fds if something has been >>>>>>>> queued >>>>>>>> to the >>>>>>>> overflow list, since the ep_poll() traverses the rdllist during >>>>>>>> recursive poll >>>>>>>> and thus events on the overflow list may not be visible yet. >>>>>>>> >>>>>>>> Test code: >>>>>>> >>>>>>> Look sane to me. Do you have any performance testing results >>>>>>> which >>>>>>> show a benefit? >>>>>>> >>>>>>> epoll maintainership isn't exactly a hive of activity nowadays :( >>>>>>> Roman, would you please have time to review this? >>>>>> >>>>>> So here is my observation: current patch does not fix the >>>>>> described >>>>>> problem (double wakeup) for the case, when new event comes exactly >>>>>> to the ->ovflist. According to the patch this is the desired >>>>>> intention: >>>>>> >>>>>> /* >>>>>> * We only need to wakeup nested epoll fds if something has >>>>>> been >>>>>> queued >>>>>> * to the overflow list, since the ep_poll() traverses the >>>>>> rdllist >>>>>> * during recursive poll and thus events on the overflow list >>>>>> may >>>>>> not be >>>>>> * visible yet. >>>>>> */ >>>>>> if (nepi != NULL) >>>>>> pwake++; >>>>>> >>>>>> .... >>>>>> >>>>>> if (pwake == 2) >>>>>> ep_poll_safewake(&ep->poll_wait); >>>>>> >>>>>> >>>>>> but this actually means that we repeat the same behavior (double >>>>>> wakeup) >>>>>> but only for the case, when event comes to the ->ovflist. >>>>>> >>>>>> How to reproduce? Can be easily done (ok, not so easy but it is >>>>>> possible >>>>>> to try): to the given userspace test we need to add one more >>>>>> socket >>>>>> and >>>>>> immediately fire the event: >>>>>> >>>>>> e.events = EPOLLIN; >>>>>> if (epoll_ctl(efd[1], EPOLL_CTL_ADD, s2fd[0], &e) < 0) >>>>>> goto out; >>>>>> >>>>>> /* >>>>>> * Signal any fd to let epoll_wait() to call >>>>>> ep_scan_ready_list() >>>>>> * in order to "catch" it there and add new event to >>>>>> ->ovflist. >>>>>> */ >>>>>> if (write(s2fd[1], "w", 1) != 1) >>>>>> goto out; >>>>>> >>>>>> That is done in order to let the following epoll_wait() call to >>>>>> invoke >>>>>> ep_scan_ready_list(), where we can "catch" and insert new event >>>>>> exactly >>>>>> to the ->ovflist. In order to insert event exactly in the correct >>>>>> list >>>>>> I introduce artificial delay. >>>>>> >>>>>> Modified test and kernel patch is below. Here is the output of >>>>>> the >>>>>> testing tool with some debug lines from kernel: >>>>>> >>>>>> # ~/devel/test/edge-bug >>>>>> [ 59.263178] ### sleep 2 >>>>>> >> write to sock >>>>>> [ 61.318243] ### done sleep >>>>>> [ 61.318991] !!!!!!!!!!! ep_poll_safewake(&ep->poll_wait); >>>>>> events_in_rdllist=1, events_in_ovflist=1 >>>>>> [ 61.321204] ### sleep 2 >>>>>> [ 63.398325] ### done sleep >>>>>> error: What?! Again?! >>>>>> >>>>>> First epoll_wait() call (ep_scan_ready_list()) observes 2 events >>>>>> (see "!!!!!!!!!!! ep_poll_safewake" output line), exactly what we >>>>>> wanted to achieve, so eventually ep_poll_safewake() is called >>>>>> again >>>>>> which leads to double wakeup. >>>>>> >>>>>> In my opinion current patch as it is should be dropped, it does >>>>>> not >>>>>> fix the described problem but just hides it. >>>>>> >>>>>> -- >>>> >>>> Hi Jason, >>>> >>>>> >>>>> Yes, there are 2 wakeups in the test case you describe, but if the >>>>> second event (write to s1fd) gets queued after the first call to >>>>> epoll_wait(), we are going to get 2 wakeups anyways. >>>> >>>> Yes, exactly, for this reason I print out the number of events >>>> observed >>>> on first wait, there should be 1 (rdllist) and 1 (ovflist), >>>> otherwise >>>> this is another case, when second event comes exactly after first >>>> wait, which is legitimate wakeup. >>>> >>>>> So yes, there may >>>>> be a slightly bigger window with this patch for 2 wakeups, but its >>>>> small >>>>> and I tried to be conservative with the patch - I'd rather get an >>>>> occasional 2nd wakeup then miss one. Trying to debug missing >>>>> wakeups >>>>> isn't always fun... >>>>> >>>>> That said, the reason for propagating events that end up on the >>>>> overflow >>>>> list was to prevent the race of the wakee not seeing events because >>>>> they >>>>> were still on the overflow list. In the testcase, imagine if there >>>>> was a >>>>> thread doing epoll_wait() on efd[0], and then a write happends on >>>>> s1fd. >>>>> I thought it was possible then that a 2nd thread doing epoll_wait() >>>>> on >>>>> efd[1], wakes up, checks efd[0] and sees no events because they are >>>>> still potentially on the overflow list. However, I think that case >>>>> is >>>>> not possible because the thread doing epoll_wait() on efd[0] is >>>>> going to >>>>> have the ep->mtx, and thus when the thread wakes up on efd[1], its >>>>> going >>>>> to have to be ordered because its also grabbing the ep->mtx >>>>> associated >>>>> with efd[0]. >>>>> >>>>> So I think its safe to do the following if we want to go further >>>>> than >>>>> the proposed patch, which is what you suggested earlier in the >>>>> thread >>>>> (minus keeping the wakeup on ep->wq). >>>> >>>> Then I do not understand why we need to keep ep->wq wakeup? >>>> @wq and @poll_wait are almost the same with only one difference: >>>> @wq is used when you sleep on it inside epoll_wait() and the other >>>> is used when you nest epoll fd inside epoll fd. Either you wake >>>> both up either you don't this at all. >>>> >>>> ep_poll_callback() does wakeup explicitly, ep_insert() and >>>> ep_modify() >>>> do wakeup explicitly, so what are the cases when we need to do >>>> wakeups >>>> from ep_scan_ready_list()? >>> >>> Hi Roman, >>> >>> So the reason I was saying not to drop the ep->wq wakeup was that I >>> was >>> thinking about a usecase where you have multi-threads say thread A >>> and >>> thread B which are doing epoll_wait() on the same epfd. Now, the >>> threads >>> both call epoll_wait() and are added as exclusive to ep->wq. Now a >>> bunch >>> of events happen and thread A is woken up. However, thread A may only >>> process a subset of the events due to its 'maxevents' parameter. In >>> that >>> case, I was thinking that the wakeup on ep->wq might be helpful, >>> because >>> in the absence of subsequent events, thread B can now start >>> processing >>> the rest, instead of waiting for the next event to be queued. >>> >>> However, I was thinking about the state of things before: >>> 86c0517 fs/epoll: deal with wait_queue only once >>> >>> Before that patch, thread A would have been removed from eq->wq >>> before >>> the wakeup call, thus waking up thread B. However, now that thread A >>> stays on the queue during the call to ep_send_events(), I believe the >>> wakeup would only target thread A, which doesn't help since its >>> already >>> checking for events. So given the state of things I think you are >>> right >>> in that its not needed. However, I wonder if not removing from the >>> ep->wq affects the multi-threaded case I described. Its been around >>> since 5.0, so probably not, but it would be a more subtle performance >>> difference. >> >> Now I understand what you mean. You want to prevent "idling" of >> events, >> while thread A is busy with the small portion of events (portion is >> equal >> to maxevents). On next iteration thread A will pick up the rest, no >> doubts, >> but would be nice to give a chance to thread B immediately to deal >> with the >> rest. Ok, makes sense. > > Exactly, I don't believe its racy as is - but it seems like it would be > good to wakeup other threads that may be waiting. That said, this logic > was removed as I pointed out. So I'm not sure we need to tie this > change > to the current one - but it may be a nice addition. > >> >> But what if to make this wakeup explicit if we have more events to >> process? >> (nothing is tested, just a guess) >> >> @@ -255,6 +255,7 @@ struct ep_pqueue { >> struct ep_send_events_data { >> int maxevents; >> struct epoll_event __user *events; >> + bool have_more; >> int res; >> }; >> @@ -1783,14 +1768,17 @@ static __poll_t ep_send_events_proc(struct >> eventpoll *ep, struct list_head *head >> } >> >> static int ep_send_events(struct eventpoll *ep, >> - struct epoll_event __user *events, int >> maxevents) >> + struct epoll_event __user *events, int >> maxevents, >> + bool *have_more) >> { >> - struct ep_send_events_data esed; >> - >> - esed.maxevents = maxevents; >> - esed.events = events; >> + struct ep_send_events_data esed = { >> + .maxevents = maxevents, >> + .events = events, >> + }; >> >> ep_scan_ready_list(ep, ep_send_events_proc, &esed, 0, false); >> + *have_more = esed.have_more; >> + >> return esed.res; >> } >> >> @@ -1827,7 +1815,7 @@ static int ep_poll(struct eventpoll *ep, struct >> epoll_event __user *events, >> { >> int res = 0, eavail, timed_out = 0; >> u64 slack = 0; >> - bool waiter = false; >> + bool waiter = false, have_more; >> wait_queue_entry_t wait; >> ktime_t expires, *to = NULL; >> >> @@ -1927,7 +1915,8 @@ static int ep_poll(struct eventpoll *ep, struct >> epoll_event __user *events, >> * more luck. >> */ >> if (!res && eavail && >> - !(res = ep_send_events(ep, events, maxevents)) && >> !timed_out) >> + !(res = ep_send_events(ep, events, maxevents, &have_more)) >> && >> + !timed_out) >> goto fetch_events; >> >> if (waiter) { >> @@ -1935,6 +1924,12 @@ static int ep_poll(struct eventpoll *ep, struct >> epoll_event __user *events, >> __remove_wait_queue(&ep->wq, &wait); >> spin_unlock_irq(&ep->wq.lock); >> } >> + /* >> + * We were not able to process all the events, so immediately >> + * wakeup other waiter. >> + */ >> + if (res > 0 && have_more && waitqueue_active(&ep->wq)) >> + wake_up(&ep->wq); >> >> return res; >> } >> >> > > Ok, yeah I like making it explicit. Looks like you are missing the > changes to ep_scan_ready_list(), but I think the general approach makes > sense. Yeah, missed the hunk: @@ -1719,8 +1704,10 @@ static __poll_t ep_send_events_proc(struct eventpoll *ep, struct list_head *head lockdep_assert_held(&ep->mtx); list_for_each_entry_safe(epi, tmp, head, rdllink) { - if (esed->res >= esed->maxevents) + if (esed->res >= esed->maxevents) { + esed->have_more = true; break; + } > Although I really didn't have a test case that motivated this - > its just was sort of noting this change in behavior while reviewing the > current change. > >> PS. So what we decide with the original patch? Remove the whole >> branch? >> > > For fwiw, I'm ok removing the whole branch as you proposed. Then probably Heiher could resend once more. Heiher, can you, please? > And I think the above change can go in separately (if we decide we want > it). I don't > think they need to be tied together. I also want to make sure this > change gets a full linux-next cycle, so I think it should target 5.5 at > this point. Sure, this explicit ->wq wakeup is a separate patch, which should be covered with some benchmarks. I can try to cook something out in order to get numbers. -- Roman
On 2019-10-07 20:43, Jason Baron wrote: [...] >> But what if to make this wakeup explicit if we have more events to >> process? >> (nothing is tested, just a guess) >> >> @@ -255,6 +255,7 @@ struct ep_pqueue { >> struct ep_send_events_data { >> int maxevents; >> struct epoll_event __user *events; >> + bool have_more; >> int res; >> }; >> @@ -1783,14 +1768,17 @@ static __poll_t ep_send_events_proc(struct >> eventpoll *ep, struct list_head *head >> } >> >> static int ep_send_events(struct eventpoll *ep, >> - struct epoll_event __user *events, int >> maxevents) >> + struct epoll_event __user *events, int >> maxevents, >> + bool *have_more) >> { >> - struct ep_send_events_data esed; >> - >> - esed.maxevents = maxevents; >> - esed.events = events; >> + struct ep_send_events_data esed = { >> + .maxevents = maxevents, >> + .events = events, >> + }; >> >> ep_scan_ready_list(ep, ep_send_events_proc, &esed, 0, false); >> + *have_more = esed.have_more; >> + >> return esed.res; >> } >> >> @@ -1827,7 +1815,7 @@ static int ep_poll(struct eventpoll *ep, struct >> epoll_event __user *events, >> { >> int res = 0, eavail, timed_out = 0; >> u64 slack = 0; >> - bool waiter = false; >> + bool waiter = false, have_more; >> wait_queue_entry_t wait; >> ktime_t expires, *to = NULL; >> >> @@ -1927,7 +1915,8 @@ static int ep_poll(struct eventpoll *ep, struct >> epoll_event __user *events, >> * more luck. >> */ >> if (!res && eavail && >> - !(res = ep_send_events(ep, events, maxevents)) && >> !timed_out) >> + !(res = ep_send_events(ep, events, maxevents, &have_more)) >> && >> + !timed_out) >> goto fetch_events; >> >> if (waiter) { >> @@ -1935,6 +1924,12 @@ static int ep_poll(struct eventpoll *ep, struct >> epoll_event __user *events, >> __remove_wait_queue(&ep->wq, &wait); >> spin_unlock_irq(&ep->wq.lock); >> } >> + /* >> + * We were not able to process all the events, so immediately >> + * wakeup other waiter. >> + */ >> + if (res > 0 && have_more && waitqueue_active(&ep->wq)) >> + wake_up(&ep->wq); >> >> return res; >> } >> >> > [...] > And I think the above change can go in separately (if we decide we want > it). Hi Jason, I did measurements using Eric's test http://yhbt.net/eponeshotmt.c (8 writers, 8 waiters; 1 writer, 8 waiters) and tested the impact of outrunning wakeup: I do not see any difference. Since write events are constantly coming, next waiter will be woken up anyway by the following write event. In order to have some perf gain probably writes should happen with some interval: produce bunch of events, sleep, produce bunch of events, sleep, etc, which seems can bring something only if writer is accidentally synchronized with waiters. Not a clean way of perf improvement. -- Roman
Hi, On Tue, Oct 8, 2019 at 3:10 AM Roman Penyaev <rpenyaev@suse.de> wrote: > > On 2019-10-07 20:43, Jason Baron wrote: > > On 10/7/19 2:30 PM, Roman Penyaev wrote: > >> On 2019-10-07 18:42, Jason Baron wrote: > >>> On 10/7/19 6:54 AM, Roman Penyaev wrote: > >>>> On 2019-10-03 18:13, Jason Baron wrote: > >>>>> On 9/30/19 7:55 AM, Roman Penyaev wrote: > >>>>>> On 2019-09-28 04:29, Andrew Morton wrote: > >>>>>>> On Wed, 25 Sep 2019 09:56:03 +0800 hev <r@hev.cc> wrote: > >>>>>>> > >>>>>>>> From: Heiher <r@hev.cc> > >>>>>>>> > >>>>>>>> Take the case where we have: > >>>>>>>> > >>>>>>>> t0 > >>>>>>>> | (ew) > >>>>>>>> e0 > >>>>>>>> | (et) > >>>>>>>> e1 > >>>>>>>> | (lt) > >>>>>>>> s0 > >>>>>>>> > >>>>>>>> t0: thread 0 > >>>>>>>> e0: epoll fd 0 > >>>>>>>> e1: epoll fd 1 > >>>>>>>> s0: socket fd 0 > >>>>>>>> ew: epoll_wait > >>>>>>>> et: edge-trigger > >>>>>>>> lt: level-trigger > >>>>>>>> > >>>>>>>> We only need to wakeup nested epoll fds if something has been > >>>>>>>> queued > >>>>>>>> to the > >>>>>>>> overflow list, since the ep_poll() traverses the rdllist during > >>>>>>>> recursive poll > >>>>>>>> and thus events on the overflow list may not be visible yet. > >>>>>>>> > >>>>>>>> Test code: > >>>>>>> > >>>>>>> Look sane to me. Do you have any performance testing results > >>>>>>> which > >>>>>>> show a benefit? > >>>>>>> > >>>>>>> epoll maintainership isn't exactly a hive of activity nowadays :( > >>>>>>> Roman, would you please have time to review this? > >>>>>> > >>>>>> So here is my observation: current patch does not fix the > >>>>>> described > >>>>>> problem (double wakeup) for the case, when new event comes exactly > >>>>>> to the ->ovflist. According to the patch this is the desired > >>>>>> intention: > >>>>>> > >>>>>> /* > >>>>>> * We only need to wakeup nested epoll fds if something has > >>>>>> been > >>>>>> queued > >>>>>> * to the overflow list, since the ep_poll() traverses the > >>>>>> rdllist > >>>>>> * during recursive poll and thus events on the overflow list > >>>>>> may > >>>>>> not be > >>>>>> * visible yet. > >>>>>> */ > >>>>>> if (nepi != NULL) > >>>>>> pwake++; > >>>>>> > >>>>>> .... > >>>>>> > >>>>>> if (pwake == 2) > >>>>>> ep_poll_safewake(&ep->poll_wait); > >>>>>> > >>>>>> > >>>>>> but this actually means that we repeat the same behavior (double > >>>>>> wakeup) > >>>>>> but only for the case, when event comes to the ->ovflist. > >>>>>> > >>>>>> How to reproduce? Can be easily done (ok, not so easy but it is > >>>>>> possible > >>>>>> to try): to the given userspace test we need to add one more > >>>>>> socket > >>>>>> and > >>>>>> immediately fire the event: > >>>>>> > >>>>>> e.events = EPOLLIN; > >>>>>> if (epoll_ctl(efd[1], EPOLL_CTL_ADD, s2fd[0], &e) < 0) > >>>>>> goto out; > >>>>>> > >>>>>> /* > >>>>>> * Signal any fd to let epoll_wait() to call > >>>>>> ep_scan_ready_list() > >>>>>> * in order to "catch" it there and add new event to > >>>>>> ->ovflist. > >>>>>> */ > >>>>>> if (write(s2fd[1], "w", 1) != 1) > >>>>>> goto out; > >>>>>> > >>>>>> That is done in order to let the following epoll_wait() call to > >>>>>> invoke > >>>>>> ep_scan_ready_list(), where we can "catch" and insert new event > >>>>>> exactly > >>>>>> to the ->ovflist. In order to insert event exactly in the correct > >>>>>> list > >>>>>> I introduce artificial delay. > >>>>>> > >>>>>> Modified test and kernel patch is below. Here is the output of > >>>>>> the > >>>>>> testing tool with some debug lines from kernel: > >>>>>> > >>>>>> # ~/devel/test/edge-bug > >>>>>> [ 59.263178] ### sleep 2 > >>>>>> >> write to sock > >>>>>> [ 61.318243] ### done sleep > >>>>>> [ 61.318991] !!!!!!!!!!! ep_poll_safewake(&ep->poll_wait); > >>>>>> events_in_rdllist=1, events_in_ovflist=1 > >>>>>> [ 61.321204] ### sleep 2 > >>>>>> [ 63.398325] ### done sleep > >>>>>> error: What?! Again?! > >>>>>> > >>>>>> First epoll_wait() call (ep_scan_ready_list()) observes 2 events > >>>>>> (see "!!!!!!!!!!! ep_poll_safewake" output line), exactly what we > >>>>>> wanted to achieve, so eventually ep_poll_safewake() is called > >>>>>> again > >>>>>> which leads to double wakeup. > >>>>>> > >>>>>> In my opinion current patch as it is should be dropped, it does > >>>>>> not > >>>>>> fix the described problem but just hides it. > >>>>>> > >>>>>> -- > >>>> > >>>> Hi Jason, > >>>> > >>>>> > >>>>> Yes, there are 2 wakeups in the test case you describe, but if the > >>>>> second event (write to s1fd) gets queued after the first call to > >>>>> epoll_wait(), we are going to get 2 wakeups anyways. > >>>> > >>>> Yes, exactly, for this reason I print out the number of events > >>>> observed > >>>> on first wait, there should be 1 (rdllist) and 1 (ovflist), > >>>> otherwise > >>>> this is another case, when second event comes exactly after first > >>>> wait, which is legitimate wakeup. > >>>> > >>>>> So yes, there may > >>>>> be a slightly bigger window with this patch for 2 wakeups, but its > >>>>> small > >>>>> and I tried to be conservative with the patch - I'd rather get an > >>>>> occasional 2nd wakeup then miss one. Trying to debug missing > >>>>> wakeups > >>>>> isn't always fun... > >>>>> > >>>>> That said, the reason for propagating events that end up on the > >>>>> overflow > >>>>> list was to prevent the race of the wakee not seeing events because > >>>>> they > >>>>> were still on the overflow list. In the testcase, imagine if there > >>>>> was a > >>>>> thread doing epoll_wait() on efd[0], and then a write happends on > >>>>> s1fd. > >>>>> I thought it was possible then that a 2nd thread doing epoll_wait() > >>>>> on > >>>>> efd[1], wakes up, checks efd[0] and sees no events because they are > >>>>> still potentially on the overflow list. However, I think that case > >>>>> is > >>>>> not possible because the thread doing epoll_wait() on efd[0] is > >>>>> going to > >>>>> have the ep->mtx, and thus when the thread wakes up on efd[1], its > >>>>> going > >>>>> to have to be ordered because its also grabbing the ep->mtx > >>>>> associated > >>>>> with efd[0]. > >>>>> > >>>>> So I think its safe to do the following if we want to go further > >>>>> than > >>>>> the proposed patch, which is what you suggested earlier in the > >>>>> thread > >>>>> (minus keeping the wakeup on ep->wq). > >>>> > >>>> Then I do not understand why we need to keep ep->wq wakeup? > >>>> @wq and @poll_wait are almost the same with only one difference: > >>>> @wq is used when you sleep on it inside epoll_wait() and the other > >>>> is used when you nest epoll fd inside epoll fd. Either you wake > >>>> both up either you don't this at all. > >>>> > >>>> ep_poll_callback() does wakeup explicitly, ep_insert() and > >>>> ep_modify() > >>>> do wakeup explicitly, so what are the cases when we need to do > >>>> wakeups > >>>> from ep_scan_ready_list()? > >>> > >>> Hi Roman, > >>> > >>> So the reason I was saying not to drop the ep->wq wakeup was that I > >>> was > >>> thinking about a usecase where you have multi-threads say thread A > >>> and > >>> thread B which are doing epoll_wait() on the same epfd. Now, the > >>> threads > >>> both call epoll_wait() and are added as exclusive to ep->wq. Now a > >>> bunch > >>> of events happen and thread A is woken up. However, thread A may only > >>> process a subset of the events due to its 'maxevents' parameter. In > >>> that > >>> case, I was thinking that the wakeup on ep->wq might be helpful, > >>> because > >>> in the absence of subsequent events, thread B can now start > >>> processing > >>> the rest, instead of waiting for the next event to be queued. > >>> > >>> However, I was thinking about the state of things before: > >>> 86c0517 fs/epoll: deal with wait_queue only once > >>> > >>> Before that patch, thread A would have been removed from eq->wq > >>> before > >>> the wakeup call, thus waking up thread B. However, now that thread A > >>> stays on the queue during the call to ep_send_events(), I believe the > >>> wakeup would only target thread A, which doesn't help since its > >>> already > >>> checking for events. So given the state of things I think you are > >>> right > >>> in that its not needed. However, I wonder if not removing from the > >>> ep->wq affects the multi-threaded case I described. Its been around > >>> since 5.0, so probably not, but it would be a more subtle performance > >>> difference. > >> > >> Now I understand what you mean. You want to prevent "idling" of > >> events, > >> while thread A is busy with the small portion of events (portion is > >> equal > >> to maxevents). On next iteration thread A will pick up the rest, no > >> doubts, > >> but would be nice to give a chance to thread B immediately to deal > >> with the > >> rest. Ok, makes sense. > > > > Exactly, I don't believe its racy as is - but it seems like it would be > > good to wakeup other threads that may be waiting. That said, this logic > > was removed as I pointed out. So I'm not sure we need to tie this > > change > > to the current one - but it may be a nice addition. > > > >> > >> But what if to make this wakeup explicit if we have more events to > >> process? > >> (nothing is tested, just a guess) > >> > >> @@ -255,6 +255,7 @@ struct ep_pqueue { > >> struct ep_send_events_data { > >> int maxevents; > >> struct epoll_event __user *events; > >> + bool have_more; > >> int res; > >> }; > >> @@ -1783,14 +1768,17 @@ static __poll_t ep_send_events_proc(struct > >> eventpoll *ep, struct list_head *head > >> } > >> > >> static int ep_send_events(struct eventpoll *ep, > >> - struct epoll_event __user *events, int > >> maxevents) > >> + struct epoll_event __user *events, int > >> maxevents, > >> + bool *have_more) > >> { > >> - struct ep_send_events_data esed; > >> - > >> - esed.maxevents = maxevents; > >> - esed.events = events; > >> + struct ep_send_events_data esed = { > >> + .maxevents = maxevents, > >> + .events = events, > >> + }; > >> > >> ep_scan_ready_list(ep, ep_send_events_proc, &esed, 0, false); > >> + *have_more = esed.have_more; > >> + > >> return esed.res; > >> } > >> > >> @@ -1827,7 +1815,7 @@ static int ep_poll(struct eventpoll *ep, struct > >> epoll_event __user *events, > >> { > >> int res = 0, eavail, timed_out = 0; > >> u64 slack = 0; > >> - bool waiter = false; > >> + bool waiter = false, have_more; > >> wait_queue_entry_t wait; > >> ktime_t expires, *to = NULL; > >> > >> @@ -1927,7 +1915,8 @@ static int ep_poll(struct eventpoll *ep, struct > >> epoll_event __user *events, > >> * more luck. > >> */ > >> if (!res && eavail && > >> - !(res = ep_send_events(ep, events, maxevents)) && > >> !timed_out) > >> + !(res = ep_send_events(ep, events, maxevents, &have_more)) > >> && > >> + !timed_out) > >> goto fetch_events; > >> > >> if (waiter) { > >> @@ -1935,6 +1924,12 @@ static int ep_poll(struct eventpoll *ep, struct > >> epoll_event __user *events, > >> __remove_wait_queue(&ep->wq, &wait); > >> spin_unlock_irq(&ep->wq.lock); > >> } > >> + /* > >> + * We were not able to process all the events, so immediately > >> + * wakeup other waiter. > >> + */ > >> + if (res > 0 && have_more && waitqueue_active(&ep->wq)) > >> + wake_up(&ep->wq); > >> > >> return res; > >> } > >> > >> > > > > Ok, yeah I like making it explicit. Looks like you are missing the > > changes to ep_scan_ready_list(), but I think the general approach makes > > sense. > > Yeah, missed the hunk: > > @@ -1719,8 +1704,10 @@ static __poll_t ep_send_events_proc(struct > eventpoll *ep, struct list_head *head > lockdep_assert_held(&ep->mtx); > > list_for_each_entry_safe(epi, tmp, head, rdllink) { > - if (esed->res >= esed->maxevents) > + if (esed->res >= esed->maxevents) { > + esed->have_more = true; > break; > + } > > > Although I really didn't have a test case that motivated this - > > its just was sort of noting this change in behavior while reviewing the > > current change. > > > >> PS. So what we decide with the original patch? Remove the whole > >> branch? > >> > > > > For fwiw, I'm ok removing the whole branch as you proposed. > > Then probably Heiher could resend once more. Heiher, can you, please? Sorry for delay. Thank you for your help! That's ok, I will re-send the patch and add unit-tests to kselftests in the later patches. > > > And I think the above change can go in separately (if we decide we want > > it). I don't > > think they need to be tied together. I also want to make sure this > > change gets a full linux-next cycle, so I think it should target 5.5 at > > this point. > > Sure, this explicit ->wq wakeup is a separate patch, which should be > covered with some benchmarks. I can try to cook something out in order > to get numbers. > > -- > Roman > -- Best regards! Hev https://hev.cc
diff --git a/fs/eventpoll.c b/fs/eventpoll.c index c4159bcc05d9..a0c07f6653c6 100644 --- a/fs/eventpoll.c +++ b/fs/eventpoll.c @@ -704,12 +704,21 @@ static __poll_t ep_scan_ready_list(struct eventpoll *ep, res = (*sproc)(ep, &txlist, priv); write_lock_irq(&ep->lock); + nepi = READ_ONCE(ep->ovflist); + /* + * We only need to wakeup nested epoll fds if something has been queued + * to the overflow list, since the ep_poll() traverses the rdllist + * during recursive poll and thus events on the overflow list may not be + * visible yet. + */ + if (nepi != NULL) + pwake++; /* * During the time we spent inside the "sproc" callback, some * other events might have been queued by the poll callback. * We re-insert them inside the main ready-list here. */ - for (nepi = READ_ONCE(ep->ovflist); (epi = nepi) != NULL; + for (; (epi = nepi) != NULL; nepi = epi->next, epi->next = EP_UNACTIVE_PTR) { /* * We need to check if the item is already in the list. @@ -755,7 +764,7 @@ static __poll_t ep_scan_ready_list(struct eventpoll *ep, mutex_unlock(&ep->mtx); /* We have to call this outside the lock */ - if (pwake) + if (pwake == 2) ep_poll_safewake(&ep->poll_wait); return res;