diff mbox

epoll: restrict EPOLLEXCLUSIVE to POLLIN and POLLOUT

Message ID 1454600344-23655-1-git-send-email-jbaron@akamai.com (mailing list archive)
State New, archived
Headers show

Commit Message

Jason Baron Feb. 4, 2016, 3:39 p.m. UTC
In the current implementation of the EPOLLEXCLUSIVE flag (added for 4.5-rc1),
if epoll waiters create different POLL* sets and register them as exclusive
against the same target fd, the current implementation will stop waking any
further waiters once it finds the first idle waiter. This means that waiters
could miss wakeups in certain cases.

For example, when we wake up a pipe for reading we do:
wake_up_interruptible_sync_poll(&pipe->wait, POLLIN | POLLRDNORM);
So if one epoll set or epfd is added to pipe p with POLLIN and a second set
epfd2 is added to pipe p with POLLRDNORM, only epfd may receive the wakeup
since the current implementation will stop after it finds any intersection
of events with a waiter that is blocked in epoll_wait().

We could potentially address this by requiring all epoll waiters that are
added to p be required to pass the same set of POLL* events. IE the first
EPOLL_CTL_ADD that passes EPOLLEXCLUSIVE establishes the set POLL* flags to
be used by any other epfds that are added as EPOLLEXCLUSIVE. However, I think
it might be somewhat confusing interface as we would have to reference count
the number of users for that set, and so userspace would have to keep track
of that count, or we would need a more involved interface. It also adds some
shared state that we'd have store somewhere. I don't think anybody will want
to bloat __wait_queue_head for this.

I think what we could do instead, is to simply restrict EPOLLEXCLUSIVE such
that it can only be specified with EPOLLIN and/or EPOLLOUT. So that way if
the wakeup includes 'POLLIN' and not 'POLLOUT', we can stop once we hit the
first idle waiter that specifies the EPOLLIN bit, since any remaining waiters
that only have 'POLLOUT' set wouldn't need to be woken. Likewise, we can do
the same thing if 'POLLOUT' is in the wakeup bit set and not 'POLLIN'. If both
'POLLOUT' and 'POLLIN' are set in the wake bit set (there is at least one
example of this I saw in fs/pipe.c), then we just wake the entire exclusive
list. Having both 'POLLOUT' and 'POLLIN' both set should not be on any
performance critical path, so I think that's ok (in fs/pipe.c its in
pipe_release()). We also continue to include EPOLLERR and EPOLLHUP by default
in any exclusive set. Thus, the user can specify EPOLLERR and/or EPOLLHUP but
is not required to do so.

Since epoll waiters may be interested in other events as well besides EPOLLIN,
EPOLLOUT, EPOLLERR and EPOLLHUP, these can still be added by doing a 'dup' call
on the target fd and adding that as one normally would with EPOLL_CTL_ADD. Since
I think that the POLLIN and POLLOUT events are what we are interest in balancing,
I think that the 'dup' thing could perhaps be added to only one of the waiter
threads. However, I think that EPOLLIN, EPOLLOUT, EPOLLERR and EPOLLHUP should
be sufficient for the majority of use-cases.

Since EPOLLEXCLUSIVE is intended to be used with a target fd shared among
multiple epfds, where between 1 and n of the epfds may receive an event, it
does not satisfy the semantics of EPOLLONESHOT where only 1 epfd would get an
event. Thus, it is not allowed to be specified in conjunction with
EPOLLEXCLUSIVE.

EPOLL_CTL_MOD is also not allowed if the fd was previously added as
EPOLLEXCLUSIVE. It seems with the limited number of flags to not be as
interesting, but this could be relaxed at some further point.

Signed-off-by: Jason Baron <jbaron@akamai.com>
---
 fs/eventpoll.c | 38 ++++++++++++++++++++++++++++++++------
 1 file changed, 32 insertions(+), 6 deletions(-)

Comments

Madars Vitolins Feb. 4, 2016, 9:44 p.m. UTC | #1
Hi Jason,


Just run off the original tests with this patch (eventpoll.c from 
4.5-rc2 + patch bellow). Got the same good results, no regression.

$ time ./bankcl
Account balance is: 1359856158.04 USD
Account balance is: 1359856158.04 USD
Account balance is: 1359856158.04 USD
Account balance is: 1359856158.04 USD
Account balance is: 1359856158.04 USD
Account balance is: 1359856158.04 USD
Account balance is: 1359856158.04 USD
Account balance is: 1359856158.04 USD
Account balance is: 1359856158.04 USD
Account balance is: 101528948.40 USD

real	0m41.826s
user	0m29.513s
sys	0m6.490s


Test case: 
https://mvitolin.wordpress.com/2015/12/05/endurox-testing-epollexclusive-flag/

PS,

Original cases 0m24.953s vs 0m41.826s now probably is related with my pc 
setup. As I just now re-run test with original patch, got the same ~41 
sec.

So I am fine with this patch!

Thanks,
Madars


Jason Baron @ 2016-02-04 17:39 rakst?ja:
> In the current implementation of the EPOLLEXCLUSIVE flag (added for 
> 4.5-rc1),
> if epoll waiters create different POLL* sets and register them as 
> exclusive
> against the same target fd, the current implementation will stop waking 
> any
> further waiters once it finds the first idle waiter. This means that 
> waiters
> could miss wakeups in certain cases.
> 
> For example, when we wake up a pipe for reading we do:
> wake_up_interruptible_sync_poll(&pipe->wait, POLLIN | POLLRDNORM);
> So if one epoll set or epfd is added to pipe p with POLLIN and a second 
> set
> epfd2 is added to pipe p with POLLRDNORM, only epfd may receive the 
> wakeup
> since the current implementation will stop after it finds any 
> intersection
> of events with a waiter that is blocked in epoll_wait().
> 
> We could potentially address this by requiring all epoll waiters that 
> are
> added to p be required to pass the same set of POLL* events. IE the 
> first
> EPOLL_CTL_ADD that passes EPOLLEXCLUSIVE establishes the set POLL* 
> flags to
> be used by any other epfds that are added as EPOLLEXCLUSIVE. However, I 
> think
> it might be somewhat confusing interface as we would have to reference 
> count
> the number of users for that set, and so userspace would have to keep 
> track
> of that count, or we would need a more involved interface. It also adds 
> some
> shared state that we'd have store somewhere. I don't think anybody will 
> want
> to bloat __wait_queue_head for this.
> 
> I think what we could do instead, is to simply restrict EPOLLEXCLUSIVE 
> such
> that it can only be specified with EPOLLIN and/or EPOLLOUT. So that way 
> if
> the wakeup includes 'POLLIN' and not 'POLLOUT', we can stop once we hit 
> the
> first idle waiter that specifies the EPOLLIN bit, since any remaining 
> waiters
> that only have 'POLLOUT' set wouldn't need to be woken. Likewise, we 
> can do
> the same thing if 'POLLOUT' is in the wakeup bit set and not 'POLLIN'. 
> If both
> 'POLLOUT' and 'POLLIN' are set in the wake bit set (there is at least 
> one
> example of this I saw in fs/pipe.c), then we just wake the entire 
> exclusive
> list. Having both 'POLLOUT' and 'POLLIN' both set should not be on any
> performance critical path, so I think that's ok (in fs/pipe.c its in
> pipe_release()). We also continue to include EPOLLERR and EPOLLHUP by 
> default
> in any exclusive set. Thus, the user can specify EPOLLERR and/or 
> EPOLLHUP but
> is not required to do so.
> 
> Since epoll waiters may be interested in other events as well besides 
> EPOLLIN,
> EPOLLOUT, EPOLLERR and EPOLLHUP, these can still be added by doing a 
> 'dup' call
> on the target fd and adding that as one normally would with 
> EPOLL_CTL_ADD. Since
> I think that the POLLIN and POLLOUT events are what we are interest in
> balancing,
> I think that the 'dup' thing could perhaps be added to only one of the 
> waiter
> threads. However, I think that EPOLLIN, EPOLLOUT, EPOLLERR and EPOLLHUP 
> should
> be sufficient for the majority of use-cases.
> 
> Since EPOLLEXCLUSIVE is intended to be used with a target fd shared 
> among
> multiple epfds, where between 1 and n of the epfds may receive an 
> event, it
> does not satisfy the semantics of EPOLLONESHOT where only 1 epfd would 
> get an
> event. Thus, it is not allowed to be specified in conjunction with
> EPOLLEXCLUSIVE.
> 
> EPOLL_CTL_MOD is also not allowed if the fd was previously added as
> EPOLLEXCLUSIVE. It seems with the limited number of flags to not be as
> interesting, but this could be relaxed at some further point.
> 
> Signed-off-by: Jason Baron <jbaron@akamai.com>
> ---
>  fs/eventpoll.c | 38 ++++++++++++++++++++++++++++++++------
>  1 file changed, 32 insertions(+), 6 deletions(-)
> 
> diff --git a/fs/eventpoll.c b/fs/eventpoll.c
> index ae1dbcf..cde6074 100644
> --- a/fs/eventpoll.c
> +++ b/fs/eventpoll.c
> @@ -94,6 +94,11 @@
>  /* Epoll private bits inside the event mask */
>  #define EP_PRIVATE_BITS (EPOLLWAKEUP | EPOLLONESHOT | EPOLLET | 
> EPOLLEXCLUSIVE)
> 
> +#define EPOLLINOUT_BITS (POLLIN | POLLOUT)
> +
> +#define EPOLLEXCLUSIVE_OK_BITS (EPOLLINOUT_BITS | POLLERR | POLLHUP | 
> \
> +				EPOLLWAKEUP | EPOLLET | EPOLLEXCLUSIVE)
> +
>  /* Maximum number of nesting allowed inside epoll sets */
>  #define EP_MAX_NESTS 4
> 
> @@ -1068,7 +1073,22 @@ static int ep_poll_callback(wait_queue_t *wait,
> unsigned mode, int sync, void *k
>  	 * wait list.
>  	 */
>  	if (waitqueue_active(&ep->wq)) {
> -		ewake = 1;
> +		if ((epi->event.events & EPOLLEXCLUSIVE) &&
> +					!((unsigned long)key & POLLFREE)) {
> +			switch ((unsigned long)key & EPOLLINOUT_BITS) {
> +			case POLLIN:
> +				if (epi->event.events & POLLIN)
> +					ewake = 1;
> +				break;
> +			case POLLOUT:
> +				if (epi->event.events & POLLOUT)
> +					ewake = 1;
> +				break;
> +			case 0:
> +				ewake = 1;
> +				break;
> +			}
> +		}
>  		wake_up_locked(&ep->wq);
>  	}
>  	if (waitqueue_active(&ep->poll_wait))
> @@ -1875,9 +1895,13 @@ SYSCALL_DEFINE4(epoll_ctl, int, epfd, int, op, 
> int, fd,
>  	 * so EPOLLEXCLUSIVE is not allowed for a EPOLL_CTL_MOD operation.
>  	 * Also, we do not currently supported nested exclusive wakeups.
>  	 */
> -	if ((epds.events & EPOLLEXCLUSIVE) && (op == EPOLL_CTL_MOD ||
> -		(op == EPOLL_CTL_ADD && is_file_epoll(tf.file))))
> -		goto error_tgt_fput;
> +	if (epds.events & EPOLLEXCLUSIVE) {
> +		if (op == EPOLL_CTL_MOD)
> +			goto error_tgt_fput;
> +		if (op == EPOLL_CTL_ADD && (is_file_epoll(tf.file) ||
> +				(epds.events & ~EPOLLEXCLUSIVE_OK_BITS)))
> +			goto error_tgt_fput;
> +	}
> 
>  	/*
>  	 * At this point it is safe to assume that the "private_data" 
> contains
> @@ -1950,8 +1974,10 @@ SYSCALL_DEFINE4(epoll_ctl, int, epfd, int, op, 
> int, fd,
>  		break;
>  	case EPOLL_CTL_MOD:
>  		if (epi) {
> -			epds.events |= POLLERR | POLLHUP;
> -			error = ep_modify(ep, epi, &epds);
> +			if (!(epi->event.events & EPOLLEXCLUSIVE)) {
> +				epds.events |= POLLERR | POLLHUP;
> +				error = ep_modify(ep, epi, &epds);
> +			}
>  		} else
>  			error = -ENOENT;
>  		break;
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Andrew Morton Feb. 4, 2016, 10:59 p.m. UTC | #2
On Thu, 04 Feb 2016 23:44:05 +0200 Madars Vitolins <m@silodev.com> wrote:

> Hi Jason,
> 
> 
> Just run off the original tests with this patch (eventpoll.c from 
> 4.5-rc2 + patch bellow). Got the same good results, no regression.
> 
> $ time ./bankcl
> Account balance is: 1359856158.04 USD
> Account balance is: 1359856158.04 USD
> Account balance is: 1359856158.04 USD
> Account balance is: 1359856158.04 USD
> Account balance is: 1359856158.04 USD
> Account balance is: 1359856158.04 USD
> Account balance is: 1359856158.04 USD
> Account balance is: 1359856158.04 USD
> Account balance is: 1359856158.04 USD
> Account balance is: 101528948.40 USD
> 
> real	0m41.826s
> user	0m29.513s
> sys	0m6.490s
> 
> 
> Test case: 
> https://mvitolin.wordpress.com/2015/12/05/endurox-testing-epollexclusive-flag/
> 
> PS,
> 
> Original cases 0m24.953s vs 0m41.826s now probably is related with my pc 
> setup. As I just now re-run test with original patch, got the same ~41 
> sec.
> 
> So I am fine with this patch!
> 

Thanks, I shall add your Tested-by:

One thing we're sorely missing is an epoll test suite, in
tools/testing/selftests.  If anyone has anything which we can use to
kick things off, please hand it over ;)
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Madars Vitolins Feb. 5, 2016, 9:49 a.m. UTC | #3
Andrew Morton @ 2016-02-05 00:59 rakst?ja:
> On Thu, 04 Feb 2016 23:44:05 +0200 Madars Vitolins <m@silodev.com> 
> wrote:
> 
>> Hi Jason,
>> 
>> 
>> Just run off the original tests with this patch (eventpoll.c from
>> 4.5-rc2 + patch bellow). Got the same good results, no regression.
>> 
>> $ time ./bankcl
>> Account balance is: 1359856158.04 USD
>> Account balance is: 1359856158.04 USD
>> Account balance is: 1359856158.04 USD
>> Account balance is: 1359856158.04 USD
>> Account balance is: 1359856158.04 USD
>> Account balance is: 1359856158.04 USD
>> Account balance is: 1359856158.04 USD
>> Account balance is: 1359856158.04 USD
>> Account balance is: 1359856158.04 USD
>> Account balance is: 101528948.40 USD
>> 
>> real	0m41.826s
>> user	0m29.513s
>> sys	0m6.490s
>> 
>> 
>> Test case:
>> https://mvitolin.wordpress.com/2015/12/05/endurox-testing-epollexclusive-flag/
>> 
>> PS,
>> 
>> Original cases 0m24.953s vs 0m41.826s now probably is related with my 
>> pc
>> setup. As I just now re-run test with original patch, got the same ~41
>> sec.
>> 
>> So I am fine with this patch!
>> 
> 
> Thanks, I shall add your Tested-by:
> 
> One thing we're sorely missing is an epoll test suite, in
> tools/testing/selftests.  If anyone has anything which we can use to
> kick things off, please hand it over ;)

Not bad idea, probably we need a "tools/testing/selftests/eventpoll" 
folder under which we should have test cases for various epoll scenarios 
with common "run.sh". In spare time I can try to build a case for 
EXCLUSIVE flag (with queues & multiple processes :) ).

Thanks,
Madars
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/fs/eventpoll.c b/fs/eventpoll.c
index ae1dbcf..cde6074 100644
--- a/fs/eventpoll.c
+++ b/fs/eventpoll.c
@@ -94,6 +94,11 @@ 
 /* Epoll private bits inside the event mask */
 #define EP_PRIVATE_BITS (EPOLLWAKEUP | EPOLLONESHOT | EPOLLET | EPOLLEXCLUSIVE)
 
+#define EPOLLINOUT_BITS (POLLIN | POLLOUT)
+
+#define EPOLLEXCLUSIVE_OK_BITS (EPOLLINOUT_BITS | POLLERR | POLLHUP | \
+				EPOLLWAKEUP | EPOLLET | EPOLLEXCLUSIVE)
+
 /* Maximum number of nesting allowed inside epoll sets */
 #define EP_MAX_NESTS 4
 
@@ -1068,7 +1073,22 @@  static int ep_poll_callback(wait_queue_t *wait, unsigned mode, int sync, void *k
 	 * wait list.
 	 */
 	if (waitqueue_active(&ep->wq)) {
-		ewake = 1;
+		if ((epi->event.events & EPOLLEXCLUSIVE) &&
+					!((unsigned long)key & POLLFREE)) {
+			switch ((unsigned long)key & EPOLLINOUT_BITS) {
+			case POLLIN:
+				if (epi->event.events & POLLIN)
+					ewake = 1;
+				break;
+			case POLLOUT:
+				if (epi->event.events & POLLOUT)
+					ewake = 1;
+				break;
+			case 0:
+				ewake = 1;
+				break;
+			}
+		}
 		wake_up_locked(&ep->wq);
 	}
 	if (waitqueue_active(&ep->poll_wait))
@@ -1875,9 +1895,13 @@  SYSCALL_DEFINE4(epoll_ctl, int, epfd, int, op, int, fd,
 	 * so EPOLLEXCLUSIVE is not allowed for a EPOLL_CTL_MOD operation.
 	 * Also, we do not currently supported nested exclusive wakeups.
 	 */
-	if ((epds.events & EPOLLEXCLUSIVE) && (op == EPOLL_CTL_MOD ||
-		(op == EPOLL_CTL_ADD && is_file_epoll(tf.file))))
-		goto error_tgt_fput;
+	if (epds.events & EPOLLEXCLUSIVE) {
+		if (op == EPOLL_CTL_MOD)
+			goto error_tgt_fput;
+		if (op == EPOLL_CTL_ADD && (is_file_epoll(tf.file) ||
+				(epds.events & ~EPOLLEXCLUSIVE_OK_BITS)))
+			goto error_tgt_fput;
+	}
 
 	/*
 	 * At this point it is safe to assume that the "private_data" contains
@@ -1950,8 +1974,10 @@  SYSCALL_DEFINE4(epoll_ctl, int, epfd, int, op, int, fd,
 		break;
 	case EPOLL_CTL_MOD:
 		if (epi) {
-			epds.events |= POLLERR | POLLHUP;
-			error = ep_modify(ep, epi, &epds);
+			if (!(epi->event.events & EPOLLEXCLUSIVE)) {
+				epds.events |= POLLERR | POLLHUP;
+				error = ep_modify(ep, epi, &epds);
+			}
 		} else
 			error = -ENOENT;
 		break;