Add provision to busyloop for events in ep_poll.

Message ID	20240828181011.1591242-1-namangulati@google.com (mailing list archive)
State	Changes Requested
Headers	show Received: from mail-yb1-f202.google.com (mail-yb1-f202.google.com [209.85.219.202]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id C11271D554 for <netdev@vger.kernel.org>; Wed, 28 Aug 2024 18:10:53 +0000 (UTC) Date: Wed, 28 Aug 2024 18:10:11 +0000 Precedence: bulk Mime-Version: 1.0 Message-ID: <20240828181011.1591242-1-namangulati@google.com> Subject: [PATCH] Add provision to busyloop for events in ep_poll. From: Naman Gulati <namangulati@google.com> To: Alexander Viro <viro@zeniv.linux.org.uk>, Christian Brauner <brauner@kernel.org>, Jan Kara <jack@suse.cz>, "David S. Miller" <davem@davemloft.net>, Eric Dumazet <edumazet@google.com>, netdev@vger.kernel.org Cc: Stanislav Fomichev <sdf@fomichev.me>, linux-kernel@vger.kernel.org, skhawaja@google.com, Joe Damato <jdamato@fastly.com>, Willem de Bruijn <willemdebruijn.kernel@gmail.com>, Naman Gulati <namangulati@google.com> Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable
Series	Add provision to busyloop for events in ep_poll. \| expand Add provision to busyloop for events in ep_poll.

Context	Check	Description
netdev/tree_selection	success	Not a local patch

Naman Gulati Aug. 28, 2024, 6:10 p.m. UTC

NAPI busypolling in ep_busy_loop loops on napi_poll and checks for new
epoll events after every napi poll. Checking just for epoll events in a
tight loop in the kernel context delivers latency gains to applications
that are not interested in napi busypolling with epoll.

This patch adds an option to loop just for new events inside
ep_busy_loop, guarded by the EPIOCSPARAMS ioctl that controls epoll napi
busypolling.

A comparison with neper tcp_rr shows that busylooping for events in
epoll_wait boosted throughput by ~3-7% and reduced median latency by
~10%.

To demonstrate the latency and throughput improvements, a comparison was
made of neper tcp_rr running with:
    1. (baseline) No busylooping
    2. (epoll busylooping) enabling the epoll busy looping on all epoll
    fd's
    3. (userspace busylooping) looping on epoll_wait in userspace
    with timeout=0

Stats for two machines with 100Gbps NICs running tcp_rr with 5 threads
and varying flows:

Type                Flows   Throughput             Latency (μs)
                             (B/s)      P50   P90    P99   P99.9   P99.99
baseline            15	    272145      57.2  71.9   91.4  100.6   111.6
baseline            30	    464952	66.8  78.8   98.1  113.4   122.4
baseline            60	    695920	80.9  118.5  143.4 161.8   174.6
epoll busyloop      15	    301751	44.7  70.6   84.3  95.4    106.5
epoll busyloop      30	    508392	58.9  76.9   96.2  109.3   118.5
epoll busyloop      60	    745731	77.4  106.2  127.5 143.1   155.9
userspace busyloop  15	    279202	55.4  73.1   85.2  98.3    109.6
userspace busyloop  30	    472440	63.7  78.2   96.5  112.2   120.1
userspace busyloop  60	    720779	77.9  113.5  134.9 152.6   165.7

Per the above data epoll busyloop outperforms baseline and userspace
busylooping in both throughput and latency. As the density of flows per
thread increased, the median latency of all three epoll mechanisms
converges. However epoll busylooping is better at capturing the tail
latencies at high flow counts.

Signed-off-by: Naman Gulati <namangulati@google.com>
---
 fs/eventpoll.c                 | 53 ++++++++++++++++++++++++++--------
 include/uapi/linux/eventpoll.h |  3 +-
 2 files changed, 43 insertions(+), 13 deletions(-)

Stanislav Fomichev Aug. 29, 2024, 2:09 a.m. UTC | #1

On 08/28, Naman Gulati wrote:
> NAPI busypolling in ep_busy_loop loops on napi_poll and checks for new
> epoll events after every napi poll. Checking just for epoll events in a
> tight loop in the kernel context delivers latency gains to applications
> that are not interested in napi busypolling with epoll.
> 
> This patch adds an option to loop just for new events inside
> ep_busy_loop, guarded by the EPIOCSPARAMS ioctl that controls epoll napi
> busypolling.
> 
> A comparison with neper tcp_rr shows that busylooping for events in
> epoll_wait boosted throughput by ~3-7% and reduced median latency by
> ~10%.
> 
> To demonstrate the latency and throughput improvements, a comparison was
> made of neper tcp_rr running with:
>     1. (baseline) No busylooping
>     2. (epoll busylooping) enabling the epoll busy looping on all epoll
>     fd's
>     3. (userspace busylooping) looping on epoll_wait in userspace
>     with timeout=0
> 
> Stats for two machines with 100Gbps NICs running tcp_rr with 5 threads
> and varying flows:
> 
> Type                Flows   Throughput             Latency (μs)
>                              (B/s)      P50   P90    P99   P99.9   P99.99
> baseline            15	    272145      57.2  71.9   91.4  100.6   111.6
> baseline            30	    464952	66.8  78.8   98.1  113.4   122.4
> baseline            60	    695920	80.9  118.5  143.4 161.8   174.6
> epoll busyloop      15	    301751	44.7  70.6   84.3  95.4    106.5
> epoll busyloop      30	    508392	58.9  76.9   96.2  109.3   118.5
> epoll busyloop      60	    745731	77.4  106.2  127.5 143.1   155.9
> userspace busyloop  15	    279202	55.4  73.1   85.2  98.3    109.6
> userspace busyloop  30	    472440	63.7  78.2   96.5  112.2   120.1
> userspace busyloop  60	    720779	77.9  113.5  134.9 152.6   165.7
> 
> Per the above data epoll busyloop outperforms baseline and userspace
> busylooping in both throughput and latency. As the density of flows per
> thread increased, the median latency of all three epoll mechanisms
> converges. However epoll busylooping is better at capturing the tail
> latencies at high flow counts.

Any idea why timeout=0 is not performing as well as looping inside the
kernel? Can we cut this overhead out? Or is it pure syscall overhead? (usecs?)

Vadim Fedorenko Aug. 29, 2024, 9:16 a.m. UTC | #2

On 28/08/2024 19:10, Naman Gulati wrote:
> NAPI busypolling in ep_busy_loop loops on napi_poll and checks for new
> epoll events after every napi poll. Checking just for epoll events in a
> tight loop in the kernel context delivers latency gains to applications
> that are not interested in napi busypolling with epoll.
> 
> This patch adds an option to loop just for new events inside
> ep_busy_loop, guarded by the EPIOCSPARAMS ioctl that controls epoll napi
> busypolling.
> 
> A comparison with neper tcp_rr shows that busylooping for events in
> epoll_wait boosted throughput by ~3-7% and reduced median latency by
> ~10%.
> 
> To demonstrate the latency and throughput improvements, a comparison was
> made of neper tcp_rr running with:
>      1. (baseline) No busylooping
>      2. (epoll busylooping) enabling the epoll busy looping on all epoll
>      fd's
>      3. (userspace busylooping) looping on epoll_wait in userspace
>      with timeout=0
> 
> Stats for two machines with 100Gbps NICs running tcp_rr with 5 threads
> and varying flows:
> 
> Type                Flows   Throughput             Latency (μs)
>                               (B/s)      P50   P90    P99   P99.9   P99.99
> baseline            15	    272145      57.2  71.9   91.4  100.6   111.6
> baseline            30	    464952	66.8  78.8   98.1  113.4   122.4
> baseline            60	    695920	80.9  118.5  143.4 161.8   174.6
> epoll busyloop      15	    301751	44.7  70.6   84.3  95.4    106.5
> epoll busyloop      30	    508392	58.9  76.9   96.2  109.3   118.5
> epoll busyloop      60	    745731	77.4  106.2  127.5 143.1   155.9
> userspace busyloop  15	    279202	55.4  73.1   85.2  98.3    109.6
> userspace busyloop  30	    472440	63.7  78.2   96.5  112.2   120.1
> userspace busyloop  60	    720779	77.9  113.5  134.9 152.6   165.7
> 
> Per the above data epoll busyloop outperforms baseline and userspace
> busylooping in both throughput and latency. As the density of flows per
> thread increased, the median latency of all three epoll mechanisms
> converges. However epoll busylooping is better at capturing the tail
> latencies at high flow counts.
> 
> Signed-off-by: Naman Gulati <namangulati@google.com>
> ---
>   fs/eventpoll.c                 | 53 ++++++++++++++++++++++++++--------
>   include/uapi/linux/eventpoll.h |  3 +-
>   2 files changed, 43 insertions(+), 13 deletions(-)
> 
> diff --git a/fs/eventpoll.c b/fs/eventpoll.c
> index f53ca4f7fcedd..6cba79261817a 100644
> --- a/fs/eventpoll.c
> +++ b/fs/eventpoll.c
> @@ -232,7 +232,10 @@ struct eventpoll {
>   	u32 busy_poll_usecs;
>   	/* busy poll packet budget */
>   	u16 busy_poll_budget;
> -	bool prefer_busy_poll;
> +	/* prefer to busypoll in napi poll */
> +	bool napi_prefer_busy_poll;
> +	/* avoid napi poll when busy looping and poll only for events */
> +	bool event_poll_only;
>   #endif
>   
>   #ifdef CONFIG_DEBUG_LOCK_ALLOC
> @@ -430,6 +433,24 @@ static bool ep_busy_loop_end(void *p, unsigned long start_time)
>   	return ep_events_available(ep) || busy_loop_ep_timeout(start_time, ep);
>   }
>   
> +/**
> + * ep_event_busy_loop - loop until events available or busy poll
> + * times out.
> + *
> + * @ep: Pointer to the eventpoll context.
> + *
> + * Return: true if events available, false otherwise.
> + */
> +static bool ep_event_busy_loop(struct eventpoll *ep)
> +{
> +	unsigned long start_time = busy_loop_current_time();
> +
> +	while (!ep_busy_loop_end(ep, start_time))
> +		cond_resched();
> +
> +	return ep_events_available(ep);
> +}
> +
>   /*
>    * Busy poll if globally on and supporting sockets found && no events,
>    * busy loop will return if need_resched or ep_events_available.
> @@ -440,23 +461,29 @@ static bool ep_busy_loop(struct eventpoll *ep, int nonblock)
>   {
>   	unsigned int napi_id = READ_ONCE(ep->napi_id);
>   	u16 budget = READ_ONCE(ep->busy_poll_budget);
> -	bool prefer_busy_poll = READ_ONCE(ep->prefer_busy_poll);
> +	bool event_poll_only = READ_ONCE(ep->event_poll_only);
>   
>   	if (!budget)
>   		budget = BUSY_POLL_BUDGET;
>   
> -	if (napi_id >= MIN_NAPI_ID && ep_busy_loop_on(ep)) {
> +	if (!ep_busy_loop_on(ep))
> +		return false;
> +
> +	if (event_poll_only) {
> +		return ep_event_busy_loop(ep);
> +	} else if (napi_id >= MIN_NAPI_ID) {

There is no need to use 'else if' in this place, in case of
event_poll_only == true the program flow will not reach this part.

> +		bool napi_prefer_busy_poll = READ_ONCE(ep->napi_prefer_busy_poll);
> +
>   		napi_busy_loop(napi_id, nonblock ? NULL : ep_busy_loop_end,
> -			       ep, prefer_busy_poll, budget);
> +				ep, napi_prefer_busy_poll, budget);
>   		if (ep_events_available(ep))
>   			return true;
>   		/*
> -		 * Busy poll timed out.  Drop NAPI ID for now, we can add
> -		 * it back in when we have moved a socket with a valid NAPI
> -		 * ID onto the ready list.
> -		 */
> +		* Busy poll timed out.  Drop NAPI ID for now, we can add
> +		* it back in when we have moved a socket with a valid NAPI
> +		* ID onto the ready list.
> +		*/

I believe this change is accidental, right?

>   		ep->napi_id = 0;
> -		return false;
>   	}
>   	return false;
>   }

[...]

Joe Damato Aug. 29, 2024, 10:40 a.m. UTC | #3

On Wed, Aug 28, 2024 at 06:10:11PM +0000, Naman Gulati wrote:
> NAPI busypolling in ep_busy_loop loops on napi_poll and checks for new
> epoll events after every napi poll. Checking just for epoll events in a
> tight loop in the kernel context delivers latency gains to applications
> that are not interested in napi busypolling with epoll.
> 
> This patch adds an option to loop just for new events inside
> ep_busy_loop, guarded by the EPIOCSPARAMS ioctl that controls epoll napi
> busypolling.

This makes an API change, so I think that linux-api@vger.kernel.org
needs to be CC'd ?
 
> A comparison with neper tcp_rr shows that busylooping for events in
> epoll_wait boosted throughput by ~3-7% and reduced median latency by
> ~10%.
> 
> To demonstrate the latency and throughput improvements, a comparison was
> made of neper tcp_rr running with:
>     1. (baseline) No busylooping

Is there NAPI-based steering to threads via SO_INCOMING_NAPI_ID in
this case? More details, please, on locality. If there is no
NAPI-based flow steering in this case, perhaps the improvements you
are seeing are a result of both syscall overhead avoidance and data
locality?

>     2. (epoll busylooping) enabling the epoll busy looping on all epoll
>     fd's

This is the case you've added, event_poll_only ? It seems like in
this case you aren't busy looping exactly, you are essentially
allowing IRQ/softIRQ to drive processing and checking on wakeup that
events are available.

IMHO, I'm not sure if "epoll busylooping" is an appropriate
description.

>     3. (userspace busylooping) looping on epoll_wait in userspace
>     with timeout=0

Same question as Stanislav; timeout=0 should get ep_loop to transfer
events immediately (if there are any) and return without actually
invoking busy poll. So, it would seem that your ioctl change
shouldn't be necessary since the equivalent behavior is already
possible with timeout=0.

I'd probably investigate both syscall overhead and data locality
before approving this patch because it seems a bit suspicious to me.

> 
> Stats for two machines with 100Gbps NICs running tcp_rr with 5 threads
> and varying flows:
> 
> Type                Flows   Throughput             Latency (μs)
>                              (B/s)      P50   P90    P99   P99.9   P99.99
> baseline            15	    272145      57.2  71.9   91.4  100.6   111.6
> baseline            30	    464952	66.8  78.8   98.1  113.4   122.4
> baseline            60	    695920	80.9  118.5  143.4 161.8   174.6
> epoll busyloop      15	    301751	44.7  70.6   84.3  95.4    106.5
> epoll busyloop      30	    508392	58.9  76.9   96.2  109.3   118.5
> epoll busyloop      60	    745731	77.4  106.2  127.5 143.1   155.9
> userspace busyloop  15	    279202	55.4  73.1   85.2  98.3    109.6
> userspace busyloop  30	    472440	63.7  78.2   96.5  112.2   120.1
> userspace busyloop  60	    720779	77.9  113.5  134.9 152.6   165.7
> 
> Per the above data epoll busyloop outperforms baseline and userspace
> busylooping in both throughput and latency. As the density of flows per
> thread increased, the median latency of all three epoll mechanisms
> converges. However epoll busylooping is better at capturing the tail
> latencies at high flow counts.
> 
> Signed-off-by: Naman Gulati <namangulati@google.com>
> ---
>  fs/eventpoll.c                 | 53 ++++++++++++++++++++++++++--------
>  include/uapi/linux/eventpoll.h |  3 +-
>  2 files changed, 43 insertions(+), 13 deletions(-)
> 
> diff --git a/fs/eventpoll.c b/fs/eventpoll.c
> index f53ca4f7fcedd..6cba79261817a 100644
> --- a/fs/eventpoll.c
> +++ b/fs/eventpoll.c
> @@ -232,7 +232,10 @@ struct eventpoll {
>  	u32 busy_poll_usecs;
>  	/* busy poll packet budget */
>  	u16 busy_poll_budget;
> -	bool prefer_busy_poll;
> +	/* prefer to busypoll in napi poll */
> +	bool napi_prefer_busy_poll;

Adding napi seems slightly redundant to me but I could be convinced either
way, I suppose.

> +	/* avoid napi poll when busy looping and poll only for events */
> +	bool event_poll_only;

I'm not sure about this overall; this isn't exactly what I think of
when I think about the word "polling" but maybe I'm being too
nit-picky.

>  #endif
>  
>  #ifdef CONFIG_DEBUG_LOCK_ALLOC
> @@ -430,6 +433,24 @@ static bool ep_busy_loop_end(void *p, unsigned long start_time)
>  	return ep_events_available(ep) || busy_loop_ep_timeout(start_time, ep);
>  }
>  
> +/**
> + * ep_event_busy_loop - loop until events available or busy poll
> + * times out.
> + *
> + * @ep: Pointer to the eventpoll context.
> + *
> + * Return: true if events available, false otherwise.
> + */
> +static bool ep_event_busy_loop(struct eventpoll *ep)
> +{
> +	unsigned long start_time = busy_loop_current_time();
> +
> +	while (!ep_busy_loop_end(ep, start_time))
> +		cond_resched();
> +
> +	return ep_events_available(ep);
> +}
> +
>  /*
>   * Busy poll if globally on and supporting sockets found && no events,
>   * busy loop will return if need_resched or ep_events_available.
> @@ -440,23 +461,29 @@ static bool ep_busy_loop(struct eventpoll *ep, int nonblock)
>  {
>  	unsigned int napi_id = READ_ONCE(ep->napi_id);
>  	u16 budget = READ_ONCE(ep->busy_poll_budget);
> -	bool prefer_busy_poll = READ_ONCE(ep->prefer_busy_poll);
> +	bool event_poll_only = READ_ONCE(ep->event_poll_only);
>  
>  	if (!budget)
>  		budget = BUSY_POLL_BUDGET;
>  
> -	if (napi_id >= MIN_NAPI_ID && ep_busy_loop_on(ep)) {
> +	if (!ep_busy_loop_on(ep))
> +		return false;
> +
> +	if (event_poll_only) {
> +		return ep_event_busy_loop(ep);
> +	} else if (napi_id >= MIN_NAPI_ID) {
> +		bool napi_prefer_busy_poll = READ_ONCE(ep->napi_prefer_busy_poll);
> +
>  		napi_busy_loop(napi_id, nonblock ? NULL : ep_busy_loop_end,
> -			       ep, prefer_busy_poll, budget);
> +				ep, napi_prefer_busy_poll, budget);
>  		if (ep_events_available(ep))
>  			return true;
>  		/*
> -		 * Busy poll timed out.  Drop NAPI ID for now, we can add
> -		 * it back in when we have moved a socket with a valid NAPI
> -		 * ID onto the ready list.
> -		 */
> +		* Busy poll timed out.  Drop NAPI ID for now, we can add
> +		* it back in when we have moved a socket with a valid NAPI
> +		* ID onto the ready list.
> +		*/
>  		ep->napi_id = 0;
> -		return false;
>  	}
>  	return false;
>  }
> @@ -523,13 +550,15 @@ static long ep_eventpoll_bp_ioctl(struct file *file, unsigned int cmd,
>  
>  		WRITE_ONCE(ep->busy_poll_usecs, epoll_params.busy_poll_usecs);
>  		WRITE_ONCE(ep->busy_poll_budget, epoll_params.busy_poll_budget);
> -		WRITE_ONCE(ep->prefer_busy_poll, epoll_params.prefer_busy_poll);
> +		WRITE_ONCE(ep->napi_prefer_busy_poll, epoll_params.prefer_busy_poll);
> +		WRITE_ONCE(ep->event_poll_only, epoll_params.event_poll_only);
>  		return 0;
>  	case EPIOCGPARAMS:
>  		memset(&epoll_params, 0, sizeof(epoll_params));
>  		epoll_params.busy_poll_usecs = READ_ONCE(ep->busy_poll_usecs);
>  		epoll_params.busy_poll_budget = READ_ONCE(ep->busy_poll_budget);
> -		epoll_params.prefer_busy_poll = READ_ONCE(ep->prefer_busy_poll);
> +		epoll_params.prefer_busy_poll = READ_ONCE(ep->napi_prefer_busy_poll);
> +		epoll_params.event_poll_only = READ_ONCE(ep->event_poll_only);
>  		if (copy_to_user(uarg, &epoll_params, sizeof(epoll_params)))
>  			return -EFAULT;
>  		return 0;
> @@ -2203,7 +2232,7 @@ static int do_epoll_create(int flags)
>  #ifdef CONFIG_NET_RX_BUSY_POLL
>  	ep->busy_poll_usecs = 0;
>  	ep->busy_poll_budget = 0;
> -	ep->prefer_busy_poll = false;
> +	ep->napi_prefer_busy_poll = false;
>  #endif

Just FYI: This is going to conflict with a patch I've sent to VFS
that hasn't quite made its way back to net-next just yet.

https://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs.git/commit/?h=vfs.misc&id=4eb76c5d9a8851735fd3ec5833ecf412e8921655

>  	ep->file = file;
>  	fd_install(fd, file);
> diff --git a/include/uapi/linux/eventpoll.h b/include/uapi/linux/eventpoll.h
> index 4f4b948ef3811..3bc0f4eed976c 100644
> --- a/include/uapi/linux/eventpoll.h
> +++ b/include/uapi/linux/eventpoll.h
> @@ -89,9 +89,10 @@ struct epoll_params {
>  	__u32 busy_poll_usecs;
>  	__u16 busy_poll_budget;
>  	__u8 prefer_busy_poll;
> +	__u8 event_poll_only:1;
>  
>  	/* pad the struct to a multiple of 64bits */
> -	__u8 __pad;
> +	__u8 __pad:7;
>  };

If the above is accepted then a similar change should make its way
into glibc, uclibc-ng, and musl. It might be easier to add an
entirely new ioctl.

All the above said: I'm not sure I'm convinced yet and having more
clear data, descriptions, and answers on locality/syscall overhead
would be very helpful.

In the future, add 'net-next' to the subject line so that this
targets the right tree:
   [PATCH net-next] subject-line

- Joe

Naman Gulati Sept. 4, 2024, 5:52 a.m. UTC | #4

On Thu, Aug 29, 2024 at 2:16 AM Vadim Fedorenko
<vadim.fedorenko@linux.dev> wrote:
>
> On 28/08/2024 19:10, Naman Gulati wrote:
> > NAPI busypolling in ep_busy_loop loops on napi_poll and checks for new
> > epoll events after every napi poll. Checking just for epoll events in a
> > tight loop in the kernel context delivers latency gains to applications
> > that are not interested in napi busypolling with epoll.
> >
> > This patch adds an option to loop just for new events inside
> > ep_busy_loop, guarded by the EPIOCSPARAMS ioctl that controls epoll napi
> > busypolling.
> >
> > A comparison with neper tcp_rr shows that busylooping for events in
> > epoll_wait boosted throughput by ~3-7% and reduced median latency by
> > ~10%.
> >
> > To demonstrate the latency and throughput improvements, a comparison was
> > made of neper tcp_rr running with:
> >      1. (baseline) No busylooping
> >      2. (epoll busylooping) enabling the epoll busy looping on all epoll
> >      fd's
> >      3. (userspace busylooping) looping on epoll_wait in userspace
> >      with timeout=0
> >
> > Stats for two machines with 100Gbps NICs running tcp_rr with 5 threads
> > and varying flows:
> >
> > Type                Flows   Throughput             Latency (μs)
> >                               (B/s)      P50   P90    P99   P99.9   P99.99
> > baseline            15            272145      57.2  71.9   91.4  100.6   111.6
> > baseline            30            464952      66.8  78.8   98.1  113.4   122.4
> > baseline            60            695920      80.9  118.5  143.4 161.8   174.6
> > epoll busyloop      15            301751      44.7  70.6   84.3  95.4    106.5
> > epoll busyloop      30            508392      58.9  76.9   96.2  109.3   118.5
> > epoll busyloop      60            745731      77.4  106.2  127.5 143.1   155.9
> > userspace busyloop  15            279202      55.4  73.1   85.2  98.3    109.6
> > userspace busyloop  30            472440      63.7  78.2   96.5  112.2   120.1
> > userspace busyloop  60            720779      77.9  113.5  134.9 152.6   165.7
> >
> > Per the above data epoll busyloop outperforms baseline and userspace
> > busylooping in both throughput and latency. As the density of flows per
> > thread increased, the median latency of all three epoll mechanisms
> > converges. However epoll busylooping is better at capturing the tail
> > latencies at high flow counts.
> >
> > Signed-off-by: Naman Gulati <namangulati@google.com>
> > ---
> >   fs/eventpoll.c                 | 53 ++++++++++++++++++++++++++--------
> >   include/uapi/linux/eventpoll.h |  3 +-
> >   2 files changed, 43 insertions(+), 13 deletions(-)
> >
> > diff --git a/fs/eventpoll.c b/fs/eventpoll.c
> > index f53ca4f7fcedd..6cba79261817a 100644
> > --- a/fs/eventpoll.c
> > +++ b/fs/eventpoll.c
> > @@ -232,7 +232,10 @@ struct eventpoll {
> >       u32 busy_poll_usecs;
> >       /* busy poll packet budget */
> >       u16 busy_poll_budget;
> > -     bool prefer_busy_poll;
> > +     /* prefer to busypoll in napi poll */
> > +     bool napi_prefer_busy_poll;
> > +     /* avoid napi poll when busy looping and poll only for events */
> > +     bool event_poll_only;
> >   #endif
> >
> >   #ifdef CONFIG_DEBUG_LOCK_ALLOC
> > @@ -430,6 +433,24 @@ static bool ep_busy_loop_end(void *p, unsigned long start_time)
> >       return ep_events_available(ep) || busy_loop_ep_timeout(start_time, ep);
> >   }
> >
> > +/**
> > + * ep_event_busy_loop - loop until events available or busy poll
> > + * times out.
> > + *
> > + * @ep: Pointer to the eventpoll context.
> > + *
> > + * Return: true if events available, false otherwise.
> > + */
> > +static bool ep_event_busy_loop(struct eventpoll *ep)
> > +{
> > +     unsigned long start_time = busy_loop_current_time();
> > +
> > +     while (!ep_busy_loop_end(ep, start_time))
> > +             cond_resched();
> > +
> > +     return ep_events_available(ep);
> > +}
> > +
> >   /*
> >    * Busy poll if globally on and supporting sockets found && no events,
> >    * busy loop will return if need_resched or ep_events_available.
> > @@ -440,23 +461,29 @@ static bool ep_busy_loop(struct eventpoll *ep, int nonblock)
> >   {
> >       unsigned int napi_id = READ_ONCE(ep->napi_id);
> >       u16 budget = READ_ONCE(ep->busy_poll_budget);
> > -     bool prefer_busy_poll = READ_ONCE(ep->prefer_busy_poll);
> > +     bool event_poll_only = READ_ONCE(ep->event_poll_only);
> >
> >       if (!budget)
> >               budget = BUSY_POLL_BUDGET;
> >
> > -     if (napi_id >= MIN_NAPI_ID && ep_busy_loop_on(ep)) {
> > +     if (!ep_busy_loop_on(ep))
> > +             return false;
> > +
> > +     if (event_poll_only) {
> > +             return ep_event_busy_loop(ep);
> > +     } else if (napi_id >= MIN_NAPI_ID) {
>
> There is no need to use 'else if' in this place, in case of
> event_poll_only == true the program flow will not reach this part.
>

Right, I'll change that.

> > +             bool napi_prefer_busy_poll = READ_ONCE(ep->napi_prefer_busy_poll);
> > +
> >               napi_busy_loop(napi_id, nonblock ? NULL : ep_busy_loop_end,
> > -                            ep, prefer_busy_poll, budget);
> > +                             ep, napi_prefer_busy_poll, budget);
> >               if (ep_events_available(ep))
> >                       return true;
> >               /*
> > -              * Busy poll timed out.  Drop NAPI ID for now, we can add
> > -              * it back in when we have moved a socket with a valid NAPI
> > -              * ID onto the ready list.
> > -              */
> > +             * Busy poll timed out.  Drop NAPI ID for now, we can add
> > +             * it back in when we have moved a socket with a valid NAPI
> > +             * ID onto the ready list.
> > +             */
>
> I believe this change is accidental, right?
>

Yes, thanks for catching, I'll fix that.

> >               ep->napi_id = 0;
> > -             return false;
> >       }
> >       return false;
> >   }
>
> [...]
>

Naman Gulati Sept. 4, 2024, 5:52 a.m. UTC | #5

Thanks all for the comments and apologies for the delay in replying.
Stan and Joe I’ve addressed some of the common concerns below.

On Thu, Aug 29, 2024 at 3:40 AM Joe Damato <jdamato@fastly.com> wrote:
>
> On Wed, Aug 28, 2024 at 06:10:11PM +0000, Naman Gulati wrote:
> > NAPI busypolling in ep_busy_loop loops on napi_poll and checks for new
> > epoll events after every napi poll. Checking just for epoll events in a
> > tight loop in the kernel context delivers latency gains to applications
> > that are not interested in napi busypolling with epoll.
> >
> > This patch adds an option to loop just for new events inside
> > ep_busy_loop, guarded by the EPIOCSPARAMS ioctl that controls epoll napi
> > busypolling.
>
> This makes an API change, so I think that linux-api@vger.kernel.org
> needs to be CC'd ?
>
> > A comparison with neper tcp_rr shows that busylooping for events in
> > epoll_wait boosted throughput by ~3-7% and reduced median latency by
> > ~10%.
> >
> > To demonstrate the latency and throughput improvements, a comparison was
> > made of neper tcp_rr running with:
> >     1. (baseline) No busylooping
>
> Is there NAPI-based steering to threads via SO_INCOMING_NAPI_ID in
> this case? More details, please, on locality. If there is no
> NAPI-based flow steering in this case, perhaps the improvements you
> are seeing are a result of both syscall overhead avoidance and data
> locality?
>

The benchmarks were run with no NAPI steering.

Regarding syscall overhead, I reproduced the above experiment with
mitigations=off
and found similar results as above. Pointing to the fact that the
above gains are
materialized from more than just avoiding syscall overhead.

By locality are you referring to numa locality?

> >     2. (epoll busylooping) enabling the epoll busy looping on all epoll
> >     fd's
>
> This is the case you've added, event_poll_only ? It seems like in
> this case you aren't busy looping exactly, you are essentially
> allowing IRQ/softIRQ to drive processing and checking on wakeup that
> events are available.
>
> IMHO, I'm not sure if "epoll busylooping" is an appropriate
> description.
>

I see your point, perhaps "spinning" or just “looping” could be a closer word?

> >     3. (userspace busylooping) looping on epoll_wait in userspace
> >     with timeout=0
>
> Same question as Stanislav; timeout=0 should get ep_loop to transfer
> events immediately (if there are any) and return without actually
> invoking busy poll. So, it would seem that your ioctl change
> shouldn't be necessary since the equivalent behavior is already
> possible with timeout=0.
>
> I'd probably investigate both syscall overhead and data locality
> before approving this patch because it seems a bit suspicious to me.
>
> >
> > Stats for two machines with 100Gbps NICs running tcp_rr with 5 threads
> > and varying flows:
> >
> > Type                Flows   Throughput             Latency (μs)
> >                              (B/s)      P50   P90    P99   P99.9   P99.99
> > baseline            15            272145      57.2  71.9   91.4  100.6   111.6
> > baseline            30            464952      66.8  78.8   98.1  113.4   122.4
> > baseline            60            695920      80.9  118.5  143.4 161.8   174.6
> > epoll busyloop      15            301751      44.7  70.6   84.3  95.4    106.5
> > epoll busyloop      30            508392      58.9  76.9   96.2  109.3   118.5
> > epoll busyloop      60            745731      77.4  106.2  127.5 143.1   155.9
> > userspace busyloop  15            279202      55.4  73.1   85.2  98.3    109.6
> > userspace busyloop  30            472440      63.7  78.2   96.5  112.2   120.1
> > userspace busyloop  60            720779      77.9  113.5  134.9 152.6   165.7
> >
> > Per the above data epoll busyloop outperforms baseline and userspace
> > busylooping in both throughput and latency. As the density of flows per
> > thread increased, the median latency of all three epoll mechanisms
> > converges. However epoll busylooping is better at capturing the tail
> > latencies at high flow counts.
> >
> > Signed-off-by: Naman Gulati <namangulati@google.com>
> > ---
> >  fs/eventpoll.c                 | 53 ++++++++++++++++++++++++++--------
> >  include/uapi/linux/eventpoll.h |  3 +-
> >  2 files changed, 43 insertions(+), 13 deletions(-)
> >
> > diff --git a/fs/eventpoll.c b/fs/eventpoll.c
> > index f53ca4f7fcedd..6cba79261817a 100644
> > --- a/fs/eventpoll.c
> > +++ b/fs/eventpoll.c
> > @@ -232,7 +232,10 @@ struct eventpoll {
> >       u32 busy_poll_usecs;
> >       /* busy poll packet budget */
> >       u16 busy_poll_budget;
> > -     bool prefer_busy_poll;
> > +     /* prefer to busypoll in napi poll */
> > +     bool napi_prefer_busy_poll;
>
> Adding napi seems slightly redundant to me but I could be convinced either
> way, I suppose.
>

With the two different polling booleans in the struct, I felt it's
better to be explicit
about the scope of each.

> > +     /* avoid napi poll when busy looping and poll only for events */
> > +     bool event_poll_only;
>
> I'm not sure about this overall; this isn't exactly what I think of
> when I think about the word "polling" but maybe I'm being too
> nit-picky.
>

I'm not sure how else to categorize the operation, is there some other phrasing
you'd recommend?

> >  #endif
> >
> >  #ifdef CONFIG_DEBUG_LOCK_ALLOC
> > @@ -430,6 +433,24 @@ static bool ep_busy_loop_end(void *p, unsigned long start_time)
> >       return ep_events_available(ep) || busy_loop_ep_timeout(start_time, ep);
> >  }
> >
> > +/**
> > + * ep_event_busy_loop - loop until events available or busy poll
> > + * times out.
> > + *
> > + * @ep: Pointer to the eventpoll context.
> > + *
> > + * Return: true if events available, false otherwise.
> > + */
> > +static bool ep_event_busy_loop(struct eventpoll *ep)
> > +{
> > +     unsigned long start_time = busy_loop_current_time();
> > +
> > +     while (!ep_busy_loop_end(ep, start_time))
> > +             cond_resched();
> > +
> > +     return ep_events_available(ep);
> > +}
> > +
> >  /*
> >   * Busy poll if globally on and supporting sockets found && no events,
> >   * busy loop will return if need_resched or ep_events_available.
> > @@ -440,23 +461,29 @@ static bool ep_busy_loop(struct eventpoll *ep, int nonblock)
> >  {
> >       unsigned int napi_id = READ_ONCE(ep->napi_id);
> >       u16 budget = READ_ONCE(ep->busy_poll_budget);
> > -     bool prefer_busy_poll = READ_ONCE(ep->prefer_busy_poll);
> > +     bool event_poll_only = READ_ONCE(ep->event_poll_only);
> >
> >       if (!budget)
> >               budget = BUSY_POLL_BUDGET;
> >
> > -     if (napi_id >= MIN_NAPI_ID && ep_busy_loop_on(ep)) {
> > +     if (!ep_busy_loop_on(ep))
> > +             return false;
> > +
> > +     if (event_poll_only) {
> > +             return ep_event_busy_loop(ep);
> > +     } else if (napi_id >= MIN_NAPI_ID) {
> > +             bool napi_prefer_busy_poll = READ_ONCE(ep->napi_prefer_busy_poll);
> > +
> >               napi_busy_loop(napi_id, nonblock ? NULL : ep_busy_loop_end,
> > -                            ep, prefer_busy_poll, budget);
> > +                             ep, napi_prefer_busy_poll, budget);
> >               if (ep_events_available(ep))
> >                       return true;
> >               /*
> > -              * Busy poll timed out.  Drop NAPI ID for now, we can add
> > -              * it back in when we have moved a socket with a valid NAPI
> > -              * ID onto the ready list.
> > -              */
> > +             * Busy poll timed out.  Drop NAPI ID for now, we can add
> > +             * it back in when we have moved a socket with a valid NAPI
> > +             * ID onto the ready list.
> > +             */
> >               ep->napi_id = 0;
> > -             return false;
> >       }
> >       return false;
> >  }
> > @@ -523,13 +550,15 @@ static long ep_eventpoll_bp_ioctl(struct file *file, unsigned int cmd,
> >
> >               WRITE_ONCE(ep->busy_poll_usecs, epoll_params.busy_poll_usecs);
> >               WRITE_ONCE(ep->busy_poll_budget, epoll_params.busy_poll_budget);
> > -             WRITE_ONCE(ep->prefer_busy_poll, epoll_params.prefer_busy_poll);
> > +             WRITE_ONCE(ep->napi_prefer_busy_poll, epoll_params.prefer_busy_poll);
> > +             WRITE_ONCE(ep->event_poll_only, epoll_params.event_poll_only);
> >               return 0;
> >       case EPIOCGPARAMS:
> >               memset(&epoll_params, 0, sizeof(epoll_params));
> >               epoll_params.busy_poll_usecs = READ_ONCE(ep->busy_poll_usecs);
> >               epoll_params.busy_poll_budget = READ_ONCE(ep->busy_poll_budget);
> > -             epoll_params.prefer_busy_poll = READ_ONCE(ep->prefer_busy_poll);
> > +             epoll_params.prefer_busy_poll = READ_ONCE(ep->napi_prefer_busy_poll);
> > +             epoll_params.event_poll_only = READ_ONCE(ep->event_poll_only);
> >               if (copy_to_user(uarg, &epoll_params, sizeof(epoll_params)))
> >                       return -EFAULT;
> >               return 0;
> > @@ -2203,7 +2232,7 @@ static int do_epoll_create(int flags)
> >  #ifdef CONFIG_NET_RX_BUSY_POLL
> >       ep->busy_poll_usecs = 0;
> >       ep->busy_poll_budget = 0;
> > -     ep->prefer_busy_poll = false;
> > +     ep->napi_prefer_busy_poll = false;
> >  #endif
>
> Just FYI: This is going to conflict with a patch I've sent to VFS
> that hasn't quite made its way back to net-next just yet.
>
> https://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs.git/commit/?h=vfs.misc&id=4eb76c5d9a8851735fd3ec5833ecf412e8921655
>

Acknowledged.

> >       ep->file = file;
> >       fd_install(fd, file);
> > diff --git a/include/uapi/linux/eventpoll.h b/include/uapi/linux/eventpoll.h
> > index 4f4b948ef3811..3bc0f4eed976c 100644
> > --- a/include/uapi/linux/eventpoll.h
> > +++ b/include/uapi/linux/eventpoll.h
> > @@ -89,9 +89,10 @@ struct epoll_params {
> >       __u32 busy_poll_usecs;
> >       __u16 busy_poll_budget;
> >       __u8 prefer_busy_poll;
> > +     __u8 event_poll_only:1;
> >
> >       /* pad the struct to a multiple of 64bits */
> > -     __u8 __pad;
> > +     __u8 __pad:7;
> >  };
>
> If the above is accepted then a similar change should make its way
> into glibc, uclibc-ng, and musl. It might be easier to add an
> entirely new ioctl.

Adding a new ioctl seems preferable, I can look into reworking the
code accordingly.

>
> All the above said: I'm not sure I'm convinced yet and having more
> clear data, descriptions, and answers on locality/syscall overhead
> would be very helpful.
>
> In the future, add 'net-next' to the subject line so that this
> targets the right tree:
>    [PATCH net-next] subject-line
>

Will do, thank you for the pointer.

> - Joe

Martin Karsten Sept. 4, 2024, 12:46 p.m. UTC | #6

On 2024-09-04 01:52, Naman Gulati wrote:
> Thanks all for the comments and apologies for the delay in replying.
> Stan and Joe I’ve addressed some of the common concerns below.
> 
> On Thu, Aug 29, 2024 at 3:40 AM Joe Damato <jdamato@fastly.com> wrote:
>>
>> On Wed, Aug 28, 2024 at 06:10:11PM +0000, Naman Gulati wrote:
>>> NAPI busypolling in ep_busy_loop loops on napi_poll and checks for new
>>> epoll events after every napi poll. Checking just for epoll events in a
>>> tight loop in the kernel context delivers latency gains to applications
>>> that are not interested in napi busypolling with epoll.
>>>
>>> This patch adds an option to loop just for new events inside
>>> ep_busy_loop, guarded by the EPIOCSPARAMS ioctl that controls epoll napi
>>> busypolling.
>>
>> This makes an API change, so I think that linux-api@vger.kernel.org
>> needs to be CC'd ?
>>
>>> A comparison with neper tcp_rr shows that busylooping for events in
>>> epoll_wait boosted throughput by ~3-7% and reduced median latency by
>>> ~10%.
>>>
>>> To demonstrate the latency and throughput improvements, a comparison was
>>> made of neper tcp_rr running with:
>>>      1. (baseline) No busylooping
>>
>> Is there NAPI-based steering to threads via SO_INCOMING_NAPI_ID in
>> this case? More details, please, on locality. If there is no
>> NAPI-based flow steering in this case, perhaps the improvements you
>> are seeing are a result of both syscall overhead avoidance and data
>> locality?
>>
> 
> The benchmarks were run with no NAPI steering.
> 
> Regarding syscall overhead, I reproduced the above experiment with
> mitigations=off
> and found similar results as above. Pointing to the fact that the
> above gains are
> materialized from more than just avoiding syscall overhead.

I suppose the natural follow-up questions are:

1) Where do the gains come from? and

2) Would they materialize with a realistic application?

System calls have some overhead even with mitigations=off. In fact I 
understand on modern CPUs security mitigations are not that expensive to 
begin with? In a micro-benchmark that does nothing else but bouncing 
packets back and forth, this overhead might look more significant than 
in a realistic application?

It seems your change does not eliminate any processing from each 
packet's path, but instead eliminates processing in between packet 
arrivals? This might lead to a small latency improvement, which might 
turn into a small throughput improvement in these micro-benchmarks, but 
that might quickly evaporate when an application has actual work to do 
in between packet arrivals.

It would be good to know a little more about your experiments. You are 
referring to 5 threads, but does that mean 5 cores were busy on both 
client and server during the experiment? Which of client or server is 
the bottleneck? In your baseline experiment, are all 5 server cores 
busy? How many RX queues are in play and how is interrupt routing 
configured?

Thanks,
Martin

Joe Damato Sept. 4, 2024, 3:58 p.m. UTC | #7

On Tue, Sep 03, 2024 at 07:18:14PM -0700, Naman Gulati wrote:
> Thanks all for the comments and apologies for the delay in replying.
> Stanislav and Joe I’ve addressed some of the common concerns below.
> 
> On Thu, Aug 29, 2024 at 3:40 AM Joe Damato <jdamato@fastly.com> wrote:
> >
> > On Wed, Aug 28, 2024 at 06:10:11PM +0000, Naman Gulati wrote:
> > > NAPI busypolling in ep_busy_loop loops on napi_poll and checks for new
> > > epoll events after every napi poll. Checking just for epoll events in a
> > > tight loop in the kernel context delivers latency gains to applications
> > > that are not interested in napi busypolling with epoll.
> > >
> > > This patch adds an option to loop just for new events inside
> > > ep_busy_loop, guarded by the EPIOCSPARAMS ioctl that controls epoll napi
> > > busypolling.
> >
> > This makes an API change, so I think that linux-api@vger.kernel.org
> > needs to be CC'd ?
> >
> > > A comparison with neper tcp_rr shows that busylooping for events in
> > > epoll_wait boosted throughput by ~3-7% and reduced median latency by
> > > ~10%.
> > >
> > > To demonstrate the latency and throughput improvements, a comparison was
> > > made of neper tcp_rr running with:
> > >     1. (baseline) No busylooping
> >
> > Is there NAPI-based steering to threads via SO_INCOMING_NAPI_ID in
> > this case? More details, please, on locality. If there is no
> > NAPI-based flow steering in this case, perhaps the improvements you
> > are seeing are a result of both syscall overhead avoidance and data
> > locality?
> 
> The benchmarks were run with no NAPI steering.
> 
> Regarding syscall overhead, I reproduced the above experiment with
> mitigations=off
> and found similar results as above. Pointing to the fact that the above
> gains are
> materialized from more than just avoiding syscall overhead.

I think the cover letter needs more benchmark cases, data points,
and documentation about precisely how the benchmark is being run and
configured. The data feels quite sparse, the explanation of the
experiment leaves too much out, and - with respect - I am not
convinced by the results posted.

> By locality are you referring to Numa locality?

CPU cache locality.

> >
> > >     2. (epoll busylooping) enabling the epoll busy looping on all epoll
> > >     fd's
> >
> > This is the case you've added, event_poll_only ? It seems like in
> > this case you aren't busy looping exactly, you are essentially
> > allowing IRQ/softIRQ to drive processing and checking on wakeup that
> > events are available.
> >
> > IMHO, I'm not sure if "epoll busylooping" is an appropriate
> > description.
> 
> I see your point, perhaps "spinning" or just “looping” could be a closer
> word?

I don't know how to describe the patch, to be totally honest. I
think the basis of the mechanism needs to be more thoroughly
understood. See below.

> >
> > >     3. (userspace busylooping) looping on epoll_wait in userspace
> > >     with timeout=0
> >
> > Same question as Stanislav; timeout=0 should get ep_loop to transfer
> > events immediately (if there are any) and return without actually
> > invoking busy poll. So, it would seem that your ioctl change
> > shouldn't be necessary since the equivalent behavior is already
> > possible with timeout=0.
> >
> > I'd probably investigate both syscall overhead and data locality
> > before approving this patch because it seems a bit suspicious to me.
> >
> > >
> > > Stats for two machines with 100Gbps NICs running tcp_rr with 5 threads
> > > and varying flows:
> > >
> > > Type                Flows   Throughput             Latency (μs)
> > >                              (B/s)      P50   P90    P99   P99.9
>  P99.99
> > > baseline            15            272145      57.2  71.9   91.4  100.6
>  111.6
> > > baseline            30            464952      66.8  78.8   98.1  113.4
>  122.4
> > > baseline            60            695920      80.9  118.5  143.4 161.8
>  174.6
> > > epoll busyloop      15            301751      44.7  70.6   84.3  95.4
>   106.5
> > > epoll busyloop      30            508392      58.9  76.9   96.2  109.3
>  118.5
> > > epoll busyloop      60            745731      77.4  106.2  127.5 143.1
>  155.9
> > > userspace busyloop  15            279202      55.4  73.1   85.2  98.3
>   109.6
> > > userspace busyloop  30            472440      63.7  78.2   96.5  112.2
>  120.1
> > > userspace busyloop  60            720779      77.9  113.5  134.9 152.6
>  165.7
> > >
> > > Per the above data epoll busyloop outperforms baseline and userspace
> > > busylooping in both throughput and latency. As the density of flows per
> > > thread increased, the median latency of all three epoll mechanisms
> > > converges. However epoll busylooping is better at capturing the tail
> > > latencies at high flow counts.
> > >
> > > Signed-off-by: Naman Gulati <namangulati@google.com>
> > > ---
> > >  fs/eventpoll.c                 | 53 ++++++++++++++++++++++++++--------
> > >  include/uapi/linux/eventpoll.h |  3 +-
> > >  2 files changed, 43 insertions(+), 13 deletions(-)
> > >
> > > diff --git a/fs/eventpoll.c b/fs/eventpoll.c
> > > index f53ca4f7fcedd..6cba79261817a 100644
> > > --- a/fs/eventpoll.c
> > > +++ b/fs/eventpoll.c
> > > @@ -232,7 +232,10 @@ struct eventpoll {
> > >       u32 busy_poll_usecs;
> > >       /* busy poll packet budget */
> > >       u16 busy_poll_budget;
> > > -     bool prefer_busy_poll;
> > > +     /* prefer to busypoll in napi poll */
> > > +     bool napi_prefer_busy_poll;
> >
> > Adding napi seems slightly redundant to me but I could be convinced either
> > way, I suppose.
> 
> With the two different polling booleans in the struct, I felt it's better
> to be explicit
> about the scope of each.
> 
> >
> > > +     /* avoid napi poll when busy looping and poll only for events */
> > > +     bool event_poll_only;
> >
> > I'm not sure about this overall; this isn't exactly what I think of
> > when I think about the word "polling" but maybe I'm being too
> > nit-picky.
> 
> I'm not sure how else to categorize the operation, is there some other
> phrasing
> you'd recommend?
> 
> >
> > >  #endif
> > >
> > >  #ifdef CONFIG_DEBUG_LOCK_ALLOC
> > > @@ -430,6 +433,24 @@ static bool ep_busy_loop_end(void *p, unsigned
> long start_time)
> > >       return ep_events_available(ep) ||
> busy_loop_ep_timeout(start_time, ep);
> > >  }
> > >
> > > +/**
> > > + * ep_event_busy_loop - loop until events available or busy poll
> > > + * times out.
> > > + *
> > > + * @ep: Pointer to the eventpoll context.
> > > + *
> > > + * Return: true if events available, false otherwise.
> > > + */
> > > +static bool ep_event_busy_loop(struct eventpoll *ep)
> > > +{
> > > +     unsigned long start_time = busy_loop_current_time();
> > > +
> > > +     while (!ep_busy_loop_end(ep, start_time))
> > > +             cond_resched();
> > > +
> > > +     return ep_events_available(ep);
> > > +}
> > > +
> > >  /*
> > >   * Busy poll if globally on and supporting sockets found && no events,
> > >   * busy loop will return if need_resched or ep_events_available.
> > > @@ -440,23 +461,29 @@ static bool ep_busy_loop(struct eventpoll *ep,
> int nonblock)
> > >  {
> > >       unsigned int napi_id = READ_ONCE(ep->napi_id);
> > >       u16 budget = READ_ONCE(ep->busy_poll_budget);
> > > -     bool prefer_busy_poll = READ_ONCE(ep->prefer_busy_poll);
> > > +     bool event_poll_only = READ_ONCE(ep->event_poll_only);
> > >
> > >       if (!budget)
> > >               budget = BUSY_POLL_BUDGET;
> > >
> > > -     if (napi_id >= MIN_NAPI_ID && ep_busy_loop_on(ep)) {
> > > +     if (!ep_busy_loop_on(ep))
> > > +             return false;
> > > +
> > > +     if (event_poll_only) {
> > > +             return ep_event_busy_loop(ep);
> > > +     } else if (napi_id >= MIN_NAPI_ID) {
> > > +             bool napi_prefer_busy_poll =
> READ_ONCE(ep->napi_prefer_busy_poll);
> > > +
> > >               napi_busy_loop(napi_id, nonblock ? NULL :
> ep_busy_loop_end,
> > > -                            ep, prefer_busy_poll, budget);
> > > +                             ep, napi_prefer_busy_poll, budget);
> > >               if (ep_events_available(ep))
> > >                       return true;
> > >               /*
> > > -              * Busy poll timed out.  Drop NAPI ID for now, we can add
> > > -              * it back in when we have moved a socket with a valid
> NAPI
> > > -              * ID onto the ready list.
> > > -              */
> > > +             * Busy poll timed out.  Drop NAPI ID for now, we can add
> > > +             * it back in when we have moved a socket with a valid NAPI
> > > +             * ID onto the ready list.
> > > +             */
> > >               ep->napi_id = 0;
> > > -             return false;
> > >       }
> > >       return false;
> > >  }
> > > @@ -523,13 +550,15 @@ static long ep_eventpoll_bp_ioctl(struct file
> *file, unsigned int cmd,
> > >
> > >               WRITE_ONCE(ep->busy_poll_usecs,
> epoll_params.busy_poll_usecs);
> > >               WRITE_ONCE(ep->busy_poll_budget,
> epoll_params.busy_poll_budget);
> > > -             WRITE_ONCE(ep->prefer_busy_poll,
> epoll_params.prefer_busy_poll);
> > > +             WRITE_ONCE(ep->napi_prefer_busy_poll,
> epoll_params.prefer_busy_poll);
> > > +             WRITE_ONCE(ep->event_poll_only,
> epoll_params.event_poll_only);
> > >               return 0;
> > >       case EPIOCGPARAMS:
> > >               memset(&epoll_params, 0, sizeof(epoll_params));
> > >               epoll_params.busy_poll_usecs =
> READ_ONCE(ep->busy_poll_usecs);
> > >               epoll_params.busy_poll_budget =
> READ_ONCE(ep->busy_poll_budget);
> > > -             epoll_params.prefer_busy_poll =
> READ_ONCE(ep->prefer_busy_poll);
> > > +             epoll_params.prefer_busy_poll =
> READ_ONCE(ep->napi_prefer_busy_poll);
> > > +             epoll_params.event_poll_only =
> READ_ONCE(ep->event_poll_only);
> > >               if (copy_to_user(uarg, &epoll_params,
> sizeof(epoll_params)))
> > >                       return -EFAULT;
> > >               return 0;
> > > @@ -2203,7 +2232,7 @@ static int do_epoll_create(int flags)
> > >  #ifdef CONFIG_NET_RX_BUSY_POLL
> > >       ep->busy_poll_usecs = 0;
> > >       ep->busy_poll_budget = 0;
> > > -     ep->prefer_busy_poll = false;
> > > +     ep->napi_prefer_busy_poll = false;
> > >  #endif
> >
> > Just FYI: This is going to conflict with a patch I've sent to VFS
> > that hasn't quite made its way back to net-next just yet.
> >
> >
> https://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs.git/commit/?h=vfs.misc&id=4eb76c5d9a8851735fd3ec5833ecf412e8921655
> >
> Acknowledged.
> 
> > >       ep->file = file;
> > >       fd_install(fd, file);
> > > diff --git a/include/uapi/linux/eventpoll.h
> b/include/uapi/linux/eventpoll.h
> > > index 4f4b948ef3811..3bc0f4eed976c 100644
> > > --- a/include/uapi/linux/eventpoll.h
> > > +++ b/include/uapi/linux/eventpoll.h
> > > @@ -89,9 +89,10 @@ struct epoll_params {
> > >       __u32 busy_poll_usecs;
> > >       __u16 busy_poll_budget;
> > >       __u8 prefer_busy_poll;
> > > +     __u8 event_poll_only:1;
> > >
> > >       /* pad the struct to a multiple of 64bits */
> > > -     __u8 __pad;
> > > +     __u8 __pad:7;
> > >  };
> >
> > If the above is accepted then a similar change should make its way
> > into glibc, uclibc-ng, and musl. It might be easier to add an
> > entirely new ioctl.
> 
> Adding a new ioctl seems preferable, I can look into reworking the code
> accordingly.

My advice is prior to reworking the code: figure out why the results
show an improvement because I can't really see a clear explanation.

If I understand the patch correctly (and perhaps I've gotten it
wrong) it is proposing to add a new mechanism for packet processing
that is quite strange:

  Let softIRQ do the work (so there's no 'polling'), but avoid
  returning to userland and call cond_resched() to defer execution
  to other things until there's events to be returned.

Couldn't this be approximated with a smaller batch size (maxevents)
and a timeout of 0?

Perhaps consider comparing performance with a more real world
example (i.e. not tcp_rr) and use NAPI-steering in the base case to
eliminate cache locality as a variable? Maybe even compare against
the suspend mechanism Martin and I proposed?

- Joe

Naman Gulati Sept. 10, 2024, 5:41 p.m. UTC | #8

On Wed, Sep 4, 2024 at 5:46 AM Martin Karsten <mkarsten@uwaterloo.ca> wrote:
>
> On 2024-09-04 01:52, Naman Gulati wrote:
> > Thanks all for the comments and apologies for the delay in replying.
> > Stan and Joe I’ve addressed some of the common concerns below.
> >
> > On Thu, Aug 29, 2024 at 3:40 AM Joe Damato <jdamato@fastly.com> wrote:
> >>
> >> On Wed, Aug 28, 2024 at 06:10:11PM +0000, Naman Gulati wrote:
> >>> NAPI busypolling in ep_busy_loop loops on napi_poll and checks for new
> >>> epoll events after every napi poll. Checking just for epoll events in a
> >>> tight loop in the kernel context delivers latency gains to applications
> >>> that are not interested in napi busypolling with epoll.
> >>>
> >>> This patch adds an option to loop just for new events inside
> >>> ep_busy_loop, guarded by the EPIOCSPARAMS ioctl that controls epoll napi
> >>> busypolling.
> >>
> >> This makes an API change, so I think that linux-api@vger.kernel.org
> >> needs to be CC'd ?
> >>
> >>> A comparison with neper tcp_rr shows that busylooping for events in
> >>> epoll_wait boosted throughput by ~3-7% and reduced median latency by
> >>> ~10%.
> >>>
> >>> To demonstrate the latency and throughput improvements, a comparison was
> >>> made of neper tcp_rr running with:
> >>>      1. (baseline) No busylooping
> >>
> >> Is there NAPI-based steering to threads via SO_INCOMING_NAPI_ID in
> >> this case? More details, please, on locality. If there is no
> >> NAPI-based flow steering in this case, perhaps the improvements you
> >> are seeing are a result of both syscall overhead avoidance and data
> >> locality?
> >>
> >
> > The benchmarks were run with no NAPI steering.
> >
> > Regarding syscall overhead, I reproduced the above experiment with
> > mitigations=off
> > and found similar results as above. Pointing to the fact that the
> > above gains are
> > materialized from more than just avoiding syscall overhead.
>
> I suppose the natural follow-up questions are:
>
> 1) Where do the gains come from? and
>
> 2) Would they materialize with a realistic application?
>
> System calls have some overhead even with mitigations=off. In fact I
> understand on modern CPUs security mitigations are not that expensive to
> begin with? In a micro-benchmark that does nothing else but bouncing
> packets back and forth, this overhead might look more significant than
> in a realistic application?
>
> It seems your change does not eliminate any processing from each
> packet's path, but instead eliminates processing in between packet
> arrivals? This might lead to a small latency improvement, which might
> turn into a small throughput improvement in these micro-benchmarks, but
> that might quickly evaporate when an application has actual work to do
> in between packet arrivals.

This is a good point, and I was able to confirm this. I profiled the
changes in the
patch by fixing the number of threads and flows but scaling message sizes with
tcp_rr, using the notion that creating and processing large messages in tcp_rr
would take more time. As the message size increases from 1 B to MSS (4KB
in my setup), I found that the difference in latency and throughput diminishes
between looping inside epoll vs looping on nonblocking epoll_wait in userspace.

Understandably, as the message sizes increase the application becomes the
bottleneck and the syscall overhead becomes marginal to the whole cost of the
operation.

I also found that looping inside epoll yields latency and throughput
improvements again when message sizes increase past MSS. I believe this can
be rationalized as the cost of processing the packet in the application is then
amortized over the multiple transmitted segments and the system call overhead
becomes more prominent again.

This is some rough data showing the above
Setup: 5 threads on both client and server, 30 flows, mitigations=off,
both server
and client using the same request/response size

Looping inside epoll:
Message Size  Throughput  Latency P50  Latency P90  Latency P99  Latency P99.9
 1 B                   543971         57                 76
      93                  106
 250 B               501245         60                 77
    97                  109
 500 B               494467         60                 77
    93                  111
 1 KB                 486412         60                 77
     97                  114
 2 KB                 385125         77                 96
     114                123
 4 KB                 378612         78                 97
     119                129
 8 KB                 349214         83                109
    125                137
 16 KB               379276         156               202
  243                274
Looping in userspace:
Message Size  Throughput  Latency P50  Latency P90  Latency P99  Latency P99.9
 1 B                   496296         59                 76
      95                   109
 250 B               468840         67                 77
    97                   111
 500 B               476804         61                 78
    97                   110
 1 KB                 464273         65                 79
    100                  115
 2 KB                 388334         76                 97
    114                  122
 4 KB                 377851         79                 98
    118                  124
 8 KB                 333718         91                115
   128                  141
 16 KB               354708         157               253
 307                  343

I also examined the perf traces for both looping setups and compared the
overhead delta between the invocation of epoll_wait in glibc and the invocation
of do_epoll_wait in the kernel to measure just the overhead of calling the
system call. With 1 B messages, looping in userspace had a higher overhead
in CPU cycles for invoking the syscall compared to looping inside epoll, however
the overhead gap also shrinks as the message sizes increase and the syscall
overhead becomes increasingly marginal.

I believe testing with a benchmark like memcached and using napi steering
would confirm the same results, and I recognize now that most regular workloads
won’t benefit from this patch.

>
> It would be good to know a little more about your experiments. You are
> referring to 5 threads, but does that mean 5 cores were busy on both
> client and server during the experiment? Which of client or server is
> the bottleneck? In your baseline experiment, are all 5 server cores
> busy? How many RX queues are in play and how is interrupt routing
> configured?

Apologies, should have been clearer in the description. The server and client
were both using 5 threads to handle the connections without any CPU pinning.
I did however confirm that all threads used distinct cores from
scheduling traces
and there was no contention.
Both hosts had 32 queues with a napi instance per queue.

>
> Thanks,
> Martin
>
>

Given the above analysis it doesn’t make sense adding the extra knobs to the
epoll interface for an optimization that's not widely applicable, therefore this
patch can be considered as not needed.
Nonetheless, appreciate the feedback Joe and Martin.

Add provision to busyloop for events in ep_poll.

Checks

Commit Message

Comments

Patch