diff mbox series

[v6,08/15] net: add helper executing custom callback from napi

Message ID 20241016185252.3746190-9-dw@davidwei.uk (mailing list archive)
State New
Headers show
Series io_uring zero copy rx | expand

Commit Message

David Wei Oct. 16, 2024, 6:52 p.m. UTC
From: Pavel Begunkov <asml.silence@gmail.com>

It's useful to have napi private bits and pieces like page pool's fast
allocating cache, so that the hot allocation path doesn't have to do any
additional synchronisation. In case of io_uring memory provider
introduced in following patches, we keep the consumer end of the
io_uring's refill queue private to napi as it's a hot path.

However, from time to time we need to synchronise with the napi, for
example to add more user memory or allocate fallback buffers. Add a
helper function napi_execute that allows to run a custom callback from
under napi context so that it can access and modify napi protected
parts of io_uring. It works similar to busy polling and stops napi from
running in the meantime, so it's supposed to be a slow control path.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: David Wei <dw@davidwei.uk>
---
 include/net/busy_poll.h |  6 ++++
 net/core/dev.c          | 77 ++++++++++++++++++++++++++++++++---------
 2 files changed, 66 insertions(+), 17 deletions(-)

Comments

Paolo Abeni Oct. 21, 2024, 2:25 p.m. UTC | #1
Hi,

On 10/16/24 20:52, David Wei wrote:
> @@ -6503,6 +6511,41 @@ void napi_busy_loop(unsigned int napi_id,
>  }
>  EXPORT_SYMBOL(napi_busy_loop);
>  
> +void napi_execute(unsigned napi_id,
> +		  void (*cb)(void *), void *cb_arg)
> +{
> +	struct napi_struct *napi;
> +	void *have_poll_lock = NULL;

Minor nit: please respect the reverse x-mas tree order.

> +
> +	guard(rcu)();

Since this will land into net core code, please use the explicit RCU
read lock/unlock:

https://elixir.bootlin.com/linux/v6.12-rc3/source/Documentation/process/maintainer-netdev.rst#L387

> +	napi = napi_by_id(napi_id);
> +	if (!napi)
> +		return;
> +
> +	if (!IS_ENABLED(CONFIG_PREEMPT_RT))
> +		preempt_disable();
> +
> +	for (;;) {
> +		local_bh_disable();
> +
> +		if (napi_state_start_busy_polling(napi, 0)) {
> +			have_poll_lock = netpoll_poll_lock(napi);
> +			cb(cb_arg);
> +			local_bh_enable();
> +			busy_poll_stop(napi, have_poll_lock, 0, 1);
> +			break;
> +		}
> +
> +		local_bh_enable();
> +		if (unlikely(need_resched()))
> +			break;
> +		cpu_relax();

Don't you need a 'loop_end' condition here?

Side notes not specifically related to this patch: I likely got lost in
previous revision, but it's unclear to me which is the merge plan here,
could you please (re-)word it?

Thanks!

Paolo
Jens Axboe Oct. 21, 2024, 3:49 p.m. UTC | #2
On 10/21/24 8:25 AM, Paolo Abeni wrote:
> Side notes not specifically related to this patch: I likely got lost in
> previous revision, but it's unclear to me which is the merge plan here,
> could you please (re-)word it?

In the past when there's been dependencies like this, what we've done
is:

Someone, usually Jakub, sets up a branch with the net bits only. Both
the io_uring tree and netdev-next pulls that in.

With that, then the io_uring tree can apply the io_uring specific
patches on top. And either side can send a pull, won't impact the other
tree.

I like that way of doing it, as it keeps things separate, yet still easy
to deal with for the side that needs further work/patches on top.
Pavel Begunkov Oct. 21, 2024, 4:34 p.m. UTC | #3
On 10/21/24 16:49, Jens Axboe wrote:
> On 10/21/24 8:25 AM, Paolo Abeni wrote:
>> Side notes not specifically related to this patch: I likely got lost in
>> previous revision, but it's unclear to me which is the merge plan here,
>> could you please (re-)word it?
> 
> In the past when there's been dependencies like this, what we've done
> is:
> 
> Someone, usually Jakub, sets up a branch with the net bits only. Both
> the io_uring tree and netdev-next pulls that in.
> 
> With that, then the io_uring tree can apply the io_uring specific
> patches on top. And either side can send a pull, won't impact the other
> tree.
> 
> I like that way of doing it, as it keeps things separate, yet still easy
> to deal with for the side that needs further work/patches on top.

Yep, I outlined in one of the comments same thing (should put it into
the cover letter). We'll get a branch with net/ patches on the common
base, so that it can be pulled into net and io-uring trees. Then, we
can deal with io_uring patches ourselves.
Pavel Begunkov Oct. 21, 2024, 5:16 p.m. UTC | #4
On 10/21/24 15:25, Paolo Abeni wrote:
> Hi,
> 
> On 10/16/24 20:52, David Wei wrote:
>> @@ -6503,6 +6511,41 @@ void napi_busy_loop(unsigned int napi_id,
>>   }
>>   EXPORT_SYMBOL(napi_busy_loop);
>>   
>> +void napi_execute(unsigned napi_id,
>> +		  void (*cb)(void *), void *cb_arg)
>> +{
>> +	struct napi_struct *napi;
>> +	void *have_poll_lock = NULL;
> 
> Minor nit: please respect the reverse x-mas tree order.
> 
>> +
>> +	guard(rcu)();
> 
> Since this will land into net core code, please use the explicit RCU
> read lock/unlock:
> 
> https://elixir.bootlin.com/linux/v6.12-rc3/source/Documentation/process/maintainer-netdev.rst#L387

I missed the doc update, will change it, thanks


>> +	napi = napi_by_id(napi_id);
>> +	if (!napi)
>> +		return;
>> +
>> +	if (!IS_ENABLED(CONFIG_PREEMPT_RT))
>> +		preempt_disable();
>> +
>> +	for (;;) {
>> +		local_bh_disable();
>> +
>> +		if (napi_state_start_busy_polling(napi, 0)) {
>> +			have_poll_lock = netpoll_poll_lock(napi);
>> +			cb(cb_arg);
>> +			local_bh_enable();
>> +			busy_poll_stop(napi, have_poll_lock, 0, 1);
>> +			break;
>> +		}
>> +
>> +		local_bh_enable();
>> +		if (unlikely(need_resched()))
>> +			break;
>> +		cpu_relax();
> 
> Don't you need a 'loop_end' condition here?

As you mentioned in 14/15, it can indeed spin for long and is bound only
by need_resched(). Do you think it's reasonable to wait for it without a
time limit with NAPI_STATE_PREFER_BUSY_POLL? softirq should yield napi
after it exhausts the budget, it should limit it well enough, what do
you think?

The only ugly part is that I don't want it to mess with the
NAPI_F_PREFER_BUSY_POLL in busy_poll_stop()

busy_poll_stop() {
	...
	clear_bit(NAPI_STATE_IN_BUSY_POLL, &napi->state);
	if (flags & NAPI_F_PREFER_BUSY_POLL) {
		napi->defer_hard_irqs_count = READ_ONCE(napi->dev->napi_defer_hard_irqs);
		timeout = READ_ONCE(napi->dev->gro_flush_timeout);
		if (napi->defer_hard_irqs_count && timeout) {
			hrtimer_start(&napi->timer, ns_to_ktime(timeout), HRTIMER_MODE_REL_PINNED);
			skip_schedule = true;
		}
	}
}

Is it fine to set PREFER_BUSY_POLL but do the stop call without? E.g.

set_bit(NAPI_STATE_PREFER_BUSY_POLL, &napi->state);
...
busy_poll_stop(napi, flags=0);
Paolo Abeni Oct. 22, 2024, 7:47 a.m. UTC | #5
Hi,

On 10/21/24 19:16, Pavel Begunkov wrote:
> On 10/21/24 15:25, Paolo Abeni wrote:
>> On 10/16/24 20:52, David Wei wrote:

[...]
>>> +	napi = napi_by_id(napi_id);
>>> +	if (!napi)
>>> +		return;
>>> +
>>> +	if (!IS_ENABLED(CONFIG_PREEMPT_RT))
>>> +		preempt_disable();
>>> +
>>> +	for (;;) {
>>> +		local_bh_disable();
>>> +
>>> +		if (napi_state_start_busy_polling(napi, 0)) {
>>> +			have_poll_lock = netpoll_poll_lock(napi);
>>> +			cb(cb_arg);
>>> +			local_bh_enable();
>>> +			busy_poll_stop(napi, have_poll_lock, 0, 1);
>>> +			break;
>>> +		}
>>> +
>>> +		local_bh_enable();
>>> +		if (unlikely(need_resched()))
>>> +			break;
>>> +		cpu_relax();
>>
>> Don't you need a 'loop_end' condition here?
> 
> As you mentioned in 14/15, it can indeed spin for long and is bound only
> by need_resched(). Do you think it's reasonable to wait for it without a
> time limit with NAPI_STATE_PREFER_BUSY_POLL? softirq should yield napi
> after it exhausts the budget, it should limit it well enough, what do
> you think?
> 
> The only ugly part is that I don't want it to mess with the
> NAPI_F_PREFER_BUSY_POLL in busy_poll_stop()
> 
> busy_poll_stop() {
> 	...
> 	clear_bit(NAPI_STATE_IN_BUSY_POLL, &napi->state);
> 	if (flags & NAPI_F_PREFER_BUSY_POLL) {
> 		napi->defer_hard_irqs_count = READ_ONCE(napi->dev->napi_defer_hard_irqs);
> 		timeout = READ_ONCE(napi->dev->gro_flush_timeout);
> 		if (napi->defer_hard_irqs_count && timeout) {
> 			hrtimer_start(&napi->timer, ns_to_ktime(timeout), HRTIMER_MODE_REL_PINNED);
> 			skip_schedule = true;
> 		}
> 	}
> }

Why do you want to avoid such branch? It will do any action only when
the user-space explicitly want to leverage the hrtimer to check for
incoming packets. In such case, I think the kernel should try to respect
the user configuration.

> Is it fine to set PREFER_BUSY_POLL but do the stop call without? E.g.
> 
> set_bit(NAPI_STATE_PREFER_BUSY_POLL, &napi->state);
> ...
> busy_poll_stop(napi, flags=0);

My preference is for using NAPI_STATE_PREFER_BUSY_POLL consistently. It
should ensure a reasonably low latency for napi_execute() and consistent
infra behavior. Unless I'm missing some dangerous side effect ;)

Thanks,

Paolo
Pavel Begunkov Oct. 22, 2024, 3:36 p.m. UTC | #6
On 10/22/24 08:47, Paolo Abeni wrote:
> Hi,
> 
> On 10/21/24 19:16, Pavel Begunkov wrote:
>> On 10/21/24 15:25, Paolo Abeni wrote:
>>> On 10/16/24 20:52, David Wei wrote:
> 
> [...]
>>>> +	napi = napi_by_id(napi_id);
>>>> +	if (!napi)
>>>> +		return;
>>>> +
>>>> +	if (!IS_ENABLED(CONFIG_PREEMPT_RT))
>>>> +		preempt_disable();
>>>> +
>>>> +	for (;;) {
>>>> +		local_bh_disable();
>>>> +
>>>> +		if (napi_state_start_busy_polling(napi, 0)) {
>>>> +			have_poll_lock = netpoll_poll_lock(napi);
>>>> +			cb(cb_arg);
>>>> +			local_bh_enable();
>>>> +			busy_poll_stop(napi, have_poll_lock, 0, 1);
>>>> +			break;
>>>> +		}
>>>> +
>>>> +		local_bh_enable();
>>>> +		if (unlikely(need_resched()))
>>>> +			break;
>>>> +		cpu_relax();
>>>
>>> Don't you need a 'loop_end' condition here?
>>
>> As you mentioned in 14/15, it can indeed spin for long and is bound only
>> by need_resched(). Do you think it's reasonable to wait for it without a
>> time limit with NAPI_STATE_PREFER_BUSY_POLL? softirq should yield napi
>> after it exhausts the budget, it should limit it well enough, what do
>> you think?
>>
>> The only ugly part is that I don't want it to mess with the
>> NAPI_F_PREFER_BUSY_POLL in busy_poll_stop()
>>
>> busy_poll_stop() {
>> 	...
>> 	clear_bit(NAPI_STATE_IN_BUSY_POLL, &napi->state);
>> 	if (flags & NAPI_F_PREFER_BUSY_POLL) {
>> 		napi->defer_hard_irqs_count = READ_ONCE(napi->dev->napi_defer_hard_irqs);
>> 		timeout = READ_ONCE(napi->dev->gro_flush_timeout);
>> 		if (napi->defer_hard_irqs_count && timeout) {
>> 			hrtimer_start(&napi->timer, ns_to_ktime(timeout), HRTIMER_MODE_REL_PINNED);
>> 			skip_schedule = true;
>> 		}
>> 	}
>> }
> 
> Why do you want to avoid such branch? It will do any action only when
> the user-space explicitly want to leverage the hrtimer to check for
> incoming packets. In such case, I think the kernel should try to respect
> the user configuration.

It should be fine to pass the flag here, it just doesn't feel right.
napi_execute() is not interested in polling, but IIRC this chunk delays
the moment when softirq kicks in when there are no napi pollers. I.e.
IMO ideally it shouldn't affect napi polling timings...

>> Is it fine to set PREFER_BUSY_POLL but do the stop call without? E.g.
>>
>> set_bit(NAPI_STATE_PREFER_BUSY_POLL, &napi->state);
>> ...
>> busy_poll_stop(napi, flags=0);
> 
> My preference is for using NAPI_STATE_PREFER_BUSY_POLL consistently. It
> should ensure a reasonably low latency for napi_execute() and consistent
> infra behavior. Unless I'm missing some dangerous side effect ;)

... but let's just set it then. It only affects the zerocopy private
queue.
diff mbox series

Patch

diff --git a/include/net/busy_poll.h b/include/net/busy_poll.h
index f03040baaefd..3fd9e65731e9 100644
--- a/include/net/busy_poll.h
+++ b/include/net/busy_poll.h
@@ -47,6 +47,7 @@  bool sk_busy_loop_end(void *p, unsigned long start_time);
 void napi_busy_loop(unsigned int napi_id,
 		    bool (*loop_end)(void *, unsigned long),
 		    void *loop_end_arg, bool prefer_busy_poll, u16 budget);
+void napi_execute(unsigned napi_id, void (*cb)(void *), void *cb_arg);
 
 void napi_busy_loop_rcu(unsigned int napi_id,
 			bool (*loop_end)(void *, unsigned long),
@@ -63,6 +64,11 @@  static inline bool sk_can_busy_loop(struct sock *sk)
 	return false;
 }
 
+static inline void napi_execute(unsigned napi_id,
+				void (*cb)(void *), void *cb_arg)
+{
+}
+
 #endif /* CONFIG_NET_RX_BUSY_POLL */
 
 static inline unsigned long busy_loop_current_time(void)
diff --git a/net/core/dev.c b/net/core/dev.c
index c682173a7642..f3bd5fd56286 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -6347,6 +6347,30 @@  enum {
 	NAPI_F_END_ON_RESCHED	= 2,
 };
 
+static inline bool napi_state_start_busy_polling(struct napi_struct *napi,
+						 unsigned flags)
+{
+	unsigned long val = READ_ONCE(napi->state);
+
+	/* If multiple threads are competing for this napi,
+	 * we avoid dirtying napi->state as much as we can.
+	 */
+	if (val & (NAPIF_STATE_DISABLE | NAPIF_STATE_SCHED |
+		   NAPIF_STATE_IN_BUSY_POLL))
+		goto fail;
+
+	if (cmpxchg(&napi->state, val,
+		    val | NAPIF_STATE_IN_BUSY_POLL |
+			  NAPIF_STATE_SCHED) != val)
+		goto fail;
+
+	return true;
+fail:
+	if (flags & NAPI_F_PREFER_BUSY_POLL)
+		set_bit(NAPI_STATE_PREFER_BUSY_POLL, &napi->state);
+	return false;
+}
+
 static void busy_poll_stop(struct napi_struct *napi, void *have_poll_lock,
 			   unsigned flags, u16 budget)
 {
@@ -6422,24 +6446,8 @@  static void __napi_busy_loop(unsigned int napi_id,
 		local_bh_disable();
 		bpf_net_ctx = bpf_net_ctx_set(&__bpf_net_ctx);
 		if (!napi_poll) {
-			unsigned long val = READ_ONCE(napi->state);
-
-			/* If multiple threads are competing for this napi,
-			 * we avoid dirtying napi->state as much as we can.
-			 */
-			if (val & (NAPIF_STATE_DISABLE | NAPIF_STATE_SCHED |
-				   NAPIF_STATE_IN_BUSY_POLL)) {
-				if (flags & NAPI_F_PREFER_BUSY_POLL)
-					set_bit(NAPI_STATE_PREFER_BUSY_POLL, &napi->state);
+			if (!napi_state_start_busy_polling(napi, flags))
 				goto count;
-			}
-			if (cmpxchg(&napi->state, val,
-				    val | NAPIF_STATE_IN_BUSY_POLL |
-					  NAPIF_STATE_SCHED) != val) {
-				if (flags & NAPI_F_PREFER_BUSY_POLL)
-					set_bit(NAPI_STATE_PREFER_BUSY_POLL, &napi->state);
-				goto count;
-			}
 			have_poll_lock = netpoll_poll_lock(napi);
 			napi_poll = napi->poll;
 		}
@@ -6503,6 +6511,41 @@  void napi_busy_loop(unsigned int napi_id,
 }
 EXPORT_SYMBOL(napi_busy_loop);
 
+void napi_execute(unsigned napi_id,
+		  void (*cb)(void *), void *cb_arg)
+{
+	struct napi_struct *napi;
+	void *have_poll_lock = NULL;
+
+	guard(rcu)();
+	napi = napi_by_id(napi_id);
+	if (!napi)
+		return;
+
+	if (!IS_ENABLED(CONFIG_PREEMPT_RT))
+		preempt_disable();
+
+	for (;;) {
+		local_bh_disable();
+
+		if (napi_state_start_busy_polling(napi, 0)) {
+			have_poll_lock = netpoll_poll_lock(napi);
+			cb(cb_arg);
+			local_bh_enable();
+			busy_poll_stop(napi, have_poll_lock, 0, 1);
+			break;
+		}
+
+		local_bh_enable();
+		if (unlikely(need_resched()))
+			break;
+		cpu_relax();
+	}
+
+	if (!IS_ENABLED(CONFIG_PREEMPT_RT))
+		preempt_enable();
+}
+
 #endif /* CONFIG_NET_RX_BUSY_POLL */
 
 static void __napi_hash_add_with_id(struct napi_struct *napi,