diff mbox series

[RFC,net-next,1/2] net: Reference bpf_redirect_info via task_struct on PREEMPT_RT.

Message ID 20240213145923.2552753-2-bigeasy@linutronix.de (mailing list archive)
State RFC
Delegated to: Netdev Maintainers
Headers show
Series Use per-task storage for XDP-redirects on PREEMPT_RT | expand

Checks

Context Check Description
netdev/tree_selection success Clearly marked for net-next
netdev/apply fail Patch does not apply to net-next

Commit Message

Sebastian Andrzej Siewior Feb. 13, 2024, 2:58 p.m. UTC
The XDP redirect process is two staged:
- bpf_prog_run_xdp() is invoked to run a eBPF program which inspects the
  packet and makes decisions. While doing that, the per-CPU variable
  bpf_redirect_info is used.

- Afterwards xdp_do_redirect() is invoked and accesses bpf_redirect_info
  and it may also access other per-CPU variables like xskmap_flush_list.

At the very end of the NAPI callback, xdp_do_flush() is invoked which
does not access bpf_redirect_info but will touch the individual per-CPU
lists.

The per-CPU variables are only used in the NAPI callback hence disabling
bottom halves is the only protection mechanism. Users from preemptible
context (like cpu_map_kthread_run()) explicitly disable bottom halves
for protections reasons.
Without locking in local_bh_disable() on PREEMPT_RT this data structure
requires explicit locking to avoid corruption if preemption occurs.

PREEMPT_RT has forced-threaded interrupts enabled and every
NAPI-callback runs in a thread. If each thread has its own data
structure then locking can be avoided and data corruption is also avoided.

Create a struct bpf_xdp_storage which contains struct bpf_redirect_info.
Define the variable on stack, use xdp_storage_set() to set a pointer to
it in task_struct of the current task. Use the __free() annotation to
automatically reset the pointer once function returns. Use a pointer which can
be used by the __free() annotation to avoid invoking the callback the pointer
is NULL. This helps the compiler to optimize the code.
The xdp_storage_set() can nest. For instance local_bh_enable() in
bpf_test_run_xdp_live() may run NET_RX_SOFTIRQ/ net_rx_action() which
also uses xdp_storage_set(). Therefore only the first invocations
updates the per-task pointer.
Use xdp_storage_get_ri() as a wrapper to retrieve the current struct
bpf_redirect_info.

This is only done on PREEMPT_RT. The !PREEMPT_RT builds keep using the
per-CPU variable instead. This should also work for !PREEMPT_RT but
isn't needed.

Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
---
 include/linux/filter.h | 79 ++++++++++++++++++++++++++++++++++++---
 include/linux/sched.h  |  5 +++
 kernel/bpf/cpumap.c    |  2 +
 kernel/fork.c          |  3 ++
 net/bpf/test_run.c     |  9 ++++-
 net/core/dev.c         | 17 +++++++++
 net/core/filter.c      | 85 +++++++++++++++++++++++++++++++-----------
 net/core/lwt_bpf.c     |  3 ++
 8 files changed, 174 insertions(+), 29 deletions(-)

Comments

Jesper Dangaard Brouer Feb. 13, 2024, 8:50 p.m. UTC | #1
On 13/02/2024 15.58, Sebastian Andrzej Siewior wrote:
> The XDP redirect process is two staged:
> - bpf_prog_run_xdp() is invoked to run a eBPF program which inspects the
>    packet and makes decisions. While doing that, the per-CPU variable
>    bpf_redirect_info is used.
> 
> - Afterwards xdp_do_redirect() is invoked and accesses bpf_redirect_info
>    and it may also access other per-CPU variables like xskmap_flush_list.
> 
> At the very end of the NAPI callback, xdp_do_flush() is invoked which
> does not access bpf_redirect_info but will touch the individual per-CPU
> lists.
> 
> The per-CPU variables are only used in the NAPI callback hence disabling
> bottom halves is the only protection mechanism. Users from preemptible
> context (like cpu_map_kthread_run()) explicitly disable bottom halves
> for protections reasons.
> Without locking in local_bh_disable() on PREEMPT_RT this data structure
> requires explicit locking to avoid corruption if preemption occurs.
> 
> PREEMPT_RT has forced-threaded interrupts enabled and every
> NAPI-callback runs in a thread. If each thread has its own data
> structure then locking can be avoided and data corruption is also avoided.
> 
> Create a struct bpf_xdp_storage which contains struct bpf_redirect_info.
> Define the variable on stack, use xdp_storage_set() to set a pointer to
> it in task_struct of the current task. Use the __free() annotation to
> automatically reset the pointer once function returns. Use a pointer which can
> be used by the __free() annotation to avoid invoking the callback the pointer
> is NULL. This helps the compiler to optimize the code.
> The xdp_storage_set() can nest. For instance local_bh_enable() in
> bpf_test_run_xdp_live() may run NET_RX_SOFTIRQ/ net_rx_action() which
> also uses xdp_storage_set(). Therefore only the first invocations
> updates the per-task pointer.
> Use xdp_storage_get_ri() as a wrapper to retrieve the current struct
> bpf_redirect_info.
> 
> This is only done on PREEMPT_RT. The !PREEMPT_RT builds keep using the
> per-CPU variable instead. This should also work for !PREEMPT_RT but
> isn't needed.
>  > Signed-off-by: Sebastian Andrzej Siewior<bigeasy@linutronix.de>

I generally like the idea around bpf_xdp_storage.

I only skimmed the code, but noticed some extra if-statements (for
!NULL). I don't think they will make a difference, but I know Toke want
me to test it...

I'll hopefully have time to look at code closer tomorrow.

--Jesper
Sebastian Andrzej Siewior Feb. 14, 2024, 12:19 p.m. UTC | #2
On 2024-02-13 21:50:51 [+0100], Jesper Dangaard Brouer wrote:
> I generally like the idea around bpf_xdp_storage.
> 
> I only skimmed the code, but noticed some extra if-statements (for
> !NULL). I don't think they will make a difference, but I know Toke want
> me to test it...

I've been looking at the assembly for the return value of
bpf_redirect_info() and there is a NULL pointer check. I hoped it was
obvious to be nun-NULL because it is a static struct.

Should this become a problem I could add
"__attribute__((returns_nonnull))" to the declaration of the function
which will optimize the NULL check away.

> I'll hopefully have time to look at code closer tomorrow.
> 
> --Jesper

Sebastian
Toke Høiland-Jørgensen Feb. 14, 2024, 1:23 p.m. UTC | #3
Sebastian Andrzej Siewior <bigeasy@linutronix.de> writes:

> On 2024-02-13 21:50:51 [+0100], Jesper Dangaard Brouer wrote:
>> I generally like the idea around bpf_xdp_storage.
>> 
>> I only skimmed the code, but noticed some extra if-statements (for
>> !NULL). I don't think they will make a difference, but I know Toke want
>> me to test it...
>
> I've been looking at the assembly for the return value of
> bpf_redirect_info() and there is a NULL pointer check. I hoped it was
> obvious to be nun-NULL because it is a static struct.
>
> Should this become a problem I could add
> "__attribute__((returns_nonnull))" to the declaration of the function
> which will optimize the NULL check away.

If we know the function will never return NULL (I was wondering about
that, actually), why have the check in the C code at all? Couldn't we just
omit it entirely instead of relying on the compiler to optimise it out?

-Toke
Sebastian Andrzej Siewior Feb. 14, 2024, 2:28 p.m. UTC | #4
On 2024-02-14 14:23:10 [+0100], Toke Høiland-Jørgensen wrote:
> Sebastian Andrzej Siewior <bigeasy@linutronix.de> writes:
> 
> > On 2024-02-13 21:50:51 [+0100], Jesper Dangaard Brouer wrote:
> >> I generally like the idea around bpf_xdp_storage.
> >> 
> >> I only skimmed the code, but noticed some extra if-statements (for
> >> !NULL). I don't think they will make a difference, but I know Toke want
> >> me to test it...
> >
> > I've been looking at the assembly for the return value of
> > bpf_redirect_info() and there is a NULL pointer check. I hoped it was
> > obvious to be nun-NULL because it is a static struct.
> >
> > Should this become a problem I could add
> > "__attribute__((returns_nonnull))" to the declaration of the function
> > which will optimize the NULL check away.
> 
> If we know the function will never return NULL (I was wondering about
> that, actually), why have the check in the C code at all? Couldn't we just
> omit it entirely instead of relying on the compiler to optimise it out?

The !RT version does:
| static inline struct bpf_redirect_info *xdp_storage_get_ri(void)
| {
|         return this_cpu_ptr(&bpf_redirect_info);
| }

which is static and can't be NULL (unless by mysterious ways the per-CPU
offset + bpf_redirect_info offset is NULL). Maybe I can put this in
this_cpu_ptr()… Let me think about it.

For RT I have:
| static inline struct bpf_xdp_storage *xdp_storage_get(void)
| {
|         struct bpf_xdp_storage *xdp_store = current->bpf_xdp_storage;
|
|         WARN_ON_ONCE(!xdp_store);
|         return xdp_store;
| }
|
| static inline struct bpf_redirect_info *xdp_storage_get_ri(void)
| {
|         struct bpf_xdp_storage *xdp_store = xdp_storage_get();
|
|         if (!xdp_store)
|                 return NULL;
|         return &xdp_store->ri;
| }

so if current->bpf_xdp_storage is NULL then we get a warning and a NULL
pointer. This *should* not happen due to xdp_storage_set() which
assigns the pointer. However if I missed a spot then there is the check
which aborts further processing.

During testing I forgot a spot in egress and the test module. You could
argue that the warning is enough since it should pop up in testing and
not production because the code is always missed and not by chance (go
boom, send a report). I *think* I covered all spots, at least the test
suite didn't point anything out to me.
I was unsure if I need something around net_tx_action() due to
TC_ACT_REDIRECT (I think qdisc) but this seems to be handled by
sch_handle_egress().

> -Toke

Sebastian
Toke Høiland-Jørgensen Feb. 14, 2024, 4:08 p.m. UTC | #5
Sebastian Andrzej Siewior <bigeasy@linutronix.de> writes:

> On 2024-02-14 14:23:10 [+0100], Toke Høiland-Jørgensen wrote:
>> Sebastian Andrzej Siewior <bigeasy@linutronix.de> writes:
>> 
>> > On 2024-02-13 21:50:51 [+0100], Jesper Dangaard Brouer wrote:
>> >> I generally like the idea around bpf_xdp_storage.
>> >> 
>> >> I only skimmed the code, but noticed some extra if-statements (for
>> >> !NULL). I don't think they will make a difference, but I know Toke want
>> >> me to test it...
>> >
>> > I've been looking at the assembly for the return value of
>> > bpf_redirect_info() and there is a NULL pointer check. I hoped it was
>> > obvious to be nun-NULL because it is a static struct.
>> >
>> > Should this become a problem I could add
>> > "__attribute__((returns_nonnull))" to the declaration of the function
>> > which will optimize the NULL check away.
>> 
>> If we know the function will never return NULL (I was wondering about
>> that, actually), why have the check in the C code at all? Couldn't we just
>> omit it entirely instead of relying on the compiler to optimise it out?
>
> The !RT version does:
> | static inline struct bpf_redirect_info *xdp_storage_get_ri(void)
> | {
> |         return this_cpu_ptr(&bpf_redirect_info);
> | }
>
> which is static and can't be NULL (unless by mysterious ways the per-CPU
> offset + bpf_redirect_info offset is NULL). Maybe I can put this in
> this_cpu_ptr()… Let me think about it.
>
> For RT I have:
> | static inline struct bpf_xdp_storage *xdp_storage_get(void)
> | {
> |         struct bpf_xdp_storage *xdp_store = current->bpf_xdp_storage;
> |
> |         WARN_ON_ONCE(!xdp_store);
> |         return xdp_store;
> | }
> |
> | static inline struct bpf_redirect_info *xdp_storage_get_ri(void)
> | {
> |         struct bpf_xdp_storage *xdp_store = xdp_storage_get();
> |
> |         if (!xdp_store)
> |                 return NULL;
> |         return &xdp_store->ri;
> | }
>
> so if current->bpf_xdp_storage is NULL then we get a warning and a NULL
> pointer. This *should* not happen due to xdp_storage_set() which
> assigns the pointer. However if I missed a spot then there is the check
> which aborts further processing.
>
> During testing I forgot a spot in egress and the test module. You could
> argue that the warning is enough since it should pop up in testing and
> not production because the code is always missed and not by chance (go
> boom, send a report). I *think* I covered all spots, at least the test
> suite didn't point anything out to me.

Well, I would prefer if we could make sure we covered everything and not
have this odd failure mode where redirect just mysteriously stops
working. At the very least, if we keep the check we should have a
WARN_ON in there to make it really obvious that something needs to be
fixed.

This brings me to another thing I was going to point out separately, but
may as well mention it here: It would be good if we could keep the
difference between the RT and !RT versions as small as possible to avoid
having subtle bugs that only appear in one configuration.

I agree with Jesper that the concept of a stack-allocated "run context"
for the XDP program makes sense in general (and I have some vague ideas
about other things that may be useful to stick in there). So I'm
wondering if it makes sense to do that even in the !RT case? We can't
stick the pointer to it into 'current' when running in softirq, but we
could change the per-cpu variable to just be a pointer that gets
populated by xdp_storage_set()?

I'm not really sure if this would be performance neutral (it's just
moving around a few bits of memory, but we do gain an extra pointer
deref), but it should be simple enough to benchmark.

> I was unsure if I need something around net_tx_action() due to
> TC_ACT_REDIRECT (I think qdisc) but this seems to be handled by
> sch_handle_egress().

Yup, I believe you're correct.

-Toke
Toke Høiland-Jørgensen Feb. 14, 2024, 4:13 p.m. UTC | #6
Sebastian Andrzej Siewior <bigeasy@linutronix.de> writes:

> The XDP redirect process is two staged:
> - bpf_prog_run_xdp() is invoked to run a eBPF program which inspects the
>   packet and makes decisions. While doing that, the per-CPU variable
>   bpf_redirect_info is used.
>
> - Afterwards xdp_do_redirect() is invoked and accesses bpf_redirect_info
>   and it may also access other per-CPU variables like xskmap_flush_list.
>
> At the very end of the NAPI callback, xdp_do_flush() is invoked which
> does not access bpf_redirect_info but will touch the individual per-CPU
> lists.
>
> The per-CPU variables are only used in the NAPI callback hence disabling
> bottom halves is the only protection mechanism. Users from preemptible
> context (like cpu_map_kthread_run()) explicitly disable bottom halves
> for protections reasons.
> Without locking in local_bh_disable() on PREEMPT_RT this data structure
> requires explicit locking to avoid corruption if preemption occurs.
>
> PREEMPT_RT has forced-threaded interrupts enabled and every
> NAPI-callback runs in a thread. If each thread has its own data
> structure then locking can be avoided and data corruption is also avoided.
>
> Create a struct bpf_xdp_storage which contains struct bpf_redirect_info.
> Define the variable on stack, use xdp_storage_set() to set a pointer to
> it in task_struct of the current task. Use the __free() annotation to
> automatically reset the pointer once function returns. Use a pointer which can
> be used by the __free() annotation to avoid invoking the callback the pointer
> is NULL. This helps the compiler to optimize the code.
> The xdp_storage_set() can nest. For instance local_bh_enable() in
> bpf_test_run_xdp_live() may run NET_RX_SOFTIRQ/ net_rx_action() which
> also uses xdp_storage_set(). Therefore only the first invocations
> updates the per-task pointer.
> Use xdp_storage_get_ri() as a wrapper to retrieve the current struct
> bpf_redirect_info.
>
> This is only done on PREEMPT_RT. The !PREEMPT_RT builds keep using the
> per-CPU variable instead. This should also work for !PREEMPT_RT but
> isn't needed.
>
> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>

[...]

> diff --git a/net/core/dev.c b/net/core/dev.c
> index de362d5f26559..c3f7d2a6b6134 100644
> --- a/net/core/dev.c
> +++ b/net/core/dev.c
> @@ -3988,11 +3988,15 @@ sch_handle_ingress(struct sk_buff *skb, struct packet_type **pt_prev, int *ret,
>  		   struct net_device *orig_dev, bool *another)
>  {
>  	struct bpf_mprog_entry *entry = rcu_dereference_bh(skb->dev->tcx_ingress);
> +	struct bpf_xdp_storage *xdp_store __free(xdp_storage_clear) = NULL;
>  	enum skb_drop_reason drop_reason = SKB_DROP_REASON_TC_INGRESS;
> +	struct bpf_xdp_storage __xdp_store;
>  	int sch_ret;
>  
>  	if (!entry)
>  		return skb;
> +
> +	xdp_store = xdp_storage_set(&__xdp_store);
>  	if (*pt_prev) {
>  		*ret = deliver_skb(skb, *pt_prev, orig_dev);
>  		*pt_prev = NULL;
> @@ -4044,12 +4048,16 @@ static __always_inline struct sk_buff *
>  sch_handle_egress(struct sk_buff *skb, int *ret, struct net_device *dev)
>  {
>  	struct bpf_mprog_entry *entry = rcu_dereference_bh(dev->tcx_egress);
> +	struct bpf_xdp_storage *xdp_store __free(xdp_storage_clear) = NULL;
>  	enum skb_drop_reason drop_reason = SKB_DROP_REASON_TC_EGRESS;
> +	struct bpf_xdp_storage __xdp_store;
>  	int sch_ret;
>  
>  	if (!entry)
>  		return skb;
>  
> +	xdp_store = xdp_storage_set(&__xdp_store);
> +
>  	/* qdisc_skb_cb(skb)->pkt_len & tcx_set_ingress() was
>  	 * already set by the caller.
>  	 */


These, and the LWT code, don't actually have anything to do with XDP,
which indicates that the 'xdp_storage' name misleading. Maybe
'bpf_net_context' or something along those lines? Or maybe we could just
move the flush lists into bpf_redirect_info itself and just keep that as
the top-level name?

-Toke
Sebastian Andrzej Siewior Feb. 14, 2024, 4:36 p.m. UTC | #7
On 2024-02-14 17:08:44 [+0100], Toke Høiland-Jørgensen wrote:
> > During testing I forgot a spot in egress and the test module. You could
> > argue that the warning is enough since it should pop up in testing and
> > not production because the code is always missed and not by chance (go
> > boom, send a report). I *think* I covered all spots, at least the test
> > suite didn't point anything out to me.
> 
> Well, I would prefer if we could make sure we covered everything and not
> have this odd failure mode where redirect just mysteriously stops
> working. At the very least, if we keep the check we should have a
> WARN_ON in there to make it really obvious that something needs to be
> fixed.

Agree.

> This brings me to another thing I was going to point out separately, but
> may as well mention it here: It would be good if we could keep the
> difference between the RT and !RT versions as small as possible to avoid
> having subtle bugs that only appear in one configuration.

Yes. I do so, too.

> I agree with Jesper that the concept of a stack-allocated "run context"
> for the XDP program makes sense in general (and I have some vague ideas
> about other things that may be useful to stick in there). So I'm
> wondering if it makes sense to do that even in the !RT case? We can't
> stick the pointer to it into 'current' when running in softirq, but we
> could change the per-cpu variable to just be a pointer that gets
> populated by xdp_storage_set()?

I *think* that it could be added to current. The assignment currently
allows nesting so that is not a problem. Only the outer most set/clear
would do something. If you run in softirq, you would hijack a random
task_struct. If the pointer is already assigned then the list and so one
must be empty because access is only allowed in BH-disabled sections.

However, using per-CPU for the pointer (instead of task_struct) would
have the advantage that it is always CPU/node local memory while the
random task_struct could have been allocated on a different NUMA node.

> I'm not really sure if this would be performance neutral (it's just
> moving around a few bits of memory, but we do gain an extra pointer
> deref), but it should be simple enough to benchmark.

My guess is that we remain with one per-CPU dereference and an
additional "add offset". That is why I kept the !RT bits as they are
before getting yelled at.

I could prepare something and run a specific test if you have one.

> -Toke

Sebastian
Sebastian Andrzej Siewior Feb. 15, 2024, 9:04 a.m. UTC | #8
On 2024-02-14 17:13:03 [+0100], Toke Høiland-Jørgensen wrote:
> > diff --git a/net/core/dev.c b/net/core/dev.c
> > index de362d5f26559..c3f7d2a6b6134 100644
> > --- a/net/core/dev.c
> > +++ b/net/core/dev.c
> > @@ -4044,12 +4048,16 @@ static __always_inline struct sk_buff *
> >  sch_handle_egress(struct sk_buff *skb, int *ret, struct net_device *dev)
> >  {
> >  	struct bpf_mprog_entry *entry = rcu_dereference_bh(dev->tcx_egress);
> > +	struct bpf_xdp_storage *xdp_store __free(xdp_storage_clear) = NULL;
> >  	enum skb_drop_reason drop_reason = SKB_DROP_REASON_TC_EGRESS;
> > +	struct bpf_xdp_storage __xdp_store;
> >  	int sch_ret;
> >  
> >  	if (!entry)
> >  		return skb;
> >  
> > +	xdp_store = xdp_storage_set(&__xdp_store);
> > +
> >  	/* qdisc_skb_cb(skb)->pkt_len & tcx_set_ingress() was
> >  	 * already set by the caller.
> >  	 */
> 
> 
> These, and the LWT code, don't actually have anything to do with XDP,
> which indicates that the 'xdp_storage' name misleading. Maybe
> 'bpf_net_context' or something along those lines? Or maybe we could just
> move the flush lists into bpf_redirect_info itself and just keep that as
> the top-level name?

I'm going to rename it for now as suggested for now. If it is a better
fit to include the lists into bpf_redirect_info then I will update
accordingly.

> -Toke

Sebastian
Toke Høiland-Jørgensen Feb. 15, 2024, 12:11 p.m. UTC | #9
Sebastian Andrzej Siewior <bigeasy@linutronix.de> writes:

> On 2024-02-14 17:13:03 [+0100], Toke Høiland-Jørgensen wrote:
>> > diff --git a/net/core/dev.c b/net/core/dev.c
>> > index de362d5f26559..c3f7d2a6b6134 100644
>> > --- a/net/core/dev.c
>> > +++ b/net/core/dev.c
>> > @@ -4044,12 +4048,16 @@ static __always_inline struct sk_buff *
>> >  sch_handle_egress(struct sk_buff *skb, int *ret, struct net_device *dev)
>> >  {
>> >  	struct bpf_mprog_entry *entry = rcu_dereference_bh(dev->tcx_egress);
>> > +	struct bpf_xdp_storage *xdp_store __free(xdp_storage_clear) = NULL;
>> >  	enum skb_drop_reason drop_reason = SKB_DROP_REASON_TC_EGRESS;
>> > +	struct bpf_xdp_storage __xdp_store;
>> >  	int sch_ret;
>> >  
>> >  	if (!entry)
>> >  		return skb;
>> >  
>> > +	xdp_store = xdp_storage_set(&__xdp_store);
>> > +
>> >  	/* qdisc_skb_cb(skb)->pkt_len & tcx_set_ingress() was
>> >  	 * already set by the caller.
>> >  	 */
>> 
>> 
>> These, and the LWT code, don't actually have anything to do with XDP,
>> which indicates that the 'xdp_storage' name misleading. Maybe
>> 'bpf_net_context' or something along those lines? Or maybe we could just
>> move the flush lists into bpf_redirect_info itself and just keep that as
>> the top-level name?
>
> I'm going to rename it for now as suggested for now. If it is a better
> fit to include the lists into bpf_redirect_info then I will update
> accordingly.

OK, SGTM.

-Toke
Toke Høiland-Jørgensen Feb. 15, 2024, 8:23 p.m. UTC | #10
Sebastian Andrzej Siewior <bigeasy@linutronix.de> writes:

> On 2024-02-14 17:08:44 [+0100], Toke Høiland-Jørgensen wrote:
>> > During testing I forgot a spot in egress and the test module. You could
>> > argue that the warning is enough since it should pop up in testing and
>> > not production because the code is always missed and not by chance (go
>> > boom, send a report). I *think* I covered all spots, at least the test
>> > suite didn't point anything out to me.
>> 
>> Well, I would prefer if we could make sure we covered everything and not
>> have this odd failure mode where redirect just mysteriously stops
>> working. At the very least, if we keep the check we should have a
>> WARN_ON in there to make it really obvious that something needs to be
>> fixed.
>
> Agree.
>
>> This brings me to another thing I was going to point out separately, but
>> may as well mention it here: It would be good if we could keep the
>> difference between the RT and !RT versions as small as possible to avoid
>> having subtle bugs that only appear in one configuration.
>
> Yes. I do so, too.
>
>> I agree with Jesper that the concept of a stack-allocated "run context"
>> for the XDP program makes sense in general (and I have some vague ideas
>> about other things that may be useful to stick in there). So I'm
>> wondering if it makes sense to do that even in the !RT case? We can't
>> stick the pointer to it into 'current' when running in softirq, but we
>> could change the per-cpu variable to just be a pointer that gets
>> populated by xdp_storage_set()?
>
> I *think* that it could be added to current. The assignment currently
> allows nesting so that is not a problem. Only the outer most set/clear
> would do something. If you run in softirq, you would hijack a random
> task_struct. If the pointer is already assigned then the list and so one
> must be empty because access is only allowed in BH-disabled sections.
>
> However, using per-CPU for the pointer (instead of task_struct) would
> have the advantage that it is always CPU/node local memory while the
> random task_struct could have been allocated on a different NUMA node.

Ah yes, good point, that's probably desirable :)

>> I'm not really sure if this would be performance neutral (it's just
>> moving around a few bits of memory, but we do gain an extra pointer
>> deref), but it should be simple enough to benchmark.
>
> My guess is that we remain with one per-CPU dereference and an
> additional "add offset". That is why I kept the !RT bits as they are
> before getting yelled at.
>
> I could prepare something and run a specific test if you have one.

The test itself is simple enough: Simply run xdp-bench (from
xdp-tools[0]) in drop mode, serve some traffic to the machine and
observe the difference in PPS before and after the patch.

The tricky part is that the traffic actually has to stress the CPU,
which means that the offered load has to be higher than what the CPU can
handle. Which generally means running on high-speed NICs with small
packets: a modern server CPU has no problem keeping up with a 10G link
even at 64-byte packet size, so a 100G NIC is needed, or the test needs
to be run on a low-powered machine.

As a traffic generator, the xdp-trafficgen utility also in xdp-tools can
be used, or the in-kernel pktgen, or something like T-rex or Moongen.
Generally serving UDP traffic with 64-byte packets on a single port
is enough to make sure the traffic is serviced by a single CPU, although
some configuration may be needed to steer IRQs as well.

-Toke

[0] https://github.com/xdp-project/xdp-tools
Sebastian Andrzej Siewior Feb. 16, 2024, 4:57 p.m. UTC | #11
On 2024-02-15 21:23:23 [+0100], Toke Høiland-Jørgensen wrote:
> The tricky part is that the traffic actually has to stress the CPU,
> which means that the offered load has to be higher than what the CPU can
> handle. Which generally means running on high-speed NICs with small
> packets: a modern server CPU has no problem keeping up with a 10G link
> even at 64-byte packet size, so a 100G NIC is needed, or the test needs
> to be run on a low-powered machine.

I have 10G box. I can tell cpufreq to go down to 1.1Ghz and I could
reduce the queues to one and hope that it is slow enough.

> As a traffic generator, the xdp-trafficgen utility also in xdp-tools can
> be used, or the in-kernel pktgen, or something like T-rex or Moongen.
> Generally serving UDP traffic with 64-byte packets on a single port
> is enough to make sure the traffic is serviced by a single CPU, although
> some configuration may be needed to steer IRQs as well.

I played with xdp-trafficgen:
| # xdp-trafficgen udp eno2  -v
| Current rlimit is infinity or 0. Not raising
| Kernel supports 5-arg xdp_cpumap_kthread tracepoint
| Error in ethtool ioctl: Operation not supported
| Got -95 queues for ifname lo
| Kernel supports 5-arg xdp_cpumap_kthread tracepoint
| Got 94 queues for ifname eno2
| Transmitting on eno2 (ifindex 3)
| lo->eno2                        0 err/s                 0 xmit/s
| lo->eno2                        0 err/s                 0 xmit/s
| lo->eno2                        0 err/s                 0 xmit/s

I even tried set the MAC address with -M/ -m but nothing happens. I see
and on drop side something happening when I issue a ping command.
Does something ring a bell? Otherwise I try the pktgen. This is a Debian
kernel (just to ensure I didn't break something or forgot a config
switch).

> -Toke

Sebastian
Toke Høiland-Jørgensen Feb. 19, 2024, 7:01 p.m. UTC | #12
Sebastian Andrzej Siewior <bigeasy@linutronix.de> writes:

> On 2024-02-15 21:23:23 [+0100], Toke Høiland-Jørgensen wrote:
>> The tricky part is that the traffic actually has to stress the CPU,
>> which means that the offered load has to be higher than what the CPU can
>> handle. Which generally means running on high-speed NICs with small
>> packets: a modern server CPU has no problem keeping up with a 10G link
>> even at 64-byte packet size, so a 100G NIC is needed, or the test needs
>> to be run on a low-powered machine.
>
> I have 10G box. I can tell cpufreq to go down to 1.1Ghz and I could
> reduce the queues to one and hope that it is slow enough.

Yeah, that may work. As long as the baseline performance is below the
~14Mpps that's 10G line rate for small packets.

>> As a traffic generator, the xdp-trafficgen utility also in xdp-tools can
>> be used, or the in-kernel pktgen, or something like T-rex or Moongen.
>> Generally serving UDP traffic with 64-byte packets on a single port
>> is enough to make sure the traffic is serviced by a single CPU, although
>> some configuration may be needed to steer IRQs as well.
>
> I played with xdp-trafficgen:
> | # xdp-trafficgen udp eno2  -v
> | Current rlimit is infinity or 0. Not raising
> | Kernel supports 5-arg xdp_cpumap_kthread tracepoint
> | Error in ethtool ioctl: Operation not supported
> | Got -95 queues for ifname lo
> | Kernel supports 5-arg xdp_cpumap_kthread tracepoint
> | Got 94 queues for ifname eno2
> | Transmitting on eno2 (ifindex 3)
> | lo->eno2                        0 err/s                 0 xmit/s
> | lo->eno2                        0 err/s                 0 xmit/s
> | lo->eno2                        0 err/s                 0 xmit/s
>
> I even tried set the MAC address with -M/ -m but nothing happens. I see
> and on drop side something happening when I issue a ping command.
> Does something ring a bell? Otherwise I try the pktgen. This is a Debian
> kernel (just to ensure I didn't break something or forgot a config
> switch).

Hmm, how old a kernel? And on what hardware? xdp-trafficgen requires a
relatively new kernel, and the driver needs to support XDP_REDIRECT. It
may be simpler to use pktgen, and at 10G rates that shouldn't become a
bottleneck either. The pktgen_sample03_burst_single_flow.sh script in
samples/pktgen in the kernel source tree is fine for this usage.

-Toke
Jesper Dangaard Brouer Feb. 20, 2024, 9:17 a.m. UTC | #13
On 19/02/2024 20.01, Toke Høiland-Jørgensen wrote:
> may be simpler to use pktgen, and at 10G rates that shouldn't become a
> bottleneck either. The pktgen_sample03_burst_single_flow.sh script in
> samples/pktgen in the kernel source tree is fine for this usage.

Example of running script:
  ./pktgen_sample03_burst_single_flow.sh -vi mlx5p1 -d 198.18.1.1 -m 
ec:0d:9a:db:11:c4 -t 12

Notice the last parameter, which is number threads to start.
If you have a ixgbe NIC driver, then I recommend -t 2 even if you have 
more CPUs.

--Jesper
Sebastian Andrzej Siewior Feb. 20, 2024, 10:17 a.m. UTC | #14
On 2024-02-20 10:17:53 [+0100], Jesper Dangaard Brouer wrote:
> 
> 
> On 19/02/2024 20.01, Toke Høiland-Jørgensen wrote:
> > may be simpler to use pktgen, and at 10G rates that shouldn't become a
> > bottleneck either. The pktgen_sample03_burst_single_flow.sh script in
> > samples/pktgen in the kernel source tree is fine for this usage.
> 
> Example of running script:
>  ./pktgen_sample03_burst_single_flow.sh -vi mlx5p1 -d 198.18.1.1 -m
> ec:0d:9a:db:11:c4 -t 12
> 
> Notice the last parameter, which is number threads to start.
> If you have a ixgbe NIC driver, then I recommend -t 2 even if you have more
> CPUs.

I get 
| Summary                 8,435,690 rx/s                  0 err/s
| Summary                 8,436,294 rx/s                  0 err/s

with "-t 8 -b 64". I started with 2 and then increased until rx/s was
falling again. I have ixgbe on the sending side and i40e on the
receiving side. I tried to receive on ixgbe but this ended with -ENOMEM
| # xdp-bench drop eth1
| Failed to attach XDP program: Cannot allocate memory

This is v6.8-rc5 on both sides. Let me see where this is coming from…

> --Jesper

Sebastian
Jesper Dangaard Brouer Feb. 20, 2024, 10:42 a.m. UTC | #15
On 20/02/2024 11.17, Sebastian Andrzej Siewior wrote:
> On 2024-02-20 10:17:53 [+0100], Jesper Dangaard Brouer wrote:
>>
>>
>> On 19/02/2024 20.01, Toke Høiland-Jørgensen wrote:
>>> may be simpler to use pktgen, and at 10G rates that shouldn't become a
>>> bottleneck either. The pktgen_sample03_burst_single_flow.sh script in
>>> samples/pktgen in the kernel source tree is fine for this usage.
>>
>> Example of running script:
>>   ./pktgen_sample03_burst_single_flow.sh -vi mlx5p1 -d 198.18.1.1 -m
>> ec:0d:9a:db:11:c4 -t 12
>>
>> Notice the last parameter, which is number threads to start.
>> If you have a ixgbe NIC driver, then I recommend -t 2 even if you have more
>> CPUs.
> 
> I get
> | Summary                 8,435,690 rx/s                  0 err/s

This seems low...
Have you remembered to disable Ethernet flow-control?

  # ethtool -A ixgbe1 rx off tx off
  # ethtool -A i40e2 rx off tx off


> | Summary                 8,436,294 rx/s                  0 err/s

You want to see the "extended" info via cmdline (or Ctrl+\)

  # xdp-bench drop -e eth1


> 
> with "-t 8 -b 64". I started with 2 and then increased until rx/s was
> falling again. I have ixgbe on the sending side and i40e on the

With ixgbe on the sending side, my testlab shows I need -t 2.

With -t 2 :
Summary                14,678,170 rx/s                  0 err/s
   receive total        14,678,170 pkt/s        14,678,170 drop/s 
         0 error/s
     cpu:1              14,678,170 pkt/s        14,678,170 drop/s 
         0 error/s
   xdp_exception                 0 hit/s

with -t 4:

Summary                10,255,385 rx/s                  0 err/s
   receive total        10,255,385 pkt/s        10,255,385 drop/s 
         0 error/s
     cpu:1              10,255,385 pkt/s        10,255,385 drop/s 
         0 error/s
   xdp_exception                 0 hit/s

> receiving side. I tried to receive on ixgbe but this ended with -ENOMEM
> | # xdp-bench drop eth1
> | Failed to attach XDP program: Cannot allocate memory
> 
> This is v6.8-rc5 on both sides. Let me see where this is coming from…
> 

Another pitfall with ixgbe is that it does a full link reset when
adding/removing XDP prog on device.  This can be annoying if connected
back-to-back, because "remote" pktgen will stop on link reset.

--Jesper
Sebastian Andrzej Siewior Feb. 20, 2024, 12:08 p.m. UTC | #16
On 2024-02-20 11:42:57 [+0100], Jesper Dangaard Brouer wrote:
> This seems low...
> Have you remembered to disable Ethernet flow-control?

No but one side says:
| i40e 0000:3d:00.1 eno2np1: NIC Link is Up, 10 Gbps Full Duplex, Flow Control: None

but I did this

>  # ethtool -A ixgbe1 rx off tx off
>  # ethtool -A i40e2 rx off tx off

and it didn't change much.

> 
> > | Summary                 8,436,294 rx/s                  0 err/s
> 
> You want to see the "extended" info via cmdline (or Ctrl+\)
> 
>  # xdp-bench drop -e eth1
> 
> 
> > 
> > with "-t 8 -b 64". I started with 2 and then increased until rx/s was
> > falling again. I have ixgbe on the sending side and i40e on the
> 
> With ixgbe on the sending side, my testlab shows I need -t 2.
> 
> With -t 2 :
> Summary                14,678,170 rx/s                  0 err/s
>   receive total        14,678,170 pkt/s        14,678,170 drop/s         0
> error/s
>     cpu:1              14,678,170 pkt/s        14,678,170 drop/s         0
> error/s
>   xdp_exception                 0 hit/s
> 
> with -t 4:
> 
> Summary                10,255,385 rx/s                  0 err/s
>   receive total        10,255,385 pkt/s        10,255,385 drop/s         0
> error/s
>     cpu:1              10,255,385 pkt/s        10,255,385 drop/s         0
> error/s
>   xdp_exception                 0 hit/s
> 
> > receiving side. I tried to receive on ixgbe but this ended with -ENOMEM
> > | # xdp-bench drop eth1
> > | Failed to attach XDP program: Cannot allocate memory
> > 
> > This is v6.8-rc5 on both sides. Let me see where this is coming from…
> > 
> 
> Another pitfall with ixgbe is that it does a full link reset when
> adding/removing XDP prog on device.  This can be annoying if connected
> back-to-back, because "remote" pktgen will stop on link reset.

so I replaced nr_cpu_ids with 64 and bootet maxcpus=64 so that I can run
xdp-bench on the ixbe.

so. i40 send, ixgbe receive.

-t 2

| Summary                 2,348,800 rx/s                  0 err/s
|   receive total         2,348,800 pkt/s         2,348,800 drop/s                0 error/s
|     cpu:0               2,348,800 pkt/s         2,348,800 drop/s                0 error/s
|   xdp_exception                 0 hit/s

-t 4
| Summary                 4,158,199 rx/s                  0 err/s
|   receive total         4,158,199 pkt/s         4,158,199 drop/s                0 error/s
|     cpu:0               4,158,199 pkt/s         4,158,199 drop/s                0 error/s
|   xdp_exception                 0 hit/s

-t 8
| Summary                 5,612,861 rx/s                  0 err/s        
|   receive total         5,612,861 pkt/s         5,612,861 drop/s                0 error/s      
|     cpu:0               5,612,861 pkt/s         5,612,861 drop/s                0 error/s      
|   xdp_exception                 0 hit/s        

going higher makes the rate drop. With 8 it floats between 5,5… 5,7…

Doing "ethtool -G eno2np1 tx 4096 rx 4096" on the i40 makes it worse,
using the default 512/512 gets the numbers from above, going below 256
makes it worse.

receiving on i40, sending on ixgbe:

-t 2
|Summary                 3,042,957 rx/s                  0 err/s
|  receive total         3,042,957 pkt/s         3,042,957 drop/s                0 error/s
|    cpu:60              3,042,957 pkt/s         3,042,957 drop/s                0 error/s
|  xdp_exception                 0 hit/s

-t 4
|Summary                 5,442,166 rx/s                  0 err/s
|  receive total         5,442,166 pkt/s         5,442,166 drop/s                0 error/s
|    cpu:60              5,442,166 pkt/s         5,442,166 drop/s                0 error/s
|  xdp_exception                 0 hit/s


-t 6
| Summary                 7,023,406 rx/s                  0 err/s
|   receive total         7,023,406 pkt/s         7,023,406 drop/s                0 error/s
|     cpu:60              7,023,406 pkt/s         7,023,406 drop/s                0 error/s
|   xdp_exception                 0 hit/s


-t 8
| Summary                 7,540,915 rx/s                  0 err/s
|   receive total         7,540,915 pkt/s         7,540,915 drop/s                0 error/s
|     cpu:60              7,540,915 pkt/s         7,540,915 drop/s                0 error/s
|   xdp_exception                 0 hit/s

-t 10
|Summary                 7,699,143 rx/s                  0 err/s
|  receive total         7,699,143 pkt/s         7,699,143 drop/s                0 error/s
|    cpu:60              7,699,143 pkt/s         7,699,143 drop/s                0 error/s
|  xdp_exception                 0 hit/s

-t 18
| Summary                 7,784,946 rx/s                  0 err/s
|   receive total         7,784,946 pkt/s         7,784,946 drop/s                0 error/s
|     cpu:60              7,784,946 pkt/s         7,784,946 drop/s                0 error/s
|   xdp_exception                 0 hit/s

after t18 it drop down to 2,…
Now I got worse than before since -t8 says 7,5… and it did 8,4 in the
morning. Do you have maybe a .config for me in case I did not enable the
performance switch?

> --Jesper

Sebastian
Dave Taht Feb. 20, 2024, 12:10 p.m. UTC | #17
I am actually rather interested in what granularity ethernet flow
control takes place at these days at these speeds. Anyone know?
Jesper Dangaard Brouer Feb. 20, 2024, 12:57 p.m. UTC | #18
On 20/02/2024 13.08, Sebastian Andrzej Siewior wrote:
> On 2024-02-20 11:42:57 [+0100], Jesper Dangaard Brouer wrote:
>> This seems low...
>> Have you remembered to disable Ethernet flow-control?
> 
> No but one side says:
> | i40e 0000:3d:00.1 eno2np1: NIC Link is Up, 10 Gbps Full Duplex, Flow Control: None
> 
> but I did this
> 
>>   # ethtool -A ixgbe1 rx off tx off
>>   # ethtool -A i40e2 rx off tx off
> 
> and it didn't change much.
> 
>>
>>> | Summary                 8,436,294 rx/s                  0 err/s
>>
>> You want to see the "extended" info via cmdline (or Ctrl+\)
>>
>>   # xdp-bench drop -e eth1
>>
>>
>>>
>>> with "-t 8 -b 64". I started with 2 and then increased until rx/s was
>>> falling again. I have ixgbe on the sending side and i40e on the
>>
>> With ixgbe on the sending side, my testlab shows I need -t 2.
>>
>> With -t 2 :
>> Summary                14,678,170 rx/s                  0 err/s
>>    receive total        14,678,170 pkt/s        14,678,170 drop/s         0
>> error/s
>>      cpu:1              14,678,170 pkt/s        14,678,170 drop/s         0
>> error/s
>>    xdp_exception                 0 hit/s
>>
>> with -t 4:
>>
>> Summary                10,255,385 rx/s                  0 err/s
>>    receive total        10,255,385 pkt/s        10,255,385 drop/s         0
>> error/s
>>      cpu:1              10,255,385 pkt/s        10,255,385 drop/s         0
>> error/s
>>    xdp_exception                 0 hit/s
>>
>>> receiving side. I tried to receive on ixgbe but this ended with -ENOMEM
>>> | # xdp-bench drop eth1
>>> | Failed to attach XDP program: Cannot allocate memory
>>>
>>> This is v6.8-rc5 on both sides. Let me see where this is coming from…
>>>
>>
>> Another pitfall with ixgbe is that it does a full link reset when
>> adding/removing XDP prog on device.  This can be annoying if connected
>> back-to-back, because "remote" pktgen will stop on link reset.
> 
> so I replaced nr_cpu_ids with 64 and bootet maxcpus=64 so that I can run
> xdp-bench on the ixgbe.
> 

Yes, ixgbe HW have limited TX queues, and XDP tries to allocate a
hardware TX queue for every CPU in the system.  So, I guess you have too
many CPUs in your system - lol.

Other drivers have a fallback to a locked XDP TX path, so this is also
something to lookout for in the machine with i40e.

> so. i40 send, ixgbe receive.
> 
> -t 2
> 
> | Summary                 2,348,800 rx/s                  0 err/s
> |   receive total         2,348,800 pkt/s         2,348,800 drop/s                0 error/s
> |     cpu:0               2,348,800 pkt/s         2,348,800 drop/s                0 error/s
> |   xdp_exception                 0 hit/s
>

This is way too low, with i40e sending.

On my system with only -t 1 my i40e driver can send with approx 15Mpps:

  Ethtool(i40e2) stat:  15028585 (  15,028,585) <= tx-0.packets /sec
  Ethtool(i40e2) stat:  15028589 (  15,028,589) <= tx_packets /sec


> -t 4
> | Summary                 4,158,199 rx/s                  0 err/s
> |   receive total         4,158,199 pkt/s         4,158,199 drop/s                0 error/s
> |     cpu:0               4,158,199 pkt/s         4,158,199 drop/s                0 error/s
> |   xdp_exception                 0 hit/s
> 

Do notice that this is all hitting CPU:0
(this is by design from pktgen_sample03_burst_single_flow.sh)


> -t 8
> | Summary                 5,612,861 rx/s                  0 err/s
> |   receive total         5,612,861 pkt/s         5,612,861 drop/s                0 error/s
> |     cpu:0               5,612,861 pkt/s         5,612,861 drop/s                0 error/s
> |   xdp_exception                 0 hit/s
> 
> going higher makes the rate drop. With 8 it floats between 5,5… 5,7…
> 

At -t 8 we seem to have hit limit on RX side, which also seems too low.

I recommend checking what speeds packet generator is sending with.
I use this tool: ethtool_stats.pl
 
https://github.com/netoptimizer/network-testing/blob/master/bin/ethtool_stats.pl

What we are basically asking: Make sure packet generator is not the
limiting factor in your tests.
As we need to make sure DUT (Device Under Test) is being overloaded
100%.  I often also check (via per record) that DUT don't have idle CPU
cycles (yes, this can easily happen... and happens when we hit a limit
in hardware either NIC or PCIe slot)


> Doing "ethtool -G eno2np1 tx 4096 rx 4096" on the i40 makes it worse,
> using the default 512/512 gets the numbers from above, going below 256
> makes it worse.
> 
> receiving on i40, sending on ixgbe:
> 
> -t 2
> |Summary                 3,042,957 rx/s                  0 err/s
> |  receive total         3,042,957 pkt/s         3,042,957 drop/s                0 error/s
> |    cpu:60              3,042,957 pkt/s         3,042,957 drop/s                0 error/s
> |  xdp_exception                 0 hit/s
> 
> -t 4
> |Summary                 5,442,166 rx/s                  0 err/s
> |  receive total         5,442,166 pkt/s         5,442,166 drop/s                0 error/s
> |    cpu:60              5,442,166 pkt/s         5,442,166 drop/s                0 error/s
> |  xdp_exception                 0 hit/s
> 
> 
> -t 6
> | Summary                 7,023,406 rx/s                  0 err/s
> |   receive total         7,023,406 pkt/s         7,023,406 drop/s                0 error/s
> |     cpu:60              7,023,406 pkt/s         7,023,406 drop/s                0 error/s
> |   xdp_exception                 0 hit/s
> 
> 
> -t 8
> | Summary                 7,540,915 rx/s                  0 err/s
> |   receive total         7,540,915 pkt/s         7,540,915 drop/s                0 error/s
> |     cpu:60              7,540,915 pkt/s         7,540,915 drop/s                0 error/s
> |   xdp_exception                 0 hit/s
> 
> -t 10
> |Summary                 7,699,143 rx/s                  0 err/s
> |  receive total         7,699,143 pkt/s         7,699,143 drop/s                0 error/s
> |    cpu:60              7,699,143 pkt/s         7,699,143 drop/s                0 error/s
> |  xdp_exception                 0 hit/s
> 

At this level, if you can verify that CPU:60 is 100% loaded, and packet
generator is sending more than rx number, then it could work as a valid
experiment.

> -t 18
> | Summary                 7,784,946 rx/s                  0 err/s
> |   receive total         7,784,946 pkt/s         7,784,946 drop/s                0 error/s
> |     cpu:60              7,784,946 pkt/s         7,784,946 drop/s                0 error/s
> |   xdp_exception                 0 hit/s
> 
> after t18 it drop down to 2,…
> Now I got worse than before since -t8 says 7,5… and it did 8,4 in the
> morning. Do you have maybe a .config for me in case I did not enable the
> performance switch?
> 

I would look for root-cause with perf record +
  perf report --sort cpu,comm,dso,symbol --no-children


--Jesper
Sebastian Andrzej Siewior Feb. 20, 2024, 3:32 p.m. UTC | #19
On 2024-02-20 13:57:24 [+0100], Jesper Dangaard Brouer wrote:
> > so I replaced nr_cpu_ids with 64 and bootet maxcpus=64 so that I can run
> > xdp-bench on the ixgbe.
> > 
> 
> Yes, ixgbe HW have limited TX queues, and XDP tries to allocate a
> hardware TX queue for every CPU in the system.  So, I guess you have too
> many CPUs in your system - lol.
> 
> Other drivers have a fallback to a locked XDP TX path, so this is also
> something to lookout for in the machine with i40e.

this locked XDP TX path starts at 64 but xdp_progs are rejected > 64 * 2.

> > so. i40 send, ixgbe receive.
> > 
> > -t 2
> > 
> > | Summary                 2,348,800 rx/s                  0 err/s
> > |   receive total         2,348,800 pkt/s         2,348,800 drop/s                0 error/s
> > |     cpu:0               2,348,800 pkt/s         2,348,800 drop/s                0 error/s
> > |   xdp_exception                 0 hit/s
> > 
> 
> This is way too low, with i40e sending.
> 
> On my system with only -t 1 my i40e driver can send with approx 15Mpps:
> 
>  Ethtool(i40e2) stat:     15028585 (  15,028,585) <= tx-0.packets /sec
>  Ethtool(i40e2) stat:     15028589 (  15,028,589) <= tx_packets /sec

-t1 in ixgbe
Show adapter(s) (eth1) statistics (ONLY that changed!)
Ethtool(eth1    ) stat:    107857263 (    107,857,263) <= tx_bytes /sec
Ethtool(eth1    ) stat:    115047684 (    115,047,684) <= tx_bytes_nic /sec
Ethtool(eth1    ) stat:      1797621 (      1,797,621) <= tx_packets /sec
Ethtool(eth1    ) stat:      1797636 (      1,797,636) <= tx_pkts_nic /sec
Ethtool(eth1    ) stat:    107857263 (    107,857,263) <= tx_queue_0_bytes /sec
Ethtool(eth1    ) stat:      1797621 (      1,797,621) <= tx_queue_0_packets /sec

-t i40e
Ethtool(eno2np1 ) stat:           90 (             90) <= port.rx_bytes /sec
Ethtool(eno2np1 ) stat:            1 (              1) <= port.rx_size_127 /sec
Ethtool(eno2np1 ) stat:            1 (              1) <= port.rx_unicast /sec
Ethtool(eno2np1 ) stat:     79554379 (     79,554,379) <= port.tx_bytes /sec
Ethtool(eno2np1 ) stat:      1243037 (      1,243,037) <= port.tx_size_64 /sec
Ethtool(eno2np1 ) stat:      1243037 (      1,243,037) <= port.tx_unicast /sec
Ethtool(eno2np1 ) stat:           86 (             86) <= rx-32.bytes /sec
Ethtool(eno2np1 ) stat:            1 (              1) <= rx-32.packets /sec
Ethtool(eno2np1 ) stat:           86 (             86) <= rx_bytes /sec
Ethtool(eno2np1 ) stat:            1 (              1) <= rx_cache_waive /sec
Ethtool(eno2np1 ) stat:            1 (              1) <= rx_packets /sec
Ethtool(eno2np1 ) stat:            1 (              1) <= rx_unicast /sec
Ethtool(eno2np1 ) stat:     74580821 (     74,580,821) <= tx-0.bytes /sec
Ethtool(eno2np1 ) stat:      1243014 (      1,243,014) <= tx-0.packets /sec
Ethtool(eno2np1 ) stat:     74580821 (     74,580,821) <= tx_bytes /sec
Ethtool(eno2np1 ) stat:      1243014 (      1,243,014) <= tx_packets /sec
Ethtool(eno2np1 ) stat:      1243037 (      1,243,037) <= tx_unicast /sec

mine is slightly slower. But this seems to match what I see on the RX
side.

> At this level, if you can verify that CPU:60 is 100% loaded, and packet
> generator is sending more than rx number, then it could work as a valid
> experiment.

i40e receiving on 8:
%Cpu8  :  0.0 us,  0.0 sy,  0.0 ni, 84.8 id,  0.0 wa,  0.0 hi, 15.2 si,  0.0 st 

ixgbe receiving on 13:
%Cpu13 :  0.0 us,  0.0 sy,  0.0 ni, 56.7 id,  0.0 wa,  0.0 hi, 43.3 si,  0.0 st 

looks idle. On the sending side kpktgend_0 is always at 100%.

> > -t 18
> > | Summary                 7,784,946 rx/s                  0 err/s
> > |   receive total         7,784,946 pkt/s         7,784,946 drop/s                0 error/s
> > |     cpu:60              7,784,946 pkt/s         7,784,946 drop/s                0 error/s
> > |   xdp_exception                 0 hit/s
> > 
> > after t18 it drop down to 2,…
> > Now I got worse than before since -t8 says 7,5… and it did 8,4 in the
> > morning. Do you have maybe a .config for me in case I did not enable the
> > performance switch?
> > 
> 
> I would look for root-cause with perf record +
>  perf report --sort cpu,comm,dso,symbol --no-children

while sending with ixgbe while running perf top on the box:
| Samples: 621K of event 'cycles', 4000 Hz, Event count (approx.): 49979376685 lost: 0/0 drop: 0/0
| Overhead  CPU  Command          Shared Object             Symbol
|   31.98%  000  kpktgend_0       [kernel]                  [k] xas_find
|    6.72%  000  kpktgend_0       [kernel]                  [k] pfn_to_dma_pte
|    5.63%  000  kpktgend_0       [kernel]                  [k] ixgbe_xmit_frame_ring
|    4.78%  000  kpktgend_0       [kernel]                  [k] dma_pte_clear_level
|    3.16%  000  kpktgend_0       [kernel]                  [k] __iommu_dma_unmap
|    2.30%  000  kpktgend_0       [kernel]                  [k] fq_ring_free_locked
|    1.99%  000  kpktgend_0       [kernel]                  [k] __domain_mapping
|    1.82%  000  kpktgend_0       [kernel]                  [k] iommu_dma_alloc_iova
|    1.80%  000  kpktgend_0       [kernel]                  [k] __iommu_map
|    1.72%  000  kpktgend_0       [kernel]                  [k] iommu_pgsize.isra.0
|    1.70%  000  kpktgend_0       [kernel]                  [k] __iommu_dma_map
|    1.63%  000  kpktgend_0       [kernel]                  [k] alloc_iova_fast
|    1.59%  000  kpktgend_0       [kernel]                  [k] _raw_spin_lock_irqsave
|    1.32%  000  kpktgend_0       [kernel]                  [k] iommu_map
|    1.30%  000  kpktgend_0       [kernel]                  [k] iommu_dma_map_page
|    1.23%  000  kpktgend_0       [kernel]                  [k] intel_iommu_iotlb_sync_map
|    1.21%  000  kpktgend_0       [kernel]                  [k] xa_find_after
|    1.17%  000  kpktgend_0       [kernel]                  [k] ixgbe_poll
|    1.06%  000  kpktgend_0       [kernel]                  [k] __iommu_unmap
|    1.04%  000  kpktgend_0       [kernel]                  [k] intel_iommu_unmap_pages
|    1.01%  000  kpktgend_0       [kernel]                  [k] free_iova_fast
|    0.96%  000  kpktgend_0       [pktgen]                  [k] pktgen_thread_worker

the i40e box while sending:
|Samples: 400K of event 'cycles:P', 4000 Hz, Event count (approx.): 80512443924 lost: 0/0 drop: 0/0
|Overhead  CPU  Command          Shared Object         Symbol
|  24.04%  000  kpktgend_0       [kernel]              [k] i40e_lan_xmit_frame
|  17.20%  019  swapper          [kernel]              [k] i40e_napi_poll
|   4.84%  019  swapper          [kernel]              [k] intel_idle_irq
|   4.20%  019  swapper          [kernel]              [k] napi_consume_skb
|   3.00%  000  kpktgend_0       [pktgen]              [k] pktgen_thread_worker
|   2.76%  008  swapper          [kernel]              [k] i40e_napi_poll
|   2.36%  000  kpktgend_0       [kernel]              [k] dma_map_page_attrs
|   1.93%  019  swapper          [kernel]              [k] dma_unmap_page_attrs
|   1.70%  008  swapper          [kernel]              [k] intel_idle_irq
|   1.44%  008  swapper          [kernel]              [k] __udp4_lib_rcv
|   1.44%  008  swapper          [kernel]              [k] __netif_receive_skb_core.constprop.0
|   1.40%  008  swapper          [kernel]              [k] napi_build_skb
|   1.28%  000  kpktgend_0       [kernel]              [k] kfree_skb_reason
|   1.27%  008  swapper          [kernel]              [k] ip_rcv_core
|   1.19%  008  swapper          [kernel]              [k] inet_gro_receive
|   1.01%  008  swapper          [kernel]              [k] kmem_cache_free.part.0

> --Jesper

Sebastian
Sebastian Andrzej Siewior Feb. 22, 2024, 9:22 a.m. UTC | #20
On 2024-02-20 16:32:08 [+0100], To Jesper Dangaard Brouer wrote:
> > 
> >  Ethtool(i40e2) stat:     15028585 (  15,028,585) <= tx-0.packets /sec
> >  Ethtool(i40e2) stat:     15028589 (  15,028,589) <= tx_packets /sec
> 
> -t1 in ixgbe
> Show adapter(s) (eth1) statistics (ONLY that changed!)
> Ethtool(eth1    ) stat:    107857263 (    107,857,263) <= tx_bytes /sec
> Ethtool(eth1    ) stat:    115047684 (    115,047,684) <= tx_bytes_nic /sec
> Ethtool(eth1    ) stat:      1797621 (      1,797,621) <= tx_packets /sec
> Ethtool(eth1    ) stat:      1797636 (      1,797,636) <= tx_pkts_nic /sec
> Ethtool(eth1    ) stat:    107857263 (    107,857,263) <= tx_queue_0_bytes /sec
> Ethtool(eth1    ) stat:      1797621 (      1,797,621) <= tx_queue_0_packets /sec> while sending with ixgbe while running perf top on the box:
> | Samples: 621K of event 'cycles', 4000 Hz, Event count (approx.): 49979376685 lost: 0/0 drop: 0/0
> | Overhead  CPU  Command          Shared Object             Symbol
> |   31.98%  000  kpktgend_0       [kernel]                  [k] xas_find
> |    6.72%  000  kpktgend_0       [kernel]                  [k] pfn_to_dma_pte
> |    5.63%  000  kpktgend_0       [kernel]                  [k] ixgbe_xmit_frame_ring
> |    4.78%  000  kpktgend_0       [kernel]                  [k] dma_pte_clear_level
> |    3.16%  000  kpktgend_0       [kernel]                  [k] __iommu_dma_unmap

I disabled the iommu and I get to

Ethtool(eth1    ) stat:     14158562 (     14,158,562) <= tx_packets /sec
Ethtool(eth1    ) stat:     14158685 (     14,158,685) <= tx_pkts_nic /sec

looks like a small improvement… It is not your 15 but close. -t2 does
improve the situation. There is a warning from DMA mapping code but ;)

> > --Jesper

Sebastian
Jesper Dangaard Brouer Feb. 22, 2024, 10:10 a.m. UTC | #21
On 22/02/2024 10.22, Sebastian Andrzej Siewior wrote:
> On 2024-02-20 16:32:08 [+0100], To Jesper Dangaard Brouer wrote:
>>>
>>>   Ethtool(i40e2) stat:     15028585 (  15,028,585) <= tx-0.packets /sec
>>>   Ethtool(i40e2) stat:     15028589 (  15,028,589) <= tx_packets /sec
>>
>> -t1 in ixgbe
>> Show adapter(s) (eth1) statistics (ONLY that changed!)
>> Ethtool(eth1    ) stat:    107857263 (    107,857,263) <= tx_bytes /sec
>> Ethtool(eth1    ) stat:    115047684 (    115,047,684) <= tx_bytes_nic /sec
>> Ethtool(eth1    ) stat:      1797621 (      1,797,621) <= tx_packets /sec
>> Ethtool(eth1    ) stat:      1797636 (      1,797,636) <= tx_pkts_nic /sec
>> Ethtool(eth1    ) stat:    107857263 (    107,857,263) <= tx_queue_0_bytes /sec
>> Ethtool(eth1    ) stat:      1797621 (      1,797,621) <= tx_queue_0_packets /sec
> …
>> while sending with ixgbe while running perf top on the box:
>> | Samples: 621K of event 'cycles', 4000 Hz, Event count (approx.): 49979376685 lost: 0/0 drop: 0/0
>> | Overhead  CPU  Command          Shared Object             Symbol
>> |   31.98%  000  kpktgend_0       [kernel]                  [k] xas_find
>> |    6.72%  000  kpktgend_0       [kernel]                  [k] pfn_to_dma_pte
>> |    5.63%  000  kpktgend_0       [kernel]                  [k] ixgbe_xmit_frame_ring
>> |    4.78%  000  kpktgend_0       [kernel]                  [k] dma_pte_clear_level
>> |    3.16%  000  kpktgend_0       [kernel]                  [k] __iommu_dma_unmap
> 
> I disabled the iommu and I get to

Yes, clearly IOMMU code that cause the performance issue for you.

This driver doesn't use page_pool, so I want to point out (for people
finding this post in the future) that page_pool keeps DMA mappings for
recycled frame, which should address the IOMMU overhead issue here.

> 
> Ethtool(eth1    ) stat:     14158562 (     14,158,562) <= tx_packets /sec
> Ethtool(eth1    ) stat:     14158685 (     14,158,685) <= tx_pkts_nic /sec
> 
> looks like a small improvement… It is not your 15 but close. -t2 does
> improve the situation. 

You cannot reach 15Mpps on 10Gbit/s as wirespeed for 10G is 14.88Mpps.

Congratulations, I think this 14.15 Mpps is as close to wirespeed as it
possible on your hardware.

BTW what CPU are you using?

> There is a warning from DMA mapping code but ;)

It is a warning from IOMMU code?
It usually means there is a real DMA unmap bug (which we should fix).

--Jesper
Sebastian Andrzej Siewior Feb. 22, 2024, 10:58 a.m. UTC | #22
On 2024-02-22 11:10:44 [+0100], Jesper Dangaard Brouer wrote:
> > Ethtool(eth1    ) stat:     14158562 (     14,158,562) <= tx_packets /sec
> > Ethtool(eth1    ) stat:     14158685 (     14,158,685) <= tx_pkts_nic /sec
> > 
> > looks like a small improvement… It is not your 15 but close. -t2 does
> > improve the situation.
> 
> You cannot reach 15Mpps on 10Gbit/s as wirespeed for 10G is 14.88Mpps.

Oh, my bad.

> Congratulations, I think this 14.15 Mpps is as close to wirespeed as it
> possible on your hardware.
> 
> BTW what CPU are you using?

"Intel(R) Xeon(R) CPU E7-8890 v3 @ 2.50GHz"
The "performance" governor is used, I lowered the number of CPUs and
disabled SMT.

> > There is a warning from DMA mapping code but ;)
> 
> It is a warning from IOMMU code?
> It usually means there is a real DMA unmap bug (which we should fix).

Not sure, I don't think so:
| ------------[ cut here ]------------
| ehci-pci 0000:00:1a.0: DMA addr 0x0000000105016ce8+8 overflow (mask ffffffff, bus limit 0).
| WARNING: CPU: 0 PID: 1029 at kernel/dma/direct.h:105 dma_map_page_attrs+0x1e8/0x1f0
| RIP: 0010:dma_map_page_attrs+0x1e8/0x1f0
| Call Trace:
|  <TASK>
|  usb_hcd_map_urb_for_dma+0x1b0/0x4d0
|  usb_hcd_submit_urb+0x342/0x9b0
|  usb_start_wait_urb+0x50/0xc0
|  usb_control_msg+0xc8/0x110
|  get_bMaxPacketSize0+0x5a/0xb0

and USB isn't working. I *think* it is the "memory above 4G" thing, not
sure where it took the wrong turn.

> --Jesper

Sebastian
diff mbox series

Patch

diff --git a/include/linux/filter.h b/include/linux/filter.h
index 68fb6c8142fec..97c9be9cabfd6 100644
--- a/include/linux/filter.h
+++ b/include/linux/filter.h
@@ -704,8 +704,69 @@  struct bpf_redirect_info {
 	struct bpf_nh_params nh;
 };
 
+#ifndef CONFIG_PREEMPT_RT
 DECLARE_PER_CPU(struct bpf_redirect_info, bpf_redirect_info);
 
+struct bpf_xdp_storage { };
+
+static inline struct bpf_xdp_storage *xdp_storage_set(struct bpf_xdp_storage *xdp_store)
+{
+	return NULL;
+}
+
+static inline void xdp_storage_clear(struct bpf_xdp_storage *xdp_store) { }
+
+static inline struct bpf_redirect_info *xdp_storage_get_ri(void)
+{
+	return this_cpu_ptr(&bpf_redirect_info);
+}
+
+#else
+
+struct bpf_xdp_storage {
+	struct bpf_redirect_info ri;
+};
+
+static inline struct bpf_xdp_storage *xdp_storage_set(struct bpf_xdp_storage *xdp_store)
+{
+	struct task_struct *tsk;
+
+	tsk = current;
+	if (tsk->bpf_xdp_storage != NULL)
+		return NULL;
+	tsk->bpf_xdp_storage = xdp_store;
+	return xdp_store;
+}
+
+static inline void xdp_storage_clear(struct bpf_xdp_storage *xdp_store)
+{
+	struct task_struct *tsk;
+
+	tsk = current;
+	if (tsk->bpf_xdp_storage != xdp_store)
+		return;
+	tsk->bpf_xdp_storage = NULL;
+}
+
+static inline struct bpf_xdp_storage *xdp_storage_get(void)
+{
+	struct bpf_xdp_storage *xdp_store = current->bpf_xdp_storage;
+
+	WARN_ON_ONCE(!xdp_store);
+	return xdp_store;
+}
+
+static inline struct bpf_redirect_info *xdp_storage_get_ri(void)
+{
+	struct bpf_xdp_storage *xdp_store = xdp_storage_get();
+
+	if (!xdp_store)
+		return NULL;
+	return &xdp_store->ri;
+}
+#endif
+DEFINE_FREE(xdp_storage_clear, struct bpf_xdp_storage *, if (_T) xdp_storage_clear(_T));
+
 /* flags for bpf_redirect_info kern_flags */
 #define BPF_RI_F_RF_NO_DIRECT	BIT(0)	/* no napi_direct on return_frame */
 
@@ -974,23 +1035,27 @@  void bpf_clear_redirect_map(struct bpf_map *map);
 
 static inline bool xdp_return_frame_no_direct(void)
 {
-	struct bpf_redirect_info *ri = this_cpu_ptr(&bpf_redirect_info);
+	struct bpf_redirect_info *ri = xdp_storage_get_ri();
 
+	if (!ri)
+		return false;
 	return ri->kern_flags & BPF_RI_F_RF_NO_DIRECT;
 }
 
 static inline void xdp_set_return_frame_no_direct(void)
 {
-	struct bpf_redirect_info *ri = this_cpu_ptr(&bpf_redirect_info);
+	struct bpf_redirect_info *ri = xdp_storage_get_ri();
 
-	ri->kern_flags |= BPF_RI_F_RF_NO_DIRECT;
+	if (ri)
+		ri->kern_flags |= BPF_RI_F_RF_NO_DIRECT;
 }
 
 static inline void xdp_clear_return_frame_no_direct(void)
 {
-	struct bpf_redirect_info *ri = this_cpu_ptr(&bpf_redirect_info);
+	struct bpf_redirect_info *ri = xdp_storage_get_ri();
 
-	ri->kern_flags &= ~BPF_RI_F_RF_NO_DIRECT;
+	if (ri)
+		ri->kern_flags &= ~BPF_RI_F_RF_NO_DIRECT;
 }
 
 static inline int xdp_ok_fwd_dev(const struct net_device *fwd,
@@ -1544,9 +1609,11 @@  static __always_inline long __bpf_xdp_redirect_map(struct bpf_map *map, u64 inde
 						   u64 flags, const u64 flag_mask,
 						   void *lookup_elem(struct bpf_map *map, u32 key))
 {
-	struct bpf_redirect_info *ri = this_cpu_ptr(&bpf_redirect_info);
+	struct bpf_redirect_info *ri = xdp_storage_get_ri();
 	const u64 action_mask = XDP_ABORTED | XDP_DROP | XDP_PASS | XDP_TX;
 
+	if (!ri)
+		return XDP_ABORTED;
 	/* Lower bits of the flags are used as return code on lookup failure */
 	if (unlikely(flags & ~(action_mask | flag_mask)))
 		return XDP_ABORTED;
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 111f388d65323..179ea10ae1fd1 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -53,6 +53,7 @@  struct bio_list;
 struct blk_plug;
 struct bpf_local_storage;
 struct bpf_run_ctx;
+struct bpf_xdp_storage;
 struct capture_control;
 struct cfs_rq;
 struct fs_struct;
@@ -1501,6 +1502,10 @@  struct task_struct {
 	/* Used for BPF run context */
 	struct bpf_run_ctx		*bpf_ctx;
 #endif
+#ifdef CONFIG_PREEMPT_RT
+	/* Used by BPF for per-TASK xdp storage */
+	struct bpf_xdp_storage		*bpf_xdp_storage;
+#endif
 
 #ifdef CONFIG_GCC_PLUGIN_STACKLEAK
 	unsigned long			lowest_stack;
diff --git a/kernel/bpf/cpumap.c b/kernel/bpf/cpumap.c
index 8a0bb80fe48a3..c40ae831ab1a6 100644
--- a/kernel/bpf/cpumap.c
+++ b/kernel/bpf/cpumap.c
@@ -261,10 +261,12 @@  static int cpu_map_bpf_prog_run(struct bpf_cpu_map_entry *rcpu, void **frames,
 
 static int cpu_map_kthread_run(void *data)
 {
+	struct bpf_xdp_storage xdp_store __cleanup(xdp_storage_clear);
 	struct bpf_cpu_map_entry *rcpu = data;
 
 	complete(&rcpu->kthread_running);
 	set_current_state(TASK_INTERRUPTIBLE);
+	xdp_storage_set(&xdp_store);
 
 	/* When kthread gives stop order, then rcpu have been disconnected
 	 * from map, thus no new packets can enter. Remaining in-flight
diff --git a/kernel/fork.c b/kernel/fork.c
index 0d944e92a43ff..0d8eb8d20963e 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -2462,6 +2462,9 @@  __latent_entropy struct task_struct *copy_process(
 	RCU_INIT_POINTER(p->bpf_storage, NULL);
 	p->bpf_ctx = NULL;
 #endif
+#ifdef CONFIG_PREEMPT_RT
+	p->bpf_xdp_storage =  NULL;
+#endif
 
 	/* Perform scheduler related setup. Assign this task to a CPU. */
 	retval = sched_fork(clone_flags, p);
diff --git a/net/bpf/test_run.c b/net/bpf/test_run.c
index dfd9193740178..50902d5254115 100644
--- a/net/bpf/test_run.c
+++ b/net/bpf/test_run.c
@@ -281,7 +281,7 @@  static int xdp_recv_frames(struct xdp_frame **frames, int nframes,
 static int xdp_test_run_batch(struct xdp_test_data *xdp, struct bpf_prog *prog,
 			      u32 repeat)
 {
-	struct bpf_redirect_info *ri = this_cpu_ptr(&bpf_redirect_info);
+	struct bpf_redirect_info *ri;
 	int err = 0, act, ret, i, nframes = 0, batch_sz;
 	struct xdp_frame **frames = xdp->frames;
 	struct xdp_page_head *head;
@@ -293,6 +293,7 @@  static int xdp_test_run_batch(struct xdp_test_data *xdp, struct bpf_prog *prog,
 	batch_sz = min_t(u32, repeat, xdp->batch_size);
 
 	local_bh_disable();
+	ri = xdp_storage_get_ri();
 	xdp_set_return_frame_no_direct();
 
 	for (i = 0; i < batch_sz; i++) {
@@ -365,8 +366,10 @@  static int bpf_test_run_xdp_live(struct bpf_prog *prog, struct xdp_buff *ctx,
 				 u32 repeat, u32 batch_size, u32 *time)
 
 {
+	struct bpf_xdp_storage *xdp_store __free(xdp_storage_clear) = NULL;
 	struct xdp_test_data xdp = { .batch_size = batch_size };
 	struct bpf_test_timer t = { .mode = NO_MIGRATE };
+	struct bpf_xdp_storage __xdp_store;
 	int ret;
 
 	if (!repeat)
@@ -376,6 +379,7 @@  static int bpf_test_run_xdp_live(struct bpf_prog *prog, struct xdp_buff *ctx,
 	if (ret)
 		return ret;
 
+	xdp_store = xdp_storage_set(&__xdp_store);
 	bpf_test_timer_enter(&t);
 	do {
 		xdp.frame_cnt = 0;
@@ -392,7 +396,9 @@  static int bpf_test_run_xdp_live(struct bpf_prog *prog, struct xdp_buff *ctx,
 static int bpf_test_run(struct bpf_prog *prog, void *ctx, u32 repeat,
 			u32 *retval, u32 *time, bool xdp)
 {
+	struct bpf_xdp_storage *xdp_store __free(xdp_storage_clear) = NULL;
 	struct bpf_prog_array_item item = {.prog = prog};
+	struct bpf_xdp_storage __xdp_store;
 	struct bpf_run_ctx *old_ctx;
 	struct bpf_cg_run_ctx run_ctx;
 	struct bpf_test_timer t = { NO_MIGRATE };
@@ -412,6 +418,7 @@  static int bpf_test_run(struct bpf_prog *prog, void *ctx, u32 repeat,
 	if (!repeat)
 		repeat = 1;
 
+	xdp_store = xdp_storage_set(&__xdp_store);
 	bpf_test_timer_enter(&t);
 	old_ctx = bpf_set_run_ctx(&run_ctx.run_ctx);
 	do {
diff --git a/net/core/dev.c b/net/core/dev.c
index de362d5f26559..c3f7d2a6b6134 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -3988,11 +3988,15 @@  sch_handle_ingress(struct sk_buff *skb, struct packet_type **pt_prev, int *ret,
 		   struct net_device *orig_dev, bool *another)
 {
 	struct bpf_mprog_entry *entry = rcu_dereference_bh(skb->dev->tcx_ingress);
+	struct bpf_xdp_storage *xdp_store __free(xdp_storage_clear) = NULL;
 	enum skb_drop_reason drop_reason = SKB_DROP_REASON_TC_INGRESS;
+	struct bpf_xdp_storage __xdp_store;
 	int sch_ret;
 
 	if (!entry)
 		return skb;
+
+	xdp_store = xdp_storage_set(&__xdp_store);
 	if (*pt_prev) {
 		*ret = deliver_skb(skb, *pt_prev, orig_dev);
 		*pt_prev = NULL;
@@ -4044,12 +4048,16 @@  static __always_inline struct sk_buff *
 sch_handle_egress(struct sk_buff *skb, int *ret, struct net_device *dev)
 {
 	struct bpf_mprog_entry *entry = rcu_dereference_bh(dev->tcx_egress);
+	struct bpf_xdp_storage *xdp_store __free(xdp_storage_clear) = NULL;
 	enum skb_drop_reason drop_reason = SKB_DROP_REASON_TC_EGRESS;
+	struct bpf_xdp_storage __xdp_store;
 	int sch_ret;
 
 	if (!entry)
 		return skb;
 
+	xdp_store = xdp_storage_set(&__xdp_store);
+
 	/* qdisc_skb_cb(skb)->pkt_len & tcx_set_ingress() was
 	 * already set by the caller.
 	 */
@@ -6240,10 +6248,13 @@  void napi_busy_loop(unsigned int napi_id,
 		    void *loop_end_arg, bool prefer_busy_poll, u16 budget)
 {
 	unsigned long start_time = loop_end ? busy_loop_current_time() : 0;
+	struct bpf_xdp_storage *xdp_store __free(xdp_storage_clear) = NULL;
 	int (*napi_poll)(struct napi_struct *napi, int budget);
+	struct bpf_xdp_storage __xdp_store;
 	void *have_poll_lock = NULL;
 	struct napi_struct *napi;
 
+	xdp_store = xdp_storage_set(&__xdp_store);
 restart:
 	napi_poll = NULL;
 
@@ -6716,10 +6727,13 @@  static void skb_defer_free_flush(struct softnet_data *sd)
 
 static int napi_threaded_poll(void *data)
 {
+	struct bpf_xdp_storage xdp_store __cleanup(xdp_storage_clear);
 	struct napi_struct *napi = data;
 	struct softnet_data *sd;
 	void *have;
 
+	xdp_storage_set(&xdp_store);
+
 	while (!napi_thread_wait(napi)) {
 		for (;;) {
 			bool repoll = false;
@@ -6753,13 +6767,16 @@  static int napi_threaded_poll(void *data)
 
 static __latent_entropy void net_rx_action(struct softirq_action *h)
 {
+	struct bpf_xdp_storage *xdp_store __free(xdp_storage_clear) = NULL;
 	struct softnet_data *sd = this_cpu_ptr(&softnet_data);
 	unsigned long time_limit = jiffies +
 		usecs_to_jiffies(READ_ONCE(netdev_budget_usecs));
 	int budget = READ_ONCE(netdev_budget);
+	struct bpf_xdp_storage __xdp_store;
 	LIST_HEAD(list);
 	LIST_HEAD(repoll);
 
+	xdp_store = xdp_storage_set(&__xdp_store);
 start:
 	sd->in_net_rx_action = true;
 	local_irq_disable();
diff --git a/net/core/filter.c b/net/core/filter.c
index eb8d5a0a0ec8f..5721acb15d40f 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -2473,8 +2473,10 @@  static const struct bpf_func_proto bpf_clone_redirect_proto = {
 	.arg3_type      = ARG_ANYTHING,
 };
 
+#ifndef CONFIG_PREEMPT_RT
 DEFINE_PER_CPU(struct bpf_redirect_info, bpf_redirect_info);
 EXPORT_PER_CPU_SYMBOL_GPL(bpf_redirect_info);
+#endif
 
 static struct net_device *skb_get_peer_dev(struct net_device *dev)
 {
@@ -2488,11 +2490,15 @@  static struct net_device *skb_get_peer_dev(struct net_device *dev)
 
 int skb_do_redirect(struct sk_buff *skb)
 {
-	struct bpf_redirect_info *ri = this_cpu_ptr(&bpf_redirect_info);
 	struct net *net = dev_net(skb->dev);
+	struct bpf_redirect_info *ri;
 	struct net_device *dev;
-	u32 flags = ri->flags;
+	u32 flags;
 
+	ri = xdp_storage_get_ri();
+	if (!ri)
+		goto out_drop;
+	flags = ri->flags;
 	dev = dev_get_by_index_rcu(net, ri->tgt_index);
 	ri->tgt_index = 0;
 	ri->flags = 0;
@@ -2521,9 +2527,9 @@  int skb_do_redirect(struct sk_buff *skb)
 
 BPF_CALL_2(bpf_redirect, u32, ifindex, u64, flags)
 {
-	struct bpf_redirect_info *ri = this_cpu_ptr(&bpf_redirect_info);
+	struct bpf_redirect_info *ri = xdp_storage_get_ri();
 
-	if (unlikely(flags & (~(BPF_F_INGRESS) | BPF_F_REDIRECT_INTERNAL)))
+	if (unlikely((flags & (~(BPF_F_INGRESS) | BPF_F_REDIRECT_INTERNAL)) || !ri))
 		return TC_ACT_SHOT;
 
 	ri->flags = flags;
@@ -2542,9 +2548,9 @@  static const struct bpf_func_proto bpf_redirect_proto = {
 
 BPF_CALL_2(bpf_redirect_peer, u32, ifindex, u64, flags)
 {
-	struct bpf_redirect_info *ri = this_cpu_ptr(&bpf_redirect_info);
+	struct bpf_redirect_info *ri = xdp_storage_get_ri();
 
-	if (unlikely(flags))
+	if (unlikely(flags || !ri))
 		return TC_ACT_SHOT;
 
 	ri->flags = BPF_F_PEER;
@@ -2564,9 +2570,9 @@  static const struct bpf_func_proto bpf_redirect_peer_proto = {
 BPF_CALL_4(bpf_redirect_neigh, u32, ifindex, struct bpf_redir_neigh *, params,
 	   int, plen, u64, flags)
 {
-	struct bpf_redirect_info *ri = this_cpu_ptr(&bpf_redirect_info);
+	struct bpf_redirect_info *ri = xdp_storage_get_ri();
 
-	if (unlikely((plen && plen < sizeof(*params)) || flags))
+	if (unlikely((plen && plen < sizeof(*params)) || flags || !ri))
 		return TC_ACT_SHOT;
 
 	ri->flags = BPF_F_NEIGH | (plen ? BPF_F_NEXTHOP : 0);
@@ -4292,6 +4298,7 @@  void xdp_do_check_flushed(struct napi_struct *napi)
 
 void bpf_clear_redirect_map(struct bpf_map *map)
 {
+#ifndef CONFIG_PREEMPT_RT
 	struct bpf_redirect_info *ri;
 	int cpu;
 
@@ -4305,6 +4312,19 @@  void bpf_clear_redirect_map(struct bpf_map *map)
 		if (unlikely(READ_ONCE(ri->map) == map))
 			cmpxchg(&ri->map, map, NULL);
 	}
+#else
+	/* ri->map is assigned in __bpf_xdp_redirect_map() from within a eBPF
+	 * program/ during NAPI callback. It is used during
+	 * xdp_do_generic_redirect_map()/ __xdp_do_redirect_frame() from the
+	 * redirect callback afterwards. ri->map is cleared after usage.
+	 * The path has no explicit RCU read section but the local_bh_disable()
+	 * is also a RCU read section which makes the complete softirq callback
+	 * RCU protected. This in turn makes ri->map RCU protocted and it is
+	 * sufficient to wait a grace period to ensure that no "ri->map == map"
+	 * exist. dev_map_free() removes the map from the list and then
+	 * invokes synchronize_rcu() after calling this function.
+	 */
+#endif
 }
 
 DEFINE_STATIC_KEY_FALSE(bpf_master_redirect_enabled_key);
@@ -4313,11 +4333,14 @@  EXPORT_SYMBOL_GPL(bpf_master_redirect_enabled_key);
 u32 xdp_master_redirect(struct xdp_buff *xdp)
 {
 	struct net_device *master, *slave;
-	struct bpf_redirect_info *ri = this_cpu_ptr(&bpf_redirect_info);
+	struct bpf_redirect_info *ri;
 
 	master = netdev_master_upper_dev_get_rcu(xdp->rxq->dev);
 	slave = master->netdev_ops->ndo_xdp_get_xmit_slave(master, xdp);
 	if (slave && slave != xdp->rxq->dev) {
+		ri = xdp_storage_get_ri();
+		if (!ri)
+			return XDP_ABORTED;
 		/* The target device is different from the receiving device, so
 		 * redirect it to the new device.
 		 * Using XDP_REDIRECT gets the correct behaviour from XDP enabled
@@ -4419,10 +4442,13 @@  static __always_inline int __xdp_do_redirect_frame(struct bpf_redirect_info *ri,
 int xdp_do_redirect(struct net_device *dev, struct xdp_buff *xdp,
 		    struct bpf_prog *xdp_prog)
 {
-	struct bpf_redirect_info *ri = this_cpu_ptr(&bpf_redirect_info);
-	enum bpf_map_type map_type = ri->map_type;
+	struct bpf_redirect_info *ri;
 
-	if (map_type == BPF_MAP_TYPE_XSKMAP)
+	ri = xdp_storage_get_ri();
+	if (!ri)
+		return -EINVAL;
+
+	if (ri->map_type == BPF_MAP_TYPE_XSKMAP)
 		return __xdp_do_redirect_xsk(ri, dev, xdp, xdp_prog);
 
 	return __xdp_do_redirect_frame(ri, dev, xdp_convert_buff_to_frame(xdp),
@@ -4433,10 +4459,13 @@  EXPORT_SYMBOL_GPL(xdp_do_redirect);
 int xdp_do_redirect_frame(struct net_device *dev, struct xdp_buff *xdp,
 			  struct xdp_frame *xdpf, struct bpf_prog *xdp_prog)
 {
-	struct bpf_redirect_info *ri = this_cpu_ptr(&bpf_redirect_info);
-	enum bpf_map_type map_type = ri->map_type;
+	struct bpf_redirect_info *ri;
 
-	if (map_type == BPF_MAP_TYPE_XSKMAP)
+	ri = xdp_storage_get_ri();
+	if (!ri)
+		return -EINVAL;
+
+	if (ri->map_type == BPF_MAP_TYPE_XSKMAP)
 		return __xdp_do_redirect_xsk(ri, dev, xdp, xdp_prog);
 
 	return __xdp_do_redirect_frame(ri, dev, xdpf, xdp_prog);
@@ -4450,10 +4479,14 @@  static int xdp_do_generic_redirect_map(struct net_device *dev,
 				       void *fwd,
 				       enum bpf_map_type map_type, u32 map_id)
 {
-	struct bpf_redirect_info *ri = this_cpu_ptr(&bpf_redirect_info);
+	struct bpf_redirect_info *ri = xdp_storage_get_ri();
 	struct bpf_map *map;
 	int err;
 
+	if (!ri) {
+		err = -EINVAL;
+		goto err;
+	}
 	switch (map_type) {
 	case BPF_MAP_TYPE_DEVMAP:
 		fallthrough;
@@ -4495,12 +4528,20 @@  static int xdp_do_generic_redirect_map(struct net_device *dev,
 int xdp_do_generic_redirect(struct net_device *dev, struct sk_buff *skb,
 			    struct xdp_buff *xdp, struct bpf_prog *xdp_prog)
 {
-	struct bpf_redirect_info *ri = this_cpu_ptr(&bpf_redirect_info);
-	enum bpf_map_type map_type = ri->map_type;
-	void *fwd = ri->tgt_value;
-	u32 map_id = ri->map_id;
+	struct bpf_redirect_info *ri = xdp_storage_get_ri();
+	enum bpf_map_type map_type;
+	void *fwd;
+	u32 map_id;
 	int err;
 
+	if (!ri) {
+		err = -EINVAL;
+		goto err;
+	}
+	map_type = ri->map_type;
+	fwd = ri->tgt_value;
+	map_id = ri->map_id;
+
 	ri->map_id = 0; /* Valid map id idr range: [1,INT_MAX[ */
 	ri->map_type = BPF_MAP_TYPE_UNSPEC;
 
@@ -4529,9 +4570,9 @@  int xdp_do_generic_redirect(struct net_device *dev, struct sk_buff *skb,
 
 BPF_CALL_2(bpf_xdp_redirect, u32, ifindex, u64, flags)
 {
-	struct bpf_redirect_info *ri = this_cpu_ptr(&bpf_redirect_info);
+	struct bpf_redirect_info *ri = xdp_storage_get_ri();
 
-	if (unlikely(flags))
+	if (unlikely(flags || !ri))
 		return XDP_ABORTED;
 
 	/* NB! Map type UNSPEC and map_id == INT_MAX (never generated
diff --git a/net/core/lwt_bpf.c b/net/core/lwt_bpf.c
index a94943681e5aa..54690b85f1fe6 100644
--- a/net/core/lwt_bpf.c
+++ b/net/core/lwt_bpf.c
@@ -38,12 +38,15 @@  static inline struct bpf_lwt *bpf_lwt_lwtunnel(struct lwtunnel_state *lwt)
 static int run_lwt_bpf(struct sk_buff *skb, struct bpf_lwt_prog *lwt,
 		       struct dst_entry *dst, bool can_redirect)
 {
+	struct bpf_xdp_storage *xdp_store __free(xdp_storage_clear) = NULL;
+	struct bpf_xdp_storage __xdp_store;
 	int ret;
 
 	/* Disabling BH is needed to protect per-CPU bpf_redirect_info between
 	 * BPF prog and skb_do_redirect().
 	 */
 	local_bh_disable();
+	xdp_store = xdp_storage_set(&__xdp_store);
 	bpf_compute_data_pointers(skb);
 	ret = bpf_prog_run_save_cb(lwt->prog, skb);