diff mbox series

[v5,net-next,14/15] net: Reference bpf_redirect_info via task_struct on PREEMPT_RT.

Message ID 20240607070427.1379327-15-bigeasy@linutronix.de (mailing list archive)
State Superseded
Delegated to: Netdev Maintainers
Headers show
Series locking: Introduce nested-BH locking. | expand

Checks

Context Check Description
netdev/series_format success Posting correctly formatted
netdev/tree_selection success Clearly marked for net-next, async
netdev/ynl success Generated files up to date; no warnings/errors; no diff in generated;
netdev/fixes_present success Fixes tag not required for -next series
netdev/header_inline success No static functions without inline keyword in header files
netdev/build_32bit success Errors and warnings before: 15204 this patch: 15204
netdev/build_tools success Errors and warnings before: 0 this patch: 0
netdev/cc_maintainers warning 11 maintainers not CCed: dietmar.eggemann@arm.com juri.lelli@redhat.com mgorman@suse.de yan@cloudflare.com brauner@kernel.org vincent.guittot@linaro.org akpm@linux-foundation.org bristot@redhat.com vschneid@redhat.com bsegall@google.com rostedt@goodmis.org
netdev/build_clang success Errors and warnings before: 2048 this patch: 2048
netdev/verify_signedoff success Signed-off-by tag matches author and committer
netdev/deprecated_api success None detected
netdev/check_selftest success No net selftest shell script
netdev/verify_fixes success No Fixes tag
netdev/build_allmodconfig_warn success Errors and warnings before: 16345 this patch: 16345
netdev/checkpatch warning CHECK: Comparison to NULL could be written "tsk->bpf_net_context" WARNING: line length of 90 exceeds 80 columns
netdev/build_clang_rust success No Rust files in patch. Skipping build
netdev/kdoc success Errors and warnings before: 92 this patch: 92
netdev/source_inline success Was 0 now: 0
netdev/contest success net-next-2024-06-09--21-00 (tests: 644)

Commit Message

Sebastian Andrzej Siewior June 7, 2024, 6:53 a.m. UTC
The XDP redirect process is two staged:
- bpf_prog_run_xdp() is invoked to run a eBPF program which inspects the
  packet and makes decisions. While doing that, the per-CPU variable
  bpf_redirect_info is used.

- Afterwards xdp_do_redirect() is invoked and accesses bpf_redirect_info
  and it may also access other per-CPU variables like xskmap_flush_list.

At the very end of the NAPI callback, xdp_do_flush() is invoked which
does not access bpf_redirect_info but will touch the individual per-CPU
lists.

The per-CPU variables are only used in the NAPI callback hence disabling
bottom halves is the only protection mechanism. Users from preemptible
context (like cpu_map_kthread_run()) explicitly disable bottom halves
for protections reasons.
Without locking in local_bh_disable() on PREEMPT_RT this data structure
requires explicit locking.

PREEMPT_RT has forced-threaded interrupts enabled and every
NAPI-callback runs in a thread. If each thread has its own data
structure then locking can be avoided.

Create a struct bpf_net_context which contains struct bpf_redirect_info.
Define the variable on stack, use bpf_net_ctx_set() to save a pointer to
it, bpf_net_ctx_clear() removes it again.
The bpf_net_ctx_set() may nest. For instance a function can be used from
within NET_RX_SOFTIRQ/ net_rx_action which uses bpf_net_ctx_set() and
NET_TX_SOFTIRQ which does not. Therefore only the first invocations
updates the pointer.
Use bpf_net_ctx_get_ri() as a wrapper to retrieve the current struct
bpf_redirect_info.

The pointer to bpf_net_context is saved task's task_struct. Using
always the bpf_net_context approach has the advantage that there is
almost zero differences between PREEMPT_RT and non-PREEMPT_RT builds.

Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Andrii Nakryiko <andrii@kernel.org>
Cc: Eduard Zingerman <eddyz87@gmail.com>
Cc: Hao Luo <haoluo@google.com>
Cc: Jesper Dangaard Brouer <hawk@kernel.org>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: John Fastabend <john.fastabend@gmail.com>
Cc: KP Singh <kpsingh@kernel.org>
Cc: Martin KaFai Lau <martin.lau@linux.dev>
Cc: Song Liu <song@kernel.org>
Cc: Stanislav Fomichev <sdf@google.com>
Cc: Toke Høiland-Jørgensen <toke@redhat.com>
Cc: Yonghong Song <yonghong.song@linux.dev>
Cc: bpf@vger.kernel.org
Acked-by: Alexei Starovoitov <ast@kernel.org>
Reviewed-by: Toke Høiland-Jørgensen <toke@redhat.com>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
---
 include/linux/filter.h | 43 ++++++++++++++++++++++++++++++++++-------
 include/linux/sched.h  |  3 +++
 kernel/bpf/cpumap.c    |  3 +++
 kernel/bpf/devmap.c    |  9 ++++++++-
 kernel/fork.c          |  1 +
 net/bpf/test_run.c     | 11 ++++++++++-
 net/core/dev.c         | 26 ++++++++++++++++++++++++-
 net/core/filter.c      | 44 ++++++++++++------------------------------
 net/core/lwt_bpf.c     |  3 +++
 9 files changed, 101 insertions(+), 42 deletions(-)

Comments

Jesper Dangaard Brouer June 7, 2024, 11:51 a.m. UTC | #1
On 07/06/2024 08.53, Sebastian Andrzej Siewior wrote:
[...]
> 
> Create a struct bpf_net_context which contains struct bpf_redirect_info.
> Define the variable on stack, use bpf_net_ctx_set() to save a pointer to
> it, bpf_net_ctx_clear() removes it again.
> The bpf_net_ctx_set() may nest. For instance a function can be used from
> within NET_RX_SOFTIRQ/ net_rx_action which uses bpf_net_ctx_set() and
> NET_TX_SOFTIRQ which does not. Therefore only the first invocations
> updates the pointer.
> Use bpf_net_ctx_get_ri() as a wrapper to retrieve the current struct
> bpf_redirect_info.
> 
> The pointer to bpf_net_context is saved task's task_struct. Using
> always the bpf_net_context approach has the advantage that there is
> almost zero differences between PREEMPT_RT and non-PREEMPT_RT builds.
> 
[...]
> ---
>   include/linux/filter.h | 43 ++++++++++++++++++++++++++++++++++-------
>   include/linux/sched.h  |  3 +++
>   kernel/bpf/cpumap.c    |  3 +++
>   kernel/bpf/devmap.c    |  9 ++++++++-
>   kernel/fork.c          |  1 +
>   net/bpf/test_run.c     | 11 ++++++++++-
>   net/core/dev.c         | 26 ++++++++++++++++++++++++-
>   net/core/filter.c      | 44 ++++++++++++------------------------------
>   net/core/lwt_bpf.c     |  3 +++
>   9 files changed, 101 insertions(+), 42 deletions(-)
> 
> diff --git a/include/linux/filter.h b/include/linux/filter.h
> index b02aea291b7e8..2ff1c394dcf0c 100644
> --- a/include/linux/filter.h
> +++ b/include/linux/filter.h
> @@ -744,7 +744,38 @@ struct bpf_redirect_info {
>   	struct bpf_nh_params nh;
>   };
>   
> -DECLARE_PER_CPU(struct bpf_redirect_info, bpf_redirect_info);
> +struct bpf_net_context {
> +	struct bpf_redirect_info ri;
> +};
> +
> +static inline struct bpf_net_context *bpf_net_ctx_set(struct bpf_net_context *bpf_net_ctx)
> +{
> +	struct task_struct *tsk = current;
> +
> +	if (tsk->bpf_net_context != NULL)
> +		return NULL;
> +	memset(&bpf_net_ctx->ri, 0, sizeof(bpf_net_ctx->ri));

It annoys me that we have to clear this memory every time.
(This is added in net_rx_action() that *all* RX packets traverse).

The feature and memory is only/primarily used for XDP and TC redirects,
but we take the overhead of clearing even when these features are not used.

Netstack does bulking in most of the cases this is used, so in our/your
benchmarks this overhead doesn't show.  But we need to be aware that
this is a "paper-cut" for single network packet processing.

Idea: We could postpone clearing until code calls bpf_net_ctx_get() ?
See below.

> +	tsk->bpf_net_context = bpf_net_ctx;
> +	return bpf_net_ctx;
> +}
> +
> +static inline void bpf_net_ctx_clear(struct bpf_net_context *bpf_net_ctx)
> +{
> +	if (bpf_net_ctx)
> +		current->bpf_net_context = NULL;
> +}
> +
> +static inline struct bpf_net_context *bpf_net_ctx_get(void)
> +{

> +	return current->bpf_net_context;
> +}
> +
> +static inline struct bpf_redirect_info *bpf_net_ctx_get_ri(void)
> +{
> +	struct bpf_net_context *bpf_net_ctx = bpf_net_ctx_get();
> +

if (bpf_net_ctx->ri->kern_flags & BPF_RI_F_NEEDS_INIT) {
   memset + init_list (intro in patch 15)
}

Maybe even postpone the init_list calls to the "get" helpers introduced 
in patch 15.


> +	return &bpf_net_ctx->ri;
> +}
>   
[...]

> diff --git a/net/core/dev.c b/net/core/dev.c
> index 2c3f86c8cd176..73965dff1b30f 100644
> --- a/net/core/dev.c
> +++ b/net/core/dev.c
[...]
> @@ -6881,10 +6902,12 @@ static __latent_entropy void net_rx_action(struct softirq_action *h)

The function net_rx_action() is core to the network stack.

>   	struct softnet_data *sd = this_cpu_ptr(&softnet_data);
>   	unsigned long time_limit = jiffies +
>   		usecs_to_jiffies(READ_ONCE(net_hotdata.netdev_budget_usecs));
> +	struct bpf_net_context __bpf_net_ctx, *bpf_net_ctx;
>   	int budget = READ_ONCE(net_hotdata.netdev_budget);
>   	LIST_HEAD(list);
>   	LIST_HEAD(repoll);
>   
> +	bpf_net_ctx = bpf_net_ctx_set(&__bpf_net_ctx);
>   start:
>   	sd->in_net_rx_action = true;
>   	local_irq_disable();
> @@ -6937,7 +6960,8 @@ static __latent_entropy void net_rx_action(struct softirq_action *h)
>   		sd->in_net_rx_action = false;
>   
>   	net_rps_action_and_irq_enable(sd);
> -end:;
> +end:
> +	bpf_net_ctx_clear(bpf_net_ctx);
>   }


The memset can be further optimized as it currently clears 64 bytes, but
it only need to clear 40 bytes, see pahole below.

Replace memset with something like:
  memset(&bpf_net_ctx->ri, 0, offsetof(struct bpf_net_context, ri.nh));

This is an optimization, because with 64 bytes this result in a rep-stos
(repeated string store operation) that on Intel touch CPU-flags (to be
IRQ safe) which is slow, while clearing 40 bytes doesn't cause compiler
to use this instruction, which is faster.  Memset benchmarked with [1]

[1] 
https://github.com/netoptimizer/prototype-kernel/blob/master/kernel/lib/time_bench_memset.c

--Jesper

$ pahole -C bpf_redirect_info vmlinux
struct bpf_redirect_info {
	u64                        tgt_index;            /*     0     8 */
	void *                     tgt_value;            /*     8     8 */
	struct bpf_map *           map;                  /*    16     8 */
	u32                        flags;                /*    24     4 */
	u32                        kern_flags;           /*    28     4 */
	u32                        map_id;               /*    32     4 */
	enum bpf_map_type          map_type;             /*    36     4 */
	struct bpf_nh_params       nh;                   /*    40    20 */

	/* size: 64, cachelines: 1, members: 8 */
	/* padding: 4 */
};



The full struct:

$ pahole -C bpf_net_context vmlinux
struct bpf_net_context {
	struct bpf_redirect_info   ri;                   /*     0    64 */

	/* XXX last struct has 4 bytes of padding */

	/* --- cacheline 1 boundary (64 bytes) --- */
	struct list_head           cpu_map_flush_list;   /*    64    16 */
	struct list_head           dev_map_flush_list;   /*    80    16 */
	struct list_head           xskmap_map_flush_list; /*    96    16 */

	/* size: 112, cachelines: 2, members: 4 */
	/* paddings: 1, sum paddings: 4 */
	/* last cacheline: 48 bytes */
};
Sebastian Andrzej Siewior June 10, 2024, 4:50 p.m. UTC | #2
On 2024-06-07 13:51:25 [+0200], Jesper Dangaard Brouer wrote:
> The memset can be further optimized as it currently clears 64 bytes, but
> it only need to clear 40 bytes, see pahole below.
> 
> Replace memset with something like:
>  memset(&bpf_net_ctx->ri, 0, offsetof(struct bpf_net_context, ri.nh));
> 
> This is an optimization, because with 64 bytes this result in a rep-stos
> (repeated string store operation) that on Intel touch CPU-flags (to be
> IRQ safe) which is slow, while clearing 40 bytes doesn't cause compiler
> to use this instruction, which is faster.  Memset benchmarked with [1]

I've been playing along with this and have to say that "rep stosq" is
roughly 3x slower vs "movq" for 64 bytes on all x86 I've been looking
at.
For gcc the stosq vs movq depends on the CPU settings. The generic uses
movq up to 40 bytes, skylake uses movq even for 64bytes. clang…
This could be tuned via -mmemset-strategy=libcall:64:align,rep_8byte:-1:align

I folded this into the last two patches:

diff --git a/include/linux/filter.h b/include/linux/filter.h
index d2b4260d9d0be..1588d208f1348 100644
--- a/include/linux/filter.h
+++ b/include/linux/filter.h
@@ -744,27 +744,40 @@ struct bpf_redirect_info {
 	struct bpf_nh_params nh;
 };
 
+enum bpf_ctx_init_type {
+	bpf_ctx_ri_init,
+	bpf_ctx_cpu_map_init,
+	bpf_ctx_dev_map_init,
+	bpf_ctx_xsk_map_init,
+};
+
 struct bpf_net_context {
 	struct bpf_redirect_info ri;
 	struct list_head cpu_map_flush_list;
 	struct list_head dev_map_flush_list;
 	struct list_head xskmap_map_flush_list;
+	unsigned int flags;
 };
 
+static inline bool bpf_net_ctx_need_init(struct bpf_net_context *bpf_net_ctx,
+					 enum bpf_ctx_init_type flag)
+{
+	return !(bpf_net_ctx->flags & (1 << flag));
+}
+
+static inline bool bpf_net_ctx_set_flag(struct bpf_net_context *bpf_net_ctx,
+					enum bpf_ctx_init_type flag)
+{
+	return bpf_net_ctx->flags |= 1 << flag;
+}
+
 static inline struct bpf_net_context *bpf_net_ctx_set(struct bpf_net_context *bpf_net_ctx)
 {
 	struct task_struct *tsk = current;
 
 	if (tsk->bpf_net_context != NULL)
 		return NULL;
-	memset(&bpf_net_ctx->ri, 0, sizeof(bpf_net_ctx->ri));
-
-	if (IS_ENABLED(CONFIG_BPF_SYSCALL)) {
-		INIT_LIST_HEAD(&bpf_net_ctx->cpu_map_flush_list);
-		INIT_LIST_HEAD(&bpf_net_ctx->dev_map_flush_list);
-	}
-	if (IS_ENABLED(CONFIG_XDP_SOCKETS))
-		INIT_LIST_HEAD(&bpf_net_ctx->xskmap_map_flush_list);
+	bpf_net_ctx->flags = 0;
 
 	tsk->bpf_net_context = bpf_net_ctx;
 	return bpf_net_ctx;
@@ -785,6 +798,11 @@ static inline struct bpf_redirect_info *bpf_net_ctx_get_ri(void)
 {
 	struct bpf_net_context *bpf_net_ctx = bpf_net_ctx_get();
 
+	if (bpf_net_ctx_need_init(bpf_net_ctx, bpf_ctx_ri_init)) {
+		memset(&bpf_net_ctx->ri, 0, offsetof(struct bpf_net_context, ri.nh));
+		bpf_net_ctx_set_flag(bpf_net_ctx, bpf_ctx_ri_init);
+	}
+
 	return &bpf_net_ctx->ri;
 }
 
@@ -792,6 +810,11 @@ static inline struct list_head *bpf_net_ctx_get_cpu_map_flush_list(void)
 {
 	struct bpf_net_context *bpf_net_ctx = bpf_net_ctx_get();
 
+	if (bpf_net_ctx_need_init(bpf_net_ctx, bpf_ctx_cpu_map_init)) {
+		INIT_LIST_HEAD(&bpf_net_ctx->cpu_map_flush_list);
+		bpf_net_ctx_set_flag(bpf_net_ctx, bpf_ctx_cpu_map_init);
+	}
+
 	return &bpf_net_ctx->cpu_map_flush_list;
 }
 
@@ -799,6 +822,11 @@ static inline struct list_head *bpf_net_ctx_get_dev_flush_list(void)
 {
 	struct bpf_net_context *bpf_net_ctx = bpf_net_ctx_get();
 
+	if (bpf_net_ctx_need_init(bpf_net_ctx, bpf_ctx_dev_map_init)) {
+		INIT_LIST_HEAD(&bpf_net_ctx->dev_map_flush_list);
+		bpf_net_ctx_set_flag(bpf_net_ctx, bpf_ctx_dev_map_init);
+	}
+
 	return &bpf_net_ctx->dev_map_flush_list;
 }
 
@@ -806,6 +834,11 @@ static inline struct list_head *bpf_net_ctx_get_xskmap_flush_list(void)
 {
 	struct bpf_net_context *bpf_net_ctx = bpf_net_ctx_get();
 
+	if (bpf_net_ctx_need_init(bpf_net_ctx, bpf_ctx_xsk_map_init)) {
+		INIT_LIST_HEAD(&bpf_net_ctx->xskmap_map_flush_list);
+		bpf_net_ctx_set_flag(bpf_net_ctx, bpf_ctx_xsk_map_init);
+	}
+
 	return &bpf_net_ctx->xskmap_map_flush_list;
 }
 

Sebastian
Jesper Dangaard Brouer June 11, 2024, 7:55 a.m. UTC | #3
On 10/06/2024 18.50, Sebastian Andrzej Siewior wrote:
> On 2024-06-07 13:51:25 [+0200], Jesper Dangaard Brouer wrote:
>> The memset can be further optimized as it currently clears 64 bytes, but
>> it only need to clear 40 bytes, see pahole below.
>>
>> Replace memset with something like:
>>   memset(&bpf_net_ctx->ri, 0, offsetof(struct bpf_net_context, ri.nh));
>>
>> This is an optimization, because with 64 bytes this result in a rep-stos
>> (repeated string store operation) that on Intel touch CPU-flags (to be
>> IRQ safe) which is slow, while clearing 40 bytes doesn't cause compiler
>> to use this instruction, which is faster.  Memset benchmarked with [1]
> 
> I've been playing along with this and have to say that "rep stosq" is
> roughly 3x slower vs "movq" for 64 bytes on all x86 I've been looking
> at.

Thanks for confirming "rep stos" is 3x slower for small sizes.


> For gcc the stosq vs movq depends on the CPU settings. The generic uses
> movq up to 40 bytes, skylake uses movq even for 64bytes. clang…
> This could be tuned via -mmemset-strategy=libcall:64:align,rep_8byte:-1:align
> 

Cool I didn't know of this tuning.  Is this a compiler option?
Where do I change this setting, as I would like to experiment with this
for our prod kernels.

My other finding is, this primarily a kernel compile problem, because
for userspace compiler chooses to use MMX instructions (e.g. movaps
xmmword ptr[rsp], xmm0).  The kernel compiler options (-mno-sse -mno-mmx
-mno-sse2 -mno-3dnow -mno-avx) disables this, which aparently changes
the tipping point.


> I folded this into the last two patches:
> 
> diff --git a/include/linux/filter.h b/include/linux/filter.h
> index d2b4260d9d0be..1588d208f1348 100644
> --- a/include/linux/filter.h
> +++ b/include/linux/filter.h
> @@ -744,27 +744,40 @@ struct bpf_redirect_info {
>   	struct bpf_nh_params nh;
>   };
>   
> +enum bpf_ctx_init_type {
> +	bpf_ctx_ri_init,
> +	bpf_ctx_cpu_map_init,
> +	bpf_ctx_dev_map_init,
> +	bpf_ctx_xsk_map_init,
> +};
> +
>   struct bpf_net_context {
>   	struct bpf_redirect_info ri;
>   	struct list_head cpu_map_flush_list;
>   	struct list_head dev_map_flush_list;
>   	struct list_head xskmap_map_flush_list;
> +	unsigned int flags;

Why have yet another flags variable, when we already have two flags in 
bpf_redirect_info ?

>   };
>   
> +static inline bool bpf_net_ctx_need_init(struct bpf_net_context *bpf_net_ctx,
> +					 enum bpf_ctx_init_type flag)
> +{
> +	return !(bpf_net_ctx->flags & (1 << flag));
> +}
> +
> +static inline bool bpf_net_ctx_set_flag(struct bpf_net_context *bpf_net_ctx,
> +					enum bpf_ctx_init_type flag)
> +{
> +	return bpf_net_ctx->flags |= 1 << flag;
> +}
> +
>   static inline struct bpf_net_context *bpf_net_ctx_set(struct bpf_net_context *bpf_net_ctx)
>   {
>   	struct task_struct *tsk = current;
>   
>   	if (tsk->bpf_net_context != NULL)
>   		return NULL;
> -	memset(&bpf_net_ctx->ri, 0, sizeof(bpf_net_ctx->ri));
> -
> -	if (IS_ENABLED(CONFIG_BPF_SYSCALL)) {
> -		INIT_LIST_HEAD(&bpf_net_ctx->cpu_map_flush_list);
> -		INIT_LIST_HEAD(&bpf_net_ctx->dev_map_flush_list);
> -	}
> -	if (IS_ENABLED(CONFIG_XDP_SOCKETS))
> -		INIT_LIST_HEAD(&bpf_net_ctx->xskmap_map_flush_list);
> +	bpf_net_ctx->flags = 0;
>   
>   	tsk->bpf_net_context = bpf_net_ctx;
>   	return bpf_net_ctx;
> @@ -785,6 +798,11 @@ static inline struct bpf_redirect_info *bpf_net_ctx_get_ri(void)
>   {
>   	struct bpf_net_context *bpf_net_ctx = bpf_net_ctx_get();
>   
> +	if (bpf_net_ctx_need_init(bpf_net_ctx, bpf_ctx_ri_init)) {
> +		memset(&bpf_net_ctx->ri, 0, offsetof(struct bpf_net_context, ri.nh));
> +		bpf_net_ctx_set_flag(bpf_net_ctx, bpf_ctx_ri_init);
> +	}
> +
>   	return &bpf_net_ctx->ri;
>   }
>   
> @@ -792,6 +810,11 @@ static inline struct list_head *bpf_net_ctx_get_cpu_map_flush_list(void)
>   {
>   	struct bpf_net_context *bpf_net_ctx = bpf_net_ctx_get();
>   
> +	if (bpf_net_ctx_need_init(bpf_net_ctx, bpf_ctx_cpu_map_init)) {
> +		INIT_LIST_HEAD(&bpf_net_ctx->cpu_map_flush_list);
> +		bpf_net_ctx_set_flag(bpf_net_ctx, bpf_ctx_cpu_map_init);
> +	}
> +
>   	return &bpf_net_ctx->cpu_map_flush_list;
>   }
>   
> @@ -799,6 +822,11 @@ static inline struct list_head *bpf_net_ctx_get_dev_flush_list(void)
>   {
>   	struct bpf_net_context *bpf_net_ctx = bpf_net_ctx_get();
>   
> +	if (bpf_net_ctx_need_init(bpf_net_ctx, bpf_ctx_dev_map_init)) {
> +		INIT_LIST_HEAD(&bpf_net_ctx->dev_map_flush_list);
> +		bpf_net_ctx_set_flag(bpf_net_ctx, bpf_ctx_dev_map_init);
> +	}
> +
>   	return &bpf_net_ctx->dev_map_flush_list;
>   }
>   
> @@ -806,6 +834,11 @@ static inline struct list_head *bpf_net_ctx_get_xskmap_flush_list(void)
>   {
>   	struct bpf_net_context *bpf_net_ctx = bpf_net_ctx_get();
>   
> +	if (bpf_net_ctx_need_init(bpf_net_ctx, bpf_ctx_xsk_map_init)) {
> +		INIT_LIST_HEAD(&bpf_net_ctx->xskmap_map_flush_list);
> +		bpf_net_ctx_set_flag(bpf_net_ctx, bpf_ctx_xsk_map_init);
> +	}
> +
>   	return &bpf_net_ctx->xskmap_map_flush_list;
>   }
>   
> 
> Sebastian
Sebastian Andrzej Siewior June 11, 2024, 8:39 a.m. UTC | #4
On 2024-06-11 09:55:11 [+0200], Jesper Dangaard Brouer wrote:
> > For gcc the stosq vs movq depends on the CPU settings. The generic uses
> > movq up to 40 bytes, skylake uses movq even for 64bytes. clang…
> > This could be tuned via -mmemset-strategy=libcall:64:align,rep_8byte:-1:align
> > 
> 
> Cool I didn't know of this tuning.  Is this a compiler option?
> Where do I change this setting, as I would like to experiment with this
> for our prod kernels.

This is what I play with right now, I'm not sure it is what I want… For
reference:

---->8-----
diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 1d7122a1883e8..b35b7b21598de 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -775,6 +775,9 @@ config SCHED_OMIT_FRAME_POINTER
 
 	  If in doubt, say "Y".
 
+config X86_OPT_MEMSET
+	bool "X86 memset playground"
+
 menuconfig HYPERVISOR_GUEST
 	bool "Linux guest support"
 	help
diff --git a/arch/x86/Makefile b/arch/x86/Makefile
index 801fd85c3ef69..bab37787fe5cd 100644
--- a/arch/x86/Makefile
+++ b/arch/x86/Makefile
@@ -151,6 +151,15 @@ else
         KBUILD_AFLAGS += -m64
         KBUILD_CFLAGS += -m64
 
+	ifeq ($(CONFIG_X86_OPT_MEMSET),y)
+		#export X86_MEMSET_CFLAGS := -mmemset-strategy=libcall:64:align,rep_8byte:-1:align
+		export X86_MEMSET_CFLAGS := -mmemset-strategy=libcall:-1:align
+	else
+		export X86_MEMSET_CFLAGS :=
+	endif
+
+        KBUILD_CFLAGS += $(X86_MEMSET_CFLAGS)
+
         # Align jump targets to 1 byte, not the default 16 bytes:
         KBUILD_CFLAGS += $(call cc-option,-falign-jumps=1)
 
diff --git a/arch/x86/entry/vdso/Makefile b/arch/x86/entry/vdso/Makefile
index 215a1b202a918..d0c9a589885ef 100644
--- a/arch/x86/entry/vdso/Makefile
+++ b/arch/x86/entry/vdso/Makefile
@@ -121,6 +121,7 @@ KBUILD_CFLAGS_32 := $(filter-out -m64,$(KBUILD_CFLAGS))
 KBUILD_CFLAGS_32 := $(filter-out -mcmodel=kernel,$(KBUILD_CFLAGS_32))
 KBUILD_CFLAGS_32 := $(filter-out -fno-pic,$(KBUILD_CFLAGS_32))
 KBUILD_CFLAGS_32 := $(filter-out -mfentry,$(KBUILD_CFLAGS_32))
+KBUILD_CFLAGS_32 := $(filter-out $(X86_MEMSET_CFLAGS),$(KBUILD_CFLAGS_32))
 KBUILD_CFLAGS_32 := $(filter-out $(RANDSTRUCT_CFLAGS),$(KBUILD_CFLAGS_32))
 KBUILD_CFLAGS_32 := $(filter-out $(GCC_PLUGINS_CFLAGS),$(KBUILD_CFLAGS_32))
 KBUILD_CFLAGS_32 := $(filter-out $(RETPOLINE_CFLAGS),$(KBUILD_CFLAGS_32))


---->8-----

I dug this up in the gcc source code and initially played on the command
line with it. The snippet compiles the kernel and it boots so…

> My other finding is, this primarily a kernel compile problem, because
> for userspace compiler chooses to use MMX instructions (e.g. movaps
> xmmword ptr[rsp], xmm0).  The kernel compiler options (-mno-sse -mno-mmx
> -mno-sse2 -mno-3dnow -mno-avx) disables this, which aparently changes
> the tipping point.

sure.

> 
> > I folded this into the last two patches:
> > 
> > diff --git a/include/linux/filter.h b/include/linux/filter.h
> > index d2b4260d9d0be..1588d208f1348 100644
> > --- a/include/linux/filter.h
> > +++ b/include/linux/filter.h
> > @@ -744,27 +744,40 @@ struct bpf_redirect_info {
> >   	struct bpf_nh_params nh;
> >   };
> > +enum bpf_ctx_init_type {
> > +	bpf_ctx_ri_init,
> > +	bpf_ctx_cpu_map_init,
> > +	bpf_ctx_dev_map_init,
> > +	bpf_ctx_xsk_map_init,
> > +};
> > +
> >   struct bpf_net_context {
> >   	struct bpf_redirect_info ri;
> >   	struct list_head cpu_map_flush_list;
> >   	struct list_head dev_map_flush_list;
> >   	struct list_head xskmap_map_flush_list;
> > +	unsigned int flags;
> 
> Why have yet another flags variable, when we already have two flags in
> bpf_redirect_info ?

Ah you want to fold this into ri member including the status for the
lists? Could try. It is splitted in order to delay the initialisation of
the lists, too. We would need to be careful to not overwrite the
flags if `ri' is initialized after the lists. That would be the case
with CONFIG_DEBUG_NET=y and not doing redirect (the empty list check
initializes that).

Sebastian
Sebastian Andrzej Siewior June 12, 2024, 10:42 a.m. UTC | #5
On 2024-06-11 10:39:20 [+0200], To Jesper Dangaard Brouer wrote:
> On 2024-06-11 09:55:11 [+0200], Jesper Dangaard Brouer wrote:
> > >   struct bpf_net_context {
> > >   	struct bpf_redirect_info ri;
> > >   	struct list_head cpu_map_flush_list;
> > >   	struct list_head dev_map_flush_list;
> > >   	struct list_head xskmap_map_flush_list;
> > > +	unsigned int flags;
> > 
> > Why have yet another flags variable, when we already have two flags in
> > bpf_redirect_info ?
> 
> Ah you want to fold this into ri member including the status for the
> lists? Could try. It is splitted in order to delay the initialisation of
> the lists, too. We would need to be careful to not overwrite the
> flags if `ri' is initialized after the lists. That would be the case
> with CONFIG_DEBUG_NET=y and not doing redirect (the empty list check
> initializes that).

What about this:

------>8----------

diff --git a/include/linux/filter.h b/include/linux/filter.h
index d2b4260d9d0be..c0349522de8fb 100644
--- a/include/linux/filter.h
+++ b/include/linux/filter.h
@@ -733,15 +733,22 @@ struct bpf_nh_params {
 	};
 };
 
+/* flags for bpf_redirect_info kern_flags */
+#define BPF_RI_F_RF_NO_DIRECT	BIT(0)	/* no napi_direct on return_frame */
+#define BPF_RI_F_RI_INIT	BIT(1)
+#define BPF_RI_F_CPU_MAP_INIT	BIT(2)
+#define BPF_RI_F_DEV_MAP_INIT	BIT(3)
+#define BPF_RI_F_XSK_MAP_INIT	BIT(4)
+
 struct bpf_redirect_info {
 	u64 tgt_index;
 	void *tgt_value;
 	struct bpf_map *map;
 	u32 flags;
-	u32 kern_flags;
 	u32 map_id;
 	enum bpf_map_type map_type;
 	struct bpf_nh_params nh;
+	u32 kern_flags;
 };
 
 struct bpf_net_context {
@@ -757,14 +764,7 @@ static inline struct bpf_net_context *bpf_net_ctx_set(struct bpf_net_context *bp
 
 	if (tsk->bpf_net_context != NULL)
 		return NULL;
-	memset(&bpf_net_ctx->ri, 0, sizeof(bpf_net_ctx->ri));
-
-	if (IS_ENABLED(CONFIG_BPF_SYSCALL)) {
-		INIT_LIST_HEAD(&bpf_net_ctx->cpu_map_flush_list);
-		INIT_LIST_HEAD(&bpf_net_ctx->dev_map_flush_list);
-	}
-	if (IS_ENABLED(CONFIG_XDP_SOCKETS))
-		INIT_LIST_HEAD(&bpf_net_ctx->xskmap_map_flush_list);
+	bpf_net_ctx->ri.kern_flags = 0;
 
 	tsk->bpf_net_context = bpf_net_ctx;
 	return bpf_net_ctx;
@@ -785,6 +785,11 @@ static inline struct bpf_redirect_info *bpf_net_ctx_get_ri(void)
 {
 	struct bpf_net_context *bpf_net_ctx = bpf_net_ctx_get();
 
+	if (!(bpf_net_ctx->ri.kern_flags & BPF_RI_F_RI_INIT)) {
+		memset(&bpf_net_ctx->ri, 0, offsetof(struct bpf_net_context, ri.nh));
+		bpf_net_ctx->ri.kern_flags |= BPF_RI_F_RI_INIT;
+	}
+
 	return &bpf_net_ctx->ri;
 }
 
@@ -792,6 +797,11 @@ static inline struct list_head *bpf_net_ctx_get_cpu_map_flush_list(void)
 {
 	struct bpf_net_context *bpf_net_ctx = bpf_net_ctx_get();
 
+	if (!(bpf_net_ctx->ri.kern_flags & BPF_RI_F_CPU_MAP_INIT)) {
+		INIT_LIST_HEAD(&bpf_net_ctx->cpu_map_flush_list);
+		bpf_net_ctx->ri.kern_flags |= BPF_RI_F_CPU_MAP_INIT;
+	}
+
 	return &bpf_net_ctx->cpu_map_flush_list;
 }
 
@@ -799,6 +809,11 @@ static inline struct list_head *bpf_net_ctx_get_dev_flush_list(void)
 {
 	struct bpf_net_context *bpf_net_ctx = bpf_net_ctx_get();
 
+	if (!(bpf_net_ctx->ri.kern_flags & BPF_RI_F_DEV_MAP_INIT)) {
+		INIT_LIST_HEAD(&bpf_net_ctx->dev_map_flush_list);
+		bpf_net_ctx->ri.kern_flags |= BPF_RI_F_DEV_MAP_INIT;
+	}
+
 	return &bpf_net_ctx->dev_map_flush_list;
 }
 
@@ -806,12 +821,14 @@ static inline struct list_head *bpf_net_ctx_get_xskmap_flush_list(void)
 {
 	struct bpf_net_context *bpf_net_ctx = bpf_net_ctx_get();
 
+	if (!(bpf_net_ctx->ri.kern_flags & BPF_RI_F_XSK_MAP_INIT)) {
+		INIT_LIST_HEAD(&bpf_net_ctx->xskmap_map_flush_list);
+		bpf_net_ctx->ri.kern_flags |= BPF_RI_F_XSK_MAP_INIT;
+	}
+
 	return &bpf_net_ctx->xskmap_map_flush_list;
 }
 
-/* flags for bpf_redirect_info kern_flags */
-#define BPF_RI_F_RF_NO_DIRECT	BIT(0)	/* no napi_direct on return_frame */
-
 /* Compute the linear packet data range [data, data_end) which
  * will be accessed by various program types (cls_bpf, act_bpf,
  * lwt, ...). Subsystems allowing direct data access must (!)

------>8----------

Moving kern_flags to the end excludes it from the memset() and can be
re-used for the delayed initialisation.

Sebastian
diff mbox series

Patch

diff --git a/include/linux/filter.h b/include/linux/filter.h
index b02aea291b7e8..2ff1c394dcf0c 100644
--- a/include/linux/filter.h
+++ b/include/linux/filter.h
@@ -744,7 +744,38 @@  struct bpf_redirect_info {
 	struct bpf_nh_params nh;
 };
 
-DECLARE_PER_CPU(struct bpf_redirect_info, bpf_redirect_info);
+struct bpf_net_context {
+	struct bpf_redirect_info ri;
+};
+
+static inline struct bpf_net_context *bpf_net_ctx_set(struct bpf_net_context *bpf_net_ctx)
+{
+	struct task_struct *tsk = current;
+
+	if (tsk->bpf_net_context != NULL)
+		return NULL;
+	memset(&bpf_net_ctx->ri, 0, sizeof(bpf_net_ctx->ri));
+	tsk->bpf_net_context = bpf_net_ctx;
+	return bpf_net_ctx;
+}
+
+static inline void bpf_net_ctx_clear(struct bpf_net_context *bpf_net_ctx)
+{
+	if (bpf_net_ctx)
+		current->bpf_net_context = NULL;
+}
+
+static inline struct bpf_net_context *bpf_net_ctx_get(void)
+{
+	return current->bpf_net_context;
+}
+
+static inline struct bpf_redirect_info *bpf_net_ctx_get_ri(void)
+{
+	struct bpf_net_context *bpf_net_ctx = bpf_net_ctx_get();
+
+	return &bpf_net_ctx->ri;
+}
 
 /* flags for bpf_redirect_info kern_flags */
 #define BPF_RI_F_RF_NO_DIRECT	BIT(0)	/* no napi_direct on return_frame */
@@ -1018,25 +1049,23 @@  struct bpf_prog *bpf_patch_insn_single(struct bpf_prog *prog, u32 off,
 				       const struct bpf_insn *patch, u32 len);
 int bpf_remove_insns(struct bpf_prog *prog, u32 off, u32 cnt);
 
-void bpf_clear_redirect_map(struct bpf_map *map);
-
 static inline bool xdp_return_frame_no_direct(void)
 {
-	struct bpf_redirect_info *ri = this_cpu_ptr(&bpf_redirect_info);
+	struct bpf_redirect_info *ri = bpf_net_ctx_get_ri();
 
 	return ri->kern_flags & BPF_RI_F_RF_NO_DIRECT;
 }
 
 static inline void xdp_set_return_frame_no_direct(void)
 {
-	struct bpf_redirect_info *ri = this_cpu_ptr(&bpf_redirect_info);
+	struct bpf_redirect_info *ri = bpf_net_ctx_get_ri();
 
 	ri->kern_flags |= BPF_RI_F_RF_NO_DIRECT;
 }
 
 static inline void xdp_clear_return_frame_no_direct(void)
 {
-	struct bpf_redirect_info *ri = this_cpu_ptr(&bpf_redirect_info);
+	struct bpf_redirect_info *ri = bpf_net_ctx_get_ri();
 
 	ri->kern_flags &= ~BPF_RI_F_RF_NO_DIRECT;
 }
@@ -1592,7 +1621,7 @@  static __always_inline long __bpf_xdp_redirect_map(struct bpf_map *map, u64 inde
 						   u64 flags, const u64 flag_mask,
 						   void *lookup_elem(struct bpf_map *map, u32 key))
 {
-	struct bpf_redirect_info *ri = this_cpu_ptr(&bpf_redirect_info);
+	struct bpf_redirect_info *ri = bpf_net_ctx_get_ri();
 	const u64 action_mask = XDP_ABORTED | XDP_DROP | XDP_PASS | XDP_TX;
 
 	/* Lower bits of the flags are used as return code on lookup failure */
diff --git a/include/linux/sched.h b/include/linux/sched.h
index a9b0ca72db55f..dfa1843ab2916 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -53,6 +53,7 @@  struct bio_list;
 struct blk_plug;
 struct bpf_local_storage;
 struct bpf_run_ctx;
+struct bpf_net_context;
 struct capture_control;
 struct cfs_rq;
 struct fs_struct;
@@ -1508,6 +1509,8 @@  struct task_struct {
 	/* Used for BPF run context */
 	struct bpf_run_ctx		*bpf_ctx;
 #endif
+	/* Used by BPF for per-TASK xdp storage */
+	struct bpf_net_context		*bpf_net_context;
 
 #ifdef CONFIG_GCC_PLUGIN_STACKLEAK
 	unsigned long			lowest_stack;
diff --git a/kernel/bpf/cpumap.c b/kernel/bpf/cpumap.c
index a8e34416e960f..66974bd027109 100644
--- a/kernel/bpf/cpumap.c
+++ b/kernel/bpf/cpumap.c
@@ -240,12 +240,14 @@  static int cpu_map_bpf_prog_run(struct bpf_cpu_map_entry *rcpu, void **frames,
 				int xdp_n, struct xdp_cpumap_stats *stats,
 				struct list_head *list)
 {
+	struct bpf_net_context __bpf_net_ctx, *bpf_net_ctx;
 	int nframes;
 
 	if (!rcpu->prog)
 		return xdp_n;
 
 	rcu_read_lock_bh();
+	bpf_net_ctx = bpf_net_ctx_set(&__bpf_net_ctx);
 
 	nframes = cpu_map_bpf_prog_run_xdp(rcpu, frames, xdp_n, stats);
 
@@ -255,6 +257,7 @@  static int cpu_map_bpf_prog_run(struct bpf_cpu_map_entry *rcpu, void **frames,
 	if (unlikely(!list_empty(list)))
 		cpu_map_bpf_prog_run_skb(rcpu, list, stats);
 
+	bpf_net_ctx_clear(bpf_net_ctx);
 	rcu_read_unlock_bh(); /* resched point, may call do_softirq() */
 
 	return nframes;
diff --git a/kernel/bpf/devmap.c b/kernel/bpf/devmap.c
index 4e2cdbb5629f2..3d9d62c6525d4 100644
--- a/kernel/bpf/devmap.c
+++ b/kernel/bpf/devmap.c
@@ -196,7 +196,14 @@  static void dev_map_free(struct bpf_map *map)
 	list_del_rcu(&dtab->list);
 	spin_unlock(&dev_map_lock);
 
-	bpf_clear_redirect_map(map);
+	/* bpf_redirect_info->map is assigned in __bpf_xdp_redirect_map()
+	 * during NAPI callback and cleared after the XDP redirect. There is no
+	 * explicit RCU read section which protects bpf_redirect_info->map but
+	 * local_bh_disable() also marks the beginning an RCU section. This
+	 * makes the complete softirq callback RCU protected. Thus after
+	 * following synchronize_rcu() there no bpf_redirect_info->map == map
+	 * assignment.
+	 */
 	synchronize_rcu();
 
 	/* Make sure prior __dev_map_entry_free() have completed. */
diff --git a/kernel/fork.c b/kernel/fork.c
index 99076dbe27d83..f314bdd7e6108 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -2355,6 +2355,7 @@  __latent_entropy struct task_struct *copy_process(
 	RCU_INIT_POINTER(p->bpf_storage, NULL);
 	p->bpf_ctx = NULL;
 #endif
+	p->bpf_net_context =  NULL;
 
 	/* Perform scheduler related setup. Assign this task to a CPU. */
 	retval = sched_fork(clone_flags, p);
diff --git a/net/bpf/test_run.c b/net/bpf/test_run.c
index f6aad4ed2ab2f..600cc8e428c1a 100644
--- a/net/bpf/test_run.c
+++ b/net/bpf/test_run.c
@@ -283,9 +283,10 @@  static int xdp_recv_frames(struct xdp_frame **frames, int nframes,
 static int xdp_test_run_batch(struct xdp_test_data *xdp, struct bpf_prog *prog,
 			      u32 repeat)
 {
-	struct bpf_redirect_info *ri = this_cpu_ptr(&bpf_redirect_info);
+	struct bpf_net_context __bpf_net_ctx, *bpf_net_ctx;
 	int err = 0, act, ret, i, nframes = 0, batch_sz;
 	struct xdp_frame **frames = xdp->frames;
+	struct bpf_redirect_info *ri;
 	struct xdp_page_head *head;
 	struct xdp_frame *frm;
 	bool redirect = false;
@@ -295,6 +296,8 @@  static int xdp_test_run_batch(struct xdp_test_data *xdp, struct bpf_prog *prog,
 	batch_sz = min_t(u32, repeat, xdp->batch_size);
 
 	local_bh_disable();
+	bpf_net_ctx = bpf_net_ctx_set(&__bpf_net_ctx);
+	ri = bpf_net_ctx_get_ri();
 	xdp_set_return_frame_no_direct();
 
 	for (i = 0; i < batch_sz; i++) {
@@ -359,6 +362,7 @@  static int xdp_test_run_batch(struct xdp_test_data *xdp, struct bpf_prog *prog,
 	}
 
 	xdp_clear_return_frame_no_direct();
+	bpf_net_ctx_clear(bpf_net_ctx);
 	local_bh_enable();
 	return err;
 }
@@ -394,6 +398,7 @@  static int bpf_test_run_xdp_live(struct bpf_prog *prog, struct xdp_buff *ctx,
 static int bpf_test_run(struct bpf_prog *prog, void *ctx, u32 repeat,
 			u32 *retval, u32 *time, bool xdp)
 {
+	struct bpf_net_context __bpf_net_ctx, *bpf_net_ctx;
 	struct bpf_prog_array_item item = {.prog = prog};
 	struct bpf_run_ctx *old_ctx;
 	struct bpf_cg_run_ctx run_ctx;
@@ -419,10 +424,14 @@  static int bpf_test_run(struct bpf_prog *prog, void *ctx, u32 repeat,
 	do {
 		run_ctx.prog_item = &item;
 		local_bh_disable();
+		bpf_net_ctx = bpf_net_ctx_set(&__bpf_net_ctx);
+
 		if (xdp)
 			*retval = bpf_prog_run_xdp(prog, ctx);
 		else
 			*retval = bpf_prog_run(prog, ctx);
+
+		bpf_net_ctx_clear(bpf_net_ctx);
 		local_bh_enable();
 	} while (bpf_test_timer_continue(&t, 1, repeat, &ret, time));
 	bpf_reset_run_ctx(old_ctx);
diff --git a/net/core/dev.c b/net/core/dev.c
index 2c3f86c8cd176..73965dff1b30f 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -4031,10 +4031,13 @@  sch_handle_ingress(struct sk_buff *skb, struct packet_type **pt_prev, int *ret,
 {
 	struct bpf_mprog_entry *entry = rcu_dereference_bh(skb->dev->tcx_ingress);
 	enum skb_drop_reason drop_reason = SKB_DROP_REASON_TC_INGRESS;
+	struct bpf_net_context __bpf_net_ctx, *bpf_net_ctx;
 	int sch_ret;
 
 	if (!entry)
 		return skb;
+
+	bpf_net_ctx = bpf_net_ctx_set(&__bpf_net_ctx);
 	if (*pt_prev) {
 		*ret = deliver_skb(skb, *pt_prev, orig_dev);
 		*pt_prev = NULL;
@@ -4063,10 +4066,12 @@  sch_handle_ingress(struct sk_buff *skb, struct packet_type **pt_prev, int *ret,
 			break;
 		}
 		*ret = NET_RX_SUCCESS;
+		bpf_net_ctx_clear(bpf_net_ctx);
 		return NULL;
 	case TC_ACT_SHOT:
 		kfree_skb_reason(skb, drop_reason);
 		*ret = NET_RX_DROP;
+		bpf_net_ctx_clear(bpf_net_ctx);
 		return NULL;
 	/* used by tc_run */
 	case TC_ACT_STOLEN:
@@ -4076,8 +4081,10 @@  sch_handle_ingress(struct sk_buff *skb, struct packet_type **pt_prev, int *ret,
 		fallthrough;
 	case TC_ACT_CONSUMED:
 		*ret = NET_RX_SUCCESS;
+		bpf_net_ctx_clear(bpf_net_ctx);
 		return NULL;
 	}
+	bpf_net_ctx_clear(bpf_net_ctx);
 
 	return skb;
 }
@@ -4087,11 +4094,14 @@  sch_handle_egress(struct sk_buff *skb, int *ret, struct net_device *dev)
 {
 	struct bpf_mprog_entry *entry = rcu_dereference_bh(dev->tcx_egress);
 	enum skb_drop_reason drop_reason = SKB_DROP_REASON_TC_EGRESS;
+	struct bpf_net_context __bpf_net_ctx, *bpf_net_ctx;
 	int sch_ret;
 
 	if (!entry)
 		return skb;
 
+	bpf_net_ctx = bpf_net_ctx_set(&__bpf_net_ctx);
+
 	/* qdisc_skb_cb(skb)->pkt_len & tcx_set_ingress() was
 	 * already set by the caller.
 	 */
@@ -4107,10 +4117,12 @@  sch_handle_egress(struct sk_buff *skb, int *ret, struct net_device *dev)
 		/* No need to push/pop skb's mac_header here on egress! */
 		skb_do_redirect(skb);
 		*ret = NET_XMIT_SUCCESS;
+		bpf_net_ctx_clear(bpf_net_ctx);
 		return NULL;
 	case TC_ACT_SHOT:
 		kfree_skb_reason(skb, drop_reason);
 		*ret = NET_XMIT_DROP;
+		bpf_net_ctx_clear(bpf_net_ctx);
 		return NULL;
 	/* used by tc_run */
 	case TC_ACT_STOLEN:
@@ -4120,8 +4132,10 @@  sch_handle_egress(struct sk_buff *skb, int *ret, struct net_device *dev)
 		fallthrough;
 	case TC_ACT_CONSUMED:
 		*ret = NET_XMIT_SUCCESS;
+		bpf_net_ctx_clear(bpf_net_ctx);
 		return NULL;
 	}
+	bpf_net_ctx_clear(bpf_net_ctx);
 
 	return skb;
 }
@@ -6358,6 +6372,7 @@  static void __napi_busy_loop(unsigned int napi_id,
 {
 	unsigned long start_time = loop_end ? busy_loop_current_time() : 0;
 	int (*napi_poll)(struct napi_struct *napi, int budget);
+	struct bpf_net_context __bpf_net_ctx, *bpf_net_ctx;
 	void *have_poll_lock = NULL;
 	struct napi_struct *napi;
 
@@ -6376,6 +6391,7 @@  static void __napi_busy_loop(unsigned int napi_id,
 		int work = 0;
 
 		local_bh_disable();
+		bpf_net_ctx = bpf_net_ctx_set(&__bpf_net_ctx);
 		if (!napi_poll) {
 			unsigned long val = READ_ONCE(napi->state);
 
@@ -6406,6 +6422,7 @@  static void __napi_busy_loop(unsigned int napi_id,
 			__NET_ADD_STATS(dev_net(napi->dev),
 					LINUX_MIB_BUSYPOLLRXPACKETS, work);
 		skb_defer_free_flush(this_cpu_ptr(&softnet_data));
+		bpf_net_ctx_clear(bpf_net_ctx);
 		local_bh_enable();
 
 		if (!loop_end || loop_end(loop_end_arg, start_time))
@@ -6833,6 +6850,7 @@  static int napi_thread_wait(struct napi_struct *napi)
 
 static void napi_threaded_poll_loop(struct napi_struct *napi)
 {
+	struct bpf_net_context __bpf_net_ctx, *bpf_net_ctx;
 	struct softnet_data *sd;
 	unsigned long last_qs = jiffies;
 
@@ -6841,6 +6859,8 @@  static void napi_threaded_poll_loop(struct napi_struct *napi)
 		void *have;
 
 		local_bh_disable();
+		bpf_net_ctx = bpf_net_ctx_set(&__bpf_net_ctx);
+
 		sd = this_cpu_ptr(&softnet_data);
 		sd->in_napi_threaded_poll = true;
 
@@ -6856,6 +6876,7 @@  static void napi_threaded_poll_loop(struct napi_struct *napi)
 			net_rps_action_and_irq_enable(sd);
 		}
 		skb_defer_free_flush(sd);
+		bpf_net_ctx_clear(bpf_net_ctx);
 		local_bh_enable();
 
 		if (!repoll)
@@ -6881,10 +6902,12 @@  static __latent_entropy void net_rx_action(struct softirq_action *h)
 	struct softnet_data *sd = this_cpu_ptr(&softnet_data);
 	unsigned long time_limit = jiffies +
 		usecs_to_jiffies(READ_ONCE(net_hotdata.netdev_budget_usecs));
+	struct bpf_net_context __bpf_net_ctx, *bpf_net_ctx;
 	int budget = READ_ONCE(net_hotdata.netdev_budget);
 	LIST_HEAD(list);
 	LIST_HEAD(repoll);
 
+	bpf_net_ctx = bpf_net_ctx_set(&__bpf_net_ctx);
 start:
 	sd->in_net_rx_action = true;
 	local_irq_disable();
@@ -6937,7 +6960,8 @@  static __latent_entropy void net_rx_action(struct softirq_action *h)
 		sd->in_net_rx_action = false;
 
 	net_rps_action_and_irq_enable(sd);
-end:;
+end:
+	bpf_net_ctx_clear(bpf_net_ctx);
 }
 
 struct netdev_adjacent {
diff --git a/net/core/filter.c b/net/core/filter.c
index fbcfd563dccfd..f40b8393dd58f 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -2478,9 +2478,6 @@  static const struct bpf_func_proto bpf_clone_redirect_proto = {
 	.arg3_type      = ARG_ANYTHING,
 };
 
-DEFINE_PER_CPU(struct bpf_redirect_info, bpf_redirect_info);
-EXPORT_PER_CPU_SYMBOL_GPL(bpf_redirect_info);
-
 static struct net_device *skb_get_peer_dev(struct net_device *dev)
 {
 	const struct net_device_ops *ops = dev->netdev_ops;
@@ -2493,7 +2490,7 @@  static struct net_device *skb_get_peer_dev(struct net_device *dev)
 
 int skb_do_redirect(struct sk_buff *skb)
 {
-	struct bpf_redirect_info *ri = this_cpu_ptr(&bpf_redirect_info);
+	struct bpf_redirect_info *ri = bpf_net_ctx_get_ri();
 	struct net *net = dev_net(skb->dev);
 	struct net_device *dev;
 	u32 flags = ri->flags;
@@ -2526,7 +2523,7 @@  int skb_do_redirect(struct sk_buff *skb)
 
 BPF_CALL_2(bpf_redirect, u32, ifindex, u64, flags)
 {
-	struct bpf_redirect_info *ri = this_cpu_ptr(&bpf_redirect_info);
+	struct bpf_redirect_info *ri = bpf_net_ctx_get_ri();
 
 	if (unlikely(flags & (~(BPF_F_INGRESS) | BPF_F_REDIRECT_INTERNAL)))
 		return TC_ACT_SHOT;
@@ -2547,7 +2544,7 @@  static const struct bpf_func_proto bpf_redirect_proto = {
 
 BPF_CALL_2(bpf_redirect_peer, u32, ifindex, u64, flags)
 {
-	struct bpf_redirect_info *ri = this_cpu_ptr(&bpf_redirect_info);
+	struct bpf_redirect_info *ri = bpf_net_ctx_get_ri();
 
 	if (unlikely(flags))
 		return TC_ACT_SHOT;
@@ -2569,7 +2566,7 @@  static const struct bpf_func_proto bpf_redirect_peer_proto = {
 BPF_CALL_4(bpf_redirect_neigh, u32, ifindex, struct bpf_redir_neigh *, params,
 	   int, plen, u64, flags)
 {
-	struct bpf_redirect_info *ri = this_cpu_ptr(&bpf_redirect_info);
+	struct bpf_redirect_info *ri = bpf_net_ctx_get_ri();
 
 	if (unlikely((plen && plen < sizeof(*params)) || flags))
 		return TC_ACT_SHOT;
@@ -4295,30 +4292,13 @@  void xdp_do_check_flushed(struct napi_struct *napi)
 }
 #endif
 
-void bpf_clear_redirect_map(struct bpf_map *map)
-{
-	struct bpf_redirect_info *ri;
-	int cpu;
-
-	for_each_possible_cpu(cpu) {
-		ri = per_cpu_ptr(&bpf_redirect_info, cpu);
-		/* Avoid polluting remote cacheline due to writes if
-		 * not needed. Once we pass this test, we need the
-		 * cmpxchg() to make sure it hasn't been changed in
-		 * the meantime by remote CPU.
-		 */
-		if (unlikely(READ_ONCE(ri->map) == map))
-			cmpxchg(&ri->map, map, NULL);
-	}
-}
-
 DEFINE_STATIC_KEY_FALSE(bpf_master_redirect_enabled_key);
 EXPORT_SYMBOL_GPL(bpf_master_redirect_enabled_key);
 
 u32 xdp_master_redirect(struct xdp_buff *xdp)
 {
+	struct bpf_redirect_info *ri = bpf_net_ctx_get_ri();
 	struct net_device *master, *slave;
-	struct bpf_redirect_info *ri = this_cpu_ptr(&bpf_redirect_info);
 
 	master = netdev_master_upper_dev_get_rcu(xdp->rxq->dev);
 	slave = master->netdev_ops->ndo_xdp_get_xmit_slave(master, xdp);
@@ -4390,7 +4370,7 @@  static __always_inline int __xdp_do_redirect_frame(struct bpf_redirect_info *ri,
 			map = READ_ONCE(ri->map);
 
 			/* The map pointer is cleared when the map is being torn
-			 * down by bpf_clear_redirect_map()
+			 * down by dev_map_free()
 			 */
 			if (unlikely(!map)) {
 				err = -ENOENT;
@@ -4435,7 +4415,7 @@  static __always_inline int __xdp_do_redirect_frame(struct bpf_redirect_info *ri,
 int xdp_do_redirect(struct net_device *dev, struct xdp_buff *xdp,
 		    struct bpf_prog *xdp_prog)
 {
-	struct bpf_redirect_info *ri = this_cpu_ptr(&bpf_redirect_info);
+	struct bpf_redirect_info *ri = bpf_net_ctx_get_ri();
 	enum bpf_map_type map_type = ri->map_type;
 
 	if (map_type == BPF_MAP_TYPE_XSKMAP)
@@ -4449,7 +4429,7 @@  EXPORT_SYMBOL_GPL(xdp_do_redirect);
 int xdp_do_redirect_frame(struct net_device *dev, struct xdp_buff *xdp,
 			  struct xdp_frame *xdpf, struct bpf_prog *xdp_prog)
 {
-	struct bpf_redirect_info *ri = this_cpu_ptr(&bpf_redirect_info);
+	struct bpf_redirect_info *ri = bpf_net_ctx_get_ri();
 	enum bpf_map_type map_type = ri->map_type;
 
 	if (map_type == BPF_MAP_TYPE_XSKMAP)
@@ -4466,7 +4446,7 @@  static int xdp_do_generic_redirect_map(struct net_device *dev,
 				       enum bpf_map_type map_type, u32 map_id,
 				       u32 flags)
 {
-	struct bpf_redirect_info *ri = this_cpu_ptr(&bpf_redirect_info);
+	struct bpf_redirect_info *ri = bpf_net_ctx_get_ri();
 	struct bpf_map *map;
 	int err;
 
@@ -4478,7 +4458,7 @@  static int xdp_do_generic_redirect_map(struct net_device *dev,
 			map = READ_ONCE(ri->map);
 
 			/* The map pointer is cleared when the map is being torn
-			 * down by bpf_clear_redirect_map()
+			 * down by dev_map_free()
 			 */
 			if (unlikely(!map)) {
 				err = -ENOENT;
@@ -4520,7 +4500,7 @@  static int xdp_do_generic_redirect_map(struct net_device *dev,
 int xdp_do_generic_redirect(struct net_device *dev, struct sk_buff *skb,
 			    struct xdp_buff *xdp, struct bpf_prog *xdp_prog)
 {
-	struct bpf_redirect_info *ri = this_cpu_ptr(&bpf_redirect_info);
+	struct bpf_redirect_info *ri = bpf_net_ctx_get_ri();
 	enum bpf_map_type map_type = ri->map_type;
 	void *fwd = ri->tgt_value;
 	u32 map_id = ri->map_id;
@@ -4556,7 +4536,7 @@  int xdp_do_generic_redirect(struct net_device *dev, struct sk_buff *skb,
 
 BPF_CALL_2(bpf_xdp_redirect, u32, ifindex, u64, flags)
 {
-	struct bpf_redirect_info *ri = this_cpu_ptr(&bpf_redirect_info);
+	struct bpf_redirect_info *ri = bpf_net_ctx_get_ri();
 
 	if (unlikely(flags))
 		return XDP_ABORTED;
diff --git a/net/core/lwt_bpf.c b/net/core/lwt_bpf.c
index a94943681e5aa..afb05f58b64c5 100644
--- a/net/core/lwt_bpf.c
+++ b/net/core/lwt_bpf.c
@@ -38,12 +38,14 @@  static inline struct bpf_lwt *bpf_lwt_lwtunnel(struct lwtunnel_state *lwt)
 static int run_lwt_bpf(struct sk_buff *skb, struct bpf_lwt_prog *lwt,
 		       struct dst_entry *dst, bool can_redirect)
 {
+	struct bpf_net_context __bpf_net_ctx, *bpf_net_ctx;
 	int ret;
 
 	/* Disabling BH is needed to protect per-CPU bpf_redirect_info between
 	 * BPF prog and skb_do_redirect().
 	 */
 	local_bh_disable();
+	bpf_net_ctx = bpf_net_ctx_set(&__bpf_net_ctx);
 	bpf_compute_data_pointers(skb);
 	ret = bpf_prog_run_save_cb(lwt->prog, skb);
 
@@ -76,6 +78,7 @@  static int run_lwt_bpf(struct sk_buff *skb, struct bpf_lwt_prog *lwt,
 		break;
 	}
 
+	bpf_net_ctx_clear(bpf_net_ctx);
 	local_bh_enable();
 
 	return ret;