From patchwork Thu Feb 29 18:17:55 2024 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Sebastian Andrzej Siewior X-Patchwork-Id: 13577544 X-Patchwork-Delegate: kuba@kernel.org Received: from galois.linutronix.de (Galois.linutronix.de [193.142.43.55]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 663E679F1; Thu, 29 Feb 2024 18:31:28 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=193.142.43.55 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1709231490; cv=none; b=tH1EO95gBb3Gcbf/zUdgkOL353ada+cEZUeht7jAoj+KKQuVaGQqTahhle3eSu0MCnfxSLT3gKTw4vNvVjFarORORmXU3gJ3QnvwxnyuHDuYNixg2CaTkXD8HxM7EhxrGfdlPuRWoy7FFY+pQFaXdQ7sxLQ5xW9f+zrg4rWa9Cc= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1709231490; c=relaxed/simple; bh=E6CEjZe6j7caodcpCj7su9QbxWK19Kn2bLc2940jHPk=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=jfC2ek3GwrPyyt2dcuAu0sGephdSiVpDeKTk18oXFP/NGEePfG8M0IUzo4dtt7+S0Qqf6L7XCV6nhDrfpf6lBJOX1ALHgplrVhQWY1OQ+SrS08iPYNLp/UMq4fk8kQnIr5QO+Coq+Zd3UdysTt685iVswMUmQWCnV6bwbeM83Pk= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linutronix.de; spf=pass smtp.mailfrom=linutronix.de; dkim=pass (2048-bit key) header.d=linutronix.de header.i=@linutronix.de header.b=A3/jSlEF; dkim=permerror (0-bit key) header.d=linutronix.de header.i=@linutronix.de header.b=tybG1xLw; arc=none smtp.client-ip=193.142.43.55 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linutronix.de Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=linutronix.de Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=linutronix.de header.i=@linutronix.de header.b="A3/jSlEF"; dkim=permerror (0-bit key) header.d=linutronix.de header.i=@linutronix.de header.b="tybG1xLw" From: Sebastian Andrzej Siewior DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linutronix.de; s=2020; t=1709231486; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=lUOjzEalkX+BTAW9Otn00MTXMUHDdynygFygR1oHXvE=; b=A3/jSlEFnIeh2uNJeOkc09hvUhkgfhZ/PWbBLtQS6N+VYMI5/OO0Bo2MXg2AJa0aKzW403 UiGdVcr+ethtFnokQXRUp6oDiGxOqInc9r4Fkl0sAeuyfnZPDPFdS8aASIWsASBeQmnc0f RUAcj25iGY/ZPMYT4T9oY6rKCz+CqCb9h0VCUg160/fcJQLorm4drzVel0JeuJTyM1wWVT qXv7EH+iwTiAxtga+OIr+w9D7whBSFqrVOP3ygevfQKMWd0xWUVjW6/gOlvpP01Km5fb5i DGUcgGOdC3fPNmkrF1rKSwUrgPHshplVuc5WIXVDRqkxxjVbVpgK4ViCk+4J4g== DKIM-Signature: v=1; a=ed25519-sha256; c=relaxed/relaxed; d=linutronix.de; s=2020e; t=1709231486; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=lUOjzEalkX+BTAW9Otn00MTXMUHDdynygFygR1oHXvE=; b=tybG1xLwoo8eNuA3bBWfzaFVh3oDFytnoym3WKaYSROQmrHQF1La9ZK8KE+COe+pwUDA8X +XpRRCvqzzseyNAw== To: bpf@vger.kernel.org, netdev@vger.kernel.org Cc: =?utf-8?b?QmrDtnJuIFTDtnBlbA==?= , "David S. Miller" , Alexei Starovoitov , Andrii Nakryiko , Daniel Borkmann , Eric Dumazet , Hao Luo , Jakub Kicinski , Jesper Dangaard Brouer , Jiri Olsa , John Fastabend , Jonathan Lemon , KP Singh , Maciej Fijalkowski , Magnus Karlsson , Martin KaFai Lau , Paolo Abeni , Peter Zijlstra , Song Liu , Stanislav Fomichev , Thomas Gleixner , Yonghong Song , Sebastian Andrzej Siewior Subject: [PATCH RFC v2 net-next 1/2] net: Reference bpf_redirect_info via task_struct on PREEMPT_RT. Date: Thu, 29 Feb 2024 19:17:55 +0100 Message-ID: <20240229183109.646865-2-bigeasy@linutronix.de> In-Reply-To: <20240229183109.646865-1-bigeasy@linutronix.de> References: <20240229183109.646865-1-bigeasy@linutronix.de> Precedence: bulk X-Mailing-List: bpf@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 X-Patchwork-Delegate: kuba@kernel.org X-Patchwork-State: RFC The XDP redirect process is two staged: - bpf_prog_run_xdp() is invoked to run a eBPF program which inspects the packet and makes decisions. While doing that, the per-CPU variable bpf_redirect_info is used. - Afterwards xdp_do_redirect() is invoked and accesses bpf_redirect_info and it may also access other per-CPU variables like xskmap_flush_list. At the very end of the NAPI callback, xdp_do_flush() is invoked which does not access bpf_redirect_info but will touch the individual per-CPU lists. The per-CPU variables are only used in the NAPI callback hence disabling bottom halves is the only protection mechanism. Users from preemptible context (like cpu_map_kthread_run()) explicitly disable bottom halves for protections reasons. Without locking in local_bh_disable() on PREEMPT_RT this data structure requires explicit locking. PREEMPT_RT has forced-threaded interrupts enabled and every NAPI-callback runs in a thread. If each thread has its own data structure then locking can be avoided. Create a struct bpf_net_context which contains struct bpf_redirect_info. Define the variable on stack, use bpf_net_ctx_set() to save a pointer to it. Use the __free() annotation to automatically reset the pointer once function returns. The bpf_net_ctx_set() may nest. For instance local_bh_enable() in bpf_test_run_xdp_live() may run NET_RX_SOFTIRQ/ net_rx_action() which also uses bpf_net_ctx_set(). Therefore only the first invocations updates the per-task pointer. Use bpf_net_ctx_get_ri() as a wrapper to retrieve the current struct bpf_redirect_info. On PREEMPT_RT the pointer to bpf_net_context is saved task's task_struct. On non-PREEMPT_RT builds the pointer saved in a per-CPU variable (which is always NODE-local memory). Using always the bpf_net_context approach has the advantage that there is almost zero differences between PREEMPT_RT and non-PREEMPT_RT builds. Signed-off-by: Sebastian Andrzej Siewior --- include/linux/filter.h | 96 ++++++++++++++++++++++++++++++++++++++--- include/linux/sched.h | 5 +++ kernel/bpf/cpumap.c | 2 + kernel/fork.c | 3 ++ net/bpf/test_run.c | 9 +++- net/core/dev.c | 17 ++++++++ net/core/filter.c | 97 ++++++++++++++++++++++++++---------------- net/core/lwt_bpf.c | 3 ++ 8 files changed, 187 insertions(+), 45 deletions(-) diff --git a/include/linux/filter.h b/include/linux/filter.h index 68fb6c8142fec..6d065991f9e9c 100644 --- a/include/linux/filter.h +++ b/include/linux/filter.h @@ -704,7 +704,83 @@ struct bpf_redirect_info { struct bpf_nh_params nh; }; -DECLARE_PER_CPU(struct bpf_redirect_info, bpf_redirect_info); +struct bpf_net_context { + struct bpf_redirect_info ri; +}; + +#ifndef CONFIG_PREEMPT_RT +DECLARE_PER_CPU(struct bpf_net_context *, bpf_net_context); + +static inline struct bpf_net_context *bpf_net_ctx_set(struct bpf_net_context *bpf_net_ctx) +{ + struct bpf_net_context *ctx; + + ctx = this_cpu_read(bpf_net_context); + if (ctx != NULL) + return NULL; + this_cpu_write(bpf_net_context, bpf_net_ctx); + return bpf_net_ctx; +} + +static inline void bpf_net_ctx_clear(struct bpf_net_context *bpf_net_ctx) +{ + struct bpf_net_context *ctx; + + ctx = this_cpu_read(bpf_net_context); + if (ctx != bpf_net_ctx) + return; + this_cpu_write(bpf_net_context, NULL); +} + +static inline struct bpf_net_context *bpf_net_ctx_get(void) +{ + struct bpf_net_context *bpf_net_ctx = this_cpu_read(bpf_net_context); + + WARN_ON_ONCE(!bpf_net_ctx); + return bpf_net_ctx; +} + +#else + +static inline struct bpf_net_context *bpf_net_ctx_set(struct bpf_net_context *bpf_net_ctx) +{ + struct task_struct *tsk = current; + + if (tsk->bpf_net_context != NULL) + return NULL; + tsk->bpf_net_context = bpf_net_ctx; + return bpf_net_ctx; +} + +static inline void bpf_net_ctx_clear(struct bpf_net_context *bpf_net_ctx) +{ + struct task_struct *tsk = current; + + if (tsk->bpf_net_context != bpf_net_ctx) + return; + tsk->bpf_net_context = NULL; +} + +static inline struct bpf_net_context *bpf_net_ctx_get(void) +{ + struct bpf_net_context *bpf_net_ctx = current->bpf_net_context; + + WARN_ON_ONCE(!bpf_net_ctx); + return bpf_net_ctx; +} + +#endif + +static inline struct bpf_redirect_info *bpf_net_ctx_get_ri(void) +{ + struct bpf_net_context *bpf_net_ctx = bpf_net_ctx_get(); + + if (!bpf_net_ctx) + return NULL; + return &bpf_net_ctx->ri; +} + +DEFINE_FREE(bpf_net_ctx_clear, struct bpf_net_context *, if (_T) bpf_net_ctx_clear(_T)); /* flags for bpf_redirect_info kern_flags */ #define BPF_RI_F_RF_NO_DIRECT BIT(0) /* no napi_direct on return_frame */ @@ -974,23 +1050,27 @@ void bpf_clear_redirect_map(struct bpf_map *map); static inline bool xdp_return_frame_no_direct(void) { - struct bpf_redirect_info *ri = this_cpu_ptr(&bpf_redirect_info); + struct bpf_redirect_info *ri = bpf_net_ctx_get_ri(); + if (!ri) + return false; return ri->kern_flags & BPF_RI_F_RF_NO_DIRECT; } static inline void xdp_set_return_frame_no_direct(void) { - struct bpf_redirect_info *ri = this_cpu_ptr(&bpf_redirect_info); + struct bpf_redirect_info *ri = bpf_net_ctx_get_ri(); - ri->kern_flags |= BPF_RI_F_RF_NO_DIRECT; + if (ri) + ri->kern_flags |= BPF_RI_F_RF_NO_DIRECT; } static inline void xdp_clear_return_frame_no_direct(void) { - struct bpf_redirect_info *ri = this_cpu_ptr(&bpf_redirect_info); + struct bpf_redirect_info *ri = bpf_net_ctx_get_ri(); - ri->kern_flags &= ~BPF_RI_F_RF_NO_DIRECT; + if (ri) + ri->kern_flags &= ~BPF_RI_F_RF_NO_DIRECT; } static inline int xdp_ok_fwd_dev(const struct net_device *fwd, @@ -1544,9 +1624,11 @@ static __always_inline long __bpf_xdp_redirect_map(struct bpf_map *map, u64 inde u64 flags, const u64 flag_mask, void *lookup_elem(struct bpf_map *map, u32 key)) { - struct bpf_redirect_info *ri = this_cpu_ptr(&bpf_redirect_info); + struct bpf_redirect_info *ri = bpf_net_ctx_get_ri(); const u64 action_mask = XDP_ABORTED | XDP_DROP | XDP_PASS | XDP_TX; + if (!ri) + return XDP_ABORTED; /* Lower bits of the flags are used as return code on lookup failure */ if (unlikely(flags & ~(action_mask | flag_mask))) return XDP_ABORTED; diff --git a/include/linux/sched.h b/include/linux/sched.h index 111f388d65323..e3a50de62015e 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -53,6 +53,7 @@ struct bio_list; struct blk_plug; struct bpf_local_storage; struct bpf_run_ctx; +struct bpf_net_context; struct capture_control; struct cfs_rq; struct fs_struct; @@ -1501,6 +1502,10 @@ struct task_struct { /* Used for BPF run context */ struct bpf_run_ctx *bpf_ctx; #endif +#ifdef CONFIG_PREEMPT_RT + /* Used by BPF for per-TASK xdp storage */ + struct bpf_net_context *bpf_net_context; +#endif #ifdef CONFIG_GCC_PLUGIN_STACKLEAK unsigned long lowest_stack; diff --git a/kernel/bpf/cpumap.c b/kernel/bpf/cpumap.c index 8a0bb80fe48a3..6d140f0769d4c 100644 --- a/kernel/bpf/cpumap.c +++ b/kernel/bpf/cpumap.c @@ -261,10 +261,12 @@ static int cpu_map_bpf_prog_run(struct bpf_cpu_map_entry *rcpu, void **frames, static int cpu_map_kthread_run(void *data) { + struct bpf_net_context bpf_net_ctx __cleanup(bpf_net_ctx_clear); struct bpf_cpu_map_entry *rcpu = data; complete(&rcpu->kthread_running); set_current_state(TASK_INTERRUPTIBLE); + bpf_net_ctx_set(&bpf_net_ctx); /* When kthread gives stop order, then rcpu have been disconnected * from map, thus no new packets can enter. Remaining in-flight diff --git a/kernel/fork.c b/kernel/fork.c index 0d944e92a43ff..73fe3bf7b8838 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -2462,6 +2462,9 @@ __latent_entropy struct task_struct *copy_process( RCU_INIT_POINTER(p->bpf_storage, NULL); p->bpf_ctx = NULL; #endif +#ifdef CONFIG_PREEMPT_RT + p->bpf_net_context = NULL; +#endif /* Perform scheduler related setup. Assign this task to a CPU. */ retval = sched_fork(clone_flags, p); diff --git a/net/bpf/test_run.c b/net/bpf/test_run.c index dfd9193740178..3e7f5404d177f 100644 --- a/net/bpf/test_run.c +++ b/net/bpf/test_run.c @@ -281,7 +281,7 @@ static int xdp_recv_frames(struct xdp_frame **frames, int nframes, static int xdp_test_run_batch(struct xdp_test_data *xdp, struct bpf_prog *prog, u32 repeat) { - struct bpf_redirect_info *ri = this_cpu_ptr(&bpf_redirect_info); + struct bpf_redirect_info *ri; int err = 0, act, ret, i, nframes = 0, batch_sz; struct xdp_frame **frames = xdp->frames; struct xdp_page_head *head; @@ -293,6 +293,7 @@ static int xdp_test_run_batch(struct xdp_test_data *xdp, struct bpf_prog *prog, batch_sz = min_t(u32, repeat, xdp->batch_size); local_bh_disable(); + ri = bpf_net_ctx_get_ri(); xdp_set_return_frame_no_direct(); for (i = 0; i < batch_sz; i++) { @@ -365,8 +366,10 @@ static int bpf_test_run_xdp_live(struct bpf_prog *prog, struct xdp_buff *ctx, u32 repeat, u32 batch_size, u32 *time) { + struct bpf_net_context *bpf_net_ctx __free(bpf_net_ctx_clear) = NULL; struct xdp_test_data xdp = { .batch_size = batch_size }; struct bpf_test_timer t = { .mode = NO_MIGRATE }; + struct bpf_net_context __bpf_net_ctx; int ret; if (!repeat) @@ -376,6 +379,7 @@ static int bpf_test_run_xdp_live(struct bpf_prog *prog, struct xdp_buff *ctx, if (ret) return ret; + bpf_net_ctx = bpf_net_ctx_set(&__bpf_net_ctx); bpf_test_timer_enter(&t); do { xdp.frame_cnt = 0; @@ -392,7 +396,9 @@ static int bpf_test_run_xdp_live(struct bpf_prog *prog, struct xdp_buff *ctx, static int bpf_test_run(struct bpf_prog *prog, void *ctx, u32 repeat, u32 *retval, u32 *time, bool xdp) { + struct bpf_net_context *bpf_net_ctx __free(bpf_net_ctx_clear) = NULL; struct bpf_prog_array_item item = {.prog = prog}; + struct bpf_net_context __bpf_net_ctx; struct bpf_run_ctx *old_ctx; struct bpf_cg_run_ctx run_ctx; struct bpf_test_timer t = { NO_MIGRATE }; @@ -412,6 +418,7 @@ static int bpf_test_run(struct bpf_prog *prog, void *ctx, u32 repeat, if (!repeat) repeat = 1; + bpf_net_ctx = bpf_net_ctx_set(&__bpf_net_ctx); bpf_test_timer_enter(&t); old_ctx = bpf_set_run_ctx(&run_ctx.run_ctx); do { diff --git a/net/core/dev.c b/net/core/dev.c index b94ce3a394b7c..95c2cd971e03c 100644 --- a/net/core/dev.c +++ b/net/core/dev.c @@ -3988,11 +3988,15 @@ sch_handle_ingress(struct sk_buff *skb, struct packet_type **pt_prev, int *ret, struct net_device *orig_dev, bool *another) { struct bpf_mprog_entry *entry = rcu_dereference_bh(skb->dev->tcx_ingress); + struct bpf_net_context *bpf_net_ctx __free(bpf_net_ctx_clear) = NULL; enum skb_drop_reason drop_reason = SKB_DROP_REASON_TC_INGRESS; + struct bpf_net_context __bpf_net_ctx; int sch_ret; if (!entry) return skb; + + bpf_net_ctx = bpf_net_ctx_set(&__bpf_net_ctx); if (*pt_prev) { *ret = deliver_skb(skb, *pt_prev, orig_dev); *pt_prev = NULL; @@ -4043,13 +4047,17 @@ sch_handle_ingress(struct sk_buff *skb, struct packet_type **pt_prev, int *ret, static __always_inline struct sk_buff * sch_handle_egress(struct sk_buff *skb, int *ret, struct net_device *dev) { + struct bpf_net_context *bpf_net_ctx __free(bpf_net_ctx_clear) = NULL; struct bpf_mprog_entry *entry = rcu_dereference_bh(dev->tcx_egress); enum skb_drop_reason drop_reason = SKB_DROP_REASON_TC_EGRESS; + struct bpf_net_context __bpf_net_ctx; int sch_ret; if (!entry) return skb; + bpf_net_ctx = bpf_net_ctx_set(&__bpf_net_ctx); + /* qdisc_skb_cb(skb)->pkt_len & tcx_set_ingress() was * already set by the caller. */ @@ -6239,11 +6247,14 @@ void napi_busy_loop(unsigned int napi_id, bool (*loop_end)(void *, unsigned long), void *loop_end_arg, bool prefer_busy_poll, u16 budget) { + struct bpf_net_context *bpf_net_ctx __free(bpf_net_ctx_clear) = NULL; unsigned long start_time = loop_end ? busy_loop_current_time() : 0; int (*napi_poll)(struct napi_struct *napi, int budget); + struct bpf_net_context __bpf_net_ctx; void *have_poll_lock = NULL; struct napi_struct *napi; + bpf_net_ctx = bpf_net_ctx_set(&__bpf_net_ctx); restart: napi_poll = NULL; @@ -6716,10 +6727,13 @@ static void skb_defer_free_flush(struct softnet_data *sd) static int napi_threaded_poll(void *data) { + struct bpf_net_context bpf_net_ctx __cleanup(bpf_net_ctx_clear); struct napi_struct *napi = data; struct softnet_data *sd; void *have; + bpf_net_ctx_set(&bpf_net_ctx); + while (!napi_thread_wait(napi)) { for (;;) { bool repoll = false; @@ -6753,13 +6767,16 @@ static int napi_threaded_poll(void *data) static __latent_entropy void net_rx_action(struct softirq_action *h) { + struct bpf_net_context *bpf_net_ctx __free(bpf_net_ctx_clear) = NULL; struct softnet_data *sd = this_cpu_ptr(&softnet_data); unsigned long time_limit = jiffies + usecs_to_jiffies(READ_ONCE(netdev_budget_usecs)); int budget = READ_ONCE(netdev_budget); + struct bpf_net_context __bpf_net_ctx; LIST_HEAD(list); LIST_HEAD(repoll); + bpf_net_ctx = bpf_net_ctx_set(&__bpf_net_ctx); start: sd->in_net_rx_action = true; local_irq_disable(); diff --git a/net/core/filter.c b/net/core/filter.c index eb8d5a0a0ec8f..3ea58b830caa8 100644 --- a/net/core/filter.c +++ b/net/core/filter.c @@ -2473,8 +2473,10 @@ static const struct bpf_func_proto bpf_clone_redirect_proto = { .arg3_type = ARG_ANYTHING, }; -DEFINE_PER_CPU(struct bpf_redirect_info, bpf_redirect_info); -EXPORT_PER_CPU_SYMBOL_GPL(bpf_redirect_info); +#ifndef CONFIG_PREEMPT_RT +DEFINE_PER_CPU(struct bpf_net_context *, bpf_net_context); +EXPORT_PER_CPU_SYMBOL_GPL(bpf_net_context); +#endif static struct net_device *skb_get_peer_dev(struct net_device *dev) { @@ -2488,11 +2490,15 @@ static struct net_device *skb_get_peer_dev(struct net_device *dev) int skb_do_redirect(struct sk_buff *skb) { - struct bpf_redirect_info *ri = this_cpu_ptr(&bpf_redirect_info); struct net *net = dev_net(skb->dev); + struct bpf_redirect_info *ri; struct net_device *dev; - u32 flags = ri->flags; + u32 flags; + ri = bpf_net_ctx_get_ri(); + if (!ri) + goto out_drop; + flags = ri->flags; dev = dev_get_by_index_rcu(net, ri->tgt_index); ri->tgt_index = 0; ri->flags = 0; @@ -2521,9 +2527,9 @@ int skb_do_redirect(struct sk_buff *skb) BPF_CALL_2(bpf_redirect, u32, ifindex, u64, flags) { - struct bpf_redirect_info *ri = this_cpu_ptr(&bpf_redirect_info); + struct bpf_redirect_info *ri = bpf_net_ctx_get_ri(); - if (unlikely(flags & (~(BPF_F_INGRESS) | BPF_F_REDIRECT_INTERNAL))) + if (unlikely((flags & (~(BPF_F_INGRESS) | BPF_F_REDIRECT_INTERNAL)) || !ri)) return TC_ACT_SHOT; ri->flags = flags; @@ -2542,9 +2548,9 @@ static const struct bpf_func_proto bpf_redirect_proto = { BPF_CALL_2(bpf_redirect_peer, u32, ifindex, u64, flags) { - struct bpf_redirect_info *ri = this_cpu_ptr(&bpf_redirect_info); + struct bpf_redirect_info *ri = bpf_net_ctx_get_ri(); - if (unlikely(flags)) + if (unlikely(flags || !ri)) return TC_ACT_SHOT; ri->flags = BPF_F_PEER; @@ -2564,9 +2570,9 @@ static const struct bpf_func_proto bpf_redirect_peer_proto = { BPF_CALL_4(bpf_redirect_neigh, u32, ifindex, struct bpf_redir_neigh *, params, int, plen, u64, flags) { - struct bpf_redirect_info *ri = this_cpu_ptr(&bpf_redirect_info); + struct bpf_redirect_info *ri = bpf_net_ctx_get_ri(); - if (unlikely((plen && plen < sizeof(*params)) || flags)) + if (unlikely((plen && plen < sizeof(*params)) || flags || !ri)) return TC_ACT_SHOT; ri->flags = BPF_F_NEIGH | (plen ? BPF_F_NEXTHOP : 0); @@ -4292,19 +4298,17 @@ void xdp_do_check_flushed(struct napi_struct *napi) void bpf_clear_redirect_map(struct bpf_map *map) { - struct bpf_redirect_info *ri; - int cpu; - - for_each_possible_cpu(cpu) { - ri = per_cpu_ptr(&bpf_redirect_info, cpu); - /* Avoid polluting remote cacheline due to writes if - * not needed. Once we pass this test, we need the - * cmpxchg() to make sure it hasn't been changed in - * the meantime by remote CPU. - */ - if (unlikely(READ_ONCE(ri->map) == map)) - cmpxchg(&ri->map, map, NULL); - } + /* ri->map is assigned in __bpf_xdp_redirect_map() from within a eBPF + * program/ during NAPI callback. It is used during + * xdp_do_generic_redirect_map()/ __xdp_do_redirect_frame() from the + * redirect callback afterwards. ri->map is cleared after usage. + * The path has no explicit RCU read section but the local_bh_disable() + * is also a RCU read section which makes the complete softirq callback + * RCU protected. This in turn makes ri->map RCU protocted and it is + * sufficient to wait a grace period to ensure that no "ri->map == map" + * exist. dev_map_free() removes the map from the list and then + * invokes synchronize_rcu() after calling this function. + */ } DEFINE_STATIC_KEY_FALSE(bpf_master_redirect_enabled_key); @@ -4313,11 +4317,14 @@ EXPORT_SYMBOL_GPL(bpf_master_redirect_enabled_key); u32 xdp_master_redirect(struct xdp_buff *xdp) { struct net_device *master, *slave; - struct bpf_redirect_info *ri = this_cpu_ptr(&bpf_redirect_info); + struct bpf_redirect_info *ri; master = netdev_master_upper_dev_get_rcu(xdp->rxq->dev); slave = master->netdev_ops->ndo_xdp_get_xmit_slave(master, xdp); if (slave && slave != xdp->rxq->dev) { + ri = bpf_net_ctx_get_ri(); + if (!ri) + return XDP_ABORTED; /* The target device is different from the receiving device, so * redirect it to the new device. * Using XDP_REDIRECT gets the correct behaviour from XDP enabled @@ -4419,10 +4426,12 @@ static __always_inline int __xdp_do_redirect_frame(struct bpf_redirect_info *ri, int xdp_do_redirect(struct net_device *dev, struct xdp_buff *xdp, struct bpf_prog *xdp_prog) { - struct bpf_redirect_info *ri = this_cpu_ptr(&bpf_redirect_info); - enum bpf_map_type map_type = ri->map_type; + struct bpf_redirect_info *ri = bpf_net_ctx_get_ri(); - if (map_type == BPF_MAP_TYPE_XSKMAP) + if (!ri) + return -EINVAL; + + if (ri->map_type == BPF_MAP_TYPE_XSKMAP) return __xdp_do_redirect_xsk(ri, dev, xdp, xdp_prog); return __xdp_do_redirect_frame(ri, dev, xdp_convert_buff_to_frame(xdp), @@ -4433,10 +4442,12 @@ EXPORT_SYMBOL_GPL(xdp_do_redirect); int xdp_do_redirect_frame(struct net_device *dev, struct xdp_buff *xdp, struct xdp_frame *xdpf, struct bpf_prog *xdp_prog) { - struct bpf_redirect_info *ri = this_cpu_ptr(&bpf_redirect_info); - enum bpf_map_type map_type = ri->map_type; + struct bpf_redirect_info *ri = bpf_net_ctx_get_ri(); - if (map_type == BPF_MAP_TYPE_XSKMAP) + if (!ri) + return -EINVAL; + + if (ri->map_type == BPF_MAP_TYPE_XSKMAP) return __xdp_do_redirect_xsk(ri, dev, xdp, xdp_prog); return __xdp_do_redirect_frame(ri, dev, xdpf, xdp_prog); @@ -4450,10 +4461,14 @@ static int xdp_do_generic_redirect_map(struct net_device *dev, void *fwd, enum bpf_map_type map_type, u32 map_id) { - struct bpf_redirect_info *ri = this_cpu_ptr(&bpf_redirect_info); + struct bpf_redirect_info *ri = bpf_net_ctx_get_ri(); struct bpf_map *map; int err; + if (!ri) { + err = -EINVAL; + goto err; + } switch (map_type) { case BPF_MAP_TYPE_DEVMAP: fallthrough; @@ -4495,12 +4510,20 @@ static int xdp_do_generic_redirect_map(struct net_device *dev, int xdp_do_generic_redirect(struct net_device *dev, struct sk_buff *skb, struct xdp_buff *xdp, struct bpf_prog *xdp_prog) { - struct bpf_redirect_info *ri = this_cpu_ptr(&bpf_redirect_info); - enum bpf_map_type map_type = ri->map_type; - void *fwd = ri->tgt_value; - u32 map_id = ri->map_id; + struct bpf_redirect_info *ri = bpf_net_ctx_get_ri(); + enum bpf_map_type map_type; + void *fwd; + u32 map_id; int err; + if (!ri) { + err = -EINVAL; + goto err; + } + map_type = ri->map_type; + fwd = ri->tgt_value; + map_id = ri->map_id; + ri->map_id = 0; /* Valid map id idr range: [1,INT_MAX[ */ ri->map_type = BPF_MAP_TYPE_UNSPEC; @@ -4529,9 +4552,9 @@ int xdp_do_generic_redirect(struct net_device *dev, struct sk_buff *skb, BPF_CALL_2(bpf_xdp_redirect, u32, ifindex, u64, flags) { - struct bpf_redirect_info *ri = this_cpu_ptr(&bpf_redirect_info); + struct bpf_redirect_info *ri = bpf_net_ctx_get_ri(); - if (unlikely(flags)) + if (unlikely(flags || !ri)) return XDP_ABORTED; /* NB! Map type UNSPEC and map_id == INT_MAX (never generated diff --git a/net/core/lwt_bpf.c b/net/core/lwt_bpf.c index a94943681e5aa..09012d0daabe9 100644 --- a/net/core/lwt_bpf.c +++ b/net/core/lwt_bpf.c @@ -38,12 +38,15 @@ static inline struct bpf_lwt *bpf_lwt_lwtunnel(struct lwtunnel_state *lwt) static int run_lwt_bpf(struct sk_buff *skb, struct bpf_lwt_prog *lwt, struct dst_entry *dst, bool can_redirect) { + struct bpf_net_context *bpf_net_ctx __free(bpf_net_ctx_clear) = NULL; + struct bpf_net_context __bpf_net_ctx; int ret; /* Disabling BH is needed to protect per-CPU bpf_redirect_info between * BPF prog and skb_do_redirect(). */ local_bh_disable(); + bpf_net_ctx = bpf_net_ctx_set(&__bpf_net_ctx); bpf_compute_data_pointers(skb); ret = bpf_prog_run_save_cb(lwt->prog, skb); From patchwork Thu Feb 29 18:17:56 2024 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Sebastian Andrzej Siewior X-Patchwork-Id: 13577543 X-Patchwork-Delegate: kuba@kernel.org Received: from galois.linutronix.de (Galois.linutronix.de [193.142.43.55]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 9593C1096F; Thu, 29 Feb 2024 18:31:28 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=193.142.43.55 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1709231490; cv=none; b=SyLGvWhmCUl16avdq/UlBO45ZIqKUEUnlUlysA7AmzG0X/p+J/60xMZdXGqg+XCNI6+B7rR7hqNxG0Eqxlm8rir9uJ31t61x3UZPUs2jTjmlMxLwEWYhdQzGqNV0MTR+H4UrtZXnlAsLJhMvjhjnOyBUwgbg/L0E72h0ZKtFDVQ= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1709231490; c=relaxed/simple; bh=EYSd9abP2PmOgKABRE3F5UZ4zyrOB2q8dkqNJRJUlZw=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=cHPsAXDGDWe96yht9+mDN1urBESVL8qr/7CtxCqWP/ZSRkt3Ecr9JVMBJ+YxSmsBL0N6hdfI1Da3jTO/C4F3TvNUESaCCLeWI53cesmAXT0lN7Qst+kpYQOTgIcWc8pEA8STIVyTHiyGz3Svtj8zGAfFgBvfuuPa1YgX7jnxax4= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linutronix.de; spf=pass smtp.mailfrom=linutronix.de; dkim=pass (2048-bit key) header.d=linutronix.de header.i=@linutronix.de header.b=Pc4ZdhSf; dkim=permerror (0-bit key) header.d=linutronix.de header.i=@linutronix.de header.b=IB9gaWR9; arc=none smtp.client-ip=193.142.43.55 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linutronix.de Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=linutronix.de Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=linutronix.de header.i=@linutronix.de header.b="Pc4ZdhSf"; dkim=permerror (0-bit key) header.d=linutronix.de header.i=@linutronix.de header.b="IB9gaWR9" From: Sebastian Andrzej Siewior DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linutronix.de; s=2020; t=1709231486; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=PdpXNGFVbsC5pkQy2lXsam+HP1edF7uuBU134UjdZcs=; b=Pc4ZdhSfVYZj1Q1iT+TRloa59hTSGh24kGA2OnrxZo6ReCWTUv/NC5v+qDJ8rxt6cjOXzF 1qqjeHp2zOBcPc+JgKsZeLURDv/N/h/Z8l33pVPhZb5GN2k1UfwtYyrXtSHcym+zNpMEA1 NkZO0+I+1DbOv5WVc5rMUp7O0sPD0t0RpZ3Pqx8ZfSofxJXsJe34pcOgdahkpdBzzX5onZ Mpy1DENmt64Bqi+FkjrbKYWg1JZ5NasdQFftkH8eHrIzS4j2V8uIe/5Y6WW9QlE6gDA6vk BZsoUp/d4Y4oBnOqhas50ZOA6DOHtHZ3qlVVvEVNK3a0NNYKEPOFxigWQsY2cA== DKIM-Signature: v=1; a=ed25519-sha256; c=relaxed/relaxed; d=linutronix.de; s=2020e; t=1709231486; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=PdpXNGFVbsC5pkQy2lXsam+HP1edF7uuBU134UjdZcs=; b=IB9gaWR92mQ09yWm/GpCsMQvsEyXXQED3ISSLGZcMfVeRKzLJFCG2l9ooqyHZ0DwkHElzo MUPaUoixypYkwTCQ== To: bpf@vger.kernel.org, netdev@vger.kernel.org Cc: =?utf-8?b?QmrDtnJuIFTDtnBlbA==?= , "David S. Miller" , Alexei Starovoitov , Andrii Nakryiko , Daniel Borkmann , Eric Dumazet , Hao Luo , Jakub Kicinski , Jesper Dangaard Brouer , Jiri Olsa , John Fastabend , Jonathan Lemon , KP Singh , Maciej Fijalkowski , Magnus Karlsson , Martin KaFai Lau , Paolo Abeni , Peter Zijlstra , Song Liu , Stanislav Fomichev , Thomas Gleixner , Yonghong Song , Sebastian Andrzej Siewior Subject: [PATCH RFC v2 net-next 2/2] net: Move per-CPU flush-lists to bpf_net_context on PREEMPT_RT. Date: Thu, 29 Feb 2024 19:17:56 +0100 Message-ID: <20240229183109.646865-3-bigeasy@linutronix.de> In-Reply-To: <20240229183109.646865-1-bigeasy@linutronix.de> References: <20240229183109.646865-1-bigeasy@linutronix.de> Precedence: bulk X-Mailing-List: bpf@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 X-Patchwork-Delegate: kuba@kernel.org X-Patchwork-State: RFC The per-CPU flush lists, which are accessed from within the NAPI callback (xdp_do_flush() for instance), are per-CPU. There are subject to the same problem as struct bpf_redirect_info. Add the per-CPU lists cpu_map_flush_list, dev_map_flush_list and xskmap_map_flush_list to struct bpf_net_context. Add wrappers for the access. Signed-off-by: Sebastian Andrzej Siewior --- include/linux/filter.h | 38 ++++++++++++++++++++++++++++++++++++++ kernel/bpf/cpumap.c | 24 ++++++++---------------- kernel/bpf/devmap.c | 16 ++++++++-------- net/xdp/xsk.c | 19 +++++++++++-------- 4 files changed, 65 insertions(+), 32 deletions(-) diff --git a/include/linux/filter.h b/include/linux/filter.h index 6d065991f9e9c..41c3dbbc4d163 100644 --- a/include/linux/filter.h +++ b/include/linux/filter.h @@ -706,6 +706,9 @@ struct bpf_redirect_info { struct bpf_net_context { struct bpf_redirect_info ri; + struct list_head cpu_map_flush_list; + struct list_head dev_map_flush_list; + struct list_head xskmap_map_flush_list; }; #ifndef CONFIG_PREEMPT_RT @@ -718,6 +721,10 @@ static inline struct bpf_net_context *bpf_net_ctx_set(struct bpf_net_context *bp ctx = this_cpu_read(bpf_net_context); if (ctx != NULL) return NULL; + INIT_LIST_HEAD(&bpf_net_ctx->cpu_map_flush_list); + INIT_LIST_HEAD(&bpf_net_ctx->dev_map_flush_list); + INIT_LIST_HEAD(&bpf_net_ctx->xskmap_map_flush_list); + this_cpu_write(bpf_net_context, bpf_net_ctx); return bpf_net_ctx; } @@ -748,6 +755,10 @@ static inline struct bpf_net_context *bpf_net_ctx_set(struct bpf_net_context *bp if (tsk->bpf_net_context != NULL) return NULL; + INIT_LIST_HEAD(&bpf_net_ctx->cpu_map_flush_list); + INIT_LIST_HEAD(&bpf_net_ctx->dev_map_flush_list); + INIT_LIST_HEAD(&bpf_net_ctx->xskmap_map_flush_list); + tsk->bpf_net_context = bpf_net_ctx; return bpf_net_ctx; } @@ -780,6 +791,33 @@ static inline struct bpf_redirect_info *bpf_net_ctx_get_ri(void) return &bpf_net_ctx->ri; } +static inline struct list_head *bpf_net_ctx_get_cpu_map_flush_list(void) +{ + struct bpf_net_context *bpf_net_ctx = bpf_net_ctx_get(); + + if (!bpf_net_ctx) + return NULL; + return &bpf_net_ctx->cpu_map_flush_list; +} + +static inline struct list_head *bpf_net_ctx_get_dev_flush_list(void) +{ + struct bpf_net_context *bpf_net_ctx = bpf_net_ctx_get(); + + if (!bpf_net_ctx) + return NULL; + return &bpf_net_ctx->dev_map_flush_list; +} + +static inline struct list_head *bpf_net_ctx_get_xskmap_flush_list(void) +{ + struct bpf_net_context *bpf_net_ctx = bpf_net_ctx_get(); + + if (!bpf_net_ctx) + return NULL; + return &bpf_net_ctx->xskmap_map_flush_list; +} + DEFINE_FREE(bpf_net_ctx_clear, struct bpf_net_context *, if (_T) bpf_net_ctx_clear(_T)); /* flags for bpf_redirect_info kern_flags */ diff --git a/kernel/bpf/cpumap.c b/kernel/bpf/cpumap.c index 6d140f0769d4c..59677cea28e18 100644 --- a/kernel/bpf/cpumap.c +++ b/kernel/bpf/cpumap.c @@ -78,8 +78,6 @@ struct bpf_cpu_map { struct bpf_cpu_map_entry __rcu **cpu_map; }; -static DEFINE_PER_CPU(struct list_head, cpu_map_flush_list); - static struct bpf_map *cpu_map_alloc(union bpf_attr *attr) { u32 value_size = attr->value_size; @@ -703,7 +701,7 @@ static void bq_flush_to_queue(struct xdp_bulk_queue *bq) */ static void bq_enqueue(struct bpf_cpu_map_entry *rcpu, struct xdp_frame *xdpf) { - struct list_head *flush_list = this_cpu_ptr(&cpu_map_flush_list); + struct list_head *flush_list = bpf_net_ctx_get_cpu_map_flush_list(); struct xdp_bulk_queue *bq = this_cpu_ptr(rcpu->bulkq); if (unlikely(bq->count == CPU_MAP_BULK_SIZE)) @@ -755,9 +753,12 @@ int cpu_map_generic_redirect(struct bpf_cpu_map_entry *rcpu, void __cpu_map_flush(void) { - struct list_head *flush_list = this_cpu_ptr(&cpu_map_flush_list); + struct list_head *flush_list = bpf_net_ctx_get_cpu_map_flush_list(); struct xdp_bulk_queue *bq, *tmp; + if (!flush_list) + return; + list_for_each_entry_safe(bq, tmp, flush_list, flush_node) { bq_flush_to_queue(bq); @@ -769,20 +770,11 @@ void __cpu_map_flush(void) #ifdef CONFIG_DEBUG_NET bool cpu_map_check_flush(void) { - if (list_empty(this_cpu_ptr(&cpu_map_flush_list))) + struct list_head *flush_list = bpf_net_ctx_get_cpu_map_flush_list(); + + if (!flush_list || list_empty(bpf_net_ctx_get_cpu_map_flush_list())) return false; __cpu_map_flush(); return true; } #endif - -static int __init cpu_map_init(void) -{ - int cpu; - - for_each_possible_cpu(cpu) - INIT_LIST_HEAD(&per_cpu(cpu_map_flush_list, cpu)); - return 0; -} - -subsys_initcall(cpu_map_init); diff --git a/kernel/bpf/devmap.c b/kernel/bpf/devmap.c index a936c704d4e77..463f3c5eaa895 100644 --- a/kernel/bpf/devmap.c +++ b/kernel/bpf/devmap.c @@ -83,7 +83,6 @@ struct bpf_dtab { u32 n_buckets; }; -static DEFINE_PER_CPU(struct list_head, dev_flush_list); static DEFINE_SPINLOCK(dev_map_lock); static LIST_HEAD(dev_map_list); @@ -407,9 +406,12 @@ static void bq_xmit_all(struct xdp_dev_bulk_queue *bq, u32 flags) */ void __dev_flush(void) { - struct list_head *flush_list = this_cpu_ptr(&dev_flush_list); + struct list_head *flush_list = bpf_net_ctx_get_dev_flush_list(); struct xdp_dev_bulk_queue *bq, *tmp; + if (!flush_list) + return; + list_for_each_entry_safe(bq, tmp, flush_list, flush_node) { bq_xmit_all(bq, XDP_XMIT_FLUSH); bq->dev_rx = NULL; @@ -421,7 +423,9 @@ void __dev_flush(void) #ifdef CONFIG_DEBUG_NET bool dev_check_flush(void) { - if (list_empty(this_cpu_ptr(&dev_flush_list))) + struct list_head *flush_list = bpf_net_ctx_get_dev_flush_list(); + + if (!flush_list || list_empty(bpf_net_ctx_get_dev_flush_list())) return false; __dev_flush(); return true; @@ -452,7 +456,7 @@ static void *__dev_map_lookup_elem(struct bpf_map *map, u32 key) static void bq_enqueue(struct net_device *dev, struct xdp_frame *xdpf, struct net_device *dev_rx, struct bpf_prog *xdp_prog) { - struct list_head *flush_list = this_cpu_ptr(&dev_flush_list); + struct list_head *flush_list = bpf_net_ctx_get_dev_flush_list(); struct xdp_dev_bulk_queue *bq = this_cpu_ptr(dev->xdp_bulkq); if (unlikely(bq->count == DEV_MAP_BULK_SIZE)) @@ -1155,15 +1159,11 @@ static struct notifier_block dev_map_notifier = { static int __init dev_map_init(void) { - int cpu; - /* Assure tracepoint shadow struct _bpf_dtab_netdev is in sync */ BUILD_BUG_ON(offsetof(struct bpf_dtab_netdev, dev) != offsetof(struct _bpf_dtab_netdev, dev)); register_netdevice_notifier(&dev_map_notifier); - for_each_possible_cpu(cpu) - INIT_LIST_HEAD(&per_cpu(dev_flush_list, cpu)); return 0; } diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c index b78c0e095e221..7a130d0871de1 100644 --- a/net/xdp/xsk.c +++ b/net/xdp/xsk.c @@ -35,8 +35,6 @@ #define TX_BATCH_SIZE 32 #define MAX_PER_SOCKET_BUDGET (TX_BATCH_SIZE) -static DEFINE_PER_CPU(struct list_head, xskmap_flush_list); - void xsk_set_rx_need_wakeup(struct xsk_buff_pool *pool) { if (pool->cached_need_wakeup & XDP_WAKEUP_RX) @@ -372,9 +370,12 @@ static int xsk_rcv(struct xdp_sock *xs, struct xdp_buff *xdp) int __xsk_map_redirect(struct xdp_sock *xs, struct xdp_buff *xdp) { - struct list_head *flush_list = this_cpu_ptr(&xskmap_flush_list); + struct list_head *flush_list = bpf_net_ctx_get_xskmap_flush_list(); int err; + if (!flush_list) + return -EINVAL; + err = xsk_rcv(xs, xdp); if (err) return err; @@ -387,9 +388,11 @@ int __xsk_map_redirect(struct xdp_sock *xs, struct xdp_buff *xdp) void __xsk_map_flush(void) { - struct list_head *flush_list = this_cpu_ptr(&xskmap_flush_list); + struct list_head *flush_list = bpf_net_ctx_get_xskmap_flush_list(); struct xdp_sock *xs, *tmp; + if (!flush_list) + return; list_for_each_entry_safe(xs, tmp, flush_list, flush_node) { xsk_flush(xs); __list_del_clearprev(&xs->flush_node); @@ -399,7 +402,9 @@ void __xsk_map_flush(void) #ifdef CONFIG_DEBUG_NET bool xsk_map_check_flush(void) { - if (list_empty(this_cpu_ptr(&xskmap_flush_list))) + struct list_head *flush_list = bpf_net_ctx_get_xskmap_flush_list(); + + if (!flush_list || list_empty(flush_list)) return false; __xsk_map_flush(); return true; @@ -1770,7 +1775,7 @@ static struct pernet_operations xsk_net_ops = { static int __init xsk_init(void) { - int err, cpu; + int err; err = proto_register(&xsk_proto, 0 /* no slab */); if (err) @@ -1788,8 +1793,6 @@ static int __init xsk_init(void) if (err) goto out_pernet; - for_each_possible_cpu(cpu) - INIT_LIST_HEAD(&per_cpu(xskmap_flush_list, cpu)); return 0; out_pernet: