diff mbox series

[net-next,15/24] net: Use nested-BH locking for XDP redirect.

Message ID 20231215171020.687342-16-bigeasy@linutronix.de (mailing list archive)
State Changes Requested
Delegated to: Netdev Maintainers
Headers show
Series locking: Introduce nested-BH locking. | expand

Checks

Context Check Description
netdev/series_format fail Series longer than 15 patches (and no cover letter)
netdev/tree_selection success Clearly marked for net-next, async
netdev/ynl success Generated files up to date; no warnings/errors; no diff in generated;
netdev/fixes_present success Fixes tag not required for -next series
netdev/header_inline success No static functions without inline keyword in header files
netdev/build_32bit fail Errors and warnings before: 1171 this patch: 1189
netdev/cc_maintainers warning 1 maintainers not CCed: yan@cloudflare.com
netdev/build_clang success Errors and warnings before: 1143 this patch: 1143
netdev/verify_signedoff success Signed-off-by tag matches author and committer
netdev/deprecated_api success None detected
netdev/check_selftest success No net selftest shell script
netdev/verify_fixes success No Fixes tag
netdev/build_allmodconfig_warn fail Errors and warnings before: 1198 this patch: 1216
netdev/checkpatch warning WARNING: line length of 81 exceeds 80 columns
netdev/build_clang_rust success No Rust files in patch. Skipping build
netdev/kdoc success Errors and warnings before: 0 this patch: 0
netdev/source_inline success Was 0 now: 0

Commit Message

Sebastian Andrzej Siewior Dec. 15, 2023, 5:07 p.m. UTC
The per-CPU variables used during bpf_prog_run_xdp() invocation and
later during xdp_do_redirect() rely on disabled BH for their protection.
Without locking in local_bh_disable() on PREEMPT_RT these data structure
require explicit locking.

This is a follow-up on the previous change which introduced
bpf_run_lock.redirect_lock and uses it now within drivers.

The simple way is to acquire the lock before bpf_prog_run_xdp() is
invoked and hold it until the end of function.
This does not always work because some drivers (cpsw, atlantic) invoke
xdp_do_flush() in the same context.
Acquiring the lock in bpf_prog_run_xdp() and dropping in
xdp_do_redirect() (without touching drivers) does not work because not
all driver, which use bpf_prog_run_xdp(), do support XDP_REDIRECT (and
invoke xdp_do_redirect()).

Ideally the minimal locking scope would be bpf_prog_run_xdp() +
xdp_do_redirect() and everything else (error recovery, DMA unmapping,
free/ alloc of memory, …) would happen outside of the locked section.

Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Andrii Nakryiko <andrii@kernel.org>
Cc: Cong Wang <xiyou.wangcong@gmail.com>
Cc: Hao Luo <haoluo@google.com>
Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Cc: Jesper Dangaard Brouer <hawk@kernel.org>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: Jiri Pirko <jiri@resnulli.us>
Cc: John Fastabend <john.fastabend@gmail.com>
Cc: KP Singh <kpsingh@kernel.org>
Cc: Martin KaFai Lau <martin.lau@linux.dev>
Cc: Ronak Doshi <doshir@vmware.com>
Cc: Song Liu <song@kernel.org>
Cc: Stanislav Fomichev <sdf@google.com>
Cc: VMware PV-Drivers Reviewers <pv-drivers@vmware.com>
Cc: Yonghong Song <yonghong.song@linux.dev>
Cc: bpf@vger.kernel.org
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
---
 drivers/net/vmxnet3/vmxnet3_xdp.c |  1 +
 kernel/bpf/cpumap.c               |  2 ++
 net/bpf/test_run.c                | 11 ++++++++---
 net/core/dev.c                    |  3 +++
 net/core/filter.c                 |  1 +
 net/core/lwt_bpf.c                |  2 ++
 net/sched/cls_api.c               |  2 ++
 7 files changed, 19 insertions(+), 3 deletions(-)

Comments

kernel test robot Dec. 16, 2023, 4:12 a.m. UTC | #1
Hi Sebastian,

kernel test robot noticed the following build warnings:

[auto build test WARNING on net-next/main]

url:    https://github.com/intel-lab-lkp/linux/commits/Sebastian-Andrzej-Siewior/locking-local_lock-Introduce-guard-definition-for-local_lock/20231216-011911
base:   net-next/main
patch link:    https://lore.kernel.org/r/20231215171020.687342-16-bigeasy%40linutronix.de
patch subject: [PATCH net-next 15/24] net: Use nested-BH locking for XDP redirect.
config: x86_64-randconfig-122-20231216 (https://download.01.org/0day-ci/archive/20231216/202312161103.iKeGoacH-lkp@intel.com/config)
compiler: clang version 16.0.4 (https://github.com/llvm/llvm-project.git ae42196bc493ffe877a7e3dff8be32035dea4d07)
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20231216/202312161103.iKeGoacH-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202312161103.iKeGoacH-lkp@intel.com/

sparse warnings: (new ones prefixed by >>)
   net/sched/cls_api.c:391:22: sparse: sparse: incorrect type in assignment (different base types) @@     expected restricted __be16 [usertype] protocol @@     got unsigned int [usertype] protocol @@
   net/sched/cls_api.c:391:22: sparse:     expected restricted __be16 [usertype] protocol
   net/sched/cls_api.c:391:22: sparse:     got unsigned int [usertype] protocol
   net/sched/cls_api.c:1862:16: sparse: sparse: incompatible types in comparison expression (different address spaces):
   net/sched/cls_api.c:1862:16: sparse:    struct tcf_proto *
   net/sched/cls_api.c:1862:16: sparse:    struct tcf_proto [noderef] __rcu *
   net/sched/cls_api.c:1962:20: sparse: sparse: incompatible types in comparison expression (different address spaces):
   net/sched/cls_api.c:1962:20: sparse:    struct tcf_proto [noderef] __rcu *
   net/sched/cls_api.c:1962:20: sparse:    struct tcf_proto *
   net/sched/cls_api.c:1924:25: sparse: sparse: incompatible types in comparison expression (different address spaces):
   net/sched/cls_api.c:1924:25: sparse:    struct tcf_proto [noderef] __rcu *
   net/sched/cls_api.c:1924:25: sparse:    struct tcf_proto *
   net/sched/cls_api.c:1944:16: sparse: sparse: incompatible types in comparison expression (different address spaces):
   net/sched/cls_api.c:1944:16: sparse:    struct tcf_proto *
   net/sched/cls_api.c:1944:16: sparse:    struct tcf_proto [noderef] __rcu *
   net/sched/cls_api.c:2010:25: sparse: sparse: restricted __be16 degrades to integer
   net/sched/cls_api.c:2697:50: sparse: sparse: restricted __be16 degrades to integer
>> net/sched/cls_api.c:3933:38: sparse: sparse: incorrect type in argument 1 (different address spaces) @@     expected struct local_lock_t [usertype] *l @@     got struct local_lock_t [noderef] __percpu * @@
   net/sched/cls_api.c:3933:38: sparse:     expected struct local_lock_t [usertype] *l
   net/sched/cls_api.c:3933:38: sparse:     got struct local_lock_t [noderef] __percpu *
   net/sched/cls_api.c: note: in included file (through include/linux/mmzone.h, include/linux/gfp.h, include/linux/umh.h, include/linux/kmod.h, ...):
   include/linux/local_lock.h:71:1: sparse: sparse: incorrect type in initializer (different address spaces) @@     expected void const [noderef] __percpu *__vpp_verify @@     got struct local_lock_t [usertype] * @@
   include/linux/local_lock.h:71:1: sparse:     expected void const [noderef] __percpu *__vpp_verify
   include/linux/local_lock.h:71:1: sparse:     got struct local_lock_t [usertype] *

vim +3933 net/sched/cls_api.c

  3921	
  3922	struct sk_buff *tcf_qevent_handle(struct tcf_qevent *qe, struct Qdisc *sch, struct sk_buff *skb,
  3923					  struct sk_buff **to_free, int *ret)
  3924	{
  3925		struct tcf_result cl_res;
  3926		struct tcf_proto *fl;
  3927	
  3928		if (!qe->info.block_index)
  3929			return skb;
  3930	
  3931		fl = rcu_dereference_bh(qe->filter_chain);
  3932	
> 3933		guard(local_lock_nested_bh)(&bpf_run_lock.redirect_lock);
  3934		switch (tcf_classify(skb, NULL, fl, &cl_res, false)) {
  3935		case TC_ACT_SHOT:
  3936			qdisc_qstats_drop(sch);
  3937			__qdisc_drop(skb, to_free);
  3938			*ret = __NET_XMIT_BYPASS;
  3939			return NULL;
  3940		case TC_ACT_STOLEN:
  3941		case TC_ACT_QUEUED:
  3942		case TC_ACT_TRAP:
  3943			__qdisc_drop(skb, to_free);
  3944			*ret = __NET_XMIT_STOLEN;
  3945			return NULL;
  3946		case TC_ACT_REDIRECT:
  3947			skb_do_redirect(skb);
  3948			*ret = __NET_XMIT_STOLEN;
  3949			return NULL;
  3950		}
  3951	
  3952		return skb;
  3953	}
  3954	EXPORT_SYMBOL(tcf_qevent_handle);
  3955
Alexei Starovoitov Dec. 20, 2023, 12:25 a.m. UTC | #2
On Fri, Dec 15, 2023 at 9:10 AM Sebastian Andrzej Siewior
<bigeasy@linutronix.de> wrote:
> diff --git a/net/core/dev.c b/net/core/dev.c
> index 5a0f6da7b3ae5..5ba7509e88752 100644
> --- a/net/core/dev.c
> +++ b/net/core/dev.c
> @@ -3993,6 +3993,7 @@ sch_handle_ingress(struct sk_buff *skb, struct packet_type **pt_prev, int *ret,
>                 *pt_prev = NULL;
>         }
>
> +       guard(local_lock_nested_bh)(&bpf_run_lock.redirect_lock);
>         qdisc_skb_cb(skb)->pkt_len = skb->len;
>         tcx_set_ingress(skb, true);
>
> @@ -4045,6 +4046,7 @@ sch_handle_egress(struct sk_buff *skb, int *ret, struct net_device *dev)
>         if (!entry)
>                 return skb;
>
> +       guard(local_lock_nested_bh)(&bpf_run_lock.redirect_lock);
>         /* qdisc_skb_cb(skb)->pkt_len & tcx_set_ingress() was
>          * already set by the caller.
>          */
> @@ -5008,6 +5010,7 @@ int do_xdp_generic(struct bpf_prog *xdp_prog, struct sk_buff *skb)
>                 u32 act;
>                 int err;
>
> +               guard(local_lock_nested_bh)(&bpf_run_lock.redirect_lock);
>                 act = netif_receive_generic_xdp(skb, &xdp, xdp_prog);
>                 if (act != XDP_PASS) {
>                         switch (act) {
> diff --git a/net/core/filter.c b/net/core/filter.c
> index 7c9653734fb60..72a7812f933a1 100644
> --- a/net/core/filter.c
> +++ b/net/core/filter.c
> @@ -4241,6 +4241,7 @@ static const struct bpf_func_proto bpf_xdp_adjust_meta_proto = {
>   */
>  void xdp_do_flush(void)
>  {
> +       guard(local_lock_nested_bh)(&bpf_run_lock.redirect_lock);
>         __dev_flush();
>         __cpu_map_flush();
>         __xsk_map_flush();
> diff --git a/net/core/lwt_bpf.c b/net/core/lwt_bpf.c
> index a94943681e5aa..74b88e897a7e3 100644
> --- a/net/core/lwt_bpf.c
> +++ b/net/core/lwt_bpf.c
> @@ -44,6 +44,7 @@ static int run_lwt_bpf(struct sk_buff *skb, struct bpf_lwt_prog *lwt,
>          * BPF prog and skb_do_redirect().
>          */
>         local_bh_disable();
> +       local_lock_nested_bh(&bpf_run_lock.redirect_lock);
>         bpf_compute_data_pointers(skb);
>         ret = bpf_prog_run_save_cb(lwt->prog, skb);
>
> @@ -76,6 +77,7 @@ static int run_lwt_bpf(struct sk_buff *skb, struct bpf_lwt_prog *lwt,
>                 break;
>         }
>
> +       local_unlock_nested_bh(&bpf_run_lock.redirect_lock);
>         local_bh_enable();
>
>         return ret;
> diff --git a/net/sched/cls_api.c b/net/sched/cls_api.c
> index 1976bd1639863..da61b99bc558f 100644
> --- a/net/sched/cls_api.c
> +++ b/net/sched/cls_api.c
> @@ -23,6 +23,7 @@
>  #include <linux/jhash.h>
>  #include <linux/rculist.h>
>  #include <linux/rhashtable.h>
> +#include <linux/bpf.h>
>  #include <net/net_namespace.h>
>  #include <net/sock.h>
>  #include <net/netlink.h>
> @@ -3925,6 +3926,7 @@ struct sk_buff *tcf_qevent_handle(struct tcf_qevent *qe, struct Qdisc *sch, stru
>
>         fl = rcu_dereference_bh(qe->filter_chain);
>
> +       guard(local_lock_nested_bh)(&bpf_run_lock.redirect_lock);
>         switch (tcf_classify(skb, NULL, fl, &cl_res, false)) {
>         case TC_ACT_SHOT:
>                 qdisc_qstats_drop(sch);

Here and in all other places this patch adds locks that
will kill performance of XDP, tcx and everything else in networking.

I'm surprised Jesper and other folks are not jumping in with nacks.
We measure performance in nanoseconds here.
Extra lock is no go.
Please find a different way without ruining performance.
Toke Høiland-Jørgensen Jan. 4, 2024, 7:29 p.m. UTC | #3
Alexei Starovoitov <alexei.starovoitov@gmail.com> writes:

> On Fri, Dec 15, 2023 at 9:10 AM Sebastian Andrzej Siewior
> <bigeasy@linutronix.de> wrote:
>> diff --git a/net/core/dev.c b/net/core/dev.c
>> index 5a0f6da7b3ae5..5ba7509e88752 100644
>> --- a/net/core/dev.c
>> +++ b/net/core/dev.c
>> @@ -3993,6 +3993,7 @@ sch_handle_ingress(struct sk_buff *skb, struct packet_type **pt_prev, int *ret,
>>                 *pt_prev = NULL;
>>         }
>>
>> +       guard(local_lock_nested_bh)(&bpf_run_lock.redirect_lock);
>>         qdisc_skb_cb(skb)->pkt_len = skb->len;
>>         tcx_set_ingress(skb, true);
>>
>> @@ -4045,6 +4046,7 @@ sch_handle_egress(struct sk_buff *skb, int *ret, struct net_device *dev)
>>         if (!entry)
>>                 return skb;
>>
>> +       guard(local_lock_nested_bh)(&bpf_run_lock.redirect_lock);
>>         /* qdisc_skb_cb(skb)->pkt_len & tcx_set_ingress() was
>>          * already set by the caller.
>>          */
>> @@ -5008,6 +5010,7 @@ int do_xdp_generic(struct bpf_prog *xdp_prog, struct sk_buff *skb)
>>                 u32 act;
>>                 int err;
>>
>> +               guard(local_lock_nested_bh)(&bpf_run_lock.redirect_lock);
>>                 act = netif_receive_generic_xdp(skb, &xdp, xdp_prog);
>>                 if (act != XDP_PASS) {
>>                         switch (act) {
>> diff --git a/net/core/filter.c b/net/core/filter.c
>> index 7c9653734fb60..72a7812f933a1 100644
>> --- a/net/core/filter.c
>> +++ b/net/core/filter.c
>> @@ -4241,6 +4241,7 @@ static const struct bpf_func_proto bpf_xdp_adjust_meta_proto = {
>>   */
>>  void xdp_do_flush(void)
>>  {
>> +       guard(local_lock_nested_bh)(&bpf_run_lock.redirect_lock);
>>         __dev_flush();
>>         __cpu_map_flush();
>>         __xsk_map_flush();
>> diff --git a/net/core/lwt_bpf.c b/net/core/lwt_bpf.c
>> index a94943681e5aa..74b88e897a7e3 100644
>> --- a/net/core/lwt_bpf.c
>> +++ b/net/core/lwt_bpf.c
>> @@ -44,6 +44,7 @@ static int run_lwt_bpf(struct sk_buff *skb, struct bpf_lwt_prog *lwt,
>>          * BPF prog and skb_do_redirect().
>>          */
>>         local_bh_disable();
>> +       local_lock_nested_bh(&bpf_run_lock.redirect_lock);
>>         bpf_compute_data_pointers(skb);
>>         ret = bpf_prog_run_save_cb(lwt->prog, skb);
>>
>> @@ -76,6 +77,7 @@ static int run_lwt_bpf(struct sk_buff *skb, struct bpf_lwt_prog *lwt,
>>                 break;
>>         }
>>
>> +       local_unlock_nested_bh(&bpf_run_lock.redirect_lock);
>>         local_bh_enable();
>>
>>         return ret;
>> diff --git a/net/sched/cls_api.c b/net/sched/cls_api.c
>> index 1976bd1639863..da61b99bc558f 100644
>> --- a/net/sched/cls_api.c
>> +++ b/net/sched/cls_api.c
>> @@ -23,6 +23,7 @@
>>  #include <linux/jhash.h>
>>  #include <linux/rculist.h>
>>  #include <linux/rhashtable.h>
>> +#include <linux/bpf.h>
>>  #include <net/net_namespace.h>
>>  #include <net/sock.h>
>>  #include <net/netlink.h>
>> @@ -3925,6 +3926,7 @@ struct sk_buff *tcf_qevent_handle(struct tcf_qevent *qe, struct Qdisc *sch, stru
>>
>>         fl = rcu_dereference_bh(qe->filter_chain);
>>
>> +       guard(local_lock_nested_bh)(&bpf_run_lock.redirect_lock);
>>         switch (tcf_classify(skb, NULL, fl, &cl_res, false)) {
>>         case TC_ACT_SHOT:
>>                 qdisc_qstats_drop(sch);
>
> Here and in all other places this patch adds locks that
> will kill performance of XDP, tcx and everything else in networking.
>
> I'm surprised Jesper and other folks are not jumping in with nacks.
> We measure performance in nanoseconds here.
> Extra lock is no go.
> Please find a different way without ruining performance.

I'll add that while all this compiles out as no-ops on !PREEMPT_RT, I do
believe there are people who are using XDP on PREEMPT_RT kernels and
still expect decent performance. And to achieve that it is absolutely
imperative that we can amortise expensive operations (such as locking)
over multiple packets.

I realise there's a fundamental trade-off between the amount of
amortisation and the latency hit that we take from holding locks for
longer, but tuning the batch size (while still keeping some amount of
batching) may be a way forward? I suppose Jakub's suggestion in the
other part of the thread, of putting the locks around napi->poll(), is a
step towards something like this.

-Toke
Sebastian Andrzej Siewior Jan. 12, 2024, 5:41 p.m. UTC | #4
On 2024-01-04 20:29:02 [+0100], Toke Høiland-Jørgensen wrote:
> Alexei Starovoitov <alexei.starovoitov@gmail.com> writes:
> 
> >> @@ -3925,6 +3926,7 @@ struct sk_buff *tcf_qevent_handle(struct tcf_qevent *qe, struct Qdisc *sch, stru
> >>
> >>         fl = rcu_dereference_bh(qe->filter_chain);
> >>
> >> +       guard(local_lock_nested_bh)(&bpf_run_lock.redirect_lock);
> >>         switch (tcf_classify(skb, NULL, fl, &cl_res, false)) {
> >>         case TC_ACT_SHOT:
> >>                 qdisc_qstats_drop(sch);
> >
> > Here and in all other places this patch adds locks that
> > will kill performance of XDP, tcx and everything else in networking.
> >
> > I'm surprised Jesper and other folks are not jumping in with nacks.
> > We measure performance in nanoseconds here.
> > Extra lock is no go.
> > Please find a different way without ruining performance.
> 
> I'll add that while all this compiles out as no-ops on !PREEMPT_RT, I do
> believe there are people who are using XDP on PREEMPT_RT kernels and
> still expect decent performance. And to achieve that it is absolutely
> imperative that we can amortise expensive operations (such as locking)
> over multiple packets.
> 
> I realise there's a fundamental trade-off between the amount of
> amortisation and the latency hit that we take from holding locks for
> longer, but tuning the batch size (while still keeping some amount of
> batching) may be a way forward? I suppose Jakub's suggestion in the
> other part of the thread, of putting the locks around napi->poll(), is a
> step towards something like this.

The RT requirements are usually different. Networking as in CAN might be
important but Ethernet could only used for remote communication and so
"not" important. People complained that they need to wait for Ethernet
to be done until the CAN packet can be injected into the stack.
With that expectation you would like to pause Ethernet immediately and
switch over the CAN interrupt thread.

But if someone managed to setup XDP then it is likely to be important.
With RT traffic it is usually not the throughput that matters but the
latency. You are likely in the position to receive a packet, say every
1ms, and need to respond immediately. XDP would be used to inspect the
packet and either hand it over to the stack or process it.

I expected the lock operation (under RT) to always succeeds and not
cause any delay because it should not be contended. It should only
block if something with higher priority preempted the current interrupt
thread _and_ also happen to use XDP on the same CPU. In that case (XDP
is needed) it would flush the current user out of the locked section
before the higher-prio thread could take over. Doing bulk and allowing
the low-priority thread to complete would delay the high-priority
thread. Maybe I am too pessimistic here and having two XDP programs on
one CPU is unlikely to happen.

Adding the lock on per-NAPI basis would allow to batch packets.
Acquiring the lock only if XDP is supported would not block the CAN
drivers since they dont't support XDP. But sounds like a hack.

Daniel said netkit doesn't need this locking because it is not
supporting this redirect and it made me think. Would it work to make
the redirect structures part of the bpf_prog-structure instead of
per-CPU? My understanding is that eBPF's programs data structures are
part of it and contain locking allowing one eBPF program preempt
another one.
Having the redirect structures part of the program would obsolete
locking. Do I miss anything?

> -Toke

Sebastian
Toke Høiland-Jørgensen Jan. 17, 2024, 4:37 p.m. UTC | #5
Sebastian Andrzej Siewior <bigeasy@linutronix.de> writes:

> On 2024-01-04 20:29:02 [+0100], Toke Høiland-Jørgensen wrote:
>> Alexei Starovoitov <alexei.starovoitov@gmail.com> writes:
>> 
>> >> @@ -3925,6 +3926,7 @@ struct sk_buff *tcf_qevent_handle(struct tcf_qevent *qe, struct Qdisc *sch, stru
>> >>
>> >>         fl = rcu_dereference_bh(qe->filter_chain);
>> >>
>> >> +       guard(local_lock_nested_bh)(&bpf_run_lock.redirect_lock);
>> >>         switch (tcf_classify(skb, NULL, fl, &cl_res, false)) {
>> >>         case TC_ACT_SHOT:
>> >>                 qdisc_qstats_drop(sch);
>> >
>> > Here and in all other places this patch adds locks that
>> > will kill performance of XDP, tcx and everything else in networking.
>> >
>> > I'm surprised Jesper and other folks are not jumping in with nacks.
>> > We measure performance in nanoseconds here.
>> > Extra lock is no go.
>> > Please find a different way without ruining performance.
>> 
>> I'll add that while all this compiles out as no-ops on !PREEMPT_RT, I do
>> believe there are people who are using XDP on PREEMPT_RT kernels and
>> still expect decent performance. And to achieve that it is absolutely
>> imperative that we can amortise expensive operations (such as locking)
>> over multiple packets.
>> 
>> I realise there's a fundamental trade-off between the amount of
>> amortisation and the latency hit that we take from holding locks for
>> longer, but tuning the batch size (while still keeping some amount of
>> batching) may be a way forward? I suppose Jakub's suggestion in the
>> other part of the thread, of putting the locks around napi->poll(), is a
>> step towards something like this.
>
> The RT requirements are usually different. Networking as in CAN might be
> important but Ethernet could only used for remote communication and so
> "not" important. People complained that they need to wait for Ethernet
> to be done until the CAN packet can be injected into the stack.
> With that expectation you would like to pause Ethernet immediately and
> switch over the CAN interrupt thread.
>
> But if someone managed to setup XDP then it is likely to be important.
> With RT traffic it is usually not the throughput that matters but the
> latency. You are likely in the position to receive a packet, say every
> 1ms, and need to respond immediately. XDP would be used to inspect the
> packet and either hand it over to the stack or process it.

I am not contesting that latency is important, but it's a pretty
fundamental trade-off and we don't want to kill throughput entirely
either. Especially since this is global to the whole kernel; and there
are definitely people who want to use XDP on an RT kernel and still
achieve high PPS rates.

(Whether those people really strictly speaking need to be running an RT
kernel is maybe debatable, but it does happen).

> I expected the lock operation (under RT) to always succeeds and not
> cause any delay because it should not be contended.

A lock does cause delay even when it's not contended. Bear in mind that
at 10 Gbps line rate, we have a budget of 64 nanoseconds to process each
packet (for 64-byte packets). So just the atomic op to figure out
whether there's any contention (around 10ns on the Intel processors I
usually test on) will blow a huge chunk of the total processing budget.
We can't actually do the full processing needed in those 64 nanoseconds
(not to mention the 6.4 nanoseconds we have available at 100Gbps), which
is why it's essential to amortise as much as we can over multiple
packets.

This is all back-of-the-envelope calculations, of course. Having some
actual numbers to look at would be great; I don't suppose you have a
setup where you can run xdp-bench and see how your patches affect the
throughput?

> It should only block if something with higher priority preempted the
> current interrupt thread _and_ also happen to use XDP on the same CPU.
> In that case (XDP is needed) it would flush the current user out of
> the locked section before the higher-prio thread could take over.
> Doing bulk and allowing the low-priority thread to complete would
> delay the high-priority thread. Maybe I am too pessimistic here and
> having two XDP programs on one CPU is unlikely to happen.
>
> Adding the lock on per-NAPI basis would allow to batch packets.
> Acquiring the lock only if XDP is supported would not block the CAN
> drivers since they dont't support XDP. But sounds like a hack.

I chatted with Jesper about this, and he had an idea not too far from
this: split up the XDP and regular stack processing in two stages, each
with their individual batching. So whereas right now we're doing
something like:

run_napi()
  bh_disable()
  for pkt in budget:
    act = run_xdp(pkt)
    if (act == XDP_PASS)
      run_netstack(pkt)  // this is the expensive bit
  bh_enable()

We could instead do:

run_napi()
  bh_disable()
  for pkt in budget:
    act = run_xdp(pkt)
    if (act == XDP_PASS)
      add_to_list(pkt, to_stack_list)
  bh_enable()
  // sched point
  bh_disable()
  for pkt in to_stack_list:
    run_netstack(pkt)
  bh_enable()


This would limit the batching that blocks everything to only the XDP
processing itself, which should limit the maximum time spent in the
blocking state significantly compared to what we have today. The caveat
being that rearranging things like this is potentially a pretty major
refactoring task that needs to touch all the drivers (even if some of
the logic can be moved into the core code in the process). So not really
sure if this approach is feasible, TBH.

> Daniel said netkit doesn't need this locking because it is not
> supporting this redirect and it made me think. Would it work to make
> the redirect structures part of the bpf_prog-structure instead of
> per-CPU? My understanding is that eBPF's programs data structures are
> part of it and contain locking allowing one eBPF program preempt
> another one.
> Having the redirect structures part of the program would obsolete
> locking. Do I miss anything?

This won't work, unfortunately: the same XDP program can be attached to
multiple interfaces simultaneously, and for hardware with multiple
receive queues (which is most of the hardware that supports XDP), it can
even run simultaneously on multiple CPUs on the same interface. This is
the reason why this is all being kept in per-CPU variables today.

-Toke
Jakub Kicinski Jan. 18, 2024, 2:04 a.m. UTC | #6
On Wed, 17 Jan 2024 17:37:29 +0100 Toke Høiland-Jørgensen wrote:
> I am not contesting that latency is important, but it's a pretty
> fundamental trade-off and we don't want to kill throughput entirely
> either. Especially since this is global to the whole kernel; and there
> are definitely people who want to use XDP on an RT kernel and still
> achieve high PPS rates.
> 
> (Whether those people really strictly speaking need to be running an RT
> kernel is maybe debatable, but it does happen).
> 
> > I expected the lock operation (under RT) to always succeeds and not
> > cause any delay because it should not be contended.  
> 
> A lock does cause delay even when it's not contended. Bear in mind that
> at 10 Gbps line rate, we have a budget of 64 nanoseconds to process each
> packet (for 64-byte packets). So just the atomic op to figure out
> whether there's any contention (around 10ns on the Intel processors I
> usually test on) will blow a huge chunk of the total processing budget.
> We can't actually do the full processing needed in those 64 nanoseconds
> (not to mention the 6.4 nanoseconds we have available at 100Gbps), which
> is why it's essential to amortise as much as we can over multiple
> packets.
> 
> This is all back-of-the-envelope calculations, of course. Having some
> actual numbers to look at would be great; I don't suppose you have a
> setup where you can run xdp-bench and see how your patches affect the
> throughput?

A potentially stupid idea which I have been turning in my head is 
how we could get away from having the driver handle details of NAPI
budgeting. It's an source of bugs and endless review comments.

All drivers end up maintaining a counter of "how many packets have
I processed" and comparing that against the budget. Would it be crazy
if we put that inside napi_struct? Add a "budget" member inside
napi_struct as well, and:

struct napi_struct {
...
	// poll state
	unsigned int budget;
	unsigned int rx_used;
...
}

static inline bool napi_rx_has_budget(napi)
{
	return napi->budget > napi->rx_used;
}

poll(napi) // no budget
{
	while (napi_rx_has_budget(napi)) {
		napi_gro_receive(napi, skb); /* does napi->rx_used++ */
		// maybe add explicit napi_rx_count() if
		// driver did something funny with the frame.
	}
}

We can also create napi_tx_has_budget() so that people stop being
confused whether budget is for Tx or not. And napi_xdp_comp_has_budget()
so that people stop completing XDP in hard irq context (budget==0)...

And we can pass napi into napi_consume_skb(), instead of, presumably
inexplicably to a newcomer, passing in budget.
And napi_complete_done() can lose the work_done argument, too.

Oh, and I'm bringing it up here, because CONFIG_RT can throw
in "need_resched()" into the napi_rx_has_budget(), obviously.
Sebastian Andrzej Siewior Jan. 18, 2024, 7:35 a.m. UTC | #7
On 2024-01-17 17:37:29 [+0100], Toke Høiland-Jørgensen wrote:
> This is all back-of-the-envelope calculations, of course. Having some
> actual numbers to look at would be great; I don't suppose you have a
> setup where you can run xdp-bench and see how your patches affect the
> throughput?

No but I probably could set it up.

> I chatted with Jesper about this, and he had an idea not too far from
> this: split up the XDP and regular stack processing in two stages, each
> with their individual batching. So whereas right now we're doing
> something like:
> 
> run_napi()
>   bh_disable()
>   for pkt in budget:
>     act = run_xdp(pkt)
>     if (act == XDP_PASS)
>       run_netstack(pkt)  // this is the expensive bit
>   bh_enable()
> 
> We could instead do:
> 
> run_napi()
>   bh_disable()
>   for pkt in budget:
>     act = run_xdp(pkt)
>     if (act == XDP_PASS)
>       add_to_list(pkt, to_stack_list)
>   bh_enable()
>   // sched point
>   bh_disable()
>   for pkt in to_stack_list:
>     run_netstack(pkt)
>   bh_enable()
> 
> 
> This would limit the batching that blocks everything to only the XDP
> processing itself, which should limit the maximum time spent in the
> blocking state significantly compared to what we have today. The caveat
> being that rearranging things like this is potentially a pretty major
> refactoring task that needs to touch all the drivers (even if some of
> the logic can be moved into the core code in the process). So not really
> sure if this approach is feasible, TBH.

This does not work because bh_disable() does not disable scheduling.
Scheduling may happen. bh_disable() acquires a lock which is currently
the only synchronisation point between two say network driver doing
NAPI. And this what I want to get rid of.
Regarding expensive bit as in XDP_PASS: This doesn't need locking as per
proposal, just the REDIRECT piece.

> > Daniel said netkit doesn't need this locking because it is not
> > supporting this redirect and it made me think. Would it work to make
> > the redirect structures part of the bpf_prog-structure instead of
> > per-CPU? My understanding is that eBPF's programs data structures are
> > part of it and contain locking allowing one eBPF program preempt
> > another one.
> > Having the redirect structures part of the program would obsolete
> > locking. Do I miss anything?
> 
> This won't work, unfortunately: the same XDP program can be attached to
> multiple interfaces simultaneously, and for hardware with multiple
> receive queues (which is most of the hardware that supports XDP), it can
> even run simultaneously on multiple CPUs on the same interface. This is
> the reason why this is all being kept in per-CPU variables today.

So I started hacking this and noticed yesterday and noticed that you can
run multiple bpf programs. This is how I learned that it won't work.
My plan B is now to move it into task_struct. 

> -Toke
Sebastian
Sebastian Andrzej Siewior Jan. 18, 2024, 8:27 a.m. UTC | #8
On 2024-01-17 18:04:47 [-0800], Jakub Kicinski wrote:
> Oh, and I'm bringing it up here, because CONFIG_RT can throw
> in "need_resched()" into the napi_rx_has_budget(), obviously.

need_resched() does not work on PREEMPT_RT the way you think. This
context (the NAPI poll callback) is preemptible and (by default) runs at
SCHED_FIFO 50 (within a threaded IRQ) so a context switch can happen at
any time by a task with higher priority.
If threadA gets preempted and owns a lock that threadB, with higher
priority, wants then threadA will get back on CPU, inherit the priority
of the threadB and continue to run until it releases the lock.

If this is the per-CPU BH lock (which I want to remove) then it will
continue until all softirqs complete.

Sebastian
Toke Høiland-Jørgensen Jan. 18, 2024, 11:51 a.m. UTC | #9
Jakub Kicinski <kuba@kernel.org> writes:

> On Wed, 17 Jan 2024 17:37:29 +0100 Toke Høiland-Jørgensen wrote:
>> I am not contesting that latency is important, but it's a pretty
>> fundamental trade-off and we don't want to kill throughput entirely
>> either. Especially since this is global to the whole kernel; and there
>> are definitely people who want to use XDP on an RT kernel and still
>> achieve high PPS rates.
>> 
>> (Whether those people really strictly speaking need to be running an RT
>> kernel is maybe debatable, but it does happen).
>> 
>> > I expected the lock operation (under RT) to always succeeds and not
>> > cause any delay because it should not be contended.  
>> 
>> A lock does cause delay even when it's not contended. Bear in mind that
>> at 10 Gbps line rate, we have a budget of 64 nanoseconds to process each
>> packet (for 64-byte packets). So just the atomic op to figure out
>> whether there's any contention (around 10ns on the Intel processors I
>> usually test on) will blow a huge chunk of the total processing budget.
>> We can't actually do the full processing needed in those 64 nanoseconds
>> (not to mention the 6.4 nanoseconds we have available at 100Gbps), which
>> is why it's essential to amortise as much as we can over multiple
>> packets.
>> 
>> This is all back-of-the-envelope calculations, of course. Having some
>> actual numbers to look at would be great; I don't suppose you have a
>> setup where you can run xdp-bench and see how your patches affect the
>> throughput?
>
> A potentially stupid idea which I have been turning in my head is 
> how we could get away from having the driver handle details of NAPI
> budgeting. It's an source of bugs and endless review comments.
>
> All drivers end up maintaining a counter of "how many packets have
> I processed" and comparing that against the budget. Would it be crazy
> if we put that inside napi_struct? Add a "budget" member inside
> napi_struct as well, and:
>
> struct napi_struct {
> ...
> 	// poll state
> 	unsigned int budget;
> 	unsigned int rx_used;
> ...
> }
>
> static inline bool napi_rx_has_budget(napi)
> {
> 	return napi->budget > napi->rx_used;
> }
>
> poll(napi) // no budget
> {
> 	while (napi_rx_has_budget(napi)) {
> 		napi_gro_receive(napi, skb); /* does napi->rx_used++ */
> 		// maybe add explicit napi_rx_count() if
> 		// driver did something funny with the frame.
> 	}
> }
>
> We can also create napi_tx_has_budget() so that people stop being
> confused whether budget is for Tx or not. And napi_xdp_comp_has_budget()
> so that people stop completing XDP in hard irq context (budget==0)...
>
> And we can pass napi into napi_consume_skb(), instead of, presumably
> inexplicably to a newcomer, passing in budget.
> And napi_complete_done() can lose the work_done argument, too.

I do agree that conceptually it makes a lot of sense to encapsulate the
budget like this so drivers don't have to do all this state tracking
themselves. It does appear that drivers are doing different things with
the budget as it is today, though. For instance, the intel drivers seem
to divide the budget over all the enabled RX rings(?); so I'm wondering
if it'll be possible to unify drivers around a more opaque NAPI poll
API?

-Toke
Toke Høiland-Jørgensen Jan. 18, 2024, 11:58 a.m. UTC | #10
Sebastian Andrzej Siewior <bigeasy@linutronix.de> writes:

> On 2024-01-17 17:37:29 [+0100], Toke Høiland-Jørgensen wrote:
>> This is all back-of-the-envelope calculations, of course. Having some
>> actual numbers to look at would be great; I don't suppose you have a
>> setup where you can run xdp-bench and see how your patches affect the
>> throughput?
>
> No but I probably could set it up.

That would be great! Feel free to ping me if you need any pointers to
how we usually do the perf measurements :)

>> I chatted with Jesper about this, and he had an idea not too far from
>> this: split up the XDP and regular stack processing in two stages, each
>> with their individual batching. So whereas right now we're doing
>> something like:
>> 
>> run_napi()
>>   bh_disable()
>>   for pkt in budget:
>>     act = run_xdp(pkt)
>>     if (act == XDP_PASS)
>>       run_netstack(pkt)  // this is the expensive bit
>>   bh_enable()
>> 
>> We could instead do:
>> 
>> run_napi()
>>   bh_disable()
>>   for pkt in budget:
>>     act = run_xdp(pkt)
>>     if (act == XDP_PASS)
>>       add_to_list(pkt, to_stack_list)
>>   bh_enable()
>>   // sched point
>>   bh_disable()
>>   for pkt in to_stack_list:
>>     run_netstack(pkt)
>>   bh_enable()
>> 
>> 
>> This would limit the batching that blocks everything to only the XDP
>> processing itself, which should limit the maximum time spent in the
>> blocking state significantly compared to what we have today. The caveat
>> being that rearranging things like this is potentially a pretty major
>> refactoring task that needs to touch all the drivers (even if some of
>> the logic can be moved into the core code in the process). So not really
>> sure if this approach is feasible, TBH.
>
> This does not work because bh_disable() does not disable scheduling.
> Scheduling may happen. bh_disable() acquires a lock which is currently
> the only synchronisation point between two say network driver doing
> NAPI. And this what I want to get rid of.
> Regarding expensive bit as in XDP_PASS: This doesn't need locking as per
> proposal, just the REDIRECT piece.

Right, well s/bh_disable()/lock()/; my main point was splitting up the
processing so that the XDP processing itself and the stack activation on
XDP_PASS is not interleaved. This will make it possible to hold the lock
around the whole XDP batch, not just individual packets, and so retain
the performance we gain from amortising expensive operations over
multiple packets.

-Toke
Jakub Kicinski Jan. 18, 2024, 4:37 p.m. UTC | #11
On Thu, 18 Jan 2024 12:51:18 +0100 Toke Høiland-Jørgensen wrote:
> I do agree that conceptually it makes a lot of sense to encapsulate the
> budget like this so drivers don't have to do all this state tracking
> themselves. It does appear that drivers are doing different things with
> the budget as it is today, though. For instance, the intel drivers seem
> to divide the budget over all the enabled RX rings(?); so I'm wondering
> if it'll be possible to unify drivers around a more opaque NAPI poll API?

We can come up with APIs which would cater to multi-queue cases.
Bigger question is what is the sensible polling strategy for those,
just dividing the budget seems, hm, crude.
Jakub Kicinski Jan. 18, 2024, 4:38 p.m. UTC | #12
On Thu, 18 Jan 2024 09:27:54 +0100 Sebastian Andrzej Siewior wrote:
> On 2024-01-17 18:04:47 [-0800], Jakub Kicinski wrote:
> > Oh, and I'm bringing it up here, because CONFIG_RT can throw
> > in "need_resched()" into the napi_rx_has_budget(), obviously.  
> 
> need_resched() does not work on PREEMPT_RT the way you think. This
> context (the NAPI poll callback) is preemptible and (by default) runs at
> SCHED_FIFO 50 (within a threaded IRQ) so a context switch can happen at
> any time by a task with higher priority.
> If threadA gets preempted and owns a lock that threadB, with higher
> priority, wants then threadA will get back on CPU, inherit the priority
> of the threadB and continue to run until it releases the lock.
> 
> If this is the per-CPU BH lock (which I want to remove) then it will
> continue until all softirqs complete.

So there's no way for a process to know on RT that someone with higher
prio is waiting for it to release its locks? :(
Sebastian Andrzej Siewior Jan. 18, 2024, 4:50 p.m. UTC | #13
On 2024-01-18 08:38:12 [-0800], Jakub Kicinski wrote:
> > If this is the per-CPU BH lock (which I want to remove) then it will
> > continue until all softirqs complete.
> 
> So there's no way for a process to know on RT that someone with higher
> prio is waiting for it to release its locks? :(

You could add a function to check if your current priority is inherited
from someone else and if so start dropping the locks you think are
responsible for it.
I made a PoC that appears to work for timer_list timer which is one of
the softirqs. This made me realise that I need in more spots and I am
doing it for the wrong reasons…

Sebastian
Toke Høiland-Jørgensen Jan. 20, 2024, 2:41 p.m. UTC | #14
Jakub Kicinski <kuba@kernel.org> writes:

> On Thu, 18 Jan 2024 12:51:18 +0100 Toke Høiland-Jørgensen wrote:
>> I do agree that conceptually it makes a lot of sense to encapsulate the
>> budget like this so drivers don't have to do all this state tracking
>> themselves. It does appear that drivers are doing different things with
>> the budget as it is today, though. For instance, the intel drivers seem
>> to divide the budget over all the enabled RX rings(?); so I'm wondering
>> if it'll be possible to unify drivers around a more opaque NAPI poll API?
>
> We can come up with APIs which would cater to multi-queue cases.
> Bigger question is what is the sensible polling strategy for those,
> just dividing the budget seems, hm, crude.

Right, agreed, though I don't have a good answer for what else to do off
the top of my head...

-Toke
diff mbox series

Patch

diff --git a/drivers/net/vmxnet3/vmxnet3_xdp.c b/drivers/net/vmxnet3/vmxnet3_xdp.c
index 80ddaff759d47..18bce98fd2e31 100644
--- a/drivers/net/vmxnet3/vmxnet3_xdp.c
+++ b/drivers/net/vmxnet3/vmxnet3_xdp.c
@@ -257,6 +257,7 @@  vmxnet3_run_xdp(struct vmxnet3_rx_queue *rq, struct xdp_buff *xdp,
 	u32 act;
 
 	rq->stats.xdp_packets++;
+	guard(local_lock_nested_bh)(&bpf_run_lock.redirect_lock);
 	act = bpf_prog_run_xdp(prog, xdp);
 	page = virt_to_page(xdp->data_hard_start);
 
diff --git a/kernel/bpf/cpumap.c b/kernel/bpf/cpumap.c
index 8a0bb80fe48a3..c26d49bb78679 100644
--- a/kernel/bpf/cpumap.c
+++ b/kernel/bpf/cpumap.c
@@ -144,6 +144,7 @@  static void cpu_map_bpf_prog_run_skb(struct bpf_cpu_map_entry *rcpu,
 	int err;
 
 	list_for_each_entry_safe(skb, tmp, listp, list) {
+		guard(local_lock_nested_bh)(&bpf_run_lock.redirect_lock);
 		act = bpf_prog_run_generic_xdp(skb, &xdp, rcpu->prog);
 		switch (act) {
 		case XDP_PASS:
@@ -182,6 +183,7 @@  static int cpu_map_bpf_prog_run_xdp(struct bpf_cpu_map_entry *rcpu,
 	struct xdp_buff xdp;
 	int i, nframes = 0;
 
+	guard(local_lock_nested_bh)(&bpf_run_lock.redirect_lock);
 	xdp_set_return_frame_no_direct();
 	xdp.rxq = &rxq;
 
diff --git a/net/bpf/test_run.c b/net/bpf/test_run.c
index c9fdcc5cdce10..db8f7eb35c6ca 100644
--- a/net/bpf/test_run.c
+++ b/net/bpf/test_run.c
@@ -293,6 +293,7 @@  static int xdp_test_run_batch(struct xdp_test_data *xdp, struct bpf_prog *prog,
 	batch_sz = min_t(u32, repeat, xdp->batch_size);
 
 	local_bh_disable();
+	local_lock_nested_bh(&bpf_run_lock.redirect_lock);
 	xdp_set_return_frame_no_direct();
 
 	for (i = 0; i < batch_sz; i++) {
@@ -348,6 +349,9 @@  static int xdp_test_run_batch(struct xdp_test_data *xdp, struct bpf_prog *prog,
 	}
 
 out:
+	xdp_clear_return_frame_no_direct();
+	local_unlock_nested_bh(&bpf_run_lock.redirect_lock);
+
 	if (redirect)
 		xdp_do_flush();
 	if (nframes) {
@@ -356,7 +360,6 @@  static int xdp_test_run_batch(struct xdp_test_data *xdp, struct bpf_prog *prog,
 			err = ret;
 	}
 
-	xdp_clear_return_frame_no_direct();
 	local_bh_enable();
 	return err;
 }
@@ -417,10 +420,12 @@  static int bpf_test_run(struct bpf_prog *prog, void *ctx, u32 repeat,
 	do {
 		run_ctx.prog_item = &item;
 		local_bh_disable();
-		if (xdp)
+		if (xdp) {
+			guard(local_lock_nested_bh)(&bpf_run_lock.redirect_lock);
 			*retval = bpf_prog_run_xdp(prog, ctx);
-		else
+		} else {
 			*retval = bpf_prog_run(prog, ctx);
+		}
 		local_bh_enable();
 	} while (bpf_test_timer_continue(&t, 1, repeat, &ret, time));
 	bpf_reset_run_ctx(old_ctx);
diff --git a/net/core/dev.c b/net/core/dev.c
index 5a0f6da7b3ae5..5ba7509e88752 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -3993,6 +3993,7 @@  sch_handle_ingress(struct sk_buff *skb, struct packet_type **pt_prev, int *ret,
 		*pt_prev = NULL;
 	}
 
+	guard(local_lock_nested_bh)(&bpf_run_lock.redirect_lock);
 	qdisc_skb_cb(skb)->pkt_len = skb->len;
 	tcx_set_ingress(skb, true);
 
@@ -4045,6 +4046,7 @@  sch_handle_egress(struct sk_buff *skb, int *ret, struct net_device *dev)
 	if (!entry)
 		return skb;
 
+	guard(local_lock_nested_bh)(&bpf_run_lock.redirect_lock);
 	/* qdisc_skb_cb(skb)->pkt_len & tcx_set_ingress() was
 	 * already set by the caller.
 	 */
@@ -5008,6 +5010,7 @@  int do_xdp_generic(struct bpf_prog *xdp_prog, struct sk_buff *skb)
 		u32 act;
 		int err;
 
+		guard(local_lock_nested_bh)(&bpf_run_lock.redirect_lock);
 		act = netif_receive_generic_xdp(skb, &xdp, xdp_prog);
 		if (act != XDP_PASS) {
 			switch (act) {
diff --git a/net/core/filter.c b/net/core/filter.c
index 7c9653734fb60..72a7812f933a1 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -4241,6 +4241,7 @@  static const struct bpf_func_proto bpf_xdp_adjust_meta_proto = {
  */
 void xdp_do_flush(void)
 {
+	guard(local_lock_nested_bh)(&bpf_run_lock.redirect_lock);
 	__dev_flush();
 	__cpu_map_flush();
 	__xsk_map_flush();
diff --git a/net/core/lwt_bpf.c b/net/core/lwt_bpf.c
index a94943681e5aa..74b88e897a7e3 100644
--- a/net/core/lwt_bpf.c
+++ b/net/core/lwt_bpf.c
@@ -44,6 +44,7 @@  static int run_lwt_bpf(struct sk_buff *skb, struct bpf_lwt_prog *lwt,
 	 * BPF prog and skb_do_redirect().
 	 */
 	local_bh_disable();
+	local_lock_nested_bh(&bpf_run_lock.redirect_lock);
 	bpf_compute_data_pointers(skb);
 	ret = bpf_prog_run_save_cb(lwt->prog, skb);
 
@@ -76,6 +77,7 @@  static int run_lwt_bpf(struct sk_buff *skb, struct bpf_lwt_prog *lwt,
 		break;
 	}
 
+	local_unlock_nested_bh(&bpf_run_lock.redirect_lock);
 	local_bh_enable();
 
 	return ret;
diff --git a/net/sched/cls_api.c b/net/sched/cls_api.c
index 1976bd1639863..da61b99bc558f 100644
--- a/net/sched/cls_api.c
+++ b/net/sched/cls_api.c
@@ -23,6 +23,7 @@ 
 #include <linux/jhash.h>
 #include <linux/rculist.h>
 #include <linux/rhashtable.h>
+#include <linux/bpf.h>
 #include <net/net_namespace.h>
 #include <net/sock.h>
 #include <net/netlink.h>
@@ -3925,6 +3926,7 @@  struct sk_buff *tcf_qevent_handle(struct tcf_qevent *qe, struct Qdisc *sch, stru
 
 	fl = rcu_dereference_bh(qe->filter_chain);
 
+	guard(local_lock_nested_bh)(&bpf_run_lock.redirect_lock);
 	switch (tcf_classify(skb, NULL, fl, &cl_res, false)) {
 	case TC_ACT_SHOT:
 		qdisc_qstats_drop(sch);