Message ID | 20231215171020.687342-16-bigeasy@linutronix.de (mailing list archive) |
---|---|
State | Changes Requested |
Delegated to: | Netdev Maintainers |
Headers | show |
Series | locking: Introduce nested-BH locking. | expand |
Hi Sebastian, kernel test robot noticed the following build warnings: [auto build test WARNING on net-next/main] url: https://github.com/intel-lab-lkp/linux/commits/Sebastian-Andrzej-Siewior/locking-local_lock-Introduce-guard-definition-for-local_lock/20231216-011911 base: net-next/main patch link: https://lore.kernel.org/r/20231215171020.687342-16-bigeasy%40linutronix.de patch subject: [PATCH net-next 15/24] net: Use nested-BH locking for XDP redirect. config: x86_64-randconfig-122-20231216 (https://download.01.org/0day-ci/archive/20231216/202312161103.iKeGoacH-lkp@intel.com/config) compiler: clang version 16.0.4 (https://github.com/llvm/llvm-project.git ae42196bc493ffe877a7e3dff8be32035dea4d07) reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20231216/202312161103.iKeGoacH-lkp@intel.com/reproduce) If you fix the issue in a separate patch/commit (i.e. not just a new version of the same patch/commit), kindly add following tags | Reported-by: kernel test robot <lkp@intel.com> | Closes: https://lore.kernel.org/oe-kbuild-all/202312161103.iKeGoacH-lkp@intel.com/ sparse warnings: (new ones prefixed by >>) net/sched/cls_api.c:391:22: sparse: sparse: incorrect type in assignment (different base types) @@ expected restricted __be16 [usertype] protocol @@ got unsigned int [usertype] protocol @@ net/sched/cls_api.c:391:22: sparse: expected restricted __be16 [usertype] protocol net/sched/cls_api.c:391:22: sparse: got unsigned int [usertype] protocol net/sched/cls_api.c:1862:16: sparse: sparse: incompatible types in comparison expression (different address spaces): net/sched/cls_api.c:1862:16: sparse: struct tcf_proto * net/sched/cls_api.c:1862:16: sparse: struct tcf_proto [noderef] __rcu * net/sched/cls_api.c:1962:20: sparse: sparse: incompatible types in comparison expression (different address spaces): net/sched/cls_api.c:1962:20: sparse: struct tcf_proto [noderef] __rcu * net/sched/cls_api.c:1962:20: sparse: struct tcf_proto * net/sched/cls_api.c:1924:25: sparse: sparse: incompatible types in comparison expression (different address spaces): net/sched/cls_api.c:1924:25: sparse: struct tcf_proto [noderef] __rcu * net/sched/cls_api.c:1924:25: sparse: struct tcf_proto * net/sched/cls_api.c:1944:16: sparse: sparse: incompatible types in comparison expression (different address spaces): net/sched/cls_api.c:1944:16: sparse: struct tcf_proto * net/sched/cls_api.c:1944:16: sparse: struct tcf_proto [noderef] __rcu * net/sched/cls_api.c:2010:25: sparse: sparse: restricted __be16 degrades to integer net/sched/cls_api.c:2697:50: sparse: sparse: restricted __be16 degrades to integer >> net/sched/cls_api.c:3933:38: sparse: sparse: incorrect type in argument 1 (different address spaces) @@ expected struct local_lock_t [usertype] *l @@ got struct local_lock_t [noderef] __percpu * @@ net/sched/cls_api.c:3933:38: sparse: expected struct local_lock_t [usertype] *l net/sched/cls_api.c:3933:38: sparse: got struct local_lock_t [noderef] __percpu * net/sched/cls_api.c: note: in included file (through include/linux/mmzone.h, include/linux/gfp.h, include/linux/umh.h, include/linux/kmod.h, ...): include/linux/local_lock.h:71:1: sparse: sparse: incorrect type in initializer (different address spaces) @@ expected void const [noderef] __percpu *__vpp_verify @@ got struct local_lock_t [usertype] * @@ include/linux/local_lock.h:71:1: sparse: expected void const [noderef] __percpu *__vpp_verify include/linux/local_lock.h:71:1: sparse: got struct local_lock_t [usertype] * vim +3933 net/sched/cls_api.c 3921 3922 struct sk_buff *tcf_qevent_handle(struct tcf_qevent *qe, struct Qdisc *sch, struct sk_buff *skb, 3923 struct sk_buff **to_free, int *ret) 3924 { 3925 struct tcf_result cl_res; 3926 struct tcf_proto *fl; 3927 3928 if (!qe->info.block_index) 3929 return skb; 3930 3931 fl = rcu_dereference_bh(qe->filter_chain); 3932 > 3933 guard(local_lock_nested_bh)(&bpf_run_lock.redirect_lock); 3934 switch (tcf_classify(skb, NULL, fl, &cl_res, false)) { 3935 case TC_ACT_SHOT: 3936 qdisc_qstats_drop(sch); 3937 __qdisc_drop(skb, to_free); 3938 *ret = __NET_XMIT_BYPASS; 3939 return NULL; 3940 case TC_ACT_STOLEN: 3941 case TC_ACT_QUEUED: 3942 case TC_ACT_TRAP: 3943 __qdisc_drop(skb, to_free); 3944 *ret = __NET_XMIT_STOLEN; 3945 return NULL; 3946 case TC_ACT_REDIRECT: 3947 skb_do_redirect(skb); 3948 *ret = __NET_XMIT_STOLEN; 3949 return NULL; 3950 } 3951 3952 return skb; 3953 } 3954 EXPORT_SYMBOL(tcf_qevent_handle); 3955
On Fri, Dec 15, 2023 at 9:10 AM Sebastian Andrzej Siewior <bigeasy@linutronix.de> wrote: > diff --git a/net/core/dev.c b/net/core/dev.c > index 5a0f6da7b3ae5..5ba7509e88752 100644 > --- a/net/core/dev.c > +++ b/net/core/dev.c > @@ -3993,6 +3993,7 @@ sch_handle_ingress(struct sk_buff *skb, struct packet_type **pt_prev, int *ret, > *pt_prev = NULL; > } > > + guard(local_lock_nested_bh)(&bpf_run_lock.redirect_lock); > qdisc_skb_cb(skb)->pkt_len = skb->len; > tcx_set_ingress(skb, true); > > @@ -4045,6 +4046,7 @@ sch_handle_egress(struct sk_buff *skb, int *ret, struct net_device *dev) > if (!entry) > return skb; > > + guard(local_lock_nested_bh)(&bpf_run_lock.redirect_lock); > /* qdisc_skb_cb(skb)->pkt_len & tcx_set_ingress() was > * already set by the caller. > */ > @@ -5008,6 +5010,7 @@ int do_xdp_generic(struct bpf_prog *xdp_prog, struct sk_buff *skb) > u32 act; > int err; > > + guard(local_lock_nested_bh)(&bpf_run_lock.redirect_lock); > act = netif_receive_generic_xdp(skb, &xdp, xdp_prog); > if (act != XDP_PASS) { > switch (act) { > diff --git a/net/core/filter.c b/net/core/filter.c > index 7c9653734fb60..72a7812f933a1 100644 > --- a/net/core/filter.c > +++ b/net/core/filter.c > @@ -4241,6 +4241,7 @@ static const struct bpf_func_proto bpf_xdp_adjust_meta_proto = { > */ > void xdp_do_flush(void) > { > + guard(local_lock_nested_bh)(&bpf_run_lock.redirect_lock); > __dev_flush(); > __cpu_map_flush(); > __xsk_map_flush(); > diff --git a/net/core/lwt_bpf.c b/net/core/lwt_bpf.c > index a94943681e5aa..74b88e897a7e3 100644 > --- a/net/core/lwt_bpf.c > +++ b/net/core/lwt_bpf.c > @@ -44,6 +44,7 @@ static int run_lwt_bpf(struct sk_buff *skb, struct bpf_lwt_prog *lwt, > * BPF prog and skb_do_redirect(). > */ > local_bh_disable(); > + local_lock_nested_bh(&bpf_run_lock.redirect_lock); > bpf_compute_data_pointers(skb); > ret = bpf_prog_run_save_cb(lwt->prog, skb); > > @@ -76,6 +77,7 @@ static int run_lwt_bpf(struct sk_buff *skb, struct bpf_lwt_prog *lwt, > break; > } > > + local_unlock_nested_bh(&bpf_run_lock.redirect_lock); > local_bh_enable(); > > return ret; > diff --git a/net/sched/cls_api.c b/net/sched/cls_api.c > index 1976bd1639863..da61b99bc558f 100644 > --- a/net/sched/cls_api.c > +++ b/net/sched/cls_api.c > @@ -23,6 +23,7 @@ > #include <linux/jhash.h> > #include <linux/rculist.h> > #include <linux/rhashtable.h> > +#include <linux/bpf.h> > #include <net/net_namespace.h> > #include <net/sock.h> > #include <net/netlink.h> > @@ -3925,6 +3926,7 @@ struct sk_buff *tcf_qevent_handle(struct tcf_qevent *qe, struct Qdisc *sch, stru > > fl = rcu_dereference_bh(qe->filter_chain); > > + guard(local_lock_nested_bh)(&bpf_run_lock.redirect_lock); > switch (tcf_classify(skb, NULL, fl, &cl_res, false)) { > case TC_ACT_SHOT: > qdisc_qstats_drop(sch); Here and in all other places this patch adds locks that will kill performance of XDP, tcx and everything else in networking. I'm surprised Jesper and other folks are not jumping in with nacks. We measure performance in nanoseconds here. Extra lock is no go. Please find a different way without ruining performance.
Alexei Starovoitov <alexei.starovoitov@gmail.com> writes: > On Fri, Dec 15, 2023 at 9:10 AM Sebastian Andrzej Siewior > <bigeasy@linutronix.de> wrote: >> diff --git a/net/core/dev.c b/net/core/dev.c >> index 5a0f6da7b3ae5..5ba7509e88752 100644 >> --- a/net/core/dev.c >> +++ b/net/core/dev.c >> @@ -3993,6 +3993,7 @@ sch_handle_ingress(struct sk_buff *skb, struct packet_type **pt_prev, int *ret, >> *pt_prev = NULL; >> } >> >> + guard(local_lock_nested_bh)(&bpf_run_lock.redirect_lock); >> qdisc_skb_cb(skb)->pkt_len = skb->len; >> tcx_set_ingress(skb, true); >> >> @@ -4045,6 +4046,7 @@ sch_handle_egress(struct sk_buff *skb, int *ret, struct net_device *dev) >> if (!entry) >> return skb; >> >> + guard(local_lock_nested_bh)(&bpf_run_lock.redirect_lock); >> /* qdisc_skb_cb(skb)->pkt_len & tcx_set_ingress() was >> * already set by the caller. >> */ >> @@ -5008,6 +5010,7 @@ int do_xdp_generic(struct bpf_prog *xdp_prog, struct sk_buff *skb) >> u32 act; >> int err; >> >> + guard(local_lock_nested_bh)(&bpf_run_lock.redirect_lock); >> act = netif_receive_generic_xdp(skb, &xdp, xdp_prog); >> if (act != XDP_PASS) { >> switch (act) { >> diff --git a/net/core/filter.c b/net/core/filter.c >> index 7c9653734fb60..72a7812f933a1 100644 >> --- a/net/core/filter.c >> +++ b/net/core/filter.c >> @@ -4241,6 +4241,7 @@ static const struct bpf_func_proto bpf_xdp_adjust_meta_proto = { >> */ >> void xdp_do_flush(void) >> { >> + guard(local_lock_nested_bh)(&bpf_run_lock.redirect_lock); >> __dev_flush(); >> __cpu_map_flush(); >> __xsk_map_flush(); >> diff --git a/net/core/lwt_bpf.c b/net/core/lwt_bpf.c >> index a94943681e5aa..74b88e897a7e3 100644 >> --- a/net/core/lwt_bpf.c >> +++ b/net/core/lwt_bpf.c >> @@ -44,6 +44,7 @@ static int run_lwt_bpf(struct sk_buff *skb, struct bpf_lwt_prog *lwt, >> * BPF prog and skb_do_redirect(). >> */ >> local_bh_disable(); >> + local_lock_nested_bh(&bpf_run_lock.redirect_lock); >> bpf_compute_data_pointers(skb); >> ret = bpf_prog_run_save_cb(lwt->prog, skb); >> >> @@ -76,6 +77,7 @@ static int run_lwt_bpf(struct sk_buff *skb, struct bpf_lwt_prog *lwt, >> break; >> } >> >> + local_unlock_nested_bh(&bpf_run_lock.redirect_lock); >> local_bh_enable(); >> >> return ret; >> diff --git a/net/sched/cls_api.c b/net/sched/cls_api.c >> index 1976bd1639863..da61b99bc558f 100644 >> --- a/net/sched/cls_api.c >> +++ b/net/sched/cls_api.c >> @@ -23,6 +23,7 @@ >> #include <linux/jhash.h> >> #include <linux/rculist.h> >> #include <linux/rhashtable.h> >> +#include <linux/bpf.h> >> #include <net/net_namespace.h> >> #include <net/sock.h> >> #include <net/netlink.h> >> @@ -3925,6 +3926,7 @@ struct sk_buff *tcf_qevent_handle(struct tcf_qevent *qe, struct Qdisc *sch, stru >> >> fl = rcu_dereference_bh(qe->filter_chain); >> >> + guard(local_lock_nested_bh)(&bpf_run_lock.redirect_lock); >> switch (tcf_classify(skb, NULL, fl, &cl_res, false)) { >> case TC_ACT_SHOT: >> qdisc_qstats_drop(sch); > > Here and in all other places this patch adds locks that > will kill performance of XDP, tcx and everything else in networking. > > I'm surprised Jesper and other folks are not jumping in with nacks. > We measure performance in nanoseconds here. > Extra lock is no go. > Please find a different way without ruining performance. I'll add that while all this compiles out as no-ops on !PREEMPT_RT, I do believe there are people who are using XDP on PREEMPT_RT kernels and still expect decent performance. And to achieve that it is absolutely imperative that we can amortise expensive operations (such as locking) over multiple packets. I realise there's a fundamental trade-off between the amount of amortisation and the latency hit that we take from holding locks for longer, but tuning the batch size (while still keeping some amount of batching) may be a way forward? I suppose Jakub's suggestion in the other part of the thread, of putting the locks around napi->poll(), is a step towards something like this. -Toke
On 2024-01-04 20:29:02 [+0100], Toke Høiland-Jørgensen wrote: > Alexei Starovoitov <alexei.starovoitov@gmail.com> writes: > > >> @@ -3925,6 +3926,7 @@ struct sk_buff *tcf_qevent_handle(struct tcf_qevent *qe, struct Qdisc *sch, stru > >> > >> fl = rcu_dereference_bh(qe->filter_chain); > >> > >> + guard(local_lock_nested_bh)(&bpf_run_lock.redirect_lock); > >> switch (tcf_classify(skb, NULL, fl, &cl_res, false)) { > >> case TC_ACT_SHOT: > >> qdisc_qstats_drop(sch); > > > > Here and in all other places this patch adds locks that > > will kill performance of XDP, tcx and everything else in networking. > > > > I'm surprised Jesper and other folks are not jumping in with nacks. > > We measure performance in nanoseconds here. > > Extra lock is no go. > > Please find a different way without ruining performance. > > I'll add that while all this compiles out as no-ops on !PREEMPT_RT, I do > believe there are people who are using XDP on PREEMPT_RT kernels and > still expect decent performance. And to achieve that it is absolutely > imperative that we can amortise expensive operations (such as locking) > over multiple packets. > > I realise there's a fundamental trade-off between the amount of > amortisation and the latency hit that we take from holding locks for > longer, but tuning the batch size (while still keeping some amount of > batching) may be a way forward? I suppose Jakub's suggestion in the > other part of the thread, of putting the locks around napi->poll(), is a > step towards something like this. The RT requirements are usually different. Networking as in CAN might be important but Ethernet could only used for remote communication and so "not" important. People complained that they need to wait for Ethernet to be done until the CAN packet can be injected into the stack. With that expectation you would like to pause Ethernet immediately and switch over the CAN interrupt thread. But if someone managed to setup XDP then it is likely to be important. With RT traffic it is usually not the throughput that matters but the latency. You are likely in the position to receive a packet, say every 1ms, and need to respond immediately. XDP would be used to inspect the packet and either hand it over to the stack or process it. I expected the lock operation (under RT) to always succeeds and not cause any delay because it should not be contended. It should only block if something with higher priority preempted the current interrupt thread _and_ also happen to use XDP on the same CPU. In that case (XDP is needed) it would flush the current user out of the locked section before the higher-prio thread could take over. Doing bulk and allowing the low-priority thread to complete would delay the high-priority thread. Maybe I am too pessimistic here and having two XDP programs on one CPU is unlikely to happen. Adding the lock on per-NAPI basis would allow to batch packets. Acquiring the lock only if XDP is supported would not block the CAN drivers since they dont't support XDP. But sounds like a hack. Daniel said netkit doesn't need this locking because it is not supporting this redirect and it made me think. Would it work to make the redirect structures part of the bpf_prog-structure instead of per-CPU? My understanding is that eBPF's programs data structures are part of it and contain locking allowing one eBPF program preempt another one. Having the redirect structures part of the program would obsolete locking. Do I miss anything? > -Toke Sebastian
Sebastian Andrzej Siewior <bigeasy@linutronix.de> writes: > On 2024-01-04 20:29:02 [+0100], Toke Høiland-Jørgensen wrote: >> Alexei Starovoitov <alexei.starovoitov@gmail.com> writes: >> >> >> @@ -3925,6 +3926,7 @@ struct sk_buff *tcf_qevent_handle(struct tcf_qevent *qe, struct Qdisc *sch, stru >> >> >> >> fl = rcu_dereference_bh(qe->filter_chain); >> >> >> >> + guard(local_lock_nested_bh)(&bpf_run_lock.redirect_lock); >> >> switch (tcf_classify(skb, NULL, fl, &cl_res, false)) { >> >> case TC_ACT_SHOT: >> >> qdisc_qstats_drop(sch); >> > >> > Here and in all other places this patch adds locks that >> > will kill performance of XDP, tcx and everything else in networking. >> > >> > I'm surprised Jesper and other folks are not jumping in with nacks. >> > We measure performance in nanoseconds here. >> > Extra lock is no go. >> > Please find a different way without ruining performance. >> >> I'll add that while all this compiles out as no-ops on !PREEMPT_RT, I do >> believe there are people who are using XDP on PREEMPT_RT kernels and >> still expect decent performance. And to achieve that it is absolutely >> imperative that we can amortise expensive operations (such as locking) >> over multiple packets. >> >> I realise there's a fundamental trade-off between the amount of >> amortisation and the latency hit that we take from holding locks for >> longer, but tuning the batch size (while still keeping some amount of >> batching) may be a way forward? I suppose Jakub's suggestion in the >> other part of the thread, of putting the locks around napi->poll(), is a >> step towards something like this. > > The RT requirements are usually different. Networking as in CAN might be > important but Ethernet could only used for remote communication and so > "not" important. People complained that they need to wait for Ethernet > to be done until the CAN packet can be injected into the stack. > With that expectation you would like to pause Ethernet immediately and > switch over the CAN interrupt thread. > > But if someone managed to setup XDP then it is likely to be important. > With RT traffic it is usually not the throughput that matters but the > latency. You are likely in the position to receive a packet, say every > 1ms, and need to respond immediately. XDP would be used to inspect the > packet and either hand it over to the stack or process it. I am not contesting that latency is important, but it's a pretty fundamental trade-off and we don't want to kill throughput entirely either. Especially since this is global to the whole kernel; and there are definitely people who want to use XDP on an RT kernel and still achieve high PPS rates. (Whether those people really strictly speaking need to be running an RT kernel is maybe debatable, but it does happen). > I expected the lock operation (under RT) to always succeeds and not > cause any delay because it should not be contended. A lock does cause delay even when it's not contended. Bear in mind that at 10 Gbps line rate, we have a budget of 64 nanoseconds to process each packet (for 64-byte packets). So just the atomic op to figure out whether there's any contention (around 10ns on the Intel processors I usually test on) will blow a huge chunk of the total processing budget. We can't actually do the full processing needed in those 64 nanoseconds (not to mention the 6.4 nanoseconds we have available at 100Gbps), which is why it's essential to amortise as much as we can over multiple packets. This is all back-of-the-envelope calculations, of course. Having some actual numbers to look at would be great; I don't suppose you have a setup where you can run xdp-bench and see how your patches affect the throughput? > It should only block if something with higher priority preempted the > current interrupt thread _and_ also happen to use XDP on the same CPU. > In that case (XDP is needed) it would flush the current user out of > the locked section before the higher-prio thread could take over. > Doing bulk and allowing the low-priority thread to complete would > delay the high-priority thread. Maybe I am too pessimistic here and > having two XDP programs on one CPU is unlikely to happen. > > Adding the lock on per-NAPI basis would allow to batch packets. > Acquiring the lock only if XDP is supported would not block the CAN > drivers since they dont't support XDP. But sounds like a hack. I chatted with Jesper about this, and he had an idea not too far from this: split up the XDP and regular stack processing in two stages, each with their individual batching. So whereas right now we're doing something like: run_napi() bh_disable() for pkt in budget: act = run_xdp(pkt) if (act == XDP_PASS) run_netstack(pkt) // this is the expensive bit bh_enable() We could instead do: run_napi() bh_disable() for pkt in budget: act = run_xdp(pkt) if (act == XDP_PASS) add_to_list(pkt, to_stack_list) bh_enable() // sched point bh_disable() for pkt in to_stack_list: run_netstack(pkt) bh_enable() This would limit the batching that blocks everything to only the XDP processing itself, which should limit the maximum time spent in the blocking state significantly compared to what we have today. The caveat being that rearranging things like this is potentially a pretty major refactoring task that needs to touch all the drivers (even if some of the logic can be moved into the core code in the process). So not really sure if this approach is feasible, TBH. > Daniel said netkit doesn't need this locking because it is not > supporting this redirect and it made me think. Would it work to make > the redirect structures part of the bpf_prog-structure instead of > per-CPU? My understanding is that eBPF's programs data structures are > part of it and contain locking allowing one eBPF program preempt > another one. > Having the redirect structures part of the program would obsolete > locking. Do I miss anything? This won't work, unfortunately: the same XDP program can be attached to multiple interfaces simultaneously, and for hardware with multiple receive queues (which is most of the hardware that supports XDP), it can even run simultaneously on multiple CPUs on the same interface. This is the reason why this is all being kept in per-CPU variables today. -Toke
On Wed, 17 Jan 2024 17:37:29 +0100 Toke Høiland-Jørgensen wrote: > I am not contesting that latency is important, but it's a pretty > fundamental trade-off and we don't want to kill throughput entirely > either. Especially since this is global to the whole kernel; and there > are definitely people who want to use XDP on an RT kernel and still > achieve high PPS rates. > > (Whether those people really strictly speaking need to be running an RT > kernel is maybe debatable, but it does happen). > > > I expected the lock operation (under RT) to always succeeds and not > > cause any delay because it should not be contended. > > A lock does cause delay even when it's not contended. Bear in mind that > at 10 Gbps line rate, we have a budget of 64 nanoseconds to process each > packet (for 64-byte packets). So just the atomic op to figure out > whether there's any contention (around 10ns on the Intel processors I > usually test on) will blow a huge chunk of the total processing budget. > We can't actually do the full processing needed in those 64 nanoseconds > (not to mention the 6.4 nanoseconds we have available at 100Gbps), which > is why it's essential to amortise as much as we can over multiple > packets. > > This is all back-of-the-envelope calculations, of course. Having some > actual numbers to look at would be great; I don't suppose you have a > setup where you can run xdp-bench and see how your patches affect the > throughput? A potentially stupid idea which I have been turning in my head is how we could get away from having the driver handle details of NAPI budgeting. It's an source of bugs and endless review comments. All drivers end up maintaining a counter of "how many packets have I processed" and comparing that against the budget. Would it be crazy if we put that inside napi_struct? Add a "budget" member inside napi_struct as well, and: struct napi_struct { ... // poll state unsigned int budget; unsigned int rx_used; ... } static inline bool napi_rx_has_budget(napi) { return napi->budget > napi->rx_used; } poll(napi) // no budget { while (napi_rx_has_budget(napi)) { napi_gro_receive(napi, skb); /* does napi->rx_used++ */ // maybe add explicit napi_rx_count() if // driver did something funny with the frame. } } We can also create napi_tx_has_budget() so that people stop being confused whether budget is for Tx or not. And napi_xdp_comp_has_budget() so that people stop completing XDP in hard irq context (budget==0)... And we can pass napi into napi_consume_skb(), instead of, presumably inexplicably to a newcomer, passing in budget. And napi_complete_done() can lose the work_done argument, too. Oh, and I'm bringing it up here, because CONFIG_RT can throw in "need_resched()" into the napi_rx_has_budget(), obviously.
On 2024-01-17 17:37:29 [+0100], Toke Høiland-Jørgensen wrote: > This is all back-of-the-envelope calculations, of course. Having some > actual numbers to look at would be great; I don't suppose you have a > setup where you can run xdp-bench and see how your patches affect the > throughput? No but I probably could set it up. > I chatted with Jesper about this, and he had an idea not too far from > this: split up the XDP and regular stack processing in two stages, each > with their individual batching. So whereas right now we're doing > something like: > > run_napi() > bh_disable() > for pkt in budget: > act = run_xdp(pkt) > if (act == XDP_PASS) > run_netstack(pkt) // this is the expensive bit > bh_enable() > > We could instead do: > > run_napi() > bh_disable() > for pkt in budget: > act = run_xdp(pkt) > if (act == XDP_PASS) > add_to_list(pkt, to_stack_list) > bh_enable() > // sched point > bh_disable() > for pkt in to_stack_list: > run_netstack(pkt) > bh_enable() > > > This would limit the batching that blocks everything to only the XDP > processing itself, which should limit the maximum time spent in the > blocking state significantly compared to what we have today. The caveat > being that rearranging things like this is potentially a pretty major > refactoring task that needs to touch all the drivers (even if some of > the logic can be moved into the core code in the process). So not really > sure if this approach is feasible, TBH. This does not work because bh_disable() does not disable scheduling. Scheduling may happen. bh_disable() acquires a lock which is currently the only synchronisation point between two say network driver doing NAPI. And this what I want to get rid of. Regarding expensive bit as in XDP_PASS: This doesn't need locking as per proposal, just the REDIRECT piece. > > Daniel said netkit doesn't need this locking because it is not > > supporting this redirect and it made me think. Would it work to make > > the redirect structures part of the bpf_prog-structure instead of > > per-CPU? My understanding is that eBPF's programs data structures are > > part of it and contain locking allowing one eBPF program preempt > > another one. > > Having the redirect structures part of the program would obsolete > > locking. Do I miss anything? > > This won't work, unfortunately: the same XDP program can be attached to > multiple interfaces simultaneously, and for hardware with multiple > receive queues (which is most of the hardware that supports XDP), it can > even run simultaneously on multiple CPUs on the same interface. This is > the reason why this is all being kept in per-CPU variables today. So I started hacking this and noticed yesterday and noticed that you can run multiple bpf programs. This is how I learned that it won't work. My plan B is now to move it into task_struct. > -Toke Sebastian
On 2024-01-17 18:04:47 [-0800], Jakub Kicinski wrote: > Oh, and I'm bringing it up here, because CONFIG_RT can throw > in "need_resched()" into the napi_rx_has_budget(), obviously. need_resched() does not work on PREEMPT_RT the way you think. This context (the NAPI poll callback) is preemptible and (by default) runs at SCHED_FIFO 50 (within a threaded IRQ) so a context switch can happen at any time by a task with higher priority. If threadA gets preempted and owns a lock that threadB, with higher priority, wants then threadA will get back on CPU, inherit the priority of the threadB and continue to run until it releases the lock. If this is the per-CPU BH lock (which I want to remove) then it will continue until all softirqs complete. Sebastian
Jakub Kicinski <kuba@kernel.org> writes: > On Wed, 17 Jan 2024 17:37:29 +0100 Toke Høiland-Jørgensen wrote: >> I am not contesting that latency is important, but it's a pretty >> fundamental trade-off and we don't want to kill throughput entirely >> either. Especially since this is global to the whole kernel; and there >> are definitely people who want to use XDP on an RT kernel and still >> achieve high PPS rates. >> >> (Whether those people really strictly speaking need to be running an RT >> kernel is maybe debatable, but it does happen). >> >> > I expected the lock operation (under RT) to always succeeds and not >> > cause any delay because it should not be contended. >> >> A lock does cause delay even when it's not contended. Bear in mind that >> at 10 Gbps line rate, we have a budget of 64 nanoseconds to process each >> packet (for 64-byte packets). So just the atomic op to figure out >> whether there's any contention (around 10ns on the Intel processors I >> usually test on) will blow a huge chunk of the total processing budget. >> We can't actually do the full processing needed in those 64 nanoseconds >> (not to mention the 6.4 nanoseconds we have available at 100Gbps), which >> is why it's essential to amortise as much as we can over multiple >> packets. >> >> This is all back-of-the-envelope calculations, of course. Having some >> actual numbers to look at would be great; I don't suppose you have a >> setup where you can run xdp-bench and see how your patches affect the >> throughput? > > A potentially stupid idea which I have been turning in my head is > how we could get away from having the driver handle details of NAPI > budgeting. It's an source of bugs and endless review comments. > > All drivers end up maintaining a counter of "how many packets have > I processed" and comparing that against the budget. Would it be crazy > if we put that inside napi_struct? Add a "budget" member inside > napi_struct as well, and: > > struct napi_struct { > ... > // poll state > unsigned int budget; > unsigned int rx_used; > ... > } > > static inline bool napi_rx_has_budget(napi) > { > return napi->budget > napi->rx_used; > } > > poll(napi) // no budget > { > while (napi_rx_has_budget(napi)) { > napi_gro_receive(napi, skb); /* does napi->rx_used++ */ > // maybe add explicit napi_rx_count() if > // driver did something funny with the frame. > } > } > > We can also create napi_tx_has_budget() so that people stop being > confused whether budget is for Tx or not. And napi_xdp_comp_has_budget() > so that people stop completing XDP in hard irq context (budget==0)... > > And we can pass napi into napi_consume_skb(), instead of, presumably > inexplicably to a newcomer, passing in budget. > And napi_complete_done() can lose the work_done argument, too. I do agree that conceptually it makes a lot of sense to encapsulate the budget like this so drivers don't have to do all this state tracking themselves. It does appear that drivers are doing different things with the budget as it is today, though. For instance, the intel drivers seem to divide the budget over all the enabled RX rings(?); so I'm wondering if it'll be possible to unify drivers around a more opaque NAPI poll API? -Toke
Sebastian Andrzej Siewior <bigeasy@linutronix.de> writes: > On 2024-01-17 17:37:29 [+0100], Toke Høiland-Jørgensen wrote: >> This is all back-of-the-envelope calculations, of course. Having some >> actual numbers to look at would be great; I don't suppose you have a >> setup where you can run xdp-bench and see how your patches affect the >> throughput? > > No but I probably could set it up. That would be great! Feel free to ping me if you need any pointers to how we usually do the perf measurements :) >> I chatted with Jesper about this, and he had an idea not too far from >> this: split up the XDP and regular stack processing in two stages, each >> with their individual batching. So whereas right now we're doing >> something like: >> >> run_napi() >> bh_disable() >> for pkt in budget: >> act = run_xdp(pkt) >> if (act == XDP_PASS) >> run_netstack(pkt) // this is the expensive bit >> bh_enable() >> >> We could instead do: >> >> run_napi() >> bh_disable() >> for pkt in budget: >> act = run_xdp(pkt) >> if (act == XDP_PASS) >> add_to_list(pkt, to_stack_list) >> bh_enable() >> // sched point >> bh_disable() >> for pkt in to_stack_list: >> run_netstack(pkt) >> bh_enable() >> >> >> This would limit the batching that blocks everything to only the XDP >> processing itself, which should limit the maximum time spent in the >> blocking state significantly compared to what we have today. The caveat >> being that rearranging things like this is potentially a pretty major >> refactoring task that needs to touch all the drivers (even if some of >> the logic can be moved into the core code in the process). So not really >> sure if this approach is feasible, TBH. > > This does not work because bh_disable() does not disable scheduling. > Scheduling may happen. bh_disable() acquires a lock which is currently > the only synchronisation point between two say network driver doing > NAPI. And this what I want to get rid of. > Regarding expensive bit as in XDP_PASS: This doesn't need locking as per > proposal, just the REDIRECT piece. Right, well s/bh_disable()/lock()/; my main point was splitting up the processing so that the XDP processing itself and the stack activation on XDP_PASS is not interleaved. This will make it possible to hold the lock around the whole XDP batch, not just individual packets, and so retain the performance we gain from amortising expensive operations over multiple packets. -Toke
On Thu, 18 Jan 2024 12:51:18 +0100 Toke Høiland-Jørgensen wrote: > I do agree that conceptually it makes a lot of sense to encapsulate the > budget like this so drivers don't have to do all this state tracking > themselves. It does appear that drivers are doing different things with > the budget as it is today, though. For instance, the intel drivers seem > to divide the budget over all the enabled RX rings(?); so I'm wondering > if it'll be possible to unify drivers around a more opaque NAPI poll API? We can come up with APIs which would cater to multi-queue cases. Bigger question is what is the sensible polling strategy for those, just dividing the budget seems, hm, crude.
On Thu, 18 Jan 2024 09:27:54 +0100 Sebastian Andrzej Siewior wrote: > On 2024-01-17 18:04:47 [-0800], Jakub Kicinski wrote: > > Oh, and I'm bringing it up here, because CONFIG_RT can throw > > in "need_resched()" into the napi_rx_has_budget(), obviously. > > need_resched() does not work on PREEMPT_RT the way you think. This > context (the NAPI poll callback) is preemptible and (by default) runs at > SCHED_FIFO 50 (within a threaded IRQ) so a context switch can happen at > any time by a task with higher priority. > If threadA gets preempted and owns a lock that threadB, with higher > priority, wants then threadA will get back on CPU, inherit the priority > of the threadB and continue to run until it releases the lock. > > If this is the per-CPU BH lock (which I want to remove) then it will > continue until all softirqs complete. So there's no way for a process to know on RT that someone with higher prio is waiting for it to release its locks? :(
On 2024-01-18 08:38:12 [-0800], Jakub Kicinski wrote: > > If this is the per-CPU BH lock (which I want to remove) then it will > > continue until all softirqs complete. > > So there's no way for a process to know on RT that someone with higher > prio is waiting for it to release its locks? :( You could add a function to check if your current priority is inherited from someone else and if so start dropping the locks you think are responsible for it. I made a PoC that appears to work for timer_list timer which is one of the softirqs. This made me realise that I need in more spots and I am doing it for the wrong reasons… Sebastian
Jakub Kicinski <kuba@kernel.org> writes: > On Thu, 18 Jan 2024 12:51:18 +0100 Toke Høiland-Jørgensen wrote: >> I do agree that conceptually it makes a lot of sense to encapsulate the >> budget like this so drivers don't have to do all this state tracking >> themselves. It does appear that drivers are doing different things with >> the budget as it is today, though. For instance, the intel drivers seem >> to divide the budget over all the enabled RX rings(?); so I'm wondering >> if it'll be possible to unify drivers around a more opaque NAPI poll API? > > We can come up with APIs which would cater to multi-queue cases. > Bigger question is what is the sensible polling strategy for those, > just dividing the budget seems, hm, crude. Right, agreed, though I don't have a good answer for what else to do off the top of my head... -Toke
diff --git a/drivers/net/vmxnet3/vmxnet3_xdp.c b/drivers/net/vmxnet3/vmxnet3_xdp.c index 80ddaff759d47..18bce98fd2e31 100644 --- a/drivers/net/vmxnet3/vmxnet3_xdp.c +++ b/drivers/net/vmxnet3/vmxnet3_xdp.c @@ -257,6 +257,7 @@ vmxnet3_run_xdp(struct vmxnet3_rx_queue *rq, struct xdp_buff *xdp, u32 act; rq->stats.xdp_packets++; + guard(local_lock_nested_bh)(&bpf_run_lock.redirect_lock); act = bpf_prog_run_xdp(prog, xdp); page = virt_to_page(xdp->data_hard_start); diff --git a/kernel/bpf/cpumap.c b/kernel/bpf/cpumap.c index 8a0bb80fe48a3..c26d49bb78679 100644 --- a/kernel/bpf/cpumap.c +++ b/kernel/bpf/cpumap.c @@ -144,6 +144,7 @@ static void cpu_map_bpf_prog_run_skb(struct bpf_cpu_map_entry *rcpu, int err; list_for_each_entry_safe(skb, tmp, listp, list) { + guard(local_lock_nested_bh)(&bpf_run_lock.redirect_lock); act = bpf_prog_run_generic_xdp(skb, &xdp, rcpu->prog); switch (act) { case XDP_PASS: @@ -182,6 +183,7 @@ static int cpu_map_bpf_prog_run_xdp(struct bpf_cpu_map_entry *rcpu, struct xdp_buff xdp; int i, nframes = 0; + guard(local_lock_nested_bh)(&bpf_run_lock.redirect_lock); xdp_set_return_frame_no_direct(); xdp.rxq = &rxq; diff --git a/net/bpf/test_run.c b/net/bpf/test_run.c index c9fdcc5cdce10..db8f7eb35c6ca 100644 --- a/net/bpf/test_run.c +++ b/net/bpf/test_run.c @@ -293,6 +293,7 @@ static int xdp_test_run_batch(struct xdp_test_data *xdp, struct bpf_prog *prog, batch_sz = min_t(u32, repeat, xdp->batch_size); local_bh_disable(); + local_lock_nested_bh(&bpf_run_lock.redirect_lock); xdp_set_return_frame_no_direct(); for (i = 0; i < batch_sz; i++) { @@ -348,6 +349,9 @@ static int xdp_test_run_batch(struct xdp_test_data *xdp, struct bpf_prog *prog, } out: + xdp_clear_return_frame_no_direct(); + local_unlock_nested_bh(&bpf_run_lock.redirect_lock); + if (redirect) xdp_do_flush(); if (nframes) { @@ -356,7 +360,6 @@ static int xdp_test_run_batch(struct xdp_test_data *xdp, struct bpf_prog *prog, err = ret; } - xdp_clear_return_frame_no_direct(); local_bh_enable(); return err; } @@ -417,10 +420,12 @@ static int bpf_test_run(struct bpf_prog *prog, void *ctx, u32 repeat, do { run_ctx.prog_item = &item; local_bh_disable(); - if (xdp) + if (xdp) { + guard(local_lock_nested_bh)(&bpf_run_lock.redirect_lock); *retval = bpf_prog_run_xdp(prog, ctx); - else + } else { *retval = bpf_prog_run(prog, ctx); + } local_bh_enable(); } while (bpf_test_timer_continue(&t, 1, repeat, &ret, time)); bpf_reset_run_ctx(old_ctx); diff --git a/net/core/dev.c b/net/core/dev.c index 5a0f6da7b3ae5..5ba7509e88752 100644 --- a/net/core/dev.c +++ b/net/core/dev.c @@ -3993,6 +3993,7 @@ sch_handle_ingress(struct sk_buff *skb, struct packet_type **pt_prev, int *ret, *pt_prev = NULL; } + guard(local_lock_nested_bh)(&bpf_run_lock.redirect_lock); qdisc_skb_cb(skb)->pkt_len = skb->len; tcx_set_ingress(skb, true); @@ -4045,6 +4046,7 @@ sch_handle_egress(struct sk_buff *skb, int *ret, struct net_device *dev) if (!entry) return skb; + guard(local_lock_nested_bh)(&bpf_run_lock.redirect_lock); /* qdisc_skb_cb(skb)->pkt_len & tcx_set_ingress() was * already set by the caller. */ @@ -5008,6 +5010,7 @@ int do_xdp_generic(struct bpf_prog *xdp_prog, struct sk_buff *skb) u32 act; int err; + guard(local_lock_nested_bh)(&bpf_run_lock.redirect_lock); act = netif_receive_generic_xdp(skb, &xdp, xdp_prog); if (act != XDP_PASS) { switch (act) { diff --git a/net/core/filter.c b/net/core/filter.c index 7c9653734fb60..72a7812f933a1 100644 --- a/net/core/filter.c +++ b/net/core/filter.c @@ -4241,6 +4241,7 @@ static const struct bpf_func_proto bpf_xdp_adjust_meta_proto = { */ void xdp_do_flush(void) { + guard(local_lock_nested_bh)(&bpf_run_lock.redirect_lock); __dev_flush(); __cpu_map_flush(); __xsk_map_flush(); diff --git a/net/core/lwt_bpf.c b/net/core/lwt_bpf.c index a94943681e5aa..74b88e897a7e3 100644 --- a/net/core/lwt_bpf.c +++ b/net/core/lwt_bpf.c @@ -44,6 +44,7 @@ static int run_lwt_bpf(struct sk_buff *skb, struct bpf_lwt_prog *lwt, * BPF prog and skb_do_redirect(). */ local_bh_disable(); + local_lock_nested_bh(&bpf_run_lock.redirect_lock); bpf_compute_data_pointers(skb); ret = bpf_prog_run_save_cb(lwt->prog, skb); @@ -76,6 +77,7 @@ static int run_lwt_bpf(struct sk_buff *skb, struct bpf_lwt_prog *lwt, break; } + local_unlock_nested_bh(&bpf_run_lock.redirect_lock); local_bh_enable(); return ret; diff --git a/net/sched/cls_api.c b/net/sched/cls_api.c index 1976bd1639863..da61b99bc558f 100644 --- a/net/sched/cls_api.c +++ b/net/sched/cls_api.c @@ -23,6 +23,7 @@ #include <linux/jhash.h> #include <linux/rculist.h> #include <linux/rhashtable.h> +#include <linux/bpf.h> #include <net/net_namespace.h> #include <net/sock.h> #include <net/netlink.h> @@ -3925,6 +3926,7 @@ struct sk_buff *tcf_qevent_handle(struct tcf_qevent *qe, struct Qdisc *sch, stru fl = rcu_dereference_bh(qe->filter_chain); + guard(local_lock_nested_bh)(&bpf_run_lock.redirect_lock); switch (tcf_classify(skb, NULL, fl, &cl_res, false)) { case TC_ACT_SHOT: qdisc_qstats_drop(sch);
The per-CPU variables used during bpf_prog_run_xdp() invocation and later during xdp_do_redirect() rely on disabled BH for their protection. Without locking in local_bh_disable() on PREEMPT_RT these data structure require explicit locking. This is a follow-up on the previous change which introduced bpf_run_lock.redirect_lock and uses it now within drivers. The simple way is to acquire the lock before bpf_prog_run_xdp() is invoked and hold it until the end of function. This does not always work because some drivers (cpsw, atlantic) invoke xdp_do_flush() in the same context. Acquiring the lock in bpf_prog_run_xdp() and dropping in xdp_do_redirect() (without touching drivers) does not work because not all driver, which use bpf_prog_run_xdp(), do support XDP_REDIRECT (and invoke xdp_do_redirect()). Ideally the minimal locking scope would be bpf_prog_run_xdp() + xdp_do_redirect() and everything else (error recovery, DMA unmapping, free/ alloc of memory, …) would happen outside of the locked section. Cc: Alexei Starovoitov <ast@kernel.org> Cc: Andrii Nakryiko <andrii@kernel.org> Cc: Cong Wang <xiyou.wangcong@gmail.com> Cc: Hao Luo <haoluo@google.com> Cc: Jamal Hadi Salim <jhs@mojatatu.com> Cc: Jesper Dangaard Brouer <hawk@kernel.org> Cc: Jiri Olsa <jolsa@kernel.org> Cc: Jiri Pirko <jiri@resnulli.us> Cc: John Fastabend <john.fastabend@gmail.com> Cc: KP Singh <kpsingh@kernel.org> Cc: Martin KaFai Lau <martin.lau@linux.dev> Cc: Ronak Doshi <doshir@vmware.com> Cc: Song Liu <song@kernel.org> Cc: Stanislav Fomichev <sdf@google.com> Cc: VMware PV-Drivers Reviewers <pv-drivers@vmware.com> Cc: Yonghong Song <yonghong.song@linux.dev> Cc: bpf@vger.kernel.org Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> --- drivers/net/vmxnet3/vmxnet3_xdp.c | 1 + kernel/bpf/cpumap.c | 2 ++ net/bpf/test_run.c | 11 ++++++++--- net/core/dev.c | 3 +++ net/core/filter.c | 1 + net/core/lwt_bpf.c | 2 ++ net/sched/cls_api.c | 2 ++ 7 files changed, 19 insertions(+), 3 deletions(-)