Message ID | 20210423020019.2333192-3-liuhangbin@gmail.com (mailing list archive) |
---|---|
State | Superseded |
Delegated to: | BPF |
Headers | show |
Series | xdp: extend xdp_redirect_map with broadcast support | expand |
On Fri, 23 Apr 2021 10:00:17 +0800 Hangbin Liu <liuhangbin@gmail.com> wrote: > This patch adds two flags BPF_F_BROADCAST and BPF_F_EXCLUDE_INGRESS to > extend xdp_redirect_map for broadcast support. > > With BPF_F_BROADCAST the packet will be broadcasted to all the interfaces > in the map. with BPF_F_EXCLUDE_INGRESS the ingress interface will be > excluded when do broadcasting. > > When getting the devices in dev hash map via dev_map_hash_get_next_key(), > there is a possibility that we fall back to the first key when a device > was removed. This will duplicate packets on some interfaces. So just walk > the whole buckets to avoid this issue. For dev array map, we also walk the > whole map to find valid interfaces. > > Function bpf_clear_redirect_map() was removed in > commit ee75aef23afe ("bpf, xdp: Restructure redirect actions"). > Add it back as we need to use ri->map again. > > Here is the performance result by using 10Gb i40e NIC, do XDP_DROP on > veth peer, run xdp_redirect_{map, map_multi} in sample/bpf and send pkts > via pktgen cmd: > ./pktgen_sample03_burst_single_flow.sh -i eno1 -d $dst_ip -m $dst_mac -t 10 -s 64 While running: $ sudo ./xdp_redirect_map_multi -F i40e2 i40e2 Get interfaces 7 7 libbpf: elf: skipping unrecognized data section(23) .eh_frame libbpf: elf: skipping relo section(24) .rel.eh_frame for section(23) .eh_frame Forwarding 10140845 pkt/s Forwarding 11767042 pkt/s Forwarding 11783437 pkt/s Forwarding 11767331 pkt/s When starting: sudo ./xdp_monitor --stats System crashed with: [ 5509.997837] BUG: kernel NULL pointer dereference, address: 0000000000000000 [ 5510.004793] #PF: supervisor read access in kernel mode [ 5510.009929] #PF: error_code(0x0000) - not-present page [ 5510.015060] PGD 0 P4D 0 [ 5510.017591] Oops: 0000 [#1] PREEMPT SMP PTI [ 5510.021769] CPU: 3 PID: 29 Comm: ksoftirqd/3 Not tainted 5.12.0-rc7-net-next-hangbin-v10+ #602 [ 5510.030368] Hardware name: Supermicro Super Server/X10SRi-F, BIOS 2.0a 08/01/2016 [ 5510.037835] RIP: 0010:perf_trace_xdp_redirect_template+0xba/0x130 [ 5510.043929] Code: 00 00 00 8b 45 18 74 1e 41 83 f9 19 74 18 45 85 c9 0f 85 83 00 00 00 81 7d 10 ff ff ff 7f 75 7a 89 c1 31 c0 eb 0d 48 8b 75 b8 <48> 8b 16 8b 8a d0 00 00 00 49 8b 55 38 41 b8 01 00 00 00 be 24 00 [ 5510.062668] RSP: 0018:ffffc9000017fc50 EFLAGS: 00010246 [ 5510.067884] RAX: 0000000000000000 RBX: ffffe8ffffccf180 RCX: 0000000000000000 [ 5510.075007] RDX: ffffffff817d1a9b RSI: 0000000000000000 RDI: ffffe8ffffcd8000 [ 5510.082133] RBP: ffffc9000017fc98 R08: 0000000000000000 R09: 0000000000000019 [ 5510.089256] R10: 0000000000000000 R11: ffff88887fd2ab70 R12: 0000000000000000 [ 5510.096382] R13: ffffc900000a5000 R14: ffff88810a8f2000 R15: ffffffff82a7c840 [ 5510.103505] FS: 0000000000000000(0000) GS:ffff88887fcc0000(0000) knlGS:0000000000000000 [ 5510.111584] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 5510.117319] CR2: 0000000000000000 CR3: 0000000157b1e004 CR4: 00000000003706e0 [ 5510.124444] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 5510.131567] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [ 5510.138692] Call Trace: [ 5510.141141] xdp_do_redirect+0x16b/0x230 [ 5510.145064] i40e_clean_rx_irq+0x62e/0x9a0 [i40e] [ 5510.149779] i40e_napi_poll+0xf0/0x410 [i40e] [ 5510.154135] __napi_poll+0x2a/0x140 [ 5510.157620] net_rx_action+0x215/0x2d0 [ 5510.161364] __do_softirq+0xe3/0x2df [ 5510.164938] run_ksoftirqd+0x1a/0x20 [ 5510.168514] smpboot_thread_fn+0xee/0x1e0 [ 5510.172519] ? sort_range+0x20/0x20 [ 5510.176003] kthread+0x116/0x150 [ 5510.179237] ? kthread_park+0x90/0x90 [ 5510.182893] ret_from_fork+0x22/0x30 [ 5510.186474] Modules linked in: algif_hash af_alg bpf_preload fuse veth nf_defrag_ipv6 nf_defrag_ipv4 tun bridge stp llc rpcrdma sunrpc rdma_ucm ib_umad rdma_cm ib_ipoib coretemp iw_cm kvm_intel ib_cm kvm mlx5_ib i40iw irqbypass ib_uverbs rapl intel_cstate i2c_i801 intel_uncore ib_core pcspkr bfq i2c_smbus acpi_ipmi wmi ipmi_si ipmi_devintf ipmi_msghandler acpi_pad sch_fq_codel mlx5_core mlxfw i40e ixgbe igc psample mdio igb sd_mod ptp t10_pi nfp i2c_algo_bit i2c_core pps_core hid_generic [last unloaded: bpfilter] [ 5510.231847] CR2: 0000000000000000 [ 5510.235164] ---[ end trace b8e076677c53b5e8 ]--- [ 5510.241762] RIP: 0010:perf_trace_xdp_redirect_template+0xba/0x130 [ 5510.247851] Code: 00 00 00 8b 45 18 74 1e 41 83 f9 19 74 18 45 85 c9 0f 85 83 00 00 00 81 7d 10 ff ff ff 7f 75 7a 89 c1 31 c0 eb 0d 48 8b 75 b8 <48> 8b 16 8b 8a d0 00 00 00 49 8b 55 38 41 b8 01 00 00 00 be 24 00 [ 5510.266590] RSP: 0018:ffffc9000017fc50 EFLAGS: 00010246 [ 5510.271804] RAX: 0000000000000000 RBX: ffffe8ffffccf180 RCX: 0000000000000000 [ 5510.278931] RDX: ffffffff817d1a9b RSI: 0000000000000000 RDI: ffffe8ffffcd8000 [ 5510.286053] RBP: ffffc9000017fc98 R08: 0000000000000000 R09: 0000000000000019 [ 5510.293178] R10: 0000000000000000 R11: ffff88887fd2ab70 R12: 0000000000000000 [ 5510.300301] R13: ffffc900000a5000 R14: ffff88810a8f2000 R15: ffffffff82a7c840 [ 5510.307426] FS: 0000000000000000(0000) GS:ffff88887fcc0000(0000) knlGS:0000000000000000 [ 5510.315503] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 5510.321241] CR2: 0000000000000000 CR3: 0000000157b1e004 CR4: 00000000003706e0 [ 5510.328364] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 5510.335490] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [ 5510.342612] Kernel panic - not syncing: Fatal exception in interrupt [ 5510.348994] Kernel Offset: disabled [ 5510.354469] ---[ end Kernel panic - not syncing: Fatal exception in interrupt ]--- [net-next]$ ./scripts/faddr2line vmlinux xdp_do_redirect+0x16b xdp_do_redirect+0x16b/0x230: trace_xdp_redirect at include/trace/events/xdp.h:136 (inlined by) trace_xdp_redirect at include/trace/events/xdp.h:136 (inlined by) xdp_do_redirect at net/core/filter.c:3996 Decode: perf_trace_xdp_redirect_template+0xba ./scripts/faddr2line vmlinux perf_trace_xdp_redirect_template+0xba perf_trace_xdp_redirect_template+0xba/0x130: perf_trace_xdp_redirect_template at include/trace/events/xdp.h:89 (discriminator 13) less -N net/core/filter.c [...] 3993 if (unlikely(err)) 3994 goto err; 3995 -> 3996 _trace_xdp_redirect_map(dev, xdp_prog, fwd, map_type, map_id, ri->tgt_index); 3997 return 0; 3998 err: 3999 _trace_xdp_redirect_map_err(dev, xdp_prog, fwd, map_type, map_id, ri->tgt_index, err); 4000 return err; 4001 } 4002 EXPORT_SYMBOL_GPL(xdp_do_redirect); > --- > include/linux/bpf.h | 20 ++++ > include/linux/filter.h | 18 +++- > include/net/xdp.h | 1 + > include/uapi/linux/bpf.h | 17 +++- > kernel/bpf/cpumap.c | 3 +- > kernel/bpf/devmap.c | 181 ++++++++++++++++++++++++++++++++- > net/core/filter.c | 37 ++++++- > net/core/xdp.c | 29 ++++++ > net/xdp/xskmap.c | 3 +- > tools/include/uapi/linux/bpf.h | 17 +++- > 10 files changed, 312 insertions(+), 14 deletions(-) > [...] > diff --git a/net/core/filter.c b/net/core/filter.c > index cae56d08a670..05ba5ab4345f 100644 > --- a/net/core/filter.c > +++ b/net/core/filter.c > @@ -3926,6 +3926,23 @@ void xdp_do_flush(void) > } > EXPORT_SYMBOL_GPL(xdp_do_flush); > > +void bpf_clear_redirect_map(struct bpf_map *map) > +{ > + struct bpf_redirect_info *ri; > + int cpu; > + > + for_each_possible_cpu(cpu) { > + ri = per_cpu_ptr(&bpf_redirect_info, cpu); > + /* Avoid polluting remote cacheline due to writes if > + * not needed. Once we pass this test, we need the > + * cmpxchg() to make sure it hasn't been changed in > + * the meantime by remote CPU. > + */ > + if (unlikely(READ_ONCE(ri->map) == map)) > + cmpxchg(&ri->map, map, NULL); > + } > +} > + > int xdp_do_redirect(struct net_device *dev, struct xdp_buff *xdp, > struct bpf_prog *xdp_prog) > { > @@ -3933,6 +3950,7 @@ int xdp_do_redirect(struct net_device *dev, struct xdp_buff *xdp, > enum bpf_map_type map_type = ri->map_type; > void *fwd = ri->tgt_value; > u32 map_id = ri->map_id; > + struct bpf_map *map; > int err; > > ri->map_id = 0; /* Valid map id idr range: [1,INT_MAX[ */ > @@ -3942,7 +3960,14 @@ int xdp_do_redirect(struct net_device *dev, struct xdp_buff *xdp, > case BPF_MAP_TYPE_DEVMAP: > fallthrough; > case BPF_MAP_TYPE_DEVMAP_HASH: > - err = dev_map_enqueue(fwd, xdp, dev); > + map = READ_ONCE(ri->map); > + if (map) { > + WRITE_ONCE(ri->map, NULL); > + err = dev_map_enqueue_multi(xdp, dev, map, > + ri->flags & BPF_F_EXCLUDE_INGRESS); > + } else { > + err = dev_map_enqueue(fwd, xdp, dev); > + } > break; > case BPF_MAP_TYPE_CPUMAP: > err = cpu_map_enqueue(fwd, xdp, dev); > @@ -3984,13 +4009,21 @@ static int xdp_do_generic_redirect_map(struct net_device *dev, > enum bpf_map_type map_type, u32 map_id) > { > struct bpf_redirect_info *ri = this_cpu_ptr(&bpf_redirect_info); > + struct bpf_map *map; > int err; > > switch (map_type) { > case BPF_MAP_TYPE_DEVMAP: > fallthrough; > case BPF_MAP_TYPE_DEVMAP_HASH: > - err = dev_map_generic_redirect(fwd, skb, xdp_prog); > + map = READ_ONCE(ri->map); > + if (map) { > + WRITE_ONCE(ri->map, NULL); > + err = dev_map_redirect_multi(dev, skb, xdp_prog, map, > + ri->flags & BPF_F_EXCLUDE_INGRESS); > + } else { > + err = dev_map_generic_redirect(fwd, skb, xdp_prog); > + } > if (unlikely(err)) > goto err; > break;
On Mon, Apr 26, 2021 at 11:53:50AM +0200, Jesper Dangaard Brouer wrote: > On Fri, 23 Apr 2021 10:00:17 +0800 > Hangbin Liu <liuhangbin@gmail.com> wrote: > > > This patch adds two flags BPF_F_BROADCAST and BPF_F_EXCLUDE_INGRESS to > > extend xdp_redirect_map for broadcast support. > > > > With BPF_F_BROADCAST the packet will be broadcasted to all the interfaces > > in the map. with BPF_F_EXCLUDE_INGRESS the ingress interface will be > > excluded when do broadcasting. > > > > When getting the devices in dev hash map via dev_map_hash_get_next_key(), > > there is a possibility that we fall back to the first key when a device > > was removed. This will duplicate packets on some interfaces. So just walk > > the whole buckets to avoid this issue. For dev array map, we also walk the > > whole map to find valid interfaces. > > > > Function bpf_clear_redirect_map() was removed in > > commit ee75aef23afe ("bpf, xdp: Restructure redirect actions"). > > Add it back as we need to use ri->map again. > > > > Here is the performance result by using 10Gb i40e NIC, do XDP_DROP on > > veth peer, run xdp_redirect_{map, map_multi} in sample/bpf and send pkts > > via pktgen cmd: > > ./pktgen_sample03_burst_single_flow.sh -i eno1 -d $dst_ip -m $dst_mac -t 10 -s 64 > > While running: > $ sudo ./xdp_redirect_map_multi -F i40e2 i40e2 > Get interfaces 7 7 > libbpf: elf: skipping unrecognized data section(23) .eh_frame > libbpf: elf: skipping relo section(24) .rel.eh_frame for section(23) .eh_frame > Forwarding 10140845 pkt/s > Forwarding 11767042 pkt/s > Forwarding 11783437 pkt/s > Forwarding 11767331 pkt/s > > When starting: sudo ./xdp_monitor --stats That seems the same issue I reported previously in our meeting. https://bugzilla.redhat.com/show_bug.cgi?id=1906820#c4 I only saw it 3 times and can't reproduce it easily. Do you have any idea where is the root cause? Thanks Hangbin > > System crashed with: > > [ 5509.997837] BUG: kernel NULL pointer dereference, address: 0000000000000000 > [ 5510.004793] #PF: supervisor read access in kernel mode > [ 5510.009929] #PF: error_code(0x0000) - not-present page > [ 5510.015060] PGD 0 P4D 0 > [ 5510.017591] Oops: 0000 [#1] PREEMPT SMP PTI > [ 5510.021769] CPU: 3 PID: 29 Comm: ksoftirqd/3 Not tainted 5.12.0-rc7-net-next-hangbin-v10+ #602 > [ 5510.030368] Hardware name: Supermicro Super Server/X10SRi-F, BIOS 2.0a 08/01/2016 > [ 5510.037835] RIP: 0010:perf_trace_xdp_redirect_template+0xba/0x130 > [ 5510.043929] Code: 00 00 00 8b 45 18 74 1e 41 83 f9 19 74 18 45 85 c9 0f 85 83 00 00 00 81 7d 10 ff ff ff 7f 75 7a 89 c1 31 c0 eb 0d 48 8b 75 b8 <48> 8b 16 8b 8a d0 00 00 00 49 8b 55 38 41 b8 01 00 00 00 be 24 00 > [ 5510.062668] RSP: 0018:ffffc9000017fc50 EFLAGS: 00010246 > [ 5510.067884] RAX: 0000000000000000 RBX: ffffe8ffffccf180 RCX: 0000000000000000 > [ 5510.075007] RDX: ffffffff817d1a9b RSI: 0000000000000000 RDI: ffffe8ffffcd8000 > [ 5510.082133] RBP: ffffc9000017fc98 R08: 0000000000000000 R09: 0000000000000019 > [ 5510.089256] R10: 0000000000000000 R11: ffff88887fd2ab70 R12: 0000000000000000 > [ 5510.096382] R13: ffffc900000a5000 R14: ffff88810a8f2000 R15: ffffffff82a7c840 > [ 5510.103505] FS: 0000000000000000(0000) GS:ffff88887fcc0000(0000) knlGS:0000000000000000 > [ 5510.111584] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > [ 5510.117319] CR2: 0000000000000000 CR3: 0000000157b1e004 CR4: 00000000003706e0 > [ 5510.124444] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 > [ 5510.131567] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 > [ 5510.138692] Call Trace: > [ 5510.141141] xdp_do_redirect+0x16b/0x230 > [ 5510.145064] i40e_clean_rx_irq+0x62e/0x9a0 [i40e] > [ 5510.149779] i40e_napi_poll+0xf0/0x410 [i40e] > [ 5510.154135] __napi_poll+0x2a/0x140 > [ 5510.157620] net_rx_action+0x215/0x2d0 > [ 5510.161364] __do_softirq+0xe3/0x2df > [ 5510.164938] run_ksoftirqd+0x1a/0x20 > [ 5510.168514] smpboot_thread_fn+0xee/0x1e0 > [ 5510.172519] ? sort_range+0x20/0x20 > [ 5510.176003] kthread+0x116/0x150 > [ 5510.179237] ? kthread_park+0x90/0x90 > [ 5510.182893] ret_from_fork+0x22/0x30 > [ 5510.186474] Modules linked in: algif_hash af_alg bpf_preload fuse veth nf_defrag_ipv6 nf_defrag_ipv4 tun bridge stp llc rpcrdma sunrpc rdma_ucm ib_umad rdma_cm ib_ipoib coretemp iw_cm kvm_intel ib_cm kvm mlx5_ib i40iw irqbypass ib_uverbs rapl intel_cstate i2c_i801 intel_uncore ib_core pcspkr bfq i2c_smbus acpi_ipmi wmi ipmi_si ipmi_devintf ipmi_msghandler acpi_pad sch_fq_codel mlx5_core mlxfw i40e ixgbe igc psample mdio igb sd_mod ptp t10_pi nfp i2c_algo_bit i2c_core pps_core hid_generic [last unloaded: bpfilter] > [ 5510.231847] CR2: 0000000000000000 > [ 5510.235164] ---[ end trace b8e076677c53b5e8 ]--- > [ 5510.241762] RIP: 0010:perf_trace_xdp_redirect_template+0xba/0x130 > [ 5510.247851] Code: 00 00 00 8b 45 18 74 1e 41 83 f9 19 74 18 45 85 c9 0f 85 83 00 00 00 81 7d 10 ff ff ff 7f 75 7a 89 c1 31 c0 eb 0d 48 8b 75 b8 <48> 8b 16 8b 8a d0 00 00 00 49 8b 55 38 41 b8 01 00 00 00 be 24 00 > [ 5510.266590] RSP: 0018:ffffc9000017fc50 EFLAGS: 00010246 > [ 5510.271804] RAX: 0000000000000000 RBX: ffffe8ffffccf180 RCX: 0000000000000000 > [ 5510.278931] RDX: ffffffff817d1a9b RSI: 0000000000000000 RDI: ffffe8ffffcd8000 > [ 5510.286053] RBP: ffffc9000017fc98 R08: 0000000000000000 R09: 0000000000000019 > [ 5510.293178] R10: 0000000000000000 R11: ffff88887fd2ab70 R12: 0000000000000000 > [ 5510.300301] R13: ffffc900000a5000 R14: ffff88810a8f2000 R15: ffffffff82a7c840 > [ 5510.307426] FS: 0000000000000000(0000) GS:ffff88887fcc0000(0000) knlGS:0000000000000000 > [ 5510.315503] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > [ 5510.321241] CR2: 0000000000000000 CR3: 0000000157b1e004 CR4: 00000000003706e0 > [ 5510.328364] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 > [ 5510.335490] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 > [ 5510.342612] Kernel panic - not syncing: Fatal exception in interrupt > [ 5510.348994] Kernel Offset: disabled > [ 5510.354469] ---[ end Kernel panic - not syncing: Fatal exception in interrupt ]--- > > [net-next]$ ./scripts/faddr2line vmlinux xdp_do_redirect+0x16b > xdp_do_redirect+0x16b/0x230: > trace_xdp_redirect at include/trace/events/xdp.h:136 > (inlined by) trace_xdp_redirect at include/trace/events/xdp.h:136 > (inlined by) xdp_do_redirect at net/core/filter.c:3996 > > Decode: perf_trace_xdp_redirect_template+0xba > ./scripts/faddr2line vmlinux perf_trace_xdp_redirect_template+0xba > perf_trace_xdp_redirect_template+0xba/0x130: > perf_trace_xdp_redirect_template at include/trace/events/xdp.h:89 (discriminator 13) > > less -N net/core/filter.c > [...] > 3993 if (unlikely(err)) > 3994 goto err; > 3995 > -> 3996 _trace_xdp_redirect_map(dev, xdp_prog, fwd, map_type, map_id, ri->tgt_index); > 3997 return 0; > 3998 err: > 3999 _trace_xdp_redirect_map_err(dev, xdp_prog, fwd, map_type, map_id, ri->tgt_index, err); > 4000 return err; > 4001 } > 4002 EXPORT_SYMBOL_GPL(xdp_do_redirect); > > > > --- > > include/linux/bpf.h | 20 ++++ > > include/linux/filter.h | 18 +++- > > include/net/xdp.h | 1 + > > include/uapi/linux/bpf.h | 17 +++- > > kernel/bpf/cpumap.c | 3 +- > > kernel/bpf/devmap.c | 181 ++++++++++++++++++++++++++++++++- > > net/core/filter.c | 37 ++++++- > > net/core/xdp.c | 29 ++++++ > > net/xdp/xskmap.c | 3 +- > > tools/include/uapi/linux/bpf.h | 17 +++- > > 10 files changed, 312 insertions(+), 14 deletions(-) > > > [...] > > > diff --git a/net/core/filter.c b/net/core/filter.c > > index cae56d08a670..05ba5ab4345f 100644 > > --- a/net/core/filter.c > > +++ b/net/core/filter.c > > @@ -3926,6 +3926,23 @@ void xdp_do_flush(void) > > } > > EXPORT_SYMBOL_GPL(xdp_do_flush); > > > > +void bpf_clear_redirect_map(struct bpf_map *map) > > +{ > > + struct bpf_redirect_info *ri; > > + int cpu; > > + > > + for_each_possible_cpu(cpu) { > > + ri = per_cpu_ptr(&bpf_redirect_info, cpu); > > + /* Avoid polluting remote cacheline due to writes if > > + * not needed. Once we pass this test, we need the > > + * cmpxchg() to make sure it hasn't been changed in > > + * the meantime by remote CPU. > > + */ > > + if (unlikely(READ_ONCE(ri->map) == map)) > > + cmpxchg(&ri->map, map, NULL); > > + } > > +} > > + > > int xdp_do_redirect(struct net_device *dev, struct xdp_buff *xdp, > > struct bpf_prog *xdp_prog) > > { > > @@ -3933,6 +3950,7 @@ int xdp_do_redirect(struct net_device *dev, struct xdp_buff *xdp, > > enum bpf_map_type map_type = ri->map_type; > > void *fwd = ri->tgt_value; > > u32 map_id = ri->map_id; > > + struct bpf_map *map; > > int err; > > > > ri->map_id = 0; /* Valid map id idr range: [1,INT_MAX[ */ > > @@ -3942,7 +3960,14 @@ int xdp_do_redirect(struct net_device *dev, struct xdp_buff *xdp, > > case BPF_MAP_TYPE_DEVMAP: > > fallthrough; > > case BPF_MAP_TYPE_DEVMAP_HASH: > > - err = dev_map_enqueue(fwd, xdp, dev); > > + map = READ_ONCE(ri->map); > > + if (map) { > > + WRITE_ONCE(ri->map, NULL); > > + err = dev_map_enqueue_multi(xdp, dev, map, > > + ri->flags & BPF_F_EXCLUDE_INGRESS); > > + } else { > > + err = dev_map_enqueue(fwd, xdp, dev); > > + } > > break; > > case BPF_MAP_TYPE_CPUMAP: > > err = cpu_map_enqueue(fwd, xdp, dev); > > @@ -3984,13 +4009,21 @@ static int xdp_do_generic_redirect_map(struct net_device *dev, > > enum bpf_map_type map_type, u32 map_id) > > { > > struct bpf_redirect_info *ri = this_cpu_ptr(&bpf_redirect_info); > > + struct bpf_map *map; > > int err; > > > > switch (map_type) { > > case BPF_MAP_TYPE_DEVMAP: > > fallthrough; > > case BPF_MAP_TYPE_DEVMAP_HASH: > > - err = dev_map_generic_redirect(fwd, skb, xdp_prog); > > + map = READ_ONCE(ri->map); > > + if (map) { > > + WRITE_ONCE(ri->map, NULL); > > + err = dev_map_redirect_multi(dev, skb, xdp_prog, map, > > + ri->flags & BPF_F_EXCLUDE_INGRESS); > > + } else { > > + err = dev_map_generic_redirect(fwd, skb, xdp_prog); > > + } > > if (unlikely(err)) > > goto err; > > break; > > -- > Best regards, > Jesper Dangaard Brouer > MSc.CS, Principal Kernel Engineer at Red Hat > LinkedIn: http://www.linkedin.com/in/brouer >
On Mon, Apr 26, 2021 at 06:47:17PM +0800, Hangbin Liu wrote: > On Mon, Apr 26, 2021 at 11:53:50AM +0200, Jesper Dangaard Brouer wrote: > > On Fri, 23 Apr 2021 10:00:17 +0800 > > Hangbin Liu <liuhangbin@gmail.com> wrote: > > > > > This patch adds two flags BPF_F_BROADCAST and BPF_F_EXCLUDE_INGRESS to > > > extend xdp_redirect_map for broadcast support. > > > > > > With BPF_F_BROADCAST the packet will be broadcasted to all the interfaces > > > in the map. with BPF_F_EXCLUDE_INGRESS the ingress interface will be > > > excluded when do broadcasting. > > > > > > When getting the devices in dev hash map via dev_map_hash_get_next_key(), > > > there is a possibility that we fall back to the first key when a device > > > was removed. This will duplicate packets on some interfaces. So just walk > > > the whole buckets to avoid this issue. For dev array map, we also walk the > > > whole map to find valid interfaces. > > > > > > Function bpf_clear_redirect_map() was removed in > > > commit ee75aef23afe ("bpf, xdp: Restructure redirect actions"). > > > Add it back as we need to use ri->map again. > > > > > > Here is the performance result by using 10Gb i40e NIC, do XDP_DROP on > > > veth peer, run xdp_redirect_{map, map_multi} in sample/bpf and send pkts > > > via pktgen cmd: > > > ./pktgen_sample03_burst_single_flow.sh -i eno1 -d $dst_ip -m $dst_mac -t 10 -s 64 > > > > While running: > > $ sudo ./xdp_redirect_map_multi -F i40e2 i40e2 > > Get interfaces 7 7 > > libbpf: elf: skipping unrecognized data section(23) .eh_frame > > libbpf: elf: skipping relo section(24) .rel.eh_frame for section(23) .eh_frame > > Forwarding 10140845 pkt/s > > Forwarding 11767042 pkt/s > > Forwarding 11783437 pkt/s > > Forwarding 11767331 pkt/s > > > > When starting: sudo ./xdp_monitor --stats > > That seems the same issue I reported previously in our meeting. > https://bugzilla.redhat.com/show_bug.cgi?id=1906820#c4 > > I only saw it 3 times and can't reproduce it easily. > > Do you have any idea where is the root cause? OK, I just re-did the test and could reproduce it now. Maybe because the code changed and it's easy to reproduce now. I will check this issue. Thanks Hangbin
On Mon, Apr 26, 2021 at 11:53:50AM +0200, Jesper Dangaard Brouer wrote: > Decode: perf_trace_xdp_redirect_template+0xba > ./scripts/faddr2line vmlinux perf_trace_xdp_redirect_template+0xba > perf_trace_xdp_redirect_template+0xba/0x130: > perf_trace_xdp_redirect_template at include/trace/events/xdp.h:89 (discriminator 13) > > less -N net/core/filter.c > [...] > 3993 if (unlikely(err)) > 3994 goto err; > 3995 > -> 3996 _trace_xdp_redirect_map(dev, xdp_prog, fwd, map_type, map_id, ri->tgt_index); Oh, the fwd in xdp xdp_redirect_map broadcast is NULL... I will see how to fix it. Maybe assign the ingress interface to fwd? Hangbin > 3997 return 0; > 3998 err: > 3999 _trace_xdp_redirect_map_err(dev, xdp_prog, fwd, map_type, map_id, ri->tgt_index, err); > 4000 return err; > 4001 } > 4002 EXPORT_SYMBOL_GPL(xdp_do_redirect);
On Mon, 26 Apr 2021 18:47:04 +0800 Hangbin Liu <liuhangbin@gmail.com> wrote: > On Mon, Apr 26, 2021 at 11:53:50AM +0200, Jesper Dangaard Brouer wrote: > > On Fri, 23 Apr 2021 10:00:17 +0800 > > Hangbin Liu <liuhangbin@gmail.com> wrote: > > > > > This patch adds two flags BPF_F_BROADCAST and BPF_F_EXCLUDE_INGRESS to > > > extend xdp_redirect_map for broadcast support. > > > > > > With BPF_F_BROADCAST the packet will be broadcasted to all the interfaces > > > in the map. with BPF_F_EXCLUDE_INGRESS the ingress interface will be > > > excluded when do broadcasting. > > > > > > When getting the devices in dev hash map via dev_map_hash_get_next_key(), > > > there is a possibility that we fall back to the first key when a device > > > was removed. This will duplicate packets on some interfaces. So just walk > > > the whole buckets to avoid this issue. For dev array map, we also walk the > > > whole map to find valid interfaces. > > > > > > Function bpf_clear_redirect_map() was removed in > > > commit ee75aef23afe ("bpf, xdp: Restructure redirect actions"). > > > Add it back as we need to use ri->map again. > > > > > > Here is the performance result by using 10Gb i40e NIC, do XDP_DROP on > > > veth peer, run xdp_redirect_{map, map_multi} in sample/bpf and send pkts > > > via pktgen cmd: > > > ./pktgen_sample03_burst_single_flow.sh -i eno1 -d $dst_ip -m $dst_mac -t 10 -s 64 > > > > While running: > > $ sudo ./xdp_redirect_map_multi -F i40e2 i40e2 > > Get interfaces 7 7 > > libbpf: elf: skipping unrecognized data section(23) .eh_frame > > libbpf: elf: skipping relo section(24) .rel.eh_frame for section(23) .eh_frame > > Forwarding 10140845 pkt/s > > Forwarding 11767042 pkt/s > > Forwarding 11783437 pkt/s > > Forwarding 11767331 pkt/s > > > > When starting: sudo ./xdp_monitor --stats > > That seems the same issue I reported previously in our meeting. > https://bugzilla.redhat.com/show_bug.cgi?id=1906820#c4 > > I only saw it 3 times and can't reproduce it easily. > > Do you have any idea where is the root cause? All the information you need to find the root-cause is listed below. I have even decoded where in the code it happens, and also include the code with line-numbering and pointed to the line the crash happens in, I don't think it is possible for me to be more specific and help further. > > System crashed with: > > > > [ 5509.997837] BUG: kernel NULL pointer dereference, address: 0000000000000000 > > [ 5510.004793] #PF: supervisor read access in kernel mode > > [ 5510.009929] #PF: error_code(0x0000) - not-present page > > [ 5510.015060] PGD 0 P4D 0 > > [ 5510.017591] Oops: 0000 [#1] PREEMPT SMP PTI > > [ 5510.021769] CPU: 3 PID: 29 Comm: ksoftirqd/3 Not tainted 5.12.0-rc7-net-next-hangbin-v10+ #602 > > [ 5510.030368] Hardware name: Supermicro Super Server/X10SRi-F, BIOS 2.0a 08/01/2016 > > [ 5510.037835] RIP: 0010:perf_trace_xdp_redirect_template+0xba/0x130 > > [ 5510.043929] Code: 00 00 00 8b 45 18 74 1e 41 83 f9 19 74 18 45 85 c9 0f 85 83 00 00 00 81 7d 10 ff ff ff 7f 75 7a 89 c1 31 c0 eb 0d 48 8b 75 b8 <48> 8b 16 8b 8a d0 00 00 00 49 8b 55 38 41 b8 01 00 00 00 be 24 00 > > [ 5510.062668] RSP: 0018:ffffc9000017fc50 EFLAGS: 00010246 > > [ 5510.067884] RAX: 0000000000000000 RBX: ffffe8ffffccf180 RCX: 0000000000000000 > > [ 5510.075007] RDX: ffffffff817d1a9b RSI: 0000000000000000 RDI: ffffe8ffffcd8000 > > [ 5510.082133] RBP: ffffc9000017fc98 R08: 0000000000000000 R09: 0000000000000019 > > [ 5510.089256] R10: 0000000000000000 R11: ffff88887fd2ab70 R12: 0000000000000000 > > [ 5510.096382] R13: ffffc900000a5000 R14: ffff88810a8f2000 R15: ffffffff82a7c840 > > [ 5510.103505] FS: 0000000000000000(0000) GS:ffff88887fcc0000(0000) knlGS:0000000000000000 > > [ 5510.111584] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > > [ 5510.117319] CR2: 0000000000000000 CR3: 0000000157b1e004 CR4: 00000000003706e0 > > [ 5510.124444] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 > > [ 5510.131567] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 > > [ 5510.138692] Call Trace: > > [ 5510.141141] xdp_do_redirect+0x16b/0x230 > > [ 5510.145064] i40e_clean_rx_irq+0x62e/0x9a0 [i40e] > > [ 5510.149779] i40e_napi_poll+0xf0/0x410 [i40e] > > [ 5510.154135] __napi_poll+0x2a/0x140 > > [ 5510.157620] net_rx_action+0x215/0x2d0 > > [ 5510.161364] __do_softirq+0xe3/0x2df > > [ 5510.164938] run_ksoftirqd+0x1a/0x20 > > [ 5510.168514] smpboot_thread_fn+0xee/0x1e0 > > [ 5510.172519] ? sort_range+0x20/0x20 > > [ 5510.176003] kthread+0x116/0x150 > > [ 5510.179237] ? kthread_park+0x90/0x90 > > [ 5510.182893] ret_from_fork+0x22/0x30 > > [ 5510.186474] Modules linked in: algif_hash af_alg bpf_preload fuse veth nf_defrag_ipv6 nf_defrag_ipv4 tun bridge stp llc rpcrdma sunrpc rdma_ucm ib_umad rdma_cm ib_ipoib coretemp iw_cm kvm_intel ib_cm kvm mlx5_ib i40iw irqbypass ib_uverbs rapl intel_cstate i2c_i801 intel_uncore ib_core pcspkr bfq i2c_smbus acpi_ipmi wmi ipmi_si ipmi_devintf ipmi_msghandler acpi_pad sch_fq_codel mlx5_core mlxfw i40e ixgbe igc psample mdio igb sd_mod ptp t10_pi nfp i2c_algo_bit i2c_core pps_core hid_generic [last unloaded: bpfilter] > > [ 5510.231847] CR2: 0000000000000000 > > [ 5510.235164] ---[ end trace b8e076677c53b5e8 ]--- > > [ 5510.241762] RIP: 0010:perf_trace_xdp_redirect_template+0xba/0x130 > > [ 5510.247851] Code: 00 00 00 8b 45 18 74 1e 41 83 f9 19 74 18 45 85 c9 0f 85 83 00 00 00 81 7d 10 ff ff ff 7f 75 7a 89 c1 31 c0 eb 0d 48 8b 75 b8 <48> 8b 16 8b 8a d0 00 00 00 49 8b 55 38 41 b8 01 00 00 00 be 24 00 > > [ 5510.266590] RSP: 0018:ffffc9000017fc50 EFLAGS: 00010246 > > [ 5510.271804] RAX: 0000000000000000 RBX: ffffe8ffffccf180 RCX: 0000000000000000 > > [ 5510.278931] RDX: ffffffff817d1a9b RSI: 0000000000000000 RDI: ffffe8ffffcd8000 > > [ 5510.286053] RBP: ffffc9000017fc98 R08: 0000000000000000 R09: 0000000000000019 > > [ 5510.293178] R10: 0000000000000000 R11: ffff88887fd2ab70 R12: 0000000000000000 > > [ 5510.300301] R13: ffffc900000a5000 R14: ffff88810a8f2000 R15: ffffffff82a7c840 > > [ 5510.307426] FS: 0000000000000000(0000) GS:ffff88887fcc0000(0000) knlGS:0000000000000000 > > [ 5510.315503] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > > [ 5510.321241] CR2: 0000000000000000 CR3: 0000000157b1e004 CR4: 00000000003706e0 > > [ 5510.328364] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 > > [ 5510.335490] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 > > [ 5510.342612] Kernel panic - not syncing: Fatal exception in interrupt > > [ 5510.348994] Kernel Offset: disabled > > [ 5510.354469] ---[ end Kernel panic - not syncing: Fatal exception in interrupt ]--- > > > > [net-next]$ ./scripts/faddr2line vmlinux xdp_do_redirect+0x16b > > xdp_do_redirect+0x16b/0x230: > > trace_xdp_redirect at include/trace/events/xdp.h:136 > > (inlined by) trace_xdp_redirect at include/trace/events/xdp.h:136 > > (inlined by) xdp_do_redirect at net/core/filter.c:3996 > > > > Decode: perf_trace_xdp_redirect_template+0xba > > ./scripts/faddr2line vmlinux perf_trace_xdp_redirect_template+0xba > > perf_trace_xdp_redirect_template+0xba/0x130: > > perf_trace_xdp_redirect_template at include/trace/events/xdp.h:89 (discriminator 13) > > > > less -N net/core/filter.c > > [...] > > 3993 if (unlikely(err)) > > 3994 goto err; > > 3995 > > -> 3996 _trace_xdp_redirect_map(dev, xdp_prog, fwd, map_type, map_id, ri->tgt_index); > > 3997 return 0; > > 3998 err: > > 3999 _trace_xdp_redirect_map_err(dev, xdp_prog, fwd, map_type, map_id, ri->tgt_index, err); > > 4000 return err; > > 4001 } > > 4002 EXPORT_SYMBOL_GPL(xdp_do_redirect); > > > > > > > --- > > > include/linux/bpf.h | 20 ++++ > > > include/linux/filter.h | 18 +++- > > > include/net/xdp.h | 1 + > > > include/uapi/linux/bpf.h | 17 +++- > > > kernel/bpf/cpumap.c | 3 +- > > > kernel/bpf/devmap.c | 181 ++++++++++++++++++++++++++++++++- > > > net/core/filter.c | 37 ++++++- > > > net/core/xdp.c | 29 ++++++ > > > net/xdp/xskmap.c | 3 +- > > > tools/include/uapi/linux/bpf.h | 17 +++- > > > 10 files changed, 312 insertions(+), 14 deletions(-) > > > > > [...] > > > > > diff --git a/net/core/filter.c b/net/core/filter.c > > > index cae56d08a670..05ba5ab4345f 100644 > > > --- a/net/core/filter.c > > > +++ b/net/core/filter.c > > > @@ -3926,6 +3926,23 @@ void xdp_do_flush(void) > > > } > > > EXPORT_SYMBOL_GPL(xdp_do_flush); > > > > > > +void bpf_clear_redirect_map(struct bpf_map *map) > > > +{ > > > + struct bpf_redirect_info *ri; > > > + int cpu; > > > + > > > + for_each_possible_cpu(cpu) { > > > + ri = per_cpu_ptr(&bpf_redirect_info, cpu); > > > + /* Avoid polluting remote cacheline due to writes if > > > + * not needed. Once we pass this test, we need the > > > + * cmpxchg() to make sure it hasn't been changed in > > > + * the meantime by remote CPU. > > > + */ > > > + if (unlikely(READ_ONCE(ri->map) == map)) > > > + cmpxchg(&ri->map, map, NULL); > > > + } > > > +} > > > + > > > int xdp_do_redirect(struct net_device *dev, struct xdp_buff *xdp, > > > struct bpf_prog *xdp_prog) > > > { > > > @@ -3933,6 +3950,7 @@ int xdp_do_redirect(struct net_device *dev, struct xdp_buff *xdp, > > > enum bpf_map_type map_type = ri->map_type; > > > void *fwd = ri->tgt_value; > > > u32 map_id = ri->map_id; > > > + struct bpf_map *map; > > > int err; > > > > > > ri->map_id = 0; /* Valid map id idr range: [1,INT_MAX[ */ > > > @@ -3942,7 +3960,14 @@ int xdp_do_redirect(struct net_device *dev, struct xdp_buff *xdp, > > > case BPF_MAP_TYPE_DEVMAP: > > > fallthrough; > > > case BPF_MAP_TYPE_DEVMAP_HASH: > > > - err = dev_map_enqueue(fwd, xdp, dev); > > > + map = READ_ONCE(ri->map); > > > + if (map) { > > > + WRITE_ONCE(ri->map, NULL); > > > + err = dev_map_enqueue_multi(xdp, dev, map, > > > + ri->flags & BPF_F_EXCLUDE_INGRESS); > > > + } else { > > > + err = dev_map_enqueue(fwd, xdp, dev); > > > + } > > > break; > > > case BPF_MAP_TYPE_CPUMAP: > > > err = cpu_map_enqueue(fwd, xdp, dev); > > > @@ -3984,13 +4009,21 @@ static int xdp_do_generic_redirect_map(struct net_device *dev, > > > enum bpf_map_type map_type, u32 map_id) > > > { > > > struct bpf_redirect_info *ri = this_cpu_ptr(&bpf_redirect_info); > > > + struct bpf_map *map; > > > int err; > > > > > > switch (map_type) { > > > case BPF_MAP_TYPE_DEVMAP: > > > fallthrough; > > > case BPF_MAP_TYPE_DEVMAP_HASH: > > > - err = dev_map_generic_redirect(fwd, skb, xdp_prog); > > > + map = READ_ONCE(ri->map); > > > + if (map) { > > > + WRITE_ONCE(ri->map, NULL); > > > + err = dev_map_redirect_multi(dev, skb, xdp_prog, map, > > > + ri->flags & BPF_F_EXCLUDE_INGRESS); > > > + } else { > > > + err = dev_map_generic_redirect(fwd, skb, xdp_prog); > > > + } > > > if (unlikely(err)) > > > goto err; > > > break; > >
On Mon, Apr 26, 2021 at 07:40:28PM +0800, Hangbin Liu wrote: > On Mon, Apr 26, 2021 at 11:53:50AM +0200, Jesper Dangaard Brouer wrote: > > Decode: perf_trace_xdp_redirect_template+0xba > > ./scripts/faddr2line vmlinux perf_trace_xdp_redirect_template+0xba > > perf_trace_xdp_redirect_template+0xba/0x130: > > perf_trace_xdp_redirect_template at include/trace/events/xdp.h:89 (discriminator 13) > > > > less -N net/core/filter.c > > [...] > > 3993 if (unlikely(err)) > > 3994 goto err; > > 3995 > > -> 3996 _trace_xdp_redirect_map(dev, xdp_prog, fwd, map_type, map_id, ri->tgt_index); > > Oh, the fwd in xdp xdp_redirect_map broadcast is NULL... > > I will see how to fix it. Maybe assign the ingress interface to fwd? Er, sorry for the flood message. I just checked the trace point code, fwd in xdp trace event means to_ifindex. So we can't assign the ingress interface to fwd. In xdp_redirect_map broadcast case, there is no specific to_ifindex. So how about just ignore it... e.g. diff --git a/include/trace/events/xdp.h b/include/trace/events/xdp.h index fcad3645a70b..1751da079330 100644 --- a/include/trace/events/xdp.h +++ b/include/trace/events/xdp.h @@ -110,7 +110,8 @@ DECLARE_EVENT_CLASS(xdp_redirect_template, u32 ifindex = 0, map_index = index; if (map_type == BPF_MAP_TYPE_DEVMAP || map_type == BPF_MAP_TYPE_DEVMAP_HASH) { - ifindex = ((struct _bpf_dtab_netdev *)tgt)->dev->ifindex; + if (tgt) + ifindex = ((struct _bpf_dtab_netdev *)tgt)->dev->ifindex; } else if (map_type == BPF_MAP_TYPE_UNSPEC && map_id == INT_MAX) { ifindex = index; map_index = 0; Hangbin
On Mon, Apr 26, 2021 at 01:41:05PM +0200, Jesper Dangaard Brouer wrote: > > > While running: > > > $ sudo ./xdp_redirect_map_multi -F i40e2 i40e2 > > > Get interfaces 7 7 > > > libbpf: elf: skipping unrecognized data section(23) .eh_frame > > > libbpf: elf: skipping relo section(24) .rel.eh_frame for section(23) .eh_frame > > > Forwarding 10140845 pkt/s > > > Forwarding 11767042 pkt/s > > > Forwarding 11783437 pkt/s > > > Forwarding 11767331 pkt/s > > > > > > When starting: sudo ./xdp_monitor --stats > > > > That seems the same issue I reported previously in our meeting. > > https://bugzilla.redhat.com/show_bug.cgi?id=1906820#c4 > > > > I only saw it 3 times and can't reproduce it easily. > > > > Do you have any idea where is the root cause? > > All the information you need to find the root-cause is listed below. > I have even decoded where in the code it happens, and also include the > code with line-numbering and pointed to the line the crash happens in, > I don't think it is possible for me to be more specific and help further. Thanks, I mixed this issue with the one I got previously, which I haven't figure out yet. For this one, I have sent a propose in another reply (that fix it in trace point event). Would you please help review. Hangbin
Hangbin Liu <liuhangbin@gmail.com> writes: > On Mon, Apr 26, 2021 at 07:40:28PM +0800, Hangbin Liu wrote: >> On Mon, Apr 26, 2021 at 11:53:50AM +0200, Jesper Dangaard Brouer wrote: >> > Decode: perf_trace_xdp_redirect_template+0xba >> > ./scripts/faddr2line vmlinux perf_trace_xdp_redirect_template+0xba >> > perf_trace_xdp_redirect_template+0xba/0x130: >> > perf_trace_xdp_redirect_template at include/trace/events/xdp.h:89 (discriminator 13) >> > >> > less -N net/core/filter.c >> > [...] >> > 3993 if (unlikely(err)) >> > 3994 goto err; >> > 3995 >> > -> 3996 _trace_xdp_redirect_map(dev, xdp_prog, fwd, map_type, map_id, ri->tgt_index); >> >> Oh, the fwd in xdp xdp_redirect_map broadcast is NULL... >> >> I will see how to fix it. Maybe assign the ingress interface to fwd? > > Er, sorry for the flood message. I just checked the trace point code, fwd > in xdp trace event means to_ifindex. So we can't assign the ingress interface > to fwd. > > In xdp_redirect_map broadcast case, there is no specific to_ifindex. > So how about just ignore it... e.g. Yeah, just leaving the ifindex as 0 when tgt is unset is fine :) -Toke
On Mon, 26 Apr 2021 19:47:42 +0800 Hangbin Liu <liuhangbin@gmail.com> wrote: > On Mon, Apr 26, 2021 at 07:40:28PM +0800, Hangbin Liu wrote: > > On Mon, Apr 26, 2021 at 11:53:50AM +0200, Jesper Dangaard Brouer wrote: > > > Decode: perf_trace_xdp_redirect_template+0xba > > > ./scripts/faddr2line vmlinux perf_trace_xdp_redirect_template+0xba > > > perf_trace_xdp_redirect_template+0xba/0x130: > > > perf_trace_xdp_redirect_template at include/trace/events/xdp.h:89 (discriminator 13) > > > > > > less -N net/core/filter.c > > > [...] > > > 3993 if (unlikely(err)) > > > 3994 goto err; > > > 3995 > > > -> 3996 _trace_xdp_redirect_map(dev, xdp_prog, fwd, map_type, map_id, ri->tgt_index); > > > > Oh, the fwd in xdp xdp_redirect_map broadcast is NULL... > > > > I will see how to fix it. Maybe assign the ingress interface to fwd? > > Er, sorry for the flood message. I just checked the trace point code, fwd > in xdp trace event means to_ifindex. So we can't assign the ingress interface > to fwd. > > In xdp_redirect_map broadcast case, there is no specific to_ifindex. > So how about just ignore it... e.g. Yes, below code make sense, and I want to confirm that it solves the crash (I tested it). IMHO leaving ifindex=0 is okay, because it is not a valid ifindex, meaning a caller of the tracepoint can deduce (together with the map types) that this must be a broadcast. Thank you Hangbin for keep working on this patchset. I know it have been a long long road. I truly appreciate your perseverance and patience with this patchset. With this crash fixed, I actually think we are very close to having something we can merge. With the unlikely() I'm fine with the code itself. I think we need to update the patch description, but I've asked Toke to help with this. The performance measurements in the patch description is not measuring what I expected, but something else. To avoid redoing a lot of testing, I think we can just describe what the test 'redirect_map-multi i40e->i40e' is doing, as broadcast feature is filtering the ingress port 'i40e->i40e' test out same interface will just drop the xdp_frame (after walking the devmap for empty ports). Or maybe it is not the same interface(?). In any-case this need to be more clear. I think it would be valuable to show (in the commit message) some tests that demonstrates the overhead of packet cloning. I expect the overhead of page-alloc+memcpy is to be significant, but Lorenzo have a number of ideas howto speed this up. Maybe you can simply broadcast-redirect into multiple veth devices that (XDP_DROP in peer-dev) to demonstrate the effect and overhead of doing the cloning process. > diff --git a/include/trace/events/xdp.h b/include/trace/events/xdp.h > index fcad3645a70b..1751da079330 100644 > --- a/include/trace/events/xdp.h > +++ b/include/trace/events/xdp.h > @@ -110,7 +110,8 @@ DECLARE_EVENT_CLASS(xdp_redirect_template, > u32 ifindex = 0, map_index = index; > > if (map_type == BPF_MAP_TYPE_DEVMAP || map_type == BPF_MAP_TYPE_DEVMAP_HASH) { > - ifindex = ((struct _bpf_dtab_netdev *)tgt)->dev->ifindex; > + if (tgt) > + ifindex = ((struct _bpf_dtab_netdev *)tgt)->dev->ifindex; > } else if (map_type == BPF_MAP_TYPE_UNSPEC && map_id == INT_MAX) { > ifindex = index; > map_index = 0; > > > Hangbin >
diff --git a/include/linux/bpf.h b/include/linux/bpf.h index f8a45f109e96..4243284fff8b 100644 --- a/include/linux/bpf.h +++ b/include/linux/bpf.h @@ -1497,8 +1497,13 @@ int dev_xdp_enqueue(struct net_device *dev, struct xdp_buff *xdp, struct net_device *dev_rx); int dev_map_enqueue(struct bpf_dtab_netdev *dst, struct xdp_buff *xdp, struct net_device *dev_rx); +int dev_map_enqueue_multi(struct xdp_buff *xdp, struct net_device *dev_rx, + struct bpf_map *map, bool exclude_ingress); int dev_map_generic_redirect(struct bpf_dtab_netdev *dst, struct sk_buff *skb, struct bpf_prog *xdp_prog); +int dev_map_redirect_multi(struct net_device *dev, struct sk_buff *skb, + struct bpf_prog *xdp_prog, struct bpf_map *map, + bool exclude_ingress); bool dev_map_can_have_prog(struct bpf_map *map); void __cpu_map_flush(void); @@ -1666,6 +1671,13 @@ int dev_map_enqueue(struct bpf_dtab_netdev *dst, struct xdp_buff *xdp, return 0; } +static inline +int dev_map_enqueue_multi(struct xdp_buff *xdp, struct net_device *dev_rx, + struct bpf_map *map, bool exclude_ingress) +{ + return 0; +} + struct sk_buff; static inline int dev_map_generic_redirect(struct bpf_dtab_netdev *dst, @@ -1675,6 +1687,14 @@ static inline int dev_map_generic_redirect(struct bpf_dtab_netdev *dst, return 0; } +static inline +int dev_map_redirect_multi(struct net_device *dev, struct sk_buff *skb, + struct bpf_prog *xdp_prog, struct bpf_map *map, + bool exclude_ingress) +{ + return 0; +} + static inline void __cpu_map_flush(void) { } diff --git a/include/linux/filter.h b/include/linux/filter.h index 9a09547bc7ba..e4885b42d754 100644 --- a/include/linux/filter.h +++ b/include/linux/filter.h @@ -646,6 +646,7 @@ struct bpf_redirect_info { u32 flags; u32 tgt_index; void *tgt_value; + struct bpf_map *map; u32 map_id; enum bpf_map_type map_type; u32 kern_flags; @@ -1464,17 +1465,18 @@ static inline bool bpf_sk_lookup_run_v6(struct net *net, int protocol, } #endif /* IS_ENABLED(CONFIG_IPV6) */ -static __always_inline int __bpf_xdp_redirect_map(struct bpf_map *map, u32 ifindex, u64 flags, +static __always_inline int __bpf_xdp_redirect_map(struct bpf_map *map, u32 ifindex, + u64 flags, u64 flag_mask, void *lookup_elem(struct bpf_map *map, u32 key)) { struct bpf_redirect_info *ri = this_cpu_ptr(&bpf_redirect_info); /* Lower bits of the flags are used as return code on lookup failure */ - if (unlikely(flags > XDP_TX)) + if (unlikely(flags & ~(BPF_F_ACTION_MASK | flag_mask))) return XDP_ABORTED; ri->tgt_value = lookup_elem(map, ifindex); - if (unlikely(!ri->tgt_value)) { + if (unlikely(!ri->tgt_value) && !(flags & BPF_F_BROADCAST)) { /* If the lookup fails we want to clear out the state in the * redirect_info struct completely, so that if an eBPF program * performs multiple lookups, the last one always takes @@ -1482,13 +1484,21 @@ static __always_inline int __bpf_xdp_redirect_map(struct bpf_map *map, u32 ifind */ ri->map_id = INT_MAX; /* Valid map id idr range: [1,INT_MAX[ */ ri->map_type = BPF_MAP_TYPE_UNSPEC; - return flags; + return flags & BPF_F_ACTION_MASK; } ri->tgt_index = ifindex; ri->map_id = map->id; ri->map_type = map->map_type; + if (flags & BPF_F_BROADCAST) { + WRITE_ONCE(ri->map, map); + ri->flags = flags; + } else { + WRITE_ONCE(ri->map, NULL); + ri->flags = 0; + } + return XDP_REDIRECT; } diff --git a/include/net/xdp.h b/include/net/xdp.h index a5bc214a49d9..5533f0ab2afc 100644 --- a/include/net/xdp.h +++ b/include/net/xdp.h @@ -170,6 +170,7 @@ struct sk_buff *__xdp_build_skb_from_frame(struct xdp_frame *xdpf, struct sk_buff *xdp_build_skb_from_frame(struct xdp_frame *xdpf, struct net_device *dev); int xdp_alloc_skb_bulk(void **skbs, int n_skb, gfp_t gfp); +struct xdp_frame *xdpf_clone(struct xdp_frame *xdpf); static inline void xdp_convert_frame_to_buff(struct xdp_frame *frame, struct xdp_buff *xdp) diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h index ec6d85a81744..c6fe0526811b 100644 --- a/include/uapi/linux/bpf.h +++ b/include/uapi/linux/bpf.h @@ -2534,8 +2534,12 @@ union bpf_attr { * The lower two bits of *flags* are used as the return code if * the map lookup fails. This is so that the return value can be * one of the XDP program return codes up to **XDP_TX**, as chosen - * by the caller. Any higher bits in the *flags* argument must be - * unset. + * by the caller. The higher bits of *flags* can be set to + * BPF_F_BROADCAST or BPF_F_EXCLUDE_INGRESS as defined below. + * + * With BPF_F_BROADCAST the packet will be broadcasted to all the + * interfaces in the map. with BPF_F_EXCLUDE_INGRESS the ingress + * interface will be excluded when do broadcasting. * * See also **bpf_redirect**\ (), which only supports redirecting * to an ifindex, but doesn't require a map to do so. @@ -5080,6 +5084,15 @@ enum { BPF_F_BPRM_SECUREEXEC = (1ULL << 0), }; +/* Flags for bpf_redirect_map helper */ +enum { + BPF_F_BROADCAST = (1ULL << 3), + BPF_F_EXCLUDE_INGRESS = (1ULL << 4), +}; + +#define BPF_F_ACTION_MASK (XDP_ABORTED | XDP_DROP | XDP_PASS | XDP_TX) +#define BPF_F_REDIR_MASK (BPF_F_BROADCAST | BPF_F_EXCLUDE_INGRESS) + #define __bpf_md_ptr(type, name) \ union { \ type name; \ diff --git a/kernel/bpf/cpumap.c b/kernel/bpf/cpumap.c index 0cf2791d5099..2c33a7a09783 100644 --- a/kernel/bpf/cpumap.c +++ b/kernel/bpf/cpumap.c @@ -601,7 +601,8 @@ static int cpu_map_get_next_key(struct bpf_map *map, void *key, void *next_key) static int cpu_map_redirect(struct bpf_map *map, u32 ifindex, u64 flags) { - return __bpf_xdp_redirect_map(map, ifindex, flags, __cpu_map_lookup_elem); + return __bpf_xdp_redirect_map(map, ifindex, flags, 0, + __cpu_map_lookup_elem); } static int cpu_map_btf_id; diff --git a/kernel/bpf/devmap.c b/kernel/bpf/devmap.c index 3980fb3bfb09..2eebb9a4b093 100644 --- a/kernel/bpf/devmap.c +++ b/kernel/bpf/devmap.c @@ -198,6 +198,7 @@ static void dev_map_free(struct bpf_map *map) list_del_rcu(&dtab->list); spin_unlock(&dev_map_lock); + bpf_clear_redirect_map(map); synchronize_rcu(); /* Make sure prior __dev_map_entry_free() have completed. */ @@ -515,6 +516,99 @@ int dev_map_enqueue(struct bpf_dtab_netdev *dst, struct xdp_buff *xdp, return __xdp_enqueue(dev, xdp, dev_rx, dst->xdp_prog); } +static bool is_valid_dst(struct bpf_dtab_netdev *obj, struct xdp_buff *xdp, + int exclude_ifindex) +{ + if (!obj || obj->dev->ifindex == exclude_ifindex || + !obj->dev->netdev_ops->ndo_xdp_xmit) + return false; + + if (xdp_ok_fwd_dev(obj->dev, xdp->data_end - xdp->data)) + return false; + + return true; +} + +static int dev_map_enqueue_clone(struct bpf_dtab_netdev *obj, + struct net_device *dev_rx, + struct xdp_frame *xdpf) +{ + struct xdp_frame *nxdpf; + + nxdpf = xdpf_clone(xdpf); + if (!nxdpf) + return -ENOMEM; + + bq_enqueue(obj->dev, nxdpf, dev_rx, obj->xdp_prog); + + return 0; +} + +int dev_map_enqueue_multi(struct xdp_buff *xdp, struct net_device *dev_rx, + struct bpf_map *map, bool exclude_ingress) +{ + struct bpf_dtab *dtab = container_of(map, struct bpf_dtab, map); + int exclude_ifindex = exclude_ingress ? dev_rx->ifindex : 0; + struct bpf_dtab_netdev *dst, *last_dst = NULL; + struct hlist_head *head; + struct xdp_frame *xdpf; + unsigned int i; + int err; + + xdpf = xdp_convert_buff_to_frame(xdp); + if (unlikely(!xdpf)) + return -EOVERFLOW; + + if (map->map_type == BPF_MAP_TYPE_DEVMAP) { + for (i = 0; i < map->max_entries; i++) { + dst = READ_ONCE(dtab->netdev_map[i]); + if (!is_valid_dst(dst, xdp, exclude_ifindex)) + continue; + + /* we only need n-1 clones; last_dst enqueued below */ + if (!last_dst) { + last_dst = dst; + continue; + } + + err = dev_map_enqueue_clone(last_dst, dev_rx, xdpf); + if (err) + return err; + + last_dst = dst; + } + } else { /* BPF_MAP_TYPE_DEVMAP_HASH */ + for (i = 0; i < dtab->n_buckets; i++) { + head = dev_map_index_hash(dtab, i); + hlist_for_each_entry_rcu(dst, head, index_hlist, + lockdep_is_held(&dtab->index_lock)) { + if (!is_valid_dst(dst, xdp, exclude_ifindex)) + continue; + + /* we only need n-1 clones; last_dst enqueued below */ + if (!last_dst) { + last_dst = dst; + continue; + } + + err = dev_map_enqueue_clone(last_dst, dev_rx, xdpf); + if (err) + return err; + + last_dst = dst; + } + } + } + + /* consume the last copy of the frame */ + if (last_dst) + bq_enqueue(last_dst->dev, xdpf, dev_rx, last_dst->xdp_prog); + else + xdp_return_frame_rx_napi(xdpf); /* dtab is empty */ + + return 0; +} + int dev_map_generic_redirect(struct bpf_dtab_netdev *dst, struct sk_buff *skb, struct bpf_prog *xdp_prog) { @@ -529,6 +623,87 @@ int dev_map_generic_redirect(struct bpf_dtab_netdev *dst, struct sk_buff *skb, return 0; } +static int dev_map_redirect_clone(struct bpf_dtab_netdev *dst, + struct sk_buff *skb, + struct bpf_prog *xdp_prog) +{ + struct sk_buff *nskb; + int err; + + nskb = skb_clone(skb, GFP_ATOMIC); + if (!nskb) + return -ENOMEM; + + err = dev_map_generic_redirect(dst, nskb, xdp_prog); + if (unlikely(err)) { + consume_skb(nskb); + return err; + } + + return 0; +} + +int dev_map_redirect_multi(struct net_device *dev, struct sk_buff *skb, + struct bpf_prog *xdp_prog, struct bpf_map *map, + bool exclude_ingress) +{ + struct bpf_dtab *dtab = container_of(map, struct bpf_dtab, map); + int exclude_ifindex = exclude_ingress ? dev->ifindex : 0; + struct bpf_dtab_netdev *dst, *last_dst = NULL; + struct hlist_head *head; + struct hlist_node *next; + unsigned int i; + int err; + + if (map->map_type == BPF_MAP_TYPE_DEVMAP) { + for (i = 0; i < map->max_entries; i++) { + dst = READ_ONCE(dtab->netdev_map[i]); + if (!dst || dst->dev->ifindex == exclude_ifindex) + continue; + + /* we only need n-1 clones; last_dst enqueued below */ + if (!last_dst) { + last_dst = dst; + continue; + } + + err = dev_map_redirect_clone(last_dst, skb, xdp_prog); + if (err) + return err; + + last_dst = dst; + } + } else { /* BPF_MAP_TYPE_DEVMAP_HASH */ + for (i = 0; i < dtab->n_buckets; i++) { + head = dev_map_index_hash(dtab, i); + hlist_for_each_entry_safe(dst, next, head, index_hlist) { + if (!dst || dst->dev->ifindex == exclude_ifindex) + continue; + + /* we only need n-1 clones; last_dst enqueued below */ + if (!last_dst) { + last_dst = dst; + continue; + } + + err = dev_map_redirect_clone(last_dst, skb, xdp_prog); + if (err) + return err; + + last_dst = dst; + } + } + } + + /* consume the first skb and return */ + if (last_dst) + return dev_map_generic_redirect(last_dst, skb, xdp_prog); + + /* dtab is empty */ + consume_skb(skb); + return 0; +} + static void *dev_map_lookup_elem(struct bpf_map *map, void *key) { struct bpf_dtab_netdev *obj = __dev_map_lookup_elem(map, *(u32 *)key); @@ -755,12 +930,14 @@ static int dev_map_hash_update_elem(struct bpf_map *map, void *key, void *value, static int dev_map_redirect(struct bpf_map *map, u32 ifindex, u64 flags) { - return __bpf_xdp_redirect_map(map, ifindex, flags, __dev_map_lookup_elem); + return __bpf_xdp_redirect_map(map, ifindex, flags, BPF_F_REDIR_MASK, + __dev_map_lookup_elem); } static int dev_hash_map_redirect(struct bpf_map *map, u32 ifindex, u64 flags) { - return __bpf_xdp_redirect_map(map, ifindex, flags, __dev_map_hash_lookup_elem); + return __bpf_xdp_redirect_map(map, ifindex, flags, BPF_F_REDIR_MASK, + __dev_map_hash_lookup_elem); } static int dev_map_btf_id; diff --git a/net/core/filter.c b/net/core/filter.c index cae56d08a670..05ba5ab4345f 100644 --- a/net/core/filter.c +++ b/net/core/filter.c @@ -3926,6 +3926,23 @@ void xdp_do_flush(void) } EXPORT_SYMBOL_GPL(xdp_do_flush); +void bpf_clear_redirect_map(struct bpf_map *map) +{ + struct bpf_redirect_info *ri; + int cpu; + + for_each_possible_cpu(cpu) { + ri = per_cpu_ptr(&bpf_redirect_info, cpu); + /* Avoid polluting remote cacheline due to writes if + * not needed. Once we pass this test, we need the + * cmpxchg() to make sure it hasn't been changed in + * the meantime by remote CPU. + */ + if (unlikely(READ_ONCE(ri->map) == map)) + cmpxchg(&ri->map, map, NULL); + } +} + int xdp_do_redirect(struct net_device *dev, struct xdp_buff *xdp, struct bpf_prog *xdp_prog) { @@ -3933,6 +3950,7 @@ int xdp_do_redirect(struct net_device *dev, struct xdp_buff *xdp, enum bpf_map_type map_type = ri->map_type; void *fwd = ri->tgt_value; u32 map_id = ri->map_id; + struct bpf_map *map; int err; ri->map_id = 0; /* Valid map id idr range: [1,INT_MAX[ */ @@ -3942,7 +3960,14 @@ int xdp_do_redirect(struct net_device *dev, struct xdp_buff *xdp, case BPF_MAP_TYPE_DEVMAP: fallthrough; case BPF_MAP_TYPE_DEVMAP_HASH: - err = dev_map_enqueue(fwd, xdp, dev); + map = READ_ONCE(ri->map); + if (map) { + WRITE_ONCE(ri->map, NULL); + err = dev_map_enqueue_multi(xdp, dev, map, + ri->flags & BPF_F_EXCLUDE_INGRESS); + } else { + err = dev_map_enqueue(fwd, xdp, dev); + } break; case BPF_MAP_TYPE_CPUMAP: err = cpu_map_enqueue(fwd, xdp, dev); @@ -3984,13 +4009,21 @@ static int xdp_do_generic_redirect_map(struct net_device *dev, enum bpf_map_type map_type, u32 map_id) { struct bpf_redirect_info *ri = this_cpu_ptr(&bpf_redirect_info); + struct bpf_map *map; int err; switch (map_type) { case BPF_MAP_TYPE_DEVMAP: fallthrough; case BPF_MAP_TYPE_DEVMAP_HASH: - err = dev_map_generic_redirect(fwd, skb, xdp_prog); + map = READ_ONCE(ri->map); + if (map) { + WRITE_ONCE(ri->map, NULL); + err = dev_map_redirect_multi(dev, skb, xdp_prog, map, + ri->flags & BPF_F_EXCLUDE_INGRESS); + } else { + err = dev_map_generic_redirect(fwd, skb, xdp_prog); + } if (unlikely(err)) goto err; break; diff --git a/net/core/xdp.c b/net/core/xdp.c index 05354976c1fc..aba84d04642b 100644 --- a/net/core/xdp.c +++ b/net/core/xdp.c @@ -583,3 +583,32 @@ struct sk_buff *xdp_build_skb_from_frame(struct xdp_frame *xdpf, return __xdp_build_skb_from_frame(xdpf, skb, dev); } EXPORT_SYMBOL_GPL(xdp_build_skb_from_frame); + +struct xdp_frame *xdpf_clone(struct xdp_frame *xdpf) +{ + unsigned int headroom, totalsize; + struct xdp_frame *nxdpf; + struct page *page; + void *addr; + + headroom = xdpf->headroom + sizeof(*xdpf); + totalsize = headroom + xdpf->len; + + if (unlikely(totalsize > PAGE_SIZE)) + return NULL; + page = dev_alloc_page(); + if (!page) + return NULL; + addr = page_to_virt(page); + + memcpy(addr, xdpf, totalsize); + + nxdpf = addr; + nxdpf->data = addr + headroom; + nxdpf->frame_sz = PAGE_SIZE; + nxdpf->mem.type = MEM_TYPE_PAGE_ORDER0; + nxdpf->mem.id = 0; + + return nxdpf; +} +EXPORT_SYMBOL_GPL(xdpf_clone); diff --git a/net/xdp/xskmap.c b/net/xdp/xskmap.c index 67b4ce504852..9df75ea4a567 100644 --- a/net/xdp/xskmap.c +++ b/net/xdp/xskmap.c @@ -226,7 +226,8 @@ static int xsk_map_delete_elem(struct bpf_map *map, void *key) static int xsk_map_redirect(struct bpf_map *map, u32 ifindex, u64 flags) { - return __bpf_xdp_redirect_map(map, ifindex, flags, __xsk_map_lookup_elem); + return __bpf_xdp_redirect_map(map, ifindex, flags, 0, + __xsk_map_lookup_elem); } void xsk_map_try_sock_delete(struct xsk_map *map, struct xdp_sock *xs, diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h index ec6d85a81744..c6fe0526811b 100644 --- a/tools/include/uapi/linux/bpf.h +++ b/tools/include/uapi/linux/bpf.h @@ -2534,8 +2534,12 @@ union bpf_attr { * The lower two bits of *flags* are used as the return code if * the map lookup fails. This is so that the return value can be * one of the XDP program return codes up to **XDP_TX**, as chosen - * by the caller. Any higher bits in the *flags* argument must be - * unset. + * by the caller. The higher bits of *flags* can be set to + * BPF_F_BROADCAST or BPF_F_EXCLUDE_INGRESS as defined below. + * + * With BPF_F_BROADCAST the packet will be broadcasted to all the + * interfaces in the map. with BPF_F_EXCLUDE_INGRESS the ingress + * interface will be excluded when do broadcasting. * * See also **bpf_redirect**\ (), which only supports redirecting * to an ifindex, but doesn't require a map to do so. @@ -5080,6 +5084,15 @@ enum { BPF_F_BPRM_SECUREEXEC = (1ULL << 0), }; +/* Flags for bpf_redirect_map helper */ +enum { + BPF_F_BROADCAST = (1ULL << 3), + BPF_F_EXCLUDE_INGRESS = (1ULL << 4), +}; + +#define BPF_F_ACTION_MASK (XDP_ABORTED | XDP_DROP | XDP_PASS | XDP_TX) +#define BPF_F_REDIR_MASK (BPF_F_BROADCAST | BPF_F_EXCLUDE_INGRESS) + #define __bpf_md_ptr(type, name) \ union { \ type name; \