diff mbox series

[PATCHv10,bpf-next,2/4] xdp: extend xdp_redirect_map with broadcast support

Message ID 20210423020019.2333192-3-liuhangbin@gmail.com (mailing list archive)
State Superseded
Delegated to: BPF
Headers show
Series xdp: extend xdp_redirect_map with broadcast support | expand

Checks

Context Check Description
netdev/cover_letter success Link
netdev/fixes_present success Link
netdev/patch_count success Link
netdev/tree_selection success Clearly marked for bpf-next
netdev/subject_prefix success Link
netdev/cc_maintainers warning 12 maintainers not CCed: joe@cilium.io jonathan.lemon@gmail.com yhs@fb.com kpsingh@kernel.org hawk@kernel.org andrii@kernel.org magnus.karlsson@intel.com songliubraving@fb.com bjorn@kernel.org davem@davemloft.net quentin@isovalent.com kuba@kernel.org
netdev/source_inline success Was 0 now: 0
netdev/verify_signedoff success Link
netdev/module_param success Was 0 now: 0
netdev/build_32bit success Errors and warnings before: 11956 this patch: 11956
netdev/kdoc success Errors and warnings before: 0 this patch: 0
netdev/verify_fixes success Link
netdev/checkpatch warning WARNING: line length of 81 exceeds 80 columns WARNING: line length of 83 exceeds 80 columns WARNING: line length of 84 exceeds 80 columns WARNING: line length of 86 exceeds 80 columns WARNING: line length of 87 exceeds 80 columns WARNING: line length of 88 exceeds 80 columns WARNING: please, no space before tabs
netdev/build_allmodconfig_warn success Errors and warnings before: 12443 this patch: 12443
netdev/header_inline success Link

Commit Message

Hangbin Liu April 23, 2021, 2 a.m. UTC
This patch adds two flags BPF_F_BROADCAST and BPF_F_EXCLUDE_INGRESS to
extend xdp_redirect_map for broadcast support.

With BPF_F_BROADCAST the packet will be broadcasted to all the interfaces
in the map. with BPF_F_EXCLUDE_INGRESS the ingress interface will be
excluded when do broadcasting.

When getting the devices in dev hash map via dev_map_hash_get_next_key(),
there is a possibility that we fall back to the first key when a device
was removed. This will duplicate packets on some interfaces. So just walk
the whole buckets to avoid this issue. For dev array map, we also walk the
whole map to find valid interfaces.

Function bpf_clear_redirect_map() was removed in
commit ee75aef23afe ("bpf, xdp: Restructure redirect actions").
Add it back as we need to use ri->map again.

Here is the performance result by using 10Gb i40e NIC, do XDP_DROP on
veth peer, run xdp_redirect_{map, map_multi} in sample/bpf and send pkts
via pktgen cmd:
./pktgen_sample03_burst_single_flow.sh -i eno1 -d $dst_ip -m $dst_mac -t 10 -s 64

There are some drop back as we need to loop the map and get each interface.

Version          | Test                                | Generic | Native
5.12 rc4         | redirect_map        i40e->i40e      |    1.9M |  9.6M
5.12 rc4         | redirect_map        i40e->veth      |    1.7M | 11.7M
5.12 rc4 + patch | redirect_map        i40e->i40e      |    1.9M |  9.3M
5.12 rc4 + patch | redirect_map        i40e->veth      |    1.7M | 11.4M
5.12 rc4 + patch | redirect_map multi  i40e->i40e      |    1.9M |  8.9M
5.12 rc4 + patch | redirect_map multi  i40e->veth      |    1.7M | 10.9M
5.12 rc4 + patch | redirect_map multi  i40e->mlx4+veth |    1.2M |  3.8M

Acked-by: Toke Høiland-Jørgensen <toke@redhat.com>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: Hangbin Liu <liuhangbin@gmail.com>

---
v10:
Remind by Jesper: revert xchg() and use READ/WRITE_ONCE when read/write map
pointer as xchg call can be expensive, since this is an atomic operation.

v9: no update

v8:
use hlist_for_each_entry_rcu() when loop the devmap hash ojbs

v7:
no need to free xdpf in dev_map_enqueue_clone() if xdpf_clone failed.
Also return -EOVERFLOW if xdp_convert_buff_to_frame() failed the same
as other caller did.

v6:
Fix a skb leak in the error path for generic XDP

v5:
a) use xchg() instead of READ/WRITE_ONCE and no need to clear ri->flags
   in xdp_do_redirect()
b) Do not use get_next_key() as we may restart looping from the first key
   when remove/update a dev in hash map. Just walk the map directly to
   get all the devices and ignore the new added/deleted objects.
c) Loop all the array map instead stop at the first hole.

v4:
a) add a new argument flag_mask to __bpf_xdp_redirect_map() filter out
invalid map.
b) __bpf_xdp_redirect_map() sets the map pointer if the broadcast flag
is set and clears it if the flag isn't set
c) xdp_do_redirect() does the READ_ONCE/WRITE_ONCE on ri->map to check
if we should enqueue multi

v3:
a) Rebase the code on Björn's "bpf, xdp: Restructure redirect actions".
   - Add struct bpf_map *map back to struct bpf_redirect_info as we need
     it for multicast.
   - Add bpf_clear_redirect_map() back for devmap.c
   - Add devmap_lookup_elem() as we need it in general path.
b) remove tmp_key in devmap_get_next_obj()

v2: Fix flag renaming issue in v1
---
 include/linux/bpf.h            |  20 ++++
 include/linux/filter.h         |  18 +++-
 include/net/xdp.h              |   1 +
 include/uapi/linux/bpf.h       |  17 +++-
 kernel/bpf/cpumap.c            |   3 +-
 kernel/bpf/devmap.c            | 181 ++++++++++++++++++++++++++++++++-
 net/core/filter.c              |  37 ++++++-
 net/core/xdp.c                 |  29 ++++++
 net/xdp/xskmap.c               |   3 +-
 tools/include/uapi/linux/bpf.h |  17 +++-
 10 files changed, 312 insertions(+), 14 deletions(-)

Comments

Jesper Dangaard Brouer April 26, 2021, 9:53 a.m. UTC | #1
On Fri, 23 Apr 2021 10:00:17 +0800
Hangbin Liu <liuhangbin@gmail.com> wrote:

> This patch adds two flags BPF_F_BROADCAST and BPF_F_EXCLUDE_INGRESS to
> extend xdp_redirect_map for broadcast support.
> 
> With BPF_F_BROADCAST the packet will be broadcasted to all the interfaces
> in the map. with BPF_F_EXCLUDE_INGRESS the ingress interface will be
> excluded when do broadcasting.
> 
> When getting the devices in dev hash map via dev_map_hash_get_next_key(),
> there is a possibility that we fall back to the first key when a device
> was removed. This will duplicate packets on some interfaces. So just walk
> the whole buckets to avoid this issue. For dev array map, we also walk the
> whole map to find valid interfaces.
> 
> Function bpf_clear_redirect_map() was removed in
> commit ee75aef23afe ("bpf, xdp: Restructure redirect actions").
> Add it back as we need to use ri->map again.
> 
> Here is the performance result by using 10Gb i40e NIC, do XDP_DROP on
> veth peer, run xdp_redirect_{map, map_multi} in sample/bpf and send pkts
> via pktgen cmd:
> ./pktgen_sample03_burst_single_flow.sh -i eno1 -d $dst_ip -m $dst_mac -t 10 -s 64

While running:
 $ sudo ./xdp_redirect_map_multi -F i40e2 i40e2
 Get interfaces 7 7
 libbpf: elf: skipping unrecognized data section(23) .eh_frame
 libbpf: elf: skipping relo section(24) .rel.eh_frame for section(23) .eh_frame
 Forwarding   10140845 pkt/s
 Forwarding   11767042 pkt/s
 Forwarding   11783437 pkt/s
 Forwarding   11767331 pkt/s

When starting:  sudo ./xdp_monitor --stats

System crashed with:

[ 5509.997837] BUG: kernel NULL pointer dereference, address: 0000000000000000
[ 5510.004793] #PF: supervisor read access in kernel mode
[ 5510.009929] #PF: error_code(0x0000) - not-present page
[ 5510.015060] PGD 0 P4D 0 
[ 5510.017591] Oops: 0000 [#1] PREEMPT SMP PTI
[ 5510.021769] CPU: 3 PID: 29 Comm: ksoftirqd/3 Not tainted 5.12.0-rc7-net-next-hangbin-v10+ #602
[ 5510.030368] Hardware name: Supermicro Super Server/X10SRi-F, BIOS 2.0a 08/01/2016
[ 5510.037835] RIP: 0010:perf_trace_xdp_redirect_template+0xba/0x130
[ 5510.043929] Code: 00 00 00 8b 45 18 74 1e 41 83 f9 19 74 18 45 85 c9 0f 85 83 00 00 00 81 7d 10 ff ff ff 7f 75 7a 89 c1 31 c0 eb 0d 48 8b 75 b8 <48> 8b 16 8b 8a d0 00 00 00 49 8b 55 38 41 b8 01 00 00 00 be 24 00
[ 5510.062668] RSP: 0018:ffffc9000017fc50 EFLAGS: 00010246
[ 5510.067884] RAX: 0000000000000000 RBX: ffffe8ffffccf180 RCX: 0000000000000000
[ 5510.075007] RDX: ffffffff817d1a9b RSI: 0000000000000000 RDI: ffffe8ffffcd8000
[ 5510.082133] RBP: ffffc9000017fc98 R08: 0000000000000000 R09: 0000000000000019
[ 5510.089256] R10: 0000000000000000 R11: ffff88887fd2ab70 R12: 0000000000000000
[ 5510.096382] R13: ffffc900000a5000 R14: ffff88810a8f2000 R15: ffffffff82a7c840
[ 5510.103505] FS:  0000000000000000(0000) GS:ffff88887fcc0000(0000) knlGS:0000000000000000
[ 5510.111584] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 5510.117319] CR2: 0000000000000000 CR3: 0000000157b1e004 CR4: 00000000003706e0
[ 5510.124444] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 5510.131567] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 5510.138692] Call Trace:
[ 5510.141141]  xdp_do_redirect+0x16b/0x230
[ 5510.145064]  i40e_clean_rx_irq+0x62e/0x9a0 [i40e]
[ 5510.149779]  i40e_napi_poll+0xf0/0x410 [i40e]
[ 5510.154135]  __napi_poll+0x2a/0x140
[ 5510.157620]  net_rx_action+0x215/0x2d0
[ 5510.161364]  __do_softirq+0xe3/0x2df
[ 5510.164938]  run_ksoftirqd+0x1a/0x20
[ 5510.168514]  smpboot_thread_fn+0xee/0x1e0
[ 5510.172519]  ? sort_range+0x20/0x20
[ 5510.176003]  kthread+0x116/0x150
[ 5510.179237]  ? kthread_park+0x90/0x90
[ 5510.182893]  ret_from_fork+0x22/0x30
[ 5510.186474] Modules linked in: algif_hash af_alg bpf_preload fuse veth nf_defrag_ipv6 nf_defrag_ipv4 tun bridge stp llc rpcrdma sunrpc rdma_ucm ib_umad rdma_cm ib_ipoib coretemp iw_cm kvm_intel ib_cm kvm mlx5_ib i40iw irqbypass ib_uverbs rapl intel_cstate i2c_i801 intel_uncore ib_core pcspkr bfq i2c_smbus acpi_ipmi wmi ipmi_si ipmi_devintf ipmi_msghandler acpi_pad sch_fq_codel mlx5_core mlxfw i40e ixgbe igc psample mdio igb sd_mod ptp t10_pi nfp i2c_algo_bit i2c_core pps_core hid_generic [last unloaded: bpfilter]
[ 5510.231847] CR2: 0000000000000000
[ 5510.235164] ---[ end trace b8e076677c53b5e8 ]---
[ 5510.241762] RIP: 0010:perf_trace_xdp_redirect_template+0xba/0x130
[ 5510.247851] Code: 00 00 00 8b 45 18 74 1e 41 83 f9 19 74 18 45 85 c9 0f 85 83 00 00 00 81 7d 10 ff ff ff 7f 75 7a 89 c1 31 c0 eb 0d 48 8b 75 b8 <48> 8b 16 8b 8a d0 00 00 00 49 8b 55 38 41 b8 01 00 00 00 be 24 00
[ 5510.266590] RSP: 0018:ffffc9000017fc50 EFLAGS: 00010246
[ 5510.271804] RAX: 0000000000000000 RBX: ffffe8ffffccf180 RCX: 0000000000000000
[ 5510.278931] RDX: ffffffff817d1a9b RSI: 0000000000000000 RDI: ffffe8ffffcd8000
[ 5510.286053] RBP: ffffc9000017fc98 R08: 0000000000000000 R09: 0000000000000019
[ 5510.293178] R10: 0000000000000000 R11: ffff88887fd2ab70 R12: 0000000000000000
[ 5510.300301] R13: ffffc900000a5000 R14: ffff88810a8f2000 R15: ffffffff82a7c840
[ 5510.307426] FS:  0000000000000000(0000) GS:ffff88887fcc0000(0000) knlGS:0000000000000000
[ 5510.315503] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 5510.321241] CR2: 0000000000000000 CR3: 0000000157b1e004 CR4: 00000000003706e0
[ 5510.328364] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 5510.335490] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 5510.342612] Kernel panic - not syncing: Fatal exception in interrupt
[ 5510.348994] Kernel Offset: disabled
[ 5510.354469] ---[ end Kernel panic - not syncing: Fatal exception in interrupt ]---

[net-next]$ ./scripts/faddr2line vmlinux xdp_do_redirect+0x16b
xdp_do_redirect+0x16b/0x230:
trace_xdp_redirect at include/trace/events/xdp.h:136
(inlined by) trace_xdp_redirect at include/trace/events/xdp.h:136
(inlined by) xdp_do_redirect at net/core/filter.c:3996

Decode: perf_trace_xdp_redirect_template+0xba
 ./scripts/faddr2line vmlinux perf_trace_xdp_redirect_template+0xba
perf_trace_xdp_redirect_template+0xba/0x130:
perf_trace_xdp_redirect_template at include/trace/events/xdp.h:89 (discriminator 13)

less -N net/core/filter.c
 [...]
   3993         if (unlikely(err))
   3994                 goto err;
   3995 
-> 3996         _trace_xdp_redirect_map(dev, xdp_prog, fwd, map_type, map_id, ri->tgt_index);
   3997         return 0;
   3998 err:
   3999         _trace_xdp_redirect_map_err(dev, xdp_prog, fwd, map_type, map_id, ri->tgt_index, err);
   4000         return err;
   4001 }
   4002 EXPORT_SYMBOL_GPL(xdp_do_redirect);


> ---
>  include/linux/bpf.h            |  20 ++++
>  include/linux/filter.h         |  18 +++-
>  include/net/xdp.h              |   1 +
>  include/uapi/linux/bpf.h       |  17 +++-
>  kernel/bpf/cpumap.c            |   3 +-
>  kernel/bpf/devmap.c            | 181 ++++++++++++++++++++++++++++++++-
>  net/core/filter.c              |  37 ++++++-
>  net/core/xdp.c                 |  29 ++++++
>  net/xdp/xskmap.c               |   3 +-
>  tools/include/uapi/linux/bpf.h |  17 +++-
>  10 files changed, 312 insertions(+), 14 deletions(-)
> 
[...]

> diff --git a/net/core/filter.c b/net/core/filter.c
> index cae56d08a670..05ba5ab4345f 100644
> --- a/net/core/filter.c
> +++ b/net/core/filter.c
> @@ -3926,6 +3926,23 @@ void xdp_do_flush(void)
>  }
>  EXPORT_SYMBOL_GPL(xdp_do_flush);
>  
> +void bpf_clear_redirect_map(struct bpf_map *map)
> +{
> +	struct bpf_redirect_info *ri;
> +	int cpu;
> +
> +	for_each_possible_cpu(cpu) {
> +		ri = per_cpu_ptr(&bpf_redirect_info, cpu);
> +		/* Avoid polluting remote cacheline due to writes if
> +		 * not needed. Once we pass this test, we need the
> +		 * cmpxchg() to make sure it hasn't been changed in
> +		 * the meantime by remote CPU.
> +		 */
> +		if (unlikely(READ_ONCE(ri->map) == map))
> +			cmpxchg(&ri->map, map, NULL);
> +	}
> +}
> +
>  int xdp_do_redirect(struct net_device *dev, struct xdp_buff *xdp,
>  		    struct bpf_prog *xdp_prog)
>  {
> @@ -3933,6 +3950,7 @@ int xdp_do_redirect(struct net_device *dev, struct xdp_buff *xdp,
>  	enum bpf_map_type map_type = ri->map_type;
>  	void *fwd = ri->tgt_value;
>  	u32 map_id = ri->map_id;
> +	struct bpf_map *map;
>  	int err;
>  
>  	ri->map_id = 0; /* Valid map id idr range: [1,INT_MAX[ */
> @@ -3942,7 +3960,14 @@ int xdp_do_redirect(struct net_device *dev, struct xdp_buff *xdp,
>  	case BPF_MAP_TYPE_DEVMAP:
>  		fallthrough;
>  	case BPF_MAP_TYPE_DEVMAP_HASH:
> -		err = dev_map_enqueue(fwd, xdp, dev);
> +		map = READ_ONCE(ri->map);
> +		if (map) {
> +			WRITE_ONCE(ri->map, NULL);
> +			err = dev_map_enqueue_multi(xdp, dev, map,
> +						    ri->flags & BPF_F_EXCLUDE_INGRESS);
> +		} else {
> +			err = dev_map_enqueue(fwd, xdp, dev);
> +		}
>  		break;
>  	case BPF_MAP_TYPE_CPUMAP:
>  		err = cpu_map_enqueue(fwd, xdp, dev);
> @@ -3984,13 +4009,21 @@ static int xdp_do_generic_redirect_map(struct net_device *dev,
>  				       enum bpf_map_type map_type, u32 map_id)
>  {
>  	struct bpf_redirect_info *ri = this_cpu_ptr(&bpf_redirect_info);
> +	struct bpf_map *map;
>  	int err;
>  
>  	switch (map_type) {
>  	case BPF_MAP_TYPE_DEVMAP:
>  		fallthrough;
>  	case BPF_MAP_TYPE_DEVMAP_HASH:
> -		err = dev_map_generic_redirect(fwd, skb, xdp_prog);
> +		map = READ_ONCE(ri->map);
> +		if (map) {
> +			WRITE_ONCE(ri->map, NULL);
> +			err = dev_map_redirect_multi(dev, skb, xdp_prog, map,
> +						     ri->flags & BPF_F_EXCLUDE_INGRESS);
> +		} else {
> +			err = dev_map_generic_redirect(fwd, skb, xdp_prog);
> +		}
>  		if (unlikely(err))
>  			goto err;
>  		break;
Hangbin Liu April 26, 2021, 10:47 a.m. UTC | #2
On Mon, Apr 26, 2021 at 11:53:50AM +0200, Jesper Dangaard Brouer wrote:
> On Fri, 23 Apr 2021 10:00:17 +0800
> Hangbin Liu <liuhangbin@gmail.com> wrote:
> 
> > This patch adds two flags BPF_F_BROADCAST and BPF_F_EXCLUDE_INGRESS to
> > extend xdp_redirect_map for broadcast support.
> > 
> > With BPF_F_BROADCAST the packet will be broadcasted to all the interfaces
> > in the map. with BPF_F_EXCLUDE_INGRESS the ingress interface will be
> > excluded when do broadcasting.
> > 
> > When getting the devices in dev hash map via dev_map_hash_get_next_key(),
> > there is a possibility that we fall back to the first key when a device
> > was removed. This will duplicate packets on some interfaces. So just walk
> > the whole buckets to avoid this issue. For dev array map, we also walk the
> > whole map to find valid interfaces.
> > 
> > Function bpf_clear_redirect_map() was removed in
> > commit ee75aef23afe ("bpf, xdp: Restructure redirect actions").
> > Add it back as we need to use ri->map again.
> > 
> > Here is the performance result by using 10Gb i40e NIC, do XDP_DROP on
> > veth peer, run xdp_redirect_{map, map_multi} in sample/bpf and send pkts
> > via pktgen cmd:
> > ./pktgen_sample03_burst_single_flow.sh -i eno1 -d $dst_ip -m $dst_mac -t 10 -s 64
> 
> While running:
>  $ sudo ./xdp_redirect_map_multi -F i40e2 i40e2
>  Get interfaces 7 7
>  libbpf: elf: skipping unrecognized data section(23) .eh_frame
>  libbpf: elf: skipping relo section(24) .rel.eh_frame for section(23) .eh_frame
>  Forwarding   10140845 pkt/s
>  Forwarding   11767042 pkt/s
>  Forwarding   11783437 pkt/s
>  Forwarding   11767331 pkt/s
> 
> When starting:  sudo ./xdp_monitor --stats

That seems the same issue I reported previously in our meeting.
https://bugzilla.redhat.com/show_bug.cgi?id=1906820#c4

I only saw it 3 times and can't reproduce it easily.

Do you have any idea where is the root cause?

Thanks
Hangbin

> 
> System crashed with:
> 
> [ 5509.997837] BUG: kernel NULL pointer dereference, address: 0000000000000000
> [ 5510.004793] #PF: supervisor read access in kernel mode
> [ 5510.009929] #PF: error_code(0x0000) - not-present page
> [ 5510.015060] PGD 0 P4D 0 
> [ 5510.017591] Oops: 0000 [#1] PREEMPT SMP PTI
> [ 5510.021769] CPU: 3 PID: 29 Comm: ksoftirqd/3 Not tainted 5.12.0-rc7-net-next-hangbin-v10+ #602
> [ 5510.030368] Hardware name: Supermicro Super Server/X10SRi-F, BIOS 2.0a 08/01/2016
> [ 5510.037835] RIP: 0010:perf_trace_xdp_redirect_template+0xba/0x130
> [ 5510.043929] Code: 00 00 00 8b 45 18 74 1e 41 83 f9 19 74 18 45 85 c9 0f 85 83 00 00 00 81 7d 10 ff ff ff 7f 75 7a 89 c1 31 c0 eb 0d 48 8b 75 b8 <48> 8b 16 8b 8a d0 00 00 00 49 8b 55 38 41 b8 01 00 00 00 be 24 00
> [ 5510.062668] RSP: 0018:ffffc9000017fc50 EFLAGS: 00010246
> [ 5510.067884] RAX: 0000000000000000 RBX: ffffe8ffffccf180 RCX: 0000000000000000
> [ 5510.075007] RDX: ffffffff817d1a9b RSI: 0000000000000000 RDI: ffffe8ffffcd8000
> [ 5510.082133] RBP: ffffc9000017fc98 R08: 0000000000000000 R09: 0000000000000019
> [ 5510.089256] R10: 0000000000000000 R11: ffff88887fd2ab70 R12: 0000000000000000
> [ 5510.096382] R13: ffffc900000a5000 R14: ffff88810a8f2000 R15: ffffffff82a7c840
> [ 5510.103505] FS:  0000000000000000(0000) GS:ffff88887fcc0000(0000) knlGS:0000000000000000
> [ 5510.111584] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [ 5510.117319] CR2: 0000000000000000 CR3: 0000000157b1e004 CR4: 00000000003706e0
> [ 5510.124444] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> [ 5510.131567] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> [ 5510.138692] Call Trace:
> [ 5510.141141]  xdp_do_redirect+0x16b/0x230
> [ 5510.145064]  i40e_clean_rx_irq+0x62e/0x9a0 [i40e]
> [ 5510.149779]  i40e_napi_poll+0xf0/0x410 [i40e]
> [ 5510.154135]  __napi_poll+0x2a/0x140
> [ 5510.157620]  net_rx_action+0x215/0x2d0
> [ 5510.161364]  __do_softirq+0xe3/0x2df
> [ 5510.164938]  run_ksoftirqd+0x1a/0x20
> [ 5510.168514]  smpboot_thread_fn+0xee/0x1e0
> [ 5510.172519]  ? sort_range+0x20/0x20
> [ 5510.176003]  kthread+0x116/0x150
> [ 5510.179237]  ? kthread_park+0x90/0x90
> [ 5510.182893]  ret_from_fork+0x22/0x30
> [ 5510.186474] Modules linked in: algif_hash af_alg bpf_preload fuse veth nf_defrag_ipv6 nf_defrag_ipv4 tun bridge stp llc rpcrdma sunrpc rdma_ucm ib_umad rdma_cm ib_ipoib coretemp iw_cm kvm_intel ib_cm kvm mlx5_ib i40iw irqbypass ib_uverbs rapl intel_cstate i2c_i801 intel_uncore ib_core pcspkr bfq i2c_smbus acpi_ipmi wmi ipmi_si ipmi_devintf ipmi_msghandler acpi_pad sch_fq_codel mlx5_core mlxfw i40e ixgbe igc psample mdio igb sd_mod ptp t10_pi nfp i2c_algo_bit i2c_core pps_core hid_generic [last unloaded: bpfilter]
> [ 5510.231847] CR2: 0000000000000000
> [ 5510.235164] ---[ end trace b8e076677c53b5e8 ]---
> [ 5510.241762] RIP: 0010:perf_trace_xdp_redirect_template+0xba/0x130
> [ 5510.247851] Code: 00 00 00 8b 45 18 74 1e 41 83 f9 19 74 18 45 85 c9 0f 85 83 00 00 00 81 7d 10 ff ff ff 7f 75 7a 89 c1 31 c0 eb 0d 48 8b 75 b8 <48> 8b 16 8b 8a d0 00 00 00 49 8b 55 38 41 b8 01 00 00 00 be 24 00
> [ 5510.266590] RSP: 0018:ffffc9000017fc50 EFLAGS: 00010246
> [ 5510.271804] RAX: 0000000000000000 RBX: ffffe8ffffccf180 RCX: 0000000000000000
> [ 5510.278931] RDX: ffffffff817d1a9b RSI: 0000000000000000 RDI: ffffe8ffffcd8000
> [ 5510.286053] RBP: ffffc9000017fc98 R08: 0000000000000000 R09: 0000000000000019
> [ 5510.293178] R10: 0000000000000000 R11: ffff88887fd2ab70 R12: 0000000000000000
> [ 5510.300301] R13: ffffc900000a5000 R14: ffff88810a8f2000 R15: ffffffff82a7c840
> [ 5510.307426] FS:  0000000000000000(0000) GS:ffff88887fcc0000(0000) knlGS:0000000000000000
> [ 5510.315503] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [ 5510.321241] CR2: 0000000000000000 CR3: 0000000157b1e004 CR4: 00000000003706e0
> [ 5510.328364] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> [ 5510.335490] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> [ 5510.342612] Kernel panic - not syncing: Fatal exception in interrupt
> [ 5510.348994] Kernel Offset: disabled
> [ 5510.354469] ---[ end Kernel panic - not syncing: Fatal exception in interrupt ]---
> 
> [net-next]$ ./scripts/faddr2line vmlinux xdp_do_redirect+0x16b
> xdp_do_redirect+0x16b/0x230:
> trace_xdp_redirect at include/trace/events/xdp.h:136
> (inlined by) trace_xdp_redirect at include/trace/events/xdp.h:136
> (inlined by) xdp_do_redirect at net/core/filter.c:3996
> 
> Decode: perf_trace_xdp_redirect_template+0xba
>  ./scripts/faddr2line vmlinux perf_trace_xdp_redirect_template+0xba
> perf_trace_xdp_redirect_template+0xba/0x130:
> perf_trace_xdp_redirect_template at include/trace/events/xdp.h:89 (discriminator 13)
> 
> less -N net/core/filter.c
>  [...]
>    3993         if (unlikely(err))
>    3994                 goto err;
>    3995 
> -> 3996         _trace_xdp_redirect_map(dev, xdp_prog, fwd, map_type, map_id, ri->tgt_index);
>    3997         return 0;
>    3998 err:
>    3999         _trace_xdp_redirect_map_err(dev, xdp_prog, fwd, map_type, map_id, ri->tgt_index, err);
>    4000         return err;
>    4001 }
>    4002 EXPORT_SYMBOL_GPL(xdp_do_redirect);
> 
> 
> > ---
> >  include/linux/bpf.h            |  20 ++++
> >  include/linux/filter.h         |  18 +++-
> >  include/net/xdp.h              |   1 +
> >  include/uapi/linux/bpf.h       |  17 +++-
> >  kernel/bpf/cpumap.c            |   3 +-
> >  kernel/bpf/devmap.c            | 181 ++++++++++++++++++++++++++++++++-
> >  net/core/filter.c              |  37 ++++++-
> >  net/core/xdp.c                 |  29 ++++++
> >  net/xdp/xskmap.c               |   3 +-
> >  tools/include/uapi/linux/bpf.h |  17 +++-
> >  10 files changed, 312 insertions(+), 14 deletions(-)
> > 
> [...]
> 
> > diff --git a/net/core/filter.c b/net/core/filter.c
> > index cae56d08a670..05ba5ab4345f 100644
> > --- a/net/core/filter.c
> > +++ b/net/core/filter.c
> > @@ -3926,6 +3926,23 @@ void xdp_do_flush(void)
> >  }
> >  EXPORT_SYMBOL_GPL(xdp_do_flush);
> >  
> > +void bpf_clear_redirect_map(struct bpf_map *map)
> > +{
> > +	struct bpf_redirect_info *ri;
> > +	int cpu;
> > +
> > +	for_each_possible_cpu(cpu) {
> > +		ri = per_cpu_ptr(&bpf_redirect_info, cpu);
> > +		/* Avoid polluting remote cacheline due to writes if
> > +		 * not needed. Once we pass this test, we need the
> > +		 * cmpxchg() to make sure it hasn't been changed in
> > +		 * the meantime by remote CPU.
> > +		 */
> > +		if (unlikely(READ_ONCE(ri->map) == map))
> > +			cmpxchg(&ri->map, map, NULL);
> > +	}
> > +}
> > +
> >  int xdp_do_redirect(struct net_device *dev, struct xdp_buff *xdp,
> >  		    struct bpf_prog *xdp_prog)
> >  {
> > @@ -3933,6 +3950,7 @@ int xdp_do_redirect(struct net_device *dev, struct xdp_buff *xdp,
> >  	enum bpf_map_type map_type = ri->map_type;
> >  	void *fwd = ri->tgt_value;
> >  	u32 map_id = ri->map_id;
> > +	struct bpf_map *map;
> >  	int err;
> >  
> >  	ri->map_id = 0; /* Valid map id idr range: [1,INT_MAX[ */
> > @@ -3942,7 +3960,14 @@ int xdp_do_redirect(struct net_device *dev, struct xdp_buff *xdp,
> >  	case BPF_MAP_TYPE_DEVMAP:
> >  		fallthrough;
> >  	case BPF_MAP_TYPE_DEVMAP_HASH:
> > -		err = dev_map_enqueue(fwd, xdp, dev);
> > +		map = READ_ONCE(ri->map);
> > +		if (map) {
> > +			WRITE_ONCE(ri->map, NULL);
> > +			err = dev_map_enqueue_multi(xdp, dev, map,
> > +						    ri->flags & BPF_F_EXCLUDE_INGRESS);
> > +		} else {
> > +			err = dev_map_enqueue(fwd, xdp, dev);
> > +		}
> >  		break;
> >  	case BPF_MAP_TYPE_CPUMAP:
> >  		err = cpu_map_enqueue(fwd, xdp, dev);
> > @@ -3984,13 +4009,21 @@ static int xdp_do_generic_redirect_map(struct net_device *dev,
> >  				       enum bpf_map_type map_type, u32 map_id)
> >  {
> >  	struct bpf_redirect_info *ri = this_cpu_ptr(&bpf_redirect_info);
> > +	struct bpf_map *map;
> >  	int err;
> >  
> >  	switch (map_type) {
> >  	case BPF_MAP_TYPE_DEVMAP:
> >  		fallthrough;
> >  	case BPF_MAP_TYPE_DEVMAP_HASH:
> > -		err = dev_map_generic_redirect(fwd, skb, xdp_prog);
> > +		map = READ_ONCE(ri->map);
> > +		if (map) {
> > +			WRITE_ONCE(ri->map, NULL);
> > +			err = dev_map_redirect_multi(dev, skb, xdp_prog, map,
> > +						     ri->flags & BPF_F_EXCLUDE_INGRESS);
> > +		} else {
> > +			err = dev_map_generic_redirect(fwd, skb, xdp_prog);
> > +		}
> >  		if (unlikely(err))
> >  			goto err;
> >  		break;
> 
> -- 
> Best regards,
>   Jesper Dangaard Brouer
>   MSc.CS, Principal Kernel Engineer at Red Hat
>   LinkedIn: http://www.linkedin.com/in/brouer
>
Hangbin Liu April 26, 2021, 10:54 a.m. UTC | #3
On Mon, Apr 26, 2021 at 06:47:17PM +0800, Hangbin Liu wrote:
> On Mon, Apr 26, 2021 at 11:53:50AM +0200, Jesper Dangaard Brouer wrote:
> > On Fri, 23 Apr 2021 10:00:17 +0800
> > Hangbin Liu <liuhangbin@gmail.com> wrote:
> > 
> > > This patch adds two flags BPF_F_BROADCAST and BPF_F_EXCLUDE_INGRESS to
> > > extend xdp_redirect_map for broadcast support.
> > > 
> > > With BPF_F_BROADCAST the packet will be broadcasted to all the interfaces
> > > in the map. with BPF_F_EXCLUDE_INGRESS the ingress interface will be
> > > excluded when do broadcasting.
> > > 
> > > When getting the devices in dev hash map via dev_map_hash_get_next_key(),
> > > there is a possibility that we fall back to the first key when a device
> > > was removed. This will duplicate packets on some interfaces. So just walk
> > > the whole buckets to avoid this issue. For dev array map, we also walk the
> > > whole map to find valid interfaces.
> > > 
> > > Function bpf_clear_redirect_map() was removed in
> > > commit ee75aef23afe ("bpf, xdp: Restructure redirect actions").
> > > Add it back as we need to use ri->map again.
> > > 
> > > Here is the performance result by using 10Gb i40e NIC, do XDP_DROP on
> > > veth peer, run xdp_redirect_{map, map_multi} in sample/bpf and send pkts
> > > via pktgen cmd:
> > > ./pktgen_sample03_burst_single_flow.sh -i eno1 -d $dst_ip -m $dst_mac -t 10 -s 64
> > 
> > While running:
> >  $ sudo ./xdp_redirect_map_multi -F i40e2 i40e2
> >  Get interfaces 7 7
> >  libbpf: elf: skipping unrecognized data section(23) .eh_frame
> >  libbpf: elf: skipping relo section(24) .rel.eh_frame for section(23) .eh_frame
> >  Forwarding   10140845 pkt/s
> >  Forwarding   11767042 pkt/s
> >  Forwarding   11783437 pkt/s
> >  Forwarding   11767331 pkt/s
> > 
> > When starting:  sudo ./xdp_monitor --stats
> 
> That seems the same issue I reported previously in our meeting.
> https://bugzilla.redhat.com/show_bug.cgi?id=1906820#c4
> 
> I only saw it 3 times and can't reproduce it easily.
> 
> Do you have any idea where is the root cause?

OK, I just re-did the test and could reproduce it now.
Maybe because the code changed and it's easy to reproduce now.

I will check this issue.

Thanks
Hangbin
Hangbin Liu April 26, 2021, 11:40 a.m. UTC | #4
On Mon, Apr 26, 2021 at 11:53:50AM +0200, Jesper Dangaard Brouer wrote:
> Decode: perf_trace_xdp_redirect_template+0xba
>  ./scripts/faddr2line vmlinux perf_trace_xdp_redirect_template+0xba
> perf_trace_xdp_redirect_template+0xba/0x130:
> perf_trace_xdp_redirect_template at include/trace/events/xdp.h:89 (discriminator 13)
> 
> less -N net/core/filter.c
>  [...]
>    3993         if (unlikely(err))
>    3994                 goto err;
>    3995 
> -> 3996         _trace_xdp_redirect_map(dev, xdp_prog, fwd, map_type, map_id, ri->tgt_index);

Oh, the fwd in xdp xdp_redirect_map broadcast is NULL...

I will see how to fix it. Maybe assign the ingress interface to fwd?

Hangbin

>    3997         return 0;
>    3998 err:
>    3999         _trace_xdp_redirect_map_err(dev, xdp_prog, fwd, map_type, map_id, ri->tgt_index, err);
>    4000         return err;
>    4001 }
>    4002 EXPORT_SYMBOL_GPL(xdp_do_redirect);
Jesper Dangaard Brouer April 26, 2021, 11:41 a.m. UTC | #5
On Mon, 26 Apr 2021 18:47:04 +0800
Hangbin Liu <liuhangbin@gmail.com> wrote:

> On Mon, Apr 26, 2021 at 11:53:50AM +0200, Jesper Dangaard Brouer wrote:
> > On Fri, 23 Apr 2021 10:00:17 +0800
> > Hangbin Liu <liuhangbin@gmail.com> wrote:
> >   
> > > This patch adds two flags BPF_F_BROADCAST and BPF_F_EXCLUDE_INGRESS to
> > > extend xdp_redirect_map for broadcast support.
> > > 
> > > With BPF_F_BROADCAST the packet will be broadcasted to all the interfaces
> > > in the map. with BPF_F_EXCLUDE_INGRESS the ingress interface will be
> > > excluded when do broadcasting.
> > > 
> > > When getting the devices in dev hash map via dev_map_hash_get_next_key(),
> > > there is a possibility that we fall back to the first key when a device
> > > was removed. This will duplicate packets on some interfaces. So just walk
> > > the whole buckets to avoid this issue. For dev array map, we also walk the
> > > whole map to find valid interfaces.
> > > 
> > > Function bpf_clear_redirect_map() was removed in
> > > commit ee75aef23afe ("bpf, xdp: Restructure redirect actions").
> > > Add it back as we need to use ri->map again.
> > > 
> > > Here is the performance result by using 10Gb i40e NIC, do XDP_DROP on
> > > veth peer, run xdp_redirect_{map, map_multi} in sample/bpf and send pkts
> > > via pktgen cmd:
> > > ./pktgen_sample03_burst_single_flow.sh -i eno1 -d $dst_ip -m $dst_mac -t 10 -s 64  
> > 
> > While running:
> >  $ sudo ./xdp_redirect_map_multi -F i40e2 i40e2
> >  Get interfaces 7 7
> >  libbpf: elf: skipping unrecognized data section(23) .eh_frame
> >  libbpf: elf: skipping relo section(24) .rel.eh_frame for section(23) .eh_frame
> >  Forwarding   10140845 pkt/s
> >  Forwarding   11767042 pkt/s
> >  Forwarding   11783437 pkt/s
> >  Forwarding   11767331 pkt/s
> > 
> > When starting:  sudo ./xdp_monitor --stats  
> 
> That seems the same issue I reported previously in our meeting.
> https://bugzilla.redhat.com/show_bug.cgi?id=1906820#c4
> 
> I only saw it 3 times and can't reproduce it easily.
> 
> Do you have any idea where is the root cause?

All the information you need to find the root-cause is listed below.
I have even decoded where in the code it happens, and also include the
code with line-numbering and pointed to the line the crash happens in,
I don't think it is possible for me to be more specific and help further.

 
> > System crashed with:
> > 
> > [ 5509.997837] BUG: kernel NULL pointer dereference, address: 0000000000000000
> > [ 5510.004793] #PF: supervisor read access in kernel mode
> > [ 5510.009929] #PF: error_code(0x0000) - not-present page
> > [ 5510.015060] PGD 0 P4D 0 
> > [ 5510.017591] Oops: 0000 [#1] PREEMPT SMP PTI
> > [ 5510.021769] CPU: 3 PID: 29 Comm: ksoftirqd/3 Not tainted 5.12.0-rc7-net-next-hangbin-v10+ #602
> > [ 5510.030368] Hardware name: Supermicro Super Server/X10SRi-F, BIOS 2.0a 08/01/2016
> > [ 5510.037835] RIP: 0010:perf_trace_xdp_redirect_template+0xba/0x130
> > [ 5510.043929] Code: 00 00 00 8b 45 18 74 1e 41 83 f9 19 74 18 45 85 c9 0f 85 83 00 00 00 81 7d 10 ff ff ff 7f 75 7a 89 c1 31 c0 eb 0d 48 8b 75 b8 <48> 8b 16 8b 8a d0 00 00 00 49 8b 55 38 41 b8 01 00 00 00 be 24 00
> > [ 5510.062668] RSP: 0018:ffffc9000017fc50 EFLAGS: 00010246
> > [ 5510.067884] RAX: 0000000000000000 RBX: ffffe8ffffccf180 RCX: 0000000000000000
> > [ 5510.075007] RDX: ffffffff817d1a9b RSI: 0000000000000000 RDI: ffffe8ffffcd8000
> > [ 5510.082133] RBP: ffffc9000017fc98 R08: 0000000000000000 R09: 0000000000000019
> > [ 5510.089256] R10: 0000000000000000 R11: ffff88887fd2ab70 R12: 0000000000000000
> > [ 5510.096382] R13: ffffc900000a5000 R14: ffff88810a8f2000 R15: ffffffff82a7c840
> > [ 5510.103505] FS:  0000000000000000(0000) GS:ffff88887fcc0000(0000) knlGS:0000000000000000
> > [ 5510.111584] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > [ 5510.117319] CR2: 0000000000000000 CR3: 0000000157b1e004 CR4: 00000000003706e0
> > [ 5510.124444] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> > [ 5510.131567] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> > [ 5510.138692] Call Trace:
> > [ 5510.141141]  xdp_do_redirect+0x16b/0x230
> > [ 5510.145064]  i40e_clean_rx_irq+0x62e/0x9a0 [i40e]
> > [ 5510.149779]  i40e_napi_poll+0xf0/0x410 [i40e]
> > [ 5510.154135]  __napi_poll+0x2a/0x140
> > [ 5510.157620]  net_rx_action+0x215/0x2d0
> > [ 5510.161364]  __do_softirq+0xe3/0x2df
> > [ 5510.164938]  run_ksoftirqd+0x1a/0x20
> > [ 5510.168514]  smpboot_thread_fn+0xee/0x1e0
> > [ 5510.172519]  ? sort_range+0x20/0x20
> > [ 5510.176003]  kthread+0x116/0x150
> > [ 5510.179237]  ? kthread_park+0x90/0x90
> > [ 5510.182893]  ret_from_fork+0x22/0x30
> > [ 5510.186474] Modules linked in: algif_hash af_alg bpf_preload fuse veth nf_defrag_ipv6 nf_defrag_ipv4 tun bridge stp llc rpcrdma sunrpc rdma_ucm ib_umad rdma_cm ib_ipoib coretemp iw_cm kvm_intel ib_cm kvm mlx5_ib i40iw irqbypass ib_uverbs rapl intel_cstate i2c_i801 intel_uncore ib_core pcspkr bfq i2c_smbus acpi_ipmi wmi ipmi_si ipmi_devintf ipmi_msghandler acpi_pad sch_fq_codel mlx5_core mlxfw i40e ixgbe igc psample mdio igb sd_mod ptp t10_pi nfp i2c_algo_bit i2c_core pps_core hid_generic [last unloaded: bpfilter]
> > [ 5510.231847] CR2: 0000000000000000
> > [ 5510.235164] ---[ end trace b8e076677c53b5e8 ]---
> > [ 5510.241762] RIP: 0010:perf_trace_xdp_redirect_template+0xba/0x130
> > [ 5510.247851] Code: 00 00 00 8b 45 18 74 1e 41 83 f9 19 74 18 45 85 c9 0f 85 83 00 00 00 81 7d 10 ff ff ff 7f 75 7a 89 c1 31 c0 eb 0d 48 8b 75 b8 <48> 8b 16 8b 8a d0 00 00 00 49 8b 55 38 41 b8 01 00 00 00 be 24 00
> > [ 5510.266590] RSP: 0018:ffffc9000017fc50 EFLAGS: 00010246
> > [ 5510.271804] RAX: 0000000000000000 RBX: ffffe8ffffccf180 RCX: 0000000000000000
> > [ 5510.278931] RDX: ffffffff817d1a9b RSI: 0000000000000000 RDI: ffffe8ffffcd8000
> > [ 5510.286053] RBP: ffffc9000017fc98 R08: 0000000000000000 R09: 0000000000000019
> > [ 5510.293178] R10: 0000000000000000 R11: ffff88887fd2ab70 R12: 0000000000000000
> > [ 5510.300301] R13: ffffc900000a5000 R14: ffff88810a8f2000 R15: ffffffff82a7c840
> > [ 5510.307426] FS:  0000000000000000(0000) GS:ffff88887fcc0000(0000) knlGS:0000000000000000
> > [ 5510.315503] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > [ 5510.321241] CR2: 0000000000000000 CR3: 0000000157b1e004 CR4: 00000000003706e0
> > [ 5510.328364] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> > [ 5510.335490] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> > [ 5510.342612] Kernel panic - not syncing: Fatal exception in interrupt
> > [ 5510.348994] Kernel Offset: disabled
> > [ 5510.354469] ---[ end Kernel panic - not syncing: Fatal exception in interrupt ]---
> > 
> > [net-next]$ ./scripts/faddr2line vmlinux xdp_do_redirect+0x16b
> > xdp_do_redirect+0x16b/0x230:
> > trace_xdp_redirect at include/trace/events/xdp.h:136
> > (inlined by) trace_xdp_redirect at include/trace/events/xdp.h:136
> > (inlined by) xdp_do_redirect at net/core/filter.c:3996
> > 
> > Decode: perf_trace_xdp_redirect_template+0xba
> >  ./scripts/faddr2line vmlinux perf_trace_xdp_redirect_template+0xba
> > perf_trace_xdp_redirect_template+0xba/0x130:
> > perf_trace_xdp_redirect_template at include/trace/events/xdp.h:89 (discriminator 13)
> > 
> > less -N net/core/filter.c
> >  [...]
> >    3993         if (unlikely(err))
> >    3994                 goto err;
> >    3995   
> > -> 3996         _trace_xdp_redirect_map(dev, xdp_prog, fwd, map_type, map_id, ri->tgt_index);  
> >    3997         return 0;
> >    3998 err:
> >    3999         _trace_xdp_redirect_map_err(dev, xdp_prog, fwd, map_type, map_id, ri->tgt_index, err);
> >    4000         return err;
> >    4001 }
> >    4002 EXPORT_SYMBOL_GPL(xdp_do_redirect);
> > 
> >   
> > > ---
> > >  include/linux/bpf.h            |  20 ++++
> > >  include/linux/filter.h         |  18 +++-
> > >  include/net/xdp.h              |   1 +
> > >  include/uapi/linux/bpf.h       |  17 +++-
> > >  kernel/bpf/cpumap.c            |   3 +-
> > >  kernel/bpf/devmap.c            | 181 ++++++++++++++++++++++++++++++++-
> > >  net/core/filter.c              |  37 ++++++-
> > >  net/core/xdp.c                 |  29 ++++++
> > >  net/xdp/xskmap.c               |   3 +-
> > >  tools/include/uapi/linux/bpf.h |  17 +++-
> > >  10 files changed, 312 insertions(+), 14 deletions(-)
> > >   
> > [...]
> >   
> > > diff --git a/net/core/filter.c b/net/core/filter.c
> > > index cae56d08a670..05ba5ab4345f 100644
> > > --- a/net/core/filter.c
> > > +++ b/net/core/filter.c
> > > @@ -3926,6 +3926,23 @@ void xdp_do_flush(void)
> > >  }
> > >  EXPORT_SYMBOL_GPL(xdp_do_flush);
> > >  
> > > +void bpf_clear_redirect_map(struct bpf_map *map)
> > > +{
> > > +	struct bpf_redirect_info *ri;
> > > +	int cpu;
> > > +
> > > +	for_each_possible_cpu(cpu) {
> > > +		ri = per_cpu_ptr(&bpf_redirect_info, cpu);
> > > +		/* Avoid polluting remote cacheline due to writes if
> > > +		 * not needed. Once we pass this test, we need the
> > > +		 * cmpxchg() to make sure it hasn't been changed in
> > > +		 * the meantime by remote CPU.
> > > +		 */
> > > +		if (unlikely(READ_ONCE(ri->map) == map))
> > > +			cmpxchg(&ri->map, map, NULL);
> > > +	}
> > > +}
> > > +
> > >  int xdp_do_redirect(struct net_device *dev, struct xdp_buff *xdp,
> > >  		    struct bpf_prog *xdp_prog)
> > >  {
> > > @@ -3933,6 +3950,7 @@ int xdp_do_redirect(struct net_device *dev, struct xdp_buff *xdp,
> > >  	enum bpf_map_type map_type = ri->map_type;
> > >  	void *fwd = ri->tgt_value;
> > >  	u32 map_id = ri->map_id;
> > > +	struct bpf_map *map;
> > >  	int err;
> > >  
> > >  	ri->map_id = 0; /* Valid map id idr range: [1,INT_MAX[ */
> > > @@ -3942,7 +3960,14 @@ int xdp_do_redirect(struct net_device *dev, struct xdp_buff *xdp,
> > >  	case BPF_MAP_TYPE_DEVMAP:
> > >  		fallthrough;
> > >  	case BPF_MAP_TYPE_DEVMAP_HASH:
> > > -		err = dev_map_enqueue(fwd, xdp, dev);
> > > +		map = READ_ONCE(ri->map);
> > > +		if (map) {
> > > +			WRITE_ONCE(ri->map, NULL);
> > > +			err = dev_map_enqueue_multi(xdp, dev, map,
> > > +						    ri->flags & BPF_F_EXCLUDE_INGRESS);
> > > +		} else {
> > > +			err = dev_map_enqueue(fwd, xdp, dev);
> > > +		}
> > >  		break;
> > >  	case BPF_MAP_TYPE_CPUMAP:
> > >  		err = cpu_map_enqueue(fwd, xdp, dev);
> > > @@ -3984,13 +4009,21 @@ static int xdp_do_generic_redirect_map(struct net_device *dev,
> > >  				       enum bpf_map_type map_type, u32 map_id)
> > >  {
> > >  	struct bpf_redirect_info *ri = this_cpu_ptr(&bpf_redirect_info);
> > > +	struct bpf_map *map;
> > >  	int err;
> > >  
> > >  	switch (map_type) {
> > >  	case BPF_MAP_TYPE_DEVMAP:
> > >  		fallthrough;
> > >  	case BPF_MAP_TYPE_DEVMAP_HASH:
> > > -		err = dev_map_generic_redirect(fwd, skb, xdp_prog);
> > > +		map = READ_ONCE(ri->map);
> > > +		if (map) {
> > > +			WRITE_ONCE(ri->map, NULL);
> > > +			err = dev_map_redirect_multi(dev, skb, xdp_prog, map,
> > > +						     ri->flags & BPF_F_EXCLUDE_INGRESS);
> > > +		} else {
> > > +			err = dev_map_generic_redirect(fwd, skb, xdp_prog);
> > > +		}
> > >  		if (unlikely(err))
> > >  			goto err;
> > >  		break;  
> >
Hangbin Liu April 26, 2021, 11:47 a.m. UTC | #6
On Mon, Apr 26, 2021 at 07:40:28PM +0800, Hangbin Liu wrote:
> On Mon, Apr 26, 2021 at 11:53:50AM +0200, Jesper Dangaard Brouer wrote:
> > Decode: perf_trace_xdp_redirect_template+0xba
> >  ./scripts/faddr2line vmlinux perf_trace_xdp_redirect_template+0xba
> > perf_trace_xdp_redirect_template+0xba/0x130:
> > perf_trace_xdp_redirect_template at include/trace/events/xdp.h:89 (discriminator 13)
> > 
> > less -N net/core/filter.c
> >  [...]
> >    3993         if (unlikely(err))
> >    3994                 goto err;
> >    3995 
> > -> 3996         _trace_xdp_redirect_map(dev, xdp_prog, fwd, map_type, map_id, ri->tgt_index);
> 
> Oh, the fwd in xdp xdp_redirect_map broadcast is NULL...
> 
> I will see how to fix it. Maybe assign the ingress interface to fwd?

Er, sorry for the flood message. I just checked the trace point code, fwd
in xdp trace event means to_ifindex. So we can't assign the ingress interface
to fwd.

In xdp_redirect_map broadcast case, there is no specific to_ifindex.
So how about just ignore it... e.g.

diff --git a/include/trace/events/xdp.h b/include/trace/events/xdp.h
index fcad3645a70b..1751da079330 100644
--- a/include/trace/events/xdp.h
+++ b/include/trace/events/xdp.h
@@ -110,7 +110,8 @@ DECLARE_EVENT_CLASS(xdp_redirect_template,
                u32 ifindex = 0, map_index = index;

                if (map_type == BPF_MAP_TYPE_DEVMAP || map_type == BPF_MAP_TYPE_DEVMAP_HASH) {
-                       ifindex = ((struct _bpf_dtab_netdev *)tgt)->dev->ifindex;
+                       if (tgt)
+                               ifindex = ((struct _bpf_dtab_netdev *)tgt)->dev->ifindex;
                } else if (map_type == BPF_MAP_TYPE_UNSPEC && map_id == INT_MAX) {
                        ifindex = index;
                        map_index = 0;


Hangbin
Hangbin Liu April 26, 2021, 11:54 a.m. UTC | #7
On Mon, Apr 26, 2021 at 01:41:05PM +0200, Jesper Dangaard Brouer wrote:
> > > While running:
> > >  $ sudo ./xdp_redirect_map_multi -F i40e2 i40e2
> > >  Get interfaces 7 7
> > >  libbpf: elf: skipping unrecognized data section(23) .eh_frame
> > >  libbpf: elf: skipping relo section(24) .rel.eh_frame for section(23) .eh_frame
> > >  Forwarding   10140845 pkt/s
> > >  Forwarding   11767042 pkt/s
> > >  Forwarding   11783437 pkt/s
> > >  Forwarding   11767331 pkt/s
> > > 
> > > When starting:  sudo ./xdp_monitor --stats  
> > 
> > That seems the same issue I reported previously in our meeting.
> > https://bugzilla.redhat.com/show_bug.cgi?id=1906820#c4
> > 
> > I only saw it 3 times and can't reproduce it easily.
> > 
> > Do you have any idea where is the root cause?
> 
> All the information you need to find the root-cause is listed below.
> I have even decoded where in the code it happens, and also include the
> code with line-numbering and pointed to the line the crash happens in,
> I don't think it is possible for me to be more specific and help further.

Thanks, I mixed this issue with the one I got previously, which I haven't
figure out yet. For this one, I have sent a propose in another reply (that
fix it in trace point event). Would you please help review.

Hangbin
Toke Høiland-Jørgensen April 26, 2021, 1 p.m. UTC | #8
Hangbin Liu <liuhangbin@gmail.com> writes:

> On Mon, Apr 26, 2021 at 07:40:28PM +0800, Hangbin Liu wrote:
>> On Mon, Apr 26, 2021 at 11:53:50AM +0200, Jesper Dangaard Brouer wrote:
>> > Decode: perf_trace_xdp_redirect_template+0xba
>> >  ./scripts/faddr2line vmlinux perf_trace_xdp_redirect_template+0xba
>> > perf_trace_xdp_redirect_template+0xba/0x130:
>> > perf_trace_xdp_redirect_template at include/trace/events/xdp.h:89 (discriminator 13)
>> > 
>> > less -N net/core/filter.c
>> >  [...]
>> >    3993         if (unlikely(err))
>> >    3994                 goto err;
>> >    3995 
>> > -> 3996         _trace_xdp_redirect_map(dev, xdp_prog, fwd, map_type, map_id, ri->tgt_index);
>> 
>> Oh, the fwd in xdp xdp_redirect_map broadcast is NULL...
>> 
>> I will see how to fix it. Maybe assign the ingress interface to fwd?
>
> Er, sorry for the flood message. I just checked the trace point code, fwd
> in xdp trace event means to_ifindex. So we can't assign the ingress interface
> to fwd.
>
> In xdp_redirect_map broadcast case, there is no specific to_ifindex.
> So how about just ignore it... e.g.

Yeah, just leaving the ifindex as 0 when tgt is unset is fine :)

-Toke
Jesper Dangaard Brouer April 27, 2021, 8:40 a.m. UTC | #9
On Mon, 26 Apr 2021 19:47:42 +0800
Hangbin Liu <liuhangbin@gmail.com> wrote:

> On Mon, Apr 26, 2021 at 07:40:28PM +0800, Hangbin Liu wrote:
> > On Mon, Apr 26, 2021 at 11:53:50AM +0200, Jesper Dangaard Brouer wrote:  
> > > Decode: perf_trace_xdp_redirect_template+0xba
> > >  ./scripts/faddr2line vmlinux perf_trace_xdp_redirect_template+0xba
> > > perf_trace_xdp_redirect_template+0xba/0x130:
> > > perf_trace_xdp_redirect_template at include/trace/events/xdp.h:89 (discriminator 13)
> > > 
> > > less -N net/core/filter.c
> > >  [...]
> > >    3993         if (unlikely(err))
> > >    3994                 goto err;
> > >    3995   
> > > -> 3996         _trace_xdp_redirect_map(dev, xdp_prog, fwd, map_type, map_id, ri->tgt_index);  
> > 
> > Oh, the fwd in xdp xdp_redirect_map broadcast is NULL...
> > 
> > I will see how to fix it. Maybe assign the ingress interface to fwd?  
> 
> Er, sorry for the flood message. I just checked the trace point code, fwd
> in xdp trace event means to_ifindex. So we can't assign the ingress interface
> to fwd.
> 
> In xdp_redirect_map broadcast case, there is no specific to_ifindex.
> So how about just ignore it... e.g.

Yes, below code make sense, and I want to confirm that it solves the
crash (I tested it).  IMHO leaving ifindex=0 is okay, because  it is
not a valid ifindex, meaning a caller of the tracepoint can deduce
(together with the map types) that this must be a broadcast.

Thank you Hangbin for keep working on this patchset.  I know it have
been a long long road.  I truly appreciate your perseverance and
patience with this patchset.  With this crash fixed, I actually think we
are very close to having something we can merge.  With the unlikely()
I'm fine with the code itself.

I think we need to update the patch description, but I've asked Toke to
help with this. The performance measurements in the patch description
is not measuring what I expected, but something else.  To avoid redoing
a lot of testing, I think we can just describe what the test
'redirect_map-multi i40e->i40e' is doing, as broadcast feature is
filtering the ingress port 'i40e->i40e' test out same interface will
just drop the xdp_frame (after walking the devmap for empty ports).  Or
maybe it is not the same interface(?). In any-case this need to be more
clear.

I think it would be valuable to show (in the commit message) some tests
that demonstrates the overhead of packet cloning.  I expect the
overhead of page-alloc+memcpy is to be significant, but Lorenzo have a
number of ideas howto speed this up.  Maybe you can simply
broadcast-redirect into multiple veth devices that (XDP_DROP in
peer-dev) to demonstrate the effect and overhead of doing the cloning
process.


> diff --git a/include/trace/events/xdp.h b/include/trace/events/xdp.h
> index fcad3645a70b..1751da079330 100644
> --- a/include/trace/events/xdp.h
> +++ b/include/trace/events/xdp.h
> @@ -110,7 +110,8 @@ DECLARE_EVENT_CLASS(xdp_redirect_template,
>                 u32 ifindex = 0, map_index = index;
> 
>                 if (map_type == BPF_MAP_TYPE_DEVMAP || map_type == BPF_MAP_TYPE_DEVMAP_HASH) {
> -                       ifindex = ((struct _bpf_dtab_netdev *)tgt)->dev->ifindex;
> +                       if (tgt)
> +                               ifindex = ((struct _bpf_dtab_netdev *)tgt)->dev->ifindex;
>                 } else if (map_type == BPF_MAP_TYPE_UNSPEC && map_id == INT_MAX) {
>                         ifindex = index;
>                         map_index = 0;
> 
> 
> Hangbin
>
diff mbox series

Patch

diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index f8a45f109e96..4243284fff8b 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -1497,8 +1497,13 @@  int dev_xdp_enqueue(struct net_device *dev, struct xdp_buff *xdp,
 		    struct net_device *dev_rx);
 int dev_map_enqueue(struct bpf_dtab_netdev *dst, struct xdp_buff *xdp,
 		    struct net_device *dev_rx);
+int dev_map_enqueue_multi(struct xdp_buff *xdp, struct net_device *dev_rx,
+			  struct bpf_map *map, bool exclude_ingress);
 int dev_map_generic_redirect(struct bpf_dtab_netdev *dst, struct sk_buff *skb,
 			     struct bpf_prog *xdp_prog);
+int dev_map_redirect_multi(struct net_device *dev, struct sk_buff *skb,
+			   struct bpf_prog *xdp_prog, struct bpf_map *map,
+			   bool exclude_ingress);
 bool dev_map_can_have_prog(struct bpf_map *map);
 
 void __cpu_map_flush(void);
@@ -1666,6 +1671,13 @@  int dev_map_enqueue(struct bpf_dtab_netdev *dst, struct xdp_buff *xdp,
 	return 0;
 }
 
+static inline
+int dev_map_enqueue_multi(struct xdp_buff *xdp, struct net_device *dev_rx,
+			  struct bpf_map *map, bool exclude_ingress)
+{
+	return 0;
+}
+
 struct sk_buff;
 
 static inline int dev_map_generic_redirect(struct bpf_dtab_netdev *dst,
@@ -1675,6 +1687,14 @@  static inline int dev_map_generic_redirect(struct bpf_dtab_netdev *dst,
 	return 0;
 }
 
+static inline
+int dev_map_redirect_multi(struct net_device *dev, struct sk_buff *skb,
+			   struct bpf_prog *xdp_prog, struct bpf_map *map,
+			   bool exclude_ingress)
+{
+	return 0;
+}
+
 static inline void __cpu_map_flush(void)
 {
 }
diff --git a/include/linux/filter.h b/include/linux/filter.h
index 9a09547bc7ba..e4885b42d754 100644
--- a/include/linux/filter.h
+++ b/include/linux/filter.h
@@ -646,6 +646,7 @@  struct bpf_redirect_info {
 	u32 flags;
 	u32 tgt_index;
 	void *tgt_value;
+	struct bpf_map *map;
 	u32 map_id;
 	enum bpf_map_type map_type;
 	u32 kern_flags;
@@ -1464,17 +1465,18 @@  static inline bool bpf_sk_lookup_run_v6(struct net *net, int protocol,
 }
 #endif /* IS_ENABLED(CONFIG_IPV6) */
 
-static __always_inline int __bpf_xdp_redirect_map(struct bpf_map *map, u32 ifindex, u64 flags,
+static __always_inline int __bpf_xdp_redirect_map(struct bpf_map *map, u32 ifindex,
+						  u64 flags, u64 flag_mask,
 						  void *lookup_elem(struct bpf_map *map, u32 key))
 {
 	struct bpf_redirect_info *ri = this_cpu_ptr(&bpf_redirect_info);
 
 	/* Lower bits of the flags are used as return code on lookup failure */
-	if (unlikely(flags > XDP_TX))
+	if (unlikely(flags & ~(BPF_F_ACTION_MASK | flag_mask)))
 		return XDP_ABORTED;
 
 	ri->tgt_value = lookup_elem(map, ifindex);
-	if (unlikely(!ri->tgt_value)) {
+	if (unlikely(!ri->tgt_value) && !(flags & BPF_F_BROADCAST)) {
 		/* If the lookup fails we want to clear out the state in the
 		 * redirect_info struct completely, so that if an eBPF program
 		 * performs multiple lookups, the last one always takes
@@ -1482,13 +1484,21 @@  static __always_inline int __bpf_xdp_redirect_map(struct bpf_map *map, u32 ifind
 		 */
 		ri->map_id = INT_MAX; /* Valid map id idr range: [1,INT_MAX[ */
 		ri->map_type = BPF_MAP_TYPE_UNSPEC;
-		return flags;
+		return flags & BPF_F_ACTION_MASK;
 	}
 
 	ri->tgt_index = ifindex;
 	ri->map_id = map->id;
 	ri->map_type = map->map_type;
 
+	if (flags & BPF_F_BROADCAST) {
+		WRITE_ONCE(ri->map, map);
+		ri->flags = flags;
+	} else {
+		WRITE_ONCE(ri->map, NULL);
+		ri->flags = 0;
+	}
+
 	return XDP_REDIRECT;
 }
 
diff --git a/include/net/xdp.h b/include/net/xdp.h
index a5bc214a49d9..5533f0ab2afc 100644
--- a/include/net/xdp.h
+++ b/include/net/xdp.h
@@ -170,6 +170,7 @@  struct sk_buff *__xdp_build_skb_from_frame(struct xdp_frame *xdpf,
 struct sk_buff *xdp_build_skb_from_frame(struct xdp_frame *xdpf,
 					 struct net_device *dev);
 int xdp_alloc_skb_bulk(void **skbs, int n_skb, gfp_t gfp);
+struct xdp_frame *xdpf_clone(struct xdp_frame *xdpf);
 
 static inline
 void xdp_convert_frame_to_buff(struct xdp_frame *frame, struct xdp_buff *xdp)
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index ec6d85a81744..c6fe0526811b 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -2534,8 +2534,12 @@  union bpf_attr {
  * 		The lower two bits of *flags* are used as the return code if
  * 		the map lookup fails. This is so that the return value can be
  * 		one of the XDP program return codes up to **XDP_TX**, as chosen
- * 		by the caller. Any higher bits in the *flags* argument must be
- * 		unset.
+ * 		by the caller. The higher bits of *flags* can be set to
+ * 		BPF_F_BROADCAST or BPF_F_EXCLUDE_INGRESS as defined below.
+ *
+ * 		With BPF_F_BROADCAST the packet will be broadcasted to all the
+ * 		interfaces in the map. with BPF_F_EXCLUDE_INGRESS the ingress
+ * 		interface will be excluded when do broadcasting.
  *
  * 		See also **bpf_redirect**\ (), which only supports redirecting
  * 		to an ifindex, but doesn't require a map to do so.
@@ -5080,6 +5084,15 @@  enum {
 	BPF_F_BPRM_SECUREEXEC	= (1ULL << 0),
 };
 
+/* Flags for bpf_redirect_map helper */
+enum {
+	BPF_F_BROADCAST		= (1ULL << 3),
+	BPF_F_EXCLUDE_INGRESS	= (1ULL << 4),
+};
+
+#define BPF_F_ACTION_MASK (XDP_ABORTED | XDP_DROP | XDP_PASS | XDP_TX)
+#define BPF_F_REDIR_MASK (BPF_F_BROADCAST | BPF_F_EXCLUDE_INGRESS)
+
 #define __bpf_md_ptr(type, name)	\
 union {					\
 	type name;			\
diff --git a/kernel/bpf/cpumap.c b/kernel/bpf/cpumap.c
index 0cf2791d5099..2c33a7a09783 100644
--- a/kernel/bpf/cpumap.c
+++ b/kernel/bpf/cpumap.c
@@ -601,7 +601,8 @@  static int cpu_map_get_next_key(struct bpf_map *map, void *key, void *next_key)
 
 static int cpu_map_redirect(struct bpf_map *map, u32 ifindex, u64 flags)
 {
-	return __bpf_xdp_redirect_map(map, ifindex, flags, __cpu_map_lookup_elem);
+	return __bpf_xdp_redirect_map(map, ifindex, flags, 0,
+				      __cpu_map_lookup_elem);
 }
 
 static int cpu_map_btf_id;
diff --git a/kernel/bpf/devmap.c b/kernel/bpf/devmap.c
index 3980fb3bfb09..2eebb9a4b093 100644
--- a/kernel/bpf/devmap.c
+++ b/kernel/bpf/devmap.c
@@ -198,6 +198,7 @@  static void dev_map_free(struct bpf_map *map)
 	list_del_rcu(&dtab->list);
 	spin_unlock(&dev_map_lock);
 
+	bpf_clear_redirect_map(map);
 	synchronize_rcu();
 
 	/* Make sure prior __dev_map_entry_free() have completed. */
@@ -515,6 +516,99 @@  int dev_map_enqueue(struct bpf_dtab_netdev *dst, struct xdp_buff *xdp,
 	return __xdp_enqueue(dev, xdp, dev_rx, dst->xdp_prog);
 }
 
+static bool is_valid_dst(struct bpf_dtab_netdev *obj, struct xdp_buff *xdp,
+			 int exclude_ifindex)
+{
+	if (!obj || obj->dev->ifindex == exclude_ifindex ||
+	    !obj->dev->netdev_ops->ndo_xdp_xmit)
+		return false;
+
+	if (xdp_ok_fwd_dev(obj->dev, xdp->data_end - xdp->data))
+		return false;
+
+	return true;
+}
+
+static int dev_map_enqueue_clone(struct bpf_dtab_netdev *obj,
+				 struct net_device *dev_rx,
+				 struct xdp_frame *xdpf)
+{
+	struct xdp_frame *nxdpf;
+
+	nxdpf = xdpf_clone(xdpf);
+	if (!nxdpf)
+		return -ENOMEM;
+
+	bq_enqueue(obj->dev, nxdpf, dev_rx, obj->xdp_prog);
+
+	return 0;
+}
+
+int dev_map_enqueue_multi(struct xdp_buff *xdp, struct net_device *dev_rx,
+			  struct bpf_map *map, bool exclude_ingress)
+{
+	struct bpf_dtab *dtab = container_of(map, struct bpf_dtab, map);
+	int exclude_ifindex = exclude_ingress ? dev_rx->ifindex : 0;
+	struct bpf_dtab_netdev *dst, *last_dst = NULL;
+	struct hlist_head *head;
+	struct xdp_frame *xdpf;
+	unsigned int i;
+	int err;
+
+	xdpf = xdp_convert_buff_to_frame(xdp);
+	if (unlikely(!xdpf))
+		return -EOVERFLOW;
+
+	if (map->map_type == BPF_MAP_TYPE_DEVMAP) {
+		for (i = 0; i < map->max_entries; i++) {
+			dst = READ_ONCE(dtab->netdev_map[i]);
+			if (!is_valid_dst(dst, xdp, exclude_ifindex))
+				continue;
+
+			/* we only need n-1 clones; last_dst enqueued below */
+			if (!last_dst) {
+				last_dst = dst;
+				continue;
+			}
+
+			err = dev_map_enqueue_clone(last_dst, dev_rx, xdpf);
+			if (err)
+				return err;
+
+			last_dst = dst;
+		}
+	} else { /* BPF_MAP_TYPE_DEVMAP_HASH */
+		for (i = 0; i < dtab->n_buckets; i++) {
+			head = dev_map_index_hash(dtab, i);
+			hlist_for_each_entry_rcu(dst, head, index_hlist,
+						 lockdep_is_held(&dtab->index_lock)) {
+				if (!is_valid_dst(dst, xdp, exclude_ifindex))
+					continue;
+
+				/* we only need n-1 clones; last_dst enqueued below */
+				if (!last_dst) {
+					last_dst = dst;
+					continue;
+				}
+
+				err = dev_map_enqueue_clone(last_dst, dev_rx, xdpf);
+				if (err)
+					return err;
+
+				last_dst = dst;
+			}
+		}
+	}
+
+	/* consume the last copy of the frame */
+	if (last_dst)
+		bq_enqueue(last_dst->dev, xdpf, dev_rx, last_dst->xdp_prog);
+	else
+		xdp_return_frame_rx_napi(xdpf); /* dtab is empty */
+
+	return 0;
+}
+
 int dev_map_generic_redirect(struct bpf_dtab_netdev *dst, struct sk_buff *skb,
 			     struct bpf_prog *xdp_prog)
 {
@@ -529,6 +623,87 @@  int dev_map_generic_redirect(struct bpf_dtab_netdev *dst, struct sk_buff *skb,
 	return 0;
 }
 
+static int dev_map_redirect_clone(struct bpf_dtab_netdev *dst,
+				  struct sk_buff *skb,
+				  struct bpf_prog *xdp_prog)
+{
+	struct sk_buff *nskb;
+	int err;
+
+	nskb = skb_clone(skb, GFP_ATOMIC);
+	if (!nskb)
+		return -ENOMEM;
+
+	err = dev_map_generic_redirect(dst, nskb, xdp_prog);
+	if (unlikely(err)) {
+		consume_skb(nskb);
+		return err;
+	}
+
+	return 0;
+}
+
+int dev_map_redirect_multi(struct net_device *dev, struct sk_buff *skb,
+			   struct bpf_prog *xdp_prog, struct bpf_map *map,
+			   bool exclude_ingress)
+{
+	struct bpf_dtab *dtab = container_of(map, struct bpf_dtab, map);
+	int exclude_ifindex = exclude_ingress ? dev->ifindex : 0;
+	struct bpf_dtab_netdev *dst, *last_dst = NULL;
+	struct hlist_head *head;
+	struct hlist_node *next;
+	unsigned int i;
+	int err;
+
+	if (map->map_type == BPF_MAP_TYPE_DEVMAP) {
+		for (i = 0; i < map->max_entries; i++) {
+			dst = READ_ONCE(dtab->netdev_map[i]);
+			if (!dst || dst->dev->ifindex == exclude_ifindex)
+				continue;
+
+			/* we only need n-1 clones; last_dst enqueued below */
+			if (!last_dst) {
+				last_dst = dst;
+				continue;
+			}
+
+			err = dev_map_redirect_clone(last_dst, skb, xdp_prog);
+			if (err)
+				return err;
+
+			last_dst = dst;
+		}
+	} else { /* BPF_MAP_TYPE_DEVMAP_HASH */
+		for (i = 0; i < dtab->n_buckets; i++) {
+			head = dev_map_index_hash(dtab, i);
+			hlist_for_each_entry_safe(dst, next, head, index_hlist) {
+				if (!dst || dst->dev->ifindex == exclude_ifindex)
+					continue;
+
+				/* we only need n-1 clones; last_dst enqueued below */
+				if (!last_dst) {
+					last_dst = dst;
+					continue;
+				}
+
+				err = dev_map_redirect_clone(last_dst, skb, xdp_prog);
+				if (err)
+					return err;
+
+				last_dst = dst;
+			}
+		}
+	}
+
+	/* consume the first skb and return */
+	if (last_dst)
+		return dev_map_generic_redirect(last_dst, skb, xdp_prog);
+
+	/* dtab is empty */
+	consume_skb(skb);
+	return 0;
+}
+
 static void *dev_map_lookup_elem(struct bpf_map *map, void *key)
 {
 	struct bpf_dtab_netdev *obj = __dev_map_lookup_elem(map, *(u32 *)key);
@@ -755,12 +930,14 @@  static int dev_map_hash_update_elem(struct bpf_map *map, void *key, void *value,
 
 static int dev_map_redirect(struct bpf_map *map, u32 ifindex, u64 flags)
 {
-	return __bpf_xdp_redirect_map(map, ifindex, flags, __dev_map_lookup_elem);
+	return __bpf_xdp_redirect_map(map, ifindex, flags, BPF_F_REDIR_MASK,
+				      __dev_map_lookup_elem);
 }
 
 static int dev_hash_map_redirect(struct bpf_map *map, u32 ifindex, u64 flags)
 {
-	return __bpf_xdp_redirect_map(map, ifindex, flags, __dev_map_hash_lookup_elem);
+	return __bpf_xdp_redirect_map(map, ifindex, flags, BPF_F_REDIR_MASK,
+				      __dev_map_hash_lookup_elem);
 }
 
 static int dev_map_btf_id;
diff --git a/net/core/filter.c b/net/core/filter.c
index cae56d08a670..05ba5ab4345f 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -3926,6 +3926,23 @@  void xdp_do_flush(void)
 }
 EXPORT_SYMBOL_GPL(xdp_do_flush);
 
+void bpf_clear_redirect_map(struct bpf_map *map)
+{
+	struct bpf_redirect_info *ri;
+	int cpu;
+
+	for_each_possible_cpu(cpu) {
+		ri = per_cpu_ptr(&bpf_redirect_info, cpu);
+		/* Avoid polluting remote cacheline due to writes if
+		 * not needed. Once we pass this test, we need the
+		 * cmpxchg() to make sure it hasn't been changed in
+		 * the meantime by remote CPU.
+		 */
+		if (unlikely(READ_ONCE(ri->map) == map))
+			cmpxchg(&ri->map, map, NULL);
+	}
+}
+
 int xdp_do_redirect(struct net_device *dev, struct xdp_buff *xdp,
 		    struct bpf_prog *xdp_prog)
 {
@@ -3933,6 +3950,7 @@  int xdp_do_redirect(struct net_device *dev, struct xdp_buff *xdp,
 	enum bpf_map_type map_type = ri->map_type;
 	void *fwd = ri->tgt_value;
 	u32 map_id = ri->map_id;
+	struct bpf_map *map;
 	int err;
 
 	ri->map_id = 0; /* Valid map id idr range: [1,INT_MAX[ */
@@ -3942,7 +3960,14 @@  int xdp_do_redirect(struct net_device *dev, struct xdp_buff *xdp,
 	case BPF_MAP_TYPE_DEVMAP:
 		fallthrough;
 	case BPF_MAP_TYPE_DEVMAP_HASH:
-		err = dev_map_enqueue(fwd, xdp, dev);
+		map = READ_ONCE(ri->map);
+		if (map) {
+			WRITE_ONCE(ri->map, NULL);
+			err = dev_map_enqueue_multi(xdp, dev, map,
+						    ri->flags & BPF_F_EXCLUDE_INGRESS);
+		} else {
+			err = dev_map_enqueue(fwd, xdp, dev);
+		}
 		break;
 	case BPF_MAP_TYPE_CPUMAP:
 		err = cpu_map_enqueue(fwd, xdp, dev);
@@ -3984,13 +4009,21 @@  static int xdp_do_generic_redirect_map(struct net_device *dev,
 				       enum bpf_map_type map_type, u32 map_id)
 {
 	struct bpf_redirect_info *ri = this_cpu_ptr(&bpf_redirect_info);
+	struct bpf_map *map;
 	int err;
 
 	switch (map_type) {
 	case BPF_MAP_TYPE_DEVMAP:
 		fallthrough;
 	case BPF_MAP_TYPE_DEVMAP_HASH:
-		err = dev_map_generic_redirect(fwd, skb, xdp_prog);
+		map = READ_ONCE(ri->map);
+		if (map) {
+			WRITE_ONCE(ri->map, NULL);
+			err = dev_map_redirect_multi(dev, skb, xdp_prog, map,
+						     ri->flags & BPF_F_EXCLUDE_INGRESS);
+		} else {
+			err = dev_map_generic_redirect(fwd, skb, xdp_prog);
+		}
 		if (unlikely(err))
 			goto err;
 		break;
diff --git a/net/core/xdp.c b/net/core/xdp.c
index 05354976c1fc..aba84d04642b 100644
--- a/net/core/xdp.c
+++ b/net/core/xdp.c
@@ -583,3 +583,32 @@  struct sk_buff *xdp_build_skb_from_frame(struct xdp_frame *xdpf,
 	return __xdp_build_skb_from_frame(xdpf, skb, dev);
 }
 EXPORT_SYMBOL_GPL(xdp_build_skb_from_frame);
+
+struct xdp_frame *xdpf_clone(struct xdp_frame *xdpf)
+{
+	unsigned int headroom, totalsize;
+	struct xdp_frame *nxdpf;
+	struct page *page;
+	void *addr;
+
+	headroom = xdpf->headroom + sizeof(*xdpf);
+	totalsize = headroom + xdpf->len;
+
+	if (unlikely(totalsize > PAGE_SIZE))
+		return NULL;
+	page = dev_alloc_page();
+	if (!page)
+		return NULL;
+	addr = page_to_virt(page);
+
+	memcpy(addr, xdpf, totalsize);
+
+	nxdpf = addr;
+	nxdpf->data = addr + headroom;
+	nxdpf->frame_sz = PAGE_SIZE;
+	nxdpf->mem.type = MEM_TYPE_PAGE_ORDER0;
+	nxdpf->mem.id = 0;
+
+	return nxdpf;
+}
+EXPORT_SYMBOL_GPL(xdpf_clone);
diff --git a/net/xdp/xskmap.c b/net/xdp/xskmap.c
index 67b4ce504852..9df75ea4a567 100644
--- a/net/xdp/xskmap.c
+++ b/net/xdp/xskmap.c
@@ -226,7 +226,8 @@  static int xsk_map_delete_elem(struct bpf_map *map, void *key)
 
 static int xsk_map_redirect(struct bpf_map *map, u32 ifindex, u64 flags)
 {
-	return __bpf_xdp_redirect_map(map, ifindex, flags, __xsk_map_lookup_elem);
+	return __bpf_xdp_redirect_map(map, ifindex, flags, 0,
+				      __xsk_map_lookup_elem);
 }
 
 void xsk_map_try_sock_delete(struct xsk_map *map, struct xdp_sock *xs,
diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
index ec6d85a81744..c6fe0526811b 100644
--- a/tools/include/uapi/linux/bpf.h
+++ b/tools/include/uapi/linux/bpf.h
@@ -2534,8 +2534,12 @@  union bpf_attr {
  * 		The lower two bits of *flags* are used as the return code if
  * 		the map lookup fails. This is so that the return value can be
  * 		one of the XDP program return codes up to **XDP_TX**, as chosen
- * 		by the caller. Any higher bits in the *flags* argument must be
- * 		unset.
+ * 		by the caller. The higher bits of *flags* can be set to
+ * 		BPF_F_BROADCAST or BPF_F_EXCLUDE_INGRESS as defined below.
+ *
+ * 		With BPF_F_BROADCAST the packet will be broadcasted to all the
+ * 		interfaces in the map. with BPF_F_EXCLUDE_INGRESS the ingress
+ * 		interface will be excluded when do broadcasting.
  *
  * 		See also **bpf_redirect**\ (), which only supports redirecting
  * 		to an ifindex, but doesn't require a map to do so.
@@ -5080,6 +5084,15 @@  enum {
 	BPF_F_BPRM_SECUREEXEC	= (1ULL << 0),
 };
 
+/* Flags for bpf_redirect_map helper */
+enum {
+	BPF_F_BROADCAST		= (1ULL << 3),
+	BPF_F_EXCLUDE_INGRESS	= (1ULL << 4),
+};
+
+#define BPF_F_ACTION_MASK (XDP_ABORTED | XDP_DROP | XDP_PASS | XDP_TX)
+#define BPF_F_REDIR_MASK (BPF_F_BROADCAST | BPF_F_EXCLUDE_INGRESS)
+
 #define __bpf_md_ptr(type, name)	\
 union {					\
 	type name;			\