Message ID | 20210422071454.2023282-3-liuhangbin@gmail.com (mailing list archive) |
---|---|
State | Superseded |
Delegated to: | BPF |
Headers | show |
Series | xdp: extend xdp_redirect_map with broadcast support | expand |
On Thu, 22 Apr 2021 15:14:52 +0800 Hangbin Liu <liuhangbin@gmail.com> wrote: > diff --git a/net/core/filter.c b/net/core/filter.c > index cae56d08a670..afec192c3b21 100644 > --- a/net/core/filter.c > +++ b/net/core/filter.c [...] > int xdp_do_redirect(struct net_device *dev, struct xdp_buff *xdp, > struct bpf_prog *xdp_prog) > { > @@ -3933,6 +3950,7 @@ int xdp_do_redirect(struct net_device *dev, struct xdp_buff *xdp, > enum bpf_map_type map_type = ri->map_type; > void *fwd = ri->tgt_value; > u32 map_id = ri->map_id; > + struct bpf_map *map; > int err; > > ri->map_id = 0; /* Valid map id idr range: [1,INT_MAX[ */ > @@ -3942,7 +3960,12 @@ int xdp_do_redirect(struct net_device *dev, struct xdp_buff *xdp, > case BPF_MAP_TYPE_DEVMAP: > fallthrough; > case BPF_MAP_TYPE_DEVMAP_HASH: > - err = dev_map_enqueue(fwd, xdp, dev); > + map = xchg(&ri->map, NULL); Hmm, this looks dangerous for performance to have on this fast-path. The xchg call can be expensive, AFAIK this is an atomic operation. > + if (map) > + err = dev_map_enqueue_multi(xdp, dev, map, > + ri->flags & BPF_F_EXCLUDE_INGRESS); > + else > + err = dev_map_enqueue(fwd, xdp, dev); > break; > case BPF_MAP_TYPE_CPUMAP: > err = cpu_map_enqueue(fwd, xdp, dev); > @@ -3984,13 +4007,19 @@ static int xdp_do_generic_redirect_map(struct net_device *dev, > enum bpf_map_type map_type, u32 map_id) > { > struct bpf_redirect_info *ri = this_cpu_ptr(&bpf_redirect_info); > + struct bpf_map *map; > int err; > > switch (map_type) { > case BPF_MAP_TYPE_DEVMAP: > fallthrough; > case BPF_MAP_TYPE_DEVMAP_HASH: > - err = dev_map_generic_redirect(fwd, skb, xdp_prog); > + map = xchg(&ri->map, NULL); Same here! > + if (map) > + err = dev_map_redirect_multi(dev, skb, xdp_prog, map, > + ri->flags & BPF_F_EXCLUDE_INGRESS); > + else > + err = dev_map_generic_redirect(fwd, skb, xdp_prog); > if (unlikely(err)) > goto err; > break;
Jesper Dangaard Brouer <brouer@redhat.com> writes: > On Thu, 22 Apr 2021 15:14:52 +0800 > Hangbin Liu <liuhangbin@gmail.com> wrote: > >> diff --git a/net/core/filter.c b/net/core/filter.c >> index cae56d08a670..afec192c3b21 100644 >> --- a/net/core/filter.c >> +++ b/net/core/filter.c > [...] >> int xdp_do_redirect(struct net_device *dev, struct xdp_buff *xdp, >> struct bpf_prog *xdp_prog) >> { >> @@ -3933,6 +3950,7 @@ int xdp_do_redirect(struct net_device *dev, struct xdp_buff *xdp, >> enum bpf_map_type map_type = ri->map_type; >> void *fwd = ri->tgt_value; >> u32 map_id = ri->map_id; >> + struct bpf_map *map; >> int err; >> >> ri->map_id = 0; /* Valid map id idr range: [1,INT_MAX[ */ >> @@ -3942,7 +3960,12 @@ int xdp_do_redirect(struct net_device *dev, struct xdp_buff *xdp, >> case BPF_MAP_TYPE_DEVMAP: >> fallthrough; >> case BPF_MAP_TYPE_DEVMAP_HASH: >> - err = dev_map_enqueue(fwd, xdp, dev); >> + map = xchg(&ri->map, NULL); > > Hmm, this looks dangerous for performance to have on this fast-path. > The xchg call can be expensive, AFAIK this is an atomic operation. Ugh, you're right. That's my bad, I suggested replacing the READ_ONCE()/WRITE_ONCE() pair with the xchg() because an exchange is what it's doing, but I failed to consider the performance implications of the atomic operation. Sorry about that, Hangbin! I guess this should be changed to: + map = READ_ONCE(ri->map); + if (map) { + WRITE_ONCE(ri->map, NULL); + err = dev_map_enqueue_multi(xdp, dev, map, + ri->flags & BPF_F_EXCLUDE_INGRESS); + } else { + err = dev_map_enqueue(fwd, xdp, dev); + } (and the same for the generic-XDP path, of course) -Toke
On Thu, 22 Apr 2021 20:02:18 +0200 Toke Høiland-Jørgensen <toke@redhat.com> wrote: > Jesper Dangaard Brouer <brouer@redhat.com> writes: > > > On Thu, 22 Apr 2021 15:14:52 +0800 > > Hangbin Liu <liuhangbin@gmail.com> wrote: > > > >> diff --git a/net/core/filter.c b/net/core/filter.c > >> index cae56d08a670..afec192c3b21 100644 > >> --- a/net/core/filter.c > >> +++ b/net/core/filter.c > > [...] > >> int xdp_do_redirect(struct net_device *dev, struct xdp_buff *xdp, > >> struct bpf_prog *xdp_prog) > >> { > >> @@ -3933,6 +3950,7 @@ int xdp_do_redirect(struct net_device *dev, struct xdp_buff *xdp, > >> enum bpf_map_type map_type = ri->map_type; > >> void *fwd = ri->tgt_value; > >> u32 map_id = ri->map_id; > >> + struct bpf_map *map; > >> int err; > >> > >> ri->map_id = 0; /* Valid map id idr range: [1,INT_MAX[ */ > >> @@ -3942,7 +3960,12 @@ int xdp_do_redirect(struct net_device *dev, struct xdp_buff *xdp, > >> case BPF_MAP_TYPE_DEVMAP: > >> fallthrough; > >> case BPF_MAP_TYPE_DEVMAP_HASH: > >> - err = dev_map_enqueue(fwd, xdp, dev); > >> + map = xchg(&ri->map, NULL); > > > > Hmm, this looks dangerous for performance to have on this fast-path. > > The xchg call can be expensive, AFAIK this is an atomic operation. > > Ugh, you're right. That's my bad, I suggested replacing the > READ_ONCE()/WRITE_ONCE() pair with the xchg() because an exchange is > what it's doing, but I failed to consider the performance implications > of the atomic operation. Sorry about that, Hangbin! I guess this should > be changed to: > > + map = READ_ONCE(ri->map); > + if (map) { > + WRITE_ONCE(ri->map, NULL); > + err = dev_map_enqueue_multi(xdp, dev, map, > + ri->flags & BPF_F_EXCLUDE_INGRESS); > + } else { > + err = dev_map_enqueue(fwd, xdp, dev); > + } This is highly sensitive fast-path code, as you saw Bjørn have been hunting nanosec in this area. The above code implicitly have "map" as the likely option, which I don't think it is. > (and the same for the generic-XDP path, of course) > > -Toke >
On Fri, Apr 23, 2021 at 06:54:29PM +0200, Jesper Dangaard Brouer wrote: > On Thu, 22 Apr 2021 20:02:18 +0200 > Toke Høiland-Jørgensen <toke@redhat.com> wrote: > > > Jesper Dangaard Brouer <brouer@redhat.com> writes: > > > > > On Thu, 22 Apr 2021 15:14:52 +0800 > > > Hangbin Liu <liuhangbin@gmail.com> wrote: > > > > > >> diff --git a/net/core/filter.c b/net/core/filter.c > > >> index cae56d08a670..afec192c3b21 100644 > > >> --- a/net/core/filter.c > > >> +++ b/net/core/filter.c > > > [...] > > >> int xdp_do_redirect(struct net_device *dev, struct xdp_buff *xdp, > > >> struct bpf_prog *xdp_prog) > > >> { > > >> @@ -3933,6 +3950,7 @@ int xdp_do_redirect(struct net_device *dev, struct xdp_buff *xdp, > > >> enum bpf_map_type map_type = ri->map_type; > > >> void *fwd = ri->tgt_value; > > >> u32 map_id = ri->map_id; > > >> + struct bpf_map *map; > > >> int err; > > >> > > >> ri->map_id = 0; /* Valid map id idr range: [1,INT_MAX[ */ > > >> @@ -3942,7 +3960,12 @@ int xdp_do_redirect(struct net_device *dev, struct xdp_buff *xdp, > > >> case BPF_MAP_TYPE_DEVMAP: > > >> fallthrough; > > >> case BPF_MAP_TYPE_DEVMAP_HASH: > > >> - err = dev_map_enqueue(fwd, xdp, dev); > > >> + map = xchg(&ri->map, NULL); > > > > > > Hmm, this looks dangerous for performance to have on this fast-path. > > > The xchg call can be expensive, AFAIK this is an atomic operation. > > > > Ugh, you're right. That's my bad, I suggested replacing the > > READ_ONCE()/WRITE_ONCE() pair with the xchg() because an exchange is > > what it's doing, but I failed to consider the performance implications > > of the atomic operation. Sorry about that, Hangbin! I guess this should > > be changed to: > > > > + map = READ_ONCE(ri->map); > > + if (map) { > > + WRITE_ONCE(ri->map, NULL); > > + err = dev_map_enqueue_multi(xdp, dev, map, > > + ri->flags & BPF_F_EXCLUDE_INGRESS); > > + } else { > > + err = dev_map_enqueue(fwd, xdp, dev); > > + } > > This is highly sensitive fast-path code, as you saw Bjørn have been > hunting nanosec in this area. The above code implicitly have "map" as > the likely option, which I don't think it is. Hi Jesper, From the performance data, there is only a slightly impact. Do we still need to block the whole patch on this? Or if you have a better solution? Thanks Hangbin
On Sat, 24 Apr 2021 09:09:25 +0800 Hangbin Liu <liuhangbin@gmail.com> wrote: > On Fri, Apr 23, 2021 at 06:54:29PM +0200, Jesper Dangaard Brouer wrote: > > On Thu, 22 Apr 2021 20:02:18 +0200 > > Toke Høiland-Jørgensen <toke@redhat.com> wrote: > > > > > Jesper Dangaard Brouer <brouer@redhat.com> writes: > > > > > > > On Thu, 22 Apr 2021 15:14:52 +0800 > > > > Hangbin Liu <liuhangbin@gmail.com> wrote: > > > > > > > >> diff --git a/net/core/filter.c b/net/core/filter.c > > > >> index cae56d08a670..afec192c3b21 100644 > > > >> --- a/net/core/filter.c > > > >> +++ b/net/core/filter.c > > > > [...] > > > >> int xdp_do_redirect(struct net_device *dev, struct xdp_buff *xdp, > > > >> struct bpf_prog *xdp_prog) > > > >> { > > > >> @@ -3933,6 +3950,7 @@ int xdp_do_redirect(struct net_device *dev, struct xdp_buff *xdp, > > > >> enum bpf_map_type map_type = ri->map_type; > > > >> void *fwd = ri->tgt_value; > > > >> u32 map_id = ri->map_id; > > > >> + struct bpf_map *map; > > > >> int err; > > > >> > > > >> ri->map_id = 0; /* Valid map id idr range: [1,INT_MAX[ */ > > > >> @@ -3942,7 +3960,12 @@ int xdp_do_redirect(struct net_device *dev, struct xdp_buff *xdp, > > > >> case BPF_MAP_TYPE_DEVMAP: > > > >> fallthrough; > > > >> case BPF_MAP_TYPE_DEVMAP_HASH: > > > >> - err = dev_map_enqueue(fwd, xdp, dev); > > > >> + map = xchg(&ri->map, NULL); > > > > > > > > Hmm, this looks dangerous for performance to have on this fast-path. > > > > The xchg call can be expensive, AFAIK this is an atomic operation. > > > > > > Ugh, you're right. That's my bad, I suggested replacing the > > > READ_ONCE()/WRITE_ONCE() pair with the xchg() because an exchange is > > > what it's doing, but I failed to consider the performance implications > > > of the atomic operation. Sorry about that, Hangbin! I guess this should > > > be changed to: > > > > > > + map = READ_ONCE(ri->map); > > > + if (map) { > > > + WRITE_ONCE(ri->map, NULL); > > > + err = dev_map_enqueue_multi(xdp, dev, map, > > > + ri->flags & BPF_F_EXCLUDE_INGRESS); > > > + } else { > > > + err = dev_map_enqueue(fwd, xdp, dev); > > > + } > > > > This is highly sensitive fast-path code, as you saw Bjørn have been > > hunting nanosec in this area. The above code implicitly have "map" as > > the likely option, which I don't think it is. > > Hi Jesper, > > From the performance data, there is only a slightly impact. Do we still need > to block the whole patch on this? Or if you have a better solution? I'm basically just asking you to add an unlikely() annotation: map = READ_ONCE(ri->map); if (unlikely(map)) { WRITE_ONCE(ri->map, NULL); err = dev_map_enqueue_multi(xdp, dev, map, [...] For XDP, performance is the single most important factor! You say your performance data, there is only a slightly impact, there must be ZERO impact (when your added features is not in use). You data: Version | Test | Generic | Native 5.12 rc4 | redirect_map i40e->i40e | 1.9M | 9.6M 5.12 rc4 + patch | redirect_map i40e->i40e | 1.9M | 9.3M The performance difference 9.6M -> 9.3M is a slowdown of 3.36 nanosec. Bjørn and others have been working really hard to optimize the code and remove down to 1.5 nanosec overheads. Thus, introducing 3.36 nanosec added overhead to the fast-path is significant.
Jesper Dangaard Brouer <brouer@redhat.com> writes: > On Sat, 24 Apr 2021 09:09:25 +0800 > Hangbin Liu <liuhangbin@gmail.com> wrote: > >> On Fri, Apr 23, 2021 at 06:54:29PM +0200, Jesper Dangaard Brouer wrote: >> > On Thu, 22 Apr 2021 20:02:18 +0200 >> > Toke Høiland-Jørgensen <toke@redhat.com> wrote: >> > >> > > Jesper Dangaard Brouer <brouer@redhat.com> writes: >> > > >> > > > On Thu, 22 Apr 2021 15:14:52 +0800 >> > > > Hangbin Liu <liuhangbin@gmail.com> wrote: >> > > > >> > > >> diff --git a/net/core/filter.c b/net/core/filter.c >> > > >> index cae56d08a670..afec192c3b21 100644 >> > > >> --- a/net/core/filter.c >> > > >> +++ b/net/core/filter.c >> > > > [...] >> > > >> int xdp_do_redirect(struct net_device *dev, struct xdp_buff *xdp, >> > > >> struct bpf_prog *xdp_prog) >> > > >> { >> > > >> @@ -3933,6 +3950,7 @@ int xdp_do_redirect(struct net_device *dev, struct xdp_buff *xdp, >> > > >> enum bpf_map_type map_type = ri->map_type; >> > > >> void *fwd = ri->tgt_value; >> > > >> u32 map_id = ri->map_id; >> > > >> + struct bpf_map *map; >> > > >> int err; >> > > >> >> > > >> ri->map_id = 0; /* Valid map id idr range: [1,INT_MAX[ */ >> > > >> @@ -3942,7 +3960,12 @@ int xdp_do_redirect(struct net_device *dev, struct xdp_buff *xdp, >> > > >> case BPF_MAP_TYPE_DEVMAP: >> > > >> fallthrough; >> > > >> case BPF_MAP_TYPE_DEVMAP_HASH: >> > > >> - err = dev_map_enqueue(fwd, xdp, dev); >> > > >> + map = xchg(&ri->map, NULL); >> > > > >> > > > Hmm, this looks dangerous for performance to have on this fast-path. >> > > > The xchg call can be expensive, AFAIK this is an atomic operation. >> > > >> > > Ugh, you're right. That's my bad, I suggested replacing the >> > > READ_ONCE()/WRITE_ONCE() pair with the xchg() because an exchange is >> > > what it's doing, but I failed to consider the performance implications >> > > of the atomic operation. Sorry about that, Hangbin! I guess this should >> > > be changed to: >> > > >> > > + map = READ_ONCE(ri->map); >> > > + if (map) { >> > > + WRITE_ONCE(ri->map, NULL); >> > > + err = dev_map_enqueue_multi(xdp, dev, map, >> > > + ri->flags & BPF_F_EXCLUDE_INGRESS); >> > > + } else { >> > > + err = dev_map_enqueue(fwd, xdp, dev); >> > > + } >> > >> > This is highly sensitive fast-path code, as you saw Bjørn have been >> > hunting nanosec in this area. The above code implicitly have "map" as >> > the likely option, which I don't think it is. >> >> Hi Jesper, >> >> From the performance data, there is only a slightly impact. Do we still need >> to block the whole patch on this? Or if you have a better solution? > > I'm basically just asking you to add an unlikely() annotation: Maybe the maintainers could add this while applying, though? Or we could fix it in a follow-up? Hangbin has been respinning this series with very minor changes for a while now, so I can certainly emphasise with his reluctance to keep doing this. IMO it's way past time to merge this already... :/ -Toke
On Sat, Apr 24, 2021 at 11:53:35AM +0200, Toke Høiland-Jørgensen wrote: > >> Hi Jesper, > >> > >> From the performance data, there is only a slightly impact. Do we still need > >> to block the whole patch on this? Or if you have a better solution? > > > > I'm basically just asking you to add an unlikely() annotation: > > Maybe the maintainers could add this while applying, though? Or we could > fix it in a follow-up? Hangbin has been respinning this series with very > minor changes for a while now, so I can certainly emphasise with his > reluctance to keep doing this. IMO it's way past time to merge this > already... :/ Not sure if Daniel will do it. If it's already missed the merge window. I can post a new version. Thanks Hangbin
On Sat, Apr 24, 2021 at 09:01:29AM +0200, Jesper Dangaard Brouer wrote: > > > > >> @@ -3942,7 +3960,12 @@ int xdp_do_redirect(struct net_device *dev, struct xdp_buff *xdp, > > > > >> case BPF_MAP_TYPE_DEVMAP: > > > > >> fallthrough; > > > > >> case BPF_MAP_TYPE_DEVMAP_HASH: > > > > >> - err = dev_map_enqueue(fwd, xdp, dev); > > > > >> + map = xchg(&ri->map, NULL); > > > > > > > > > > Hmm, this looks dangerous for performance to have on this fast-path. > > > > > The xchg call can be expensive, AFAIK this is an atomic operation. > > > > > > > > Ugh, you're right. That's my bad, I suggested replacing the > > > > READ_ONCE()/WRITE_ONCE() pair with the xchg() because an exchange is > > > > what it's doing, but I failed to consider the performance implications > > > > of the atomic operation. Sorry about that, Hangbin! I guess this should > > > > be changed to: > > > > > > > > + map = READ_ONCE(ri->map); > > > > + if (map) { > > > > + WRITE_ONCE(ri->map, NULL); > > > > + err = dev_map_enqueue_multi(xdp, dev, map, > > > > + ri->flags & BPF_F_EXCLUDE_INGRESS); > > > > + } else { > > > > + err = dev_map_enqueue(fwd, xdp, dev); > > > > + } > > > > > > This is highly sensitive fast-path code, as you saw Bjørn have been > > > hunting nanosec in this area. The above code implicitly have "map" as > > > the likely option, which I don't think it is. > > > > Hi Jesper, > > > > From the performance data, there is only a slightly impact. Do we still need > > to block the whole patch on this? Or if you have a better solution? > > I'm basically just asking you to add an unlikely() annotation: > > map = READ_ONCE(ri->map); > if (unlikely(map)) { > WRITE_ONCE(ri->map, NULL); > err = dev_map_enqueue_multi(xdp, dev, map, [...] > > For XDP, performance is the single most important factor! You say your > performance data, there is only a slightly impact, there must be ZERO > impact (when your added features is not in use). > > You data: > Version | Test | Generic | Native > 5.12 rc4 | redirect_map i40e->i40e | 1.9M | 9.6M > 5.12 rc4 + patch | redirect_map i40e->i40e | 1.9M | 9.3M > > The performance difference 9.6M -> 9.3M is a slowdown of 3.36 nanosec. > Bjørn and others have been working really hard to optimize the code and > remove down to 1.5 nanosec overheads. Thus, introducing 3.36 nanosec > added overhead to the fast-path is significant. I re-check the performance data. The data > Version | Test | Generic | Native > 5.12 rc4 | redirect_map i40e->i40e | 1.9M | 9.6M > 5.12 rc4 + patch | redirect_map i40e->i40e | 1.9M | 9.3M is done on version 5. Today I re-did the test, on version 10, with xchg() changed to READ_ONCE/WRITE_ONCE. Here is the new data (Generic path data was omitted as there is no change) Version | Test | Generic | Native 5.12 rc4 | redirect_map i40e->i40e | 9.7M 5.12 rc4 | redirect_map i40e->veth | 11.8M 5.12 rc4 + patch | redirect_map i40e->i40e | 9.6M 5.12 rc4 + patch | redirect_map i40e->veth | 11.6M 5.12 rc4 + patch | redirect_map multi i40e->i40e | 9.5M 5.12 rc4 + patch | redirect_map multi i40e->veth | 11.5M 5.12 rc4 + patch | redirect_map multi i40e->mlx4+veth | 3.9M And after add unlikely() in the check path, the new data looks like Version | Test | Native 5.12 rc4 + patch | redirect_map i40e->i40e | 9.6M 5.12 rc4 + patch | redirect_map i40e->veth | 11.7M 5.12 rc4 + patch | redirect_map multi i40e->i40e | 9.4M 5.12 rc4 + patch | redirect_map multi i40e->veth | 11.4M 5.12 rc4 + patch | redirect_map multi i40e->mlx4+veth | 3.8M So with unlikely(), the redirect_map is a slightly up, while redirect_map broadcast has a little drawback. But for the total data it looks this time there is no much gap compared with no this patch for redirect_map. Do you think we still need the unlikely() in check path? Thanks Hangbin
On Mon, 26 Apr 2021 14:01:17 +0800 Hangbin Liu <liuhangbin@gmail.com> wrote: > On Sat, Apr 24, 2021 at 09:01:29AM +0200, Jesper Dangaard Brouer wrote: > > > > > >> @@ -3942,7 +3960,12 @@ int xdp_do_redirect(struct net_device *dev, struct xdp_buff *xdp, > > > > > >> case BPF_MAP_TYPE_DEVMAP: > > > > > >> fallthrough; > > > > > >> case BPF_MAP_TYPE_DEVMAP_HASH: > > > > > >> - err = dev_map_enqueue(fwd, xdp, dev); > > > > > >> + map = xchg(&ri->map, NULL); > > > > > > > > > > > > Hmm, this looks dangerous for performance to have on this fast-path. > > > > > > The xchg call can be expensive, AFAIK this is an atomic operation. > > > > > > > > > > Ugh, you're right. That's my bad, I suggested replacing the > > > > > READ_ONCE()/WRITE_ONCE() pair with the xchg() because an exchange is > > > > > what it's doing, but I failed to consider the performance implications > > > > > of the atomic operation. Sorry about that, Hangbin! I guess this should > > > > > be changed to: > > > > > > > > > > + map = READ_ONCE(ri->map); > > > > > + if (map) { > > > > > + WRITE_ONCE(ri->map, NULL); > > > > > + err = dev_map_enqueue_multi(xdp, dev, map, > > > > > + ri->flags & BPF_F_EXCLUDE_INGRESS); > > > > > + } else { > > > > > + err = dev_map_enqueue(fwd, xdp, dev); > > > > > + } > > > > > > > > This is highly sensitive fast-path code, as you saw Bjørn have been > > > > hunting nanosec in this area. The above code implicitly have "map" as > > > > the likely option, which I don't think it is. > > > > > > Hi Jesper, > > > > > > From the performance data, there is only a slightly impact. Do we still need > > > to block the whole patch on this? Or if you have a better solution? > > > > I'm basically just asking you to add an unlikely() annotation: > > > > map = READ_ONCE(ri->map); > > if (unlikely(map)) { > > WRITE_ONCE(ri->map, NULL); > > err = dev_map_enqueue_multi(xdp, dev, map, [...] > > > > For XDP, performance is the single most important factor! You say your > > performance data, there is only a slightly impact, there must be ZERO > > impact (when your added features is not in use). > > > > You data: > > Version | Test | Generic | Native > > 5.12 rc4 | redirect_map i40e->i40e | 1.9M | 9.6M > > 5.12 rc4 + patch | redirect_map i40e->i40e | 1.9M | 9.3M > > > > The performance difference 9.6M -> 9.3M is a slowdown of 3.36 nanosec. > > Bjørn and others have been working really hard to optimize the code and > > remove down to 1.5 nanosec overheads. Thus, introducing 3.36 nanosec > > added overhead to the fast-path is significant. > > I re-check the performance data. The data > > Version | Test | Generic | Native > > 5.12 rc4 | redirect_map i40e->i40e | 1.9M | 9.6M > > 5.12 rc4 + patch | redirect_map i40e->i40e | 1.9M | 9.3M > > is done on version 5. > > Today I re-did the test, on version 10, with xchg() changed to > READ_ONCE/WRITE_ONCE. Here is the new data (Generic path data was omitted > as there is no change) > > Version | Test | Generic | Native > 5.12 rc4 | redirect_map i40e->i40e | 9.7M > 5.12 rc4 | redirect_map i40e->veth | 11.8M > > 5.12 rc4 + patch | redirect_map i40e->i40e | 9.6M Great to see the baseline redirect_map (i40e->i40e) have almost no impact, only 1.07 ns ((1/9.7-1/9.6)*1000), which is what we want to see. (It might be zero as measurements can fluctuate when diff is below 2ns) > 5.12 rc4 + patch | redirect_map i40e->veth | 11.6M What XDP program are you running on the inner veth? > 5.12 rc4 + patch | redirect_map multi i40e->i40e | 9.5M I'm very surprised to see redirect_map multi being so fast (9.5M vs. 9.6M normal map-redir). I was expecting to see larger overhead, as the code dev_map_enqueue_clone() would clone the packet in xdpf_clone() via allocating a new page (dev_alloc_page) and then doing a memcpy(). Looking closer at this patchset, I realize that the test 'redirect_map-multi' is testing an optimization, and will never call dev_map_enqueue_clone() + xdpf_clone(). IMHO trying to optimize 'redirect_map-multi' to be just as fast as base 'redirect_map' doesn't make much sense. If the 'broadcast' call only send a single packet, then there isn't any reason to call the 'multi' variant. Does the 'selftests/bpf' make sure to activate the code path that does cloning? > 5.12 rc4 + patch | redirect_map multi i40e->veth | 11.5M > 5.12 rc4 + patch | redirect_map multi i40e->mlx4+veth | 3.9M > > And after add unlikely() in the check path, the new data looks like > > Version | Test | Native > 5.12 rc4 + patch | redirect_map i40e->i40e | 9.6M > 5.12 rc4 + patch | redirect_map i40e->veth | 11.7M > 5.12 rc4 + patch | redirect_map multi i40e->i40e | 9.4M > 5.12 rc4 + patch | redirect_map multi i40e->veth | 11.4M > 5.12 rc4 + patch | redirect_map multi i40e->mlx4+veth | 3.8M > > So with unlikely(), the redirect_map is a slightly up, while redirect_map > broadcast has a little drawback. But for the total data it looks this time > there is no much gap compared with no this patch for redirect_map. > > Do you think we still need the unlikely() in check path? Yes. The call to redirect_map multi is allowed (and expected) to be slower, because when using it to broadcast packets we expect that dev_map_enqueue_clone() + xdpf_clone() will get activated, which will be the dominating overhead.
On Mon, Apr 26, 2021 at 11:23:08AM +0200, Jesper Dangaard Brouer wrote: > > I re-check the performance data. The data > > > Version | Test | Generic | Native > > > 5.12 rc4 | redirect_map i40e->i40e | 1.9M | 9.6M > > > 5.12 rc4 + patch | redirect_map i40e->i40e | 1.9M | 9.3M > > > > is done on version 5. > > > > Today I re-did the test, on version 10, with xchg() changed to > > READ_ONCE/WRITE_ONCE. Here is the new data (Generic path data was omitted > > as there is no change) > > > > Version | Test | Generic | Native > > 5.12 rc4 | redirect_map i40e->i40e | 9.7M > > 5.12 rc4 | redirect_map i40e->veth | 11.8M > > > > 5.12 rc4 + patch | redirect_map i40e->i40e | 9.6M > > Great to see the baseline redirect_map (i40e->i40e) have almost no > impact, only 1.07 ns ((1/9.7-1/9.6)*1000), which is what we want to > see. (It might be zero as measurements can fluctuate when diff is > below 2ns) > > > > 5.12 rc4 + patch | redirect_map i40e->veth | 11.6M > > What XDP program are you running on the inner veth? XDP_DROP > > > 5.12 rc4 + patch | redirect_map multi i40e->i40e | 9.5M > > I'm very surprised to see redirect_map multi being so fast (9.5M vs. > 9.6M normal map-redir). I was expecting to see larger overhead, as the Yes, with only hash map size 4, one port to one port redirect. The impact are mainly on looping the map. (This info will be updated to new patch set description) > code dev_map_enqueue_clone() would clone the packet in xdpf_clone() via > allocating a new page (dev_alloc_page) and then doing a memcpy(). > > Looking closer at this patchset, I realize that the test > 'redirect_map-multi' is testing an optimization, and will never call > dev_map_enqueue_clone() + xdpf_clone(). IMHO trying to optimize > 'redirect_map-multi' to be just as fast as base 'redirect_map' doesn't > make much sense. If the 'broadcast' call only send a single packet, > then there isn't any reason to call the 'multi' variant. Yes, that's why there are also i40e->mlx4+veth test. > > Does the 'selftests/bpf' make sure to activate the code path that does > cloning? Yes, selftest will redirect packets to 2 or 3 other interfaces. > > > 5.12 rc4 + patch | redirect_map multi i40e->veth | 11.5M > > 5.12 rc4 + patch | redirect_map multi i40e->mlx4+veth | 3.9M > > > > And after add unlikely() in the check path, the new data looks like > > > > Version | Test | Native > > 5.12 rc4 + patch | redirect_map i40e->i40e | 9.6M > > 5.12 rc4 + patch | redirect_map i40e->veth | 11.7M > > 5.12 rc4 + patch | redirect_map multi i40e->i40e | 9.4M > > 5.12 rc4 + patch | redirect_map multi i40e->veth | 11.4M > > 5.12 rc4 + patch | redirect_map multi i40e->mlx4+veth | 3.8M > > > > So with unlikely(), the redirect_map is a slightly up, while redirect_map > > broadcast has a little drawback. But for the total data it looks this time > > there is no much gap compared with no this patch for redirect_map. > > > > Do you think we still need the unlikely() in check path? > > Yes. The call to redirect_map multi is allowed (and expected) to be > slower, because when using it to broadcast packets we expect that > dev_map_enqueue_clone() + xdpf_clone() will get activated, which will > be the dominating overhead. OK, I will. Hangbin
diff --git a/include/linux/bpf.h b/include/linux/bpf.h index f8a45f109e96..4243284fff8b 100644 --- a/include/linux/bpf.h +++ b/include/linux/bpf.h @@ -1497,8 +1497,13 @@ int dev_xdp_enqueue(struct net_device *dev, struct xdp_buff *xdp, struct net_device *dev_rx); int dev_map_enqueue(struct bpf_dtab_netdev *dst, struct xdp_buff *xdp, struct net_device *dev_rx); +int dev_map_enqueue_multi(struct xdp_buff *xdp, struct net_device *dev_rx, + struct bpf_map *map, bool exclude_ingress); int dev_map_generic_redirect(struct bpf_dtab_netdev *dst, struct sk_buff *skb, struct bpf_prog *xdp_prog); +int dev_map_redirect_multi(struct net_device *dev, struct sk_buff *skb, + struct bpf_prog *xdp_prog, struct bpf_map *map, + bool exclude_ingress); bool dev_map_can_have_prog(struct bpf_map *map); void __cpu_map_flush(void); @@ -1666,6 +1671,13 @@ int dev_map_enqueue(struct bpf_dtab_netdev *dst, struct xdp_buff *xdp, return 0; } +static inline +int dev_map_enqueue_multi(struct xdp_buff *xdp, struct net_device *dev_rx, + struct bpf_map *map, bool exclude_ingress) +{ + return 0; +} + struct sk_buff; static inline int dev_map_generic_redirect(struct bpf_dtab_netdev *dst, @@ -1675,6 +1687,14 @@ static inline int dev_map_generic_redirect(struct bpf_dtab_netdev *dst, return 0; } +static inline +int dev_map_redirect_multi(struct net_device *dev, struct sk_buff *skb, + struct bpf_prog *xdp_prog, struct bpf_map *map, + bool exclude_ingress) +{ + return 0; +} + static inline void __cpu_map_flush(void) { } diff --git a/include/linux/filter.h b/include/linux/filter.h index 9a09547bc7ba..e4885b42d754 100644 --- a/include/linux/filter.h +++ b/include/linux/filter.h @@ -646,6 +646,7 @@ struct bpf_redirect_info { u32 flags; u32 tgt_index; void *tgt_value; + struct bpf_map *map; u32 map_id; enum bpf_map_type map_type; u32 kern_flags; @@ -1464,17 +1465,18 @@ static inline bool bpf_sk_lookup_run_v6(struct net *net, int protocol, } #endif /* IS_ENABLED(CONFIG_IPV6) */ -static __always_inline int __bpf_xdp_redirect_map(struct bpf_map *map, u32 ifindex, u64 flags, +static __always_inline int __bpf_xdp_redirect_map(struct bpf_map *map, u32 ifindex, + u64 flags, u64 flag_mask, void *lookup_elem(struct bpf_map *map, u32 key)) { struct bpf_redirect_info *ri = this_cpu_ptr(&bpf_redirect_info); /* Lower bits of the flags are used as return code on lookup failure */ - if (unlikely(flags > XDP_TX)) + if (unlikely(flags & ~(BPF_F_ACTION_MASK | flag_mask))) return XDP_ABORTED; ri->tgt_value = lookup_elem(map, ifindex); - if (unlikely(!ri->tgt_value)) { + if (unlikely(!ri->tgt_value) && !(flags & BPF_F_BROADCAST)) { /* If the lookup fails we want to clear out the state in the * redirect_info struct completely, so that if an eBPF program * performs multiple lookups, the last one always takes @@ -1482,13 +1484,21 @@ static __always_inline int __bpf_xdp_redirect_map(struct bpf_map *map, u32 ifind */ ri->map_id = INT_MAX; /* Valid map id idr range: [1,INT_MAX[ */ ri->map_type = BPF_MAP_TYPE_UNSPEC; - return flags; + return flags & BPF_F_ACTION_MASK; } ri->tgt_index = ifindex; ri->map_id = map->id; ri->map_type = map->map_type; + if (flags & BPF_F_BROADCAST) { + WRITE_ONCE(ri->map, map); + ri->flags = flags; + } else { + WRITE_ONCE(ri->map, NULL); + ri->flags = 0; + } + return XDP_REDIRECT; } diff --git a/include/net/xdp.h b/include/net/xdp.h index a5bc214a49d9..5533f0ab2afc 100644 --- a/include/net/xdp.h +++ b/include/net/xdp.h @@ -170,6 +170,7 @@ struct sk_buff *__xdp_build_skb_from_frame(struct xdp_frame *xdpf, struct sk_buff *xdp_build_skb_from_frame(struct xdp_frame *xdpf, struct net_device *dev); int xdp_alloc_skb_bulk(void **skbs, int n_skb, gfp_t gfp); +struct xdp_frame *xdpf_clone(struct xdp_frame *xdpf); static inline void xdp_convert_frame_to_buff(struct xdp_frame *frame, struct xdp_buff *xdp) diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h index ec6d85a81744..c6fe0526811b 100644 --- a/include/uapi/linux/bpf.h +++ b/include/uapi/linux/bpf.h @@ -2534,8 +2534,12 @@ union bpf_attr { * The lower two bits of *flags* are used as the return code if * the map lookup fails. This is so that the return value can be * one of the XDP program return codes up to **XDP_TX**, as chosen - * by the caller. Any higher bits in the *flags* argument must be - * unset. + * by the caller. The higher bits of *flags* can be set to + * BPF_F_BROADCAST or BPF_F_EXCLUDE_INGRESS as defined below. + * + * With BPF_F_BROADCAST the packet will be broadcasted to all the + * interfaces in the map. with BPF_F_EXCLUDE_INGRESS the ingress + * interface will be excluded when do broadcasting. * * See also **bpf_redirect**\ (), which only supports redirecting * to an ifindex, but doesn't require a map to do so. @@ -5080,6 +5084,15 @@ enum { BPF_F_BPRM_SECUREEXEC = (1ULL << 0), }; +/* Flags for bpf_redirect_map helper */ +enum { + BPF_F_BROADCAST = (1ULL << 3), + BPF_F_EXCLUDE_INGRESS = (1ULL << 4), +}; + +#define BPF_F_ACTION_MASK (XDP_ABORTED | XDP_DROP | XDP_PASS | XDP_TX) +#define BPF_F_REDIR_MASK (BPF_F_BROADCAST | BPF_F_EXCLUDE_INGRESS) + #define __bpf_md_ptr(type, name) \ union { \ type name; \ diff --git a/kernel/bpf/cpumap.c b/kernel/bpf/cpumap.c index 0cf2791d5099..2c33a7a09783 100644 --- a/kernel/bpf/cpumap.c +++ b/kernel/bpf/cpumap.c @@ -601,7 +601,8 @@ static int cpu_map_get_next_key(struct bpf_map *map, void *key, void *next_key) static int cpu_map_redirect(struct bpf_map *map, u32 ifindex, u64 flags) { - return __bpf_xdp_redirect_map(map, ifindex, flags, __cpu_map_lookup_elem); + return __bpf_xdp_redirect_map(map, ifindex, flags, 0, + __cpu_map_lookup_elem); } static int cpu_map_btf_id; diff --git a/kernel/bpf/devmap.c b/kernel/bpf/devmap.c index 3980fb3bfb09..2eebb9a4b093 100644 --- a/kernel/bpf/devmap.c +++ b/kernel/bpf/devmap.c @@ -198,6 +198,7 @@ static void dev_map_free(struct bpf_map *map) list_del_rcu(&dtab->list); spin_unlock(&dev_map_lock); + bpf_clear_redirect_map(map); synchronize_rcu(); /* Make sure prior __dev_map_entry_free() have completed. */ @@ -515,6 +516,99 @@ int dev_map_enqueue(struct bpf_dtab_netdev *dst, struct xdp_buff *xdp, return __xdp_enqueue(dev, xdp, dev_rx, dst->xdp_prog); } +static bool is_valid_dst(struct bpf_dtab_netdev *obj, struct xdp_buff *xdp, + int exclude_ifindex) +{ + if (!obj || obj->dev->ifindex == exclude_ifindex || + !obj->dev->netdev_ops->ndo_xdp_xmit) + return false; + + if (xdp_ok_fwd_dev(obj->dev, xdp->data_end - xdp->data)) + return false; + + return true; +} + +static int dev_map_enqueue_clone(struct bpf_dtab_netdev *obj, + struct net_device *dev_rx, + struct xdp_frame *xdpf) +{ + struct xdp_frame *nxdpf; + + nxdpf = xdpf_clone(xdpf); + if (!nxdpf) + return -ENOMEM; + + bq_enqueue(obj->dev, nxdpf, dev_rx, obj->xdp_prog); + + return 0; +} + +int dev_map_enqueue_multi(struct xdp_buff *xdp, struct net_device *dev_rx, + struct bpf_map *map, bool exclude_ingress) +{ + struct bpf_dtab *dtab = container_of(map, struct bpf_dtab, map); + int exclude_ifindex = exclude_ingress ? dev_rx->ifindex : 0; + struct bpf_dtab_netdev *dst, *last_dst = NULL; + struct hlist_head *head; + struct xdp_frame *xdpf; + unsigned int i; + int err; + + xdpf = xdp_convert_buff_to_frame(xdp); + if (unlikely(!xdpf)) + return -EOVERFLOW; + + if (map->map_type == BPF_MAP_TYPE_DEVMAP) { + for (i = 0; i < map->max_entries; i++) { + dst = READ_ONCE(dtab->netdev_map[i]); + if (!is_valid_dst(dst, xdp, exclude_ifindex)) + continue; + + /* we only need n-1 clones; last_dst enqueued below */ + if (!last_dst) { + last_dst = dst; + continue; + } + + err = dev_map_enqueue_clone(last_dst, dev_rx, xdpf); + if (err) + return err; + + last_dst = dst; + } + } else { /* BPF_MAP_TYPE_DEVMAP_HASH */ + for (i = 0; i < dtab->n_buckets; i++) { + head = dev_map_index_hash(dtab, i); + hlist_for_each_entry_rcu(dst, head, index_hlist, + lockdep_is_held(&dtab->index_lock)) { + if (!is_valid_dst(dst, xdp, exclude_ifindex)) + continue; + + /* we only need n-1 clones; last_dst enqueued below */ + if (!last_dst) { + last_dst = dst; + continue; + } + + err = dev_map_enqueue_clone(last_dst, dev_rx, xdpf); + if (err) + return err; + + last_dst = dst; + } + } + } + + /* consume the last copy of the frame */ + if (last_dst) + bq_enqueue(last_dst->dev, xdpf, dev_rx, last_dst->xdp_prog); + else + xdp_return_frame_rx_napi(xdpf); /* dtab is empty */ + + return 0; +} + int dev_map_generic_redirect(struct bpf_dtab_netdev *dst, struct sk_buff *skb, struct bpf_prog *xdp_prog) { @@ -529,6 +623,87 @@ int dev_map_generic_redirect(struct bpf_dtab_netdev *dst, struct sk_buff *skb, return 0; } +static int dev_map_redirect_clone(struct bpf_dtab_netdev *dst, + struct sk_buff *skb, + struct bpf_prog *xdp_prog) +{ + struct sk_buff *nskb; + int err; + + nskb = skb_clone(skb, GFP_ATOMIC); + if (!nskb) + return -ENOMEM; + + err = dev_map_generic_redirect(dst, nskb, xdp_prog); + if (unlikely(err)) { + consume_skb(nskb); + return err; + } + + return 0; +} + +int dev_map_redirect_multi(struct net_device *dev, struct sk_buff *skb, + struct bpf_prog *xdp_prog, struct bpf_map *map, + bool exclude_ingress) +{ + struct bpf_dtab *dtab = container_of(map, struct bpf_dtab, map); + int exclude_ifindex = exclude_ingress ? dev->ifindex : 0; + struct bpf_dtab_netdev *dst, *last_dst = NULL; + struct hlist_head *head; + struct hlist_node *next; + unsigned int i; + int err; + + if (map->map_type == BPF_MAP_TYPE_DEVMAP) { + for (i = 0; i < map->max_entries; i++) { + dst = READ_ONCE(dtab->netdev_map[i]); + if (!dst || dst->dev->ifindex == exclude_ifindex) + continue; + + /* we only need n-1 clones; last_dst enqueued below */ + if (!last_dst) { + last_dst = dst; + continue; + } + + err = dev_map_redirect_clone(last_dst, skb, xdp_prog); + if (err) + return err; + + last_dst = dst; + } + } else { /* BPF_MAP_TYPE_DEVMAP_HASH */ + for (i = 0; i < dtab->n_buckets; i++) { + head = dev_map_index_hash(dtab, i); + hlist_for_each_entry_safe(dst, next, head, index_hlist) { + if (!dst || dst->dev->ifindex == exclude_ifindex) + continue; + + /* we only need n-1 clones; last_dst enqueued below */ + if (!last_dst) { + last_dst = dst; + continue; + } + + err = dev_map_redirect_clone(last_dst, skb, xdp_prog); + if (err) + return err; + + last_dst = dst; + } + } + } + + /* consume the first skb and return */ + if (last_dst) + return dev_map_generic_redirect(last_dst, skb, xdp_prog); + + /* dtab is empty */ + consume_skb(skb); + return 0; +} + static void *dev_map_lookup_elem(struct bpf_map *map, void *key) { struct bpf_dtab_netdev *obj = __dev_map_lookup_elem(map, *(u32 *)key); @@ -755,12 +930,14 @@ static int dev_map_hash_update_elem(struct bpf_map *map, void *key, void *value, static int dev_map_redirect(struct bpf_map *map, u32 ifindex, u64 flags) { - return __bpf_xdp_redirect_map(map, ifindex, flags, __dev_map_lookup_elem); + return __bpf_xdp_redirect_map(map, ifindex, flags, BPF_F_REDIR_MASK, + __dev_map_lookup_elem); } static int dev_hash_map_redirect(struct bpf_map *map, u32 ifindex, u64 flags) { - return __bpf_xdp_redirect_map(map, ifindex, flags, __dev_map_hash_lookup_elem); + return __bpf_xdp_redirect_map(map, ifindex, flags, BPF_F_REDIR_MASK, + __dev_map_hash_lookup_elem); } static int dev_map_btf_id; diff --git a/net/core/filter.c b/net/core/filter.c index cae56d08a670..afec192c3b21 100644 --- a/net/core/filter.c +++ b/net/core/filter.c @@ -3926,6 +3926,23 @@ void xdp_do_flush(void) } EXPORT_SYMBOL_GPL(xdp_do_flush); +void bpf_clear_redirect_map(struct bpf_map *map) +{ + struct bpf_redirect_info *ri; + int cpu; + + for_each_possible_cpu(cpu) { + ri = per_cpu_ptr(&bpf_redirect_info, cpu); + /* Avoid polluting remote cacheline due to writes if + * not needed. Once we pass this test, we need the + * cmpxchg() to make sure it hasn't been changed in + * the meantime by remote CPU. + */ + if (unlikely(READ_ONCE(ri->map) == map)) + cmpxchg(&ri->map, map, NULL); + } +} + int xdp_do_redirect(struct net_device *dev, struct xdp_buff *xdp, struct bpf_prog *xdp_prog) { @@ -3933,6 +3950,7 @@ int xdp_do_redirect(struct net_device *dev, struct xdp_buff *xdp, enum bpf_map_type map_type = ri->map_type; void *fwd = ri->tgt_value; u32 map_id = ri->map_id; + struct bpf_map *map; int err; ri->map_id = 0; /* Valid map id idr range: [1,INT_MAX[ */ @@ -3942,7 +3960,12 @@ int xdp_do_redirect(struct net_device *dev, struct xdp_buff *xdp, case BPF_MAP_TYPE_DEVMAP: fallthrough; case BPF_MAP_TYPE_DEVMAP_HASH: - err = dev_map_enqueue(fwd, xdp, dev); + map = xchg(&ri->map, NULL); + if (map) + err = dev_map_enqueue_multi(xdp, dev, map, + ri->flags & BPF_F_EXCLUDE_INGRESS); + else + err = dev_map_enqueue(fwd, xdp, dev); break; case BPF_MAP_TYPE_CPUMAP: err = cpu_map_enqueue(fwd, xdp, dev); @@ -3984,13 +4007,19 @@ static int xdp_do_generic_redirect_map(struct net_device *dev, enum bpf_map_type map_type, u32 map_id) { struct bpf_redirect_info *ri = this_cpu_ptr(&bpf_redirect_info); + struct bpf_map *map; int err; switch (map_type) { case BPF_MAP_TYPE_DEVMAP: fallthrough; case BPF_MAP_TYPE_DEVMAP_HASH: - err = dev_map_generic_redirect(fwd, skb, xdp_prog); + map = xchg(&ri->map, NULL); + if (map) + err = dev_map_redirect_multi(dev, skb, xdp_prog, map, + ri->flags & BPF_F_EXCLUDE_INGRESS); + else + err = dev_map_generic_redirect(fwd, skb, xdp_prog); if (unlikely(err)) goto err; break; diff --git a/net/core/xdp.c b/net/core/xdp.c index 05354976c1fc..aba84d04642b 100644 --- a/net/core/xdp.c +++ b/net/core/xdp.c @@ -583,3 +583,32 @@ struct sk_buff *xdp_build_skb_from_frame(struct xdp_frame *xdpf, return __xdp_build_skb_from_frame(xdpf, skb, dev); } EXPORT_SYMBOL_GPL(xdp_build_skb_from_frame); + +struct xdp_frame *xdpf_clone(struct xdp_frame *xdpf) +{ + unsigned int headroom, totalsize; + struct xdp_frame *nxdpf; + struct page *page; + void *addr; + + headroom = xdpf->headroom + sizeof(*xdpf); + totalsize = headroom + xdpf->len; + + if (unlikely(totalsize > PAGE_SIZE)) + return NULL; + page = dev_alloc_page(); + if (!page) + return NULL; + addr = page_to_virt(page); + + memcpy(addr, xdpf, totalsize); + + nxdpf = addr; + nxdpf->data = addr + headroom; + nxdpf->frame_sz = PAGE_SIZE; + nxdpf->mem.type = MEM_TYPE_PAGE_ORDER0; + nxdpf->mem.id = 0; + + return nxdpf; +} +EXPORT_SYMBOL_GPL(xdpf_clone); diff --git a/net/xdp/xskmap.c b/net/xdp/xskmap.c index 67b4ce504852..9df75ea4a567 100644 --- a/net/xdp/xskmap.c +++ b/net/xdp/xskmap.c @@ -226,7 +226,8 @@ static int xsk_map_delete_elem(struct bpf_map *map, void *key) static int xsk_map_redirect(struct bpf_map *map, u32 ifindex, u64 flags) { - return __bpf_xdp_redirect_map(map, ifindex, flags, __xsk_map_lookup_elem); + return __bpf_xdp_redirect_map(map, ifindex, flags, 0, + __xsk_map_lookup_elem); } void xsk_map_try_sock_delete(struct xsk_map *map, struct xdp_sock *xs, diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h index ec6d85a81744..c6fe0526811b 100644 --- a/tools/include/uapi/linux/bpf.h +++ b/tools/include/uapi/linux/bpf.h @@ -2534,8 +2534,12 @@ union bpf_attr { * The lower two bits of *flags* are used as the return code if * the map lookup fails. This is so that the return value can be * one of the XDP program return codes up to **XDP_TX**, as chosen - * by the caller. Any higher bits in the *flags* argument must be - * unset. + * by the caller. The higher bits of *flags* can be set to + * BPF_F_BROADCAST or BPF_F_EXCLUDE_INGRESS as defined below. + * + * With BPF_F_BROADCAST the packet will be broadcasted to all the + * interfaces in the map. with BPF_F_EXCLUDE_INGRESS the ingress + * interface will be excluded when do broadcasting. * * See also **bpf_redirect**\ (), which only supports redirecting * to an ifindex, but doesn't require a map to do so. @@ -5080,6 +5084,15 @@ enum { BPF_F_BPRM_SECUREEXEC = (1ULL << 0), }; +/* Flags for bpf_redirect_map helper */ +enum { + BPF_F_BROADCAST = (1ULL << 3), + BPF_F_EXCLUDE_INGRESS = (1ULL << 4), +}; + +#define BPF_F_ACTION_MASK (XDP_ABORTED | XDP_DROP | XDP_PASS | XDP_TX) +#define BPF_F_REDIR_MASK (BPF_F_BROADCAST | BPF_F_EXCLUDE_INGRESS) + #define __bpf_md_ptr(type, name) \ union { \ type name; \