diff mbox series

[bpf-next,07/10] bpf: Add helpers to query conntrack info

Message ID 20211019144655.3483197-8-maximmi@nvidia.com (mailing list archive)
State Changes Requested
Delegated to: BPF
Headers show
Series New BPF helpers to accelerate synproxy | expand

Checks

Context Check Description
bpf/vmtest-bpf-next-PR success PR summary
netdev/cover_letter success Series has a cover letter
netdev/fixes_present success Fixes tag not required for -next series
netdev/patch_count success Link
netdev/tree_selection success Clearly marked for bpf-next
netdev/subject_prefix success Link
netdev/cc_maintainers warning 2 maintainers not CCed: llvm@lists.linux.dev liuhangbin@gmail.com
netdev/source_inline success Was 0 now: 0
netdev/verify_signedoff success Signed-off-by tag matches author and committer
netdev/module_param success Was 0 now: 0
netdev/build_32bit success Errors and warnings before: 11790 this patch: 11790
netdev/kdoc success Errors and warnings before: 0 this patch: 0
netdev/verify_fixes success No Fixes tag
netdev/checkpatch warning WARNING: line length of 124 exceeds 80 columns
netdev/build_allmodconfig_warn success Errors and warnings before: 11421 this patch: 11421
netdev/header_inline success No static functions without inline keyword in header files
bpf/vmtest-bpf-next success VM_Test

Commit Message

Maxim Mikityanskiy Oct. 19, 2021, 2:46 p.m. UTC
The new helpers (bpf_ct_lookup_tcp and bpf_ct_lookup_udp) allow to query
connection tracking information of TCP and UDP connections based on
source and destination IP address and port. The helper returns a pointer
to struct nf_conn (if the conntrack entry was found), which needs to be
released with bpf_ct_release.

Signed-off-by: Maxim Mikityanskiy <maximmi@nvidia.com>
Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
---
 include/uapi/linux/bpf.h       |  81 +++++++++++++
 kernel/bpf/verifier.c          |   9 +-
 net/core/filter.c              | 205 +++++++++++++++++++++++++++++++++
 tools/include/uapi/linux/bpf.h |  81 +++++++++++++
 4 files changed, 373 insertions(+), 3 deletions(-)

Comments

Kumar Kartikeya Dwivedi Oct. 20, 2021, 3:56 a.m. UTC | #1
On Tue, Oct 19, 2021 at 08:16:52PM IST, Maxim Mikityanskiy wrote:
> The new helpers (bpf_ct_lookup_tcp and bpf_ct_lookup_udp) allow to query
> connection tracking information of TCP and UDP connections based on
> source and destination IP address and port. The helper returns a pointer
> to struct nf_conn (if the conntrack entry was found), which needs to be
> released with bpf_ct_release.
>
> Signed-off-by: Maxim Mikityanskiy <maximmi@nvidia.com>
> Reviewed-by: Tariq Toukan <tariqt@nvidia.com>

The last discussion on this [0] suggested that stable BPF helpers for conntrack
were not desired, hence the recent series [1] to extend kfunc support to modules
and base the conntrack work on top of it, which I'm working on now (supporting
both CT lookup and insert).

[0]: https://lore.kernel.org/bpf/CAADnVQJTJzxzig=1vvAUMXELUoOwm2vXq0ahP4mfhBWGsCm9QA@mail.gmail.com
[1]: https://lore.kernel.org/bpf/CAADnVQKDPG+U-NwoAeNSU5Ef9ZYhhGcgL4wBkFoP-E9h8-XZhw@mail.gmail.com

--
Kartikeya
Florian Westphal Oct. 20, 2021, 9:28 a.m. UTC | #2
Kumar Kartikeya Dwivedi <memxor@gmail.com> wrote:
> On Tue, Oct 19, 2021 at 08:16:52PM IST, Maxim Mikityanskiy wrote:
> > The new helpers (bpf_ct_lookup_tcp and bpf_ct_lookup_udp) allow to query
> > connection tracking information of TCP and UDP connections based on
> > source and destination IP address and port. The helper returns a pointer
> > to struct nf_conn (if the conntrack entry was found), which needs to be
> > released with bpf_ct_release.
> >
> > Signed-off-by: Maxim Mikityanskiy <maximmi@nvidia.com>
> > Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
> 
> The last discussion on this [0] suggested that stable BPF helpers for conntrack
> were not desired, hence the recent series [1] to extend kfunc support to modules
> and base the conntrack work on top of it, which I'm working on now (supporting
> both CT lookup and insert).

This will sabotage netfilter pipeline and the way things work more and
more 8-(

If you want to use netfilter with ebpf, please have a look at the RFC I
posted and lets work on adding a netfilter specific program type that
can run ebpf programs directly from any of the existing netfilter hook
points.

Thanks.
Toke Høiland-Jørgensen Oct. 20, 2021, 9:46 a.m. UTC | #3
> +#if IS_BUILTIN(CONFIG_NF_CONNTRACK)

This makes the helpers all but useless on distro kernels; I don't think
this is the right way to go about it. As Kumar mentioned, he's working
on an approach using kfuncs in modules; maybe you can collaborate on
that?

-Toke
Toke Høiland-Jørgensen Oct. 20, 2021, 9:48 a.m. UTC | #4
Florian Westphal <fw@strlen.de> writes:

> Kumar Kartikeya Dwivedi <memxor@gmail.com> wrote:
>> On Tue, Oct 19, 2021 at 08:16:52PM IST, Maxim Mikityanskiy wrote:
>> > The new helpers (bpf_ct_lookup_tcp and bpf_ct_lookup_udp) allow to query
>> > connection tracking information of TCP and UDP connections based on
>> > source and destination IP address and port. The helper returns a pointer
>> > to struct nf_conn (if the conntrack entry was found), which needs to be
>> > released with bpf_ct_release.
>> >
>> > Signed-off-by: Maxim Mikityanskiy <maximmi@nvidia.com>
>> > Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
>> 
>> The last discussion on this [0] suggested that stable BPF helpers for conntrack
>> were not desired, hence the recent series [1] to extend kfunc support to modules
>> and base the conntrack work on top of it, which I'm working on now (supporting
>> both CT lookup and insert).
>
> This will sabotage netfilter pipeline and the way things work more and
> more 8-(

Why?

> If you want to use netfilter with ebpf, please have a look at the RFC
> I posted and lets work on adding a netfilter specific program type
> that can run ebpf programs directly from any of the existing netfilter
> hook points.

Accelerating netfilter using BPF is a worthy goal in itself, but I also
think having the ability to lookup into conntrack from XDP is useful for
cases where someone wants to bypass the stack entirely (for accelerating
packet forwarding, say). I don't think these goals are in conflict
either, what makes you say they are?

-Toke
Florian Westphal Oct. 20, 2021, 9:58 a.m. UTC | #5
Toke Høiland-Jørgensen <toke@redhat.com> wrote:
> Florian Westphal <fw@strlen.de> writes:
> 
> > Kumar Kartikeya Dwivedi <memxor@gmail.com> wrote:
> >> On Tue, Oct 19, 2021 at 08:16:52PM IST, Maxim Mikityanskiy wrote:
> >> > The new helpers (bpf_ct_lookup_tcp and bpf_ct_lookup_udp) allow to query
> >> > connection tracking information of TCP and UDP connections based on
> >> > source and destination IP address and port. The helper returns a pointer
> >> > to struct nf_conn (if the conntrack entry was found), which needs to be
> >> > released with bpf_ct_release.
> >> >
> >> > Signed-off-by: Maxim Mikityanskiy <maximmi@nvidia.com>
> >> > Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
> >> 
> >> The last discussion on this [0] suggested that stable BPF helpers for conntrack
> >> were not desired, hence the recent series [1] to extend kfunc support to modules
> >> and base the conntrack work on top of it, which I'm working on now (supporting
> >> both CT lookup and insert).
> >
> > This will sabotage netfilter pipeline and the way things work more and
> > more 8-(
> 
> Why?

Lookups should be fine.  Insertions are the problem.

NAT hooks are expected to execute before the insertion into the
conntrack table.

If you insert before, NAT hooks won't execute, i.e.
rules that use dnat/redirect/masquerade have no effect.

> > If you want to use netfilter with ebpf, please have a look at the RFC
> > I posted and lets work on adding a netfilter specific program type
> > that can run ebpf programs directly from any of the existing netfilter
> > hook points.
> 
> Accelerating netfilter using BPF is a worthy goal in itself, but I also
> think having the ability to lookup into conntrack from XDP is useful for
> cases where someone wants to bypass the stack entirely (for accelerating
> packet forwarding, say). I don't think these goals are in conflict
> either, what makes you say they are?

Lookup is fine, I don't see fundamental issues with XDP-based bypass,
there are flowtables that also bypass classic forward path via the
netfilter ingress hook (first packet needs to go via classic path to
pass through all filter + nat rules and is offlloaded to HW or SW via
the 'flow add' statement in nftables.

I don't think there is anything that stands in the way of replicating
this via XDP.
Toke Høiland-Jørgensen Oct. 20, 2021, 12:21 p.m. UTC | #6
Florian Westphal <fw@strlen.de> writes:

> Toke Høiland-Jørgensen <toke@redhat.com> wrote:
>> Florian Westphal <fw@strlen.de> writes:
>> 
>> > Kumar Kartikeya Dwivedi <memxor@gmail.com> wrote:
>> >> On Tue, Oct 19, 2021 at 08:16:52PM IST, Maxim Mikityanskiy wrote:
>> >> > The new helpers (bpf_ct_lookup_tcp and bpf_ct_lookup_udp) allow to query
>> >> > connection tracking information of TCP and UDP connections based on
>> >> > source and destination IP address and port. The helper returns a pointer
>> >> > to struct nf_conn (if the conntrack entry was found), which needs to be
>> >> > released with bpf_ct_release.
>> >> >
>> >> > Signed-off-by: Maxim Mikityanskiy <maximmi@nvidia.com>
>> >> > Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
>> >> 
>> >> The last discussion on this [0] suggested that stable BPF helpers for conntrack
>> >> were not desired, hence the recent series [1] to extend kfunc support to modules
>> >> and base the conntrack work on top of it, which I'm working on now (supporting
>> >> both CT lookup and insert).
>> >
>> > This will sabotage netfilter pipeline and the way things work more and
>> > more 8-(
>> 
>> Why?
>
> Lookups should be fine.  Insertions are the problem.
>
> NAT hooks are expected to execute before the insertion into the
> conntrack table.
>
> If you insert before, NAT hooks won't execute, i.e.
> rules that use dnat/redirect/masquerade have no effect.

Well yes, if you insert the wrong state into the conntrack table, you're
going to get wrong behaviour. That's sorta expected, there are lots of
things XDP can do to disrupt the packet flow (like just dropping the
packets :)).

>> > If you want to use netfilter with ebpf, please have a look at the RFC
>> > I posted and lets work on adding a netfilter specific program type
>> > that can run ebpf programs directly from any of the existing netfilter
>> > hook points.
>> 
>> Accelerating netfilter using BPF is a worthy goal in itself, but I also
>> think having the ability to lookup into conntrack from XDP is useful for
>> cases where someone wants to bypass the stack entirely (for accelerating
>> packet forwarding, say). I don't think these goals are in conflict
>> either, what makes you say they are?
>
> Lookup is fine, I don't see fundamental issues with XDP-based bypass,
> there are flowtables that also bypass classic forward path via the
> netfilter ingress hook (first packet needs to go via classic path to
> pass through all filter + nat rules and is offlloaded to HW or SW via
> the 'flow add' statement in nftables.
>
> I don't think there is anything that stands in the way of replicating
> this via XDP.

What I want to be able to do is write an XDP program that does the following:

1. Parse the packet header and determine if it's a packet type we know
   how to handle. If not, just return XDP_PASS and let the stack deal
   with corner cases.

2. If we know how to handle the packet (say, it's TCP or UDP), do a
   lookup into conntrack to figure out if there's state for it and we
   need to do things like NAT.

3. If we need to NAT, rewrite the packet based on the information we got
   back from conntrack.

4. Update the conntrack state to be consistent with the packet, and then
   redirect it out the destination interface.

I.e., in the common case the packet doesn't go through the stack at all;
but we need to make conntrack aware that we processed the packet so the
entry doesn't expire (and any state related to the flow gets updated).
Ideally we should also be able to create new state for a flow we haven't
seen before.

This requires updating of state, but I see no reason why this shouldn't
be possible?

-Toke
Florian Westphal Oct. 20, 2021, 12:44 p.m. UTC | #7
Toke Høiland-Jørgensen <toke@redhat.com> wrote:
> > Lookups should be fine.  Insertions are the problem.
> >
> > NAT hooks are expected to execute before the insertion into the
> > conntrack table.
> >
> > If you insert before, NAT hooks won't execute, i.e.
> > rules that use dnat/redirect/masquerade have no effect.
> 
> Well yes, if you insert the wrong state into the conntrack table, you're
> going to get wrong behaviour. That's sorta expected, there are lots of
> things XDP can do to disrupt the packet flow (like just dropping the
> packets :)).

Sure, but I'm not sure I understand the use case.

Insertion at XDP layer turns off netfilters NAT capability, so its
incompatible with the classic forwarding path.

If thats fine, why do you need to insert into the conntrack table to
begin with?  The entire infrastructure its designed for is disabled...

> > I don't think there is anything that stands in the way of replicating
> > this via XDP.
> 
> What I want to be able to do is write an XDP program that does the following:
> 
> 1. Parse the packet header and determine if it's a packet type we know
>    how to handle. If not, just return XDP_PASS and let the stack deal
>    with corner cases.
> 
> 2. If we know how to handle the packet (say, it's TCP or UDP), do a
>    lookup into conntrack to figure out if there's state for it and we
>    need to do things like NAT.
> 
> 3. If we need to NAT, rewrite the packet based on the information we got
>    back from conntrack.

You could already do that by storing that info in bpf maps
The ctnetlink event generated on conntrack insertion contains the NAT
mapping information, so you could have a userspace daemon that
intercepts those to update the map.

> 4. Update the conntrack state to be consistent with the packet, and then
>    redirect it out the destination interface.
> 
> I.e., in the common case the packet doesn't go through the stack at all;
> but we need to make conntrack aware that we processed the packet so the
> entry doesn't expire (and any state related to the flow gets updated).

In the HW offload case, conntrack is bypassed completely. There is an
IPS_(HW)_OFFLOAD_BIT that prevents the flow from expiring.

> Ideally we should also be able to create new state for a flow we haven't
> seen before.

The way HW offload was intended to work is to allow users to express
what flows should be offloaded via 'flow add' expression in nftables, so
they can e.g. use byte counters or rate estimators etc. to make such
a decision.  So initial packet always passes via normal stack.

This is also needed to consider e.g. XFRM -- nft_flow_offload.c won't
offload if the packet has a secpath attached (i.e., will get encrypted
later).

I suspect we'd want a way to notify/call an ebpf program instead so we
can avoid the ctnetlink -> userspace -> update dance and do the XDP
'flow bypass information update' from inside the kernel and ebpf/XDP
reimplementation of the nf flow table (it uses the netfilter ingress
hook on the configured devices; everyhing it does should be doable
from XDP).

> This requires updating of state, but I see no reason why this shouldn't
> be possible?

Updating ct->status is problematic, there would have to be extra checks
that prevent non-atomic writes and toggling of special bits such as
CONFIRMED, TEMPLATE or DYING.  Adding a helper to toggle something
specific, e.g. the offload state bit, should be okay.
Maxim Mikityanskiy Oct. 20, 2021, 1:18 p.m. UTC | #8
On 2021-10-20 06:56, Kumar Kartikeya Dwivedi wrote:
> On Tue, Oct 19, 2021 at 08:16:52PM IST, Maxim Mikityanskiy wrote:
>> The new helpers (bpf_ct_lookup_tcp and bpf_ct_lookup_udp) allow to query
>> connection tracking information of TCP and UDP connections based on
>> source and destination IP address and port. The helper returns a pointer
>> to struct nf_conn (if the conntrack entry was found), which needs to be
>> released with bpf_ct_release.
>>
>> Signed-off-by: Maxim Mikityanskiy <maximmi@nvidia.com>
>> Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
> 
> The last discussion on this [0] suggested that stable BPF helpers for conntrack
> were not desired, hence the recent series [1] to extend kfunc support to modules
> and base the conntrack work on top of it, which I'm working on now (supporting
> both CT lookup and insert).

If you have conntrack lookup, I can base my solution on top of yours. As 
it supports modules, it's even better. What is the current status of 
your work? When do you plan to submit a series? Please add me to Cc when 
you do.

Thanks for reviewing!

> [0]: https://lore.kernel.org/bpf/CAADnVQJTJzxzig=1vvAUMXELUoOwm2vXq0ahP4mfhBWGsCm9QA@mail.gmail.com
> [1]: https://lore.kernel.org/bpf/CAADnVQKDPG+U-NwoAeNSU5Ef9ZYhhGcgL4wBkFoP-E9h8-XZhw@mail.gmail.com
> 
> --
> Kartikeya
>
Kumar Kartikeya Dwivedi Oct. 20, 2021, 7:17 p.m. UTC | #9
On Wed, Oct 20, 2021 at 06:48:25PM IST, Maxim Mikityanskiy wrote:
> On 2021-10-20 06:56, Kumar Kartikeya Dwivedi wrote:
> > On Tue, Oct 19, 2021 at 08:16:52PM IST, Maxim Mikityanskiy wrote:
> > > The new helpers (bpf_ct_lookup_tcp and bpf_ct_lookup_udp) allow to query
> > > connection tracking information of TCP and UDP connections based on
> > > source and destination IP address and port. The helper returns a pointer
> > > to struct nf_conn (if the conntrack entry was found), which needs to be
> > > released with bpf_ct_release.
> > >
> > > Signed-off-by: Maxim Mikityanskiy <maximmi@nvidia.com>
> > > Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
> >
> > The last discussion on this [0] suggested that stable BPF helpers for conntrack
> > were not desired, hence the recent series [1] to extend kfunc support to modules
> > and base the conntrack work on top of it, which I'm working on now (supporting
> > both CT lookup and insert).
>
> If you have conntrack lookup, I can base my solution on top of yours. As it
> supports modules, it's even better. What is the current status of your work?
> When do you plan to submit a series? Please add me to Cc when you do.
>

Great, I'll post the lookup stuff separately next week, and Cc you.

Thanks!

> Thanks for reviewing!
>
> > [0]: https://lore.kernel.org/bpf/CAADnVQJTJzxzig=1vvAUMXELUoOwm2vXq0ahP4mfhBWGsCm9QA@mail.gmail.com
> > [1]: https://lore.kernel.org/bpf/CAADnVQKDPG+U-NwoAeNSU5Ef9ZYhhGcgL4wBkFoP-E9h8-XZhw@mail.gmail.com
> >
> > --
> > Kartikeya
> >
>

--
Kartikeya
Toke Høiland-Jørgensen Oct. 20, 2021, 8:54 p.m. UTC | #10
Florian Westphal <fw@strlen.de> writes:

> Toke Høiland-Jørgensen <toke@redhat.com> wrote:
>> > Lookups should be fine.  Insertions are the problem.
>> >
>> > NAT hooks are expected to execute before the insertion into the
>> > conntrack table.
>> >
>> > If you insert before, NAT hooks won't execute, i.e.
>> > rules that use dnat/redirect/masquerade have no effect.
>> 
>> Well yes, if you insert the wrong state into the conntrack table, you're
>> going to get wrong behaviour. That's sorta expected, there are lots of
>> things XDP can do to disrupt the packet flow (like just dropping the
>> packets :)).
>
> Sure, but I'm not sure I understand the use case.
>
> Insertion at XDP layer turns off netfilters NAT capability, so its
> incompatible with the classic forwarding path.
>
> If thats fine, why do you need to insert into the conntrack table to
> begin with?  The entire infrastructure its designed for is disabled...

One of the major selling points of XDP is that you can reuse the
existing kernel infrastructure instead of having to roll your own. So
sure, one could implement their own conntrack using BPF maps (as indeed,
e.g., Cilium has done), but why do that when you can take advantage of
the existing one in the kernel? Same reason we have the bpf_fib_lookup()
helper...

>> > I don't think there is anything that stands in the way of replicating
>> > this via XDP.
>> 
>> What I want to be able to do is write an XDP program that does the following:
>> 
>> 1. Parse the packet header and determine if it's a packet type we know
>>    how to handle. If not, just return XDP_PASS and let the stack deal
>>    with corner cases.
>> 
>> 2. If we know how to handle the packet (say, it's TCP or UDP), do a
>>    lookup into conntrack to figure out if there's state for it and we
>>    need to do things like NAT.
>> 
>> 3. If we need to NAT, rewrite the packet based on the information we got
>>    back from conntrack.
>
> You could already do that by storing that info in bpf maps The
> ctnetlink event generated on conntrack insertion contains the NAT
> mapping information, so you could have a userspace daemon that
> intercepts those to update the map.

Sure, but see above.

>> 4. Update the conntrack state to be consistent with the packet, and then
>>    redirect it out the destination interface.
>> 
>> I.e., in the common case the packet doesn't go through the stack at all;
>> but we need to make conntrack aware that we processed the packet so the
>> entry doesn't expire (and any state related to the flow gets updated).
>
> In the HW offload case, conntrack is bypassed completely. There is an
> IPS_(HW)_OFFLOAD_BIT that prevents the flow from expiring.

That's comparable in execution semantics (stack is bypassed entirely),
but not in control plane semantics (we lookup from XDP instead of
pushing flows down to an offload).

>> Ideally we should also be able to create new state for a flow we haven't
>> seen before.
>
> The way HW offload was intended to work is to allow users to express
> what flows should be offloaded via 'flow add' expression in nftables, so
> they can e.g. use byte counters or rate estimators etc. to make such
> a decision.  So initial packet always passes via normal stack.
>
> This is also needed to consider e.g. XFRM -- nft_flow_offload.c won't
> offload if the packet has a secpath attached (i.e., will get encrypted
> later).
>
> I suspect we'd want a way to notify/call an ebpf program instead so we
> can avoid the ctnetlink -> userspace -> update dance and do the XDP
> 'flow bypass information update' from inside the kernel and ebpf/XDP
> reimplementation of the nf flow table (it uses the netfilter ingress
> hook on the configured devices; everyhing it does should be doable
> from XDP).

But the point is exactly that we don't have to duplicate the state into
BPF, we can make XDP look it up directly.

>> This requires updating of state, but I see no reason why this shouldn't
>> be possible?
>
> Updating ct->status is problematic, there would have to be extra checks
> that prevent non-atomic writes and toggling of special bits such as
> CONFIRMED, TEMPLATE or DYING.  Adding a helper to toggle something
> specific, e.g. the offload state bit, should be okay.

We can certainly constrain the update so it's not possible to get into
an unsafe state. The primary use case is accelerating the common case,
punting to the stack is fine for corner cases.

-Toke
David Ahern Oct. 20, 2021, 10:55 p.m. UTC | #11
On 10/20/21 2:54 PM, Toke Høiland-Jørgensen wrote:
>> Sure, but I'm not sure I understand the use case.
>>
>> Insertion at XDP layer turns off netfilters NAT capability, so its
>> incompatible with the classic forwarding path.
>>
>> If thats fine, why do you need to insert into the conntrack table to
>> begin with?  The entire infrastructure its designed for is disabled...
> One of the major selling points of XDP is that you can reuse the
> existing kernel infrastructure instead of having to roll your own. So
> sure, one could implement their own conntrack using BPF maps (as indeed,
> e.g., Cilium has done), but why do that when you can take advantage of
> the existing one in the kernel? Same reason we have the bpf_fib_lookup()
> helper...
> 

Exactly, and a key point is that it allows consistency between XDP fast
path and full stack slow path. e.g., the BPF program is removed or
defers a flow to full stack for some reason.
Florian Westphal Oct. 21, 2021, 7:36 a.m. UTC | #12
Toke Høiland-Jørgensen <toke@redhat.com> wrote:
> Florian Westphal <fw@strlen.de> writes:
> 
> > Toke Høiland-Jørgensen <toke@redhat.com> wrote:
> >> > Lookups should be fine.  Insertions are the problem.
> >> >
> >> > NAT hooks are expected to execute before the insertion into the
> >> > conntrack table.
> >> >
> >> > If you insert before, NAT hooks won't execute, i.e.
> >> > rules that use dnat/redirect/masquerade have no effect.
> >> 
> >> Well yes, if you insert the wrong state into the conntrack table, you're
> >> going to get wrong behaviour. That's sorta expected, there are lots of
> >> things XDP can do to disrupt the packet flow (like just dropping the
> >> packets :)).
> >
> > Sure, but I'm not sure I understand the use case.
> >
> > Insertion at XDP layer turns off netfilters NAT capability, so its
> > incompatible with the classic forwarding path.
> >
> > If thats fine, why do you need to insert into the conntrack table to
> > begin with?  The entire infrastructure its designed for is disabled...
> 
> One of the major selling points of XDP is that you can reuse the
> existing kernel infrastructure instead of having to roll your own. So
> sure, one could implement their own conntrack using BPF maps (as indeed,
> e.g., Cilium has done), but why do that when you can take advantage of
> the existing one in the kernel? Same reason we have the bpf_fib_lookup()
> helper...

Insertion to conntrack via ebpf seems to be bad to me precisely because it
bypasses the existing infra.

In the bypass scenario you're envisioning, who is responsible for
fastpath-or-not decision?

> > In the HW offload case, conntrack is bypassed completely. There is an
> > IPS_(HW)_OFFLOAD_BIT that prevents the flow from expiring.
> 
> That's comparable in execution semantics (stack is bypassed entirely),
> but not in control plane semantics (we lookup from XDP instead of
> pushing flows down to an offload).

I'm not following.  As soon as you do insertion via XDP existing
control plane (*tables ruleset, xfrm and so on) becomes irrelevant.

Say e.g. user has a iptables ruleset that disables conntrack for udp dport
53 to avoid conntrack overhead for local resolver cache.

No longer relevant, ebpf overrides or whatever generates the epbf prog
needs to emulate existing config.

> > I suspect we'd want a way to notify/call an ebpf program instead so we
> > can avoid the ctnetlink -> userspace -> update dance and do the XDP
> > 'flow bypass information update' from inside the kernel and ebpf/XDP
> > reimplementation of the nf flow table (it uses the netfilter ingress
> > hook on the configured devices; everyhing it does should be doable
> > from XDP).
> 
> But the point is exactly that we don't have to duplicate the state into
> BPF, we can make XDP look it up directly.

Normally for fast bypass I'd expect that the bypass infra would want to
access all info in one lookup, but conntrack only gives you the NAT
transformation, so you'll also need a sk lookup and possibly a FIB
lookup later to get the route.
Also maybe an xfrm lookup as well if your bypass infra needs to support
ipsec.

So I neither understand the need for conntrack lookup (*for fast bypass use
case*) nor the need for insert IFF the control plane we have is to be
respected.
diff mbox series

Patch

diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index a10a44c4f79b..883de3f1bb8b 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -4925,6 +4925,79 @@  union bpf_attr {
  *	Return
  *		The number of bytes written to the buffer, or a negative error
  *		in case of failure.
+ *
+ * struct bpf_nf_conn *bpf_ct_lookup_tcp(void *ctx, struct bpf_sock_tuple *tuple, u32 tuple_size, u64 netns, u64 *flags_err)
+ *	Description
+ *		Look for conntrack info for a TCP connection matching *tuple*,
+ *		optionally in a child network namespace *netns*.
+ *
+ *		The *flags_err* argument is used as an input parameter for flags
+ *		and output parameter for the error code. The flags can be a
+ *		combination of one or more of the following values:
+ *
+ *		**BPF_F_CT_DIR_REPLY**
+ *			When set, the conntrack direction is IP_CT_DIR_REPLY,
+ *			otherwise IP_CT_DIR_ORIGINAL.
+ *
+ *		If the function returns **NULL**, *flags_err* will indicate the
+ *		error code:
+ *
+ *		**EAFNOSUPPORT**
+ *			*tuple_size* doesn't match supported address families
+ *			(AF_INET; AF_INET6 when CONFIG_IPV6 is enabled).
+ *
+ *		**EINVAL**
+ *			Input arguments are not valid.
+ *
+ *		**ENOENT**
+ *			Connection tracking entry for *tuple* wasn't found.
+ *
+ *		This helper is available only if the kernel was compiled with
+ *		**CONFIG_NF_CONNTRACK** configuration option as built-in.
+ *	Return
+ *		Connection tracking status (see **enum ip_conntrack_status**),
+ *		or **NULL** in case of failure or if there is no conntrack entry
+ *		for this tuple.
+ *
+ * struct bpf_nf_conn *bpf_ct_lookup_udp(void *ctx, struct bpf_sock_tuple *tuple, u32 tuple_size, u64 netns, u64 *flags_err)
+ *	Description
+ *		Look for conntrack info for a UDP connection matching *tuple*,
+ *		optionally in a child network namespace *netns*.
+ *
+ *		The *flags_err* argument is used as an input parameter for flags
+ *		and output parameter for the error code. The flags can be a
+ *		combination of one or more of the following values:
+ *
+ *		**BPF_F_CT_DIR_REPLY**
+ *			When set, the conntrack direction is IP_CT_DIR_REPLY,
+ *			otherwise IP_CT_DIR_ORIGINAL.
+ *
+ *		If the function returns **NULL**, *flags_err* will indicate the
+ *		error code:
+ *
+ *		**EAFNOSUPPORT**
+ *			*tuple_size* doesn't match supported address families
+ *			(AF_INET; AF_INET6 when CONFIG_IPV6 is enabled).
+ *
+ *		**EINVAL**
+ *			Input arguments are not valid.
+ *
+ *		**ENOENT**
+ *			Connection tracking entry for *tuple* wasn't found.
+ *
+ *		This helper is available only if the kernel was compiled with
+ *		**CONFIG_NF_CONNTRACK** configuration option as built-in.
+ *	Return
+ *		Connection tracking status (see **enum ip_conntrack_status**),
+ *		or **NULL** in case of failure or if there is no conntrack entry
+ *		for this tuple.
+ *
+ * long bpf_ct_release(void *ct)
+ *	Description
+ *		Release the reference held by *ct*. *ct* must be a non-**NULL**
+ *		pointer that was returned from **bpf_ct_lookup_xxx**\ ().
+ *	Return
+ *		0 on success, or a negative error in case of failure.
  */
 #define __BPF_FUNC_MAPPER(FN)		\
 	FN(unspec),			\
@@ -5105,6 +5178,9 @@  union bpf_attr {
 	FN(task_pt_regs),		\
 	FN(get_branch_snapshot),	\
 	FN(trace_vprintk),		\
+	FN(ct_lookup_tcp),		\
+	FN(ct_lookup_udp),		\
+	FN(ct_release),			\
 	/* */
 
 /* integer value in 'imm' field of BPF_CALL instruction selects which helper
@@ -5288,6 +5364,11 @@  enum {
 	BPF_F_EXCLUDE_INGRESS	= (1ULL << 4),
 };
 
+/* Flags for bpf_ct_lookup_{tcp,udp} helpers. */
+enum {
+	BPF_F_CT_DIR_REPLY	= (1ULL << 0),
+};
+
 #define __bpf_md_ptr(type, name)	\
 union {					\
 	type name;			\
diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index 6eafef35e247..23e2a23ca9c4 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -506,7 +506,8 @@  static bool is_release_function(enum bpf_func_id func_id)
 {
 	return func_id == BPF_FUNC_sk_release ||
 	       func_id == BPF_FUNC_ringbuf_submit ||
-	       func_id == BPF_FUNC_ringbuf_discard;
+	       func_id == BPF_FUNC_ringbuf_discard ||
+	       func_id == BPF_FUNC_ct_release;
 }
 
 static bool may_be_acquire_function(enum bpf_func_id func_id)
@@ -515,7 +516,8 @@  static bool may_be_acquire_function(enum bpf_func_id func_id)
 		func_id == BPF_FUNC_sk_lookup_udp ||
 		func_id == BPF_FUNC_skc_lookup_tcp ||
 		func_id == BPF_FUNC_map_lookup_elem ||
-	        func_id == BPF_FUNC_ringbuf_reserve;
+		func_id == BPF_FUNC_ringbuf_reserve ||
+		func_id == BPF_FUNC_ct_lookup_tcp;
 }
 
 static bool is_acquire_function(enum bpf_func_id func_id,
@@ -526,7 +528,8 @@  static bool is_acquire_function(enum bpf_func_id func_id,
 	if (func_id == BPF_FUNC_sk_lookup_tcp ||
 	    func_id == BPF_FUNC_sk_lookup_udp ||
 	    func_id == BPF_FUNC_skc_lookup_tcp ||
-	    func_id == BPF_FUNC_ringbuf_reserve)
+	    func_id == BPF_FUNC_ringbuf_reserve ||
+	    func_id == BPF_FUNC_ct_lookup_tcp)
 		return true;
 
 	if (func_id == BPF_FUNC_map_lookup_elem &&
diff --git a/net/core/filter.c b/net/core/filter.c
index d2d07ccae599..f913851c97f7 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -79,6 +79,7 @@ 
 #include <net/tls.h>
 #include <net/xdp.h>
 #include <net/netfilter/nf_conntrack.h>
+#include <net/netfilter/nf_conntrack_core.h>
 
 static const struct bpf_func_proto *
 bpf_sk_base_func_proto(enum bpf_func_id func_id);
@@ -7096,6 +7097,194 @@  static const struct bpf_func_proto bpf_sock_ops_reserve_hdr_opt_proto = {
 	.arg3_type	= ARG_ANYTHING,
 };
 
+#if IS_BUILTIN(CONFIG_NF_CONNTRACK)
+static struct nf_conn *bpf_ct_lookup(struct net *caller_net,
+				     struct bpf_sock_tuple *tuple,
+				     u32 tuple_len,
+				     u8 protonum,
+				     u64 netns_id,
+				     u64 flags)
+{
+	struct nf_conntrack_tuple ct_tuple = {};
+	struct nf_conntrack_tuple_hash *found;
+	struct net *net;
+	u8 direction;
+
+	direction = flags & BPF_F_CT_DIR_REPLY ? IP_CT_DIR_REPLY :
+						 IP_CT_DIR_ORIGINAL;
+
+	if (flags & ~BPF_F_CT_DIR_REPLY)
+		return ERR_PTR(-EINVAL);
+
+	if (tuple_len == sizeof(tuple->ipv4)) {
+		ct_tuple.src.l3num = AF_INET;
+		ct_tuple.src.u3.ip = tuple->ipv4.saddr;
+		ct_tuple.src.u.tcp.port = tuple->ipv4.sport;
+		ct_tuple.dst.u3.ip = tuple->ipv4.daddr;
+		ct_tuple.dst.u.tcp.port = tuple->ipv4.dport;
+#if IS_ENABLED(CONFIG_IPV6)
+	} else if (tuple_len == sizeof(tuple->ipv6)) {
+		ct_tuple.src.l3num = AF_INET6;
+		memcpy(ct_tuple.src.u3.ip6, tuple->ipv6.saddr,
+		       sizeof(tuple->ipv6.saddr));
+		ct_tuple.src.u.tcp.port = tuple->ipv6.sport;
+		memcpy(ct_tuple.dst.u3.ip6, tuple->ipv6.daddr,
+		       sizeof(tuple->ipv6.daddr));
+		ct_tuple.dst.u.tcp.port = tuple->ipv6.dport;
+#endif
+	} else {
+		return ERR_PTR(-EAFNOSUPPORT);
+	}
+
+	ct_tuple.dst.protonum = protonum;
+	ct_tuple.dst.dir = direction;
+
+	net = caller_net;
+	if ((s32)netns_id >= 0) {
+		if (unlikely(netns_id > S32_MAX))
+			return ERR_PTR(-EINVAL);
+		net = get_net_ns_by_id(net, netns_id);
+		if (!net)
+			return ERR_PTR(-EINVAL);
+	}
+
+	found = nf_conntrack_find_get(net, &nf_ct_zone_dflt, &ct_tuple);
+
+	if ((s32)netns_id >= 0)
+		put_net(net);
+
+	if (!found)
+		return ERR_PTR(-ENOENT);
+	return nf_ct_tuplehash_to_ctrack(found);
+}
+
+BPF_CALL_5(bpf_xdp_ct_lookup_tcp, struct xdp_buff *, ctx,
+	   struct bpf_sock_tuple *, tuple, u32, tuple_len,
+	   u64, netns_id, u64 *, flags_err)
+{
+	struct nf_conn *ct;
+
+	ct = bpf_ct_lookup(dev_net(ctx->rxq->dev), tuple, tuple_len,
+			   IPPROTO_TCP, netns_id, *flags_err);
+	if (IS_ERR(ct)) {
+		*flags_err = PTR_ERR(ct);
+		return (unsigned long)NULL;
+	}
+	return (unsigned long)ct;
+}
+
+static const struct bpf_func_proto bpf_xdp_ct_lookup_tcp_proto = {
+	.func		= bpf_xdp_ct_lookup_tcp,
+	.gpl_only	= true, /* nf_conntrack_find_get is GPL */
+	.pkt_access	= true,
+	.ret_type	= RET_PTR_TO_NF_CONN_OR_NULL,
+	.arg1_type	= ARG_PTR_TO_CTX,
+	.arg2_type	= ARG_PTR_TO_MEM,
+	.arg3_type	= ARG_CONST_SIZE,
+	.arg4_type	= ARG_ANYTHING,
+	.arg5_type	= ARG_PTR_TO_LONG,
+};
+
+BPF_CALL_5(bpf_xdp_ct_lookup_udp, struct xdp_buff *, ctx,
+	   struct bpf_sock_tuple *, tuple, u32, tuple_len,
+	   u64, netns_id, u64 *, flags_err)
+{
+	struct nf_conn *ct;
+
+	ct = bpf_ct_lookup(dev_net(ctx->rxq->dev), tuple, tuple_len,
+			   IPPROTO_UDP, netns_id, *flags_err);
+	if (IS_ERR(ct)) {
+		*flags_err = PTR_ERR(ct);
+		return (unsigned long)NULL;
+	}
+	return (unsigned long)ct;
+}
+
+static const struct bpf_func_proto bpf_xdp_ct_lookup_udp_proto = {
+	.func		= bpf_xdp_ct_lookup_udp,
+	.gpl_only	= true, /* nf_conntrack_find_get is GPL */
+	.pkt_access	= true,
+	.ret_type	= RET_PTR_TO_NF_CONN_OR_NULL,
+	.arg1_type	= ARG_PTR_TO_CTX,
+	.arg2_type	= ARG_PTR_TO_MEM,
+	.arg3_type	= ARG_CONST_SIZE,
+	.arg4_type	= ARG_ANYTHING,
+	.arg5_type	= ARG_PTR_TO_LONG,
+};
+
+BPF_CALL_5(bpf_skb_ct_lookup_tcp, struct sk_buff *, skb,
+	   struct bpf_sock_tuple *, tuple, u32, tuple_len,
+	   u64, netns_id, u64 *, flags_err)
+{
+	struct net *caller_net;
+	struct nf_conn *ct;
+
+	caller_net = skb->dev ? dev_net(skb->dev) : sock_net(skb->sk);
+	ct = bpf_ct_lookup(caller_net, tuple, tuple_len, IPPROTO_TCP,
+			   netns_id, *flags_err);
+	if (IS_ERR(ct)) {
+		*flags_err = PTR_ERR(ct);
+		return (unsigned long)NULL;
+	}
+	return (unsigned long)ct;
+}
+
+static const struct bpf_func_proto bpf_skb_ct_lookup_tcp_proto = {
+	.func		= bpf_skb_ct_lookup_tcp,
+	.gpl_only	= true, /* nf_conntrack_find_get is GPL */
+	.pkt_access	= true,
+	.ret_type	= RET_PTR_TO_NF_CONN_OR_NULL,
+	.arg1_type	= ARG_PTR_TO_CTX,
+	.arg2_type	= ARG_PTR_TO_MEM,
+	.arg3_type	= ARG_CONST_SIZE,
+	.arg4_type	= ARG_ANYTHING,
+	.arg5_type	= ARG_PTR_TO_LONG,
+};
+
+BPF_CALL_5(bpf_skb_ct_lookup_udp, struct sk_buff *, skb,
+	   struct bpf_sock_tuple *, tuple, u32, tuple_len,
+	   u64, netns_id, u64 *, flags_err)
+{
+	struct net *caller_net;
+	struct nf_conn *ct;
+
+	caller_net = skb->dev ? dev_net(skb->dev) : sock_net(skb->sk);
+	ct = bpf_ct_lookup(caller_net, tuple, tuple_len, IPPROTO_UDP,
+			   netns_id, *flags_err);
+	if (IS_ERR(ct)) {
+		*flags_err = PTR_ERR(ct);
+		return (unsigned long)NULL;
+	}
+	return (unsigned long)ct;
+}
+
+static const struct bpf_func_proto bpf_skb_ct_lookup_udp_proto = {
+	.func		= bpf_skb_ct_lookup_udp,
+	.gpl_only	= true, /* nf_conntrack_find_get is GPL */
+	.pkt_access	= true,
+	.ret_type	= RET_PTR_TO_NF_CONN_OR_NULL,
+	.arg1_type	= ARG_PTR_TO_CTX,
+	.arg2_type	= ARG_PTR_TO_MEM,
+	.arg3_type	= ARG_CONST_SIZE,
+	.arg4_type	= ARG_ANYTHING,
+	.arg5_type	= ARG_PTR_TO_LONG,
+};
+
+BPF_CALL_1(bpf_ct_release, struct nf_conn *, ct)
+{
+	nf_ct_put(ct);
+	return 0;
+}
+
+static const struct bpf_func_proto bpf_ct_release_proto = {
+	.func		= bpf_ct_release,
+	.gpl_only	= false,
+	.pkt_access	= false,
+	.ret_type	= RET_INTEGER,
+	.arg1_type	= ARG_PTR_TO_NF_CONN,
+};
+#endif
+
 #endif /* CONFIG_INET */
 
 bool bpf_helper_changes_pkt_data(void *func)
@@ -7455,6 +7644,14 @@  tc_cls_act_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
 		return &bpf_tcp_gen_syncookie_proto;
 	case BPF_FUNC_sk_assign:
 		return &bpf_sk_assign_proto;
+#if IS_BUILTIN(CONFIG_NF_CONNTRACK)
+	case BPF_FUNC_ct_lookup_tcp:
+		return &bpf_skb_ct_lookup_tcp_proto;
+	case BPF_FUNC_ct_lookup_udp:
+		return &bpf_skb_ct_lookup_udp_proto;
+	case BPF_FUNC_ct_release:
+		return &bpf_ct_release_proto;
+#endif
 #endif
 	default:
 		return bpf_sk_base_func_proto(func_id);
@@ -7498,6 +7695,14 @@  xdp_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
 		return &bpf_tcp_check_syncookie_proto;
 	case BPF_FUNC_tcp_gen_syncookie:
 		return &bpf_tcp_gen_syncookie_proto;
+#if IS_BUILTIN(CONFIG_NF_CONNTRACK)
+	case BPF_FUNC_ct_lookup_tcp:
+		return &bpf_xdp_ct_lookup_tcp_proto;
+	case BPF_FUNC_ct_lookup_udp:
+		return &bpf_xdp_ct_lookup_udp_proto;
+	case BPF_FUNC_ct_release:
+		return &bpf_ct_release_proto;
+#endif
 #endif
 	default:
 		return bpf_sk_base_func_proto(func_id);
diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
index a10a44c4f79b..883de3f1bb8b 100644
--- a/tools/include/uapi/linux/bpf.h
+++ b/tools/include/uapi/linux/bpf.h
@@ -4925,6 +4925,79 @@  union bpf_attr {
  *	Return
  *		The number of bytes written to the buffer, or a negative error
  *		in case of failure.
+ *
+ * struct bpf_nf_conn *bpf_ct_lookup_tcp(void *ctx, struct bpf_sock_tuple *tuple, u32 tuple_size, u64 netns, u64 *flags_err)
+ *	Description
+ *		Look for conntrack info for a TCP connection matching *tuple*,
+ *		optionally in a child network namespace *netns*.
+ *
+ *		The *flags_err* argument is used as an input parameter for flags
+ *		and output parameter for the error code. The flags can be a
+ *		combination of one or more of the following values:
+ *
+ *		**BPF_F_CT_DIR_REPLY**
+ *			When set, the conntrack direction is IP_CT_DIR_REPLY,
+ *			otherwise IP_CT_DIR_ORIGINAL.
+ *
+ *		If the function returns **NULL**, *flags_err* will indicate the
+ *		error code:
+ *
+ *		**EAFNOSUPPORT**
+ *			*tuple_size* doesn't match supported address families
+ *			(AF_INET; AF_INET6 when CONFIG_IPV6 is enabled).
+ *
+ *		**EINVAL**
+ *			Input arguments are not valid.
+ *
+ *		**ENOENT**
+ *			Connection tracking entry for *tuple* wasn't found.
+ *
+ *		This helper is available only if the kernel was compiled with
+ *		**CONFIG_NF_CONNTRACK** configuration option as built-in.
+ *	Return
+ *		Connection tracking status (see **enum ip_conntrack_status**),
+ *		or **NULL** in case of failure or if there is no conntrack entry
+ *		for this tuple.
+ *
+ * struct bpf_nf_conn *bpf_ct_lookup_udp(void *ctx, struct bpf_sock_tuple *tuple, u32 tuple_size, u64 netns, u64 *flags_err)
+ *	Description
+ *		Look for conntrack info for a UDP connection matching *tuple*,
+ *		optionally in a child network namespace *netns*.
+ *
+ *		The *flags_err* argument is used as an input parameter for flags
+ *		and output parameter for the error code. The flags can be a
+ *		combination of one or more of the following values:
+ *
+ *		**BPF_F_CT_DIR_REPLY**
+ *			When set, the conntrack direction is IP_CT_DIR_REPLY,
+ *			otherwise IP_CT_DIR_ORIGINAL.
+ *
+ *		If the function returns **NULL**, *flags_err* will indicate the
+ *		error code:
+ *
+ *		**EAFNOSUPPORT**
+ *			*tuple_size* doesn't match supported address families
+ *			(AF_INET; AF_INET6 when CONFIG_IPV6 is enabled).
+ *
+ *		**EINVAL**
+ *			Input arguments are not valid.
+ *
+ *		**ENOENT**
+ *			Connection tracking entry for *tuple* wasn't found.
+ *
+ *		This helper is available only if the kernel was compiled with
+ *		**CONFIG_NF_CONNTRACK** configuration option as built-in.
+ *	Return
+ *		Connection tracking status (see **enum ip_conntrack_status**),
+ *		or **NULL** in case of failure or if there is no conntrack entry
+ *		for this tuple.
+ *
+ * long bpf_ct_release(void *ct)
+ *	Description
+ *		Release the reference held by *ct*. *ct* must be a non-**NULL**
+ *		pointer that was returned from **bpf_ct_lookup_xxx**\ ().
+ *	Return
+ *		0 on success, or a negative error in case of failure.
  */
 #define __BPF_FUNC_MAPPER(FN)		\
 	FN(unspec),			\
@@ -5105,6 +5178,9 @@  union bpf_attr {
 	FN(task_pt_regs),		\
 	FN(get_branch_snapshot),	\
 	FN(trace_vprintk),		\
+	FN(ct_lookup_tcp),		\
+	FN(ct_lookup_udp),		\
+	FN(ct_release),			\
 	/* */
 
 /* integer value in 'imm' field of BPF_CALL instruction selects which helper
@@ -5288,6 +5364,11 @@  enum {
 	BPF_F_EXCLUDE_INGRESS	= (1ULL << 4),
 };
 
+/* Flags for bpf_ct_lookup_{tcp,udp} helpers. */
+enum {
+	BPF_F_CT_DIR_REPLY	= (1ULL << 0),
+};
+
 #define __bpf_md_ptr(type, name)	\
 union {					\
 	type name;			\