diff mbox series

[v10,net-next,15/15] p4tc: add P4 classifier

Message ID 20240122194801.152658-16-jhs@mojatatu.com (mailing list archive)
State Changes Requested
Delegated to: Netdev Maintainers
Headers show
Series Introducing P4TC | expand

Checks

Context Check Description
netdev/series_format success Posting correctly formatted
netdev/tree_selection success Clearly marked for net-next, async
netdev/ynl success Generated files up to date; no warnings/errors; no diff in generated;
netdev/fixes_present success Fixes tag not required for -next series
netdev/header_inline success No static functions without inline keyword in header files
netdev/build_32bit success Errors and warnings before: 5152 this patch: 5152
netdev/build_tools success Errors and warnings before: 2 this patch: 0
netdev/cc_maintainers success CCed 0 of 0 maintainers
netdev/build_clang success Errors and warnings before: 1278 this patch: 1278
netdev/verify_signedoff success Signed-off-by tag matches author and committer
netdev/deprecated_api success None detected
netdev/check_selftest success No net selftest shell script
netdev/verify_fixes success No Fixes tag
netdev/build_allmodconfig_warn success Errors and warnings before: 5456 this patch: 5456
netdev/checkpatch warning WARNING: It's generally not useful to have the filename in the file WARNING: added, moved or deleted file(s), does MAINTAINERS need updating?
netdev/build_clang_rust success No Rust files in patch. Skipping build
netdev/kdoc success Errors and warnings before: 0 this patch: 0
netdev/source_inline success Was 0 now: 0
netdev/contest success net-next-2024-01-24--18-00 (tests: 520)

Commit Message

Jamal Hadi Salim Jan. 22, 2024, 7:48 p.m. UTC
Introduce P4 tc classifier. The main task of this classifier is to manage
the lifetime of pipeline instances across one or more netdev ports.
Note a pipeline may be instantiated multiple times across one or more tc chains
and different priorities.

Note that part or whole of the P4 pipeline could reside in tc, XDP or even
hardware depending on how the P4 program was compiled.
To use the P4 classifier you must specify a pipeline name that will be
associated to the filter instance, a s/w parser (eBPF) and datapath P4
control block program (eBPF) program. Although this patchset does not deal
with offloads, it is also possible to load the h/w part using this filter.
We will illustrate a few examples further below to clarify. Please treat
the illustrated split as an example - there are probably more pragmatic
approaches to splitting the pipeline; however, regardless of where the different
pieces of the pipeline are placed (tc, XDP, HW) and what each layer will
implement (what part of the pipeline) - these examples are merely showing
what is possible.

The pipeline is assumed to have already been created via a template.

For example, if we were to add a filter to ingress of a group of netdevs
(tc block 22) and associate it to P4 pipeline simple_l3 we could issue the
following command:

tc filter add block 22 parent ffff: protocol all prio 6 p4 pname simple_l3 \
    action bpf obj $PARSER.o ... \
    action bpf obj $PROGNAME.o section prog/tc-ingress

The above uses the classical tc action mechanism in which the first action
runs the P4 parser and if that goes well then the P4 control block is
executed. Note, although not shown above, one could also append the command
line with other traditional tc actions.

In these patches, we also support two types of loadings of the pipeline
programs and differentiate between what gets loaded at say tc vs xdp by using
syntax which specifies location as either "prog type tc obj" or
"prog type xdp obj". There is an ongoing discussion in the P4TC community
biweekly meetings which is likely going to have us add another location
definition "prog type hw" which will specify the hardware object file name
and other related attributes.

An example using tc:

tc filter add block 22 parent ffff: protocol all prio 6 p4 pname simple_l3 \
    prog type tc obj $PARSER.o ... \
    action bpf obj $PROGNAME.o section prog/tc-ingress

For XDP, to illustrate an example:

tc filter add dev $P0 ingress protocol all prio 1 p4 pname simple_l3 \
    prog type xdp obj $PARSER.o section parser/xdp \
    pinned_link /sys/fs/bpf/mylink \
    action bpf obj $PROGNAME.o section prog/tc-ingress

In this case, the parser will be executed in the XDP layer and the rest of
P4 control block as a tc action.

For illustration sake, the hw one looks as follows (please note there's
still a lot of discussions going on in the meetings - the example is here
merely to illustrate the tc filter functionality):

tc filter add block 22 ingress protocol all prio 1 p4 pname simple_l3 \
   prog type hw filename "mypnameprog.o" ... \
   prog type xdp obj $PARSER.o section parser/xdp pinned_link /sys/fs/bpf/mylink \
   action bpf obj $PROGNAME.o section prog/tc-ingress

The theory of operations is as follows:

================================1. PARSING================================

The packet first encounters the parser.
The parser is implemented in ebpf residing either at the TC or XDP
level. The parsed header values are stored in a shared eBPF map.
When the parser runs at XDP level, we load it into XDP using tc filter
command and pin it to a file.

=============================2. ACTIONS=============================

In the above example, the P4 program (minus the parser) is encoded in an
action($PROGNAME.o). It should be noted that classical tc actions
continue to work:
IOW, someone could decide to add a mirred action to mirror all packets
after or before the ebpf action.

tc filter add dev $P0 parent ffff: protocol all prio 6 p4 pname simple_l3 \
    prog type tc obj $PARSER.o section parser/tc-ingress \
    action bpf obj $PROGNAME.o section prog/tc-ingress \
    action mirred egress mirror index 1 dev $P1 \
    action bpf obj $ANOTHERPROG.o section mysect/section-1

It should also be noted that it is feasible to split some of the ingress
datapath into XDP first and more into TC later (as was shown above for
example where the parser runs at XDP level). YMMV.
Regardless of choice of which scheme to use, none of these will affect
UAPI. It will all depend on whether you generate code to load on XDP vs
tc, etc.

Co-developed-by: Victor Nogueira <victor@mojatatu.com>
Signed-off-by: Victor Nogueira <victor@mojatatu.com>
Co-developed-by: Pedro Tammela <pctammela@mojatatu.com>
Signed-off-by: Pedro Tammela <pctammela@mojatatu.com>
Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com>
---
 include/uapi/linux/pkt_cls.h |  18 ++
 net/sched/Kconfig            |  12 +
 net/sched/Makefile           |   1 +
 net/sched/cls_p4.c           | 450 +++++++++++++++++++++++++++++++++++
 net/sched/p4tc/Makefile      |   4 +-
 net/sched/p4tc/trace.c       |  10 +
 net/sched/p4tc/trace.h       |  44 ++++
 7 files changed, 538 insertions(+), 1 deletion(-)
 create mode 100644 net/sched/cls_p4.c
 create mode 100644 net/sched/p4tc/trace.c
 create mode 100644 net/sched/p4tc/trace.h

Comments

Daniel Borkmann Jan. 24, 2024, 1:59 p.m. UTC | #1
On 1/22/24 8:48 PM, Jamal Hadi Salim wrote:
> Introduce P4 tc classifier. The main task of this classifier is to manage
> the lifetime of pipeline instances across one or more netdev ports.
> Note a pipeline may be instantiated multiple times across one or more tc chains
> and different priorities.
> 
> Note that part or whole of the P4 pipeline could reside in tc, XDP or even
> hardware depending on how the P4 program was compiled.
> To use the P4 classifier you must specify a pipeline name that will be
> associated to the filter instance, a s/w parser (eBPF) and datapath P4
> control block program (eBPF) program. Although this patchset does not deal
> with offloads, it is also possible to load the h/w part using this filter.
> We will illustrate a few examples further below to clarify. Please treat
> the illustrated split as an example - there are probably more pragmatic
> approaches to splitting the pipeline; however, regardless of where the different
> pieces of the pipeline are placed (tc, XDP, HW) and what each layer will
> implement (what part of the pipeline) - these examples are merely showing
> what is possible.
> 
> The pipeline is assumed to have already been created via a template.
> 
> For example, if we were to add a filter to ingress of a group of netdevs
> (tc block 22) and associate it to P4 pipeline simple_l3 we could issue the
> following command:
> 
> tc filter add block 22 parent ffff: protocol all prio 6 p4 pname simple_l3 \
>      action bpf obj $PARSER.o ... \
>      action bpf obj $PROGNAME.o section prog/tc-ingress
> 
> The above uses the classical tc action mechanism in which the first action
> runs the P4 parser and if that goes well then the P4 control block is
> executed. Note, although not shown above, one could also append the command
> line with other traditional tc actions.
> 
> In these patches, we also support two types of loadings of the pipeline
> programs and differentiate between what gets loaded at say tc vs xdp by using
> syntax which specifies location as either "prog type tc obj" or
> "prog type xdp obj". There is an ongoing discussion in the P4TC community
> biweekly meetings which is likely going to have us add another location
> definition "prog type hw" which will specify the hardware object file name
> and other related attributes.
> 
> An example using tc:
> 
> tc filter add block 22 parent ffff: protocol all prio 6 p4 pname simple_l3 \
>      prog type tc obj $PARSER.o ... \
>      action bpf obj $PROGNAME.o section prog/tc-ingress
> 
> For XDP, to illustrate an example:
> 
> tc filter add dev $P0 ingress protocol all prio 1 p4 pname simple_l3 \
>      prog type xdp obj $PARSER.o section parser/xdp \
>      pinned_link /sys/fs/bpf/mylink \
>      action bpf obj $PROGNAME.o section prog/tc-ingress
> 
> In this case, the parser will be executed in the XDP layer and the rest of
> P4 control block as a tc action.
> 
> For illustration sake, the hw one looks as follows (please note there's
> still a lot of discussions going on in the meetings - the example is here
> merely to illustrate the tc filter functionality):
> 
> tc filter add block 22 ingress protocol all prio 1 p4 pname simple_l3 \
>     prog type hw filename "mypnameprog.o" ... \
>     prog type xdp obj $PARSER.o section parser/xdp pinned_link /sys/fs/bpf/mylink \
>     action bpf obj $PROGNAME.o section prog/tc-ingress
> 
> The theory of operations is as follows:
> 
> ================================1. PARSING================================
> 
> The packet first encounters the parser.
> The parser is implemented in ebpf residing either at the TC or XDP
> level. The parsed header values are stored in a shared eBPF map.
> When the parser runs at XDP level, we load it into XDP using tc filter
> command and pin it to a file.
> 
> =============================2. ACTIONS=============================
> 
> In the above example, the P4 program (minus the parser) is encoded in an
> action($PROGNAME.o). It should be noted that classical tc actions
> continue to work:
> IOW, someone could decide to add a mirred action to mirror all packets
> after or before the ebpf action.
> 
> tc filter add dev $P0 parent ffff: protocol all prio 6 p4 pname simple_l3 \
>      prog type tc obj $PARSER.o section parser/tc-ingress \
>      action bpf obj $PROGNAME.o section prog/tc-ingress \
>      action mirred egress mirror index 1 dev $P1 \
>      action bpf obj $ANOTHERPROG.o section mysect/section-1
> 
> It should also be noted that it is feasible to split some of the ingress
> datapath into XDP first and more into TC later (as was shown above for
> example where the parser runs at XDP level). YMMV.
> Regardless of choice of which scheme to use, none of these will affect
> UAPI. It will all depend on whether you generate code to load on XDP vs
> tc, etc.
> 
> Co-developed-by: Victor Nogueira <victor@mojatatu.com>
> Signed-off-by: Victor Nogueira <victor@mojatatu.com>
> Co-developed-by: Pedro Tammela <pctammela@mojatatu.com>
> Signed-off-by: Pedro Tammela <pctammela@mojatatu.com>
> Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com>

My objections from last iterations still stand, and I also added a nak,
so please do not just drop it with new revisions.. from the v10 as you
wrote you added further code but despite the various community feedback
the design still stands as before, therefore:

Nacked-by: Daniel Borkmann <daniel@iogearbox.net>

[...]
> +static int cls_p4_prog_from_efd(struct nlattr **tb,
> +				struct p4tc_bpf_prog *prog, u32 flags,
> +				struct netlink_ext_ack *extack)
> +{
> +	struct bpf_prog *fp;
> +	u32 prog_type;
> +	char *name;
> +	u32 bpf_fd;
> +
> +	bpf_fd = nla_get_u32(tb[TCA_P4_PROG_FD]);
> +	prog_type = nla_get_u32(tb[TCA_P4_PROG_TYPE]);
> +
> +	if (prog_type != BPF_PROG_TYPE_XDP &&
> +	    prog_type != BPF_PROG_TYPE_SCHED_ACT) {

Also as mentioned earlier I don't think tc should hold references on
XDP programs in here. It doesn't make any sense aside from the fact
that the cls_p4 is also not doing anything with it. This is something
that a user space control plane should be doing i.e. managing a XDP
link on the target device.

> +		NL_SET_ERR_MSG(extack,
> +			       "BPF prog type must be BPF_PROG_TYPE_SCHED_ACT or BPF_PROG_TYPE_XDP");
> +		return -EINVAL;
> +	}
> +
> +	fp = bpf_prog_get_type_dev(bpf_fd, prog_type, false);
> +	if (IS_ERR(fp))
> +		return PTR_ERR(fp);
> +
> +	name = nla_memdup(tb[TCA_P4_PROG_NAME], GFP_KERNEL);
> +	if (!name) {
> +		bpf_prog_put(fp);
> +		return -ENOMEM;
> +	}
> +
> +	prog->p4_prog_name = name;
> +	prog->p4_prog = fp;
> +
> +	return 0;
> +}
> +
Jamal Hadi Salim Jan. 24, 2024, 2:40 p.m. UTC | #2
On Wed, Jan 24, 2024 at 8:59 AM Daniel Borkmann <daniel@iogearbox.net> wrote:
>
> On 1/22/24 8:48 PM, Jamal Hadi Salim wrote:
> > Introduce P4 tc classifier. The main task of this classifier is to manage
> > the lifetime of pipeline instances across one or more netdev ports.
> > Note a pipeline may be instantiated multiple times across one or more tc chains
> > and different priorities.
> >
> > Note that part or whole of the P4 pipeline could reside in tc, XDP or even
> > hardware depending on how the P4 program was compiled.
> > To use the P4 classifier you must specify a pipeline name that will be
> > associated to the filter instance, a s/w parser (eBPF) and datapath P4
> > control block program (eBPF) program. Although this patchset does not deal
> > with offloads, it is also possible to load the h/w part using this filter.
> > We will illustrate a few examples further below to clarify. Please treat
> > the illustrated split as an example - there are probably more pragmatic
> > approaches to splitting the pipeline; however, regardless of where the different
> > pieces of the pipeline are placed (tc, XDP, HW) and what each layer will
> > implement (what part of the pipeline) - these examples are merely showing
> > what is possible.
> >
> > The pipeline is assumed to have already been created via a template.
> >
> > For example, if we were to add a filter to ingress of a group of netdevs
> > (tc block 22) and associate it to P4 pipeline simple_l3 we could issue the
> > following command:
> >
> > tc filter add block 22 parent ffff: protocol all prio 6 p4 pname simple_l3 \
> >      action bpf obj $PARSER.o ... \
> >      action bpf obj $PROGNAME.o section prog/tc-ingress
> >
> > The above uses the classical tc action mechanism in which the first action
> > runs the P4 parser and if that goes well then the P4 control block is
> > executed. Note, although not shown above, one could also append the command
> > line with other traditional tc actions.
> >
> > In these patches, we also support two types of loadings of the pipeline
> > programs and differentiate between what gets loaded at say tc vs xdp by using
> > syntax which specifies location as either "prog type tc obj" or
> > "prog type xdp obj". There is an ongoing discussion in the P4TC community
> > biweekly meetings which is likely going to have us add another location
> > definition "prog type hw" which will specify the hardware object file name
> > and other related attributes.
> >
> > An example using tc:
> >
> > tc filter add block 22 parent ffff: protocol all prio 6 p4 pname simple_l3 \
> >      prog type tc obj $PARSER.o ... \
> >      action bpf obj $PROGNAME.o section prog/tc-ingress
> >
> > For XDP, to illustrate an example:
> >
> > tc filter add dev $P0 ingress protocol all prio 1 p4 pname simple_l3 \
> >      prog type xdp obj $PARSER.o section parser/xdp \
> >      pinned_link /sys/fs/bpf/mylink \
> >      action bpf obj $PROGNAME.o section prog/tc-ingress
> >
> > In this case, the parser will be executed in the XDP layer and the rest of
> > P4 control block as a tc action.
> >
> > For illustration sake, the hw one looks as follows (please note there's
> > still a lot of discussions going on in the meetings - the example is here
> > merely to illustrate the tc filter functionality):
> >
> > tc filter add block 22 ingress protocol all prio 1 p4 pname simple_l3 \
> >     prog type hw filename "mypnameprog.o" ... \
> >     prog type xdp obj $PARSER.o section parser/xdp pinned_link /sys/fs/bpf/mylink \
> >     action bpf obj $PROGNAME.o section prog/tc-ingress
> >
> > The theory of operations is as follows:
> >
> > ================================1. PARSING================================
> >
> > The packet first encounters the parser.
> > The parser is implemented in ebpf residing either at the TC or XDP
> > level. The parsed header values are stored in a shared eBPF map.
> > When the parser runs at XDP level, we load it into XDP using tc filter
> > command and pin it to a file.
> >
> > =============================2. ACTIONS=============================
> >
> > In the above example, the P4 program (minus the parser) is encoded in an
> > action($PROGNAME.o). It should be noted that classical tc actions
> > continue to work:
> > IOW, someone could decide to add a mirred action to mirror all packets
> > after or before the ebpf action.
> >
> > tc filter add dev $P0 parent ffff: protocol all prio 6 p4 pname simple_l3 \
> >      prog type tc obj $PARSER.o section parser/tc-ingress \
> >      action bpf obj $PROGNAME.o section prog/tc-ingress \
> >      action mirred egress mirror index 1 dev $P1 \
> >      action bpf obj $ANOTHERPROG.o section mysect/section-1
> >
> > It should also be noted that it is feasible to split some of the ingress
> > datapath into XDP first and more into TC later (as was shown above for
> > example where the parser runs at XDP level). YMMV.
> > Regardless of choice of which scheme to use, none of these will affect
> > UAPI. It will all depend on whether you generate code to load on XDP vs
> > tc, etc.
> >
> > Co-developed-by: Victor Nogueira <victor@mojatatu.com>
> > Signed-off-by: Victor Nogueira <victor@mojatatu.com>
> > Co-developed-by: Pedro Tammela <pctammela@mojatatu.com>
> > Signed-off-by: Pedro Tammela <pctammela@mojatatu.com>
> > Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com>
>
> My objections from last iterations still stand, and I also added a nak,
> so please do not just drop it with new revisions.. from the v10 as you
> wrote you added further code but despite the various community feedback
> the design still stands as before, therefore:
>
> Nacked-by: Daniel Borkmann <daniel@iogearbox.net>
>

We didnt make code changes - but did you read the cover letter and the
extended commentary in this patch's commit log? We should have
mentioned it in the changes log. It did respond to your comments.
There's text that says "the filter manages the lifetime of the
pipeline" - which in the future could include not only tc but XDP but
also the hardware path (in the form of a file that gets loaded). I am
not sure if that message is clear. Your angle being this is layer
violation. In the last discussion i asked you for suggestions and we
went the tcx route, which didnt make sense, and  then you didnt
respond.

> [...]
> > +static int cls_p4_prog_from_efd(struct nlattr **tb,
> > +                             struct p4tc_bpf_prog *prog, u32 flags,
> > +                             struct netlink_ext_ack *extack)
> > +{
> > +     struct bpf_prog *fp;
> > +     u32 prog_type;
> > +     char *name;
> > +     u32 bpf_fd;
> > +
> > +     bpf_fd = nla_get_u32(tb[TCA_P4_PROG_FD]);
> > +     prog_type = nla_get_u32(tb[TCA_P4_PROG_TYPE]);
> > +
> > +     if (prog_type != BPF_PROG_TYPE_XDP &&
> > +         prog_type != BPF_PROG_TYPE_SCHED_ACT) {
>
> Also as mentioned earlier I don't think tc should hold references on
> XDP programs in here. It doesn't make any sense aside from the fact
> that the cls_p4 is also not doing anything with it. This is something
> that a user space control plane should be doing i.e. managing a XDP
> link on the target device.

This is the same argument about layer violation that you made earlier.
The filter manages the p4 pipeline - i.e it's not just about the ebpf
blob(s) but for example in the future (discussions are still ongoing
with vendors who have P4 NICs) a filter could be loaded to also
specify the location of the hardware blob.
I would be happy with a suggestion that gets us moving forward with
that context in mind.

cheers,
jamal

> > +             NL_SET_ERR_MSG(extack,
> > +                            "BPF prog type must be BPF_PROG_TYPE_SCHED_ACT or BPF_PROG_TYPE_XDP");
> > +             return -EINVAL;
> > +     }
> > +
> > +     fp = bpf_prog_get_type_dev(bpf_fd, prog_type, false);
> > +     if (IS_ERR(fp))
> > +             return PTR_ERR(fp);
> > +
> > +     name = nla_memdup(tb[TCA_P4_PROG_NAME], GFP_KERNEL);
> > +     if (!name) {
> > +             bpf_prog_put(fp);
> > +             return -ENOMEM;
> > +     }
> > +
> > +     prog->p4_prog_name = name;
> > +     prog->p4_prog = fp;
> > +
> > +     return 0;
> > +}
> > +
Daniel Borkmann Jan. 25, 2024, 3:47 p.m. UTC | #3
On 1/24/24 3:40 PM, Jamal Hadi Salim wrote:
> On Wed, Jan 24, 2024 at 8:59 AM Daniel Borkmann <daniel@iogearbox.net> wrote:
>> On 1/22/24 8:48 PM, Jamal Hadi Salim wrote:
[...]
>>>
>>> It should also be noted that it is feasible to split some of the ingress
>>> datapath into XDP first and more into TC later (as was shown above for
>>> example where the parser runs at XDP level). YMMV.
>>> Regardless of choice of which scheme to use, none of these will affect
>>> UAPI. It will all depend on whether you generate code to load on XDP vs
>>> tc, etc.
>>>
>>> Co-developed-by: Victor Nogueira <victor@mojatatu.com>
>>> Signed-off-by: Victor Nogueira <victor@mojatatu.com>
>>> Co-developed-by: Pedro Tammela <pctammela@mojatatu.com>
>>> Signed-off-by: Pedro Tammela <pctammela@mojatatu.com>
>>> Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com>
>>
>> My objections from last iterations still stand, and I also added a nak,
>> so please do not just drop it with new revisions.. from the v10 as you
>> wrote you added further code but despite the various community feedback
>> the design still stands as before, therefore:
>>
>> Nacked-by: Daniel Borkmann <daniel@iogearbox.net>
> 
> We didnt make code changes - but did you read the cover letter and the
> extended commentary in this patch's commit log? We should have
> mentioned it in the changes log. It did respond to your comments.
> There's text that says "the filter manages the lifetime of the
> pipeline" - which in the future could include not only tc but XDP but
> also the hardware path (in the form of a file that gets loaded). I am
> not sure if that message is clear. Your angle being this is layer
> violation. In the last discussion i asked you for suggestions and we
> went the tcx route, which didnt make sense, and  then you didnt
> respond.
[...]

>> Also as mentioned earlier I don't think tc should hold references on
>> XDP programs in here. It doesn't make any sense aside from the fact
>> that the cls_p4 is also not doing anything with it. This is something
>> that a user space control plane should be doing i.e. managing a XDP
>> link on the target device.
> 
> This is the same argument about layer violation that you made earlier.
> The filter manages the p4 pipeline - i.e it's not just about the ebpf
> blob(s) but for example in the future (discussions are still ongoing
> with vendors who have P4 NICs) a filter could be loaded to also
> specify the location of the hardware blob.

Ah, so there is a plan to eventually add HW offload support for cls_p4?
Or is this only specifiying a location of a blob through some opaque
cookie value from user space?

> I would be happy with a suggestion that gets us moving forward with
> that context in mind.

My question on the above is mainly what does it bring you to hold a
reference on the XDP program? There is no guarantee that something else
will get loaded onto XDP, and then eventually the cls_p4 is the only
entity holding the reference but w/o 'purpose'. We do have BPF links
and the user space component orchestrating all this needs to create
and pin the BPF link in BPF fs, for example. An artificial reference
on XDP prog feels similar as if you'd hold a reference on an inode
out of tc.. Again, that should be delegated to the control plane you
have running interacting with the compiler which then manages and
loads its artifacts. What if you would also need to set up some
netfilter rules for the SW pipeline, would you then embed this too?

Thanks,
Daniel
Jamal Hadi Salim Jan. 25, 2024, 5:59 p.m. UTC | #4
On Thu, Jan 25, 2024 at 10:47 AM Daniel Borkmann <daniel@iogearbox.net> wrote:
>
> On 1/24/24 3:40 PM, Jamal Hadi Salim wrote:
> > On Wed, Jan 24, 2024 at 8:59 AM Daniel Borkmann <daniel@iogearbox.net> wrote:
> >> On 1/22/24 8:48 PM, Jamal Hadi Salim wrote:
> [...]
> >>>
> >>> It should also be noted that it is feasible to split some of the ingress
> >>> datapath into XDP first and more into TC later (as was shown above for
> >>> example where the parser runs at XDP level). YMMV.
> >>> Regardless of choice of which scheme to use, none of these will affect
> >>> UAPI. It will all depend on whether you generate code to load on XDP vs
> >>> tc, etc.
> >>>
> >>> Co-developed-by: Victor Nogueira <victor@mojatatu.com>
> >>> Signed-off-by: Victor Nogueira <victor@mojatatu.com>
> >>> Co-developed-by: Pedro Tammela <pctammela@mojatatu.com>
> >>> Signed-off-by: Pedro Tammela <pctammela@mojatatu.com>
> >>> Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com>
> >>
> >> My objections from last iterations still stand, and I also added a nak,
> >> so please do not just drop it with new revisions.. from the v10 as you
> >> wrote you added further code but despite the various community feedback
> >> the design still stands as before, therefore:
> >>
> >> Nacked-by: Daniel Borkmann <daniel@iogearbox.net>
> >
> > We didnt make code changes - but did you read the cover letter and the
> > extended commentary in this patch's commit log? We should have
> > mentioned it in the changes log. It did respond to your comments.
> > There's text that says "the filter manages the lifetime of the
> > pipeline" - which in the future could include not only tc but XDP but
> > also the hardware path (in the form of a file that gets loaded). I am
> > not sure if that message is clear. Your angle being this is layer
> > violation. In the last discussion i asked you for suggestions and we
> > went the tcx route, which didnt make sense, and  then you didnt
> > respond.
> [...]
>
> >> Also as mentioned earlier I don't think tc should hold references on
> >> XDP programs in here. It doesn't make any sense aside from the fact
> >> that the cls_p4 is also not doing anything with it. This is something
> >> that a user space control plane should be doing i.e. managing a XDP
> >> link on the target device.
> >
> > This is the same argument about layer violation that you made earlier.
> > The filter manages the p4 pipeline - i.e it's not just about the ebpf
> > blob(s) but for example in the future (discussions are still ongoing
> > with vendors who have P4 NICs) a filter could be loaded to also
> > specify the location of the hardware blob.
>
> Ah, so there is a plan to eventually add HW offload support for cls_p4?
> Or is this only specifiying a location of a blob through some opaque
> cookie value from user space?

Current thought process is it will be something along these lines (the
commit provides more details):

tc filter add block 22 ingress protocol all prio 1 p4 pname simple_l3 \
   prog type hw filename "mypnameprog.o" ... \
   prog type xdp obj $PARSER.o section parser/xdp pinned_link
/sys/fs/bpf/mylink \
   action bpf obj $PROGNAME.o section prog/tc-ingress

These discussions are still ongoing - but that is the current
consensus. Note: we are not pushing any code for that, but hope it
paints the bigger picture....
The idea is the cls p4 owns the lifetime of the pipeline. Installing
the filter instantiates the p4 pipeline "simple_l3" and triggers a lot
of the refcounts to make sure the pipeline and its components stays
alive.
There could be multiple such filters - when someone deletes the last
filter, then it is safe to delete the pipeline.
Essentially the filter manages the lifetime of the pipeline.

> > I would be happy with a suggestion that gets us moving forward with
> > that context in mind.
>
> My question on the above is mainly what does it bring you to hold a
> reference on the XDP program? There is no guarantee that something else
> will get loaded onto XDP, and then eventually the cls_p4 is the only
> entity holding the reference but w/o 'purpose'. We do have BPF links
> and the user space component orchestrating all this needs to create
> and pin the BPF link in BPF fs, for example. An artificial reference
> on XDP prog feels similar as if you'd hold a reference on an inode
> out of tc.. Again, that should be delegated to the control plane you
> have running interacting with the compiler which then manages and
> loads its artifacts. What if you would also need to set up some
> netfilter rules for the SW pipeline, would you then embed this too?

Sorry, a slight tangent first:
P4 is self-contained, there are a handful of objects that are defined
by the spec (externs, actions, tables, etc) and we model them in the
patchset, so that part is self-contained. For the extra richness such
as the netfilter example you quoted - based on my many years of
experience deploying SDN - using daemons(sorry if i am reading too
much in what I think you are implying) for control is not the best
option i.e you need all kinds of coordination - for example where do
you store state, what happens when the daemon dies, how do you
graceful restarts etc. Based on that, if i can put things in the
kernel (which is essentially a "perpetual daemon", unless the kernel
crashes) it's a lot simpler to manage as a source of truth especially
when there is not that much info. There is a limit when there are
multiple pieces (to use your netfilter example) because you need
another layer to coordinate things.

Re: the XDP part - our key reason is mostly managerial, in that the
filter is the lifetime manager of the pipeline; and that if i dump
that filter i can see all the details in regards to the pipeline(tc,
XDP and in future hw, etc) in one spot. You are right, the link
pinning is our protection from someone replacing the XDP prog (this
was a tip from Toke in the early days) and the comparison of tc
holding inode is apropos.
There's some history: in the early days we were also using metadata
which comes from the XDP program at the tc layer if more processing
was to be done (and there was extra metadata which told us which XDP
prog produced it which we would vet before trusting the metadata).
Given all the above, we should still be able to hold this info without
necessarily holding the extra refcount and be able to see this detail.
So we can remove the refcounting.

cheers,
jamal

> Thanks,
> Daniel
Jamal Hadi Salim Feb. 16, 2024, 9:18 p.m. UTC | #5
On Thu, Jan 25, 2024 at 12:59 PM Jamal Hadi Salim <jhs@mojatatu.com> wrote:
>
> On Thu, Jan 25, 2024 at 10:47 AM Daniel Borkmann <daniel@iogearbox.net> wrote:
> >
> > On 1/24/24 3:40 PM, Jamal Hadi Salim wrote:
> > > On Wed, Jan 24, 2024 at 8:59 AM Daniel Borkmann <daniel@iogearbox.net> wrote:
> > >> On 1/22/24 8:48 PM, Jamal Hadi Salim wrote:
> > [...]
> > >>>
> > >>> It should also be noted that it is feasible to split some of the ingress
> > >>> datapath into XDP first and more into TC later (as was shown above for
> > >>> example where the parser runs at XDP level). YMMV.
> > >>> Regardless of choice of which scheme to use, none of these will affect
> > >>> UAPI. It will all depend on whether you generate code to load on XDP vs
> > >>> tc, etc.
> > >>>
> > >>> Co-developed-by: Victor Nogueira <victor@mojatatu.com>
> > >>> Signed-off-by: Victor Nogueira <victor@mojatatu.com>
> > >>> Co-developed-by: Pedro Tammela <pctammela@mojatatu.com>
> > >>> Signed-off-by: Pedro Tammela <pctammela@mojatatu.com>
> > >>> Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com>
> > >>
> > >> My objections from last iterations still stand, and I also added a nak,
> > >> so please do not just drop it with new revisions.. from the v10 as you
> > >> wrote you added further code but despite the various community feedback
> > >> the design still stands as before, therefore:
> > >>
> > >> Nacked-by: Daniel Borkmann <daniel@iogearbox.net>
> > >
> > > We didnt make code changes - but did you read the cover letter and the
> > > extended commentary in this patch's commit log? We should have
> > > mentioned it in the changes log. It did respond to your comments.
> > > There's text that says "the filter manages the lifetime of the
> > > pipeline" - which in the future could include not only tc but XDP but
> > > also the hardware path (in the form of a file that gets loaded). I am
> > > not sure if that message is clear. Your angle being this is layer
> > > violation. In the last discussion i asked you for suggestions and we
> > > went the tcx route, which didnt make sense, and  then you didnt
> > > respond.
> > [...]
> >
> > >> Also as mentioned earlier I don't think tc should hold references on
> > >> XDP programs in here. It doesn't make any sense aside from the fact
> > >> that the cls_p4 is also not doing anything with it. This is something
> > >> that a user space control plane should be doing i.e. managing a XDP
> > >> link on the target device.
> > >
> > > This is the same argument about layer violation that you made earlier.
> > > The filter manages the p4 pipeline - i.e it's not just about the ebpf
> > > blob(s) but for example in the future (discussions are still ongoing
> > > with vendors who have P4 NICs) a filter could be loaded to also
> > > specify the location of the hardware blob.
> >
> > Ah, so there is a plan to eventually add HW offload support for cls_p4?
> > Or is this only specifiying a location of a blob through some opaque
> > cookie value from user space?
>
> Current thought process is it will be something along these lines (the
> commit provides more details):
>
> tc filter add block 22 ingress protocol all prio 1 p4 pname simple_l3 \
>    prog type hw filename "mypnameprog.o" ... \
>    prog type xdp obj $PARSER.o section parser/xdp pinned_link
> /sys/fs/bpf/mylink \
>    action bpf obj $PROGNAME.o section prog/tc-ingress
>
> These discussions are still ongoing - but that is the current
> consensus. Note: we are not pushing any code for that, but hope it
> paints the bigger picture....
> The idea is the cls p4 owns the lifetime of the pipeline. Installing
> the filter instantiates the p4 pipeline "simple_l3" and triggers a lot
> of the refcounts to make sure the pipeline and its components stays
> alive.
> There could be multiple such filters - when someone deletes the last
> filter, then it is safe to delete the pipeline.
> Essentially the filter manages the lifetime of the pipeline.
>
> > > I would be happy with a suggestion that gets us moving forward with
> > > that context in mind.
> >
> > My question on the above is mainly what does it bring you to hold a
> > reference on the XDP program? There is no guarantee that something else
> > will get loaded onto XDP, and then eventually the cls_p4 is the only
> > entity holding the reference but w/o 'purpose'. We do have BPF links
> > and the user space component orchestrating all this needs to create
> > and pin the BPF link in BPF fs, for example. An artificial reference
> > on XDP prog feels similar as if you'd hold a reference on an inode
> > out of tc.. Again, that should be delegated to the control plane you
> > have running interacting with the compiler which then manages and
> > loads its artifacts. What if you would also need to set up some
> > netfilter rules for the SW pipeline, would you then embed this too?
>
> Sorry, a slight tangent first:
> P4 is self-contained, there are a handful of objects that are defined
> by the spec (externs, actions, tables, etc) and we model them in the
> patchset, so that part is self-contained. For the extra richness such
> as the netfilter example you quoted - based on my many years of
> experience deploying SDN - using daemons(sorry if i am reading too
> much in what I think you are implying) for control is not the best
> option i.e you need all kinds of coordination - for example where do
> you store state, what happens when the daemon dies, how do you
> graceful restarts etc. Based on that, if i can put things in the
> kernel (which is essentially a "perpetual daemon", unless the kernel
> crashes) it's a lot simpler to manage as a source of truth especially
> when there is not that much info. There is a limit when there are
> multiple pieces (to use your netfilter example) because you need
> another layer to coordinate things.
>
> Re: the XDP part - our key reason is mostly managerial, in that the
> filter is the lifetime manager of the pipeline; and that if i dump
> that filter i can see all the details in regards to the pipeline(tc,
> XDP and in future hw, etc) in one spot. You are right, the link
> pinning is our protection from someone replacing the XDP prog (this
> was a tip from Toke in the early days) and the comparison of tc
> holding inode is apropos.
> There's some history: in the early days we were also using metadata
> which comes from the XDP program at the tc layer if more processing
> was to be done (and there was extra metadata which told us which XDP
> prog produced it which we would vet before trusting the metadata).
> Given all the above, we should still be able to hold this info without
> necessarily holding the extra refcount and be able to see this detail.
> So we can remove the refcounting.
>

Daniel?

cheers,
jamal


> cheers,
> jamal
>
> > Thanks,
> > Daniel
Daniel Borkmann Feb. 20, 2024, 3:31 p.m. UTC | #6
On 2/16/24 10:18 PM, Jamal Hadi Salim wrote:
> On Thu, Jan 25, 2024 at 12:59 PM Jamal Hadi Salim <jhs@mojatatu.com> wrote:
>> On Thu, Jan 25, 2024 at 10:47 AM Daniel Borkmann <daniel@iogearbox.net> wrote:
>>> On 1/24/24 3:40 PM, Jamal Hadi Salim wrote:
>>>> On Wed, Jan 24, 2024 at 8:59 AM Daniel Borkmann <daniel@iogearbox.net> wrote:
>>>>> On 1/22/24 8:48 PM, Jamal Hadi Salim wrote:
>>> [...]
>>>>>>
>>>>>> It should also be noted that it is feasible to split some of the ingress
>>>>>> datapath into XDP first and more into TC later (as was shown above for
>>>>>> example where the parser runs at XDP level). YMMV.
>>>>>> Regardless of choice of which scheme to use, none of these will affect
>>>>>> UAPI. It will all depend on whether you generate code to load on XDP vs
>>>>>> tc, etc.
>>>>>>
>>>>>> Co-developed-by: Victor Nogueira <victor@mojatatu.com>
>>>>>> Signed-off-by: Victor Nogueira <victor@mojatatu.com>
>>>>>> Co-developed-by: Pedro Tammela <pctammela@mojatatu.com>
>>>>>> Signed-off-by: Pedro Tammela <pctammela@mojatatu.com>
>>>>>> Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com>
>>>>>
>>>>> My objections from last iterations still stand, and I also added a nak,
>>>>> so please do not just drop it with new revisions.. from the v10 as you
>>>>> wrote you added further code but despite the various community feedback
>>>>> the design still stands as before, therefore:
>>>>>
>>>>> Nacked-by: Daniel Borkmann <daniel@iogearbox.net>
>>>>
>>>> We didnt make code changes - but did you read the cover letter and the
>>>> extended commentary in this patch's commit log? We should have
>>>> mentioned it in the changes log. It did respond to your comments.
>>>> There's text that says "the filter manages the lifetime of the
>>>> pipeline" - which in the future could include not only tc but XDP but
>>>> also the hardware path (in the form of a file that gets loaded). I am
>>>> not sure if that message is clear. Your angle being this is layer
>>>> violation. In the last discussion i asked you for suggestions and we
>>>> went the tcx route, which didnt make sense, and  then you didnt
>>>> respond.
>>> [...]
>>>
>>>>> Also as mentioned earlier I don't think tc should hold references on
>>>>> XDP programs in here. It doesn't make any sense aside from the fact
>>>>> that the cls_p4 is also not doing anything with it. This is something
>>>>> that a user space control plane should be doing i.e. managing a XDP
>>>>> link on the target device.
>>>>
>>>> This is the same argument about layer violation that you made earlier.
>>>> The filter manages the p4 pipeline - i.e it's not just about the ebpf
>>>> blob(s) but for example in the future (discussions are still ongoing
>>>> with vendors who have P4 NICs) a filter could be loaded to also
>>>> specify the location of the hardware blob.
>>>
>>> Ah, so there is a plan to eventually add HW offload support for cls_p4?
>>> Or is this only specifiying a location of a blob through some opaque
>>> cookie value from user space?
>>
>> Current thought process is it will be something along these lines (the
>> commit provides more details):
>>
>> tc filter add block 22 ingress protocol all prio 1 p4 pname simple_l3 \
>>     prog type hw filename "mypnameprog.o" ... \
>>     prog type xdp obj $PARSER.o section parser/xdp pinned_link
>> /sys/fs/bpf/mylink \
>>     action bpf obj $PROGNAME.o section prog/tc-ingress
>>
>> These discussions are still ongoing - but that is the current
>> consensus. Note: we are not pushing any code for that, but hope it
>> paints the bigger picture....
>> The idea is the cls p4 owns the lifetime of the pipeline. Installing
>> the filter instantiates the p4 pipeline "simple_l3" and triggers a lot
>> of the refcounts to make sure the pipeline and its components stays
>> alive.
>> There could be multiple such filters - when someone deletes the last
>> filter, then it is safe to delete the pipeline.
>> Essentially the filter manages the lifetime of the pipeline.
>>
>>>> I would be happy with a suggestion that gets us moving forward with
>>>> that context in mind.
>>>
>>> My question on the above is mainly what does it bring you to hold a
>>> reference on the XDP program? There is no guarantee that something else
>>> will get loaded onto XDP, and then eventually the cls_p4 is the only
>>> entity holding the reference but w/o 'purpose'. We do have BPF links
>>> and the user space component orchestrating all this needs to create
>>> and pin the BPF link in BPF fs, for example. An artificial reference
>>> on XDP prog feels similar as if you'd hold a reference on an inode
>>> out of tc.. Again, that should be delegated to the control plane you
>>> have running interacting with the compiler which then manages and
>>> loads its artifacts. What if you would also need to set up some
>>> netfilter rules for the SW pipeline, would you then embed this too?
>>
>> Sorry, a slight tangent first:
>> P4 is self-contained, there are a handful of objects that are defined
>> by the spec (externs, actions, tables, etc) and we model them in the
>> patchset, so that part is self-contained. For the extra richness such
>> as the netfilter example you quoted - based on my many years of
>> experience deploying SDN - using daemons(sorry if i am reading too
>> much in what I think you are implying) for control is not the best
>> option i.e you need all kinds of coordination - for example where do
>> you store state, what happens when the daemon dies, how do you
>> graceful restarts etc. Based on that, if i can put things in the
>> kernel (which is essentially a "perpetual daemon", unless the kernel
>> crashes) it's a lot simpler to manage as a source of truth especially
>> when there is not that much info. There is a limit when there are
>> multiple pieces (to use your netfilter example) because you need
>> another layer to coordinate things.

'source of truth' for the various attach points or BPF links, yes, but in
this case here it is not, since the source of truth on what is attached
is not in cls_p4 but rather on the XDP link. How do you handle the case
when cls_p4 says something different to what is /actually/ attached? Why
is it not enough to establish some convention in user space, to pin the
link and retrieve/update from there when needed? Like everyone else does.
... even if you consider iproute2 your "control plane" (which I have the
feeling you do)?

>> Re: the XDP part - our key reason is mostly managerial, in that the
>> filter is the lifetime manager of the pipeline; and that if i dump

This is imho the problematic part which feels like square peg in round
hole, trying to fit this whole lifetime manager of the pipeline into
the cls_p4 filter. We agree to disagree here. Instead of reusing
individual building blocks from user space, this tries to cramp control
plane parts into the kernel for which its not a great fit with what is
build here as-is.

>> that filter i can see all the details in regards to the pipeline(tc,
>> XDP and in future hw, etc) in one spot. You are right, the link
>> pinning is our protection from someone replacing the XDP prog (this
>> was a tip from Toke in the early days) and the comparison of tc
>> holding inode is apropos.
>> There's some history: in the early days we were also using metadata
>> which comes from the XDP program at the tc layer if more processing
>> was to be done (and there was extra metadata which told us which XDP
>> prog produced it which we would vet before trusting the metadata).
>> Given all the above, we should still be able to hold this info without
>> necessarily holding the extra refcount and be able to see this detail.
>> So we can remove the refcounting.
> 
> Daniel?

The refcount should definitely be removed, but then again, see the point
above in that it is inconsistent information. Why can't this be done in
user space with some convention in your user space control plane - if you
take iproute2, then why it cannot pin the link in a bpf fs instance and
retrieve it from there?
Jamal Hadi Salim Feb. 21, 2024, 2:51 p.m. UTC | #7
On Tue, Feb 20, 2024 at 10:49 AM Daniel Borkmann <daniel@iogearbox.net> wrote:
>
> On 2/16/24 10:18 PM, Jamal Hadi Salim wrote:
> > On Thu, Jan 25, 2024 at 12:59 PM Jamal Hadi Salim <jhs@mojatatu.com> wrote:
> >> On Thu, Jan 25, 2024 at 10:47 AM Daniel Borkmann <daniel@iogearbox.net> wrote:
> >>> On 1/24/24 3:40 PM, Jamal Hadi Salim wrote:
> >>>> On Wed, Jan 24, 2024 at 8:59 AM Daniel Borkmann <daniel@iogearbox.net> wrote:
> >>>>> On 1/22/24 8:48 PM, Jamal Hadi Salim wrote:
> >>> [...]
> >>>>>>
> >>>>>> It should also be noted that it is feasible to split some of the ingress
> >>>>>> datapath into XDP first and more into TC later (as was shown above for
> >>>>>> example where the parser runs at XDP level). YMMV.
> >>>>>> Regardless of choice of which scheme to use, none of these will affect
> >>>>>> UAPI. It will all depend on whether you generate code to load on XDP vs
> >>>>>> tc, etc.
> >>>>>>
> >>>>>> Co-developed-by: Victor Nogueira <victor@mojatatu.com>
> >>>>>> Signed-off-by: Victor Nogueira <victor@mojatatu.com>
> >>>>>> Co-developed-by: Pedro Tammela <pctammela@mojatatu.com>
> >>>>>> Signed-off-by: Pedro Tammela <pctammela@mojatatu.com>
> >>>>>> Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com>
> >>>>>
> >>>>> My objections from last iterations still stand, and I also added a nak,
> >>>>> so please do not just drop it with new revisions.. from the v10 as you
> >>>>> wrote you added further code but despite the various community feedback
> >>>>> the design still stands as before, therefore:
> >>>>>
> >>>>> Nacked-by: Daniel Borkmann <daniel@iogearbox.net>
> >>>>
> >>>> We didnt make code changes - but did you read the cover letter and the
> >>>> extended commentary in this patch's commit log? We should have
> >>>> mentioned it in the changes log. It did respond to your comments.
> >>>> There's text that says "the filter manages the lifetime of the
> >>>> pipeline" - which in the future could include not only tc but XDP but
> >>>> also the hardware path (in the form of a file that gets loaded). I am
> >>>> not sure if that message is clear. Your angle being this is layer
> >>>> violation. In the last discussion i asked you for suggestions and we
> >>>> went the tcx route, which didnt make sense, and  then you didnt
> >>>> respond.
> >>> [...]
> >>>
> >>>>> Also as mentioned earlier I don't think tc should hold references on
> >>>>> XDP programs in here. It doesn't make any sense aside from the fact
> >>>>> that the cls_p4 is also not doing anything with it. This is something
> >>>>> that a user space control plane should be doing i.e. managing a XDP
> >>>>> link on the target device.
> >>>>
> >>>> This is the same argument about layer violation that you made earlier.
> >>>> The filter manages the p4 pipeline - i.e it's not just about the ebpf
> >>>> blob(s) but for example in the future (discussions are still ongoing
> >>>> with vendors who have P4 NICs) a filter could be loaded to also
> >>>> specify the location of the hardware blob.
> >>>
> >>> Ah, so there is a plan to eventually add HW offload support for cls_p4?
> >>> Or is this only specifiying a location of a blob through some opaque
> >>> cookie value from user space?
> >>
> >> Current thought process is it will be something along these lines (the
> >> commit provides more details):
> >>
> >> tc filter add block 22 ingress protocol all prio 1 p4 pname simple_l3 \
> >>     prog type hw filename "mypnameprog.o" ... \
> >>     prog type xdp obj $PARSER.o section parser/xdp pinned_link
> >> /sys/fs/bpf/mylink \
> >>     action bpf obj $PROGNAME.o section prog/tc-ingress
> >>
> >> These discussions are still ongoing - but that is the current
> >> consensus. Note: we are not pushing any code for that, but hope it
> >> paints the bigger picture....
> >> The idea is the cls p4 owns the lifetime of the pipeline. Installing
> >> the filter instantiates the p4 pipeline "simple_l3" and triggers a lot
> >> of the refcounts to make sure the pipeline and its components stays
> >> alive.
> >> There could be multiple such filters - when someone deletes the last
> >> filter, then it is safe to delete the pipeline.
> >> Essentially the filter manages the lifetime of the pipeline.
> >>
> >>>> I would be happy with a suggestion that gets us moving forward with
> >>>> that context in mind.
> >>>
> >>> My question on the above is mainly what does it bring you to hold a
> >>> reference on the XDP program? There is no guarantee that something else
> >>> will get loaded onto XDP, and then eventually the cls_p4 is the only
> >>> entity holding the reference but w/o 'purpose'. We do have BPF links
> >>> and the user space component orchestrating all this needs to create
> >>> and pin the BPF link in BPF fs, for example. An artificial reference
> >>> on XDP prog feels similar as if you'd hold a reference on an inode
> >>> out of tc.. Again, that should be delegated to the control plane you
> >>> have running interacting with the compiler which then manages and
> >>> loads its artifacts. What if you would also need to set up some
> >>> netfilter rules for the SW pipeline, would you then embed this too?
> >>
> >> Sorry, a slight tangent first:
> >> P4 is self-contained, there are a handful of objects that are defined
> >> by the spec (externs, actions, tables, etc) and we model them in the
> >> patchset, so that part is self-contained. For the extra richness such
> >> as the netfilter example you quoted - based on my many years of
> >> experience deploying SDN - using daemons(sorry if i am reading too
> >> much in what I think you are implying) for control is not the best
> >> option i.e you need all kinds of coordination - for example where do
> >> you store state, what happens when the daemon dies, how do you
> >> graceful restarts etc. Based on that, if i can put things in the
> >> kernel (which is essentially a "perpetual daemon", unless the kernel
> >> crashes) it's a lot simpler to manage as a source of truth especially
> >> when there is not that much info. There is a limit when there are
> >> multiple pieces (to use your netfilter example) because you need
> >> another layer to coordinate things.
>
> 'source of truth' for the various attach points or BPF links, yes, but in
> this case here it is not, since the source of truth on what is attached
> is not in cls_p4 but rather on the XDP link. How do you handle the case
> when cls_p4 says something different to what is /actually/ attached? Why
> is it not enough to establish some convention in user space, to pin the
> link and retrieve/update from there when needed? Like everyone else does.
> ... even if you consider iproute2 your "control plane" (which I have the
> feeling you do)?
>
> >> Re: the XDP part - our key reason is mostly managerial, in that the
> >> filter is the lifetime manager of the pipeline; and that if i dump
>
> This is imho the problematic part which feels like square peg in round
> hole, trying to fit this whole lifetime manager of the pipeline into
> the cls_p4 filter. We agree to disagree here. Instead of reusing
> individual building blocks from user space, this tries to cramp control
> plane parts into the kernel for which its not a great fit with what is
> build here as-is.
>
> >> that filter i can see all the details in regards to the pipeline(tc,
> >> XDP and in future hw, etc) in one spot. You are right, the link
> >> pinning is our protection from someone replacing the XDP prog (this
> >> was a tip from Toke in the early days) and the comparison of tc
> >> holding inode is apropos.
> >> There's some history: in the early days we were also using metadata
> >> which comes from the XDP program at the tc layer if more processing
> >> was to be done (and there was extra metadata which told us which XDP
> >> prog produced it which we would vet before trusting the metadata).
> >> Given all the above, we should still be able to hold this info without
> >> necessarily holding the extra refcount and be able to see this detail.
> >> So we can remove the refcounting.
> >
> > Daniel?
>
> The refcount should definitely be removed, but then again, see the point
> above in that it is inconsistent information. Why can't this be done in
> user space with some convention in your user space control plane - if you
> take iproute2, then why it cannot pin the link in a bpf fs instance and
> retrieve it from there?

Ok, Daniel - let's do this so we can move forward. I am getting
exhausted, we've been going at this for a year now. As a compromise: I
will remove the support for XDP altogether from the filter. We will
still reference the XDP program in the CLI and infact load and pin it
that way but the filter will not be adding a refcount in the kernel as
in the posted patch.

cheers,
jamal
diff mbox series

Patch

diff --git a/include/uapi/linux/pkt_cls.h b/include/uapi/linux/pkt_cls.h
index dd313a727..e4d89bc98 100644
--- a/include/uapi/linux/pkt_cls.h
+++ b/include/uapi/linux/pkt_cls.h
@@ -692,6 +692,24 @@  enum {
 
 #define TCA_MATCHALL_MAX (__TCA_MATCHALL_MAX - 1)
 
+/* P4 classifier */
+
+enum {
+	TCA_P4_UNSPEC,
+	TCA_P4_CLASSID,
+	TCA_P4_ACT,
+	TCA_P4_PNAME,
+	TCA_P4_PIPEID,
+	TCA_P4_PROG_FD,
+	TCA_P4_PROG_NAME,
+	TCA_P4_PROG_TYPE,
+	TCA_P4_PROG_ID,
+	TCA_P4_PAD,
+	__TCA_P4_MAX,
+};
+
+#define TCA_P4_MAX (__TCA_P4_MAX - 1)
+
 /* Extended Matches */
 
 struct tcf_ematch_tree_hdr {
diff --git a/net/sched/Kconfig b/net/sched/Kconfig
index b1798a5e1..b33d790e1 100644
--- a/net/sched/Kconfig
+++ b/net/sched/Kconfig
@@ -565,6 +565,18 @@  config NET_CLS_MATCHALL
 	  To compile this code as a module, choose M here: the module will
 	  be called cls_matchall.
 
+config NET_CLS_P4
+	tristate "P4 classifier"
+	select NET_CLS
+	select NET_P4TC
+	help
+	  If you say Y here, you will be able to bind a P4 pipeline
+	  program. You will need to install a P4 template representing the
+	  program successfully to use this feature.
+
+	  To compile this code as a module, choose M here: the module will
+	  be called cls_p4.
+
 config NET_EMATCH
 	bool "Extended Matches"
 	select NET_CLS
diff --git a/net/sched/Makefile b/net/sched/Makefile
index 581f9dd69..b4f9ef48d 100644
--- a/net/sched/Makefile
+++ b/net/sched/Makefile
@@ -72,6 +72,7 @@  obj-$(CONFIG_NET_CLS_CGROUP)	+= cls_cgroup.o
 obj-$(CONFIG_NET_CLS_BPF)	+= cls_bpf.o
 obj-$(CONFIG_NET_CLS_FLOWER)	+= cls_flower.o
 obj-$(CONFIG_NET_CLS_MATCHALL)	+= cls_matchall.o
+obj-$(CONFIG_NET_CLS_P4)	+= cls_p4.o
 obj-$(CONFIG_NET_EMATCH)	+= ematch.o
 obj-$(CONFIG_NET_EMATCH_CMP)	+= em_cmp.o
 obj-$(CONFIG_NET_EMATCH_NBYTE)	+= em_nbyte.o
diff --git a/net/sched/cls_p4.c b/net/sched/cls_p4.c
new file mode 100644
index 000000000..bd83d4e42
--- /dev/null
+++ b/net/sched/cls_p4.c
@@ -0,0 +1,450 @@ 
+// SPDX-License-Identifier: GPL-2.0-or-later
+/*
+ * net/sched/cls_p4.c - P4 Classifier
+ * Copyright (c) 2022-2024, Mojatatu Networks
+ * Copyright (c) 2022-2024, Intel Corporation.
+ * Authors:     Jamal Hadi Salim <jhs@mojatatu.com>
+ *              Victor Nogueira <victor@mojatatu.com>
+ *              Pedro Tammela <pctammela@mojatatu.com>
+ */
+
+#include <linux/kernel.h>
+#include <linux/init.h>
+#include <linux/module.h>
+#include <linux/percpu.h>
+#include <linux/bpf.h>
+#include <linux/filter.h>
+
+#include <net/sch_generic.h>
+#include <net/pkt_cls.h>
+
+#include <net/p4tc.h>
+
+#include "p4tc/trace.h"
+
+#define CLS_P4_PROG_NAME_LEN	256
+
+struct p4tc_bpf_prog {
+	struct bpf_prog *p4_prog;
+	const char *p4_prog_name;
+};
+
+struct cls_p4_head {
+	struct tcf_exts exts;
+	struct tcf_result res;
+	struct rcu_work rwork;
+	struct p4tc_pipeline *pipeline;
+	struct p4tc_bpf_prog *prog;
+	u32 handle;
+};
+
+static int p4_classify(struct sk_buff *skb, const struct tcf_proto *tp,
+		       struct tcf_result *res)
+{
+	struct cls_p4_head *head = rcu_dereference_bh(tp->root);
+	bool at_ingress = skb_at_tc_ingress(skb);
+
+	if (unlikely(!head)) {
+		pr_err("P4 classifier not found\n");
+		return -1;
+	}
+
+	/* head->prog represents the eBPF program that will be first executed by
+	 * the data plane. It may or may not exist. In addition to head->prog,
+	 * we'll have another eBPF program that will execute after this one in
+	 * the form of a filter action (head->exts).
+	 * head->prog->p4_prog_type == BPf_PROG_TYPE_SCHED_ACT means this
+	 * program executes in TC P4 filter.
+	 * head->prog->p4_prog_type == BPf_PROG_TYPE_SCHED_XDP means this
+	 * program was loaded in XDP.
+	 */
+	if (head->prog) {
+		int rc = TC_ACT_PIPE;
+
+		/* If eBPF program is loaded into TC */
+		if (head->prog->p4_prog->type == BPF_PROG_TYPE_SCHED_ACT) {
+			if (at_ingress) {
+				/* It is safe to push/pull even if
+				 * skb_shared()
+				 */
+				__skb_push(skb, skb->mac_len);
+				bpf_compute_data_pointers(skb);
+				rc = bpf_prog_run(head->prog->p4_prog,
+						  skb);
+				__skb_pull(skb, skb->mac_len);
+			} else {
+				bpf_compute_data_pointers(skb);
+				rc = bpf_prog_run(head->prog->p4_prog,
+						  skb);
+			}
+		}
+
+		if (rc != TC_ACT_PIPE)
+			return rc;
+	}
+
+	trace_p4_classify(skb, head->pipeline);
+
+	*res = head->res;
+
+	return tcf_exts_exec(skb, &head->exts, res);
+}
+
+static int p4_init(struct tcf_proto *tp)
+{
+	return 0;
+}
+
+static void p4_bpf_prog_destroy(struct p4tc_bpf_prog *prog)
+{
+	bpf_prog_put(prog->p4_prog);
+	kfree(prog->p4_prog_name);
+	kfree(prog);
+}
+
+static void __p4_destroy(struct cls_p4_head *head)
+{
+	tcf_exts_destroy(&head->exts);
+	tcf_exts_put_net(&head->exts);
+	if (head->prog)
+		p4_bpf_prog_destroy(head->prog);
+	p4tc_pipeline_put(head->pipeline);
+	kfree(head);
+}
+
+static void p4_destroy_work(struct work_struct *work)
+{
+	struct cls_p4_head *head =
+		container_of(to_rcu_work(work), struct cls_p4_head, rwork);
+
+	rtnl_lock();
+	__p4_destroy(head);
+	rtnl_unlock();
+}
+
+static void p4_destroy(struct tcf_proto *tp, bool rtnl_held,
+		       struct netlink_ext_ack *extack)
+{
+	struct cls_p4_head *head = rtnl_dereference(tp->root);
+
+	if (!head)
+		return;
+
+	tcf_unbind_filter(tp, &head->res);
+
+	if (tcf_exts_get_net(&head->exts))
+		tcf_queue_work(&head->rwork, p4_destroy_work);
+	else
+		__p4_destroy(head);
+}
+
+static void *p4_get(struct tcf_proto *tp, u32 handle)
+{
+	struct cls_p4_head *head = rtnl_dereference(tp->root);
+
+	if (head && head->handle == handle)
+		return head;
+
+	return NULL;
+}
+
+static const struct nla_policy p4_policy[TCA_P4_MAX + 1] = {
+	[TCA_P4_UNSPEC] = { .type = NLA_UNSPEC },
+	[TCA_P4_CLASSID] = { .type = NLA_U32 },
+	[TCA_P4_ACT] = { .type = NLA_NESTED },
+	[TCA_P4_PNAME] = { .type = NLA_STRING, .len = P4TC_PIPELINE_NAMSIZ },
+	[TCA_P4_PIPEID] = { .type = NLA_U32 },
+	[TCA_P4_PROG_FD] = { .type = NLA_U32 },
+	[TCA_P4_PROG_NAME] = { .type = NLA_STRING,
+			       .len = CLS_P4_PROG_NAME_LEN },
+	[TCA_P4_PROG_TYPE] = { .type = NLA_U32 },
+};
+
+static int cls_p4_prog_from_efd(struct nlattr **tb,
+				struct p4tc_bpf_prog *prog, u32 flags,
+				struct netlink_ext_ack *extack)
+{
+	struct bpf_prog *fp;
+	u32 prog_type;
+	char *name;
+	u32 bpf_fd;
+
+	bpf_fd = nla_get_u32(tb[TCA_P4_PROG_FD]);
+	prog_type = nla_get_u32(tb[TCA_P4_PROG_TYPE]);
+
+	if (prog_type != BPF_PROG_TYPE_XDP &&
+	    prog_type != BPF_PROG_TYPE_SCHED_ACT) {
+		NL_SET_ERR_MSG(extack,
+			       "BPF prog type must be BPF_PROG_TYPE_SCHED_ACT or BPF_PROG_TYPE_XDP");
+		return -EINVAL;
+	}
+
+	fp = bpf_prog_get_type_dev(bpf_fd, prog_type, false);
+	if (IS_ERR(fp))
+		return PTR_ERR(fp);
+
+	name = nla_memdup(tb[TCA_P4_PROG_NAME], GFP_KERNEL);
+	if (!name) {
+		bpf_prog_put(fp);
+		return -ENOMEM;
+	}
+
+	prog->p4_prog_name = name;
+	prog->p4_prog = fp;
+
+	return 0;
+}
+
+static int p4_set_parms(struct net *net, struct tcf_proto *tp,
+			struct cls_p4_head *head, unsigned long base,
+			struct nlattr **tb, struct nlattr *est, u32 flags,
+			struct netlink_ext_ack *extack)
+{
+	bool load_bpf_prog = tb[TCA_P4_PROG_NAME] && tb[TCA_P4_PROG_FD] &&
+			     tb[TCA_P4_PROG_TYPE];
+	struct p4tc_bpf_prog *prog = NULL;
+	int err;
+
+	err = tcf_exts_validate_ex(net, tp, tb, est, &head->exts, flags, 0,
+				   extack);
+	if (err < 0)
+		return err;
+
+	if (load_bpf_prog) {
+		prog = kzalloc(sizeof(*prog), GFP_KERNEL);
+		if (!prog) {
+			err = -ENOMEM;
+			goto exts_destroy;
+		}
+
+		err = cls_p4_prog_from_efd(tb, prog, flags, extack);
+		if (err < 0) {
+			kfree(prog);
+			goto exts_destroy;
+		}
+	}
+
+	if (tb[TCA_P4_CLASSID]) {
+		head->res.classid = nla_get_u32(tb[TCA_P4_CLASSID]);
+		tcf_bind_filter(tp, &head->res, base);
+	}
+
+	if (load_bpf_prog) {
+		if (head->prog) {
+			pr_notice("cls_p4: Substituting old BPF program with id %u with new one with id %u\n",
+				  head->prog->p4_prog->aux->id,
+				  prog->p4_prog->aux->id);
+			p4_bpf_prog_destroy(head->prog);
+		}
+		head->prog = prog;
+	}
+
+	return 0;
+
+exts_destroy:
+	tcf_exts_destroy(&head->exts);
+	return err;
+}
+
+static int p4_change(struct net *net, struct sk_buff *in_skb,
+		     struct tcf_proto *tp, unsigned long base, u32 handle,
+		     struct nlattr **tca, void **arg, u32 flags,
+		     struct netlink_ext_ack *extack)
+{
+	struct cls_p4_head *head = rtnl_dereference(tp->root);
+	struct p4tc_pipeline *pipeline = NULL;
+	struct nlattr *tb[TCA_P4_MAX + 1];
+	struct cls_p4_head *new_cls;
+	char *pname = NULL;
+	u32 pipeid = 0;
+	int err;
+
+	if (!tca[TCA_OPTIONS]) {
+		NL_SET_ERR_MSG(extack, "Must provide pipeline options");
+		return -EINVAL;
+	}
+
+	if (head)
+		return -EEXIST;
+
+	err = nla_parse_nested(tb, TCA_P4_MAX, tca[TCA_OPTIONS], p4_policy,
+			       extack);
+	if (err < 0)
+		return err;
+
+	if (tb[TCA_P4_PNAME])
+		pname = nla_data(tb[TCA_P4_PNAME]);
+
+	if (tb[TCA_P4_PIPEID])
+		pipeid = nla_get_u32(tb[TCA_P4_PIPEID]);
+
+	pipeline = p4tc_pipeline_find_get(net, pname, pipeid, extack);
+	if (IS_ERR(pipeline))
+		return PTR_ERR(pipeline);
+
+	if (!p4tc_pipeline_sealed(pipeline)) {
+		err = -EINVAL;
+		NL_SET_ERR_MSG(extack, "Pipeline must be sealed before use");
+		goto pipeline_put;
+	}
+
+	new_cls = kzalloc(sizeof(*new_cls), GFP_KERNEL);
+	if (!new_cls) {
+		err = -ENOMEM;
+		goto pipeline_put;
+	}
+
+	err = tcf_exts_init(&new_cls->exts, net, TCA_P4_ACT, 0);
+	if (err)
+		goto err_exts_init;
+
+	if (!handle)
+		handle = 1;
+
+	new_cls->handle = handle;
+
+	err = p4_set_parms(net, tp, new_cls, base, tb, tca[TCA_RATE], flags,
+			   extack);
+	if (err)
+		goto err_set_parms;
+
+	new_cls->pipeline = pipeline;
+	*arg = head;
+	rcu_assign_pointer(tp->root, new_cls);
+	return 0;
+
+err_set_parms:
+	tcf_exts_destroy(&new_cls->exts);
+err_exts_init:
+	kfree(new_cls);
+pipeline_put:
+	p4tc_pipeline_put(pipeline);
+	return err;
+}
+
+static int p4_delete(struct tcf_proto *tp, void *arg, bool *last,
+		     bool rtnl_held, struct netlink_ext_ack *extack)
+{
+	*last = true;
+	return 0;
+}
+
+static void p4_walk(struct tcf_proto *tp, struct tcf_walker *arg,
+		    bool rtnl_held)
+{
+	struct cls_p4_head *head = rtnl_dereference(tp->root);
+
+	if (arg->count < arg->skip)
+		goto skip;
+
+	if (!head)
+		return;
+	if (arg->fn(tp, head, arg) < 0)
+		arg->stop = 1;
+skip:
+	arg->count++;
+}
+
+static int p4_prog_dump(struct sk_buff *skb, struct p4tc_bpf_prog *prog)
+{
+	unsigned char *b = nlmsg_get_pos(skb);
+
+	if (nla_put_u32(skb, TCA_P4_PROG_ID, prog->p4_prog->aux->id))
+		goto nla_put_failure;
+
+	if (nla_put_string(skb, TCA_P4_PROG_NAME, prog->p4_prog_name))
+		goto nla_put_failure;
+
+	if (nla_put_u32(skb, TCA_P4_PROG_TYPE, prog->p4_prog->type))
+		goto nla_put_failure;
+
+	return 0;
+
+nla_put_failure:
+	nlmsg_trim(skb, b);
+	return -1;
+}
+
+static int p4_dump(struct net *net, struct tcf_proto *tp, void *fh,
+		   struct sk_buff *skb, struct tcmsg *t, bool rtnl_held)
+{
+	struct cls_p4_head *head = fh;
+	struct nlattr *nest;
+
+	if (!head)
+		return skb->len;
+
+	t->tcm_handle = head->handle;
+
+	nest = nla_nest_start(skb, TCA_OPTIONS);
+	if (!nest)
+		goto nla_put_failure;
+
+	if (nla_put_string(skb, TCA_P4_PNAME, head->pipeline->common.name))
+		goto nla_put_failure;
+
+	if (head->res.classid &&
+	    nla_put_u32(skb, TCA_P4_CLASSID, head->res.classid))
+		goto nla_put_failure;
+
+	if (head->prog && p4_prog_dump(skb, head->prog))
+		goto nla_put_failure;
+
+	if (tcf_exts_dump(skb, &head->exts))
+		goto nla_put_failure;
+
+	nla_nest_end(skb, nest);
+
+	if (tcf_exts_dump_stats(skb, &head->exts) < 0)
+		goto nla_put_failure;
+
+	return skb->len;
+
+nla_put_failure:
+	nla_nest_cancel(skb, nest);
+	return -1;
+}
+
+static void p4_bind_class(void *fh, u32 classid, unsigned long cl, void *q,
+			  unsigned long base)
+{
+	struct cls_p4_head *head = fh;
+
+	if (head && head->res.classid == classid) {
+		if (cl)
+			__tcf_bind_filter(q, &head->res, base);
+		else
+			__tcf_unbind_filter(q, &head->res);
+	}
+}
+
+static struct tcf_proto_ops cls_p4_ops __read_mostly = {
+	.kind		= "p4",
+	.classify	= p4_classify,
+	.init		= p4_init,
+	.destroy	= p4_destroy,
+	.get		= p4_get,
+	.change		= p4_change,
+	.delete		= p4_delete,
+	.walk		= p4_walk,
+	.dump		= p4_dump,
+	.bind_class	= p4_bind_class,
+	.owner		= THIS_MODULE,
+};
+
+static int __init cls_p4_init(void)
+{
+	return register_tcf_proto_ops(&cls_p4_ops);
+}
+
+static void __exit cls_p4_exit(void)
+{
+	unregister_tcf_proto_ops(&cls_p4_ops);
+}
+
+module_init(cls_p4_init);
+module_exit(cls_p4_exit);
+
+MODULE_AUTHOR("Mojatatu Networks");
+MODULE_DESCRIPTION("P4 Classifier");
+MODULE_LICENSE("GPL");
diff --git a/net/sched/p4tc/Makefile b/net/sched/p4tc/Makefile
index 73ccb53c4..04302a3ac 100644
--- a/net/sched/p4tc/Makefile
+++ b/net/sched/p4tc/Makefile
@@ -1,6 +1,8 @@ 
 # SPDX-License-Identifier: GPL-2.0
 
+CFLAGS_trace.o := -I$(src)
+
 obj-y := p4tc_types.o p4tc_tmpl_api.o p4tc_pipeline.o \
 	p4tc_action.o p4tc_table.o p4tc_tbl_entry.o \
-	p4tc_filter.o p4tc_runtime_api.o
+	p4tc_filter.o p4tc_runtime_api.o trace.o
 obj-$(CONFIG_DEBUG_INFO_BTF) += p4tc_bpf.o
diff --git a/net/sched/p4tc/trace.c b/net/sched/p4tc/trace.c
new file mode 100644
index 000000000..683313407
--- /dev/null
+++ b/net/sched/p4tc/trace.c
@@ -0,0 +1,10 @@ 
+// SPDX-License-Identifier: (GPL-2.0-only OR BSD-2-Clause)
+
+#include <net/p4tc.h>
+
+#ifndef __CHECKER__
+
+#define CREATE_TRACE_POINTS
+#include "trace.h"
+EXPORT_TRACEPOINT_SYMBOL_GPL(p4_classify);
+#endif
diff --git a/net/sched/p4tc/trace.h b/net/sched/p4tc/trace.h
new file mode 100644
index 000000000..80abec13b
--- /dev/null
+++ b/net/sched/p4tc/trace.h
@@ -0,0 +1,44 @@ 
+/* SPDX-License-Identifier: GPL-2.0 */
+#undef TRACE_SYSTEM
+#define TRACE_SYSTEM p4tc
+
+#if !defined(__P4TC_TRACE_H_) || defined(TRACE_HEADER_MULTI_READ)
+#define __P4TC_TRACE_H
+
+#include <linux/tracepoint.h>
+
+struct p4tc_pipeline;
+
+TRACE_EVENT(p4_classify,
+	    TP_PROTO(struct sk_buff *skb, struct p4tc_pipeline *pipeline),
+
+	    TP_ARGS(skb, pipeline),
+
+	    TP_STRUCT__entry(__string(pname, pipeline->common.name)
+			     __field(u32,  p_id)
+			     __field(u32,  ifindex)
+			     __field(u32,  ingress)
+			    ),
+
+	    TP_fast_assign(__assign_str(pname, pipeline->common.name);
+			   __entry->p_id = pipeline->common.p_id;
+			   __entry->ifindex = skb->dev->ifindex;
+			   __entry->ingress = skb_at_tc_ingress(skb);
+			  ),
+
+	    TP_printk("dev=%u dir=%s pipeline=%s p_id=%u",
+		      __entry->ifindex,
+		      __entry->ingress ? "ingress" : "egress",
+		      __get_str(pname),
+		      __entry->p_id
+		     )
+);
+
+#endif
+
+#undef TRACE_INCLUDE_PATH
+#define TRACE_INCLUDE_PATH .
+#undef TRACE_INCLUDE_FILE
+#define TRACE_INCLUDE_FILE trace
+
+#include <trace/define_trace.h>