[v10,net-next,15/15] p4tc: add P4 classifier

Message ID	20240122194801.152658-16-jhs@mojatatu.com (mailing list archive)
State	Changes Requested
Delegated to:	Netdev Maintainers
Headers	show Received: from mail-qk1-f170.google.com (mail-qk1-f170.google.com [209.85.222.170]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 4A11146547 for <bpf@vger.kernel.org>; Mon, 22 Jan 2024 19:48:32 +0000 (UTC) From: Jamal Hadi Salim <jhs@mojatatu.com> To: netdev@vger.kernel.org Cc: deb.chatterjee@intel.com, anjali.singhai@intel.com, namrata.limaye@intel.com, tom@sipanda.io, mleitner@redhat.com, Mahesh.Shirshyad@amd.com, tomasz.osinski@intel.com, jiri@resnulli.us, xiyou.wangcong@gmail.com, davem@davemloft.net, edumazet@google.com, kuba@kernel.org, pabeni@redhat.com, vladbu@nvidia.com, horms@kernel.org, khalidm@nvidia.com, toke@redhat.com, mattyk@nvidia.com, daniel@iogearbox.net, bpf@vger.kernel.org Subject: [PATCH v10 net-next 15/15] p4tc: add P4 classifier Date: Mon, 22 Jan 2024 14:48:01 -0500 Message-Id: <20240122194801.152658-16-jhs@mojatatu.com> In-Reply-To: <20240122194801.152658-1-jhs@mojatatu.com> References: <20240122194801.152658-1-jhs@mojatatu.com> Precedence: bulk MIME-Version: 1.0 Content-Transfer-Encoding: 8bit
Series	Introducing P4TC \| expand [v10,net-next,00/15] Introducing P4TC [v10,net-next,01/15] net: sched: act_api: Introduce P4 actions list [v10,net-next,02/15] net/sched: act_api: increase action kind string length [v10,net-next,03/15] net/sched: act_api: Update tc_action_ops to account for P4 actions [v10,net-next,04/15] net/sched: act_api: add struct p4tc_action_ops as a parameter to lookup callba… [v10,net-next,05/15] net: sched: act_api: Add support for preallocated P4 action instances [v10,net-next,06/15] p4tc: add P4 data types [v10,net-next,07/15] p4tc: add template API [v10,net-next,08/15] p4tc: add template pipeline create, get, update, delete [v10,net-next,09/15] p4tc: add template action create, update, delete, get, flush and dump [v10,net-next,10/15] p4tc: add runtime action support [v10,net-next,11/15] p4tc: add template table create, update, delete, get, flush and dump [v10,net-next,12/15] p4tc: add runtime table entry create and update [v10,net-next,13/15] p4tc: add runtime table entry get, delete, flush and dump [v10,net-next,14/15] p4tc: add set of P4TC table kfuncs [v10,net-next,15/15] p4tc: add P4 classifier

Context	Check	Description
netdev/series_format	success	Posting correctly formatted
netdev/tree_selection	success	Clearly marked for net-next, async
netdev/ynl	success	Generated files up to date; no warnings/errors; no diff in generated;
netdev/fixes_present	success	Fixes tag not required for -next series
netdev/header_inline	success	No static functions without inline keyword in header files
netdev/build_32bit	success	Errors and warnings before: 5152 this patch: 5152
netdev/build_tools	success	Errors and warnings before: 2 this patch: 0
netdev/cc_maintainers	success	CCed 0 of 0 maintainers
netdev/build_clang	success	Errors and warnings before: 1278 this patch: 1278
netdev/verify_signedoff	success	Signed-off-by tag matches author and committer
netdev/deprecated_api	success	None detected
netdev/check_selftest	success	No net selftest shell script
netdev/verify_fixes	success	No Fixes tag
netdev/build_allmodconfig_warn	success	Errors and warnings before: 5456 this patch: 5456
netdev/checkpatch	warning	WARNING: It's generally not useful to have the filename in the file WARNING: added, moved or deleted file(s), does MAINTAINERS need updating?
netdev/build_clang_rust	success	No Rust files in patch. Skipping build
netdev/kdoc	success	Errors and warnings before: 0 this patch: 0
netdev/source_inline	success	Was 0 now: 0
netdev/contest	success	net-next-2024-01-24--18-00 (tests: 520)

Jamal Hadi Salim Jan. 22, 2024, 7:48 p.m. UTC

Introduce P4 tc classifier. The main task of this classifier is to manage
the lifetime of pipeline instances across one or more netdev ports.
Note a pipeline may be instantiated multiple times across one or more tc chains
and different priorities.

Note that part or whole of the P4 pipeline could reside in tc, XDP or even
hardware depending on how the P4 program was compiled.
To use the P4 classifier you must specify a pipeline name that will be
associated to the filter instance, a s/w parser (eBPF) and datapath P4
control block program (eBPF) program. Although this patchset does not deal
with offloads, it is also possible to load the h/w part using this filter.
We will illustrate a few examples further below to clarify. Please treat
the illustrated split as an example - there are probably more pragmatic
approaches to splitting the pipeline; however, regardless of where the different
pieces of the pipeline are placed (tc, XDP, HW) and what each layer will
implement (what part of the pipeline) - these examples are merely showing
what is possible.

The pipeline is assumed to have already been created via a template.

For example, if we were to add a filter to ingress of a group of netdevs
(tc block 22) and associate it to P4 pipeline simple_l3 we could issue the
following command:

tc filter add block 22 parent ffff: protocol all prio 6 p4 pname simple_l3 \
    action bpf obj $PARSER.o ... \
    action bpf obj $PROGNAME.o section prog/tc-ingress

The above uses the classical tc action mechanism in which the first action
runs the P4 parser and if that goes well then the P4 control block is
executed. Note, although not shown above, one could also append the command
line with other traditional tc actions.

In these patches, we also support two types of loadings of the pipeline
programs and differentiate between what gets loaded at say tc vs xdp by using
syntax which specifies location as either "prog type tc obj" or
"prog type xdp obj". There is an ongoing discussion in the P4TC community
biweekly meetings which is likely going to have us add another location
definition "prog type hw" which will specify the hardware object file name
and other related attributes.

An example using tc:

tc filter add block 22 parent ffff: protocol all prio 6 p4 pname simple_l3 \
    prog type tc obj $PARSER.o ... \
    action bpf obj $PROGNAME.o section prog/tc-ingress

For XDP, to illustrate an example:

tc filter add dev $P0 ingress protocol all prio 1 p4 pname simple_l3 \
    prog type xdp obj $PARSER.o section parser/xdp \
    pinned_link /sys/fs/bpf/mylink \
    action bpf obj $PROGNAME.o section prog/tc-ingress

In this case, the parser will be executed in the XDP layer and the rest of
P4 control block as a tc action.

For illustration sake, the hw one looks as follows (please note there's
still a lot of discussions going on in the meetings - the example is here
merely to illustrate the tc filter functionality):

tc filter add block 22 ingress protocol all prio 1 p4 pname simple_l3 \
   prog type hw filename "mypnameprog.o" ... \
   prog type xdp obj $PARSER.o section parser/xdp pinned_link /sys/fs/bpf/mylink \
   action bpf obj $PROGNAME.o section prog/tc-ingress

The theory of operations is as follows:

================================1. PARSING================================

The packet first encounters the parser.
The parser is implemented in ebpf residing either at the TC or XDP
level. The parsed header values are stored in a shared eBPF map.
When the parser runs at XDP level, we load it into XDP using tc filter
command and pin it to a file.

=============================2. ACTIONS=============================

In the above example, the P4 program (minus the parser) is encoded in an
action($PROGNAME.o). It should be noted that classical tc actions
continue to work:
IOW, someone could decide to add a mirred action to mirror all packets
after or before the ebpf action.

tc filter add dev $P0 parent ffff: protocol all prio 6 p4 pname simple_l3 \
    prog type tc obj $PARSER.o section parser/tc-ingress \
    action bpf obj $PROGNAME.o section prog/tc-ingress \
    action mirred egress mirror index 1 dev $P1 \
    action bpf obj $ANOTHERPROG.o section mysect/section-1

It should also be noted that it is feasible to split some of the ingress
datapath into XDP first and more into TC later (as was shown above for
example where the parser runs at XDP level). YMMV.
Regardless of choice of which scheme to use, none of these will affect
UAPI. It will all depend on whether you generate code to load on XDP vs
tc, etc.

Co-developed-by: Victor Nogueira <victor@mojatatu.com>
Signed-off-by: Victor Nogueira <victor@mojatatu.com>
Co-developed-by: Pedro Tammela <pctammela@mojatatu.com>
Signed-off-by: Pedro Tammela <pctammela@mojatatu.com>
Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com>
---
 include/uapi/linux/pkt_cls.h |  18 ++
 net/sched/Kconfig            |  12 +
 net/sched/Makefile           |   1 +
 net/sched/cls_p4.c           | 450 +++++++++++++++++++++++++++++++++++
 net/sched/p4tc/Makefile      |   4 +-
 net/sched/p4tc/trace.c       |  10 +
 net/sched/p4tc/trace.h       |  44 ++++
 7 files changed, 538 insertions(+), 1 deletion(-)
 create mode 100644 net/sched/cls_p4.c
 create mode 100644 net/sched/p4tc/trace.c
 create mode 100644 net/sched/p4tc/trace.h

Daniel Borkmann Jan. 24, 2024, 1:59 p.m. UTC | #1

On 1/22/24 8:48 PM, Jamal Hadi Salim wrote:
> Introduce P4 tc classifier. The main task of this classifier is to manage
> the lifetime of pipeline instances across one or more netdev ports.
> Note a pipeline may be instantiated multiple times across one or more tc chains
> and different priorities.
> 
> Note that part or whole of the P4 pipeline could reside in tc, XDP or even
> hardware depending on how the P4 program was compiled.
> To use the P4 classifier you must specify a pipeline name that will be
> associated to the filter instance, a s/w parser (eBPF) and datapath P4
> control block program (eBPF) program. Although this patchset does not deal
> with offloads, it is also possible to load the h/w part using this filter.
> We will illustrate a few examples further below to clarify. Please treat
> the illustrated split as an example - there are probably more pragmatic
> approaches to splitting the pipeline; however, regardless of where the different
> pieces of the pipeline are placed (tc, XDP, HW) and what each layer will
> implement (what part of the pipeline) - these examples are merely showing
> what is possible.
> 
> The pipeline is assumed to have already been created via a template.
> 
> For example, if we were to add a filter to ingress of a group of netdevs
> (tc block 22) and associate it to P4 pipeline simple_l3 we could issue the
> following command:
> 
> tc filter add block 22 parent ffff: protocol all prio 6 p4 pname simple_l3 \
>      action bpf obj $PARSER.o ... \
>      action bpf obj $PROGNAME.o section prog/tc-ingress
> 
> The above uses the classical tc action mechanism in which the first action
> runs the P4 parser and if that goes well then the P4 control block is
> executed. Note, although not shown above, one could also append the command
> line with other traditional tc actions.
> 
> In these patches, we also support two types of loadings of the pipeline
> programs and differentiate between what gets loaded at say tc vs xdp by using
> syntax which specifies location as either "prog type tc obj" or
> "prog type xdp obj". There is an ongoing discussion in the P4TC community
> biweekly meetings which is likely going to have us add another location
> definition "prog type hw" which will specify the hardware object file name
> and other related attributes.
> 
> An example using tc:
> 
> tc filter add block 22 parent ffff: protocol all prio 6 p4 pname simple_l3 \
>      prog type tc obj $PARSER.o ... \
>      action bpf obj $PROGNAME.o section prog/tc-ingress
> 
> For XDP, to illustrate an example:
> 
> tc filter add dev $P0 ingress protocol all prio 1 p4 pname simple_l3 \
>      prog type xdp obj $PARSER.o section parser/xdp \
>      pinned_link /sys/fs/bpf/mylink \
>      action bpf obj $PROGNAME.o section prog/tc-ingress
> 
> In this case, the parser will be executed in the XDP layer and the rest of
> P4 control block as a tc action.
> 
> For illustration sake, the hw one looks as follows (please note there's
> still a lot of discussions going on in the meetings - the example is here
> merely to illustrate the tc filter functionality):
> 
> tc filter add block 22 ingress protocol all prio 1 p4 pname simple_l3 \
>     prog type hw filename "mypnameprog.o" ... \
>     prog type xdp obj $PARSER.o section parser/xdp pinned_link /sys/fs/bpf/mylink \
>     action bpf obj $PROGNAME.o section prog/tc-ingress
> 
> The theory of operations is as follows:
> 
> ================================1. PARSING================================
> 
> The packet first encounters the parser.
> The parser is implemented in ebpf residing either at the TC or XDP
> level. The parsed header values are stored in a shared eBPF map.
> When the parser runs at XDP level, we load it into XDP using tc filter
> command and pin it to a file.
> 
> =============================2. ACTIONS=============================
> 
> In the above example, the P4 program (minus the parser) is encoded in an
> action($PROGNAME.o). It should be noted that classical tc actions
> continue to work:
> IOW, someone could decide to add a mirred action to mirror all packets
> after or before the ebpf action.
> 
> tc filter add dev $P0 parent ffff: protocol all prio 6 p4 pname simple_l3 \
>      prog type tc obj $PARSER.o section parser/tc-ingress \
>      action bpf obj $PROGNAME.o section prog/tc-ingress \
>      action mirred egress mirror index 1 dev $P1 \
>      action bpf obj $ANOTHERPROG.o section mysect/section-1
> 
> It should also be noted that it is feasible to split some of the ingress
> datapath into XDP first and more into TC later (as was shown above for
> example where the parser runs at XDP level). YMMV.
> Regardless of choice of which scheme to use, none of these will affect
> UAPI. It will all depend on whether you generate code to load on XDP vs
> tc, etc.
> 
> Co-developed-by: Victor Nogueira <victor@mojatatu.com>
> Signed-off-by: Victor Nogueira <victor@mojatatu.com>
> Co-developed-by: Pedro Tammela <pctammela@mojatatu.com>
> Signed-off-by: Pedro Tammela <pctammela@mojatatu.com>
> Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com>

My objections from last iterations still stand, and I also added a nak,
so please do not just drop it with new revisions.. from the v10 as you
wrote you added further code but despite the various community feedback
the design still stands as before, therefore:

Nacked-by: Daniel Borkmann <daniel@iogearbox.net>

[...]
> +static int cls_p4_prog_from_efd(struct nlattr **tb,
> +				struct p4tc_bpf_prog *prog, u32 flags,
> +				struct netlink_ext_ack *extack)
> +{
> +	struct bpf_prog *fp;
> +	u32 prog_type;
> +	char *name;
> +	u32 bpf_fd;
> +
> +	bpf_fd = nla_get_u32(tb[TCA_P4_PROG_FD]);
> +	prog_type = nla_get_u32(tb[TCA_P4_PROG_TYPE]);
> +
> +	if (prog_type != BPF_PROG_TYPE_XDP &&
> +	    prog_type != BPF_PROG_TYPE_SCHED_ACT) {

Also as mentioned earlier I don't think tc should hold references on
XDP programs in here. It doesn't make any sense aside from the fact
that the cls_p4 is also not doing anything with it. This is something
that a user space control plane should be doing i.e. managing a XDP
link on the target device.

> +		NL_SET_ERR_MSG(extack,
> +			       "BPF prog type must be BPF_PROG_TYPE_SCHED_ACT or BPF_PROG_TYPE_XDP");
> +		return -EINVAL;
> +	}
> +
> +	fp = bpf_prog_get_type_dev(bpf_fd, prog_type, false);
> +	if (IS_ERR(fp))
> +		return PTR_ERR(fp);
> +
> +	name = nla_memdup(tb[TCA_P4_PROG_NAME], GFP_KERNEL);
> +	if (!name) {
> +		bpf_prog_put(fp);
> +		return -ENOMEM;
> +	}
> +
> +	prog->p4_prog_name = name;
> +	prog->p4_prog = fp;
> +
> +	return 0;
> +}
> +

Jamal Hadi Salim Jan. 24, 2024, 2:40 p.m. UTC | #2

On Wed, Jan 24, 2024 at 8:59 AM Daniel Borkmann <daniel@iogearbox.net> wrote:
>
> On 1/22/24 8:48 PM, Jamal Hadi Salim wrote:
> > Introduce P4 tc classifier. The main task of this classifier is to manage
> > the lifetime of pipeline instances across one or more netdev ports.
> > Note a pipeline may be instantiated multiple times across one or more tc chains
> > and different priorities.
> >
> > Note that part or whole of the P4 pipeline could reside in tc, XDP or even
> > hardware depending on how the P4 program was compiled.
> > To use the P4 classifier you must specify a pipeline name that will be
> > associated to the filter instance, a s/w parser (eBPF) and datapath P4
> > control block program (eBPF) program. Although this patchset does not deal
> > with offloads, it is also possible to load the h/w part using this filter.
> > We will illustrate a few examples further below to clarify. Please treat
> > the illustrated split as an example - there are probably more pragmatic
> > approaches to splitting the pipeline; however, regardless of where the different
> > pieces of the pipeline are placed (tc, XDP, HW) and what each layer will
> > implement (what part of the pipeline) - these examples are merely showing
> > what is possible.
> >
> > The pipeline is assumed to have already been created via a template.
> >
> > For example, if we were to add a filter to ingress of a group of netdevs
> > (tc block 22) and associate it to P4 pipeline simple_l3 we could issue the
> > following command:
> >
> > tc filter add block 22 parent ffff: protocol all prio 6 p4 pname simple_l3 \
> >      action bpf obj $PARSER.o ... \
> >      action bpf obj $PROGNAME.o section prog/tc-ingress
> >
> > The above uses the classical tc action mechanism in which the first action
> > runs the P4 parser and if that goes well then the P4 control block is
> > executed. Note, although not shown above, one could also append the command
> > line with other traditional tc actions.
> >
> > In these patches, we also support two types of loadings of the pipeline
> > programs and differentiate between what gets loaded at say tc vs xdp by using
> > syntax which specifies location as either "prog type tc obj" or
> > "prog type xdp obj". There is an ongoing discussion in the P4TC community
> > biweekly meetings which is likely going to have us add another location
> > definition "prog type hw" which will specify the hardware object file name
> > and other related attributes.
> >
> > An example using tc:
> >
> > tc filter add block 22 parent ffff: protocol all prio 6 p4 pname simple_l3 \
> >      prog type tc obj $PARSER.o ... \
> >      action bpf obj $PROGNAME.o section prog/tc-ingress
> >
> > For XDP, to illustrate an example:
> >
> > tc filter add dev $P0 ingress protocol all prio 1 p4 pname simple_l3 \
> >      prog type xdp obj $PARSER.o section parser/xdp \
> >      pinned_link /sys/fs/bpf/mylink \
> >      action bpf obj $PROGNAME.o section prog/tc-ingress
> >
> > In this case, the parser will be executed in the XDP layer and the rest of
> > P4 control block as a tc action.
> >
> > For illustration sake, the hw one looks as follows (please note there's
> > still a lot of discussions going on in the meetings - the example is here
> > merely to illustrate the tc filter functionality):
> >
> > tc filter add block 22 ingress protocol all prio 1 p4 pname simple_l3 \
> >     prog type hw filename "mypnameprog.o" ... \
> >     prog type xdp obj $PARSER.o section parser/xdp pinned_link /sys/fs/bpf/mylink \
> >     action bpf obj $PROGNAME.o section prog/tc-ingress
> >
> > The theory of operations is as follows:
> >
> > ================================1. PARSING================================
> >
> > The packet first encounters the parser.
> > The parser is implemented in ebpf residing either at the TC or XDP
> > level. The parsed header values are stored in a shared eBPF map.
> > When the parser runs at XDP level, we load it into XDP using tc filter
> > command and pin it to a file.
> >
> > =============================2. ACTIONS=============================
> >
> > In the above example, the P4 program (minus the parser) is encoded in an
> > action($PROGNAME.o). It should be noted that classical tc actions
> > continue to work:
> > IOW, someone could decide to add a mirred action to mirror all packets
> > after or before the ebpf action.
> >
> > tc filter add dev $P0 parent ffff: protocol all prio 6 p4 pname simple_l3 \
> >      prog type tc obj $PARSER.o section parser/tc-ingress \
> >      action bpf obj $PROGNAME.o section prog/tc-ingress \
> >      action mirred egress mirror index 1 dev $P1 \
> >      action bpf obj $ANOTHERPROG.o section mysect/section-1
> >
> > It should also be noted that it is feasible to split some of the ingress
> > datapath into XDP first and more into TC later (as was shown above for
> > example where the parser runs at XDP level). YMMV.
> > Regardless of choice of which scheme to use, none of these will affect
> > UAPI. It will all depend on whether you generate code to load on XDP vs
> > tc, etc.
> >
> > Co-developed-by: Victor Nogueira <victor@mojatatu.com>
> > Signed-off-by: Victor Nogueira <victor@mojatatu.com>
> > Co-developed-by: Pedro Tammela <pctammela@mojatatu.com>
> > Signed-off-by: Pedro Tammela <pctammela@mojatatu.com>
> > Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com>
>
> My objections from last iterations still stand, and I also added a nak,
> so please do not just drop it with new revisions.. from the v10 as you
> wrote you added further code but despite the various community feedback
> the design still stands as before, therefore:
>
> Nacked-by: Daniel Borkmann <daniel@iogearbox.net>
>

We didnt make code changes - but did you read the cover letter and the
extended commentary in this patch's commit log? We should have
mentioned it in the changes log. It did respond to your comments.
There's text that says "the filter manages the lifetime of the
pipeline" - which in the future could include not only tc but XDP but
also the hardware path (in the form of a file that gets loaded). I am
not sure if that message is clear. Your angle being this is layer
violation. In the last discussion i asked you for suggestions and we
went the tcx route, which didnt make sense, and  then you didnt
respond.

> [...]
> > +static int cls_p4_prog_from_efd(struct nlattr **tb,
> > +                             struct p4tc_bpf_prog *prog, u32 flags,
> > +                             struct netlink_ext_ack *extack)
> > +{
> > +     struct bpf_prog *fp;
> > +     u32 prog_type;
> > +     char *name;
> > +     u32 bpf_fd;
> > +
> > +     bpf_fd = nla_get_u32(tb[TCA_P4_PROG_FD]);
> > +     prog_type = nla_get_u32(tb[TCA_P4_PROG_TYPE]);
> > +
> > +     if (prog_type != BPF_PROG_TYPE_XDP &&
> > +         prog_type != BPF_PROG_TYPE_SCHED_ACT) {
>
> Also as mentioned earlier I don't think tc should hold references on
> XDP programs in here. It doesn't make any sense aside from the fact
> that the cls_p4 is also not doing anything with it. This is something
> that a user space control plane should be doing i.e. managing a XDP
> link on the target device.

This is the same argument about layer violation that you made earlier.
The filter manages the p4 pipeline - i.e it's not just about the ebpf
blob(s) but for example in the future (discussions are still ongoing
with vendors who have P4 NICs) a filter could be loaded to also
specify the location of the hardware blob.
I would be happy with a suggestion that gets us moving forward with
that context in mind.

cheers,
jamal

> > +             NL_SET_ERR_MSG(extack,
> > +                            "BPF prog type must be BPF_PROG_TYPE_SCHED_ACT or BPF_PROG_TYPE_XDP");
> > +             return -EINVAL;
> > +     }
> > +
> > +     fp = bpf_prog_get_type_dev(bpf_fd, prog_type, false);
> > +     if (IS_ERR(fp))
> > +             return PTR_ERR(fp);
> > +
> > +     name = nla_memdup(tb[TCA_P4_PROG_NAME], GFP_KERNEL);
> > +     if (!name) {
> > +             bpf_prog_put(fp);
> > +             return -ENOMEM;
> > +     }
> > +
> > +     prog->p4_prog_name = name;
> > +     prog->p4_prog = fp;
> > +
> > +     return 0;
> > +}
> > +

Daniel Borkmann Jan. 25, 2024, 3:47 p.m. UTC | #3

On 1/24/24 3:40 PM, Jamal Hadi Salim wrote:
> On Wed, Jan 24, 2024 at 8:59 AM Daniel Borkmann <daniel@iogearbox.net> wrote:
>> On 1/22/24 8:48 PM, Jamal Hadi Salim wrote:
[...]
>>>
>>> It should also be noted that it is feasible to split some of the ingress
>>> datapath into XDP first and more into TC later (as was shown above for
>>> example where the parser runs at XDP level). YMMV.
>>> Regardless of choice of which scheme to use, none of these will affect
>>> UAPI. It will all depend on whether you generate code to load on XDP vs
>>> tc, etc.
>>>
>>> Co-developed-by: Victor Nogueira <victor@mojatatu.com>
>>> Signed-off-by: Victor Nogueira <victor@mojatatu.com>
>>> Co-developed-by: Pedro Tammela <pctammela@mojatatu.com>
>>> Signed-off-by: Pedro Tammela <pctammela@mojatatu.com>
>>> Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com>
>>
>> My objections from last iterations still stand, and I also added a nak,
>> so please do not just drop it with new revisions.. from the v10 as you
>> wrote you added further code but despite the various community feedback
>> the design still stands as before, therefore:
>>
>> Nacked-by: Daniel Borkmann <daniel@iogearbox.net>
> 
> We didnt make code changes - but did you read the cover letter and the
> extended commentary in this patch's commit log? We should have
> mentioned it in the changes log. It did respond to your comments.
> There's text that says "the filter manages the lifetime of the
> pipeline" - which in the future could include not only tc but XDP but
> also the hardware path (in the form of a file that gets loaded). I am
> not sure if that message is clear. Your angle being this is layer
> violation. In the last discussion i asked you for suggestions and we
> went the tcx route, which didnt make sense, and  then you didnt
> respond.
[...]

>> Also as mentioned earlier I don't think tc should hold references on
>> XDP programs in here. It doesn't make any sense aside from the fact
>> that the cls_p4 is also not doing anything with it. This is something
>> that a user space control plane should be doing i.e. managing a XDP
>> link on the target device.
> 
> This is the same argument about layer violation that you made earlier.
> The filter manages the p4 pipeline - i.e it's not just about the ebpf
> blob(s) but for example in the future (discussions are still ongoing
> with vendors who have P4 NICs) a filter could be loaded to also
> specify the location of the hardware blob.

Ah, so there is a plan to eventually add HW offload support for cls_p4?
Or is this only specifiying a location of a blob through some opaque
cookie value from user space?

> I would be happy with a suggestion that gets us moving forward with
> that context in mind.

My question on the above is mainly what does it bring you to hold a
reference on the XDP program? There is no guarantee that something else
will get loaded onto XDP, and then eventually the cls_p4 is the only
entity holding the reference but w/o 'purpose'. We do have BPF links
and the user space component orchestrating all this needs to create
and pin the BPF link in BPF fs, for example. An artificial reference
on XDP prog feels similar as if you'd hold a reference on an inode
out of tc.. Again, that should be delegated to the control plane you
have running interacting with the compiler which then manages and
loads its artifacts. What if you would also need to set up some
netfilter rules for the SW pipeline, would you then embed this too?

Thanks,
Daniel

Jamal Hadi Salim Jan. 25, 2024, 5:59 p.m. UTC | #4

On Thu, Jan 25, 2024 at 10:47 AM Daniel Borkmann <daniel@iogearbox.net> wrote:
>
> On 1/24/24 3:40 PM, Jamal Hadi Salim wrote:
> > On Wed, Jan 24, 2024 at 8:59 AM Daniel Borkmann <daniel@iogearbox.net> wrote:
> >> On 1/22/24 8:48 PM, Jamal Hadi Salim wrote:
> [...]
> >>>
> >>> It should also be noted that it is feasible to split some of the ingress
> >>> datapath into XDP first and more into TC later (as was shown above for
> >>> example where the parser runs at XDP level). YMMV.
> >>> Regardless of choice of which scheme to use, none of these will affect
> >>> UAPI. It will all depend on whether you generate code to load on XDP vs
> >>> tc, etc.
> >>>
> >>> Co-developed-by: Victor Nogueira <victor@mojatatu.com>
> >>> Signed-off-by: Victor Nogueira <victor@mojatatu.com>
> >>> Co-developed-by: Pedro Tammela <pctammela@mojatatu.com>
> >>> Signed-off-by: Pedro Tammela <pctammela@mojatatu.com>
> >>> Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com>
> >>
> >> My objections from last iterations still stand, and I also added a nak,
> >> so please do not just drop it with new revisions.. from the v10 as you
> >> wrote you added further code but despite the various community feedback
> >> the design still stands as before, therefore:
> >>
> >> Nacked-by: Daniel Borkmann <daniel@iogearbox.net>
> >
> > We didnt make code changes - but did you read the cover letter and the
> > extended commentary in this patch's commit log? We should have
> > mentioned it in the changes log. It did respond to your comments.
> > There's text that says "the filter manages the lifetime of the
> > pipeline" - which in the future could include not only tc but XDP but
> > also the hardware path (in the form of a file that gets loaded). I am
> > not sure if that message is clear. Your angle being this is layer
> > violation. In the last discussion i asked you for suggestions and we
> > went the tcx route, which didnt make sense, and  then you didnt
> > respond.
> [...]
>
> >> Also as mentioned earlier I don't think tc should hold references on
> >> XDP programs in here. It doesn't make any sense aside from the fact
> >> that the cls_p4 is also not doing anything with it. This is something
> >> that a user space control plane should be doing i.e. managing a XDP
> >> link on the target device.
> >
> > This is the same argument about layer violation that you made earlier.
> > The filter manages the p4 pipeline - i.e it's not just about the ebpf
> > blob(s) but for example in the future (discussions are still ongoing
> > with vendors who have P4 NICs) a filter could be loaded to also
> > specify the location of the hardware blob.
>
> Ah, so there is a plan to eventually add HW offload support for cls_p4?
> Or is this only specifiying a location of a blob through some opaque
> cookie value from user space?

Current thought process is it will be something along these lines (the
commit provides more details):

tc filter add block 22 ingress protocol all prio 1 p4 pname simple_l3 \
   prog type hw filename "mypnameprog.o" ... \
   prog type xdp obj $PARSER.o section parser/xdp pinned_link
/sys/fs/bpf/mylink \
   action bpf obj $PROGNAME.o section prog/tc-ingress

These discussions are still ongoing - but that is the current
consensus. Note: we are not pushing any code for that, but hope it
paints the bigger picture....
The idea is the cls p4 owns the lifetime of the pipeline. Installing
the filter instantiates the p4 pipeline "simple_l3" and triggers a lot
of the refcounts to make sure the pipeline and its components stays
alive.
There could be multiple such filters - when someone deletes the last
filter, then it is safe to delete the pipeline.
Essentially the filter manages the lifetime of the pipeline.

> > I would be happy with a suggestion that gets us moving forward with
> > that context in mind.
>
> My question on the above is mainly what does it bring you to hold a
> reference on the XDP program? There is no guarantee that something else
> will get loaded onto XDP, and then eventually the cls_p4 is the only
> entity holding the reference but w/o 'purpose'. We do have BPF links
> and the user space component orchestrating all this needs to create
> and pin the BPF link in BPF fs, for example. An artificial reference
> on XDP prog feels similar as if you'd hold a reference on an inode
> out of tc.. Again, that should be delegated to the control plane you
> have running interacting with the compiler which then manages and
> loads its artifacts. What if you would also need to set up some
> netfilter rules for the SW pipeline, would you then embed this too?

Sorry, a slight tangent first:
P4 is self-contained, there are a handful of objects that are defined
by the spec (externs, actions, tables, etc) and we model them in the
patchset, so that part is self-contained. For the extra richness such
as the netfilter example you quoted - based on my many years of
experience deploying SDN - using daemons(sorry if i am reading too
much in what I think you are implying) for control is not the best
option i.e you need all kinds of coordination - for example where do
you store state, what happens when the daemon dies, how do you
graceful restarts etc. Based on that, if i can put things in the
kernel (which is essentially a "perpetual daemon", unless the kernel
crashes) it's a lot simpler to manage as a source of truth especially
when there is not that much info. There is a limit when there are
multiple pieces (to use your netfilter example) because you need
another layer to coordinate things.

Re: the XDP part - our key reason is mostly managerial, in that the
filter is the lifetime manager of the pipeline; and that if i dump
that filter i can see all the details in regards to the pipeline(tc,
XDP and in future hw, etc) in one spot. You are right, the link
pinning is our protection from someone replacing the XDP prog (this
was a tip from Toke in the early days) and the comparison of tc
holding inode is apropos.
There's some history: in the early days we were also using metadata
which comes from the XDP program at the tc layer if more processing
was to be done (and there was extra metadata which told us which XDP
prog produced it which we would vet before trusting the metadata).
Given all the above, we should still be able to hold this info without
necessarily holding the extra refcount and be able to see this detail.
So we can remove the refcounting.

cheers,
jamal

> Thanks,
> Daniel

Jamal Hadi Salim Feb. 16, 2024, 9:18 p.m. UTC | #5

On Thu, Jan 25, 2024 at 12:59 PM Jamal Hadi Salim <jhs@mojatatu.com> wrote:
>
> On Thu, Jan 25, 2024 at 10:47 AM Daniel Borkmann <daniel@iogearbox.net> wrote:
> >
> > On 1/24/24 3:40 PM, Jamal Hadi Salim wrote:
> > > On Wed, Jan 24, 2024 at 8:59 AM Daniel Borkmann <daniel@iogearbox.net> wrote:
> > >> On 1/22/24 8:48 PM, Jamal Hadi Salim wrote:
> > [...]
> > >>>
> > >>> It should also be noted that it is feasible to split some of the ingress
> > >>> datapath into XDP first and more into TC later (as was shown above for
> > >>> example where the parser runs at XDP level). YMMV.
> > >>> Regardless of choice of which scheme to use, none of these will affect
> > >>> UAPI. It will all depend on whether you generate code to load on XDP vs
> > >>> tc, etc.
> > >>>
> > >>> Co-developed-by: Victor Nogueira <victor@mojatatu.com>
> > >>> Signed-off-by: Victor Nogueira <victor@mojatatu.com>
> > >>> Co-developed-by: Pedro Tammela <pctammela@mojatatu.com>
> > >>> Signed-off-by: Pedro Tammela <pctammela@mojatatu.com>
> > >>> Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com>
> > >>
> > >> My objections from last iterations still stand, and I also added a nak,
> > >> so please do not just drop it with new revisions.. from the v10 as you
> > >> wrote you added further code but despite the various community feedback
> > >> the design still stands as before, therefore:
> > >>
> > >> Nacked-by: Daniel Borkmann <daniel@iogearbox.net>
> > >
> > > We didnt make code changes - but did you read the cover letter and the
> > > extended commentary in this patch's commit log? We should have
> > > mentioned it in the changes log. It did respond to your comments.
> > > There's text that says "the filter manages the lifetime of the
> > > pipeline" - which in the future could include not only tc but XDP but
> > > also the hardware path (in the form of a file that gets loaded). I am
> > > not sure if that message is clear. Your angle being this is layer
> > > violation. In the last discussion i asked you for suggestions and we
> > > went the tcx route, which didnt make sense, and  then you didnt
> > > respond.
> > [...]
> >
> > >> Also as mentioned earlier I don't think tc should hold references on
> > >> XDP programs in here. It doesn't make any sense aside from the fact
> > >> that the cls_p4 is also not doing anything with it. This is something
> > >> that a user space control plane should be doing i.e. managing a XDP
> > >> link on the target device.
> > >
> > > This is the same argument about layer violation that you made earlier.
> > > The filter manages the p4 pipeline - i.e it's not just about the ebpf
> > > blob(s) but for example in the future (discussions are still ongoing
> > > with vendors who have P4 NICs) a filter could be loaded to also
> > > specify the location of the hardware blob.
> >
> > Ah, so there is a plan to eventually add HW offload support for cls_p4?
> > Or is this only specifiying a location of a blob through some opaque
> > cookie value from user space?
>
> Current thought process is it will be something along these lines (the
> commit provides more details):
>
> tc filter add block 22 ingress protocol all prio 1 p4 pname simple_l3 \
>    prog type hw filename "mypnameprog.o" ... \
>    prog type xdp obj $PARSER.o section parser/xdp pinned_link
> /sys/fs/bpf/mylink \
>    action bpf obj $PROGNAME.o section prog/tc-ingress
>
> These discussions are still ongoing - but that is the current
> consensus. Note: we are not pushing any code for that, but hope it
> paints the bigger picture....
> The idea is the cls p4 owns the lifetime of the pipeline. Installing
> the filter instantiates the p4 pipeline "simple_l3" and triggers a lot
> of the refcounts to make sure the pipeline and its components stays
> alive.
> There could be multiple such filters - when someone deletes the last
> filter, then it is safe to delete the pipeline.
> Essentially the filter manages the lifetime of the pipeline.
>
> > > I would be happy with a suggestion that gets us moving forward with
> > > that context in mind.
> >
> > My question on the above is mainly what does it bring you to hold a
> > reference on the XDP program? There is no guarantee that something else
> > will get loaded onto XDP, and then eventually the cls_p4 is the only
> > entity holding the reference but w/o 'purpose'. We do have BPF links
> > and the user space component orchestrating all this needs to create
> > and pin the BPF link in BPF fs, for example. An artificial reference
> > on XDP prog feels similar as if you'd hold a reference on an inode
> > out of tc.. Again, that should be delegated to the control plane you
> > have running interacting with the compiler which then manages and
> > loads its artifacts. What if you would also need to set up some
> > netfilter rules for the SW pipeline, would you then embed this too?
>
> Sorry, a slight tangent first:
> P4 is self-contained, there are a handful of objects that are defined
> by the spec (externs, actions, tables, etc) and we model them in the
> patchset, so that part is self-contained. For the extra richness such
> as the netfilter example you quoted - based on my many years of
> experience deploying SDN - using daemons(sorry if i am reading too
> much in what I think you are implying) for control is not the best
> option i.e you need all kinds of coordination - for example where do
> you store state, what happens when the daemon dies, how do you
> graceful restarts etc. Based on that, if i can put things in the
> kernel (which is essentially a "perpetual daemon", unless the kernel
> crashes) it's a lot simpler to manage as a source of truth especially
> when there is not that much info. There is a limit when there are
> multiple pieces (to use your netfilter example) because you need
> another layer to coordinate things.
>
> Re: the XDP part - our key reason is mostly managerial, in that the
> filter is the lifetime manager of the pipeline; and that if i dump
> that filter i can see all the details in regards to the pipeline(tc,
> XDP and in future hw, etc) in one spot. You are right, the link
> pinning is our protection from someone replacing the XDP prog (this
> was a tip from Toke in the early days) and the comparison of tc
> holding inode is apropos.
> There's some history: in the early days we were also using metadata
> which comes from the XDP program at the tc layer if more processing
> was to be done (and there was extra metadata which told us which XDP
> prog produced it which we would vet before trusting the metadata).
> Given all the above, we should still be able to hold this info without
> necessarily holding the extra refcount and be able to see this detail.
> So we can remove the refcounting.
>

Daniel?

cheers,
jamal


> cheers,
> jamal
>
> > Thanks,
> > Daniel

Daniel Borkmann Feb. 20, 2024, 3:31 p.m. UTC | #6

On 2/16/24 10:18 PM, Jamal Hadi Salim wrote:
> On Thu, Jan 25, 2024 at 12:59 PM Jamal Hadi Salim <jhs@mojatatu.com> wrote:
>> On Thu, Jan 25, 2024 at 10:47 AM Daniel Borkmann <daniel@iogearbox.net> wrote:
>>> On 1/24/24 3:40 PM, Jamal Hadi Salim wrote:
>>>> On Wed, Jan 24, 2024 at 8:59 AM Daniel Borkmann <daniel@iogearbox.net> wrote:
>>>>> On 1/22/24 8:48 PM, Jamal Hadi Salim wrote:
>>> [...]
>>>>>>
>>>>>> It should also be noted that it is feasible to split some of the ingress
>>>>>> datapath into XDP first and more into TC later (as was shown above for
>>>>>> example where the parser runs at XDP level). YMMV.
>>>>>> Regardless of choice of which scheme to use, none of these will affect
>>>>>> UAPI. It will all depend on whether you generate code to load on XDP vs
>>>>>> tc, etc.
>>>>>>
>>>>>> Co-developed-by: Victor Nogueira <victor@mojatatu.com>
>>>>>> Signed-off-by: Victor Nogueira <victor@mojatatu.com>
>>>>>> Co-developed-by: Pedro Tammela <pctammela@mojatatu.com>
>>>>>> Signed-off-by: Pedro Tammela <pctammela@mojatatu.com>
>>>>>> Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com>
>>>>>
>>>>> My objections from last iterations still stand, and I also added a nak,
>>>>> so please do not just drop it with new revisions.. from the v10 as you
>>>>> wrote you added further code but despite the various community feedback
>>>>> the design still stands as before, therefore:
>>>>>
>>>>> Nacked-by: Daniel Borkmann <daniel@iogearbox.net>
>>>>
>>>> We didnt make code changes - but did you read the cover letter and the
>>>> extended commentary in this patch's commit log? We should have
>>>> mentioned it in the changes log. It did respond to your comments.
>>>> There's text that says "the filter manages the lifetime of the
>>>> pipeline" - which in the future could include not only tc but XDP but
>>>> also the hardware path (in the form of a file that gets loaded). I am
>>>> not sure if that message is clear. Your angle being this is layer
>>>> violation. In the last discussion i asked you for suggestions and we
>>>> went the tcx route, which didnt make sense, and  then you didnt
>>>> respond.
>>> [...]
>>>
>>>>> Also as mentioned earlier I don't think tc should hold references on
>>>>> XDP programs in here. It doesn't make any sense aside from the fact
>>>>> that the cls_p4 is also not doing anything with it. This is something
>>>>> that a user space control plane should be doing i.e. managing a XDP
>>>>> link on the target device.
>>>>
>>>> This is the same argument about layer violation that you made earlier.
>>>> The filter manages the p4 pipeline - i.e it's not just about the ebpf
>>>> blob(s) but for example in the future (discussions are still ongoing
>>>> with vendors who have P4 NICs) a filter could be loaded to also
>>>> specify the location of the hardware blob.
>>>
>>> Ah, so there is a plan to eventually add HW offload support for cls_p4?
>>> Or is this only specifiying a location of a blob through some opaque
>>> cookie value from user space?
>>
>> Current thought process is it will be something along these lines (the
>> commit provides more details):
>>
>> tc filter add block 22 ingress protocol all prio 1 p4 pname simple_l3 \
>>     prog type hw filename "mypnameprog.o" ... \
>>     prog type xdp obj $PARSER.o section parser/xdp pinned_link
>> /sys/fs/bpf/mylink \
>>     action bpf obj $PROGNAME.o section prog/tc-ingress
>>
>> These discussions are still ongoing - but that is the current
>> consensus. Note: we are not pushing any code for that, but hope it
>> paints the bigger picture....
>> The idea is the cls p4 owns the lifetime of the pipeline. Installing
>> the filter instantiates the p4 pipeline "simple_l3" and triggers a lot
>> of the refcounts to make sure the pipeline and its components stays
>> alive.
>> There could be multiple such filters - when someone deletes the last
>> filter, then it is safe to delete the pipeline.
>> Essentially the filter manages the lifetime of the pipeline.
>>
>>>> I would be happy with a suggestion that gets us moving forward with
>>>> that context in mind.
>>>
>>> My question on the above is mainly what does it bring you to hold a
>>> reference on the XDP program? There is no guarantee that something else
>>> will get loaded onto XDP, and then eventually the cls_p4 is the only
>>> entity holding the reference but w/o 'purpose'. We do have BPF links
>>> and the user space component orchestrating all this needs to create
>>> and pin the BPF link in BPF fs, for example. An artificial reference
>>> on XDP prog feels similar as if you'd hold a reference on an inode
>>> out of tc.. Again, that should be delegated to the control plane you
>>> have running interacting with the compiler which then manages and
>>> loads its artifacts. What if you would also need to set up some
>>> netfilter rules for the SW pipeline, would you then embed this too?
>>
>> Sorry, a slight tangent first:
>> P4 is self-contained, there are a handful of objects that are defined
>> by the spec (externs, actions, tables, etc) and we model them in the
>> patchset, so that part is self-contained. For the extra richness such
>> as the netfilter example you quoted - based on my many years of
>> experience deploying SDN - using daemons(sorry if i am reading too
>> much in what I think you are implying) for control is not the best
>> option i.e you need all kinds of coordination - for example where do
>> you store state, what happens when the daemon dies, how do you
>> graceful restarts etc. Based on that, if i can put things in the
>> kernel (which is essentially a "perpetual daemon", unless the kernel
>> crashes) it's a lot simpler to manage as a source of truth especially
>> when there is not that much info. There is a limit when there are
>> multiple pieces (to use your netfilter example) because you need
>> another layer to coordinate things.

'source of truth' for the various attach points or BPF links, yes, but in
this case here it is not, since the source of truth on what is attached
is not in cls_p4 but rather on the XDP link. How do you handle the case
when cls_p4 says something different to what is /actually/ attached? Why
is it not enough to establish some convention in user space, to pin the
link and retrieve/update from there when needed? Like everyone else does.
... even if you consider iproute2 your "control plane" (which I have the
feeling you do)?

>> Re: the XDP part - our key reason is mostly managerial, in that the
>> filter is the lifetime manager of the pipeline; and that if i dump

This is imho the problematic part which feels like square peg in round
hole, trying to fit this whole lifetime manager of the pipeline into
the cls_p4 filter. We agree to disagree here. Instead of reusing
individual building blocks from user space, this tries to cramp control
plane parts into the kernel for which its not a great fit with what is
build here as-is.

>> that filter i can see all the details in regards to the pipeline(tc,
>> XDP and in future hw, etc) in one spot. You are right, the link
>> pinning is our protection from someone replacing the XDP prog (this
>> was a tip from Toke in the early days) and the comparison of tc
>> holding inode is apropos.
>> There's some history: in the early days we were also using metadata
>> which comes from the XDP program at the tc layer if more processing
>> was to be done (and there was extra metadata which told us which XDP
>> prog produced it which we would vet before trusting the metadata).
>> Given all the above, we should still be able to hold this info without
>> necessarily holding the extra refcount and be able to see this detail.
>> So we can remove the refcounting.
> 
> Daniel?

The refcount should definitely be removed, but then again, see the point
above in that it is inconsistent information. Why can't this be done in
user space with some convention in your user space control plane - if you
take iproute2, then why it cannot pin the link in a bpf fs instance and
retrieve it from there?

Jamal Hadi Salim Feb. 21, 2024, 2:51 p.m. UTC | #7

On Tue, Feb 20, 2024 at 10:49 AM Daniel Borkmann <daniel@iogearbox.net> wrote:
>
> On 2/16/24 10:18 PM, Jamal Hadi Salim wrote:
> > On Thu, Jan 25, 2024 at 12:59 PM Jamal Hadi Salim <jhs@mojatatu.com> wrote:
> >> On Thu, Jan 25, 2024 at 10:47 AM Daniel Borkmann <daniel@iogearbox.net> wrote:
> >>> On 1/24/24 3:40 PM, Jamal Hadi Salim wrote:
> >>>> On Wed, Jan 24, 2024 at 8:59 AM Daniel Borkmann <daniel@iogearbox.net> wrote:
> >>>>> On 1/22/24 8:48 PM, Jamal Hadi Salim wrote:
> >>> [...]
> >>>>>>
> >>>>>> It should also be noted that it is feasible to split some of the ingress
> >>>>>> datapath into XDP first and more into TC later (as was shown above for
> >>>>>> example where the parser runs at XDP level). YMMV.
> >>>>>> Regardless of choice of which scheme to use, none of these will affect
> >>>>>> UAPI. It will all depend on whether you generate code to load on XDP vs
> >>>>>> tc, etc.
> >>>>>>
> >>>>>> Co-developed-by: Victor Nogueira <victor@mojatatu.com>
> >>>>>> Signed-off-by: Victor Nogueira <victor@mojatatu.com>
> >>>>>> Co-developed-by: Pedro Tammela <pctammela@mojatatu.com>
> >>>>>> Signed-off-by: Pedro Tammela <pctammela@mojatatu.com>
> >>>>>> Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com>
> >>>>>
> >>>>> My objections from last iterations still stand, and I also added a nak,
> >>>>> so please do not just drop it with new revisions.. from the v10 as you
> >>>>> wrote you added further code but despite the various community feedback
> >>>>> the design still stands as before, therefore:
> >>>>>
> >>>>> Nacked-by: Daniel Borkmann <daniel@iogearbox.net>
> >>>>
> >>>> We didnt make code changes - but did you read the cover letter and the
> >>>> extended commentary in this patch's commit log? We should have
> >>>> mentioned it in the changes log. It did respond to your comments.
> >>>> There's text that says "the filter manages the lifetime of the
> >>>> pipeline" - which in the future could include not only tc but XDP but
> >>>> also the hardware path (in the form of a file that gets loaded). I am
> >>>> not sure if that message is clear. Your angle being this is layer
> >>>> violation. In the last discussion i asked you for suggestions and we
> >>>> went the tcx route, which didnt make sense, and  then you didnt
> >>>> respond.
> >>> [...]
> >>>
> >>>>> Also as mentioned earlier I don't think tc should hold references on
> >>>>> XDP programs in here. It doesn't make any sense aside from the fact
> >>>>> that the cls_p4 is also not doing anything with it. This is something
> >>>>> that a user space control plane should be doing i.e. managing a XDP
> >>>>> link on the target device.
> >>>>
> >>>> This is the same argument about layer violation that you made earlier.
> >>>> The filter manages the p4 pipeline - i.e it's not just about the ebpf
> >>>> blob(s) but for example in the future (discussions are still ongoing
> >>>> with vendors who have P4 NICs) a filter could be loaded to also
> >>>> specify the location of the hardware blob.
> >>>
> >>> Ah, so there is a plan to eventually add HW offload support for cls_p4?
> >>> Or is this only specifiying a location of a blob through some opaque
> >>> cookie value from user space?
> >>
> >> Current thought process is it will be something along these lines (the
> >> commit provides more details):
> >>
> >> tc filter add block 22 ingress protocol all prio 1 p4 pname simple_l3 \
> >>     prog type hw filename "mypnameprog.o" ... \
> >>     prog type xdp obj $PARSER.o section parser/xdp pinned_link
> >> /sys/fs/bpf/mylink \
> >>     action bpf obj $PROGNAME.o section prog/tc-ingress
> >>
> >> These discussions are still ongoing - but that is the current
> >> consensus. Note: we are not pushing any code for that, but hope it
> >> paints the bigger picture....
> >> The idea is the cls p4 owns the lifetime of the pipeline. Installing
> >> the filter instantiates the p4 pipeline "simple_l3" and triggers a lot
> >> of the refcounts to make sure the pipeline and its components stays
> >> alive.
> >> There could be multiple such filters - when someone deletes the last
> >> filter, then it is safe to delete the pipeline.
> >> Essentially the filter manages the lifetime of the pipeline.
> >>
> >>>> I would be happy with a suggestion that gets us moving forward with
> >>>> that context in mind.
> >>>
> >>> My question on the above is mainly what does it bring you to hold a
> >>> reference on the XDP program? There is no guarantee that something else
> >>> will get loaded onto XDP, and then eventually the cls_p4 is the only
> >>> entity holding the reference but w/o 'purpose'. We do have BPF links
> >>> and the user space component orchestrating all this needs to create
> >>> and pin the BPF link in BPF fs, for example. An artificial reference
> >>> on XDP prog feels similar as if you'd hold a reference on an inode
> >>> out of tc.. Again, that should be delegated to the control plane you
> >>> have running interacting with the compiler which then manages and
> >>> loads its artifacts. What if you would also need to set up some
> >>> netfilter rules for the SW pipeline, would you then embed this too?
> >>
> >> Sorry, a slight tangent first:
> >> P4 is self-contained, there are a handful of objects that are defined
> >> by the spec (externs, actions, tables, etc) and we model them in the
> >> patchset, so that part is self-contained. For the extra richness such
> >> as the netfilter example you quoted - based on my many years of
> >> experience deploying SDN - using daemons(sorry if i am reading too
> >> much in what I think you are implying) for control is not the best
> >> option i.e you need all kinds of coordination - for example where do
> >> you store state, what happens when the daemon dies, how do you
> >> graceful restarts etc. Based on that, if i can put things in the
> >> kernel (which is essentially a "perpetual daemon", unless the kernel
> >> crashes) it's a lot simpler to manage as a source of truth especially
> >> when there is not that much info. There is a limit when there are
> >> multiple pieces (to use your netfilter example) because you need
> >> another layer to coordinate things.
>
> 'source of truth' for the various attach points or BPF links, yes, but in
> this case here it is not, since the source of truth on what is attached
> is not in cls_p4 but rather on the XDP link. How do you handle the case
> when cls_p4 says something different to what is /actually/ attached? Why
> is it not enough to establish some convention in user space, to pin the
> link and retrieve/update from there when needed? Like everyone else does.
> ... even if you consider iproute2 your "control plane" (which I have the
> feeling you do)?
>
> >> Re: the XDP part - our key reason is mostly managerial, in that the
> >> filter is the lifetime manager of the pipeline; and that if i dump
>
> This is imho the problematic part which feels like square peg in round
> hole, trying to fit this whole lifetime manager of the pipeline into
> the cls_p4 filter. We agree to disagree here. Instead of reusing
> individual building blocks from user space, this tries to cramp control
> plane parts into the kernel for which its not a great fit with what is
> build here as-is.
>
> >> that filter i can see all the details in regards to the pipeline(tc,
> >> XDP and in future hw, etc) in one spot. You are right, the link
> >> pinning is our protection from someone replacing the XDP prog (this
> >> was a tip from Toke in the early days) and the comparison of tc
> >> holding inode is apropos.
> >> There's some history: in the early days we were also using metadata
> >> which comes from the XDP program at the tc layer if more processing
> >> was to be done (and there was extra metadata which told us which XDP
> >> prog produced it which we would vet before trusting the metadata).
> >> Given all the above, we should still be able to hold this info without
> >> necessarily holding the extra refcount and be able to see this detail.
> >> So we can remove the refcounting.
> >
> > Daniel?
>
> The refcount should definitely be removed, but then again, see the point
> above in that it is inconsistent information. Why can't this be done in
> user space with some convention in your user space control plane - if you
> take iproute2, then why it cannot pin the link in a bpf fs instance and
> retrieve it from there?

Ok, Daniel - let's do this so we can move forward. I am getting
exhausted, we've been going at this for a year now. As a compromise: I
will remove the support for XDP altogether from the filter. We will
still reference the XDP program in the CLI and infact load and pin it
that way but the filter will not be adding a refcount in the kernel as
in the posted patch.

cheers,
jamal

[v10,net-next,15/15] p4tc: add P4 classifier

Checks

Commit Message

Comments

Patch