Message ID | 20240122194801.152658-16-jhs@mojatatu.com (mailing list archive) |
---|---|
State | Changes Requested |
Delegated to: | Netdev Maintainers |
Headers | show |
Series | Introducing P4TC | expand |
On 1/22/24 8:48 PM, Jamal Hadi Salim wrote: > Introduce P4 tc classifier. The main task of this classifier is to manage > the lifetime of pipeline instances across one or more netdev ports. > Note a pipeline may be instantiated multiple times across one or more tc chains > and different priorities. > > Note that part or whole of the P4 pipeline could reside in tc, XDP or even > hardware depending on how the P4 program was compiled. > To use the P4 classifier you must specify a pipeline name that will be > associated to the filter instance, a s/w parser (eBPF) and datapath P4 > control block program (eBPF) program. Although this patchset does not deal > with offloads, it is also possible to load the h/w part using this filter. > We will illustrate a few examples further below to clarify. Please treat > the illustrated split as an example - there are probably more pragmatic > approaches to splitting the pipeline; however, regardless of where the different > pieces of the pipeline are placed (tc, XDP, HW) and what each layer will > implement (what part of the pipeline) - these examples are merely showing > what is possible. > > The pipeline is assumed to have already been created via a template. > > For example, if we were to add a filter to ingress of a group of netdevs > (tc block 22) and associate it to P4 pipeline simple_l3 we could issue the > following command: > > tc filter add block 22 parent ffff: protocol all prio 6 p4 pname simple_l3 \ > action bpf obj $PARSER.o ... \ > action bpf obj $PROGNAME.o section prog/tc-ingress > > The above uses the classical tc action mechanism in which the first action > runs the P4 parser and if that goes well then the P4 control block is > executed. Note, although not shown above, one could also append the command > line with other traditional tc actions. > > In these patches, we also support two types of loadings of the pipeline > programs and differentiate between what gets loaded at say tc vs xdp by using > syntax which specifies location as either "prog type tc obj" or > "prog type xdp obj". There is an ongoing discussion in the P4TC community > biweekly meetings which is likely going to have us add another location > definition "prog type hw" which will specify the hardware object file name > and other related attributes. > > An example using tc: > > tc filter add block 22 parent ffff: protocol all prio 6 p4 pname simple_l3 \ > prog type tc obj $PARSER.o ... \ > action bpf obj $PROGNAME.o section prog/tc-ingress > > For XDP, to illustrate an example: > > tc filter add dev $P0 ingress protocol all prio 1 p4 pname simple_l3 \ > prog type xdp obj $PARSER.o section parser/xdp \ > pinned_link /sys/fs/bpf/mylink \ > action bpf obj $PROGNAME.o section prog/tc-ingress > > In this case, the parser will be executed in the XDP layer and the rest of > P4 control block as a tc action. > > For illustration sake, the hw one looks as follows (please note there's > still a lot of discussions going on in the meetings - the example is here > merely to illustrate the tc filter functionality): > > tc filter add block 22 ingress protocol all prio 1 p4 pname simple_l3 \ > prog type hw filename "mypnameprog.o" ... \ > prog type xdp obj $PARSER.o section parser/xdp pinned_link /sys/fs/bpf/mylink \ > action bpf obj $PROGNAME.o section prog/tc-ingress > > The theory of operations is as follows: > > ================================1. PARSING================================ > > The packet first encounters the parser. > The parser is implemented in ebpf residing either at the TC or XDP > level. The parsed header values are stored in a shared eBPF map. > When the parser runs at XDP level, we load it into XDP using tc filter > command and pin it to a file. > > =============================2. ACTIONS============================= > > In the above example, the P4 program (minus the parser) is encoded in an > action($PROGNAME.o). It should be noted that classical tc actions > continue to work: > IOW, someone could decide to add a mirred action to mirror all packets > after or before the ebpf action. > > tc filter add dev $P0 parent ffff: protocol all prio 6 p4 pname simple_l3 \ > prog type tc obj $PARSER.o section parser/tc-ingress \ > action bpf obj $PROGNAME.o section prog/tc-ingress \ > action mirred egress mirror index 1 dev $P1 \ > action bpf obj $ANOTHERPROG.o section mysect/section-1 > > It should also be noted that it is feasible to split some of the ingress > datapath into XDP first and more into TC later (as was shown above for > example where the parser runs at XDP level). YMMV. > Regardless of choice of which scheme to use, none of these will affect > UAPI. It will all depend on whether you generate code to load on XDP vs > tc, etc. > > Co-developed-by: Victor Nogueira <victor@mojatatu.com> > Signed-off-by: Victor Nogueira <victor@mojatatu.com> > Co-developed-by: Pedro Tammela <pctammela@mojatatu.com> > Signed-off-by: Pedro Tammela <pctammela@mojatatu.com> > Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com> My objections from last iterations still stand, and I also added a nak, so please do not just drop it with new revisions.. from the v10 as you wrote you added further code but despite the various community feedback the design still stands as before, therefore: Nacked-by: Daniel Borkmann <daniel@iogearbox.net> [...] > +static int cls_p4_prog_from_efd(struct nlattr **tb, > + struct p4tc_bpf_prog *prog, u32 flags, > + struct netlink_ext_ack *extack) > +{ > + struct bpf_prog *fp; > + u32 prog_type; > + char *name; > + u32 bpf_fd; > + > + bpf_fd = nla_get_u32(tb[TCA_P4_PROG_FD]); > + prog_type = nla_get_u32(tb[TCA_P4_PROG_TYPE]); > + > + if (prog_type != BPF_PROG_TYPE_XDP && > + prog_type != BPF_PROG_TYPE_SCHED_ACT) { Also as mentioned earlier I don't think tc should hold references on XDP programs in here. It doesn't make any sense aside from the fact that the cls_p4 is also not doing anything with it. This is something that a user space control plane should be doing i.e. managing a XDP link on the target device. > + NL_SET_ERR_MSG(extack, > + "BPF prog type must be BPF_PROG_TYPE_SCHED_ACT or BPF_PROG_TYPE_XDP"); > + return -EINVAL; > + } > + > + fp = bpf_prog_get_type_dev(bpf_fd, prog_type, false); > + if (IS_ERR(fp)) > + return PTR_ERR(fp); > + > + name = nla_memdup(tb[TCA_P4_PROG_NAME], GFP_KERNEL); > + if (!name) { > + bpf_prog_put(fp); > + return -ENOMEM; > + } > + > + prog->p4_prog_name = name; > + prog->p4_prog = fp; > + > + return 0; > +} > +
On Wed, Jan 24, 2024 at 8:59 AM Daniel Borkmann <daniel@iogearbox.net> wrote: > > On 1/22/24 8:48 PM, Jamal Hadi Salim wrote: > > Introduce P4 tc classifier. The main task of this classifier is to manage > > the lifetime of pipeline instances across one or more netdev ports. > > Note a pipeline may be instantiated multiple times across one or more tc chains > > and different priorities. > > > > Note that part or whole of the P4 pipeline could reside in tc, XDP or even > > hardware depending on how the P4 program was compiled. > > To use the P4 classifier you must specify a pipeline name that will be > > associated to the filter instance, a s/w parser (eBPF) and datapath P4 > > control block program (eBPF) program. Although this patchset does not deal > > with offloads, it is also possible to load the h/w part using this filter. > > We will illustrate a few examples further below to clarify. Please treat > > the illustrated split as an example - there are probably more pragmatic > > approaches to splitting the pipeline; however, regardless of where the different > > pieces of the pipeline are placed (tc, XDP, HW) and what each layer will > > implement (what part of the pipeline) - these examples are merely showing > > what is possible. > > > > The pipeline is assumed to have already been created via a template. > > > > For example, if we were to add a filter to ingress of a group of netdevs > > (tc block 22) and associate it to P4 pipeline simple_l3 we could issue the > > following command: > > > > tc filter add block 22 parent ffff: protocol all prio 6 p4 pname simple_l3 \ > > action bpf obj $PARSER.o ... \ > > action bpf obj $PROGNAME.o section prog/tc-ingress > > > > The above uses the classical tc action mechanism in which the first action > > runs the P4 parser and if that goes well then the P4 control block is > > executed. Note, although not shown above, one could also append the command > > line with other traditional tc actions. > > > > In these patches, we also support two types of loadings of the pipeline > > programs and differentiate between what gets loaded at say tc vs xdp by using > > syntax which specifies location as either "prog type tc obj" or > > "prog type xdp obj". There is an ongoing discussion in the P4TC community > > biweekly meetings which is likely going to have us add another location > > definition "prog type hw" which will specify the hardware object file name > > and other related attributes. > > > > An example using tc: > > > > tc filter add block 22 parent ffff: protocol all prio 6 p4 pname simple_l3 \ > > prog type tc obj $PARSER.o ... \ > > action bpf obj $PROGNAME.o section prog/tc-ingress > > > > For XDP, to illustrate an example: > > > > tc filter add dev $P0 ingress protocol all prio 1 p4 pname simple_l3 \ > > prog type xdp obj $PARSER.o section parser/xdp \ > > pinned_link /sys/fs/bpf/mylink \ > > action bpf obj $PROGNAME.o section prog/tc-ingress > > > > In this case, the parser will be executed in the XDP layer and the rest of > > P4 control block as a tc action. > > > > For illustration sake, the hw one looks as follows (please note there's > > still a lot of discussions going on in the meetings - the example is here > > merely to illustrate the tc filter functionality): > > > > tc filter add block 22 ingress protocol all prio 1 p4 pname simple_l3 \ > > prog type hw filename "mypnameprog.o" ... \ > > prog type xdp obj $PARSER.o section parser/xdp pinned_link /sys/fs/bpf/mylink \ > > action bpf obj $PROGNAME.o section prog/tc-ingress > > > > The theory of operations is as follows: > > > > ================================1. PARSING================================ > > > > The packet first encounters the parser. > > The parser is implemented in ebpf residing either at the TC or XDP > > level. The parsed header values are stored in a shared eBPF map. > > When the parser runs at XDP level, we load it into XDP using tc filter > > command and pin it to a file. > > > > =============================2. ACTIONS============================= > > > > In the above example, the P4 program (minus the parser) is encoded in an > > action($PROGNAME.o). It should be noted that classical tc actions > > continue to work: > > IOW, someone could decide to add a mirred action to mirror all packets > > after or before the ebpf action. > > > > tc filter add dev $P0 parent ffff: protocol all prio 6 p4 pname simple_l3 \ > > prog type tc obj $PARSER.o section parser/tc-ingress \ > > action bpf obj $PROGNAME.o section prog/tc-ingress \ > > action mirred egress mirror index 1 dev $P1 \ > > action bpf obj $ANOTHERPROG.o section mysect/section-1 > > > > It should also be noted that it is feasible to split some of the ingress > > datapath into XDP first and more into TC later (as was shown above for > > example where the parser runs at XDP level). YMMV. > > Regardless of choice of which scheme to use, none of these will affect > > UAPI. It will all depend on whether you generate code to load on XDP vs > > tc, etc. > > > > Co-developed-by: Victor Nogueira <victor@mojatatu.com> > > Signed-off-by: Victor Nogueira <victor@mojatatu.com> > > Co-developed-by: Pedro Tammela <pctammela@mojatatu.com> > > Signed-off-by: Pedro Tammela <pctammela@mojatatu.com> > > Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com> > > My objections from last iterations still stand, and I also added a nak, > so please do not just drop it with new revisions.. from the v10 as you > wrote you added further code but despite the various community feedback > the design still stands as before, therefore: > > Nacked-by: Daniel Borkmann <daniel@iogearbox.net> > We didnt make code changes - but did you read the cover letter and the extended commentary in this patch's commit log? We should have mentioned it in the changes log. It did respond to your comments. There's text that says "the filter manages the lifetime of the pipeline" - which in the future could include not only tc but XDP but also the hardware path (in the form of a file that gets loaded). I am not sure if that message is clear. Your angle being this is layer violation. In the last discussion i asked you for suggestions and we went the tcx route, which didnt make sense, and then you didnt respond. > [...] > > +static int cls_p4_prog_from_efd(struct nlattr **tb, > > + struct p4tc_bpf_prog *prog, u32 flags, > > + struct netlink_ext_ack *extack) > > +{ > > + struct bpf_prog *fp; > > + u32 prog_type; > > + char *name; > > + u32 bpf_fd; > > + > > + bpf_fd = nla_get_u32(tb[TCA_P4_PROG_FD]); > > + prog_type = nla_get_u32(tb[TCA_P4_PROG_TYPE]); > > + > > + if (prog_type != BPF_PROG_TYPE_XDP && > > + prog_type != BPF_PROG_TYPE_SCHED_ACT) { > > Also as mentioned earlier I don't think tc should hold references on > XDP programs in here. It doesn't make any sense aside from the fact > that the cls_p4 is also not doing anything with it. This is something > that a user space control plane should be doing i.e. managing a XDP > link on the target device. This is the same argument about layer violation that you made earlier. The filter manages the p4 pipeline - i.e it's not just about the ebpf blob(s) but for example in the future (discussions are still ongoing with vendors who have P4 NICs) a filter could be loaded to also specify the location of the hardware blob. I would be happy with a suggestion that gets us moving forward with that context in mind. cheers, jamal > > + NL_SET_ERR_MSG(extack, > > + "BPF prog type must be BPF_PROG_TYPE_SCHED_ACT or BPF_PROG_TYPE_XDP"); > > + return -EINVAL; > > + } > > + > > + fp = bpf_prog_get_type_dev(bpf_fd, prog_type, false); > > + if (IS_ERR(fp)) > > + return PTR_ERR(fp); > > + > > + name = nla_memdup(tb[TCA_P4_PROG_NAME], GFP_KERNEL); > > + if (!name) { > > + bpf_prog_put(fp); > > + return -ENOMEM; > > + } > > + > > + prog->p4_prog_name = name; > > + prog->p4_prog = fp; > > + > > + return 0; > > +} > > +
On 1/24/24 3:40 PM, Jamal Hadi Salim wrote: > On Wed, Jan 24, 2024 at 8:59 AM Daniel Borkmann <daniel@iogearbox.net> wrote: >> On 1/22/24 8:48 PM, Jamal Hadi Salim wrote: [...] >>> >>> It should also be noted that it is feasible to split some of the ingress >>> datapath into XDP first and more into TC later (as was shown above for >>> example where the parser runs at XDP level). YMMV. >>> Regardless of choice of which scheme to use, none of these will affect >>> UAPI. It will all depend on whether you generate code to load on XDP vs >>> tc, etc. >>> >>> Co-developed-by: Victor Nogueira <victor@mojatatu.com> >>> Signed-off-by: Victor Nogueira <victor@mojatatu.com> >>> Co-developed-by: Pedro Tammela <pctammela@mojatatu.com> >>> Signed-off-by: Pedro Tammela <pctammela@mojatatu.com> >>> Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com> >> >> My objections from last iterations still stand, and I also added a nak, >> so please do not just drop it with new revisions.. from the v10 as you >> wrote you added further code but despite the various community feedback >> the design still stands as before, therefore: >> >> Nacked-by: Daniel Borkmann <daniel@iogearbox.net> > > We didnt make code changes - but did you read the cover letter and the > extended commentary in this patch's commit log? We should have > mentioned it in the changes log. It did respond to your comments. > There's text that says "the filter manages the lifetime of the > pipeline" - which in the future could include not only tc but XDP but > also the hardware path (in the form of a file that gets loaded). I am > not sure if that message is clear. Your angle being this is layer > violation. In the last discussion i asked you for suggestions and we > went the tcx route, which didnt make sense, and then you didnt > respond. [...] >> Also as mentioned earlier I don't think tc should hold references on >> XDP programs in here. It doesn't make any sense aside from the fact >> that the cls_p4 is also not doing anything with it. This is something >> that a user space control plane should be doing i.e. managing a XDP >> link on the target device. > > This is the same argument about layer violation that you made earlier. > The filter manages the p4 pipeline - i.e it's not just about the ebpf > blob(s) but for example in the future (discussions are still ongoing > with vendors who have P4 NICs) a filter could be loaded to also > specify the location of the hardware blob. Ah, so there is a plan to eventually add HW offload support for cls_p4? Or is this only specifiying a location of a blob through some opaque cookie value from user space? > I would be happy with a suggestion that gets us moving forward with > that context in mind. My question on the above is mainly what does it bring you to hold a reference on the XDP program? There is no guarantee that something else will get loaded onto XDP, and then eventually the cls_p4 is the only entity holding the reference but w/o 'purpose'. We do have BPF links and the user space component orchestrating all this needs to create and pin the BPF link in BPF fs, for example. An artificial reference on XDP prog feels similar as if you'd hold a reference on an inode out of tc.. Again, that should be delegated to the control plane you have running interacting with the compiler which then manages and loads its artifacts. What if you would also need to set up some netfilter rules for the SW pipeline, would you then embed this too? Thanks, Daniel
On Thu, Jan 25, 2024 at 10:47 AM Daniel Borkmann <daniel@iogearbox.net> wrote: > > On 1/24/24 3:40 PM, Jamal Hadi Salim wrote: > > On Wed, Jan 24, 2024 at 8:59 AM Daniel Borkmann <daniel@iogearbox.net> wrote: > >> On 1/22/24 8:48 PM, Jamal Hadi Salim wrote: > [...] > >>> > >>> It should also be noted that it is feasible to split some of the ingress > >>> datapath into XDP first and more into TC later (as was shown above for > >>> example where the parser runs at XDP level). YMMV. > >>> Regardless of choice of which scheme to use, none of these will affect > >>> UAPI. It will all depend on whether you generate code to load on XDP vs > >>> tc, etc. > >>> > >>> Co-developed-by: Victor Nogueira <victor@mojatatu.com> > >>> Signed-off-by: Victor Nogueira <victor@mojatatu.com> > >>> Co-developed-by: Pedro Tammela <pctammela@mojatatu.com> > >>> Signed-off-by: Pedro Tammela <pctammela@mojatatu.com> > >>> Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com> > >> > >> My objections from last iterations still stand, and I also added a nak, > >> so please do not just drop it with new revisions.. from the v10 as you > >> wrote you added further code but despite the various community feedback > >> the design still stands as before, therefore: > >> > >> Nacked-by: Daniel Borkmann <daniel@iogearbox.net> > > > > We didnt make code changes - but did you read the cover letter and the > > extended commentary in this patch's commit log? We should have > > mentioned it in the changes log. It did respond to your comments. > > There's text that says "the filter manages the lifetime of the > > pipeline" - which in the future could include not only tc but XDP but > > also the hardware path (in the form of a file that gets loaded). I am > > not sure if that message is clear. Your angle being this is layer > > violation. In the last discussion i asked you for suggestions and we > > went the tcx route, which didnt make sense, and then you didnt > > respond. > [...] > > >> Also as mentioned earlier I don't think tc should hold references on > >> XDP programs in here. It doesn't make any sense aside from the fact > >> that the cls_p4 is also not doing anything with it. This is something > >> that a user space control plane should be doing i.e. managing a XDP > >> link on the target device. > > > > This is the same argument about layer violation that you made earlier. > > The filter manages the p4 pipeline - i.e it's not just about the ebpf > > blob(s) but for example in the future (discussions are still ongoing > > with vendors who have P4 NICs) a filter could be loaded to also > > specify the location of the hardware blob. > > Ah, so there is a plan to eventually add HW offload support for cls_p4? > Or is this only specifiying a location of a blob through some opaque > cookie value from user space? Current thought process is it will be something along these lines (the commit provides more details): tc filter add block 22 ingress protocol all prio 1 p4 pname simple_l3 \ prog type hw filename "mypnameprog.o" ... \ prog type xdp obj $PARSER.o section parser/xdp pinned_link /sys/fs/bpf/mylink \ action bpf obj $PROGNAME.o section prog/tc-ingress These discussions are still ongoing - but that is the current consensus. Note: we are not pushing any code for that, but hope it paints the bigger picture.... The idea is the cls p4 owns the lifetime of the pipeline. Installing the filter instantiates the p4 pipeline "simple_l3" and triggers a lot of the refcounts to make sure the pipeline and its components stays alive. There could be multiple such filters - when someone deletes the last filter, then it is safe to delete the pipeline. Essentially the filter manages the lifetime of the pipeline. > > I would be happy with a suggestion that gets us moving forward with > > that context in mind. > > My question on the above is mainly what does it bring you to hold a > reference on the XDP program? There is no guarantee that something else > will get loaded onto XDP, and then eventually the cls_p4 is the only > entity holding the reference but w/o 'purpose'. We do have BPF links > and the user space component orchestrating all this needs to create > and pin the BPF link in BPF fs, for example. An artificial reference > on XDP prog feels similar as if you'd hold a reference on an inode > out of tc.. Again, that should be delegated to the control plane you > have running interacting with the compiler which then manages and > loads its artifacts. What if you would also need to set up some > netfilter rules for the SW pipeline, would you then embed this too? Sorry, a slight tangent first: P4 is self-contained, there are a handful of objects that are defined by the spec (externs, actions, tables, etc) and we model them in the patchset, so that part is self-contained. For the extra richness such as the netfilter example you quoted - based on my many years of experience deploying SDN - using daemons(sorry if i am reading too much in what I think you are implying) for control is not the best option i.e you need all kinds of coordination - for example where do you store state, what happens when the daemon dies, how do you graceful restarts etc. Based on that, if i can put things in the kernel (which is essentially a "perpetual daemon", unless the kernel crashes) it's a lot simpler to manage as a source of truth especially when there is not that much info. There is a limit when there are multiple pieces (to use your netfilter example) because you need another layer to coordinate things. Re: the XDP part - our key reason is mostly managerial, in that the filter is the lifetime manager of the pipeline; and that if i dump that filter i can see all the details in regards to the pipeline(tc, XDP and in future hw, etc) in one spot. You are right, the link pinning is our protection from someone replacing the XDP prog (this was a tip from Toke in the early days) and the comparison of tc holding inode is apropos. There's some history: in the early days we were also using metadata which comes from the XDP program at the tc layer if more processing was to be done (and there was extra metadata which told us which XDP prog produced it which we would vet before trusting the metadata). Given all the above, we should still be able to hold this info without necessarily holding the extra refcount and be able to see this detail. So we can remove the refcounting. cheers, jamal > Thanks, > Daniel
On Thu, Jan 25, 2024 at 12:59 PM Jamal Hadi Salim <jhs@mojatatu.com> wrote: > > On Thu, Jan 25, 2024 at 10:47 AM Daniel Borkmann <daniel@iogearbox.net> wrote: > > > > On 1/24/24 3:40 PM, Jamal Hadi Salim wrote: > > > On Wed, Jan 24, 2024 at 8:59 AM Daniel Borkmann <daniel@iogearbox.net> wrote: > > >> On 1/22/24 8:48 PM, Jamal Hadi Salim wrote: > > [...] > > >>> > > >>> It should also be noted that it is feasible to split some of the ingress > > >>> datapath into XDP first and more into TC later (as was shown above for > > >>> example where the parser runs at XDP level). YMMV. > > >>> Regardless of choice of which scheme to use, none of these will affect > > >>> UAPI. It will all depend on whether you generate code to load on XDP vs > > >>> tc, etc. > > >>> > > >>> Co-developed-by: Victor Nogueira <victor@mojatatu.com> > > >>> Signed-off-by: Victor Nogueira <victor@mojatatu.com> > > >>> Co-developed-by: Pedro Tammela <pctammela@mojatatu.com> > > >>> Signed-off-by: Pedro Tammela <pctammela@mojatatu.com> > > >>> Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com> > > >> > > >> My objections from last iterations still stand, and I also added a nak, > > >> so please do not just drop it with new revisions.. from the v10 as you > > >> wrote you added further code but despite the various community feedback > > >> the design still stands as before, therefore: > > >> > > >> Nacked-by: Daniel Borkmann <daniel@iogearbox.net> > > > > > > We didnt make code changes - but did you read the cover letter and the > > > extended commentary in this patch's commit log? We should have > > > mentioned it in the changes log. It did respond to your comments. > > > There's text that says "the filter manages the lifetime of the > > > pipeline" - which in the future could include not only tc but XDP but > > > also the hardware path (in the form of a file that gets loaded). I am > > > not sure if that message is clear. Your angle being this is layer > > > violation. In the last discussion i asked you for suggestions and we > > > went the tcx route, which didnt make sense, and then you didnt > > > respond. > > [...] > > > > >> Also as mentioned earlier I don't think tc should hold references on > > >> XDP programs in here. It doesn't make any sense aside from the fact > > >> that the cls_p4 is also not doing anything with it. This is something > > >> that a user space control plane should be doing i.e. managing a XDP > > >> link on the target device. > > > > > > This is the same argument about layer violation that you made earlier. > > > The filter manages the p4 pipeline - i.e it's not just about the ebpf > > > blob(s) but for example in the future (discussions are still ongoing > > > with vendors who have P4 NICs) a filter could be loaded to also > > > specify the location of the hardware blob. > > > > Ah, so there is a plan to eventually add HW offload support for cls_p4? > > Or is this only specifiying a location of a blob through some opaque > > cookie value from user space? > > Current thought process is it will be something along these lines (the > commit provides more details): > > tc filter add block 22 ingress protocol all prio 1 p4 pname simple_l3 \ > prog type hw filename "mypnameprog.o" ... \ > prog type xdp obj $PARSER.o section parser/xdp pinned_link > /sys/fs/bpf/mylink \ > action bpf obj $PROGNAME.o section prog/tc-ingress > > These discussions are still ongoing - but that is the current > consensus. Note: we are not pushing any code for that, but hope it > paints the bigger picture.... > The idea is the cls p4 owns the lifetime of the pipeline. Installing > the filter instantiates the p4 pipeline "simple_l3" and triggers a lot > of the refcounts to make sure the pipeline and its components stays > alive. > There could be multiple such filters - when someone deletes the last > filter, then it is safe to delete the pipeline. > Essentially the filter manages the lifetime of the pipeline. > > > > I would be happy with a suggestion that gets us moving forward with > > > that context in mind. > > > > My question on the above is mainly what does it bring you to hold a > > reference on the XDP program? There is no guarantee that something else > > will get loaded onto XDP, and then eventually the cls_p4 is the only > > entity holding the reference but w/o 'purpose'. We do have BPF links > > and the user space component orchestrating all this needs to create > > and pin the BPF link in BPF fs, for example. An artificial reference > > on XDP prog feels similar as if you'd hold a reference on an inode > > out of tc.. Again, that should be delegated to the control plane you > > have running interacting with the compiler which then manages and > > loads its artifacts. What if you would also need to set up some > > netfilter rules for the SW pipeline, would you then embed this too? > > Sorry, a slight tangent first: > P4 is self-contained, there are a handful of objects that are defined > by the spec (externs, actions, tables, etc) and we model them in the > patchset, so that part is self-contained. For the extra richness such > as the netfilter example you quoted - based on my many years of > experience deploying SDN - using daemons(sorry if i am reading too > much in what I think you are implying) for control is not the best > option i.e you need all kinds of coordination - for example where do > you store state, what happens when the daemon dies, how do you > graceful restarts etc. Based on that, if i can put things in the > kernel (which is essentially a "perpetual daemon", unless the kernel > crashes) it's a lot simpler to manage as a source of truth especially > when there is not that much info. There is a limit when there are > multiple pieces (to use your netfilter example) because you need > another layer to coordinate things. > > Re: the XDP part - our key reason is mostly managerial, in that the > filter is the lifetime manager of the pipeline; and that if i dump > that filter i can see all the details in regards to the pipeline(tc, > XDP and in future hw, etc) in one spot. You are right, the link > pinning is our protection from someone replacing the XDP prog (this > was a tip from Toke in the early days) and the comparison of tc > holding inode is apropos. > There's some history: in the early days we were also using metadata > which comes from the XDP program at the tc layer if more processing > was to be done (and there was extra metadata which told us which XDP > prog produced it which we would vet before trusting the metadata). > Given all the above, we should still be able to hold this info without > necessarily holding the extra refcount and be able to see this detail. > So we can remove the refcounting. > Daniel? cheers, jamal > cheers, > jamal > > > Thanks, > > Daniel
On 2/16/24 10:18 PM, Jamal Hadi Salim wrote: > On Thu, Jan 25, 2024 at 12:59 PM Jamal Hadi Salim <jhs@mojatatu.com> wrote: >> On Thu, Jan 25, 2024 at 10:47 AM Daniel Borkmann <daniel@iogearbox.net> wrote: >>> On 1/24/24 3:40 PM, Jamal Hadi Salim wrote: >>>> On Wed, Jan 24, 2024 at 8:59 AM Daniel Borkmann <daniel@iogearbox.net> wrote: >>>>> On 1/22/24 8:48 PM, Jamal Hadi Salim wrote: >>> [...] >>>>>> >>>>>> It should also be noted that it is feasible to split some of the ingress >>>>>> datapath into XDP first and more into TC later (as was shown above for >>>>>> example where the parser runs at XDP level). YMMV. >>>>>> Regardless of choice of which scheme to use, none of these will affect >>>>>> UAPI. It will all depend on whether you generate code to load on XDP vs >>>>>> tc, etc. >>>>>> >>>>>> Co-developed-by: Victor Nogueira <victor@mojatatu.com> >>>>>> Signed-off-by: Victor Nogueira <victor@mojatatu.com> >>>>>> Co-developed-by: Pedro Tammela <pctammela@mojatatu.com> >>>>>> Signed-off-by: Pedro Tammela <pctammela@mojatatu.com> >>>>>> Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com> >>>>> >>>>> My objections from last iterations still stand, and I also added a nak, >>>>> so please do not just drop it with new revisions.. from the v10 as you >>>>> wrote you added further code but despite the various community feedback >>>>> the design still stands as before, therefore: >>>>> >>>>> Nacked-by: Daniel Borkmann <daniel@iogearbox.net> >>>> >>>> We didnt make code changes - but did you read the cover letter and the >>>> extended commentary in this patch's commit log? We should have >>>> mentioned it in the changes log. It did respond to your comments. >>>> There's text that says "the filter manages the lifetime of the >>>> pipeline" - which in the future could include not only tc but XDP but >>>> also the hardware path (in the form of a file that gets loaded). I am >>>> not sure if that message is clear. Your angle being this is layer >>>> violation. In the last discussion i asked you for suggestions and we >>>> went the tcx route, which didnt make sense, and then you didnt >>>> respond. >>> [...] >>> >>>>> Also as mentioned earlier I don't think tc should hold references on >>>>> XDP programs in here. It doesn't make any sense aside from the fact >>>>> that the cls_p4 is also not doing anything with it. This is something >>>>> that a user space control plane should be doing i.e. managing a XDP >>>>> link on the target device. >>>> >>>> This is the same argument about layer violation that you made earlier. >>>> The filter manages the p4 pipeline - i.e it's not just about the ebpf >>>> blob(s) but for example in the future (discussions are still ongoing >>>> with vendors who have P4 NICs) a filter could be loaded to also >>>> specify the location of the hardware blob. >>> >>> Ah, so there is a plan to eventually add HW offload support for cls_p4? >>> Or is this only specifiying a location of a blob through some opaque >>> cookie value from user space? >> >> Current thought process is it will be something along these lines (the >> commit provides more details): >> >> tc filter add block 22 ingress protocol all prio 1 p4 pname simple_l3 \ >> prog type hw filename "mypnameprog.o" ... \ >> prog type xdp obj $PARSER.o section parser/xdp pinned_link >> /sys/fs/bpf/mylink \ >> action bpf obj $PROGNAME.o section prog/tc-ingress >> >> These discussions are still ongoing - but that is the current >> consensus. Note: we are not pushing any code for that, but hope it >> paints the bigger picture.... >> The idea is the cls p4 owns the lifetime of the pipeline. Installing >> the filter instantiates the p4 pipeline "simple_l3" and triggers a lot >> of the refcounts to make sure the pipeline and its components stays >> alive. >> There could be multiple such filters - when someone deletes the last >> filter, then it is safe to delete the pipeline. >> Essentially the filter manages the lifetime of the pipeline. >> >>>> I would be happy with a suggestion that gets us moving forward with >>>> that context in mind. >>> >>> My question on the above is mainly what does it bring you to hold a >>> reference on the XDP program? There is no guarantee that something else >>> will get loaded onto XDP, and then eventually the cls_p4 is the only >>> entity holding the reference but w/o 'purpose'. We do have BPF links >>> and the user space component orchestrating all this needs to create >>> and pin the BPF link in BPF fs, for example. An artificial reference >>> on XDP prog feels similar as if you'd hold a reference on an inode >>> out of tc.. Again, that should be delegated to the control plane you >>> have running interacting with the compiler which then manages and >>> loads its artifacts. What if you would also need to set up some >>> netfilter rules for the SW pipeline, would you then embed this too? >> >> Sorry, a slight tangent first: >> P4 is self-contained, there are a handful of objects that are defined >> by the spec (externs, actions, tables, etc) and we model them in the >> patchset, so that part is self-contained. For the extra richness such >> as the netfilter example you quoted - based on my many years of >> experience deploying SDN - using daemons(sorry if i am reading too >> much in what I think you are implying) for control is not the best >> option i.e you need all kinds of coordination - for example where do >> you store state, what happens when the daemon dies, how do you >> graceful restarts etc. Based on that, if i can put things in the >> kernel (which is essentially a "perpetual daemon", unless the kernel >> crashes) it's a lot simpler to manage as a source of truth especially >> when there is not that much info. There is a limit when there are >> multiple pieces (to use your netfilter example) because you need >> another layer to coordinate things. 'source of truth' for the various attach points or BPF links, yes, but in this case here it is not, since the source of truth on what is attached is not in cls_p4 but rather on the XDP link. How do you handle the case when cls_p4 says something different to what is /actually/ attached? Why is it not enough to establish some convention in user space, to pin the link and retrieve/update from there when needed? Like everyone else does. ... even if you consider iproute2 your "control plane" (which I have the feeling you do)? >> Re: the XDP part - our key reason is mostly managerial, in that the >> filter is the lifetime manager of the pipeline; and that if i dump This is imho the problematic part which feels like square peg in round hole, trying to fit this whole lifetime manager of the pipeline into the cls_p4 filter. We agree to disagree here. Instead of reusing individual building blocks from user space, this tries to cramp control plane parts into the kernel for which its not a great fit with what is build here as-is. >> that filter i can see all the details in regards to the pipeline(tc, >> XDP and in future hw, etc) in one spot. You are right, the link >> pinning is our protection from someone replacing the XDP prog (this >> was a tip from Toke in the early days) and the comparison of tc >> holding inode is apropos. >> There's some history: in the early days we were also using metadata >> which comes from the XDP program at the tc layer if more processing >> was to be done (and there was extra metadata which told us which XDP >> prog produced it which we would vet before trusting the metadata). >> Given all the above, we should still be able to hold this info without >> necessarily holding the extra refcount and be able to see this detail. >> So we can remove the refcounting. > > Daniel? The refcount should definitely be removed, but then again, see the point above in that it is inconsistent information. Why can't this be done in user space with some convention in your user space control plane - if you take iproute2, then why it cannot pin the link in a bpf fs instance and retrieve it from there?
On Tue, Feb 20, 2024 at 10:49 AM Daniel Borkmann <daniel@iogearbox.net> wrote: > > On 2/16/24 10:18 PM, Jamal Hadi Salim wrote: > > On Thu, Jan 25, 2024 at 12:59 PM Jamal Hadi Salim <jhs@mojatatu.com> wrote: > >> On Thu, Jan 25, 2024 at 10:47 AM Daniel Borkmann <daniel@iogearbox.net> wrote: > >>> On 1/24/24 3:40 PM, Jamal Hadi Salim wrote: > >>>> On Wed, Jan 24, 2024 at 8:59 AM Daniel Borkmann <daniel@iogearbox.net> wrote: > >>>>> On 1/22/24 8:48 PM, Jamal Hadi Salim wrote: > >>> [...] > >>>>>> > >>>>>> It should also be noted that it is feasible to split some of the ingress > >>>>>> datapath into XDP first and more into TC later (as was shown above for > >>>>>> example where the parser runs at XDP level). YMMV. > >>>>>> Regardless of choice of which scheme to use, none of these will affect > >>>>>> UAPI. It will all depend on whether you generate code to load on XDP vs > >>>>>> tc, etc. > >>>>>> > >>>>>> Co-developed-by: Victor Nogueira <victor@mojatatu.com> > >>>>>> Signed-off-by: Victor Nogueira <victor@mojatatu.com> > >>>>>> Co-developed-by: Pedro Tammela <pctammela@mojatatu.com> > >>>>>> Signed-off-by: Pedro Tammela <pctammela@mojatatu.com> > >>>>>> Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com> > >>>>> > >>>>> My objections from last iterations still stand, and I also added a nak, > >>>>> so please do not just drop it with new revisions.. from the v10 as you > >>>>> wrote you added further code but despite the various community feedback > >>>>> the design still stands as before, therefore: > >>>>> > >>>>> Nacked-by: Daniel Borkmann <daniel@iogearbox.net> > >>>> > >>>> We didnt make code changes - but did you read the cover letter and the > >>>> extended commentary in this patch's commit log? We should have > >>>> mentioned it in the changes log. It did respond to your comments. > >>>> There's text that says "the filter manages the lifetime of the > >>>> pipeline" - which in the future could include not only tc but XDP but > >>>> also the hardware path (in the form of a file that gets loaded). I am > >>>> not sure if that message is clear. Your angle being this is layer > >>>> violation. In the last discussion i asked you for suggestions and we > >>>> went the tcx route, which didnt make sense, and then you didnt > >>>> respond. > >>> [...] > >>> > >>>>> Also as mentioned earlier I don't think tc should hold references on > >>>>> XDP programs in here. It doesn't make any sense aside from the fact > >>>>> that the cls_p4 is also not doing anything with it. This is something > >>>>> that a user space control plane should be doing i.e. managing a XDP > >>>>> link on the target device. > >>>> > >>>> This is the same argument about layer violation that you made earlier. > >>>> The filter manages the p4 pipeline - i.e it's not just about the ebpf > >>>> blob(s) but for example in the future (discussions are still ongoing > >>>> with vendors who have P4 NICs) a filter could be loaded to also > >>>> specify the location of the hardware blob. > >>> > >>> Ah, so there is a plan to eventually add HW offload support for cls_p4? > >>> Or is this only specifiying a location of a blob through some opaque > >>> cookie value from user space? > >> > >> Current thought process is it will be something along these lines (the > >> commit provides more details): > >> > >> tc filter add block 22 ingress protocol all prio 1 p4 pname simple_l3 \ > >> prog type hw filename "mypnameprog.o" ... \ > >> prog type xdp obj $PARSER.o section parser/xdp pinned_link > >> /sys/fs/bpf/mylink \ > >> action bpf obj $PROGNAME.o section prog/tc-ingress > >> > >> These discussions are still ongoing - but that is the current > >> consensus. Note: we are not pushing any code for that, but hope it > >> paints the bigger picture.... > >> The idea is the cls p4 owns the lifetime of the pipeline. Installing > >> the filter instantiates the p4 pipeline "simple_l3" and triggers a lot > >> of the refcounts to make sure the pipeline and its components stays > >> alive. > >> There could be multiple such filters - when someone deletes the last > >> filter, then it is safe to delete the pipeline. > >> Essentially the filter manages the lifetime of the pipeline. > >> > >>>> I would be happy with a suggestion that gets us moving forward with > >>>> that context in mind. > >>> > >>> My question on the above is mainly what does it bring you to hold a > >>> reference on the XDP program? There is no guarantee that something else > >>> will get loaded onto XDP, and then eventually the cls_p4 is the only > >>> entity holding the reference but w/o 'purpose'. We do have BPF links > >>> and the user space component orchestrating all this needs to create > >>> and pin the BPF link in BPF fs, for example. An artificial reference > >>> on XDP prog feels similar as if you'd hold a reference on an inode > >>> out of tc.. Again, that should be delegated to the control plane you > >>> have running interacting with the compiler which then manages and > >>> loads its artifacts. What if you would also need to set up some > >>> netfilter rules for the SW pipeline, would you then embed this too? > >> > >> Sorry, a slight tangent first: > >> P4 is self-contained, there are a handful of objects that are defined > >> by the spec (externs, actions, tables, etc) and we model them in the > >> patchset, so that part is self-contained. For the extra richness such > >> as the netfilter example you quoted - based on my many years of > >> experience deploying SDN - using daemons(sorry if i am reading too > >> much in what I think you are implying) for control is not the best > >> option i.e you need all kinds of coordination - for example where do > >> you store state, what happens when the daemon dies, how do you > >> graceful restarts etc. Based on that, if i can put things in the > >> kernel (which is essentially a "perpetual daemon", unless the kernel > >> crashes) it's a lot simpler to manage as a source of truth especially > >> when there is not that much info. There is a limit when there are > >> multiple pieces (to use your netfilter example) because you need > >> another layer to coordinate things. > > 'source of truth' for the various attach points or BPF links, yes, but in > this case here it is not, since the source of truth on what is attached > is not in cls_p4 but rather on the XDP link. How do you handle the case > when cls_p4 says something different to what is /actually/ attached? Why > is it not enough to establish some convention in user space, to pin the > link and retrieve/update from there when needed? Like everyone else does. > ... even if you consider iproute2 your "control plane" (which I have the > feeling you do)? > > >> Re: the XDP part - our key reason is mostly managerial, in that the > >> filter is the lifetime manager of the pipeline; and that if i dump > > This is imho the problematic part which feels like square peg in round > hole, trying to fit this whole lifetime manager of the pipeline into > the cls_p4 filter. We agree to disagree here. Instead of reusing > individual building blocks from user space, this tries to cramp control > plane parts into the kernel for which its not a great fit with what is > build here as-is. > > >> that filter i can see all the details in regards to the pipeline(tc, > >> XDP and in future hw, etc) in one spot. You are right, the link > >> pinning is our protection from someone replacing the XDP prog (this > >> was a tip from Toke in the early days) and the comparison of tc > >> holding inode is apropos. > >> There's some history: in the early days we were also using metadata > >> which comes from the XDP program at the tc layer if more processing > >> was to be done (and there was extra metadata which told us which XDP > >> prog produced it which we would vet before trusting the metadata). > >> Given all the above, we should still be able to hold this info without > >> necessarily holding the extra refcount and be able to see this detail. > >> So we can remove the refcounting. > > > > Daniel? > > The refcount should definitely be removed, but then again, see the point > above in that it is inconsistent information. Why can't this be done in > user space with some convention in your user space control plane - if you > take iproute2, then why it cannot pin the link in a bpf fs instance and > retrieve it from there? Ok, Daniel - let's do this so we can move forward. I am getting exhausted, we've been going at this for a year now. As a compromise: I will remove the support for XDP altogether from the filter. We will still reference the XDP program in the CLI and infact load and pin it that way but the filter will not be adding a refcount in the kernel as in the posted patch. cheers, jamal
diff --git a/include/uapi/linux/pkt_cls.h b/include/uapi/linux/pkt_cls.h index dd313a727..e4d89bc98 100644 --- a/include/uapi/linux/pkt_cls.h +++ b/include/uapi/linux/pkt_cls.h @@ -692,6 +692,24 @@ enum { #define TCA_MATCHALL_MAX (__TCA_MATCHALL_MAX - 1) +/* P4 classifier */ + +enum { + TCA_P4_UNSPEC, + TCA_P4_CLASSID, + TCA_P4_ACT, + TCA_P4_PNAME, + TCA_P4_PIPEID, + TCA_P4_PROG_FD, + TCA_P4_PROG_NAME, + TCA_P4_PROG_TYPE, + TCA_P4_PROG_ID, + TCA_P4_PAD, + __TCA_P4_MAX, +}; + +#define TCA_P4_MAX (__TCA_P4_MAX - 1) + /* Extended Matches */ struct tcf_ematch_tree_hdr { diff --git a/net/sched/Kconfig b/net/sched/Kconfig index b1798a5e1..b33d790e1 100644 --- a/net/sched/Kconfig +++ b/net/sched/Kconfig @@ -565,6 +565,18 @@ config NET_CLS_MATCHALL To compile this code as a module, choose M here: the module will be called cls_matchall. +config NET_CLS_P4 + tristate "P4 classifier" + select NET_CLS + select NET_P4TC + help + If you say Y here, you will be able to bind a P4 pipeline + program. You will need to install a P4 template representing the + program successfully to use this feature. + + To compile this code as a module, choose M here: the module will + be called cls_p4. + config NET_EMATCH bool "Extended Matches" select NET_CLS diff --git a/net/sched/Makefile b/net/sched/Makefile index 581f9dd69..b4f9ef48d 100644 --- a/net/sched/Makefile +++ b/net/sched/Makefile @@ -72,6 +72,7 @@ obj-$(CONFIG_NET_CLS_CGROUP) += cls_cgroup.o obj-$(CONFIG_NET_CLS_BPF) += cls_bpf.o obj-$(CONFIG_NET_CLS_FLOWER) += cls_flower.o obj-$(CONFIG_NET_CLS_MATCHALL) += cls_matchall.o +obj-$(CONFIG_NET_CLS_P4) += cls_p4.o obj-$(CONFIG_NET_EMATCH) += ematch.o obj-$(CONFIG_NET_EMATCH_CMP) += em_cmp.o obj-$(CONFIG_NET_EMATCH_NBYTE) += em_nbyte.o diff --git a/net/sched/cls_p4.c b/net/sched/cls_p4.c new file mode 100644 index 000000000..bd83d4e42 --- /dev/null +++ b/net/sched/cls_p4.c @@ -0,0 +1,450 @@ +// SPDX-License-Identifier: GPL-2.0-or-later +/* + * net/sched/cls_p4.c - P4 Classifier + * Copyright (c) 2022-2024, Mojatatu Networks + * Copyright (c) 2022-2024, Intel Corporation. + * Authors: Jamal Hadi Salim <jhs@mojatatu.com> + * Victor Nogueira <victor@mojatatu.com> + * Pedro Tammela <pctammela@mojatatu.com> + */ + +#include <linux/kernel.h> +#include <linux/init.h> +#include <linux/module.h> +#include <linux/percpu.h> +#include <linux/bpf.h> +#include <linux/filter.h> + +#include <net/sch_generic.h> +#include <net/pkt_cls.h> + +#include <net/p4tc.h> + +#include "p4tc/trace.h" + +#define CLS_P4_PROG_NAME_LEN 256 + +struct p4tc_bpf_prog { + struct bpf_prog *p4_prog; + const char *p4_prog_name; +}; + +struct cls_p4_head { + struct tcf_exts exts; + struct tcf_result res; + struct rcu_work rwork; + struct p4tc_pipeline *pipeline; + struct p4tc_bpf_prog *prog; + u32 handle; +}; + +static int p4_classify(struct sk_buff *skb, const struct tcf_proto *tp, + struct tcf_result *res) +{ + struct cls_p4_head *head = rcu_dereference_bh(tp->root); + bool at_ingress = skb_at_tc_ingress(skb); + + if (unlikely(!head)) { + pr_err("P4 classifier not found\n"); + return -1; + } + + /* head->prog represents the eBPF program that will be first executed by + * the data plane. It may or may not exist. In addition to head->prog, + * we'll have another eBPF program that will execute after this one in + * the form of a filter action (head->exts). + * head->prog->p4_prog_type == BPf_PROG_TYPE_SCHED_ACT means this + * program executes in TC P4 filter. + * head->prog->p4_prog_type == BPf_PROG_TYPE_SCHED_XDP means this + * program was loaded in XDP. + */ + if (head->prog) { + int rc = TC_ACT_PIPE; + + /* If eBPF program is loaded into TC */ + if (head->prog->p4_prog->type == BPF_PROG_TYPE_SCHED_ACT) { + if (at_ingress) { + /* It is safe to push/pull even if + * skb_shared() + */ + __skb_push(skb, skb->mac_len); + bpf_compute_data_pointers(skb); + rc = bpf_prog_run(head->prog->p4_prog, + skb); + __skb_pull(skb, skb->mac_len); + } else { + bpf_compute_data_pointers(skb); + rc = bpf_prog_run(head->prog->p4_prog, + skb); + } + } + + if (rc != TC_ACT_PIPE) + return rc; + } + + trace_p4_classify(skb, head->pipeline); + + *res = head->res; + + return tcf_exts_exec(skb, &head->exts, res); +} + +static int p4_init(struct tcf_proto *tp) +{ + return 0; +} + +static void p4_bpf_prog_destroy(struct p4tc_bpf_prog *prog) +{ + bpf_prog_put(prog->p4_prog); + kfree(prog->p4_prog_name); + kfree(prog); +} + +static void __p4_destroy(struct cls_p4_head *head) +{ + tcf_exts_destroy(&head->exts); + tcf_exts_put_net(&head->exts); + if (head->prog) + p4_bpf_prog_destroy(head->prog); + p4tc_pipeline_put(head->pipeline); + kfree(head); +} + +static void p4_destroy_work(struct work_struct *work) +{ + struct cls_p4_head *head = + container_of(to_rcu_work(work), struct cls_p4_head, rwork); + + rtnl_lock(); + __p4_destroy(head); + rtnl_unlock(); +} + +static void p4_destroy(struct tcf_proto *tp, bool rtnl_held, + struct netlink_ext_ack *extack) +{ + struct cls_p4_head *head = rtnl_dereference(tp->root); + + if (!head) + return; + + tcf_unbind_filter(tp, &head->res); + + if (tcf_exts_get_net(&head->exts)) + tcf_queue_work(&head->rwork, p4_destroy_work); + else + __p4_destroy(head); +} + +static void *p4_get(struct tcf_proto *tp, u32 handle) +{ + struct cls_p4_head *head = rtnl_dereference(tp->root); + + if (head && head->handle == handle) + return head; + + return NULL; +} + +static const struct nla_policy p4_policy[TCA_P4_MAX + 1] = { + [TCA_P4_UNSPEC] = { .type = NLA_UNSPEC }, + [TCA_P4_CLASSID] = { .type = NLA_U32 }, + [TCA_P4_ACT] = { .type = NLA_NESTED }, + [TCA_P4_PNAME] = { .type = NLA_STRING, .len = P4TC_PIPELINE_NAMSIZ }, + [TCA_P4_PIPEID] = { .type = NLA_U32 }, + [TCA_P4_PROG_FD] = { .type = NLA_U32 }, + [TCA_P4_PROG_NAME] = { .type = NLA_STRING, + .len = CLS_P4_PROG_NAME_LEN }, + [TCA_P4_PROG_TYPE] = { .type = NLA_U32 }, +}; + +static int cls_p4_prog_from_efd(struct nlattr **tb, + struct p4tc_bpf_prog *prog, u32 flags, + struct netlink_ext_ack *extack) +{ + struct bpf_prog *fp; + u32 prog_type; + char *name; + u32 bpf_fd; + + bpf_fd = nla_get_u32(tb[TCA_P4_PROG_FD]); + prog_type = nla_get_u32(tb[TCA_P4_PROG_TYPE]); + + if (prog_type != BPF_PROG_TYPE_XDP && + prog_type != BPF_PROG_TYPE_SCHED_ACT) { + NL_SET_ERR_MSG(extack, + "BPF prog type must be BPF_PROG_TYPE_SCHED_ACT or BPF_PROG_TYPE_XDP"); + return -EINVAL; + } + + fp = bpf_prog_get_type_dev(bpf_fd, prog_type, false); + if (IS_ERR(fp)) + return PTR_ERR(fp); + + name = nla_memdup(tb[TCA_P4_PROG_NAME], GFP_KERNEL); + if (!name) { + bpf_prog_put(fp); + return -ENOMEM; + } + + prog->p4_prog_name = name; + prog->p4_prog = fp; + + return 0; +} + +static int p4_set_parms(struct net *net, struct tcf_proto *tp, + struct cls_p4_head *head, unsigned long base, + struct nlattr **tb, struct nlattr *est, u32 flags, + struct netlink_ext_ack *extack) +{ + bool load_bpf_prog = tb[TCA_P4_PROG_NAME] && tb[TCA_P4_PROG_FD] && + tb[TCA_P4_PROG_TYPE]; + struct p4tc_bpf_prog *prog = NULL; + int err; + + err = tcf_exts_validate_ex(net, tp, tb, est, &head->exts, flags, 0, + extack); + if (err < 0) + return err; + + if (load_bpf_prog) { + prog = kzalloc(sizeof(*prog), GFP_KERNEL); + if (!prog) { + err = -ENOMEM; + goto exts_destroy; + } + + err = cls_p4_prog_from_efd(tb, prog, flags, extack); + if (err < 0) { + kfree(prog); + goto exts_destroy; + } + } + + if (tb[TCA_P4_CLASSID]) { + head->res.classid = nla_get_u32(tb[TCA_P4_CLASSID]); + tcf_bind_filter(tp, &head->res, base); + } + + if (load_bpf_prog) { + if (head->prog) { + pr_notice("cls_p4: Substituting old BPF program with id %u with new one with id %u\n", + head->prog->p4_prog->aux->id, + prog->p4_prog->aux->id); + p4_bpf_prog_destroy(head->prog); + } + head->prog = prog; + } + + return 0; + +exts_destroy: + tcf_exts_destroy(&head->exts); + return err; +} + +static int p4_change(struct net *net, struct sk_buff *in_skb, + struct tcf_proto *tp, unsigned long base, u32 handle, + struct nlattr **tca, void **arg, u32 flags, + struct netlink_ext_ack *extack) +{ + struct cls_p4_head *head = rtnl_dereference(tp->root); + struct p4tc_pipeline *pipeline = NULL; + struct nlattr *tb[TCA_P4_MAX + 1]; + struct cls_p4_head *new_cls; + char *pname = NULL; + u32 pipeid = 0; + int err; + + if (!tca[TCA_OPTIONS]) { + NL_SET_ERR_MSG(extack, "Must provide pipeline options"); + return -EINVAL; + } + + if (head) + return -EEXIST; + + err = nla_parse_nested(tb, TCA_P4_MAX, tca[TCA_OPTIONS], p4_policy, + extack); + if (err < 0) + return err; + + if (tb[TCA_P4_PNAME]) + pname = nla_data(tb[TCA_P4_PNAME]); + + if (tb[TCA_P4_PIPEID]) + pipeid = nla_get_u32(tb[TCA_P4_PIPEID]); + + pipeline = p4tc_pipeline_find_get(net, pname, pipeid, extack); + if (IS_ERR(pipeline)) + return PTR_ERR(pipeline); + + if (!p4tc_pipeline_sealed(pipeline)) { + err = -EINVAL; + NL_SET_ERR_MSG(extack, "Pipeline must be sealed before use"); + goto pipeline_put; + } + + new_cls = kzalloc(sizeof(*new_cls), GFP_KERNEL); + if (!new_cls) { + err = -ENOMEM; + goto pipeline_put; + } + + err = tcf_exts_init(&new_cls->exts, net, TCA_P4_ACT, 0); + if (err) + goto err_exts_init; + + if (!handle) + handle = 1; + + new_cls->handle = handle; + + err = p4_set_parms(net, tp, new_cls, base, tb, tca[TCA_RATE], flags, + extack); + if (err) + goto err_set_parms; + + new_cls->pipeline = pipeline; + *arg = head; + rcu_assign_pointer(tp->root, new_cls); + return 0; + +err_set_parms: + tcf_exts_destroy(&new_cls->exts); +err_exts_init: + kfree(new_cls); +pipeline_put: + p4tc_pipeline_put(pipeline); + return err; +} + +static int p4_delete(struct tcf_proto *tp, void *arg, bool *last, + bool rtnl_held, struct netlink_ext_ack *extack) +{ + *last = true; + return 0; +} + +static void p4_walk(struct tcf_proto *tp, struct tcf_walker *arg, + bool rtnl_held) +{ + struct cls_p4_head *head = rtnl_dereference(tp->root); + + if (arg->count < arg->skip) + goto skip; + + if (!head) + return; + if (arg->fn(tp, head, arg) < 0) + arg->stop = 1; +skip: + arg->count++; +} + +static int p4_prog_dump(struct sk_buff *skb, struct p4tc_bpf_prog *prog) +{ + unsigned char *b = nlmsg_get_pos(skb); + + if (nla_put_u32(skb, TCA_P4_PROG_ID, prog->p4_prog->aux->id)) + goto nla_put_failure; + + if (nla_put_string(skb, TCA_P4_PROG_NAME, prog->p4_prog_name)) + goto nla_put_failure; + + if (nla_put_u32(skb, TCA_P4_PROG_TYPE, prog->p4_prog->type)) + goto nla_put_failure; + + return 0; + +nla_put_failure: + nlmsg_trim(skb, b); + return -1; +} + +static int p4_dump(struct net *net, struct tcf_proto *tp, void *fh, + struct sk_buff *skb, struct tcmsg *t, bool rtnl_held) +{ + struct cls_p4_head *head = fh; + struct nlattr *nest; + + if (!head) + return skb->len; + + t->tcm_handle = head->handle; + + nest = nla_nest_start(skb, TCA_OPTIONS); + if (!nest) + goto nla_put_failure; + + if (nla_put_string(skb, TCA_P4_PNAME, head->pipeline->common.name)) + goto nla_put_failure; + + if (head->res.classid && + nla_put_u32(skb, TCA_P4_CLASSID, head->res.classid)) + goto nla_put_failure; + + if (head->prog && p4_prog_dump(skb, head->prog)) + goto nla_put_failure; + + if (tcf_exts_dump(skb, &head->exts)) + goto nla_put_failure; + + nla_nest_end(skb, nest); + + if (tcf_exts_dump_stats(skb, &head->exts) < 0) + goto nla_put_failure; + + return skb->len; + +nla_put_failure: + nla_nest_cancel(skb, nest); + return -1; +} + +static void p4_bind_class(void *fh, u32 classid, unsigned long cl, void *q, + unsigned long base) +{ + struct cls_p4_head *head = fh; + + if (head && head->res.classid == classid) { + if (cl) + __tcf_bind_filter(q, &head->res, base); + else + __tcf_unbind_filter(q, &head->res); + } +} + +static struct tcf_proto_ops cls_p4_ops __read_mostly = { + .kind = "p4", + .classify = p4_classify, + .init = p4_init, + .destroy = p4_destroy, + .get = p4_get, + .change = p4_change, + .delete = p4_delete, + .walk = p4_walk, + .dump = p4_dump, + .bind_class = p4_bind_class, + .owner = THIS_MODULE, +}; + +static int __init cls_p4_init(void) +{ + return register_tcf_proto_ops(&cls_p4_ops); +} + +static void __exit cls_p4_exit(void) +{ + unregister_tcf_proto_ops(&cls_p4_ops); +} + +module_init(cls_p4_init); +module_exit(cls_p4_exit); + +MODULE_AUTHOR("Mojatatu Networks"); +MODULE_DESCRIPTION("P4 Classifier"); +MODULE_LICENSE("GPL"); diff --git a/net/sched/p4tc/Makefile b/net/sched/p4tc/Makefile index 73ccb53c4..04302a3ac 100644 --- a/net/sched/p4tc/Makefile +++ b/net/sched/p4tc/Makefile @@ -1,6 +1,8 @@ # SPDX-License-Identifier: GPL-2.0 +CFLAGS_trace.o := -I$(src) + obj-y := p4tc_types.o p4tc_tmpl_api.o p4tc_pipeline.o \ p4tc_action.o p4tc_table.o p4tc_tbl_entry.o \ - p4tc_filter.o p4tc_runtime_api.o + p4tc_filter.o p4tc_runtime_api.o trace.o obj-$(CONFIG_DEBUG_INFO_BTF) += p4tc_bpf.o diff --git a/net/sched/p4tc/trace.c b/net/sched/p4tc/trace.c new file mode 100644 index 000000000..683313407 --- /dev/null +++ b/net/sched/p4tc/trace.c @@ -0,0 +1,10 @@ +// SPDX-License-Identifier: (GPL-2.0-only OR BSD-2-Clause) + +#include <net/p4tc.h> + +#ifndef __CHECKER__ + +#define CREATE_TRACE_POINTS +#include "trace.h" +EXPORT_TRACEPOINT_SYMBOL_GPL(p4_classify); +#endif diff --git a/net/sched/p4tc/trace.h b/net/sched/p4tc/trace.h new file mode 100644 index 000000000..80abec13b --- /dev/null +++ b/net/sched/p4tc/trace.h @@ -0,0 +1,44 @@ +/* SPDX-License-Identifier: GPL-2.0 */ +#undef TRACE_SYSTEM +#define TRACE_SYSTEM p4tc + +#if !defined(__P4TC_TRACE_H_) || defined(TRACE_HEADER_MULTI_READ) +#define __P4TC_TRACE_H + +#include <linux/tracepoint.h> + +struct p4tc_pipeline; + +TRACE_EVENT(p4_classify, + TP_PROTO(struct sk_buff *skb, struct p4tc_pipeline *pipeline), + + TP_ARGS(skb, pipeline), + + TP_STRUCT__entry(__string(pname, pipeline->common.name) + __field(u32, p_id) + __field(u32, ifindex) + __field(u32, ingress) + ), + + TP_fast_assign(__assign_str(pname, pipeline->common.name); + __entry->p_id = pipeline->common.p_id; + __entry->ifindex = skb->dev->ifindex; + __entry->ingress = skb_at_tc_ingress(skb); + ), + + TP_printk("dev=%u dir=%s pipeline=%s p_id=%u", + __entry->ifindex, + __entry->ingress ? "ingress" : "egress", + __get_str(pname), + __entry->p_id + ) +); + +#endif + +#undef TRACE_INCLUDE_PATH +#define TRACE_INCLUDE_PATH . +#undef TRACE_INCLUDE_FILE +#define TRACE_INCLUDE_FILE trace + +#include <trace/define_trace.h>