Message ID | 20230124170346.316866-1-jhs@mojatatu.com (mailing list archive) |
---|---|
Headers | show |
Series | Introducing P4TC | expand |
On Tue, 24 Jan 2023 12:03:46 -0500 Jamal Hadi Salim wrote: > There have been many discussions and meetings since about 2015 in regards to > P4 over TC and now that the market has chosen P4 as the datapath specification > lingua franca Which market? Barely anyone understands the existing TC offloads. We'd need strong, and practical reasons to merge this. Speaking with my "have suffered thru the TC offloads working for a vendor" hat on, not the "junior maintainer" hat.
On Thu, Jan 26, 2023 at 6:30 PM Jakub Kicinski <kuba@kernel.org> wrote: > > On Tue, 24 Jan 2023 12:03:46 -0500 Jamal Hadi Salim wrote: > > There have been many discussions and meetings since about 2015 in regards to > > P4 over TC and now that the market has chosen P4 as the datapath specification > > lingua franca > > Which market? Network programmability involving hardware - where at minimal the specification of the datapath is in P4 and often the implementation is. For samples of specification using P4 (that are public) see for example MS Azure: https://github.com/sonic-net/DASH/tree/main/dash-pipeline If you are a vendor and want to sell a NIC in that space, the spec you get is in P4. Your underlying hardware doesnt have to be P4 native, but at minimal the abstraction (as we are trying to provide with P4TC) has to be able to consume the P4 specification. For implementations where P4 is in use, there are many - some public others not, sample space: https://cloud.google.com/blog/products/gcp/google-cloud-using-p4runtime-to-build-smart-networks There are NICs and switches which are P4 native in the market. IOW, there is beacoup $ investment in this space that makes it worth pursuing. TC is the kernel offload mechanism that has gathered deployment experience over many years - hence P4TC. > Barely anyone understands the existing TC offloads. Hyperboles like these are never helpful in a discussion. TC offloads are deployed today, they work and many folks are actively working on them. Are there challenges? yes. For one (and this applies to all kernel offloads) the process gets in the way of exposing new features. So there are learnings that we try to resolve in P4TC. I'd be curious to hear about your suffering with TC offloads and see if we can take that experience and make things better. >We'd need strong, > and practical reasons to merge this. Speaking with my "have suffered > thru the TC offloads working for a vendor" hat on, not the "junior > maintainer" hat. P4TC is "standalone" in that it does not affect other TC consumers or any other subsystems on performance; it is also sufficiently isolated in that you can choose to compile it out altogether and more importantly it comes with committed support. And i should emphasize this discussion on getting P4 on TC has been going on for a few years in the community culminating with this. cheers, jamal
On Fri, 27 Jan 2023 08:33:39 -0500 Jamal Hadi Salim wrote: > On Thu, Jan 26, 2023 at 6:30 PM Jakub Kicinski <kuba@kernel.org> wrote: > > On Tue, 24 Jan 2023 12:03:46 -0500 Jamal Hadi Salim wrote: > > > There have been many discussions and meetings since about 2015 in regards to > > > P4 over TC and now that the market has chosen P4 as the datapath specification > > > lingua franca > > > > Which market? > > Network programmability involving hardware - where at minimal the > specification of the datapath is in P4 and > often the implementation is. For samples of specification using P4 > (that are public) see for example MS Azure: > https://github.com/sonic-net/DASH/tree/main/dash-pipeline That's an IPU thing? > If you are a vendor and want to sell a NIC in that space, the spec you > get is in P4. s/NIC/IPU/ ? > Your underlying hardware > doesnt have to be P4 native, but at minimal the abstraction (as we are > trying to provide with P4TC) has to be > able to consume the P4 specification. P4 is certainly an option, especially for specs, but I haven't seen much adoption myself. What's the benefit / use case? > For implementations where P4 is in use, there are many - some public > others not, sample space: > https://cloud.google.com/blog/products/gcp/google-cloud-using-p4runtime-to-build-smart-networks Hyper-scaler proprietary. > There are NICs and switches which are P4 native in the market. Link to docs? > IOW, there is beacoup $ investment in this space that makes it worth pursuing. Pursuing $ is good! But the community IMO should maximize a different function. > TC is the kernel offload mechanism that has gathered deployment > experience over many years - hence P4TC. I don't wanna argue. I thought it'd be more fair towards you if I made my lack of conviction known, rather than sit quiet and ignore it since it's just an RFC.
Fri, Jan 27, 2023 at 12:30:22AM CET, kuba@kernel.org wrote: >On Tue, 24 Jan 2023 12:03:46 -0500 Jamal Hadi Salim wrote: >> There have been many discussions and meetings since about 2015 in regards to >> P4 over TC and now that the market has chosen P4 as the datapath specification >> lingua franca > >Which market? > >Barely anyone understands the existing TC offloads. We'd need strong, >and practical reasons to merge this. Speaking with my "have suffered >thru the TC offloads working for a vendor" hat on, not the "junior >maintainer" hat. You talk about offload, yet I don't see any offload code in this RFC. It's pure sw implementation. But speaking about offload, how exactly do you plan to offload this Jamal? AFAIK there is some HW-specific compiler magic needed to generate HW acceptable blob. How exactly do you plan to deliver it to the driver? If HW offload offload is the motivation for this RFC work and we cannot pass the TC in kernel objects to drivers, I fail to see why exactly do you need the SW implementation...
On Fri, Jan 27, 2023 at 12:18 PM Jakub Kicinski <kuba@kernel.org> wrote: > > On Fri, 27 Jan 2023 08:33:39 -0500 Jamal Hadi Salim wrote: > > On Thu, Jan 26, 2023 at 6:30 PM Jakub Kicinski <kuba@kernel.org> wrote: > > > On Tue, 24 Jan 2023 12:03:46 -0500 Jamal Hadi Salim wrote: [..] > > Network programmability involving hardware - where at minimal the > > specification of the datapath is in P4 and > > often the implementation is. For samples of specification using P4 > > (that are public) see for example MS Azure: > > https://github.com/sonic-net/DASH/tree/main/dash-pipeline > > That's an IPU thing? > Yes, DASH is xPU. But the whole Sonic/SAI thing includes switches and P4 plays a role there. > > If you are a vendor and want to sell a NIC in that space, the spec you > > get is in P4. > > s/NIC/IPU/ ? I do believe that one can write a P4 program to express things a regular NIC could express that may be harder to expose with current interfaces. > > Your underlying hardware > > doesnt have to be P4 native, but at minimal the abstraction (as we are > > trying to provide with P4TC) has to be > > able to consume the P4 specification. > > P4 is certainly an option, especially for specs, but I haven't seen much > adoption myself. The xPU market outside of hyper-scalers is emerging now. Hyperscalers looking at xPUs are looking at P4 as the datapath language - that sets the trend forward to large enterprises. That's my experience. Some of the vendors on the Cc should be able to point to adoption. Anjali? Matty? > What's the benefit / use case? Of P4 or xPUs? Unified approach to standardize how a datapath is defined is a value for P4. Providing a singular abstraction via the kernel (as opposed to every vendor pitching their API) is what the kernel brings. > > For implementations where P4 is in use, there are many - some public > > others not, sample space: > > https://cloud.google.com/blog/products/gcp/google-cloud-using-p4runtime-to-build-smart-networks > > Hyper-scaler proprietary. The control abstraction (P4 runtime) is certainly not proprietary. The datapath that is targetted by the runtime is. Hopefully we can fix that with P4TC. The majority of the discussions i have with some of the folks who do kernel bypass have one theme in common: The kernel process is just too long. Trying to add one feature to flower could take anywhere from 6 months to 3 years to finally show up in some supported distro. With P4TC we are taking the approach of scriptability to allow for speacilized datapaths (which P4 excels in). The google datapath maybe proprietary while their hardware may even(or not) be using native P4 - but the important detail is we have _a way_ to abstract those datapaths. > > There are NICs and switches which are P4 native in the market. > > Link to docs? > Off top of my head Intel Mount Evans, Pensando, Xilinx FPGAs, etc. The point is to bring them together under the linux umbrella. > > IOW, there is beacoup $ investment in this space that makes it worth pursuing. > > Pursuing $ is good! But the community IMO should maximize > a different function. While I agree $ is not the primary motivator it is a factor, it is a good indicator. No different than the network stack being tweaked to do certain things that certain hyperscalers need because they invest $. I have no problems with a large harmonious tent. cheers, jamal > > TC is the kernel offload mechanism that has gathered deployment > > experience over many years - hence P4TC. > > I don't wanna argue. I thought it'd be more fair towards you if I made > my lack of conviction known, rather than sit quiet and ignore it since > it's just an RFC.
On Fri, Jan 27, 2023 at 1:26 PM Jiri Pirko <jiri@resnulli.us> wrote: > > Fri, Jan 27, 2023 at 12:30:22AM CET, kuba@kernel.org wrote: > >On Tue, 24 Jan 2023 12:03:46 -0500 Jamal Hadi Salim wrote: > >> There have been many discussions and meetings since about 2015 in regards to > >> P4 over TC and now that the market has chosen P4 as the datapath specification > >> lingua franca > > > >Which market? > > > >Barely anyone understands the existing TC offloads. We'd need strong, > >and practical reasons to merge this. Speaking with my "have suffered > >thru the TC offloads working for a vendor" hat on, not the "junior > >maintainer" hat. > > You talk about offload, yet I don't see any offload code in this RFC. > It's pure sw implementation. > > But speaking about offload, how exactly do you plan to offload this > Jamal? AFAIK there is some HW-specific compiler magic needed to generate > HW acceptable blob. How exactly do you plan to deliver it to the driver? > If HW offload offload is the motivation for this RFC work and we cannot > pass the TC in kernel objects to drivers, I fail to see why exactly do > you need the SW implementation... Our rule in TC is: _if you want to offload using TC you must have a s/w equivalent_. We enforced this rule multiple times (as you know). P4TC has a sw equivalent to whatever the hardware would do. We are pushing that first. Regardless, it has value on its own merit: I can run P4 equivalent in s/w in a scriptable (as in no compilation in the same spirit as u32 and pedit), by programming the kernel datapath without changing any kernel code. To answer your question in regards to what the interfaces "P4 speaking" hardware or drivers are going to be programmed, there are discussions going on right now: There is a strong leaning towards devlink for the hardware side loading.... The idea from the driver side is to reuse the tc ndos. We have biweekly meetings which are open. We do have Nvidia folks, but would be great if we can have you there. Let me find the link and send it to you. Do note however, our goal is to get s/w first as per tradition of other offloads with TC . cheers, jamal
On 01/27, Jamal Hadi Salim wrote: > On Fri, Jan 27, 2023 at 1:26 PM Jiri Pirko <jiri@resnulli.us> wrote: > > > > Fri, Jan 27, 2023 at 12:30:22AM CET, kuba@kernel.org wrote: > > >On Tue, 24 Jan 2023 12:03:46 -0500 Jamal Hadi Salim wrote: > > >> There have been many discussions and meetings since about 2015 in > regards to > > >> P4 over TC and now that the market has chosen P4 as the datapath > specification > > >> lingua franca > > > > > >Which market? > > > > > >Barely anyone understands the existing TC offloads. We'd need strong, > > >and practical reasons to merge this. Speaking with my "have suffered > > >thru the TC offloads working for a vendor" hat on, not the "junior > > >maintainer" hat. > > > > You talk about offload, yet I don't see any offload code in this RFC. > > It's pure sw implementation. > > > > But speaking about offload, how exactly do you plan to offload this > > Jamal? AFAIK there is some HW-specific compiler magic needed to generate > > HW acceptable blob. How exactly do you plan to deliver it to the driver? > > If HW offload offload is the motivation for this RFC work and we cannot > > pass the TC in kernel objects to drivers, I fail to see why exactly do > > you need the SW implementation... > Our rule in TC is: _if you want to offload using TC you must have a > s/w equivalent_. > We enforced this rule multiple times (as you know). > P4TC has a sw equivalent to whatever the hardware would do. We are > pushing that > first. Regardless, it has value on its own merit: > I can run P4 equivalent in s/w in a scriptable (as in no compilation > in the same spirit as u32 and pedit), > by programming the kernel datapath without changing any kernel code. Not to derail too much, but maybe you can clarify the following for me: In my (in)experience, P4 is usually constrained by the vendor specific extensions. So how real is that goal where we can have a generic P4@TC with an option to offload? In my view, the reality (at least currently) is that there are NIC-specific P4 programs which won't have a chance of running generically at TC (unless we implement those vendor extensions). And regarding custom parser, someone has to ask that 'what about bpf question': let's say we have a P4 frontend at TC, can we use bpfilter-like usermode helper to transparently compile it to bpf (for SW path) instead inventing yet another packet parser? Wrestling with the verifier won't be easy here, but I trust it more than this new kParser. > To answer your question in regards to what the interfaces "P4 > speaking" hardware or drivers > are going to be programmed, there are discussions going on right now: > There is a strong > leaning towards devlink for the hardware side loading.... The idea > from the driver side is to > reuse the tc ndos. > We have biweekly meetings which are open. We do have Nvidia folks, but > would be great if > we can have you there. Let me find the link and send it to you. > Do note however, our goal is to get s/w first as per tradition of > other offloads with TC . > cheers, > jamal
On 1/27/23 9:04 PM, Jamal Hadi Salim wrote: > On Fri, Jan 27, 2023 at 1:26 PM Jiri Pirko <jiri@resnulli.us> wrote: >> Fri, Jan 27, 2023 at 12:30:22AM CET, kuba@kernel.org wrote: >>> On Tue, 24 Jan 2023 12:03:46 -0500 Jamal Hadi Salim wrote: >>>> There have been many discussions and meetings since about 2015 in regards to >>>> P4 over TC and now that the market has chosen P4 as the datapath specification >>>> lingua franca >>> >>> Which market? >>> >>> Barely anyone understands the existing TC offloads. We'd need strong, >>> and practical reasons to merge this. Speaking with my "have suffered >>> thru the TC offloads working for a vendor" hat on, not the "junior >>> maintainer" hat. >> >> You talk about offload, yet I don't see any offload code in this RFC. >> It's pure sw implementation. >> >> But speaking about offload, how exactly do you plan to offload this >> Jamal? AFAIK there is some HW-specific compiler magic needed to generate >> HW acceptable blob. How exactly do you plan to deliver it to the driver? >> If HW offload offload is the motivation for this RFC work and we cannot >> pass the TC in kernel objects to drivers, I fail to see why exactly do >> you need the SW implementation... > > Our rule in TC is: _if you want to offload using TC you must have a > s/w equivalent_. > We enforced this rule multiple times (as you know). > P4TC has a sw equivalent to whatever the hardware would do. We are pushing that > first. Regardless, it has value on its own merit: > I can run P4 equivalent in s/w in a scriptable (as in no compilation > in the same spirit as u32 and pedit), `62001 insertions(+), 45 deletions(-)` and more to come for a software datapath which in the end no-one will use (assuming you'll have the hw offloads) is a pretty heavy lift.. imo the layer of abstraction is wrong here as Stan hinted. What if tomorrow P4 programming language is not the 'lingua franca' anymore and something else comes along? Then all of it is still baked into uapi instead of having a generic/versatile intermediate later.
On Fri, Jan 27, 2023 at 2:26 PM <sdf@google.com> wrote: > > On 01/27, Jamal Hadi Salim wrote: > > On Fri, Jan 27, 2023 at 1:26 PM Jiri Pirko <jiri@resnulli.us> wrote: > > > > > > Fri, Jan 27, 2023 at 12:30:22AM CET, kuba@kernel.org wrote: > > > >On Tue, 24 Jan 2023 12:03:46 -0500 Jamal Hadi Salim wrote: > > > >> There have been many discussions and meetings since about 2015 in > > regards to > > > >> P4 over TC and now that the market has chosen P4 as the datapath > > specification > > > >> lingua franca > > > > > > > >Which market? > > > > > > > >Barely anyone understands the existing TC offloads. We'd need strong, > > > >and practical reasons to merge this. Speaking with my "have suffered > > > >thru the TC offloads working for a vendor" hat on, not the "junior > > > >maintainer" hat. > > > > > > You talk about offload, yet I don't see any offload code in this RFC. > > > It's pure sw implementation. > > > > > > But speaking about offload, how exactly do you plan to offload this > > > Jamal? AFAIK there is some HW-specific compiler magic needed to generate > > > HW acceptable blob. How exactly do you plan to deliver it to the driver? > > > If HW offload offload is the motivation for this RFC work and we cannot > > > pass the TC in kernel objects to drivers, I fail to see why exactly do > > > you need the SW implementation... > > > Our rule in TC is: _if you want to offload using TC you must have a > > s/w equivalent_. > > We enforced this rule multiple times (as you know). > > P4TC has a sw equivalent to whatever the hardware would do. We are > > pushing that > > first. Regardless, it has value on its own merit: > > I can run P4 equivalent in s/w in a scriptable (as in no compilation > > in the same spirit as u32 and pedit), > > by programming the kernel datapath without changing any kernel code. > > Not to derail too much, but maybe you can clarify the following for me: > In my (in)experience, P4 is usually constrained by the vendor > specific extensions. So how real is that goal where we can have a generic > P4@TC with an option to offload? In my view, the reality (at least > currently) is that there are NIC-specific P4 programs which won't have > a chance of running generically at TC (unless we implement those vendor > extensions). > > And regarding custom parser, someone has to ask that 'what about bpf > question': let's say we have a P4 frontend at TC, can we use bpfilter-like > usermode helper to transparently compile it to bpf (for SW path) instead > inventing yet another packet parser? Wrestling with the verifier won't be > easy here, but I trust it more than this new kParser. Yes, wrestling with the verifier is tricky, however we do have a solution to compile arbitrarily complex parsers into eBFP. We presented this work at Netdev 0x15 https://netdevconf.info/0x15/session.html?Replacing-Flow-Dissector-with-PANDA-Parser. Of course this has the obvious advantage that we don't have to change the kernel (however, as we talk about in the presentation, this method actually produces a faster more extensible parser than flow dissector, so it's still on my radar to replace flow dissector itself with an eBPF parser :-) ) The value of kParser is that it is not compiled code, but dynamically scriptable. It's much easier to change on the fly and depends on a CLI interface which works well with P4TC. The front end is the same as what we are using for PANDA parser, that is the same parser frontend (in C code or other) can be compiled into XDP/eBPF, kParser CLI, or other targets (this is based on establishing a IR which we talked about in https://myfoobar2022.sched.com/event/1BhCX/high-performance-programmable-parsers Tom > > > > To answer your question in regards to what the interfaces "P4 > > speaking" hardware or drivers > > are going to be programmed, there are discussions going on right now: > > There is a strong > > leaning towards devlink for the hardware side loading.... The idea > > from the driver side is to > > reuse the tc ndos. > > We have biweekly meetings which are open. We do have Nvidia folks, but > > would be great if > > we can have you there. Let me find the link and send it to you. > > Do note however, our goal is to get s/w first as per tradition of > > other offloads with TC . > > > cheers, > > jamal
On Fri, Jan 27, 2023 at 5:26 PM <sdf@google.com> wrote: > > On 01/27, Jamal Hadi Salim wrote: > > On Fri, Jan 27, 2023 at 1:26 PM Jiri Pirko <jiri@resnulli.us> wrote: > > > > > > Fri, Jan 27, 2023 at 12:30:22AM CET, kuba@kernel.org wrote: > > > >On Tue, 24 Jan 2023 12:03:46 -0500 Jamal Hadi Salim wrote: > > > >> There have been many discussions and meetings since about 2015 in > > regards to > > > >> P4 over TC and now that the market has chosen P4 as the datapath > > specification > > > >> lingua franca > > > > > > > >Which market? > > > > > > > >Barely anyone understands the existing TC offloads. We'd need strong, > > > >and practical reasons to merge this. Speaking with my "have suffered > > > >thru the TC offloads working for a vendor" hat on, not the "junior > > > >maintainer" hat. > > > > > > You talk about offload, yet I don't see any offload code in this RFC. > > > It's pure sw implementation. > > > > > > But speaking about offload, how exactly do you plan to offload this > > > Jamal? AFAIK there is some HW-specific compiler magic needed to generate > > > HW acceptable blob. How exactly do you plan to deliver it to the driver? > > > If HW offload offload is the motivation for this RFC work and we cannot > > > pass the TC in kernel objects to drivers, I fail to see why exactly do > > > you need the SW implementation... > > > Our rule in TC is: _if you want to offload using TC you must have a > > s/w equivalent_. > > We enforced this rule multiple times (as you know). > > P4TC has a sw equivalent to whatever the hardware would do. We are > > pushing that > > first. Regardless, it has value on its own merit: > > I can run P4 equivalent in s/w in a scriptable (as in no compilation > > in the same spirit as u32 and pedit), > > by programming the kernel datapath without changing any kernel code. > > Not to derail too much, but maybe you can clarify the following for me: > In my (in)experience, P4 is usually constrained by the vendor > specific extensions. So how real is that goal where we can have a generic > P4@TC with an option to offload? In my view, the reality (at least > currently) is that there are NIC-specific P4 programs which won't have > a chance of running generically at TC (unless we implement those vendor > extensions). We are going to implement all the PSA/PNA externs. Most of these programs tend to be set or ALU operations on headers or metadata which we can handle. Do you have any examples of NIC-vendor-specific features that cant be generalized? > And regarding custom parser, someone has to ask that 'what about bpf > question': let's say we have a P4 frontend at TC, can we use bpfilter-like > usermode helper to transparently compile it to bpf (for SW path) instead > inventing yet another packet parser? Wrestling with the verifier won't be > easy here, but I trust it more than this new kParser. > We dont compile anything, the parser (and rest of infra) is scriptable. cheers, jamal
On Fri, Jan 27, 2023 at 6:02 PM Daniel Borkmann <daniel@iogearbox.net> wrote: > > On 1/27/23 9:04 PM, Jamal Hadi Salim wrote: > > On Fri, Jan 27, 2023 at 1:26 PM Jiri Pirko <jiri@resnulli.us> wrote: > >> Fri, Jan 27, 2023 at 12:30:22AM CET, kuba@kernel.org wrote: > >>> On Tue, 24 Jan 2023 12:03:46 -0500 Jamal Hadi Salim wrote: > >>>> There have been many discussions and meetings since about 2015 in regards to > >>>> P4 over TC and now that the market has chosen P4 as the datapath specification > >>>> lingua franca > >>> > >>> Which market? > >>> [..] > > > > Our rule in TC is: _if you want to offload using TC you must have a > > s/w equivalent_. > > We enforced this rule multiple times (as you know). > > P4TC has a sw equivalent to whatever the hardware would do. We are pushing that > > first. Regardless, it has value on its own merit: > > I can run P4 equivalent in s/w in a scriptable (as in no compilation > > in the same spirit as u32 and pedit), > > `62001 insertions(+), 45 deletions(-)` and more to come for a software > datapath which in the end no-one will use (assuming you'll have the hw > offloads) is a pretty heavy lift.. I am not sure i fully parsed what you said - but the sw stands on its own merit. The consumption of P4 specification is one - but ability to define arbitrary pipelines without changing the kernel code (u32/pedit like, etc) is of value. Note (in case i misunderstood what you are saying): As mentioned there is commitment to support; its clean standalone and can be compiled out and even when compiled in has no effect on the rest of the code performance or otherwise. > imo the layer of abstraction is wrong > here as Stan hinted. What if tomorrow P4 programming language is not the > 'lingua franca' anymore and something else comes along? Then all of it is > still baked into uapi instead of having a generic/versatile intermediate > later. Match-action pipeline as an approach to defining datapaths is what we implement here. It is what P4 defines. I dont think P4 covers everything that is needed under the shining sun but a lot of effort has gone into standardizing common things. And if there are gaps we fill them. That is a solid, well understood way to build hardware and sw (TC has been around all these years implementing that paradigm). So that is the intended abstraction being implemented. The interface is designed to be scriptable to remove the burden of making kernel (and btw user space as well to iproute2) changes for new processing functions (whether in s/w or hardware). cheers, jamal
On Fri, Jan 27, 2023 at 3:06 PM Tom Herbert <tom@sipanda.io> wrote: > > On Fri, Jan 27, 2023 at 2:26 PM <sdf@google.com> wrote: > > > > On 01/27, Jamal Hadi Salim wrote: > > > On Fri, Jan 27, 2023 at 1:26 PM Jiri Pirko <jiri@resnulli.us> wrote: > > > > > > > > Fri, Jan 27, 2023 at 12:30:22AM CET, kuba@kernel.org wrote: > > > > >On Tue, 24 Jan 2023 12:03:46 -0500 Jamal Hadi Salim wrote: > > > > >> There have been many discussions and meetings since about 2015 in > > > regards to > > > > >> P4 over TC and now that the market has chosen P4 as the datapath > > > specification > > > > >> lingua franca > > > > > > > > > >Which market? > > > > > > > > > >Barely anyone understands the existing TC offloads. We'd need strong, > > > > >and practical reasons to merge this. Speaking with my "have suffered > > > > >thru the TC offloads working for a vendor" hat on, not the "junior > > > > >maintainer" hat. > > > > > > > > You talk about offload, yet I don't see any offload code in this RFC. > > > > It's pure sw implementation. > > > > > > > > But speaking about offload, how exactly do you plan to offload this > > > > Jamal? AFAIK there is some HW-specific compiler magic needed to generate > > > > HW acceptable blob. How exactly do you plan to deliver it to the driver? > > > > If HW offload offload is the motivation for this RFC work and we cannot > > > > pass the TC in kernel objects to drivers, I fail to see why exactly do > > > > you need the SW implementation... > > > > > Our rule in TC is: _if you want to offload using TC you must have a > > > s/w equivalent_. > > > We enforced this rule multiple times (as you know). > > > P4TC has a sw equivalent to whatever the hardware would do. We are > > > pushing that > > > first. Regardless, it has value on its own merit: > > > I can run P4 equivalent in s/w in a scriptable (as in no compilation > > > in the same spirit as u32 and pedit), > > > by programming the kernel datapath without changing any kernel code. > > > > Not to derail too much, but maybe you can clarify the following for me: > > In my (in)experience, P4 is usually constrained by the vendor > > specific extensions. So how real is that goal where we can have a generic > > P4@TC with an option to offload? In my view, the reality (at least > > currently) is that there are NIC-specific P4 programs which won't have > > a chance of running generically at TC (unless we implement those vendor > > extensions). > > > > And regarding custom parser, someone has to ask that 'what about bpf > > question': let's say we have a P4 frontend at TC, can we use bpfilter-like > > usermode helper to transparently compile it to bpf (for SW path) instead > > inventing yet another packet parser? Wrestling with the verifier won't be > > easy here, but I trust it more than this new kParser. > > Yes, wrestling with the verifier is tricky, however we do have a > solution to compile arbitrarily complex parsers into eBFP. We > presented this work at Netdev 0x15 > https://netdevconf.info/0x15/session.html?Replacing-Flow-Dissector-with-PANDA-Parser. Thanks Tom, I'll check it out. I've yet to go through the netdev recordings :-( > Of course this has the obvious advantage that we don't have to change > the kernel (however, as we talk about in the presentation, this method > actually produces a faster more extensible parser than flow dissector, > so it's still on my radar to replace flow dissector itself with an > eBPF parser :-) ) Since there is already a bpf flow dissector, I'm assuming you're talking about replacing the existing C flow dissector with a PANDA-based one? I was hoping that at some point, we can have a BPF flow dissector program that supports everything the existing C-one does, and maybe we can ship this program with the kernel and load it by default. We can keep the C-based one for some minimal non-bpf configurations. But idk, the benefit is not 100% clear to me; except maybe bpf-based flow dissector can be treated as more "secure" due to all verifier constraints... > The value of kParser is that it is not compiled code, but dynamically > scriptable. It's much easier to change on the fly and depends on a CLI > interface which works well with P4TC. The front end is the same as > what we are using for PANDA parser, that is the same parser frontend > (in C code or other) can be compiled into XDP/eBPF, kParser CLI, or > other targets (this is based on establishing a IR which we talked > about in https://myfoobar2022.sched.com/event/1BhCX/high-performance-programmable-parsers That seems like a technicality? A BPF-based parser can also be driven by maps/tables; or, worst case, can be recompiled and replaced on the fly without any downtime. > Tom > > > > > > > > To answer your question in regards to what the interfaces "P4 > > > speaking" hardware or drivers > > > are going to be programmed, there are discussions going on right now: > > > There is a strong > > > leaning towards devlink for the hardware side loading.... The idea > > > from the driver side is to > > > reuse the tc ndos. > > > We have biweekly meetings which are open. We do have Nvidia folks, but > > > would be great if > > > we can have you there. Let me find the link and send it to you. > > > Do note however, our goal is to get s/w first as per tradition of > > > other offloads with TC . > > > > > cheers, > > > jamal
On Fri, Jan 27, 2023 at 3:27 PM Jamal Hadi Salim <jhs@mojatatu.com> wrote: > > On Fri, Jan 27, 2023 at 5:26 PM <sdf@google.com> wrote: > > > > On 01/27, Jamal Hadi Salim wrote: > > > On Fri, Jan 27, 2023 at 1:26 PM Jiri Pirko <jiri@resnulli.us> wrote: > > > > > > > > Fri, Jan 27, 2023 at 12:30:22AM CET, kuba@kernel.org wrote: > > > > >On Tue, 24 Jan 2023 12:03:46 -0500 Jamal Hadi Salim wrote: > > > > >> There have been many discussions and meetings since about 2015 in > > > regards to > > > > >> P4 over TC and now that the market has chosen P4 as the datapath > > > specification > > > > >> lingua franca > > > > > > > > > >Which market? > > > > > > > > > >Barely anyone understands the existing TC offloads. We'd need strong, > > > > >and practical reasons to merge this. Speaking with my "have suffered > > > > >thru the TC offloads working for a vendor" hat on, not the "junior > > > > >maintainer" hat. > > > > > > > > You talk about offload, yet I don't see any offload code in this RFC. > > > > It's pure sw implementation. > > > > > > > > But speaking about offload, how exactly do you plan to offload this > > > > Jamal? AFAIK there is some HW-specific compiler magic needed to generate > > > > HW acceptable blob. How exactly do you plan to deliver it to the driver? > > > > If HW offload offload is the motivation for this RFC work and we cannot > > > > pass the TC in kernel objects to drivers, I fail to see why exactly do > > > > you need the SW implementation... > > > > > Our rule in TC is: _if you want to offload using TC you must have a > > > s/w equivalent_. > > > We enforced this rule multiple times (as you know). > > > P4TC has a sw equivalent to whatever the hardware would do. We are > > > pushing that > > > first. Regardless, it has value on its own merit: > > > I can run P4 equivalent in s/w in a scriptable (as in no compilation > > > in the same spirit as u32 and pedit), > > > by programming the kernel datapath without changing any kernel code. > > > > Not to derail too much, but maybe you can clarify the following for me: > > In my (in)experience, P4 is usually constrained by the vendor > > specific extensions. So how real is that goal where we can have a generic > > P4@TC with an option to offload? In my view, the reality (at least > > currently) is that there are NIC-specific P4 programs which won't have > > a chance of running generically at TC (unless we implement those vendor > > extensions). > > We are going to implement all the PSA/PNA externs. Most of these > programs tend to > be set or ALU operations on headers or metadata which we can handle. > Do you have > any examples of NIC-vendor-specific features that cant be generalized? I don't think I can share more without giving away something that I shouldn't give away :-) But IIUC, and I might be missing something, it's totally within the standard for vendors to differentiate and provide non-standard 'extern' extensions. I'm mostly wondering what are your thoughts on this. If I have a p4 program depending on one of these externs, we can't sw-emulate it unless we also implement the extension. Are we gonna ask NICs that have those custom extensions to provide a SW implementation as well? Or are we going to prohibit vendors to differentiate that way? > > And regarding custom parser, someone has to ask that 'what about bpf > > question': let's say we have a P4 frontend at TC, can we use bpfilter-like > > usermode helper to transparently compile it to bpf (for SW path) instead > > inventing yet another packet parser? Wrestling with the verifier won't be > > easy here, but I trust it more than this new kParser. > > > > We dont compile anything, the parser (and rest of infra) is scriptable. As I've replied to Tom, that seems like a technicality. BPF programs can also be scriptable with some maps/tables. Or it can be made to look like "scriptable" by recompiling it on every configuration change and updating it on the fly. Or am I missing something? Can we have a P4TC frontend and whenever configuration is updated, we upcall into userspace to compile this whatever p4 representation into whatever bpf bytecode that we then run. No new/custom/scriptable parsers needed. > cheers, > jamal
On Fri, Jan 27, 2023 at 4:47 PM Stanislav Fomichev <sdf@google.com> wrote: > > On Fri, Jan 27, 2023 at 3:06 PM Tom Herbert <tom@sipanda.io> wrote: > > > > On Fri, Jan 27, 2023 at 2:26 PM <sdf@google.com> wrote: > > > > > > On 01/27, Jamal Hadi Salim wrote: > > > > On Fri, Jan 27, 2023 at 1:26 PM Jiri Pirko <jiri@resnulli.us> wrote: > > > > > > > > > > Fri, Jan 27, 2023 at 12:30:22AM CET, kuba@kernel.org wrote: > > > > > >On Tue, 24 Jan 2023 12:03:46 -0500 Jamal Hadi Salim wrote: > > > > > >> There have been many discussions and meetings since about 2015 in > > > > regards to > > > > > >> P4 over TC and now that the market has chosen P4 as the datapath > > > > specification > > > > > >> lingua franca > > > > > > > > > > > >Which market? > > > > > > > > > > > >Barely anyone understands the existing TC offloads. We'd need strong, > > > > > >and practical reasons to merge this. Speaking with my "have suffered > > > > > >thru the TC offloads working for a vendor" hat on, not the "junior > > > > > >maintainer" hat. > > > > > > > > > > You talk about offload, yet I don't see any offload code in this RFC. > > > > > It's pure sw implementation. > > > > > > > > > > But speaking about offload, how exactly do you plan to offload this > > > > > Jamal? AFAIK there is some HW-specific compiler magic needed to generate > > > > > HW acceptable blob. How exactly do you plan to deliver it to the driver? > > > > > If HW offload offload is the motivation for this RFC work and we cannot > > > > > pass the TC in kernel objects to drivers, I fail to see why exactly do > > > > > you need the SW implementation... > > > > > > > Our rule in TC is: _if you want to offload using TC you must have a > > > > s/w equivalent_. > > > > We enforced this rule multiple times (as you know). > > > > P4TC has a sw equivalent to whatever the hardware would do. We are > > > > pushing that > > > > first. Regardless, it has value on its own merit: > > > > I can run P4 equivalent in s/w in a scriptable (as in no compilation > > > > in the same spirit as u32 and pedit), > > > > by programming the kernel datapath without changing any kernel code. > > > > > > Not to derail too much, but maybe you can clarify the following for me: > > > In my (in)experience, P4 is usually constrained by the vendor > > > specific extensions. So how real is that goal where we can have a generic > > > P4@TC with an option to offload? In my view, the reality (at least > > > currently) is that there are NIC-specific P4 programs which won't have > > > a chance of running generically at TC (unless we implement those vendor > > > extensions). > > > > > > And regarding custom parser, someone has to ask that 'what about bpf > > > question': let's say we have a P4 frontend at TC, can we use bpfilter-like > > > usermode helper to transparently compile it to bpf (for SW path) instead > > > inventing yet another packet parser? Wrestling with the verifier won't be > > > easy here, but I trust it more than this new kParser. > > > > Yes, wrestling with the verifier is tricky, however we do have a > > solution to compile arbitrarily complex parsers into eBFP. We > > presented this work at Netdev 0x15 > > https://netdevconf.info/0x15/session.html?Replacing-Flow-Dissector-with-PANDA-Parser. > > Thanks Tom, I'll check it out. I've yet to go through the netdev recordings :-( > > > Of course this has the obvious advantage that we don't have to change > > the kernel (however, as we talk about in the presentation, this method > > actually produces a faster more extensible parser than flow dissector, > > so it's still on my radar to replace flow dissector itself with an > > eBPF parser :-) ) > > Since there is already a bpf flow dissector, I'm assuming you're > talking about replacing the existing C flow dissector with a > PANDA-based one? Yes > I was hoping that at some point, we can have a BPF flow dissector > program that supports everything the existing C-one does, and maybe we > can ship this program with the kernel and load it by default. Yes, we have that. Actually, we can provide a superset to include things like TCP options which flow dissector doesn't support > We can > keep the C-based one for some minimal non-bpf configurations. But idk, > the benefit is not 100% clear to me; except maybe bpf-based flow > dissector can be treated as more "secure" due to all verifier > constraints... Not just more secure, more robust and extensible. I call flow dissector the "function we love to hate". On one hand it has proven to be incredibly useful, on the other hand it's been a major pain to maintain and isn't remotely extensible. We have seen many problems over the years, particularly when people have added support for less common protocols. Collapsing all the protocol layers, ensuring that the bookkeeping is correct, and trying to maintain some reasonable level of performance has led to it being spaghetti code (I wrote the first instantiation of flow dissector for RPS, so I accept my fair share of blame for the predicament of flow dissector :-) ). The optimized eBPF code we're generating also qualifies as spaghetti code (i.e. a whole bunch of loop unrolling, inlining tables, and so on). The difference is that the front end code in PANDA-C, is well organized and abstracts out all the bookkeeping so that the programmer doesn't have to worry about it. > > > The value of kParser is that it is not compiled code, but dynamically > > scriptable. It's much easier to change on the fly and depends on a CLI > > interface which works well with P4TC. The front end is the same as > > what we are using for PANDA parser, that is the same parser frontend > > (in C code or other) can be compiled into XDP/eBPF, kParser CLI, or > > other targets (this is based on establishing a IR which we talked > > about in https://myfoobar2022.sched.com/event/1BhCX/high-performance-programmable-parsers > > That seems like a technicality? A BPF-based parser can also be driven > by maps/tables; or, worst case, can be recompiled and replaced on the > fly without any downtime. Perhaps. Also, in the spirit of full transparency, kParser is in its nature interpreted, so we have to expect that it will have lower performance than an optimized compiled parser. Tom > > > > Tom > > > > > > > > > > > > To answer your question in regards to what the interfaces "P4 > > > > speaking" hardware or drivers > > > > are going to be programmed, there are discussions going on right now: > > > > There is a strong > > > > leaning towards devlink for the hardware side loading.... The idea > > > > from the driver side is to > > > > reuse the tc ndos. > > > > We have biweekly meetings which are open. We do have Nvidia folks, but > > > > would be great if > > > > we can have you there. Let me find the link and send it to you. > > > > Do note however, our goal is to get s/w first as per tradition of > > > > other offloads with TC . > > > > > > > cheers, > > > > jamal
P4 is definitely the language of choice for defining a Dataplane in HW for IPUs/DPUs/FNICs and Switches. As a vendor I can definitely say that the smart devices implement a very programmable ASIC as each customer Dataplane defers quite a bit and P4 is the language of choice for specifying the Dataplane definitions. A lot of customer deploy proprietary protocols that run in HW and there is no good way right now in kernel to support these proprietary protcols. If we enable these protocol in the kernel it takes a huge effort and they don’t evolve well. Being able to define in P4 and offload into HW using tc mechanism really helps in supporting the customer's Dataplane and protcols without having to wait months and years to get the kernel updated. Here is a link to our IPU offering that is P4 programmable https://www.intel.com/content/www/us/en/products/details/network-io/ipu/e2000-asic.html Here are some other useful links https://ipdk.io/ Anjali -----Original Message----- From: Jamal Hadi Salim <hadi@mojatatu.com> Sent: Friday, January 27, 2023 11:43 AM To: Jakub Kicinski <kuba@kernel.org> Cc: Jamal Hadi Salim <jhs@mojatatu.com>; netdev@vger.kernel.org; kernel@mojatatu.com; Chatterjee, Deb <deb.chatterjee@intel.com>; Singhai, Anjali <anjali.singhai@intel.com>; Limaye, Namrata <namrata.limaye@intel.com>; khalidm@nvidia.com; tom@sipanda.io; pratyush@sipanda.io; jiri@resnulli.us; xiyou.wangcong@gmail.com; davem@davemloft.net; edumazet@google.com; pabeni@redhat.com; vladbu@nvidia.com; simon.horman@corigine.com; stefanc@marvell.com; seong.kim@amd.com; mattyk@nvidia.com; Daly, Dan <dan.daly@intel.com>; Fingerhut, John Andy <john.andy.fingerhut@intel.com> Subject: Re: [PATCH net-next RFC 00/20] Introducing P4TC On Fri, Jan 27, 2023 at 12:18 PM Jakub Kicinski <kuba@kernel.org> wrote: > > On Fri, 27 Jan 2023 08:33:39 -0500 Jamal Hadi Salim wrote: > > On Thu, Jan 26, 2023 at 6:30 PM Jakub Kicinski <kuba@kernel.org> wrote: > > > On Tue, 24 Jan 2023 12:03:46 -0500 Jamal Hadi Salim wrote: [..] > > Network programmability involving hardware - where at minimal the > > specification of the datapath is in P4 and often the implementation > > is. For samples of specification using P4 (that are public) see for > > example MS Azure: > > https://github.com/sonic-net/DASH/tree/main/dash-pipeline > > That's an IPU thing? > Yes, DASH is xPU. But the whole Sonic/SAI thing includes switches and P4 plays a role there. > > If you are a vendor and want to sell a NIC in that space, the spec > > you get is in P4. > > s/NIC/IPU/ ? I do believe that one can write a P4 program to express things a regular NIC could express that may be harder to expose with current interfaces. > > Your underlying hardware > > doesnt have to be P4 native, but at minimal the abstraction (as we > > are trying to provide with P4TC) has to be able to consume the P4 > > specification. > > P4 is certainly an option, especially for specs, but I haven't seen > much adoption myself. The xPU market outside of hyper-scalers is emerging now. Hyperscalers looking at xPUs are looking at P4 as the datapath language - that sets the trend forward to large enterprises. That's my experience. Some of the vendors on the Cc should be able to point to adoption. Anjali? Matty? > What's the benefit / use case? Of P4 or xPUs? Unified approach to standardize how a datapath is defined is a value for P4. Providing a singular abstraction via the kernel (as opposed to every vendor pitching their API) is what the kernel brings. > > For implementations where P4 is in use, there are many - some public > > others not, sample space: > > https://cloud.google.com/blog/products/gcp/google-cloud-using-p4runt > > ime-to-build-smart-networks > > Hyper-scaler proprietary. The control abstraction (P4 runtime) is certainly not proprietary. The datapath that is targetted by the runtime is. Hopefully we can fix that with P4TC. The majority of the discussions i have with some of the folks who do kernel bypass have one theme in common: The kernel process is just too long. Trying to add one feature to flower could take anywhere from 6 months to 3 years to finally show up in some supported distro. With P4TC we are taking the approach of scriptability to allow for speacilized datapaths (which P4 excels in). The google datapath maybe proprietary while their hardware may even(or not) be using native P4 - but the important detail is we have _a way_ to abstract those datapaths. > > There are NICs and switches which are P4 native in the market. > > Link to docs? > Off top of my head Intel Mount Evans, Pensando, Xilinx FPGAs, etc. The point is to bring them together under the linux umbrella. > > IOW, there is beacoup $ investment in this space that makes it worth pursuing. > > Pursuing $ is good! But the community IMO should maximize a different > function. While I agree $ is not the primary motivator it is a factor, it is a good indicator. No different than the network stack being tweaked to do certain things that certain hyperscalers need because they invest $. I have no problems with a large harmonious tent. cheers, jamal > > TC is the kernel offload mechanism that has gathered deployment > > experience over many years - hence P4TC. > > I don't wanna argue. I thought it'd be more fair towards you if I made > my lack of conviction known, rather than sit quiet and ignore it since > it's just an RFC.
On Fri, Jan 27, 2023 at 7:48 PM Stanislav Fomichev <sdf@google.com> wrote: > > On Fri, Jan 27, 2023 at 3:27 PM Jamal Hadi Salim <jhs@mojatatu.com> wrote: > > > > On Fri, Jan 27, 2023 at 5:26 PM <sdf@google.com> wrote: > > > > > > On 01/27, Jamal Hadi Salim wrote: > > > > On Fri, Jan 27, 2023 at 1:26 PM Jiri Pirko <jiri@resnulli.us> wrote: > > > > > > > > > > Fri, Jan 27, 2023 at 12:30:22AM CET, kuba@kernel.org wrote: > > > > > >On Tue, 24 Jan 2023 12:03:46 -0500 Jamal Hadi Salim wrote: > > > > > >> There have been many discussions and meetings since about 2015 in > > > > regards to > > > > > >> P4 over TC and now that the market has chosen P4 as the datapath > > > > specification > > > > > >> lingua franca > > > > > > > > > > > >Which market? > > > > > > > > > > > >Barely anyone understands the existing TC offloads. We'd need strong, > > > > > >and practical reasons to merge this. Speaking with my "have suffered > > > > > >thru the TC offloads working for a vendor" hat on, not the "junior > > > > > >maintainer" hat. > > > > > > > > > > You talk about offload, yet I don't see any offload code in this RFC. > > > > > It's pure sw implementation. > > > > > > > > > > But speaking about offload, how exactly do you plan to offload this > > > > > Jamal? AFAIK there is some HW-specific compiler magic needed to generate > > > > > HW acceptable blob. How exactly do you plan to deliver it to the driver? > > > > > If HW offload offload is the motivation for this RFC work and we cannot > > > > > pass the TC in kernel objects to drivers, I fail to see why exactly do > > > > > you need the SW implementation... > > > > > > > Our rule in TC is: _if you want to offload using TC you must have a > > > > s/w equivalent_. > > > > We enforced this rule multiple times (as you know). > > > > P4TC has a sw equivalent to whatever the hardware would do. We are > > > > pushing that > > > > first. Regardless, it has value on its own merit: > > > > I can run P4 equivalent in s/w in a scriptable (as in no compilation > > > > in the same spirit as u32 and pedit), > > > > by programming the kernel datapath without changing any kernel code. > > > > > > Not to derail too much, but maybe you can clarify the following for me: > > > In my (in)experience, P4 is usually constrained by the vendor > > > specific extensions. So how real is that goal where we can have a generic > > > P4@TC with an option to offload? In my view, the reality (at least > > > currently) is that there are NIC-specific P4 programs which won't have > > > a chance of running generically at TC (unless we implement those vendor > > > extensions). > > > > We are going to implement all the PSA/PNA externs. Most of these > > programs tend to > > be set or ALU operations on headers or metadata which we can handle. > > Do you have > > any examples of NIC-vendor-specific features that cant be generalized? > > I don't think I can share more without giving away something that I > shouldn't give away :-) > But IIUC, and I might be missing something, it's totally within the > standard for vendors to differentiate and provide non-standard > 'extern' extensions. > I'm mostly wondering what are your thoughts on this. If I have a p4 > program depending on one of these externs, we can't sw-emulate it > unless we also implement the extension. Are we gonna ask NICs that > have those custom extensions to provide a SW implementation as well? > Or are we going to prohibit vendors to differentiate that way? > > > > And regarding custom parser, someone has to ask that 'what about bpf > > > question': let's say we have a P4 frontend at TC, can we use bpfilter-like > > > usermode helper to transparently compile it to bpf (for SW path) instead > > > inventing yet another packet parser? Wrestling with the verifier won't be > > > easy here, but I trust it more than this new kParser. > > > > > > > We dont compile anything, the parser (and rest of infra) is scriptable. > > As I've replied to Tom, that seems like a technicality. BPF programs > can also be scriptable with some maps/tables. Or it can be made to > look like "scriptable" by recompiling it on every configuration change > and updating it on the fly. Or am I missing something? > > Can we have a P4TC frontend and whenever configuration is updated, we > upcall into userspace to compile this whatever p4 representation into > whatever bpf bytecode that we then run. No new/custom/scriptable > parsers needed. I would also think that if we need another programmable component in the kernel, that this would be based on BPF, and compiled outside the kernel. Is the argument for an explicit TC objects API purely that this API can be passed through to hardware, as well as implemented in the kernel directly? Something that would be lost if the datapath is implement as a single BPF program at the TC hook. Can you elaborate some more why this needs yet another in-kernel parser separate from BPF? The flow dissection case is solved fine by the BPF flow dissector. (I also hope one day the kernel can load a BPF dissector by default and we avoid the majority of the unsafe C code entirely.)
On Fri, Jan 27, 2023 at 7:48 PM Stanislav Fomichev <sdf@google.com> wrote: > > On Fri, Jan 27, 2023 at 3:27 PM Jamal Hadi Salim <jhs@mojatatu.com> wrote: > > > > On Fri, Jan 27, 2023 at 5:26 PM <sdf@google.com> wrote: > > > > > > On 01/27, Jamal Hadi Salim wrote: > > > > On Fri, Jan 27, 2023 at 1:26 PM Jiri Pirko <jiri@resnulli.us> wrote: > > > > > > > > > > Fri, Jan 27, 2023 at 12:30:22AM CET, kuba@kernel.org wrote: > > > > > >On Tue, 24 Jan 2023 12:03:46 -0500 Jamal Hadi Salim wrote: [..] > > > Not to derail too much, but maybe you can clarify the following for me: > > > In my (in)experience, P4 is usually constrained by the vendor > > > specific extensions. So how real is that goal where we can have a generic > > > P4@TC with an option to offload? In my view, the reality (at least > > > currently) is that there are NIC-specific P4 programs which won't have > > > a chance of running generically at TC (unless we implement those vendor > > > extensions). > > > > We are going to implement all the PSA/PNA externs. Most of these > > programs tend to > > be set or ALU operations on headers or metadata which we can handle. > > Do you have > > any examples of NIC-vendor-specific features that cant be generalized? > > I don't think I can share more without giving away something that I > shouldn't give away :-) Fair enough. > But IIUC, and I might be missing something, it's totally within the > standard for vendors to differentiate and provide non-standard > 'extern' extensions. > I'm mostly wondering what are your thoughts on this. If I have a p4 > program depending on one of these externs, we can't sw-emulate it > unless we also implement the extension. Are we gonna ask NICs that > have those custom extensions to provide a SW implementation as well? > Or are we going to prohibit vendors to differentiate that way? > It will dilute the value to prohibit any extern. What you referred to as "differentiation" is most of the time just implementation differences i.e someone may use a TCAM vs SRAM or some specific hw to implement crypto foobar; however, the "signature" of the extern is no different in its abstraction than an action. IOW, an Input X would produce an output Y in an extern regardless of the black box implementation. I understand the cases where some vendor may have some ASIC features that noone else cares about and that said functions can be exposed as externs. We really dont want these to be part of kernel proper. In our templating above would mean using the command abstraction to create the extern. There are three threads: 1) PSA/PNA externs like crc, checksums, hash etc. Those are part of P4TC as template commands. They are defined in the generic spec, they are not vendor specific and for almost all cases there's already kernel code that implements their features. So we will make them accessible to P4 programs. Vendor specific - we dont want them to be part of P4TC and we provide two ways to address them. 2) We can emulate them without offering the equivalent functionality just so someone can load a P4 program. This will work with P4TC as is today but it means for that extern you dont have functional equivalence to hardware. 3) Commands, to be specific for externs can be written as kernel modules. It's not my favorite option since we want everything to be scriptable but it is an option available. cheers, jamal > > > And regarding custom parser, someone has to ask that 'what about bpf > > > question': let's say we have a P4 frontend at TC, can we use bpfilter-like > > > usermode helper to transparently compile it to bpf (for SW path) instead > > > inventing yet another packet parser? Wrestling with the verifier won't be > > > easy here, but I trust it more than this new kParser. > > > > > > > We dont compile anything, the parser (and rest of infra) is scriptable. > > As I've replied to Tom, that seems like a technicality. BPF programs > can also be scriptable with some maps/tables. Or it can be made to > look like "scriptable" by recompiling it on every configuration change > and updating it on the fly. Or am I missing something? > > Can we have a P4TC frontend and whenever configuration is updated, we > upcall into userspace to compile this whatever p4 representation into > whatever bpf bytecode that we then run. No new/custom/scriptable > parsers needed. > > > cheers, > > jamal
On Sat, Jan 28, 2023 at 8:37 AM Willem de Bruijn <willemb@google.com> wrote: > > On Fri, Jan 27, 2023 at 7:48 PM Stanislav Fomichev <sdf@google.com> wrote: > > > > On Fri, Jan 27, 2023 at 3:27 PM Jamal Hadi Salim <jhs@mojatatu.com> wrote: > > > > > > On Fri, Jan 27, 2023 at 5:26 PM <sdf@google.com> wrote: > > > > > > > > On 01/27, Jamal Hadi Salim wrote: > > > > > On Fri, Jan 27, 2023 at 1:26 PM Jiri Pirko <jiri@resnulli.us> wrote: > > > > > > > > > > > > Fri, Jan 27, 2023 at 12:30:22AM CET, kuba@kernel.org wrote: > > > > > > >On Tue, 24 Jan 2023 12:03:46 -0500 Jamal Hadi Salim wrote: > > > > > > >> There have been many discussions and meetings since about 2015 in > > > > > regards to > > > > > > >> P4 over TC and now that the market has chosen P4 as the datapath > > > > > specification > > > > > > >> lingua franca > > > > > > > > > > > > > >Which market? > > > > > > > > > > > > > >Barely anyone understands the existing TC offloads. We'd need strong, > > > > > > >and practical reasons to merge this. Speaking with my "have suffered > > > > > > >thru the TC offloads working for a vendor" hat on, not the "junior > > > > > > >maintainer" hat. > > > > > > > > > > > > You talk about offload, yet I don't see any offload code in this RFC. > > > > > > It's pure sw implementation. > > > > > > > > > > > > But speaking about offload, how exactly do you plan to offload this > > > > > > Jamal? AFAIK there is some HW-specific compiler magic needed to generate > > > > > > HW acceptable blob. How exactly do you plan to deliver it to the driver? > > > > > > If HW offload offload is the motivation for this RFC work and we cannot > > > > > > pass the TC in kernel objects to drivers, I fail to see why exactly do > > > > > > you need the SW implementation... > > > > > > > > > Our rule in TC is: _if you want to offload using TC you must have a > > > > > s/w equivalent_. > > > > > We enforced this rule multiple times (as you know). > > > > > P4TC has a sw equivalent to whatever the hardware would do. We are > > > > > pushing that > > > > > first. Regardless, it has value on its own merit: > > > > > I can run P4 equivalent in s/w in a scriptable (as in no compilation > > > > > in the same spirit as u32 and pedit), > > > > > by programming the kernel datapath without changing any kernel code. > > > > > > > > Not to derail too much, but maybe you can clarify the following for me: > > > > In my (in)experience, P4 is usually constrained by the vendor > > > > specific extensions. So how real is that goal where we can have a generic > > > > P4@TC with an option to offload? In my view, the reality (at least > > > > currently) is that there are NIC-specific P4 programs which won't have > > > > a chance of running generically at TC (unless we implement those vendor > > > > extensions). > > > > > > We are going to implement all the PSA/PNA externs. Most of these > > > programs tend to > > > be set or ALU operations on headers or metadata which we can handle. > > > Do you have > > > any examples of NIC-vendor-specific features that cant be generalized? > > > > I don't think I can share more without giving away something that I > > shouldn't give away :-) > > But IIUC, and I might be missing something, it's totally within the > > standard for vendors to differentiate and provide non-standard > > 'extern' extensions. > > I'm mostly wondering what are your thoughts on this. If I have a p4 > > program depending on one of these externs, we can't sw-emulate it > > unless we also implement the extension. Are we gonna ask NICs that > > have those custom extensions to provide a SW implementation as well? > > Or are we going to prohibit vendors to differentiate that way? > > > > > > And regarding custom parser, someone has to ask that 'what about bpf > > > > question': let's say we have a P4 frontend at TC, can we use bpfilter-like > > > > usermode helper to transparently compile it to bpf (for SW path) instead > > > > inventing yet another packet parser? Wrestling with the verifier won't be > > > > easy here, but I trust it more than this new kParser. > > > > > > > > > > We dont compile anything, the parser (and rest of infra) is scriptable. > > > > As I've replied to Tom, that seems like a technicality. BPF programs > > can also be scriptable with some maps/tables. Or it can be made to > > look like "scriptable" by recompiling it on every configuration change > > and updating it on the fly. Or am I missing something? > > > > Can we have a P4TC frontend and whenever configuration is updated, we > > upcall into userspace to compile this whatever p4 representation into > > whatever bpf bytecode that we then run. No new/custom/scriptable > > parsers needed. > > I would also think that if we need another programmable component in > the kernel, that this would be based on BPF, and compiled outside the > kernel. > > Is the argument for an explicit TC objects API purely that this API > can be passed through to hardware, as well as implemented in the > kernel directly? Something that would be lost if the datapath is > implement as a single BPF program at the TC hook. > We use the skip_sw and skip_hw knobs in tc to indicate whether a policy is targeting hw or sw. Not sure if you are familiar with it but its been around (and deployed) for a few years now. So a P4 program policy can target either. In regards to the parser - we need a scriptable parser which is offered by kparser in kernel. P4 doesnt describe how to offload the parser just the matches and actions; however, as Tom alluded there's nothing that obstructs us offer the same tc controls to offload the parser or pieces of it. cheers, jamal > Can you elaborate some more why this needs yet another in-kernel > parser separate from BPF? The flow dissection case is solved fine by > the BPF flow dissector. (I also hope one day the kernel can load a BPF > dissector by default and we avoid the majority of the unsafe C code > entirely.)
On Sat, Jan 28, 2023 at 10:10 AM Jamal Hadi Salim <jhs@mojatatu.com> wrote: > > On Sat, Jan 28, 2023 at 8:37 AM Willem de Bruijn <willemb@google.com> wrote: > > > > On Fri, Jan 27, 2023 at 7:48 PM Stanislav Fomichev <sdf@google.com> wrote: > > > > > > On Fri, Jan 27, 2023 at 3:27 PM Jamal Hadi Salim <jhs@mojatatu.com> wrote: > > > > > > > > On Fri, Jan 27, 2023 at 5:26 PM <sdf@google.com> wrote: > > > > > > > > > > On 01/27, Jamal Hadi Salim wrote: > > > > > > On Fri, Jan 27, 2023 at 1:26 PM Jiri Pirko <jiri@resnulli.us> wrote: > > > > > > > > > > > > > > Fri, Jan 27, 2023 at 12:30:22AM CET, kuba@kernel.org wrote: > > > > > > > >On Tue, 24 Jan 2023 12:03:46 -0500 Jamal Hadi Salim wrote: > > > > > > > >> There have been many discussions and meetings since about 2015 in > > > > > > regards to > > > > > > > >> P4 over TC and now that the market has chosen P4 as the datapath > > > > > > specification > > > > > > > >> lingua franca > > > > > > > > > > > > > > > >Which market? > > > > > > > > > > > > > > > >Barely anyone understands the existing TC offloads. We'd need strong, > > > > > > > >and practical reasons to merge this. Speaking with my "have suffered > > > > > > > >thru the TC offloads working for a vendor" hat on, not the "junior > > > > > > > >maintainer" hat. > > > > > > > > > > > > > > You talk about offload, yet I don't see any offload code in this RFC. > > > > > > > It's pure sw implementation. > > > > > > > > > > > > > > But speaking about offload, how exactly do you plan to offload this > > > > > > > Jamal? AFAIK there is some HW-specific compiler magic needed to generate > > > > > > > HW acceptable blob. How exactly do you plan to deliver it to the driver? > > > > > > > If HW offload offload is the motivation for this RFC work and we cannot > > > > > > > pass the TC in kernel objects to drivers, I fail to see why exactly do > > > > > > > you need the SW implementation... > > > > > > > > > > > Our rule in TC is: _if you want to offload using TC you must have a > > > > > > s/w equivalent_. > > > > > > We enforced this rule multiple times (as you know). > > > > > > P4TC has a sw equivalent to whatever the hardware would do. We are > > > > > > pushing that > > > > > > first. Regardless, it has value on its own merit: > > > > > > I can run P4 equivalent in s/w in a scriptable (as in no compilation > > > > > > in the same spirit as u32 and pedit), > > > > > > by programming the kernel datapath without changing any kernel code. > > > > > > > > > > Not to derail too much, but maybe you can clarify the following for me: > > > > > In my (in)experience, P4 is usually constrained by the vendor > > > > > specific extensions. So how real is that goal where we can have a generic > > > > > P4@TC with an option to offload? In my view, the reality (at least > > > > > currently) is that there are NIC-specific P4 programs which won't have > > > > > a chance of running generically at TC (unless we implement those vendor > > > > > extensions). > > > > > > > > We are going to implement all the PSA/PNA externs. Most of these > > > > programs tend to > > > > be set or ALU operations on headers or metadata which we can handle. > > > > Do you have > > > > any examples of NIC-vendor-specific features that cant be generalized? > > > > > > I don't think I can share more without giving away something that I > > > shouldn't give away :-) > > > But IIUC, and I might be missing something, it's totally within the > > > standard for vendors to differentiate and provide non-standard > > > 'extern' extensions. > > > I'm mostly wondering what are your thoughts on this. If I have a p4 > > > program depending on one of these externs, we can't sw-emulate it > > > unless we also implement the extension. Are we gonna ask NICs that > > > have those custom extensions to provide a SW implementation as well? > > > Or are we going to prohibit vendors to differentiate that way? > > > > > > > > And regarding custom parser, someone has to ask that 'what about bpf > > > > > question': let's say we have a P4 frontend at TC, can we use bpfilter-like > > > > > usermode helper to transparently compile it to bpf (for SW path) instead > > > > > inventing yet another packet parser? Wrestling with the verifier won't be > > > > > easy here, but I trust it more than this new kParser. > > > > > > > > > > > > > We dont compile anything, the parser (and rest of infra) is scriptable. > > > > > > As I've replied to Tom, that seems like a technicality. BPF programs > > > can also be scriptable with some maps/tables. Or it can be made to > > > look like "scriptable" by recompiling it on every configuration change > > > and updating it on the fly. Or am I missing something? > > > > > > Can we have a P4TC frontend and whenever configuration is updated, we > > > upcall into userspace to compile this whatever p4 representation into > > > whatever bpf bytecode that we then run. No new/custom/scriptable > > > parsers needed. > > > > I would also think that if we need another programmable component in > > the kernel, that this would be based on BPF, and compiled outside the > > kernel. > > > > Is the argument for an explicit TC objects API purely that this API > > can be passed through to hardware, as well as implemented in the > > kernel directly? Something that would be lost if the datapath is > > implement as a single BPF program at the TC hook. > > > > We use the skip_sw and skip_hw knobs in tc to indicate whether a > policy is targeting hw or sw. Not sure if you are familiar with it but its > been around (and deployed) for a few years now. So a P4 program > policy can target either. I know. So the only reason the kernel ABI needs to be extended with P4 objects is to be able to pass the same commands to hardware. The whole kernel dataplane could be implemented as a BPF program, correct? > In regards to the parser - we need a scriptable parser which is offered > by kparser in kernel. P4 doesnt describe how to offload the parser > just the matches and actions; however, as Tom alluded there's nothing > that obstructs us offer the same tc controls to offload the parser or pieces > of it. And this is the only reason that the parser needs to be in the kernel. Because the API is at the kernel ABI level. If the P4 program is compiled to BPF in userspace, then the parser would be compiled in userspace too. A preferable option, as it would not require adding yet another parser in C in the kernel. I understand the value of PANDA as a high level declarative language to describe network protocols. I'm just trying to get more explicit why compilation from PANDA to BPF is not sufficient for your use-case. > cheers, > jamal > > > Can you elaborate some more why this needs yet another in-kernel > > parser separate from BPF? The flow dissection case is solved fine by > > the BPF flow dissector. (I also hope one day the kernel can load a BPF > > dissector by default and we avoid the majority of the unsafe C code > > entirely.)
On Fri, Jan 27, 2023 at 5:34 PM Singhai, Anjali <anjali.singhai@intel.com> wrote: > > P4 is definitely the language of choice for defining a Dataplane in HW for IPUs/DPUs/FNICs and Switches. As a vendor I can definitely say that the smart devices implement a very programmable ASIC as each customer Dataplane defers quite a bit and P4 is the language of choice for specifying the Dataplane definitions. A lot of customer deploy proprietary protocols that run in HW and there is no good way right now in kernel to support these proprietary protcols. If we enable these protocol in the kernel it takes a huge effort and they don’t evolve well. > Being able to define in P4 and offload into HW using tc mechanism really helps in supporting the customer's Dataplane and protcols without having to wait months and years to get the kernel updated. Here is a link to our IPU offering that is P4 programmable Anjali, P4 may be the language of choice for programming HW datapath, however it's not the language of choice for programming SW datapaths-- that's C over XDP/eBPF. And while XDP/eBPF also doesn't depend on kernel updates, it has a major advantage over P4 in that it doesn't require fancy hardware either. Even at full data center deployment of P4 devices, there will be at least an order of magnitude more deployment of SW programmed datapaths; and unless someone is using P4 hardware, there's zero value in rewriting programs in P4 instead of C. IMO, we will never see networking developers moving to P4 en masse-- P4 will always be a niche market relative to the programmable datapath space and the skill sets required to support serious scalable deployment. That being said, there will be a nontrivial contingent of users who need to run the same programs in both SW and HW environments. Expecting them to maintain two very different code bases to support two disparate models is costly and prohibitive to them. So for their benefit, we need a solution to reconcile these two models. P4TC is one means to accomplish that. We want to consider both the permutations: 1) compile C code to run in P4 hardware 2) compile P4 to run in SW. If we establish a common IR, then we can generalize the problem: programmer writes their datapath in the language of their choosing (P4, C, Python, Rust, etc.), they compile the program to whatever backend they are using (HW, SW, XDP/eBPF, etc.). The P4TC CLI serves as one such IR as there's nothing that prevents someone from compiling a program from another language to the CLI (for instance, we've implemented the compiler to output the parser CLI from PANDA-C). The CLI natively runs in kernel SW, and with the right hooks could be offloaded to HW-- not just P4 hardware but potentially other hardware targets as well. Tom > https://www.intel.com/content/www/us/en/products/details/network-io/ipu/e2000-asic.html > Here are some other useful links > https://ipdk.io/ > > Anjali > > -----Original Message----- > From: Jamal Hadi Salim <hadi@mojatatu.com> > Sent: Friday, January 27, 2023 11:43 AM > To: Jakub Kicinski <kuba@kernel.org> > Cc: Jamal Hadi Salim <jhs@mojatatu.com>; netdev@vger.kernel.org; kernel@mojatatu.com; Chatterjee, Deb <deb.chatterjee@intel.com>; Singhai, Anjali <anjali.singhai@intel.com>; Limaye, Namrata <namrata.limaye@intel.com>; khalidm@nvidia.com; tom@sipanda.io; pratyush@sipanda.io; jiri@resnulli.us; xiyou.wangcong@gmail.com; davem@davemloft.net; edumazet@google.com; pabeni@redhat.com; vladbu@nvidia.com; simon.horman@corigine.com; stefanc@marvell.com; seong.kim@amd.com; mattyk@nvidia.com; Daly, Dan <dan.daly@intel.com>; Fingerhut, John Andy <john.andy.fingerhut@intel.com> > Subject: Re: [PATCH net-next RFC 00/20] Introducing P4TC > > On Fri, Jan 27, 2023 at 12:18 PM Jakub Kicinski <kuba@kernel.org> wrote: > > > > On Fri, 27 Jan 2023 08:33:39 -0500 Jamal Hadi Salim wrote: > > > On Thu, Jan 26, 2023 at 6:30 PM Jakub Kicinski <kuba@kernel.org> wrote: > > > > On Tue, 24 Jan 2023 12:03:46 -0500 Jamal Hadi Salim wrote: > > [..] > > > Network programmability involving hardware - where at minimal the > > > specification of the datapath is in P4 and often the implementation > > > is. For samples of specification using P4 (that are public) see for > > > example MS Azure: > > > https://github.com/sonic-net/DASH/tree/main/dash-pipeline > > > > That's an IPU thing? > > > > Yes, DASH is xPU. But the whole Sonic/SAI thing includes switches and P4 plays a role there. > > > > If you are a vendor and want to sell a NIC in that space, the spec > > > you get is in P4. > > > > s/NIC/IPU/ ? > > I do believe that one can write a P4 program to express things a regular NIC could express that may be harder to expose with current interfaces. > > > > Your underlying hardware > > > doesnt have to be P4 native, but at minimal the abstraction (as we > > > are trying to provide with P4TC) has to be able to consume the P4 > > > specification. > > > > P4 is certainly an option, especially for specs, but I haven't seen > > much adoption myself. > > The xPU market outside of hyper-scalers is emerging now. Hyperscalers looking at xPUs are looking at P4 as the datapath language - that sets the trend forward to large enterprises. > That's my experience. > Some of the vendors on the Cc should be able to point to adoption. > Anjali? Matty? > > > What's the benefit / use case? > > Of P4 or xPUs? > Unified approach to standardize how a datapath is defined is a value for P4. > Providing a singular abstraction via the kernel (as opposed to every vendor pitching their API) is what the kernel brings. > > > > For implementations where P4 is in use, there are many - some public > > > others not, sample space: > > > https://cloud.google.com/blog/products/gcp/google-cloud-using-p4runt > > > ime-to-build-smart-networks > > > > Hyper-scaler proprietary. > > The control abstraction (P4 runtime) is certainly not proprietary. > The datapath that is targetted by the runtime is. > Hopefully we can fix that with P4TC. > The majority of the discussions i have with some of the folks who do kernel bypass have one theme in common: > The kernel process is just too long. Trying to add one feature to flower could take anywhere from 6 months to 3 years to finally show up in some supported distro. With P4TC we are taking the approach of scriptability to allow for speacilized datapaths (which P4 excels in). The google datapath maybe proprietary while their hardware may even(or not) be using native P4 - but the important detail is we have _a way_ to abstract those datapaths. > > > > There are NICs and switches which are P4 native in the market. > > > > Link to docs? > > > > Off top of my head Intel Mount Evans, Pensando, Xilinx FPGAs, etc. The point is to bring them together under the linux umbrella. > > > > IOW, there is beacoup $ investment in this space that makes it worth pursuing. > > > > Pursuing $ is good! But the community IMO should maximize a different > > function. > > While I agree $ is not the primary motivator it is a factor, it is a good indicator. No different than the network stack being tweaked to do certain things that certain hyperscalers need because they invest $. > I have no problems with a large harmonious tent. > > cheers, > jamal > > > > TC is the kernel offload mechanism that has gathered deployment > > > experience over many years - hence P4TC. > > > > I don't wanna argue. I thought it'd be more fair towards you if I made > > my lack of conviction known, rather than sit quiet and ignore it since > > it's just an RFC.
On Sat, 28 Jan 2023 13:17:35 -0800 Tom Herbert <tom@herbertland.com> wrote: > On Fri, Jan 27, 2023 at 5:34 PM Singhai, Anjali > <anjali.singhai@intel.com> wrote: > > > > P4 is definitely the language of choice for defining a Dataplane in HW for IPUs/DPUs/FNICs and Switches. As a vendor I can definitely say that the smart devices implement a very programmable ASIC as each customer Dataplane defers quite a bit and P4 is the language of choice for specifying the Dataplane definitions. A lot of customer deploy proprietary protocols that run in HW and there is no good way right now in kernel to support these proprietary protcols. If we enable these protocol in the kernel it takes a huge effort and they don’t evolve well. > > Being able to define in P4 and offload into HW using tc mechanism really helps in supporting the customer's Dataplane and protcols without having to wait months and years to get the kernel updated. Here is a link to our IPU offering that is P4 programmable > > Anjali, > > P4 may be the language of choice for programming HW datapath, however > it's not the language of choice for programming SW datapaths-- that's > C over XDP/eBPF. And while XDP/eBPF also doesn't depend on kernel > updates, it has a major advantage over P4 in that it doesn't require > fancy hardware either. > > Even at full data center deployment of P4 devices, there will be at > least an order of magnitude more deployment of SW programmed > datapaths; and unless someone is using P4 hardware, there's zero value > in rewriting programs in P4 instead of C. IMO, we will never see > networking developers moving to P4 en masse-- P4 will always be a > niche market relative to the programmable datapath space and the skill > sets required to support serious scalable deployment. That being said, > there will be a nontrivial contingent of users who need to run the > same programs in both SW and HW environments. Expecting them to > maintain two very different code bases to support two disparate models > is costly and prohibitive to them. So for their benefit, we need a > solution to reconcile these two models. P4TC is one means to > accomplish that. > > We want to consider both the permutations: 1) compile C code to run in > P4 hardware 2) compile P4 to run in SW. If we establish a common IR, > then we can generalize the problem: programmer writes their datapath > in the language of their choosing (P4, C, Python, Rust, etc.), they > compile the program to whatever backend they are using (HW, SW, > XDP/eBPF, etc.). The P4TC CLI serves as one such IR as there's nothing > that prevents someone from compiling a program from another language > to the CLI (for instance, we've implemented the compiler to output the > parser CLI from PANDA-C). The CLI natively runs in kernel SW, and with > the right hooks could be offloaded to HW-- not just P4 hardware but > potentially other hardware targets as well. Rather than more kernel network software, if instead this was targeting userspace or eBPF for the SW version; then there would be less exposed security risk and also less long term technical debt here.
Willem de Bruijn wrote: > On Sat, Jan 28, 2023 at 10:10 AM Jamal Hadi Salim <jhs@mojatatu.com> wrote: > > [...] > > > > > > I would also think that if we need another programmable component in > > > the kernel, that this would be based on BPF, and compiled outside the > > > kernel. > > > > > > Is the argument for an explicit TC objects API purely that this API > > > can be passed through to hardware, as well as implemented in the > > > kernel directly? Something that would be lost if the datapath is > > > implement as a single BPF program at the TC hook. > > > > > > > We use the skip_sw and skip_hw knobs in tc to indicate whether a > > policy is targeting hw or sw. Not sure if you are familiar with it but its > > been around (and deployed) for a few years now. So a P4 program > > policy can target either. > > I know. So the only reason the kernel ABI needs to be extended with P4 > objects is to be able to pass the same commands to hardware. The whole > kernel dataplane could be implemented as a BPF program, correct? > > > In regards to the parser - we need a scriptable parser which is offered > > by kparser in kernel. P4 doesnt describe how to offload the parser > > just the matches and actions; however, as Tom alluded there's nothing > > that obstructs us offer the same tc controls to offload the parser or pieces > > of it. > > And this is the only reason that the parser needs to be in the kernel. > Because the API is at the kernel ABI level. If the P4 program is compiled > to BPF in userspace, then the parser would be compiled in userspace > too. A preferable option, as it would not require adding yet another > parser in C in the kernel. Also there already exists a P4 backend that targets BPF. https://github.com/p4lang/p4c So as a SW object we can just do the P4 compilation step in user space and run it in BPF as suggested. Then for hw offload we really would need to see some hardware to have any concrete ideas on how to make it work. Also P4 defines a runtime API so would be good to see how all that works with any proposed offload. > > I understand the value of PANDA as a high level declarative language > to describe network protocols. I'm just trying to get more explicit > why compilation from PANDA to BPF is not sufficient for your use-case. > > > > cheers, > > jamal > > > > > Can you elaborate some more why this needs yet another in-kernel > > > parser separate from BPF? The flow dissection case is solved fine by > > > the BPF flow dissector. (I also hope one day the kernel can load a BPF > > > dissector by default and we avoid the majority of the unsafe C code > > > entirely.) >
On Sat, Jan 28, 2023 at 10:33 AM Willem de Bruijn <willemb@google.com> wrote: > > On Sat, Jan 28, 2023 at 10:10 AM Jamal Hadi Salim <jhs@mojatatu.com> wrote: > > > > On Sat, Jan 28, 2023 at 8:37 AM Willem de Bruijn <willemb@google.com> wrote: > > > > [..] > > We use the skip_sw and skip_hw knobs in tc to indicate whether a > > policy is targeting hw or sw. Not sure if you are familiar with it but its > > been around (and deployed) for a few years now. So a P4 program > > policy can target either. > > I know. So the only reason the kernel ABI needs to be extended with P4 > objects is to be able to pass the same commands to hardware. The whole > kernel dataplane could be implemented as a BPF program, correct? > It's more than an ABI (although that is important as well). It is about reuse of the infra which provides a transparent symbiosis between hardware offload and software that has matured over time: For example, you can take a pipeline or a table or actions (lately) and split them between hardware and software transparently, etc. To re-iterate, we are reusing and plugging into a proven and deployed mechanism which enables our goal (of HW + SW scripting of arbitrary P4-enabled datapaths which are functionally equivalent). > > In regards to the parser - we need a scriptable parser which is offered > > by kparser in kernel. P4 doesnt describe how to offload the parser > > just the matches and actions; however, as Tom alluded there's nothing > > that obstructs us offer the same tc controls to offload the parser or pieces > > of it. > > And this is the only reason that the parser needs to be in the kernel. > Because the API is at the kernel ABI level. If the P4 program is compiled > to BPF in userspace, then the parser would be compiled in userspace > too. A preferable option, as it would not require adding yet another > parser in C in the kernel. > Kparser while based on PANDA has the important detail to note is that it is an infra for creating arbitrary parsers. The infra sits in the kernel and i can create arbitrary parsers with policy scripts. The emphasis is on scriptability. cheers, jamal > I understand the value of PANDA as a high level declarative language > to describe network protocols. I'm just trying to get more explicit > why compilation from PANDA to BPF is not sufficient for your use-case. >
On Sun, Jan 29, 2023 at 12:39 AM John Fastabend <john.fastabend@gmail.com> wrote: > > Willem de Bruijn wrote: > > On Sat, Jan 28, 2023 at 10:10 AM Jamal Hadi Salim <jhs@mojatatu.com> wrote: > > > > > [...] > > > Also there already exists a P4 backend that targets BPF. > > https://github.com/p4lang/p4c There's also one based on rust - does that mean we should rewrite our code in rust? Joking aside - rust was a suggestion made at a talk i did. I ended up adding a slide for the next talk which read: Title: So... how is this better than KDE? Attributed to Rusty Russell Who attributes it to Cort Dougan s/KDE/[rust/ebpf/dpdk/vpp/ovs]/g We have very specific goals - of which the most important is met by what works today and we are reusing that. cheers, jamal > So as a SW object we can just do the P4 compilation step in user > space and run it in BPF as suggested. Then for hw offload we really > would need to see some hardware to have any concrete ideas on how > to make it work. > > Also P4 defines a runtime API so would be good to see how all that > works with any proposed offload.
Sorry, John - to answer your question on P4runtime; that runs on top of netlink. Netlink can express a lot more than P4runtime so we are letting it sit in userspace. I could describe the netlink interfaces but easier if you look at the code and ping me privately unless there are more folks interested in that to which i can respond on the list. cheers, jamal On Sun, Jan 29, 2023 at 6:11 AM Jamal Hadi Salim <jhs@mojatatu.com> wrote: > > On Sun, Jan 29, 2023 at 12:39 AM John Fastabend > <john.fastabend@gmail.com> wrote: > > > > Willem de Bruijn wrote: > > > On Sat, Jan 28, 2023 at 10:10 AM Jamal Hadi Salim <jhs@mojatatu.com> wrote: > > > > > > > > [...] > > > > > > Also there already exists a P4 backend that targets BPF. > > > > https://github.com/p4lang/p4c > > There's also one based on rust - does that mean we should rewrite our > code in rust? > Joking aside - rust was a suggestion made at a talk i did. I ended up > adding a slide for the next talk which read: > > Title: So... how is this better than KDE? > Attributed to Rusty Russell > Who attributes it to Cort Dougan > s/KDE/[rust/ebpf/dpdk/vpp/ovs]/g > > We have very specific goals - of which the most important is met by > what works today and we are reusing that. > > cheers, > jamal > > > So as a SW object we can just do the P4 compilation step in user > > space and run it in BPF as suggested. Then for hw offload we really > > would need to see some hardware to have any concrete ideas on how > > to make it work. > > > > > > Also P4 defines a runtime API so would be good to see how all that > > works with any proposed offload.
Jamal Hadi Salim <jhs@mojatatu.com> writes: >> > We use the skip_sw and skip_hw knobs in tc to indicate whether a >> > policy is targeting hw or sw. Not sure if you are familiar with it but its >> > been around (and deployed) for a few years now. So a P4 program >> > policy can target either. >> >> I know. So the only reason the kernel ABI needs to be extended with P4 >> objects is to be able to pass the same commands to hardware. The whole >> kernel dataplane could be implemented as a BPF program, correct? >> > > It's more than an ABI (although that is important as well). > It is about reuse of the infra which provides a transparent symbiosis > between hardware offload and software that has matured over time: For > example, you can take a pipeline or a table or actions (lately) and > split them between hardware and software transparently, etc. To > re-iterate, we are reusing and plugging into a proven and deployed > mechanism which enables our goal (of HW + SW scripting of arbitrary > P4-enabled datapaths which are functionally equivalent). But you're doing this in a way that completely ignores the existing ecosystem for creating programmable software datapaths in the kernel (i.e., eBPF/XDP) in favour of adding *yet another* interpreter to the kernel. In particular, completely excluding the XDP from this is misguided. Programmable networking in Linux operates at three layers: - HW: for stuff that's supported and practical there - XDP: software fast-path for high-performance bits that can't go into HW - TC/rest of stack: SW slow path for functional equivalence I can see P4 playing a role as a higher-level data plane definition language even for Linux SW stacks, but let's have it integrate with the full ecosystem, not be its own little island in a corner... -Toke
I am agreeing with you Tom. P4tc does not restrict the high level language to be P4, it can be anything as long as it can be compiled to create an IR that can be used to teach/program the SW and the HW, which is what the script-ability of p4tc provides. Ultimately the devices are evolving as combination of highly efficient Domain specific architecture and the traditional Generic cores, and SW in the kernel has to evolve to program them both in a way that the user can decide whether to run a particular functionality in Domain specific HW or SW that runs on general purpose cores or in some cases the functionality runs in both places and the intelligent (and some-day AI managed Infrastructure controller) entity decides whether the flow should use the HW path or the SW path. There is no other way forward because a SW dataplane can only provide an overflow region for flows and the HW will have to run the most demanding flows, as the Network demand and capacity of the data-center keeps reaching higher and higher levels. From a HW vendor's point of view we have already entered the 3rd epoch of computer architecture. A domain specific architecture still has to be programmable but for a specific domain, linux kernel which has remained fixed function (and fixed protocol) traditionally needs to evolve to support these domain specific architecture that are protocol and dataplane programmable. I think p4tc definitely is the right way forward. There were some arguments made earlier about but the big Datacenters are programming these domain specific architecture from user space already, no doubt but isn't the whole argument for linux kernel is democratizing of the goodness the HW brings to all , the small users and the big ones? There is also argument that is being made about using ebpf for implementing the SW path, may be I am missing the part as to how do you offload if not to another general purpose core even if it is not as evolved as the current day Xeon's. And we know that even the simplest of the general purpose cores ( example RISC-V) right now cannot sustain the rate at which the network needs to feed the business logic running on the CPUs or GPUs or TPUs in an economically viable solution. All data points to the fact that Network processing running on general purpose cores eats up more than half of the cores and that’s expensive. Because the performance/power unit math when using an IPU/DPU/Smart NIC for network work load is so much better than that of a General purpose core. So I do not see a way forward for epbf to be offloaded on anything but general purpose cores and in the meantime Domain specific programmable ASICs need to be still programmed as they are the right solution for the economy of scale. Having said that we do have to find a good solution for p4 externs in SW and may be there is room for some helpers ( may be even ebpf) ( as long as you don’t ask me to offload that in HW
Jamal Hadi Salim wrote: > On Sun, Jan 29, 2023 at 12:39 AM John Fastabend > <john.fastabend@gmail.com> wrote: > > > > Willem de Bruijn wrote: > > > On Sat, Jan 28, 2023 at 10:10 AM Jamal Hadi Salim <jhs@mojatatu.com> wrote: > > > > > > > > [...] > > > > > > Also there already exists a P4 backend that targets BPF. > > > > https://github.com/p4lang/p4c > > There's also one based on rust - does that mean we should rewrite our > code in rust? > Joking aside - rust was a suggestion made at a talk i did. I ended up > adding a slide for the next talk which read: > > Title: So... how is this better than KDE? > Attributed to Rusty Russell > Who attributes it to Cort Dougan > s/KDE/[rust/ebpf/dpdk/vpp/ovs]/g > > We have very specific goals - of which the most important is met by > what works today and we are reusing that. OK, I may have missed your goals I read the cover letter and merely scanned the patches. But, seeing we've chatted about this before let me put my critique here. P4TC as a software datapath: 1. We can already run P4 in software with P4C which compiles into an existing BPF implementations, nothing new needed. If we object to p4c implementation there are others (VMWare has one for XDP) or feel free to write any other DSL or abstraction over BPF. 2. 'tc' layer is not going to be as fast as XDP so without an XDP implementation we can't get best possible implementation. 3. Happy to admit I don't have data, but I'm not convinced a match action pipeline is an ideal implementation for software. It is done specifically in HW to facilitate CAMs/TCAMs and other special logic blocks that do not map well to general purpose CPU. BPF or other insn are better abstraction for software. So I struggle to find upside as a purely software implementation. If you took an XDP P4 backend and then had this implementation showing performance or some other vector where a XDP implementation underperformed that would be interesting. Then either we would have good reason to try another datapath or P4TC as a hardware datapath: 1. We don't have a hardware/driver implementation to review so its difficult to even judge if this is a good idea or not. 2. I imagine most hardware can not create TCAMs/CAMs out of nothing. So there is a hard problem that I believe is not addressed here around how user knows their software blob can ever be offloaded at all. How you move to new hw and the blob can continue to work so and an so forth. 3. FPGA P4 implementations as far as I recall can use P4 to build the pipeline up front. But, once its built its not like you would (re)build it or (re)configure it on the fly. But the workflow doesn't align with how I understand these patches. 4. Has any vendor with a linux driver (maybe not even in kernel yet) open sourced anything that resembles a P4 pipeline? Without this its again hard to understand what is possible and what vendors will let users do. P4TC as SW/HW running same P4: 1. This doesn't need to be done in kernel. If one compiler runs P4 into XDP or TC-BPF that is good and another compiler runs it into hw specific backend. This satisifies having both software and hardware implementation. Extra commentary: I agree we've been chatting about this for a long time but until some vendor (Intel?) will OSS and support a linux driver and hardware with open programmable parser and MAT. I'm not sure how we get P4 for Linux users. Does it exist and I missed it? Thanks, John > > cheers, > jamal > > > So as a SW object we can just do the P4 compilation step in user > > space and run it in BPF as suggested. Then for hw offload we really > > would need to see some hardware to have any concrete ideas on how > > to make it work. > > > > > > Also P4 defines a runtime API so would be good to see how all that > > works with any proposed offload. Yep agree with your other comment not really important can be built on top of Netlink or BPF today.
Mon, Jan 30, 2023 at 05:30:17AM CET, john.fastabend@gmail.com wrote: >Jamal Hadi Salim wrote: >> On Sun, Jan 29, 2023 at 12:39 AM John Fastabend >> <john.fastabend@gmail.com> wrote: >> > >> > Willem de Bruijn wrote: >> > > On Sat, Jan 28, 2023 at 10:10 AM Jamal Hadi Salim <jhs@mojatatu.com> wrote: >> > > > >> > >> > [...] >> > >> > >> > Also there already exists a P4 backend that targets BPF. >> > >> > https://github.com/p4lang/p4c >> >> There's also one based on rust - does that mean we should rewrite our >> code in rust? >> Joking aside - rust was a suggestion made at a talk i did. I ended up >> adding a slide for the next talk which read: >> >> Title: So... how is this better than KDE? >> Attributed to Rusty Russell >> Who attributes it to Cort Dougan >> s/KDE/[rust/ebpf/dpdk/vpp/ovs]/g >> >> We have very specific goals - of which the most important is met by >> what works today and we are reusing that. > >OK, I may have missed your goals I read the cover letter and merely >scanned the patches. But, seeing we've chatted about this before >let me put my critique here. > >P4TC as a software datapath: > >1. We can already run P4 in software with P4C which compiles into an > existing BPF implementations, nothing new needed. If we object > to p4c implementation there are others (VMWare has one for XDP) > or feel free to write any other DSL or abstraction over BPF. > >2. 'tc' layer is not going to be as fast as XDP so without an XDP > implementation we can't get best possible implementation. > >3. Happy to admit I don't have data, but I'm not convinced a match > action pipeline is an ideal implementation for software. It is > done specifically in HW to facilitate CAMs/TCAMs and other special > logic blocks that do not map well to general purpose CPU. BPF or > other insn are better abstraction for software. > >So I struggle to find upside as a purely software implementation. >If you took an XDP P4 backend and then had this implementation >showing performance or some other vector where a XDP implementation >underperformed that would be interesting. Then either we would have >good reason to try another datapath or > >P4TC as a hardware datapath: > >1. We don't have a hardware/driver implementation to review so its > difficult to even judge if this is a good idea or not. > >2. I imagine most hardware can not create TCAMs/CAMs out of > nothing. So there is a hard problem that I believe is not > addressed here around how user knows their software blob > can ever be offloaded at all. How you move to new hw and > the blob can continue to work so and an so forth. > >3. FPGA P4 implementations as far as I recall can use P4 to build > the pipeline up front. But, once its built its not like you > would (re)build it or (re)configure it on the fly. But the workflow > doesn't align with how I understand these patches. > >4. Has any vendor with a linux driver (maybe not even in kernel yet) > open sourced anything that resembles a P4 pipeline? Without > this its again hard to understand what is possible and what > vendors will let users do. > >P4TC as SW/HW running same P4: > >1. This doesn't need to be done in kernel. If one compiler runs > P4 into XDP or TC-BPF that is good and another compiler runs > it into hw specific backend. This satisifies having both > software and hardware implementation. > >Extra commentary: I agree we've been chatting about this for a long >time but until some vendor (Intel?) will OSS and support a linux >driver and hardware with open programmable parser and MAT. I'm not >sure how we get P4 for Linux users. Does it exist and I missed it? John, I think that your summary is quite accurate. Regarding SW implementation, I have to admit I also fail to see motivation to have P4 specific datapath instead of having XDP/eBPF one, that could run P4 compiled program. The only motivation would be that if somehow helps to offload to HW. But can it? Regarding HW implementation. I believe that every HW implementation is very specific and to find some common intermediate kernel uAPI is probably not possible (correct me if I'm wrong, that that is the impression I'm getting from all parties). Then the only option is to allow userspace to insert HW-speficic blob that is an output of per-vendor P4 compilator. Now is this blob uAPI channel possible to be introduced? How it should look like? How to enforce limitations so it is not exploited for other purposes as a kernel bypass? > >Thanks, >John > >> >> cheers, >> jamal >> >> > So as a SW object we can just do the P4 compilation step in user >> > space and run it in BPF as suggested. Then for hw offload we really >> > would need to see some hardware to have any concrete ideas on how >> > to make it work. >> > >> >> >> > Also P4 defines a runtime API so would be good to see how all that >> > works with any proposed offload. > >Yep agree with your other comment not really important can be built >on top of Netlink or BPF today.
Jiri Pirko <jiri@resnulli.us> writes: >>P4TC as SW/HW running same P4: >> >>1. This doesn't need to be done in kernel. If one compiler runs >> P4 into XDP or TC-BPF that is good and another compiler runs >> it into hw specific backend. This satisifies having both >> software and hardware implementation. >> >>Extra commentary: I agree we've been chatting about this for a long >>time but until some vendor (Intel?) will OSS and support a linux >>driver and hardware with open programmable parser and MAT. I'm not >>sure how we get P4 for Linux users. Does it exist and I missed it? > > > John, I think that your summary is quite accurate. Regarding SW > implementation, I have to admit I also fail to see motivation to have P4 > specific datapath instead of having XDP/eBPF one, that could run P4 > compiled program. The only motivation would be that if somehow helps to > offload to HW. But can it? According to the slides from the netdev talk[0], it seems that offloading will have to have a component that goes outside of TC anyway (see "Model 3: Joint loading" where it says "this is impossible"). So I don't really see why having this interpreter in TC help any. Also, any control plane management feature specific to managing P4 state in hardware could just as well manage a BPF-based software path on the kernel side instead of the P4 interpreter stuff... -Toke [0] https://netdevconf.info/0x16/session.html?Your-Network-Datapath-Will-Be-P4-Scripted
So i dont have to respond to each email individually, I will respond here in no particular order. First let me provide some context, if that was already clear please skip it. Hopefully providing the context will help us to focus otherwise that bikeshed's color and shape will take forever to settle on. __Context__ I hope we all agree that when you have 2x100G NIC (and i have seen people asking for 2x800G NICs) no XDP or DPDK is going to save you. To visualize: one 25G port is 35Mpps unidirectional. So "software stack" is not the answer. You need to offload. I would argue further that in the near future a lot of the stuff including transport will eventually have to partially or fully move to hardware (see the HOMA keynote for a sample space[0]). CPUs are not going to keep up with the massive IO requirements. I am not talking about offload meaning NIC vendors providing you Checksum or clever RSS or some basic metadata or timestamp offload; I think those will continue to be needed - but that is a different scope altogether. Neither are we trying to address transport offload in P4TC. I hope we also agree that the MAT construct is well understood and that we have good experience in both sw (TC) and hardware deployments over many years. P4 is a _standardized_ specification for addressing these constructs. P4 is by no means perfect but it is an established standard. It is already being used to provide requirements to NIC vendors today (regardless of the underlying implementation) So what are we trying to achieve with P4TC? John, I could have done a better job in describing the goals in the cover letter: We are going for MAT sw equivalence to what is in hardware. A two-fer that is already provided by the existing TC infrastructure. Scriptability is not a new idea in TC (see u32 and pedit and others in TC). IOW, we are reusing and plugging into a proven and deployed mechanism with a built-in policy driven, transparent symbiosis between hardware offload and software that has matured over time. You can take a pipeline or a table or actions and split them between hardware and software transparently, etc. This hammer already meets our goals. It's about using the appropriate tool for the right problem. We are not going to rewrite that infra in rust or ebpf just because. If the argument is about performance (see point above on 200G ports): We care about sw performance but more importantly we care about equivalence. I will put it this way: if we are confronted with a design choice of picking whether we forgo equivalence in order to get better sw performance, we are going to trade off performance. You want wire speed performance then offload. __End Context__ So now let me respond to the points raised. Jiri, i think one of the concerns you have is that there is no way to generalize the different hardware by using a single abstraction since all hardware may have different architectures (eg whether using RMT vs DRMT, a mesh processing xbar, TCAM, SRAM, host DRAM, etc) which may necessitate doing things like underlying table reordering, merging, sorting etc. The consensus is that it is the vendor driver that is responsible for “transforming” P4 abstractions into whatever your hardware does. The standardized abstraction is P4. Each P4 object (match or action) has an ID and attributes - just like we do today with flower with exception it is not hard coded in the kernel as we do today. So if the tc ndo gets a callback to add an entry that will match header and/or metadata X on table Y and execute action Z, it should take care of figuring out how that transforms into its appropriate hardware layout. IOW, your hardware doesnt have to be P4 native it just has to accept the constructs. To emphasize again that folks are already doing this: see the MS DASH project where you have many NIC vendors (if i am not mistaken xilinx, pensando, intel mev, nvidia bluefield, some startups, etc) - they all consume P4 and may implement differently. The next question is how do you teach the driver what the different P4 object IDs mean and load the P4 objects for the hardware? We need to have a consensus on that for sure - there are multiple approaches that we explored: you could go directly from netlink using the templating DSL; you could go via devlink or you can have a hybrid of the two. Initially different vendors thought differently but they seem to settle on devlink. From a TC perspective the ndo callbacks for runtime dont change. Toke, I labelled that one option as IMpossible as a parody - it is what the vendors are saying today and my play on words is "even impossible says IM possible". The challenge we have is that while some vendor may have a driver and an approach that works, we need to have a consensus instead of one vendor dictating the approach we use. To John, I hope i have addressed some of your commentary above. The current approach vendors are taking is total bypass of the kernel for offload (we are getting our asses handed to us). The kernel is used to configure control then it punts to user space and then you invoke a vendor proprietary API. And every vendor has their own API. If you are sourcing the NICs from multiple vendors then this is bad for the consumer (unless you are a hyperscaler in which case almost all are writing their own proprietary user space stacks). Are you pitching that model? The synced hardware + sw is already provided by TC - why punt to user space? cheers, jamal [0] https://netdevconf.info/0x16/session.html?keynote-ousterhout On Mon, Jan 30, 2023 at 6:27 AM Toke Høiland-Jørgensen <toke@redhat.com> wrote: > > Jiri Pirko <jiri@resnulli.us> writes: > > >>P4TC as SW/HW running same P4: > >> > >>1. This doesn't need to be done in kernel. If one compiler runs > >> P4 into XDP or TC-BPF that is good and another compiler runs > >> it into hw specific backend. This satisifies having both > >> software and hardware implementation. > >> > >>Extra commentary: I agree we've been chatting about this for a long > >>time but until some vendor (Intel?) will OSS and support a linux > >>driver and hardware with open programmable parser and MAT. I'm not > >>sure how we get P4 for Linux users. Does it exist and I missed it? > > > > > > John, I think that your summary is quite accurate. Regarding SW > > implementation, I have to admit I also fail to see motivation to have P4 > > specific datapath instead of having XDP/eBPF one, that could run P4 > > compiled program. The only motivation would be that if somehow helps to > > offload to HW. But can it? > > According to the slides from the netdev talk[0], it seems that > offloading will have to have a component that goes outside of TC anyway > (see "Model 3: Joint loading" where it says "this is impossible"). So I > don't really see why having this interpreter in TC help any. > > Also, any control plane management feature specific to managing P4 state > in hardware could just as well manage a BPF-based software path on the > kernel side instead of the P4 interpreter stuff... > > -Toke > > [0] https://netdevconf.info/0x16/session.html?Your-Network-Datapath-Will-Be-P4-Scripted >
Hi Jamal I'm mostly sat watching and eating popcorn, and i have little knowledge in the area. > Jiri, i think one of the concerns you have is that there is no way to > generalize the different hardware by using a single abstraction since > all hardware may have different architectures (eg whether using RMT vs > DRMT, a mesh processing xbar, TCAM, SRAM, host DRAM, etc) which may > necessitate doing things like underlying table reordering, merging, > sorting etc. The consensus is that it is the vendor driver that is > responsible for “transforming” P4 abstractions into whatever your > hardware does. What is the complexity involved in this 'transformation'? Are we talking about putting a P4 'compiler' into each driver, each vendor having there own compiler? Performing an upcall into user space with a P4 blob and asking the vendor tool to give us back a blob for the hardware? Or is it relatively simple, a few hundred lines of code, simple transformations? As far as i know, all offloading done so far in the network stack has been purely in kernel. We transform a kernel representation of networking state into something the hardware understands and pass it to the hardware. That means, except for bugs, what happens in SW should be the same as what happens in HW, just faster. But there have been mention of P4 extensions. Stuff that the SW P4 implementation cannot do, but the hardware can, and vendors appear to think such extensions are part of their magic sauce. How will that work? Is the 'compiler' supposed to recognise plain P4 equivalent of these extensions and replace it with those extensions? I suppose what i'm trying to get at, is are we going to enforce the SW and HW equivalence by doing the transformation in kernel, or could we be heading towards in userspace we take our P4 and compile it with one toolchain for the SW path, another toolchain for the HW path, and we have no guarantee that the resulting blobs actually came from the same sources and are supposed to be equivalent? And that then makes the SW path somewhat pointless? Andrew
On Mon, Jan 30, 2023 at 9:42 AM Andrew Lunn <andrew@lunn.ch> wrote: > > Hi Jamal > > I'm mostly sat watching and eating popcorn, and i have little > knowledge in the area. > > > Jiri, i think one of the concerns you have is that there is no way to > > generalize the different hardware by using a single abstraction since > > all hardware may have different architectures (eg whether using RMT vs > > DRMT, a mesh processing xbar, TCAM, SRAM, host DRAM, etc) which may > > necessitate doing things like underlying table reordering, merging, > > sorting etc. The consensus is that it is the vendor driver that is > > responsible for “transforming” P4 abstractions into whatever your > > hardware does. > > What is the complexity involved in this 'transformation'? Are we > talking about putting a P4 'compiler' into each driver, each vendor > having there own compiler? Performing an upcall into user space with a > P4 blob and asking the vendor tool to give us back a blob for the > hardware? Or is it relatively simple, a few hundred lines of code, > simple transformations? > The current model is you compile the kernel vs hardware output as two separate files. They are loaded separately The compiler has a vendor specific backend and a P4TC one. There has to be a authentication sync that the two are one and the same; essentially each program/pipeline has a name and an ID and some hash for validation. See slide #49 in the presentation at https://netdevconf.info/0x16/session.html?Your-Network-Datapath-Will-Be-P4-Scripted Only the vendor will be able to create something reasonable for their specific hardware. The issue is how to load the hardware part - the three methods were discussed are listed in slides 50-52. The vendors seem to be in agreement that the best option is #1. BTW, these discussions happen in a high bandwidth medium at the moment every two weeks here: https://www.google.com/url?q=https://teams.microsoft.com/l/meetup-join/1.&sa=D&source=calendar&ust=1675366175958603&usg=AOvVaw1UZo8g5Ir6OcC-SRFM9PF1 It would be helpful if other folks will show up in those meetings. > As far as i know, all offloading done so far in the network stack has > been purely in kernel. We transform a kernel representation of > networking state into something the hardware understands and pass it > to the hardware. That means, except for bugs, what happens in SW > should be the same as what happens in HW, just faster. Exactly - that is what is refered to as "hardcoding" in slides 43-44 with what P4TC would do described in slide #45. > But there have > been mention of P4 extensions. Stuff that the SW P4 implementation > cannot do, but the hardware can, and vendors appear to think such > extensions are part of their magic sauce. How will that work? Is the > 'compiler' supposed to recognise plain P4 equivalent of these > extensions and replace it with those extensions? I think the "magic sauce" angle is mostly the idea of how one would implement foobar differently than the other vendor. If someone uses a little ASIC and the next person uses FW to program a TCAM they may feel they have an advantage in their hardware that the other guy doesnt have. At the end of the day that thing looks like a box with input Y that produces output X. In P4 they call them "externs". From a P4TC backend perspective, we hope that we can allow foobar to be implemented by multiple vendors without caring about the details of the implementation. The vendor backend can describe it to whatever detail it wants to its hardware. > I suppose what i'm trying to get at, is are we going to enforce the SW > and HW equivalence by doing the transformation in kernel, or could we > be heading towards in userspace we take our P4 and compile it with one > toolchain for the SW path, another toolchain for the HW path, and we > have no guarantee that the resulting blobs actually came from the same > sources and are supposed to be equivalent? And that then makes the SW > path somewhat pointless? See above - the two have to map to the same equivalence and validated as such. It is also about providing a singular interface through the kernel as opposed to dealing with multiple vendor APIs. cheers, jamal
Jamal Hadi Salim <jhs@mojatatu.com> writes: > So i dont have to respond to each email individually, I will respond > here in no particular order. First let me provide some context, if > that was already clear please skip it. Hopefully providing the context > will help us to focus otherwise that bikeshed's color and shape will > take forever to settle on. > > __Context__ > > I hope we all agree that when you have 2x100G NIC (and i have seen > people asking for 2x800G NICs) no XDP or DPDK is going to save you. To > visualize: one 25G port is 35Mpps unidirectional. So "software stack" > is not the answer. You need to offload. I'm not disputing the need to offload, and I'm personally delighted that P4 is breaking open the vendor black boxes to provide a standardised interface for this. However, while it's true that software can't keep up at the high end, not everything runs at the high end, and today's high end is tomorrow's mid end, in which XDP can very much play a role. So being able to move smoothly between the two, and even implement functions that split processing between them, is an essential feature of a programmable networking path in Linux. Which is why I'm objecting to implementing the P4 bits as something that's hanging off the side of the stack in its own thing and is not integrated with the rest of the stack. You were touting this as a feature ("being self-contained"). I consider it a bug. > Scriptability is not a new idea in TC (see u32 and pedit and others in > TC). u32 is notoriously hard to use. The others are neat, but obviously limited to particular use cases. Do you actually expect anyone to use P4 by manually entering TC commands to build a pipeline? I really find that hard to believe... > IOW, we are reusing and plugging into a proven and deployed mechanism > with a built-in policy driven, transparent symbiosis between hardware > offload and software that has matured over time. You can take a > pipeline or a table or actions and split them between hardware and > software transparently, etc. That's a control plane feature though, it's not an argument for adding another interpreter to the kernel. > This hammer already meets our goals. That 60k+ line patch submission of yours says otherwise... > It's about using the appropriate tool for the right problem. We are > not going to rewrite that infra in rust or ebpf just because. "The right tool for the job" also means something that integrates well with the wider ecosystem. For better or worse, in the kernel that ecosystem (of datapath programmability) is BPF-based. Dismissing request to integrate with that as, essentially, empty fanboyism, comes across as incredibly arrogant. > Toke, I labelled that one option as IMpossible as a parody - it is > what the vendors are saying today and my play on words is "even > impossible says IM possible". Side note: I think it would be helpful if you dropped all the sarcasm and snide remarks when communicating this stuff in writing, especially to a new audience. It just confuses things, and doesn't exactly help with the perception of arrogance either... -Toke
On Sun, Jan 29, 2023 at 7:09 PM Singhai, Anjali <anjali.singhai@intel.com> wrote: > > I am agreeing with you Tom. P4tc does not restrict the high level language to be P4, it can be anything as long as it can be compiled to create an IR that can be used to teach/program the SW and the HW, which is what the script-ability of p4tc provides. > > Ultimately the devices are evolving as combination of highly efficient Domain specific architecture and the traditional Generic cores, and SW in the kernel has to evolve to program them both in a way that the user can decide whether to run a particular functionality in Domain specific HW or SW that runs on general purpose cores or in some cases the functionality runs in both places and the intelligent (and some-day AI managed Infrastructure controller) entity decides whether the flow should use the HW path or the SW path. There is no other way forward because a SW dataplane can only provide an overflow region for flows and the HW will have to run the most demanding flows, as the Network demand and capacity of the data-center keeps reaching higher and higher levels. From a HW vendor's point of view we have already entered the 3rd epoch of computer architecture. > > A domain specific architecture still has to be programmable but for a specific domain, linux kernel which has remained fixed function (and fixed protocol) I believe the majority of people on this list would disagree with that. XDP and eBPF were invented precisely to make the Linux kernel extensible. As Toke said, any proposed solution for programmable datapaths cannot ignore XDP/eBPF. > traditionally needs to evolve to support these domain specific architecture that are protocol and dataplane programmable. I think p4tc definitely is the right way forward. > > There were some arguments made earlier about but the big Datacenters are programming these domain specific architecture from user space already, no doubt but isn't the whole argument for linux kernel is democratizing of the goodness the HW brings to all , the small users and the big ones? > > There is also argument that is being made about using ebpf for implementing the SW path, may be I am missing the part as to how do you offload if not to another general purpose core even if it is not as evolved as the current day Xeon's. And we know that even the simplest of the general purpose cores ( example RISC-V) right now cannot sustain the rate at which the network needs to feed the business logic running on the CPUs or GPUs or TPUs in an economically viable solution. All data points to the fact that Network processing running on general purpose cores eats up more than half of the cores and that’s expensive. You are making the incorrect assumption that we are restricted to using off the shelf commodity CPUs. With an open ISA like RISC-V we are free to customize it and build domain specific CPUs following the same principles of domain specific architectures. I believe it is quite feasible with current technology to build a fully programmable and very high performance datapath through CPUs. The solution involves ripping out things we don't need like FPU and MMU, and putting in things like optimized instructions for parsing, primitives for maximizing parallelism, arithmetic instructions to optimize processing nature of specific nature data, and inline accelerators. Running a datapath on CPU avoids the rigid structures of hardware pipeline (like John mention, a match-action pipeline won't work for all problems) > Because the performance/power unit math when using an IPU/DPU/Smart NIC for network work load is so much better than that of a General purpose core. So I do not see a way forward for epbf to be offloaded on anything but general purpose cores We can already do that. The CLI command examples in the kParser were generated from a parser written to the PANDA-C. PANDA is a library API for C and a parser defines a set of data structures and macros. It's just plain C code and doesn't even use #pragma like CUDA does which is the analogue programming model for GPUs. PANDA-C code can be compiled into eBPF, userspace, the kParser CLI, and other targets. Assuming that the HW offload for P4TC is supported then the same parser would be running in both eBPF and the P4 hardware, and hence we have factually offloaded a parser that runs in eBPF to P4 hardware. This is supported for the parser, but adding the rest of the processing can be similarly achieved. > and in the meantime Domain specific programmable ASICs need to be still programmed as they are the right solution for the economy of scale. > > Having said that we do have to find a good solution for p4 externs in SW and may be there is room for some helpers ( may be even ebpf) ( as long as you don’t ask me to offload that in HW
On Mon, Jan 30, 2023 at 12:04 PM Toke Høiland-Jørgensen <toke@redhat.com> wrote: > > Jamal Hadi Salim <jhs@mojatatu.com> writes: > > > So i dont have to respond to each email individually, I will respond > > here in no particular order. First let me provide some context, if > > that was already clear please skip it. Hopefully providing the context > > will help us to focus otherwise that bikeshed's color and shape will > > take forever to settle on. > > > > __Context__ > > > > I hope we all agree that when you have 2x100G NIC (and i have seen > > people asking for 2x800G NICs) no XDP or DPDK is going to save you. To > > visualize: one 25G port is 35Mpps unidirectional. So "software stack" > > is not the answer. You need to offload. > > I'm not disputing the need to offload, and I'm personally delighted that > P4 is breaking open the vendor black boxes to provide a standardised > interface for this. > > However, while it's true that software can't keep up at the high end, > not everything runs at the high end, and today's high end is tomorrow's > mid end, in which XDP can very much play a role. So being able to move > smoothly between the two, and even implement functions that split > processing between them, is an essential feature of a programmable > networking path in Linux. Which is why I'm objecting to implementing the > P4 bits as something that's hanging off the side of the stack in its own > thing and is not integrated with the rest of the stack. You were touting > this as a feature ("being self-contained"). I consider it a bug. > > > Scriptability is not a new idea in TC (see u32 and pedit and others in > > TC). > > u32 is notoriously hard to use. The others are neat, but obviously > limited to particular use cases. Despite my love for u32, I admit its user interface is cryptic. I just wanted to point out to existing samples of scriptable and offloadable TC objects. > Do you actually expect anyone to use P4 > by manually entering TC commands to build a pipeline? I really find that > hard to believe... You dont have to manually hand code anything - its the compilers job. But of course for simple P4 programs, yes i think you can handcode something if you understand the templating syntax. > > IOW, we are reusing and plugging into a proven and deployed mechanism > > with a built-in policy driven, transparent symbiosis between hardware > > offload and software that has matured over time. You can take a > > pipeline or a table or actions and split them between hardware and > > software transparently, etc. > > That's a control plane feature though, it's not an argument for adding > another interpreter to the kernel. I am not sure what you mean by control, but what i described is kernel built in. Of course i could do more complex things from user space (if that is what you mean as control). > > This hammer already meets our goals. > > That 60k+ line patch submission of yours says otherwise... This is pretty much covered in the cover letter and a few responses in the thread since. > > It's about using the appropriate tool for the right problem. We are > > not going to rewrite that infra in rust or ebpf just because. > > "The right tool for the job" also means something that integrates well > with the wider ecosystem. For better or worse, in the kernel that > ecosystem (of datapath programmability) is BPF-based. Dismissing request > to integrate with that as, essentially, empty fanboyism, comes across as > incredibly arrogant. > > Toke, I labelled that one option as IMpossible as a parody - it is > > what the vendors are saying today and my play on words is "even > > impossible says IM possible". > > Side note: I think it would be helpful if you dropped all the sarcasm > and snide remarks when communicating this stuff in writing, especially > to a new audience. It just confuses things, and doesn't exactly help > with the perception of arrogance either... I apologize if i offended you - you quoted a slide i did and I was describing what that slide was supposed to relay. cheers, jamal > -Toke >
Jamal Hadi Salim <hadi@mojatatu.com> writes: > On Mon, Jan 30, 2023 at 12:04 PM Toke Høiland-Jørgensen <toke@redhat.com> wrote: >> >> Jamal Hadi Salim <jhs@mojatatu.com> writes: >> >> > So i dont have to respond to each email individually, I will respond >> > here in no particular order. First let me provide some context, if >> > that was already clear please skip it. Hopefully providing the context >> > will help us to focus otherwise that bikeshed's color and shape will >> > take forever to settle on. >> > >> > __Context__ >> > >> > I hope we all agree that when you have 2x100G NIC (and i have seen >> > people asking for 2x800G NICs) no XDP or DPDK is going to save you. To >> > visualize: one 25G port is 35Mpps unidirectional. So "software stack" >> > is not the answer. You need to offload. >> >> I'm not disputing the need to offload, and I'm personally delighted that >> P4 is breaking open the vendor black boxes to provide a standardised >> interface for this. >> >> However, while it's true that software can't keep up at the high end, >> not everything runs at the high end, and today's high end is tomorrow's >> mid end, in which XDP can very much play a role. So being able to move >> smoothly between the two, and even implement functions that split >> processing between them, is an essential feature of a programmable >> networking path in Linux. Which is why I'm objecting to implementing the >> P4 bits as something that's hanging off the side of the stack in its own >> thing and is not integrated with the rest of the stack. You were touting >> this as a feature ("being self-contained"). I consider it a bug. >> >> > Scriptability is not a new idea in TC (see u32 and pedit and others in >> > TC). >> >> u32 is notoriously hard to use. The others are neat, but obviously >> limited to particular use cases. > > Despite my love for u32, I admit its user interface is cryptic. I just > wanted to point out to existing samples of scriptable and offloadable > TC objects. > >> Do you actually expect anyone to use P4 >> by manually entering TC commands to build a pipeline? I really find that >> hard to believe... > > You dont have to manually hand code anything - its the compilers job. Right, that was kinda my point: in that case the compiler could just as well generate a (set of) BPF program(s) instead of this TC script thing. >> > IOW, we are reusing and plugging into a proven and deployed mechanism >> > with a built-in policy driven, transparent symbiosis between hardware >> > offload and software that has matured over time. You can take a >> > pipeline or a table or actions and split them between hardware and >> > software transparently, etc. >> >> That's a control plane feature though, it's not an argument for adding >> another interpreter to the kernel. > > I am not sure what you mean by control, but what i described is kernel > built in. Of course i could do more complex things from user space (if > that is what you mean as control). "Control plane" as in SDN parlance. I.e., the bits that keep track of configuration of the flow/pipeline/table configuration. There's no reason you can't have all that infrastructure and use BPF as the datapath language. I.e., instead of: tc p4template create pipeline/aP4proggie numtables 1 ... + all the other stuff to populate it you could just do: tc p4 create pipeline/aP4proggie obj_file aP4proggie.bpf.o and still have all the management infrastructure without the new interpreter and associated complexity in the kernel. >> > This hammer already meets our goals. >> >> That 60k+ line patch submission of yours says otherwise... > > This is pretty much covered in the cover letter and a few responses in > the thread since. The only argument for why your current approach makes sense I've seen you make is "I don't want to rewrite it in BPF". Which is not a technical argument. I'm not trying to be disingenuous here, BTW: I really don't see the technical argument for why the P4 data plane has to be implemented as its own interpreter instead of integrating with what we have already (i.e., BPF). -Toke
Toke Høiland-Jørgensen wrote: > Jamal Hadi Salim <hadi@mojatatu.com> writes: > > > On Mon, Jan 30, 2023 at 12:04 PM Toke Høiland-Jørgensen <toke@redhat.com> wrote: > >> > >> Jamal Hadi Salim <jhs@mojatatu.com> writes: > >> > >> > So i dont have to respond to each email individually, I will respond > >> > here in no particular order. First let me provide some context, if > >> > that was already clear please skip it. Hopefully providing the context > >> > will help us to focus otherwise that bikeshed's color and shape will > >> > take forever to settle on. > >> > > >> > __Context__ > >> > > >> > I hope we all agree that when you have 2x100G NIC (and i have seen > >> > people asking for 2x800G NICs) no XDP or DPDK is going to save you. To > >> > visualize: one 25G port is 35Mpps unidirectional. So "software stack" > >> > is not the answer. You need to offload. > >> > >> I'm not disputing the need to offload, and I'm personally delighted that > >> P4 is breaking open the vendor black boxes to provide a standardised > >> interface for this. > >> > >> However, while it's true that software can't keep up at the high end, > >> not everything runs at the high end, and today's high end is tomorrow's > >> mid end, in which XDP can very much play a role. So being able to move > >> smoothly between the two, and even implement functions that split > >> processing between them, is an essential feature of a programmable > >> networking path in Linux. Which is why I'm objecting to implementing the > >> P4 bits as something that's hanging off the side of the stack in its own > >> thing and is not integrated with the rest of the stack. You were touting > >> this as a feature ("being self-contained"). I consider it a bug. > >> > >> > Scriptability is not a new idea in TC (see u32 and pedit and others in > >> > TC). > >> > >> u32 is notoriously hard to use. The others are neat, but obviously > >> limited to particular use cases. > > > > Despite my love for u32, I admit its user interface is cryptic. I just > > wanted to point out to existing samples of scriptable and offloadable > > TC objects. > > > >> Do you actually expect anyone to use P4 > >> by manually entering TC commands to build a pipeline? I really find that > >> hard to believe... > > > > You dont have to manually hand code anything - its the compilers job. > > Right, that was kinda my point: in that case the compiler could just as > well generate a (set of) BPF program(s) instead of this TC script thing. > > >> > IOW, we are reusing and plugging into a proven and deployed mechanism > >> > with a built-in policy driven, transparent symbiosis between hardware > >> > offload and software that has matured over time. You can take a > >> > pipeline or a table or actions and split them between hardware and > >> > software transparently, etc. > >> > >> That's a control plane feature though, it's not an argument for adding > >> another interpreter to the kernel. > > > > I am not sure what you mean by control, but what i described is kernel > > built in. Of course i could do more complex things from user space (if > > that is what you mean as control). > > "Control plane" as in SDN parlance. I.e., the bits that keep track of > configuration of the flow/pipeline/table configuration. > > There's no reason you can't have all that infrastructure and use BPF as > the datapath language. I.e., instead of: > > tc p4template create pipeline/aP4proggie numtables 1 > ... + all the other stuff to populate it > > you could just do: > > tc p4 create pipeline/aP4proggie obj_file aP4proggie.bpf.o > > and still have all the management infrastructure without the new > interpreter and associated complexity in the kernel. > > >> > This hammer already meets our goals. > >> > >> That 60k+ line patch submission of yours says otherwise... > > > > This is pretty much covered in the cover letter and a few responses in > > the thread since. > > The only argument for why your current approach makes sense I've seen > you make is "I don't want to rewrite it in BPF". Which is not a > technical argument. > > I'm not trying to be disingenuous here, BTW: I really don't see the > technical argument for why the P4 data plane has to be implemented as > its own interpreter instead of integrating with what we have already > (i.e., BPF). > > -Toke > I'll just take this here becaues I think its mostly related. Still not convinced the P4TC has any value for sw. From the slide you say vendors prefer you have this picture roughtly. [ P4 compiler ] ------ [ P4TC backend ] ----> TC API | | [ P4 Vendor backend ] | | V [ Devlink ] Now just replace P4TC backend with P4C and your only work is to replace devlink with the current hw specific bits and you have a sw and hw components. Then you get XDP-BPF pretty easily from P4XDP backend if you like. The compat piece is handled by compiler where it should be. My CPU is not a MAT so pretending it is seems not ideal to me, I don't have a TCAM on my cores. For runtime get those vendors to write their SDKs over Devlink and no need for this software thing. The runtime for P4c should already work over BPF. Giving this picture [ P4 compiler ] ------ [ P4C backend ] ----> BPF | | [ P4 Vendor backend ] | | V [ Devlink ] And much less work for us to maintain. .John
John Fastabend <john.fastabend@gmail.com> writes: > Toke Høiland-Jørgensen wrote: >> Jamal Hadi Salim <hadi@mojatatu.com> writes: >> >> > On Mon, Jan 30, 2023 at 12:04 PM Toke Høiland-Jørgensen <toke@redhat.com> wrote: >> >> >> >> Jamal Hadi Salim <jhs@mojatatu.com> writes: >> >> >> >> > So i dont have to respond to each email individually, I will respond >> >> > here in no particular order. First let me provide some context, if >> >> > that was already clear please skip it. Hopefully providing the context >> >> > will help us to focus otherwise that bikeshed's color and shape will >> >> > take forever to settle on. >> >> > >> >> > __Context__ >> >> > >> >> > I hope we all agree that when you have 2x100G NIC (and i have seen >> >> > people asking for 2x800G NICs) no XDP or DPDK is going to save you. To >> >> > visualize: one 25G port is 35Mpps unidirectional. So "software stack" >> >> > is not the answer. You need to offload. >> >> >> >> I'm not disputing the need to offload, and I'm personally delighted that >> >> P4 is breaking open the vendor black boxes to provide a standardised >> >> interface for this. >> >> >> >> However, while it's true that software can't keep up at the high end, >> >> not everything runs at the high end, and today's high end is tomorrow's >> >> mid end, in which XDP can very much play a role. So being able to move >> >> smoothly between the two, and even implement functions that split >> >> processing between them, is an essential feature of a programmable >> >> networking path in Linux. Which is why I'm objecting to implementing the >> >> P4 bits as something that's hanging off the side of the stack in its own >> >> thing and is not integrated with the rest of the stack. You were touting >> >> this as a feature ("being self-contained"). I consider it a bug. >> >> >> >> > Scriptability is not a new idea in TC (see u32 and pedit and others in >> >> > TC). >> >> >> >> u32 is notoriously hard to use. The others are neat, but obviously >> >> limited to particular use cases. >> > >> > Despite my love for u32, I admit its user interface is cryptic. I just >> > wanted to point out to existing samples of scriptable and offloadable >> > TC objects. >> > >> >> Do you actually expect anyone to use P4 >> >> by manually entering TC commands to build a pipeline? I really find that >> >> hard to believe... >> > >> > You dont have to manually hand code anything - its the compilers job. >> >> Right, that was kinda my point: in that case the compiler could just as >> well generate a (set of) BPF program(s) instead of this TC script thing. >> >> >> > IOW, we are reusing and plugging into a proven and deployed mechanism >> >> > with a built-in policy driven, transparent symbiosis between hardware >> >> > offload and software that has matured over time. You can take a >> >> > pipeline or a table or actions and split them between hardware and >> >> > software transparently, etc. >> >> >> >> That's a control plane feature though, it's not an argument for adding >> >> another interpreter to the kernel. >> > >> > I am not sure what you mean by control, but what i described is kernel >> > built in. Of course i could do more complex things from user space (if >> > that is what you mean as control). >> >> "Control plane" as in SDN parlance. I.e., the bits that keep track of >> configuration of the flow/pipeline/table configuration. >> >> There's no reason you can't have all that infrastructure and use BPF as >> the datapath language. I.e., instead of: >> >> tc p4template create pipeline/aP4proggie numtables 1 >> ... + all the other stuff to populate it >> >> you could just do: >> >> tc p4 create pipeline/aP4proggie obj_file aP4proggie.bpf.o >> >> and still have all the management infrastructure without the new >> interpreter and associated complexity in the kernel. >> >> >> > This hammer already meets our goals. >> >> >> >> That 60k+ line patch submission of yours says otherwise... >> > >> > This is pretty much covered in the cover letter and a few responses in >> > the thread since. >> >> The only argument for why your current approach makes sense I've seen >> you make is "I don't want to rewrite it in BPF". Which is not a >> technical argument. >> >> I'm not trying to be disingenuous here, BTW: I really don't see the >> technical argument for why the P4 data plane has to be implemented as >> its own interpreter instead of integrating with what we have already >> (i.e., BPF). >> >> -Toke >> > > I'll just take this here becaues I think its mostly related. > > Still not convinced the P4TC has any value for sw. From the > slide you say vendors prefer you have this picture roughtly. > > > [ P4 compiler ] ------ [ P4TC backend ] ----> TC API > | > | > [ P4 Vendor backend ] > | > | > V > [ Devlink ] > > > Now just replace P4TC backend with P4C and your only work is to > replace devlink with the current hw specific bits and you have > a sw and hw components. Then you get XDP-BPF pretty easily from > P4XDP backend if you like. The compat piece is handled by compiler > where it should be. My CPU is not a MAT so pretending it is seems > not ideal to me, I don't have a TCAM on my cores. > > For runtime get those vendors to write their SDKs over Devlink > and no need for this software thing. The runtime for P4c should > already work over BPF. Giving this picture > > [ P4 compiler ] ------ [ P4C backend ] ----> BPF > | > | > [ P4 Vendor backend ] > | > | > V > [ Devlink ] > > And much less work for us to maintain. Yes, this was basically my point as well. Thank you for putting it into ASCII diagrams! :) There's still the control plane bit: some kernel component that configures the pieces (pipelines?) created in the top-right and bottom-left corners of your diagram(s), keeping track of which pipelines are in HW/SW, maybe updating some match tables dynamically and extracting statistics. I'm totally OK with having that bit be in the kernel, but that can be added on top of your second diagram just as well as on top of the first one... -Toke
On Mon, Jan 30, 2023 at 1:10 PM John Fastabend <john.fastabend@gmail.com> wrote: > > Toke Høiland-Jørgensen wrote: > > Jamal Hadi Salim <hadi@mojatatu.com> writes: > > > > > On Mon, Jan 30, 2023 at 12:04 PM Toke Høiland-Jørgensen <toke@redhat.com> wrote: > > >> > > >> Jamal Hadi Salim <jhs@mojatatu.com> writes: > > >> > > >> > So i dont have to respond to each email individually, I will respond > > >> > here in no particular order. First let me provide some context, if > > >> > that was already clear please skip it. Hopefully providing the context > > >> > will help us to focus otherwise that bikeshed's color and shape will > > >> > take forever to settle on. > > >> > > > >> > __Context__ > > >> > > > >> > I hope we all agree that when you have 2x100G NIC (and i have seen > > >> > people asking for 2x800G NICs) no XDP or DPDK is going to save you. To > > >> > visualize: one 25G port is 35Mpps unidirectional. So "software stack" > > >> > is not the answer. You need to offload. > > >> > > >> I'm not disputing the need to offload, and I'm personally delighted that > > >> P4 is breaking open the vendor black boxes to provide a standardised > > >> interface for this. > > >> > > >> However, while it's true that software can't keep up at the high end, > > >> not everything runs at the high end, and today's high end is tomorrow's > > >> mid end, in which XDP can very much play a role. So being able to move > > >> smoothly between the two, and even implement functions that split > > >> processing between them, is an essential feature of a programmable > > >> networking path in Linux. Which is why I'm objecting to implementing the > > >> P4 bits as something that's hanging off the side of the stack in its own > > >> thing and is not integrated with the rest of the stack. You were touting > > >> this as a feature ("being self-contained"). I consider it a bug. > > >> > > >> > Scriptability is not a new idea in TC (see u32 and pedit and others in > > >> > TC). > > >> > > >> u32 is notoriously hard to use. The others are neat, but obviously > > >> limited to particular use cases. > > > > > > Despite my love for u32, I admit its user interface is cryptic. I just > > > wanted to point out to existing samples of scriptable and offloadable > > > TC objects. > > > > > >> Do you actually expect anyone to use P4 > > >> by manually entering TC commands to build a pipeline? I really find that > > >> hard to believe... > > > > > > You dont have to manually hand code anything - its the compilers job. > > > > Right, that was kinda my point: in that case the compiler could just as > > well generate a (set of) BPF program(s) instead of this TC script thing. > > > > >> > IOW, we are reusing and plugging into a proven and deployed mechanism > > >> > with a built-in policy driven, transparent symbiosis between hardware > > >> > offload and software that has matured over time. You can take a > > >> > pipeline or a table or actions and split them between hardware and > > >> > software transparently, etc. > > >> > > >> That's a control plane feature though, it's not an argument for adding > > >> another interpreter to the kernel. > > > > > > I am not sure what you mean by control, but what i described is kernel > > > built in. Of course i could do more complex things from user space (if > > > that is what you mean as control). > > > > "Control plane" as in SDN parlance. I.e., the bits that keep track of > > configuration of the flow/pipeline/table configuration. > > > > There's no reason you can't have all that infrastructure and use BPF as > > the datapath language. I.e., instead of: > > > > tc p4template create pipeline/aP4proggie numtables 1 > > ... + all the other stuff to populate it > > > > you could just do: > > > > tc p4 create pipeline/aP4proggie obj_file aP4proggie.bpf.o > > > > and still have all the management infrastructure without the new > > interpreter and associated complexity in the kernel. > > > > >> > This hammer already meets our goals. > > >> > > >> That 60k+ line patch submission of yours says otherwise... > > > > > > This is pretty much covered in the cover letter and a few responses in > > > the thread since. > > > > The only argument for why your current approach makes sense I've seen > > you make is "I don't want to rewrite it in BPF". Which is not a > > technical argument. > > > > I'm not trying to be disingenuous here, BTW: I really don't see the > > technical argument for why the P4 data plane has to be implemented as > > its own interpreter instead of integrating with what we have already > > (i.e., BPF). > > > > -Toke > > > > I'll just take this here becaues I think its mostly related. > > Still not convinced the P4TC has any value for sw. From the > slide you say vendors prefer you have this picture roughtly. > > > [ P4 compiler ] ------ [ P4TC backend ] ----> TC API > | > | > [ P4 Vendor backend ] > | > | > V > [ Devlink ] > > > Now just replace P4TC backend with P4C and your only work is to > replace devlink with the current hw specific bits and you have > a sw and hw components. Then you get XDP-BPF pretty easily from > P4XDP backend if you like. The compat piece is handled by compiler > where it should be. My CPU is not a MAT so pretending it is seems > not ideal to me, I don't have a TCAM on my cores. > > For runtime get those vendors to write their SDKs over Devlink > and no need for this software thing. The runtime for P4c should > already work over BPF. Giving this picture > > [ P4 compiler ] ------ [ P4C backend ] ----> BPF > | > | > [ P4 Vendor backend ] > | > | > V > [ Devlink ] > John, that's a good direction. If we go one step further and define a common Intermediate Representation for programmable datapaths, we can create a general solution that gives the user maximum flexibility and freedom on both the frontend and the backend. For the front end they can use whatever language they want as long as it supports an API that can be compiled into the common IR (this is what PANDA does for defining data paths in C). Similarly, for the backend we want to support multiple targets both hardware and software. This is "write once, run anywhere, run well": the developer writes their program once, the same program runs on different targets, and on any particular target the program runs as fast as possible given the capabilities of the target. There is another problem that a common IR addresses. The salient requirement of kernel offload is that the offloaded functionality is precisely equivalent to the kernel functionality that is being offloaded. The traditional way this has been done is that the kernel has to manage the bits offloaded to the device and provide all the API. The problem is that it doesn't scale and quickly leads to complexities like callouts to a jit compiler. My proposal is that we compute an MD-5 hash of the IR and tag the program compiled from it for the kernel (e.g. eBPF bytecode) and also tag the executable compiled for the hardware (e.g. the P4 run-time). At run time, there kernel would query the device to see what program its running, if the reported hash is equal to that of the loaded eBPF program, then the device is running a functionally equivalent program and the offload can safely be performed (via whatever datapath interfaces are needed). This means that the device can be managed through a side channel, but the kernel retains the necessary transparency to instantiate the offload. Here is a diagram of what this might look like: [ P4 program ] ---- [ P4 compiler ] -----------------------+ | [ PANDA-C program ] ---- [ LLVM ]-----------------------+ | [ PANDA-Python program ] --- {Python compiler] ---+ | [ PANDA-Rust program ] --- [Rust compiler] ----------+ | [GUI] -------------[GUI to IR]---------------------------------+ | [CLI] --------------[CLI to IR]---------------------------------+ | +-----------------------------------------+ | V [Common IR (.json)] | +-----------------------+ | +----[P4 Vendor Backend] ---- [Devlink] | +----[IR to eBPF backend compiler] --- [eBPF bytecode code] | +----[IR to CPU instructions] --- [Executable Binary] | +----[IR to P4TC CLI] --- [Script of commands] > And much less work for us to maintain. +1 > > .John
I think we are going in cycles. John I asked you earlier and i think you answered my question: You are actually pitching an out of band runtime using some vendor sdk via devlink (why even bother with devlink interface, not sure). P4TC is saying the runtime API is via the kernel to the drivers. Toke, i dont think i have managed to get across that there is an "autonomous" control built into the kernel. It is not just things that come across netlink. It's about the whole infra. I think without that clarity we are going to speak past each other and it's a frustrating discussion which could get emotional. You cant just displace, for example flower and say "use ebpf because it works on tc", theres a lot of tribal knowledge gluing relationship between hardware and software. Maybe take a look at this patchset below to see an example which shows how part of an action graph will work in hardware and partially in sw under certain conditions: https://www.spinics.net/lists/netdev/msg877507.html and then we can have a better discussion. cheers, jamal On Mon, Jan 30, 2023 at 4:21 PM Toke Høiland-Jørgensen <toke@redhat.com> wrote: > > John Fastabend <john.fastabend@gmail.com> writes: > > > Toke Høiland-Jørgensen wrote: > >> Jamal Hadi Salim <hadi@mojatatu.com> writes: > >> > >> > On Mon, Jan 30, 2023 at 12:04 PM Toke Høiland-Jørgensen <toke@redhat.com> wrote: > >> >> > >> >> Jamal Hadi Salim <jhs@mojatatu.com> writes: > >> >> > >> >> > So i dont have to respond to each email individually, I will respond > >> >> > here in no particular order. First let me provide some context, if > >> >> > that was already clear please skip it. Hopefully providing the context > >> >> > will help us to focus otherwise that bikeshed's color and shape will > >> >> > take forever to settle on. > >> >> > > >> >> > __Context__ > >> >> > > >> >> > I hope we all agree that when you have 2x100G NIC (and i have seen > >> >> > people asking for 2x800G NICs) no XDP or DPDK is going to save you. To > >> >> > visualize: one 25G port is 35Mpps unidirectional. So "software stack" > >> >> > is not the answer. You need to offload. > >> >> > >> >> I'm not disputing the need to offload, and I'm personally delighted that > >> >> P4 is breaking open the vendor black boxes to provide a standardised > >> >> interface for this. > >> >> > >> >> However, while it's true that software can't keep up at the high end, > >> >> not everything runs at the high end, and today's high end is tomorrow's > >> >> mid end, in which XDP can very much play a role. So being able to move > >> >> smoothly between the two, and even implement functions that split > >> >> processing between them, is an essential feature of a programmable > >> >> networking path in Linux. Which is why I'm objecting to implementing the > >> >> P4 bits as something that's hanging off the side of the stack in its own > >> >> thing and is not integrated with the rest of the stack. You were touting > >> >> this as a feature ("being self-contained"). I consider it a bug. > >> >> > >> >> > Scriptability is not a new idea in TC (see u32 and pedit and others in > >> >> > TC). > >> >> > >> >> u32 is notoriously hard to use. The others are neat, but obviously > >> >> limited to particular use cases. > >> > > >> > Despite my love for u32, I admit its user interface is cryptic. I just > >> > wanted to point out to existing samples of scriptable and offloadable > >> > TC objects. > >> > > >> >> Do you actually expect anyone to use P4 > >> >> by manually entering TC commands to build a pipeline? I really find that > >> >> hard to believe... > >> > > >> > You dont have to manually hand code anything - its the compilers job. > >> > >> Right, that was kinda my point: in that case the compiler could just as > >> well generate a (set of) BPF program(s) instead of this TC script thing. > >> > >> >> > IOW, we are reusing and plugging into a proven and deployed mechanism > >> >> > with a built-in policy driven, transparent symbiosis between hardware > >> >> > offload and software that has matured over time. You can take a > >> >> > pipeline or a table or actions and split them between hardware and > >> >> > software transparently, etc. > >> >> > >> >> That's a control plane feature though, it's not an argument for adding > >> >> another interpreter to the kernel. > >> > > >> > I am not sure what you mean by control, but what i described is kernel > >> > built in. Of course i could do more complex things from user space (if > >> > that is what you mean as control). > >> > >> "Control plane" as in SDN parlance. I.e., the bits that keep track of > >> configuration of the flow/pipeline/table configuration. > >> > >> There's no reason you can't have all that infrastructure and use BPF as > >> the datapath language. I.e., instead of: > >> > >> tc p4template create pipeline/aP4proggie numtables 1 > >> ... + all the other stuff to populate it > >> > >> you could just do: > >> > >> tc p4 create pipeline/aP4proggie obj_file aP4proggie.bpf.o > >> > >> and still have all the management infrastructure without the new > >> interpreter and associated complexity in the kernel. > >> > >> >> > This hammer already meets our goals. > >> >> > >> >> That 60k+ line patch submission of yours says otherwise... > >> > > >> > This is pretty much covered in the cover letter and a few responses in > >> > the thread since. > >> > >> The only argument for why your current approach makes sense I've seen > >> you make is "I don't want to rewrite it in BPF". Which is not a > >> technical argument. > >> > >> I'm not trying to be disingenuous here, BTW: I really don't see the > >> technical argument for why the P4 data plane has to be implemented as > >> its own interpreter instead of integrating with what we have already > >> (i.e., BPF). > >> > >> -Toke > >> > > > > I'll just take this here becaues I think its mostly related. > > > > Still not convinced the P4TC has any value for sw. From the > > slide you say vendors prefer you have this picture roughtly. > > > > > > [ P4 compiler ] ------ [ P4TC backend ] ----> TC API > > | > > | > > [ P4 Vendor backend ] > > | > > | > > V > > [ Devlink ] > > > > > > Now just replace P4TC backend with P4C and your only work is to > > replace devlink with the current hw specific bits and you have > > a sw and hw components. Then you get XDP-BPF pretty easily from > > P4XDP backend if you like. The compat piece is handled by compiler > > where it should be. My CPU is not a MAT so pretending it is seems > > not ideal to me, I don't have a TCAM on my cores. > > > > For runtime get those vendors to write their SDKs over Devlink > > and no need for this software thing. The runtime for P4c should > > already work over BPF. Giving this picture > > > > [ P4 compiler ] ------ [ P4C backend ] ----> BPF > > | > > | > > [ P4 Vendor backend ] > > | > > | > > V > > [ Devlink ] > > > > And much less work for us to maintain. > > Yes, this was basically my point as well. Thank you for putting it into > ASCII diagrams! :) > > There's still the control plane bit: some kernel component that > configures the pieces (pipelines?) created in the top-right and > bottom-left corners of your diagram(s), keeping track of which pipelines > are in HW/SW, maybe updating some match tables dynamically and > extracting statistics. I'm totally OK with having that bit be in the > kernel, but that can be added on top of your second diagram just as well > as on top of the first one... > > -Toke >
Devlink is only for downloading the vendor specific compiler output for a P4 program and for teaching the driver about the names of runtime P4 object as to how they map onto the HW. This helps with the Initial definition of the Dataplane. Devlink is NOT for the runtime programming of the Dataplane, that has to go through the P4TC block for anybody to deploy a programmable dataplane between the HW and the SW and have some flows that are in HW and some in SW or some processing HW and some in SW. ndo_setup_tc framework and support in the drivers will give us the hooks to program the HW match-action entries. Please explain through ebpf model how do I program the HW at runtime? Thanks Anjali -----Original Message----- From: Jamal Hadi Salim <jhs@mojatatu.com> Sent: Monday, January 30, 2023 2:54 PM To: Toke Høiland-Jørgensen <toke@redhat.com> Cc: John Fastabend <john.fastabend@gmail.com>; Jamal Hadi Salim <hadi@mojatatu.com>; Jiri Pirko <jiri@resnulli.us>; Willem de Bruijn <willemb@google.com>; Stanislav Fomichev <sdf@google.com>; Jakub Kicinski <kuba@kernel.org>; netdev@vger.kernel.org; kernel@mojatatu.com; Chatterjee, Deb <deb.chatterjee@intel.com>; Singhai, Anjali <anjali.singhai@intel.com>; Limaye, Namrata <namrata.limaye@intel.com>; khalidm@nvidia.com; tom@sipanda.io; pratyush@sipanda.io; xiyou.wangcong@gmail.com; davem@davemloft.net; edumazet@google.com; pabeni@redhat.com; vladbu@nvidia.com; simon.horman@corigine.com; stefanc@marvell.com; seong.kim@amd.com; mattyk@nvidia.com; Daly, Dan <dan.daly@intel.com>; Fingerhut, John Andy <john.andy.fingerhut@intel.com> Subject: Re: [PATCH net-next RFC 00/20] Introducing P4TC I think we are going in cycles. John I asked you earlier and i think you answered my question: You are actually pitching an out of band runtime using some vendor sdk via devlink (why even bother with devlink interface, not sure). P4TC is saying the runtime API is via the kernel to the drivers. Toke, i dont think i have managed to get across that there is an "autonomous" control built into the kernel. It is not just things that come across netlink. It's about the whole infra. I think without that clarity we are going to speak past each other and it's a frustrating discussion which could get emotional. You cant just displace, for example flower and say "use ebpf because it works on tc", theres a lot of tribal knowledge gluing relationship between hardware and software. Maybe take a look at this patchset below to see an example which shows how part of an action graph will work in hardware and partially in sw under certain conditions: https://www.spinics.net/lists/netdev/msg877507.html and then we can have a better discussion. cheers, jamal On Mon, Jan 30, 2023 at 4:21 PM Toke Høiland-Jørgensen <toke@redhat.com> wrote: > > John Fastabend <john.fastabend@gmail.com> writes: > > > Toke Høiland-Jørgensen wrote: > >> Jamal Hadi Salim <hadi@mojatatu.com> writes: > >> > >> > On Mon, Jan 30, 2023 at 12:04 PM Toke Høiland-Jørgensen <toke@redhat.com> wrote: > >> >> > >> >> Jamal Hadi Salim <jhs@mojatatu.com> writes: > >> >> > >> >> > So i dont have to respond to each email individually, I will > >> >> > respond here in no particular order. First let me provide some > >> >> > context, if that was already clear please skip it. Hopefully > >> >> > providing the context will help us to focus otherwise that > >> >> > bikeshed's color and shape will take forever to settle on. > >> >> > > >> >> > __Context__ > >> >> > > >> >> > I hope we all agree that when you have 2x100G NIC (and i have > >> >> > seen people asking for 2x800G NICs) no XDP or DPDK is going to > >> >> > save you. To > >> >> > visualize: one 25G port is 35Mpps unidirectional. So "software stack" > >> >> > is not the answer. You need to offload. > >> >> > >> >> I'm not disputing the need to offload, and I'm personally > >> >> delighted that > >> >> P4 is breaking open the vendor black boxes to provide a > >> >> standardised interface for this. > >> >> > >> >> However, while it's true that software can't keep up at the high > >> >> end, not everything runs at the high end, and today's high end > >> >> is tomorrow's mid end, in which XDP can very much play a role. > >> >> So being able to move smoothly between the two, and even > >> >> implement functions that split processing between them, is an > >> >> essential feature of a programmable networking path in Linux. > >> >> Which is why I'm objecting to implementing the > >> >> P4 bits as something that's hanging off the side of the stack in > >> >> its own thing and is not integrated with the rest of the stack. > >> >> You were touting this as a feature ("being self-contained"). I consider it a bug. > >> >> > >> >> > Scriptability is not a new idea in TC (see u32 and pedit and > >> >> > others in TC). > >> >> > >> >> u32 is notoriously hard to use. The others are neat, but > >> >> obviously limited to particular use cases. > >> > > >> > Despite my love for u32, I admit its user interface is cryptic. I > >> > just wanted to point out to existing samples of scriptable and > >> > offloadable TC objects. > >> > > >> >> Do you actually expect anyone to use P4 by manually entering TC > >> >> commands to build a pipeline? I really find that hard to > >> >> believe... > >> > > >> > You dont have to manually hand code anything - its the compilers job. > >> > >> Right, that was kinda my point: in that case the compiler could > >> just as well generate a (set of) BPF program(s) instead of this TC script thing. > >> > >> >> > IOW, we are reusing and plugging into a proven and deployed > >> >> > mechanism with a built-in policy driven, transparent symbiosis > >> >> > between hardware offload and software that has matured over > >> >> > time. You can take a pipeline or a table or actions and split > >> >> > them between hardware and software transparently, etc. > >> >> > >> >> That's a control plane feature though, it's not an argument for > >> >> adding another interpreter to the kernel. > >> > > >> > I am not sure what you mean by control, but what i described is > >> > kernel built in. Of course i could do more complex things from > >> > user space (if that is what you mean as control). > >> > >> "Control plane" as in SDN parlance. I.e., the bits that keep track > >> of configuration of the flow/pipeline/table configuration. > >> > >> There's no reason you can't have all that infrastructure and use > >> BPF as the datapath language. I.e., instead of: > >> > >> tc p4template create pipeline/aP4proggie numtables 1 ... + all the > >> other stuff to populate it > >> > >> you could just do: > >> > >> tc p4 create pipeline/aP4proggie obj_file aP4proggie.bpf.o > >> > >> and still have all the management infrastructure without the new > >> interpreter and associated complexity in the kernel. > >> > >> >> > This hammer already meets our goals. > >> >> > >> >> That 60k+ line patch submission of yours says otherwise... > >> > > >> > This is pretty much covered in the cover letter and a few > >> > responses in the thread since. > >> > >> The only argument for why your current approach makes sense I've > >> seen you make is "I don't want to rewrite it in BPF". Which is not > >> a technical argument. > >> > >> I'm not trying to be disingenuous here, BTW: I really don't see the > >> technical argument for why the P4 data plane has to be implemented > >> as its own interpreter instead of integrating with what we have > >> already (i.e., BPF). > >> > >> -Toke > >> > > > > I'll just take this here becaues I think its mostly related. > > > > Still not convinced the P4TC has any value for sw. From the slide > > you say vendors prefer you have this picture roughtly. > > > > > > [ P4 compiler ] ------ [ P4TC backend ] ----> TC API > > | > > | > > [ P4 Vendor backend ] > > | > > | > > V > > [ Devlink ] > > > > > > Now just replace P4TC backend with P4C and your only work is to > > replace devlink with the current hw specific bits and you have a sw > > and hw components. Then you get XDP-BPF pretty easily from P4XDP > > backend if you like. The compat piece is handled by compiler where > > it should be. My CPU is not a MAT so pretending it is seems not > > ideal to me, I don't have a TCAM on my cores. > > > > For runtime get those vendors to write their SDKs over Devlink and > > no need for this software thing. The runtime for P4c should already > > work over BPF. Giving this picture > > > > [ P4 compiler ] ------ [ P4C backend ] ----> BPF > > | > > | > > [ P4 Vendor backend ] > > | > > | > > V > > [ Devlink ] > > > > And much less work for us to maintain. > > Yes, this was basically my point as well. Thank you for putting it > into ASCII diagrams! :) > > There's still the control plane bit: some kernel component that > configures the pieces (pipelines?) created in the top-right and > bottom-left corners of your diagram(s), keeping track of which > pipelines are in HW/SW, maybe updating some match tables dynamically > and extracting statistics. I'm totally OK with having that bit be in > the kernel, but that can be added on top of your second diagram just > as well as on top of the first one... > > -Toke >
Jamal Hadi Salim wrote: > I think we are going in cycles. John I asked you earlier and i think > you answered my question: You are actually pitching an out of band > runtime using some vendor sdk via devlink (why even bother with > devlink interface, not sure). P4TC is saying the runtime API is via > the kernel to the drivers. Not out of band, via devlink and a common API for all vendors to implement so userspace applications can abstract away vendor specifics of the runtime API as much as possible. Then runtime component can implement the Linux API and abstract the hardware away this way. runtime API is still via the kernel and the the driver its just going through devlink because its fundamentally a hardware configuration and independent of a software datapath. I think the key here is I see no value in (re)implementing a duplicate software stack when we already have one and even have a backend for the one we have. Its more general. And likely more performant. If you want a software implementation do it in rocker so its clear its a software implementatoin of a switch. > > Toke, i dont think i have managed to get across that there is an > "autonomous" control built into the kernel. It is not just things that > come across netlink. It's about the whole infra. I think without that > clarity we are going to speak past each other and it's a frustrating > discussion which could get emotional. You cant just displace, for > example flower and say "use ebpf because it works on tc", theres a lot > of tribal knowledge gluing relationship between hardware and software. > Maybe take a look at this patchset below to see an example which shows > how part of an action graph will work in hardware and partially in sw > under certain conditions: > https://www.spinics.net/lists/netdev/msg877507.html and then we can > have a better discussion. > > cheers, > jamal > > > On Mon, Jan 30, 2023 at 4:21 PM Toke Høiland-Jørgensen <toke@redhat.com> wrote: > > > > John Fastabend <john.fastabend@gmail.com> writes: > > > > > Toke Høiland-Jørgensen wrote: > > >> Jamal Hadi Salim <hadi@mojatatu.com> writes: > > >> > > >> > On Mon, Jan 30, 2023 at 12:04 PM Toke Høiland-Jørgensen <toke@redhat.com> wrote: > > >> >> > > >> >> Jamal Hadi Salim <jhs@mojatatu.com> writes: > > >> >> > > >> >> > So i dont have to respond to each email individually, I will respond > > >> >> > here in no particular order. First let me provide some context, if > > >> >> > that was already clear please skip it. Hopefully providing the context > > >> >> > will help us to focus otherwise that bikeshed's color and shape will > > >> >> > take forever to settle on. > > >> >> > > > >> >> > __Context__ > > >> >> > > > >> >> > I hope we all agree that when you have 2x100G NIC (and i have seen > > >> >> > people asking for 2x800G NICs) no XDP or DPDK is going to save you. To > > >> >> > visualize: one 25G port is 35Mpps unidirectional. So "software stack" > > >> >> > is not the answer. You need to offload. > > >> >> > > >> >> I'm not disputing the need to offload, and I'm personally delighted that > > >> >> P4 is breaking open the vendor black boxes to provide a standardised > > >> >> interface for this. > > >> >> > > >> >> However, while it's true that software can't keep up at the high end, > > >> >> not everything runs at the high end, and today's high end is tomorrow's > > >> >> mid end, in which XDP can very much play a role. So being able to move > > >> >> smoothly between the two, and even implement functions that split > > >> >> processing between them, is an essential feature of a programmable > > >> >> networking path in Linux. Which is why I'm objecting to implementing the > > >> >> P4 bits as something that's hanging off the side of the stack in its own > > >> >> thing and is not integrated with the rest of the stack. You were touting > > >> >> this as a feature ("being self-contained"). I consider it a bug. > > >> >> > > >> >> > Scriptability is not a new idea in TC (see u32 and pedit and others in > > >> >> > TC). > > >> >> > > >> >> u32 is notoriously hard to use. The others are neat, but obviously > > >> >> limited to particular use cases. > > >> > > > >> > Despite my love for u32, I admit its user interface is cryptic. I just > > >> > wanted to point out to existing samples of scriptable and offloadable > > >> > TC objects. > > >> > > > >> >> Do you actually expect anyone to use P4 > > >> >> by manually entering TC commands to build a pipeline? I really find that > > >> >> hard to believe... > > >> > > > >> > You dont have to manually hand code anything - its the compilers job. > > >> > > >> Right, that was kinda my point: in that case the compiler could just as > > >> well generate a (set of) BPF program(s) instead of this TC script thing. > > >> > > >> >> > IOW, we are reusing and plugging into a proven and deployed mechanism > > >> >> > with a built-in policy driven, transparent symbiosis between hardware > > >> >> > offload and software that has matured over time. You can take a > > >> >> > pipeline or a table or actions and split them between hardware and > > >> >> > software transparently, etc. > > >> >> > > >> >> That's a control plane feature though, it's not an argument for adding > > >> >> another interpreter to the kernel. > > >> > > > >> > I am not sure what you mean by control, but what i described is kernel > > >> > built in. Of course i could do more complex things from user space (if > > >> > that is what you mean as control). > > >> > > >> "Control plane" as in SDN parlance. I.e., the bits that keep track of > > >> configuration of the flow/pipeline/table configuration. > > >> > > >> There's no reason you can't have all that infrastructure and use BPF as > > >> the datapath language. I.e., instead of: > > >> > > >> tc p4template create pipeline/aP4proggie numtables 1 > > >> ... + all the other stuff to populate it > > >> > > >> you could just do: > > >> > > >> tc p4 create pipeline/aP4proggie obj_file aP4proggie.bpf.o > > >> > > >> and still have all the management infrastructure without the new > > >> interpreter and associated complexity in the kernel. > > >> > > >> >> > This hammer already meets our goals. > > >> >> > > >> >> That 60k+ line patch submission of yours says otherwise... > > >> > > > >> > This is pretty much covered in the cover letter and a few responses in > > >> > the thread since. > > >> > > >> The only argument for why your current approach makes sense I've seen > > >> you make is "I don't want to rewrite it in BPF". Which is not a > > >> technical argument. > > >> > > >> I'm not trying to be disingenuous here, BTW: I really don't see the > > >> technical argument for why the P4 data plane has to be implemented as > > >> its own interpreter instead of integrating with what we have already > > >> (i.e., BPF). > > >> > > >> -Toke > > >> > > > > > > I'll just take this here becaues I think its mostly related. > > > > > > Still not convinced the P4TC has any value for sw. From the > > > slide you say vendors prefer you have this picture roughtly. > > > > > > > > > [ P4 compiler ] ------ [ P4TC backend ] ----> TC API > > > | > > > | > > > [ P4 Vendor backend ] > > > | > > > | > > > V > > > [ Devlink ] > > > > > > > > > Now just replace P4TC backend with P4C and your only work is to > > > replace devlink with the current hw specific bits and you have > > > a sw and hw components. Then you get XDP-BPF pretty easily from > > > P4XDP backend if you like. The compat piece is handled by compiler > > > where it should be. My CPU is not a MAT so pretending it is seems > > > not ideal to me, I don't have a TCAM on my cores. > > > > > > For runtime get those vendors to write their SDKs over Devlink > > > and no need for this software thing. The runtime for P4c should > > > already work over BPF. Giving this picture > > > > > > [ P4 compiler ] ------ [ P4C backend ] ----> BPF > > > | > > > | > > > [ P4 Vendor backend ] > > > | > > > | > > > V > > > [ Devlink ] > > > > > > And much less work for us to maintain. > > > > Yes, this was basically my point as well. Thank you for putting it into > > ASCII diagrams! :) > > > > There's still the control plane bit: some kernel component that > > configures the pieces (pipelines?) created in the top-right and > > bottom-left corners of your diagram(s), keeping track of which pipelines > > are in HW/SW, maybe updating some match tables dynamically and > > extracting statistics. I'm totally OK with having that bit be in the > > kernel, but that can be added on top of your second diagram just as well > > as on top of the first one... > > > > -Toke > >
Singhai, Anjali wrote: > Devlink is only for downloading the vendor specific compiler output for a P4 program and for teaching the driver about the names of runtime P4 object as to how they map onto the HW. This helps with the Initial definition of the Dataplane. > > Devlink is NOT for the runtime programming of the Dataplane, that has to go through the P4TC block for anybody to deploy a programmable dataplane between the HW and the SW and have some flows that are in HW and some in SW or some processing HW and some in SW. ndo_setup_tc framework and support in the drivers will give us the hooks to program the HW match-action entries. > Please explain through ebpf model how do I program the HW at runtime? > > Thanks > Anjali > Didn't see this as it was top posted but, the answer is you don't program hardware the ebpf when your underlying target is a MAT. Use devlink for the runtime programming as well, its there to program hardware. This "Devlink is NOT for the runtime programming" is just an artificate of the design here which I disagree with and it feels like many other folks also disagree. Also if you have some flows going to SW you want to use the most performant option you have which would be XDP-BPF at the moment in a standard linux box or maybe af-xdp. So in these cases you should look to divide your P4 pipeline between XDP and HW. Sure you can say performance doesn't matter for my use case, but surely it does for some things and anyways you have the performant thing already built so just use it. Thanks, John
On Mon, Jan 30, 2023 at 7:06 PM John Fastabend <john.fastabend@gmail.com> wrote: > > Singhai, Anjali wrote: > > Devlink is only for downloading the vendor specific compiler output for a P4 program and for teaching the driver about the names of runtime P4 object as to how they map onto the HW. This helps with the Initial definition of the Dataplane. > > > > Devlink is NOT for the runtime programming of the Dataplane, that has to go through the P4TC block for anybody to deploy a programmable dataplane between the HW and the SW and have some flows that are in HW and some in SW or some processing HW and some in SW. ndo_setup_tc framework and support in the drivers will give us the hooks to program the HW match-action entries. > > Please explain through ebpf model how do I program the HW at runtime? > > > > Thanks > > Anjali > > > > Didn't see this as it was top posted but, the answer is you don't program > hardware the ebpf when your underlying target is a MAT. > > Use devlink for the runtime programming as well, its there to program > hardware. This "Devlink is NOT for the runtime programming" is > just an artificate of the design here which I disagree with and it feels > like many other folks also disagree. > We are going to need strong justification to use devlink for programming the binary interface to begin with - see the driver models discussion. And let me get this clear: you are suggesting we use it for runtime and redo all that tc ndo and associated infra? cheers, jamal > Also if you have some flows going to SW you want to use the most > performant option you have which would be XDP-BPF at the moment in a > standard linux box or maybe af-xdp. So in these cases you should look > to divide your P4 pipeline between XDP and HW. Sure you can say > performance doesn't matter for my use case, but surely it does for > some things and anyways you have the performant thing already built > so just use it. > Thanks, > John
On Mon, 30 Jan 2023 19:26:05 -0500 Jamal Hadi Salim wrote: > > Didn't see this as it was top posted but, the answer is you don't program > > hardware the ebpf when your underlying target is a MAT. > > > > Use devlink for the runtime programming as well, its there to program > > hardware. This "Devlink is NOT for the runtime programming" is > > just an artificate of the design here which I disagree with and it feels > > like many other folks also disagree. > > We are going to need strong justification to use devlink for > programming the binary interface to begin with We may disagree on direction, but we should agree status quo / reality. What John described is what we suggested to Intel to do (2+ years ago), and what is already implemented upstream. Grep for DDP. IIRC my opinion back then was that unless kernel has any use for whatever the configuration exposes - we should stay out of it. > - see the driver models discussion. > > And let me get this clear: you are suggesting we > use it for runtime and redo all that tc ndo and associated infra?
On Mon, Jan 30, 2023 at 11:12 PM Jakub Kicinski <kuba@kernel.org> wrote: > > On Mon, 30 Jan 2023 19:26:05 -0500 Jamal Hadi Salim wrote: > > > Didn't see this as it was top posted but, the answer is you don't program > > > hardware the ebpf when your underlying target is a MAT. > > > > > > Use devlink for the runtime programming as well, its there to program > > > hardware. This "Devlink is NOT for the runtime programming" is > > > just an artificate of the design here which I disagree with and it feels > > > like many other folks also disagree. > > > > We are going to need strong justification to use devlink for > > programming the binary interface to begin with > > We may disagree on direction, but we should agree status quo / reality. > > What John described is what we suggested to Intel to do (2+ years ago), > and what is already implemented upstream. Grep for DDP. > I went back and looked at the email threads - I hope i got the right one from 2020. Note, there are two paths in P4TC: DDP loading via devlink is equivalent to loading the P4 binary for the hardware. That is one of the 3 (and currently most popular) driver interfaces suggested. Some of that drew Second is runtime which is via standard TC. John's proposal is equivalent to suggesting moving the flower interface Devlink. That is not the same as loading the config. > IIRC my opinion back then was that unless kernel has any use for > whatever the configuration exposes - we should stay out of it. It does for runtime and the tc infra already takes care of that. The cover letter says: "...one can be more explicit and specify "skip_sw" or "skip_hw" to either offload the entry (if a NIC or switch driver is capable) or make it purely run entirely in the kernel or in a cooperative mode between kernel and user space." cheers, jamal
On Tue, Jan 31, 2023 at 5:27 AM Jamal Hadi Salim <jhs@mojatatu.com> wrote: > > On Mon, Jan 30, 2023 at 11:12 PM Jakub Kicinski <kuba@kernel.org> wrote: > > > > On Mon, 30 Jan 2023 19:26:05 -0500 Jamal Hadi Salim wrote: > > > > Didn't see this as it was top posted but, the answer is you don't program > > > > hardware the ebpf when your underlying target is a MAT. > > > > > > > > Use devlink for the runtime programming as well, its there to program > > > > hardware. This "Devlink is NOT for the runtime programming" is > > > > just an artificate of the design here which I disagree with and it feels > > > > like many other folks also disagree. > > > > > > We are going to need strong justification to use devlink for > > > programming the binary interface to begin with > > > > We may disagree on direction, but we should agree status quo / reality. > > > > What John described is what we suggested to Intel to do (2+ years ago), > > and what is already implemented upstream. Grep for DDP. > > > > I went back and looked at the email threads - I hope i got the right > one from 2020. > > Note, there are two paths in P4TC: > DDP loading via devlink is equivalent to loading the P4 binary for the hardware. > That is one of the 3 (and currently most popular) driver interfaces > suggested. Some of that drew Sorry didnt finish my thought here, wanted to say: The loading of the P4 binary over devlink drew (to some people) suspicion it is going to be used for loading kernel bypass. cheers, jamal > Second is runtime which is via standard TC. John's proposal is > equivalent to suggesting moving the flower interface Devlink. That is > not the same as loading the config. > > > IIRC my opinion back then was that unless kernel has any use for > > whatever the configuration exposes - we should stay out of it. > > It does for runtime and the tc infra already takes care of that. The > cover letter says: > > "...one can be more explicit and specify "skip_sw" or "skip_hw" to either > offload the entry (if a NIC or switch driver is capable) or make it purely run > entirely in the kernel or in a cooperative mode between kernel and user space." > > cheers, > jamal
Jamal Hadi Salim <jhs@mojatatu.com> writes: > Toke, i dont think i have managed to get across that there is an > "autonomous" control built into the kernel. It is not just things that > come across netlink. It's about the whole infra. I'm not disputing the need for the TC infra to configure the pipelines and their relationship in the hardware. I'm saying that your implementation *of the SW path* is the wrong approach and it would be better done by using BPF (not talking about the existing TC-BPF, either). It's a bit hard to know your thinking for sure here, since your patch series doesn't include any of the offload control bits. But from the slides and your hints in this series, AFAICT, the flow goes something like: hw_pipeline_id = devlink_program_hardware(dev, p4_compiled_blob); sw_pipeline_id = `tc p4template create ...` (etc, this is generated by P4C) tc_act = tc_act_create(hw_pipeline_id, sw_pipeline_id) which will turn into something like: struct p4_cls_offload ofl = { .classid = classid, .pipeline_id = hw_pipeline_id }; if (check_sw_and_hw_equivalence(hw_pipeline_id, sw_pipeline_id)) /* some magic check here */ return -EINVAL; netdev->netdev_ops->ndo_setup_tc(dev, TC_SETUP_P4, &ofl); I.e, all that's being passed to the hardware is the ID of the pre-programmed pipeline, because that programming is going to be out-of-band via devlink anyway. In which case, you could just as well replace the above: sw_pipeline_id = `tc p4template create ...` (etc, this is generated by P4C) with sw_pipeline_id = bpf_prog_load(BPF_PROG_TYPE_P4TC, "my_obj_file.o"); /* my_obj_file is created by P4c */ and achieve exactly the same. Having all the P4 data types and concepts exist inside the kernel *might* make sense if the kernel could then translate those into the hardware representations and manage their lifecycle in a uniform way. But as far as I can tell from the slides and what you've been saying in this thread that's not going to be possible anyway, so why do you need anything more granular than the pipeline ID? -Toke
Tue, Jan 31, 2023 at 01:17:14PM CET, toke@redhat.com wrote: >Jamal Hadi Salim <jhs@mojatatu.com> writes: > >> Toke, i dont think i have managed to get across that there is an >> "autonomous" control built into the kernel. It is not just things that >> come across netlink. It's about the whole infra. > >I'm not disputing the need for the TC infra to configure the pipelines >and their relationship in the hardware. I'm saying that your >implementation *of the SW path* is the wrong approach and it would be >better done by using BPF (not talking about the existing TC-BPF, >either). > >It's a bit hard to know your thinking for sure here, since your patch >series doesn't include any of the offload control bits. But from the >slides and your hints in this series, AFAICT, the flow goes something >like: > >hw_pipeline_id = devlink_program_hardware(dev, p4_compiled_blob); I don't think that devlink is the correct iface for this. If you want to tight it together with the SW pipeline configurable by TC, use TC as you do for the BPF binary in this example. If you have the TC-block shared among many netdevs, the HW needs to know that for binding the P4 input. Btw, you can have multiple netdevs of different vendors sharing the same TC-block, then you need to upload multiple HW binary blobs here. What it eventually might result with is that the userspace would upload a list of binaries with indication what is the target: "BPF" -> xxx.o "DRIVERNAMEX" -> aaa.bin "DRIVERNAMEY" -> bbb.bin In theory, there might be a HW to accept the BPF binary :) My point is, userspace provides a list of binaries, individual kernel parts take what they like. >sw_pipeline_id = `tc p4template create ...` (etc, this is generated by P4C) > >tc_act = tc_act_create(hw_pipeline_id, sw_pipeline_id) > >which will turn into something like: > >struct p4_cls_offload ofl = { > .classid = classid, > .pipeline_id = hw_pipeline_id >}; > >if (check_sw_and_hw_equivalence(hw_pipeline_id, sw_pipeline_id)) /* some magic check here */ Ha! I would like to see this magic here :) > return -EINVAL; > >netdev->netdev_ops->ndo_setup_tc(dev, TC_SETUP_P4, &ofl); > > >I.e, all that's being passed to the hardware is the ID of the >pre-programmed pipeline, because that programming is going to be >out-of-band via devlink anyway. > >In which case, you could just as well replace the above: > >sw_pipeline_id = `tc p4template create ...` (etc, this is generated by P4C) > >with > >sw_pipeline_id = bpf_prog_load(BPF_PROG_TYPE_P4TC, "my_obj_file.o"); /* my_obj_file is created by P4c */ > >and achieve exactly the same. > >Having all the P4 data types and concepts exist inside the kernel >*might* make sense if the kernel could then translate those into the >hardware representations and manage their lifecycle in a uniform way. >But as far as I can tell from the slides and what you've been saying in >this thread that's not going to be possible anyway, so why do you need >anything more granular than the pipeline ID? > >-Toke >
Tue, Jan 31, 2023 at 01:17:14PM CET, toke@redhat.com wrote: >Jamal Hadi Salim <jhs@mojatatu.com> writes: > >> Toke, i dont think i have managed to get across that there is an >> "autonomous" control built into the kernel. It is not just things that >> come across netlink. It's about the whole infra. > >I'm not disputing the need for the TC infra to configure the pipelines >and their relationship in the hardware. I'm saying that your >implementation *of the SW path* is the wrong approach and it would be >better done by using BPF (not talking about the existing TC-BPF, >either). > >It's a bit hard to know your thinking for sure here, since your patch >series doesn't include any of the offload control bits. But from the >slides and your hints in this series, AFAICT, the flow goes something >like: > >hw_pipeline_id = devlink_program_hardware(dev, p4_compiled_blob); >sw_pipeline_id = `tc p4template create ...` (etc, this is generated by P4C) > >tc_act = tc_act_create(hw_pipeline_id, sw_pipeline_id) > >which will turn into something like: > >struct p4_cls_offload ofl = { > .classid = classid, > .pipeline_id = hw_pipeline_id >}; > >if (check_sw_and_hw_equivalence(hw_pipeline_id, sw_pipeline_id)) /* some magic check here */ > return -EINVAL; > >netdev->netdev_ops->ndo_setup_tc(dev, TC_SETUP_P4, &ofl); > > >I.e, all that's being passed to the hardware is the ID of the >pre-programmed pipeline, because that programming is going to be >out-of-band via devlink anyway. > >In which case, you could just as well replace the above: > >sw_pipeline_id = `tc p4template create ...` (etc, this is generated by P4C) > >with > >sw_pipeline_id = bpf_prog_load(BPF_PROG_TYPE_P4TC, "my_obj_file.o"); /* my_obj_file is created by P4c */ > >and achieve exactly the same. > >Having all the P4 data types and concepts exist inside the kernel >*might* make sense if the kernel could then translate those into the >hardware representations and manage their lifecycle in a uniform way. >But as far as I can tell from the slides and what you've been saying in >this thread that's not going to be possible anyway, so why do you need >anything more granular than the pipeline ID? Toke, I understand what what you describe above is applicable for the P4 program instantiation (pipeline definition). What is the suggestion for the actual "rule insertions" ? Would it make sense to use TC iface (Jamal's or similar) to insert rules to both BPF SW path and offloaded HW path? > >-Toke >
Jiri Pirko <jiri@resnulli.us> writes: > Tue, Jan 31, 2023 at 01:17:14PM CET, toke@redhat.com wrote: >>Jamal Hadi Salim <jhs@mojatatu.com> writes: >> >>> Toke, i dont think i have managed to get across that there is an >>> "autonomous" control built into the kernel. It is not just things that >>> come across netlink. It's about the whole infra. >> >>I'm not disputing the need for the TC infra to configure the pipelines >>and their relationship in the hardware. I'm saying that your >>implementation *of the SW path* is the wrong approach and it would be >>better done by using BPF (not talking about the existing TC-BPF, >>either). >> >>It's a bit hard to know your thinking for sure here, since your patch >>series doesn't include any of the offload control bits. But from the >>slides and your hints in this series, AFAICT, the flow goes something >>like: >> >>hw_pipeline_id = devlink_program_hardware(dev, p4_compiled_blob); >>sw_pipeline_id = `tc p4template create ...` (etc, this is generated by P4C) >> >>tc_act = tc_act_create(hw_pipeline_id, sw_pipeline_id) >> >>which will turn into something like: >> >>struct p4_cls_offload ofl = { >> .classid = classid, >> .pipeline_id = hw_pipeline_id >>}; >> >>if (check_sw_and_hw_equivalence(hw_pipeline_id, sw_pipeline_id)) /* some magic check here */ >> return -EINVAL; >> >>netdev->netdev_ops->ndo_setup_tc(dev, TC_SETUP_P4, &ofl); >> >> >>I.e, all that's being passed to the hardware is the ID of the >>pre-programmed pipeline, because that programming is going to be >>out-of-band via devlink anyway. >> >>In which case, you could just as well replace the above: >> >>sw_pipeline_id = `tc p4template create ...` (etc, this is generated by P4C) >> >>with >> >>sw_pipeline_id = bpf_prog_load(BPF_PROG_TYPE_P4TC, "my_obj_file.o"); /* my_obj_file is created by P4c */ >> >>and achieve exactly the same. >> >>Having all the P4 data types and concepts exist inside the kernel >>*might* make sense if the kernel could then translate those into the >>hardware representations and manage their lifecycle in a uniform way. >>But as far as I can tell from the slides and what you've been saying in >>this thread that's not going to be possible anyway, so why do you need >>anything more granular than the pipeline ID? > > Toke, I understand what what you describe above is applicable for the P4 > program instantiation (pipeline definition). > > What is the suggestion for the actual "rule insertions" ? Would it make > sense to use TC iface (Jamal's or similar) to insert rules to both BPF SW > path and offloaded HW path? Hmm, so by "rule insertions" here you're referring to populating what P4 calls 'tables', right? I could see a couple of ways this could be bridged between the BPF side and the HW side: - Create a new BPF map type that is backed by the TC-internal data structure, so updates from userspace go via the TC interface, but BPF programs access the contents via the bpf_map_*() helpers (or we could allow updating via the bpf() syscall as well) - Expose the TC data structures to BPF via their own set of kfuncs, similar to what we did for conntrack - Scrap the TC interface entirely and make this an offload-enabled BPF map type (using the BPF ndo and bpf_map_dev_ops operations to update it). Userspace would then populate it via the bpf() syscall like any other map. I suspect the map interface is the most straight-forward to use from the BPF side, but informing this by what existing implementations do (thinking of the P4->XDP compiler in particular) might be a good idea? -Toke
On Tue, 31 Jan 2023 05:30:10 -0500 Jamal Hadi Salim wrote: > > Note, there are two paths in P4TC: > > DDP loading via devlink is equivalent to loading the P4 binary for the hardware. > > That is one of the 3 (and currently most popular) driver interfaces > > suggested. Some of that drew > > Sorry didnt finish my thought here, wanted to say: The loading of the > P4 binary over devlink drew (to some people) suspicion it is going to > be used for loading kernel bypass. The only practical use case I heard was the IPU. Worrying about devlink programming being a bypass on an IPU is like rearranging chairs on the Titanic.
So while going through this thought process, things to consider: 1) The autonomy of the tc infra, essentially the skip_sw/hw controls and their packet driven iteration. Perhaps (the patch i pointed to from Paul Blakey) where part of the action graph runs in sw. 2) The dynamicity of being able to trigger table offloads and/or kernel table updates which are packet driven (consider scenario where they have iterated the hardware and ingressed into the kernel). cheers, jamal On Tue, Jan 31, 2023 at 12:01 PM Toke Høiland-Jørgensen <toke@redhat.com> wrote: > > Jiri Pirko <jiri@resnulli.us> writes: > > > Tue, Jan 31, 2023 at 01:17:14PM CET, toke@redhat.com wrote: > >>Jamal Hadi Salim <jhs@mojatatu.com> writes: > >> > >>> Toke, i dont think i have managed to get across that there is an > >>> "autonomous" control built into the kernel. It is not just things that > >>> come across netlink. It's about the whole infra. > >> > >>I'm not disputing the need for the TC infra to configure the pipelines > >>and their relationship in the hardware. I'm saying that your > >>implementation *of the SW path* is the wrong approach and it would be > >>better done by using BPF (not talking about the existing TC-BPF, > >>either). > >> > >>It's a bit hard to know your thinking for sure here, since your patch > >>series doesn't include any of the offload control bits. But from the > >>slides and your hints in this series, AFAICT, the flow goes something > >>like: > >> > >>hw_pipeline_id = devlink_program_hardware(dev, p4_compiled_blob); > >>sw_pipeline_id = `tc p4template create ...` (etc, this is generated by P4C) > >> > >>tc_act = tc_act_create(hw_pipeline_id, sw_pipeline_id) > >> > >>which will turn into something like: > >> > >>struct p4_cls_offload ofl = { > >> .classid = classid, > >> .pipeline_id = hw_pipeline_id > >>}; > >> > >>if (check_sw_and_hw_equivalence(hw_pipeline_id, sw_pipeline_id)) /* some magic check here */ > >> return -EINVAL; > >> > >>netdev->netdev_ops->ndo_setup_tc(dev, TC_SETUP_P4, &ofl); > >> > >> > >>I.e, all that's being passed to the hardware is the ID of the > >>pre-programmed pipeline, because that programming is going to be > >>out-of-band via devlink anyway. > >> > >>In which case, you could just as well replace the above: > >> > >>sw_pipeline_id = `tc p4template create ...` (etc, this is generated by P4C) > >> > >>with > >> > >>sw_pipeline_id = bpf_prog_load(BPF_PROG_TYPE_P4TC, "my_obj_file.o"); /* my_obj_file is created by P4c */ > >> > >>and achieve exactly the same. > >> > >>Having all the P4 data types and concepts exist inside the kernel > >>*might* make sense if the kernel could then translate those into the > >>hardware representations and manage their lifecycle in a uniform way. > >>But as far as I can tell from the slides and what you've been saying in > >>this thread that's not going to be possible anyway, so why do you need > >>anything more granular than the pipeline ID? > > > > Toke, I understand what what you describe above is applicable for the P4 > > program instantiation (pipeline definition). > > > > What is the suggestion for the actual "rule insertions" ? Would it make > > sense to use TC iface (Jamal's or similar) to insert rules to both BPF SW > > path and offloaded HW path? > > Hmm, so by "rule insertions" here you're referring to populating what P4 > calls 'tables', right? > > I could see a couple of ways this could be bridged between the BPF side > and the HW side: > > - Create a new BPF map type that is backed by the TC-internal data > structure, so updates from userspace go via the TC interface, but BPF > programs access the contents via the bpf_map_*() helpers (or we could > allow updating via the bpf() syscall as well) > > - Expose the TC data structures to BPF via their own set of kfuncs, > similar to what we did for conntrack > > - Scrap the TC interface entirely and make this an offload-enabled BPF > map type (using the BPF ndo and bpf_map_dev_ops operations to update > it). Userspace would then populate it via the bpf() syscall like any > other map. > > > I suspect the map interface is the most straight-forward to use from the > BPF side, but informing this by what existing implementations do > (thinking of the P4->XDP compiler in particular) might be a good idea? > > -Toke >
On Tue, Jan 31, 2023 at 2:10 PM Jakub Kicinski <kuba@kernel.org> wrote: > > On Tue, 31 Jan 2023 05:30:10 -0500 Jamal Hadi Salim wrote: > > > Note, there are two paths in P4TC: > > > DDP loading via devlink is equivalent to loading the P4 binary for the hardware. > > > That is one of the 3 (and currently most popular) driver interfaces > > > suggested. Some of that drew > > > > Sorry didnt finish my thought here, wanted to say: The loading of the > > P4 binary over devlink drew (to some people) suspicion it is going to > > be used for loading kernel bypass. > > The only practical use case I heard was the IPU. Worrying about devlink > programming being a bypass on an IPU is like rearranging chairs on the > Titanic. BTW, I do believe FNICs are heading in that direction as well. I didnt quiet follow the titanic chairs analogy, can you elaborate on that statement? cheers, jamal
On Tue, 31 Jan 2023 17:32:52 -0500 Jamal Hadi Salim wrote: > > > Sorry didnt finish my thought here, wanted to say: The loading of the > > > P4 binary over devlink drew (to some people) suspicion it is going to > > > be used for loading kernel bypass. > > > > The only practical use case I heard was the IPU. Worrying about devlink > > programming being a bypass on an IPU is like rearranging chairs on the > > Titanic. > > BTW, I do believe FNICs are heading in that direction as well. > I didnt quiet follow the titanic chairs analogy, can you elaborate on > that statement? https://en.wiktionary.org/wiki/rearrange_the_deck_chairs_on_the_Titanic
On Tue, Jan 31, 2023 at 5:36 PM Jakub Kicinski <kuba@kernel.org> wrote: > > On Tue, 31 Jan 2023 17:32:52 -0500 Jamal Hadi Salim wrote: > > > > Sorry didnt finish my thought here, wanted to say: The loading of the > > > > P4 binary over devlink drew (to some people) suspicion it is going to > > > > be used for loading kernel bypass. > > > > > > The only practical use case I heard was the IPU. Worrying about devlink > > > programming being a bypass on an IPU is like rearranging chairs on the > > > Titanic. > > > > BTW, I do believe FNICs are heading in that direction as well. > > I didnt quiet follow the titanic chairs analogy, can you elaborate on > > that statement? > > https://en.wiktionary.org/wiki/rearrange_the_deck_chairs_on_the_Titanic LoL. Lets convince Jiri then. On programming devlink for the runtime I would respectfully disagree that it is the right interface. cheers, jamal
Jamal Hadi Salim <jhs@mojatatu.com> writes: > So while going through this thought process, things to consider: > 1) The autonomy of the tc infra, essentially the skip_sw/hw controls > and their packet driven iteration. Perhaps (the patch i pointed to > from Paul Blakey) where part of the action graph runs in sw. Yeah, I agree that mixed-mode operation is an important consideration, and presumably attaching metadata directly to a packet on the hardware side, and accessing that in sw, is in scope as well? We seem to have landed on exposing that sort of thing via kfuncs in XDP, so expanding on that seems reasonable at a first glance. > 2) The dynamicity of being able to trigger table offloads and/or > kernel table updates which are packet driven (consider scenario where > they have iterated the hardware and ingressed into the kernel). That could be done by either interface, though: the kernel can propagate a bpf_map_update() from a BPF program to the hardware version of the table as well. I suspect a map-based API at least on the BPF side would be more natural, but I don't really have a strong opinion on this :) -Toke
On Tue, Jan 31, 2023 at 5:54 PM Toke Høiland-Jørgensen <toke@redhat.com> wrote: > > Jamal Hadi Salim <jhs@mojatatu.com> writes: > > > So while going through this thought process, things to consider: > > 1) The autonomy of the tc infra, essentially the skip_sw/hw controls > > and their packet driven iteration. Perhaps (the patch i pointed to > > from Paul Blakey) where part of the action graph runs in sw. > > Yeah, I agree that mixed-mode operation is an important consideration, > and presumably attaching metadata directly to a packet on the hardware > side, and accessing that in sw, is in scope as well? We seem to have > landed on exposing that sort of thing via kfuncs in XDP, so expanding on > that seems reasonable at a first glance. There is built-in metadata chain id/prio/protocol (stored in cls common struct) passed when the policy is installed. The hardware may be able to handle received (probably packet encapsulated, but i believe that is vendor specific) metadata and transform it into the appropriate continuation point. Maybe a simpler example is to look at the patch from Paul (since that is the most recent change, so it is sticking in my brain); if you can follow the example, you'll see there's some state that is transferred for the action with a cookie from/to the driver. > > 2) The dynamicity of being able to trigger table offloads and/or > > kernel table updates which are packet driven (consider scenario where > > they have iterated the hardware and ingressed into the kernel). > > That could be done by either interface, though: the kernel can propagate > a bpf_map_update() from a BPF program to the hardware version of the > table as well. I suspect a map-based API at least on the BPF side would > be more natural, but I don't really have a strong opinion on this :) Should have mentioned this earlier as requirement: Speed of update is _extremely_ important, i.e how fast you can update could make or break things; see talk from Marcelo/Vlad[1]. My gut feeling is dealing with feedback from some vendor firmware/driver interface that the entry is really offloaded may cause challenges for ebpf by stalling the program. We have seen upto several ms delays on occasions. cheers, jamal [1] https://netdevconf.info/0x15/session.html?Where-turbo-boosting-TC-flower-control-path-had-led-us-to
Jamal Hadi Salim <jhs@mojatatu.com> writes: > On Tue, Jan 31, 2023 at 5:54 PM Toke Høiland-Jørgensen <toke@redhat.com> wrote: >> >> Jamal Hadi Salim <jhs@mojatatu.com> writes: >> >> > So while going through this thought process, things to consider: >> > 1) The autonomy of the tc infra, essentially the skip_sw/hw controls >> > and their packet driven iteration. Perhaps (the patch i pointed to >> > from Paul Blakey) where part of the action graph runs in sw. >> >> Yeah, I agree that mixed-mode operation is an important consideration, >> and presumably attaching metadata directly to a packet on the hardware >> side, and accessing that in sw, is in scope as well? We seem to have >> landed on exposing that sort of thing via kfuncs in XDP, so expanding on >> that seems reasonable at a first glance. > > There is built-in metadata chain id/prio/protocol (stored in cls > common struct) passed when the policy is installed. The hardware may > be able to handle received (probably packet encapsulated, but i > believe that is vendor specific) metadata and transform it into the > appropriate continuation point. Maybe a simpler example is to look at > the patch from Paul (since that is the most recent change, so it is > sticking in my brain); if you can follow the example, you'll see > there's some state that is transferred for the action with a cookie > from/to the driver. Right, that roughly fits my understanding. Just adding a kfunc to fetch that cookie would be the obvious way to expose it to BPF. >> > 2) The dynamicity of being able to trigger table offloads and/or >> > kernel table updates which are packet driven (consider scenario where >> > they have iterated the hardware and ingressed into the kernel). >> >> That could be done by either interface, though: the kernel can propagate >> a bpf_map_update() from a BPF program to the hardware version of the >> table as well. I suspect a map-based API at least on the BPF side would >> be more natural, but I don't really have a strong opinion on this :) > > Should have mentioned this earlier as requirement: > Speed of update is _extremely_ important, i.e how fast you can update > could make or break things; see talk from Marcelo/Vlad[1]. My gut > feeling is dealing with feedback from some vendor firmware/driver > interface that the entry is really offloaded may cause challenges for > ebpf by stalling the program. We have seen upto several ms delays on > occasions. Right, understandable. That seems kinda orthogonal to which API is used to expose this data, though? In the end it's all just kernel code, and, well, if updating things in an offloaded map/table is taking too long, we'll have to either fix the underlying code to make it faster, or the application will have keep things only in software? :) -Toke
Sorry I was distracted somewhere else. I am not sure i fully grokked your proposal but I am willing to go through this thought exercise with you (perhaps a higher bandwidth media would help); however, we should put some parameters so it doesnt become a perpetual discussion: The starting premise is that posted code meets our requirements so whatever we do using ebpf has to meet our requirements; we dont want to get into a wrestling match with any of the ebpf constraints. Actually, I am ok with some limited degree of square hole round peg situation but it cant be interfering in getting work done. I would also be ok with small surgeries into the ebpf core if needed to meet our requirements. Performance and maintainability are also on the table. Let me know what you think. cheers, jamal On Wed, Feb 1, 2023 at 1:08 PM Toke Høiland-Jørgensen <toke@redhat.com> wrote: > > Jamal Hadi Salim <jhs@mojatatu.com> writes: > > > On Tue, Jan 31, 2023 at 5:54 PM Toke Høiland-Jørgensen <toke@redhat.com> wrote: > >> > >> Jamal Hadi Salim <jhs@mojatatu.com> writes: > >> > >> > So while going through this thought process, things to consider: > >> > 1) The autonomy of the tc infra, essentially the skip_sw/hw controls > >> > and their packet driven iteration. Perhaps (the patch i pointed to > >> > from Paul Blakey) where part of the action graph runs in sw. > >> > >> Yeah, I agree that mixed-mode operation is an important consideration, > >> and presumably attaching metadata directly to a packet on the hardware > >> side, and accessing that in sw, is in scope as well? We seem to have > >> landed on exposing that sort of thing via kfuncs in XDP, so expanding on > >> that seems reasonable at a first glance. > > > > There is built-in metadata chain id/prio/protocol (stored in cls > > common struct) passed when the policy is installed. The hardware may > > be able to handle received (probably packet encapsulated, but i > > believe that is vendor specific) metadata and transform it into the > > appropriate continuation point. Maybe a simpler example is to look at > > the patch from Paul (since that is the most recent change, so it is > > sticking in my brain); if you can follow the example, you'll see > > there's some state that is transferred for the action with a cookie > > from/to the driver. > > Right, that roughly fits my understanding. Just adding a kfunc to fetch > that cookie would be the obvious way to expose it to BPF. > > >> > 2) The dynamicity of being able to trigger table offloads and/or > >> > kernel table updates which are packet driven (consider scenario where > >> > they have iterated the hardware and ingressed into the kernel). > >> > >> That could be done by either interface, though: the kernel can propagate > >> a bpf_map_update() from a BPF program to the hardware version of the > >> table as well. I suspect a map-based API at least on the BPF side would > >> be more natural, but I don't really have a strong opinion on this :) > > > > Should have mentioned this earlier as requirement: > > Speed of update is _extremely_ important, i.e how fast you can update > > could make or break things; see talk from Marcelo/Vlad[1]. My gut > > feeling is dealing with feedback from some vendor firmware/driver > > interface that the entry is really offloaded may cause challenges for > > ebpf by stalling the program. We have seen upto several ms delays on > > occasions. > > Right, understandable. That seems kinda orthogonal to which API is used > to expose this data, though? In the end it's all just kernel code, and, > well, if updating things in an offloaded map/table is taking too long, > we'll have to either fix the underlying code to make it faster, or the > application will have keep things only in software? :) > > -Toke >
On Thu, Feb 2, 2023 at 10:51 AM Jamal Hadi Salim <jhs@mojatatu.com> wrote: > > Sorry I was distracted somewhere else. > I am not sure i fully grokked your proposal but I am willing to go > through this thought exercise with you (perhaps a higher bandwidth > media would help); however, we should put some parameters so it > doesnt become a perpetual discussion: > > The starting premise is that posted code meets our requirements so > whatever we do using ebpf has to meet our requirements; we dont want > to get into a wrestling match with any of the ebpf constraints. > Actually, I am ok with some limited degree of square hole round peg > situation but it cant be interfering in getting work done. I would > also be ok with small surgeries into the ebpf core if needed to meet > our requirements. Can you elaborate on what the problems are with using eBPF? I know there is at least one P4->eBPF compiler, what is lacking that doesn't meet your requirements? > Performance and maintainability are also on the table. Performance of the software datapath is of paramount importance. My fundamental concern here is that if we push an underperforming software solution, then the patches don't just enable offload, they'll be used to *justify* it. That is, the hardware vendors might go to their customers and show how much better the offload is than the slow software solution; whereas if they compared to a higher performing software solution it might meet the performance requirements of the customer thereby saving them the cost and complexity of offload. Note we've already been down this path once with DPDK once being touted as being "10x faster than the kernel" with little regard to whether the kernel could be tuned or adapted-- of course, we subsequently invented XDP and pretty much closed the gap. Tom > > Let me know what you think. > > cheers, > jamal > > On Wed, Feb 1, 2023 at 1:08 PM Toke Høiland-Jørgensen <toke@redhat.com> wrote: > > > > Jamal Hadi Salim <jhs@mojatatu.com> writes: > > > > > On Tue, Jan 31, 2023 at 5:54 PM Toke Høiland-Jørgensen <toke@redhat.com> wrote: > > >> > > >> Jamal Hadi Salim <jhs@mojatatu.com> writes: > > >> > > >> > So while going through this thought process, things to consider: > > >> > 1) The autonomy of the tc infra, essentially the skip_sw/hw controls > > >> > and their packet driven iteration. Perhaps (the patch i pointed to > > >> > from Paul Blakey) where part of the action graph runs in sw. > > >> > > >> Yeah, I agree that mixed-mode operation is an important consideration, > > >> and presumably attaching metadata directly to a packet on the hardware > > >> side, and accessing that in sw, is in scope as well? We seem to have > > >> landed on exposing that sort of thing via kfuncs in XDP, so expanding on > > >> that seems reasonable at a first glance. > > > > > > There is built-in metadata chain id/prio/protocol (stored in cls > > > common struct) passed when the policy is installed. The hardware may > > > be able to handle received (probably packet encapsulated, but i > > > believe that is vendor specific) metadata and transform it into the > > > appropriate continuation point. Maybe a simpler example is to look at > > > the patch from Paul (since that is the most recent change, so it is > > > sticking in my brain); if you can follow the example, you'll see > > > there's some state that is transferred for the action with a cookie > > > from/to the driver. > > > > Right, that roughly fits my understanding. Just adding a kfunc to fetch > > that cookie would be the obvious way to expose it to BPF. > > > > >> > 2) The dynamicity of being able to trigger table offloads and/or > > >> > kernel table updates which are packet driven (consider scenario where > > >> > they have iterated the hardware and ingressed into the kernel). > > >> > > >> That could be done by either interface, though: the kernel can propagate > > >> a bpf_map_update() from a BPF program to the hardware version of the > > >> table as well. I suspect a map-based API at least on the BPF side would > > >> be more natural, but I don't really have a strong opinion on this :) > > > > > > Should have mentioned this earlier as requirement: > > > Speed of update is _extremely_ important, i.e how fast you can update > > > could make or break things; see talk from Marcelo/Vlad[1]. My gut > > > feeling is dealing with feedback from some vendor firmware/driver > > > interface that the entry is really offloaded may cause challenges for > > > ebpf by stalling the program. We have seen upto several ms delays on > > > occasions. > > > > Right, understandable. That seems kinda orthogonal to which API is used > > to expose this data, though? In the end it's all just kernel code, and, > > well, if updating things in an offloaded map/table is taking too long, > > we'll have to either fix the underlying code to make it faster, or the > > application will have keep things only in software? :) > > > > -Toke > >
On 30/01/2023 14:06, Jamal Hadi Salim wrote: > So what are we trying to achieve with P4TC? John, I could have done a > better job in describing the goals in the cover letter: > We are going for MAT sw equivalence to what is in hardware. A two-fer > that is already provided by the existing TC infrastructure. ... > This hammer already meets our goals. I'd like to give a perspective from the AMD/Xilinx/Solarflare SmartNIC project. Though I must stress I'm not speaking for that organisation, and I wasn't the one writing the P4 code; these are just my personal observations based on the view I had from within the project team. We used P4 in the SN1022's datapath, but encountered a number of limitations that prevented a wholly P4-based implementation, in spite of the hardware being MAT/CAM flavoured. Overall I would say that P4 was not a great fit for the problem space; it was usually possible to get it to do what we wanted but only by bending it in unnatural ways. (The advantage was, of course, the strong toolchain for compiling it into optimised logic on the FPGA; writing the whole thing by hand in RTL would have taken far more effort.) Developing a worthwhile P4-based datapath proved to be something of an engineer-time sink; compilation and verification weren't quick, and just because your P4 works in a software model doesn't necessarily mean it will perform well in hardware. Thus P4 is, in my personal opinion, a poor choice for end-user/runtime behaviour specification, at least for FPGA-flavoured devices. It works okay for a multi-month product development project, is just about viable for implementing something like a pipeline plugin, but treating it as a fully flexible software-defined datapath is not something that will fly. > I would argue further that in > the near future a lot of the stuff including transport will eventually > have to partially or fully move to hardware (see the HOMA keynote for > a sample space[0]). I think HOMA is very interesting and I agree hardware doing something like it will eventually be needed. But as you admit, P4TC doesn't address that — unsurprising, since the kind of dynamic imperative behaviour involved is totally outside P4's wheelhouse. So maybe I'm missing your point here but I don't see why you bring it up. Ultimately I think trying to expose the underlying hardware as a P4 platform is the wrong abstraction layer to provide to userspace. It's trying too hard to avoid protocol ossification, by requiring the entire pipeline to be user-definable at a bit level, but in the real world if someone wants to deploy a new low-level protocol they'll be better off upgrading their kernel and drivers to offload the new protocol-specific *feature* onto protocol-agnostic *hardware* than trying to develop and validate a P4 pipeline. It is only protocol ossification in *hardware* that is a problem for this kind of thing (not to be confused with the ossification problem on a network where you can't use new proto because a middlebox somewhere in the path barfs on it); protocol-specific SW APIs are only a problem if they result in vendors designing ossified hardware (to implement exactly those APIs and nothing else), which hopefully we've all learned not to do by now. On 30/01/2023 03:09, Singhai, Anjali wrote: > There is also argument that is being made about using ebpf for > implementing the SW path, may be I am missing the part as to how do > you offload if not to another general purpose core even if it is not > as evolved as the current day Xeon's. I have to be a little circumspect here as I don't know how much we've made public, but there are good prospects for FPGA offloads of eBPF with high performance. The instructions can be transformed into a pipeline of logic blocks which look nothing like a Von Neumann architecture, so can get much better perf/area and perf/power than an array of general-purpose cores. My personal belief (which I don't, alas, have hard data to back up) is that this approach will also outperform the 'array of specialised packet-processor cores' that many NPU/DPU products are using. In the situations where you do need a custom datapath (which often involve the kind of dynamic behaviour that's not P4-friendly), eBPF is, I would say, far superior to P4 as an IR. -ed
Hi Ed, On Tue, Feb 14, 2023 at 12:07 PM Edward Cree <ecree.xilinx@gmail.com> wrote: > > On 30/01/2023 14:06, Jamal Hadi Salim wrote: > > So what are we trying to achieve with P4TC? John, I could have done a > > better job in describing the goals in the cover letter: > > We are going for MAT sw equivalence to what is in hardware. A two-fer > > that is already provided by the existing TC infrastructure. > ... > > This hammer already meets our goals. > > I'd like to give a perspective from the AMD/Xilinx/Solarflare SmartNIC > project. Though I must stress I'm not speaking for that organisation, > and I wasn't the one writing the P4 code; these are just my personal > observations based on the view I had from within the project team. > We used P4 in the SN1022's datapath, but encountered a number of > limitations that prevented a wholly P4-based implementation, in spite > of the hardware being MAT/CAM flavoured. > Overall I would say that P4 > was not a great fit for the problem space; it was usually possible to > get it to do what we wanted but only by bending it in unnatural ways. > (The advantage was, of course, the strong toolchain for compiling it > into optimised logic on the FPGA; writing the whole thing by hand in > RTL would have taken far more effort.) > Developing a worthwhile P4-based datapath proved to be something of an > engineer-time sink; compilation and verification weren't quick, and > just because your P4 works in a software model doesn't necessarily > mean it will perform well in hardware. > Thus P4 is, in my personal opinion, a poor choice for end-user/runtime > behaviour specification, at least for FPGA-flavoured devices. I am curios to understand the challenges you came across specific to P4 in what you describe above. My gut feeling is, depending on the P4 program, you ran out of resources. How many LUTs does this device offer? I am going to hazard a guess that 30-40% of the resources on the FPGA were just for P4 abstraction in which case writing a complex P4 program just wont fit. Having said that, tooling is also very important as part of the developer experience - if it takes forever to compile things then that developer experience goes down the tubes. Maybe it is a tooling challenge? IMO: it is also about operational experience (i.e the ops not just the devs) and deployment infra is key. IOW, it's not just about the datapath but also the full package integration, for example, ease of control plane integration, field debuggability, operational usability, etc... If you are doing a one-off you can integrate whatever infrastructure you want. If you are a cloud vendor you have the skills in house and it may be worth investing in them. If you are a second tier operator or large enterprise OTOH it is not part of your business model to stock up with smart people. > It > works okay for a multi-month product development project, is just > about viable for implementing something like a pipeline plugin, but > treating it as a fully flexible software-defined datapath is not > something that will fly. > I would argue that FPGA projects tend to be one-offs mostly (multi-month very specialized solutions). If you want a generic, repeatable solution you will have to pay the cost for abstraction (both performance and resource consumption). Then you can train people to be able to operate the repeatable solutions in some manual. > > I would argue further that in > > the near future a lot of the stuff including transport will eventually > > have to partially or fully move to hardware (see the HOMA keynote for > > a sample space[0]). > > I think HOMA is very interesting and I agree hardware doing something > like it will eventually be needed. But as you admit, P4TC doesn't > address that — unsurprising, since the kind of dynamic imperative > behaviour involved is totally outside P4's wheelhouse. So maybe I'm > missing your point here but I don't see why you bring it up. It was a response to the sentiment that XDP or ebpf is needed to solve the performance problem. My response was: i can't count on s/w saving me from 800Gbps ethernet port capacity; i gave that transport offload example as a statement of the inevitability of even things outside the classical L2-L4 datapath infrastructure to eventually move to offload. > Ultimately I think trying to expose the underlying hardware as a P4 > platform is the wrong abstraction layer to provide to userspace. If you mean transport layer exposure via P4 then I would agree. But for L2-L4 the P4 abstraction (TC as well) is match-action pipeline which works very well today with control plane abstraction from user space. > It's trying too hard to avoid protocol ossification, by requiring the > entire pipeline to be user-definable at a bit level, but in the real > world if someone wants to deploy a new low-level protocol they'll be > better off upgrading their kernel and drivers to offload the new > protocol-specific *feature* onto protocol-agnostic *hardware* than > trying to develop and validate a P4 pipeline. I agree with your view on low-level bit confusion in P4 (depending on how you write your program); however, I dont agree with the perspective that you can somehow write that code for your new action or new header processing and then go ahead and upgrade the driver and maybe install some new firmware is the right solution. If you have the skills, sure. But if you are second tier consumer, sourcing from multiple NIC vendors, and want to offload a new pipeline/protocol-specific feature across those NICs i would argue that those skills are not within your reach unless you standardize that interface (which is what P4 and P4TC strive for). I am not saying the abstraction is free rather that it is worth the return on investment for this scenario. > It is only protocol ossification in *hardware* that is a problem for > this kind of thing (not to be confused with the ossification problem > on a network where you can't use new proto because a middlebox > somewhere in the path barfs on it); protocol-specific SW APIs are > only a problem if they result in vendors designing ossified hardware > (to implement exactly those APIs and nothing else), which hopefully > we've all learned not to do by now. It's more of a challenge on velocity-to-feature and getting the whole package with the same effort by specification with P4 i.e starting with the datapath all the way to the control plane. And that instead of multi-vendor APIs for protocol-specific solutions (vendors are pitching DPDK APIs mostly) we are suggesting that unifying API is P4TC etc for all vendors. BTW: I am not arguing that on an FPGA you can generate very optimal RTL code(that is both resource and computation efficient) which is very specific to the target datapath. I am sure there are use cases for that. OTOH, there is a very large set of users who would rather go for the match-action paradigm for generality of abstraction. BTW, in your response below to Anjali: Sure, you can start with ebpf - why not any other language? What is the connection to RTL? the frontend you said you have used is P4 for example and you could generate that into RTL. cheers, jamal > On 30/01/2023 03:09, Singhai, Anjali wrote: > > There is also argument that is being made about using ebpf for > > implementing the SW path, may be I am missing the part as to how do > > you offload if not to another general purpose core even if it is not > > as evolved as the current day Xeon's. > > I have to be a little circumspect here as I don't know how much we've > made public, but there are good prospects for FPGA offloads of eBPF > with high performance. The instructions can be transformed into a > pipeline of logic blocks which look nothing like a Von Neumann > architecture, so can get much better perf/area and perf/power than an > array of general-purpose cores. > My personal belief (which I don't, alas, have hard data to back up) is > that this approach will also outperform the 'array of specialised > packet-processor cores' that many NPU/DPU products are using. > > In the situations where you do need a custom datapath (which often > involve the kind of dynamic behaviour that's not P4-friendly), eBPF > is, I would say, far superior to P4 as an IR. > > -ed
Hi, Want to provide an update to this thread and a summary of where we are (typing this on web browser client so i hope it doesnt come all mangled up): I have had high bandwidth discussions with several people offlist (thanks to everyone who invested their time in trying to smoothen this); sometimes cooler headers prevail this way. We are willing (and are starting) to invest time to see how we can fit ebpf for the software datapath. Should be noted that we did look at ebpf when this project started and we ended up not going that path. I think what is new in this equation is the concept of kfuncs - which we didnt have back then. Perhaps with kfuncs we can make both worlds work together. XDP as well is appealing. As i have stated earlier: The starting premise is that the posted code meets our requirements so whatever we do using ebpf has to meet our requirements. I am ok with some limited degree of square hole round peg situation but it cant be interfering in meeting our goals. So let me restate those goals so we dont go into some rabit hole in the discussion: 1) Supporting P4 in the kernel both for the sw and hw datapath utilizing the well established tc infra which allows both sw equivalence and hw offload. We are _not_ going to reinvent this. Essentially we get the whole package: from the control plane to the tooling infra, netlink messaging to s/w and h/w symbiosis, the autonomous kernel control, etc. The advantage is that we have a singular vendor-neutral interface via the kernel using well understood mechanisms. Behavioral equivalence between hw and sw is a given. 2) Operational usability - this is encoded currently in the scriptability approach. Ex, I can just ship someone a shell script in an email but more important if they have deployed tc offloads the runtime semantics are unchanged. "write once, run anywhere" paradigm is easier to state in ascii;-> The interface is designed to be scriptable to remove the burden of making kernel and user space code changes for any new processing functions (whether in s/w or hardware). 3) Debuggability - developers and ops people who are familiar with tc offloads can continue using the _same existing techniques and tools_. This also eases support. 4) Performance - note our angle on this, based on the niche we are looking at is "if you want performance then offload". However, one discussion point that has been raised multiple times in the thread and in private is that there are performance gains when using ebpf. This arguement is reasonable and a motivator for us to invest our time in evaluating. We have started doing off the cuff measurements. With very simple P4 program which receives a packet, looks up a table, and on a hit changes src/mac address then forwards. We have: A) implemented a handcoded ebpf program, B) generated P4TC sw only C) flower s/w only (skip_hw) rules and D) hardware offload (skip_sw) all on tc (so we can do orange-orange comparison). The SUT has a dual port CX6 NIC capable of offloading pedit and mirred. Trex is connected to one port and sending http gets which goes via the box and a response comes back on the other port which we send back to trex. The traffic is very asymettric; data coming back to the client fills up the 25G pipe but ACKs going back consume a lot less. Unfortunately all 4 scenarios were able to handle the wire rate - we are going to set up more nasty traffic generation later; for now we opted to look at cpu utilization for the 4 scenarios. We have the following results: A) 35% CPU B) 39% C) 36% D) 0% This is by no means a good test but i wanted to illustrate the relevance of #D (0%) - which is a main itch for us. We need to test more complex programs which is where probably the performance of ebpf will shine. XDP for sure will beat all the others - but i would rather get the facts in place first. So we are investing effort in this direction and will share results at some point. There may be other low hanging fruits that have been brought up in the discussion for ebpf (the parser being one); we will be looking at all those as well. Note: The goal of this exercise for us is to evaluate not just performance but also consider how it affects the other P4TC goals. There may be a sweet happy point somewhere in there but we need to collect the data instead of hypothesizing. cheers, jamal