Message ID | 20240225165447.156954-1-jhs@mojatatu.com (mailing list archive) |
---|---|
Headers | show |
Series | Introducing P4TC (series 1) | expand |
Jamal Hadi Salim wrote: > This is the first patchset of two. In this patch we are submitting 15 which > cover the minimal viable P4 PNA architecture. > > __Description of these Patches__ > > Patch #1 adds infrastructure for per-netns P4 actions that can be created on > as need basis for the P4 program requirement. This patch makes a small incision > into act_api. Patches 2-4 are minimalist enablers for P4TC and have no > effect the classical tc action (example patch#2 just increases the size of the > action names from 16->64B). > Patch 5 adds infrastructure support for preallocation of dynamic actions. > > The core P4TC code implements several P4 objects. > 1) Patch #6 introduces P4 data types which are consumed by the rest of the code > 2) Patch #7 introduces the templating API. i.e. CRUD commands for templates > 3) Patch #8 introduces the concept of templating Pipelines. i.e CRUD commands > for P4 pipelines. > 4) Patch #9 introduces the action templates and associated CRUD commands. > 5) Patch #10 introduce the action runtime infrastructure. > 6) Patch #11 introduces the concept of P4 table templates and associated > CRUD commands for tables. > 7) Patch #12 introduces runtime table entry infra and associated CU commands. > 8) Patch #13 introduces runtime table entry infra and associated RD commands. > 9) Patch #14 introduces interaction of eBPF to P4TC tables via kfunc. > 10) Patch #15 introduces the TC classifier P4 used at runtime. > > Daniel, please look again at patch #15. > > There are a few more patches (5) not in this patchset that deal with test > cases, etc. > > What is P4? > ----------- > > The Programming Protocol-independent Packet Processors (P4) is an open source, > domain-specific programming language for specifying data plane behavior. > > The current P4 landscape includes an extensive range of deployments, products, > projects and services, etc[9][12]. Two major NIC vendors, Intel[10] and AMD[11] > currently offer P4-native NICs. P4 is currently curated by the Linux > Foundation[9]. > > On why P4 - see small treatise here:[4]. > > What is P4TC? > ------------- > > P4TC is a net-namespace aware P4 implementation over TC; meaning, a P4 program > and its associated objects and state are attachend to a kernel _netns_ structure. > IOW, if we had two programs across netns' or within a netns they have no > visibility to each others objects (unlike for example TC actions whose kinds are > "global" in nature or eBPF maps visavis bpftool). [...] Although I appreciate a good amount of work went into building above I'll add my concerns here so they are not lost. These are architecture concerns not this line of code needs some tweak. - It encodes a DSL into the kernel. Its unclear how we pick which DSL gets pushed into the kernel and which do not. Do we take any DSL folks can code up? I would prefer a lower level intermediate langauge. My view is this is a lesson we should have learned from OVS. OVS had wider adoption and still struggled in some ways my belief is this is very similar to OVS. (Also OVS was novel/great at a lot of things fwiw.) - We have a general purpose language in BPF that can implement the P4 DSL already. I don't see any need for another set of code when the end goal is running P4 in Linux network stack is doable. Typically we reject duplicate things when they don't have concrete benefits. - P4 as a DSL is not optimized for general purpose CPUs, but rather hardware pipelines. Although it can be optimized for CPUs its a harder problem. A review of some of the VPP/DPDK work here is useful. - P4 infrastructure already has a p4c backend this is adding another P4 backend instead of getting the rather small group of people to work on a single backend we are now creating another one. - Common reasons I think would justify a new P4 backend and implementation would be: speed efficiency, or expressiveness. I think this implementation is neither more efficient nor more expressive. Concrete examples on expressiveness would be interesting, but I don't see any. Loops were mentioned once but latest kernels have loop support. - The main talking point for many slide decks about p4tc is hardware offload. This seems like the main benefit of pushing the P4 DSL into the kernel. But, we have no hw implementation, not even a vendor stepping up to comment on this implementation and how it will work for them. HW introduces all sorts of interesting problems that I don't see how we solve in this framework. For example a few off the top of my head: syncing current state into tc, how does operator program tc inside constraints, who writes the p4 models for these hardware devices, do they fit into this 'tc' infrastructure, partial updates into hardware seems unlikely to work for most hardware, ... - The kfuncs are mostly duplicates of map ops we already have in BPF API. The motivation by my read is to use netlink instead of bpf commands. I don't agree with this, optimizing for some low level debug a developer uses is the wrong design space. Actual users should not be deploying this via ssh into boxes. The workflow will not scale and really we need tooling and infra to land P4 programs across the network. This is orders of more pain if its an endpoint solution and not a middlebox/switch solution. As a switch solution I don't see how p4tc sw scales to even TOR packet rates. So you need tooling on top and user interact with the tooling not the Linux widget/debugger at the bottom. - There is no performance analysis: The comment was functionality before performance which I disagree with. If it was a first implementation and we didn't have a way to do P4 DSL already than I might agree, but here we have an existing solution so it should be at least as good and should be better than existing backend. A software datapath adoption is going to be critically based on performance. I don't see taking even a 5% hit when porting over to P4 from existing datapath. Commentary: I think its 100% correct to debate how the P4 DSL is implemented in the kernel. I can't see why this is off limits somehow this patch set proposes an approach there could be many approaches. BPF comes up not because I'm some BPF zealot that needs P4 DSL in BPF, but because it exists today there is even a P4 backend. Fundamentally I don't see the value add we get by creating two P4 pipelines this is going to create duplication all the way up to the P4 tooling/infra through to the kernel. From your side you keep saying I'm bike shedding and demanding BPF, but from my perspective your introducing another entire toolchain simply because you want some low level debug commands that 99% of P4 users should not be using or caring about. To try and be constructive some things that would change my mind would be a vendor showing how hardware can be used. This would be compelling. Or performance showing its somehow gets a more performant implementation. Or lastly if the current p4c implementation is fundamentally broken somehow. Thanks John
On Wed, Feb 28, 2024 at 12:11 PM John Fastabend <john.fastabend@gmail.com> wrote: > > Jamal Hadi Salim wrote: > > This is the first patchset of two. In this patch we are submitting 15 which > > cover the minimal viable P4 PNA architecture. > > > > __Description of these Patches__ > > > > Patch #1 adds infrastructure for per-netns P4 actions that can be created on > > as need basis for the P4 program requirement. This patch makes a small incision > > into act_api. Patches 2-4 are minimalist enablers for P4TC and have no > > effect the classical tc action (example patch#2 just increases the size of the > > action names from 16->64B). > > Patch 5 adds infrastructure support for preallocation of dynamic actions. > > > > The core P4TC code implements several P4 objects. > > 1) Patch #6 introduces P4 data types which are consumed by the rest of the code > > 2) Patch #7 introduces the templating API. i.e. CRUD commands for templates > > 3) Patch #8 introduces the concept of templating Pipelines. i.e CRUD commands > > for P4 pipelines. > > 4) Patch #9 introduces the action templates and associated CRUD commands. > > 5) Patch #10 introduce the action runtime infrastructure. > > 6) Patch #11 introduces the concept of P4 table templates and associated > > CRUD commands for tables. > > 7) Patch #12 introduces runtime table entry infra and associated CU commands. > > 8) Patch #13 introduces runtime table entry infra and associated RD commands. > > 9) Patch #14 introduces interaction of eBPF to P4TC tables via kfunc. > > 10) Patch #15 introduces the TC classifier P4 used at runtime. > > > > Daniel, please look again at patch #15. > > > > There are a few more patches (5) not in this patchset that deal with test > > cases, etc. > > > > What is P4? > > ----------- > > > > The Programming Protocol-independent Packet Processors (P4) is an open source, > > domain-specific programming language for specifying data plane behavior. > > > > The current P4 landscape includes an extensive range of deployments, products, > > projects and services, etc[9][12]. Two major NIC vendors, Intel[10] and AMD[11] > > currently offer P4-native NICs. P4 is currently curated by the Linux > > Foundation[9]. > > > > On why P4 - see small treatise here:[4]. > > > > What is P4TC? > > ------------- > > > > P4TC is a net-namespace aware P4 implementation over TC; meaning, a P4 program > > and its associated objects and state are attachend to a kernel _netns_ structure. > > IOW, if we had two programs across netns' or within a netns they have no > > visibility to each others objects (unlike for example TC actions whose kinds are > > "global" in nature or eBPF maps visavis bpftool). > > [...] > > Although I appreciate a good amount of work went into building above I'll > add my concerns here so they are not lost. These are architecture concerns > not this line of code needs some tweak. > > - It encodes a DSL into the kernel. Its unclear how we pick which DSL gets > pushed into the kernel and which do not. Do we take any DSL folks can code > up? > I would prefer a lower level intermediate langauge. My view is this is > a lesson we should have learned from OVS. OVS had wider adoption and > still struggled in some ways my belief is this is very similar to OVS. > (Also OVS was novel/great at a lot of things fwiw.) > > - We have a general purpose language in BPF that can implement the P4 DSL > already. I don't see any need for another set of code when the end goal > is running P4 in Linux network stack is doable. Typically we reject > duplicate things when they don't have concrete benefits. > > - P4 as a DSL is not optimized for general purpose CPUs, but > rather hardware pipelines. Although it can be optimized for CPUs its > a harder problem. A review of some of the VPP/DPDK work here is useful. > > - P4 infrastructure already has a p4c backend this is adding another P4 > backend instead of getting the rather small group of people to work on > a single backend we are now creating another one. > > - Common reasons I think would justify a new P4 backend and implementation > would be: speed efficiency, or expressiveness. I think this > implementation is neither more efficient nor more expressive. Concrete > examples on expressiveness would be interesting, but I don't see any. > Loops were mentioned once but latest kernels have loop support. > > - The main talking point for many slide decks about p4tc is hardware > offload. This seems like the main benefit of pushing the P4 DSL into the > kernel. But, we have no hw implementation, not even a vendor stepping up > to comment on this implementation and how it will work for them. HW > introduces all sorts of interesting problems that I don't see how we > solve in this framework. For example a few off the top of my head: > syncing current state into tc, how does operator program tc inside > constraints, who writes the p4 models for these hardware devices, do > they fit into this 'tc' infrastructure, partial updates into hardware > seems unlikely to work for most hardware, ... > > - The kfuncs are mostly duplicates of map ops we already have in BPF API. > The motivation by my read is to use netlink instead of bpf commands. I > don't agree with this, optimizing for some low level debug a developer > uses is the wrong design space. Actual users should not be deploying > this via ssh into boxes. The workflow will not scale and really we need > tooling and infra to land P4 programs across the network. This is orders > of more pain if its an endpoint solution and not a middlebox/switch > solution. As a switch solution I don't see how p4tc sw scales to even TOR > packet rates. So you need tooling on top and user interact with the > tooling not the Linux widget/debugger at the bottom. > > - There is no performance analysis: The comment was functionality before > performance which I disagree with. If it was a first implementation and > we didn't have a way to do P4 DSL already than I might agree, but here > we have an existing solution so it should be at least as good and should > be better than existing backend. A software datapath adoption is going > to be critically based on performance. I don't see taking even a 5% hit > when porting over to P4 from existing datapath. > > Commentary: I think its 100% correct to debate how the P4 DSL is > implemented in the kernel. I can't see why this is off limits somehow this > patch set proposes an approach there could be many approaches. BPF comes up > not because I'm some BPF zealot that needs P4 DSL in BPF, but because it > exists today there is even a P4 backend. Fundamentally I don't see the > value add we get by creating two P4 pipelines this is going to create > duplication all the way up to the P4 tooling/infra through to the kernel. > From your side you keep saying I'm bike shedding and demanding BPF, but > from my perspective your introducing another entire toolchain simply > because you want some low level debug commands that 99% of P4 users should > not be using or caring about. > > To try and be constructive some things that would change my mind would > be a vendor showing how hardware can be used. This would be compelling. > Or performance showing its somehow gets a more performant implementation. > Or lastly if the current p4c implementation is fundamentally broken > somehow. > John, With all due respect we are going back again over the same points, recycled many times over to which i have responded to you many times. It's gettting tiring. This is exactly why i called it bikeshedding. Let's just agree to disagree. cheers, jamal > Thanks > John
Jamal Hadi Salim wrote: > On Wed, Feb 28, 2024 at 12:11 PM John Fastabend > <john.fastabend@gmail.com> wrote: > > > > Jamal Hadi Salim wrote: > > > This is the first patchset of two. In this patch we are submitting 15 which > > > cover the minimal viable P4 PNA architecture. > > > > > > __Description of these Patches__ > > > > > > Patch #1 adds infrastructure for per-netns P4 actions that can be created on > > > as need basis for the P4 program requirement. This patch makes a small incision > > > into act_api. Patches 2-4 are minimalist enablers for P4TC and have no > > > effect the classical tc action (example patch#2 just increases the size of the > > > action names from 16->64B). > > > Patch 5 adds infrastructure support for preallocation of dynamic actions. > > > > > > The core P4TC code implements several P4 objects. > > > 1) Patch #6 introduces P4 data types which are consumed by the rest of the code > > > 2) Patch #7 introduces the templating API. i.e. CRUD commands for templates > > > 3) Patch #8 introduces the concept of templating Pipelines. i.e CRUD commands > > > for P4 pipelines. > > > 4) Patch #9 introduces the action templates and associated CRUD commands. > > > 5) Patch #10 introduce the action runtime infrastructure. > > > 6) Patch #11 introduces the concept of P4 table templates and associated > > > CRUD commands for tables. > > > 7) Patch #12 introduces runtime table entry infra and associated CU commands. > > > 8) Patch #13 introduces runtime table entry infra and associated RD commands. > > > 9) Patch #14 introduces interaction of eBPF to P4TC tables via kfunc. > > > 10) Patch #15 introduces the TC classifier P4 used at runtime. > > > > > > Daniel, please look again at patch #15. > > > > > > There are a few more patches (5) not in this patchset that deal with test > > > cases, etc. > > > > > > What is P4? > > > ----------- > > > > > > The Programming Protocol-independent Packet Processors (P4) is an open source, > > > domain-specific programming language for specifying data plane behavior. > > > > > > The current P4 landscape includes an extensive range of deployments, products, > > > projects and services, etc[9][12]. Two major NIC vendors, Intel[10] and AMD[11] > > > currently offer P4-native NICs. P4 is currently curated by the Linux > > > Foundation[9]. > > > > > > On why P4 - see small treatise here:[4]. > > > > > > What is P4TC? > > > ------------- > > > > > > P4TC is a net-namespace aware P4 implementation over TC; meaning, a P4 program > > > and its associated objects and state are attachend to a kernel _netns_ structure. > > > IOW, if we had two programs across netns' or within a netns they have no > > > visibility to each others objects (unlike for example TC actions whose kinds are > > > "global" in nature or eBPF maps visavis bpftool). > > > > [...] > > > > Although I appreciate a good amount of work went into building above I'll > > add my concerns here so they are not lost. These are architecture concerns > > not this line of code needs some tweak. > > > > - It encodes a DSL into the kernel. Its unclear how we pick which DSL gets > > pushed into the kernel and which do not. Do we take any DSL folks can code > > up? > > I would prefer a lower level intermediate langauge. My view is this is > > a lesson we should have learned from OVS. OVS had wider adoption and > > still struggled in some ways my belief is this is very similar to OVS. > > (Also OVS was novel/great at a lot of things fwiw.) > > > > - We have a general purpose language in BPF that can implement the P4 DSL > > already. I don't see any need for another set of code when the end goal > > is running P4 in Linux network stack is doable. Typically we reject > > duplicate things when they don't have concrete benefits. > > > > - P4 as a DSL is not optimized for general purpose CPUs, but > > rather hardware pipelines. Although it can be optimized for CPUs its > > a harder problem. A review of some of the VPP/DPDK work here is useful. > > > > - P4 infrastructure already has a p4c backend this is adding another P4 > > backend instead of getting the rather small group of people to work on > > a single backend we are now creating another one. > > > > - Common reasons I think would justify a new P4 backend and implementation > > would be: speed efficiency, or expressiveness. I think this > > implementation is neither more efficient nor more expressive. Concrete > > examples on expressiveness would be interesting, but I don't see any. > > Loops were mentioned once but latest kernels have loop support. > > > > - The main talking point for many slide decks about p4tc is hardware > > offload. This seems like the main benefit of pushing the P4 DSL into the > > kernel. But, we have no hw implementation, not even a vendor stepping up > > to comment on this implementation and how it will work for them. HW > > introduces all sorts of interesting problems that I don't see how we > > solve in this framework. For example a few off the top of my head: > > syncing current state into tc, how does operator program tc inside > > constraints, who writes the p4 models for these hardware devices, do > > they fit into this 'tc' infrastructure, partial updates into hardware > > seems unlikely to work for most hardware, ... > > > > - The kfuncs are mostly duplicates of map ops we already have in BPF API. > > The motivation by my read is to use netlink instead of bpf commands. I > > don't agree with this, optimizing for some low level debug a developer > > uses is the wrong design space. Actual users should not be deploying > > this via ssh into boxes. The workflow will not scale and really we need > > tooling and infra to land P4 programs across the network. This is orders > > of more pain if its an endpoint solution and not a middlebox/switch > > solution. As a switch solution I don't see how p4tc sw scales to even TOR > > packet rates. So you need tooling on top and user interact with the > > tooling not the Linux widget/debugger at the bottom. > > > > - There is no performance analysis: The comment was functionality before > > performance which I disagree with. If it was a first implementation and > > we didn't have a way to do P4 DSL already than I might agree, but here > > we have an existing solution so it should be at least as good and should > > be better than existing backend. A software datapath adoption is going > > to be critically based on performance. I don't see taking even a 5% hit > > when porting over to P4 from existing datapath. > > > > Commentary: I think its 100% correct to debate how the P4 DSL is > > implemented in the kernel. I can't see why this is off limits somehow this > > patch set proposes an approach there could be many approaches. BPF comes up > > not because I'm some BPF zealot that needs P4 DSL in BPF, but because it > > exists today there is even a P4 backend. Fundamentally I don't see the > > value add we get by creating two P4 pipelines this is going to create > > duplication all the way up to the P4 tooling/infra through to the kernel. > > From your side you keep saying I'm bike shedding and demanding BPF, but > > from my perspective your introducing another entire toolchain simply > > because you want some low level debug commands that 99% of P4 users should > > not be using or caring about. > > > > To try and be constructive some things that would change my mind would > > be a vendor showing how hardware can be used. This would be compelling. > > Or performance showing its somehow gets a more performant implementation. > > Or lastly if the current p4c implementation is fundamentally broken > > somehow. > > > > John, > With all due respect we are going back again over the same points, > recycled many times over to which i have responded to you many times. > It's gettting tiring. This is exactly why i called it bikeshedding. > Let's just agree to disagree. Yep we agree to disagree and I put them them as a summary so others can see them and think it over/decide where they stand on it. In the end you don't need my ACK here, but I wanted my opinion summarized. > > cheers, > jamal > > > Thanks > > John
On Sun, 2024-02-25 at 11:54 -0500, Jamal Hadi Salim wrote: > This is the first patchset of two. In this patch we are submitting 15 which > cover the minimal viable P4 PNA architecture. > > __Description of these Patches__ > > Patch #1 adds infrastructure for per-netns P4 actions that can be created on > as need basis for the P4 program requirement. This patch makes a small incision > into act_api. Patches 2-4 are minimalist enablers for P4TC and have no > effect the classical tc action (example patch#2 just increases the size of the > action names from 16->64B). > Patch 5 adds infrastructure support for preallocation of dynamic actions. > > The core P4TC code implements several P4 objects. > 1) Patch #6 introduces P4 data types which are consumed by the rest of the code > 2) Patch #7 introduces the templating API. i.e. CRUD commands for templates > 3) Patch #8 introduces the concept of templating Pipelines. i.e CRUD commands > for P4 pipelines. > 4) Patch #9 introduces the action templates and associated CRUD commands. > 5) Patch #10 introduce the action runtime infrastructure. > 6) Patch #11 introduces the concept of P4 table templates and associated > CRUD commands for tables. > 7) Patch #12 introduces runtime table entry infra and associated CU commands. > 8) Patch #13 introduces runtime table entry infra and associated RD commands. > 9) Patch #14 introduces interaction of eBPF to P4TC tables via kfunc. > 10) Patch #15 introduces the TC classifier P4 used at runtime. > > Daniel, please look again at patch #15. > > There are a few more patches (5) not in this patchset that deal with test > cases, etc. > > What is P4? > ----------- > > The Programming Protocol-independent Packet Processors (P4) is an open source, > domain-specific programming language for specifying data plane behavior. > > The current P4 landscape includes an extensive range of deployments, products, > projects and services, etc[9][12]. Two major NIC vendors, Intel[10] and AMD[11] > currently offer P4-native NICs. P4 is currently curated by the Linux > Foundation[9]. > > On why P4 - see small treatise here:[4]. > > What is P4TC? > ------------- > > P4TC is a net-namespace aware P4 implementation over TC; meaning, a P4 program > and its associated objects and state are attachend to a kernel _netns_ structure. > IOW, if we had two programs across netns' or within a netns they have no > visibility to each others objects (unlike for example TC actions whose kinds are > "global" in nature or eBPF maps visavis bpftool). > > P4TC builds on top of many years of Linux TC experiences of a netlink control > path interface coupled with a software datapath with an equivalent offloadable > hardware datapath. In this patch series we are focussing only on the s/w > datapath. The s/w and h/w path equivalence that TC provides is relevant > for a primary use case of P4 where some (currently) large consumers of NICs > provide vendors their datapath specs in P4. In such a case one could generate > specified datapaths in s/w and test/validate the requirements before hardware > acquisition(example [12]). > > Unlike other approaches such as TC Flower which require kernel and user space > changes when new datapath objects like packet headers are introduced P4TC, with > these patches, provides _kernel and user space code change independence_. > Meaning: > A P4 program describes headers, parsers, etc alongside the datapath processing; > the compiler uses the P4 program as input and generates several artifacts which > are then loaded into the kernel to manifest the intended datapath. In addition > to the generated datapath, control path constructs are generated. The process is > described further below in "P4TC Workflow". > > There have been many discussions and meetings within the community since > about 2015 in regards to P4 over TC[2] and we are finally proving to the > naysayers that we do get stuff done! > > A lot more of the P4TC motivation is captured at: > https://github.com/p4tc-dev/docs/blob/main/why-p4tc.md > > __P4TC Architecture__ > > The current architecture was described at netdevconf 0x17[14] and if you prefer > academic conference papers, a short paper is available here[15]. > > There are 4 parts: > > 1) A Template CRUD provisioning API for manifesting a P4 program and its > associated objects in the kernel. The template provisioning API uses netlink. > See patch in part 2. > > 2) A Runtime CRUD+ API code which is used for controlling the different runtime > behavior of the P4 objects. The runtime API uses netlink. See notes further > down. See patch description later.. > > 3) P4 objects and their control interfaces: tables, actions, externs, etc. > Any object that requires control plane interaction resides in the TC domain > and is subject to the CRUD runtime API. The intended goal is to make use of the > tc semantics of skip_sw/hw to target P4 program objects either in s/w or h/w. > > 4) S/W Datapath code hooks. The s/w datapath is eBPF based and is generated > by a compiler based on the P4 spec. When accessing any P4 object that requires > control plane interfaces, the eBPF code accesses the P4TC side from #3 above > using kfuncs. > > The generated eBPF code is derived from [13] with enhancements and fixes to meet > our requirements. > > __P4TC Workflow__ > > The Development and instantiation workflow for P4TC is as follows: > > A) A developer writes a P4 program, "myprog" > > B) Compiles it using the P4C compiler[8]. The compiler generates 3 outputs: > > a) A shell script which form template definitions for the different P4 > objects "myprog" utilizes (tables, externs, actions etc). See #1 above.. > > b) the parser and the rest of the datapath are generated as eBPF and need > to be compiled into binaries. At the moment the parser and the main control > block are generated as separate eBPF program but this could change in > the future (without affecting any kernel code). See #4 above. > > c) A json introspection file used for the control plane (by iproute2/tc). > > C) At this point the artifacts from #1,#4 could be handed to an operator > (the operator could be the same person as the developer from #A, #B). > > i) For the eBPF part, either the operator is handed an ebpf binary or > source which they compile at this point into a binary. > The operator executes the shell script(s) to manifest the functional > "myprog" into the kernel. > > ii) The operator instantiates "myprog" pipeline via the tc P4 filter > to ingress/egress (depending on P4 arch) of one or more netdevs/ports > (illustrated below as "block 22"). > > Example instantion where the parser is a separate action: > "tc filter add block 22 ingress protocol all prio 10 p4 pname myprog \ > action bpf obj $PARSER.o section p4tc/parse \ > action bpf obj $PROGNAME.o section p4tc/main" > > See individual patches in partc for more examples tc vs xdp etc. Also see > section on "challenges" (further below on this cover letter). > > Once "myprog" P4 program is instantiated one can start performing operations > on table entries and/or actions at runtime as described below. > > __P4TC Runtime Control Path__ > > The control interface builds on past tc experience and tries to get things > right from the beginning (example filtering is separated from depending > on existing object TLVs and made generic); also the code is written in > such a way it is mostly lockless. > > The P4TC control interface, using netlink, provides what we call a CRUDPS > abstraction which stands for: Create, Read(get), Update, Delete, Subscribe, > Publish. From a high level PoV the following describes a conformant high level > API (both on netlink data model and code level): > > Create(</path/to/object, DATA>+) > Read(</path/to/object>, [optional filter]) > Update(</path/to/object>, DATA>+) > Delete(</path/to/object>, [optional filter]) > Subscribe(</path/to/object>, [optional filter]) > > Note, we _dont_ treat "dump" or "flush" as speacial. If "path/to/object" points > to a table then a "Delete" implies "flush" and a "Read" implies dump but if > it points to an entry (by specifying a key) then "Delete" implies deleting > and entry and "Read" implies reading that single entry. It should be noted that > both "Delete" and "Read" take an optional filter parameter. The filter can > define further refinements to what the control plane wants read or deleted. > "Subscribe" uses built in netlink event management. It, as well, takes a filter > which can further refine what events get generated to the control plane (taken > out of this patchset, to be re-added with consideration of [16]). > > Lets show some runtime samples: > > ..create an entry, if we match ip address 10.0.1.2 send packet out eno1 > tc p4ctrl create myprog/table/mytable \ > dstAddr 10.0.1.2/32 action send_to_port param port eno1 > > ..Batch create entries > tc p4ctrl create myprog/table/mytable \ > entry dstAddr 10.1.1.2/32 action send_to_port param port eno1 \ > entry dstAddr 10.1.10.2/32 action send_to_port param port eno10 \ > entry dstAddr 10.0.2.2/32 action send_to_port param port eno2 > > ..Get an entry (note "read" is interchangeably used as "get" which is a common > semantic in tc): > tc p4ctrl read myprog/table/mytable \ > dstAddr 10.0.2.2/32 > > ..dump mytable > tc p4ctrl read myprog/table/mytable > > ..dump mytable for all entries whose key fits within 10.1.0.0/16 > tc p4ctrl read myprog/table/mytable \ > filter key/myprog/mytable/dstAddr = 10.1.0.0/16 > > ..dump all mytable entries which have an action send_to_port with param "eno1" > tc p4ctrl get myprog/table/mytable \ > filter param/act/myprog/send_to_port/port = "eno1" > > The filter expression is powerful, f.e you could say: > > tc p4ctrl get myprog/table/mytable \ > filter param/act/myprog/send_to_port/port = "eno1" && \ > key/myprog/mytable/dstAddr = 10.1.0.0/16 > > It also works on built in metadata, example in the following case dumping > entries from mytable that have seen activity in the last 10 secs: > tc p4ctrl get myprog/table/mytable \ > filter msecs_since < 10000 > > Delete follows the same syntax as get/read, so for sake of brevity we won't > show more example than how to flush mytable: > > tc p4ctrl delete myprog/table/mytable > > Mystery question: How do we achieve iproute2-kernel independence and > how does "tc p4ctrl" as a cli know how to program the kernel given an > arbitrary command line as shown above? Answer(s): It queries the > compiler generated json file in "P4TC Workflow" #B.c above. The json file has > enough details to figure out that we have a program called "myprog" which has a > table "mytable" that has a key name "dstAddr" which happens to be type ipv4 > address prefix. The json file also provides details to show that the table > "mytable" supports an action called "send_to_port" which accepts a parameter > "port" of type netdev (see the types patch for all supported P4 data types). > All P4 components have names, IDs, and types - so this makes it very easy to map > into netlink. > Once user space tc/p4ctrl validates the human command input, it creates > standard binary netlink structures (TLVs etc) which are sent to the kernel. > See the runtime table entry patch for more details. > > __P4TC Datapath__ > > The P4TC s/w datapath execution is generated as eBPF. Any objects that require > control interfacing reside in the "P4TC domain" and are controlled via netlink > as described above. Per packet execution and state and even objects that do not > require control interfacing (like the P4 parser) are generated as eBPF. > > A packet arriving on s/w ingress of any of the ports on block 22 will first be > exercised via the (generated eBPF) parser component to extract the headers (the > ip destination address in labelled "dstAddr" above). > The datapath then proceeds to use "dstAddr", table ID and pipeline ID > as a key to do a lookup in myprog's "mytable" which returns the action params > which are then used to execute the action in the eBPF datapath (eventually > sending out packets to eno1). > On a table miss, mytable's default miss action (not described) is executed. > > __Testing__ > > Speaking of testing - we have 2-300 tdc test cases (which will be in the > second patchset). > These tests are run on our CICD system on pull requests and after commits are > approved. The CICD does a lot of other tests (more since v2, thanks to Simon's > input)including: > checkpatch, sparse, smatch, coccinelle, 32 bit and 64 bit builds tested on both > X86, ARM 64 and emulated BE via qemu s390. We trigger performance testing in the > CICD to catch performance regressions (currently only on the control path, but > in the future for the datapath). > Syzkaller runs 24/7 on dedicated hardware, originally we focussed only on memory > sanitizer but recently added support for concurrency sanitizer. > Before main releases we ensure each patch will compile on its own to help in > git bisect and run the xmas tree tool. We eventually put the code via coverity. > > In addition we are working on enabling a tool that will take a P4 program, run > it through the compiler, and generate permutations of traffic patterns via > symbolic execution that will test both positive and negative datapath code > paths. The test generator tool integration is still work in progress. > Also: We have other code that test parallelization etc which we are trying to > find a fit for in the kernel tree's testing infra. > > > __References__ > > [1]https://github.com/p4tc-dev/docs/blob/main/p4-conference-2023/2023P4WorkshopP4TC.pdf > [2]https://github.com/p4tc-dev/docs/blob/main/why-p4tc.md#historical-perspective-for-p4tc > [3]https://2023p4workshop.sched.com/event/1KsAe/p4tc-linux-kernel-p4-implementation-approaches-and-evaluation > [4]https://github.com/p4tc-dev/docs/blob/main/why-p4tc.md#so-why-p4-and-how-does-p4-help-here > [5]https://lore.kernel.org/netdev/20230517110232.29349-3-jhs@mojatatu.com/T/#mf59be7abc5df3473cff3879c8cc3e2369c0640a6 > [6]https://lore.kernel.org/netdev/20230517110232.29349-3-jhs@mojatatu.com/T/#m783cfd79e9d755cf0e7afc1a7d5404635a5b1919 > [7]https://lore.kernel.org/netdev/20230517110232.29349-3-jhs@mojatatu.com/T/#ma8c84df0f7043d17b98f3d67aab0f4904c600469 > [8]https://github.com/p4lang/p4c/tree/main/backends/tc > [9]https://p4.org/ > [10]https://www.intel.com/content/www/us/en/products/details/network-io/ipu/e2000-asic.html > [11]https://www.amd.com/en/accelerators/pensando > [12]https://github.com/sonic-net/DASH/tree/main > [13]https://github.com/p4lang/p4c/tree/main/backends/ebpf > [14]https://netdevconf.info/0x17/sessions/talk/integrating-ebpf-into-the-p4tc-datapath.html > [15]https://dl.acm.org/doi/10.1145/3630047.3630193 > [16]https://lore.kernel.org/netdev/20231216123001.1293639-1-jiri@resnulli.us/ > [17.a]https://netdevconf.info/0x13/session.html?talk-tc-u-classifier > [17.b]man tc-u32 > [18]man tc-pedit > [19] https://lore.kernel.org/netdev/20231219181623.3845083-6-victor@mojatatu.com/T/#m86e71743d1d83b728bb29d5b877797cb4942e835 > [20.a] https://netdevconf.info/0x16/sessions/talk/your-network-datapath-will-be-p4-scripted.html > [20.b] https://netdevconf.info/0x16/sessions/workshop/p4tc-workshop.html > > -------- > HISTORY > -------- > > Changes in Version 12 > ---------------------- > > 0) Introduce back 15 patches (v11 had 5) > > 1) From discussions with Daniel: > i) Remove the XDP programs association alltogether. No refcounting. nothing. > ii) Remove prog type tc - everything is now an ebpf tc action. > > 2) s/PAD0/__pad0/g. Thanks to Marcelo. > > 3) Add extack to specify how many entries (N of M) specified in a batch for > any of requested Create/Update/Delete succeeded. Prior to this it would > only tell us the batch failed to complete without giving us details of > which of M failed. Added as a debug aid. > > Changes in Version 11 > ---------------------- > 1) Split the series into two. Original patches 1-5 in this patchset. The rest > will go out after this is merged. > > 2) Change any references of IFNAMSIZ in the action code when referencing the > action name size to ACTNAMSIZ. Thanks to Marcelo. > > Changes in Version 10 > ---------------------- > 1) A couple of patches from the earlier version were clean enough to submit, > so we did. This gave us room to split the two largest patches each into > two. Even though the split is not git-bisactable and really some of it didn't > make much sense (eg spliting a create, and update in one patch and delete and > get into another) we made sure each of the split patches compiled > independently. The idea is to reduce the number of lines of code to review > and when we get sufficient reviews we will put the splits together again. > See patch #12 and #13 as well as patches #7 and #8). > > 2) Add more context in patch 0. Please READ! > > 3) Added dump/delete filters back to the code - we had taken them out in the > earlier patches to reduce the amount of code for review - but in retrospect > we feel they are important enough to push earlier rather than later. > > > Changes In version 9 > --------------------- > > 1) Remove the largest patch (externs) to ease review. > > 2) Break up action patches into two to ease review bringing down the patches > that need more scrutiny to 8 (the first 7 are almost trivial). > > 3) Fixup prefix naming convention to p4tc_xxx for uapi and p4a_xxx for actions > to provide consistency(Jiri). > > 4) Silence sparse warning "was not declared. Should it be static?" for kfuncs > by making them static. TBH, not sure if this is the right solution > but it makes sparse happy and hopefully someone will comment. > > Changes In Version 8 > --------------------- > > 1) Fix all the patchwork warnings and improve our ci to catch them in the future > > 2) Reduce the number of patches to basic max(15) to ease review. > > Changes In Version 7 > ------------------------- > > 0) First time removing the RFC tag! > > 1) Removed XDP cookie. It turns out as was pointed out by Toke(Thanks!) - that > using bpf links was sufficient to protect us from someone replacing or deleting > a eBPF program after it has been bound to a netdev. > > 2) Add some reviewed-bys from Vlad. > > 3) Small bug fixes from v6 based on testing for ebpf. > > 4) Added the counter extern as a sample extern. Illustrating this example because > it is slightly complex since it is possible to invoke it directly from > the P4TC domain (in case of direct counters) or from eBPF (indirect counters). > It is not exactly the most efficient implementation (a reasonable counter impl > should be per-cpu). > > Changes In RFC Version 6 > ------------------------- > > 1) Completed integration from scriptable view to eBPF. Completed integration > of externs integration. > > 2) Small bug fixes from v5 based on testing. > > Changes In RFC Version 5 > ------------------------- > > 1) More integration from scriptable view to eBPF. Small bug fixes from last > integration. > > 2) More streamlining support of externs via kfunc (create-on-miss, etc) > > 3) eBPF linking for XDP. > > There is more eBPF integration/streamlining coming (we are getting close to > conversion from scriptable domain). > > Changes In RFC Version 4 > ------------------------- > > 1) More integration from scriptable to eBPF. Small bug fixes. > > 2) More streamlining support of externs via kfunc (one additional kfunc). > > 3) Removed per-cpu scratchpad per Toke's suggestion and instead use XDP metadata. > > There is more eBPF integration coming. One thing we looked at but is not in this > patchset but should be in the next is use of eBPF link in our loading (see > "challenge #1" further below). > > Changes In RFC Version 3 > ------------------------- > > These patches are still in a little bit of flux as we adjust to integrating > eBPF. So there are small constructs that are used in V1 and 2 but no longer > used in this version. We will make a V4 which will remove those. > The changes from V2 are as follows: > > 1) Feedback we got in V2 is to try stick to one of the two modes. In this version > we are taking one more step and going the path of mode2 vs v2 where we had 2 modes. > > 2) The P4 Register extern is no longer standalone. Instead, as part of integrating > into eBPF we introduce another kfunc which encapsulates Register as part of the > extern interface. > > 3) We have improved our CICD to include tools pointed to us by Simon. See > "Testing" further below. Thanks to Simon for that and other issues he caught. > Simon, we discussed on issue [7] but decided to keep that log since we think > it is useful. > > 4) A lot of small cleanups. Thanks Marcelo. There are two things we need to > re-discuss though; see: [5], [6]. > > 5) We removed the need for a range of IDs for dynamic actions. Thanks Jakub. > > 6) Clarify ambiguity caused by smatch in an if(A) else if(B) condition. We are > guaranteed that either A or B must exist; however, lets make smatch happy. > Thanks to Simon and Dan Carpenter. > > Changes In RFC Version 2 > ------------------------- > > Version 2 is the initial integration of the eBPF datapath. > We took into consideration suggestions provided to use eBPF and put effort into > analyzing eBPF as datapath which involved extensive testing. > We implemented 6 approaches with eBPF and ran performance analysis and presented > our results at the P4 2023 workshop in Santa Clara[see: 1, 3] on each of the 6 > vs the scriptable P4TC and concluded that 2 of the approaches are sensible (4 if > you account for XDP or TC separately). > > Conclusions from the exercise: We lose the simple operational model we had > prior to integrating eBPF. We do gain performance in most cases when the > datapath is less compute-bound. > For more discussion on our requirements vs journeying the eBPF path please > scroll down to "Restating Our Requirements" and "Challenges". > > This patch set presented two modes. > mode1: the parser is entirely based on eBPF - whereas the rest of the > SW datapath stays as _scriptable_ as in Version 1. > mode2: All of the kernel s/w datapath (including parser) is in eBPF. > > The key ingredient for eBPF, that we did not have access to in the past, is > kfunc (it made a big difference for us to reconsider eBPF). > > In V2 the two modes are mutually exclusive (IOW, you get to choose one > or the other via Kconfig). I think/fear that this series has a "quorum" problem: different voices raises opposition, and nobody (?) outside the authors supported the code and the feature. Could be the missing of H/W offload support in the current form the root cause for such lack support? Or there are parties interested that have been quite so far? Thanks, Paolo
On Thu, Feb 29, 2024 at 12:14 PM Paolo Abeni <pabeni@redhat.com> wrote: > > On Sun, 2024-02-25 at 11:54 -0500, Jamal Hadi Salim wrote: > > This is the first patchset of two. In this patch we are submitting 15 which > > cover the minimal viable P4 PNA architecture. > > > > __Description of these Patches__ > > > > Patch #1 adds infrastructure for per-netns P4 actions that can be created on > > as need basis for the P4 program requirement. This patch makes a small incision > > into act_api. Patches 2-4 are minimalist enablers for P4TC and have no > > effect the classical tc action (example patch#2 just increases the size of the > > action names from 16->64B). > > Patch 5 adds infrastructure support for preallocation of dynamic actions. > > > > The core P4TC code implements several P4 objects. > > 1) Patch #6 introduces P4 data types which are consumed by the rest of the code > > 2) Patch #7 introduces the templating API. i.e. CRUD commands for templates > > 3) Patch #8 introduces the concept of templating Pipelines. i.e CRUD commands > > for P4 pipelines. > > 4) Patch #9 introduces the action templates and associated CRUD commands. > > 5) Patch #10 introduce the action runtime infrastructure. > > 6) Patch #11 introduces the concept of P4 table templates and associated > > CRUD commands for tables. > > 7) Patch #12 introduces runtime table entry infra and associated CU commands. > > 8) Patch #13 introduces runtime table entry infra and associated RD commands. > > 9) Patch #14 introduces interaction of eBPF to P4TC tables via kfunc. > > 10) Patch #15 introduces the TC classifier P4 used at runtime. > > > > Daniel, please look again at patch #15. > > > > There are a few more patches (5) not in this patchset that deal with test > > cases, etc. > > > > What is P4? > > ----------- > > > > The Programming Protocol-independent Packet Processors (P4) is an open source, > > domain-specific programming language for specifying data plane behavior. > > > > The current P4 landscape includes an extensive range of deployments, products, > > projects and services, etc[9][12]. Two major NIC vendors, Intel[10] and AMD[11] > > currently offer P4-native NICs. P4 is currently curated by the Linux > > Foundation[9]. > > > > On why P4 - see small treatise here:[4]. > > > > What is P4TC? > > ------------- > > > > P4TC is a net-namespace aware P4 implementation over TC; meaning, a P4 program > > and its associated objects and state are attachend to a kernel _netns_ structure. > > IOW, if we had two programs across netns' or within a netns they have no > > visibility to each others objects (unlike for example TC actions whose kinds are > > "global" in nature or eBPF maps visavis bpftool). > > > > P4TC builds on top of many years of Linux TC experiences of a netlink control > > path interface coupled with a software datapath with an equivalent offloadable > > hardware datapath. In this patch series we are focussing only on the s/w > > datapath. The s/w and h/w path equivalence that TC provides is relevant > > for a primary use case of P4 where some (currently) large consumers of NICs > > provide vendors their datapath specs in P4. In such a case one could generate > > specified datapaths in s/w and test/validate the requirements before hardware > > acquisition(example [12]). > > > > Unlike other approaches such as TC Flower which require kernel and user space > > changes when new datapath objects like packet headers are introduced P4TC, with > > these patches, provides _kernel and user space code change independence_. > > Meaning: > > A P4 program describes headers, parsers, etc alongside the datapath processing; > > the compiler uses the P4 program as input and generates several artifacts which > > are then loaded into the kernel to manifest the intended datapath. In addition > > to the generated datapath, control path constructs are generated. The process is > > described further below in "P4TC Workflow". > > > > There have been many discussions and meetings within the community since > > about 2015 in regards to P4 over TC[2] and we are finally proving to the > > naysayers that we do get stuff done! > > > > A lot more of the P4TC motivation is captured at: > > https://github.com/p4tc-dev/docs/blob/main/why-p4tc.md > > > > __P4TC Architecture__ > > > > The current architecture was described at netdevconf 0x17[14] and if you prefer > > academic conference papers, a short paper is available here[15]. > > > > There are 4 parts: > > > > 1) A Template CRUD provisioning API for manifesting a P4 program and its > > associated objects in the kernel. The template provisioning API uses netlink. > > See patch in part 2. > > > > 2) A Runtime CRUD+ API code which is used for controlling the different runtime > > behavior of the P4 objects. The runtime API uses netlink. See notes further > > down. See patch description later.. > > > > 3) P4 objects and their control interfaces: tables, actions, externs, etc. > > Any object that requires control plane interaction resides in the TC domain > > and is subject to the CRUD runtime API. The intended goal is to make use of the > > tc semantics of skip_sw/hw to target P4 program objects either in s/w or h/w. > > > > 4) S/W Datapath code hooks. The s/w datapath is eBPF based and is generated > > by a compiler based on the P4 spec. When accessing any P4 object that requires > > control plane interfaces, the eBPF code accesses the P4TC side from #3 above > > using kfuncs. > > > > The generated eBPF code is derived from [13] with enhancements and fixes to meet > > our requirements. > > > > __P4TC Workflow__ > > > > The Development and instantiation workflow for P4TC is as follows: > > > > A) A developer writes a P4 program, "myprog" > > > > B) Compiles it using the P4C compiler[8]. The compiler generates 3 outputs: > > > > a) A shell script which form template definitions for the different P4 > > objects "myprog" utilizes (tables, externs, actions etc). See #1 above.. > > > > b) the parser and the rest of the datapath are generated as eBPF and need > > to be compiled into binaries. At the moment the parser and the main control > > block are generated as separate eBPF program but this could change in > > the future (without affecting any kernel code). See #4 above. > > > > c) A json introspection file used for the control plane (by iproute2/tc). > > > > C) At this point the artifacts from #1,#4 could be handed to an operator > > (the operator could be the same person as the developer from #A, #B). > > > > i) For the eBPF part, either the operator is handed an ebpf binary or > > source which they compile at this point into a binary. > > The operator executes the shell script(s) to manifest the functional > > "myprog" into the kernel. > > > > ii) The operator instantiates "myprog" pipeline via the tc P4 filter > > to ingress/egress (depending on P4 arch) of one or more netdevs/ports > > (illustrated below as "block 22"). > > > > Example instantion where the parser is a separate action: > > "tc filter add block 22 ingress protocol all prio 10 p4 pname myprog \ > > action bpf obj $PARSER.o section p4tc/parse \ > > action bpf obj $PROGNAME.o section p4tc/main" > > > > See individual patches in partc for more examples tc vs xdp etc. Also see > > section on "challenges" (further below on this cover letter). > > > > Once "myprog" P4 program is instantiated one can start performing operations > > on table entries and/or actions at runtime as described below. > > > > __P4TC Runtime Control Path__ > > > > The control interface builds on past tc experience and tries to get things > > right from the beginning (example filtering is separated from depending > > on existing object TLVs and made generic); also the code is written in > > such a way it is mostly lockless. > > > > The P4TC control interface, using netlink, provides what we call a CRUDPS > > abstraction which stands for: Create, Read(get), Update, Delete, Subscribe, > > Publish. From a high level PoV the following describes a conformant high level > > API (both on netlink data model and code level): > > > > Create(</path/to/object, DATA>+) > > Read(</path/to/object>, [optional filter]) > > Update(</path/to/object>, DATA>+) > > Delete(</path/to/object>, [optional filter]) > > Subscribe(</path/to/object>, [optional filter]) > > > > Note, we _dont_ treat "dump" or "flush" as speacial. If "path/to/object" points > > to a table then a "Delete" implies "flush" and a "Read" implies dump but if > > it points to an entry (by specifying a key) then "Delete" implies deleting > > and entry and "Read" implies reading that single entry. It should be noted that > > both "Delete" and "Read" take an optional filter parameter. The filter can > > define further refinements to what the control plane wants read or deleted. > > "Subscribe" uses built in netlink event management. It, as well, takes a filter > > which can further refine what events get generated to the control plane (taken > > out of this patchset, to be re-added with consideration of [16]). > > > > Lets show some runtime samples: > > > > ..create an entry, if we match ip address 10.0.1.2 send packet out eno1 > > tc p4ctrl create myprog/table/mytable \ > > dstAddr 10.0.1.2/32 action send_to_port param port eno1 > > > > ..Batch create entries > > tc p4ctrl create myprog/table/mytable \ > > entry dstAddr 10.1.1.2/32 action send_to_port param port eno1 \ > > entry dstAddr 10.1.10.2/32 action send_to_port param port eno10 \ > > entry dstAddr 10.0.2.2/32 action send_to_port param port eno2 > > > > ..Get an entry (note "read" is interchangeably used as "get" which is a common > > semantic in tc): > > tc p4ctrl read myprog/table/mytable \ > > dstAddr 10.0.2.2/32 > > > > ..dump mytable > > tc p4ctrl read myprog/table/mytable > > > > ..dump mytable for all entries whose key fits within 10.1.0.0/16 > > tc p4ctrl read myprog/table/mytable \ > > filter key/myprog/mytable/dstAddr = 10.1.0.0/16 > > > > ..dump all mytable entries which have an action send_to_port with param "eno1" > > tc p4ctrl get myprog/table/mytable \ > > filter param/act/myprog/send_to_port/port = "eno1" > > > > The filter expression is powerful, f.e you could say: > > > > tc p4ctrl get myprog/table/mytable \ > > filter param/act/myprog/send_to_port/port = "eno1" && \ > > key/myprog/mytable/dstAddr = 10.1.0.0/16 > > > > It also works on built in metadata, example in the following case dumping > > entries from mytable that have seen activity in the last 10 secs: > > tc p4ctrl get myprog/table/mytable \ > > filter msecs_since < 10000 > > > > Delete follows the same syntax as get/read, so for sake of brevity we won't > > show more example than how to flush mytable: > > > > tc p4ctrl delete myprog/table/mytable > > > > Mystery question: How do we achieve iproute2-kernel independence and > > how does "tc p4ctrl" as a cli know how to program the kernel given an > > arbitrary command line as shown above? Answer(s): It queries the > > compiler generated json file in "P4TC Workflow" #B.c above. The json file has > > enough details to figure out that we have a program called "myprog" which has a > > table "mytable" that has a key name "dstAddr" which happens to be type ipv4 > > address prefix. The json file also provides details to show that the table > > "mytable" supports an action called "send_to_port" which accepts a parameter > > "port" of type netdev (see the types patch for all supported P4 data types). > > All P4 components have names, IDs, and types - so this makes it very easy to map > > into netlink. > > Once user space tc/p4ctrl validates the human command input, it creates > > standard binary netlink structures (TLVs etc) which are sent to the kernel. > > See the runtime table entry patch for more details. > > > > __P4TC Datapath__ > > > > The P4TC s/w datapath execution is generated as eBPF. Any objects that require > > control interfacing reside in the "P4TC domain" and are controlled via netlink > > as described above. Per packet execution and state and even objects that do not > > require control interfacing (like the P4 parser) are generated as eBPF. > > > > A packet arriving on s/w ingress of any of the ports on block 22 will first be > > exercised via the (generated eBPF) parser component to extract the headers (the > > ip destination address in labelled "dstAddr" above). > > The datapath then proceeds to use "dstAddr", table ID and pipeline ID > > as a key to do a lookup in myprog's "mytable" which returns the action params > > which are then used to execute the action in the eBPF datapath (eventually > > sending out packets to eno1). > > On a table miss, mytable's default miss action (not described) is executed. > > > > __Testing__ > > > > Speaking of testing - we have 2-300 tdc test cases (which will be in the > > second patchset). > > These tests are run on our CICD system on pull requests and after commits are > > approved. The CICD does a lot of other tests (more since v2, thanks to Simon's > > input)including: > > checkpatch, sparse, smatch, coccinelle, 32 bit and 64 bit builds tested on both > > X86, ARM 64 and emulated BE via qemu s390. We trigger performance testing in the > > CICD to catch performance regressions (currently only on the control path, but > > in the future for the datapath). > > Syzkaller runs 24/7 on dedicated hardware, originally we focussed only on memory > > sanitizer but recently added support for concurrency sanitizer. > > Before main releases we ensure each patch will compile on its own to help in > > git bisect and run the xmas tree tool. We eventually put the code via coverity. > > > > In addition we are working on enabling a tool that will take a P4 program, run > > it through the compiler, and generate permutations of traffic patterns via > > symbolic execution that will test both positive and negative datapath code > > paths. The test generator tool integration is still work in progress. > > Also: We have other code that test parallelization etc which we are trying to > > find a fit for in the kernel tree's testing infra. > > > > > > __References__ > > > > [1]https://github.com/p4tc-dev/docs/blob/main/p4-conference-2023/2023P4WorkshopP4TC.pdf > > [2]https://github.com/p4tc-dev/docs/blob/main/why-p4tc.md#historical-perspective-for-p4tc > > [3]https://2023p4workshop.sched.com/event/1KsAe/p4tc-linux-kernel-p4-implementation-approaches-and-evaluation > > [4]https://github.com/p4tc-dev/docs/blob/main/why-p4tc.md#so-why-p4-and-how-does-p4-help-here > > [5]https://lore.kernel.org/netdev/20230517110232.29349-3-jhs@mojatatu.com/T/#mf59be7abc5df3473cff3879c8cc3e2369c0640a6 > > [6]https://lore.kernel.org/netdev/20230517110232.29349-3-jhs@mojatatu.com/T/#m783cfd79e9d755cf0e7afc1a7d5404635a5b1919 > > [7]https://lore.kernel.org/netdev/20230517110232.29349-3-jhs@mojatatu.com/T/#ma8c84df0f7043d17b98f3d67aab0f4904c600469 > > [8]https://github.com/p4lang/p4c/tree/main/backends/tc > > [9]https://p4.org/ > > [10]https://www.intel.com/content/www/us/en/products/details/network-io/ipu/e2000-asic.html > > [11]https://www.amd.com/en/accelerators/pensando > > [12]https://github.com/sonic-net/DASH/tree/main > > [13]https://github.com/p4lang/p4c/tree/main/backends/ebpf > > [14]https://netdevconf.info/0x17/sessions/talk/integrating-ebpf-into-the-p4tc-datapath.html > > [15]https://dl.acm.org/doi/10.1145/3630047.3630193 > > [16]https://lore.kernel.org/netdev/20231216123001.1293639-1-jiri@resnulli.us/ > > [17.a]https://netdevconf.info/0x13/session.html?talk-tc-u-classifier > > [17.b]man tc-u32 > > [18]man tc-pedit > > [19] https://lore.kernel.org/netdev/20231219181623.3845083-6-victor@mojatatu.com/T/#m86e71743d1d83b728bb29d5b877797cb4942e835 > > [20.a] https://netdevconf.info/0x16/sessions/talk/your-network-datapath-will-be-p4-scripted.html > > [20.b] https://netdevconf.info/0x16/sessions/workshop/p4tc-workshop.html > > > > -------- > > HISTORY > > -------- > > > > Changes in Version 12 > > ---------------------- > > > > 0) Introduce back 15 patches (v11 had 5) > > > > 1) From discussions with Daniel: > > i) Remove the XDP programs association alltogether. No refcounting. nothing. > > ii) Remove prog type tc - everything is now an ebpf tc action. > > > > 2) s/PAD0/__pad0/g. Thanks to Marcelo. > > > > 3) Add extack to specify how many entries (N of M) specified in a batch for > > any of requested Create/Update/Delete succeeded. Prior to this it would > > only tell us the batch failed to complete without giving us details of > > which of M failed. Added as a debug aid. > > > > Changes in Version 11 > > ---------------------- > > 1) Split the series into two. Original patches 1-5 in this patchset. The rest > > will go out after this is merged. > > > > 2) Change any references of IFNAMSIZ in the action code when referencing the > > action name size to ACTNAMSIZ. Thanks to Marcelo. > > > > Changes in Version 10 > > ---------------------- > > 1) A couple of patches from the earlier version were clean enough to submit, > > so we did. This gave us room to split the two largest patches each into > > two. Even though the split is not git-bisactable and really some of it didn't > > make much sense (eg spliting a create, and update in one patch and delete and > > get into another) we made sure each of the split patches compiled > > independently. The idea is to reduce the number of lines of code to review > > and when we get sufficient reviews we will put the splits together again. > > See patch #12 and #13 as well as patches #7 and #8). > > > > 2) Add more context in patch 0. Please READ! > > > > 3) Added dump/delete filters back to the code - we had taken them out in the > > earlier patches to reduce the amount of code for review - but in retrospect > > we feel they are important enough to push earlier rather than later. > > > > > > Changes In version 9 > > --------------------- > > > > 1) Remove the largest patch (externs) to ease review. > > > > 2) Break up action patches into two to ease review bringing down the patches > > that need more scrutiny to 8 (the first 7 are almost trivial). > > > > 3) Fixup prefix naming convention to p4tc_xxx for uapi and p4a_xxx for actions > > to provide consistency(Jiri). > > > > 4) Silence sparse warning "was not declared. Should it be static?" for kfuncs > > by making them static. TBH, not sure if this is the right solution > > but it makes sparse happy and hopefully someone will comment. > > > > Changes In Version 8 > > --------------------- > > > > 1) Fix all the patchwork warnings and improve our ci to catch them in the future > > > > 2) Reduce the number of patches to basic max(15) to ease review. > > > > Changes In Version 7 > > ------------------------- > > > > 0) First time removing the RFC tag! > > > > 1) Removed XDP cookie. It turns out as was pointed out by Toke(Thanks!) - that > > using bpf links was sufficient to protect us from someone replacing or deleting > > a eBPF program after it has been bound to a netdev. > > > > 2) Add some reviewed-bys from Vlad. > > > > 3) Small bug fixes from v6 based on testing for ebpf. > > > > 4) Added the counter extern as a sample extern. Illustrating this example because > > it is slightly complex since it is possible to invoke it directly from > > the P4TC domain (in case of direct counters) or from eBPF (indirect counters). > > It is not exactly the most efficient implementation (a reasonable counter impl > > should be per-cpu). > > > > Changes In RFC Version 6 > > ------------------------- > > > > 1) Completed integration from scriptable view to eBPF. Completed integration > > of externs integration. > > > > 2) Small bug fixes from v5 based on testing. > > > > Changes In RFC Version 5 > > ------------------------- > > > > 1) More integration from scriptable view to eBPF. Small bug fixes from last > > integration. > > > > 2) More streamlining support of externs via kfunc (create-on-miss, etc) > > > > 3) eBPF linking for XDP. > > > > There is more eBPF integration/streamlining coming (we are getting close to > > conversion from scriptable domain). > > > > Changes In RFC Version 4 > > ------------------------- > > > > 1) More integration from scriptable to eBPF. Small bug fixes. > > > > 2) More streamlining support of externs via kfunc (one additional kfunc). > > > > 3) Removed per-cpu scratchpad per Toke's suggestion and instead use XDP metadata. > > > > There is more eBPF integration coming. One thing we looked at but is not in this > > patchset but should be in the next is use of eBPF link in our loading (see > > "challenge #1" further below). > > > > Changes In RFC Version 3 > > ------------------------- > > > > These patches are still in a little bit of flux as we adjust to integrating > > eBPF. So there are small constructs that are used in V1 and 2 but no longer > > used in this version. We will make a V4 which will remove those. > > The changes from V2 are as follows: > > > > 1) Feedback we got in V2 is to try stick to one of the two modes. In this version > > we are taking one more step and going the path of mode2 vs v2 where we had 2 modes. > > > > 2) The P4 Register extern is no longer standalone. Instead, as part of integrating > > into eBPF we introduce another kfunc which encapsulates Register as part of the > > extern interface. > > > > 3) We have improved our CICD to include tools pointed to us by Simon. See > > "Testing" further below. Thanks to Simon for that and other issues he caught. > > Simon, we discussed on issue [7] but decided to keep that log since we think > > it is useful. > > > > 4) A lot of small cleanups. Thanks Marcelo. There are two things we need to > > re-discuss though; see: [5], [6]. > > > > 5) We removed the need for a range of IDs for dynamic actions. Thanks Jakub. > > > > 6) Clarify ambiguity caused by smatch in an if(A) else if(B) condition. We are > > guaranteed that either A or B must exist; however, lets make smatch happy. > > Thanks to Simon and Dan Carpenter. > > > > Changes In RFC Version 2 > > ------------------------- > > > > Version 2 is the initial integration of the eBPF datapath. > > We took into consideration suggestions provided to use eBPF and put effort into > > analyzing eBPF as datapath which involved extensive testing. > > We implemented 6 approaches with eBPF and ran performance analysis and presented > > our results at the P4 2023 workshop in Santa Clara[see: 1, 3] on each of the 6 > > vs the scriptable P4TC and concluded that 2 of the approaches are sensible (4 if > > you account for XDP or TC separately). > > > > Conclusions from the exercise: We lose the simple operational model we had > > prior to integrating eBPF. We do gain performance in most cases when the > > datapath is less compute-bound. > > For more discussion on our requirements vs journeying the eBPF path please > > scroll down to "Restating Our Requirements" and "Challenges". > > > > This patch set presented two modes. > > mode1: the parser is entirely based on eBPF - whereas the rest of the > > SW datapath stays as _scriptable_ as in Version 1. > > mode2: All of the kernel s/w datapath (including parser) is in eBPF. > > > > The key ingredient for eBPF, that we did not have access to in the past, is > > kfunc (it made a big difference for us to reconsider eBPF). > > > > In V2 the two modes are mutually exclusive (IOW, you get to choose one > > or the other via Kconfig). > > I think/fear that this series has a "quorum" problem: different voices > raises opposition, and nobody (?) outside the authors supported the > code and the feature. > > Could be the missing of H/W offload support in the current form the > root cause for such lack support? Or there are parties interested that > have been quite so far? Some of the people who attend our meetings and have vested interest in this are on Cc. But the cover letter is clear on this (right at the top under "What is P4" and "what is P4TC"). cheers, jamal > Thanks, > > Paolo > >
Jamal Hadi Salim wrote: > On Thu, Feb 29, 2024 at 12:14 PM Paolo Abeni <pabeni@redhat.com> wrote: > > > > On Sun, 2024-02-25 at 11:54 -0500, Jamal Hadi Salim wrote: > > > This is the first patchset of two. In this patch we are submitting 15 which > > > cover the minimal viable P4 PNA architecture. > > > > > > __Description of these Patches__ > > > > > > Patch #1 adds infrastructure for per-netns P4 actions that can be created on > > > as need basis for the P4 program requirement. This patch makes a small incision > > > into act_api. Patches 2-4 are minimalist enablers for P4TC and have no > > > effect the classical tc action (example patch#2 just increases the size of the > > > action names from 16->64B). > > > Patch 5 adds infrastructure support for preallocation of dynamic actions. > > > > > > The core P4TC code implements several P4 objects. > > > 1) Patch #6 introduces P4 data types which are consumed by the rest of the code > > > 2) Patch #7 introduces the templating API. i.e. CRUD commands for templates > > > 3) Patch #8 introduces the concept of templating Pipelines. i.e CRUD commands > > > for P4 pipelines. > > > 4) Patch #9 introduces the action templates and associated CRUD commands. > > > 5) Patch #10 introduce the action runtime infrastructure. > > > 6) Patch #11 introduces the concept of P4 table templates and associated > > > CRUD commands for tables. > > > 7) Patch #12 introduces runtime table entry infra and associated CU commands. > > > 8) Patch #13 introduces runtime table entry infra and associated RD commands. > > > 9) Patch #14 introduces interaction of eBPF to P4TC tables via kfunc. > > > 10) Patch #15 introduces the TC classifier P4 used at runtime. > > > > > > Daniel, please look again at patch #15. > > > > > > There are a few more patches (5) not in this patchset that deal with test > > > cases, etc. > > > > > > What is P4? > > > ----------- > > > > > > The Programming Protocol-independent Packet Processors (P4) is an open source, > > > domain-specific programming language for specifying data plane behavior. > > > > > > The current P4 landscape includes an extensive range of deployments, products, > > > projects and services, etc[9][12]. Two major NIC vendors, Intel[10] and AMD[11] > > > currently offer P4-native NICs. P4 is currently curated by the Linux > > > Foundation[9]. > > > > > > On why P4 - see small treatise here:[4]. > > > > > > What is P4TC? > > > ------------- > > > > > > P4TC is a net-namespace aware P4 implementation over TC; meaning, a P4 program > > > and its associated objects and state are attachend to a kernel _netns_ structure. > > > IOW, if we had two programs across netns' or within a netns they have no > > > visibility to each others objects (unlike for example TC actions whose kinds are > > > "global" in nature or eBPF maps visavis bpftool). > > > > > > P4TC builds on top of many years of Linux TC experiences of a netlink control > > > path interface coupled with a software datapath with an equivalent offloadable > > > hardware datapath. In this patch series we are focussing only on the s/w > > > datapath. The s/w and h/w path equivalence that TC provides is relevant > > > for a primary use case of P4 where some (currently) large consumers of NICs > > > provide vendors their datapath specs in P4. In such a case one could generate > > > specified datapaths in s/w and test/validate the requirements before hardware > > > acquisition(example [12]). > > > > > > Unlike other approaches such as TC Flower which require kernel and user space > > > changes when new datapath objects like packet headers are introduced P4TC, with > > > these patches, provides _kernel and user space code change independence_. > > > Meaning: > > > A P4 program describes headers, parsers, etc alongside the datapath processing; > > > the compiler uses the P4 program as input and generates several artifacts which > > > are then loaded into the kernel to manifest the intended datapath. In addition > > > to the generated datapath, control path constructs are generated. The process is > > > described further below in "P4TC Workflow". > > > > > > There have been many discussions and meetings within the community since > > > about 2015 in regards to P4 over TC[2] and we are finally proving to the > > > naysayers that we do get stuff done! > > > > > > A lot more of the P4TC motivation is captured at: > > > https://github.com/p4tc-dev/docs/blob/main/why-p4tc.md > > > > > > __P4TC Architecture__ > > > > > > The current architecture was described at netdevconf 0x17[14] and if you prefer > > > academic conference papers, a short paper is available here[15]. > > > > > > There are 4 parts: > > > > > > 1) A Template CRUD provisioning API for manifesting a P4 program and its > > > associated objects in the kernel. The template provisioning API uses netlink. > > > See patch in part 2. > > > > > > 2) A Runtime CRUD+ API code which is used for controlling the different runtime > > > behavior of the P4 objects. The runtime API uses netlink. See notes further > > > down. See patch description later.. > > > > > > 3) P4 objects and their control interfaces: tables, actions, externs, etc. > > > Any object that requires control plane interaction resides in the TC domain > > > and is subject to the CRUD runtime API. The intended goal is to make use of the > > > tc semantics of skip_sw/hw to target P4 program objects either in s/w or h/w. > > > > > > 4) S/W Datapath code hooks. The s/w datapath is eBPF based and is generated > > > by a compiler based on the P4 spec. When accessing any P4 object that requires > > > control plane interfaces, the eBPF code accesses the P4TC side from #3 above > > > using kfuncs. > > > > > > The generated eBPF code is derived from [13] with enhancements and fixes to meet > > > our requirements. > > > > > > __P4TC Workflow__ > > > > > > The Development and instantiation workflow for P4TC is as follows: > > > > > > A) A developer writes a P4 program, "myprog" > > > > > > B) Compiles it using the P4C compiler[8]. The compiler generates 3 outputs: > > > > > > a) A shell script which form template definitions for the different P4 > > > objects "myprog" utilizes (tables, externs, actions etc). See #1 above.. > > > > > > b) the parser and the rest of the datapath are generated as eBPF and need > > > to be compiled into binaries. At the moment the parser and the main control > > > block are generated as separate eBPF program but this could change in > > > the future (without affecting any kernel code). See #4 above. > > > > > > c) A json introspection file used for the control plane (by iproute2/tc). > > > > > > C) At this point the artifacts from #1,#4 could be handed to an operator > > > (the operator could be the same person as the developer from #A, #B). > > > > > > i) For the eBPF part, either the operator is handed an ebpf binary or > > > source which they compile at this point into a binary. > > > The operator executes the shell script(s) to manifest the functional > > > "myprog" into the kernel. > > > > > > ii) The operator instantiates "myprog" pipeline via the tc P4 filter > > > to ingress/egress (depending on P4 arch) of one or more netdevs/ports > > > (illustrated below as "block 22"). > > > > > > Example instantion where the parser is a separate action: > > > "tc filter add block 22 ingress protocol all prio 10 p4 pname myprog \ > > > action bpf obj $PARSER.o section p4tc/parse \ > > > action bpf obj $PROGNAME.o section p4tc/main" > > > > > > See individual patches in partc for more examples tc vs xdp etc. Also see > > > section on "challenges" (further below on this cover letter). > > > > > > Once "myprog" P4 program is instantiated one can start performing operations > > > on table entries and/or actions at runtime as described below. > > > > > > __P4TC Runtime Control Path__ > > > > > > The control interface builds on past tc experience and tries to get things > > > right from the beginning (example filtering is separated from depending > > > on existing object TLVs and made generic); also the code is written in > > > such a way it is mostly lockless. > > > > > > The P4TC control interface, using netlink, provides what we call a CRUDPS > > > abstraction which stands for: Create, Read(get), Update, Delete, Subscribe, > > > Publish. From a high level PoV the following describes a conformant high level > > > API (both on netlink data model and code level): > > > > > > Create(</path/to/object, DATA>+) > > > Read(</path/to/object>, [optional filter]) > > > Update(</path/to/object>, DATA>+) > > > Delete(</path/to/object>, [optional filter]) > > > Subscribe(</path/to/object>, [optional filter]) > > > > > > Note, we _dont_ treat "dump" or "flush" as speacial. If "path/to/object" points > > > to a table then a "Delete" implies "flush" and a "Read" implies dump but if > > > it points to an entry (by specifying a key) then "Delete" implies deleting > > > and entry and "Read" implies reading that single entry. It should be noted that > > > both "Delete" and "Read" take an optional filter parameter. The filter can > > > define further refinements to what the control plane wants read or deleted. > > > "Subscribe" uses built in netlink event management. It, as well, takes a filter > > > which can further refine what events get generated to the control plane (taken > > > out of this patchset, to be re-added with consideration of [16]). > > > > > > Lets show some runtime samples: > > > > > > ..create an entry, if we match ip address 10.0.1.2 send packet out eno1 > > > tc p4ctrl create myprog/table/mytable \ > > > dstAddr 10.0.1.2/32 action send_to_port param port eno1 > > > > > > ..Batch create entries > > > tc p4ctrl create myprog/table/mytable \ > > > entry dstAddr 10.1.1.2/32 action send_to_port param port eno1 \ > > > entry dstAddr 10.1.10.2/32 action send_to_port param port eno10 \ > > > entry dstAddr 10.0.2.2/32 action send_to_port param port eno2 > > > > > > ..Get an entry (note "read" is interchangeably used as "get" which is a common > > > semantic in tc): > > > tc p4ctrl read myprog/table/mytable \ > > > dstAddr 10.0.2.2/32 > > > > > > ..dump mytable > > > tc p4ctrl read myprog/table/mytable > > > > > > ..dump mytable for all entries whose key fits within 10.1.0.0/16 > > > tc p4ctrl read myprog/table/mytable \ > > > filter key/myprog/mytable/dstAddr = 10.1.0.0/16 > > > > > > ..dump all mytable entries which have an action send_to_port with param "eno1" > > > tc p4ctrl get myprog/table/mytable \ > > > filter param/act/myprog/send_to_port/port = "eno1" > > > > > > The filter expression is powerful, f.e you could say: > > > > > > tc p4ctrl get myprog/table/mytable \ > > > filter param/act/myprog/send_to_port/port = "eno1" && \ > > > key/myprog/mytable/dstAddr = 10.1.0.0/16 > > > > > > It also works on built in metadata, example in the following case dumping > > > entries from mytable that have seen activity in the last 10 secs: > > > tc p4ctrl get myprog/table/mytable \ > > > filter msecs_since < 10000 > > > > > > Delete follows the same syntax as get/read, so for sake of brevity we won't > > > show more example than how to flush mytable: > > > > > > tc p4ctrl delete myprog/table/mytable > > > > > > Mystery question: How do we achieve iproute2-kernel independence and > > > how does "tc p4ctrl" as a cli know how to program the kernel given an > > > arbitrary command line as shown above? Answer(s): It queries the > > > compiler generated json file in "P4TC Workflow" #B.c above. The json file has > > > enough details to figure out that we have a program called "myprog" which has a > > > table "mytable" that has a key name "dstAddr" which happens to be type ipv4 > > > address prefix. The json file also provides details to show that the table > > > "mytable" supports an action called "send_to_port" which accepts a parameter > > > "port" of type netdev (see the types patch for all supported P4 data types). > > > All P4 components have names, IDs, and types - so this makes it very easy to map > > > into netlink. > > > Once user space tc/p4ctrl validates the human command input, it creates > > > standard binary netlink structures (TLVs etc) which are sent to the kernel. > > > See the runtime table entry patch for more details. > > > > > > __P4TC Datapath__ > > > > > > The P4TC s/w datapath execution is generated as eBPF. Any objects that require > > > control interfacing reside in the "P4TC domain" and are controlled via netlink > > > as described above. Per packet execution and state and even objects that do not > > > require control interfacing (like the P4 parser) are generated as eBPF. > > > > > > A packet arriving on s/w ingress of any of the ports on block 22 will first be > > > exercised via the (generated eBPF) parser component to extract the headers (the > > > ip destination address in labelled "dstAddr" above). > > > The datapath then proceeds to use "dstAddr", table ID and pipeline ID > > > as a key to do a lookup in myprog's "mytable" which returns the action params > > > which are then used to execute the action in the eBPF datapath (eventually > > > sending out packets to eno1). > > > On a table miss, mytable's default miss action (not described) is executed. > > > > > > __Testing__ > > > > > > Speaking of testing - we have 2-300 tdc test cases (which will be in the > > > second patchset). > > > These tests are run on our CICD system on pull requests and after commits are > > > approved. The CICD does a lot of other tests (more since v2, thanks to Simon's > > > input)including: > > > checkpatch, sparse, smatch, coccinelle, 32 bit and 64 bit builds tested on both > > > X86, ARM 64 and emulated BE via qemu s390. We trigger performance testing in the > > > CICD to catch performance regressions (currently only on the control path, but > > > in the future for the datapath). > > > Syzkaller runs 24/7 on dedicated hardware, originally we focussed only on memory > > > sanitizer but recently added support for concurrency sanitizer. > > > Before main releases we ensure each patch will compile on its own to help in > > > git bisect and run the xmas tree tool. We eventually put the code via coverity. > > > > > > In addition we are working on enabling a tool that will take a P4 program, run > > > it through the compiler, and generate permutations of traffic patterns via > > > symbolic execution that will test both positive and negative datapath code > > > paths. The test generator tool integration is still work in progress. > > > Also: We have other code that test parallelization etc which we are trying to > > > find a fit for in the kernel tree's testing infra. > > > > > > > > > __References__ > > > > > > [1]https://github.com/p4tc-dev/docs/blob/main/p4-conference-2023/2023P4WorkshopP4TC.pdf > > > [2]https://github.com/p4tc-dev/docs/blob/main/why-p4tc.md#historical-perspective-for-p4tc > > > [3]https://2023p4workshop.sched.com/event/1KsAe/p4tc-linux-kernel-p4-implementation-approaches-and-evaluation > > > [4]https://github.com/p4tc-dev/docs/blob/main/why-p4tc.md#so-why-p4-and-how-does-p4-help-here > > > [5]https://lore.kernel.org/netdev/20230517110232.29349-3-jhs@mojatatu.com/T/#mf59be7abc5df3473cff3879c8cc3e2369c0640a6 > > > [6]https://lore.kernel.org/netdev/20230517110232.29349-3-jhs@mojatatu.com/T/#m783cfd79e9d755cf0e7afc1a7d5404635a5b1919 > > > [7]https://lore.kernel.org/netdev/20230517110232.29349-3-jhs@mojatatu.com/T/#ma8c84df0f7043d17b98f3d67aab0f4904c600469 > > > [8]https://github.com/p4lang/p4c/tree/main/backends/tc > > > [9]https://p4.org/ > > > [10]https://www.intel.com/content/www/us/en/products/details/network-io/ipu/e2000-asic.html > > > [11]https://www.amd.com/en/accelerators/pensando > > > [12]https://github.com/sonic-net/DASH/tree/main > > > [13]https://github.com/p4lang/p4c/tree/main/backends/ebpf > > > [14]https://netdevconf.info/0x17/sessions/talk/integrating-ebpf-into-the-p4tc-datapath.html > > > [15]https://dl.acm.org/doi/10.1145/3630047.3630193 > > > [16]https://lore.kernel.org/netdev/20231216123001.1293639-1-jiri@resnulli.us/ > > > [17.a]https://netdevconf.info/0x13/session.html?talk-tc-u-classifier > > > [17.b]man tc-u32 > > > [18]man tc-pedit > > > [19] https://lore.kernel.org/netdev/20231219181623.3845083-6-victor@mojatatu.com/T/#m86e71743d1d83b728bb29d5b877797cb4942e835 > > > [20.a] https://netdevconf.info/0x16/sessions/talk/your-network-datapath-will-be-p4-scripted.html > > > [20.b] https://netdevconf.info/0x16/sessions/workshop/p4tc-workshop.html > > > > > > -------- > > > HISTORY > > > -------- > > > > > > Changes in Version 12 > > > ---------------------- > > > > > > 0) Introduce back 15 patches (v11 had 5) > > > > > > 1) From discussions with Daniel: > > > i) Remove the XDP programs association alltogether. No refcounting. nothing. > > > ii) Remove prog type tc - everything is now an ebpf tc action. > > > > > > 2) s/PAD0/__pad0/g. Thanks to Marcelo. > > > > > > 3) Add extack to specify how many entries (N of M) specified in a batch for > > > any of requested Create/Update/Delete succeeded. Prior to this it would > > > only tell us the batch failed to complete without giving us details of > > > which of M failed. Added as a debug aid. > > > > > > Changes in Version 11 > > > ---------------------- > > > 1) Split the series into two. Original patches 1-5 in this patchset. The rest > > > will go out after this is merged. > > > > > > 2) Change any references of IFNAMSIZ in the action code when referencing the > > > action name size to ACTNAMSIZ. Thanks to Marcelo. > > > > > > Changes in Version 10 > > > ---------------------- > > > 1) A couple of patches from the earlier version were clean enough to submit, > > > so we did. This gave us room to split the two largest patches each into > > > two. Even though the split is not git-bisactable and really some of it didn't > > > make much sense (eg spliting a create, and update in one patch and delete and > > > get into another) we made sure each of the split patches compiled > > > independently. The idea is to reduce the number of lines of code to review > > > and when we get sufficient reviews we will put the splits together again. > > > See patch #12 and #13 as well as patches #7 and #8). > > > > > > 2) Add more context in patch 0. Please READ! > > > > > > 3) Added dump/delete filters back to the code - we had taken them out in the > > > earlier patches to reduce the amount of code for review - but in retrospect > > > we feel they are important enough to push earlier rather than later. > > > > > > > > > Changes In version 9 > > > --------------------- > > > > > > 1) Remove the largest patch (externs) to ease review. > > > > > > 2) Break up action patches into two to ease review bringing down the patches > > > that need more scrutiny to 8 (the first 7 are almost trivial). > > > > > > 3) Fixup prefix naming convention to p4tc_xxx for uapi and p4a_xxx for actions > > > to provide consistency(Jiri). > > > > > > 4) Silence sparse warning "was not declared. Should it be static?" for kfuncs > > > by making them static. TBH, not sure if this is the right solution > > > but it makes sparse happy and hopefully someone will comment. > > > > > > Changes In Version 8 > > > --------------------- > > > > > > 1) Fix all the patchwork warnings and improve our ci to catch them in the future > > > > > > 2) Reduce the number of patches to basic max(15) to ease review. > > > > > > Changes In Version 7 > > > ------------------------- > > > > > > 0) First time removing the RFC tag! > > > > > > 1) Removed XDP cookie. It turns out as was pointed out by Toke(Thanks!) - that > > > using bpf links was sufficient to protect us from someone replacing or deleting > > > a eBPF program after it has been bound to a netdev. > > > > > > 2) Add some reviewed-bys from Vlad. > > > > > > 3) Small bug fixes from v6 based on testing for ebpf. > > > > > > 4) Added the counter extern as a sample extern. Illustrating this example because > > > it is slightly complex since it is possible to invoke it directly from > > > the P4TC domain (in case of direct counters) or from eBPF (indirect counters). > > > It is not exactly the most efficient implementation (a reasonable counter impl > > > should be per-cpu). > > > > > > Changes In RFC Version 6 > > > ------------------------- > > > > > > 1) Completed integration from scriptable view to eBPF. Completed integration > > > of externs integration. > > > > > > 2) Small bug fixes from v5 based on testing. > > > > > > Changes In RFC Version 5 > > > ------------------------- > > > > > > 1) More integration from scriptable view to eBPF. Small bug fixes from last > > > integration. > > > > > > 2) More streamlining support of externs via kfunc (create-on-miss, etc) > > > > > > 3) eBPF linking for XDP. > > > > > > There is more eBPF integration/streamlining coming (we are getting close to > > > conversion from scriptable domain). > > > > > > Changes In RFC Version 4 > > > ------------------------- > > > > > > 1) More integration from scriptable to eBPF. Small bug fixes. > > > > > > 2) More streamlining support of externs via kfunc (one additional kfunc). > > > > > > 3) Removed per-cpu scratchpad per Toke's suggestion and instead use XDP metadata. > > > > > > There is more eBPF integration coming. One thing we looked at but is not in this > > > patchset but should be in the next is use of eBPF link in our loading (see > > > "challenge #1" further below). > > > > > > Changes In RFC Version 3 > > > ------------------------- > > > > > > These patches are still in a little bit of flux as we adjust to integrating > > > eBPF. So there are small constructs that are used in V1 and 2 but no longer > > > used in this version. We will make a V4 which will remove those. > > > The changes from V2 are as follows: > > > > > > 1) Feedback we got in V2 is to try stick to one of the two modes. In this version > > > we are taking one more step and going the path of mode2 vs v2 where we had 2 modes. > > > > > > 2) The P4 Register extern is no longer standalone. Instead, as part of integrating > > > into eBPF we introduce another kfunc which encapsulates Register as part of the > > > extern interface. > > > > > > 3) We have improved our CICD to include tools pointed to us by Simon. See > > > "Testing" further below. Thanks to Simon for that and other issues he caught. > > > Simon, we discussed on issue [7] but decided to keep that log since we think > > > it is useful. > > > > > > 4) A lot of small cleanups. Thanks Marcelo. There are two things we need to > > > re-discuss though; see: [5], [6]. > > > > > > 5) We removed the need for a range of IDs for dynamic actions. Thanks Jakub. > > > > > > 6) Clarify ambiguity caused by smatch in an if(A) else if(B) condition. We are > > > guaranteed that either A or B must exist; however, lets make smatch happy. > > > Thanks to Simon and Dan Carpenter. > > > > > > Changes In RFC Version 2 > > > ------------------------- > > > > > > Version 2 is the initial integration of the eBPF datapath. > > > We took into consideration suggestions provided to use eBPF and put effort into > > > analyzing eBPF as datapath which involved extensive testing. > > > We implemented 6 approaches with eBPF and ran performance analysis and presented > > > our results at the P4 2023 workshop in Santa Clara[see: 1, 3] on each of the 6 > > > vs the scriptable P4TC and concluded that 2 of the approaches are sensible (4 if > > > you account for XDP or TC separately). > > > > > > Conclusions from the exercise: We lose the simple operational model we had > > > prior to integrating eBPF. We do gain performance in most cases when the > > > datapath is less compute-bound. > > > For more discussion on our requirements vs journeying the eBPF path please > > > scroll down to "Restating Our Requirements" and "Challenges". > > > > > > This patch set presented two modes. > > > mode1: the parser is entirely based on eBPF - whereas the rest of the > > > SW datapath stays as _scriptable_ as in Version 1. > > > mode2: All of the kernel s/w datapath (including parser) is in eBPF. > > > > > > The key ingredient for eBPF, that we did not have access to in the past, is > > > kfunc (it made a big difference for us to reconsider eBPF). > > > > > > In V2 the two modes are mutually exclusive (IOW, you get to choose one > > > or the other via Kconfig). > > > > I think/fear that this series has a "quorum" problem: different voices > > raises opposition, and nobody (?) outside the authors supported the > > code and the feature. > > > > Could be the missing of H/W offload support in the current form the > > root cause for such lack support? Or there are parties interested that > > have been quite so far? Yeah agree with h/w comment would be interested to hear these folks that have h/w. For me to get on board obvious things that would be interesting. (a) hardware offload (b) some fundamental problem with exisiing p4c backend we already have or (c) significant performance improvement. > > Some of the people who attend our meetings and have vested interest in > this are on Cc. But the cover letter is clear on this (right at the > top under "What is P4" and "what is P4TC"). > > cheers, > jamal > > > > Thanks, > > > > Paolo > > > > >
From: Paolo Abeni <pabeni@redhat.com> > I think/fear that this series has a "quorum" problem: different voices raises opposition, and nobody (?) outside the authors > supported the code and the feature. > Could be the missing of H/W offload support in the current form the root cause for such lack support? Or there are parties > interested that have been quite so far? Hi, Intel/AMD definitely need the p4tc offload support and a kernel SW pipeline, as a lot of customers using programmable pipeline (smart switch and smart NIC) prefer kernel standard APIs and interfaces (netlink and tc ndo). Intel and other vendors have native P4 capable HW and are invested in P4 as a dataplane specification. - Customers run P4 dataplane in multiple targets including SW pipeline as well as programmable Switches and DPUs. - A standardized kernel APIs and implementation brings in portability across vendors and across targets (CPU/SW and DPUs). - A P4 pipeline can be built using both SW and HW (DPU/switch) components and the P4 pipeline should seamlessly move between the two. - This patch series helps create a SW pipeline and standard API. Thanks, Anjali
Singhai, Anjali wrote: > From: Paolo Abeni <pabeni@redhat.com> > > > I think/fear that this series has a "quorum" problem: different voices raises opposition, and nobody (?) outside the authors > > supported the code and the feature. > > > Could be the missing of H/W offload support in the current form the root cause for such lack support? Or there are parties > > interested that have been quite so far? > > Hi, > Intel/AMD definitely need the p4tc offload support and a kernel SW pipeline, as a lot of customers using programmable pipeline (smart switch and smart NIC) prefer kernel standard APIs and interfaces (netlink and tc ndo). Intel and other vendors have native P4 capable HW and are invested in P4 as a dataplane specification. Great what hardware/driver and how do we get that code here so we can see it working? Is the hardware available e.g. can I get ahold of one? What is programmable on your devices? Is this 'just' the parser graph or are you slicing up tables and so on. Is it a FPGA, DPU architecture or a TCAM architecture? How do you reprogram the device? I somehow doubt its through a piecemeal ndo. But let me know if I'm wrong maybe my internal architecture details are dated. Fully speculating the interface is a FW big thunk to the device? Without any details its difficult to get community feedback on how the hw programmable interface should work. The only reason I've even bothered with this thread is I want to see P4 working. Who owns the AMD side or some other vendor so we can get something that works across at least two vendors which is our usual bar for adding hw offload things. Note if you just want a kernel SW pipeline we already have that so I'm not seeing that as paticularly motivating. Again my point of view. P4 as a dataplane specification is great but I don't see the connection to this patchset without real hardware in a driver. > > - Customers run P4 dataplane in multiple targets including SW pipeline as well as programmable Switches and DPUs. > - A standardized kernel APIs and implementation brings in portability across vendors and across targets (CPU/SW and DPUs). > - A P4 pipeline can be built using both SW and HW (DPU/switch) components and the P4 pipeline should seamlessly move between the two. > - This patch series helps create a SW pipeline and standard API. > > Thanks, > Anjali >
On Thu, Feb 29, 2024 at 5:33 PM John Fastabend <john.fastabend@gmail.com> wrote: > > Singhai, Anjali wrote: > > From: Paolo Abeni <pabeni@redhat.com> > > > > > I think/fear that this series has a "quorum" problem: different voices raises opposition, and nobody (?) outside the authors > > > supported the code and the feature. > > > > > Could be the missing of H/W offload support in the current form the root cause for such lack support? Or there are parties > > > interested that have been quite so far? > > > > Hi, > > Intel/AMD definitely need the p4tc offload support and a kernel SW pipeline, as a lot of customers using programmable pipeline (smart switch and smart NIC) prefer kernel standard APIs and interfaces (netlink and tc ndo). Intel and other vendors have native P4 capable HW and are invested in P4 as a dataplane specification. > > Great what hardware/driver and how do we get that code here so we can see > it working? Is the hardware available e.g. can I get ahold of one? > > What is programmable on your devices? Is this 'just' the parser graph or > are you slicing up tables and so on. Is it a FPGA, DPU architecture or a > TCAM architecture? How do you reprogram the device? I somehow doubt its > through a piecemeal ndo. But let me know if I'm wrong maybe my internal > architecture details are dated. Fully speculating the interface is a FW > big thunk to the device? > > Without any details its difficult to get community feedback on how the > hw programmable interface should work. The only reason I've even > bothered with this thread is I want to see P4 working. > > Who owns the AMD side or some other vendor so we can get something that > works across at least two vendors which is our usual bar for adding hw > offload things. > > Note if you just want a kernel SW pipeline we already have that so > I'm not seeing that as paticularly motivating. Again my point of view. > P4 as a dataplane specification is great but I don't see the connection > to this patchset without real hardware in a driver. Here's what you can buy on the market that are native P4 (not that it hasnt been mentioned from day 1 on patch 0 references): [10]https://www.intel.com/content/www/us/en/products/details/network-io/ipu/e2000-asic.html [11]https://www.amd.com/en/accelerators/pensando I want to emphasize again these patches are about the P4 s/w pipeline that is intended to work seamlessly with hw offload. If you are interested in h/w offload and want to contribute just show up at the meetings - they are open to all. The current offloadable piece is the match-action tables. The P4 specs may change to include parsers in the future or other objects etc (but not sure why we should discuss this in the thread). cheers, jamal
On 2/28/24 9:11 AM, John Fastabend wrote: > - The kfuncs are mostly duplicates of map ops we already have in BPF API. > The motivation by my read is to use netlink instead of bpf commands. I I also have similar thought on the kfuncs (create/update/delete) which is mostly bpf map ops. It could have one single kfunc to allocate a kernel specific p4 entry/object and then store that in a bpf map. With the bpf_rbtree, bpf_list, and other recent advancements, it should be able to describe them in a bpf map. The reply in v9 was that the p4 table will also be used in the future HW piece/driver but the HW piece is not ready yet, bpf is the only consumer of the kernel p4 table now and this makes mimicking the bpf map api to kfuncs not convincing. bpf "tc / xdp" program uses netlink to attach/detach and the policy also stays in the bpf map. When there is a HW piece that consumes the p4 table, that will be a better time to discuss the kfunc interface. > don't agree with this, optimizing for some low level debug a developer > uses is the wrong design space. Actual users should not be deploying > this via ssh into boxes. The workflow will not scale and really we need > tooling and infra to land P4 programs across the network. This is orders > of more pain if its an endpoint solution and not a middlebox/switch > solution. As a switch solution I don't see how p4tc sw scales to even TOR > packet rates. So you need tooling on top and user interact with the > tooling not the Linux widget/debugger at the bottom.
On Fri, Mar 1, 2024 at 2:02 AM Martin KaFai Lau <martin.lau@linux.dev> wrote: > > On 2/28/24 9:11 AM, John Fastabend wrote: > > - The kfuncs are mostly duplicates of map ops we already have in BPF API. > > The motivation by my read is to use netlink instead of bpf commands. I > > I also have similar thought on the kfuncs (create/update/delete) which is mostly > bpf map ops. It could have one single kfunc to allocate a kernel specific p4 > entry/object and then store that in a bpf map. With the bpf_rbtree, bpf_list, > and other recent advancements, it should be able to describe them in a bpf map. > The reply in v9 was that the p4 table will also be used in the future HW > piece/driver but the HW piece is not ready yet, bpf is the only consumer of the > kernel p4 table now and this makes mimicking the bpf map api to kfuncs not > convincing. bpf "tc / xdp" program uses netlink to attach/detach and the policy > also stays in the bpf map. > It's a lot more complex than just attaching/detaching. Our control plane uses netlink (regardless of whether it is offloaded or not) for all object controls (not just table entries) for the many reasons that have been stated in the cover letters since the beginning. I unfortunately took out some of the text after v10 to try and shorten the text. I will be adding it back. If you cant find it i could cutnpaste and send privately. cheers, jamal > When there is a HW piece that consumes the p4 table, that will be a better time > to discuss the kfunc interface. > > > don't agree with this, optimizing for some low level debug a developer > > uses is the wrong design space. Actual users should not be deploying > > this via ssh into boxes. The workflow will not scale and really we need > > tooling and infra to land P4 programs across the network. This is orders > > of more pain if its an endpoint solution and not a middlebox/switch > > solution. As a switch solution I don't see how p4tc sw scales to even TOR > > packet rates. So you need tooling on top and user interact with the > > tooling not the Linux widget/debugger at the bottom. >
On Thu, 29 Feb 2024 19:00:50 -0800 Tom Herbert wrote: > > I want to emphasize again these patches are about the P4 s/w pipeline > > that is intended to work seamlessly with hw offload. If you are > > interested in h/w offload and want to contribute just show up at the > > meetings - they are open to all. The current offloadable piece is the > > match-action tables. The P4 specs may change to include parsers in the > > future or other objects etc (but not sure why we should discuss this > > in the thread). > > Pardon my ignorance, but doesn't P4 want to be compiled to a backend > target? How does going through TC make this seamless? +1 My intuition is that for offload the device would be programmed at start-of-day / probe. By loading the compiled P4 from /lib/firmware. Then the _device_ tells the kernel what tables and parser graph it's got. Plus, if we're talking about offloads, aren't we getting back into the same controversies we had when merging OvS (not that I was around). The "standalone stack to the side" problem. Some of the tables in the pipeline may be for routing, not ACLs. Should they be fed from the routing stack? How is that integration going to work? The parsing graph feels a bit like global device configuration, not a piece of functionality that should sit under sub-sub-system in the corner.
On Fri, Mar 1, 2024 at 12:00 PM Jakub Kicinski <kuba@kernel.org> wrote: > > On Thu, 29 Feb 2024 19:00:50 -0800 Tom Herbert wrote: > > > I want to emphasize again these patches are about the P4 s/w pipeline > > > that is intended to work seamlessly with hw offload. If you are > > > interested in h/w offload and want to contribute just show up at the > > > meetings - they are open to all. The current offloadable piece is the > > > match-action tables. The P4 specs may change to include parsers in the > > > future or other objects etc (but not sure why we should discuss this > > > in the thread). > > > > Pardon my ignorance, but doesn't P4 want to be compiled to a backend > > target? How does going through TC make this seamless? > > +1 > I should clarify what i meant by "seamless". It means the same control API is used for s/w or h/w. This is a feature of tc, and is not being introduced by P4TC. P4 control only deals with Match-action tables - just as TC does. > My intuition is that for offload the device would be programmed at > start-of-day / probe. By loading the compiled P4 from /lib/firmware. > Then the _device_ tells the kernel what tables and parser graph it's > got. > BTW: I just want to say that these patches are about s/w - not offload. Someone asked about offload so as in normal discussions we steered in that direction. The hardware piece will require additional patchsets which still require discussions. I hope we dont steer off too much, otherwise i can start a new thread just to discuss current view of the h/w. Its not the device telling the kernel what it has. Its the other way around. From the P4 program you generate the s/w (the ebpf code and other auxillary stuff) and h/w pieces using a compiler. You compile ebpf, etc, then load. The current point of discussion is the hw binary is to be "activated" through the same tc filter that does the s/w. So one could say: tc filter add block 22 ingress protocol all prio 1 p4 pname simple_l3 \ prog type hw filename "simple_l3.o" ... \ action bpf obj $PARSER.o section p4tc/parser \ action bpf obj $PROGNAME.o section p4tc/main And that would through tc driver callbacks signal to the driver to find the binary possibly via /lib/firmware Some of the original discussion was to use devlink for loading the binary - but that went nowhere. Once you have this in place then netlink with tc skip_sw/hw. This is what i meant by "seamless" > Plus, if we're talking about offloads, aren't we getting back into > the same controversies we had when merging OvS (not that I was around). > The "standalone stack to the side" problem. Some of the tables in the > pipeline may be for routing, not ACLs. Should they be fed from the > routing stack? How is that integration going to work? The parsing > graph feels a bit like global device configuration, not a piece of > functionality that should sit under sub-sub-system in the corner. The current (maybe i should say initial) thought is the P4 program does not touch the existing kernel infra such as fdb etc. Of course we can model the kernel datapath using P4 but you wont be using "ip route add..." or "bridge fdb...". In the future, P4 extern could be used to model existing infra and we should be able to use the same tooling. That is a discussion that comes on/off (i think it did in the last meeting). cheers, jamal
>From: Paolo Abeni mailto:pabeni@redhat.com >Sent: Thursday, February 29, 2024 9:14 AM >To: Jamal Hadi Salim mailto:jhs@mojatatu.com; mailto:netdev@vger.kernel.org >Cc: mailto:deb.chatterjee@intel.com; mailto:anjali.singhai@intel.com; mailto:namrata.limaye@intel.com; mailto:tom@sipanda.io; mailto:mleitner@redhat.com; mailto:Mahesh.Shirshyad@amd.com; mailto:Vipin.Jain@amd.com; mailto:tomasz.osinski@intel.com; mailto:jiri@resnulli.us; mailto:xiyou.wangcong@gmail.com; mailto:davem@davemloft.net; mailto:edumazet@google.com; mailto:kuba@kernel.org; mailto:vladbu@nvidia.com; mailto:horms@kernel.org; mailto:khalidm@nvidia.com; mailto:toke@redhat.com; mailto:daniel@iogearbox.net; mailto:victor@mojatatu.com; mailto:pctammela@mojatatu.com; mailto:dan.daly@intel.com; mailto:andy.fingerhut@gmail.com; Chris Sommers mailto:chris.sommers@keysight.com; mailto:mattyk@nvidia.com; mailto:bpf@vger.kernel.org >Subject: Re: [PATCH net-next v12 00/15] Introducing P4TC (series 1) > >On Sun, 2024-02-25 at 11: 54 -0500, Jamal Hadi Salim wrote: > This is the first patchset of two. In this patch we are submitting 15 which > cover the minimal viable P4 PNA architecture. > > __Description of these Patches__ > > >ZjQcmQRYFpfptBannerStart >This Message is From an External Sender: Use caution opening files, clicking links or responding to requests. > > > >ZjQcmQRYFpfptBannerEnd >On Sun, 2024-02-25 at 11:54 -0500, Jamal Hadi Salim wrote: >> This is the first patchset of two. In this patch we are submitting 15 which >> cover the minimal viable P4 PNA architecture. >> >> __Description of these Patches__ >> >> Patch #1 adds infrastructure for per-netns P4 actions that can be created on >> as need basis for the P4 program requirement. This patch makes a small incision >> into act_api. Patches 2-4 are minimalist enablers for P4TC and have no >> effect the classical tc action (example patch#2 just increases the size of the >> action names from 16->64B). >> Patch 5 adds infrastructure support for preallocation of dynamic actions. >> >> The core P4TC code implements several P4 objects. >> 1) Patch #6 introduces P4 data types which are consumed by the rest of the code >> 2) Patch #7 introduces the templating API. i.e. CRUD commands for templates >> 3) Patch #8 introduces the concept of templating Pipelines. i.e CRUD commands >> for P4 pipelines. >> 4) Patch #9 introduces the action templates and associated CRUD commands. >> 5) Patch #10 introduce the action runtime infrastructure. >> 6) Patch #11 introduces the concept of P4 table templates and associated >> CRUD commands for tables. >> 7) Patch #12 introduces runtime table entry infra and associated CU commands. >> 8) Patch #13 introduces runtime table entry infra and associated RD commands. >> 9) Patch #14 introduces interaction of eBPF to P4TC tables via kfunc. >> 10) Patch #15 introduces the TC classifier P4 used at runtime. >> >> Daniel, please look again at patch #15. >> >> There are a few more patches (5) not in this patchset that deal with test >> cases, etc. >> >> What is P4? >> ----------- >> >> The Programming Protocol-independent Packet Processors (P4) is an open source, >> domain-specific programming language for specifying data plane behavior. >> >> The current P4 landscape includes an extensive range of deployments, products, >> projects and services, etc[9][12]. Two major NIC vendors, Intel[10] and AMD[11] >> currently offer P4-native NICs. P4 is currently curated by the Linux >> Foundation[9]. >> >> On why P4 - see small treatise here:[4]. >> >> What is P4TC? >> ------------- >> >> P4TC is a net-namespace aware P4 implementation over TC; meaning, a P4 program >> and its associated objects and state are attachend to a kernel _netns_ structure. >> IOW, if we had two programs across netns' or within a netns they have no >> visibility to each others objects (unlike for example TC actions whose kinds are >> "global" in nature or eBPF maps visavis bpftool). >> >> P4TC builds on top of many years of Linux TC experiences of a netlink control >> path interface coupled with a software datapath with an equivalent offloadable >> hardware datapath. In this patch series we are focussing only on the s/w >> datapath. The s/w and h/w path equivalence that TC provides is relevant >> for a primary use case of P4 where some (currently) large consumers of NICs >> provide vendors their datapath specs in P4. In such a case one could generate >> specified datapaths in s/w and test/validate the requirements before hardware >> acquisition(example [12]). >> >> Unlike other approaches such as TC Flower which require kernel and user space >> changes when new datapath objects like packet headers are introduced P4TC, with >> these patches, provides _kernel and user space code change independence_. >> Meaning: >> A P4 program describes headers, parsers, etc alongside the datapath processing; >> the compiler uses the P4 program as input and generates several artifacts which >> are then loaded into the kernel to manifest the intended datapath. In addition >> to the generated datapath, control path constructs are generated. The process is >> described further below in "P4TC Workflow". >> >> There have been many discussions and meetings within the community since >> about 2015 in regards to P4 over TC[2] and we are finally proving to the >> naysayers that we do get stuff done! >> >> A lot more of the P4TC motivation is captured at: >> https://urldefense.com/v3/__https://github.com/p4tc-dev/docs/blob/main/why-p4tc.md__;!!I5pVk4LIGAfnvw!nUhTLe4nLBPIqlW8u7cJYnSCbFNQSDW318aG2nqxbMZuptPK41snJiHein5mNazAYWIwAfJ3o8O7ZSCV8wc$ >> >> __P4TC Architecture__ >> >> The current architecture was described at netdevconf 0x17[14] and if you prefer >> academic conference papers, a short paper is available here[15]. >> >> There are 4 parts: >> >> 1) A Template CRUD provisioning API for manifesting a P4 program and its >> associated objects in the kernel. The template provisioning API uses netlink. >> See patch in part 2. >> >> 2) A Runtime CRUD+ API code which is used for controlling the different runtime >> behavior of the P4 objects. The runtime API uses netlink. See notes further >> down. See patch description later.. >> >> 3) P4 objects and their control interfaces: tables, actions, externs, etc. >> Any object that requires control plane interaction resides in the TC domain >> and is subject to the CRUD runtime API. The intended goal is to make use of the >> tc semantics of skip_sw/hw to target P4 program objects either in s/w or h/w. >> >> 4) S/W Datapath code hooks. The s/w datapath is eBPF based and is generated >> by a compiler based on the P4 spec. When accessing any P4 object that requires >> control plane interfaces, the eBPF code accesses the P4TC side from #3 above >> using kfuncs. >> >> The generated eBPF code is derived from [13] with enhancements and fixes to meet >> our requirements. >> >> __P4TC Workflow__ >> >> The Development and instantiation workflow for P4TC is as follows: >> >> A) A developer writes a P4 program, "myprog" >> >> B) Compiles it using the P4C compiler[8]. The compiler generates 3 outputs: >> >> a) A shell script which form template definitions for the different P4 >> objects "myprog" utilizes (tables, externs, actions etc). See #1 above.. >> >> b) the parser and the rest of the datapath are generated as eBPF and need >> to be compiled into binaries. At the moment the parser and the main control >> block are generated as separate eBPF program but this could change in >> the future (without affecting any kernel code). See #4 above. >> >> c) A json introspection file used for the control plane (by iproute2/tc). >> >> C) At this point the artifacts from #1,#4 could be handed to an operator >> (the operator could be the same person as the developer from #A, #B). >> >> i) For the eBPF part, either the operator is handed an ebpf binary or >> source which they compile at this point into a binary. >> The operator executes the shell script(s) to manifest the functional >> "myprog" into the kernel. >> >> ii) The operator instantiates "myprog" pipeline via the tc P4 filter >> to ingress/egress (depending on P4 arch) of one or more netdevs/ports >> (illustrated below as "block 22"). >> >> Example instantion where the parser is a separate action: >> "tc filter add block 22 ingress protocol all prio 10 p4 pname myprog \ >> action bpf obj $PARSER.o section p4tc/parse \ >> action bpf obj $PROGNAME.o section p4tc/main" >> >> See individual patches in partc for more examples tc vs xdp etc. Also see >> section on "challenges" (further below on this cover letter). >> >> Once "myprog" P4 program is instantiated one can start performing operations >> on table entries and/or actions at runtime as described below. >> >> __P4TC Runtime Control Path__ >> >> The control interface builds on past tc experience and tries to get things >> right from the beginning (example filtering is separated from depending >> on existing object TLVs and made generic); also the code is written in >> such a way it is mostly lockless. >> >> The P4TC control interface, using netlink, provides what we call a CRUDPS >> abstraction which stands for: Create, Read(get), Update, Delete, Subscribe, >> Publish. From a high level PoV the following describes a conformant high level >> API (both on netlink data model and code level): >> >> Create(</path/to/object, DATA>+) >> Read(</path/to/object>, [optional filter]) >> Update(</path/to/object>, DATA>+) >> Delete(</path/to/object>, [optional filter]) >> Subscribe(</path/to/object>, [optional filter]) >> >> Note, we _dont_ treat "dump" or "flush" as speacial. If "path/to/object" points >> to a table then a "Delete" implies "flush" and a "Read" implies dump but if >> it points to an entry (by specifying a key) then "Delete" implies deleting >> and entry and "Read" implies reading that single entry. It should be noted that >> both "Delete" and "Read" take an optional filter parameter. The filter can >> define further refinements to what the control plane wants read or deleted. >> "Subscribe" uses built in netlink event management. It, as well, takes a filter >> which can further refine what events get generated to the control plane (taken >> out of this patchset, to be re-added with consideration of [16]). >> >> Lets show some runtime samples: >> >> ..create an entry, if we match ip address 10.0.1.2 send packet out eno1 >> tc p4ctrl create myprog/table/mytable \ >> dstAddr 10.0.1.2/32 action send_to_port param port eno1 >> >> ..Batch create entries >> tc p4ctrl create myprog/table/mytable \ >> entry dstAddr 10.1.1.2/32 action send_to_port param port eno1 \ >> entry dstAddr 10.1.10.2/32 action send_to_port param port eno10 \ >> entry dstAddr 10.0.2.2/32 action send_to_port param port eno2 >> >> ..Get an entry (note "read" is interchangeably used as "get" which is a common >> semantic in tc): >> tc p4ctrl read myprog/table/mytable \ >> dstAddr 10.0.2.2/32 >> >> ..dump mytable >> tc p4ctrl read myprog/table/mytable >> >> ..dump mytable for all entries whose key fits within 10.1.0.0/16 >> tc p4ctrl read myprog/table/mytable \ >> filter key/myprog/mytable/dstAddr = 10.1.0.0/16 >> >> ..dump all mytable entries which have an action send_to_port with param "eno1" >> tc p4ctrl get myprog/table/mytable \ >> filter param/act/myprog/send_to_port/port = "eno1" >> >> The filter expression is powerful, f.e you could say: >> >> tc p4ctrl get myprog/table/mytable \ >> filter param/act/myprog/send_to_port/port = "eno1" && \ >> key/myprog/mytable/dstAddr = 10.1.0.0/16 >> >> It also works on built in metadata, example in the following case dumping >> entries from mytable that have seen activity in the last 10 secs: >> tc p4ctrl get myprog/table/mytable \ >> filter msecs_since < 10000 >> >> Delete follows the same syntax as get/read, so for sake of brevity we won't >> show more example than how to flush mytable: >> >> tc p4ctrl delete myprog/table/mytable >> >> Mystery question: How do we achieve iproute2-kernel independence and >> how does "tc p4ctrl" as a cli know how to program the kernel given an >> arbitrary command line as shown above? Answer(s): It queries the >> compiler generated json file in "P4TC Workflow" #B.c above. The json file has >> enough details to figure out that we have a program called "myprog" which has a >> table "mytable" that has a key name "dstAddr" which happens to be type ipv4 >> address prefix. The json file also provides details to show that the table >> "mytable" supports an action called "send_to_port" which accepts a parameter >> "port" of type netdev (see the types patch for all supported P4 data types). >> All P4 components have names, IDs, and types - so this makes it very easy to map >> into netlink. >> Once user space tc/p4ctrl validates the human command input, it creates >> standard binary netlink structures (TLVs etc) which are sent to the kernel. >> See the runtime table entry patch for more details. >> >> __P4TC Datapath__ >> >> The P4TC s/w datapath execution is generated as eBPF. Any objects that require >> control interfacing reside in the "P4TC domain" and are controlled via netlink >> as described above. Per packet execution and state and even objects that do not >> require control interfacing (like the P4 parser) are generated as eBPF. >> >> A packet arriving on s/w ingress of any of the ports on block 22 will first be >> exercised via the (generated eBPF) parser component to extract the headers (the >> ip destination address in labelled "dstAddr" above). >> The datapath then proceeds to use "dstAddr", table ID and pipeline ID >> as a key to do a lookup in myprog's "mytable" which returns the action params >> which are then used to execute the action in the eBPF datapath (eventually >> sending out packets to eno1). >> On a table miss, mytable's default miss action (not described) is executed. >> >> __Testing__ >> >> Speaking of testing - we have 2-300 tdc test cases (which will be in the >> second patchset). >> These tests are run on our CICD system on pull requests and after commits are >> approved. The CICD does a lot of other tests (more since v2, thanks to Simon's >> input)including: >> checkpatch, sparse, smatch, coccinelle, 32 bit and 64 bit builds tested on both >> X86, ARM 64 and emulated BE via qemu s390. We trigger performance testing in the >> CICD to catch performance regressions (currently only on the control path, but >> in the future for the datapath). >> Syzkaller runs 24/7 on dedicated hardware, originally we focussed only on memory >> sanitizer but recently added support for concurrency sanitizer. >> Before main releases we ensure each patch will compile on its own to help in >> git bisect and run the xmas tree tool. We eventually put the code via coverity. >> >> In addition we are working on enabling a tool that will take a P4 program, run >> it through the compiler, and generate permutations of traffic patterns via >> symbolic execution that will test both positive and negative datapath code >> paths. The test generator tool integration is still work in progress. >> Also: We have other code that test parallelization etc which we are trying to >> find a fit for in the kernel tree's testing infra. >> >> >> __References__ >> >> [1]https://urldefense.com/v3/__https://github.com/p4tc-dev/docs/blob/main/p4-conference-2023/2023P4WorkshopP4TC.pdf__;!!I5pVk4LIGAfnvw!nUhTLe4nLBPIqlW8u7cJYnSCbFNQSDW318aG2nqxbMZuptPK41snJiHein5mNazAYWIwAfJ3o8O7bPf6Tk4$ >> [2]https://urldefense.com/v3/__https://github.com/p4tc-dev/docs/blob/main/why-p4tc.md*historical-perspective-for-p4tc__;Iw!!I5pVk4LIGAfnvw!nUhTLe4nLBPIqlW8u7cJYnSCbFNQSDW318aG2nqxbMZuptPK41snJiHein5mNazAYWIwAfJ3o8O7LkM5QJk$ >> [3]https://urldefense.com/v3/__https://2023p4workshop.sched.com/event/1KsAe/p4tc-linux-kernel-p4-implementation-approaches-and-evaluation__;!!I5pVk4LIGAfnvw!nUhTLe4nLBPIqlW8u7cJYnSCbFNQSDW318aG2nqxbMZuptPK41snJiHein5mNazAYWIwAfJ3o8O73gpmAKE$ >> [4]https://urldefense.com/v3/__https://github.com/p4tc-dev/docs/blob/main/why-p4tc.md*so-why-p4-and-how-does-p4-help-here__;Iw!!I5pVk4LIGAfnvw!nUhTLe4nLBPIqlW8u7cJYnSCbFNQSDW318aG2nqxbMZuptPK41snJiHein5mNazAYWIwAfJ3o8O7fvy73gU$ >> [5]https://urldefense.com/v3/__https://lore.kernel.org/netdev/20230517110232.29349-3-jhs@mojatatu.com/T/*mf59be7abc5df3473cff3879c8cc3e2369c0640a6__;Iw!!I5pVk4LIGAfnvw!nUhTLe4nLBPIqlW8u7cJYnSCbFNQSDW318aG2nqxbMZuptPK41snJiHein5mNazAYWIwAfJ3o8O7rJJDxSc$ >> [6]https://urldefense.com/v3/__https://lore.kernel.org/netdev/20230517110232.29349-3-jhs@mojatatu.com/T/*m783cfd79e9d755cf0e7afc1a7d5404635a5b1919__;Iw!!I5pVk4LIGAfnvw!nUhTLe4nLBPIqlW8u7cJYnSCbFNQSDW318aG2nqxbMZuptPK41snJiHein5mNazAYWIwAfJ3o8O74EMrBVI$ >> [7]https://urldefense.com/v3/__https://lore.kernel.org/netdev/20230517110232.29349-3-jhs@mojatatu.com/T/*ma8c84df0f7043d17b98f3d67aab0f4904c600469__;Iw!!I5pVk4LIGAfnvw!nUhTLe4nLBPIqlW8u7cJYnSCbFNQSDW318aG2nqxbMZuptPK41snJiHein5mNazAYWIwAfJ3o8O7-6T3BD8$ >> [8]https://urldefense.com/v3/__https://github.com/p4lang/p4c/tree/main/backends/tc__;!!I5pVk4LIGAfnvw!nUhTLe4nLBPIqlW8u7cJYnSCbFNQSDW318aG2nqxbMZuptPK41snJiHein5mNazAYWIwAfJ3o8O7EsGj_yE$ >> [9]https://urldefense.com/v3/__https://p4.org/__;!!I5pVk4LIGAfnvw!nUhTLe4nLBPIqlW8u7cJYnSCbFNQSDW318aG2nqxbMZuptPK41snJiHein5mNazAYWIwAfJ3o8O7MA51wp8$ >> [10]https://urldefense.com/v3/__https://www.intel.com/content/www/us/en/products/details/network-io/ipu/e2000-asic.html__;!!I5pVk4LIGAfnvw!nUhTLe4nLBPIqlW8u7cJYnSCbFNQSDW318aG2nqxbMZuptPK41snJiHein5mNazAYWIwAfJ3o8O7HaJpkWg$ >> [11]https://urldefense.com/v3/__https://www.amd.com/en/accelerators/pensando__;!!I5pVk4LIGAfnvw!nUhTLe4nLBPIqlW8u7cJYnSCbFNQSDW318aG2nqxbMZuptPK41snJiHein5mNazAYWIwAfJ3o8O7u8agJlY$ >> [12]https://urldefense.com/v3/__https://github.com/sonic-net/DASH/tree/main__;!!I5pVk4LIGAfnvw!nUhTLe4nLBPIqlW8u7cJYnSCbFNQSDW318aG2nqxbMZuptPK41snJiHein5mNazAYWIwAfJ3o8O77NF6LU0$ >> [13]https://urldefense.com/v3/__https://github.com/p4lang/p4c/tree/main/backends/ebpf__;!!I5pVk4LIGAfnvw!nUhTLe4nLBPIqlW8u7cJYnSCbFNQSDW318aG2nqxbMZuptPK41snJiHein5mNazAYWIwAfJ3o8O7Hn8dxDI$ >> [14]https://urldefense.com/v3/__https://netdevconf.info/0x17/sessions/talk/integrating-ebpf-into-the-p4tc-datapath.html__;!!I5pVk4LIGAfnvw!nUhTLe4nLBPIqlW8u7cJYnSCbFNQSDW318aG2nqxbMZuptPK41snJiHein5mNazAYWIwAfJ3o8O7dDtnoik$ >> [15]https://urldefense.com/v3/__https://dl.acm.org/doi/10.1145/3630047.3630193__;!!I5pVk4LIGAfnvw!nUhTLe4nLBPIqlW8u7cJYnSCbFNQSDW318aG2nqxbMZuptPK41snJiHein5mNazAYWIwAfJ3o8O7zb87EuI$ >> [16]https://urldefense.com/v3/__https://lore.kernel.org/netdev/20231216123001.1293639-1-jiri@resnulli.us/__;!!I5pVk4LIGAfnvw!nUhTLe4nLBPIqlW8u7cJYnSCbFNQSDW318aG2nqxbMZuptPK41snJiHein5mNazAYWIwAfJ3o8O7mLYrgl8$ >> [17.a]https://urldefense.com/v3/__https://netdevconf.info/0x13/session.html?talk-tc-u-classifier__;!!I5pVk4LIGAfnvw!nUhTLe4nLBPIqlW8u7cJYnSCbFNQSDW318aG2nqxbMZuptPK41snJiHein5mNazAYWIwAfJ3o8O7qSaba8A$ >> [17.b]man tc-u32 >> [18]man tc-pedit >> [19] https://urldefense.com/v3/__https://lore.kernel.org/netdev/20231219181623.3845083-6-victor@mojatatu.com/T/*m86e71743d1d83b728bb29d5b877797cb4942e835__;Iw!!I5pVk4LIGAfnvw!nUhTLe4nLBPIqlW8u7cJYnSCbFNQSDW318aG2nqxbMZuptPK41snJiHein5mNazAYWIwAfJ3o8O7Uc3-7Vg$ >> [20.a] https://urldefense.com/v3/__https://netdevconf.info/0x16/sessions/talk/your-network-datapath-will-be-p4-scripted.html__;!!I5pVk4LIGAfnvw!nUhTLe4nLBPIqlW8u7cJYnSCbFNQSDW318aG2nqxbMZuptPK41snJiHein5mNazAYWIwAfJ3o8O7YIAkKuc$ >> [20.b] https://urldefense.com/v3/__https://netdevconf.info/0x16/sessions/workshop/p4tc-workshop.html__;!!I5pVk4LIGAfnvw!nUhTLe4nLBPIqlW8u7cJYnSCbFNQSDW318aG2nqxbMZuptPK41snJiHein5mNazAYWIwAfJ3o8O7_8aEvEI$ >> >> -------- >> HISTORY >> -------- >> >> Changes in Version 12 >> ---------------------- >> >> 0) Introduce back 15 patches (v11 had 5) >> >> 1) From discussions with Daniel: >> i) Remove the XDP programs association alltogether. No refcounting. nothing. >> ii) Remove prog type tc - everything is now an ebpf tc action. >> >> 2) s/PAD0/__pad0/g. Thanks to Marcelo. >> >> 3) Add extack to specify how many entries (N of M) specified in a batch for >> any of requested Create/Update/Delete succeeded. Prior to this it would >> only tell us the batch failed to complete without giving us details of >> which of M failed. Added as a debug aid. >> >> Changes in Version 11 >> ---------------------- >> 1) Split the series into two. Original patches 1-5 in this patchset. The rest >> will go out after this is merged. >> >> 2) Change any references of IFNAMSIZ in the action code when referencing the >> action name size to ACTNAMSIZ. Thanks to Marcelo. >> >> Changes in Version 10 >> ---------------------- >> 1) A couple of patches from the earlier version were clean enough to submit, >> so we did. This gave us room to split the two largest patches each into >> two. Even though the split is not git-bisactable and really some of it didn't >> make much sense (eg spliting a create, and update in one patch and delete and >> get into another) we made sure each of the split patches compiled >> independently. The idea is to reduce the number of lines of code to review >> and when we get sufficient reviews we will put the splits together again. >> See patch #12 and #13 as well as patches #7 and #8). >> >> 2) Add more context in patch 0. Please READ! >> >> 3) Added dump/delete filters back to the code - we had taken them out in the >> earlier patches to reduce the amount of code for review - but in retrospect >> we feel they are important enough to push earlier rather than later. >> >> >> Changes In version 9 >> --------------------- >> >> 1) Remove the largest patch (externs) to ease review. >> >> 2) Break up action patches into two to ease review bringing down the patches >> that need more scrutiny to 8 (the first 7 are almost trivial). >> >> 3) Fixup prefix naming convention to p4tc_xxx for uapi and p4a_xxx for actions >> to provide consistency(Jiri). >> >> 4) Silence sparse warning "was not declared. Should it be static?" for kfuncs >> by making them static. TBH, not sure if this is the right solution >> but it makes sparse happy and hopefully someone will comment. >> >> Changes In Version 8 >> --------------------- >> >> 1) Fix all the patchwork warnings and improve our ci to catch them in the future >> >> 2) Reduce the number of patches to basic max(15) to ease review. >> >> Changes In Version 7 >> ------------------------- >> >> 0) First time removing the RFC tag! >> >> 1) Removed XDP cookie. It turns out as was pointed out by Toke(Thanks!) - that >> using bpf links was sufficient to protect us from someone replacing or deleting >> a eBPF program after it has been bound to a netdev. >> >> 2) Add some reviewed-bys from Vlad. >> >> 3) Small bug fixes from v6 based on testing for ebpf. >> >> 4) Added the counter extern as a sample extern. Illustrating this example because >> it is slightly complex since it is possible to invoke it directly from >> the P4TC domain (in case of direct counters) or from eBPF (indirect counters). >> It is not exactly the most efficient implementation (a reasonable counter impl >> should be per-cpu). >> >> Changes In RFC Version 6 >> ------------------------- >> >> 1) Completed integration from scriptable view to eBPF. Completed integration >> of externs integration. >> >> 2) Small bug fixes from v5 based on testing. >> >> Changes In RFC Version 5 >> ------------------------- >> >> 1) More integration from scriptable view to eBPF. Small bug fixes from last >> integration. >> >> 2) More streamlining support of externs via kfunc (create-on-miss, etc) >> >> 3) eBPF linking for XDP. >> >> There is more eBPF integration/streamlining coming (we are getting close to >> conversion from scriptable domain). >> >> Changes In RFC Version 4 >> ------------------------- >> >> 1) More integration from scriptable to eBPF. Small bug fixes. >> >> 2) More streamlining support of externs via kfunc (one additional kfunc). >> >> 3) Removed per-cpu scratchpad per Toke's suggestion and instead use XDP metadata. >> >> There is more eBPF integration coming. One thing we looked at but is not in this >> patchset but should be in the next is use of eBPF link in our loading (see >> "challenge #1" further below). >> >> Changes In RFC Version 3 >> ------------------------- >> >> These patches are still in a little bit of flux as we adjust to integrating >> eBPF. So there are small constructs that are used in V1 and 2 but no longer >> used in this version. We will make a V4 which will remove those. >> The changes from V2 are as follows: >> >> 1) Feedback we got in V2 is to try stick to one of the two modes. In this version >> we are taking one more step and going the path of mode2 vs v2 where we had 2 modes. >> >> 2) The P4 Register extern is no longer standalone. Instead, as part of integrating >> into eBPF we introduce another kfunc which encapsulates Register as part of the >> extern interface. >> >> 3) We have improved our CICD to include tools pointed to us by Simon. See >> "Testing" further below. Thanks to Simon for that and other issues he caught. >> Simon, we discussed on issue [7] but decided to keep that log since we think >> it is useful. >> >> 4) A lot of small cleanups. Thanks Marcelo. There are two things we need to >> re-discuss though; see: [5], [6]. >> >> 5) We removed the need for a range of IDs for dynamic actions. Thanks Jakub. >> >> 6) Clarify ambiguity caused by smatch in an if(A) else if(B) condition. We are >> guaranteed that either A or B must exist; however, lets make smatch happy. >> Thanks to Simon and Dan Carpenter. >> >> Changes In RFC Version 2 >> ------------------------- >> >> Version 2 is the initial integration of the eBPF datapath. >> We took into consideration suggestions provided to use eBPF and put effort into >> analyzing eBPF as datapath which involved extensive testing. >> We implemented 6 approaches with eBPF and ran performance analysis and presented >> our results at the P4 2023 workshop in Santa Clara[see: 1, 3] on each of the 6 >> vs the scriptable P4TC and concluded that 2 of the approaches are sensible (4 if >> you account for XDP or TC separately). >> >> Conclusions from the exercise: We lose the simple operational model we had >> prior to integrating eBPF. We do gain performance in most cases when the >> datapath is less compute-bound. >> For more discussion on our requirements vs journeying the eBPF path please >> scroll down to "Restating Our Requirements" and "Challenges". >> >> This patch set presented two modes. >> mode1: the parser is entirely based on eBPF - whereas the rest of the >> SW datapath stays as _scriptable_ as in Version 1. >> mode2: All of the kernel s/w datapath (including parser) is in eBPF. >> >> The key ingredient for eBPF, that we did not have access to in the past, is >> kfunc (it made a big difference for us to reconsider eBPF). >> >> In V2 the two modes are mutually exclusive (IOW, you get to choose one >> or the other via Kconfig). > >I think/fear that this series has a "quorum" problem: different voices >raises opposition, and nobody (?) outside the authors supported the >code and the feature. > >Could be the missing of H/W offload support in the current form the >root cause for such lack support? Or there are parties interested that >have been quite so far? > >Thanks, > >Paolo > Hi Paolo, thanks. I am one of those "parties interested that have been quite so far." I wanted to voice my staunch support for accepting P4TC into the kernel. None of the present objections in the various threads reduce my enthusiasm. I find the following aspects most compelling: - Performant, highly functional, pure-SW P4 dataplane - Near-ubiquitous availability on all platforms, once it's upstreamed. Saves having to install a bunch of other p4 ecosystem tools, lowers the barrier to entry, and increases the likelihood an application can run on any platform. - larger dev community. Anything added to the Linux kernel benefits from a large, thriving community, vast and rigorous regression testing, long-term support, etc. - well-conceived CRUDX northbound API and clever use of existing well-understood netlink, easy to overlay other northbound APIs such as TDI (Table driven interface) used in IPDK; P4Runtime gRPC API; etc. - integration with popular and well-understood tc provides a good impedance match for users. - extensibility, ability to add externs, and interface to eBPF. The ability to add externs is especially compelling. It is not easy to do so in current backends such as bmv2, P4DPDK or p4-ebpf. - roadmap to hardware offload for even greater performance. Even _without_ offload, the above benefits justify it in my mind. There are many applications for a pure-SW P4 dataplane, both in userland like P4DPDK, and the proposed P4TC - running as part of the kernel is _exciting_. Vendors have already voiced their support for offload and this initial set of patches paves the way and lets the community benefit from it and start to make it better, now. It is possible the detractors of P4TC are not active P4 users, so I hope to provide a bit of perspective. Besides the pioneering switch ASIC (Tofino) use-cases which provided the initial impetus, P4 is used extensively in at least two commercial IPUs/DPUs. In addition, there are multiple toolchains to run P4 code on FPGAs. The dream is to write P4 code which can be run in a scalable fashion on a range of targets. It shouldn’t be necessary to “prove” P4 is worthy, those who’ve already embraced it know this. There are several use-cases for a SW implementation of a P4 dataplane, including behavioral modeling and production uses. P4 allows one to write core functionality which can run on multiple platforms: pure SW, FPGAs, offload NICs/DPUs/IPUs, switch ASICs. Behavioral modeling of a pipeline using P4: - The SONiC-DASH project (https://github.com/sonic-net/DASH) is a thriving, multi-vendor collaboration which specifies advanced, high-performance features to accelerate datacenter services. These overlay services are specified using a P4 program which allows all concerned to agree on the packet pipeline and even the control-plane APIs (using SAI, the Switch Abstraction Interface). The actual implementation on a vendor's offload device (DPU/IPU) may or may not use any of the reference P4 code, but that is not important. What is important is that we specify the dataplane in P4, and execute it on the bmv2 backend in a container. We run conformance and regression suites with standard test vectors, which can also be run against actual production implementations to verify compliance. The bmv2 backend has many limitations, including performance and difficulty to extend its functionality. As a major contributor to this project, I am helping to explore alternatives. - Large-scale cloud-service providers use P4 extensively as a dataplane (fabric switch) modeling language. One of the driving use-cases in the P4-API working group (I’m a co-chair) is to control SDN switches using P4-Runtime. The switches’ pipelines are modeled in P4 by some users, similar to the DASH use-case. Having a performant, pure-SW implementation is invaluable for modeling and simulation. Running P4 code in pure SW for production use-cases (not just modeling): There are many use-cases for running a custom dataplane written in P4. The productivity of P4 code cannot be overstated. With the right framework, P4 apps can be developed (and controlled/managed) in literally hours. It is much more productive than writing, say c or eBPF. I can do all three, and P4 is way more productive for certain applications. In conclusion, I hope we can upstream P4-TC soon. Please move this forward with all due speed. Thanks! Chris Sommers Keysight Technologies
On Fri, 1 Mar 2024 12:39:56 -0500 Jamal Hadi Salim wrote: > On Fri, Mar 1, 2024 at 12:00 PM Jakub Kicinski <kuba@kernel.org> wrote: > > > Pardon my ignorance, but doesn't P4 want to be compiled to a backend > > > target? How does going through TC make this seamless? > > > > +1 > > I should clarify what i meant by "seamless". It means the same control > API is used for s/w or h/w. This is a feature of tc, and is not being > introduced by P4TC. P4 control only deals with Match-action tables - > just as TC does. Right, and the compiled P4 pipeline is tacked onto that API. Loading that presumably implies a pipeline reset. There's no precedent for loading things into TC resulting a device datapath reset. > > My intuition is that for offload the device would be programmed at > > start-of-day / probe. By loading the compiled P4 from /lib/firmware. > > Then the _device_ tells the kernel what tables and parser graph it's > > got. > > BTW: I just want to say that these patches are about s/w - not > offload. Someone asked about offload so as in normal discussions we > steered in that direction. The hardware piece will require additional > patchsets which still require discussions. I hope we dont steer off > too much, otherwise i can start a new thread just to discuss current > view of the h/w. > > Its not the device telling the kernel what it has. Its the other way around. Yes, I'm describing how I'd have designed it :) If it was the same as what you've already implemented - why would I be typing it into an email.. ? :) > From the P4 program you generate the s/w (the ebpf code and other > auxillary stuff) and h/w pieces using a compiler. > You compile ebpf, etc, then load. That part is fine. > The current point of discussion is the hw binary is to be "activated" > through the same tc filter that does the s/w. So one could say: > > tc filter add block 22 ingress protocol all prio 1 p4 pname simple_l3 > \ > prog type hw filename "simple_l3.o" ... \ > action bpf obj $PARSER.o section p4tc/parser \ > action bpf obj $PROGNAME.o section p4tc/main > > And that would through tc driver callbacks signal to the driver to > find the binary possibly via /lib/firmware > Some of the original discussion was to use devlink for loading the > binary - but that went nowhere. Back to the device reset, unless the load has no impact on inflight traffic the loading doesn't belong in TC, IMO. Plus you're going to run into (what IIRC was Jiri's complaint) that you're loading arbitrary binary blobs, opaque to the kernel. > Once you have this in place then netlink with tc skip_sw/hw. This is > what i meant by "seamless" > > > Plus, if we're talking about offloads, aren't we getting back into > > the same controversies we had when merging OvS (not that I was around). > > The "standalone stack to the side" problem. Some of the tables in the > > pipeline may be for routing, not ACLs. Should they be fed from the > > routing stack? How is that integration going to work? The parsing > > graph feels a bit like global device configuration, not a piece of > > functionality that should sit under sub-sub-system in the corner. > > The current (maybe i should say initial) thought is the P4 program > does not touch the existing kernel infra such as fdb etc. It's off to the side thing. Ignoring the fact that *all*, networking devices already have parsers which would benefit from being accurately described. > Of course we can model the kernel datapath using P4 but you wont be > using "ip route add..." or "bridge fdb...". > In the future, P4 extern could be used to model existing infra and we > should be able to use the same tooling. That is a discussion that > comes on/off (i think it did in the last meeting). Maybe, IDK. I thought prevailing wisdom, at least for offloads, is to offload the existing networking stack, and fill in the gaps. Not build a completely new implementation from scratch, and "integrate later". Or at least "fill in the gaps" is how I like to think. I can't quite fit together in my head how this is okay, but OvS was not allowed to add their offload API. And what's supposed to be part of TC and what isn't, where you only expect to have one filter here, and create a whole new object universe inside TC. But that's just my opinions. The way things work we may wake up one day and find out that Dave has applied this :)
On Fri, Mar 1, 2024 at 5:32 PM Jakub Kicinski <kuba@kernel.org> wrote: > > On Fri, 1 Mar 2024 12:39:56 -0500 Jamal Hadi Salim wrote: > > On Fri, Mar 1, 2024 at 12:00 PM Jakub Kicinski <kuba@kernel.org> wrote: > > > > Pardon my ignorance, but doesn't P4 want to be compiled to a backend > > > > target? How does going through TC make this seamless? > > > > > > +1 > > > > I should clarify what i meant by "seamless". It means the same control > > API is used for s/w or h/w. This is a feature of tc, and is not being > > introduced by P4TC. P4 control only deals with Match-action tables - > > just as TC does. > > Right, and the compiled P4 pipeline is tacked onto that API. > Loading that presumably implies a pipeline reset. There's > no precedent for loading things into TC resulting a device > datapath reset. > > > > My intuition is that for offload the device would be programmed at > > > start-of-day / probe. By loading the compiled P4 from /lib/firmware. > > > Then the _device_ tells the kernel what tables and parser graph it's > > > got. > > > > BTW: I just want to say that these patches are about s/w - not > > offload. Someone asked about offload so as in normal discussions we > > steered in that direction. The hardware piece will require additional > > patchsets which still require discussions. I hope we dont steer off > > too much, otherwise i can start a new thread just to discuss current > > view of the h/w. > > > > Its not the device telling the kernel what it has. Its the other way around. > > Yes, I'm describing how I'd have designed it :) If it was the same > as what you've already implemented - why would I be typing it into > an email.. ? :) > > > From the P4 program you generate the s/w (the ebpf code and other > > auxillary stuff) and h/w pieces using a compiler. > > You compile ebpf, etc, then load. > > That part is fine. > > > The current point of discussion is the hw binary is to be "activated" > > through the same tc filter that does the s/w. So one could say: > > > > tc filter add block 22 ingress protocol all prio 1 p4 pname simple_l3 > > \ > > prog type hw filename "simple_l3.o" ... \ > > action bpf obj $PARSER.o section p4tc/parser \ > > action bpf obj $PROGNAME.o section p4tc/main > > > > And that would through tc driver callbacks signal to the driver to > > find the binary possibly via /lib/firmware > > Some of the original discussion was to use devlink for loading the > > binary - but that went nowhere. > > Back to the device reset, unless the load has no impact on inflight > traffic the loading doesn't belong in TC, IMO. Plus you're going to > run into (what IIRC was Jiri's complaint) that you're loading arbitrary > binary blobs, opaque to the kernel. > > > Once you have this in place then netlink with tc skip_sw/hw. This is > > what i meant by "seamless" > > > > > Plus, if we're talking about offloads, aren't we getting back into > > > the same controversies we had when merging OvS (not that I was around). > > > The "standalone stack to the side" problem. Some of the tables in the > > > pipeline may be for routing, not ACLs. Should they be fed from the > > > routing stack? How is that integration going to work? The parsing > > > graph feels a bit like global device configuration, not a piece of > > > functionality that should sit under sub-sub-system in the corner. > > > > The current (maybe i should say initial) thought is the P4 program > > does not touch the existing kernel infra such as fdb etc. > > It's off to the side thing. Ignoring the fact that *all*, networking > devices already have parsers which would benefit from being accurately > described. Jakub, This is configurability versus programmability. The table driven approach as input (configurability) might work fine for generic match-action tables up to the point that tables are expressive enough to satisfy the requirements. But parsing doesn't fall into the table driven paradigm: parsers want to be *programmed*. This is why we removed kParser from this patch set and fell back to eBPF for parsing. But the problem we quickly hit that eBPF is not offloadable to network devices, for example when we compile P4 in an eBPF parser we've lost the declarative representation that parsers in the devices could consume (they're not CPUs running eBPF). I think the key here is what we mean by kernel offload. When we do kernel offload, is it the kernel implementation or the kernel functionality that's being offloaded? If it's the latter then we have a lot more flexibility. What we'd need is a safe and secure way to synchronize with that offload device that precisely supports the kernel functionality we'd like to offload. This can be done if both the kernel bits and programmed offload are derived from the same source (i.e. tag source code with a sha-1). For example, if someone writes a parser in P4, we can compile that into both eBPF and a P4 backend using independent tool chains and program download. At runtime, the kernel can safely offload the functionality of the eBPF parser to the device if it matches the hash to that reported by the device Tom > > > Of course we can model the kernel datapath using P4 but you wont be > > using "ip route add..." or "bridge fdb...". > > In the future, P4 extern could be used to model existing infra and we > > should be able to use the same tooling. That is a discussion that > > comes on/off (i think it did in the last meeting). > > Maybe, IDK. I thought prevailing wisdom, at least for offloads, > is to offload the existing networking stack, and fill in the gaps. > Not build a completely new implementation from scratch, and "integrate > later". Or at least "fill in the gaps" is how I like to think. > > I can't quite fit together in my head how this is okay, but OvS > was not allowed to add their offload API. And what's supposed to > be part of TC and what isn't, where you only expect to have one > filter here, and create a whole new object universe inside TC. > > But that's just my opinions. The way things work we may wake up one > day and find out that Dave has applied this :)
On Fri, Mar 1, 2024 at 8:32 PM Jakub Kicinski <kuba@kernel.org> wrote: > > On Fri, 1 Mar 2024 12:39:56 -0500 Jamal Hadi Salim wrote: > > On Fri, Mar 1, 2024 at 12:00 PM Jakub Kicinski <kuba@kernel.org> wrote: > > > > Pardon my ignorance, but doesn't P4 want to be compiled to a backend > > > > target? How does going through TC make this seamless? > > > > > > +1 > > > > I should clarify what i meant by "seamless". It means the same control > > API is used for s/w or h/w. This is a feature of tc, and is not being > > introduced by P4TC. P4 control only deals with Match-action tables - > > just as TC does. > > Right, and the compiled P4 pipeline is tacked onto that API. > Loading that presumably implies a pipeline reset. There's > no precedent for loading things into TC resulting a device > datapath reset. Ive changed the subject to reflect this discussion is about h/w offload so we dont drift too much from the intent of the patches. AFAIK, all these devices have some HA built in to do program replacement. i.e. afaik, no device reset. I believe the tofino switch in the earlier generations may have needed resets which caused a few packet drops in a live environment update. Granted there may be devices (not that i am aware) that may not be able to do HA. All this needs to be considered for offloads. > > > My intuition is that for offload the device would be programmed at > > > start-of-day / probe. By loading the compiled P4 from /lib/firmware. > > > Then the _device_ tells the kernel what tables and parser graph it's > > > got. > > > > BTW: I just want to say that these patches are about s/w - not > > offload. Someone asked about offload so as in normal discussions we > > steered in that direction. The hardware piece will require additional > > patchsets which still require discussions. I hope we dont steer off > > too much, otherwise i can start a new thread just to discuss current > > view of the h/w. > > > > Its not the device telling the kernel what it has. Its the other way around. > > Yes, I'm describing how I'd have designed it :) If it was the same > as what you've already implemented - why would I be typing it into > an email.. ? :) > I think i misunderstood you and thought I needed to provide context. The P4 pipelines are meant to be able to be re-programmed multiple times in a live environment. IOW, I should be able to delete/create a pipeline while another is running. Some hardware may require that the parser is shared etc, but you can certainly replace the match action tables or add an entirely new logic. In any case this is all still under discussion and can be further refined. > > From the P4 program you generate the s/w (the ebpf code and other > > auxillary stuff) and h/w pieces using a compiler. > > You compile ebpf, etc, then load. > > That part is fine. > > > The current point of discussion is the hw binary is to be "activated" > > through the same tc filter that does the s/w. So one could say: > > > > tc filter add block 22 ingress protocol all prio 1 p4 pname simple_l3 > > \ > > prog type hw filename "simple_l3.o" ... \ > > action bpf obj $PARSER.o section p4tc/parser \ > > action bpf obj $PROGNAME.o section p4tc/main > > > > And that would through tc driver callbacks signal to the driver to > > find the binary possibly via /lib/firmware > > Some of the original discussion was to use devlink for loading the > > binary - but that went nowhere. > > Back to the device reset, unless the load has no impact on inflight > traffic the loading doesn't belong in TC, IMO. Plus you're going to > run into (what IIRC was Jiri's complaint) that you're loading arbitrary > binary blobs, opaque to the kernel. > And you said at that time binary blobs are already a way of life. Let's take DDP as a use case: They load the firmware (via ethtool) and we were recently discussing whether they should use flower or u32 etc. I would say this is in the same spirit. Doing ethtool may be a bit disconnected. But that is up for discussion as well. There has been concern that we need to have some authentication in some of the discussions. Is that what you mean? > > Once you have this in place then netlink with tc skip_sw/hw. This is > > what i meant by "seamless" > > > > > Plus, if we're talking about offloads, aren't we getting back into > > > the same controversies we had when merging OvS (not that I was around). > > > The "standalone stack to the side" problem. Some of the tables in the > > > pipeline may be for routing, not ACLs. Should they be fed from the > > > routing stack? How is that integration going to work? The parsing > > > graph feels a bit like global device configuration, not a piece of > > > functionality that should sit under sub-sub-system in the corner. > > > > The current (maybe i should say initial) thought is the P4 program > > does not touch the existing kernel infra such as fdb etc. > > It's off to the side thing. Ignoring the fact that *all*, networking > devices already have parsers which would benefit from being accurately > described. > I am not following this point. > > Of course we can model the kernel datapath using P4 but you wont be > > using "ip route add..." or "bridge fdb...". > > In the future, P4 extern could be used to model existing infra and we > > should be able to use the same tooling. That is a discussion that > > comes on/off (i think it did in the last meeting). > > Maybe, IDK. I thought prevailing wisdom, at least for offloads, > is to offload the existing networking stack, and fill in the gaps. > Not build a completely new implementation from scratch, and "integrate > later". Or at least "fill in the gaps" is how I like to think. > > I can't quite fit together in my head how this is okay, but OvS > was not allowed to add their offload API. And what's supposed to > be part of TC and what isn't, where you only expect to have one > filter here, and create a whole new object universe inside TC. > I was there. Ovs matched what tc already had functionally, 10 years after tc existed, and they were busy rewriting what tc offered. So naturally we pushed for them to use what TC had. You still need to write whatever extensions needed into the kernel etc in order to support what the hardware can offer. I hope i am not stating the obvious: P4 provides a more malleable approach. Assume a blank template in h/w and s/w and where you specify what you need then both the s/w and hardware support it. Flower is analogous to a "fixed pipeline" meaning you can extend flower by changing the kernel and datapath. Often it is not covering all potential hw match actions engines and often we see patches to do one more thing requiring more kernel changes. If you replace flower with P4 you remove the need to update the kernel, user space etc for the same features that flower needs to be extended for today. You just tell the compiler what you need (within hardware capacity of course). So i dont see P4 as "offload the existing kernel infra aka flower" but rather remove the limitations that flower constrains us with today. As far as other kernel infra (fdb etc), that can be added as i stated - it is just not a starting point. cheers, jamal > But that's just my opinions. The way things work we may wake up one > day and find out that Dave has applied this :)
On Fri, Mar 1, 2024 at 9:59 PM Jamal Hadi Salim <jhs@mojatatu.com> wrote: > > On Fri, Mar 1, 2024 at 8:32 PM Jakub Kicinski <kuba@kernel.org> wrote: > > > > On Fri, 1 Mar 2024 12:39:56 -0500 Jamal Hadi Salim wrote: > > > On Fri, Mar 1, 2024 at 12:00 PM Jakub Kicinski <kuba@kernel.org> wrote: > > > > > Pardon my ignorance, but doesn't P4 want to be compiled to a backend > > > > > target? How does going through TC make this seamless? > > > > > > > > +1 > > > > > > I should clarify what i meant by "seamless". It means the same control > > > API is used for s/w or h/w. This is a feature of tc, and is not being > > > introduced by P4TC. P4 control only deals with Match-action tables - > > > just as TC does. > > > > Right, and the compiled P4 pipeline is tacked onto that API. > > Loading that presumably implies a pipeline reset. There's > > no precedent for loading things into TC resulting a device > > datapath reset. > > Ive changed the subject to reflect this discussion is about h/w > offload so we dont drift too much from the intent of the patches. > > AFAIK, all these devices have some HA built in to do program > replacement. i.e. afaik, no device reset. > I believe the tofino switch in the earlier generations may have needed > resets which caused a few packet drops in a live environment update. > Granted there may be devices (not that i am aware) that may not be > able to do HA. All this needs to be considered for offloads. > > > > > My intuition is that for offload the device would be programmed at > > > > start-of-day / probe. By loading the compiled P4 from /lib/firmware. > > > > Then the _device_ tells the kernel what tables and parser graph it's > > > > got. > > > > > > BTW: I just want to say that these patches are about s/w - not > > > offload. Someone asked about offload so as in normal discussions we > > > steered in that direction. The hardware piece will require additional > > > patchsets which still require discussions. I hope we dont steer off > > > too much, otherwise i can start a new thread just to discuss current > > > view of the h/w. > > > > > > Its not the device telling the kernel what it has. Its the other way around. > > > > Yes, I'm describing how I'd have designed it :) If it was the same > > as what you've already implemented - why would I be typing it into > > an email.. ? :) > > > > I think i misunderstood you and thought I needed to provide context. > The P4 pipelines are meant to be able to be re-programmed multiple > times in a live environment. IOW, I should be able to delete/create a > pipeline while another is running. Some hardware may require that the > parser is shared etc, but you can certainly replace the match action > tables or add an entirely new logic. In any case this is all still > under discussion and can be further refined. > > > > From the P4 program you generate the s/w (the ebpf code and other > > > auxillary stuff) and h/w pieces using a compiler. > > > You compile ebpf, etc, then load. > > > > That part is fine. > > > > > The current point of discussion is the hw binary is to be "activated" > > > through the same tc filter that does the s/w. So one could say: > > > > > > tc filter add block 22 ingress protocol all prio 1 p4 pname simple_l3 > > > \ > > > prog type hw filename "simple_l3.o" ... \ > > > action bpf obj $PARSER.o section p4tc/parser \ > > > action bpf obj $PROGNAME.o section p4tc/main > > > > > > And that would through tc driver callbacks signal to the driver to > > > find the binary possibly via /lib/firmware > > > Some of the original discussion was to use devlink for loading the > > > binary - but that went nowhere. > > > > Back to the device reset, unless the load has no impact on inflight > > traffic the loading doesn't belong in TC, IMO. Plus you're going to > > run into (what IIRC was Jiri's complaint) that you're loading arbitrary > > binary blobs, opaque to the kernel. > > > > And you said at that time binary blobs are already a way of life. > Let's take DDP as a use case: They load the firmware (via ethtool) > and we were recently discussing whether they should use flower or u32 > etc. I would say this is in the same spirit. Doing ethtool may be a > bit disconnected. But that is up for discussion as well. > There has been concern that we need to have some authentication in > some of the discussions. Is that what you mean? > > > > Once you have this in place then netlink with tc skip_sw/hw. This is > > > what i meant by "seamless" > > > > > > > Plus, if we're talking about offloads, aren't we getting back into > > > > the same controversies we had when merging OvS (not that I was around). > > > > The "standalone stack to the side" problem. Some of the tables in the > > > > pipeline may be for routing, not ACLs. Should they be fed from the > > > > routing stack? How is that integration going to work? The parsing > > > > graph feels a bit like global device configuration, not a piece of > > > > functionality that should sit under sub-sub-system in the corner. > > > > > > The current (maybe i should say initial) thought is the P4 program > > > does not touch the existing kernel infra such as fdb etc. > > > > It's off to the side thing. Ignoring the fact that *all*, networking > > devices already have parsers which would benefit from being accurately > > described. > > > > I am not following this point. > > > > Of course we can model the kernel datapath using P4 but you wont be > > > using "ip route add..." or "bridge fdb...". > > > In the future, P4 extern could be used to model existing infra and we > > > should be able to use the same tooling. That is a discussion that > > > comes on/off (i think it did in the last meeting). > > > > Maybe, IDK. I thought prevailing wisdom, at least for offloads, > > is to offload the existing networking stack, and fill in the gaps. > > Not build a completely new implementation from scratch, and "integrate > > later". Or at least "fill in the gaps" is how I like to think. > > > > I can't quite fit together in my head how this is okay, but OvS > > was not allowed to add their offload API. And what's supposed to > > be part of TC and what isn't, where you only expect to have one > > filter here, and create a whole new object universe inside TC. > > > > I was there. > Ovs matched what tc already had functionally, 10 years after tc > existed, and they were busy rewriting what tc offered. So naturally we > pushed for them to use what TC had. You still need to write whatever > extensions needed into the kernel etc in order to support what the > hardware can offer. > > I hope i am not stating the obvious: P4 provides a more malleable > approach. Assume a blank template in h/w and s/w and where you specify > what you need then both the s/w and hardware support it. Flower is > analogous to a "fixed pipeline" meaning you can extend flower by > changing the kernel and datapath. Often it is not covering all > potential hw match actions engines and often we see patches to do one > more thing requiring more kernel changes. If you replace flower with > P4 you remove the need to update the kernel, user space etc for the > same features that flower needs to be extended for today. You just > tell the compiler what you need (within hardware capacity of course). > So i dont see P4 as "offload the existing kernel infra aka flower" but > rather remove the limitations that flower constrains us with today. As > far as other kernel infra (fdb etc), that can be added as i stated - > it is just not a starting point. > Sorry, after getting some coffee i believe I mumbled too much in my previous email. Let me summarize your points and reduce the mumbling: 1)Your point on: Triggering the pipeline re/programming via the filter would require a reset of the device on a live environment. AFAIK, the "P4 native" devices that I know of do allow multiple programs and have operational schemes to allow updates without resets. I will gather more info and post it after one of our meetings. Having said that, we really have not paid much attention to this detail so it is a valid concern that needs to be ironed out. It is even more imperative if we want to support a device that is not "P4 native" or one that requires a reset whether it is P4 native or not then what you referred to as "programmed at start-of-day / probe" is a valid concern. 2) Your point on: "integrate later", or at least "fill in the gaps" This part i am probably going to mumble on. I am going to consider more than just doing ACLs/MAT via flower/u32 for the sake of discussion. True, "fill the gaps" has been our model so far. It requires kernel changes, user space code changes etc justifiably so because most of the time such datapaths are subject to standardization via IETF, IEEE, etc and new extensions come in on a regular basis. And sometimes we do add features that one or two users or a single vendor has need for at the cost of kernel and user/control extension. Given our work process, any features added this way take a long time to make it to the end user. At the cost of this sounding controversial, i am going to call things like fdb, fib, etc which have fixed datapaths in the kernel "legacy". These "legacy" datapaths almost all the time have very strong user bases with strong infra tooling which took years to get in shape. So they must be supported. I see two approaches: - you can leave those "legacy" ndo ops alone and not go via the tc ndo ops used by P4TC. - or write a P4 program that looks _exactly_ like what current bridging looks like and add helpers to allow existing tools to continue to work via tc ndo and then phase out the "fixed datapath" ndos. This will take a long long time but it could be a goal. There is another caveat: Often different vendor hardware has slightly different features which cant be exposed because either they are very specific to the vendor or it's just very hard to express with existing "legacy" without making intrusive changes. So we are going to be able to allow these vendors/users to expose as much or as little as is needed for a specific deployment without affecting anyone else with new kernel/user code. On the "integrate later" aspect: That is probably because most of the times we want to avoid doing intrusive niche changes (which is resolvable with the above). cheers, jamal
On Fri, 1 Mar 2024 18:20:36 -0800 Tom Herbert wrote: > This is configurability versus programmability. The table driven > approach as input (configurability) might work fine for generic > match-action tables up to the point that tables are expressive enough > to satisfy the requirements. But parsing doesn't fall into the table > driven paradigm: parsers want to be *programmed*. This is why we > removed kParser from this patch set and fell back to eBPF for parsing. > But the problem we quickly hit that eBPF is not offloadable to network > devices, for example when we compile P4 in an eBPF parser we've lost > the declarative representation that parsers in the devices could > consume (they're not CPUs running eBPF). > > I think the key here is what we mean by kernel offload. When we do > kernel offload, is it the kernel implementation or the kernel > functionality that's being offloaded? If it's the latter then we have > a lot more flexibility. What we'd need is a safe and secure way to > synchronize with that offload device that precisely supports the > kernel functionality we'd like to offload. This can be done if both > the kernel bits and programmed offload are derived from the same > source (i.e. tag source code with a sha-1). For example, if someone > writes a parser in P4, we can compile that into both eBPF and a P4 > backend using independent tool chains and program download. At > runtime, the kernel can safely offload the functionality of the eBPF > parser to the device if it matches the hash to that reported by the > device Good points. If I understand you correctly you're saying that parsers are more complex than just a basic parsing tree a'la u32. Then we can take this argument further. P4 has grown to encompass a lot of functionality of quite complex devices. How do we square that with the kernel functionality offload model. If the entire device is modeled, including f.e. TSO, an offload would mean that the user has to write a TSO implementation which they then load into TC? That seems odd. IOW I don't quite know how to square in my head the "total functionality" with being a TC-based "plugin".
On Sat, 2 Mar 2024 09:36:53 -0500 Jamal Hadi Salim wrote: > 2) Your point on: "integrate later", or at least "fill in the gaps" > This part i am probably going to mumble on. I am going to consider > more than just doing ACLs/MAT via flower/u32 for the sake of > discussion. > True, "fill the gaps" has been our model so far. It requires kernel > changes, user space code changes etc justifiably so because most of > the time such datapaths are subject to standardization via IETF, IEEE, > etc and new extensions come in on a regular basis. And sometimes we > do add features that one or two users or a single vendor has need for > at the cost of kernel and user/control extension. Given our work > process, any features added this way take a long time to make it to > the end user. What I had in mind was more of a DDP model. The device loads it binary blob FW in whatever way it does, then it tells the kernel its parser graph, and tables. The kernel exposes those tables to user space. All dynamic, no need to change the kernel for each new protocol. But that's different in two ways: 1. the device tells kernel the tables, no "dynamic reprogramming" 2. you don't need the SW side, the only use of the API is to interact with the device User can still do BPF kfuncs to look up in the tables (like in FIB), but call them from cls_bpf. I think in P4 terms that may be something more akin to only providing the runtime API? I seem to recall they had some distinction... > At the cost of this sounding controversial, i am going > to call things like fdb, fib, etc which have fixed datapaths in the > kernel "legacy". These "legacy" datapaths almost all the time have The cynic in me sometimes thinks that the biggest problem with "legacy" protocols is that it's hard to make money on them :) > very strong user bases with strong infra tooling which took years to > get in shape. So they must be supported. I see two approaches: > - you can leave those "legacy" ndo ops alone and not go via the tc > ndo ops used by P4TC. > - or write a P4 program that looks _exactly_ like what current > bridging looks like and add helpers to allow existing tools to > continue to work via tc ndo and then phase out the "fixed datapath" > ndos. This will take a long long time but it could be a goal. > > There is another caveat: Often different vendor hardware has slightly > different features which cant be exposed because either they are very > specific to the vendor or it's just very hard to express with existing > "legacy" without making intrusive changes. So we are going to be able > to allow these vendors/users to expose as much or as little as is > needed for a specific deployment without affecting anyone else with > new kernel/user code. > > On the "integrate later" aspect: That is probably because most of the > times we want to avoid doing intrusive niche changes (which is > resolvable with the above).
On Sat, Mar 2, 2024 at 7:15 PM Jakub Kicinski <kuba@kernel.org> wrote: > > On Fri, 1 Mar 2024 18:20:36 -0800 Tom Herbert wrote: > > This is configurability versus programmability. The table driven > > approach as input (configurability) might work fine for generic > > match-action tables up to the point that tables are expressive enough > > to satisfy the requirements. But parsing doesn't fall into the table > > driven paradigm: parsers want to be *programmed*. This is why we > > removed kParser from this patch set and fell back to eBPF for parsing. > > But the problem we quickly hit that eBPF is not offloadable to network > > devices, for example when we compile P4 in an eBPF parser we've lost > > the declarative representation that parsers in the devices could > > consume (they're not CPUs running eBPF). > > > > I think the key here is what we mean by kernel offload. When we do > > kernel offload, is it the kernel implementation or the kernel > > functionality that's being offloaded? If it's the latter then we have > > a lot more flexibility. What we'd need is a safe and secure way to > > synchronize with that offload device that precisely supports the > > kernel functionality we'd like to offload. This can be done if both > > the kernel bits and programmed offload are derived from the same > > source (i.e. tag source code with a sha-1). For example, if someone > > writes a parser in P4, we can compile that into both eBPF and a P4 > > backend using independent tool chains and program download. At > > runtime, the kernel can safely offload the functionality of the eBPF > > parser to the device if it matches the hash to that reported by the > > device > > Good points. If I understand you correctly you're saying that parsers > are more complex than just a basic parsing tree a'la u32. Yes. Parsing things like TLVs, GRE flag field, or nested protobufs isn't conducive to u32. We also want the advantages of compiler optimizations to unroll loops, squash nodes in the parse graph, etc. > Then we can take this argument further. P4 has grown to encompass a lot > of functionality of quite complex devices. How do we square that with > the kernel functionality offload model. If the entire device is modeled, > including f.e. TSO, an offload would mean that the user has to write > a TSO implementation which they then load into TC? That seems odd. > > IOW I don't quite know how to square in my head the "total > functionality" with being a TC-based "plugin". Hi Jakub, I believe the solution is to replace kernel code with eBPF in cases where we need programmability. This effectively means that we would ship eBPF code as part of the kernel. So in the case of TSO, the kernel would include a standard implementation in eBPF that could be compiled into the kernel by default. The restricted C source code is tagged with a hash, so if someone wants to offload TSO they could compile the source into their target and retain the hash. At runtime it's a matter of querying the driver to see if the device supports the TSO program the kernel is running by comparing hash values. Scaling this, a device could support a catalogue of programs: TSO, LRO, parser, IPtables, etc., If the kernel can match the hash of its eBPF code to one reported by the driver then it can assume functionality is offloadable. This is an elaboration of "device features", but instead of the device telling us they think they support an adequate GRO implementation by reporting NETIF_F_GRO, the device would tell the kernel that they not only support GRO but they provide identical functionality of the kernel GRO (which IMO is the first requirement of kernel offload). Even before considering hardware offload, I think this approach addresses a more fundamental problem to make the kernel programmable. Since the code is in eBPF, the kernel can be reprogrammed at runtime which could be controlled by TC. This allows local customization of kernel features, but also is the simplest way to "patch" the kernel with security and bug fixes (nobody is ever excited to do a kernel rebase in their datacenter!). Flow dissector is a prime candidate for this, and I am still planning to replace it with an all eBPF program (https://netdevconf.info/0x15/slides/16/Flow%20dissector_PANDA%20parser.pdf). Tom
On Sat, Mar 2, 2024 at 10:27 PM Jakub Kicinski <kuba@kernel.org> wrote: > > On Sat, 2 Mar 2024 09:36:53 -0500 Jamal Hadi Salim wrote: > > 2) Your point on: "integrate later", or at least "fill in the gaps" > > This part i am probably going to mumble on. I am going to consider > > more than just doing ACLs/MAT via flower/u32 for the sake of > > discussion. > > True, "fill the gaps" has been our model so far. It requires kernel > > changes, user space code changes etc justifiably so because most of > > the time such datapaths are subject to standardization via IETF, IEEE, > > etc and new extensions come in on a regular basis. And sometimes we > > do add features that one or two users or a single vendor has need for > > at the cost of kernel and user/control extension. Given our work > > process, any features added this way take a long time to make it to > > the end user. > > What I had in mind was more of a DDP model. The device loads it binary > blob FW in whatever way it does, then it tells the kernel its parser > graph, and tables. The kernel exposes those tables to user space. > All dynamic, no need to change the kernel for each new protocol. > > But that's different in two ways: > 1. the device tells kernel the tables, no "dynamic reprogramming" > 2. you don't need the SW side, the only use of the API is to interact > with the device > > User can still do BPF kfuncs to look up in the tables (like in FIB), > but call them from cls_bpf. > This is not far off from what is envisioned today in the discussions. The main issue is who loads the binary? We went from devlink to the filter doing the loading. DDP is ethtool. We still need to tie a PCI device/tc block to the "program" so we can do skip_sw and it works. Meaning a device that is capable of handling multiple programs can have multiple blobs loaded. A "program" is mapped to a tc filter and MAT control works the same way as it does today (netlink/tc ndo). A program in P4 has a name, ID and people have been suggesting a sha1 identity (or a signature of some kind should be generated by the compiler). So the upward propagation could be tied to discovering these 3 tuples from the driver. Then the control plane targets a program via those tuples via netlink (as we do currently). I do note, using the DDP sample space, currently whatever gets loaded is "trusted" and really you need to have human knowledge of what the NIC's parsing + MAT is to send the control. With P4 that is all visible/programmable by the end user (i am not a proponent of vendors "shipping" things or calling them for support) - so should be sufficient to just discover what is in the binary and send the correct control messages down. > I think in P4 terms that may be something more akin to only providing > the runtime API? I seem to recall they had some distinction... There are several solutions out there (ex: TDI, P4runtime) - our API is netlink and those could be written on top of netlink, there's no controversy there. So the starting point is defining the datapath using P4, generating the binary blob and whatever constraints needed using the vendor backend and for s/w equivalent generating the eBPF datapath. > > At the cost of this sounding controversial, i am going > > to call things like fdb, fib, etc which have fixed datapaths in the > > kernel "legacy". These "legacy" datapaths almost all the time have > > The cynic in me sometimes thinks that the biggest problem with "legacy" > protocols is that it's hard to make money on them :) That's a big motivation without a doubt, but also there are people that want to experiment with things. One of the craziest examples we have is someone who created a P4 program for "in network calculator", essentially a calculator in the datapath. You send it two operands and an operator using custom headers, it does the math and responds with a result in a new header. By itself this program is a toy but it demonstrates that if one wanted to, they could have something custom in hardware and/or kernel datapath. cheers, jamal
On Sun, Mar 3, 2024 at 9:00 AM Jamal Hadi Salim <jhs@mojatatu.com> wrote: > > On Sat, Mar 2, 2024 at 10:27 PM Jakub Kicinski <kuba@kernel.org> wrote: > > > > On Sat, 2 Mar 2024 09:36:53 -0500 Jamal Hadi Salim wrote: > > > 2) Your point on: "integrate later", or at least "fill in the gaps" > > > This part i am probably going to mumble on. I am going to consider > > > more than just doing ACLs/MAT via flower/u32 for the sake of > > > discussion. > > > True, "fill the gaps" has been our model so far. It requires kernel > > > changes, user space code changes etc justifiably so because most of > > > the time such datapaths are subject to standardization via IETF, IEEE, > > > etc and new extensions come in on a regular basis. And sometimes we > > > do add features that one or two users or a single vendor has need for > > > at the cost of kernel and user/control extension. Given our work > > > process, any features added this way take a long time to make it to > > > the end user. > > > > What I had in mind was more of a DDP model. The device loads it binary > > blob FW in whatever way it does, then it tells the kernel its parser > > graph, and tables. The kernel exposes those tables to user space. > > All dynamic, no need to change the kernel for each new protocol. > > > > But that's different in two ways: > > 1. the device tells kernel the tables, no "dynamic reprogramming" > > 2. you don't need the SW side, the only use of the API is to interact > > with the device > > > > User can still do BPF kfuncs to look up in the tables (like in FIB), > > but call them from cls_bpf. > > > > This is not far off from what is envisioned today in the discussions. > The main issue is who loads the binary? We went from devlink to the > filter doing the loading. DDP is ethtool. We still need to tie a PCI > device/tc block to the "program" so we can do skip_sw and it works. > Meaning a device that is capable of handling multiple programs can > have multiple blobs loaded. A "program" is mapped to a tc filter and > MAT control works the same way as it does today (netlink/tc ndo). > > A program in P4 has a name, ID and people have been suggesting a sha1 > identity (or a signature of some kind should be generated by the > compiler). So the upward propagation could be tied to discovering > these 3 tuples from the driver. Then the control plane targets a > program via those tuples via netlink (as we do currently). > > I do note, using the DDP sample space, currently whatever gets loaded > is "trusted" and really you need to have human knowledge of what the > NIC's parsing + MAT is to send the control. With P4 that is all > visible/programmable by the end user (i am not a proponent of vendors > "shipping" things or calling them for support) - so should be > sufficient to just discover what is in the binary and send the correct > control messages down. > > > I think in P4 terms that may be something more akin to only providing > > the runtime API? I seem to recall they had some distinction... > > There are several solutions out there (ex: TDI, P4runtime) - our API > is netlink and those could be written on top of netlink, there's no > controversy there. > So the starting point is defining the datapath using P4, generating > the binary blob and whatever constraints needed using the vendor > backend and for s/w equivalent generating the eBPF datapath. > > > > At the cost of this sounding controversial, i am going > > > to call things like fdb, fib, etc which have fixed datapaths in the > > > kernel "legacy". These "legacy" datapaths almost all the time have > > > > The cynic in me sometimes thinks that the biggest problem with "legacy" > > protocols is that it's hard to make money on them :) > > That's a big motivation without a doubt, but also there are people > that want to experiment with things. One of the craziest examples we > have is someone who created a P4 program for "in network calculator", > essentially a calculator in the datapath. You send it two operands and > an operator using custom headers, it does the math and responds with a > result in a new header. By itself this program is a toy but it > demonstrates that if one wanted to, they could have something custom > in hardware and/or kernel datapath. Jamal, Given how long P4 has been around it's surprising that the best publicly available code example is "the network calculator" toy. At this point in its lifetime, eBPF had far more examples of real world use cases publically available. That being said, there's nothing unique about P4 supporting the network calculator. We could just as easily write this in eBPF (either plain C or P4) and "offload" it to an ARM core on a SmartNIC. If we are going to support programmable device offload in the Linux kernel then I maintain it should be a generic mechanism that's agnostic to *both* the frontend programming language as well as the backend target. For frontend languages we want to let the user program in a language that's convenient for *them*, which honestly in most cases isn't going to be a narrow use case DSL (i.e. typically users want to code in C/C++, Python, Rust, etc.). For the backend it's the same story, maybe we're compiling to run in host, maybe we're offloading to P4 runtime, maybe we're offloading to another CPU, maybe we're offloading some other programmable NPU. The only real requirement is a compiler that can take the frontend code and compile for the desired backend target, but above all we want this to be easy for the programmer, the compiler needs to do the heavy lifting and we should never require the user to understand the nuances of a target. IMO, the model we want for programmable kernel offload is "write once, run anywhere, run well". Which is the Java tagline amended with "run well". Users write one program for their datapath processing, it runs on various targets, for any given target we run to run at the highest performance levels possible given the target's capabilities. Tom > > cheers, > jamal
On Sun, Mar 3, 2024 at 1:11 PM Tom Herbert <tom@sipanda.io> wrote: > > On Sun, Mar 3, 2024 at 9:00 AM Jamal Hadi Salim <jhs@mojatatu.com> wrote: > > > > On Sat, Mar 2, 2024 at 10:27 PM Jakub Kicinski <kuba@kernel.org> wrote: > > > > > > On Sat, 2 Mar 2024 09:36:53 -0500 Jamal Hadi Salim wrote: > > > > 2) Your point on: "integrate later", or at least "fill in the gaps" > > > > This part i am probably going to mumble on. I am going to consider > > > > more than just doing ACLs/MAT via flower/u32 for the sake of > > > > discussion. > > > > True, "fill the gaps" has been our model so far. It requires kernel > > > > changes, user space code changes etc justifiably so because most of > > > > the time such datapaths are subject to standardization via IETF, IEEE, > > > > etc and new extensions come in on a regular basis. And sometimes we > > > > do add features that one or two users or a single vendor has need for > > > > at the cost of kernel and user/control extension. Given our work > > > > process, any features added this way take a long time to make it to > > > > the end user. > > > > > > What I had in mind was more of a DDP model. The device loads it binary > > > blob FW in whatever way it does, then it tells the kernel its parser > > > graph, and tables. The kernel exposes those tables to user space. > > > All dynamic, no need to change the kernel for each new protocol. > > > > > > But that's different in two ways: > > > 1. the device tells kernel the tables, no "dynamic reprogramming" > > > 2. you don't need the SW side, the only use of the API is to interact > > > with the device > > > > > > User can still do BPF kfuncs to look up in the tables (like in FIB), > > > but call them from cls_bpf. > > > > > > > This is not far off from what is envisioned today in the discussions. > > The main issue is who loads the binary? We went from devlink to the > > filter doing the loading. DDP is ethtool. We still need to tie a PCI > > device/tc block to the "program" so we can do skip_sw and it works. > > Meaning a device that is capable of handling multiple programs can > > have multiple blobs loaded. A "program" is mapped to a tc filter and > > MAT control works the same way as it does today (netlink/tc ndo). > > > > A program in P4 has a name, ID and people have been suggesting a sha1 > > identity (or a signature of some kind should be generated by the > > compiler). So the upward propagation could be tied to discovering > > these 3 tuples from the driver. Then the control plane targets a > > program via those tuples via netlink (as we do currently). > > > > I do note, using the DDP sample space, currently whatever gets loaded > > is "trusted" and really you need to have human knowledge of what the > > NIC's parsing + MAT is to send the control. With P4 that is all > > visible/programmable by the end user (i am not a proponent of vendors > > "shipping" things or calling them for support) - so should be > > sufficient to just discover what is in the binary and send the correct > > control messages down. > > > > > I think in P4 terms that may be something more akin to only providing > > > the runtime API? I seem to recall they had some distinction... > > > > There are several solutions out there (ex: TDI, P4runtime) - our API > > is netlink and those could be written on top of netlink, there's no > > controversy there. > > So the starting point is defining the datapath using P4, generating > > the binary blob and whatever constraints needed using the vendor > > backend and for s/w equivalent generating the eBPF datapath. > > > > > > At the cost of this sounding controversial, i am going > > > > to call things like fdb, fib, etc which have fixed datapaths in the > > > > kernel "legacy". These "legacy" datapaths almost all the time have > > > > > > The cynic in me sometimes thinks that the biggest problem with "legacy" > > > protocols is that it's hard to make money on them :) > > > > That's a big motivation without a doubt, but also there are people > > that want to experiment with things. One of the craziest examples we > > have is someone who created a P4 program for "in network calculator", > > essentially a calculator in the datapath. You send it two operands and > > an operator using custom headers, it does the math and responds with a > > result in a new header. By itself this program is a toy but it > > demonstrates that if one wanted to, they could have something custom > > in hardware and/or kernel datapath. > > Jamal, > > Given how long P4 has been around it's surprising that the best > publicly available code example is "the network calculator" toy. Come on Tom ;-> That was just an example of something "crazy" to demonstrate freedom. I can run that in any of the P4 friendly NICs today. You are probably being facetious - There are some serious publicly available projects out there, some of which I quote on the cover letter (like DASH). > At > this point in its lifetime, eBPF had far more examples of real world > use cases publically available. That being said, there's nothing > unique about P4 supporting the network calculator. We could just as > easily write this in eBPF (either plain C or P4) and "offload" it to > an ARM core on a SmartNIC. With current port speeds hitting 800gbps you want to use Arm cores as your offload engine?;-> Running the generated ebpf on the arm core is a valid P4 target. i.e there is no contradiction. Note: P4 is a DSL specialized for datapath definition; it is not a competition to ebpf, two different worlds. I see ebpf as an infrastructure tool, nothing more. > If we are going to support programmable device offload in the Linux > kernel then I maintain it should be a generic mechanism that's > agnostic to *both* the frontend programming language as well as the > backend target. For frontend languages we want to let the user program > in a language that's convenient for *them*, which honestly in most > cases isn't going to be a narrow use case DSL (i.e. typically users > want to code in C/C++, Python, Rust, etc.). You and I have never agreed philosophically on this point, ever. Developers are expensive and not economically scalable. IOW, In the era of automation (generative AI, etc) tooling is king. Let's build the right tooling. Whenever you make this statement i get the vision of Steve Balmer ranting on the stage with "developers! developers! developers!" but that was eons ago. To use your strong view: Learn compilers! And the future is probably to replace compilers with AI. > For the backend it's the > same story, maybe we're compiling to run in host, maybe we're > offloading to P4 runtime, maybe we're offloading to another CPU, maybe > we're offloading some other programmable NPU. The only real > requirement is a compiler that can take the frontend code and compile > for the desired backend target, but above all we want this to be easy > for the programmer, the compiler needs to do the heavy lifting and we > should never require the user to understand the nuances of a target. > Agreed, it is possible to use other languages in the frontend. It is also possible to extend P4. > IMO, the model we want for programmable kernel offload is "write once, > run anywhere, run well". Which is the Java tagline amended with "run > well". Users write one program for their datapath processing, it runs > on various targets, for any given target we run to run at the highest > performance levels possible given the target's capabilities. > I would like to emphasize: Our target is P4 - vendors have put out hardware, people are deploying and evolving things. It is real today with deployments, not some science project. I am not arguing you cant do what you suggested but we want to initially focus on P4. Neither am i saying we cant influence P4 to be more Linux friendly. But none of that matters. We are only concerned about P4. cheers, jamal > Tom > > > > > cheers, > > jamal
On Sun, 3 Mar 2024 08:31:11 -0800 Tom Herbert wrote: > Even before considering hardware offload, I think this approach > addresses a more fundamental problem to make the kernel programmable. I like some aspects of what you're describing, but my understanding is that it'd be a noticeable shift in direction. I'm not sure if merging P4TC is the most effective way of taking a first step in that direction. (I mean that in the literal sense of lack of confidence, not polite way to indicate holding a conviction to the contrary.)
On Sun, 3 Mar 2024 14:04:11 -0500 Jamal Hadi Salim wrote: > > At > > this point in its lifetime, eBPF had far more examples of real world > > use cases publically available. That being said, there's nothing > > unique about P4 supporting the network calculator. We could just as > > easily write this in eBPF (either plain C or P4) and "offload" it to > > an ARM core on a SmartNIC. > > With current port speeds hitting 800gbps you want to use Arm cores as > your offload engine?;-> Running the generated ebpf on the arm core is > a valid P4 target. i.e there is no contradiction. > Note: P4 is a DSL specialized for datapath definition; it is not a > competition to ebpf, two different worlds. I see ebpf as an > infrastructure tool, nothing more. I wonder how much we're benefiting of calling this thing P4 and how much we should focus on filling in the tech gaps. Exactly like you said, BPF is not competition, but neither does the kernel "support P4", any more than it supports bpftrace and: $ git grep --files-with-matches bpftrace Documentation/bpf/redirect.rst tools/testing/selftests/bpf/progs/test_xdp_attach_fail.c Filling in tech gaps would also help DPP, IDK how much DPP is based or using P4, neither should I have to care, frankly :S
On Mon, Mar 4, 2024 at 12:07 PM Jakub Kicinski <kuba@kernel.org> wrote: > > On Sun, 3 Mar 2024 08:31:11 -0800 Tom Herbert wrote: > > Even before considering hardware offload, I think this approach > > addresses a more fundamental problem to make the kernel programmable. > > I like some aspects of what you're describing, but my understanding > is that it'd be a noticeable shift in direction. > I'm not sure if merging P4TC is the most effective way of taking > a first step in that direction. (I mean that in the literal sense > of lack of confidence, not polite way to indicate holding a conviction > to the contrary.) Jakub, My comments were with regards to making the kernel offloadable by first making it programmable. The P4TC patches are very good for describing processing that is table driven like filtering or IPtables, but I was thinking more of kernel datapath processing that isn't table driven like GSO, GRO, flow dissector, and even up to revisiting TCP offload. Basically, I'm proposing that instead of eBPF always being side functionality, there are cases where it could natively be used to implement the main functionality of the kernel datapath! It is a noticeable shift in direction, but I also think it's the logical outcome of eBPF :-). Tom
On Mon, Mar 4, 2024 at 3:18 PM Jakub Kicinski <kuba@kernel.org> wrote: > > On Sun, 3 Mar 2024 14:04:11 -0500 Jamal Hadi Salim wrote: > > > At > > > this point in its lifetime, eBPF had far more examples of real world > > > use cases publically available. That being said, there's nothing > > > unique about P4 supporting the network calculator. We could just as > > > easily write this in eBPF (either plain C or P4) and "offload" it to > > > an ARM core on a SmartNIC. > > > > With current port speeds hitting 800gbps you want to use Arm cores as > > your offload engine?;-> Running the generated ebpf on the arm core is > > a valid P4 target. i.e there is no contradiction. > > Note: P4 is a DSL specialized for datapath definition; it is not a > > competition to ebpf, two different worlds. I see ebpf as an > > infrastructure tool, nothing more. > > I wonder how much we're benefiting of calling this thing P4 and how > much we should focus on filling in the tech gaps. We are implementing based on the P4 standard specification. I fear it is confusing to call it something else if everyone else is calling it P4 (including the vendors whose devices are being targeted in case of offload). If the name is an issue, sure we can change. It just so happens that TC has similar semantics to P4 (match action tables) - hence the name P4TC and implementation encompassing code that fits nicely with TC. > Exactly like you said, BPF is not competition, but neither does > the kernel "support P4", any more than it supports bpftrace and: > Like i said if name is an issue, let's change the name;-> > $ git grep --files-with-matches bpftrace > Documentation/bpf/redirect.rst > tools/testing/selftests/bpf/progs/test_xdp_attach_fail.c > > Filling in tech gaps would also help DPP, IDK how much DPP is based > or using P4, neither should I have to care, frankly :S DDP is an Intel specific approach, pre-P4. P4: at least two vendors(on Cc) AMD have NICs with P4 specification and there FPGA variants out there as well. From my discussions with folks at Intel it is easy to transform DDP to P4. My understanding is it is the same compiler folks. The beauty being you dont have to use the intel version of the loaded program to offload if you wanted to change what the hardware does custom to you (within constraints of what hardware can do). cheers, jamal
On 03/03, Tom Herbert wrote: > On Sat, Mar 2, 2024 at 7:15 PM Jakub Kicinski <kuba@kernel.org> wrote: > > > > On Fri, 1 Mar 2024 18:20:36 -0800 Tom Herbert wrote: > > > This is configurability versus programmability. The table driven > > > approach as input (configurability) might work fine for generic > > > match-action tables up to the point that tables are expressive enough > > > to satisfy the requirements. But parsing doesn't fall into the table > > > driven paradigm: parsers want to be *programmed*. This is why we > > > removed kParser from this patch set and fell back to eBPF for parsing. > > > But the problem we quickly hit that eBPF is not offloadable to network > > > devices, for example when we compile P4 in an eBPF parser we've lost > > > the declarative representation that parsers in the devices could > > > consume (they're not CPUs running eBPF). > > > > > > I think the key here is what we mean by kernel offload. When we do > > > kernel offload, is it the kernel implementation or the kernel > > > functionality that's being offloaded? If it's the latter then we have > > > a lot more flexibility. What we'd need is a safe and secure way to > > > synchronize with that offload device that precisely supports the > > > kernel functionality we'd like to offload. This can be done if both > > > the kernel bits and programmed offload are derived from the same > > > source (i.e. tag source code with a sha-1). For example, if someone > > > writes a parser in P4, we can compile that into both eBPF and a P4 > > > backend using independent tool chains and program download. At > > > runtime, the kernel can safely offload the functionality of the eBPF > > > parser to the device if it matches the hash to that reported by the > > > device > > > > Good points. If I understand you correctly you're saying that parsers > > are more complex than just a basic parsing tree a'la u32. > > Yes. Parsing things like TLVs, GRE flag field, or nested protobufs > isn't conducive to u32. We also want the advantages of compiler > optimizations to unroll loops, squash nodes in the parse graph, etc. > > > Then we can take this argument further. P4 has grown to encompass a lot > > of functionality of quite complex devices. How do we square that with > > the kernel functionality offload model. If the entire device is modeled, > > including f.e. TSO, an offload would mean that the user has to write > > a TSO implementation which they then load into TC? That seems odd. > > > > IOW I don't quite know how to square in my head the "total > > functionality" with being a TC-based "plugin". > > Hi Jakub, > > I believe the solution is to replace kernel code with eBPF in cases > where we need programmability. This effectively means that we would > ship eBPF code as part of the kernel. So in the case of TSO, the > kernel would include a standard implementation in eBPF that could be > compiled into the kernel by default. The restricted C source code is > tagged with a hash, so if someone wants to offload TSO they could > compile the source into their target and retain the hash. At runtime > it's a matter of querying the driver to see if the device supports the > TSO program the kernel is running by comparing hash values. Scaling > this, a device could support a catalogue of programs: TSO, LRO, > parser, IPtables, etc., If the kernel can match the hash of its eBPF > code to one reported by the driver then it can assume functionality is > offloadable. This is an elaboration of "device features", but instead > of the device telling us they think they support an adequate GRO > implementation by reporting NETIF_F_GRO, the device would tell the > kernel that they not only support GRO but they provide identical > functionality of the kernel GRO (which IMO is the first requirement of > kernel offload). > > Even before considering hardware offload, I think this approach > addresses a more fundamental problem to make the kernel programmable. > Since the code is in eBPF, the kernel can be reprogrammed at runtime > which could be controlled by TC. This allows local customization of > kernel features, but also is the simplest way to "patch" the kernel > with security and bug fixes (nobody is ever excited to do a kernel [..] > rebase in their datacenter!). Flow dissector is a prime candidate for > this, and I am still planning to replace it with an all eBPF program > (https://netdevconf.info/0x15/slides/16/Flow%20dissector_PANDA%20parser.pdf). So you're suggesting to bundle (and extend) tools/testing/selftests/bpf/progs/bpf_flow.c? We were thinking along similar lines here. We load this program manually right now, shipping and autoloading with the kernel will be easer.
On 03/03, Jamal Hadi Salim wrote: > On Sun, Mar 3, 2024 at 1:11 PM Tom Herbert <tom@sipanda.io> wrote: > > > > On Sun, Mar 3, 2024 at 9:00 AM Jamal Hadi Salim <jhs@mojatatu.com> wrote: > > > > > > On Sat, Mar 2, 2024 at 10:27 PM Jakub Kicinski <kuba@kernel.org> wrote: > > > > > > > > On Sat, 2 Mar 2024 09:36:53 -0500 Jamal Hadi Salim wrote: > > > > > 2) Your point on: "integrate later", or at least "fill in the gaps" > > > > > This part i am probably going to mumble on. I am going to consider > > > > > more than just doing ACLs/MAT via flower/u32 for the sake of > > > > > discussion. > > > > > True, "fill the gaps" has been our model so far. It requires kernel > > > > > changes, user space code changes etc justifiably so because most of > > > > > the time such datapaths are subject to standardization via IETF, IEEE, > > > > > etc and new extensions come in on a regular basis. And sometimes we > > > > > do add features that one or two users or a single vendor has need for > > > > > at the cost of kernel and user/control extension. Given our work > > > > > process, any features added this way take a long time to make it to > > > > > the end user. > > > > > > > > What I had in mind was more of a DDP model. The device loads it binary > > > > blob FW in whatever way it does, then it tells the kernel its parser > > > > graph, and tables. The kernel exposes those tables to user space. > > > > All dynamic, no need to change the kernel for each new protocol. > > > > > > > > But that's different in two ways: > > > > 1. the device tells kernel the tables, no "dynamic reprogramming" > > > > 2. you don't need the SW side, the only use of the API is to interact > > > > with the device > > > > > > > > User can still do BPF kfuncs to look up in the tables (like in FIB), > > > > but call them from cls_bpf. > > > > > > > > > > This is not far off from what is envisioned today in the discussions. > > > The main issue is who loads the binary? We went from devlink to the > > > filter doing the loading. DDP is ethtool. We still need to tie a PCI > > > device/tc block to the "program" so we can do skip_sw and it works. > > > Meaning a device that is capable of handling multiple programs can > > > have multiple blobs loaded. A "program" is mapped to a tc filter and > > > MAT control works the same way as it does today (netlink/tc ndo). > > > > > > A program in P4 has a name, ID and people have been suggesting a sha1 > > > identity (or a signature of some kind should be generated by the > > > compiler). So the upward propagation could be tied to discovering > > > these 3 tuples from the driver. Then the control plane targets a > > > program via those tuples via netlink (as we do currently). > > > > > > I do note, using the DDP sample space, currently whatever gets loaded > > > is "trusted" and really you need to have human knowledge of what the > > > NIC's parsing + MAT is to send the control. With P4 that is all > > > visible/programmable by the end user (i am not a proponent of vendors > > > "shipping" things or calling them for support) - so should be > > > sufficient to just discover what is in the binary and send the correct > > > control messages down. > > > > > > > I think in P4 terms that may be something more akin to only providing > > > > the runtime API? I seem to recall they had some distinction... > > > > > > There are several solutions out there (ex: TDI, P4runtime) - our API > > > is netlink and those could be written on top of netlink, there's no > > > controversy there. > > > So the starting point is defining the datapath using P4, generating > > > the binary blob and whatever constraints needed using the vendor > > > backend and for s/w equivalent generating the eBPF datapath. > > > > > > > > At the cost of this sounding controversial, i am going > > > > > to call things like fdb, fib, etc which have fixed datapaths in the > > > > > kernel "legacy". These "legacy" datapaths almost all the time have > > > > > > > > The cynic in me sometimes thinks that the biggest problem with "legacy" > > > > protocols is that it's hard to make money on them :) > > > > > > That's a big motivation without a doubt, but also there are people > > > that want to experiment with things. One of the craziest examples we > > > have is someone who created a P4 program for "in network calculator", > > > essentially a calculator in the datapath. You send it two operands and > > > an operator using custom headers, it does the math and responds with a > > > result in a new header. By itself this program is a toy but it > > > demonstrates that if one wanted to, they could have something custom > > > in hardware and/or kernel datapath. > > > > Jamal, > > > > Given how long P4 has been around it's surprising that the best > > publicly available code example is "the network calculator" toy. > > Come on Tom ;-> That was just an example of something "crazy" to > demonstrate freedom. I can run that in any of the P4 friendly NICs > today. You are probably being facetious - There are some serious > publicly available projects out there, some of which I quote on the > cover letter (like DASH). Shameless plug. I have a more crazy example with bpf: https://github.com/fomichev/xdp-btc-miner A good way to ensure all those smartnic cycles are not wasted :-D I wish we had more nics with xdp bpf offloads :-(
On Mon, Mar 4, 2024 at 4:23 PM Stanislav Fomichev <sdf@google.com> wrote: > > On 03/03, Jamal Hadi Salim wrote: > > On Sun, Mar 3, 2024 at 1:11 PM Tom Herbert <tom@sipanda.io> wrote: > > > > > > On Sun, Mar 3, 2024 at 9:00 AM Jamal Hadi Salim <jhs@mojatatu.com> wrote: > > > > > > > > On Sat, Mar 2, 2024 at 10:27 PM Jakub Kicinski <kuba@kernel.org> wrote: > > > > > > > > > > On Sat, 2 Mar 2024 09:36:53 -0500 Jamal Hadi Salim wrote: > > > > > > 2) Your point on: "integrate later", or at least "fill in the gaps" > > > > > > This part i am probably going to mumble on. I am going to consider > > > > > > more than just doing ACLs/MAT via flower/u32 for the sake of > > > > > > discussion. > > > > > > True, "fill the gaps" has been our model so far. It requires kernel > > > > > > changes, user space code changes etc justifiably so because most of > > > > > > the time such datapaths are subject to standardization via IETF, IEEE, > > > > > > etc and new extensions come in on a regular basis. And sometimes we > > > > > > do add features that one or two users or a single vendor has need for > > > > > > at the cost of kernel and user/control extension. Given our work > > > > > > process, any features added this way take a long time to make it to > > > > > > the end user. > > > > > > > > > > What I had in mind was more of a DDP model. The device loads it binary > > > > > blob FW in whatever way it does, then it tells the kernel its parser > > > > > graph, and tables. The kernel exposes those tables to user space. > > > > > All dynamic, no need to change the kernel for each new protocol. > > > > > > > > > > But that's different in two ways: > > > > > 1. the device tells kernel the tables, no "dynamic reprogramming" > > > > > 2. you don't need the SW side, the only use of the API is to interact > > > > > with the device > > > > > > > > > > User can still do BPF kfuncs to look up in the tables (like in FIB), > > > > > but call them from cls_bpf. > > > > > > > > > > > > > This is not far off from what is envisioned today in the discussions. > > > > The main issue is who loads the binary? We went from devlink to the > > > > filter doing the loading. DDP is ethtool. We still need to tie a PCI > > > > device/tc block to the "program" so we can do skip_sw and it works. > > > > Meaning a device that is capable of handling multiple programs can > > > > have multiple blobs loaded. A "program" is mapped to a tc filter and > > > > MAT control works the same way as it does today (netlink/tc ndo). > > > > > > > > A program in P4 has a name, ID and people have been suggesting a sha1 > > > > identity (or a signature of some kind should be generated by the > > > > compiler). So the upward propagation could be tied to discovering > > > > these 3 tuples from the driver. Then the control plane targets a > > > > program via those tuples via netlink (as we do currently). > > > > > > > > I do note, using the DDP sample space, currently whatever gets loaded > > > > is "trusted" and really you need to have human knowledge of what the > > > > NIC's parsing + MAT is to send the control. With P4 that is all > > > > visible/programmable by the end user (i am not a proponent of vendors > > > > "shipping" things or calling them for support) - so should be > > > > sufficient to just discover what is in the binary and send the correct > > > > control messages down. > > > > > > > > > I think in P4 terms that may be something more akin to only providing > > > > > the runtime API? I seem to recall they had some distinction... > > > > > > > > There are several solutions out there (ex: TDI, P4runtime) - our API > > > > is netlink and those could be written on top of netlink, there's no > > > > controversy there. > > > > So the starting point is defining the datapath using P4, generating > > > > the binary blob and whatever constraints needed using the vendor > > > > backend and for s/w equivalent generating the eBPF datapath. > > > > > > > > > > At the cost of this sounding controversial, i am going > > > > > > to call things like fdb, fib, etc which have fixed datapaths in the > > > > > > kernel "legacy". These "legacy" datapaths almost all the time have > > > > > > > > > > The cynic in me sometimes thinks that the biggest problem with "legacy" > > > > > protocols is that it's hard to make money on them :) > > > > > > > > That's a big motivation without a doubt, but also there are people > > > > that want to experiment with things. One of the craziest examples we > > > > have is someone who created a P4 program for "in network calculator", > > > > essentially a calculator in the datapath. You send it two operands and > > > > an operator using custom headers, it does the math and responds with a > > > > result in a new header. By itself this program is a toy but it > > > > demonstrates that if one wanted to, they could have something custom > > > > in hardware and/or kernel datapath. > > > > > > Jamal, > > > > > > Given how long P4 has been around it's surprising that the best > > > publicly available code example is "the network calculator" toy. > > > > Come on Tom ;-> That was just an example of something "crazy" to > > demonstrate freedom. I can run that in any of the P4 friendly NICs > > today. You are probably being facetious - There are some serious > > publicly available projects out there, some of which I quote on the > > cover letter (like DASH). > > Shameless plug. I have a more crazy example with bpf: > > https://github.com/fomichev/xdp-btc-miner > Hrm - this looks crazy interesting;-> Tempting. I guess to port this to P4 we'd need the sha256 in h/w (which most of these vendors have already). Is there any other acceleration would you need? Would have been more fun if you invented you own headers too ;-> cheers, jamal > A good way to ensure all those smartnic cycles are not wasted :-D > I wish we had more nics with xdp bpf offloads :-(
On Mon, Mar 4, 2024 at 1:19 PM Stanislav Fomichev <sdf@google.com> wrote: > > On 03/03, Tom Herbert wrote: > > On Sat, Mar 2, 2024 at 7:15 PM Jakub Kicinski <kuba@kernel.org> wrote: > > > > > > On Fri, 1 Mar 2024 18:20:36 -0800 Tom Herbert wrote: > > > > This is configurability versus programmability. The table driven > > > > approach as input (configurability) might work fine for generic > > > > match-action tables up to the point that tables are expressive enough > > > > to satisfy the requirements. But parsing doesn't fall into the table > > > > driven paradigm: parsers want to be *programmed*. This is why we > > > > removed kParser from this patch set and fell back to eBPF for parsing. > > > > But the problem we quickly hit that eBPF is not offloadable to network > > > > devices, for example when we compile P4 in an eBPF parser we've lost > > > > the declarative representation that parsers in the devices could > > > > consume (they're not CPUs running eBPF). > > > > > > > > I think the key here is what we mean by kernel offload. When we do > > > > kernel offload, is it the kernel implementation or the kernel > > > > functionality that's being offloaded? If it's the latter then we have > > > > a lot more flexibility. What we'd need is a safe and secure way to > > > > synchronize with that offload device that precisely supports the > > > > kernel functionality we'd like to offload. This can be done if both > > > > the kernel bits and programmed offload are derived from the same > > > > source (i.e. tag source code with a sha-1). For example, if someone > > > > writes a parser in P4, we can compile that into both eBPF and a P4 > > > > backend using independent tool chains and program download. At > > > > runtime, the kernel can safely offload the functionality of the eBPF > > > > parser to the device if it matches the hash to that reported by the > > > > device > > > > > > Good points. If I understand you correctly you're saying that parsers > > > are more complex than just a basic parsing tree a'la u32. > > > > Yes. Parsing things like TLVs, GRE flag field, or nested protobufs > > isn't conducive to u32. We also want the advantages of compiler > > optimizations to unroll loops, squash nodes in the parse graph, etc. > > > > > Then we can take this argument further. P4 has grown to encompass a lot > > > of functionality of quite complex devices. How do we square that with > > > the kernel functionality offload model. If the entire device is modeled, > > > including f.e. TSO, an offload would mean that the user has to write > > > a TSO implementation which they then load into TC? That seems odd. > > > > > > IOW I don't quite know how to square in my head the "total > > > functionality" with being a TC-based "plugin". > > > > Hi Jakub, > > > > I believe the solution is to replace kernel code with eBPF in cases > > where we need programmability. This effectively means that we would > > ship eBPF code as part of the kernel. So in the case of TSO, the > > kernel would include a standard implementation in eBPF that could be > > compiled into the kernel by default. The restricted C source code is > > tagged with a hash, so if someone wants to offload TSO they could > > compile the source into their target and retain the hash. At runtime > > it's a matter of querying the driver to see if the device supports the > > TSO program the kernel is running by comparing hash values. Scaling > > this, a device could support a catalogue of programs: TSO, LRO, > > parser, IPtables, etc., If the kernel can match the hash of its eBPF > > code to one reported by the driver then it can assume functionality is > > offloadable. This is an elaboration of "device features", but instead > > of the device telling us they think they support an adequate GRO > > implementation by reporting NETIF_F_GRO, the device would tell the > > kernel that they not only support GRO but they provide identical > > functionality of the kernel GRO (which IMO is the first requirement of > > kernel offload). > > > > Even before considering hardware offload, I think this approach > > addresses a more fundamental problem to make the kernel programmable. > > Since the code is in eBPF, the kernel can be reprogrammed at runtime > > which could be controlled by TC. This allows local customization of > > kernel features, but also is the simplest way to "patch" the kernel > > with security and bug fixes (nobody is ever excited to do a kernel > > [..] > > > rebase in their datacenter!). Flow dissector is a prime candidate for > > this, and I am still planning to replace it with an all eBPF program > > (https://netdevconf.info/0x15/slides/16/Flow%20dissector_PANDA%20parser.pdf). > > So you're suggesting to bundle (and extend) > tools/testing/selftests/bpf/progs/bpf_flow.c? We were thinking along > similar lines here. We load this program manually right now, shipping > and autoloading with the kernel will be easer. Hi Stanislav, Yes, I envision that we would have a standard implementation of flow-dissector in eBPF that is shipped with the kernel and autoloaded. However, for the front end source I want to move away from imperative code. As I mentioned in the presentation flow_dissector.c is spaghetti code and has been prone to bugs over the years especially whenever someone adds support for a new fringe protocol (I take the liberty to call it spaghetti code since I'm partially responsible for creating this mess ;-) ). The problem is that parsers are much better represented by a declarative rather than an imperative representation. To that end, we defined PANDA which allows constructing a parser (parse graph) in data structures in C. We use the "PANDA parser" to compile C to restricted C code which looks more like eBPF in imperative code. With this method we abstract out all the bookkeeping that was often the source of bugs (like pulling up skbufs, checking length limits, etc.). The other advantage is that we're able to find a lot more optimizations if we start with a right representation of the problem. If you're interested, the video presentation on this is in https://www.youtube.com/watch?v=zVnmVDSEoXc. Tom
On 03/04, Jamal Hadi Salim wrote: > On Mon, Mar 4, 2024 at 4:23 PM Stanislav Fomichev <sdf@google.com> wrote: > > > > On 03/03, Jamal Hadi Salim wrote: > > > On Sun, Mar 3, 2024 at 1:11 PM Tom Herbert <tom@sipanda.io> wrote: > > > > > > > > On Sun, Mar 3, 2024 at 9:00 AM Jamal Hadi Salim <jhs@mojatatu.com> wrote: > > > > > > > > > > On Sat, Mar 2, 2024 at 10:27 PM Jakub Kicinski <kuba@kernel.org> wrote: > > > > > > > > > > > > On Sat, 2 Mar 2024 09:36:53 -0500 Jamal Hadi Salim wrote: > > > > > > > 2) Your point on: "integrate later", or at least "fill in the gaps" > > > > > > > This part i am probably going to mumble on. I am going to consider > > > > > > > more than just doing ACLs/MAT via flower/u32 for the sake of > > > > > > > discussion. > > > > > > > True, "fill the gaps" has been our model so far. It requires kernel > > > > > > > changes, user space code changes etc justifiably so because most of > > > > > > > the time such datapaths are subject to standardization via IETF, IEEE, > > > > > > > etc and new extensions come in on a regular basis. And sometimes we > > > > > > > do add features that one or two users or a single vendor has need for > > > > > > > at the cost of kernel and user/control extension. Given our work > > > > > > > process, any features added this way take a long time to make it to > > > > > > > the end user. > > > > > > > > > > > > What I had in mind was more of a DDP model. The device loads it binary > > > > > > blob FW in whatever way it does, then it tells the kernel its parser > > > > > > graph, and tables. The kernel exposes those tables to user space. > > > > > > All dynamic, no need to change the kernel for each new protocol. > > > > > > > > > > > > But that's different in two ways: > > > > > > 1. the device tells kernel the tables, no "dynamic reprogramming" > > > > > > 2. you don't need the SW side, the only use of the API is to interact > > > > > > with the device > > > > > > > > > > > > User can still do BPF kfuncs to look up in the tables (like in FIB), > > > > > > but call them from cls_bpf. > > > > > > > > > > > > > > > > This is not far off from what is envisioned today in the discussions. > > > > > The main issue is who loads the binary? We went from devlink to the > > > > > filter doing the loading. DDP is ethtool. We still need to tie a PCI > > > > > device/tc block to the "program" so we can do skip_sw and it works. > > > > > Meaning a device that is capable of handling multiple programs can > > > > > have multiple blobs loaded. A "program" is mapped to a tc filter and > > > > > MAT control works the same way as it does today (netlink/tc ndo). > > > > > > > > > > A program in P4 has a name, ID and people have been suggesting a sha1 > > > > > identity (or a signature of some kind should be generated by the > > > > > compiler). So the upward propagation could be tied to discovering > > > > > these 3 tuples from the driver. Then the control plane targets a > > > > > program via those tuples via netlink (as we do currently). > > > > > > > > > > I do note, using the DDP sample space, currently whatever gets loaded > > > > > is "trusted" and really you need to have human knowledge of what the > > > > > NIC's parsing + MAT is to send the control. With P4 that is all > > > > > visible/programmable by the end user (i am not a proponent of vendors > > > > > "shipping" things or calling them for support) - so should be > > > > > sufficient to just discover what is in the binary and send the correct > > > > > control messages down. > > > > > > > > > > > I think in P4 terms that may be something more akin to only providing > > > > > > the runtime API? I seem to recall they had some distinction... > > > > > > > > > > There are several solutions out there (ex: TDI, P4runtime) - our API > > > > > is netlink and those could be written on top of netlink, there's no > > > > > controversy there. > > > > > So the starting point is defining the datapath using P4, generating > > > > > the binary blob and whatever constraints needed using the vendor > > > > > backend and for s/w equivalent generating the eBPF datapath. > > > > > > > > > > > > At the cost of this sounding controversial, i am going > > > > > > > to call things like fdb, fib, etc which have fixed datapaths in the > > > > > > > kernel "legacy". These "legacy" datapaths almost all the time have > > > > > > > > > > > > The cynic in me sometimes thinks that the biggest problem with "legacy" > > > > > > protocols is that it's hard to make money on them :) > > > > > > > > > > That's a big motivation without a doubt, but also there are people > > > > > that want to experiment with things. One of the craziest examples we > > > > > have is someone who created a P4 program for "in network calculator", > > > > > essentially a calculator in the datapath. You send it two operands and > > > > > an operator using custom headers, it does the math and responds with a > > > > > result in a new header. By itself this program is a toy but it > > > > > demonstrates that if one wanted to, they could have something custom > > > > > in hardware and/or kernel datapath. > > > > > > > > Jamal, > > > > > > > > Given how long P4 has been around it's surprising that the best > > > > publicly available code example is "the network calculator" toy. > > > > > > Come on Tom ;-> That was just an example of something "crazy" to > > > demonstrate freedom. I can run that in any of the P4 friendly NICs > > > today. You are probably being facetious - There are some serious > > > publicly available projects out there, some of which I quote on the > > > cover letter (like DASH). > > > > Shameless plug. I have a more crazy example with bpf: > > > > https://github.com/fomichev/xdp-btc-miner > > > > Hrm - this looks crazy interesting;-> Tempting. I guess to port this > to P4 we'd need the sha256 in h/w (which most of these vendors have > already). Is there any other acceleration would you need? Would have > been more fun if you invented you own headers too ;-> Yeah, some way to do sha256(sha256(at_some_fixed_packet_offset + 80 bytes)) is one thing. And the other is some way to compare that sha256 vs some hard-coded (difficulty) number (as a 256-byte uint). But I have no clue how well that maps into declarative p4 language. Most likely possible if you're saying that the calculator is possible? I'm assuming that even sha256 can possibly be implemented in p4 without any extra support from the vendor? It's just a bunch of xors and rotations over a fix-sized input buffer.
On Mon, Mar 4, 2024 at 5:23 PM Stanislav Fomichev <sdf@google.com> wrote: > > On 03/04, Jamal Hadi Salim wrote: > > On Mon, Mar 4, 2024 at 4:23 PM Stanislav Fomichev <sdf@google.com> wrote: > > > > > > On 03/03, Jamal Hadi Salim wrote: > > > > On Sun, Mar 3, 2024 at 1:11 PM Tom Herbert <tom@sipanda.io> wrote: > > > > > > > > > > On Sun, Mar 3, 2024 at 9:00 AM Jamal Hadi Salim <jhs@mojatatu.com> wrote: > > > > > > > > > > > > On Sat, Mar 2, 2024 at 10:27 PM Jakub Kicinski <kuba@kernel.org> wrote: > > > > > > > > > > > > > > On Sat, 2 Mar 2024 09:36:53 -0500 Jamal Hadi Salim wrote: > > > > > > > > 2) Your point on: "integrate later", or at least "fill in the gaps" > > > > > > > > This part i am probably going to mumble on. I am going to consider > > > > > > > > more than just doing ACLs/MAT via flower/u32 for the sake of > > > > > > > > discussion. > > > > > > > > True, "fill the gaps" has been our model so far. It requires kernel > > > > > > > > changes, user space code changes etc justifiably so because most of > > > > > > > > the time such datapaths are subject to standardization via IETF, IEEE, > > > > > > > > etc and new extensions come in on a regular basis. And sometimes we > > > > > > > > do add features that one or two users or a single vendor has need for > > > > > > > > at the cost of kernel and user/control extension. Given our work > > > > > > > > process, any features added this way take a long time to make it to > > > > > > > > the end user. > > > > > > > > > > > > > > What I had in mind was more of a DDP model. The device loads it binary > > > > > > > blob FW in whatever way it does, then it tells the kernel its parser > > > > > > > graph, and tables. The kernel exposes those tables to user space. > > > > > > > All dynamic, no need to change the kernel for each new protocol. > > > > > > > > > > > > > > But that's different in two ways: > > > > > > > 1. the device tells kernel the tables, no "dynamic reprogramming" > > > > > > > 2. you don't need the SW side, the only use of the API is to interact > > > > > > > with the device > > > > > > > > > > > > > > User can still do BPF kfuncs to look up in the tables (like in FIB), > > > > > > > but call them from cls_bpf. > > > > > > > > > > > > > > > > > > > This is not far off from what is envisioned today in the discussions. > > > > > > The main issue is who loads the binary? We went from devlink to the > > > > > > filter doing the loading. DDP is ethtool. We still need to tie a PCI > > > > > > device/tc block to the "program" so we can do skip_sw and it works. > > > > > > Meaning a device that is capable of handling multiple programs can > > > > > > have multiple blobs loaded. A "program" is mapped to a tc filter and > > > > > > MAT control works the same way as it does today (netlink/tc ndo). > > > > > > > > > > > > A program in P4 has a name, ID and people have been suggesting a sha1 > > > > > > identity (or a signature of some kind should be generated by the > > > > > > compiler). So the upward propagation could be tied to discovering > > > > > > these 3 tuples from the driver. Then the control plane targets a > > > > > > program via those tuples via netlink (as we do currently). > > > > > > > > > > > > I do note, using the DDP sample space, currently whatever gets loaded > > > > > > is "trusted" and really you need to have human knowledge of what the > > > > > > NIC's parsing + MAT is to send the control. With P4 that is all > > > > > > visible/programmable by the end user (i am not a proponent of vendors > > > > > > "shipping" things or calling them for support) - so should be > > > > > > sufficient to just discover what is in the binary and send the correct > > > > > > control messages down. > > > > > > > > > > > > > I think in P4 terms that may be something more akin to only providing > > > > > > > the runtime API? I seem to recall they had some distinction... > > > > > > > > > > > > There are several solutions out there (ex: TDI, P4runtime) - our API > > > > > > is netlink and those could be written on top of netlink, there's no > > > > > > controversy there. > > > > > > So the starting point is defining the datapath using P4, generating > > > > > > the binary blob and whatever constraints needed using the vendor > > > > > > backend and for s/w equivalent generating the eBPF datapath. > > > > > > > > > > > > > > At the cost of this sounding controversial, i am going > > > > > > > > to call things like fdb, fib, etc which have fixed datapaths in the > > > > > > > > kernel "legacy". These "legacy" datapaths almost all the time have > > > > > > > > > > > > > > The cynic in me sometimes thinks that the biggest problem with "legacy" > > > > > > > protocols is that it's hard to make money on them :) > > > > > > > > > > > > That's a big motivation without a doubt, but also there are people > > > > > > that want to experiment with things. One of the craziest examples we > > > > > > have is someone who created a P4 program for "in network calculator", > > > > > > essentially a calculator in the datapath. You send it two operands and > > > > > > an operator using custom headers, it does the math and responds with a > > > > > > result in a new header. By itself this program is a toy but it > > > > > > demonstrates that if one wanted to, they could have something custom > > > > > > in hardware and/or kernel datapath. > > > > > > > > > > Jamal, > > > > > > > > > > Given how long P4 has been around it's surprising that the best > > > > > publicly available code example is "the network calculator" toy. > > > > > > > > Come on Tom ;-> That was just an example of something "crazy" to > > > > demonstrate freedom. I can run that in any of the P4 friendly NICs > > > > today. You are probably being facetious - There are some serious > > > > publicly available projects out there, some of which I quote on the > > > > cover letter (like DASH). > > > > > > Shameless plug. I have a more crazy example with bpf: > > > > > > https://github.com/fomichev/xdp-btc-miner > > > > > > > Hrm - this looks crazy interesting;-> Tempting. I guess to port this > > to P4 we'd need the sha256 in h/w (which most of these vendors have > > already). Is there any other acceleration would you need? Would have > > been more fun if you invented you own headers too ;-> > > Yeah, some way to do sha256(sha256(at_some_fixed_packet_offset + 80 bytes)) This part is straight forward. > is one thing. And the other is some way to compare that sha256 vs some > hard-coded (difficulty) number (as a 256-byte uint). The compiler may have issues with this comparison - will have to look (I am pretty sure it's fixable though). > But I have no > clue how well that maps into declarative p4 language. Most likely > possible if you're saying that the calculator is possible? The calculator basically is written as a set of match-action tables. You parse your header, construct a key based on the operator field of the header (eg "+"), invoke an action which takes the operands from the headers(eg "1" and "2"), the action returns you results(3"). You stash the result in a new packet and send it back to the source. So my thinking is the computation you need would be modelled on an action. > I'm assuming that even sha256 can possibly be implemented in p4 without > any extra support from the vendor? It's just a bunch of xors and > rotations over a fix-sized input buffer. True, and I think those would be fast. But if the h/w offers it as an interface why not. It's not that you are running out of instruction space - and my memory is hazy - but iirc, there is sha256 support in the kernel Crypto API - does it not make sense to kfunc into that? cheers, jamal
On 03/04, Jamal Hadi Salim wrote: > On Mon, Mar 4, 2024 at 5:23 PM Stanislav Fomichev <sdf@google.com> wrote: > > > > On 03/04, Jamal Hadi Salim wrote: > > > On Mon, Mar 4, 2024 at 4:23 PM Stanislav Fomichev <sdf@google.com> wrote: > > > > > > > > On 03/03, Jamal Hadi Salim wrote: > > > > > On Sun, Mar 3, 2024 at 1:11 PM Tom Herbert <tom@sipanda.io> wrote: > > > > > > > > > > > > On Sun, Mar 3, 2024 at 9:00 AM Jamal Hadi Salim <jhs@mojatatu.com> wrote: > > > > > > > > > > > > > > On Sat, Mar 2, 2024 at 10:27 PM Jakub Kicinski <kuba@kernel.org> wrote: > > > > > > > > > > > > > > > > On Sat, 2 Mar 2024 09:36:53 -0500 Jamal Hadi Salim wrote: > > > > > > > > > 2) Your point on: "integrate later", or at least "fill in the gaps" > > > > > > > > > This part i am probably going to mumble on. I am going to consider > > > > > > > > > more than just doing ACLs/MAT via flower/u32 for the sake of > > > > > > > > > discussion. > > > > > > > > > True, "fill the gaps" has been our model so far. It requires kernel > > > > > > > > > changes, user space code changes etc justifiably so because most of > > > > > > > > > the time such datapaths are subject to standardization via IETF, IEEE, > > > > > > > > > etc and new extensions come in on a regular basis. And sometimes we > > > > > > > > > do add features that one or two users or a single vendor has need for > > > > > > > > > at the cost of kernel and user/control extension. Given our work > > > > > > > > > process, any features added this way take a long time to make it to > > > > > > > > > the end user. > > > > > > > > > > > > > > > > What I had in mind was more of a DDP model. The device loads it binary > > > > > > > > blob FW in whatever way it does, then it tells the kernel its parser > > > > > > > > graph, and tables. The kernel exposes those tables to user space. > > > > > > > > All dynamic, no need to change the kernel for each new protocol. > > > > > > > > > > > > > > > > But that's different in two ways: > > > > > > > > 1. the device tells kernel the tables, no "dynamic reprogramming" > > > > > > > > 2. you don't need the SW side, the only use of the API is to interact > > > > > > > > with the device > > > > > > > > > > > > > > > > User can still do BPF kfuncs to look up in the tables (like in FIB), > > > > > > > > but call them from cls_bpf. > > > > > > > > > > > > > > > > > > > > > > This is not far off from what is envisioned today in the discussions. > > > > > > > The main issue is who loads the binary? We went from devlink to the > > > > > > > filter doing the loading. DDP is ethtool. We still need to tie a PCI > > > > > > > device/tc block to the "program" so we can do skip_sw and it works. > > > > > > > Meaning a device that is capable of handling multiple programs can > > > > > > > have multiple blobs loaded. A "program" is mapped to a tc filter and > > > > > > > MAT control works the same way as it does today (netlink/tc ndo). > > > > > > > > > > > > > > A program in P4 has a name, ID and people have been suggesting a sha1 > > > > > > > identity (or a signature of some kind should be generated by the > > > > > > > compiler). So the upward propagation could be tied to discovering > > > > > > > these 3 tuples from the driver. Then the control plane targets a > > > > > > > program via those tuples via netlink (as we do currently). > > > > > > > > > > > > > > I do note, using the DDP sample space, currently whatever gets loaded > > > > > > > is "trusted" and really you need to have human knowledge of what the > > > > > > > NIC's parsing + MAT is to send the control. With P4 that is all > > > > > > > visible/programmable by the end user (i am not a proponent of vendors > > > > > > > "shipping" things or calling them for support) - so should be > > > > > > > sufficient to just discover what is in the binary and send the correct > > > > > > > control messages down. > > > > > > > > > > > > > > > I think in P4 terms that may be something more akin to only providing > > > > > > > > the runtime API? I seem to recall they had some distinction... > > > > > > > > > > > > > > There are several solutions out there (ex: TDI, P4runtime) - our API > > > > > > > is netlink and those could be written on top of netlink, there's no > > > > > > > controversy there. > > > > > > > So the starting point is defining the datapath using P4, generating > > > > > > > the binary blob and whatever constraints needed using the vendor > > > > > > > backend and for s/w equivalent generating the eBPF datapath. > > > > > > > > > > > > > > > > At the cost of this sounding controversial, i am going > > > > > > > > > to call things like fdb, fib, etc which have fixed datapaths in the > > > > > > > > > kernel "legacy". These "legacy" datapaths almost all the time have > > > > > > > > > > > > > > > > The cynic in me sometimes thinks that the biggest problem with "legacy" > > > > > > > > protocols is that it's hard to make money on them :) > > > > > > > > > > > > > > That's a big motivation without a doubt, but also there are people > > > > > > > that want to experiment with things. One of the craziest examples we > > > > > > > have is someone who created a P4 program for "in network calculator", > > > > > > > essentially a calculator in the datapath. You send it two operands and > > > > > > > an operator using custom headers, it does the math and responds with a > > > > > > > result in a new header. By itself this program is a toy but it > > > > > > > demonstrates that if one wanted to, they could have something custom > > > > > > > in hardware and/or kernel datapath. > > > > > > > > > > > > Jamal, > > > > > > > > > > > > Given how long P4 has been around it's surprising that the best > > > > > > publicly available code example is "the network calculator" toy. > > > > > > > > > > Come on Tom ;-> That was just an example of something "crazy" to > > > > > demonstrate freedom. I can run that in any of the P4 friendly NICs > > > > > today. You are probably being facetious - There are some serious > > > > > publicly available projects out there, some of which I quote on the > > > > > cover letter (like DASH). > > > > > > > > Shameless plug. I have a more crazy example with bpf: > > > > > > > > https://github.com/fomichev/xdp-btc-miner > > > > > > > > > > Hrm - this looks crazy interesting;-> Tempting. I guess to port this > > > to P4 we'd need the sha256 in h/w (which most of these vendors have > > > already). Is there any other acceleration would you need? Would have > > > been more fun if you invented you own headers too ;-> > > > > Yeah, some way to do sha256(sha256(at_some_fixed_packet_offset + 80 bytes)) > > This part is straight forward. > > > is one thing. And the other is some way to compare that sha256 vs some > > hard-coded (difficulty) number (as a 256-byte uint). > > The compiler may have issues with this comparison - will have to look > (I am pretty sure it's fixable though). > > > > But I have no > > clue how well that maps into declarative p4 language. Most likely > > possible if you're saying that the calculator is possible? > > The calculator basically is written as a set of match-action tables. > You parse your header, construct a key based on the operator field of > the header (eg "+"), invoke an action which takes the operands from > the headers(eg "1" and "2"), the action returns you results(3"). You > stash the result in a new packet and send it back to the source. > > So my thinking is the computation you need would be modelled on an action. > > > I'm assuming that even sha256 can possibly be implemented in p4 without > > any extra support from the vendor? It's just a bunch of xors and > > rotations over a fix-sized input buffer. [..] > True, and I think those would be fast. But if the h/w offers it as an > interface why not. > It's not that you are running out of instruction space - and my memory > is hazy - but iirc, there is sha256 support in the kernel Crypto API - > does it not make sense to kfunc into that? Oh yeah, that's definitely a better path if somebody were do to it "properly". It's still fun, though, to see how far we can push the bpf vm/verifier without using any extra helpers :-D
On 03/04, Tom Herbert wrote: > On Mon, Mar 4, 2024 at 1:19 PM Stanislav Fomichev <sdf@google.com> wrote: > > > > On 03/03, Tom Herbert wrote: > > > On Sat, Mar 2, 2024 at 7:15 PM Jakub Kicinski <kuba@kernel.org> wrote: > > > > > > > > On Fri, 1 Mar 2024 18:20:36 -0800 Tom Herbert wrote: > > > > > This is configurability versus programmability. The table driven > > > > > approach as input (configurability) might work fine for generic > > > > > match-action tables up to the point that tables are expressive enough > > > > > to satisfy the requirements. But parsing doesn't fall into the table > > > > > driven paradigm: parsers want to be *programmed*. This is why we > > > > > removed kParser from this patch set and fell back to eBPF for parsing. > > > > > But the problem we quickly hit that eBPF is not offloadable to network > > > > > devices, for example when we compile P4 in an eBPF parser we've lost > > > > > the declarative representation that parsers in the devices could > > > > > consume (they're not CPUs running eBPF). > > > > > > > > > > I think the key here is what we mean by kernel offload. When we do > > > > > kernel offload, is it the kernel implementation or the kernel > > > > > functionality that's being offloaded? If it's the latter then we have > > > > > a lot more flexibility. What we'd need is a safe and secure way to > > > > > synchronize with that offload device that precisely supports the > > > > > kernel functionality we'd like to offload. This can be done if both > > > > > the kernel bits and programmed offload are derived from the same > > > > > source (i.e. tag source code with a sha-1). For example, if someone > > > > > writes a parser in P4, we can compile that into both eBPF and a P4 > > > > > backend using independent tool chains and program download. At > > > > > runtime, the kernel can safely offload the functionality of the eBPF > > > > > parser to the device if it matches the hash to that reported by the > > > > > device > > > > > > > > Good points. If I understand you correctly you're saying that parsers > > > > are more complex than just a basic parsing tree a'la u32. > > > > > > Yes. Parsing things like TLVs, GRE flag field, or nested protobufs > > > isn't conducive to u32. We also want the advantages of compiler > > > optimizations to unroll loops, squash nodes in the parse graph, etc. > > > > > > > Then we can take this argument further. P4 has grown to encompass a lot > > > > of functionality of quite complex devices. How do we square that with > > > > the kernel functionality offload model. If the entire device is modeled, > > > > including f.e. TSO, an offload would mean that the user has to write > > > > a TSO implementation which they then load into TC? That seems odd. > > > > > > > > IOW I don't quite know how to square in my head the "total > > > > functionality" with being a TC-based "plugin". > > > > > > Hi Jakub, > > > > > > I believe the solution is to replace kernel code with eBPF in cases > > > where we need programmability. This effectively means that we would > > > ship eBPF code as part of the kernel. So in the case of TSO, the > > > kernel would include a standard implementation in eBPF that could be > > > compiled into the kernel by default. The restricted C source code is > > > tagged with a hash, so if someone wants to offload TSO they could > > > compile the source into their target and retain the hash. At runtime > > > it's a matter of querying the driver to see if the device supports the > > > TSO program the kernel is running by comparing hash values. Scaling > > > this, a device could support a catalogue of programs: TSO, LRO, > > > parser, IPtables, etc., If the kernel can match the hash of its eBPF > > > code to one reported by the driver then it can assume functionality is > > > offloadable. This is an elaboration of "device features", but instead > > > of the device telling us they think they support an adequate GRO > > > implementation by reporting NETIF_F_GRO, the device would tell the > > > kernel that they not only support GRO but they provide identical > > > functionality of the kernel GRO (which IMO is the first requirement of > > > kernel offload). > > > > > > Even before considering hardware offload, I think this approach > > > addresses a more fundamental problem to make the kernel programmable. > > > Since the code is in eBPF, the kernel can be reprogrammed at runtime > > > which could be controlled by TC. This allows local customization of > > > kernel features, but also is the simplest way to "patch" the kernel > > > with security and bug fixes (nobody is ever excited to do a kernel > > > > [..] > > > > > rebase in their datacenter!). Flow dissector is a prime candidate for > > > this, and I am still planning to replace it with an all eBPF program > > > (https://netdevconf.info/0x15/slides/16/Flow%20dissector_PANDA%20parser.pdf). > > > > So you're suggesting to bundle (and extend) > > tools/testing/selftests/bpf/progs/bpf_flow.c? We were thinking along > > similar lines here. We load this program manually right now, shipping > > and autoloading with the kernel will be easer. > > Hi Stanislav, > > Yes, I envision that we would have a standard implementation of > flow-dissector in eBPF that is shipped with the kernel and autoloaded. > However, for the front end source I want to move away from imperative > code. As I mentioned in the presentation flow_dissector.c is spaghetti > code and has been prone to bugs over the years especially whenever > someone adds support for a new fringe protocol (I take the liberty to > call it spaghetti code since I'm partially responsible for creating > this mess ;-) ). > > The problem is that parsers are much better represented by a > declarative rather than an imperative representation. To that end, we > defined PANDA which allows constructing a parser (parse graph) in data > structures in C. We use the "PANDA parser" to compile C to restricted > C code which looks more like eBPF in imperative code. With this method > we abstract out all the bookkeeping that was often the source of bugs > (like pulling up skbufs, checking length limits, etc.). The other > advantage is that we're able to find a lot more optimizations if we > start with a right representation of the problem. > > If you're interested, the video presentation on this is in > https://www.youtube.com/watch?v=zVnmVDSEoXc. Oh, yeah, I've seen this one. Agreed that the C implementation is not pleasant and generating a parser from some declarative spec is a better idea. From my pow, the biggest win we get from making bpf flow dissector pluggable is the fact that we can now actually write some tests for it (and, maybe, fuzz it?). We should also probably spend more time properly defining the behavior of the existing C implementation. We've seen some interesting bugs like this one: https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next.git/commit/?id=9fa02892857ae2b3b699630e5ede28f72106e7e7
On Mon, Mar 4, 2024 at 3:24 PM Stanislav Fomichev <sdf@google.com> wrote: > > On 03/04, Tom Herbert wrote: > > On Mon, Mar 4, 2024 at 1:19 PM Stanislav Fomichev <sdf@google.com> wrote: > > > > > > On 03/03, Tom Herbert wrote: > > > > On Sat, Mar 2, 2024 at 7:15 PM Jakub Kicinski <kuba@kernel.org> wrote: > > > > > > > > > > On Fri, 1 Mar 2024 18:20:36 -0800 Tom Herbert wrote: > > > > > > This is configurability versus programmability. The table driven > > > > > > approach as input (configurability) might work fine for generic > > > > > > match-action tables up to the point that tables are expressive enough > > > > > > to satisfy the requirements. But parsing doesn't fall into the table > > > > > > driven paradigm: parsers want to be *programmed*. This is why we > > > > > > removed kParser from this patch set and fell back to eBPF for parsing. > > > > > > But the problem we quickly hit that eBPF is not offloadable to network > > > > > > devices, for example when we compile P4 in an eBPF parser we've lost > > > > > > the declarative representation that parsers in the devices could > > > > > > consume (they're not CPUs running eBPF). > > > > > > > > > > > > I think the key here is what we mean by kernel offload. When we do > > > > > > kernel offload, is it the kernel implementation or the kernel > > > > > > functionality that's being offloaded? If it's the latter then we have > > > > > > a lot more flexibility. What we'd need is a safe and secure way to > > > > > > synchronize with that offload device that precisely supports the > > > > > > kernel functionality we'd like to offload. This can be done if both > > > > > > the kernel bits and programmed offload are derived from the same > > > > > > source (i.e. tag source code with a sha-1). For example, if someone > > > > > > writes a parser in P4, we can compile that into both eBPF and a P4 > > > > > > backend using independent tool chains and program download. At > > > > > > runtime, the kernel can safely offload the functionality of the eBPF > > > > > > parser to the device if it matches the hash to that reported by the > > > > > > device > > > > > > > > > > Good points. If I understand you correctly you're saying that parsers > > > > > are more complex than just a basic parsing tree a'la u32. > > > > > > > > Yes. Parsing things like TLVs, GRE flag field, or nested protobufs > > > > isn't conducive to u32. We also want the advantages of compiler > > > > optimizations to unroll loops, squash nodes in the parse graph, etc. > > > > > > > > > Then we can take this argument further. P4 has grown to encompass a lot > > > > > of functionality of quite complex devices. How do we square that with > > > > > the kernel functionality offload model. If the entire device is modeled, > > > > > including f.e. TSO, an offload would mean that the user has to write > > > > > a TSO implementation which they then load into TC? That seems odd. > > > > > > > > > > IOW I don't quite know how to square in my head the "total > > > > > functionality" with being a TC-based "plugin". > > > > > > > > Hi Jakub, > > > > > > > > I believe the solution is to replace kernel code with eBPF in cases > > > > where we need programmability. This effectively means that we would > > > > ship eBPF code as part of the kernel. So in the case of TSO, the > > > > kernel would include a standard implementation in eBPF that could be > > > > compiled into the kernel by default. The restricted C source code is > > > > tagged with a hash, so if someone wants to offload TSO they could > > > > compile the source into their target and retain the hash. At runtime > > > > it's a matter of querying the driver to see if the device supports the > > > > TSO program the kernel is running by comparing hash values. Scaling > > > > this, a device could support a catalogue of programs: TSO, LRO, > > > > parser, IPtables, etc., If the kernel can match the hash of its eBPF > > > > code to one reported by the driver then it can assume functionality is > > > > offloadable. This is an elaboration of "device features", but instead > > > > of the device telling us they think they support an adequate GRO > > > > implementation by reporting NETIF_F_GRO, the device would tell the > > > > kernel that they not only support GRO but they provide identical > > > > functionality of the kernel GRO (which IMO is the first requirement of > > > > kernel offload). > > > > > > > > Even before considering hardware offload, I think this approach > > > > addresses a more fundamental problem to make the kernel programmable. > > > > Since the code is in eBPF, the kernel can be reprogrammed at runtime > > > > which could be controlled by TC. This allows local customization of > > > > kernel features, but also is the simplest way to "patch" the kernel > > > > with security and bug fixes (nobody is ever excited to do a kernel > > > > > > [..] > > > > > > > rebase in their datacenter!). Flow dissector is a prime candidate for > > > > this, and I am still planning to replace it with an all eBPF program > > > > (https://netdevconf.info/0x15/slides/16/Flow%20dissector_PANDA%20parser.pdf). > > > > > > So you're suggesting to bundle (and extend) > > > tools/testing/selftests/bpf/progs/bpf_flow.c? We were thinking along > > > similar lines here. We load this program manually right now, shipping > > > and autoloading with the kernel will be easer. > > > > Hi Stanislav, > > > > Yes, I envision that we would have a standard implementation of > > flow-dissector in eBPF that is shipped with the kernel and autoloaded. > > However, for the front end source I want to move away from imperative > > code. As I mentioned in the presentation flow_dissector.c is spaghetti > > code and has been prone to bugs over the years especially whenever > > someone adds support for a new fringe protocol (I take the liberty to > > call it spaghetti code since I'm partially responsible for creating > > this mess ;-) ). > > > > The problem is that parsers are much better represented by a > > declarative rather than an imperative representation. To that end, we > > defined PANDA which allows constructing a parser (parse graph) in data > > structures in C. We use the "PANDA parser" to compile C to restricted > > C code which looks more like eBPF in imperative code. With this method > > we abstract out all the bookkeeping that was often the source of bugs > > (like pulling up skbufs, checking length limits, etc.). The other > > advantage is that we're able to find a lot more optimizations if we > > start with a right representation of the problem. > > > > If you're interested, the video presentation on this is in > > https://www.youtube.com/watch?v=zVnmVDSEoXc. > > Oh, yeah, I've seen this one. Agreed that the C implementation is not > pleasant and generating a parser from some declarative spec is a better > idea. > > From my pow, the biggest win we get from making bpf flow dissector > pluggable is the fact that we can now actually write some tests for it Yes, extracting out functions from the kernel allows them to be independently unit tested. It's an even bigger win if the same source code is used for offloading the functionality as I described. We can call this "Test once, run anywhere!" Tom > (and, maybe, fuzz it?). We should also probably spend more time properly > defining the behavior of the existing C implementation. We've seen > some interesting bugs like this one: > https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next.git/commit/?id=9fa02892857ae2b3b699630e5ede28f72106e7e7