mbox series

[net-next,v12,00/15] Introducing P4TC (series 1)

Message ID 20240225165447.156954-1-jhs@mojatatu.com (mailing list archive)
Headers show
Series Introducing P4TC (series 1) | expand

Message

Jamal Hadi Salim Feb. 25, 2024, 4:54 p.m. UTC
This is the first patchset of two. In this patch we are submitting 15 which
cover the minimal viable P4 PNA architecture.

__Description of these Patches__

Patch #1 adds infrastructure for per-netns P4 actions that can be created on
as need basis for the P4 program requirement. This patch makes a small incision
into act_api. Patches 2-4 are minimalist enablers for P4TC and have no
effect the classical tc action (example patch#2 just increases the size of the
action names from 16->64B).
Patch 5 adds infrastructure support for preallocation of dynamic actions.

The core P4TC code implements several P4 objects.
1) Patch #6 introduces P4 data types which are consumed by the rest of the code
2) Patch #7 introduces the templating API. i.e. CRUD commands for templates
3) Patch #8 introduces the concept of templating Pipelines. i.e CRUD commands
   for P4 pipelines.
4) Patch #9 introduces the action templates and associated CRUD commands.
5) Patch #10 introduce the action runtime infrastructure.
6) Patch #11 introduces the concept of P4 table templates and associated
   CRUD commands for tables.
7) Patch #12 introduces runtime table entry infra and associated CU commands.
8) Patch #13 introduces runtime table entry infra and associated RD commands.
9) Patch #14 introduces interaction of eBPF to P4TC tables via kfunc.
10) Patch #15 introduces the TC classifier P4 used at runtime.

Daniel, please look again at patch #15.

There are a few more patches (5) not in this patchset that deal with test
cases, etc.

What is P4?
-----------

The Programming Protocol-independent Packet Processors (P4) is an open source,
domain-specific programming language for specifying data plane behavior.

The current P4 landscape includes an extensive range of deployments, products,
projects and services, etc[9][12]. Two major NIC vendors, Intel[10] and AMD[11]
currently offer P4-native NICs. P4 is currently curated by the Linux
Foundation[9].

On why P4 - see small treatise here:[4].

What is P4TC?
-------------

P4TC is a net-namespace aware P4 implementation over TC; meaning, a P4 program
and its associated objects and state are attachend to a kernel _netns_ structure.
IOW, if we had two programs across netns' or within a netns they have no
visibility to each others objects (unlike for example TC actions whose kinds are
"global" in nature or eBPF maps visavis bpftool).

P4TC builds on top of many years of Linux TC experiences of a netlink control
path interface coupled with a software datapath with an equivalent offloadable
hardware datapath. In this patch series we are focussing only on the s/w
datapath. The s/w and h/w path equivalence that TC provides is relevant
for a primary use case of P4 where some (currently) large consumers of NICs
provide vendors their datapath specs in P4. In such a case one could generate
specified datapaths in s/w and test/validate the requirements before hardware
acquisition(example [12]).

Unlike other approaches such as TC Flower which require kernel and user space
changes when new datapath objects like packet headers are introduced P4TC, with
these patches, provides _kernel and user space code change independence_.
Meaning:
A P4 program describes headers, parsers, etc alongside the datapath processing;
the compiler uses the P4 program as input and generates several artifacts which
are then loaded into the kernel to manifest the intended datapath. In addition
to the generated datapath, control path constructs are generated. The process is
described further below in "P4TC Workflow".

There have been many discussions and meetings within the community since
about 2015 in regards to P4 over TC[2] and we are finally proving to the
naysayers that we do get stuff done!

A lot more of the P4TC motivation is captured at:
https://github.com/p4tc-dev/docs/blob/main/why-p4tc.md

__P4TC Architecture__

The current architecture was described at netdevconf 0x17[14] and if you prefer
academic conference papers, a short paper is available here[15].

There are 4 parts:

1) A Template CRUD provisioning API for manifesting a P4 program and its
associated objects in the kernel. The template provisioning API uses netlink.
See patch in part 2.

2) A Runtime CRUD+ API code which is used for controlling the different runtime
behavior of the P4 objects. The runtime API uses netlink. See notes further
down. See patch description later..

3) P4 objects and their control interfaces: tables, actions, externs, etc.
Any object that requires control plane interaction resides in the TC domain
and is subject to the CRUD runtime API.  The intended goal is to make use of the
tc semantics of skip_sw/hw to target P4 program objects either in s/w or h/w.

4) S/W Datapath code hooks. The s/w datapath is eBPF based and is generated
by a compiler based on the P4 spec. When accessing any P4 object that requires
control plane interfaces, the eBPF code accesses the P4TC side from #3 above
using kfuncs.

The generated eBPF code is derived from [13] with enhancements and fixes to meet
our requirements.

__P4TC Workflow__

The Development and instantiation workflow for P4TC is as follows:

  A) A developer writes a P4 program, "myprog"

  B) Compiles it using the P4C compiler[8]. The compiler generates 3 outputs:

     a) A shell script which form template definitions for the different P4
     objects "myprog" utilizes (tables, externs, actions etc). See #1 above..

     b) the parser and the rest of the datapath are generated as eBPF and need
     to be compiled into binaries. At the moment the parser and the main control
     block are generated as separate eBPF program but this could change in
     the future (without affecting any kernel code). See #4 above.

     c) A json introspection file used for the control plane (by iproute2/tc).

  C) At this point the artifacts from #1,#4 could be handed to an operator
     (the operator could be the same person as the developer from #A, #B).

     i) For the eBPF part, either the operator is handed an ebpf binary or
     source which they compile at this point into a binary.
     The operator executes the shell script(s) to manifest the functional
     "myprog" into the kernel.

     ii) The operator instantiates "myprog" pipeline via the tc P4 filter
     to ingress/egress (depending on P4 arch) of one or more netdevs/ports
     (illustrated below as "block 22").

     Example instantion where the parser is a separate action:
       "tc filter add block 22 ingress protocol all prio 10 p4 pname myprog \
        action bpf obj $PARSER.o section p4tc/parse \
        action bpf obj $PROGNAME.o section p4tc/main"

See individual patches in partc for more examples tc vs xdp etc. Also see
section on "challenges" (further below on this cover letter).

Once "myprog" P4 program is instantiated one can start performing operations
on table entries and/or actions at runtime as described below.

__P4TC Runtime Control Path__

The control interface builds on past tc experience and tries to get things
right from the beginning (example filtering is separated from depending
on existing object TLVs and made generic); also the code is written in
such a way it is mostly lockless.

The P4TC control interface, using netlink, provides what we call a CRUDPS
abstraction which stands for: Create, Read(get), Update, Delete, Subscribe,
Publish.  From a high level PoV the following describes a conformant high level
API (both on netlink data model and code level):

	Create(</path/to/object, DATA>+)
	Read(</path/to/object>, [optional filter])
	Update(</path/to/object>, DATA>+)
	Delete(</path/to/object>, [optional filter])
	Subscribe(</path/to/object>, [optional filter])

Note, we _dont_ treat "dump" or "flush" as speacial. If "path/to/object" points
to a table then a "Delete" implies "flush" and a "Read" implies dump but if
it points to an entry (by specifying a key) then "Delete" implies deleting
and entry and "Read" implies reading that single entry. It should be noted that
both "Delete" and "Read" take an optional filter parameter. The filter can
define further refinements to what the control plane wants read or deleted.
"Subscribe" uses built in netlink event management. It, as well, takes a filter
which can further refine what events get generated to the control plane (taken
out of this patchset, to be re-added with consideration of [16]).

Lets show some runtime samples:

..create an entry, if we match ip address 10.0.1.2 send packet out eno1
  tc p4ctrl create myprog/table/mytable \
   dstAddr 10.0.1.2/32 action send_to_port param port eno1

..Batch create entries
  tc p4ctrl create myprog/table/mytable \
  entry dstAddr 10.1.1.2/32  action send_to_port param port eno1 \
  entry dstAddr 10.1.10.2/32  action send_to_port param port eno10 \
  entry dstAddr 10.0.2.2/32  action send_to_port param port eno2

..Get an entry (note "read" is interchangeably used as "get" which is a common
		semantic in tc):
  tc p4ctrl read myprog/table/mytable \
   dstAddr 10.0.2.2/32

..dump mytable
  tc p4ctrl read myprog/table/mytable

..dump mytable for all entries whose key fits within 10.1.0.0/16
  tc p4ctrl read myprog/table/mytable \
  filter key/myprog/mytable/dstAddr = 10.1.0.0/16

..dump all mytable entries which have an action send_to_port with param "eno1"
  tc p4ctrl get myprog/table/mytable \
  filter param/act/myprog/send_to_port/port = "eno1"

The filter expression is powerful, f.e you could say:

  tc p4ctrl get myprog/table/mytable \
  filter param/act/myprog/send_to_port/port = "eno1" && \
         key/myprog/mytable/dstAddr = 10.1.0.0/16

It also works on built in metadata, example in the following case dumping
entries from mytable that have seen activity in the last 10 secs:
  tc p4ctrl get myprog/table/mytable \
  filter msecs_since < 10000

Delete follows the same syntax as get/read, so for sake of brevity we won't
show more example than how to flush mytable:

  tc p4ctrl delete myprog/table/mytable

Mystery question: How do we achieve iproute2-kernel independence and
how does "tc p4ctrl" as a cli know how to program the kernel given an
arbitrary command line as shown above? Answer(s): It queries the
compiler generated json file in "P4TC Workflow" #B.c above. The json file has
enough details to figure out that we have a program called "myprog" which has a
table "mytable" that has a key name "dstAddr" which happens to be type ipv4
address prefix. The json file also provides details to show that the table
"mytable" supports an action called "send_to_port" which accepts a parameter
"port" of type netdev (see the types patch for all supported P4 data types).
All P4 components have names, IDs, and types - so this makes it very easy to map
into netlink.
Once user space tc/p4ctrl validates the human command input, it creates
standard binary netlink structures (TLVs etc) which are sent to the kernel.
See the runtime table entry patch for more details.

__P4TC Datapath__

The P4TC s/w datapath execution is generated as eBPF. Any objects that require
control interfacing reside in the "P4TC domain" and are controlled via netlink
as described above. Per packet execution and state and even objects that do not
require control interfacing (like the P4 parser) are generated as eBPF.

A packet arriving on s/w ingress of any of the ports on block 22 will first be
exercised via the (generated eBPF) parser component to extract the headers (the
ip destination address in labelled "dstAddr" above).
The datapath then proceeds to use "dstAddr", table ID and pipeline ID
as a key to do a lookup in myprog's "mytable" which returns the action params
which are then used to execute the action in the eBPF datapath (eventually
sending out packets to eno1).
On a table miss, mytable's default miss action (not described) is executed.

__Testing__

Speaking of testing - we have 2-300 tdc test cases (which will be in the
second patchset).
These tests are run on our CICD system on pull requests and after commits are
approved. The CICD does a lot of other tests (more since v2, thanks to Simon's
input)including:
checkpatch, sparse, smatch, coccinelle, 32 bit and 64 bit builds tested on both
X86, ARM 64 and emulated BE via qemu s390. We trigger performance testing in the
CICD to catch performance regressions (currently only on the control path, but
in the future for the datapath).
Syzkaller runs 24/7 on dedicated hardware, originally we focussed only on memory
sanitizer but recently added support for concurrency sanitizer.
Before main releases we ensure each patch will compile on its own to help in
git bisect and run the xmas tree tool. We eventually put the code via coverity.

In addition we are working on enabling a tool that will take a P4 program, run
it through the compiler, and generate permutations of traffic patterns via
symbolic execution that will test both positive and negative datapath code
paths. The test generator tool integration is still work in progress.
Also: We have other code that test parallelization etc which we are trying to
find a fit for in the kernel tree's testing infra.


__References__

[1]https://github.com/p4tc-dev/docs/blob/main/p4-conference-2023/2023P4WorkshopP4TC.pdf
[2]https://github.com/p4tc-dev/docs/blob/main/why-p4tc.md#historical-perspective-for-p4tc
[3]https://2023p4workshop.sched.com/event/1KsAe/p4tc-linux-kernel-p4-implementation-approaches-and-evaluation
[4]https://github.com/p4tc-dev/docs/blob/main/why-p4tc.md#so-why-p4-and-how-does-p4-help-here
[5]https://lore.kernel.org/netdev/20230517110232.29349-3-jhs@mojatatu.com/T/#mf59be7abc5df3473cff3879c8cc3e2369c0640a6
[6]https://lore.kernel.org/netdev/20230517110232.29349-3-jhs@mojatatu.com/T/#m783cfd79e9d755cf0e7afc1a7d5404635a5b1919
[7]https://lore.kernel.org/netdev/20230517110232.29349-3-jhs@mojatatu.com/T/#ma8c84df0f7043d17b98f3d67aab0f4904c600469
[8]https://github.com/p4lang/p4c/tree/main/backends/tc
[9]https://p4.org/
[10]https://www.intel.com/content/www/us/en/products/details/network-io/ipu/e2000-asic.html
[11]https://www.amd.com/en/accelerators/pensando
[12]https://github.com/sonic-net/DASH/tree/main
[13]https://github.com/p4lang/p4c/tree/main/backends/ebpf
[14]https://netdevconf.info/0x17/sessions/talk/integrating-ebpf-into-the-p4tc-datapath.html
[15]https://dl.acm.org/doi/10.1145/3630047.3630193
[16]https://lore.kernel.org/netdev/20231216123001.1293639-1-jiri@resnulli.us/
[17.a]https://netdevconf.info/0x13/session.html?talk-tc-u-classifier
[17.b]man tc-u32
[18]man tc-pedit
[19] https://lore.kernel.org/netdev/20231219181623.3845083-6-victor@mojatatu.com/T/#m86e71743d1d83b728bb29d5b877797cb4942e835
[20.a] https://netdevconf.info/0x16/sessions/talk/your-network-datapath-will-be-p4-scripted.html
[20.b] https://netdevconf.info/0x16/sessions/workshop/p4tc-workshop.html

--------
HISTORY
--------

Changes in Version 12
----------------------

0) Introduce back 15 patches (v11 had 5)

1) From discussions with Daniel:
   i) Remove the XDP programs association alltogether. No refcounting. nothing.
   ii) Remove prog type tc - everything is now an ebpf tc action.

2) s/PAD0/__pad0/g. Thanks to Marcelo.

3) Add extack to specify how many entries (N of M) specified in a batch for
   any of requested Create/Update/Delete succeeded. Prior to this it would
   only tell us the batch failed to complete without giving us details of
   which of M failed. Added as a debug aid.

Changes in Version 11
----------------------
1) Split the series into two. Original patches 1-5 in this patchset. The rest
   will go out after this is merged.

2) Change any references of IFNAMSIZ in the action code when referencing the
   action name size to ACTNAMSIZ. Thanks to Marcelo.

Changes in Version 10
----------------------
1) A couple of patches from the earlier version were clean enough to submit,
   so we did. This gave us room to split the two largest patches each into
   two. Even though the split is not git-bisactable and really some of it didn't
   make much sense (eg spliting a create, and update in one patch and delete and
   get into another) we made sure each of the split patches compiled
   independently. The idea is to reduce the number of lines of code to review
   and when we get sufficient reviews we will put the splits together again.
   See patch #12 and #13 as well as patches #7 and #8).

2) Add more context in patch 0. Please READ!

3) Added dump/delete filters back to the code - we had taken them out in the
   earlier patches to reduce the amount of code for review - but in retrospect
   we feel they are important enough to push earlier rather than later.


Changes In version 9
---------------------

1) Remove the largest patch (externs) to ease review.

2) Break up action patches into two to ease review bringing down the patches
   that need more scrutiny to 8 (the first 7 are almost trivial).

3) Fixup prefix naming convention to p4tc_xxx for uapi and p4a_xxx for actions
   to provide consistency(Jiri).

4) Silence sparse warning "was not declared. Should it be static?" for kfuncs
   by making them static. TBH, not sure if this is the right solution
   but it makes sparse happy and hopefully someone will comment.

Changes In Version 8
---------------------

1) Fix all the patchwork warnings and improve our ci to catch them in the future

2) Reduce the number of patches to basic max(15)  to ease review.

Changes In Version 7
-------------------------

0) First time removing the RFC tag!

1) Removed XDP cookie. It turns out as was pointed out by Toke(Thanks!) - that
using bpf links was sufficient to protect us from someone replacing or deleting
a eBPF program after it has been bound to a netdev.

2) Add some reviewed-bys from Vlad.

3) Small bug fixes from v6 based on testing for ebpf.

4) Added the counter extern as a sample extern. Illustrating this example because
   it is slightly complex since it is possible to invoke it directly from
   the P4TC domain (in case of direct counters) or from eBPF (indirect counters).
   It is not exactly the most efficient implementation (a reasonable counter impl
   should be per-cpu).

Changes In RFC Version 6
-------------------------

1) Completed integration from scriptable view to eBPF. Completed integration
   of externs integration.

2) Small bug fixes from v5 based on testing.

Changes In RFC Version 5
-------------------------

1) More integration from scriptable view to eBPF. Small bug fixes from last
   integration.

2) More streamlining support of externs via kfunc (create-on-miss, etc)

3) eBPF linking for XDP.

There is more eBPF integration/streamlining coming (we are getting close to
conversion from scriptable domain).

Changes In RFC Version 4
-------------------------

1) More integration from scriptable to eBPF. Small bug fixes.

2) More streamlining support of externs via kfunc (one additional kfunc).

3) Removed per-cpu scratchpad per Toke's suggestion and instead use XDP metadata.

There is more eBPF integration coming. One thing we looked at but is not in this
patchset but should be in the next is use of eBPF link in our loading (see
"challenge #1" further below).

Changes In RFC Version 3
-------------------------

These patches are still in a little bit of flux as we adjust to integrating
eBPF. So there are small constructs that are used in V1 and 2 but no longer
used in this version. We will make a V4 which will remove those.
The changes from V2 are as follows:

1) Feedback we got in V2 is to try stick to one of the two modes. In this version
we are taking one more step and going the path of mode2 vs v2 where we had 2 modes.

2) The P4 Register extern is no longer standalone. Instead, as part of integrating
into eBPF we introduce another kfunc which encapsulates Register as part of the
extern interface.

3) We have improved our CICD to include tools pointed to us by Simon. See
   "Testing" further below. Thanks to Simon for that and other issues he caught.
   Simon, we discussed on issue [7] but decided to keep that log since we think
   it is useful.

4) A lot of small cleanups. Thanks Marcelo. There are two things we need to
   re-discuss though; see: [5], [6].

5) We removed the need for a range of IDs for dynamic actions. Thanks Jakub.

6) Clarify ambiguity caused by smatch in an if(A) else if(B) condition. We are
   guaranteed that either A or B must exist; however, lets make smatch happy.
   Thanks to Simon and Dan Carpenter.

Changes In RFC Version 2
-------------------------

Version 2 is the initial integration of the eBPF datapath.
We took into consideration suggestions provided to use eBPF and put effort into
analyzing eBPF as datapath which involved extensive testing.
We implemented 6 approaches with eBPF and ran performance analysis and presented
our results at the P4 2023 workshop in Santa Clara[see: 1, 3] on each of the 6
vs the scriptable P4TC and concluded that 2 of the approaches are sensible (4 if
you account for XDP or TC separately).

Conclusions from the exercise: We lose the simple operational model we had
prior to integrating eBPF. We do gain performance in most cases when the
datapath is less compute-bound.
For more discussion on our requirements vs journeying the eBPF path please
scroll down to "Restating Our Requirements" and "Challenges".

This patch set presented two modes.
mode1: the parser is entirely based on eBPF - whereas the rest of the
SW datapath stays as _scriptable_ as in Version 1.
mode2: All of the kernel s/w datapath (including parser) is in eBPF.

The key ingredient for eBPF, that we did not have access to in the past, is
kfunc (it made a big difference for us to reconsider eBPF).

In V2 the two modes are mutually exclusive (IOW, you get to choose one
or the other via Kconfig).


Jamal Hadi Salim (15):
  net: sched: act_api: Introduce P4 actions list
  net/sched: act_api: increase action kind string length
  net/sched: act_api: Update tc_action_ops to account for P4 actions
  net/sched: act_api: add struct p4tc_action_ops as a parameter to
    lookup callback
  net: sched: act_api: Add support for preallocated P4 action instances
  p4tc: add P4 data types
  p4tc: add template API
  p4tc: add template pipeline create, get, update, delete
  p4tc: add template action create, update, delete, get, flush and dump
  p4tc: add runtime action support
  p4tc: add template table create, update, delete, get, flush and dump
  p4tc: add runtime table entry create and update
  p4tc: add runtime table entry get, delete, flush and dump
  p4tc: add set of P4TC table kfuncs
  p4tc: add P4 classifier

 include/linux/bitops.h            |    1 +
 include/net/act_api.h             |   23 +-
 include/net/p4tc.h                |  675 +++++++
 include/net/p4tc_types.h          |   91 +
 include/net/tc_act/p4tc.h         |   78 +
 include/uapi/linux/p4tc.h         |  442 +++++
 include/uapi/linux/pkt_cls.h      |   15 +
 include/uapi/linux/rtnetlink.h    |   18 +
 include/uapi/linux/tc_act/tc_p4.h |   11 +
 net/sched/Kconfig                 |   23 +
 net/sched/Makefile                |    3 +
 net/sched/act_api.c               |  192 +-
 net/sched/cls_api.c               |    2 +-
 net/sched/cls_p4.c                |  305 +++
 net/sched/p4tc/Makefile           |    8 +
 net/sched/p4tc/p4tc_action.c      | 2397 +++++++++++++++++++++++
 net/sched/p4tc/p4tc_bpf.c         |  342 ++++
 net/sched/p4tc/p4tc_filter.c      |  870 ++++++++
 net/sched/p4tc/p4tc_pipeline.c    |  700 +++++++
 net/sched/p4tc/p4tc_runtime_api.c |  145 ++
 net/sched/p4tc/p4tc_table.c       | 1834 +++++++++++++++++
 net/sched/p4tc/p4tc_tbl_entry.c   | 3047 +++++++++++++++++++++++++++++
 net/sched/p4tc/p4tc_tmpl_api.c    |  440 +++++
 net/sched/p4tc/p4tc_types.c       | 1407 +++++++++++++
 net/sched/p4tc/trace.c            |   10 +
 net/sched/p4tc/trace.h            |   44 +
 security/selinux/nlmsgtab.c       |   10 +-
 27 files changed, 13097 insertions(+), 36 deletions(-)
 create mode 100644 include/net/p4tc.h
 create mode 100644 include/net/p4tc_types.h
 create mode 100644 include/net/tc_act/p4tc.h
 create mode 100644 include/uapi/linux/p4tc.h
 create mode 100644 include/uapi/linux/tc_act/tc_p4.h
 create mode 100644 net/sched/cls_p4.c
 create mode 100644 net/sched/p4tc/Makefile
 create mode 100644 net/sched/p4tc/p4tc_action.c
 create mode 100644 net/sched/p4tc/p4tc_bpf.c
 create mode 100644 net/sched/p4tc/p4tc_filter.c
 create mode 100644 net/sched/p4tc/p4tc_pipeline.c
 create mode 100644 net/sched/p4tc/p4tc_runtime_api.c
 create mode 100644 net/sched/p4tc/p4tc_table.c
 create mode 100644 net/sched/p4tc/p4tc_tbl_entry.c
 create mode 100644 net/sched/p4tc/p4tc_tmpl_api.c
 create mode 100644 net/sched/p4tc/p4tc_types.c
 create mode 100644 net/sched/p4tc/trace.c
 create mode 100644 net/sched/p4tc/trace.h

Comments

John Fastabend Feb. 28, 2024, 5:11 p.m. UTC | #1
Jamal Hadi Salim wrote:
> This is the first patchset of two. In this patch we are submitting 15 which
> cover the minimal viable P4 PNA architecture.
> 
> __Description of these Patches__
> 
> Patch #1 adds infrastructure for per-netns P4 actions that can be created on
> as need basis for the P4 program requirement. This patch makes a small incision
> into act_api. Patches 2-4 are minimalist enablers for P4TC and have no
> effect the classical tc action (example patch#2 just increases the size of the
> action names from 16->64B).
> Patch 5 adds infrastructure support for preallocation of dynamic actions.
> 
> The core P4TC code implements several P4 objects.
> 1) Patch #6 introduces P4 data types which are consumed by the rest of the code
> 2) Patch #7 introduces the templating API. i.e. CRUD commands for templates
> 3) Patch #8 introduces the concept of templating Pipelines. i.e CRUD commands
>    for P4 pipelines.
> 4) Patch #9 introduces the action templates and associated CRUD commands.
> 5) Patch #10 introduce the action runtime infrastructure.
> 6) Patch #11 introduces the concept of P4 table templates and associated
>    CRUD commands for tables.
> 7) Patch #12 introduces runtime table entry infra and associated CU commands.
> 8) Patch #13 introduces runtime table entry infra and associated RD commands.
> 9) Patch #14 introduces interaction of eBPF to P4TC tables via kfunc.
> 10) Patch #15 introduces the TC classifier P4 used at runtime.
> 
> Daniel, please look again at patch #15.
> 
> There are a few more patches (5) not in this patchset that deal with test
> cases, etc.
> 
> What is P4?
> -----------
> 
> The Programming Protocol-independent Packet Processors (P4) is an open source,
> domain-specific programming language for specifying data plane behavior.
> 
> The current P4 landscape includes an extensive range of deployments, products,
> projects and services, etc[9][12]. Two major NIC vendors, Intel[10] and AMD[11]
> currently offer P4-native NICs. P4 is currently curated by the Linux
> Foundation[9].
> 
> On why P4 - see small treatise here:[4].
> 
> What is P4TC?
> -------------
> 
> P4TC is a net-namespace aware P4 implementation over TC; meaning, a P4 program
> and its associated objects and state are attachend to a kernel _netns_ structure.
> IOW, if we had two programs across netns' or within a netns they have no
> visibility to each others objects (unlike for example TC actions whose kinds are
> "global" in nature or eBPF maps visavis bpftool).

[...]

Although I appreciate a good amount of work went into building above I'll
add my concerns here so they are not lost. These are architecture concerns
not this line of code needs some tweak.

 - It encodes a DSL into the kernel. Its unclear how we pick which DSL gets
   pushed into the kernel and which do not. Do we take any DSL folks can code
   up?
   I would prefer a lower level  intermediate langauge. My view is this is
   a lesson we should have learned from OVS. OVS had wider adoption and
   still struggled in some ways my belief is this is very similar to OVS.
   (Also OVS was novel/great at a lot of things fwiw.)

 - We have a general purpose language in BPF that can implement the P4 DSL
   already. I don't see any need for another set of code when the end goal
   is running P4 in Linux network stack is doable. Typically we reject
   duplicate things when they don't have concrete benefits.

 - P4 as a DSL is not optimized for general purpose CPUs, but
   rather hardware pipelines. Although it can be optimized for CPUs its
   a harder problem. A review of some of the VPP/DPDK work here is useful.

 - P4 infrastructure already has a p4c backend this is adding another P4
   backend instead of getting the rather small group of people to work on
   a single backend we are now creating another one.

 - Common reasons I think would justify a new P4 backend and implementation
   would be: speed efficiency, or expressiveness. I think this
   implementation is neither more efficient nor more expressive. Concrete
   examples on expressiveness would be interesting, but I don't see any.
   Loops were mentioned once but latest kernels have loop support.

 - The main talking point for many slide decks about p4tc is hardware
   offload. This seems like the main benefit of pushing the P4 DSL into the
   kernel. But, we have no hw implementation, not even a vendor stepping up
   to comment on this implementation and how it will work for them. HW
   introduces all sorts of interesting problems that I don't see how we
   solve in this framework. For example a few off the top of my head:
   syncing current state into tc, how does operator program tc inside
   constraints, who writes the p4 models for these hardware devices, do
   they fit into this 'tc' infrastructure, partial updates into hardware
   seems unlikely to work for most hardware, ...

 - The kfuncs are mostly duplicates of map ops we already have in BPF API.
   The motivation by my read is to use netlink instead of bpf commands. I
   don't agree with this, optimizing for some low level debug a developer
   uses is the wrong design space. Actual users should not be deploying
   this via ssh into boxes. The workflow will not scale and really we need
   tooling and infra to land P4 programs across the network. This is orders
   of more pain if its an endpoint solution and not a middlebox/switch
   solution. As a switch solution I don't see how p4tc sw scales to even TOR
   packet rates. So you need tooling on top and user interact with the
   tooling not the Linux widget/debugger at the bottom.

 - There is no performance analysis: The comment was functionality before
   performance which I disagree with. If it was a first implementation and
   we didn't have a way to do P4 DSL already than I might agree, but here
   we have an existing solution so it should be at least as good and should
   be better than existing backend. A software datapath adoption is going
   to be critically based on performance. I don't see taking even a 5% hit
   when porting over to P4 from existing datapath.

Commentary: I think its 100% correct to debate how the P4 DSL is
implemented in the kernel. I can't see why this is off limits somehow this
patch set proposes an approach there could be many approaches. BPF comes up
not because I'm some BPF zealot that needs P4 DSL in BPF, but because it
exists today there is even a P4 backend. Fundamentally I don't see the
value add we get by creating two P4 pipelines this is going to create
duplication all the way up to the P4 tooling/infra through to the kernel.
From your side you keep saying I'm bike shedding and demanding BPF, but
from my perspective your introducing another entire toolchain simply
because you want some low level debug commands that 99% of P4 users should
not be using or caring about.

To try and be constructive some things that would change my mind would
be a vendor showing how hardware can be used. This would be compelling.
Or performance showing its somehow gets a more performant implementation.
Or lastly if the current p4c implementation is fundamentally broken
somehow.

Thanks
John
Jamal Hadi Salim Feb. 28, 2024, 6:23 p.m. UTC | #2
On Wed, Feb 28, 2024 at 12:11 PM John Fastabend
<john.fastabend@gmail.com> wrote:
>
> Jamal Hadi Salim wrote:
> > This is the first patchset of two. In this patch we are submitting 15 which
> > cover the minimal viable P4 PNA architecture.
> >
> > __Description of these Patches__
> >
> > Patch #1 adds infrastructure for per-netns P4 actions that can be created on
> > as need basis for the P4 program requirement. This patch makes a small incision
> > into act_api. Patches 2-4 are minimalist enablers for P4TC and have no
> > effect the classical tc action (example patch#2 just increases the size of the
> > action names from 16->64B).
> > Patch 5 adds infrastructure support for preallocation of dynamic actions.
> >
> > The core P4TC code implements several P4 objects.
> > 1) Patch #6 introduces P4 data types which are consumed by the rest of the code
> > 2) Patch #7 introduces the templating API. i.e. CRUD commands for templates
> > 3) Patch #8 introduces the concept of templating Pipelines. i.e CRUD commands
> >    for P4 pipelines.
> > 4) Patch #9 introduces the action templates and associated CRUD commands.
> > 5) Patch #10 introduce the action runtime infrastructure.
> > 6) Patch #11 introduces the concept of P4 table templates and associated
> >    CRUD commands for tables.
> > 7) Patch #12 introduces runtime table entry infra and associated CU commands.
> > 8) Patch #13 introduces runtime table entry infra and associated RD commands.
> > 9) Patch #14 introduces interaction of eBPF to P4TC tables via kfunc.
> > 10) Patch #15 introduces the TC classifier P4 used at runtime.
> >
> > Daniel, please look again at patch #15.
> >
> > There are a few more patches (5) not in this patchset that deal with test
> > cases, etc.
> >
> > What is P4?
> > -----------
> >
> > The Programming Protocol-independent Packet Processors (P4) is an open source,
> > domain-specific programming language for specifying data plane behavior.
> >
> > The current P4 landscape includes an extensive range of deployments, products,
> > projects and services, etc[9][12]. Two major NIC vendors, Intel[10] and AMD[11]
> > currently offer P4-native NICs. P4 is currently curated by the Linux
> > Foundation[9].
> >
> > On why P4 - see small treatise here:[4].
> >
> > What is P4TC?
> > -------------
> >
> > P4TC is a net-namespace aware P4 implementation over TC; meaning, a P4 program
> > and its associated objects and state are attachend to a kernel _netns_ structure.
> > IOW, if we had two programs across netns' or within a netns they have no
> > visibility to each others objects (unlike for example TC actions whose kinds are
> > "global" in nature or eBPF maps visavis bpftool).
>
> [...]
>
> Although I appreciate a good amount of work went into building above I'll
> add my concerns here so they are not lost. These are architecture concerns
> not this line of code needs some tweak.
>
>  - It encodes a DSL into the kernel. Its unclear how we pick which DSL gets
>    pushed into the kernel and which do not. Do we take any DSL folks can code
>    up?
>    I would prefer a lower level  intermediate langauge. My view is this is
>    a lesson we should have learned from OVS. OVS had wider adoption and
>    still struggled in some ways my belief is this is very similar to OVS.
>    (Also OVS was novel/great at a lot of things fwiw.)
>
>  - We have a general purpose language in BPF that can implement the P4 DSL
>    already. I don't see any need for another set of code when the end goal
>    is running P4 in Linux network stack is doable. Typically we reject
>    duplicate things when they don't have concrete benefits.
>
>  - P4 as a DSL is not optimized for general purpose CPUs, but
>    rather hardware pipelines. Although it can be optimized for CPUs its
>    a harder problem. A review of some of the VPP/DPDK work here is useful.
>
>  - P4 infrastructure already has a p4c backend this is adding another P4
>    backend instead of getting the rather small group of people to work on
>    a single backend we are now creating another one.
>
>  - Common reasons I think would justify a new P4 backend and implementation
>    would be: speed efficiency, or expressiveness. I think this
>    implementation is neither more efficient nor more expressive. Concrete
>    examples on expressiveness would be interesting, but I don't see any.
>    Loops were mentioned once but latest kernels have loop support.
>
>  - The main talking point for many slide decks about p4tc is hardware
>    offload. This seems like the main benefit of pushing the P4 DSL into the
>    kernel. But, we have no hw implementation, not even a vendor stepping up
>    to comment on this implementation and how it will work for them. HW
>    introduces all sorts of interesting problems that I don't see how we
>    solve in this framework. For example a few off the top of my head:
>    syncing current state into tc, how does operator program tc inside
>    constraints, who writes the p4 models for these hardware devices, do
>    they fit into this 'tc' infrastructure, partial updates into hardware
>    seems unlikely to work for most hardware, ...
>
>  - The kfuncs are mostly duplicates of map ops we already have in BPF API.
>    The motivation by my read is to use netlink instead of bpf commands. I
>    don't agree with this, optimizing for some low level debug a developer
>    uses is the wrong design space. Actual users should not be deploying
>    this via ssh into boxes. The workflow will not scale and really we need
>    tooling and infra to land P4 programs across the network. This is orders
>    of more pain if its an endpoint solution and not a middlebox/switch
>    solution. As a switch solution I don't see how p4tc sw scales to even TOR
>    packet rates. So you need tooling on top and user interact with the
>    tooling not the Linux widget/debugger at the bottom.
>
>  - There is no performance analysis: The comment was functionality before
>    performance which I disagree with. If it was a first implementation and
>    we didn't have a way to do P4 DSL already than I might agree, but here
>    we have an existing solution so it should be at least as good and should
>    be better than existing backend. A software datapath adoption is going
>    to be critically based on performance. I don't see taking even a 5% hit
>    when porting over to P4 from existing datapath.
>
> Commentary: I think its 100% correct to debate how the P4 DSL is
> implemented in the kernel. I can't see why this is off limits somehow this
> patch set proposes an approach there could be many approaches. BPF comes up
> not because I'm some BPF zealot that needs P4 DSL in BPF, but because it
> exists today there is even a P4 backend. Fundamentally I don't see the
> value add we get by creating two P4 pipelines this is going to create
> duplication all the way up to the P4 tooling/infra through to the kernel.
> From your side you keep saying I'm bike shedding and demanding BPF, but
> from my perspective your introducing another entire toolchain simply
> because you want some low level debug commands that 99% of P4 users should
> not be using or caring about.
>
> To try and be constructive some things that would change my mind would
> be a vendor showing how hardware can be used. This would be compelling.
> Or performance showing its somehow gets a more performant implementation.
> Or lastly if the current p4c implementation is fundamentally broken
> somehow.
>

John,
With all due respect we are going back again over the same points,
recycled many times over to which i have responded to you many times.
It's gettting tiring.  This is exactly why i called it bikeshedding.
Let's just agree to disagree.

cheers,
jamal

> Thanks
> John
John Fastabend Feb. 28, 2024, 9:13 p.m. UTC | #3
Jamal Hadi Salim wrote:
> On Wed, Feb 28, 2024 at 12:11 PM John Fastabend
> <john.fastabend@gmail.com> wrote:
> >
> > Jamal Hadi Salim wrote:
> > > This is the first patchset of two. In this patch we are submitting 15 which
> > > cover the minimal viable P4 PNA architecture.
> > >
> > > __Description of these Patches__
> > >
> > > Patch #1 adds infrastructure for per-netns P4 actions that can be created on
> > > as need basis for the P4 program requirement. This patch makes a small incision
> > > into act_api. Patches 2-4 are minimalist enablers for P4TC and have no
> > > effect the classical tc action (example patch#2 just increases the size of the
> > > action names from 16->64B).
> > > Patch 5 adds infrastructure support for preallocation of dynamic actions.
> > >
> > > The core P4TC code implements several P4 objects.
> > > 1) Patch #6 introduces P4 data types which are consumed by the rest of the code
> > > 2) Patch #7 introduces the templating API. i.e. CRUD commands for templates
> > > 3) Patch #8 introduces the concept of templating Pipelines. i.e CRUD commands
> > >    for P4 pipelines.
> > > 4) Patch #9 introduces the action templates and associated CRUD commands.
> > > 5) Patch #10 introduce the action runtime infrastructure.
> > > 6) Patch #11 introduces the concept of P4 table templates and associated
> > >    CRUD commands for tables.
> > > 7) Patch #12 introduces runtime table entry infra and associated CU commands.
> > > 8) Patch #13 introduces runtime table entry infra and associated RD commands.
> > > 9) Patch #14 introduces interaction of eBPF to P4TC tables via kfunc.
> > > 10) Patch #15 introduces the TC classifier P4 used at runtime.
> > >
> > > Daniel, please look again at patch #15.
> > >
> > > There are a few more patches (5) not in this patchset that deal with test
> > > cases, etc.
> > >
> > > What is P4?
> > > -----------
> > >
> > > The Programming Protocol-independent Packet Processors (P4) is an open source,
> > > domain-specific programming language for specifying data plane behavior.
> > >
> > > The current P4 landscape includes an extensive range of deployments, products,
> > > projects and services, etc[9][12]. Two major NIC vendors, Intel[10] and AMD[11]
> > > currently offer P4-native NICs. P4 is currently curated by the Linux
> > > Foundation[9].
> > >
> > > On why P4 - see small treatise here:[4].
> > >
> > > What is P4TC?
> > > -------------
> > >
> > > P4TC is a net-namespace aware P4 implementation over TC; meaning, a P4 program
> > > and its associated objects and state are attachend to a kernel _netns_ structure.
> > > IOW, if we had two programs across netns' or within a netns they have no
> > > visibility to each others objects (unlike for example TC actions whose kinds are
> > > "global" in nature or eBPF maps visavis bpftool).
> >
> > [...]
> >
> > Although I appreciate a good amount of work went into building above I'll
> > add my concerns here so they are not lost. These are architecture concerns
> > not this line of code needs some tweak.
> >
> >  - It encodes a DSL into the kernel. Its unclear how we pick which DSL gets
> >    pushed into the kernel and which do not. Do we take any DSL folks can code
> >    up?
> >    I would prefer a lower level  intermediate langauge. My view is this is
> >    a lesson we should have learned from OVS. OVS had wider adoption and
> >    still struggled in some ways my belief is this is very similar to OVS.
> >    (Also OVS was novel/great at a lot of things fwiw.)
> >
> >  - We have a general purpose language in BPF that can implement the P4 DSL
> >    already. I don't see any need for another set of code when the end goal
> >    is running P4 in Linux network stack is doable. Typically we reject
> >    duplicate things when they don't have concrete benefits.
> >
> >  - P4 as a DSL is not optimized for general purpose CPUs, but
> >    rather hardware pipelines. Although it can be optimized for CPUs its
> >    a harder problem. A review of some of the VPP/DPDK work here is useful.
> >
> >  - P4 infrastructure already has a p4c backend this is adding another P4
> >    backend instead of getting the rather small group of people to work on
> >    a single backend we are now creating another one.
> >
> >  - Common reasons I think would justify a new P4 backend and implementation
> >    would be: speed efficiency, or expressiveness. I think this
> >    implementation is neither more efficient nor more expressive. Concrete
> >    examples on expressiveness would be interesting, but I don't see any.
> >    Loops were mentioned once but latest kernels have loop support.
> >
> >  - The main talking point for many slide decks about p4tc is hardware
> >    offload. This seems like the main benefit of pushing the P4 DSL into the
> >    kernel. But, we have no hw implementation, not even a vendor stepping up
> >    to comment on this implementation and how it will work for them. HW
> >    introduces all sorts of interesting problems that I don't see how we
> >    solve in this framework. For example a few off the top of my head:
> >    syncing current state into tc, how does operator program tc inside
> >    constraints, who writes the p4 models for these hardware devices, do
> >    they fit into this 'tc' infrastructure, partial updates into hardware
> >    seems unlikely to work for most hardware, ...
> >
> >  - The kfuncs are mostly duplicates of map ops we already have in BPF API.
> >    The motivation by my read is to use netlink instead of bpf commands. I
> >    don't agree with this, optimizing for some low level debug a developer
> >    uses is the wrong design space. Actual users should not be deploying
> >    this via ssh into boxes. The workflow will not scale and really we need
> >    tooling and infra to land P4 programs across the network. This is orders
> >    of more pain if its an endpoint solution and not a middlebox/switch
> >    solution. As a switch solution I don't see how p4tc sw scales to even TOR
> >    packet rates. So you need tooling on top and user interact with the
> >    tooling not the Linux widget/debugger at the bottom.
> >
> >  - There is no performance analysis: The comment was functionality before
> >    performance which I disagree with. If it was a first implementation and
> >    we didn't have a way to do P4 DSL already than I might agree, but here
> >    we have an existing solution so it should be at least as good and should
> >    be better than existing backend. A software datapath adoption is going
> >    to be critically based on performance. I don't see taking even a 5% hit
> >    when porting over to P4 from existing datapath.
> >
> > Commentary: I think its 100% correct to debate how the P4 DSL is
> > implemented in the kernel. I can't see why this is off limits somehow this
> > patch set proposes an approach there could be many approaches. BPF comes up
> > not because I'm some BPF zealot that needs P4 DSL in BPF, but because it
> > exists today there is even a P4 backend. Fundamentally I don't see the
> > value add we get by creating two P4 pipelines this is going to create
> > duplication all the way up to the P4 tooling/infra through to the kernel.
> > From your side you keep saying I'm bike shedding and demanding BPF, but
> > from my perspective your introducing another entire toolchain simply
> > because you want some low level debug commands that 99% of P4 users should
> > not be using or caring about.
> >
> > To try and be constructive some things that would change my mind would
> > be a vendor showing how hardware can be used. This would be compelling.
> > Or performance showing its somehow gets a more performant implementation.
> > Or lastly if the current p4c implementation is fundamentally broken
> > somehow.
> >
> 
> John,
> With all due respect we are going back again over the same points,
> recycled many times over to which i have responded to you many times.
> It's gettting tiring.  This is exactly why i called it bikeshedding.
> Let's just agree to disagree.

Yep we agree to disagree and I put them them as a summary so others
can see them and think it over/decide where they stand on it. In the
end you don't need my ACK here, but I wanted my opinion summarized.

> 
> cheers,
> jamal
> 
> > Thanks
> > John
Paolo Abeni Feb. 29, 2024, 5:13 p.m. UTC | #4
On Sun, 2024-02-25 at 11:54 -0500, Jamal Hadi Salim wrote:
> This is the first patchset of two. In this patch we are submitting 15 which
> cover the minimal viable P4 PNA architecture.
> 
> __Description of these Patches__
> 
> Patch #1 adds infrastructure for per-netns P4 actions that can be created on
> as need basis for the P4 program requirement. This patch makes a small incision
> into act_api. Patches 2-4 are minimalist enablers for P4TC and have no
> effect the classical tc action (example patch#2 just increases the size of the
> action names from 16->64B).
> Patch 5 adds infrastructure support for preallocation of dynamic actions.
> 
> The core P4TC code implements several P4 objects.
> 1) Patch #6 introduces P4 data types which are consumed by the rest of the code
> 2) Patch #7 introduces the templating API. i.e. CRUD commands for templates
> 3) Patch #8 introduces the concept of templating Pipelines. i.e CRUD commands
>    for P4 pipelines.
> 4) Patch #9 introduces the action templates and associated CRUD commands.
> 5) Patch #10 introduce the action runtime infrastructure.
> 6) Patch #11 introduces the concept of P4 table templates and associated
>    CRUD commands for tables.
> 7) Patch #12 introduces runtime table entry infra and associated CU commands.
> 8) Patch #13 introduces runtime table entry infra and associated RD commands.
> 9) Patch #14 introduces interaction of eBPF to P4TC tables via kfunc.
> 10) Patch #15 introduces the TC classifier P4 used at runtime.
> 
> Daniel, please look again at patch #15.
> 
> There are a few more patches (5) not in this patchset that deal with test
> cases, etc.
> 
> What is P4?
> -----------
> 
> The Programming Protocol-independent Packet Processors (P4) is an open source,
> domain-specific programming language for specifying data plane behavior.
> 
> The current P4 landscape includes an extensive range of deployments, products,
> projects and services, etc[9][12]. Two major NIC vendors, Intel[10] and AMD[11]
> currently offer P4-native NICs. P4 is currently curated by the Linux
> Foundation[9].
> 
> On why P4 - see small treatise here:[4].
> 
> What is P4TC?
> -------------
> 
> P4TC is a net-namespace aware P4 implementation over TC; meaning, a P4 program
> and its associated objects and state are attachend to a kernel _netns_ structure.
> IOW, if we had two programs across netns' or within a netns they have no
> visibility to each others objects (unlike for example TC actions whose kinds are
> "global" in nature or eBPF maps visavis bpftool).
> 
> P4TC builds on top of many years of Linux TC experiences of a netlink control
> path interface coupled with a software datapath with an equivalent offloadable
> hardware datapath. In this patch series we are focussing only on the s/w
> datapath. The s/w and h/w path equivalence that TC provides is relevant
> for a primary use case of P4 where some (currently) large consumers of NICs
> provide vendors their datapath specs in P4. In such a case one could generate
> specified datapaths in s/w and test/validate the requirements before hardware
> acquisition(example [12]).
> 
> Unlike other approaches such as TC Flower which require kernel and user space
> changes when new datapath objects like packet headers are introduced P4TC, with
> these patches, provides _kernel and user space code change independence_.
> Meaning:
> A P4 program describes headers, parsers, etc alongside the datapath processing;
> the compiler uses the P4 program as input and generates several artifacts which
> are then loaded into the kernel to manifest the intended datapath. In addition
> to the generated datapath, control path constructs are generated. The process is
> described further below in "P4TC Workflow".
> 
> There have been many discussions and meetings within the community since
> about 2015 in regards to P4 over TC[2] and we are finally proving to the
> naysayers that we do get stuff done!
> 
> A lot more of the P4TC motivation is captured at:
> https://github.com/p4tc-dev/docs/blob/main/why-p4tc.md
> 
> __P4TC Architecture__
> 
> The current architecture was described at netdevconf 0x17[14] and if you prefer
> academic conference papers, a short paper is available here[15].
> 
> There are 4 parts:
> 
> 1) A Template CRUD provisioning API for manifesting a P4 program and its
> associated objects in the kernel. The template provisioning API uses netlink.
> See patch in part 2.
> 
> 2) A Runtime CRUD+ API code which is used for controlling the different runtime
> behavior of the P4 objects. The runtime API uses netlink. See notes further
> down. See patch description later..
> 
> 3) P4 objects and their control interfaces: tables, actions, externs, etc.
> Any object that requires control plane interaction resides in the TC domain
> and is subject to the CRUD runtime API.  The intended goal is to make use of the
> tc semantics of skip_sw/hw to target P4 program objects either in s/w or h/w.
> 
> 4) S/W Datapath code hooks. The s/w datapath is eBPF based and is generated
> by a compiler based on the P4 spec. When accessing any P4 object that requires
> control plane interfaces, the eBPF code accesses the P4TC side from #3 above
> using kfuncs.
> 
> The generated eBPF code is derived from [13] with enhancements and fixes to meet
> our requirements.
> 
> __P4TC Workflow__
> 
> The Development and instantiation workflow for P4TC is as follows:
> 
>   A) A developer writes a P4 program, "myprog"
> 
>   B) Compiles it using the P4C compiler[8]. The compiler generates 3 outputs:
> 
>      a) A shell script which form template definitions for the different P4
>      objects "myprog" utilizes (tables, externs, actions etc). See #1 above..
> 
>      b) the parser and the rest of the datapath are generated as eBPF and need
>      to be compiled into binaries. At the moment the parser and the main control
>      block are generated as separate eBPF program but this could change in
>      the future (without affecting any kernel code). See #4 above.
> 
>      c) A json introspection file used for the control plane (by iproute2/tc).
> 
>   C) At this point the artifacts from #1,#4 could be handed to an operator
>      (the operator could be the same person as the developer from #A, #B).
> 
>      i) For the eBPF part, either the operator is handed an ebpf binary or
>      source which they compile at this point into a binary.
>      The operator executes the shell script(s) to manifest the functional
>      "myprog" into the kernel.
> 
>      ii) The operator instantiates "myprog" pipeline via the tc P4 filter
>      to ingress/egress (depending on P4 arch) of one or more netdevs/ports
>      (illustrated below as "block 22").
> 
>      Example instantion where the parser is a separate action:
>        "tc filter add block 22 ingress protocol all prio 10 p4 pname myprog \
>         action bpf obj $PARSER.o section p4tc/parse \
>         action bpf obj $PROGNAME.o section p4tc/main"
> 
> See individual patches in partc for more examples tc vs xdp etc. Also see
> section on "challenges" (further below on this cover letter).
> 
> Once "myprog" P4 program is instantiated one can start performing operations
> on table entries and/or actions at runtime as described below.
> 
> __P4TC Runtime Control Path__
> 
> The control interface builds on past tc experience and tries to get things
> right from the beginning (example filtering is separated from depending
> on existing object TLVs and made generic); also the code is written in
> such a way it is mostly lockless.
> 
> The P4TC control interface, using netlink, provides what we call a CRUDPS
> abstraction which stands for: Create, Read(get), Update, Delete, Subscribe,
> Publish.  From a high level PoV the following describes a conformant high level
> API (both on netlink data model and code level):
> 
> 	Create(</path/to/object, DATA>+)
> 	Read(</path/to/object>, [optional filter])
> 	Update(</path/to/object>, DATA>+)
> 	Delete(</path/to/object>, [optional filter])
> 	Subscribe(</path/to/object>, [optional filter])
> 
> Note, we _dont_ treat "dump" or "flush" as speacial. If "path/to/object" points
> to a table then a "Delete" implies "flush" and a "Read" implies dump but if
> it points to an entry (by specifying a key) then "Delete" implies deleting
> and entry and "Read" implies reading that single entry. It should be noted that
> both "Delete" and "Read" take an optional filter parameter. The filter can
> define further refinements to what the control plane wants read or deleted.
> "Subscribe" uses built in netlink event management. It, as well, takes a filter
> which can further refine what events get generated to the control plane (taken
> out of this patchset, to be re-added with consideration of [16]).
> 
> Lets show some runtime samples:
> 
> ..create an entry, if we match ip address 10.0.1.2 send packet out eno1
>   tc p4ctrl create myprog/table/mytable \
>    dstAddr 10.0.1.2/32 action send_to_port param port eno1
> 
> ..Batch create entries
>   tc p4ctrl create myprog/table/mytable \
>   entry dstAddr 10.1.1.2/32  action send_to_port param port eno1 \
>   entry dstAddr 10.1.10.2/32  action send_to_port param port eno10 \
>   entry dstAddr 10.0.2.2/32  action send_to_port param port eno2
> 
> ..Get an entry (note "read" is interchangeably used as "get" which is a common
> 		semantic in tc):
>   tc p4ctrl read myprog/table/mytable \
>    dstAddr 10.0.2.2/32
> 
> ..dump mytable
>   tc p4ctrl read myprog/table/mytable
> 
> ..dump mytable for all entries whose key fits within 10.1.0.0/16
>   tc p4ctrl read myprog/table/mytable \
>   filter key/myprog/mytable/dstAddr = 10.1.0.0/16
> 
> ..dump all mytable entries which have an action send_to_port with param "eno1"
>   tc p4ctrl get myprog/table/mytable \
>   filter param/act/myprog/send_to_port/port = "eno1"
> 
> The filter expression is powerful, f.e you could say:
> 
>   tc p4ctrl get myprog/table/mytable \
>   filter param/act/myprog/send_to_port/port = "eno1" && \
>          key/myprog/mytable/dstAddr = 10.1.0.0/16
> 
> It also works on built in metadata, example in the following case dumping
> entries from mytable that have seen activity in the last 10 secs:
>   tc p4ctrl get myprog/table/mytable \
>   filter msecs_since < 10000
> 
> Delete follows the same syntax as get/read, so for sake of brevity we won't
> show more example than how to flush mytable:
> 
>   tc p4ctrl delete myprog/table/mytable
> 
> Mystery question: How do we achieve iproute2-kernel independence and
> how does "tc p4ctrl" as a cli know how to program the kernel given an
> arbitrary command line as shown above? Answer(s): It queries the
> compiler generated json file in "P4TC Workflow" #B.c above. The json file has
> enough details to figure out that we have a program called "myprog" which has a
> table "mytable" that has a key name "dstAddr" which happens to be type ipv4
> address prefix. The json file also provides details to show that the table
> "mytable" supports an action called "send_to_port" which accepts a parameter
> "port" of type netdev (see the types patch for all supported P4 data types).
> All P4 components have names, IDs, and types - so this makes it very easy to map
> into netlink.
> Once user space tc/p4ctrl validates the human command input, it creates
> standard binary netlink structures (TLVs etc) which are sent to the kernel.
> See the runtime table entry patch for more details.
> 
> __P4TC Datapath__
> 
> The P4TC s/w datapath execution is generated as eBPF. Any objects that require
> control interfacing reside in the "P4TC domain" and are controlled via netlink
> as described above. Per packet execution and state and even objects that do not
> require control interfacing (like the P4 parser) are generated as eBPF.
> 
> A packet arriving on s/w ingress of any of the ports on block 22 will first be
> exercised via the (generated eBPF) parser component to extract the headers (the
> ip destination address in labelled "dstAddr" above).
> The datapath then proceeds to use "dstAddr", table ID and pipeline ID
> as a key to do a lookup in myprog's "mytable" which returns the action params
> which are then used to execute the action in the eBPF datapath (eventually
> sending out packets to eno1).
> On a table miss, mytable's default miss action (not described) is executed.
> 
> __Testing__
> 
> Speaking of testing - we have 2-300 tdc test cases (which will be in the
> second patchset).
> These tests are run on our CICD system on pull requests and after commits are
> approved. The CICD does a lot of other tests (more since v2, thanks to Simon's
> input)including:
> checkpatch, sparse, smatch, coccinelle, 32 bit and 64 bit builds tested on both
> X86, ARM 64 and emulated BE via qemu s390. We trigger performance testing in the
> CICD to catch performance regressions (currently only on the control path, but
> in the future for the datapath).
> Syzkaller runs 24/7 on dedicated hardware, originally we focussed only on memory
> sanitizer but recently added support for concurrency sanitizer.
> Before main releases we ensure each patch will compile on its own to help in
> git bisect and run the xmas tree tool. We eventually put the code via coverity.
> 
> In addition we are working on enabling a tool that will take a P4 program, run
> it through the compiler, and generate permutations of traffic patterns via
> symbolic execution that will test both positive and negative datapath code
> paths. The test generator tool integration is still work in progress.
> Also: We have other code that test parallelization etc which we are trying to
> find a fit for in the kernel tree's testing infra.
> 
> 
> __References__
> 
> [1]https://github.com/p4tc-dev/docs/blob/main/p4-conference-2023/2023P4WorkshopP4TC.pdf
> [2]https://github.com/p4tc-dev/docs/blob/main/why-p4tc.md#historical-perspective-for-p4tc
> [3]https://2023p4workshop.sched.com/event/1KsAe/p4tc-linux-kernel-p4-implementation-approaches-and-evaluation
> [4]https://github.com/p4tc-dev/docs/blob/main/why-p4tc.md#so-why-p4-and-how-does-p4-help-here
> [5]https://lore.kernel.org/netdev/20230517110232.29349-3-jhs@mojatatu.com/T/#mf59be7abc5df3473cff3879c8cc3e2369c0640a6
> [6]https://lore.kernel.org/netdev/20230517110232.29349-3-jhs@mojatatu.com/T/#m783cfd79e9d755cf0e7afc1a7d5404635a5b1919
> [7]https://lore.kernel.org/netdev/20230517110232.29349-3-jhs@mojatatu.com/T/#ma8c84df0f7043d17b98f3d67aab0f4904c600469
> [8]https://github.com/p4lang/p4c/tree/main/backends/tc
> [9]https://p4.org/
> [10]https://www.intel.com/content/www/us/en/products/details/network-io/ipu/e2000-asic.html
> [11]https://www.amd.com/en/accelerators/pensando
> [12]https://github.com/sonic-net/DASH/tree/main
> [13]https://github.com/p4lang/p4c/tree/main/backends/ebpf
> [14]https://netdevconf.info/0x17/sessions/talk/integrating-ebpf-into-the-p4tc-datapath.html
> [15]https://dl.acm.org/doi/10.1145/3630047.3630193
> [16]https://lore.kernel.org/netdev/20231216123001.1293639-1-jiri@resnulli.us/
> [17.a]https://netdevconf.info/0x13/session.html?talk-tc-u-classifier
> [17.b]man tc-u32
> [18]man tc-pedit
> [19] https://lore.kernel.org/netdev/20231219181623.3845083-6-victor@mojatatu.com/T/#m86e71743d1d83b728bb29d5b877797cb4942e835
> [20.a] https://netdevconf.info/0x16/sessions/talk/your-network-datapath-will-be-p4-scripted.html
> [20.b] https://netdevconf.info/0x16/sessions/workshop/p4tc-workshop.html
> 
> --------
> HISTORY
> --------
> 
> Changes in Version 12
> ----------------------
> 
> 0) Introduce back 15 patches (v11 had 5)
> 
> 1) From discussions with Daniel:
>    i) Remove the XDP programs association alltogether. No refcounting. nothing.
>    ii) Remove prog type tc - everything is now an ebpf tc action.
> 
> 2) s/PAD0/__pad0/g. Thanks to Marcelo.
> 
> 3) Add extack to specify how many entries (N of M) specified in a batch for
>    any of requested Create/Update/Delete succeeded. Prior to this it would
>    only tell us the batch failed to complete without giving us details of
>    which of M failed. Added as a debug aid.
> 
> Changes in Version 11
> ----------------------
> 1) Split the series into two. Original patches 1-5 in this patchset. The rest
>    will go out after this is merged.
> 
> 2) Change any references of IFNAMSIZ in the action code when referencing the
>    action name size to ACTNAMSIZ. Thanks to Marcelo.
> 
> Changes in Version 10
> ----------------------
> 1) A couple of patches from the earlier version were clean enough to submit,
>    so we did. This gave us room to split the two largest patches each into
>    two. Even though the split is not git-bisactable and really some of it didn't
>    make much sense (eg spliting a create, and update in one patch and delete and
>    get into another) we made sure each of the split patches compiled
>    independently. The idea is to reduce the number of lines of code to review
>    and when we get sufficient reviews we will put the splits together again.
>    See patch #12 and #13 as well as patches #7 and #8).
> 
> 2) Add more context in patch 0. Please READ!
> 
> 3) Added dump/delete filters back to the code - we had taken them out in the
>    earlier patches to reduce the amount of code for review - but in retrospect
>    we feel they are important enough to push earlier rather than later.
> 
> 
> Changes In version 9
> ---------------------
> 
> 1) Remove the largest patch (externs) to ease review.
> 
> 2) Break up action patches into two to ease review bringing down the patches
>    that need more scrutiny to 8 (the first 7 are almost trivial).
> 
> 3) Fixup prefix naming convention to p4tc_xxx for uapi and p4a_xxx for actions
>    to provide consistency(Jiri).
> 
> 4) Silence sparse warning "was not declared. Should it be static?" for kfuncs
>    by making them static. TBH, not sure if this is the right solution
>    but it makes sparse happy and hopefully someone will comment.
> 
> Changes In Version 8
> ---------------------
> 
> 1) Fix all the patchwork warnings and improve our ci to catch them in the future
> 
> 2) Reduce the number of patches to basic max(15)  to ease review.
> 
> Changes In Version 7
> -------------------------
> 
> 0) First time removing the RFC tag!
> 
> 1) Removed XDP cookie. It turns out as was pointed out by Toke(Thanks!) - that
> using bpf links was sufficient to protect us from someone replacing or deleting
> a eBPF program after it has been bound to a netdev.
> 
> 2) Add some reviewed-bys from Vlad.
> 
> 3) Small bug fixes from v6 based on testing for ebpf.
> 
> 4) Added the counter extern as a sample extern. Illustrating this example because
>    it is slightly complex since it is possible to invoke it directly from
>    the P4TC domain (in case of direct counters) or from eBPF (indirect counters).
>    It is not exactly the most efficient implementation (a reasonable counter impl
>    should be per-cpu).
> 
> Changes In RFC Version 6
> -------------------------
> 
> 1) Completed integration from scriptable view to eBPF. Completed integration
>    of externs integration.
> 
> 2) Small bug fixes from v5 based on testing.
> 
> Changes In RFC Version 5
> -------------------------
> 
> 1) More integration from scriptable view to eBPF. Small bug fixes from last
>    integration.
> 
> 2) More streamlining support of externs via kfunc (create-on-miss, etc)
> 
> 3) eBPF linking for XDP.
> 
> There is more eBPF integration/streamlining coming (we are getting close to
> conversion from scriptable domain).
> 
> Changes In RFC Version 4
> -------------------------
> 
> 1) More integration from scriptable to eBPF. Small bug fixes.
> 
> 2) More streamlining support of externs via kfunc (one additional kfunc).
> 
> 3) Removed per-cpu scratchpad per Toke's suggestion and instead use XDP metadata.
> 
> There is more eBPF integration coming. One thing we looked at but is not in this
> patchset but should be in the next is use of eBPF link in our loading (see
> "challenge #1" further below).
> 
> Changes In RFC Version 3
> -------------------------
> 
> These patches are still in a little bit of flux as we adjust to integrating
> eBPF. So there are small constructs that are used in V1 and 2 but no longer
> used in this version. We will make a V4 which will remove those.
> The changes from V2 are as follows:
> 
> 1) Feedback we got in V2 is to try stick to one of the two modes. In this version
> we are taking one more step and going the path of mode2 vs v2 where we had 2 modes.
> 
> 2) The P4 Register extern is no longer standalone. Instead, as part of integrating
> into eBPF we introduce another kfunc which encapsulates Register as part of the
> extern interface.
> 
> 3) We have improved our CICD to include tools pointed to us by Simon. See
>    "Testing" further below. Thanks to Simon for that and other issues he caught.
>    Simon, we discussed on issue [7] but decided to keep that log since we think
>    it is useful.
> 
> 4) A lot of small cleanups. Thanks Marcelo. There are two things we need to
>    re-discuss though; see: [5], [6].
> 
> 5) We removed the need for a range of IDs for dynamic actions. Thanks Jakub.
> 
> 6) Clarify ambiguity caused by smatch in an if(A) else if(B) condition. We are
>    guaranteed that either A or B must exist; however, lets make smatch happy.
>    Thanks to Simon and Dan Carpenter.
> 
> Changes In RFC Version 2
> -------------------------
> 
> Version 2 is the initial integration of the eBPF datapath.
> We took into consideration suggestions provided to use eBPF and put effort into
> analyzing eBPF as datapath which involved extensive testing.
> We implemented 6 approaches with eBPF and ran performance analysis and presented
> our results at the P4 2023 workshop in Santa Clara[see: 1, 3] on each of the 6
> vs the scriptable P4TC and concluded that 2 of the approaches are sensible (4 if
> you account for XDP or TC separately).
> 
> Conclusions from the exercise: We lose the simple operational model we had
> prior to integrating eBPF. We do gain performance in most cases when the
> datapath is less compute-bound.
> For more discussion on our requirements vs journeying the eBPF path please
> scroll down to "Restating Our Requirements" and "Challenges".
> 
> This patch set presented two modes.
> mode1: the parser is entirely based on eBPF - whereas the rest of the
> SW datapath stays as _scriptable_ as in Version 1.
> mode2: All of the kernel s/w datapath (including parser) is in eBPF.
> 
> The key ingredient for eBPF, that we did not have access to in the past, is
> kfunc (it made a big difference for us to reconsider eBPF).
> 
> In V2 the two modes are mutually exclusive (IOW, you get to choose one
> or the other via Kconfig).

I think/fear that this series has a "quorum" problem: different voices
raises opposition, and nobody (?) outside the authors supported the
code and the feature. 

Could be the missing of H/W offload support in the current form the
root cause for such lack support? Or there are parties interested that
have been quite so far?

Thanks,

Paolo
Jamal Hadi Salim Feb. 29, 2024, 6:49 p.m. UTC | #5
On Thu, Feb 29, 2024 at 12:14 PM Paolo Abeni <pabeni@redhat.com> wrote:
>
> On Sun, 2024-02-25 at 11:54 -0500, Jamal Hadi Salim wrote:
> > This is the first patchset of two. In this patch we are submitting 15 which
> > cover the minimal viable P4 PNA architecture.
> >
> > __Description of these Patches__
> >
> > Patch #1 adds infrastructure for per-netns P4 actions that can be created on
> > as need basis for the P4 program requirement. This patch makes a small incision
> > into act_api. Patches 2-4 are minimalist enablers for P4TC and have no
> > effect the classical tc action (example patch#2 just increases the size of the
> > action names from 16->64B).
> > Patch 5 adds infrastructure support for preallocation of dynamic actions.
> >
> > The core P4TC code implements several P4 objects.
> > 1) Patch #6 introduces P4 data types which are consumed by the rest of the code
> > 2) Patch #7 introduces the templating API. i.e. CRUD commands for templates
> > 3) Patch #8 introduces the concept of templating Pipelines. i.e CRUD commands
> >    for P4 pipelines.
> > 4) Patch #9 introduces the action templates and associated CRUD commands.
> > 5) Patch #10 introduce the action runtime infrastructure.
> > 6) Patch #11 introduces the concept of P4 table templates and associated
> >    CRUD commands for tables.
> > 7) Patch #12 introduces runtime table entry infra and associated CU commands.
> > 8) Patch #13 introduces runtime table entry infra and associated RD commands.
> > 9) Patch #14 introduces interaction of eBPF to P4TC tables via kfunc.
> > 10) Patch #15 introduces the TC classifier P4 used at runtime.
> >
> > Daniel, please look again at patch #15.
> >
> > There are a few more patches (5) not in this patchset that deal with test
> > cases, etc.
> >
> > What is P4?
> > -----------
> >
> > The Programming Protocol-independent Packet Processors (P4) is an open source,
> > domain-specific programming language for specifying data plane behavior.
> >
> > The current P4 landscape includes an extensive range of deployments, products,
> > projects and services, etc[9][12]. Two major NIC vendors, Intel[10] and AMD[11]
> > currently offer P4-native NICs. P4 is currently curated by the Linux
> > Foundation[9].
> >
> > On why P4 - see small treatise here:[4].
> >
> > What is P4TC?
> > -------------
> >
> > P4TC is a net-namespace aware P4 implementation over TC; meaning, a P4 program
> > and its associated objects and state are attachend to a kernel _netns_ structure.
> > IOW, if we had two programs across netns' or within a netns they have no
> > visibility to each others objects (unlike for example TC actions whose kinds are
> > "global" in nature or eBPF maps visavis bpftool).
> >
> > P4TC builds on top of many years of Linux TC experiences of a netlink control
> > path interface coupled with a software datapath with an equivalent offloadable
> > hardware datapath. In this patch series we are focussing only on the s/w
> > datapath. The s/w and h/w path equivalence that TC provides is relevant
> > for a primary use case of P4 where some (currently) large consumers of NICs
> > provide vendors their datapath specs in P4. In such a case one could generate
> > specified datapaths in s/w and test/validate the requirements before hardware
> > acquisition(example [12]).
> >
> > Unlike other approaches such as TC Flower which require kernel and user space
> > changes when new datapath objects like packet headers are introduced P4TC, with
> > these patches, provides _kernel and user space code change independence_.
> > Meaning:
> > A P4 program describes headers, parsers, etc alongside the datapath processing;
> > the compiler uses the P4 program as input and generates several artifacts which
> > are then loaded into the kernel to manifest the intended datapath. In addition
> > to the generated datapath, control path constructs are generated. The process is
> > described further below in "P4TC Workflow".
> >
> > There have been many discussions and meetings within the community since
> > about 2015 in regards to P4 over TC[2] and we are finally proving to the
> > naysayers that we do get stuff done!
> >
> > A lot more of the P4TC motivation is captured at:
> > https://github.com/p4tc-dev/docs/blob/main/why-p4tc.md
> >
> > __P4TC Architecture__
> >
> > The current architecture was described at netdevconf 0x17[14] and if you prefer
> > academic conference papers, a short paper is available here[15].
> >
> > There are 4 parts:
> >
> > 1) A Template CRUD provisioning API for manifesting a P4 program and its
> > associated objects in the kernel. The template provisioning API uses netlink.
> > See patch in part 2.
> >
> > 2) A Runtime CRUD+ API code which is used for controlling the different runtime
> > behavior of the P4 objects. The runtime API uses netlink. See notes further
> > down. See patch description later..
> >
> > 3) P4 objects and their control interfaces: tables, actions, externs, etc.
> > Any object that requires control plane interaction resides in the TC domain
> > and is subject to the CRUD runtime API.  The intended goal is to make use of the
> > tc semantics of skip_sw/hw to target P4 program objects either in s/w or h/w.
> >
> > 4) S/W Datapath code hooks. The s/w datapath is eBPF based and is generated
> > by a compiler based on the P4 spec. When accessing any P4 object that requires
> > control plane interfaces, the eBPF code accesses the P4TC side from #3 above
> > using kfuncs.
> >
> > The generated eBPF code is derived from [13] with enhancements and fixes to meet
> > our requirements.
> >
> > __P4TC Workflow__
> >
> > The Development and instantiation workflow for P4TC is as follows:
> >
> >   A) A developer writes a P4 program, "myprog"
> >
> >   B) Compiles it using the P4C compiler[8]. The compiler generates 3 outputs:
> >
> >      a) A shell script which form template definitions for the different P4
> >      objects "myprog" utilizes (tables, externs, actions etc). See #1 above..
> >
> >      b) the parser and the rest of the datapath are generated as eBPF and need
> >      to be compiled into binaries. At the moment the parser and the main control
> >      block are generated as separate eBPF program but this could change in
> >      the future (without affecting any kernel code). See #4 above.
> >
> >      c) A json introspection file used for the control plane (by iproute2/tc).
> >
> >   C) At this point the artifacts from #1,#4 could be handed to an operator
> >      (the operator could be the same person as the developer from #A, #B).
> >
> >      i) For the eBPF part, either the operator is handed an ebpf binary or
> >      source which they compile at this point into a binary.
> >      The operator executes the shell script(s) to manifest the functional
> >      "myprog" into the kernel.
> >
> >      ii) The operator instantiates "myprog" pipeline via the tc P4 filter
> >      to ingress/egress (depending on P4 arch) of one or more netdevs/ports
> >      (illustrated below as "block 22").
> >
> >      Example instantion where the parser is a separate action:
> >        "tc filter add block 22 ingress protocol all prio 10 p4 pname myprog \
> >         action bpf obj $PARSER.o section p4tc/parse \
> >         action bpf obj $PROGNAME.o section p4tc/main"
> >
> > See individual patches in partc for more examples tc vs xdp etc. Also see
> > section on "challenges" (further below on this cover letter).
> >
> > Once "myprog" P4 program is instantiated one can start performing operations
> > on table entries and/or actions at runtime as described below.
> >
> > __P4TC Runtime Control Path__
> >
> > The control interface builds on past tc experience and tries to get things
> > right from the beginning (example filtering is separated from depending
> > on existing object TLVs and made generic); also the code is written in
> > such a way it is mostly lockless.
> >
> > The P4TC control interface, using netlink, provides what we call a CRUDPS
> > abstraction which stands for: Create, Read(get), Update, Delete, Subscribe,
> > Publish.  From a high level PoV the following describes a conformant high level
> > API (both on netlink data model and code level):
> >
> >       Create(</path/to/object, DATA>+)
> >       Read(</path/to/object>, [optional filter])
> >       Update(</path/to/object>, DATA>+)
> >       Delete(</path/to/object>, [optional filter])
> >       Subscribe(</path/to/object>, [optional filter])
> >
> > Note, we _dont_ treat "dump" or "flush" as speacial. If "path/to/object" points
> > to a table then a "Delete" implies "flush" and a "Read" implies dump but if
> > it points to an entry (by specifying a key) then "Delete" implies deleting
> > and entry and "Read" implies reading that single entry. It should be noted that
> > both "Delete" and "Read" take an optional filter parameter. The filter can
> > define further refinements to what the control plane wants read or deleted.
> > "Subscribe" uses built in netlink event management. It, as well, takes a filter
> > which can further refine what events get generated to the control plane (taken
> > out of this patchset, to be re-added with consideration of [16]).
> >
> > Lets show some runtime samples:
> >
> > ..create an entry, if we match ip address 10.0.1.2 send packet out eno1
> >   tc p4ctrl create myprog/table/mytable \
> >    dstAddr 10.0.1.2/32 action send_to_port param port eno1
> >
> > ..Batch create entries
> >   tc p4ctrl create myprog/table/mytable \
> >   entry dstAddr 10.1.1.2/32  action send_to_port param port eno1 \
> >   entry dstAddr 10.1.10.2/32  action send_to_port param port eno10 \
> >   entry dstAddr 10.0.2.2/32  action send_to_port param port eno2
> >
> > ..Get an entry (note "read" is interchangeably used as "get" which is a common
> >               semantic in tc):
> >   tc p4ctrl read myprog/table/mytable \
> >    dstAddr 10.0.2.2/32
> >
> > ..dump mytable
> >   tc p4ctrl read myprog/table/mytable
> >
> > ..dump mytable for all entries whose key fits within 10.1.0.0/16
> >   tc p4ctrl read myprog/table/mytable \
> >   filter key/myprog/mytable/dstAddr = 10.1.0.0/16
> >
> > ..dump all mytable entries which have an action send_to_port with param "eno1"
> >   tc p4ctrl get myprog/table/mytable \
> >   filter param/act/myprog/send_to_port/port = "eno1"
> >
> > The filter expression is powerful, f.e you could say:
> >
> >   tc p4ctrl get myprog/table/mytable \
> >   filter param/act/myprog/send_to_port/port = "eno1" && \
> >          key/myprog/mytable/dstAddr = 10.1.0.0/16
> >
> > It also works on built in metadata, example in the following case dumping
> > entries from mytable that have seen activity in the last 10 secs:
> >   tc p4ctrl get myprog/table/mytable \
> >   filter msecs_since < 10000
> >
> > Delete follows the same syntax as get/read, so for sake of brevity we won't
> > show more example than how to flush mytable:
> >
> >   tc p4ctrl delete myprog/table/mytable
> >
> > Mystery question: How do we achieve iproute2-kernel independence and
> > how does "tc p4ctrl" as a cli know how to program the kernel given an
> > arbitrary command line as shown above? Answer(s): It queries the
> > compiler generated json file in "P4TC Workflow" #B.c above. The json file has
> > enough details to figure out that we have a program called "myprog" which has a
> > table "mytable" that has a key name "dstAddr" which happens to be type ipv4
> > address prefix. The json file also provides details to show that the table
> > "mytable" supports an action called "send_to_port" which accepts a parameter
> > "port" of type netdev (see the types patch for all supported P4 data types).
> > All P4 components have names, IDs, and types - so this makes it very easy to map
> > into netlink.
> > Once user space tc/p4ctrl validates the human command input, it creates
> > standard binary netlink structures (TLVs etc) which are sent to the kernel.
> > See the runtime table entry patch for more details.
> >
> > __P4TC Datapath__
> >
> > The P4TC s/w datapath execution is generated as eBPF. Any objects that require
> > control interfacing reside in the "P4TC domain" and are controlled via netlink
> > as described above. Per packet execution and state and even objects that do not
> > require control interfacing (like the P4 parser) are generated as eBPF.
> >
> > A packet arriving on s/w ingress of any of the ports on block 22 will first be
> > exercised via the (generated eBPF) parser component to extract the headers (the
> > ip destination address in labelled "dstAddr" above).
> > The datapath then proceeds to use "dstAddr", table ID and pipeline ID
> > as a key to do a lookup in myprog's "mytable" which returns the action params
> > which are then used to execute the action in the eBPF datapath (eventually
> > sending out packets to eno1).
> > On a table miss, mytable's default miss action (not described) is executed.
> >
> > __Testing__
> >
> > Speaking of testing - we have 2-300 tdc test cases (which will be in the
> > second patchset).
> > These tests are run on our CICD system on pull requests and after commits are
> > approved. The CICD does a lot of other tests (more since v2, thanks to Simon's
> > input)including:
> > checkpatch, sparse, smatch, coccinelle, 32 bit and 64 bit builds tested on both
> > X86, ARM 64 and emulated BE via qemu s390. We trigger performance testing in the
> > CICD to catch performance regressions (currently only on the control path, but
> > in the future for the datapath).
> > Syzkaller runs 24/7 on dedicated hardware, originally we focussed only on memory
> > sanitizer but recently added support for concurrency sanitizer.
> > Before main releases we ensure each patch will compile on its own to help in
> > git bisect and run the xmas tree tool. We eventually put the code via coverity.
> >
> > In addition we are working on enabling a tool that will take a P4 program, run
> > it through the compiler, and generate permutations of traffic patterns via
> > symbolic execution that will test both positive and negative datapath code
> > paths. The test generator tool integration is still work in progress.
> > Also: We have other code that test parallelization etc which we are trying to
> > find a fit for in the kernel tree's testing infra.
> >
> >
> > __References__
> >
> > [1]https://github.com/p4tc-dev/docs/blob/main/p4-conference-2023/2023P4WorkshopP4TC.pdf
> > [2]https://github.com/p4tc-dev/docs/blob/main/why-p4tc.md#historical-perspective-for-p4tc
> > [3]https://2023p4workshop.sched.com/event/1KsAe/p4tc-linux-kernel-p4-implementation-approaches-and-evaluation
> > [4]https://github.com/p4tc-dev/docs/blob/main/why-p4tc.md#so-why-p4-and-how-does-p4-help-here
> > [5]https://lore.kernel.org/netdev/20230517110232.29349-3-jhs@mojatatu.com/T/#mf59be7abc5df3473cff3879c8cc3e2369c0640a6
> > [6]https://lore.kernel.org/netdev/20230517110232.29349-3-jhs@mojatatu.com/T/#m783cfd79e9d755cf0e7afc1a7d5404635a5b1919
> > [7]https://lore.kernel.org/netdev/20230517110232.29349-3-jhs@mojatatu.com/T/#ma8c84df0f7043d17b98f3d67aab0f4904c600469
> > [8]https://github.com/p4lang/p4c/tree/main/backends/tc
> > [9]https://p4.org/
> > [10]https://www.intel.com/content/www/us/en/products/details/network-io/ipu/e2000-asic.html
> > [11]https://www.amd.com/en/accelerators/pensando
> > [12]https://github.com/sonic-net/DASH/tree/main
> > [13]https://github.com/p4lang/p4c/tree/main/backends/ebpf
> > [14]https://netdevconf.info/0x17/sessions/talk/integrating-ebpf-into-the-p4tc-datapath.html
> > [15]https://dl.acm.org/doi/10.1145/3630047.3630193
> > [16]https://lore.kernel.org/netdev/20231216123001.1293639-1-jiri@resnulli.us/
> > [17.a]https://netdevconf.info/0x13/session.html?talk-tc-u-classifier
> > [17.b]man tc-u32
> > [18]man tc-pedit
> > [19] https://lore.kernel.org/netdev/20231219181623.3845083-6-victor@mojatatu.com/T/#m86e71743d1d83b728bb29d5b877797cb4942e835
> > [20.a] https://netdevconf.info/0x16/sessions/talk/your-network-datapath-will-be-p4-scripted.html
> > [20.b] https://netdevconf.info/0x16/sessions/workshop/p4tc-workshop.html
> >
> > --------
> > HISTORY
> > --------
> >
> > Changes in Version 12
> > ----------------------
> >
> > 0) Introduce back 15 patches (v11 had 5)
> >
> > 1) From discussions with Daniel:
> >    i) Remove the XDP programs association alltogether. No refcounting. nothing.
> >    ii) Remove prog type tc - everything is now an ebpf tc action.
> >
> > 2) s/PAD0/__pad0/g. Thanks to Marcelo.
> >
> > 3) Add extack to specify how many entries (N of M) specified in a batch for
> >    any of requested Create/Update/Delete succeeded. Prior to this it would
> >    only tell us the batch failed to complete without giving us details of
> >    which of M failed. Added as a debug aid.
> >
> > Changes in Version 11
> > ----------------------
> > 1) Split the series into two. Original patches 1-5 in this patchset. The rest
> >    will go out after this is merged.
> >
> > 2) Change any references of IFNAMSIZ in the action code when referencing the
> >    action name size to ACTNAMSIZ. Thanks to Marcelo.
> >
> > Changes in Version 10
> > ----------------------
> > 1) A couple of patches from the earlier version were clean enough to submit,
> >    so we did. This gave us room to split the two largest patches each into
> >    two. Even though the split is not git-bisactable and really some of it didn't
> >    make much sense (eg spliting a create, and update in one patch and delete and
> >    get into another) we made sure each of the split patches compiled
> >    independently. The idea is to reduce the number of lines of code to review
> >    and when we get sufficient reviews we will put the splits together again.
> >    See patch #12 and #13 as well as patches #7 and #8).
> >
> > 2) Add more context in patch 0. Please READ!
> >
> > 3) Added dump/delete filters back to the code - we had taken them out in the
> >    earlier patches to reduce the amount of code for review - but in retrospect
> >    we feel they are important enough to push earlier rather than later.
> >
> >
> > Changes In version 9
> > ---------------------
> >
> > 1) Remove the largest patch (externs) to ease review.
> >
> > 2) Break up action patches into two to ease review bringing down the patches
> >    that need more scrutiny to 8 (the first 7 are almost trivial).
> >
> > 3) Fixup prefix naming convention to p4tc_xxx for uapi and p4a_xxx for actions
> >    to provide consistency(Jiri).
> >
> > 4) Silence sparse warning "was not declared. Should it be static?" for kfuncs
> >    by making them static. TBH, not sure if this is the right solution
> >    but it makes sparse happy and hopefully someone will comment.
> >
> > Changes In Version 8
> > ---------------------
> >
> > 1) Fix all the patchwork warnings and improve our ci to catch them in the future
> >
> > 2) Reduce the number of patches to basic max(15)  to ease review.
> >
> > Changes In Version 7
> > -------------------------
> >
> > 0) First time removing the RFC tag!
> >
> > 1) Removed XDP cookie. It turns out as was pointed out by Toke(Thanks!) - that
> > using bpf links was sufficient to protect us from someone replacing or deleting
> > a eBPF program after it has been bound to a netdev.
> >
> > 2) Add some reviewed-bys from Vlad.
> >
> > 3) Small bug fixes from v6 based on testing for ebpf.
> >
> > 4) Added the counter extern as a sample extern. Illustrating this example because
> >    it is slightly complex since it is possible to invoke it directly from
> >    the P4TC domain (in case of direct counters) or from eBPF (indirect counters).
> >    It is not exactly the most efficient implementation (a reasonable counter impl
> >    should be per-cpu).
> >
> > Changes In RFC Version 6
> > -------------------------
> >
> > 1) Completed integration from scriptable view to eBPF. Completed integration
> >    of externs integration.
> >
> > 2) Small bug fixes from v5 based on testing.
> >
> > Changes In RFC Version 5
> > -------------------------
> >
> > 1) More integration from scriptable view to eBPF. Small bug fixes from last
> >    integration.
> >
> > 2) More streamlining support of externs via kfunc (create-on-miss, etc)
> >
> > 3) eBPF linking for XDP.
> >
> > There is more eBPF integration/streamlining coming (we are getting close to
> > conversion from scriptable domain).
> >
> > Changes In RFC Version 4
> > -------------------------
> >
> > 1) More integration from scriptable to eBPF. Small bug fixes.
> >
> > 2) More streamlining support of externs via kfunc (one additional kfunc).
> >
> > 3) Removed per-cpu scratchpad per Toke's suggestion and instead use XDP metadata.
> >
> > There is more eBPF integration coming. One thing we looked at but is not in this
> > patchset but should be in the next is use of eBPF link in our loading (see
> > "challenge #1" further below).
> >
> > Changes In RFC Version 3
> > -------------------------
> >
> > These patches are still in a little bit of flux as we adjust to integrating
> > eBPF. So there are small constructs that are used in V1 and 2 but no longer
> > used in this version. We will make a V4 which will remove those.
> > The changes from V2 are as follows:
> >
> > 1) Feedback we got in V2 is to try stick to one of the two modes. In this version
> > we are taking one more step and going the path of mode2 vs v2 where we had 2 modes.
> >
> > 2) The P4 Register extern is no longer standalone. Instead, as part of integrating
> > into eBPF we introduce another kfunc which encapsulates Register as part of the
> > extern interface.
> >
> > 3) We have improved our CICD to include tools pointed to us by Simon. See
> >    "Testing" further below. Thanks to Simon for that and other issues he caught.
> >    Simon, we discussed on issue [7] but decided to keep that log since we think
> >    it is useful.
> >
> > 4) A lot of small cleanups. Thanks Marcelo. There are two things we need to
> >    re-discuss though; see: [5], [6].
> >
> > 5) We removed the need for a range of IDs for dynamic actions. Thanks Jakub.
> >
> > 6) Clarify ambiguity caused by smatch in an if(A) else if(B) condition. We are
> >    guaranteed that either A or B must exist; however, lets make smatch happy.
> >    Thanks to Simon and Dan Carpenter.
> >
> > Changes In RFC Version 2
> > -------------------------
> >
> > Version 2 is the initial integration of the eBPF datapath.
> > We took into consideration suggestions provided to use eBPF and put effort into
> > analyzing eBPF as datapath which involved extensive testing.
> > We implemented 6 approaches with eBPF and ran performance analysis and presented
> > our results at the P4 2023 workshop in Santa Clara[see: 1, 3] on each of the 6
> > vs the scriptable P4TC and concluded that 2 of the approaches are sensible (4 if
> > you account for XDP or TC separately).
> >
> > Conclusions from the exercise: We lose the simple operational model we had
> > prior to integrating eBPF. We do gain performance in most cases when the
> > datapath is less compute-bound.
> > For more discussion on our requirements vs journeying the eBPF path please
> > scroll down to "Restating Our Requirements" and "Challenges".
> >
> > This patch set presented two modes.
> > mode1: the parser is entirely based on eBPF - whereas the rest of the
> > SW datapath stays as _scriptable_ as in Version 1.
> > mode2: All of the kernel s/w datapath (including parser) is in eBPF.
> >
> > The key ingredient for eBPF, that we did not have access to in the past, is
> > kfunc (it made a big difference for us to reconsider eBPF).
> >
> > In V2 the two modes are mutually exclusive (IOW, you get to choose one
> > or the other via Kconfig).
>
> I think/fear that this series has a "quorum" problem: different voices
> raises opposition, and nobody (?) outside the authors supported the
> code and the feature.
>
> Could be the missing of H/W offload support in the current form the
> root cause for such lack support? Or there are parties interested that
> have been quite so far?

Some of the people who attend our meetings and have vested interest in
this are on Cc.  But the cover letter is clear on this (right at the
top under "What is P4" and "what is P4TC").

cheers,
jamal


> Thanks,
>
> Paolo
>
>
John Fastabend Feb. 29, 2024, 8:52 p.m. UTC | #6
Jamal Hadi Salim wrote:
> On Thu, Feb 29, 2024 at 12:14 PM Paolo Abeni <pabeni@redhat.com> wrote:
> >
> > On Sun, 2024-02-25 at 11:54 -0500, Jamal Hadi Salim wrote:
> > > This is the first patchset of two. In this patch we are submitting 15 which
> > > cover the minimal viable P4 PNA architecture.
> > >
> > > __Description of these Patches__
> > >
> > > Patch #1 adds infrastructure for per-netns P4 actions that can be created on
> > > as need basis for the P4 program requirement. This patch makes a small incision
> > > into act_api. Patches 2-4 are minimalist enablers for P4TC and have no
> > > effect the classical tc action (example patch#2 just increases the size of the
> > > action names from 16->64B).
> > > Patch 5 adds infrastructure support for preallocation of dynamic actions.
> > >
> > > The core P4TC code implements several P4 objects.
> > > 1) Patch #6 introduces P4 data types which are consumed by the rest of the code
> > > 2) Patch #7 introduces the templating API. i.e. CRUD commands for templates
> > > 3) Patch #8 introduces the concept of templating Pipelines. i.e CRUD commands
> > >    for P4 pipelines.
> > > 4) Patch #9 introduces the action templates and associated CRUD commands.
> > > 5) Patch #10 introduce the action runtime infrastructure.
> > > 6) Patch #11 introduces the concept of P4 table templates and associated
> > >    CRUD commands for tables.
> > > 7) Patch #12 introduces runtime table entry infra and associated CU commands.
> > > 8) Patch #13 introduces runtime table entry infra and associated RD commands.
> > > 9) Patch #14 introduces interaction of eBPF to P4TC tables via kfunc.
> > > 10) Patch #15 introduces the TC classifier P4 used at runtime.
> > >
> > > Daniel, please look again at patch #15.
> > >
> > > There are a few more patches (5) not in this patchset that deal with test
> > > cases, etc.
> > >
> > > What is P4?
> > > -----------
> > >
> > > The Programming Protocol-independent Packet Processors (P4) is an open source,
> > > domain-specific programming language for specifying data plane behavior.
> > >
> > > The current P4 landscape includes an extensive range of deployments, products,
> > > projects and services, etc[9][12]. Two major NIC vendors, Intel[10] and AMD[11]
> > > currently offer P4-native NICs. P4 is currently curated by the Linux
> > > Foundation[9].
> > >
> > > On why P4 - see small treatise here:[4].
> > >
> > > What is P4TC?
> > > -------------
> > >
> > > P4TC is a net-namespace aware P4 implementation over TC; meaning, a P4 program
> > > and its associated objects and state are attachend to a kernel _netns_ structure.
> > > IOW, if we had two programs across netns' or within a netns they have no
> > > visibility to each others objects (unlike for example TC actions whose kinds are
> > > "global" in nature or eBPF maps visavis bpftool).
> > >
> > > P4TC builds on top of many years of Linux TC experiences of a netlink control
> > > path interface coupled with a software datapath with an equivalent offloadable
> > > hardware datapath. In this patch series we are focussing only on the s/w
> > > datapath. The s/w and h/w path equivalence that TC provides is relevant
> > > for a primary use case of P4 where some (currently) large consumers of NICs
> > > provide vendors their datapath specs in P4. In such a case one could generate
> > > specified datapaths in s/w and test/validate the requirements before hardware
> > > acquisition(example [12]).
> > >
> > > Unlike other approaches such as TC Flower which require kernel and user space
> > > changes when new datapath objects like packet headers are introduced P4TC, with
> > > these patches, provides _kernel and user space code change independence_.
> > > Meaning:
> > > A P4 program describes headers, parsers, etc alongside the datapath processing;
> > > the compiler uses the P4 program as input and generates several artifacts which
> > > are then loaded into the kernel to manifest the intended datapath. In addition
> > > to the generated datapath, control path constructs are generated. The process is
> > > described further below in "P4TC Workflow".
> > >
> > > There have been many discussions and meetings within the community since
> > > about 2015 in regards to P4 over TC[2] and we are finally proving to the
> > > naysayers that we do get stuff done!
> > >
> > > A lot more of the P4TC motivation is captured at:
> > > https://github.com/p4tc-dev/docs/blob/main/why-p4tc.md
> > >
> > > __P4TC Architecture__
> > >
> > > The current architecture was described at netdevconf 0x17[14] and if you prefer
> > > academic conference papers, a short paper is available here[15].
> > >
> > > There are 4 parts:
> > >
> > > 1) A Template CRUD provisioning API for manifesting a P4 program and its
> > > associated objects in the kernel. The template provisioning API uses netlink.
> > > See patch in part 2.
> > >
> > > 2) A Runtime CRUD+ API code which is used for controlling the different runtime
> > > behavior of the P4 objects. The runtime API uses netlink. See notes further
> > > down. See patch description later..
> > >
> > > 3) P4 objects and their control interfaces: tables, actions, externs, etc.
> > > Any object that requires control plane interaction resides in the TC domain
> > > and is subject to the CRUD runtime API.  The intended goal is to make use of the
> > > tc semantics of skip_sw/hw to target P4 program objects either in s/w or h/w.
> > >
> > > 4) S/W Datapath code hooks. The s/w datapath is eBPF based and is generated
> > > by a compiler based on the P4 spec. When accessing any P4 object that requires
> > > control plane interfaces, the eBPF code accesses the P4TC side from #3 above
> > > using kfuncs.
> > >
> > > The generated eBPF code is derived from [13] with enhancements and fixes to meet
> > > our requirements.
> > >
> > > __P4TC Workflow__
> > >
> > > The Development and instantiation workflow for P4TC is as follows:
> > >
> > >   A) A developer writes a P4 program, "myprog"
> > >
> > >   B) Compiles it using the P4C compiler[8]. The compiler generates 3 outputs:
> > >
> > >      a) A shell script which form template definitions for the different P4
> > >      objects "myprog" utilizes (tables, externs, actions etc). See #1 above..
> > >
> > >      b) the parser and the rest of the datapath are generated as eBPF and need
> > >      to be compiled into binaries. At the moment the parser and the main control
> > >      block are generated as separate eBPF program but this could change in
> > >      the future (without affecting any kernel code). See #4 above.
> > >
> > >      c) A json introspection file used for the control plane (by iproute2/tc).
> > >
> > >   C) At this point the artifacts from #1,#4 could be handed to an operator
> > >      (the operator could be the same person as the developer from #A, #B).
> > >
> > >      i) For the eBPF part, either the operator is handed an ebpf binary or
> > >      source which they compile at this point into a binary.
> > >      The operator executes the shell script(s) to manifest the functional
> > >      "myprog" into the kernel.
> > >
> > >      ii) The operator instantiates "myprog" pipeline via the tc P4 filter
> > >      to ingress/egress (depending on P4 arch) of one or more netdevs/ports
> > >      (illustrated below as "block 22").
> > >
> > >      Example instantion where the parser is a separate action:
> > >        "tc filter add block 22 ingress protocol all prio 10 p4 pname myprog \
> > >         action bpf obj $PARSER.o section p4tc/parse \
> > >         action bpf obj $PROGNAME.o section p4tc/main"
> > >
> > > See individual patches in partc for more examples tc vs xdp etc. Also see
> > > section on "challenges" (further below on this cover letter).
> > >
> > > Once "myprog" P4 program is instantiated one can start performing operations
> > > on table entries and/or actions at runtime as described below.
> > >
> > > __P4TC Runtime Control Path__
> > >
> > > The control interface builds on past tc experience and tries to get things
> > > right from the beginning (example filtering is separated from depending
> > > on existing object TLVs and made generic); also the code is written in
> > > such a way it is mostly lockless.
> > >
> > > The P4TC control interface, using netlink, provides what we call a CRUDPS
> > > abstraction which stands for: Create, Read(get), Update, Delete, Subscribe,
> > > Publish.  From a high level PoV the following describes a conformant high level
> > > API (both on netlink data model and code level):
> > >
> > >       Create(</path/to/object, DATA>+)
> > >       Read(</path/to/object>, [optional filter])
> > >       Update(</path/to/object>, DATA>+)
> > >       Delete(</path/to/object>, [optional filter])
> > >       Subscribe(</path/to/object>, [optional filter])
> > >
> > > Note, we _dont_ treat "dump" or "flush" as speacial. If "path/to/object" points
> > > to a table then a "Delete" implies "flush" and a "Read" implies dump but if
> > > it points to an entry (by specifying a key) then "Delete" implies deleting
> > > and entry and "Read" implies reading that single entry. It should be noted that
> > > both "Delete" and "Read" take an optional filter parameter. The filter can
> > > define further refinements to what the control plane wants read or deleted.
> > > "Subscribe" uses built in netlink event management. It, as well, takes a filter
> > > which can further refine what events get generated to the control plane (taken
> > > out of this patchset, to be re-added with consideration of [16]).
> > >
> > > Lets show some runtime samples:
> > >
> > > ..create an entry, if we match ip address 10.0.1.2 send packet out eno1
> > >   tc p4ctrl create myprog/table/mytable \
> > >    dstAddr 10.0.1.2/32 action send_to_port param port eno1
> > >
> > > ..Batch create entries
> > >   tc p4ctrl create myprog/table/mytable \
> > >   entry dstAddr 10.1.1.2/32  action send_to_port param port eno1 \
> > >   entry dstAddr 10.1.10.2/32  action send_to_port param port eno10 \
> > >   entry dstAddr 10.0.2.2/32  action send_to_port param port eno2
> > >
> > > ..Get an entry (note "read" is interchangeably used as "get" which is a common
> > >               semantic in tc):
> > >   tc p4ctrl read myprog/table/mytable \
> > >    dstAddr 10.0.2.2/32
> > >
> > > ..dump mytable
> > >   tc p4ctrl read myprog/table/mytable
> > >
> > > ..dump mytable for all entries whose key fits within 10.1.0.0/16
> > >   tc p4ctrl read myprog/table/mytable \
> > >   filter key/myprog/mytable/dstAddr = 10.1.0.0/16
> > >
> > > ..dump all mytable entries which have an action send_to_port with param "eno1"
> > >   tc p4ctrl get myprog/table/mytable \
> > >   filter param/act/myprog/send_to_port/port = "eno1"
> > >
> > > The filter expression is powerful, f.e you could say:
> > >
> > >   tc p4ctrl get myprog/table/mytable \
> > >   filter param/act/myprog/send_to_port/port = "eno1" && \
> > >          key/myprog/mytable/dstAddr = 10.1.0.0/16
> > >
> > > It also works on built in metadata, example in the following case dumping
> > > entries from mytable that have seen activity in the last 10 secs:
> > >   tc p4ctrl get myprog/table/mytable \
> > >   filter msecs_since < 10000
> > >
> > > Delete follows the same syntax as get/read, so for sake of brevity we won't
> > > show more example than how to flush mytable:
> > >
> > >   tc p4ctrl delete myprog/table/mytable
> > >
> > > Mystery question: How do we achieve iproute2-kernel independence and
> > > how does "tc p4ctrl" as a cli know how to program the kernel given an
> > > arbitrary command line as shown above? Answer(s): It queries the
> > > compiler generated json file in "P4TC Workflow" #B.c above. The json file has
> > > enough details to figure out that we have a program called "myprog" which has a
> > > table "mytable" that has a key name "dstAddr" which happens to be type ipv4
> > > address prefix. The json file also provides details to show that the table
> > > "mytable" supports an action called "send_to_port" which accepts a parameter
> > > "port" of type netdev (see the types patch for all supported P4 data types).
> > > All P4 components have names, IDs, and types - so this makes it very easy to map
> > > into netlink.
> > > Once user space tc/p4ctrl validates the human command input, it creates
> > > standard binary netlink structures (TLVs etc) which are sent to the kernel.
> > > See the runtime table entry patch for more details.
> > >
> > > __P4TC Datapath__
> > >
> > > The P4TC s/w datapath execution is generated as eBPF. Any objects that require
> > > control interfacing reside in the "P4TC domain" and are controlled via netlink
> > > as described above. Per packet execution and state and even objects that do not
> > > require control interfacing (like the P4 parser) are generated as eBPF.
> > >
> > > A packet arriving on s/w ingress of any of the ports on block 22 will first be
> > > exercised via the (generated eBPF) parser component to extract the headers (the
> > > ip destination address in labelled "dstAddr" above).
> > > The datapath then proceeds to use "dstAddr", table ID and pipeline ID
> > > as a key to do a lookup in myprog's "mytable" which returns the action params
> > > which are then used to execute the action in the eBPF datapath (eventually
> > > sending out packets to eno1).
> > > On a table miss, mytable's default miss action (not described) is executed.
> > >
> > > __Testing__
> > >
> > > Speaking of testing - we have 2-300 tdc test cases (which will be in the
> > > second patchset).
> > > These tests are run on our CICD system on pull requests and after commits are
> > > approved. The CICD does a lot of other tests (more since v2, thanks to Simon's
> > > input)including:
> > > checkpatch, sparse, smatch, coccinelle, 32 bit and 64 bit builds tested on both
> > > X86, ARM 64 and emulated BE via qemu s390. We trigger performance testing in the
> > > CICD to catch performance regressions (currently only on the control path, but
> > > in the future for the datapath).
> > > Syzkaller runs 24/7 on dedicated hardware, originally we focussed only on memory
> > > sanitizer but recently added support for concurrency sanitizer.
> > > Before main releases we ensure each patch will compile on its own to help in
> > > git bisect and run the xmas tree tool. We eventually put the code via coverity.
> > >
> > > In addition we are working on enabling a tool that will take a P4 program, run
> > > it through the compiler, and generate permutations of traffic patterns via
> > > symbolic execution that will test both positive and negative datapath code
> > > paths. The test generator tool integration is still work in progress.
> > > Also: We have other code that test parallelization etc which we are trying to
> > > find a fit for in the kernel tree's testing infra.
> > >
> > >
> > > __References__
> > >
> > > [1]https://github.com/p4tc-dev/docs/blob/main/p4-conference-2023/2023P4WorkshopP4TC.pdf
> > > [2]https://github.com/p4tc-dev/docs/blob/main/why-p4tc.md#historical-perspective-for-p4tc
> > > [3]https://2023p4workshop.sched.com/event/1KsAe/p4tc-linux-kernel-p4-implementation-approaches-and-evaluation
> > > [4]https://github.com/p4tc-dev/docs/blob/main/why-p4tc.md#so-why-p4-and-how-does-p4-help-here
> > > [5]https://lore.kernel.org/netdev/20230517110232.29349-3-jhs@mojatatu.com/T/#mf59be7abc5df3473cff3879c8cc3e2369c0640a6
> > > [6]https://lore.kernel.org/netdev/20230517110232.29349-3-jhs@mojatatu.com/T/#m783cfd79e9d755cf0e7afc1a7d5404635a5b1919
> > > [7]https://lore.kernel.org/netdev/20230517110232.29349-3-jhs@mojatatu.com/T/#ma8c84df0f7043d17b98f3d67aab0f4904c600469
> > > [8]https://github.com/p4lang/p4c/tree/main/backends/tc
> > > [9]https://p4.org/
> > > [10]https://www.intel.com/content/www/us/en/products/details/network-io/ipu/e2000-asic.html
> > > [11]https://www.amd.com/en/accelerators/pensando
> > > [12]https://github.com/sonic-net/DASH/tree/main
> > > [13]https://github.com/p4lang/p4c/tree/main/backends/ebpf
> > > [14]https://netdevconf.info/0x17/sessions/talk/integrating-ebpf-into-the-p4tc-datapath.html
> > > [15]https://dl.acm.org/doi/10.1145/3630047.3630193
> > > [16]https://lore.kernel.org/netdev/20231216123001.1293639-1-jiri@resnulli.us/
> > > [17.a]https://netdevconf.info/0x13/session.html?talk-tc-u-classifier
> > > [17.b]man tc-u32
> > > [18]man tc-pedit
> > > [19] https://lore.kernel.org/netdev/20231219181623.3845083-6-victor@mojatatu.com/T/#m86e71743d1d83b728bb29d5b877797cb4942e835
> > > [20.a] https://netdevconf.info/0x16/sessions/talk/your-network-datapath-will-be-p4-scripted.html
> > > [20.b] https://netdevconf.info/0x16/sessions/workshop/p4tc-workshop.html
> > >
> > > --------
> > > HISTORY
> > > --------
> > >
> > > Changes in Version 12
> > > ----------------------
> > >
> > > 0) Introduce back 15 patches (v11 had 5)
> > >
> > > 1) From discussions with Daniel:
> > >    i) Remove the XDP programs association alltogether. No refcounting. nothing.
> > >    ii) Remove prog type tc - everything is now an ebpf tc action.
> > >
> > > 2) s/PAD0/__pad0/g. Thanks to Marcelo.
> > >
> > > 3) Add extack to specify how many entries (N of M) specified in a batch for
> > >    any of requested Create/Update/Delete succeeded. Prior to this it would
> > >    only tell us the batch failed to complete without giving us details of
> > >    which of M failed. Added as a debug aid.
> > >
> > > Changes in Version 11
> > > ----------------------
> > > 1) Split the series into two. Original patches 1-5 in this patchset. The rest
> > >    will go out after this is merged.
> > >
> > > 2) Change any references of IFNAMSIZ in the action code when referencing the
> > >    action name size to ACTNAMSIZ. Thanks to Marcelo.
> > >
> > > Changes in Version 10
> > > ----------------------
> > > 1) A couple of patches from the earlier version were clean enough to submit,
> > >    so we did. This gave us room to split the two largest patches each into
> > >    two. Even though the split is not git-bisactable and really some of it didn't
> > >    make much sense (eg spliting a create, and update in one patch and delete and
> > >    get into another) we made sure each of the split patches compiled
> > >    independently. The idea is to reduce the number of lines of code to review
> > >    and when we get sufficient reviews we will put the splits together again.
> > >    See patch #12 and #13 as well as patches #7 and #8).
> > >
> > > 2) Add more context in patch 0. Please READ!
> > >
> > > 3) Added dump/delete filters back to the code - we had taken them out in the
> > >    earlier patches to reduce the amount of code for review - but in retrospect
> > >    we feel they are important enough to push earlier rather than later.
> > >
> > >
> > > Changes In version 9
> > > ---------------------
> > >
> > > 1) Remove the largest patch (externs) to ease review.
> > >
> > > 2) Break up action patches into two to ease review bringing down the patches
> > >    that need more scrutiny to 8 (the first 7 are almost trivial).
> > >
> > > 3) Fixup prefix naming convention to p4tc_xxx for uapi and p4a_xxx for actions
> > >    to provide consistency(Jiri).
> > >
> > > 4) Silence sparse warning "was not declared. Should it be static?" for kfuncs
> > >    by making them static. TBH, not sure if this is the right solution
> > >    but it makes sparse happy and hopefully someone will comment.
> > >
> > > Changes In Version 8
> > > ---------------------
> > >
> > > 1) Fix all the patchwork warnings and improve our ci to catch them in the future
> > >
> > > 2) Reduce the number of patches to basic max(15)  to ease review.
> > >
> > > Changes In Version 7
> > > -------------------------
> > >
> > > 0) First time removing the RFC tag!
> > >
> > > 1) Removed XDP cookie. It turns out as was pointed out by Toke(Thanks!) - that
> > > using bpf links was sufficient to protect us from someone replacing or deleting
> > > a eBPF program after it has been bound to a netdev.
> > >
> > > 2) Add some reviewed-bys from Vlad.
> > >
> > > 3) Small bug fixes from v6 based on testing for ebpf.
> > >
> > > 4) Added the counter extern as a sample extern. Illustrating this example because
> > >    it is slightly complex since it is possible to invoke it directly from
> > >    the P4TC domain (in case of direct counters) or from eBPF (indirect counters).
> > >    It is not exactly the most efficient implementation (a reasonable counter impl
> > >    should be per-cpu).
> > >
> > > Changes In RFC Version 6
> > > -------------------------
> > >
> > > 1) Completed integration from scriptable view to eBPF. Completed integration
> > >    of externs integration.
> > >
> > > 2) Small bug fixes from v5 based on testing.
> > >
> > > Changes In RFC Version 5
> > > -------------------------
> > >
> > > 1) More integration from scriptable view to eBPF. Small bug fixes from last
> > >    integration.
> > >
> > > 2) More streamlining support of externs via kfunc (create-on-miss, etc)
> > >
> > > 3) eBPF linking for XDP.
> > >
> > > There is more eBPF integration/streamlining coming (we are getting close to
> > > conversion from scriptable domain).
> > >
> > > Changes In RFC Version 4
> > > -------------------------
> > >
> > > 1) More integration from scriptable to eBPF. Small bug fixes.
> > >
> > > 2) More streamlining support of externs via kfunc (one additional kfunc).
> > >
> > > 3) Removed per-cpu scratchpad per Toke's suggestion and instead use XDP metadata.
> > >
> > > There is more eBPF integration coming. One thing we looked at but is not in this
> > > patchset but should be in the next is use of eBPF link in our loading (see
> > > "challenge #1" further below).
> > >
> > > Changes In RFC Version 3
> > > -------------------------
> > >
> > > These patches are still in a little bit of flux as we adjust to integrating
> > > eBPF. So there are small constructs that are used in V1 and 2 but no longer
> > > used in this version. We will make a V4 which will remove those.
> > > The changes from V2 are as follows:
> > >
> > > 1) Feedback we got in V2 is to try stick to one of the two modes. In this version
> > > we are taking one more step and going the path of mode2 vs v2 where we had 2 modes.
> > >
> > > 2) The P4 Register extern is no longer standalone. Instead, as part of integrating
> > > into eBPF we introduce another kfunc which encapsulates Register as part of the
> > > extern interface.
> > >
> > > 3) We have improved our CICD to include tools pointed to us by Simon. See
> > >    "Testing" further below. Thanks to Simon for that and other issues he caught.
> > >    Simon, we discussed on issue [7] but decided to keep that log since we think
> > >    it is useful.
> > >
> > > 4) A lot of small cleanups. Thanks Marcelo. There are two things we need to
> > >    re-discuss though; see: [5], [6].
> > >
> > > 5) We removed the need for a range of IDs for dynamic actions. Thanks Jakub.
> > >
> > > 6) Clarify ambiguity caused by smatch in an if(A) else if(B) condition. We are
> > >    guaranteed that either A or B must exist; however, lets make smatch happy.
> > >    Thanks to Simon and Dan Carpenter.
> > >
> > > Changes In RFC Version 2
> > > -------------------------
> > >
> > > Version 2 is the initial integration of the eBPF datapath.
> > > We took into consideration suggestions provided to use eBPF and put effort into
> > > analyzing eBPF as datapath which involved extensive testing.
> > > We implemented 6 approaches with eBPF and ran performance analysis and presented
> > > our results at the P4 2023 workshop in Santa Clara[see: 1, 3] on each of the 6
> > > vs the scriptable P4TC and concluded that 2 of the approaches are sensible (4 if
> > > you account for XDP or TC separately).
> > >
> > > Conclusions from the exercise: We lose the simple operational model we had
> > > prior to integrating eBPF. We do gain performance in most cases when the
> > > datapath is less compute-bound.
> > > For more discussion on our requirements vs journeying the eBPF path please
> > > scroll down to "Restating Our Requirements" and "Challenges".
> > >
> > > This patch set presented two modes.
> > > mode1: the parser is entirely based on eBPF - whereas the rest of the
> > > SW datapath stays as _scriptable_ as in Version 1.
> > > mode2: All of the kernel s/w datapath (including parser) is in eBPF.
> > >
> > > The key ingredient for eBPF, that we did not have access to in the past, is
> > > kfunc (it made a big difference for us to reconsider eBPF).
> > >
> > > In V2 the two modes are mutually exclusive (IOW, you get to choose one
> > > or the other via Kconfig).
> >
> > I think/fear that this series has a "quorum" problem: different voices
> > raises opposition, and nobody (?) outside the authors supported the
> > code and the feature.
> >
> > Could be the missing of H/W offload support in the current form the
> > root cause for such lack support? Or there are parties interested that
> > have been quite so far?

Yeah agree with h/w comment would be interested to hear these folks that
have h/w. For me to get on board obvious things that would be interesting.
(a) hardware offload (b) some fundamental problem with exisiing p4c
backend we already have or (c) significant performance improvement.

> 
> Some of the people who attend our meetings and have vested interest in
> this are on Cc.  But the cover letter is clear on this (right at the
> top under "What is P4" and "what is P4TC").
> 
> cheers,
> jamal
> 
> 
> > Thanks,
> >
> > Paolo
> >
> >
>
Singhai, Anjali Feb. 29, 2024, 9:49 p.m. UTC | #7
From: Paolo Abeni <pabeni@redhat.com> 

> I think/fear that this series has a "quorum" problem: different voices raises opposition, and nobody (?) outside the authors
> supported the code and the feature. 

> Could be the missing of H/W offload support in the current form the root cause for such lack support? Or there are parties 
> interested that have been quite so far?

Hi,
   Intel/AMD definitely need the p4tc offload support and a kernel SW pipeline, as a lot of customers using programmable pipeline (smart switch and smart NIC) prefer kernel standard APIs and interfaces (netlink and tc ndo). Intel and other vendors have native P4 capable HW and are invested in P4 as a dataplane specification.

- Customers run P4 dataplane in multiple targets including SW pipeline as well as programmable Switches and DPUs.
- A standardized kernel APIs and implementation brings in portability across vendors and across targets (CPU/SW and DPUs).
- A P4 pipeline can be built using both SW and HW (DPU/switch) components and the P4 pipeline should seamlessly move between the two. 
- This patch series helps create a SW pipeline and standard API.

Thanks,
Anjali
John Fastabend Feb. 29, 2024, 10:33 p.m. UTC | #8
Singhai, Anjali wrote:
> From: Paolo Abeni <pabeni@redhat.com> 
> 
> > I think/fear that this series has a "quorum" problem: different voices raises opposition, and nobody (?) outside the authors
> > supported the code and the feature. 
> 
> > Could be the missing of H/W offload support in the current form the root cause for such lack support? Or there are parties 
> > interested that have been quite so far?
> 
> Hi,
>    Intel/AMD definitely need the p4tc offload support and a kernel SW pipeline, as a lot of customers using programmable pipeline (smart switch and smart NIC) prefer kernel standard APIs and interfaces (netlink and tc ndo). Intel and other vendors have native P4 capable HW and are invested in P4 as a dataplane specification.

Great what hardware/driver and how do we get that code here so we can see
it working? Is the hardware available e.g. can I get ahold of one?

What is programmable on your devices? Is this 'just' the parser graph or
are you slicing up tables and so on. Is it a FPGA, DPU architecture or a
TCAM architecture? How do you reprogram the device? I somehow doubt its
through a piecemeal ndo. But let me know if I'm wrong maybe my internal
architecture details are dated. Fully speculating the interface is a FW
big thunk to the device?

Without any details its difficult to get community feedback on how the
hw programmable interface should work. The only reason I've even
bothered with this thread is I want to see P4 working.

Who owns the AMD side or some other vendor so we can get something that
works across at least two vendors which is our usual bar for adding hw
offload things.

Note if you just want a kernel SW pipeline we already have that so
I'm not seeing that as paticularly motivating. Again my point of view.
P4 as a dataplane specification is great but I don't see the connection
to this patchset without real hardware in a driver.

> 
> - Customers run P4 dataplane in multiple targets including SW pipeline as well as programmable Switches and DPUs.
> - A standardized kernel APIs and implementation brings in portability across vendors and across targets (CPU/SW and DPUs).
> - A P4 pipeline can be built using both SW and HW (DPU/switch) components and the P4 pipeline should seamlessly move between the two. 
> - This patch series helps create a SW pipeline and standard API.
> 
> Thanks,
> Anjali
>
Jamal Hadi Salim Feb. 29, 2024, 10:48 p.m. UTC | #9
On Thu, Feb 29, 2024 at 5:33 PM John Fastabend <john.fastabend@gmail.com> wrote:
>
> Singhai, Anjali wrote:
> > From: Paolo Abeni <pabeni@redhat.com>
> >
> > > I think/fear that this series has a "quorum" problem: different voices raises opposition, and nobody (?) outside the authors
> > > supported the code and the feature.
> >
> > > Could be the missing of H/W offload support in the current form the root cause for such lack support? Or there are parties
> > > interested that have been quite so far?
> >
> > Hi,
> >    Intel/AMD definitely need the p4tc offload support and a kernel SW pipeline, as a lot of customers using programmable pipeline (smart switch and smart NIC) prefer kernel standard APIs and interfaces (netlink and tc ndo). Intel and other vendors have native P4 capable HW and are invested in P4 as a dataplane specification.
>
> Great what hardware/driver and how do we get that code here so we can see
> it working? Is the hardware available e.g. can I get ahold of one?
>
> What is programmable on your devices? Is this 'just' the parser graph or
> are you slicing up tables and so on. Is it a FPGA, DPU architecture or a
> TCAM architecture? How do you reprogram the device? I somehow doubt its
> through a piecemeal ndo. But let me know if I'm wrong maybe my internal
> architecture details are dated. Fully speculating the interface is a FW
> big thunk to the device?
>
> Without any details its difficult to get community feedback on how the
> hw programmable interface should work. The only reason I've even
> bothered with this thread is I want to see P4 working.
>
> Who owns the AMD side or some other vendor so we can get something that
> works across at least two vendors which is our usual bar for adding hw
> offload things.
>
> Note if you just want a kernel SW pipeline we already have that so
> I'm not seeing that as paticularly motivating. Again my point of view.
> P4 as a dataplane specification is great but I don't see the connection
> to this patchset without real hardware in a driver.

Here's what you can buy on the market that are native P4 (not that it
hasnt been mentioned from day 1 on patch 0 references):
[10]https://www.intel.com/content/www/us/en/products/details/network-io/ipu/e2000-asic.html
[11]https://www.amd.com/en/accelerators/pensando

I want to emphasize again these patches are about the P4 s/w pipeline
that is intended to work seamlessly with hw offload. If you are
interested in h/w offload and want to contribute just show up at the
meetings - they are open to all. The current offloadable piece is the
match-action tables. The P4 specs may change to include parsers in the
future or other objects etc (but not sure why we should discuss this
in the thread).

cheers,
jamal
Martin KaFai Lau March 1, 2024, 7:02 a.m. UTC | #10
On 2/28/24 9:11 AM, John Fastabend wrote:
>   - The kfuncs are mostly duplicates of map ops we already have in BPF API.
>     The motivation by my read is to use netlink instead of bpf commands. I

I also have similar thought on the kfuncs (create/update/delete) which is mostly 
bpf map ops. It could have one single kfunc to allocate a kernel specific p4 
entry/object and then store that in a bpf map. With the bpf_rbtree, bpf_list, 
and other recent advancements, it should be able to describe them in a bpf map. 
The reply in v9 was that the p4 table will also be used in the future HW 
piece/driver but the HW piece is not ready yet, bpf is the only consumer of the 
kernel p4 table now and this makes mimicking the bpf map api to kfuncs not 
convincing. bpf "tc / xdp" program uses netlink to attach/detach and the policy 
also stays in the bpf map.

When there is a HW piece that consumes the p4 table, that will be a better time 
to discuss the kfunc interface.

>     don't agree with this, optimizing for some low level debug a developer
>     uses is the wrong design space. Actual users should not be deploying
>     this via ssh into boxes. The workflow will not scale and really we need
>     tooling and infra to land P4 programs across the network. This is orders
>     of more pain if its an endpoint solution and not a middlebox/switch
>     solution. As a switch solution I don't see how p4tc sw scales to even TOR
>     packet rates. So you need tooling on top and user interact with the
>     tooling not the Linux widget/debugger at the bottom.
Jamal Hadi Salim March 1, 2024, 12:36 p.m. UTC | #11
On Fri, Mar 1, 2024 at 2:02 AM Martin KaFai Lau <martin.lau@linux.dev> wrote:
>
> On 2/28/24 9:11 AM, John Fastabend wrote:
> >   - The kfuncs are mostly duplicates of map ops we already have in BPF API.
> >     The motivation by my read is to use netlink instead of bpf commands. I
>
> I also have similar thought on the kfuncs (create/update/delete) which is mostly
> bpf map ops. It could have one single kfunc to allocate a kernel specific p4
> entry/object and then store that in a bpf map. With the bpf_rbtree, bpf_list,
> and other recent advancements, it should be able to describe them in a bpf map.
> The reply in v9 was that the p4 table will also be used in the future HW
> piece/driver but the HW piece is not ready yet, bpf is the only consumer of the
> kernel p4 table now and this makes mimicking the bpf map api to kfuncs not
> convincing. bpf "tc / xdp" program uses netlink to attach/detach and the policy
> also stays in the bpf map.
>

It's a lot more complex than just attaching/detaching. Our control
plane uses netlink (regardless of whether it is offloaded or not) for
all object controls (not just table entries) for the many reasons that
have been stated in the cover letters since the beginning. I
unfortunately took out some of the text after v10 to try and shorten
the text. I will be adding it back. If you cant find it i could
cutnpaste and send privately.

cheers,
jamal

> When there is a HW piece that consumes the p4 table, that will be a better time
> to discuss the kfunc interface.
>
> >     don't agree with this, optimizing for some low level debug a developer
> >     uses is the wrong design space. Actual users should not be deploying
> >     this via ssh into boxes. The workflow will not scale and really we need
> >     tooling and infra to land P4 programs across the network. This is orders
> >     of more pain if its an endpoint solution and not a middlebox/switch
> >     solution. As a switch solution I don't see how p4tc sw scales to even TOR
> >     packet rates. So you need tooling on top and user interact with the
> >     tooling not the Linux widget/debugger at the bottom.
>
Jakub Kicinski March 1, 2024, 5 p.m. UTC | #12
On Thu, 29 Feb 2024 19:00:50 -0800 Tom Herbert wrote:
> > I want to emphasize again these patches are about the P4 s/w pipeline
> > that is intended to work seamlessly with hw offload. If you are
> > interested in h/w offload and want to contribute just show up at the
> > meetings - they are open to all. The current offloadable piece is the
> > match-action tables. The P4 specs may change to include parsers in the
> > future or other objects etc (but not sure why we should discuss this
> > in the thread).
> 
> Pardon my ignorance, but doesn't P4 want to be compiled to a backend
> target? How does going through TC make this seamless?

+1

My intuition is that for offload the device would be programmed at
start-of-day / probe. By loading the compiled P4 from /lib/firmware.
Then the _device_ tells the kernel what tables and parser graph it's
got.

Plus, if we're talking about offloads, aren't we getting back into
the same controversies we had when merging OvS (not that I was around).
The "standalone stack to the side" problem. Some of the tables in the
pipeline may be for routing, not ACLs. Should they be fed from the
routing stack? How is that integration going to work? The parsing
graph feels a bit like global device configuration, not a piece of
functionality that should sit under sub-sub-system in the corner.
Jamal Hadi Salim March 1, 2024, 5:39 p.m. UTC | #13
On Fri, Mar 1, 2024 at 12:00 PM Jakub Kicinski <kuba@kernel.org> wrote:
>
> On Thu, 29 Feb 2024 19:00:50 -0800 Tom Herbert wrote:
> > > I want to emphasize again these patches are about the P4 s/w pipeline
> > > that is intended to work seamlessly with hw offload. If you are
> > > interested in h/w offload and want to contribute just show up at the
> > > meetings - they are open to all. The current offloadable piece is the
> > > match-action tables. The P4 specs may change to include parsers in the
> > > future or other objects etc (but not sure why we should discuss this
> > > in the thread).
> >
> > Pardon my ignorance, but doesn't P4 want to be compiled to a backend
> > target? How does going through TC make this seamless?
>
> +1
>

I should clarify what i meant by "seamless". It means the same control
API is used for s/w or h/w. This is a feature of tc, and is not being
introduced by P4TC. P4 control only deals with Match-action tables -
just as TC does.

> My intuition is that for offload the device would be programmed at
> start-of-day / probe. By loading the compiled P4 from /lib/firmware.
> Then the _device_ tells the kernel what tables and parser graph it's
> got.
>

BTW: I just want to say that these patches are about s/w - not
offload. Someone asked about offload so as in normal discussions we
steered in that direction. The hardware piece will require additional
patchsets which still require discussions. I hope we dont steer off
too much, otherwise i can start a new thread just to discuss current
view of the h/w.

Its not the device telling the kernel what it has. Its the other way around.
From the P4 program you generate the s/w (the ebpf code and other
auxillary stuff) and h/w pieces using a compiler.
You compile ebpf, etc, then load.

The current point of discussion is the hw binary is to be "activated"
through the same tc filter that does the s/w. So one could say:

tc filter add block 22 ingress protocol all prio 1 p4 pname simple_l3
\
   prog type hw filename "simple_l3.o" ... \
   action bpf obj $PARSER.o section p4tc/parser \
   action bpf obj $PROGNAME.o section p4tc/main

And that would through tc driver callbacks signal to the driver to
find the binary possibly via  /lib/firmware
Some of the original discussion was to use devlink for loading the
binary - but that went nowhere.

Once you have this in place then netlink with tc skip_sw/hw. This is
what i meant by "seamless"

> Plus, if we're talking about offloads, aren't we getting back into
> the same controversies we had when merging OvS (not that I was around).
> The "standalone stack to the side" problem. Some of the tables in the
> pipeline may be for routing, not ACLs. Should they be fed from the
> routing stack? How is that integration going to work? The parsing
> graph feels a bit like global device configuration, not a piece of
> functionality that should sit under sub-sub-system in the corner.

The current (maybe i should say initial) thought is the P4 program
does not touch the existing kernel infra such as fdb etc.
Of course we can model the kernel datapath using P4 but you wont be
using "ip route add..." or "bridge fdb...".
In the future, P4 extern could be used to model existing infra and we
should be able to use the same tooling. That is a discussion that
comes on/off (i think it did in the last meeting).

cheers,
jamal
Chris Sommers March 1, 2024, 6:53 p.m. UTC | #14
>From: Paolo Abeni mailto:pabeni@redhat.com 
>Sent: Thursday, February 29, 2024 9:14 AM
>To: Jamal Hadi Salim mailto:jhs@mojatatu.com; mailto:netdev@vger.kernel.org
>Cc: mailto:deb.chatterjee@intel.com; mailto:anjali.singhai@intel.com; mailto:namrata.limaye@intel.com; mailto:tom@sipanda.io; mailto:mleitner@redhat.com; mailto:Mahesh.Shirshyad@amd.com; mailto:Vipin.Jain@amd.com; mailto:tomasz.osinski@intel.com; mailto:jiri@resnulli.us; mailto:xiyou.wangcong@gmail.com; mailto:davem@davemloft.net; mailto:edumazet@google.com; mailto:kuba@kernel.org; mailto:vladbu@nvidia.com; mailto:horms@kernel.org; mailto:khalidm@nvidia.com; mailto:toke@redhat.com; mailto:daniel@iogearbox.net; mailto:victor@mojatatu.com; mailto:pctammela@mojatatu.com; mailto:dan.daly@intel.com; mailto:andy.fingerhut@gmail.com; Chris Sommers mailto:chris.sommers@keysight.com; mailto:mattyk@nvidia.com; mailto:bpf@vger.kernel.org
>Subject: Re: [PATCH net-next v12 00/15] Introducing P4TC (series 1)
>
>On Sun, 2024-02-25 at 11: 54 -0500, Jamal Hadi Salim wrote: > This is the first patchset of two. In this patch we are submitting 15 which > cover the minimal viable P4 PNA architecture. > > __Description of these Patches__ > > 
>ZjQcmQRYFpfptBannerStart
>This Message is From an External Sender: Use caution opening files, clicking links or responding to requests. 
>
>
>
>ZjQcmQRYFpfptBannerEnd
>On Sun, 2024-02-25 at 11:54 -0500, Jamal Hadi Salim wrote:
>> This is the first patchset of two. In this patch we are submitting 15 which
>> cover the minimal viable P4 PNA architecture.
>> 
>> __Description of these Patches__
>> 
>> Patch #1 adds infrastructure for per-netns P4 actions that can be created on
>> as need basis for the P4 program requirement. This patch makes a small incision
>> into act_api. Patches 2-4 are minimalist enablers for P4TC and have no
>> effect the classical tc action (example patch#2 just increases the size of the
>> action names from 16->64B).
>> Patch 5 adds infrastructure support for preallocation of dynamic actions.
>> 
>> The core P4TC code implements several P4 objects.
>> 1) Patch #6 introduces P4 data types which are consumed by the rest of the code
>> 2) Patch #7 introduces the templating API. i.e. CRUD commands for templates
>> 3) Patch #8 introduces the concept of templating Pipelines. i.e CRUD commands
>>    for P4 pipelines.
>> 4) Patch #9 introduces the action templates and associated CRUD commands.
>> 5) Patch #10 introduce the action runtime infrastructure.
>> 6) Patch #11 introduces the concept of P4 table templates and associated
>>    CRUD commands for tables.
>> 7) Patch #12 introduces runtime table entry infra and associated CU commands.
>> 8) Patch #13 introduces runtime table entry infra and associated RD commands.
>> 9) Patch #14 introduces interaction of eBPF to P4TC tables via kfunc.
>> 10) Patch #15 introduces the TC classifier P4 used at runtime.
>> 
>> Daniel, please look again at patch #15.
>> 
>> There are a few more patches (5) not in this patchset that deal with test
>> cases, etc.
>> 
>> What is P4?
>> -----------
>> 
>> The Programming Protocol-independent Packet Processors (P4) is an open source,
>> domain-specific programming language for specifying data plane behavior.
>> 
>> The current P4 landscape includes an extensive range of deployments, products,
>> projects and services, etc[9][12]. Two major NIC vendors, Intel[10] and AMD[11]
>> currently offer P4-native NICs. P4 is currently curated by the Linux
>> Foundation[9].
>> 
>> On why P4 - see small treatise here:[4].
>> 
>> What is P4TC?
>> -------------
>> 
>> P4TC is a net-namespace aware P4 implementation over TC; meaning, a P4 program
>> and its associated objects and state are attachend to a kernel _netns_ structure.
>> IOW, if we had two programs across netns' or within a netns they have no
>> visibility to each others objects (unlike for example TC actions whose kinds are
>> "global" in nature or eBPF maps visavis bpftool).
>> 
>> P4TC builds on top of many years of Linux TC experiences of a netlink control
>> path interface coupled with a software datapath with an equivalent offloadable
>> hardware datapath. In this patch series we are focussing only on the s/w
>> datapath. The s/w and h/w path equivalence that TC provides is relevant
>> for a primary use case of P4 where some (currently) large consumers of NICs
>> provide vendors their datapath specs in P4. In such a case one could generate
>> specified datapaths in s/w and test/validate the requirements before hardware
>> acquisition(example [12]).
>> 
>> Unlike other approaches such as TC Flower which require kernel and user space
>> changes when new datapath objects like packet headers are introduced P4TC, with
>> these patches, provides _kernel and user space code change independence_.
>> Meaning:
>> A P4 program describes headers, parsers, etc alongside the datapath processing;
>> the compiler uses the P4 program as input and generates several artifacts which
>> are then loaded into the kernel to manifest the intended datapath. In addition
>> to the generated datapath, control path constructs are generated. The process is
>> described further below in "P4TC Workflow".
>> 
>> There have been many discussions and meetings within the community since
>> about 2015 in regards to P4 over TC[2] and we are finally proving to the
>> naysayers that we do get stuff done!
>> 
>> A lot more of the P4TC motivation is captured at:
>> https://urldefense.com/v3/__https://github.com/p4tc-dev/docs/blob/main/why-p4tc.md__;!!I5pVk4LIGAfnvw!nUhTLe4nLBPIqlW8u7cJYnSCbFNQSDW318aG2nqxbMZuptPK41snJiHein5mNazAYWIwAfJ3o8O7ZSCV8wc$
>> 
>> __P4TC Architecture__
>> 
>> The current architecture was described at netdevconf 0x17[14] and if you prefer
>> academic conference papers, a short paper is available here[15].
>> 
>> There are 4 parts:
>> 
>> 1) A Template CRUD provisioning API for manifesting a P4 program and its
>> associated objects in the kernel. The template provisioning API uses netlink.
>> See patch in part 2.
>> 
>> 2) A Runtime CRUD+ API code which is used for controlling the different runtime
>> behavior of the P4 objects. The runtime API uses netlink. See notes further
>> down. See patch description later..
>> 
>> 3) P4 objects and their control interfaces: tables, actions, externs, etc.
>> Any object that requires control plane interaction resides in the TC domain
>> and is subject to the CRUD runtime API.  The intended goal is to make use of the
>> tc semantics of skip_sw/hw to target P4 program objects either in s/w or h/w.
>> 
>> 4) S/W Datapath code hooks. The s/w datapath is eBPF based and is generated
>> by a compiler based on the P4 spec. When accessing any P4 object that requires
>> control plane interfaces, the eBPF code accesses the P4TC side from #3 above
>> using kfuncs.
>> 
>> The generated eBPF code is derived from [13] with enhancements and fixes to meet
>> our requirements.
>> 
>> __P4TC Workflow__
>> 
>> The Development and instantiation workflow for P4TC is as follows:
>> 
>>   A) A developer writes a P4 program, "myprog"
>> 
>>   B) Compiles it using the P4C compiler[8]. The compiler generates 3 outputs:
>> 
>>      a) A shell script which form template definitions for the different P4
>>      objects "myprog" utilizes (tables, externs, actions etc). See #1 above..
>> 
>>      b) the parser and the rest of the datapath are generated as eBPF and need
>>      to be compiled into binaries. At the moment the parser and the main control
>>      block are generated as separate eBPF program but this could change in
>>      the future (without affecting any kernel code). See #4 above.
>> 
>>      c) A json introspection file used for the control plane (by iproute2/tc).
>> 
>>   C) At this point the artifacts from #1,#4 could be handed to an operator
>>      (the operator could be the same person as the developer from #A, #B).
>> 
>>      i) For the eBPF part, either the operator is handed an ebpf binary or
>>      source which they compile at this point into a binary.
>>      The operator executes the shell script(s) to manifest the functional
>>      "myprog" into the kernel.
>> 
>>      ii) The operator instantiates "myprog" pipeline via the tc P4 filter
>>      to ingress/egress (depending on P4 arch) of one or more netdevs/ports
>>      (illustrated below as "block 22").
>> 
>>      Example instantion where the parser is a separate action:
>>        "tc filter add block 22 ingress protocol all prio 10 p4 pname myprog \
>>         action bpf obj $PARSER.o section p4tc/parse \
>>         action bpf obj $PROGNAME.o section p4tc/main"
>> 
>> See individual patches in partc for more examples tc vs xdp etc. Also see
>> section on "challenges" (further below on this cover letter).
>> 
>> Once "myprog" P4 program is instantiated one can start performing operations
>> on table entries and/or actions at runtime as described below.
>> 
>> __P4TC Runtime Control Path__
>> 
>> The control interface builds on past tc experience and tries to get things
>> right from the beginning (example filtering is separated from depending
>> on existing object TLVs and made generic); also the code is written in
>> such a way it is mostly lockless.
>> 
>> The P4TC control interface, using netlink, provides what we call a CRUDPS
>> abstraction which stands for: Create, Read(get), Update, Delete, Subscribe,
>> Publish.  From a high level PoV the following describes a conformant high level
>> API (both on netlink data model and code level):
>> 
>>  Create(</path/to/object, DATA>+)
>>  Read(</path/to/object>, [optional filter])
>>  Update(</path/to/object>, DATA>+)
>>  Delete(</path/to/object>, [optional filter])
>>  Subscribe(</path/to/object>, [optional filter])
>> 
>> Note, we _dont_ treat "dump" or "flush" as speacial. If "path/to/object" points
>> to a table then a "Delete" implies "flush" and a "Read" implies dump but if
>> it points to an entry (by specifying a key) then "Delete" implies deleting
>> and entry and "Read" implies reading that single entry. It should be noted that
>> both "Delete" and "Read" take an optional filter parameter. The filter can
>> define further refinements to what the control plane wants read or deleted.
>> "Subscribe" uses built in netlink event management. It, as well, takes a filter
>> which can further refine what events get generated to the control plane (taken
>> out of this patchset, to be re-added with consideration of [16]).
>> 
>> Lets show some runtime samples:
>> 
>> ..create an entry, if we match ip address 10.0.1.2 send packet out eno1
>>   tc p4ctrl create myprog/table/mytable \
>>    dstAddr 10.0.1.2/32 action send_to_port param port eno1
>> 
>> ..Batch create entries
>>   tc p4ctrl create myprog/table/mytable \
>>   entry dstAddr 10.1.1.2/32  action send_to_port param port eno1 \
>>   entry dstAddr 10.1.10.2/32  action send_to_port param port eno10 \
>>   entry dstAddr 10.0.2.2/32  action send_to_port param port eno2
>> 
>> ..Get an entry (note "read" is interchangeably used as "get" which is a common
>>      semantic in tc):
>>   tc p4ctrl read myprog/table/mytable \
>>    dstAddr 10.0.2.2/32
>> 
>> ..dump mytable
>>   tc p4ctrl read myprog/table/mytable
>> 
>> ..dump mytable for all entries whose key fits within 10.1.0.0/16
>>   tc p4ctrl read myprog/table/mytable \
>>   filter key/myprog/mytable/dstAddr = 10.1.0.0/16
>> 
>> ..dump all mytable entries which have an action send_to_port with param "eno1"
>>   tc p4ctrl get myprog/table/mytable \
>>   filter param/act/myprog/send_to_port/port = "eno1"
>> 
>> The filter expression is powerful, f.e you could say:
>> 
>>   tc p4ctrl get myprog/table/mytable \
>>   filter param/act/myprog/send_to_port/port = "eno1" && \
>>          key/myprog/mytable/dstAddr = 10.1.0.0/16
>> 
>> It also works on built in metadata, example in the following case dumping
>> entries from mytable that have seen activity in the last 10 secs:
>>   tc p4ctrl get myprog/table/mytable \
>>   filter msecs_since < 10000
>> 
>> Delete follows the same syntax as get/read, so for sake of brevity we won't
>> show more example than how to flush mytable:
>> 
>>   tc p4ctrl delete myprog/table/mytable
>> 
>> Mystery question: How do we achieve iproute2-kernel independence and
>> how does "tc p4ctrl" as a cli know how to program the kernel given an
>> arbitrary command line as shown above? Answer(s): It queries the
>> compiler generated json file in "P4TC Workflow" #B.c above. The json file has
>> enough details to figure out that we have a program called "myprog" which has a
>> table "mytable" that has a key name "dstAddr" which happens to be type ipv4
>> address prefix. The json file also provides details to show that the table
>> "mytable" supports an action called "send_to_port" which accepts a parameter
>> "port" of type netdev (see the types patch for all supported P4 data types).
>> All P4 components have names, IDs, and types - so this makes it very easy to map
>> into netlink.
>> Once user space tc/p4ctrl validates the human command input, it creates
>> standard binary netlink structures (TLVs etc) which are sent to the kernel.
>> See the runtime table entry patch for more details.
>> 
>> __P4TC Datapath__
>> 
>> The P4TC s/w datapath execution is generated as eBPF. Any objects that require
>> control interfacing reside in the "P4TC domain" and are controlled via netlink
>> as described above. Per packet execution and state and even objects that do not
>> require control interfacing (like the P4 parser) are generated as eBPF.
>> 
>> A packet arriving on s/w ingress of any of the ports on block 22 will first be
>> exercised via the (generated eBPF) parser component to extract the headers (the
>> ip destination address in labelled "dstAddr" above).
>> The datapath then proceeds to use "dstAddr", table ID and pipeline ID
>> as a key to do a lookup in myprog's "mytable" which returns the action params
>> which are then used to execute the action in the eBPF datapath (eventually
>> sending out packets to eno1).
>> On a table miss, mytable's default miss action (not described) is executed.
>> 
>> __Testing__
>> 
>> Speaking of testing - we have 2-300 tdc test cases (which will be in the
>> second patchset).
>> These tests are run on our CICD system on pull requests and after commits are
>> approved. The CICD does a lot of other tests (more since v2, thanks to Simon's
>> input)including:
>> checkpatch, sparse, smatch, coccinelle, 32 bit and 64 bit builds tested on both
>> X86, ARM 64 and emulated BE via qemu s390. We trigger performance testing in the
>> CICD to catch performance regressions (currently only on the control path, but
>> in the future for the datapath).
>> Syzkaller runs 24/7 on dedicated hardware, originally we focussed only on memory
>> sanitizer but recently added support for concurrency sanitizer.
>> Before main releases we ensure each patch will compile on its own to help in
>> git bisect and run the xmas tree tool. We eventually put the code via coverity.
>> 
>> In addition we are working on enabling a tool that will take a P4 program, run
>> it through the compiler, and generate permutations of traffic patterns via
>> symbolic execution that will test both positive and negative datapath code
>> paths. The test generator tool integration is still work in progress.
>> Also: We have other code that test parallelization etc which we are trying to
>> find a fit for in the kernel tree's testing infra.
>> 
>> 
>> __References__
>> 
>> [1]https://urldefense.com/v3/__https://github.com/p4tc-dev/docs/blob/main/p4-conference-2023/2023P4WorkshopP4TC.pdf__;!!I5pVk4LIGAfnvw!nUhTLe4nLBPIqlW8u7cJYnSCbFNQSDW318aG2nqxbMZuptPK41snJiHein5mNazAYWIwAfJ3o8O7bPf6Tk4$
>> [2]https://urldefense.com/v3/__https://github.com/p4tc-dev/docs/blob/main/why-p4tc.md*historical-perspective-for-p4tc__;Iw!!I5pVk4LIGAfnvw!nUhTLe4nLBPIqlW8u7cJYnSCbFNQSDW318aG2nqxbMZuptPK41snJiHein5mNazAYWIwAfJ3o8O7LkM5QJk$
>> [3]https://urldefense.com/v3/__https://2023p4workshop.sched.com/event/1KsAe/p4tc-linux-kernel-p4-implementation-approaches-and-evaluation__;!!I5pVk4LIGAfnvw!nUhTLe4nLBPIqlW8u7cJYnSCbFNQSDW318aG2nqxbMZuptPK41snJiHein5mNazAYWIwAfJ3o8O73gpmAKE$
>> [4]https://urldefense.com/v3/__https://github.com/p4tc-dev/docs/blob/main/why-p4tc.md*so-why-p4-and-how-does-p4-help-here__;Iw!!I5pVk4LIGAfnvw!nUhTLe4nLBPIqlW8u7cJYnSCbFNQSDW318aG2nqxbMZuptPK41snJiHein5mNazAYWIwAfJ3o8O7fvy73gU$
>> [5]https://urldefense.com/v3/__https://lore.kernel.org/netdev/20230517110232.29349-3-jhs@mojatatu.com/T/*mf59be7abc5df3473cff3879c8cc3e2369c0640a6__;Iw!!I5pVk4LIGAfnvw!nUhTLe4nLBPIqlW8u7cJYnSCbFNQSDW318aG2nqxbMZuptPK41snJiHein5mNazAYWIwAfJ3o8O7rJJDxSc$
>> [6]https://urldefense.com/v3/__https://lore.kernel.org/netdev/20230517110232.29349-3-jhs@mojatatu.com/T/*m783cfd79e9d755cf0e7afc1a7d5404635a5b1919__;Iw!!I5pVk4LIGAfnvw!nUhTLe4nLBPIqlW8u7cJYnSCbFNQSDW318aG2nqxbMZuptPK41snJiHein5mNazAYWIwAfJ3o8O74EMrBVI$
>> [7]https://urldefense.com/v3/__https://lore.kernel.org/netdev/20230517110232.29349-3-jhs@mojatatu.com/T/*ma8c84df0f7043d17b98f3d67aab0f4904c600469__;Iw!!I5pVk4LIGAfnvw!nUhTLe4nLBPIqlW8u7cJYnSCbFNQSDW318aG2nqxbMZuptPK41snJiHein5mNazAYWIwAfJ3o8O7-6T3BD8$
>> [8]https://urldefense.com/v3/__https://github.com/p4lang/p4c/tree/main/backends/tc__;!!I5pVk4LIGAfnvw!nUhTLe4nLBPIqlW8u7cJYnSCbFNQSDW318aG2nqxbMZuptPK41snJiHein5mNazAYWIwAfJ3o8O7EsGj_yE$
>> [9]https://urldefense.com/v3/__https://p4.org/__;!!I5pVk4LIGAfnvw!nUhTLe4nLBPIqlW8u7cJYnSCbFNQSDW318aG2nqxbMZuptPK41snJiHein5mNazAYWIwAfJ3o8O7MA51wp8$
>> [10]https://urldefense.com/v3/__https://www.intel.com/content/www/us/en/products/details/network-io/ipu/e2000-asic.html__;!!I5pVk4LIGAfnvw!nUhTLe4nLBPIqlW8u7cJYnSCbFNQSDW318aG2nqxbMZuptPK41snJiHein5mNazAYWIwAfJ3o8O7HaJpkWg$
>> [11]https://urldefense.com/v3/__https://www.amd.com/en/accelerators/pensando__;!!I5pVk4LIGAfnvw!nUhTLe4nLBPIqlW8u7cJYnSCbFNQSDW318aG2nqxbMZuptPK41snJiHein5mNazAYWIwAfJ3o8O7u8agJlY$
>> [12]https://urldefense.com/v3/__https://github.com/sonic-net/DASH/tree/main__;!!I5pVk4LIGAfnvw!nUhTLe4nLBPIqlW8u7cJYnSCbFNQSDW318aG2nqxbMZuptPK41snJiHein5mNazAYWIwAfJ3o8O77NF6LU0$
>> [13]https://urldefense.com/v3/__https://github.com/p4lang/p4c/tree/main/backends/ebpf__;!!I5pVk4LIGAfnvw!nUhTLe4nLBPIqlW8u7cJYnSCbFNQSDW318aG2nqxbMZuptPK41snJiHein5mNazAYWIwAfJ3o8O7Hn8dxDI$
>> [14]https://urldefense.com/v3/__https://netdevconf.info/0x17/sessions/talk/integrating-ebpf-into-the-p4tc-datapath.html__;!!I5pVk4LIGAfnvw!nUhTLe4nLBPIqlW8u7cJYnSCbFNQSDW318aG2nqxbMZuptPK41snJiHein5mNazAYWIwAfJ3o8O7dDtnoik$
>> [15]https://urldefense.com/v3/__https://dl.acm.org/doi/10.1145/3630047.3630193__;!!I5pVk4LIGAfnvw!nUhTLe4nLBPIqlW8u7cJYnSCbFNQSDW318aG2nqxbMZuptPK41snJiHein5mNazAYWIwAfJ3o8O7zb87EuI$
>> [16]https://urldefense.com/v3/__https://lore.kernel.org/netdev/20231216123001.1293639-1-jiri@resnulli.us/__;!!I5pVk4LIGAfnvw!nUhTLe4nLBPIqlW8u7cJYnSCbFNQSDW318aG2nqxbMZuptPK41snJiHein5mNazAYWIwAfJ3o8O7mLYrgl8$
>> [17.a]https://urldefense.com/v3/__https://netdevconf.info/0x13/session.html?talk-tc-u-classifier__;!!I5pVk4LIGAfnvw!nUhTLe4nLBPIqlW8u7cJYnSCbFNQSDW318aG2nqxbMZuptPK41snJiHein5mNazAYWIwAfJ3o8O7qSaba8A$
>> [17.b]man tc-u32
>> [18]man tc-pedit
>> [19] https://urldefense.com/v3/__https://lore.kernel.org/netdev/20231219181623.3845083-6-victor@mojatatu.com/T/*m86e71743d1d83b728bb29d5b877797cb4942e835__;Iw!!I5pVk4LIGAfnvw!nUhTLe4nLBPIqlW8u7cJYnSCbFNQSDW318aG2nqxbMZuptPK41snJiHein5mNazAYWIwAfJ3o8O7Uc3-7Vg$
>> [20.a] https://urldefense.com/v3/__https://netdevconf.info/0x16/sessions/talk/your-network-datapath-will-be-p4-scripted.html__;!!I5pVk4LIGAfnvw!nUhTLe4nLBPIqlW8u7cJYnSCbFNQSDW318aG2nqxbMZuptPK41snJiHein5mNazAYWIwAfJ3o8O7YIAkKuc$
>> [20.b] https://urldefense.com/v3/__https://netdevconf.info/0x16/sessions/workshop/p4tc-workshop.html__;!!I5pVk4LIGAfnvw!nUhTLe4nLBPIqlW8u7cJYnSCbFNQSDW318aG2nqxbMZuptPK41snJiHein5mNazAYWIwAfJ3o8O7_8aEvEI$
>> 
>> --------
>> HISTORY
>> --------
>> 
>> Changes in Version 12
>> ----------------------
>> 
>> 0) Introduce back 15 patches (v11 had 5)
>> 
>> 1) From discussions with Daniel:
>>    i) Remove the XDP programs association alltogether. No refcounting. nothing.
>>    ii) Remove prog type tc - everything is now an ebpf tc action.
>> 
>> 2) s/PAD0/__pad0/g. Thanks to Marcelo.
>> 
>> 3) Add extack to specify how many entries (N of M) specified in a batch for
>>    any of requested Create/Update/Delete succeeded. Prior to this it would
>>    only tell us the batch failed to complete without giving us details of
>>    which of M failed. Added as a debug aid.
>> 
>> Changes in Version 11
>> ----------------------
>> 1) Split the series into two. Original patches 1-5 in this patchset. The rest
>>    will go out after this is merged.
>> 
>> 2) Change any references of IFNAMSIZ in the action code when referencing the
>>    action name size to ACTNAMSIZ. Thanks to Marcelo.
>> 
>> Changes in Version 10
>> ----------------------
>> 1) A couple of patches from the earlier version were clean enough to submit,
>>    so we did. This gave us room to split the two largest patches each into
>>    two. Even though the split is not git-bisactable and really some of it didn't
>>    make much sense (eg spliting a create, and update in one patch and delete and
>>    get into another) we made sure each of the split patches compiled
>>    independently. The idea is to reduce the number of lines of code to review
>>    and when we get sufficient reviews we will put the splits together again.
>>    See patch #12 and #13 as well as patches #7 and #8).
>> 
>> 2) Add more context in patch 0. Please READ!
>> 
>> 3) Added dump/delete filters back to the code - we had taken them out in the
>>    earlier patches to reduce the amount of code for review - but in retrospect
>>    we feel they are important enough to push earlier rather than later.
>> 
>> 
>> Changes In version 9
>> ---------------------
>> 
>> 1) Remove the largest patch (externs) to ease review.
>> 
>> 2) Break up action patches into two to ease review bringing down the patches
>>    that need more scrutiny to 8 (the first 7 are almost trivial).
>> 
>> 3) Fixup prefix naming convention to p4tc_xxx for uapi and p4a_xxx for actions
>>    to provide consistency(Jiri).
>> 
>> 4) Silence sparse warning "was not declared. Should it be static?" for kfuncs
>>    by making them static. TBH, not sure if this is the right solution
>>    but it makes sparse happy and hopefully someone will comment.
>> 
>> Changes In Version 8
>> ---------------------
>> 
>> 1) Fix all the patchwork warnings and improve our ci to catch them in the future
>> 
>> 2) Reduce the number of patches to basic max(15)  to ease review.
>> 
>> Changes In Version 7
>> -------------------------
>> 
>> 0) First time removing the RFC tag!
>> 
>> 1) Removed XDP cookie. It turns out as was pointed out by Toke(Thanks!) - that
>> using bpf links was sufficient to protect us from someone replacing or deleting
>> a eBPF program after it has been bound to a netdev.
>> 
>> 2) Add some reviewed-bys from Vlad.
>> 
>> 3) Small bug fixes from v6 based on testing for ebpf.
>> 
>> 4) Added the counter extern as a sample extern. Illustrating this example because
>>    it is slightly complex since it is possible to invoke it directly from
>>    the P4TC domain (in case of direct counters) or from eBPF (indirect counters).
>>    It is not exactly the most efficient implementation (a reasonable counter impl
>>    should be per-cpu).
>> 
>> Changes In RFC Version 6
>> -------------------------
>> 
>> 1) Completed integration from scriptable view to eBPF. Completed integration
>>    of externs integration.
>> 
>> 2) Small bug fixes from v5 based on testing.
>> 
>> Changes In RFC Version 5
>> -------------------------
>> 
>> 1) More integration from scriptable view to eBPF. Small bug fixes from last
>>    integration.
>> 
>> 2) More streamlining support of externs via kfunc (create-on-miss, etc)
>> 
>> 3) eBPF linking for XDP.
>> 
>> There is more eBPF integration/streamlining coming (we are getting close to
>> conversion from scriptable domain).
>> 
>> Changes In RFC Version 4
>> -------------------------
>> 
>> 1) More integration from scriptable to eBPF. Small bug fixes.
>> 
>> 2) More streamlining support of externs via kfunc (one additional kfunc).
>> 
>> 3) Removed per-cpu scratchpad per Toke's suggestion and instead use XDP metadata.
>> 
>> There is more eBPF integration coming. One thing we looked at but is not in this
>> patchset but should be in the next is use of eBPF link in our loading (see
>> "challenge #1" further below).
>> 
>> Changes In RFC Version 3
>> -------------------------
>> 
>> These patches are still in a little bit of flux as we adjust to integrating
>> eBPF. So there are small constructs that are used in V1 and 2 but no longer
>> used in this version. We will make a V4 which will remove those.
>> The changes from V2 are as follows:
>> 
>> 1) Feedback we got in V2 is to try stick to one of the two modes. In this version
>> we are taking one more step and going the path of mode2 vs v2 where we had 2 modes.
>> 
>> 2) The P4 Register extern is no longer standalone. Instead, as part of integrating
>> into eBPF we introduce another kfunc which encapsulates Register as part of the
>> extern interface.
>> 
>> 3) We have improved our CICD to include tools pointed to us by Simon. See
>>    "Testing" further below. Thanks to Simon for that and other issues he caught.
>>    Simon, we discussed on issue [7] but decided to keep that log since we think
>>    it is useful.
>> 
>> 4) A lot of small cleanups. Thanks Marcelo. There are two things we need to
>>    re-discuss though; see: [5], [6].
>> 
>> 5) We removed the need for a range of IDs for dynamic actions. Thanks Jakub.
>> 
>> 6) Clarify ambiguity caused by smatch in an if(A) else if(B) condition. We are
>>    guaranteed that either A or B must exist; however, lets make smatch happy.
>>    Thanks to Simon and Dan Carpenter.
>> 
>> Changes In RFC Version 2
>> -------------------------
>> 
>> Version 2 is the initial integration of the eBPF datapath.
>> We took into consideration suggestions provided to use eBPF and put effort into
>> analyzing eBPF as datapath which involved extensive testing.
>> We implemented 6 approaches with eBPF and ran performance analysis and presented
>> our results at the P4 2023 workshop in Santa Clara[see: 1, 3] on each of the 6
>> vs the scriptable P4TC and concluded that 2 of the approaches are sensible (4 if
>> you account for XDP or TC separately).
>> 
>> Conclusions from the exercise: We lose the simple operational model we had
>> prior to integrating eBPF. We do gain performance in most cases when the
>> datapath is less compute-bound.
>> For more discussion on our requirements vs journeying the eBPF path please
>> scroll down to "Restating Our Requirements" and "Challenges".
>> 
>> This patch set presented two modes.
>> mode1: the parser is entirely based on eBPF - whereas the rest of the
>> SW datapath stays as _scriptable_ as in Version 1.
>> mode2: All of the kernel s/w datapath (including parser) is in eBPF.
>> 
>> The key ingredient for eBPF, that we did not have access to in the past, is
>> kfunc (it made a big difference for us to reconsider eBPF).
>> 
>> In V2 the two modes are mutually exclusive (IOW, you get to choose one
>> or the other via Kconfig).
>
>I think/fear that this series has a "quorum" problem: different voices
>raises opposition, and nobody (?) outside the authors supported the
>code and the feature. 
>
>Could be the missing of H/W offload support in the current form the
>root cause for such lack support? Or there are parties interested that
>have been quite so far?
>
>Thanks,
>
>Paolo
>

Hi Paolo, thanks. I am one of those "parties interested that have been quite so far."

I wanted to voice my staunch support for accepting P4TC into the kernel. None of the present objections in the various threads reduce my enthusiasm. I find the following aspects most compelling:

- Performant, highly functional, pure-SW P4 dataplane

- Near-ubiquitous availability on all platforms, once it's upstreamed. Saves having to install a bunch of other p4 ecosystem tools, lowers the barrier to entry, and increases the likelihood an application can run on any platform.

- larger dev community. Anything added to the Linux kernel benefits from a large, thriving community, vast and rigorous regression testing, long-term support, etc.

- well-conceived CRUDX northbound API and clever use of existing well-understood netlink, easy to overlay other northbound APIs such as TDI (Table driven interface) used in IPDK; P4Runtime gRPC API; etc.

- integration with popular and well-understood tc provides a good impedance match for users.

- extensibility, ability to add externs, and interface to eBPF. The ability to add externs is especially compelling. It is not easy to do so in current backends such as bmv2, P4DPDK or p4-ebpf. 

- roadmap to hardware offload for even greater performance. Even _without_ offload, the above benefits justify it in my mind. There are many applications for a pure-SW P4 dataplane, both in userland like P4DPDK, and the proposed P4TC - running as part of the kernel is _exciting_. Vendors have already voiced their support for offload and this initial set of patches paves the way and lets the community benefit from it and start to make it better, now.

It is possible the detractors of P4TC are not active P4 users, so I hope to provide a bit of perspective. Besides the pioneering switch ASIC (Tofino) use-cases which provided the initial impetus, P4 is used extensively in at least two commercial IPUs/DPUs. In addition, there are multiple toolchains to run P4 code on FPGAs. The dream is to write P4 code which can be run in a scalable fashion on a range of targets. It shouldn’t be necessary to “prove” P4 is worthy, those who’ve already embraced it know this.

There are several use-cases for a SW implementation of a P4 dataplane, including behavioral modeling and production uses. P4 allows one to write core functionality which can run on multiple platforms: pure SW, FPGAs, offload NICs/DPUs/IPUs, switch ASICs.

Behavioral modeling of a pipeline using P4:

- The SONiC-DASH project (https://github.com/sonic-net/DASH) is a thriving, multi-vendor collaboration which specifies advanced, high-performance features to accelerate datacenter services. These overlay services are specified using a P4 program which allows all concerned to agree on the packet pipeline and even the control-plane APIs (using SAI, the Switch Abstraction Interface). The actual implementation on a vendor's offload device (DPU/IPU) may or may not use any of the reference P4 code, but that is not important. What is important is that we specify the dataplane in P4, and execute it on the bmv2 backend in a container. We run conformance and regression suites with standard test vectors, which can also be run against actual production implementations to verify compliance. The bmv2 backend has many limitations, including performance and difficulty to extend its functionality. As a major contributor to this project, I am helping to explore alternatives.

- Large-scale cloud-service providers use P4 extensively as a dataplane (fabric switch) modeling language. One of the driving use-cases in the P4-API working group (I’m a co-chair) is to control SDN switches using P4-Runtime. The switches’ pipelines are modeled in P4 by some users, similar to the DASH use-case. Having a performant, pure-SW implementation is invaluable for modeling and simulation.

Running P4 code in pure SW for production use-cases (not just modeling):

There are many use-cases for running a custom dataplane written in P4. The productivity of P4 code cannot be overstated. With the right framework, P4 apps can be developed (and controlled/managed) in literally hours. It is much more productive than writing, say c or eBPF. I can do all three, and P4 is way more productive for certain applications.

In conclusion, I hope we can upstream P4-TC soon. Please move this forward with all due speed. Thanks!

Chris Sommers
Keysight Technologies
Jakub Kicinski March 2, 2024, 1:32 a.m. UTC | #15
On Fri, 1 Mar 2024 12:39:56 -0500 Jamal Hadi Salim wrote:
> On Fri, Mar 1, 2024 at 12:00 PM Jakub Kicinski <kuba@kernel.org> wrote:
> > > Pardon my ignorance, but doesn't P4 want to be compiled to a backend
> > > target? How does going through TC make this seamless?  
> >
> > +1
> 
> I should clarify what i meant by "seamless". It means the same control
> API is used for s/w or h/w. This is a feature of tc, and is not being
> introduced by P4TC. P4 control only deals with Match-action tables -
> just as TC does.

Right, and the compiled P4 pipeline is tacked onto that API.
Loading that presumably implies a pipeline reset. There's 
no precedent for loading things into TC resulting a device
datapath reset.

> > My intuition is that for offload the device would be programmed at
> > start-of-day / probe. By loading the compiled P4 from /lib/firmware.
> > Then the _device_ tells the kernel what tables and parser graph it's
> > got.
> 
> BTW: I just want to say that these patches are about s/w - not
> offload. Someone asked about offload so as in normal discussions we
> steered in that direction. The hardware piece will require additional
> patchsets which still require discussions. I hope we dont steer off
> too much, otherwise i can start a new thread just to discuss current
> view of the h/w.
> 
> Its not the device telling the kernel what it has. Its the other way around.

Yes, I'm describing how I'd have designed it :) If it was the same
as what you've already implemented - why would I be typing it into
an email.. ? :)

> From the P4 program you generate the s/w (the ebpf code and other
> auxillary stuff) and h/w pieces using a compiler.
> You compile ebpf, etc, then load.

That part is fine.

> The current point of discussion is the hw binary is to be "activated"
> through the same tc filter that does the s/w. So one could say:
> 
> tc filter add block 22 ingress protocol all prio 1 p4 pname simple_l3
> \
>    prog type hw filename "simple_l3.o" ... \
>    action bpf obj $PARSER.o section p4tc/parser \
>    action bpf obj $PROGNAME.o section p4tc/main
> 
> And that would through tc driver callbacks signal to the driver to
> find the binary possibly via  /lib/firmware
> Some of the original discussion was to use devlink for loading the
> binary - but that went nowhere.

Back to the device reset, unless the load has no impact on inflight
traffic the loading doesn't belong in TC, IMO. Plus you're going to
run into (what IIRC was Jiri's complaint) that you're loading arbitrary
binary blobs, opaque to the kernel.

> Once you have this in place then netlink with tc skip_sw/hw. This is
> what i meant by "seamless"
> 
> > Plus, if we're talking about offloads, aren't we getting back into
> > the same controversies we had when merging OvS (not that I was around).
> > The "standalone stack to the side" problem. Some of the tables in the
> > pipeline may be for routing, not ACLs. Should they be fed from the
> > routing stack? How is that integration going to work? The parsing
> > graph feels a bit like global device configuration, not a piece of
> > functionality that should sit under sub-sub-system in the corner.  
> 
> The current (maybe i should say initial) thought is the P4 program
> does not touch the existing kernel infra such as fdb etc.

It's off to the side thing. Ignoring the fact that *all*, networking
devices already have parsers which would benefit from being accurately
described.

> Of course we can model the kernel datapath using P4 but you wont be
> using "ip route add..." or "bridge fdb...".
> In the future, P4 extern could be used to model existing infra and we
> should be able to use the same tooling. That is a discussion that
> comes on/off (i think it did in the last meeting).

Maybe, IDK. I thought prevailing wisdom, at least for offloads,
is to offload the existing networking stack, and fill in the gaps.
Not build a completely new implementation from scratch, and "integrate
later". Or at least "fill in the gaps" is how I like to think.

I can't quite fit together in my head how this is okay, but OvS
was not allowed to add their offload API. And what's supposed to
be part of TC and what isn't, where you only expect to have one 
filter here, and create a whole new object universe inside TC.

But that's just my opinions. The way things work we may wake up one 
day and find out that Dave has applied this :)
Tom Herbert March 2, 2024, 2:20 a.m. UTC | #16
On Fri, Mar 1, 2024 at 5:32 PM Jakub Kicinski <kuba@kernel.org> wrote:
>
> On Fri, 1 Mar 2024 12:39:56 -0500 Jamal Hadi Salim wrote:
> > On Fri, Mar 1, 2024 at 12:00 PM Jakub Kicinski <kuba@kernel.org> wrote:
> > > > Pardon my ignorance, but doesn't P4 want to be compiled to a backend
> > > > target? How does going through TC make this seamless?
> > >
> > > +1
> >
> > I should clarify what i meant by "seamless". It means the same control
> > API is used for s/w or h/w. This is a feature of tc, and is not being
> > introduced by P4TC. P4 control only deals with Match-action tables -
> > just as TC does.
>
> Right, and the compiled P4 pipeline is tacked onto that API.
> Loading that presumably implies a pipeline reset. There's
> no precedent for loading things into TC resulting a device
> datapath reset.
>
> > > My intuition is that for offload the device would be programmed at
> > > start-of-day / probe. By loading the compiled P4 from /lib/firmware.
> > > Then the _device_ tells the kernel what tables and parser graph it's
> > > got.
> >
> > BTW: I just want to say that these patches are about s/w - not
> > offload. Someone asked about offload so as in normal discussions we
> > steered in that direction. The hardware piece will require additional
> > patchsets which still require discussions. I hope we dont steer off
> > too much, otherwise i can start a new thread just to discuss current
> > view of the h/w.
> >
> > Its not the device telling the kernel what it has. Its the other way around.
>
> Yes, I'm describing how I'd have designed it :) If it was the same
> as what you've already implemented - why would I be typing it into
> an email.. ? :)
>
> > From the P4 program you generate the s/w (the ebpf code and other
> > auxillary stuff) and h/w pieces using a compiler.
> > You compile ebpf, etc, then load.
>
> That part is fine.
>
> > The current point of discussion is the hw binary is to be "activated"
> > through the same tc filter that does the s/w. So one could say:
> >
> > tc filter add block 22 ingress protocol all prio 1 p4 pname simple_l3
> > \
> >    prog type hw filename "simple_l3.o" ... \
> >    action bpf obj $PARSER.o section p4tc/parser \
> >    action bpf obj $PROGNAME.o section p4tc/main
> >
> > And that would through tc driver callbacks signal to the driver to
> > find the binary possibly via  /lib/firmware
> > Some of the original discussion was to use devlink for loading the
> > binary - but that went nowhere.
>
> Back to the device reset, unless the load has no impact on inflight
> traffic the loading doesn't belong in TC, IMO. Plus you're going to
> run into (what IIRC was Jiri's complaint) that you're loading arbitrary
> binary blobs, opaque to the kernel.
>
> > Once you have this in place then netlink with tc skip_sw/hw. This is
> > what i meant by "seamless"
> >
> > > Plus, if we're talking about offloads, aren't we getting back into
> > > the same controversies we had when merging OvS (not that I was around).
> > > The "standalone stack to the side" problem. Some of the tables in the
> > > pipeline may be for routing, not ACLs. Should they be fed from the
> > > routing stack? How is that integration going to work? The parsing
> > > graph feels a bit like global device configuration, not a piece of
> > > functionality that should sit under sub-sub-system in the corner.
> >
> > The current (maybe i should say initial) thought is the P4 program
> > does not touch the existing kernel infra such as fdb etc.
>
> It's off to the side thing. Ignoring the fact that *all*, networking
> devices already have parsers which would benefit from being accurately
> described.

Jakub,

This is configurability versus programmability. The table driven
approach as input (configurability) might work fine for generic
match-action tables up to the point that tables are expressive enough
to satisfy the requirements. But parsing doesn't fall into the table
driven paradigm: parsers want to be *programmed*. This is why we
removed kParser from this patch set and fell back to eBPF for parsing.
But the problem we quickly hit that eBPF is not offloadable to network
devices, for example when we compile P4 in an eBPF parser we've lost
the declarative representation that parsers in the devices could
consume (they're not CPUs running eBPF).

I think the key here is what we mean by kernel offload. When we do
kernel offload, is it the kernel implementation or the kernel
functionality that's being offloaded? If it's the latter then we have
a lot more flexibility. What we'd need is a safe and secure way to
synchronize with that offload device that precisely supports the
kernel functionality we'd like to offload. This can be done if both
the kernel bits and programmed offload are derived from the same
source (i.e. tag source code with a sha-1). For example, if someone
writes a parser in P4, we can compile that into both eBPF and a P4
backend using independent tool chains and program download. At
runtime, the kernel can safely offload the functionality of the eBPF
parser to the device if it matches the hash to that reported by the
device

Tom

>
> > Of course we can model the kernel datapath using P4 but you wont be
> > using "ip route add..." or "bridge fdb...".
> > In the future, P4 extern could be used to model existing infra and we
> > should be able to use the same tooling. That is a discussion that
> > comes on/off (i think it did in the last meeting).
>
> Maybe, IDK. I thought prevailing wisdom, at least for offloads,
> is to offload the existing networking stack, and fill in the gaps.
> Not build a completely new implementation from scratch, and "integrate
> later". Or at least "fill in the gaps" is how I like to think.
>
> I can't quite fit together in my head how this is okay, but OvS
> was not allowed to add their offload API. And what's supposed to
> be part of TC and what isn't, where you only expect to have one
> filter here, and create a whole new object universe inside TC.
>
> But that's just my opinions. The way things work we may wake up one
> day and find out that Dave has applied this :)
Jamal Hadi Salim March 2, 2024, 2:59 a.m. UTC | #17
On Fri, Mar 1, 2024 at 8:32 PM Jakub Kicinski <kuba@kernel.org> wrote:
>
> On Fri, 1 Mar 2024 12:39:56 -0500 Jamal Hadi Salim wrote:
> > On Fri, Mar 1, 2024 at 12:00 PM Jakub Kicinski <kuba@kernel.org> wrote:
> > > > Pardon my ignorance, but doesn't P4 want to be compiled to a backend
> > > > target? How does going through TC make this seamless?
> > >
> > > +1
> >
> > I should clarify what i meant by "seamless". It means the same control
> > API is used for s/w or h/w. This is a feature of tc, and is not being
> > introduced by P4TC. P4 control only deals with Match-action tables -
> > just as TC does.
>
> Right, and the compiled P4 pipeline is tacked onto that API.
> Loading that presumably implies a pipeline reset. There's
> no precedent for loading things into TC resulting a device
> datapath reset.

Ive changed the subject to reflect this discussion is about h/w
offload so we dont drift too much from the intent of the patches.

AFAIK, all these devices have some HA built in to do program
replacement. i.e. afaik, no device reset.
I believe the tofino switch in the earlier generations may have needed
resets which caused a few packet drops in a live environment update.
Granted there may be devices (not that i am aware) that may not be
able to do HA. All this needs to be considered for offloads.

> > > My intuition is that for offload the device would be programmed at
> > > start-of-day / probe. By loading the compiled P4 from /lib/firmware.
> > > Then the _device_ tells the kernel what tables and parser graph it's
> > > got.
> >
> > BTW: I just want to say that these patches are about s/w - not
> > offload. Someone asked about offload so as in normal discussions we
> > steered in that direction. The hardware piece will require additional
> > patchsets which still require discussions. I hope we dont steer off
> > too much, otherwise i can start a new thread just to discuss current
> > view of the h/w.
> >
> > Its not the device telling the kernel what it has. Its the other way around.
>
> Yes, I'm describing how I'd have designed it :) If it was the same
> as what you've already implemented - why would I be typing it into
> an email.. ? :)
>

I think i misunderstood you and thought I needed to provide context.
The P4 pipelines are meant to be able to be re-programmed multiple
times in a live environment. IOW, I should be able to delete/create a
pipeline while another is running. Some hardware may require that the
parser is shared etc, but you can certainly replace the match action
tables or add an entirely new logic. In any case this is all still
under discussion and can be further refined.

> > From the P4 program you generate the s/w (the ebpf code and other
> > auxillary stuff) and h/w pieces using a compiler.
> > You compile ebpf, etc, then load.
>
> That part is fine.
>
> > The current point of discussion is the hw binary is to be "activated"
> > through the same tc filter that does the s/w. So one could say:
> >
> > tc filter add block 22 ingress protocol all prio 1 p4 pname simple_l3
> > \
> >    prog type hw filename "simple_l3.o" ... \
> >    action bpf obj $PARSER.o section p4tc/parser \
> >    action bpf obj $PROGNAME.o section p4tc/main
> >
> > And that would through tc driver callbacks signal to the driver to
> > find the binary possibly via  /lib/firmware
> > Some of the original discussion was to use devlink for loading the
> > binary - but that went nowhere.
>
> Back to the device reset, unless the load has no impact on inflight
> traffic the loading doesn't belong in TC, IMO. Plus you're going to
> run into (what IIRC was Jiri's complaint) that you're loading arbitrary
> binary blobs, opaque to the kernel.
>

And you said at that time binary blobs are already a way of life.
Let's take DDP as a use case:  They load the firmware (via ethtool)
and we were recently discussing whether they should use flower or u32
etc.  I would say this is in the same spirit. Doing ethtool may be a
bit disconnected. But that is up for discussion as well.
There has been concern that we need to have some authentication in
some of the discussions. Is that what you mean?

> > Once you have this in place then netlink with tc skip_sw/hw. This is
> > what i meant by "seamless"
> >
> > > Plus, if we're talking about offloads, aren't we getting back into
> > > the same controversies we had when merging OvS (not that I was around).
> > > The "standalone stack to the side" problem. Some of the tables in the
> > > pipeline may be for routing, not ACLs. Should they be fed from the
> > > routing stack? How is that integration going to work? The parsing
> > > graph feels a bit like global device configuration, not a piece of
> > > functionality that should sit under sub-sub-system in the corner.
> >
> > The current (maybe i should say initial) thought is the P4 program
> > does not touch the existing kernel infra such as fdb etc.
>
> It's off to the side thing. Ignoring the fact that *all*, networking
> devices already have parsers which would benefit from being accurately
> described.
>

I am not following this point.

> > Of course we can model the kernel datapath using P4 but you wont be
> > using "ip route add..." or "bridge fdb...".
> > In the future, P4 extern could be used to model existing infra and we
> > should be able to use the same tooling. That is a discussion that
> > comes on/off (i think it did in the last meeting).
>
> Maybe, IDK. I thought prevailing wisdom, at least for offloads,
> is to offload the existing networking stack, and fill in the gaps.
> Not build a completely new implementation from scratch, and "integrate
> later". Or at least "fill in the gaps" is how I like to think.
>
> I can't quite fit together in my head how this is okay, but OvS
> was not allowed to add their offload API. And what's supposed to
> be part of TC and what isn't, where you only expect to have one
> filter here, and create a whole new object universe inside TC.
>

I was there.
Ovs matched what tc already had functionally, 10 years after tc
existed, and they were busy rewriting what tc offered. So naturally we
pushed for them to use what TC had. You still need to write whatever
extensions needed into the kernel etc in order to support what the
hardware can offer.

I hope i am not stating the obvious: P4 provides a more malleable
approach. Assume a blank template in h/w and s/w and where you specify
what you need then both the s/w and hardware support it. Flower is
analogous to a "fixed pipeline" meaning you can extend flower by
changing the kernel and datapath. Often it is not covering all
potential hw match actions engines and often we see patches to do one
more thing requiring more kernel changes.  If you replace flower with
P4 you remove the need to update the kernel, user space etc for the
same features that flower needs to be extended for today. You just
tell the compiler what you need (within hardware capacity of course).
So i dont see P4 as "offload the existing kernel infra aka flower" but
rather remove the limitations that flower constrains us with today. As
far as other kernel infra (fdb etc), that can be added as i stated -
it is just not a starting point.

cheers,
jamal


> But that's just my opinions. The way things work we may wake up one
> day and find out that Dave has applied this :)
Jamal Hadi Salim March 2, 2024, 2:36 p.m. UTC | #18
On Fri, Mar 1, 2024 at 9:59 PM Jamal Hadi Salim <jhs@mojatatu.com> wrote:
>
> On Fri, Mar 1, 2024 at 8:32 PM Jakub Kicinski <kuba@kernel.org> wrote:
> >
> > On Fri, 1 Mar 2024 12:39:56 -0500 Jamal Hadi Salim wrote:
> > > On Fri, Mar 1, 2024 at 12:00 PM Jakub Kicinski <kuba@kernel.org> wrote:
> > > > > Pardon my ignorance, but doesn't P4 want to be compiled to a backend
> > > > > target? How does going through TC make this seamless?
> > > >
> > > > +1
> > >
> > > I should clarify what i meant by "seamless". It means the same control
> > > API is used for s/w or h/w. This is a feature of tc, and is not being
> > > introduced by P4TC. P4 control only deals with Match-action tables -
> > > just as TC does.
> >
> > Right, and the compiled P4 pipeline is tacked onto that API.
> > Loading that presumably implies a pipeline reset. There's
> > no precedent for loading things into TC resulting a device
> > datapath reset.
>
> Ive changed the subject to reflect this discussion is about h/w
> offload so we dont drift too much from the intent of the patches.
>
> AFAIK, all these devices have some HA built in to do program
> replacement. i.e. afaik, no device reset.
> I believe the tofino switch in the earlier generations may have needed
> resets which caused a few packet drops in a live environment update.
> Granted there may be devices (not that i am aware) that may not be
> able to do HA. All this needs to be considered for offloads.
>
> > > > My intuition is that for offload the device would be programmed at
> > > > start-of-day / probe. By loading the compiled P4 from /lib/firmware.
> > > > Then the _device_ tells the kernel what tables and parser graph it's
> > > > got.
> > >
> > > BTW: I just want to say that these patches are about s/w - not
> > > offload. Someone asked about offload so as in normal discussions we
> > > steered in that direction. The hardware piece will require additional
> > > patchsets which still require discussions. I hope we dont steer off
> > > too much, otherwise i can start a new thread just to discuss current
> > > view of the h/w.
> > >
> > > Its not the device telling the kernel what it has. Its the other way around.
> >
> > Yes, I'm describing how I'd have designed it :) If it was the same
> > as what you've already implemented - why would I be typing it into
> > an email.. ? :)
> >
>
> I think i misunderstood you and thought I needed to provide context.
> The P4 pipelines are meant to be able to be re-programmed multiple
> times in a live environment. IOW, I should be able to delete/create a
> pipeline while another is running. Some hardware may require that the
> parser is shared etc, but you can certainly replace the match action
> tables or add an entirely new logic. In any case this is all still
> under discussion and can be further refined.
>
> > > From the P4 program you generate the s/w (the ebpf code and other
> > > auxillary stuff) and h/w pieces using a compiler.
> > > You compile ebpf, etc, then load.
> >
> > That part is fine.
> >
> > > The current point of discussion is the hw binary is to be "activated"
> > > through the same tc filter that does the s/w. So one could say:
> > >
> > > tc filter add block 22 ingress protocol all prio 1 p4 pname simple_l3
> > > \
> > >    prog type hw filename "simple_l3.o" ... \
> > >    action bpf obj $PARSER.o section p4tc/parser \
> > >    action bpf obj $PROGNAME.o section p4tc/main
> > >
> > > And that would through tc driver callbacks signal to the driver to
> > > find the binary possibly via  /lib/firmware
> > > Some of the original discussion was to use devlink for loading the
> > > binary - but that went nowhere.
> >
> > Back to the device reset, unless the load has no impact on inflight
> > traffic the loading doesn't belong in TC, IMO. Plus you're going to
> > run into (what IIRC was Jiri's complaint) that you're loading arbitrary
> > binary blobs, opaque to the kernel.
> >
>
> And you said at that time binary blobs are already a way of life.
> Let's take DDP as a use case:  They load the firmware (via ethtool)
> and we were recently discussing whether they should use flower or u32
> etc.  I would say this is in the same spirit. Doing ethtool may be a
> bit disconnected. But that is up for discussion as well.
> There has been concern that we need to have some authentication in
> some of the discussions. Is that what you mean?
>
> > > Once you have this in place then netlink with tc skip_sw/hw. This is
> > > what i meant by "seamless"
> > >
> > > > Plus, if we're talking about offloads, aren't we getting back into
> > > > the same controversies we had when merging OvS (not that I was around).
> > > > The "standalone stack to the side" problem. Some of the tables in the
> > > > pipeline may be for routing, not ACLs. Should they be fed from the
> > > > routing stack? How is that integration going to work? The parsing
> > > > graph feels a bit like global device configuration, not a piece of
> > > > functionality that should sit under sub-sub-system in the corner.
> > >
> > > The current (maybe i should say initial) thought is the P4 program
> > > does not touch the existing kernel infra such as fdb etc.
> >
> > It's off to the side thing. Ignoring the fact that *all*, networking
> > devices already have parsers which would benefit from being accurately
> > described.
> >
>
> I am not following this point.
>
> > > Of course we can model the kernel datapath using P4 but you wont be
> > > using "ip route add..." or "bridge fdb...".
> > > In the future, P4 extern could be used to model existing infra and we
> > > should be able to use the same tooling. That is a discussion that
> > > comes on/off (i think it did in the last meeting).
> >
> > Maybe, IDK. I thought prevailing wisdom, at least for offloads,
> > is to offload the existing networking stack, and fill in the gaps.
> > Not build a completely new implementation from scratch, and "integrate
> > later". Or at least "fill in the gaps" is how I like to think.
> >
> > I can't quite fit together in my head how this is okay, but OvS
> > was not allowed to add their offload API. And what's supposed to
> > be part of TC and what isn't, where you only expect to have one
> > filter here, and create a whole new object universe inside TC.
> >
>
> I was there.
> Ovs matched what tc already had functionally, 10 years after tc
> existed, and they were busy rewriting what tc offered. So naturally we
> pushed for them to use what TC had. You still need to write whatever
> extensions needed into the kernel etc in order to support what the
> hardware can offer.
>
> I hope i am not stating the obvious: P4 provides a more malleable
> approach. Assume a blank template in h/w and s/w and where you specify
> what you need then both the s/w and hardware support it. Flower is
> analogous to a "fixed pipeline" meaning you can extend flower by
> changing the kernel and datapath. Often it is not covering all
> potential hw match actions engines and often we see patches to do one
> more thing requiring more kernel changes.  If you replace flower with
> P4 you remove the need to update the kernel, user space etc for the
> same features that flower needs to be extended for today. You just
> tell the compiler what you need (within hardware capacity of course).
> So i dont see P4 as "offload the existing kernel infra aka flower" but
> rather remove the limitations that flower constrains us with today. As
> far as other kernel infra (fdb etc), that can be added as i stated -
> it is just not a starting point.
>

Sorry, after getting some coffee i believe I mumbled too much in my
previous email. Let me summarize your points and reduce the mumbling:
1)Your point on: Triggering the pipeline re/programming via the filter
would require a reset of the device on a live environment.
AFAIK, the "P4 native" devices that I know of  do allow multiple
programs and have operational schemes to allow updates without resets.
I will gather more info and post it after one of our meetings.
Having said that, we really have not paid much attention to this
detail so it is a valid concern that needs to be ironed out.
It is even more imperative if we want to support a device that is not
"P4 native" or one that requires a reset whether it is P4 native or
not then what you referred to as "programmed at start-of-day / probe"
is a valid concern.

2) Your point on:  "integrate later", or at least "fill in the gaps"
This part i am probably going to mumble on. I am going to consider
more than just doing ACLs/MAT via flower/u32 for the sake of
discussion.
True, "fill the gaps" has been our model so far. It requires kernel
changes, user space code changes etc justifiably so because most of
the time such datapaths are subject to standardization via IETF, IEEE,
etc and new extensions come in on a regular basis.  And sometimes we
do add features that one or two users or a single vendor has need for
at the cost of kernel and user/control extension. Given our work
process, any features added this way take a long time to make it to
the end user. At the cost of this sounding controversial, i am going
to call things like fdb, fib, etc which have fixed datapaths in the
kernel "legacy". These "legacy" datapaths almost all the time have
very strong user bases with strong infra tooling which took years to
get in shape. So they must be supported. I see two approaches:
-  you can leave those "legacy" ndo ops alone and not go via the tc
ndo ops used by P4TC.
-  or write a P4 program that looks _exactly_ like what current
bridging looks like and add helpers to allow existing tools to
continue to work via tc ndo and then phase out the "fixed datapath"
ndos. This will take a long long time but it could be a goal.

There is another caveat: Often different vendor hardware has slightly
different features which cant be exposed because either they are very
specific to the vendor or it's just very hard to express with existing
"legacy" without making intrusive changes. So we are going to be able
to allow these vendors/users to expose as much or as little as is
needed for a specific deployment without affecting anyone else with
new kernel/user code.

On the "integrate later" aspect: That is probably because most of the
times we want to avoid doing intrusive niche changes (which is
resolvable with the above).

cheers,
jamal
Jakub Kicinski March 3, 2024, 3:15 a.m. UTC | #19
On Fri, 1 Mar 2024 18:20:36 -0800 Tom Herbert wrote:
> This is configurability versus programmability. The table driven
> approach as input (configurability) might work fine for generic
> match-action tables up to the point that tables are expressive enough
> to satisfy the requirements. But parsing doesn't fall into the table
> driven paradigm: parsers want to be *programmed*. This is why we
> removed kParser from this patch set and fell back to eBPF for parsing.
> But the problem we quickly hit that eBPF is not offloadable to network
> devices, for example when we compile P4 in an eBPF parser we've lost
> the declarative representation that parsers in the devices could
> consume (they're not CPUs running eBPF).
> 
> I think the key here is what we mean by kernel offload. When we do
> kernel offload, is it the kernel implementation or the kernel
> functionality that's being offloaded? If it's the latter then we have
> a lot more flexibility. What we'd need is a safe and secure way to
> synchronize with that offload device that precisely supports the
> kernel functionality we'd like to offload. This can be done if both
> the kernel bits and programmed offload are derived from the same
> source (i.e. tag source code with a sha-1). For example, if someone
> writes a parser in P4, we can compile that into both eBPF and a P4
> backend using independent tool chains and program download. At
> runtime, the kernel can safely offload the functionality of the eBPF
> parser to the device if it matches the hash to that reported by the
> device

Good points. If I understand you correctly you're saying that parsers
are more complex than just a basic parsing tree a'la u32.
Then we can take this argument further. P4 has grown to encompass a lot
of functionality of quite complex devices. How do we square that with 
the kernel functionality offload model. If the entire device is modeled,
including f.e. TSO, an offload would mean that the user has to write
a TSO implementation which they then load into TC? That seems odd.

IOW I don't quite know how to square in my head the "total
functionality" with being a TC-based "plugin".
Jakub Kicinski March 3, 2024, 3:27 a.m. UTC | #20
On Sat, 2 Mar 2024 09:36:53 -0500 Jamal Hadi Salim wrote:
> 2) Your point on:  "integrate later", or at least "fill in the gaps"
> This part i am probably going to mumble on. I am going to consider
> more than just doing ACLs/MAT via flower/u32 for the sake of
> discussion.
> True, "fill the gaps" has been our model so far. It requires kernel
> changes, user space code changes etc justifiably so because most of
> the time such datapaths are subject to standardization via IETF, IEEE,
> etc and new extensions come in on a regular basis.  And sometimes we
> do add features that one or two users or a single vendor has need for
> at the cost of kernel and user/control extension. Given our work
> process, any features added this way take a long time to make it to
> the end user.

What I had in mind was more of a DDP model. The device loads it binary
blob FW in whatever way it does, then it tells the kernel its parser
graph, and tables. The kernel exposes those tables to user space.
All dynamic, no need to change the kernel for each new protocol.

But that's different in two ways:
 1. the device tells kernel the tables, no "dynamic reprogramming"
 2. you don't need the SW side, the only use of the API is to interact
    with the device

User can still do BPF kfuncs to look up in the tables (like in FIB), 
but call them from cls_bpf.

I think in P4 terms that may be something more akin to only providing
the runtime API? I seem to recall they had some distinction...

> At the cost of this sounding controversial, i am going
> to call things like fdb, fib, etc which have fixed datapaths in the
> kernel "legacy". These "legacy" datapaths almost all the time have

The cynic in me sometimes thinks that the biggest problem with "legacy"
protocols is that it's hard to make money on them :)

> very strong user bases with strong infra tooling which took years to
> get in shape. So they must be supported. I see two approaches:
> -  you can leave those "legacy" ndo ops alone and not go via the tc
> ndo ops used by P4TC.
> -  or write a P4 program that looks _exactly_ like what current
> bridging looks like and add helpers to allow existing tools to
> continue to work via tc ndo and then phase out the "fixed datapath"
> ndos. This will take a long long time but it could be a goal.
> 
> There is another caveat: Often different vendor hardware has slightly
> different features which cant be exposed because either they are very
> specific to the vendor or it's just very hard to express with existing
> "legacy" without making intrusive changes. So we are going to be able
> to allow these vendors/users to expose as much or as little as is
> needed for a specific deployment without affecting anyone else with
> new kernel/user code.
> 
> On the "integrate later" aspect: That is probably because most of the
> times we want to avoid doing intrusive niche changes (which is
> resolvable with the above).
Tom Herbert March 3, 2024, 4:31 p.m. UTC | #21
On Sat, Mar 2, 2024 at 7:15 PM Jakub Kicinski <kuba@kernel.org> wrote:
>
> On Fri, 1 Mar 2024 18:20:36 -0800 Tom Herbert wrote:
> > This is configurability versus programmability. The table driven
> > approach as input (configurability) might work fine for generic
> > match-action tables up to the point that tables are expressive enough
> > to satisfy the requirements. But parsing doesn't fall into the table
> > driven paradigm: parsers want to be *programmed*. This is why we
> > removed kParser from this patch set and fell back to eBPF for parsing.
> > But the problem we quickly hit that eBPF is not offloadable to network
> > devices, for example when we compile P4 in an eBPF parser we've lost
> > the declarative representation that parsers in the devices could
> > consume (they're not CPUs running eBPF).
> >
> > I think the key here is what we mean by kernel offload. When we do
> > kernel offload, is it the kernel implementation or the kernel
> > functionality that's being offloaded? If it's the latter then we have
> > a lot more flexibility. What we'd need is a safe and secure way to
> > synchronize with that offload device that precisely supports the
> > kernel functionality we'd like to offload. This can be done if both
> > the kernel bits and programmed offload are derived from the same
> > source (i.e. tag source code with a sha-1). For example, if someone
> > writes a parser in P4, we can compile that into both eBPF and a P4
> > backend using independent tool chains and program download. At
> > runtime, the kernel can safely offload the functionality of the eBPF
> > parser to the device if it matches the hash to that reported by the
> > device
>
> Good points. If I understand you correctly you're saying that parsers
> are more complex than just a basic parsing tree a'la u32.

Yes. Parsing things like TLVs, GRE flag field, or nested protobufs
isn't conducive to u32. We also want the advantages of compiler
optimizations to unroll loops, squash nodes in the parse graph, etc.

> Then we can take this argument further. P4 has grown to encompass a lot
> of functionality of quite complex devices. How do we square that with
> the kernel functionality offload model. If the entire device is modeled,
> including f.e. TSO, an offload would mean that the user has to write
> a TSO implementation which they then load into TC? That seems odd.
>
> IOW I don't quite know how to square in my head the "total
> functionality" with being a TC-based "plugin".

Hi Jakub,

I believe the solution is to replace kernel code with eBPF in cases
where we need programmability. This effectively means that we would
ship eBPF code as part of the kernel. So in the case of TSO, the
kernel would include a standard implementation in eBPF that could be
compiled into the kernel by default. The restricted C source code is
tagged with a hash, so if someone wants to offload TSO they could
compile the source into their target and retain the hash. At runtime
it's a matter of querying the driver to see if the device supports the
TSO program the kernel is running by comparing hash values. Scaling
this, a device could support a catalogue of programs: TSO, LRO,
parser, IPtables, etc., If the kernel can match the hash of its eBPF
code to one reported by the driver then it can assume functionality is
offloadable. This is an elaboration of "device features", but instead
of the device telling us they think they support an adequate GRO
implementation by reporting NETIF_F_GRO, the device would tell the
kernel that they not only support GRO but they provide identical
functionality of the kernel GRO (which IMO is the first requirement of
kernel offload).

Even before considering hardware offload, I think this approach
addresses a more fundamental problem to make the kernel programmable.
Since the code is in eBPF, the kernel can be reprogrammed at runtime
which could be controlled by TC. This allows local customization of
kernel features, but also is the simplest way to "patch" the kernel
with security and bug fixes (nobody is ever excited to do a kernel
rebase in their datacenter!). Flow dissector is a prime candidate for
this, and I am still planning to replace it with an all eBPF program
(https://netdevconf.info/0x15/slides/16/Flow%20dissector_PANDA%20parser.pdf).

Tom
Jamal Hadi Salim March 3, 2024, 5 p.m. UTC | #22
On Sat, Mar 2, 2024 at 10:27 PM Jakub Kicinski <kuba@kernel.org> wrote:
>
> On Sat, 2 Mar 2024 09:36:53 -0500 Jamal Hadi Salim wrote:
> > 2) Your point on:  "integrate later", or at least "fill in the gaps"
> > This part i am probably going to mumble on. I am going to consider
> > more than just doing ACLs/MAT via flower/u32 for the sake of
> > discussion.
> > True, "fill the gaps" has been our model so far. It requires kernel
> > changes, user space code changes etc justifiably so because most of
> > the time such datapaths are subject to standardization via IETF, IEEE,
> > etc and new extensions come in on a regular basis.  And sometimes we
> > do add features that one or two users or a single vendor has need for
> > at the cost of kernel and user/control extension. Given our work
> > process, any features added this way take a long time to make it to
> > the end user.
>
> What I had in mind was more of a DDP model. The device loads it binary
> blob FW in whatever way it does, then it tells the kernel its parser
> graph, and tables. The kernel exposes those tables to user space.
> All dynamic, no need to change the kernel for each new protocol.
>
> But that's different in two ways:
>  1. the device tells kernel the tables, no "dynamic reprogramming"
>  2. you don't need the SW side, the only use of the API is to interact
>     with the device
>
> User can still do BPF kfuncs to look up in the tables (like in FIB),
> but call them from cls_bpf.
>

This is not far off from what is envisioned today in the discussions.
The main issue is who loads the binary? We went from devlink to the
filter doing the loading. DDP is ethtool. We still need to tie a PCI
device/tc block to the "program" so we can do skip_sw and it works.
Meaning a device that is capable of handling multiple programs can
have multiple blobs loaded. A "program" is mapped to a tc filter and
MAT control works the same way as it does today (netlink/tc ndo).

A program in P4 has a name, ID and people have been suggesting a sha1
identity (or a signature of some kind should be generated by the
compiler). So the upward propagation could be tied to discovering
these 3 tuples from the driver. Then the control plane targets a
program via those tuples via netlink (as we do currently).

I do note, using the DDP sample space, currently whatever gets loaded
is "trusted" and really you need to have human knowledge of what the
NIC's parsing + MAT is to send the control. With P4 that is all
visible/programmable by the end user (i am not a proponent of vendors
"shipping" things or calling them for support) - so should be
sufficient to just discover what is in the binary and send the correct
control messages down.

> I think in P4 terms that may be something more akin to only providing
> the runtime API? I seem to recall they had some distinction...

There are several solutions out there (ex: TDI, P4runtime) - our API
is netlink and those could be written on top of netlink, there's no
controversy there.
So the starting point is defining the datapath using P4, generating
the binary blob and whatever constraints needed using the vendor
backend and for s/w equivalent generating the eBPF datapath.

> > At the cost of this sounding controversial, i am going
> > to call things like fdb, fib, etc which have fixed datapaths in the
> > kernel "legacy". These "legacy" datapaths almost all the time have
>
> The cynic in me sometimes thinks that the biggest problem with "legacy"
> protocols is that it's hard to make money on them :)

That's a big motivation without a doubt, but also there are people
that want to experiment with things. One of the craziest examples we
have is someone who created a P4 program for "in network calculator",
essentially a calculator in the datapath. You send it two operands and
an operator using custom headers, it does the math and responds with a
result in a new header. By itself this program is a toy but it
demonstrates that if one wanted to, they could have something custom
in hardware and/or kernel datapath.

cheers,
jamal
Tom Herbert March 3, 2024, 6:10 p.m. UTC | #23
On Sun, Mar 3, 2024 at 9:00 AM Jamal Hadi Salim <jhs@mojatatu.com> wrote:
>
> On Sat, Mar 2, 2024 at 10:27 PM Jakub Kicinski <kuba@kernel.org> wrote:
> >
> > On Sat, 2 Mar 2024 09:36:53 -0500 Jamal Hadi Salim wrote:
> > > 2) Your point on:  "integrate later", or at least "fill in the gaps"
> > > This part i am probably going to mumble on. I am going to consider
> > > more than just doing ACLs/MAT via flower/u32 for the sake of
> > > discussion.
> > > True, "fill the gaps" has been our model so far. It requires kernel
> > > changes, user space code changes etc justifiably so because most of
> > > the time such datapaths are subject to standardization via IETF, IEEE,
> > > etc and new extensions come in on a regular basis.  And sometimes we
> > > do add features that one or two users or a single vendor has need for
> > > at the cost of kernel and user/control extension. Given our work
> > > process, any features added this way take a long time to make it to
> > > the end user.
> >
> > What I had in mind was more of a DDP model. The device loads it binary
> > blob FW in whatever way it does, then it tells the kernel its parser
> > graph, and tables. The kernel exposes those tables to user space.
> > All dynamic, no need to change the kernel for each new protocol.
> >
> > But that's different in two ways:
> >  1. the device tells kernel the tables, no "dynamic reprogramming"
> >  2. you don't need the SW side, the only use of the API is to interact
> >     with the device
> >
> > User can still do BPF kfuncs to look up in the tables (like in FIB),
> > but call them from cls_bpf.
> >
>
> This is not far off from what is envisioned today in the discussions.
> The main issue is who loads the binary? We went from devlink to the
> filter doing the loading. DDP is ethtool. We still need to tie a PCI
> device/tc block to the "program" so we can do skip_sw and it works.
> Meaning a device that is capable of handling multiple programs can
> have multiple blobs loaded. A "program" is mapped to a tc filter and
> MAT control works the same way as it does today (netlink/tc ndo).
>
> A program in P4 has a name, ID and people have been suggesting a sha1
> identity (or a signature of some kind should be generated by the
> compiler). So the upward propagation could be tied to discovering
> these 3 tuples from the driver. Then the control plane targets a
> program via those tuples via netlink (as we do currently).
>
> I do note, using the DDP sample space, currently whatever gets loaded
> is "trusted" and really you need to have human knowledge of what the
> NIC's parsing + MAT is to send the control. With P4 that is all
> visible/programmable by the end user (i am not a proponent of vendors
> "shipping" things or calling them for support) - so should be
> sufficient to just discover what is in the binary and send the correct
> control messages down.
>
> > I think in P4 terms that may be something more akin to only providing
> > the runtime API? I seem to recall they had some distinction...
>
> There are several solutions out there (ex: TDI, P4runtime) - our API
> is netlink and those could be written on top of netlink, there's no
> controversy there.
> So the starting point is defining the datapath using P4, generating
> the binary blob and whatever constraints needed using the vendor
> backend and for s/w equivalent generating the eBPF datapath.
>
> > > At the cost of this sounding controversial, i am going
> > > to call things like fdb, fib, etc which have fixed datapaths in the
> > > kernel "legacy". These "legacy" datapaths almost all the time have
> >
> > The cynic in me sometimes thinks that the biggest problem with "legacy"
> > protocols is that it's hard to make money on them :)
>
> That's a big motivation without a doubt, but also there are people
> that want to experiment with things. One of the craziest examples we
> have is someone who created a P4 program for "in network calculator",
> essentially a calculator in the datapath. You send it two operands and
> an operator using custom headers, it does the math and responds with a
> result in a new header. By itself this program is a toy but it
> demonstrates that if one wanted to, they could have something custom
> in hardware and/or kernel datapath.

Jamal,

Given how long P4 has been around it's surprising that the best
publicly available code example is "the network calculator" toy. At
this point in its lifetime, eBPF had far more examples of real world
use cases publically available. That being said, there's nothing
unique about P4 supporting the network calculator. We could just as
easily write this in eBPF (either plain C or P4)  and "offload" it to
an ARM core on a SmartNIC.

If we are going to support programmable device offload in the Linux
kernel then I maintain it should be a generic mechanism that's
agnostic to *both* the frontend programming language as well as the
backend target. For frontend languages we want to let the user program
in a language that's convenient for *them*, which honestly in most
cases isn't going to be a narrow use case DSL (i.e. typically users
want to code in C/C++, Python, Rust, etc.). For the backend it's the
same story, maybe we're compiling to run in host, maybe we're
offloading to P4 runtime, maybe we're offloading to another CPU, maybe
we're offloading some other programmable NPU. The only real
requirement is a compiler that can take the frontend code and compile
for the desired backend target, but above all we want this to be easy
for the programmer, the compiler needs to do the heavy lifting and we
should never require the user to understand the nuances of a target.

IMO, the model we want for programmable kernel offload is "write once,
run anywhere, run well". Which is the Java tagline amended with "run
well". Users write one program for their datapath processing, it runs
on various targets, for any given target we run to run at the highest
performance levels possible given the target's capabilities.

Tom

>
> cheers,
> jamal
Jamal Hadi Salim March 3, 2024, 7:04 p.m. UTC | #24
On Sun, Mar 3, 2024 at 1:11 PM Tom Herbert <tom@sipanda.io> wrote:
>
> On Sun, Mar 3, 2024 at 9:00 AM Jamal Hadi Salim <jhs@mojatatu.com> wrote:
> >
> > On Sat, Mar 2, 2024 at 10:27 PM Jakub Kicinski <kuba@kernel.org> wrote:
> > >
> > > On Sat, 2 Mar 2024 09:36:53 -0500 Jamal Hadi Salim wrote:
> > > > 2) Your point on:  "integrate later", or at least "fill in the gaps"
> > > > This part i am probably going to mumble on. I am going to consider
> > > > more than just doing ACLs/MAT via flower/u32 for the sake of
> > > > discussion.
> > > > True, "fill the gaps" has been our model so far. It requires kernel
> > > > changes, user space code changes etc justifiably so because most of
> > > > the time such datapaths are subject to standardization via IETF, IEEE,
> > > > etc and new extensions come in on a regular basis.  And sometimes we
> > > > do add features that one or two users or a single vendor has need for
> > > > at the cost of kernel and user/control extension. Given our work
> > > > process, any features added this way take a long time to make it to
> > > > the end user.
> > >
> > > What I had in mind was more of a DDP model. The device loads it binary
> > > blob FW in whatever way it does, then it tells the kernel its parser
> > > graph, and tables. The kernel exposes those tables to user space.
> > > All dynamic, no need to change the kernel for each new protocol.
> > >
> > > But that's different in two ways:
> > >  1. the device tells kernel the tables, no "dynamic reprogramming"
> > >  2. you don't need the SW side, the only use of the API is to interact
> > >     with the device
> > >
> > > User can still do BPF kfuncs to look up in the tables (like in FIB),
> > > but call them from cls_bpf.
> > >
> >
> > This is not far off from what is envisioned today in the discussions.
> > The main issue is who loads the binary? We went from devlink to the
> > filter doing the loading. DDP is ethtool. We still need to tie a PCI
> > device/tc block to the "program" so we can do skip_sw and it works.
> > Meaning a device that is capable of handling multiple programs can
> > have multiple blobs loaded. A "program" is mapped to a tc filter and
> > MAT control works the same way as it does today (netlink/tc ndo).
> >
> > A program in P4 has a name, ID and people have been suggesting a sha1
> > identity (or a signature of some kind should be generated by the
> > compiler). So the upward propagation could be tied to discovering
> > these 3 tuples from the driver. Then the control plane targets a
> > program via those tuples via netlink (as we do currently).
> >
> > I do note, using the DDP sample space, currently whatever gets loaded
> > is "trusted" and really you need to have human knowledge of what the
> > NIC's parsing + MAT is to send the control. With P4 that is all
> > visible/programmable by the end user (i am not a proponent of vendors
> > "shipping" things or calling them for support) - so should be
> > sufficient to just discover what is in the binary and send the correct
> > control messages down.
> >
> > > I think in P4 terms that may be something more akin to only providing
> > > the runtime API? I seem to recall they had some distinction...
> >
> > There are several solutions out there (ex: TDI, P4runtime) - our API
> > is netlink and those could be written on top of netlink, there's no
> > controversy there.
> > So the starting point is defining the datapath using P4, generating
> > the binary blob and whatever constraints needed using the vendor
> > backend and for s/w equivalent generating the eBPF datapath.
> >
> > > > At the cost of this sounding controversial, i am going
> > > > to call things like fdb, fib, etc which have fixed datapaths in the
> > > > kernel "legacy". These "legacy" datapaths almost all the time have
> > >
> > > The cynic in me sometimes thinks that the biggest problem with "legacy"
> > > protocols is that it's hard to make money on them :)
> >
> > That's a big motivation without a doubt, but also there are people
> > that want to experiment with things. One of the craziest examples we
> > have is someone who created a P4 program for "in network calculator",
> > essentially a calculator in the datapath. You send it two operands and
> > an operator using custom headers, it does the math and responds with a
> > result in a new header. By itself this program is a toy but it
> > demonstrates that if one wanted to, they could have something custom
> > in hardware and/or kernel datapath.
>
> Jamal,
>
> Given how long P4 has been around it's surprising that the best
> publicly available code example is "the network calculator" toy.

Come on Tom ;-> That was just an example of something "crazy" to
demonstrate freedom. I can run that in any of the P4 friendly NICs
today. You are probably being facetious - There are some serious
publicly available projects out there, some of which I quote on the
cover letter (like DASH).

> At
> this point in its lifetime, eBPF had far more examples of real world
> use cases publically available. That being said, there's nothing
> unique about P4 supporting the network calculator. We could just as
> easily write this in eBPF (either plain C or P4)  and "offload" it to
> an ARM core on a SmartNIC.

With current port speeds hitting 800gbps you want to use Arm cores as
your offload engine?;-> Running the generated ebpf on the arm core is
a valid P4 target.  i.e there is no contradiction.
Note: P4 is a DSL specialized for datapath definition; it is not a
competition to ebpf, two different worlds. I see ebpf as an
infrastructure tool, nothing more.

> If we are going to support programmable device offload in the Linux
> kernel then I maintain it should be a generic mechanism that's
> agnostic to *both* the frontend programming language as well as the
> backend target. For frontend languages we want to let the user program
> in a language that's convenient for *them*, which honestly in most
> cases isn't going to be a narrow use case DSL (i.e. typically users
> want to code in C/C++, Python, Rust, etc.).

You and I have never agreed philosophically on this point, ever.
Developers are expensive and not economically scalable. IOW, In the
era of automation (generative AI, etc) tooling is king. Let's build
the right tooling. Whenever you make this statement  i get the vision
of Steve Balmer ranting on the stage with "developers! developers!
developers!" but that was eons ago. To use your strong view: Learn
compilers! And the future is probably to replace compilers with AI.

> For the backend it's the
> same story, maybe we're compiling to run in host, maybe we're
> offloading to P4 runtime, maybe we're offloading to another CPU, maybe
> we're offloading some other programmable NPU. The only real
> requirement is a compiler that can take the frontend code and compile
> for the desired backend target, but above all we want this to be easy
> for the programmer, the compiler needs to do the heavy lifting and we
> should never require the user to understand the nuances of a target.
>

Agreed, it is possible to use other languages in the frontend. It is
also possible to extend P4.

> IMO, the model we want for programmable kernel offload is "write once,
> run anywhere, run well". Which is the Java tagline amended with "run
> well". Users write one program for their datapath processing, it runs
> on various targets, for any given target we run to run at the highest
> performance levels possible given the target's capabilities.
>

I would like to emphasize: Our target is P4 - vendors have put out
hardware, people are deploying and evolving things. It is real today
with deployments, not some science project. I am not arguing you cant
do what you suggested but we want to initially focus on P4. Neither am
i saying we cant influence P4 to be more Linux friendly. But none of
that matters. We are only concerned about P4.

cheers,
jamal



> Tom
>
> >
> > cheers,
> > jamal
Jakub Kicinski March 4, 2024, 8:07 p.m. UTC | #25
On Sun, 3 Mar 2024 08:31:11 -0800 Tom Herbert wrote:
> Even before considering hardware offload, I think this approach
> addresses a more fundamental problem to make the kernel programmable.

I like some aspects of what you're describing, but my understanding
is that it'd be a noticeable shift in direction.
I'm not sure if merging P4TC is the most effective way of taking
a first step in that direction. (I mean that in the literal sense
of lack of confidence, not polite way to indicate holding a conviction
to the contrary.)
Jakub Kicinski March 4, 2024, 8:18 p.m. UTC | #26
On Sun, 3 Mar 2024 14:04:11 -0500 Jamal Hadi Salim wrote:
> > At
> > this point in its lifetime, eBPF had far more examples of real world
> > use cases publically available. That being said, there's nothing
> > unique about P4 supporting the network calculator. We could just as
> > easily write this in eBPF (either plain C or P4)  and "offload" it to
> > an ARM core on a SmartNIC.  
> 
> With current port speeds hitting 800gbps you want to use Arm cores as
> your offload engine?;-> Running the generated ebpf on the arm core is
> a valid P4 target.  i.e there is no contradiction.
> Note: P4 is a DSL specialized for datapath definition; it is not a
> competition to ebpf, two different worlds. I see ebpf as an
> infrastructure tool, nothing more.

I wonder how much we're benefiting of calling this thing P4 and how
much we should focus on filling in the tech gaps.
Exactly like you said, BPF is not competition, but neither does 
the kernel "support P4", any more than it supports bpftrace and:

$ git grep --files-with-matches bpftrace
Documentation/bpf/redirect.rst
tools/testing/selftests/bpf/progs/test_xdp_attach_fail.c

Filling in tech gaps would also help DPP, IDK how much DPP is based 
or using P4, neither should I have to care, frankly :S
Tom Herbert March 4, 2024, 8:58 p.m. UTC | #27
On Mon, Mar 4, 2024 at 12:07 PM Jakub Kicinski <kuba@kernel.org> wrote:
>
> On Sun, 3 Mar 2024 08:31:11 -0800 Tom Herbert wrote:
> > Even before considering hardware offload, I think this approach
> > addresses a more fundamental problem to make the kernel programmable.
>
> I like some aspects of what you're describing, but my understanding
> is that it'd be a noticeable shift in direction.
> I'm not sure if merging P4TC is the most effective way of taking
> a first step in that direction. (I mean that in the literal sense
> of lack of confidence, not polite way to indicate holding a conviction
> to the contrary.)

Jakub,

My comments were with regards to making the kernel offloadable by
first making it programmable. The P4TC patches are very good for
describing processing that is table driven like filtering or IPtables,
but I was thinking more of kernel datapath processing that isn't table
driven like GSO, GRO, flow dissector, and even up to revisiting TCP
offload.

Basically, I'm proposing that instead of eBPF always being side
functionality, there are cases where it could natively be used to
implement the main functionality of the kernel datapath! It is a
noticeable shift in direction, but I also think it's the logical
outcome of eBPF :-).

Tom
Jamal Hadi Salim March 4, 2024, 9:02 p.m. UTC | #28
On Mon, Mar 4, 2024 at 3:18 PM Jakub Kicinski <kuba@kernel.org> wrote:
>
> On Sun, 3 Mar 2024 14:04:11 -0500 Jamal Hadi Salim wrote:
> > > At
> > > this point in its lifetime, eBPF had far more examples of real world
> > > use cases publically available. That being said, there's nothing
> > > unique about P4 supporting the network calculator. We could just as
> > > easily write this in eBPF (either plain C or P4)  and "offload" it to
> > > an ARM core on a SmartNIC.
> >
> > With current port speeds hitting 800gbps you want to use Arm cores as
> > your offload engine?;-> Running the generated ebpf on the arm core is
> > a valid P4 target.  i.e there is no contradiction.
> > Note: P4 is a DSL specialized for datapath definition; it is not a
> > competition to ebpf, two different worlds. I see ebpf as an
> > infrastructure tool, nothing more.
>
> I wonder how much we're benefiting of calling this thing P4 and how
> much we should focus on filling in the tech gaps.

We are implementing based on the P4 standard specification. I fear it
is confusing to call it something else if everyone else is calling it
P4 (including the vendors whose devices are being targeted in case of
offload).
If the name is an issue, sure we can change.
It just so happens that TC has similar semantics to P4 (match action
tables) - hence the name P4TC and implementation encompassing code
that fits nicely with TC.

> Exactly like you said, BPF is not competition, but neither does
> the kernel "support P4", any more than it supports bpftrace and:
>

Like i said if name is an issue, let's change the name;->

> $ git grep --files-with-matches bpftrace
> Documentation/bpf/redirect.rst
> tools/testing/selftests/bpf/progs/test_xdp_attach_fail.c
>
> Filling in tech gaps would also help DPP, IDK how much DPP is based
> or using P4, neither should I have to care, frankly :S

DDP is an Intel specific approach, pre-P4. P4: at least two vendors(on
Cc) AMD have NICs with P4 specification and there FPGA variants out
there as well.
From my discussions with folks at Intel it is easy to transform DDP to
P4. My understanding is it is the same compiler folks. The beauty
being you dont have to use the intel version of the loaded program to
offload if you wanted to change what the hardware does custom to you
(within constraints of what hardware can do).

cheers,
jamal
Stanislav Fomichev March 4, 2024, 9:19 p.m. UTC | #29
On 03/03, Tom Herbert wrote:
> On Sat, Mar 2, 2024 at 7:15 PM Jakub Kicinski <kuba@kernel.org> wrote:
> >
> > On Fri, 1 Mar 2024 18:20:36 -0800 Tom Herbert wrote:
> > > This is configurability versus programmability. The table driven
> > > approach as input (configurability) might work fine for generic
> > > match-action tables up to the point that tables are expressive enough
> > > to satisfy the requirements. But parsing doesn't fall into the table
> > > driven paradigm: parsers want to be *programmed*. This is why we
> > > removed kParser from this patch set and fell back to eBPF for parsing.
> > > But the problem we quickly hit that eBPF is not offloadable to network
> > > devices, for example when we compile P4 in an eBPF parser we've lost
> > > the declarative representation that parsers in the devices could
> > > consume (they're not CPUs running eBPF).
> > >
> > > I think the key here is what we mean by kernel offload. When we do
> > > kernel offload, is it the kernel implementation or the kernel
> > > functionality that's being offloaded? If it's the latter then we have
> > > a lot more flexibility. What we'd need is a safe and secure way to
> > > synchronize with that offload device that precisely supports the
> > > kernel functionality we'd like to offload. This can be done if both
> > > the kernel bits and programmed offload are derived from the same
> > > source (i.e. tag source code with a sha-1). For example, if someone
> > > writes a parser in P4, we can compile that into both eBPF and a P4
> > > backend using independent tool chains and program download. At
> > > runtime, the kernel can safely offload the functionality of the eBPF
> > > parser to the device if it matches the hash to that reported by the
> > > device
> >
> > Good points. If I understand you correctly you're saying that parsers
> > are more complex than just a basic parsing tree a'la u32.
> 
> Yes. Parsing things like TLVs, GRE flag field, or nested protobufs
> isn't conducive to u32. We also want the advantages of compiler
> optimizations to unroll loops, squash nodes in the parse graph, etc.
> 
> > Then we can take this argument further. P4 has grown to encompass a lot
> > of functionality of quite complex devices. How do we square that with
> > the kernel functionality offload model. If the entire device is modeled,
> > including f.e. TSO, an offload would mean that the user has to write
> > a TSO implementation which they then load into TC? That seems odd.
> >
> > IOW I don't quite know how to square in my head the "total
> > functionality" with being a TC-based "plugin".
> 
> Hi Jakub,
> 
> I believe the solution is to replace kernel code with eBPF in cases
> where we need programmability. This effectively means that we would
> ship eBPF code as part of the kernel. So in the case of TSO, the
> kernel would include a standard implementation in eBPF that could be
> compiled into the kernel by default. The restricted C source code is
> tagged with a hash, so if someone wants to offload TSO they could
> compile the source into their target and retain the hash. At runtime
> it's a matter of querying the driver to see if the device supports the
> TSO program the kernel is running by comparing hash values. Scaling
> this, a device could support a catalogue of programs: TSO, LRO,
> parser, IPtables, etc., If the kernel can match the hash of its eBPF
> code to one reported by the driver then it can assume functionality is
> offloadable. This is an elaboration of "device features", but instead
> of the device telling us they think they support an adequate GRO
> implementation by reporting NETIF_F_GRO, the device would tell the
> kernel that they not only support GRO but they provide identical
> functionality of the kernel GRO (which IMO is the first requirement of
> kernel offload).
> 
> Even before considering hardware offload, I think this approach
> addresses a more fundamental problem to make the kernel programmable.
> Since the code is in eBPF, the kernel can be reprogrammed at runtime
> which could be controlled by TC. This allows local customization of
> kernel features, but also is the simplest way to "patch" the kernel
> with security and bug fixes (nobody is ever excited to do a kernel

[..]

> rebase in their datacenter!). Flow dissector is a prime candidate for
> this, and I am still planning to replace it with an all eBPF program
> (https://netdevconf.info/0x15/slides/16/Flow%20dissector_PANDA%20parser.pdf).

So you're suggesting to bundle (and extend)
tools/testing/selftests/bpf/progs/bpf_flow.c? We were thinking along
similar lines here. We load this program manually right now, shipping
and autoloading with the kernel will be easer.
Stanislav Fomichev March 4, 2024, 9:23 p.m. UTC | #30
On 03/03, Jamal Hadi Salim wrote:
> On Sun, Mar 3, 2024 at 1:11 PM Tom Herbert <tom@sipanda.io> wrote:
> >
> > On Sun, Mar 3, 2024 at 9:00 AM Jamal Hadi Salim <jhs@mojatatu.com> wrote:
> > >
> > > On Sat, Mar 2, 2024 at 10:27 PM Jakub Kicinski <kuba@kernel.org> wrote:
> > > >
> > > > On Sat, 2 Mar 2024 09:36:53 -0500 Jamal Hadi Salim wrote:
> > > > > 2) Your point on:  "integrate later", or at least "fill in the gaps"
> > > > > This part i am probably going to mumble on. I am going to consider
> > > > > more than just doing ACLs/MAT via flower/u32 for the sake of
> > > > > discussion.
> > > > > True, "fill the gaps" has been our model so far. It requires kernel
> > > > > changes, user space code changes etc justifiably so because most of
> > > > > the time such datapaths are subject to standardization via IETF, IEEE,
> > > > > etc and new extensions come in on a regular basis.  And sometimes we
> > > > > do add features that one or two users or a single vendor has need for
> > > > > at the cost of kernel and user/control extension. Given our work
> > > > > process, any features added this way take a long time to make it to
> > > > > the end user.
> > > >
> > > > What I had in mind was more of a DDP model. The device loads it binary
> > > > blob FW in whatever way it does, then it tells the kernel its parser
> > > > graph, and tables. The kernel exposes those tables to user space.
> > > > All dynamic, no need to change the kernel for each new protocol.
> > > >
> > > > But that's different in two ways:
> > > >  1. the device tells kernel the tables, no "dynamic reprogramming"
> > > >  2. you don't need the SW side, the only use of the API is to interact
> > > >     with the device
> > > >
> > > > User can still do BPF kfuncs to look up in the tables (like in FIB),
> > > > but call them from cls_bpf.
> > > >
> > >
> > > This is not far off from what is envisioned today in the discussions.
> > > The main issue is who loads the binary? We went from devlink to the
> > > filter doing the loading. DDP is ethtool. We still need to tie a PCI
> > > device/tc block to the "program" so we can do skip_sw and it works.
> > > Meaning a device that is capable of handling multiple programs can
> > > have multiple blobs loaded. A "program" is mapped to a tc filter and
> > > MAT control works the same way as it does today (netlink/tc ndo).
> > >
> > > A program in P4 has a name, ID and people have been suggesting a sha1
> > > identity (or a signature of some kind should be generated by the
> > > compiler). So the upward propagation could be tied to discovering
> > > these 3 tuples from the driver. Then the control plane targets a
> > > program via those tuples via netlink (as we do currently).
> > >
> > > I do note, using the DDP sample space, currently whatever gets loaded
> > > is "trusted" and really you need to have human knowledge of what the
> > > NIC's parsing + MAT is to send the control. With P4 that is all
> > > visible/programmable by the end user (i am not a proponent of vendors
> > > "shipping" things or calling them for support) - so should be
> > > sufficient to just discover what is in the binary and send the correct
> > > control messages down.
> > >
> > > > I think in P4 terms that may be something more akin to only providing
> > > > the runtime API? I seem to recall they had some distinction...
> > >
> > > There are several solutions out there (ex: TDI, P4runtime) - our API
> > > is netlink and those could be written on top of netlink, there's no
> > > controversy there.
> > > So the starting point is defining the datapath using P4, generating
> > > the binary blob and whatever constraints needed using the vendor
> > > backend and for s/w equivalent generating the eBPF datapath.
> > >
> > > > > At the cost of this sounding controversial, i am going
> > > > > to call things like fdb, fib, etc which have fixed datapaths in the
> > > > > kernel "legacy". These "legacy" datapaths almost all the time have
> > > >
> > > > The cynic in me sometimes thinks that the biggest problem with "legacy"
> > > > protocols is that it's hard to make money on them :)
> > >
> > > That's a big motivation without a doubt, but also there are people
> > > that want to experiment with things. One of the craziest examples we
> > > have is someone who created a P4 program for "in network calculator",
> > > essentially a calculator in the datapath. You send it two operands and
> > > an operator using custom headers, it does the math and responds with a
> > > result in a new header. By itself this program is a toy but it
> > > demonstrates that if one wanted to, they could have something custom
> > > in hardware and/or kernel datapath.
> >
> > Jamal,
> >
> > Given how long P4 has been around it's surprising that the best
> > publicly available code example is "the network calculator" toy.
> 
> Come on Tom ;-> That was just an example of something "crazy" to
> demonstrate freedom. I can run that in any of the P4 friendly NICs
> today. You are probably being facetious - There are some serious
> publicly available projects out there, some of which I quote on the
> cover letter (like DASH).

Shameless plug. I have a more crazy example with bpf:

https://github.com/fomichev/xdp-btc-miner

A good way to ensure all those smartnic cycles are not wasted :-D
I wish we had more nics with xdp bpf offloads :-(
Jamal Hadi Salim March 4, 2024, 9:44 p.m. UTC | #31
On Mon, Mar 4, 2024 at 4:23 PM Stanislav Fomichev <sdf@google.com> wrote:
>
> On 03/03, Jamal Hadi Salim wrote:
> > On Sun, Mar 3, 2024 at 1:11 PM Tom Herbert <tom@sipanda.io> wrote:
> > >
> > > On Sun, Mar 3, 2024 at 9:00 AM Jamal Hadi Salim <jhs@mojatatu.com> wrote:
> > > >
> > > > On Sat, Mar 2, 2024 at 10:27 PM Jakub Kicinski <kuba@kernel.org> wrote:
> > > > >
> > > > > On Sat, 2 Mar 2024 09:36:53 -0500 Jamal Hadi Salim wrote:
> > > > > > 2) Your point on:  "integrate later", or at least "fill in the gaps"
> > > > > > This part i am probably going to mumble on. I am going to consider
> > > > > > more than just doing ACLs/MAT via flower/u32 for the sake of
> > > > > > discussion.
> > > > > > True, "fill the gaps" has been our model so far. It requires kernel
> > > > > > changes, user space code changes etc justifiably so because most of
> > > > > > the time such datapaths are subject to standardization via IETF, IEEE,
> > > > > > etc and new extensions come in on a regular basis.  And sometimes we
> > > > > > do add features that one or two users or a single vendor has need for
> > > > > > at the cost of kernel and user/control extension. Given our work
> > > > > > process, any features added this way take a long time to make it to
> > > > > > the end user.
> > > > >
> > > > > What I had in mind was more of a DDP model. The device loads it binary
> > > > > blob FW in whatever way it does, then it tells the kernel its parser
> > > > > graph, and tables. The kernel exposes those tables to user space.
> > > > > All dynamic, no need to change the kernel for each new protocol.
> > > > >
> > > > > But that's different in two ways:
> > > > >  1. the device tells kernel the tables, no "dynamic reprogramming"
> > > > >  2. you don't need the SW side, the only use of the API is to interact
> > > > >     with the device
> > > > >
> > > > > User can still do BPF kfuncs to look up in the tables (like in FIB),
> > > > > but call them from cls_bpf.
> > > > >
> > > >
> > > > This is not far off from what is envisioned today in the discussions.
> > > > The main issue is who loads the binary? We went from devlink to the
> > > > filter doing the loading. DDP is ethtool. We still need to tie a PCI
> > > > device/tc block to the "program" so we can do skip_sw and it works.
> > > > Meaning a device that is capable of handling multiple programs can
> > > > have multiple blobs loaded. A "program" is mapped to a tc filter and
> > > > MAT control works the same way as it does today (netlink/tc ndo).
> > > >
> > > > A program in P4 has a name, ID and people have been suggesting a sha1
> > > > identity (or a signature of some kind should be generated by the
> > > > compiler). So the upward propagation could be tied to discovering
> > > > these 3 tuples from the driver. Then the control plane targets a
> > > > program via those tuples via netlink (as we do currently).
> > > >
> > > > I do note, using the DDP sample space, currently whatever gets loaded
> > > > is "trusted" and really you need to have human knowledge of what the
> > > > NIC's parsing + MAT is to send the control. With P4 that is all
> > > > visible/programmable by the end user (i am not a proponent of vendors
> > > > "shipping" things or calling them for support) - so should be
> > > > sufficient to just discover what is in the binary and send the correct
> > > > control messages down.
> > > >
> > > > > I think in P4 terms that may be something more akin to only providing
> > > > > the runtime API? I seem to recall they had some distinction...
> > > >
> > > > There are several solutions out there (ex: TDI, P4runtime) - our API
> > > > is netlink and those could be written on top of netlink, there's no
> > > > controversy there.
> > > > So the starting point is defining the datapath using P4, generating
> > > > the binary blob and whatever constraints needed using the vendor
> > > > backend and for s/w equivalent generating the eBPF datapath.
> > > >
> > > > > > At the cost of this sounding controversial, i am going
> > > > > > to call things like fdb, fib, etc which have fixed datapaths in the
> > > > > > kernel "legacy". These "legacy" datapaths almost all the time have
> > > > >
> > > > > The cynic in me sometimes thinks that the biggest problem with "legacy"
> > > > > protocols is that it's hard to make money on them :)
> > > >
> > > > That's a big motivation without a doubt, but also there are people
> > > > that want to experiment with things. One of the craziest examples we
> > > > have is someone who created a P4 program for "in network calculator",
> > > > essentially a calculator in the datapath. You send it two operands and
> > > > an operator using custom headers, it does the math and responds with a
> > > > result in a new header. By itself this program is a toy but it
> > > > demonstrates that if one wanted to, they could have something custom
> > > > in hardware and/or kernel datapath.
> > >
> > > Jamal,
> > >
> > > Given how long P4 has been around it's surprising that the best
> > > publicly available code example is "the network calculator" toy.
> >
> > Come on Tom ;-> That was just an example of something "crazy" to
> > demonstrate freedom. I can run that in any of the P4 friendly NICs
> > today. You are probably being facetious - There are some serious
> > publicly available projects out there, some of which I quote on the
> > cover letter (like DASH).
>
> Shameless plug. I have a more crazy example with bpf:
>
> https://github.com/fomichev/xdp-btc-miner
>

Hrm - this looks crazy interesting;-> Tempting. I guess to port this
to P4 we'd need the sha256 in h/w (which most of these vendors have
already). Is there any other acceleration would you need? Would have
been more fun if you invented you own headers too ;->

cheers,
jamal

> A good way to ensure all those smartnic cycles are not wasted :-D
> I wish we had more nics with xdp bpf offloads :-(
Tom Herbert March 4, 2024, 10:01 p.m. UTC | #32
On Mon, Mar 4, 2024 at 1:19 PM Stanislav Fomichev <sdf@google.com> wrote:
>
> On 03/03, Tom Herbert wrote:
> > On Sat, Mar 2, 2024 at 7:15 PM Jakub Kicinski <kuba@kernel.org> wrote:
> > >
> > > On Fri, 1 Mar 2024 18:20:36 -0800 Tom Herbert wrote:
> > > > This is configurability versus programmability. The table driven
> > > > approach as input (configurability) might work fine for generic
> > > > match-action tables up to the point that tables are expressive enough
> > > > to satisfy the requirements. But parsing doesn't fall into the table
> > > > driven paradigm: parsers want to be *programmed*. This is why we
> > > > removed kParser from this patch set and fell back to eBPF for parsing.
> > > > But the problem we quickly hit that eBPF is not offloadable to network
> > > > devices, for example when we compile P4 in an eBPF parser we've lost
> > > > the declarative representation that parsers in the devices could
> > > > consume (they're not CPUs running eBPF).
> > > >
> > > > I think the key here is what we mean by kernel offload. When we do
> > > > kernel offload, is it the kernel implementation or the kernel
> > > > functionality that's being offloaded? If it's the latter then we have
> > > > a lot more flexibility. What we'd need is a safe and secure way to
> > > > synchronize with that offload device that precisely supports the
> > > > kernel functionality we'd like to offload. This can be done if both
> > > > the kernel bits and programmed offload are derived from the same
> > > > source (i.e. tag source code with a sha-1). For example, if someone
> > > > writes a parser in P4, we can compile that into both eBPF and a P4
> > > > backend using independent tool chains and program download. At
> > > > runtime, the kernel can safely offload the functionality of the eBPF
> > > > parser to the device if it matches the hash to that reported by the
> > > > device
> > >
> > > Good points. If I understand you correctly you're saying that parsers
> > > are more complex than just a basic parsing tree a'la u32.
> >
> > Yes. Parsing things like TLVs, GRE flag field, or nested protobufs
> > isn't conducive to u32. We also want the advantages of compiler
> > optimizations to unroll loops, squash nodes in the parse graph, etc.
> >
> > > Then we can take this argument further. P4 has grown to encompass a lot
> > > of functionality of quite complex devices. How do we square that with
> > > the kernel functionality offload model. If the entire device is modeled,
> > > including f.e. TSO, an offload would mean that the user has to write
> > > a TSO implementation which they then load into TC? That seems odd.
> > >
> > > IOW I don't quite know how to square in my head the "total
> > > functionality" with being a TC-based "plugin".
> >
> > Hi Jakub,
> >
> > I believe the solution is to replace kernel code with eBPF in cases
> > where we need programmability. This effectively means that we would
> > ship eBPF code as part of the kernel. So in the case of TSO, the
> > kernel would include a standard implementation in eBPF that could be
> > compiled into the kernel by default. The restricted C source code is
> > tagged with a hash, so if someone wants to offload TSO they could
> > compile the source into their target and retain the hash. At runtime
> > it's a matter of querying the driver to see if the device supports the
> > TSO program the kernel is running by comparing hash values. Scaling
> > this, a device could support a catalogue of programs: TSO, LRO,
> > parser, IPtables, etc., If the kernel can match the hash of its eBPF
> > code to one reported by the driver then it can assume functionality is
> > offloadable. This is an elaboration of "device features", but instead
> > of the device telling us they think they support an adequate GRO
> > implementation by reporting NETIF_F_GRO, the device would tell the
> > kernel that they not only support GRO but they provide identical
> > functionality of the kernel GRO (which IMO is the first requirement of
> > kernel offload).
> >
> > Even before considering hardware offload, I think this approach
> > addresses a more fundamental problem to make the kernel programmable.
> > Since the code is in eBPF, the kernel can be reprogrammed at runtime
> > which could be controlled by TC. This allows local customization of
> > kernel features, but also is the simplest way to "patch" the kernel
> > with security and bug fixes (nobody is ever excited to do a kernel
>
> [..]
>
> > rebase in their datacenter!). Flow dissector is a prime candidate for
> > this, and I am still planning to replace it with an all eBPF program
> > (https://netdevconf.info/0x15/slides/16/Flow%20dissector_PANDA%20parser.pdf).
>
> So you're suggesting to bundle (and extend)
> tools/testing/selftests/bpf/progs/bpf_flow.c? We were thinking along
> similar lines here. We load this program manually right now, shipping
> and autoloading with the kernel will be easer.

Hi Stanislav,

Yes, I envision that we would have a standard implementation of
flow-dissector in eBPF that is shipped with the kernel and autoloaded.
However, for the front end source I want to move away from imperative
code. As I mentioned in the presentation flow_dissector.c is spaghetti
code and has been prone to bugs over the years especially whenever
someone adds support for a new fringe protocol (I take the liberty to
call it spaghetti code since I'm partially responsible for creating
this mess ;-) ).

The problem is that parsers are much better represented by a
declarative rather than an imperative representation. To that end, we
defined PANDA which allows constructing a parser (parse graph) in data
structures in C. We use the "PANDA parser" to compile C to restricted
C code which looks more like eBPF in imperative code. With this method
we abstract out all the bookkeeping that was often the source of bugs
(like pulling up skbufs, checking length limits, etc.). The other
advantage is that we're able to find a lot more optimizations if we
start with a right representation of the problem.

If you're interested, the video presentation on this is in
https://www.youtube.com/watch?v=zVnmVDSEoXc.

Tom
Stanislav Fomichev March 4, 2024, 10:23 p.m. UTC | #33
On 03/04, Jamal Hadi Salim wrote:
> On Mon, Mar 4, 2024 at 4:23 PM Stanislav Fomichev <sdf@google.com> wrote:
> >
> > On 03/03, Jamal Hadi Salim wrote:
> > > On Sun, Mar 3, 2024 at 1:11 PM Tom Herbert <tom@sipanda.io> wrote:
> > > >
> > > > On Sun, Mar 3, 2024 at 9:00 AM Jamal Hadi Salim <jhs@mojatatu.com> wrote:
> > > > >
> > > > > On Sat, Mar 2, 2024 at 10:27 PM Jakub Kicinski <kuba@kernel.org> wrote:
> > > > > >
> > > > > > On Sat, 2 Mar 2024 09:36:53 -0500 Jamal Hadi Salim wrote:
> > > > > > > 2) Your point on:  "integrate later", or at least "fill in the gaps"
> > > > > > > This part i am probably going to mumble on. I am going to consider
> > > > > > > more than just doing ACLs/MAT via flower/u32 for the sake of
> > > > > > > discussion.
> > > > > > > True, "fill the gaps" has been our model so far. It requires kernel
> > > > > > > changes, user space code changes etc justifiably so because most of
> > > > > > > the time such datapaths are subject to standardization via IETF, IEEE,
> > > > > > > etc and new extensions come in on a regular basis.  And sometimes we
> > > > > > > do add features that one or two users or a single vendor has need for
> > > > > > > at the cost of kernel and user/control extension. Given our work
> > > > > > > process, any features added this way take a long time to make it to
> > > > > > > the end user.
> > > > > >
> > > > > > What I had in mind was more of a DDP model. The device loads it binary
> > > > > > blob FW in whatever way it does, then it tells the kernel its parser
> > > > > > graph, and tables. The kernel exposes those tables to user space.
> > > > > > All dynamic, no need to change the kernel for each new protocol.
> > > > > >
> > > > > > But that's different in two ways:
> > > > > >  1. the device tells kernel the tables, no "dynamic reprogramming"
> > > > > >  2. you don't need the SW side, the only use of the API is to interact
> > > > > >     with the device
> > > > > >
> > > > > > User can still do BPF kfuncs to look up in the tables (like in FIB),
> > > > > > but call them from cls_bpf.
> > > > > >
> > > > >
> > > > > This is not far off from what is envisioned today in the discussions.
> > > > > The main issue is who loads the binary? We went from devlink to the
> > > > > filter doing the loading. DDP is ethtool. We still need to tie a PCI
> > > > > device/tc block to the "program" so we can do skip_sw and it works.
> > > > > Meaning a device that is capable of handling multiple programs can
> > > > > have multiple blobs loaded. A "program" is mapped to a tc filter and
> > > > > MAT control works the same way as it does today (netlink/tc ndo).
> > > > >
> > > > > A program in P4 has a name, ID and people have been suggesting a sha1
> > > > > identity (or a signature of some kind should be generated by the
> > > > > compiler). So the upward propagation could be tied to discovering
> > > > > these 3 tuples from the driver. Then the control plane targets a
> > > > > program via those tuples via netlink (as we do currently).
> > > > >
> > > > > I do note, using the DDP sample space, currently whatever gets loaded
> > > > > is "trusted" and really you need to have human knowledge of what the
> > > > > NIC's parsing + MAT is to send the control. With P4 that is all
> > > > > visible/programmable by the end user (i am not a proponent of vendors
> > > > > "shipping" things or calling them for support) - so should be
> > > > > sufficient to just discover what is in the binary and send the correct
> > > > > control messages down.
> > > > >
> > > > > > I think in P4 terms that may be something more akin to only providing
> > > > > > the runtime API? I seem to recall they had some distinction...
> > > > >
> > > > > There are several solutions out there (ex: TDI, P4runtime) - our API
> > > > > is netlink and those could be written on top of netlink, there's no
> > > > > controversy there.
> > > > > So the starting point is defining the datapath using P4, generating
> > > > > the binary blob and whatever constraints needed using the vendor
> > > > > backend and for s/w equivalent generating the eBPF datapath.
> > > > >
> > > > > > > At the cost of this sounding controversial, i am going
> > > > > > > to call things like fdb, fib, etc which have fixed datapaths in the
> > > > > > > kernel "legacy". These "legacy" datapaths almost all the time have
> > > > > >
> > > > > > The cynic in me sometimes thinks that the biggest problem with "legacy"
> > > > > > protocols is that it's hard to make money on them :)
> > > > >
> > > > > That's a big motivation without a doubt, but also there are people
> > > > > that want to experiment with things. One of the craziest examples we
> > > > > have is someone who created a P4 program for "in network calculator",
> > > > > essentially a calculator in the datapath. You send it two operands and
> > > > > an operator using custom headers, it does the math and responds with a
> > > > > result in a new header. By itself this program is a toy but it
> > > > > demonstrates that if one wanted to, they could have something custom
> > > > > in hardware and/or kernel datapath.
> > > >
> > > > Jamal,
> > > >
> > > > Given how long P4 has been around it's surprising that the best
> > > > publicly available code example is "the network calculator" toy.
> > >
> > > Come on Tom ;-> That was just an example of something "crazy" to
> > > demonstrate freedom. I can run that in any of the P4 friendly NICs
> > > today. You are probably being facetious - There are some serious
> > > publicly available projects out there, some of which I quote on the
> > > cover letter (like DASH).
> >
> > Shameless plug. I have a more crazy example with bpf:
> >
> > https://github.com/fomichev/xdp-btc-miner
> >
> 
> Hrm - this looks crazy interesting;-> Tempting. I guess to port this
> to P4 we'd need the sha256 in h/w (which most of these vendors have
> already). Is there any other acceleration would you need? Would have
> been more fun if you invented you own headers too ;->

Yeah, some way to do sha256(sha256(at_some_fixed_packet_offset + 80 bytes))
is one thing. And the other is some way to compare that sha256 vs some
hard-coded (difficulty) number (as a 256-byte uint). But I have no
clue how well that maps into declarative p4 language. Most likely
possible if you're saying that the calculator is possible?
I'm assuming that even sha256 can possibly be implemented in p4 without
any extra support from the vendor? It's just a bunch of xors and
rotations over a fix-sized input buffer.
Jamal Hadi Salim March 4, 2024, 10:59 p.m. UTC | #34
On Mon, Mar 4, 2024 at 5:23 PM Stanislav Fomichev <sdf@google.com> wrote:
>
> On 03/04, Jamal Hadi Salim wrote:
> > On Mon, Mar 4, 2024 at 4:23 PM Stanislav Fomichev <sdf@google.com> wrote:
> > >
> > > On 03/03, Jamal Hadi Salim wrote:
> > > > On Sun, Mar 3, 2024 at 1:11 PM Tom Herbert <tom@sipanda.io> wrote:
> > > > >
> > > > > On Sun, Mar 3, 2024 at 9:00 AM Jamal Hadi Salim <jhs@mojatatu.com> wrote:
> > > > > >
> > > > > > On Sat, Mar 2, 2024 at 10:27 PM Jakub Kicinski <kuba@kernel.org> wrote:
> > > > > > >
> > > > > > > On Sat, 2 Mar 2024 09:36:53 -0500 Jamal Hadi Salim wrote:
> > > > > > > > 2) Your point on:  "integrate later", or at least "fill in the gaps"
> > > > > > > > This part i am probably going to mumble on. I am going to consider
> > > > > > > > more than just doing ACLs/MAT via flower/u32 for the sake of
> > > > > > > > discussion.
> > > > > > > > True, "fill the gaps" has been our model so far. It requires kernel
> > > > > > > > changes, user space code changes etc justifiably so because most of
> > > > > > > > the time such datapaths are subject to standardization via IETF, IEEE,
> > > > > > > > etc and new extensions come in on a regular basis.  And sometimes we
> > > > > > > > do add features that one or two users or a single vendor has need for
> > > > > > > > at the cost of kernel and user/control extension. Given our work
> > > > > > > > process, any features added this way take a long time to make it to
> > > > > > > > the end user.
> > > > > > >
> > > > > > > What I had in mind was more of a DDP model. The device loads it binary
> > > > > > > blob FW in whatever way it does, then it tells the kernel its parser
> > > > > > > graph, and tables. The kernel exposes those tables to user space.
> > > > > > > All dynamic, no need to change the kernel for each new protocol.
> > > > > > >
> > > > > > > But that's different in two ways:
> > > > > > >  1. the device tells kernel the tables, no "dynamic reprogramming"
> > > > > > >  2. you don't need the SW side, the only use of the API is to interact
> > > > > > >     with the device
> > > > > > >
> > > > > > > User can still do BPF kfuncs to look up in the tables (like in FIB),
> > > > > > > but call them from cls_bpf.
> > > > > > >
> > > > > >
> > > > > > This is not far off from what is envisioned today in the discussions.
> > > > > > The main issue is who loads the binary? We went from devlink to the
> > > > > > filter doing the loading. DDP is ethtool. We still need to tie a PCI
> > > > > > device/tc block to the "program" so we can do skip_sw and it works.
> > > > > > Meaning a device that is capable of handling multiple programs can
> > > > > > have multiple blobs loaded. A "program" is mapped to a tc filter and
> > > > > > MAT control works the same way as it does today (netlink/tc ndo).
> > > > > >
> > > > > > A program in P4 has a name, ID and people have been suggesting a sha1
> > > > > > identity (or a signature of some kind should be generated by the
> > > > > > compiler). So the upward propagation could be tied to discovering
> > > > > > these 3 tuples from the driver. Then the control plane targets a
> > > > > > program via those tuples via netlink (as we do currently).
> > > > > >
> > > > > > I do note, using the DDP sample space, currently whatever gets loaded
> > > > > > is "trusted" and really you need to have human knowledge of what the
> > > > > > NIC's parsing + MAT is to send the control. With P4 that is all
> > > > > > visible/programmable by the end user (i am not a proponent of vendors
> > > > > > "shipping" things or calling them for support) - so should be
> > > > > > sufficient to just discover what is in the binary and send the correct
> > > > > > control messages down.
> > > > > >
> > > > > > > I think in P4 terms that may be something more akin to only providing
> > > > > > > the runtime API? I seem to recall they had some distinction...
> > > > > >
> > > > > > There are several solutions out there (ex: TDI, P4runtime) - our API
> > > > > > is netlink and those could be written on top of netlink, there's no
> > > > > > controversy there.
> > > > > > So the starting point is defining the datapath using P4, generating
> > > > > > the binary blob and whatever constraints needed using the vendor
> > > > > > backend and for s/w equivalent generating the eBPF datapath.
> > > > > >
> > > > > > > > At the cost of this sounding controversial, i am going
> > > > > > > > to call things like fdb, fib, etc which have fixed datapaths in the
> > > > > > > > kernel "legacy". These "legacy" datapaths almost all the time have
> > > > > > >
> > > > > > > The cynic in me sometimes thinks that the biggest problem with "legacy"
> > > > > > > protocols is that it's hard to make money on them :)
> > > > > >
> > > > > > That's a big motivation without a doubt, but also there are people
> > > > > > that want to experiment with things. One of the craziest examples we
> > > > > > have is someone who created a P4 program for "in network calculator",
> > > > > > essentially a calculator in the datapath. You send it two operands and
> > > > > > an operator using custom headers, it does the math and responds with a
> > > > > > result in a new header. By itself this program is a toy but it
> > > > > > demonstrates that if one wanted to, they could have something custom
> > > > > > in hardware and/or kernel datapath.
> > > > >
> > > > > Jamal,
> > > > >
> > > > > Given how long P4 has been around it's surprising that the best
> > > > > publicly available code example is "the network calculator" toy.
> > > >
> > > > Come on Tom ;-> That was just an example of something "crazy" to
> > > > demonstrate freedom. I can run that in any of the P4 friendly NICs
> > > > today. You are probably being facetious - There are some serious
> > > > publicly available projects out there, some of which I quote on the
> > > > cover letter (like DASH).
> > >
> > > Shameless plug. I have a more crazy example with bpf:
> > >
> > > https://github.com/fomichev/xdp-btc-miner
> > >
> >
> > Hrm - this looks crazy interesting;-> Tempting. I guess to port this
> > to P4 we'd need the sha256 in h/w (which most of these vendors have
> > already). Is there any other acceleration would you need? Would have
> > been more fun if you invented you own headers too ;->
>
> Yeah, some way to do sha256(sha256(at_some_fixed_packet_offset + 80 bytes))

This part is straight forward.

> is one thing. And the other is some way to compare that sha256 vs some
> hard-coded (difficulty) number (as a 256-byte uint).

The compiler may have issues with this comparison - will have to look
(I am pretty sure it's fixable though).


>  But I have no
> clue how well that maps into declarative p4 language. Most likely
> possible if you're saying that the calculator is possible?

The calculator basically is written as a set of match-action tables.
You parse your header, construct a key based on the operator field of
the header (eg "+"),  invoke an action which takes the operands from
the headers(eg "1" and "2"), the action returns you results(3"). You
stash the result in a new packet and send it back to the source.

So my thinking is the computation you need would be modelled on an action.

> I'm assuming that even sha256 can possibly be implemented in p4 without
> any extra support from the vendor? It's just a bunch of xors and
> rotations over a fix-sized input buffer.

True,  and I think those would be fast. But if the h/w offers it as an
interface why not.
It's not that you are running out of instruction space - and my memory
is hazy - but iirc, there is sha256 support in the kernel Crypto API -
does it not make sense to kfunc into that?

cheers,
jamal
Stanislav Fomichev March 4, 2024, 11:14 p.m. UTC | #35
On 03/04, Jamal Hadi Salim wrote:
> On Mon, Mar 4, 2024 at 5:23 PM Stanislav Fomichev <sdf@google.com> wrote:
> >
> > On 03/04, Jamal Hadi Salim wrote:
> > > On Mon, Mar 4, 2024 at 4:23 PM Stanislav Fomichev <sdf@google.com> wrote:
> > > >
> > > > On 03/03, Jamal Hadi Salim wrote:
> > > > > On Sun, Mar 3, 2024 at 1:11 PM Tom Herbert <tom@sipanda.io> wrote:
> > > > > >
> > > > > > On Sun, Mar 3, 2024 at 9:00 AM Jamal Hadi Salim <jhs@mojatatu.com> wrote:
> > > > > > >
> > > > > > > On Sat, Mar 2, 2024 at 10:27 PM Jakub Kicinski <kuba@kernel.org> wrote:
> > > > > > > >
> > > > > > > > On Sat, 2 Mar 2024 09:36:53 -0500 Jamal Hadi Salim wrote:
> > > > > > > > > 2) Your point on:  "integrate later", or at least "fill in the gaps"
> > > > > > > > > This part i am probably going to mumble on. I am going to consider
> > > > > > > > > more than just doing ACLs/MAT via flower/u32 for the sake of
> > > > > > > > > discussion.
> > > > > > > > > True, "fill the gaps" has been our model so far. It requires kernel
> > > > > > > > > changes, user space code changes etc justifiably so because most of
> > > > > > > > > the time such datapaths are subject to standardization via IETF, IEEE,
> > > > > > > > > etc and new extensions come in on a regular basis.  And sometimes we
> > > > > > > > > do add features that one or two users or a single vendor has need for
> > > > > > > > > at the cost of kernel and user/control extension. Given our work
> > > > > > > > > process, any features added this way take a long time to make it to
> > > > > > > > > the end user.
> > > > > > > >
> > > > > > > > What I had in mind was more of a DDP model. The device loads it binary
> > > > > > > > blob FW in whatever way it does, then it tells the kernel its parser
> > > > > > > > graph, and tables. The kernel exposes those tables to user space.
> > > > > > > > All dynamic, no need to change the kernel for each new protocol.
> > > > > > > >
> > > > > > > > But that's different in two ways:
> > > > > > > >  1. the device tells kernel the tables, no "dynamic reprogramming"
> > > > > > > >  2. you don't need the SW side, the only use of the API is to interact
> > > > > > > >     with the device
> > > > > > > >
> > > > > > > > User can still do BPF kfuncs to look up in the tables (like in FIB),
> > > > > > > > but call them from cls_bpf.
> > > > > > > >
> > > > > > >
> > > > > > > This is not far off from what is envisioned today in the discussions.
> > > > > > > The main issue is who loads the binary? We went from devlink to the
> > > > > > > filter doing the loading. DDP is ethtool. We still need to tie a PCI
> > > > > > > device/tc block to the "program" so we can do skip_sw and it works.
> > > > > > > Meaning a device that is capable of handling multiple programs can
> > > > > > > have multiple blobs loaded. A "program" is mapped to a tc filter and
> > > > > > > MAT control works the same way as it does today (netlink/tc ndo).
> > > > > > >
> > > > > > > A program in P4 has a name, ID and people have been suggesting a sha1
> > > > > > > identity (or a signature of some kind should be generated by the
> > > > > > > compiler). So the upward propagation could be tied to discovering
> > > > > > > these 3 tuples from the driver. Then the control plane targets a
> > > > > > > program via those tuples via netlink (as we do currently).
> > > > > > >
> > > > > > > I do note, using the DDP sample space, currently whatever gets loaded
> > > > > > > is "trusted" and really you need to have human knowledge of what the
> > > > > > > NIC's parsing + MAT is to send the control. With P4 that is all
> > > > > > > visible/programmable by the end user (i am not a proponent of vendors
> > > > > > > "shipping" things or calling them for support) - so should be
> > > > > > > sufficient to just discover what is in the binary and send the correct
> > > > > > > control messages down.
> > > > > > >
> > > > > > > > I think in P4 terms that may be something more akin to only providing
> > > > > > > > the runtime API? I seem to recall they had some distinction...
> > > > > > >
> > > > > > > There are several solutions out there (ex: TDI, P4runtime) - our API
> > > > > > > is netlink and those could be written on top of netlink, there's no
> > > > > > > controversy there.
> > > > > > > So the starting point is defining the datapath using P4, generating
> > > > > > > the binary blob and whatever constraints needed using the vendor
> > > > > > > backend and for s/w equivalent generating the eBPF datapath.
> > > > > > >
> > > > > > > > > At the cost of this sounding controversial, i am going
> > > > > > > > > to call things like fdb, fib, etc which have fixed datapaths in the
> > > > > > > > > kernel "legacy". These "legacy" datapaths almost all the time have
> > > > > > > >
> > > > > > > > The cynic in me sometimes thinks that the biggest problem with "legacy"
> > > > > > > > protocols is that it's hard to make money on them :)
> > > > > > >
> > > > > > > That's a big motivation without a doubt, but also there are people
> > > > > > > that want to experiment with things. One of the craziest examples we
> > > > > > > have is someone who created a P4 program for "in network calculator",
> > > > > > > essentially a calculator in the datapath. You send it two operands and
> > > > > > > an operator using custom headers, it does the math and responds with a
> > > > > > > result in a new header. By itself this program is a toy but it
> > > > > > > demonstrates that if one wanted to, they could have something custom
> > > > > > > in hardware and/or kernel datapath.
> > > > > >
> > > > > > Jamal,
> > > > > >
> > > > > > Given how long P4 has been around it's surprising that the best
> > > > > > publicly available code example is "the network calculator" toy.
> > > > >
> > > > > Come on Tom ;-> That was just an example of something "crazy" to
> > > > > demonstrate freedom. I can run that in any of the P4 friendly NICs
> > > > > today. You are probably being facetious - There are some serious
> > > > > publicly available projects out there, some of which I quote on the
> > > > > cover letter (like DASH).
> > > >
> > > > Shameless plug. I have a more crazy example with bpf:
> > > >
> > > > https://github.com/fomichev/xdp-btc-miner
> > > >
> > >
> > > Hrm - this looks crazy interesting;-> Tempting. I guess to port this
> > > to P4 we'd need the sha256 in h/w (which most of these vendors have
> > > already). Is there any other acceleration would you need? Would have
> > > been more fun if you invented you own headers too ;->
> >
> > Yeah, some way to do sha256(sha256(at_some_fixed_packet_offset + 80 bytes))
> 
> This part is straight forward.
> 
> > is one thing. And the other is some way to compare that sha256 vs some
> > hard-coded (difficulty) number (as a 256-byte uint).
> 
> The compiler may have issues with this comparison - will have to look
> (I am pretty sure it's fixable though).
> 
> 
> >  But I have no
> > clue how well that maps into declarative p4 language. Most likely
> > possible if you're saying that the calculator is possible?
> 
> The calculator basically is written as a set of match-action tables.
> You parse your header, construct a key based on the operator field of
> the header (eg "+"),  invoke an action which takes the operands from
> the headers(eg "1" and "2"), the action returns you results(3"). You
> stash the result in a new packet and send it back to the source.
> 
> So my thinking is the computation you need would be modelled on an action.
> 
> > I'm assuming that even sha256 can possibly be implemented in p4 without
> > any extra support from the vendor? It's just a bunch of xors and
> > rotations over a fix-sized input buffer.

[..]

> True,  and I think those would be fast. But if the h/w offers it as an
> interface why not.
> It's not that you are running out of instruction space - and my memory
> is hazy - but iirc, there is sha256 support in the kernel Crypto API -
> does it not make sense to kfunc into that?

Oh yeah, that's definitely a better path if somebody were do to it
"properly". It's still fun, though, to see how far we can push
the bpf vm/verifier without using any extra helpers :-D
Stanislav Fomichev March 4, 2024, 11:24 p.m. UTC | #36
On 03/04, Tom Herbert wrote:
> On Mon, Mar 4, 2024 at 1:19 PM Stanislav Fomichev <sdf@google.com> wrote:
> >
> > On 03/03, Tom Herbert wrote:
> > > On Sat, Mar 2, 2024 at 7:15 PM Jakub Kicinski <kuba@kernel.org> wrote:
> > > >
> > > > On Fri, 1 Mar 2024 18:20:36 -0800 Tom Herbert wrote:
> > > > > This is configurability versus programmability. The table driven
> > > > > approach as input (configurability) might work fine for generic
> > > > > match-action tables up to the point that tables are expressive enough
> > > > > to satisfy the requirements. But parsing doesn't fall into the table
> > > > > driven paradigm: parsers want to be *programmed*. This is why we
> > > > > removed kParser from this patch set and fell back to eBPF for parsing.
> > > > > But the problem we quickly hit that eBPF is not offloadable to network
> > > > > devices, for example when we compile P4 in an eBPF parser we've lost
> > > > > the declarative representation that parsers in the devices could
> > > > > consume (they're not CPUs running eBPF).
> > > > >
> > > > > I think the key here is what we mean by kernel offload. When we do
> > > > > kernel offload, is it the kernel implementation or the kernel
> > > > > functionality that's being offloaded? If it's the latter then we have
> > > > > a lot more flexibility. What we'd need is a safe and secure way to
> > > > > synchronize with that offload device that precisely supports the
> > > > > kernel functionality we'd like to offload. This can be done if both
> > > > > the kernel bits and programmed offload are derived from the same
> > > > > source (i.e. tag source code with a sha-1). For example, if someone
> > > > > writes a parser in P4, we can compile that into both eBPF and a P4
> > > > > backend using independent tool chains and program download. At
> > > > > runtime, the kernel can safely offload the functionality of the eBPF
> > > > > parser to the device if it matches the hash to that reported by the
> > > > > device
> > > >
> > > > Good points. If I understand you correctly you're saying that parsers
> > > > are more complex than just a basic parsing tree a'la u32.
> > >
> > > Yes. Parsing things like TLVs, GRE flag field, or nested protobufs
> > > isn't conducive to u32. We also want the advantages of compiler
> > > optimizations to unroll loops, squash nodes in the parse graph, etc.
> > >
> > > > Then we can take this argument further. P4 has grown to encompass a lot
> > > > of functionality of quite complex devices. How do we square that with
> > > > the kernel functionality offload model. If the entire device is modeled,
> > > > including f.e. TSO, an offload would mean that the user has to write
> > > > a TSO implementation which they then load into TC? That seems odd.
> > > >
> > > > IOW I don't quite know how to square in my head the "total
> > > > functionality" with being a TC-based "plugin".
> > >
> > > Hi Jakub,
> > >
> > > I believe the solution is to replace kernel code with eBPF in cases
> > > where we need programmability. This effectively means that we would
> > > ship eBPF code as part of the kernel. So in the case of TSO, the
> > > kernel would include a standard implementation in eBPF that could be
> > > compiled into the kernel by default. The restricted C source code is
> > > tagged with a hash, so if someone wants to offload TSO they could
> > > compile the source into their target and retain the hash. At runtime
> > > it's a matter of querying the driver to see if the device supports the
> > > TSO program the kernel is running by comparing hash values. Scaling
> > > this, a device could support a catalogue of programs: TSO, LRO,
> > > parser, IPtables, etc., If the kernel can match the hash of its eBPF
> > > code to one reported by the driver then it can assume functionality is
> > > offloadable. This is an elaboration of "device features", but instead
> > > of the device telling us they think they support an adequate GRO
> > > implementation by reporting NETIF_F_GRO, the device would tell the
> > > kernel that they not only support GRO but they provide identical
> > > functionality of the kernel GRO (which IMO is the first requirement of
> > > kernel offload).
> > >
> > > Even before considering hardware offload, I think this approach
> > > addresses a more fundamental problem to make the kernel programmable.
> > > Since the code is in eBPF, the kernel can be reprogrammed at runtime
> > > which could be controlled by TC. This allows local customization of
> > > kernel features, but also is the simplest way to "patch" the kernel
> > > with security and bug fixes (nobody is ever excited to do a kernel
> >
> > [..]
> >
> > > rebase in their datacenter!). Flow dissector is a prime candidate for
> > > this, and I am still planning to replace it with an all eBPF program
> > > (https://netdevconf.info/0x15/slides/16/Flow%20dissector_PANDA%20parser.pdf).
> >
> > So you're suggesting to bundle (and extend)
> > tools/testing/selftests/bpf/progs/bpf_flow.c? We were thinking along
> > similar lines here. We load this program manually right now, shipping
> > and autoloading with the kernel will be easer.
> 
> Hi Stanislav,
> 
> Yes, I envision that we would have a standard implementation of
> flow-dissector in eBPF that is shipped with the kernel and autoloaded.
> However, for the front end source I want to move away from imperative
> code. As I mentioned in the presentation flow_dissector.c is spaghetti
> code and has been prone to bugs over the years especially whenever
> someone adds support for a new fringe protocol (I take the liberty to
> call it spaghetti code since I'm partially responsible for creating
> this mess ;-) ).
> 
> The problem is that parsers are much better represented by a
> declarative rather than an imperative representation. To that end, we
> defined PANDA which allows constructing a parser (parse graph) in data
> structures in C. We use the "PANDA parser" to compile C to restricted
> C code which looks more like eBPF in imperative code. With this method
> we abstract out all the bookkeeping that was often the source of bugs
> (like pulling up skbufs, checking length limits, etc.). The other
> advantage is that we're able to find a lot more optimizations if we
> start with a right representation of the problem.
> 
> If you're interested, the video presentation on this is in
> https://www.youtube.com/watch?v=zVnmVDSEoXc.

Oh, yeah, I've seen this one. Agreed that the C implementation is not
pleasant and generating a parser from some declarative spec is a better
idea.

From my pow, the biggest win we get from making bpf flow dissector
pluggable is the fact that we can now actually write some tests for it
(and, maybe, fuzz it?). We should also probably spend more time properly
defining the behavior of the existing C implementation. We've seen
some interesting bugs like this one:
https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next.git/commit/?id=9fa02892857ae2b3b699630e5ede28f72106e7e7
Tom Herbert March 4, 2024, 11:50 p.m. UTC | #37
On Mon, Mar 4, 2024 at 3:24 PM Stanislav Fomichev <sdf@google.com> wrote:
>
> On 03/04, Tom Herbert wrote:
> > On Mon, Mar 4, 2024 at 1:19 PM Stanislav Fomichev <sdf@google.com> wrote:
> > >
> > > On 03/03, Tom Herbert wrote:
> > > > On Sat, Mar 2, 2024 at 7:15 PM Jakub Kicinski <kuba@kernel.org> wrote:
> > > > >
> > > > > On Fri, 1 Mar 2024 18:20:36 -0800 Tom Herbert wrote:
> > > > > > This is configurability versus programmability. The table driven
> > > > > > approach as input (configurability) might work fine for generic
> > > > > > match-action tables up to the point that tables are expressive enough
> > > > > > to satisfy the requirements. But parsing doesn't fall into the table
> > > > > > driven paradigm: parsers want to be *programmed*. This is why we
> > > > > > removed kParser from this patch set and fell back to eBPF for parsing.
> > > > > > But the problem we quickly hit that eBPF is not offloadable to network
> > > > > > devices, for example when we compile P4 in an eBPF parser we've lost
> > > > > > the declarative representation that parsers in the devices could
> > > > > > consume (they're not CPUs running eBPF).
> > > > > >
> > > > > > I think the key here is what we mean by kernel offload. When we do
> > > > > > kernel offload, is it the kernel implementation or the kernel
> > > > > > functionality that's being offloaded? If it's the latter then we have
> > > > > > a lot more flexibility. What we'd need is a safe and secure way to
> > > > > > synchronize with that offload device that precisely supports the
> > > > > > kernel functionality we'd like to offload. This can be done if both
> > > > > > the kernel bits and programmed offload are derived from the same
> > > > > > source (i.e. tag source code with a sha-1). For example, if someone
> > > > > > writes a parser in P4, we can compile that into both eBPF and a P4
> > > > > > backend using independent tool chains and program download. At
> > > > > > runtime, the kernel can safely offload the functionality of the eBPF
> > > > > > parser to the device if it matches the hash to that reported by the
> > > > > > device
> > > > >
> > > > > Good points. If I understand you correctly you're saying that parsers
> > > > > are more complex than just a basic parsing tree a'la u32.
> > > >
> > > > Yes. Parsing things like TLVs, GRE flag field, or nested protobufs
> > > > isn't conducive to u32. We also want the advantages of compiler
> > > > optimizations to unroll loops, squash nodes in the parse graph, etc.
> > > >
> > > > > Then we can take this argument further. P4 has grown to encompass a lot
> > > > > of functionality of quite complex devices. How do we square that with
> > > > > the kernel functionality offload model. If the entire device is modeled,
> > > > > including f.e. TSO, an offload would mean that the user has to write
> > > > > a TSO implementation which they then load into TC? That seems odd.
> > > > >
> > > > > IOW I don't quite know how to square in my head the "total
> > > > > functionality" with being a TC-based "plugin".
> > > >
> > > > Hi Jakub,
> > > >
> > > > I believe the solution is to replace kernel code with eBPF in cases
> > > > where we need programmability. This effectively means that we would
> > > > ship eBPF code as part of the kernel. So in the case of TSO, the
> > > > kernel would include a standard implementation in eBPF that could be
> > > > compiled into the kernel by default. The restricted C source code is
> > > > tagged with a hash, so if someone wants to offload TSO they could
> > > > compile the source into their target and retain the hash. At runtime
> > > > it's a matter of querying the driver to see if the device supports the
> > > > TSO program the kernel is running by comparing hash values. Scaling
> > > > this, a device could support a catalogue of programs: TSO, LRO,
> > > > parser, IPtables, etc., If the kernel can match the hash of its eBPF
> > > > code to one reported by the driver then it can assume functionality is
> > > > offloadable. This is an elaboration of "device features", but instead
> > > > of the device telling us they think they support an adequate GRO
> > > > implementation by reporting NETIF_F_GRO, the device would tell the
> > > > kernel that they not only support GRO but they provide identical
> > > > functionality of the kernel GRO (which IMO is the first requirement of
> > > > kernel offload).
> > > >
> > > > Even before considering hardware offload, I think this approach
> > > > addresses a more fundamental problem to make the kernel programmable.
> > > > Since the code is in eBPF, the kernel can be reprogrammed at runtime
> > > > which could be controlled by TC. This allows local customization of
> > > > kernel features, but also is the simplest way to "patch" the kernel
> > > > with security and bug fixes (nobody is ever excited to do a kernel
> > >
> > > [..]
> > >
> > > > rebase in their datacenter!). Flow dissector is a prime candidate for
> > > > this, and I am still planning to replace it with an all eBPF program
> > > > (https://netdevconf.info/0x15/slides/16/Flow%20dissector_PANDA%20parser.pdf).
> > >
> > > So you're suggesting to bundle (and extend)
> > > tools/testing/selftests/bpf/progs/bpf_flow.c? We were thinking along
> > > similar lines here. We load this program manually right now, shipping
> > > and autoloading with the kernel will be easer.
> >
> > Hi Stanislav,
> >
> > Yes, I envision that we would have a standard implementation of
> > flow-dissector in eBPF that is shipped with the kernel and autoloaded.
> > However, for the front end source I want to move away from imperative
> > code. As I mentioned in the presentation flow_dissector.c is spaghetti
> > code and has been prone to bugs over the years especially whenever
> > someone adds support for a new fringe protocol (I take the liberty to
> > call it spaghetti code since I'm partially responsible for creating
> > this mess ;-) ).
> >
> > The problem is that parsers are much better represented by a
> > declarative rather than an imperative representation. To that end, we
> > defined PANDA which allows constructing a parser (parse graph) in data
> > structures in C. We use the "PANDA parser" to compile C to restricted
> > C code which looks more like eBPF in imperative code. With this method
> > we abstract out all the bookkeeping that was often the source of bugs
> > (like pulling up skbufs, checking length limits, etc.). The other
> > advantage is that we're able to find a lot more optimizations if we
> > start with a right representation of the problem.
> >
> > If you're interested, the video presentation on this is in
> > https://www.youtube.com/watch?v=zVnmVDSEoXc.
>
> Oh, yeah, I've seen this one. Agreed that the C implementation is not
> pleasant and generating a parser from some declarative spec is a better
> idea.
>
> From my pow, the biggest win we get from making bpf flow dissector
> pluggable is the fact that we can now actually write some tests for it

Yes, extracting out functions from the kernel allows them to be
independently unit tested. It's an even bigger win if the same source
code is used for offloading the functionality as I described. We can
call this "Test once, run anywhere!"

Tom

> (and, maybe, fuzz it?). We should also probably spend more time properly
> defining the behavior of the existing C implementation. We've seen
> some interesting bugs like this one:
> https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next.git/commit/?id=9fa02892857ae2b3b699630e5ede28f72106e7e7