[net-next,v16,00/15] Introducing P4TC (series 1)

Message ID	20240410140141.495384-1-jhs@mojatatu.com (mailing list archive)
Headers	show Received: from mail-qk1-f171.google.com (mail-qk1-f171.google.com [209.85.222.171]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 3F69715DBAF for <bpf@vger.kernel.org>; Wed, 10 Apr 2024 14:01:45 +0000 (UTC) From: Jamal Hadi Salim <jhs@mojatatu.com> To: netdev@vger.kernel.org Cc: deb.chatterjee@intel.com, anjali.singhai@intel.com, namrata.limaye@intel.com, tom@sipanda.io, mleitner@redhat.com, Mahesh.Shirshyad@amd.com, tomasz.osinski@intel.com, jiri@resnulli.us, xiyou.wangcong@gmail.com, davem@davemloft.net, edumazet@google.com, kuba@kernel.org, pabeni@redhat.com, vladbu@nvidia.com, horms@kernel.org, khalidm@nvidia.com, toke@redhat.com, victor@mojatatu.com, pctammela@mojatatu.com, Vipin.Jain@amd.com, dan.daly@intel.com, andy.fingerhut@gmail.com, chris.sommers@keysight.com, mattyk@nvidia.com, bpf@vger.kernel.org Subject: [PATCH net-next v16 00/15] Introducing P4TC (series 1) Date: Wed, 10 Apr 2024 10:01:26 -0400 Message-Id: <20240410140141.495384-1-jhs@mojatatu.com> Precedence: bulk MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit
Series	Introducing P4TC (series 1) \| expand [net-next,v16,00/15] Introducing P4TC (series 1) [net-next,v16,01/15] net: sched: act_api: Introduce P4 actions list [net-next,v16,02/15] net/sched: act_api: increase action kind string length [net-next,v16,03/15] net/sched: act_api: Update tc_action_ops to account for P4 actions [net-next,v16,04/15] net/sched: act_api: add struct p4tc_action_ops as a parameter to lookup callba… [net-next,v16,05/15] net: sched: act_api: Add support for preallocated P4 action instances [net-next,v16,06/15] p4tc: add P4 data types [net-next,v16,07/15] p4tc: add template API [net-next,v16,08/15] p4tc: add template pipeline create, get, update, delete [net-next,v16,09/15] p4tc: add template action create, update, delete, get, flush and dump [net-next,v16,10/15] p4tc: add runtime action support [net-next,v16,11/15] p4tc: add template table create, update, delete, get, flush and dump [net-next,v16,12/15] p4tc: add runtime table entry create and update [net-next,v16,13/15] p4tc: add runtime table entry get, delete, flush and dump [net-next,v16,14/15] p4tc: add set of P4TC table kfuncs [net-next,v16,15/15] p4tc: add P4 classifier

Jamal Hadi Salim April 10, 2024, 2:01 p.m. UTC

This is the first patchset of two. In this patch we are submitting 15 which
cover the minimal viable P4 PNA architecture.
Please, if you want to discuss a slightly tangential subject like offload or
even your politics then start another thread with a different subject line.
The way you do it is to change the subject line to for example
"<Your New Subject Here> (WAS: <original subject line here>)".

In this cover letter i am restoring text i took out in V10 which stated "our
requirements".

The only change that v16 makes is to add a nack to patch 14 on kfuncs
from Daniel and John. We strongly disagree with the nack; unfortunately I
have to rehash whats already in the cover letter and has been discussed over
and over and over again:

1) P4TC uses the TC model - therefore the design is centred around TC filters,
actions etc. It means a unified TC control via netlink for s/w + h/w twins.
It means the P4 objects(tables, actions, externs, etc) and associated data
are owned by P4TC. None of the other "innovations" that are divorced from
TC such as tcx make any sense to solving the engineering problem at
stake. And therefore the argument that "tc actions and filters are a
mistake or inferior and you have to use what we innovated" is a
non-starter and both arrogant and condescending. We use eBPF as an infra
tool not as the answer looking for a question. Sorry.

2) We use kfuncs to access the P4 objects for the s/w datapath. AFAIK,
kfuncs contributions do not have to be sent to the ebpf mailing list
for review or approval. Infact, kfuncs can be implemented in a kernel
module and do not need to be upstreamed. But it is "encouraged to
upstream for sharing reasons". So somehow picking when you want to
move the goal posts for political nack purposes is abuse of power.
For our work there are certain features that need to be upstreamed so
the community can have full access to say the P4 PNA architecture and
not need to install oot kernel modules.
For this reason, we are need to push the kfuncs as part of the series.
It does not make sense to make them oot.

3) Just a reminder: This code is entirely in the TC domain and does not
make any changes to ebpf code. So for people from the ebpf domain who
do not maintain the TC code to step in and nack TC patches is most
certainly overstepping.

__Description of these Patches__

These Patches are constrained entirely within the TC domain with very tiny
changes made to TC core code in patch 1-5. eBPF is used as an infrastructure
component for the software datapath and no changes are made to any eBPF code,
only kfuncs are introduced in patch 14.

Patch #1 adds infrastructure for per-netns P4 actions that can be created on
as need basis for the P4 program requirement. This patch makes a small
incision into act_api. Patches 2-4 are minimalist enablers for P4TC and have
no effect on the classical tc action (example patch#2 just increases the size
of the action names from 16->64B).
Patch 5 adds infrastructure support for preallocation of dynamic actions
needed for P4.

The core P4TC code implements several P4 objects.
1) Patch #6 introduces P4 data types which are consumed by the rest of the
code
2) Patch #7 introduces the templating API. i.e. CRUD commands for templates
3) Patch #8 introduces the concept of templating Pipelines. i.e CRUD
commands for P4 pipelines.
4) Patch #9 introduces the action templates and associated CRUD commands.
5) Patch #10 introduce the action runtime infrastructure.
6) Patch #11 introduces the concept of P4 table templates and associated
CRUD commands for tables.
7) Patch #12 introduces runtime table entry infra and associated CU
commands.
8) Patch #13 introduces runtime table entry infra and associated RD
commands.
9) Patch #14 introduces interaction of eBPF to P4TC tables via kfunc.
10) Patch #15 introduces the TC classifier P4 used at runtime.

There are a few more patches not in this patchset that deal with externs,
test cases, etc.

What is P4?
-----------

The Programming Protocol-independent Packet Processors (P4) is an open
source, domain-specific programming language for specifying data plane
behavior.

The current P4 landscape includes an extensive range of deployments,
products, projects and services, etc[9][12]. Two major NIC vendors,
Intel[10] and AMD[11] currently offer P4-native NICs. P4 is currently
curated by the Linux Foundation[9].

A lot more on why P4 - see small treatise here:[4].

What is P4TC?
-------------

P4TC is a net-namespace aware P4 implementation over TC; meaning, a P4
program and its associated objects and state are attachend to a kernel
_netns_ structure.
IOW, if we had two programs across netns' or within a netns they have no
visibility to each others objects (unlike for example TC actions whose
kinds are "global" in nature or eBPF maps visavis bpftool).

P4TC builds on top of many years of Linux TC experiences of a netlink
control path interface coupled with a software datapath with an equivalent
offloadable hardware datapath. In this patch series we are focussing only
on the s/w datapath. The s/w and h/w path equivalence that TC provides is
relevant for a primary use case of P4 where some (currently) large consumers
of NICs provide vendors their datapath specs in P4. In such a case one could
generate specified datapaths in s/w and test/validate the requirements
before hardware acquisition(example [12]).

Unlike other approaches such as TC Flower which require kernel and user
space changes when new datapath objects like packet headers are introduced
P4TC requires zero kernel or user space changes. We refer to this as:
_kernel and user space code change independence_.
Meaning:
A P4 program describes headers, how to parse, etc alongside prescribing
the datapath processing logic; the compiler uses the P4 program as input
and generates several artifacts which are then loaded into the kernel to
manifest the intended datapath. In addition to the generated datapath,
control path constructs are generated. The process is described further
below in "P4TC Workflow".

Some History
------------

There have been many discussions and meetings within the community since
about 2015 in regards to P4 over TC[2] and we are finally proving to the
naysayers that we do get stuff done!

A lot more of the P4TC motivation is captured at:
https://github.com/p4tc-dev/docs/blob/main/why-p4tc.md

__P4TC Architecture__

The current architecture was described at netdevconf 0x17[14] and if you
prefer academic conference papers, a short paper is available here[15].

There are 4 parts:

1) A Template CRUD provisioning API for manifesting a P4 program and its
associated objects in the kernel. The template provisioning API uses
netlink. See patch in part 2.

2) A Runtime CRUD+ API code which is used for controlling the different
runtime behavior of the P4 objects. The runtime API uses netlink. See notes
further down. See patch descriptions...

3) P4 objects and their control interfaces: tables, actions, externs, etc.
Any object that requires control plane interaction resides in the TC domain
and is subject to the CRUD runtime API. The intended goal is to make use
of the tc semantics of skip_sw/hw to target P4 program objects either in s/w
or h/w.

4) S/W Datapath code hooks. The s/w datapath is eBPF based and is generated
by a compiler based on the P4 spec. When accessing any P4 object that
requires control plane interfaces, the eBPF code accesses the P4TC side
from #3 above using kfuncs.

The generated eBPF code is derived from [13] with enhancements and fixes to
meet our requirements.

__P4TC Workflow__

The Development and instantiation workflow for P4TC is as follows:

A) A developer writes a P4 program, "myprog"

B) Compiles it using the P4C compiler[8]. The compiler generates 3
outputs:

a) A shell script which form template definitions for the different P4
objects "myprog" utilizes (tables, externs, actions etc). See #1
above

b) The parser and the rest of the datapath are generated as eBPF and
need to be compiled into binaries. At the moment the parser and the
main control block are generated as separate eBPF program but this
could change in the future (without affecting any kernel code).
See #4 above.

c) A json introspection file used for the control plane
(by iproute2/tc).

C) At this point the artifacts from #1,#4 could be handed to an operator
(the operator could be the same person as the developer from #A, #B).

i) For the eBPF part, either the operator is handed an ebpf binary or
source which they compile at this point into a binary.
The operator executes the shell script(s) to manifest the functional
"myprog" into the kernel.

ii) The operator instantiates "myprog" pipeline via the tc P4 filter
to ingress/egress (depending on P4 arch) of one or more netdevs/ports
(illustrated below as "block 22").

Example instantion where the parser is a separate action:
"tc filter add block 22 ingress protocol all prio 10 \
p4 pname myprog \
action bpf obj $PARSER.o section p4tc/parse \
action bpf obj $PROGNAME.o section p4tc/main"

See individual patches in partc for more examples tc vs xdp etc. Also see
section on "challenges" (further below on this cover letter).

Once "myprog" P4 program is instantiated one can start performing operations
on table entries and/or actions at runtime as described below.

__P4TC Runtime Control Path__

The control interface builds on past tc experience and tries to get things
right from the beginning (example filtering is separated from depending
on existing object TLVs and made generic); also the code is written in
such a way it is mostly lockless.

The P4TC control interface, using netlink, provides what we call a CRUDPS
abstraction which stands for: Create, Read(get), Update, Delete, Subscribe,
Publish. From a high level PoV the following describes a conformant high
level API (both on netlink data model and code level):

Create(</path/to/object, DATA>+)
Read(</path/to/object>, [optional filter])
Update(</path/to/object>, DATA>+)
Delete(</path/to/object>, [optional filter])
Subscribe(</path/to/object>, [optional filter])

Note, we _dont_ treat "dump" or "flush" as speacial. If "path/to/object"
points to a table then a "Delete" implies "flush" and a "Read" implies dump
but if it points to an entry (by specifying a key) then "Delete" implies
deleting and entry and "Read" implies reading that single entry. It should
be noted that both "Delete" and "Read" take an optional filter parameter.
The filter can define further refinements to what the control plane wants
read or deleted.
"Subscribe" uses built in netlink event management. It, as well, takes a
filter which can further refine what events get generated to the control
plane (taken out of this patchset, to be re-added with consideration of
[16]).

Lets show some runtime samples:

..create an entry, if we match ip address 10.0.1.2 send packet out eno1
tc p4ctrl create myprog/table/mytable \
dstAddr 10.0.1.2/32 action send_to_port param port eno1

..Batch create entries
tc p4ctrl create myprog/table/mytable \
entry dstAddr 10.1.1.2/32 action send_to_port param port eno1 \
entry dstAddr 10.1.10.2/32 action send_to_port param port eno10 \
entry dstAddr 10.0.2.2/32 action send_to_port param port eno2

..Get an entry (note "read" is interchangeably used as "get" which is a
common semantic in tc):
tc p4ctrl read myprog/table/mytable \
dstAddr 10.0.2.2/32

..dump mytable
tc p4ctrl read myprog/table/mytable

..dump mytable for all entries whose key fits within 10.1.0.0/16
tc p4ctrl read myprog/table/mytable \
filter key/myprog/mytable/dstAddr = 10.1.0.0/16

..dump all mytable entries which have an action send_to_port with param "eno1"
tc p4ctrl get myprog/table/mytable \
filter param/act/myprog/send_to_port/port = "eno1"

The filter expression is powerful, f.e you could say:

tc p4ctrl get myprog/table/mytable \
filter param/act/myprog/send_to_port/port = "eno1" && \
key/myprog/mytable/dstAddr = 10.1.0.0/16

It also works on built in metadata, example in the following case dumping
entries from mytable that have seen activity in the last 10 secs:
tc p4ctrl get myprog/table/mytable \
filter msecs_since < 10000

Delete follows the same syntax as get/read, so for sake of brevity we won't
show more example than how to flush mytable:

tc p4ctrl delete myprog/table/mytable

Mystery question: How do we achieve iproute2-kernel independence and
how does "tc p4ctrl" as a cli know how to program the kernel given an
arbitrary command line as shown above? Answer(s): It queries the
compiler generated json file in "P4TC Workflow" #B.c above. The json file
has enough details to figure out that we have a program called "myprog"
which has a table "mytable" that has a key name "dstAddr" which happens to
be type ipv4 address prefix. The json file also provides details to show
that the table "mytable" supports an action called "send_to_port" which
accepts a parameter "port" of type netdev (see the types patch for all
supported P4 data types).
All P4 components have names, IDs, and types - so this makes it very easy
to map into netlink.
Once user space tc/p4ctrl validates the human command input, it creates
standard binary netlink structures (TLVs etc) which are sent to the kernel.
See the runtime table entry patch for more details.

__P4TC Datapath__

The P4TC s/w datapath execution is generated as eBPF. Any objects that
require control interfacing reside in the "P4TC domain" and are controlled
via netlink as described above. Per packet execution and state and even
objects that do not require control interfacing (like the P4 parser) are
generated as eBPF.

A packet arriving on s/w ingress of any of the ports on block 22
(illustrated in section "P4TC Workflow" above will first be exercised via
the (generated eBPF) parser component to extract the headers (the ip
destination address labeled "dstAddr" above in section "P4TC Runtime
Control Path"). The datapath then proceeds to use "dstAddr", table ID
and pipeline ID as a key to do a lookup in myprog's "mytable" which returns
the action params which are then used to execute the action in the eBPF
datapath (eventually sending out packets to eno1).
On a table miss, mytable's default miss action (not described) is executed.

__Testing__

Speaking of testing - we have 2-300 tdc test cases (which will be in the
second patchset).
These tests are run on our CICD system on pull requests and after commits
are approved. The CICD does a lot of other tests (more since v2, thanks to
Simon's input)including:
checkpatch, sparse, smatch, coccinelle, 32 bit and 64 bit builds tested on
both X86, ARM 64 and emulated BE via qemu s390. We trigger performance
testing in the CICD to catch performance regressions (currently only on
the control path, but in the future for the datapath).
Syzkaller runs 24/7 on dedicated hardware, originally we focussed only on
memory sanitizer but recently added support for concurrency sanitizer.
Before main releases we ensure each patch will compile on its own to help
in git bisect and run the xmas tree tool. We eventually put the code via
coverity.

In addition we are working on enabling a tool that will take a P4 program,
run it through the compiler, and generate permutations of traffic patterns
via symbolic execution that will test both positive and negative datapath
code paths. The test generator tool integration is still work in progress.
Also: We have other code that test parallelization etc which we are trying
to find a fit for in the kernel tree's testing infra.

__Restating Our Requirements__

Given this code is not intrusive at all because it only touches TC.
We would like to emphasize that we see eBPF as _infrastructure tooling
available to us and not the end goal_. Please help us with technical input
on for example how we can do better kfuncs, etc. If you want to critique,
then our requirements should be your guide and please be considerate that
this is about P4, not eBPF. IOW:
We would appreciate technical commentary instead of bikeshedding on how
_you_ would have implemented this probably with more eBPF or some other
clever tricks. It is sad to see there was zero input from anyone in the eBPF
world for 7 RFC postings (in a period of 9 months).
If i am ranting here is because we have spent over a year now on this
topic - we have taken the initial input and have given you eBPF. So lets
make progress please.

The initial release was presented in October 2022[20] and RFC in January
2023 had a "scriptable" datapath (the idea built on the u32 classifier[17]
and pedit action[18] approach. Post RFC V1, we made changes to fit the
feedback to integrate eBPF to replace the "scriptable" software datapath.
On our part, the goal for the change was to meet folks in the middle as a
compromise.
No regrets on the journey since after all the effort because we ended
getting XDP which was not in the original picture. Some of our efforts are
captured at [1][3] and in the patch history.

In this section we review the original scriptable version against the
current implementation which uses eBPF and in the process re-enumerate our
requirements.

To be very clear: Our intention for P4TC is to target _the TC crowd_.
Essentially developers and ops people already familiar and deploying TC
based infra.
More importantly the original intent for P4TC was to enable _ops folks_
more than devs (given code is being generated and doesn't need humans to
write it).

With TC, we gain the whole "familiar" package of match-action pipeline
abstraction++, meaning from the control plane(see discussion above) all
the way to the tooling infra, i.e iproute2/tc cli, netlink infra interface
(request/response, event subscribe/multicast-publish, congestion control
etc), s/w and h/w symbiosis, the autonomous kernel control, etc.
The main advantage over vendor specific implementations(which is the current
alternative) is: with P4TC we have a singular vendor-neutral interface via
the kernel using well understood mechanisms that have gained learnings from
deployment experience.

So lets list some of these requirements and compare whether moving to eBPF
affected us or gave us an advantage.

0) Understood Control Plane semantics

This requirement is unaffected.
The control plane remains as netlink and therefore we get the classical
multi-user CRUD+Publish/subscribe APIs built in.

1) Must support SW/HW equivalence

This requirement is unaffected. The control plane is netlink. Any semantics
to select between sw and hw via skip_sw/hw semantics is maintained.

2) Supporting expressibility of the universe set of P4 progs

It is a must to support 100% of all possible P4 programs. In the past the
eBPF verifier, for example in [13], had to be worked around and even then
there are cases where we couldnt avoid path explosion when branching isi
involved and failed to run. So we were skeptical about using eBPF to begin
with.
Kfuncs changed our minds. Note, there are still challenges running all
potential P4 programs at the XDP level - but the pipeline could be split
between XDP and TC in such cases. The compiler can be told to generate
pieces that run on XDP and other on TC (see examples).
Summary: This requirement is unaffected.

3) Operational usability

By maintaining the TC control plane (even in presence of eBPF datapath)
runtime aspects remain unchanged. So for our target audience of folks
who have deployed tc, including offloads, the comfort zone is unchanged.

There is some loss in operational usability because we now have more knobs:
the extra compilation, loading and syncing of ebpf binaries, etc.
IOW, I can no longer just ship someone a shell script(ascii) in an email to
someone and say "go run this and "myprog" will just work".

4) Operational and development Debuggability

If something goes wrong, the tc craftsperson is now required to have
additional knowledge of eBPF code and process.
Our intent is to compensate this challenge with debug tools that ease the
craftperson's debugging.

5) Opportunity for rapid prototyping of new ideas

This is not exactly a requirement but something that became a useful
feature during the P4TC development phase. When the compiler was lagging
behind in features was to often handcode the template scripts.
Then you would dump back the template from the kernel and do a diff to
ensure the kernel didn't get something wrong. Essentially, this was a nice
debug feature. During development, we wrote scripts that covered a range of
P4 architectures(PSA, V1, etc) which required no kernel code changes.

Over time the debug feature morphed into: a) start by handcoding scripts
then b) read it back and then c) generate the P4 code.
It means one could start with the template scripts outside of the
constraints of a P4 architecture spec(PNA/PSA) or even within a P4
architecture then test some ideas and eventually feed back the concepts to
the compiler authors or modify or create a new P4 architecture and share
with the P4 standards folks.

To summarize in presence of eBPF: The debugging idea is probably still
alive. One could dump, with proper tooling(bpftool for example), the
loaded eBPF code and be able to check for differences. But this is not the
interesting part.
The concept of going back from whats in the kernel to P4 is a lot more
difficult to implement mostly due to scoping of DSL vs general purpose. It
may be lost. We have been discussing ways to use BTF and embedding
annotations in the eBPF code and binary but more thought is required and we
welcome suggestions.

6) Supporting per namespace program

In P4TC every program and its associated objects have unique IDs which are
generated by the compiler. Multiple or the same P4 program(s) can run
independently in different namespaces alongside their appropriate state and
object instance parameterization (despite name or ID collission).
This requirement is still met (by virtue of keeping P4 program control
objects within the TC domain and attaching to a netns).

__References__

[1]https://github.com/p4tc-dev/docs/blob/main/p4-conference-2023/2023P4WorkshopP4TC.pdf
[2]https://github.com/p4tc-dev/docs/blob/main/why-p4tc.md#historical-perspective-for-p4tc
[3]https://2023p4workshop.sched.com/event/1KsAe/p4tc-linux-kernel-p4-implementation-approaches-and-evaluation
[4]https://github.com/p4tc-dev/docs/blob/main/why-p4tc.md#so-why-p4-and-how-does-p4-help-here
[5]https://lore.kernel.org/netdev/20230517110232.29349-3-jhs@mojatatu.com/T/#mf59be7abc5df3473cff3879c8cc3e2369c0640a6
[6]https://lore.kernel.org/netdev/20230517110232.29349-3-jhs@mojatatu.com/T/#m783cfd79e9d755cf0e7afc1a7d5404635a5b1919
[7]https://lore.kernel.org/netdev/20230517110232.29349-3-jhs@mojatatu.com/T/#ma8c84df0f7043d17b98f3d67aab0f4904c600469
[8]https://github.com/p4lang/p4c/tree/main/backends/tc
[9]https://p4.org/
[10]https://www.intel.com/content/www/us/en/products/details/network-io/ipu/e2000-asic.html
[11]https://www.amd.com/en/accelerators/pensando
[12]https://github.com/sonic-net/DASH/tree/main
[13]https://github.com/p4lang/p4c/tree/main/backends/ebpf
[14]https://netdevconf.info/0x17/sessions/talk/integrating-ebpf-into-the-p4tc-datapath.html
[15]https://dl.acm.org/doi/10.1145/3630047.3630193
[16]https://lore.kernel.org/netdev/20231216123001.1293639-1-jiri@resnulli.us/
[17.a]https://netdevconf.info/0x13/session.html?talk-tc-u-classifier
[17.b]man tc-u32
[18]man tc-pedit
[19] https://lore.kernel.org/netdev/20231219181623.3845083-6-victor@mojatatu.com/T/#m86e71743d1d83b728bb29d5b877797cb4942e835
[20.a] https://netdevconf.info/0x16/sessions/talk/your-network-datapath-will-be-p4-scripted.html
[20.b] https://netdevconf.info/0x16/sessions/workshop/p4tc-workshop.html

--------
HISTORY
--------

Changes in Version 16
----------------------
1) Add Daniel's and John's Nack to patch 14

Changes in Version 15
----------------------
1) Add Alexei's Nack to patch 14

Changes in Version 14
----------------------
1) #UNDEF HWRITE/HREAD and remove unnecessary checks (Paolo)
2) Remove const cast added in v13 as a result of changes suggested
suggested by Paolo (Marcelo)
3) Introduce type validate for s8 caught as a result of audit from #1
4) S/GFP_KERNEL/GFP_KERNEL_ACCOUNT for types and runtime objects (Paolo)
5) Syzkaller caught an invalid netlink attribute bug that has existed
since v5! As noted in patch0 we've been running syzkaller for months.
6) Add Marcelo's reviewed-by for patch 14 and Toke's ACK to the series.

Changes in Version 13
----------------------

1) Remove ops->print() from p4 types (Paolo).

2) Use mutex instead of rwlock for dynamic actions since rwlock is
discouraged these days(Paolo).

3) Constify action init_ops() ops parameter (Paolo).

4) Use struct sk_buff in kfunc instead of struct __sk_buff (Martin)
Use struct xdp_buff in kfunc instead of struct xdp_md (Martin)

5) Replace BTF_SET8_START with BTF_KFUNCS_START and replace
BTF_SET8_END with BTF_KFUNCS_END (Martin)

6) Add params__sz argument to all kfuncs to guard against future change
to parameter structures being passed between bpf and tc. For kfunc
xdp/bpf_p4tc_entry_create() we already had the max(5) allowed number of
of parameters. To work around this we had to merge two structs together
in order to maintain the number of params to 5 (Martin).

7) Add more info on commit log to explain the relation between the kfuncs
and TC for patch #14 (Martin).

Changes in Version 12
----------------------

0) Introduce back 15 patches (v11 had 5)

1) From discussions with Daniel:
i) Remove the XDP programs association alltogether. No refcounting. nothing.
ii) Remove prog type tc - everything is now an ebpf tc action.

2) s/PAD0/__pad0/g. Thanks to Marcelo.

3) Add extack to specify how many entries (N of M) specified in a batch for
any of requested Create/Update/Delete succeeded. Prior to this it would
only tell us the batch failed to complete without giving us details of
which of M failed. Added as a debug aid.

Changes in Version 11
----------------------
1) Split the series into two. Original patches 1-5 in this patchset. The rest
will go out after this is merged.

2) Change any references of IFNAMSIZ in the action code when referencing the
action name size to ACTNAMSIZ. Thanks to Marcelo.

Changes in Version 10
----------------------
1) A couple of patches from the earlier version were clean enough to submit,
so we did. This gave us room to split the two largest patches each into
two. Even though the split is not git-bisactable and really some of it didn't
make much sense (eg spliting a create, and update in one patch and delete and
get into another) we made sure each of the split patches compiled
independently. The idea is to reduce the number of lines of code to review
and when we get sufficient reviews we will put the splits together again.
See patch #12 and #13 as well as patches #7 and #8).

2) Add more context in patch 0. Please READ!

3) Added dump/delete filters back to the code - we had taken them out in the
earlier patches to reduce the amount of code for review - but in retrospect
we feel they are important enough to push earlier rather than later.

Changes In version 9
---------------------

1) Remove the largest patch (externs) to ease review.

2) Break up action patches into two to ease review bringing down the patches
that need more scrutiny to 8 (the first 7 are almost trivial).

3) Fixup prefix naming convention to p4tc_xxx for uapi and p4a_xxx for actions
to provide consistency(Jiri).

4) Silence sparse warning "was not declared. Should it be static?" for kfuncs
by making them static. TBH, not sure if this is the right solution
but it makes sparse happy and hopefully someone will comment.

Changes In Version 8
---------------------

1) Fix all the patchwork warnings and improve our ci to catch them in the future

2) Reduce the number of patches to basic max(15) to ease review.

Changes In Version 7
-------------------------

0) First time removing the RFC tag!

1) Removed XDP cookie. It turns out as was pointed out by Toke(Thanks!) - that
using bpf links was sufficient to protect us from someone replacing or deleting
a eBPF program after it has been bound to a netdev.

2) Add some reviewed-bys from Vlad.

3) Small bug fixes from v6 based on testing for ebpf.

4) Added the counter extern as a sample extern. Illustrating this example because
it is slightly complex since it is possible to invoke it directly from
the P4TC domain (in case of direct counters) or from eBPF (indirect counters).
It is not exactly the most efficient implementation (a reasonable counter impl
should be per-cpu).

Changes In RFC Version 6
-------------------------

1) Completed integration from scriptable view to eBPF. Completed integration
of externs integration.

2) Small bug fixes from v5 based on testing.

Changes In RFC Version 5
-------------------------

1) More integration from scriptable view to eBPF. Small bug fixes from last
integration.

2) More streamlining support of externs via kfunc (create-on-miss, etc)

3) eBPF linking for XDP.

There is more eBPF integration/streamlining coming (we are getting close to
conversion from scriptable domain).

Changes In RFC Version 4
-------------------------

1) More integration from scriptable to eBPF. Small bug fixes.

2) More streamlining support of externs via kfunc (one additional kfunc).

3) Removed per-cpu scratchpad per Toke's suggestion and instead use XDP metadata.

There is more eBPF integration coming. One thing we looked at but is not in this
patchset but should be in the next is use of eBPF link in our loading (see
"challenge #1" further below).

Changes In RFC Version 3
-------------------------

These patches are still in a little bit of flux as we adjust to integrating
eBPF. So there are small constructs that are used in V1 and 2 but no longer
used in this version. We will make a V4 which will remove those.
The changes from V2 are as follows:

1) Feedback we got in V2 is to try stick to one of the two modes. In this version
we are taking one more step and going the path of mode2 vs v2 where we had 2 modes.

2) The P4 Register extern is no longer standalone. Instead, as part of integrating
into eBPF we introduce another kfunc which encapsulates Register as part of the
extern interface.

3) We have improved our CICD to include tools pointed to us by Simon. See
"Testing" further below. Thanks to Simon for that and other issues he caught.
Simon, we discussed on issue [7] but decided to keep that log since we think
it is useful.

4) A lot of small cleanups. Thanks Marcelo. There are two things we need to
re-discuss though; see: [5], [6].

5) We removed the need for a range of IDs for dynamic actions. Thanks Jakub.

6) Clarify ambiguity caused by smatch in an if(A) else if(B) condition. We are
guaranteed that either A or B must exist; however, lets make smatch happy.
Thanks to Simon and Dan Carpenter.

Changes In RFC Version 2
-------------------------

Version 2 is the initial integration of the eBPF datapath.
We took into consideration suggestions provided to use eBPF and put effort into
analyzing eBPF as datapath which involved extensive testing.
We implemented 6 approaches with eBPF and ran performance analysis and presented
our results at the P4 2023 workshop in Santa Clara[see: 1, 3] on each of the 6
vs the scriptable P4TC and concluded that 2 of the approaches are sensible (4 if
you account for XDP or TC separately).

Conclusions from the exercise: We lose the simple operational model we had
prior to integrating eBPF. We do gain performance in most cases when the
datapath is less compute-bound.
For more discussion on our requirements vs journeying the eBPF path please
scroll down to "Restating Our Requirements" and "Challenges".

This patch set presented two modes.
mode1: the parser is entirely based on eBPF - whereas the rest of the
SW datapath stays as _scriptable_ as in Version 1.
mode2: All of the kernel s/w datapath (including parser) is in eBPF.

The key ingredient for eBPF, that we did not have access to in the past, is
kfunc (it made a big difference for us to reconsider eBPF).

In V2 the two modes are mutually exclusive (IOW, you get to choose one
or the other via Kconfig).

Jamal Hadi Salim (15):
net: sched: act_api: Introduce P4 actions list
net/sched: act_api: increase action kind string length
net/sched: act_api: Update tc_action_ops to account for P4 actions
net/sched: act_api: add struct p4tc_action_ops as a parameter to
lookup callback
net: sched: act_api: Add support for preallocated P4 action instances
p4tc: add P4 data types
p4tc: add template API
p4tc: add template pipeline create, get, update, delete
p4tc: add template action create, update, delete, get, flush and dump
p4tc: add runtime action support
p4tc: add template table create, update, delete, get, flush and dump
p4tc: add runtime table entry create and update
p4tc: add runtime table entry get, delete, flush and dump
p4tc: add set of P4TC table kfuncs
p4tc: add P4 classifier

include/linux/bitops.h | 1 +
include/net/act_api.h | 23 +-
include/net/p4tc.h | 714 +++++++
include/net/p4tc_types.h | 89 +
include/net/tc_act/p4tc.h | 79 +
include/uapi/linux/p4tc.h | 465 +++++
include/uapi/linux/pkt_cls.h | 15 +
include/uapi/linux/rtnetlink.h | 18 +
include/uapi/linux/tc_act/tc_p4.h | 11 +
net/sched/Kconfig | 23 +
net/sched/Makefile | 3 +
net/sched/act_api.c | 192 +-
net/sched/cls_api.c | 2 +-
net/sched/cls_p4.c | 305 +++
net/sched/p4tc/Makefile | 8 +
net/sched/p4tc/p4tc_action.c | 2419 +++++++++++++++++++++++
net/sched/p4tc/p4tc_bpf.c | 360 ++++
net/sched/p4tc/p4tc_filter.c | 1012 ++++++++++
net/sched/p4tc/p4tc_pipeline.c | 700 +++++++
net/sched/p4tc/p4tc_runtime_api.c | 145 ++
net/sched/p4tc/p4tc_table.c | 1820 +++++++++++++++++
net/sched/p4tc/p4tc_tbl_entry.c | 3071 +++++++++++++++++++++++++++++
net/sched/p4tc/p4tc_tmpl_api.c | 440 +++++
net/sched/p4tc/p4tc_types.c | 1213 ++++++++++++
net/sched/p4tc/trace.c | 10 +
net/sched/p4tc/trace.h | 44 +
security/selinux/nlmsgtab.c | 10 +-
27 files changed, 13156 insertions(+), 36 deletions(-)
create mode 100644 include/net/p4tc.h
create mode 100644 include/net/p4tc_types.h
create mode 100644 include/net/tc_act/p4tc.h
create mode 100644 include/uapi/linux/p4tc.h
create mode 100644 include/uapi/linux/tc_act/tc_p4.h
create mode 100644 net/sched/cls_p4.c
create mode 100644 net/sched/p4tc/Makefile
create mode 100644 net/sched/p4tc/p4tc_action.c
create mode 100644 net/sched/p4tc/p4tc_bpf.c
create mode 100644 net/sched/p4tc/p4tc_filter.c
create mode 100644 net/sched/p4tc/p4tc_pipeline.c
create mode 100644 net/sched/p4tc/p4tc_runtime_api.c
create mode 100644 net/sched/p4tc/p4tc_table.c
create mode 100644 net/sched/p4tc/p4tc_tbl_entry.c
create mode 100644 net/sched/p4tc/p4tc_tmpl_api.c
create mode 100644 net/sched/p4tc/p4tc_types.c
create mode 100644 net/sched/p4tc/trace.c
create mode 100644 net/sched/p4tc/trace.h

Paolo Abeni April 11, 2024, 2:07 p.m. UTC | #1

On Wed, 2024-04-10 at 10:01 -0400, Jamal Hadi Salim wrote:
> The only change that v16 makes is to add a nack to patch 14 on kfuncs
> from Daniel and John. We strongly disagree with the nack; unfortunately I
> have to rehash whats already in the cover letter and has been discussed over
> and over and over again:

I feel bad asking, but I have to, since all options I have here are
IMHO quite sub-optimal.

How bad would be dropping patch 14 and reworking the rest with
alternative s/w datapath? (I guess restoring it from oldest revision of
this series).

Paolo

Jamal Hadi Salim April 11, 2024, 4:24 p.m. UTC | #2

On Thu, Apr 11, 2024 at 10:07 AM Paolo Abeni <pabeni@redhat.com> wrote:
>
> On Wed, 2024-04-10 at 10:01 -0400, Jamal Hadi Salim wrote:
> > The only change that v16 makes is to add a nack to patch 14 on kfuncs
> > from Daniel and John. We strongly disagree with the nack; unfortunately I
> > have to rehash whats already in the cover letter and has been discussed over
> > and over and over again:
>
> I feel bad asking, but I have to, since all options I have here are
> IMHO quite sub-optimal.
>
> How bad would be dropping patch 14 and reworking the rest with
> alternative s/w datapath? (I guess restoring it from oldest revision of
> this series).

We want to keep using ebpf  for the s/w datapath if that is not clear by now.
I do not understand the obstructionism tbh. Are users allowed to use
kfuncs as part of infra or not? My understanding is yes.
This community is getting too political and my worry is that we have
corporatism creeping in like it is in standards bodies.
We started by not using ebpf. The same people who are objecting now
went up in arms and insisted we use ebpf. As a member of this
community, my motivation was to meet them in the middle by
compromising. We invested another year to move to that middle ground.
Now they are insisting we do not use ebpf because they dont like our
design or how we are using ebpf or maybe it's not a use case they have
any need for or some other politics. I lost track of the moving goal
posts. Open source is about solving your itch. This code is entirely
on TC, zero code changed in ebpf core. The new goalpost is based on
emotional outrage over use of functions. The whole thing is getting
extremely toxic.

cheers,
jamal

Jamal Hadi Salim April 19, 2024, 12:08 p.m. UTC | #3

On Thu, Apr 11, 2024 at 12:24 PM Jamal Hadi Salim <jhs@mojatatu.com> wrote:
>
> On Thu, Apr 11, 2024 at 10:07 AM Paolo Abeni <pabeni@redhat.com> wrote:
> >
> > On Wed, 2024-04-10 at 10:01 -0400, Jamal Hadi Salim wrote:
> > > The only change that v16 makes is to add a nack to patch 14 on kfuncs
> > > from Daniel and John. We strongly disagree with the nack; unfortunately I
> > > have to rehash whats already in the cover letter and has been discussed over
> > > and over and over again:
> >
> > I feel bad asking, but I have to, since all options I have here are
> > IMHO quite sub-optimal.
> >
> > How bad would be dropping patch 14 and reworking the rest with
> > alternative s/w datapath? (I guess restoring it from oldest revision of
> > this series).
>
>
> We want to keep using ebpf  for the s/w datapath if that is not clear by now.
> I do not understand the obstructionism tbh. Are users allowed to use
> kfuncs as part of infra or not? My understanding is yes.
> This community is getting too political and my worry is that we have
> corporatism creeping in like it is in standards bodies.
> We started by not using ebpf. The same people who are objecting now
> went up in arms and insisted we use ebpf. As a member of this
> community, my motivation was to meet them in the middle by
> compromising. We invested another year to move to that middle ground.
> Now they are insisting we do not use ebpf because they dont like our
> design or how we are using ebpf or maybe it's not a use case they have
> any need for or some other politics. I lost track of the moving goal
> posts. Open source is about solving your itch. This code is entirely
> on TC, zero code changed in ebpf core. The new goalpost is based on
> emotional outrage over use of functions. The whole thing is getting
> extremely toxic.
>

Paolo,
Following up since no movement for a week now;->
I am going to give benefit of doubt that there was miscommunication or
misunderstanding for all the back and forth that has happened so far
with the nackers. I will provide a summary below on the main points
raised and then provide responses:

1) "Use maps"

It doesnt make sense for our requirement. The reason we are using TC
is because a) P4 has an excellent fit with TC match action paradigm b)
we are targeting both s/w and h/w and the TC model caters well for
this. The objects belong to TC, shared between s/w, h/w and control
plane (and netlink is the API). Maybe this diagram would help:
https://github.com/p4tc-dev/docs/blob/main/images/why-p4tc/p4tc-runtime-pipeline.png

While the s/w part stands on its own accord (as elaborated many
times), for TC which has offloads, the s/w twin is introduced before
the h/w equivalent. This is what this series is doing.

2) "but ... it is not performant"
This has been brought up in regards to netlink and kfuncs. Performance
is a lower priority to P4 correctness and expressibility.
Netlink provides us the abstractions we need, it works with TC for
both s/w and h/w offload and has a lot of knowledge base for
expressing control plane APIs. We dont believe reinventing all that
makes sense.
Kfuncs are a means to an end - they provide us the gluing we need to
have an ebpf s/w datapath to the TC objects. Getting an extra
10-100Kpps is not a driving factor.

3) "but you did it wrong, here's how you do it..."

I gave up on responding to this - but do note this sentiment is a big
theme in the exchanges and consumed most of the electrons. We are
_never_ going to get any consensus with statements like "tc actions
are a mistake" or "use tcx".

4) "... drop the kfunc patch"

kfuncs essentially boil down to function calls. They don't require any
special handling by the eBPF verifier nor introduce new semantics to
eBPF. They are similar in nature to the already existing kfuncs
interacting with other kernel objects such as nf_conntrack.
The precedence (repeated in conferences and email threads multiple
times) is: kfuncs dont have to be sent to ebpf list or reviewed by
folks in the ebpf world. And We believe that rule applies to us as
well. Either kfuncs (and frankly ebpf) is infrastructure glue or it's
not.

Now for a little rant:

Open source is not a zero-sum game. Ebpf already coexists with
netfilter, tc, etc and various subsystems happily.
I hope our requirement is clear and i dont have to keep justifying why
P4 or relitigate over and over again why we need TC. Open source is
about scratching your itch and our itch is totally contained within
TC. I cant help but feel that this community is getting way too
pervasive with politics and obscure agendas. I understand agendas, I
just dont understand the zero-sum thinking.
My view is this series should still be applied with the nacks since it
sits entirely on its own silo within networking/TC (and has nothing to
do with ebpf).

cheers,
jamal

Alexei Starovoitov April 19, 2024, 2:23 p.m. UTC | #4

On Fri, Apr 19, 2024 at 5:08 AM Jamal Hadi Salim <jhs@mojatatu.com> wrote:
>
> My view is this series should still be applied with the nacks since it
> sits entirely on its own silo within networking/TC (and has nothing to
> do with ebpf).

My Nack applies to the whole set. The kernel doesn't need this anti-feature
for many reasons already explained.

Jamal Hadi Salim April 19, 2024, 2:33 p.m. UTC | #5

On Fri, Apr 19, 2024 at 10:23 AM Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
>
> On Fri, Apr 19, 2024 at 5:08 AM Jamal Hadi Salim <jhs@mojatatu.com> wrote:
> >
> > My view is this series should still be applied with the nacks since it
> > sits entirely on its own silo within networking/TC (and has nothing to
> > do with ebpf).
>
> My Nack applies to the whole set. The kernel doesn't need this anti-feature
> for many reasons already explained.

Can you be more explicit? What else would you add to the list i posted above?

cheers,
jamal

Alexei Starovoitov April 19, 2024, 2:37 p.m. UTC | #6

On Fri, Apr 19, 2024 at 7:34 AM Jamal Hadi Salim <jhs@mojatatu.com> wrote:
>
> On Fri, Apr 19, 2024 at 10:23 AM Alexei Starovoitov
> <alexei.starovoitov@gmail.com> wrote:
> >
> > On Fri, Apr 19, 2024 at 5:08 AM Jamal Hadi Salim <jhs@mojatatu.com> wrote:
> > >
> > > My view is this series should still be applied with the nacks since it
> > > sits entirely on its own silo within networking/TC (and has nothing to
> > > do with ebpf).
> >
> > My Nack applies to the whole set. The kernel doesn't need this anti-feature
> > for many reasons already explained.
>
> Can you be more explicit? What else would you add to the list i posted above?

Since you're refusing to work with us your only option
is to mention my Nack in the cover letter and send it
as a PR to Linus during the merge window.

Jamal Hadi Salim April 19, 2024, 2:45 p.m. UTC | #7

On Fri, Apr 19, 2024 at 10:37 AM Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
>
> On Fri, Apr 19, 2024 at 7:34 AM Jamal Hadi Salim <jhs@mojatatu.com> wrote:
> >
> > On Fri, Apr 19, 2024 at 10:23 AM Alexei Starovoitov
> > <alexei.starovoitov@gmail.com> wrote:
> > >
> > > On Fri, Apr 19, 2024 at 5:08 AM Jamal Hadi Salim <jhs@mojatatu.com> wrote:
> > > >
> > > > My view is this series should still be applied with the nacks since it
> > > > sits entirely on its own silo within networking/TC (and has nothing to
> > > > do with ebpf).
> > >
> > > My Nack applies to the whole set. The kernel doesn't need this anti-feature
> > > for many reasons already explained.
> >
> > Can you be more explicit? What else would you add to the list i posted above?
>
> Since you're refusing to work with us your only option

Who is "us"? ebpf? I hope you are not speaking on behalf of the net subsystem.
You are entitled to your opinion (and aggression) - and there is a lot
of that with you, but this should be based on technical merit not your
emotions.
I summarized the reasons brought up by you and Cilium. Do you have
more to add to that list? If you do, please add to it.

> is to mention my Nack in the cover letter and send it
> as a PR to Linus during the merge window.

You dont get to decide that - I was talking to the networking people.

cheers,
jamal

Alexei Starovoitov April 19, 2024, 2:49 p.m. UTC | #8

On Fri, Apr 19, 2024 at 7:45 AM Jamal Hadi Salim <jhs@mojatatu.com> wrote:
>
> You dont get to decide that - I was talking to the networking people.

You think they want net-next PR to get derailed because of this?

Jamal Hadi Salim April 19, 2024, 2:55 p.m. UTC | #9

On Fri, Apr 19, 2024 at 10:49 AM Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
>
> On Fri, Apr 19, 2024 at 7:45 AM Jamal Hadi Salim <jhs@mojatatu.com> wrote:
> >
> > You dont get to decide that - I was talking to the networking people.
>
> You think they want net-next PR to get derailed because of this?

Why would it be derailed?

cheers,
jamal

Paolo Abeni April 19, 2024, 5:20 p.m. UTC | #10

On Fri, 2024-04-19 at 08:08 -0400, Jamal Hadi Salim wrote:
> On Thu, Apr 11, 2024 at 12:24 PM Jamal Hadi Salim <jhs@mojatatu.com> wrote:
> > 
> > On Thu, Apr 11, 2024 at 10:07 AM Paolo Abeni <pabeni@redhat.com> wrote:
> > > 
> > > On Wed, 2024-04-10 at 10:01 -0400, Jamal Hadi Salim wrote:
> > > > The only change that v16 makes is to add a nack to patch 14 on kfuncs
> > > > from Daniel and John. We strongly disagree with the nack; unfortunately I
> > > > have to rehash whats already in the cover letter and has been discussed over
> > > > and over and over again:
> > > 
> > > I feel bad asking, but I have to, since all options I have here are
> > > IMHO quite sub-optimal.
> > > 
> > > How bad would be dropping patch 14 and reworking the rest with
> > > alternative s/w datapath? (I guess restoring it from oldest revision of
> > > this series).
> > 
> > 
> > We want to keep using ebpf  for the s/w datapath if that is not clear by now.
> > I do not understand the obstructionism tbh. Are users allowed to use
> > kfuncs as part of infra or not? My understanding is yes.
> > This community is getting too political and my worry is that we have
> > corporatism creeping in like it is in standards bodies.
> > We started by not using ebpf. The same people who are objecting now
> > went up in arms and insisted we use ebpf. As a member of this
> > community, my motivation was to meet them in the middle by
> > compromising. We invested another year to move to that middle ground.
> > Now they are insisting we do not use ebpf because they dont like our
> > design or how we are using ebpf or maybe it's not a use case they have
> > any need for or some other politics. I lost track of the moving goal
> > posts. Open source is about solving your itch. This code is entirely
> > on TC, zero code changed in ebpf core. The new goalpost is based on
> > emotional outrage over use of functions. The whole thing is getting
> > extremely toxic.
> > 
> 
> Paolo,
> Following up since no movement for a week now;->
> I am going to give benefit of doubt that there was miscommunication or
> misunderstanding for all the back and forth that has happened so far
> with the nackers. I will provide a summary below on the main points
> raised and then provide responses:
> 
> 1) "Use maps"
> 
> It doesnt make sense for our requirement. The reason we are using TC
> is because a) P4 has an excellent fit with TC match action paradigm b)
> we are targeting both s/w and h/w and the TC model caters well for
> this. The objects belong to TC, shared between s/w, h/w and control
> plane (and netlink is the API). Maybe this diagram would help:
> https://github.com/p4tc-dev/docs/blob/main/images/why-p4tc/p4tc-runtime-pipeline.png
> 
> While the s/w part stands on its own accord (as elaborated many
> times), for TC which has offloads, the s/w twin is introduced before
> the h/w equivalent. This is what this series is doing.
> 
> 2) "but ... it is not performant"
> This has been brought up in regards to netlink and kfuncs. Performance
> is a lower priority to P4 correctness and expressibility.
> Netlink provides us the abstractions we need, it works with TC for
> both s/w and h/w offload and has a lot of knowledge base for
> expressing control plane APIs. We dont believe reinventing all that
> makes sense.
> Kfuncs are a means to an end - they provide us the gluing we need to
> have an ebpf s/w datapath to the TC objects. Getting an extra
> 10-100Kpps is not a driving factor.
> 
> 3) "but you did it wrong, here's how you do it..."
> 
> I gave up on responding to this - but do note this sentiment is a big
> theme in the exchanges and consumed most of the electrons. We are
> _never_ going to get any consensus with statements like "tc actions
> are a mistake" or "use tcx".
> 
> 4) "... drop the kfunc patch"
> 
> kfuncs essentially boil down to function calls. They don't require any
> special handling by the eBPF verifier nor introduce new semantics to
> eBPF. They are similar in nature to the already existing kfuncs
> interacting with other kernel objects such as nf_conntrack.
> The precedence (repeated in conferences and email threads multiple
> times) is: kfuncs dont have to be sent to ebpf list or reviewed by
> folks in the ebpf world. And We believe that rule applies to us as
> well. Either kfuncs (and frankly ebpf) is infrastructure glue or it's
> not.
> 
> Now for a little rant:
> 
> Open source is not a zero-sum game. Ebpf already coexists with
> netfilter, tc, etc and various subsystems happily.
> I hope our requirement is clear and i dont have to keep justifying why
> P4 or relitigate over and over again why we need TC. Open source is
> about scratching your itch and our itch is totally contained within
> TC. I cant help but feel that this community is getting way too
> pervasive with politics and obscure agendas. I understand agendas, I
> just dont understand the zero-sum thinking.
> My view is this series should still be applied with the nacks since it
> sits entirely on its own silo within networking/TC (and has nothing to
> do with ebpf).

It's really hard for me - meaning I'll not do that - applying a series
that has been so fiercely nacked, especially given that the other
maintainers are not supporting it.
         
I really understand this is very bad for you.
         
Let me try to do an extreme attempt to find some middle ground between
this series and the bpf folks.

My understanding is that the most disliked item is the lifecycle for
the objects allocated via the kfunc(s). 

If I understand correctly, the hard requirement on bpf side is that any
kernel object allocated by kfunc must be released at program unload
time. p4tc postpone such allocation to recycle the structure. 

While there are other arguments, my reading of the past few iterations
is that solving the above node should lift the nack, am I correct?

Could p4tc pre-allocate all the p4tc_table_entry_act_bpf_kern entries
and let p4a_runt_create_bpf() fail if the pool is empty? would that
satisfy the bpf requirement?

Otherwise could p4tc force free the p4tc_table_entry_act_bpf_kern at
unload time?

Thanks,

Paolo

Jamal Hadi Salim April 19, 2024, 6:01 p.m. UTC | #11

On Fri, Apr 19, 2024 at 1:20 PM Paolo Abeni <pabeni@redhat.com> wrote:
>
> On Fri, 2024-04-19 at 08:08 -0400, Jamal Hadi Salim wrote:
> > On Thu, Apr 11, 2024 at 12:24 PM Jamal Hadi Salim <jhs@mojatatu.com> wrote:
> > >
> > > On Thu, Apr 11, 2024 at 10:07 AM Paolo Abeni <pabeni@redhat.com> wrote:
> > > >
> > > > On Wed, 2024-04-10 at 10:01 -0400, Jamal Hadi Salim wrote:
> > > > > The only change that v16 makes is to add a nack to patch 14 on kfuncs
> > > > > from Daniel and John. We strongly disagree with the nack; unfortunately I
> > > > > have to rehash whats already in the cover letter and has been discussed over
> > > > > and over and over again:
> > > >
> > > > I feel bad asking, but I have to, since all options I have here are
> > > > IMHO quite sub-optimal.
> > > >
> > > > How bad would be dropping patch 14 and reworking the rest with
> > > > alternative s/w datapath? (I guess restoring it from oldest revision of
> > > > this series).
> > >
> > >
> > > We want to keep using ebpf  for the s/w datapath if that is not clear by now.
> > > I do not understand the obstructionism tbh. Are users allowed to use
> > > kfuncs as part of infra or not? My understanding is yes.
> > > This community is getting too political and my worry is that we have
> > > corporatism creeping in like it is in standards bodies.
> > > We started by not using ebpf. The same people who are objecting now
> > > went up in arms and insisted we use ebpf. As a member of this
> > > community, my motivation was to meet them in the middle by
> > > compromising. We invested another year to move to that middle ground.
> > > Now they are insisting we do not use ebpf because they dont like our
> > > design or how we are using ebpf or maybe it's not a use case they have
> > > any need for or some other politics. I lost track of the moving goal
> > > posts. Open source is about solving your itch. This code is entirely
> > > on TC, zero code changed in ebpf core. The new goalpost is based on
> > > emotional outrage over use of functions. The whole thing is getting
> > > extremely toxic.
> > >
> >
> > Paolo,
> > Following up since no movement for a week now;->
> > I am going to give benefit of doubt that there was miscommunication or
> > misunderstanding for all the back and forth that has happened so far
> > with the nackers. I will provide a summary below on the main points
> > raised and then provide responses:
> >
> > 1) "Use maps"
> >
> > It doesnt make sense for our requirement. The reason we are using TC
> > is because a) P4 has an excellent fit with TC match action paradigm b)
> > we are targeting both s/w and h/w and the TC model caters well for
> > this. The objects belong to TC, shared between s/w, h/w and control
> > plane (and netlink is the API). Maybe this diagram would help:
> > https://github.com/p4tc-dev/docs/blob/main/images/why-p4tc/p4tc-runtime-pipeline.png
> >
> > While the s/w part stands on its own accord (as elaborated many
> > times), for TC which has offloads, the s/w twin is introduced before
> > the h/w equivalent. This is what this series is doing.
> >
> > 2) "but ... it is not performant"
> > This has been brought up in regards to netlink and kfuncs. Performance
> > is a lower priority to P4 correctness and expressibility.
> > Netlink provides us the abstractions we need, it works with TC for
> > both s/w and h/w offload and has a lot of knowledge base for
> > expressing control plane APIs. We dont believe reinventing all that
> > makes sense.
> > Kfuncs are a means to an end - they provide us the gluing we need to
> > have an ebpf s/w datapath to the TC objects. Getting an extra
> > 10-100Kpps is not a driving factor.
> >
> > 3) "but you did it wrong, here's how you do it..."
> >
> > I gave up on responding to this - but do note this sentiment is a big
> > theme in the exchanges and consumed most of the electrons. We are
> > _never_ going to get any consensus with statements like "tc actions
> > are a mistake" or "use tcx".
> >
> > 4) "... drop the kfunc patch"
> >
> > kfuncs essentially boil down to function calls. They don't require any
> > special handling by the eBPF verifier nor introduce new semantics to
> > eBPF. They are similar in nature to the already existing kfuncs
> > interacting with other kernel objects such as nf_conntrack.
> > The precedence (repeated in conferences and email threads multiple
> > times) is: kfuncs dont have to be sent to ebpf list or reviewed by
> > folks in the ebpf world. And We believe that rule applies to us as
> > well. Either kfuncs (and frankly ebpf) is infrastructure glue or it's
> > not.
> >
> > Now for a little rant:
> >
> > Open source is not a zero-sum game. Ebpf already coexists with
> > netfilter, tc, etc and various subsystems happily.
> > I hope our requirement is clear and i dont have to keep justifying why
> > P4 or relitigate over and over again why we need TC. Open source is
> > about scratching your itch and our itch is totally contained within
> > TC. I cant help but feel that this community is getting way too
> > pervasive with politics and obscure agendas. I understand agendas, I
> > just dont understand the zero-sum thinking.
> > My view is this series should still be applied with the nacks since it
> > sits entirely on its own silo within networking/TC (and has nothing to
> > do with ebpf).
>
> It's really hard for me - meaning I'll not do that - applying a series
> that has been so fiercely nacked, especially given that the other
> maintainers are not supporting it.
>
> I really understand this is very bad for you.
>
> Let me try to do an extreme attempt to find some middle ground between
> this series and the bpf folks.
>
> My understanding is that the most disliked item is the lifecycle for
> the objects allocated via the kfunc(s).
>
> If I understand correctly, the hard requirement on bpf side is that any
> kernel object allocated by kfunc must be released at program unload
> time. p4tc postpone such allocation to recycle the structure.
>
> While there are other arguments, my reading of the past few iterations
> is that solving the above node should lift the nack, am I correct?
>
> Could p4tc pre-allocate all the p4tc_table_entry_act_bpf_kern entries
> and let p4a_runt_create_bpf() fail if the pool is empty? would that
> satisfy the bpf requirement?

Let me think about it and weigh the consequences.

> Otherwise could p4tc force free the p4tc_table_entry_act_bpf_kern at
> unload time?

This one wont work for us unfortunately. If we have entries added by
the control plane with skip_sw just because the ebpf program is gone
doesnt mean they disappear.

cheers,
jamal

 there are use cases where no entry is loaded by the s/w datap

> Thanks,
>
> Paolo
>
>

Jamal Hadi Salim April 26, 2024, 5:12 p.m. UTC | #12

On Fri, Apr 19, 2024 at 2:01 PM Jamal Hadi Salim <jhs@mojatatu.com> wrote:
>
> On Fri, Apr 19, 2024 at 1:20 PM Paolo Abeni <pabeni@redhat.com> wrote:
> >
> > On Fri, 2024-04-19 at 08:08 -0400, Jamal Hadi Salim wrote:
> > > On Thu, Apr 11, 2024 at 12:24 PM Jamal Hadi Salim <jhs@mojatatu.com> wrote:
> > > >
> > > > On Thu, Apr 11, 2024 at 10:07 AM Paolo Abeni <pabeni@redhat.com> wrote:
> > > > >
> > > > > On Wed, 2024-04-10 at 10:01 -0400, Jamal Hadi Salim wrote:
> > > > > > The only change that v16 makes is to add a nack to patch 14 on kfuncs
> > > > > > from Daniel and John. We strongly disagree with the nack; unfortunately I
> > > > > > have to rehash whats already in the cover letter and has been discussed over
> > > > > > and over and over again:
> > > > >
> > > > > I feel bad asking, but I have to, since all options I have here are
> > > > > IMHO quite sub-optimal.
> > > > >
> > > > > How bad would be dropping patch 14 and reworking the rest with
> > > > > alternative s/w datapath? (I guess restoring it from oldest revision of
> > > > > this series).
> > > >
> > > >
> > > > We want to keep using ebpf  for the s/w datapath if that is not clear by now.
> > > > I do not understand the obstructionism tbh. Are users allowed to use
> > > > kfuncs as part of infra or not? My understanding is yes.
> > > > This community is getting too political and my worry is that we have
> > > > corporatism creeping in like it is in standards bodies.
> > > > We started by not using ebpf. The same people who are objecting now
> > > > went up in arms and insisted we use ebpf. As a member of this
> > > > community, my motivation was to meet them in the middle by
> > > > compromising. We invested another year to move to that middle ground.
> > > > Now they are insisting we do not use ebpf because they dont like our
> > > > design or how we are using ebpf or maybe it's not a use case they have
> > > > any need for or some other politics. I lost track of the moving goal
> > > > posts. Open source is about solving your itch. This code is entirely
> > > > on TC, zero code changed in ebpf core. The new goalpost is based on
> > > > emotional outrage over use of functions. The whole thing is getting
> > > > extremely toxic.
> > > >
> > >
> > > Paolo,
> > > Following up since no movement for a week now;->
> > > I am going to give benefit of doubt that there was miscommunication or
> > > misunderstanding for all the back and forth that has happened so far
> > > with the nackers. I will provide a summary below on the main points
> > > raised and then provide responses:
> > >
> > > 1) "Use maps"
> > >
> > > It doesnt make sense for our requirement. The reason we are using TC
> > > is because a) P4 has an excellent fit with TC match action paradigm b)
> > > we are targeting both s/w and h/w and the TC model caters well for
> > > this. The objects belong to TC, shared between s/w, h/w and control
> > > plane (and netlink is the API). Maybe this diagram would help:
> > > https://github.com/p4tc-dev/docs/blob/main/images/why-p4tc/p4tc-runtime-pipeline.png
> > >
> > > While the s/w part stands on its own accord (as elaborated many
> > > times), for TC which has offloads, the s/w twin is introduced before
> > > the h/w equivalent. This is what this series is doing.
> > >
> > > 2) "but ... it is not performant"
> > > This has been brought up in regards to netlink and kfuncs. Performance
> > > is a lower priority to P4 correctness and expressibility.
> > > Netlink provides us the abstractions we need, it works with TC for
> > > both s/w and h/w offload and has a lot of knowledge base for
> > > expressing control plane APIs. We dont believe reinventing all that
> > > makes sense.
> > > Kfuncs are a means to an end - they provide us the gluing we need to
> > > have an ebpf s/w datapath to the TC objects. Getting an extra
> > > 10-100Kpps is not a driving factor.
> > >
> > > 3) "but you did it wrong, here's how you do it..."
> > >
> > > I gave up on responding to this - but do note this sentiment is a big
> > > theme in the exchanges and consumed most of the electrons. We are
> > > _never_ going to get any consensus with statements like "tc actions
> > > are a mistake" or "use tcx".
> > >
> > > 4) "... drop the kfunc patch"
> > >
> > > kfuncs essentially boil down to function calls. They don't require any
> > > special handling by the eBPF verifier nor introduce new semantics to
> > > eBPF. They are similar in nature to the already existing kfuncs
> > > interacting with other kernel objects such as nf_conntrack.
> > > The precedence (repeated in conferences and email threads multiple
> > > times) is: kfuncs dont have to be sent to ebpf list or reviewed by
> > > folks in the ebpf world. And We believe that rule applies to us as
> > > well. Either kfuncs (and frankly ebpf) is infrastructure glue or it's
> > > not.
> > >
> > > Now for a little rant:
> > >
> > > Open source is not a zero-sum game. Ebpf already coexists with
> > > netfilter, tc, etc and various subsystems happily.
> > > I hope our requirement is clear and i dont have to keep justifying why
> > > P4 or relitigate over and over again why we need TC. Open source is
> > > about scratching your itch and our itch is totally contained within
> > > TC. I cant help but feel that this community is getting way too
> > > pervasive with politics and obscure agendas. I understand agendas, I
> > > just dont understand the zero-sum thinking.
> > > My view is this series should still be applied with the nacks since it
> > > sits entirely on its own silo within networking/TC (and has nothing to
> > > do with ebpf).
> >
> > It's really hard for me - meaning I'll not do that - applying a series
> > that has been so fiercely nacked, especially given that the other
> > maintainers are not supporting it.
> >
> > I really understand this is very bad for you.
> >
> > Let me try to do an extreme attempt to find some middle ground between
> > this series and the bpf folks.
> >
> > My understanding is that the most disliked item is the lifecycle for
> > the objects allocated via the kfunc(s).
> >
> > If I understand correctly, the hard requirement on bpf side is that any
> > kernel object allocated by kfunc must be released at program unload
> > time. p4tc postpone such allocation to recycle the structure.
> >
> > While there are other arguments, my reading of the past few iterations
> > is that solving the above node should lift the nack, am I correct?
> >
> > Could p4tc pre-allocate all the p4tc_table_entry_act_bpf_kern entries
> > and let p4a_runt_create_bpf() fail if the pool is empty? would that
> > satisfy the bpf requirement?
>
> Let me think about it and weigh the consequences.
>

Sorry, was busy evaluating. Yes, we can enforce the memory allocation
constraints such that when the ebpf program is removed any entries
added by said ebpf program can be removed from the datapath.

> > Otherwise could p4tc force free the p4tc_table_entry_act_bpf_kern at
> > unload time?
>
> This one wont work for us unfortunately. If we have entries added by
> the control plane with skip_sw just because the ebpf program is gone
> doesnt mean they disappear.

Just to clarify (the figure
https://github.com/p4tc-dev/docs/blob/main/images/why-p4tc/p4tc-runtime-pipeline.png
should help) :
For P4 table objects, there are 3 types of entries: 1) created by
control path for s/w datapath with skip_hw 2) created by control path
for h/w datapath with skip_sw and 3) dynamically created by s/w
datapath (ebpf) not far off from conntrack.
The only ones we can remove when the ebpf program goes away are from #3.

cheers,
jamal

Paolo Abeni April 26, 2024, 5:21 p.m. UTC | #13

On Fri, 2024-04-26 at 13:12 -0400, Jamal Hadi Salim wrote:
> On Fri, Apr 19, 2024 at 2:01 PM Jamal Hadi Salim <jhs@mojatatu.com> wrote:
> > 
> > On Fri, Apr 19, 2024 at 1:20 PM Paolo Abeni <pabeni@redhat.com> wrote:
> > > 
> > > On Fri, 2024-04-19 at 08:08 -0400, Jamal Hadi Salim wrote:
> > > > On Thu, Apr 11, 2024 at 12:24 PM Jamal Hadi Salim <jhs@mojatatu.com> wrote:
> > > > > 
> > > > > On Thu, Apr 11, 2024 at 10:07 AM Paolo Abeni <pabeni@redhat.com> wrote:
> > > > > > 
> > > > > > On Wed, 2024-04-10 at 10:01 -0400, Jamal Hadi Salim wrote:
> > > > > > > The only change that v16 makes is to add a nack to patch 14 on kfuncs
> > > > > > > from Daniel and John. We strongly disagree with the nack; unfortunately I
> > > > > > > have to rehash whats already in the cover letter and has been discussed over
> > > > > > > and over and over again:
> > > > > > 
> > > > > > I feel bad asking, but I have to, since all options I have here are
> > > > > > IMHO quite sub-optimal.
> > > > > > 
> > > > > > How bad would be dropping patch 14 and reworking the rest with
> > > > > > alternative s/w datapath? (I guess restoring it from oldest revision of
> > > > > > this series).
> > > > > 
> > > > > 
> > > > > We want to keep using ebpf  for the s/w datapath if that is not clear by now.
> > > > > I do not understand the obstructionism tbh. Are users allowed to use
> > > > > kfuncs as part of infra or not? My understanding is yes.
> > > > > This community is getting too political and my worry is that we have
> > > > > corporatism creeping in like it is in standards bodies.
> > > > > We started by not using ebpf. The same people who are objecting now
> > > > > went up in arms and insisted we use ebpf. As a member of this
> > > > > community, my motivation was to meet them in the middle by
> > > > > compromising. We invested another year to move to that middle ground.
> > > > > Now they are insisting we do not use ebpf because they dont like our
> > > > > design or how we are using ebpf or maybe it's not a use case they have
> > > > > any need for or some other politics. I lost track of the moving goal
> > > > > posts. Open source is about solving your itch. This code is entirely
> > > > > on TC, zero code changed in ebpf core. The new goalpost is based on
> > > > > emotional outrage over use of functions. The whole thing is getting
> > > > > extremely toxic.
> > > > > 
> > > > 
> > > > Paolo,
> > > > Following up since no movement for a week now;->
> > > > I am going to give benefit of doubt that there was miscommunication or
> > > > misunderstanding for all the back and forth that has happened so far
> > > > with the nackers. I will provide a summary below on the main points
> > > > raised and then provide responses:
> > > > 
> > > > 1) "Use maps"
> > > > 
> > > > It doesnt make sense for our requirement. The reason we are using TC
> > > > is because a) P4 has an excellent fit with TC match action paradigm b)
> > > > we are targeting both s/w and h/w and the TC model caters well for
> > > > this. The objects belong to TC, shared between s/w, h/w and control
> > > > plane (and netlink is the API). Maybe this diagram would help:
> > > > https://github.com/p4tc-dev/docs/blob/main/images/why-p4tc/p4tc-runtime-pipeline.png
> > > > 
> > > > While the s/w part stands on its own accord (as elaborated many
> > > > times), for TC which has offloads, the s/w twin is introduced before
> > > > the h/w equivalent. This is what this series is doing.
> > > > 
> > > > 2) "but ... it is not performant"
> > > > This has been brought up in regards to netlink and kfuncs. Performance
> > > > is a lower priority to P4 correctness and expressibility.
> > > > Netlink provides us the abstractions we need, it works with TC for
> > > > both s/w and h/w offload and has a lot of knowledge base for
> > > > expressing control plane APIs. We dont believe reinventing all that
> > > > makes sense.
> > > > Kfuncs are a means to an end - they provide us the gluing we need to
> > > > have an ebpf s/w datapath to the TC objects. Getting an extra
> > > > 10-100Kpps is not a driving factor.
> > > > 
> > > > 3) "but you did it wrong, here's how you do it..."
> > > > 
> > > > I gave up on responding to this - but do note this sentiment is a big
> > > > theme in the exchanges and consumed most of the electrons. We are
> > > > _never_ going to get any consensus with statements like "tc actions
> > > > are a mistake" or "use tcx".
> > > > 
> > > > 4) "... drop the kfunc patch"
> > > > 
> > > > kfuncs essentially boil down to function calls. They don't require any
> > > > special handling by the eBPF verifier nor introduce new semantics to
> > > > eBPF. They are similar in nature to the already existing kfuncs
> > > > interacting with other kernel objects such as nf_conntrack.
> > > > The precedence (repeated in conferences and email threads multiple
> > > > times) is: kfuncs dont have to be sent to ebpf list or reviewed by
> > > > folks in the ebpf world. And We believe that rule applies to us as
> > > > well. Either kfuncs (and frankly ebpf) is infrastructure glue or it's
> > > > not.
> > > > 
> > > > Now for a little rant:
> > > > 
> > > > Open source is not a zero-sum game. Ebpf already coexists with
> > > > netfilter, tc, etc and various subsystems happily.
> > > > I hope our requirement is clear and i dont have to keep justifying why
> > > > P4 or relitigate over and over again why we need TC. Open source is
> > > > about scratching your itch and our itch is totally contained within
> > > > TC. I cant help but feel that this community is getting way too
> > > > pervasive with politics and obscure agendas. I understand agendas, I
> > > > just dont understand the zero-sum thinking.
> > > > My view is this series should still be applied with the nacks since it
> > > > sits entirely on its own silo within networking/TC (and has nothing to
> > > > do with ebpf).
> > > 
> > > It's really hard for me - meaning I'll not do that - applying a series
> > > that has been so fiercely nacked, especially given that the other
> > > maintainers are not supporting it.
> > > 
> > > I really understand this is very bad for you.
> > > 
> > > Let me try to do an extreme attempt to find some middle ground between
> > > this series and the bpf folks.
> > > 
> > > My understanding is that the most disliked item is the lifecycle for
> > > the objects allocated via the kfunc(s).
> > > 
> > > If I understand correctly, the hard requirement on bpf side is that any
> > > kernel object allocated by kfunc must be released at program unload
> > > time. p4tc postpone such allocation to recycle the structure.
> > > 
> > > While there are other arguments, my reading of the past few iterations
> > > is that solving the above node should lift the nack, am I correct?
> > > 
> > > Could p4tc pre-allocate all the p4tc_table_entry_act_bpf_kern entries
> > > and let p4a_runt_create_bpf() fail if the pool is empty? would that
> > > satisfy the bpf requirement?
> > 
> > Let me think about it and weigh the consequences.
> > 
> 
> Sorry, was busy evaluating. Yes, we can enforce the memory allocation
> constraints such that when the ebpf program is removed any entries
> added by said ebpf program can be removed from the datapath.

I suggested the such changes based on my interpretation of this long
and complex discussion, I can have missed some or many relevant points.
@Alexei: could you please double check the above and eventually,
hopefully, confirm that such change would lift your nacked-by?

Thanks!

Paolo

Alexei Starovoitov April 26, 2024, 5:43 p.m. UTC | #14

On Fri, Apr 26, 2024 at 10:21 AM Paolo Abeni <pabeni@redhat.com> wrote:
>
> On Fri, 2024-04-26 at 13:12 -0400, Jamal Hadi Salim wrote:
> > On Fri, Apr 19, 2024 at 2:01 PM Jamal Hadi Salim <jhs@mojatatu.com> wrote:
> > >
> > > On Fri, Apr 19, 2024 at 1:20 PM Paolo Abeni <pabeni@redhat.com> wrote:
> > > >
> > > > On Fri, 2024-04-19 at 08:08 -0400, Jamal Hadi Salim wrote:
> > > > > On Thu, Apr 11, 2024 at 12:24 PM Jamal Hadi Salim <jhs@mojatatu.com> wrote:
> > > > > >
> > > > > > On Thu, Apr 11, 2024 at 10:07 AM Paolo Abeni <pabeni@redhat.com> wrote:
> > > > > > >
> > > > > > > On Wed, 2024-04-10 at 10:01 -0400, Jamal Hadi Salim wrote:
> > > > > > > > The only change that v16 makes is to add a nack to patch 14 on kfuncs
> > > > > > > > from Daniel and John. We strongly disagree with the nack; unfortunately I
> > > > > > > > have to rehash whats already in the cover letter and has been discussed over
> > > > > > > > and over and over again:
> > > > > > >
> > > > > > > I feel bad asking, but I have to, since all options I have here are
> > > > > > > IMHO quite sub-optimal.
> > > > > > >
> > > > > > > How bad would be dropping patch 14 and reworking the rest with
> > > > > > > alternative s/w datapath? (I guess restoring it from oldest revision of
> > > > > > > this series).
> > > > > >
> > > > > >
> > > > > > We want to keep using ebpf  for the s/w datapath if that is not clear by now.
> > > > > > I do not understand the obstructionism tbh. Are users allowed to use
> > > > > > kfuncs as part of infra or not? My understanding is yes.
> > > > > > This community is getting too political and my worry is that we have
> > > > > > corporatism creeping in like it is in standards bodies.
> > > > > > We started by not using ebpf. The same people who are objecting now
> > > > > > went up in arms and insisted we use ebpf. As a member of this
> > > > > > community, my motivation was to meet them in the middle by
> > > > > > compromising. We invested another year to move to that middle ground.
> > > > > > Now they are insisting we do not use ebpf because they dont like our
> > > > > > design or how we are using ebpf or maybe it's not a use case they have
> > > > > > any need for or some other politics. I lost track of the moving goal
> > > > > > posts. Open source is about solving your itch. This code is entirely
> > > > > > on TC, zero code changed in ebpf core. The new goalpost is based on
> > > > > > emotional outrage over use of functions. The whole thing is getting
> > > > > > extremely toxic.
> > > > > >
> > > > >
> > > > > Paolo,
> > > > > Following up since no movement for a week now;->
> > > > > I am going to give benefit of doubt that there was miscommunication or
> > > > > misunderstanding for all the back and forth that has happened so far
> > > > > with the nackers. I will provide a summary below on the main points
> > > > > raised and then provide responses:
> > > > >
> > > > > 1) "Use maps"
> > > > >
> > > > > It doesnt make sense for our requirement. The reason we are using TC
> > > > > is because a) P4 has an excellent fit with TC match action paradigm b)
> > > > > we are targeting both s/w and h/w and the TC model caters well for
> > > > > this. The objects belong to TC, shared between s/w, h/w and control
> > > > > plane (and netlink is the API). Maybe this diagram would help:
> > > > > https://github.com/p4tc-dev/docs/blob/main/images/why-p4tc/p4tc-runtime-pipeline.png
> > > > >
> > > > > While the s/w part stands on its own accord (as elaborated many
> > > > > times), for TC which has offloads, the s/w twin is introduced before
> > > > > the h/w equivalent. This is what this series is doing.
> > > > >
> > > > > 2) "but ... it is not performant"
> > > > > This has been brought up in regards to netlink and kfuncs. Performance
> > > > > is a lower priority to P4 correctness and expressibility.
> > > > > Netlink provides us the abstractions we need, it works with TC for
> > > > > both s/w and h/w offload and has a lot of knowledge base for
> > > > > expressing control plane APIs. We dont believe reinventing all that
> > > > > makes sense.
> > > > > Kfuncs are a means to an end - they provide us the gluing we need to
> > > > > have an ebpf s/w datapath to the TC objects. Getting an extra
> > > > > 10-100Kpps is not a driving factor.
> > > > >
> > > > > 3) "but you did it wrong, here's how you do it..."
> > > > >
> > > > > I gave up on responding to this - but do note this sentiment is a big
> > > > > theme in the exchanges and consumed most of the electrons. We are
> > > > > _never_ going to get any consensus with statements like "tc actions
> > > > > are a mistake" or "use tcx".
> > > > >
> > > > > 4) "... drop the kfunc patch"
> > > > >
> > > > > kfuncs essentially boil down to function calls. They don't require any
> > > > > special handling by the eBPF verifier nor introduce new semantics to
> > > > > eBPF. They are similar in nature to the already existing kfuncs
> > > > > interacting with other kernel objects such as nf_conntrack.
> > > > > The precedence (repeated in conferences and email threads multiple
> > > > > times) is: kfuncs dont have to be sent to ebpf list or reviewed by
> > > > > folks in the ebpf world. And We believe that rule applies to us as
> > > > > well. Either kfuncs (and frankly ebpf) is infrastructure glue or it's
> > > > > not.
> > > > >
> > > > > Now for a little rant:
> > > > >
> > > > > Open source is not a zero-sum game. Ebpf already coexists with
> > > > > netfilter, tc, etc and various subsystems happily.
> > > > > I hope our requirement is clear and i dont have to keep justifying why
> > > > > P4 or relitigate over and over again why we need TC. Open source is
> > > > > about scratching your itch and our itch is totally contained within
> > > > > TC. I cant help but feel that this community is getting way too
> > > > > pervasive with politics and obscure agendas. I understand agendas, I
> > > > > just dont understand the zero-sum thinking.
> > > > > My view is this series should still be applied with the nacks since it
> > > > > sits entirely on its own silo within networking/TC (and has nothing to
> > > > > do with ebpf).
> > > >
> > > > It's really hard for me - meaning I'll not do that - applying a series
> > > > that has been so fiercely nacked, especially given that the other
> > > > maintainers are not supporting it.
> > > >
> > > > I really understand this is very bad for you.
> > > >
> > > > Let me try to do an extreme attempt to find some middle ground between
> > > > this series and the bpf folks.
> > > >
> > > > My understanding is that the most disliked item is the lifecycle for
> > > > the objects allocated via the kfunc(s).
> > > >
> > > > If I understand correctly, the hard requirement on bpf side is that any
> > > > kernel object allocated by kfunc must be released at program unload
> > > > time. p4tc postpone such allocation to recycle the structure.
> > > >
> > > > While there are other arguments, my reading of the past few iterations
> > > > is that solving the above node should lift the nack, am I correct?
> > > >
> > > > Could p4tc pre-allocate all the p4tc_table_entry_act_bpf_kern entries
> > > > and let p4a_runt_create_bpf() fail if the pool is empty? would that
> > > > satisfy the bpf requirement?
> > >
> > > Let me think about it and weigh the consequences.
> > >
> >
> > Sorry, was busy evaluating. Yes, we can enforce the memory allocation
> > constraints such that when the ebpf program is removed any entries
> > added by said ebpf program can be removed from the datapath.
>
> I suggested the such changes based on my interpretation of this long
> and complex discussion, I can have missed some or many relevant points.
> @Alexei: could you please double check the above and eventually,
> hopefully, confirm that such change would lift your nacked-by?

No. The whole design is broken.
Remembering what was allocated by kfunc and freeing it later
is not fixing the design at all.
Sorry.

Jamal Hadi Salim April 26, 2024, 6:03 p.m. UTC | #15

On Fri, Apr 26, 2024 at 1:43 PM Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
>
> On Fri, Apr 26, 2024 at 10:21 AM Paolo Abeni <pabeni@redhat.com> wrote:
> >
> > On Fri, 2024-04-26 at 13:12 -0400, Jamal Hadi Salim wrote:
> > > On Fri, Apr 19, 2024 at 2:01 PM Jamal Hadi Salim <jhs@mojatatu.com> wrote:
> > > >
> > > > On Fri, Apr 19, 2024 at 1:20 PM Paolo Abeni <pabeni@redhat.com> wrote:
> > > > >
> > > > > On Fri, 2024-04-19 at 08:08 -0400, Jamal Hadi Salim wrote:
> > > > > > On Thu, Apr 11, 2024 at 12:24 PM Jamal Hadi Salim <jhs@mojatatu.com> wrote:
> > > > > > >
> > > > > > > On Thu, Apr 11, 2024 at 10:07 AM Paolo Abeni <pabeni@redhat.com> wrote:
> > > > > > > >
> > > > > > > > On Wed, 2024-04-10 at 10:01 -0400, Jamal Hadi Salim wrote:
> > > > > > > > > The only change that v16 makes is to add a nack to patch 14 on kfuncs
> > > > > > > > > from Daniel and John. We strongly disagree with the nack; unfortunately I
> > > > > > > > > have to rehash whats already in the cover letter and has been discussed over
> > > > > > > > > and over and over again:
> > > > > > > >
> > > > > > > > I feel bad asking, but I have to, since all options I have here are
> > > > > > > > IMHO quite sub-optimal.
> > > > > > > >
> > > > > > > > How bad would be dropping patch 14 and reworking the rest with
> > > > > > > > alternative s/w datapath? (I guess restoring it from oldest revision of
> > > > > > > > this series).
> > > > > > >
> > > > > > >
> > > > > > > We want to keep using ebpf  for the s/w datapath if that is not clear by now.
> > > > > > > I do not understand the obstructionism tbh. Are users allowed to use
> > > > > > > kfuncs as part of infra or not? My understanding is yes.
> > > > > > > This community is getting too political and my worry is that we have
> > > > > > > corporatism creeping in like it is in standards bodies.
> > > > > > > We started by not using ebpf. The same people who are objecting now
> > > > > > > went up in arms and insisted we use ebpf. As a member of this
> > > > > > > community, my motivation was to meet them in the middle by
> > > > > > > compromising. We invested another year to move to that middle ground.
> > > > > > > Now they are insisting we do not use ebpf because they dont like our
> > > > > > > design or how we are using ebpf or maybe it's not a use case they have
> > > > > > > any need for or some other politics. I lost track of the moving goal
> > > > > > > posts. Open source is about solving your itch. This code is entirely
> > > > > > > on TC, zero code changed in ebpf core. The new goalpost is based on
> > > > > > > emotional outrage over use of functions. The whole thing is getting
> > > > > > > extremely toxic.
> > > > > > >
> > > > > >
> > > > > > Paolo,
> > > > > > Following up since no movement for a week now;->
> > > > > > I am going to give benefit of doubt that there was miscommunication or
> > > > > > misunderstanding for all the back and forth that has happened so far
> > > > > > with the nackers. I will provide a summary below on the main points
> > > > > > raised and then provide responses:
> > > > > >
> > > > > > 1) "Use maps"
> > > > > >
> > > > > > It doesnt make sense for our requirement. The reason we are using TC
> > > > > > is because a) P4 has an excellent fit with TC match action paradigm b)
> > > > > > we are targeting both s/w and h/w and the TC model caters well for
> > > > > > this. The objects belong to TC, shared between s/w, h/w and control
> > > > > > plane (and netlink is the API). Maybe this diagram would help:
> > > > > > https://github.com/p4tc-dev/docs/blob/main/images/why-p4tc/p4tc-runtime-pipeline.png
> > > > > >
> > > > > > While the s/w part stands on its own accord (as elaborated many
> > > > > > times), for TC which has offloads, the s/w twin is introduced before
> > > > > > the h/w equivalent. This is what this series is doing.
> > > > > >
> > > > > > 2) "but ... it is not performant"
> > > > > > This has been brought up in regards to netlink and kfuncs. Performance
> > > > > > is a lower priority to P4 correctness and expressibility.
> > > > > > Netlink provides us the abstractions we need, it works with TC for
> > > > > > both s/w and h/w offload and has a lot of knowledge base for
> > > > > > expressing control plane APIs. We dont believe reinventing all that
> > > > > > makes sense.
> > > > > > Kfuncs are a means to an end - they provide us the gluing we need to
> > > > > > have an ebpf s/w datapath to the TC objects. Getting an extra
> > > > > > 10-100Kpps is not a driving factor.
> > > > > >
> > > > > > 3) "but you did it wrong, here's how you do it..."
> > > > > >
> > > > > > I gave up on responding to this - but do note this sentiment is a big
> > > > > > theme in the exchanges and consumed most of the electrons. We are
> > > > > > _never_ going to get any consensus with statements like "tc actions
> > > > > > are a mistake" or "use tcx".
> > > > > >
> > > > > > 4) "... drop the kfunc patch"
> > > > > >
> > > > > > kfuncs essentially boil down to function calls. They don't require any
> > > > > > special handling by the eBPF verifier nor introduce new semantics to
> > > > > > eBPF. They are similar in nature to the already existing kfuncs
> > > > > > interacting with other kernel objects such as nf_conntrack.
> > > > > > The precedence (repeated in conferences and email threads multiple
> > > > > > times) is: kfuncs dont have to be sent to ebpf list or reviewed by
> > > > > > folks in the ebpf world. And We believe that rule applies to us as
> > > > > > well. Either kfuncs (and frankly ebpf) is infrastructure glue or it's
> > > > > > not.
> > > > > >
> > > > > > Now for a little rant:
> > > > > >
> > > > > > Open source is not a zero-sum game. Ebpf already coexists with
> > > > > > netfilter, tc, etc and various subsystems happily.
> > > > > > I hope our requirement is clear and i dont have to keep justifying why
> > > > > > P4 or relitigate over and over again why we need TC. Open source is
> > > > > > about scratching your itch and our itch is totally contained within
> > > > > > TC. I cant help but feel that this community is getting way too
> > > > > > pervasive with politics and obscure agendas. I understand agendas, I
> > > > > > just dont understand the zero-sum thinking.
> > > > > > My view is this series should still be applied with the nacks since it
> > > > > > sits entirely on its own silo within networking/TC (and has nothing to
> > > > > > do with ebpf).
> > > > >
> > > > > It's really hard for me - meaning I'll not do that - applying a series
> > > > > that has been so fiercely nacked, especially given that the other
> > > > > maintainers are not supporting it.
> > > > >
> > > > > I really understand this is very bad for you.
> > > > >
> > > > > Let me try to do an extreme attempt to find some middle ground between
> > > > > this series and the bpf folks.
> > > > >
> > > > > My understanding is that the most disliked item is the lifecycle for
> > > > > the objects allocated via the kfunc(s).
> > > > >
> > > > > If I understand correctly, the hard requirement on bpf side is that any
> > > > > kernel object allocated by kfunc must be released at program unload
> > > > > time. p4tc postpone such allocation to recycle the structure.
> > > > >
> > > > > While there are other arguments, my reading of the past few iterations
> > > > > is that solving the above node should lift the nack, am I correct?
> > > > >
> > > > > Could p4tc pre-allocate all the p4tc_table_entry_act_bpf_kern entries
> > > > > and let p4a_runt_create_bpf() fail if the pool is empty? would that
> > > > > satisfy the bpf requirement?
> > > >
> > > > Let me think about it and weigh the consequences.
> > > >
> > >
> > > Sorry, was busy evaluating. Yes, we can enforce the memory allocation
> > > constraints such that when the ebpf program is removed any entries
> > > added by said ebpf program can be removed from the datapath.
> >
> > I suggested the such changes based on my interpretation of this long
> > and complex discussion, I can have missed some or many relevant points.
> > @Alexei: could you please double check the above and eventually,
> > hopefully, confirm that such change would lift your nacked-by?
>
> No. The whole design is broken.
> Remembering what was allocated by kfunc and freeing it later
> is not fixing the design at all.

Can you be a little less vague?
We are dealing with multiple domains here _including hw offloads_ and
as mentioned already, a few times now, for that reason these objects
belong to the P4TC domain. If it wasnt clear this diagram explains the
design:
https://github.com/p4tc-dev/docs/blob/main/images/why-p4tc/p4tc-runtime-pipeline.png
IOW, P4 objects(to be specific table entries in this discussion) may
be shared between s/w and/or h/w.
Note: there is no allocation done by the kfunc - it will just pick
from a fixed pool of pre-allocated entries. Where is the "design
broken" considering all this?

cheers,
jamal

Jamal Hadi Salim May 20, 2024, 3:34 p.m. UTC | #16

On Fri, Apr 26, 2024 at 2:03 PM Jamal Hadi Salim <jhs@mojatatu.com> wrote:
>
> On Fri, Apr 26, 2024 at 1:43 PM Alexei Starovoitov
> <alexei.starovoitov@gmail.com> wrote:
> >
> > On Fri, Apr 26, 2024 at 10:21 AM Paolo Abeni <pabeni@redhat.com> wrote:
> > >
> > > On Fri, 2024-04-26 at 13:12 -0400, Jamal Hadi Salim wrote:
> > > > On Fri, Apr 19, 2024 at 2:01 PM Jamal Hadi Salim <jhs@mojatatu.com> wrote:
> > > > >
> > > > > On Fri, Apr 19, 2024 at 1:20 PM Paolo Abeni <pabeni@redhat.com> wrote:
> > > > > >
> > > > > > On Fri, 2024-04-19 at 08:08 -0400, Jamal Hadi Salim wrote:
> > > > > > > On Thu, Apr 11, 2024 at 12:24 PM Jamal Hadi Salim <jhs@mojatatu.com> wrote:
> > > > > > > >
> > > > > > > > On Thu, Apr 11, 2024 at 10:07 AM Paolo Abeni <pabeni@redhat.com> wrote:
> > > > > > > > >
> > > > > > > > > On Wed, 2024-04-10 at 10:01 -0400, Jamal Hadi Salim wrote:
> > > > > > > > > > The only change that v16 makes is to add a nack to patch 14 on kfuncs
> > > > > > > > > > from Daniel and John. We strongly disagree with the nack; unfortunately I
> > > > > > > > > > have to rehash whats already in the cover letter and has been discussed over
> > > > > > > > > > and over and over again:
> > > > > > > > >
> > > > > > > > > I feel bad asking, but I have to, since all options I have here are
> > > > > > > > > IMHO quite sub-optimal.
> > > > > > > > >
> > > > > > > > > How bad would be dropping patch 14 and reworking the rest with
> > > > > > > > > alternative s/w datapath? (I guess restoring it from oldest revision of
> > > > > > > > > this series).
> > > > > > > >
> > > > > > > >
> > > > > > > > We want to keep using ebpf  for the s/w datapath if that is not clear by now.
> > > > > > > > I do not understand the obstructionism tbh. Are users allowed to use
> > > > > > > > kfuncs as part of infra or not? My understanding is yes.
> > > > > > > > This community is getting too political and my worry is that we have
> > > > > > > > corporatism creeping in like it is in standards bodies.
> > > > > > > > We started by not using ebpf. The same people who are objecting now
> > > > > > > > went up in arms and insisted we use ebpf. As a member of this
> > > > > > > > community, my motivation was to meet them in the middle by
> > > > > > > > compromising. We invested another year to move to that middle ground.
> > > > > > > > Now they are insisting we do not use ebpf because they dont like our
> > > > > > > > design or how we are using ebpf or maybe it's not a use case they have
> > > > > > > > any need for or some other politics. I lost track of the moving goal
> > > > > > > > posts. Open source is about solving your itch. This code is entirely
> > > > > > > > on TC, zero code changed in ebpf core. The new goalpost is based on
> > > > > > > > emotional outrage over use of functions. The whole thing is getting
> > > > > > > > extremely toxic.
> > > > > > > >
> > > > > > >
> > > > > > > Paolo,
> > > > > > > Following up since no movement for a week now;->
> > > > > > > I am going to give benefit of doubt that there was miscommunication or
> > > > > > > misunderstanding for all the back and forth that has happened so far
> > > > > > > with the nackers. I will provide a summary below on the main points
> > > > > > > raised and then provide responses:
> > > > > > >
> > > > > > > 1) "Use maps"
> > > > > > >
> > > > > > > It doesnt make sense for our requirement. The reason we are using TC
> > > > > > > is because a) P4 has an excellent fit with TC match action paradigm b)
> > > > > > > we are targeting both s/w and h/w and the TC model caters well for
> > > > > > > this. The objects belong to TC, shared between s/w, h/w and control
> > > > > > > plane (and netlink is the API). Maybe this diagram would help:
> > > > > > > https://github.com/p4tc-dev/docs/blob/main/images/why-p4tc/p4tc-runtime-pipeline.png
> > > > > > >
> > > > > > > While the s/w part stands on its own accord (as elaborated many
> > > > > > > times), for TC which has offloads, the s/w twin is introduced before
> > > > > > > the h/w equivalent. This is what this series is doing.
> > > > > > >
> > > > > > > 2) "but ... it is not performant"
> > > > > > > This has been brought up in regards to netlink and kfuncs. Performance
> > > > > > > is a lower priority to P4 correctness and expressibility.
> > > > > > > Netlink provides us the abstractions we need, it works with TC for
> > > > > > > both s/w and h/w offload and has a lot of knowledge base for
> > > > > > > expressing control plane APIs. We dont believe reinventing all that
> > > > > > > makes sense.
> > > > > > > Kfuncs are a means to an end - they provide us the gluing we need to
> > > > > > > have an ebpf s/w datapath to the TC objects. Getting an extra
> > > > > > > 10-100Kpps is not a driving factor.
> > > > > > >
> > > > > > > 3) "but you did it wrong, here's how you do it..."
> > > > > > >
> > > > > > > I gave up on responding to this - but do note this sentiment is a big
> > > > > > > theme in the exchanges and consumed most of the electrons. We are
> > > > > > > _never_ going to get any consensus with statements like "tc actions
> > > > > > > are a mistake" or "use tcx".
> > > > > > >
> > > > > > > 4) "... drop the kfunc patch"
> > > > > > >
> > > > > > > kfuncs essentially boil down to function calls. They don't require any
> > > > > > > special handling by the eBPF verifier nor introduce new semantics to
> > > > > > > eBPF. They are similar in nature to the already existing kfuncs
> > > > > > > interacting with other kernel objects such as nf_conntrack.
> > > > > > > The precedence (repeated in conferences and email threads multiple
> > > > > > > times) is: kfuncs dont have to be sent to ebpf list or reviewed by
> > > > > > > folks in the ebpf world. And We believe that rule applies to us as
> > > > > > > well. Either kfuncs (and frankly ebpf) is infrastructure glue or it's
> > > > > > > not.
> > > > > > >
> > > > > > > Now for a little rant:
> > > > > > >
> > > > > > > Open source is not a zero-sum game. Ebpf already coexists with
> > > > > > > netfilter, tc, etc and various subsystems happily.
> > > > > > > I hope our requirement is clear and i dont have to keep justifying why
> > > > > > > P4 or relitigate over and over again why we need TC. Open source is
> > > > > > > about scratching your itch and our itch is totally contained within
> > > > > > > TC. I cant help but feel that this community is getting way too
> > > > > > > pervasive with politics and obscure agendas. I understand agendas, I
> > > > > > > just dont understand the zero-sum thinking.
> > > > > > > My view is this series should still be applied with the nacks since it
> > > > > > > sits entirely on its own silo within networking/TC (and has nothing to
> > > > > > > do with ebpf).
> > > > > >
> > > > > > It's really hard for me - meaning I'll not do that - applying a series
> > > > > > that has been so fiercely nacked, especially given that the other
> > > > > > maintainers are not supporting it.
> > > > > >
> > > > > > I really understand this is very bad for you.
> > > > > >
> > > > > > Let me try to do an extreme attempt to find some middle ground between
> > > > > > this series and the bpf folks.
> > > > > >
> > > > > > My understanding is that the most disliked item is the lifecycle for
> > > > > > the objects allocated via the kfunc(s).
> > > > > >
> > > > > > If I understand correctly, the hard requirement on bpf side is that any
> > > > > > kernel object allocated by kfunc must be released at program unload
> > > > > > time. p4tc postpone such allocation to recycle the structure.
> > > > > >
> > > > > > While there are other arguments, my reading of the past few iterations
> > > > > > is that solving the above node should lift the nack, am I correct?
> > > > > >
> > > > > > Could p4tc pre-allocate all the p4tc_table_entry_act_bpf_kern entries
> > > > > > and let p4a_runt_create_bpf() fail if the pool is empty? would that
> > > > > > satisfy the bpf requirement?
> > > > >
> > > > > Let me think about it and weigh the consequences.
> > > > >
> > > >
> > > > Sorry, was busy evaluating. Yes, we can enforce the memory allocation
> > > > constraints such that when the ebpf program is removed any entries
> > > > added by said ebpf program can be removed from the datapath.
> > >
> > > I suggested the such changes based on my interpretation of this long
> > > and complex discussion, I can have missed some or many relevant points.
> > > @Alexei: could you please double check the above and eventually,
> > > hopefully, confirm that such change would lift your nacked-by?
> >
> > No. The whole design is broken.
> > Remembering what was allocated by kfunc and freeing it later
> > is not fixing the design at all.
>
> Can you be a little less vague?
> We are dealing with multiple domains here _including hw offloads_ and
> as mentioned already, a few times now, for that reason these objects
> belong to the P4TC domain. If it wasnt clear this diagram explains the
> design:
> https://github.com/p4tc-dev/docs/blob/main/images/why-p4tc/p4tc-runtime-pipeline.png
> IOW, P4 objects(to be specific table entries in this discussion) may
> be shared between s/w and/or h/w.
> Note: there is no allocation done by the kfunc - it will just pick
> from a fixed pool of pre-allocated entries. Where is the "design
> broken" considering all this?

Ok, not that i was expecting an answer and i think i have waited long enough.

Frankly my agreement to make the change and the time spent to validate
were just an attempt to make an effort for a compromise (as we have
done many many times) - but really that approach works against our
requirements to control the aging/deletion/replacement policy. I dont
believe there's any good faith from the nackers. For that reason that
offer is off the table.
It should be noted our changes that Alexei is objecting to is more
tame than for example
https://elixir.bootlin.com/linux/latest/source/net/netfilter/nf_conntrack_bpf.c#L318
We didnt see Alexeis nack on that code.

I am not asking to be given speacial treatment but it is clear we have
a hole in the process currently. All i am asking for is fair
treatment. At this point, given that Paolo says the patches cant be
applied because of 3 cross-subsystem nacks, my suggestion on how we
resolve this is to appoint a third person arbitrator. This person
cannot be part of the TC or eBPF collective and has to be agreed to by
both parties.

Hopefully this will introduce some new set of rules that will help the
maintainers resolve such issues should they surface in the future.

I will collect all the other issues raised and my responses and create
a web page so things dont get lost in the noise. I will post then and
maybe send to a wider audience.

cheers,
jamal



> cheers,
> jamal

Jamal Hadi Salim May 21, 2024, 12:35 p.m. UTC | #17

As stated a few times, we strongly disagree with the nature of the
Nacks from Alexei, Daniel and John. We dont think there is good ground
for the Nacks.

A brief history on the P4TC patches:

We posted V1 in January 2023. The main objection then was that we
needed to use eBPF. After some discussion and investigation on our
part we found that using kfuncs would satisfy our goals as well as the
objections raised. We posted 28 RFC patches looking for feedback from
eBPF and other folks with V2 in May 2023 - these patches were not
ready but we were nevertheless soliciting for feedback. By Version 7
in October/2023 we removed the RFC tag (meaning we are asking for
inclusion). In Version 8 we sent the first 15 patches as series
1(following netdev rules that allow only 15 patches); 5 of these
patches are trivial tc core patches. Starting with V8 and upto V14 the
releases were mostly suggested changes (much thanks to folks who made
suggestions for technical changes) and at one point it was a bug fix
for an issue caught by our syzkaller instance.

When it seemed like Paolo was heading towards applying series 1 given
the feedback, Alexei nacked patch 14 when we released V14, see:
https://lore.kernel.org/bpf/20240404122338.372945-5-jhs@mojatatu.com/
V15 only change was adding Alexei's nack. V15 was followed by Daniel
and then John also nacking the same patch 14. V16's only change was to
add these extra Nacks.

At that point(v16) i asked for the series to be applied despite the
Nacks because, frankly, the Nacks have no merit. Paolo was not
comfortable applying patches with Nacks and tried to mediate. In his
mediation effort he asked if we could remove eBPF - and our answer was
no because after all that time we have become dependent on it and
frankly there was no technical reason not to use eBPF. Paolo then
asked if we could satisfy one of the points Alexei raised in terms of
clearing table entries when an eBPF program was unloaded. We spent a
week investigating and came to a conclusion that we could do it as a
compromise (even though it is not something fitting to our
requirements and there is existing code that we copied from doing
exactly what Alexei is objecting to). Alexei rejected this offer. This
puts Paolo in a difficult position because it is clear there is no
compromise to be had. I feel we are in uncharted teritory.

Since we are in a quagmire, I am asking for a third party mediator to
review the objections and validate if they have merit.
I have created a web page to capture all the objections raised by the
3 gents over a period of time at:
https://github.com/p4tc-dev/pushback-patches
If any of the 3 people feel i have misrepresented their objections or
missed an important detail please let me know and i will fix the page.

cheers,
jamal

Jakub Kicinski May 22, 2024, 10:19 p.m. UTC | #18

Hi Jamal!

On Tue, 21 May 2024 08:35:07 -0400 Jamal Hadi Salim wrote:
> At that point(v16) i asked for the series to be applied despite the
> Nacks because, frankly, the Nacks have no merit. Paolo was not
> comfortable applying patches with Nacks and tried to mediate. In his
> mediation effort he asked if we could remove eBPF - and our answer was
> no because after all that time we have become dependent on it and
> frankly there was no technical reason not to use eBPF.

I'm not fully clear on who you're appealing to, and I may be missing
some points. But maybe it will be more useful than hurtful if I clarify
my point of view.

AFAIU BPF folks disagree with the use of their subsystem, and they
point out that P4 pipelines can be implemented using BPF in the first
place.
To which you reply that you like (a highly dated type of) a netlink
interface, and (handwavey) ability to configure the data path SW or 
HW via the same interface.

AFAICT there's some but not very strong support for P4TC, and it
doesn't benefit or solve any problems of the broader networking stack
(e.g. expressing or configuring parser graphs in general)

So from my perspective, the submission is neither technically strong
enough, nor broadly useful enough to consider making questionable precedents
for, i.e. to override maintainers on how their subsystems are extended.

Jamal Hadi Salim May 22, 2024, 11:03 p.m. UTC | #19

On Wed, May 22, 2024 at 6:19 PM Jakub Kicinski <kuba@kernel.org> wrote:
>
> Hi Jamal!
>
> On Tue, 21 May 2024 08:35:07 -0400 Jamal Hadi Salim wrote:
> > At that point(v16) i asked for the series to be applied despite the
> > Nacks because, frankly, the Nacks have no merit. Paolo was not
> > comfortable applying patches with Nacks and tried to mediate. In his
> > mediation effort he asked if we could remove eBPF - and our answer was
> > no because after all that time we have become dependent on it and
> > frankly there was no technical reason not to use eBPF.
>
> I'm not fully clear on who you're appealing to, and I may be missing
> some points. But maybe it will be more useful than hurtful if I clarify
> my point of view.
>
> AFAIU BPF folks disagree with the use of their subsystem, and they
> point out that P4 pipelines can be implemented using BPF in the first
> place.
> To which you reply that you like (a highly dated type of) a netlink
> interface, and (handwavey) ability to configure the data path SW or
> HW via the same interface.

It's not what I "like" , rather it is a requirement to support both
s/w and h/w offload. The TC model is the traditional approach to
deploy these models. I addressed the same comment you are making above
in #1a and #1b  (https://github.com/p4tc-dev/pushback-patches).

OTOH, "BPF folks disagree with the use of their subsystem" is a
problematic statement. Is BPF infra for the kernel community or is it
something the ebpf folks can decide, at their whim, to allow who they
like to use or not. We are not changing any BPF code. And there's
already a case where the interfaces are used exactly as we used them
in the conntrack code i pointed to in the page (we literally copied
that code). Why is it ok for conntrack code to use exactly the same
approach but not us?

> AFAICT there's some but not very strong support for P4TC,

I dont agree. Paolo asked this question and afaik Intel, AMD (both
build P4-native NICs) and the folks interested in the MS DASH project
responded saying they are in support. Look at who is being Cced. A lot
of these folks who attend biweekly discussion calls on P4TC. Sample:
https://lore.kernel.org/netdev/IA0PR17MB7070B51A955FB8595FFBA5FB965E2@IA0PR17MB7070.namprd17.prod.outlook.com/

> and it
> doesn't benefit or solve any problems of the broader networking stack
> (e.g. expressing or configuring parser graphs in general)
>

I am not sure where the parser thing comes from - the parser is
generated as eBPF.

> So from my perspective, the submission is neither technically strong
> enough, nor broadly useful enough to consider making questionable precedents
> for, i.e. to override maintainers on how their subsystems are extended.

I believe as a community nobody should just have the power to nack
things just because - as i stated in the page, not even Linus. That
code doesnt touch anything to do with eBPF maintainers (meaning things
they have to fix when an issue shows up) neither does it "extend" as
you state any ebpf code and it is all part of the networking
subsystem. Sure,  anybody has the right to nack but  I contend that
nacks should be based on technical reasons. I have listed all the
objections in that page and how i have responded to them over time.
Someone needs to look at those objectively and say if they are valid.
The arguement made so far(By Paolo and now by you)  is "we cant
override maintainers on how their subsystems are used" then we are in
uncharted territory, thats why i am asking for arbitration.

cheers,
jamal

Singhai, Anjali May 23, 2024, 12:30 a.m. UTC | #20

On Wed, May 22, 2024 at 6:19 PM Jakub Kicinski <kuba@kernel.org> wrote:

>> AFAICT there's some but not very strong support for P4TC,

On Wed, May 22, 2024 at 4:04 PM Jamal Hadi Salim <jhs@mojatatu.com > wrote:
>I dont agree. Paolo asked this question and afaik Intel, AMD (both build P4-native NICs) and the folks interested in the MS DASH project >responded saying they are in support. Look at who is being Cced. A lot of these folks who attend biweekly discussion calls on P4TC. >Sample:
>https://lore.kernel.org/netdev/IA0PR17MB7070B51A955FB8595FFBA5FB965E2@IA0PR17MB7070.namprd17.prod.outlook.com/

FWIW, Intel is in full support of P4TC as we have stated several times in the past.

Chris Sommers May 23, 2024, 12:44 a.m. UTC | #21

Apologies for resending as plain text, the first try was HTML and got rejected by bots.

> On Wed, May 22, 2024 at 6:19 PM Jakub Kicinski <kuba@kernel.org> wrote:
> >
> > Hi Jamal!
> >
> > On Tue, 21 May 2024 08:35:07 -0400 Jamal Hadi Salim wrote:
> > > At that point(v16) i asked for the series to be applied despite the
> > > Nacks because, frankly, the Nacks have no merit. Paolo was not
> > > comfortable applying patches with Nacks and tried to mediate. In his
> > > mediation effort he asked if we could remove eBPF - and our answer was
> > > no because after all that time we have become dependent on it and
> > > frankly there was no technical reason not to use eBPF.
> >
> > I'm not fully clear on who you're appealing to, and I may be missing
> > some points. But maybe it will be more useful than hurtful if I clarify
> > my point of view.
> >
> > AFAIU BPF folks disagree with the use of their subsystem, and they
> > point out that P4 pipelines can be implemented using BPF in the first
> > place.
> > To which you reply that you like (a highly dated type of) a netlink
> > interface, and (handwavey) ability to configure the data path SW or
> > HW via the same interface.
> 
> It's not what I "like" , rather it is a requirement to support both
> s/w and h/w offload. The TC model is the traditional approach to
> deploy these models. I addressed the same comment you are making above
> in #1a and #1b  (https://urldefense.com/v3/__https://github.com/p4tc-dev/pushback-patches__;!!I5pVk4LIGAfnvw!kaZ6EmPxEqGLG8JMw-_L0BgYq48Pe25wj6pHMF6BVei5WsRgwMeLQupmvgvLyN-LgXacKBzzs0-w2zKP2A$).
> 
> OTOH, "BPF folks disagree with the use of their subsystem" is a
> problematic statement. Is BPF infra for the kernel community or is it
> something the ebpf folks can decide, at their whim, to allow who they
> like to use or not. We are not changing any BPF code. And there's
> already a case where the interfaces are used exactly as we used them
> in the conntrack code i pointed to in the page (we literally copied
> that code). Why is it ok for conntrack code to use exactly the same
> approach but not us?
> 
> > AFAICT there's some but not very strong support for P4TC,
> 
> I dont agree. Paolo asked this question and afaik Intel, AMD (both
> build P4-native NICs) and the folks interested in the MS DASH project
> responded saying they are in support. Look at who is being Cced. A lot
> of these folks who attend biweekly discussion calls on P4TC. Sample:
> https://urldefense.com/v3/__https://lore.kernel.org/netdev/IA0PR17MB7070B51A955FB8595FFBA5FB965E2@IA0PR17MB7070.namprd17.prod.outlook.com/__;!!I5pVk4LIGAfnvw!kaZ6EmPxEqGLG8JMw-_L0BgYq48Pe25wj6pHMF6BVei5WsRgwMeLQupmvgvLyN-LgXacKBzzs09TFzoQBw$
> 
+1
> > and it
> > doesn't benefit or solve any problems of the broader networking stack
> > (e.g. expressing or configuring parser graphs in general)
> >
> 

Huh? As a DSL, P4 has already been proven to be an extremely effective and popular way to express parse graphs, stack manipulation, and stateful programming. Yesterday, I used the P4TC dev branch to implement something in one sitting, which includes parsing RoCEv2 network stacks. I just cut and pasted P4 code originally written for a P4 ASIC into a working P4TC example to add functionality. It took mere seconds to compile and launch it, and a few minutes to test it. I know of no other workflow which provides such quick turnaround and is so accessible. I'd like it to be as ubiquitous as eBPF itself.

> I am not sure where the parser thing comes from - the parser is
> generated as eBPF.
> 
> > So from my perspective, the submission is neither technically strong
> > enough, nor broadly useful enough to consider making questionable precedents
> > for, i.e. to override maintainers on how their subsystems are extended.
I disagree vehemently on the "broadly useful enough" comment.
> 
> I believe as a community nobody should just have the power to nack
> things just because - as i stated in the page, not even Linus. That
> code doesnt touch anything to do with eBPF maintainers (meaning things
> they have to fix when an issue shows up) neither does it "extend" as
> you state any ebpf code and it is all part of the networking
> subsystem. Sure,  anybody has the right to nack but  I contend that
> nacks should be based on technical reasons. I have listed all the
> objections in that page and how i have responded to them over time.
> Someone needs to look at those objectively and say if they are valid.
> The arguement made so far(By Paolo and now by you)  is "we cant
> override maintainers on how their subsystems are used" then we are in
> uncharted territory, thats why i am asking for arbitration.
> 
> cheers,
> jamal 
Maintainers: I am perplexed and dismayed that this is getting so much pushback. None of the objections, regardless of their merits (or not) seem to outweigh the potential benefits to end-users. I am extremely interested in using P4TC, it adds a lot of value and reuses so much existing Linux infra. The custom extern model is compelling. The control plane CRUDXPS will tie nicely into P4Runtime and TDI. I have an application which needs to run purely in SW - no HW offload, so prior suggestions to wait for it to "approve" this is frustrating.  I could use this yesterday. Furthermore, as an active contributor to sonic-dash, where we model the pipeline in P4, I can state that P4TC could be a compelling alternative to bmv2, which is slow, long in the tooth and lacks PNA support.

I beseech the NACKers to take a deep breath, reevaluate any entrenched positions and consider how much goodness this will add, even if this is not your preference for implementing datapaths. It doesn't have to be. That can and should be decided by the larger community. This could open the door to thousands of creative developers who are comfortable in P4 but not adept in low-level networking code. P4 had a significant impact on democratizing network programming, and that was just on bmv2 and Tofino, which is EOL. Making performant and powerful P4TC ubiquitous on virtually any Linux server could have a similar effect, just like eBPF opened a lot of doors to non-kernel programmers to do interesting things. Be a part of that transformation!

Tom Herbert May 23, 2024, 12:54 a.m. UTC | #22

On Wed, May 22, 2024 at 5:09 PM Chris Sommers
<chris.sommers@keysight.com> wrote:
>
> > On Wed, May 22, 2024 at 6:19 PM Jakub Kicinski <kuba@kernel.org> wrote:
> > >
> > > Hi Jamal!
> > >
> > > On Tue, 21 May 2024 08:35:07 -0400 Jamal Hadi Salim wrote:
> > > > At that point(v16) i asked for the series to be applied despite the
> > > > Nacks because, frankly, the Nacks have no merit. Paolo was not
> > > > comfortable applying patches with Nacks and tried to mediate. In his
> > > > mediation effort he asked if we could remove eBPF - and our answer was
> > > > no because after all that time we have become dependent on it and
> > > > frankly there was no technical reason not to use eBPF.
> > >
> > > I'm not fully clear on who you're appealing to, and I may be missing
> > > some points. But maybe it will be more useful than hurtful if I clarify
> > > my point of view.
> > >
> > > AFAIU BPF folks disagree with the use of their subsystem, and they
> > > point out that P4 pipelines can be implemented using BPF in the first
> > > place.
> > > To which you reply that you like (a highly dated type of) a netlink
> > > interface, and (handwavey) ability to configure the data path SW or
> > > HW via the same interface.
> >
> > It's not what I "like" , rather it is a requirement to support both
> > s/w and h/w offload. The TC model is the traditional approach to
> > deploy these models. I addressed the same comment you are making above
> > in #1a and #1b  (https://urldefense.com/v3/__https://github.com/p4tc-dev/pushback-patches__;!!I5pVk4LIGAfnvw!kaZ6EmPxEqGLG8JMw-_L0BgYq48Pe25wj6pHMF6BVei5WsRgwMeLQupmvgvLyN-LgXacKBzzs0-w2zKP2A$).
> >
> > OTOH, "BPF folks disagree with the use of their subsystem" is a
> > problematic statement. Is BPF infra for the kernel community or is it
> > something the ebpf folks can decide, at their whim, to allow who they
> > like to use or not. We are not changing any BPF code. And there's
> > already a case where the interfaces are used exactly as we used them
> > in the conntrack code i pointed to in the page (we literally copied
> > that code). Why is it ok for conntrack code to use exactly the same
> > approach but not us?
> >
> > > AFAICT there's some but not very strong support for P4TC,
> >
> > I dont agree. Paolo asked this question and afaik Intel, AMD (both
> > build P4-native NICs) and the folks interested in the MS DASH project
> > responded saying they are in support. Look at who is being Cced. A lot
> > of these folks who attend biweekly discussion calls on P4TC. Sample:
> > https://urldefense.com/v3/__https://lore.kernel.org/netdev/IA0PR17MB7070B51A955FB8595FFBA5FB965E2@IA0PR17MB7070.namprd17.prod.outlook.com/__;!!I5pVk4LIGAfnvw!kaZ6EmPxEqGLG8JMw-_L0BgYq48Pe25wj6pHMF6BVei5WsRgwMeLQupmvgvLyN-LgXacKBzzs09TFzoQBw$
> >
> +1
> > > and it
> > > doesn't benefit or solve any problems of the broader networking stack
> > > (e.g. expressing or configuring parser graphs in general)
> > >
> >
>
> Huh? As a DSL, P4 has already been proven to be an extremely effective and popular way to express parse graphs, stack manipulation, and stateful programming. Yesterday, I used the P4TC dev branch to implement something in one sitting, which includes parsing RoCEv2 network stacks. I just cut and pasted P4 code originally written for a P4 ASIC into a working P4TC example to add functionality. It took mere seconds to compile and launch it, and a few minutes to test it. I know of no other workflow which provides such quick turnaround and is so accessible. I'd like it to be as ubiquitous as eBPF itself.

Chris,

When you say "it took mere seconds to compile and launch" are you
taking into account the ramp up time that it takes to learn P4 and
become proficient to do something interesting? Considering that P4
syntax is very different from typical languages than networking
programmers are typically familiar with, this ramp up time is
non-zero. OTOH, eBPF is ubiquitous because it's primarily programmed
in Restricted C-- this makes it easy for many programmers since they
don't have to learn a completely new language and so the ramp up time
for the average networking programmer is much less for using eBPF.

This is really the fundamental problem with DSLs, they require
specialized skill sets in a programming language for a narrow use case
(and specialized compilers, tool chains, debugging, etc)-- this means
a DSL only makes sense if there is no other means to accomplish the
same effects using a commodity language with perhaps a specialized
library (it's not just in the networking realm, consider the
advantages of using CUDA-C instead of a DLS for GPUs). Personally, I
don't believe that P4 has yet to be proven necessary for programming a
datapath-- for instance we can program a parser in declarative
representation in C,
https://netdevconf.info/0x16/papers/11/High%20Performance%20Programmable%20Parsers.pdf.

So unless P4 is proven necessary, then I'm doubtful it will ever be a
ubiquitous way to program the kernel-- it seems much more likely that
people will continue to use C and eBPF, and for those users that want
to use P4 they can use P4->eBPF compiler.

Tom
>
> > I am not sure where the parser thing comes from - the parser is
> > generated as eBPF.
> >
> > > So from my perspective, the submission is neither technically strong
> > > enough, nor broadly useful enough to consider making questionable precedents
> > > for, i.e. to override maintainers on how their subsystems are extended.
> I disagree vehemently on the "broadly useful enough" comment.
> >
> > I believe as a community nobody should just have the power to nack
> > things just because - as i stated in the page, not even Linus. That
> > code doesnt touch anything to do with eBPF maintainers (meaning things
> > they have to fix when an issue shows up) neither does it "extend" as
> > you state any ebpf code and it is all part of the networking
> > subsystem. Sure,  anybody has the right to nack but  I contend that
> > nacks should be based on technical reasons. I have listed all the
> > objections in that page and how i have responded to them over time.
> > Someone needs to look at those objectively and say if they are valid.
> > The arguement made so far(By Paolo and now by you)  is "we cant
> > override maintainers on how their subsystems are used" then we are in
> > uncharted territory, thats why i am asking for arbitration.
> >
> > cheers,
> > jamal
> Maintainers: I am perplexed and dismayed that this is getting so much pushback. None of the objections, regardless of their merits (or not) seem to outweigh the potential benefits to end-users. I am extremely interested in using P4TC, it adds a lot of value and reuses so much existing Linux infra. The custom extern model is compelling. The control plane CRUDXPS will tie nicely into P4Runtime and TDI. I have an application which needs to run purely in SW - no HW offload, so prior suggestions to wait for it to "approve" this is frustrating.  I could use this yesterday. Furthermore, as an active contributor to sonic-dash, where we model the pipeline in P4, I can state that P4TC could be a compelling alternative to bmv2, which is slow, long in the tooth and lacks PNA support.
>
> I beseech the NACKers to take a deep breath, reevaluate any entrenched positions and consider how much goodness this will add, even if this is not your preference for implementing datapaths. It doesn't have to be. That can and should be decided by the larger community. This could open the door to thousands of creative developers who are comfortable in P4 but not adept in low-level networking code. P4 had a significant impact on democratizing network programming, and that was just on bmv2 and Tofino, which is EOL. Making performant and powerful P4TC ubiquitous on virtually any Linux server could have a similar effect, just like eBPF opened a lot of doors to non-kernel programmers to do interesting things. Be a part of that transformation!

Jamal Hadi Salim May 23, 2024, 1:13 a.m. UTC | #23

On Wed, May 22, 2024 at 8:54 PM Tom Herbert <tom@sipanda.io> wrote:
>
> On Wed, May 22, 2024 at 5:09 PM Chris Sommers
> <chris.sommers@keysight.com> wrote:
> >
> > > On Wed, May 22, 2024 at 6:19 PM Jakub Kicinski <kuba@kernel.org> wrote:
> > > >
> > > > Hi Jamal!
> > > >
> > > > On Tue, 21 May 2024 08:35:07 -0400 Jamal Hadi Salim wrote:
> > > > > At that point(v16) i asked for the series to be applied despite the
> > > > > Nacks because, frankly, the Nacks have no merit. Paolo was not
> > > > > comfortable applying patches with Nacks and tried to mediate. In his
> > > > > mediation effort he asked if we could remove eBPF - and our answer was
> > > > > no because after all that time we have become dependent on it and
> > > > > frankly there was no technical reason not to use eBPF.
> > > >
> > > > I'm not fully clear on who you're appealing to, and I may be missing
> > > > some points. But maybe it will be more useful than hurtful if I clarify
> > > > my point of view.
> > > >
> > > > AFAIU BPF folks disagree with the use of their subsystem, and they
> > > > point out that P4 pipelines can be implemented using BPF in the first
> > > > place.
> > > > To which you reply that you like (a highly dated type of) a netlink
> > > > interface, and (handwavey) ability to configure the data path SW or
> > > > HW via the same interface.
> > >
> > > It's not what I "like" , rather it is a requirement to support both
> > > s/w and h/w offload. The TC model is the traditional approach to
> > > deploy these models. I addressed the same comment you are making above
> > > in #1a and #1b  (https://urldefense.com/v3/__https://github.com/p4tc-dev/pushback-patches__;!!I5pVk4LIGAfnvw!kaZ6EmPxEqGLG8JMw-_L0BgYq48Pe25wj6pHMF6BVei5WsRgwMeLQupmvgvLyN-LgXacKBzzs0-w2zKP2A$).
> > >
> > > OTOH, "BPF folks disagree with the use of their subsystem" is a
> > > problematic statement. Is BPF infra for the kernel community or is it
> > > something the ebpf folks can decide, at their whim, to allow who they
> > > like to use or not. We are not changing any BPF code. And there's
> > > already a case where the interfaces are used exactly as we used them
> > > in the conntrack code i pointed to in the page (we literally copied
> > > that code). Why is it ok for conntrack code to use exactly the same
> > > approach but not us?
> > >
> > > > AFAICT there's some but not very strong support for P4TC,
> > >
> > > I dont agree. Paolo asked this question and afaik Intel, AMD (both
> > > build P4-native NICs) and the folks interested in the MS DASH project
> > > responded saying they are in support. Look at who is being Cced. A lot
> > > of these folks who attend biweekly discussion calls on P4TC. Sample:
> > > https://urldefense.com/v3/__https://lore.kernel.org/netdev/IA0PR17MB7070B51A955FB8595FFBA5FB965E2@IA0PR17MB7070.namprd17.prod.outlook.com/__;!!I5pVk4LIGAfnvw!kaZ6EmPxEqGLG8JMw-_L0BgYq48Pe25wj6pHMF6BVei5WsRgwMeLQupmvgvLyN-LgXacKBzzs09TFzoQBw$
> > >
> > +1
> > > > and it
> > > > doesn't benefit or solve any problems of the broader networking stack
> > > > (e.g. expressing or configuring parser graphs in general)
> > > >
> > >
> >
> > Huh? As a DSL, P4 has already been proven to be an extremely effective and popular way to express parse graphs, stack manipulation, and stateful programming. Yesterday, I used the P4TC dev branch to implement something in one sitting, which includes parsing RoCEv2 network stacks. I just cut and pasted P4 code originally written for a P4 ASIC into a working P4TC example to add functionality. It took mere seconds to compile and launch it, and a few minutes to test it. I know of no other workflow which provides such quick turnaround and is so accessible. I'd like it to be as ubiquitous as eBPF itself.
>
> Chris,
>
> When you say "it took mere seconds to compile and launch" are you
> taking into account the ramp up time that it takes to learn P4 and
> become proficient to do something interesting? Considering that P4
> syntax is very different from typical languages than networking
> programmers are typically familiar with, this ramp up time is
> non-zero. OTOH, eBPF is ubiquitous because it's primarily programmed
> in Restricted C-- this makes it easy for many programmers since they
> don't have to learn a completely new language and so the ramp up time
> for the average networking programmer is much less for using eBPF.
>
> This is really the fundamental problem with DSLs, they require
> specialized skill sets in a programming language for a narrow use case
> (and specialized compilers, tool chains, debugging, etc)-- this means
> a DSL only makes sense if there is no other means to accomplish the
> same effects using a commodity language with perhaps a specialized
> library (it's not just in the networking realm, consider the
> advantages of using CUDA-C instead of a DLS for GPUs). Personally, I
> don't believe that P4 has yet to be proven necessary for programming a
> datapath-- for instance we can program a parser in declarative
> representation in C,
> https://netdevconf.info/0x16/papers/11/High%20Performance%20Programmable%20Parsers.pdf.
>
> So unless P4 is proven necessary, then I'm doubtful it will ever be a
> ubiquitous way to program the kernel-- it seems much more likely that
> people will continue to use C and eBPF, and for those users that want
> to use P4 they can use P4->eBPF compiler.
>

Tom,
I cant stop the distraction of this thread becoming a discussion on
the merits of DSL vs a lower level language (and I know you are not a
P4 fan) but please change the subject so we dont loose the main focus
which is a discussion on the patches. I have done it for you. Chris if
you wish to respond please respond under the new thread subject.

cheers,
jamal

cheers,
jamal

Chris Sommers May 23, 2024, 2:29 a.m. UTC | #24

> On Wed, May 22, 2024 at 8:54 PM Tom Herbert <mailto:tom@sipanda.io> wrote:
> >
> > On Wed, May 22, 2024 at 5:09 PM Chris Sommers
> > <mailto:chris.sommers@keysight.com> wrote:
> > >
> > > > On Wed, May 22, 2024 at 6:19 PM Jakub Kicinski <mailto:kuba@kernel.org> wrote:
> > > > >
> > > > > Hi Jamal!
> > > > >
> > > > > On Tue, 21 May 2024 08:35:07 -0400 Jamal Hadi Salim wrote:
> > > > > > At that point(v16) i asked for the series to be applied despite the
> > > > > > Nacks because, frankly, the Nacks have no merit. Paolo was not
> > > > > > comfortable applying patches with Nacks and tried to mediate. In his
> > > > > > mediation effort he asked if we could remove eBPF - and our answer was
> > > > > > no because after all that time we have become dependent on it and
> > > > > > frankly there was no technical reason not to use eBPF.
> > > > >
> > > > > I'm not fully clear on who you're appealing to, and I may be missing
> > > > > some points. But maybe it will be more useful than hurtful if I clarify
> > > > > my point of view.
> > > > >
> > > > > AFAIU BPF folks disagree with the use of their subsystem, and they
> > > > > point out that P4 pipelines can be implemented using BPF in the first
> > > > > place.
> > > > > To which you reply that you like (a highly dated type of) a netlink
> > > > > interface, and (handwavey) ability to configure the data path SW or
> > > > > HW via the same interface.
> > > >
> > > > It's not what I "like" , rather it is a requirement to support both
> > > > s/w and h/w offload. The TC model is the traditional approach to
> > > > deploy these models. I addressed the same comment you are making above
> > > > in #1a and #1b  (https://urldefense.com/v3/__https://github.com/p4tc-dev/pushback-patches__;!!I5pVk4LIGAfnvw!kaZ6EmPxEqGLG8JMw-_L0BgYq48Pe25wj6pHMF6BVei5WsRgwMeLQupmvgvLyN-LgXacKBzzs0-w2zKP2A$).
> >> >
> > > > OTOH, "BPF folks disagree with the use of their subsystem" is a
> > > > problematic statement. Is BPF infra for the kernel community or is it
> > > > something the ebpf folks can decide, at their whim, to allow who they
> > > > like to use or not. We are not changing any BPF code. And there's
> > > > already a case where the interfaces are used exactly as we used them
> > > > in the conntrack code i pointed to in the page (we literally copied
> > > > that code). Why is it ok for conntrack code to use exactly the same
> > > > approach but not us?
> > > >
> > > > > AFAICT there's some but not very strong support for P4TC,
> > > >
> > > > I dont agree. Paolo asked this question and afaik Intel, AMD (both
> > > > build P4-native NICs) and the folks interested in the MS DASH project
> > > > responded saying they are in support. Look at who is being Cced. A lot
> > > > of these folks who attend biweekly discussion calls on P4TC. Sample:
> > > > https://urldefense.com/v3/__https://lore.kernel.org/netdev/IA0PR17MB7070B51A955FB8595FFBA5FB965E2@IA0PR17MB7070.namprd17.prod.outlook.com/__;!!I5pVk4LIGAfnvw!kaZ6EmPxEqGLG8JMw-_L0BgYq48Pe25wj6pHMF6BVei5WsRgwMeLQupmvgvLyN-LgXacKBzzs09TFzoQBw$
> >> >
> > > +1
> > > > > and it
> > > > > doesn't benefit or solve any problems of the broader networking stack
> > > > > (e.g. expressing or configuring parser graphs in general)
> > > > >
> > > >
> > >
> > > Huh? As a DSL, P4 has already been proven to be an extremely effective and popular way to express parse graphs, stack manipulation, and stateful programming. Yesterday, I used the P4TC dev branch to implement something in one sitting, which includes parsing RoCEv2 network stacks. I just cut and pasted P4 code originally written for a P4 ASIC into a working P4TC example to add functionality. It took mere seconds to compile and launch it, and a few minutes to test it. I know of no other workflow which provides such quick turnaround and is so accessible. I'd like it to be as ubiquitous as eBPF itself.
> >
> > Chris,
> >
> > When you say "it took mere seconds to compile and launch" are you
> > taking into account the ramp up time that it takes to learn P4 and
> > become proficient to do something interesting? 

Hi Tom, thanks for the dialog. To answer your question, it took seconds to compile and deploy, not learn P4. Adding the parsing for several headers took minutes. If you want to compare learning curve, learning to write P4 code and let the framework handle all the painful low-level Linux details is way easier than trying to learn how to write c code for Linux networking. It’s not even close. I’ve written C for 40 years, P4 for 7 years, and dabbled in eBPF so I can attest to the ease of learning and using P4. I’ve onboarded and mentored engineers who barely knew C, to develop complex networking products using P4, and built the automation APIs (REST, gRPC) to manage them. One person can develop an entire commercial product by themselves in months. P4 has expanded the reach of programmers such that both HW and SW engineers can easily learn P4 and become pretty adept at it. I would not expect even experienced c programmers to be able to master Linux internals very quickly. Writing a P4-TC program and injecting it via tc was like magic the first time.

>> Considering that P4
> > syntax is very different from typical languages than networking
> > programmers are typically familiar with, this ramp up time is
> > non-zero. OTOH, eBPF is ubiquitous because it's primarily programmed
> > in Restricted C-- this makes it easy for many programmers since they
> > don't have to learn a completely new language and so the ramp up time
> > for the average networking programmer is much less for using eBPF.

I think your statement about “typical network programmers” overlooks the fact that since P4 was introduced, it has been taught in many universities to teach networking and possibly enabled a whole new breed of “network engineers” who can solve real problems without even knowing C programming. Without P4 they might never have gone this route. A class in network stack programming using c would have so many prerequisites to even get to parsing, compared to P4, where it could be demonstrated in one lesson. These “networking programmers” are not typical by your standards, but there are many such. They have just as much claim to the title "network programmer” as a C programmer. Similarly, an assembly language programmer is no less than a C or Python programmer. People writing P4 are usually focused on applications, and it is very useful and productive for that. Why should someone have to learn low-level C or eBPF to solve their problem?

> >
> > This is really the fundamental problem with DSLs, they require
> > specialized skill sets in a programming language for a narrow use case
> > (and specialized compilers, tool chains, debugging, etc)-- this means
> > a DSL only makes sense if there is no other means to accomplish the
> > same effects using a commodity language with perhaps a specialized
> > library (it's not just in the networking realm, consider the
> > advantages of using CUDA-C instead of a DLS for GPUs).

A pretty strong opinion, but DSLs arise to fill a need and P4 did so. It's still going strong.

>> Personally, I
> > don't believe that P4 has yet to be proven necessary for programming a
> > datapath-- for instance we can program a parser in declarative
> > representation in C,
> > https://urldefense.com/v3/__https://netdevconf.info/0x16/papers/11/High*20Performance*20Programmable*20Parsers.pdf__;JSUl!!I5pVk4LIGAfnvw!m9zrSDvddfzSt_sMBjOEvqw31RzAwWlEDM4ah5IJ2kqsmq6XtPIVJd-1_ZoGWBXKLyda77RYLvGR83Ginw$.

CPL (slide11) looks like a DSL wrapped in JSON to me. “Solution: Common Parser Language (CPL); Parser representation in declarative .json” So I am confused. It is either a new language a.k.a. DSL, or it's not. Nothing against it, I'm sure it is great, but let's call it what it is.
We already have parser representations in declarative p4. And it's used and known worldwide. And has a respectable specification, any users and working groups. And it's formally provable (https://github.com/verified-network-toolchain/petr4)

> >
> > So unless P4 is proven necessary, then I'm doubtful it will ever be a
> > ubiquitous way to program the kernel-- it seems much more likely that
> > people will continue to use C and eBPF, and for those users that want
> > to use P4 they can use P4->eBPF compiler.

“ubiquitous way to program the kernel” – is not my goal. I don’t even want to know about the kernel when I am writing p4 - it's just a means to an end. I want to manipulate packets on a Linux host. P4DPDK, P4-eBPF, P4-TC – all let me do that. I LOVE the fact that P4-TC would be available in every Linux distro once upstreamed. It would solve so many deployment issues, benefit from regression testing, etc. So much goodness.

" and for those users that want to use P4 they can use P4->eBPF compiler." -I'd really like to choose for myself and not have someone make that choice for me. P4-TC checks all the boxes for me.

Thanks for the point of view, it's healthy to debate.
Cheers,
Chris

> >
> 
> Tom,
> I cant stop the distraction of this thread becoming a discussion on
> the merits of DSL vs a lower level language (and I know you are not a
> P4 fan) but please change the subject so we dont loose the main focus
> which is a discussion on the patches. I have done it for you. Chris if
> you wish to respond please respond under the new thread subject.
> 
> cheers,
> jamal

Tom Herbert May 23, 2024, 3:34 a.m. UTC | #25

On Wed, May 22, 2024 at 7:30 PM Chris Sommers
<chris.sommers@keysight.com> wrote:
>
> > On Wed, May 22, 2024 at 8:54 PM Tom Herbert <mailto:tom@sipanda.io> wrote:
> > >
> > > On Wed, May 22, 2024 at 5:09 PM Chris Sommers
> > > <mailto:chris.sommers@keysight.com> wrote:
> > > >
> > > > > On Wed, May 22, 2024 at 6:19 PM Jakub Kicinski <mailto:kuba@kernel.org> wrote:
> > > > > >
> > > > > > Hi Jamal!
> > > > > >
> > > > > > On Tue, 21 May 2024 08:35:07 -0400 Jamal Hadi Salim wrote:
> > > > > > > At that point(v16) i asked for the series to be applied despite the
> > > > > > > Nacks because, frankly, the Nacks have no merit. Paolo was not
> > > > > > > comfortable applying patches with Nacks and tried to mediate. In his
> > > > > > > mediation effort he asked if we could remove eBPF - and our answer was
> > > > > > > no because after all that time we have become dependent on it and
> > > > > > > frankly there was no technical reason not to use eBPF.
> > > > > >
> > > > > > I'm not fully clear on who you're appealing to, and I may be missing
> > > > > > some points. But maybe it will be more useful than hurtful if I clarify
> > > > > > my point of view.
> > > > > >
> > > > > > AFAIU BPF folks disagree with the use of their subsystem, and they
> > > > > > point out that P4 pipelines can be implemented using BPF in the first
> > > > > > place.
> > > > > > To which you reply that you like (a highly dated type of) a netlink
> > > > > > interface, and (handwavey) ability to configure the data path SW or
> > > > > > HW via the same interface.
> > > > >
> > > > > It's not what I "like" , rather it is a requirement to support both
> > > > > s/w and h/w offload. The TC model is the traditional approach to
> > > > > deploy these models. I addressed the same comment you are making above
> > > > > in #1a and #1b  (https://urldefense.com/v3/__https://github.com/p4tc-dev/pushback-patches__;!!I5pVk4LIGAfnvw!kaZ6EmPxEqGLG8JMw-_L0BgYq48Pe25wj6pHMF6BVei5WsRgwMeLQupmvgvLyN-LgXacKBzzs0-w2zKP2A$).
> > >> >
> > > > > OTOH, "BPF folks disagree with the use of their subsystem" is a
> > > > > problematic statement. Is BPF infra for the kernel community or is it
> > > > > something the ebpf folks can decide, at their whim, to allow who they
> > > > > like to use or not. We are not changing any BPF code. And there's
> > > > > already a case where the interfaces are used exactly as we used them
> > > > > in the conntrack code i pointed to in the page (we literally copied
> > > > > that code). Why is it ok for conntrack code to use exactly the same
> > > > > approach but not us?
> > > > >
> > > > > > AFAICT there's some but not very strong support for P4TC,
> > > > >
> > > > > I dont agree. Paolo asked this question and afaik Intel, AMD (both
> > > > > build P4-native NICs) and the folks interested in the MS DASH project
> > > > > responded saying they are in support. Look at who is being Cced. A lot
> > > > > of these folks who attend biweekly discussion calls on P4TC. Sample:
> > > > > https://urldefense.com/v3/__https://lore.kernel.org/netdev/IA0PR17MB7070B51A955FB8595FFBA5FB965E2@IA0PR17MB7070.namprd17.prod.outlook.com/__;!!I5pVk4LIGAfnvw!kaZ6EmPxEqGLG8JMw-_L0BgYq48Pe25wj6pHMF6BVei5WsRgwMeLQupmvgvLyN-LgXacKBzzs09TFzoQBw$
> > >> >
> > > > +1
> > > > > > and it
> > > > > > doesn't benefit or solve any problems of the broader networking stack
> > > > > > (e.g. expressing or configuring parser graphs in general)
> > > > > >
> > > > >
> > > >
> > > > Huh? As a DSL, P4 has already been proven to be an extremely effective and popular way to express parse graphs, stack manipulation, and stateful programming. Yesterday, I used the P4TC dev branch to implement something in one sitting, which includes parsing RoCEv2 network stacks. I just cut and pasted P4 code originally written for a P4 ASIC into a working P4TC example to add functionality. It took mere seconds to compile and launch it, and a few minutes to test it. I know of no other workflow which provides such quick turnaround and is so accessible. I'd like it to be as ubiquitous as eBPF itself.
> > >
> > > Chris,
> > >
> > > When you say "it took mere seconds to compile and launch" are you
> > > taking into account the ramp up time that it takes to learn P4 and
> > > become proficient to do something interesting?
>
> Hi Tom, thanks for the dialog. To answer your question, it took seconds to compile and deploy, not learn P4. Adding the parsing for several headers took minutes. If you want to compare learning curve, learning to write P4 code and let the framework handle all the painful low-level Linux details is way easier than trying to learn how to write c code for Linux networking. It’s not even close. I’ve written C for 40 years, P4 for 7 years, and dabbled in eBPF so I can attest to the ease of learning and using P4. I’ve onboarded and mentored engineers who barely knew C, to develop complex networking products using P4, and built the automation APIs (REST, gRPC) to manage them. One person can develop an entire commercial product by themselves in months. P4 has expanded the reach of programmers such that both HW and SW engineers can easily learn P4 and become pretty adept at it. I would not expect even experienced c programmers to be able to master Linux internals very quickly. Writing a P4-TC program and injecting it via tc was like magic the first time.
>
> >> Considering that P4
> > > syntax is very different from typical languages than networking
> > > programmers are typically familiar with, this ramp up time is
> > > non-zero. OTOH, eBPF is ubiquitous because it's primarily programmed
> > > in Restricted C-- this makes it easy for many programmers since they
> > > don't have to learn a completely new language and so the ramp up time
> > > for the average networking programmer is much less for using eBPF.
>
> I think your statement about “typical network programmers” overlooks the fact that since P4 was introduced, it has been taught in many universities to teach networking and possibly enabled a whole new breed of “network engineers” who can solve real problems without even knowing C programming. Without P4 they might never have gone this route. A class in network stack programming using c would have so many prerequisites to even get to parsing, compared to P4, where it could be demonstrated in one lesson. These “networking programmers” are not typical by your standards, but there are many such. They have just as much claim to the title "network programmer” as a C programmer. Similarly, an assembly language programmer is no less than a C or Python programmer. People writing P4 are usually focused on applications, and it is very useful and productive for that. Why should someone have to learn low-level C or eBPF to solve their problem?

Hio Chris,

You're comparing learning a completely new language versus programming
in a subset of an established language, they're really not comparable.
When one programs in Restricted-C they just need to understand what
features of C are supported.

>
> > >
> > > This is really the fundamental problem with DSLs, they require
> > > specialized skill sets in a programming language for a narrow use case
> > > (and specialized compilers, tool chains, debugging, etc)-- this means
> > > a DSL only makes sense if there is no other means to accomplish the
> > > same effects using a commodity language with perhaps a specialized
> > > library (it's not just in the networking realm, consider the
> > > advantages of using CUDA-C instead of a DLS for GPUs).
>
> A pretty strong opinion, but DSLs arise to fill a need and P4 did so. It's still going strong.
>
> >> Personally, I
> > > don't believe that P4 has yet to be proven necessary for programming a
> > > datapath-- for instance we can program a parser in declarative
> > > representation in C,
> > > https://urldefense.com/v3/__https://netdevconf.info/0x16/papers/11/High*20Performance*20Programmable*20Parsers.pdf__;JSUl!!I5pVk4LIGAfnvw!m9zrSDvddfzSt_sMBjOEvqw31RzAwWlEDM4ah5IJ2kqsmq6XtPIVJd-1_ZoGWBXKLyda77RYLvGR83Ginw$.
>
> CPL (slide11) looks like a DSL wrapped in JSON to me. “Solution: Common Parser Language (CPL); Parser representation in declarative .json” So I am confused. It is either a new language a.k.a. DSL, or it's not. Nothing against it, I'm sure it is great, but let's call it what it is.

Correct, it's not a new language. We've since renamed it Common Parser
Representation.

> We already have parser representations in declarative p4. And it's used and known worldwide. And has a respectable specification, any users and working groups. And it's formally provable (https://github.com/verified-network-toolchain/petr4)
>
> > >
> > > So unless P4 is proven necessary, then I'm doubtful it will ever be a
> > > ubiquitous way to program the kernel-- it seems much more likely that
> > > people will continue to use C and eBPF, and for those users that want
> > > to use P4 they can use P4->eBPF compiler.
>
> “ubiquitous way to program the kernel” – is not my goal. I don’t even want to know about the kernel when I am writing p4 - it's just a means to an end. I want to manipulate packets on a Linux host. P4DPDK, P4-eBPF, P4-TC – all let me do that. I LOVE the fact that P4-TC would be available in every Linux distro once upstreamed. It would solve so many deployment issues, benefit from regression testing, etc. So much goodness
>
> " and for those users that want to use P4 they can use P4->eBPF compiler." -I'd really like to choose for myself and not have someone make that choice for me. P4-TC checks all the boxes for me.

Sure, but this is a lot of kernel code and that will require support
and maintenance. It needs to be justified, and the fact that someone
wants it just to have a choice is, frankly, not much of a
justification. I think a justification needs to start with "Why isn't
P4->eBPF sufficient?" (the question has been raised several times, but
it still doesn't seem like there's a strong answer).

Tom
>
> Thanks for the point of view, it's healthy to debate.
> Cheers,
> Chris
>
> > >
> >
> > Tom,
> > I cant stop the distraction of this thread becoming a discussion on
> > the merits of DSL vs a lower level language (and I know you are not a
> > P4 fan) but please change the subject so we dont loose the main focus
> > which is a discussion on the patches. I have done it for you. Chris if
> > you wish to respond please respond under the new thread subject.
> >
> > cheers,
> > jamal
>

Tom Herbert May 24, 2024, 4:50 p.m. UTC | #26

Hi Chris,

P4 was created to support programming the hardware data path in high
end routers, but P4-TC would enable the use of P4 across all Linux
devices. Since this is potentially a lot of code going into the kernel
to support it, I believe it's entirely fair for us to evaluate and
give feedback on the P4 language and its suitability for the broader
user community including environments where there will never be a need
for P4 hardware. Note that I am questioning the design decisions of P4
in the context of supporting a DSL in the kernel via P4-TC, if the
P4->eBPF compiler is used then then these concerns are less pertinent.
Nevertheless, I would suggest that the P4 folks take the points being
raised as constructive feedback on the language.

I took a cursory look at several P4 programs including tutorials,
switch code, firewalls, etc. I have particular interest in variable
length headers, so I'll use
https://github.com/jafingerhut/p4-guide/blob/master/checksum/checksum-ipv4-with-options.p4
as a reference.

The first thing I noticed about P4 is that almost everything is
expressed as a bit field. Like bit<8> and bit<32>. I suppose this
arises from the fact that P4 was originally intended to run in non-CPU
hardware where there's no inherent unit of data like bytes. But, CPUs
don't work that way; CPUs work ordinal types of bytes, half words,
words, double words, etc. (__u8, __u16, __u32, __u64). That means that
all mainstream computer languages fundamentally operate on ordinal
types even if the variable types are explicitly declared. If someone
programming in P4 needs to map original types to bit fields in P4, so
if they want a __u32 they need to use a bit<32> in P4 (except they're
not exactly equivalent, a __u32 in C is guaranteed to be byte aligned
and I'm assuming in P4 bit<32> is not guaranteed to be byte aligned--
this seems like it might be susceptible to programming errors). I'd
also point out that networking protocols are also defined using
ordinal type fields, there are some exceptions, but for the most part
protocol fields try to be in units of bytes (or octets if you want to
be old school!). I believe life would be easier for the programmer if
they could just define variables and fields with ordinal types, the
fix here seems simple enough just add typedefs to P4 like "typedef
__u32 bit<32>".

In the IP header definition there's "varbit<320>  options;". It took
me several seconds to decode this and realize this is space for forty
bytes of IP options (i.e. 8 * 40 == 320). I suppose this follows the
design of using bit fields for everything, but I think this is more
than just an annoyance like the bit fields for ordinal types are.
First off, it's not very readable. I've never heard anyone say that
there's 320 bits of IP options, or seen an RFC specify that. Likewise,
the standard Ethernet MTU is 1500 bytes, not 12,000 bits which would
seem to be how that would be expressed in P4. So this seems very
unreadable to me and potentially prone to errors. The fix for this
also seems easy, why not just add varbyte to P4 so we can do
varbyte<40>, varbyte<87>, varbyte<123>, etc.?

The next thing I notice about the P4 programs I surveyed is that all
of them seem to define the protocol headers within the protocol. Every
program seems to have "header ethernet_t" and "header ipv4_t" and
other protocols that are used and protocol constants like Ethertypes
also seem to be spelled out in each program. Sometimes these are in
include files within the program. What I don't see is that P4 has a
standard set of include files for defining protocol headers. For
instance, in Linux C we would just do "#include <linux/if_ether.h>"
and "#include <linux/ip.h>" to get the definitions of the Ethernet
header and IPv4 header. In fact, if someone were to submit a patch to
Netdev that included its own definition of Ethernet or an IP header
structure they would almost certainly get pushback. It's a fundamental
programming principle, not just in networking but pretty much
everywhere, to not continuously redefine common and standard
constructs-- just put common things in header files that can be shared
by multiple programs (to do otherwise substantially increases the
possibility of errors, bloats code, and reduces readability).

Marshalling up common definitions into header files that are common in
the P4 development environment seems simple enough (maybe it's already
done?), but I would also point out that Linux has included files that
describe protocol formats and header structures for almost every
protocol under the sun that are well tested. It would be great if
somehow we could somehow leverage that work. For instance, in the P4
samples I looked at srcAddr and dstAddr are defined for IP addresses,
but in linux/ip.h their saddr and daddr are the respective field
names. Why not just base the P4 definition on the Linux one? Then when
someone is porting code from Linux to P4 they can use the same field
names-- this makes things a lot easier on the programmer! I'll also
mention that we wrote a little Python script to generate P4 header and
constant definitions from Linux headers. It almost worked, the snag we
hit was that P4 has some limits on nesting structures and unions so we
couldn't translate some of the C structures to P4 (if you're
interested I can provide the details on the problem we hit).

The IPv4 header checksum code was a real head scratcher for me. Do we
really need to state each field in the IP header just to compute the
checksum? (and not just do this once, but twice :-( ). See code below
for verifyChecksum and updateChecksum.

In C, verifying and setting the IP header checksum is really easy:

if (checksum(iphdr, 0, iphdr->ihl << 4))
    goto bad_csum;

ip->csum = checksum(iphdr, 0, iphdr->ihl << 4);

Relative to the C code, the P4 code seems very convoluted to me and
prone to errors. What if someone accidentally omits a field? What if
fields become slightly out of order? Also, no one would ever describe
the IPv4 checksum as taking the checksum over the IHL, diffserv,
totalLen, ... That is *way* too complicated for an algorithm that is
really simple-- from RFC791: "The checksum field is the 16 bit one's
complement of the one's complement sum of all 16 bit words in the
header.". Reverse engineering the design, the clue seems to be
HashAlgorithm.csum16. Maybe in P4 the IP checksum is just considered
another form of hash, and I suspect the input to hash computation is
specified as sort of data structure to make things generic (for
instance, how we create a substructure in flow keys in flow_dissector
to compute a SipHash over the TCP and UDP tuple). But, the IPv4
checksum isn't just another hash-- on a host, we need to compute the
checksum for *every* IPv4 packet. This has to be fast and simple, we
can do this in as few as five instructions or less. So even if the
code below is correct, I have to wonder how easy it is to emit an
efficient executable. Would a compiler easily realize that all the
fields in the pseudo structure are contiguous without holes such that
it can omit those five instructions?

I don't know how prevalent this method of listing all the fields in a
data structure as arguments to a function is in P4, but, by almost any
objective measure, I have to say that the code below is bad and
bloated. Maybe there's a better way to do it in P4, but if there's not
then this is a deficiency in the P4 language.

Tom

control verifyChecksum(inout headers hdr,
                       inout metadata meta)
{
    apply {
        // There is code similar to this in Github repo p4lang/p4c in
        // file testdata/p4_16_samples/flowlet_switching-bmv2.p4
        // However in that file it is only for a fixed length IPv4
        // header with no options.
        verify_checksum(true,
            { hdr.ipv4.version,
                hdr.ipv4.ihl,
                hdr.ipv4.diffserv,
                hdr.ipv4.totalLen,
                hdr.ipv4.identification,
                hdr.ipv4.flags,
                hdr.ipv4.fragOffset,
                hdr.ipv4.ttl,
                hdr.ipv4.protocol,
                hdr.ipv4.srcAddr,
                hdr.ipv4.dstAddr
#ifdef ALLOW_IPV4_OPTIONS
                , hdr.ipv4.options
#endif /* ALLOW_IPV4_OPTIONS */
            },
            hdr.ipv4.hdrChecksum, HashAlgorithm.csum16);
    }
}

control updateChecksum(inout headers hdr,
                       inout metadata meta)
{
    apply {
        update_checksum(true,
            { hdr.ipv4.version,
                hdr.ipv4.ihl,
                hdr.ipv4.diffserv,
                hdr.ipv4.totalLen,
                hdr.ipv4.identification,
                hdr.ipv4.flags,
                hdr.ipv4.fragOffset,
                hdr.ipv4.ttl,
                hdr.ipv4.protocol,
                hdr.ipv4.srcAddr,
                hdr.ipv4.dstAddr
#ifdef ALLOW_IPV4_OPTIONS
                , hdr.ipv4.options
#endif /* ALLOW_IPV4_OPTIONS */
            },
            hdr.ipv4.hdrChecksum, HashAlgorithm.csum16);
    }
}

On Wed, May 22, 2024 at 8:34 PM Tom Herbert <tom@sipanda.io> wrote:
>
> On Wed, May 22, 2024 at 7:30 PM Chris Sommers
> <chris.sommers@keysight.com> wrote:
> >
> > > On Wed, May 22, 2024 at 8:54 PM Tom Herbert <mailto:tom@sipanda.io> wrote:
> > > >
> > > > On Wed, May 22, 2024 at 5:09 PM Chris Sommers
> > > > <mailto:chris.sommers@keysight.com> wrote:
> > > > >
> > > > > > On Wed, May 22, 2024 at 6:19 PM Jakub Kicinski <mailto:kuba@kernel.org> wrote:
> > > > > > >
> > > > > > > Hi Jamal!
> > > > > > >
> > > > > > > On Tue, 21 May 2024 08:35:07 -0400 Jamal Hadi Salim wrote:
> > > > > > > > At that point(v16) i asked for the series to be applied despite the
> > > > > > > > Nacks because, frankly, the Nacks have no merit. Paolo was not
> > > > > > > > comfortable applying patches with Nacks and tried to mediate. In his
> > > > > > > > mediation effort he asked if we could remove eBPF - and our answer was
> > > > > > > > no because after all that time we have become dependent on it and
> > > > > > > > frankly there was no technical reason not to use eBPF.
> > > > > > >
> > > > > > > I'm not fully clear on who you're appealing to, and I may be missing
> > > > > > > some points. But maybe it will be more useful than hurtful if I clarify
> > > > > > > my point of view.
> > > > > > >
> > > > > > > AFAIU BPF folks disagree with the use of their subsystem, and they
> > > > > > > point out that P4 pipelines can be implemented using BPF in the first
> > > > > > > place.
> > > > > > > To which you reply that you like (a highly dated type of) a netlink
> > > > > > > interface, and (handwavey) ability to configure the data path SW or
> > > > > > > HW via the same interface.
> > > > > >
> > > > > > It's not what I "like" , rather it is a requirement to support both
> > > > > > s/w and h/w offload. The TC model is the traditional approach to
> > > > > > deploy these models. I addressed the same comment you are making above
> > > > > > in #1a and #1b  (https://urldefense.com/v3/__https://github.com/p4tc-dev/pushback-patches__;!!I5pVk4LIGAfnvw!kaZ6EmPxEqGLG8JMw-_L0BgYq48Pe25wj6pHMF6BVei5WsRgwMeLQupmvgvLyN-LgXacKBzzs0-w2zKP2A$).
> > > >> >
> > > > > > OTOH, "BPF folks disagree with the use of their subsystem" is a
> > > > > > problematic statement. Is BPF infra for the kernel community or is it
> > > > > > something the ebpf folks can decide, at their whim, to allow who they
> > > > > > like to use or not. We are not changing any BPF code. And there's
> > > > > > already a case where the interfaces are used exactly as we used them
> > > > > > in the conntrack code i pointed to in the page (we literally copied
> > > > > > that code). Why is it ok for conntrack code to use exactly the same
> > > > > > approach but not us?
> > > > > >
> > > > > > > AFAICT there's some but not very strong support for P4TC,
> > > > > >
> > > > > > I dont agree. Paolo asked this question and afaik Intel, AMD (both
> > > > > > build P4-native NICs) and the folks interested in the MS DASH project
> > > > > > responded saying they are in support. Look at who is being Cced. A lot
> > > > > > of these folks who attend biweekly discussion calls on P4TC. Sample:
> > > > > > https://urldefense.com/v3/__https://lore.kernel.org/netdev/IA0PR17MB7070B51A955FB8595FFBA5FB965E2@IA0PR17MB7070.namprd17.prod.outlook.com/__;!!I5pVk4LIGAfnvw!kaZ6EmPxEqGLG8JMw-_L0BgYq48Pe25wj6pHMF6BVei5WsRgwMeLQupmvgvLyN-LgXacKBzzs09TFzoQBw$
> > > >> >
> > > > > +1
> > > > > > > and it
> > > > > > > doesn't benefit or solve any problems of the broader networking stack
> > > > > > > (e.g. expressing or configuring parser graphs in general)
> > > > > > >
> > > > > >
> > > > >
> > > > > Huh? As a DSL, P4 has already been proven to be an extremely effective and popular way to express parse graphs, stack manipulation, and stateful programming. Yesterday, I used the P4TC dev branch to implement something in one sitting, which includes parsing RoCEv2 network stacks. I just cut and pasted P4 code originally written for a P4 ASIC into a working P4TC example to add functionality. It took mere seconds to compile and launch it, and a few minutes to test it. I know of no other workflow which provides such quick turnaround and is so accessible. I'd like it to be as ubiquitous as eBPF itself.
> > > >
> > > > Chris,
> > > >
> > > > When you say "it took mere seconds to compile and launch" are you
> > > > taking into account the ramp up time that it takes to learn P4 and
> > > > become proficient to do something interesting?
> >
> > Hi Tom, thanks for the dialog. To answer your question, it took seconds to compile and deploy, not learn P4. Adding the parsing for several headers took minutes. If you want to compare learning curve, learning to write P4 code and let the framework handle all the painful low-level Linux details is way easier than trying to learn how to write c code for Linux networking. It’s not even close. I’ve written C for 40 years, P4 for 7 years, and dabbled in eBPF so I can attest to the ease of learning and using P4. I’ve onboarded and mentored engineers who barely knew C, to develop complex networking products using P4, and built the automation APIs (REST, gRPC) to manage them. One person can develop an entire commercial product by themselves in months. P4 has expanded the reach of programmers such that both HW and SW engineers can easily learn P4 and become pretty adept at it. I would not expect even experienced c programmers to be able to master Linux internals very quickly. Writing a P4-TC program and injecting it via tc was like magic the first time.
> >
> > >> Considering that P4
> > > > syntax is very different from typical languages than networking
> > > > programmers are typically familiar with, this ramp up time is
> > > > non-zero. OTOH, eBPF is ubiquitous because it's primarily programmed
> > > > in Restricted C-- this makes it easy for many programmers since they
> > > > don't have to learn a completely new language and so the ramp up time
> > > > for the average networking programmer is much less for using eBPF.
> >
> > I think your statement about “typical network programmers” overlooks the fact that since P4 was introduced, it has been taught in many universities to teach networking and possibly enabled a whole new breed of “network engineers” who can solve real problems without even knowing C programming. Without P4 they might never have gone this route. A class in network stack programming using c would have so many prerequisites to even get to parsing, compared to P4, where it could be demonstrated in one lesson. These “networking programmers” are not typical by your standards, but there are many such. They have just as much claim to the title "network programmer” as a C programmer. Similarly, an assembly language programmer is no less than a C or Python programmer. People writing P4 are usually focused on applications, and it is very useful and productive for that. Why should someone have to learn low-level C or eBPF to solve their problem?
>
> Hio Chris,
>
> You're comparing learning a completely new language versus programming
> in a subset of an established language, they're really not comparable.
> When one programs in Restricted-C they just need to understand what
> features of C are supported.
>
> >
> > > >
> > > > This is really the fundamental problem with DSLs, they require
> > > > specialized skill sets in a programming language for a narrow use case
> > > > (and specialized compilers, tool chains, debugging, etc)-- this means
> > > > a DSL only makes sense if there is no other means to accomplish the
> > > > same effects using a commodity language with perhaps a specialized
> > > > library (it's not just in the networking realm, consider the
> > > > advantages of using CUDA-C instead of a DLS for GPUs).
> >
> > A pretty strong opinion, but DSLs arise to fill a need and P4 did so. It's still going strong.
> >
> > >> Personally, I
> > > > don't believe that P4 has yet to be proven necessary for programming a
> > > > datapath-- for instance we can program a parser in declarative
> > > > representation in C,
> > > > https://urldefense.com/v3/__https://netdevconf.info/0x16/papers/11/High*20Performance*20Programmable*20Parsers.pdf__;JSUl!!I5pVk4LIGAfnvw!m9zrSDvddfzSt_sMBjOEvqw31RzAwWlEDM4ah5IJ2kqsmq6XtPIVJd-1_ZoGWBXKLyda77RYLvGR83Ginw$.
> >
> > CPL (slide11) looks like a DSL wrapped in JSON to me. “Solution: Common Parser Language (CPL); Parser representation in declarative .json” So I am confused. It is either a new language a.k.a. DSL, or it's not. Nothing against it, I'm sure it is great, but let's call it what it is.
>
> Correct, it's not a new language. We've since renamed it Common Parser
> Representation.
>
> > We already have parser representations in declarative p4. And it's used and known worldwide. And has a respectable specification, any users and working groups. And it's formally provable (https://github.com/verified-network-toolchain/petr4)
> >
> > > >
> > > > So unless P4 is proven necessary, then I'm doubtful it will ever be a
> > > > ubiquitous way to program the kernel-- it seems much more likely that
> > > > people will continue to use C and eBPF, and for those users that want
> > > > to use P4 they can use P4->eBPF compiler.
> >
> > “ubiquitous way to program the kernel” – is not my goal. I don’t even want to know about the kernel when I am writing p4 - it's just a means to an end. I want to manipulate packets on a Linux host. P4DPDK, P4-eBPF, P4-TC – all let me do that. I LOVE the fact that P4-TC would be available in every Linux distro once upstreamed. It would solve so many deployment issues, benefit from regression testing, etc. So much goodness
> >
> > " and for those users that want to use P4 they can use P4->eBPF compiler." -I'd really like to choose for myself and not have someone make that choice for me. P4-TC checks all the boxes for me.
>
> Sure, but this is a lot of kernel code and that will require support
> and maintenance. It needs to be justified, and the fact that someone
> wants it just to have a choice is, frankly, not much of a
> justification. I think a justification needs to start with "Why isn't
> P4->eBPF sufficient?" (the question has been raised several times, but
> it still doesn't seem like there's a strong answer).
>
> Tom
> >
> > Thanks for the point of view, it's healthy to debate.
> > Cheers,
> > Chris
> >
> > > >
> > >
> > > Tom,
> > > I cant stop the distraction of this thread becoming a discussion on
> > > the merits of DSL vs a lower level language (and I know you are not a
> > > P4 fan) but please change the subject so we dont loose the main focus
> > > which is a discussion on the patches. I have done it for you. Chris if
> > > you wish to respond please respond under the new thread subject.
> > >
> > > cheers,
> > > jamal
> >

Jamal Hadi Salim May 24, 2024, 6:45 p.m. UTC | #27

On Fri, May 24, 2024 at 12:50 PM Tom Herbert <tom@sipanda.io> wrote:
>
> Hi Chris,
>
> P4 was created to support programming the hardware data path in high
> end routers, but P4-TC would enable the use of P4 across all Linux
> devices. Since this is potentially a lot of code going into the kernel
> to support it, I believe it's entirely fair for us to evaluate and
> give feedback on the P4 language and its suitability for the broader
> user community including environments where there will never be a need
> for P4 hardware. Note that I am questioning the design decisions of P4
> in the context of supporting a DSL in the kernel via P4-TC, if the
> P4->eBPF compiler is used then then these concerns are less pertinent.
> Nevertheless, I would suggest that the P4 folks take the points being
> raised as constructive feedback on the language.
>

A lot of misleading info there. The P4 PNA architecture is for end
hosts not routers. For some NIC vendors you can go as far as writting
hardware GRO or TSO offload or variations of your liking  using P4
(cretainly not a middle feature). That notwithstanding the idea of
offloading match-action via TC is not new and has been widely
used/adopted for end hosts.

Tom, you want to perhaps disclose that you have a competing product?
That will help provide better context on your angle.
TBH, I am confused by what your end game is - is your view that a
crusade against P4 will make you sell more of your product? I have 3
NICs here with me (from 2 vendors) that are P4 programmable. You can
be as negative as you want about P4 but you are not going to make it
go away, sorry.

I will let Chris or whoever else on Cc respond to the P4 bits if they
wishe because there's misunderstanding there as well.

cheers,
jamal


> I took a cursory look at several P4 programs including tutorials,
> switch code, firewalls, etc. I have particular interest in variable
> length headers, so I'll use
> https://github.com/jafingerhut/p4-guide/blob/master/checksum/checksum-ipv4-with-options.p4
> as a reference.
>
> The first thing I noticed about P4 is that almost everything is
> expressed as a bit field. Like bit<8> and bit<32>. I suppose this
> arises from the fact that P4 was originally intended to run in non-CPU
> hardware where there's no inherent unit of data like bytes. But, CPUs
> don't work that way; CPUs work ordinal types of bytes, half words,
> words, double words, etc. (__u8, __u16, __u32, __u64). That means that
> all mainstream computer languages fundamentally operate on ordinal
> types even if the variable types are explicitly declared. If someone
> programming in P4 needs to map original types to bit fields in P4, so
> if they want a __u32 they need to use a bit<32> in P4 (except they're
> not exactly equivalent, a __u32 in C is guaranteed to be byte aligned
> and I'm assuming in P4 bit<32> is not guaranteed to be byte aligned--
> this seems like it might be susceptible to programming errors). I'd
> also point out that networking protocols are also defined using
> ordinal type fields, there are some exceptions, but for the most part
> protocol fields try to be in units of bytes (or octets if you want to
> be old school!). I believe life would be easier for the programmer if
> they could just define variables and fields with ordinal types, the
> fix here seems simple enough just add typedefs to P4 like "typedef
> __u32 bit<32>".
>
> In the IP header definition there's "varbit<320>  options;". It took
> me several seconds to decode this and realize this is space for forty
> bytes of IP options (i.e. 8 * 40 == 320). I suppose this follows the
> design of using bit fields for everything, but I think this is more
> than just an annoyance like the bit fields for ordinal types are.
> First off, it's not very readable. I've never heard anyone say that
> there's 320 bits of IP options, or seen an RFC specify that. Likewise,
> the standard Ethernet MTU is 1500 bytes, not 12,000 bits which would
> seem to be how that would be expressed in P4. So this seems very
> unreadable to me and potentially prone to errors. The fix for this
> also seems easy, why not just add varbyte to P4 so we can do
> varbyte<40>, varbyte<87>, varbyte<123>, etc.?
>
> The next thing I notice about the P4 programs I surveyed is that all
> of them seem to define the protocol headers within the protocol. Every
> program seems to have "header ethernet_t" and "header ipv4_t" and
> other protocols that are used and protocol constants like Ethertypes
> also seem to be spelled out in each program. Sometimes these are in
> include files within the program. What I don't see is that P4 has a
> standard set of include files for defining protocol headers. For
> instance, in Linux C we would just do "#include <linux/if_ether.h>"
> and "#include <linux/ip.h>" to get the definitions of the Ethernet
> header and IPv4 header. In fact, if someone were to submit a patch to
> Netdev that included its own definition of Ethernet or an IP header
> structure they would almost certainly get pushback. It's a fundamental
> programming principle, not just in networking but pretty much
> everywhere, to not continuously redefine common and standard
> constructs-- just put common things in header files that can be shared
> by multiple programs (to do otherwise substantially increases the
> possibility of errors, bloats code, and reduces readability).
>
> Marshalling up common definitions into header files that are common in
> the P4 development environment seems simple enough (maybe it's already
> done?), but I would also point out that Linux has included files that
> describe protocol formats and header structures for almost every
> protocol under the sun that are well tested. It would be great if
> somehow we could somehow leverage that work. For instance, in the P4
> samples I looked at srcAddr and dstAddr are defined for IP addresses,
> but in linux/ip.h their saddr and daddr are the respective field
> names. Why not just base the P4 definition on the Linux one? Then when
> someone is porting code from Linux to P4 they can use the same field
> names-- this makes things a lot easier on the programmer! I'll also
> mention that we wrote a little Python script to generate P4 header and
> constant definitions from Linux headers. It almost worked, the snag we
> hit was that P4 has some limits on nesting structures and unions so we
> couldn't translate some of the C structures to P4 (if you're
> interested I can provide the details on the problem we hit).
>
> The IPv4 header checksum code was a real head scratcher for me. Do we
> really need to state each field in the IP header just to compute the
> checksum? (and not just do this once, but twice :-( ). See code below
> for verifyChecksum and updateChecksum.
>
> In C, verifying and setting the IP header checksum is really easy:
>
> if (checksum(iphdr, 0, iphdr->ihl << 4))
>     goto bad_csum;
>
> ip->csum = checksum(iphdr, 0, iphdr->ihl << 4);
>
> Relative to the C code, the P4 code seems very convoluted to me and
> prone to errors. What if someone accidentally omits a field? What if
> fields become slightly out of order? Also, no one would ever describe
> the IPv4 checksum as taking the checksum over the IHL, diffserv,
> totalLen, ... That is *way* too complicated for an algorithm that is
> really simple-- from RFC791: "The checksum field is the 16 bit one's
> complement of the one's complement sum of all 16 bit words in the
> header.". Reverse engineering the design, the clue seems to be
> HashAlgorithm.csum16. Maybe in P4 the IP checksum is just considered
> another form of hash, and I suspect the input to hash computation is
> specified as sort of data structure to make things generic (for
> instance, how we create a substructure in flow keys in flow_dissector
> to compute a SipHash over the TCP and UDP tuple). But, the IPv4
> checksum isn't just another hash-- on a host, we need to compute the
> checksum for *every* IPv4 packet. This has to be fast and simple, we
> can do this in as few as five instructions or less. So even if the
> code below is correct, I have to wonder how easy it is to emit an
> efficient executable. Would a compiler easily realize that all the
> fields in the pseudo structure are contiguous without holes such that
> it can omit those five instructions?
>
> I don't know how prevalent this method of listing all the fields in a
> data structure as arguments to a function is in P4, but, by almost any
> objective measure, I have to say that the code below is bad and
> bloated. Maybe there's a better way to do it in P4, but if there's not
> then this is a deficiency in the P4 language.
>
> Tom
>
> control verifyChecksum(inout headers hdr,
>                        inout metadata meta)
> {
>     apply {
>         // There is code similar to this in Github repo p4lang/p4c in
>         // file testdata/p4_16_samples/flowlet_switching-bmv2.p4
>         // However in that file it is only for a fixed length IPv4
>         // header with no options.
>         verify_checksum(true,
>             { hdr.ipv4.version,
>                 hdr.ipv4.ihl,
>                 hdr.ipv4.diffserv,
>                 hdr.ipv4.totalLen,
>                 hdr.ipv4.identification,
>                 hdr.ipv4.flags,
>                 hdr.ipv4.fragOffset,
>                 hdr.ipv4.ttl,
>                 hdr.ipv4.protocol,
>                 hdr.ipv4.srcAddr,
>                 hdr.ipv4.dstAddr
> #ifdef ALLOW_IPV4_OPTIONS
>                 , hdr.ipv4.options
> #endif /* ALLOW_IPV4_OPTIONS */
>             },
>             hdr.ipv4.hdrChecksum, HashAlgorithm.csum16);
>     }
> }
>
> control updateChecksum(inout headers hdr,
>                        inout metadata meta)
> {
>     apply {
>         update_checksum(true,
>             { hdr.ipv4.version,
>                 hdr.ipv4.ihl,
>                 hdr.ipv4.diffserv,
>                 hdr.ipv4.totalLen,
>                 hdr.ipv4.identification,
>                 hdr.ipv4.flags,
>                 hdr.ipv4.fragOffset,
>                 hdr.ipv4.ttl,
>                 hdr.ipv4.protocol,
>                 hdr.ipv4.srcAddr,
>                 hdr.ipv4.dstAddr
> #ifdef ALLOW_IPV4_OPTIONS
>                 , hdr.ipv4.options
> #endif /* ALLOW_IPV4_OPTIONS */
>             },
>             hdr.ipv4.hdrChecksum, HashAlgorithm.csum16);
>     }
> }
>
> On Wed, May 22, 2024 at 8:34 PM Tom Herbert <tom@sipanda.io> wrote:
> >
> > On Wed, May 22, 2024 at 7:30 PM Chris Sommers
> > <chris.sommers@keysight.com> wrote:
> > >
> > > > On Wed, May 22, 2024 at 8:54 PM Tom Herbert <mailto:tom@sipanda.io> wrote:
> > > > >
> > > > > On Wed, May 22, 2024 at 5:09 PM Chris Sommers
> > > > > <mailto:chris.sommers@keysight.com> wrote:
> > > > > >
> > > > > > > On Wed, May 22, 2024 at 6:19 PM Jakub Kicinski <mailto:kuba@kernel.org> wrote:
> > > > > > > >
> > > > > > > > Hi Jamal!
> > > > > > > >
> > > > > > > > On Tue, 21 May 2024 08:35:07 -0400 Jamal Hadi Salim wrote:
> > > > > > > > > At that point(v16) i asked for the series to be applied despite the
> > > > > > > > > Nacks because, frankly, the Nacks have no merit. Paolo was not
> > > > > > > > > comfortable applying patches with Nacks and tried to mediate. In his
> > > > > > > > > mediation effort he asked if we could remove eBPF - and our answer was
> > > > > > > > > no because after all that time we have become dependent on it and
> > > > > > > > > frankly there was no technical reason not to use eBPF.
> > > > > > > >
> > > > > > > > I'm not fully clear on who you're appealing to, and I may be missing
> > > > > > > > some points. But maybe it will be more useful than hurtful if I clarify
> > > > > > > > my point of view.
> > > > > > > >
> > > > > > > > AFAIU BPF folks disagree with the use of their subsystem, and they
> > > > > > > > point out that P4 pipelines can be implemented using BPF in the first
> > > > > > > > place.
> > > > > > > > To which you reply that you like (a highly dated type of) a netlink
> > > > > > > > interface, and (handwavey) ability to configure the data path SW or
> > > > > > > > HW via the same interface.
> > > > > > >
> > > > > > > It's not what I "like" , rather it is a requirement to support both
> > > > > > > s/w and h/w offload. The TC model is the traditional approach to
> > > > > > > deploy these models. I addressed the same comment you are making above
> > > > > > > in #1a and #1b  (https://urldefense.com/v3/__https://github.com/p4tc-dev/pushback-patches__;!!I5pVk4LIGAfnvw!kaZ6EmPxEqGLG8JMw-_L0BgYq48Pe25wj6pHMF6BVei5WsRgwMeLQupmvgvLyN-LgXacKBzzs0-w2zKP2A$).
> > > > >> >
> > > > > > > OTOH, "BPF folks disagree with the use of their subsystem" is a
> > > > > > > problematic statement. Is BPF infra for the kernel community or is it
> > > > > > > something the ebpf folks can decide, at their whim, to allow who they
> > > > > > > like to use or not. We are not changing any BPF code. And there's
> > > > > > > already a case where the interfaces are used exactly as we used them
> > > > > > > in the conntrack code i pointed to in the page (we literally copied
> > > > > > > that code). Why is it ok for conntrack code to use exactly the same
> > > > > > > approach but not us?
> > > > > > >
> > > > > > > > AFAICT there's some but not very strong support for P4TC,
> > > > > > >
> > > > > > > I dont agree. Paolo asked this question and afaik Intel, AMD (both
> > > > > > > build P4-native NICs) and the folks interested in the MS DASH project
> > > > > > > responded saying they are in support. Look at who is being Cced. A lot
> > > > > > > of these folks who attend biweekly discussion calls on P4TC. Sample:
> > > > > > > https://urldefense.com/v3/__https://lore.kernel.org/netdev/IA0PR17MB7070B51A955FB8595FFBA5FB965E2@IA0PR17MB7070.namprd17.prod.outlook.com/__;!!I5pVk4LIGAfnvw!kaZ6EmPxEqGLG8JMw-_L0BgYq48Pe25wj6pHMF6BVei5WsRgwMeLQupmvgvLyN-LgXacKBzzs09TFzoQBw$
> > > > >> >
> > > > > > +1
> > > > > > > > and it
> > > > > > > > doesn't benefit or solve any problems of the broader networking stack
> > > > > > > > (e.g. expressing or configuring parser graphs in general)
> > > > > > > >
> > > > > > >
> > > > > >
> > > > > > Huh? As a DSL, P4 has already been proven to be an extremely effective and popular way to express parse graphs, stack manipulation, and stateful programming. Yesterday, I used the P4TC dev branch to implement something in one sitting, which includes parsing RoCEv2 network stacks. I just cut and pasted P4 code originally written for a P4 ASIC into a working P4TC example to add functionality. It took mere seconds to compile and launch it, and a few minutes to test it. I know of no other workflow which provides such quick turnaround and is so accessible. I'd like it to be as ubiquitous as eBPF itself.
> > > > >
> > > > > Chris,
> > > > >
> > > > > When you say "it took mere seconds to compile and launch" are you
> > > > > taking into account the ramp up time that it takes to learn P4 and
> > > > > become proficient to do something interesting?
> > >
> > > Hi Tom, thanks for the dialog. To answer your question, it took seconds to compile and deploy, not learn P4. Adding the parsing for several headers took minutes. If you want to compare learning curve, learning to write P4 code and let the framework handle all the painful low-level Linux details is way easier than trying to learn how to write c code for Linux networking. It’s not even close. I’ve written C for 40 years, P4 for 7 years, and dabbled in eBPF so I can attest to the ease of learning and using P4. I’ve onboarded and mentored engineers who barely knew C, to develop complex networking products using P4, and built the automation APIs (REST, gRPC) to manage them. One person can develop an entire commercial product by themselves in months. P4 has expanded the reach of programmers such that both HW and SW engineers can easily learn P4 and become pretty adept at it. I would not expect even experienced c programmers to be able to master Linux internals very quickly. Writing a P4-TC program and injecting it via tc was like magic the first time.
> > >
> > > >> Considering that P4
> > > > > syntax is very different from typical languages than networking
> > > > > programmers are typically familiar with, this ramp up time is
> > > > > non-zero. OTOH, eBPF is ubiquitous because it's primarily programmed
> > > > > in Restricted C-- this makes it easy for many programmers since they
> > > > > don't have to learn a completely new language and so the ramp up time
> > > > > for the average networking programmer is much less for using eBPF.
> > >
> > > I think your statement about “typical network programmers” overlooks the fact that since P4 was introduced, it has been taught in many universities to teach networking and possibly enabled a whole new breed of “network engineers” who can solve real problems without even knowing C programming. Without P4 they might never have gone this route. A class in network stack programming using c would have so many prerequisites to even get to parsing, compared to P4, where it could be demonstrated in one lesson. These “networking programmers” are not typical by your standards, but there are many such. They have just as much claim to the title "network programmer” as a C programmer. Similarly, an assembly language programmer is no less than a C or Python programmer. People writing P4 are usually focused on applications, and it is very useful and productive for that. Why should someone have to learn low-level C or eBPF to solve their problem?
> >
> > Hio Chris,
> >
> > You're comparing learning a completely new language versus programming
> > in a subset of an established language, they're really not comparable.
> > When one programs in Restricted-C they just need to understand what
> > features of C are supported.
> >
> > >
> > > > >
> > > > > This is really the fundamental problem with DSLs, they require
> > > > > specialized skill sets in a programming language for a narrow use case
> > > > > (and specialized compilers, tool chains, debugging, etc)-- this means
> > > > > a DSL only makes sense if there is no other means to accomplish the
> > > > > same effects using a commodity language with perhaps a specialized
> > > > > library (it's not just in the networking realm, consider the
> > > > > advantages of using CUDA-C instead of a DLS for GPUs).
> > >
> > > A pretty strong opinion, but DSLs arise to fill a need and P4 did so. It's still going strong.
> > >
> > > >> Personally, I
> > > > > don't believe that P4 has yet to be proven necessary for programming a
> > > > > datapath-- for instance we can program a parser in declarative
> > > > > representation in C,
> > > > > https://urldefense.com/v3/__https://netdevconf.info/0x16/papers/11/High*20Performance*20Programmable*20Parsers.pdf__;JSUl!!I5pVk4LIGAfnvw!m9zrSDvddfzSt_sMBjOEvqw31RzAwWlEDM4ah5IJ2kqsmq6XtPIVJd-1_ZoGWBXKLyda77RYLvGR83Ginw$.
> > >
> > > CPL (slide11) looks like a DSL wrapped in JSON to me. “Solution: Common Parser Language (CPL); Parser representation in declarative .json” So I am confused. It is either a new language a.k.a. DSL, or it's not. Nothing against it, I'm sure it is great, but let's call it what it is.
> >
> > Correct, it's not a new language. We've since renamed it Common Parser
> > Representation.
> >
> > > We already have parser representations in declarative p4. And it's used and known worldwide. And has a respectable specification, any users and working groups. And it's formally provable (https://github.com/verified-network-toolchain/petr4)
> > >
> > > > >
> > > > > So unless P4 is proven necessary, then I'm doubtful it will ever be a
> > > > > ubiquitous way to program the kernel-- it seems much more likely that
> > > > > people will continue to use C and eBPF, and for those users that want
> > > > > to use P4 they can use P4->eBPF compiler.
> > >
> > > “ubiquitous way to program the kernel” – is not my goal. I don’t even want to know about the kernel when I am writing p4 - it's just a means to an end. I want to manipulate packets on a Linux host. P4DPDK, P4-eBPF, P4-TC – all let me do that. I LOVE the fact that P4-TC would be available in every Linux distro once upstreamed. It would solve so many deployment issues, benefit from regression testing, etc. So much goodness
> > >
> > > " and for those users that want to use P4 they can use P4->eBPF compiler." -I'd really like to choose for myself and not have someone make that choice for me. P4-TC checks all the boxes for me.
> >
> > Sure, but this is a lot of kernel code and that will require support
> > and maintenance. It needs to be justified, and the fact that someone
> > wants it just to have a choice is, frankly, not much of a
> > justification. I think a justification needs to start with "Why isn't
> > P4->eBPF sufficient?" (the question has been raised several times, but
> > it still doesn't seem like there's a strong answer).
> >
> > Tom
> > >
> > > Thanks for the point of view, it's healthy to debate.
> > > Cheers,
> > > Chris
> > >
> > > > >
> > > >
> > > > Tom,
> > > > I cant stop the distraction of this thread becoming a discussion on
> > > > the merits of DSL vs a lower level language (and I know you are not a
> > > > P4 fan) but please change the subject so we dont loose the main focus
> > > > which is a discussion on the patches. I have done it for you. Chris if
> > > > you wish to respond please respond under the new thread subject.
> > > >
> > > > cheers,
> > > > jamal
> > >

Chris Sommers May 24, 2024, 10:36 p.m. UTC | #28

Oops, resending as plaintext. Sigh...
- 
> On Fri, May 24, 2024 at 12:50 PM Tom Herbert <tom@sipanda.io> wrote:
> >
> > Hi Chris,
> >
> > P4 was created to support programming the hardware data path in high
> > end routers, but P4-TC would enable the use of P4 across all Linux
> > devices. Since this is potentially a lot of code going into the kernel
> > to support it, I believe it's entirely fair for us to evaluate and
> > give feedback on the P4 language and its suitability for the broader
> > user community including environments where there will never be a need
> > for P4 hardware. Note that I am questioning the design decisions of P4
> > in the context of supporting a DSL in the kernel via P4-TC, if the
> > P4->eBPF compiler is used then then these concerns are less pertinent.
> > Nevertheless, I would suggest that the P4 folks take the points being
> > raised as constructive feedback on the language.
> >
Hi Tom,
RE: Your observations and feedback on P4 language and prevalent coding practices, the most constructive approach would be to attend P4 working group meetings where your opinions and ideas will be respectfully considered and your offer to help gratefully accepted. You can also file issues or pull requests on GitHub. The Language and Architecture working groups would probably be the best places to participate. We are an open-minded and welcoming group of volunteers from industry and academia who are always looking for new members. It sounds like you have lots of relevant experience and a different point of view which could add hybrid vigor.
Chris Sommers
Distinguished SW Engineer
> 
> A lot of misleading info there. The P4 PNA architecture is for end
> hosts not routers. For some NIC vendors you can go as far as writting
> hardware GRO or TSO offload or variations of your liking  using P4
> (cretainly not a middle feature). That notwithstanding the idea of
> offloading match-action via TC is not new and has been widely
> used/adopted for end hosts.
> 
> Tom, you want to perhaps disclose that you have a competing product?
> That will help provide better context on your angle.
> TBH, I am confused by what your end game is - is your view that a
> crusade against P4 will make you sell more of your product? I have 3
> NICs here with me (from 2 vendors) that are P4 programmable. You can
> be as negative as you want about P4 but you are not going to make it
> go away, sorry.
> 
> I will let Chris or whoever else on Cc respond to the P4 bits if they
> wishe because there's misunderstanding there as well.
> 
> cheers,
> jamal
> 
> 
> > I took a cursory look at several P4 programs including tutorials,
> > switch code, firewalls, etc. I have particular interest in variable
> > length headers, so I'll use
> > https://urldefense.com/v3/__https://github.com/jafingerhut/p4-guide/blob/master/checksum/checksum-ipv4-with-options.p4__;!!I5pVk4LIGAfnvw!juhSwk9UTheuI8-0mudbGTSZ_GBx3Z6hmcOAgiaAW14Ecter6K8iJ8DSzakN1d4GCE4uFJ05wkE81N6KNw$
> > as a reference.
> >
> > The first thing I noticed about P4 is that almost everything is
> > expressed as a bit field. Like bit<8> and bit<32>. I suppose this
> > arises from the fact that P4 was originally intended to run in non-CPU
> > hardware where there's no inherent unit of data like bytes. But, CPUs
> > don't work that way; CPUs work ordinal types of bytes, half words,
> > words, double words, etc. (__u8, __u16, __u32, __u64). That means that
> > all mainstream computer languages fundamentally operate on ordinal
> > types even if the variable types are explicitly declared. If someone
> > programming in P4 needs to map original types to bit fields in P4, so
> > if they want a __u32 they need to use a bit<32> in P4 (except they're
> > not exactly equivalent, a __u32 in C is guaranteed to be byte aligned
> > and I'm assuming in P4 bit<32> is not guaranteed to be byte aligned--
> > this seems like it might be susceptible to programming errors). I'd
> > also point out that networking protocols are also defined using
> > ordinal type fields, there are some exceptions, but for the most part
> > protocol fields try to be in units of bytes (or octets if you want to
> > be old school!). I believe life would be easier for the programmer if
> > they could just define variables and fields with ordinal types, the
> > fix here seems simple enough just add typedefs to P4 like "typedef
> > __u32 bit<32>".
> >
> > In the IP header definition there's "varbit<320>  options;". It took
> > me several seconds to decode this and realize this is space for forty
> > bytes of IP options (i.e. 8 * 40 == 320). I suppose this follows the
> > design of using bit fields for everything, but I think this is more
> > than just an annoyance like the bit fields for ordinal types are.
> > First off, it's not very readable. I've never heard anyone say that
> > there's 320 bits of IP options, or seen an RFC specify that. Likewise,
> > the standard Ethernet MTU is 1500 bytes, not 12,000 bits which would
> > seem to be how that would be expressed in P4. So this seems very
> > unreadable to me and potentially prone to errors. The fix for this
> > also seems easy, why not just add varbyte to P4 so we can do
> > varbyte<40>, varbyte<87>, varbyte<123>, etc.?
> >
> > The next thing I notice about the P4 programs I surveyed is that all
> > of them seem to define the protocol headers within the protocol. Every
> > program seems to have "header ethernet_t" and "header ipv4_t" and
> > other protocols that are used and protocol constants like Ethertypes
> > also seem to be spelled out in each program. Sometimes these are in
> > include files within the program. What I don't see is that P4 has a
> > standard set of include files for defining protocol headers. For
> > instance, in Linux C we would just do "#include <linux/if_ether.h>"
> > and "#include <linux/ip.h>" to get the definitions of the Ethernet
> > header and IPv4 header. In fact, if someone were to submit a patch to
> > Netdev that included its own definition of Ethernet or an IP header
> > structure they would almost certainly get pushback. It's a fundamental
> > programming principle, not just in networking but pretty much
> > everywhere, to not continuously redefine common and standard
> > constructs-- just put common things in header files that can be shared
> > by multiple programs (to do otherwise substantially increases the
> > possibility of errors, bloats code, and reduces readability).
> >
> > Marshalling up common definitions into header files that are common in
> > the P4 development environment seems simple enough (maybe it's already
> > done?), but I would also point out that Linux has included files that
> > describe protocol formats and header structures for almost every
> > protocol under the sun that are well tested. It would be great if
> > somehow we could somehow leverage that work. For instance, in the P4
> > samples I looked at srcAddr and dstAddr are defined for IP addresses,
> > but in linux/ip.h their saddr and daddr are the respective field
> > names. Why not just base the P4 definition on the Linux one? Then when
> > someone is porting code from Linux to P4 they can use the same field
> > names-- this makes things a lot easier on the programmer! I'll also
> > mention that we wrote a little Python script to generate P4 header and
> > constant definitions from Linux headers. It almost worked, the snag we
> > hit was that P4 has some limits on nesting structures and unions so we
> > couldn't translate some of the C structures to P4 (if you're
> > interested I can provide the details on the problem we hit).
> >
> > The IPv4 header checksum code was a real head scratcher for me. Do we
> > really need to state each field in the IP header just to compute the
> > checksum? (and not just do this once, but twice :-( ). See code below
> > for verifyChecksum and updateChecksum.
> >
> > In C, verifying and setting the IP header checksum is really easy:
> >
> > if (checksum(iphdr, 0, iphdr->ihl << 4))
> >     goto bad_csum;
> >
> > ip->csum = checksum(iphdr, 0, iphdr->ihl << 4);
> >
> > Relative to the C code, the P4 code seems very convoluted to me and
> > prone to errors. What if someone accidentally omits a field? What if
> > fields become slightly out of order? Also, no one would ever describe
> > the IPv4 checksum as taking the checksum over the IHL, diffserv,
> > totalLen, ... That is *way* too complicated for an algorithm that is
> > really simple-- from RFC791: "The checksum field is the 16 bit one's
> > complement of the one's complement sum of all 16 bit words in the
> > header.". Reverse engineering the design, the clue seems to be
> > HashAlgorithm.csum16. Maybe in P4 the IP checksum is just considered
> > another form of hash, and I suspect the input to hash computation is
> > specified as sort of data structure to make things generic (for
> > instance, how we create a substructure in flow keys in flow_dissector
> > to compute a SipHash over the TCP and UDP tuple). But, the IPv4
> > checksum isn't just another hash-- on a host, we need to compute the
> > checksum for *every* IPv4 packet. This has to be fast and simple, we
> > can do this in as few as five instructions or less. So even if the
> > code below is correct, I have to wonder how easy it is to emit an
> > efficient executable. Would a compiler easily realize that all the
> > fields in the pseudo structure are contiguous without holes such that
> > it can omit those five instructions?
> >
> > I don't know how prevalent this method of listing all the fields in a
> > data structure as arguments to a function is in P4, but, by almost any
> > objective measure, I have to say that the code below is bad and
> > bloated. Maybe there's a better way to do it in P4, but if there's not
> > then this is a deficiency in the P4 language.
> >
> > Tom
> >
> > control verifyChecksum(inout headers hdr,
> >                        inout metadata meta)
> > {
> >     apply {
> >         // There is code similar to this in Github repo p4lang/p4c in
> >         // file testdata/p4_16_samples/flowlet_switching-bmv2.p4
> >         // However in that file it is only for a fixed length IPv4
> >         // header with no options.
> >         verify_checksum(true,
> >             { hdr.ipv4.version,
> >                 hdr.ipv4.ihl,
> >                 hdr.ipv4.diffserv,
> >                 hdr.ipv4.totalLen,
> >                 hdr.ipv4.identification,
> >                 hdr.ipv4.flags,
> >                 hdr.ipv4.fragOffset,
> >                 hdr.ipv4.ttl,
> >                 hdr.ipv4.protocol,
> >                 hdr.ipv4.srcAddr,
> >                 hdr.ipv4.dstAddr
> > #ifdef ALLOW_IPV4_OPTIONS
> >                 , hdr.ipv4.options
> > #endif /* ALLOW_IPV4_OPTIONS */
> >             },
> >             hdr.ipv4.hdrChecksum, HashAlgorithm.csum16);
> >     }
> > }
> >
> > control updateChecksum(inout headers hdr,
> >                        inout metadata meta)
> > {
> >     apply {
> >         update_checksum(true,
> >             { hdr.ipv4.version,
> >                 hdr.ipv4.ihl,
> >                 hdr.ipv4.diffserv,
> >                 hdr.ipv4.totalLen,
> >                 hdr.ipv4.identification,
> >                 hdr.ipv4.flags,
> >                 hdr.ipv4.fragOffset,
> >                 hdr.ipv4.ttl,
> >                 hdr.ipv4.protocol,
> >                 hdr.ipv4.srcAddr,
> >                 hdr.ipv4.dstAddr
> > #ifdef ALLOW_IPV4_OPTIONS
> >                 , hdr.ipv4.options
> > #endif /* ALLOW_IPV4_OPTIONS */
> >             },
> >             hdr.ipv4.hdrChecksum, HashAlgorithm.csum16);
> >     }
> > }
> >
> > On Wed, May 22, 2024 at 8:34 PM Tom Herbert <tom@sipanda.io> wrote:
> > >
> > > On Wed, May 22, 2024 at 7:30 PM Chris Sommers
> > > <chris.sommers@keysight.com> wrote:
> > > >
> > > > > On Wed, May 22, 2024 at 8:54 PM Tom Herbert <mailto:tom@sipanda.io> wrote:
> > > > > >
> > > > > > On Wed, May 22, 2024 at 5:09 PM Chris Sommers
> > > > > > <mailto:chris.sommers@keysight.com> wrote:
> > > > > > >
> > > > > > > > On Wed, May 22, 2024 at 6:19 PM Jakub Kicinski <mailto:kuba@kernel.org> wrote:
> > > > > > > > >
> > > > > > > > > Hi Jamal!
> > > > > > > > >
> > > > > > > > > On Tue, 21 May 2024 08:35:07 -0400 Jamal Hadi Salim wrote:
> > > > > > > > > > At that point(v16) i asked for the series to be applied despite the
> > > > > > > > > > Nacks because, frankly, the Nacks have no merit. Paolo was not
> > > > > > > > > > comfortable applying patches with Nacks and tried to mediate. In his
> > > > > > > > > > mediation effort he asked if we could remove eBPF - and our answer was
> > > > > > > > > > no because after all that time we have become dependent on it and
> > > > > > > > > > frankly there was no technical reason not to use eBPF.
> > > > > > > > >
> > > > > > > > > I'm not fully clear on who you're appealing to, and I may be missing
> > > > > > > > > some points. But maybe it will be more useful than hurtful if I clarify
> > > > > > > > > my point of view.
> > > > > > > > >
> > > > > > > > > AFAIU BPF folks disagree with the use of their subsystem, and they
> > > > > > > > > point out that P4 pipelines can be implemented using BPF in the first
> > > > > > > > > place.
> > > > > > > > > To which you reply that you like (a highly dated type of) a netlink
> > > > > > > > > interface, and (handwavey) ability to configure the data path SW or
> > > > > > > > > HW via the same interface.
> > > > > > > >
> > > > > > > > It's not what I "like" , rather it is a requirement to support both
> > > > > > > > s/w and h/w offload. The TC model is the traditional approach to
> > > > > > > > deploy these models. I addressed the same comment you are making above
> > > > > > > > in #1a and #1b  (https://urldefense.com/v3/__https://github.com/p4tc-dev/pushback-patches__;!!I5pVk4LIGAfnvw!kaZ6EmPxEqGLG8JMw-_L0BgYq48Pe25wj6pHMF6BVei5WsRgwMeLQupmvgvLyN-LgXacKBzzs0-w2zKP2A$).
> >> > > >> >
> > > > > > > > OTOH, "BPF folks disagree with the use of their subsystem" is a
> > > > > > > > problematic statement. Is BPF infra for the kernel community or is it
> > > > > > > > something the ebpf folks can decide, at their whim, to allow who they
> > > > > > > > like to use or not. We are not changing any BPF code. And there's
> > > > > > > > already a case where the interfaces are used exactly as we used them
> > > > > > > > in the conntrack code i pointed to in the page (we literally copied
> > > > > > > > that code). Why is it ok for conntrack code to use exactly the same
> > > > > > > > approach but not us?
> > > > > > > >
> > > > > > > > > AFAICT there's some but not very strong support for P4TC,
> > > > > > > >
> > > > > > > > I dont agree. Paolo asked this question and afaik Intel, AMD (both
> > > > > > > > build P4-native NICs) and the folks interested in the MS DASH project
> > > > > > > > responded saying they are in support. Look at who is being Cced. A lot
> > > > > > > > of these folks who attend biweekly discussion calls on P4TC. Sample:
> > > > > > > > https://urldefense.com/v3/__https://lore.kernel.org/netdev/IA0PR17MB7070B51A955FB8595FFBA5FB965E2@IA0PR17MB7070.namprd17.prod.outlook.com/__;!!I5pVk4LIGAfnvw!kaZ6EmPxEqGLG8JMw-_L0BgYq48Pe25wj6pHMF6BVei5WsRgwMeLQupmvgvLyN-LgXacKBzzs09TFzoQBw$
> >> > > >> >
> > > > > > > +1
> > > > > > > > > and it
> > > > > > > > > doesn't benefit or solve any problems of the broader networking stack
> > > > > > > > > (e.g. expressing or configuring parser graphs in general)
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > > > Huh? As a DSL, P4 has already been proven to be an extremely effective and popular way to express parse graphs, stack manipulation, and stateful programming. Yesterday, I used the P4TC dev branch to implement something in one sitting, which includes parsing RoCEv2 network stacks. I just cut and pasted P4 code originally written for a P4 ASIC into a working P4TC example to add functionality. It took mere seconds to compile and launch it, and a few minutes to test it. I know of no other workflow which provides such quick turnaround and is so accessible. I'd like it to be as ubiquitous as eBPF itself.
> > > > > >
> > > > > > Chris,
> > > > > >
> > > > > > When you say "it took mere seconds to compile and launch" are you
> > > > > > taking into account the ramp up time that it takes to learn P4 and
> > > > > > become proficient to do something interesting?
> > > >
> > > > Hi Tom, thanks for the dialog. To answer your question, it took seconds to compile and deploy, not learn P4. Adding the parsing for several headers took minutes. If you want to compare learning curve, learning to write P4 code and let the framework handle all the painful low-level Linux details is way easier than trying to learn how to write c code for Linux networking. It’s not even close. I’ve written C for 40 years, P4 for 7 years, and dabbled in eBPF so I can attest to the ease of learning and using P4. I’ve onboarded and mentored engineers who barely knew C, to develop complex networking products using P4, and built the automation APIs (REST, gRPC) to manage them. One person can develop an entire commercial product by themselves in months. P4 has expanded the reach of programmers such that both HW and SW engineers can easily learn P4 and become pretty adept at it. I would not expect even experienced c programmers to be able to master Linux internals very quickly. Writing a P4-TC program and injecting it via tc was like magic the first time.
> > > >
> > > > >> Considering that P4
> > > > > > syntax is very different from typical languages than networking
> > > > > > programmers are typically familiar with, this ramp up time is
> > > > > > non-zero. OTOH, eBPF is ubiquitous because it's primarily programmed
> > > > > > in Restricted C-- this makes it easy for many programmers since they
> > > > > > don't have to learn a completely new language and so the ramp up time
> > > > > > for the average networking programmer is much less for using eBPF.
> > > >
> > > > I think your statement about “typical network programmers” overlooks the fact that since P4 was introduced, it has been taught in many universities to teach networking and possibly enabled a whole new breed of “network engineers” who can solve real problems without even knowing C programming. Without P4 they might never have gone this route. A class in network stack programming using c would have so many prerequisites to even get to parsing, compared to P4, where it could be demonstrated in one lesson. These “networking programmers” are not typical by your standards, but there are many such. They have just as much claim to the title "network programmer” as a C programmer. Similarly, an assembly language programmer is no less than a C or Python programmer. People writing P4 are usually focused on applications, and it is very useful and productive for that. Why should someone have to learn low-level C or eBPF to solve their problem?
> > >
> > > Hio Chris,
> > >
> > > You're comparing learning a completely new language versus programming
> > > in a subset of an established language, they're really not comparable.
> > > When one programs in Restricted-C they just need to understand what
> > > features of C are supported.
> > >
> > > >
> > > > > >
> > > > > > This is really the fundamental problem with DSLs, they require
> > > > > > specialized skill sets in a programming language for a narrow use case
> > > > > > (and specialized compilers, tool chains, debugging, etc)-- this means
> > > > > > a DSL only makes sense if there is no other means to accomplish the
> > > > > > same effects using a commodity language with perhaps a specialized
> > > > > > library (it's not just in the networking realm, consider the
> > > > > > advantages of using CUDA-C instead of a DLS for GPUs).
> > > >
> > > > A pretty strong opinion, but DSLs arise to fill a need and P4 did so. It's still going strong.
> > > >
> > > > >> Personally, I
> > > > > > don't believe that P4 has yet to be proven necessary for programming a
> > > > > > datapath-- for instance we can program a parser in declarative
> > > > > > representation in C,
> > > > > > https://urldefense.com/v3/__https://netdevconf.info/0x16/papers/11/High*20Performance*20Programmable*20Parsers.pdf__;JSUl!!I5pVk4LIGAfnvw!m9zrSDvddfzSt_sMBjOEvqw31RzAwWlEDM4ah5IJ2kqsmq6XtPIVJd-1_ZoGWBXKLyda77RYLvGR83Ginw$.
> >> >
> > > > CPL (slide11) looks like a DSL wrapped in JSON to me. “Solution: Common Parser Language (CPL); Parser representation in declarative .json” So I am confused. It is either a new language a.k.a. DSL, or it's not. Nothing against it, I'm sure it is great, but let's call it what it is.
> > >
> > > Correct, it's not a new language. We've since renamed it Common Parser
> > > Representation.
> > >
> > > > We already have parser representations in declarative p4. And it's used and known worldwide. And has a respectable specification, any users and working groups. And it's formally provable (https://urldefense.com/v3/__https://github.com/verified-network-toolchain/petr4__;!!I5pVk4LIGAfnvw!juhSwk9UTheuI8-0mudbGTSZ_GBx3Z6hmcOAgiaAW14Ecter6K8iJ8DSzakN1d4GCE4uFJ05wkE9mvn6Vw$)
> > > >
> > > > > >
> > > > > > So unless P4 is proven necessary, then I'm doubtful it will ever be a
> > > > > > ubiquitous way to program the kernel-- it seems much more likely that
> > > > > > people will continue to use C and eBPF, and for those users that want
> > > > > > to use P4 they can use P4->eBPF compiler.
> > > >
> > > > “ubiquitous way to program the kernel” – is not my goal. I don’t even want to know about the kernel when I am writing p4 - it's just a means to an end. I want to manipulate packets on a Linux host. P4DPDK, P4-eBPF, P4-TC – all let me do that. I LOVE the fact that P4-TC would be available in every Linux distro once upstreamed. It would solve so many deployment issues, benefit from regression testing, etc. So much goodness
> > > >
> > > > " and for those users that want to use P4 they can use P4->eBPF compiler." -I'd really like to choose for myself and not have someone make that choice for me. P4-TC checks all the boxes for me.
> > >
> > > Sure, but this is a lot of kernel code and that will require support
> > > and maintenance. It needs to be justified, and the fact that someone
> > > wants it just to have a choice is, frankly, not much of a
> > > justification. I think a justification needs to start with "Why isn't
> > > P4->eBPF sufficient?" (the question has been raised several times, but
> > > it still doesn't seem like there's a strong answer).
> > >
> > > Tom
> > > >
> > > > Thanks for the point of view, it's healthy to debate.
> > > > Cheers,
> > > > Chris
> > > >
> > > > > >
> > > > >
> > > > > Tom,
> > > > > I cant stop the distraction of this thread becoming a discussion on
> > > > > the merits of DSL vs a lower level language (and I know you are not a
> > > > > P4 fan) but please change the subject so we dont loose the main focus
> > > > > which is a discussion on the patches. I have done it for you. Chris if
> > > > > you wish to respond please respond under the new thread subject.
> > > > >
> > > > > cheers,
> > > > > jamal
> > > >

Jain, Vipin May 25, 2024, 4:43 p.m. UTC | #29

[AMD Official Use Only - AMD Internal Distribution Only]

My apologies, earlier email used html and was blocked by the list...
My response at the bottom as "VJ>"

John Fastabend May 28, 2024, 8:17 p.m. UTC | #30

Jain, Vipin wrote:
> [AMD Official Use Only - AMD Internal Distribution Only]
> 
> My apologies, earlier email used html and was blocked by the list...
> My response at the bottom as "VJ>"
> 
> ________________________________________
> From: Jain, Vipin <Vipin.Jain@amd.com>
> Sent: Friday, May 24, 2024 2:28 PM
> To: Singhai, Anjali <anjali.singhai@intel.com>; Hadi Salim, Jamal <jhs@mojatatu.com>; Jakub Kicinski <kuba@kernel.org>
> Cc: Paolo Abeni <pabeni@redhat.com>; Alexei Starovoitov <alexei.starovoitov@gmail.com>; Network Development <netdev@vger.kernel.org>; Chatterjee, Deb <deb.chatterjee@intel.com>; Limaye, Namrata <namrata.limaye@intel.com>; tom Herbert <tom@sipanda.io>; Marcelo Ricardo Leitner <mleitner@redhat.com>; Shirshyad, Mahesh <Mahesh.Shirshyad@amd.com>; Osinski, Tomasz <tomasz.osinski@intel.com>; Jiri Pirko <jiri@resnulli.us>; Cong Wang <xiyou.wangcong@gmail.com>; David S. Miller <davem@davemloft.net>; Eric Dumazet <edumazet@google.com>; Vlad Buslov <vladbu@nvidia.com>; Simon Horman <horms@kernel.org>; Khalid Manaa <khalidm@nvidia.com>; Toke Høiland-Jørgensen <toke@redhat.com>; Victor Nogueira <victor@mojatatu.com>; Tammela, Pedro <pctammela@mojatatu.com>; Daly, Dan <dan.daly@intel.com>; Andy Fingerhut <andy.fingerhut@gmail.com>; Sommers, Chris <chris.sommers@keysight.com>; Matty Kadosh <mattyk@nvidia.com>; bpf <bpf@vger.kernel.org>; lwn@lwn.net <lwn@lwn.net>
> Subject: Re: On the NACKs on P4TC patches
> 
> [AMD Official Use Only - AMD Internal Distribution Only]
> 
> 
> I can ascertain (from AMD) that we have stated interest in, and are in full support of P4TC.
> 
> Happy to elaborate more if needed.
> 
> Thank you,
> Vipin Jain
> Sr Fellow Engineer, AMD
> ________________________________________
> From: Singhai, Anjali <anjali.singhai@intel.com>
> Sent: Wednesday, May 22, 2024 5:30 PM
> To: Hadi Salim, Jamal <jhs@mojatatu.com>; Jakub Kicinski <kuba@kernel.org>
> Cc: Paolo Abeni <pabeni@redhat.com>; Alexei Starovoitov <alexei.starovoitov@gmail.com>; Network Development <netdev@vger.kernel.org>; Chatterjee, Deb <deb.chatterjee@intel.com>; Limaye, Namrata <namrata.limaye@intel.com>; tom Herbert <tom@sipanda.io>; Marcelo Ricardo Leitner <mleitner@redhat.com>; Shirshyad, Mahesh <Mahesh.Shirshyad@amd.com>; Osinski, Tomasz <tomasz.osinski@intel.com>; Jiri Pirko <jiri@resnulli.us>; Cong Wang <xiyou.wangcong@gmail.com>; David S. Miller <davem@davemloft.net>; Eric Dumazet <edumazet@google.com>; Vlad Buslov <vladbu@nvidia.com>; Simon Horman <horms@kernel.org>; Khalid Manaa <khalidm@nvidia.com>; Toke Høiland-Jørgensen <toke@redhat.com>; Victor Nogueira <victor@mojatatu.com>; Tammela, Pedro <pctammela@mojatatu.com>; Jain, Vipin <Vipin.Jain@amd.com>; Daly, Dan <dan.daly@intel.com>; Andy Fingerhut <andy.fingerhut@gmail.com>; Sommers, Chris <chris.sommers@keysight.com>; Matty Kadosh <mattyk@nvidia.com>; bpf <bpf@vger.kernel.org>; lwn@lwn.net <lwn@lwn.net>
> Subject: RE: On the NACKs on P4TC patches
> 
> Caution: This message originated from an External Source. Use proper caution when opening attachments, clicking links, or responding.
> 
> 
> On Wed, May 22, 2024 at 6:19 PM Jakub Kicinski <kuba@kernel.org> wrote:
> 
> >> AFAICT there's some but not very strong support for P4TC,
> 
> On Wed, May 22, 2024 at 4:04 PM Jamal Hadi Salim <jhs@mojatatu.com > wrote:
> >I dont agree. Paolo asked this question and afaik Intel, AMD (both build P4-native NICs) and the folks interested in the MS DASH project >responded saying they are in support. Look at who is being Cced. A lot of these folks who attend biweekly discussion calls on P4TC. >Sample:
> >https://lore.kernel.org/netdev/IA0PR17MB7070B51A955FB8595FFBA5FB965E2@IA0PR17MB7070.namprd17.prod.outlook.com/
> 
> FWIW, Intel is in full support of P4TC as we have stated several times in the past.

> VJ> I can ascertain (from AMD) that we have stated interest in, and are in full support of P4TC. Happy to elaborate more if needed.
> VJ> Thanks, Vipin

Anjali and Vipin is your support for HW support of P4 or a Linux SW implementation
of P4. If its for HW support what drivers would we want to support? Can you
describe how to program these devices?

At the moment there hasn't been any movement on Linux hardware P4 support side
as far as I can tell. Yes there are some SDKs and build kits floating around for
FPGAs. For example maybe start with what drivers in kernel tree run the DPUs that
have this support? I think this would be a productive direction to go if we in
fact have hardware support in the works.

If you want a SW implementation in Linux my opinion is still pushing a DSL
into the kernel datapath via qdisc/tc is the wrong direction. Mapping P4
onto hardware blocks is fundamentally different architecture from mapping
P4 onto general purpose CPU and registers. My opinion -- to handle this you
need a per architecture backend/JIT to compile the P4 to native instructions.
This will give you the most flexibility to define new constructs, best
performance, and lowest overhead runtime. We have a P4 BPF backend already
and JITs for most architectures I don't see the need for P4TC in this
context.

If the end goal is a hardware offload control plane I'm skeptical we
even need something specific just for SW datapath. I would propose
a devlink or new infra to program the device directly vs overhead and
complexity of abstracting through 'tc'. If you want to emulate your
device use BPF or user space datapath.

.John

Singhai, Anjali May 28, 2024, 10:17 p.m. UTC | #31

>From: John Fastabend <john.fastabend@gmail.com> 
>Sent: Tuesday, May 28, 2024 1:17 PM

>Jain, Vipin wrote:
>> [AMD Official Use Only - AMD Internal Distribution Only]
>> 
>> My apologies, earlier email used html and was blocked by the list...
>> My response at the bottom as "VJ>"
>>
>> ________________________________________

>Anjali and Vipin is your support for HW support of P4 or a Linux SW implementation of P4. If its for HW support what drivers would we want to support? Can you describe how to program >these devices?

>At the moment there hasn't been any movement on Linux hardware P4 support side as far as I can tell. Yes there are some SDKs and build kits floating around for FPGAs. For example >maybe start with what drivers in kernel tree run the DPUs that have this support? I think this would be a productive direction to go if we in fact have hardware support in the works.

>If you want a SW implementation in Linux my opinion is still pushing a DSL into the kernel datapath via qdisc/tc is the wrong direction. Mapping P4 onto hardware blocks is fundamentally >different architecture from mapping
>P4 onto general purpose CPU and registers. My opinion -- to handle this you need a per architecture backend/JIT to compile the P4 to native instructions.
>This will give you the most flexibility to define new constructs, best performance, and lowest overhead runtime. We have a P4 BPF backend already and JITs for most architectures I don't >see the need for P4TC in this context.

>If the end goal is a hardware offload control plane I'm skeptical we even need something specific just for SW datapath. I would propose a devlink or new infra to program the device directly >vs overhead and complexity of abstracting through 'tc'. If you want to emulate your device use BPF or user space datapath.

>.John

John,                                                                            
Let me start by saying production hardware exists i think Jamal posted some links but i can point you to our hardware.
The hardware devices under discussion are capable of being abstracted using the P4 match-action paradigm so that's why we chose TC.
These devices are programmed using the TC/netlink interface i.e the standard TC control-driver ops apply. While it is clear to us that the P4TC abstraction suffices, we are currently discussing details that will cater for all vendors in our biweekly meetings.
One big requirement is we want to avoid the flower trap - we dont want to be changing kernel/user/driver code every time we add new datapaths.
We feel P4TC approach is the path to add Linux kernel support.                   

The s/w path is needed as well for several reasons.                              
We need the same P4 program to run either in software or hardware or in both using skip_sw/skip_hw. It could be either in split mode or as an exception path as it is done today in flower or u32. Also it is common now in the P4 community that people define their datapath using their program and will write a control application that works for both hardware and software datapaths. They could be using the software datapath for testing as you said but also for the split/exception path. Chris can probably add more comments on the software datapath.

Anjali

Tom Herbert May 28, 2024, 11:01 p.m. UTC | #32

On Tue, May 28, 2024 at 3:17 PM Singhai, Anjali
<anjali.singhai@intel.com> wrote:
>
> >From: John Fastabend <john.fastabend@gmail.com>
> >Sent: Tuesday, May 28, 2024 1:17 PM
>
> >Jain, Vipin wrote:
> >> [AMD Official Use Only - AMD Internal Distribution Only]
> >>
> >> My apologies, earlier email used html and was blocked by the list...
> >> My response at the bottom as "VJ>"
> >>
> >> ________________________________________
>
> >Anjali and Vipin is your support for HW support of P4 or a Linux SW implementation of P4. If its for HW support what drivers would we want to support? Can you describe how to program >these devices?
>
> >At the moment there hasn't been any movement on Linux hardware P4 support side as far as I can tell. Yes there are some SDKs and build kits floating around for FPGAs. For example >maybe start with what drivers in kernel tree run the DPUs that have this support? I think this would be a productive direction to go if we in fact have hardware support in the works.
>
> >If you want a SW implementation in Linux my opinion is still pushing a DSL into the kernel datapath via qdisc/tc is the wrong direction. Mapping P4 onto hardware blocks is fundamentally >different architecture from mapping
> >P4 onto general purpose CPU and registers. My opinion -- to handle this you need a per architecture backend/JIT to compile the P4 to native instructions.
> >This will give you the most flexibility to define new constructs, best performance, and lowest overhead runtime. We have a P4 BPF backend already and JITs for most architectures I don't >see the need for P4TC in this context.
>
> >If the end goal is a hardware offload control plane I'm skeptical we even need something specific just for SW datapath. I would propose a devlink or new infra to program the device directly >vs overhead and complexity of abstracting through 'tc'. If you want to emulate your device use BPF or user space datapath.
>
> >.John
>
>
> John,
> Let me start by saying production hardware exists i think Jamal posted some links but i can point you to our hardware.
> The hardware devices under discussion are capable of being abstracted using the P4 match-action paradigm so that's why we chose TC.
> These devices are programmed using the TC/netlink interface i.e the standard TC control-driver ops apply. While it is clear to us that the P4TC abstraction suffices, we are currently discussing details that will cater for all vendors in our biweekly meetings.
> One big requirement is we want to avoid the flower trap - we dont want to be changing kernel/user/driver code every time we add new datapaths.
> We feel P4TC approach is the path to add Linux kernel support.
>
> The s/w path is needed as well for several reasons.
> We need the same P4 program to run either in software or hardware or in both using skip_sw/skip_hw. It could be either in split mode or as an exception path as it is done today in flower or u32. Also it is common now in the P4 community that people define their datapath using their program and will write a control application that works for both hardware and software datapaths. They could be using the software datapath for testing as you said but also for the split/exception path. Chris can probably add more comments on the software datapath.

Hi Anjali,

Are there any use cases of P4-TC that don't involve P4 hardware? If
someone wanted to write one off datapath code for their deployment and
they didn't have P4 hardware would you suggest that they write they're
code in P4-TC? The reason I ask is because I'm concerned about the
performance of P4-TC. Like John said, this is mapping code that is
intended to run in specialized hardware into a CPU, and it's also
interpreted execution in TC. The performance numbers in
https://github.com/p4tc-dev/docs/blob/main/p4-conference-2023/2023P4WorkshopP4TC.pdf
seem to show that P4-TC has about half the performance of XDP. Even
with a lot of work, it's going to be difficult to substantially close
that gap.

The risk if we allow this into the kernel is that a vendor might be
tempted to point to P4-TC performance as a baseline to justify to
customers that they need to buy specialized hardware to get
performance, whereas if XDP was used maybe they don't need the
performance and cost of hardware. Note, this scenario already happened
once before, when the DPDK joined LF they made bogus claims that they
got a 100x performance over the kernel-- had they put at least the
slightest effort into tuning the kernel that would have dropped the
delta by an order of magnitude, and since then we've pretty much
closed the gap (actually, this is precisely what motivated the
creation of XDP so I guess that story had a happy ending!) . There are
circumstances where hardware offload may be warranted, but it needs to
be honestly justified by comparing it to an optimized software
solution-- so in the case of P4, it should be compared to well written
XDP code for instance, not P4-TC.

Tom

>
>
> Anjali
>

Chris Sommers May 28, 2024, 11:43 p.m. UTC | #33

> On Tue, May 28, 2024 at 3:17 PM Singhai, Anjali
> <anjali.singhai@intel.com> wrote:
> >
> > >From: John Fastabend <john.fastabend@gmail.com>
> > >Sent: Tuesday, May 28, 2024 1:17 PM
> >
> > >Jain, Vipin wrote:
> > >> [AMD Official Use Only - AMD Internal Distribution Only]
> > >>
> > >> My apologies, earlier email used html and was blocked by the list...
> > >> My response at the bottom as "VJ>"
> > >>
> > >> ________________________________________
> >
> > >Anjali and Vipin is your support for HW support of P4 or a Linux SW implementation of P4. If its for HW support what drivers would we want to support? Can you describe how to program >these devices?
> >
> > >At the moment there hasn't been any movement on Linux hardware P4 support side as far as I can tell. Yes there are some SDKs and build kits floating around for FPGAs. For example >maybe start with what drivers in kernel tree run the DPUs that have this support? I think this would be a productive direction to go if we in fact have hardware support in the works.
> >
> > >If you want a SW implementation in Linux my opinion is still pushing a DSL into the kernel datapath via qdisc/tc is the wrong direction. Mapping P4 onto hardware blocks is fundamentally >different architecture from mapping
> > >P4 onto general purpose CPU and registers. My opinion -- to handle this you need a per architecture backend/JIT to compile the P4 to native instructions.
> > >This will give you the most flexibility to define new constructs, best performance, and lowest overhead runtime. We have a P4 BPF backend already and JITs for most architectures I don't >see the need for P4TC in this context.
> >
> > >If the end goal is a hardware offload control plane I'm skeptical we even need something specific just for SW datapath. I would propose a devlink or new infra to program the device directly >vs overhead and complexity of abstracting through 'tc'. If you want to emulate your device use BPF or user space datapath.
> >
> > >.John
> >
> >
> > John,
> > Let me start by saying production hardware exists i think Jamal posted some links but i can point you to our hardware.
> > The hardware devices under discussion are capable of being abstracted using the P4 match-action paradigm so that's why we chose TC.
> > These devices are programmed using the TC/netlink interface i.e the standard TC control-driver ops apply. While it is clear to us that the P4TC abstraction suffices, we are currently discussing details that will cater for all vendors in our biweekly meetings.
> > One big requirement is we want to avoid the flower trap - we dont want to be changing kernel/user/driver code every time we add new datapaths.
> > We feel P4TC approach is the path to add Linux kernel support.
> >
> > The s/w path is needed as well for several reasons.
> > We need the same P4 program to run either in software or hardware or in both using skip_sw/skip_hw. It could be either in split mode or as an exception path as it is done today in flower or u32. Also it is common now in the P4 community that people define their datapath using their program and will write a control application that works for both hardware and software datapaths. They could be using the software datapath for testing as you said but also for the split/exception path. Chris can probably add more comments on the software datapath.

Anjali, thanks for asking. Agreed, I like the flexibility of accommodating a variety of platforms depending upon performance requirements and intended target system. For me, flexibility is important. Some solutions need an inline filter and P4-TC makes it so easy. The fact I will be able to get HW offload means I'm not performance bound. Some other solutions might need DPDK implementation, so P4-DPDK is a choice there as well, and there are acceleration options. Keeping much of the dataplane design in one language (P4) makes it easier for more developers to create products without having to be platform-level experts. As someone who's worked with P4 Tofino, P4-TC, bmv2, etc. I can authoritatively state that all have their proper place.
> 
> Hi Anjali,
> 
> Are there any use cases of P4-TC that don't involve P4 hardware? If
> someone wanted to write one off datapath code for their deployment and
> they didn't have P4 hardware would you suggest that they write they're
> code in P4-TC? The reason I ask is because I'm concerned about the
> performance of P4-TC. Like John said, this is mapping code that is
> intended to run in specialized hardware into a CPU, and it's also
> interpreted execution in TC. The performance numbers in
> https://urldefense.com/v3/__https://github.com/p4tc-dev/docs/blob/main/p4-conference-2023/2023P4WorkshopP4TC.pdf__;!!I5pVk4LIGAfnvw!mHilz4xBMimnfapDG8BEgqOuPw_Mn-KiMHb-aNbl8nB8TwfOfSleeIANiNRFQtTc5zfR0aK1TE2J8lT2Fg$
> seem to show that P4-TC has about half the performance of XDP. Even
> with a lot of work, it's going to be difficult to substantially close
> that gap.

AFAIK P4-TC can emit XDP or eBPF code depending upon the situation, someone more knowledgeable should chime in.
However, I don't agree that comparing the speeds of XDP vs. P4-TC should even be a deciding factor.
If P4-TC is good enough for a lot of applications, that is fine by me and over time it'll only get better.
If we held back every innovation because it was slower than something else, progress would suffer.
> 
> The risk if we allow this into the kernel is that a vendor might be
> tempted to point to P4-TC performance as a baseline to justify to
> customers that they need to buy specialized hardware to get
> performance, whereas if XDP was used maybe they don't need the
> performance and cost of hardware.

I really don't buy this argument, it's FUD. Let's judge P4-TC on its merits, not prejudge it as a ploy to sell vendor hardware.

> Note, this scenario already happened
> once before, when the DPDK joined LF they made bogus claims that they
> got a 100x performance over the kernel-- had they put at least the
> slightest effort into tuning the kernel that would have dropped the
> delta by an order of magnitude, and since then we've pretty much
> closed the gap (actually, this is precisely what motivated the
> creation of XDP so I guess that story had a happy ending!) . There are
> circumstances where hardware offload may be warranted, but it needs to
> be honestly justified by comparing it to an optimized software
> solution-- so in the case of P4, it should be compared to well written
> XDP code for instance, not P4-TC.

I strongly disagree that it "it needs to be honestly justified by comparing it to an optimized software solution."
Says who? This is no more factual than saying "C or golang need to be judged by comparing it to assembly language."
Today the gap between C and assembly is small, but way back in my career, C was way slower.
Over time optimizing compilers have closed the gap. Who's to say P4 technologies won't do the same?
P4-TC can be judged on its own merits for its utility and productivity. I can't stress enough that P4 is very productive when applied to certain problems.

Note, P4-BMv2 has been used by thousands of developers, researchers and students and it is relatively slow. Yet that doesn't deter users.
There is a Google Summer of Code project to add PNA support, rather ambitious. However, P4-TC already partially supports PNA and the gap is closing.
I feel like P4-TC could replace the use of BMv2 in a lot of applications and if it were upstreamed, it'd eventually be available on all Linux machines. The ability to write custom externs
is very compelling. Eventual HW offload using the same code will be game-changing. Bmv2 is a big c++ program and somewhat intimidating to dig into to make enhancements, especially at the architectural level.  
There is no HW offload path, and it's not really fast, so it remains mainly a researchy-thing and will stay that way. P4-TC could span the needs from research to production in SW, and performant production with HW offload.
> 
> Tom
> 
> >
> >
> > Anjali
>

John Fastabend May 28, 2024, 11:45 p.m. UTC | #34

Singhai, Anjali wrote:
> >From: John Fastabend <john.fastabend@gmail.com> 
> >Sent: Tuesday, May 28, 2024 1:17 PM
> 
> >Jain, Vipin wrote:
> >> [AMD Official Use Only - AMD Internal Distribution Only]
> >> 
> >> My apologies, earlier email used html and was blocked by the list...
> >> My response at the bottom as "VJ>"
> >>
> >> ________________________________________
> 
> >Anjali and Vipin is your support for HW support of P4 or a Linux SW implementation of P4. If its for HW support what drivers would we want to support? Can you describe how to program >these devices?
> 
> >At the moment there hasn't been any movement on Linux hardware P4 support side as far as I can tell. Yes there are some SDKs and build kits floating around for FPGAs. For example >maybe start with what drivers in kernel tree run the DPUs that have this support? I think this would be a productive direction to go if we in fact have hardware support in the works.
> 
> >If you want a SW implementation in Linux my opinion is still pushing a DSL into the kernel datapath via qdisc/tc is the wrong direction. Mapping P4 onto hardware blocks is fundamentally >different architecture from mapping
> >P4 onto general purpose CPU and registers. My opinion -- to handle this you need a per architecture backend/JIT to compile the P4 to native instructions.
> >This will give you the most flexibility to define new constructs, best performance, and lowest overhead runtime. We have a P4 BPF backend already and JITs for most architectures I don't >see the need for P4TC in this context.
> 
> >If the end goal is a hardware offload control plane I'm skeptical we even need something specific just for SW datapath. I would propose a devlink or new infra to program the device directly >vs overhead and complexity of abstracting through 'tc'. If you want to emulate your device use BPF or user space datapath.
> 
> >.John
> 
> 
> John,                                                                            
> Let me start by saying production hardware exists i think Jamal posted some links but i can point you to our hardware.

Maybe more direct what Linux drivers support this? That would be
a good first place to start IMO. Similarly what AMD hardware
driver supports this. If I have two drivers from two vendors
with P4 support this is great.

For Intel I assume this is idpf?

To be concrete can we start with Linux driver A and P4 program
P. Modprobe driver A and push P4 program P so that it does
something very simple, and drop a CIDR/Port range into a table.
Perhaps this is so obvious in your community the trouble is in
the context of a Linux driver its not immediately obvious to me
and I would suspect its not obvious to many others.

I really think walking through the key steps here would
really help?

 1. $ p4IntelCompiler p4-dos.p4 -o myp4
 2. $ modprobe idpf
 3. $ ping -i eth0 10.0.0.1 // good
 4. $ p4Load p4-dos.p4
 5. -- load cidr into the hardware somehow -- p4rt-ctrl?
 6. $ ping -i eth0 10.0.0.1 // dropped

This is an honest attempt to help fwiw. Questions would be.

For compilation do we need an artifact from Intel it seems
so from docs. But maybe a typo not sure. I'm not overly stuck
on it but worth mentioning if folks try to follow your docs.

For 2 I assume this is just normal every day module load nothing
to see. Does it pop something up in /proc or in firmware or...?
How do I know its P4 ready?

For 4. How does this actually work? Is it a file in a directory
the driver pushes into firmware? How does the firmware know
I've done this? Does the Linux driver already support this?

For 5 (most interesting) how does this work today. How are
you currently talking to the driver/firmware to insert rules
and discover the tables? And does the idpf driver do this
already? Some side channel I guess? This is p4rt-ctrl?

I've seen docs for above in ipdk, but they are a bit hard
to follow if I'm honest.

I assume IPDK is the source folks talk to when we mention there
is hardware somewhere. Also it seems there is an IPDK BPF support
as well which is interesting.

And do you know how the DPDK implementation works? Can we
learn from them is it just on top of Flow API which we
could easily use in devlink or some other *link I suspect.

> The hardware devices under discussion are capable of being abstracted using the P4 match-action paradigm so that's why we chose TC.
> These devices are programmed using the TC/netlink interface i.e the standard TC control-driver ops apply. While it is clear to us that the P4TC abstraction suffices, we are currently discussing details that will cater for all vendors in our biweekly meetings.
> One big requirement is we want to avoid the flower trap - we dont want to be changing kernel/user/driver code every time we add new datapaths.

I think many 1st order and important points have been skipped. How do you
program the device is it a firmware blob, a set of firmware commands,
something that comes to you on device so only vendor sees this? Maybe
I can infer this from some docs and some examples (by the way I ran
through some of your DPU docs and such) but its unclear how these
map onto Linux networking. Jiri started into this earlier and was
cut off because p4tc was not for hardware offload. Now it is apparently.

P4 is a good DSL for this sure and it has a runtime already specified
which is great.

This is not a qdisc/tc its an entire hardware pipeline I don't see
the reason to put it in TC at all.

> We feel P4TC approach is the path to add Linux kernel support.                   

I disagree with your implementation not your goals to support
flexible hardware. 

>                                                                                  
> The s/w path is needed as well for several reasons.                              
> We need the same P4 program to run either in software or hardware or in both using skip_sw/skip_hw. It could be either in split mode or as an exception path as it is done today in flower or u32. Also it is common now in the P4 community that people define their datapath using their program and will write a control application that works for both hardware and software datapaths. They could be using the software datapath for testing as you said but also for the split/exception path. Chris can probably add more comments on the software datapath.

None of above requires P4TC. For different architectures you
build optimal backend compilers. You have a Xilenx backend,
an Intel backend, and a Linux CPU based backend. I see no
reason to constrain the software case to map to a pipeline
model for example. Software running on a CPU has very different
characteristics from something running on a TOR, or FPGA.
Trying to push all these into one backend "model" will result
in suboptimal result for every target. At the end of the
day my .02$, P4 is a DSL it needs a target dependent compiler
in front of it. I want to optimize my software pipeline the
compiler should compress tables as much as possible and
search for a O(1) lookup even if getting that key is somewhat
expensive. Conversely a TCAM changes the game. An FPGA is
going to be flexible and make lots of tradeoffs here of which
I'm not an expert. Also by avoiding loading the DSL into the kernel
you leave room for others to build new/better/worse DSLs as they
please.

The P4 community writes control applicatoins on top of the
runtime spec right? p4rt-ctl being the thing I found. This
should abstract the endpoint away to work with hardware or
software or FPGA or anything else.

.John

Jain, Vipin May 29, 2024, 1:44 a.m. UTC | #35

[Public]

Inline as <VJ2>... (was html, sorry)

Tom Herbert May 29, 2024, 1:55 a.m. UTC | #36

> None of above requires P4TC. For different architectures you
> build optimal backend compilers. You have a Xilenx backend,
> an Intel backend, and a Linux CPU based backend. I see no
> reason to constrain the software case to map to a pipeline
> model for example. Software running on a CPU has very different
> characteristics from something running on a TOR, or FPGA.
> Trying to push all these into one backend "model" will result
> in suboptimal result for every target. At the end of the
> day my .02$, P4 is a DSL it needs a target dependent compiler
> in front of it. I want to optimize my software pipeline the
> compiler should compress tables as much as possible and
> search for a O(1) lookup even if getting that key is somewhat
> expensive. Conversely a TCAM changes the game. An FPGA is
> going to be flexible and make lots of tradeoffs here of which
> I'm not an expert. Also by avoiding loading the DSL into the kernel
> you leave room for others to build new/better/worse DSLs as they
> please.
>

I think the general ask here is to define an Intermediate
Representation that describes a programmed data path where it's a
combination of declarative and imperative elements (parsers and table
descriptions are better in declarative representation, functional
logic seems more imperative). We also want references to accelerators
with dynamic runtime binding to hardware (there are some interesting
tricks we can do in the loader for a CPU target-- will talk about at
Netdev). With a good IR we can decouple the frontend from the backend
target which enables mixing and matching programming languages with
arbitrary HW or SW targets. So a good IR potentially enables a lot of
flexibility and freedom on both sides of the equation.

An IR also facilitates reasonable kernel offload via signing images
with a hash of the IR. So for instance, a frontend compiler could
compile a P4 program into the IR. That code could then be compiled
into a SW target, say eBPF, and maybe P4 hardware. Each image has the
hash of the IR. At runtime, the eBPF code could be loaded into the
kernel. The hardware image can be loaded into the device using a side
band mechanism. To offload, we would query the device-- if the hash
reported by the device matches the hash in the eBPF then we know that
the offload is viable. No jits, no pushing firmware bits through the
kernel, no need for device capabilities flags, and avoids the pitfalls
of TC flower.

There is one challenge here in how to deal with offloads that are
already integrated into the kernel. I think GRO is a great example.
GRO has been especially elusive as an offload since it requires a
device to autonomously parse packets on input.  We really want a GRO
offload that parses the same exact protocols the kernel does
(including encapsulations), but also implements the exact same logic
in timers and pushing reassembled segments. So this needs to be
programmable. The problem with the technique I described is that GRO
is integrated into the kernel so we have no basis for a hash. I think
the answer here is to start replacing fixed kernel C code with eBPF
even in the critical path (we already talked about replacing flow
dissector with eBPF).

Anyway, we have been working on this. There's Common Parser
Representation in json (formerly known CPL that we talked about at
Netdev). For execution logic, LLVM IR seems fine (btrw, MLIR is really
useful by the way!). We're just starting to look at tables (probably
also json). If there's interest I could share more...

Tom

Jamal Hadi Salim May 29, 2024, 11:10 a.m. UTC | #37

Not sure why my email was tagged as html and blocked, but here goes again:

On Tue, May 28, 2024 at 7:43 PM Chris Sommers
<chris.sommers@keysight.com> wrote:
>
> > On Tue, May 28, 2024 at 3:17 PM Singhai, Anjali
> > <anjali.singhai@intel.com> wrote:
> > >
> > > >From: John Fastabend <john.fastabend@gmail.com>
> > > >Sent: Tuesday, May 28, 2024 1:17 PM
> > >
> > > >Jain, Vipin wrote:
> > > >> [AMD Official Use Only - AMD Internal Distribution Only]
> > > >>
> > > >> My apologies, earlier email used html and was blocked by the list...
> > > >> My response at the bottom as "VJ>"
> > > >>
> > > >> ________________________________________
> > >
> > > >Anjali and Vipin is your support for HW support of P4 or a Linux SW implementation of P4. If its for HW support what drivers would we want to support? Can you describe how to program >these devices?
> > >
> > > >At the moment there hasn't been any movement on Linux hardware P4 support side as far as I can tell. Yes there are some SDKs and build kits floating around for FPGAs. For example >maybe start with what drivers in kernel tree run the DPUs that have this support? I think this would be a productive direction to go if we in fact have hardware support in the works.
> > >
> > > >If you want a SW implementation in Linux my opinion is still pushing a DSL into the kernel datapath via qdisc/tc is the wrong direction. Mapping P4 onto hardware blocks is fundamentally >different architecture from mapping
> > > >P4 onto general purpose CPU and registers. My opinion -- to handle this you need a per architecture backend/JIT to compile the P4 to native instructions.
> > > >This will give you the most flexibility to define new constructs, best performance, and lowest overhead runtime. We have a P4 BPF backend already and JITs for most architectures I don't >see the need for P4TC in this context.
> > >
> > > >If the end goal is a hardware offload control plane I'm skeptical we even need something specific just for SW datapath. I would propose a devlink or new infra to program the device directly >vs overhead and complexity of abstracting through 'tc'. If you want to emulate your device use BPF or user space datapath.
> > >
> > > >.John
> > >
> > >
> > > John,
> > > Let me start by saying production hardware exists i think Jamal posted some links but i can point you to our hardware.
> > > The hardware devices under discussion are capable of being abstracted using the P4 match-action paradigm so that's why we chose TC.
> > > These devices are programmed using the TC/netlink interface i.e the standard TC control-driver ops apply. While it is clear to us that the P4TC abstraction suffices, we are currently discussing details that will cater for all vendors in our biweekly meetings.
> > > One big requirement is we want to avoid the flower trap - we dont want to be changing kernel/user/driver code every time we add new datapaths.
> > > We feel P4TC approach is the path to add Linux kernel support.
> > >
> > > The s/w path is needed as well for several reasons.
> > > We need the same P4 program to run either in software or hardware or in both using skip_sw/skip_hw. It could be either in split mode or as an exception path as it is done today in flower or u32. Also it is common now in the P4 community that people define their datapath using their program and will write a control application that works for both hardware and software datapaths. They could be using the software datapath for testing as you said but also for the split/exception path. Chris can probably add more comments on the software datapath.
>
> Anjali, thanks for asking. Agreed, I like the flexibility of accommodating a variety of platforms depending upon performance requirements and intended target system. For me, flexibility is important. Some solutions need an inline filter and P4-TC makes it so easy. The fact I will be able to get HW offload means I'm not performance bound. Some other solutions might need DPDK implementation, so P4-DPDK is a choice there as well, and there are acceleration options. Keeping much of the dataplane design in one language (P4) makes it easier for more developers to create products without having to be platform-level experts. As someone who's worked with P4 Tofino, P4-TC, bmv2, etc. I can authoritatively state that all have their proper place.
> >
> > Hi Anjali,
> >
> > Are there any use cases of P4-TC that don't involve P4 hardware? If
> > someone wanted to write one off datapath code for their deployment and
> > they didn't have P4 hardware would you suggest that they write they're
> > code in P4-TC? The reason I ask is because I'm concerned about the
> > performance of P4-TC. Like John said, this is mapping code that is
> > intended to run in specialized hardware into a CPU, and it's also
> > interpreted execution in TC. The performance numbers in
> > https://urldefense.com/v3/__https://github.com/p4tc-dev/docs/blob/main/p4-conference-2023/2023P4WorkshopP4TC.pdf__;!!I5pVk4LIGAfnvw!mHilz4xBMimnfapDG8BEgqOuPw_Mn-KiMHb-aNbl8nB8TwfOfSleeIANiNRFQtTc5zfR0aK1TE2J8lT2Fg$
> > seem to show that P4-TC has about half the performance of XDP. Even
> > with a lot of work, it's going to be difficult to substantially close
> > that gap.
>
> AFAIK P4-TC can emit XDP or eBPF code depending upon the situation, someone more knowledgeable should chime in.
> However, I don't agree that comparing the speeds of XDP vs. P4-TC should even be a deciding factor.
> If P4-TC is good enough for a lot of applications, that is fine by me and over time it'll only get better.
> If we held back every innovation because it was slower than something else, progress would suffer.

Yes, XDP can be emitted based on compiler options (and was a
motivation factor in considering use of eBPF). Tom's comment above
seems to confuse the fact that XDP tends to be faster than TC with
eBPF as the fault of P4TC.
In any case this statement falls under:
https://github.com/p4tc-dev/pushback-patches?tab=readme-ov-file#2b-comment-but--it-is-not-performant

On Tom's theory that the vendors are going to push inferior s/w for
the sake of selling h/w  - I would argues that we are not in the 90s
anymore and I dont believe there's any vendor conspiracy theory here
;-> a single port can do 100s of Gbps, and of course if you want to do
high speed you need to offload, no general purpose CPU will save you.
And really the arguement that "offload=evil" holds no water anymore.

cheers,
jamal

Jamal Hadi Salim May 29, 2024, 11:21 a.m. UTC | #38

On Tue, May 28, 2024 at 7:45 PM John Fastabend <john.fastabend@gmail.com> wrote:
>
> Singhai, Anjali wrote:
> > >From: John Fastabend <john.fastabend@gmail.com>
> > >Sent: Tuesday, May 28, 2024 1:17 PM
> >
> > >Jain, Vipin wrote:
> > >> [AMD Official Use Only - AMD Internal Distribution Only]
> > >>
> > >> My apologies, earlier email used html and was blocked by the list...
> > >> My response at the bottom as "VJ>"
> > >>
> > >> ________________________________________
> >
> > >Anjali and Vipin is your support for HW support of P4 or a Linux SW implementation of P4. If its for HW support what drivers would we want to support? Can you describe how to program >these devices?
> >
> > >At the moment there hasn't been any movement on Linux hardware P4 support side as far as I can tell. Yes there are some SDKs and build kits floating around for FPGAs. For example >maybe start with what drivers in kernel tree run the DPUs that have this support? I think this would be a productive direction to go if we in fact have hardware support in the works.
> >
> > >If you want a SW implementation in Linux my opinion is still pushing a DSL into the kernel datapath via qdisc/tc is the wrong direction. Mapping P4 onto hardware blocks is fundamentally >different architecture from mapping
> > >P4 onto general purpose CPU and registers. My opinion -- to handle this you need a per architecture backend/JIT to compile the P4 to native instructions.
> > >This will give you the most flexibility to define new constructs, best performance, and lowest overhead runtime. We have a P4 BPF backend already and JITs for most architectures I don't >see the need for P4TC in this context.
> >
> > >If the end goal is a hardware offload control plane I'm skeptical we even need something specific just for SW datapath. I would propose a devlink or new infra to program the device directly >vs overhead and complexity of abstracting through 'tc'. If you want to emulate your device use BPF or user space datapath.
> >
> > >.John
> >
> >
> > John,
> > Let me start by saying production hardware exists i think Jamal posted some links but i can point you to our hardware.
>
> Maybe more direct what Linux drivers support this? That would be
> a good first place to start IMO. Similarly what AMD hardware
> driver supports this. If I have two drivers from two vendors
> with P4 support this is great.
>
> For Intel I assume this is idpf?
>
> To be concrete can we start with Linux driver A and P4 program
> P. Modprobe driver A and push P4 program P so that it does
> something very simple, and drop a CIDR/Port range into a table.
> Perhaps this is so obvious in your community the trouble is in
> the context of a Linux driver its not immediately obvious to me
> and I would suspect its not obvious to many others.
>
> I really think walking through the key steps here would
> really help?
>
>  1. $ p4IntelCompiler p4-dos.p4 -o myp4
>  2. $ modprobe idpf
>  3. $ ping -i eth0 10.0.0.1 // good
>  4. $ p4Load p4-dos.p4
>  5. -- load cidr into the hardware somehow -- p4rt-ctrl?
>  6. $ ping -i eth0 10.0.0.1 // dropped
>
> This is an honest attempt to help fwiw. Questions would be.
>
> For compilation do we need an artifact from Intel it seems
> so from docs. But maybe a typo not sure. I'm not overly stuck
> on it but worth mentioning if folks try to follow your docs.
>
> For 2 I assume this is just normal every day module load nothing
> to see. Does it pop something up in /proc or in firmware or...?
> How do I know its P4 ready?
>
> For 4. How does this actually work? Is it a file in a directory
> the driver pushes into firmware? How does the firmware know
> I've done this? Does the Linux driver already support this?
>
> For 5 (most interesting) how does this work today. How are
> you currently talking to the driver/firmware to insert rules
> and discover the tables? And does the idpf driver do this
> already? Some side channel I guess? This is p4rt-ctrl?
>
> I've seen docs for above in ipdk, but they are a bit hard
> to follow if I'm honest.
>
> I assume IPDK is the source folks talk to when we mention there
> is hardware somewhere. Also it seems there is an IPDK BPF support
> as well which is interesting.
>
> And do you know how the DPDK implementation works? Can we
> learn from them is it just on top of Flow API which we
> could easily use in devlink or some other *link I suspect.
>
> > The hardware devices under discussion are capable of being abstracted using the P4 match-action paradigm so that's why we chose TC.
> > These devices are programmed using the TC/netlink interface i.e the standard TC control-driver ops apply. While it is clear to us that the P4TC abstraction suffices, we are currently discussing details that will cater for all vendors in our biweekly meetings.
> > One big requirement is we want to avoid the flower trap - we dont want to be changing kernel/user/driver code every time we add new datapaths.
>
> I think many 1st order and important points have been skipped. How do you
> program the device is it a firmware blob, a set of firmware commands,
> something that comes to you on device so only vendor sees this? Maybe
> I can infer this from some docs and some examples (by the way I ran
> through some of your DPU docs and such) but its unclear how these
> map onto Linux networking. Jiri started into this earlier and was
> cut off because p4tc was not for hardware offload. Now it is apparently.
>
> P4 is a good DSL for this sure and it has a runtime already specified
> which is great.
>
> This is not a qdisc/tc its an entire hardware pipeline I don't see
> the reason to put it in TC at all.
>
> > We feel P4TC approach is the path to add Linux kernel support.
>
> I disagree with your implementation not your goals to support
> flexible hardware.
>
> >
> > The s/w path is needed as well for several reasons.
> > We need the same P4 program to run either in software or hardware or in both using skip_sw/skip_hw. It could be either in split mode or as an exception path as it is done today in flower or u32. Also it is common now in the P4 community that people define their datapath using their program and will write a control application that works for both hardware and software datapaths. They could be using the software datapath for testing as you said but also for the split/exception path. Chris can probably add more comments on the software datapath.
>
> None of above requires P4TC. For different architectures you
> build optimal backend compilers. You have a Xilenx backend,
> an Intel backend, and a Linux CPU based backend. I see no
> reason to constrain the software case to map to a pipeline
> model for example. Software running on a CPU has very different
> characteristics from something running on a TOR, or FPGA.
> Trying to push all these into one backend "model" will result
> in suboptimal result for every target. At the end of the
> day my .02$, P4 is a DSL it needs a target dependent compiler
> in front of it. I want to optimize my software pipeline the
> compiler should compress tables as much as possible and
> search for a O(1) lookup even if getting that key is somewhat
> expensive. Conversely a TCAM changes the game. An FPGA is
> going to be flexible and make lots of tradeoffs here of which
> I'm not an expert. Also by avoiding loading the DSL into the kernel
> you leave room for others to build new/better/worse DSLs as they
> please.
>
> The P4 community writes control applicatoins on top of the
> runtime spec right? p4rt-ctl being the thing I found. This
> should abstract the endpoint away to work with hardware or
> software or FPGA or anything else.
>

For the record, _every single patchset we have posted_ specified our
requirements as being s/w + h/w. A simpler version of the requirements
is listed here:
https://github.com/p4tc-dev/pushback-patches?tab=readme-ov-file#summary-of-our-requirements

John's content variant above is described in:
https://github.com/p4tc-dev/pushback-patches?tab=readme-ov-file#summary-of-our-requirements
According to him we should not bother with the kernel at all. It's
what is commonly referred to as a monday-morning quarterbacking or
arm-chair lawyering "lets just do it my way and it will all be great".
It's 90% of these discussions and one of the reasons I put up that
page.

cheers,
jamal

Jamal Hadi Salim May 29, 2024, 11:22 a.m. UTC | #39

On Wed, May 29, 2024 at 7:21 AM Jamal Hadi Salim <jhs@mojatatu.com> wrote:
>
> On Tue, May 28, 2024 at 7:45 PM John Fastabend <john.fastabend@gmail.com> wrote:
> >
> > Singhai, Anjali wrote:
> > > >From: John Fastabend <john.fastabend@gmail.com>
> > > >Sent: Tuesday, May 28, 2024 1:17 PM
> > >
> > > >Jain, Vipin wrote:
> > > >> [AMD Official Use Only - AMD Internal Distribution Only]
> > > >>
> > > >> My apologies, earlier email used html and was blocked by the list...
> > > >> My response at the bottom as "VJ>"
> > > >>
> > > >> ________________________________________
> > >
> > > >Anjali and Vipin is your support for HW support of P4 or a Linux SW implementation of P4. If its for HW support what drivers would we want to support? Can you describe how to program >these devices?
> > >
> > > >At the moment there hasn't been any movement on Linux hardware P4 support side as far as I can tell. Yes there are some SDKs and build kits floating around for FPGAs. For example >maybe start with what drivers in kernel tree run the DPUs that have this support? I think this would be a productive direction to go if we in fact have hardware support in the works.
> > >
> > > >If you want a SW implementation in Linux my opinion is still pushing a DSL into the kernel datapath via qdisc/tc is the wrong direction. Mapping P4 onto hardware blocks is fundamentally >different architecture from mapping
> > > >P4 onto general purpose CPU and registers. My opinion -- to handle this you need a per architecture backend/JIT to compile the P4 to native instructions.
> > > >This will give you the most flexibility to define new constructs, best performance, and lowest overhead runtime. We have a P4 BPF backend already and JITs for most architectures I don't >see the need for P4TC in this context.
> > >
> > > >If the end goal is a hardware offload control plane I'm skeptical we even need something specific just for SW datapath. I would propose a devlink or new infra to program the device directly >vs overhead and complexity of abstracting through 'tc'. If you want to emulate your device use BPF or user space datapath.
> > >
> > > >.John
> > >
> > >
> > > John,
> > > Let me start by saying production hardware exists i think Jamal posted some links but i can point you to our hardware.
> >
> > Maybe more direct what Linux drivers support this? That would be
> > a good first place to start IMO. Similarly what AMD hardware
> > driver supports this. If I have two drivers from two vendors
> > with P4 support this is great.
> >
> > For Intel I assume this is idpf?
> >
> > To be concrete can we start with Linux driver A and P4 program
> > P. Modprobe driver A and push P4 program P so that it does
> > something very simple, and drop a CIDR/Port range into a table.
> > Perhaps this is so obvious in your community the trouble is in
> > the context of a Linux driver its not immediately obvious to me
> > and I would suspect its not obvious to many others.
> >
> > I really think walking through the key steps here would
> > really help?
> >
> >  1. $ p4IntelCompiler p4-dos.p4 -o myp4
> >  2. $ modprobe idpf
> >  3. $ ping -i eth0 10.0.0.1 // good
> >  4. $ p4Load p4-dos.p4
> >  5. -- load cidr into the hardware somehow -- p4rt-ctrl?
> >  6. $ ping -i eth0 10.0.0.1 // dropped
> >
> > This is an honest attempt to help fwiw. Questions would be.
> >
> > For compilation do we need an artifact from Intel it seems
> > so from docs. But maybe a typo not sure. I'm not overly stuck
> > on it but worth mentioning if folks try to follow your docs.
> >
> > For 2 I assume this is just normal every day module load nothing
> > to see. Does it pop something up in /proc or in firmware or...?
> > How do I know its P4 ready?
> >
> > For 4. How does this actually work? Is it a file in a directory
> > the driver pushes into firmware? How does the firmware know
> > I've done this? Does the Linux driver already support this?
> >
> > For 5 (most interesting) how does this work today. How are
> > you currently talking to the driver/firmware to insert rules
> > and discover the tables? And does the idpf driver do this
> > already? Some side channel I guess? This is p4rt-ctrl?
> >
> > I've seen docs for above in ipdk, but they are a bit hard
> > to follow if I'm honest.
> >
> > I assume IPDK is the source folks talk to when we mention there
> > is hardware somewhere. Also it seems there is an IPDK BPF support
> > as well which is interesting.
> >
> > And do you know how the DPDK implementation works? Can we
> > learn from them is it just on top of Flow API which we
> > could easily use in devlink or some other *link I suspect.
> >
> > > The hardware devices under discussion are capable of being abstracted using the P4 match-action paradigm so that's why we chose TC.
> > > These devices are programmed using the TC/netlink interface i.e the standard TC control-driver ops apply. While it is clear to us that the P4TC abstraction suffices, we are currently discussing details that will cater for all vendors in our biweekly meetings.
> > > One big requirement is we want to avoid the flower trap - we dont want to be changing kernel/user/driver code every time we add new datapaths.
> >
> > I think many 1st order and important points have been skipped. How do you
> > program the device is it a firmware blob, a set of firmware commands,
> > something that comes to you on device so only vendor sees this? Maybe
> > I can infer this from some docs and some examples (by the way I ran
> > through some of your DPU docs and such) but its unclear how these
> > map onto Linux networking. Jiri started into this earlier and was
> > cut off because p4tc was not for hardware offload. Now it is apparently.
> >
> > P4 is a good DSL for this sure and it has a runtime already specified
> > which is great.
> >
> > This is not a qdisc/tc its an entire hardware pipeline I don't see
> > the reason to put it in TC at all.
> >
> > > We feel P4TC approach is the path to add Linux kernel support.
> >
> > I disagree with your implementation not your goals to support
> > flexible hardware.
> >
> > >
> > > The s/w path is needed as well for several reasons.
> > > We need the same P4 program to run either in software or hardware or in both using skip_sw/skip_hw. It could be either in split mode or as an exception path as it is done today in flower or u32. Also it is common now in the P4 community that people define their datapath using their program and will write a control application that works for both hardware and software datapaths. They could be using the software datapath for testing as you said but also for the split/exception path. Chris can probably add more comments on the software datapath.
> >
> > None of above requires P4TC. For different architectures you
> > build optimal backend compilers. You have a Xilenx backend,
> > an Intel backend, and a Linux CPU based backend. I see no
> > reason to constrain the software case to map to a pipeline
> > model for example. Software running on a CPU has very different
> > characteristics from something running on a TOR, or FPGA.
> > Trying to push all these into one backend "model" will result
> > in suboptimal result for every target. At the end of the
> > day my .02$, P4 is a DSL it needs a target dependent compiler
> > in front of it. I want to optimize my software pipeline the
> > compiler should compress tables as much as possible and
> > search for a O(1) lookup even if getting that key is somewhat
> > expensive. Conversely a TCAM changes the game. An FPGA is
> > going to be flexible and make lots of tradeoffs here of which
> > I'm not an expert. Also by avoiding loading the DSL into the kernel
> > you leave room for others to build new/better/worse DSLs as they
> > please.
> >
> > The P4 community writes control applicatoins on top of the
> > runtime spec right? p4rt-ctl being the thing I found. This
> > should abstract the endpoint away to work with hardware or
> > software or FPGA or anything else.
> >
>
> For the record, _every single patchset we have posted_ specified our
> requirements as being s/w + h/w. A simpler version of the requirements
> is listed here:
> https://github.com/p4tc-dev/pushback-patches?tab=readme-ov-file#summary-of-our-requirements
>
> John's content variant above is described in:
> https://github.com/p4tc-dev/pushback-patches?tab=readme-ov-file#summary-of-our-requirements
Correction: https://github.com/p4tc-dev/pushback-patches?tab=readme-ov-file#3-comment-but-you-did-it-wrong-heres-how-you-do-it

cheers,
jamal
> According to him we should not bother with the kernel at all. It's
> what is commonly referred to as a monday-morning quarterbacking or
> arm-chair lawyering "lets just do it my way and it will all be great".
> It's 90% of these discussions and one of the reasons I put up that
> page.
>
> cheers,
> jamal

Tom Herbert May 29, 2024, 2:45 p.m. UTC | #40

On Wed, May 29, 2024 at 4:01 AM Jamal Hadi Salim <jhs@mojatatu.com> wrote:
>
>
>
> On Tue, May 28, 2024 at 7:43 PM Chris Sommers <chris.sommers@keysight.com> wrote:
>>
>> > On Tue, May 28, 2024 at 3:17 PM Singhai, Anjali
>> > <anjali.singhai@intel.com> wrote:
>> > >
>> > > >From: John Fastabend <john.fastabend@gmail.com>
>> > > >Sent: Tuesday, May 28, 2024 1:17 PM
>> > >
>> > > >Jain, Vipin wrote:
>> > > >> [AMD Official Use Only - AMD Internal Distribution Only]
>> > > >>
>> > > >> My apologies, earlier email used html and was blocked by the list...
>> > > >> My response at the bottom as "VJ>"
>> > > >>
>> > > >> ________________________________________
>> > >
>> > > >Anjali and Vipin is your support for HW support of P4 or a Linux SW implementation of P4. If its for HW support what drivers would we want to support? Can you describe how to program >these devices?
>> > >
>> > > >At the moment there hasn't been any movement on Linux hardware P4 support side as far as I can tell. Yes there are some SDKs and build kits floating around for FPGAs. For example >maybe start with what drivers in kernel tree run the DPUs that have this support? I think this would be a productive direction to go if we in fact have hardware support in the works.
>> > >
>> > > >If you want a SW implementation in Linux my opinion is still pushing a DSL into the kernel datapath via qdisc/tc is the wrong direction. Mapping P4 onto hardware blocks is fundamentally >different architecture from mapping
>> > > >P4 onto general purpose CPU and registers. My opinion -- to handle this you need a per architecture backend/JIT to compile the P4 to native instructions.
>> > > >This will give you the most flexibility to define new constructs, best performance, and lowest overhead runtime. We have a P4 BPF backend already and JITs for most architectures I don't >see the need for P4TC in this context.
>> > >
>> > > >If the end goal is a hardware offload control plane I'm skeptical we even need something specific just for SW datapath. I would propose a devlink or new infra to program the device directly >vs overhead and complexity of abstracting through 'tc'. If you want to emulate your device use BPF or user space datapath.
>> > >
>> > > >.John
>> > >
>> > >
>> > > John,
>> > > Let me start by saying production hardware exists i think Jamal posted some links but i can point you to our hardware.
>> > > The hardware devices under discussion are capable of being abstracted using the P4 match-action paradigm so that's why we chose TC.
>> > > These devices are programmed using the TC/netlink interface i.e the standard TC control-driver ops apply. While it is clear to us that the P4TC abstraction suffices, we are currently discussing details that will cater for all vendors in our biweekly meetings.
>> > > One big requirement is we want to avoid the flower trap - we dont want to be changing kernel/user/driver code every time we add new datapaths.
>> > > We feel P4TC approach is the path to add Linux kernel support.
>> > >
>> > > The s/w path is needed as well for several reasons.
>> > > We need the same P4 program to run either in software or hardware or in both using skip_sw/skip_hw. It could be either in split mode or as an exception path as it is done today in flower or u32. Also it is common now in the P4 community that people define their datapath using their program and will write a control application that works for both hardware and software datapaths. They could be using the software datapath for testing as you said but also for the split/exception path. Chris can probably add more comments on the software datapath.
>>
>> Anjali, thanks for asking. Agreed, I like the flexibility of accommodating a variety of platforms depending upon performance requirements and intended target system. For me, flexibility is important. Some solutions need an inline filter and P4-TC makes it so easy. The fact I will be able to get HW offload means I'm not performance bound. Some other solutions might need DPDK implementation, so P4-DPDK is a choice there as well, and there are acceleration options. Keeping much of the dataplane design in one language (P4) makes it easier for more developers to create products without having to be platform-level experts. As someone who's worked with P4 Tofino, P4-TC, bmv2, etc. I can authoritatively state that all have their proper place.
>> >
>> > Hi Anjali,
>> >
>> > Are there any use cases of P4-TC that don't involve P4 hardware? If
>> > someone wanted to write one off datapath code for their deployment and
>> > they didn't have P4 hardware would you suggest that they write they're
>> > code in P4-TC? The reason I ask is because I'm concerned about the
>> > performance of P4-TC. Like John said, this is mapping code that is
>> > intended to run in specialized hardware into a CPU, and it's also
>> > interpreted execution in TC. The performance numbers in
>> > https://urldefense.com/v3/__https://github.com/p4tc-dev/docs/blob/main/p4-conference-2023/2023P4WorkshopP4TC.pdf__;!!I5pVk4LIGAfnvw!mHilz4xBMimnfapDG8BEgqOuPw_Mn-KiMHb-aNbl8nB8TwfOfSleeIANiNRFQtTc5zfR0aK1TE2J8lT2Fg$
>> > seem to show that P4-TC has about half the performance of XDP. Even
>> > with a lot of work, it's going to be difficult to substantially close
>> > that gap.
>>
>> AFAIK P4-TC can emit XDP or eBPF code depending upon the situation, someone more knowledgeable should chime in.
>> However, I don't agree that comparing the speeds of XDP vs. P4-TC should even be a deciding factor.
>> If P4-TC is good enough for a lot of applications, that is fine by me and over time it'll only get better.
>> If we held back every innovation because it was slower than something else, progress would suffer.
>> >
>
>
> Yes, XDP can be emitted based on compiler options (and was a motivation factor in considering use of eBPF). Tom's comment above seems to confuse the fact that XDP tends to be faster than TC with eBPF as the fault of P4TC.
> In any case this statement falls under: https://github.com/p4tc-dev/pushback-patches?tab=readme-ov-file#2b-comment-but--it-is-not-performant

Jamal,

From that: "My response has always consistently been: performance is a
lower priority to P4 correctness and expressibility." That might be
true for P4, but not for the kernel. CPU performance is important, and
your statement below that justifies offloads on the basis that "no
general purpose CPU will save you" confirms that. Please be more
upfront about what  the performance is like including performance
numbers in the cover letter for the next patch set. This is the best
way to avoid confusion and rampant speculation, and if performance
isn't stellar being open about it in the community is the best way to
figure out how to improve it.
>
> On Tom's theory that the vendors are going to push inferior s/w for the sake of selling h/w: we are not in the 90s anymore and there's no vendor conspiracy theory here: a single port can do 100s of Gbps, and of course if you want to do high speed you need to offload, no general purpose CPU will save you.

Let's not pretend that offloads are a magic bullet that just makes
everything better, if that were true then we'd all be using TOE by
now! There are a myriad of factors to consider whether offloading is
worth it. What is "high speed", is this small packets or big packets,
are we terminating TCP, are we doing some sort of fast/slow path split
which might work great in the lab but on the Internet can become a DOS
vector? What's the application? Are we just trying to offload parts of
the datapath, TCP, RDMA, memcached, ML reduce operations? Are we
trying to do line rate encryption, compression, trying to do a billion
PCB lookups a second? Are we taking into account continuing
advancements in the CPU that have in the past made offloads obsolete
(for instance, AES instructions pretty much obsoleted initial attempts
to obsolete IPsec)? How simple is the programming model, how
debuggable is it, what's the TCO?

I do believe offload is part of the solution. And the good news is
that programmable devices facilitate that. IMO, our challenge is to
create a facility in the kernel to kernel offloads in a much better
way (I don't believe there's disagreement with these points).

Tom





>
> cheers,
> jamal
>
>>
>> > The risk if we allow this into the kernel is that a vendor might be
>> > tempted to point to P4-TC performance as a baseline to justify to
>> > customers that they need to buy specialized hardware to get
>> > performance, whereas if XDP was used maybe they don't need the
>> > performance and cost of hardware.
>>
>> I really don't buy this argument, it's FUD. Let's judge P4-TC on its merits, not prejudge it as a ploy to sell vendor hardware.
>>
>> > Note, this scenario already happened
>> > once before, when the DPDK joined LF they made bogus claims that they
>> > got a 100x performance over the kernel-- had they put at least the
>> > slightest effort into tuning the kernel that would have dropped the
>> > delta by an order of magnitude, and since then we've pretty much
>> > closed the gap (actually, this is precisely what motivated the
>> > creation of XDP so I guess that story had a happy ending!) . There are
>> > circumstances where hardware offload may be warranted, but it needs to
>> > be honestly justified by comparing it to an optimized software
>> > solution-- so in the case of P4, it should be compared to well written
>> > XDP code for instance, not P4-TC.
>>
>> I strongly disagree that it "it needs to be honestly justified by comparing it to an optimized software solution."
>> Says who? This is no more factual than saying "C or golang need to be judged by comparing it to assembly language."
>> Today the gap between C and assembly is small, but way back in my career, C was way slower.
>> Over time optimizing compilers have closed the gap. Who's to say P4 technologies won't do the same?
>> P4-TC can be judged on its own merits for its utility and productivity. I can't stress enough that P4 is very productive when applied to certain problems.
>>
>> Note, P4-BMv2 has been used by thousands of developers, researchers and students and it is relatively slow. Yet that doesn't deter users.
>> There is a Google Summer of Code project to add PNA support, rather ambitious. However, P4-TC already partially supports PNA and the gap is closing.
>> I feel like P4-TC could replace the use of BMv2 in a lot of applications and if it were upstreamed, it'd eventually be available on all Linux machines. The ability to write custom externs
>> is very compelling. Eventual HW offload using the same code will be game-changing. Bmv2 is a big c++ program and somewhat intimidating to dig into to make enhancements, especially at the architectural level.
>> There is no HW offload path, and it's not really fast, so it remains mainly a researchy-thing and will stay that way. P4-TC could span the needs from research to production in SW, and performant production with HW offload.
>> >
>> > Tom
>> >
>> > >
>> > >
>> > > Anjali
>> >

Jamal Hadi Salim May 30, 2024, 4:59 p.m. UTC | #41

On Wed, May 29, 2024 at 10:46 AM Tom Herbert <tom@sipanda.io> wrote:
>
> On Wed, May 29, 2024 at 4:01 AM Jamal Hadi Salim <jhs@mojatatu.com> wrote:
> >
> >
> >
> > On Tue, May 28, 2024 at 7:43 PM Chris Sommers <chris.sommers@keysight.com> wrote:
> >>
> >> > On Tue, May 28, 2024 at 3:17 PM Singhai, Anjali
> >> > <anjali.singhai@intel.com> wrote:
> >> > >
> >> > > >From: John Fastabend <john.fastabend@gmail.com>
> >> > > >Sent: Tuesday, May 28, 2024 1:17 PM
> >> > >
> >> > > >Jain, Vipin wrote:
> >> > > >> [AMD Official Use Only - AMD Internal Distribution Only]
> >> > > >>
> >> > > >> My apologies, earlier email used html and was blocked by the list...
> >> > > >> My response at the bottom as "VJ>"
> >> > > >>
> >> > > >> ________________________________________
> >> > >
> >> > > >Anjali and Vipin is your support for HW support of P4 or a Linux SW implementation of P4. If its for HW support what drivers would we want to support? Can you describe how to program >these devices?
> >> > >
> >> > > >At the moment there hasn't been any movement on Linux hardware P4 support side as far as I can tell. Yes there are some SDKs and build kits floating around for FPGAs. For example >maybe start with what drivers in kernel tree run the DPUs that have this support? I think this would be a productive direction to go if we in fact have hardware support in the works.
> >> > >
> >> > > >If you want a SW implementation in Linux my opinion is still pushing a DSL into the kernel datapath via qdisc/tc is the wrong direction. Mapping P4 onto hardware blocks is fundamentally >different architecture from mapping
> >> > > >P4 onto general purpose CPU and registers. My opinion -- to handle this you need a per architecture backend/JIT to compile the P4 to native instructions.
> >> > > >This will give you the most flexibility to define new constructs, best performance, and lowest overhead runtime. We have a P4 BPF backend already and JITs for most architectures I don't >see the need for P4TC in this context.
> >> > >
> >> > > >If the end goal is a hardware offload control plane I'm skeptical we even need something specific just for SW datapath. I would propose a devlink or new infra to program the device directly >vs overhead and complexity of abstracting through 'tc'. If you want to emulate your device use BPF or user space datapath.
> >> > >
> >> > > >.John
> >> > >
> >> > >
> >> > > John,
> >> > > Let me start by saying production hardware exists i think Jamal posted some links but i can point you to our hardware.
> >> > > The hardware devices under discussion are capable of being abstracted using the P4 match-action paradigm so that's why we chose TC.
> >> > > These devices are programmed using the TC/netlink interface i.e the standard TC control-driver ops apply. While it is clear to us that the P4TC abstraction suffices, we are currently discussing details that will cater for all vendors in our biweekly meetings.
> >> > > One big requirement is we want to avoid the flower trap - we dont want to be changing kernel/user/driver code every time we add new datapaths.
> >> > > We feel P4TC approach is the path to add Linux kernel support.
> >> > >
> >> > > The s/w path is needed as well for several reasons.
> >> > > We need the same P4 program to run either in software or hardware or in both using skip_sw/skip_hw. It could be either in split mode or as an exception path as it is done today in flower or u32. Also it is common now in the P4 community that people define their datapath using their program and will write a control application that works for both hardware and software datapaths. They could be using the software datapath for testing as you said but also for the split/exception path. Chris can probably add more comments on the software datapath.
> >>
> >> Anjali, thanks for asking. Agreed, I like the flexibility of accommodating a variety of platforms depending upon performance requirements and intended target system. For me, flexibility is important. Some solutions need an inline filter and P4-TC makes it so easy. The fact I will be able to get HW offload means I'm not performance bound. Some other solutions might need DPDK implementation, so P4-DPDK is a choice there as well, and there are acceleration options. Keeping much of the dataplane design in one language (P4) makes it easier for more developers to create products without having to be platform-level experts. As someone who's worked with P4 Tofino, P4-TC, bmv2, etc. I can authoritatively state that all have their proper place.
> >> >
> >> > Hi Anjali,
> >> >
> >> > Are there any use cases of P4-TC that don't involve P4 hardware? If
> >> > someone wanted to write one off datapath code for their deployment and
> >> > they didn't have P4 hardware would you suggest that they write they're
> >> > code in P4-TC? The reason I ask is because I'm concerned about the
> >> > performance of P4-TC. Like John said, this is mapping code that is
> >> > intended to run in specialized hardware into a CPU, and it's also
> >> > interpreted execution in TC. The performance numbers in
> >> > https://urldefense.com/v3/__https://github.com/p4tc-dev/docs/blob/main/p4-conference-2023/2023P4WorkshopP4TC.pdf__;!!I5pVk4LIGAfnvw!mHilz4xBMimnfapDG8BEgqOuPw_Mn-KiMHb-aNbl8nB8TwfOfSleeIANiNRFQtTc5zfR0aK1TE2J8lT2Fg$
> >> > seem to show that P4-TC has about half the performance of XDP. Even
> >> > with a lot of work, it's going to be difficult to substantially close
> >> > that gap.
> >>
> >> AFAIK P4-TC can emit XDP or eBPF code depending upon the situation, someone more knowledgeable should chime in.
> >> However, I don't agree that comparing the speeds of XDP vs. P4-TC should even be a deciding factor.
> >> If P4-TC is good enough for a lot of applications, that is fine by me and over time it'll only get better.
> >> If we held back every innovation because it was slower than something else, progress would suffer.
> >> >
> >
> >
> > Yes, XDP can be emitted based on compiler options (and was a motivation factor in considering use of eBPF). Tom's comment above seems to confuse the fact that XDP tends to be faster than TC with eBPF as the fault of P4TC.
> > In any case this statement falls under: https://github.com/p4tc-dev/pushback-patches?tab=readme-ov-file#2b-comment-but--it-is-not-performant
>
> Jamal,
>
> From that: "My response has always consistently been: performance is a
> lower priority to P4 correctness and expressibility." That might be
> true for P4, but not for the kernel. CPU performance is important, and
> your statement below that justifies offloads on the basis that "no
> general purpose CPU will save you" confirms that. Please be more
> upfront about what  the performance is like including performance
> numbers in the cover letter for the next patch set. This is the best
> way to avoid confusion and rampant speculation, and if performance
> isn't stellar being open about it in the community is the best way to
> figure out how to improve it.

I believe you are misreading those graphs or maybe you are mixing it
with the original u32/pedit script approach? The tests are run at TC
and XDP layers. Pay particular attention to the results of the
handcoded/tuned eBPF datapath at TC and at XDP compared to analogous
ones generated by the compiler. You will notice +/-5% or so
differences. That is with the current compiler generated code. We are
looking to improve that - but do note that is generated code, nothing
to do with the kernel. As the P4 program becomes more complex (many
tables, longer keys, more entries, more complex actions) then we
become compute bound, so no difference really.

Now having said that: yes - s/w performance is certainly _not our
highest priority feature_ and that is not saying we dont care but as
the text said If i am getting 2Mpps using handcoding vs 1.84Mpps using
generated code(per those graphs) and i can generate code and execute
it in 5 minutes (Chris who is knowledgeable in P4 was able to do it in
less time), then _i pick the code generation any day of the week_.
Tooling, tooling, tooling.
To re-iterate, the most important requirement is the abstraction, meaning:
I can take the same P4 program I am running in s/w and generate using
a different backend for AMD or Intel offload equivalent and get
several magnitude improvements in performance because it is now
running in h/w. I still get to use the same application controlling
either s/w and/or hardware, etc

TBH, I am indifferent and could add some numbers but it is missing the
emphasis of what we are trying to achieve, the cover letter is already
half a novel - with the short attention span most people have it will
be just muddying the waters.

> >
> > On Tom's theory that the vendors are going to push inferior s/w for the sake of selling h/w: we are not in the 90s anymore and there's no vendor conspiracy theory here: a single port can do 100s of Gbps, and of course if you want to do high speed you need to offload, no general purpose CPU will save you.
>
> Let's not pretend that offloads are a magic bullet that just makes
> everything better, if that were true then we'd all be using TOE by
> now! There are a myriad of factors to consider whether offloading is
> worth it. What is "high speed", is this small packets or big packets,
> are we terminating TCP, are we doing some sort of fast/slow path split
> which might work great in the lab but on the Internet can become a DOS
> vector? What's the application? Are we just trying to offload parts of
> the datapath, TCP, RDMA, memcached, ML reduce operations? Are we
> trying to do line rate encryption, compression, trying to do a billion
> PCB lookups a second? Are we taking into account continuing
> advancements in the CPU that have in the past made offloads obsolete
> (for instance, AES instructions pretty much obsoleted initial attempts
> to obsolete IPsec)? How simple is the programming model, how
> debuggable is it, what's the TCO?
>
> I do believe offload is part of the solution. And the good news is
> that programmable devices facilitate that. IMO, our challenge is to
> create a facility in the kernel to kernel offloads in a much better
> way (I don't believe there's disagreement with these points).
>

This is about a MAT(match-action table) model whose offloads are
covered via TC and is well understood and is very specific.
We are not trying to solve "the world of offloads" which includes
TOEs. P4 aware NICs are in the market and afaik those ASICs are not
solving TOE. I thought you understand the scope but if not start by
reading this: https://github.com/p4tc-dev/docs/blob/main/why-p4tc.md

cheers,
jamal

> Tom
>
>
>
>
>
> >
> > cheers,
> > jamal
> >
> >>
> >> > The risk if we allow this into the kernel is that a vendor might be
> >> > tempted to point to P4-TC performance as a baseline to justify to
> >> > customers that they need to buy specialized hardware to get
> >> > performance, whereas if XDP was used maybe they don't need the
> >> > performance and cost of hardware.
> >>
> >> I really don't buy this argument, it's FUD. Let's judge P4-TC on its merits, not prejudge it as a ploy to sell vendor hardware.
> >>
> >> > Note, this scenario already happened
> >> > once before, when the DPDK joined LF they made bogus claims that they
> >> > got a 100x performance over the kernel-- had they put at least the
> >> > slightest effort into tuning the kernel that would have dropped the
> >> > delta by an order of magnitude, and since then we've pretty much
> >> > closed the gap (actually, this is precisely what motivated the
> >> > creation of XDP so I guess that story had a happy ending!) . There are
> >> > circumstances where hardware offload may be warranted, but it needs to
> >> > be honestly justified by comparing it to an optimized software
> >> > solution-- so in the case of P4, it should be compared to well written
> >> > XDP code for instance, not P4-TC.
> >>
> >> I strongly disagree that it "it needs to be honestly justified by comparing it to an optimized software solution."
> >> Says who? This is no more factual than saying "C or golang need to be judged by comparing it to assembly language."
> >> Today the gap between C and assembly is small, but way back in my career, C was way slower.
> >> Over time optimizing compilers have closed the gap. Who's to say P4 technologies won't do the same?
> >> P4-TC can be judged on its own merits for its utility and productivity. I can't stress enough that P4 is very productive when applied to certain problems.
> >>
> >> Note, P4-BMv2 has been used by thousands of developers, researchers and students and it is relatively slow. Yet that doesn't deter users.
> >> There is a Google Summer of Code project to add PNA support, rather ambitious. However, P4-TC already partially supports PNA and the gap is closing.
> >> I feel like P4-TC could replace the use of BMv2 in a lot of applications and if it were upstreamed, it'd eventually be available on all Linux machines. The ability to write custom externs
> >> is very compelling. Eventual HW offload using the same code will be game-changing. Bmv2 is a big c++ program and somewhat intimidating to dig into to make enhancements, especially at the architectural level.
> >> There is no HW offload path, and it's not really fast, so it remains mainly a researchy-thing and will stay that way. P4-TC could span the needs from research to production in SW, and performant production with HW offload.
> >> >
> >> > Tom
> >> >
> >> > >
> >> > >
> >> > > Anjali
> >> >

Tom Herbert May 30, 2024, 6:16 p.m. UTC | #42

On Thu, May 30, 2024 at 9:59 AM Jamal Hadi Salim <jhs@mojatatu.com> wrote:
>
> On Wed, May 29, 2024 at 10:46 AM Tom Herbert <tom@sipanda.io> wrote:
> >
> > On Wed, May 29, 2024 at 4:01 AM Jamal Hadi Salim <jhs@mojatatu.com> wrote:
> > >
> > >
> > >
> > > On Tue, May 28, 2024 at 7:43 PM Chris Sommers <chris.sommers@keysight.com> wrote:
> > >>
> > >> > On Tue, May 28, 2024 at 3:17 PM Singhai, Anjali
> > >> > <anjali.singhai@intel.com> wrote:
> > >> > >
> > >> > > >From: John Fastabend <john.fastabend@gmail.com>
> > >> > > >Sent: Tuesday, May 28, 2024 1:17 PM
> > >> > >
> > >> > > >Jain, Vipin wrote:
> > >> > > >> [AMD Official Use Only - AMD Internal Distribution Only]
> > >> > > >>
> > >> > > >> My apologies, earlier email used html and was blocked by the list...
> > >> > > >> My response at the bottom as "VJ>"
> > >> > > >>
> > >> > > >> ________________________________________
> > >> > >
> > >> > > >Anjali and Vipin is your support for HW support of P4 or a Linux SW implementation of P4. If its for HW support what drivers would we want to support? Can you describe how to program >these devices?
> > >> > >
> > >> > > >At the moment there hasn't been any movement on Linux hardware P4 support side as far as I can tell. Yes there are some SDKs and build kits floating around for FPGAs. For example >maybe start with what drivers in kernel tree run the DPUs that have this support? I think this would be a productive direction to go if we in fact have hardware support in the works.
> > >> > >
> > >> > > >If you want a SW implementation in Linux my opinion is still pushing a DSL into the kernel datapath via qdisc/tc is the wrong direction. Mapping P4 onto hardware blocks is fundamentally >different architecture from mapping
> > >> > > >P4 onto general purpose CPU and registers. My opinion -- to handle this you need a per architecture backend/JIT to compile the P4 to native instructions.
> > >> > > >This will give you the most flexibility to define new constructs, best performance, and lowest overhead runtime. We have a P4 BPF backend already and JITs for most architectures I don't >see the need for P4TC in this context.
> > >> > >
> > >> > > >If the end goal is a hardware offload control plane I'm skeptical we even need something specific just for SW datapath. I would propose a devlink or new infra to program the device directly >vs overhead and complexity of abstracting through 'tc'. If you want to emulate your device use BPF or user space datapath.
> > >> > >
> > >> > > >.John
> > >> > >
> > >> > >
> > >> > > John,
> > >> > > Let me start by saying production hardware exists i think Jamal posted some links but i can point you to our hardware.
> > >> > > The hardware devices under discussion are capable of being abstracted using the P4 match-action paradigm so that's why we chose TC.
> > >> > > These devices are programmed using the TC/netlink interface i.e the standard TC control-driver ops apply. While it is clear to us that the P4TC abstraction suffices, we are currently discussing details that will cater for all vendors in our biweekly meetings.
> > >> > > One big requirement is we want to avoid the flower trap - we dont want to be changing kernel/user/driver code every time we add new datapaths.
> > >> > > We feel P4TC approach is the path to add Linux kernel support.
> > >> > >
> > >> > > The s/w path is needed as well for several reasons.
> > >> > > We need the same P4 program to run either in software or hardware or in both using skip_sw/skip_hw. It could be either in split mode or as an exception path as it is done today in flower or u32. Also it is common now in the P4 community that people define their datapath using their program and will write a control application that works for both hardware and software datapaths. They could be using the software datapath for testing as you said but also for the split/exception path. Chris can probably add more comments on the software datapath.
> > >>
> > >> Anjali, thanks for asking. Agreed, I like the flexibility of accommodating a variety of platforms depending upon performance requirements and intended target system. For me, flexibility is important. Some solutions need an inline filter and P4-TC makes it so easy. The fact I will be able to get HW offload means I'm not performance bound. Some other solutions might need DPDK implementation, so P4-DPDK is a choice there as well, and there are acceleration options. Keeping much of the dataplane design in one language (P4) makes it easier for more developers to create products without having to be platform-level experts. As someone who's worked with P4 Tofino, P4-TC, bmv2, etc. I can authoritatively state that all have their proper place.
> > >> >
> > >> > Hi Anjali,
> > >> >
> > >> > Are there any use cases of P4-TC that don't involve P4 hardware? If
> > >> > someone wanted to write one off datapath code for their deployment and
> > >> > they didn't have P4 hardware would you suggest that they write they're
> > >> > code in P4-TC? The reason I ask is because I'm concerned about the
> > >> > performance of P4-TC. Like John said, this is mapping code that is
> > >> > intended to run in specialized hardware into a CPU, and it's also
> > >> > interpreted execution in TC. The performance numbers in
> > >> > https://urldefense.com/v3/__https://github.com/p4tc-dev/docs/blob/main/p4-conference-2023/2023P4WorkshopP4TC.pdf__;!!I5pVk4LIGAfnvw!mHilz4xBMimnfapDG8BEgqOuPw_Mn-KiMHb-aNbl8nB8TwfOfSleeIANiNRFQtTc5zfR0aK1TE2J8lT2Fg$
> > >> > seem to show that P4-TC has about half the performance of XDP. Even
> > >> > with a lot of work, it's going to be difficult to substantially close
> > >> > that gap.
> > >>
> > >> AFAIK P4-TC can emit XDP or eBPF code depending upon the situation, someone more knowledgeable should chime in.
> > >> However, I don't agree that comparing the speeds of XDP vs. P4-TC should even be a deciding factor.
> > >> If P4-TC is good enough for a lot of applications, that is fine by me and over time it'll only get better.
> > >> If we held back every innovation because it was slower than something else, progress would suffer.
> > >> >
> > >
> > >
> > > Yes, XDP can be emitted based on compiler options (and was a motivation factor in considering use of eBPF). Tom's comment above seems to confuse the fact that XDP tends to be faster than TC with eBPF as the fault of P4TC.
> > > In any case this statement falls under: https://github.com/p4tc-dev/pushback-patches?tab=readme-ov-file#2b-comment-but--it-is-not-performant
> >
> > Jamal,
> >
> > From that: "My response has always consistently been: performance is a
> > lower priority to P4 correctness and expressibility." That might be
> > true for P4, but not for the kernel. CPU performance is important, and
> > your statement below that justifies offloads on the basis that "no
> > general purpose CPU will save you" confirms that. Please be more
> > upfront about what  the performance is like including performance
> > numbers in the cover letter for the next patch set. This is the best
> > way to avoid confusion and rampant speculation, and if performance
> > isn't stellar being open about it in the community is the best way to
> > figure out how to improve it.
>
> I believe you are misreading those graphs or maybe you are mixing it
> with the original u32/pedit script approach? The tests are run at TC
> and XDP layers. Pay particular attention to the results of the
> handcoded/tuned eBPF datapath at TC and at XDP compared to analogous
> ones generated by the compiler. You will notice +/-5% or so
> differences. That is with the current compiler generated code. We are
> looking to improve that - but do note that is generated code, nothing
> to do with the kernel. As the P4 program becomes more complex (many
> tables, longer keys, more entries, more complex actions) then we
> become compute bound, so no difference really.
>
> Now having said that: yes - s/w performance is certainly _not our
> highest priority feature_ and that is not saying we dont care but as
> the text said If i am getting 2Mpps using handcoding vs 1.84Mpps using
> generated code(per those graphs) and i can generate code and execute
> it in 5 minutes (Chris who is knowledgeable in P4 was able to do it in
> less time), then _i pick the code generation any day of the week_.
> Tooling, tooling, tooling.
> To re-iterate, the most important requirement is the abstraction, meaning:
> I can take the same P4 program I am running in s/w and generate using
> a different backend for AMD or Intel offload equivalent and get
> several magnitude improvements in performance because it is now
> running in h/w. I still get to use the same application controlling
> either s/w and/or hardware, etc

Jamal,

I believe you're making contradictory points here. On one hand you're
saying that performance isn't a high priority and that it's enough to
get the abstraction right. On the other hand you seem to be making the
argument that we need hardware offload because performance of software
in a CPU is so bad. I can't rectify these statements.

Also, when you claim that hardware is going to deliver "several
magnitude improvements in performance" over an implementation that has
not been optimized for performance in a CPU, then you are heading down
the path of justifying hardware offload on the basis that it performs
better than baseline software which has not been at all optimized.
IMO, that is not valid justification and I believe it would be a
disservice to our users if they buy into hardware where a software
solution would have been sufficient had someone put in the effort to
optimize it.

>
> TBH, I am indifferent and could add some numbers but it is missing the
> emphasis of what we are trying to achieve, the cover letter is already
> half a novel - with the short attention span most people have it will
> be just muddying the waters.

This is putting code in the kernel that runs in the Linux networking
data path. It shouldn't be any surprise that we're asking for some
quantification and analysis of performance in the patch description.

Tom

>
> > >
> > > On Tom's theory that the vendors are going to push inferior s/w for the sake of selling h/w: we are not in the 90s anymore and there's no vendor conspiracy theory here: a single port can do 100s of Gbps, and of course if you want to do high speed you need to offload, no general purpose CPU will save you.
> >
> > Let's not pretend that offloads are a magic bullet that just makes
> > everything better, if that were true then we'd all be using TOE by
> > now! There are a myriad of factors to consider whether offloading is
> > worth it. What is "high speed", is this small packets or big packets,
> > are we terminating TCP, are we doing some sort of fast/slow path split
> > which might work great in the lab but on the Internet can become a DOS
> > vector? What's the application? Are we just trying to offload parts of
> > the datapath, TCP, RDMA, memcached, ML reduce operations? Are we
> > trying to do line rate encryption, compression, trying to do a billion
> > PCB lookups a second? Are we taking into account continuing
> > advancements in the CPU that have in the past made offloads obsolete
> > (for instance, AES instructions pretty much obsoleted initial attempts
> > to obsolete IPsec)? How simple is the programming model, how
> > debuggable is it, what's the TCO?
> >
> > I do believe offload is part of the solution. And the good news is
> > that programmable devices facilitate that. IMO, our challenge is to
> > create a facility in the kernel to kernel offloads in a much better
> > way (I don't believe there's disagreement with these points).
> >
>
> This is about a MAT(match-action table) model whose offloads are
> covered via TC and is well understood and is very specific.
> We are not trying to solve "the world of offloads" which includes
> TOEs. P4 aware NICs are in the market and afaik those ASICs are not
> solving TOE. I thought you understand the scope but if not start by
> reading this: https://github.com/p4tc-dev/docs/blob/main/why-p4tc.md
>
> cheers,
> jamal
>
> > Tom
> >
> >
> >
> >
> >
> > >
> > > cheers,
> > > jamal
> > >
> > >>
> > >> > The risk if we allow this into the kernel is that a vendor might be
> > >> > tempted to point to P4-TC performance as a baseline to justify to
> > >> > customers that they need to buy specialized hardware to get
> > >> > performance, whereas if XDP was used maybe they don't need the
> > >> > performance and cost of hardware.
> > >>
> > >> I really don't buy this argument, it's FUD. Let's judge P4-TC on its merits, not prejudge it as a ploy to sell vendor hardware.
> > >>
> > >> > Note, this scenario already happened
> > >> > once before, when the DPDK joined LF they made bogus claims that they
> > >> > got a 100x performance over the kernel-- had they put at least the
> > >> > slightest effort into tuning the kernel that would have dropped the
> > >> > delta by an order of magnitude, and since then we've pretty much
> > >> > closed the gap (actually, this is precisely what motivated the
> > >> > creation of XDP so I guess that story had a happy ending!) . There are
> > >> > circumstances where hardware offload may be warranted, but it needs to
> > >> > be honestly justified by comparing it to an optimized software
> > >> > solution-- so in the case of P4, it should be compared to well written
> > >> > XDP code for instance, not P4-TC.
> > >>
> > >> I strongly disagree that it "it needs to be honestly justified by comparing it to an optimized software solution."
> > >> Says who? This is no more factual than saying "C or golang need to be judged by comparing it to assembly language."
> > >> Today the gap between C and assembly is small, but way back in my career, C was way slower.
> > >> Over time optimizing compilers have closed the gap. Who's to say P4 technologies won't do the same?
> > >> P4-TC can be judged on its own merits for its utility and productivity. I can't stress enough that P4 is very productive when applied to certain problems.
> > >>
> > >> Note, P4-BMv2 has been used by thousands of developers, researchers and students and it is relatively slow. Yet that doesn't deter users.
> > >> There is a Google Summer of Code project to add PNA support, rather ambitious. However, P4-TC already partially supports PNA and the gap is closing.
> > >> I feel like P4-TC could replace the use of BMv2 in a lot of applications and if it were upstreamed, it'd eventually be available on all Linux machines. The ability to write custom externs
> > >> is very compelling. Eventual HW offload using the same code will be game-changing. Bmv2 is a big c++ program and somewhat intimidating to dig into to make enhancements, especially at the architectural level.
> > >> There is no HW offload path, and it's not really fast, so it remains mainly a researchy-thing and will stay that way. P4-TC could span the needs from research to production in SW, and performant production with HW offload.
> > >> >
> > >> > Tom
> > >> >
> > >> > >
> > >> > >
> > >> > > Anjali
> > >> >

Jakub Kicinski June 11, 2024, 2:21 p.m. UTC | #43

Since the inevitable LWN article has been written, let me put more
detail into what I already mentioned here:

https://lore.kernel.org/all/20240301090020.7c9ebc1d@kernel.org/

for the benefit of non-networking people.

On Wed, 10 Apr 2024 10:01:26 -0400 Jamal Hadi Salim wrote:
> P4TC builds on top of many years of Linux TC experiences of a netlink
> control path interface coupled with a software datapath with an equivalent
> offloadable hardware datapath.

The point of having SW datapath is to provide a blueprint for the
behavior. This is completely moot for P4 which comes as a standard.

Besides we already have 5 (or more) flow offloads, we don't need
a 6th, completely disconnected from the existing ones. Leaving
users guessing which one to use, and how they interact.

In my opinion, reasonable way to implement programmable parser for
Linux is:

 1. User writes their parser in whatever DSL they want
 2. User compiles the parser in user space
   2.1 Compiler embeds a representation of the graph in the blob
 3. User puts the blob in /lib/firmware
 4. devlink dev $dev reload action parser-fetch $filename
 5. devlink loads the file, parses it to extract the representation
    from 2.1, and passes the blob to the driver
   5.1 driver/fw reinitializes the HW parser
   5.2 user can inspect the graph by dumping the common representation
       from 2.1 (via something like devlink dpipe, perhaps)
 6. The parser tables are annotated with Linux offload targets (routes,
    classic ntuple, nftables, flower etc.) with some tables being left
    as "raw"* (* better name would be great)
 7. ethtool ntuple is extended to support insertion of arbitrary rules
    into the "raw" tables
 8. The other tables can only be inserted into using the subsystem they
    are annotated for

This builds on how some devices _already_ operate. Gives the benefits
of expressing parser information and ability to insert rules for
uncommon protocols also for devices which are not programmable.
And it uses ethtool ntuple, which SW people actually want to use.

Before the tin foil hats gather - we have no use for any of this at
Meta, I'm not trying to twist the design to fit the use cases of big
bad hyperscalers.

Jamal Hadi Salim June 11, 2024, 3:10 p.m. UTC | #44

On Tue, Jun 11, 2024 at 10:21 AM Jakub Kicinski <kuba@kernel.org> wrote:
>
> Since the inevitable LWN article has been written, let me put more
> detail into what I already mentioned here:
>
> https://lore.kernel.org/all/20240301090020.7c9ebc1d@kernel.org/
>
> for the benefit of non-networking people.
>
> On Wed, 10 Apr 2024 10:01:26 -0400 Jamal Hadi Salim wrote:
> > P4TC builds on top of many years of Linux TC experiences of a netlink
> > control path interface coupled with a software datapath with an equivalent
> > offloadable hardware datapath.
>
> The point of having SW datapath is to provide a blueprint for the
> behavior. This is completely moot for P4 which comes as a standard.
>
> Besides we already have 5 (or more) flow offloads, we don't need
> a 6th, completely disconnected from the existing ones. Leaving
> users guessing which one to use, and how they interact.
>
> In my opinion, reasonable way to implement programmable parser for

You have mentioned "parser" before - are you referring to the DDP
patches earlier from Intel?
In P4 the parser is just one of the objects.

> Linux is:
>
>  1. User writes their parser in whatever DSL they want
>  2. User compiles the parser in user space
>    2.1 Compiler embeds a representation of the graph in the blob
>  3. User puts the blob in /lib/firmware
>  4. devlink dev $dev reload action parser-fetch $filename
>  5. devlink loads the file, parses it to extract the representation
>     from 2.1, and passes the blob to the driver
>    5.1 driver/fw reinitializes the HW parser
>    5.2 user can inspect the graph by dumping the common representation
>        from 2.1 (via something like devlink dpipe, perhaps)
>  6. The parser tables are annotated with Linux offload targets (routes,
>     classic ntuple, nftables, flower etc.) with some tables being left
>     as "raw"* (* better name would be great)
>  7. ethtool ntuple is extended to support insertion of arbitrary rules
>     into the "raw" tables
>  8. The other tables can only be inserted into using the subsystem they
>     are annotated for
>
> This builds on how some devices _already_ operate. Gives the benefits
> of expressing parser information and ability to insert rules for
> uncommon protocols also for devices which are not programmable.
> And it uses ethtool ntuple, which SW people actually want to use.
>
> Before the tin foil hats gather - we have no use for any of this at
> Meta, I'm not trying to twist the design to fit the use cases of big
> bad hyperscalers.

The scope is much bigger than just parsers though, it is about P4 in
which the parser is but one object.
Limiting what we can do just to fit a narrow definition of "offload"
is not the right direction.
P4 is well understood, hardware exists for P4 and is used to specify
hardware specs and is deployed(See Vipin's comment).


cheers,
jamal

Jakub Kicinski June 11, 2024, 3:33 p.m. UTC | #45

On Tue, 11 Jun 2024 11:10:35 -0400 Jamal Hadi Salim wrote:
> > Before the tin foil hats gather - we have no use for any of this at
> > Meta, I'm not trying to twist the design to fit the use cases of big
> > bad hyperscalers.  
> 
> The scope is much bigger than just parsers though, it is about P4 in
> which the parser is but one object.

For me it's very much not "about P4". I don't care what DSL user prefers
and whether the device the offloads targets is built by a P4 vendor.

> Limiting what we can do just to fit a narrow definition of "offload"
> is not the right direction.

This is how Linux development works. You implement small, useful slice
which helps the overall project. Then you implement the next, and
another.

On the technical level, putting the code into devlink rather than TC
does not impose any meaningful limitations. But I really don't want
you to lift and shift the entire pile of code at once.

> P4 is well understood, hardware exists for P4 and is used to specify
> hardware specs and is deployed(See Vipin's comment).

"Hardware exists for P4" is about as meaningful as "hardware exists
for C++".

Jamal Hadi Salim June 11, 2024, 3:53 p.m. UTC | #46

On Tue, Jun 11, 2024 at 11:33 AM Jakub Kicinski <kuba@kernel.org> wrote:
>
> On Tue, 11 Jun 2024 11:10:35 -0400 Jamal Hadi Salim wrote:
> > > Before the tin foil hats gather - we have no use for any of this at
> > > Meta, I'm not trying to twist the design to fit the use cases of big
> > > bad hyperscalers.
> >
> > The scope is much bigger than just parsers though, it is about P4 in
> > which the parser is but one object.
>
> For me it's very much not "about P4". I don't care what DSL user prefers
> and whether the device the offloads targets is built by a P4 vendor.
>

I think it is an important detail though.
You wouldnt say PSP shouldnt start small by first taking care of TLS
or IPSec because it is not the target.

> > Limiting what we can do just to fit a narrow definition of "offload"
> > is not the right direction.
>
> This is how Linux development works. You implement small, useful slice
> which helps the overall project. Then you implement the next, and
> another.
>
> On the technical level, putting the code into devlink rather than TC
> does not impose any meaningful limitations. But I really don't want
> you to lift and shift the entire pile of code at once.
>

Yes, the binary blob is going via devlink or some other scheme.

> > P4 is well understood, hardware exists for P4 and is used to specify
> > hardware specs and is deployed(See Vipin's comment).
>
> "Hardware exists for P4" is about as meaningful as "hardware exists
> for C++".

We'll have to agree to disagree. Take a look at this for example.
https://www.servethehome.com/pensando-distributed-services-architecture-smartnic/

cheers,
jamal

Tom Herbert June 11, 2024, 4:34 p.m. UTC | #47

On Tue, Jun 11, 2024 at 8:53 AM Jamal Hadi Salim <jhs@mojatatu.com> wrote:
>
> On Tue, Jun 11, 2024 at 11:33 AM Jakub Kicinski <kuba@kernel.org> wrote:
> >
> > On Tue, 11 Jun 2024 11:10:35 -0400 Jamal Hadi Salim wrote:
> > > > Before the tin foil hats gather - we have no use for any of this at
> > > > Meta, I'm not trying to twist the design to fit the use cases of big
> > > > bad hyperscalers.
> > >
> > > The scope is much bigger than just parsers though, it is about P4 in
> > > which the parser is but one object.
> >
> > For me it's very much not "about P4". I don't care what DSL user prefers
> > and whether the device the offloads targets is built by a P4 vendor.
> >
>
> I think it is an important detail though.
> You wouldnt say PSP shouldnt start small by first taking care of TLS
> or IPSec because it is not the target.
>
> > > Limiting what we can do just to fit a narrow definition of "offload"
> > > is not the right direction.

Jamal,

I think you might be missing Jakub's point. His plan wouldn't narrow
the definition of "offload", but actually would increase applicability
and use cases of offload. The best way to do an offload is allow
flexibility on both sides of the equation: Let the user write their
data path code in whatever language they want, and allow them offload
to arbitrary software or programmable hardware targets.

For example, if a user already has P4 hardware for their high end
server then by all means they should write their datapath in P4. But,
there might also be a user that wants to offload TCP keepalive to a
lower powered CPU on a Smartphone; in this case a simple C program
maybe running in eBPF on the CPU should do the trick-- forcing them to
write their program in P4 or even worse force them to put P4 hardware
into their smartphone is not good. We should be able to define a
common offload infrastructure to be both language and target agnostic
that would handle both these use cases of offload and everything in
between. P4 could certainly be one option for both programming
language and offload target, but it shouldn't be the only option.

Tom

> >
> > This is how Linux development works. You implement small, useful slice
> > which helps the overall project. Then you implement the next, and
> > another.
> >
> > On the technical level, putting the code into devlink rather than TC
> > does not impose any meaningful limitations. But I really don't want
> > you to lift and shift the entire pile of code at once.
> >
>
> Yes, the binary blob is going via devlink or some other scheme.
>
> > > P4 is well understood, hardware exists for P4 and is used to specify
> > > hardware specs and is deployed(See Vipin's comment).
> >
> > "Hardware exists for P4" is about as meaningful as "hardware exists
> > for C++".
>
> We'll have to agree to disagree. Take a look at this for example.
> https://www.servethehome.com/pensando-distributed-services-architecture-smartnic/
>
> cheers,
> jamal

John Fastabend June 11, 2024, 5:21 p.m. UTC | #48

Tom Herbert wrote:
> On Tue, Jun 11, 2024 at 8:53 AM Jamal Hadi Salim <jhs@mojatatu.com> wrote:
> >
> > On Tue, Jun 11, 2024 at 11:33 AM Jakub Kicinski <kuba@kernel.org> wrote:
> > >
> > > On Tue, 11 Jun 2024 11:10:35 -0400 Jamal Hadi Salim wrote:
> > > > > Before the tin foil hats gather - we have no use for any of this at
> > > > > Meta, I'm not trying to twist the design to fit the use cases of big
> > > > > bad hyperscalers.
> > > >
> > > > The scope is much bigger than just parsers though, it is about P4 in
> > > > which the parser is but one object.
> > >
> > > For me it's very much not "about P4". I don't care what DSL user prefers
> > > and whether the device the offloads targets is built by a P4 vendor.
> > >
> >
> > I think it is an important detail though.
> > You wouldnt say PSP shouldnt start small by first taking care of TLS
> > or IPSec because it is not the target.
> >
> > > > Limiting what we can do just to fit a narrow definition of "offload"
> > > > is not the right direction.
> 
> Jamal,
> 
> I think you might be missing Jakub's point. His plan wouldn't narrow
> the definition of "offload", but actually would increase applicability
> and use cases of offload. The best way to do an offload is allow
> flexibility on both sides of the equation: Let the user write their
> data path code in whatever language they want, and allow them offload
> to arbitrary software or programmable hardware targets.

+1.
 
> 
> For example, if a user already has P4 hardware for their high end
> server then by all means they should write their datapath in P4. But,
> there might also be a user that wants to offload TCP keepalive to a
> lower powered CPU on a Smartphone; in this case a simple C program
> maybe running in eBPF on the CPU should do the trick-- forcing them to
> write their program in P4 or even worse force them to put P4 hardware
> into their smartphone is not good. We should be able to define a
> common offload infrastructure to be both language and target agnostic
> that would handle both these use cases of offload and everything in
> between. P4 could certainly be one option for both programming
> language and offload target, but it shouldn't be the only option.

Agree major benefit of proposal here is it doesn't dictate the
language. My DSL preference is P4 but no need to push that here.

> 
> Tom

My $.02 Jakub's proposal is a very pragmatic way to get support for P4
enabled hardware I'm all for it. I can't actually think up anything
in the P4 hardware side that couldn't go through the table notion
in (7). We might want bulk updates and the likes at some point, but
starting with basics should be good enough.

> 
> > >
> > > This is how Linux development works. You implement small, useful slice
> > > which helps the overall project. Then you implement the next, and
> > > another.

+1.

> > >
> > > On the technical level, putting the code into devlink rather than TC
> > > does not impose any meaningful limitations. But I really don't want
> > > you to lift and shift the entire pile of code at once.
> > >

devlink or an improved n_tuple (n_table?) mechanism would be great.
Happy to help here.

> >
> > Yes, the binary blob is going via devlink or some other scheme.
> >
> > > > P4 is well understood, hardware exists for P4 and is used to specify
> > > > hardware specs and is deployed(See Vipin's comment).
> > >
> > > "Hardware exists for P4" is about as meaningful as "hardware exists
> > > for C++".
> >
> > We'll have to agree to disagree. Take a look at this for example.
> > https://www.servethehome.com/pensando-distributed-services-architecture-smartnic/
> >
> > cheers,
> > jamal
>

Jakub Kicinski June 11, 2024, 5:53 p.m. UTC | #49

On Tue, 11 Jun 2024 11:53:28 -0400 Jamal Hadi Salim wrote:
> > For me it's very much not "about P4". I don't care what DSL user prefers
> > and whether the device the offloads targets is built by a P4 vendor.
> 
> I think it is an important detail though.
> You wouldnt say PSP shouldnt start small by first taking care of TLS
> or IPSec because it is not the target.

I really don't see any parallel with PSP. And it _is_ small, 4kLoC.

First you complain that community is "political" and doesn't give you
technical feedback, and then when you get technical feedback you attack
the work of the maintainer helping you.

Do you not see how these kind of retaliatory responses are exactly 
the reason why people were afraid to give you clear feedback earlier?
Maybe one of the upcoming conferences should give out mirrors instead
of t-shirts as swag.

Jamal Hadi Salim June 11, 2024, 7:13 p.m. UTC | #50

On Tue, Jun 11, 2024 at 1:53 PM Jakub Kicinski <kuba@kernel.org> wrote:
>
> On Tue, 11 Jun 2024 11:53:28 -0400 Jamal Hadi Salim wrote:
> > > For me it's very much not "about P4". I don't care what DSL user prefers
> > > and whether the device the offloads targets is built by a P4 vendor.
> >
> > I think it is an important detail though.
> > You wouldnt say PSP shouldnt start small by first taking care of TLS
> > or IPSec because it is not the target.
>
> I really don't see any parallel with PSP. And it _is_ small, 4kLoC.
>
> First you complain that community is "political" and doesn't give you
> technical feedback, and then when you get technical feedback you attack
> the work of the maintainer helping you.
>

You made a proposal saying it was a "start small" approach. I
responded saying that it doesnt really cover our requirements and
pointed to a sample h/w to show why. I only used PSP to illustrate why
"start small" doesnt work for what we are targeting. I was not in any
way attacking your work.

We are not trying to cover the whole world of offloads. It is a very
specific niche -P4- which uses the existing tc model because that's
how match-action tables are offloaded today. The actions and tables
are dynamically defined by the users P4 program whereas in flower they
are hardcoded in the kernel. I dont see any other way to achieve these
goals with flower or other existing approaches.  Flower for example
could be written as a single P4 program and the goal here is to
support a wider range of programs without making kernel changes.

cheers,
jamal

[net-next,v16,00/15] Introducing P4TC (series 1)

Message

Comments