mbox series

[net-next,v8,00/15] Introducing P4TC

Message ID 20231116145948.203001-1-jhs@mojatatu.com (mailing list archive)
Headers show
Series Introducing P4TC | expand

Message

Jamal Hadi Salim Nov. 16, 2023, 2:59 p.m. UTC
We are seeking community feedback on P4TC patches.

We have reduced the number of commits in this patchset including leaving out
all the testcases and secondary patches in order to ease review.

We feel we have completed the migration from the V1 scriptable version to eBPF
and now is a good time to remove the RFC tag.

Changes In RFC Version 2
-------------------------

Version 2 is the initial integration of the eBPF datapath.
We took into consideration suggestions provided to use eBPF and put effort into
analyzing eBPF as datapath which involved extensive testing.
We implemented 6 approaches with eBPF and ran performance analysis and presented
our results at the P4 2023 workshop in Santa Clara[see: 1, 3] on each of the 6
vs the scriptable P4TC and concluded that 2 of the approaches are sensible (4 if
you account for XDP or TC separately).

Conclusions from the exercise: We lose the simple operational model we had
prior to integrating eBPF. We do gain performance in most cases when the
datapath is less compute-bound.
For more discussion on our requirements vs journeying the eBPF path please
scroll down to "Restating Our Requirements" and "Challenges".

This patch set presented two modes.
mode1: the parser is entirely based on eBPF - whereas the rest of the
SW datapath stays as _scriptable_ as in Version 1.
mode2: All of the kernel s/w datapath (including parser) is in eBPF.

The key ingredient for eBPF, that we did not have access to in the past, is
kfunc (it made a big difference for us to reconsider eBPF).

In V2 the two modes are mutually exclusive (IOW, you get to choose one
or the other via Kconfig).

Changes In RFC Version 3
-------------------------

These patches are still in a little bit of flux as we adjust to integrating
eBPF. So there are small constructs that are used in V1 and 2 but no longer
used in this version. We will make a V4 which will remove those.
The changes from V2 are as follows:

1) Feedback we got in V2 is to try stick to one of the two modes. In this version
we are taking one more step and going the path of mode2 vs v2 where we had 2 modes.

2) The P4 Register extern is no longer standalone. Instead, as part of integrating
into eBPF we introduce another kfunc which encapsulates Register as part of the
extern interface.

3) We have improved our CICD to include tools pointed to us by Simon. See
   "Testing" further below. Thanks to Simon for that and other issues he caught.
   Simon, we discussed on issue [7] but decided to keep that log since we think
   it is useful.

4) A lot of small cleanups. Thanks Marcelo. There are two things we need to
   re-discuss though; see: [5], [6].

5) We removed the need for a range of IDs for dynamic actions. Thanks Jakub.

6) Clarify ambiguity caused by smatch in an if(A) else if(B) condition. We are
   guaranteed that either A or B must exist; however, lets make smatch happy.
   Thanks to Simon and Dan Carpenter.

Changes In RFC Version 4
-------------------------

1) More integration from scriptable to eBPF. Small bug fixes.

2) More streamlining support of externs via kfunc (one additional kfunc).

3) Removed per-cpu scratchpad per Toke's suggestion and instead use XDP metadata.

There is more eBPF integration coming. One thing we looked at but is not in this
patchset but should be in the next is use of eBPF link in our loading (see
"challenge #1" further below).

Changes In RFC Version 5
-------------------------

1) More integration from scriptable view to eBPF. Small bug fixes from last
   integration.

2) More streamlining support of externs via kfunc (create-on-miss, etc)

3) eBPF linking for XDP.

There is more eBPF integration/streamlining coming (we are getting close to
conversion from scriptable domain).

Changes In RFC Version 6
-------------------------

1) Completed integration from scriptable view to eBPF. Completed integration
   of externs integration.

2) Small bug fixes from v5 based on testing.

Changes In Version 7
-------------------------

0) First time removing the RFC tag!

1) Removed XDP cookie. It turns out as was pointed out by Toke(Thanks!) - that
using bpf links was sufficient to protect us from someone replacing or deleting
a eBPF program after it has been bound to a netdev.

2) Add some reviewed-bys from Vlad.

3) Small bug fixes from v6 based on testing for ebpf.

4) Added the counter extern as a sample extern. Illustrating this example because
   it is slightly complex since it is possible to invoke it directly from
   the P4TC domain (in case of direct counters) or from eBPF (indirect counters).
   It is not exactly the most efficient implementation (a reasonable counter impl
   should be per-cpu).

Changes In Version 8
---------------------
1) Fix all the patchwork warnings and improve our ci to catch them in the future

2) Reduce the number of patches to basic max(15)  to ease review.

What is P4?
-----------

The Programming Protocol-independent Packet Processors (P4) is an open source,
domain-specific programming language for specifying data plane behavior.

The P4 ecosystem includes an extensive range of deployments, products, projects
and services, etc[9][10][11][12].

__What is P4TC?__

P4TC is a net-namespace aware implementation, meaning multiple P4 programs can
run independently in different namespaces alongside their appropriate state. The
implementation builds on top of many years of Linux TC experiences.
On why P4 - see small treatise here:[4].

There have been many discussions and meetings since about 2015 in regards to
P4 over TC[2] and we are finally proving the naysayers that we do get stuff
done!

A lot more of the P4TC motivation is captured at:
https://github.com/p4tc-dev/docs/blob/main/why-p4tc.md

**In this patch series we focus on s/w datapath only**.

__P4TC Workflow__

These patches enable kernel and user space code change _independence_ for any
new P4 program that describes a new datapath. The workflow is as follows:

  1) A developer writes a P4 program, "myprog"

  2) Compiles it using the P4C compiler[8]. The compiler generates 3 outputs:
     a) shell script(s) which form template definitions for the different P4
     objects "myprog" utilizes (tables, externs, actions etc).
     b) the parser and the rest of the datapath are generated
     in eBPF and need to be compiled into binaries.
     c) A json introspection file used for the control plane (by iproute2/tc).

  3) The developer (or operator) executes the shell script(s) to manifest the
     functional "myprog" into the kernel.

  4) The developer (or operator) instantiates "myprog" via the tc P4 filter
     to ingress/egress (depending on P4 arch) of one or more netdevs/ports.

     Example1: parser is an action:
       "tc filter add block 22 ingress protocol all prio 10 p4 pname myprog \
        action bpf obj $PARSER.o section parser/tc-ingress \
        action bpf obj $PROGNAME.o section p4prog/tc"

     Example2: parser explicitly bound and rest of dpath as an action:
       "tc filter add block 22 ingress protocol all prio 10 p4 pname myprog \
        prog tc obj $PARSER.o section parser/tc-ingress \
        action bpf obj $PROGNAME.o section p4prog/tc"

     Example3: parser is at XDP, rest of dpath as an action:
       "tc filter add block 22 ingress protocol all prio 10 p4 pname myprog \
        prog type xdp obj $PARSER.o section parser/xdp-ingress \
	pinned_link /path/to/xdp-prog-link \
        action bpf obj $PROGNAME.o section p4prog/tc"

     Example4: parser+prog at XDP:
       "tc filter add block 22 ingress protocol all prio 10 p4 pname myprog \
        prog type xdp obj $PROGNAME.o section p4prog/xdp \
	pinned_link /path/to/xdp-prog-link"

    see individual patches for more examples tc vs xdp etc. Also see section on
    "challenges" (on this cover letter).

Once "myprog" P4 program is instantiated one can start updating table entries
that are associated with myprog's table named "mytable". Example:

  tc p4ctrl create myprog/table/mytable dstAddr 10.0.1.2/32 \
    action send_to_port param port eno1

A packet arriving on ingress of any of the ports on block 22 will first be
exercised via the (eBPF) parser to find the headers pointing to the ip
destination address.
The remainder eBPF datapath uses the result dstAddr as a key to do a lookup in
myprog's mytable which returns the action params which are then used to execute
the action in the eBPF datapath (eventually sending out packets to eno1).
On a table miss, mytable's default miss action is executed.

__Description of Patches__

P4TC is designed to have no impact on the core code for other users
of TC. IOW, you can compile it out but even if it compiled in and you dont use
it there should be no impact on your performance.

We do make core kernel changes. Patch #1 adds infrastructure for "dynamic"
actions that can be created on "the fly" based on the P4 program requirement.
This patch makes a small incision into act_api which shouldn't affect the
performance (or functionality) of the existing actions. Patches 2-4,6-7 are
minimalist enablers for P4TC and have no effect the classical tc action.
Patch 5 adds infrastructure support for preallocation of dynamic actions.

The core P4TC code implements several P4 objects.

1) Patch #8 introduces P4 data types which are consumed by the rest of the code
2) Patch #9 introduces the concept of templating Pipelines. i.e CRUD commands
   for P4 pipelines.
3) Patch #10 introduces the concept of action templates and associated
   CRUD commands.
4) Patch #11 introduces the concept of P4 table templates and associated
   CRUD commands for tables
5) Patch #12 introduces table entries and associated CRUD commands.
6) Patch #13 introduces interaction of eBPF to P4TC tables via kfunc.
7) Patch #14 introduces the TC classifier P4 used at runtime.
8) Patch #15 introduces extern interfacing (both template and runtime).

__Testing__

Speaking of testing - we have ~300 tdc test cases. This number is growing as
we are adjusting to accommodate for eBPF.
These tests are run on our CICD system on pull requests and after commits are
approved. The CICD does a lot of other tests (more since v2, thanks to Simon's
input)including:
checkpatch, sparse, smatch, coccinelle, 32 bit and 64 bit builds tested on both
X86, ARM 64 and emulated BE via qemu s390. We trigger performance testing in the
CICD to catch performance regressions (currently only on the control path, but
in the future for the datapath).
Syzkaller runs 24/7 on dedicated hardware, originally we focussed only on memory
sanitizer but recently added support for concurrency sanitizer.
Before main releases we ensure each patch will compile on its own to help in
git bisect and run the xmas tree tool. We eventually put the code via coverity.

In addition we are working on a tool that will take a P4 program, run it through
the compiler, and generate permutations of traffic patterns via symbolic
execution that will test both positive and negative datapath code paths. The
test generator tool is still work in progress and will be generated by the P4
compiler.
Note: We have other code that test parallelization etc which we are trying to
find a fit for in the kernel tree's testing infra.

__Restating Our Requirements__

The initial release made in January/2023 had a "scriptable" datapath (think u32
classifier and pedit action). In this section we review the scriptable version
against the current implementation we are pushing upstream which uses eBPF.

Our intention is to target the TC crowd.
Essentially developers and ops people deploying TC based infra.
More importantly the original intent for P4TC was to enable _ops folks_ more than
devs (given code is being generated and doesn't need humans to write it).

With TC, we get whole "familiar" package of match-action pipeline abstraction++,
meaning from the control plane all the way to the tooling infra, i.e
iproute2/tc cli, netlink infra(request/resp, event subscribe/multicast-publish,
congestion control etc), s/w and h/w symbiosis, the autonomous kernel control,
etc.
The main advantage is that we have a singular vendor-neutral interface via the
kernel using well understood mechanisms based on deployment experience (and
at least this part doesnt need retraining).

1) Supporting expressibility of the universe set of P4 progs

It is a must to support 100% of all possible P4 programs. In the past the eBPF
verifier had to be worked around and even then there are cases where we couldnt
avoid path explosion when branching is involved. Kfunc-ing solves these issues
for us. Note, there are still challenges running all potential P4 programs at
the XDP level - the solution to that is to have the compiler generate XDP based
code only if it possible to map it to that layer.

2) Support for P4 HW and SW equivalence.

This feature continues to work even in the presence of eBPF as the s/w
datapath. There are cases of square-hole-round-peg scenarios but
those are implementation issues we can live with.

3) Operational usability

By maintaining the TC control plane (even in presence of eBPF datapath)
runtime aspects remain unchanged. So for our target audience of folks
who have deployed tc including offloads - the comfort zone is unchanged.
There is also the comfort zone of continuing to use the true-and-tried netlink
interfacing.

There is some loss in operational usability because we now have more knobs:
the extra compilation, loading and syncing of ebpf binaries, etc.
IOW, I can no longer just ship someone a shell script in an email to
say go run this and "myprog" will just work.

4) Operational and development Debuggability

If something goes wrong, the tc craftsperson is now required to have additional
knowledge of eBPF code and process. This applies to both the operational person
as well as someone who wrote a driver. We dont believe this is solvable.

5) Opportunity for rapid prototyping of new ideas

During the P4TC development phase something that came naturally was to often
handcode the template scripts because the compiler backend (which is P4 arch
specific) wasnt ready to generate certain things. Then you would read back the
template and diff to ensure the kernel didn't get something wrong. So this
started as a debug feature. During development, we wrote scripts that
covered a range of P4 architectures(PSA, V1, etc) which required no kernel code
changes.

Over time the debug feature morphed into: a) start by handcoding scripts then
b) read it back and then c) generate the P4 code.
It means one could start with the template scripts outside of the constraints
of a P4 architecture spec(PNA/PSA) or even within a P4 architecture then test
some ideas and eventually feed back the concepts to the compiler authors or
modify or create a new P4 architecture and share with the P4 standards folks.

To summarize in presence of eBPF: The debugging idea is probably still alive.
One could dump, with proper tooling(bpftool for example), the loaded eBPF code
and be able to check for differences. But this is not the interesting part.
The concept of going back from whats in the kernel to P4 is a lot more difficult
to implement mostly due to scoping of DSL vs general purpose. It may be lost.
We have been thinking of ways to use BTF and embedding annotations in the eBPF
code and binary but more thought is required and we welcome suggestions.

6) Supporting per namespace program

This requirement is still met (by virtue of keeping P4 control objects within the
TC domain).

__Challenges__

1) Concept of tc block in XDP is _very tedious_ to implement. It would be nice
   if we can use concept there as well, since we expect P4 to work with many
   ports. It will likely require some core patches to fix this.

2) Right now we are using "packed" construct to enforce alignment in kfunc data
   exchange; but we're wondering if there is potential to use BTF to understand
   parameters and their offsets and encode this information at the compiler
   level.

3) At the moment we are creating a static buffer of 128B to retrieve the action
   parameters. If you have a lot of table entries and individual(non-shared)
   action instances with actions that require very little (or no) param space
   a lot of memory is wasted. There may also be cases where 128B may not be
   enough; (likely this is something we can teach the P4C compiler). If we can
   have dynamic pointers instead for kfunc fixed length parameterization then
   this issue is resolvable.

4) See "Restating Our Requirements" #5.
   We would really appreciate ideas/suggestions, etc.

__References__

[1]https://github.com/p4tc-dev/docs/blob/main/p4-conference-2023/2023P4WorkshopP4TC.pdf
[2]https://github.com/p4tc-dev/docs/blob/main/why-p4tc.md#historical-perspective-for-p4tc
[3]https://2023p4workshop.sched.com/event/1KsAe/p4tc-linux-kernel-p4-implementation-approaches-and-evaluation
[4]https://github.com/p4tc-dev/docs/blob/main/why-p4tc.md#so-why-p4-and-how-does-p4-help-here
[5]https://lore.kernel.org/netdev/20230517110232.29349-3-jhs@mojatatu.com/T/#mf59be7abc5df3473cff3879c8cc3e2369c0640a6
[6]https://lore.kernel.org/netdev/20230517110232.29349-3-jhs@mojatatu.com/T/#m783cfd79e9d755cf0e7afc1a7d5404635a5b1919
[7]https://lore.kernel.org/netdev/20230517110232.29349-3-jhs@mojatatu.com/T/#ma8c84df0f7043d17b98f3d67aab0f4904c600469
[8]https://github.com/p4lang/p4c/tree/main/backends/tc
[9]https://p4.org/
[10]https://www.intel.com/content/www/us/en/products/details/network-io/ipu/e2000-asic.html
[11]https://www.amd.com/en/accelerators/pensando
[12]https://github.com/sonic-net/DASH/tree/main

Jamal Hadi Salim (15):
  net: sched: act_api: Introduce dynamic actions list
  net/sched: act_api: increase action kind string length
  net/sched: act_api: Update tc_action_ops to account for dynamic
    actions
  net/sched: act_api: add struct p4tc_action_ops as a parameter to
    lookup callback
  net: sched: act_api: Add support for preallocated dynamic action
    instances
  net: introduce rcu_replace_pointer_rtnl
  rtnl: add helper to check if group has listeners
  p4tc: add P4 data types
  p4tc: add template pipeline create, get, update, delete
  p4tc: add action template create, update, delete, get, flush and dump
  p4tc: add template table create, update, delete, get, flush and dump
  p4tc: add runtime table entry create, update, get, delete, flush and
    dump
  p4tc: add set of P4TC table kfuncs
  p4tc: add P4 classifier
  p4tc: Add P4 extern interface

 include/linux/bitops.h            |    1 +
 include/linux/rtnetlink.h         |   19 +
 include/net/act_api.h             |   22 +-
 include/net/p4tc.h                |  744 ++++++++
 include/net/p4tc_ext_api.h        |  199 ++
 include/net/p4tc_types.h          |   88 +
 include/net/tc_act/p4tc.h         |   52 +
 include/uapi/linux/p4tc.h         |  406 ++++
 include/uapi/linux/p4tc_ext.h     |   36 +
 include/uapi/linux/pkt_cls.h      |   19 +
 include/uapi/linux/rtnetlink.h    |   18 +
 net/sched/Kconfig                 |   23 +
 net/sched/Makefile                |    3 +
 net/sched/act_api.c               |  195 +-
 net/sched/cls_api.c               |    2 +-
 net/sched/cls_p4.c                |  447 +++++
 net/sched/p4tc/Makefile           |    8 +
 net/sched/p4tc/p4tc_action.c      | 2308 +++++++++++++++++++++++
 net/sched/p4tc/p4tc_bpf.c         |  414 +++++
 net/sched/p4tc/p4tc_ext.c         | 2204 ++++++++++++++++++++++
 net/sched/p4tc/p4tc_pipeline.c    |  707 +++++++
 net/sched/p4tc/p4tc_runtime_api.c |  153 ++
 net/sched/p4tc/p4tc_table.c       | 1634 ++++++++++++++++
 net/sched/p4tc/p4tc_tbl_entry.c   | 2870 +++++++++++++++++++++++++++++
 net/sched/p4tc/p4tc_tmpl_api.c    |  611 ++++++
 net/sched/p4tc/p4tc_tmpl_ext.c    | 2221 ++++++++++++++++++++++
 net/sched/p4tc/p4tc_types.c       | 1247 +++++++++++++
 net/sched/p4tc/trace.c            |   10 +
 net/sched/p4tc/trace.h            |   44 +
 security/selinux/nlmsgtab.c       |   10 +-
 30 files changed, 16676 insertions(+), 39 deletions(-)
 create mode 100644 include/net/p4tc.h
 create mode 100644 include/net/p4tc_ext_api.h
 create mode 100644 include/net/p4tc_types.h
 create mode 100644 include/net/tc_act/p4tc.h
 create mode 100644 include/uapi/linux/p4tc.h
 create mode 100644 include/uapi/linux/p4tc_ext.h
 create mode 100644 net/sched/cls_p4.c
 create mode 100644 net/sched/p4tc/Makefile
 create mode 100644 net/sched/p4tc/p4tc_action.c
 create mode 100644 net/sched/p4tc/p4tc_bpf.c
 create mode 100644 net/sched/p4tc/p4tc_ext.c
 create mode 100644 net/sched/p4tc/p4tc_pipeline.c
 create mode 100644 net/sched/p4tc/p4tc_runtime_api.c
 create mode 100644 net/sched/p4tc/p4tc_table.c
 create mode 100644 net/sched/p4tc/p4tc_tbl_entry.c
 create mode 100644 net/sched/p4tc/p4tc_tmpl_api.c
 create mode 100644 net/sched/p4tc/p4tc_tmpl_ext.c
 create mode 100644 net/sched/p4tc/p4tc_types.c
 create mode 100644 net/sched/p4tc/trace.c
 create mode 100644 net/sched/p4tc/trace.h

Comments

John Fastabend Nov. 17, 2023, 6:27 a.m. UTC | #1
Jamal Hadi Salim wrote:
> We are seeking community feedback on P4TC patches.
> 

[...]

> 
> What is P4?
> -----------

I read the cover letter here is my high level takeaway.

P4c-bpf backend exists and I don't see why we wouldn't use that as a starting
point. At least the cover letter needs to explain why this path is not taken.
From the cover letter there appears to be bpf pieces and non-bpf pieces, but
I don't see any reason not to just land it all in BPF. Support exists and if
its missing some smaller things add them and everyone gets them vs niche P4
backend.

Without hardware support for any of this its impossible to understand how 'tc'
would work as a hardware offload interface for a p4 device so we need hardware
support to evaluate. For example I'm not even sure how you would take a BPF
parser into hardware on most network devices that aren't processor based.

P4 has a P4Runtime I think most folks would prefer a P4 UI vs typing in 'tc'
commands so arguing for 'tc' UI is nice is not going to be very compelling.
Best we can say is it works well enough and we use it. 

more commentary below.

> 
> The Programming Protocol-independent Packet Processors (P4) is an open source,
> domain-specific programming language for specifying data plane behavior.
> 
> The P4 ecosystem includes an extensive range of deployments, products, projects
> and services, etc[9][10][11][12].
> 
> __What is P4TC?__
> 
> P4TC is a net-namespace aware implementation, meaning multiple P4 programs can
> run independently in different namespaces alongside their appropriate state. The
> implementation builds on top of many years of Linux TC experiences.
> On why P4 - see small treatise here:[4].
> 
> There have been many discussions and meetings since about 2015 in regards to
> P4 over TC[2] and we are finally proving the naysayers that we do get stuff
> done!
> 
> A lot more of the P4TC motivation is captured at:
> https://github.com/p4tc-dev/docs/blob/main/why-p4tc.md
> 
> **In this patch series we focus on s/w datapath only**.

I don't see the value in adding 16676 lines of code for s/w only datapath
of something we already can do with p4c-ebpf backend. Or one of the other
backends already there. Namely take P4 programs and run them on CPUs in Linux.

Also I suspect a pipelined datapath is going to be slower than a O(1) lookup
datapath so I'm guessing its slower than most datapaths we have already.

What do we gain here over existing p4c-ebpf?

> 
> __P4TC Workflow__
> 
> These patches enable kernel and user space code change _independence_ for any
> new P4 program that describes a new datapath. The workflow is as follows:
> 
>   1) A developer writes a P4 program, "myprog"
> 
>   2) Compiles it using the P4C compiler[8]. The compiler generates 3 outputs:
>      a) shell script(s) which form template definitions for the different P4
>      objects "myprog" utilizes (tables, externs, actions etc).

This is odd to me. I think packing around shell scrips as a program is not
very usable. Why not just an object file.

>      b) the parser and the rest of the datapath are generated
>      in eBPF and need to be compiled into binaries.
>      c) A json introspection file used for the control plane (by iproute2/tc).

Why split up the eBPF and control plane like this? eBPF has a control plane
just use the existing one?

> 
>   3) The developer (or operator) executes the shell script(s) to manifest the
>      functional "myprog" into the kernel.
> 
>   4) The developer (or operator) instantiates "myprog" via the tc P4 filter
>      to ingress/egress (depending on P4 arch) of one or more netdevs/ports.
> 
>      Example1: parser is an action:
>        "tc filter add block 22 ingress protocol all prio 10 p4 pname myprog \
>         action bpf obj $PARSER.o section parser/tc-ingress \
>         action bpf obj $PROGNAME.o section p4prog/tc"
> 
>      Example2: parser explicitly bound and rest of dpath as an action:
>        "tc filter add block 22 ingress protocol all prio 10 p4 pname myprog \
>         prog tc obj $PARSER.o section parser/tc-ingress \
>         action bpf obj $PROGNAME.o section p4prog/tc"
> 
>      Example3: parser is at XDP, rest of dpath as an action:
>        "tc filter add block 22 ingress protocol all prio 10 p4 pname myprog \
>         prog type xdp obj $PARSER.o section parser/xdp-ingress \
> 	pinned_link /path/to/xdp-prog-link \
>         action bpf obj $PROGNAME.o section p4prog/tc"
> 
>      Example4: parser+prog at XDP:
>        "tc filter add block 22 ingress protocol all prio 10 p4 pname myprog \
>         prog type xdp obj $PROGNAME.o section p4prog/xdp \
> 	pinned_link /path/to/xdp-prog-link"
> 
>     see individual patches for more examples tc vs xdp etc. Also see section on
>     "challenges" (on this cover letter).
> 
> Once "myprog" P4 program is instantiated one can start updating table entries
> that are associated with myprog's table named "mytable". Example:
> 
>   tc p4ctrl create myprog/table/mytable dstAddr 10.0.1.2/32 \
>     action send_to_port param port eno1

As a UI above is entirely cryptic to most folks I bet.

myprog table is a BPF map? If so then I don't see any need for this just
interact with it like a BPF map. I suspect its some other object, but
I don't see any ratoinal for that.

> 
> A packet arriving on ingress of any of the ports on block 22 will first be
> exercised via the (eBPF) parser to find the headers pointing to the ip
> destination address.
> The remainder eBPF datapath uses the result dstAddr as a key to do a lookup in
> myprog's mytable which returns the action params which are then used to execute
> the action in the eBPF datapath (eventually sending out packets to eno1).
> On a table miss, mytable's default miss action is executed.

This chunk looks like standard BPF program. Parse pkt, lookup an action,
do the action.

> 
> __Description of Patches__
> 
> P4TC is designed to have no impact on the core code for other users
> of TC. IOW, you can compile it out but even if it compiled in and you dont use
> it there should be no impact on your performance.
> 
> We do make core kernel changes. Patch #1 adds infrastructure for "dynamic"
> actions that can be created on "the fly" based on the P4 program requirement.

the common pattern in bpf for this is to use a tail call map and populate
it at runtime and/or just compile your program with the actions. Here
the actions came from the p4 back up at step 1 so no reason we can't
just compile them with p4c.

> This patch makes a small incision into act_api which shouldn't affect the
> performance (or functionality) of the existing actions. Patches 2-4,6-7 are
> minimalist enablers for P4TC and have no effect the classical tc action.
> Patch 5 adds infrastructure support for preallocation of dynamic actions.
> 
> The core P4TC code implements several P4 objects.

[...]

> 
> __Restating Our Requirements__
> 
> The initial release made in January/2023 had a "scriptable" datapath (think u32
> classifier and pedit action). In this section we review the scriptable version
> against the current implementation we are pushing upstream which uses eBPF.
> 
> Our intention is to target the TC crowd.
> Essentially developers and ops people deploying TC based infra.
> More importantly the original intent for P4TC was to enable _ops folks_ more than
> devs (given code is being generated and doesn't need humans to write it).

I don't follow. humans wrote the p4.

I think the intent should be to enable P4 to run on Linux. Ideally efficiently.
If the _ops folks are writing P4 great as long as we give them an efficient
way to run their p4 I don't think they care about what executes it.

> 
> With TC, we get whole "familiar" package of match-action pipeline abstraction++,
> meaning from the control plane all the way to the tooling infra, i.e
> iproute2/tc cli, netlink infra(request/resp, event subscribe/multicast-publish,
> congestion control etc), s/w and h/w symbiosis, the autonomous kernel control,
> etc.
> The main advantage is that we have a singular vendor-neutral interface via the
> kernel using well understood mechanisms based on deployment experience (and
> at least this part doesnt need retraining).

A seemless p4 experience would be great. That looks like a tooling problem
at the p4c-backend and p4c-frontend problem. Rather than a bunch of 'tc' glue
I would aim for,

  $ p4c-* myprog.p4
  $ p4cRun ./myprog

And maybe some options like,

  $ p4cRun -i eth0 ./myprog

Then use the p4runtime to interface with the system. If you don't like the
runtime then it should be brought up in that working group.

> 
> 1) Supporting expressibility of the universe set of P4 progs
> 
> It is a must to support 100% of all possible P4 programs. In the past the eBPF
> verifier had to be worked around and even then there are cases where we couldnt
> avoid path explosion when branching is involved. Kfunc-ing solves these issues
> for us. Note, there are still challenges running all potential P4 programs at
> the XDP level - the solution to that is to have the compiler generate XDP based
> code only if it possible to map it to that layer.

Examples and we can fix it.

> 
> 2) Support for P4 HW and SW equivalence.
> 
> This feature continues to work even in the presence of eBPF as the s/w
> datapath. There are cases of square-hole-round-peg scenarios but
> those are implementation issues we can live with.

But no hw support.

> 
> 3) Operational usability
> 
> By maintaining the TC control plane (even in presence of eBPF datapath)
> runtime aspects remain unchanged. So for our target audience of folks
> who have deployed tc including offloads - the comfort zone is unchanged.
> There is also the comfort zone of continuing to use the true-and-tried netlink
> interfacing.

The P4 control plane should be P4Runtime.

> 
> There is some loss in operational usability because we now have more knobs:
> the extra compilation, loading and syncing of ebpf binaries, etc.
> IOW, I can no longer just ship someone a shell script in an email to
> say go run this and "myprog" will just work.
> 
> 4) Operational and development Debuggability
> 
> If something goes wrong, the tc craftsperson is now required to have additional
> knowledge of eBPF code and process. This applies to both the operational person
> as well as someone who wrote a driver. We dont believe this is solvable.
> 
> 5) Opportunity for rapid prototyping of new ideas

[...]

> 6) Supporting per namespace program
> 
> This requirement is still met (by virtue of keeping P4 control objects within the
> TC domain).

BPF can also be network namespaced I'm not sure I understand comment.

> 
> __Challenges__
> 
> 1) Concept of tc block in XDP is _very tedious_ to implement. It would be nice
>    if we can use concept there as well, since we expect P4 to work with many
>    ports. It will likely require some core patches to fix this.
> 
> 2) Right now we are using "packed" construct to enforce alignment in kfunc data
>    exchange; but we're wondering if there is potential to use BTF to understand
>    parameters and their offsets and encode this information at the compiler
>    level.
> 
> 3) At the moment we are creating a static buffer of 128B to retrieve the action
>    parameters. If you have a lot of table entries and individual(non-shared)
>    action instances with actions that require very little (or no) param space
>    a lot of memory is wasted. There may also be cases where 128B may not be
>    enough; (likely this is something we can teach the P4C compiler). If we can
>    have dynamic pointers instead for kfunc fixed length parameterization then
>    this issue is resolvable.
> 
> 4) See "Restating Our Requirements" #5.
>    We would really appreciate ideas/suggestions, etc.
> 
> __References__

Thanks,
John
Jamal Hadi Salim Nov. 17, 2023, 12:49 p.m. UTC | #2
On Fri, Nov 17, 2023 at 1:27 AM John Fastabend <john.fastabend@gmail.com> wrote:
>
> Jamal Hadi Salim wrote:
> > We are seeking community feedback on P4TC patches.
> >
>
> [...]
>
> >
> > What is P4?
> > -----------
>
> I read the cover letter here is my high level takeaway.
>

At least you read the cover letter this time ;->

> P4c-bpf backend exists and I don't see why we wouldn't use that as a starting
> point.

Are you familiar with P4 architectures? That code was for PSA (which
is essentially for switches) we are doing PNA (which is more nic
oriented).
And yes, we used that code as a starting point and made the necessary
changes needed to conform to PNA. We made it actually work better by
using kfuncs.

> At least the cover letter needs to explain why this path is not taken.

I thought we had a reference to that backend - but will add it for the
next update.

> From the cover letter there appears to be bpf pieces and non-bpf pieces, but
> I don't see any reason not to just land it all in BPF. Support exists and if
> its missing some smaller things add them and everyone gets them vs niche P4
> backend.

Ok, i thought you said you read the cover letter. Reasons are well
stated, primarily that we need to make sure all P4 programs work.

>
> Without hardware support for any of this its impossible to understand how 'tc'
> would work as a hardware offload interface for a p4 device so we need hardware
> support to evaluate. For example I'm not even sure how you would take a BPF
> parser into hardware on most network devices that aren't processor based.
>

P4 has nothing to do with parsers in hardware. Where did you get this
requirement from?

> P4 has a P4Runtime I think most folks would prefer a P4 UI vs typing in 'tc'
> commands so arguing for 'tc' UI is nice is not going to be very compelling.
> Best we can say is it works well enough and we use it.


The control plane interface is netlink. This part is not negotiable.
You can write whatever you want on top of it(for example P4runtime
using netlink as its southbound interface). We feel that tc - a well
understood utility - is one we should make publicly available for the
rest of the world to use. For example we have rust code that runs on
top of netlink to do performance testing.

> more commentary below.
>
> >
> > The Programming Protocol-independent Packet Processors (P4) is an open source,
> > domain-specific programming language for specifying data plane behavior.
> >
> > The P4 ecosystem includes an extensive range of deployments, products, projects
> > and services, etc[9][10][11][12].
> >
> > __What is P4TC?__
> >
> > P4TC is a net-namespace aware implementation, meaning multiple P4 programs can
> > run independently in different namespaces alongside their appropriate state. The
> > implementation builds on top of many years of Linux TC experiences.
> > On why P4 - see small treatise here:[4].
> >
> > There have been many discussions and meetings since about 2015 in regards to
> > P4 over TC[2] and we are finally proving the naysayers that we do get stuff
> > done!
> >
> > A lot more of the P4TC motivation is captured at:
> > https://github.com/p4tc-dev/docs/blob/main/why-p4tc.md
> >
> > **In this patch series we focus on s/w datapath only**.
>
> I don't see the value in adding 16676 lines of code for s/w only datapath
> of something we already can do with p4c-ebpf backend.

Please please stop this entitlement politics (which i frankly think
you guys have been getting away with for a few years now).
This code does not touch any core code - you guys constantly push code
that touches core code and it is not unusual we have to pick up the
pieces after but now you are going to call me out for the number of
lines of code? Is it ok for you to write lines of code in the kernel
but not me? Judge the technical work then we can have a meaningful
discussion.

TBH, I am trying very hard to see if i should respond to any more
comments from you. I was very happy with our original scriptable
approach and you came out and banged on the table that you want ebpf.
We spent 10 months of multiple people working on this code to make it
ebpf friendly and now you want more (actually i am not sure what the
hell you want).

> Or one of the other
> backends already there. Namely take P4 programs and run them on CPUs in Linux.
>
> Also I suspect a pipelined datapath is going to be slower than a O(1) lookup
> datapath so I'm guessing its slower than most datapaths we have already.
>
> What do we gain here over existing p4c-ebpf?
>

see above.

> >
> > __P4TC Workflow__
> >
> > These patches enable kernel and user space code change _independence_ for any
> > new P4 program that describes a new datapath. The workflow is as follows:
> >
> >   1) A developer writes a P4 program, "myprog"
> >
> >   2) Compiles it using the P4C compiler[8]. The compiler generates 3 outputs:
> >      a) shell script(s) which form template definitions for the different P4
> >      objects "myprog" utilizes (tables, externs, actions etc).
>
> This is odd to me. I think packing around shell scrips as a program is not
> very usable. Why not just an object file.
>
> >      b) the parser and the rest of the datapath are generated
> >      in eBPF and need to be compiled into binaries.
> >      c) A json introspection file used for the control plane (by iproute2/tc).
>
> Why split up the eBPF and control plane like this? eBPF has a control plane
> just use the existing one?
>

The cover letter clearly states that we are using netlink as the
control api. Does eBPF support netlink?

> >
> >   3) The developer (or operator) executes the shell script(s) to manifest the
> >      functional "myprog" into the kernel.
> >
> >   4) The developer (or operator) instantiates "myprog" via the tc P4 filter
> >      to ingress/egress (depending on P4 arch) of one or more netdevs/ports.
> >
> >      Example1: parser is an action:
> >        "tc filter add block 22 ingress protocol all prio 10 p4 pname myprog \
> >         action bpf obj $PARSER.o section parser/tc-ingress \
> >         action bpf obj $PROGNAME.o section p4prog/tc"
> >
> >      Example2: parser explicitly bound and rest of dpath as an action:
> >        "tc filter add block 22 ingress protocol all prio 10 p4 pname myprog \
> >         prog tc obj $PARSER.o section parser/tc-ingress \
> >         action bpf obj $PROGNAME.o section p4prog/tc"
> >
> >      Example3: parser is at XDP, rest of dpath as an action:
> >        "tc filter add block 22 ingress protocol all prio 10 p4 pname myprog \
> >         prog type xdp obj $PARSER.o section parser/xdp-ingress \
> >       pinned_link /path/to/xdp-prog-link \
> >         action bpf obj $PROGNAME.o section p4prog/tc"
> >
> >      Example4: parser+prog at XDP:
> >        "tc filter add block 22 ingress protocol all prio 10 p4 pname myprog \
> >         prog type xdp obj $PROGNAME.o section p4prog/xdp \
> >       pinned_link /path/to/xdp-prog-link"
> >
> >     see individual patches for more examples tc vs xdp etc. Also see section on
> >     "challenges" (on this cover letter).
> >
> > Once "myprog" P4 program is instantiated one can start updating table entries
> > that are associated with myprog's table named "mytable". Example:
> >
> >   tc p4ctrl create myprog/table/mytable dstAddr 10.0.1.2/32 \
> >     action send_to_port param port eno1
>
> As a UI above is entirely cryptic to most folks I bet.
>

But ebpf is not?

> myprog table is a BPF map? If so then I don't see any need for this just
> interact with it like a BPF map. I suspect its some other object, but
> I don't see any ratoinal for that.

All the P4 objects sit in the TC domain. The datapath program is ebpf.
Control is via netlink.


> >
> > A packet arriving on ingress of any of the ports on block 22 will first be
> > exercised via the (eBPF) parser to find the headers pointing to the ip
> > destination address.
> > The remainder eBPF datapath uses the result dstAddr as a key to do a lookup in
> > myprog's mytable which returns the action params which are then used to execute
> > the action in the eBPF datapath (eventually sending out packets to eno1).
> > On a table miss, mytable's default miss action is executed.
>
> This chunk looks like standard BPF program. Parse pkt, lookup an action,
> do the action.
>

Yes, the ebpf datapath does the parsing, and then interacts with
kfuncs to the tc world before it (the ebpf datapath) executes the
action.
Note: ebpf did not invent any of that (parse, lookup, action). It has
existed in tc for 20 years before ebpf existed.

> > __Description of Patches__
> >
> > P4TC is designed to have no impact on the core code for other users
> > of TC. IOW, you can compile it out but even if it compiled in and you dont use
> > it there should be no impact on your performance.
> >
> > We do make core kernel changes. Patch #1 adds infrastructure for "dynamic"
> > actions that can be created on "the fly" based on the P4 program requirement.
>
> the common pattern in bpf for this is to use a tail call map and populate
> it at runtime and/or just compile your program with the actions. Here
> the actions came from the p4 back up at step 1 so no reason we can't
> just compile them with p4c.
>
> > This patch makes a small incision into act_api which shouldn't affect the
> > performance (or functionality) of the existing actions. Patches 2-4,6-7 are
> > minimalist enablers for P4TC and have no effect the classical tc action.
> > Patch 5 adds infrastructure support for preallocation of dynamic actions.
> >
> > The core P4TC code implements several P4 objects.
>
> [...]
>
> >
> > __Restating Our Requirements__
> >
> > The initial release made in January/2023 had a "scriptable" datapath (think u32
> > classifier and pedit action). In this section we review the scriptable version
> > against the current implementation we are pushing upstream which uses eBPF.
> >
> > Our intention is to target the TC crowd.
> > Essentially developers and ops people deploying TC based infra.
> > More importantly the original intent for P4TC was to enable _ops folks_ more than
> > devs (given code is being generated and doesn't need humans to write it).
>
> I don't follow. humans wrote the p4.
>

But not the ebpf code, that is compiler generated. P4 is a higher
level Domain specific language and ebpf is just one backend (others
s/w variants include DPDK, Rust, C, etc)

> I think the intent should be to enable P4 to run on Linux. Ideally efficiently.
> If the _ops folks are writing P4 great as long as we give them an efficient
> way to run their p4 I don't think they care about what executes it.
>
> >
> > With TC, we get whole "familiar" package of match-action pipeline abstraction++,
> > meaning from the control plane all the way to the tooling infra, i.e
> > iproute2/tc cli, netlink infra(request/resp, event subscribe/multicast-publish,
> > congestion control etc), s/w and h/w symbiosis, the autonomous kernel control,
> > etc.
> > The main advantage is that we have a singular vendor-neutral interface via the
> > kernel using well understood mechanisms based on deployment experience (and
> > at least this part doesnt need retraining).
>
> A seemless p4 experience would be great. That looks like a tooling problem
> at the p4c-backend and p4c-frontend problem. Rather than a bunch of 'tc' glue
> I would aim for,
>
>   $ p4c-* myprog.p4
>   $ p4cRun ./myprog
>
> And maybe some options like,
>
>   $ p4cRun -i eth0 ./myprog

Armchair lawyering and classical ML bikesheding

> Then use the p4runtime to interface with the system. If you don't like the
> runtime then it should be brought up in that working group.
>
> >
> > 1) Supporting expressibility of the universe set of P4 progs
> >
> > It is a must to support 100% of all possible P4 programs. In the past the eBPF
> > verifier had to be worked around and even then there are cases where we couldnt
> > avoid path explosion when branching is involved. Kfunc-ing solves these issues
> > for us. Note, there are still challenges running all potential P4 programs at
> > the XDP level - the solution to that is to have the compiler generate XDP based
> > code only if it possible to map it to that layer.
>
> Examples and we can fix it.

Right. Let me wait for you to fix something 5 years from now. I would
never have used eBPF at all but the kfunc is what changed my mind.

> >
> > 2) Support for P4 HW and SW equivalence.
> >
> > This feature continues to work even in the presence of eBPF as the s/w
> > datapath. There are cases of square-hole-round-peg scenarios but
> > those are implementation issues we can live with.
>
> But no hw support.
>

This patcheset has nothing to do with offload (you read the cover
letter?). All above is saying is that by virtue of using TC we have a
path to a proven offload approach.


> >
> > 3) Operational usability
> >
> > By maintaining the TC control plane (even in presence of eBPF datapath)
> > runtime aspects remain unchanged. So for our target audience of folks
> > who have deployed tc including offloads - the comfort zone is unchanged.
> > There is also the comfort zone of continuing to use the true-and-tried netlink
> > interfacing.
>
> The P4 control plane should be P4Runtime.
>

And be my guest and write it on top of netlink.

cheers,
jamal

> >
> > There is some loss in operational usability because we now have more knobs:
> > the extra compilation, loading and syncing of ebpf binaries, etc.
> > IOW, I can no longer just ship someone a shell script in an email to
> > say go run this and "myprog" will just work.
> >
> > 4) Operational and development Debuggability
> >
> > If something goes wrong, the tc craftsperson is now required to have additional
> > knowledge of eBPF code and process. This applies to both the operational person
> > as well as someone who wrote a driver. We dont believe this is solvable.
> >
> > 5) Opportunity for rapid prototyping of new ideas
>
> [...]
>
> > 6) Supporting per namespace program
> >
> > This requirement is still met (by virtue of keeping P4 control objects within the
> > TC domain).
>
> BPF can also be network namespaced I'm not sure I understand comment.
>
> >
> > __Challenges__
> >
> > 1) Concept of tc block in XDP is _very tedious_ to implement. It would be nice
> >    if we can use concept there as well, since we expect P4 to work with many
> >    ports. It will likely require some core patches to fix this.
> >
> > 2) Right now we are using "packed" construct to enforce alignment in kfunc data
> >    exchange; but we're wondering if there is potential to use BTF to understand
> >    parameters and their offsets and encode this information at the compiler
> >    level.
> >
> > 3) At the moment we are creating a static buffer of 128B to retrieve the action
> >    parameters. If you have a lot of table entries and individual(non-shared)
> >    action instances with actions that require very little (or no) param space
> >    a lot of memory is wasted. There may also be cases where 128B may not be
> >    enough; (likely this is something we can teach the P4C compiler). If we can
> >    have dynamic pointers instead for kfunc fixed length parameterization then
> >    this issue is resolvable.
> >
> > 4) See "Restating Our Requirements" #5.
> >    We would really appreciate ideas/suggestions, etc.
> >
> > __References__
>
> Thanks,
> John
John Fastabend Nov. 17, 2023, 6:37 p.m. UTC | #3
Jamal Hadi Salim wrote:
> On Fri, Nov 17, 2023 at 1:27 AM John Fastabend <john.fastabend@gmail.com> wrote:
> >
> > Jamal Hadi Salim wrote:
> > > We are seeking community feedback on P4TC patches.
> > >
> >
> > [...]
> >
> > >
> > > What is P4?
> > > -----------
> >
> > I read the cover letter here is my high level takeaway.
> >
> 
> At least you read the cover letter this time ;->

I read it last time as well. About mid way down I tried to
list the points (1-5) more concisely if folks want to get to the
meat of my argument quickly.

> > P4c-bpf backend exists and I don't see why we wouldn't use that as a starting
> > point.
> 
> Are you familiar with P4 architectures? That code was for PSA (which
> is essentially for switches) we are doing PNA (which is more nic
> oriented).

Yes. But for folks that are not PSA is a switch architecture it
looks roughly like this,

   parser -> ingress -> deparser -> pkt replication -> parser
                                                        -> egress
                                                           -> deparser
                                                             -> queueing

The gist is ingress/egress blocks hold your p4 logic (match action
tables usually) to xfrm headers, counters, registers, and so on. You
get one on ingress and one on egress to build your logic up.

And PNA is a roughly like this,

   ingress -> parser -> control -> deparser -> accelerators -> host | network 

An accelerators are externs more or less defined outside P4. Control has
all your metrics, header transforms, registers, and so on. And parser
well it parsers headers. Deparser is something we don't typically think
about much on sw side but it serializes the object back into a packet.
That is a rough couple line explanation.

You can also define whatever architecture like and there are some
ways to do that. But if you want to be a PSA or PNA you define those
blocks in your P4. The key idea is to have architectures that map
to a large set of different vendor hardware. Clearly sw and FPGAs
can build mostly any architecture needed.

As an editorial comment P4 is very much a hardware centric view of
the world when looking at P4 architectures. SW never needed these
because we mostly have general purpose CPUs.

> And yes, we used that code as a starting point and made the necessary
> changes needed to conform to PNA. We made it actually work better by
> using kfuncs.

Better performance? More P4 DSL program space implemented? The kfuncs
added are equivelant to map ops already in BPF but over 'tc' map types.
Or did I miss some kfuncs.

The p4c-ebpf backend already supports two models we could have added
the PNA model to it as well. Its actually simpler than PSA model
in many ways at least its fewer blocks. I think all this infrastructure
here could be unesseary with updates to p4c-ebpf.

> 
> > At least the cover letter needs to explain why this path is not taken.
> 
> I thought we had a reference to that backend - but will add it for the
> next update.
> 
> > From the cover letter there appears to be bpf pieces and non-bpf pieces, but
> > I don't see any reason not to just land it all in BPF. Support exists and if
> > its missing some smaller things add them and everyone gets them vs niche P4
> > backend.
> 
> Ok, i thought you said you read the cover letter. Reasons are well
> stated, primarily that we need to make sure all P4 programs work.

I don't think that is a very strong argument to use/build a p4c-tc
architecture and implementation instead of p4c-ebpf. I can't think
of any reason p4c-ebpf can't support all programs other than perhaps
its missing a few items. From a design side though it should be
capabable of any PSA, PNA, and many more architectures you come up
with.

And I'm genuinely curious what is missing so a list would be nice.
The missing block perhaps is a perfomant software TCAM, but I'm
not fully convinced that software should even bother to try and
duplicate a TCAM. If you need a performant TCAM buy hw with a TCAM
emulating one is always going to be slower. Have Intel/AMD/ARM
glue a TCAM to the core if its so useful.

To be clear p4c-tc is only targeting PNA programs not all P4 space.

> 
> >
> > Without hardware support for any of this its impossible to understand how 'tc'
> > would work as a hardware offload interface for a p4 device so we need hardware
> > support to evaluate. For example I'm not even sure how you would take a BPF
> > parser into hardware on most network devices that aren't processor based.
> >
> 
> P4 has nothing to do with parsers in hardware. Where did you get this
> requirement from?

P4 is/was primarily developed as a DSL to program hardware. We've
never figured out how to do a native Linux P4 controller for hardware.
There are a couple blockers for that in my opinion. First no one
has ever opened up the hardware to an OSS solution. Two its
never been entirely clear what the big win for enough people would be.
So we get targetted offloads, timestamp, vxlan, tso, ktls, even
heard quic offload yesterday. And its easy enough to just program
the hardware directly from user space.

So yes I think P4 has a lot to do with hardware, its probably
fair to say this p4c-tc thing isn't hardware. But, I think its
very limiting and the value of any p4 implementation in kernel
would be its ability to use hardware.

I'm not even convinced P4 is a good DSL for SW implementations.
I don't think its obvious how hw P4 and sw datapaths integrate
effectively. My opinion is p4c-tc is not moving us forward
here.

> 
> > P4 has a P4Runtime I think most folks would prefer a P4 UI vs typing in 'tc'
> > commands so arguing for 'tc' UI is nice is not going to be very compelling.
> > Best we can say is it works well enough and we use it.
> 
> 
> The control plane interface is netlink. This part is not negotiable.
> You can write whatever you want on top of it(for example P4runtime
> using netlink as its southbound interface). We feel that tc - a well

Sure we need a low level interface for p4runtime to use and I
agree we don't need all blocks done at once.

> understood utility - is one we should make publicly available for the
> rest of the world to use. For example we have rust code that runs on
> top of netlink to do performance testing.

If updates/lookups from userspace is a performance vector you
care about I can't see how netlink is more efficient than a
mmapped bpf map. If you have data share it, but it seems
highly unlikely.

The argument I'm trying to make is netlink vs bpf maps vs
some other goo shouldn't matter to users because we should
build them higher level tooling to interact with the p4
objects. Then it comes down to performance in my opinion.
And if map updates matter I suspect netlink is relatively
slow.

> 
> > more commentary below.
> >
> > >
> > > The Programming Protocol-independent Packet Processors (P4) is an open source,
> > > domain-specific programming language for specifying data plane behavior.
> > >
> > > The P4 ecosystem includes an extensive range of deployments, products, projects
> > > and services, etc[9][10][11][12].
> > >
> > > __What is P4TC?__
> > >
> > > P4TC is a net-namespace aware implementation, meaning multiple P4 programs can
> > > run independently in different namespaces alongside their appropriate state. The
> > > implementation builds on top of many years of Linux TC experiences.
> > > On why P4 - see small treatise here:[4].
> > >
> > > There have been many discussions and meetings since about 2015 in regards to
> > > P4 over TC[2] and we are finally proving the naysayers that we do get stuff
> > > done!
> > >
> > > A lot more of the P4TC motivation is captured at:
> > > https://github.com/p4tc-dev/docs/blob/main/why-p4tc.md
> > >
> > > **In this patch series we focus on s/w datapath only**.
> >
> > I don't see the value in adding 16676 lines of code for s/w only datapath
> > of something we already can do with p4c-ebpf backend.
> 
> Please please stop this entitlement politics (which i frankly think
> you guys have been getting away with for a few years now).

I'm allowed to disagree with your architecture and propose what I think
is a betteer way to translate P4 into software.

Its common to argue against adding new code if it duplicates functionality
we already support.

> This code does not touch any core code - you guys constantly push code
> that touches core code and it is not unusual we have to pick up the
> pieces after but now you are going to call me out for the number of
> lines of code? Is it ok for you to write lines of code in the kernel
> but not me? Judge the technical work then we can have a meaningful
> discussion.

I think I'm judging the technical work here. Bullet points.

1. p4c-tc implementation looks like it should be slower than a
   in terms of pkts/sec than a bpf implementation. Meaning
   I suspect pipeline and objects laid out like this will lose
   to a BPF program with an parser and single lookup. The p4c-ebpf
   compiler should look to create optimized EBPF code not some
   emulated switch topology.

2. p4c-tc control plan looks slower than a directly mmaped bpf
   map. Doing a simple update vs a netlink msg. The argument
   that BPF can't do CRUD (which we had offlist) seems incorrect
   to me. Correct me if I'm wrong with details about why.

2. I don't see why ebpf can not support all P4 programs. Because
   the DSL compiler side doesn't support the nic architecture
   side to me indicates fixing the compiler is the direction
   not pushing on the kernel.

3. Working in BPF framework will benefit more folks than a tc
   framework. I just don't see a large user base of P4 software
   running on Linux. It doesn't mean we can't have it in linux,
   but worth considering. We have lots of niche stuff in the
   kernel, but usually the niche thing doesn't have another
   more common way to run it.

4. The win for P4 is not sw implementation. Its about getting
   programmable hardware and this doesn't advance that goal
   in any meaningful way as far as I can see.

5. By pushing the P4 model so low in the stack of tooling
   you lose ability for compiler to do interesting things.
   Combining match action tables, converting them to
   switch statements or jumps, finding inverse operations
   and removing them. I still think there is lots of unexplored
   work on compiling P4 that has not been done.

> 
> TBH, I am trying very hard to see if i should respond to any more
> comments from you. I was very happy with our original scriptable
> approach and you came out and banged on the table that you want ebpf.
> We spent 10 months of multiple people working on this code to make it
> ebpf friendly and now you want more (actually i am not sure what the
> hell you want).

I've made the above arguments on early versions of the code,
and when we talked, and even offered it in p4 working group.
It shouldn't be surprising I've not changed my opinion.

Its a argument against duplicating existing functionality with
something that is slower and doesn't give us HW P4 support. The
bullets above.


> 
> > Or one of the other
> > backends already there. Namely take P4 programs and run them on CPUs in Linux.
> >
> > Also I suspect a pipelined datapath is going to be slower than a O(1) lookup
> > datapath so I'm guessing its slower than most datapaths we have already.
> >
> > What do we gain here over existing p4c-ebpf?
> >
> 
> see above.

We are talking past eachother becaus here I argue it looks like a slow
datapath and you say 'see above' but what above was I meant to see?
That it doesn't have PNA support? Compared to PSA doing a PNA support
should be straightforward.

I disagree that software should try to emulate hardware to closely.
They are fundamentally different platforms. One has CAMs, TCAMs,
and LPMs and obscure instruction sets to make all this work. The other
is working on a general purpose CPU. I think slamming a hardware
architecture into software with emulated TCAMs and what not,
will be a losing performance proposition. Experience shows you can
either go SIMD direction and parrallize everything with these instructions
or you reduce the datapath to a single (or minimal set) of lookups.
Find a counter-example.

> 
> > >
> > > __P4TC Workflow__
> > >
> > > These patches enable kernel and user space code change _independence_ for any
> > > new P4 program that describes a new datapath. The workflow is as follows:
> > >
> > >   1) A developer writes a P4 program, "myprog"
> > >
> > >   2) Compiles it using the P4C compiler[8]. The compiler generates 3 outputs:
> > >      a) shell script(s) which form template definitions for the different P4
> > >      objects "myprog" utilizes (tables, externs, actions etc).
> >
> > This is odd to me. I think packing around shell scrips as a program is not
> > very usable. Why not just an object file.
> >
> > >      b) the parser and the rest of the datapath are generated
> > >      in eBPF and need to be compiled into binaries.
> > >      c) A json introspection file used for the control plane (by iproute2/tc).
> >
> > Why split up the eBPF and control plane like this? eBPF has a control plane
> > just use the existing one?
> >
> 
> The cover letter clearly states that we are using netlink as the
> control api. Does eBPF support netlink?

But why? The statement is there but no rational is given. People are
used to it was maybe stated, but my argument is users of P4 shouldn't
be crafting netlink messages they need tooling if its netlink or BPF
or some new thing. So pick the most efficient tool for the job. Why
is netlink the most efficient option here.

> 
> > >
> > >   3) The developer (or operator) executes the shell script(s) to manifest the
> > >      functional "myprog" into the kernel.
> > >
> > >   4) The developer (or operator) instantiates "myprog" via the tc P4 filter
> > >      to ingress/egress (depending on P4 arch) of one or more netdevs/ports.
> > >
> > >      Example1: parser is an action:
> > >        "tc filter add block 22 ingress protocol all prio 10 p4 pname myprog \
> > >         action bpf obj $PARSER.o section parser/tc-ingress \
> > >         action bpf obj $PROGNAME.o section p4prog/tc"
> > >
> > >      Example2: parser explicitly bound and rest of dpath as an action:
> > >        "tc filter add block 22 ingress protocol all prio 10 p4 pname myprog \
> > >         prog tc obj $PARSER.o section parser/tc-ingress \
> > >         action bpf obj $PROGNAME.o section p4prog/tc"
> > >
> > >      Example3: parser is at XDP, rest of dpath as an action:
> > >        "tc filter add block 22 ingress protocol all prio 10 p4 pname myprog \
> > >         prog type xdp obj $PARSER.o section parser/xdp-ingress \
> > >       pinned_link /path/to/xdp-prog-link \
> > >         action bpf obj $PROGNAME.o section p4prog/tc"
> > >
> > >      Example4: parser+prog at XDP:
> > >        "tc filter add block 22 ingress protocol all prio 10 p4 pname myprog \
> > >         prog type xdp obj $PROGNAME.o section p4prog/xdp \
> > >       pinned_link /path/to/xdp-prog-link"
> > >
> > >     see individual patches for more examples tc vs xdp etc. Also see section on
> > >     "challenges" (on this cover letter).
> > >
> > > Once "myprog" P4 program is instantiated one can start updating table entries
> > > that are associated with myprog's table named "mytable". Example:
> > >
> > >   tc p4ctrl create myprog/table/mytable dstAddr 10.0.1.2/32 \
> > >     action send_to_port param port eno1
> >
> > As a UI above is entirely cryptic to most folks I bet.
> >
> 
> But ebpf is not?

We don't need everything out the gate but my point is that the UI
should be abstracted away from the P4 programmer and operator at
this level. My observation that 'tc' is cryptic was just an off-hand
comment I don't think its relevant to the overall argument for or against,
what we should understand is how to map p4runtime or at least a
operator friendly UI onto the semantics.

> 
> > myprog table is a BPF map? If so then I don't see any need for this just
> > interact with it like a BPF map. I suspect its some other object, but
> > I don't see any ratoinal for that.
> 
> All the P4 objects sit in the TC domain. The datapath program is ebpf.
> Control is via netlink.

I'm missing something fundamental. What do we gain from this TC domain.
There are some TC maps for LPM and TCAMs we have LPM already in BPF
and TCAM you have could easily be added if you want to. Then entire
program runs to completion. Surely this is more performant. Throw in
XDP and the redirect never leaves the NIC, no skb, etc.

From the architecture side I don't think we need kernel objects
for pipelines and some P4 notion of match action tables those
can all be mapped into the BPF program. The packet never leaves
XDP. Performance is good on datapath and performance is good
on map update side. It looks like noise to me teaching the kernel
about P4 objects and types. More importantly you are constraining
the optimizations the compiler can make. Perhaps the compiler
wants no map at all and implements it as a switch stmt for
example. Maybe the compiler can find inverse operations and
fastpaths to short circuit. By forcing the model so low in
the stack you remove this ability.

> 
> 
> > >
> > > A packet arriving on ingress of any of the ports on block 22 will first be
> > > exercised via the (eBPF) parser to find the headers pointing to the ip
> > > destination address.
> > > The remainder eBPF datapath uses the result dstAddr as a key to do a lookup in
> > > myprog's mytable which returns the action params which are then used to execute
> > > the action in the eBPF datapath (eventually sending out packets to eno1).
> > > On a table miss, mytable's default miss action is executed.
> >
> > This chunk looks like standard BPF program. Parse pkt, lookup an action,
> > do the action.
> >
> 
> Yes, the ebpf datapath does the parsing, and then interacts with
> kfuncs to the tc world before it (the ebpf datapath) executes the
> action.
> Note: ebpf did not invent any of that (parse, lookup, action). It has
> existed in tc for 20 years before ebpf existed.

Its not about who invented what. All this goes way back.

My point is the 'tc' world here looks unnecessary. It can be managed
from outside the kernel entirely.

> 
> > > __Description of Patches__
> > >
> > > P4TC is designed to have no impact on the core code for other users
> > > of TC. IOW, you can compile it out but even if it compiled in and you dont use
> > > it there should be no impact on your performance.
> > >
> > > We do make core kernel changes. Patch #1 adds infrastructure for "dynamic"
> > > actions that can be created on "the fly" based on the P4 program requirement.
> >
> > the common pattern in bpf for this is to use a tail call map and populate
> > it at runtime and/or just compile your program with the actions. Here
> > the actions came from the p4 back up at step 1 so no reason we can't
> > just compile them with p4c.
> >
> > > This patch makes a small incision into act_api which shouldn't affect the
> > > performance (or functionality) of the existing actions. Patches 2-4,6-7 are
> > > minimalist enablers for P4TC and have no effect the classical tc action.
> > > Patch 5 adds infrastructure support for preallocation of dynamic actions.
> > >
> > > The core P4TC code implements several P4 objects.
> >
> > [...]
> >
> > >
> > > __Restating Our Requirements__
> > >
> > > The initial release made in January/2023 had a "scriptable" datapath (think u32
> > > classifier and pedit action). In this section we review the scriptable version
> > > against the current implementation we are pushing upstream which uses eBPF.
> > >
> > > Our intention is to target the TC crowd.
> > > Essentially developers and ops people deploying TC based infra.
> > > More importantly the original intent for P4TC was to enable _ops folks_ more than
> > > devs (given code is being generated and doesn't need humans to write it).
> >
> > I don't follow. humans wrote the p4.
> >
> 
> But not the ebpf code, that is compiler generated. P4 is a higher
> level Domain specific language and ebpf is just one backend (others
> s/w variants include DPDK, Rust, C, etc)

Yes. I still don't follow. Of course ebpf is just one backend.

> 
> > I think the intent should be to enable P4 to run on Linux. Ideally efficiently.
> > If the _ops folks are writing P4 great as long as we give them an efficient
> > way to run their p4 I don't think they care about what executes it.
> >
> > >
> > > With TC, we get whole "familiar" package of match-action pipeline abstraction++,
> > > meaning from the control plane all the way to the tooling infra, i.e
> > > iproute2/tc cli, netlink infra(request/resp, event subscribe/multicast-publish,
> > > congestion control etc), s/w and h/w symbiosis, the autonomous kernel control,
> > > etc.
> > > The main advantage is that we have a singular vendor-neutral interface via the
> > > kernel using well understood mechanisms based on deployment experience (and
> > > at least this part doesnt need retraining).
> >
> > A seemless p4 experience would be great. That looks like a tooling problem
> > at the p4c-backend and p4c-frontend problem. Rather than a bunch of 'tc' glue
> > I would aim for,
> >
> >   $ p4c-* myprog.p4
> >   $ p4cRun ./myprog
> >
> > And maybe some options like,
> >
> >   $ p4cRun -i eth0 ./myprog
> 
> Armchair lawyering and classical ML bikesheding

It was just an example of what I think the end goal should be.

> 
> > Then use the p4runtime to interface with the system. If you don't like the
> > runtime then it should be brought up in that working group.
> >
> > >
> > > 1) Supporting expressibility of the universe set of P4 progs
> > >
> > > It is a must to support 100% of all possible P4 programs. In the past the eBPF
> > > verifier had to be worked around and even then there are cases where we couldnt
> > > avoid path explosion when branching is involved. Kfunc-ing solves these issues
> > > for us. Note, there are still challenges running all potential P4 programs at
> > > the XDP level - the solution to that is to have the compiler generate XDP based
> > > code only if it possible to map it to that layer.
> >
> > Examples and we can fix it.
> 
> Right. Let me wait for you to fix something 5 years from now. I would
> never have used eBPF at all but the kfunc is what changed my mind.
> 
> > >
> > > 2) Support for P4 HW and SW equivalence.
> > >
> > > This feature continues to work even in the presence of eBPF as the s/w
> > > datapath. There are cases of square-hole-round-peg scenarios but
> > > those are implementation issues we can live with.
> >
> > But no hw support.
> >
> 
> This patcheset has nothing to do with offload (you read the cover
> letter?). All above is saying is that by virtue of using TC we have a
> path to a proven offload approach.

I'm arguing P4 is in a big part about programmable HW. If we merge
a P4 into the kernel all the way down to the p4 types and don't
consider how it works with hardware that is a non starter for me.

> 
> 
> > >
> > > 3) Operational usability
> > >
> > > By maintaining the TC control plane (even in presence of eBPF datapath)
> > > runtime aspects remain unchanged. So for our target audience of folks
> > > who have deployed tc including offloads - the comfort zone is unchanged.
> > > There is also the comfort zone of continuing to use the true-and-tried netlink
> > > interfacing.
> >
> > The P4 control plane should be P4Runtime.
> >
> 
> And be my guest and write it on top of netlink.

But I would prefer it was a BPF map and gave my reasons above.

> 
> cheers,
> jamal
Jamal Hadi Salim Nov. 17, 2023, 8:46 p.m. UTC | #4
On Fri, Nov 17, 2023 at 1:37 PM John Fastabend <john.fastabend@gmail.com> wrote:
>
> Jamal Hadi Salim wrote:
> > On Fri, Nov 17, 2023 at 1:27 AM John Fastabend <john.fastabend@gmail.com> wrote:
> > >
> > > Jamal Hadi Salim wrote:
> > > > We are seeking community feedback on P4TC patches.
> > > >
> > >
> > > [...]
> > >
> > > >
> > > > What is P4?
> > > > -----------
> > >
> > > I read the cover letter here is my high level takeaway.
> > >
> >
> > At least you read the cover letter this time ;->
>
> I read it last time as well. About mid way down I tried to
> list the points (1-5) more concisely if folks want to get to the
> meat of my argument quickly.

You wrote an essay - i will just jump to your points further down the
text below and try and summarize it...

> > > P4c-bpf backend exists and I don't see why we wouldn't use that as a starting
> > > point.
> >
> > Are you familiar with P4 architectures? That code was for PSA (which
> > is essentially for switches) we are doing PNA (which is more nic
> > oriented).
>
> Yes. But for folks that are not PSA is a switch architecture it
> looks roughly like this,
>
>    parser -> ingress -> deparser -> pkt replication -> parser
>                                                         -> egress
>                                                            -> deparser
>                                                              -> queueing
>
> The gist is ingress/egress blocks hold your p4 logic (match action
> tables usually) to xfrm headers, counters, registers, and so on. You
> get one on ingress and one on egress to build your logic up.
>
> And PNA is a roughly like this,
>
>    ingress -> parser -> control -> deparser -> accelerators -> host | network
>
> An accelerators are externs more or less defined outside P4. Control has
> all your metrics, header transforms, registers, and so on. And parser
> well it parsers headers. Deparser is something we don't typically think
> about much on sw side but it serializes the object back into a packet.
> That is a rough couple line explanation.
>
> You can also define whatever architecture like and there are some
> ways to do that. But if you want to be a PSA or PNA you define those
> blocks in your P4. The key idea is to have architectures that map
> to a large set of different vendor hardware. Clearly sw and FPGAs
> can build mostly any architecture needed.
>
> As an editorial comment P4 is very much a hardware centric view of
> the world when looking at P4 architectures. SW never needed these
> because we mostly have general purpose CPUs.
>
> > And yes, we used that code as a starting point and made the necessary
> > changes needed to conform to PNA. We made it actually work better by
> > using kfuncs.
>
> Better performance? More P4 DSL program space implemented? The kfuncs
> added are equivelant to map ops already in BPF but over 'tc' map types.
> Or did I miss some kfuncs.
>
> The p4c-ebpf backend already supports two models we could have added
> the PNA model to it as well. Its actually simpler than PSA model
> in many ways at least its fewer blocks. I think all this infrastructure
> here could be unesseary with updates to p4c-ebpf.
>
> >
> > > At least the cover letter needs to explain why this path is not taken.
> >
> > I thought we had a reference to that backend - but will add it for the
> > next update.
> >
> > > From the cover letter there appears to be bpf pieces and non-bpf pieces, but
> > > I don't see any reason not to just land it all in BPF. Support exists and if
> > > its missing some smaller things add them and everyone gets them vs niche P4
> > > backend.
> >
> > Ok, i thought you said you read the cover letter. Reasons are well
> > stated, primarily that we need to make sure all P4 programs work.
>
> I don't think that is a very strong argument to use/build a p4c-tc
> architecture and implementation instead of p4c-ebpf. I can't think
> of any reason p4c-ebpf can't support all programs other than perhaps
> its missing a few items. From a design side though it should be
> capabable of any PSA, PNA, and many more architectures you come up
> with.
>
> And I'm genuinely curious what is missing so a list would be nice.
> The missing block perhaps is a perfomant software TCAM, but I'm
> not fully convinced that software should even bother to try and
> duplicate a TCAM. If you need a performant TCAM buy hw with a TCAM
> emulating one is always going to be slower. Have Intel/AMD/ARM
> glue a TCAM to the core if its so useful.
>
> To be clear p4c-tc is only targeting PNA programs not all P4 space.
>
> >
> > >
> > > Without hardware support for any of this its impossible to understand how 'tc'
> > > would work as a hardware offload interface for a p4 device so we need hardware
> > > support to evaluate. For example I'm not even sure how you would take a BPF
> > > parser into hardware on most network devices that aren't processor based.
> > >
> >
> > P4 has nothing to do with parsers in hardware. Where did you get this
> > requirement from?
>
> P4 is/was primarily developed as a DSL to program hardware. We've
> never figured out how to do a native Linux P4 controller for hardware.
> There are a couple blockers for that in my opinion. First no one
> has ever opened up the hardware to an OSS solution. Two its
> never been entirely clear what the big win for enough people would be.
> So we get targetted offloads, timestamp, vxlan, tso, ktls, even
> heard quic offload yesterday. And its easy enough to just program
> the hardware directly from user space.
>
> So yes I think P4 has a lot to do with hardware, its probably
> fair to say this p4c-tc thing isn't hardware. But, I think its
> very limiting and the value of any p4 implementation in kernel
> would be its ability to use hardware.
>
> I'm not even convinced P4 is a good DSL for SW implementations.
> I don't think its obvious how hw P4 and sw datapaths integrate
> effectively. My opinion is p4c-tc is not moving us forward
> here.
>
> >
> > > P4 has a P4Runtime I think most folks would prefer a P4 UI vs typing in 'tc'
> > > commands so arguing for 'tc' UI is nice is not going to be very compelling.
> > > Best we can say is it works well enough and we use it.
> >
> >
> > The control plane interface is netlink. This part is not negotiable.
> > You can write whatever you want on top of it(for example P4runtime
> > using netlink as its southbound interface). We feel that tc - a well
>
> Sure we need a low level interface for p4runtime to use and I
> agree we don't need all blocks done at once.
>
> > understood utility - is one we should make publicly available for the
> > rest of the world to use. For example we have rust code that runs on
> > top of netlink to do performance testing.
>
> If updates/lookups from userspace is a performance vector you
> care about I can't see how netlink is more efficient than a
> mmapped bpf map. If you have data share it, but it seems
> highly unlikely.
>
> The argument I'm trying to make is netlink vs bpf maps vs
> some other goo shouldn't matter to users because we should
> build them higher level tooling to interact with the p4
> objects. Then it comes down to performance in my opinion.
> And if map updates matter I suspect netlink is relatively
> slow.
>
> >
> > > more commentary below.
> > >
> > > >
> > > > The Programming Protocol-independent Packet Processors (P4) is an open source,
> > > > domain-specific programming language for specifying data plane behavior.
> > > >
> > > > The P4 ecosystem includes an extensive range of deployments, products, projects
> > > > and services, etc[9][10][11][12].
> > > >
> > > > __What is P4TC?__
> > > >
> > > > P4TC is a net-namespace aware implementation, meaning multiple P4 programs can
> > > > run independently in different namespaces alongside their appropriate state. The
> > > > implementation builds on top of many years of Linux TC experiences.
> > > > On why P4 - see small treatise here:[4].
> > > >
> > > > There have been many discussions and meetings since about 2015 in regards to
> > > > P4 over TC[2] and we are finally proving the naysayers that we do get stuff
> > > > done!
> > > >
> > > > A lot more of the P4TC motivation is captured at:
> > > > https://github.com/p4tc-dev/docs/blob/main/why-p4tc.md
> > > >
> > > > **In this patch series we focus on s/w datapath only**.
> > >
> > > I don't see the value in adding 16676 lines of code for s/w only datapath
> > > of something we already can do with p4c-ebpf backend.
> >
> > Please please stop this entitlement politics (which i frankly think
> > you guys have been getting away with for a few years now).
>
> I'm allowed to disagree with your architecture and propose what I think
> is a betteer way to translate P4 into software.
>
> Its common to argue against adding new code if it duplicates functionality
> we already support.
>
> > This code does not touch any core code - you guys constantly push code
> > that touches core code and it is not unusual we have to pick up the
> > pieces after but now you are going to call me out for the number of
> > lines of code? Is it ok for you to write lines of code in the kernel
> > but not me? Judge the technical work then we can have a meaningful
> > discussion.
>
> I think I'm judging the technical work here. Bullet points.
>
> 1. p4c-tc implementation looks like it should be slower than a
>    in terms of pkts/sec than a bpf implementation. Meaning
>    I suspect pipeline and objects laid out like this will lose
>    to a BPF program with an parser and single lookup. The p4c-ebpf
>    compiler should look to create optimized EBPF code not some
>    emulated switch topology.
>

The parser is ebpf based. The other objects which require control
plane interaction are not - those interact via netlink.
We published perf data a while back - presented at the P4 workshop
back in April (was in the cover letter)
https://github.com/p4tc-dev/docs/blob/main/p4-conference-2023/2023P4WorkshopP4TC.pdf
But do note: the correct abstraction is the first priority.
Optimization is something we can teach the compiler over time. But
even with the minimalist code generation you can see that our approach
always beats ebpf in LPM and ternary. The other ones I am pretty sure
we can optimize over time.
Your view of "single lookup" is true for simple programs but if you
have 10 tables trying to model a 5G function then it doesnt make sense
(and i think the data we published was clear that you gain no
advantage using ebpf - as a matter of fact there was no perf
difference between XDP and tc in such cases).

> 2. p4c-tc control plan looks slower than a directly mmaped bpf
>    map. Doing a simple update vs a netlink msg. The argument
>    that BPF can't do CRUD (which we had offlist) seems incorrect
>    to me. Correct me if I'm wrong with details about why.
>

So let me see....
you want me to replace netlink and all its features and rewrite it
using the ebpf system calls? Congestion control, event handling,
arbitrary message crafting, etc and the years of work that went into
netlink? NO to the HELL.
I should note: that there was an interesting talk at netdevconf 0x17
where the speaker showed the challenges of dealing with ebpf on "day
two" - slides or videos are not up yet, but link is:
https://netdevconf.info/0x17/sessions/talk/is-scaling-ebpf-easy-yet-a-small-step-to-one-server-but-giant-leap-to-distributed-network.html
The point the speaker was making is it's always easy to whip an ebpf
program that can slice and dice packets and maybe even flush LEDs but
the real work and challenge is in the control plane. I agree with the
speaker based on my experiences. This discussion of replacing netlink
with ebpf system calls is absolutely a non-starter. Let's just end the
discussion and agree to disagree if you are going to keep insisting on
that.

> 2. I don't see why ebpf can not support all P4 programs. Because
>    the DSL compiler side doesn't support the nic architecture
>    side to me indicates fixing the compiler is the direction
>    not pushing on the kernel.
>

Wrestling with the verifier, different version of toolchains, etc.
This is not just a problem we are facing, but about everyone out there
that tries to do something serious with ebpf  eventually hits these
issues. Kfuncs really opened the door for us (i think it improved
usability of ebpf by probably orders of magnitude). Without kfuncs i
would not have even considered ebpf - and did i say i was fine with
u32 and pedit approach we had.

> 3. Working in BPF framework will benefit more folks than a tc
>    framework. I just don't see a large user base of P4 software
>    running on Linux. It doesn't mean we can't have it in linux,
>    but worth considering. We have lots of niche stuff in the
>    kernel, but usually the niche thing doesn't have another
>    more common way to run it.
>

To each their itch - that's what open source is about. This is our
itch. You dont have to like it nor use it.  There are a lot of things
i dont like in the kernel and would never use. Saying you dont see a
"large user base of P4 software on Linux" is handwaving at best. Under
what metric do you reach such a conclusion? The fact that i can
describe something in a _simple_ high level language like P4 and get
low level ebpf for free is of great value. I dont need to go and look
for an ebpf expert to hand code things for me.

> 4. The win for P4 is not sw implementation. Its about getting
>    programmable hardware and this doesn't advance that goal
>    in any meaningful way as far as I can see.

And all the s/w incarnations of P4 out there would disagree with you.
The fact that P4 has use in h/w doesnt disqualify it from being useful
in s/w.

> 5. By pushing the P4 model so low in the stack of tooling
>    you lose ability for compiler to do interesting things.
>    Combining match action tables, converting them to
>    switch statements or jumps, finding inverse operations
>    and removing them. I still think there is lots of unexplored
>    work on compiling P4 that has not been done.
>

And that can be done over time unless you are saying it is impossible.
ebpf != P4, they are two different levels of expression. eBPF is just
a tool to get us there and nothing more.

cheers,
jamal

> >
> > TBH, I am trying very hard to see if i should respond to any more
> > comments from you. I was very happy with our original scriptable
> > approach and you came out and banged on the table that you want ebpf.
> > We spent 10 months of multiple people working on this code to make it
> > ebpf friendly and now you want more (actually i am not sure what the
> > hell you want).
>
> I've made the above arguments on early versions of the code,
> and when we talked, and even offered it in p4 working group.
> It shouldn't be surprising I've not changed my opinion.
>
> Its a argument against duplicating existing functionality with
> something that is slower and doesn't give us HW P4 support. The
> bullets above.
>
>
> >
> > > Or one of the other
> > > backends already there. Namely take P4 programs and run them on CPUs in Linux.
> > >
> > > Also I suspect a pipelined datapath is going to be slower than a O(1) lookup
> > > datapath so I'm guessing its slower than most datapaths we have already.
> > >
> > > What do we gain here over existing p4c-ebpf?
> > >
> >
> > see above.
>
> We are talking past eachother becaus here I argue it looks like a slow
> datapath and you say 'see above' but what above was I meant to see?
> That it doesn't have PNA support? Compared to PSA doing a PNA support
> should be straightforward.
>
> I disagree that software should try to emulate hardware to closely.
> They are fundamentally different platforms. One has CAMs, TCAMs,
> and LPMs and obscure instruction sets to make all this work. The other
> is working on a general purpose CPU. I think slamming a hardware
> architecture into software with emulated TCAMs and what not,
> will be a losing performance proposition. Experience shows you can
> either go SIMD direction and parrallize everything with these instructions
> or you reduce the datapath to a single (or minimal set) of lookups.
> Find a counter-example.
>
> >
> > > >
> > > > __P4TC Workflow__
> > > >
> > > > These patches enable kernel and user space code change _independence_ for any
> > > > new P4 program that describes a new datapath. The workflow is as follows:
> > > >
> > > >   1) A developer writes a P4 program, "myprog"
> > > >
> > > >   2) Compiles it using the P4C compiler[8]. The compiler generates 3 outputs:
> > > >      a) shell script(s) which form template definitions for the different P4
> > > >      objects "myprog" utilizes (tables, externs, actions etc).
> > >
> > > This is odd to me. I think packing around shell scrips as a program is not
> > > very usable. Why not just an object file.
> > >
> > > >      b) the parser and the rest of the datapath are generated
> > > >      in eBPF and need to be compiled into binaries.
> > > >      c) A json introspection file used for the control plane (by iproute2/tc).
> > >
> > > Why split up the eBPF and control plane like this? eBPF has a control plane
> > > just use the existing one?
> > >
> >
> > The cover letter clearly states that we are using netlink as the
> > control api. Does eBPF support netlink?
>
> But why? The statement is there but no rational is given. People are
> used to it was maybe stated, but my argument is users of P4 shouldn't
> be crafting netlink messages they need tooling if its netlink or BPF
> or some new thing. So pick the most efficient tool for the job. Why
> is netlink the most efficient option here.
>
> >
> > > >
> > > >   3) The developer (or operator) executes the shell script(s) to manifest the
> > > >      functional "myprog" into the kernel.
> > > >
> > > >   4) The developer (or operator) instantiates "myprog" via the tc P4 filter
> > > >      to ingress/egress (depending on P4 arch) of one or more netdevs/ports.
> > > >
> > > >      Example1: parser is an action:
> > > >        "tc filter add block 22 ingress protocol all prio 10 p4 pname myprog \
> > > >         action bpf obj $PARSER.o section parser/tc-ingress \
> > > >         action bpf obj $PROGNAME.o section p4prog/tc"
> > > >
> > > >      Example2: parser explicitly bound and rest of dpath as an action:
> > > >        "tc filter add block 22 ingress protocol all prio 10 p4 pname myprog \
> > > >         prog tc obj $PARSER.o section parser/tc-ingress \
> > > >         action bpf obj $PROGNAME.o section p4prog/tc"
> > > >
> > > >      Example3: parser is at XDP, rest of dpath as an action:
> > > >        "tc filter add block 22 ingress protocol all prio 10 p4 pname myprog \
> > > >         prog type xdp obj $PARSER.o section parser/xdp-ingress \
> > > >       pinned_link /path/to/xdp-prog-link \
> > > >         action bpf obj $PROGNAME.o section p4prog/tc"
> > > >
> > > >      Example4: parser+prog at XDP:
> > > >        "tc filter add block 22 ingress protocol all prio 10 p4 pname myprog \
> > > >         prog type xdp obj $PROGNAME.o section p4prog/xdp \
> > > >       pinned_link /path/to/xdp-prog-link"
> > > >
> > > >     see individual patches for more examples tc vs xdp etc. Also see section on
> > > >     "challenges" (on this cover letter).
> > > >
> > > > Once "myprog" P4 program is instantiated one can start updating table entries
> > > > that are associated with myprog's table named "mytable". Example:
> > > >
> > > >   tc p4ctrl create myprog/table/mytable dstAddr 10.0.1.2/32 \
> > > >     action send_to_port param port eno1
> > >
> > > As a UI above is entirely cryptic to most folks I bet.
> > >
> >
> > But ebpf is not?
>
> We don't need everything out the gate but my point is that the UI
> should be abstracted away from the P4 programmer and operator at
> this level. My observation that 'tc' is cryptic was just an off-hand
> comment I don't think its relevant to the overall argument for or against,
> what we should understand is how to map p4runtime or at least a
> operator friendly UI onto the semantics.
>
> >
> > > myprog table is a BPF map? If so then I don't see any need for this just
> > > interact with it like a BPF map. I suspect its some other object, but
> > > I don't see any ratoinal for that.
> >
> > All the P4 objects sit in the TC domain. The datapath program is ebpf.
> > Control is via netlink.
>
> I'm missing something fundamental. What do we gain from this TC domain.
> There are some TC maps for LPM and TCAMs we have LPM already in BPF
> and TCAM you have could easily be added if you want to. Then entire
> program runs to completion. Surely this is more performant. Throw in
> XDP and the redirect never leaves the NIC, no skb, etc.
>
> From the architecture side I don't think we need kernel objects
> for pipelines and some P4 notion of match action tables those
> can all be mapped into the BPF program. The packet never leaves
> XDP. Performance is good on datapath and performance is good
> on map update side. It looks like noise to me teaching the kernel
> about P4 objects and types. More importantly you are constraining
> the optimizations the compiler can make. Perhaps the compiler
> wants no map at all and implements it as a switch stmt for
> example. Maybe the compiler can find inverse operations and
> fastpaths to short circuit. By forcing the model so low in
> the stack you remove this ability.
>
> >
> >
> > > >
> > > > A packet arriving on ingress of any of the ports on block 22 will first be
> > > > exercised via the (eBPF) parser to find the headers pointing to the ip
> > > > destination address.
> > > > The remainder eBPF datapath uses the result dstAddr as a key to do a lookup in
> > > > myprog's mytable which returns the action params which are then used to execute
> > > > the action in the eBPF datapath (eventually sending out packets to eno1).
> > > > On a table miss, mytable's default miss action is executed.
> > >
> > > This chunk looks like standard BPF program. Parse pkt, lookup an action,
> > > do the action.
> > >
> >
> > Yes, the ebpf datapath does the parsing, and then interacts with
> > kfuncs to the tc world before it (the ebpf datapath) executes the
> > action.
> > Note: ebpf did not invent any of that (parse, lookup, action). It has
> > existed in tc for 20 years before ebpf existed.
>
> Its not about who invented what. All this goes way back.
>
> My point is the 'tc' world here looks unnecessary. It can be managed
> from outside the kernel entirely.
>
> >
> > > > __Description of Patches__
> > > >
> > > > P4TC is designed to have no impact on the core code for other users
> > > > of TC. IOW, you can compile it out but even if it compiled in and you dont use
> > > > it there should be no impact on your performance.
> > > >
> > > > We do make core kernel changes. Patch #1 adds infrastructure for "dynamic"
> > > > actions that can be created on "the fly" based on the P4 program requirement.
> > >
> > > the common pattern in bpf for this is to use a tail call map and populate
> > > it at runtime and/or just compile your program with the actions. Here
> > > the actions came from the p4 back up at step 1 so no reason we can't
> > > just compile them with p4c.
> > >
> > > > This patch makes a small incision into act_api which shouldn't affect the
> > > > performance (or functionality) of the existing actions. Patches 2-4,6-7 are
> > > > minimalist enablers for P4TC and have no effect the classical tc action.
> > > > Patch 5 adds infrastructure support for preallocation of dynamic actions.
> > > >
> > > > The core P4TC code implements several P4 objects.
> > >
> > > [...]
> > >
> > > >
> > > > __Restating Our Requirements__
> > > >
> > > > The initial release made in January/2023 had a "scriptable" datapath (think u32
> > > > classifier and pedit action). In this section we review the scriptable version
> > > > against the current implementation we are pushing upstream which uses eBPF.
> > > >
> > > > Our intention is to target the TC crowd.
> > > > Essentially developers and ops people deploying TC based infra.
> > > > More importantly the original intent for P4TC was to enable _ops folks_ more than
> > > > devs (given code is being generated and doesn't need humans to write it).
> > >
> > > I don't follow. humans wrote the p4.
> > >
> >
> > But not the ebpf code, that is compiler generated. P4 is a higher
> > level Domain specific language and ebpf is just one backend (others
> > s/w variants include DPDK, Rust, C, etc)
>
> Yes. I still don't follow. Of course ebpf is just one backend.
>
> >
> > > I think the intent should be to enable P4 to run on Linux. Ideally efficiently.
> > > If the _ops folks are writing P4 great as long as we give them an efficient
> > > way to run their p4 I don't think they care about what executes it.
> > >
> > > >
> > > > With TC, we get whole "familiar" package of match-action pipeline abstraction++,
> > > > meaning from the control plane all the way to the tooling infra, i.e
> > > > iproute2/tc cli, netlink infra(request/resp, event subscribe/multicast-publish,
> > > > congestion control etc), s/w and h/w symbiosis, the autonomous kernel control,
> > > > etc.
> > > > The main advantage is that we have a singular vendor-neutral interface via the
> > > > kernel using well understood mechanisms based on deployment experience (and
> > > > at least this part doesnt need retraining).
> > >
> > > A seemless p4 experience would be great. That looks like a tooling problem
> > > at the p4c-backend and p4c-frontend problem. Rather than a bunch of 'tc' glue
> > > I would aim for,
> > >
> > >   $ p4c-* myprog.p4
> > >   $ p4cRun ./myprog
> > >
> > > And maybe some options like,
> > >
> > >   $ p4cRun -i eth0 ./myprog
> >
> > Armchair lawyering and classical ML bikesheding
>
> It was just an example of what I think the end goal should be.
>
> >
> > > Then use the p4runtime to interface with the system. If you don't like the
> > > runtime then it should be brought up in that working group.
> > >
> > > >
> > > > 1) Supporting expressibility of the universe set of P4 progs
> > > >
> > > > It is a must to support 100% of all possible P4 programs. In the past the eBPF
> > > > verifier had to be worked around and even then there are cases where we couldnt
> > > > avoid path explosion when branching is involved. Kfunc-ing solves these issues
> > > > for us. Note, there are still challenges running all potential P4 programs at
> > > > the XDP level - the solution to that is to have the compiler generate XDP based
> > > > code only if it possible to map it to that layer.
> > >
> > > Examples and we can fix it.
> >
> > Right. Let me wait for you to fix something 5 years from now. I would
> > never have used eBPF at all but the kfunc is what changed my mind.
> >
> > > >
> > > > 2) Support for P4 HW and SW equivalence.
> > > >
> > > > This feature continues to work even in the presence of eBPF as the s/w
> > > > datapath. There are cases of square-hole-round-peg scenarios but
> > > > those are implementation issues we can live with.
> > >
> > > But no hw support.
> > >
> >
> > This patcheset has nothing to do with offload (you read the cover
> > letter?). All above is saying is that by virtue of using TC we have a
> > path to a proven offload approach.
>
> I'm arguing P4 is in a big part about programmable HW. If we merge
> a P4 into the kernel all the way down to the p4 types and don't
> consider how it works with hardware that is a non starter for me.
>
> >
> >
> > > >
> > > > 3) Operational usability
> > > >
> > > > By maintaining the TC control plane (even in presence of eBPF datapath)
> > > > runtime aspects remain unchanged. So for our target audience of folks
> > > > who have deployed tc including offloads - the comfort zone is unchanged.
> > > > There is also the comfort zone of continuing to use the true-and-tried netlink
> > > > interfacing.
> > >
> > > The P4 control plane should be P4Runtime.
> > >
> >
> > And be my guest and write it on top of netlink.
>
> But I would prefer it was a BPF map and gave my reasons above.
>
> >
> > cheers,
> > jamal
Jiri Pirko Nov. 20, 2023, 9:39 a.m. UTC | #5
Fri, Nov 17, 2023 at 09:46:11PM CET, jhs@mojatatu.com wrote:
>On Fri, Nov 17, 2023 at 1:37 PM John Fastabend <john.fastabend@gmail.com> wrote:
>>
>> Jamal Hadi Salim wrote:
>> > On Fri, Nov 17, 2023 at 1:27 AM John Fastabend <john.fastabend@gmail.com> wrote:
>> > >
>> > > Jamal Hadi Salim wrote:

[...]


>>
>> I think I'm judging the technical work here. Bullet points.
>>
>> 1. p4c-tc implementation looks like it should be slower than a
>>    in terms of pkts/sec than a bpf implementation. Meaning
>>    I suspect pipeline and objects laid out like this will lose
>>    to a BPF program with an parser and single lookup. The p4c-ebpf
>>    compiler should look to create optimized EBPF code not some
>>    emulated switch topology.
>>
>
>The parser is ebpf based. The other objects which require control
>plane interaction are not - those interact via netlink.
>We published perf data a while back - presented at the P4 workshop
>back in April (was in the cover letter)
>https://github.com/p4tc-dev/docs/blob/main/p4-conference-2023/2023P4WorkshopP4TC.pdf
>But do note: the correct abstraction is the first priority.
>Optimization is something we can teach the compiler over time. But
>even with the minimalist code generation you can see that our approach
>always beats ebpf in LPM and ternary. The other ones I am pretty sure

Any idea why? Perhaps the existing eBPF maps are not that suitable for
this kinds of lookups? I mean in theory, eBPF should be always faster.


>we can optimize over time.
>Your view of "single lookup" is true for simple programs but if you
>have 10 tables trying to model a 5G function then it doesnt make sense
>(and i think the data we published was clear that you gain no
>advantage using ebpf - as a matter of fact there was no perf
>difference between XDP and tc in such cases).
>
>> 2. p4c-tc control plan looks slower than a directly mmaped bpf
>>    map. Doing a simple update vs a netlink msg. The argument
>>    that BPF can't do CRUD (which we had offlist) seems incorrect
>>    to me. Correct me if I'm wrong with details about why.
>>
>
>So let me see....
>you want me to replace netlink and all its features and rewrite it
>using the ebpf system calls? Congestion control, event handling,
>arbitrary message crafting, etc and the years of work that went into
>netlink? NO to the HELL.

Wait, I don't think John suggests anything like that. He just suggests
to have the tables as eBPF maps. Honestly, I don't understand the
fixation on netlink. Its socket messaging, memcpies, processing
overhead, etc can't keep up with mmaped memory access at scale. Measure
that and I bet you'll get drastically different results.

I mean, netlink is good for a lot of things, but does not mean it is an
universal answer to userspace<->kernel data passing.


>I should note: that there was an interesting talk at netdevconf 0x17
>where the speaker showed the challenges of dealing with ebpf on "day
>two" - slides or videos are not up yet, but link is:
>https://netdevconf.info/0x17/sessions/talk/is-scaling-ebpf-easy-yet-a-small-step-to-one-server-but-giant-leap-to-distributed-network.html
>The point the speaker was making is it's always easy to whip an ebpf
>program that can slice and dice packets and maybe even flush LEDs but
>the real work and challenge is in the control plane. I agree with the
>speaker based on my experiences. This discussion of replacing netlink
>with ebpf system calls is absolutely a non-starter. Let's just end the
>discussion and agree to disagree if you are going to keep insisting on
>that.


[...]
Jamal Hadi Salim Nov. 20, 2023, 2:23 p.m. UTC | #6
On Mon, Nov 20, 2023 at 4:39 AM Jiri Pirko <jiri@resnulli.us> wrote:
>
> Fri, Nov 17, 2023 at 09:46:11PM CET, jhs@mojatatu.com wrote:
> >On Fri, Nov 17, 2023 at 1:37 PM John Fastabend <john.fastabend@gmail.com> wrote:
> >>
> >> Jamal Hadi Salim wrote:
> >> > On Fri, Nov 17, 2023 at 1:27 AM John Fastabend <john.fastabend@gmail.com> wrote:
> >> > >
> >> > > Jamal Hadi Salim wrote:
>
> [...]
>
>
> >>
> >> I think I'm judging the technical work here. Bullet points.
> >>
> >> 1. p4c-tc implementation looks like it should be slower than a
> >>    in terms of pkts/sec than a bpf implementation. Meaning
> >>    I suspect pipeline and objects laid out like this will lose
> >>    to a BPF program with an parser and single lookup. The p4c-ebpf
> >>    compiler should look to create optimized EBPF code not some
> >>    emulated switch topology.
> >>
> >
> >The parser is ebpf based. The other objects which require control
> >plane interaction are not - those interact via netlink.
> >We published perf data a while back - presented at the P4 workshop
> >back in April (was in the cover letter)
> >https://github.com/p4tc-dev/docs/blob/main/p4-conference-2023/2023P4WorkshopP4TC.pdf
> >But do note: the correct abstraction is the first priority.
> >Optimization is something we can teach the compiler over time. But
> >even with the minimalist code generation you can see that our approach
> >always beats ebpf in LPM and ternary. The other ones I am pretty sure
>
> Any idea why? Perhaps the existing eBPF maps are not that suitable for
> this kinds of lookups? I mean in theory, eBPF should be always faster.

We didnt look closely; however, that is not the point - the point is
the perf difference if there is one, is not big with the big win being
proper P4 abstraction. For LPM for sure our algorithmic approach is
better. For ternary the compute intensity in looping is better done in
C. And for exact i believe that ebpf uses better hashing.
Again, that is not the point we were trying to validate in those experiments..

On your point of "maps are not that suitable" P4 tables tend to have
very specific attributes (examples associated meters, counters,
default hit and miss actions, etc).

> >we can optimize over time.
> >Your view of "single lookup" is true for simple programs but if you
> >have 10 tables trying to model a 5G function then it doesnt make sense
> >(and i think the data we published was clear that you gain no
> >advantage using ebpf - as a matter of fact there was no perf
> >difference between XDP and tc in such cases).
> >
> >> 2. p4c-tc control plan looks slower than a directly mmaped bpf
> >>    map. Doing a simple update vs a netlink msg. The argument
> >>    that BPF can't do CRUD (which we had offlist) seems incorrect
> >>    to me. Correct me if I'm wrong with details about why.
> >>
> >
> >So let me see....
> >you want me to replace netlink and all its features and rewrite it
> >using the ebpf system calls? Congestion control, event handling,
> >arbitrary message crafting, etc and the years of work that went into
> >netlink? NO to the HELL.
>
> Wait, I don't think John suggests anything like that. He just suggests
> to have the tables as eBPF maps.

What's the difference? Unless maps can do netlink.

> Honestly, I don't understand the
> fixation on netlink. Its socket messaging, memcpies, processing
> overhead, etc can't keep up with mmaped memory access at scale. Measure
> that and I bet you'll get drastically different results.
>
> I mean, netlink is good for a lot of things, but does not mean it is an
> universal answer to userspace<->kernel data passing.

Here's a small sample of our requirements that are satisfied by
netlink for P4 object hierarchy[1]:
1. Msg construction/parsing
2. Multi-user request/response messaging
3. Multi-user event subscribe/publish messaging

I dont think i need to provide an explanation on the differences here
visavis what ebpf system calls provide vs what netlink provides and
how netlink is a clear fit. If it is not clear i can give more
breakdown. And of course there's more but above is a good sample.

The part that is taken for granted is the control plane code and
interaction which is an extremely important detail. P4 Abstraction
requires hierarchies with different compiler generated encoded path
ids etc. This ID mapping gets exacerbated by having multitudes of  P4
programs which have different requirements. Netlink is a natural fit
for this P4 abstraction. Not to mention the netlink/tc path (and in
particular the ID mapping) provides a conduit for offload when that is
needed.
eBPF is just a tool - and the objects are intended to be generic - and
i dont see how any of this could be achieved without retooling to make
it more specific to P4.

cheers,
jamal



>
> >I should note: that there was an interesting talk at netdevconf 0x17
> >where the speaker showed the challenges of dealing with ebpf on "day
> >two" - slides or videos are not up yet, but link is:
> >https://netdevconf.info/0x17/sessions/talk/is-scaling-ebpf-easy-yet-a-small-step-to-one-server-but-giant-leap-to-distributed-network.html
> >The point the speaker was making is it's always easy to whip an ebpf
> >program that can slice and dice packets and maybe even flush LEDs but
> >the real work and challenge is in the control plane. I agree with the
> >speaker based on my experiences. This discussion of replacing netlink
> >with ebpf system calls is absolutely a non-starter. Let's just end the
> >discussion and agree to disagree if you are going to keep insisting on
> >that.
>
>
> [...]
Jiri Pirko Nov. 20, 2023, 6:10 p.m. UTC | #7
Mon, Nov 20, 2023 at 03:23:59PM CET, jhs@mojatatu.com wrote:
>On Mon, Nov 20, 2023 at 4:39 AM Jiri Pirko <jiri@resnulli.us> wrote:
>>
>> Fri, Nov 17, 2023 at 09:46:11PM CET, jhs@mojatatu.com wrote:
>> >On Fri, Nov 17, 2023 at 1:37 PM John Fastabend <john.fastabend@gmail.com> wrote:
>> >>
>> >> Jamal Hadi Salim wrote:
>> >> > On Fri, Nov 17, 2023 at 1:27 AM John Fastabend <john.fastabend@gmail.com> wrote:
>> >> > >
>> >> > > Jamal Hadi Salim wrote:
>>
>> [...]
>>
>>
>> >>
>> >> I think I'm judging the technical work here. Bullet points.
>> >>
>> >> 1. p4c-tc implementation looks like it should be slower than a
>> >>    in terms of pkts/sec than a bpf implementation. Meaning
>> >>    I suspect pipeline and objects laid out like this will lose
>> >>    to a BPF program with an parser and single lookup. The p4c-ebpf
>> >>    compiler should look to create optimized EBPF code not some
>> >>    emulated switch topology.
>> >>
>> >
>> >The parser is ebpf based. The other objects which require control
>> >plane interaction are not - those interact via netlink.
>> >We published perf data a while back - presented at the P4 workshop
>> >back in April (was in the cover letter)
>> >https://github.com/p4tc-dev/docs/blob/main/p4-conference-2023/2023P4WorkshopP4TC.pdf
>> >But do note: the correct abstraction is the first priority.
>> >Optimization is something we can teach the compiler over time. But
>> >even with the minimalist code generation you can see that our approach
>> >always beats ebpf in LPM and ternary. The other ones I am pretty sure
>>
>> Any idea why? Perhaps the existing eBPF maps are not that suitable for
>> this kinds of lookups? I mean in theory, eBPF should be always faster.
>
>We didnt look closely; however, that is not the point - the point is
>the perf difference if there is one, is not big with the big win being
>proper P4 abstraction. For LPM for sure our algorithmic approach is
>better. For ternary the compute intensity in looping is better done in
>C. And for exact i believe that ebpf uses better hashing.
>Again, that is not the point we were trying to validate in those experiments..
>
>On your point of "maps are not that suitable" P4 tables tend to have
>very specific attributes (examples associated meters, counters,
>default hit and miss actions, etc).
>
>> >we can optimize over time.
>> >Your view of "single lookup" is true for simple programs but if you
>> >have 10 tables trying to model a 5G function then it doesnt make sense
>> >(and i think the data we published was clear that you gain no
>> >advantage using ebpf - as a matter of fact there was no perf
>> >difference between XDP and tc in such cases).
>> >
>> >> 2. p4c-tc control plan looks slower than a directly mmaped bpf
>> >>    map. Doing a simple update vs a netlink msg. The argument
>> >>    that BPF can't do CRUD (which we had offlist) seems incorrect
>> >>    to me. Correct me if I'm wrong with details about why.
>> >>
>> >
>> >So let me see....
>> >you want me to replace netlink and all its features and rewrite it
>> >using the ebpf system calls? Congestion control, event handling,
>> >arbitrary message crafting, etc and the years of work that went into
>> >netlink? NO to the HELL.
>>
>> Wait, I don't think John suggests anything like that. He just suggests
>> to have the tables as eBPF maps.
>
>What's the difference? Unless maps can do netlink.
>
>> Honestly, I don't understand the
>> fixation on netlink. Its socket messaging, memcpies, processing
>> overhead, etc can't keep up with mmaped memory access at scale. Measure
>> that and I bet you'll get drastically different results.
>>
>> I mean, netlink is good for a lot of things, but does not mean it is an
>> universal answer to userspace<->kernel data passing.
>
>Here's a small sample of our requirements that are satisfied by
>netlink for P4 object hierarchy[1]:
>1. Msg construction/parsing
>2. Multi-user request/response messaging

What is actually a usecase for having multiple users program p4 pipeline
in parallel?

>3. Multi-user event subscribe/publish messaging

Same here. What is the usecase for multiple users receiving p4 events?


>
>I dont think i need to provide an explanation on the differences here
>visavis what ebpf system calls provide vs what netlink provides and
>how netlink is a clear fit. If it is not clear i can give more

It is not :/


>breakdown. And of course there's more but above is a good sample.
>
>The part that is taken for granted is the control plane code and
>interaction which is an extremely important detail. P4 Abstraction
>requires hierarchies with different compiler generated encoded path
>ids etc. This ID mapping gets exacerbated by having multitudes of  P4

Why the actual eBFP mapping does not serve the same purpose as ID?
ID:mapping
1 :1
?


>programs which have different requirements. Netlink is a natural fit
>for this P4 abstraction. Not to mention the netlink/tc path (and in
>particular the ID mapping) provides a conduit for offload when that is
>needed.
>eBPF is just a tool - and the objects are intended to be generic - and
>i dont see how any of this could be achieved without retooling to make
>it more specific to P4.
>
>cheers,
>jamal
>
>
>
>>
>> >I should note: that there was an interesting talk at netdevconf 0x17
>> >where the speaker showed the challenges of dealing with ebpf on "day
>> >two" - slides or videos are not up yet, but link is:
>> >https://netdevconf.info/0x17/sessions/talk/is-scaling-ebpf-easy-yet-a-small-step-to-one-server-but-giant-leap-to-distributed-network.html
>> >The point the speaker was making is it's always easy to whip an ebpf
>> >program that can slice and dice packets and maybe even flush LEDs but
>> >the real work and challenge is in the control plane. I agree with the
>> >speaker based on my experiences. This discussion of replacing netlink
>> >with ebpf system calls is absolutely a non-starter. Let's just end the
>> >discussion and agree to disagree if you are going to keep insisting on
>> >that.
>>
>>
>> [...]
Jamal Hadi Salim Nov. 20, 2023, 7:56 p.m. UTC | #8
On Mon, Nov 20, 2023 at 1:10 PM Jiri Pirko <jiri@resnulli.us> wrote:
>
> Mon, Nov 20, 2023 at 03:23:59PM CET, jhs@mojatatu.com wrote:
> >On Mon, Nov 20, 2023 at 4:39 AM Jiri Pirko <jiri@resnulli.us> wrote:
> >>
> >> Fri, Nov 17, 2023 at 09:46:11PM CET, jhs@mojatatu.com wrote:
> >> >On Fri, Nov 17, 2023 at 1:37 PM John Fastabend <john.fastabend@gmail.com> wrote:
> >> >>
> >> >> Jamal Hadi Salim wrote:
> >> >> > On Fri, Nov 17, 2023 at 1:27 AM John Fastabend <john.fastabend@gmail.com> wrote:
> >> >> > >
> >> >> > > Jamal Hadi Salim wrote:
> >>
> >> [...]
> >>
> >>
> >> >>
> >> >> I think I'm judging the technical work here. Bullet points.
> >> >>
> >> >> 1. p4c-tc implementation looks like it should be slower than a
> >> >>    in terms of pkts/sec than a bpf implementation. Meaning
> >> >>    I suspect pipeline and objects laid out like this will lose
> >> >>    to a BPF program with an parser and single lookup. The p4c-ebpf
> >> >>    compiler should look to create optimized EBPF code not some
> >> >>    emulated switch topology.
> >> >>
> >> >
> >> >The parser is ebpf based. The other objects which require control
> >> >plane interaction are not - those interact via netlink.
> >> >We published perf data a while back - presented at the P4 workshop
> >> >back in April (was in the cover letter)
> >> >https://github.com/p4tc-dev/docs/blob/main/p4-conference-2023/2023P4WorkshopP4TC.pdf
> >> >But do note: the correct abstraction is the first priority.
> >> >Optimization is something we can teach the compiler over time. But
> >> >even with the minimalist code generation you can see that our approach
> >> >always beats ebpf in LPM and ternary. The other ones I am pretty sure
> >>
> >> Any idea why? Perhaps the existing eBPF maps are not that suitable for
> >> this kinds of lookups? I mean in theory, eBPF should be always faster.
> >
> >We didnt look closely; however, that is not the point - the point is
> >the perf difference if there is one, is not big with the big win being
> >proper P4 abstraction. For LPM for sure our algorithmic approach is
> >better. For ternary the compute intensity in looping is better done in
> >C. And for exact i believe that ebpf uses better hashing.
> >Again, that is not the point we were trying to validate in those experiments..
> >
> >On your point of "maps are not that suitable" P4 tables tend to have
> >very specific attributes (examples associated meters, counters,
> >default hit and miss actions, etc).
> >
> >> >we can optimize over time.
> >> >Your view of "single lookup" is true for simple programs but if you
> >> >have 10 tables trying to model a 5G function then it doesnt make sense
> >> >(and i think the data we published was clear that you gain no
> >> >advantage using ebpf - as a matter of fact there was no perf
> >> >difference between XDP and tc in such cases).
> >> >
> >> >> 2. p4c-tc control plan looks slower than a directly mmaped bpf
> >> >>    map. Doing a simple update vs a netlink msg. The argument
> >> >>    that BPF can't do CRUD (which we had offlist) seems incorrect
> >> >>    to me. Correct me if I'm wrong with details about why.
> >> >>
> >> >
> >> >So let me see....
> >> >you want me to replace netlink and all its features and rewrite it
> >> >using the ebpf system calls? Congestion control, event handling,
> >> >arbitrary message crafting, etc and the years of work that went into
> >> >netlink? NO to the HELL.
> >>
> >> Wait, I don't think John suggests anything like that. He just suggests
> >> to have the tables as eBPF maps.
> >
> >What's the difference? Unless maps can do netlink.
> >
> >> Honestly, I don't understand the
> >> fixation on netlink. Its socket messaging, memcpies, processing
> >> overhead, etc can't keep up with mmaped memory access at scale. Measure
> >> that and I bet you'll get drastically different results.
> >>
> >> I mean, netlink is good for a lot of things, but does not mean it is an
> >> universal answer to userspace<->kernel data passing.
> >
> >Here's a small sample of our requirements that are satisfied by
> >netlink for P4 object hierarchy[1]:
> >1. Msg construction/parsing
> >2. Multi-user request/response messaging
>
> What is actually a usecase for having multiple users program p4 pipeline
> in parallel?

First of all - this is Linux, multiple users is a way of life, you
shouldnt have to ask that question unless you are trying to be
socratic. Meaning multiple control plane apps can be allowed to
program different parts and even different tables - think multi-tier
pipeline.

> >3. Multi-user event subscribe/publish messaging
>
> Same here. What is the usecase for multiple users receiving p4 events?

Same thing.
Note: Events are really not part of P4 but we added them for
flexibility - and as you well know they are useful.

>
> >
> >I dont think i need to provide an explanation on the differences here
> >visavis what ebpf system calls provide vs what netlink provides and
> >how netlink is a clear fit. If it is not clear i can give more
>
> It is not :/

I thought it was obvious for someone like you, but fine - here goes for those 3:

1. Msg construction/parsing: A lot of infra for sending attributes
back and forth is already built into netlink. I would have to create
mine from scratch for ebpf.  This will include not just the
construction/parsing but all the detailed attribute content policy
validations(even in the presence of hierarchies) that comes with it.
And not to forget the state transform between kernel and user space.

2. Multi-user request/response messaging
If you can write all the code for #1 above then this should work fine for ebpf

3. Event publish subscribe
You would have to create mechanisms for ebpf which either are non
trivial or non complete: Example 1: you can put surgeries in the ebpf
code to look at map manipulations and then interface it to some event
management scheme which checks for subscribed users. Example 2: It may
also be feasible to create your own map for subscription vs something
like perf ring for event publication(something i have done in the
past), but that is also limited in many ways.

>
> >breakdown. And of course there's more but above is a good sample.
> >
> >The part that is taken for granted is the control plane code and
> >interaction which is an extremely important detail. P4 Abstraction
> >requires hierarchies with different compiler generated encoded path
> >ids etc. This ID mapping gets exacerbated by having multitudes of  P4
>
> Why the actual eBFP mapping does not serve the same purpose as ID?
> ID:mapping 1 :1

An identification of an object requires hierarchical IDs: A
pipeline/program ID, A table id, a table entry Identification, an
action identification and for each individual action content
parameter, an ID etc. These same IDs would be what hardware would
recognize as well (in case of offload).  Given the dynamic nature of
these IDs it is essentially up to the compiler to define them. These
hierarchies  are much easier to validate in netlink.

We dont want to be constrained to a generic infra like eBPF for these
objects. Again eBPF is a means to an end (and not the goal here!).

cheers,
jamal
>
>
> >programs which have different requirements. Netlink is a natural fit
> >for this P4 abstraction. Not to mention the netlink/tc path (and in
> >particular the ID mapping) provides a conduit for offload when that is
> >needed.
> >eBPF is just a tool - and the objects are intended to be generic - and
> >i dont see how any of this could be achieved without retooling to make
> >it more specific to P4.
> >
> >cheers,
> >jamal
> >
> >
> >
> >>
> >> >I should note: that there was an interesting talk at netdevconf 0x17
> >> >where the speaker showed the challenges of dealing with ebpf on "day
> >> >two" - slides or videos are not up yet, but link is:
> >> >https://netdevconf.info/0x17/sessions/talk/is-scaling-ebpf-easy-yet-a-small-step-to-one-server-but-giant-leap-to-distributed-network.html
> >> >The point the speaker was making is it's always easy to whip an ebpf
> >> >program that can slice and dice packets and maybe even flush LEDs but
> >> >the real work and challenge is in the control plane. I agree with the
> >> >speaker based on my experiences. This discussion of replacing netlink
> >> >with ebpf system calls is absolutely a non-starter. Let's just end the
> >> >discussion and agree to disagree if you are going to keep insisting on
> >> >that.
> >>
> >>
> >> [...]
John Fastabend Nov. 20, 2023, 8:41 p.m. UTC | #9
Jamal Hadi Salim wrote:
> On Mon, Nov 20, 2023 at 1:10 PM Jiri Pirko <jiri@resnulli.us> wrote:
> >
> > Mon, Nov 20, 2023 at 03:23:59PM CET, jhs@mojatatu.com wrote:
> > >On Mon, Nov 20, 2023 at 4:39 AM Jiri Pirko <jiri@resnulli.us> wrote:
> > >>
> > >> Fri, Nov 17, 2023 at 09:46:11PM CET, jhs@mojatatu.com wrote:
> > >> >On Fri, Nov 17, 2023 at 1:37 PM John Fastabend <john.fastabend@gmail.com> wrote:
> > >> >>
> > >> >> Jamal Hadi Salim wrote:
> > >> >> > On Fri, Nov 17, 2023 at 1:27 AM John Fastabend <john.fastabend@gmail.com> wrote:
> > >> >> > >
> > >> >> > > Jamal Hadi Salim wrote:
> > >>
> > >> [...]
> > >>
> > >>
> > >> >>
> > >> >> I think I'm judging the technical work here. Bullet points.
> > >> >>
> > >> >> 1. p4c-tc implementation looks like it should be slower than a
> > >> >>    in terms of pkts/sec than a bpf implementation. Meaning
> > >> >>    I suspect pipeline and objects laid out like this will lose
> > >> >>    to a BPF program with an parser and single lookup. The p4c-ebpf
> > >> >>    compiler should look to create optimized EBPF code not some
> > >> >>    emulated switch topology.
> > >> >>
> > >> >
> > >> >The parser is ebpf based. The other objects which require control
> > >> >plane interaction are not - those interact via netlink.
> > >> >We published perf data a while back - presented at the P4 workshop
> > >> >back in April (was in the cover letter)
> > >> >https://github.com/p4tc-dev/docs/blob/main/p4-conference-2023/2023P4WorkshopP4TC.pdf
> > >> >But do note: the correct abstraction is the first priority.
> > >> >Optimization is something we can teach the compiler over time. But
> > >> >even with the minimalist code generation you can see that our approach
> > >> >always beats ebpf in LPM and ternary. The other ones I am pretty sure
> > >>
> > >> Any idea why? Perhaps the existing eBPF maps are not that suitable for
> > >> this kinds of lookups? I mean in theory, eBPF should be always faster.
> > >
> > >We didnt look closely; however, that is not the point - the point is
> > >the perf difference if there is one, is not big with the big win being
> > >proper P4 abstraction. For LPM for sure our algorithmic approach is
> > >better. For ternary the compute intensity in looping is better done in
> > >C. And for exact i believe that ebpf uses better hashing.
> > >Again, that is not the point we were trying to validate in those experiments..

If you compared your implementation to the bpf lpm_trie its a bit
misleading. The data structure is a rhashtable vs a Trie doing LPM.

Also I can't see how __p4tc_table_entry_lookup() is going to scale?
That looks like a bucket per key? If so that wont scale well with
1000's of entries and lots of duplicate masks. I did a quick scan
of code, but would be nice to detail the algorithm in the commit
msg so we can disect it.

This doesn't look what we would want though for an LPM unless
I've dropped this out of context.

+static struct p4tc_table_entry *
+__p4tc_entry_lookup(struct p4tc_table *table, struct p4tc_table_entry_key *key)
+       __must_hold(RCU)
+{
+       struct p4tc_table_entry *entry = NULL;
+       struct rhlist_head *tmp, *bucket_list;
+       struct p4tc_table_entry *entry_curr;
+       u32 smallest_prio = U32_MAX;
+
+       bucket_list =
+               rhltable_lookup(&table->tbl_entries, key, entry_hlt_params);
+       if (!bucket_list)
+               return NULL;
+
+       rhl_for_each_entry_rcu(entry_curr, tmp, bucket_list, ht_node) {
+               struct p4tc_table_entry_value *value =
+                       p4tc_table_entry_value(entry_curr);
+               if (value->prio <= smallest_prio) {
+                       smallest_prio = value->prio;
+                       entry = entry_curr;
+               }
+       }
+
+       return entry;
+}


Also I don't know what 'better done in C' matters the TCAM data structure
can be written in C and used as a BPF map. At least that is how we would
normally approach it from BPF side.

> > >
> > >On your point of "maps are not that suitable" P4 tables tend to have
> > >very specific attributes (examples associated meters, counters,
> > >default hit and miss actions, etc).

The typical way we handle this from BPF is to either use the 0 entry
for stats, annotations, etc. or create a blob of memory (another map,
variables, global struct, ...) and stash the info there. If we care
about performance we make those per cpu and deal with it in user
land.

> > >
> > >> >we can optimize over time.
> > >> >Your view of "single lookup" is true for simple programs but if you
> > >> >have 10 tables trying to model a 5G function then it doesnt make sense
> > >> >(and i think the data we published was clear that you gain no
> > >> >advantage using ebpf - as a matter of fact there was no perf
> > >> >difference between XDP and tc in such cases).
> > >> >
> > >> >> 2. p4c-tc control plan looks slower than a directly mmaped bpf
> > >> >>    map. Doing a simple update vs a netlink msg. The argument
> > >> >>    that BPF can't do CRUD (which we had offlist) seems incorrect
> > >> >>    to me. Correct me if I'm wrong with details about why.
> > >> >>
> > >> >
> > >> >So let me see....
> > >> >you want me to replace netlink and all its features and rewrite it
> > >> >using the ebpf system calls? Congestion control, event handling,
> > >> >arbitrary message crafting, etc and the years of work that went into
> > >> >netlink? NO to the HELL.
> > >>
> > >> Wait, I don't think John suggests anything like that. He just suggests
> > >> to have the tables as eBPF maps.
> > >
> > >What's the difference? Unless maps can do netlink.
> > >

I'm going to argue map update time matters and we should use the fastest
updates possible. If it complicates user space side some I would prefer
that to slow updates. I don't think you can get much faster than a
mmaped block of memory. Or even syscall updates are probably faster than
netlink msgs.

> > >> Honestly, I don't understand the
> > >> fixation on netlink. Its socket messaging, memcpies, processing
> > >> overhead, etc can't keep up with mmaped memory access at scale. Measure
> > >> that and I bet you'll get drastically different results.
> > >>
> > >> I mean, netlink is good for a lot of things, but does not mean it is an
> > >> universal answer to userspace<->kernel data passing.
> > >
> > >Here's a small sample of our requirements that are satisfied by
> > >netlink for P4 object hierarchy[1]:
> > >1. Msg construction/parsing
> > >2. Multi-user request/response messaging
> >
> > What is actually a usecase for having multiple users program p4 pipeline
> > in parallel?
> 
> First of all - this is Linux, multiple users is a way of life, you
> shouldnt have to ask that question unless you are trying to be
> socratic. Meaning multiple control plane apps can be allowed to
> program different parts and even different tables - think multi-tier
> pipeline.

Linux is always been opinionated and rejects code all the time because
its not the "right" way. I've been on the reject your stuff side before.

Partitioning ownershiip of the pipeline is different than multiple
users of the same elements. From BPF side (to show its doable) is
done by pinning maps to files and giving that file to different
programs. The DDOS thing can own the DDOS map and the router can own
its router tables. BPF handles this using the file systems mostly.

> 
> > >3. Multi-user event subscribe/publish messaging
> >
> > Same here. What is the usecase for multiple users receiving p4 events?
> 
> Same thing.
> Note: Events are really not part of P4 but we added them for
> flexibility - and as you well know they are useful.

Per above I wouldn't sacrafice update perf for this. Also its doable
from userspace if you need to. Other thing I've come to dislike a bit
is teaching the kernel a specific DSL. P4 is my favorte, but still
going so far as to encode a specific P4 spec into the kernel seems
unnecessary. Also now will we have to have kernel X supports P4.16 and
kernel X+N supports P4.18 it seems like a pain.

> 
> >
> > >
> > >I dont think i need to provide an explanation on the differences here
> > >visavis what ebpf system calls provide vs what netlink provides and
> > >how netlink is a clear fit. If it is not clear i can give more
> >
> > It is not :/
> 
> I thought it was obvious for someone like you, but fine - here goes for those 3:
> 
> 1. Msg construction/parsing: A lot of infra for sending attributes
> back and forth is already built into netlink. I would have to create
> mine from scratch for ebpf.  This will include not just the
> construction/parsing but all the detailed attribute content policy
> validations(even in the presence of hierarchies) that comes with it.
> And not to forget the state transform between kernel and user space.

But the series here does that as well probably could reuse that on
top of BPF. We have lots of libraries to deal with ebpf to help.
I don't see anything problematic here for BPF.

> 
> 2. Multi-user request/response messaging
> If you can write all the code for #1 above then this should work fine for ebpf
> 
> 3. Event publish subscribe
> You would have to create mechanisms for ebpf which either are non
> trivial or non complete: Example 1: you can put surgeries in the ebpf
> code to look at map manipulations and then interface it to some event
> management scheme which checks for subscribed users. Example 2: It may
> also be feasible to create your own map for subscription vs something
> like perf ring for event publication(something i have done in the
> past), but that is also limited in many ways.

I would just push them out over a single perf ring and build the
subscription on top of GRPC (pick your protocol of choice).

> 
> >
> > >breakdown. And of course there's more but above is a good sample.
> > >
> > >The part that is taken for granted is the control plane code and
> > >interaction which is an extremely important detail. P4 Abstraction
> > >requires hierarchies with different compiler generated encoded path
> > >ids etc. This ID mapping gets exacerbated by having multitudes of  P4
> >
> > Why the actual eBFP mapping does not serve the same purpose as ID?
> > ID:mapping 1 :1
> 
> An identification of an object requires hierarchical IDs: A
> pipeline/program ID, A table id, a table entry Identification, an
> action identification and for each individual action content
> parameter, an ID etc. These same IDs would be what hardware would
> recognize as well (in case of offload).  Given the dynamic nature of
> these IDs it is essentially up to the compiler to define them. These
> hierarchies  are much easier to validate in netlink.

I'm on board for offloads, but this series says no offloads and we
have no one with hardware in Linux for offloads yet. If we have a
series with a P4 driver and NIC I can get my hands on now we have
an entirely different conversation.

None of this above is a problem in eBPF. Its just mapping ids around.

> 
> We dont want to be constrained to a generic infra like eBPF for these
> objects. Again eBPF is a means to an end (and not the goal here!).

I don't see any constraints from eBPF above just a list of things
that of course you would have to code up. But none of that doesn't
already exist in other projects.

> 
> cheers,
> jamal
> >
> >
> > >programs which have different requirements. Netlink is a natural fit
> > >for this P4 abstraction. Not to mention the netlink/tc path (and in
> > >particular the ID mapping) provides a conduit for offload when that is
> > >needed.
> > >eBPF is just a tool - and the objects are intended to be generic - and
> > >i dont see how any of this could be achieved without retooling to make
> > >it more specific to P4.
> > >
> > >cheers,
> > >jamal
> > >
> > >
> > >
> > >>
> > >> >I should note: that there was an interesting talk at netdevconf 0x17
> > >> >where the speaker showed the challenges of dealing with ebpf on "day
> > >> >two" - slides or videos are not up yet, but link is:
> > >> >https://netdevconf.info/0x17/sessions/talk/is-scaling-ebpf-easy-yet-a-small-step-to-one-server-but-giant-leap-to-distributed-network.html
> > >> >The point the speaker was making is it's always easy to whip an ebpf
> > >> >program that can slice and dice packets and maybe even flush LEDs but
> > >> >the real work and challenge is in the control plane. I agree with the
> > >> >speaker based on my experiences. This discussion of replacing netlink
> > >> >with ebpf system calls is absolutely a non-starter. Let's just end the
> > >> >discussion and agree to disagree if you are going to keep insisting on
> > >> >that.
> > >>
> > >>
> > >> [...]
Daniel Borkmann Nov. 20, 2023, 9:48 p.m. UTC | #10
On 11/20/23 8:56 PM, Jamal Hadi Salim wrote:
> On Mon, Nov 20, 2023 at 1:10 PM Jiri Pirko <jiri@resnulli.us> wrote:
>> Mon, Nov 20, 2023 at 03:23:59PM CET, jhs@mojatatu.com wrote:
>>> On Mon, Nov 20, 2023 at 4:39 AM Jiri Pirko <jiri@resnulli.us> wrote:
>>>> Fri, Nov 17, 2023 at 09:46:11PM CET, jhs@mojatatu.com wrote:
>>>>> On Fri, Nov 17, 2023 at 1:37 PM John Fastabend <john.fastabend@gmail.com> wrote:
>>>>>> Jamal Hadi Salim wrote:
>>>>>>> On Fri, Nov 17, 2023 at 1:27 AM John Fastabend <john.fastabend@gmail.com> wrote:
>>>>>>>> Jamal Hadi Salim wrote:
>>>>
>>>> [...]
>>>>
>>>>>> I think I'm judging the technical work here. Bullet points.
>>>>>>
>>>>>> 1. p4c-tc implementation looks like it should be slower than a
>>>>>>     in terms of pkts/sec than a bpf implementation. Meaning
>>>>>>     I suspect pipeline and objects laid out like this will lose
>>>>>>     to a BPF program with an parser and single lookup. The p4c-ebpf
>>>>>>     compiler should look to create optimized EBPF code not some
>>>>>>     emulated switch topology.
>>>>>
>>>>> The parser is ebpf based. The other objects which require control
>>>>> plane interaction are not - those interact via netlink.
>>>>> We published perf data a while back - presented at the P4 workshop
>>>>> back in April (was in the cover letter)
>>>>> https://github.com/p4tc-dev/docs/blob/main/p4-conference-2023/2023P4WorkshopP4TC.pdf
>>>>> But do note: the correct abstraction is the first priority.
>>>>> Optimization is something we can teach the compiler over time. But
>>>>> even with the minimalist code generation you can see that our approach
>>>>> always beats ebpf in LPM and ternary. The other ones I am pretty sure
>>>>
>>>> Any idea why? Perhaps the existing eBPF maps are not that suitable for
>>>> this kinds of lookups? I mean in theory, eBPF should be always faster.
>>>
>>> We didnt look closely; however, that is not the point - the point is
>>> the perf difference if there is one, is not big with the big win being
>>> proper P4 abstraction. For LPM for sure our algorithmic approach is
>>> better. For ternary the compute intensity in looping is better done in
>>> C. And for exact i believe that ebpf uses better hashing.
>>> Again, that is not the point we were trying to validate in those experiments..
>>>
>>> On your point of "maps are not that suitable" P4 tables tend to have
>>> very specific attributes (examples associated meters, counters,
>>> default hit and miss actions, etc).
>>>
>>>>> we can optimize over time.
>>>>> Your view of "single lookup" is true for simple programs but if you
>>>>> have 10 tables trying to model a 5G function then it doesnt make sense
>>>>> (and i think the data we published was clear that you gain no
>>>>> advantage using ebpf - as a matter of fact there was no perf
>>>>> difference between XDP and tc in such cases).
>>>>>
>>>>>> 2. p4c-tc control plan looks slower than a directly mmaped bpf
>>>>>>     map. Doing a simple update vs a netlink msg. The argument
>>>>>>     that BPF can't do CRUD (which we had offlist) seems incorrect
>>>>>>     to me. Correct me if I'm wrong with details about why.
>>>>>
>>>>> So let me see....
>>>>> you want me to replace netlink and all its features and rewrite it
>>>>> using the ebpf system calls? Congestion control, event handling,
>>>>> arbitrary message crafting, etc and the years of work that went into
>>>>> netlink? NO to the HELL.
>>>>
>>>> Wait, I don't think John suggests anything like that. He just suggests
>>>> to have the tables as eBPF maps.
>>>
>>> What's the difference? Unless maps can do netlink.
>>>
>>>> Honestly, I don't understand the
>>>> fixation on netlink. Its socket messaging, memcpies, processing
>>>> overhead, etc can't keep up with mmaped memory access at scale. Measure
>>>> that and I bet you'll get drastically different results.
>>>>
>>>> I mean, netlink is good for a lot of things, but does not mean it is an
>>>> universal answer to userspace<->kernel data passing.
>>>
>>> Here's a small sample of our requirements that are satisfied by
>>> netlink for P4 object hierarchy[1]:
>>> 1. Msg construction/parsing
>>> 2. Multi-user request/response messaging
>>
>> What is actually a usecase for having multiple users program p4 pipeline
>> in parallel?
> 
> First of all - this is Linux, multiple users is a way of life, you
> shouldnt have to ask that question unless you are trying to be
> socratic. Meaning multiple control plane apps can be allowed to
> program different parts and even different tables - think multi-tier
> pipeline.
> 
>>> 3. Multi-user event subscribe/publish messaging
>>
>> Same here. What is the usecase for multiple users receiving p4 events?
> 
> Same thing.
> Note: Events are really not part of P4 but we added them for
> flexibility - and as you well know they are useful.
> 
>>> I dont think i need to provide an explanation on the differences here
>>> visavis what ebpf system calls provide vs what netlink provides and
>>> how netlink is a clear fit. If it is not clear i can give more
>>
>> It is not :/
> 
> I thought it was obvious for someone like you, but fine - here goes for those 3:
> 
> 1. Msg construction/parsing: A lot of infra for sending attributes
> back and forth is already built into netlink. I would have to create
> mine from scratch for ebpf.  This will include not just the
> construction/parsing but all the detailed attribute content policy
> validations(even in the presence of hierarchies) that comes with it.
> And not to forget the state transform between kernel and user space.
> 
> 2. Multi-user request/response messaging
> If you can write all the code for #1 above then this should work fine for ebpf
> 
> 3. Event publish subscribe
> You would have to create mechanisms for ebpf which either are non
> trivial or non complete: Example 1: you can put surgeries in the ebpf
> code to look at map manipulations and then interface it to some event
> management scheme which checks for subscribed users. Example 2: It may
> also be feasible to create your own map for subscription vs something
> like perf ring for event publication(something i have done in the
> past), but that is also limited in many ways.

I still don't think this answers all the questions on why the netlink
shim layer. The kfuncs are essentially available to all of tc BPF and
I don't think there was a discussion why they cannot be done generic
in a way that they could benefit all tc/XDP BPF users. With the patch
14 you are more or less copying what is existing with {cls,act}_bpf
just that you also allow XDP loading from tc(?). We do have existing
interfaces for XDP program management.

tc BPF and XDP already have widely used infrastructure and can be developed
against libbpf or other user space libraries for a user space control plane.
With 'control plane' you refer here to the tc / netlink shim you've built,
but looking at the tc command line examples, this doesn't really provide a
good user experience (you call it p4 but people load bpf obj files). If the
expectation is that an operator should run tc commands, then neither it's
a nice experience for p4 nor for BPF folks. From a BPF PoV, we moved over
to bpf_mprog and plan to also extend this for XDP to have a common look and
feel wrt networking for developers. Why can't this be reused?

I don't quite follow why not most of this could be implemented entirely in
user space without the detour of this and you would provide a developer
library which could then be integrated into a p4 runtime/frontend? This
way users never interface with ebpf parts nor tc given they also shouldn't
have to - it's an implementation detail. This is what John was also pointing
out earlier.

If you need notifications/subscribe mechanism for map updates, then this
could be extended.. same way like BPF internals got extended along with the
sched_ext work, making the core pieces more useful also outside of the latter.

The link to below slides are not public, so it's hard to see what is really
meant here, but I have also never seen an email from the speaker on the BPF
mailing list providing concrete feedback(?). People do build control planes
around BPF in the wild, I'm not sure where you take 'flush LEDs' from, to
me this all sounds rather hand-wavy and trying to brute-force the fixation
on netlink you went with that is raising questions. I don't think there was
objection on going with eBPF but rather all this infra for the former for
a SW-only extension.

[...]
>>>>> I should note: that there was an interesting talk at netdevconf 0x17
>>>>> where the speaker showed the challenges of dealing with ebpf on "day
>>>>> two" - slides or videos are not up yet, but link is:
>>>>> https://netdevconf.info/0x17/sessions/talk/is-scaling-ebpf-easy-yet-a-small-step-to-one-server-but-giant-leap-to-distributed-network.html
>>>>> The point the speaker was making is it's always easy to whip an ebpf
>>>>> program that can slice and dice packets and maybe even flush LEDs but
>>>>> the real work and challenge is in the control plane. I agree with the
>>>>> speaker based on my experiences. This discussion of replacing netlink
>>>>> with ebpf system calls is absolutely a non-starter. Let's just end the
>>>>> discussion and agree to disagree if you are going to keep insisting on
>>>>> that.
Jamal Hadi Salim Nov. 20, 2023, 10:13 p.m. UTC | #11
On Mon, Nov 20, 2023 at 3:41 PM John Fastabend <john.fastabend@gmail.com> wrote:
>
> Jamal Hadi Salim wrote:
> > On Mon, Nov 20, 2023 at 1:10 PM Jiri Pirko <jiri@resnulli.us> wrote:
> > >
> > > Mon, Nov 20, 2023 at 03:23:59PM CET, jhs@mojatatu.com wrote:
> > > >On Mon, Nov 20, 2023 at 4:39 AM Jiri Pirko <jiri@resnulli.us> wrote:
> > > >>
> > > >> Fri, Nov 17, 2023 at 09:46:11PM CET, jhs@mojatatu.com wrote:
> > > >> >On Fri, Nov 17, 2023 at 1:37 PM John Fastabend <john.fastabend@gmail.com> wrote:
> > > >> >>
> > > >> >> Jamal Hadi Salim wrote:
> > > >> >> > On Fri, Nov 17, 2023 at 1:27 AM John Fastabend <john.fastabend@gmail.com> wrote:
> > > >> >> > >
> > > >> >> > > Jamal Hadi Salim wrote:
> > > >>
> > > >> [...]
> > > >>
> > > >>
> > > >> >>
> > > >> >> I think I'm judging the technical work here. Bullet points.
> > > >> >>
> > > >> >> 1. p4c-tc implementation looks like it should be slower than a
> > > >> >>    in terms of pkts/sec than a bpf implementation. Meaning
> > > >> >>    I suspect pipeline and objects laid out like this will lose
> > > >> >>    to a BPF program with an parser and single lookup. The p4c-ebpf
> > > >> >>    compiler should look to create optimized EBPF code not some
> > > >> >>    emulated switch topology.
> > > >> >>
> > > >> >
> > > >> >The parser is ebpf based. The other objects which require control
> > > >> >plane interaction are not - those interact via netlink.
> > > >> >We published perf data a while back - presented at the P4 workshop
> > > >> >back in April (was in the cover letter)
> > > >> >https://github.com/p4tc-dev/docs/blob/main/p4-conference-2023/2023P4WorkshopP4TC.pdf
> > > >> >But do note: the correct abstraction is the first priority.
> > > >> >Optimization is something we can teach the compiler over time. But
> > > >> >even with the minimalist code generation you can see that our approach
> > > >> >always beats ebpf in LPM and ternary. The other ones I am pretty sure
> > > >>
> > > >> Any idea why? Perhaps the existing eBPF maps are not that suitable for
> > > >> this kinds of lookups? I mean in theory, eBPF should be always faster.
> > > >
> > > >We didnt look closely; however, that is not the point - the point is
> > > >the perf difference if there is one, is not big with the big win being
> > > >proper P4 abstraction. For LPM for sure our algorithmic approach is
> > > >better. For ternary the compute intensity in looping is better done in
> > > >C. And for exact i believe that ebpf uses better hashing.
> > > >Again, that is not the point we were trying to validate in those experiments..
>
> If you compared your implementation to the bpf lpm_trie its a bit
> misleading. The data structure is a rhashtable vs a Trie doing LPM.
>
> Also I can't see how __p4tc_table_entry_lookup() is going to scale?
> That looks like a bucket per key? If so that wont scale well with
> 1000's of entries and lots of duplicate masks.

I think you are misreading the code - there are no duplicate masks;
iiuc, by scale you mean performance of lookup and the numbers we got
show very different results (the more entries and masks the better
numbers we showed).
Again - i dont want to make this a topic. The issue is not whether we
beat you or you beat us in numbers is not relevant to begin with.

>I did a quick scan
> of code, but would be nice to detail the algorithm in the commit
> msg so we can disect it.
>
> This doesn't look what we would want though for an LPM unless
> I've dropped this out of context.
>
> +static struct p4tc_table_entry *
> +__p4tc_entry_lookup(struct p4tc_table *table, struct p4tc_table_entry_key *key)
> +       __must_hold(RCU)
> +{
> +       struct p4tc_table_entry *entry = NULL;
> +       struct rhlist_head *tmp, *bucket_list;
> +       struct p4tc_table_entry *entry_curr;
> +       u32 smallest_prio = U32_MAX;
> +
> +       bucket_list =
> +               rhltable_lookup(&table->tbl_entries, key, entry_hlt_params);
> +       if (!bucket_list)
> +               return NULL;
> +
> +       rhl_for_each_entry_rcu(entry_curr, tmp, bucket_list, ht_node) {
> +               struct p4tc_table_entry_value *value =
> +                       p4tc_table_entry_value(entry_curr);
> +               if (value->prio <= smallest_prio) {
> +                       smallest_prio = value->prio;
> +                       entry = entry_curr;
> +               }
> +       }
> +
> +       return entry;
> +}

You are quoting the ternary (not LPM) matching code. It iterates all
entries (we could only do ~190 when we tested in plain ebpf, thats why
our test was restricted to that number).

> Also I don't know what 'better done in C' matters the TCAM data structure
> can be written in C and used as a BPF map. At least that is how we would
> normally approach it from BPF side.

See the code you quoted - you have to loop and pick the best of N
matches, where N could be arbitrarily large.

> > > >
> > > >On your point of "maps are not that suitable" P4 tables tend to have
> > > >very specific attributes (examples associated meters, counters,
> > > >default hit and miss actions, etc).
>
> The typical way we handle this from BPF is to either use the 0 entry
> for stats, annotations, etc. or create a blob of memory (another map,
> variables, global struct, ...) and stash the info there. If we care
> about performance we make those per cpu and deal with it in user
> land.
>

Back to the abstraction overhead in user space being high. The whole
point is to minimize all that..

> > > >
> > > >> >we can optimize over time.
> > > >> >Your view of "single lookup" is true for simple programs but if you
> > > >> >have 10 tables trying to model a 5G function then it doesnt make sense
> > > >> >(and i think the data we published was clear that you gain no
> > > >> >advantage using ebpf - as a matter of fact there was no perf
> > > >> >difference between XDP and tc in such cases).
> > > >> >
> > > >> >> 2. p4c-tc control plan looks slower than a directly mmaped bpf
> > > >> >>    map. Doing a simple update vs a netlink msg. The argument
> > > >> >>    that BPF can't do CRUD (which we had offlist) seems incorrect
> > > >> >>    to me. Correct me if I'm wrong with details about why.
> > > >> >>
> > > >> >
> > > >> >So let me see....
> > > >> >you want me to replace netlink and all its features and rewrite it
> > > >> >using the ebpf system calls? Congestion control, event handling,
> > > >> >arbitrary message crafting, etc and the years of work that went into
> > > >> >netlink? NO to the HELL.
> > > >>
> > > >> Wait, I don't think John suggests anything like that. He just suggests
> > > >> to have the tables as eBPF maps.
> > > >
> > > >What's the difference? Unless maps can do netlink.
> > > >
>
> I'm going to argue map update time matters and we should use the fastest
> updates possible. If it complicates user space side some I would prefer
> that to slow updates. I don't think you can get much faster than a
> mmaped block of memory. Or even syscall updates are probably faster than
> netlink msgs.

So lets put this to rest:
It's about the P4 abstraction first (as i mentioned earlier) - i am
sure mmaping would be faster, but that is secondary - correct
abstraction first.
I am ok with some level of abstraction wrangling (example match-action
in P4 to match-value in ebpf) but there is a limit.

> > > >> Honestly, I don't understand the
> > > >> fixation on netlink. Its socket messaging, memcpies, processing
> > > >> overhead, etc can't keep up with mmaped memory access at scale. Measure
> > > >> that and I bet you'll get drastically different results.
> > > >>
> > > >> I mean, netlink is good for a lot of things, but does not mean it is an
> > > >> universal answer to userspace<->kernel data passing.
> > > >
> > > >Here's a small sample of our requirements that are satisfied by
> > > >netlink for P4 object hierarchy[1]:
> > > >1. Msg construction/parsing
> > > >2. Multi-user request/response messaging
> > >
> > > What is actually a usecase for having multiple users program p4 pipeline
> > > in parallel?
> >
> > First of all - this is Linux, multiple users is a way of life, you
> > shouldnt have to ask that question unless you are trying to be
> > socratic. Meaning multiple control plane apps can be allowed to
> > program different parts and even different tables - think multi-tier
> > pipeline.
>
> Linux is always been opinionated and rejects code all the time because
> its not the "right" way. I've been on the reject your stuff side before.
>
> Partitioning ownershiip of the pipeline is different than multiple
> users of the same elements. From BPF side (to show its doable) is
> done by pinning maps to files and giving that file to different
> programs. The DDOS thing can own the DDOS map and the router can own
> its router tables. BPF handles this using the file systems mostly.
>

And with tc it just fits right in without any of those tricks...

> >
> > > >3. Multi-user event subscribe/publish messaging
> > >
> > > Same here. What is the usecase for multiple users receiving p4 events?
> >
> > Same thing.
> > Note: Events are really not part of P4 but we added them for
> > flexibility - and as you well know they are useful.
>
> Per above I wouldn't sacrafice update perf for this. Also its doable
> from userspace if you need to. Other thing I've come to dislike a bit
> is teaching the kernel a specific DSL. P4 is my favorte, but still
> going so far as to encode a specific P4 spec into the kernel seems
> unnecessary. Also now will we have to have kernel X supports P4.16 and
> kernel X+N supports P4.18 it seems like a pain.
>

I believe you are misunderstanding, let me explain. While our focus is on PNA:
There is no change in the kernel infra (upstream code) for PNA or PSA
or PXXXX. The compiler may end up generating different code depending
on the architecture selected at the compile command line. The control
constructs are very static with their hierarchy IDs.
In regards to what you prophesize above when the language goes from
P4.16 to P4.18 - i dont mean to be rude, but: kettle look at pot
much?;-> i.e what happens when eBPF ISA gets extended? The safety
feature we have for P4TC is externs - most of these will be
implemented as self-fulfilling kfuncs.

> >
> > >
> > > >
> > > >I dont think i need to provide an explanation on the differences here
> > > >visavis what ebpf system calls provide vs what netlink provides and
> > > >how netlink is a clear fit. If it is not clear i can give more
> > >
> > > It is not :/
> >
> > I thought it was obvious for someone like you, but fine - here goes for those 3:
> >
> > 1. Msg construction/parsing: A lot of infra for sending attributes
> > back and forth is already built into netlink. I would have to create
> > mine from scratch for ebpf.  This will include not just the
> > construction/parsing but all the detailed attribute content policy
> > validations(even in the presence of hierarchies) that comes with it.
> > And not to forget the state transform between kernel and user space.
>
> But the series here does that as well probably could reuse that on
> top of BPF. We have lots of libraries to deal with ebpf to help.
> I don't see anything problematic here for BPF.

Which library does all these (netlink features) in eBPF and has
something matching it in the kernel? We did try to write our own but
it was a huge waste of time.

> >
> > 2. Multi-user request/response messaging
> > If you can write all the code for #1 above then this should work fine for ebpf
> >
> > 3. Event publish subscribe
> > You would have to create mechanisms for ebpf which either are non
> > trivial or non complete: Example 1: you can put surgeries in the ebpf
> > code to look at map manipulations and then interface it to some event
> > management scheme which checks for subscribed users. Example 2: It may
> > also be feasible to create your own map for subscription vs something
> > like perf ring for event publication(something i have done in the
> > past), but that is also limited in many ways.
>
> I would just push them out over a single perf ring and build the
> subscription on top of GRPC (pick your protocol of choice).
>


Why - just so i could use ebpf? Ive never understood that single user
mode perf ring thing.

> >
> > >
> > > >breakdown. And of course there's more but above is a good sample.
> > > >
> > > >The part that is taken for granted is the control plane code and
> > > >interaction which is an extremely important detail. P4 Abstraction
> > > >requires hierarchies with different compiler generated encoded path
> > > >ids etc. This ID mapping gets exacerbated by having multitudes of  P4
> > >
> > > Why the actual eBFP mapping does not serve the same purpose as ID?
> > > ID:mapping 1 :1
> >
> > An identification of an object requires hierarchical IDs: A
> > pipeline/program ID, A table id, a table entry Identification, an
> > action identification and for each individual action content
> > parameter, an ID etc. These same IDs would be what hardware would
> > recognize as well (in case of offload).  Given the dynamic nature of
> > these IDs it is essentially up to the compiler to define them. These
> > hierarchies  are much easier to validate in netlink.
>
> I'm on board for offloads, but this series says no offloads and we
> have no one with hardware in Linux for offloads yet. If we have a
> series with a P4 driver and NIC I can get my hands on now we have
> an entirely different conversation.

The argument made is that P4 s/w stands on its own merit regardless of
presence of hardware offload (there are s/w versions of DPDK and rust
that I believe are used in production). As an example, the DASH
project quoted in the cover letter uses P4 as a datapath specification
language. The datapath is then verified to be working in s/w. So let's
not argue that there is no merit to a s/w P4 version without h/w
offload.

I do have a NIC(Intel e2000) that does P4 offloads but i am afraid I
can't give it to you. Folks who are doing offloads will present
drivers when they are ready and when/if those patches show there will
be extensions to deal with ndo_tc. But i know you already know and are
on the p4tc mailing list and are quite aware of these developments - I
am not sure I understand your motivation for bringing this up a few
times now. I read it as some sort of insinuation that there is some
secret vendor hardware that is going to benefit from all this secret
trojan we are doing here. Again P4 s/w stands on its own.

> None of this above is a problem in eBPF. Its just mapping ids around.
>
> >
> > We dont want to be constrained to a generic infra like eBPF for these
> > objects. Again eBPF is a means to an end (and not the goal here!).
>
> I don't see any constraints from eBPF above just a list of things
> that of course you would have to code up. But none of that doesn't
> already exist in other projects.
>

And we can agree to disagree.

cheers,
jamal

> >
> > cheers,
> > jamal
> > >
> > >
> > > >programs which have different requirements. Netlink is a natural fit
> > > >for this P4 abstraction. Not to mention the netlink/tc path (and in
> > > >particular the ID mapping) provides a conduit for offload when that is
> > > >needed.
> > > >eBPF is just a tool - and the objects are intended to be generic - and
> > > >i dont see how any of this could be achieved without retooling to make
> > > >it more specific to P4.
> > > >
> > > >cheers,
> > > >jamal
> > > >
> > > >
> > > >
> > > >>
> > > >> >I should note: that there was an interesting talk at netdevconf 0x17
> > > >> >where the speaker showed the challenges of dealing with ebpf on "day
> > > >> >two" - slides or videos are not up yet, but link is:
> > > >> >https://netdevconf.info/0x17/sessions/talk/is-scaling-ebpf-easy-yet-a-small-step-to-one-server-but-giant-leap-to-distributed-network.html
> > > >> >The point the speaker was making is it's always easy to whip an ebpf
> > > >> >program that can slice and dice packets and maybe even flush LEDs but
> > > >> >the real work and challenge is in the control plane. I agree with the
> > > >> >speaker based on my experiences. This discussion of replacing netlink
> > > >> >with ebpf system calls is absolutely a non-starter. Let's just end the
> > > >> >discussion and agree to disagree if you are going to keep insisting on
> > > >> >that.
> > > >>
> > > >>
> > > >> [...]
>
>
Jamal Hadi Salim Nov. 20, 2023, 10:56 p.m. UTC | #12
On Mon, Nov 20, 2023 at 4:49 PM Daniel Borkmann <daniel@iogearbox.net> wrote:
>
> On 11/20/23 8:56 PM, Jamal Hadi Salim wrote:
> > On Mon, Nov 20, 2023 at 1:10 PM Jiri Pirko <jiri@resnulli.us> wrote:
> >> Mon, Nov 20, 2023 at 03:23:59PM CET, jhs@mojatatu.com wrote:
> >>> On Mon, Nov 20, 2023 at 4:39 AM Jiri Pirko <jiri@resnulli.us> wrote:
> >>>> Fri, Nov 17, 2023 at 09:46:11PM CET, jhs@mojatatu.com wrote:
> >>>>> On Fri, Nov 17, 2023 at 1:37 PM John Fastabend <john.fastabend@gmail.com> wrote:
> >>>>>> Jamal Hadi Salim wrote:
> >>>>>>> On Fri, Nov 17, 2023 at 1:27 AM John Fastabend <john.fastabend@gmail.com> wrote:
> >>>>>>>> Jamal Hadi Salim wrote:
> >>>>
> >>>> [...]
> >>>>
> >>>>>> I think I'm judging the technical work here. Bullet points.
> >>>>>>
> >>>>>> 1. p4c-tc implementation looks like it should be slower than a
> >>>>>>     in terms of pkts/sec than a bpf implementation. Meaning
> >>>>>>     I suspect pipeline and objects laid out like this will lose
> >>>>>>     to a BPF program with an parser and single lookup. The p4c-ebpf
> >>>>>>     compiler should look to create optimized EBPF code not some
> >>>>>>     emulated switch topology.
> >>>>>
> >>>>> The parser is ebpf based. The other objects which require control
> >>>>> plane interaction are not - those interact via netlink.
> >>>>> We published perf data a while back - presented at the P4 workshop
> >>>>> back in April (was in the cover letter)
> >>>>> https://github.com/p4tc-dev/docs/blob/main/p4-conference-2023/2023P4WorkshopP4TC.pdf
> >>>>> But do note: the correct abstraction is the first priority.
> >>>>> Optimization is something we can teach the compiler over time. But
> >>>>> even with the minimalist code generation you can see that our approach
> >>>>> always beats ebpf in LPM and ternary. The other ones I am pretty sure
> >>>>
> >>>> Any idea why? Perhaps the existing eBPF maps are not that suitable for
> >>>> this kinds of lookups? I mean in theory, eBPF should be always faster.
> >>>
> >>> We didnt look closely; however, that is not the point - the point is
> >>> the perf difference if there is one, is not big with the big win being
> >>> proper P4 abstraction. For LPM for sure our algorithmic approach is
> >>> better. For ternary the compute intensity in looping is better done in
> >>> C. And for exact i believe that ebpf uses better hashing.
> >>> Again, that is not the point we were trying to validate in those experiments..
> >>>
> >>> On your point of "maps are not that suitable" P4 tables tend to have
> >>> very specific attributes (examples associated meters, counters,
> >>> default hit and miss actions, etc).
> >>>
> >>>>> we can optimize over time.
> >>>>> Your view of "single lookup" is true for simple programs but if you
> >>>>> have 10 tables trying to model a 5G function then it doesnt make sense
> >>>>> (and i think the data we published was clear that you gain no
> >>>>> advantage using ebpf - as a matter of fact there was no perf
> >>>>> difference between XDP and tc in such cases).
> >>>>>
> >>>>>> 2. p4c-tc control plan looks slower than a directly mmaped bpf
> >>>>>>     map. Doing a simple update vs a netlink msg. The argument
> >>>>>>     that BPF can't do CRUD (which we had offlist) seems incorrect
> >>>>>>     to me. Correct me if I'm wrong with details about why.
> >>>>>
> >>>>> So let me see....
> >>>>> you want me to replace netlink and all its features and rewrite it
> >>>>> using the ebpf system calls? Congestion control, event handling,
> >>>>> arbitrary message crafting, etc and the years of work that went into
> >>>>> netlink? NO to the HELL.
> >>>>
> >>>> Wait, I don't think John suggests anything like that. He just suggests
> >>>> to have the tables as eBPF maps.
> >>>
> >>> What's the difference? Unless maps can do netlink.
> >>>
> >>>> Honestly, I don't understand the
> >>>> fixation on netlink. Its socket messaging, memcpies, processing
> >>>> overhead, etc can't keep up with mmaped memory access at scale. Measure
> >>>> that and I bet you'll get drastically different results.
> >>>>
> >>>> I mean, netlink is good for a lot of things, but does not mean it is an
> >>>> universal answer to userspace<->kernel data passing.
> >>>
> >>> Here's a small sample of our requirements that are satisfied by
> >>> netlink for P4 object hierarchy[1]:
> >>> 1. Msg construction/parsing
> >>> 2. Multi-user request/response messaging
> >>
> >> What is actually a usecase for having multiple users program p4 pipeline
> >> in parallel?
> >
> > First of all - this is Linux, multiple users is a way of life, you
> > shouldnt have to ask that question unless you are trying to be
> > socratic. Meaning multiple control plane apps can be allowed to
> > program different parts and even different tables - think multi-tier
> > pipeline.
> >
> >>> 3. Multi-user event subscribe/publish messaging
> >>
> >> Same here. What is the usecase for multiple users receiving p4 events?
> >
> > Same thing.
> > Note: Events are really not part of P4 but we added them for
> > flexibility - and as you well know they are useful.
> >
> >>> I dont think i need to provide an explanation on the differences here
> >>> visavis what ebpf system calls provide vs what netlink provides and
> >>> how netlink is a clear fit. If it is not clear i can give more
> >>
> >> It is not :/
> >
> > I thought it was obvious for someone like you, but fine - here goes for those 3:
> >
> > 1. Msg construction/parsing: A lot of infra for sending attributes
> > back and forth is already built into netlink. I would have to create
> > mine from scratch for ebpf.  This will include not just the
> > construction/parsing but all the detailed attribute content policy
> > validations(even in the presence of hierarchies) that comes with it.
> > And not to forget the state transform between kernel and user space.
> >
> > 2. Multi-user request/response messaging
> > If you can write all the code for #1 above then this should work fine for ebpf
> >
> > 3. Event publish subscribe
> > You would have to create mechanisms for ebpf which either are non
> > trivial or non complete: Example 1: you can put surgeries in the ebpf
> > code to look at map manipulations and then interface it to some event
> > management scheme which checks for subscribed users. Example 2: It may
> > also be feasible to create your own map for subscription vs something
> > like perf ring for event publication(something i have done in the
> > past), but that is also limited in many ways.
>
> I still don't think this answers all the questions on why the netlink
> shim layer. The kfuncs are essentially available to all of tc BPF and
> I don't think there was a discussion why they cannot be done generic
> in a way that they could benefit all tc/XDP BPF users. With the patch
> 14 you are more or less copying what is existing with {cls,act}_bpf
> just that you also allow XDP loading from tc(?). We do have existing
> interfaces for XDP program management.
>

I am not sure i followed - but we are open to suggestions to improve
operation usability.

> tc BPF and XDP already have widely used infrastructure and can be developed
> against libbpf or other user space libraries for a user space control plane.
> With 'control plane' you refer here to the tc / netlink shim you've built,
> but looking at the tc command line examples, this doesn't really provide a
> good user experience (you call it p4 but people load bpf obj files). If the
> expectation is that an operator should run tc commands, then neither it's
> a nice experience for p4 nor for BPF folks. From a BPF PoV, we moved over
> to bpf_mprog and plan to also extend this for XDP to have a common look and
> feel wrt networking for developers. Why can't this be reused?

The filter loading which loads the program is considered pipeline
instantiation - consider it as "provisioning" more than "control"
which runs at runtime. "control" is purely netlink based. The iproute2
code we use links libbpf for example for the filter. If we can achieve
the same with bpf_mprog then sure - we just dont want to loose
functionality though.  off top of my head, some sample space:
- we could have multiple pipelines with different priorities (which tc
provides to us) - and each pipeline may have its own logic with many
tables etc (and the choice to iterate the next one is essentially
encoded in the tc action codes)
- we use tc block to map groups of ports (which i dont think bpf has
internal access of)

In regards to usability: no i dont expect someone doing things at
scale to use command line tc. The APIs are via netlink. But the tc cli
is must for the rest of the masses per our traditions. Also i really
didnt even want to use ebpf at all for operator experience reasons -
it requires a compilation of the code and an extra loading compared to
what our original u32/pedit code offered.

> I don't quite follow why not most of this could be implemented entirely in
> user space without the detour of this and you would provide a developer
> library which could then be integrated into a p4 runtime/frontend? This
> way users never interface with ebpf parts nor tc given they also shouldn't
> have to - it's an implementation detail. This is what John was also pointing
> out earlier.
>

Netlink is the API. We will provide a library for object manipulation
which abstracts away the need to know netlink. Someone who for their
own reasons wants to use p4runtime or TDI could write on top of this.
I would not design a kernel interface to just meet p4runtime (we
already have TDI which came later which does things differently). So i
expect us to support both those two. And if i was to do something on
SDN that was more robust i would write my own that still uses these
netlink interfaces.

> If you need notifications/subscribe mechanism for map updates, then this
> could be extended.. same way like BPF internals got extended along with the
> sched_ext work, making the core pieces more useful also outside of the latter.
>

Why? I already have this working great right now with netlink.

> The link to below slides are not public, so it's hard to see what is really
> meant here, but I have also never seen an email from the speaker on the BPF
> mailing list providing concrete feedback(?). People do build control planes
> around BPF in the wild, I'm not sure where you take 'flush LEDs' from, to
> me this all sounds rather hand-wavy and trying to brute-force the fixation
> on netlink you went with that is raising questions. I don't think there was
> objection on going with eBPF but rather all this infra for the former for
> a SW-only extension.

There are a handful of people who are holding the slides being
released (will go and chase them after this).

BTW, our experience in regards to usability for eBPF control plane is
the same as Ivan. I was listening to the talk and just nodding along.
You focused too much on the datapath and did a good job there but i am
afraid not so much on usability of the control path. My view is: to
create a back and forth with the kernel for something as complex as we
have using the ebpf system calls vs netlink, you would need to spend a
lot more developer resources in the ebpf case.  If you want to call
what i have a "the fixation on netlink" maybe you are fixated on ebpf
syscall?;->

cheers,
jamal


> [...]
> >>>>> I should note: that there was an interesting talk at netdevconf 0x17
> >>>>> where the speaker showed the challenges of dealing with ebpf on "day
> >>>>> two" - slides or videos are not up yet, but link is:
> >>>>> https://netdevconf.info/0x17/sessions/talk/is-scaling-ebpf-easy-yet-a-small-step-to-one-server-but-giant-leap-to-distributed-network.html
> >>>>> The point the speaker was making is it's always easy to whip an ebpf
> >>>>> program that can slice and dice packets and maybe even flush LEDs but
> >>>>> the real work and challenge is in the control plane. I agree with the
> >>>>> speaker based on my experiences. This discussion of replacing netlink
> >>>>> with ebpf system calls is absolutely a non-starter. Let's just end the
> >>>>> discussion and agree to disagree if you are going to keep insisting on
> >>>>> that.
Jiri Pirko Nov. 21, 2023, 1:06 p.m. UTC | #13
Mon, Nov 20, 2023 at 11:56:50PM CET, jhs@mojatatu.com wrote:
>On Mon, Nov 20, 2023 at 4:49 PM Daniel Borkmann <daniel@iogearbox.net> wrote:
>>
>> On 11/20/23 8:56 PM, Jamal Hadi Salim wrote:
>> > On Mon, Nov 20, 2023 at 1:10 PM Jiri Pirko <jiri@resnulli.us> wrote:
>> >> Mon, Nov 20, 2023 at 03:23:59PM CET, jhs@mojatatu.com wrote:

[...]

>
>> tc BPF and XDP already have widely used infrastructure and can be developed
>> against libbpf or other user space libraries for a user space control plane.
>> With 'control plane' you refer here to the tc / netlink shim you've built,
>> but looking at the tc command line examples, this doesn't really provide a
>> good user experience (you call it p4 but people load bpf obj files). If the
>> expectation is that an operator should run tc commands, then neither it's
>> a nice experience for p4 nor for BPF folks. From a BPF PoV, we moved over
>> to bpf_mprog and plan to also extend this for XDP to have a common look and
>> feel wrt networking for developers. Why can't this be reused?
>
>The filter loading which loads the program is considered pipeline
>instantiation - consider it as "provisioning" more than "control"
>which runs at runtime. "control" is purely netlink based. The iproute2
>code we use links libbpf for example for the filter. If we can achieve
>the same with bpf_mprog then sure - we just dont want to loose
>functionality though.  off top of my head, some sample space:
>- we could have multiple pipelines with different priorities (which tc
>provides to us) - and each pipeline may have its own logic with many
>tables etc (and the choice to iterate the next one is essentially
>encoded in the tc action codes)
>- we use tc block to map groups of ports (which i dont think bpf has
>internal access of)
>
>In regards to usability: no i dont expect someone doing things at
>scale to use command line tc. The APIs are via netlink. But the tc cli
>is must for the rest of the masses per our traditions. Also i really

I don't follow. You repeatedly mention "the must of the traditional tc
cli", but what of the existing traditional cli you use for p4tc?
If I look at the examples, pretty much everything looks new to me.
Example:

  tc p4ctrl create myprog/table/mytable dstAddr 10.0.1.2/32 \
    action send_to_port param port eno1

This is just TC/RTnetlink used as a channel to pass new things over. If
that is the case, what's traditional here?


>didnt even want to use ebpf at all for operator experience reasons -
>it requires a compilation of the code and an extra loading compared to
>what our original u32/pedit code offered.
>
>> I don't quite follow why not most of this could be implemented entirely in
>> user space without the detour of this and you would provide a developer
>> library which could then be integrated into a p4 runtime/frontend? This
>> way users never interface with ebpf parts nor tc given they also shouldn't
>> have to - it's an implementation detail. This is what John was also pointing
>> out earlier.
>>
>
>Netlink is the API. We will provide a library for object manipulation
>which abstracts away the need to know netlink. Someone who for their
>own reasons wants to use p4runtime or TDI could write on top of this.
>I would not design a kernel interface to just meet p4runtime (we
>already have TDI which came later which does things differently). So i
>expect us to support both those two. And if i was to do something on
>SDN that was more robust i would write my own that still uses these
>netlink interfaces.

Actually, what Daniel says about the p4 library used as a backend to p4
frontend is pretty much aligned what I claimed on the p4 calls couple of
times. If you have this p4 userspace tooling, it is easy for offloads to
replace the backed by vendor-specific library which allows p4 offload
suitable for all vendors (your plan of p4tc offload does not work well
for our hw, as we repeatedly claimed).

As I also said on the p4 call couple of times, I don't see the kernel
as the correct place to do the p4 abstractions. Why don't you do it in
userspace and give vendors possiblity to have p4 backends with compilers,
runtime optimizations etc in userspace, talking to the HW in the
vendor-suitable way too. Then the SW implementation could be easily eBPF
and the main reason (I believe) why you need to have this is TC
(offload) is then void.

The "everyone wants to use TC/netlink" claim does not seem correct
to me. Why not to have one Linux p4 solution that fits everyones needs?

[...]
Jamal Hadi Salim Nov. 21, 2023, 1:47 p.m. UTC | #14
On Tue, Nov 21, 2023 at 8:06 AM Jiri Pirko <jiri@resnulli.us> wrote:
>
> Mon, Nov 20, 2023 at 11:56:50PM CET, jhs@mojatatu.com wrote:
> >On Mon, Nov 20, 2023 at 4:49 PM Daniel Borkmann <daniel@iogearbox.net> wrote:
> >>
> >> On 11/20/23 8:56 PM, Jamal Hadi Salim wrote:
> >> > On Mon, Nov 20, 2023 at 1:10 PM Jiri Pirko <jiri@resnulli.us> wrote:
> >> >> Mon, Nov 20, 2023 at 03:23:59PM CET, jhs@mojatatu.com wrote:
>
> [...]
>
> >
> >> tc BPF and XDP already have widely used infrastructure and can be developed
> >> against libbpf or other user space libraries for a user space control plane.
> >> With 'control plane' you refer here to the tc / netlink shim you've built,
> >> but looking at the tc command line examples, this doesn't really provide a
> >> good user experience (you call it p4 but people load bpf obj files). If the
> >> expectation is that an operator should run tc commands, then neither it's
> >> a nice experience for p4 nor for BPF folks. From a BPF PoV, we moved over
> >> to bpf_mprog and plan to also extend this for XDP to have a common look and
> >> feel wrt networking for developers. Why can't this be reused?
> >
> >The filter loading which loads the program is considered pipeline
> >instantiation - consider it as "provisioning" more than "control"
> >which runs at runtime. "control" is purely netlink based. The iproute2
> >code we use links libbpf for example for the filter. If we can achieve
> >the same with bpf_mprog then sure - we just dont want to loose
> >functionality though.  off top of my head, some sample space:
> >- we could have multiple pipelines with different priorities (which tc
> >provides to us) - and each pipeline may have its own logic with many
> >tables etc (and the choice to iterate the next one is essentially
> >encoded in the tc action codes)
> >- we use tc block to map groups of ports (which i dont think bpf has
> >internal access of)
> >
> >In regards to usability: no i dont expect someone doing things at
> >scale to use command line tc. The APIs are via netlink. But the tc cli
> >is must for the rest of the masses per our traditions. Also i really
>
> I don't follow. You repeatedly mention "the must of the traditional tc
> cli", but what of the existing traditional cli you use for p4tc?
> If I look at the examples, pretty much everything looks new to me.
> Example:
>
>   tc p4ctrl create myprog/table/mytable dstAddr 10.0.1.2/32 \
>     action send_to_port param port eno1
>
> This is just TC/RTnetlink used as a channel to pass new things over. If
> that is the case, what's traditional here?
>


What is not traditional about it?

>
> >didnt even want to use ebpf at all for operator experience reasons -
> >it requires a compilation of the code and an extra loading compared to
> >what our original u32/pedit code offered.
> >
> >> I don't quite follow why not most of this could be implemented entirely in
> >> user space without the detour of this and you would provide a developer
> >> library which could then be integrated into a p4 runtime/frontend? This
> >> way users never interface with ebpf parts nor tc given they also shouldn't
> >> have to - it's an implementation detail. This is what John was also pointing
> >> out earlier.
> >>
> >
> >Netlink is the API. We will provide a library for object manipulation
> >which abstracts away the need to know netlink. Someone who for their
> >own reasons wants to use p4runtime or TDI could write on top of this.
> >I would not design a kernel interface to just meet p4runtime (we
> >already have TDI which came later which does things differently). So i
> >expect us to support both those two. And if i was to do something on
> >SDN that was more robust i would write my own that still uses these
> >netlink interfaces.
>
> Actually, what Daniel says about the p4 library used as a backend to p4
> frontend is pretty much aligned what I claimed on the p4 calls couple of
> times. If you have this p4 userspace tooling, it is easy for offloads to
> replace the backed by vendor-specific library which allows p4 offload
> suitable for all vendors (your plan of p4tc offload does not work well
> for our hw, as we repeatedly claimed).
>

That's you - NVIDIA. You have chosen a path away from the kernel
towards DOCA. I understand NVIDIA's frustration with dealing with
upstream process (which has been cited to me as a good reason for
DOCA) but please dont impose these values and your politics on other
vendors(Intel, AMD for example) who are more than willing to invest
into making the kernel interfaces the path forward. Your choice.
Nobody is stopping you from offering your customers proprietary
solutions which include a specific ebpf approach alongside DOCA. We
believe that a singular interface regardless of the vendor is the
right way forward. IMHO, this siloing that unfortunately is also added
by eBPF being a double edged sword is not good for the community.

> As I also said on the p4 call couple of times, I don't see the kernel
> as the correct place to do the p4 abstractions. Why don't you do it in
> userspace and give vendors possiblity to have p4 backends with compilers,
> runtime optimizations etc in userspace, talking to the HW in the
> vendor-suitable way too. Then the SW implementation could be easily eBPF
> and the main reason (I believe) why you need to have this is TC
> (offload) is then void.
>
> The "everyone wants to use TC/netlink" claim does not seem correct
> to me. Why not to have one Linux p4 solution that fits everyones needs?

You mean more fitting to the DOCA world? no, because iam a kernel
first person and kernel interfaces are good for everyone.

cheers,
jamal
Jiri Pirko Nov. 21, 2023, 2:19 p.m. UTC | #15
Tue, Nov 21, 2023 at 02:47:40PM CET, jhs@mojatatu.com wrote:
>On Tue, Nov 21, 2023 at 8:06 AM Jiri Pirko <jiri@resnulli.us> wrote:
>>
>> Mon, Nov 20, 2023 at 11:56:50PM CET, jhs@mojatatu.com wrote:
>> >On Mon, Nov 20, 2023 at 4:49 PM Daniel Borkmann <daniel@iogearbox.net> wrote:
>> >>
>> >> On 11/20/23 8:56 PM, Jamal Hadi Salim wrote:
>> >> > On Mon, Nov 20, 2023 at 1:10 PM Jiri Pirko <jiri@resnulli.us> wrote:
>> >> >> Mon, Nov 20, 2023 at 03:23:59PM CET, jhs@mojatatu.com wrote:
>>
>> [...]
>>
>> >
>> >> tc BPF and XDP already have widely used infrastructure and can be developed
>> >> against libbpf or other user space libraries for a user space control plane.
>> >> With 'control plane' you refer here to the tc / netlink shim you've built,
>> >> but looking at the tc command line examples, this doesn't really provide a
>> >> good user experience (you call it p4 but people load bpf obj files). If the
>> >> expectation is that an operator should run tc commands, then neither it's
>> >> a nice experience for p4 nor for BPF folks. From a BPF PoV, we moved over
>> >> to bpf_mprog and plan to also extend this for XDP to have a common look and
>> >> feel wrt networking for developers. Why can't this be reused?
>> >
>> >The filter loading which loads the program is considered pipeline
>> >instantiation - consider it as "provisioning" more than "control"
>> >which runs at runtime. "control" is purely netlink based. The iproute2
>> >code we use links libbpf for example for the filter. If we can achieve
>> >the same with bpf_mprog then sure - we just dont want to loose
>> >functionality though.  off top of my head, some sample space:
>> >- we could have multiple pipelines with different priorities (which tc
>> >provides to us) - and each pipeline may have its own logic with many
>> >tables etc (and the choice to iterate the next one is essentially
>> >encoded in the tc action codes)
>> >- we use tc block to map groups of ports (which i dont think bpf has
>> >internal access of)
>> >
>> >In regards to usability: no i dont expect someone doing things at
>> >scale to use command line tc. The APIs are via netlink. But the tc cli
>> >is must for the rest of the masses per our traditions. Also i really
>>
>> I don't follow. You repeatedly mention "the must of the traditional tc
>> cli", but what of the existing traditional cli you use for p4tc?
>> If I look at the examples, pretty much everything looks new to me.
>> Example:
>>
>>   tc p4ctrl create myprog/table/mytable dstAddr 10.0.1.2/32 \
>>     action send_to_port param port eno1
>>
>> This is just TC/RTnetlink used as a channel to pass new things over. If
>> that is the case, what's traditional here?
>>
>
>
>What is not traditional about it?

Okay, so in that case, the following example communitating with
userspace deamon using imaginary "p4ctrl" app is equally traditional:
  $ p4ctrl create myprog/table/mytable dstAddr 10.0.1.2/32 \
     action send_to_port param port eno1


>
>>
>> >didnt even want to use ebpf at all for operator experience reasons -
>> >it requires a compilation of the code and an extra loading compared to
>> >what our original u32/pedit code offered.
>> >
>> >> I don't quite follow why not most of this could be implemented entirely in
>> >> user space without the detour of this and you would provide a developer
>> >> library which could then be integrated into a p4 runtime/frontend? This
>> >> way users never interface with ebpf parts nor tc given they also shouldn't
>> >> have to - it's an implementation detail. This is what John was also pointing
>> >> out earlier.
>> >>
>> >
>> >Netlink is the API. We will provide a library for object manipulation
>> >which abstracts away the need to know netlink. Someone who for their
>> >own reasons wants to use p4runtime or TDI could write on top of this.
>> >I would not design a kernel interface to just meet p4runtime (we
>> >already have TDI which came later which does things differently). So i
>> >expect us to support both those two. And if i was to do something on
>> >SDN that was more robust i would write my own that still uses these
>> >netlink interfaces.
>>
>> Actually, what Daniel says about the p4 library used as a backend to p4
>> frontend is pretty much aligned what I claimed on the p4 calls couple of
>> times. If you have this p4 userspace tooling, it is easy for offloads to
>> replace the backed by vendor-specific library which allows p4 offload
>> suitable for all vendors (your plan of p4tc offload does not work well
>> for our hw, as we repeatedly claimed).
>>
>
>That's you - NVIDIA. You have chosen a path away from the kernel
>towards DOCA. I understand NVIDIA's frustration with dealing with
>upstream process (which has been cited to me as a good reason for
>DOCA) but please dont impose these values and your politics on other
>vendors(Intel, AMD for example) who are more than willing to invest
>into making the kernel interfaces the path forward. Your choice.

No, you are missing the point. This has nothing to do with DOCA. This
has to do with the simple limitation of your offload assuming there are
no runtime changes in the compiled pipeline. For Intel, maybe they
aren't, and it's a good fit for them. All I say is, that it is not the
good fit for everyone.


>Nobody is stopping you from offering your customers proprietary
>solutions which include a specific ebpf approach alongside DOCA. We
>believe that a singular interface regardless of the vendor is the
>right way forward. IMHO, this siloing that unfortunately is also added
>by eBPF being a double edged sword is not good for the community.
>
>> As I also said on the p4 call couple of times, I don't see the kernel
>> as the correct place to do the p4 abstractions. Why don't you do it in
>> userspace and give vendors possiblity to have p4 backends with compilers,
>> runtime optimizations etc in userspace, talking to the HW in the
>> vendor-suitable way too. Then the SW implementation could be easily eBPF
>> and the main reason (I believe) why you need to have this is TC
>> (offload) is then void.
>>
>> The "everyone wants to use TC/netlink" claim does not seem correct
>> to me. Why not to have one Linux p4 solution that fits everyones needs?
>
>You mean more fitting to the DOCA world? no, because iam a kernel

Again, this has 0 relation to DOCA.


>first person and kernel interfaces are good for everyone.

Yeah, not really. Not always the kernel is the right answer. Your/Intel
plan to handle the offload by:
1) abuse devlink to flash p4 binary
2) parse the binary in kernel to match to the table ids of rules coming
   from p4tc ndo_setup_tc
3) abuse devlink to flash p4 binary for tc-flower
4) parse the binary in kernel to match to the table ids of rules coming
   from tc-flower ndo_setup_tc
is really something that is making me a little bit nauseous.

If you don't have a feasible plan to do the offload, p4tc does not make
sense to me to be honest.


>
>cheers,
>jamal
Jamal Hadi Salim Nov. 21, 2023, 3:21 p.m. UTC | #16
On Tue, Nov 21, 2023 at 9:19 AM Jiri Pirko <jiri@resnulli.us> wrote:
>
> Tue, Nov 21, 2023 at 02:47:40PM CET, jhs@mojatatu.com wrote:
> >On Tue, Nov 21, 2023 at 8:06 AM Jiri Pirko <jiri@resnulli.us> wrote:
> >>
> >> Mon, Nov 20, 2023 at 11:56:50PM CET, jhs@mojatatu.com wrote:
> >> >On Mon, Nov 20, 2023 at 4:49 PM Daniel Borkmann <daniel@iogearbox.net> wrote:
> >> >>
> >> >> On 11/20/23 8:56 PM, Jamal Hadi Salim wrote:
> >> >> > On Mon, Nov 20, 2023 at 1:10 PM Jiri Pirko <jiri@resnulli.us> wrote:
> >> >> >> Mon, Nov 20, 2023 at 03:23:59PM CET, jhs@mojatatu.com wrote:
> >>
> >> [...]
> >>
> >> >
> >> >> tc BPF and XDP already have widely used infrastructure and can be developed
> >> >> against libbpf or other user space libraries for a user space control plane.
> >> >> With 'control plane' you refer here to the tc / netlink shim you've built,
> >> >> but looking at the tc command line examples, this doesn't really provide a
> >> >> good user experience (you call it p4 but people load bpf obj files). If the
> >> >> expectation is that an operator should run tc commands, then neither it's
> >> >> a nice experience for p4 nor for BPF folks. From a BPF PoV, we moved over
> >> >> to bpf_mprog and plan to also extend this for XDP to have a common look and
> >> >> feel wrt networking for developers. Why can't this be reused?
> >> >
> >> >The filter loading which loads the program is considered pipeline
> >> >instantiation - consider it as "provisioning" more than "control"
> >> >which runs at runtime. "control" is purely netlink based. The iproute2
> >> >code we use links libbpf for example for the filter. If we can achieve
> >> >the same with bpf_mprog then sure - we just dont want to loose
> >> >functionality though.  off top of my head, some sample space:
> >> >- we could have multiple pipelines with different priorities (which tc
> >> >provides to us) - and each pipeline may have its own logic with many
> >> >tables etc (and the choice to iterate the next one is essentially
> >> >encoded in the tc action codes)
> >> >- we use tc block to map groups of ports (which i dont think bpf has
> >> >internal access of)
> >> >
> >> >In regards to usability: no i dont expect someone doing things at
> >> >scale to use command line tc. The APIs are via netlink. But the tc cli
> >> >is must for the rest of the masses per our traditions. Also i really
> >>
> >> I don't follow. You repeatedly mention "the must of the traditional tc
> >> cli", but what of the existing traditional cli you use for p4tc?
> >> If I look at the examples, pretty much everything looks new to me.
> >> Example:
> >>
> >>   tc p4ctrl create myprog/table/mytable dstAddr 10.0.1.2/32 \
> >>     action send_to_port param port eno1
> >>
> >> This is just TC/RTnetlink used as a channel to pass new things over. If
> >> that is the case, what's traditional here?
> >>
> >
> >
> >What is not traditional about it?
>
> Okay, so in that case, the following example communitating with
> userspace deamon using imaginary "p4ctrl" app is equally traditional:
>   $ p4ctrl create myprog/table/mytable dstAddr 10.0.1.2/32 \
>      action send_to_port param port eno1

Huh? Thats just an application - classical tc which part of iproute2
that is sending to the kernel, no different than "tc flower.."
Where do you get the "userspace" daemon part? Yes, you can write a
daemon but it will use the same APIs as tc.

>
> >
> >>
> >> >didnt even want to use ebpf at all for operator experience reasons -
> >> >it requires a compilation of the code and an extra loading compared to
> >> >what our original u32/pedit code offered.
> >> >
> >> >> I don't quite follow why not most of this could be implemented entirely in
> >> >> user space without the detour of this and you would provide a developer
> >> >> library which could then be integrated into a p4 runtime/frontend? This
> >> >> way users never interface with ebpf parts nor tc given they also shouldn't
> >> >> have to - it's an implementation detail. This is what John was also pointing
> >> >> out earlier.
> >> >>
> >> >
> >> >Netlink is the API. We will provide a library for object manipulation
> >> >which abstracts away the need to know netlink. Someone who for their
> >> >own reasons wants to use p4runtime or TDI could write on top of this.
> >> >I would not design a kernel interface to just meet p4runtime (we
> >> >already have TDI which came later which does things differently). So i
> >> >expect us to support both those two. And if i was to do something on
> >> >SDN that was more robust i would write my own that still uses these
> >> >netlink interfaces.
> >>
> >> Actually, what Daniel says about the p4 library used as a backend to p4
> >> frontend is pretty much aligned what I claimed on the p4 calls couple of
> >> times. If you have this p4 userspace tooling, it is easy for offloads to
> >> replace the backed by vendor-specific library which allows p4 offload
> >> suitable for all vendors (your plan of p4tc offload does not work well
> >> for our hw, as we repeatedly claimed).
> >>
> >
> >That's you - NVIDIA. You have chosen a path away from the kernel
> >towards DOCA. I understand NVIDIA's frustration with dealing with
> >upstream process (which has been cited to me as a good reason for
> >DOCA) but please dont impose these values and your politics on other
> >vendors(Intel, AMD for example) who are more than willing to invest
> >into making the kernel interfaces the path forward. Your choice.
>
> No, you are missing the point. This has nothing to do with DOCA.

Right Jiri ;->

> This
> has to do with the simple limitation of your offload assuming there are
> no runtime changes in the compiled pipeline. For Intel, maybe they
> aren't, and it's a good fit for them. All I say is, that it is not the
> good fit for everyone.

 a) it is not part of the P4 spec to dynamically make changes to the
datapath pipeline after it is create and we are discussing a P4
implementation not an extension that would add more value b) We are
more than happy to add extensions in the future to accomodate for
features but first _P4 spec_ must be met c) we had longer discussions
with Matty, Khalid and the Rice folks who wrote a paper on that topic
which you probably didnt attend and everything that needs to be done
can be from user space today for all those optimizations.

Conclusion is: For what you need to do (which i dont believe is a
limitation in your hardware rather a design decision on your part) run
your user space daemon, do optimizations and update the datapath.
Everybody is happy.

>
> >Nobody is stopping you from offering your customers proprietary
> >solutions which include a specific ebpf approach alongside DOCA. We
> >believe that a singular interface regardless of the vendor is the
> >right way forward. IMHO, this siloing that unfortunately is also added
> >by eBPF being a double edged sword is not good for the community.
> >
> >> As I also said on the p4 call couple of times, I don't see the kernel
> >> as the correct place to do the p4 abstractions. Why don't you do it in
> >> userspace and give vendors possiblity to have p4 backends with compilers,
> >> runtime optimizations etc in userspace, talking to the HW in the
> >> vendor-suitable way too. Then the SW implementation could be easily eBPF
> >> and the main reason (I believe) why you need to have this is TC
> >> (offload) is then void.
> >>
> >> The "everyone wants to use TC/netlink" claim does not seem correct
> >> to me. Why not to have one Linux p4 solution that fits everyones needs?
> >
> >You mean more fitting to the DOCA world? no, because iam a kernel
>
> Again, this has 0 relation to DOCA.
>
>
> >first person and kernel interfaces are good for everyone.
>
> Yeah, not really. Not always the kernel is the right answer. Your/Intel
> plan to handle the offload by:
> 1) abuse devlink to flash p4 binary
> 2) parse the binary in kernel to match to the table ids of rules coming
>    from p4tc ndo_setup_tc
> 3) abuse devlink to flash p4 binary for tc-flower
> 4) parse the binary in kernel to match to the table ids of rules coming
>    from tc-flower ndo_setup_tc
> is really something that is making me a little bit nauseous.
>
> If you don't have a feasible plan to do the offload, p4tc does not make
> sense to me to be honest.

You mean if there's no plan to match your (NVIDIA?)  point of view.
For #1 - how's this different from DDP? Wasnt that your suggestion to
begin with? For #2 Nobody is proposing to do anything of the sort. The
ndo is passed IDs for the objects and associated contents. For #3+#4
tc flower thing has nothing to do with P4TC that was just some random
proposal someone made seeing if they could ride on top of P4TC.

Besides this nobody really has to satisfy your point of view - like i
said earlier feel free to provide proprietary solutions. From a
consumer perspective  I would not want to deal with 4 different
vendors with 4 different proprietary approaches. The kernel is the
unifying part. You seemed happier with tc flower just not with the
kernel process - which is ironically the same thing we are going
through here ;->

cheers,
jamal

>
> >
> >cheers,
> >jamal
Jiri Pirko Nov. 22, 2023, 9:25 a.m. UTC | #17
Tue, Nov 21, 2023 at 04:21:44PM CET, jhs@mojatatu.com wrote:
>On Tue, Nov 21, 2023 at 9:19 AM Jiri Pirko <jiri@resnulli.us> wrote:
>>
>> Tue, Nov 21, 2023 at 02:47:40PM CET, jhs@mojatatu.com wrote:
>> >On Tue, Nov 21, 2023 at 8:06 AM Jiri Pirko <jiri@resnulli.us> wrote:
>> >>
>> >> Mon, Nov 20, 2023 at 11:56:50PM CET, jhs@mojatatu.com wrote:
>> >> >On Mon, Nov 20, 2023 at 4:49 PM Daniel Borkmann <daniel@iogearbox.net> wrote:
>> >> >>
>> >> >> On 11/20/23 8:56 PM, Jamal Hadi Salim wrote:
>> >> >> > On Mon, Nov 20, 2023 at 1:10 PM Jiri Pirko <jiri@resnulli.us> wrote:
>> >> >> >> Mon, Nov 20, 2023 at 03:23:59PM CET, jhs@mojatatu.com wrote:
>> >>
>> >> [...]
>> >>
>> >> >
>> >> >> tc BPF and XDP already have widely used infrastructure and can be developed
>> >> >> against libbpf or other user space libraries for a user space control plane.
>> >> >> With 'control plane' you refer here to the tc / netlink shim you've built,
>> >> >> but looking at the tc command line examples, this doesn't really provide a
>> >> >> good user experience (you call it p4 but people load bpf obj files). If the
>> >> >> expectation is that an operator should run tc commands, then neither it's
>> >> >> a nice experience for p4 nor for BPF folks. From a BPF PoV, we moved over
>> >> >> to bpf_mprog and plan to also extend this for XDP to have a common look and
>> >> >> feel wrt networking for developers. Why can't this be reused?
>> >> >
>> >> >The filter loading which loads the program is considered pipeline
>> >> >instantiation - consider it as "provisioning" more than "control"
>> >> >which runs at runtime. "control" is purely netlink based. The iproute2
>> >> >code we use links libbpf for example for the filter. If we can achieve
>> >> >the same with bpf_mprog then sure - we just dont want to loose
>> >> >functionality though.  off top of my head, some sample space:
>> >> >- we could have multiple pipelines with different priorities (which tc
>> >> >provides to us) - and each pipeline may have its own logic with many
>> >> >tables etc (and the choice to iterate the next one is essentially
>> >> >encoded in the tc action codes)
>> >> >- we use tc block to map groups of ports (which i dont think bpf has
>> >> >internal access of)
>> >> >
>> >> >In regards to usability: no i dont expect someone doing things at
>> >> >scale to use command line tc. The APIs are via netlink. But the tc cli
>> >> >is must for the rest of the masses per our traditions. Also i really
>> >>
>> >> I don't follow. You repeatedly mention "the must of the traditional tc
>> >> cli", but what of the existing traditional cli you use for p4tc?
>> >> If I look at the examples, pretty much everything looks new to me.
>> >> Example:
>> >>
>> >>   tc p4ctrl create myprog/table/mytable dstAddr 10.0.1.2/32 \
>> >>     action send_to_port param port eno1
>> >>
>> >> This is just TC/RTnetlink used as a channel to pass new things over. If
>> >> that is the case, what's traditional here?
>> >>
>> >
>> >
>> >What is not traditional about it?
>>
>> Okay, so in that case, the following example communitating with
>> userspace deamon using imaginary "p4ctrl" app is equally traditional:
>>   $ p4ctrl create myprog/table/mytable dstAddr 10.0.1.2/32 \
>>      action send_to_port param port eno1
>
>Huh? Thats just an application - classical tc which part of iproute2
>that is sending to the kernel, no different than "tc flower.."
>Where do you get the "userspace" daemon part? Yes, you can write a
>daemon but it will use the same APIs as tc.

Okay, so which part is the "tradition"?


>
>>
>> >
>> >>
>> >> >didnt even want to use ebpf at all for operator experience reasons -
>> >> >it requires a compilation of the code and an extra loading compared to
>> >> >what our original u32/pedit code offered.
>> >> >
>> >> >> I don't quite follow why not most of this could be implemented entirely in
>> >> >> user space without the detour of this and you would provide a developer
>> >> >> library which could then be integrated into a p4 runtime/frontend? This
>> >> >> way users never interface with ebpf parts nor tc given they also shouldn't
>> >> >> have to - it's an implementation detail. This is what John was also pointing
>> >> >> out earlier.
>> >> >>
>> >> >
>> >> >Netlink is the API. We will provide a library for object manipulation
>> >> >which abstracts away the need to know netlink. Someone who for their
>> >> >own reasons wants to use p4runtime or TDI could write on top of this.
>> >> >I would not design a kernel interface to just meet p4runtime (we
>> >> >already have TDI which came later which does things differently). So i
>> >> >expect us to support both those two. And if i was to do something on
>> >> >SDN that was more robust i would write my own that still uses these
>> >> >netlink interfaces.
>> >>
>> >> Actually, what Daniel says about the p4 library used as a backend to p4
>> >> frontend is pretty much aligned what I claimed on the p4 calls couple of
>> >> times. If you have this p4 userspace tooling, it is easy for offloads to
>> >> replace the backed by vendor-specific library which allows p4 offload
>> >> suitable for all vendors (your plan of p4tc offload does not work well
>> >> for our hw, as we repeatedly claimed).
>> >>
>> >
>> >That's you - NVIDIA. You have chosen a path away from the kernel
>> >towards DOCA. I understand NVIDIA's frustration with dealing with
>> >upstream process (which has been cited to me as a good reason for
>> >DOCA) but please dont impose these values and your politics on other
>> >vendors(Intel, AMD for example) who are more than willing to invest
>> >into making the kernel interfaces the path forward. Your choice.
>>
>> No, you are missing the point. This has nothing to do with DOCA.
>
>Right Jiri ;->
>
>> This
>> has to do with the simple limitation of your offload assuming there are
>> no runtime changes in the compiled pipeline. For Intel, maybe they
>> aren't, and it's a good fit for them. All I say is, that it is not the
>> good fit for everyone.
>
> a) it is not part of the P4 spec to dynamically make changes to the
>datapath pipeline after it is create and we are discussing a P4

Isn't this up to the implementation? I mean from the p4 perspective,
everything is static. Hw might need to reshuffle the pipeline internally
during rule insertion/remove in order to optimize the layout.


>implementation not an extension that would add more value b) We are
>more than happy to add extensions in the future to accomodate for
>features but first _P4 spec_ must be met c) we had longer discussions
>with Matty, Khalid and the Rice folks who wrote a paper on that topic
>which you probably didnt attend and everything that needs to be done
>can be from user space today for all those optimizations.
>
>Conclusion is: For what you need to do (which i dont believe is a
>limitation in your hardware rather a design decision on your part) run
>your user space daemon, do optimizations and update the datapath.
>Everybody is happy.

Should the userspace daemon listen on inserted rules to be offloade
over netlink?


>
>>
>> >Nobody is stopping you from offering your customers proprietary
>> >solutions which include a specific ebpf approach alongside DOCA. We
>> >believe that a singular interface regardless of the vendor is the
>> >right way forward. IMHO, this siloing that unfortunately is also added
>> >by eBPF being a double edged sword is not good for the community.
>> >
>> >> As I also said on the p4 call couple of times, I don't see the kernel
>> >> as the correct place to do the p4 abstractions. Why don't you do it in
>> >> userspace and give vendors possiblity to have p4 backends with compilers,
>> >> runtime optimizations etc in userspace, talking to the HW in the
>> >> vendor-suitable way too. Then the SW implementation could be easily eBPF
>> >> and the main reason (I believe) why you need to have this is TC
>> >> (offload) is then void.
>> >>
>> >> The "everyone wants to use TC/netlink" claim does not seem correct
>> >> to me. Why not to have one Linux p4 solution that fits everyones needs?
>> >
>> >You mean more fitting to the DOCA world? no, because iam a kernel
>>
>> Again, this has 0 relation to DOCA.
>>
>>
>> >first person and kernel interfaces are good for everyone.
>>
>> Yeah, not really. Not always the kernel is the right answer. Your/Intel
>> plan to handle the offload by:
>> 1) abuse devlink to flash p4 binary
>> 2) parse the binary in kernel to match to the table ids of rules coming
>>    from p4tc ndo_setup_tc
>> 3) abuse devlink to flash p4 binary for tc-flower
>> 4) parse the binary in kernel to match to the table ids of rules coming
>>    from tc-flower ndo_setup_tc
>> is really something that is making me a little bit nauseous.
>>
>> If you don't have a feasible plan to do the offload, p4tc does not make
>> sense to me to be honest.
>
>You mean if there's no plan to match your (NVIDIA?)  point of view.
>For #1 - how's this different from DDP? Wasnt that your suggestion to

I doubt that. Any flashing-blob-parsing-in-kernel is something I'm
opposed to from day 1.


>begin with? For #2 Nobody is proposing to do anything of the sort. The
>ndo is passed IDs for the objects and associated contents. For #3+#4

During offload, you need to parse the blob in driver to be able to match
the ids with blob entities. That was presented by you/Intel in the past
IIRC.


>tc flower thing has nothing to do with P4TC that was just some random
>proposal someone made seeing if they could ride on top of P4TC.

Yeah, it's not yet merged and already mentally used for abuse. I love
that :)


>
>Besides this nobody really has to satisfy your point of view - like i
>said earlier feel free to provide proprietary solutions. From a
>consumer perspective  I would not want to deal with 4 different
>vendors with 4 different proprietary approaches. The kernel is the
>unifying part. You seemed happier with tc flower just not with the

Yeah, that is my point, why the unifying part can't be a userspace
daemon/library with multiple backends (p4tc, bpf, vendorX, vendorY, ..)?

I just don't see the kernel as a good fit for abstraction here,
given the fact that the vendor compilers does not run in kernel.
That is breaking your model.


>kernel process - which is ironically the same thing we are going
>through here ;->
>
>cheers,
>jamal
>
>>
>> >
>> >cheers,
>> >jamal
Jamal Hadi Salim Nov. 22, 2023, 3:14 p.m. UTC | #18
On Wed, Nov 22, 2023 at 4:25 AM Jiri Pirko <jiri@resnulli.us> wrote:
>
> Tue, Nov 21, 2023 at 04:21:44PM CET, jhs@mojatatu.com wrote:
> >On Tue, Nov 21, 2023 at 9:19 AM Jiri Pirko <jiri@resnulli.us> wrote:
> >>
> >> Tue, Nov 21, 2023 at 02:47:40PM CET, jhs@mojatatu.com wrote:
> >> >On Tue, Nov 21, 2023 at 8:06 AM Jiri Pirko <jiri@resnulli.us> wrote:
> >> >>
> >> >> Mon, Nov 20, 2023 at 11:56:50PM CET, jhs@mojatatu.com wrote:
> >> >> >On Mon, Nov 20, 2023 at 4:49 PM Daniel Borkmann <daniel@iogearbox.net> wrote:
> >> >> >>
> >> >> >> On 11/20/23 8:56 PM, Jamal Hadi Salim wrote:
> >> >> >> > On Mon, Nov 20, 2023 at 1:10 PM Jiri Pirko <jiri@resnulli.us> wrote:
> >> >> >> >> Mon, Nov 20, 2023 at 03:23:59PM CET, jhs@mojatatu.com wrote:
> >> >>
> >> >> [...]
> >> >>
> >> >> >
> >> >> >> tc BPF and XDP already have widely used infrastructure and can be developed
> >> >> >> against libbpf or other user space libraries for a user space control plane.
> >> >> >> With 'control plane' you refer here to the tc / netlink shim you've built,
> >> >> >> but looking at the tc command line examples, this doesn't really provide a
> >> >> >> good user experience (you call it p4 but people load bpf obj files). If the
> >> >> >> expectation is that an operator should run tc commands, then neither it's
> >> >> >> a nice experience for p4 nor for BPF folks. From a BPF PoV, we moved over
> >> >> >> to bpf_mprog and plan to also extend this for XDP to have a common look and
> >> >> >> feel wrt networking for developers. Why can't this be reused?
> >> >> >
> >> >> >The filter loading which loads the program is considered pipeline
> >> >> >instantiation - consider it as "provisioning" more than "control"
> >> >> >which runs at runtime. "control" is purely netlink based. The iproute2
> >> >> >code we use links libbpf for example for the filter. If we can achieve
> >> >> >the same with bpf_mprog then sure - we just dont want to loose
> >> >> >functionality though.  off top of my head, some sample space:
> >> >> >- we could have multiple pipelines with different priorities (which tc
> >> >> >provides to us) - and each pipeline may have its own logic with many
> >> >> >tables etc (and the choice to iterate the next one is essentially
> >> >> >encoded in the tc action codes)
> >> >> >- we use tc block to map groups of ports (which i dont think bpf has
> >> >> >internal access of)
> >> >> >
> >> >> >In regards to usability: no i dont expect someone doing things at
> >> >> >scale to use command line tc. The APIs are via netlink. But the tc cli
> >> >> >is must for the rest of the masses per our traditions. Also i really
> >> >>
> >> >> I don't follow. You repeatedly mention "the must of the traditional tc
> >> >> cli", but what of the existing traditional cli you use for p4tc?
> >> >> If I look at the examples, pretty much everything looks new to me.
> >> >> Example:
> >> >>
> >> >>   tc p4ctrl create myprog/table/mytable dstAddr 10.0.1.2/32 \
> >> >>     action send_to_port param port eno1
> >> >>
> >> >> This is just TC/RTnetlink used as a channel to pass new things over. If
> >> >> that is the case, what's traditional here?
> >> >>
> >> >
> >> >
> >> >What is not traditional about it?
> >>
> >> Okay, so in that case, the following example communitating with
> >> userspace deamon using imaginary "p4ctrl" app is equally traditional:
> >>   $ p4ctrl create myprog/table/mytable dstAddr 10.0.1.2/32 \
> >>      action send_to_port param port eno1
> >
> >Huh? Thats just an application - classical tc which part of iproute2
> >that is sending to the kernel, no different than "tc flower.."
> >Where do you get the "userspace" daemon part? Yes, you can write a
> >daemon but it will use the same APIs as tc.
>
> Okay, so which part is the "tradition"?
>

Provides tooling via tc cli that _everyone_ in the tc world is
familiar with - which uses the same syntax as other tc extensions do,
same expectations (eg events, request responses, familiar commands for
dumping, flushing etc). Basically someone familiar with tc will pick
this up and operate it very quickly and would have an easier time
debugging it.
There are caveats - as will be with all new classifiers - but those
are within reason.

> >>
> >> >
> >> >>
> >> >> >didnt even want to use ebpf at all for operator experience reasons -
> >> >> >it requires a compilation of the code and an extra loading compared to
> >> >> >what our original u32/pedit code offered.
> >> >> >
> >> >> >> I don't quite follow why not most of this could be implemented entirely in
> >> >> >> user space without the detour of this and you would provide a developer
> >> >> >> library which could then be integrated into a p4 runtime/frontend? This
> >> >> >> way users never interface with ebpf parts nor tc given they also shouldn't
> >> >> >> have to - it's an implementation detail. This is what John was also pointing
> >> >> >> out earlier.
> >> >> >>
> >> >> >
> >> >> >Netlink is the API. We will provide a library for object manipulation
> >> >> >which abstracts away the need to know netlink. Someone who for their
> >> >> >own reasons wants to use p4runtime or TDI could write on top of this.
> >> >> >I would not design a kernel interface to just meet p4runtime (we
> >> >> >already have TDI which came later which does things differently). So i
> >> >> >expect us to support both those two. And if i was to do something on
> >> >> >SDN that was more robust i would write my own that still uses these
> >> >> >netlink interfaces.
> >> >>
> >> >> Actually, what Daniel says about the p4 library used as a backend to p4
> >> >> frontend is pretty much aligned what I claimed on the p4 calls couple of
> >> >> times. If you have this p4 userspace tooling, it is easy for offloads to
> >> >> replace the backed by vendor-specific library which allows p4 offload
> >> >> suitable for all vendors (your plan of p4tc offload does not work well
> >> >> for our hw, as we repeatedly claimed).
> >> >>
> >> >
> >> >That's you - NVIDIA. You have chosen a path away from the kernel
> >> >towards DOCA. I understand NVIDIA's frustration with dealing with
> >> >upstream process (which has been cited to me as a good reason for
> >> >DOCA) but please dont impose these values and your politics on other
> >> >vendors(Intel, AMD for example) who are more than willing to invest
> >> >into making the kernel interfaces the path forward. Your choice.
> >>
> >> No, you are missing the point. This has nothing to do with DOCA.
> >
> >Right Jiri ;->
> >
> >> This
> >> has to do with the simple limitation of your offload assuming there are
> >> no runtime changes in the compiled pipeline. For Intel, maybe they
> >> aren't, and it's a good fit for them. All I say is, that it is not the
> >> good fit for everyone.
> >
> > a) it is not part of the P4 spec to dynamically make changes to the
> >datapath pipeline after it is create and we are discussing a P4
>
> Isn't this up to the implementation? I mean from the p4 perspective,
> everything is static. Hw might need to reshuffle the pipeline internally
> during rule insertion/remove in order to optimize the layout.
>

But do note: the focus here is on P4 (hence the name P4TC).

> >implementation not an extension that would add more value b) We are
> >more than happy to add extensions in the future to accomodate for
> >features but first _P4 spec_ must be met c) we had longer discussions
> >with Matty, Khalid and the Rice folks who wrote a paper on that topic
> >which you probably didnt attend and everything that needs to be done
> >can be from user space today for all those optimizations.
> >
> >Conclusion is: For what you need to do (which i dont believe is a
> >limitation in your hardware rather a design decision on your part) run
> >your user space daemon, do optimizations and update the datapath.
> >Everybody is happy.
>
> Should the userspace daemon listen on inserted rules to be offloade
> over netlink?
>

I mean you could if you wanted to given this is just traditional
netlink which emits events (with some filtering when we integrate the
filter approach). But why?

> >
> >>
> >> >Nobody is stopping you from offering your customers proprietary
> >> >solutions which include a specific ebpf approach alongside DOCA. We
> >> >believe that a singular interface regardless of the vendor is the
> >> >right way forward. IMHO, this siloing that unfortunately is also added
> >> >by eBPF being a double edged sword is not good for the community.
> >> >
> >> >> As I also said on the p4 call couple of times, I don't see the kernel
> >> >> as the correct place to do the p4 abstractions. Why don't you do it in
> >> >> userspace and give vendors possiblity to have p4 backends with compilers,
> >> >> runtime optimizations etc in userspace, talking to the HW in the
> >> >> vendor-suitable way too. Then the SW implementation could be easily eBPF
> >> >> and the main reason (I believe) why you need to have this is TC
> >> >> (offload) is then void.
> >> >>
> >> >> The "everyone wants to use TC/netlink" claim does not seem correct
> >> >> to me. Why not to have one Linux p4 solution that fits everyones needs?
> >> >
> >> >You mean more fitting to the DOCA world? no, because iam a kernel
> >>
> >> Again, this has 0 relation to DOCA.
> >>
> >>
> >> >first person and kernel interfaces are good for everyone.
> >>
> >> Yeah, not really. Not always the kernel is the right answer. Your/Intel
> >> plan to handle the offload by:
> >> 1) abuse devlink to flash p4 binary
> >> 2) parse the binary in kernel to match to the table ids of rules coming
> >>    from p4tc ndo_setup_tc
> >> 3) abuse devlink to flash p4 binary for tc-flower
> >> 4) parse the binary in kernel to match to the table ids of rules coming
> >>    from tc-flower ndo_setup_tc
> >> is really something that is making me a little bit nauseous.
> >>
> >> If you don't have a feasible plan to do the offload, p4tc does not make
> >> sense to me to be honest.
> >
> >You mean if there's no plan to match your (NVIDIA?)  point of view.
> >For #1 - how's this different from DDP? Wasnt that your suggestion to
>
> I doubt that. Any flashing-blob-parsing-in-kernel is something I'm
> opposed to from day 1.
>
>

Oh well - it is in the kernel and it works fine tbh.

> >begin with? For #2 Nobody is proposing to do anything of the sort. The
> >ndo is passed IDs for the objects and associated contents. For #3+#4
>
> During offload, you need to parse the blob in driver to be able to match
> the ids with blob entities. That was presented by you/Intel in the past
> IIRC.
>

You are correct - in case of offload the netlink IDs will have to be
authenticated against what the hardware can accept, but the devlink
flash use i believe was from you as a compromise.

>
> >tc flower thing has nothing to do with P4TC that was just some random
> >proposal someone made seeing if they could ride on top of P4TC.
>
> Yeah, it's not yet merged and already mentally used for abuse. I love
> that :)
>
> >
> >Besides this nobody really has to satisfy your point of view - like i
> >said earlier feel free to provide proprietary solutions. From a
> >consumer perspective  I would not want to deal with 4 different
> >vendors with 4 different proprietary approaches. The kernel is the
> >unifying part. You seemed happier with tc flower just not with the
>
> Yeah, that is my point, why the unifying part can't be a userspace
> daemon/library with multiple backends (p4tc, bpf, vendorX, vendorY, ..)?
>
> I just don't see the kernel as a good fit for abstraction here,
> given the fact that the vendor compilers does not run in kernel.
> That is breaking your model.
>

Jiri - we want to support P4, first. Like you said the P4 pipeline,
once installed is static.
P4 doesnt allow dynamic update of the pipeline. For example, once you
say "here are my 14 tables and their associated actions and here's how
the pipeline main control (on how to iterate the tables etc) is going
to be" and after you instantiate/activate that pipeline, you dont go
back 5 minutes later and say "sorry, please introduce table 15, which
i want you to walk to after you visit table 3 if metadata foo is 5" or
"shoot, let's change that table 5 to be exact instead of LPM". It's
not anywhere in the spec.
That doesnt mean it is not useful thing to have - but it is an
invention that has _nothing to do with the P4 spec_; so saying a P4
implementation must support it is a bit out of scope and there are
vendors with hardware who support P4 today that dont need any of this.
In my opinion that is a feature that could be added later out of
necessity (there is some good niche value in being able to add some
"dynamicism" to any pipeline) and influence the P4 standards on why it
is needed.
It should be doable today in a brute force way (this is just one
suggestion that came to me when Rice University/Nvidia presented[1]);
i am sure there are other approaches and the idea is by no means
proven.

1) User space Creates/compiles/Adds/activate your program that has 14
tables at tc prio X chain Y
2) a) 5 minutes later user space decides it wants to change and add
table 3 after table 15, visited when metadata foo=5
    b) your compiler in user space compiles a brand new program which
satisfies #2a (how this program was authored is out of scope of
discussion)
    c) user space adds the new program at tc prio X+1 chain Y or another chain Z
    d) user space delete tc prio X chain Y (and make sure your packets
entry point is whatever #c is)

cheers,
jamal

[1] https://www.cs.rice.edu/~eugeneng/papers/SIGCOMM23-Pipeleon.pdf

>
> >kernel process - which is ironically the same thing we are going
> >through here ;->
> >
> >cheers,
> >jamal
> >
> >>
> >> >
> >> >cheers,
> >> >jamal
Jiri Pirko Nov. 22, 2023, 6:31 p.m. UTC | #19
Wed, Nov 22, 2023 at 04:14:02PM CET, jhs@mojatatu.com wrote:
>On Wed, Nov 22, 2023 at 4:25 AM Jiri Pirko <jiri@resnulli.us> wrote:
>>
>> Tue, Nov 21, 2023 at 04:21:44PM CET, jhs@mojatatu.com wrote:
>> >On Tue, Nov 21, 2023 at 9:19 AM Jiri Pirko <jiri@resnulli.us> wrote:
>> >>
>> >> Tue, Nov 21, 2023 at 02:47:40PM CET, jhs@mojatatu.com wrote:
>> >> >On Tue, Nov 21, 2023 at 8:06 AM Jiri Pirko <jiri@resnulli.us> wrote:
>> >> >>
>> >> >> Mon, Nov 20, 2023 at 11:56:50PM CET, jhs@mojatatu.com wrote:
>> >> >> >On Mon, Nov 20, 2023 at 4:49 PM Daniel Borkmann <daniel@iogearbox.net> wrote:
>> >> >> >>
>> >> >> >> On 11/20/23 8:56 PM, Jamal Hadi Salim wrote:
>> >> >> >> > On Mon, Nov 20, 2023 at 1:10 PM Jiri Pirko <jiri@resnulli.us> wrote:
>> >> >> >> >> Mon, Nov 20, 2023 at 03:23:59PM CET, jhs@mojatatu.com wrote:
>> >> >>
>> >> >> [...]
>> >> >>
>> >> >> >
>> >> >> >> tc BPF and XDP already have widely used infrastructure and can be developed
>> >> >> >> against libbpf or other user space libraries for a user space control plane.
>> >> >> >> With 'control plane' you refer here to the tc / netlink shim you've built,
>> >> >> >> but looking at the tc command line examples, this doesn't really provide a
>> >> >> >> good user experience (you call it p4 but people load bpf obj files). If the
>> >> >> >> expectation is that an operator should run tc commands, then neither it's
>> >> >> >> a nice experience for p4 nor for BPF folks. From a BPF PoV, we moved over
>> >> >> >> to bpf_mprog and plan to also extend this for XDP to have a common look and
>> >> >> >> feel wrt networking for developers. Why can't this be reused?
>> >> >> >
>> >> >> >The filter loading which loads the program is considered pipeline
>> >> >> >instantiation - consider it as "provisioning" more than "control"
>> >> >> >which runs at runtime. "control" is purely netlink based. The iproute2
>> >> >> >code we use links libbpf for example for the filter. If we can achieve
>> >> >> >the same with bpf_mprog then sure - we just dont want to loose
>> >> >> >functionality though.  off top of my head, some sample space:
>> >> >> >- we could have multiple pipelines with different priorities (which tc
>> >> >> >provides to us) - and each pipeline may have its own logic with many
>> >> >> >tables etc (and the choice to iterate the next one is essentially
>> >> >> >encoded in the tc action codes)
>> >> >> >- we use tc block to map groups of ports (which i dont think bpf has
>> >> >> >internal access of)
>> >> >> >
>> >> >> >In regards to usability: no i dont expect someone doing things at
>> >> >> >scale to use command line tc. The APIs are via netlink. But the tc cli
>> >> >> >is must for the rest of the masses per our traditions. Also i really
>> >> >>
>> >> >> I don't follow. You repeatedly mention "the must of the traditional tc
>> >> >> cli", but what of the existing traditional cli you use for p4tc?
>> >> >> If I look at the examples, pretty much everything looks new to me.
>> >> >> Example:
>> >> >>
>> >> >>   tc p4ctrl create myprog/table/mytable dstAddr 10.0.1.2/32 \
>> >> >>     action send_to_port param port eno1
>> >> >>
>> >> >> This is just TC/RTnetlink used as a channel to pass new things over. If
>> >> >> that is the case, what's traditional here?
>> >> >>
>> >> >
>> >> >
>> >> >What is not traditional about it?
>> >>
>> >> Okay, so in that case, the following example communitating with
>> >> userspace deamon using imaginary "p4ctrl" app is equally traditional:
>> >>   $ p4ctrl create myprog/table/mytable dstAddr 10.0.1.2/32 \
>> >>      action send_to_port param port eno1
>> >
>> >Huh? Thats just an application - classical tc which part of iproute2
>> >that is sending to the kernel, no different than "tc flower.."
>> >Where do you get the "userspace" daemon part? Yes, you can write a
>> >daemon but it will use the same APIs as tc.
>>
>> Okay, so which part is the "tradition"?
>>
>
>Provides tooling via tc cli that _everyone_ in the tc world is
>familiar with - which uses the same syntax as other tc extensions do,
>same expectations (eg events, request responses, familiar commands for
>dumping, flushing etc). Basically someone familiar with tc will pick
>this up and operate it very quickly and would have an easier time
>debugging it.
>There are caveats - as will be with all new classifiers - but those
>are within reason.

Okay, so syntax familiarity wise, what's the difference between
following 2 approaches:
$ tc p4ctrl create myprog/table/mytable dstAddr 10.0.1.2/32 \
      action send_to_port param port eno1
$ p4ctrl create myprog/table/mytable dstAddr 10.0.1.2/32 \
      action send_to_port param port eno1
?


>
>> >>
>> >> >
>> >> >>
>> >> >> >didnt even want to use ebpf at all for operator experience reasons -
>> >> >> >it requires a compilation of the code and an extra loading compared to
>> >> >> >what our original u32/pedit code offered.
>> >> >> >
>> >> >> >> I don't quite follow why not most of this could be implemented entirely in
>> >> >> >> user space without the detour of this and you would provide a developer
>> >> >> >> library which could then be integrated into a p4 runtime/frontend? This
>> >> >> >> way users never interface with ebpf parts nor tc given they also shouldn't
>> >> >> >> have to - it's an implementation detail. This is what John was also pointing
>> >> >> >> out earlier.
>> >> >> >>
>> >> >> >
>> >> >> >Netlink is the API. We will provide a library for object manipulation
>> >> >> >which abstracts away the need to know netlink. Someone who for their
>> >> >> >own reasons wants to use p4runtime or TDI could write on top of this.
>> >> >> >I would not design a kernel interface to just meet p4runtime (we
>> >> >> >already have TDI which came later which does things differently). So i
>> >> >> >expect us to support both those two. And if i was to do something on
>> >> >> >SDN that was more robust i would write my own that still uses these
>> >> >> >netlink interfaces.
>> >> >>
>> >> >> Actually, what Daniel says about the p4 library used as a backend to p4
>> >> >> frontend is pretty much aligned what I claimed on the p4 calls couple of
>> >> >> times. If you have this p4 userspace tooling, it is easy for offloads to
>> >> >> replace the backed by vendor-specific library which allows p4 offload
>> >> >> suitable for all vendors (your plan of p4tc offload does not work well
>> >> >> for our hw, as we repeatedly claimed).
>> >> >>
>> >> >
>> >> >That's you - NVIDIA. You have chosen a path away from the kernel
>> >> >towards DOCA. I understand NVIDIA's frustration with dealing with
>> >> >upstream process (which has been cited to me as a good reason for
>> >> >DOCA) but please dont impose these values and your politics on other
>> >> >vendors(Intel, AMD for example) who are more than willing to invest
>> >> >into making the kernel interfaces the path forward. Your choice.
>> >>
>> >> No, you are missing the point. This has nothing to do with DOCA.
>> >
>> >Right Jiri ;->
>> >
>> >> This
>> >> has to do with the simple limitation of your offload assuming there are
>> >> no runtime changes in the compiled pipeline. For Intel, maybe they
>> >> aren't, and it's a good fit for them. All I say is, that it is not the
>> >> good fit for everyone.
>> >
>> > a) it is not part of the P4 spec to dynamically make changes to the
>> >datapath pipeline after it is create and we are discussing a P4
>>
>> Isn't this up to the implementation? I mean from the p4 perspective,
>> everything is static. Hw might need to reshuffle the pipeline internally
>> during rule insertion/remove in order to optimize the layout.
>>
>
>But do note: the focus here is on P4 (hence the name P4TC).
>
>> >implementation not an extension that would add more value b) We are
>> >more than happy to add extensions in the future to accomodate for
>> >features but first _P4 spec_ must be met c) we had longer discussions
>> >with Matty, Khalid and the Rice folks who wrote a paper on that topic
>> >which you probably didnt attend and everything that needs to be done
>> >can be from user space today for all those optimizations.
>> >
>> >Conclusion is: For what you need to do (which i dont believe is a
>> >limitation in your hardware rather a design decision on your part) run
>> >your user space daemon, do optimizations and update the datapath.
>> >Everybody is happy.
>>
>> Should the userspace daemon listen on inserted rules to be offloade
>> over netlink?
>>
>
>I mean you could if you wanted to given this is just traditional
>netlink which emits events (with some filtering when we integrate the
>filter approach). But why?

Nevermind.


>
>> >
>> >>
>> >> >Nobody is stopping you from offering your customers proprietary
>> >> >solutions which include a specific ebpf approach alongside DOCA. We
>> >> >believe that a singular interface regardless of the vendor is the
>> >> >right way forward. IMHO, this siloing that unfortunately is also added
>> >> >by eBPF being a double edged sword is not good for the community.
>> >> >
>> >> >> As I also said on the p4 call couple of times, I don't see the kernel
>> >> >> as the correct place to do the p4 abstractions. Why don't you do it in
>> >> >> userspace and give vendors possiblity to have p4 backends with compilers,
>> >> >> runtime optimizations etc in userspace, talking to the HW in the
>> >> >> vendor-suitable way too. Then the SW implementation could be easily eBPF
>> >> >> and the main reason (I believe) why you need to have this is TC
>> >> >> (offload) is then void.
>> >> >>
>> >> >> The "everyone wants to use TC/netlink" claim does not seem correct
>> >> >> to me. Why not to have one Linux p4 solution that fits everyones needs?
>> >> >
>> >> >You mean more fitting to the DOCA world? no, because iam a kernel
>> >>
>> >> Again, this has 0 relation to DOCA.
>> >>
>> >>
>> >> >first person and kernel interfaces are good for everyone.
>> >>
>> >> Yeah, not really. Not always the kernel is the right answer. Your/Intel
>> >> plan to handle the offload by:
>> >> 1) abuse devlink to flash p4 binary
>> >> 2) parse the binary in kernel to match to the table ids of rules coming
>> >>    from p4tc ndo_setup_tc
>> >> 3) abuse devlink to flash p4 binary for tc-flower
>> >> 4) parse the binary in kernel to match to the table ids of rules coming
>> >>    from tc-flower ndo_setup_tc
>> >> is really something that is making me a little bit nauseous.
>> >>
>> >> If you don't have a feasible plan to do the offload, p4tc does not make
>> >> sense to me to be honest.
>> >
>> >You mean if there's no plan to match your (NVIDIA?)  point of view.
>> >For #1 - how's this different from DDP? Wasnt that your suggestion to
>>
>> I doubt that. Any flashing-blob-parsing-in-kernel is something I'm
>> opposed to from day 1.
>>
>>
>
>Oh well - it is in the kernel and it works fine tbh.
>
>> >begin with? For #2 Nobody is proposing to do anything of the sort. The
>> >ndo is passed IDs for the objects and associated contents. For #3+#4
>>
>> During offload, you need to parse the blob in driver to be able to match
>> the ids with blob entities. That was presented by you/Intel in the past
>> IIRC.
>>
>
>You are correct - in case of offload the netlink IDs will have to be
>authenticated against what the hardware can accept, but the devlink
>flash use i believe was from you as a compromise.

Definitelly not. I'm against devlink abuse for this from day 1.


>
>>
>> >tc flower thing has nothing to do with P4TC that was just some random
>> >proposal someone made seeing if they could ride on top of P4TC.
>>
>> Yeah, it's not yet merged and already mentally used for abuse. I love
>> that :)
>>
>> >
>> >Besides this nobody really has to satisfy your point of view - like i
>> >said earlier feel free to provide proprietary solutions. From a
>> >consumer perspective  I would not want to deal with 4 different
>> >vendors with 4 different proprietary approaches. The kernel is the
>> >unifying part. You seemed happier with tc flower just not with the
>>
>> Yeah, that is my point, why the unifying part can't be a userspace
>> daemon/library with multiple backends (p4tc, bpf, vendorX, vendorY, ..)?
>>
>> I just don't see the kernel as a good fit for abstraction here,
>> given the fact that the vendor compilers does not run in kernel.
>> That is breaking your model.
>>
>
>Jiri - we want to support P4, first. Like you said the P4 pipeline,
>once installed is static.
>P4 doesnt allow dynamic update of the pipeline. For example, once you
>say "here are my 14 tables and their associated actions and here's how
>the pipeline main control (on how to iterate the tables etc) is going
>to be" and after you instantiate/activate that pipeline, you dont go
>back 5 minutes later and say "sorry, please introduce table 15, which
>i want you to walk to after you visit table 3 if metadata foo is 5" or
>"shoot, let's change that table 5 to be exact instead of LPM". It's
>not anywhere in the spec.
>That doesnt mean it is not useful thing to have - but it is an
>invention that has _nothing to do with the P4 spec_; so saying a P4
>implementation must support it is a bit out of scope and there are
>vendors with hardware who support P4 today that dont need any of this.

I'm not talking about the spec. I'm talking about the offload
implemetation, the offload compiler the offload runtime manager. You
don't have those in kernel. That is the issue. The runtime manager is
the one to decide and reshuffle the hw internals. Again, this has
nothing to do with p4 frontend. This is offload implementation.

And that is why I believe your p4 kernel implementation is unoffloadable.
And if it is unoffloadable, do we really need it? IDK.


>In my opinion that is a feature that could be added later out of
>necessity (there is some good niche value in being able to add some
>"dynamicism" to any pipeline) and influence the P4 standards on why it
>is needed.
>It should be doable today in a brute force way (this is just one
>suggestion that came to me when Rice University/Nvidia presented[1]);
>i am sure there are other approaches and the idea is by no means
>proven.
>
>1) User space Creates/compiles/Adds/activate your program that has 14
>tables at tc prio X chain Y
>2) a) 5 minutes later user space decides it wants to change and add
>table 3 after table 15, visited when metadata foo=5
>    b) your compiler in user space compiles a brand new program which
>satisfies #2a (how this program was authored is out of scope of
>discussion)
>    c) user space adds the new program at tc prio X+1 chain Y or another chain Z
>    d) user space delete tc prio X chain Y (and make sure your packets
>entry point is whatever #c is)

I never suggested anything like what you describe. I'm not sure why you
think so.


>
>cheers,
>jamal
>
>[1] https://www.cs.rice.edu/~eugeneng/papers/SIGCOMM23-Pipeleon.pdf
>
>>
>> >kernel process - which is ironically the same thing we are going
>> >through here ;->
>> >
>> >cheers,
>> >jamal
>> >
>> >>
>> >> >
>> >> >cheers,
>> >> >jamal
John Fastabend Nov. 22, 2023, 6:50 p.m. UTC | #20
Jiri Pirko wrote:
> Wed, Nov 22, 2023 at 04:14:02PM CET, jhs@mojatatu.com wrote:
> >On Wed, Nov 22, 2023 at 4:25 AM Jiri Pirko <jiri@resnulli.us> wrote:
> >>
> >> Tue, Nov 21, 2023 at 04:21:44PM CET, jhs@mojatatu.com wrote:
> >> >On Tue, Nov 21, 2023 at 9:19 AM Jiri Pirko <jiri@resnulli.us> wrote:
> >> >>
> >> >> Tue, Nov 21, 2023 at 02:47:40PM CET, jhs@mojatatu.com wrote:
> >> >> >On Tue, Nov 21, 2023 at 8:06 AM Jiri Pirko <jiri@resnulli.us> wrote:
> >> >> >>
> >> >> >> Mon, Nov 20, 2023 at 11:56:50PM CET, jhs@mojatatu.com wrote:
> >> >> >> >On Mon, Nov 20, 2023 at 4:49 PM Daniel Borkmann <daniel@iogearbox.net> wrote:
> >> >> >> >>
> >> >> >> >> On 11/20/23 8:56 PM, Jamal Hadi Salim wrote:
> >> >> >> >> > On Mon, Nov 20, 2023 at 1:10 PM Jiri Pirko <jiri@resnulli.us> wrote:
> >> >> >> >> >> Mon, Nov 20, 2023 at 03:23:59PM CET, jhs@mojatatu.com wrote:
> >> >> >>
> >> >> >> [...]
> >> >> >>
> >> >> >> >
> >> >> >> >> tc BPF and XDP already have widely used infrastructure and can be developed
> >> >> >> >> against libbpf or other user space libraries for a user space control plane.
> >> >> >> >> With 'control plane' you refer here to the tc / netlink shim you've built,
> >> >> >> >> but looking at the tc command line examples, this doesn't really provide a
> >> >> >> >> good user experience (you call it p4 but people load bpf obj files). If the
> >> >> >> >> expectation is that an operator should run tc commands, then neither it's
> >> >> >> >> a nice experience for p4 nor for BPF folks. From a BPF PoV, we moved over
> >> >> >> >> to bpf_mprog and plan to also extend this for XDP to have a common look and
> >> >> >> >> feel wrt networking for developers. Why can't this be reused?
> >> >> >> >
> >> >> >> >The filter loading which loads the program is considered pipeline
> >> >> >> >instantiation - consider it as "provisioning" more than "control"
> >> >> >> >which runs at runtime. "control" is purely netlink based. The iproute2
> >> >> >> >code we use links libbpf for example for the filter. If we can achieve
> >> >> >> >the same with bpf_mprog then sure - we just dont want to loose
> >> >> >> >functionality though.  off top of my head, some sample space:
> >> >> >> >- we could have multiple pipelines with different priorities (which tc
> >> >> >> >provides to us) - and each pipeline may have its own logic with many
> >> >> >> >tables etc (and the choice to iterate the next one is essentially
> >> >> >> >encoded in the tc action codes)
> >> >> >> >- we use tc block to map groups of ports (which i dont think bpf has
> >> >> >> >internal access of)
> >> >> >> >
> >> >> >> >In regards to usability: no i dont expect someone doing things at
> >> >> >> >scale to use command line tc. The APIs are via netlink. But the tc cli
> >> >> >> >is must for the rest of the masses per our traditions. Also i really
> >> >> >>
> >> >> >> I don't follow. You repeatedly mention "the must of the traditional tc
> >> >> >> cli", but what of the existing traditional cli you use for p4tc?
> >> >> >> If I look at the examples, pretty much everything looks new to me.
> >> >> >> Example:
> >> >> >>
> >> >> >>   tc p4ctrl create myprog/table/mytable dstAddr 10.0.1.2/32 \
> >> >> >>     action send_to_port param port eno1
> >> >> >>
> >> >> >> This is just TC/RTnetlink used as a channel to pass new things over. If
> >> >> >> that is the case, what's traditional here?
> >> >> >>
> >> >> >
> >> >> >
> >> >> >What is not traditional about it?
> >> >>
> >> >> Okay, so in that case, the following example communitating with
> >> >> userspace deamon using imaginary "p4ctrl" app is equally traditional:
> >> >>   $ p4ctrl create myprog/table/mytable dstAddr 10.0.1.2/32 \
> >> >>      action send_to_port param port eno1
> >> >
> >> >Huh? Thats just an application - classical tc which part of iproute2
> >> >that is sending to the kernel, no different than "tc flower.."
> >> >Where do you get the "userspace" daemon part? Yes, you can write a
> >> >daemon but it will use the same APIs as tc.
> >>
> >> Okay, so which part is the "tradition"?
> >>
> >
> >Provides tooling via tc cli that _everyone_ in the tc world is
> >familiar with - which uses the same syntax as other tc extensions do,
> >same expectations (eg events, request responses, familiar commands for
> >dumping, flushing etc). Basically someone familiar with tc will pick
> >this up and operate it very quickly and would have an easier time
> >debugging it.
> >There are caveats - as will be with all new classifiers - but those
> >are within reason.
> 
> Okay, so syntax familiarity wise, what's the difference between
> following 2 approaches:
> $ tc p4ctrl create myprog/table/mytable dstAddr 10.0.1.2/32 \
>       action send_to_port param port eno1
> $ p4ctrl create myprog/table/mytable dstAddr 10.0.1.2/32 \
>       action send_to_port param port eno1
> ?
> 
> 
> >
> >> >>
> >> >> >
> >> >> >>
> >> >> >> >didnt even want to use ebpf at all for operator experience reasons -
> >> >> >> >it requires a compilation of the code and an extra loading compared to
> >> >> >> >what our original u32/pedit code offered.
> >> >> >> >
> >> >> >> >> I don't quite follow why not most of this could be implemented entirely in
> >> >> >> >> user space without the detour of this and you would provide a developer
> >> >> >> >> library which could then be integrated into a p4 runtime/frontend? This
> >> >> >> >> way users never interface with ebpf parts nor tc given they also shouldn't
> >> >> >> >> have to - it's an implementation detail. This is what John was also pointing
> >> >> >> >> out earlier.
> >> >> >> >>
> >> >> >> >
> >> >> >> >Netlink is the API. We will provide a library for object manipulation
> >> >> >> >which abstracts away the need to know netlink. Someone who for their
> >> >> >> >own reasons wants to use p4runtime or TDI could write on top of this.
> >> >> >> >I would not design a kernel interface to just meet p4runtime (we
> >> >> >> >already have TDI which came later which does things differently). So i
> >> >> >> >expect us to support both those two. And if i was to do something on
> >> >> >> >SDN that was more robust i would write my own that still uses these
> >> >> >> >netlink interfaces.
> >> >> >>
> >> >> >> Actually, what Daniel says about the p4 library used as a backend to p4
> >> >> >> frontend is pretty much aligned what I claimed on the p4 calls couple of
> >> >> >> times. If you have this p4 userspace tooling, it is easy for offloads to
> >> >> >> replace the backed by vendor-specific library which allows p4 offload
> >> >> >> suitable for all vendors (your plan of p4tc offload does not work well
> >> >> >> for our hw, as we repeatedly claimed).
> >> >> >>
> >> >> >
> >> >> >That's you - NVIDIA. You have chosen a path away from the kernel
> >> >> >towards DOCA. I understand NVIDIA's frustration with dealing with
> >> >> >upstream process (which has been cited to me as a good reason for
> >> >> >DOCA) but please dont impose these values and your politics on other
> >> >> >vendors(Intel, AMD for example) who are more than willing to invest
> >> >> >into making the kernel interfaces the path forward. Your choice.
> >> >>
> >> >> No, you are missing the point. This has nothing to do with DOCA.
> >> >
> >> >Right Jiri ;->
> >> >
> >> >> This
> >> >> has to do with the simple limitation of your offload assuming there are
> >> >> no runtime changes in the compiled pipeline. For Intel, maybe they
> >> >> aren't, and it's a good fit for them. All I say is, that it is not the
> >> >> good fit for everyone.
> >> >
> >> > a) it is not part of the P4 spec to dynamically make changes to the
> >> >datapath pipeline after it is create and we are discussing a P4
> >>
> >> Isn't this up to the implementation? I mean from the p4 perspective,
> >> everything is static. Hw might need to reshuffle the pipeline internally
> >> during rule insertion/remove in order to optimize the layout.
> >>
> >
> >But do note: the focus here is on P4 (hence the name P4TC).
> >
> >> >implementation not an extension that would add more value b) We are
> >> >more than happy to add extensions in the future to accomodate for
> >> >features but first _P4 spec_ must be met c) we had longer discussions
> >> >with Matty, Khalid and the Rice folks who wrote a paper on that topic
> >> >which you probably didnt attend and everything that needs to be done
> >> >can be from user space today for all those optimizations.
> >> >
> >> >Conclusion is: For what you need to do (which i dont believe is a
> >> >limitation in your hardware rather a design decision on your part) run
> >> >your user space daemon, do optimizations and update the datapath.
> >> >Everybody is happy.
> >>
> >> Should the userspace daemon listen on inserted rules to be offloade
> >> over netlink?
> >>
> >
> >I mean you could if you wanted to given this is just traditional
> >netlink which emits events (with some filtering when we integrate the
> >filter approach). But why?
> 
> Nevermind.
> 
> 
> >
> >> >
> >> >>
> >> >> >Nobody is stopping you from offering your customers proprietary
> >> >> >solutions which include a specific ebpf approach alongside DOCA. We
> >> >> >believe that a singular interface regardless of the vendor is the
> >> >> >right way forward. IMHO, this siloing that unfortunately is also added
> >> >> >by eBPF being a double edged sword is not good for the community.
> >> >> >
> >> >> >> As I also said on the p4 call couple of times, I don't see the kernel
> >> >> >> as the correct place to do the p4 abstractions. Why don't you do it in
> >> >> >> userspace and give vendors possiblity to have p4 backends with compilers,
> >> >> >> runtime optimizations etc in userspace, talking to the HW in the
> >> >> >> vendor-suitable way too. Then the SW implementation could be easily eBPF
> >> >> >> and the main reason (I believe) why you need to have this is TC
> >> >> >> (offload) is then void.
> >> >> >>
> >> >> >> The "everyone wants to use TC/netlink" claim does not seem correct
> >> >> >> to me. Why not to have one Linux p4 solution that fits everyones needs?
> >> >> >
> >> >> >You mean more fitting to the DOCA world? no, because iam a kernel
> >> >>
> >> >> Again, this has 0 relation to DOCA.
> >> >>
> >> >>
> >> >> >first person and kernel interfaces are good for everyone.
> >> >>
> >> >> Yeah, not really. Not always the kernel is the right answer. Your/Intel
> >> >> plan to handle the offload by:
> >> >> 1) abuse devlink to flash p4 binary
> >> >> 2) parse the binary in kernel to match to the table ids of rules coming
> >> >>    from p4tc ndo_setup_tc
> >> >> 3) abuse devlink to flash p4 binary for tc-flower
> >> >> 4) parse the binary in kernel to match to the table ids of rules coming
> >> >>    from tc-flower ndo_setup_tc
> >> >> is really something that is making me a little bit nauseous.
> >> >>
> >> >> If you don't have a feasible plan to do the offload, p4tc does not make
> >> >> sense to me to be honest.
> >> >
> >> >You mean if there's no plan to match your (NVIDIA?)  point of view.
> >> >For #1 - how's this different from DDP? Wasnt that your suggestion to
> >>
> >> I doubt that. Any flashing-blob-parsing-in-kernel is something I'm
> >> opposed to from day 1.
> >>
> >>
> >
> >Oh well - it is in the kernel and it works fine tbh.
> >
> >> >begin with? For #2 Nobody is proposing to do anything of the sort. The
> >> >ndo is passed IDs for the objects and associated contents. For #3+#4
> >>
> >> During offload, you need to parse the blob in driver to be able to match
> >> the ids with blob entities. That was presented by you/Intel in the past
> >> IIRC.
> >>
> >
> >You are correct - in case of offload the netlink IDs will have to be
> >authenticated against what the hardware can accept, but the devlink
> >flash use i believe was from you as a compromise.
> 
> Definitelly not. I'm against devlink abuse for this from day 1.
> 
> 
> >
> >>
> >> >tc flower thing has nothing to do with P4TC that was just some random
> >> >proposal someone made seeing if they could ride on top of P4TC.
> >>
> >> Yeah, it's not yet merged and already mentally used for abuse. I love
> >> that :)
> >>
> >> >
> >> >Besides this nobody really has to satisfy your point of view - like i
> >> >said earlier feel free to provide proprietary solutions. From a
> >> >consumer perspective  I would not want to deal with 4 different
> >> >vendors with 4 different proprietary approaches. The kernel is the
> >> >unifying part. You seemed happier with tc flower just not with the
> >>
> >> Yeah, that is my point, why the unifying part can't be a userspace
> >> daemon/library with multiple backends (p4tc, bpf, vendorX, vendorY, ..)?
> >>
> >> I just don't see the kernel as a good fit for abstraction here,
> >> given the fact that the vendor compilers does not run in kernel.
> >> That is breaking your model.
> >>
> >
> >Jiri - we want to support P4, first. Like you said the P4 pipeline,
> >once installed is static.
> >P4 doesnt allow dynamic update of the pipeline. For example, once you
> >say "here are my 14 tables and their associated actions and here's how
> >the pipeline main control (on how to iterate the tables etc) is going
> >to be" and after you instantiate/activate that pipeline, you dont go
> >back 5 minutes later and say "sorry, please introduce table 15, which
> >i want you to walk to after you visit table 3 if metadata foo is 5" or
> >"shoot, let's change that table 5 to be exact instead of LPM". It's
> >not anywhere in the spec.
> >That doesnt mean it is not useful thing to have - but it is an
> >invention that has _nothing to do with the P4 spec_; so saying a P4
> >implementation must support it is a bit out of scope and there are
> >vendors with hardware who support P4 today that dont need any of this.
> 
> I'm not talking about the spec. I'm talking about the offload
> implemetation, the offload compiler the offload runtime manager. You
> don't have those in kernel. That is the issue. The runtime manager is
> the one to decide and reshuffle the hw internals. Again, this has
> nothing to do with p4 frontend. This is offload implementation.
> 
> And that is why I believe your p4 kernel implementation is unoffloadable.
> And if it is unoffloadable, do we really need it? IDK.

And my point is we already have a way to do software P4 implementation
in BPF so I don't see the need for yet another mechanism. And if
P4tc is not amenable to hardware offload then when we need a P4
hardware offload then what? We write another P4TC-HW?

My opinion if we push a DSL into the kernel (which I also sort of
think is not a great idea) then it should at least support offloads
from the start. I know P4 can be used for software, but really its
main value is hardware offlaod here. Even the name PNA, Portable NIC
Architecture, implies its about NICs. If it was software focused
portability wouldn't matter a general purpose CPU can emulate
almost any architecture you would like.

> 
> 
> >In my opinion that is a feature that could be added later out of
> >necessity (there is some good niche value in being able to add some
> >"dynamicism" to any pipeline) and influence the P4 standards on why it
> >is needed.
> >It should be doable today in a brute force way (this is just one
> >suggestion that came to me when Rice University/Nvidia presented[1]);
> >i am sure there are other approaches and the idea is by no means
> >proven.
> >
> >1) User space Creates/compiles/Adds/activate your program that has 14
> >tables at tc prio X chain Y
> >2) a) 5 minutes later user space decides it wants to change and add
> >table 3 after table 15, visited when metadata foo=5
> >    b) your compiler in user space compiles a brand new program which
> >satisfies #2a (how this program was authored is out of scope of
> >discussion)
> >    c) user space adds the new program at tc prio X+1 chain Y or another chain Z
> >    d) user space delete tc prio X chain Y (and make sure your packets
> >entry point is whatever #c is)
> 
> I never suggested anything like what you describe. I'm not sure why you
> think so.
> 
> 
> >
> >cheers,
> >jamal
> >
> >[1] https://www.cs.rice.edu/~eugeneng/papers/SIGCOMM23-Pipeleon.pdf
> >
> >>
> >> >kernel process - which is ironically the same thing we are going
> >> >through here ;->
> >> >
> >> >cheers,
> >> >jamal
> >> >
> >> >>
> >> >> >
> >> >> >cheers,
> >> >> >jamal
Jamal Hadi Salim Nov. 22, 2023, 7:35 p.m. UTC | #21
On Wed, Nov 22, 2023 at 1:31 PM Jiri Pirko <jiri@resnulli.us> wrote:
>
> Wed, Nov 22, 2023 at 04:14:02PM CET, jhs@mojatatu.com wrote:
> >On Wed, Nov 22, 2023 at 4:25 AM Jiri Pirko <jiri@resnulli.us> wrote:
> >>
> >> Tue, Nov 21, 2023 at 04:21:44PM CET, jhs@mojatatu.com wrote:
> >> >On Tue, Nov 21, 2023 at 9:19 AM Jiri Pirko <jiri@resnulli.us> wrote:
> >> >>
> >> >> Tue, Nov 21, 2023 at 02:47:40PM CET, jhs@mojatatu.com wrote:
> >> >> >On Tue, Nov 21, 2023 at 8:06 AM Jiri Pirko <jiri@resnulli.us> wrote:
> >> >> >>
> >> >> >> Mon, Nov 20, 2023 at 11:56:50PM CET, jhs@mojatatu.com wrote:
> >> >> >> >On Mon, Nov 20, 2023 at 4:49 PM Daniel Borkmann <daniel@iogearbox.net> wrote:
> >> >> >> >>
> >> >> >> >> On 11/20/23 8:56 PM, Jamal Hadi Salim wrote:
> >> >> >> >> > On Mon, Nov 20, 2023 at 1:10 PM Jiri Pirko <jiri@resnulli.us> wrote:
> >> >> >> >> >> Mon, Nov 20, 2023 at 03:23:59PM CET, jhs@mojatatu.com wrote:
> >> >> >>
> >> >> >> [...]
> >> >> >>
> >> >> >> >
> >> >> >> >> tc BPF and XDP already have widely used infrastructure and can be developed
> >> >> >> >> against libbpf or other user space libraries for a user space control plane.
> >> >> >> >> With 'control plane' you refer here to the tc / netlink shim you've built,
> >> >> >> >> but looking at the tc command line examples, this doesn't really provide a
> >> >> >> >> good user experience (you call it p4 but people load bpf obj files). If the
> >> >> >> >> expectation is that an operator should run tc commands, then neither it's
> >> >> >> >> a nice experience for p4 nor for BPF folks. From a BPF PoV, we moved over
> >> >> >> >> to bpf_mprog and plan to also extend this for XDP to have a common look and
> >> >> >> >> feel wrt networking for developers. Why can't this be reused?
> >> >> >> >
> >> >> >> >The filter loading which loads the program is considered pipeline
> >> >> >> >instantiation - consider it as "provisioning" more than "control"
> >> >> >> >which runs at runtime. "control" is purely netlink based. The iproute2
> >> >> >> >code we use links libbpf for example for the filter. If we can achieve
> >> >> >> >the same with bpf_mprog then sure - we just dont want to loose
> >> >> >> >functionality though.  off top of my head, some sample space:
> >> >> >> >- we could have multiple pipelines with different priorities (which tc
> >> >> >> >provides to us) - and each pipeline may have its own logic with many
> >> >> >> >tables etc (and the choice to iterate the next one is essentially
> >> >> >> >encoded in the tc action codes)
> >> >> >> >- we use tc block to map groups of ports (which i dont think bpf has
> >> >> >> >internal access of)
> >> >> >> >
> >> >> >> >In regards to usability: no i dont expect someone doing things at
> >> >> >> >scale to use command line tc. The APIs are via netlink. But the tc cli
> >> >> >> >is must for the rest of the masses per our traditions. Also i really
> >> >> >>
> >> >> >> I don't follow. You repeatedly mention "the must of the traditional tc
> >> >> >> cli", but what of the existing traditional cli you use for p4tc?
> >> >> >> If I look at the examples, pretty much everything looks new to me.
> >> >> >> Example:
> >> >> >>
> >> >> >>   tc p4ctrl create myprog/table/mytable dstAddr 10.0.1.2/32 \
> >> >> >>     action send_to_port param port eno1
> >> >> >>
> >> >> >> This is just TC/RTnetlink used as a channel to pass new things over. If
> >> >> >> that is the case, what's traditional here?
> >> >> >>
> >> >> >
> >> >> >
> >> >> >What is not traditional about it?
> >> >>
> >> >> Okay, so in that case, the following example communitating with
> >> >> userspace deamon using imaginary "p4ctrl" app is equally traditional:
> >> >>   $ p4ctrl create myprog/table/mytable dstAddr 10.0.1.2/32 \
> >> >>      action send_to_port param port eno1
> >> >
> >> >Huh? Thats just an application - classical tc which part of iproute2
> >> >that is sending to the kernel, no different than "tc flower.."
> >> >Where do you get the "userspace" daemon part? Yes, you can write a
> >> >daemon but it will use the same APIs as tc.
> >>
> >> Okay, so which part is the "tradition"?
> >>
> >
> >Provides tooling via tc cli that _everyone_ in the tc world is
> >familiar with - which uses the same syntax as other tc extensions do,
> >same expectations (eg events, request responses, familiar commands for
> >dumping, flushing etc). Basically someone familiar with tc will pick
> >this up and operate it very quickly and would have an easier time
> >debugging it.
> >There are caveats - as will be with all new classifiers - but those
> >are within reason.
>
> Okay, so syntax familiarity wise, what's the difference between
> following 2 approaches:
> $ tc p4ctrl create myprog/table/mytable dstAddr 10.0.1.2/32 \
>       action send_to_port param port eno1
> $ p4ctrl create myprog/table/mytable dstAddr 10.0.1.2/32 \
>       action send_to_port param port eno1
> ?
>
>
> >
> >> >>
> >> >> >
> >> >> >>
> >> >> >> >didnt even want to use ebpf at all for operator experience reasons -
> >> >> >> >it requires a compilation of the code and an extra loading compared to
> >> >> >> >what our original u32/pedit code offered.
> >> >> >> >
> >> >> >> >> I don't quite follow why not most of this could be implemented entirely in
> >> >> >> >> user space without the detour of this and you would provide a developer
> >> >> >> >> library which could then be integrated into a p4 runtime/frontend? This
> >> >> >> >> way users never interface with ebpf parts nor tc given they also shouldn't
> >> >> >> >> have to - it's an implementation detail. This is what John was also pointing
> >> >> >> >> out earlier.
> >> >> >> >>
> >> >> >> >
> >> >> >> >Netlink is the API. We will provide a library for object manipulation
> >> >> >> >which abstracts away the need to know netlink. Someone who for their
> >> >> >> >own reasons wants to use p4runtime or TDI could write on top of this.
> >> >> >> >I would not design a kernel interface to just meet p4runtime (we
> >> >> >> >already have TDI which came later which does things differently). So i
> >> >> >> >expect us to support both those two. And if i was to do something on
> >> >> >> >SDN that was more robust i would write my own that still uses these
> >> >> >> >netlink interfaces.
> >> >> >>
> >> >> >> Actually, what Daniel says about the p4 library used as a backend to p4
> >> >> >> frontend is pretty much aligned what I claimed on the p4 calls couple of
> >> >> >> times. If you have this p4 userspace tooling, it is easy for offloads to
> >> >> >> replace the backed by vendor-specific library which allows p4 offload
> >> >> >> suitable for all vendors (your plan of p4tc offload does not work well
> >> >> >> for our hw, as we repeatedly claimed).
> >> >> >>
> >> >> >
> >> >> >That's you - NVIDIA. You have chosen a path away from the kernel
> >> >> >towards DOCA. I understand NVIDIA's frustration with dealing with
> >> >> >upstream process (which has been cited to me as a good reason for
> >> >> >DOCA) but please dont impose these values and your politics on other
> >> >> >vendors(Intel, AMD for example) who are more than willing to invest
> >> >> >into making the kernel interfaces the path forward. Your choice.
> >> >>
> >> >> No, you are missing the point. This has nothing to do with DOCA.
> >> >
> >> >Right Jiri ;->
> >> >
> >> >> This
> >> >> has to do with the simple limitation of your offload assuming there are
> >> >> no runtime changes in the compiled pipeline. For Intel, maybe they
> >> >> aren't, and it's a good fit for them. All I say is, that it is not the
> >> >> good fit for everyone.
> >> >
> >> > a) it is not part of the P4 spec to dynamically make changes to the
> >> >datapath pipeline after it is create and we are discussing a P4
> >>
> >> Isn't this up to the implementation? I mean from the p4 perspective,
> >> everything is static. Hw might need to reshuffle the pipeline internally
> >> during rule insertion/remove in order to optimize the layout.
> >>
> >
> >But do note: the focus here is on P4 (hence the name P4TC).
> >
> >> >implementation not an extension that would add more value b) We are
> >> >more than happy to add extensions in the future to accomodate for
> >> >features but first _P4 spec_ must be met c) we had longer discussions
> >> >with Matty, Khalid and the Rice folks who wrote a paper on that topic
> >> >which you probably didnt attend and everything that needs to be done
> >> >can be from user space today for all those optimizations.
> >> >
> >> >Conclusion is: For what you need to do (which i dont believe is a
> >> >limitation in your hardware rather a design decision on your part) run
> >> >your user space daemon, do optimizations and update the datapath.
> >> >Everybody is happy.
> >>
> >> Should the userspace daemon listen on inserted rules to be offloade
> >> over netlink?
> >>
> >
> >I mean you could if you wanted to given this is just traditional
> >netlink which emits events (with some filtering when we integrate the
> >filter approach). But why?
>
> Nevermind.
>
>
> >
> >> >
> >> >>
> >> >> >Nobody is stopping you from offering your customers proprietary
> >> >> >solutions which include a specific ebpf approach alongside DOCA. We
> >> >> >believe that a singular interface regardless of the vendor is the
> >> >> >right way forward. IMHO, this siloing that unfortunately is also added
> >> >> >by eBPF being a double edged sword is not good for the community.
> >> >> >
> >> >> >> As I also said on the p4 call couple of times, I don't see the kernel
> >> >> >> as the correct place to do the p4 abstractions. Why don't you do it in
> >> >> >> userspace and give vendors possiblity to have p4 backends with compilers,
> >> >> >> runtime optimizations etc in userspace, talking to the HW in the
> >> >> >> vendor-suitable way too. Then the SW implementation could be easily eBPF
> >> >> >> and the main reason (I believe) why you need to have this is TC
> >> >> >> (offload) is then void.
> >> >> >>
> >> >> >> The "everyone wants to use TC/netlink" claim does not seem correct
> >> >> >> to me. Why not to have one Linux p4 solution that fits everyones needs?
> >> >> >
> >> >> >You mean more fitting to the DOCA world? no, because iam a kernel
> >> >>
> >> >> Again, this has 0 relation to DOCA.
> >> >>
> >> >>
> >> >> >first person and kernel interfaces are good for everyone.
> >> >>
> >> >> Yeah, not really. Not always the kernel is the right answer. Your/Intel
> >> >> plan to handle the offload by:
> >> >> 1) abuse devlink to flash p4 binary
> >> >> 2) parse the binary in kernel to match to the table ids of rules coming
> >> >>    from p4tc ndo_setup_tc
> >> >> 3) abuse devlink to flash p4 binary for tc-flower
> >> >> 4) parse the binary in kernel to match to the table ids of rules coming
> >> >>    from tc-flower ndo_setup_tc
> >> >> is really something that is making me a little bit nauseous.
> >> >>
> >> >> If you don't have a feasible plan to do the offload, p4tc does not make
> >> >> sense to me to be honest.
> >> >
> >> >You mean if there's no plan to match your (NVIDIA?)  point of view.
> >> >For #1 - how's this different from DDP? Wasnt that your suggestion to
> >>
> >> I doubt that. Any flashing-blob-parsing-in-kernel is something I'm
> >> opposed to from day 1.
> >>
> >>
> >
> >Oh well - it is in the kernel and it works fine tbh.
> >
> >> >begin with? For #2 Nobody is proposing to do anything of the sort. The
> >> >ndo is passed IDs for the objects and associated contents. For #3+#4
> >>
> >> During offload, you need to parse the blob in driver to be able to match
> >> the ids with blob entities. That was presented by you/Intel in the past
> >> IIRC.
> >>
> >
> >You are correct - in case of offload the netlink IDs will have to be
> >authenticated against what the hardware can accept, but the devlink
> >flash use i believe was from you as a compromise.
>
> Definitelly not. I'm against devlink abuse for this from day 1.
>
>
> >
> >>
> >> >tc flower thing has nothing to do with P4TC that was just some random
> >> >proposal someone made seeing if they could ride on top of P4TC.
> >>
> >> Yeah, it's not yet merged and already mentally used for abuse. I love
> >> that :)
> >>
> >> >
> >> >Besides this nobody really has to satisfy your point of view - like i
> >> >said earlier feel free to provide proprietary solutions. From a
> >> >consumer perspective  I would not want to deal with 4 different
> >> >vendors with 4 different proprietary approaches. The kernel is the
> >> >unifying part. You seemed happier with tc flower just not with the
> >>
> >> Yeah, that is my point, why the unifying part can't be a userspace
> >> daemon/library with multiple backends (p4tc, bpf, vendorX, vendorY, ..)?
> >>
> >> I just don't see the kernel as a good fit for abstraction here,
> >> given the fact that the vendor compilers does not run in kernel.
> >> That is breaking your model.
> >>
> >
> >Jiri - we want to support P4, first. Like you said the P4 pipeline,
> >once installed is static.
> >P4 doesnt allow dynamic update of the pipeline. For example, once you
> >say "here are my 14 tables and their associated actions and here's how
> >the pipeline main control (on how to iterate the tables etc) is going
> >to be" and after you instantiate/activate that pipeline, you dont go
> >back 5 minutes later and say "sorry, please introduce table 15, which
> >i want you to walk to after you visit table 3 if metadata foo is 5" or
> >"shoot, let's change that table 5 to be exact instead of LPM". It's
> >not anywhere in the spec.
> >That doesnt mean it is not useful thing to have - but it is an
> >invention that has _nothing to do with the P4 spec_; so saying a P4
> >implementation must support it is a bit out of scope and there are
> >vendors with hardware who support P4 today that dont need any of this.
>
> I'm not talking about the spec. I'm talking about the offload
> implemetation, the offload compiler the offload runtime manager. You
> don't have those in kernel. That is the issue. The runtime manager is
> the one to decide and reshuffle the hw internals. Again, this has
> nothing to do with p4 frontend. This is offload implementation.
>
> And that is why I believe your p4 kernel implementation is unoffloadable.
> And if it is unoffloadable, do we really need it? IDK.
>

Say what?
It's not offloadable in your hardware, you mean? Because i have beside
me here an intel e2000 which offloads just fine (and the AMD folks
seem fine too).
If your view is that all these runtime optimization surmount to a
compiler in the kernel/driver that is your, well, your view. In my
view (and others have said this to you already) the P4C compiler is
responsible for resource optimizations. The hardware supports P4, you
give it constraints and it knows what to do. At runtime, anything a
driver needs to do for resource optimization (resorting, reshuffling
etc), that is not a P4 problem - sorry if you have issues in your
architecture approach.

> >In my opinion that is a feature that could be added later out of
> >necessity (there is some good niche value in being able to add some
> >"dynamicism" to any pipeline) and influence the P4 standards on why it
> >is needed.
> >It should be doable today in a brute force way (this is just one
> >suggestion that came to me when Rice University/Nvidia presented[1]);
> >i am sure there are other approaches and the idea is by no means
> >proven.
> >
> >1) User space Creates/compiles/Adds/activate your program that has 14
> >tables at tc prio X chain Y
> >2) a) 5 minutes later user space decides it wants to change and add
> >table 3 after table 15, visited when metadata foo=5
> >    b) your compiler in user space compiles a brand new program which
> >satisfies #2a (how this program was authored is out of scope of
> >discussion)
> >    c) user space adds the new program at tc prio X+1 chain Y or another chain Z
> >    d) user space delete tc prio X chain Y (and make sure your packets
> >entry point is whatever #c is)
>
> I never suggested anything like what you describe. I'm not sure why you
> think so.

It's the same class of problems - the paper i pointed to (coauthored
by Matty and others) has runtime resource optimizations which are
tantamount to changing the nature of the pipeline. We may need to
profile in the kernel but all those optimizations can be derived in
user space using the approach I described.

cheers,
jamal


> >[1] https://www.cs.rice.edu/~eugeneng/papers/SIGCOMM23-Pipeleon.pdf
> >
> >>
> >> >kernel process - which is ironically the same thing we are going
> >> >through here ;->
> >> >
> >> >cheers,
> >> >jamal
> >> >
> >> >>
> >> >> >
> >> >> >cheers,
> >> >> >jamal
Jiri Pirko Nov. 23, 2023, 6:36 a.m. UTC | #22
Wed, Nov 22, 2023 at 08:35:21PM CET, jhs@mojatatu.com wrote:
>On Wed, Nov 22, 2023 at 1:31 PM Jiri Pirko <jiri@resnulli.us> wrote:
>>
>> Wed, Nov 22, 2023 at 04:14:02PM CET, jhs@mojatatu.com wrote:
>> >On Wed, Nov 22, 2023 at 4:25 AM Jiri Pirko <jiri@resnulli.us> wrote:
>> >>
>> >> Tue, Nov 21, 2023 at 04:21:44PM CET, jhs@mojatatu.com wrote:
>> >> >On Tue, Nov 21, 2023 at 9:19 AM Jiri Pirko <jiri@resnulli.us> wrote:
>> >> >>
>> >> >> Tue, Nov 21, 2023 at 02:47:40PM CET, jhs@mojatatu.com wrote:
>> >> >> >On Tue, Nov 21, 2023 at 8:06 AM Jiri Pirko <jiri@resnulli.us> wrote:
>> >> >> >>
>> >> >> >> Mon, Nov 20, 2023 at 11:56:50PM CET, jhs@mojatatu.com wrote:
>> >> >> >> >On Mon, Nov 20, 2023 at 4:49 PM Daniel Borkmann <daniel@iogearbox.net> wrote:
>> >> >> >> >>
>> >> >> >> >> On 11/20/23 8:56 PM, Jamal Hadi Salim wrote:
>> >> >> >> >> > On Mon, Nov 20, 2023 at 1:10 PM Jiri Pirko <jiri@resnulli.us> wrote:
>> >> >> >> >> >> Mon, Nov 20, 2023 at 03:23:59PM CET, jhs@mojatatu.com wrote:
>> >> >> >>
>> >> >> >> [...]
>> >> >> >>
>> >> >> >> >
>> >> >> >> >> tc BPF and XDP already have widely used infrastructure and can be developed
>> >> >> >> >> against libbpf or other user space libraries for a user space control plane.
>> >> >> >> >> With 'control plane' you refer here to the tc / netlink shim you've built,
>> >> >> >> >> but looking at the tc command line examples, this doesn't really provide a
>> >> >> >> >> good user experience (you call it p4 but people load bpf obj files). If the
>> >> >> >> >> expectation is that an operator should run tc commands, then neither it's
>> >> >> >> >> a nice experience for p4 nor for BPF folks. From a BPF PoV, we moved over
>> >> >> >> >> to bpf_mprog and plan to also extend this for XDP to have a common look and
>> >> >> >> >> feel wrt networking for developers. Why can't this be reused?
>> >> >> >> >
>> >> >> >> >The filter loading which loads the program is considered pipeline
>> >> >> >> >instantiation - consider it as "provisioning" more than "control"
>> >> >> >> >which runs at runtime. "control" is purely netlink based. The iproute2
>> >> >> >> >code we use links libbpf for example for the filter. If we can achieve
>> >> >> >> >the same with bpf_mprog then sure - we just dont want to loose
>> >> >> >> >functionality though.  off top of my head, some sample space:
>> >> >> >> >- we could have multiple pipelines with different priorities (which tc
>> >> >> >> >provides to us) - and each pipeline may have its own logic with many
>> >> >> >> >tables etc (and the choice to iterate the next one is essentially
>> >> >> >> >encoded in the tc action codes)
>> >> >> >> >- we use tc block to map groups of ports (which i dont think bpf has
>> >> >> >> >internal access of)
>> >> >> >> >
>> >> >> >> >In regards to usability: no i dont expect someone doing things at
>> >> >> >> >scale to use command line tc. The APIs are via netlink. But the tc cli
>> >> >> >> >is must for the rest of the masses per our traditions. Also i really
>> >> >> >>
>> >> >> >> I don't follow. You repeatedly mention "the must of the traditional tc
>> >> >> >> cli", but what of the existing traditional cli you use for p4tc?
>> >> >> >> If I look at the examples, pretty much everything looks new to me.
>> >> >> >> Example:
>> >> >> >>
>> >> >> >>   tc p4ctrl create myprog/table/mytable dstAddr 10.0.1.2/32 \
>> >> >> >>     action send_to_port param port eno1
>> >> >> >>
>> >> >> >> This is just TC/RTnetlink used as a channel to pass new things over. If
>> >> >> >> that is the case, what's traditional here?
>> >> >> >>
>> >> >> >
>> >> >> >
>> >> >> >What is not traditional about it?
>> >> >>
>> >> >> Okay, so in that case, the following example communitating with
>> >> >> userspace deamon using imaginary "p4ctrl" app is equally traditional:
>> >> >>   $ p4ctrl create myprog/table/mytable dstAddr 10.0.1.2/32 \
>> >> >>      action send_to_port param port eno1
>> >> >
>> >> >Huh? Thats just an application - classical tc which part of iproute2
>> >> >that is sending to the kernel, no different than "tc flower.."
>> >> >Where do you get the "userspace" daemon part? Yes, you can write a
>> >> >daemon but it will use the same APIs as tc.
>> >>
>> >> Okay, so which part is the "tradition"?
>> >>
>> >
>> >Provides tooling via tc cli that _everyone_ in the tc world is
>> >familiar with - which uses the same syntax as other tc extensions do,
>> >same expectations (eg events, request responses, familiar commands for
>> >dumping, flushing etc). Basically someone familiar with tc will pick
>> >this up and operate it very quickly and would have an easier time
>> >debugging it.
>> >There are caveats - as will be with all new classifiers - but those
>> >are within reason.
>>
>> Okay, so syntax familiarity wise, what's the difference between
>> following 2 approaches:
>> $ tc p4ctrl create myprog/table/mytable dstAddr 10.0.1.2/32 \
>>       action send_to_port param port eno1
>> $ p4ctrl create myprog/table/mytable dstAddr 10.0.1.2/32 \
>>       action send_to_port param port eno1
>> ?
>>
>>
>> >
>> >> >>
>> >> >> >
>> >> >> >>
>> >> >> >> >didnt even want to use ebpf at all for operator experience reasons -
>> >> >> >> >it requires a compilation of the code and an extra loading compared to
>> >> >> >> >what our original u32/pedit code offered.
>> >> >> >> >
>> >> >> >> >> I don't quite follow why not most of this could be implemented entirely in
>> >> >> >> >> user space without the detour of this and you would provide a developer
>> >> >> >> >> library which could then be integrated into a p4 runtime/frontend? This
>> >> >> >> >> way users never interface with ebpf parts nor tc given they also shouldn't
>> >> >> >> >> have to - it's an implementation detail. This is what John was also pointing
>> >> >> >> >> out earlier.
>> >> >> >> >>
>> >> >> >> >
>> >> >> >> >Netlink is the API. We will provide a library for object manipulation
>> >> >> >> >which abstracts away the need to know netlink. Someone who for their
>> >> >> >> >own reasons wants to use p4runtime or TDI could write on top of this.
>> >> >> >> >I would not design a kernel interface to just meet p4runtime (we
>> >> >> >> >already have TDI which came later which does things differently). So i
>> >> >> >> >expect us to support both those two. And if i was to do something on
>> >> >> >> >SDN that was more robust i would write my own that still uses these
>> >> >> >> >netlink interfaces.
>> >> >> >>
>> >> >> >> Actually, what Daniel says about the p4 library used as a backend to p4
>> >> >> >> frontend is pretty much aligned what I claimed on the p4 calls couple of
>> >> >> >> times. If you have this p4 userspace tooling, it is easy for offloads to
>> >> >> >> replace the backed by vendor-specific library which allows p4 offload
>> >> >> >> suitable for all vendors (your plan of p4tc offload does not work well
>> >> >> >> for our hw, as we repeatedly claimed).
>> >> >> >>
>> >> >> >
>> >> >> >That's you - NVIDIA. You have chosen a path away from the kernel
>> >> >> >towards DOCA. I understand NVIDIA's frustration with dealing with
>> >> >> >upstream process (which has been cited to me as a good reason for
>> >> >> >DOCA) but please dont impose these values and your politics on other
>> >> >> >vendors(Intel, AMD for example) who are more than willing to invest
>> >> >> >into making the kernel interfaces the path forward. Your choice.
>> >> >>
>> >> >> No, you are missing the point. This has nothing to do with DOCA.
>> >> >
>> >> >Right Jiri ;->
>> >> >
>> >> >> This
>> >> >> has to do with the simple limitation of your offload assuming there are
>> >> >> no runtime changes in the compiled pipeline. For Intel, maybe they
>> >> >> aren't, and it's a good fit for them. All I say is, that it is not the
>> >> >> good fit for everyone.
>> >> >
>> >> > a) it is not part of the P4 spec to dynamically make changes to the
>> >> >datapath pipeline after it is create and we are discussing a P4
>> >>
>> >> Isn't this up to the implementation? I mean from the p4 perspective,
>> >> everything is static. Hw might need to reshuffle the pipeline internally
>> >> during rule insertion/remove in order to optimize the layout.
>> >>
>> >
>> >But do note: the focus here is on P4 (hence the name P4TC).
>> >
>> >> >implementation not an extension that would add more value b) We are
>> >> >more than happy to add extensions in the future to accomodate for
>> >> >features but first _P4 spec_ must be met c) we had longer discussions
>> >> >with Matty, Khalid and the Rice folks who wrote a paper on that topic
>> >> >which you probably didnt attend and everything that needs to be done
>> >> >can be from user space today for all those optimizations.
>> >> >
>> >> >Conclusion is: For what you need to do (which i dont believe is a
>> >> >limitation in your hardware rather a design decision on your part) run
>> >> >your user space daemon, do optimizations and update the datapath.
>> >> >Everybody is happy.
>> >>
>> >> Should the userspace daemon listen on inserted rules to be offloade
>> >> over netlink?
>> >>
>> >
>> >I mean you could if you wanted to given this is just traditional
>> >netlink which emits events (with some filtering when we integrate the
>> >filter approach). But why?
>>
>> Nevermind.
>>
>>
>> >
>> >> >
>> >> >>
>> >> >> >Nobody is stopping you from offering your customers proprietary
>> >> >> >solutions which include a specific ebpf approach alongside DOCA. We
>> >> >> >believe that a singular interface regardless of the vendor is the
>> >> >> >right way forward. IMHO, this siloing that unfortunately is also added
>> >> >> >by eBPF being a double edged sword is not good for the community.
>> >> >> >
>> >> >> >> As I also said on the p4 call couple of times, I don't see the kernel
>> >> >> >> as the correct place to do the p4 abstractions. Why don't you do it in
>> >> >> >> userspace and give vendors possiblity to have p4 backends with compilers,
>> >> >> >> runtime optimizations etc in userspace, talking to the HW in the
>> >> >> >> vendor-suitable way too. Then the SW implementation could be easily eBPF
>> >> >> >> and the main reason (I believe) why you need to have this is TC
>> >> >> >> (offload) is then void.
>> >> >> >>
>> >> >> >> The "everyone wants to use TC/netlink" claim does not seem correct
>> >> >> >> to me. Why not to have one Linux p4 solution that fits everyones needs?
>> >> >> >
>> >> >> >You mean more fitting to the DOCA world? no, because iam a kernel
>> >> >>
>> >> >> Again, this has 0 relation to DOCA.
>> >> >>
>> >> >>
>> >> >> >first person and kernel interfaces are good for everyone.
>> >> >>
>> >> >> Yeah, not really. Not always the kernel is the right answer. Your/Intel
>> >> >> plan to handle the offload by:
>> >> >> 1) abuse devlink to flash p4 binary
>> >> >> 2) parse the binary in kernel to match to the table ids of rules coming
>> >> >>    from p4tc ndo_setup_tc
>> >> >> 3) abuse devlink to flash p4 binary for tc-flower
>> >> >> 4) parse the binary in kernel to match to the table ids of rules coming
>> >> >>    from tc-flower ndo_setup_tc
>> >> >> is really something that is making me a little bit nauseous.
>> >> >>
>> >> >> If you don't have a feasible plan to do the offload, p4tc does not make
>> >> >> sense to me to be honest.
>> >> >
>> >> >You mean if there's no plan to match your (NVIDIA?)  point of view.
>> >> >For #1 - how's this different from DDP? Wasnt that your suggestion to
>> >>
>> >> I doubt that. Any flashing-blob-parsing-in-kernel is something I'm
>> >> opposed to from day 1.
>> >>
>> >>
>> >
>> >Oh well - it is in the kernel and it works fine tbh.
>> >
>> >> >begin with? For #2 Nobody is proposing to do anything of the sort. The
>> >> >ndo is passed IDs for the objects and associated contents. For #3+#4
>> >>
>> >> During offload, you need to parse the blob in driver to be able to match
>> >> the ids with blob entities. That was presented by you/Intel in the past
>> >> IIRC.
>> >>
>> >
>> >You are correct - in case of offload the netlink IDs will have to be
>> >authenticated against what the hardware can accept, but the devlink
>> >flash use i believe was from you as a compromise.
>>
>> Definitelly not. I'm against devlink abuse for this from day 1.
>>
>>
>> >
>> >>
>> >> >tc flower thing has nothing to do with P4TC that was just some random
>> >> >proposal someone made seeing if they could ride on top of P4TC.
>> >>
>> >> Yeah, it's not yet merged and already mentally used for abuse. I love
>> >> that :)
>> >>
>> >> >
>> >> >Besides this nobody really has to satisfy your point of view - like i
>> >> >said earlier feel free to provide proprietary solutions. From a
>> >> >consumer perspective  I would not want to deal with 4 different
>> >> >vendors with 4 different proprietary approaches. The kernel is the
>> >> >unifying part. You seemed happier with tc flower just not with the
>> >>
>> >> Yeah, that is my point, why the unifying part can't be a userspace
>> >> daemon/library with multiple backends (p4tc, bpf, vendorX, vendorY, ..)?
>> >>
>> >> I just don't see the kernel as a good fit for abstraction here,
>> >> given the fact that the vendor compilers does not run in kernel.
>> >> That is breaking your model.
>> >>
>> >
>> >Jiri - we want to support P4, first. Like you said the P4 pipeline,
>> >once installed is static.
>> >P4 doesnt allow dynamic update of the pipeline. For example, once you
>> >say "here are my 14 tables and their associated actions and here's how
>> >the pipeline main control (on how to iterate the tables etc) is going
>> >to be" and after you instantiate/activate that pipeline, you dont go
>> >back 5 minutes later and say "sorry, please introduce table 15, which
>> >i want you to walk to after you visit table 3 if metadata foo is 5" or
>> >"shoot, let's change that table 5 to be exact instead of LPM". It's
>> >not anywhere in the spec.
>> >That doesnt mean it is not useful thing to have - but it is an
>> >invention that has _nothing to do with the P4 spec_; so saying a P4
>> >implementation must support it is a bit out of scope and there are
>> >vendors with hardware who support P4 today that dont need any of this.
>>
>> I'm not talking about the spec. I'm talking about the offload
>> implemetation, the offload compiler the offload runtime manager. You
>> don't have those in kernel. That is the issue. The runtime manager is
>> the one to decide and reshuffle the hw internals. Again, this has
>> nothing to do with p4 frontend. This is offload implementation.
>>
>> And that is why I believe your p4 kernel implementation is unoffloadable.
>> And if it is unoffloadable, do we really need it? IDK.
>>
>
>Say what?
>It's not offloadable in your hardware, you mean? Because i have beside
>me here an intel e2000 which offloads just fine (and the AMD folks
>seem fine too).

Will Intel and AMD have compiler in kernel, so no blob transfer and
parsing it in kernel wound not be needed? No.


>If your view is that all these runtime optimization surmount to a
>compiler in the kernel/driver that is your, well, your view. In my
>view (and others have said this to you already) the P4C compiler is
>responsible for resource optimizations. The hardware supports P4, you
>give it constraints and it knows what to do. At runtime, anything a
>driver needs to do for resource optimization (resorting, reshuffling
>etc), that is not a P4 problem - sorry if you have issues in your
>architecture approach.

Sure, it is the offload implementation problem. And for them, you need
to use userspace components. And that is the problem. This discussion
leads nowhere, I don't know how differently should I describe this.


>
>> >In my opinion that is a feature that could be added later out of
>> >necessity (there is some good niche value in being able to add some
>> >"dynamicism" to any pipeline) and influence the P4 standards on why it
>> >is needed.
>> >It should be doable today in a brute force way (this is just one
>> >suggestion that came to me when Rice University/Nvidia presented[1]);
>> >i am sure there are other approaches and the idea is by no means
>> >proven.
>> >
>> >1) User space Creates/compiles/Adds/activate your program that has 14
>> >tables at tc prio X chain Y
>> >2) a) 5 minutes later user space decides it wants to change and add
>> >table 3 after table 15, visited when metadata foo=5
>> >    b) your compiler in user space compiles a brand new program which
>> >satisfies #2a (how this program was authored is out of scope of
>> >discussion)
>> >    c) user space adds the new program at tc prio X+1 chain Y or another chain Z
>> >    d) user space delete tc prio X chain Y (and make sure your packets
>> >entry point is whatever #c is)
>>
>> I never suggested anything like what you describe. I'm not sure why you
>> think so.
>
>It's the same class of problems - the paper i pointed to (coauthored
>by Matty and others) has runtime resource optimizations which are
>tantamount to changing the nature of the pipeline. We may need to
>profile in the kernel but all those optimizations can be derived in
>user space using the approach I described.
>
>cheers,
>jamal
>
>
>> >[1] https://www.cs.rice.edu/~eugeneng/papers/SIGCOMM23-Pipeleon.pdf
>> >
>> >>
>> >> >kernel process - which is ironically the same thing we are going
>> >> >through here ;->
>> >> >
>> >> >cheers,
>> >> >jamal
>> >> >
>> >> >>
>> >> >> >
>> >> >> >cheers,
>> >> >> >jamal
Jamal Hadi Salim Nov. 23, 2023, 1:22 p.m. UTC | #23
On Thu, Nov 23, 2023 at 1:36 AM Jiri Pirko <jiri@resnulli.us> wrote:
>
> Wed, Nov 22, 2023 at 08:35:21PM CET, jhs@mojatatu.com wrote:
> >On Wed, Nov 22, 2023 at 1:31 PM Jiri Pirko <jiri@resnulli.us> wrote:
> >>
> >> Wed, Nov 22, 2023 at 04:14:02PM CET, jhs@mojatatu.com wrote:
> >> >On Wed, Nov 22, 2023 at 4:25 AM Jiri Pirko <jiri@resnulli.us> wrote:
> >> >>
> >> >> Tue, Nov 21, 2023 at 04:21:44PM CET, jhs@mojatatu.com wrote:
> >> >> >On Tue, Nov 21, 2023 at 9:19 AM Jiri Pirko <jiri@resnulli.us> wrote:
> >> >> >>
> >> >> >> Tue, Nov 21, 2023 at 02:47:40PM CET, jhs@mojatatu.com wrote:
> >> >> >> >On Tue, Nov 21, 2023 at 8:06 AM Jiri Pirko <jiri@resnulli.us> wrote:
> >> >> >> >>
> >> >> >> >> Mon, Nov 20, 2023 at 11:56:50PM CET, jhs@mojatatu.com wrote:
> >> >> >> >> >On Mon, Nov 20, 2023 at 4:49 PM Daniel Borkmann <daniel@iogearbox.net> wrote:
> >> >> >> >> >>
> >> >> >> >> >> On 11/20/23 8:56 PM, Jamal Hadi Salim wrote:
> >> >> >> >> >> > On Mon, Nov 20, 2023 at 1:10 PM Jiri Pirko <jiri@resnulli.us> wrote:
> >> >> >> >> >> >> Mon, Nov 20, 2023 at 03:23:59PM CET, jhs@mojatatu.com wrote:
> >> >> >> >>
> >> >> >> >> [...]
> >> >> >> >>
> >> >> >> >> >
> >> >> >> >> >> tc BPF and XDP already have widely used infrastructure and can be developed
> >> >> >> >> >> against libbpf or other user space libraries for a user space control plane.
> >> >> >> >> >> With 'control plane' you refer here to the tc / netlink shim you've built,
> >> >> >> >> >> but looking at the tc command line examples, this doesn't really provide a
> >> >> >> >> >> good user experience (you call it p4 but people load bpf obj files). If the
> >> >> >> >> >> expectation is that an operator should run tc commands, then neither it's
> >> >> >> >> >> a nice experience for p4 nor for BPF folks. From a BPF PoV, we moved over
> >> >> >> >> >> to bpf_mprog and plan to also extend this for XDP to have a common look and
> >> >> >> >> >> feel wrt networking for developers. Why can't this be reused?
> >> >> >> >> >
> >> >> >> >> >The filter loading which loads the program is considered pipeline
> >> >> >> >> >instantiation - consider it as "provisioning" more than "control"
> >> >> >> >> >which runs at runtime. "control" is purely netlink based. The iproute2
> >> >> >> >> >code we use links libbpf for example for the filter. If we can achieve
> >> >> >> >> >the same with bpf_mprog then sure - we just dont want to loose
> >> >> >> >> >functionality though.  off top of my head, some sample space:
> >> >> >> >> >- we could have multiple pipelines with different priorities (which tc
> >> >> >> >> >provides to us) - and each pipeline may have its own logic with many
> >> >> >> >> >tables etc (and the choice to iterate the next one is essentially
> >> >> >> >> >encoded in the tc action codes)
> >> >> >> >> >- we use tc block to map groups of ports (which i dont think bpf has
> >> >> >> >> >internal access of)
> >> >> >> >> >
> >> >> >> >> >In regards to usability: no i dont expect someone doing things at
> >> >> >> >> >scale to use command line tc. The APIs are via netlink. But the tc cli
> >> >> >> >> >is must for the rest of the masses per our traditions. Also i really
> >> >> >> >>
> >> >> >> >> I don't follow. You repeatedly mention "the must of the traditional tc
> >> >> >> >> cli", but what of the existing traditional cli you use for p4tc?
> >> >> >> >> If I look at the examples, pretty much everything looks new to me.
> >> >> >> >> Example:
> >> >> >> >>
> >> >> >> >>   tc p4ctrl create myprog/table/mytable dstAddr 10.0.1.2/32 \
> >> >> >> >>     action send_to_port param port eno1
> >> >> >> >>
> >> >> >> >> This is just TC/RTnetlink used as a channel to pass new things over. If
> >> >> >> >> that is the case, what's traditional here?
> >> >> >> >>
> >> >> >> >
> >> >> >> >
> >> >> >> >What is not traditional about it?
> >> >> >>
> >> >> >> Okay, so in that case, the following example communitating with
> >> >> >> userspace deamon using imaginary "p4ctrl" app is equally traditional:
> >> >> >>   $ p4ctrl create myprog/table/mytable dstAddr 10.0.1.2/32 \
> >> >> >>      action send_to_port param port eno1
> >> >> >
> >> >> >Huh? Thats just an application - classical tc which part of iproute2
> >> >> >that is sending to the kernel, no different than "tc flower.."
> >> >> >Where do you get the "userspace" daemon part? Yes, you can write a
> >> >> >daemon but it will use the same APIs as tc.
> >> >>
> >> >> Okay, so which part is the "tradition"?
> >> >>
> >> >
> >> >Provides tooling via tc cli that _everyone_ in the tc world is
> >> >familiar with - which uses the same syntax as other tc extensions do,
> >> >same expectations (eg events, request responses, familiar commands for
> >> >dumping, flushing etc). Basically someone familiar with tc will pick
> >> >this up and operate it very quickly and would have an easier time
> >> >debugging it.
> >> >There are caveats - as will be with all new classifiers - but those
> >> >are within reason.
> >>
> >> Okay, so syntax familiarity wise, what's the difference between
> >> following 2 approaches:
> >> $ tc p4ctrl create myprog/table/mytable dstAddr 10.0.1.2/32 \
> >>       action send_to_port param port eno1
> >> $ p4ctrl create myprog/table/mytable dstAddr 10.0.1.2/32 \
> >>       action send_to_port param port eno1
> >> ?
> >>
> >>
> >> >
> >> >> >>
> >> >> >> >
> >> >> >> >>
> >> >> >> >> >didnt even want to use ebpf at all for operator experience reasons -
> >> >> >> >> >it requires a compilation of the code and an extra loading compared to
> >> >> >> >> >what our original u32/pedit code offered.
> >> >> >> >> >
> >> >> >> >> >> I don't quite follow why not most of this could be implemented entirely in
> >> >> >> >> >> user space without the detour of this and you would provide a developer
> >> >> >> >> >> library which could then be integrated into a p4 runtime/frontend? This
> >> >> >> >> >> way users never interface with ebpf parts nor tc given they also shouldn't
> >> >> >> >> >> have to - it's an implementation detail. This is what John was also pointing
> >> >> >> >> >> out earlier.
> >> >> >> >> >>
> >> >> >> >> >
> >> >> >> >> >Netlink is the API. We will provide a library for object manipulation
> >> >> >> >> >which abstracts away the need to know netlink. Someone who for their
> >> >> >> >> >own reasons wants to use p4runtime or TDI could write on top of this.
> >> >> >> >> >I would not design a kernel interface to just meet p4runtime (we
> >> >> >> >> >already have TDI which came later which does things differently). So i
> >> >> >> >> >expect us to support both those two. And if i was to do something on
> >> >> >> >> >SDN that was more robust i would write my own that still uses these
> >> >> >> >> >netlink interfaces.
> >> >> >> >>
> >> >> >> >> Actually, what Daniel says about the p4 library used as a backend to p4
> >> >> >> >> frontend is pretty much aligned what I claimed on the p4 calls couple of
> >> >> >> >> times. If you have this p4 userspace tooling, it is easy for offloads to
> >> >> >> >> replace the backed by vendor-specific library which allows p4 offload
> >> >> >> >> suitable for all vendors (your plan of p4tc offload does not work well
> >> >> >> >> for our hw, as we repeatedly claimed).
> >> >> >> >>
> >> >> >> >
> >> >> >> >That's you - NVIDIA. You have chosen a path away from the kernel
> >> >> >> >towards DOCA. I understand NVIDIA's frustration with dealing with
> >> >> >> >upstream process (which has been cited to me as a good reason for
> >> >> >> >DOCA) but please dont impose these values and your politics on other
> >> >> >> >vendors(Intel, AMD for example) who are more than willing to invest
> >> >> >> >into making the kernel interfaces the path forward. Your choice.
> >> >> >>
> >> >> >> No, you are missing the point. This has nothing to do with DOCA.
> >> >> >
> >> >> >Right Jiri ;->
> >> >> >
> >> >> >> This
> >> >> >> has to do with the simple limitation of your offload assuming there are
> >> >> >> no runtime changes in the compiled pipeline. For Intel, maybe they
> >> >> >> aren't, and it's a good fit for them. All I say is, that it is not the
> >> >> >> good fit for everyone.
> >> >> >
> >> >> > a) it is not part of the P4 spec to dynamically make changes to the
> >> >> >datapath pipeline after it is create and we are discussing a P4
> >> >>
> >> >> Isn't this up to the implementation? I mean from the p4 perspective,
> >> >> everything is static. Hw might need to reshuffle the pipeline internally
> >> >> during rule insertion/remove in order to optimize the layout.
> >> >>
> >> >
> >> >But do note: the focus here is on P4 (hence the name P4TC).
> >> >
> >> >> >implementation not an extension that would add more value b) We are
> >> >> >more than happy to add extensions in the future to accomodate for
> >> >> >features but first _P4 spec_ must be met c) we had longer discussions
> >> >> >with Matty, Khalid and the Rice folks who wrote a paper on that topic
> >> >> >which you probably didnt attend and everything that needs to be done
> >> >> >can be from user space today for all those optimizations.
> >> >> >
> >> >> >Conclusion is: For what you need to do (which i dont believe is a
> >> >> >limitation in your hardware rather a design decision on your part) run
> >> >> >your user space daemon, do optimizations and update the datapath.
> >> >> >Everybody is happy.
> >> >>
> >> >> Should the userspace daemon listen on inserted rules to be offloade
> >> >> over netlink?
> >> >>
> >> >
> >> >I mean you could if you wanted to given this is just traditional
> >> >netlink which emits events (with some filtering when we integrate the
> >> >filter approach). But why?
> >>
> >> Nevermind.
> >>
> >>
> >> >
> >> >> >
> >> >> >>
> >> >> >> >Nobody is stopping you from offering your customers proprietary
> >> >> >> >solutions which include a specific ebpf approach alongside DOCA. We
> >> >> >> >believe that a singular interface regardless of the vendor is the
> >> >> >> >right way forward. IMHO, this siloing that unfortunately is also added
> >> >> >> >by eBPF being a double edged sword is not good for the community.
> >> >> >> >
> >> >> >> >> As I also said on the p4 call couple of times, I don't see the kernel
> >> >> >> >> as the correct place to do the p4 abstractions. Why don't you do it in
> >> >> >> >> userspace and give vendors possiblity to have p4 backends with compilers,
> >> >> >> >> runtime optimizations etc in userspace, talking to the HW in the
> >> >> >> >> vendor-suitable way too. Then the SW implementation could be easily eBPF
> >> >> >> >> and the main reason (I believe) why you need to have this is TC
> >> >> >> >> (offload) is then void.
> >> >> >> >>
> >> >> >> >> The "everyone wants to use TC/netlink" claim does not seem correct
> >> >> >> >> to me. Why not to have one Linux p4 solution that fits everyones needs?
> >> >> >> >
> >> >> >> >You mean more fitting to the DOCA world? no, because iam a kernel
> >> >> >>
> >> >> >> Again, this has 0 relation to DOCA.
> >> >> >>
> >> >> >>
> >> >> >> >first person and kernel interfaces are good for everyone.
> >> >> >>
> >> >> >> Yeah, not really. Not always the kernel is the right answer. Your/Intel
> >> >> >> plan to handle the offload by:
> >> >> >> 1) abuse devlink to flash p4 binary
> >> >> >> 2) parse the binary in kernel to match to the table ids of rules coming
> >> >> >>    from p4tc ndo_setup_tc
> >> >> >> 3) abuse devlink to flash p4 binary for tc-flower
> >> >> >> 4) parse the binary in kernel to match to the table ids of rules coming
> >> >> >>    from tc-flower ndo_setup_tc
> >> >> >> is really something that is making me a little bit nauseous.
> >> >> >>
> >> >> >> If you don't have a feasible plan to do the offload, p4tc does not make
> >> >> >> sense to me to be honest.
> >> >> >
> >> >> >You mean if there's no plan to match your (NVIDIA?)  point of view.
> >> >> >For #1 - how's this different from DDP? Wasnt that your suggestion to
> >> >>
> >> >> I doubt that. Any flashing-blob-parsing-in-kernel is something I'm
> >> >> opposed to from day 1.
> >> >>
> >> >>
> >> >
> >> >Oh well - it is in the kernel and it works fine tbh.
> >> >
> >> >> >begin with? For #2 Nobody is proposing to do anything of the sort. The
> >> >> >ndo is passed IDs for the objects and associated contents. For #3+#4
> >> >>
> >> >> During offload, you need to parse the blob in driver to be able to match
> >> >> the ids with blob entities. That was presented by you/Intel in the past
> >> >> IIRC.
> >> >>
> >> >
> >> >You are correct - in case of offload the netlink IDs will have to be
> >> >authenticated against what the hardware can accept, but the devlink
> >> >flash use i believe was from you as a compromise.
> >>
> >> Definitelly not. I'm against devlink abuse for this from day 1.
> >>
> >>
> >> >
> >> >>
> >> >> >tc flower thing has nothing to do with P4TC that was just some random
> >> >> >proposal someone made seeing if they could ride on top of P4TC.
> >> >>
> >> >> Yeah, it's not yet merged and already mentally used for abuse. I love
> >> >> that :)
> >> >>
> >> >> >
> >> >> >Besides this nobody really has to satisfy your point of view - like i
> >> >> >said earlier feel free to provide proprietary solutions. From a
> >> >> >consumer perspective  I would not want to deal with 4 different
> >> >> >vendors with 4 different proprietary approaches. The kernel is the
> >> >> >unifying part. You seemed happier with tc flower just not with the
> >> >>
> >> >> Yeah, that is my point, why the unifying part can't be a userspace
> >> >> daemon/library with multiple backends (p4tc, bpf, vendorX, vendorY, ..)?
> >> >>
> >> >> I just don't see the kernel as a good fit for abstraction here,
> >> >> given the fact that the vendor compilers does not run in kernel.
> >> >> That is breaking your model.
> >> >>
> >> >
> >> >Jiri - we want to support P4, first. Like you said the P4 pipeline,
> >> >once installed is static.
> >> >P4 doesnt allow dynamic update of the pipeline. For example, once you
> >> >say "here are my 14 tables and their associated actions and here's how
> >> >the pipeline main control (on how to iterate the tables etc) is going
> >> >to be" and after you instantiate/activate that pipeline, you dont go
> >> >back 5 minutes later and say "sorry, please introduce table 15, which
> >> >i want you to walk to after you visit table 3 if metadata foo is 5" or
> >> >"shoot, let's change that table 5 to be exact instead of LPM". It's
> >> >not anywhere in the spec.
> >> >That doesnt mean it is not useful thing to have - but it is an
> >> >invention that has _nothing to do with the P4 spec_; so saying a P4
> >> >implementation must support it is a bit out of scope and there are
> >> >vendors with hardware who support P4 today that dont need any of this.
> >>
> >> I'm not talking about the spec. I'm talking about the offload
> >> implemetation, the offload compiler the offload runtime manager. You
> >> don't have those in kernel. That is the issue. The runtime manager is
> >> the one to decide and reshuffle the hw internals. Again, this has
> >> nothing to do with p4 frontend. This is offload implementation.
> >>
> >> And that is why I believe your p4 kernel implementation is unoffloadable.
> >> And if it is unoffloadable, do we really need it? IDK.
> >>
> >
> >Say what?
> >It's not offloadable in your hardware, you mean? Because i have beside
> >me here an intel e2000 which offloads just fine (and the AMD folks
> >seem fine too).
>
> Will Intel and AMD have compiler in kernel, so no blob transfer and
> parsing it in kernel wound not be needed? No.

By that definition anything that parses anything is a compiler.

>
> >If your view is that all these runtime optimization surmount to a
> >compiler in the kernel/driver that is your, well, your view. In my
> >view (and others have said this to you already) the P4C compiler is
> >responsible for resource optimizations. The hardware supports P4, you
> >give it constraints and it knows what to do. At runtime, anything a
> >driver needs to do for resource optimization (resorting, reshuffling
> >etc), that is not a P4 problem - sorry if you have issues in your
> >architecture approach.
>
> Sure, it is the offload implementation problem. And for them, you need
> to use userspace components. And that is the problem. This discussion
> leads nowhere, I don't know how differently I should describe this.

Jiri's - that's your view based on whatever design you have in your
mind. This has nothing to do with P4.
So let me repeat again:
1) A vendor's backend for P4 when it compiles ensures that resource
constraints are taken care of.
2) The same program can run in s/w.
3) It makes *ZERO* sense to mix vendor specific constraint
optimization(what you described as resorting, reshuffling etc) as part
of P4TC or P4. Absolutely nothing to do with either. Write a
background task, specific to you,  if you feel you need to move things
around at runtime.

We agree on one thing at least: This discussion is going nowhere.

cheers,
jamal



> >
> >> >In my opinion that is a feature that could be added later out of
> >> >necessity (there is some good niche value in being able to add some
> >> >"dynamicism" to any pipeline) and influence the P4 standards on why it
> >> >is needed.
> >> >It should be doable today in a brute force way (this is just one
> >> >suggestion that came to me when Rice University/Nvidia presented[1]);
> >> >i am sure there are other approaches and the idea is by no means
> >> >proven.
> >> >
> >> >1) User space Creates/compiles/Adds/activate your program that has 14
> >> >tables at tc prio X chain Y
> >> >2) a) 5 minutes later user space decides it wants to change and add
> >> >table 3 after table 15, visited when metadata foo=5
> >> >    b) your compiler in user space compiles a brand new program which
> >> >satisfies #2a (how this program was authored is out of scope of
> >> >discussion)
> >> >    c) user space adds the new program at tc prio X+1 chain Y or another chain Z
> >> >    d) user space delete tc prio X chain Y (and make sure your packets
> >> >entry point is whatever #c is)
> >>
> >> I never suggested anything like what you describe. I'm not sure why you
> >> think so.
> >
> >It's the same class of problems - the paper i pointed to (coauthored
> >by Matty and others) has runtime resource optimizations which are
> >tantamount to changing the nature of the pipeline. We may need to
> >profile in the kernel but all those optimizations can be derived in
> >user space using the approach I described.
> >
> >cheers,
> >jamal
> >
> >
> >> >[1] https://www.cs.rice.edu/~eugeneng/papers/SIGCOMM23-Pipeleon.pdf
> >> >
> >> >>
> >> >> >kernel process - which is ironically the same thing we are going
> >> >> >through here ;->
> >> >> >
> >> >> >cheers,
> >> >> >jamal
> >> >> >
> >> >> >>
> >> >> >> >
> >> >> >> >cheers,
> >> >> >> >jamal
Jiri Pirko Nov. 23, 2023, 1:34 p.m. UTC | #24
Thu, Nov 23, 2023 at 02:22:11PM CET, jhs@mojatatu.com wrote:
>On Thu, Nov 23, 2023 at 1:36 AM Jiri Pirko <jiri@resnulli.us> wrote:
>>
>> Wed, Nov 22, 2023 at 08:35:21PM CET, jhs@mojatatu.com wrote:
>> >On Wed, Nov 22, 2023 at 1:31 PM Jiri Pirko <jiri@resnulli.us> wrote:
>> >>
>> >> Wed, Nov 22, 2023 at 04:14:02PM CET, jhs@mojatatu.com wrote:
>> >> >On Wed, Nov 22, 2023 at 4:25 AM Jiri Pirko <jiri@resnulli.us> wrote:
>> >> >>
>> >> >> Tue, Nov 21, 2023 at 04:21:44PM CET, jhs@mojatatu.com wrote:
>> >> >> >On Tue, Nov 21, 2023 at 9:19 AM Jiri Pirko <jiri@resnulli.us> wrote:
>> >> >> >>
>> >> >> >> Tue, Nov 21, 2023 at 02:47:40PM CET, jhs@mojatatu.com wrote:
>> >> >> >> >On Tue, Nov 21, 2023 at 8:06 AM Jiri Pirko <jiri@resnulli.us> wrote:
>> >> >> >> >>
>> >> >> >> >> Mon, Nov 20, 2023 at 11:56:50PM CET, jhs@mojatatu.com wrote:
>> >> >> >> >> >On Mon, Nov 20, 2023 at 4:49 PM Daniel Borkmann <daniel@iogearbox.net> wrote:
>> >> >> >> >> >>
>> >> >> >> >> >> On 11/20/23 8:56 PM, Jamal Hadi Salim wrote:
>> >> >> >> >> >> > On Mon, Nov 20, 2023 at 1:10 PM Jiri Pirko <jiri@resnulli.us> wrote:
>> >> >> >> >> >> >> Mon, Nov 20, 2023 at 03:23:59PM CET, jhs@mojatatu.com wrote:
>> >> >> >> >>
>> >> >> >> >> [...]
>> >> >> >> >>
>> >> >> >> >> >
>> >> >> >> >> >> tc BPF and XDP already have widely used infrastructure and can be developed
>> >> >> >> >> >> against libbpf or other user space libraries for a user space control plane.
>> >> >> >> >> >> With 'control plane' you refer here to the tc / netlink shim you've built,
>> >> >> >> >> >> but looking at the tc command line examples, this doesn't really provide a
>> >> >> >> >> >> good user experience (you call it p4 but people load bpf obj files). If the
>> >> >> >> >> >> expectation is that an operator should run tc commands, then neither it's
>> >> >> >> >> >> a nice experience for p4 nor for BPF folks. From a BPF PoV, we moved over
>> >> >> >> >> >> to bpf_mprog and plan to also extend this for XDP to have a common look and
>> >> >> >> >> >> feel wrt networking for developers. Why can't this be reused?
>> >> >> >> >> >
>> >> >> >> >> >The filter loading which loads the program is considered pipeline
>> >> >> >> >> >instantiation - consider it as "provisioning" more than "control"
>> >> >> >> >> >which runs at runtime. "control" is purely netlink based. The iproute2
>> >> >> >> >> >code we use links libbpf for example for the filter. If we can achieve
>> >> >> >> >> >the same with bpf_mprog then sure - we just dont want to loose
>> >> >> >> >> >functionality though.  off top of my head, some sample space:
>> >> >> >> >> >- we could have multiple pipelines with different priorities (which tc
>> >> >> >> >> >provides to us) - and each pipeline may have its own logic with many
>> >> >> >> >> >tables etc (and the choice to iterate the next one is essentially
>> >> >> >> >> >encoded in the tc action codes)
>> >> >> >> >> >- we use tc block to map groups of ports (which i dont think bpf has
>> >> >> >> >> >internal access of)
>> >> >> >> >> >
>> >> >> >> >> >In regards to usability: no i dont expect someone doing things at
>> >> >> >> >> >scale to use command line tc. The APIs are via netlink. But the tc cli
>> >> >> >> >> >is must for the rest of the masses per our traditions. Also i really
>> >> >> >> >>
>> >> >> >> >> I don't follow. You repeatedly mention "the must of the traditional tc
>> >> >> >> >> cli", but what of the existing traditional cli you use for p4tc?
>> >> >> >> >> If I look at the examples, pretty much everything looks new to me.
>> >> >> >> >> Example:
>> >> >> >> >>
>> >> >> >> >>   tc p4ctrl create myprog/table/mytable dstAddr 10.0.1.2/32 \
>> >> >> >> >>     action send_to_port param port eno1
>> >> >> >> >>
>> >> >> >> >> This is just TC/RTnetlink used as a channel to pass new things over. If
>> >> >> >> >> that is the case, what's traditional here?
>> >> >> >> >>
>> >> >> >> >
>> >> >> >> >
>> >> >> >> >What is not traditional about it?
>> >> >> >>
>> >> >> >> Okay, so in that case, the following example communitating with
>> >> >> >> userspace deamon using imaginary "p4ctrl" app is equally traditional:
>> >> >> >>   $ p4ctrl create myprog/table/mytable dstAddr 10.0.1.2/32 \
>> >> >> >>      action send_to_port param port eno1
>> >> >> >
>> >> >> >Huh? Thats just an application - classical tc which part of iproute2
>> >> >> >that is sending to the kernel, no different than "tc flower.."
>> >> >> >Where do you get the "userspace" daemon part? Yes, you can write a
>> >> >> >daemon but it will use the same APIs as tc.
>> >> >>
>> >> >> Okay, so which part is the "tradition"?
>> >> >>
>> >> >
>> >> >Provides tooling via tc cli that _everyone_ in the tc world is
>> >> >familiar with - which uses the same syntax as other tc extensions do,
>> >> >same expectations (eg events, request responses, familiar commands for
>> >> >dumping, flushing etc). Basically someone familiar with tc will pick
>> >> >this up and operate it very quickly and would have an easier time
>> >> >debugging it.
>> >> >There are caveats - as will be with all new classifiers - but those
>> >> >are within reason.
>> >>
>> >> Okay, so syntax familiarity wise, what's the difference between
>> >> following 2 approaches:
>> >> $ tc p4ctrl create myprog/table/mytable dstAddr 10.0.1.2/32 \
>> >>       action send_to_port param port eno1
>> >> $ p4ctrl create myprog/table/mytable dstAddr 10.0.1.2/32 \
>> >>       action send_to_port param port eno1
>> >> ?
>> >>
>> >>
>> >> >
>> >> >> >>
>> >> >> >> >
>> >> >> >> >>
>> >> >> >> >> >didnt even want to use ebpf at all for operator experience reasons -
>> >> >> >> >> >it requires a compilation of the code and an extra loading compared to
>> >> >> >> >> >what our original u32/pedit code offered.
>> >> >> >> >> >
>> >> >> >> >> >> I don't quite follow why not most of this could be implemented entirely in
>> >> >> >> >> >> user space without the detour of this and you would provide a developer
>> >> >> >> >> >> library which could then be integrated into a p4 runtime/frontend? This
>> >> >> >> >> >> way users never interface with ebpf parts nor tc given they also shouldn't
>> >> >> >> >> >> have to - it's an implementation detail. This is what John was also pointing
>> >> >> >> >> >> out earlier.
>> >> >> >> >> >>
>> >> >> >> >> >
>> >> >> >> >> >Netlink is the API. We will provide a library for object manipulation
>> >> >> >> >> >which abstracts away the need to know netlink. Someone who for their
>> >> >> >> >> >own reasons wants to use p4runtime or TDI could write on top of this.
>> >> >> >> >> >I would not design a kernel interface to just meet p4runtime (we
>> >> >> >> >> >already have TDI which came later which does things differently). So i
>> >> >> >> >> >expect us to support both those two. And if i was to do something on
>> >> >> >> >> >SDN that was more robust i would write my own that still uses these
>> >> >> >> >> >netlink interfaces.
>> >> >> >> >>
>> >> >> >> >> Actually, what Daniel says about the p4 library used as a backend to p4
>> >> >> >> >> frontend is pretty much aligned what I claimed on the p4 calls couple of
>> >> >> >> >> times. If you have this p4 userspace tooling, it is easy for offloads to
>> >> >> >> >> replace the backed by vendor-specific library which allows p4 offload
>> >> >> >> >> suitable for all vendors (your plan of p4tc offload does not work well
>> >> >> >> >> for our hw, as we repeatedly claimed).
>> >> >> >> >>
>> >> >> >> >
>> >> >> >> >That's you - NVIDIA. You have chosen a path away from the kernel
>> >> >> >> >towards DOCA. I understand NVIDIA's frustration with dealing with
>> >> >> >> >upstream process (which has been cited to me as a good reason for
>> >> >> >> >DOCA) but please dont impose these values and your politics on other
>> >> >> >> >vendors(Intel, AMD for example) who are more than willing to invest
>> >> >> >> >into making the kernel interfaces the path forward. Your choice.
>> >> >> >>
>> >> >> >> No, you are missing the point. This has nothing to do with DOCA.
>> >> >> >
>> >> >> >Right Jiri ;->
>> >> >> >
>> >> >> >> This
>> >> >> >> has to do with the simple limitation of your offload assuming there are
>> >> >> >> no runtime changes in the compiled pipeline. For Intel, maybe they
>> >> >> >> aren't, and it's a good fit for them. All I say is, that it is not the
>> >> >> >> good fit for everyone.
>> >> >> >
>> >> >> > a) it is not part of the P4 spec to dynamically make changes to the
>> >> >> >datapath pipeline after it is create and we are discussing a P4
>> >> >>
>> >> >> Isn't this up to the implementation? I mean from the p4 perspective,
>> >> >> everything is static. Hw might need to reshuffle the pipeline internally
>> >> >> during rule insertion/remove in order to optimize the layout.
>> >> >>
>> >> >
>> >> >But do note: the focus here is on P4 (hence the name P4TC).
>> >> >
>> >> >> >implementation not an extension that would add more value b) We are
>> >> >> >more than happy to add extensions in the future to accomodate for
>> >> >> >features but first _P4 spec_ must be met c) we had longer discussions
>> >> >> >with Matty, Khalid and the Rice folks who wrote a paper on that topic
>> >> >> >which you probably didnt attend and everything that needs to be done
>> >> >> >can be from user space today for all those optimizations.
>> >> >> >
>> >> >> >Conclusion is: For what you need to do (which i dont believe is a
>> >> >> >limitation in your hardware rather a design decision on your part) run
>> >> >> >your user space daemon, do optimizations and update the datapath.
>> >> >> >Everybody is happy.
>> >> >>
>> >> >> Should the userspace daemon listen on inserted rules to be offloade
>> >> >> over netlink?
>> >> >>
>> >> >
>> >> >I mean you could if you wanted to given this is just traditional
>> >> >netlink which emits events (with some filtering when we integrate the
>> >> >filter approach). But why?
>> >>
>> >> Nevermind.
>> >>
>> >>
>> >> >
>> >> >> >
>> >> >> >>
>> >> >> >> >Nobody is stopping you from offering your customers proprietary
>> >> >> >> >solutions which include a specific ebpf approach alongside DOCA. We
>> >> >> >> >believe that a singular interface regardless of the vendor is the
>> >> >> >> >right way forward. IMHO, this siloing that unfortunately is also added
>> >> >> >> >by eBPF being a double edged sword is not good for the community.
>> >> >> >> >
>> >> >> >> >> As I also said on the p4 call couple of times, I don't see the kernel
>> >> >> >> >> as the correct place to do the p4 abstractions. Why don't you do it in
>> >> >> >> >> userspace and give vendors possiblity to have p4 backends with compilers,
>> >> >> >> >> runtime optimizations etc in userspace, talking to the HW in the
>> >> >> >> >> vendor-suitable way too. Then the SW implementation could be easily eBPF
>> >> >> >> >> and the main reason (I believe) why you need to have this is TC
>> >> >> >> >> (offload) is then void.
>> >> >> >> >>
>> >> >> >> >> The "everyone wants to use TC/netlink" claim does not seem correct
>> >> >> >> >> to me. Why not to have one Linux p4 solution that fits everyones needs?
>> >> >> >> >
>> >> >> >> >You mean more fitting to the DOCA world? no, because iam a kernel
>> >> >> >>
>> >> >> >> Again, this has 0 relation to DOCA.
>> >> >> >>
>> >> >> >>
>> >> >> >> >first person and kernel interfaces are good for everyone.
>> >> >> >>
>> >> >> >> Yeah, not really. Not always the kernel is the right answer. Your/Intel
>> >> >> >> plan to handle the offload by:
>> >> >> >> 1) abuse devlink to flash p4 binary
>> >> >> >> 2) parse the binary in kernel to match to the table ids of rules coming
>> >> >> >>    from p4tc ndo_setup_tc
>> >> >> >> 3) abuse devlink to flash p4 binary for tc-flower
>> >> >> >> 4) parse the binary in kernel to match to the table ids of rules coming
>> >> >> >>    from tc-flower ndo_setup_tc
>> >> >> >> is really something that is making me a little bit nauseous.
>> >> >> >>
>> >> >> >> If you don't have a feasible plan to do the offload, p4tc does not make
>> >> >> >> sense to me to be honest.
>> >> >> >
>> >> >> >You mean if there's no plan to match your (NVIDIA?)  point of view.
>> >> >> >For #1 - how's this different from DDP? Wasnt that your suggestion to
>> >> >>
>> >> >> I doubt that. Any flashing-blob-parsing-in-kernel is something I'm
>> >> >> opposed to from day 1.
>> >> >>
>> >> >>
>> >> >
>> >> >Oh well - it is in the kernel and it works fine tbh.
>> >> >
>> >> >> >begin with? For #2 Nobody is proposing to do anything of the sort. The
>> >> >> >ndo is passed IDs for the objects and associated contents. For #3+#4
>> >> >>
>> >> >> During offload, you need to parse the blob in driver to be able to match
>> >> >> the ids with blob entities. That was presented by you/Intel in the past
>> >> >> IIRC.
>> >> >>
>> >> >
>> >> >You are correct - in case of offload the netlink IDs will have to be
>> >> >authenticated against what the hardware can accept, but the devlink
>> >> >flash use i believe was from you as a compromise.
>> >>
>> >> Definitelly not. I'm against devlink abuse for this from day 1.
>> >>
>> >>
>> >> >
>> >> >>
>> >> >> >tc flower thing has nothing to do with P4TC that was just some random
>> >> >> >proposal someone made seeing if they could ride on top of P4TC.
>> >> >>
>> >> >> Yeah, it's not yet merged and already mentally used for abuse. I love
>> >> >> that :)
>> >> >>
>> >> >> >
>> >> >> >Besides this nobody really has to satisfy your point of view - like i
>> >> >> >said earlier feel free to provide proprietary solutions. From a
>> >> >> >consumer perspective  I would not want to deal with 4 different
>> >> >> >vendors with 4 different proprietary approaches. The kernel is the
>> >> >> >unifying part. You seemed happier with tc flower just not with the
>> >> >>
>> >> >> Yeah, that is my point, why the unifying part can't be a userspace
>> >> >> daemon/library with multiple backends (p4tc, bpf, vendorX, vendorY, ..)?
>> >> >>
>> >> >> I just don't see the kernel as a good fit for abstraction here,
>> >> >> given the fact that the vendor compilers does not run in kernel.
>> >> >> That is breaking your model.
>> >> >>
>> >> >
>> >> >Jiri - we want to support P4, first. Like you said the P4 pipeline,
>> >> >once installed is static.
>> >> >P4 doesnt allow dynamic update of the pipeline. For example, once you
>> >> >say "here are my 14 tables and their associated actions and here's how
>> >> >the pipeline main control (on how to iterate the tables etc) is going
>> >> >to be" and after you instantiate/activate that pipeline, you dont go
>> >> >back 5 minutes later and say "sorry, please introduce table 15, which
>> >> >i want you to walk to after you visit table 3 if metadata foo is 5" or
>> >> >"shoot, let's change that table 5 to be exact instead of LPM". It's
>> >> >not anywhere in the spec.
>> >> >That doesnt mean it is not useful thing to have - but it is an
>> >> >invention that has _nothing to do with the P4 spec_; so saying a P4
>> >> >implementation must support it is a bit out of scope and there are
>> >> >vendors with hardware who support P4 today that dont need any of this.
>> >>
>> >> I'm not talking about the spec. I'm talking about the offload
>> >> implemetation, the offload compiler the offload runtime manager. You
>> >> don't have those in kernel. That is the issue. The runtime manager is
>> >> the one to decide and reshuffle the hw internals. Again, this has
>> >> nothing to do with p4 frontend. This is offload implementation.
>> >>
>> >> And that is why I believe your p4 kernel implementation is unoffloadable.
>> >> And if it is unoffloadable, do we really need it? IDK.
>> >>
>> >
>> >Say what?
>> >It's not offloadable in your hardware, you mean? Because i have beside
>> >me here an intel e2000 which offloads just fine (and the AMD folks
>> >seem fine too).
>>
>> Will Intel and AMD have compiler in kernel, so no blob transfer and
>> parsing it in kernel wound not be needed? No.
>
>By that definition anything that parses anything is a compiler.
>
>>
>> >If your view is that all these runtime optimization surmount to a
>> >compiler in the kernel/driver that is your, well, your view. In my
>> >view (and others have said this to you already) the P4C compiler is
>> >responsible for resource optimizations. The hardware supports P4, you
>> >give it constraints and it knows what to do. At runtime, anything a
>> >driver needs to do for resource optimization (resorting, reshuffling
>> >etc), that is not a P4 problem - sorry if you have issues in your
>> >architecture approach.
>>
>> Sure, it is the offload implementation problem. And for them, you need
>> to use userspace components. And that is the problem. This discussion
>> leads nowhere, I don't know how differently I should describe this.
>
>Jiri's - that's your view based on whatever design you have in your
>mind. This has nothing to do with P4.
>So let me repeat again:
>1) A vendor's backend for P4 when it compiles ensures that resource
>constraints are taken care of.
>2) The same program can run in s/w.
>3) It makes *ZERO* sense to mix vendor specific constraint
>optimization(what you described as resorting, reshuffling etc) as part
>of P4TC or P4. Absolutely nothing to do with either. Write a

I never suggested for it to be part of P4tc of P4. I don't know why you
think so.


>background task, specific to you,  if you feel you need to move things
>around at runtime.

Yeah, that backgroud task is in userspace.


>
>We agree on one thing at least: This discussion is going nowhere.

Correct.


>
>cheers,
>jamal
>
>
>
>> >
>> >> >In my opinion that is a feature that could be added later out of
>> >> >necessity (there is some good niche value in being able to add some
>> >> >"dynamicism" to any pipeline) and influence the P4 standards on why it
>> >> >is needed.
>> >> >It should be doable today in a brute force way (this is just one
>> >> >suggestion that came to me when Rice University/Nvidia presented[1]);
>> >> >i am sure there are other approaches and the idea is by no means
>> >> >proven.
>> >> >
>> >> >1) User space Creates/compiles/Adds/activate your program that has 14
>> >> >tables at tc prio X chain Y
>> >> >2) a) 5 minutes later user space decides it wants to change and add
>> >> >table 3 after table 15, visited when metadata foo=5
>> >> >    b) your compiler in user space compiles a brand new program which
>> >> >satisfies #2a (how this program was authored is out of scope of
>> >> >discussion)
>> >> >    c) user space adds the new program at tc prio X+1 chain Y or another chain Z
>> >> >    d) user space delete tc prio X chain Y (and make sure your packets
>> >> >entry point is whatever #c is)
>> >>
>> >> I never suggested anything like what you describe. I'm not sure why you
>> >> think so.
>> >
>> >It's the same class of problems - the paper i pointed to (coauthored
>> >by Matty and others) has runtime resource optimizations which are
>> >tantamount to changing the nature of the pipeline. We may need to
>> >profile in the kernel but all those optimizations can be derived in
>> >user space using the approach I described.
>> >
>> >cheers,
>> >jamal
>> >
>> >
>> >> >[1] https://www.cs.rice.edu/~eugeneng/papers/SIGCOMM23-Pipeleon.pdf
>> >> >
>> >> >>
>> >> >> >kernel process - which is ironically the same thing we are going
>> >> >> >through here ;->
>> >> >> >
>> >> >> >cheers,
>> >> >> >jamal
>> >> >> >
>> >> >> >>
>> >> >> >> >
>> >> >> >> >cheers,
>> >> >> >> >jamal
Jamal Hadi Salim Nov. 23, 2023, 1:45 p.m. UTC | #25
On Thu, Nov 23, 2023 at 8:34 AM Jiri Pirko <jiri@resnulli.us> wrote:
>
> Thu, Nov 23, 2023 at 02:22:11PM CET, jhs@mojatatu.com wrote:
> >On Thu, Nov 23, 2023 at 1:36 AM Jiri Pirko <jiri@resnulli.us> wrote:
> >>
> >> Wed, Nov 22, 2023 at 08:35:21PM CET, jhs@mojatatu.com wrote:
> >> >On Wed, Nov 22, 2023 at 1:31 PM Jiri Pirko <jiri@resnulli.us> wrote:
> >> >>
> >> >> Wed, Nov 22, 2023 at 04:14:02PM CET, jhs@mojatatu.com wrote:
> >> >> >On Wed, Nov 22, 2023 at 4:25 AM Jiri Pirko <jiri@resnulli.us> wrote:
> >> >> >>
> >> >> >> Tue, Nov 21, 2023 at 04:21:44PM CET, jhs@mojatatu.com wrote:
> >> >> >> >On Tue, Nov 21, 2023 at 9:19 AM Jiri Pirko <jiri@resnulli.us> wrote:
> >> >> >> >>
> >> >> >> >> Tue, Nov 21, 2023 at 02:47:40PM CET, jhs@mojatatu.com wrote:
> >> >> >> >> >On Tue, Nov 21, 2023 at 8:06 AM Jiri Pirko <jiri@resnulli.us> wrote:
> >> >> >> >> >>
> >> >> >> >> >> Mon, Nov 20, 2023 at 11:56:50PM CET, jhs@mojatatu.com wrote:
> >> >> >> >> >> >On Mon, Nov 20, 2023 at 4:49 PM Daniel Borkmann <daniel@iogearbox.net> wrote:
> >> >> >> >> >> >>
> >> >> >> >> >> >> On 11/20/23 8:56 PM, Jamal Hadi Salim wrote:
> >> >> >> >> >> >> > On Mon, Nov 20, 2023 at 1:10 PM Jiri Pirko <jiri@resnulli.us> wrote:
> >> >> >> >> >> >> >> Mon, Nov 20, 2023 at 03:23:59PM CET, jhs@mojatatu.com wrote:
> >> >> >> >> >>
> >> >> >> >> >> [...]
> >> >> >> >> >>
> >> >> >> >> >> >
> >> >> >> >> >> >> tc BPF and XDP already have widely used infrastructure and can be developed
> >> >> >> >> >> >> against libbpf or other user space libraries for a user space control plane.
> >> >> >> >> >> >> With 'control plane' you refer here to the tc / netlink shim you've built,
> >> >> >> >> >> >> but looking at the tc command line examples, this doesn't really provide a
> >> >> >> >> >> >> good user experience (you call it p4 but people load bpf obj files). If the
> >> >> >> >> >> >> expectation is that an operator should run tc commands, then neither it's
> >> >> >> >> >> >> a nice experience for p4 nor for BPF folks. From a BPF PoV, we moved over
> >> >> >> >> >> >> to bpf_mprog and plan to also extend this for XDP to have a common look and
> >> >> >> >> >> >> feel wrt networking for developers. Why can't this be reused?
> >> >> >> >> >> >
> >> >> >> >> >> >The filter loading which loads the program is considered pipeline
> >> >> >> >> >> >instantiation - consider it as "provisioning" more than "control"
> >> >> >> >> >> >which runs at runtime. "control" is purely netlink based. The iproute2
> >> >> >> >> >> >code we use links libbpf for example for the filter. If we can achieve
> >> >> >> >> >> >the same with bpf_mprog then sure - we just dont want to loose
> >> >> >> >> >> >functionality though.  off top of my head, some sample space:
> >> >> >> >> >> >- we could have multiple pipelines with different priorities (which tc
> >> >> >> >> >> >provides to us) - and each pipeline may have its own logic with many
> >> >> >> >> >> >tables etc (and the choice to iterate the next one is essentially
> >> >> >> >> >> >encoded in the tc action codes)
> >> >> >> >> >> >- we use tc block to map groups of ports (which i dont think bpf has
> >> >> >> >> >> >internal access of)
> >> >> >> >> >> >
> >> >> >> >> >> >In regards to usability: no i dont expect someone doing things at
> >> >> >> >> >> >scale to use command line tc. The APIs are via netlink. But the tc cli
> >> >> >> >> >> >is must for the rest of the masses per our traditions. Also i really
> >> >> >> >> >>
> >> >> >> >> >> I don't follow. You repeatedly mention "the must of the traditional tc
> >> >> >> >> >> cli", but what of the existing traditional cli you use for p4tc?
> >> >> >> >> >> If I look at the examples, pretty much everything looks new to me.
> >> >> >> >> >> Example:
> >> >> >> >> >>
> >> >> >> >> >>   tc p4ctrl create myprog/table/mytable dstAddr 10.0.1.2/32 \
> >> >> >> >> >>     action send_to_port param port eno1
> >> >> >> >> >>
> >> >> >> >> >> This is just TC/RTnetlink used as a channel to pass new things over. If
> >> >> >> >> >> that is the case, what's traditional here?
> >> >> >> >> >>
> >> >> >> >> >
> >> >> >> >> >
> >> >> >> >> >What is not traditional about it?
> >> >> >> >>
> >> >> >> >> Okay, so in that case, the following example communitating with
> >> >> >> >> userspace deamon using imaginary "p4ctrl" app is equally traditional:
> >> >> >> >>   $ p4ctrl create myprog/table/mytable dstAddr 10.0.1.2/32 \
> >> >> >> >>      action send_to_port param port eno1
> >> >> >> >
> >> >> >> >Huh? Thats just an application - classical tc which part of iproute2
> >> >> >> >that is sending to the kernel, no different than "tc flower.."
> >> >> >> >Where do you get the "userspace" daemon part? Yes, you can write a
> >> >> >> >daemon but it will use the same APIs as tc.
> >> >> >>
> >> >> >> Okay, so which part is the "tradition"?
> >> >> >>
> >> >> >
> >> >> >Provides tooling via tc cli that _everyone_ in the tc world is
> >> >> >familiar with - which uses the same syntax as other tc extensions do,
> >> >> >same expectations (eg events, request responses, familiar commands for
> >> >> >dumping, flushing etc). Basically someone familiar with tc will pick
> >> >> >this up and operate it very quickly and would have an easier time
> >> >> >debugging it.
> >> >> >There are caveats - as will be with all new classifiers - but those
> >> >> >are within reason.
> >> >>
> >> >> Okay, so syntax familiarity wise, what's the difference between
> >> >> following 2 approaches:
> >> >> $ tc p4ctrl create myprog/table/mytable dstAddr 10.0.1.2/32 \
> >> >>       action send_to_port param port eno1
> >> >> $ p4ctrl create myprog/table/mytable dstAddr 10.0.1.2/32 \
> >> >>       action send_to_port param port eno1
> >> >> ?
> >> >>
> >> >>
> >> >> >
> >> >> >> >>
> >> >> >> >> >
> >> >> >> >> >>
> >> >> >> >> >> >didnt even want to use ebpf at all for operator experience reasons -
> >> >> >> >> >> >it requires a compilation of the code and an extra loading compared to
> >> >> >> >> >> >what our original u32/pedit code offered.
> >> >> >> >> >> >
> >> >> >> >> >> >> I don't quite follow why not most of this could be implemented entirely in
> >> >> >> >> >> >> user space without the detour of this and you would provide a developer
> >> >> >> >> >> >> library which could then be integrated into a p4 runtime/frontend? This
> >> >> >> >> >> >> way users never interface with ebpf parts nor tc given they also shouldn't
> >> >> >> >> >> >> have to - it's an implementation detail. This is what John was also pointing
> >> >> >> >> >> >> out earlier.
> >> >> >> >> >> >>
> >> >> >> >> >> >
> >> >> >> >> >> >Netlink is the API. We will provide a library for object manipulation
> >> >> >> >> >> >which abstracts away the need to know netlink. Someone who for their
> >> >> >> >> >> >own reasons wants to use p4runtime or TDI could write on top of this.
> >> >> >> >> >> >I would not design a kernel interface to just meet p4runtime (we
> >> >> >> >> >> >already have TDI which came later which does things differently). So i
> >> >> >> >> >> >expect us to support both those two. And if i was to do something on
> >> >> >> >> >> >SDN that was more robust i would write my own that still uses these
> >> >> >> >> >> >netlink interfaces.
> >> >> >> >> >>
> >> >> >> >> >> Actually, what Daniel says about the p4 library used as a backend to p4
> >> >> >> >> >> frontend is pretty much aligned what I claimed on the p4 calls couple of
> >> >> >> >> >> times. If you have this p4 userspace tooling, it is easy for offloads to
> >> >> >> >> >> replace the backed by vendor-specific library which allows p4 offload
> >> >> >> >> >> suitable for all vendors (your plan of p4tc offload does not work well
> >> >> >> >> >> for our hw, as we repeatedly claimed).
> >> >> >> >> >>
> >> >> >> >> >
> >> >> >> >> >That's you - NVIDIA. You have chosen a path away from the kernel
> >> >> >> >> >towards DOCA. I understand NVIDIA's frustration with dealing with
> >> >> >> >> >upstream process (which has been cited to me as a good reason for
> >> >> >> >> >DOCA) but please dont impose these values and your politics on other
> >> >> >> >> >vendors(Intel, AMD for example) who are more than willing to invest
> >> >> >> >> >into making the kernel interfaces the path forward. Your choice.
> >> >> >> >>
> >> >> >> >> No, you are missing the point. This has nothing to do with DOCA.
> >> >> >> >
> >> >> >> >Right Jiri ;->
> >> >> >> >
> >> >> >> >> This
> >> >> >> >> has to do with the simple limitation of your offload assuming there are
> >> >> >> >> no runtime changes in the compiled pipeline. For Intel, maybe they
> >> >> >> >> aren't, and it's a good fit for them. All I say is, that it is not the
> >> >> >> >> good fit for everyone.
> >> >> >> >
> >> >> >> > a) it is not part of the P4 spec to dynamically make changes to the
> >> >> >> >datapath pipeline after it is create and we are discussing a P4
> >> >> >>
> >> >> >> Isn't this up to the implementation? I mean from the p4 perspective,
> >> >> >> everything is static. Hw might need to reshuffle the pipeline internally
> >> >> >> during rule insertion/remove in order to optimize the layout.
> >> >> >>
> >> >> >
> >> >> >But do note: the focus here is on P4 (hence the name P4TC).
> >> >> >
> >> >> >> >implementation not an extension that would add more value b) We are
> >> >> >> >more than happy to add extensions in the future to accomodate for
> >> >> >> >features but first _P4 spec_ must be met c) we had longer discussions
> >> >> >> >with Matty, Khalid and the Rice folks who wrote a paper on that topic
> >> >> >> >which you probably didnt attend and everything that needs to be done
> >> >> >> >can be from user space today for all those optimizations.
> >> >> >> >
> >> >> >> >Conclusion is: For what you need to do (which i dont believe is a
> >> >> >> >limitation in your hardware rather a design decision on your part) run
> >> >> >> >your user space daemon, do optimizations and update the datapath.
> >> >> >> >Everybody is happy.
> >> >> >>
> >> >> >> Should the userspace daemon listen on inserted rules to be offloade
> >> >> >> over netlink?
> >> >> >>
> >> >> >
> >> >> >I mean you could if you wanted to given this is just traditional
> >> >> >netlink which emits events (with some filtering when we integrate the
> >> >> >filter approach). But why?
> >> >>
> >> >> Nevermind.
> >> >>
> >> >>
> >> >> >
> >> >> >> >
> >> >> >> >>
> >> >> >> >> >Nobody is stopping you from offering your customers proprietary
> >> >> >> >> >solutions which include a specific ebpf approach alongside DOCA. We
> >> >> >> >> >believe that a singular interface regardless of the vendor is the
> >> >> >> >> >right way forward. IMHO, this siloing that unfortunately is also added
> >> >> >> >> >by eBPF being a double edged sword is not good for the community.
> >> >> >> >> >
> >> >> >> >> >> As I also said on the p4 call couple of times, I don't see the kernel
> >> >> >> >> >> as the correct place to do the p4 abstractions. Why don't you do it in
> >> >> >> >> >> userspace and give vendors possiblity to have p4 backends with compilers,
> >> >> >> >> >> runtime optimizations etc in userspace, talking to the HW in the
> >> >> >> >> >> vendor-suitable way too. Then the SW implementation could be easily eBPF
> >> >> >> >> >> and the main reason (I believe) why you need to have this is TC
> >> >> >> >> >> (offload) is then void.
> >> >> >> >> >>
> >> >> >> >> >> The "everyone wants to use TC/netlink" claim does not seem correct
> >> >> >> >> >> to me. Why not to have one Linux p4 solution that fits everyones needs?
> >> >> >> >> >
> >> >> >> >> >You mean more fitting to the DOCA world? no, because iam a kernel
> >> >> >> >>
> >> >> >> >> Again, this has 0 relation to DOCA.
> >> >> >> >>
> >> >> >> >>
> >> >> >> >> >first person and kernel interfaces are good for everyone.
> >> >> >> >>
> >> >> >> >> Yeah, not really. Not always the kernel is the right answer. Your/Intel
> >> >> >> >> plan to handle the offload by:
> >> >> >> >> 1) abuse devlink to flash p4 binary
> >> >> >> >> 2) parse the binary in kernel to match to the table ids of rules coming
> >> >> >> >>    from p4tc ndo_setup_tc
> >> >> >> >> 3) abuse devlink to flash p4 binary for tc-flower
> >> >> >> >> 4) parse the binary in kernel to match to the table ids of rules coming
> >> >> >> >>    from tc-flower ndo_setup_tc
> >> >> >> >> is really something that is making me a little bit nauseous.
> >> >> >> >>
> >> >> >> >> If you don't have a feasible plan to do the offload, p4tc does not make
> >> >> >> >> sense to me to be honest.
> >> >> >> >
> >> >> >> >You mean if there's no plan to match your (NVIDIA?)  point of view.
> >> >> >> >For #1 - how's this different from DDP? Wasnt that your suggestion to
> >> >> >>
> >> >> >> I doubt that. Any flashing-blob-parsing-in-kernel is something I'm
> >> >> >> opposed to from day 1.
> >> >> >>
> >> >> >>
> >> >> >
> >> >> >Oh well - it is in the kernel and it works fine tbh.
> >> >> >
> >> >> >> >begin with? For #2 Nobody is proposing to do anything of the sort. The
> >> >> >> >ndo is passed IDs for the objects and associated contents. For #3+#4
> >> >> >>
> >> >> >> During offload, you need to parse the blob in driver to be able to match
> >> >> >> the ids with blob entities. That was presented by you/Intel in the past
> >> >> >> IIRC.
> >> >> >>
> >> >> >
> >> >> >You are correct - in case of offload the netlink IDs will have to be
> >> >> >authenticated against what the hardware can accept, but the devlink
> >> >> >flash use i believe was from you as a compromise.
> >> >>
> >> >> Definitelly not. I'm against devlink abuse for this from day 1.
> >> >>
> >> >>
> >> >> >
> >> >> >>
> >> >> >> >tc flower thing has nothing to do with P4TC that was just some random
> >> >> >> >proposal someone made seeing if they could ride on top of P4TC.
> >> >> >>
> >> >> >> Yeah, it's not yet merged and already mentally used for abuse. I love
> >> >> >> that :)
> >> >> >>
> >> >> >> >
> >> >> >> >Besides this nobody really has to satisfy your point of view - like i
> >> >> >> >said earlier feel free to provide proprietary solutions. From a
> >> >> >> >consumer perspective  I would not want to deal with 4 different
> >> >> >> >vendors with 4 different proprietary approaches. The kernel is the
> >> >> >> >unifying part. You seemed happier with tc flower just not with the
> >> >> >>
> >> >> >> Yeah, that is my point, why the unifying part can't be a userspace
> >> >> >> daemon/library with multiple backends (p4tc, bpf, vendorX, vendorY, ..)?
> >> >> >>
> >> >> >> I just don't see the kernel as a good fit for abstraction here,
> >> >> >> given the fact that the vendor compilers does not run in kernel.
> >> >> >> That is breaking your model.
> >> >> >>
> >> >> >
> >> >> >Jiri - we want to support P4, first. Like you said the P4 pipeline,
> >> >> >once installed is static.
> >> >> >P4 doesnt allow dynamic update of the pipeline. For example, once you
> >> >> >say "here are my 14 tables and their associated actions and here's how
> >> >> >the pipeline main control (on how to iterate the tables etc) is going
> >> >> >to be" and after you instantiate/activate that pipeline, you dont go
> >> >> >back 5 minutes later and say "sorry, please introduce table 15, which
> >> >> >i want you to walk to after you visit table 3 if metadata foo is 5" or
> >> >> >"shoot, let's change that table 5 to be exact instead of LPM". It's
> >> >> >not anywhere in the spec.
> >> >> >That doesnt mean it is not useful thing to have - but it is an
> >> >> >invention that has _nothing to do with the P4 spec_; so saying a P4
> >> >> >implementation must support it is a bit out of scope and there are
> >> >> >vendors with hardware who support P4 today that dont need any of this.
> >> >>
> >> >> I'm not talking about the spec. I'm talking about the offload
> >> >> implemetation, the offload compiler the offload runtime manager. You
> >> >> don't have those in kernel. That is the issue. The runtime manager is
> >> >> the one to decide and reshuffle the hw internals. Again, this has
> >> >> nothing to do with p4 frontend. This is offload implementation.
> >> >>
> >> >> And that is why I believe your p4 kernel implementation is unoffloadable.
> >> >> And if it is unoffloadable, do we really need it? IDK.
> >> >>
> >> >
> >> >Say what?
> >> >It's not offloadable in your hardware, you mean? Because i have beside
> >> >me here an intel e2000 which offloads just fine (and the AMD folks
> >> >seem fine too).
> >>
> >> Will Intel and AMD have compiler in kernel, so no blob transfer and
> >> parsing it in kernel wound not be needed? No.
> >
> >By that definition anything that parses anything is a compiler.
> >
> >>
> >> >If your view is that all these runtime optimization surmount to a
> >> >compiler in the kernel/driver that is your, well, your view. In my
> >> >view (and others have said this to you already) the P4C compiler is
> >> >responsible for resource optimizations. The hardware supports P4, you
> >> >give it constraints and it knows what to do. At runtime, anything a
> >> >driver needs to do for resource optimization (resorting, reshuffling
> >> >etc), that is not a P4 problem - sorry if you have issues in your
> >> >architecture approach.
> >>
> >> Sure, it is the offload implementation problem. And for them, you need
> >> to use userspace components. And that is the problem. This discussion
> >> leads nowhere, I don't know how differently I should describe this.
> >
> >Jiri's - that's your view based on whatever design you have in your
> >mind. This has nothing to do with P4.
> >So let me repeat again:
> >1) A vendor's backend for P4 when it compiles ensures that resource
> >constraints are taken care of.
> >2) The same program can run in s/w.
> >3) It makes *ZERO* sense to mix vendor specific constraint
> >optimization(what you described as resorting, reshuffling etc) as part
> >of P4TC or P4. Absolutely nothing to do with either. Write a
>
> I never suggested for it to be part of P4tc of P4. I don't know why you
> think so.

I guess because this discussion is about P4/P4TC? I may have misread
what you are saying then because I saw the  "P4TC must be in
userspace" mantra tied to this specific optimization requirement.

>
> >background task, specific to you,  if you feel you need to move things
> >around at runtime.
>
> Yeah, that backgroud task is in userspace.
>

I don't have a horse in this race.

cheers,
jamal

>
> >
> >We agree on one thing at least: This discussion is going nowhere.
>
> Correct.
>
> >
> >cheers,
> >jamal
> >
> >
> >
> >> >
> >> >> >In my opinion that is a feature that could be added later out of
> >> >> >necessity (there is some good niche value in being able to add some
> >> >> >"dynamicism" to any pipeline) and influence the P4 standards on why it
> >> >> >is needed.
> >> >> >It should be doable today in a brute force way (this is just one
> >> >> >suggestion that came to me when Rice University/Nvidia presented[1]);
> >> >> >i am sure there are other approaches and the idea is by no means
> >> >> >proven.
> >> >> >
> >> >> >1) User space Creates/compiles/Adds/activate your program that has 14
> >> >> >tables at tc prio X chain Y
> >> >> >2) a) 5 minutes later user space decides it wants to change and add
> >> >> >table 3 after table 15, visited when metadata foo=5
> >> >> >    b) your compiler in user space compiles a brand new program which
> >> >> >satisfies #2a (how this program was authored is out of scope of
> >> >> >discussion)
> >> >> >    c) user space adds the new program at tc prio X+1 chain Y or another chain Z
> >> >> >    d) user space delete tc prio X chain Y (and make sure your packets
> >> >> >entry point is whatever #c is)
> >> >>
> >> >> I never suggested anything like what you describe. I'm not sure why you
> >> >> think so.
> >> >
> >> >It's the same class of problems - the paper i pointed to (coauthored
> >> >by Matty and others) has runtime resource optimizations which are
> >> >tantamount to changing the nature of the pipeline. We may need to
> >> >profile in the kernel but all those optimizations can be derived in
> >> >user space using the approach I described.
> >> >
> >> >cheers,
> >> >jamal
> >> >
> >> >
> >> >> >[1] https://www.cs.rice.edu/~eugeneng/papers/SIGCOMM23-Pipeleon.pdf
> >> >> >
> >> >> >>
> >> >> >> >kernel process - which is ironically the same thing we are going
> >> >> >> >through here ;->
> >> >> >> >
> >> >> >> >cheers,
> >> >> >> >jamal
> >> >> >> >
> >> >> >> >>
> >> >> >> >> >
> >> >> >> >> >cheers,
> >> >> >> >> >jamal
Jiri Pirko Nov. 23, 2023, 2:07 p.m. UTC | #26
Thu, Nov 23, 2023 at 02:45:50PM CET, jhs@mojatatu.com wrote:
>On Thu, Nov 23, 2023 at 8:34 AM Jiri Pirko <jiri@resnulli.us> wrote:
>>
>> Thu, Nov 23, 2023 at 02:22:11PM CET, jhs@mojatatu.com wrote:
>> >On Thu, Nov 23, 2023 at 1:36 AM Jiri Pirko <jiri@resnulli.us> wrote:
>> >>
>> >> Wed, Nov 22, 2023 at 08:35:21PM CET, jhs@mojatatu.com wrote:
>> >> >On Wed, Nov 22, 2023 at 1:31 PM Jiri Pirko <jiri@resnulli.us> wrote:
>> >> >>
>> >> >> Wed, Nov 22, 2023 at 04:14:02PM CET, jhs@mojatatu.com wrote:
>> >> >> >On Wed, Nov 22, 2023 at 4:25 AM Jiri Pirko <jiri@resnulli.us> wrote:
>> >> >> >>
>> >> >> >> Tue, Nov 21, 2023 at 04:21:44PM CET, jhs@mojatatu.com wrote:
>> >> >> >> >On Tue, Nov 21, 2023 at 9:19 AM Jiri Pirko <jiri@resnulli.us> wrote:
>> >> >> >> >>
>> >> >> >> >> Tue, Nov 21, 2023 at 02:47:40PM CET, jhs@mojatatu.com wrote:
>> >> >> >> >> >On Tue, Nov 21, 2023 at 8:06 AM Jiri Pirko <jiri@resnulli.us> wrote:
>> >> >> >> >> >>
>> >> >> >> >> >> Mon, Nov 20, 2023 at 11:56:50PM CET, jhs@mojatatu.com wrote:
>> >> >> >> >> >> >On Mon, Nov 20, 2023 at 4:49 PM Daniel Borkmann <daniel@iogearbox.net> wrote:
>> >> >> >> >> >> >>
>> >> >> >> >> >> >> On 11/20/23 8:56 PM, Jamal Hadi Salim wrote:
>> >> >> >> >> >> >> > On Mon, Nov 20, 2023 at 1:10 PM Jiri Pirko <jiri@resnulli.us> wrote:
>> >> >> >> >> >> >> >> Mon, Nov 20, 2023 at 03:23:59PM CET, jhs@mojatatu.com wrote:
>> >> >> >> >> >>
>> >> >> >> >> >> [...]
>> >> >> >> >> >>
>> >> >> >> >> >> >
>> >> >> >> >> >> >> tc BPF and XDP already have widely used infrastructure and can be developed
>> >> >> >> >> >> >> against libbpf or other user space libraries for a user space control plane.
>> >> >> >> >> >> >> With 'control plane' you refer here to the tc / netlink shim you've built,
>> >> >> >> >> >> >> but looking at the tc command line examples, this doesn't really provide a
>> >> >> >> >> >> >> good user experience (you call it p4 but people load bpf obj files). If the
>> >> >> >> >> >> >> expectation is that an operator should run tc commands, then neither it's
>> >> >> >> >> >> >> a nice experience for p4 nor for BPF folks. From a BPF PoV, we moved over
>> >> >> >> >> >> >> to bpf_mprog and plan to also extend this for XDP to have a common look and
>> >> >> >> >> >> >> feel wrt networking for developers. Why can't this be reused?
>> >> >> >> >> >> >
>> >> >> >> >> >> >The filter loading which loads the program is considered pipeline
>> >> >> >> >> >> >instantiation - consider it as "provisioning" more than "control"
>> >> >> >> >> >> >which runs at runtime. "control" is purely netlink based. The iproute2
>> >> >> >> >> >> >code we use links libbpf for example for the filter. If we can achieve
>> >> >> >> >> >> >the same with bpf_mprog then sure - we just dont want to loose
>> >> >> >> >> >> >functionality though.  off top of my head, some sample space:
>> >> >> >> >> >> >- we could have multiple pipelines with different priorities (which tc
>> >> >> >> >> >> >provides to us) - and each pipeline may have its own logic with many
>> >> >> >> >> >> >tables etc (and the choice to iterate the next one is essentially
>> >> >> >> >> >> >encoded in the tc action codes)
>> >> >> >> >> >> >- we use tc block to map groups of ports (which i dont think bpf has
>> >> >> >> >> >> >internal access of)
>> >> >> >> >> >> >
>> >> >> >> >> >> >In regards to usability: no i dont expect someone doing things at
>> >> >> >> >> >> >scale to use command line tc. The APIs are via netlink. But the tc cli
>> >> >> >> >> >> >is must for the rest of the masses per our traditions. Also i really
>> >> >> >> >> >>
>> >> >> >> >> >> I don't follow. You repeatedly mention "the must of the traditional tc
>> >> >> >> >> >> cli", but what of the existing traditional cli you use for p4tc?
>> >> >> >> >> >> If I look at the examples, pretty much everything looks new to me.
>> >> >> >> >> >> Example:
>> >> >> >> >> >>
>> >> >> >> >> >>   tc p4ctrl create myprog/table/mytable dstAddr 10.0.1.2/32 \
>> >> >> >> >> >>     action send_to_port param port eno1
>> >> >> >> >> >>
>> >> >> >> >> >> This is just TC/RTnetlink used as a channel to pass new things over. If
>> >> >> >> >> >> that is the case, what's traditional here?
>> >> >> >> >> >>
>> >> >> >> >> >
>> >> >> >> >> >
>> >> >> >> >> >What is not traditional about it?
>> >> >> >> >>
>> >> >> >> >> Okay, so in that case, the following example communitating with
>> >> >> >> >> userspace deamon using imaginary "p4ctrl" app is equally traditional:
>> >> >> >> >>   $ p4ctrl create myprog/table/mytable dstAddr 10.0.1.2/32 \
>> >> >> >> >>      action send_to_port param port eno1
>> >> >> >> >
>> >> >> >> >Huh? Thats just an application - classical tc which part of iproute2
>> >> >> >> >that is sending to the kernel, no different than "tc flower.."
>> >> >> >> >Where do you get the "userspace" daemon part? Yes, you can write a
>> >> >> >> >daemon but it will use the same APIs as tc.
>> >> >> >>
>> >> >> >> Okay, so which part is the "tradition"?
>> >> >> >>
>> >> >> >
>> >> >> >Provides tooling via tc cli that _everyone_ in the tc world is
>> >> >> >familiar with - which uses the same syntax as other tc extensions do,
>> >> >> >same expectations (eg events, request responses, familiar commands for
>> >> >> >dumping, flushing etc). Basically someone familiar with tc will pick
>> >> >> >this up and operate it very quickly and would have an easier time
>> >> >> >debugging it.
>> >> >> >There are caveats - as will be with all new classifiers - but those
>> >> >> >are within reason.
>> >> >>
>> >> >> Okay, so syntax familiarity wise, what's the difference between
>> >> >> following 2 approaches:
>> >> >> $ tc p4ctrl create myprog/table/mytable dstAddr 10.0.1.2/32 \
>> >> >>       action send_to_port param port eno1
>> >> >> $ p4ctrl create myprog/table/mytable dstAddr 10.0.1.2/32 \
>> >> >>       action send_to_port param port eno1
>> >> >> ?
>> >> >>
>> >> >>
>> >> >> >
>> >> >> >> >>
>> >> >> >> >> >
>> >> >> >> >> >>
>> >> >> >> >> >> >didnt even want to use ebpf at all for operator experience reasons -
>> >> >> >> >> >> >it requires a compilation of the code and an extra loading compared to
>> >> >> >> >> >> >what our original u32/pedit code offered.
>> >> >> >> >> >> >
>> >> >> >> >> >> >> I don't quite follow why not most of this could be implemented entirely in
>> >> >> >> >> >> >> user space without the detour of this and you would provide a developer
>> >> >> >> >> >> >> library which could then be integrated into a p4 runtime/frontend? This
>> >> >> >> >> >> >> way users never interface with ebpf parts nor tc given they also shouldn't
>> >> >> >> >> >> >> have to - it's an implementation detail. This is what John was also pointing
>> >> >> >> >> >> >> out earlier.
>> >> >> >> >> >> >>
>> >> >> >> >> >> >
>> >> >> >> >> >> >Netlink is the API. We will provide a library for object manipulation
>> >> >> >> >> >> >which abstracts away the need to know netlink. Someone who for their
>> >> >> >> >> >> >own reasons wants to use p4runtime or TDI could write on top of this.
>> >> >> >> >> >> >I would not design a kernel interface to just meet p4runtime (we
>> >> >> >> >> >> >already have TDI which came later which does things differently). So i
>> >> >> >> >> >> >expect us to support both those two. And if i was to do something on
>> >> >> >> >> >> >SDN that was more robust i would write my own that still uses these
>> >> >> >> >> >> >netlink interfaces.
>> >> >> >> >> >>
>> >> >> >> >> >> Actually, what Daniel says about the p4 library used as a backend to p4
>> >> >> >> >> >> frontend is pretty much aligned what I claimed on the p4 calls couple of
>> >> >> >> >> >> times. If you have this p4 userspace tooling, it is easy for offloads to
>> >> >> >> >> >> replace the backed by vendor-specific library which allows p4 offload
>> >> >> >> >> >> suitable for all vendors (your plan of p4tc offload does not work well
>> >> >> >> >> >> for our hw, as we repeatedly claimed).
>> >> >> >> >> >>
>> >> >> >> >> >
>> >> >> >> >> >That's you - NVIDIA. You have chosen a path away from the kernel
>> >> >> >> >> >towards DOCA. I understand NVIDIA's frustration with dealing with
>> >> >> >> >> >upstream process (which has been cited to me as a good reason for
>> >> >> >> >> >DOCA) but please dont impose these values and your politics on other
>> >> >> >> >> >vendors(Intel, AMD for example) who are more than willing to invest
>> >> >> >> >> >into making the kernel interfaces the path forward. Your choice.
>> >> >> >> >>
>> >> >> >> >> No, you are missing the point. This has nothing to do with DOCA.
>> >> >> >> >
>> >> >> >> >Right Jiri ;->
>> >> >> >> >
>> >> >> >> >> This
>> >> >> >> >> has to do with the simple limitation of your offload assuming there are
>> >> >> >> >> no runtime changes in the compiled pipeline. For Intel, maybe they
>> >> >> >> >> aren't, and it's a good fit for them. All I say is, that it is not the
>> >> >> >> >> good fit for everyone.
>> >> >> >> >
>> >> >> >> > a) it is not part of the P4 spec to dynamically make changes to the
>> >> >> >> >datapath pipeline after it is create and we are discussing a P4
>> >> >> >>
>> >> >> >> Isn't this up to the implementation? I mean from the p4 perspective,
>> >> >> >> everything is static. Hw might need to reshuffle the pipeline internally
>> >> >> >> during rule insertion/remove in order to optimize the layout.
>> >> >> >>
>> >> >> >
>> >> >> >But do note: the focus here is on P4 (hence the name P4TC).
>> >> >> >
>> >> >> >> >implementation not an extension that would add more value b) We are
>> >> >> >> >more than happy to add extensions in the future to accomodate for
>> >> >> >> >features but first _P4 spec_ must be met c) we had longer discussions
>> >> >> >> >with Matty, Khalid and the Rice folks who wrote a paper on that topic
>> >> >> >> >which you probably didnt attend and everything that needs to be done
>> >> >> >> >can be from user space today for all those optimizations.
>> >> >> >> >
>> >> >> >> >Conclusion is: For what you need to do (which i dont believe is a
>> >> >> >> >limitation in your hardware rather a design decision on your part) run
>> >> >> >> >your user space daemon, do optimizations and update the datapath.
>> >> >> >> >Everybody is happy.
>> >> >> >>
>> >> >> >> Should the userspace daemon listen on inserted rules to be offloade
>> >> >> >> over netlink?
>> >> >> >>
>> >> >> >
>> >> >> >I mean you could if you wanted to given this is just traditional
>> >> >> >netlink which emits events (with some filtering when we integrate the
>> >> >> >filter approach). But why?
>> >> >>
>> >> >> Nevermind.
>> >> >>
>> >> >>
>> >> >> >
>> >> >> >> >
>> >> >> >> >>
>> >> >> >> >> >Nobody is stopping you from offering your customers proprietary
>> >> >> >> >> >solutions which include a specific ebpf approach alongside DOCA. We
>> >> >> >> >> >believe that a singular interface regardless of the vendor is the
>> >> >> >> >> >right way forward. IMHO, this siloing that unfortunately is also added
>> >> >> >> >> >by eBPF being a double edged sword is not good for the community.
>> >> >> >> >> >
>> >> >> >> >> >> As I also said on the p4 call couple of times, I don't see the kernel
>> >> >> >> >> >> as the correct place to do the p4 abstractions. Why don't you do it in
>> >> >> >> >> >> userspace and give vendors possiblity to have p4 backends with compilers,
>> >> >> >> >> >> runtime optimizations etc in userspace, talking to the HW in the
>> >> >> >> >> >> vendor-suitable way too. Then the SW implementation could be easily eBPF
>> >> >> >> >> >> and the main reason (I believe) why you need to have this is TC
>> >> >> >> >> >> (offload) is then void.
>> >> >> >> >> >>
>> >> >> >> >> >> The "everyone wants to use TC/netlink" claim does not seem correct
>> >> >> >> >> >> to me. Why not to have one Linux p4 solution that fits everyones needs?
>> >> >> >> >> >
>> >> >> >> >> >You mean more fitting to the DOCA world? no, because iam a kernel
>> >> >> >> >>
>> >> >> >> >> Again, this has 0 relation to DOCA.
>> >> >> >> >>
>> >> >> >> >>
>> >> >> >> >> >first person and kernel interfaces are good for everyone.
>> >> >> >> >>
>> >> >> >> >> Yeah, not really. Not always the kernel is the right answer. Your/Intel
>> >> >> >> >> plan to handle the offload by:
>> >> >> >> >> 1) abuse devlink to flash p4 binary
>> >> >> >> >> 2) parse the binary in kernel to match to the table ids of rules coming
>> >> >> >> >>    from p4tc ndo_setup_tc
>> >> >> >> >> 3) abuse devlink to flash p4 binary for tc-flower
>> >> >> >> >> 4) parse the binary in kernel to match to the table ids of rules coming
>> >> >> >> >>    from tc-flower ndo_setup_tc
>> >> >> >> >> is really something that is making me a little bit nauseous.
>> >> >> >> >>
>> >> >> >> >> If you don't have a feasible plan to do the offload, p4tc does not make
>> >> >> >> >> sense to me to be honest.
>> >> >> >> >
>> >> >> >> >You mean if there's no plan to match your (NVIDIA?)  point of view.
>> >> >> >> >For #1 - how's this different from DDP? Wasnt that your suggestion to
>> >> >> >>
>> >> >> >> I doubt that. Any flashing-blob-parsing-in-kernel is something I'm
>> >> >> >> opposed to from day 1.
>> >> >> >>
>> >> >> >>
>> >> >> >
>> >> >> >Oh well - it is in the kernel and it works fine tbh.
>> >> >> >
>> >> >> >> >begin with? For #2 Nobody is proposing to do anything of the sort. The
>> >> >> >> >ndo is passed IDs for the objects and associated contents. For #3+#4
>> >> >> >>
>> >> >> >> During offload, you need to parse the blob in driver to be able to match
>> >> >> >> the ids with blob entities. That was presented by you/Intel in the past
>> >> >> >> IIRC.
>> >> >> >>
>> >> >> >
>> >> >> >You are correct - in case of offload the netlink IDs will have to be
>> >> >> >authenticated against what the hardware can accept, but the devlink
>> >> >> >flash use i believe was from you as a compromise.
>> >> >>
>> >> >> Definitelly not. I'm against devlink abuse for this from day 1.
>> >> >>
>> >> >>
>> >> >> >
>> >> >> >>
>> >> >> >> >tc flower thing has nothing to do with P4TC that was just some random
>> >> >> >> >proposal someone made seeing if they could ride on top of P4TC.
>> >> >> >>
>> >> >> >> Yeah, it's not yet merged and already mentally used for abuse. I love
>> >> >> >> that :)
>> >> >> >>
>> >> >> >> >
>> >> >> >> >Besides this nobody really has to satisfy your point of view - like i
>> >> >> >> >said earlier feel free to provide proprietary solutions. From a
>> >> >> >> >consumer perspective  I would not want to deal with 4 different
>> >> >> >> >vendors with 4 different proprietary approaches. The kernel is the
>> >> >> >> >unifying part. You seemed happier with tc flower just not with the
>> >> >> >>
>> >> >> >> Yeah, that is my point, why the unifying part can't be a userspace
>> >> >> >> daemon/library with multiple backends (p4tc, bpf, vendorX, vendorY, ..)?
>> >> >> >>
>> >> >> >> I just don't see the kernel as a good fit for abstraction here,
>> >> >> >> given the fact that the vendor compilers does not run in kernel.
>> >> >> >> That is breaking your model.
>> >> >> >>
>> >> >> >
>> >> >> >Jiri - we want to support P4, first. Like you said the P4 pipeline,
>> >> >> >once installed is static.
>> >> >> >P4 doesnt allow dynamic update of the pipeline. For example, once you
>> >> >> >say "here are my 14 tables and their associated actions and here's how
>> >> >> >the pipeline main control (on how to iterate the tables etc) is going
>> >> >> >to be" and after you instantiate/activate that pipeline, you dont go
>> >> >> >back 5 minutes later and say "sorry, please introduce table 15, which
>> >> >> >i want you to walk to after you visit table 3 if metadata foo is 5" or
>> >> >> >"shoot, let's change that table 5 to be exact instead of LPM". It's
>> >> >> >not anywhere in the spec.
>> >> >> >That doesnt mean it is not useful thing to have - but it is an
>> >> >> >invention that has _nothing to do with the P4 spec_; so saying a P4
>> >> >> >implementation must support it is a bit out of scope and there are
>> >> >> >vendors with hardware who support P4 today that dont need any of this.
>> >> >>
>> >> >> I'm not talking about the spec. I'm talking about the offload
>> >> >> implemetation, the offload compiler the offload runtime manager. You
>> >> >> don't have those in kernel. That is the issue. The runtime manager is
>> >> >> the one to decide and reshuffle the hw internals. Again, this has
>> >> >> nothing to do with p4 frontend. This is offload implementation.
>> >> >>
>> >> >> And that is why I believe your p4 kernel implementation is unoffloadable.
>> >> >> And if it is unoffloadable, do we really need it? IDK.
>> >> >>
>> >> >
>> >> >Say what?
>> >> >It's not offloadable in your hardware, you mean? Because i have beside
>> >> >me here an intel e2000 which offloads just fine (and the AMD folks
>> >> >seem fine too).
>> >>
>> >> Will Intel and AMD have compiler in kernel, so no blob transfer and
>> >> parsing it in kernel wound not be needed? No.
>> >
>> >By that definition anything that parses anything is a compiler.
>> >
>> >>
>> >> >If your view is that all these runtime optimization surmount to a
>> >> >compiler in the kernel/driver that is your, well, your view. In my
>> >> >view (and others have said this to you already) the P4C compiler is
>> >> >responsible for resource optimizations. The hardware supports P4, you
>> >> >give it constraints and it knows what to do. At runtime, anything a
>> >> >driver needs to do for resource optimization (resorting, reshuffling
>> >> >etc), that is not a P4 problem - sorry if you have issues in your
>> >> >architecture approach.
>> >>
>> >> Sure, it is the offload implementation problem. And for them, you need
>> >> to use userspace components. And that is the problem. This discussion
>> >> leads nowhere, I don't know how differently I should describe this.
>> >
>> >Jiri's - that's your view based on whatever design you have in your
>> >mind. This has nothing to do with P4.
>> >So let me repeat again:
>> >1) A vendor's backend for P4 when it compiles ensures that resource
>> >constraints are taken care of.
>> >2) The same program can run in s/w.
>> >3) It makes *ZERO* sense to mix vendor specific constraint
>> >optimization(what you described as resorting, reshuffling etc) as part
>> >of P4TC or P4. Absolutely nothing to do with either. Write a
>>
>> I never suggested for it to be part of P4tc of P4. I don't know why you
>> think so.
>
>I guess because this discussion is about P4/P4TC? I may have misread
>what you are saying then because I saw the  "P4TC must be in
>userspace" mantra tied to this specific optimization requirement.

Yeah, and again, my point is, this is unoffloadable. Do we still
need it in kernel?


>
>>
>> >background task, specific to you,  if you feel you need to move things
>> >around at runtime.
>>
>> Yeah, that backgroud task is in userspace.
>>
>
>I don't have a horse in this race.
>
>cheers,
>jamal
>
>>
>> >
>> >We agree on one thing at least: This discussion is going nowhere.
>>
>> Correct.
>>
>> >
>> >cheers,
>> >jamal
>> >
>> >
>> >
>> >> >
>> >> >> >In my opinion that is a feature that could be added later out of
>> >> >> >necessity (there is some good niche value in being able to add some
>> >> >> >"dynamicism" to any pipeline) and influence the P4 standards on why it
>> >> >> >is needed.
>> >> >> >It should be doable today in a brute force way (this is just one
>> >> >> >suggestion that came to me when Rice University/Nvidia presented[1]);
>> >> >> >i am sure there are other approaches and the idea is by no means
>> >> >> >proven.
>> >> >> >
>> >> >> >1) User space Creates/compiles/Adds/activate your program that has 14
>> >> >> >tables at tc prio X chain Y
>> >> >> >2) a) 5 minutes later user space decides it wants to change and add
>> >> >> >table 3 after table 15, visited when metadata foo=5
>> >> >> >    b) your compiler in user space compiles a brand new program which
>> >> >> >satisfies #2a (how this program was authored is out of scope of
>> >> >> >discussion)
>> >> >> >    c) user space adds the new program at tc prio X+1 chain Y or another chain Z
>> >> >> >    d) user space delete tc prio X chain Y (and make sure your packets
>> >> >> >entry point is whatever #c is)
>> >> >>
>> >> >> I never suggested anything like what you describe. I'm not sure why you
>> >> >> think so.
>> >> >
>> >> >It's the same class of problems - the paper i pointed to (coauthored
>> >> >by Matty and others) has runtime resource optimizations which are
>> >> >tantamount to changing the nature of the pipeline. We may need to
>> >> >profile in the kernel but all those optimizations can be derived in
>> >> >user space using the approach I described.
>> >> >
>> >> >cheers,
>> >> >jamal
>> >> >
>> >> >
>> >> >> >[1] https://www.cs.rice.edu/~eugeneng/papers/SIGCOMM23-Pipeleon.pdf
>> >> >> >
>> >> >> >>
>> >> >> >> >kernel process - which is ironically the same thing we are going
>> >> >> >> >through here ;->
>> >> >> >> >
>> >> >> >> >cheers,
>> >> >> >> >jamal
>> >> >> >> >
>> >> >> >> >>
>> >> >> >> >> >
>> >> >> >> >> >cheers,
>> >> >> >> >> >jamal
Jamal Hadi Salim Nov. 23, 2023, 2:28 p.m. UTC | #27
On Thu, Nov 23, 2023 at 9:07 AM Jiri Pirko <jiri@resnulli.us> wrote:
>
> Thu, Nov 23, 2023 at 02:45:50PM CET, jhs@mojatatu.com wrote:
> >On Thu, Nov 23, 2023 at 8:34 AM Jiri Pirko <jiri@resnulli.us> wrote:
> >>
> >> Thu, Nov 23, 2023 at 02:22:11PM CET, jhs@mojatatu.com wrote:
> >> >On Thu, Nov 23, 2023 at 1:36 AM Jiri Pirko <jiri@resnulli.us> wrote:
> >> >>
> >> >> Wed, Nov 22, 2023 at 08:35:21PM CET, jhs@mojatatu.com wrote:
> >> >> >On Wed, Nov 22, 2023 at 1:31 PM Jiri Pirko <jiri@resnulli.us> wrote:
> >> >> >>
> >> >> >> Wed, Nov 22, 2023 at 04:14:02PM CET, jhs@mojatatu.com wrote:
> >> >> >> >On Wed, Nov 22, 2023 at 4:25 AM Jiri Pirko <jiri@resnulli.us> wrote:
> >> >> >> >>
> >> >> >> >> Tue, Nov 21, 2023 at 04:21:44PM CET, jhs@mojatatu.com wrote:
> >> >> >> >> >On Tue, Nov 21, 2023 at 9:19 AM Jiri Pirko <jiri@resnulli.us> wrote:
> >> >> >> >> >>
> >> >> >> >> >> Tue, Nov 21, 2023 at 02:47:40PM CET, jhs@mojatatu.com wrote:
> >> >> >> >> >> >On Tue, Nov 21, 2023 at 8:06 AM Jiri Pirko <jiri@resnulli.us> wrote:
> >> >> >> >> >> >>
> >> >> >> >> >> >> Mon, Nov 20, 2023 at 11:56:50PM CET, jhs@mojatatu.com wrote:
> >> >> >> >> >> >> >On Mon, Nov 20, 2023 at 4:49 PM Daniel Borkmann <daniel@iogearbox.net> wrote:
> >> >> >> >> >> >> >>
> >> >> >> >> >> >> >> On 11/20/23 8:56 PM, Jamal Hadi Salim wrote:
> >> >> >> >> >> >> >> > On Mon, Nov 20, 2023 at 1:10 PM Jiri Pirko <jiri@resnulli.us> wrote:
> >> >> >> >> >> >> >> >> Mon, Nov 20, 2023 at 03:23:59PM CET, jhs@mojatatu.com wrote:
> >> >> >> >> >> >>
> >> >> >> >> >> >> [...]
> >> >> >> >> >> >>
> >> >> >> >> >> >> >
> >> >> >> >> >> >> >> tc BPF and XDP already have widely used infrastructure and can be developed
> >> >> >> >> >> >> >> against libbpf or other user space libraries for a user space control plane.
> >> >> >> >> >> >> >> With 'control plane' you refer here to the tc / netlink shim you've built,
> >> >> >> >> >> >> >> but looking at the tc command line examples, this doesn't really provide a
> >> >> >> >> >> >> >> good user experience (you call it p4 but people load bpf obj files). If the
> >> >> >> >> >> >> >> expectation is that an operator should run tc commands, then neither it's
> >> >> >> >> >> >> >> a nice experience for p4 nor for BPF folks. From a BPF PoV, we moved over
> >> >> >> >> >> >> >> to bpf_mprog and plan to also extend this for XDP to have a common look and
> >> >> >> >> >> >> >> feel wrt networking for developers. Why can't this be reused?
> >> >> >> >> >> >> >
> >> >> >> >> >> >> >The filter loading which loads the program is considered pipeline
> >> >> >> >> >> >> >instantiation - consider it as "provisioning" more than "control"
> >> >> >> >> >> >> >which runs at runtime. "control" is purely netlink based. The iproute2
> >> >> >> >> >> >> >code we use links libbpf for example for the filter. If we can achieve
> >> >> >> >> >> >> >the same with bpf_mprog then sure - we just dont want to loose
> >> >> >> >> >> >> >functionality though.  off top of my head, some sample space:
> >> >> >> >> >> >> >- we could have multiple pipelines with different priorities (which tc
> >> >> >> >> >> >> >provides to us) - and each pipeline may have its own logic with many
> >> >> >> >> >> >> >tables etc (and the choice to iterate the next one is essentially
> >> >> >> >> >> >> >encoded in the tc action codes)
> >> >> >> >> >> >> >- we use tc block to map groups of ports (which i dont think bpf has
> >> >> >> >> >> >> >internal access of)
> >> >> >> >> >> >> >
> >> >> >> >> >> >> >In regards to usability: no i dont expect someone doing things at
> >> >> >> >> >> >> >scale to use command line tc. The APIs are via netlink. But the tc cli
> >> >> >> >> >> >> >is must for the rest of the masses per our traditions. Also i really
> >> >> >> >> >> >>
> >> >> >> >> >> >> I don't follow. You repeatedly mention "the must of the traditional tc
> >> >> >> >> >> >> cli", but what of the existing traditional cli you use for p4tc?
> >> >> >> >> >> >> If I look at the examples, pretty much everything looks new to me.
> >> >> >> >> >> >> Example:
> >> >> >> >> >> >>
> >> >> >> >> >> >>   tc p4ctrl create myprog/table/mytable dstAddr 10.0.1.2/32 \
> >> >> >> >> >> >>     action send_to_port param port eno1
> >> >> >> >> >> >>
> >> >> >> >> >> >> This is just TC/RTnetlink used as a channel to pass new things over. If
> >> >> >> >> >> >> that is the case, what's traditional here?
> >> >> >> >> >> >>
> >> >> >> >> >> >
> >> >> >> >> >> >
> >> >> >> >> >> >What is not traditional about it?
> >> >> >> >> >>
> >> >> >> >> >> Okay, so in that case, the following example communitating with
> >> >> >> >> >> userspace deamon using imaginary "p4ctrl" app is equally traditional:
> >> >> >> >> >>   $ p4ctrl create myprog/table/mytable dstAddr 10.0.1.2/32 \
> >> >> >> >> >>      action send_to_port param port eno1
> >> >> >> >> >
> >> >> >> >> >Huh? Thats just an application - classical tc which part of iproute2
> >> >> >> >> >that is sending to the kernel, no different than "tc flower.."
> >> >> >> >> >Where do you get the "userspace" daemon part? Yes, you can write a
> >> >> >> >> >daemon but it will use the same APIs as tc.
> >> >> >> >>
> >> >> >> >> Okay, so which part is the "tradition"?
> >> >> >> >>
> >> >> >> >
> >> >> >> >Provides tooling via tc cli that _everyone_ in the tc world is
> >> >> >> >familiar with - which uses the same syntax as other tc extensions do,
> >> >> >> >same expectations (eg events, request responses, familiar commands for
> >> >> >> >dumping, flushing etc). Basically someone familiar with tc will pick
> >> >> >> >this up and operate it very quickly and would have an easier time
> >> >> >> >debugging it.
> >> >> >> >There are caveats - as will be with all new classifiers - but those
> >> >> >> >are within reason.
> >> >> >>
> >> >> >> Okay, so syntax familiarity wise, what's the difference between
> >> >> >> following 2 approaches:
> >> >> >> $ tc p4ctrl create myprog/table/mytable dstAddr 10.0.1.2/32 \
> >> >> >>       action send_to_port param port eno1
> >> >> >> $ p4ctrl create myprog/table/mytable dstAddr 10.0.1.2/32 \
> >> >> >>       action send_to_port param port eno1
> >> >> >> ?
> >> >> >>
> >> >> >>
> >> >> >> >
> >> >> >> >> >>
> >> >> >> >> >> >
> >> >> >> >> >> >>
> >> >> >> >> >> >> >didnt even want to use ebpf at all for operator experience reasons -
> >> >> >> >> >> >> >it requires a compilation of the code and an extra loading compared to
> >> >> >> >> >> >> >what our original u32/pedit code offered.
> >> >> >> >> >> >> >
> >> >> >> >> >> >> >> I don't quite follow why not most of this could be implemented entirely in
> >> >> >> >> >> >> >> user space without the detour of this and you would provide a developer
> >> >> >> >> >> >> >> library which could then be integrated into a p4 runtime/frontend? This
> >> >> >> >> >> >> >> way users never interface with ebpf parts nor tc given they also shouldn't
> >> >> >> >> >> >> >> have to - it's an implementation detail. This is what John was also pointing
> >> >> >> >> >> >> >> out earlier.
> >> >> >> >> >> >> >>
> >> >> >> >> >> >> >
> >> >> >> >> >> >> >Netlink is the API. We will provide a library for object manipulation
> >> >> >> >> >> >> >which abstracts away the need to know netlink. Someone who for their
> >> >> >> >> >> >> >own reasons wants to use p4runtime or TDI could write on top of this.
> >> >> >> >> >> >> >I would not design a kernel interface to just meet p4runtime (we
> >> >> >> >> >> >> >already have TDI which came later which does things differently). So i
> >> >> >> >> >> >> >expect us to support both those two. And if i was to do something on
> >> >> >> >> >> >> >SDN that was more robust i would write my own that still uses these
> >> >> >> >> >> >> >netlink interfaces.
> >> >> >> >> >> >>
> >> >> >> >> >> >> Actually, what Daniel says about the p4 library used as a backend to p4
> >> >> >> >> >> >> frontend is pretty much aligned what I claimed on the p4 calls couple of
> >> >> >> >> >> >> times. If you have this p4 userspace tooling, it is easy for offloads to
> >> >> >> >> >> >> replace the backed by vendor-specific library which allows p4 offload
> >> >> >> >> >> >> suitable for all vendors (your plan of p4tc offload does not work well
> >> >> >> >> >> >> for our hw, as we repeatedly claimed).
> >> >> >> >> >> >>
> >> >> >> >> >> >
> >> >> >> >> >> >That's you - NVIDIA. You have chosen a path away from the kernel
> >> >> >> >> >> >towards DOCA. I understand NVIDIA's frustration with dealing with
> >> >> >> >> >> >upstream process (which has been cited to me as a good reason for
> >> >> >> >> >> >DOCA) but please dont impose these values and your politics on other
> >> >> >> >> >> >vendors(Intel, AMD for example) who are more than willing to invest
> >> >> >> >> >> >into making the kernel interfaces the path forward. Your choice.
> >> >> >> >> >>
> >> >> >> >> >> No, you are missing the point. This has nothing to do with DOCA.
> >> >> >> >> >
> >> >> >> >> >Right Jiri ;->
> >> >> >> >> >
> >> >> >> >> >> This
> >> >> >> >> >> has to do with the simple limitation of your offload assuming there are
> >> >> >> >> >> no runtime changes in the compiled pipeline. For Intel, maybe they
> >> >> >> >> >> aren't, and it's a good fit for them. All I say is, that it is not the
> >> >> >> >> >> good fit for everyone.
> >> >> >> >> >
> >> >> >> >> > a) it is not part of the P4 spec to dynamically make changes to the
> >> >> >> >> >datapath pipeline after it is create and we are discussing a P4
> >> >> >> >>
> >> >> >> >> Isn't this up to the implementation? I mean from the p4 perspective,
> >> >> >> >> everything is static. Hw might need to reshuffle the pipeline internally
> >> >> >> >> during rule insertion/remove in order to optimize the layout.
> >> >> >> >>
> >> >> >> >
> >> >> >> >But do note: the focus here is on P4 (hence the name P4TC).
> >> >> >> >
> >> >> >> >> >implementation not an extension that would add more value b) We are
> >> >> >> >> >more than happy to add extensions in the future to accomodate for
> >> >> >> >> >features but first _P4 spec_ must be met c) we had longer discussions
> >> >> >> >> >with Matty, Khalid and the Rice folks who wrote a paper on that topic
> >> >> >> >> >which you probably didnt attend and everything that needs to be done
> >> >> >> >> >can be from user space today for all those optimizations.
> >> >> >> >> >
> >> >> >> >> >Conclusion is: For what you need to do (which i dont believe is a
> >> >> >> >> >limitation in your hardware rather a design decision on your part) run
> >> >> >> >> >your user space daemon, do optimizations and update the datapath.
> >> >> >> >> >Everybody is happy.
> >> >> >> >>
> >> >> >> >> Should the userspace daemon listen on inserted rules to be offloade
> >> >> >> >> over netlink?
> >> >> >> >>
> >> >> >> >
> >> >> >> >I mean you could if you wanted to given this is just traditional
> >> >> >> >netlink which emits events (with some filtering when we integrate the
> >> >> >> >filter approach). But why?
> >> >> >>
> >> >> >> Nevermind.
> >> >> >>
> >> >> >>
> >> >> >> >
> >> >> >> >> >
> >> >> >> >> >>
> >> >> >> >> >> >Nobody is stopping you from offering your customers proprietary
> >> >> >> >> >> >solutions which include a specific ebpf approach alongside DOCA. We
> >> >> >> >> >> >believe that a singular interface regardless of the vendor is the
> >> >> >> >> >> >right way forward. IMHO, this siloing that unfortunately is also added
> >> >> >> >> >> >by eBPF being a double edged sword is not good for the community.
> >> >> >> >> >> >
> >> >> >> >> >> >> As I also said on the p4 call couple of times, I don't see the kernel
> >> >> >> >> >> >> as the correct place to do the p4 abstractions. Why don't you do it in
> >> >> >> >> >> >> userspace and give vendors possiblity to have p4 backends with compilers,
> >> >> >> >> >> >> runtime optimizations etc in userspace, talking to the HW in the
> >> >> >> >> >> >> vendor-suitable way too. Then the SW implementation could be easily eBPF
> >> >> >> >> >> >> and the main reason (I believe) why you need to have this is TC
> >> >> >> >> >> >> (offload) is then void.
> >> >> >> >> >> >>
> >> >> >> >> >> >> The "everyone wants to use TC/netlink" claim does not seem correct
> >> >> >> >> >> >> to me. Why not to have one Linux p4 solution that fits everyones needs?
> >> >> >> >> >> >
> >> >> >> >> >> >You mean more fitting to the DOCA world? no, because iam a kernel
> >> >> >> >> >>
> >> >> >> >> >> Again, this has 0 relation to DOCA.
> >> >> >> >> >>
> >> >> >> >> >>
> >> >> >> >> >> >first person and kernel interfaces are good for everyone.
> >> >> >> >> >>
> >> >> >> >> >> Yeah, not really. Not always the kernel is the right answer. Your/Intel
> >> >> >> >> >> plan to handle the offload by:
> >> >> >> >> >> 1) abuse devlink to flash p4 binary
> >> >> >> >> >> 2) parse the binary in kernel to match to the table ids of rules coming
> >> >> >> >> >>    from p4tc ndo_setup_tc
> >> >> >> >> >> 3) abuse devlink to flash p4 binary for tc-flower
> >> >> >> >> >> 4) parse the binary in kernel to match to the table ids of rules coming
> >> >> >> >> >>    from tc-flower ndo_setup_tc
> >> >> >> >> >> is really something that is making me a little bit nauseous.
> >> >> >> >> >>
> >> >> >> >> >> If you don't have a feasible plan to do the offload, p4tc does not make
> >> >> >> >> >> sense to me to be honest.
> >> >> >> >> >
> >> >> >> >> >You mean if there's no plan to match your (NVIDIA?)  point of view.
> >> >> >> >> >For #1 - how's this different from DDP? Wasnt that your suggestion to
> >> >> >> >>
> >> >> >> >> I doubt that. Any flashing-blob-parsing-in-kernel is something I'm
> >> >> >> >> opposed to from day 1.
> >> >> >> >>
> >> >> >> >>
> >> >> >> >
> >> >> >> >Oh well - it is in the kernel and it works fine tbh.
> >> >> >> >
> >> >> >> >> >begin with? For #2 Nobody is proposing to do anything of the sort. The
> >> >> >> >> >ndo is passed IDs for the objects and associated contents. For #3+#4
> >> >> >> >>
> >> >> >> >> During offload, you need to parse the blob in driver to be able to match
> >> >> >> >> the ids with blob entities. That was presented by you/Intel in the past
> >> >> >> >> IIRC.
> >> >> >> >>
> >> >> >> >
> >> >> >> >You are correct - in case of offload the netlink IDs will have to be
> >> >> >> >authenticated against what the hardware can accept, but the devlink
> >> >> >> >flash use i believe was from you as a compromise.
> >> >> >>
> >> >> >> Definitelly not. I'm against devlink abuse for this from day 1.
> >> >> >>
> >> >> >>
> >> >> >> >
> >> >> >> >>
> >> >> >> >> >tc flower thing has nothing to do with P4TC that was just some random
> >> >> >> >> >proposal someone made seeing if they could ride on top of P4TC.
> >> >> >> >>
> >> >> >> >> Yeah, it's not yet merged and already mentally used for abuse. I love
> >> >> >> >> that :)
> >> >> >> >>
> >> >> >> >> >
> >> >> >> >> >Besides this nobody really has to satisfy your point of view - like i
> >> >> >> >> >said earlier feel free to provide proprietary solutions. From a
> >> >> >> >> >consumer perspective  I would not want to deal with 4 different
> >> >> >> >> >vendors with 4 different proprietary approaches. The kernel is the
> >> >> >> >> >unifying part. You seemed happier with tc flower just not with the
> >> >> >> >>
> >> >> >> >> Yeah, that is my point, why the unifying part can't be a userspace
> >> >> >> >> daemon/library with multiple backends (p4tc, bpf, vendorX, vendorY, ..)?
> >> >> >> >>
> >> >> >> >> I just don't see the kernel as a good fit for abstraction here,
> >> >> >> >> given the fact that the vendor compilers does not run in kernel.
> >> >> >> >> That is breaking your model.
> >> >> >> >>
> >> >> >> >
> >> >> >> >Jiri - we want to support P4, first. Like you said the P4 pipeline,
> >> >> >> >once installed is static.
> >> >> >> >P4 doesnt allow dynamic update of the pipeline. For example, once you
> >> >> >> >say "here are my 14 tables and their associated actions and here's how
> >> >> >> >the pipeline main control (on how to iterate the tables etc) is going
> >> >> >> >to be" and after you instantiate/activate that pipeline, you dont go
> >> >> >> >back 5 minutes later and say "sorry, please introduce table 15, which
> >> >> >> >i want you to walk to after you visit table 3 if metadata foo is 5" or
> >> >> >> >"shoot, let's change that table 5 to be exact instead of LPM". It's
> >> >> >> >not anywhere in the spec.
> >> >> >> >That doesnt mean it is not useful thing to have - but it is an
> >> >> >> >invention that has _nothing to do with the P4 spec_; so saying a P4
> >> >> >> >implementation must support it is a bit out of scope and there are
> >> >> >> >vendors with hardware who support P4 today that dont need any of this.
> >> >> >>
> >> >> >> I'm not talking about the spec. I'm talking about the offload
> >> >> >> implemetation, the offload compiler the offload runtime manager. You
> >> >> >> don't have those in kernel. That is the issue. The runtime manager is
> >> >> >> the one to decide and reshuffle the hw internals. Again, this has
> >> >> >> nothing to do with p4 frontend. This is offload implementation.
> >> >> >>
> >> >> >> And that is why I believe your p4 kernel implementation is unoffloadable.
> >> >> >> And if it is unoffloadable, do we really need it? IDK.
> >> >> >>
> >> >> >
> >> >> >Say what?
> >> >> >It's not offloadable in your hardware, you mean? Because i have beside
> >> >> >me here an intel e2000 which offloads just fine (and the AMD folks
> >> >> >seem fine too).
> >> >>
> >> >> Will Intel and AMD have compiler in kernel, so no blob transfer and
> >> >> parsing it in kernel wound not be needed? No.
> >> >
> >> >By that definition anything that parses anything is a compiler.
> >> >
> >> >>
> >> >> >If your view is that all these runtime optimization surmount to a
> >> >> >compiler in the kernel/driver that is your, well, your view. In my
> >> >> >view (and others have said this to you already) the P4C compiler is
> >> >> >responsible for resource optimizations. The hardware supports P4, you
> >> >> >give it constraints and it knows what to do. At runtime, anything a
> >> >> >driver needs to do for resource optimization (resorting, reshuffling
> >> >> >etc), that is not a P4 problem - sorry if you have issues in your
> >> >> >architecture approach.
> >> >>
> >> >> Sure, it is the offload implementation problem. And for them, you need
> >> >> to use userspace components. And that is the problem. This discussion
> >> >> leads nowhere, I don't know how differently I should describe this.
> >> >
> >> >Jiri's - that's your view based on whatever design you have in your
> >> >mind. This has nothing to do with P4.
> >> >So let me repeat again:
> >> >1) A vendor's backend for P4 when it compiles ensures that resource
> >> >constraints are taken care of.
> >> >2) The same program can run in s/w.
> >> >3) It makes *ZERO* sense to mix vendor specific constraint
> >> >optimization(what you described as resorting, reshuffling etc) as part
> >> >of P4TC or P4. Absolutely nothing to do with either. Write a
> >>
> >> I never suggested for it to be part of P4tc of P4. I don't know why you
> >> think so.
> >
> >I guess because this discussion is about P4/P4TC? I may have misread
> >what you are saying then because I saw the  "P4TC must be in
> >userspace" mantra tied to this specific optimization requirement.
>
> Yeah, and again, my point is, this is unoffloadable.

Here we go again with this weird claim. I guess we need to give an
award to the other vendors for doing the "impossible"?

>Do we still  need it in kernel?

Didnt you just say it has nothing to do with P4TC?

You "It cant be offloaded".
Me "it can be offloaded, other vendors are doing it and it has nothing
to do with P4 or P4TC and here's why..."
You " i didnt say it has anything to do with P4 or P4TC"
Me "ok i misunderstood i thought you said P4 cant be offloaded via
P4TC and has to be done in user space"
You "It cant be offloaded"

Circular non-ending discussion.

Then there's John
John "ebpf, ebpf, ebpf"
Me "we gave you ebpf"
John "but you are not using ebpf system call"
Me " but it doesnt make sense for the following reasons..."
John "but someone has already implemented ebpf.."
Me "yes, but here's how ..."
John "ebpf, ebpf, ebpf"

Another circular non-ending discussion.

Let's just end this electron-wasting lawyering discussion.

cheers,
jamal






Bizare. Unoffloadable according to you.

>
> >
> >>
> >> >background task, specific to you,  if you feel you need to move things
> >> >around at runtime.
> >>
> >> Yeah, that backgroud task is in userspace.
> >>
> >
> >I don't have a horse in this race.
> >
> >cheers,
> >jamal
> >
> >>
> >> >
> >> >We agree on one thing at least: This discussion is going nowhere.
> >>
> >> Correct.
> >>
> >> >
> >> >cheers,
> >> >jamal
> >> >
> >> >
> >> >
> >> >> >
> >> >> >> >In my opinion that is a feature that could be added later out of
> >> >> >> >necessity (there is some good niche value in being able to add some
> >> >> >> >"dynamicism" to any pipeline) and influence the P4 standards on why it
> >> >> >> >is needed.
> >> >> >> >It should be doable today in a brute force way (this is just one
> >> >> >> >suggestion that came to me when Rice University/Nvidia presented[1]);
> >> >> >> >i am sure there are other approaches and the idea is by no means
> >> >> >> >proven.
> >> >> >> >
> >> >> >> >1) User space Creates/compiles/Adds/activate your program that has 14
> >> >> >> >tables at tc prio X chain Y
> >> >> >> >2) a) 5 minutes later user space decides it wants to change and add
> >> >> >> >table 3 after table 15, visited when metadata foo=5
> >> >> >> >    b) your compiler in user space compiles a brand new program which
> >> >> >> >satisfies #2a (how this program was authored is out of scope of
> >> >> >> >discussion)
> >> >> >> >    c) user space adds the new program at tc prio X+1 chain Y or another chain Z
> >> >> >> >    d) user space delete tc prio X chain Y (and make sure your packets
> >> >> >> >entry point is whatever #c is)
> >> >> >>
> >> >> >> I never suggested anything like what you describe. I'm not sure why you
> >> >> >> think so.
> >> >> >
> >> >> >It's the same class of problems - the paper i pointed to (coauthored
> >> >> >by Matty and others) has runtime resource optimizations which are
> >> >> >tantamount to changing the nature of the pipeline. We may need to
> >> >> >profile in the kernel but all those optimizations can be derived in
> >> >> >user space using the approach I described.
> >> >> >
> >> >> >cheers,
> >> >> >jamal
> >> >> >
> >> >> >
> >> >> >> >[1] https://www.cs.rice.edu/~eugeneng/papers/SIGCOMM23-Pipeleon.pdf
> >> >> >> >
> >> >> >> >>
> >> >> >> >> >kernel process - which is ironically the same thing we are going
> >> >> >> >> >through here ;->
> >> >> >> >> >
> >> >> >> >> >cheers,
> >> >> >> >> >jamal
> >> >> >> >> >
> >> >> >> >> >>
> >> >> >> >> >> >
> >> >> >> >> >> >cheers,
> >> >> >> >> >> >jamal
Jiri Pirko Nov. 23, 2023, 3:27 p.m. UTC | #28
Thu, Nov 23, 2023 at 03:28:07PM CET, jhs@mojatatu.com wrote:
>On Thu, Nov 23, 2023 at 9:07 AM Jiri Pirko <jiri@resnulli.us> wrote:
>>
>> Thu, Nov 23, 2023 at 02:45:50PM CET, jhs@mojatatu.com wrote:
>> >On Thu, Nov 23, 2023 at 8:34 AM Jiri Pirko <jiri@resnulli.us> wrote:
>> >>
>> >> Thu, Nov 23, 2023 at 02:22:11PM CET, jhs@mojatatu.com wrote:
>> >> >On Thu, Nov 23, 2023 at 1:36 AM Jiri Pirko <jiri@resnulli.us> wrote:
>> >> >>
>> >> >> Wed, Nov 22, 2023 at 08:35:21PM CET, jhs@mojatatu.com wrote:
>> >> >> >On Wed, Nov 22, 2023 at 1:31 PM Jiri Pirko <jiri@resnulli.us> wrote:
>> >> >> >>
>> >> >> >> Wed, Nov 22, 2023 at 04:14:02PM CET, jhs@mojatatu.com wrote:
>> >> >> >> >On Wed, Nov 22, 2023 at 4:25 AM Jiri Pirko <jiri@resnulli.us> wrote:
>> >> >> >> >>
>> >> >> >> >> Tue, Nov 21, 2023 at 04:21:44PM CET, jhs@mojatatu.com wrote:
>> >> >> >> >> >On Tue, Nov 21, 2023 at 9:19 AM Jiri Pirko <jiri@resnulli.us> wrote:
>> >> >> >> >> >>
>> >> >> >> >> >> Tue, Nov 21, 2023 at 02:47:40PM CET, jhs@mojatatu.com wrote:
>> >> >> >> >> >> >On Tue, Nov 21, 2023 at 8:06 AM Jiri Pirko <jiri@resnulli.us> wrote:
>> >> >> >> >> >> >>
>> >> >> >> >> >> >> Mon, Nov 20, 2023 at 11:56:50PM CET, jhs@mojatatu.com wrote:
>> >> >> >> >> >> >> >On Mon, Nov 20, 2023 at 4:49 PM Daniel Borkmann <daniel@iogearbox.net> wrote:
>> >> >> >> >> >> >> >>
>> >> >> >> >> >> >> >> On 11/20/23 8:56 PM, Jamal Hadi Salim wrote:
>> >> >> >> >> >> >> >> > On Mon, Nov 20, 2023 at 1:10 PM Jiri Pirko <jiri@resnulli.us> wrote:
>> >> >> >> >> >> >> >> >> Mon, Nov 20, 2023 at 03:23:59PM CET, jhs@mojatatu.com wrote:
>> >> >> >> >> >> >>
>> >> >> >> >> >> >> [...]
>> >> >> >> >> >> >>
>> >> >> >> >> >> >> >
>> >> >> >> >> >> >> >> tc BPF and XDP already have widely used infrastructure and can be developed
>> >> >> >> >> >> >> >> against libbpf or other user space libraries for a user space control plane.
>> >> >> >> >> >> >> >> With 'control plane' you refer here to the tc / netlink shim you've built,
>> >> >> >> >> >> >> >> but looking at the tc command line examples, this doesn't really provide a
>> >> >> >> >> >> >> >> good user experience (you call it p4 but people load bpf obj files). If the
>> >> >> >> >> >> >> >> expectation is that an operator should run tc commands, then neither it's
>> >> >> >> >> >> >> >> a nice experience for p4 nor for BPF folks. From a BPF PoV, we moved over
>> >> >> >> >> >> >> >> to bpf_mprog and plan to also extend this for XDP to have a common look and
>> >> >> >> >> >> >> >> feel wrt networking for developers. Why can't this be reused?
>> >> >> >> >> >> >> >
>> >> >> >> >> >> >> >The filter loading which loads the program is considered pipeline
>> >> >> >> >> >> >> >instantiation - consider it as "provisioning" more than "control"
>> >> >> >> >> >> >> >which runs at runtime. "control" is purely netlink based. The iproute2
>> >> >> >> >> >> >> >code we use links libbpf for example for the filter. If we can achieve
>> >> >> >> >> >> >> >the same with bpf_mprog then sure - we just dont want to loose
>> >> >> >> >> >> >> >functionality though.  off top of my head, some sample space:
>> >> >> >> >> >> >> >- we could have multiple pipelines with different priorities (which tc
>> >> >> >> >> >> >> >provides to us) - and each pipeline may have its own logic with many
>> >> >> >> >> >> >> >tables etc (and the choice to iterate the next one is essentially
>> >> >> >> >> >> >> >encoded in the tc action codes)
>> >> >> >> >> >> >> >- we use tc block to map groups of ports (which i dont think bpf has
>> >> >> >> >> >> >> >internal access of)
>> >> >> >> >> >> >> >
>> >> >> >> >> >> >> >In regards to usability: no i dont expect someone doing things at
>> >> >> >> >> >> >> >scale to use command line tc. The APIs are via netlink. But the tc cli
>> >> >> >> >> >> >> >is must for the rest of the masses per our traditions. Also i really
>> >> >> >> >> >> >>
>> >> >> >> >> >> >> I don't follow. You repeatedly mention "the must of the traditional tc
>> >> >> >> >> >> >> cli", but what of the existing traditional cli you use for p4tc?
>> >> >> >> >> >> >> If I look at the examples, pretty much everything looks new to me.
>> >> >> >> >> >> >> Example:
>> >> >> >> >> >> >>
>> >> >> >> >> >> >>   tc p4ctrl create myprog/table/mytable dstAddr 10.0.1.2/32 \
>> >> >> >> >> >> >>     action send_to_port param port eno1
>> >> >> >> >> >> >>
>> >> >> >> >> >> >> This is just TC/RTnetlink used as a channel to pass new things over. If
>> >> >> >> >> >> >> that is the case, what's traditional here?
>> >> >> >> >> >> >>
>> >> >> >> >> >> >
>> >> >> >> >> >> >
>> >> >> >> >> >> >What is not traditional about it?
>> >> >> >> >> >>
>> >> >> >> >> >> Okay, so in that case, the following example communitating with
>> >> >> >> >> >> userspace deamon using imaginary "p4ctrl" app is equally traditional:
>> >> >> >> >> >>   $ p4ctrl create myprog/table/mytable dstAddr 10.0.1.2/32 \
>> >> >> >> >> >>      action send_to_port param port eno1
>> >> >> >> >> >
>> >> >> >> >> >Huh? Thats just an application - classical tc which part of iproute2
>> >> >> >> >> >that is sending to the kernel, no different than "tc flower.."
>> >> >> >> >> >Where do you get the "userspace" daemon part? Yes, you can write a
>> >> >> >> >> >daemon but it will use the same APIs as tc.
>> >> >> >> >>
>> >> >> >> >> Okay, so which part is the "tradition"?
>> >> >> >> >>
>> >> >> >> >
>> >> >> >> >Provides tooling via tc cli that _everyone_ in the tc world is
>> >> >> >> >familiar with - which uses the same syntax as other tc extensions do,
>> >> >> >> >same expectations (eg events, request responses, familiar commands for
>> >> >> >> >dumping, flushing etc). Basically someone familiar with tc will pick
>> >> >> >> >this up and operate it very quickly and would have an easier time
>> >> >> >> >debugging it.
>> >> >> >> >There are caveats - as will be with all new classifiers - but those
>> >> >> >> >are within reason.
>> >> >> >>
>> >> >> >> Okay, so syntax familiarity wise, what's the difference between
>> >> >> >> following 2 approaches:
>> >> >> >> $ tc p4ctrl create myprog/table/mytable dstAddr 10.0.1.2/32 \
>> >> >> >>       action send_to_port param port eno1
>> >> >> >> $ p4ctrl create myprog/table/mytable dstAddr 10.0.1.2/32 \
>> >> >> >>       action send_to_port param port eno1
>> >> >> >> ?
>> >> >> >>
>> >> >> >>
>> >> >> >> >
>> >> >> >> >> >>
>> >> >> >> >> >> >
>> >> >> >> >> >> >>
>> >> >> >> >> >> >> >didnt even want to use ebpf at all for operator experience reasons -
>> >> >> >> >> >> >> >it requires a compilation of the code and an extra loading compared to
>> >> >> >> >> >> >> >what our original u32/pedit code offered.
>> >> >> >> >> >> >> >
>> >> >> >> >> >> >> >> I don't quite follow why not most of this could be implemented entirely in
>> >> >> >> >> >> >> >> user space without the detour of this and you would provide a developer
>> >> >> >> >> >> >> >> library which could then be integrated into a p4 runtime/frontend? This
>> >> >> >> >> >> >> >> way users never interface with ebpf parts nor tc given they also shouldn't
>> >> >> >> >> >> >> >> have to - it's an implementation detail. This is what John was also pointing
>> >> >> >> >> >> >> >> out earlier.
>> >> >> >> >> >> >> >>
>> >> >> >> >> >> >> >
>> >> >> >> >> >> >> >Netlink is the API. We will provide a library for object manipulation
>> >> >> >> >> >> >> >which abstracts away the need to know netlink. Someone who for their
>> >> >> >> >> >> >> >own reasons wants to use p4runtime or TDI could write on top of this.
>> >> >> >> >> >> >> >I would not design a kernel interface to just meet p4runtime (we
>> >> >> >> >> >> >> >already have TDI which came later which does things differently). So i
>> >> >> >> >> >> >> >expect us to support both those two. And if i was to do something on
>> >> >> >> >> >> >> >SDN that was more robust i would write my own that still uses these
>> >> >> >> >> >> >> >netlink interfaces.
>> >> >> >> >> >> >>
>> >> >> >> >> >> >> Actually, what Daniel says about the p4 library used as a backend to p4
>> >> >> >> >> >> >> frontend is pretty much aligned what I claimed on the p4 calls couple of
>> >> >> >> >> >> >> times. If you have this p4 userspace tooling, it is easy for offloads to
>> >> >> >> >> >> >> replace the backed by vendor-specific library which allows p4 offload
>> >> >> >> >> >> >> suitable for all vendors (your plan of p4tc offload does not work well
>> >> >> >> >> >> >> for our hw, as we repeatedly claimed).
>> >> >> >> >> >> >>
>> >> >> >> >> >> >
>> >> >> >> >> >> >That's you - NVIDIA. You have chosen a path away from the kernel
>> >> >> >> >> >> >towards DOCA. I understand NVIDIA's frustration with dealing with
>> >> >> >> >> >> >upstream process (which has been cited to me as a good reason for
>> >> >> >> >> >> >DOCA) but please dont impose these values and your politics on other
>> >> >> >> >> >> >vendors(Intel, AMD for example) who are more than willing to invest
>> >> >> >> >> >> >into making the kernel interfaces the path forward. Your choice.
>> >> >> >> >> >>
>> >> >> >> >> >> No, you are missing the point. This has nothing to do with DOCA.
>> >> >> >> >> >
>> >> >> >> >> >Right Jiri ;->
>> >> >> >> >> >
>> >> >> >> >> >> This
>> >> >> >> >> >> has to do with the simple limitation of your offload assuming there are
>> >> >> >> >> >> no runtime changes in the compiled pipeline. For Intel, maybe they
>> >> >> >> >> >> aren't, and it's a good fit for them. All I say is, that it is not the
>> >> >> >> >> >> good fit for everyone.
>> >> >> >> >> >
>> >> >> >> >> > a) it is not part of the P4 spec to dynamically make changes to the
>> >> >> >> >> >datapath pipeline after it is create and we are discussing a P4
>> >> >> >> >>
>> >> >> >> >> Isn't this up to the implementation? I mean from the p4 perspective,
>> >> >> >> >> everything is static. Hw might need to reshuffle the pipeline internally
>> >> >> >> >> during rule insertion/remove in order to optimize the layout.
>> >> >> >> >>
>> >> >> >> >
>> >> >> >> >But do note: the focus here is on P4 (hence the name P4TC).
>> >> >> >> >
>> >> >> >> >> >implementation not an extension that would add more value b) We are
>> >> >> >> >> >more than happy to add extensions in the future to accomodate for
>> >> >> >> >> >features but first _P4 spec_ must be met c) we had longer discussions
>> >> >> >> >> >with Matty, Khalid and the Rice folks who wrote a paper on that topic
>> >> >> >> >> >which you probably didnt attend and everything that needs to be done
>> >> >> >> >> >can be from user space today for all those optimizations.
>> >> >> >> >> >
>> >> >> >> >> >Conclusion is: For what you need to do (which i dont believe is a
>> >> >> >> >> >limitation in your hardware rather a design decision on your part) run
>> >> >> >> >> >your user space daemon, do optimizations and update the datapath.
>> >> >> >> >> >Everybody is happy.
>> >> >> >> >>
>> >> >> >> >> Should the userspace daemon listen on inserted rules to be offloade
>> >> >> >> >> over netlink?
>> >> >> >> >>
>> >> >> >> >
>> >> >> >> >I mean you could if you wanted to given this is just traditional
>> >> >> >> >netlink which emits events (with some filtering when we integrate the
>> >> >> >> >filter approach). But why?
>> >> >> >>
>> >> >> >> Nevermind.
>> >> >> >>
>> >> >> >>
>> >> >> >> >
>> >> >> >> >> >
>> >> >> >> >> >>
>> >> >> >> >> >> >Nobody is stopping you from offering your customers proprietary
>> >> >> >> >> >> >solutions which include a specific ebpf approach alongside DOCA. We
>> >> >> >> >> >> >believe that a singular interface regardless of the vendor is the
>> >> >> >> >> >> >right way forward. IMHO, this siloing that unfortunately is also added
>> >> >> >> >> >> >by eBPF being a double edged sword is not good for the community.
>> >> >> >> >> >> >
>> >> >> >> >> >> >> As I also said on the p4 call couple of times, I don't see the kernel
>> >> >> >> >> >> >> as the correct place to do the p4 abstractions. Why don't you do it in
>> >> >> >> >> >> >> userspace and give vendors possiblity to have p4 backends with compilers,
>> >> >> >> >> >> >> runtime optimizations etc in userspace, talking to the HW in the
>> >> >> >> >> >> >> vendor-suitable way too. Then the SW implementation could be easily eBPF
>> >> >> >> >> >> >> and the main reason (I believe) why you need to have this is TC
>> >> >> >> >> >> >> (offload) is then void.
>> >> >> >> >> >> >>
>> >> >> >> >> >> >> The "everyone wants to use TC/netlink" claim does not seem correct
>> >> >> >> >> >> >> to me. Why not to have one Linux p4 solution that fits everyones needs?
>> >> >> >> >> >> >
>> >> >> >> >> >> >You mean more fitting to the DOCA world? no, because iam a kernel
>> >> >> >> >> >>
>> >> >> >> >> >> Again, this has 0 relation to DOCA.
>> >> >> >> >> >>
>> >> >> >> >> >>
>> >> >> >> >> >> >first person and kernel interfaces are good for everyone.
>> >> >> >> >> >>
>> >> >> >> >> >> Yeah, not really. Not always the kernel is the right answer. Your/Intel
>> >> >> >> >> >> plan to handle the offload by:
>> >> >> >> >> >> 1) abuse devlink to flash p4 binary
>> >> >> >> >> >> 2) parse the binary in kernel to match to the table ids of rules coming
>> >> >> >> >> >>    from p4tc ndo_setup_tc
>> >> >> >> >> >> 3) abuse devlink to flash p4 binary for tc-flower
>> >> >> >> >> >> 4) parse the binary in kernel to match to the table ids of rules coming
>> >> >> >> >> >>    from tc-flower ndo_setup_tc
>> >> >> >> >> >> is really something that is making me a little bit nauseous.
>> >> >> >> >> >>
>> >> >> >> >> >> If you don't have a feasible plan to do the offload, p4tc does not make
>> >> >> >> >> >> sense to me to be honest.
>> >> >> >> >> >
>> >> >> >> >> >You mean if there's no plan to match your (NVIDIA?)  point of view.
>> >> >> >> >> >For #1 - how's this different from DDP? Wasnt that your suggestion to
>> >> >> >> >>
>> >> >> >> >> I doubt that. Any flashing-blob-parsing-in-kernel is something I'm
>> >> >> >> >> opposed to from day 1.
>> >> >> >> >>
>> >> >> >> >>
>> >> >> >> >
>> >> >> >> >Oh well - it is in the kernel and it works fine tbh.
>> >> >> >> >
>> >> >> >> >> >begin with? For #2 Nobody is proposing to do anything of the sort. The
>> >> >> >> >> >ndo is passed IDs for the objects and associated contents. For #3+#4
>> >> >> >> >>
>> >> >> >> >> During offload, you need to parse the blob in driver to be able to match
>> >> >> >> >> the ids with blob entities. That was presented by you/Intel in the past
>> >> >> >> >> IIRC.
>> >> >> >> >>
>> >> >> >> >
>> >> >> >> >You are correct - in case of offload the netlink IDs will have to be
>> >> >> >> >authenticated against what the hardware can accept, but the devlink
>> >> >> >> >flash use i believe was from you as a compromise.
>> >> >> >>
>> >> >> >> Definitelly not. I'm against devlink abuse for this from day 1.
>> >> >> >>
>> >> >> >>
>> >> >> >> >
>> >> >> >> >>
>> >> >> >> >> >tc flower thing has nothing to do with P4TC that was just some random
>> >> >> >> >> >proposal someone made seeing if they could ride on top of P4TC.
>> >> >> >> >>
>> >> >> >> >> Yeah, it's not yet merged and already mentally used for abuse. I love
>> >> >> >> >> that :)
>> >> >> >> >>
>> >> >> >> >> >
>> >> >> >> >> >Besides this nobody really has to satisfy your point of view - like i
>> >> >> >> >> >said earlier feel free to provide proprietary solutions. From a
>> >> >> >> >> >consumer perspective  I would not want to deal with 4 different
>> >> >> >> >> >vendors with 4 different proprietary approaches. The kernel is the
>> >> >> >> >> >unifying part. You seemed happier with tc flower just not with the
>> >> >> >> >>
>> >> >> >> >> Yeah, that is my point, why the unifying part can't be a userspace
>> >> >> >> >> daemon/library with multiple backends (p4tc, bpf, vendorX, vendorY, ..)?
>> >> >> >> >>
>> >> >> >> >> I just don't see the kernel as a good fit for abstraction here,
>> >> >> >> >> given the fact that the vendor compilers does not run in kernel.
>> >> >> >> >> That is breaking your model.
>> >> >> >> >>
>> >> >> >> >
>> >> >> >> >Jiri - we want to support P4, first. Like you said the P4 pipeline,
>> >> >> >> >once installed is static.
>> >> >> >> >P4 doesnt allow dynamic update of the pipeline. For example, once you
>> >> >> >> >say "here are my 14 tables and their associated actions and here's how
>> >> >> >> >the pipeline main control (on how to iterate the tables etc) is going
>> >> >> >> >to be" and after you instantiate/activate that pipeline, you dont go
>> >> >> >> >back 5 minutes later and say "sorry, please introduce table 15, which
>> >> >> >> >i want you to walk to after you visit table 3 if metadata foo is 5" or
>> >> >> >> >"shoot, let's change that table 5 to be exact instead of LPM". It's
>> >> >> >> >not anywhere in the spec.
>> >> >> >> >That doesnt mean it is not useful thing to have - but it is an
>> >> >> >> >invention that has _nothing to do with the P4 spec_; so saying a P4
>> >> >> >> >implementation must support it is a bit out of scope and there are
>> >> >> >> >vendors with hardware who support P4 today that dont need any of this.
>> >> >> >>
>> >> >> >> I'm not talking about the spec. I'm talking about the offload
>> >> >> >> implemetation, the offload compiler the offload runtime manager. You
>> >> >> >> don't have those in kernel. That is the issue. The runtime manager is
>> >> >> >> the one to decide and reshuffle the hw internals. Again, this has
>> >> >> >> nothing to do with p4 frontend. This is offload implementation.
>> >> >> >>
>> >> >> >> And that is why I believe your p4 kernel implementation is unoffloadable.
>> >> >> >> And if it is unoffloadable, do we really need it? IDK.
>> >> >> >>
>> >> >> >
>> >> >> >Say what?
>> >> >> >It's not offloadable in your hardware, you mean? Because i have beside
>> >> >> >me here an intel e2000 which offloads just fine (and the AMD folks
>> >> >> >seem fine too).
>> >> >>
>> >> >> Will Intel and AMD have compiler in kernel, so no blob transfer and
>> >> >> parsing it in kernel wound not be needed? No.
>> >> >
>> >> >By that definition anything that parses anything is a compiler.
>> >> >
>> >> >>
>> >> >> >If your view is that all these runtime optimization surmount to a
>> >> >> >compiler in the kernel/driver that is your, well, your view. In my
>> >> >> >view (and others have said this to you already) the P4C compiler is
>> >> >> >responsible for resource optimizations. The hardware supports P4, you
>> >> >> >give it constraints and it knows what to do. At runtime, anything a
>> >> >> >driver needs to do for resource optimization (resorting, reshuffling
>> >> >> >etc), that is not a P4 problem - sorry if you have issues in your
>> >> >> >architecture approach.
>> >> >>
>> >> >> Sure, it is the offload implementation problem. And for them, you need
>> >> >> to use userspace components. And that is the problem. This discussion
>> >> >> leads nowhere, I don't know how differently I should describe this.
>> >> >
>> >> >Jiri's - that's your view based on whatever design you have in your
>> >> >mind. This has nothing to do with P4.
>> >> >So let me repeat again:
>> >> >1) A vendor's backend for P4 when it compiles ensures that resource
>> >> >constraints are taken care of.
>> >> >2) The same program can run in s/w.
>> >> >3) It makes *ZERO* sense to mix vendor specific constraint
>> >> >optimization(what you described as resorting, reshuffling etc) as part
>> >> >of P4TC or P4. Absolutely nothing to do with either. Write a
>> >>
>> >> I never suggested for it to be part of P4tc of P4. I don't know why you
>> >> think so.
>> >
>> >I guess because this discussion is about P4/P4TC? I may have misread
>> >what you are saying then because I saw the  "P4TC must be in
>> >userspace" mantra tied to this specific optimization requirement.
>>
>> Yeah, and again, my point is, this is unoffloadable.
>
>Here we go again with this weird claim. I guess we need to give an
>award to the other vendors for doing the "impossible"?

By having the compiler in kernel, that would be awesome. Clear offload
from kernel to device.

That's not the case. Trampolines, binary blobs parsing in kernel doing
the match with tc structures in drivers, abuse of devlink flash,
tc-flower offload using this facility. All this was already seriously
discussed before p4tc is even merged. Great, love that.



>
>>Do we still  need it in kernel?
>
>Didnt you just say it has nothing to do with P4TC?
>
>You "It cant be offloaded".
>Me "it can be offloaded, other vendors are doing it and it has nothing
>to do with P4 or P4TC and here's why..."
>You " i didnt say it has anything to do with P4 or P4TC"
>Me "ok i misunderstood i thought you said P4 cant be offloaded via
>P4TC and has to be done in user space"
>You "It cant be offloaded"

Let me do my own misinterpretation please.


>
>Circular non-ending discussion.
>
>Then there's John
>John "ebpf, ebpf, ebpf"
>Me "we gave you ebpf"
>John "but you are not using ebpf system call"
>Me " but it doesnt make sense for the following reasons..."
>John "but someone has already implemented ebpf.."
>Me "yes, but here's how ..."
>John "ebpf, ebpf, ebpf"
>
>Another circular non-ending discussion.
>
>Let's just end this electron-wasting lawyering discussion.
>
>cheers,
>jamal
>
>
>
>
>
>
>Bizare. Unoffloadable according to you.
>
>>
>> >
>> >>
>> >> >background task, specific to you,  if you feel you need to move things
>> >> >around at runtime.
>> >>
>> >> Yeah, that backgroud task is in userspace.
>> >>
>> >
>> >I don't have a horse in this race.
>> >
>> >cheers,
>> >jamal
>> >
>> >>
>> >> >
>> >> >We agree on one thing at least: This discussion is going nowhere.
>> >>
>> >> Correct.
>> >>
>> >> >
>> >> >cheers,
>> >> >jamal
>> >> >
>> >> >
>> >> >
>> >> >> >
>> >> >> >> >In my opinion that is a feature that could be added later out of
>> >> >> >> >necessity (there is some good niche value in being able to add some
>> >> >> >> >"dynamicism" to any pipeline) and influence the P4 standards on why it
>> >> >> >> >is needed.
>> >> >> >> >It should be doable today in a brute force way (this is just one
>> >> >> >> >suggestion that came to me when Rice University/Nvidia presented[1]);
>> >> >> >> >i am sure there are other approaches and the idea is by no means
>> >> >> >> >proven.
>> >> >> >> >
>> >> >> >> >1) User space Creates/compiles/Adds/activate your program that has 14
>> >> >> >> >tables at tc prio X chain Y
>> >> >> >> >2) a) 5 minutes later user space decides it wants to change and add
>> >> >> >> >table 3 after table 15, visited when metadata foo=5
>> >> >> >> >    b) your compiler in user space compiles a brand new program which
>> >> >> >> >satisfies #2a (how this program was authored is out of scope of
>> >> >> >> >discussion)
>> >> >> >> >    c) user space adds the new program at tc prio X+1 chain Y or another chain Z
>> >> >> >> >    d) user space delete tc prio X chain Y (and make sure your packets
>> >> >> >> >entry point is whatever #c is)
>> >> >> >>
>> >> >> >> I never suggested anything like what you describe. I'm not sure why you
>> >> >> >> think so.
>> >> >> >
>> >> >> >It's the same class of problems - the paper i pointed to (coauthored
>> >> >> >by Matty and others) has runtime resource optimizations which are
>> >> >> >tantamount to changing the nature of the pipeline. We may need to
>> >> >> >profile in the kernel but all those optimizations can be derived in
>> >> >> >user space using the approach I described.
>> >> >> >
>> >> >> >cheers,
>> >> >> >jamal
>> >> >> >
>> >> >> >
>> >> >> >> >[1] https://www.cs.rice.edu/~eugeneng/papers/SIGCOMM23-Pipeleon.pdf
>> >> >> >> >
>> >> >> >> >>
>> >> >> >> >> >kernel process - which is ironically the same thing we are going
>> >> >> >> >> >through here ;->
>> >> >> >> >> >
>> >> >> >> >> >cheers,
>> >> >> >> >> >jamal
>> >> >> >> >> >
>> >> >> >> >> >>
>> >> >> >> >> >> >
>> >> >> >> >> >> >cheers,
>> >> >> >> >> >> >jamal
Jamal Hadi Salim Nov. 23, 2023, 4:30 p.m. UTC | #29
On Thu, Nov 23, 2023 at 10:28 AM Jiri Pirko <jiri@resnulli.us> wrote:
>
> Thu, Nov 23, 2023 at 03:28:07PM CET, jhs@mojatatu.com wrote:
> >On Thu, Nov 23, 2023 at 9:07 AM Jiri Pirko <jiri@resnulli.us> wrote:
> >>
> >> Thu, Nov 23, 2023 at 02:45:50PM CET, jhs@mojatatu.com wrote:
> >> >On Thu, Nov 23, 2023 at 8:34 AM Jiri Pirko <jiri@resnulli.us> wrote:
> >> >>
> >> >> Thu, Nov 23, 2023 at 02:22:11PM CET, jhs@mojatatu.com wrote:
> >> >> >On Thu, Nov 23, 2023 at 1:36 AM Jiri Pirko <jiri@resnulli.us> wrote:
> >> >> >>
> >> >> >> Wed, Nov 22, 2023 at 08:35:21PM CET, jhs@mojatatu.com wrote:
> >> >> >> >On Wed, Nov 22, 2023 at 1:31 PM Jiri Pirko <jiri@resnulli.us> wrote:
> >> >> >> >>
> >> >> >> >> Wed, Nov 22, 2023 at 04:14:02PM CET, jhs@mojatatu.com wrote:
> >> >> >> >> >On Wed, Nov 22, 2023 at 4:25 AM Jiri Pirko <jiri@resnulli.us> wrote:
> >> >> >> >> >>
> >> >> >> >> >> Tue, Nov 21, 2023 at 04:21:44PM CET, jhs@mojatatu.com wrote:
> >> >> >> >> >> >On Tue, Nov 21, 2023 at 9:19 AM Jiri Pirko <jiri@resnulli.us> wrote:
> >> >> >> >> >> >>
> >> >> >> >> >> >> Tue, Nov 21, 2023 at 02:47:40PM CET, jhs@mojatatu.com wrote:
> >> >> >> >> >> >> >On Tue, Nov 21, 2023 at 8:06 AM Jiri Pirko <jiri@resnulli.us> wrote:
> >> >> >> >> >> >> >>
> >> >> >> >> >> >> >> Mon, Nov 20, 2023 at 11:56:50PM CET, jhs@mojatatu.com wrote:
> >> >> >> >> >> >> >> >On Mon, Nov 20, 2023 at 4:49 PM Daniel Borkmann <daniel@iogearbox.net> wrote:
> >> >> >> >> >> >> >> >>
> >> >> >> >> >> >> >> >> On 11/20/23 8:56 PM, Jamal Hadi Salim wrote:
> >> >> >> >> >> >> >> >> > On Mon, Nov 20, 2023 at 1:10 PM Jiri Pirko <jiri@resnulli.us> wrote:
> >> >> >> >> >> >> >> >> >> Mon, Nov 20, 2023 at 03:23:59PM CET, jhs@mojatatu.com wrote:
> >> >> >> >> >> >> >>
> >> >> >> >> >> >> >> [...]
> >> >> >> >> >> >> >>
> >> >> >> >> >> >> >> >
> >> >> >> >> >> >> >> >> tc BPF and XDP already have widely used infrastructure and can be developed
> >> >> >> >> >> >> >> >> against libbpf or other user space libraries for a user space control plane.
> >> >> >> >> >> >> >> >> With 'control plane' you refer here to the tc / netlink shim you've built,
> >> >> >> >> >> >> >> >> but looking at the tc command line examples, this doesn't really provide a
> >> >> >> >> >> >> >> >> good user experience (you call it p4 but people load bpf obj files). If the
> >> >> >> >> >> >> >> >> expectation is that an operator should run tc commands, then neither it's
> >> >> >> >> >> >> >> >> a nice experience for p4 nor for BPF folks. From a BPF PoV, we moved over
> >> >> >> >> >> >> >> >> to bpf_mprog and plan to also extend this for XDP to have a common look and
> >> >> >> >> >> >> >> >> feel wrt networking for developers. Why can't this be reused?
> >> >> >> >> >> >> >> >
> >> >> >> >> >> >> >> >The filter loading which loads the program is considered pipeline
> >> >> >> >> >> >> >> >instantiation - consider it as "provisioning" more than "control"
> >> >> >> >> >> >> >> >which runs at runtime. "control" is purely netlink based. The iproute2
> >> >> >> >> >> >> >> >code we use links libbpf for example for the filter. If we can achieve
> >> >> >> >> >> >> >> >the same with bpf_mprog then sure - we just dont want to loose
> >> >> >> >> >> >> >> >functionality though.  off top of my head, some sample space:
> >> >> >> >> >> >> >> >- we could have multiple pipelines with different priorities (which tc
> >> >> >> >> >> >> >> >provides to us) - and each pipeline may have its own logic with many
> >> >> >> >> >> >> >> >tables etc (and the choice to iterate the next one is essentially
> >> >> >> >> >> >> >> >encoded in the tc action codes)
> >> >> >> >> >> >> >> >- we use tc block to map groups of ports (which i dont think bpf has
> >> >> >> >> >> >> >> >internal access of)
> >> >> >> >> >> >> >> >
> >> >> >> >> >> >> >> >In regards to usability: no i dont expect someone doing things at
> >> >> >> >> >> >> >> >scale to use command line tc. The APIs are via netlink. But the tc cli
> >> >> >> >> >> >> >> >is must for the rest of the masses per our traditions. Also i really
> >> >> >> >> >> >> >>
> >> >> >> >> >> >> >> I don't follow. You repeatedly mention "the must of the traditional tc
> >> >> >> >> >> >> >> cli", but what of the existing traditional cli you use for p4tc?
> >> >> >> >> >> >> >> If I look at the examples, pretty much everything looks new to me.
> >> >> >> >> >> >> >> Example:
> >> >> >> >> >> >> >>
> >> >> >> >> >> >> >>   tc p4ctrl create myprog/table/mytable dstAddr 10.0.1.2/32 \
> >> >> >> >> >> >> >>     action send_to_port param port eno1
> >> >> >> >> >> >> >>
> >> >> >> >> >> >> >> This is just TC/RTnetlink used as a channel to pass new things over. If
> >> >> >> >> >> >> >> that is the case, what's traditional here?
> >> >> >> >> >> >> >>
> >> >> >> >> >> >> >
> >> >> >> >> >> >> >
> >> >> >> >> >> >> >What is not traditional about it?
> >> >> >> >> >> >>
> >> >> >> >> >> >> Okay, so in that case, the following example communitating with
> >> >> >> >> >> >> userspace deamon using imaginary "p4ctrl" app is equally traditional:
> >> >> >> >> >> >>   $ p4ctrl create myprog/table/mytable dstAddr 10.0.1.2/32 \
> >> >> >> >> >> >>      action send_to_port param port eno1
> >> >> >> >> >> >
> >> >> >> >> >> >Huh? Thats just an application - classical tc which part of iproute2
> >> >> >> >> >> >that is sending to the kernel, no different than "tc flower.."
> >> >> >> >> >> >Where do you get the "userspace" daemon part? Yes, you can write a
> >> >> >> >> >> >daemon but it will use the same APIs as tc.
> >> >> >> >> >>
> >> >> >> >> >> Okay, so which part is the "tradition"?
> >> >> >> >> >>
> >> >> >> >> >
> >> >> >> >> >Provides tooling via tc cli that _everyone_ in the tc world is
> >> >> >> >> >familiar with - which uses the same syntax as other tc extensions do,
> >> >> >> >> >same expectations (eg events, request responses, familiar commands for
> >> >> >> >> >dumping, flushing etc). Basically someone familiar with tc will pick
> >> >> >> >> >this up and operate it very quickly and would have an easier time
> >> >> >> >> >debugging it.
> >> >> >> >> >There are caveats - as will be with all new classifiers - but those
> >> >> >> >> >are within reason.
> >> >> >> >>
> >> >> >> >> Okay, so syntax familiarity wise, what's the difference between
> >> >> >> >> following 2 approaches:
> >> >> >> >> $ tc p4ctrl create myprog/table/mytable dstAddr 10.0.1.2/32 \
> >> >> >> >>       action send_to_port param port eno1
> >> >> >> >> $ p4ctrl create myprog/table/mytable dstAddr 10.0.1.2/32 \
> >> >> >> >>       action send_to_port param port eno1
> >> >> >> >> ?
> >> >> >> >>
> >> >> >> >>
> >> >> >> >> >
> >> >> >> >> >> >>
> >> >> >> >> >> >> >
> >> >> >> >> >> >> >>
> >> >> >> >> >> >> >> >didnt even want to use ebpf at all for operator experience reasons -
> >> >> >> >> >> >> >> >it requires a compilation of the code and an extra loading compared to
> >> >> >> >> >> >> >> >what our original u32/pedit code offered.
> >> >> >> >> >> >> >> >
> >> >> >> >> >> >> >> >> I don't quite follow why not most of this could be implemented entirely in
> >> >> >> >> >> >> >> >> user space without the detour of this and you would provide a developer
> >> >> >> >> >> >> >> >> library which could then be integrated into a p4 runtime/frontend? This
> >> >> >> >> >> >> >> >> way users never interface with ebpf parts nor tc given they also shouldn't
> >> >> >> >> >> >> >> >> have to - it's an implementation detail. This is what John was also pointing
> >> >> >> >> >> >> >> >> out earlier.
> >> >> >> >> >> >> >> >>
> >> >> >> >> >> >> >> >
> >> >> >> >> >> >> >> >Netlink is the API. We will provide a library for object manipulation
> >> >> >> >> >> >> >> >which abstracts away the need to know netlink. Someone who for their
> >> >> >> >> >> >> >> >own reasons wants to use p4runtime or TDI could write on top of this.
> >> >> >> >> >> >> >> >I would not design a kernel interface to just meet p4runtime (we
> >> >> >> >> >> >> >> >already have TDI which came later which does things differently). So i
> >> >> >> >> >> >> >> >expect us to support both those two. And if i was to do something on
> >> >> >> >> >> >> >> >SDN that was more robust i would write my own that still uses these
> >> >> >> >> >> >> >> >netlink interfaces.
> >> >> >> >> >> >> >>
> >> >> >> >> >> >> >> Actually, what Daniel says about the p4 library used as a backend to p4
> >> >> >> >> >> >> >> frontend is pretty much aligned what I claimed on the p4 calls couple of
> >> >> >> >> >> >> >> times. If you have this p4 userspace tooling, it is easy for offloads to
> >> >> >> >> >> >> >> replace the backed by vendor-specific library which allows p4 offload
> >> >> >> >> >> >> >> suitable for all vendors (your plan of p4tc offload does not work well
> >> >> >> >> >> >> >> for our hw, as we repeatedly claimed).
> >> >> >> >> >> >> >>
> >> >> >> >> >> >> >
> >> >> >> >> >> >> >That's you - NVIDIA. You have chosen a path away from the kernel
> >> >> >> >> >> >> >towards DOCA. I understand NVIDIA's frustration with dealing with
> >> >> >> >> >> >> >upstream process (which has been cited to me as a good reason for
> >> >> >> >> >> >> >DOCA) but please dont impose these values and your politics on other
> >> >> >> >> >> >> >vendors(Intel, AMD for example) who are more than willing to invest
> >> >> >> >> >> >> >into making the kernel interfaces the path forward. Your choice.
> >> >> >> >> >> >>
> >> >> >> >> >> >> No, you are missing the point. This has nothing to do with DOCA.
> >> >> >> >> >> >
> >> >> >> >> >> >Right Jiri ;->
> >> >> >> >> >> >
> >> >> >> >> >> >> This
> >> >> >> >> >> >> has to do with the simple limitation of your offload assuming there are
> >> >> >> >> >> >> no runtime changes in the compiled pipeline. For Intel, maybe they
> >> >> >> >> >> >> aren't, and it's a good fit for them. All I say is, that it is not the
> >> >> >> >> >> >> good fit for everyone.
> >> >> >> >> >> >
> >> >> >> >> >> > a) it is not part of the P4 spec to dynamically make changes to the
> >> >> >> >> >> >datapath pipeline after it is create and we are discussing a P4
> >> >> >> >> >>
> >> >> >> >> >> Isn't this up to the implementation? I mean from the p4 perspective,
> >> >> >> >> >> everything is static. Hw might need to reshuffle the pipeline internally
> >> >> >> >> >> during rule insertion/remove in order to optimize the layout.
> >> >> >> >> >>
> >> >> >> >> >
> >> >> >> >> >But do note: the focus here is on P4 (hence the name P4TC).
> >> >> >> >> >
> >> >> >> >> >> >implementation not an extension that would add more value b) We are
> >> >> >> >> >> >more than happy to add extensions in the future to accomodate for
> >> >> >> >> >> >features but first _P4 spec_ must be met c) we had longer discussions
> >> >> >> >> >> >with Matty, Khalid and the Rice folks who wrote a paper on that topic
> >> >> >> >> >> >which you probably didnt attend and everything that needs to be done
> >> >> >> >> >> >can be from user space today for all those optimizations.
> >> >> >> >> >> >
> >> >> >> >> >> >Conclusion is: For what you need to do (which i dont believe is a
> >> >> >> >> >> >limitation in your hardware rather a design decision on your part) run
> >> >> >> >> >> >your user space daemon, do optimizations and update the datapath.
> >> >> >> >> >> >Everybody is happy.
> >> >> >> >> >>
> >> >> >> >> >> Should the userspace daemon listen on inserted rules to be offloade
> >> >> >> >> >> over netlink?
> >> >> >> >> >>
> >> >> >> >> >
> >> >> >> >> >I mean you could if you wanted to given this is just traditional
> >> >> >> >> >netlink which emits events (with some filtering when we integrate the
> >> >> >> >> >filter approach). But why?
> >> >> >> >>
> >> >> >> >> Nevermind.
> >> >> >> >>
> >> >> >> >>
> >> >> >> >> >
> >> >> >> >> >> >
> >> >> >> >> >> >>
> >> >> >> >> >> >> >Nobody is stopping you from offering your customers proprietary
> >> >> >> >> >> >> >solutions which include a specific ebpf approach alongside DOCA. We
> >> >> >> >> >> >> >believe that a singular interface regardless of the vendor is the
> >> >> >> >> >> >> >right way forward. IMHO, this siloing that unfortunately is also added
> >> >> >> >> >> >> >by eBPF being a double edged sword is not good for the community.
> >> >> >> >> >> >> >
> >> >> >> >> >> >> >> As I also said on the p4 call couple of times, I don't see the kernel
> >> >> >> >> >> >> >> as the correct place to do the p4 abstractions. Why don't you do it in
> >> >> >> >> >> >> >> userspace and give vendors possiblity to have p4 backends with compilers,
> >> >> >> >> >> >> >> runtime optimizations etc in userspace, talking to the HW in the
> >> >> >> >> >> >> >> vendor-suitable way too. Then the SW implementation could be easily eBPF
> >> >> >> >> >> >> >> and the main reason (I believe) why you need to have this is TC
> >> >> >> >> >> >> >> (offload) is then void.
> >> >> >> >> >> >> >>
> >> >> >> >> >> >> >> The "everyone wants to use TC/netlink" claim does not seem correct
> >> >> >> >> >> >> >> to me. Why not to have one Linux p4 solution that fits everyones needs?
> >> >> >> >> >> >> >
> >> >> >> >> >> >> >You mean more fitting to the DOCA world? no, because iam a kernel
> >> >> >> >> >> >>
> >> >> >> >> >> >> Again, this has 0 relation to DOCA.
> >> >> >> >> >> >>
> >> >> >> >> >> >>
> >> >> >> >> >> >> >first person and kernel interfaces are good for everyone.
> >> >> >> >> >> >>
> >> >> >> >> >> >> Yeah, not really. Not always the kernel is the right answer. Your/Intel
> >> >> >> >> >> >> plan to handle the offload by:
> >> >> >> >> >> >> 1) abuse devlink to flash p4 binary
> >> >> >> >> >> >> 2) parse the binary in kernel to match to the table ids of rules coming
> >> >> >> >> >> >>    from p4tc ndo_setup_tc
> >> >> >> >> >> >> 3) abuse devlink to flash p4 binary for tc-flower
> >> >> >> >> >> >> 4) parse the binary in kernel to match to the table ids of rules coming
> >> >> >> >> >> >>    from tc-flower ndo_setup_tc
> >> >> >> >> >> >> is really something that is making me a little bit nauseous.
> >> >> >> >> >> >>
> >> >> >> >> >> >> If you don't have a feasible plan to do the offload, p4tc does not make
> >> >> >> >> >> >> sense to me to be honest.
> >> >> >> >> >> >
> >> >> >> >> >> >You mean if there's no plan to match your (NVIDIA?)  point of view.
> >> >> >> >> >> >For #1 - how's this different from DDP? Wasnt that your suggestion to
> >> >> >> >> >>
> >> >> >> >> >> I doubt that. Any flashing-blob-parsing-in-kernel is something I'm
> >> >> >> >> >> opposed to from day 1.
> >> >> >> >> >>
> >> >> >> >> >>
> >> >> >> >> >
> >> >> >> >> >Oh well - it is in the kernel and it works fine tbh.
> >> >> >> >> >
> >> >> >> >> >> >begin with? For #2 Nobody is proposing to do anything of the sort. The
> >> >> >> >> >> >ndo is passed IDs for the objects and associated contents. For #3+#4
> >> >> >> >> >>
> >> >> >> >> >> During offload, you need to parse the blob in driver to be able to match
> >> >> >> >> >> the ids with blob entities. That was presented by you/Intel in the past
> >> >> >> >> >> IIRC.
> >> >> >> >> >>
> >> >> >> >> >
> >> >> >> >> >You are correct - in case of offload the netlink IDs will have to be
> >> >> >> >> >authenticated against what the hardware can accept, but the devlink
> >> >> >> >> >flash use i believe was from you as a compromise.
> >> >> >> >>
> >> >> >> >> Definitelly not. I'm against devlink abuse for this from day 1.
> >> >> >> >>
> >> >> >> >>
> >> >> >> >> >
> >> >> >> >> >>
> >> >> >> >> >> >tc flower thing has nothing to do with P4TC that was just some random
> >> >> >> >> >> >proposal someone made seeing if they could ride on top of P4TC.
> >> >> >> >> >>
> >> >> >> >> >> Yeah, it's not yet merged and already mentally used for abuse. I love
> >> >> >> >> >> that :)
> >> >> >> >> >>
> >> >> >> >> >> >
> >> >> >> >> >> >Besides this nobody really has to satisfy your point of view - like i
> >> >> >> >> >> >said earlier feel free to provide proprietary solutions. From a
> >> >> >> >> >> >consumer perspective  I would not want to deal with 4 different
> >> >> >> >> >> >vendors with 4 different proprietary approaches. The kernel is the
> >> >> >> >> >> >unifying part. You seemed happier with tc flower just not with the
> >> >> >> >> >>
> >> >> >> >> >> Yeah, that is my point, why the unifying part can't be a userspace
> >> >> >> >> >> daemon/library with multiple backends (p4tc, bpf, vendorX, vendorY, ..)?
> >> >> >> >> >>
> >> >> >> >> >> I just don't see the kernel as a good fit for abstraction here,
> >> >> >> >> >> given the fact that the vendor compilers does not run in kernel.
> >> >> >> >> >> That is breaking your model.
> >> >> >> >> >>
> >> >> >> >> >
> >> >> >> >> >Jiri - we want to support P4, first. Like you said the P4 pipeline,
> >> >> >> >> >once installed is static.
> >> >> >> >> >P4 doesnt allow dynamic update of the pipeline. For example, once you
> >> >> >> >> >say "here are my 14 tables and their associated actions and here's how
> >> >> >> >> >the pipeline main control (on how to iterate the tables etc) is going
> >> >> >> >> >to be" and after you instantiate/activate that pipeline, you dont go
> >> >> >> >> >back 5 minutes later and say "sorry, please introduce table 15, which
> >> >> >> >> >i want you to walk to after you visit table 3 if metadata foo is 5" or
> >> >> >> >> >"shoot, let's change that table 5 to be exact instead of LPM". It's
> >> >> >> >> >not anywhere in the spec.
> >> >> >> >> >That doesnt mean it is not useful thing to have - but it is an
> >> >> >> >> >invention that has _nothing to do with the P4 spec_; so saying a P4
> >> >> >> >> >implementation must support it is a bit out of scope and there are
> >> >> >> >> >vendors with hardware who support P4 today that dont need any of this.
> >> >> >> >>
> >> >> >> >> I'm not talking about the spec. I'm talking about the offload
> >> >> >> >> implemetation, the offload compiler the offload runtime manager. You
> >> >> >> >> don't have those in kernel. That is the issue. The runtime manager is
> >> >> >> >> the one to decide and reshuffle the hw internals. Again, this has
> >> >> >> >> nothing to do with p4 frontend. This is offload implementation.
> >> >> >> >>
> >> >> >> >> And that is why I believe your p4 kernel implementation is unoffloadable.
> >> >> >> >> And if it is unoffloadable, do we really need it? IDK.
> >> >> >> >>
> >> >> >> >
> >> >> >> >Say what?
> >> >> >> >It's not offloadable in your hardware, you mean? Because i have beside
> >> >> >> >me here an intel e2000 which offloads just fine (and the AMD folks
> >> >> >> >seem fine too).
> >> >> >>
> >> >> >> Will Intel and AMD have compiler in kernel, so no blob transfer and
> >> >> >> parsing it in kernel wound not be needed? No.
> >> >> >
> >> >> >By that definition anything that parses anything is a compiler.
> >> >> >
> >> >> >>
> >> >> >> >If your view is that all these runtime optimization surmount to a
> >> >> >> >compiler in the kernel/driver that is your, well, your view. In my
> >> >> >> >view (and others have said this to you already) the P4C compiler is
> >> >> >> >responsible for resource optimizations. The hardware supports P4, you
> >> >> >> >give it constraints and it knows what to do. At runtime, anything a
> >> >> >> >driver needs to do for resource optimization (resorting, reshuffling
> >> >> >> >etc), that is not a P4 problem - sorry if you have issues in your
> >> >> >> >architecture approach.
> >> >> >>
> >> >> >> Sure, it is the offload implementation problem. And for them, you need
> >> >> >> to use userspace components. And that is the problem. This discussion
> >> >> >> leads nowhere, I don't know how differently I should describe this.
> >> >> >
> >> >> >Jiri's - that's your view based on whatever design you have in your
> >> >> >mind. This has nothing to do with P4.
> >> >> >So let me repeat again:
> >> >> >1) A vendor's backend for P4 when it compiles ensures that resource
> >> >> >constraints are taken care of.
> >> >> >2) The same program can run in s/w.
> >> >> >3) It makes *ZERO* sense to mix vendor specific constraint
> >> >> >optimization(what you described as resorting, reshuffling etc) as part
> >> >> >of P4TC or P4. Absolutely nothing to do with either. Write a
> >> >>
> >> >> I never suggested for it to be part of P4tc of P4. I don't know why you
> >> >> think so.
> >> >
> >> >I guess because this discussion is about P4/P4TC? I may have misread
> >> >what you are saying then because I saw the  "P4TC must be in
> >> >userspace" mantra tied to this specific optimization requirement.
> >>
> >> Yeah, and again, my point is, this is unoffloadable.
> >
> >Here we go again with this weird claim. I guess we need to give an
> >award to the other vendors for doing the "impossible"?
>
> By having the compiler in kernel, that would be awesome. Clear offload
> from kernel to device.
>
> That's not the case. Trampolines, binary blobs parsing in kernel doing
> the match with tc structures in drivers, abuse of devlink flash,
> tc-flower offload using this facility. All this was already seriously
> discussed before p4tc is even merged. Great, love that.
>

I was hoping not to say anything but my fingers couldnt help themselves:
So "unoffloadable" means there is a binary blob and this doesnt work
per your design idea of how it should work?
Not that it cant be implemented (clearly it has been implemented), it
is just not how _you_ would implement it? All along I thought this was
an issue with your hardware.
I know that when someone says devlink your answer is N.O - but that is
a different topic.

cheers,
jamal

>
> >
> >>Do we still  need it in kernel?
> >
> >Didnt you just say it has nothing to do with P4TC?
> >
> >You "It cant be offloaded".
> >Me "it can be offloaded, other vendors are doing it and it has nothing
> >to do with P4 or P4TC and here's why..."
> >You " i didnt say it has anything to do with P4 or P4TC"
> >Me "ok i misunderstood i thought you said P4 cant be offloaded via
> >P4TC and has to be done in user space"
> >You "It cant be offloaded"
>
> Let me do my own misinterpretation please.
>
>
>
> >
> >Circular non-ending discussion.
> >
> >Then there's John
> >John "ebpf, ebpf, ebpf"
> >Me "we gave you ebpf"
> >John "but you are not using ebpf system call"
> >Me " but it doesnt make sense for the following reasons..."
> >John "but someone has already implemented ebpf.."
> >Me "yes, but here's how ..."
> >John "ebpf, ebpf, ebpf"
> >
> >Another circular non-ending discussion.
> >
> >Let's just end this electron-wasting lawyering discussion.
> >
> >cheers,
> >jamal
> >
> >
> >
> >
> >
> >
> >Bizare. Unoffloadable according to you.
> >
> >>
> >> >
> >> >>
> >> >> >background task, specific to you,  if you feel you need to move things
> >> >> >around at runtime.
> >> >>
> >> >> Yeah, that backgroud task is in userspace.
> >> >>
> >> >
> >> >I don't have a horse in this race.
> >> >
> >> >cheers,
> >> >jamal
> >> >
> >> >>
> >> >> >
> >> >> >We agree on one thing at least: This discussion is going nowhere.
> >> >>
> >> >> Correct.
> >> >>
> >> >> >
> >> >> >cheers,
> >> >> >jamal
> >> >> >
> >> >> >
> >> >> >
> >> >> >> >
> >> >> >> >> >In my opinion that is a feature that could be added later out of
> >> >> >> >> >necessity (there is some good niche value in being able to add some
> >> >> >> >> >"dynamicism" to any pipeline) and influence the P4 standards on why it
> >> >> >> >> >is needed.
> >> >> >> >> >It should be doable today in a brute force way (this is just one
> >> >> >> >> >suggestion that came to me when Rice University/Nvidia presented[1]);
> >> >> >> >> >i am sure there are other approaches and the idea is by no means
> >> >> >> >> >proven.
> >> >> >> >> >
> >> >> >> >> >1) User space Creates/compiles/Adds/activate your program that has 14
> >> >> >> >> >tables at tc prio X chain Y
> >> >> >> >> >2) a) 5 minutes later user space decides it wants to change and add
> >> >> >> >> >table 3 after table 15, visited when metadata foo=5
> >> >> >> >> >    b) your compiler in user space compiles a brand new program which
> >> >> >> >> >satisfies #2a (how this program was authored is out of scope of
> >> >> >> >> >discussion)
> >> >> >> >> >    c) user space adds the new program at tc prio X+1 chain Y or another chain Z
> >> >> >> >> >    d) user space delete tc prio X chain Y (and make sure your packets
> >> >> >> >> >entry point is whatever #c is)
> >> >> >> >>
> >> >> >> >> I never suggested anything like what you describe. I'm not sure why you
> >> >> >> >> think so.
> >> >> >> >
> >> >> >> >It's the same class of problems - the paper i pointed to (coauthored
> >> >> >> >by Matty and others) has runtime resource optimizations which are
> >> >> >> >tantamount to changing the nature of the pipeline. We may need to
> >> >> >> >profile in the kernel but all those optimizations can be derived in
> >> >> >> >user space using the approach I described.
> >> >> >> >
> >> >> >> >cheers,
> >> >> >> >jamal
> >> >> >> >
> >> >> >> >
> >> >> >> >> >[1] https://www.cs.rice.edu/~eugeneng/papers/SIGCOMM23-Pipeleon.pdf
> >> >> >> >> >
> >> >> >> >> >>
> >> >> >> >> >> >kernel process - which is ironically the same thing we are going
> >> >> >> >> >> >through here ;->
> >> >> >> >> >> >
> >> >> >> >> >> >cheers,
> >> >> >> >> >> >jamal
> >> >> >> >> >> >
> >> >> >> >> >> >>
> >> >> >> >> >> >> >
> >> >> >> >> >> >> >cheers,
> >> >> >> >> >> >> >jamal
Edward Cree Nov. 23, 2023, 5:53 p.m. UTC | #30
On 23/11/2023 16:30, Jamal Hadi Salim wrote:
> I was hoping not to say anything but my fingers couldnt help themselves:
> So "unoffloadable" means there is a binary blob and this doesnt work
> per your design idea of how it should work?
> Not that it cant be implemented (clearly it has been implemented), it
> is just not how _you_ would implement it? All along I thought this was
> an issue with your hardware.

The kernel doesn't like to trust offload blobs from a userspace compiler,
 because it has no way to be sure that what comes out of the compiler
 matches the rules/tables/whatever it has in the SW datapath.
It's also a support nightmare because it's basically like each user
 compiling their own device firmware.  At least normally with device
 firmware the driver side is talking to something with narrow/fixed
 semantics and went through upstream review, even if the firmware side is
 still a black box.
Just to prove I'm not playing favourites: this is *also* a problem with
 eBPF offloads like Nanotubes, and I'm not convinced we have a viable
 solution yet.

The only way I can see to handle it is something analogous to proof-
 carrying code, where the kernel (driver, since the blob is likely to be
 wholly vendor-specific) can inspect the binary blob and verify somehow
 that (assuming the HW behaves according to its datasheet) it implements
 the same thing that exists in SW.
Or simplify the hardware design enough that the compiler can be small
 and tight enough to live in-kernel, but that's often impossible.

-ed
Jiri Pirko Nov. 23, 2023, 6:04 p.m. UTC | #31
Thu, Nov 23, 2023 at 05:30:58PM CET, jhs@mojatatu.com wrote:
>On Thu, Nov 23, 2023 at 10:28 AM Jiri Pirko <jiri@resnulli.us> wrote:
>>
>> Thu, Nov 23, 2023 at 03:28:07PM CET, jhs@mojatatu.com wrote:
>> >On Thu, Nov 23, 2023 at 9:07 AM Jiri Pirko <jiri@resnulli.us> wrote:
>> >>
>> >> Thu, Nov 23, 2023 at 02:45:50PM CET, jhs@mojatatu.com wrote:
>> >> >On Thu, Nov 23, 2023 at 8:34 AM Jiri Pirko <jiri@resnulli.us> wrote:
>> >> >>
>> >> >> Thu, Nov 23, 2023 at 02:22:11PM CET, jhs@mojatatu.com wrote:
>> >> >> >On Thu, Nov 23, 2023 at 1:36 AM Jiri Pirko <jiri@resnulli.us> wrote:
>> >> >> >>
>> >> >> >> Wed, Nov 22, 2023 at 08:35:21PM CET, jhs@mojatatu.com wrote:
>> >> >> >> >On Wed, Nov 22, 2023 at 1:31 PM Jiri Pirko <jiri@resnulli.us> wrote:
>> >> >> >> >>
>> >> >> >> >> Wed, Nov 22, 2023 at 04:14:02PM CET, jhs@mojatatu.com wrote:
>> >> >> >> >> >On Wed, Nov 22, 2023 at 4:25 AM Jiri Pirko <jiri@resnulli.us> wrote:
>> >> >> >> >> >>
>> >> >> >> >> >> Tue, Nov 21, 2023 at 04:21:44PM CET, jhs@mojatatu.com wrote:
>> >> >> >> >> >> >On Tue, Nov 21, 2023 at 9:19 AM Jiri Pirko <jiri@resnulli.us> wrote:
>> >> >> >> >> >> >>
>> >> >> >> >> >> >> Tue, Nov 21, 2023 at 02:47:40PM CET, jhs@mojatatu.com wrote:
>> >> >> >> >> >> >> >On Tue, Nov 21, 2023 at 8:06 AM Jiri Pirko <jiri@resnulli.us> wrote:
>> >> >> >> >> >> >> >>
>> >> >> >> >> >> >> >> Mon, Nov 20, 2023 at 11:56:50PM CET, jhs@mojatatu.com wrote:
>> >> >> >> >> >> >> >> >On Mon, Nov 20, 2023 at 4:49 PM Daniel Borkmann <daniel@iogearbox.net> wrote:
>> >> >> >> >> >> >> >> >>
>> >> >> >> >> >> >> >> >> On 11/20/23 8:56 PM, Jamal Hadi Salim wrote:
>> >> >> >> >> >> >> >> >> > On Mon, Nov 20, 2023 at 1:10 PM Jiri Pirko <jiri@resnulli.us> wrote:
>> >> >> >> >> >> >> >> >> >> Mon, Nov 20, 2023 at 03:23:59PM CET, jhs@mojatatu.com wrote:
>> >> >> >> >> >> >> >>
>> >> >> >> >> >> >> >> [...]
>> >> >> >> >> >> >> >>
>> >> >> >> >> >> >> >> >
>> >> >> >> >> >> >> >> >> tc BPF and XDP already have widely used infrastructure and can be developed
>> >> >> >> >> >> >> >> >> against libbpf or other user space libraries for a user space control plane.
>> >> >> >> >> >> >> >> >> With 'control plane' you refer here to the tc / netlink shim you've built,
>> >> >> >> >> >> >> >> >> but looking at the tc command line examples, this doesn't really provide a
>> >> >> >> >> >> >> >> >> good user experience (you call it p4 but people load bpf obj files). If the
>> >> >> >> >> >> >> >> >> expectation is that an operator should run tc commands, then neither it's
>> >> >> >> >> >> >> >> >> a nice experience for p4 nor for BPF folks. From a BPF PoV, we moved over
>> >> >> >> >> >> >> >> >> to bpf_mprog and plan to also extend this for XDP to have a common look and
>> >> >> >> >> >> >> >> >> feel wrt networking for developers. Why can't this be reused?
>> >> >> >> >> >> >> >> >
>> >> >> >> >> >> >> >> >The filter loading which loads the program is considered pipeline
>> >> >> >> >> >> >> >> >instantiation - consider it as "provisioning" more than "control"
>> >> >> >> >> >> >> >> >which runs at runtime. "control" is purely netlink based. The iproute2
>> >> >> >> >> >> >> >> >code we use links libbpf for example for the filter. If we can achieve
>> >> >> >> >> >> >> >> >the same with bpf_mprog then sure - we just dont want to loose
>> >> >> >> >> >> >> >> >functionality though.  off top of my head, some sample space:
>> >> >> >> >> >> >> >> >- we could have multiple pipelines with different priorities (which tc
>> >> >> >> >> >> >> >> >provides to us) - and each pipeline may have its own logic with many
>> >> >> >> >> >> >> >> >tables etc (and the choice to iterate the next one is essentially
>> >> >> >> >> >> >> >> >encoded in the tc action codes)
>> >> >> >> >> >> >> >> >- we use tc block to map groups of ports (which i dont think bpf has
>> >> >> >> >> >> >> >> >internal access of)
>> >> >> >> >> >> >> >> >
>> >> >> >> >> >> >> >> >In regards to usability: no i dont expect someone doing things at
>> >> >> >> >> >> >> >> >scale to use command line tc. The APIs are via netlink. But the tc cli
>> >> >> >> >> >> >> >> >is must for the rest of the masses per our traditions. Also i really
>> >> >> >> >> >> >> >>
>> >> >> >> >> >> >> >> I don't follow. You repeatedly mention "the must of the traditional tc
>> >> >> >> >> >> >> >> cli", but what of the existing traditional cli you use for p4tc?
>> >> >> >> >> >> >> >> If I look at the examples, pretty much everything looks new to me.
>> >> >> >> >> >> >> >> Example:
>> >> >> >> >> >> >> >>
>> >> >> >> >> >> >> >>   tc p4ctrl create myprog/table/mytable dstAddr 10.0.1.2/32 \
>> >> >> >> >> >> >> >>     action send_to_port param port eno1
>> >> >> >> >> >> >> >>
>> >> >> >> >> >> >> >> This is just TC/RTnetlink used as a channel to pass new things over. If
>> >> >> >> >> >> >> >> that is the case, what's traditional here?
>> >> >> >> >> >> >> >>
>> >> >> >> >> >> >> >
>> >> >> >> >> >> >> >
>> >> >> >> >> >> >> >What is not traditional about it?
>> >> >> >> >> >> >>
>> >> >> >> >> >> >> Okay, so in that case, the following example communitating with
>> >> >> >> >> >> >> userspace deamon using imaginary "p4ctrl" app is equally traditional:
>> >> >> >> >> >> >>   $ p4ctrl create myprog/table/mytable dstAddr 10.0.1.2/32 \
>> >> >> >> >> >> >>      action send_to_port param port eno1
>> >> >> >> >> >> >
>> >> >> >> >> >> >Huh? Thats just an application - classical tc which part of iproute2
>> >> >> >> >> >> >that is sending to the kernel, no different than "tc flower.."
>> >> >> >> >> >> >Where do you get the "userspace" daemon part? Yes, you can write a
>> >> >> >> >> >> >daemon but it will use the same APIs as tc.
>> >> >> >> >> >>
>> >> >> >> >> >> Okay, so which part is the "tradition"?
>> >> >> >> >> >>
>> >> >> >> >> >
>> >> >> >> >> >Provides tooling via tc cli that _everyone_ in the tc world is
>> >> >> >> >> >familiar with - which uses the same syntax as other tc extensions do,
>> >> >> >> >> >same expectations (eg events, request responses, familiar commands for
>> >> >> >> >> >dumping, flushing etc). Basically someone familiar with tc will pick
>> >> >> >> >> >this up and operate it very quickly and would have an easier time
>> >> >> >> >> >debugging it.
>> >> >> >> >> >There are caveats - as will be with all new classifiers - but those
>> >> >> >> >> >are within reason.
>> >> >> >> >>
>> >> >> >> >> Okay, so syntax familiarity wise, what's the difference between
>> >> >> >> >> following 2 approaches:
>> >> >> >> >> $ tc p4ctrl create myprog/table/mytable dstAddr 10.0.1.2/32 \
>> >> >> >> >>       action send_to_port param port eno1
>> >> >> >> >> $ p4ctrl create myprog/table/mytable dstAddr 10.0.1.2/32 \
>> >> >> >> >>       action send_to_port param port eno1
>> >> >> >> >> ?
>> >> >> >> >>
>> >> >> >> >>
>> >> >> >> >> >
>> >> >> >> >> >> >>
>> >> >> >> >> >> >> >
>> >> >> >> >> >> >> >>
>> >> >> >> >> >> >> >> >didnt even want to use ebpf at all for operator experience reasons -
>> >> >> >> >> >> >> >> >it requires a compilation of the code and an extra loading compared to
>> >> >> >> >> >> >> >> >what our original u32/pedit code offered.
>> >> >> >> >> >> >> >> >
>> >> >> >> >> >> >> >> >> I don't quite follow why not most of this could be implemented entirely in
>> >> >> >> >> >> >> >> >> user space without the detour of this and you would provide a developer
>> >> >> >> >> >> >> >> >> library which could then be integrated into a p4 runtime/frontend? This
>> >> >> >> >> >> >> >> >> way users never interface with ebpf parts nor tc given they also shouldn't
>> >> >> >> >> >> >> >> >> have to - it's an implementation detail. This is what John was also pointing
>> >> >> >> >> >> >> >> >> out earlier.
>> >> >> >> >> >> >> >> >>
>> >> >> >> >> >> >> >> >
>> >> >> >> >> >> >> >> >Netlink is the API. We will provide a library for object manipulation
>> >> >> >> >> >> >> >> >which abstracts away the need to know netlink. Someone who for their
>> >> >> >> >> >> >> >> >own reasons wants to use p4runtime or TDI could write on top of this.
>> >> >> >> >> >> >> >> >I would not design a kernel interface to just meet p4runtime (we
>> >> >> >> >> >> >> >> >already have TDI which came later which does things differently). So i
>> >> >> >> >> >> >> >> >expect us to support both those two. And if i was to do something on
>> >> >> >> >> >> >> >> >SDN that was more robust i would write my own that still uses these
>> >> >> >> >> >> >> >> >netlink interfaces.
>> >> >> >> >> >> >> >>
>> >> >> >> >> >> >> >> Actually, what Daniel says about the p4 library used as a backend to p4
>> >> >> >> >> >> >> >> frontend is pretty much aligned what I claimed on the p4 calls couple of
>> >> >> >> >> >> >> >> times. If you have this p4 userspace tooling, it is easy for offloads to
>> >> >> >> >> >> >> >> replace the backed by vendor-specific library which allows p4 offload
>> >> >> >> >> >> >> >> suitable for all vendors (your plan of p4tc offload does not work well
>> >> >> >> >> >> >> >> for our hw, as we repeatedly claimed).
>> >> >> >> >> >> >> >>
>> >> >> >> >> >> >> >
>> >> >> >> >> >> >> >That's you - NVIDIA. You have chosen a path away from the kernel
>> >> >> >> >> >> >> >towards DOCA. I understand NVIDIA's frustration with dealing with
>> >> >> >> >> >> >> >upstream process (which has been cited to me as a good reason for
>> >> >> >> >> >> >> >DOCA) but please dont impose these values and your politics on other
>> >> >> >> >> >> >> >vendors(Intel, AMD for example) who are more than willing to invest
>> >> >> >> >> >> >> >into making the kernel interfaces the path forward. Your choice.
>> >> >> >> >> >> >>
>> >> >> >> >> >> >> No, you are missing the point. This has nothing to do with DOCA.
>> >> >> >> >> >> >
>> >> >> >> >> >> >Right Jiri ;->
>> >> >> >> >> >> >
>> >> >> >> >> >> >> This
>> >> >> >> >> >> >> has to do with the simple limitation of your offload assuming there are
>> >> >> >> >> >> >> no runtime changes in the compiled pipeline. For Intel, maybe they
>> >> >> >> >> >> >> aren't, and it's a good fit for them. All I say is, that it is not the
>> >> >> >> >> >> >> good fit for everyone.
>> >> >> >> >> >> >
>> >> >> >> >> >> > a) it is not part of the P4 spec to dynamically make changes to the
>> >> >> >> >> >> >datapath pipeline after it is create and we are discussing a P4
>> >> >> >> >> >>
>> >> >> >> >> >> Isn't this up to the implementation? I mean from the p4 perspective,
>> >> >> >> >> >> everything is static. Hw might need to reshuffle the pipeline internally
>> >> >> >> >> >> during rule insertion/remove in order to optimize the layout.
>> >> >> >> >> >>
>> >> >> >> >> >
>> >> >> >> >> >But do note: the focus here is on P4 (hence the name P4TC).
>> >> >> >> >> >
>> >> >> >> >> >> >implementation not an extension that would add more value b) We are
>> >> >> >> >> >> >more than happy to add extensions in the future to accomodate for
>> >> >> >> >> >> >features but first _P4 spec_ must be met c) we had longer discussions
>> >> >> >> >> >> >with Matty, Khalid and the Rice folks who wrote a paper on that topic
>> >> >> >> >> >> >which you probably didnt attend and everything that needs to be done
>> >> >> >> >> >> >can be from user space today for all those optimizations.
>> >> >> >> >> >> >
>> >> >> >> >> >> >Conclusion is: For what you need to do (which i dont believe is a
>> >> >> >> >> >> >limitation in your hardware rather a design decision on your part) run
>> >> >> >> >> >> >your user space daemon, do optimizations and update the datapath.
>> >> >> >> >> >> >Everybody is happy.
>> >> >> >> >> >>
>> >> >> >> >> >> Should the userspace daemon listen on inserted rules to be offloade
>> >> >> >> >> >> over netlink?
>> >> >> >> >> >>
>> >> >> >> >> >
>> >> >> >> >> >I mean you could if you wanted to given this is just traditional
>> >> >> >> >> >netlink which emits events (with some filtering when we integrate the
>> >> >> >> >> >filter approach). But why?
>> >> >> >> >>
>> >> >> >> >> Nevermind.
>> >> >> >> >>
>> >> >> >> >>
>> >> >> >> >> >
>> >> >> >> >> >> >
>> >> >> >> >> >> >>
>> >> >> >> >> >> >> >Nobody is stopping you from offering your customers proprietary
>> >> >> >> >> >> >> >solutions which include a specific ebpf approach alongside DOCA. We
>> >> >> >> >> >> >> >believe that a singular interface regardless of the vendor is the
>> >> >> >> >> >> >> >right way forward. IMHO, this siloing that unfortunately is also added
>> >> >> >> >> >> >> >by eBPF being a double edged sword is not good for the community.
>> >> >> >> >> >> >> >
>> >> >> >> >> >> >> >> As I also said on the p4 call couple of times, I don't see the kernel
>> >> >> >> >> >> >> >> as the correct place to do the p4 abstractions. Why don't you do it in
>> >> >> >> >> >> >> >> userspace and give vendors possiblity to have p4 backends with compilers,
>> >> >> >> >> >> >> >> runtime optimizations etc in userspace, talking to the HW in the
>> >> >> >> >> >> >> >> vendor-suitable way too. Then the SW implementation could be easily eBPF
>> >> >> >> >> >> >> >> and the main reason (I believe) why you need to have this is TC
>> >> >> >> >> >> >> >> (offload) is then void.
>> >> >> >> >> >> >> >>
>> >> >> >> >> >> >> >> The "everyone wants to use TC/netlink" claim does not seem correct
>> >> >> >> >> >> >> >> to me. Why not to have one Linux p4 solution that fits everyones needs?
>> >> >> >> >> >> >> >
>> >> >> >> >> >> >> >You mean more fitting to the DOCA world? no, because iam a kernel
>> >> >> >> >> >> >>
>> >> >> >> >> >> >> Again, this has 0 relation to DOCA.
>> >> >> >> >> >> >>
>> >> >> >> >> >> >>
>> >> >> >> >> >> >> >first person and kernel interfaces are good for everyone.
>> >> >> >> >> >> >>
>> >> >> >> >> >> >> Yeah, not really. Not always the kernel is the right answer. Your/Intel
>> >> >> >> >> >> >> plan to handle the offload by:
>> >> >> >> >> >> >> 1) abuse devlink to flash p4 binary
>> >> >> >> >> >> >> 2) parse the binary in kernel to match to the table ids of rules coming
>> >> >> >> >> >> >>    from p4tc ndo_setup_tc
>> >> >> >> >> >> >> 3) abuse devlink to flash p4 binary for tc-flower
>> >> >> >> >> >> >> 4) parse the binary in kernel to match to the table ids of rules coming
>> >> >> >> >> >> >>    from tc-flower ndo_setup_tc
>> >> >> >> >> >> >> is really something that is making me a little bit nauseous.
>> >> >> >> >> >> >>
>> >> >> >> >> >> >> If you don't have a feasible plan to do the offload, p4tc does not make
>> >> >> >> >> >> >> sense to me to be honest.
>> >> >> >> >> >> >
>> >> >> >> >> >> >You mean if there's no plan to match your (NVIDIA?)  point of view.
>> >> >> >> >> >> >For #1 - how's this different from DDP? Wasnt that your suggestion to
>> >> >> >> >> >>
>> >> >> >> >> >> I doubt that. Any flashing-blob-parsing-in-kernel is something I'm
>> >> >> >> >> >> opposed to from day 1.
>> >> >> >> >> >>
>> >> >> >> >> >>
>> >> >> >> >> >
>> >> >> >> >> >Oh well - it is in the kernel and it works fine tbh.
>> >> >> >> >> >
>> >> >> >> >> >> >begin with? For #2 Nobody is proposing to do anything of the sort. The
>> >> >> >> >> >> >ndo is passed IDs for the objects and associated contents. For #3+#4
>> >> >> >> >> >>
>> >> >> >> >> >> During offload, you need to parse the blob in driver to be able to match
>> >> >> >> >> >> the ids with blob entities. That was presented by you/Intel in the past
>> >> >> >> >> >> IIRC.
>> >> >> >> >> >>
>> >> >> >> >> >
>> >> >> >> >> >You are correct - in case of offload the netlink IDs will have to be
>> >> >> >> >> >authenticated against what the hardware can accept, but the devlink
>> >> >> >> >> >flash use i believe was from you as a compromise.
>> >> >> >> >>
>> >> >> >> >> Definitelly not. I'm against devlink abuse for this from day 1.
>> >> >> >> >>
>> >> >> >> >>
>> >> >> >> >> >
>> >> >> >> >> >>
>> >> >> >> >> >> >tc flower thing has nothing to do with P4TC that was just some random
>> >> >> >> >> >> >proposal someone made seeing if they could ride on top of P4TC.
>> >> >> >> >> >>
>> >> >> >> >> >> Yeah, it's not yet merged and already mentally used for abuse. I love
>> >> >> >> >> >> that :)
>> >> >> >> >> >>
>> >> >> >> >> >> >
>> >> >> >> >> >> >Besides this nobody really has to satisfy your point of view - like i
>> >> >> >> >> >> >said earlier feel free to provide proprietary solutions. From a
>> >> >> >> >> >> >consumer perspective  I would not want to deal with 4 different
>> >> >> >> >> >> >vendors with 4 different proprietary approaches. The kernel is the
>> >> >> >> >> >> >unifying part. You seemed happier with tc flower just not with the
>> >> >> >> >> >>
>> >> >> >> >> >> Yeah, that is my point, why the unifying part can't be a userspace
>> >> >> >> >> >> daemon/library with multiple backends (p4tc, bpf, vendorX, vendorY, ..)?
>> >> >> >> >> >>
>> >> >> >> >> >> I just don't see the kernel as a good fit for abstraction here,
>> >> >> >> >> >> given the fact that the vendor compilers does not run in kernel.
>> >> >> >> >> >> That is breaking your model.
>> >> >> >> >> >>
>> >> >> >> >> >
>> >> >> >> >> >Jiri - we want to support P4, first. Like you said the P4 pipeline,
>> >> >> >> >> >once installed is static.
>> >> >> >> >> >P4 doesnt allow dynamic update of the pipeline. For example, once you
>> >> >> >> >> >say "here are my 14 tables and their associated actions and here's how
>> >> >> >> >> >the pipeline main control (on how to iterate the tables etc) is going
>> >> >> >> >> >to be" and after you instantiate/activate that pipeline, you dont go
>> >> >> >> >> >back 5 minutes later and say "sorry, please introduce table 15, which
>> >> >> >> >> >i want you to walk to after you visit table 3 if metadata foo is 5" or
>> >> >> >> >> >"shoot, let's change that table 5 to be exact instead of LPM". It's
>> >> >> >> >> >not anywhere in the spec.
>> >> >> >> >> >That doesnt mean it is not useful thing to have - but it is an
>> >> >> >> >> >invention that has _nothing to do with the P4 spec_; so saying a P4
>> >> >> >> >> >implementation must support it is a bit out of scope and there are
>> >> >> >> >> >vendors with hardware who support P4 today that dont need any of this.
>> >> >> >> >>
>> >> >> >> >> I'm not talking about the spec. I'm talking about the offload
>> >> >> >> >> implemetation, the offload compiler the offload runtime manager. You
>> >> >> >> >> don't have those in kernel. That is the issue. The runtime manager is
>> >> >> >> >> the one to decide and reshuffle the hw internals. Again, this has
>> >> >> >> >> nothing to do with p4 frontend. This is offload implementation.
>> >> >> >> >>
>> >> >> >> >> And that is why I believe your p4 kernel implementation is unoffloadable.
>> >> >> >> >> And if it is unoffloadable, do we really need it? IDK.
>> >> >> >> >>
>> >> >> >> >
>> >> >> >> >Say what?
>> >> >> >> >It's not offloadable in your hardware, you mean? Because i have beside
>> >> >> >> >me here an intel e2000 which offloads just fine (and the AMD folks
>> >> >> >> >seem fine too).
>> >> >> >>
>> >> >> >> Will Intel and AMD have compiler in kernel, so no blob transfer and
>> >> >> >> parsing it in kernel wound not be needed? No.
>> >> >> >
>> >> >> >By that definition anything that parses anything is a compiler.
>> >> >> >
>> >> >> >>
>> >> >> >> >If your view is that all these runtime optimization surmount to a
>> >> >> >> >compiler in the kernel/driver that is your, well, your view. In my
>> >> >> >> >view (and others have said this to you already) the P4C compiler is
>> >> >> >> >responsible for resource optimizations. The hardware supports P4, you
>> >> >> >> >give it constraints and it knows what to do. At runtime, anything a
>> >> >> >> >driver needs to do for resource optimization (resorting, reshuffling
>> >> >> >> >etc), that is not a P4 problem - sorry if you have issues in your
>> >> >> >> >architecture approach.
>> >> >> >>
>> >> >> >> Sure, it is the offload implementation problem. And for them, you need
>> >> >> >> to use userspace components. And that is the problem. This discussion
>> >> >> >> leads nowhere, I don't know how differently I should describe this.
>> >> >> >
>> >> >> >Jiri's - that's your view based on whatever design you have in your
>> >> >> >mind. This has nothing to do with P4.
>> >> >> >So let me repeat again:
>> >> >> >1) A vendor's backend for P4 when it compiles ensures that resource
>> >> >> >constraints are taken care of.
>> >> >> >2) The same program can run in s/w.
>> >> >> >3) It makes *ZERO* sense to mix vendor specific constraint
>> >> >> >optimization(what you described as resorting, reshuffling etc) as part
>> >> >> >of P4TC or P4. Absolutely nothing to do with either. Write a
>> >> >>
>> >> >> I never suggested for it to be part of P4tc of P4. I don't know why you
>> >> >> think so.
>> >> >
>> >> >I guess because this discussion is about P4/P4TC? I may have misread
>> >> >what you are saying then because I saw the  "P4TC must be in
>> >> >userspace" mantra tied to this specific optimization requirement.
>> >>
>> >> Yeah, and again, my point is, this is unoffloadable.
>> >
>> >Here we go again with this weird claim. I guess we need to give an
>> >award to the other vendors for doing the "impossible"?
>>
>> By having the compiler in kernel, that would be awesome. Clear offload
>> from kernel to device.
>>
>> That's not the case. Trampolines, binary blobs parsing in kernel doing
>> the match with tc structures in drivers, abuse of devlink flash,
>> tc-flower offload using this facility. All this was already seriously
>> discussed before p4tc is even merged. Great, love that.
>>
>
>I was hoping not to say anything but my fingers couldnt help themselves:
>So "unoffloadable" means there is a binary blob and this doesnt work
>per your design idea of how it should work?

Not going to repeat myself.


>Not that it cant be implemented (clearly it has been implemented), it
>is just not how _you_ would implement it? All along I thought this was
>an issue with your hardware.

The subset of issues is present even with Intel approach. I thought that
is clear, it's apparently not. I'm done.


>I know that when someone says devlink your answer is N.O - but that is
>a different topic.

That's probably because lot of times people tend to abuse it.


>
>cheers,
>jamal
>
>>
>> >
>> >>Do we still  need it in kernel?
>> >
>> >Didnt you just say it has nothing to do with P4TC?
>> >
>> >You "It cant be offloaded".
>> >Me "it can be offloaded, other vendors are doing it and it has nothing
>> >to do with P4 or P4TC and here's why..."
>> >You " i didnt say it has anything to do with P4 or P4TC"
>> >Me "ok i misunderstood i thought you said P4 cant be offloaded via
>> >P4TC and has to be done in user space"
>> >You "It cant be offloaded"
>>
>> Let me do my own misinterpretation please.
>>
>>
>>
>> >
>> >Circular non-ending discussion.
>> >
>> >Then there's John
>> >John "ebpf, ebpf, ebpf"
>> >Me "we gave you ebpf"
>> >John "but you are not using ebpf system call"
>> >Me " but it doesnt make sense for the following reasons..."
>> >John "but someone has already implemented ebpf.."
>> >Me "yes, but here's how ..."
>> >John "ebpf, ebpf, ebpf"
>> >
>> >Another circular non-ending discussion.
>> >
>> >Let's just end this electron-wasting lawyering discussion.
>> >
>> >cheers,
>> >jamal
>> >
>> >
>> >
>> >
>> >
>> >
>> >Bizare. Unoffloadable according to you.
>> >
>> >>
>> >> >
>> >> >>
>> >> >> >background task, specific to you,  if you feel you need to move things
>> >> >> >around at runtime.
>> >> >>
>> >> >> Yeah, that backgroud task is in userspace.
>> >> >>
>> >> >
>> >> >I don't have a horse in this race.
>> >> >
>> >> >cheers,
>> >> >jamal
>> >> >
>> >> >>
>> >> >> >
>> >> >> >We agree on one thing at least: This discussion is going nowhere.
>> >> >>
>> >> >> Correct.
>> >> >>
>> >> >> >
>> >> >> >cheers,
>> >> >> >jamal
>> >> >> >
>> >> >> >
>> >> >> >
>> >> >> >> >
>> >> >> >> >> >In my opinion that is a feature that could be added later out of
>> >> >> >> >> >necessity (there is some good niche value in being able to add some
>> >> >> >> >> >"dynamicism" to any pipeline) and influence the P4 standards on why it
>> >> >> >> >> >is needed.
>> >> >> >> >> >It should be doable today in a brute force way (this is just one
>> >> >> >> >> >suggestion that came to me when Rice University/Nvidia presented[1]);
>> >> >> >> >> >i am sure there are other approaches and the idea is by no means
>> >> >> >> >> >proven.
>> >> >> >> >> >
>> >> >> >> >> >1) User space Creates/compiles/Adds/activate your program that has 14
>> >> >> >> >> >tables at tc prio X chain Y
>> >> >> >> >> >2) a) 5 minutes later user space decides it wants to change and add
>> >> >> >> >> >table 3 after table 15, visited when metadata foo=5
>> >> >> >> >> >    b) your compiler in user space compiles a brand new program which
>> >> >> >> >> >satisfies #2a (how this program was authored is out of scope of
>> >> >> >> >> >discussion)
>> >> >> >> >> >    c) user space adds the new program at tc prio X+1 chain Y or another chain Z
>> >> >> >> >> >    d) user space delete tc prio X chain Y (and make sure your packets
>> >> >> >> >> >entry point is whatever #c is)
>> >> >> >> >>
>> >> >> >> >> I never suggested anything like what you describe. I'm not sure why you
>> >> >> >> >> think so.
>> >> >> >> >
>> >> >> >> >It's the same class of problems - the paper i pointed to (coauthored
>> >> >> >> >by Matty and others) has runtime resource optimizations which are
>> >> >> >> >tantamount to changing the nature of the pipeline. We may need to
>> >> >> >> >profile in the kernel but all those optimizations can be derived in
>> >> >> >> >user space using the approach I described.
>> >> >> >> >
>> >> >> >> >cheers,
>> >> >> >> >jamal
>> >> >> >> >
>> >> >> >> >
>> >> >> >> >> >[1] https://www.cs.rice.edu/~eugeneng/papers/SIGCOMM23-Pipeleon.pdf
>> >> >> >> >> >
>> >> >> >> >> >>
>> >> >> >> >> >> >kernel process - which is ironically the same thing we are going
>> >> >> >> >> >> >through here ;->
>> >> >> >> >> >> >
>> >> >> >> >> >> >cheers,
>> >> >> >> >> >> >jamal
>> >> >> >> >> >> >
>> >> >> >> >> >> >>
>> >> >> >> >> >> >> >
>> >> >> >> >> >> >> >cheers,
>> >> >> >> >> >> >> >jamal
Jiri Pirko Nov. 23, 2023, 6:09 p.m. UTC | #32
Thu, Nov 23, 2023 at 06:53:42PM CET, ecree.xilinx@gmail.com wrote:
>On 23/11/2023 16:30, Jamal Hadi Salim wrote:
>> I was hoping not to say anything but my fingers couldnt help themselves:
>> So "unoffloadable" means there is a binary blob and this doesnt work
>> per your design idea of how it should work?
>> Not that it cant be implemented (clearly it has been implemented), it
>> is just not how _you_ would implement it? All along I thought this was
>> an issue with your hardware.
>
>The kernel doesn't like to trust offload blobs from a userspace compiler,
> because it has no way to be sure that what comes out of the compiler
> matches the rules/tables/whatever it has in the SW datapath.
>It's also a support nightmare because it's basically like each user
> compiling their own device firmware.  At least normally with device
> firmware the driver side is talking to something with narrow/fixed
> semantics and went through upstream review, even if the firmware side is
> still a black box.
>Just to prove I'm not playing favourites: this is *also* a problem with
> eBPF offloads like Nanotubes, and I'm not convinced we have a viable
> solution yet.

Just for the record, I'm not aware of anyone suggesting p4 eBPF offload
in this thread.


>
>The only way I can see to handle it is something analogous to proof-
> carrying code, where the kernel (driver, since the blob is likely to be
> wholly vendor-specific) can inspect the binary blob and verify somehow
> that (assuming the HW behaves according to its datasheet) it implements
> the same thing that exists in SW.
>Or simplify the hardware design enough that the compiler can be small
> and tight enough to live in-kernel, but that's often impossible.

Yeah, that would solve the offloading problem. From what I'm hearing
from multiple sides, not going to happen.

>
>-ed
Jakub Kicinski Nov. 23, 2023, 6:53 p.m. UTC | #33
On Thu, 23 Nov 2023 17:53:42 +0000 Edward Cree wrote:
> The kernel doesn't like to trust offload blobs from a userspace compiler,
>  because it has no way to be sure that what comes out of the compiler
>  matches the rules/tables/whatever it has in the SW datapath.
> It's also a support nightmare because it's basically like each user
>  compiling their own device firmware.  

Practically speaking every high speed NIC runs a huge binary blob of FW.
First, let's acknowledge that as reality.

Second, there is no equivalent for arbitrary packet parsing in the
kernel proper. Offload means take something form the host and put it
on the device. If there's nothing in the kernel, we can't consider
the new functionality an offload.

I understand that "we offload SW functionality" is our general policy,
but we should remember why this policy is in place, and not
automatically jump to the conclusion.

>  At least normally with device firmware the driver side is talking to
>  something with narrow/fixed semantics and went through upstream
>  review, even if the firmware side is still a black box.

We should be buildings things which are useful and open (as in
extensible by people "from the street"). With that in mind, to me,
a more practical approach would be to try to figure out a common
and rigid FW interface for expressing the parsing graph.

But that's an interface going from the binary blob to the kernel.

> Just to prove I'm not playing favourites: this is *also* a problem with
>  eBPF offloads like Nanotubes, and I'm not convinced we have a viable
>  solution yet.

BPF offloads are actual offloads. Config/state is in the kernel,
you need to pop it out to user space, then prove that it's what
user intended.
Jamal Hadi Salim Nov. 23, 2023, 6:58 p.m. UTC | #34
On Thu, Nov 23, 2023 at 1:09 PM Jiri Pirko <jiri@resnulli.us> wrote:
>
> Thu, Nov 23, 2023 at 06:53:42PM CET, ecree.xilinx@gmail.com wrote:
> >On 23/11/2023 16:30, Jamal Hadi Salim wrote:
> >> I was hoping not to say anything but my fingers couldnt help themselves:
> >> So "unoffloadable" means there is a binary blob and this doesnt work
> >> per your design idea of how it should work?
> >> Not that it cant be implemented (clearly it has been implemented), it
> >> is just not how _you_ would implement it? All along I thought this was
> >> an issue with your hardware.
> >
> >The kernel doesn't like to trust offload blobs from a userspace compiler,
> > because it has no way to be sure that what comes out of the compiler
> > matches the rules/tables/whatever it has in the SW datapath.
> >It's also a support nightmare because it's basically like each user
> > compiling their own device firmware.  At least normally with device
> > firmware the driver side is talking to something with narrow/fixed
> > semantics and went through upstream review, even if the firmware side is
> > still a black box.
> >Just to prove I'm not playing favourites: this is *also* a problem with
> > eBPF offloads like Nanotubes, and I'm not convinced we have a viable
> > solution yet.
>
> Just for the record, I'm not aware of anyone suggesting p4 eBPF offload
> in this thread.
>
>
> >
> >The only way I can see to handle it is something analogous to proof-
> > carrying code, where the kernel (driver, since the blob is likely to be
> > wholly vendor-specific) can inspect the binary blob and verify somehow
> > that (assuming the HW behaves according to its datasheet) it implements
> > the same thing that exists in SW.
> >Or simplify the hardware design enough that the compiler can be small
> > and tight enough to live in-kernel, but that's often impossible.
>
> Yeah, that would solve the offloading problem. From what I'm hearing
> from multiple sides, not going to happen.

This is a topic that has been discussed many times. The idea Tom is
describing has been the basis of the discussion i.e some form of
signature that is tied to the binary as well as the s/w side of things
when you do offload. I am not an attestation expert - but isnt that
sufficient?

cheers,
jamal

> >
> >-ed
Tom Herbert Nov. 23, 2023, 7:42 p.m. UTC | #35
On Thu, Nov 23, 2023 at 10:53 AM Jakub Kicinski <kuba@kernel.org> wrote:
>
> On Thu, 23 Nov 2023 17:53:42 +0000 Edward Cree wrote:
> > The kernel doesn't like to trust offload blobs from a userspace compiler,
> >  because it has no way to be sure that what comes out of the compiler
> >  matches the rules/tables/whatever it has in the SW datapath.
> > It's also a support nightmare because it's basically like each user
> >  compiling their own device firmware.
>

Hi Jakub,

> Practically speaking every high speed NIC runs a huge binary blob of FW.
> First, let's acknowledge that as reality.
>
Yes. But we're also seeing a trend for programmable NICs. It's an
interesting question as to how the kernel can leverage that
programmability for the benefit of the user.

> Second, there is no equivalent for arbitrary packet parsing in the
> kernel proper. Offload means take something form the host and put it
> on the device. If there's nothing in the kernel, we can't consider
> the new functionality an offload.

That's completely true, however I believe that eBPF has expanded our
definition of "what's in the kernel". For instance, we can do
arbitrary parsing in an XDP/eBPF program (in fact, it's still on my
list of things to do to rip out Flow dissector C code and replace it
with eBPF).

(https://netdevconf.info/0x15/slides/16/Flow%20dissector_PANDA%20parser.pdf,
https://www.youtube.com/watch?v=zVnmVDSEoXc&list=PLrninrcyMo3L-hsJv23hFyDGRaeBY1EJO)

>
> I understand that "we offload SW functionality" is our general policy,
> but we should remember why this policy is in place, and not
> automatically jump to the conclusion.
>
> >  At least normally with device firmware the driver side is talking to
> >  something with narrow/fixed semantics and went through upstream
> >  review, even if the firmware side is still a black box.
>
> We should be buildings things which are useful and open (as in
> extensible by people "from the street"). With that in mind, to me,
> a more practical approach would be to try to figure out a common
> and rigid FW interface for expressing the parsing graph.

Parse graphs are best represented by declarative representation, not
an imperative one. This is a main reason why I want to replace flow
dissector, a parser written in imperative C code is difficult to
maintain as evident by the myriad of bugs in that code (particularly
when people added support or uncommon protocols). P4 got this part
right, however I don't believe we need to boil the ocean by
programming the kernel in a new language. A better alternative is to
define an IR that contains for this purpose.  We do that in Common
Parser Language (CPL) which is a .json schema to describe parse
graphs. With an IR we can compile into arbitrary backends including
P4, eBPF, C, and even custom assembly instructions for parsing
(arbitrary font ends languages are facilitated as well).

(https://netdevconf.info/0x16/papers/11/High%20Performance%20Programmable%20Parsers.pdf)

>
> But that's an interface going from the binary blob to the kernel.
>
> > Just to prove I'm not playing favourites: this is *also* a problem with
> >  eBPF offloads like Nanotubes, and I'm not convinced we have a viable
> >  solution yet.
>
> BPF offloads are actual offloads. Config/state is in the kernel,
> you need to pop it out to user space, then prove that it's what
> user intended.

Seems like offloading eBPF byte code and running a VM in the offload
device is pretty much considered a non-starter. But, what if we could
offload the _functionality_ of an eBPF program with confidence that
the functionality _exactly_ matches that of the eBPF program running
in the kernel? I believe that could be beneficial.

For instance, we all know that LRO never gained traction. The reason
is because each vendor does it however they want and no one can match
the exact functionality that SW GRO provides. It's not an offload of
kernel SW, so it's not viable. But, suppose we wrote GRO in some
program that could be compiled into eBPF and a device binary. Using
something like that hash technique I described, it seems like we could
properly do a kernel offload of GRO where the offload functionality
matches the software in the kernel.

Tom
Jiri Pirko Nov. 24, 2023, 10:39 a.m. UTC | #36
Thu, Nov 23, 2023 at 07:53:05PM CET, kuba@kernel.org wrote:
>On Thu, 23 Nov 2023 17:53:42 +0000 Edward Cree wrote:
>> The kernel doesn't like to trust offload blobs from a userspace compiler,
>>  because it has no way to be sure that what comes out of the compiler
>>  matches the rules/tables/whatever it has in the SW datapath.
>> It's also a support nightmare because it's basically like each user
>>  compiling their own device firmware.  
>
>Practically speaking every high speed NIC runs a huge binary blob of FW.
>First, let's acknowledge that as reality.

True, but I believe we need to diferenciate:
1) vendor created, versioned, signed binary fw blob
2) user compiled on demand, blob

I look at 2) as on "a configuration" of some sort.


>
>Second, there is no equivalent for arbitrary packet parsing in the
>kernel proper. Offload means take something form the host and put it
>on the device. If there's nothing in the kernel, we can't consider
>the new functionality an offload.
>
>I understand that "we offload SW functionality" is our general policy,
>but we should remember why this policy is in place, and not
>automatically jump to the conclusion.

It is in place to have well defined SW definition of what devices
offloads.


>
>>  At least normally with device firmware the driver side is talking to
>>  something with narrow/fixed semantics and went through upstream
>>  review, even if the firmware side is still a black box.
>
>We should be buildings things which are useful and open (as in
>extensible by people "from the street"). With that in mind, to me,
>a more practical approach would be to try to figure out a common
>and rigid FW interface for expressing the parsing graph.

Hmm, could you elaborate a bit more on this one please?

>
>But that's an interface going from the binary blob to the kernel.
>
>> Just to prove I'm not playing favourites: this is *also* a problem with
>>  eBPF offloads like Nanotubes, and I'm not convinced we have a viable
>>  solution yet.
>
>BPF offloads are actual offloads. Config/state is in the kernel,
>you need to pop it out to user space, then prove that it's what
>user intended.