mbox series

[net-next,RFC,00/20] Introducing P4TC

Message ID 20230124170346.316866-1-jhs@mojatatu.com (mailing list archive)
Headers show
Series Introducing P4TC | expand

Message

Jamal Hadi Salim Jan. 24, 2023, 5:03 p.m. UTC
We are seeking community feedback on P4TC patches.
Apologies, I know this is a large number of patches but it is the best we could
do so as not to miss the essence of the work. We have a few more patches but
took them out for brevity of review.

P4TC is an implementation of the Programming Protocol-independent Packet
Processors (P4) that is kernel based, building on top of many years of Linux TC
experiences:

 * P4TC is scriptable - building on and extending the implementation/deployment
   concepts of the TC u32 classifier, pedit action, etc.
 * P4TC is designed to allow hardware offload based on experiences derived from
   TC classifiers flower, u32, matchall, etc.

By "scriptable" we mean: these patches enable kernel and user space code change
independency for any P4 program that describes a new datapath. The workflow is
as follows:
  1) A developer writes a P4 program, "myprog"
  2) Compiles it using the P4C compiler. The compiler generates output in the
     form of shell scripts which form template definitions for the different P4
     objects "myprog" utilizes (objects described below in the patch list).
  3) The developer (or operator) executes the shell scripts to manifest
     the functional equivalent of "myprog" into the kernel.
  4) The developer (or operator) instantiates "myprog" via the tc P4 filter
     to ingress/egress of one or more netdevs/ports. Example:
       "tc filter add block 22 ingress protocol ip prio 6 p4 pname myprog"

Once "myprog" is instantiated one can start updating table entries that are
associated with "myprog". Example:
  tc p4runtime create myprog/mytable dstAddr 10.0.1.2/32 prio 10 \
    action send param port type dev port1

Of course one can be more explicit and specify "skip_sw" or "skip_hw" to either
offload the entry (if a NIC or switch driver is capable) or make it purely run
entirely in the kernel or in a cooperative mode between kernel and user space.

Note: You do not need a compiler to create the template scripts used in
step #3. You can hand code them - however, there will be cases where you have
complex programs that would require the compiler.
Note2: There are no binary blobs being loaded into the kernel, rather a bunch
of "policies" to activate mechanisms in the kernel.

There have been many discussions and meetings since about 2015 in regards to
P4 over TC and now that the market has chosen P4 as the datapath specification
lingua franca we are finally proving the naysayers that we do get stuff done!

P4TC is designed to have very little impact on the core code for other users
of TC. We make one change to the core - to be specific we change the
implementation of action lists to use IDR instead of a linked list (see patch
#1); however, that change can be considered to be a control plane performance
improvement since IDR is faster in most cases.
The rest of the core changes(patches 2-9) are to enable P4TC and are minimalist
in nature. IOW, P4TC is self-contained and reuses the tc infrastructure without
affecting other consumers of the TC infra.

The core P4TC code implements several P4 objects.

1) Patch #10 implements the parser, kparser, which is based on Panda to allow
   for a scriptable approach for describing the equivalence to a P4 parser.
2) Patch #11 introduces P4 data types which are consumed by the rest of the code
3) Patch #12 introduces the concept of templating Pipelines. i.e CRUD commands
   for P4 pipelines.
4) Patch #13 introduces the concept of P4 user metadata and associated CRUD
   template commands.
5) Patch #14 introduces the concept of P4 header fields and associated CRUD
   template commands. Note header fields tie into the parser from patch #10.
6) Patch #15 introduces the concept of action templates and associated
   CRUD commands.
7) Patch #16 introduces the concept of P4 table templates and associated
   CRUD commands for tables
8) Patch #17 introduces the concept of table _runtime control_ and associated
   CRUD commands.
9) Patch #18 introduces the concept of P4 register templates and associated
   CRUD commands for registers.
9) Patch #19 introduces the concept of dynamic actions commands that are
    used by actions (see patch #15).
11) Patch #20 introduces the TC P4 classifier used at runtime.

Speaking of testing - we have about 400 tdc test cases (which are left out
from this patch series). This number is growing.
These tests are run on our CICD system after commits are approved. The CICD does
a lot of other tests including:
checkpatch, sparse, 32 bit and 64 bit builds tested on both X86, ARM 64
and emulated BE via qemu s390. We trigger performance testing in the CICD
to catch performance regressions (currently only on the control path, but in
the future for the datapath).
Syzkaller runs 24/7 on dedicated hardware, and before main releases we put
the code via coverity. All of this has helped find bugs and ensure stability.
In addition we are working on a tool that will take a p4 program, run it through
the compiler, and generate permutations of traffic patterns that will test both
positive and negative code paths. The test generator tool is still work in
progress and will be generated by the P4 compiler.

There's a lot more info for the curious that we are leaving out for the sake
of brevity. A good starting point is to checkout recent material on the subject.
There is a presentation on P4TC as well as a workshop that took place in
Netdevconf 0x16), see:
https://netdevconf.info/0x16/session.html?Your-Network-Datapath-Will-Be-P4-Scripted
https://netdevconf.info/0x16/session.html?P4TC-Workshop

Jamal Hadi Salim (26):
  net/sched: act_api: change act_base into an IDR
  net/sched: act_api: increase action kind string length
  net/sched: act_api: increase TCA_ID_MAX
  net/sched: act_api: add init_ops to struct tc_action_op
  net/sched: act_api: introduce tc_lookup_action_byid()
  net/sched: act_api: export generic tc action searcher
  net/sched: act_api: create and export __tcf_register_action
  net/sched: act_api: add struct p4tc_action_ops as a parameter to
    lookup callback
  net: introduce rcu_replace_pointer_rtnl
  p4tc: add P4 data types
  p4tc: add pipeline create, get, update, delete
  p4tc: add metadata create, update, delete, get, flush and dump
  p4tc: add header field create, get, delete, flush and dump
  p4tc: add action template create, update, delete, get, flush and dump
  p4tc: add table create, update, delete, get, flush and dump
  p4tc: add table entry create, update, get, delete, flush and dump
  p4tc: add register create, update, delete, get, flush and dump
  p4tc: add dynamic action commands
  p4tc: add P4 classifier
  selftests: tc-testing: add P4TC pipeline control path tdc tests
  selftests: tc-testing: add P4TC metadata control path tdc tests
  selftests: tc-testing: add P4TC action templates tdc tests
  selftests: tc-testing: add P4TC table control path tdc tests
  selftests: tc-testing: add P4TC table entries control path tdc tests
  selftests: tc-testing: add P4TC register tdc tests
  MAINTAINERS: add p4tc entry

Pratyush Khan (2):
  net/kparser: add kParser
  net/kparser: add kParser documentation

 Documentation/networking/kParser.rst          |   327 +
 .../networking/parse_graph_example.svg        |  2039 +++
 MAINTAINERS                                   |    14 +
 include/linux/rtnetlink.h                     |    12 +
 include/linux/skbuff.h                        |    17 +
 include/net/act_api.h                         |    17 +-
 include/net/kparser.h                         |   110 +
 include/net/p4tc.h                            |   665 +
 include/net/p4tc_types.h                      |    61 +
 include/net/sch_generic.h                     |     5 +
 include/net/tc_act/p4tc.h                     |    25 +
 include/uapi/linux/kparser.h                  |   674 +
 include/uapi/linux/p4tc.h                     |   510 +
 include/uapi/linux/pkt_cls.h                  |    17 +-
 include/uapi/linux/rtnetlink.h                |    14 +
 net/Kconfig                                   |     9 +
 net/Makefile                                  |     1 +
 net/core/skbuff.c                             |    17 +
 net/kparser/Makefile                          |    17 +
 net/kparser/kparser.h                         |   418 +
 net/kparser/kparser_cmds.c                    |   917 ++
 net/kparser/kparser_cmds_dump_ops.c           |   586 +
 net/kparser/kparser_cmds_ops.c                |  3778 +++++
 net/kparser/kparser_condexpr.h                |    52 +
 net/kparser/kparser_datapath.c                |  1266 ++
 net/kparser/kparser_main.c                    |   329 +
 net/kparser/kparser_metaextract.h             |   891 ++
 net/kparser/kparser_types.h                   |   586 +
 net/sched/Kconfig                             |    20 +
 net/sched/Makefile                            |     3 +
 net/sched/act_api.c                           |   156 +-
 net/sched/cls_p4.c                            |   339 +
 net/sched/p4tc/Makefile                       |     7 +
 net/sched/p4tc/p4tc_action.c                  |  1907 +++
 net/sched/p4tc/p4tc_cmds.c                    |  3492 +++++
 net/sched/p4tc/p4tc_hdrfield.c                |   625 +
 net/sched/p4tc/p4tc_meta.c                    |   884 ++
 net/sched/p4tc/p4tc_parser_api.c              |   229 +
 net/sched/p4tc/p4tc_pipeline.c                |   996 ++
 net/sched/p4tc/p4tc_register.c                |   749 +
 net/sched/p4tc/p4tc_table.c                   |  1636 ++
 net/sched/p4tc/p4tc_tbl_api.c                 |  1895 +++
 net/sched/p4tc/p4tc_tmpl_api.c                |   609 +
 net/sched/p4tc/p4tc_types.c                   |  1294 ++
 net/sched/p4tc/trace.c                        |    10 +
 net/sched/p4tc/trace.h                        |    45 +
 security/selinux/nlmsgtab.c                   |     8 +-
 .../tc-tests/p4tc/action_templates.json       | 12378 ++++++++++++++++
 .../tc-testing/tc-tests/p4tc/metadata.json    |  2652 ++++
 .../tc-testing/tc-tests/p4tc/pipeline.json    |  3212 ++++
 .../tc-testing/tc-tests/p4tc/register.json    |  2752 ++++
 .../tc-testing/tc-tests/p4tc/table.json       |  8956 +++++++++++
 .../tc-tests/p4tc/table_entries.json          |  3818 +++++
 53 files changed, 62001 insertions(+), 45 deletions(-)
 create mode 100644 Documentation/networking/kParser.rst
 create mode 100644 Documentation/networking/parse_graph_example.svg
 create mode 100644 include/net/kparser.h
 create mode 100644 include/net/p4tc.h
 create mode 100644 include/net/p4tc_types.h
 create mode 100644 include/net/tc_act/p4tc.h
 create mode 100644 include/uapi/linux/kparser.h
 create mode 100644 include/uapi/linux/p4tc.h
 create mode 100644 net/kparser/Makefile
 create mode 100644 net/kparser/kparser.h
 create mode 100644 net/kparser/kparser_cmds.c
 create mode 100644 net/kparser/kparser_cmds_dump_ops.c
 create mode 100644 net/kparser/kparser_cmds_ops.c
 create mode 100644 net/kparser/kparser_condexpr.h
 create mode 100644 net/kparser/kparser_datapath.c
 create mode 100644 net/kparser/kparser_main.c
 create mode 100644 net/kparser/kparser_metaextract.h
 create mode 100644 net/kparser/kparser_types.h
 create mode 100644 net/sched/cls_p4.c
 create mode 100644 net/sched/p4tc/Makefile
 create mode 100644 net/sched/p4tc/p4tc_action.c
 create mode 100644 net/sched/p4tc/p4tc_cmds.c
 create mode 100644 net/sched/p4tc/p4tc_hdrfield.c
 create mode 100644 net/sched/p4tc/p4tc_meta.c
 create mode 100644 net/sched/p4tc/p4tc_parser_api.c
 create mode 100644 net/sched/p4tc/p4tc_pipeline.c
 create mode 100644 net/sched/p4tc/p4tc_register.c
 create mode 100644 net/sched/p4tc/p4tc_table.c
 create mode 100644 net/sched/p4tc/p4tc_tbl_api.c
 create mode 100644 net/sched/p4tc/p4tc_tmpl_api.c
 create mode 100644 net/sched/p4tc/p4tc_types.c
 create mode 100644 net/sched/p4tc/trace.c
 create mode 100644 net/sched/p4tc/trace.h
 create mode 100644 tools/testing/selftests/tc-testing/tc-tests/p4tc/action_templates.json
 create mode 100644 tools/testing/selftests/tc-testing/tc-tests/p4tc/metadata.json
 create mode 100644 tools/testing/selftests/tc-testing/tc-tests/p4tc/pipeline.json
 create mode 100644 tools/testing/selftests/tc-testing/tc-tests/p4tc/register.json
 create mode 100644 tools/testing/selftests/tc-testing/tc-tests/p4tc/table.json
 create mode 100644 tools/testing/selftests/tc-testing/tc-tests/p4tc/table_entries.json

Comments

Jakub Kicinski Jan. 26, 2023, 11:30 p.m. UTC | #1
On Tue, 24 Jan 2023 12:03:46 -0500 Jamal Hadi Salim wrote:
> There have been many discussions and meetings since about 2015 in regards to
> P4 over TC and now that the market has chosen P4 as the datapath specification
> lingua franca

Which market?

Barely anyone understands the existing TC offloads. We'd need strong,
and practical reasons to merge this. Speaking with my "have suffered
thru the TC offloads working for a vendor" hat on, not the "junior
maintainer" hat.
Jamal Hadi Salim Jan. 27, 2023, 1:33 p.m. UTC | #2
On Thu, Jan 26, 2023 at 6:30 PM Jakub Kicinski <kuba@kernel.org> wrote:
>
> On Tue, 24 Jan 2023 12:03:46 -0500 Jamal Hadi Salim wrote:
> > There have been many discussions and meetings since about 2015 in regards to
> > P4 over TC and now that the market has chosen P4 as the datapath specification
> > lingua franca
>
> Which market?

Network programmability involving hardware  - where at minimal the
specification of the datapath is in P4 and
often the implementation is. For samples of specification using P4
(that are public) see for example MS Azure:
https://github.com/sonic-net/DASH/tree/main/dash-pipeline
If you are a vendor and want to sell a NIC in that space, the spec you
get is in P4. Your underlying hardware
doesnt have to be P4 native, but at minimal the abstraction (as we are
trying to provide with P4TC) has to be
able to consume the P4 specification.
For implementations where P4 is in use, there are many - some public
others not, sample space:
https://cloud.google.com/blog/products/gcp/google-cloud-using-p4runtime-to-build-smart-networks

There are NICs and switches which are P4 native in the market. IOW,
there is beacoup $ investment
in this space that makes it worth pursuing. TC is the kernel offload
mechanism that has gathered deployment
experience over many years - hence P4TC.

> Barely anyone understands the existing TC offloads.

Hyperboles like these are never helpful in a discussion.
TC offloads are deployed today, they work and many folks are actively
working on them.
Are there challenges? yes. For one (and this applies to all kernel
offloads) the process gets
in the way of exposing new features. So there are learnings that we
try to resolve in P4TC.
I'd be curious to hear about your suffering with TC offloads and see
if we can take that
experience and make things better.

>We'd need strong,
> and practical reasons to merge this. Speaking with my "have suffered
> thru the TC offloads working for a vendor" hat on, not the "junior
> maintainer" hat.

P4TC is "standalone" in that it does not affect other TC consumers or
any other subsystems on performance; it is also
sufficiently isolated in that  you can choose to compile it out
altogether and more importantly it comes with committed
support.
And i should emphasize this discussion on getting P4 on TC has been
going on for a few years in the community
culminating with this.

cheers,
jamal
Jakub Kicinski Jan. 27, 2023, 5:18 p.m. UTC | #3
On Fri, 27 Jan 2023 08:33:39 -0500 Jamal Hadi Salim wrote:
> On Thu, Jan 26, 2023 at 6:30 PM Jakub Kicinski <kuba@kernel.org> wrote:
> > On Tue, 24 Jan 2023 12:03:46 -0500 Jamal Hadi Salim wrote:  
> > > There have been many discussions and meetings since about 2015 in regards to
> > > P4 over TC and now that the market has chosen P4 as the datapath specification
> > > lingua franca  
> >
> > Which market?  
> 
> Network programmability involving hardware  - where at minimal the
> specification of the datapath is in P4 and
> often the implementation is. For samples of specification using P4
> (that are public) see for example MS Azure:
> https://github.com/sonic-net/DASH/tree/main/dash-pipeline

That's an IPU thing?

> If you are a vendor and want to sell a NIC in that space, the spec you
> get is in P4.

s/NIC/IPU/ ?

> Your underlying hardware
> doesnt have to be P4 native, but at minimal the abstraction (as we are
> trying to provide with P4TC) has to be
> able to consume the P4 specification.

P4 is certainly an option, especially for specs, but I haven't seen much
adoption myself.
What's the benefit / use case?

> For implementations where P4 is in use, there are many - some public
> others not, sample space:
> https://cloud.google.com/blog/products/gcp/google-cloud-using-p4runtime-to-build-smart-networks

Hyper-scaler proprietary.

> There are NICs and switches which are P4 native in the market.

Link to docs?

> IOW, there is beacoup $ investment in this space that makes it worth pursuing.

Pursuing $ is good! But the community IMO should maximize
a different function.

> TC is the kernel offload mechanism that has gathered deployment
> experience over many years - hence P4TC.

I don't wanna argue. I thought it'd be more fair towards you if I made
my lack of conviction known, rather than sit quiet and ignore it since
it's just an RFC.
Jiri Pirko Jan. 27, 2023, 6:26 p.m. UTC | #4
Fri, Jan 27, 2023 at 12:30:22AM CET, kuba@kernel.org wrote:
>On Tue, 24 Jan 2023 12:03:46 -0500 Jamal Hadi Salim wrote:
>> There have been many discussions and meetings since about 2015 in regards to
>> P4 over TC and now that the market has chosen P4 as the datapath specification
>> lingua franca
>
>Which market?
>
>Barely anyone understands the existing TC offloads. We'd need strong,
>and practical reasons to merge this. Speaking with my "have suffered
>thru the TC offloads working for a vendor" hat on, not the "junior
>maintainer" hat.

You talk about offload, yet I don't see any offload code in this RFC.
It's pure sw implementation.

But speaking about offload, how exactly do you plan to offload this
Jamal? AFAIK there is some HW-specific compiler magic needed to generate
HW acceptable blob. How exactly do you plan to deliver it to the driver?
If HW offload offload is the motivation for this RFC work and we cannot
pass the TC in kernel objects to drivers, I fail to see why exactly do
you need the SW implementation...
Jamal Hadi Salim Jan. 27, 2023, 7:42 p.m. UTC | #5
On Fri, Jan 27, 2023 at 12:18 PM Jakub Kicinski <kuba@kernel.org> wrote:
>
> On Fri, 27 Jan 2023 08:33:39 -0500 Jamal Hadi Salim wrote:
> > On Thu, Jan 26, 2023 at 6:30 PM Jakub Kicinski <kuba@kernel.org> wrote:
> > > On Tue, 24 Jan 2023 12:03:46 -0500 Jamal Hadi Salim wrote:

[..]
> > Network programmability involving hardware  - where at minimal the
> > specification of the datapath is in P4 and
> > often the implementation is. For samples of specification using P4
> > (that are public) see for example MS Azure:
> > https://github.com/sonic-net/DASH/tree/main/dash-pipeline
>
> That's an IPU thing?
>

Yes, DASH is xPU. But the whole Sonic/SAI thing includes switches and P4 plays
a role there.

> > If you are a vendor and want to sell a NIC in that space, the spec you
> > get is in P4.
>
> s/NIC/IPU/ ?

I do believe that one can write a P4 program to express things a
regular NIC could express
that may be harder to expose with current interfaces.

> > Your underlying hardware
> > doesnt have to be P4 native, but at minimal the abstraction (as we are
> > trying to provide with P4TC) has to be
> > able to consume the P4 specification.
>
> P4 is certainly an option, especially for specs, but I haven't seen much
> adoption myself.

The xPU market outside of hyper-scalers is emerging now. Hyperscalers
looking at xPUs
are looking at P4 as the datapath language - that sets the trend
forward to large enterprises.
That's my experience.
Some of the vendors on the Cc should be able to point to adoption.
Anjali? Matty?

> What's the benefit / use case?

Of P4 or xPUs?
Unified approach to standardize how a datapath is defined is a value for P4.
Providing a singular abstraction via the kernel (as opposed to every
vendor pitching
their API) is what the kernel brings.

> > For implementations where P4 is in use, there are many - some public
> > others not, sample space:
> > https://cloud.google.com/blog/products/gcp/google-cloud-using-p4runtime-to-build-smart-networks
>
> Hyper-scaler proprietary.

The control abstraction (P4 runtime) is certainly not proprietary.
The datapath that is targetted by the runtime is.
Hopefully we can fix that with P4TC.
The majority of the discussions i have with some of the folks who do
kernel bypass have one theme in common:
The kernel process is just too long. Trying to add one feature to
flower could take anywhere from 6 months to 3
years to finally show up in some supported distro. With P4TC we are
taking the approach of scriptability to allow
for speacilized datapaths (which P4 excels in). The google datapath
maybe proprietary while their hardware may
even(or not) be using native P4 - but the important detail is we have
_a way_ to abstract those datapaths.

> > There are NICs and switches which are P4 native in the market.
>
> Link to docs?
>

Off top of my head Intel Mount Evans, Pensando, Xilinx FPGAs, etc. The
point is to bring them together
under the linux umbrella.

> > IOW, there is beacoup $ investment in this space that makes it worth pursuing.
>
> Pursuing $ is good! But the community IMO should maximize
> a different function.

While I agree $ is not the primary motivator it is a factor, it is a
good indicator. No different than the network stack
being tweaked to do certain things that certain hyperscalers need
because they invest $.
I have no problems with a large harmonious tent.

cheers,
jamal

> > TC is the kernel offload mechanism that has gathered deployment
> > experience over many years - hence P4TC.
>
> I don't wanna argue. I thought it'd be more fair towards you if I made
> my lack of conviction known, rather than sit quiet and ignore it since
> it's just an RFC.
Jamal Hadi Salim Jan. 27, 2023, 8:04 p.m. UTC | #6
On Fri, Jan 27, 2023 at 1:26 PM Jiri Pirko <jiri@resnulli.us> wrote:
>
> Fri, Jan 27, 2023 at 12:30:22AM CET, kuba@kernel.org wrote:
> >On Tue, 24 Jan 2023 12:03:46 -0500 Jamal Hadi Salim wrote:
> >> There have been many discussions and meetings since about 2015 in regards to
> >> P4 over TC and now that the market has chosen P4 as the datapath specification
> >> lingua franca
> >
> >Which market?
> >
> >Barely anyone understands the existing TC offloads. We'd need strong,
> >and practical reasons to merge this. Speaking with my "have suffered
> >thru the TC offloads working for a vendor" hat on, not the "junior
> >maintainer" hat.
>
> You talk about offload, yet I don't see any offload code in this RFC.
> It's pure sw implementation.
>
> But speaking about offload, how exactly do you plan to offload this
> Jamal? AFAIK there is some HW-specific compiler magic needed to generate
> HW acceptable blob. How exactly do you plan to deliver it to the driver?
> If HW offload offload is the motivation for this RFC work and we cannot
> pass the TC in kernel objects to drivers, I fail to see why exactly do
> you need the SW implementation...

Our rule in TC is: _if you want to offload using TC you must have a
s/w equivalent_.
We enforced this rule multiple times (as you know).
P4TC has a sw equivalent to whatever the hardware would do. We are pushing that
first. Regardless, it has value on its own merit:
I can run P4 equivalent in s/w in a scriptable (as in no compilation
in the same spirit as u32 and pedit),
by programming the kernel datapath without changing any kernel code.

To answer your question in regards to what the interfaces "P4
speaking" hardware or drivers
are going to be programmed, there are discussions going on right now:
There is a strong
leaning towards devlink for the hardware side loading.... The idea
from the driver side is to
reuse the tc ndos.
We have biweekly meetings which are open. We do have Nvidia folks, but
would be great if
we can have you there. Let me find the link and send it to you.
Do note however, our goal is to get s/w first as per tradition of
other offloads with TC .

cheers,
jamal
Stanislav Fomichev Jan. 27, 2023, 10:26 p.m. UTC | #7
On 01/27, Jamal Hadi Salim wrote:
> On Fri, Jan 27, 2023 at 1:26 PM Jiri Pirko <jiri@resnulli.us> wrote:
> >
> > Fri, Jan 27, 2023 at 12:30:22AM CET, kuba@kernel.org wrote:
> > >On Tue, 24 Jan 2023 12:03:46 -0500 Jamal Hadi Salim wrote:
> > >> There have been many discussions and meetings since about 2015 in  
> regards to
> > >> P4 over TC and now that the market has chosen P4 as the datapath  
> specification
> > >> lingua franca
> > >
> > >Which market?
> > >
> > >Barely anyone understands the existing TC offloads. We'd need strong,
> > >and practical reasons to merge this. Speaking with my "have suffered
> > >thru the TC offloads working for a vendor" hat on, not the "junior
> > >maintainer" hat.
> >
> > You talk about offload, yet I don't see any offload code in this RFC.
> > It's pure sw implementation.
> >
> > But speaking about offload, how exactly do you plan to offload this
> > Jamal? AFAIK there is some HW-specific compiler magic needed to generate
> > HW acceptable blob. How exactly do you plan to deliver it to the driver?
> > If HW offload offload is the motivation for this RFC work and we cannot
> > pass the TC in kernel objects to drivers, I fail to see why exactly do
> > you need the SW implementation...

> Our rule in TC is: _if you want to offload using TC you must have a
> s/w equivalent_.
> We enforced this rule multiple times (as you know).
> P4TC has a sw equivalent to whatever the hardware would do. We are  
> pushing that
> first. Regardless, it has value on its own merit:
> I can run P4 equivalent in s/w in a scriptable (as in no compilation
> in the same spirit as u32 and pedit),
> by programming the kernel datapath without changing any kernel code.

Not to derail too much, but maybe you can clarify the following for me:
In my (in)experience, P4 is usually constrained by the vendor
specific extensions. So how real is that goal where we can have a generic
P4@TC with an option to offload? In my view, the reality (at least
currently) is that there are NIC-specific P4 programs which won't have
a chance of running generically at TC (unless we implement those vendor
extensions).

And regarding custom parser, someone has to ask that 'what about bpf
question': let's say we have a P4 frontend at TC, can we use bpfilter-like
usermode helper to transparently compile it to bpf (for SW path) instead
inventing yet another packet parser? Wrestling with the verifier won't be
easy here, but I trust it more than this new kParser.

> To answer your question in regards to what the interfaces "P4
> speaking" hardware or drivers
> are going to be programmed, there are discussions going on right now:
> There is a strong
> leaning towards devlink for the hardware side loading.... The idea
> from the driver side is to
> reuse the tc ndos.
> We have biweekly meetings which are open. We do have Nvidia folks, but
> would be great if
> we can have you there. Let me find the link and send it to you.
> Do note however, our goal is to get s/w first as per tradition of
> other offloads with TC .

> cheers,
> jamal
Daniel Borkmann Jan. 27, 2023, 11:02 p.m. UTC | #8
On 1/27/23 9:04 PM, Jamal Hadi Salim wrote:
> On Fri, Jan 27, 2023 at 1:26 PM Jiri Pirko <jiri@resnulli.us> wrote:
>> Fri, Jan 27, 2023 at 12:30:22AM CET, kuba@kernel.org wrote:
>>> On Tue, 24 Jan 2023 12:03:46 -0500 Jamal Hadi Salim wrote:
>>>> There have been many discussions and meetings since about 2015 in regards to
>>>> P4 over TC and now that the market has chosen P4 as the datapath specification
>>>> lingua franca
>>>
>>> Which market?
>>>
>>> Barely anyone understands the existing TC offloads. We'd need strong,
>>> and practical reasons to merge this. Speaking with my "have suffered
>>> thru the TC offloads working for a vendor" hat on, not the "junior
>>> maintainer" hat.
>>
>> You talk about offload, yet I don't see any offload code in this RFC.
>> It's pure sw implementation.
>>
>> But speaking about offload, how exactly do you plan to offload this
>> Jamal? AFAIK there is some HW-specific compiler magic needed to generate
>> HW acceptable blob. How exactly do you plan to deliver it to the driver?
>> If HW offload offload is the motivation for this RFC work and we cannot
>> pass the TC in kernel objects to drivers, I fail to see why exactly do
>> you need the SW implementation...
> 
> Our rule in TC is: _if you want to offload using TC you must have a
> s/w equivalent_.
> We enforced this rule multiple times (as you know).
> P4TC has a sw equivalent to whatever the hardware would do. We are pushing that
> first. Regardless, it has value on its own merit:
> I can run P4 equivalent in s/w in a scriptable (as in no compilation
> in the same spirit as u32 and pedit),

`62001 insertions(+), 45 deletions(-)` and more to come for a software
datapath which in the end no-one will use (assuming you'll have the hw
offloads) is a pretty heavy lift.. imo the layer of abstraction is wrong
here as Stan hinted. What if tomorrow P4 programming language is not the
'lingua franca' anymore and something else comes along? Then all of it is
still baked into uapi instead of having a generic/versatile intermediate
later.
Tom Herbert Jan. 27, 2023, 11:06 p.m. UTC | #9
On Fri, Jan 27, 2023 at 2:26 PM <sdf@google.com> wrote:
>
> On 01/27, Jamal Hadi Salim wrote:
> > On Fri, Jan 27, 2023 at 1:26 PM Jiri Pirko <jiri@resnulli.us> wrote:
> > >
> > > Fri, Jan 27, 2023 at 12:30:22AM CET, kuba@kernel.org wrote:
> > > >On Tue, 24 Jan 2023 12:03:46 -0500 Jamal Hadi Salim wrote:
> > > >> There have been many discussions and meetings since about 2015 in
> > regards to
> > > >> P4 over TC and now that the market has chosen P4 as the datapath
> > specification
> > > >> lingua franca
> > > >
> > > >Which market?
> > > >
> > > >Barely anyone understands the existing TC offloads. We'd need strong,
> > > >and practical reasons to merge this. Speaking with my "have suffered
> > > >thru the TC offloads working for a vendor" hat on, not the "junior
> > > >maintainer" hat.
> > >
> > > You talk about offload, yet I don't see any offload code in this RFC.
> > > It's pure sw implementation.
> > >
> > > But speaking about offload, how exactly do you plan to offload this
> > > Jamal? AFAIK there is some HW-specific compiler magic needed to generate
> > > HW acceptable blob. How exactly do you plan to deliver it to the driver?
> > > If HW offload offload is the motivation for this RFC work and we cannot
> > > pass the TC in kernel objects to drivers, I fail to see why exactly do
> > > you need the SW implementation...
>
> > Our rule in TC is: _if you want to offload using TC you must have a
> > s/w equivalent_.
> > We enforced this rule multiple times (as you know).
> > P4TC has a sw equivalent to whatever the hardware would do. We are
> > pushing that
> > first. Regardless, it has value on its own merit:
> > I can run P4 equivalent in s/w in a scriptable (as in no compilation
> > in the same spirit as u32 and pedit),
> > by programming the kernel datapath without changing any kernel code.
>
> Not to derail too much, but maybe you can clarify the following for me:
> In my (in)experience, P4 is usually constrained by the vendor
> specific extensions. So how real is that goal where we can have a generic
> P4@TC with an option to offload? In my view, the reality (at least
> currently) is that there are NIC-specific P4 programs which won't have
> a chance of running generically at TC (unless we implement those vendor
> extensions).
>
> And regarding custom parser, someone has to ask that 'what about bpf
> question': let's say we have a P4 frontend at TC, can we use bpfilter-like
> usermode helper to transparently compile it to bpf (for SW path) instead
> inventing yet another packet parser? Wrestling with the verifier won't be
> easy here, but I trust it more than this new kParser.

Yes, wrestling with the verifier is tricky, however we do have a
solution to compile arbitrarily complex parsers into eBFP. We
presented this work at Netdev 0x15
https://netdevconf.info/0x15/session.html?Replacing-Flow-Dissector-with-PANDA-Parser.
Of course this has the obvious advantage that we don't have to change
the kernel (however, as we talk about in the presentation, this method
actually produces a faster more extensible parser than flow dissector,
so it's still on my radar to replace flow dissector itself with an
eBPF parser :-) )

The value of kParser is that it is not compiled code, but dynamically
scriptable. It's much easier to change on the fly and depends on a CLI
interface which works well with P4TC. The front end is the same as
what we are using for PANDA parser, that is the same parser frontend
(in C code or other) can be compiled into XDP/eBPF, kParser CLI, or
other targets (this is based on establishing a IR which we talked
about in https://myfoobar2022.sched.com/event/1BhCX/high-performance-programmable-parsers

Tom

>
>
> > To answer your question in regards to what the interfaces "P4
> > speaking" hardware or drivers
> > are going to be programmed, there are discussions going on right now:
> > There is a strong
> > leaning towards devlink for the hardware side loading.... The idea
> > from the driver side is to
> > reuse the tc ndos.
> > We have biweekly meetings which are open. We do have Nvidia folks, but
> > would be great if
> > we can have you there. Let me find the link and send it to you.
> > Do note however, our goal is to get s/w first as per tradition of
> > other offloads with TC .
>
> > cheers,
> > jamal
Jamal Hadi Salim Jan. 27, 2023, 11:27 p.m. UTC | #10
On Fri, Jan 27, 2023 at 5:26 PM <sdf@google.com> wrote:
>
> On 01/27, Jamal Hadi Salim wrote:
> > On Fri, Jan 27, 2023 at 1:26 PM Jiri Pirko <jiri@resnulli.us> wrote:
> > >
> > > Fri, Jan 27, 2023 at 12:30:22AM CET, kuba@kernel.org wrote:
> > > >On Tue, 24 Jan 2023 12:03:46 -0500 Jamal Hadi Salim wrote:
> > > >> There have been many discussions and meetings since about 2015 in
> > regards to
> > > >> P4 over TC and now that the market has chosen P4 as the datapath
> > specification
> > > >> lingua franca
> > > >
> > > >Which market?
> > > >
> > > >Barely anyone understands the existing TC offloads. We'd need strong,
> > > >and practical reasons to merge this. Speaking with my "have suffered
> > > >thru the TC offloads working for a vendor" hat on, not the "junior
> > > >maintainer" hat.
> > >
> > > You talk about offload, yet I don't see any offload code in this RFC.
> > > It's pure sw implementation.
> > >
> > > But speaking about offload, how exactly do you plan to offload this
> > > Jamal? AFAIK there is some HW-specific compiler magic needed to generate
> > > HW acceptable blob. How exactly do you plan to deliver it to the driver?
> > > If HW offload offload is the motivation for this RFC work and we cannot
> > > pass the TC in kernel objects to drivers, I fail to see why exactly do
> > > you need the SW implementation...
>
> > Our rule in TC is: _if you want to offload using TC you must have a
> > s/w equivalent_.
> > We enforced this rule multiple times (as you know).
> > P4TC has a sw equivalent to whatever the hardware would do. We are
> > pushing that
> > first. Regardless, it has value on its own merit:
> > I can run P4 equivalent in s/w in a scriptable (as in no compilation
> > in the same spirit as u32 and pedit),
> > by programming the kernel datapath without changing any kernel code.
>
> Not to derail too much, but maybe you can clarify the following for me:
> In my (in)experience, P4 is usually constrained by the vendor
> specific extensions. So how real is that goal where we can have a generic
> P4@TC with an option to offload? In my view, the reality (at least
> currently) is that there are NIC-specific P4 programs which won't have
> a chance of running generically at TC (unless we implement those vendor
> extensions).

We are going to implement all the PSA/PNA externs. Most of these
programs tend to
be set or ALU operations on headers or metadata which we can handle.
Do you have
any examples of NIC-vendor-specific features that cant be generalized?

> And regarding custom parser, someone has to ask that 'what about bpf
> question': let's say we have a P4 frontend at TC, can we use bpfilter-like
> usermode helper to transparently compile it to bpf (for SW path) instead
> inventing yet another packet parser? Wrestling with the verifier won't be
> easy here, but I trust it more than this new kParser.
>

We dont compile anything, the parser (and rest of infra) is scriptable.

cheers,
jamal
Jamal Hadi Salim Jan. 27, 2023, 11:57 p.m. UTC | #11
On Fri, Jan 27, 2023 at 6:02 PM Daniel Borkmann <daniel@iogearbox.net> wrote:
>
> On 1/27/23 9:04 PM, Jamal Hadi Salim wrote:
> > On Fri, Jan 27, 2023 at 1:26 PM Jiri Pirko <jiri@resnulli.us> wrote:
> >> Fri, Jan 27, 2023 at 12:30:22AM CET, kuba@kernel.org wrote:
> >>> On Tue, 24 Jan 2023 12:03:46 -0500 Jamal Hadi Salim wrote:
> >>>> There have been many discussions and meetings since about 2015 in regards to
> >>>> P4 over TC and now that the market has chosen P4 as the datapath specification
> >>>> lingua franca
> >>>
> >>> Which market?
> >>>

[..]
> >
> > Our rule in TC is: _if you want to offload using TC you must have a
> > s/w equivalent_.
> > We enforced this rule multiple times (as you know).
> > P4TC has a sw equivalent to whatever the hardware would do. We are pushing that
> > first. Regardless, it has value on its own merit:
> > I can run P4 equivalent in s/w in a scriptable (as in no compilation
> > in the same spirit as u32 and pedit),
>
> `62001 insertions(+), 45 deletions(-)` and more to come for a software
> datapath which in the end no-one will use (assuming you'll have the hw
> offloads) is a pretty heavy lift..

I am not sure i fully parsed what you said - but the sw stands on its own
merit. The consumption of P4 specification is one - but ability to define
arbitrary pipelines without changing the kernel code (u32/pedit like, etc) is
of value.
Note (in case i misunderstood what you are saying):
As mentioned there is commitment to support; its clean standalone and
can be compiled out
and even when compiled in has no effect on the rest of the code performance
or otherwise.

> imo the layer of abstraction is wrong
> here as Stan hinted. What if tomorrow P4 programming language is not the
> 'lingua franca' anymore and something else comes along? Then all of it is
> still baked into uapi instead of having a generic/versatile intermediate
> later.

Match-action pipeline as an approach to defining datapaths is what we
implement here.
It is what P4 defines. I dont think P4 covers everything that is
needed under the shining sun
but a lot of effort has gone into standardizing common things. And if
there are gaps we fill them.
That is a solid, well understood way to build hardware and sw (TC has
been around all these
years implementing that paradigm). So that is the intended abstraction
being implemented.
The interface is designed to be scriptable to remove the burden of
making kernel (and btw user space
as well to iproute2) changes for new processing functions (whether in
s/w or hardware).

cheers,
jamal
Stanislav Fomichev Jan. 28, 2023, 12:47 a.m. UTC | #12
On Fri, Jan 27, 2023 at 3:06 PM Tom Herbert <tom@sipanda.io> wrote:
>
> On Fri, Jan 27, 2023 at 2:26 PM <sdf@google.com> wrote:
> >
> > On 01/27, Jamal Hadi Salim wrote:
> > > On Fri, Jan 27, 2023 at 1:26 PM Jiri Pirko <jiri@resnulli.us> wrote:
> > > >
> > > > Fri, Jan 27, 2023 at 12:30:22AM CET, kuba@kernel.org wrote:
> > > > >On Tue, 24 Jan 2023 12:03:46 -0500 Jamal Hadi Salim wrote:
> > > > >> There have been many discussions and meetings since about 2015 in
> > > regards to
> > > > >> P4 over TC and now that the market has chosen P4 as the datapath
> > > specification
> > > > >> lingua franca
> > > > >
> > > > >Which market?
> > > > >
> > > > >Barely anyone understands the existing TC offloads. We'd need strong,
> > > > >and practical reasons to merge this. Speaking with my "have suffered
> > > > >thru the TC offloads working for a vendor" hat on, not the "junior
> > > > >maintainer" hat.
> > > >
> > > > You talk about offload, yet I don't see any offload code in this RFC.
> > > > It's pure sw implementation.
> > > >
> > > > But speaking about offload, how exactly do you plan to offload this
> > > > Jamal? AFAIK there is some HW-specific compiler magic needed to generate
> > > > HW acceptable blob. How exactly do you plan to deliver it to the driver?
> > > > If HW offload offload is the motivation for this RFC work and we cannot
> > > > pass the TC in kernel objects to drivers, I fail to see why exactly do
> > > > you need the SW implementation...
> >
> > > Our rule in TC is: _if you want to offload using TC you must have a
> > > s/w equivalent_.
> > > We enforced this rule multiple times (as you know).
> > > P4TC has a sw equivalent to whatever the hardware would do. We are
> > > pushing that
> > > first. Regardless, it has value on its own merit:
> > > I can run P4 equivalent in s/w in a scriptable (as in no compilation
> > > in the same spirit as u32 and pedit),
> > > by programming the kernel datapath without changing any kernel code.
> >
> > Not to derail too much, but maybe you can clarify the following for me:
> > In my (in)experience, P4 is usually constrained by the vendor
> > specific extensions. So how real is that goal where we can have a generic
> > P4@TC with an option to offload? In my view, the reality (at least
> > currently) is that there are NIC-specific P4 programs which won't have
> > a chance of running generically at TC (unless we implement those vendor
> > extensions).
> >
> > And regarding custom parser, someone has to ask that 'what about bpf
> > question': let's say we have a P4 frontend at TC, can we use bpfilter-like
> > usermode helper to transparently compile it to bpf (for SW path) instead
> > inventing yet another packet parser? Wrestling with the verifier won't be
> > easy here, but I trust it more than this new kParser.
>
> Yes, wrestling with the verifier is tricky, however we do have a
> solution to compile arbitrarily complex parsers into eBFP. We
> presented this work at Netdev 0x15
> https://netdevconf.info/0x15/session.html?Replacing-Flow-Dissector-with-PANDA-Parser.

Thanks Tom, I'll check it out. I've yet to go through the netdev recordings :-(

> Of course this has the obvious advantage that we don't have to change
> the kernel (however, as we talk about in the presentation, this method
> actually produces a faster more extensible parser than flow dissector,
> so it's still on my radar to replace flow dissector itself with an
> eBPF parser :-) )

Since there is already a bpf flow dissector, I'm assuming you're
talking about replacing the existing C flow dissector with a
PANDA-based one?
I was hoping that at some point, we can have a BPF flow dissector
program that supports everything the existing C-one does, and maybe we
can ship this program with the kernel and load it by default. We can
keep the C-based one for some minimal non-bpf configurations. But idk,
the benefit is not 100% clear to me; except maybe bpf-based flow
dissector can be treated as more "secure" due to all verifier
constraints...

> The value of kParser is that it is not compiled code, but dynamically
> scriptable. It's much easier to change on the fly and depends on a CLI
> interface which works well with P4TC. The front end is the same as
> what we are using for PANDA parser, that is the same parser frontend
> (in C code or other) can be compiled into XDP/eBPF, kParser CLI, or
> other targets (this is based on establishing a IR which we talked
> about in https://myfoobar2022.sched.com/event/1BhCX/high-performance-programmable-parsers

That seems like a technicality? A BPF-based parser can also be driven
by maps/tables; or, worst case, can be recompiled and replaced on the
fly without any downtime.


> Tom
>
> >
> >
> > > To answer your question in regards to what the interfaces "P4
> > > speaking" hardware or drivers
> > > are going to be programmed, there are discussions going on right now:
> > > There is a strong
> > > leaning towards devlink for the hardware side loading.... The idea
> > > from the driver side is to
> > > reuse the tc ndos.
> > > We have biweekly meetings which are open. We do have Nvidia folks, but
> > > would be great if
> > > we can have you there. Let me find the link and send it to you.
> > > Do note however, our goal is to get s/w first as per tradition of
> > > other offloads with TC .
> >
> > > cheers,
> > > jamal
Stanislav Fomichev Jan. 28, 2023, 12:47 a.m. UTC | #13
On Fri, Jan 27, 2023 at 3:27 PM Jamal Hadi Salim <jhs@mojatatu.com> wrote:
>
> On Fri, Jan 27, 2023 at 5:26 PM <sdf@google.com> wrote:
> >
> > On 01/27, Jamal Hadi Salim wrote:
> > > On Fri, Jan 27, 2023 at 1:26 PM Jiri Pirko <jiri@resnulli.us> wrote:
> > > >
> > > > Fri, Jan 27, 2023 at 12:30:22AM CET, kuba@kernel.org wrote:
> > > > >On Tue, 24 Jan 2023 12:03:46 -0500 Jamal Hadi Salim wrote:
> > > > >> There have been many discussions and meetings since about 2015 in
> > > regards to
> > > > >> P4 over TC and now that the market has chosen P4 as the datapath
> > > specification
> > > > >> lingua franca
> > > > >
> > > > >Which market?
> > > > >
> > > > >Barely anyone understands the existing TC offloads. We'd need strong,
> > > > >and practical reasons to merge this. Speaking with my "have suffered
> > > > >thru the TC offloads working for a vendor" hat on, not the "junior
> > > > >maintainer" hat.
> > > >
> > > > You talk about offload, yet I don't see any offload code in this RFC.
> > > > It's pure sw implementation.
> > > >
> > > > But speaking about offload, how exactly do you plan to offload this
> > > > Jamal? AFAIK there is some HW-specific compiler magic needed to generate
> > > > HW acceptable blob. How exactly do you plan to deliver it to the driver?
> > > > If HW offload offload is the motivation for this RFC work and we cannot
> > > > pass the TC in kernel objects to drivers, I fail to see why exactly do
> > > > you need the SW implementation...
> >
> > > Our rule in TC is: _if you want to offload using TC you must have a
> > > s/w equivalent_.
> > > We enforced this rule multiple times (as you know).
> > > P4TC has a sw equivalent to whatever the hardware would do. We are
> > > pushing that
> > > first. Regardless, it has value on its own merit:
> > > I can run P4 equivalent in s/w in a scriptable (as in no compilation
> > > in the same spirit as u32 and pedit),
> > > by programming the kernel datapath without changing any kernel code.
> >
> > Not to derail too much, but maybe you can clarify the following for me:
> > In my (in)experience, P4 is usually constrained by the vendor
> > specific extensions. So how real is that goal where we can have a generic
> > P4@TC with an option to offload? In my view, the reality (at least
> > currently) is that there are NIC-specific P4 programs which won't have
> > a chance of running generically at TC (unless we implement those vendor
> > extensions).
>
> We are going to implement all the PSA/PNA externs. Most of these
> programs tend to
> be set or ALU operations on headers or metadata which we can handle.
> Do you have
> any examples of NIC-vendor-specific features that cant be generalized?

I don't think I can share more without giving away something that I
shouldn't give away :-)
But IIUC, and I might be missing something, it's totally within the
standard for vendors to differentiate and provide non-standard
'extern' extensions.
I'm mostly wondering what are your thoughts on this. If I have a p4
program depending on one of these externs, we can't sw-emulate it
unless we also implement the extension. Are we gonna ask NICs that
have those custom extensions to provide a SW implementation as well?
Or are we going to prohibit vendors to differentiate that way?

> > And regarding custom parser, someone has to ask that 'what about bpf
> > question': let's say we have a P4 frontend at TC, can we use bpfilter-like
> > usermode helper to transparently compile it to bpf (for SW path) instead
> > inventing yet another packet parser? Wrestling with the verifier won't be
> > easy here, but I trust it more than this new kParser.
> >
>
> We dont compile anything, the parser (and rest of infra) is scriptable.

As I've replied to Tom, that seems like a technicality. BPF programs
can also be scriptable with some maps/tables. Or it can be made to
look like "scriptable" by recompiling it on every configuration change
and updating it on the fly. Or am I missing something?

Can we have a P4TC frontend and whenever configuration is updated, we
upcall into userspace to compile this whatever p4 representation into
whatever bpf bytecode that we then run. No new/custom/scriptable
parsers needed.

> cheers,
> jamal
Tom Herbert Jan. 28, 2023, 1:32 a.m. UTC | #14
On Fri, Jan 27, 2023 at 4:47 PM Stanislav Fomichev <sdf@google.com> wrote:
>
> On Fri, Jan 27, 2023 at 3:06 PM Tom Herbert <tom@sipanda.io> wrote:
> >
> > On Fri, Jan 27, 2023 at 2:26 PM <sdf@google.com> wrote:
> > >
> > > On 01/27, Jamal Hadi Salim wrote:
> > > > On Fri, Jan 27, 2023 at 1:26 PM Jiri Pirko <jiri@resnulli.us> wrote:
> > > > >
> > > > > Fri, Jan 27, 2023 at 12:30:22AM CET, kuba@kernel.org wrote:
> > > > > >On Tue, 24 Jan 2023 12:03:46 -0500 Jamal Hadi Salim wrote:
> > > > > >> There have been many discussions and meetings since about 2015 in
> > > > regards to
> > > > > >> P4 over TC and now that the market has chosen P4 as the datapath
> > > > specification
> > > > > >> lingua franca
> > > > > >
> > > > > >Which market?
> > > > > >
> > > > > >Barely anyone understands the existing TC offloads. We'd need strong,
> > > > > >and practical reasons to merge this. Speaking with my "have suffered
> > > > > >thru the TC offloads working for a vendor" hat on, not the "junior
> > > > > >maintainer" hat.
> > > > >
> > > > > You talk about offload, yet I don't see any offload code in this RFC.
> > > > > It's pure sw implementation.
> > > > >
> > > > > But speaking about offload, how exactly do you plan to offload this
> > > > > Jamal? AFAIK there is some HW-specific compiler magic needed to generate
> > > > > HW acceptable blob. How exactly do you plan to deliver it to the driver?
> > > > > If HW offload offload is the motivation for this RFC work and we cannot
> > > > > pass the TC in kernel objects to drivers, I fail to see why exactly do
> > > > > you need the SW implementation...
> > >
> > > > Our rule in TC is: _if you want to offload using TC you must have a
> > > > s/w equivalent_.
> > > > We enforced this rule multiple times (as you know).
> > > > P4TC has a sw equivalent to whatever the hardware would do. We are
> > > > pushing that
> > > > first. Regardless, it has value on its own merit:
> > > > I can run P4 equivalent in s/w in a scriptable (as in no compilation
> > > > in the same spirit as u32 and pedit),
> > > > by programming the kernel datapath without changing any kernel code.
> > >
> > > Not to derail too much, but maybe you can clarify the following for me:
> > > In my (in)experience, P4 is usually constrained by the vendor
> > > specific extensions. So how real is that goal where we can have a generic
> > > P4@TC with an option to offload? In my view, the reality (at least
> > > currently) is that there are NIC-specific P4 programs which won't have
> > > a chance of running generically at TC (unless we implement those vendor
> > > extensions).
> > >
> > > And regarding custom parser, someone has to ask that 'what about bpf
> > > question': let's say we have a P4 frontend at TC, can we use bpfilter-like
> > > usermode helper to transparently compile it to bpf (for SW path) instead
> > > inventing yet another packet parser? Wrestling with the verifier won't be
> > > easy here, but I trust it more than this new kParser.
> >
> > Yes, wrestling with the verifier is tricky, however we do have a
> > solution to compile arbitrarily complex parsers into eBFP. We
> > presented this work at Netdev 0x15
> > https://netdevconf.info/0x15/session.html?Replacing-Flow-Dissector-with-PANDA-Parser.
>
> Thanks Tom, I'll check it out. I've yet to go through the netdev recordings :-(
>
> > Of course this has the obvious advantage that we don't have to change
> > the kernel (however, as we talk about in the presentation, this method
> > actually produces a faster more extensible parser than flow dissector,
> > so it's still on my radar to replace flow dissector itself with an
> > eBPF parser :-) )
>
> Since there is already a bpf flow dissector, I'm assuming you're
> talking about replacing the existing C flow dissector with a
> PANDA-based one?

Yes

> I was hoping that at some point, we can have a BPF flow dissector
> program that supports everything the existing C-one does, and maybe we
> can ship this program with the kernel and load it by default.

Yes, we have that. Actually, we can provide a superset to include
things like TCP options which flow dissector doesn't support

> We can
> keep the C-based one for some minimal non-bpf configurations. But idk,
> the benefit is not 100% clear to me; except maybe bpf-based flow
> dissector can be treated as more "secure" due to all verifier
> constraints...

Not just more secure, more robust and extensible. I call flow
dissector the "function we love to hate". On one hand it has proven to
be incredibly useful, on the other hand it's been a major pain to
maintain and isn't remotely extensible. We have seen many problems
over the years, particularly when people have added support for less
common protocols. Collapsing all the protocol layers, ensuring that
the bookkeeping is correct, and trying to maintain some reasonable
level of performance has led to it being spaghetti code (I wrote the
first instantiation of flow dissector for RPS, so I accept my fair
share of blame for the predicament of flow dissector :-) ). The
optimized eBPF code we're generating also qualifies as spaghetti code
(i.e. a whole bunch of loop unrolling, inlining tables, and so on).
The difference is that the front end code in PANDA-C, is well
organized and abstracts out all the bookkeeping so that the programmer
doesn't have to worry about it.

>
> > The value of kParser is that it is not compiled code, but dynamically
> > scriptable. It's much easier to change on the fly and depends on a CLI
> > interface which works well with P4TC. The front end is the same as
> > what we are using for PANDA parser, that is the same parser frontend
> > (in C code or other) can be compiled into XDP/eBPF, kParser CLI, or
> > other targets (this is based on establishing a IR which we talked
> > about in https://myfoobar2022.sched.com/event/1BhCX/high-performance-programmable-parsers
>
> That seems like a technicality? A BPF-based parser can also be driven
> by maps/tables; or, worst case, can be recompiled and replaced on the
> fly without any downtime.

Perhaps. Also, in the spirit of full transparency, kParser is in its
nature interpreted, so we have to expect that it will have lower
performance than an optimized compiled parser.

Tom

>
>
> > Tom
> >
> > >
> > >
> > > > To answer your question in regards to what the interfaces "P4
> > > > speaking" hardware or drivers
> > > > are going to be programmed, there are discussions going on right now:
> > > > There is a strong
> > > > leaning towards devlink for the hardware side loading.... The idea
> > > > from the driver side is to
> > > > reuse the tc ndos.
> > > > We have biweekly meetings which are open. We do have Nvidia folks, but
> > > > would be great if
> > > > we can have you there. Let me find the link and send it to you.
> > > > Do note however, our goal is to get s/w first as per tradition of
> > > > other offloads with TC .
> > >
> > > > cheers,
> > > > jamal
Singhai, Anjali Jan. 28, 2023, 1:34 a.m. UTC | #15
P4 is definitely the language of choice for defining a Dataplane in HW for IPUs/DPUs/FNICs and Switches. As a vendor I can definitely say that the smart devices implement a very programmable ASIC as each customer Dataplane defers quite a bit and P4 is the language of choice for specifying the Dataplane definitions. A lot of customer deploy proprietary protocols that run in HW and there is no good way right now in kernel to support these proprietary protcols. If we enable these protocol in the kernel it takes a huge effort and they don’t evolve well.
Being able to define in P4 and offload into HW using tc mechanism really helps in supporting the customer's Dataplane and protcols without having to wait months and years to get the kernel updated. Here is a link to our IPU offering that is P4 programmable
 https://www.intel.com/content/www/us/en/products/details/network-io/ipu/e2000-asic.html
Here are some other useful links
https://ipdk.io/

Anjali

-----Original Message-----
From: Jamal Hadi Salim <hadi@mojatatu.com> 
Sent: Friday, January 27, 2023 11:43 AM
To: Jakub Kicinski <kuba@kernel.org>
Cc: Jamal Hadi Salim <jhs@mojatatu.com>; netdev@vger.kernel.org; kernel@mojatatu.com; Chatterjee, Deb <deb.chatterjee@intel.com>; Singhai, Anjali <anjali.singhai@intel.com>; Limaye, Namrata <namrata.limaye@intel.com>; khalidm@nvidia.com; tom@sipanda.io; pratyush@sipanda.io; jiri@resnulli.us; xiyou.wangcong@gmail.com; davem@davemloft.net; edumazet@google.com; pabeni@redhat.com; vladbu@nvidia.com; simon.horman@corigine.com; stefanc@marvell.com; seong.kim@amd.com; mattyk@nvidia.com; Daly, Dan <dan.daly@intel.com>; Fingerhut, John Andy <john.andy.fingerhut@intel.com>
Subject: Re: [PATCH net-next RFC 00/20] Introducing P4TC

On Fri, Jan 27, 2023 at 12:18 PM Jakub Kicinski <kuba@kernel.org> wrote:
>
> On Fri, 27 Jan 2023 08:33:39 -0500 Jamal Hadi Salim wrote:
> > On Thu, Jan 26, 2023 at 6:30 PM Jakub Kicinski <kuba@kernel.org> wrote:
> > > On Tue, 24 Jan 2023 12:03:46 -0500 Jamal Hadi Salim wrote:

[..]
> > Network programmability involving hardware  - where at minimal the 
> > specification of the datapath is in P4 and often the implementation 
> > is. For samples of specification using P4 (that are public) see for 
> > example MS Azure:
> > https://github.com/sonic-net/DASH/tree/main/dash-pipeline
>
> That's an IPU thing?
>

Yes, DASH is xPU. But the whole Sonic/SAI thing includes switches and P4 plays a role there.

> > If you are a vendor and want to sell a NIC in that space, the spec 
> > you get is in P4.
>
> s/NIC/IPU/ ?

I do believe that one can write a P4 program to express things a regular NIC could express that may be harder to expose with current interfaces.

> > Your underlying hardware
> > doesnt have to be P4 native, but at minimal the abstraction (as we 
> > are trying to provide with P4TC) has to be able to consume the P4 
> > specification.
>
> P4 is certainly an option, especially for specs, but I haven't seen 
> much adoption myself.

The xPU market outside of hyper-scalers is emerging now. Hyperscalers looking at xPUs are looking at P4 as the datapath language - that sets the trend forward to large enterprises.
That's my experience.
Some of the vendors on the Cc should be able to point to adoption.
Anjali? Matty?

> What's the benefit / use case?

Of P4 or xPUs?
Unified approach to standardize how a datapath is defined is a value for P4.
Providing a singular abstraction via the kernel (as opposed to every vendor pitching their API) is what the kernel brings.

> > For implementations where P4 is in use, there are many - some public 
> > others not, sample space:
> > https://cloud.google.com/blog/products/gcp/google-cloud-using-p4runt
> > ime-to-build-smart-networks
>
> Hyper-scaler proprietary.

The control abstraction (P4 runtime) is certainly not proprietary.
The datapath that is targetted by the runtime is.
Hopefully we can fix that with P4TC.
The majority of the discussions i have with some of the folks who do kernel bypass have one theme in common:
The kernel process is just too long. Trying to add one feature to flower could take anywhere from 6 months to 3 years to finally show up in some supported distro. With P4TC we are taking the approach of scriptability to allow for speacilized datapaths (which P4 excels in). The google datapath maybe proprietary while their hardware may even(or not) be using native P4 - but the important detail is we have _a way_ to abstract those datapaths.

> > There are NICs and switches which are P4 native in the market.
>
> Link to docs?
>

Off top of my head Intel Mount Evans, Pensando, Xilinx FPGAs, etc. The point is to bring them together under the linux umbrella.

> > IOW, there is beacoup $ investment in this space that makes it worth pursuing.
>
> Pursuing $ is good! But the community IMO should maximize a different 
> function.

While I agree $ is not the primary motivator it is a factor, it is a good indicator. No different than the network stack being tweaked to do certain things that certain hyperscalers need because they invest $.
I have no problems with a large harmonious tent.

cheers,
jamal

> > TC is the kernel offload mechanism that has gathered deployment 
> > experience over many years - hence P4TC.
>
> I don't wanna argue. I thought it'd be more fair towards you if I made 
> my lack of conviction known, rather than sit quiet and ignore it since 
> it's just an RFC.
Willem de Bruijn Jan. 28, 2023, 1:37 p.m. UTC | #16
On Fri, Jan 27, 2023 at 7:48 PM Stanislav Fomichev <sdf@google.com> wrote:
>
> On Fri, Jan 27, 2023 at 3:27 PM Jamal Hadi Salim <jhs@mojatatu.com> wrote:
> >
> > On Fri, Jan 27, 2023 at 5:26 PM <sdf@google.com> wrote:
> > >
> > > On 01/27, Jamal Hadi Salim wrote:
> > > > On Fri, Jan 27, 2023 at 1:26 PM Jiri Pirko <jiri@resnulli.us> wrote:
> > > > >
> > > > > Fri, Jan 27, 2023 at 12:30:22AM CET, kuba@kernel.org wrote:
> > > > > >On Tue, 24 Jan 2023 12:03:46 -0500 Jamal Hadi Salim wrote:
> > > > > >> There have been many discussions and meetings since about 2015 in
> > > > regards to
> > > > > >> P4 over TC and now that the market has chosen P4 as the datapath
> > > > specification
> > > > > >> lingua franca
> > > > > >
> > > > > >Which market?
> > > > > >
> > > > > >Barely anyone understands the existing TC offloads. We'd need strong,
> > > > > >and practical reasons to merge this. Speaking with my "have suffered
> > > > > >thru the TC offloads working for a vendor" hat on, not the "junior
> > > > > >maintainer" hat.
> > > > >
> > > > > You talk about offload, yet I don't see any offload code in this RFC.
> > > > > It's pure sw implementation.
> > > > >
> > > > > But speaking about offload, how exactly do you plan to offload this
> > > > > Jamal? AFAIK there is some HW-specific compiler magic needed to generate
> > > > > HW acceptable blob. How exactly do you plan to deliver it to the driver?
> > > > > If HW offload offload is the motivation for this RFC work and we cannot
> > > > > pass the TC in kernel objects to drivers, I fail to see why exactly do
> > > > > you need the SW implementation...
> > >
> > > > Our rule in TC is: _if you want to offload using TC you must have a
> > > > s/w equivalent_.
> > > > We enforced this rule multiple times (as you know).
> > > > P4TC has a sw equivalent to whatever the hardware would do. We are
> > > > pushing that
> > > > first. Regardless, it has value on its own merit:
> > > > I can run P4 equivalent in s/w in a scriptable (as in no compilation
> > > > in the same spirit as u32 and pedit),
> > > > by programming the kernel datapath without changing any kernel code.
> > >
> > > Not to derail too much, but maybe you can clarify the following for me:
> > > In my (in)experience, P4 is usually constrained by the vendor
> > > specific extensions. So how real is that goal where we can have a generic
> > > P4@TC with an option to offload? In my view, the reality (at least
> > > currently) is that there are NIC-specific P4 programs which won't have
> > > a chance of running generically at TC (unless we implement those vendor
> > > extensions).
> >
> > We are going to implement all the PSA/PNA externs. Most of these
> > programs tend to
> > be set or ALU operations on headers or metadata which we can handle.
> > Do you have
> > any examples of NIC-vendor-specific features that cant be generalized?
>
> I don't think I can share more without giving away something that I
> shouldn't give away :-)
> But IIUC, and I might be missing something, it's totally within the
> standard for vendors to differentiate and provide non-standard
> 'extern' extensions.
> I'm mostly wondering what are your thoughts on this. If I have a p4
> program depending on one of these externs, we can't sw-emulate it
> unless we also implement the extension. Are we gonna ask NICs that
> have those custom extensions to provide a SW implementation as well?
> Or are we going to prohibit vendors to differentiate that way?
>
> > > And regarding custom parser, someone has to ask that 'what about bpf
> > > question': let's say we have a P4 frontend at TC, can we use bpfilter-like
> > > usermode helper to transparently compile it to bpf (for SW path) instead
> > > inventing yet another packet parser? Wrestling with the verifier won't be
> > > easy here, but I trust it more than this new kParser.
> > >
> >
> > We dont compile anything, the parser (and rest of infra) is scriptable.
>
> As I've replied to Tom, that seems like a technicality. BPF programs
> can also be scriptable with some maps/tables. Or it can be made to
> look like "scriptable" by recompiling it on every configuration change
> and updating it on the fly. Or am I missing something?
>
> Can we have a P4TC frontend and whenever configuration is updated, we
> upcall into userspace to compile this whatever p4 representation into
> whatever bpf bytecode that we then run. No new/custom/scriptable
> parsers needed.

I would also think that if we need another programmable component in
the kernel, that this would be based on BPF, and compiled outside the
kernel.

Is the argument for an explicit TC objects API purely that this API
can be passed through to hardware, as well as implemented in the
kernel directly? Something that would be lost if the datapath is
implement as a single BPF program at the TC hook.

Can you elaborate some more why this needs yet another in-kernel
parser separate from BPF? The flow dissection case is solved fine by
the BPF flow dissector. (I also hope one day the kernel can load a BPF
dissector by default and we avoid the majority of the unsafe C code
entirely.)
Jamal Hadi Salim Jan. 28, 2023, 1:41 p.m. UTC | #17
On Fri, Jan 27, 2023 at 7:48 PM Stanislav Fomichev <sdf@google.com> wrote:
>
> On Fri, Jan 27, 2023 at 3:27 PM Jamal Hadi Salim <jhs@mojatatu.com> wrote:
> >
> > On Fri, Jan 27, 2023 at 5:26 PM <sdf@google.com> wrote:
> > >
> > > On 01/27, Jamal Hadi Salim wrote:
> > > > On Fri, Jan 27, 2023 at 1:26 PM Jiri Pirko <jiri@resnulli.us> wrote:
> > > > >
> > > > > Fri, Jan 27, 2023 at 12:30:22AM CET, kuba@kernel.org wrote:
> > > > > >On Tue, 24 Jan 2023 12:03:46 -0500 Jamal Hadi Salim wrote:

[..]
> > > Not to derail too much, but maybe you can clarify the following for me:
> > > In my (in)experience, P4 is usually constrained by the vendor
> > > specific extensions. So how real is that goal where we can have a generic
> > > P4@TC with an option to offload? In my view, the reality (at least
> > > currently) is that there are NIC-specific P4 programs which won't have
> > > a chance of running generically at TC (unless we implement those vendor
> > > extensions).
> >
> > We are going to implement all the PSA/PNA externs. Most of these
> > programs tend to
> > be set or ALU operations on headers or metadata which we can handle.
> > Do you have
> > any examples of NIC-vendor-specific features that cant be generalized?
>
> I don't think I can share more without giving away something that I
> shouldn't give away :-)

Fair enough.

> But IIUC, and I might be missing something, it's totally within the
> standard for vendors to differentiate and provide non-standard
> 'extern' extensions.
> I'm mostly wondering what are your thoughts on this. If I have a p4
> program depending on one of these externs, we can't sw-emulate it
> unless we also implement the extension. Are we gonna ask NICs that
> have those custom extensions to provide a SW implementation as well?
> Or are we going to prohibit vendors to differentiate that way?
>

It will dilute the value to prohibit any extern.
What you referred to as "differentiation" is most of the time just
implementation
differences i.e someone may use a TCAM vs SRAM or some specific hw
to implement crypto foobar; however, the "signature" of the extern is
no different
in its abstraction than an action. IOW, an Input X would produce an output Y in
an extern regardless of the black box implementation.
I understand the cases where some vendor may have some ASIC features that
noone else cares about and that said functions can be exposed as externs.
We really dont want these to be part of kernel proper.

In our templating above would mean using the command abstraction to
create the extern.

There are three threads:
1) PSA/PNA externs like crc, checksums, hash etc. Those are part of P4TC as
template commands. They are defined in the generic spec, they are not
vendor specific
and for almost all cases there's already kernel code that implements their
features. So we will make them accessible to P4 programs.
Vendor specific - we dont want them to be part of P4TC and we provide two
ways to address them.
2) We can emulate them without offering the equivalent functionality just so
someone can load a P4 program. This will work with P4TC as is today
but it means for that extern you dont have functional equivalence to hardware.
3) Commands, to be specific for externs can be written as kernel modules.
It's not my favorite option since we want everything to be scriptable but it
is an option available.

cheers,
jamal




> > > And regarding custom parser, someone has to ask that 'what about bpf
> > > question': let's say we have a P4 frontend at TC, can we use bpfilter-like
> > > usermode helper to transparently compile it to bpf (for SW path) instead
> > > inventing yet another packet parser? Wrestling with the verifier won't be
> > > easy here, but I trust it more than this new kParser.
> > >
> >
> > We dont compile anything, the parser (and rest of infra) is scriptable.
>
> As I've replied to Tom, that seems like a technicality. BPF programs
> can also be scriptable with some maps/tables. Or it can be made to
> look like "scriptable" by recompiling it on every configuration change
> and updating it on the fly. Or am I missing something?
>
> Can we have a P4TC frontend and whenever configuration is updated, we
> upcall into userspace to compile this whatever p4 representation into
> whatever bpf bytecode that we then run. No new/custom/scriptable
> parsers needed.
>
> > cheers,
> > jamal
Jamal Hadi Salim Jan. 28, 2023, 3:10 p.m. UTC | #18
On Sat, Jan 28, 2023 at 8:37 AM Willem de Bruijn <willemb@google.com> wrote:
>
> On Fri, Jan 27, 2023 at 7:48 PM Stanislav Fomichev <sdf@google.com> wrote:
> >
> > On Fri, Jan 27, 2023 at 3:27 PM Jamal Hadi Salim <jhs@mojatatu.com> wrote:
> > >
> > > On Fri, Jan 27, 2023 at 5:26 PM <sdf@google.com> wrote:
> > > >
> > > > On 01/27, Jamal Hadi Salim wrote:
> > > > > On Fri, Jan 27, 2023 at 1:26 PM Jiri Pirko <jiri@resnulli.us> wrote:
> > > > > >
> > > > > > Fri, Jan 27, 2023 at 12:30:22AM CET, kuba@kernel.org wrote:
> > > > > > >On Tue, 24 Jan 2023 12:03:46 -0500 Jamal Hadi Salim wrote:
> > > > > > >> There have been many discussions and meetings since about 2015 in
> > > > > regards to
> > > > > > >> P4 over TC and now that the market has chosen P4 as the datapath
> > > > > specification
> > > > > > >> lingua franca
> > > > > > >
> > > > > > >Which market?
> > > > > > >
> > > > > > >Barely anyone understands the existing TC offloads. We'd need strong,
> > > > > > >and practical reasons to merge this. Speaking with my "have suffered
> > > > > > >thru the TC offloads working for a vendor" hat on, not the "junior
> > > > > > >maintainer" hat.
> > > > > >
> > > > > > You talk about offload, yet I don't see any offload code in this RFC.
> > > > > > It's pure sw implementation.
> > > > > >
> > > > > > But speaking about offload, how exactly do you plan to offload this
> > > > > > Jamal? AFAIK there is some HW-specific compiler magic needed to generate
> > > > > > HW acceptable blob. How exactly do you plan to deliver it to the driver?
> > > > > > If HW offload offload is the motivation for this RFC work and we cannot
> > > > > > pass the TC in kernel objects to drivers, I fail to see why exactly do
> > > > > > you need the SW implementation...
> > > >
> > > > > Our rule in TC is: _if you want to offload using TC you must have a
> > > > > s/w equivalent_.
> > > > > We enforced this rule multiple times (as you know).
> > > > > P4TC has a sw equivalent to whatever the hardware would do. We are
> > > > > pushing that
> > > > > first. Regardless, it has value on its own merit:
> > > > > I can run P4 equivalent in s/w in a scriptable (as in no compilation
> > > > > in the same spirit as u32 and pedit),
> > > > > by programming the kernel datapath without changing any kernel code.
> > > >
> > > > Not to derail too much, but maybe you can clarify the following for me:
> > > > In my (in)experience, P4 is usually constrained by the vendor
> > > > specific extensions. So how real is that goal where we can have a generic
> > > > P4@TC with an option to offload? In my view, the reality (at least
> > > > currently) is that there are NIC-specific P4 programs which won't have
> > > > a chance of running generically at TC (unless we implement those vendor
> > > > extensions).
> > >
> > > We are going to implement all the PSA/PNA externs. Most of these
> > > programs tend to
> > > be set or ALU operations on headers or metadata which we can handle.
> > > Do you have
> > > any examples of NIC-vendor-specific features that cant be generalized?
> >
> > I don't think I can share more without giving away something that I
> > shouldn't give away :-)
> > But IIUC, and I might be missing something, it's totally within the
> > standard for vendors to differentiate and provide non-standard
> > 'extern' extensions.
> > I'm mostly wondering what are your thoughts on this. If I have a p4
> > program depending on one of these externs, we can't sw-emulate it
> > unless we also implement the extension. Are we gonna ask NICs that
> > have those custom extensions to provide a SW implementation as well?
> > Or are we going to prohibit vendors to differentiate that way?
> >
> > > > And regarding custom parser, someone has to ask that 'what about bpf
> > > > question': let's say we have a P4 frontend at TC, can we use bpfilter-like
> > > > usermode helper to transparently compile it to bpf (for SW path) instead
> > > > inventing yet another packet parser? Wrestling with the verifier won't be
> > > > easy here, but I trust it more than this new kParser.
> > > >
> > >
> > > We dont compile anything, the parser (and rest of infra) is scriptable.
> >
> > As I've replied to Tom, that seems like a technicality. BPF programs
> > can also be scriptable with some maps/tables. Or it can be made to
> > look like "scriptable" by recompiling it on every configuration change
> > and updating it on the fly. Or am I missing something?
> >
> > Can we have a P4TC frontend and whenever configuration is updated, we
> > upcall into userspace to compile this whatever p4 representation into
> > whatever bpf bytecode that we then run. No new/custom/scriptable
> > parsers needed.
>
> I would also think that if we need another programmable component in
> the kernel, that this would be based on BPF, and compiled outside the
> kernel.
>
> Is the argument for an explicit TC objects API purely that this API
> can be passed through to hardware, as well as implemented in the
> kernel directly? Something that would be lost if the datapath is
> implement as a single BPF program at the TC hook.
>

We use the skip_sw and skip_hw knobs in tc to indicate whether a
policy is targeting hw or sw. Not sure if you are familiar with it but its
been around (and deployed) for a few years now. So a P4 program
policy can target either.

In regards to the parser - we need a scriptable parser which is offered
by kparser in kernel. P4 doesnt describe how to offload the parser
just the matches and actions; however, as Tom alluded there's nothing
that obstructs us offer the same tc controls to offload the parser or pieces
of it.

cheers,
jamal

> Can you elaborate some more why this needs yet another in-kernel
> parser separate from BPF? The flow dissection case is solved fine by
> the BPF flow dissector. (I also hope one day the kernel can load a BPF
> dissector by default and we avoid the majority of the unsafe C code
> entirely.)
Willem de Bruijn Jan. 28, 2023, 3:33 p.m. UTC | #19
On Sat, Jan 28, 2023 at 10:10 AM Jamal Hadi Salim <jhs@mojatatu.com> wrote:
>
> On Sat, Jan 28, 2023 at 8:37 AM Willem de Bruijn <willemb@google.com> wrote:
> >
> > On Fri, Jan 27, 2023 at 7:48 PM Stanislav Fomichev <sdf@google.com> wrote:
> > >
> > > On Fri, Jan 27, 2023 at 3:27 PM Jamal Hadi Salim <jhs@mojatatu.com> wrote:
> > > >
> > > > On Fri, Jan 27, 2023 at 5:26 PM <sdf@google.com> wrote:
> > > > >
> > > > > On 01/27, Jamal Hadi Salim wrote:
> > > > > > On Fri, Jan 27, 2023 at 1:26 PM Jiri Pirko <jiri@resnulli.us> wrote:
> > > > > > >
> > > > > > > Fri, Jan 27, 2023 at 12:30:22AM CET, kuba@kernel.org wrote:
> > > > > > > >On Tue, 24 Jan 2023 12:03:46 -0500 Jamal Hadi Salim wrote:
> > > > > > > >> There have been many discussions and meetings since about 2015 in
> > > > > > regards to
> > > > > > > >> P4 over TC and now that the market has chosen P4 as the datapath
> > > > > > specification
> > > > > > > >> lingua franca
> > > > > > > >
> > > > > > > >Which market?
> > > > > > > >
> > > > > > > >Barely anyone understands the existing TC offloads. We'd need strong,
> > > > > > > >and practical reasons to merge this. Speaking with my "have suffered
> > > > > > > >thru the TC offloads working for a vendor" hat on, not the "junior
> > > > > > > >maintainer" hat.
> > > > > > >
> > > > > > > You talk about offload, yet I don't see any offload code in this RFC.
> > > > > > > It's pure sw implementation.
> > > > > > >
> > > > > > > But speaking about offload, how exactly do you plan to offload this
> > > > > > > Jamal? AFAIK there is some HW-specific compiler magic needed to generate
> > > > > > > HW acceptable blob. How exactly do you plan to deliver it to the driver?
> > > > > > > If HW offload offload is the motivation for this RFC work and we cannot
> > > > > > > pass the TC in kernel objects to drivers, I fail to see why exactly do
> > > > > > > you need the SW implementation...
> > > > >
> > > > > > Our rule in TC is: _if you want to offload using TC you must have a
> > > > > > s/w equivalent_.
> > > > > > We enforced this rule multiple times (as you know).
> > > > > > P4TC has a sw equivalent to whatever the hardware would do. We are
> > > > > > pushing that
> > > > > > first. Regardless, it has value on its own merit:
> > > > > > I can run P4 equivalent in s/w in a scriptable (as in no compilation
> > > > > > in the same spirit as u32 and pedit),
> > > > > > by programming the kernel datapath without changing any kernel code.
> > > > >
> > > > > Not to derail too much, but maybe you can clarify the following for me:
> > > > > In my (in)experience, P4 is usually constrained by the vendor
> > > > > specific extensions. So how real is that goal where we can have a generic
> > > > > P4@TC with an option to offload? In my view, the reality (at least
> > > > > currently) is that there are NIC-specific P4 programs which won't have
> > > > > a chance of running generically at TC (unless we implement those vendor
> > > > > extensions).
> > > >
> > > > We are going to implement all the PSA/PNA externs. Most of these
> > > > programs tend to
> > > > be set or ALU operations on headers or metadata which we can handle.
> > > > Do you have
> > > > any examples of NIC-vendor-specific features that cant be generalized?
> > >
> > > I don't think I can share more without giving away something that I
> > > shouldn't give away :-)
> > > But IIUC, and I might be missing something, it's totally within the
> > > standard for vendors to differentiate and provide non-standard
> > > 'extern' extensions.
> > > I'm mostly wondering what are your thoughts on this. If I have a p4
> > > program depending on one of these externs, we can't sw-emulate it
> > > unless we also implement the extension. Are we gonna ask NICs that
> > > have those custom extensions to provide a SW implementation as well?
> > > Or are we going to prohibit vendors to differentiate that way?
> > >
> > > > > And regarding custom parser, someone has to ask that 'what about bpf
> > > > > question': let's say we have a P4 frontend at TC, can we use bpfilter-like
> > > > > usermode helper to transparently compile it to bpf (for SW path) instead
> > > > > inventing yet another packet parser? Wrestling with the verifier won't be
> > > > > easy here, but I trust it more than this new kParser.
> > > > >
> > > >
> > > > We dont compile anything, the parser (and rest of infra) is scriptable.
> > >
> > > As I've replied to Tom, that seems like a technicality. BPF programs
> > > can also be scriptable with some maps/tables. Or it can be made to
> > > look like "scriptable" by recompiling it on every configuration change
> > > and updating it on the fly. Or am I missing something?
> > >
> > > Can we have a P4TC frontend and whenever configuration is updated, we
> > > upcall into userspace to compile this whatever p4 representation into
> > > whatever bpf bytecode that we then run. No new/custom/scriptable
> > > parsers needed.
> >
> > I would also think that if we need another programmable component in
> > the kernel, that this would be based on BPF, and compiled outside the
> > kernel.
> >
> > Is the argument for an explicit TC objects API purely that this API
> > can be passed through to hardware, as well as implemented in the
> > kernel directly? Something that would be lost if the datapath is
> > implement as a single BPF program at the TC hook.
> >
>
> We use the skip_sw and skip_hw knobs in tc to indicate whether a
> policy is targeting hw or sw. Not sure if you are familiar with it but its
> been around (and deployed) for a few years now. So a P4 program
> policy can target either.

I know. So the only reason the kernel ABI needs to be extended with P4
objects is to be able to pass the same commands to hardware. The whole
kernel dataplane could be implemented as a BPF program, correct?

> In regards to the parser - we need a scriptable parser which is offered
> by kparser in kernel. P4 doesnt describe how to offload the parser
> just the matches and actions; however, as Tom alluded there's nothing
> that obstructs us offer the same tc controls to offload the parser or pieces
> of it.

And this is the only reason that the parser needs to be in the kernel.
Because the API is at the kernel ABI level. If the P4 program is compiled
to BPF in userspace, then the parser would be compiled in userspace
too. A preferable option, as it would not require adding yet another
parser in C in the kernel.

I understand the value of PANDA as a high level declarative language
to describe network protocols. I'm just trying to get more explicit
why compilation from PANDA to BPF is not sufficient for your use-case.


> cheers,
> jamal
>
> > Can you elaborate some more why this needs yet another in-kernel
> > parser separate from BPF? The flow dissection case is solved fine by
> > the BPF flow dissector. (I also hope one day the kernel can load a BPF
> > dissector by default and we avoid the majority of the unsafe C code
> > entirely.)
Tom Herbert Jan. 28, 2023, 9:17 p.m. UTC | #20
On Fri, Jan 27, 2023 at 5:34 PM Singhai, Anjali
<anjali.singhai@intel.com> wrote:
>
> P4 is definitely the language of choice for defining a Dataplane in HW for IPUs/DPUs/FNICs and Switches. As a vendor I can definitely say that the smart devices implement a very programmable ASIC as each customer Dataplane defers quite a bit and P4 is the language of choice for specifying the Dataplane definitions. A lot of customer deploy proprietary protocols that run in HW and there is no good way right now in kernel to support these proprietary protcols. If we enable these protocol in the kernel it takes a huge effort and they don’t evolve well.
> Being able to define in P4 and offload into HW using tc mechanism really helps in supporting the customer's Dataplane and protcols without having to wait months and years to get the kernel updated. Here is a link to our IPU offering that is P4 programmable

Anjali,

P4 may be the language of choice for programming HW datapath, however
it's not the language of choice for programming SW datapaths-- that's
C over XDP/eBPF. And while XDP/eBPF also doesn't depend on kernel
updates, it has a major advantage over P4 in that it doesn't require
fancy hardware either.

Even at full data center deployment of P4 devices, there will be at
least an order of magnitude more deployment of SW programmed
datapaths; and unless someone is using P4 hardware, there's zero value
in rewriting programs in P4 instead of C. IMO, we will never see
networking developers moving to P4 en masse-- P4 will always be a
niche market relative to the programmable datapath space and the skill
sets required to support serious scalable deployment. That being said,
there will be a nontrivial contingent of users who need to run the
same programs in both SW and HW environments. Expecting them to
maintain two very different code bases to support two disparate models
is costly and prohibitive to them. So for their benefit, we need a
solution to reconcile these two models. P4TC is one means to
accomplish that.

We want to consider both the permutations: 1) compile C code to run in
P4 hardware 2) compile P4 to run in SW. If we establish a common IR,
then we can generalize the problem: programmer writes their datapath
in the language of their choosing (P4, C, Python, Rust, etc.), they
compile the program to whatever backend they are using (HW, SW,
XDP/eBPF, etc.). The P4TC CLI serves as one such IR as there's nothing
that prevents someone from compiling a program from another language
to the CLI (for instance, we've implemented the compiler to output the
parser CLI from PANDA-C). The CLI natively runs in kernel SW, and with
the right hooks could be offloaded to HW-- not just P4 hardware but
potentially other hardware targets as well.

Tom

>  https://www.intel.com/content/www/us/en/products/details/network-io/ipu/e2000-asic.html
> Here are some other useful links
> https://ipdk.io/
>
> Anjali
>
> -----Original Message-----
> From: Jamal Hadi Salim <hadi@mojatatu.com>
> Sent: Friday, January 27, 2023 11:43 AM
> To: Jakub Kicinski <kuba@kernel.org>
> Cc: Jamal Hadi Salim <jhs@mojatatu.com>; netdev@vger.kernel.org; kernel@mojatatu.com; Chatterjee, Deb <deb.chatterjee@intel.com>; Singhai, Anjali <anjali.singhai@intel.com>; Limaye, Namrata <namrata.limaye@intel.com>; khalidm@nvidia.com; tom@sipanda.io; pratyush@sipanda.io; jiri@resnulli.us; xiyou.wangcong@gmail.com; davem@davemloft.net; edumazet@google.com; pabeni@redhat.com; vladbu@nvidia.com; simon.horman@corigine.com; stefanc@marvell.com; seong.kim@amd.com; mattyk@nvidia.com; Daly, Dan <dan.daly@intel.com>; Fingerhut, John Andy <john.andy.fingerhut@intel.com>
> Subject: Re: [PATCH net-next RFC 00/20] Introducing P4TC
>
> On Fri, Jan 27, 2023 at 12:18 PM Jakub Kicinski <kuba@kernel.org> wrote:
> >
> > On Fri, 27 Jan 2023 08:33:39 -0500 Jamal Hadi Salim wrote:
> > > On Thu, Jan 26, 2023 at 6:30 PM Jakub Kicinski <kuba@kernel.org> wrote:
> > > > On Tue, 24 Jan 2023 12:03:46 -0500 Jamal Hadi Salim wrote:
>
> [..]
> > > Network programmability involving hardware  - where at minimal the
> > > specification of the datapath is in P4 and often the implementation
> > > is. For samples of specification using P4 (that are public) see for
> > > example MS Azure:
> > > https://github.com/sonic-net/DASH/tree/main/dash-pipeline
> >
> > That's an IPU thing?
> >
>
> Yes, DASH is xPU. But the whole Sonic/SAI thing includes switches and P4 plays a role there.
>
> > > If you are a vendor and want to sell a NIC in that space, the spec
> > > you get is in P4.
> >
> > s/NIC/IPU/ ?
>
> I do believe that one can write a P4 program to express things a regular NIC could express that may be harder to expose with current interfaces.
>
> > > Your underlying hardware
> > > doesnt have to be P4 native, but at minimal the abstraction (as we
> > > are trying to provide with P4TC) has to be able to consume the P4
> > > specification.
> >
> > P4 is certainly an option, especially for specs, but I haven't seen
> > much adoption myself.
>
> The xPU market outside of hyper-scalers is emerging now. Hyperscalers looking at xPUs are looking at P4 as the datapath language - that sets the trend forward to large enterprises.
> That's my experience.
> Some of the vendors on the Cc should be able to point to adoption.
> Anjali? Matty?
>
> > What's the benefit / use case?
>
> Of P4 or xPUs?
> Unified approach to standardize how a datapath is defined is a value for P4.
> Providing a singular abstraction via the kernel (as opposed to every vendor pitching their API) is what the kernel brings.
>
> > > For implementations where P4 is in use, there are many - some public
> > > others not, sample space:
> > > https://cloud.google.com/blog/products/gcp/google-cloud-using-p4runt
> > > ime-to-build-smart-networks
> >
> > Hyper-scaler proprietary.
>
> The control abstraction (P4 runtime) is certainly not proprietary.
> The datapath that is targetted by the runtime is.
> Hopefully we can fix that with P4TC.
> The majority of the discussions i have with some of the folks who do kernel bypass have one theme in common:
> The kernel process is just too long. Trying to add one feature to flower could take anywhere from 6 months to 3 years to finally show up in some supported distro. With P4TC we are taking the approach of scriptability to allow for speacilized datapaths (which P4 excels in). The google datapath maybe proprietary while their hardware may even(or not) be using native P4 - but the important detail is we have _a way_ to abstract those datapaths.
>
> > > There are NICs and switches which are P4 native in the market.
> >
> > Link to docs?
> >
>
> Off top of my head Intel Mount Evans, Pensando, Xilinx FPGAs, etc. The point is to bring them together under the linux umbrella.
>
> > > IOW, there is beacoup $ investment in this space that makes it worth pursuing.
> >
> > Pursuing $ is good! But the community IMO should maximize a different
> > function.
>
> While I agree $ is not the primary motivator it is a factor, it is a good indicator. No different than the network stack being tweaked to do certain things that certain hyperscalers need because they invest $.
> I have no problems with a large harmonious tent.
>
> cheers,
> jamal
>
> > > TC is the kernel offload mechanism that has gathered deployment
> > > experience over many years - hence P4TC.
> >
> > I don't wanna argue. I thought it'd be more fair towards you if I made
> > my lack of conviction known, rather than sit quiet and ignore it since
> > it's just an RFC.
Stephen Hemminger Jan. 29, 2023, 2:09 a.m. UTC | #21
On Sat, 28 Jan 2023 13:17:35 -0800
Tom Herbert <tom@herbertland.com> wrote:

> On Fri, Jan 27, 2023 at 5:34 PM Singhai, Anjali
> <anjali.singhai@intel.com> wrote:
> >
> > P4 is definitely the language of choice for defining a Dataplane in HW for IPUs/DPUs/FNICs and Switches. As a vendor I can definitely say that the smart devices implement a very programmable ASIC as each customer Dataplane defers quite a bit and P4 is the language of choice for specifying the Dataplane definitions. A lot of customer deploy proprietary protocols that run in HW and there is no good way right now in kernel to support these proprietary protcols. If we enable these protocol in the kernel it takes a huge effort and they don’t evolve well.
> > Being able to define in P4 and offload into HW using tc mechanism really helps in supporting the customer's Dataplane and protcols without having to wait months and years to get the kernel updated. Here is a link to our IPU offering that is P4 programmable  
> 
> Anjali,
> 
> P4 may be the language of choice for programming HW datapath, however
> it's not the language of choice for programming SW datapaths-- that's
> C over XDP/eBPF. And while XDP/eBPF also doesn't depend on kernel
> updates, it has a major advantage over P4 in that it doesn't require
> fancy hardware either.
> 
> Even at full data center deployment of P4 devices, there will be at
> least an order of magnitude more deployment of SW programmed
> datapaths; and unless someone is using P4 hardware, there's zero value
> in rewriting programs in P4 instead of C. IMO, we will never see
> networking developers moving to P4 en masse-- P4 will always be a
> niche market relative to the programmable datapath space and the skill
> sets required to support serious scalable deployment. That being said,
> there will be a nontrivial contingent of users who need to run the
> same programs in both SW and HW environments. Expecting them to
> maintain two very different code bases to support two disparate models
> is costly and prohibitive to them. So for their benefit, we need a
> solution to reconcile these two models. P4TC is one means to
> accomplish that.
> 
> We want to consider both the permutations: 1) compile C code to run in
> P4 hardware 2) compile P4 to run in SW. If we establish a common IR,
> then we can generalize the problem: programmer writes their datapath
> in the language of their choosing (P4, C, Python, Rust, etc.), they
> compile the program to whatever backend they are using (HW, SW,
> XDP/eBPF, etc.). The P4TC CLI serves as one such IR as there's nothing
> that prevents someone from compiling a program from another language
> to the CLI (for instance, we've implemented the compiler to output the
> parser CLI from PANDA-C). The CLI natively runs in kernel SW, and with
> the right hooks could be offloaded to HW-- not just P4 hardware but
> potentially other hardware targets as well.

Rather than more kernel network software, if instead this was
targeting userspace or eBPF for the SW version; then there would
be less exposed security risk and also less long term technical debt
here.
John Fastabend Jan. 29, 2023, 5:39 a.m. UTC | #22
Willem de Bruijn wrote:
> On Sat, Jan 28, 2023 at 10:10 AM Jamal Hadi Salim <jhs@mojatatu.com> wrote:
> >

[...]

> > >
> > > I would also think that if we need another programmable component in
> > > the kernel, that this would be based on BPF, and compiled outside the
> > > kernel.
> > >
> > > Is the argument for an explicit TC objects API purely that this API
> > > can be passed through to hardware, as well as implemented in the
> > > kernel directly? Something that would be lost if the datapath is
> > > implement as a single BPF program at the TC hook.
> > >
> >
> > We use the skip_sw and skip_hw knobs in tc to indicate whether a
> > policy is targeting hw or sw. Not sure if you are familiar with it but its
> > been around (and deployed) for a few years now. So a P4 program
> > policy can target either.
> 
> I know. So the only reason the kernel ABI needs to be extended with P4
> objects is to be able to pass the same commands to hardware. The whole
> kernel dataplane could be implemented as a BPF program, correct?
> 
> > In regards to the parser - we need a scriptable parser which is offered
> > by kparser in kernel. P4 doesnt describe how to offload the parser
> > just the matches and actions; however, as Tom alluded there's nothing
> > that obstructs us offer the same tc controls to offload the parser or pieces
> > of it.
> 
> And this is the only reason that the parser needs to be in the kernel.
> Because the API is at the kernel ABI level. If the P4 program is compiled
> to BPF in userspace, then the parser would be compiled in userspace
> too. A preferable option, as it would not require adding yet another
> parser in C in the kernel.

Also there already exists a P4 backend that targets BPF.

 https://github.com/p4lang/p4c

So as a SW object we can just do the P4 compilation step in user
space and run it in BPF as suggested. Then for hw offload we really
would need to see some hardware to have any concrete ideas on how
to make it work.

Also P4 defines a runtime API so would be good to see how all that
works with any proposed offload.

> 
> I understand the value of PANDA as a high level declarative language
> to describe network protocols. I'm just trying to get more explicit
> why compilation from PANDA to BPF is not sufficient for your use-case.
> 
> 
> > cheers,
> > jamal
> >
> > > Can you elaborate some more why this needs yet another in-kernel
> > > parser separate from BPF? The flow dissection case is solved fine by
> > > the BPF flow dissector. (I also hope one day the kernel can load a BPF
> > > dissector by default and we avoid the majority of the unsafe C code
> > > entirely.)
>
Jamal Hadi Salim Jan. 29, 2023, 11:02 a.m. UTC | #23
On Sat, Jan 28, 2023 at 10:33 AM Willem de Bruijn <willemb@google.com> wrote:
>
> On Sat, Jan 28, 2023 at 10:10 AM Jamal Hadi Salim <jhs@mojatatu.com> wrote:
> >
> > On Sat, Jan 28, 2023 at 8:37 AM Willem de Bruijn <willemb@google.com> wrote:
> > >
>

[..]

> > We use the skip_sw and skip_hw knobs in tc to indicate whether a
> > policy is targeting hw or sw. Not sure if you are familiar with it but its
> > been around (and deployed) for a few years now. So a P4 program
> > policy can target either.
>
> I know. So the only reason the kernel ABI needs to be extended with P4
> objects is to be able to pass the same commands to hardware. The whole
> kernel dataplane could be implemented as a BPF program, correct?
>

It's more than an ABI (although that is important as well).
It is about reuse of the infra which provides a transparent symbiosis
between hardware offload and software that has matured over time: For
example, you can take a pipeline or a table or actions (lately) and
split them between hardware and software transparently, etc. To
re-iterate, we are reusing and plugging into a proven and deployed
mechanism which enables our goal (of HW + SW scripting of arbitrary
P4-enabled datapaths which are functionally equivalent).

> > In regards to the parser - we need a scriptable parser which is offered
> > by kparser in kernel. P4 doesnt describe how to offload the parser
> > just the matches and actions; however, as Tom alluded there's nothing
> > that obstructs us offer the same tc controls to offload the parser or pieces
> > of it.
>
> And this is the only reason that the parser needs to be in the kernel.
> Because the API is at the kernel ABI level. If the P4 program is compiled
> to BPF in userspace, then the parser would be compiled in userspace
> too. A preferable option, as it would not require adding yet another
> parser in C in the kernel.
>

Kparser while based on PANDA has the important detail to note is that it is an
infra for creating arbitrary parsers. The infra sits in the kernel and
i can create
arbitrary parsers with policy scripts. The emphasis is on scriptability.

cheers,
jamal

> I understand the value of PANDA as a high level declarative language
> to describe network protocols. I'm just trying to get more explicit
> why compilation from PANDA to BPF is not sufficient for your use-case.
>
Jamal Hadi Salim Jan. 29, 2023, 11:11 a.m. UTC | #24
On Sun, Jan 29, 2023 at 12:39 AM John Fastabend
<john.fastabend@gmail.com> wrote:
>
> Willem de Bruijn wrote:
> > On Sat, Jan 28, 2023 at 10:10 AM Jamal Hadi Salim <jhs@mojatatu.com> wrote:
> > >
>
> [...]
>
>
> Also there already exists a P4 backend that targets BPF.
>
>  https://github.com/p4lang/p4c

There's also one based on rust - does that mean we should rewrite our
code in rust?
Joking aside - rust was a suggestion made at a talk i did. I ended up
adding a slide for the next talk which read:

Title: So... how is this better than KDE?
  Attributed to Rusty Russell
     Who attributes it to Cort Dougan
      s/KDE/[rust/ebpf/dpdk/vpp/ovs]/g

We have very specific goals - of which the most important is met by
what works today and we are reusing that.

cheers,
jamal

> So as a SW object we can just do the P4 compilation step in user
> space and run it in BPF as suggested. Then for hw offload we really
> would need to see some hardware to have any concrete ideas on how
> to make it work.
>


> Also P4 defines a runtime API so would be good to see how all that
> works with any proposed offload.
Jamal Hadi Salim Jan. 29, 2023, 11:19 a.m. UTC | #25
Sorry, John - to answer your question on P4runtime; that runs on top
of netlink. Netlink can express
a lot more than P4runtime so we are letting it sit in userspace. I
could describe the netlink interfaces but easier if you look at the
code and ping me privately unless there are more folks interested in
that to which i can respond on the list.

cheers,
jamal

On Sun, Jan 29, 2023 at 6:11 AM Jamal Hadi Salim <jhs@mojatatu.com> wrote:
>
> On Sun, Jan 29, 2023 at 12:39 AM John Fastabend
> <john.fastabend@gmail.com> wrote:
> >
> > Willem de Bruijn wrote:
> > > On Sat, Jan 28, 2023 at 10:10 AM Jamal Hadi Salim <jhs@mojatatu.com> wrote:
> > > >
> >
> > [...]
> >
> >
> > Also there already exists a P4 backend that targets BPF.
> >
> >  https://github.com/p4lang/p4c
>
> There's also one based on rust - does that mean we should rewrite our
> code in rust?
> Joking aside - rust was a suggestion made at a talk i did. I ended up
> adding a slide for the next talk which read:
>
> Title: So... how is this better than KDE?
>   Attributed to Rusty Russell
>      Who attributes it to Cort Dougan
>       s/KDE/[rust/ebpf/dpdk/vpp/ovs]/g
>
> We have very specific goals - of which the most important is met by
> what works today and we are reusing that.
>
> cheers,
> jamal
>
> > So as a SW object we can just do the P4 compilation step in user
> > space and run it in BPF as suggested. Then for hw offload we really
> > would need to see some hardware to have any concrete ideas on how
> > to make it work.
> >
>
>
> > Also P4 defines a runtime API so would be good to see how all that
> > works with any proposed offload.
Toke Høiland-Jørgensen Jan. 29, 2023, 10:14 p.m. UTC | #26
Jamal Hadi Salim <jhs@mojatatu.com> writes:

>> > We use the skip_sw and skip_hw knobs in tc to indicate whether a
>> > policy is targeting hw or sw. Not sure if you are familiar with it but its
>> > been around (and deployed) for a few years now. So a P4 program
>> > policy can target either.
>>
>> I know. So the only reason the kernel ABI needs to be extended with P4
>> objects is to be able to pass the same commands to hardware. The whole
>> kernel dataplane could be implemented as a BPF program, correct?
>>
>
> It's more than an ABI (although that is important as well).
> It is about reuse of the infra which provides a transparent symbiosis
> between hardware offload and software that has matured over time: For
> example, you can take a pipeline or a table or actions (lately) and
> split them between hardware and software transparently, etc. To
> re-iterate, we are reusing and plugging into a proven and deployed
> mechanism which enables our goal (of HW + SW scripting of arbitrary
> P4-enabled datapaths which are functionally equivalent).

But you're doing this in a way that completely ignores the existing
ecosystem for creating programmable software datapaths in the kernel
(i.e., eBPF/XDP) in favour of adding *yet another* interpreter to the
kernel.

In particular, completely excluding the XDP from this is misguided.
Programmable networking in Linux operates at three layers:

- HW: for stuff that's supported and practical there
- XDP: software fast-path for high-performance bits that can't go into HW
- TC/rest of stack: SW slow path for functional equivalence

I can see P4 playing a role as a higher-level data plane definition
language even for Linux SW stacks, but let's have it integrate with the
full ecosystem, not be its own little island in a corner...

-Toke
Singhai, Anjali Jan. 30, 2023, 3:09 a.m. UTC | #27
I am agreeing with you Tom. P4tc does not restrict the high level language to be P4, it can be anything as long as it can be compiled to create an IR that can be used to teach/program the SW and the HW, which is what the script-ability of p4tc provides.

Ultimately the devices are evolving as combination of highly efficient Domain specific architecture and the traditional Generic cores, and SW in the kernel has to evolve to program them both in a way that the user can decide whether to run a particular functionality in Domain specific HW or SW that runs on general purpose cores or in some cases the functionality runs in both places and the intelligent (and some-day AI managed Infrastructure controller) entity decides whether the flow should use the HW path or the SW path. There is no other way forward because a SW dataplane can only provide an overflow region for flows and the HW will have to run the most demanding flows, as the Network demand and capacity of the data-center keeps reaching higher and higher levels. From a HW vendor's point of view we have already  entered the 3rd epoch of computer architecture. 

A domain specific architecture still has to be programmable but for a specific domain, linux kernel which has remained fixed function (and fixed protocol) traditionally needs to evolve to support these domain specific architecture that are protocol and dataplane programmable. I think p4tc definitely is the right way forward.
 
There were some arguments made earlier about but the big Datacenters are programming these domain specific architecture from user space already, no doubt but isn't the whole argument for linux kernel is democratizing of the goodness the HW brings to all , the small users and the big ones?

There is also argument that is being made about using ebpf for implementing the SW path, may be I am missing the part as to how do you offload if not to another general purpose core even if it is not as evolved as the current day Xeon's. And we know that even the simplest of the general purpose cores ( example RISC-V) right now cannot sustain the rate at which the network needs to feed the business logic running on the CPUs or GPUs or TPUs in an economically viable solution. All data points to the fact that Network processing running on general purpose cores eats up more than half of the cores and that’s expensive. Because the performance/power unit math when using an IPU/DPU/Smart NIC for network work load is so much better than that of a General purpose core. So I do not see a way forward for epbf to be offloaded on anything but general purpose cores and in the meantime Domain specific programmable ASICs need to be still programmed as they are the right solution for the economy of scale.

Having said that we do have to find a good solution for p4 externs in SW and may be there is room for some helpers ( may be even ebpf) ( as long as you don’t ask me to offload that in HW 
John Fastabend Jan. 30, 2023, 4:30 a.m. UTC | #28
Jamal Hadi Salim wrote:
> On Sun, Jan 29, 2023 at 12:39 AM John Fastabend
> <john.fastabend@gmail.com> wrote:
> >
> > Willem de Bruijn wrote:
> > > On Sat, Jan 28, 2023 at 10:10 AM Jamal Hadi Salim <jhs@mojatatu.com> wrote:
> > > >
> >
> > [...]
> >
> >
> > Also there already exists a P4 backend that targets BPF.
> >
> >  https://github.com/p4lang/p4c
> 
> There's also one based on rust - does that mean we should rewrite our
> code in rust?
> Joking aside - rust was a suggestion made at a talk i did. I ended up
> adding a slide for the next talk which read:
> 
> Title: So... how is this better than KDE?
>   Attributed to Rusty Russell
>      Who attributes it to Cort Dougan
>       s/KDE/[rust/ebpf/dpdk/vpp/ovs]/g
> 
> We have very specific goals - of which the most important is met by
> what works today and we are reusing that.

OK, I may have missed your goals I read the cover letter and merely
scanned the patches. But, seeing we've chatted about this before
let me put my critique here.

P4TC as a software datapath:

1. We can already run P4 in software with P4C which compiles into an
   existing BPF implementations, nothing new needed. If we object
   to p4c implementation there are others (VMWare has one for XDP)
   or feel free to write any other DSL or abstraction over BPF.

2. 'tc' layer is not going to be as fast as XDP so without an XDP
   implementation we can't get best possible implementation.

3. Happy to admit I don't have data, but I'm not convinced a match
   action pipeline is an ideal implementation for software. It is
   done specifically in HW to facilitate CAMs/TCAMs and other special
   logic blocks that do not map well to general purpose CPU. BPF or
   other insn are better abstraction for software.

So I struggle to find upside as a purely software implementation.
If you took an XDP P4 backend and then had this implementation
showing performance or some other vector where a XDP implementation
underperformed that would be interesting. Then either we would have
good reason to try another datapath or 

P4TC as a hardware datapath:

1. We don't have a hardware/driver implementation to review so its
   difficult to even judge if this is a good idea or not.

2. I imagine most hardware can not create TCAMs/CAMs out of
   nothing. So there is a hard problem that I believe is not
   addressed here around how user knows their software blob
   can ever be offloaded at all. How you move to new hw and
   the blob can continue to work so and an so forth.

3. FPGA P4 implementations as far as I recall can use P4 to build
   the pipeline up front. But, once its built its not like you
   would (re)build it or (re)configure it on the fly. But the workflow
   doesn't align with how I understand these patches.

4. Has any vendor with a linux driver (maybe not even in kernel yet)
   open sourced anything that resembles a P4 pipeline? Without
   this its again hard to understand what is possible and what
   vendors will let users do.

P4TC as SW/HW running same P4:

1. This doesn't need to be done in kernel. If one compiler runs
   P4 into XDP or TC-BPF that is good and another compiler runs
   it into hw specific backend. This satisifies having both
   software and hardware implementation.

Extra commentary: I agree we've been chatting about this for a long
time but until some vendor (Intel?) will OSS and support a linux
driver and hardware with open programmable parser and MAT. I'm not
sure how we get P4 for Linux users. Does it exist and I missed it?

Thanks,
John

> 
> cheers,
> jamal
> 
> > So as a SW object we can just do the P4 compilation step in user
> > space and run it in BPF as suggested. Then for hw offload we really
> > would need to see some hardware to have any concrete ideas on how
> > to make it work.
> >
> 
> 
> > Also P4 defines a runtime API so would be good to see how all that
> > works with any proposed offload.

Yep agree with your other comment not really important can be built
on top of Netlink or BPF today.
Jiri Pirko Jan. 30, 2023, 10:13 a.m. UTC | #29
Mon, Jan 30, 2023 at 05:30:17AM CET, john.fastabend@gmail.com wrote:
>Jamal Hadi Salim wrote:
>> On Sun, Jan 29, 2023 at 12:39 AM John Fastabend
>> <john.fastabend@gmail.com> wrote:
>> >
>> > Willem de Bruijn wrote:
>> > > On Sat, Jan 28, 2023 at 10:10 AM Jamal Hadi Salim <jhs@mojatatu.com> wrote:
>> > > >
>> >
>> > [...]
>> >
>> >
>> > Also there already exists a P4 backend that targets BPF.
>> >
>> >  https://github.com/p4lang/p4c
>> 
>> There's also one based on rust - does that mean we should rewrite our
>> code in rust?
>> Joking aside - rust was a suggestion made at a talk i did. I ended up
>> adding a slide for the next talk which read:
>> 
>> Title: So... how is this better than KDE?
>>   Attributed to Rusty Russell
>>      Who attributes it to Cort Dougan
>>       s/KDE/[rust/ebpf/dpdk/vpp/ovs]/g
>> 
>> We have very specific goals - of which the most important is met by
>> what works today and we are reusing that.
>
>OK, I may have missed your goals I read the cover letter and merely
>scanned the patches. But, seeing we've chatted about this before
>let me put my critique here.
>
>P4TC as a software datapath:
>
>1. We can already run P4 in software with P4C which compiles into an
>   existing BPF implementations, nothing new needed. If we object
>   to p4c implementation there are others (VMWare has one for XDP)
>   or feel free to write any other DSL or abstraction over BPF.
>
>2. 'tc' layer is not going to be as fast as XDP so without an XDP
>   implementation we can't get best possible implementation.
>
>3. Happy to admit I don't have data, but I'm not convinced a match
>   action pipeline is an ideal implementation for software. It is
>   done specifically in HW to facilitate CAMs/TCAMs and other special
>   logic blocks that do not map well to general purpose CPU. BPF or
>   other insn are better abstraction for software.
>
>So I struggle to find upside as a purely software implementation.
>If you took an XDP P4 backend and then had this implementation
>showing performance or some other vector where a XDP implementation
>underperformed that would be interesting. Then either we would have
>good reason to try another datapath or 
>
>P4TC as a hardware datapath:
>
>1. We don't have a hardware/driver implementation to review so its
>   difficult to even judge if this is a good idea or not.
>
>2. I imagine most hardware can not create TCAMs/CAMs out of
>   nothing. So there is a hard problem that I believe is not
>   addressed here around how user knows their software blob
>   can ever be offloaded at all. How you move to new hw and
>   the blob can continue to work so and an so forth.
>
>3. FPGA P4 implementations as far as I recall can use P4 to build
>   the pipeline up front. But, once its built its not like you
>   would (re)build it or (re)configure it on the fly. But the workflow
>   doesn't align with how I understand these patches.
>
>4. Has any vendor with a linux driver (maybe not even in kernel yet)
>   open sourced anything that resembles a P4 pipeline? Without
>   this its again hard to understand what is possible and what
>   vendors will let users do.
>
>P4TC as SW/HW running same P4:
>
>1. This doesn't need to be done in kernel. If one compiler runs
>   P4 into XDP or TC-BPF that is good and another compiler runs
>   it into hw specific backend. This satisifies having both
>   software and hardware implementation.
>
>Extra commentary: I agree we've been chatting about this for a long
>time but until some vendor (Intel?) will OSS and support a linux
>driver and hardware with open programmable parser and MAT. I'm not
>sure how we get P4 for Linux users. Does it exist and I missed it?


John, I think that your summary is quite accurate. Regarding SW
implementation, I have to admit I also fail to see motivation to have P4
specific datapath instead of having XDP/eBPF one, that could run P4
compiled program. The only motivation would be that if somehow helps to
offload to HW. But can it?

Regarding HW implementation. I believe that every HW implementation is
very specific and to find some common intermediate kernel uAPI is
probably not possible (correct me if I'm wrong, that that is the
impression I'm getting from all parties). Then the only option is to
allow userspace to insert HW-speficic blob that is an output of
per-vendor P4 compilator.

Now is this blob uAPI channel possible to be introduced? How it should
look like? How to enforce limitations so it is not exploited for other
purposes as a kernel bypass?



>
>Thanks,
>John
>
>> 
>> cheers,
>> jamal
>> 
>> > So as a SW object we can just do the P4 compilation step in user
>> > space and run it in BPF as suggested. Then for hw offload we really
>> > would need to see some hardware to have any concrete ideas on how
>> > to make it work.
>> >
>> 
>> 
>> > Also P4 defines a runtime API so would be good to see how all that
>> > works with any proposed offload.
>
>Yep agree with your other comment not really important can be built
>on top of Netlink or BPF today.
Toke Høiland-Jørgensen Jan. 30, 2023, 11:26 a.m. UTC | #30
Jiri Pirko <jiri@resnulli.us> writes:

>>P4TC as SW/HW running same P4:
>>
>>1. This doesn't need to be done in kernel. If one compiler runs
>>   P4 into XDP or TC-BPF that is good and another compiler runs
>>   it into hw specific backend. This satisifies having both
>>   software and hardware implementation.
>>
>>Extra commentary: I agree we've been chatting about this for a long
>>time but until some vendor (Intel?) will OSS and support a linux
>>driver and hardware with open programmable parser and MAT. I'm not
>>sure how we get P4 for Linux users. Does it exist and I missed it?
>
>
> John, I think that your summary is quite accurate. Regarding SW
> implementation, I have to admit I also fail to see motivation to have P4
> specific datapath instead of having XDP/eBPF one, that could run P4
> compiled program. The only motivation would be that if somehow helps to
> offload to HW. But can it?

According to the slides from the netdev talk[0], it seems that
offloading will have to have a component that goes outside of TC anyway
(see "Model 3: Joint loading" where it says "this is impossible"). So I
don't really see why having this interpreter in TC help any.

Also, any control plane management feature specific to managing P4 state
in hardware could just as well manage a BPF-based software path on the
kernel side instead of the P4 interpreter stuff...

-Toke

[0] https://netdevconf.info/0x16/session.html?Your-Network-Datapath-Will-Be-P4-Scripted
Jamal Hadi Salim Jan. 30, 2023, 2:06 p.m. UTC | #31
So i dont have to respond to each email individually, I will respond
here in no particular order. First let me provide some context, if
that was already clear please skip it. Hopefully providing the context
will help us to focus otherwise that bikeshed's color and shape will
take forever to settle on.

__Context__

I hope we all agree that when you have 2x100G NIC (and i have seen
people asking for 2x800G NICs) no XDP or DPDK is going to save you. To
visualize: one 25G port is 35Mpps unidirectional. So "software stack"
is not the answer. You need to offload. I would argue further that in
the near future a lot of the stuff including transport will eventually
have to partially or fully move to hardware (see the HOMA keynote for
a sample space[0]). CPUs are not going to keep up with the massive IO
requirements. I am not talking about offload meaning NIC vendors
providing you Checksum or clever RSS or some basic metadata or
timestamp offload; I think those will continue to be needed - but that
is a different scope altogether. Neither are we trying to address
transport offload in P4TC.

I hope we also agree that the MAT construct is well understood and
that we have good experience in both sw (TC)
and hardware deployments over many years. P4 is a _standardized_
specification for addressing these constructs.
P4 is by no means perfect but it is an established standard. It is
already being used to provide requirements to NIC vendors today
(regardless of the underlying implementation)

So what are we trying to achieve with P4TC? John, I could have done a
better job in describing the goals in the cover letter:
We are going for MAT sw equivalence to what is in hardware. A two-fer
that is already provided by the existing TC infrastructure.
Scriptability is not a new idea in TC (see u32 and pedit and others in
TC). IOW, we are reusing and plugging into a proven and deployed
mechanism with a built-in policy driven, transparent symbiosis between
hardware offload and software that has matured over time. You can take
a pipeline or a table or actions and split them between hardware and
software transparently, etc. This hammer already meets our goals.
It's about using the appropriate tool for the right problem. We are
not going to rewrite that infra in rust or ebpf just because. If the
argument is about performance (see point above on 200G ports): We care
about sw performance but more importantly we care about equivalence. I
will put it this way: if we are confronted with a design choice of
picking whether we forgo equivalence in order to get better sw
performance, we are going to trade off performance. You want wire
speed performance then offload.

__End Context__

So now let me respond to the points raised.

Jiri, i think one of the concerns you have is that there is no way to
generalize the different hardware by using a single abstraction since
all hardware may have different architectures (eg whether using RMT vs
DRMT, a mesh processing xbar, TCAM, SRAM, host DRAM,  etc) which may
necessitate doing things like underlying table reordering, merging,
sorting etc. The consensus is that it is the vendor driver that is
responsible for “transforming” P4 abstractions into whatever your
hardware does. The standardized abstraction is P4. Each P4 object
(match or action) has an ID and attributes - just like we do today
with flower with exception it is not hard coded in the kernel as we do
today. So if the tc ndo gets a callback to add an entry that will
match header and/or metadata X on table Y and execute action Z, it
should take care of figuring out how that transforms into its
appropriate hardware layout. IOW, your hardware doesnt have to be P4
native it just has to accept the constructs.
To emphasize again that folks are already doing this:  see the MS DASH
project where you have many NIC vendors (if i am not mistaken xilinx,
pensando, intel mev, nvidia bluefield, some startups, etc) -  they all
consume P4 and may implement differently.

The next question is how do you teach the driver what the different P4
object IDs mean and load the P4 objects for the hardware? We need to
have a consensus on that for sure - there are multiple approaches that
we explored: you could go directly from netlink using the templating
DSL; you could go via devlink or you can have a hybrid of the two.
Initially different vendors thought differently but they seem to
settle on devlink. From a TC perspective the ndo callbacks for runtime
dont change.
Toke, I labelled that one option as IMpossible as a parody - it is
what the vendors are saying today and my play on words is "even
impossible says IM possible". The challenge we have is that while some
vendor may have a driver and an approach that works, we need to have a
consensus instead of one vendor dictating the approach we use.

To John, I hope i have addressed some of your commentary above. The
current approach vendors are taking is total bypass of the kernel for
offload (we are getting our asses handed to us). The kernel is used to
configure control then it punts to user space and then you invoke a
vendor proprietary API. And every vendor has their own API. If you are
sourcing the NICs from multiple vendors then this is bad for the
consumer (unless you are a hyperscaler in which case almost all are
writing their own proprietary user space stacks). Are you pitching
that model?
The synced hardware + sw is already provided by TC - why punt to user space?

cheers,
jamal

[0] https://netdevconf.info/0x16/session.html?keynote-ousterhout

On Mon, Jan 30, 2023 at 6:27 AM Toke Høiland-Jørgensen <toke@redhat.com> wrote:
>
> Jiri Pirko <jiri@resnulli.us> writes:
>
> >>P4TC as SW/HW running same P4:
> >>
> >>1. This doesn't need to be done in kernel. If one compiler runs
> >>   P4 into XDP or TC-BPF that is good and another compiler runs
> >>   it into hw specific backend. This satisifies having both
> >>   software and hardware implementation.
> >>
> >>Extra commentary: I agree we've been chatting about this for a long
> >>time but until some vendor (Intel?) will OSS and support a linux
> >>driver and hardware with open programmable parser and MAT. I'm not
> >>sure how we get P4 for Linux users. Does it exist and I missed it?
> >
> >
> > John, I think that your summary is quite accurate. Regarding SW
> > implementation, I have to admit I also fail to see motivation to have P4
> > specific datapath instead of having XDP/eBPF one, that could run P4
> > compiled program. The only motivation would be that if somehow helps to
> > offload to HW. But can it?
>
> According to the slides from the netdev talk[0], it seems that
> offloading will have to have a component that goes outside of TC anyway
> (see "Model 3: Joint loading" where it says "this is impossible"). So I
> don't really see why having this interpreter in TC help any.
>
> Also, any control plane management feature specific to managing P4 state
> in hardware could just as well manage a BPF-based software path on the
> kernel side instead of the P4 interpreter stuff...
>
> -Toke
>
> [0] https://netdevconf.info/0x16/session.html?Your-Network-Datapath-Will-Be-P4-Scripted
>
Andrew Lunn Jan. 30, 2023, 2:42 p.m. UTC | #32
Hi Jamal

I'm mostly sat watching and eating popcorn, and i have little
knowledge in the area.

> Jiri, i think one of the concerns you have is that there is no way to
> generalize the different hardware by using a single abstraction since
> all hardware may have different architectures (eg whether using RMT vs
> DRMT, a mesh processing xbar, TCAM, SRAM, host DRAM,  etc) which may
> necessitate doing things like underlying table reordering, merging,
> sorting etc. The consensus is that it is the vendor driver that is
> responsible for “transforming” P4 abstractions into whatever your
> hardware does.

What is the complexity involved in this 'transformation'? Are we
talking about putting a P4 'compiler' into each driver, each vendor
having there own compiler? Performing an upcall into user space with a
P4 blob and asking the vendor tool to give us back a blob for the
hardware? Or is it relatively simple, a few hundred lines of code,
simple transformations?

As far as i know, all offloading done so far in the network stack has
been purely in kernel. We transform a kernel representation of
networking state into something the hardware understands and pass it
to the hardware. That means, except for bugs, what happens in SW
should be the same as what happens in HW, just faster. But there have
been mention of P4 extensions. Stuff that the SW P4 implementation
cannot do, but the hardware can, and vendors appear to think such
extensions are part of their magic sauce. How will that work? Is the
'compiler' supposed to recognise plain P4 equivalent of these
extensions and replace it with those extensions?

I suppose what i'm trying to get at, is are we going to enforce the SW
and HW equivalence by doing the transformation in kernel, or could we
be heading towards in userspace we take our P4 and compile it with one
toolchain for the SW path, another toolchain for the HW path, and we
have no guarantee that the resulting blobs actually came from the same
sources and are supposed to be equivalent? And that then makes the SW
path somewhat pointless?

     Andrew
Jamal Hadi Salim Jan. 30, 2023, 3:31 p.m. UTC | #33
On Mon, Jan 30, 2023 at 9:42 AM Andrew Lunn <andrew@lunn.ch> wrote:
>
> Hi Jamal
>
> I'm mostly sat watching and eating popcorn, and i have little
> knowledge in the area.
>
> > Jiri, i think one of the concerns you have is that there is no way to
> > generalize the different hardware by using a single abstraction since
> > all hardware may have different architectures (eg whether using RMT vs
> > DRMT, a mesh processing xbar, TCAM, SRAM, host DRAM,  etc) which may
> > necessitate doing things like underlying table reordering, merging,
> > sorting etc. The consensus is that it is the vendor driver that is
> > responsible for “transforming” P4 abstractions into whatever your
> > hardware does.
>
> What is the complexity involved in this 'transformation'? Are we
> talking about putting a P4 'compiler' into each driver, each vendor
> having there own compiler? Performing an upcall into user space with a
> P4 blob and asking the vendor tool to give us back a blob for the
> hardware? Or is it relatively simple, a few hundred lines of code,
> simple transformations?
>

The current model is you compile the kernel vs hardware output as two
separate files. They are loaded separately
The compiler has a vendor specific backend and a P4TC one. There has
to be a authentication sync that the two are one and the same;
essentially each program/pipeline has a name and an ID and some hash
for validation. See slide #49 in the presentation at
https://netdevconf.info/0x16/session.html?Your-Network-Datapath-Will-Be-P4-Scripted
Only the vendor will be able to create something reasonable for their
specific hardware.
The issue is how to load the hardware part - the three methods were
discussed are listed in slides 50-52. The vendors seem to be in
agreement that the best option is #1.

BTW, these discussions happen in a high bandwidth medium at the moment
every two weeks here:
https://www.google.com/url?q=https://teams.microsoft.com/l/meetup-join/1.&sa=D&source=calendar&ust=1675366175958603&usg=AOvVaw1UZo8g5Ir6OcC-SRFM9PF1
It would be helpful if other folks will show up in those meetings.

> As far as i know, all offloading done so far in the network stack has
> been purely in kernel. We transform a kernel representation of
> networking state into something the hardware understands and pass it
> to the hardware. That means, except for bugs, what happens in SW
> should be the same as what happens in HW, just faster.

Exactly - that is what is refered to as "hardcoding" in slides 43-44
with what P4TC would do described in slide #45.

> But there have
> been mention of P4 extensions. Stuff that the SW P4 implementation
> cannot do, but the hardware can, and vendors appear to think such
> extensions are part of their magic sauce. How will that work? Is the
> 'compiler' supposed to recognise plain P4 equivalent of these
> extensions and replace it with those extensions?

I think the "magic sauce" angle is mostly the idea of how one would
implement foobar differently than the other vendor. If someone uses a
little ASIC and the next person uses FW to program a TCAM they may
feel they have an advantage in their hardware that the other guy
doesnt have.  At the end of the day that thing looks like a box with
input Y that produces output X. In P4 they call them "externs".  From
a P4TC backend perspective, we hope that we can allow foobar to be
implemented by multiple vendors without caring about the details of
the implementation. The vendor backend can describe it to whatever
detail it wants to its hardware.

> I suppose what i'm trying to get at, is are we going to enforce the SW
> and HW equivalence by doing the transformation in kernel, or could we
> be heading towards in userspace we take our P4 and compile it with one
> toolchain for the SW path, another toolchain for the HW path, and we
> have no guarantee that the resulting blobs actually came from the same
> sources and are supposed to be equivalent? And that then makes the SW
> path somewhat pointless?

See above - the two have to map to the same equivalence and validated as such.
It is also about providing a singular interface through the kernel as
opposed to dealing
with multiple vendor APIs.

cheers,
jamal
Toke Høiland-Jørgensen Jan. 30, 2023, 5:04 p.m. UTC | #34
Jamal Hadi Salim <jhs@mojatatu.com> writes:

> So i dont have to respond to each email individually, I will respond
> here in no particular order. First let me provide some context, if
> that was already clear please skip it. Hopefully providing the context
> will help us to focus otherwise that bikeshed's color and shape will
> take forever to settle on.
>
> __Context__
>
> I hope we all agree that when you have 2x100G NIC (and i have seen
> people asking for 2x800G NICs) no XDP or DPDK is going to save you. To
> visualize: one 25G port is 35Mpps unidirectional. So "software stack"
> is not the answer. You need to offload.

I'm not disputing the need to offload, and I'm personally delighted that
P4 is breaking open the vendor black boxes to provide a standardised
interface for this.

However, while it's true that software can't keep up at the high end,
not everything runs at the high end, and today's high end is tomorrow's
mid end, in which XDP can very much play a role. So being able to move
smoothly between the two, and even implement functions that split
processing between them, is an essential feature of a programmable
networking path in Linux. Which is why I'm objecting to implementing the
P4 bits as something that's hanging off the side of the stack in its own
thing and is not integrated with the rest of the stack. You were touting
this as a feature ("being self-contained"). I consider it a bug.

> Scriptability is not a new idea in TC (see u32 and pedit and others in
> TC).

u32 is notoriously hard to use. The others are neat, but obviously
limited to particular use cases. Do you actually expect anyone to use P4
by manually entering TC commands to build a pipeline? I really find that
hard to believe...

> IOW, we are reusing and plugging into a proven and deployed mechanism
> with a built-in policy driven, transparent symbiosis between hardware
> offload and software that has matured over time. You can take a
> pipeline or a table or actions and split them between hardware and
> software transparently, etc.

That's a control plane feature though, it's not an argument for adding
another interpreter to the kernel.

> This hammer already meets our goals.

That 60k+ line patch submission of yours says otherwise...

> It's about using the appropriate tool for the right problem. We are
> not going to rewrite that infra in rust or ebpf just because.

"The right tool for the job" also means something that integrates well
with the wider ecosystem. For better or worse, in the kernel that
ecosystem (of datapath programmability) is BPF-based. Dismissing request
to integrate with that as, essentially, empty fanboyism, comes across as
incredibly arrogant.

> Toke, I labelled that one option as IMpossible as a parody - it is
> what the vendors are saying today and my play on words is "even
> impossible says IM possible".

Side note: I think it would be helpful if you dropped all the sarcasm
and snide remarks when communicating this stuff in writing, especially
to a new audience. It just confuses things, and doesn't exactly help
with the perception of arrogance either...

-Toke
Tom Herbert Jan. 30, 2023, 5:05 p.m. UTC | #35
On Sun, Jan 29, 2023 at 7:09 PM Singhai, Anjali
<anjali.singhai@intel.com> wrote:
>
> I am agreeing with you Tom. P4tc does not restrict the high level language to be P4, it can be anything as long as it can be compiled to create an IR that can be used to teach/program the SW and the HW, which is what the script-ability of p4tc provides.
>
> Ultimately the devices are evolving as combination of highly efficient Domain specific architecture and the traditional Generic cores, and SW in the kernel has to evolve to program them both in a way that the user can decide whether to run a particular functionality in Domain specific HW or SW that runs on general purpose cores or in some cases the functionality runs in both places and the intelligent (and some-day AI managed Infrastructure controller) entity decides whether the flow should use the HW path or the SW path. There is no other way forward because a SW dataplane can only provide an overflow region for flows and the HW will have to run the most demanding flows, as the Network demand and capacity of the data-center keeps reaching higher and higher levels. From a HW vendor's point of view we have already  entered the 3rd epoch of computer architecture.
>
> A domain specific architecture still has to be programmable but for a specific domain, linux kernel which has remained fixed function (and fixed protocol)

I believe the majority of people on this list would disagree with
that. XDP and eBPF were invented precisely to make the Linux kernel
extensible. As Toke said, any proposed solution for programmable
datapaths cannot ignore XDP/eBPF.

> traditionally needs to evolve to support these domain specific architecture that are protocol and dataplane programmable. I think p4tc definitely is the right way forward.
>
> There were some arguments made earlier about but the big Datacenters are programming these domain specific architecture from user space already, no doubt but isn't the whole argument for linux kernel is democratizing of the goodness the HW brings to all , the small users and the big ones?
>
> There is also argument that is being made about using ebpf for implementing the SW path, may be I am missing the part as to how do you offload if not to another general purpose core even if it is not as evolved as the current day Xeon's. And we know that even the simplest of the general purpose cores ( example RISC-V) right now cannot sustain the rate at which the network needs to feed the business logic running on the CPUs or GPUs or TPUs in an economically viable solution. All data points to the fact that Network processing running on general purpose cores eats up more than half of the cores and that’s expensive.

You are making the incorrect assumption that we are restricted to
using off the shelf commodity CPUs. With an open ISA like RISC-V we
are free to customize it and build domain specific CPUs following the
same principles of domain specific architectures. I believe it is
quite feasible with current technology to build a fully programmable
and very high performance datapath through CPUs. The solution involves
ripping out things we don't need like FPU and MMU, and putting in
things like optimized instructions for parsing, primitives for
maximizing parallelism, arithmetic instructions to optimize processing
nature of specific nature data, and inline accelerators. Running a
datapath on CPU avoids the rigid structures of hardware pipeline (like
John mention, a match-action pipeline won't work for all problems)

> Because the performance/power unit math when using an IPU/DPU/Smart NIC for network work load is so much better than that of a General purpose core. So I do not see a way forward for epbf to be offloaded on anything but general purpose cores

We can already do that. The CLI command examples in the kParser were
generated from a parser written to the PANDA-C. PANDA is a library API
for C and a parser defines a set of data structures and macros. It's
just plain C code and doesn't even use #pragma like CUDA does which is
the analogue programming model for GPUs. PANDA-C code can be compiled
into eBPF, userspace, the kParser CLI, and other targets. Assuming
that the HW offload for P4TC is supported then the same parser would
be running in both eBPF and the P4 hardware, and hence we have
factually offloaded a parser that runs in eBPF to P4 hardware. This is
supported for the parser, but adding the rest of the processing can be
similarly achieved.

> and in the meantime Domain specific programmable ASICs need to be still programmed as they are the right solution for the economy of scale.
>
> Having said that we do have to find a good solution for p4 externs in SW and may be there is room for some helpers ( may be even ebpf) ( as long as you don’t ask me to offload that in HW 
Jamal Hadi Salim Jan. 30, 2023, 7:02 p.m. UTC | #36
On Mon, Jan 30, 2023 at 12:04 PM Toke Høiland-Jørgensen <toke@redhat.com> wrote:
>
> Jamal Hadi Salim <jhs@mojatatu.com> writes:
>
> > So i dont have to respond to each email individually, I will respond
> > here in no particular order. First let me provide some context, if
> > that was already clear please skip it. Hopefully providing the context
> > will help us to focus otherwise that bikeshed's color and shape will
> > take forever to settle on.
> >
> > __Context__
> >
> > I hope we all agree that when you have 2x100G NIC (and i have seen
> > people asking for 2x800G NICs) no XDP or DPDK is going to save you. To
> > visualize: one 25G port is 35Mpps unidirectional. So "software stack"
> > is not the answer. You need to offload.
>
> I'm not disputing the need to offload, and I'm personally delighted that
> P4 is breaking open the vendor black boxes to provide a standardised
> interface for this.
>
> However, while it's true that software can't keep up at the high end,
> not everything runs at the high end, and today's high end is tomorrow's
> mid end, in which XDP can very much play a role. So being able to move
> smoothly between the two, and even implement functions that split
> processing between them, is an essential feature of a programmable
> networking path in Linux. Which is why I'm objecting to implementing the
> P4 bits as something that's hanging off the side of the stack in its own
> thing and is not integrated with the rest of the stack. You were touting
> this as a feature ("being self-contained"). I consider it a bug.
>
> > Scriptability is not a new idea in TC (see u32 and pedit and others in
> > TC).
>
> u32 is notoriously hard to use. The others are neat, but obviously
> limited to particular use cases.

Despite my love for u32, I admit its user interface is cryptic. I just
wanted to point out to existing samples of scriptable and offloadable
TC objects.

> Do you actually expect anyone to use P4
> by manually entering TC commands to build a pipeline? I really find that
> hard to believe...

You dont have to manually hand code anything - its the compilers job.
But of course for simple P4 programs, yes i think you can handcode
something if you understand the templating syntax.

> > IOW, we are reusing and plugging into a proven and deployed mechanism
> > with a built-in policy driven, transparent symbiosis between hardware
> > offload and software that has matured over time. You can take a
> > pipeline or a table or actions and split them between hardware and
> > software transparently, etc.
>
> That's a control plane feature though, it's not an argument for adding
> another interpreter to the kernel.

I am not sure what you mean by control, but what i described is kernel built in.
Of course i could do more complex things from user space (if that is
what you mean as control).

> > This hammer already meets our goals.
>
> That 60k+ line patch submission of yours says otherwise...

This is pretty much covered in the cover letter and a few responses in
the thread since.

> > It's about using the appropriate tool for the right problem. We are
> > not going to rewrite that infra in rust or ebpf just because.
>
> "The right tool for the job" also means something that integrates well
> with the wider ecosystem. For better or worse, in the kernel that
> ecosystem (of datapath programmability) is BPF-based. Dismissing request
> to integrate with that as, essentially, empty fanboyism, comes across as
> incredibly arrogant.
> > Toke, I labelled that one option as IMpossible as a parody - it is
> > what the vendors are saying today and my play on words is "even
> > impossible says IM possible".
>
> Side note: I think it would be helpful if you dropped all the sarcasm
> and snide remarks when communicating this stuff in writing, especially
> to a new audience. It just confuses things, and doesn't exactly help
> with the perception of arrogance either...

I apologize if i offended you - you quoted a slide i did and I was
describing what that slide was supposed to relay.

cheers,
jamal

> -Toke
>
Toke Høiland-Jørgensen Jan. 30, 2023, 8:21 p.m. UTC | #37
Jamal Hadi Salim <hadi@mojatatu.com> writes:

> On Mon, Jan 30, 2023 at 12:04 PM Toke Høiland-Jørgensen <toke@redhat.com> wrote:
>>
>> Jamal Hadi Salim <jhs@mojatatu.com> writes:
>>
>> > So i dont have to respond to each email individually, I will respond
>> > here in no particular order. First let me provide some context, if
>> > that was already clear please skip it. Hopefully providing the context
>> > will help us to focus otherwise that bikeshed's color and shape will
>> > take forever to settle on.
>> >
>> > __Context__
>> >
>> > I hope we all agree that when you have 2x100G NIC (and i have seen
>> > people asking for 2x800G NICs) no XDP or DPDK is going to save you. To
>> > visualize: one 25G port is 35Mpps unidirectional. So "software stack"
>> > is not the answer. You need to offload.
>>
>> I'm not disputing the need to offload, and I'm personally delighted that
>> P4 is breaking open the vendor black boxes to provide a standardised
>> interface for this.
>>
>> However, while it's true that software can't keep up at the high end,
>> not everything runs at the high end, and today's high end is tomorrow's
>> mid end, in which XDP can very much play a role. So being able to move
>> smoothly between the two, and even implement functions that split
>> processing between them, is an essential feature of a programmable
>> networking path in Linux. Which is why I'm objecting to implementing the
>> P4 bits as something that's hanging off the side of the stack in its own
>> thing and is not integrated with the rest of the stack. You were touting
>> this as a feature ("being self-contained"). I consider it a bug.
>>
>> > Scriptability is not a new idea in TC (see u32 and pedit and others in
>> > TC).
>>
>> u32 is notoriously hard to use. The others are neat, but obviously
>> limited to particular use cases.
>
> Despite my love for u32, I admit its user interface is cryptic. I just
> wanted to point out to existing samples of scriptable and offloadable
> TC objects.
>
>> Do you actually expect anyone to use P4
>> by manually entering TC commands to build a pipeline? I really find that
>> hard to believe...
>
> You dont have to manually hand code anything - its the compilers job.

Right, that was kinda my point: in that case the compiler could just as
well generate a (set of) BPF program(s) instead of this TC script thing.

>> > IOW, we are reusing and plugging into a proven and deployed mechanism
>> > with a built-in policy driven, transparent symbiosis between hardware
>> > offload and software that has matured over time. You can take a
>> > pipeline or a table or actions and split them between hardware and
>> > software transparently, etc.
>>
>> That's a control plane feature though, it's not an argument for adding
>> another interpreter to the kernel.
>
> I am not sure what you mean by control, but what i described is kernel
> built in. Of course i could do more complex things from user space (if
> that is what you mean as control).

"Control plane" as in SDN parlance. I.e., the bits that keep track of
configuration of the flow/pipeline/table configuration.

There's no reason you can't have all that infrastructure and use BPF as
the datapath language. I.e., instead of:

tc p4template create pipeline/aP4proggie numtables 1
... + all the other stuff to populate it

you could just do:

tc p4 create pipeline/aP4proggie obj_file aP4proggie.bpf.o

and still have all the management infrastructure without the new
interpreter and associated complexity in the kernel.

>> > This hammer already meets our goals.
>>
>> That 60k+ line patch submission of yours says otherwise...
>
> This is pretty much covered in the cover letter and a few responses in
> the thread since.

The only argument for why your current approach makes sense I've seen
you make is "I don't want to rewrite it in BPF". Which is not a
technical argument.

I'm not trying to be disingenuous here, BTW: I really don't see the
technical argument for why the P4 data plane has to be implemented as
its own interpreter instead of integrating with what we have already
(i.e., BPF).

-Toke
John Fastabend Jan. 30, 2023, 9:10 p.m. UTC | #38
Toke Høiland-Jørgensen wrote:
> Jamal Hadi Salim <hadi@mojatatu.com> writes:
> 
> > On Mon, Jan 30, 2023 at 12:04 PM Toke Høiland-Jørgensen <toke@redhat.com> wrote:
> >>
> >> Jamal Hadi Salim <jhs@mojatatu.com> writes:
> >>
> >> > So i dont have to respond to each email individually, I will respond
> >> > here in no particular order. First let me provide some context, if
> >> > that was already clear please skip it. Hopefully providing the context
> >> > will help us to focus otherwise that bikeshed's color and shape will
> >> > take forever to settle on.
> >> >
> >> > __Context__
> >> >
> >> > I hope we all agree that when you have 2x100G NIC (and i have seen
> >> > people asking for 2x800G NICs) no XDP or DPDK is going to save you. To
> >> > visualize: one 25G port is 35Mpps unidirectional. So "software stack"
> >> > is not the answer. You need to offload.
> >>
> >> I'm not disputing the need to offload, and I'm personally delighted that
> >> P4 is breaking open the vendor black boxes to provide a standardised
> >> interface for this.
> >>
> >> However, while it's true that software can't keep up at the high end,
> >> not everything runs at the high end, and today's high end is tomorrow's
> >> mid end, in which XDP can very much play a role. So being able to move
> >> smoothly between the two, and even implement functions that split
> >> processing between them, is an essential feature of a programmable
> >> networking path in Linux. Which is why I'm objecting to implementing the
> >> P4 bits as something that's hanging off the side of the stack in its own
> >> thing and is not integrated with the rest of the stack. You were touting
> >> this as a feature ("being self-contained"). I consider it a bug.
> >>
> >> > Scriptability is not a new idea in TC (see u32 and pedit and others in
> >> > TC).
> >>
> >> u32 is notoriously hard to use. The others are neat, but obviously
> >> limited to particular use cases.
> >
> > Despite my love for u32, I admit its user interface is cryptic. I just
> > wanted to point out to existing samples of scriptable and offloadable
> > TC objects.
> >
> >> Do you actually expect anyone to use P4
> >> by manually entering TC commands to build a pipeline? I really find that
> >> hard to believe...
> >
> > You dont have to manually hand code anything - its the compilers job.
> 
> Right, that was kinda my point: in that case the compiler could just as
> well generate a (set of) BPF program(s) instead of this TC script thing.
> 
> >> > IOW, we are reusing and plugging into a proven and deployed mechanism
> >> > with a built-in policy driven, transparent symbiosis between hardware
> >> > offload and software that has matured over time. You can take a
> >> > pipeline or a table or actions and split them between hardware and
> >> > software transparently, etc.
> >>
> >> That's a control plane feature though, it's not an argument for adding
> >> another interpreter to the kernel.
> >
> > I am not sure what you mean by control, but what i described is kernel
> > built in. Of course i could do more complex things from user space (if
> > that is what you mean as control).
> 
> "Control plane" as in SDN parlance. I.e., the bits that keep track of
> configuration of the flow/pipeline/table configuration.
> 
> There's no reason you can't have all that infrastructure and use BPF as
> the datapath language. I.e., instead of:
> 
> tc p4template create pipeline/aP4proggie numtables 1
> ... + all the other stuff to populate it
> 
> you could just do:
> 
> tc p4 create pipeline/aP4proggie obj_file aP4proggie.bpf.o
> 
> and still have all the management infrastructure without the new
> interpreter and associated complexity in the kernel.
> 
> >> > This hammer already meets our goals.
> >>
> >> That 60k+ line patch submission of yours says otherwise...
> >
> > This is pretty much covered in the cover letter and a few responses in
> > the thread since.
> 
> The only argument for why your current approach makes sense I've seen
> you make is "I don't want to rewrite it in BPF". Which is not a
> technical argument.
> 
> I'm not trying to be disingenuous here, BTW: I really don't see the
> technical argument for why the P4 data plane has to be implemented as
> its own interpreter instead of integrating with what we have already
> (i.e., BPF).
> 
> -Toke
> 

I'll just take this here becaues I think its mostly related.

Still not convinced the P4TC has any value for sw. From the
slide you say vendors prefer you have this picture roughtly.


   [ P4 compiler ] ------ [ P4TC backend ] ----> TC API
        |
        |
   [ P4 Vendor backend ]
        |
        |
        V
   [ Devlink ]


Now just replace P4TC backend with P4C and your only work is to
replace devlink with the current hw specific bits and you have
a sw and hw components. Then you get XDP-BPF pretty easily from
P4XDP backend if you like. The compat piece is handled by compiler
where it should be. My CPU is not a MAT so pretending it is seems
not ideal to me, I don't have a TCAM on my cores.

For runtime get those vendors to write their SDKs over Devlink
and no need for this software thing. The runtime for P4c should
already work over BPF. Giving this picture

   [ P4 compiler ] ------ [ P4C backend ] ----> BPF
        |
        |
   [ P4 Vendor backend ]
        |
        |
        V
   [ Devlink ]

And much less work for us to maintain.

.John
Toke Høiland-Jørgensen Jan. 30, 2023, 9:20 p.m. UTC | #39
John Fastabend <john.fastabend@gmail.com> writes:

> Toke Høiland-Jørgensen wrote:
>> Jamal Hadi Salim <hadi@mojatatu.com> writes:
>> 
>> > On Mon, Jan 30, 2023 at 12:04 PM Toke Høiland-Jørgensen <toke@redhat.com> wrote:
>> >>
>> >> Jamal Hadi Salim <jhs@mojatatu.com> writes:
>> >>
>> >> > So i dont have to respond to each email individually, I will respond
>> >> > here in no particular order. First let me provide some context, if
>> >> > that was already clear please skip it. Hopefully providing the context
>> >> > will help us to focus otherwise that bikeshed's color and shape will
>> >> > take forever to settle on.
>> >> >
>> >> > __Context__
>> >> >
>> >> > I hope we all agree that when you have 2x100G NIC (and i have seen
>> >> > people asking for 2x800G NICs) no XDP or DPDK is going to save you. To
>> >> > visualize: one 25G port is 35Mpps unidirectional. So "software stack"
>> >> > is not the answer. You need to offload.
>> >>
>> >> I'm not disputing the need to offload, and I'm personally delighted that
>> >> P4 is breaking open the vendor black boxes to provide a standardised
>> >> interface for this.
>> >>
>> >> However, while it's true that software can't keep up at the high end,
>> >> not everything runs at the high end, and today's high end is tomorrow's
>> >> mid end, in which XDP can very much play a role. So being able to move
>> >> smoothly between the two, and even implement functions that split
>> >> processing between them, is an essential feature of a programmable
>> >> networking path in Linux. Which is why I'm objecting to implementing the
>> >> P4 bits as something that's hanging off the side of the stack in its own
>> >> thing and is not integrated with the rest of the stack. You were touting
>> >> this as a feature ("being self-contained"). I consider it a bug.
>> >>
>> >> > Scriptability is not a new idea in TC (see u32 and pedit and others in
>> >> > TC).
>> >>
>> >> u32 is notoriously hard to use. The others are neat, but obviously
>> >> limited to particular use cases.
>> >
>> > Despite my love for u32, I admit its user interface is cryptic. I just
>> > wanted to point out to existing samples of scriptable and offloadable
>> > TC objects.
>> >
>> >> Do you actually expect anyone to use P4
>> >> by manually entering TC commands to build a pipeline? I really find that
>> >> hard to believe...
>> >
>> > You dont have to manually hand code anything - its the compilers job.
>> 
>> Right, that was kinda my point: in that case the compiler could just as
>> well generate a (set of) BPF program(s) instead of this TC script thing.
>> 
>> >> > IOW, we are reusing and plugging into a proven and deployed mechanism
>> >> > with a built-in policy driven, transparent symbiosis between hardware
>> >> > offload and software that has matured over time. You can take a
>> >> > pipeline or a table or actions and split them between hardware and
>> >> > software transparently, etc.
>> >>
>> >> That's a control plane feature though, it's not an argument for adding
>> >> another interpreter to the kernel.
>> >
>> > I am not sure what you mean by control, but what i described is kernel
>> > built in. Of course i could do more complex things from user space (if
>> > that is what you mean as control).
>> 
>> "Control plane" as in SDN parlance. I.e., the bits that keep track of
>> configuration of the flow/pipeline/table configuration.
>> 
>> There's no reason you can't have all that infrastructure and use BPF as
>> the datapath language. I.e., instead of:
>> 
>> tc p4template create pipeline/aP4proggie numtables 1
>> ... + all the other stuff to populate it
>> 
>> you could just do:
>> 
>> tc p4 create pipeline/aP4proggie obj_file aP4proggie.bpf.o
>> 
>> and still have all the management infrastructure without the new
>> interpreter and associated complexity in the kernel.
>> 
>> >> > This hammer already meets our goals.
>> >>
>> >> That 60k+ line patch submission of yours says otherwise...
>> >
>> > This is pretty much covered in the cover letter and a few responses in
>> > the thread since.
>> 
>> The only argument for why your current approach makes sense I've seen
>> you make is "I don't want to rewrite it in BPF". Which is not a
>> technical argument.
>> 
>> I'm not trying to be disingenuous here, BTW: I really don't see the
>> technical argument for why the P4 data plane has to be implemented as
>> its own interpreter instead of integrating with what we have already
>> (i.e., BPF).
>> 
>> -Toke
>> 
>
> I'll just take this here becaues I think its mostly related.
>
> Still not convinced the P4TC has any value for sw. From the
> slide you say vendors prefer you have this picture roughtly.
>
>
>    [ P4 compiler ] ------ [ P4TC backend ] ----> TC API
>         |
>         |
>    [ P4 Vendor backend ]
>         |
>         |
>         V
>    [ Devlink ]
>
>
> Now just replace P4TC backend with P4C and your only work is to
> replace devlink with the current hw specific bits and you have
> a sw and hw components. Then you get XDP-BPF pretty easily from
> P4XDP backend if you like. The compat piece is handled by compiler
> where it should be. My CPU is not a MAT so pretending it is seems
> not ideal to me, I don't have a TCAM on my cores.
>
> For runtime get those vendors to write their SDKs over Devlink
> and no need for this software thing. The runtime for P4c should
> already work over BPF. Giving this picture
>
>    [ P4 compiler ] ------ [ P4C backend ] ----> BPF
>         |
>         |
>    [ P4 Vendor backend ]
>         |
>         |
>         V
>    [ Devlink ]
>
> And much less work for us to maintain.

Yes, this was basically my point as well. Thank you for putting it into
ASCII diagrams! :)

There's still the control plane bit: some kernel component that
configures the pieces (pipelines?) created in the top-right and
bottom-left corners of your diagram(s), keeping track of which pipelines
are in HW/SW, maybe updating some match tables dynamically and
extracting statistics. I'm totally OK with having that bit be in the
kernel, but that can be added on top of your second diagram just as well
as on top of the first one...

-Toke
Tom Herbert Jan. 30, 2023, 10:41 p.m. UTC | #40
On Mon, Jan 30, 2023 at 1:10 PM John Fastabend <john.fastabend@gmail.com> wrote:
>
> Toke Høiland-Jørgensen wrote:
> > Jamal Hadi Salim <hadi@mojatatu.com> writes:
> >
> > > On Mon, Jan 30, 2023 at 12:04 PM Toke Høiland-Jørgensen <toke@redhat.com> wrote:
> > >>
> > >> Jamal Hadi Salim <jhs@mojatatu.com> writes:
> > >>
> > >> > So i dont have to respond to each email individually, I will respond
> > >> > here in no particular order. First let me provide some context, if
> > >> > that was already clear please skip it. Hopefully providing the context
> > >> > will help us to focus otherwise that bikeshed's color and shape will
> > >> > take forever to settle on.
> > >> >
> > >> > __Context__
> > >> >
> > >> > I hope we all agree that when you have 2x100G NIC (and i have seen
> > >> > people asking for 2x800G NICs) no XDP or DPDK is going to save you. To
> > >> > visualize: one 25G port is 35Mpps unidirectional. So "software stack"
> > >> > is not the answer. You need to offload.
> > >>
> > >> I'm not disputing the need to offload, and I'm personally delighted that
> > >> P4 is breaking open the vendor black boxes to provide a standardised
> > >> interface for this.
> > >>
> > >> However, while it's true that software can't keep up at the high end,
> > >> not everything runs at the high end, and today's high end is tomorrow's
> > >> mid end, in which XDP can very much play a role. So being able to move
> > >> smoothly between the two, and even implement functions that split
> > >> processing between them, is an essential feature of a programmable
> > >> networking path in Linux. Which is why I'm objecting to implementing the
> > >> P4 bits as something that's hanging off the side of the stack in its own
> > >> thing and is not integrated with the rest of the stack. You were touting
> > >> this as a feature ("being self-contained"). I consider it a bug.
> > >>
> > >> > Scriptability is not a new idea in TC (see u32 and pedit and others in
> > >> > TC).
> > >>
> > >> u32 is notoriously hard to use. The others are neat, but obviously
> > >> limited to particular use cases.
> > >
> > > Despite my love for u32, I admit its user interface is cryptic. I just
> > > wanted to point out to existing samples of scriptable and offloadable
> > > TC objects.
> > >
> > >> Do you actually expect anyone to use P4
> > >> by manually entering TC commands to build a pipeline? I really find that
> > >> hard to believe...
> > >
> > > You dont have to manually hand code anything - its the compilers job.
> >
> > Right, that was kinda my point: in that case the compiler could just as
> > well generate a (set of) BPF program(s) instead of this TC script thing.
> >
> > >> > IOW, we are reusing and plugging into a proven and deployed mechanism
> > >> > with a built-in policy driven, transparent symbiosis between hardware
> > >> > offload and software that has matured over time. You can take a
> > >> > pipeline or a table or actions and split them between hardware and
> > >> > software transparently, etc.
> > >>
> > >> That's a control plane feature though, it's not an argument for adding
> > >> another interpreter to the kernel.
> > >
> > > I am not sure what you mean by control, but what i described is kernel
> > > built in. Of course i could do more complex things from user space (if
> > > that is what you mean as control).
> >
> > "Control plane" as in SDN parlance. I.e., the bits that keep track of
> > configuration of the flow/pipeline/table configuration.
> >
> > There's no reason you can't have all that infrastructure and use BPF as
> > the datapath language. I.e., instead of:
> >
> > tc p4template create pipeline/aP4proggie numtables 1
> > ... + all the other stuff to populate it
> >
> > you could just do:
> >
> > tc p4 create pipeline/aP4proggie obj_file aP4proggie.bpf.o
> >
> > and still have all the management infrastructure without the new
> > interpreter and associated complexity in the kernel.
> >
> > >> > This hammer already meets our goals.
> > >>
> > >> That 60k+ line patch submission of yours says otherwise...
> > >
> > > This is pretty much covered in the cover letter and a few responses in
> > > the thread since.
> >
> > The only argument for why your current approach makes sense I've seen
> > you make is "I don't want to rewrite it in BPF". Which is not a
> > technical argument.
> >
> > I'm not trying to be disingenuous here, BTW: I really don't see the
> > technical argument for why the P4 data plane has to be implemented as
> > its own interpreter instead of integrating with what we have already
> > (i.e., BPF).
> >
> > -Toke
> >
>
> I'll just take this here becaues I think its mostly related.
>
> Still not convinced the P4TC has any value for sw. From the
> slide you say vendors prefer you have this picture roughtly.
>
>
>    [ P4 compiler ] ------ [ P4TC backend ] ----> TC API
>         |
>         |
>    [ P4 Vendor backend ]
>         |
>         |
>         V
>    [ Devlink ]
>
>
> Now just replace P4TC backend with P4C and your only work is to
> replace devlink with the current hw specific bits and you have
> a sw and hw components. Then you get XDP-BPF pretty easily from
> P4XDP backend if you like. The compat piece is handled by compiler
> where it should be. My CPU is not a MAT so pretending it is seems
> not ideal to me, I don't have a TCAM on my cores.
>
> For runtime get those vendors to write their SDKs over Devlink
> and no need for this software thing. The runtime for P4c should
> already work over BPF. Giving this picture
>
>    [ P4 compiler ] ------ [ P4C backend ] ----> BPF
>         |
>         |
>    [ P4 Vendor backend ]
>         |
>         |
>         V
>    [ Devlink ]
>

John, that's a good direction. If we go one step further and define a
common Intermediate Representation for programmable datapaths, we can
create a general solution that gives the user maximum flexibility and
freedom on both the frontend and the backend. For the front end they
can use whatever language they want as long as it supports an API that
can be compiled into the common IR (this is what PANDA does for
defining data paths in C). Similarly, for the backend we want to
support multiple targets both hardware and software. This is "write
once, run anywhere, run well": the developer writes their program
once, the same program runs on different targets, and on any
particular target the program runs as fast as possible given the
capabilities of the target.

There is another problem that a common IR addresses. The salient
requirement of kernel offload is that the offloaded functionality is
precisely equivalent to the kernel functionality that is being
offloaded. The traditional way this has been done is that the kernel
has to manage the bits offloaded to the device and provide all the
API. The problem is that it doesn't scale and quickly leads to
complexities like callouts to a jit compiler. My proposal is that we
compute an MD-5 hash of the IR and tag the program compiled from it
for the kernel (e.g. eBPF bytecode) and also tag the executable
compiled for the hardware (e.g. the P4 run-time).  At run time, there
kernel would query the device to see what program its running, if the
reported hash is equal to that of the loaded eBPF program, then the
device is running a functionally equivalent program and the offload
can safely be performed (via whatever datapath interfaces are needed).
This means that the device can be managed through a side channel, but
the kernel retains the necessary transparency to instantiate the
offload.

Here is a diagram of what this might look like:

[ P4 program ] ---- [ P4 compiler ] -----------------------+

           |
[ PANDA-C program ] ---- [ LLVM ]-----------------------+

           |
[ PANDA-Python program ] --- {Python compiler] ---+

           |
[ PANDA-Rust program ] --- [Rust compiler] ----------+

           |
[GUI] -------------[GUI to IR]---------------------------------+

           |
[CLI] --------------[CLI to IR]---------------------------------+

           |
                              +-----------------------------------------+
                              |
                              V
                  [Common IR (.json)]
                              |
+-----------------------+
|
+----[P4 Vendor Backend] ---- [Devlink]
|
+----[IR to eBPF backend compiler] --- [eBPF bytecode code]
|
+----[IR to CPU instructions] --- [Executable Binary]
|
+----[IR to P4TC CLI] --- [Script of commands]


> And much less work for us to maintain.

+1

>
> .John
Jamal Hadi Salim Jan. 30, 2023, 10:53 p.m. UTC | #41
I think we are going in cycles. John I asked you earlier and i think
you answered my question: You are actually pitching an out of band
runtime using some vendor sdk via devlink (why even bother with
devlink interface, not sure). P4TC is saying the runtime API is via
the kernel to the drivers.

Toke, i dont think i have managed to get across that there is an
"autonomous" control built into the kernel. It is not just things that
come across netlink. It's about the whole infra. I think without that
clarity we are going to speak past each other and it's a frustrating
discussion which could get emotional. You cant just displace, for
example flower and say "use ebpf because it works on tc", theres a lot
of tribal knowledge gluing relationship between hardware and software.
Maybe take a look at this patchset below to see an example which shows
how part of an action graph will work in hardware and partially in sw
under certain conditions:
https://www.spinics.net/lists/netdev/msg877507.html and then we can
have a better discussion.

cheers,
jamal


On Mon, Jan 30, 2023 at 4:21 PM Toke Høiland-Jørgensen <toke@redhat.com> wrote:
>
> John Fastabend <john.fastabend@gmail.com> writes:
>
> > Toke Høiland-Jørgensen wrote:
> >> Jamal Hadi Salim <hadi@mojatatu.com> writes:
> >>
> >> > On Mon, Jan 30, 2023 at 12:04 PM Toke Høiland-Jørgensen <toke@redhat.com> wrote:
> >> >>
> >> >> Jamal Hadi Salim <jhs@mojatatu.com> writes:
> >> >>
> >> >> > So i dont have to respond to each email individually, I will respond
> >> >> > here in no particular order. First let me provide some context, if
> >> >> > that was already clear please skip it. Hopefully providing the context
> >> >> > will help us to focus otherwise that bikeshed's color and shape will
> >> >> > take forever to settle on.
> >> >> >
> >> >> > __Context__
> >> >> >
> >> >> > I hope we all agree that when you have 2x100G NIC (and i have seen
> >> >> > people asking for 2x800G NICs) no XDP or DPDK is going to save you. To
> >> >> > visualize: one 25G port is 35Mpps unidirectional. So "software stack"
> >> >> > is not the answer. You need to offload.
> >> >>
> >> >> I'm not disputing the need to offload, and I'm personally delighted that
> >> >> P4 is breaking open the vendor black boxes to provide a standardised
> >> >> interface for this.
> >> >>
> >> >> However, while it's true that software can't keep up at the high end,
> >> >> not everything runs at the high end, and today's high end is tomorrow's
> >> >> mid end, in which XDP can very much play a role. So being able to move
> >> >> smoothly between the two, and even implement functions that split
> >> >> processing between them, is an essential feature of a programmable
> >> >> networking path in Linux. Which is why I'm objecting to implementing the
> >> >> P4 bits as something that's hanging off the side of the stack in its own
> >> >> thing and is not integrated with the rest of the stack. You were touting
> >> >> this as a feature ("being self-contained"). I consider it a bug.
> >> >>
> >> >> > Scriptability is not a new idea in TC (see u32 and pedit and others in
> >> >> > TC).
> >> >>
> >> >> u32 is notoriously hard to use. The others are neat, but obviously
> >> >> limited to particular use cases.
> >> >
> >> > Despite my love for u32, I admit its user interface is cryptic. I just
> >> > wanted to point out to existing samples of scriptable and offloadable
> >> > TC objects.
> >> >
> >> >> Do you actually expect anyone to use P4
> >> >> by manually entering TC commands to build a pipeline? I really find that
> >> >> hard to believe...
> >> >
> >> > You dont have to manually hand code anything - its the compilers job.
> >>
> >> Right, that was kinda my point: in that case the compiler could just as
> >> well generate a (set of) BPF program(s) instead of this TC script thing.
> >>
> >> >> > IOW, we are reusing and plugging into a proven and deployed mechanism
> >> >> > with a built-in policy driven, transparent symbiosis between hardware
> >> >> > offload and software that has matured over time. You can take a
> >> >> > pipeline or a table or actions and split them between hardware and
> >> >> > software transparently, etc.
> >> >>
> >> >> That's a control plane feature though, it's not an argument for adding
> >> >> another interpreter to the kernel.
> >> >
> >> > I am not sure what you mean by control, but what i described is kernel
> >> > built in. Of course i could do more complex things from user space (if
> >> > that is what you mean as control).
> >>
> >> "Control plane" as in SDN parlance. I.e., the bits that keep track of
> >> configuration of the flow/pipeline/table configuration.
> >>
> >> There's no reason you can't have all that infrastructure and use BPF as
> >> the datapath language. I.e., instead of:
> >>
> >> tc p4template create pipeline/aP4proggie numtables 1
> >> ... + all the other stuff to populate it
> >>
> >> you could just do:
> >>
> >> tc p4 create pipeline/aP4proggie obj_file aP4proggie.bpf.o
> >>
> >> and still have all the management infrastructure without the new
> >> interpreter and associated complexity in the kernel.
> >>
> >> >> > This hammer already meets our goals.
> >> >>
> >> >> That 60k+ line patch submission of yours says otherwise...
> >> >
> >> > This is pretty much covered in the cover letter and a few responses in
> >> > the thread since.
> >>
> >> The only argument for why your current approach makes sense I've seen
> >> you make is "I don't want to rewrite it in BPF". Which is not a
> >> technical argument.
> >>
> >> I'm not trying to be disingenuous here, BTW: I really don't see the
> >> technical argument for why the P4 data plane has to be implemented as
> >> its own interpreter instead of integrating with what we have already
> >> (i.e., BPF).
> >>
> >> -Toke
> >>
> >
> > I'll just take this here becaues I think its mostly related.
> >
> > Still not convinced the P4TC has any value for sw. From the
> > slide you say vendors prefer you have this picture roughtly.
> >
> >
> >    [ P4 compiler ] ------ [ P4TC backend ] ----> TC API
> >         |
> >         |
> >    [ P4 Vendor backend ]
> >         |
> >         |
> >         V
> >    [ Devlink ]
> >
> >
> > Now just replace P4TC backend with P4C and your only work is to
> > replace devlink with the current hw specific bits and you have
> > a sw and hw components. Then you get XDP-BPF pretty easily from
> > P4XDP backend if you like. The compat piece is handled by compiler
> > where it should be. My CPU is not a MAT so pretending it is seems
> > not ideal to me, I don't have a TCAM on my cores.
> >
> > For runtime get those vendors to write their SDKs over Devlink
> > and no need for this software thing. The runtime for P4c should
> > already work over BPF. Giving this picture
> >
> >    [ P4 compiler ] ------ [ P4C backend ] ----> BPF
> >         |
> >         |
> >    [ P4 Vendor backend ]
> >         |
> >         |
> >         V
> >    [ Devlink ]
> >
> > And much less work for us to maintain.
>
> Yes, this was basically my point as well. Thank you for putting it into
> ASCII diagrams! :)
>
> There's still the control plane bit: some kernel component that
> configures the pieces (pipelines?) created in the top-right and
> bottom-left corners of your diagram(s), keeping track of which pipelines
> are in HW/SW, maybe updating some match tables dynamically and
> extracting statistics. I'm totally OK with having that bit be in the
> kernel, but that can be added on top of your second diagram just as well
> as on top of the first one...
>
> -Toke
>
Singhai, Anjali Jan. 30, 2023, 11:24 p.m. UTC | #42
Devlink is only for downloading the vendor specific compiler output for a P4 program and for teaching the driver about the names of runtime P4 object as to how they map onto the HW. This helps with the Initial definition of the Dataplane.

Devlink is NOT for the runtime programming of the Dataplane, that has to go through the P4TC block for anybody to deploy a programmable dataplane between the HW and the SW and have some flows that are in HW and some in SW or some processing HW and some in SW. ndo_setup_tc framework and support in the drivers will give us the hooks to program the HW match-action entries. 
Please explain through ebpf model how do I program the HW at runtime? 

Thanks
Anjali


-----Original Message-----
From: Jamal Hadi Salim <jhs@mojatatu.com> 
Sent: Monday, January 30, 2023 2:54 PM
To: Toke Høiland-Jørgensen <toke@redhat.com>
Cc: John Fastabend <john.fastabend@gmail.com>; Jamal Hadi Salim <hadi@mojatatu.com>; Jiri Pirko <jiri@resnulli.us>; Willem de Bruijn <willemb@google.com>; Stanislav Fomichev <sdf@google.com>; Jakub Kicinski <kuba@kernel.org>; netdev@vger.kernel.org; kernel@mojatatu.com; Chatterjee, Deb <deb.chatterjee@intel.com>; Singhai, Anjali <anjali.singhai@intel.com>; Limaye, Namrata <namrata.limaye@intel.com>; khalidm@nvidia.com; tom@sipanda.io; pratyush@sipanda.io; xiyou.wangcong@gmail.com; davem@davemloft.net; edumazet@google.com; pabeni@redhat.com; vladbu@nvidia.com; simon.horman@corigine.com; stefanc@marvell.com; seong.kim@amd.com; mattyk@nvidia.com; Daly, Dan <dan.daly@intel.com>; Fingerhut, John Andy <john.andy.fingerhut@intel.com>
Subject: Re: [PATCH net-next RFC 00/20] Introducing P4TC

I think we are going in cycles. John I asked you earlier and i think you answered my question: You are actually pitching an out of band runtime using some vendor sdk via devlink (why even bother with devlink interface, not sure). P4TC is saying the runtime API is via the kernel to the drivers.

Toke, i dont think i have managed to get across that there is an "autonomous" control built into the kernel. It is not just things that come across netlink. It's about the whole infra. I think without that clarity we are going to speak past each other and it's a frustrating discussion which could get emotional. You cant just displace, for example flower and say "use ebpf because it works on tc", theres a lot of tribal knowledge gluing relationship between hardware and software.
Maybe take a look at this patchset below to see an example which shows how part of an action graph will work in hardware and partially in sw under certain conditions:
https://www.spinics.net/lists/netdev/msg877507.html and then we can have a better discussion.

cheers,
jamal


On Mon, Jan 30, 2023 at 4:21 PM Toke Høiland-Jørgensen <toke@redhat.com> wrote:
>
> John Fastabend <john.fastabend@gmail.com> writes:
>
> > Toke Høiland-Jørgensen wrote:
> >> Jamal Hadi Salim <hadi@mojatatu.com> writes:
> >>
> >> > On Mon, Jan 30, 2023 at 12:04 PM Toke Høiland-Jørgensen <toke@redhat.com> wrote:
> >> >>
> >> >> Jamal Hadi Salim <jhs@mojatatu.com> writes:
> >> >>
> >> >> > So i dont have to respond to each email individually, I will 
> >> >> > respond here in no particular order. First let me provide some 
> >> >> > context, if that was already clear please skip it. Hopefully 
> >> >> > providing the context will help us to focus otherwise that 
> >> >> > bikeshed's color and shape will take forever to settle on.
> >> >> >
> >> >> > __Context__
> >> >> >
> >> >> > I hope we all agree that when you have 2x100G NIC (and i have 
> >> >> > seen people asking for 2x800G NICs) no XDP or DPDK is going to 
> >> >> > save you. To
> >> >> > visualize: one 25G port is 35Mpps unidirectional. So "software stack"
> >> >> > is not the answer. You need to offload.
> >> >>
> >> >> I'm not disputing the need to offload, and I'm personally 
> >> >> delighted that
> >> >> P4 is breaking open the vendor black boxes to provide a 
> >> >> standardised interface for this.
> >> >>
> >> >> However, while it's true that software can't keep up at the high 
> >> >> end, not everything runs at the high end, and today's high end 
> >> >> is tomorrow's mid end, in which XDP can very much play a role. 
> >> >> So being able to move smoothly between the two, and even 
> >> >> implement functions that split processing between them, is an 
> >> >> essential feature of a programmable networking path in Linux. 
> >> >> Which is why I'm objecting to implementing the
> >> >> P4 bits as something that's hanging off the side of the stack in 
> >> >> its own thing and is not integrated with the rest of the stack. 
> >> >> You were touting this as a feature ("being self-contained"). I consider it a bug.
> >> >>
> >> >> > Scriptability is not a new idea in TC (see u32 and pedit and 
> >> >> > others in TC).
> >> >>
> >> >> u32 is notoriously hard to use. The others are neat, but 
> >> >> obviously limited to particular use cases.
> >> >
> >> > Despite my love for u32, I admit its user interface is cryptic. I 
> >> > just wanted to point out to existing samples of scriptable and 
> >> > offloadable TC objects.
> >> >
> >> >> Do you actually expect anyone to use P4 by manually entering TC 
> >> >> commands to build a pipeline? I really find that hard to 
> >> >> believe...
> >> >
> >> > You dont have to manually hand code anything - its the compilers job.
> >>
> >> Right, that was kinda my point: in that case the compiler could 
> >> just as well generate a (set of) BPF program(s) instead of this TC script thing.
> >>
> >> >> > IOW, we are reusing and plugging into a proven and deployed 
> >> >> > mechanism with a built-in policy driven, transparent symbiosis 
> >> >> > between hardware offload and software that has matured over 
> >> >> > time. You can take a pipeline or a table or actions and split 
> >> >> > them between hardware and software transparently, etc.
> >> >>
> >> >> That's a control plane feature though, it's not an argument for 
> >> >> adding another interpreter to the kernel.
> >> >
> >> > I am not sure what you mean by control, but what i described is 
> >> > kernel built in. Of course i could do more complex things from 
> >> > user space (if that is what you mean as control).
> >>
> >> "Control plane" as in SDN parlance. I.e., the bits that keep track 
> >> of configuration of the flow/pipeline/table configuration.
> >>
> >> There's no reason you can't have all that infrastructure and use 
> >> BPF as the datapath language. I.e., instead of:
> >>
> >> tc p4template create pipeline/aP4proggie numtables 1 ... + all the 
> >> other stuff to populate it
> >>
> >> you could just do:
> >>
> >> tc p4 create pipeline/aP4proggie obj_file aP4proggie.bpf.o
> >>
> >> and still have all the management infrastructure without the new 
> >> interpreter and associated complexity in the kernel.
> >>
> >> >> > This hammer already meets our goals.
> >> >>
> >> >> That 60k+ line patch submission of yours says otherwise...
> >> >
> >> > This is pretty much covered in the cover letter and a few 
> >> > responses in the thread since.
> >>
> >> The only argument for why your current approach makes sense I've 
> >> seen you make is "I don't want to rewrite it in BPF". Which is not 
> >> a technical argument.
> >>
> >> I'm not trying to be disingenuous here, BTW: I really don't see the 
> >> technical argument for why the P4 data plane has to be implemented 
> >> as its own interpreter instead of integrating with what we have 
> >> already (i.e., BPF).
> >>
> >> -Toke
> >>
> >
> > I'll just take this here becaues I think its mostly related.
> >
> > Still not convinced the P4TC has any value for sw. From the slide 
> > you say vendors prefer you have this picture roughtly.
> >
> >
> >    [ P4 compiler ] ------ [ P4TC backend ] ----> TC API
> >         |
> >         |
> >    [ P4 Vendor backend ]
> >         |
> >         |
> >         V
> >    [ Devlink ]
> >
> >
> > Now just replace P4TC backend with P4C and your only work is to 
> > replace devlink with the current hw specific bits and you have a sw 
> > and hw components. Then you get XDP-BPF pretty easily from P4XDP 
> > backend if you like. The compat piece is handled by compiler where 
> > it should be. My CPU is not a MAT so pretending it is seems not 
> > ideal to me, I don't have a TCAM on my cores.
> >
> > For runtime get those vendors to write their SDKs over Devlink and 
> > no need for this software thing. The runtime for P4c should already 
> > work over BPF. Giving this picture
> >
> >    [ P4 compiler ] ------ [ P4C backend ] ----> BPF
> >         |
> >         |
> >    [ P4 Vendor backend ]
> >         |
> >         |
> >         V
> >    [ Devlink ]
> >
> > And much less work for us to maintain.
>
> Yes, this was basically my point as well. Thank you for putting it 
> into ASCII diagrams! :)
>
> There's still the control plane bit: some kernel component that 
> configures the pieces (pipelines?) created in the top-right and 
> bottom-left corners of your diagram(s), keeping track of which 
> pipelines are in HW/SW, maybe updating some match tables dynamically 
> and extracting statistics. I'm totally OK with having that bit be in 
> the kernel, but that can be added on top of your second diagram just 
> as well as on top of the first one...
>
> -Toke
>
John Fastabend Jan. 30, 2023, 11:32 p.m. UTC | #43
Jamal Hadi Salim wrote:
> I think we are going in cycles. John I asked you earlier and i think
> you answered my question: You are actually pitching an out of band
> runtime using some vendor sdk via devlink (why even bother with
> devlink interface, not sure). P4TC is saying the runtime API is via
> the kernel to the drivers.

Not out of band, via devlink and a common API for all vendors to
implement so userspace applications can abstract away vendor
specifics of the runtime API as much as possible. Then runtime
component can implement the Linux API and abstract the hardware
away this way.

runtime API is still via the kernel and the the driver its just going
through devlink because its fundamentally a hardware configuration
and independent of a software datapath.

I think the key here is I see no value in (re)implementing a duplicate
software stack when we already have one and even have a backend for the
one we have. Its more general. And likely more performant.

If you want a software implementation do it in rocker so its clear
its a software implementatoin of a switch.

> 
> Toke, i dont think i have managed to get across that there is an
> "autonomous" control built into the kernel. It is not just things that
> come across netlink. It's about the whole infra. I think without that
> clarity we are going to speak past each other and it's a frustrating
> discussion which could get emotional. You cant just displace, for
> example flower and say "use ebpf because it works on tc", theres a lot
> of tribal knowledge gluing relationship between hardware and software.
> Maybe take a look at this patchset below to see an example which shows
> how part of an action graph will work in hardware and partially in sw
> under certain conditions:
> https://www.spinics.net/lists/netdev/msg877507.html and then we can
> have a better discussion.
> 
> cheers,
> jamal
> 
> 
> On Mon, Jan 30, 2023 at 4:21 PM Toke Høiland-Jørgensen <toke@redhat.com> wrote:
> >
> > John Fastabend <john.fastabend@gmail.com> writes:
> >
> > > Toke Høiland-Jørgensen wrote:
> > >> Jamal Hadi Salim <hadi@mojatatu.com> writes:
> > >>
> > >> > On Mon, Jan 30, 2023 at 12:04 PM Toke Høiland-Jørgensen <toke@redhat.com> wrote:
> > >> >>
> > >> >> Jamal Hadi Salim <jhs@mojatatu.com> writes:
> > >> >>
> > >> >> > So i dont have to respond to each email individually, I will respond
> > >> >> > here in no particular order. First let me provide some context, if
> > >> >> > that was already clear please skip it. Hopefully providing the context
> > >> >> > will help us to focus otherwise that bikeshed's color and shape will
> > >> >> > take forever to settle on.
> > >> >> >
> > >> >> > __Context__
> > >> >> >
> > >> >> > I hope we all agree that when you have 2x100G NIC (and i have seen
> > >> >> > people asking for 2x800G NICs) no XDP or DPDK is going to save you. To
> > >> >> > visualize: one 25G port is 35Mpps unidirectional. So "software stack"
> > >> >> > is not the answer. You need to offload.
> > >> >>
> > >> >> I'm not disputing the need to offload, and I'm personally delighted that
> > >> >> P4 is breaking open the vendor black boxes to provide a standardised
> > >> >> interface for this.
> > >> >>
> > >> >> However, while it's true that software can't keep up at the high end,
> > >> >> not everything runs at the high end, and today's high end is tomorrow's
> > >> >> mid end, in which XDP can very much play a role. So being able to move
> > >> >> smoothly between the two, and even implement functions that split
> > >> >> processing between them, is an essential feature of a programmable
> > >> >> networking path in Linux. Which is why I'm objecting to implementing the
> > >> >> P4 bits as something that's hanging off the side of the stack in its own
> > >> >> thing and is not integrated with the rest of the stack. You were touting
> > >> >> this as a feature ("being self-contained"). I consider it a bug.
> > >> >>
> > >> >> > Scriptability is not a new idea in TC (see u32 and pedit and others in
> > >> >> > TC).
> > >> >>
> > >> >> u32 is notoriously hard to use. The others are neat, but obviously
> > >> >> limited to particular use cases.
> > >> >
> > >> > Despite my love for u32, I admit its user interface is cryptic. I just
> > >> > wanted to point out to existing samples of scriptable and offloadable
> > >> > TC objects.
> > >> >
> > >> >> Do you actually expect anyone to use P4
> > >> >> by manually entering TC commands to build a pipeline? I really find that
> > >> >> hard to believe...
> > >> >
> > >> > You dont have to manually hand code anything - its the compilers job.
> > >>
> > >> Right, that was kinda my point: in that case the compiler could just as
> > >> well generate a (set of) BPF program(s) instead of this TC script thing.
> > >>
> > >> >> > IOW, we are reusing and plugging into a proven and deployed mechanism
> > >> >> > with a built-in policy driven, transparent symbiosis between hardware
> > >> >> > offload and software that has matured over time. You can take a
> > >> >> > pipeline or a table or actions and split them between hardware and
> > >> >> > software transparently, etc.
> > >> >>
> > >> >> That's a control plane feature though, it's not an argument for adding
> > >> >> another interpreter to the kernel.
> > >> >
> > >> > I am not sure what you mean by control, but what i described is kernel
> > >> > built in. Of course i could do more complex things from user space (if
> > >> > that is what you mean as control).
> > >>
> > >> "Control plane" as in SDN parlance. I.e., the bits that keep track of
> > >> configuration of the flow/pipeline/table configuration.
> > >>
> > >> There's no reason you can't have all that infrastructure and use BPF as
> > >> the datapath language. I.e., instead of:
> > >>
> > >> tc p4template create pipeline/aP4proggie numtables 1
> > >> ... + all the other stuff to populate it
> > >>
> > >> you could just do:
> > >>
> > >> tc p4 create pipeline/aP4proggie obj_file aP4proggie.bpf.o
> > >>
> > >> and still have all the management infrastructure without the new
> > >> interpreter and associated complexity in the kernel.
> > >>
> > >> >> > This hammer already meets our goals.
> > >> >>
> > >> >> That 60k+ line patch submission of yours says otherwise...
> > >> >
> > >> > This is pretty much covered in the cover letter and a few responses in
> > >> > the thread since.
> > >>
> > >> The only argument for why your current approach makes sense I've seen
> > >> you make is "I don't want to rewrite it in BPF". Which is not a
> > >> technical argument.
> > >>
> > >> I'm not trying to be disingenuous here, BTW: I really don't see the
> > >> technical argument for why the P4 data plane has to be implemented as
> > >> its own interpreter instead of integrating with what we have already
> > >> (i.e., BPF).
> > >>
> > >> -Toke
> > >>
> > >
> > > I'll just take this here becaues I think its mostly related.
> > >
> > > Still not convinced the P4TC has any value for sw. From the
> > > slide you say vendors prefer you have this picture roughtly.
> > >
> > >
> > >    [ P4 compiler ] ------ [ P4TC backend ] ----> TC API
> > >         |
> > >         |
> > >    [ P4 Vendor backend ]
> > >         |
> > >         |
> > >         V
> > >    [ Devlink ]
> > >
> > >
> > > Now just replace P4TC backend with P4C and your only work is to
> > > replace devlink with the current hw specific bits and you have
> > > a sw and hw components. Then you get XDP-BPF pretty easily from
> > > P4XDP backend if you like. The compat piece is handled by compiler
> > > where it should be. My CPU is not a MAT so pretending it is seems
> > > not ideal to me, I don't have a TCAM on my cores.
> > >
> > > For runtime get those vendors to write their SDKs over Devlink
> > > and no need for this software thing. The runtime for P4c should
> > > already work over BPF. Giving this picture
> > >
> > >    [ P4 compiler ] ------ [ P4C backend ] ----> BPF
> > >         |
> > >         |
> > >    [ P4 Vendor backend ]
> > >         |
> > >         |
> > >         V
> > >    [ Devlink ]
> > >
> > > And much less work for us to maintain.
> >
> > Yes, this was basically my point as well. Thank you for putting it into
> > ASCII diagrams! :)
> >
> > There's still the control plane bit: some kernel component that
> > configures the pieces (pipelines?) created in the top-right and
> > bottom-left corners of your diagram(s), keeping track of which pipelines
> > are in HW/SW, maybe updating some match tables dynamically and
> > extracting statistics. I'm totally OK with having that bit be in the
> > kernel, but that can be added on top of your second diagram just as well
> > as on top of the first one...
> >
> > -Toke
> >
John Fastabend Jan. 31, 2023, 12:06 a.m. UTC | #44
Singhai, Anjali wrote:
> Devlink is only for downloading the vendor specific compiler output for a P4 program and for teaching the driver about the names of runtime P4 object as to how they map onto the HW. This helps with the Initial definition of the Dataplane.
> 
> Devlink is NOT for the runtime programming of the Dataplane, that has to go through the P4TC block for anybody to deploy a programmable dataplane between the HW and the SW and have some flows that are in HW and some in SW or some processing HW and some in SW. ndo_setup_tc framework and support in the drivers will give us the hooks to program the HW match-action entries. 
> Please explain through ebpf model how do I program the HW at runtime? 
> 
> Thanks
> Anjali
> 

Didn't see this as it was top posted but, the answer is you don't program
hardware the ebpf when your underlying target is a MAT.

Use devlink for the runtime programming as well, its there to program
hardware. This "Devlink is NOT for the runtime programming" is
just an artificate of the design here which I disagree with and it feels
like many other folks also disagree.

Also if you have some flows going to SW you want to use the most
performant option you have which would be XDP-BPF at the moment in a
standard linux box or maybe af-xdp. So in these cases you should look
to divide your P4 pipeline between XDP and HW. Sure you can say
performance doesn't matter for my use case, but surely it does for
some things and anyways you have the performant thing already built
so just use it.

Thanks,
John
Jamal Hadi Salim Jan. 31, 2023, 12:26 a.m. UTC | #45
On Mon, Jan 30, 2023 at 7:06 PM John Fastabend <john.fastabend@gmail.com> wrote:
>
> Singhai, Anjali wrote:
> > Devlink is only for downloading the vendor specific compiler output for a P4 program and for teaching the driver about the names of runtime P4 object as to how they map onto the HW. This helps with the Initial definition of the Dataplane.
> >
> > Devlink is NOT for the runtime programming of the Dataplane, that has to go through the P4TC block for anybody to deploy a programmable dataplane between the HW and the SW and have some flows that are in HW and some in SW or some processing HW and some in SW. ndo_setup_tc framework and support in the drivers will give us the hooks to program the HW match-action entries.
> > Please explain through ebpf model how do I program the HW at runtime?
> >
> > Thanks
> > Anjali
> >
>
> Didn't see this as it was top posted but, the answer is you don't program
> hardware the ebpf when your underlying target is a MAT.
>
> Use devlink for the runtime programming as well, its there to program
> hardware. This "Devlink is NOT for the runtime programming" is
> just an artificate of the design here which I disagree with and it feels
> like many other folks also disagree.
>

We are going to need strong justification to use devlink for
programming the  binary interface to begin with - see the driver
models discussion. And let me get this clear: you are suggesting we
use it for runtime and redo all that tc ndo and associated infra?

cheers,
jamal

> Also if you have some flows going to SW you want to use the most
> performant option you have which would be XDP-BPF at the moment in a
> standard linux box or maybe af-xdp. So in these cases you should look
> to divide your P4 pipeline between XDP and HW. Sure you can say
> performance doesn't matter for my use case, but surely it does for
> some things and anyways you have the performant thing already built
> so just use it.
> Thanks,
> John
Jakub Kicinski Jan. 31, 2023, 4:12 a.m. UTC | #46
On Mon, 30 Jan 2023 19:26:05 -0500 Jamal Hadi Salim wrote:
> > Didn't see this as it was top posted but, the answer is you don't program
> > hardware the ebpf when your underlying target is a MAT.
> >
> > Use devlink for the runtime programming as well, its there to program
> > hardware. This "Devlink is NOT for the runtime programming" is
> > just an artificate of the design here which I disagree with and it feels
> > like many other folks also disagree.
> 
> We are going to need strong justification to use devlink for
> programming the binary interface to begin with

We may disagree on direction, but we should agree status quo / reality.

What John described is what we suggested to Intel to do (2+ years ago),
and what is already implemented upstream. Grep for DDP.

IIRC my opinion back then was that unless kernel has any use for
whatever the configuration exposes - we should stay out of it.

> - see the driver models discussion. 
> 
> And let me get this clear: you are suggesting we
> use it for runtime and redo all that tc ndo and associated infra?
Jamal Hadi Salim Jan. 31, 2023, 10:27 a.m. UTC | #47
On Mon, Jan 30, 2023 at 11:12 PM Jakub Kicinski <kuba@kernel.org> wrote:
>
> On Mon, 30 Jan 2023 19:26:05 -0500 Jamal Hadi Salim wrote:
> > > Didn't see this as it was top posted but, the answer is you don't program
> > > hardware the ebpf when your underlying target is a MAT.
> > >
> > > Use devlink for the runtime programming as well, its there to program
> > > hardware. This "Devlink is NOT for the runtime programming" is
> > > just an artificate of the design here which I disagree with and it feels
> > > like many other folks also disagree.
> >
> > We are going to need strong justification to use devlink for
> > programming the binary interface to begin with
>
> We may disagree on direction, but we should agree status quo / reality.
>
> What John described is what we suggested to Intel to do (2+ years ago),
> and what is already implemented upstream. Grep for DDP.
>

I went back and looked at the email threads - I hope i got the right
one from 2020.

Note, there are two paths in P4TC:
DDP loading via devlink is equivalent to loading the P4 binary for the hardware.
That is one of the 3 (and currently most popular) driver interfaces
suggested. Some of that drew
Second is runtime which is via standard TC. John's proposal is
equivalent to suggesting moving the flower interface Devlink. That is
not the same as loading the config.

> IIRC my opinion back then was that unless kernel has any use for
> whatever the configuration exposes - we should stay out of it.

It does for runtime and the tc infra already takes care of that. The
cover letter says:

"...one can be more explicit and specify "skip_sw" or "skip_hw" to either
offload the entry (if a NIC or switch driver is capable) or make it purely run
entirely in the kernel or in a cooperative mode between kernel and user space."

cheers,
jamal
Jamal Hadi Salim Jan. 31, 2023, 10:30 a.m. UTC | #48
On Tue, Jan 31, 2023 at 5:27 AM Jamal Hadi Salim <jhs@mojatatu.com> wrote:
>
> On Mon, Jan 30, 2023 at 11:12 PM Jakub Kicinski <kuba@kernel.org> wrote:
> >
> > On Mon, 30 Jan 2023 19:26:05 -0500 Jamal Hadi Salim wrote:
> > > > Didn't see this as it was top posted but, the answer is you don't program
> > > > hardware the ebpf when your underlying target is a MAT.
> > > >
> > > > Use devlink for the runtime programming as well, its there to program
> > > > hardware. This "Devlink is NOT for the runtime programming" is
> > > > just an artificate of the design here which I disagree with and it feels
> > > > like many other folks also disagree.
> > >
> > > We are going to need strong justification to use devlink for
> > > programming the binary interface to begin with
> >
> > We may disagree on direction, but we should agree status quo / reality.
> >
> > What John described is what we suggested to Intel to do (2+ years ago),
> > and what is already implemented upstream. Grep for DDP.
> >
>
> I went back and looked at the email threads - I hope i got the right
> one from 2020.
>
> Note, there are two paths in P4TC:
> DDP loading via devlink is equivalent to loading the P4 binary for the hardware.
> That is one of the 3 (and currently most popular) driver interfaces
> suggested. Some of that drew

Sorry didnt finish my thought here, wanted to say: The loading of the
P4 binary over devlink drew (to some people) suspicion it is going to
be used for loading kernel bypass.

cheers,
jamal

> Second is runtime which is via standard TC. John's proposal is
> equivalent to suggesting moving the flower interface Devlink. That is
> not the same as loading the config.
>
> > IIRC my opinion back then was that unless kernel has any use for
> > whatever the configuration exposes - we should stay out of it.
>
> It does for runtime and the tc infra already takes care of that. The
> cover letter says:
>
> "...one can be more explicit and specify "skip_sw" or "skip_hw" to either
> offload the entry (if a NIC or switch driver is capable) or make it purely run
> entirely in the kernel or in a cooperative mode between kernel and user space."
>
> cheers,
> jamal
Toke Høiland-Jørgensen Jan. 31, 2023, 12:17 p.m. UTC | #49
Jamal Hadi Salim <jhs@mojatatu.com> writes:

> Toke, i dont think i have managed to get across that there is an
> "autonomous" control built into the kernel. It is not just things that
> come across netlink. It's about the whole infra.

I'm not disputing the need for the TC infra to configure the pipelines
and their relationship in the hardware. I'm saying that your
implementation *of the SW path* is the wrong approach and it would be
better done by using BPF (not talking about the existing TC-BPF,
either).

It's a bit hard to know your thinking for sure here, since your patch
series doesn't include any of the offload control bits. But from the
slides and your hints in this series, AFAICT, the flow goes something
like:

hw_pipeline_id = devlink_program_hardware(dev, p4_compiled_blob);
sw_pipeline_id = `tc p4template create ...` (etc, this is generated by P4C)

tc_act = tc_act_create(hw_pipeline_id, sw_pipeline_id)

which will turn into something like:

struct p4_cls_offload ofl = {
  .classid = classid,
  .pipeline_id = hw_pipeline_id
};

if (check_sw_and_hw_equivalence(hw_pipeline_id, sw_pipeline_id)) /* some magic check here */
  return -EINVAL;

netdev->netdev_ops->ndo_setup_tc(dev, TC_SETUP_P4, &ofl);


I.e, all that's being passed to the hardware is the ID of the
pre-programmed pipeline, because that programming is going to be
out-of-band via devlink anyway.

In which case, you could just as well replace the above:

sw_pipeline_id = `tc p4template create ...` (etc, this is generated by P4C)

with

sw_pipeline_id = bpf_prog_load(BPF_PROG_TYPE_P4TC, "my_obj_file.o"); /* my_obj_file is created by P4c */

and achieve exactly the same.

Having all the P4 data types and concepts exist inside the kernel
*might* make sense if the kernel could then translate those into the
hardware representations and manage their lifecycle in a uniform way.
But as far as I can tell from the slides and what you've been saying in
this thread that's not going to be possible anyway, so why do you need
anything more granular than the pipeline ID?

-Toke
Jiri Pirko Jan. 31, 2023, 12:37 p.m. UTC | #50
Tue, Jan 31, 2023 at 01:17:14PM CET, toke@redhat.com wrote:
>Jamal Hadi Salim <jhs@mojatatu.com> writes:
>
>> Toke, i dont think i have managed to get across that there is an
>> "autonomous" control built into the kernel. It is not just things that
>> come across netlink. It's about the whole infra.
>
>I'm not disputing the need for the TC infra to configure the pipelines
>and their relationship in the hardware. I'm saying that your
>implementation *of the SW path* is the wrong approach and it would be
>better done by using BPF (not talking about the existing TC-BPF,
>either).
>
>It's a bit hard to know your thinking for sure here, since your patch
>series doesn't include any of the offload control bits. But from the
>slides and your hints in this series, AFAICT, the flow goes something
>like:
>
>hw_pipeline_id = devlink_program_hardware(dev, p4_compiled_blob);

I don't think that devlink is the correct iface for this. If you want to
tight it together with the SW pipeline configurable by TC, use TC as you
do for the BPF binary in this example. If you have the TC-block shared
among many netdevs, the HW needs to know that for binding the P4 input.

Btw, you can have multiple netdevs of different vendors sharing the same
TC-block, then you need to upload multiple HW binary blobs here.

What it eventually might result with is that the userspace would upload
a list of binaries with indication what is the target:
"BPF" -> xxx.o
"DRIVERNAMEX" -> aaa.bin
"DRIVERNAMEY" -> bbb.bin
In theory, there might be a HW to accept the BPF binary :) My point is,
userspace provides a list of binaries, individual kernel parts take what
they like.


>sw_pipeline_id = `tc p4template create ...` (etc, this is generated by P4C)
>
>tc_act = tc_act_create(hw_pipeline_id, sw_pipeline_id)
>
>which will turn into something like:
>
>struct p4_cls_offload ofl = {
>  .classid = classid,
>  .pipeline_id = hw_pipeline_id
>};
>
>if (check_sw_and_hw_equivalence(hw_pipeline_id, sw_pipeline_id)) /* some magic check here */

Ha! I would like to see this magic here :)


>  return -EINVAL;
>
>netdev->netdev_ops->ndo_setup_tc(dev, TC_SETUP_P4, &ofl);
>
>
>I.e, all that's being passed to the hardware is the ID of the
>pre-programmed pipeline, because that programming is going to be
>out-of-band via devlink anyway.
>
>In which case, you could just as well replace the above:
>
>sw_pipeline_id = `tc p4template create ...` (etc, this is generated by P4C)
>
>with
>
>sw_pipeline_id = bpf_prog_load(BPF_PROG_TYPE_P4TC, "my_obj_file.o"); /* my_obj_file is created by P4c */
>
>and achieve exactly the same.
>
>Having all the P4 data types and concepts exist inside the kernel
>*might* make sense if the kernel could then translate those into the
>hardware representations and manage their lifecycle in a uniform way.
>But as far as I can tell from the slides and what you've been saying in
>this thread that's not going to be possible anyway, so why do you need
>anything more granular than the pipeline ID?
>
>-Toke
>
Jiri Pirko Jan. 31, 2023, 2:38 p.m. UTC | #51
Tue, Jan 31, 2023 at 01:17:14PM CET, toke@redhat.com wrote:
>Jamal Hadi Salim <jhs@mojatatu.com> writes:
>
>> Toke, i dont think i have managed to get across that there is an
>> "autonomous" control built into the kernel. It is not just things that
>> come across netlink. It's about the whole infra.
>
>I'm not disputing the need for the TC infra to configure the pipelines
>and their relationship in the hardware. I'm saying that your
>implementation *of the SW path* is the wrong approach and it would be
>better done by using BPF (not talking about the existing TC-BPF,
>either).
>
>It's a bit hard to know your thinking for sure here, since your patch
>series doesn't include any of the offload control bits. But from the
>slides and your hints in this series, AFAICT, the flow goes something
>like:
>
>hw_pipeline_id = devlink_program_hardware(dev, p4_compiled_blob);
>sw_pipeline_id = `tc p4template create ...` (etc, this is generated by P4C)
>
>tc_act = tc_act_create(hw_pipeline_id, sw_pipeline_id)
>
>which will turn into something like:
>
>struct p4_cls_offload ofl = {
>  .classid = classid,
>  .pipeline_id = hw_pipeline_id
>};
>
>if (check_sw_and_hw_equivalence(hw_pipeline_id, sw_pipeline_id)) /* some magic check here */
>  return -EINVAL;
>
>netdev->netdev_ops->ndo_setup_tc(dev, TC_SETUP_P4, &ofl);
>
>
>I.e, all that's being passed to the hardware is the ID of the
>pre-programmed pipeline, because that programming is going to be
>out-of-band via devlink anyway.
>
>In which case, you could just as well replace the above:
>
>sw_pipeline_id = `tc p4template create ...` (etc, this is generated by P4C)
>
>with
>
>sw_pipeline_id = bpf_prog_load(BPF_PROG_TYPE_P4TC, "my_obj_file.o"); /* my_obj_file is created by P4c */
>
>and achieve exactly the same.
>
>Having all the P4 data types and concepts exist inside the kernel
>*might* make sense if the kernel could then translate those into the
>hardware representations and manage their lifecycle in a uniform way.
>But as far as I can tell from the slides and what you've been saying in
>this thread that's not going to be possible anyway, so why do you need
>anything more granular than the pipeline ID?

Toke, I understand what what you describe above is applicable for the P4
program instantiation (pipeline definition).

What is the suggestion for the actual "rule insertions" ? Would it make
sense to use TC iface (Jamal's or similar) to insert rules to both BPF SW
path and offloaded HW path?


>
>-Toke
>
Toke Høiland-Jørgensen Jan. 31, 2023, 5:01 p.m. UTC | #52
Jiri Pirko <jiri@resnulli.us> writes:

> Tue, Jan 31, 2023 at 01:17:14PM CET, toke@redhat.com wrote:
>>Jamal Hadi Salim <jhs@mojatatu.com> writes:
>>
>>> Toke, i dont think i have managed to get across that there is an
>>> "autonomous" control built into the kernel. It is not just things that
>>> come across netlink. It's about the whole infra.
>>
>>I'm not disputing the need for the TC infra to configure the pipelines
>>and their relationship in the hardware. I'm saying that your
>>implementation *of the SW path* is the wrong approach and it would be
>>better done by using BPF (not talking about the existing TC-BPF,
>>either).
>>
>>It's a bit hard to know your thinking for sure here, since your patch
>>series doesn't include any of the offload control bits. But from the
>>slides and your hints in this series, AFAICT, the flow goes something
>>like:
>>
>>hw_pipeline_id = devlink_program_hardware(dev, p4_compiled_blob);
>>sw_pipeline_id = `tc p4template create ...` (etc, this is generated by P4C)
>>
>>tc_act = tc_act_create(hw_pipeline_id, sw_pipeline_id)
>>
>>which will turn into something like:
>>
>>struct p4_cls_offload ofl = {
>>  .classid = classid,
>>  .pipeline_id = hw_pipeline_id
>>};
>>
>>if (check_sw_and_hw_equivalence(hw_pipeline_id, sw_pipeline_id)) /* some magic check here */
>>  return -EINVAL;
>>
>>netdev->netdev_ops->ndo_setup_tc(dev, TC_SETUP_P4, &ofl);
>>
>>
>>I.e, all that's being passed to the hardware is the ID of the
>>pre-programmed pipeline, because that programming is going to be
>>out-of-band via devlink anyway.
>>
>>In which case, you could just as well replace the above:
>>
>>sw_pipeline_id = `tc p4template create ...` (etc, this is generated by P4C)
>>
>>with
>>
>>sw_pipeline_id = bpf_prog_load(BPF_PROG_TYPE_P4TC, "my_obj_file.o"); /* my_obj_file is created by P4c */
>>
>>and achieve exactly the same.
>>
>>Having all the P4 data types and concepts exist inside the kernel
>>*might* make sense if the kernel could then translate those into the
>>hardware representations and manage their lifecycle in a uniform way.
>>But as far as I can tell from the slides and what you've been saying in
>>this thread that's not going to be possible anyway, so why do you need
>>anything more granular than the pipeline ID?
>
> Toke, I understand what what you describe above is applicable for the P4
> program instantiation (pipeline definition).
>
> What is the suggestion for the actual "rule insertions" ? Would it make
> sense to use TC iface (Jamal's or similar) to insert rules to both BPF SW
> path and offloaded HW path?

Hmm, so by "rule insertions" here you're referring to populating what P4
calls 'tables', right?

I could see a couple of ways this could be bridged between the BPF side
and the HW side:

- Create a new BPF map type that is backed by the TC-internal data
  structure, so updates from userspace go via the TC interface, but BPF
  programs access the contents via the bpf_map_*() helpers (or we could
  allow updating via the bpf() syscall as well)

- Expose the TC data structures to BPF via their own set of kfuncs,
  similar to what we did for conntrack

- Scrap the TC interface entirely and make this an offload-enabled BPF
  map type (using the BPF ndo and bpf_map_dev_ops operations to update
  it). Userspace would then populate it via the bpf() syscall like any
  other map.


I suspect the map interface is the most straight-forward to use from the
BPF side, but informing this by what existing implementations do
(thinking of the P4->XDP compiler in particular) might be a good idea?

-Toke
Jakub Kicinski Jan. 31, 2023, 7:10 p.m. UTC | #53
On Tue, 31 Jan 2023 05:30:10 -0500 Jamal Hadi Salim wrote:
> > Note, there are two paths in P4TC:
> > DDP loading via devlink is equivalent to loading the P4 binary for the hardware.
> > That is one of the 3 (and currently most popular) driver interfaces
> > suggested. Some of that drew  
> 
> Sorry didnt finish my thought here, wanted to say: The loading of the
> P4 binary over devlink drew (to some people) suspicion it is going to
> be used for loading kernel bypass.

The only practical use case I heard was the IPU. Worrying about devlink
programming being a bypass on an IPU is like rearranging chairs on the
Titanic.
Jamal Hadi Salim Jan. 31, 2023, 10:23 p.m. UTC | #54
So while going through this thought process, things to consider:
1) The autonomy of the tc infra, essentially the skip_sw/hw  controls
and their packet driven iteration. Perhaps (the patch i pointed to
from Paul Blakey) where part of the action graph runs in sw.
2) The dynamicity of being able to trigger table offloads and/or
kernel table updates which are packet driven (consider scenario where
they have iterated the hardware and ingressed into the kernel).

cheers,
jamal

On Tue, Jan 31, 2023 at 12:01 PM Toke Høiland-Jørgensen <toke@redhat.com> wrote:
>
> Jiri Pirko <jiri@resnulli.us> writes:
>
> > Tue, Jan 31, 2023 at 01:17:14PM CET, toke@redhat.com wrote:
> >>Jamal Hadi Salim <jhs@mojatatu.com> writes:
> >>
> >>> Toke, i dont think i have managed to get across that there is an
> >>> "autonomous" control built into the kernel. It is not just things that
> >>> come across netlink. It's about the whole infra.
> >>
> >>I'm not disputing the need for the TC infra to configure the pipelines
> >>and their relationship in the hardware. I'm saying that your
> >>implementation *of the SW path* is the wrong approach and it would be
> >>better done by using BPF (not talking about the existing TC-BPF,
> >>either).
> >>
> >>It's a bit hard to know your thinking for sure here, since your patch
> >>series doesn't include any of the offload control bits. But from the
> >>slides and your hints in this series, AFAICT, the flow goes something
> >>like:
> >>
> >>hw_pipeline_id = devlink_program_hardware(dev, p4_compiled_blob);
> >>sw_pipeline_id = `tc p4template create ...` (etc, this is generated by P4C)
> >>
> >>tc_act = tc_act_create(hw_pipeline_id, sw_pipeline_id)
> >>
> >>which will turn into something like:
> >>
> >>struct p4_cls_offload ofl = {
> >>  .classid = classid,
> >>  .pipeline_id = hw_pipeline_id
> >>};
> >>
> >>if (check_sw_and_hw_equivalence(hw_pipeline_id, sw_pipeline_id)) /* some magic check here */
> >>  return -EINVAL;
> >>
> >>netdev->netdev_ops->ndo_setup_tc(dev, TC_SETUP_P4, &ofl);
> >>
> >>
> >>I.e, all that's being passed to the hardware is the ID of the
> >>pre-programmed pipeline, because that programming is going to be
> >>out-of-band via devlink anyway.
> >>
> >>In which case, you could just as well replace the above:
> >>
> >>sw_pipeline_id = `tc p4template create ...` (etc, this is generated by P4C)
> >>
> >>with
> >>
> >>sw_pipeline_id = bpf_prog_load(BPF_PROG_TYPE_P4TC, "my_obj_file.o"); /* my_obj_file is created by P4c */
> >>
> >>and achieve exactly the same.
> >>
> >>Having all the P4 data types and concepts exist inside the kernel
> >>*might* make sense if the kernel could then translate those into the
> >>hardware representations and manage their lifecycle in a uniform way.
> >>But as far as I can tell from the slides and what you've been saying in
> >>this thread that's not going to be possible anyway, so why do you need
> >>anything more granular than the pipeline ID?
> >
> > Toke, I understand what what you describe above is applicable for the P4
> > program instantiation (pipeline definition).
> >
> > What is the suggestion for the actual "rule insertions" ? Would it make
> > sense to use TC iface (Jamal's or similar) to insert rules to both BPF SW
> > path and offloaded HW path?
>
> Hmm, so by "rule insertions" here you're referring to populating what P4
> calls 'tables', right?
>
> I could see a couple of ways this could be bridged between the BPF side
> and the HW side:
>
> - Create a new BPF map type that is backed by the TC-internal data
>   structure, so updates from userspace go via the TC interface, but BPF
>   programs access the contents via the bpf_map_*() helpers (or we could
>   allow updating via the bpf() syscall as well)
>
> - Expose the TC data structures to BPF via their own set of kfuncs,
>   similar to what we did for conntrack
>
> - Scrap the TC interface entirely and make this an offload-enabled BPF
>   map type (using the BPF ndo and bpf_map_dev_ops operations to update
>   it). Userspace would then populate it via the bpf() syscall like any
>   other map.
>
>
> I suspect the map interface is the most straight-forward to use from the
> BPF side, but informing this by what existing implementations do
> (thinking of the P4->XDP compiler in particular) might be a good idea?
>
> -Toke
>
Jamal Hadi Salim Jan. 31, 2023, 10:32 p.m. UTC | #55
On Tue, Jan 31, 2023 at 2:10 PM Jakub Kicinski <kuba@kernel.org> wrote:
>
> On Tue, 31 Jan 2023 05:30:10 -0500 Jamal Hadi Salim wrote:
> > > Note, there are two paths in P4TC:
> > > DDP loading via devlink is equivalent to loading the P4 binary for the hardware.
> > > That is one of the 3 (and currently most popular) driver interfaces
> > > suggested. Some of that drew
> >
> > Sorry didnt finish my thought here, wanted to say: The loading of the
> > P4 binary over devlink drew (to some people) suspicion it is going to
> > be used for loading kernel bypass.
>
> The only practical use case I heard was the IPU. Worrying about devlink
> programming being a bypass on an IPU is like rearranging chairs on the
> Titanic.

BTW, I do believe FNICs are heading in that direction as well. I didnt
quiet follow the titanic chairs analogy, can you elaborate on that
statement?

cheers,
jamal
Jakub Kicinski Jan. 31, 2023, 10:36 p.m. UTC | #56
On Tue, 31 Jan 2023 17:32:52 -0500 Jamal Hadi Salim wrote:
> > > Sorry didnt finish my thought here, wanted to say: The loading of the
> > > P4 binary over devlink drew (to some people) suspicion it is going to
> > > be used for loading kernel bypass.  
> >
> > The only practical use case I heard was the IPU. Worrying about devlink
> > programming being a bypass on an IPU is like rearranging chairs on the
> > Titanic.  
> 
> BTW, I do believe FNICs are heading in that direction as well.
> I didnt quiet follow the titanic chairs analogy, can you elaborate on
> that statement?

https://en.wiktionary.org/wiki/rearrange_the_deck_chairs_on_the_Titanic
Jamal Hadi Salim Jan. 31, 2023, 10:50 p.m. UTC | #57
On Tue, Jan 31, 2023 at 5:36 PM Jakub Kicinski <kuba@kernel.org> wrote:
>
> On Tue, 31 Jan 2023 17:32:52 -0500 Jamal Hadi Salim wrote:
> > > > Sorry didnt finish my thought here, wanted to say: The loading of the
> > > > P4 binary over devlink drew (to some people) suspicion it is going to
> > > > be used for loading kernel bypass.
> > >
> > > The only practical use case I heard was the IPU. Worrying about devlink
> > > programming being a bypass on an IPU is like rearranging chairs on the
> > > Titanic.
> >
> > BTW, I do believe FNICs are heading in that direction as well.
> > I didnt quiet follow the titanic chairs analogy, can you elaborate on
> > that statement?
>
> https://en.wiktionary.org/wiki/rearrange_the_deck_chairs_on_the_Titanic

LoL. Lets convince Jiri then.
On programming devlink for the runtime I would respectfully disagree
that it is the right interface.

cheers,
jamal
Toke Høiland-Jørgensen Jan. 31, 2023, 10:53 p.m. UTC | #58
Jamal Hadi Salim <jhs@mojatatu.com> writes:

> So while going through this thought process, things to consider:
> 1) The autonomy of the tc infra, essentially the skip_sw/hw  controls
> and their packet driven iteration. Perhaps (the patch i pointed to
> from Paul Blakey) where part of the action graph runs in sw.

Yeah, I agree that mixed-mode operation is an important consideration,
and presumably attaching metadata directly to a packet on the hardware
side, and accessing that in sw, is in scope as well? We seem to have
landed on exposing that sort of thing via kfuncs in XDP, so expanding on
that seems reasonable at a first glance.

> 2) The dynamicity of being able to trigger table offloads and/or
> kernel table updates which are packet driven (consider scenario where
> they have iterated the hardware and ingressed into the kernel).

That could be done by either interface, though: the kernel can propagate
a bpf_map_update() from a BPF program to the hardware version of the
table as well. I suspect a map-based API at least on the BPF side would
be more natural, but I don't really have a strong opinion on this :)

-Toke
Jamal Hadi Salim Jan. 31, 2023, 11:31 p.m. UTC | #59
On Tue, Jan 31, 2023 at 5:54 PM Toke Høiland-Jørgensen <toke@redhat.com> wrote:
>
> Jamal Hadi Salim <jhs@mojatatu.com> writes:
>
> > So while going through this thought process, things to consider:
> > 1) The autonomy of the tc infra, essentially the skip_sw/hw  controls
> > and their packet driven iteration. Perhaps (the patch i pointed to
> > from Paul Blakey) where part of the action graph runs in sw.
>
> Yeah, I agree that mixed-mode operation is an important consideration,
> and presumably attaching metadata directly to a packet on the hardware
> side, and accessing that in sw, is in scope as well? We seem to have
> landed on exposing that sort of thing via kfuncs in XDP, so expanding on
> that seems reasonable at a first glance.

There is  built-in metadata chain id/prio/protocol (stored in cls
common struct) passed when the policy is installed. The hardware may
be able to handle received (probably packet encapsulated, but i
believe that is vendor specific) metadata and transform it into the
appropriate continuation point. Maybe a simpler example is to look at
the patch from Paul (since that is the most recent change, so it is
sticking in my brain); if you can follow the example,  you'll see
there's some state that is transferred for the action with a cookie
from/to the driver.

> > 2) The dynamicity of being able to trigger table offloads and/or
> > kernel table updates which are packet driven (consider scenario where
> > they have iterated the hardware and ingressed into the kernel).
>
> That could be done by either interface, though: the kernel can propagate
> a bpf_map_update() from a BPF program to the hardware version of the
> table as well. I suspect a map-based API at least on the BPF side would
> be more natural, but I don't really have a strong opinion on this :)

Should have mentioned this earlier as requirement:
Speed of update is _extremely_ important, i.e how fast you can update
could make or break things; see talk from Marcelo/Vlad[1]. My gut
feeling is dealing with feedback from some vendor firmware/driver
interface that the entry is really offloaded may cause challenges for
ebpf by stalling the program. We have seen upto several ms delays on
occasions.

cheers,
jamal
[1] https://netdevconf.info/0x15/session.html?Where-turbo-boosting-TC-flower-control-path-had-led-us-to
Toke Høiland-Jørgensen Feb. 1, 2023, 6:08 p.m. UTC | #60
Jamal Hadi Salim <jhs@mojatatu.com> writes:

> On Tue, Jan 31, 2023 at 5:54 PM Toke Høiland-Jørgensen <toke@redhat.com> wrote:
>>
>> Jamal Hadi Salim <jhs@mojatatu.com> writes:
>>
>> > So while going through this thought process, things to consider:
>> > 1) The autonomy of the tc infra, essentially the skip_sw/hw  controls
>> > and their packet driven iteration. Perhaps (the patch i pointed to
>> > from Paul Blakey) where part of the action graph runs in sw.
>>
>> Yeah, I agree that mixed-mode operation is an important consideration,
>> and presumably attaching metadata directly to a packet on the hardware
>> side, and accessing that in sw, is in scope as well? We seem to have
>> landed on exposing that sort of thing via kfuncs in XDP, so expanding on
>> that seems reasonable at a first glance.
>
> There is  built-in metadata chain id/prio/protocol (stored in cls
> common struct) passed when the policy is installed. The hardware may
> be able to handle received (probably packet encapsulated, but i
> believe that is vendor specific) metadata and transform it into the
> appropriate continuation point. Maybe a simpler example is to look at
> the patch from Paul (since that is the most recent change, so it is
> sticking in my brain); if you can follow the example,  you'll see
> there's some state that is transferred for the action with a cookie
> from/to the driver.

Right, that roughly fits my understanding. Just adding a kfunc to fetch
that cookie would be the obvious way to expose it to BPF.

>> > 2) The dynamicity of being able to trigger table offloads and/or
>> > kernel table updates which are packet driven (consider scenario where
>> > they have iterated the hardware and ingressed into the kernel).
>>
>> That could be done by either interface, though: the kernel can propagate
>> a bpf_map_update() from a BPF program to the hardware version of the
>> table as well. I suspect a map-based API at least on the BPF side would
>> be more natural, but I don't really have a strong opinion on this :)
>
> Should have mentioned this earlier as requirement:
> Speed of update is _extremely_ important, i.e how fast you can update
> could make or break things; see talk from Marcelo/Vlad[1]. My gut
> feeling is dealing with feedback from some vendor firmware/driver
> interface that the entry is really offloaded may cause challenges for
> ebpf by stalling the program. We have seen upto several ms delays on
> occasions.

Right, understandable. That seems kinda orthogonal to which API is used
to expose this data, though? In the end it's all just kernel code, and,
well, if updating things in an offloaded map/table is taking too long,
we'll have to either fix the underlying code to make it faster, or the
application will have keep things only in software? :)

-Toke
Jamal Hadi Salim Feb. 2, 2023, 6:50 p.m. UTC | #61
Sorry I was distracted somewhere else.
I am not sure i fully grokked your proposal but I am willing to go
through this thought exercise with you (perhaps a higher bandwidth
media would help); however,  we should put some parameters so it
doesnt become a perpetual discussion:

The starting premise is that posted code meets our requirements so
whatever we do using ebpf has to meet our requirements; we dont want
to get into a wrestling match with any of the ebpf constraints.
Actually, I am ok with some limited degree of square hole round peg
situation but it cant be interfering in getting work done. I would
also be ok with small surgeries into the ebpf core if needed to meet
our requirements.
Performance and maintainability are also on the table.

Let me know what you think.

cheers,
jamal

On Wed, Feb 1, 2023 at 1:08 PM Toke Høiland-Jørgensen <toke@redhat.com> wrote:
>
> Jamal Hadi Salim <jhs@mojatatu.com> writes:
>
> > On Tue, Jan 31, 2023 at 5:54 PM Toke Høiland-Jørgensen <toke@redhat.com> wrote:
> >>
> >> Jamal Hadi Salim <jhs@mojatatu.com> writes:
> >>
> >> > So while going through this thought process, things to consider:
> >> > 1) The autonomy of the tc infra, essentially the skip_sw/hw  controls
> >> > and their packet driven iteration. Perhaps (the patch i pointed to
> >> > from Paul Blakey) where part of the action graph runs in sw.
> >>
> >> Yeah, I agree that mixed-mode operation is an important consideration,
> >> and presumably attaching metadata directly to a packet on the hardware
> >> side, and accessing that in sw, is in scope as well? We seem to have
> >> landed on exposing that sort of thing via kfuncs in XDP, so expanding on
> >> that seems reasonable at a first glance.
> >
> > There is  built-in metadata chain id/prio/protocol (stored in cls
> > common struct) passed when the policy is installed. The hardware may
> > be able to handle received (probably packet encapsulated, but i
> > believe that is vendor specific) metadata and transform it into the
> > appropriate continuation point. Maybe a simpler example is to look at
> > the patch from Paul (since that is the most recent change, so it is
> > sticking in my brain); if you can follow the example,  you'll see
> > there's some state that is transferred for the action with a cookie
> > from/to the driver.
>
> Right, that roughly fits my understanding. Just adding a kfunc to fetch
> that cookie would be the obvious way to expose it to BPF.
>
> >> > 2) The dynamicity of being able to trigger table offloads and/or
> >> > kernel table updates which are packet driven (consider scenario where
> >> > they have iterated the hardware and ingressed into the kernel).
> >>
> >> That could be done by either interface, though: the kernel can propagate
> >> a bpf_map_update() from a BPF program to the hardware version of the
> >> table as well. I suspect a map-based API at least on the BPF side would
> >> be more natural, but I don't really have a strong opinion on this :)
> >
> > Should have mentioned this earlier as requirement:
> > Speed of update is _extremely_ important, i.e how fast you can update
> > could make or break things; see talk from Marcelo/Vlad[1]. My gut
> > feeling is dealing with feedback from some vendor firmware/driver
> > interface that the entry is really offloaded may cause challenges for
> > ebpf by stalling the program. We have seen upto several ms delays on
> > occasions.
>
> Right, understandable. That seems kinda orthogonal to which API is used
> to expose this data, though? In the end it's all just kernel code, and,
> well, if updating things in an offloaded map/table is taking too long,
> we'll have to either fix the underlying code to make it faster, or the
> application will have keep things only in software? :)
>
> -Toke
>
Tom Herbert Feb. 2, 2023, 11:34 p.m. UTC | #62
On Thu, Feb 2, 2023 at 10:51 AM Jamal Hadi Salim <jhs@mojatatu.com> wrote:
>
> Sorry I was distracted somewhere else.
> I am not sure i fully grokked your proposal but I am willing to go
> through this thought exercise with you (perhaps a higher bandwidth
> media would help); however,  we should put some parameters so it
> doesnt become a perpetual discussion:
>
> The starting premise is that posted code meets our requirements so
> whatever we do using ebpf has to meet our requirements; we dont want
> to get into a wrestling match with any of the ebpf constraints.
> Actually, I am ok with some limited degree of square hole round peg
> situation but it cant be interfering in getting work done. I would
> also be ok with small surgeries into the ebpf core if needed to meet
> our requirements.

Can you elaborate on what the problems are with using eBPF? I know
there is at least one P4->eBPF compiler, what is lacking that doesn't
meet your requirements?

> Performance and maintainability are also on the table.

Performance of the software datapath is of paramount importance. My
fundamental concern here is that if we push an underperforming
software solution, then the patches don't just enable offload, they'll
be used to *justify* it. That is, the hardware vendors might go to
their customers and show how much better the offload is than the slow
software solution; whereas if they compared to a higher performing
software solution it might meet the performance requirements of the
customer thereby saving them the cost and complexity of offload. Note
we've already been down this path once with DPDK once being touted as
being "10x faster than the kernel" with little regard to whether the
kernel could be tuned or adapted-- of course, we subsequently invented
XDP and pretty much closed the gap.

Tom

>
> Let me know what you think.
>
> cheers,
> jamal
>
> On Wed, Feb 1, 2023 at 1:08 PM Toke Høiland-Jørgensen <toke@redhat.com> wrote:
> >
> > Jamal Hadi Salim <jhs@mojatatu.com> writes:
> >
> > > On Tue, Jan 31, 2023 at 5:54 PM Toke Høiland-Jørgensen <toke@redhat.com> wrote:
> > >>
> > >> Jamal Hadi Salim <jhs@mojatatu.com> writes:
> > >>
> > >> > So while going through this thought process, things to consider:
> > >> > 1) The autonomy of the tc infra, essentially the skip_sw/hw  controls
> > >> > and their packet driven iteration. Perhaps (the patch i pointed to
> > >> > from Paul Blakey) where part of the action graph runs in sw.
> > >>
> > >> Yeah, I agree that mixed-mode operation is an important consideration,
> > >> and presumably attaching metadata directly to a packet on the hardware
> > >> side, and accessing that in sw, is in scope as well? We seem to have
> > >> landed on exposing that sort of thing via kfuncs in XDP, so expanding on
> > >> that seems reasonable at a first glance.
> > >
> > > There is  built-in metadata chain id/prio/protocol (stored in cls
> > > common struct) passed when the policy is installed. The hardware may
> > > be able to handle received (probably packet encapsulated, but i
> > > believe that is vendor specific) metadata and transform it into the
> > > appropriate continuation point. Maybe a simpler example is to look at
> > > the patch from Paul (since that is the most recent change, so it is
> > > sticking in my brain); if you can follow the example,  you'll see
> > > there's some state that is transferred for the action with a cookie
> > > from/to the driver.
> >
> > Right, that roughly fits my understanding. Just adding a kfunc to fetch
> > that cookie would be the obvious way to expose it to BPF.
> >
> > >> > 2) The dynamicity of being able to trigger table offloads and/or
> > >> > kernel table updates which are packet driven (consider scenario where
> > >> > they have iterated the hardware and ingressed into the kernel).
> > >>
> > >> That could be done by either interface, though: the kernel can propagate
> > >> a bpf_map_update() from a BPF program to the hardware version of the
> > >> table as well. I suspect a map-based API at least on the BPF side would
> > >> be more natural, but I don't really have a strong opinion on this :)
> > >
> > > Should have mentioned this earlier as requirement:
> > > Speed of update is _extremely_ important, i.e how fast you can update
> > > could make or break things; see talk from Marcelo/Vlad[1]. My gut
> > > feeling is dealing with feedback from some vendor firmware/driver
> > > interface that the entry is really offloaded may cause challenges for
> > > ebpf by stalling the program. We have seen upto several ms delays on
> > > occasions.
> >
> > Right, understandable. That seems kinda orthogonal to which API is used
> > to expose this data, though? In the end it's all just kernel code, and,
> > well, if updating things in an offloaded map/table is taking too long,
> > we'll have to either fix the underlying code to make it faster, or the
> > application will have keep things only in software? :)
> >
> > -Toke
> >
Edward Cree Feb. 14, 2023, 5:07 p.m. UTC | #63
On 30/01/2023 14:06, Jamal Hadi Salim wrote:
> So what are we trying to achieve with P4TC? John, I could have done a
> better job in describing the goals in the cover letter:
> We are going for MAT sw equivalence to what is in hardware. A two-fer
> that is already provided by the existing TC infrastructure.
...
> This hammer already meets our goals.

I'd like to give a perspective from the AMD/Xilinx/Solarflare SmartNIC
 project.  Though I must stress I'm not speaking for that organisation,
 and I wasn't the one writing the P4 code; these are just my personal
 observations based on the view I had from within the project team.
We used P4 in the SN1022's datapath, but encountered a number of
 limitations that prevented a wholly P4-based implementation, in spite
 of the hardware being MAT/CAM flavoured.  Overall I would say that P4
 was not a great fit for the problem space; it was usually possible to
 get it to do what we wanted but only by bending it in unnatural ways.
 (The advantage was, of course, the strong toolchain for compiling it
 into optimised logic on the FPGA; writing the whole thing by hand in
 RTL would have taken far more effort.)
Developing a worthwhile P4-based datapath proved to be something of an
 engineer-time sink; compilation and verification weren't quick, and
 just because your P4 works in a software model doesn't necessarily
 mean it will perform well in hardware.
Thus P4 is, in my personal opinion, a poor choice for end-user/runtime
 behaviour specification, at least for FPGA-flavoured devices.  It
 works okay for a multi-month product development project, is just
 about viable for implementing something like a pipeline plugin, but
 treating it as a fully flexible software-defined datapath is not
 something that will fly.

> I would argue further that in
> the near future a lot of the stuff including transport will eventually
> have to partially or fully move to hardware (see the HOMA keynote for
> a sample space[0]).

I think HOMA is very interesting and I agree hardware doing something
 like it will eventually be needed.  But as you admit, P4TC doesn't
 address that — unsurprising, since the kind of dynamic imperative
 behaviour involved is totally outside P4's wheelhouse.  So maybe I'm
 missing your point here but I don't see why you bring it up.

Ultimately I think trying to expose the underlying hardware as a P4
 platform is the wrong abstraction layer to provide to userspace.
It's trying too hard to avoid protocol ossification, by requiring the
 entire pipeline to be user-definable at a bit level, but in the real
 world if someone wants to deploy a new low-level protocol they'll be
 better off upgrading their kernel and drivers to offload the new
 protocol-specific *feature* onto protocol-agnostic *hardware* than
 trying to develop and validate a P4 pipeline.
It is only protocol ossification in *hardware* that is a problem for
 this kind of thing (not to be confused with the ossification problem
 on a network where you can't use new proto because a middlebox
 somewhere in the path barfs on it); protocol-specific SW APIs are
 only a problem if they result in vendors designing ossified hardware
 (to implement exactly those APIs and nothing else), which hopefully
 we've all learned not to do by now.

On 30/01/2023 03:09, Singhai, Anjali wrote:
> There is also argument that is being made about using ebpf for
> implementing the SW path, may be I am missing the part as to how do
> you offload if not to another general purpose core even if it is not
> as evolved as the current day Xeon's.

I have to be a little circumspect here as I don't know how much we've
 made public, but there are good prospects for FPGA offloads of eBPF
 with high performance.  The instructions can be transformed into a
 pipeline of logic blocks which look nothing like a Von Neumann
 architecture, so can get much better perf/area and perf/power than an
 array of general-purpose cores.
My personal belief (which I don't, alas, have hard data to back up) is
 that this approach will also outperform the 'array of specialised
 packet-processor cores' that many NPU/DPU products are using.

In the situations where you do need a custom datapath (which often
 involve the kind of dynamic behaviour that's not P4-friendly), eBPF
 is, I would say, far superior to P4 as an IR.

-ed
Jamal Hadi Salim Feb. 14, 2023, 8:44 p.m. UTC | #64
Hi Ed,

On Tue, Feb 14, 2023 at 12:07 PM Edward Cree <ecree.xilinx@gmail.com> wrote:
>
> On 30/01/2023 14:06, Jamal Hadi Salim wrote:
> > So what are we trying to achieve with P4TC? John, I could have done a
> > better job in describing the goals in the cover letter:
> > We are going for MAT sw equivalence to what is in hardware. A two-fer
> > that is already provided by the existing TC infrastructure.
> ...
> > This hammer already meets our goals.
>
> I'd like to give a perspective from the AMD/Xilinx/Solarflare SmartNIC
>  project.  Though I must stress I'm not speaking for that organisation,
>  and I wasn't the one writing the P4 code; these are just my personal
>  observations based on the view I had from within the project team.
> We used P4 in the SN1022's datapath, but encountered a number of
>  limitations that prevented a wholly P4-based implementation, in spite
>  of the hardware being MAT/CAM flavoured.
>  Overall I would say that P4
>  was not a great fit for the problem space; it was usually possible to
>  get it to do what we wanted but only by bending it in unnatural ways.
>  (The advantage was, of course, the strong toolchain for compiling it
>  into optimised logic on the FPGA; writing the whole thing by hand in
>  RTL would have taken far more effort.)
> Developing a worthwhile P4-based datapath proved to be something of an
>  engineer-time sink; compilation and verification weren't quick, and
>  just because your P4 works in a software model doesn't necessarily
>  mean it will perform well in hardware.
> Thus P4 is, in my personal opinion, a poor choice for end-user/runtime
>  behaviour specification, at least for FPGA-flavoured devices.

I am curios to understand the challenges you came across specific to
P4 in what you describe above.
My gut feeling is, depending on the P4 program, you ran out of
resources. How many LUTs does this device offer? I am going to hazard
a guess that 30-40% of the resources on the FPGA  were just for P4
abstraction in which case writing a complex P4 program just wont fit.
Having said that, tooling is also very important as part of the
developer experience - if it takes forever to compile things then that
developer experience goes down the tubes. Maybe it is a tooling
challenge?
IMO:
it is also about operational experience (i.e the ops not just the
devs) and deployment infra is key. IOW, it's not just about the
datapath but also the full package integration, for example, ease of
control plane integration, field debuggability, operational usability,
etc... If you are doing a one-off you can integrate whatever
infrastructure you want. If you are a cloud vendor you have the skills
in house and it may be worth investing in them. If you are a second
tier operator or large enterprise OTOH it is not part of your business
model to stock up with smart people.

>   It
>  works okay for a multi-month product development project, is just
>  about viable for implementing something like a pipeline plugin, but
>  treating it as a fully flexible software-defined datapath is not
>  something that will fly.
>

I would argue that FPGA projects tend to be one-offs mostly
(multi-month very specialized solutions). If you want a generic,
repeatable solution you will have to pay the cost for abstraction
(both performance and resource consumption). Then you can train people
to be able to operate the repeatable solutions in some manual.

> > I would argue further that in
> > the near future a lot of the stuff including transport will eventually
> > have to partially or fully move to hardware (see the HOMA keynote for
> > a sample space[0]).
>
> I think HOMA is very interesting and I agree hardware doing something
>  like it will eventually be needed.  But as you admit, P4TC doesn't
>  address that — unsurprising, since the kind of dynamic imperative
>  behaviour involved is totally outside P4's wheelhouse.  So maybe I'm
>  missing your point here but I don't see why you bring it up.

It was a response to the sentiment that XDP or ebpf is needed to solve
the performance problem. My response was: i can't count on s/w saving
me from 800Gbps ethernet port capacity; i gave that transport offload
example as a statement of the inevitability of even things outside the
classical L2-L4 datapath infrastructure to eventually move to offload.

> Ultimately I think trying to expose the underlying hardware as a P4
>  platform is the wrong abstraction layer to provide to userspace.

If you mean transport layer exposure via P4 then I would agree. But
for L2-L4 the P4 abstraction (TC as well) is match-action pipeline
which works very well today with control plane abstraction from user
space.

> It's trying too hard to avoid protocol ossification, by requiring the
>  entire pipeline to be user-definable at a bit level, but in the real
>  world if someone wants to deploy a new low-level protocol they'll be
>  better off upgrading their kernel and drivers to offload the new
>  protocol-specific *feature* onto protocol-agnostic *hardware* than
>  trying to develop and validate a P4 pipeline.

I agree with your  view on low-level bit confusion in P4 (depending on
how you write your program); however, I dont agree with the
perspective that you can somehow write that code for your new action
or new header processing and then go ahead and upgrade the driver and
maybe install some new firmware is the right solution. If you have the
skills, sure. But if you are second tier consumer, sourcing from
multiple NIC vendors, and want to offload a new
pipeline/protocol-specific feature across those NICs i would argue
that those skills are not within your reach unless you standardize
that interface (which is what P4 and P4TC strive for). I am not saying
the abstraction is free rather that it is worth the return on
investment for this scenario.

> It is only protocol ossification in *hardware* that is a problem for
>  this kind of thing (not to be confused with the ossification problem
>  on a network where you can't use new proto because a middlebox
>  somewhere in the path barfs on it); protocol-specific SW APIs are
>  only a problem if they result in vendors designing ossified hardware
>  (to implement exactly those APIs and nothing else), which hopefully
>  we've all learned not to do by now.

It's more of a challenge on velocity-to-feature and getting the whole
package with the same effort by specification with P4 i.e starting
with the datapath all the way to the control plane. And that instead
of multi-vendor APIs for protocol-specific solutions (vendors are
pitching DPDK APIs mostly)  we are suggesting that unifying API is
P4TC etc for all vendors.

BTW: I am not arguing that on an FPGA you can generate very optimal
RTL code(that is both resource and computation efficient) which is
very specific to the target datapath. I am sure there are use cases
for that. OTOH, there is a very large set of users who would rather go
for the match-action paradigm for generality of abstraction.

BTW, in your response below to Anjali:
Sure, you can start with ebpf  - why not any other language? What is
the connection to RTL? the frontend you said you have used is P4 for
example and you could generate that into RTL.

cheers,
jamal


> On 30/01/2023 03:09, Singhai, Anjali wrote:
> > There is also argument that is being made about using ebpf for
> > implementing the SW path, may be I am missing the part as to how do
> > you offload if not to another general purpose core even if it is not
> > as evolved as the current day Xeon's.
>
> I have to be a little circumspect here as I don't know how much we've
>  made public, but there are good prospects for FPGA offloads of eBPF
>  with high performance.  The instructions can be transformed into a
>  pipeline of logic blocks which look nothing like a Von Neumann
>  architecture, so can get much better perf/area and perf/power than an
>  array of general-purpose cores.
> My personal belief (which I don't, alas, have hard data to back up) is
>  that this approach will also outperform the 'array of specialised
>  packet-processor cores' that many NPU/DPU products are using.
>
> In the situations where you do need a custom datapath (which often
>  involve the kind of dynamic behaviour that's not P4-friendly), eBPF
>  is, I would say, far superior to P4 as an IR.
>
> -ed
Jamal Hadi Salim Feb. 16, 2023, 8:24 p.m. UTC | #65
Hi,

Want to provide an update to this thread and a summary of where we are
(typing this on web browser client so i hope it doesnt come all
mangled up):

I have had high bandwidth discussions with several people offlist
(thanks to everyone who invested their time in trying to smoothen
this); sometimes cooler   headers prevail this way. We are willing
(and are starting) to invest time to  see how we can fit ebpf for the
software datapath. Should be noted that we did  look at ebpf when this
project started and we ended up not going that path. I think what is
new in this equation is the concept of kfuncs - which we didnt  have
back then. Perhaps with kfuncs we can make both worlds work together.
XDP as well is appealing.

As i have stated earlier:
The starting premise is that the posted code meets our requirements so
 whatever we do using ebpf has to meet our requirements. I am ok with
some limited degree of square hole round peg situation but it cant  be
interfering in meeting our goals.

So let me restate those goals so we dont go into some rabit hole in
the discussion:
1) Supporting P4 in the kernel both for the sw and hw datapath
utilizing the well established tc infra which allows both sw
equivalence and hw offload.  We are _not_ going to reinvent this.
Essentially we get the whole package: from the control plane to the
tooling infra, netlink messaging to s/w and h/w symbiosis, the
autonomous kernel control, etc. The advantage is that we have a
singular vendor-neutral interface via the kernel using well understood
mechanisms.
Behavioral equivalence between hw and sw is a given.

2) Operational usability - this is encoded currently in the
scriptability approach. Ex, I can just ship someone a shell script in
an email but more important if they have deployed tc offloads the
runtime semantics are unchanged. "write once, run anywhere" paradigm
is easier to state in ascii;-> The interface is designed to be
scriptable to remove the burden of  making kernel and user space code
changes for any new processing functions  (whether in s/w or
hardware).
3) Debuggability - developers and ops people who are familiar with tc
offloads  can continue using the _same existing techniques and tools_.
This also eases support.

4) Performance - note our angle on this, based on the niche we are
looking at  is "if you want performance then offload". However, one
discussion point that has been raised multiple times in the thread and
in private is that  there are performance gains when using ebpf. This
arguement is reasonable and a motivator for us to invest our time in
evaluating.

 We have started doing off the cuff measurements. With very simple P4
program which receives a packet, looks up a table, and on a hit
changes src/mac address then forwards. We have: A) implemented a
handcoded ebpf program, B) generated  P4TC sw only C) flower s/w  only
(skip_hw) rules and D) hardware offload (skip_sw) all on tc (so we can
do orange-orange comparison). The SUT has a dual port CX6 NIC capable
of offloading pedit and mirred. Trex is connected to one port and
sending http gets which goes via the box and a response comes back on
the other port which we send back to trex. The traffic is very
asymettric; data coming  back to the client fills up the 25G pipe but
ACKs going back consume a lot less. Unfortunately all 4 scenarios were
able to handle the wire rate - we are going to set up more nasty
traffic generation later; for now we opted to look at cpu utilization
for the 4 scenarios. We have the following results:

 A) 35% CPU B) 39% C) 36% D) 0%

This is by no means a good test but i wanted to illustrate the
relevance of  #D (0%) - which is a main itch for us.

We need to test more complex programs which is where probably the
performance of ebpf will shine. XDP for sure will beat all the others
- but i would rather get the facts in place first. So we are investing
effort in this direction
and will share results at some point.

There may be other low hanging fruits that have been brought up in the
discussion for ebpf (the parser being one); we will be looking at all
those as well.

Note:
 The goal of this exercise for us is to evaluate not just performance
but   also consider how it affects the other P4TC goals. There may be
a sweet happy point somewhere in there but we need to collect the data
instead of hypothesizing.

cheers,
jamal