mbox series

[RFC,net-next,0/7] 802.1Q Frame Preemption and 802.3 MAC Merge support via ethtool

Message ID 20220816222920.1952936-1-vladimir.oltean@nxp.com (mailing list archive)
Headers show
Series 802.1Q Frame Preemption and 802.3 MAC Merge support via ethtool | expand

Message

Vladimir Oltean Aug. 16, 2022, 10:29 p.m. UTC
Vinicius' progress on upstreaming frame preemption support for Intel I226
seemed to stall, so I decided to give it a go using my own view as well.
https://patchwork.kernel.org/project/netdevbpf/cover/20220520011538.1098888-1-vinicius.gomes@intel.com/

Please don't take this patch set too seriously; I spent only a few
days working on this, and I'm only posting it as RFC to inform others
I've started doing this, before I spend too much time to risk colliding
with someone else's active work.

Compared to Vinicius' previous patches, this is basically a new
implementation, with the following differences:

- The MAC Merge (mm) and Frame Preemption (fp) settings are split like
  they were in Vinicius' proposal to have fp as part of tc-taprio. But
  in my proposal, the fp portion is still part of ethtool, like mm.

- We have statistics, actually 2 kinds. First we have MAC merge layer
  stats, which are exposed as protocol-specific stats:

  ethtool --json --include-statistics --show-mm eno2
  [ {
          "ifname": "eno2",
          "verify-status": "SUCCEEDED",
          "verify-time": 10,
          "supported": true,
          "enabled": true,
          "active": true,
          "add-frag-size": 0,
          "statistics": {
              "MACMergeFrameAssErrorCount": 0,
              "MACMergeFrameSmdErrorCount": 0,
              "MACMergeFrameAssOkCount": 0,
              "MACMergeFragCountRx": 0,
              "MACMergeFragCountTx": 0,
              "MACMergeHoldCount": 0
          }
      } ]

  and then we also have the usual standardized statistics counters, but
  replicated for the pMAC:

  ethtool -S eno0 --groups pmac-rmon
  Standard stats for eno0:
  pmac-rmon-etherStatsUndersizePkts: 0
  pmac-rmon-etherStatsOversizePkts: 0
  pmac-rmon-etherStatsFragments: 0
  pmac-rmon-etherStatsJabbers: 0
  rx-pmac-rmon-etherStatsPkts64to64Octets: 0
  rx-pmac-rmon-etherStatsPkts65to127Octets: 0
  rx-pmac-rmon-etherStatsPkts128to255Octets: 0
  rx-pmac-rmon-etherStatsPkts256to511Octets: 0
  rx-pmac-rmon-etherStatsPkts512to1023Octets: 0
  rx-pmac-rmon-etherStatsPkts1024to1522Octets: 0
  rx-pmac-rmon-etherStatsPkts1523to9000Octets: 0
  tx-pmac-rmon-etherStatsPkts64to64Octets: 0
  tx-pmac-rmon-etherStatsPkts65to127Octets: 0
  tx-pmac-rmon-etherStatsPkts128to255Octets: 0
  tx-pmac-rmon-etherStatsPkts256to511Octets: 0
  tx-pmac-rmon-etherStatsPkts512to1023Octets: 0
  tx-pmac-rmon-etherStatsPkts1024to1522Octets: 0
  tx-pmac-rmon-etherStatsPkts1523to9000Octets: 0

  ethtool -S eno0 --groups eth-pmac-mac
  Standard stats for eno0:
  eth-pmac-mac-FramesTransmittedOK: 0
  eth-pmac-mac-SingleCollisionFrames: 0
  eth-pmac-mac-MultipleCollisionFrames: 0
  eth-pmac-mac-FramesReceivedOK: 0
  eth-pmac-mac-FrameCheckSequenceErrors: 0
  eth-pmac-mac-AlignmentErrors: 0
  eth-pmac-mac-OctetsTransmittedOK: 0
  eth-pmac-mac-FramesWithDeferredXmissions: 0
  eth-pmac-mac-LateCollisions: 0
  eth-pmac-mac-FramesAbortedDueToXSColls: 0
  eth-pmac-mac-FramesLostDueToIntMACXmitError: 0
  eth-pmac-mac-CarrierSenseErrors: 0
  eth-pmac-mac-OctetsReceivedOK: 0
  eth-pmac-mac-FramesLostDueToIntMACRcvError: 0
  eth-pmac-mac-MulticastFramesXmittedOK: 0
  eth-pmac-mac-BroadcastFramesXmittedOK: 0
  eth-pmac-mac-MulticastFramesReceivedOK: 0
  eth-pmac-mac-BroadcastFramesReceivedOK: 0

  ethtool -S eno0 --groups eth-pmac-ctrl
  Standard stats for eno0:
  eth-pmac-ctrl-MACControlFramesTransmitted: 0
  eth-pmac-ctrl-MACControlFramesReceived: 0

  What also exists but is not exported here are PAUSE stats for the
  pMAC. Since those are also protocol-specific stats, I'm not sure how
  to mix the 2 (MAC Merge layer + PAUSE). Maybe just extend
  ETHTOOL_A_PAUSE_STAT_TX_FRAMES and ETHTOOL_A_PAUSE_STAT_RX_FRAMES with
  the pMAC variants?

- Finally, the hardware I'm working with (here, the test vehicle is the
  NXP ENETC from LS1028A, although I have patches for the Felix switch
  as well, but those need a bit of a revolution in the driver to go in
  first). This hardware is not without its flaws, but at least allows me
  to concentrate on the UAPI portions for this series.

I also have a kselftest written, but it's for the Felix switch (covers
forwarding latency) and so it's not included here.

Are there objections in exposing the UAPI for this new feature in this way?

Also, there is no documentation associated with this patch set, other
than the code. Life is too short to write documentation for an RFC, sorry.
I may get kdoc related kernel bot warnings because I copy-pasted ethtool
structure definitions from here and there, but I didn't fill in the
descriptions of all their fields. All those fields are as truthful to
the standards as possible rather than my own variables or names, so
please refer to those specs for now.

Vladimir Oltean (7):
  net: ethtool: netlink: introduce ethnl_update_bool()
  net: ethtool: add support for Frame Preemption and MAC Merge layer
  net: ethtool: stats: make stats_put_stats() take input from multiple
    sources
  net: ethtool: stats: replicate standardized counters for the pMAC
  net: enetc: parameterize port MAC stats to also cover the pMAC
  net: enetc: expose some standardized ethtool counters
  net: enetc: add support for Frame Preemption and MAC Merge layer

 .../ethernet/freescale/enetc/enetc_ethtool.c  | 399 +++++++++++++++---
 .../net/ethernet/freescale/enetc/enetc_hw.h   | 132 +++---
 include/linux/ethtool.h                       |  68 +++
 include/uapi/linux/ethtool.h                  |  15 +
 include/uapi/linux/ethtool_netlink.h          |  86 ++++
 net/ethtool/Makefile                          |   3 +-
 net/ethtool/fp.c                              | 295 +++++++++++++
 net/ethtool/mm.c                              | 228 ++++++++++
 net/ethtool/netlink.c                         |  38 ++
 net/ethtool/netlink.h                         |  34 ++
 net/ethtool/stats.c                           | 218 +++++++---
 11 files changed, 1338 insertions(+), 178 deletions(-)
 create mode 100644 net/ethtool/fp.c
 create mode 100644 net/ethtool/mm.c

Comments

Jakub Kicinski Aug. 17, 2022, 3:34 a.m. UTC | #1
On Wed, 17 Aug 2022 01:29:13 +0300 Vladimir Oltean wrote:
>   What also exists but is not exported here are PAUSE stats for the
>   pMAC. Since those are also protocol-specific stats, I'm not sure how
>   to mix the 2 (MAC Merge layer + PAUSE). Maybe just extend
>   ETHTOOL_A_PAUSE_STAT_TX_FRAMES and ETHTOOL_A_PAUSE_STAT_RX_FRAMES with
>   the pMAC variants?

I have a couple of general questions. The mm and fp are related but fp
can be implemented without mm or they must always come together? (I'd
still split patch 2 for ease of review, tho.)

When we have separate set of stats for pMAC the normal stats are sum of
all traffic, right? So normal - pMAC == eMAC, everything that's not
preemptible is express?

Did you consider adding an attribute for switching between MAC and pMAC
for stats rather than duplicating things?
Vladimir Oltean Aug. 17, 2022, 11:50 a.m. UTC | #2
On Tue, Aug 16, 2022 at 08:34:17PM -0700, Jakub Kicinski wrote:
> On Wed, 17 Aug 2022 01:29:13 +0300 Vladimir Oltean wrote:
> >   What also exists but is not exported here are PAUSE stats for the
> >   pMAC. Since those are also protocol-specific stats, I'm not sure how
> >   to mix the 2 (MAC Merge layer + PAUSE). Maybe just extend
> >   ETHTOOL_A_PAUSE_STAT_TX_FRAMES and ETHTOOL_A_PAUSE_STAT_RX_FRAMES with
> >   the pMAC variants?
> 
> I have a couple of general questions. The mm and fp are related but fp
> can be implemented without mm or they must always come together? (I'd
> still split patch 2 for ease of review, tho.)

FP cannot be implemented without MM and MM makes limited (but some)
sense without FP. Since FP just decides which packets you TX via the
pMAC and which via the eMAC, you can configure just the MM layer such
that you interoperate with a FP-capable switch, but you don't actually
generate any preemptable traffic yourself.

In fact, the reasons why I decided to split these are:
- because they are part of different specs, which call for different
  managed objects
- because in an SoC where IPs are mixed and matched from different
  vendors, it makes perfect sense to me that the FP portion (more
  related to the queue/classification system) is provided by one vendor,
  and the MM portion is provided by another. In the future, we may find
  enough commonalities to justify introducing the concept of a dedicated
  MAC driver, independent/reusable between Ethernet controller ("net_device")
  drivers. We have this today already with the PCS layer in phylink.
  So if there is a physical split between the layers, I think keeping a
  split in terms of callbacks makes some sense too.

> When we have separate set of stats for pMAC the normal stats are sum of
> all traffic, right? So normal - pMAC == eMAC, everything that's not
> preemptible is express?

Actually not quite, or at least not for the LS1028A ENETC and Felix switch.
The normal counters report just what the eMAC sees, and the pMAC counters
just what the pMAC sees. After all, only the eMAC was enabled up until now.
Nobody does the addition currently.

> Did you consider adding an attribute for switching between MAC and pMAC
> for stats rather than duplicating things?

No. Could you expand on that idea a little? Add a netlink attribute
where, and this helps reduce duplication where, and how?
Jakub Kicinski Aug. 17, 2022, 6:46 p.m. UTC | #3
On Wed, 17 Aug 2022 11:50:09 +0000 Vladimir Oltean wrote:
> On Tue, Aug 16, 2022 at 08:34:17PM -0700, Jakub Kicinski wrote:
> > I have a couple of general questions. The mm and fp are related but fp
> > can be implemented without mm or they must always come together? (I'd
> > still split patch 2 for ease of review, tho.)  
> 
> FP cannot be implemented without MM and MM makes limited (but some)
> sense without FP. Since FP just decides which packets you TX via the
> pMAC and which via the eMAC, you can configure just the MM layer such
> that you interoperate with a FP-capable switch, but you don't actually
> generate any preemptable traffic yourself.
> 
> In fact, the reasons why I decided to split these are:
> - because they are part of different specs, which call for different
>   managed objects
> - because in an SoC where IPs are mixed and matched from different
>   vendors, it makes perfect sense to me that the FP portion (more
>   related to the queue/classification system) is provided by one vendor,
>   and the MM portion is provided by another. In the future, we may find
>   enough commonalities to justify introducing the concept of a dedicated
>   MAC driver, independent/reusable between Ethernet controller ("net_device")
>   drivers. We have this today already with the PCS layer in phylink.
>   So if there is a physical split between the layers, I think keeping a
>   split in terms of callbacks makes some sense too.

Hah, interesting. I was under the impression that FP can be done
without MM, if frame is preempted it just gets scrambled (bad FCS 
gets injected or a special symbol) and dropped by the receiver.
I had it completely backwards, then.

> > When we have separate set of stats for pMAC the normal stats are sum of
> > all traffic, right? So normal - pMAC == eMAC, everything that's not
> > preemptible is express?  
> 
> Actually not quite, or at least not for the LS1028A ENETC and Felix switch.
> The normal counters report just what the eMAC sees, and the pMAC counters
> just what the pMAC sees. After all, only the eMAC was enabled up until now.
> Nobody does the addition currently.

I see. And the netdev stats are the total?

> > Did you consider adding an attribute for switching between MAC and pMAC
> > for stats rather than duplicating things?  
> 
> No. Could you expand on that idea a little? Add a netlink attribute
> where, and this helps reduce duplication where, and how?

Add a attribute to ETHTOOL_MSG_STATS_GET, let's call it
ETHTOOL_A_STATS_EXPRESS, a flag.

Plumb thru to all the stats callback an extra argument 
(a structure for future extensibility) with a bool pMAC;

Add a capability field to ethtool_ops to announce that
driver will pay attention to the bool pMAC / has support.

We can then use the existing callbacks.

Am I making sense?
Vinicius Costa Gomes Aug. 17, 2022, 10:47 p.m. UTC | #4
Hi Vladimir,

Vladimir Oltean <vladimir.oltean@nxp.com> writes:

> Vinicius' progress on upstreaming frame preemption support for Intel I226
> seemed to stall, so I decided to give it a go using my own view as well.
> https://patchwork.kernel.org/project/netdevbpf/cover/20220520011538.1098888-1-vinicius.gomes@intel.com/

I was stuck with some internal projects (and some other things) for some
time which left me with very little energy/time to follow up with that
series.

Just let's say that your timing was very good, a few more days I would
have sent another version. I am kind of glad that you decided to take
this torch.

>
> Please don't take this patch set too seriously; I spent only a few
> days working on this, and I'm only posting it as RFC to inform others
> I've started doing this, before I spend too much time to risk colliding
> with someone else's active work.
>
> Compared to Vinicius' previous patches, this is basically a new
> implementation, with the following differences:
>
> - The MAC Merge (mm) and Frame Preemption (fp) settings are split like
>   they were in Vinicius' proposal to have fp as part of tc-taprio. But
>   in my proposal, the fp portion is still part of ethtool, like mm.
>

I have some questions/comments about this part. Mostly related to
"prios" in this context. Will make them in the UAPI patches.

> - We have statistics, actually 2 kinds. First we have MAC merge layer
>   stats, which are exposed as protocol-specific stats:
>
>   ethtool --json --include-statistics --show-mm eno2
>   [ {
>           "ifname": "eno2",
>           "verify-status": "SUCCEEDED",
>           "verify-time": 10,
>           "supported": true,
>           "enabled": true,
>           "active": true,
>           "add-frag-size": 0,
>           "statistics": {
>               "MACMergeFrameAssErrorCount": 0,
>               "MACMergeFrameSmdErrorCount": 0,
>               "MACMergeFrameAssOkCount": 0,
>               "MACMergeFragCountRx": 0,
>               "MACMergeFragCountTx": 0,
>               "MACMergeHoldCount": 0
>           }
>       } ]
>
>   and then we also have the usual standardized statistics counters, but
>   replicated for the pMAC:
>
>   ethtool -S eno0 --groups pmac-rmon
>   Standard stats for eno0:
>   pmac-rmon-etherStatsUndersizePkts: 0
>   pmac-rmon-etherStatsOversizePkts: 0
>   pmac-rmon-etherStatsFragments: 0
>   pmac-rmon-etherStatsJabbers: 0
>   rx-pmac-rmon-etherStatsPkts64to64Octets: 0
>   rx-pmac-rmon-etherStatsPkts65to127Octets: 0
>   rx-pmac-rmon-etherStatsPkts128to255Octets: 0
>   rx-pmac-rmon-etherStatsPkts256to511Octets: 0
>   rx-pmac-rmon-etherStatsPkts512to1023Octets: 0
>   rx-pmac-rmon-etherStatsPkts1024to1522Octets: 0
>   rx-pmac-rmon-etherStatsPkts1523to9000Octets: 0
>   tx-pmac-rmon-etherStatsPkts64to64Octets: 0
>   tx-pmac-rmon-etherStatsPkts65to127Octets: 0
>   tx-pmac-rmon-etherStatsPkts128to255Octets: 0
>   tx-pmac-rmon-etherStatsPkts256to511Octets: 0
>   tx-pmac-rmon-etherStatsPkts512to1023Octets: 0
>   tx-pmac-rmon-etherStatsPkts1024to1522Octets: 0
>   tx-pmac-rmon-etherStatsPkts1523to9000Octets: 0
>
>   ethtool -S eno0 --groups eth-pmac-mac
>   Standard stats for eno0:
>   eth-pmac-mac-FramesTransmittedOK: 0
>   eth-pmac-mac-SingleCollisionFrames: 0
>   eth-pmac-mac-MultipleCollisionFrames: 0
>   eth-pmac-mac-FramesReceivedOK: 0
>   eth-pmac-mac-FrameCheckSequenceErrors: 0
>   eth-pmac-mac-AlignmentErrors: 0
>   eth-pmac-mac-OctetsTransmittedOK: 0
>   eth-pmac-mac-FramesWithDeferredXmissions: 0
>   eth-pmac-mac-LateCollisions: 0
>   eth-pmac-mac-FramesAbortedDueToXSColls: 0
>   eth-pmac-mac-FramesLostDueToIntMACXmitError: 0
>   eth-pmac-mac-CarrierSenseErrors: 0
>   eth-pmac-mac-OctetsReceivedOK: 0
>   eth-pmac-mac-FramesLostDueToIntMACRcvError: 0
>   eth-pmac-mac-MulticastFramesXmittedOK: 0
>   eth-pmac-mac-BroadcastFramesXmittedOK: 0
>   eth-pmac-mac-MulticastFramesReceivedOK: 0
>   eth-pmac-mac-BroadcastFramesReceivedOK: 0
>
>   ethtool -S eno0 --groups eth-pmac-ctrl
>   Standard stats for eno0:
>   eth-pmac-ctrl-MACControlFramesTransmitted: 0
>   eth-pmac-ctrl-MACControlFramesReceived: 0
>
>   What also exists but is not exported here are PAUSE stats for the
>   pMAC. Since those are also protocol-specific stats, I'm not sure how
>   to mix the 2 (MAC Merge layer + PAUSE). Maybe just extend
>   ETHTOOL_A_PAUSE_STAT_TX_FRAMES and ETHTOOL_A_PAUSE_STAT_RX_FRAMES with
>   the pMAC variants?
>
> - Finally, the hardware I'm working with (here, the test vehicle is the
>   NXP ENETC from LS1028A, although I have patches for the Felix switch
>   as well, but those need a bit of a revolution in the driver to go in
>   first). This hardware is not without its flaws, but at least allows me
>   to concentrate on the UAPI portions for this series.
>
> I also have a kselftest written, but it's for the Felix switch (covers
> forwarding latency) and so it's not included here.
>
> Are there objections in exposing the UAPI for this new feature in this way?
>

I really liked the statistics part, even though the hardware I am
working right now with doesn't provide all of them.

> Also, there is no documentation associated with this patch set, other
> than the code. Life is too short to write documentation for an RFC, sorry.
> I may get kdoc related kernel bot warnings because I copy-pasted ethtool
> structure definitions from here and there, but I didn't fill in the
> descriptions of all their fields. All those fields are as truthful to
> the standards as possible rather than my own variables or names, so
> please refer to those specs for now.
>
> Vladimir Oltean (7):
>   net: ethtool: netlink: introduce ethnl_update_bool()
>   net: ethtool: add support for Frame Preemption and MAC Merge layer
>   net: ethtool: stats: make stats_put_stats() take input from multiple
>     sources
>   net: ethtool: stats: replicate standardized counters for the pMAC
>   net: enetc: parameterize port MAC stats to also cover the pMAC
>   net: enetc: expose some standardized ethtool counters
>   net: enetc: add support for Frame Preemption and MAC Merge layer
>
>  .../ethernet/freescale/enetc/enetc_ethtool.c  | 399 +++++++++++++++---
>  .../net/ethernet/freescale/enetc/enetc_hw.h   | 132 +++---
>  include/linux/ethtool.h                       |  68 +++
>  include/uapi/linux/ethtool.h                  |  15 +
>  include/uapi/linux/ethtool_netlink.h          |  86 ++++
>  net/ethtool/Makefile                          |   3 +-
>  net/ethtool/fp.c                              | 295 +++++++++++++
>  net/ethtool/mm.c                              | 228 ++++++++++
>  net/ethtool/netlink.c                         |  38 ++
>  net/ethtool/netlink.h                         |  34 ++
>  net/ethtool/stats.c                           | 218 +++++++---
>  11 files changed, 1338 insertions(+), 178 deletions(-)
>  create mode 100644 net/ethtool/fp.c
>  create mode 100644 net/ethtool/mm.c
>
> -- 
> 2.34.1
>


Cheers,
Kurt Kanzenbach Aug. 19, 2022, 8:16 a.m. UTC | #5
On Wed Aug 17 2022, Vladimir Oltean wrote:
> Vinicius' progress on upstreaming frame preemption support for Intel I226
> seemed to stall, so I decided to give it a go using my own view as well.
> https://patchwork.kernel.org/project/netdevbpf/cover/20220520011538.1098888-1-vinicius.gomes@intel.com/

Great to see progress on FPE :-).

[snip]

> - Finally, the hardware I'm working with (here, the test vehicle is the
>   NXP ENETC from LS1028A, although I have patches for the Felix switch
>   as well, but those need a bit of a revolution in the driver to go in
>   first). This hardware is not without its flaws, but at least allows me
>   to concentrate on the UAPI portions for this series.
>
> I also have a kselftest written, but it's for the Felix switch (covers
> forwarding latency) and so it's not included here.

What kind of selftest did you implement? So far I've been doing this:
Using a cyclic real time application to create high priority frames and
running iperf3 in parallel to simulate low priority traffic
constantly. Afterwards, checking the NIC statistics for fragments and so
on. Also checking the latency of the RT frames with FPE on/off.

BTW, if you guys need help with testing of patches, i do have access to
i225 and stmmacs which both support FPE. Also the Hirschmann switches
should support it.

Thanks,
Kurt
Vladimir Oltean Aug. 19, 2022, 4:59 p.m. UTC | #6
Hi Kurt,

On Fri, Aug 19, 2022 at 10:16:20AM +0200, Kurt Kanzenbach wrote:
> On Wed Aug 17 2022, Vladimir Oltean wrote:
> > Vinicius' progress on upstreaming frame preemption support for Intel I226
> > seemed to stall, so I decided to give it a go using my own view as well.
> > https://patchwork.kernel.org/project/netdevbpf/cover/20220520011538.1098888-1-vinicius.gomes@intel.com/
> 
> Great to see progress on FPE :-).

Let's hope it lasts ;)

> > - Finally, the hardware I'm working with (here, the test vehicle is the
> >   NXP ENETC from LS1028A, although I have patches for the Felix switch
> >   as well, but those need a bit of a revolution in the driver to go in
> >   first). This hardware is not without its flaws, but at least allows me
> >   to concentrate on the UAPI portions for this series.
> >
> > I also have a kselftest written, but it's for the Felix switch (covers
> > forwarding latency) and so it's not included here.
> 
> What kind of selftest did you implement? So far I've been doing this:
> Using a cyclic real time application to create high priority frames and
> running iperf3 in parallel to simulate low priority traffic
> constantly. Afterwards, checking the NIC statistics for fragments and so
> on. Also checking the latency of the RT frames with FPE on/off.
> 
> BTW, if you guys need help with testing of patches, i do have access to
> i225 and stmmacs which both support FPE. Also the Hirschmann switches
> should support it.

Blah, I didn't want to spoil the surprise just yet. I am orchestrating 2
isochron senders at specific times, one of PT traffic and one of ET.

There are actually 2 variants of this: one is for endpoint FP and the
other is for bridge FP. I only had time to convert the bridge FP to
kselftest format; not the endpoint one (for enetc).

In the endpoint case, interference is created on the sender interface.
I compare HW TX timestamps to the expected TX times to calculate how
long it took until PT was preempted. I repeat the test millions of times
until I can plot the latencies having the PT <-> ET base time offset on
the X axis. It looks very cool.

The bridge case is similar, except for the fact that interference is
created on a bridge port going to a common receiver of 2 isochron
senders. What I plot is the path delay, and again, this shows actual
preemption times with a nanosecond resolution.

Here's a small snapshot of the main script. Not shown are the supporting
scripts for this:

#!/bin/bash
# SPDX-License-Identifier: GPL-2.0
# Copyright 2022 NXP
#
# Selftest for Frame Preemption on a switch. Creates controlled packet
# collisions between a priority configured as preemptable traffic (PT) and a
# priority configured as express traffic (ET).
#
#                           +-----------------------------+
#                           |            fp.sh            |
#                           |             br0             |
#                           |     +--------+--------+     |
#                           |     |        |        |     |
#                           |   $swp1    $swp2    $swp3   |
#                           +-----------------------------+
#   +--------------------+        ^        ^        |       +----------------+
#   |                    |        |        |        |       |                |
#   | fp_node_et_send.sh |--------+        |        +------>| fp_node_rcv.sh |
#   |        $h1         |                 |                |       $h3      |
#   +--------------------+                 |                +----------------+
#                                          |
#   +--------------------+                 |
#   |                    |                 |
#   | fp_node_pt_send.sh |-----------------+
#   |        $h2         |
#   +--------------------+
#
# Normally, a packet collision plot with no frame preemption looks as follows.
# The first packet starts being transmitted by the MAC, and the second packet
# sees a latency proportional to how much transmission time of the previous
# packet there is left.
#
#        ^
#   ET   |
# frame  |        ---.
#  path  |       /    --.
# delay  |       |       --.
#        |       /          --.
#        |       |             --.
#        |      /                 --.
#        |      |                    ---.
#        |-----/                         -----------------------
#        |
#     ---+----------------------------------------------------------->
#        |      ET base time offset relative to PT
#
# Depending on the preemption point (where the express packet hits the
# preemptable packet that is undergoing transmission), the express packet will
# have a higher or lower path delay (latency).
#
# The packet collision can be broken into 3 distinct intervals:
# (a) when the packets start interfering, but not enough of the PT packet has
#     been transmitted yet so as to preempt it (MAC merge fragments must still
#     obey the minimum Ethernet frame size rule). There are no frame
#     preemptions when packets collide here.
# (b) the preemptable area, where the latency should be equal to the
#     transmission time of a MAC merge fragment rather than a full packet
#     (preemption should happen right away).
# (c) where the packets still interfere, but there are not enough bytes of the
#     PT frame left to transmit so as to construct a valid fragment out of the
#     remainder. No preemptions are expected in this region.
#
#        ^
#   ET   |      a         b            c
# frame  |   <----><---------------><---->
#  path  |
# delay  |
#        |
#        |       /|
#        |      /  -----------------.
#        |      |                    ---.
#        |-----/                         -----------------------
#        |
#     ---+----------------------------------------------------------->
#        |      ET base time offset relative to PT
#
# The selftest evaluates the preemption performance of a given switch by
# plotting the ET latency as a function of the ET base time offset.

WAIT_TIME=1
NUM_NETIFS=3
NETIF_CREATE=no
REQUIRE_MZ=no
lib_dir=$(dirname $0)/../../../net/forwarding
source $lib_dir/lib.sh
source $lib_dir/tsn_lib.sh
source $(dirname $0)/fp_topology.sh

require_command gnuplot

swp1=${NETIFS[p1]}
swp2=${NETIFS[p2]}
swp3=${NETIFS[p3]}

signal_received=false

ethtool_save_fp_admin_status()
{
	admin_status=$(ethtool --json --show-fp $1 | \
		jq "(.[].\"parameter-table\"[] | select(.prio == $2)).\"admin-status\"")
}

ethtool_restore_fp_admin_status()
{
	case $admin_status in
	express)
		ethtool --set-fp $1 admin-status $2:E
		;;
	preemptable)
		ethtool --set-fp $1 admin-status $2:P
		;;
	esac
}

avg_pdelay()
{
	local file=$1
	local tmp=$(mktemp)

	printf "print((" > $tmp
	isochron report \
		--input-file "${file}" \
		--printf-format "print(\"{} + \".format(%d - %d), end='')\n" \
		--printf-args "RT" | python3 - >> $tmp
	printf "0) / $NUM_FRAMES)" >> $tmp
	cat $tmp | python3 -
	rm -f $tmp
}

validate_coordination()
{
	local et=$1; shift
	local pt=$1; shift
	local et_start
	local pt_start

	et_start=$(($(isochron report --input-file $et \
		--printf-format "%u - %u" --printf-args "SB" --stop 1)))
	pt_start=$(($(isochron report --input-file $pt \
		--printf-format "%u - %u" --printf-args "SB" --stop 1)))
	if ! [ $et_start = $pt_start ]; then
		printf "ET test %s / PT test %s: ET starts at %s, PT starts at %s\n" \
			"$et" "$pt" \
			"$(isochron report --input-file $et \
				--printf-format "%T" --printf-args "S" --stop 1)" \
			"$(isochron report --input-file $pt \
				--printf-format "%T" --printf-args "S" --stop 1)"
		exit 1
	fi
}

plot()
{
	local pt_frame_size=$1; shift
	local num_tests=$1; shift
	local fp_label=$1; shift
	local pt_svg="plots/pt-${pt_frame_size}-${fp_label}-pt.svg"
	local pt_plot="plots/pt-${pt_frame_size}-${fp_label}-pt.plot"
	local et_svg="plots/pt-${pt_frame_size}-${fp_label}-et.svg"
	local et_plot="plots/pt-${pt_frame_size}-${fp_label}-et.plot"
	local pt_avg_pdelay
	local et_avg_pdelay
	local base_time
	local et
	local pt
	local i

	printf "Plotting charts for collision between PT@%d and ET@%d\n" \
		"${pt_frame_size}" "${ET_FRAME_SIZE}"

	mkdir -p plots
	rm -f $et_plot $pt_plot

	for ((i = 0; i < num_tests; i++)); do
		et="reports/pt-${pt_frame_size}-test-${i}-${fp_label}-et.dat"
		pt="reports/pt-${pt_frame_size}-test-${i}-${fp_label}-pt.dat"

		base_time=$(isochron report --input-file "${et}" \
				--printf-format "%u" --printf-args "B" \
				--stop 1)

		pt_avg_pdelay=$(avg_pdelay $pt)
		et_avg_pdelay=$(avg_pdelay $et)

		echo "$base_time $pt_avg_pdelay" >> $pt_plot
		echo "$base_time $et_avg_pdelay" >> $et_plot

		if [ "$signal_received" = "true" ]; then
			exit 1
		fi
	done

	gnuplot --persist <<- EOF
		set xlabel "Base time offset (ns)"
		set ylabel "Average PT path delay (ns)"
		set term svg
		set terminal svg enhanced background rgb "white"
		set output "${pt_svg}"
		plot "${pt_plot}" using 1:2 with lines title "PT latency"
	EOF

	gnuplot --persist <<- EOF
		set xlabel "Base time offset (ns)"
		set ylabel "Average ET path delay (ns)"
		set term svg
		set terminal svg enhanced background rgb "white"
		set output "${et_svg}"
		plot "${et_plot}" using 1:2 with lines title "ET latency"
	EOF

	rm -f ${et_plot} ${pt_plot}
}

test_collision()
{
	local test_num=$1; shift
	local pt_frame_size=$1; shift
	local base_time_offset=$1; shift
	local fp_label=$1; shift
	local vrf_name=$(master_name_get br0)
	local et="reports/pt-${pt_frame_size}-test-${test_num}-${fp_label}-et.dat"
	local pt="reports/pt-${pt_frame_size}-test-${test_num}-${fp_label}-pt.dat"
	local orchestration=$(mktemp)
	local et_extra_args
	local pt_extra_args

	if [ "${H1_REQUIRE_PTP4L}" = "no" ]; then
		et_extra_args="${et_extra_args} --omit-sync"
	fi

	if [ "${H2_REQUIRE_PTP4L}" = "no" ]; then
		pt_extra_args="${pt_extra_args} --omit-sync"
	fi

	if [ "${H3_REQUIRE_PTP4L}" = "no" ]; then
		et_extra_args="${et_extra_args} --omit-remote-sync"
		pt_extra_args="${pt_extra_args} --omit-remote-sync"
	fi

	mkdir -p reports

	printf "Collision between PT packets of size %d and ET packets of size %d at %s, test %d (base time offset %d)\n" \
		${pt_frame_size} ${ET_FRAME_SIZE} ${fp_label} ${test_num} ${base_time_offset}

	cat <<- EOF > ${orchestration}
	[ET]
	host = $H1_IPV4%$vrf_name
	port = $H1_PORT
	exec = isochron send \\
		--client $H3_IPV4%$vrf_name \\
		--stats-port $H3_ET_PORT \\
		--interface $ET_IF_NAME \\
		--unix-domain-socket $UDS_ADDRESS_H1 \\
		--num-frames $NUM_FRAMES \\
		--priority $ET_PRIO \\
		--vid 0 \\
		--base-time ${base_time_offset} \\
		--cycle-time ${CYCLE_TIME_NS} \\
		--frame-size $ET_FRAME_SIZE \\
		--sync-threshold $SYNC_THRESHOLD_NS \\
		--cpu-mask $((1 << ${ISOCHRON_CPU})) \\
		--sched-fifo \\
		--sched-priority 98 \\
		--etype 0xdead \\
		--txtime \\
		${et_extra_args} \\
		--output-file ${et}

	[PT]
	host = $H2_IPV4%$vrf_name
	port = $H2_PORT
	exec = isochron send \\
		--client $H3_IPV4%$vrf_name \\
		--stats-port $H3_PT_PORT \\
		--interface $PT_IF_NAME \\
		--unix-domain-socket $UDS_ADDRESS_H2 \\
		--num-frames $NUM_FRAMES \\
		--priority $PT_PRIO \\
		--vid 0 \\
		--base-time 0.000000000 \\
		--cycle-time ${CYCLE_TIME_NS} \\
		--frame-size $pt_frame_size \\
		--sync-threshold $SYNC_THRESHOLD_NS \\
		--cpu-mask $((1 << ${ISOCHRON_CPU})) \\
		--sched-fifo \\
		--sched-priority 98 \\
		--etype 0xdeaf \\
		--txtime \\
		${pt_extra_args} \\
		--output-file ${pt}
	EOF

	isochron orchestrate --input-file ${orchestration}

	rm -rf ${orchestration}

	validate_coordination $et $pt
}

test_collision_sweep_base_time()
{
	local pt_frame_size=$1; shift
	local base_time_start=$1; shift
	local base_time_stop=$1; shift
	local base_time_increment=$1; shift
	local fp_label=$1; shift
	local num_tests
	local i

	num_tests=$(((base_time_stop - base_time_start + 1) / base_time_increment))
	for ((i = 0; i < num_tests; i++)); do
		base_time=$((base_time_start + i * base_time_increment))

		test_collision "$i" "$pt_frame_size" "$base_time" "$fp_label"

		if [ "$signal_received" = "true" ]; then
			exit 1
		fi
	done

	plot $pt_frame_size $num_tests $fp_label
}

test_pt_124()
{
	local start=$1; shift
	local stop=$1; shift
	local fp_label=$1; shift

	test_collision_sweep_base_time 124 $start $stop 40 $fp_label
}

test_pt_300()
{
	local start=$1; shift
	local stop=$1; shift
	local fp_label=$1; shift

	test_collision_sweep_base_time 300 $start $stop 40 $fp_label
}

test_pt_600()
{
	local start=$1; shift
	local stop=$1; shift
	local fp_label=$1; shift

	test_collision_sweep_base_time 600 $start $stop 40 $fp_label
}

test_pt_1000()
{
	local start=$1; shift
	local stop=$1; shift
	local fp_label=$1; shift

	test_collision_sweep_base_time 1000 $start $stop 40 $fp_label
}

test_pt_1500()
{
	local start=$1; shift
	local stop=$1; shift
	local fp_label=$1

	test_collision_sweep_base_time 1500 $start $stop 40 $fp_label
}

test_no_fp_cut_through()
{
	local fp_label="no-fp"

	test_pt_124 0 1800 $fp_label
	test_pt_300 0 3400 $fp_label
	test_pt_600 0 7000 $fp_label
	test_pt_1000 1500 12000 $fp_label
	test_pt_1500 3000 17000 $fp_label
}

test_fp_quick()
{
	ethtool_save_fp_admin_status $swp3 $PT_PRIO
	ethtool --set-mm $swp3 verify-disable off enabled on add-frag-size 0
	ethtool --set-fp $swp3 admin-status $PT_PRIO:P

	test_pt_1500 12000 26000 "quick"

	ethtool_restore_fp_admin_status $swp3 $PT_PRIO
}

test_fp()
{
	local fp_label=$1; shift
	local add_frag_size=$1; shift

	ethtool_save_fp_admin_status $swp3 $PT_PRIO
	ethtool --set-mm $swp3 verify-disable off enabled on add-frag-size $add_frag_size
	ethtool --set-fp $swp3 admin-status $PT_PRIO:P

	test_pt_124 200 2000 $fp_label
	test_pt_300 1500 5000 $fp_label
	test_pt_600 4500 10000 $fp_label
	test_pt_1000 8800 17000 $fp_label
	test_pt_1500 12000 26000 $fp_label

	ethtool_restore_fp_admin_status $swp3 $PT_PRIO
}

test_fp_add_frag_size_0()
{
	test_fp "add-frag-size-0" 0
}

test_fp_add_frag_size_1()
{
	test_fp "add-frag-size-1" 1
}

test_fp_add_frag_size_2()
{
	test_fp "add-frag-size-2" 2
}

test_fp_add_frag_size_3()
{
	test_fp "add-frag-size-2" 3
}

setup_prepare()
{
	vrf_prepare

	switch_create
	ptp4l_start "$swp1 $swp2 $swp3" false ${UDS_ADDRESS_SWITCH}
	phc2sys_start ${UDS_ADDRESS_SWITCH}
}

cleanup()
{
	pre_cleanup

	ptp4l_stop "$swp1 $swp2 $swp3"
	phc2sys_stop
	switch_destroy

	vrf_cleanup
}

signal()
{
	signal_received=true
}

trap cleanup EXIT
trap signal SIGTERM
trap signal SIGINT

#	test_no_fp_cut_through
#	test_fp_add_frag_size_0
#	test_fp_add_frag_size_1
#	test_fp_add_frag_size_2
#	test_fp_add_frag_size_3
ALL_TESTS="
	test_fp_quick
"

setup_prepare
setup_wait

tests_run

exit $EXIT_STATUS
Kurt Kanzenbach Aug. 23, 2022, 10:50 a.m. UTC | #7
On Fri Aug 19 2022, Vladimir Oltean wrote:
> Hi Kurt,
>
> On Fri, Aug 19, 2022 at 10:16:20AM +0200, Kurt Kanzenbach wrote:
>> On Wed Aug 17 2022, Vladimir Oltean wrote:
>> > Vinicius' progress on upstreaming frame preemption support for Intel I226
>> > seemed to stall, so I decided to give it a go using my own view as well.
>> > https://patchwork.kernel.org/project/netdevbpf/cover/20220520011538.1098888-1-vinicius.gomes@intel.com/
>> 
>> Great to see progress on FPE :-).
>
> Let's hope it lasts ;)
>
>> > - Finally, the hardware I'm working with (here, the test vehicle is the
>> >   NXP ENETC from LS1028A, although I have patches for the Felix switch
>> >   as well, but those need a bit of a revolution in the driver to go in
>> >   first). This hardware is not without its flaws, but at least allows me
>> >   to concentrate on the UAPI portions for this series.
>> >
>> > I also have a kselftest written, but it's for the Felix switch (covers
>> > forwarding latency) and so it's not included here.
>> 
>> What kind of selftest did you implement? So far I've been doing this:
>> Using a cyclic real time application to create high priority frames and
>> running iperf3 in parallel to simulate low priority traffic
>> constantly. Afterwards, checking the NIC statistics for fragments and so
>> on. Also checking the latency of the RT frames with FPE on/off.
>> 
>> BTW, if you guys need help with testing of patches, i do have access to
>> i225 and stmmacs which both support FPE. Also the Hirschmann switches
>> should support it.
>
> Blah, I didn't want to spoil the surprise just yet. I am orchestrating 2
> isochron senders at specific times, one of PT traffic and one of ET.
>
> There are actually 2 variants of this: one is for endpoint FP and the
> other is for bridge FP. I only had time to convert the bridge FP to
> kselftest format; not the endpoint one (for enetc).
>
> In the endpoint case, interference is created on the sender interface.
> I compare HW TX timestamps to the expected TX times to calculate how
> long it took until PT was preempted. I repeat the test millions of times
> until I can plot the latencies having the PT <-> ET base time offset on
> the X axis. It looks very cool.
>
> The bridge case is similar, except for the fact that interference is
> created on a bridge port going to a common receiver of 2 isochron
> senders. What I plot is the path delay, and again, this shows actual
> preemption times with a nanosecond resolution.

That makes a lot of sense and this kind of test scenario should work for
other endpoints implementations such as igc and stmmac too.

Thanks,
Kurt
Vladimir Oltean Oct. 1, 2022, 3:53 p.m. UTC | #8
On Wed, Aug 17, 2022 at 11:46:42AM -0700, Jakub Kicinski wrote:
> > > When we have separate set of stats for pMAC the normal stats are sum of
> > > all traffic, right? So normal - pMAC == eMAC, everything that's not
> > > preemptible is express?  
> > 
> > Actually not quite, or at least not for the LS1028A ENETC and Felix switch.
> > The normal counters report just what the eMAC sees, and the pMAC counters
> > just what the pMAC sees. After all, only the eMAC was enabled up until now.
> > Nobody does the addition currently.
> 
> I see. And the netdev stats are the total?

dev->stats reports the aggregate of express and preemptable packets seen
by software, yes. I got the hint though, I should also report the
aggregate. The summation seems like a generic problem which ethtool
should be able to do internally, yet a generic implementation is riddled
with problems that must be dealt with (RMON histograms reported by eMAC
and pMAC can be different; some counters could be implemented by the
eMAC but not the pMAC or vice versa, and that needs to be handled, etc).
Additionally, the summation of counters must also be done for ndo_get_stats64(),
when those come from hardware as well. So I'd incline to do it in the
driver rn.

> > > Did you consider adding an attribute for switching between MAC and pMAC
> > > for stats rather than duplicating things?  
> > 
> > No. Could you expand on that idea a little? Add a netlink attribute
> > where, and this helps reduce duplication where, and how?
> 
> Add a attribute to ETHTOOL_MSG_STATS_GET, let's call it
> ETHTOOL_A_STATS_EXPRESS, a flag.

I'll add this to the UAPI and to internal data structures, ok?

enum ethtool_stats_src {
	ETHTOOL_STATS_SRC_AGGREGATE = 0,
	ETHTOOL_STATS_SRC_EMAC,
	ETHTOOL_STATS_SRC_PMAC,
};

> Plumb thru to all the stats callback an extra argument 
> (a structure for future extensibility) with a bool pMAC;
> 
> Add a capability field to ethtool_ops to announce that
> driver will pay attention to the bool pMAC / has support.

You mean capability field as in ethtool_ops::supported_coalesce_params,
right? (we discussed about this separately).
This won't fit the enetc driver very well. Some enetc ports on the NXP
LS1028A support the MM layer (port 0, port 2) and some don't (port 1,
port 3). Yet they share the same PF driver. So populating mm_supported =
true in the const struct enetc_pf_ethtool_ops isn't going to cover both.
I can, however, key on my ethtool_ops :: get_mm_state() function which
lets the driver report a "bool supported". Is this ok?

> We can then use the existing callbacks.
> 
> Am I making sense?

Yes, thanks.
Jakub Kicinski Oct. 3, 2022, 2:36 p.m. UTC | #9
On Sat, 1 Oct 2022 15:53:38 +0000 Vladimir Oltean wrote:
> > Add a attribute to ETHTOOL_MSG_STATS_GET, let's call it
> > ETHTOOL_A_STATS_EXPRESS, a flag.  
> 
> I'll add this to the UAPI and to internal data structures, ok?
> 
> enum ethtool_stats_src {
> 	ETHTOOL_STATS_SRC_AGGREGATE = 0,
> 	ETHTOOL_STATS_SRC_EMAC,
> 	ETHTOOL_STATS_SRC_PMAC,
> };

Yup!

> > Plumb thru to all the stats callback an extra argument 
> > (a structure for future extensibility) with a bool pMAC;
> > 
> > Add a capability field to ethtool_ops to announce that
> > driver will pay attention to the bool pMAC / has support.  
> 
> You mean capability field as in ethtool_ops::supported_coalesce_params,
> right? (we discussed about this separately).
> This won't fit the enetc driver very well. Some enetc ports on the NXP
> LS1028A support the MM layer (port 0, port 2) and some don't (port 1,
> port 3). Yet they share the same PF driver. So populating mm_supported =
> true in the const struct enetc_pf_ethtool_ops isn't going to cover both.
> I can, however, key on my ethtool_ops :: get_mm_state() function which
> lets the driver report a "bool supported". Is this ok?

That happens, I think about the capability in the ops as driver caps
rather than HW caps. The driver can still return -EOPNOTSUPP, but it
guarantees to check the field's value. 

Most (all but one) datacenter NIC vendors have uber-drivers for all
their HW generations these days, static per-driver caps can't map to 
HW caps in my world.

So weak preference for sticking to that model to avoid confusion about
the semantics of existing caps vs caps which should use a function call.
Vladimir Oltean Oct. 3, 2022, 2:51 p.m. UTC | #10
On Mon, Oct 03, 2022 at 07:36:03AM -0700, Jakub Kicinski wrote:
> On Sat, 1 Oct 2022 15:53:38 +0000 Vladimir Oltean wrote:
> > > Add a attribute to ETHTOOL_MSG_STATS_GET, let's call it
> > > ETHTOOL_A_STATS_EXPRESS, a flag.  
> > 
> > I'll add this to the UAPI and to internal data structures, ok?
> > 
> > enum ethtool_stats_src {
> > 	ETHTOOL_STATS_SRC_AGGREGATE = 0,
> > 	ETHTOOL_STATS_SRC_EMAC,
> > 	ETHTOOL_STATS_SRC_PMAC,
> > };
> 
> Yup!

Ok. I've also added enum ethtool_stats_src as the first member of struct
ethtool_eth_mac_stats, ethtool_eth_phy_stats, ethtool_eth_ctrl_stats,
ethtool_pause_stats, ethtool_rmon_stats. So I am not adding an extra
argument (another "structure for future extensibility" as you wrote
below). Hope that's ok.

> > > Plumb thru to all the stats callback an extra argument 
> > > (a structure for future extensibility) with a bool pMAC;
> > > 
> > > Add a capability field to ethtool_ops to announce that
> > > driver will pay attention to the bool pMAC / has support.  
> > 
> > You mean capability field as in ethtool_ops::supported_coalesce_params,
> > right? (we discussed about this separately).
> > This won't fit the enetc driver very well. Some enetc ports on the NXP
> > LS1028A support the MM layer (port 0, port 2) and some don't (port 1,
> > port 3). Yet they share the same PF driver. So populating mm_supported =
> > true in the const struct enetc_pf_ethtool_ops isn't going to cover both.
> > I can, however, key on my ethtool_ops :: get_mm_state() function which
> > lets the driver report a "bool supported". Is this ok?
> 
> That happens, I think about the capability in the ops as driver caps
> rather than HW caps. The driver can still return -EOPNOTSUPP, but it
> guarantees to check the field's value. 

The stats callbacks return void. We'd be relying on the ETHTOOL_STAT_NOT_SET value.

> 
> Most (all but one) datacenter NIC vendors have uber-drivers for all
> their HW generations these days, static per-driver caps can't map to 
> HW caps in my world.
> 
> So weak preference for sticking to that model to avoid confusion about
> the semantics of existing caps vs caps which should use a function call.

An even bigger uber-driver is DSA, with its own dsa_slave_ethtool_ops.
If I put "supported_mm" in ethtool_ops, and set it to true in DSA,
I become responsible for rejecting everything except ETHTOOL_STATS_SRC_AGGREGATE
for all DSA drivers, which I'd rather not do. Alternatively, I put it to
false in DSA and I won't have pMAC stats callbacks getting called even
if I do support a pMAC. Maybe DSA isn't even the only one in this situation.