diff mbox series

[net-next,v2,2/3] net: dsa: add Arrow SpeedChips XRS700x driver

Message ID 20201125193740.36825-3-george.mccollister@gmail.com (mailing list archive)
State Changes Requested
Delegated to: Netdev Maintainers
Headers show
Series Arrow SpeedChips XRS700x DSA Driver | expand

Checks

Context Check Description
netdev/cover_letter success Link
netdev/fixes_present success Link
netdev/patch_count success Link
netdev/tree_selection success Clearly marked for net-next
netdev/subject_prefix success Link
netdev/source_inline success Was 0 now: 0
netdev/verify_signedoff success Link
netdev/module_param success Was 0 now: 0
netdev/build_32bit fail Errors and warnings before: 1 this patch: 12
netdev/kdoc success Errors and warnings before: 0 this patch: 0
netdev/verify_fixes success Link
netdev/checkpatch warning CHECK: Prefer kernel type 'u64' over 'uint64_t' WARNING: DT compatible string "arrow,xrs7003e" appears un-documented -- check ./Documentation/devicetree/bindings/ WARNING: DT compatible string "arrow,xrs7003f" appears un-documented -- check ./Documentation/devicetree/bindings/ WARNING: DT compatible string "arrow,xrs7004e" appears un-documented -- check ./Documentation/devicetree/bindings/ WARNING: DT compatible string "arrow,xrs7004f" appears un-documented -- check ./Documentation/devicetree/bindings/ WARNING: added, moved or deleted file(s), does MAINTAINERS need updating? WARNING: line length of 81 exceeds 80 columns WARNING: please write a paragraph that describes the config symbol fully
netdev/build_allmodconfig_warn fail Errors and warnings before: 1 this patch: 12
netdev/header_inline success Link
netdev/stable success Stable not CCed

Commit Message

George McCollister Nov. 25, 2020, 7:37 p.m. UTC
Add a driver with initial support for the Arrow SpeedChips XRS7000
series of gigabit Ethernet switch chips which are typically used in
critical networking applications.

The switches have up to three RGMII ports and one RMII port.
Management to the switches can be performed over i2c or mdio.

Support for advanced features such as PTP and
HSR/PRP (IEC 62439-3 Clause 5 & 4) is not included in this patch and
may be added at a later date.

Signed-off-by: George McCollister <george.mccollister@gmail.com>
---
 drivers/net/dsa/Kconfig                |   2 +
 drivers/net/dsa/Makefile               |   1 +
 drivers/net/dsa/xrs700x/Kconfig        |  26 ++
 drivers/net/dsa/xrs700x/Makefile       |   4 +
 drivers/net/dsa/xrs700x/xrs700x.c      | 585 +++++++++++++++++++++++++++++++++
 drivers/net/dsa/xrs700x/xrs700x.h      |  38 +++
 drivers/net/dsa/xrs700x/xrs700x_i2c.c  | 150 +++++++++
 drivers/net/dsa/xrs700x/xrs700x_mdio.c | 162 +++++++++
 drivers/net/dsa/xrs700x/xrs700x_reg.h  | 205 ++++++++++++
 9 files changed, 1173 insertions(+)
 create mode 100644 drivers/net/dsa/xrs700x/Kconfig
 create mode 100644 drivers/net/dsa/xrs700x/Makefile
 create mode 100644 drivers/net/dsa/xrs700x/xrs700x.c
 create mode 100644 drivers/net/dsa/xrs700x/xrs700x.h
 create mode 100644 drivers/net/dsa/xrs700x/xrs700x_i2c.c
 create mode 100644 drivers/net/dsa/xrs700x/xrs700x_mdio.c
 create mode 100644 drivers/net/dsa/xrs700x/xrs700x_reg.h

Comments

Jakub Kicinski Nov. 26, 2020, 1:42 a.m. UTC | #1
On Wed, 25 Nov 2020 13:37:39 -0600 George McCollister wrote:
> Add a driver with initial support for the Arrow SpeedChips XRS7000
> series of gigabit Ethernet switch chips which are typically used in
> critical networking applications.
> 
> The switches have up to three RGMII ports and one RMII port.
> Management to the switches can be performed over i2c or mdio.
> 
> Support for advanced features such as PTP and
> HSR/PRP (IEC 62439-3 Clause 5 & 4) is not included in this patch and
> may be added at a later date.
> 
> Signed-off-by: George McCollister <george.mccollister@gmail.com>

You need to add symbol exports otherwise this won't build with
allmodconfig:

ERROR: modpost: "xrs7004f_info"
[drivers/net/dsa/xrs700x/xrs700x_mdio.ko] undefined! ERROR: modpost:
"xrs7004e_info" [drivers/net/dsa/xrs700x/xrs700x_mdio.ko] undefined!
ERROR: modpost: "xrs7003f_info"
[drivers/net/dsa/xrs700x/xrs700x_mdio.ko] undefined! ERROR: modpost:
"xrs7003e_info" [drivers/net/dsa/xrs700x/xrs700x_mdio.ko] undefined!
ERROR: modpost: "xrs7004f_info"
[drivers/net/dsa/xrs700x/xrs700x_i2c.ko] undefined! ERROR: modpost:
"xrs7004e_info" [drivers/net/dsa/xrs700x/xrs700x_i2c.ko] undefined!
ERROR: modpost: "xrs7003f_info"
[drivers/net/dsa/xrs700x/xrs700x_i2c.ko] undefined! ERROR: modpost:
"xrs7003e_info" [drivers/net/dsa/xrs700x/xrs700x_i2c.ko] undefined!

> +	{XRS_RX_UNDERSIZE_L, "rx_undersize"},
> +	{XRS_RX_FRAGMENTS_L, "rx_fragments"},
> +	{XRS_RX_OVERSIZE_L, "rx_oversize"},
> +	{XRS_RX_JABBER_L, "rx_jabber"},
> +	{XRS_RX_ERR_L, "rx_err"},
> +	{XRS_RX_CRC_L, "rx_crc"},

As Vladimir already mentioned to you the statistics which have
corresponding entries in struct rtnl_link_stats64 should be reported
the standard way. The infra for DSA may not be in place yet, so best 
if you just drop those for now.
George McCollister Nov. 26, 2020, 2:25 a.m. UTC | #2
On Wed, Nov 25, 2020 at 7:42 PM Jakub Kicinski <kuba@kernel.org> wrote:
>
> On Wed, 25 Nov 2020 13:37:39 -0600 George McCollister wrote:
> > Add a driver with initial support for the Arrow SpeedChips XRS7000
> > series of gigabit Ethernet switch chips which are typically used in
> > critical networking applications.
> >
> > The switches have up to three RGMII ports and one RMII port.
> > Management to the switches can be performed over i2c or mdio.
> >
> > Support for advanced features such as PTP and
> > HSR/PRP (IEC 62439-3 Clause 5 & 4) is not included in this patch and
> > may be added at a later date.
> >
> > Signed-off-by: George McCollister <george.mccollister@gmail.com>
>
> You need to add symbol exports otherwise this won't build with
> allmodconfig:
>
> ERROR: modpost: "xrs7004f_info"
> [drivers/net/dsa/xrs700x/xrs700x_mdio.ko] undefined! ERROR: modpost:
> "xrs7004e_info" [drivers/net/dsa/xrs700x/xrs700x_mdio.ko] undefined!
> ERROR: modpost: "xrs7003f_info"
> [drivers/net/dsa/xrs700x/xrs700x_mdio.ko] undefined! ERROR: modpost:
> "xrs7003e_info" [drivers/net/dsa/xrs700x/xrs700x_mdio.ko] undefined!
> ERROR: modpost: "xrs7004f_info"
> [drivers/net/dsa/xrs700x/xrs700x_i2c.ko] undefined! ERROR: modpost:
> "xrs7004e_info" [drivers/net/dsa/xrs700x/xrs700x_i2c.ko] undefined!
> ERROR: modpost: "xrs7003f_info"
> [drivers/net/dsa/xrs700x/xrs700x_i2c.ko] undefined! ERROR: modpost:
> "xrs7003e_info" [drivers/net/dsa/xrs700x/xrs700x_i2c.ko] undefined!

I was wondering if I possibly needed to do that but wasn't getting any
errors the way I was building.

>
> > +     {XRS_RX_UNDERSIZE_L, "rx_undersize"},
> > +     {XRS_RX_FRAGMENTS_L, "rx_fragments"},
> > +     {XRS_RX_OVERSIZE_L, "rx_oversize"},
> > +     {XRS_RX_JABBER_L, "rx_jabber"},
> > +     {XRS_RX_ERR_L, "rx_err"},
> > +     {XRS_RX_CRC_L, "rx_crc"},
>
> As Vladimir already mentioned to you the statistics which have
> corresponding entries in struct rtnl_link_stats64 should be reported
> the standard way. The infra for DSA may not be in place yet, so best
> if you just drop those for now.

Okay, that clears it up a bit. Just drop these 6? I'll read through
that thread again and try to make sense of it.

Thanks
Vladimir Oltean Nov. 26, 2020, 1:24 p.m. UTC | #3
On Wed, Nov 25, 2020 at 08:25:11PM -0600, George McCollister wrote:
> > > +     {XRS_RX_UNDERSIZE_L, "rx_undersize"},
> > > +     {XRS_RX_FRAGMENTS_L, "rx_fragments"},
> > > +     {XRS_RX_OVERSIZE_L, "rx_oversize"},
> > > +     {XRS_RX_JABBER_L, "rx_jabber"},
> > > +     {XRS_RX_ERR_L, "rx_err"},
> > > +     {XRS_RX_CRC_L, "rx_crc"},
> >
> > As Vladimir already mentioned to you the statistics which have
> > corresponding entries in struct rtnl_link_stats64 should be reported
> > the standard way. The infra for DSA may not be in place yet, so best
> > if you just drop those for now.
> 
> Okay, that clears it up a bit. Just drop these 6? I'll read through
> that thread again and try to make sense of it.

I feel that I should ask. Do you want me to look into exposing RMON
interface counters through rtnetlink (I've never done anything like that
before either, but there's a beginning for everything), or are you going
to?
Vladimir Oltean Nov. 26, 2020, 5:56 p.m. UTC | #4
On Thu, Nov 26, 2020 at 03:24:18PM +0200, Vladimir Oltean wrote:
> On Wed, Nov 25, 2020 at 08:25:11PM -0600, George McCollister wrote:
> > > > +     {XRS_RX_UNDERSIZE_L, "rx_undersize"},
> > > > +     {XRS_RX_FRAGMENTS_L, "rx_fragments"},
> > > > +     {XRS_RX_OVERSIZE_L, "rx_oversize"},
> > > > +     {XRS_RX_JABBER_L, "rx_jabber"},
> > > > +     {XRS_RX_ERR_L, "rx_err"},
> > > > +     {XRS_RX_CRC_L, "rx_crc"},
> > >
> > > As Vladimir already mentioned to you the statistics which have
> > > corresponding entries in struct rtnl_link_stats64 should be reported
> > > the standard way. The infra for DSA may not be in place yet, so best
> > > if you just drop those for now.
> >
> > Okay, that clears it up a bit. Just drop these 6? I'll read through
> > that thread again and try to make sense of it.
>
> I feel that I should ask. Do you want me to look into exposing RMON
> interface counters through rtnetlink (I've never done anything like that
> before either, but there's a beginning for everything), or are you going
> to?

So I started to add .ndo_get_stats64 based on the hardware counters, but
I already hit the first roadblock, as described by the wise words of
Documentation/networking/statistics.rst:

| The `.ndo_get_stats64` callback can not sleep because of accesses
| via `/proc/net/dev`. If driver may sleep when retrieving the statistics
| from the device it should do so periodically asynchronously and only return
| a recent copy from `.ndo_get_stats64`. Ethtool interrupt coalescing interface
| allows setting the frequency of refreshing statistics, if needed.


Unfortunately, I feel this is almost unacceptable for a DSA driver that
more often than not needs to retrieve these counters from a slow and
bottlenecked bus (SPI, I2C, MDIO etc). Periodic readouts are not an
option, because the only periodic interval that would not put absurdly
high pressure on the limited SPI bandwidth would be a readout interval
that gives you very old counters.

What exactly is it that incurs the atomic context? I cannot seem to
figure out from this stack trace:

[  869.692526] ------------[ cut here ]------------
[  869.697174] WARNING: CPU: 0 PID: 444 at kernel/rcu/tree_plugin.h:297 rcu_note_context_switch+0x54/0x438
[  869.706598] Modules linked in:
[  869.709662] CPU: 0 PID: 444 Comm: cat Not tainted 5.10.0-rc5-next-20201126-00006-g0598c9bbacc1-dirty #1452
[  869.724764] pstate: 20000085 (nzCv daIf -PAN -UAO -TCO BTYPE=--)
[  869.730790] pc : rcu_note_context_switch+0x54/0x438
[  869.735681] lr : rcu_note_context_switch+0x44/0x438
[  869.740570] sp : ffff80001039b420
[  869.743889] x29: ffff80001039b420 x28: ffff0f7a046c8e80
[  869.749220] x27: ffffcc70e3fad000 x26: ffff80001039b9d4
[  869.754550] x25: 0000000000000000 x24: ffffcc70e27ae82c
[  869.759879] x23: ffffcc70e3f90000 x22: 0000000000000000
[  869.765208] x21: ffff0f7a02177140 x20: ffff0f7a02177140
[  869.770537] x19: ffff0f7a7b9d3bc0 x18: 00000000ffffffff
[  869.775865] x17: 0000000000000000 x16: 0000000000000000
[  869.781193] x15: 0000000000000004 x14: 0000000000000000
[  869.786523] x13: 0000000000000000 x12: 0000000000000000
[  869.791852] x11: 0000000000000000 x10: 0000000000000000
[  869.797181] x9 : ffff0f7a022d0800 x8 : 0000000000000004
[  869.802510] x7 : 0000000000000004 x6 : ffff80001039b410
[  869.807838] x5 : 0000000000000001 x4 : 0000000000000001
[  869.813168] x3 : 503c00c9a4c6a300 x2 : 0000000000000000
[  869.818496] x1 : ffffcc70e3f90b98 x0 : 0000000000000001
[  869.823826] Call trace:
[  869.826276]  rcu_note_context_switch+0x54/0x438
[  869.830819]  __schedule+0xc0/0x708
[  869.834228]  schedule+0x4c/0x108
[  869.837462]  schedule_timeout+0x1a8/0x320
[  869.841480]  wait_for_completion+0x9c/0x148
[  869.845675]  dspi_transfer_one_message+0x158/0x550
[  869.850480]  __spi_pump_messages+0x208/0x818
[  869.854760]  __spi_sync+0x2a4/0x2e0
[  869.858257]  spi_sync+0x34/0x58
[  869.861404]  spi_sync_transfer+0x94/0xb8
[  869.865337]  sja1105_xfer.isra.1+0x250/0x2e0
[  869.869618]  sja1105_xfer_buf+0x4c/0x60
[  869.873462]  sja1105_port_status_get+0x68/0x8f0
[  869.878004]  sja1105_port_get_stats64+0x58/0x100
[  869.882633]  dsa_slave_get_stats64+0x3c/0x58
[  869.886916]  dev_get_stats+0xc0/0xd8
[  869.890500]  dev_seq_printf_stats+0x44/0x118
[  869.894780]  dev_seq_show+0x30/0x60
[  869.898276]  seq_read_iter+0x330/0x450
[  869.902032]  seq_read+0xf8/0x148
[  869.905268]  proc_reg_read+0xd4/0x110
[  869.908939]  vfs_read+0xac/0x1c8
[  869.912172]  ksys_read+0x74/0xf8
[  869.915405]  __arm64_sys_read+0x24/0x30
[  869.919251]  el0_svc_common.constprop.3+0x80/0x1b0
[  869.924055]  do_el0_svc+0x34/0xa0
[  869.927378]  el0_sync_handler+0x138/0x198
[  869.931397]  el0_sync+0x140/0x180
[  869.934718] ---[ end trace fd45b387ae2c6970 ]---
George McCollister Nov. 26, 2020, 7:07 p.m. UTC | #5
On Thu, Nov 26, 2020 at 11:56 AM Vladimir Oltean <olteanv@gmail.com> wrote:
>
> On Thu, Nov 26, 2020 at 03:24:18PM +0200, Vladimir Oltean wrote:
> > On Wed, Nov 25, 2020 at 08:25:11PM -0600, George McCollister wrote:
> > > > > +     {XRS_RX_UNDERSIZE_L, "rx_undersize"},
> > > > > +     {XRS_RX_FRAGMENTS_L, "rx_fragments"},
> > > > > +     {XRS_RX_OVERSIZE_L, "rx_oversize"},
> > > > > +     {XRS_RX_JABBER_L, "rx_jabber"},
> > > > > +     {XRS_RX_ERR_L, "rx_err"},
> > > > > +     {XRS_RX_CRC_L, "rx_crc"},
> > > >
> > > > As Vladimir already mentioned to you the statistics which have
> > > > corresponding entries in struct rtnl_link_stats64 should be reported
> > > > the standard way. The infra for DSA may not be in place yet, so best
> > > > if you just drop those for now.
> > >
> > > Okay, that clears it up a bit. Just drop these 6? I'll read through
> > > that thread again and try to make sense of it.
> >
> > I feel that I should ask. Do you want me to look into exposing RMON
> > interface counters through rtnetlink (I've never done anything like that
> > before either, but there's a beginning for everything), or are you going
> > to?
>
> So I started to add .ndo_get_stats64 based on the hardware counters, but
> I already hit the first roadblock, as described by the wise words of
> Documentation/networking/statistics.rst:
>
> | The `.ndo_get_stats64` callback can not sleep because of accesses
> | via `/proc/net/dev`. If driver may sleep when retrieving the statistics
> | from the device it should do so periodically asynchronously and only return
> | a recent copy from `.ndo_get_stats64`. Ethtool interrupt coalescing interface
> | allows setting the frequency of refreshing statistics, if needed.
>
>
> Unfortunately, I feel this is almost unacceptable for a DSA driver that
> more often than not needs to retrieve these counters from a slow and
> bottlenecked bus (SPI, I2C, MDIO etc). Periodic readouts are not an
> option, because the only periodic interval that would not put absurdly
> high pressure on the limited SPI bandwidth would be a readout interval
> that gives you very old counters.

Indeed it seems ndo_get_stats64() usually gets data over something
like a local or PCIe bus or from software. I had a brief look to see
if I could find another driver that was getting the stats over a slow
bus and didn't notice anything. If you haven't already you might do a
quick grep and see if anything pops out to you.

>
> What exactly is it that incurs the atomic context? I cannot seem to
> figure out from this stack trace:

I think something in fs/seq_file.c is taking an rcu lock. I suppose it
doesn't really matter though since the documentation says we can't
sleep.

>
> [  869.692526] ------------[ cut here ]------------
> [  869.697174] WARNING: CPU: 0 PID: 444 at kernel/rcu/tree_plugin.h:297 rcu_note_context_switch+0x54/0x438
> [  869.706598] Modules linked in:
> [  869.709662] CPU: 0 PID: 444 Comm: cat Not tainted 5.10.0-rc5-next-20201126-00006-g0598c9bbacc1-dirty #1452
> [  869.724764] pstate: 20000085 (nzCv daIf -PAN -UAO -TCO BTYPE=--)
> [  869.730790] pc : rcu_note_context_switch+0x54/0x438
> [  869.735681] lr : rcu_note_context_switch+0x44/0x438
> [  869.740570] sp : ffff80001039b420
> [  869.743889] x29: ffff80001039b420 x28: ffff0f7a046c8e80
> [  869.749220] x27: ffffcc70e3fad000 x26: ffff80001039b9d4
> [  869.754550] x25: 0000000000000000 x24: ffffcc70e27ae82c
> [  869.759879] x23: ffffcc70e3f90000 x22: 0000000000000000
> [  869.765208] x21: ffff0f7a02177140 x20: ffff0f7a02177140
> [  869.770537] x19: ffff0f7a7b9d3bc0 x18: 00000000ffffffff
> [  869.775865] x17: 0000000000000000 x16: 0000000000000000
> [  869.781193] x15: 0000000000000004 x14: 0000000000000000
> [  869.786523] x13: 0000000000000000 x12: 0000000000000000
> [  869.791852] x11: 0000000000000000 x10: 0000000000000000
> [  869.797181] x9 : ffff0f7a022d0800 x8 : 0000000000000004
> [  869.802510] x7 : 0000000000000004 x6 : ffff80001039b410
> [  869.807838] x5 : 0000000000000001 x4 : 0000000000000001
> [  869.813168] x3 : 503c00c9a4c6a300 x2 : 0000000000000000
> [  869.818496] x1 : ffffcc70e3f90b98 x0 : 0000000000000001
> [  869.823826] Call trace:
> [  869.826276]  rcu_note_context_switch+0x54/0x438
> [  869.830819]  __schedule+0xc0/0x708
> [  869.834228]  schedule+0x4c/0x108
> [  869.837462]  schedule_timeout+0x1a8/0x320
> [  869.841480]  wait_for_completion+0x9c/0x148
> [  869.845675]  dspi_transfer_one_message+0x158/0x550
> [  869.850480]  __spi_pump_messages+0x208/0x818
> [  869.854760]  __spi_sync+0x2a4/0x2e0
> [  869.858257]  spi_sync+0x34/0x58
> [  869.861404]  spi_sync_transfer+0x94/0xb8
> [  869.865337]  sja1105_xfer.isra.1+0x250/0x2e0
> [  869.869618]  sja1105_xfer_buf+0x4c/0x60
> [  869.873462]  sja1105_port_status_get+0x68/0x8f0
> [  869.878004]  sja1105_port_get_stats64+0x58/0x100
> [  869.882633]  dsa_slave_get_stats64+0x3c/0x58
> [  869.886916]  dev_get_stats+0xc0/0xd8
> [  869.890500]  dev_seq_printf_stats+0x44/0x118
> [  869.894780]  dev_seq_show+0x30/0x60
> [  869.898276]  seq_read_iter+0x330/0x450
> [  869.902032]  seq_read+0xf8/0x148
> [  869.905268]  proc_reg_read+0xd4/0x110
> [  869.908939]  vfs_read+0xac/0x1c8
> [  869.912172]  ksys_read+0x74/0xf8
> [  869.915405]  __arm64_sys_read+0x24/0x30
> [  869.919251]  el0_svc_common.constprop.3+0x80/0x1b0
> [  869.924055]  do_el0_svc+0x34/0xa0
> [  869.927378]  el0_sync_handler+0x138/0x198
> [  869.931397]  el0_sync+0x140/0x180
> [  869.934718] ---[ end trace fd45b387ae2c6970 ]---

It does seem to me that this is something that needs to be sorted out
at the subsystem level and that this driver has been "caught in the
crossfire". Any guidance on how we could proceed with this driver and
revisit this when we have answers to these questions at the subsystem
level would be appreciated if substantial time will be required to
work this out.

Thanks for taking the initiative to look into this.

-George
Vladimir Oltean Nov. 26, 2020, 10:05 p.m. UTC | #6
On Thu, Nov 26, 2020 at 01:07:12PM -0600, George McCollister wrote:
> On Thu, Nov 26, 2020 at 11:56 AM Vladimir Oltean <olteanv@gmail.com> wrote:
> >
> > On Thu, Nov 26, 2020 at 03:24:18PM +0200, Vladimir Oltean wrote:
> > > On Wed, Nov 25, 2020 at 08:25:11PM -0600, George McCollister wrote:
> > > > > > +     {XRS_RX_UNDERSIZE_L, "rx_undersize"},
> > > > > > +     {XRS_RX_FRAGMENTS_L, "rx_fragments"},
> > > > > > +     {XRS_RX_OVERSIZE_L, "rx_oversize"},
> > > > > > +     {XRS_RX_JABBER_L, "rx_jabber"},
> > > > > > +     {XRS_RX_ERR_L, "rx_err"},
> > > > > > +     {XRS_RX_CRC_L, "rx_crc"},
> > > > >
> > > > > As Vladimir already mentioned to you the statistics which have
> > > > > corresponding entries in struct rtnl_link_stats64 should be reported
> > > > > the standard way. The infra for DSA may not be in place yet, so best
> > > > > if you just drop those for now.
> > > >
> > > > Okay, that clears it up a bit. Just drop these 6? I'll read through
> > > > that thread again and try to make sense of it.
> > >
> > > I feel that I should ask. Do you want me to look into exposing RMON
> > > interface counters through rtnetlink (I've never done anything like that
> > > before either, but there's a beginning for everything), or are you going
> > > to?
> >
> > So I started to add .ndo_get_stats64 based on the hardware counters, but
> > I already hit the first roadblock, as described by the wise words of
> > Documentation/networking/statistics.rst:
> >
> > | The `.ndo_get_stats64` callback can not sleep because of accesses
> > | via `/proc/net/dev`. If driver may sleep when retrieving the statistics
> > | from the device it should do so periodically asynchronously and only return
> > | a recent copy from `.ndo_get_stats64`. Ethtool interrupt coalescing interface
> > | allows setting the frequency of refreshing statistics, if needed.
> >
> >
> > Unfortunately, I feel this is almost unacceptable for a DSA driver that
> > more often than not needs to retrieve these counters from a slow and
> > bottlenecked bus (SPI, I2C, MDIO etc). Periodic readouts are not an
> > option, because the only periodic interval that would not put absurdly
> > high pressure on the limited SPI bandwidth would be a readout interval
> > that gives you very old counters.
>
> Indeed it seems ndo_get_stats64() usually gets data over something
> like a local or PCIe bus or from software. I had a brief look to see
> if I could find another driver that was getting the stats over a slow
> bus and didn't notice anything. If you haven't already you might do a
> quick grep and see if anything pops out to you.
>
> >
> > What exactly is it that incurs the atomic context? I cannot seem to
> > figure out from this stack trace:
>
> I think something in fs/seq_file.c is taking an rcu lock.

Not quite. It _is_ the RCU read-side lock that's taken, but it's taken
locally from dev_seq_start in net/core/net-procfs.c. The reason is that
/proc/net/dev iterates through all interfaces from the current netns,
and it is precisely that that creates atomic context. You used to need
to hold the rwlock_t dev_base_lock, but now you can also "get away" with
the RCU read-side lock. Either way, both are atomic context, so it
doesn't help.

commit c6d14c84566d6b70ad9dc1618db0dec87cca9300
Author: Eric Dumazet <eric.dumazet@gmail.com>
Date:   Wed Nov 4 05:43:23 2009 -0800

    net: Introduce for_each_netdev_rcu() iterator

    Adds RCU management to the list of netdevices.

    Convert some for_each_netdev() users to RCU version, if
    it can avoid read_lock-ing dev_base_lock

    Ie:
            read_lock(&dev_base_loack);
            for_each_netdev(net, dev)
                    some_action();
            read_unlock(&dev_base_lock);

    becomes :

            rcu_read_lock();
            for_each_netdev_rcu(net, dev)
                    some_action();
            rcu_read_unlock();


    Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>

So... yeah. As long as this kernel interface exists, it needs to run in
atomic context, by construction. Great.

> I suppose it doesn't really matter though since the documentation says
> we can't sleep.

You're talking, I suppose, about these words of wisdom in
Documentation/filesystems/seq_file.rst?

| However, the seq_file code (by design) will not sleep between the calls
| to start() and stop(), so holding a lock during that time is a
| reasonable thing to do. The seq_file code will also avoid taking any
| other locks while the iterator is active.

It _doesn't_ say that you can't sleep between start() and stop(), right?
It just says that if you want to keep the seq_file iterator atomic, the
seq_file code is not sabotaging you by sleeping. But you still could
sleep if you wanted to.

Back to the statistics counters.

How accurate do the counters in /proc/net/dev need to be? What programs
consume those? Could they be more out of date than the ones retrieved
through rtnetlink?

I'm thinking that maybe we could introduce another ndo, something like
.ndo_get_stats64_blocking, that could be called from all places except
from net/core/net-procfs.c. That one could still call the non-blocking
variant. Then, depending on the answer to the question "how inaccurate
could we reasonably leave /proc/net/dev", we could:
- just return zeroes there
- return the counters cached from the last blocking call

> It does seem to me that this is something that needs to be sorted out
> at the subsystem level and that this driver has been "caught in the
> crossfire". Any guidance on how we could proceed with this driver and
> revisit this when we have answers to these questions at the subsystem
> level would be appreciated if substantial time will be required to
> work this out.

Now seriously, who isn't caught in the crossfire here? Let's do some
brainstorming and it will be quick and painless.
Jakub Kicinski Nov. 27, 2020, 6:35 p.m. UTC | #7
On Fri, 27 Nov 2020 00:05:00 +0200 Vladimir Oltean wrote:
> On Thu, Nov 26, 2020 at 01:07:12PM -0600, George McCollister wrote:
> > On Thu, Nov 26, 2020 at 11:56 AM Vladimir Oltean <olteanv@gmail.com> wrote:  
> > > On Thu, Nov 26, 2020 at 03:24:18PM +0200, Vladimir Oltean wrote:  
> > > > On Wed, Nov 25, 2020 at 08:25:11PM -0600, George McCollister wrote:  
> > > > > > > +     {XRS_RX_UNDERSIZE_L, "rx_undersize"},
> > > > > > > +     {XRS_RX_FRAGMENTS_L, "rx_fragments"},
> > > > > > > +     {XRS_RX_OVERSIZE_L, "rx_oversize"},
> > > > > > > +     {XRS_RX_JABBER_L, "rx_jabber"},
> > > > > > > +     {XRS_RX_ERR_L, "rx_err"},
> > > > > > > +     {XRS_RX_CRC_L, "rx_crc"},  
> > > > > >
> > > > > > As Vladimir already mentioned to you the statistics which have
> > > > > > corresponding entries in struct rtnl_link_stats64 should be reported
> > > > > > the standard way. The infra for DSA may not be in place yet, so best
> > > > > > if you just drop those for now.  
> > > > >
> > > > > Okay, that clears it up a bit. Just drop these 6? I'll read through
> > > > > that thread again and try to make sense of it.  
> > > >
> > > > I feel that I should ask. Do you want me to look into exposing RMON
> > > > interface counters through rtnetlink (I've never done anything like that
> > > > before either, but there's a beginning for everything), or are you going
> > > > to?  
> > >
> > > So I started to add .ndo_get_stats64 based on the hardware counters, but
> > > I already hit the first roadblock, as described by the wise words of
> > > Documentation/networking/statistics.rst:
> > >
> > > | The `.ndo_get_stats64` callback can not sleep because of accesses
> > > | via `/proc/net/dev`. If driver may sleep when retrieving the statistics
> > > | from the device it should do so periodically asynchronously and only return
> > > | a recent copy from `.ndo_get_stats64`. Ethtool interrupt coalescing interface
> > > | allows setting the frequency of refreshing statistics, if needed.

I should have probably also mentioned here that unlike most NDOs
.ndo_get_stats64 is called without rtnl lock held at all.

> > > Unfortunately, I feel this is almost unacceptable for a DSA driver that
> > > more often than not needs to retrieve these counters from a slow and
> > > bottlenecked bus (SPI, I2C, MDIO etc). Periodic readouts are not an
> > > option, because the only periodic interval that would not put absurdly
> > > high pressure on the limited SPI bandwidth would be a readout interval
> > > that gives you very old counters.  

What's a high interval? It's not uncommon to refresh the stats once a
second even in high performance NICs.

> > Indeed it seems ndo_get_stats64() usually gets data over something
> > like a local or PCIe bus or from software. I had a brief look to see
> > if I could find another driver that was getting the stats over a slow
> > bus and didn't notice anything. If you haven't already you might do a
> > quick grep and see if anything pops out to you.
> >  
> > >
> > > What exactly is it that incurs the atomic context? I cannot seem to
> > > figure out from this stack trace:  
> >
> > I think something in fs/seq_file.c is taking an rcu lock.  
> 
> Not quite. It _is_ the RCU read-side lock that's taken, but it's taken
> locally from dev_seq_start in net/core/net-procfs.c. The reason is that
> /proc/net/dev iterates through all interfaces from the current netns,
> and it is precisely that that creates atomic context. You used to need
> to hold the rwlock_t dev_base_lock, but now you can also "get away" with
> the RCU read-side lock. Either way, both are atomic context, so it
> doesn't help.
> 
> commit c6d14c84566d6b70ad9dc1618db0dec87cca9300
> Author: Eric Dumazet <eric.dumazet@gmail.com>
> Date:   Wed Nov 4 05:43:23 2009 -0800
> 
>     net: Introduce for_each_netdev_rcu() iterator
> 
>     Adds RCU management to the list of netdevices.
> 
>     Convert some for_each_netdev() users to RCU version, if
>     it can avoid read_lock-ing dev_base_lock
> 
>     Ie:
>             read_lock(&dev_base_loack);
>             for_each_netdev(net, dev)
>                     some_action();
>             read_unlock(&dev_base_lock);
> 
>     becomes :
> 
>             rcu_read_lock();
>             for_each_netdev_rcu(net, dev)
>                     some_action();
>             rcu_read_unlock();
> 
> 
>     Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
>     Signed-off-by: David S. Miller <davem@davemloft.net>
> 
> So... yeah. As long as this kernel interface exists, it needs to run in
> atomic context, by construction. Great.
>
> > I suppose it doesn't really matter though since the documentation says
> > we can't sleep.  
> 
> You're talking, I suppose, about these words of wisdom in
> Documentation/filesystems/seq_file.rst?
> 
> | However, the seq_file code (by design) will not sleep between the calls
> | to start() and stop(), so holding a lock during that time is a
> | reasonable thing to do. The seq_file code will also avoid taking any
> | other locks while the iterator is active.
> 
> It _doesn't_ say that you can't sleep between start() and stop(), right?
> It just says that if you want to keep the seq_file iterator atomic, the
> seq_file code is not sabotaging you by sleeping. But you still could
> sleep if you wanted to.
> 
> Back to the statistics counters.
> 
> How accurate do the counters in /proc/net/dev need to be? What programs
> consume those? Could they be more out of date than the ones retrieved
> through rtnetlink?

ifconfig does for sure.

> I'm thinking that maybe we could introduce another ndo, something like
> .ndo_get_stats64_blocking, that could be called from all places except
> from net/core/net-procfs.c. That one could still call the non-blocking
> variant. Then, depending on the answer to the question "how inaccurate
> could we reasonably leave /proc/net/dev", we could:
> - just return zeroes there
> - return the counters cached from the last blocking call

I'd rather not introduce divergent behavior like that.

Is the periodic refresh really that awful? We're mostly talking error
counters here so every second or every few seconds should be perfectly
fine.

> > It does seem to me that this is something that needs to be sorted out
> > at the subsystem level and that this driver has been "caught in the
> > crossfire". Any guidance on how we could proceed with this driver and
> > revisit this when we have answers to these questions at the subsystem
> > level would be appreciated if substantial time will be required to
> > work this out.  
> 
> Now seriously, who isn't caught in the crossfire here? Let's do some
> brainstorming and it will be quick and painless.
Vladimir Oltean Nov. 27, 2020, 7:50 p.m. UTC | #8
On Fri, Nov 27, 2020 at 12:47:41PM -0600, George McCollister wrote:
> On Fri, Nov 27, 2020, 12:35 PM Jakub Kicinski <kuba@kernel.org> wrote:
> 
> > On Fri, 27 Nov 2020 00:05:00 +0200 Vladimir Oltean wrote:
> > > On Thu, Nov 26, 2020 at 01:07:12PM -0600, George McCollister wrote:
> > > > On Thu, Nov 26, 2020 at 11:56 AM Vladimir Oltean <olteanv@gmail.com>
> > wrote:
> > > > > On Thu, Nov 26, 2020 at 03:24:18PM +0200, Vladimir Oltean wrote:
> > > > > > On Wed, Nov 25, 2020 at 08:25:11PM -0600, George McCollister
> > wrote:
> > > > > > > > > +     {XRS_RX_UNDERSIZE_L, "rx_undersize"},
> > > > > > > > > +     {XRS_RX_FRAGMENTS_L, "rx_fragments"},
> > > > > > > > > +     {XRS_RX_OVERSIZE_L, "rx_oversize"},
> > > > > > > > > +     {XRS_RX_JABBER_L, "rx_jabber"},
> > > > > > > > > +     {XRS_RX_ERR_L, "rx_err"},
> > > > > > > > > +     {XRS_RX_CRC_L, "rx_crc"},
> > > > > > > >
> > > > > > > > As Vladimir already mentioned to you the statistics which have
> > > > > > > > corresponding entries in struct rtnl_link_stats64 should be
> > reported
> > > > > > > > the standard way. The infra for DSA may not be in place yet,
> > so best
> > > > > > > > if you just drop those for now.
> > > > > > >
> > > > > > > Okay, that clears it up a bit. Just drop these 6? I'll read
> > through
> > > > > > > that thread again and try to make sense of it.
> > > > > >
> > > > > > I feel that I should ask. Do you want me to look into exposing RMON
> > > > > > interface counters through rtnetlink (I've never done anything
> > like that
> > > > > > before either, but there's a beginning for everything), or are you
> > going
> > > > > > to?
> > > > >
> > > > > So I started to add .ndo_get_stats64 based on the hardware counters,
> > but
> > > > > I already hit the first roadblock, as described by the wise words of
> > > > > Documentation/networking/statistics.rst:
> > > > >
> > > > > | The `.ndo_get_stats64` callback can not sleep because of accesses
> > > > > | via `/proc/net/dev`. If driver may sleep when retrieving the
> > statistics
> > > > > | from the device it should do so periodically asynchronously and
> > only return
> > > > > | a recent copy from `.ndo_get_stats64`. Ethtool interrupt
> > coalescing interface
> > > > > | allows setting the frequency of refreshing statistics, if needed.
> >
> > I should have probably also mentioned here that unlike most NDOs
> > .ndo_get_stats64 is called without rtnl lock held at all.
> >
> > > > > Unfortunately, I feel this is almost unacceptable for a DSA driver
> > that
> > > > > more often than not needs to retrieve these counters from a slow and
> > > > > bottlenecked bus (SPI, I2C, MDIO etc). Periodic readouts are not an
> > > > > option, because the only periodic interval that would not put
> > absurdly
> > > > > high pressure on the limited SPI bandwidth would be a readout
> > interval
> > > > > that gives you very old counters.
> >
> > What's a high interval? It's not uncommon to refresh the stats once a
> > second even in high performance NICs.
> >
> > > > Indeed it seems ndo_get_stats64() usually gets data over something
> > > > like a local or PCIe bus or from software. I had a brief look to see
> > > > if I could find another driver that was getting the stats over a slow
> > > > bus and didn't notice anything. If you haven't already you might do a
> > > > quick grep and see if anything pops out to you.
> > > >
> > > > >
> > > > > What exactly is it that incurs the atomic context? I cannot seem to
> > > > > figure out from this stack trace:
> > > >
> > > > I think something in fs/seq_file.c is taking an rcu lock.
> > >
> > > Not quite. It _is_ the RCU read-side lock that's taken, but it's taken
> > > locally from dev_seq_start in net/core/net-procfs.c. The reason is that
> > > /proc/net/dev iterates through all interfaces from the current netns,
> > > and it is precisely that that creates atomic context. You used to need
> > > to hold the rwlock_t dev_base_lock, but now you can also "get away" with
> > > the RCU read-side lock. Either way, both are atomic context, so it
> > > doesn't help.
> > >
> > > commit c6d14c84566d6b70ad9dc1618db0dec87cca9300
> > > Author: Eric Dumazet <eric.dumazet@gmail.com>
> > > Date:   Wed Nov 4 05:43:23 2009 -0800
> > >
> > >     net: Introduce for_each_netdev_rcu() iterator
> > >
> > >     Adds RCU management to the list of netdevices.
> > >
> > >     Convert some for_each_netdev() users to RCU version, if
> > >     it can avoid read_lock-ing dev_base_lock
> > >
> > >     Ie:
> > >             read_lock(&dev_base_loack);
> > >             for_each_netdev(net, dev)
> > >                     some_action();
> > >             read_unlock(&dev_base_lock);
> > >
> > >     becomes :
> > >
> > >             rcu_read_lock();
> > >             for_each_netdev_rcu(net, dev)
> > >                     some_action();
> > >             rcu_read_unlock();
> > >
> > >
> > >     Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
> > >     Signed-off-by: David S. Miller <davem@davemloft.net>
> > >
> > > So... yeah. As long as this kernel interface exists, it needs to run in
> > > atomic context, by construction. Great.
> > >
> > > > I suppose it doesn't really matter though since the documentation says
> > > > we can't sleep.
> > >
> > > You're talking, I suppose, about these words of wisdom in
> > > Documentation/filesystems/seq_file.rst?
> > >
> > > | However, the seq_file code (by design) will not sleep between the calls
> > > | to start() and stop(), so holding a lock during that time is a
> > > | reasonable thing to do. The seq_file code will also avoid taking any
> > > | other locks while the iterator is active.
> > >
> > > It _doesn't_ say that you can't sleep between start() and stop(), right?
> > > It just says that if you want to keep the seq_file iterator atomic, the
> > > seq_file code is not sabotaging you by sleeping. But you still could
> > > sleep if you wanted to.
> > >
> > > Back to the statistics counters.
> > >
> > > How accurate do the counters in /proc/net/dev need to be? What programs
> > > consume those? Could they be more out of date than the ones retrieved
> > > through rtnetlink?
> >
> > ifconfig does for sure.
> >
> > > I'm thinking that maybe we could introduce another ndo, something like
> > > .ndo_get_stats64_blocking, that could be called from all places except
> > > from net/core/net-procfs.c. That one could still call the non-blocking
> > > variant. Then, depending on the answer to the question "how inaccurate
> > > could we reasonably leave /proc/net/dev", we could:
> > > - just return zeroes there
> > > - return the counters cached from the last blocking call
> >
> > I'd rather not introduce divergent behavior like that.
> >
> > Is the periodic refresh really that awful? We're mostly talking error
> > counters here so every second or every few seconds should be perfectly
> > fine.
> >
> 
> That sounds pretty reasonable to me. Worst case is about a 100kbp/s
> management bus and with the amount of data we're talking about that should
> be no problem so long as there's a way to raise the interval or disable it
> entirely.

100 Kbps = 12.5KB/s.
sja1105 has 93 64-bit counters, and during every counter refresh cycle I
would need to get some counters from the beginning of that range, some
from the middle and some from the end. With all the back-and-forth
between the sja1105 driver and the SPI controller driver, and the
protocol overhead associated with creating a "SPI read" message, it is
all in all more efficient to just issue a burst read operation for all
the counters, even ones that I'm not going to use. So let's go with
that, 93x8 bytes (and ignore protocol overhead) = 744 bytes of SPI I/O
per second. At a throughput of 12.5KB/s, that takes 59 ms to complete,
and that's just for the raw I/O, that thing which keeps the SPI mutex
locked. You know what else I could do during that time? Anything else!
Like for example perform PTP timestamp reconstruction, which has a hard
deadline at 135 ms after the packet was received, and would appreciate
if the SPI mutex was not locked for 59 ms every second.
And all of that, for what benefit? Honestly any periodic I/O over the
management interface is too much I/O, unless there is any strong reason
to have it.

Also, even the simple idea of providing out-of-date counters to user
space running in syscall context has me scratching my head. I can only
think of all the drivers in selftests that are checking statistics
counters before, then they send a packet, then they check the counters
after. What do those need to do, put a sleep to make sure the counters
were updated?
Andrew Lunn Nov. 27, 2020, 8:47 p.m. UTC | #9
> Is the periodic refresh really that awful? We're mostly talking error
> counters here so every second or every few seconds should be perfectly
> fine.

Humm, i would prefer error counts to be more correct than anything
else. When debugging issues, you generally don't care how many packets
worked. It is how many failed you are interesting, and how that number
of failures increases.

So long as these counters are still in ethtool -S, i guess it does not
matter. That i do trust to be accurate, and probably consistent across
the counters it returns.

	Andrew
George McCollister Nov. 27, 2020, 8:58 p.m. UTC | #10
On Fri, Nov 27, 2020 at 1:50 PM Vladimir Oltean <olteanv@gmail.com> wrote:
>
> On Fri, Nov 27, 2020 at 12:47:41PM -0600, George McCollister wrote:
> > On Fri, Nov 27, 2020, 12:35 PM Jakub Kicinski <kuba@kernel.org> wrote:
> >
> > > On Fri, 27 Nov 2020 00:05:00 +0200 Vladimir Oltean wrote:
> > > > On Thu, Nov 26, 2020 at 01:07:12PM -0600, George McCollister wrote:
> > > > > On Thu, Nov 26, 2020 at 11:56 AM Vladimir Oltean <olteanv@gmail.com>
> > > wrote:
> > > > > > On Thu, Nov 26, 2020 at 03:24:18PM +0200, Vladimir Oltean wrote:
> > > > > > > On Wed, Nov 25, 2020 at 08:25:11PM -0600, George McCollister
> > > wrote:
> > > > > > > > > > +     {XRS_RX_UNDERSIZE_L, "rx_undersize"},
> > > > > > > > > > +     {XRS_RX_FRAGMENTS_L, "rx_fragments"},
> > > > > > > > > > +     {XRS_RX_OVERSIZE_L, "rx_oversize"},
> > > > > > > > > > +     {XRS_RX_JABBER_L, "rx_jabber"},
> > > > > > > > > > +     {XRS_RX_ERR_L, "rx_err"},
> > > > > > > > > > +     {XRS_RX_CRC_L, "rx_crc"},
> > > > > > > > >
> > > > > > > > > As Vladimir already mentioned to you the statistics which have
> > > > > > > > > corresponding entries in struct rtnl_link_stats64 should be
> > > reported
> > > > > > > > > the standard way. The infra for DSA may not be in place yet,
> > > so best
> > > > > > > > > if you just drop those for now.
> > > > > > > >
> > > > > > > > Okay, that clears it up a bit. Just drop these 6? I'll read
> > > through
> > > > > > > > that thread again and try to make sense of it.
> > > > > > >
> > > > > > > I feel that I should ask. Do you want me to look into exposing RMON
> > > > > > > interface counters through rtnetlink (I've never done anything
> > > like that
> > > > > > > before either, but there's a beginning for everything), or are you
> > > going
> > > > > > > to?
> > > > > >
> > > > > > So I started to add .ndo_get_stats64 based on the hardware counters,
> > > but
> > > > > > I already hit the first roadblock, as described by the wise words of
> > > > > > Documentation/networking/statistics.rst:
> > > > > >
> > > > > > | The `.ndo_get_stats64` callback can not sleep because of accesses
> > > > > > | via `/proc/net/dev`. If driver may sleep when retrieving the
> > > statistics
> > > > > > | from the device it should do so periodically asynchronously and
> > > only return
> > > > > > | a recent copy from `.ndo_get_stats64`. Ethtool interrupt
> > > coalescing interface
> > > > > > | allows setting the frequency of refreshing statistics, if needed.
> > >
> > > I should have probably also mentioned here that unlike most NDOs
> > > .ndo_get_stats64 is called without rtnl lock held at all.
> > >
> > > > > > Unfortunately, I feel this is almost unacceptable for a DSA driver
> > > that
> > > > > > more often than not needs to retrieve these counters from a slow and
> > > > > > bottlenecked bus (SPI, I2C, MDIO etc). Periodic readouts are not an
> > > > > > option, because the only periodic interval that would not put
> > > absurdly
> > > > > > high pressure on the limited SPI bandwidth would be a readout
> > > interval
> > > > > > that gives you very old counters.
> > >
> > > What's a high interval? It's not uncommon to refresh the stats once a
> > > second even in high performance NICs.
> > >
> > > > > Indeed it seems ndo_get_stats64() usually gets data over something
> > > > > like a local or PCIe bus or from software. I had a brief look to see
> > > > > if I could find another driver that was getting the stats over a slow
> > > > > bus and didn't notice anything. If you haven't already you might do a
> > > > > quick grep and see if anything pops out to you.
> > > > >
> > > > > >
> > > > > > What exactly is it that incurs the atomic context? I cannot seem to
> > > > > > figure out from this stack trace:
> > > > >
> > > > > I think something in fs/seq_file.c is taking an rcu lock.
> > > >
> > > > Not quite. It _is_ the RCU read-side lock that's taken, but it's taken
> > > > locally from dev_seq_start in net/core/net-procfs.c. The reason is that
> > > > /proc/net/dev iterates through all interfaces from the current netns,
> > > > and it is precisely that that creates atomic context. You used to need
> > > > to hold the rwlock_t dev_base_lock, but now you can also "get away" with
> > > > the RCU read-side lock. Either way, both are atomic context, so it
> > > > doesn't help.
> > > >
> > > > commit c6d14c84566d6b70ad9dc1618db0dec87cca9300
> > > > Author: Eric Dumazet <eric.dumazet@gmail.com>
> > > > Date:   Wed Nov 4 05:43:23 2009 -0800
> > > >
> > > >     net: Introduce for_each_netdev_rcu() iterator
> > > >
> > > >     Adds RCU management to the list of netdevices.
> > > >
> > > >     Convert some for_each_netdev() users to RCU version, if
> > > >     it can avoid read_lock-ing dev_base_lock
> > > >
> > > >     Ie:
> > > >             read_lock(&dev_base_loack);
> > > >             for_each_netdev(net, dev)
> > > >                     some_action();
> > > >             read_unlock(&dev_base_lock);
> > > >
> > > >     becomes :
> > > >
> > > >             rcu_read_lock();
> > > >             for_each_netdev_rcu(net, dev)
> > > >                     some_action();
> > > >             rcu_read_unlock();
> > > >
> > > >
> > > >     Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
> > > >     Signed-off-by: David S. Miller <davem@davemloft.net>
> > > >
> > > > So... yeah. As long as this kernel interface exists, it needs to run in
> > > > atomic context, by construction. Great.
> > > >
> > > > > I suppose it doesn't really matter though since the documentation says
> > > > > we can't sleep.
> > > >
> > > > You're talking, I suppose, about these words of wisdom in
> > > > Documentation/filesystems/seq_file.rst?
> > > >
> > > > | However, the seq_file code (by design) will not sleep between the calls
> > > > | to start() and stop(), so holding a lock during that time is a
> > > > | reasonable thing to do. The seq_file code will also avoid taking any
> > > > | other locks while the iterator is active.
> > > >
> > > > It _doesn't_ say that you can't sleep between start() and stop(), right?
> > > > It just says that if you want to keep the seq_file iterator atomic, the
> > > > seq_file code is not sabotaging you by sleeping. But you still could
> > > > sleep if you wanted to.
> > > >
> > > > Back to the statistics counters.
> > > >
> > > > How accurate do the counters in /proc/net/dev need to be? What programs
> > > > consume those? Could they be more out of date than the ones retrieved
> > > > through rtnetlink?
> > >
> > > ifconfig does for sure.
> > >
> > > > I'm thinking that maybe we could introduce another ndo, something like
> > > > .ndo_get_stats64_blocking, that could be called from all places except
> > > > from net/core/net-procfs.c. That one could still call the non-blocking
> > > > variant. Then, depending on the answer to the question "how inaccurate
> > > > could we reasonably leave /proc/net/dev", we could:
> > > > - just return zeroes there
> > > > - return the counters cached from the last blocking call
> > >
> > > I'd rather not introduce divergent behavior like that.
> > >
> > > Is the periodic refresh really that awful? We're mostly talking error
> > > counters here so every second or every few seconds should be perfectly
> > > fine.
> > >
> >
> > That sounds pretty reasonable to me. Worst case is about a 100kbp/s
> > management bus and with the amount of data we're talking about that should
> > be no problem so long as there's a way to raise the interval or disable it
> > entirely.
>
> 100 Kbps = 12.5KB/s.
> sja1105 has 93 64-bit counters, and during every counter refresh cycle I

Yeah, that's quite big. The xrs700x counters are only 16 bit. They
need to be polled on an interval anyway or they will roll.

> would need to get some counters from the beginning of that range, some
> from the middle and some from the end. With all the back-and-forth
> between the sja1105 driver and the SPI controller driver, and the
> protocol overhead associated with creating a "SPI read" message, it is
> all in all more efficient to just issue a burst read operation for all
> the counters, even ones that I'm not going to use. So let's go with
> that, 93x8 bytes (and ignore protocol overhead) = 744 bytes of SPI I/O
> per second. At a throughput of 12.5KB/s, that takes 59 ms to complete,
> and that's just for the raw I/O, that thing which keeps the SPI mutex
> locked. You know what else I could do during that time? Anything else!
> Like for example perform PTP timestamp reconstruction, which has a hard
> deadline at 135 ms after the packet was received, and would appreciate
> if the SPI mutex was not locked for 59 ms every second.

Indeed, however if you need to acquire this data at all it's going to
burden the system at that time so unless you're able to stretch out
the reads over a length of time whether or not you're polling every
second or once a day may not matter if you're never able to miss a
deadline.

> And all of that, for what benefit? Honestly any periodic I/O over the
> management interface is too much I/O, unless there is any strong reason
> to have it.

True enough.

>
> Also, even the simple idea of providing out-of-date counters to user
> space running in syscall context has me scratching my head. I can only
> think of all the drivers in selftests that are checking statistics
> counters before, then they send a packet, then they check the counters
> after. What do those need to do, put a sleep to make sure the counters
> were updated?

Yeah, I don't know enough about how these are used. I'll defer to you.

I apologize if I'm slow at responding. Yesterday and today are
holidays in the US.
Jakub Kicinski Nov. 27, 2020, 9:13 p.m. UTC | #11
On Fri, 27 Nov 2020 21:47:14 +0100 Andrew Lunn wrote:
> > Is the periodic refresh really that awful? We're mostly talking error
> > counters here so every second or every few seconds should be perfectly
> > fine.  
> 
> Humm, i would prefer error counts to be more correct than anything
> else. When debugging issues, you generally don't care how many packets
> worked. It is how many failed you are interesting, and how that number
> of failures increases.

Right, but not sure I'd use the word "correct". Perhaps "immediately up
to date"?

High speed NICs usually go through a layer of firmware before they
access the stats, IOW even if we always synchronously ask for the stats
in the kernel - in practice a lot of NICs (most?) will return some form
of cached information.

> So long as these counters are still in ethtool -S, i guess it does not
> matter. That i do trust to be accurate, and probably consistent across
> the counters it returns.

Not in the NIC designs I'm familiar with.

But anyway - this only matters in some strict testing harness, right?
Normal users will look at a stats after they noticed issues (so minutes
/ hours later) or at the very best they'll look at a graph, which will
hardly require <1sec accuracy to when error occurred.
Vladimir Oltean Nov. 27, 2020, 9:23 p.m. UTC | #12
On Fri, Nov 27, 2020 at 01:13:46PM -0800, Jakub Kicinski wrote:
> On Fri, 27 Nov 2020 21:47:14 +0100 Andrew Lunn wrote:
> > > Is the periodic refresh really that awful? We're mostly talking error
> > > counters here so every second or every few seconds should be perfectly
> > > fine.
> >
> > Humm, i would prefer error counts to be more correct than anything
> > else. When debugging issues, you generally don't care how many packets
> > worked. It is how many failed you are interesting, and how that number
> > of failures increases.
>
> Right, but not sure I'd use the word "correct". Perhaps "immediately up
> to date"?
>
> High speed NICs usually go through a layer of firmware before they
> access the stats, IOW even if we always synchronously ask for the stats
> in the kernel - in practice a lot of NICs (most?) will return some form
> of cached information.
>
> > So long as these counters are still in ethtool -S, i guess it does not
> > matter. That i do trust to be accurate, and probably consistent across
> > the counters it returns.
>
> Not in the NIC designs I'm familiar with.
>
> But anyway - this only matters in some strict testing harness, right?
> Normal users will look at a stats after they noticed issues (so minutes
> / hours later) or at the very best they'll look at a graph, which will
> hardly require <1sec accuracy to when error occurred.

Either way, can we conclude that ndo_get_stats64 is not a replacement
for ethtool -S, since the latter is blocking and, if implemented correctly,
can return the counters at the time of the call (therefore making sure
that anything that happened before the syscall has been accounted into
the retrieved values), and the former isn't?

The whole discussion started because you said we shouldn't expose some
statistics counters in ethtool as long as they have a standardized
equivalent. Well, I think we still should.
Andrew Lunn Nov. 27, 2020, 9:32 p.m. UTC | #13
> > So long as these counters are still in ethtool -S, i guess it does not
> > matter. That i do trust to be accurate, and probably consistent across
> > the counters it returns.
> 
> Not in the NIC designs I'm familiar with.

Many NICs have a way to take a hardware snapshot of all counters. You
can then read them out as fast or slow as you want, since you read the
snapshot, not the live counters. As a result you can compare counters
against each other.

> But anyway - this only matters in some strict testing harness, right?
> Normal users will look at a stats after they noticed issues (so minutes
> / hours later) or at the very best they'll look at a graph, which will
> hardly require <1sec accuracy to when error occurred.

As Vladimir has pointed out, polling once per second over an i2c bus
is expensive. And there is an obvious linear cost with the number of
ports on these switches. And we need to keep latency down so that PTP
is accurate. Do we really want to be polling, for something which is
very unlikely to be used? I think we should probably take another look
at the locking and see if it can be modified to allow block, so we can
avoid this wasteful polling.

	Andrew
Andrew Lunn Nov. 27, 2020, 9:36 p.m. UTC | #14
> Either way, can we conclude that ndo_get_stats64 is not a replacement
> for ethtool -S, since the latter is blocking and, if implemented correctly,
> can return the counters at the time of the call (therefore making sure
> that anything that happened before the syscall has been accounted into
> the retrieved values), and the former isn't?

ethtool -S is the best source of consistent, up to date statistics we
have. It seems silly not to include everything the hardware offers
there.

	Andrew
Jakub Kicinski Nov. 27, 2020, 9:37 p.m. UTC | #15
Replying to George's email 'cause I didn't get Vladimir's email from
the ML.

On Fri, 27 Nov 2020 14:58:29 -0600 George McCollister wrote:
> > 100 Kbps = 12.5KB/s.
> > sja1105 has 93 64-bit counters, and during every counter refresh cycle I  

Are these 93 for one port? That sounds like a lot.. There're usually 
~10 stats (per port) that are relevant to the standard netdev stats.

> Yeah, that's quite big. The xrs700x counters are only 16 bit. They
> need to be polled on an interval anyway or they will roll.

Yup! That's pretty common.

> > would need to get some counters from the beginning of that range, some
> > from the middle and some from the end. With all the back-and-forth
> > between the sja1105 driver and the SPI controller driver, and the
> > protocol overhead associated with creating a "SPI read" message, it is
> > all in all more efficient to just issue a burst read operation for all
> > the counters, even ones that I'm not going to use. So let's go with
> > that, 93x8 bytes (and ignore protocol overhead) = 744 bytes of SPI I/O
> > per second. At a throughput of 12.5KB/s, that takes 59 ms to complete,
> > and that's just for the raw I/O, that thing which keeps the SPI mutex
> > locked. You know what else I could do during that time? Anything else!
> > Like for example perform PTP timestamp reconstruction, which has a hard
> > deadline at 135 ms after the packet was received, and would appreciate
> > if the SPI mutex was not locked for 59 ms every second.  
> 
> Indeed, however if you need to acquire this data at all it's going to
> burden the system at that time so unless you're able to stretch out
> the reads over a length of time whether or not you're polling every
> second or once a day may not matter if you're never able to miss a
> deadline.

Exactly, either way you gotta prepare for users polling those stats.
A design where stats are read synchronously and user (an unprivileged
user, BTW) has the ability to disturb the operation of the system
sounds really flaky.

> > And all of that, for what benefit? Honestly any periodic I/O over the
> > management interface is too much I/O, unless there is any strong reason
> > to have it.  
> 
> True enough.
> 
> > Also, even the simple idea of providing out-of-date counters to user
> > space running in syscall context has me scratching my head. I can only
> > think of all the drivers in selftests that are checking statistics
> > counters before, then they send a packet, then they check the counters
> > after. What do those need to do, put a sleep to make sure the counters
> > were updated?  

Frankly life sounds simpler on the embedded networking planet than it is
on the one I'm living on ;) High speed systems are often eventually
consistent. Either because stats are gathered from HW periodically by
the FW, or RCU grace period has to expire, or workqueue has to run,
etc. etc. I know it's annoying for writing tests but it's manageable. 

If there is a better alternative I'm all ears but having /proc and
ifconfig return zeros for error counts while ip link doesn't will lead
to too much confusion IMO. While delayed update of stats is a fact of
life for _years_ now (hence it was backed into the ethtool -C API).
Jakub Kicinski Nov. 27, 2020, 10:03 p.m. UTC | #16
On Fri, 27 Nov 2020 23:23:42 +0200 Vladimir Oltean wrote:
> On Fri, Nov 27, 2020 at 01:13:46PM -0800, Jakub Kicinski wrote:
> > On Fri, 27 Nov 2020 21:47:14 +0100 Andrew Lunn wrote:  
> > > > Is the periodic refresh really that awful? We're mostly talking error
> > > > counters here so every second or every few seconds should be perfectly
> > > > fine.  
> > >
> > > Humm, i would prefer error counts to be more correct than anything
> > > else. When debugging issues, you generally don't care how many packets
> > > worked. It is how many failed you are interesting, and how that number
> > > of failures increases.  
> >
> > Right, but not sure I'd use the word "correct". Perhaps "immediately up
> > to date"?
> >
> > High speed NICs usually go through a layer of firmware before they
> > access the stats, IOW even if we always synchronously ask for the stats
> > in the kernel - in practice a lot of NICs (most?) will return some form
> > of cached information.
> >  
> > > So long as these counters are still in ethtool -S, i guess it does not
> > > matter. That i do trust to be accurate, and probably consistent across
> > > the counters it returns.  
> >
> > Not in the NIC designs I'm familiar with.
> >
> > But anyway - this only matters in some strict testing harness, right?
> > Normal users will look at a stats after they noticed issues (so minutes
> > / hours later) or at the very best they'll look at a graph, which will
> > hardly require <1sec accuracy to when error occurred.  
> 
> Either way, can we conclude that ndo_get_stats64 is not a replacement
> for ethtool -S, since the latter is blocking and, if implemented correctly,
> can return the counters at the time of the call (therefore making sure
> that anything that happened before the syscall has been accounted into
> the retrieved values), and the former isn't?

ethtool -S stats are not 100% up to date. Not on Netronome, Intel,
Broadcom or Mellanox NICs AFAIK.

> The whole discussion started because you said we shouldn't expose some
> statistics counters in ethtool as long as they have a standardized
> equivalent. Well, I think we still should.

Users must have access to stats via standard Linux interfaces with well
defined semantics. We cannot continue to live in the world where user
has to guess driver specific name for ethtool -S to find out the number
of CRC errors...

I know it may not matter to a driver developer, and it didn't matter
much to me when I was one, because in my drivers they always had the
same name. But trying to monitor a fleet of hardware from multiple
vendors is very painful with the status quo, we must do better.
We can't have users scrape through what is basically a debug interface
to get to vital information.

I'd really love to find a way out of the procfs issue, but I'm not sure
if there is one.
Jakub Kicinski Nov. 27, 2020, 10:14 p.m. UTC | #17
On Fri, 27 Nov 2020 22:32:44 +0100 Andrew Lunn wrote:
> > > So long as these counters are still in ethtool -S, i guess it does not
> > > matter. That i do trust to be accurate, and probably consistent across
> > > the counters it returns.  
> > 
> > Not in the NIC designs I'm familiar with.  
> 
> Many NICs have a way to take a hardware snapshot of all counters.
> You can then read them out as fast or slow as you want, since you
> read the snapshot, not the live counters. As a result you can compare
> counters against each other.

Curious, does Marvell HW do it?
 
> > But anyway - this only matters in some strict testing harness,
> > right? Normal users will look at a stats after they noticed issues
> > (so minutes / hours later) or at the very best they'll look at a
> > graph, which will hardly require <1sec accuracy to when error
> > occurred.  
> 
> As Vladimir has pointed out, polling once per second over an i2c bus
> is expensive. And there is an obvious linear cost with the number of
> ports on these switches. And we need to keep latency down so that PTP
> is accurate. Do we really want to be polling, for something which is
> very unlikely to be used?

IDK I find it very questionable if the system design doesn't take into
account that statistics are retrieved every n seconds. We can perhaps
scale the default period with the speed of the bus?

> I think we should probably take another look at the locking and see
> if it can be modified to allow block, so we can avoid this wasteful
> polling.

It'd be great. Worst case scenario we can have very, very rare polling
+ a synchronous callback? But we shouldn't leave /proc be completely
incorrect.

Converting /proc to be blocking may be a little risky, there may be
legacy daemons, and other software which people have running which reads
/proc just to get a list of interfaces or something silly, which will
suddenly start causing latencies to the entire stack.

But perhaps we can try and find out. I'm not completely opposed.
Vladimir Oltean Nov. 27, 2020, 10:42 p.m. UTC | #18
On Fri, Nov 27, 2020 at 01:37:53PM -0800, Jakub Kicinski wrote:
> Replying to George's email 'cause I didn't get Vladimir's email from
> the ML.
>
> On Fri, 27 Nov 2020 14:58:29 -0600 George McCollister wrote:
> > > 100 Kbps = 12.5KB/s.
> > > sja1105 has 93 64-bit counters, and during every counter refresh cycle I
>
> Are these 93 for one port? That sounds like a lot.. There're usually
> ~10 stats (per port) that are relevant to the standard netdev stats.

I would like to report the rx_errors as an aggregate of a few other
discrete counters that are far in between. Hence the idea of reading the
entire region delegated to port counters using a single SPI transaction.

> > Yeah, that's quite big. The xrs700x counters are only 16 bit. They
> > need to be polled on an interval anyway or they will roll.
>
> Yup! That's pretty common.
>
> > > would need to get some counters from the beginning of that range, some
> > > from the middle and some from the end. With all the back-and-forth
> > > between the sja1105 driver and the SPI controller driver, and the
> > > protocol overhead associated with creating a "SPI read" message, it is
> > > all in all more efficient to just issue a burst read operation for all
> > > the counters, even ones that I'm not going to use. So let's go with
> > > that, 93x8 bytes (and ignore protocol overhead) = 744 bytes of SPI I/O
> > > per second. At a throughput of 12.5KB/s, that takes 59 ms to complete,
> > > and that's just for the raw I/O, that thing which keeps the SPI mutex
> > > locked. You know what else I could do during that time? Anything else!
> > > Like for example perform PTP timestamp reconstruction, which has a hard
> > > deadline at 135 ms after the packet was received, and would appreciate
> > > if the SPI mutex was not locked for 59 ms every second.
> >
> > Indeed, however if you need to acquire this data at all it's going to
> > burden the system at that time so unless you're able to stretch out
> > the reads over a length of time whether or not you're polling every
> > second or once a day may not matter if you're never able to miss a
> > deadline.
>
> Exactly, either way you gotta prepare for users polling those stats.
> A design where stats are read synchronously and user (an unprivileged
> user, BTW) has the ability to disturb the operation of the system
> sounds really flaky.

So I was probably overreacting with the previous estimate, since
- nobody in their right mind is going to use sja1105 at a SPI clock of
  100 KHz, but at least 100 times that value.
- other buses, like I2C/MDIO, don't really have the notion of variable
  buffer sizes and burst transactions. So the wait time has a smaller
  upper bound there, because of the implicit "cond_resched" given by the
  need to read word by word. In retrospect it would be wiser to not make
  use of burst reads for getting stats over SPI either.

But to your point. The worst case is worse than periodic readouts, and
the system needs to be resilient against them. Well, some things are
finicky no matter what you do and how well you protect, like phc2sys.
So for those, you need to be more radical, and you need to do cross
timestamping as close to the actual I/O as possible:
https://lwn.net/Articles/798467/
But nonetheless, point taken. As long as the system withstands the worst
case, then the impact it has on latency is not catastrophic, although it
is still there. But who would not like a system which achieves the same
thing by doing less? I feel that not sorting this out now will deter
many DSA driver writers from using ndo_get_stats64, if they don't have
otherwise a reason to schedule a periodic stats workqueue anyway.

> > > And all of that, for what benefit? Honestly any periodic I/O over the
> > > management interface is too much I/O, unless there is any strong reason
> > > to have it.
> >
> > True enough.
> >
> > > Also, even the simple idea of providing out-of-date counters to user
> > > space running in syscall context has me scratching my head. I can only
> > > think of all the drivers in selftests that are checking statistics
> > > counters before, then they send a packet, then they check the counters
> > > after. What do those need to do, put a sleep to make sure the counters
> > > were updated?
>
> Frankly life sounds simpler on the embedded networking planet than it is
> on the one I'm living on ;) High speed systems are often eventually
> consistent. Either because stats are gathered from HW periodically by
> the FW, or RCU grace period has to expire, or workqueue has to run,
> etc. etc. I know it's annoying for writing tests but it's manageable.

Yes, but all of these are problems which I do not have, so I don't see
why I would want to suffer from others' challenges.

> If there is a better alternative I'm all ears but having /proc and
> ifconfig return zeros for error counts while ip link doesn't will lead
> to too much confusion IMO. While delayed update of stats is a fact of
> life for _years_ now (hence it was backed into the ethtool -C API).

I don't know the net/core folder well enough to say anything definitive
about it. To me nothing looked obviously possible.
Although in my mind of somebody who doesn't understand how other pieces
of code seem to get away by exposing the same object through two
different types of locking (like br_vlan_get_pvid and
br_vlan_get_pvid_rcu), maybe this could be an approach to try to obtain
the RTNL in net-procfs.c instead of RCU. But this is just, you know,
speculating things I don't understand.
Andrew Lunn Nov. 27, 2020, 10:46 p.m. UTC | #19
On Fri, Nov 27, 2020 at 02:14:02PM -0800, Jakub Kicinski wrote:
> On Fri, 27 Nov 2020 22:32:44 +0100 Andrew Lunn wrote:
> > > > So long as these counters are still in ethtool -S, i guess it does not
> > > > matter. That i do trust to be accurate, and probably consistent across
> > > > the counters it returns.  
> > > 
> > > Not in the NIC designs I'm familiar with.  
> > 
> > Many NICs have a way to take a hardware snapshot of all counters.
> > You can then read them out as fast or slow as you want, since you
> > read the snapshot, not the live counters. As a result you can compare
> > counters against each other.
> 
> Curious, does Marvell HW do it?

Yes. Every generation of Marvell SOHO switch has this.

> IDK I find it very questionable if the system design doesn't take into
> account that statistics are retrieved every n seconds. We can perhaps
> scale the default period with the speed of the bus?

You don't actually have much choice. I2C is defined to run at
100Kbps. There is a fast mode which runs at 400Kbps. How do you design
around that? MDIO is around 2.5Mbps, but has 50% overhead from the
preamble. SPI can be up to 50Mbps, but there is no standard set of
speeds. None of the data sheets i've seen ever talk about recommended
scheduling polices for transactions over these busses. But testing has
shown, if you want good PTP, you need to keep them loaded as lightly
as possible. If you don't have PTP it becomes less important.

   Andrew
Vladimir Oltean Nov. 27, 2020, 11:21 p.m. UTC | #20
On Fri, Nov 27, 2020 at 01:37:53PM -0800, Jakub Kicinski wrote:
> High speed systems are often eventually consistent. Either because
> stats are gathered from HW periodically by the FW, or RCU grace period
> has to expire, or workqueue has to run, etc. etc. I know it's annoying
> for writing tests but it's manageable.

Out of curiosity, what does a test writer need to do to get out of the
"eventual consistency" conundrum in a portable way and answer the
question "has my packet not been received by the interface or has the
counter just not updated"?
Andrew Lunn Nov. 27, 2020, 11:30 p.m. UTC | #21
> If there is a better alternative I'm all ears but having /proc and
> ifconfig return zeros for error counts while ip link doesn't will lead
> to too much confusion IMO. While delayed update of stats is a fact of
> life for _years_ now (hence it was backed into the ethtool -C API).

How about dev_seq_start() issues a netdev notifier chain event, asking
devices which care to update their cached rtnl_link_stats64 counters.
They can decide if their cache is too old, and do a blocking read for
new values.

Once the notifier has completed, dev_seq_start() can then
rcu_read_lock() and do the actual collection of stats from the drivers
non-blocking.

	Andrew
Vladimir Oltean Nov. 27, 2020, 11:39 p.m. UTC | #22
On Sat, Nov 28, 2020 at 12:30:48AM +0100, Andrew Lunn wrote:
> > If there is a better alternative I'm all ears but having /proc and
> > ifconfig return zeros for error counts while ip link doesn't will lead
> > to too much confusion IMO. While delayed update of stats is a fact of
> > life for _years_ now (hence it was backed into the ethtool -C API).
> 
> How about dev_seq_start() issues a netdev notifier chain event, asking
> devices which care to update their cached rtnl_link_stats64 counters.
> They can decide if their cache is too old, and do a blocking read for
> new values.
> 
> Once the notifier has completed, dev_seq_start() can then
> rcu_read_lock() and do the actual collection of stats from the drivers
> non-blocking.

That sounds smart. I can try to prototype that and see how well it
works, or do you want to?
Jakub Kicinski Nov. 27, 2020, 11:51 p.m. UTC | #23
On Sat, 28 Nov 2020 01:21:40 +0200 Vladimir Oltean wrote:
> On Fri, Nov 27, 2020 at 01:37:53PM -0800, Jakub Kicinski wrote:
> > High speed systems are often eventually consistent. Either because
> > stats are gathered from HW periodically by the FW, or RCU grace period
> > has to expire, or workqueue has to run, etc. etc. I know it's annoying
> > for writing tests but it's manageable.  
> 
> Out of curiosity, what does a test writer need to do to get out of the
> "eventual consistency" conundrum in a portable way and answer the
> question "has my packet not been received by the interface or has the
> counter just not updated"?

Retry for a reasonable amount of time for the system state to converge.
E.g. bpftool_prog_list_wait() in selftests/bpf/test_offloads.py, or
check_tables() in selftests/drivers/net/udp_tunnel_nic.sh.

Sadly I was too lazy to open source the driver tests we built at
Netronome.
Jakub Kicinski Nov. 27, 2020, 11:56 p.m. UTC | #24
What is it with my email today, didn't get this one again.

On Sat, 28 Nov 2020 01:39:16 +0200 Vladimir Oltean wrote:
> On Sat, Nov 28, 2020 at 12:30:48AM +0100, Andrew Lunn wrote:
> > > If there is a better alternative I'm all ears but having /proc and
> > > ifconfig return zeros for error counts while ip link doesn't will lead
> > > to too much confusion IMO. While delayed update of stats is a fact of
> > > life for _years_ now (hence it was backed into the ethtool -C API).  
> > 
> > How about dev_seq_start() issues a netdev notifier chain event, asking
> > devices which care to update their cached rtnl_link_stats64 counters.
> > They can decide if their cache is too old, and do a blocking read for
> > new values.

Just to avoid breaking the suggestion that seqfiles don't sleep after
.start? Hm. I thought BPF slept (or did cond_reshed()) in the middle of
seq iteration. We should double check that seqfiles really can't sleep.

> > Once the notifier has completed, dev_seq_start() can then
> > rcu_read_lock() and do the actual collection of stats from the drivers
> > non-blocking.
Vladimir Oltean Nov. 28, 2020, 12:02 a.m. UTC | #25
On Sat, Nov 28, 2020 at 01:39:16AM +0200, Vladimir Oltean wrote:
> On Sat, Nov 28, 2020 at 12:30:48AM +0100, Andrew Lunn wrote:
> > > If there is a better alternative I'm all ears but having /proc and
> > > ifconfig return zeros for error counts while ip link doesn't will lead
> > > to too much confusion IMO. While delayed update of stats is a fact of
> > > life for _years_ now (hence it was backed into the ethtool -C API).
> >
> > How about dev_seq_start() issues a netdev notifier chain event, asking
> > devices which care to update their cached rtnl_link_stats64 counters.
> > They can decide if their cache is too old, and do a blocking read for
> > new values.
> >
> > Once the notifier has completed, dev_seq_start() can then
> > rcu_read_lock() and do the actual collection of stats from the drivers
> > non-blocking.
>
> That sounds smart. I can try to prototype that and see how well it
> works, or do you want to?

The situation is like this:

static int call_netdevice_notifiers_info(unsigned long val,
					 struct netdev_notifier_info *info);

expects a non-NULL info->dev argument.

To get a net device you need to call:

#define for_each_netdev(net, d)		\
		list_for_each_entry(d, &(net)->dev_base_head, dev_list)

which has the following protection rules:

/*
 * The @dev_base_head list is protected by @dev_base_lock and the rtnl
 * semaphore.
 *
 * Pure readers hold dev_base_lock for reading, or rcu_read_lock()
 *
 * Writers must hold the rtnl semaphore while they loop through the
 * dev_base_head list, and hold dev_base_lock for writing when they do the
 * actual updates.  This allows pure readers to access the list even
 * while a writer is preparing to update it.
 *
 * To put it another way, dev_base_lock is held for writing only to
 * protect against pure readers; the rtnl semaphore provides the
 * protection against other writers.
 *
 * See, for example usages, register_netdevice() and
 * unregister_netdevice(), which must be called with the rtnl
 * semaphore held.
 */

This means, as far as I understand, 2 things:
1. call_netdevice_notifiers_info doesn't help, since our problem is the
   same
2. I think that holding the RTNL should also be a valid way to iterate
   through the net devices in the current netns, and doing just that
   could be the simplest way out. It certainly worked when I tried it.
   But those could also be famous last words...
Andrew Lunn Nov. 28, 2020, 12:39 a.m. UTC | #26
> This means, as far as I understand, 2 things:
> 1. call_netdevice_notifiers_info doesn't help, since our problem is the
>    same
> 2. I think that holding the RTNL should also be a valid way to iterate
>    through the net devices in the current netns, and doing just that
>    could be the simplest way out. It certainly worked when I tried it.
>    But those could also be famous last words...

DSA makes the assumption it can block in a notifier change event.  For
example, NETDEV_CHANGEUPPER calls dsa_slave_changeupper() which calls
into the driver. We have not seen any warnings about sleeping while
atomic. So maybe see how NETDEV_CHANGEUPPER is issued.

	Andrew
Vladimir Oltean Nov. 28, 2020, 1:41 a.m. UTC | #27
On Sat, Nov 28, 2020 at 01:39:12AM +0100, Andrew Lunn wrote:
> > This means, as far as I understand, 2 things:
> > 1. call_netdevice_notifiers_info doesn't help, since our problem is the
> >    same
> > 2. I think that holding the RTNL should also be a valid way to iterate
> >    through the net devices in the current netns, and doing just that
> >    could be the simplest way out. It certainly worked when I tried it.
> >    But those could also be famous last words...
>
> DSA makes the assumption it can block in a notifier change event.  For
> example, NETDEV_CHANGEUPPER calls dsa_slave_changeupper() which calls
> into the driver. We have not seen any warnings about sleeping while
> atomic. So maybe see how NETDEV_CHANGEUPPER is issued.

Yes, this is in line with what I said. Even though we are pure readers,
it is still sufficient (albeit not necessary) to hold rtnl in order to
safely iterate through net->dev_base_head (or in this case, through its
hashmaps per name and per ifindex). However, rtnl is the only one that
offers sleepable context, since it is the only one that is a mutex.

So the patch could simply look like this, no notifiers needed:

-----------------------------[cut here]-----------------------------
diff --git a/net/core/net-procfs.c b/net/core/net-procfs.c
index c714e6a9dad4..1429e8c066d8 100644
--- a/net/core/net-procfs.c
+++ b/net/core/net-procfs.c
@@ -1,6 +1,7 @@
 // SPDX-License-Identifier: GPL-2.0
 #include <linux/netdevice.h>
 #include <linux/proc_fs.h>
+#include <linux/rtnetlink.h>
 #include <linux/seq_file.h>
 #include <net/wext.h>
 
@@ -21,7 +22,7 @@ static inline struct net_device *dev_from_same_bucket(struct seq_file *seq, loff
 	unsigned int count = 0, offset = get_offset(*pos);
 
 	h = &net->dev_index_head[get_bucket(*pos)];
-	hlist_for_each_entry_rcu(dev, h, index_hlist) {
+	hlist_for_each_entry(dev, h, index_hlist) {
 		if (++count == offset)
 			return dev;
 	}
@@ -51,9 +52,9 @@ static inline struct net_device *dev_from_bucket(struct seq_file *seq, loff_t *p
  *	in detail.
  */
 static void *dev_seq_start(struct seq_file *seq, loff_t *pos)
-	__acquires(RCU)
+	__acquires(rtnl_mutex)
 {
-	rcu_read_lock();
+	rtnl_lock();
 	if (!*pos)
 		return SEQ_START_TOKEN;
 
@@ -70,9 +71,9 @@ static void *dev_seq_next(struct seq_file *seq, void *v, loff_t *pos)
 }
 
 static void dev_seq_stop(struct seq_file *seq, void *v)
-	__releases(RCU)
+	__releases(rtnl_mutex)
 {
-	rcu_read_unlock();
+	rtnl_unlock();
 }
 
 static void dev_seq_printf_stats(struct seq_file *seq, struct net_device *dev)
-----------------------------[cut here]-----------------------------

The advantage is that it can now sleep.

The disadvantage is that now it can sleep. It is a writer, so other
writers can block it out, and it can block out other writers. More
contention. Speaking of, here's an interesting patch from someone who
seems to enjoy running ifconfig:

commit f04565ddf52e401880f8ba51de0dff8ba51c99fd
Author: Mihai Maruseac <mihai.maruseac@gmail.com>
Date:   Thu Oct 20 20:45:10 2011 +0000

    dev: use name hash for dev_seq_ops

    Instead of using the dev->next chain and trying to resync at each call to
    dev_seq_start, use the name hash, keeping the bucket and the offset in
    seq->private field.

    Tests revealed the following results for ifconfig > /dev/null
            * 1000 interfaces:
                    * 0.114s without patch
                    * 0.089s with patch
            * 3000 interfaces:
                    * 0.489s without patch
                    * 0.110s with patch
            * 5000 interfaces:
                    * 1.363s without patch
                    * 0.250s with patch
            * 128000 interfaces (other setup):
                    * ~100s without patch
                    * ~30s with patch

    Signed-off-by: Mihai Maruseac <mmaruseac@ixiacom.com>
    Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>

Jakub, I would like to hear more from you. I would still like to try
this patch out. You clearly have a lot more background with the code.
You said in an earlier reply that you should have also documented that
ndo_get_stats64 is one of the few NDOs that does not take the RTNL. Is
there a particular reason for that being so, and a reason why it can't
change?
Vladimir Oltean Nov. 28, 2020, 1:45 a.m. UTC | #28
On Fri, Nov 27, 2020 at 03:56:49PM -0800, Jakub Kicinski wrote:
> What is it with my email today, didn't get this one again.
>
> On Sat, 28 Nov 2020 01:39:16 +0200 Vladimir Oltean wrote:
> > On Sat, Nov 28, 2020 at 12:30:48AM +0100, Andrew Lunn wrote:
> > > > If there is a better alternative I'm all ears but having /proc and
> > > > ifconfig return zeros for error counts while ip link doesn't will lead
> > > > to too much confusion IMO. While delayed update of stats is a fact of
> > > > life for _years_ now (hence it was backed into the ethtool -C API).
> > >
> > > How about dev_seq_start() issues a netdev notifier chain event, asking
> > > devices which care to update their cached rtnl_link_stats64 counters.
> > > They can decide if their cache is too old, and do a blocking read for
> > > new values.
>
> Just to avoid breaking the suggestion that seqfiles don't sleep after
> .start? Hm. I thought BPF slept (or did cond_reshed()) in the middle of
> seq iteration. We should double check that seqfiles really can't sleep.

I don't think that seqfiles must not sleep after start(), at least
that's my interpretation and confirmed by some tests. I added a might_sleep()
in quite a few places and there is no issue now that we don't take the
RCU read-side lock any longer. Have you seen my previous reply to
George:

| > I suppose it doesn't really matter though since the documentation says
| > we can't sleep.
|
| You're talking, I suppose, about these words of wisdom in
| Documentation/filesystems/seq_file.rst?
|
| | However, the seq_file code (by design) will not sleep between the calls
| | to start() and stop(), so holding a lock during that time is a
| | reasonable thing to do. The seq_file code will also avoid taking any
| | other locks while the iterator is active.
|
| It _doesn't_ say that you can't sleep between start() and stop(), right?
| It just says that if you want to keep the seq_file iterator atomic, the
| seq_file code is not sabotaging you by sleeping. But you still could
| sleep if you wanted to.
Jakub Kicinski Nov. 28, 2020, 2:15 a.m. UTC | #29
On Sat, 28 Nov 2020 03:41:06 +0200 Vladimir Oltean wrote:
> Jakub, I would like to hear more from you. I would still like to try
> this patch out. You clearly have a lot more background with the code.

Well, I've seen people run into the problem of this NDO not being able
to sleep, but I don't have much background or knowledge of what impact
the locking will have on real systems.

We will need to bring this up with Eric (probably best after the turkey
weekend is over).

In the meantime if you feel like it you may want to add some tracing /
printing to check which processes are accessing /proc/net/dev on your
platforms of interest, see if there is anything surprising.

> You said in an earlier reply that you should have also documented that
> ndo_get_stats64 is one of the few NDOs that does not take the RTNL. Is
> there a particular reason for that being so, and a reason why it can't
> change?

I just meant that as a way of documenting the status quo. I'm not aware
of any other place reading stats under RCU (which doesn't mean it
doesn't exist :)).

That said it is a little tempting to add a new per-netdev mutex here,
instead of congesting RTNL lock further, since today no correct driver
should depend on the RTNL lock.
George McCollister Nov. 30, 2020, 4:52 p.m. UTC | #30
On Fri, Nov 27, 2020 at 8:15 PM Jakub Kicinski <kuba@kernel.org> wrote:
>
> On Sat, 28 Nov 2020 03:41:06 +0200 Vladimir Oltean wrote:
> > Jakub, I would like to hear more from you. I would still like to try
> > this patch out. You clearly have a lot more background with the code.
>
> Well, I've seen people run into the problem of this NDO not being able
> to sleep, but I don't have much background or knowledge of what impact
> the locking will have on real systems.
>
> We will need to bring this up with Eric (probably best after the turkey
> weekend is over).
>
> In the meantime if you feel like it you may want to add some tracing /
> printing to check which processes are accessing /proc/net/dev on your
> platforms of interest, see if there is anything surprising.
>
> > You said in an earlier reply that you should have also documented that
> > ndo_get_stats64 is one of the few NDOs that does not take the RTNL. Is
> > there a particular reason for that being so, and a reason why it can't
> > change?
>
> I just meant that as a way of documenting the status quo. I'm not aware
> of any other place reading stats under RCU (which doesn't mean it
> doesn't exist :)).
>
> That said it is a little tempting to add a new per-netdev mutex here,
> instead of congesting RTNL lock further, since today no correct driver
> should depend on the RTNL lock.

Another possible option could be replacing for_each_netdev_rcu with
for_each_netdev_srcu and using list_for_each_entry_srcu (though it's
currently used nowhere else in the kernel). Has anyone considered
using sleepable RCUs or thought of a reason they wouldn't work or
wouldn't be desirable? For more info search for SRCU in
Documentation/RCU/RTFP.txt
Vladimir Oltean Nov. 30, 2020, 11:50 p.m. UTC | #31
On Mon, Nov 30, 2020 at 10:52:35AM -0600, George McCollister wrote:
> Another possible option could be replacing for_each_netdev_rcu with
> for_each_netdev_srcu and using list_for_each_entry_srcu (though it's
> currently used nowhere else in the kernel). Has anyone considered
> using sleepable RCUs or thought of a reason they wouldn't work or
> wouldn't be desirable? For more info search for SRCU in
> Documentation/RCU/RTFP.txt

Want to take the lead?
George McCollister Nov. 30, 2020, 11:58 p.m. UTC | #32
On Mon, Nov 30, 2020 at 5:50 PM Vladimir Oltean <olteanv@gmail.com> wrote:
>
> On Mon, Nov 30, 2020 at 10:52:35AM -0600, George McCollister wrote:
> > Another possible option could be replacing for_each_netdev_rcu with
> > for_each_netdev_srcu and using list_for_each_entry_srcu (though it's
> > currently used nowhere else in the kernel). Has anyone considered
> > using sleepable RCUs or thought of a reason they wouldn't work or
> > wouldn't be desirable? For more info search for SRCU in
> > Documentation/RCU/RTFP.txt
>
> Want to take the lead?

I certainly could take a stab at it. It would be nice to get a
"doesn't sound like a terrible idea at first glance" from Jakub (and
anyone else) before starting on it. Maybe someone has brought this up
before and was shot down for $reason. If so that would be nice to
know. Of course it's possible I could also find some reason it won't
work after investigating/implementing/testing.
Vladimir Oltean Dec. 1, 2020, 12:19 a.m. UTC | #33
On Mon, Nov 30, 2020 at 05:58:17PM -0600, George McCollister wrote:
> On Mon, Nov 30, 2020 at 5:50 PM Vladimir Oltean <olteanv@gmail.com> wrote:
> >
> > On Mon, Nov 30, 2020 at 10:52:35AM -0600, George McCollister wrote:
> > > Another possible option could be replacing for_each_netdev_rcu with
> > > for_each_netdev_srcu and using list_for_each_entry_srcu (though it's
> > > currently used nowhere else in the kernel). Has anyone considered
> > > using sleepable RCUs or thought of a reason they wouldn't work or
> > > wouldn't be desirable? For more info search for SRCU in
> > > Documentation/RCU/RTFP.txt
> >
> > Want to take the lead?
>
> I certainly could take a stab at it. It would be nice to get a
> "doesn't sound like a terrible idea at first glance" from Jakub (and
> anyone else) before starting on it. Maybe someone has brought this up
> before and was shot down for $reason. If so that would be nice to
> know. Of course it's possible I could also find some reason it won't
> work after investigating/implementing/testing.

Well, for one thing, I don't think the benefit justifies the effort, but
then again, it's not my effort... I think there are simpler ways to get
sleepable context for the callers of dev_get_stats, I have some patches
that I'm working on. Not sure if you saw the discussion here.
https://lore.kernel.org/netdev/20201129205817.hti2l4hm2fbp2iwy@skbuf/
Let that not stop you, though.
Vladimir Oltean Dec. 2, 2020, 12:28 a.m. UTC | #34
Hi Jakub,

On Fri, Nov 27, 2020 at 10:36:42PM +0100, Andrew Lunn wrote:
> > Either way, can we conclude that ndo_get_stats64 is not a replacement
> > for ethtool -S, since the latter is blocking and, if implemented correctly,
> > can return the counters at the time of the call (therefore making sure
> > that anything that happened before the syscall has been accounted into
> > the retrieved values), and the former isn't?
>
> ethtool -S is the best source of consistent, up to date statistics we
> have. It seems silly not to include everything the hardware offers
> there.

To add to this, it would seem odd to me if we took the decision to not
expose MAC-level counters any longer in ethtool. Say the MAC has a counter
named rx_dropped. If we are only exposing this counter in ndo_get_stats64,
then we could hit the scenario where this counter keeps incrementing,
but it is the network stack who increments it, and not the MAC.

dev_get_stats() currently does:
	storage->rx_dropped += (unsigned long)atomic_long_read(&dev->rx_dropped);
	storage->tx_dropped += (unsigned long)atomic_long_read(&dev->tx_dropped);
	storage->rx_nohandler += (unsigned long)atomic_long_read(&dev->rx_nohandler);

thereby clobbering the MAC-provided counter. We would not know if it is
a MAC-level drop or not.
Jakub Kicinski Dec. 2, 2020, 12:54 a.m. UTC | #35
On Wed, 2 Dec 2020 02:28:51 +0200 Vladimir Oltean wrote:
> On Fri, Nov 27, 2020 at 10:36:42PM +0100, Andrew Lunn wrote:
> > > Either way, can we conclude that ndo_get_stats64 is not a replacement
> > > for ethtool -S, since the latter is blocking and, if implemented correctly,
> > > can return the counters at the time of the call (therefore making sure
> > > that anything that happened before the syscall has been accounted into
> > > the retrieved values), and the former isn't?  
> >
> > ethtool -S is the best source of consistent, up to date statistics we
> > have. It seems silly not to include everything the hardware offers
> > there.  
> 
> To add to this, it would seem odd to me if we took the decision to not
> expose MAC-level counters any longer in ethtool. Say the MAC has a counter
> named rx_dropped. If we are only exposing this counter in ndo_get_stats64,
> then we could hit the scenario where this counter keeps incrementing,
> but it is the network stack who increments it, and not the MAC.
> 
> dev_get_stats() currently does:
> 	storage->rx_dropped += (unsigned long)atomic_long_read(&dev->rx_dropped);
> 	storage->tx_dropped += (unsigned long)atomic_long_read(&dev->tx_dropped);
> 	storage->rx_nohandler += (unsigned long)atomic_long_read(&dev->rx_nohandler);
> 
> thereby clobbering the MAC-provided counter. We would not know if it is
> a MAC-level drop or not.

Fine granularity HW stats are fine, but the aggregate must be reported
in standard stats first.

The correct stat for MAC drops (AFAIU) is rx_missed.

This should act as a generic "device had to drop valid packets"
indication and ethtool -S should serve for manual debugging to find 
out which stage of pipeline / reason caused the drop.
diff mbox series

Patch

diff --git a/drivers/net/dsa/Kconfig b/drivers/net/dsa/Kconfig
index f6a0488589fc..3af373e90806 100644
--- a/drivers/net/dsa/Kconfig
+++ b/drivers/net/dsa/Kconfig
@@ -60,6 +60,8 @@  source "drivers/net/dsa/qca/Kconfig"
 
 source "drivers/net/dsa/sja1105/Kconfig"
 
+source "drivers/net/dsa/xrs700x/Kconfig"
+
 config NET_DSA_QCA8K
 	tristate "Qualcomm Atheros QCA8K Ethernet switch family support"
 	depends on NET_DSA
diff --git a/drivers/net/dsa/Makefile b/drivers/net/dsa/Makefile
index a84adb140a04..f3598c040994 100644
--- a/drivers/net/dsa/Makefile
+++ b/drivers/net/dsa/Makefile
@@ -24,3 +24,4 @@  obj-y				+= mv88e6xxx/
 obj-y				+= ocelot/
 obj-y				+= qca/
 obj-y				+= sja1105/
+obj-y				+= xrs700x/
diff --git a/drivers/net/dsa/xrs700x/Kconfig b/drivers/net/dsa/xrs700x/Kconfig
new file mode 100644
index 000000000000..d10a4dce1676
--- /dev/null
+++ b/drivers/net/dsa/xrs700x/Kconfig
@@ -0,0 +1,26 @@ 
+# SPDX-License-Identifier: GPL-2.0-only
+config NET_DSA_XRS700X
+	tristate
+	depends on NET_DSA
+	select NET_DSA_TAG_XRS700X
+	select REGMAP
+	help
+	  This enables support for Arrow SpeedChips XRS7003/7004 gigabit
+	  Ethernet switches.
+
+config NET_DSA_XRS700X_I2C
+	tristate "Arrow XRS7000X series switch in I2C mode"
+	depends on NET_DSA && I2C
+	select NET_DSA_XRS700X
+	select REGMAP_I2C
+	help
+	  Enable I2C support for Arrow SpeedChips XRS7003/7004 gigabit Ethernet
+	  switches.
+
+config NET_DSA_XRS700X_MDIO
+	tristate "Arrow XRS7000X series switch in MDIO mode"
+	depends on NET_DSA
+	select NET_DSA_XRS700X
+	help
+	  Enable MDIO support for Arrow SpeedChips XRS7003/7004 gigabit Ethernet
+	  switches.
diff --git a/drivers/net/dsa/xrs700x/Makefile b/drivers/net/dsa/xrs700x/Makefile
new file mode 100644
index 000000000000..51a3a7d9296a
--- /dev/null
+++ b/drivers/net/dsa/xrs700x/Makefile
@@ -0,0 +1,4 @@ 
+# SPDX-License-Identifier: GPL-2.0-only
+obj-$(CONFIG_NET_DSA_XRS700X) += xrs700x.o
+obj-$(CONFIG_NET_DSA_XRS700X_I2C) += xrs700x_i2c.o
+obj-$(CONFIG_NET_DSA_XRS700X_MDIO) += xrs700x_mdio.o
diff --git a/drivers/net/dsa/xrs700x/xrs700x.c b/drivers/net/dsa/xrs700x/xrs700x.c
new file mode 100644
index 000000000000..f3abdb0eeeec
--- /dev/null
+++ b/drivers/net/dsa/xrs700x/xrs700x.c
@@ -0,0 +1,585 @@ 
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Copyright (C) 2020 NovaTech LLC
+ * George McCollister <george.mccollister@gmail.com>
+ */
+
+#include <net/dsa.h>
+#include <linux/if_bridge.h>
+#include <linux/of_device.h>
+#include "xrs700x.h"
+#include "xrs700x_reg.h"
+
+#define XRS700X_MIB_INTERVAL msecs_to_jiffies(30000)
+
+#define XRS7003E_ID	0x100
+#define XRS7003F_ID	0x101
+#define XRS7004E_ID	0x200
+#define XRS7004F_ID	0x201
+
+const struct xrs700x_info xrs7003e_info = {XRS7003E_ID, "XRS7003E", 3};
+const struct xrs700x_info xrs7003f_info = {XRS7003F_ID, "XRS7003F", 3};
+const struct xrs700x_info xrs7004e_info = {XRS7004E_ID, "XRS7004E", 4};
+const struct xrs700x_info xrs7004f_info = {XRS7004F_ID, "XRS7004F", 4};
+
+struct xrs700x_regfield {
+	struct reg_field rf;
+	struct regmap_field **rmf;
+};
+
+struct xrs700x_mib {
+	unsigned int offset;
+	const char *name;
+};
+
+static const struct xrs700x_mib xrs700x_mibs[] = {
+	{XRS_RX_GOOD_OCTETS_L, "rx_good_octets"},
+	{XRS_RX_BAD_OCTETS_L, "rx_bad_octets"},
+	{XRS_RX_UNICAST_L, "rx_unicast"},
+	{XRS_RX_BROADCAST_L, "rx_broadcast"},
+	{XRS_RX_MULTICAST_L, "rx_multicast"},
+	{XRS_RX_UNDERSIZE_L, "rx_undersize"},
+	{XRS_RX_FRAGMENTS_L, "rx_fragments"},
+	{XRS_RX_OVERSIZE_L, "rx_oversize"},
+	{XRS_RX_JABBER_L, "rx_jabber"},
+	{XRS_RX_ERR_L, "rx_err"},
+	{XRS_RX_CRC_L, "rx_crc"},
+	{XRS_RX_64_L, "rx_64"},
+	{XRS_RX_65_127_L, "rx_65_127"},
+	{XRS_RX_128_255_L, "rx_128_255"},
+	{XRS_RX_256_511_L, "rx_256_511"},
+	{XRS_RX_512_1023_L, "rx_512_1023"},
+	{XRS_RX_1024_1536_L, "rx_1024_1536"},
+	{XRS_RX_HSR_PRP_L, "rx_hsr_prp"},
+	{XRS_RX_WRONGLAN_L, "rx_wronglan"},
+	{XRS_RX_DUPLICATE_L, "rx_duplicate"},
+	{XRS_TX_OCTETS_L, "tx_octets"},
+	{XRS_TX_UNICAST_L, "tx_unicast"},
+	{XRS_TX_BROADCAST_L, "tx_broadcast"},
+	{XRS_TX_MULTICAST_L, "tx_multicast"},
+	{XRS_TX_HSR_PRP_L, "tx_hsr_prp"},
+	{XRS_PRIQ_DROP_L, "priq_drop"},
+	{XRS_EARLY_DROP_L, "early_drop"},
+};
+
+static void xrs700x_get_strings(struct dsa_switch *ds, int port,
+				u32 stringset, uint8_t *data)
+{
+	int i;
+
+	if (stringset != ETH_SS_STATS)
+		return;
+
+	for (i = 0; i < ARRAY_SIZE(xrs700x_mibs); i++) {
+		strlcpy(data, xrs700x_mibs[i].name, ETH_GSTRING_LEN);
+		data += ETH_GSTRING_LEN;
+	}
+}
+
+static int xrs700x_get_sset_count(struct dsa_switch *ds, int port, int sset)
+{
+	if (sset != ETH_SS_STATS)
+		return -EOPNOTSUPP;
+
+	return ARRAY_SIZE(xrs700x_mibs);
+}
+
+static void xrs700x_read_port_counters(struct xrs700x *priv, int port)
+{
+	struct xrs700x_port *p = &priv->ports[port];
+	int i;
+
+	mutex_lock(&p->mib_mutex);
+
+	/* Capture counter values */
+	regmap_write(priv->regmap, XRS_CNT_CTRL(port), 1);
+
+	for (i = 0; i < ARRAY_SIZE(xrs700x_mibs); i++) {
+		unsigned int high = 0, low = 0, reg;
+
+		reg = xrs700x_mibs[i].offset + XRS_PORT_OFFSET * port;
+		regmap_read(priv->regmap, reg, &low);
+		regmap_read(priv->regmap, reg + 2, &high);
+
+		p->mib_data[i] += (high << 16) | low;
+	}
+
+	mutex_unlock(&p->mib_mutex);
+}
+
+static void xrs700x_mib_work(struct work_struct *work)
+{
+	struct xrs700x *priv = container_of(work, struct xrs700x,
+					    mib_work.work);
+	int i;
+
+	for (i = 0; i < priv->ds->num_ports; i++)
+		xrs700x_read_port_counters(priv, i);
+
+	schedule_delayed_work(&priv->mib_work, XRS700X_MIB_INTERVAL);
+}
+
+static void xrs700x_get_ethtool_stats(struct dsa_switch *ds, int port,
+				      uint64_t *data)
+{
+	struct xrs700x *priv = ds->priv;
+	struct xrs700x_port *p = &priv->ports[port];
+
+	xrs700x_read_port_counters(priv, port);
+
+	mutex_lock(&p->mib_mutex);
+	memcpy(data, p->mib_data, sizeof(*data) * ARRAY_SIZE(xrs700x_mibs));
+	mutex_unlock(&p->mib_mutex);
+}
+
+static int xrs700x_setup_regmap_range(struct xrs700x *priv)
+{
+	struct xrs700x_regfield regfields[] = {
+		{
+			.rf = REG_FIELD_ID(XRS_PORT_STATE(0), 0, 1,
+					   priv->ds->num_ports,
+					   XRS_PORT_OFFSET),
+			.rmf = &priv->ps_forward
+		},
+		{
+			.rf = REG_FIELD_ID(XRS_PORT_STATE(0), 2, 3,
+					   priv->ds->num_ports,
+					   XRS_PORT_OFFSET),
+			.rmf = &priv->ps_management
+		},
+		{
+			.rf = REG_FIELD_ID(XRS_PORT_STATE(0), 4, 9,
+					   priv->ds->num_ports,
+					   XRS_PORT_OFFSET),
+			.rmf = &priv->ps_sel_speed
+		},
+		{
+			.rf = REG_FIELD_ID(XRS_PORT_STATE(0), 10, 11,
+					   priv->ds->num_ports,
+					   XRS_PORT_OFFSET),
+			.rmf = &priv->ps_cur_speed
+		}
+	};
+	int i = 0;
+
+	for (; i < ARRAY_SIZE(regfields); i++) {
+		*regfields[i].rmf = devm_regmap_field_alloc(priv->dev,
+							    priv->regmap,
+							    regfields[i].rf);
+		if (IS_ERR(*regfields[i].rmf))
+			return PTR_ERR(*regfields[i].rmf);
+	}
+
+	return 0;
+}
+
+static enum dsa_tag_protocol xrs700x_get_tag_protocol(struct dsa_switch *ds,
+						      int port,
+						      enum dsa_tag_protocol m)
+{
+	return DSA_TAG_PROTO_XRS700X;
+}
+
+static int xrs700x_reset(struct dsa_switch *ds)
+{
+	struct xrs700x *priv = ds->priv;
+	unsigned int val;
+	int ret;
+
+	ret = regmap_write(priv->regmap, XRS_GENERAL, XRS_GENERAL_RESET);
+	if (ret)
+		goto error;
+
+	ret = regmap_read_poll_timeout(priv->regmap, XRS_GENERAL,
+				       val, !(val & XRS_GENERAL_RESET),
+				       10, 1000);
+error:
+	if (ret) {
+		dev_err_ratelimited(priv->dev, "error resetting switch: %d\n",
+				    ret);
+	}
+
+	return ret;
+}
+
+static void xrs700x_port_stp_state_set(struct dsa_switch *ds, int port,
+				       u8 state)
+{
+	struct xrs700x *priv = ds->priv;
+	unsigned int bpdus = 1;
+	unsigned int val;
+
+	switch (state) {
+	case BR_STATE_DISABLED:
+		bpdus = 0;
+		fallthrough;
+	case BR_STATE_BLOCKING:
+	case BR_STATE_LISTENING:
+		val = XRS_PORT_DISABLED;
+		break;
+	case BR_STATE_LEARNING:
+		val = XRS_PORT_LEARNING;
+		break;
+	case BR_STATE_FORWARDING:
+		val = XRS_PORT_FORWARDING;
+		break;
+	default:
+		dev_err(ds->dev, "invalid STP state: %d\n", state);
+		return;
+	}
+
+	regmap_fields_write(priv->ps_forward, port, val);
+
+	/* Enable/disable inbound policy added by xrs700x_port_add_bpdu_ipf()
+	 * which allows BPDU forwarding to the CPU port when the front facing
+	 * port is in disabled/learning state.
+	 */
+	regmap_update_bits(priv->regmap, XRS_ETH_ADDR_CFG(port, 0), 1, bpdus);
+
+	dev_dbg_ratelimited(priv->dev, "%s - port: %d, state: %u, val: 0x%x\n",
+			    __func__, port, state, val);
+}
+
+/* Add an inbound policy filter which matches the BPDU destination MAC
+ * and forwards to the CPU port. Leave the policy disabled, it will be
+ * enabled as needed.
+ */
+static int xrs700x_port_add_bpdu_ipf(struct dsa_switch *ds, int port)
+{
+	struct xrs700x *priv = ds->priv;
+	unsigned int val = 0;
+	int i = 0;
+	int ret;
+
+	/* Compare all 48 bits of the destination MAC address. */
+	ret = regmap_write(priv->regmap, XRS_ETH_ADDR_CFG(port, 0), 48 << 2);
+	if (ret)
+		return ret;
+
+	/* match BPDU destination 01:80:c2:00:00:00 */
+	ret = regmap_write(priv->regmap, XRS_ETH_ADDR_0(port, 0), 0x8001);
+	if (ret)
+		return ret;
+
+	ret = regmap_write(priv->regmap, XRS_ETH_ADDR_1(port, 0), 0xc2);
+	if (ret)
+		return ret;
+
+	ret = regmap_write(priv->regmap, XRS_ETH_ADDR_2(port, 0), 0x0);
+	if (ret)
+		return ret;
+
+	/* Mirror BPDU to CPU port */
+	for (; i < ds->num_ports; i++) {
+		if (dsa_is_cpu_port(ds, i))
+			val |= BIT(i);
+	}
+
+	ret = regmap_write(priv->regmap, XRS_ETH_ADDR_FWD_MIRROR(port, 0), val);
+	if (ret)
+		return ret;
+
+	ret = regmap_write(priv->regmap, XRS_ETH_ADDR_FWD_ALLOW(port, 0), 0);
+	if (ret)
+		return ret;
+
+	return 0;
+}
+
+static int xrs700x_port_setup(struct dsa_switch *ds, int port)
+{
+	bool cpu_port = dsa_is_cpu_port(ds, port);
+	struct xrs700x *priv = ds->priv;
+	unsigned int val = 0;
+	int ret, i;
+
+	xrs700x_port_stp_state_set(ds, port, BR_STATE_DISABLED);
+
+	/* Disable forwarding to non-CPU ports */
+	for (i = 0; i < ds->num_ports; i++) {
+		if (!dsa_is_cpu_port(ds, i))
+			val |= BIT(i);
+	}
+
+	ret = regmap_write(priv->regmap, XRS_PORT_FWD_MASK(port), val);
+	if (ret)
+		return ret;
+
+	val = cpu_port ? XRS_PORT_MODE_MANAGEMENT : XRS_PORT_MODE_NORMAL;
+	ret = regmap_fields_write(priv->ps_management, port, val);
+	if (ret)
+		return ret;
+
+	if (!cpu_port) {
+		ret = xrs700x_port_add_bpdu_ipf(ds, port);
+		if (ret)
+			return ret;
+	}
+
+	return 0;
+}
+
+static int xrs700x_setup(struct dsa_switch *ds)
+{
+	struct xrs700x *priv = ds->priv;
+	int ret, i;
+
+	ret = xrs700x_reset(ds);
+	if (ret)
+		return ret;
+
+	for (i = 0; i < ds->num_ports; i++) {
+		ret = xrs700x_port_setup(ds, i);
+		if (ret)
+			return ret;
+	}
+
+	schedule_delayed_work(&priv->mib_work, XRS700X_MIB_INTERVAL);
+
+	return 0;
+}
+
+static void xrs700x_teardown(struct dsa_switch *ds)
+{
+	struct xrs700x *priv = ds->priv;
+
+	cancel_delayed_work_sync(&priv->mib_work);
+}
+
+static void xrs700x_phylink_validate(struct dsa_switch *ds, int port,
+				     unsigned long *supported,
+				     struct phylink_link_state *state)
+{
+	__ETHTOOL_DECLARE_LINK_MODE_MASK(mask) = { 0, };
+
+	switch (port) {
+	case 0:
+		break;
+	case 1:
+	case 2:
+	case 3:
+		phylink_set(mask, 1000baseT_Full);
+		break;
+	default:
+		bitmap_zero(supported, __ETHTOOL_LINK_MODE_MASK_NBITS);
+		dev_err(ds->dev, "Unsupported port: %i\n", port);
+		return;
+	}
+
+	phylink_set_port_modes(mask);
+
+	/* The switch only supports full duplex. */
+	phylink_set(mask, 10baseT_Full);
+	phylink_set(mask, 100baseT_Full);
+
+	bitmap_and(supported, supported, mask,
+		   __ETHTOOL_LINK_MODE_MASK_NBITS);
+	bitmap_and(state->advertising, state->advertising, mask,
+		   __ETHTOOL_LINK_MODE_MASK_NBITS);
+}
+
+static void xrs700x_mac_link_up(struct dsa_switch *ds, int port,
+				unsigned int mode, phy_interface_t interface,
+				struct phy_device *phydev,
+				int speed, int duplex,
+				bool tx_pause, bool rx_pause)
+{
+	struct xrs700x *priv = ds->priv;
+	unsigned int val;
+
+	switch (speed) {
+	case SPEED_1000:
+		val = XRS_PORT_SPEED_1000;
+		break;
+	case SPEED_100:
+		val = XRS_PORT_SPEED_100;
+		break;
+	case SPEED_10:
+		val = XRS_PORT_SPEED_10;
+		break;
+	default:
+		return;
+	}
+
+	regmap_fields_write(priv->ps_sel_speed, port, val);
+
+	dev_dbg_ratelimited(priv->dev, "%s: port: %d mode: %u speed: %u\n",
+			    __func__, port, mode, speed);
+}
+
+static int xrs700x_bridge_common(struct dsa_switch *ds, int port,
+				 struct net_device *bridge, bool join)
+{
+	unsigned int i, ret, cpu_mask = 0, mask = 0;
+	struct xrs700x *priv = ds->priv;
+
+	for (i = 0; i < ds->num_ports; i++) {
+		if (dsa_is_cpu_port(ds, i))
+			continue;
+
+		cpu_mask |= BIT(i);
+
+		if (dsa_to_port(ds, i)->bridge_dev == bridge)
+			continue;
+
+		mask |= BIT(i);
+	}
+
+	for (i = 0; i < ds->num_ports; i++) {
+		if (dsa_to_port(ds, i)->bridge_dev != bridge)
+			continue;
+
+		ret = regmap_write(priv->regmap, XRS_PORT_FWD_MASK(i), mask);
+		if (ret)
+			return ret;
+	}
+
+	if (!join) {
+		ret = regmap_write(priv->regmap, XRS_PORT_FWD_MASK(port),
+				   cpu_mask);
+		if (ret)
+			return ret;
+	}
+
+	return 0;
+}
+
+static int xrs700x_bridge_join(struct dsa_switch *ds, int port,
+			       struct net_device *bridge)
+{
+	return xrs700x_bridge_common(ds, port, bridge, true);
+}
+
+static void xrs700x_bridge_leave(struct dsa_switch *ds, int port,
+				 struct net_device *bridge)
+{
+	xrs700x_bridge_common(ds, port, bridge, false);
+}
+
+static const struct dsa_switch_ops xrs700x_ops = {
+	.get_tag_protocol	= xrs700x_get_tag_protocol,
+	.setup			= xrs700x_setup,
+	.teardown		= xrs700x_teardown,
+	.port_stp_state_set	= xrs700x_port_stp_state_set,
+	.phylink_validate	= xrs700x_phylink_validate,
+	.phylink_mac_link_up	= xrs700x_mac_link_up,
+	.get_strings		= xrs700x_get_strings,
+	.get_sset_count		= xrs700x_get_sset_count,
+	.get_ethtool_stats	= xrs700x_get_ethtool_stats,
+	.port_bridge_join	= xrs700x_bridge_join,
+	.port_bridge_leave	= xrs700x_bridge_leave,
+};
+
+static int xrs700x_detect(struct xrs700x *dev)
+{
+	const struct xrs700x_info *info;
+	unsigned int id;
+	int ret;
+
+	ret = regmap_read(dev->regmap, XRS_DEV_ID0, &id);
+	if (ret) {
+		dev_err(dev->dev, "error %d while reading switch id.\n",
+			ret);
+		return ret;
+	}
+
+	info = of_device_get_match_data(dev->dev);
+	if (!info)
+		return -EINVAL;
+
+	if (info->id == id) {
+		dev->ds->num_ports = info->num_ports;
+		dev_info(dev->dev, "%s detected.\n", info->name);
+		return 0;
+	}
+
+	dev_err(dev->dev, "expected switch id 0x%x but found 0x%x.\n",
+		info->id, id);
+
+	return -ENODEV;
+}
+
+struct xrs700x *xrs700x_switch_alloc(struct device *base, void *priv)
+{
+	struct dsa_switch *ds;
+	struct xrs700x *dev;
+
+	ds = devm_kzalloc(base, sizeof(*ds), GFP_KERNEL);
+	if (!ds)
+		return NULL;
+
+	ds->dev = base;
+
+	dev = devm_kzalloc(base, sizeof(*dev), GFP_KERNEL);
+	if (!dev)
+		return NULL;
+
+	INIT_DELAYED_WORK(&dev->mib_work, xrs700x_mib_work);
+
+	ds->ops = &xrs700x_ops;
+	ds->priv = dev;
+	dev->dev = base;
+
+	dev->ds = ds;
+	dev->priv = priv;
+
+	return dev;
+}
+EXPORT_SYMBOL(xrs700x_switch_alloc);
+
+static int xrs700x_alloc_port_mib(struct xrs700x *dev, int port)
+{
+	struct xrs700x_port *p = &dev->ports[port];
+	size_t mib_size = sizeof(*p->mib_data) * ARRAY_SIZE(xrs700x_mibs);
+
+	p->mib_data = devm_kzalloc(dev->dev, mib_size, GFP_KERNEL);
+	if (!p->mib_data)
+		return -ENOMEM;
+
+	mutex_init(&p->mib_mutex);
+
+	return 0;
+}
+
+int xrs700x_switch_register(struct xrs700x *dev)
+{
+	int ret;
+	int i;
+
+	ret = xrs700x_detect(dev);
+	if (ret)
+		return ret;
+
+	ret = xrs700x_setup_regmap_range(dev);
+	if (ret)
+		return ret;
+
+	dev->ports = devm_kzalloc(dev->dev,
+				  sizeof(*dev->ports) * dev->ds->num_ports,
+				  GFP_KERNEL);
+	if (!dev->ports)
+		return -ENOMEM;
+
+	for (i = 0; i < dev->ds->num_ports; i++) {
+		ret = xrs700x_alloc_port_mib(dev, i);
+		if (ret)
+			return ret;
+	}
+
+	ret = dsa_register_switch(dev->ds);
+
+	return ret;
+}
+EXPORT_SYMBOL(xrs700x_switch_register);
+
+void xrs700x_switch_remove(struct xrs700x *dev)
+{
+	cancel_delayed_work_sync(&dev->mib_work);
+
+	dsa_unregister_switch(dev->ds);
+}
+EXPORT_SYMBOL(xrs700x_switch_remove);
+
+MODULE_AUTHOR("George McCollister <george.mccollister@gmail.com>");
+MODULE_DESCRIPTION("Arrow SpeedChips XRS700x DSA driver");
+MODULE_LICENSE("GPL v2");
diff --git a/drivers/net/dsa/xrs700x/xrs700x.h b/drivers/net/dsa/xrs700x/xrs700x.h
new file mode 100644
index 000000000000..1f28b0ecebe0
--- /dev/null
+++ b/drivers/net/dsa/xrs700x/xrs700x.h
@@ -0,0 +1,38 @@ 
+/* SPDX-License-Identifier: GPL-2.0 */
+#include <linux/device.h>
+#include <linux/mutex.h>
+#include <linux/regmap.h>
+#include <linux/workqueue.h>
+
+struct xrs700x_info {
+	unsigned int id;
+	const char *name;
+	size_t num_ports;
+};
+
+extern const struct xrs700x_info xrs7003e_info;
+extern const struct xrs700x_info xrs7003f_info;
+extern const struct xrs700x_info xrs7004e_info;
+extern const struct xrs700x_info xrs7004f_info;
+
+struct xrs700x_port {
+	struct mutex mib_mutex; /* protects mib_data */
+	uint64_t *mib_data;
+};
+
+struct xrs700x {
+	struct dsa_switch *ds;
+	struct device *dev;
+	void *priv;
+	struct regmap *regmap;
+	struct regmap_field *ps_forward;
+	struct regmap_field *ps_management;
+	struct regmap_field *ps_sel_speed;
+	struct regmap_field *ps_cur_speed;
+	struct delayed_work mib_work;
+	struct xrs700x_port *ports;
+};
+
+struct xrs700x *xrs700x_switch_alloc(struct device *base, void *priv);
+int xrs700x_switch_register(struct xrs700x *dev);
+void xrs700x_switch_remove(struct xrs700x *dev);
diff --git a/drivers/net/dsa/xrs700x/xrs700x_i2c.c b/drivers/net/dsa/xrs700x/xrs700x_i2c.c
new file mode 100644
index 000000000000..fe64a7757f9e
--- /dev/null
+++ b/drivers/net/dsa/xrs700x/xrs700x_i2c.c
@@ -0,0 +1,150 @@ 
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Copyright (C) 2020 NovaTech LLC
+ * George McCollister <george.mccollister@gmail.com>
+ */
+
+#include <linux/bits.h>
+#include <linux/i2c.h>
+#include <linux/module.h>
+#include "xrs700x.h"
+#include "xrs700x_reg.h"
+
+static int xrs700x_i2c_reg_read(void *context, unsigned int reg,
+				unsigned int *val)
+{
+	struct device *dev = context;
+	struct i2c_client *i2c = to_i2c_client(dev);
+	unsigned char buf[4];
+	int ret;
+
+	buf[0] = reg >> 23 & 0xff;
+	buf[1] = reg >> 15 & 0xff;
+	buf[2] = reg >> 7 & 0xff;
+	buf[3] = (reg & 0x7f) << 1;
+
+	ret = i2c_master_send(i2c, buf, sizeof(buf));
+	if (ret < 0) {
+		dev_err(dev, "xrs i2c_master_send returned %d\n", ret);
+		return ret;
+	}
+
+	ret = i2c_master_recv(i2c, buf, 2);
+	if (ret < 0) {
+		dev_err(dev, "xrs i2c_master_recv returned %d\n", ret);
+		return ret;
+	}
+
+	*val = buf[0] << 8 | buf[1];
+
+	return 0;
+}
+
+static int xrs700x_i2c_reg_write(void *context, unsigned int reg,
+				 unsigned int val)
+{
+	struct device *dev = context;
+	struct i2c_client *i2c = to_i2c_client(dev);
+	unsigned char buf[6];
+	int ret;
+
+	buf[0] = reg >> 23 & 0xff;
+	buf[1] = reg >> 15 & 0xff;
+	buf[2] = reg >> 7 & 0xff;
+	buf[3] = (reg & 0x7f) << 1 | 1;
+	buf[4] = val >> 8 & 0xff;
+	buf[5] = val & 0xff;
+
+	ret = i2c_master_send(i2c, buf, sizeof(buf));
+	if (ret < 0) {
+		dev_err(dev, "xrs i2c_master_send returned %d\n", ret);
+		return ret;
+	}
+
+	return 0;
+}
+
+static const struct regmap_config xrs700x_i2c_regmap_config = {
+	.val_bits = 16,
+	.reg_stride = 2,
+	.reg_bits = 32,
+	.pad_bits = 0,
+	.write_flag_mask = 0,
+	.read_flag_mask = 0,
+	.reg_read = xrs700x_i2c_reg_read,
+	.reg_write = xrs700x_i2c_reg_write,
+	.max_register = 0,
+	.cache_type = REGCACHE_NONE,
+	.reg_format_endian = REGMAP_ENDIAN_BIG,
+	.val_format_endian = REGMAP_ENDIAN_BIG
+};
+
+static int xrs700x_i2c_probe(struct i2c_client *i2c,
+			     const struct i2c_device_id *i2c_id)
+{
+	struct xrs700x *dev;
+	int ret;
+
+	dev = xrs700x_switch_alloc(&i2c->dev, i2c);
+	if (!dev)
+		return -ENOMEM;
+
+	dev->regmap = devm_regmap_init(&i2c->dev, NULL, &i2c->dev,
+				       &xrs700x_i2c_regmap_config);
+	if (IS_ERR(dev->regmap)) {
+		ret = PTR_ERR(dev->regmap);
+		dev_err(&i2c->dev, "Failed to initialize regmap: %d\n", ret);
+		return ret;
+	}
+
+	ret = xrs700x_switch_register(dev);
+
+	/* Main DSA driver may not be started yet. */
+	if (ret)
+		return ret;
+
+	i2c_set_clientdata(i2c, dev);
+
+	return 0;
+}
+
+static int xrs700x_i2c_remove(struct i2c_client *i2c)
+{
+	struct xrs700x *dev = i2c_get_clientdata(i2c);
+
+	xrs700x_switch_remove(dev);
+
+	return 0;
+}
+
+static const struct i2c_device_id xrs700x_i2c_id[] = {
+	{ "xrs700x-switch", 0 },
+	{},
+};
+
+MODULE_DEVICE_TABLE(i2c, xrs700x_i2c_id);
+
+static const struct of_device_id xrs700x_i2c_dt_ids[] = {
+	{ .compatible = "arrow,xrs7003e", .data = &xrs7003e_info },
+	{ .compatible = "arrow,xrs7003f", .data = &xrs7003f_info },
+	{ .compatible = "arrow,xrs7004e", .data = &xrs7004e_info },
+	{ .compatible = "arrow,xrs7004f", .data = &xrs7004f_info },
+	{},
+};
+MODULE_DEVICE_TABLE(of, xrs700x_i2c_dt_ids);
+
+static struct i2c_driver xrs700x_i2c_driver = {
+	.driver = {
+		.name	= "xrs700x-i2c",
+		.of_match_table = of_match_ptr(xrs700x_i2c_dt_ids),
+	},
+	.probe	= xrs700x_i2c_probe,
+	.remove	= xrs700x_i2c_remove,
+	.id_table = xrs700x_i2c_id,
+};
+
+module_i2c_driver(xrs700x_i2c_driver);
+
+MODULE_AUTHOR("George McCollister <george.mccollister@gmail.com>");
+MODULE_DESCRIPTION("Arrow SpeedChips XRS700x DSA I2C driver");
+MODULE_LICENSE("GPL v2");
diff --git a/drivers/net/dsa/xrs700x/xrs700x_mdio.c b/drivers/net/dsa/xrs700x/xrs700x_mdio.c
new file mode 100644
index 000000000000..4fa6cc8f871c
--- /dev/null
+++ b/drivers/net/dsa/xrs700x/xrs700x_mdio.c
@@ -0,0 +1,162 @@ 
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Copyright (C) 2020 NovaTech LLC
+ * George McCollister <george.mccollister@gmail.com>
+ */
+
+#include <linux/bitfield.h>
+#include <linux/bits.h>
+#include <linux/mdio.h>
+#include <linux/module.h>
+#include <linux/phy.h>
+#include "xrs700x.h"
+#include "xrs700x_reg.h"
+
+#define XRS_MDIO_IBA0	0x10
+#define XRS_MDIO_IBA1	0x11
+#define XRS_MDIO_IBD	0x14
+
+#define XRS_IB_READ	0x0
+#define XRS_IB_WRITE	0x1
+
+static int xrs700x_mdio_reg_read(void *context, unsigned int reg,
+				 unsigned int *val)
+{
+	struct mdio_device *mdiodev = context;
+	struct device *dev = &mdiodev->dev;
+	u16 uval;
+	int ret;
+
+	uval = (u16)FIELD_GET(GENMASK(31, 16), reg);
+
+	ret = mdiobus_write(mdiodev->bus, mdiodev->addr, XRS_MDIO_IBA1, uval);
+	if (ret < 0) {
+		dev_err(dev, "xrs mdiobus_write returned %d\n", ret);
+		return ret;
+	}
+
+	uval = (u16)((reg & GENMASK(15, 1)) | XRS_IB_READ);
+
+	ret = mdiobus_write(mdiodev->bus, mdiodev->addr, XRS_MDIO_IBA0, uval);
+	if (ret < 0) {
+		dev_err(dev, "xrs mdiobus_write returned %d\n", ret);
+		return ret;
+	}
+
+	ret = mdiobus_read(mdiodev->bus, mdiodev->addr, XRS_MDIO_IBD);
+	if (ret < 0) {
+		dev_err(dev, "xrs mdiobus_read returned %d\n", ret);
+		return ret;
+	}
+
+	*val = (unsigned int)ret;
+
+	return 0;
+}
+
+static int xrs700x_mdio_reg_write(void *context, unsigned int reg,
+				  unsigned int val)
+{
+	struct mdio_device *mdiodev = context;
+	struct device *dev = &mdiodev->dev;
+	u16 uval;
+	int ret;
+
+	ret = mdiobus_write(mdiodev->bus, mdiodev->addr, XRS_MDIO_IBD, (u16)val);
+	if (ret < 0) {
+		dev_err(dev, "xrs mdiobus_write returned %d\n", ret);
+		return ret;
+	}
+
+	uval = (u16)FIELD_GET(GENMASK(31, 16), reg);
+
+	ret = mdiobus_write(mdiodev->bus, mdiodev->addr, XRS_MDIO_IBA1, uval);
+	if (ret < 0) {
+		dev_err(dev, "xrs mdiobus_write returned %d\n", ret);
+		return ret;
+	}
+
+	uval = (u16)((reg & GENMASK(15, 1)) | XRS_IB_WRITE);
+
+	ret = mdiobus_write(mdiodev->bus, mdiodev->addr, XRS_MDIO_IBA0, uval);
+	if (ret < 0) {
+		dev_err(dev, "xrs mdiobus_write returned %d\n", ret);
+		return ret;
+	}
+
+	return 0;
+}
+
+static const struct regmap_config xrs700x_mdio_regmap_config = {
+	.val_bits = 16,
+	.reg_stride = 2,
+	.reg_bits = 32,
+	.pad_bits = 0,
+	.write_flag_mask = 0,
+	.read_flag_mask = 0,
+	.reg_read = xrs700x_mdio_reg_read,
+	.reg_write = xrs700x_mdio_reg_write,
+	.max_register = XRS_VLAN(MAX_VLAN),
+	.cache_type = REGCACHE_NONE,
+	.reg_format_endian = REGMAP_ENDIAN_BIG,
+	.val_format_endian = REGMAP_ENDIAN_BIG
+};
+
+static int xrs700x_mdio_probe(struct mdio_device *mdiodev)
+{
+	struct xrs700x *dev;
+	int ret;
+
+	dev = xrs700x_switch_alloc(&mdiodev->dev, mdiodev);
+	if (!dev)
+		return -ENOMEM;
+
+	dev->regmap = devm_regmap_init(&mdiodev->dev, NULL, mdiodev,
+				       &xrs700x_mdio_regmap_config);
+	if (IS_ERR(dev->regmap)) {
+		ret = PTR_ERR(dev->regmap);
+		dev_err(&mdiodev->dev, "Failed to initialize regmap: %d\n", ret);
+		return ret;
+	}
+
+	ret = xrs700x_switch_register(dev);
+
+	/* Main DSA driver may not be started yet. */
+	if (ret)
+		return ret;
+
+	dev_set_drvdata(&mdiodev->dev, dev);
+
+	return 0;
+}
+
+static void xrs700x_mdio_remove(struct mdio_device *mdiodev)
+{
+	struct xrs700x *dev = dev_get_drvdata(&mdiodev->dev);
+
+	xrs700x_switch_remove(dev);
+}
+
+static const struct of_device_id xrs700x_mdio_dt_ids[] = {
+	{ .compatible = "arrow,xrs7003e", .data = &xrs7003e_info },
+	{ .compatible = "arrow,xrs7003f", .data = &xrs7003f_info },
+	{ .compatible = "arrow,xrs7004e", .data = &xrs7004e_info },
+	{ .compatible = "arrow,xrs7004f", .data = &xrs7004f_info },
+	{},
+};
+MODULE_DEVICE_TABLE(of, xrs700x_mdio_dt_ids);
+
+static struct mdio_driver xrs700x_mdio_driver = {
+	.mdiodrv.driver = {
+		.name	= "xrs700x-mdio",
+		.of_match_table = xrs700x_mdio_dt_ids,
+	},
+	.probe	= xrs700x_mdio_probe,
+	.remove	= xrs700x_mdio_remove,
+};
+
+mdio_module_driver(xrs700x_mdio_driver);
+
+MODULE_AUTHOR("George McCollister <george.mccollister@gmail.com>");
+MODULE_DESCRIPTION("Arrow SpeedChips XRS700x DSA MDIO driver");
+MODULE_LICENSE("GPL v2");
diff --git a/drivers/net/dsa/xrs700x/xrs700x_reg.h b/drivers/net/dsa/xrs700x/xrs700x_reg.h
new file mode 100644
index 000000000000..d162d0b84bac
--- /dev/null
+++ b/drivers/net/dsa/xrs700x/xrs700x_reg.h
@@ -0,0 +1,205 @@ 
+/* SPDX-License-Identifier: GPL-2.0 */
+
+/* Register Base Addresses */
+#define XRS_DEVICE_ID_BASE		0x0
+#define XRS_GPIO_BASE			0x10000
+#define XRS_PORT_OFFSET			0x10000
+#define XRS_PORT_BASE(x)		(0x200000 + XRS_PORT_OFFSET * (x))
+#define XRS_RTC_BASE			0x280000
+#define XRS_TS_OFFSET			0x8000
+#define XRS_TS_BASE(x)			(0x290000 + XRS_TS_OFFSET * (x))
+#define XRS_SWITCH_CONF_BASE		0x300000
+
+/* Device Identification Registers */
+#define XRS_DEV_ID0			(XRS_DEVICE_ID_BASE + 0)
+#define XRS_DEV_ID1			(XRS_DEVICE_ID_BASE + 2)
+#define XRS_INT_ID0			(XRS_DEVICE_ID_BASE + 4)
+#define XRS_INT_ID1			(XRS_DEVICE_ID_BASE + 6)
+#define XRS_REV_ID			(XRS_DEVICE_ID_BASE + 8)
+
+/* GPIO Registers */
+#define XRS_CONFIG0			(XRS_GPIO_BASE + 0x1000)
+#define XRS_INPUT_STATUS0		(XRS_GPIO_BASE + 0x1002)
+#define XRS_CONFIG1			(XRS_GPIO_BASE + 0x1004)
+#define XRS_INPUT_STATUS1		(XRS_GPIO_BASE + 0x1006)
+#define XRS_CONFIG2			(XRS_GPIO_BASE + 0x1008)
+#define XRS_INPUT_STATUS2		(XRS_GPIO_BASE + 0x100a)
+
+/* Port Configuration Registers */
+#define XRS_PORT_GEN_BASE(x)		(XRS_PORT_BASE(x) + 0x0)
+#define XRS_PORT_HSR_BASE(x)		(XRS_PORT_BASE(x) + 0x2000)
+#define XRS_PORT_PTP_BASE(x)		(XRS_PORT_BASE(x) + 0x4000)
+#define XRS_PORT_CNT_BASE(x)		(XRS_PORT_BASE(x) + 0x6000)
+#define XRS_PORT_IPO_BASE(x)		(XRS_PORT_BASE(x) + 0x8000)
+
+/* Port Configuration Registers - General and State */
+#define XRS_PORT_STATE(x)		(XRS_PORT_GEN_BASE(x) + 0x0)
+#define XRS_PORT_FORWARDING		0
+#define XRS_PORT_LEARNING		1
+#define XRS_PORT_DISABLED		2
+#define XRS_PORT_MODE_NORMAL		0
+#define XRS_PORT_MODE_MANAGEMENT	1
+#define XRS_PORT_SPEED_1000		0x12
+#define XRS_PORT_SPEED_100		0x20
+#define XRS_PORT_SPEED_10		0x30
+#define XRS_PORT_VLAN(x)		(XRS_PORT_GEN_BASE(x) + 0x10)
+#define XRS_PORT_VLAN0_MAPPING(x)	(XRS_PORT_GEN_BASE(x) + 0x12)
+#define XRS_PORT_FWD_MASK(x)		(XRS_PORT_GEN_BASE(x) + 0x14)
+#define XRS_PORT_VLAN_PRIO(x)		(XRS_PORT_GEN_BASE(x) + 0x16)
+
+/* Port Configuration Registers - HSR/PRP */
+#define XRS_HSR_CFG(x)			(XRS_PORT_HSR_BASE(x) + 0x0)
+
+/* Port Configuration Registers - PTP */
+#define XRS_PTP_RX_SYNC_DELAY_NS_LO(x)	(XRS_PORT_PTP_BASE(x) + 0x2)
+#define XRS_PTP_RX_SYNC_DELAY_NS_HI(x)	(XRS_PORT_PTP_BASE(x) + 0x4)
+#define XRS_PTP_RX_EVENT_DELAY_NS(x)	(XRS_PORT_PTP_BASE(x) + 0xa)
+#define XRS_PTP_TX_EVENT_DELAY_NS(x)	(XRS_PORT_PTP_BASE(x) + 0x12)
+
+/* Port Configuration Registers - Counter */
+#define XRS_CNT_CTRL(x)			(XRS_PORT_CNT_BASE(x) + 0x0)
+#define XRS_RX_GOOD_OCTETS_L		(XRS_PORT_CNT_BASE(0) + 0x200)
+#define XRS_RX_GOOD_OCTETS_H		(XRS_PORT_CNT_BASE(0) + 0x202)
+#define XRS_RX_BAD_OCTETS_L		(XRS_PORT_CNT_BASE(0) + 0x204)
+#define XRS_RX_BAD_OCTETS_H		(XRS_PORT_CNT_BASE(0) + 0x206)
+#define XRS_RX_UNICAST_L		(XRS_PORT_CNT_BASE(0) + 0x208)
+#define XRS_RX_UNICAST_H		(XRS_PORT_CNT_BASE(0) + 0x20a)
+#define XRS_RX_BROADCAST_L		(XRS_PORT_CNT_BASE(0) + 0x20c)
+#define XRS_RX_BROADCAST_H		(XRS_PORT_CNT_BASE(0) + 0x20e)
+#define XRS_RX_MULTICAST_L		(XRS_PORT_CNT_BASE(0) + 0x210)
+#define XRS_RX_MULTICAST_H		(XRS_PORT_CNT_BASE(0) + 0x212)
+#define XRS_RX_UNDERSIZE_L		(XRS_PORT_CNT_BASE(0) + 0x214)
+#define XRS_RX_UNDERSIZE_H		(XRS_PORT_CNT_BASE(0) + 0x216)
+#define XRS_RX_FRAGMENTS_L		(XRS_PORT_CNT_BASE(0) + 0x218)
+#define XRS_RX_FRAGMENTS_H		(XRS_PORT_CNT_BASE(0) + 0x21a)
+#define XRS_RX_OVERSIZE_L		(XRS_PORT_CNT_BASE(0) + 0x21c)
+#define XRS_RX_OVERSIZE_H		(XRS_PORT_CNT_BASE(0) + 0x21e)
+#define XRS_RX_JABBER_L			(XRS_PORT_CNT_BASE(0) + 0x220)
+#define XRS_RX_JABBER_H			(XRS_PORT_CNT_BASE(0) + 0x222)
+#define XRS_RX_ERR_L			(XRS_PORT_CNT_BASE(0) + 0x224)
+#define XRS_RX_ERR_H			(XRS_PORT_CNT_BASE(0) + 0x226)
+#define XRS_RX_CRC_L			(XRS_PORT_CNT_BASE(0) + 0x228)
+#define XRS_RX_CRC_H			(XRS_PORT_CNT_BASE(0) + 0x22a)
+#define XRS_RX_64_L			(XRS_PORT_CNT_BASE(0) + 0x22c)
+#define XRS_RX_64_H			(XRS_PORT_CNT_BASE(0) + 0x22e)
+#define XRS_RX_65_127_L			(XRS_PORT_CNT_BASE(0) + 0x230)
+#define XRS_RX_65_127_H			(XRS_PORT_CNT_BASE(0) + 0x232)
+#define XRS_RX_128_255_L		(XRS_PORT_CNT_BASE(0) + 0x234)
+#define XRS_RX_128_255_H		(XRS_PORT_CNT_BASE(0) + 0x236)
+#define XRS_RX_256_511_L		(XRS_PORT_CNT_BASE(0) + 0x238)
+#define XRS_RX_256_511_H		(XRS_PORT_CNT_BASE(0) + 0x23a)
+#define XRS_RX_512_1023_L		(XRS_PORT_CNT_BASE(0) + 0x23c)
+#define XRS_RX_512_1023_H		(XRS_PORT_CNT_BASE(0) + 0x23e)
+#define XRS_RX_1024_1536_L		(XRS_PORT_CNT_BASE(0) + 0x240)
+#define XRS_RX_1024_1536_H		(XRS_PORT_CNT_BASE(0) + 0x242)
+#define XRS_RX_HSR_PRP_L		(XRS_PORT_CNT_BASE(0) + 0x244)
+#define XRS_RX_HSR_PRP_H		(XRS_PORT_CNT_BASE(0) + 0x246)
+#define XRS_RX_WRONGLAN_L		(XRS_PORT_CNT_BASE(0) + 0x248)
+#define XRS_RX_WRONGLAN_H		(XRS_PORT_CNT_BASE(0) + 0x24a)
+#define XRS_RX_DUPLICATE_L		(XRS_PORT_CNT_BASE(0) + 0x24c)
+#define XRS_RX_DUPLICATE_H		(XRS_PORT_CNT_BASE(0) + 0x24e)
+#define XRS_TX_OCTETS_L			(XRS_PORT_CNT_BASE(0) + 0x280)
+#define XRS_TX_OCTETS_H			(XRS_PORT_CNT_BASE(0) + 0x282)
+#define XRS_TX_UNICAST_L		(XRS_PORT_CNT_BASE(0) + 0x284)
+#define XRS_TX_UNICAST_H		(XRS_PORT_CNT_BASE(0) + 0x286)
+#define XRS_TX_BROADCAST_L		(XRS_PORT_CNT_BASE(0) + 0x288)
+#define XRS_TX_BROADCAST_H		(XRS_PORT_CNT_BASE(0) + 0x28a)
+#define XRS_TX_MULTICAST_L		(XRS_PORT_CNT_BASE(0) + 0x28c)
+#define XRS_TX_MULTICAST_H		(XRS_PORT_CNT_BASE(0) + 0x28e)
+#define XRS_TX_HSR_PRP_L		(XRS_PORT_CNT_BASE(0) + 0x290)
+#define XRS_TX_HSR_PRP_H		(XRS_PORT_CNT_BASE(0) + 0x292)
+#define XRS_PRIQ_DROP_L			(XRS_PORT_CNT_BASE(0) + 0x2c0)
+#define XRS_PRIQ_DROP_H			(XRS_PORT_CNT_BASE(0) + 0x2c2)
+#define XRS_EARLY_DROP_L		(XRS_PORT_CNT_BASE(0) + 0x2c4)
+#define XRS_EARLY_DROP_H		(XRS_PORT_CNT_BASE(0) + 0x2c6)
+
+/* Port Configuration Registers - Inbound Policy 0 - 15 */
+#define XRS_ETH_ADDR_CFG(x, p)		(XRS_PORT_IPO_BASE(x) + \
+					 (p) * 0x20 + 0x0)
+#define XRS_ETH_ADDR_FWD_ALLOW(x, p)	(XRS_PORT_IPO_BASE(x) + \
+					 (p) * 0x20 + 0x2)
+#define XRS_ETH_ADDR_FWD_MIRROR(x, p)	(XRS_PORT_IPO_BASE(x) + \
+					 (p) * 0x20 + 0x4)
+#define XRS_ETH_ADDR_0(x, p)		(XRS_PORT_IPO_BASE(x) + \
+					 (p) * 0x20 + 0x8)
+#define XRS_ETH_ADDR_1(x, p)		(XRS_PORT_IPO_BASE(x) + \
+					 (p) * 0x20 + 0xa)
+#define XRS_ETH_ADDR_2(x, p)		(XRS_PORT_IPO_BASE(x) + \
+					 (p) * 0x20 + 0xc)
+
+/* RTC Registers */
+#define XRS_CUR_NSEC0			(XRS_RTC_BASE + 0x1004)
+#define XRS_CUR_NSEC1			(XRS_RTC_BASE + 0x1006)
+#define XRS_CUR_SEC0			(XRS_RTC_BASE + 0x1008)
+#define XRS_CUR_SEC1			(XRS_RTC_BASE + 0x100a)
+#define XRS_CUR_SEC2			(XRS_RTC_BASE + 0x100c)
+#define XRS_TIME_CC0			(XRS_RTC_BASE + 0x1010)
+#define XRS_TIME_CC1			(XRS_RTC_BASE + 0x1012)
+#define XRS_TIME_CC2			(XRS_RTC_BASE + 0x1014)
+#define XRS_STEP_SIZE0			(XRS_RTC_BASE + 0x1020)
+#define XRS_STEP_SIZE1			(XRS_RTC_BASE + 0x1022)
+#define XRS_STEP_SIZE2			(XRS_RTC_BASE + 0x1024)
+#define XRS_ADJUST_NSEC0		(XRS_RTC_BASE + 0x1034)
+#define XRS_ADJUST_NSEC1		(XRS_RTC_BASE + 0x1036)
+#define XRS_ADJUST_SEC0			(XRS_RTC_BASE + 0x1038)
+#define XRS_ADJUST_SEC1			(XRS_RTC_BASE + 0x103a)
+#define XRS_ADJUST_SEC2			(XRS_RTC_BASE + 0x103c)
+#define XRS_TIME_CMD			(XRS_RTC_BASE + 0x1040)
+
+/* Time Stamper Registers */
+#define XRS_TS_CTRL(x)			(XRS_TS_BASE(x) + 0x1000)
+#define XRS_TS_INT_MASK(x)		(XRS_TS_BASE(x) + 0x1008)
+#define XRS_TS_INT_STATUS(x)		(XRS_TS_BASE(x) + 0x1010)
+#define XRS_TS_NSEC0(x)			(XRS_TS_BASE(x) + 0x1104)
+#define XRS_TS_NSEC1(x)			(XRS_TS_BASE(x) + 0x1106)
+#define XRS_TS_SEC0(x)			(XRS_TS_BASE(x) + 0x1108)
+#define XRS_TS_SEC1(x)			(XRS_TS_BASE(x) + 0x110a)
+#define XRS_TS_SEC2(x)			(XRS_TS_BASE(x) + 0x110c)
+#define XRS_PNCT0(x)			(XRS_TS_BASE(x) + 0x1110)
+#define XRS_PNCT1(x)			(XRS_TS_BASE(x) + 0x1112)
+
+/* Switch Configuration Registers */
+#define XRS_SWITCH_GEN_BASE		(XRS_SWITCH_CONF_BASE + 0x0)
+#define XRS_SWITCH_TS_BASE		(XRS_SWITCH_CONF_BASE + 0x2000)
+#define XRS_SWITCH_VLAN_BASE		(XRS_SWITCH_CONF_BASE + 0x4000)
+
+/* Switch Configuration Registers - General */
+#define XRS_GENERAL			(XRS_SWITCH_GEN_BASE + 0x10)
+#define XRS_GENERAL_TIME_TRAILER	BIT(9)
+#define XRS_GENERAL_MOD_SYNC		BIT(10)
+#define XRS_GENERAL_CUT_THRU		BIT(13)
+#define XRS_GENERAL_CLR_MAC_TBL		BIT(14)
+#define XRS_GENERAL_RESET		BIT(15)
+#define XRS_MT_CLEAR_MASK		(XRS_SWITCH_GEN_BASE + 0x12)
+#define XRS_ADDRESS_AGING		(XRS_SWITCH_GEN_BASE + 0x20)
+#define XRS_TS_CTRL_TX			(XRS_SWITCH_GEN_BASE + 0x28)
+#define XRS_TS_CTRL_RX			(XRS_SWITCH_GEN_BASE + 0x2a)
+#define XRS_INT_MASK			(XRS_SWITCH_GEN_BASE + 0x2c)
+#define XRS_INT_STATUS			(XRS_SWITCH_GEN_BASE + 0x2e)
+#define XRS_MAC_TABLE0			(XRS_SWITCH_GEN_BASE + 0x200)
+#define XRS_MAC_TABLE1			(XRS_SWITCH_GEN_BASE + 0x202)
+#define XRS_MAC_TABLE2			(XRS_SWITCH_GEN_BASE + 0x204)
+#define XRS_MAC_TABLE3			(XRS_SWITCH_GEN_BASE + 0x206)
+
+/* Switch Configuration Registers - Frame Timestamp */
+#define XRS_TX_TS_NS_LO(t)		(XRS_SWITCH_TS_BASE + 0x80 * (t) + 0x0)
+#define XRS_TX_TS_NS_HI(t)		(XRS_SWITCH_TS_BASE + 0x80 * (t) + 0x2)
+#define XRS_TX_TS_S_LO(t)		(XRS_SWITCH_TS_BASE + 0x80 * (t) + 0x4)
+#define XRS_TX_TS_S_HI(t)		(XRS_SWITCH_TS_BASE + 0x80 * (t) + 0x6)
+#define XRS_TX_TS_HDR(t, h)		(XRS_SWITCH_TS_BASE + 0x80 * (t) + \
+					 0x2 * (h) + 0xe)
+#define XRS_RX_TS_NS_LO(t)		(XRS_SWITCH_TS_BASE + 0x80 * (t) + \
+					 0x200)
+#define XRS_RX_TS_NS_HI(t)		(XRS_SWITCH_TS_BASE + 0x80 * (t) + \
+					 0x202)
+#define XRS_RX_TS_S_LO(t)		(XRS_SWITCH_TS_BASE + 0x80 * (t) + \
+					 0x204)
+#define XRS_RX_TS_S_HI(t)		(XRS_SWITCH_TS_BASE + 0x80 * (t) + \
+					 0x206)
+#define XRS_RX_TS_HDR(t, h)		(XRS_SWITCH_TS_BASE + 0x80 * (t) + \
+					 0x2 * (h) + 0xe)
+
+/* Switch Configuration Registers - VLAN */
+#define XRS_VLAN(v)			(XRS_SWITCH_VLAN_BASE + 0x2 * (v))
+
+#define MAX_VLAN			4095